Summary
- Research focus shifted from a graph-structured long-term memory approach to an end-to-end pipeline for Temporal Reasoning.
- We established a traceable pipeline Retrieval → Packing → Reader → Evaluation, fixed “events not indexed” and temporal visibility, and restored non-zero retrieval hits.
- Key finding this month: retrieval Hit@10 ≈ 0.26, but nDCG@10 ≈ 0.019. Ranking quality is the main bottleneck; differences across temporal views are minimal.
Why Temporal Reasoning (and why now)
Core temporal views: RAW / AW / AR
- RAW: No temporal-visibility filtering; serves as a superset index that is built once and reused end-to-end.
- AW (As-World): Filters by the “real-world state” at question time, keeping only facts/events still valid at the time of the query.
- AR (As-Recorded): Filters by “what was recorded at that time,” preserving the evidence that existed when it was logged for auditability.
Example 1: Rescheduled meeting
- Oct 1: “The meeting is on Oct 10.” Oct 5: “Rescheduled to Oct 12.”
- Question A (asked on Oct 3): “When is the meeting?”
- AR: Sees the Oct 1 record → Oct 10 (the reschedule does not exist yet).
- AW: Under “current world knowledge” on Oct 3, only Oct 10 exists → Oct 10.
- Question B (asked on Oct 8): “When is the meeting?”
- AR: Returns both entries (Oct 10 and Oct 12) to show the change log.
- AW: Answers Oct 12 (the latest valid state).
- Why it matters: AW/AR resolve conflicts along the time axis—AW prevents stale facts from polluting answers; AR preserves the audit trail.
Example 2: Identity change (moving residence)
- 2023: “Alice lives in Paris.” 2024: “Alice moved to Berlin.”
- Question (asked in 2025): “Where does Alice live?”
- AW: Berlin (currently valid).
- AR: Shows the full timeline, Paris → Berlin, for traceability.
Example 3: Subscription/contract expiry
- Subscription valid from 2024-01-01 to 2024-12-31.
- Question (2025-01-02): “Is the subscription active?”
- AW: Inactive (expired).
- RAW: The subscription fact is still present in the index, but filtered out at candidate time under AW.
Implementation approach: Use RAW as the superset index (single build, reused), then apply AW/AR visibility filters at candidate time; align gold labels to the same temporal view during evaluation to ensure time-correct evidence.
Methods & System Changes (centered on Temporal Reasoning)
Design goals
Use the RAW → AW/AR view stack as the backbone to close the loop from “data generation/assimilation” to “retrieval/reranking/packing/reader/evaluation.” Quantify each stage—“retrieved?”, “packed?”, “cited?”, “answered?”—so bottlenecks are observable and localizable.
End-to-end dataflow (worked example with a dialogue)
Dialogue (with timestamps)
[2025-10-01 09:00] U: Let's schedule the project meeting on Oct 10 in the morning.
[2025-10-05 18:12] U: Reschedule the meeting to the morning of Oct 12; room B203.
[2025-10-08 10:00] Q: When will the project meeting be held?
1) Preprocessing (memowrite)
- Sentence splitting & speaker normalization: identify
U(user) as subject; extract candidate temporal expressions (“Oct 10”, “Oct 12”) and location (“B203”). - Relative-time normalization: map to ISO (
2025-10-10,2025-10-12) with timezone metadata.
2) Entities / Aliases / Coreference
- Entity extraction:
Meeting(project)(event),B203(location). - Alias merge: unify “project meeting” and “meeting” as the same
Meeting(project). - Coreference: the second “meeting” refers to the same
Meeting(project).
3) Temporal parsing & alignment
- Event intervals:
E1: Meeting(project)@2025-10-10 09:00(recorded at2025-10-01)E2: Meeting(project)@2025-10-12 09:00(recorded at2025-10-05, supersedesE1)
- Facts/states:
F1: Meeting(project).location =B203(recorded at2025-10-05)
4) RBU assimilation (state transitions)
- Business rule “later record supersedes earlier plan”:
E2REPLACEE1as the current valid time;F1writes the location state.
5) Evidence generation & storage
- Create two event docs (schematic
doc_id):doc:event/meeting_20251010_v1(superseded)doc:event/meeting_20251012_v2(currently valid)
- And a location fact:
doc:fact/meeting_location_b203_v1
6) Indexing & visibility (RAW → AW/AR)
- RAW index: one-time build; includes all docs (even expired/superseded).
- AW filter (As-World, answer time =
2025-10-08 10:00):doc:event/meeting_20251012_v2: visible (valid and recorded)doc:event/meeting_20251010_v1: not visible (superseded)doc:fact/meeting_location_b203_v1: visible
- AR filter (As-Recorded):
- Both event records are visible (Oct 10 and Oct 12) for audit.
7) Retrieval & reranking
- Query construction:
Q="When is the project meeting?"+ temporal hints (observation time =2025-10-08). - Candidate set: BM25 ∪ e5 with RRF fusion produces Top-N.
- Reranking: cross-encoder (MiniLM) + optional temporal similarity
time_sim(E, Q)(events closer to the effective answer time score higher).
| rank | doc_id | kind | score(CE) | time_sim | fused |
|---|---|---|---|---|---|
| 1 | doc:event/meeting_20251012_v2 | event | 2.31 | 0.98 | 3.29 |
| 2 | doc:fact/meeting_location_b203_v1 | fact | 1.05 | 0.72 | 1.77 |
| 3 | doc:event/meeting_20251010_v1 | event | 1.12 | 0.12 | 1.24 |
| 4 | doc:note/meeting_summary_sep | note | 0.85 | 0.40 | 1.25 |
| 5 | doc:fact/old_room_b201 | fact | 0.60 | 0.05 | 0.65 |
Under AW,
meeting_20251010_v1is removed by visibility filtering; under RAW it remains but ranks lower.
8) Packing
- Fold Top-K chunks to document-level evidence (chunk→doc): generate a pack table, e.g.:
| rank_in_ctx | doc_id | view | active_at/intervals | tokens |
|---|---|---|---|---|
| 1 | doc:event/meeting_20251012_v2 | AW | [2025-10-12 09:00, …] | 180 |
| 2 | doc:fact/meeting_location_b203_v1 | AW | [2025-10-05 18:12, …] | 120 |
- Compute
pack_hit@k(whether gold made it into the pack) andpack_loss@k = retrieval_hit@k − pack_hit@k.
9) Reader & citations
- Structured output contract (JSON-only):
{
"answer": "The meeting is on the morning of Oct 12 in room B203.",
"citations": ["doc:event/meeting_20251012_v2", "doc:fact/meeting_location_b203_v1"]
}- Evaluate
cite_hit@k(does the reader cite gold?) andcontext_use@k(are citations drawn from the pack?).
10) Evaluation loop & stage metrics
- Retrieval:
hit@k / ndcg@k - Packing:
pack_hit@k / pack_loss@k - Citations:
cite_hit@k / context_use@k - End-to-end:
accuracy / acc_oracle_pack - Diagnostics:
candidates.tsv / gold.tsv / id_join_report.json / pack_context.tsv / reader_citations.tsv / visibility_losses.tsv
Key implementation switches (repo-aligned)
- RAW superset index & stable fingerprints:
src/modules/retrieval/index_key.py,src/modules/retrieval/index.py - View filtering (AW/AR) & fusion retrieval:
src/modules/retrieval/eagle_v1.py - Cross-encoder reranking & (optional) temporal features:
src/modules/retrieval/rerank.py - Packing & folding: via
eagle_v1.pack; ID normalization:tools/id_normalize.py - Reader contract & citations:
src/pipelines/memoread.py(prompt template enforces JSON) - Evaluation & sampling:
src/pipelines/eval_runner.py(--max-samples --sample-step), retrieval-only:runner.py --retrieval-only - Temporal subset & filtering:
tools/datasets.py(longmemeval:temporal)
Experimental Setup (October baseline)
- Dataset:
longmemeval:temporal(133 records) - Views: AW as primary, RAW / AR as controls
- Evaluation:
k=10; observe retrieval-only and end-to-end separately
Results & Analysis (retrieval-side)
| config | view | hit@1 | hit@5 | hit@10 | ndcg@1 | ndcg@5 | ndcg@10 |
|---|---|---|---|---|---|---|---|
| qwen_emb+rerank_xenc | RAW | 0.1654 | 0.2481 | 0.2857 | 0.0116 | 0.0155 | 0.0208 |
| qwen_emb+rerank_xenc | AR | 0.1654 | 0.2556 | 0.3008 | 0.0117 | 0.0158 | 0.0208 |
| qwen_emb+rerank_xenc | AW | 0.1654 | 0.2556 | 0.2932 | 0.0117 | 0.0171 | 0.0215 |
| qwen_emb | RAW | 0.1429 | 0.2256 | 0.2556 | 0.0095 | 0.0132 | 0.0211 |
| qwen_emb | AR | 0.1353 | 0.2180 | 0.2707 | 0.0093 | 0.0132 | 0.0209 |
| qwen_emb | AW | 0.1353 | 0.2331 | 0.2632 | 0.0093 | 0.0131 | 0.0208 |
| colbert | RAW | 0.1353 | 0.2180 | 0.2632 | 0.0093 | 0.0131 | 0.0208 |
| colbert | AR | 0.1353 | 0.2331 | 0.2632 | 0.0093 | 0.0131 | 0.0208 |
| colbert | AW | 0.1353 | 0.2180 | 0.2632 | 0.0093 | 0.0132 | 0.0208 |
| bm25_only | RAW | 0.1353 | 0.2105 | 0.2406 | 0.0015 | 0.0051 | 0.0165 |
| bm25_only | AR | 0.1353 | 0.2105 | 0.2331 | 0.0015 | 0.0051 | 0.0163 |
| bm25_only | AW | 0.1278 | 0.2180 | 0.2331 | 0.0015 | 0.0051 | 0.0163 |
| dense_only | RAW | 0.1429 | 0.2256 | 0.2556 | 0.0013 | 0.0111 | 0.0175 |
| dense_only | AR | 0.1429 | 0.2105 | 0.2481 | 0.0015 | 0.0112 | 0.0176 |
| dense_only | AW | 0.1353 | 0.2105 | 0.2556 | 0.0013 | 0.0111 | 0.0175 |
Key takeaways
- Fusion retrieval + cross-encoder (
qwen_rerank_xenc) outperforms pure BM25 or pure dense. - Temporal views differ little (RAW ≈ AW ≈ AR), supporting the “RAW superset index + candidate-time filtering” strategy.
- To improve nDCG@10, we need a larger candidate pool, stronger reranking, and temporal similarity features.
Errors & Bottlenecks
- Reranking / ordering: current cross-encoder pairs are small; candidate pool is narrow, pushing hits toward the tail.
- Temporal-sensitive scoring: missing a feature capturing proximity between query time and event time.
- By-kind detail: event-side ranking remains weaker; needs targeted boosts.
Next Steps
-
Ranking enhancement (priority)
- Increase seed candidates (50/100/200) and cross-encoder pairs (128/256).
- Add temporal similarity and linearly fuse with CE/BM25 (grid-search α/β).
- Doc-level de-dup/aggregation to avoid near-duplicates crowding top ranks.
-
By-kind evaluation
- Plot event/fact hit-and-rank curves; tune features/weights for temporal questions.
-
End-to-end accuracy
- After retrieval gains, focus on Packing/Reader loss channels (
pack_loss@k,cite_hit@k) and answer/citation consistency.
- After retrieval gains, focus on Packing/Reader loss channels (
-
Operational stability & reproducibility
- Keep metric key names and docs consistent; maintain single-build RAW superset index reused across runs.
Implementation Progress (P1 Completion Summary)
Code map & toggles (collapsible)
P1 Completion Report (Module A/B/C options & code mapping)
Conclusion: Across the current codebase, modules along the P1 path have been progressively landed. Most modules expose Option A (rule-based/lightweight); some expose Option B/C (small models / constrained LLM). Below is an itemized audit with status tags, each directly mapped to source paths and existing script configs.
Notes:
- All code paths are rooted at
src/.- If a function is integrated via a pipeline (rather than a standalone module file), we note the integration point and config switch.
- Status tags use [Completed | Partially done | To do], with file paths provided for verification.
Preprocessing (
pipelines/memowrite.preprocess)
- Option A (rule-based sentence splitting + regex extraction + relative time normalization): Completed
- Entrypoints:
src/pipelines/memowrite.py:preprocess_turn,preprocess_dialogue_turn- Sentence splitting: in-file
parser_utils(src/utils/parser.py) plus simple rules; dialogues inject stable speaker/addressee entities- Regex: email/URL/phone/amount recognized in the NER stage
- Option B (lightweight sentence splitter model): To do
- No dedicated tiny splitter toggle found
- Config:
ExtractionConfig.use_flair(syntax/NER assist); other preprocessing toggles are aggregated inmemowrite
- Ref:
src/pipelines/memowrite.py:ExtractionConfigNER / Nodes (
modules/extract/entities)
- Option A (rules + Flair + regex for account-like fields): Completed
- File:
src/modules/extract/entities.py- Functionality: heuristics for PER/ORG/LOC capitals, regex for email/URL/phone/amount, optional Flair NER
- Output: entity aggregation (aliases, mention merging), confidence
- Option B (RoBERTa-NER fine-tune): To do
- No RoBERTa NER path/call found
- Option C (constrained LLM extraction): To do
- No LLM gateway landed for entity extraction (LLM only used for relation triage and QA)
Alias merging (Alias)
- Option A (exact + Jaro–Winkler): Completed
AliasTracker:src/modules/extract/entities.py(Jaro–Winkler/Token-Set thresholds)- Option B (token-set similarity): Completed
- Supported via
token_set_threshold- Option C (contextual discriminator): To do
Coreference (Coref)
- Option A (rule-based, windowed, person-first): Completed
- File:
src/modules/extract/coref.py(pronoun rules, configurable window)- Option B (SpanBERT distillation): To do
- Option C (LLM conflict arbitration): To do
Time normalization
- Option A (rules/dictionaries + timezone inference): Completed
- File:
src/modules/time/normalize.py(rich regex/parsing + TZ handling)- Option B (SUTime/Heidel fallback): To do
- Not integrated
- Option C (LLM low-confidence disambiguation): Partially done (capability stub)
TimeNormalizer.use_llm_disambexists but no wired call chain yetRelation extraction (Relations)
- Option A (templates + dependency): Completed
- File:
src/modules/extract/relations.py(dep/regex templates for schedule/contact/preferences, etc.)- Option B (RoBERTa binary filter): To do
- No RoBERTa filter present
- Option C (constrained LLM structuring): Partially done (triage layer)
src/modules/extract/triage.pyuses LLM/Gemini for keep/drop + confidence; hooks viaExtractionConfig.triageinsrc/pipelines/memowrite.pyRBU assimilation (Assimilation)
- Option A (AR snapshot exact match + open versioning): Completed
- File:
src/modules/assimilation/rbu.py(SKIP/INSERT/REPLACE/CLOSE and DSL emission)- Option B (fuzzy-match extensions): To do
- Currently relies on alias closure + exact dst
Event building & merging (Events)
- Option A (hard triggers + thresholded creation + caching): Completed
- File:
src/modules/events/builder.py(hard-trigger predicates, fixed-window aggregation, templatedsrc rel dstsummaries)- Option B (semantic merge: sim≥0.8 + participant overlap + adjacent/overlapping windows): To do
- Option C (constrained LLM for readable summaries): To do
Episode aggregation
- Option A (fixed time window + participant overlap): Partially done
- Implemented at event layer; no standalone episode module yet
- Option B (topic drift detection): To do
Retrieval entry (EAGLE v1)
- Option A (BM25): Completed
- File:
src/modules/retrieval/index.py(BM25 implementation + index cache)- Option B (BM25 + e5-small fusion): Completed
- e5 embeddings:
src/modules/retrieval/emb_backends/e5.py(CPU/GPU with fallback); RRF fusion:src/modules/retrieval/eagle_v1.py- Option C (+ ColBERT-lite): To do
- No ColBERT implementation;
configs/retrieval/colbert.yamldraft exists without execution pathReranking & expansion
- Expansion A (depth ≤ 2, fanout ≤ m): Completed
- Controlled by
depth/fanoutinsrc/modules/retrieval/eagle_v1.py- Expansion B (relation-type gating): To do
- Rerank A (none / RRF / MMR): Completed
- Currently using RRF (MMR not explicitly surfaced)
- Rerank B (MonoT5-small): To do
- No MonoT5 dependency wired
- Rerank C (MiniLM cross-encoder): Completed
- Cross-encoder:
src/modules/retrieval/rerank.py(cross-encoder/ms-marco-MiniLM-L-6-v2if available)Packing
- Option A (events/facts → minimal evidence with active_at/intervals/evidence_id): Completed
- Integrated in retrieval output:
src/modules/retrieval/eagle_v1.py(packfield)- Option B (statistical summaries): To do
Triggers / Backfill / QA / Evaluation
- Triggers A/B/C: Completed (hard triggers / salience keep-archive / periodic & backfill)
src/modules/triggers/strong.py,salience.py,periodic.py,backfill.py; types insrc/modules/triggers/types.py- QA A/B: Completed (baseline LLM + temporal prompts)
src/pipelines/memoread.pyviamodules.llm.gateway; Gemini provider available- Evaluation A: Completed (EM/Acc-AR/AW/Hit@k/nDCG@k/costs)
src/pipelines/eval_runner.py(retrieval + packing + QA aggregation and metrics dump)- Evaluation B/C: Partially done (error classifier / dashboard)
- Has error/fallback/empty-graph counters and retrieval decomposition; no dedicated classifier module yet
Configuration & scripts
- P1 config loader:
src/utils/config.py:load_p1_config- Retrieval/rerank:
src/modules/retrieval/eagle_v1.py(underretrieval:use_bm25,use_embeddings|use_e5,rerank.backend/top/weight,depth/fanout/k_rrf)- Extraction & triage toggles:
src/pipelines/memowrite.py→ExtractionConfig,TriageConfig(use_flair,alias_*_threshold,triage.enabled/backend/per_turn_tokens/...)- Trigger toggles:
_DEFAULT_TRIGGER_CONFIGinsrc/pipelines/memowrite.py; dispatched asTriggerConfigat runtime- Evaluation entrypoints:
scripts/run_p1.sh,scripts/run_retrieval.sh,scripts/run_eval.sh- Artifacts: under
reports/—metrics.json,per_query.jsonl,pack_*,reader_*, etc.