Summary

  • Research focus shifted from a graph-structured long-term memory approach to an end-to-end pipeline for Temporal Reasoning.
  • We established a traceable pipeline Retrieval → Packing → Reader → Evaluation, fixed “events not indexed” and temporal visibility, and restored non-zero retrieval hits.
  • Key finding this month: retrieval Hit@10 ≈ 0.26, but nDCG@10 ≈ 0.019. Ranking quality is the main bottleneck; differences across temporal views are minimal.

Why Temporal Reasoning (and why now)

Core temporal views: RAW / AW / AR

  • RAW: No temporal-visibility filtering; serves as a superset index that is built once and reused end-to-end.
  • AW (As-World): Filters by the “real-world state” at question time, keeping only facts/events still valid at the time of the query.
  • AR (As-Recorded): Filters by “what was recorded at that time,” preserving the evidence that existed when it was logged for auditability.

Example 1: Rescheduled meeting

  • Oct 1: “The meeting is on Oct 10.” Oct 5: “Rescheduled to Oct 12.”
  • Question A (asked on Oct 3): “When is the meeting?”
    • AR: Sees the Oct 1 record → Oct 10 (the reschedule does not exist yet).
    • AW: Under “current world knowledge” on Oct 3, only Oct 10 exists → Oct 10.
  • Question B (asked on Oct 8): “When is the meeting?”
    • AR: Returns both entries (Oct 10 and Oct 12) to show the change log.
    • AW: Answers Oct 12 (the latest valid state).
  • Why it matters: AW/AR resolve conflicts along the time axis—AW prevents stale facts from polluting answers; AR preserves the audit trail.

Example 2: Identity change (moving residence)

  • 2023: “Alice lives in Paris.” 2024: “Alice moved to Berlin.”
  • Question (asked in 2025): “Where does Alice live?”
    • AW: Berlin (currently valid).
    • AR: Shows the full timeline, Paris → Berlin, for traceability.

Example 3: Subscription/contract expiry

  • Subscription valid from 2024-01-01 to 2024-12-31.
  • Question (2025-01-02): “Is the subscription active?”
    • AW: Inactive (expired).
    • RAW: The subscription fact is still present in the index, but filtered out at candidate time under AW.

Implementation approach: Use RAW as the superset index (single build, reused), then apply AW/AR visibility filters at candidate time; align gold labels to the same temporal view during evaluation to ensure time-correct evidence.

Methods & System Changes (centered on Temporal Reasoning)

Design goals

Use the RAW → AW/AR view stack as the backbone to close the loop from “data generation/assimilation” to “retrieval/reranking/packing/reader/evaluation.” Quantify each stage—“retrieved?”, “packed?”, “cited?”, “answered?”—so bottlenecks are observable and localizable.

End-to-end dataflow (worked example with a dialogue)

Dialogue (with timestamps)


[2025-10-01 09:00] U: Let's schedule the project meeting on Oct 10 in the morning.
[2025-10-05 18:12] U: Reschedule the meeting to the morning of Oct 12; room B203.
[2025-10-08 10:00] Q: When will the project meeting be held?

1) Preprocessing (memowrite)

  • Sentence splitting & speaker normalization: identify U (user) as subject; extract candidate temporal expressions (“Oct 10”, “Oct 12”) and location (“B203”).
  • Relative-time normalization: map to ISO (2025-10-10, 2025-10-12) with timezone metadata.

2) Entities / Aliases / Coreference

  • Entity extraction: Meeting(project) (event), B203 (location).
  • Alias merge: unify “project meeting” and “meeting” as the same Meeting(project).
  • Coreference: the second “meeting” refers to the same Meeting(project).

3) Temporal parsing & alignment

  • Event intervals:
    • E1: Meeting(project)@2025-10-10 09:00 (recorded at 2025-10-01)
    • E2: Meeting(project)@2025-10-12 09:00 (recorded at 2025-10-05, supersedes E1)
  • Facts/states:
    • F1: Meeting(project).location = B203 (recorded at 2025-10-05)

4) RBU assimilation (state transitions)

  • Business rule “later record supersedes earlier plan”: E2 REPLACE E1 as the current valid time; F1 writes the location state.

5) Evidence generation & storage

  • Create two event docs (schematic doc_id):
    • doc:event/meeting_20251010_v1 (superseded)
    • doc:event/meeting_20251012_v2 (currently valid)
  • And a location fact:
    • doc:fact/meeting_location_b203_v1

6) Indexing & visibility (RAW → AW/AR)

  • RAW index: one-time build; includes all docs (even expired/superseded).
  • AW filter (As-World, answer time = 2025-10-08 10:00):
    • doc:event/meeting_20251012_v2: visible (valid and recorded)
    • doc:event/meeting_20251010_v1: not visible (superseded)
    • doc:fact/meeting_location_b203_v1: visible
  • AR filter (As-Recorded):
    • Both event records are visible (Oct 10 and Oct 12) for audit.

7) Retrieval & reranking

  • Query construction: Q="When is the project meeting?" + temporal hints (observation time = 2025-10-08).
  • Candidate set: BM25 ∪ e5 with RRF fusion produces Top-N.
  • Reranking: cross-encoder (MiniLM) + optional temporal similarity time_sim(E, Q) (events closer to the effective answer time score higher).
rankdoc_idkindscore(CE)time_simfused
1doc:event/meeting_20251012_v2event2.310.983.29
2doc:fact/meeting_location_b203_v1fact1.050.721.77
3doc:event/meeting_20251010_v1event1.120.121.24
4doc:note/meeting_summary_sepnote0.850.401.25
5doc:fact/old_room_b201fact0.600.050.65

Under AW, meeting_20251010_v1 is removed by visibility filtering; under RAW it remains but ranks lower.

8) Packing

  • Fold Top-K chunks to document-level evidence (chunk→doc): generate a pack table, e.g.:
rank_in_ctxdoc_idviewactive_at/intervalstokens
1doc:event/meeting_20251012_v2AW[2025-10-12 09:00, …]180
2doc:fact/meeting_location_b203_v1AW[2025-10-05 18:12, …]120
  • Compute pack_hit@k (whether gold made it into the pack) and pack_loss@k = retrieval_hit@k − pack_hit@k.

9) Reader & citations

  • Structured output contract (JSON-only):
{
  "answer": "The meeting is on the morning of Oct 12 in room B203.",
  "citations": ["doc:event/meeting_20251012_v2", "doc:fact/meeting_location_b203_v1"]
}
  • Evaluate cite_hit@k (does the reader cite gold?) and context_use@k (are citations drawn from the pack?).

10) Evaluation loop & stage metrics

  • Retrieval: hit@k / ndcg@k
  • Packing: pack_hit@k / pack_loss@k
  • Citations: cite_hit@k / context_use@k
  • End-to-end: accuracy / acc_oracle_pack
  • Diagnostics: candidates.tsv / gold.tsv / id_join_report.json / pack_context.tsv / reader_citations.tsv / visibility_losses.tsv

Key implementation switches (repo-aligned)

  • RAW superset index & stable fingerprints: src/modules/retrieval/index_key.py, src/modules/retrieval/index.py
  • View filtering (AW/AR) & fusion retrieval: src/modules/retrieval/eagle_v1.py
  • Cross-encoder reranking & (optional) temporal features: src/modules/retrieval/rerank.py
  • Packing & folding: via eagle_v1.pack; ID normalization: tools/id_normalize.py
  • Reader contract & citations: src/pipelines/memoread.py (prompt template enforces JSON)
  • Evaluation & sampling: src/pipelines/eval_runner.py (--max-samples --sample-step), retrieval-only: runner.py --retrieval-only
  • Temporal subset & filtering: tools/datasets.py (longmemeval:temporal)

Experimental Setup (October baseline)

  • Dataset: longmemeval:temporal (133 records)
  • Views: AW as primary, RAW / AR as controls
  • Evaluation: k=10; observe retrieval-only and end-to-end separately

Results & Analysis (retrieval-side)

configviewhit@1hit@5hit@10ndcg@1ndcg@5ndcg@10
qwen_emb+rerank_xencRAW0.16540.24810.28570.01160.01550.0208
qwen_emb+rerank_xencAR0.16540.25560.30080.01170.01580.0208
qwen_emb+rerank_xencAW0.16540.25560.29320.01170.01710.0215
qwen_embRAW0.14290.22560.25560.00950.01320.0211
qwen_embAR0.13530.21800.27070.00930.01320.0209
qwen_embAW0.13530.23310.26320.00930.01310.0208
colbertRAW0.13530.21800.26320.00930.01310.0208
colbertAR0.13530.23310.26320.00930.01310.0208
colbertAW0.13530.21800.26320.00930.01320.0208
bm25_onlyRAW0.13530.21050.24060.00150.00510.0165
bm25_onlyAR0.13530.21050.23310.00150.00510.0163
bm25_onlyAW0.12780.21800.23310.00150.00510.0163
dense_onlyRAW0.14290.22560.25560.00130.01110.0175
dense_onlyAR0.14290.21050.24810.00150.01120.0176
dense_onlyAW0.13530.21050.25560.00130.01110.0175

Key takeaways

  • Fusion retrieval + cross-encoder (qwen_rerank_xenc) outperforms pure BM25 or pure dense.
  • Temporal views differ little (RAW ≈ AW ≈ AR), supporting the “RAW superset index + candidate-time filtering” strategy.
  • To improve nDCG@10, we need a larger candidate pool, stronger reranking, and temporal similarity features.

Errors & Bottlenecks

  • Reranking / ordering: current cross-encoder pairs are small; candidate pool is narrow, pushing hits toward the tail.
  • Temporal-sensitive scoring: missing a feature capturing proximity between query time and event time.
  • By-kind detail: event-side ranking remains weaker; needs targeted boosts.

Next Steps

  • Ranking enhancement (priority)

    • Increase seed candidates (50/100/200) and cross-encoder pairs (128/256).
    • Add temporal similarity and linearly fuse with CE/BM25 (grid-search α/β).
    • Doc-level de-dup/aggregation to avoid near-duplicates crowding top ranks.
  • By-kind evaluation

    • Plot event/fact hit-and-rank curves; tune features/weights for temporal questions.
  • End-to-end accuracy

    • After retrieval gains, focus on Packing/Reader loss channels (pack_loss@k, cite_hit@k) and answer/citation consistency.
  • Operational stability & reproducibility

    • Keep metric key names and docs consistent; maintain single-build RAW superset index reused across runs.

Implementation Progress (P1 Completion Summary)