October Report

Summary

Research focus shifted from a graph-structured long-term memory approach to an end-to-end pipeline for Temporal Reasoning.

We established a traceable pipeline Retrieval → Packing → Reader → Evaluation, fixed “events not indexed” and temporal visibility, and restored non-zero retrieval hits.

Key finding this month: retrieval Hit@10 ≈ 0.26, but nDCG@10 ≈ 0.019. Ranking quality is the main bottleneck; differences across temporal views are minimal.

Why Temporal Reasoning (and why now)

Core temporal views: RAW / AW / AR

RAW: No temporal-visibility filtering; serves as a superset index that is built once and reused end-to-end.
AW (As-World): Filters by the “real-world state” at question time, keeping only facts/events still valid at the time of the query.
AR (As-Recorded): Filters by “what was recorded at that time,” preserving the evidence that existed when it was logged for auditability.

Example 1: Rescheduled meeting

Oct 1: “The meeting is on Oct 10.” Oct 5: “Rescheduled to Oct 12.”
Question A (asked on Oct 3): “When is the meeting?”
- AR: Sees the Oct 1 record → Oct 10 (the reschedule does not exist yet).
- AW: Under “current world knowledge” on Oct 3, only Oct 10 exists → Oct 10.
Question B (asked on Oct 8): “When is the meeting?”
- AR: Returns both entries (Oct 10 and Oct 12) to show the change log.
- AW: Answers Oct 12 (the latest valid state).
Why it matters: AW/AR resolve conflicts along the time axis—AW prevents stale facts from polluting answers; AR preserves the audit trail.

Example 2: Identity change (moving residence)

2023: “Alice lives in Paris.” 2024: “Alice moved to Berlin.”
Question (asked in 2025): “Where does Alice live?”
- AW: Berlin (currently valid).
- AR: Shows the full timeline, Paris → Berlin, for traceability.

Example 3: Subscription/contract expiry

Subscription valid from 2024-01-01 to 2024-12-31.
Question (2025-01-02): “Is the subscription active?”
- AW: Inactive (expired).
- RAW: The subscription fact is still present in the index, but filtered out at candidate time under AW.

Implementation approach: Use RAW as the superset index (single build, reused), then apply AW/AR visibility filters at candidate time; align gold labels to the same temporal view during evaluation to ensure time-correct evidence.

Methods & System Changes (centered on Temporal Reasoning)

Design goals

Use the RAW → AW/AR view stack as the backbone to close the loop from “data generation/assimilation” to “retrieval/reranking/packing/reader/evaluation.” Quantify each stage—“retrieved?”, “packed?”, “cited?”, “answered?”—so bottlenecks are observable and localizable.

End-to-end dataflow (worked example with a dialogue)

Dialogue (with timestamps)


[2025-10-01 09:00] U: Let's schedule the project meeting on Oct 10 in the morning.
[2025-10-05 18:12] U: Reschedule the meeting to the morning of Oct 12; room B203.
[2025-10-08 10:00] Q: When will the project meeting be held?

1) Preprocessing (memowrite)

Sentence splitting & speaker normalization: identify U (user) as subject; extract candidate temporal expressions (“Oct 10”, “Oct 12”) and location (“B203”).
Relative-time normalization: map to ISO (2025-10-10, 2025-10-12) with timezone metadata.

2) Entities / Aliases / Coreference

Entity extraction: Meeting(project) (event), B203 (location).
Alias merge: unify “project meeting” and “meeting” as the same Meeting(project).
Coreference: the second “meeting” refers to the same Meeting(project).

3) Temporal parsing & alignment

Event intervals:
- E1: Meeting(project)@2025-10-10 09:00 (recorded at 2025-10-01)
- E2: Meeting(project)@2025-10-12 09:00 (recorded at 2025-10-05, supersedes E1)
Facts/states:
- F1: Meeting(project).location = B203 (recorded at 2025-10-05)

4) RBU assimilation (state transitions)

Business rule “later record supersedes earlier plan”: E2 REPLACE E1 as the current valid time; F1 writes the location state.

5) Evidence generation & storage

Create two event docs (schematic doc_id):
- doc:event/meeting_20251010_v1 (superseded)
- doc:event/meeting_20251012_v2 (currently valid)
And a location fact:
- doc:fact/meeting_location_b203_v1

6) Indexing & visibility (RAW → AW/AR)

RAW index: one-time build; includes all docs (even expired/superseded).
AW filter (As-World, answer time = 2025-10-08 10:00):
- doc:event/meeting_20251012_v2: visible (valid and recorded)
- doc:event/meeting_20251010_v1: not visible (superseded)
- doc:fact/meeting_location_b203_v1: visible
AR filter (As-Recorded):
- Both event records are visible (Oct 10 and Oct 12) for audit.

7) Retrieval & reranking

Query construction: Q="When is the project meeting?" + temporal hints (observation time = 2025-10-08).
Candidate set: BM25 ∪ e5 with RRF fusion produces Top-N.
Reranking: cross-encoder (MiniLM) + optional temporal similarity time_sim(E, Q) (events closer to the effective answer time score higher).

rank	doc_id	kind	score(CE)	time_sim	fused
1	doc:event/meeting_20251012_v2	event	2.31	0.98	3.29
2	doc:fact/meeting_location_b203_v1	fact	1.05	0.72	1.77
3	doc:event/meeting_20251010_v1	event	1.12	0.12	1.24
4	doc:note/meeting_summary_sep	note	0.85	0.40	1.25
5	doc:fact/old_room_b201	fact	0.60	0.05	0.65

Under AW, meeting_20251010_v1 is removed by visibility filtering; under RAW it remains but ranks lower.

8) Packing

Fold Top-K chunks to document-level evidence (chunk→doc): generate a pack table, e.g.:

rank_in_ctx	doc_id	view	active_at/intervals	tokens
1	doc:event/meeting_20251012_v2	AW	[2025-10-12 09:00, …]	180
2	doc:fact/meeting_location_b203_v1	AW	[2025-10-05 18:12, …]	120

Compute pack_hit@k (whether gold made it into the pack) and pack_loss@k = retrieval_hit@k − pack_hit@k.

9) Reader & citations

Structured output contract (JSON-only):

{
  "answer": "The meeting is on the morning of Oct 12 in room B203.",
  "citations": ["doc:event/meeting_20251012_v2", "doc:fact/meeting_location_b203_v1"]
}

Evaluate cite_hit@k (does the reader cite gold?) and context_use@k (are citations drawn from the pack?).

10) Evaluation loop & stage metrics

Retrieval: hit@k / ndcg@k
Packing: pack_hit@k / pack_loss@k
Citations: cite_hit@k / context_use@k
End-to-end: accuracy / acc_oracle_pack
Diagnostics: candidates.tsv / gold.tsv / id_join_report.json / pack_context.tsv / reader_citations.tsv / visibility_losses.tsv

Key implementation switches (repo-aligned)

RAW superset index & stable fingerprints: src/modules/retrieval/index_key.py, src/modules/retrieval/index.py
View filtering (AW/AR) & fusion retrieval: src/modules/retrieval/eagle_v1.py
Cross-encoder reranking & (optional) temporal features: src/modules/retrieval/rerank.py
Packing & folding: via eagle_v1.pack; ID normalization: tools/id_normalize.py
Reader contract & citations: src/pipelines/memoread.py (prompt template enforces JSON)
Evaluation & sampling: src/pipelines/eval_runner.py (--max-samples --sample-step), retrieval-only: runner.py --retrieval-only
Temporal subset & filtering: tools/datasets.py (longmemeval:temporal)

Experimental Setup (October baseline)

Dataset: longmemeval:temporal (133 records)
Views: AW as primary, RAW / AR as controls
Evaluation: k=10; observe retrieval-only and end-to-end separately

Results & Analysis (retrieval-side)

config	view	hit@1	hit@5	hit@10	ndcg@1	ndcg@5	ndcg@10
qwen_emb+rerank_xenc	RAW	0.1654	0.2481	0.2857	0.0116	0.0155	0.0208
qwen_emb+rerank_xenc	AR	0.1654	0.2556	0.3008	0.0117	0.0158	0.0208
qwen_emb+rerank_xenc	AW	0.1654	0.2556	0.2932	0.0117	0.0171	0.0215
qwen_emb	RAW	0.1429	0.2256	0.2556	0.0095	0.0132	0.0211
qwen_emb	AR	0.1353	0.2180	0.2707	0.0093	0.0132	0.0209
qwen_emb	AW	0.1353	0.2331	0.2632	0.0093	0.0131	0.0208
colbert	RAW	0.1353	0.2180	0.2632	0.0093	0.0131	0.0208
colbert	AR	0.1353	0.2331	0.2632	0.0093	0.0131	0.0208
colbert	AW	0.1353	0.2180	0.2632	0.0093	0.0132	0.0208
bm25_only	RAW	0.1353	0.2105	0.2406	0.0015	0.0051	0.0165
bm25_only	AR	0.1353	0.2105	0.2331	0.0015	0.0051	0.0163
bm25_only	AW	0.1278	0.2180	0.2331	0.0015	0.0051	0.0163
dense_only	RAW	0.1429	0.2256	0.2556	0.0013	0.0111	0.0175
dense_only	AR	0.1429	0.2105	0.2481	0.0015	0.0112	0.0176
dense_only	AW	0.1353	0.2105	0.2556	0.0013	0.0111	0.0175

Key takeaways

Fusion retrieval + cross-encoder (qwen_rerank_xenc) outperforms pure BM25 or pure dense.
Temporal views differ little (RAW ≈ AW ≈ AR), supporting the “RAW superset index + candidate-time filtering” strategy.
To improve nDCG@10, we need a larger candidate pool, stronger reranking, and temporal similarity features.

Errors & Bottlenecks

Reranking / ordering: current cross-encoder pairs are small; candidate pool is narrow, pushing hits toward the tail.
Temporal-sensitive scoring: missing a feature capturing proximity between query time and event time.
By-kind detail: event-side ranking remains weaker; needs targeted boosts.

Next Steps

Ranking enhancement (priority)
- Increase seed candidates (50/100/200) and cross-encoder pairs (128/256).
- Add temporal similarity and linearly fuse with CE/BM25 (grid-search α/β).
- Doc-level de-dup/aggregation to avoid near-duplicates crowding top ranks.
By-kind evaluation
- Plot event/fact hit-and-rank curves; tune features/weights for temporal questions.
End-to-end accuracy
- After retrieval gains, focus on Packing/Reader loss channels (pack_loss@k, cite_hit@k) and answer/citation consistency.
Operational stability & reproducibility
- Keep metric key names and docs consistent; maintain single-build RAW superset index reused across runs.

Implementation Progress (P1 Completion Summary)

Code map & toggles (collapsible)

P1 Completion Report (Module A/B/C options & code mapping)

Conclusion: Across the current codebase, modules along the P1 path have been progressively landed. Most modules expose Option A (rule-based/lightweight); some expose Option B/C (small models / constrained LLM). Below is an itemized audit with status tags, each directly mapped to source paths and existing script configs.

Notes:

All code paths are rooted at src/.

If a function is integrated via a pipeline (rather than a standalone module file), we note the integration point and config switch.

Status tags use [Completed | Partially done | To do], with file paths provided for verification.

Preprocessing (pipelines/memowrite.preprocess)

Option A (rule-based sentence splitting + regex extraction + relative time normalization): Completed

Entrypoints: src/pipelines/memowrite.py:preprocess_turn, preprocess_dialogue_turn

Sentence splitting: in-file parser_utils (src/utils/parser.py) plus simple rules; dialogues inject stable speaker/addressee entities

Regex: email/URL/phone/amount recognized in the NER stage

Option B (lightweight sentence splitter model): To do

No dedicated tiny splitter toggle found

Config: ExtractionConfig.use_flair (syntax/NER assist); other preprocessing toggles are aggregated in memowrite

Ref: src/pipelines/memowrite.py:ExtractionConfig

NER / Nodes (modules/extract/entities)

Option A (rules + Flair + regex for account-like fields): Completed

File: src/modules/extract/entities.py

Functionality: heuristics for PER/ORG/LOC capitals, regex for email/URL/phone/amount, optional Flair NER

Output: entity aggregation (aliases, mention merging), confidence

Option B (RoBERTa-NER fine-tune): To do

No RoBERTa NER path/call found

Option C (constrained LLM extraction): To do

No LLM gateway landed for entity extraction (LLM only used for relation triage and QA)

Alias merging (Alias)

Option A (exact + Jaro–Winkler): Completed

AliasTracker: src/modules/extract/entities.py (Jaro–Winkler/Token-Set thresholds)

Option B (token-set similarity): Completed

Supported via token_set_threshold

Option C (contextual discriminator): To do

Coreference (Coref)

Option A (rule-based, windowed, person-first): Completed

File: src/modules/extract/coref.py (pronoun rules, configurable window)

Option B (SpanBERT distillation): To do

Option C (LLM conflict arbitration): To do

Time normalization

Option A (rules/dictionaries + timezone inference): Completed

File: src/modules/time/normalize.py (rich regex/parsing + TZ handling)

Option B (SUTime/Heidel fallback): To do

Not integrated

Option C (LLM low-confidence disambiguation): Partially done (capability stub)

TimeNormalizer.use_llm_disamb exists but no wired call chain yet

Relation extraction (Relations)

Option A (templates + dependency): Completed

File: src/modules/extract/relations.py (dep/regex templates for schedule/contact/preferences, etc.)

Option B (RoBERTa binary filter): To do

No RoBERTa filter present

Option C (constrained LLM structuring): Partially done (triage layer)

src/modules/extract/triage.py uses LLM/Gemini for keep/drop + confidence; hooks via ExtractionConfig.triage in src/pipelines/memowrite.py

RBU assimilation (Assimilation)

Option A (AR snapshot exact match + open versioning): Completed

File: src/modules/assimilation/rbu.py (SKIP/INSERT/REPLACE/CLOSE and DSL emission)

Option B (fuzzy-match extensions): To do

Currently relies on alias closure + exact dst

Event building & merging (Events)

Option A (hard triggers + thresholded creation + caching): Completed

File: src/modules/events/builder.py (hard-trigger predicates, fixed-window aggregation, templated src rel dst summaries)

Option B (semantic merge: sim≥0.8 + participant overlap + adjacent/overlapping windows): To do

Option C (constrained LLM for readable summaries): To do

Episode aggregation

Option A (fixed time window + participant overlap): Partially done

Implemented at event layer; no standalone episode module yet

Option B (topic drift detection): To do

Retrieval entry (EAGLE v1)

Option A (BM25): Completed

File: src/modules/retrieval/index.py (BM25 implementation + index cache)

Option B (BM25 + e5-small fusion): Completed

e5 embeddings: src/modules/retrieval/emb_backends/e5.py (CPU/GPU with fallback); RRF fusion: src/modules/retrieval/eagle_v1.py

Option C (+ ColBERT-lite): To do

No ColBERT implementation; configs/retrieval/colbert.yaml draft exists without execution path

Reranking & expansion

Expansion A (depth ≤ 2, fanout ≤ m): Completed

Controlled by depth/fanout in src/modules/retrieval/eagle_v1.py

Expansion B (relation-type gating): To do

Rerank A (none / RRF / MMR): Completed

Currently using RRF (MMR not explicitly surfaced)

Rerank B (MonoT5-small): To do

No MonoT5 dependency wired

Rerank C (MiniLM cross-encoder): Completed

Cross-encoder: src/modules/retrieval/rerank.py (cross-encoder/ms-marco-MiniLM-L-6-v2 if available)

Packing

Option A (events/facts → minimal evidence with active_at/intervals/evidence_id): Completed

Integrated in retrieval output: src/modules/retrieval/eagle_v1.py (pack field)

Option B (statistical summaries): To do

Triggers / Backfill / QA / Evaluation

Triggers A/B/C: Completed (hard triggers / salience keep-archive / periodic & backfill)

src/modules/triggers/strong.py, salience.py, periodic.py, backfill.py; types in src/modules/triggers/types.py

QA A/B: Completed (baseline LLM + temporal prompts)

src/pipelines/memoread.py via modules.llm.gateway; Gemini provider available

Evaluation A: Completed (EM/Acc-AR/AW/Hit@k/nDCG@k/costs)

src/pipelines/eval_runner.py (retrieval + packing + QA aggregation and metrics dump)

Evaluation B/C: Partially done (error classifier / dashboard)

Has error/fallback/empty-graph counters and retrieval decomposition; no dedicated classifier module yet

Configuration & scripts

P1 config loader: src/utils/config.py:load_p1_config

Retrieval/rerank: src/modules/retrieval/eagle_v1.py (under retrieval: use_bm25, use_embeddings|use_e5, rerank.backend/top/weight, depth/fanout/k_rrf)

Extraction & triage toggles: src/pipelines/memowrite.py → ExtractionConfig, TriageConfig (use_flair, alias_*_threshold, triage.enabled/backend/per_turn_tokens/...)

Trigger toggles: _DEFAULT_TRIGGER_CONFIG in src/pipelines/memowrite.py; dispatched as TriggerConfig at runtime

Evaluation entrypoints: scripts/run_p1.sh, scripts/run_retrieval.sh, scripts/run_eval.sh

Artifacts: under reports/ — metrics.json, per_query.jsonl, pack_*, reader_*, etc.

Bowen Notes

Explorer

October Report

Why Temporal Reasoning (and why now)

Core temporal views: RAW / AW / AR

Example 1: Rescheduled meeting

Example 2: Identity change (moving residence)

Example 3: Subscription/contract expiry

Methods & System Changes (centered on Temporal Reasoning)

End-to-end dataflow (worked example with a dialogue)

Key implementation switches (repo-aligned)

Experimental Setup (October baseline)

Results & Analysis (retrieval-side)

Errors & Bottlenecks

Next Steps

Implementation Progress (P1 Completion Summary)

Graph View

Table of Contents