MemoTrace: Bitemporal Knowledge-Graph Memory for Conversational Agents
Abstract
Longer context windows do not reliably solve long-term conversational memory. Models degrade with distractors, show “lost-in-the-middle” position bias, and fail on multi-needle retrieval and time consistency across sessions @liu2023lostmiddle @hsieh2024ruler @shaham2023zeroscrolls @bai2023longbench. Retrieval-augmented methods struggle with fact evolution, temporal repair, evidence traceability, and multi-session stability @asai2023selfrag @yan2024crag. Graph-structured retrieval improves relational reasoning but lacks explicit bitemporal semantics @edge2024graphrag.
MemoTrace converts dialog increments into an append-only, bitemporal fact layer (L1) and an event layer (L2). It exposes two temporal cuts—As-Recorded (system-known time) and As-World (world-valid time)—and retrieves compact, auditable subgraphs via a hybrid EAGLE pipeline under fixed token budgets. A transaction-style DSL enables UPSERT/ARCHIVE/RETRO_CORRECT/UPSERT_EVENT with replayable JSONL logs, exact snapshots, and evidence-linked answers. We target wins on long-term memory benchmarks (LoCoMo, LongMemEval) with explicit time-consistency metrics and latency/tokens constraints @maharana2024locomo @wu2024longmemeval.
Novelty & Positioning
We claim the first true bitemporal data model for LLM agent memory, bridging the auditability gap between valid time and transaction time (dual-cut queries). We further introduce an event visibility constraint to prevent future leakage in As-Recorded answers and design a temporal self-critique mechanism extending Self-RAG to check time consistency. Our retrieval is a graph-first hybrid under strict token budgets. Positioning emphasizes enterprise-ready accountability, not just conversational quality @microsoft_temporal @wiki_temporaldb @asai2023selfrag.
Background & Related Work
Limits of long context. Performance drops with middle placement and increased length; multi-needle retrieval remains brittle @liu2023lostmiddle @hsieh2024ruler @bai2023longbench.
RAG and graph retrieval. Corrective RAG and self-reflective RAG reduce hallucination, but lack bitemporal audit trails @asai2023selfrag @yan2024crag. GraphRAG structures context but omits dual-time semantics @edge2024graphrag.
Temporal representation. SQL:2011 separates system-versioned and application-time periods, enabling point-in-time and audit queries; event/interval formalisms guide temporal annotation @microsoft_temporal @allen1983interval @pustejovsky2003timeml.
Long-term conversational memory. LoCoMo and LongMemEval expose multi-session recall, knowledge updates, temporal reasoning, and abstention gaps @maharana2024locomo @wu2024longmemeval.
Problem Statement
Given a timestamped conversation stream, construct an append-only, bitemporal, versioned fact graph (L1) and an event graph (L2). For any time , answer under As-Recorded@t (only evidence recorded and visible by ) or As-World@t (world-valid at ). Under equal token budgets, improve multi-needle accuracy and p95 latency while returning auditable subgraphs.
Claims
- Bitemporal facts with dual-cut QA reduce position bias and semantic masking under equal token budgets @liu2023lostmiddle @hsieh2024ruler.
- Event abstraction compresses context while retaining traceable evidence, improving relevance and p95 latency @edge2024graphrag.
- Temporal self-critique improves time-consistent grounding over static factual critique @asai2023selfrag.
Methodology
Design principles
- Truth in facts, not entities. Entities carry identity; changeable statements are versioned facts with bitemporal semantics.
- Two temporal cuts. All operations respect As-Recorded or As-World, aligned with SQL temporal practice @microsoft_temporal.
- Explainability by construction. Events are reconstructible via evidence closure; aggregation nodes never invent facts.
- Append-only and auditable. A JSONL DSL log is the single source of truth; all states are reproducible by replay.
Memory hierarchy and write triggers (cost-first long/short memory)
We adopt a three-tier memory with four write triggers and unified budgets for LoCoMo/LongMemEval.
Tiers
- STM: recent turns context window; never written to LTM.
- WM: ephemeral working memory for current topic, pending confirmations; merged or dropped.
- LTM: MemoTrace L1/L2 store with bitemporal facts and events.
Write triggers
- Event trigger (strong): on knowledge update/correction, commitments/decisions, long-term preferences, identity/schedule changes →
UPSERT_EVENT+includes_fact; useRETRO_CORRECTto close outdatedvalid/record. - Information-gain/salience trigger (soft): score from recency, question/decision flags, personal slots, cross-session frequency, temporal specificity, centrality. If write; if skip; otherwise park in WM.
- Periodic trigger (coverage): at topic boundary or end of session; write summary Event + minimal evidence closure only.
- Backfill trigger (failure-driven): on retrieval miss/abstention/temporal conflict, extract the missing fact and write with source turn and cut label (AR/AW).
Explicit vs implicit: strong triggers and corrections use explicit agent tools; periodic/backfill run implicitly in background to avoid latency.
Budgets & GC
- Per-session caps: strong writes ≤ 3, periodic ≤ 1; others to WM. Write budget ~120 tokens/session (avg). Read budget ~600 tokens/QA including evidence.
- LTM compression: consolidate low-salience Events into Episodes with stats only; deduplicate facts via alias merges (append-only). Archive when .
L1 layer: nodes and versioned facts
Node types
Entity(id, types[], canonical_name, aliases[])
Literal(id, datatype, value) # text, number, boolean, ISO-8601, etc.
TimeInstant(id, iso8601, tz)
Document(id, uri, hash) # provenance
Mention(id, doc_id, span, text) # surface evidence
Versioned facts
FactVersion = {
id, src, rel, dst, # assertion triple
valid=[s_v, e_v), # world-time validity
record=[s_r, e_r), # system-known/visible time
source: Document.id,
evidence_id: Mention.id[],
confidence: float
}
Required relations
has_attr(Entity -> Literal)
alias_of(Entity' -> Entity)
mentioned_in(Entity|Event -> Mention)
Invariants
- At most one open version per
id. - Version starts are monotone.
- For any
id, a snapshot selects the candidate with maxrecord.startamong versions satisfying the temporal predicate.
L1 visibility
- Nodes with no incident visible facts and no explicit creation record can be hidden in As-Recorded.
alias_offollows As-Recorded visibility to avoid leaking future coreference.
L2 layer: events and optional aggregation
Event nodes
Event(id, summary, time_span=[t_start, t_end], status)
time_span is narrative, not a replacement for L1 bitemporal truth @allen1983interval @pustejovsky2003timeml.
L2 edge types and semantics
includes_fact(Event -> FactVersion.id)
has_participant(Event -> Entity, role)
has_time(Event -> TimeInstant|Literal)
related_to(Event -> Event, qualifier)
subevent_of(Event -> Event)
causes(Event -> Event)
corefers(Event -> Event)
Policy: includes_fact defines the auditable closure. All participants/times/locations are reconstructible from that closure.
Optional Episode aggregation
Episode(id, title, window=[t_start, t_end])
aggregates(Episode -> Event)
Episodes compress across sessions for retrieval/visualization only; visibility derives from contained events; Episodes never act as evidence sources.
Bitemporal snapshots and visibility
Snapshot selection
These mirror application-time vs system-versioned queries and preserve prefix-consistency under append-only logs @microsoft_temporal.
Event visibility (As-Recorded)
If any required evidence is not visible at , hide the event in As-Recorded. Freshness for As-Recorded uses .
Episode visibility
Episodes never replace evidence.
Properties
- Monotonicity: under append-only updates, is non-decreasing.
- Cut isolation: As-Recorded outputs never depend on future records; As-World aligns with
validintervals.
DSL: operations and idempotency
UPSERT_ENTITY(canonical_name, aliases[])
MERGE_ENTITY(src_id, dst_id)
UPSERT_EDGE(src, rel, dst, valid, record, evidence[])
ARCHIVE_EDGE(edge_id, record_end)
RETRO_CORRECT(edge_id, valid_end, record_end)
UPSERT_EVENT(summary, participants[], includes_fact[], time_span)
LINK(role|temporal|causal|coref|subevent|related_to)
UPSERT_EPISODE(title, window, events[])
All destructive changes are expressed as appended records. Events/Episodes use deterministic IDs like hash(topic, day, n) to ensure replay idempotence.
Retrieval and packing (EAGLE)
Entry: BM25 + E5/mE5 embeddings for broad recall @wang2022e5 @wang2024me5 + ColBERTv2 late interaction for multi-needle robustness @santhanam2021colbertv2.
Apply the temporal cut first (snapshot). Enforce event visibility in As-Recorded. Freshness features: L1 uses record.start (AR) or valid.{start|end} (AW); L2 uses (AR) or time_span (AW).
Across-layer expansion: from top- L1 and L2 seeds, expand bidirectionally L1↔L2 with depth and per-relation fanout , de-duplicating nodes and facts. Attach intervals and active_at flags to facts; active_fact_count to events.
Rerank: Stage-1 RRF/MMR; Stage-2 lightweight cross-encoder (e.g., MonoT5-small) over top-64 with explicit temporal features @monot5.
Pack: emit a compact subgraph — events first, then the minimal evidence closure. Episodes appear only as summarizing containers under tight token budgets.
Complexity: snapshot = index slice + tail scan; retrieval with token cap and early stopping by coverage@.
Temporal self-critique (extension to Self-RAG)
We add a reflection signal to verify that generated statements are supported and valid for the query’s temporal cut (AR/AW). Training data is synthesized via DSL replays that create lagged corrections and counterfactuals; learning target penalizes temporally inconsistent generations even if textually supported @asai2023selfrag.
Pseudocode
Snapshot
def snapshot_L1(mode, t, versions):
if mode == "AW":
C = [v for v in versions if v.valid.start <= t < (v.valid.end or float("inf"))]
else:
C = [v for v in versions if v.record.start <= t < (v.record.end or float("inf"))]
return argmax_groupby(C, key=lambda v: v.id, argmax=lambda v: v.record.start)Event visibility
def event_visible_at(event, t):
latest = None
for fv_id in event.includes_fact:
vis = [v for v in history(fv_id)
if v.record.start <= t < (v.record.end or float("inf"))]
if not vis:
return None
latest = max(latest, max(v.record.start for v in vis))
return latestTurn handler with write triggers and backfill
def on_turn(dialogue, q, t, view): # view in {"AR","AW"}
stm.update(dialogue[-1])
events = detect_events(dialogue[-1])
sal = salience(dialogue[-1])
if is_update_or_correction(events): write_strong(events, t)
elif sal >= TAU_KEEP: write_keep(dialogue[-1], t)
elif is_topic_shift(dialogue): write_periodic_summary(dialogue, t)
ans, ev = memosearch(q, t, view)
if ev.is_empty() or ans.abstain or ans.temporal_conflict:
backfill_from(dialogue[-K:], q, t)
return ansEvaluation Plan
Benchmarks: LoCoMo for multi-session, long-turn QA; LongMemEval for five abilities with focus on Knowledge Updates, Temporal Reasoning, Abstention @maharana2024locomo @wu2024longmemeval.
Tasks: As-Recorded/As-World QA with contradictions and retro-corrections; multi-needle retrieval with counting/aggregation; no-literal-match QA; long-term memory tasks.
Metrics: Acc-AR/AW; Time-Consistency-F1; Temporal-Conflict-Rate↓; Evidence-Hit@k, NDCG@k; p50/p95 latency; token count; calibration (ECE/Brier) @guo2017calibration.
Baselines: equal-budget long-context prompting; vanilla vector-RAG; GraphRAG; SGMem-style sentence graph; DyG-RAG-style timeline retrieval; Self-RAG critique @edge2024graphrag @asai2023selfrag.
Ablations: −bitemporal, −, −temporal-critique, −Episode, −ColBERTv2, −periodic, −backfill, −salience.
Cost curves: report accuracy–cost–latency Pareto fronts; tokens/turn (read/write) and LTM growth rate under caps.
Risks & Mitigations
- Entity/coref drift: append-only
alias_ofmerges + low-frequency human curation; AR cut forbids future-merge leakage. - Overwriting/ballooning: trigger priority and cooldown; periodic writes are summaries only; archive by salience.
- Temporal parsing errors: TIMEX normalization; uncertainty tags; AR/AW fallback strategies.
- Self-critique bias: balance positive/negative temporal cases in training; audit false reject/accept rates.
Ethics & Governance
Evidence traceability with deletion/correction interfaces; minimal retention for sensitive slots; full audit logs of who/when/why.
Expected Contributions
- A bitemporal, auditable graph memory and DSL for agentic systems.
- Dual-cut temporal QA with verifiable time consistency and no future leakage.
- Event abstraction with minimal-evidence, token-budgeted packing.
- Cost-aware long/short memory with adaptive write triggers.
Resources & Reproducibility
Code: src/{core,modules,pipelines,utils}, scripts/run_mvp.sh. Run: ruff + mypy + pytest + run_mvp.sh. Artifacts: snapshots/logs/configs; seeds recorded; leakage test set and query–evidence packages released.
Timeline
- M1 (done): GraphStore, bitemporal snapshots, event store with As-Recorded visibility, CLI (
memowrite/memoread/memosearch), tests with coverage gates. - M2: segmentation/consolidation v1; EAGLE v1 (embedding entry + light rerank); subgraph visualization; end-to-end evaluation; initial temporal self-critique.