MemoTrace: Bitemporal Knowledge-Graph Memory for Conversational Agents

Abstract

Longer context windows do not reliably solve long-term conversational memory. Models degrade with distractors, show “lost-in-the-middle” position bias, and fail on multi-needle retrieval and time consistency across sessions @liu2023lostmiddle @hsieh2024ruler @shaham2023zeroscrolls @bai2023longbench. Retrieval-augmented methods struggle with fact evolution, temporal repair, evidence traceability, and multi-session stability @asai2023selfrag @yan2024crag. Graph-structured retrieval improves relational reasoning but lacks explicit bitemporal semantics @edge2024graphrag.

MemoTrace converts dialog increments into an append-only, bitemporal fact layer (L1) and an event layer (L2). It exposes two temporal cuts—As-Recorded (system-known time) and As-World (world-valid time)—and retrieves compact, auditable subgraphs via a hybrid EAGLE pipeline under fixed token budgets. A transaction-style DSL enables UPSERT/ARCHIVE/RETRO_CORRECT/UPSERT_EVENT with replayable JSONL logs, exact snapshots, and evidence-linked answers. We target wins on long-term memory benchmarks (LoCoMo, LongMemEval) with explicit time-consistency metrics and latency/tokens constraints @maharana2024locomo @wu2024longmemeval.

Novelty & Positioning

We claim the first true bitemporal data model for LLM agent memory, bridging the auditability gap between valid time and transaction time (dual-cut queries). We further introduce an event visibility constraint to prevent future leakage in As-Recorded answers and design a temporal self-critique mechanism extending Self-RAG to check time consistency. Our retrieval is a graph-first hybrid under strict token budgets. Positioning emphasizes enterprise-ready accountability, not just conversational quality @microsoft_temporal @wiki_temporaldb @asai2023selfrag.

Limits of long context. Performance drops with middle placement and increased length; multi-needle retrieval remains brittle @liu2023lostmiddle @hsieh2024ruler @bai2023longbench.

RAG and graph retrieval. Corrective RAG and self-reflective RAG reduce hallucination, but lack bitemporal audit trails @asai2023selfrag @yan2024crag. GraphRAG structures context but omits dual-time semantics @edge2024graphrag.

Temporal representation. SQL:2011 separates system-versioned and application-time periods, enabling point-in-time and audit queries; event/interval formalisms guide temporal annotation @microsoft_temporal @allen1983interval @pustejovsky2003timeml.

Long-term conversational memory. LoCoMo and LongMemEval expose multi-session recall, knowledge updates, temporal reasoning, and abstention gaps @maharana2024locomo @wu2024longmemeval.

Problem Statement

Given a timestamped conversation stream, construct an append-only, bitemporal, versioned fact graph (L1) and an event graph (L2). For any time , answer under As-Recorded@t (only evidence recorded and visible by ) or As-World@t (world-valid at ). Under equal token budgets, improve multi-needle accuracy and p95 latency while returning auditable subgraphs.

Claims

  1. Bitemporal facts with dual-cut QA reduce position bias and semantic masking under equal token budgets @liu2023lostmiddle @hsieh2024ruler.
  2. Event abstraction compresses context while retaining traceable evidence, improving relevance and p95 latency @edge2024graphrag.
  3. Temporal self-critique improves time-consistent grounding over static factual critique @asai2023selfrag.

Methodology

Design principles

  • Truth in facts, not entities. Entities carry identity; changeable statements are versioned facts with bitemporal semantics.
  • Two temporal cuts. All operations respect As-Recorded or As-World, aligned with SQL temporal practice @microsoft_temporal.
  • Explainability by construction. Events are reconstructible via evidence closure; aggregation nodes never invent facts.
  • Append-only and auditable. A JSONL DSL log is the single source of truth; all states are reproducible by replay.

Memory hierarchy and write triggers (cost-first long/short memory)

We adopt a three-tier memory with four write triggers and unified budgets for LoCoMo/LongMemEval.

Tiers

  • STM: recent turns context window; never written to LTM.
  • WM: ephemeral working memory for current topic, pending confirmations; merged or dropped.
  • LTM: MemoTrace L1/L2 store with bitemporal facts and events.

Write triggers

  1. Event trigger (strong): on knowledge update/correction, commitments/decisions, long-term preferences, identity/schedule changes → UPSERT_EVENT + includes_fact; use RETRO_CORRECT to close outdated valid/record.
  2. Information-gain/salience trigger (soft): score from recency, question/decision flags, personal slots, cross-session frequency, temporal specificity, centrality. If write; if skip; otherwise park in WM.
  3. Periodic trigger (coverage): at topic boundary or end of session; write summary Event + minimal evidence closure only.
  4. Backfill trigger (failure-driven): on retrieval miss/abstention/temporal conflict, extract the missing fact and write with source turn and cut label (AR/AW).

Explicit vs implicit: strong triggers and corrections use explicit agent tools; periodic/backfill run implicitly in background to avoid latency.

Budgets & GC

  • Per-session caps: strong writes ≤ 3, periodic ≤ 1; others to WM. Write budget ~120 tokens/session (avg). Read budget ~600 tokens/QA including evidence.
  • LTM compression: consolidate low-salience Events into Episodes with stats only; deduplicate facts via alias merges (append-only). Archive when .

L1 layer: nodes and versioned facts

Node types

Entity(id, types[], canonical_name, aliases[])
Literal(id, datatype, value)            # text, number, boolean, ISO-8601, etc.
TimeInstant(id, iso8601, tz)
Document(id, uri, hash)                 # provenance
Mention(id, doc_id, span, text)         # surface evidence

Versioned facts

FactVersion = {
  id, src, rel, dst,                    # assertion triple
  valid=[s_v, e_v),                     # world-time validity
  record=[s_r, e_r),                    # system-known/visible time
  source: Document.id,
  evidence_id: Mention.id[],
  confidence: float
}

Required relations

has_attr(Entity -> Literal)
alias_of(Entity' -> Entity)
mentioned_in(Entity|Event -> Mention)

Invariants

  • At most one open version per id.
  • Version starts are monotone.
  • For any id, a snapshot selects the candidate with max record.start among versions satisfying the temporal predicate.

L1 visibility

  • Nodes with no incident visible facts and no explicit creation record can be hidden in As-Recorded.
  • alias_of follows As-Recorded visibility to avoid leaking future coreference.

L2 layer: events and optional aggregation

Event nodes

Event(id, summary, time_span=[t_start, t_end], status)

time_span is narrative, not a replacement for L1 bitemporal truth @allen1983interval @pustejovsky2003timeml.

L2 edge types and semantics

includes_fact(Event -> FactVersion.id)
has_participant(Event -> Entity, role)
has_time(Event -> TimeInstant|Literal)
related_to(Event -> Event, qualifier)
subevent_of(Event -> Event)
causes(Event -> Event)
corefers(Event -> Event)

Policy: includes_fact defines the auditable closure. All participants/times/locations are reconstructible from that closure.

Optional Episode aggregation

Episode(id, title, window=[t_start, t_end])
aggregates(Episode -> Event)

Episodes compress across sessions for retrieval/visualization only; visibility derives from contained events; Episodes never act as evidence sources.

Bitemporal snapshots and visibility

Snapshot selection

These mirror application-time vs system-versioned queries and preserve prefix-consistency under append-only logs @microsoft_temporal.

Event visibility (As-Recorded)

If any required evidence is not visible at , hide the event in As-Recorded. Freshness for As-Recorded uses .

Episode visibility

Episodes never replace evidence.

Properties

  • Monotonicity: under append-only updates, is non-decreasing.
  • Cut isolation: As-Recorded outputs never depend on future records; As-World aligns with valid intervals.

DSL: operations and idempotency

UPSERT_ENTITY(canonical_name, aliases[])
MERGE_ENTITY(src_id, dst_id)
UPSERT_EDGE(src, rel, dst, valid, record, evidence[])
ARCHIVE_EDGE(edge_id, record_end)
RETRO_CORRECT(edge_id, valid_end, record_end)
UPSERT_EVENT(summary, participants[], includes_fact[], time_span)
LINK(role|temporal|causal|coref|subevent|related_to)
UPSERT_EPISODE(title, window, events[])

All destructive changes are expressed as appended records. Events/Episodes use deterministic IDs like hash(topic, day, n) to ensure replay idempotence.

Retrieval and packing (EAGLE)

Entry: BM25 + E5/mE5 embeddings for broad recall @wang2022e5 @wang2024me5 + ColBERTv2 late interaction for multi-needle robustness @santhanam2021colbertv2.

Apply the temporal cut first (snapshot). Enforce event visibility in As-Recorded. Freshness features: L1 uses record.start (AR) or valid.{start|end} (AW); L2 uses (AR) or time_span (AW).

Across-layer expansion: from top- L1 and L2 seeds, expand bidirectionally L1↔L2 with depth and per-relation fanout , de-duplicating nodes and facts. Attach intervals and active_at flags to facts; active_fact_count to events.

Rerank: Stage-1 RRF/MMR; Stage-2 lightweight cross-encoder (e.g., MonoT5-small) over top-64 with explicit temporal features @monot5.

Pack: emit a compact subgraph — events first, then the minimal evidence closure. Episodes appear only as summarizing containers under tight token budgets.

Complexity: snapshot = index slice + tail scan; retrieval with token cap and early stopping by coverage@.

Temporal self-critique (extension to Self-RAG)

We add a reflection signal to verify that generated statements are supported and valid for the query’s temporal cut (AR/AW). Training data is synthesized via DSL replays that create lagged corrections and counterfactuals; learning target penalizes temporally inconsistent generations even if textually supported @asai2023selfrag.

Pseudocode

Snapshot

def snapshot_L1(mode, t, versions):
    if mode == "AW":
        C = [v for v in versions if v.valid.start <= t < (v.valid.end or float("inf"))]
    else:
        C = [v for v in versions if v.record.start <= t < (v.record.end or float("inf"))]
    return argmax_groupby(C, key=lambda v: v.id, argmax=lambda v: v.record.start)

Event visibility

def event_visible_at(event, t):
    latest = None
    for fv_id in event.includes_fact:
        vis = [v for v in history(fv_id)
               if v.record.start <= t < (v.record.end or float("inf"))]
        if not vis:
            return None
        latest = max(latest, max(v.record.start for v in vis))
    return latest

Turn handler with write triggers and backfill

def on_turn(dialogue, q, t, view):  # view in {"AR","AW"}
    stm.update(dialogue[-1])
    events = detect_events(dialogue[-1])
    sal = salience(dialogue[-1])
 
    if is_update_or_correction(events): write_strong(events, t)
    elif sal >= TAU_KEEP: write_keep(dialogue[-1], t)
    elif is_topic_shift(dialogue): write_periodic_summary(dialogue, t)
 
    ans, ev = memosearch(q, t, view)
    if ev.is_empty() or ans.abstain or ans.temporal_conflict:
        backfill_from(dialogue[-K:], q, t)
    return ans

Evaluation Plan

Benchmarks: LoCoMo for multi-session, long-turn QA; LongMemEval for five abilities with focus on Knowledge Updates, Temporal Reasoning, Abstention @maharana2024locomo @wu2024longmemeval.

Tasks: As-Recorded/As-World QA with contradictions and retro-corrections; multi-needle retrieval with counting/aggregation; no-literal-match QA; long-term memory tasks.

Metrics: Acc-AR/AW; Time-Consistency-F1; Temporal-Conflict-Rate↓; Evidence-Hit@k, NDCG@k; p50/p95 latency; token count; calibration (ECE/Brier) @guo2017calibration.

Baselines: equal-budget long-context prompting; vanilla vector-RAG; GraphRAG; SGMem-style sentence graph; DyG-RAG-style timeline retrieval; Self-RAG critique @edge2024graphrag @asai2023selfrag.

Ablations: −bitemporal, −, −temporal-critique, −Episode, −ColBERTv2, −periodic, −backfill, −salience.

Cost curves: report accuracy–cost–latency Pareto fronts; tokens/turn (read/write) and LTM growth rate under caps.

Risks & Mitigations

  • Entity/coref drift: append-only alias_of merges + low-frequency human curation; AR cut forbids future-merge leakage.
  • Overwriting/ballooning: trigger priority and cooldown; periodic writes are summaries only; archive by salience.
  • Temporal parsing errors: TIMEX normalization; uncertainty tags; AR/AW fallback strategies.
  • Self-critique bias: balance positive/negative temporal cases in training; audit false reject/accept rates.

Ethics & Governance

Evidence traceability with deletion/correction interfaces; minimal retention for sensitive slots; full audit logs of who/when/why.

Expected Contributions

  • A bitemporal, auditable graph memory and DSL for agentic systems.
  • Dual-cut temporal QA with verifiable time consistency and no future leakage.
  • Event abstraction with minimal-evidence, token-budgeted packing.
  • Cost-aware long/short memory with adaptive write triggers.

Resources & Reproducibility

Code: src/{core,modules,pipelines,utils}, scripts/run_mvp.sh. Run: ruff + mypy + pytest + run_mvp.sh. Artifacts: snapshots/logs/configs; seeds recorded; leakage test set and query–evidence packages released.

Timeline

  • M1 (done): GraphStore, bitemporal snapshots, event store with As-Recorded visibility, CLI (memowrite/memoread/memosearch), tests with coverage gates.
  • M2: segmentation/consolidation v1; EAGLE v1 (embedding entry + light rerank); subgraph visualization; end-to-end evaluation; initial temporal self-critique.