MemoTrace: Bitemporal Knowledge-Graph Memory for Conversational Agents

Abstract

Longer context windows do not reliably solve long-term conversational memory. Models degrade with distractors, show “lost-in-the-middle” position bias, and fail on multi-needle retrieval and time consistency across sessions @liu2023lostmiddle @hsieh2024ruler @shaham2023zeroscrolls @bai2023longbench. Retrieval-augmented methods struggle with fact evolution, temporal repair, evidence traceability, and multi-session stability @asai2023selfrag @yan2024crag. Graph-structured retrieval improves relational reasoning but lacks explicit bitemporal semantics @edge2024graphrag.

MemoTrace converts dialog increments into an append-only, bitemporal fact layer (L1) and an event layer (L2). It exposes two temporal cuts—As-Recorded (system-known time) and As-World (world-valid time)—and retrieves compact, auditable subgraphs via a hybrid EAGLE pipeline under fixed token budgets. A transaction-style DSL enables UPSERT/ARCHIVE/RETRO_CORRECT/UPSERT_EVENT with replayable JSONL logs, exact snapshots, and evidence-linked answers. We target wins on long-term memory benchmarks (LoCoMo, LongMemEval) with explicit time-consistency metrics and latency/tokens constraints @maharana2024locomo @wu2024longmemeval.

Novelty & Positioning

We claim the first true bitemporal data model for LLM agent memory, bridging the auditability gap between valid time and transaction time (dual-cut queries). We further introduce an event visibility constraint to prevent future leakage in As-Recorded answers and design a temporal self-critique mechanism extending Self-RAG to check time consistency. Our retrieval is a graph-first hybrid under strict token budgets. Positioning emphasizes enterprise-ready accountability, not just conversational quality @microsoft_temporal @wiki_temporaldb @asai2023selfrag.

Limits of long context. Performance drops with middle placement and increased length; multi-needle retrieval remains brittle @liu2023lostmiddle @hsieh2024ruler @bai2023longbench.

RAG and graph retrieval. Corrective RAG and self-reflective RAG reduce hallucination, but lack bitemporal audit trails @asai2023selfrag @yan2024crag. GraphRAG structures context but omits dual-time semantics @edge2024graphrag.

Temporal representation. SQL:2011 separates system-versioned and application-time periods, enabling point-in-time and audit queries; event/interval formalisms guide temporal annotation @microsoft_temporal @allen1983interval @pustejovsky2003timeml.

Long-term conversational memory. LoCoMo and LongMemEval expose multi-session recall, knowledge updates, temporal reasoning, and abstention gaps @maharana2024locomo @wu2024longmemeval.

Problem Statement

Given a timestamped conversation stream, construct an append-only, bitemporal, versioned fact graph (L1) and an event graph (L2). For any time $t$ , answer under As-Recorded@t (only evidence recorded and visible by $t$ ) or As-World@t (world-valid at $t$ ). Under equal token budgets, improve multi-needle accuracy and p95 latency while returning auditable subgraphs.

Claims

Bitemporal facts with dual-cut QA reduce position bias and semantic masking under equal token budgets @liu2023lostmiddle @hsieh2024ruler.
Event abstraction compresses context while retaining traceable evidence, improving relevance and p95 latency @edge2024graphrag.
Temporal self-critique improves time-consistent grounding over static factual critique @asai2023selfrag.

Methodology

Design principles

Truth in facts, not entities. Entities carry identity; changeable statements are versioned facts with bitemporal semantics.
Two temporal cuts. All operations respect As-Recorded or As-World, aligned with SQL temporal practice @microsoft_temporal.
Explainability by construction. Events are reconstructible via evidence closure; aggregation nodes never invent facts.
Append-only and auditable. A JSONL DSL log is the single source of truth; all states are reproducible by replay.

Memory hierarchy and write triggers (cost-first long/short memory)

We adopt a three-tier memory with four write triggers and unified budgets for LoCoMo/LongMemEval.

Tiers

STM: recent $W$ turns context window; never written to LTM.
WM: ephemeral working memory for current topic, pending confirmations; merged or dropped.
LTM: MemoTrace L1/L2 store with bitemporal facts and events.

Write triggers

Event trigger (strong): on knowledge update/correction, commitments/decisions, long-term preferences, identity/schedule changes → UPSERT_EVENT + includes_fact; use RETRO_CORRECT to close outdated valid/record.
Information-gain/salience trigger (soft): score $s = σ! (\sum_{i} w_{i} f_{i})$ from recency, question/decision flags, personal slots, cross-session frequency, temporal specificity, centrality. If $s \geq τ_{keep}$ write; if $s \leq τ_{archive}$ skip; otherwise park in WM.
Periodic trigger (coverage): at topic boundary or end of session; write summary Event + minimal evidence closure only.
Backfill trigger (failure-driven): on retrieval miss/abstention/temporal conflict, extract the missing fact and write with source turn and cut label (AR/AW).

Explicit vs implicit: strong triggers and corrections use explicit agent tools; periodic/backfill run implicitly in background to avoid latency.

Budgets & GC

Per-session caps: strong writes ≤ 3, periodic ≤ 1; others to WM. Write budget ~120 tokens/session (avg). Read budget ~600 tokens/QA including evidence.
LTM compression: consolidate low-salience Events into Episodes with stats only; deduplicate facts via alias merges (append-only). Archive when $s \leq τ_{archive}$ .

L1 layer: nodes and versioned facts

Node types

Entity(id, types[], canonical_name, aliases[])
Literal(id, datatype, value)            # text, number, boolean, ISO-8601, etc.
TimeInstant(id, iso8601, tz)
Document(id, uri, hash)                 # provenance
Mention(id, doc_id, span, text)         # surface evidence

Versioned facts

FactVersion = {
  id, src, rel, dst,                    # assertion triple
  valid=[s_v, e_v),                     # world-time validity
  record=[s_r, e_r),                    # system-known/visible time
  source: Document.id,
  evidence_id: Mention.id[],
  confidence: float
}

Required relations

has_attr(Entity -> Literal)
alias_of(Entity' -> Entity)
mentioned_in(Entity|Event -> Mention)

Invariants

At most one open version per id.
Version starts are monotone.
For any id, a snapshot selects the candidate with max record.start among versions satisfying the temporal predicate.

L1 visibility

Nodes with no incident visible facts and no explicit creation record can be hidden in As-Recorded.
alias_of follows As-Recorded visibility to avoid leaking future coreference.

L2 layer: events and optional aggregation

Event nodes

Event(id, summary, time_span=[t_start, t_end], status)

time_span is narrative, not a replacement for L1 bitemporal truth @allen1983interval @pustejovsky2003timeml.

L2 edge types and semantics

includes_fact(Event -> FactVersion.id)
has_participant(Event -> Entity, role)
has_time(Event -> TimeInstant|Literal)
related_to(Event -> Event, qualifier)
subevent_of(Event -> Event)
causes(Event -> Event)
corefers(Event -> Event)

Policy: includes_fact defines the auditable closure. All participants/times/locations are reconstructible from that closure.

Optional Episode aggregation

Episode(id, title, window=[t_start, t_end])
aggregates(Episode -> Event)

Episodes compress across sessions for retrieval/visualization only; visibility derives from contained events; Episodes never act as evidence sources.

Bitemporal snapshots and visibility

Snapshot selection

As-World@ t As-Recorded@ t : \forall, fact i d, ar g max v . record.start s.t. t \in v . valid : \forall, fact i d, ar g max v . record.start s.t. t \in v . record

These mirror application-time vs system-versioned queries and preserve prefix-consistency under append-only logs @microsoft_temporal.

Event visibility (As-Recorded)

t_{visible} (e, t) = f \in e . include s_{f} act max max, v . record.start ∣ v visible at t in As-Recorded,

If any required evidence is not visible at $t$ , hide the event in As-Recorded. Freshness for As-Recorded uses $t_{visible}$ .

Episode visibility

t_{visible} (Episode) = i min t_{visible} (e_{i})

Episodes never replace evidence.

Properties

Monotonicity: under append-only updates, $t_{visible}$ is non-decreasing.
Cut isolation: As-Recorded outputs never depend on future records; As-World aligns with valid intervals.

DSL: operations and idempotency

UPSERT_ENTITY(canonical_name, aliases[])
MERGE_ENTITY(src_id, dst_id)
UPSERT_EDGE(src, rel, dst, valid, record, evidence[])
ARCHIVE_EDGE(edge_id, record_end)
RETRO_CORRECT(edge_id, valid_end, record_end)
UPSERT_EVENT(summary, participants[], includes_fact[], time_span)
LINK(role|temporal|causal|coref|subevent|related_to)
UPSERT_EPISODE(title, window, events[])

All destructive changes are expressed as appended records. Events/Episodes use deterministic IDs like hash(topic, day, n) to ensure replay idempotence.

Retrieval and packing (EAGLE)

Entry: BM25 + E5/mE5 embeddings for broad recall @wang2022e5 @wang2024me5 + ColBERTv2 late interaction for multi-needle robustness @santhanam2021colbertv2.

Apply the temporal cut first (snapshot). Enforce event visibility in As-Recorded. Freshness features: L1 uses record.start (AR) or valid.{start|end} (AW); L2 uses $t_{visible}$ (AR) or time_span (AW).

Across-layer expansion: from top- $k$ L1 and L2 seeds, expand bidirectionally L1↔L2 with depth $\leq 2$ and per-relation fanout $\leq m$ , de-duplicating nodes and facts. Attach intervals and active_at flags to facts; active_fact_count to events.

Rerank: Stage-1 RRF/MMR; Stage-2 lightweight cross-encoder (e.g., MonoT5-small) over top-64 with explicit temporal features @monot5.

Pack: emit a compact subgraph — events first, then the minimal evidence closure. Episodes appear only as summarizing containers under tight token budgets.

Complexity: snapshot = index slice + tail scan; retrieval $\tilde{O} (lo g N + k + m \cdot k)$ with token cap and early stopping by coverage@ $k$ .

Temporal self-critique (extension to Self-RAG)

We add a reflection signal to verify that generated statements are supported and valid for the query’s temporal cut (AR/AW). Training data is synthesized via DSL replays that create lagged corrections and counterfactuals; learning target penalizes temporally inconsistent generations even if textually supported @asai2023selfrag.

Pseudocode

Snapshot

def snapshot_L1(mode, t, versions):
    if mode == "AW":
        C = [v for v in versions if v.valid.start <= t < (v.valid.end or float("inf"))]
    else:
        C = [v for v in versions if v.record.start <= t < (v.record.end or float("inf"))]
    return argmax_groupby(C, key=lambda v: v.id, argmax=lambda v: v.record.start)

Event visibility

def event_visible_at(event, t):
    latest = None
    for fv_id in event.includes_fact:
        vis = [v for v in history(fv_id)
               if v.record.start <= t < (v.record.end or float("inf"))]
        if not vis:
            return None
        latest = max(latest, max(v.record.start for v in vis))
    return latest

Turn handler with write triggers and backfill

def on_turn(dialogue, q, t, view):  # view in {"AR","AW"}
    stm.update(dialogue[-1])
    events = detect_events(dialogue[-1])
    sal = salience(dialogue[-1])
 
    if is_update_or_correction(events): write_strong(events, t)
    elif sal >= TAU_KEEP: write_keep(dialogue[-1], t)
    elif is_topic_shift(dialogue): write_periodic_summary(dialogue, t)
 
    ans, ev = memosearch(q, t, view)
    if ev.is_empty() or ans.abstain or ans.temporal_conflict:
        backfill_from(dialogue[-K:], q, t)
    return ans

Evaluation Plan

Benchmarks: LoCoMo for multi-session, long-turn QA; LongMemEval for five abilities with focus on Knowledge Updates, Temporal Reasoning, Abstention @maharana2024locomo @wu2024longmemeval.

Tasks: As-Recorded/As-World QA with contradictions and retro-corrections; multi-needle retrieval with counting/aggregation; no-literal-match QA; long-term memory tasks.

Metrics: Acc-AR/AW; Time-Consistency-F1; Temporal-Conflict-Rate↓; Evidence-Hit@k, NDCG@k; p50/p95 latency; token count; calibration (ECE/Brier) @guo2017calibration.

Baselines: equal-budget long-context prompting; vanilla vector-RAG; GraphRAG; SGMem-style sentence graph; DyG-RAG-style timeline retrieval; Self-RAG critique @edge2024graphrag @asai2023selfrag.

Ablations: −bitemporal, − $t_{visible}$ , −temporal-critique, −Episode, −ColBERTv2, −periodic, −backfill, −salience.

Cost curves: report accuracy–cost–latency Pareto fronts; tokens/turn (read/write) and LTM growth rate under caps.

Risks & Mitigations

Entity/coref drift: append-only alias_of merges + low-frequency human curation; AR cut forbids future-merge leakage.
Overwriting/ballooning: trigger priority and cooldown; periodic writes are summaries only; archive by salience.
Temporal parsing errors: TIMEX normalization; uncertainty tags; AR/AW fallback strategies.
Self-critique bias: balance positive/negative temporal cases in training; audit false reject/accept rates.

Ethics & Governance

Evidence traceability with deletion/correction interfaces; minimal retention for sensitive slots; full audit logs of who/when/why.

Expected Contributions

A bitemporal, auditable graph memory and DSL for agentic systems.
Dual-cut temporal QA with verifiable time consistency and no future leakage.
Event abstraction with minimal-evidence, token-budgeted packing.
Cost-aware long/short memory with adaptive write triggers.

Resources & Reproducibility

Code: src/{core,modules,pipelines,utils}, scripts/run_mvp.sh. Run: ruff + mypy + pytest + run_mvp.sh. Artifacts: snapshots/logs/configs; seeds recorded; leakage test set and query–evidence packages released.

Timeline

M1 (done): GraphStore, bitemporal snapshots, event store with As-Recorded visibility, CLI (memowrite/memoread/memosearch), tests with coverage gates.
M2: segmentation/consolidation v1; EAGLE v1 (embedding entry + light rerank); subgraph visualization; end-to-end evaluation; initial temporal self-critique.

Bowen Notes

Explorer

Proposal

MemoTrace: Bitemporal Knowledge-Graph Memory for Conversational Agents

Abstract

Novelty & Positioning

Problem Statement

Claims

Methodology

Design principles

Memory hierarchy and write triggers (cost-first long/short memory)

L1 layer: nodes and versioned facts

L2 layer: events and optional aggregation

Bitemporal snapshots and visibility

DSL: operations and idempotency

Retrieval and packing (EAGLE)

Temporal self-critique (extension to Self-RAG)

Pseudocode

Evaluation Plan

Risks & Mitigations

Ethics & Governance

Expected Contributions

Resources & Reproducibility

Timeline

Graph View

Table of Contents

Bowen Notes

Explorer

Proposal

MemoTrace: Bitemporal Knowledge-Graph Memory for Conversational Agents

Abstract

Novelty & Positioning

Background & Related Work

Problem Statement

Claims

Methodology

Design principles

Memory hierarchy and write triggers (cost-first long/short memory)

L1 layer: nodes and versioned facts

L2 layer: events and optional aggregation

Bitemporal snapshots and visibility

DSL: operations and idempotency

Retrieval and packing (EAGLE)

Temporal self-critique (extension to Self-RAG)

Pseudocode

Evaluation Plan

Risks & Mitigations

Ethics & Governance

Expected Contributions

Resources & Reproducibility

Timeline

Graph View

Table of Contents