Cross-Graph Attention: Mathematical Foundation

9 hours ago
27 min read

Purpose: Canonical reference for the mathematical framework connecting transformer-style attention mechanisms to cross-graph discovery in compounding intelligence systems.

1. The Core Claim

The cross-graph discovery mechanism described in our architecture is structurally analogous to the scaled dot-product attention mechanism in Vaswani et al. (2017), "Attention Is All You Need." This is not a loose metaphor — the operations have the same mathematical form, and three key properties transfer directly: quadratic interaction space, constant path length between any two entities, and residual preservation of accumulated knowledge.

This document formalizes the correspondence across three levels — single-decision attention, cross-graph attention, and multi-domain attention — with precise notation and plain-English explanations at each step.

[GRAPHIC: CI-05 — The Rosetta Stone | "Transformers gave machines language. Cross-graph attention gives enterprises compounding intelligence." | Side-by-side correspondence: LEFT (blue) transformer word-attention ("bank" attending to "river," "flooded" etc.), RIGHT (gold) cross-graph entity-attention (PAT-TRAVEL attending to TI-SG-CRED, TI-RU-RANSOM etc.), CENTER bridge with variable mapping table (Q↔E_i, K↔E_j, V↔P_j, softmax identical). Bottom strip: 1 head → 15 heads → O(n² × t^γ). Visual thesis of the correspondence this document formalizes.]

2. Background: Transformer Attention

From Vaswani et al., the scaled dot-product attention function is:

Attention(Q, K, V) = softmax(Q · K^T / √d_k) · V — Eq. (1)

The variables — read as a table, not a wall of symbols:

Symbol	Shape	Role	Plain English
Q	n × d_k	Query matrix	"What am I looking for?"
K	m × d_k	Key matrix	"What do I contain?"
V	m × d_v	Value matrix	The information payload to extract
Q · K^T	n × m	Compatibility scores	Every query scored against every key
√d_k	scalar	Scaling factor	Prevents softmax from becoming too sharp
Output	n × d_v	Weighted value sum	Each query's best information, weighted by relevance

Shape check: Q (n × d_k) times K^T (d_k × m) = (n × m). That (n × m) matrix times V (m × d_v) = Output (n × d_v). Dimensions chain correctly.

Multi-head attention runs h parallel attention functions, each with learned projections:

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) · W^O — Eq. (2)

where head_i = Attention(Q · W_i^Q, K · W_i^K, V · W_i^V)

Cross-attention (encoder-decoder attention) is where Q comes from one sequence and K, V come from a different sequence — enabling one representation to attend to another.

Residual connections preserve the original input while adding the attention output:

output = LayerNorm(x + Attention(x)) — Eq. (3)

Why this matters for us: Residual connections preserve the original representation while adding new information. We'll use this property when we show that cross-graph discovery is architecturally designed to add intelligence without destroying existing knowledge (with appropriate safeguards — see Property 3 in Section 6).

[GRAPHIC: CGA-01 — Three Levels of Cross-Graph Attention | Three-level zoom diagram showing the correspondence formalized in Sections 3-5: Level 1 (single alert → scoring matrix = single-query attention, Eq. 4) → Level 2 (domain pair → cross-graph attention = cross-attention, Eq. 6) → Level 3 (all 6 domains → 15 attention heads = multi-head attention, Eq. 9). Each level shows the transformer parallel on left and the cross-graph equivalent on right.]

3. Level 1: The Scoring Matrix as Single-Query Attention

The Current Implementation

In the SOC Copilot demo, for each alert α, the system evaluates a scoring matrix:

6 context factors: travel_match, asset_criticality, VIP_status, time_anomaly, device_trust, pattern_history
4 possible actions: false_positive_close, escalate_tier2, enrich_and_wait, escalate_incident
Softmax with temperature τ ≈ 0.25 produces action probabilities

Formal Correspondence

The variables:

Symbol	Shape	Role	Plain English
f	1 × 6	Factor vector for alert α	"Here's what this alert looks like" — values extracted from the context graph
W	4 × 6	Weight matrix	Each row is one action's preference profile across the 6 factors
τ	scalar	Temperature	Controls how decisive the softmax is (low τ = winner-take-all)

The action selection computation is:

P(action | alert) = softmax(f · W^T / τ) — Eq. (4)

Shape check: f (1 × 6) times W^T (6 × 4) = (1 × 4). That's one score per action. Softmax converts to probabilities. The system picks the highest.

Plain English: The factor vector f asks "given what I know about this alert, how compatible is each action?" The weight matrix W encodes what each action "cares about." The dot product f · W^T scores every (alert, action) compatibility simultaneously. Softmax converts scores to probabilities. The system selects the highest-probability action.

So what? This is not a custom heuristic — it is attention-shaped: a softmax over dot-product compatibility scores producing a convex mixture. This is the same computational form as the scoring step in transformer attention, though the roles differ: in transformers, Q and K are representations of items in the same embedding space; here, Q is a factor vector and K is a parameter matrix (making this closer to a softmax linear policy or multinomial logistic regression). The structural correspondence means that shape-level properties — softmax selectivity, dot-product compatibility, temperature-controlled entropy — carry over. Full theoretical transfer (e.g., convergence guarantees, expressivity results) requires the additional structure that Levels 2 and 3 provide.

Mapping to Attention

Attention Component	Scoring Matrix Equivalent	Shape
Query Q	Factor vector f — "what does this alert look like?"	1 × 6
Key K	Weight matrix W — "what does each action need?"	4 × 6
Value V	Action outcome vectors — the action to execute	4 × d_v
Scaling 1/√d_k	Temperature 1/τ — controls softmax selectivity	scalar
Softmax output	Action probability distribution	1 × 4

Temperature τ and the transformer scaling factor 1/√d_k are both softmax sharpness controls — they regulate the entropy of the output distribution. The mechanism differs: √d_k is a principled variance normalization derived from the assumption that dot-product components are independent with unit variance (preventing gradient saturation as d grows); τ is an empirical calibration parameter tuned to the decision domain (τ ≈ 0.25 yields appropriately decisive action selection for the SOC scoring matrix). They occupy the same position in the computation and serve the same functional role, but τ lacks the distributional justification that makes √d_k principled.

What the AgentEvolver learns: After each decision, verified outcomes update W — the weight matrix. This serves the same architectural role as gradient updates to the projection matrices W^Q, W^K, W^V in transformers. The critical difference: transformer projections are learned at training time via backpropagation over a differentiable loss function; the AgentEvolver learns them at runtime via verified-outcome reinforcement (closer to bandit-style reward learning than gradient descent). Same position in the computational graph — different learning mechanism. That difference is the key architectural insight: the base model stays frozen, and the operational layer evolves.

Note: The scoring matrix operates in raw factor space without learned projections — equivalent to attention with identity projection matrices. The correspondence holds for the core operation (compatibility scoring via dot product + softmax normalization). Learned projections (which would enable the system to attend in different subspaces, as multi-head attention does) are a future optimization.

The Compounding Property

At initialization, W contains uniform or heuristic weights. After 340+ decisions with verified outcomes, W has calibrated to reflect the firm's actual risk profile. This is why Week 1 shows 68% auto-close accuracy and Week 4 shows 89% — same model, same factors, evolved weights.

So what? The improvement isn't from a smarter model or more data in the traditional sense. It's from the weight matrix absorbing the firm's specific patterns — exactly as transformer weights absorb language patterns during training. But here it happens continuously, in production, without retraining. This is the Decision Clock ticking.

[GRAPHIC: FC-04 — Decision Clock Weight Evolution (Day 1 vs Day 30) | Side-by-side weight matrices showing calibration over time. Day 1: uniform/heuristic weights across 6 factors × 4 actions. Day 30: firm-specific weights after 340+ verified outcomes — travel_match and device_trust amplified for false_positive, VIP_status amplified for escalate_incident. Accuracy 68% → 89%. Visualizes the Compounding Property described above.]

4. Level 2: Cross-Graph Discovery as Cross-Attention

Setup

Let G = {G_1, G_2, ..., G_n} be n semantic graph domains. In the SOC architecture, n = 6:

Domain i	Graph G_i	Entity Examples	Typical size m_i
1	Security Context	Assets, users, alerts, attack patterns	~200 entities
2	Decision History	Decisions, outcomes, confidence trajectories	~500 (grows daily)
3	Threat Intelligence	CVEs, IOCs, campaign TTPs, geo-risk	~300 (grows daily)
4	Organizational	Reporting lines, role changes, access policies	~100 (stable)
5	Behavioral Baseline	Normal patterns per user/asset/time	~150
6	Compliance & Policy	Regulatory mandates, audit schedules	~80 (stable)

Each domain G_i contains m_i entities. Each entity has a d-dimensional representation (embedding) computed from its properties and graph-structural context.

Entity Embedding Matrices

For domain G_i, define:

E_i — shape: m_i × d — Eq. (5)

where m_i = number of entities in domain i

d = embedding dimension (e.g. 128)

Row k of E_i is the embedding of the k-th entity in domain i.

Plain English: Each entity in each graph domain is represented as a vector of numbers that captures its properties and relationships. An asset's embedding might encode its criticality, exposure, patch status, and network position. A threat indicator's embedding might encode its severity, recency, geographic scope, and campaign association.

The quality of cross-graph discoveries depends on the quality of these embeddings. In the current architecture, embeddings are constructed from graph-structural features (node properties, relationship types, neighborhood statistics). Learned embeddings via graph neural networks or contrastive learning are a future optimization that would strengthen the dot-product compatibility signal.

Cross-domain alignment requirement: Eq. (6) computes E_i · E_j^T across different domains. For this dot product to produce meaningful compatibility scores, all domain embeddings must live in a shared d-dimensional vector space with comparable geometry. In the current implementation, this is achieved through: (a) a common feature schema applied across all domains (each entity is encoded using the same d-dimensional template regardless of domain), (b) z-score normalization per feature dimension (ensuring comparable scales), and (c) unit-norm normalization of entity vectors (ensuring dot products reflect angular similarity, not magnitude). Without these alignment steps, cross-domain dot products would be uninterpretable — high scores could reflect scale mismatches rather than genuine semantic compatibility. A future optimization is domain-specific linear projections P_i mapping each domain into the shared space (even if P_i = I today), which would enable per-domain learned alignment while preserving the shared-space invariant.

Cross-Graph Attention

For source domain G_i (seeking enrichment) and target domain G_j (providing context):

CrossAttention(G_i, G_j) = softmax(E_i · E_j^T / √d) · V_j — Eq. (6)

The variables — what each piece is and what shape it has:

Symbol	Shape	Role	Plain English
E_i	m_i × d	Queries from domain i	"For each of my entities: what do I need to know?"
E_j	m_j × d	Keys from domain j	"For each of my entities: what do I have to offer?"
V_j	m_j × d_v	Values from domain j	The actionable information payload to transfer
E_i · E_j^T	m_i × m_j	Compatibility matrix	Every entity in i scored against every entity in j
√d	scalar	Scaling factor (using d since d_k = d in our architecture)	Prevents attention scores from exploding
Output	m_i × d_v	Enriched representations	Domain i entities, now carrying domain j's insights

Shape check — follow the dimensions through the equation:

E_i × E_j^T = Compatibility matrix

(m_i × d) (d × m_j) (m_i × m_j) ← d cancels, every i-entity scored vs every j-entity

softmax normalizes each row → still (m_i × m_j) ← each i-entity's attention weights over j-entities sum to 1

Attention × V_j = Output

(m_i × m_j) (m_j × d_v) (m_i × d_v) ← m_j cancels, each i-entity gets a weighted blend of j's values

Concrete numbers: If Decision History has 500 entities and Threat Intel has 300 entities, the compatibility matrix is 500 × 300 = 150,000 relevance scores, computed in a single matrix multiplication. Most will be near-zero. The discoveries are the few with high scores.

So what? This is the mechanism that finds the Singapore recalibration — and every similar cross-domain insight — automatically. No analyst needs to remember that Singapore logins were being auto-closed AND that a new Singapore threat campaign exists. The math checks every possible pairing, in every direction, in one operation. The insight that would take a human analyst hours of cross-referencing falls out of a single matrix multiply.

Plain English: Cross-graph attention asks: "For each entity in my domain, which entities in the other domain are most relevant — and what should I learn from them?" The dot product computes relevance between every pair. Softmax focuses attention on the most compatible pairs. The output enriches each entity with weighted information from the other domain.

Worked Example: Singapore Threat Recalibration

Setup:

Domain i = Decision History. Entity: PAT-TRAVEL-001 (pattern for Singapore travel logins, confidence 0.94, 127 closures as false positive).
Domain j = Threat Intelligence. Entity: Campaign TI-2026-SG-CRED (Singapore IP range 103.15.x.x, credential stuffing, 340% increase this month, confidence: HIGH).

Step 1: Dot product. The embedding of PAT-TRAVEL-001 encodes geographic focus (Singapore), decision type (false_positive_close), and high confidence. The embedding of TI-2026-SG-CRED encodes geographic scope (Singapore), threat type (credential stuffing), and high severity. The dot product is high because both vectors have strong Singapore-related components.

Step 2: Softmax. Among all threat intelligence entities, TI-2026-SG-CRED receives high attention weight for PAT-TRAVEL-001 because of geographic and temporal overlap.

Step 3: Value transfer. The value payload from TI-2026-SG-CRED — "active credential stuffing campaign, 340% elevation, HIGH confidence" — is transferred to enrich PAT-TRAVEL-001's representation.

Step 4: Discovery. The enriched representation surfaces a high-relevance cross-graph signal: "The pattern used to auto-close 127 Singapore logins may be dangerously miscalibrated given an active credential stuffing campaign targeting that geography." Action: reduce PAT-TRAVEL-001 confidence from 0.94 to 0.79, add threat_intel_risk as a new scoring factor.

[GRAPHIC: CI-04 — The Singapore Discovery | "How compounding intelligence learns: from generic guesses to autonomous cross-graph discovery" | Horizontal three-phase timeline with dual layers (human outcome ↑ / machine state ↓): Phase 1 (Day 1, gray) generic weights, 68% accuracy, wrong escalation → Phase 2 (Day 30, blue) calibrated weights after 340+ verified outcomes, 89% accuracy, correct auto-close → Phase 3 (Day 47, gold, HERO) cross-graph discovery fires, PAT-TRAVEL-001 × TI-2026-SG-CRED compatibility score 0.94, weights recalibrated 0.42→0.18, $4.88M breach prevented. Traces the worked example above through all three phases. "No human told the system to look."]

Discovery Threshold

Not every cross-graph pair produces a meaningful discovery. Define the pre-softmax logit matrix (raw compatibility scores before normalization):

S_{i,j} = E_i · E_j^T / √d — Eq. (7a)

shape: m_i × m_j (one raw compatibility score per entity pair)

And the attention matrix (row-normalized):

A_{i,j} = softmax(S_{i,j}) — Eq. (7b)

shape: m_i × m_j (attention weights summing to 1 per row)

A discovery is identified using a two-stage criterion:

Discovery(entity_k from domain i, entity_l from domain j)

⟺ S_{i,j}[k, l] > θ_logit — Eq. (8a)

AND entity_l ∈ top-K(A_{i,j}[k, :]) — Eq. (8b)

Why two stages, not just a threshold on softmax weights:

Softmax produces relative scores — row-normalized attention weights depend on all other keys in that row. A weight A[k,l] can be high simply because everything else is low, not because the absolute compatibility is meaningful. As domain j grows (m_j increases), the distribution of attention weights shifts, making a fixed threshold θ unstable.

The two-stage criterion avoids both problems:

Eq. (8a) thresholds on pre-softmax logits S, which are absolute compatibility scores independent of how many other entities exist in the target domain. This provides stability as domains grow.
Eq. (8b) applies top-K selection within each row's attention distribution, ensuring only the most salient connections pass. K can be tuned per domain pair based on expected discovery density.

An optional strengthening: require the discovery to be bidirectional — entity_l also attends strongly to entity_k (high in both S_{i,j} and S_{j,i}) — which reduces false discoveries from one-sided similarity.

Concrete example: In a 500 × 300 logit matrix (Decision History × Threat Intel), the pre-softmax threshold θ_logit filters to entities with genuinely high raw compatibility. Top-K=3 per row selects the most salient targets. Perhaps 30-50 entries survive both gates. Those are the discoveries — the cross-domain insights that no single graph contains.

So what? This is what makes the system's discoveries firm-specific. The two-stage criterion sifts 150,000 potential connections down to the few dozen that matter for this particular firm — its patterns, its threats, its organizational structure. A different firm, with different graphs, gets different discoveries. Crucially, the discovery count is stable as domains grow — the pre-softmax threshold ensures new entities don't dilute or inflate discovery rates.

Plain English: Most cross-graph pairs have low relevance — a compliance mandate about data retention has little to say about a specific login alert. Discoveries are the high-attention exceptions: the pairs where entities across domains have genuinely high compatibility on an absolute scale AND rank among the top matches within their attention row. These are the firm-specific insights that nobody programmed.

5. Level 3: Multi-Domain Attention

Formulation

Analogous to multi-head attention (Eq. 2), define multi-domain attention across all graph pairs:

MultiDomainAttention(G) = Aggregate({head_{i,j}}) — Eq. (9)

where head_{i,j} = CrossAttention(G_i, G_j)

for all i < j (unique pairs only)

With n = 6 domains, the number of heads h = n(n-1)/2 = 15.

So what? The analogy to multi-head attention is structural, not mechanistic. In a transformer, heads are learned projections over the same token set — each head discovers a different linguistic relationship (syntax, coreference, semantic roles) through learned query/key spaces. Here, each "head" is a different bipartite graph — different domains with different entity types and sizes. The heads discover different types of institutional insight not through learned projections but through the semantic structure of the domains themselves. The parallel is in the outcome (categorically different insights from parallel attention operations), not the mechanism (learned projections vs. domain structure).

Plain English: Just as a transformer runs multiple attention heads in parallel — each discovering a different type of relationship — multi-domain attention runs 15 cross-graph searches in parallel, each discovering a different type of institutional insight.

Each Head Discovers a Different Category of Insight

Head (i, j)	Discovery Category	Example	Why it's firm-specific
Security × Threat Intel	Threat-contextualized re-evaluation	SG threat elevation recalibrates travel-login FP rate	Depends on your FP patterns + current threat landscape
Security × Organizational	Role-change sensitivity	CFO promotion → alert re-scrutiny	Depends on your org structure + your auto-close history
Decision × Behavioral	Decision-behavior consistency	Auto-close trend vs. actual behavior drift	Only emerges from your accumulated decision patterns
Decision × Threat Intel	Historical decision quality	Were past closures correct given new intel?	Retroactive re-evaluation of your specific decisions
Threat Intel × Compliance	Regulatory exposure from threats	New CVE affecting PCI-scoped assets	Depends on your compliance scope + current threats
Organizational × Behavioral	Insider risk emergence	Access pattern change after org restructure	Only visible if you connect your org graph + your behavior graph
Behavioral × Compliance	Audit-triggered behavioral review	Data transfer spike during active audit	Depends on your audit schedule + your user behavior

This parallels — at the outcome level — the transformer finding in Figures 3-5 of Vaswani et al., where different attention heads capture categorically different linguistic relationships. Here, different domain-pair heads discover categorically different institutional relationships. The structural difference: transformer heads discover categories through learned projections; cross-graph heads discover categories through the semantic structure of the domain pairs themselves.

Quadratic Growth of Discovery Space

The total number of pairwise interaction terms computed across all heads:

Total interactions = Σ_{i<j} m_i × m_j — Eq. (10)

For equal-sized domains (m entities each):

Total interactions = [n(n-1)/2] × m² — Eq. (11)

= 15 × m² for n = 6

Adding one domain (n → n+1):

ΔInteractions = n × m² — Eq. (12)

= 6 × m² when going from 6 → 7 domains

The acceleration pattern — why each new domain is worth more than the last:

Domains (n)	Pairs	New pairs from latest domain	Discovery surfaces
2	1	1	1 × m²
3	3	2	3 × m²
4	6	3	6 × m²
5	10	4	10 × m²
6	15	5	15 × m²
7	21	6	21 × m²

[GRAPHIC: GM-02 — Cross-Graph Connections: Combinatorial Growth (2→4→6 domains) | Visual progression from 2 domains (1 pair) to 4 domains (6 pairs) to 6 domains (15 pairs). Each domain shown as a node with cross-links between all pairs. Demonstrates n(n-1)/2 growth — each new domain connects to ALL existing domains. Formal companion to Eq. (10-12).]

So what? This is the mathematical reason the moat accelerates. A competitor can replicate our code, our model, even our architecture. But they cannot replicate the accumulated entities in our graphs. And the value of those entities grows quadratically with the number of connected domains — not linearly. Connecting a 7th domain doesn't add 1/7th more value. It adds 6 new cross-graph heads, each with access to the full richness of all existing domains. The marginal value of each new domain increases with the number you already have.

6. Three Properties That Transfer from Attention Theory

[GRAPHIC: CGA-03 — Three Properties from Attention Theory | Three-panel visualization: (1) Quadratic Interaction Space — discovery surfaces grow as n(n-1)/2, (2) Constant Path Length O(1) — any entity reaches any other in one operation, (3) Residual Preservation — the graph retains access to accumulated knowledge. Each property shown with transformer parallel on left and cross-graph equivalent on right.]

Property 1: Quadratic Interaction Space

Transformer: Self-attention complexity is O(n² · d) because every position attends to every other position (Table 1, Vaswani et al.).

Cross-graph: Multi-domain attention computes n(n-1)/2 cross-attention operations, each with O(m_i · m_j · d) interactions. Total interaction space grows quadratically with graph coverage n.

Why it matters for the moat: A competitor can copy our code and start with the same 6 graph domains. But they start with empty graphs (m = 0 entities, 0 accumulated decisions). Our system after 12 months has thousands of entities per domain and thousands of verified decisions. The interaction space — and therefore the discovery potential — scales with both the number of domains AND the richness of each domain.

So what? This is why the moat is structural, not just temporal. It's not "we have 12 more months of data." It's "we have 12 more months of quadratically compounding interactions." The math makes the difference concrete: more domains × richer domains = exponentially more discovery surfaces.

Practical note on compute: Total compute for the full sweep is Σ_{i<j} m_i · m_j · d. If one domain grows large (e.g., Decision History accumulating thousands of entities), it dominates the cost. Production deployments will require sparsification strategies: approximate nearest neighbor (ANN) indexing for the compatibility computation, entity pruning (removing low-information entities from attention), and blocking/bucketing by temporal recency or entity type. These reduce the constant factor without changing the O(n²) growth in discovery surfaces.

Property 2: Constant Path Length

Transformer: Self-attention connects any two positions with O(1) sequential operations (Table 1). Recurrent layers require O(n) steps. This is attention's key advantage: distant dependencies don't require signal propagation through intermediaries.

Cross-graph: Any entity in domain i can attend to any entity in domain j in a single cross-attention operation. No routing through intermediate domains. A threat intelligence fact about Singapore directly enriches a decision history pattern about Singapore logins — no intermediate graph traversal required.

Why it matters: In traditional approaches, discovering the Singapore recalibration would require: (1) an analyst notices the threat report, (2) remembers that Singapore logins have been auto-closed, (3) manually cross-references the two. That's O(n) human cognitive steps. Cross-graph attention discovers it in O(1) — a single matrix multiplication surfaces all high-relevance cross-domain pairs simultaneously.

So what? This isn't about speed — it's about coverage. A human analyst can hold maybe 5-10 cross-references in working memory. Cross-graph attention computes 150,000 simultaneously. The Singapore connection might be obvious to a senior analyst. But what about the 47 other high-attention pairs in that same sweep? Those are the discoveries that no human would ever make — not because they're stupid, but because no human can hold 150,000 pairwise comparisons in their head.

Property 3: Residual Enrichment (Knowledge Preservation)

Transformer: Residual connections (Eq. 3) ensure that the original representation is preserved while attention adds new information:

output = LayerNorm(x + Attention(x))

The original signal x is never destroyed — only augmented.

Cross-graph: When cross-graph attention enriches domain i with discoveries from domain j, the original entity representations are preserved:

E_i^{enriched} = E_i + Σ_{j≠i} CrossAttention(G_i, G_j) — Eq. (13)

Shape check: E_i is (m_i × d). CrossAttention output is (m_i × d_v).

We set d_v = d for residual compatibility (ensuring the attention

output can be added element-wise to E_i). This is a design choice,

not a mathematical requirement — transformers sometimes use d_v ≠ d

with a projection back to d.

Addition is element-wise. Original E_i is preserved in full.

The graph structure itself is also preserved — new discovery edges are added, but existing nodes and relationships are never deleted.

Plain English: Cross-graph discovery adds to what the system knows — it does not delete or overwrite existing representations. When the system discovers that Singapore FP rates need recalibration, it doesn't delete the 127 previous decisions. It enriches the pattern with new context. The original knowledge is the residual; the cross-graph insight is the sublayer output.

Precision note on "monotonically non-decreasing": Residual-style enrichment (Eq. 13) preserves the original embedding vector in the sum, meaning the system retains access to prior state. However, residual addition alone does not guarantee that downstream decisions improve — representations could become less separable, later layers could effectively ignore the residual, and normalization (if applied) modifies the residual. The stronger claim requires architectural safeguards: gated residuals (so enrichment can be selectively applied), provenance tracking (so the source of each enrichment is auditable), and versioned graph snapshots (enabling rollback). With these safeguards in place — which our architecture provides — forgetting is avoidable by design, and accumulated intelligence is monotonically non-decreasing in the accessible knowledge base, though not automatically in every downstream metric.

So what? The moat's permanence rests on two architectural facts, not just the residual property: (1) the graph itself persists through any model swap (GPT-4 → GPT-5, Claude → Gemini) — enriched embeddings, discovered patterns, and accumulated decisions all survive, and (2) gated residuals + versioning ensure that enrichment is additive and auditable. A competitor doesn't just need to "catch up" — they need to replicate discoveries that emerged from your specific graph state at a moment in time that no longer exists.

7. The Temporal Dimension: Positional Encoding as Clock Alignment

Transformer Positional Encoding

Transformers have no inherent notion of sequence order. Positional encoding injects temporal/positional information:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))

PE(pos, 2i + 1) = cos(pos / 10000^(2i/d_model))

This allows the model to reason about when things occurred relative to each other.

Clock-Based Temporal Encoding

In cross-graph attention, temporal context is critical. A threat intelligence indicator from yesterday is more relevant than one from six months ago. A role change from three weeks ago matters more than one from three years ago.

The Four Clocks provide structured temporal encoding:

Clock	Temporal Dimension	What It Encodes	Encoding Role
State	Current snapshot (t = now)	What's true at query time	Static features in entity embeddings
Event	Historical timeline (t = event)	When decisions were made, when threats emerged	Recency weighting in attention scores
Decision	Evolution trajectory (Δt)	How fast confidence is changing, weight drift rate	Velocity features in embeddings
Insight	Discovery timeline (t = discovery)	When cross-graph insights emerged, validation count	Maturity weighting for discovered patterns

Plain English: Just as positional encoding tells a transformer "this word is the 5th in the sentence," clock-based encoding tells cross-graph attention "this threat indicator is 3 days old," "this decision pattern has been validated 14 times over 6 weeks," or "this role change happened 3 weeks ago." Temporal context changes which cross-graph pairs are relevant.

[GRAPHIC: FC-01 — Four Clocks Progression Diagram | The four temporal clocks — State (what's true now), Event (what happened), Decision (how judgment evolves), Insight (cross-domain discovery) — shown as a progression from static to self-improving. Maps directly to the temporal encoding roles in the table above.]

8. Refined Moat Equation

The original moat equation was:

Moat = n × t × f

With the attention framework, we can be more precise:

Institutional Intelligence = Σ_i W_i(t) [Term 1: within-domain]

+ α · Σ_{i<j} D_{i,j}(n, t, f) [Term 2: cross-domain]

+ β · R(n, t) [Term 3: second-order]

— Eq. (14)

What each term means — and what business outcome it drives:

Term	Symbol	What it captures	Clock	Business impact
Within-domain compounding	W_i(t)	Weight calibration after t decisions	Decision Clock	68% → 89% accuracy. Same model, evolved weights.
Cross-domain discoveries	D_{i,j}(n, t, f)	Insights from cross-attention between domain pairs	Insight Clock	Singapore recalibration, role-change sensitivity — firm-specific insights nobody programmed
Second-order discoveries	R(n, t)	Insights discovered by attending to previous discoveries	Insight Clock (recursive)	A discovery from (Security × Threat Intel) becomes an entity that can be attended to by (Compliance × Discovered Insights). Compounding on compounding.

The coupling constants α and β reflect discovery density and second-order discovery rate.

The key mathematical insight: D_{i,j} grows with the product of domain sizes m_i · m_j, and domain sizes themselves grow with time t. This motivates the following scaling model (not a derived theorem, but a parameterized growth estimate):

Assume each domain's entity count grows as m_i(t) ~ t^{a_i}, where a_i reflects the domain's growth rate (a_i ≈ 1 for domains with daily entity creation like Decision History; a_i ≈ 0 for stable domains like Organizational). Then the cross-graph interaction count for head (i,j) grows as:

m_i(t) · m_j(t) ~ t^{a_i + a_j}

Summing over all n(n-1)/2 heads, the dominant term grows as t^{max(a_i + a_j)}. For a mix of growing and stable domains, the effective exponent γ = max(a_i + a_j) across all head pairs falls in [1, 2]. Combined with the quadratic growth in n (number of domain pairs), the total intelligence function scales approximately as:

I(n, t) ~ O(n² × t^γ) where 1 ≤ γ ≤ 2 — Eq. (15)

This is a scaling model, not a formal bound — it characterizes expected growth under the assumed domain growth rates, not worst-case guarantees.

What γ means — and why 1.5 is a reasonable estimate for the SOC architecture:

γ value	When it applies	Example	Assumed growth rates
γ → 2	Both domains in the dominant pair grow linearly	Decision History (a≈1) × Threat Intel (a≈1)	a_i + a_j ≈ 2
γ → 1	Only one domain in each pair grows	Organizational (a≈0) × Compliance (a≈0)	a_i + a_j ≈ 0-1
γ ≈ 1.5	Practical blend: dominant heads are near-quadratic, others near-linear	Weighted average across 15 heads	Effective blend

Even γ = 1.2 creates a moat that linear accumulation models cannot match.

[GRAPHIC: CGA-02 — Why the Moat Is Super-Linear: O(n² × t^γ) | Dual-axis visualization: LEFT shows graph coverage (n) with quadratic growth in discovery pairs, RIGHT shows time-in-operation (t) with super-linear compounding. Combined curve shows O(n² × t^γ) growth. Month 24: first mover 117 vs competitor 41. Formal companion to Eq. (15).]

The competitive gap — in concrete numbers:

First mover at month 24: 24^1.5 = 117 units of accumulated intelligence

Competitor at month 12: 12^1.5 = 41 units of accumulated intelligence

Gap: 117 - 41 = 76 (nearly 2× the competitor's total)

At month 36:

First mover: 36^1.5 = 216

Competitor at month 24: 24^1.5 = 117

Gap: 216 - 117 = 99 (still growing)

[GRAPHIC: GM-04-v2 — The Gap Widens Every Month | First mover vs. competitor divergence curves over 36 months. Your firm at 117 vs competitor at 41 at Month 24, gap widening super-linearly. Visualizes the numbers in the calculation above.]

[FIGURE: 36-Month Simulation Dashboard | Compounding Intelligence: 36-Month Simulation | 6-panel validation: (1) Accuracy 68%→95%, (2) Cross-graph discoveries/month (quadratic growth), (3) Accumulated intelligence I(n,t) showing 117 vs 42 at Month 24, (4) Competitive gap widening every month, (5) Analyst hours saved (29K hrs / $2.46M), (6) Breach cost avoided ($3.60M). Parameters: n=6, γ=1.5, 50 alerts/day, competitor delay=12 months. All parameters tunable — run compounding_sim.py to regenerate. See: dashboard_combined.png, simulation_data.csv]

So what? This is the entire argument in one paragraph. A competitor who copies your code, your model, your architecture, and starts running the exact same system today — they still cannot catch you. Not because you have proprietary technology. Because the math of attention applied to accumulating graphs creates a gap that widens over time. Every month you operate, the gap grows. Every domain you connect, the gap multiplies. This is not a lead measured in months. It's a lead measured in the square of months times the square of domains. That's the moat. It's mathematical. It's permanent. And it's provable.

[GRAPHIC: GM-05-v2 — The Compounding Moat Equation — Dual Form | The moat equation in both intuitive (Moat = graph coverage × time × search frequency) and formal (I(n,t) ~ O(n² × t^γ)) forms. Dark theme, three input dials feeding the dual equation. Capstone visual for the moat argument.]

9. Summary of Correspondences

Transformer Concept	Cross-Graph Equivalent	Shape / Form	So What
Token embedding	Entity embedding in graph domain	m_i × d	Each entity's properties encoded as a vector
Query Q	Source domain entities seeking enrichment	m_i × d	"What do I need to know?"
Key K	Target domain entities providing context	m_j × d	"What do I have?"
Value V	Information payloads from target domain	m_j × d_v	The actionable content to transfer
QK^T/√d_k	Cross-graph compatibility matrix	m_i × m_j	All pairwise relevance scores in one multiply
Softmax	Attention normalization	m_i × m_j → [0,1]	Focus on high-relevance, suppress noise
Attention output	Enriched entity representations	m_i × d_v	Domain i entities now carrying domain j's insights
Multi-head (h heads)	Multi-domain (15 heads for 6 domains)	15 compatibility matrices	Each head discovers a different insight type
Residual connection	Enrichment without replacement	E_i + Σ attention	Graph retains access to accumulated knowledge
Positional encoding	Clock-based temporal encoding	4 clock dimensions	State, Event, Decision, Insight
Layer stacking	Periodic discovery sweeps	1 sweep = 1 "layer"	Each sweep builds on all previous discoveries
Training-time weight learning	Runtime weight evolution (AgentEvolver)	W evolves via outcomes	Same role, different mechanism — production, not training

10. What This Is and What It Isn't

What this IS:

A formal mathematical framework showing that cross-graph discovery has the same computational form as transformer attention (attention-shaped operations with transferable shape-level properties)
An explanation that borrows the most widely understood mathematical vocabulary in modern AI to make the super-linear compounding argument precise and falsifiable
A precise vocabulary that technical audiences (ML engineers, architects, VCs with technical depth) will immediately understand
A set of transferable properties (quadratic interaction space, constant path length, residual preservation) that explain the moat's permanence

What this IS NOT:

A claim that the system IS a transformer or uses neural network training
A claim that the system uses backpropagation or gradient descent (the AgentEvolver uses verified-outcome feedback, not gradient updates)
A claim that the cross-graph mechanism requires learned embeddings (graph-structural features work; learned embeddings are a future optimization)
A replacement for the existing plain-English positioning (new employee analogy, Four Clocks, moat equation remain primary; this is the mathematical foundation underneath them)

The soundbite:

"Transformers let tokens attend to tokens. We let graph domains attend to graph domains. Same math. Applied to institutional knowledge instead of language."

Appendix A: The Undergraduate Walkthrough — Building Intuition from Scratch

This appendix builds the entire framework from first principles, assuming no prior knowledge of transformers or attention mechanisms. It is designed to make Sections 2-8 above feel inevitable rather than opaque. Read this first if the formal notation feels dense; read Sections 2-8 first if you prefer precision.

Step 0: What's a dot product?

You have two lists of numbers. A dot product multiplies them pairwise and adds up:

[3, 1, 0] · [2, 0, 4] = (3×2) + (1×0) + (0×4) = 6

The key intuition: if two lists have big numbers in the same positions, the dot product is high. It measures how similar two things are — how much they "point in the same direction."

This single operation — multiply pairwise, sum the results — is the engine behind everything that follows. Transformers, attention, cross-graph discovery: they're all built on dot products.

Step 1: What does an alert look like?

When a security alert fires, the system extracts 6 numbers about it — a "factor vector" f:

f = [0.95, 0.3, 0.0, 0.7, 0.9, 0.85]

│ │ │ │ │ └── pattern_history (seen this before: high)

│ │ │ │ └──────── device_trust (known device: high)

│ │ │ └────────────── time_anomaly (odd hour: moderate)

│ │ └──────────────────── VIP_status (not a VIP: zero)

│ └────────────────────────── asset_criticality (low-value asset)

└───────────────────────────────── travel_match (employee in Singapore: high)

That's the query — "here's what this alert looks like." (This is the factor vector f from Eq. 4 in Section 3.)

Step 2: What does each action care about?

The system has 4 possible actions. Each action has its own row of 6 weights — what it "cares about":

travel asset VIP time device history

────── ───── ─── ──── ────── ───────

false_positive: [ 0.8, 0.1, 0.0, 0.2, 0.7, 0.9 ] ← "I fire when travel+device+history are high"

escalate_tier2: [ 0.3, 0.6, 0.8, 0.5, 0.2, 0.1 ] ← "I fire when asset+VIP are high"

enrich_wait: [ 0.4, 0.4, 0.3, 0.7, 0.3, 0.4 ] ← "I fire when time_anomaly is high but nothing else screams"

escalate_incident:[ 0.1, 0.9, 0.9, 0.8, 0.1, 0.1 ] ← "I fire when asset+VIP+time are all high"

Stack those 4 rows and you get the weight matrix W (4 rows × 6 columns). That's the keys — "here's what each action needs." (This is W from Eq. 4.)

Step 3: The dot product scores compatibility

Now multiply the alert vector f against each action's row:

f · false_positive = (0.95×0.8) + (0.3×0.1) + (0.0×0.0) + (0.7×0.2) + (0.9×0.7) + (0.85×0.9) = 2.33

f · escalate_tier2 = (0.95×0.3) + (0.3×0.6) + (0.0×0.8) + (0.7×0.5) + (0.9×0.2) + (0.85×0.1) = 1.08

f · enrich_wait = (0.95×0.4) + (0.3×0.4) + (0.0×0.3) + (0.7×0.7) + (0.9×0.3) + (0.85×0.4) = 1.60

f · escalate_incident= (0.95×0.1) + (0.3×0.9) + (0.0×0.9) + (0.7×0.8) + (0.9×0.1) + (0.85×0.1) = 1.10

False positive gets the highest score (2.33) because the alert and that action are similar — they both have high values in the travel, device, and history slots. That's the dot product doing its job: measuring compatibility.

In matrix notation, all four dot products at once: f · Wᵀ — that's Eq. (4).

Step 4: Softmax turns scores into probabilities

Raw scores (2.33, 1.08, 1.60, 1.10) aren't probabilities yet. Softmax converts them:

softmax([2.33, 1.08, 1.60, 1.10] / τ) where τ = 0.25 (temperature)

Temperature τ controls sharpness. Low τ = very decisive (winner takes almost all). High τ = more uncertain.

Result: something like [0.94, 0.01, 0.05, 0.01] — 94% confidence it's a false positive.

That's Level 1. That's the whole scoring matrix. It is attention-shaped. The query asks "what is this alert?" The keys answer "what does each action need?" The dot product measures compatibility. Softmax picks the best match. This is the same computational form as what happens inside every transformer when a word "attends to" other words to decide what it means in context — though in transformers Q and K are representations of items in the same space, while here Q is a factor vector and K is a parameter matrix. (See Section 3 for the formal treatment.)

Step 5: Now the key insight — the weights LEARN

On Day 1, those weights are generic guesses. On Day 30, after 340+ decisions with verified outcomes (a human confirmed "yes, that was actually a false positive" or "no, you should have escalated"), the AgentEvolver adjusts the weights. The travel_match weight for false_positive might go from 0.8 to 0.92 because in this firm, Singapore travel logins really are almost always legitimate. The time_anomaly weight might increase because this firm's SOC sees more real attacks at odd hours than the baseline assumed.

That's why accuracy goes from 68% to 89%. Same model. Same 6 factors. Better weights. The weights absorbed the firm's specific patterns. (This is the "Compounding Property" in Section 3.)

Step 6: Cross-graph attention — Level 2

Now make it bigger. Instead of one alert attending to 4 actions, imagine every entity in one graph domain attending to every entity in another domain.

Decision History has hundreds of entities (patterns, past decisions, confidence scores). Threat Intelligence has hundreds of entities (CVEs, campaigns, IOC indicators). Each entity is represented as a vector of numbers — an embedding — just like the alert's factor vector, but richer.

Cross-graph attention computes the dot product between every Decision History entity and every Threat Intel entity:

E_decision · E_threat^T = a big matrix of compatibility scores

If Decision History has 200 entities and Threat Intel has 150 entities, that's a 200 × 150 matrix — 30,000 compatibility scores, computed in one operation.

Most scores are low (a compliance retention policy has nothing to say about a specific login pattern). But a few are high — like PAT-TRAVEL-001 (Singapore false positive pattern) and TI-2026-SG-CRED (Singapore credential stuffing campaign). Both embeddings encode "Singapore" prominently, so their dot product is high.

Softmax focuses attention on these high-compatibility pairs. The value transfer carries the threat intel payload ("active campaign, 340% increase") over to enrich the decision pattern. That's the discovery: the system just realized its Singapore false positive calibration is dangerously wrong, without anyone telling it to look.

That's Eq. (6) in Section 4. Same math as Step 3, but operating on entire domains instead of one alert vs. four actions.

Step 7: Why 15 heads — Level 3

With 6 domains, there are 6×5/2 = 15 unique pairs. Each pair is its own "attention head" — its own compatibility matrix, its own set of potential discoveries:

Decision History × Threat Intel → "are our past decisions still valid given new threats?"
Organizational × Decision History → "did a role change make our auto-close habits dangerous?"
Behavioral × Compliance → "does this behavior spike coincide with an active audit?"

Each head finds a categorically different kind of insight. All 15 run in parallel. (See Section 5 for the full head taxonomy.)

Step 8: Why this creates a permanent moat

Here's where it gets competitive. Three properties fall out of this math:

(a) Quadratic interaction space. Adding a 7th domain doesn't add 1 new discovery source — it adds 6 (one new pair with each existing domain). An 8th domain adds 7. The number of discovery surfaces grows as n(n-1)/2 — quadratic in the number of domains.

(b) Each domain gets richer over time. Decision History grows every day (more decisions). Threat Intel grows every day (new CVEs, new campaigns). So the compatibility matrices get bigger — more rows, more columns, more potential discoveries. If domains grow linearly with time t, the interaction space grows as ~t². Combined with the n² from domain pairs: total intelligence grows as O(n² × t^γ) where γ is between 1 and 2.

(c) Discoveries are preserved by design. The enrichment is additive — E_i + attention output. Original knowledge preserved. New discoveries layered on top. With gated residuals and versioned snapshots, the system is architecturally designed to retain what it has learned.

A competitor starting 12 months late has empty graphs. At γ = 1.5: you're at 24^1.5 = 117 units of accumulated intelligence. They're at 12^1.5 = 41. The gap is 76 — nearly double their total. And it widens every month. (See Section 8 for the full moat equation.)

Cross-Graph Attention: Mathematical FoundationFebruary 2026 — Dakshineshwari LLC

Arindam Banerji, PhD

banerji.arindam@gmail.com

Cross-Graph Attention: Mathematical Foundation

Recent Posts

Comments

Stay Connected with us