[↗] GIDS series

preprint · appendix a · 7 of 7

God’s Infinite Dimensional Space — Step-by-Step Guide

Cheat sheet: significant terms and operations

Symbol / termPlain meaningWhat the operation doesWhy it is here
\(\mathcal N\)Noumenal arenaThe largest space of possible distinctionsGives the framework a “reality is larger than experience” starting point
\(\mathbf n_t\)Noumenal microstate at time \(t\)A point in \(\mathcal N\)Represents the raw state before organism-specific projection
\(\mathcal M^{\mathrm{spec}}\)Species-level accessible subspaceA finite slice of \(\mathcal N\)Says a lineage can only access some distinctions
\(P^{\mathrm{spec}}\)Species projectionOrthogonally projects raw state into the accessible sliceFormalizes “organisms do not get the whole world”
\(E^{\mathrm{spec}}\)Species coordinate encoderConverts the projected slice into coordinatesTurns the accessible slice into a workable vector
\(\Delta \mathbf v_\perp\)Novelty residualRemoves what the current species template already capturesTests whether a mutation adds a new axis
\(\Delta \Phi_\tau\)Net fitness contributionExpected gain minus cost of keeping a candidate axisDecides whether an axis is retained
\(G_i\)Inherited templateThe repertoire of distinctions person \(i\) could in principle hostSeparates lineage structure from individual realization
\(T_i\)Realized individual embeddingThe durable structure of one person after history, language, culture, and experienceThis is the theoretical person-side object
\(\phi_{i,t}\)Full phenomenal stateEverything live in the person at time \(t\)Motivating ideal object
\(q_{i,t}^{(\tau,\Delta)}\)Predictive observer-stateSmallest state that preserves the future law for task \(\tau\) and horizon \(\Delta\)The formal target
\(s_{i,t}^{(\tau,\Delta)}\)Operational state\((\hat T_i, z_{i,t}, c_{i,t}, w_t)\)What the benchmark can actually train on
\(\chi_{i,t}\)ChimeraPerson-in-role objectMakes context explicit instead of hiding it inside “the person”
\(c_{i,t}\)Role and institution contextRole, regime, local demandsExplains why the same person acts differently in different settings
\(w_t\)World stateMarket, account, pressure, or other external stateKeeps the environment separate from the person
\(x_t\)PropositionThe thing hitting the observer nowThe object being scored, simulated, or chosen
\(Y_{i,t+\Delta}^{(\tau)}\)Main future outcomeThe task targetWhat the model tries to predict
\(A_{i,t+\Delta}^{(m,\tau)}\)Auxiliary probeSide target such as objection class or delay bucketForces the latent state to carry reusable structure
\(\Pi_\tau(T_i,c_{i,t},x_t)\)Task projectionSelects task-relevant coordinatesSays not all dimensions matter for every task
\(a_{i,t}\)Salience weightsReweights coordinates elementwiseCaptures what is active now
\(z_{i,t}=a_{i,t}\odot \Pi_\tau(\cdot)\)Active fast sliceUses elementwise multiplication to gate the task coordinatesProduces the current live state for the transition
\(m_{i,t}=\sum_j \omega_{ij,t}\mu_{ij}\)Memory fieldWeighted sum of tracesMakes memory computable
\(R(\mu_{ij},x_t,c_{i,t},\phi_{i,t})\)Retrieval ruleUpdates relevance of a traceExplains why the same prompt can work differently later
\(\Xi(C,c)\)Contextual liftingRetypes categories using contextAvoids false contradictions
\(E_{f,s}(c)\)Category embeddingMaps a discrete token to a dense vectorStandard ML move for categorical data
\(u_{i,t}^{(f,s)}\)Within-event pooled category vectorAverages embedded tokens in one bagConverts sparse categorical events into fixed-width vectors
\(\nu\)Null vectorLearned stand-in for an empty bagKeeps empty slots explicit instead of pretending they are zero
\(m_{i,t}^{(f,s)}\)Mask bitIndicates whether a slot is populatedSeparates absence from value
\(|\)ConcatenationJoins vectors end-to-endPreserves slot identity
\(g_i^{\mathrm{slow}}\)Slow categorical bankRegime-aware durable pooled categorical memoryCaptures what the person is generally like now
\(g_{i,t}^{\mathrm{fast},\tau}\)Fast categorical poolTask-conditioned recent categorical summaryCaptures what is currently active
\(E_T(\cdot)\)Slow encoderBuilds \(\hat T_i\) from durable informationProduces the slow person embedding
\(U_\theta(\cdot)\)Fast update ruleUpdates fast state from new eventsMakes the model sequential
\(E_o^{(\tau)}(\cdot)\)Observer encoderPacks slow state, fast state, context, and world into one task representationCreates the observer-side object for interaction
\(E_p^{(\tau)}(x_t)\)Proposition encoderEncodes the proposition into the same task spaceMakes propositions comparable with the observer-side state
\(\Psi_\tau(o,p)\)Interaction operatorCombines observer and proposition encodingsRepresents “what happens when this proposition hits this observer”
\(G_\theta(\cdot)\)Transition mapPredicts the next latent predictive stateCore world-model step
\(R_0,R_m\)Readout headsDecode outcomes and probes from the latent stateConverts hidden state into measurable outputs
\(\Delta_\tau(f)\)Feature contributionPerformance with a feature family minus performance without itDecides whether a feature family stays
\(\mathcal L_\tau\)Training objectiveAdds main loss, probe losses, and regularizationDefines what gradient descent is minimizing
\(\Omega(\theta)\)RegularizerPenalizes overly flexible parameter settingsKeeps the fit from becoming brittle
\(\hat T_i \leftarrow (1-\alpha)\hat T_i+\alpha \hat T_i^{\mathrm{new}}\)Slow EMA refreshMixes old durable state with a refreshed durable estimateKeeps slow state stable
\(\operatorname{score}_\theta(x\mid s)\)Proposition scoreExpected task utility if proposition \(x\) is used in state \(s\)Turns forecasting into ranking
\(\arg\max\)Best choicePicks the highest scoring candidateFormal proposition search
\(P(\cdot\mid\cdot)\)Conditional probability“Probability of this given that”Language of sufficiency and prediction
\(\mathbb E[\cdot]\)ExpectationAverage predicted value under uncertaintyNeeded when scoring uncertain futures
\(\perp\)Conditional independenceSays extra information stops helping once a state is knownLets the framework define sufficiency and mediation
\(\hat V_{\mathrm{IPS}}(\pi)\)IPS estimateReweights logged rewards to estimate a target policy’s valueSeparates ranking from policy evaluation
LogLoss / Brier / PR-AUC / ECEEvaluation metricsFit, probability error, rare-event ranking, and calibrationMeasures whether the model is useful

The full arc in one line

$$ \mathcal N \longrightarrow \mathcal M^{\mathrm{spec}} \longrightarrow G_i \longrightarrow T_i \longrightarrow \phi_{i,t} \rightsquigarrow q_{i,t}^{(\tau,\Delta)} \approx s_{i,t}^{(\tau,\Delta)} \longrightarrow y_{i,t+\Delta}^{(\tau)}. $$

Read it left to right:

  1. Start with a world that contains more distinctions than any organism can use.
  2. Restrict that world to the distinctions a lineage can access.
  3. Turn the lineage template into an individual template.
  4. Realize that template in one person.
  5. Let that person occupy a full momentary state.
  6. Compress the full state into a task-specific predictive state.
  7. Approximate that predictive state with something measurable.
  8. Decode future observable outcomes from it.

Part 0 — Background

Step 0.1: Replace fixed categories with evolved structure

The opening move is simple: organisms do not passively mirror the world; they inherit a structured way of carving it up.

Why this matters: it turns the appearance of reality into something that can be modeled as a built structure rather than a raw copy of an external world.

Step 0.2: Treat interpretation and response as one process

The framework treats perception, interpretation, understanding, and action as one continuous state-transition process.

Why this matters: once they are written in one format, the same algebra can describe seeing, feeling, remembering, and acting.

Part 1 — Specifying the Area of Interest

Step 1.1: Represent change as vectors

The framework starts by treating observable changes as vectors.

Plain meaning: instead of handling vision, memory, and action as unrelated substances, it writes them as coordinates in a shared space.

Step 1.2: Sample a continuous stream into modelable states

Reality is continuous, but the model uses snapshots.

Plain meaning: the flow is continuous in life, but discrete time makes the math and the benchmark possible.

Step 1.3: Distinguish noumenal vectors from phenomenal vectors

Noumenal vectors are raw distinctions available before the organism has organized them; phenomenal vectors are reality as it appears after the organism’s inherited structure processes them.

Why this matters: the framework keeps raw physical distinctions separate from lived experience.

Step 1.4: Introduce the inherited seed

The inherited seed is the lineage-fixed structure that determines what kinds of distinctions the organism can even register.

Why this matters: the seed explains why an organism gets one kind of world rather than all possible worlds.

Step 1.5: Use a toy sequence to show state transition

The early toy example turns one vector sequence into another by inserting intermediate internal states.

What is happening mathematically: one structured vector recruits others in the same larger state space, and later states inherit or modify earlier coordinates.

Why it is here: it gives an intuitive picture of how a current input can activate affect, action, and an updated object state without changing notation.

Step 1.6: Define the universal arena

$$ \mathbf n_t = \sum_{k=1}^{\infty} n_{t,k}\mathbf e_k \in \mathcal N. $$

Plain meaning: \(\mathcal N\) is the largest coordinate system the framework will allow, and \(\mathbf n_t\) is a raw state inside it.

What the operation does: it writes a raw state as a sum of basis directions with weights.

Why it is here: without a large ambient space, there is nowhere to place distinctions that a lineage does not access.

Step 1.7: Define the species-level accessible slice

$$ \mathcal M^{\mathrm{spec}}=\operatorname{span}\{\mathbf v_1,\ldots,\mathbf v_d\}\subset \mathcal N. $$

Plain meaning: evolution preserves a finite set of useful axes.

What the operation does: \(\operatorname{span}\) says every accessible state is a linear combination of those selected axes.

Why it is here: it formalizes the idea that a lineage experiences only a finite, useful slice of a much larger arena.

Step 1.8: Project raw state into that slice

$$ P^{\mathrm{spec}}\mathbf n = \sum_{i=1}^{d}\langle \mathbf v_i,\mathbf n\rangle \mathbf v_i. $$

Plain meaning: keep the part of the raw world that aligns with the lineage’s accessible axes.

What the operation does: the inner products \(\langle \mathbf v_i,\mathbf n\rangle\) measure how much of \(\mathbf n\) lies along each accessible direction, then rebuild the accessible component from those amounts.

Why it is here: it makes “accessible world” a real projection rather than a metaphor.

Step 1.9: Encode the projected slice as coordinates

$$ E^{\mathrm{spec}}(\mathbf n_t) = \begin{bmatrix} \langle \mathbf v_1,\mathbf n_t\rangle \\ \vdots \\ \langle \mathbf v_d,\mathbf n_t\rangle \end{bmatrix}. $$

Plain meaning: convert the accessible slice into a finite vector.

What the operation does: the encoder takes the coefficients of the projected state along the accessible axes.

Why it is here: later models need coordinates, not just abstract subspaces.

Step 1.10: Test whether evolution added a genuinely new axis

$$ \Delta \mathbf v_{\perp} = \Delta \mathbf v - P^{\mathrm{spec}}_{\tau}\Delta \mathbf v. $$

Plain meaning: subtract what the current species template already explains.

What the operation does: it removes the old component of a candidate mutation, leaving only the genuinely new part.

Why it is here: this is the novelty test.

If the residual is zero, the candidate distinction is redundant. If the residual is nonzero, the candidate adds something new.

Step 1.11: Keep the axis only if gain beats cost

$$ \Delta \Phi_{\tau}(\widehat{\Delta \mathbf v}_{\perp}) = \mathbb E_{e\sim \mathcal E_{\tau}} \Big[ W\!\big(e,\mathcal M_{\tau}^{\mathrm{spec}} \oplus \operatorname{span}\{\widehat{\Delta \mathbf v}_{\perp}\}\big)
  • W!\big(e,\mathcal M_{\tau}^{\mathrm{spec}}\big) \Big]
  • C(\widehat{\Delta \mathbf v}_{\perp}). $$

Plain meaning: an axis stays only if it helps more than it costs.

Harder terms:

  • \(\mathbb E\) means average over environments the lineage encounters.
  • \(W(e,\mathcal M)\) is expected reproductive value in environment \(e\) if the lineage has accessible subspace \(\mathcal M\).
  • \(C(\cdot)\) is maintenance cost: energy, wiring, false positives, and related burdens.
  • \(\oplus\) means “add a new independent direction to the current space.”

Why it is here: it gives the species template a retention rule instead of treating it as a mysterious gift.

Part 2 — Deriving the Transcendental Embedding

Step 2.1: Separate five different objects

Part 2 splits one overloaded phrase into five levels:

  1. inherited template,
  2. realized individual embedding,
  3. full phenomenal state,
  4. task-conditioned predictive state,
  5. measurable estimate.

Why it matters: these are not the same object, and the math stays cleaner once they are separated.

Step 2.2: Define the person-in-role object

$$ \chi_{i,t} = (T_i, c_{i,t}). $$

Plain meaning: model a person as a person plus their active role-context.

What the operation does: it forms a tuple.

Why it is here: the same person can produce different outputs under different roles without becoming a different person.

Step 2.3: Use psychometrics as a prior, not a full ontology

$$ p_i \in \mathbb R^k. $$

Plain meaning: factor scores are useful summaries, but they are only one input channel.

Why it is here: standardized summaries help with approximation, but they do not replace state, memory, proposition, or transition.

Step 2.4: Project the person down to what matters for this task

$$ \Pi_\tau(T_i, c_{i,t}, x_t) \in \mathbb R^d. $$

Plain meaning: for a given task, only some coordinates matter.

What the operation does: it chooses the task-relevant slice of the person given the current context and proposition.

Why it is here: the framework is task-specific, not universally maximal at every step.

Step 2.5: Weight that slice by salience

$$ z_{i,t} = a_{i,t} \odot \Pi_\tau(T_i, c_{i,t}, x_t). $$

Plain meaning: even within the task-relevant slice, some coordinates are live and some are quiet.

Harder term:

  • \(\odot\) is elementwise multiplication. If one coordinate has salience \(0\), it is suppressed. If it has salience \(1\), it passes through unchanged. Values in between partially gate it.

Why it is here: it turns a broad person representation into a current active state.

Step 2.6: Distinguish full state, predictive state, and measurable state

$$ P\!\left( Y_{i,t+\Delta}^{(\tau)} \mid H_{i,\le t}, T_i, c_{i,t}, w_t, x_t \right) = P\!\left( Y_{i,t+\Delta}^{(\tau)} \mid q_{i,t}^{(\tau,\Delta)}, x_t \right). $$

Plain meaning: once \(q_{i,t}^{(\tau,\Delta)}\) is known, the rest of the past adds nothing for predicting the task outcome under the same proposition.

Harder term:

  • A conditional probability \(P(A\mid B)\) means “probability of \(A\) once \(B\) is known.”
  • Sufficiency here means the state keeps all the information the future still needs.

Why it is here: this is the central definition of the predictive observer-state.

Then the measurable approximation is

$$ s_{i,t}^{(\tau,\Delta)}=(\hat T_i,z_{i,t},c_{i,t},w_t). $$

Plain meaning: this is the version the benchmark can actually construct from data.

Step 2.7: Model memory as weighted traces

$$ m_{i,t}=\sum_{j=1}^{N_i}\omega_{ij,t}\mu_{ij}. $$

Plain meaning: memory is treated as stored traces with time-varying weights.

Harder terms:

  • \(\mu_{ij}\) is trace \(j\).
  • \(\omega_{ij,t}\) is how relevant or active that trace is at time \(t\).

Why it is here: it gives the framework a concrete way to represent persistence and reactivation.

Step 2.8: Let the current event change retrieval

$$ \omega_{ij,t+1}=R(\mu_{ij},x_t,c_{i,t},\phi_{i,t}). $$

Plain meaning: new input changes which old traces matter.

Why it is here: the same message can land differently after prior experiences.

Step 2.9: Let learning update the memory field

$$ m_{i,t+1}=U(m_{i,t},x_t,\phi_{i,t}). $$

Plain meaning: memory is not static; each event changes the future state space.

Why it is here: it explains why sequence models should matter.

Step 2.10: Contextually lift categories before pooling them

$$ \widetilde{C}_{i,t}^{(f,s)} = \Xi\!\big(C_{i,t}^{(f,s)}, c_{i,t}\big). $$

Plain meaning: raw categories are retyped with context before they are compared.

Example: “aggressive” in self-interest and “aggressive” in out-group treatment may not be the same fact.

Why it is here: it prevents the model from averaging away real structure or inventing contradictions too early.

Step 2.11: Pool categories inside one event

$$ u_{i,t}^{(f,s)} = \begin{cases} \frac{1}{|\widetilde{C}_{i,t}^{(f,s)}|} \sum_{c \in \widetilde{C}_{i,t}^{(f,s)}} E_{f,s}(c), & |\widetilde{C}_{i,t}^{(f,s)}| > 0, \\[6pt] \nu_{f,s}, & |\widetilde{C}_{i,t}^{(f,s)}| = 0. \end{cases} \qquad m_{i,t}^{(f,s)}=\mathbf 1\{|\widetilde{C}_{i,t}^{(f,s)}|>0\}. $$

Plain meaning: embed the event’s categorical tokens, average them, and use a learned null vector plus a mask bit when the bag is empty.

Harder terms:

  • \(E_{f,s}(c)\) is an embedding lookup: it turns a token into a trainable dense vector.
  • \(\mathbf 1{\cdot}\) is an indicator: it equals 1 when the condition is true and 0 otherwise.

Why it is here: categorical traces are abundant in real logs, and this makes them usable without flattening them into brittle one-hot tables.

Step 2.12: Preserve slot identity

$$ e_{i,t}^{\mathrm{cat}} = \big\|_{(f,s)} [P_{f,s}u_{i,t}^{(f,s)},\,m_{i,t}^{(f,s)}]. $$

Plain meaning: keep family and source channels separate when you build the event-level categorical representation.

Why it is here: “biography said X” and “behavior showed X” are not the same kind of evidence.

Step 2.13: Build slow categorical memory

$$ g_{i,\rho}^{\mathrm{slow}} = \frac{ \sum_{r \le t}\mathbf 1\{\rho_r=\rho\}\,\beta_{i,r}^{\mathrm{slow}}\,e_{i,r}^{\mathrm{cat}} }{ \sum_{r \le t}\mathbf 1\{\rho_r=\rho\}\,\beta_{i,r}^{\mathrm{slow}} } \quad\text{when usable evidence exists.} $$

Plain meaning: average old categorical events inside the same regime, but weight them by importance.

Harder terms:

  • \(\rho\) is the regime or role bucket.
  • \(\beta^{\mathrm{slow}}\) controls how much each past event contributes.

Why it is here: the slow bank is meant to capture durable person structure, not just the last touch.

Step 2.14: Build fast categorical memory

$$ g_{i,t}^{\mathrm{fast},\tau} = \sum_{r \le t}\alpha_{i,r,t}^{(\tau)} e_{i,r}^{\mathrm{cat}}, \qquad \sum_{r \le t}\alpha_{i,r,t}^{(\tau)}=1. $$

Plain meaning: create a task-conditioned summary of recent categorical history.

Why it is here: short-horizon prediction usually depends on what is currently active, not just what is durable.

Step 2.15: Define minimality

$$ q_{i,t}^{(\tau,\Delta)} = h(r_{i,t}^{(\tau,\Delta)}) $$

for every other sufficient state \(r\).

Plain meaning: a minimal sufficient state is one that every other sufficient state can be reduced to.

Why it is here: “sufficient” alone could mean dragging the full archive forever; minimality asks for the smallest useful state.

Step 2.16: Split the measurable state into slow and fast pieces

$$ s_{i,t}^{(\tau,\Delta)} = (\hat T_i,z_{i,t},c_{i,t},w_t) \approx q_{i,t}^{(\tau,\Delta)}. $$

Plain meaning: the benchmark approximation has a durable person part and a rapidly updating local part.

Why it is here: durable traits and recent events change on different timescales.

Step 2.17: Define the realized embedding and its first estimate

$$ T_i=\Psi(G_i,\ell_i,h_i), \qquad h_i=\sum_{k=1}^{n_i}\beta_{ik}e_{ik}. $$

Plain meaning: the realized person is the inherited template filtered through language, culture, and weighted life events.

Then the first operational estimate is

$$ \hat T_i^{(0)} = W_p p_i + W_b b_i + W_\ell \ell_i + W_r r_i + W_h h_i + W_g g_i^{\mathrm{slow}}. $$

Plain meaning: start with a weighted combination of person summaries, life history, and slow categorical memory.

Why it is here: this is the bridge from theory to something that can be computed.

Step 2.18: Use the local state to predict the next predictive state

$$ \hat q_{i,t+1}^{(\tau,\Delta)} = G_\theta(\hat T_i,z_{i,t},c_{i,t},w_t,x_t). $$

Plain meaning: once the slow person state, fast local state, context, world, and proposition are known, predict the next task-relevant state.

Why it is here: this is the handoff into the world model.

Part 3 — Application: Predicting How People Behave

Step 3.1: Keep the ideal transition law, but do not train against it directly

$$ \phi_{i,t+1}=F(T_i,\phi_{i,t},x_t). $$

Plain meaning: the motivating ideal is still the next phenomenal state.

Why it is here: it says what the framework is aiming at, even though the benchmark cannot observe \(\phi_{i,t}\) directly.

Step 3.2: Introduce a task projection from a large ambient space

$$ \Pi_\tau:\mathcal U\to \mathbb R^{d_\tau}. $$

Plain meaning: each task only needs a finite slice of the larger representational arena.

Why it is here: it keeps the framework open-ended without requiring infinite computation.

Step 3.3: Define the operational transition law

$$ \hat q_{i,t+1}^{(\tau,\Delta)} = G_\theta(\hat T_i,z_{i,t},c_{i,t},w_t,x_t). $$

Plain meaning: the trainable model predicts the next predictive state from the operational state and proposition.

Why it is here: this is the model actually learned from data.

Step 3.4: Decompose the world model into encoder, interaction, and decoder

$$ o_{i,t}^{(\tau)}=E_o^{(\tau)}(\hat T_i,z_{i,t},c_{i,t},w_t), $$
$$ p_t^{(\tau)}=E_p^{(\tau)}(x_t), $$
$$ h_{i,t}^{(\tau)}=\Psi_\tau(o_{i,t}^{(\tau)},p_t^{(\tau)}), $$
$$ \hat q_{i,t+1}^{(\tau,\Delta)}=G_\theta(h_{i,t}^{(\tau)}). $$

Plain meaning:

  • encode the observer-side state,
  • encode the proposition,
  • let them interact,
  • decode the next predictive state.

Why it is here: it separates representation from interaction and makes the architecture modular.

Step 3.5: Decode visible consequences from the latent state

$$ \hat y_{i,t+\Delta}^{(\tau)}=R_0(\hat q_{i,t+1}^{(\tau,\Delta)}), \qquad \hat a_{i,t+\Delta}^{(m,\tau)}=R_m(\hat q_{i,t+1}^{(\tau,\Delta)}). $$

Plain meaning: outcomes and probe labels are readouts from the predicted next state, not the state itself.

Why it is here: a reply, a meeting, or an objection class is a measurable residue of an underlying transition.

Step 3.6: Define task-equivalence of propositions

$$ \Pi_{\tau}(E_p^{(\tau)}(x_t^{(1)})) = \Pi_{\tau}(E_p^{(\tau)}(x_t^{(2)})) \;\Rightarrow\; G_\theta(\cdot,x_t^{(1)})\approx G_\theta(\cdot,x_t^{(2)}). $$

Plain meaning: if two propositions look the same in the task-relevant projection, they should produce the same next-state prediction.

Why it is here: this is the paper’s formal notion of composability.

Step 3.7: Keep only useful feature families

$$ \Delta_\tau(f)=\mathrm{Perf}_\tau(M\cup f)-\mathrm{Perf}_\tau(M). $$

Plain meaning: a feature family stays only if it improves the task.

Why it is here: the framework is meant to be discoverable and revisable, not fixed in advance.

Step 3.8: Define the world model and its rollout

$$ \mathcal W_\tau:(\hat T_i,z_{i,t},c_{i,t},w_t,x_t)\mapsto \hat q_{i,t+1}^{(\tau,\Delta)}, $$
$$ \hat q_{i,t+k}^{(\tau,\Delta)} = \mathcal W_\tau^{(k)}(\hat T_i,z_{i,t},c_{i,t},w_t,x_t,\ldots,x_{t+k-1}). $$

Plain meaning: a one-step predictor becomes a simulator once you apply it repeatedly.

Why it is here: proposition choice is not only about one immediate readout; it can change future trajectories.

Step 3.9: Train with main loss, probe losses, and regularization

$$ \mathcal L_\tau = \mathcal L_{\mathrm{main}}
  • \sum_{m=1}^{M}\lambda_m \mathcal L_{\mathrm{probe},m}
  • \lambda_{\mathrm{reg}}\Omega(\theta). $$

Plain meaning: optimize the main task, auxiliary probes, and weight control together.

Harder terms:

  • A loss is a number that gets smaller when the model improves.
  • \(\lambda_m\) and \(\lambda_{\mathrm{reg}}\) control how much the probe terms and regularizer matter relative to the main task.
  • Regularization keeps a model from fitting noise too aggressively.

Why it is here: it encourages the latent state to carry more than one narrow signal.

Then update by gradient descent:

$$ \theta_{t+1}=\theta_t-\eta\nabla_\theta \mathcal L_\tau. $$

Plain meaning: compute how the loss changes with respect to the parameters and step in the direction that lowers it.

Step 3.10: Rank propositions by expected task utility

$$ \operatorname{score}_\theta(x\mid s_{i,t}^{(\tau,\Delta)}) = \mathbb E_\theta\!\left[ U_\tau\!\left( \hat q_{i,t+1}^{(\tau,\Delta)}, \hat y_{i,t+\Delta}^{(\tau)}, \hat a_{i,t+\Delta}^{(\tau)} \right) \mid s_{i,t}^{(\tau,\Delta)},x \right]. $$

Plain meaning: score each admissible proposition by its expected downstream value.

Harder term:

  • The expectation \(\mathbb E_\theta[\cdot]\) is an average over what the model predicts could happen.

Why it is here: prediction becomes decision support once propositions are scored.

Then search for the best candidate:

$$ x_t^\star \in \arg\max_{x\in \mathcal X_{i,t}^{\mathrm{adm}}} \operatorname{score}_\theta(x\mid s_{i,t}^{(\tau,\Delta)}). $$

Plain meaning: pick the candidate with highest score among the allowed options.

Part 4 — Benchmarking the World Model

Step 4.1: Make the state hierarchy explicit

$$ \phi_{i,t} \quad\text{full motivating state} $$
$$ q_{i,t}^{(\tau,\Delta)} \quad\text{formal predictive state} $$
$$ s_{i,t}^{(\tau,\Delta)} = (\hat T_i,z_{i,t},c_{i,t},w_t) \quad\text{measurable approximation}. $$

Plain meaning: the benchmark only has access to the third object.

Why it is here: it keeps the benchmark honest.

Step 4.2: Build the dataset around event-time prediction

$$ \mathcal D_\tau = \left\{ (u_i,H_{i,\le t},c_{i,t},w_t,x_t,y_{i,t+\Delta}^{(\tau)},a_{i,t+\Delta}^{(\tau)}) \right\}_{(i,t)}. $$

Plain meaning: every row is a person snapshot with history, context, world, proposition, and future labels.

Why it is here: the task is to predict future transition from current state plus proposition.

Step 4.3: Encode the event stream explicitly

$$ e_{i,t}=[x_t,\delta_t,r_t,a_t,m_t^{\mathrm{obs}},e_{i,t}^{\mathrm{cat}}]. $$

Plain meaning: each event carries the proposition, delay, response, action, memory proxy, and raw categorical shock.

Why it is here: the model needs both structured sequence data and categorical trace data.

Step 4.4: Benchmark against baselines of increasing strength

The benchmark tests:

  1. frequency only,
  2. current-touch only,
  3. static tabular features,
  4. shallow history summaries,
  5. two-tower recommendation style,
  6. monolithic sequence modeling.

Why it is here: a slow/fast latent-state model should only survive if it beats or matches simpler alternatives in a meaningful way.

Step 4.5: Define the proposed latent-state benchmark model

$$ \hat T_i=E_T(u_i,g_i^{\mathrm{slow}}), \qquad z_{i,0}=z_0(\hat T_i), $$
$$ z_{i,t+1}=U_\theta(z_{i,t},\hat T_i,c_{i,t},w_t,e_{i,t},g_{i,t}^{\mathrm{fast},\tau}), $$
$$ \hat q_{i,t+1}^{(\tau,\Delta)} = G_\theta(\hat T_i,z_{i,t},c_{i,t},w_t,x_{t+1}). $$

Plain meaning: estimate durable person state, initialize fast state, update fast state with each event, and predict the next task-relevant state for each candidate proposition.

Step 4.6: Use the corrected training objective

$$ \mathcal L_\tau = \mathcal L_{\mathrm{main}}
  • \sum_{m=1}^{M}\lambda_m \mathcal L_{\mathrm{probe},m}
  • \lambda_{\mathrm{reg}} \Omega(\theta). $$

Plain meaning: all three pieces are added because the optimizer is minimizing the objective.

Why it is here: probe losses should be reduced, not increased, and regularization should discourage unstable fits, not reward them.

Step 4.7: Update slow and fast state on different timescales

Fast update:

$$ z_{i,t+1}=U_\theta(z_{i,t},\hat T_i,c_{i,t},w_t,e_{i,t},g_{i,t}^{\mathrm{fast},\tau}). $$

Slow update:

$$ \hat T_i \leftarrow (1-\alpha)\hat T_i + \alpha\,\hat T_i^{\mathrm{new}}. $$

Plain meaning: recent events can move the fast state immediately, but durable identity should drift slowly toward a refreshed durable estimate.

Harder term:

  • This is an exponential moving average. A small \(\alpha\) means the slow state changes gradually.

Why it is here: without this split, one loud event can rewrite the whole person-side representation too aggressively.

Step 4.8: Separate ranking from policy evaluation

The same score function can be used in three regimes:

  1. Observational ranking: rank candidate propositions under the learned simulator.
  2. Off-policy evaluation: estimate how a target policy would have done using logged propensities.
  3. Online policy improvement: test and improve the policy under controlled experimentation.

Why it is here: ranking alone is not causal control.

Step 4.9: Use IPS only when logged propensities exist

$$ \hat V_{\mathrm{IPS}}(\pi) = \frac{1}{N} \sum_{t=1}^{N} \frac{\mathbf 1\{x_t=\pi(s_{i,t}^{(\tau,\Delta)})\}}{e_t}\,r_t. $$

Plain meaning: upweight cases where the historical policy was unlikely to choose the action the target policy would have chosen.

Harder terms:

  • \(e_t=\mu(x_t\mid s_{i,t}^{(\tau,\Delta)})\) is the behavior policy’s logged propensity.
  • \(r_t\) is realized reward.
  • If the target policy is stochastic rather than deterministic, the indicator is replaced by the importance ratio \(\pi(x_t\mid s)/\mu(x_t\mid s)\).

Why it is here: it gives a principled bridge from logged data to policy-value estimates.

Step 4.10: Split train, validation, and test over time

$$ \mathcal D_{\mathrm{train}}^{(1:T_1)}, \quad \mathcal D_{\mathrm{val}}^{(T_1:T_2)}, \quad \mathcal D_{\mathrm{test}}^{(T_2:T_3)}. $$

Plain meaning: future rows must stay in the future.

Why it is here: random row splits leak information in sequence problems.

Step 4.11: Evaluate both forecast quality and calibration

$$ \mathrm{LogLoss},\qquad \mathrm{Brier},\qquad \mathrm{PR\text{-}AUC},\qquad \mathrm{ECE}. $$

Plain meaning:

  • LogLoss checks probabilistic fit,
  • Brier checks squared probability error,
  • PR-AUC checks rare-event ranking,
  • ECE checks calibration.

Why it is here: a useful decision model must rank well and produce believable probabilities.

Step 4.12: Use ablations to test whether the decomposition is real

The benchmark removes:

  • \(z_{i,t}\),
  • \(\hat T_i\),
  • salience-weighted pooling,
  • source/regime separation,
  • probe heads,
  • or the explicit slow/fast structure itself.

Why it is here: if performance does not move when these are removed, those pieces were decorative.

Step 4.13: Define success clearly

Success on forecasting means better held-out log loss and Brier score than the best baseline across more than one horizon.

Success on proposition selection means better off-policy value or live lift than a baseline policy, when the data regime actually supports that claim.

Part 5 — Axioms, Lemmas, and Main Theorem

Step 5.1: State the axioms

Part 5 assumes:

  1. an ambient space exists,
  2. individuals instantiate inherited structure,
  3. a task-conditioned predictive state exists,
  4. the operational slow/fast state approximates that predictive state,
  5. transition factors through the task-relevant proposition encoding,
  6. evolutionary retention uses a positive-gain rule,
  7. auxiliary probes also factor through predictive state.

Why it is here: theorems need fixed primitives.

Step 5.2: Prove the accessible slice is the best accessible approximation

$$ P^{\mathrm{spec}}\mathbf n_t = \operatorname{argmin}_{\mathbf m\in\mathcal M^{\mathrm{spec}}} \lVert \mathbf n_t-\mathbf m\rVert. $$

Plain meaning: among all states inside the accessible subspace, the projection is the closest one to the raw noumenal state.

Harder term:

  • \(\arg\min\) means “the value that makes the quantity as small as possible.”

Why it is here: projection now has a precise geometric meaning.

Step 5.3: Prove the novelty test is exact

If \(\Delta \mathbf v\) already lies in the current accessible space, its residual is zero. If the residual is nonzero and its net gain is positive, adding it increases the evolutionary objective.

Why it is here: new axes are not added by intuition; they are added by residual novelty plus positive value.

Step 5.4: Prove task-equivalent propositions induce the same next-state prediction

If two propositions have the same task-relevant projected encoding, then the transition map gives the same predicted next state.

Why it is here: this is the formal version of composability.

Step 5.5: Prove rollout is well-defined

Applying the same deterministic one-step world model repeatedly defines a unique finite-horizon rollout.

Why it is here: a one-step transition law becomes a true world model once it can simulate trajectories.

Step 5.6: Prove explicit history can help

Under log loss, the Bayes-optimal risk is \(H(Y\mid X)\), conditional entropy. Under Brier score for binary outcomes, the Bayes-optimal risk is \(\mathbb E[\operatorname{Var}(Y\mid X)]\), expected conditional variance.

Plain meaning: giving the model more genuinely useful history cannot make the optimal predictor worse.

Why it is here: this justifies explicit state and sequence modeling when the future depends on the past.

Step 5.7: Prove minimal sufficient state is unique up to reparameterization

If two predictive states are both minimal and sufficient for the same task, then they are the same object up to measurable bijection on their supports.

Plain meaning: the coordinates can change, but the minimal information content is the same.

Why it is here: the framework is not claiming one sacred coordinate chart for the mind.

Step 5.8: Prove sufficient compression

If

$$ Y \perp H_{i,\le t} \mid (z_{i,t},\hat T_i,c_{i,t},w_t,x_t), $$

then the full history can be replaced by the compressed state for prediction.

Plain meaning: if the compressed state blocks any remaining dependence on the raw history, the raw history no longer needs to be carried directly.

Why it is here: it justifies latent-state compression.

Step 5.9: Prove slow/fast mediation

If the predictive state factors into slow and fast parts and the relevant conditional independence holds, then recent within-window history contributes through the fast state once the slow profile and context are fixed.

Why it is here: it gives the slow/fast split a formal interpretation rather than treating it as architecture taste.

Step 5.10: Prove probe consistency

If two histories give the same predictive state, then under the same proposition they induce the same conditional law for every probe that factors through that state.

Why it is here: probe heads become principled readouts instead of decorative extras.

Step 5.11: State the main theorem

The main theorem says that for each task and horizon there exists a finite-dimensional task-conditioned representation

$$ (\hat T_i,z_{i,t},c_{i,t},w_t,x_t) \longmapsto \hat q_{i,t+1}^{(\tau,\Delta)} \longmapsto (\hat y_{i,t+\Delta}^{(\tau)},\hat a_{i,t+\Delta}^{(\tau)}) $$

with these properties:

  1. best accessible approximation,
  2. evolutionary coherence,
  3. task-equivalence,
  4. recursive simulability,
  5. state advantage,
  6. minimal-state uniqueness,
  7. sufficient compression,
  8. slow/fast mediation,
  9. probe consistency.

Plain meaning: once the framework’s assumptions are accepted, the world model is internally coherent and usable as a predictive system.

Step 5.12: State the corollary on proposition ranking

$$ x \mapsto \operatorname{score}_\theta(x\mid s_{i,t}^{(\tau,\Delta)}). $$

Plain meaning: the model induces an observational ranking over admissible propositions.

Why it matters: the best proposition is well-defined as an argmax of the score whenever the candidate set is finite or the maximum is attained.

Step 5.13: Keep the causal boundary sharp

The corollary does not prove that following the ranking improves the real world.

Why it is here: observational ranking, off-policy evaluation, and online causal improvement are different epistemic regimes.

Operational summary

For one task, the framework reduces to this recipe:

  1. Build the durable person estimate \(\hat T_i\) from slow features and slow categorical memory.
  2. Build the current fast state \(z_{i,t}\) from recent events and fast categorical memory.
  3. Attach role-context \(c_{i,t}\) and world state \(w_t\).
  4. Encode the candidate proposition \(x_t\).
  5. Predict the next task-relevant latent state.
  6. Decode outcomes and probes.
  7. Update fast state after each event.
  8. Refresh slow state gradually as durable evidence accumulates.
  9. Train against main outcome, probes, and regularization.
  10. Benchmark against simpler baselines.
  11. Rank propositions observationally.
  12. Only claim policy improvement when propensities or experiments support it.

One clean mental model

The framework says:

  • evolution gives a lineage a finite accessible slice of reality,
  • a person realizes that slice in an individual way,
  • recent events activate only part of that person-space,
  • a proposition interacts with that active state,
  • the interaction moves the observer into a new task-relevant state,
  • visible outcomes are readouts from that new state.

That is the whole system in its shortest usable form.