preprint · appendix a · 7 of 7
God’s Infinite Dimensional Space — Step-by-Step Guide
Cheat sheet: significant terms and operations
| Symbol / term | Plain meaning | What the operation does | Why it is here |
|---|---|---|---|
| \(\mathcal N\) | Noumenal arena | The largest space of possible distinctions | Gives the framework a “reality is larger than experience” starting point |
| \(\mathbf n_t\) | Noumenal microstate at time \(t\) | A point in \(\mathcal N\) | Represents the raw state before organism-specific projection |
| \(\mathcal M^{\mathrm{spec}}\) | Species-level accessible subspace | A finite slice of \(\mathcal N\) | Says a lineage can only access some distinctions |
| \(P^{\mathrm{spec}}\) | Species projection | Orthogonally projects raw state into the accessible slice | Formalizes “organisms do not get the whole world” |
| \(E^{\mathrm{spec}}\) | Species coordinate encoder | Converts the projected slice into coordinates | Turns the accessible slice into a workable vector |
| \(\Delta \mathbf v_\perp\) | Novelty residual | Removes what the current species template already captures | Tests whether a mutation adds a new axis |
| \(\Delta \Phi_\tau\) | Net fitness contribution | Expected gain minus cost of keeping a candidate axis | Decides whether an axis is retained |
| \(G_i\) | Inherited template | The repertoire of distinctions person \(i\) could in principle host | Separates lineage structure from individual realization |
| \(T_i\) | Realized individual embedding | The durable structure of one person after history, language, culture, and experience | This is the theoretical person-side object |
| \(\phi_{i,t}\) | Full phenomenal state | Everything live in the person at time \(t\) | Motivating ideal object |
| \(q_{i,t}^{(\tau,\Delta)}\) | Predictive observer-state | Smallest state that preserves the future law for task \(\tau\) and horizon \(\Delta\) | The formal target |
| \(s_{i,t}^{(\tau,\Delta)}\) | Operational state | \((\hat T_i, z_{i,t}, c_{i,t}, w_t)\) | What the benchmark can actually train on |
| \(\chi_{i,t}\) | Chimera | Person-in-role object | Makes context explicit instead of hiding it inside “the person” |
| \(c_{i,t}\) | Role and institution context | Role, regime, local demands | Explains why the same person acts differently in different settings |
| \(w_t\) | World state | Market, account, pressure, or other external state | Keeps the environment separate from the person |
| \(x_t\) | Proposition | The thing hitting the observer now | The object being scored, simulated, or chosen |
| \(Y_{i,t+\Delta}^{(\tau)}\) | Main future outcome | The task target | What the model tries to predict |
| \(A_{i,t+\Delta}^{(m,\tau)}\) | Auxiliary probe | Side target such as objection class or delay bucket | Forces the latent state to carry reusable structure |
| \(\Pi_\tau(T_i,c_{i,t},x_t)\) | Task projection | Selects task-relevant coordinates | Says not all dimensions matter for every task |
| \(a_{i,t}\) | Salience weights | Reweights coordinates elementwise | Captures what is active now |
| \(z_{i,t}=a_{i,t}\odot \Pi_\tau(\cdot)\) | Active fast slice | Uses elementwise multiplication to gate the task coordinates | Produces the current live state for the transition |
| \(m_{i,t}=\sum_j \omega_{ij,t}\mu_{ij}\) | Memory field | Weighted sum of traces | Makes memory computable |
| \(R(\mu_{ij},x_t,c_{i,t},\phi_{i,t})\) | Retrieval rule | Updates relevance of a trace | Explains why the same prompt can work differently later |
| \(\Xi(C,c)\) | Contextual lifting | Retypes categories using context | Avoids false contradictions |
| \(E_{f,s}(c)\) | Category embedding | Maps a discrete token to a dense vector | Standard ML move for categorical data |
| \(u_{i,t}^{(f,s)}\) | Within-event pooled category vector | Averages embedded tokens in one bag | Converts sparse categorical events into fixed-width vectors |
| \(\nu\) | Null vector | Learned stand-in for an empty bag | Keeps empty slots explicit instead of pretending they are zero |
| \(m_{i,t}^{(f,s)}\) | Mask bit | Indicates whether a slot is populated | Separates absence from value |
| \(|\) | Concatenation | Joins vectors end-to-end | Preserves slot identity |
| \(g_i^{\mathrm{slow}}\) | Slow categorical bank | Regime-aware durable pooled categorical memory | Captures what the person is generally like now |
| \(g_{i,t}^{\mathrm{fast},\tau}\) | Fast categorical pool | Task-conditioned recent categorical summary | Captures what is currently active |
| \(E_T(\cdot)\) | Slow encoder | Builds \(\hat T_i\) from durable information | Produces the slow person embedding |
| \(U_\theta(\cdot)\) | Fast update rule | Updates fast state from new events | Makes the model sequential |
| \(E_o^{(\tau)}(\cdot)\) | Observer encoder | Packs slow state, fast state, context, and world into one task representation | Creates the observer-side object for interaction |
| \(E_p^{(\tau)}(x_t)\) | Proposition encoder | Encodes the proposition into the same task space | Makes propositions comparable with the observer-side state |
| \(\Psi_\tau(o,p)\) | Interaction operator | Combines observer and proposition encodings | Represents “what happens when this proposition hits this observer” |
| \(G_\theta(\cdot)\) | Transition map | Predicts the next latent predictive state | Core world-model step |
| \(R_0,R_m\) | Readout heads | Decode outcomes and probes from the latent state | Converts hidden state into measurable outputs |
| \(\Delta_\tau(f)\) | Feature contribution | Performance with a feature family minus performance without it | Decides whether a feature family stays |
| \(\mathcal L_\tau\) | Training objective | Adds main loss, probe losses, and regularization | Defines what gradient descent is minimizing |
| \(\Omega(\theta)\) | Regularizer | Penalizes overly flexible parameter settings | Keeps the fit from becoming brittle |
| \(\hat T_i \leftarrow (1-\alpha)\hat T_i+\alpha \hat T_i^{\mathrm{new}}\) | Slow EMA refresh | Mixes old durable state with a refreshed durable estimate | Keeps slow state stable |
| \(\operatorname{score}_\theta(x\mid s)\) | Proposition score | Expected task utility if proposition \(x\) is used in state \(s\) | Turns forecasting into ranking |
| \(\arg\max\) | Best choice | Picks the highest scoring candidate | Formal proposition search |
| \(P(\cdot\mid\cdot)\) | Conditional probability | “Probability of this given that” | Language of sufficiency and prediction |
| \(\mathbb E[\cdot]\) | Expectation | Average predicted value under uncertainty | Needed when scoring uncertain futures |
| \(\perp\) | Conditional independence | Says extra information stops helping once a state is known | Lets the framework define sufficiency and mediation |
| \(\hat V_{\mathrm{IPS}}(\pi)\) | IPS estimate | Reweights logged rewards to estimate a target policy’s value | Separates ranking from policy evaluation |
| LogLoss / Brier / PR-AUC / ECE | Evaluation metrics | Fit, probability error, rare-event ranking, and calibration | Measures whether the model is useful |
The full arc in one line
Read it left to right:
- Start with a world that contains more distinctions than any organism can use.
- Restrict that world to the distinctions a lineage can access.
- Turn the lineage template into an individual template.
- Realize that template in one person.
- Let that person occupy a full momentary state.
- Compress the full state into a task-specific predictive state.
- Approximate that predictive state with something measurable.
- Decode future observable outcomes from it.
Part 0 — Background
Step 0.1: Replace fixed categories with evolved structure
The opening move is simple: organisms do not passively mirror the world; they inherit a structured way of carving it up.
Why this matters: it turns the appearance of reality into something that can be modeled as a built structure rather than a raw copy of an external world.
Step 0.2: Treat interpretation and response as one process
The framework treats perception, interpretation, understanding, and action as one continuous state-transition process.
Why this matters: once they are written in one format, the same algebra can describe seeing, feeling, remembering, and acting.
Part 1 — Specifying the Area of Interest
Step 1.1: Represent change as vectors
The framework starts by treating observable changes as vectors.
Plain meaning: instead of handling vision, memory, and action as unrelated substances, it writes them as coordinates in a shared space.
Step 1.2: Sample a continuous stream into modelable states
Reality is continuous, but the model uses snapshots.
Plain meaning: the flow is continuous in life, but discrete time makes the math and the benchmark possible.
Step 1.3: Distinguish noumenal vectors from phenomenal vectors
Noumenal vectors are raw distinctions available before the organism has organized them; phenomenal vectors are reality as it appears after the organism’s inherited structure processes them.
Why this matters: the framework keeps raw physical distinctions separate from lived experience.
Step 1.4: Introduce the inherited seed
The inherited seed is the lineage-fixed structure that determines what kinds of distinctions the organism can even register.
Why this matters: the seed explains why an organism gets one kind of world rather than all possible worlds.
Step 1.5: Use a toy sequence to show state transition
The early toy example turns one vector sequence into another by inserting intermediate internal states.
What is happening mathematically: one structured vector recruits others in the same larger state space, and later states inherit or modify earlier coordinates.
Why it is here: it gives an intuitive picture of how a current input can activate affect, action, and an updated object state without changing notation.
Step 1.6: Define the universal arena
Plain meaning: \(\mathcal N\) is the largest coordinate system the framework will allow, and \(\mathbf n_t\) is a raw state inside it.
What the operation does: it writes a raw state as a sum of basis directions with weights.
Why it is here: without a large ambient space, there is nowhere to place distinctions that a lineage does not access.
Step 1.7: Define the species-level accessible slice
Plain meaning: evolution preserves a finite set of useful axes.
What the operation does: \(\operatorname{span}\) says every accessible state is a linear combination of those selected axes.
Why it is here: it formalizes the idea that a lineage experiences only a finite, useful slice of a much larger arena.
Step 1.8: Project raw state into that slice
Plain meaning: keep the part of the raw world that aligns with the lineage’s accessible axes.
What the operation does: the inner products \(\langle \mathbf v_i,\mathbf n\rangle\) measure how much of \(\mathbf n\) lies along each accessible direction, then rebuild the accessible component from those amounts.
Why it is here: it makes “accessible world” a real projection rather than a metaphor.
Step 1.9: Encode the projected slice as coordinates
Plain meaning: convert the accessible slice into a finite vector.
What the operation does: the encoder takes the coefficients of the projected state along the accessible axes.
Why it is here: later models need coordinates, not just abstract subspaces.
Step 1.10: Test whether evolution added a genuinely new axis
Plain meaning: subtract what the current species template already explains.
What the operation does: it removes the old component of a candidate mutation, leaving only the genuinely new part.
Why it is here: this is the novelty test.
If the residual is zero, the candidate distinction is redundant. If the residual is nonzero, the candidate adds something new.
Step 1.11: Keep the axis only if gain beats cost
- W!\big(e,\mathcal M_{\tau}^{\mathrm{spec}}\big) \Big]
- C(\widehat{\Delta \mathbf v}_{\perp}). $$
Plain meaning: an axis stays only if it helps more than it costs.
Harder terms:
- \(\mathbb E\) means average over environments the lineage encounters.
- \(W(e,\mathcal M)\) is expected reproductive value in environment \(e\) if the lineage has accessible subspace \(\mathcal M\).
- \(C(\cdot)\) is maintenance cost: energy, wiring, false positives, and related burdens.
- \(\oplus\) means “add a new independent direction to the current space.”
Why it is here: it gives the species template a retention rule instead of treating it as a mysterious gift.
Part 2 — Deriving the Transcendental Embedding
Step 2.1: Separate five different objects
Part 2 splits one overloaded phrase into five levels:
- inherited template,
- realized individual embedding,
- full phenomenal state,
- task-conditioned predictive state,
- measurable estimate.
Why it matters: these are not the same object, and the math stays cleaner once they are separated.
Step 2.2: Define the person-in-role object
Plain meaning: model a person as a person plus their active role-context.
What the operation does: it forms a tuple.
Why it is here: the same person can produce different outputs under different roles without becoming a different person.
Step 2.3: Use psychometrics as a prior, not a full ontology
Plain meaning: factor scores are useful summaries, but they are only one input channel.
Why it is here: standardized summaries help with approximation, but they do not replace state, memory, proposition, or transition.
Step 2.4: Project the person down to what matters for this task
Plain meaning: for a given task, only some coordinates matter.
What the operation does: it chooses the task-relevant slice of the person given the current context and proposition.
Why it is here: the framework is task-specific, not universally maximal at every step.
Step 2.5: Weight that slice by salience
Plain meaning: even within the task-relevant slice, some coordinates are live and some are quiet.
Harder term:
- \(\odot\) is elementwise multiplication. If one coordinate has salience \(0\), it is suppressed. If it has salience \(1\), it passes through unchanged. Values in between partially gate it.
Why it is here: it turns a broad person representation into a current active state.
Step 2.6: Distinguish full state, predictive state, and measurable state
Plain meaning: once \(q_{i,t}^{(\tau,\Delta)}\) is known, the rest of the past adds nothing for predicting the task outcome under the same proposition.
Harder term:
- A conditional probability \(P(A\mid B)\) means “probability of \(A\) once \(B\) is known.”
- Sufficiency here means the state keeps all the information the future still needs.
Why it is here: this is the central definition of the predictive observer-state.
Then the measurable approximation is
Plain meaning: this is the version the benchmark can actually construct from data.
Step 2.7: Model memory as weighted traces
Plain meaning: memory is treated as stored traces with time-varying weights.
Harder terms:
- \(\mu_{ij}\) is trace \(j\).
- \(\omega_{ij,t}\) is how relevant or active that trace is at time \(t\).
Why it is here: it gives the framework a concrete way to represent persistence and reactivation.
Step 2.8: Let the current event change retrieval
Plain meaning: new input changes which old traces matter.
Why it is here: the same message can land differently after prior experiences.
Step 2.9: Let learning update the memory field
Plain meaning: memory is not static; each event changes the future state space.
Why it is here: it explains why sequence models should matter.
Step 2.10: Contextually lift categories before pooling them
Plain meaning: raw categories are retyped with context before they are compared.
Example: “aggressive” in self-interest and “aggressive” in out-group treatment may not be the same fact.
Why it is here: it prevents the model from averaging away real structure or inventing contradictions too early.
Step 2.11: Pool categories inside one event
Plain meaning: embed the event’s categorical tokens, average them, and use a learned null vector plus a mask bit when the bag is empty.
Harder terms:
- \(E_{f,s}(c)\) is an embedding lookup: it turns a token into a trainable dense vector.
- \(\mathbf 1{\cdot}\) is an indicator: it equals 1 when the condition is true and 0 otherwise.
Why it is here: categorical traces are abundant in real logs, and this makes them usable without flattening them into brittle one-hot tables.
Step 2.12: Preserve slot identity
Plain meaning: keep family and source channels separate when you build the event-level categorical representation.
Why it is here: “biography said X” and “behavior showed X” are not the same kind of evidence.
Step 2.13: Build slow categorical memory
Plain meaning: average old categorical events inside the same regime, but weight them by importance.
Harder terms:
- \(\rho\) is the regime or role bucket.
- \(\beta^{\mathrm{slow}}\) controls how much each past event contributes.
Why it is here: the slow bank is meant to capture durable person structure, not just the last touch.
Step 2.14: Build fast categorical memory
Plain meaning: create a task-conditioned summary of recent categorical history.
Why it is here: short-horizon prediction usually depends on what is currently active, not just what is durable.
Step 2.15: Define minimality
for every other sufficient state \(r\).
Plain meaning: a minimal sufficient state is one that every other sufficient state can be reduced to.
Why it is here: “sufficient” alone could mean dragging the full archive forever; minimality asks for the smallest useful state.
Step 2.16: Split the measurable state into slow and fast pieces
Plain meaning: the benchmark approximation has a durable person part and a rapidly updating local part.
Why it is here: durable traits and recent events change on different timescales.
Step 2.17: Define the realized embedding and its first estimate
Plain meaning: the realized person is the inherited template filtered through language, culture, and weighted life events.
Then the first operational estimate is
Plain meaning: start with a weighted combination of person summaries, life history, and slow categorical memory.
Why it is here: this is the bridge from theory to something that can be computed.
Step 2.18: Use the local state to predict the next predictive state
Plain meaning: once the slow person state, fast local state, context, world, and proposition are known, predict the next task-relevant state.
Why it is here: this is the handoff into the world model.
Part 3 — Application: Predicting How People Behave
Step 3.1: Keep the ideal transition law, but do not train against it directly
Plain meaning: the motivating ideal is still the next phenomenal state.
Why it is here: it says what the framework is aiming at, even though the benchmark cannot observe \(\phi_{i,t}\) directly.
Step 3.2: Introduce a task projection from a large ambient space
Plain meaning: each task only needs a finite slice of the larger representational arena.
Why it is here: it keeps the framework open-ended without requiring infinite computation.
Step 3.3: Define the operational transition law
Plain meaning: the trainable model predicts the next predictive state from the operational state and proposition.
Why it is here: this is the model actually learned from data.
Step 3.4: Decompose the world model into encoder, interaction, and decoder
Plain meaning:
- encode the observer-side state,
- encode the proposition,
- let them interact,
- decode the next predictive state.
Why it is here: it separates representation from interaction and makes the architecture modular.
Step 3.5: Decode visible consequences from the latent state
Plain meaning: outcomes and probe labels are readouts from the predicted next state, not the state itself.
Why it is here: a reply, a meeting, or an objection class is a measurable residue of an underlying transition.
Step 3.6: Define task-equivalence of propositions
Plain meaning: if two propositions look the same in the task-relevant projection, they should produce the same next-state prediction.
Why it is here: this is the paper’s formal notion of composability.
Step 3.7: Keep only useful feature families
Plain meaning: a feature family stays only if it improves the task.
Why it is here: the framework is meant to be discoverable and revisable, not fixed in advance.
Step 3.8: Define the world model and its rollout
Plain meaning: a one-step predictor becomes a simulator once you apply it repeatedly.
Why it is here: proposition choice is not only about one immediate readout; it can change future trajectories.
Step 3.9: Train with main loss, probe losses, and regularization
- \sum_{m=1}^{M}\lambda_m \mathcal L_{\mathrm{probe},m}
- \lambda_{\mathrm{reg}}\Omega(\theta). $$
Plain meaning: optimize the main task, auxiliary probes, and weight control together.
Harder terms:
- A loss is a number that gets smaller when the model improves.
- \(\lambda_m\) and \(\lambda_{\mathrm{reg}}\) control how much the probe terms and regularizer matter relative to the main task.
- Regularization keeps a model from fitting noise too aggressively.
Why it is here: it encourages the latent state to carry more than one narrow signal.
Then update by gradient descent:
Plain meaning: compute how the loss changes with respect to the parameters and step in the direction that lowers it.
Step 3.10: Rank propositions by expected task utility
Plain meaning: score each admissible proposition by its expected downstream value.
Harder term:
- The expectation \(\mathbb E_\theta[\cdot]\) is an average over what the model predicts could happen.
Why it is here: prediction becomes decision support once propositions are scored.
Then search for the best candidate:
Plain meaning: pick the candidate with highest score among the allowed options.
Part 4 — Benchmarking the World Model
Step 4.1: Make the state hierarchy explicit
Plain meaning: the benchmark only has access to the third object.
Why it is here: it keeps the benchmark honest.
Step 4.2: Build the dataset around event-time prediction
Plain meaning: every row is a person snapshot with history, context, world, proposition, and future labels.
Why it is here: the task is to predict future transition from current state plus proposition.
Step 4.3: Encode the event stream explicitly
Plain meaning: each event carries the proposition, delay, response, action, memory proxy, and raw categorical shock.
Why it is here: the model needs both structured sequence data and categorical trace data.
Step 4.4: Benchmark against baselines of increasing strength
The benchmark tests:
- frequency only,
- current-touch only,
- static tabular features,
- shallow history summaries,
- two-tower recommendation style,
- monolithic sequence modeling.
Why it is here: a slow/fast latent-state model should only survive if it beats or matches simpler alternatives in a meaningful way.
Step 4.5: Define the proposed latent-state benchmark model
Plain meaning: estimate durable person state, initialize fast state, update fast state with each event, and predict the next task-relevant state for each candidate proposition.
Step 4.6: Use the corrected training objective
- \sum_{m=1}^{M}\lambda_m \mathcal L_{\mathrm{probe},m}
- \lambda_{\mathrm{reg}} \Omega(\theta). $$
Plain meaning: all three pieces are added because the optimizer is minimizing the objective.
Why it is here: probe losses should be reduced, not increased, and regularization should discourage unstable fits, not reward them.
Step 4.7: Update slow and fast state on different timescales
Fast update:
Slow update:
Plain meaning: recent events can move the fast state immediately, but durable identity should drift slowly toward a refreshed durable estimate.
Harder term:
- This is an exponential moving average. A small \(\alpha\) means the slow state changes gradually.
Why it is here: without this split, one loud event can rewrite the whole person-side representation too aggressively.
Step 4.8: Separate ranking from policy evaluation
The same score function can be used in three regimes:
- Observational ranking: rank candidate propositions under the learned simulator.
- Off-policy evaluation: estimate how a target policy would have done using logged propensities.
- Online policy improvement: test and improve the policy under controlled experimentation.
Why it is here: ranking alone is not causal control.
Step 4.9: Use IPS only when logged propensities exist
Plain meaning: upweight cases where the historical policy was unlikely to choose the action the target policy would have chosen.
Harder terms:
- \(e_t=\mu(x_t\mid s_{i,t}^{(\tau,\Delta)})\) is the behavior policy’s logged propensity.
- \(r_t\) is realized reward.
- If the target policy is stochastic rather than deterministic, the indicator is replaced by the importance ratio \(\pi(x_t\mid s)/\mu(x_t\mid s)\).
Why it is here: it gives a principled bridge from logged data to policy-value estimates.
Step 4.10: Split train, validation, and test over time
Plain meaning: future rows must stay in the future.
Why it is here: random row splits leak information in sequence problems.
Step 4.11: Evaluate both forecast quality and calibration
Plain meaning:
- LogLoss checks probabilistic fit,
- Brier checks squared probability error,
- PR-AUC checks rare-event ranking,
- ECE checks calibration.
Why it is here: a useful decision model must rank well and produce believable probabilities.
Step 4.12: Use ablations to test whether the decomposition is real
The benchmark removes:
- \(z_{i,t}\),
- \(\hat T_i\),
- salience-weighted pooling,
- source/regime separation,
- probe heads,
- or the explicit slow/fast structure itself.
Why it is here: if performance does not move when these are removed, those pieces were decorative.
Step 4.13: Define success clearly
Success on forecasting means better held-out log loss and Brier score than the best baseline across more than one horizon.
Success on proposition selection means better off-policy value or live lift than a baseline policy, when the data regime actually supports that claim.
Part 5 — Axioms, Lemmas, and Main Theorem
Step 5.1: State the axioms
Part 5 assumes:
- an ambient space exists,
- individuals instantiate inherited structure,
- a task-conditioned predictive state exists,
- the operational slow/fast state approximates that predictive state,
- transition factors through the task-relevant proposition encoding,
- evolutionary retention uses a positive-gain rule,
- auxiliary probes also factor through predictive state.
Why it is here: theorems need fixed primitives.
Step 5.2: Prove the accessible slice is the best accessible approximation
Plain meaning: among all states inside the accessible subspace, the projection is the closest one to the raw noumenal state.
Harder term:
- \(\arg\min\) means “the value that makes the quantity as small as possible.”
Why it is here: projection now has a precise geometric meaning.
Step 5.3: Prove the novelty test is exact
If \(\Delta \mathbf v\) already lies in the current accessible space, its residual is zero. If the residual is nonzero and its net gain is positive, adding it increases the evolutionary objective.
Why it is here: new axes are not added by intuition; they are added by residual novelty plus positive value.
Step 5.4: Prove task-equivalent propositions induce the same next-state prediction
If two propositions have the same task-relevant projected encoding, then the transition map gives the same predicted next state.
Why it is here: this is the formal version of composability.
Step 5.5: Prove rollout is well-defined
Applying the same deterministic one-step world model repeatedly defines a unique finite-horizon rollout.
Why it is here: a one-step transition law becomes a true world model once it can simulate trajectories.
Step 5.6: Prove explicit history can help
Under log loss, the Bayes-optimal risk is \(H(Y\mid X)\), conditional entropy. Under Brier score for binary outcomes, the Bayes-optimal risk is \(\mathbb E[\operatorname{Var}(Y\mid X)]\), expected conditional variance.
Plain meaning: giving the model more genuinely useful history cannot make the optimal predictor worse.
Why it is here: this justifies explicit state and sequence modeling when the future depends on the past.
Step 5.7: Prove minimal sufficient state is unique up to reparameterization
If two predictive states are both minimal and sufficient for the same task, then they are the same object up to measurable bijection on their supports.
Plain meaning: the coordinates can change, but the minimal information content is the same.
Why it is here: the framework is not claiming one sacred coordinate chart for the mind.
Step 5.8: Prove sufficient compression
If
then the full history can be replaced by the compressed state for prediction.
Plain meaning: if the compressed state blocks any remaining dependence on the raw history, the raw history no longer needs to be carried directly.
Why it is here: it justifies latent-state compression.
Step 5.9: Prove slow/fast mediation
If the predictive state factors into slow and fast parts and the relevant conditional independence holds, then recent within-window history contributes through the fast state once the slow profile and context are fixed.
Why it is here: it gives the slow/fast split a formal interpretation rather than treating it as architecture taste.
Step 5.10: Prove probe consistency
If two histories give the same predictive state, then under the same proposition they induce the same conditional law for every probe that factors through that state.
Why it is here: probe heads become principled readouts instead of decorative extras.
Step 5.11: State the main theorem
The main theorem says that for each task and horizon there exists a finite-dimensional task-conditioned representation
with these properties:
- best accessible approximation,
- evolutionary coherence,
- task-equivalence,
- recursive simulability,
- state advantage,
- minimal-state uniqueness,
- sufficient compression,
- slow/fast mediation,
- probe consistency.
Plain meaning: once the framework’s assumptions are accepted, the world model is internally coherent and usable as a predictive system.
Step 5.12: State the corollary on proposition ranking
Plain meaning: the model induces an observational ranking over admissible propositions.
Why it matters: the best proposition is well-defined as an argmax of the score whenever the candidate set is finite or the maximum is attained.
Step 5.13: Keep the causal boundary sharp
The corollary does not prove that following the ranking improves the real world.
Why it is here: observational ranking, off-policy evaluation, and online causal improvement are different epistemic regimes.
Operational summary
For one task, the framework reduces to this recipe:
- Build the durable person estimate \(\hat T_i\) from slow features and slow categorical memory.
- Build the current fast state \(z_{i,t}\) from recent events and fast categorical memory.
- Attach role-context \(c_{i,t}\) and world state \(w_t\).
- Encode the candidate proposition \(x_t\).
- Predict the next task-relevant latent state.
- Decode outcomes and probes.
- Update fast state after each event.
- Refresh slow state gradually as durable evidence accumulates.
- Train against main outcome, probes, and regularization.
- Benchmark against simpler baselines.
- Rank propositions observationally.
- Only claim policy improvement when propensities or experiments support it.
One clean mental model
The framework says:
- evolution gives a lineage a finite accessible slice of reality,
- a person realizes that slice in an individual way,
- recent events activate only part of that person-space,
- a proposition interacts with that active state,
- the interaction moves the observer into a new task-relevant state,
- visible outcomes are readouts from that new state.
That is the whole system in its shortest usable form.