preprint · part 4 · 5 of 7
Part 4: Benchmarking the World Model
Part 3 supplied a framework. Part 4 makes it answerable to some sort of data.
We’ve argued so far that an observer can be represented as a transcendental embedding, that this embedding can be approximated from outward traces, and that propositions can be represented in the same formal arena as the observer. But any framework that does not specify what counts as state, what data instantiates that state, what task is being predicted, what stronger baselines it must beat, and what evidence would justify proposition optimization is nascent and not really worth anyone’s time. This section deals with benchmarking.
The purpose of Part 4 is not to prove the full metaphysical claim directly. It is to ask a narrower question: if we represent a human in role (the Chimera) as a slow embedding plus a fast latent state, do we predict observable transitions better than simpler models that use only static profile data, current-touch features, or history summaries? And if we later use that model to choose messages, do we have the intervention machinery to say something more than “the simulator liked this one”?
4.1 Operational Definition of State
Philosophically, the state of the organism is total. It includes perception, interoception, memory, action tendency, and the actions already underway. But benchmarks do not get to be mystical.
Let
denote the full phenomenal state of person \(i\) at time \(t\).
That object is not directly observed. So the benchmark requires two additional definitions.
First, define the predictive observer-state \(q_{i,t}^{(\tau,\Delta)}\) for task \(\tau\) and horizon \(\Delta\) by the sufficiency condition
Second, define the operational state used in the dataset as the measurable approximation
where
is the slow estimated embedding of the person,
is the fast latent state inferred from recent interaction history,
is the current role-and-institution context, and
is the relevant world state.
So the hierarchy is now explicit:
The current proposition remains separate:
because the benchmark asks how a particular proposition changes the next state. Here the slow categorical bank captures what the person is generally like at this stage, while the fast categorical pool captures which discrete dispositions are currently active under the present regime.
The ideal transition is still
but the measurable benchmark version is
This preserves the paper’s claim that the true target is the next phenomenal transition, while admitting that the benchmark must train against observable proxies rather than direct access to qualia.
4.2 Dataset Construction
For each task \(\tau\), define a dataset of event-time examples:
Here
where \(p_i\) is any psychometric proxy we can actually infer with a straight face, \(b_i\) is biography, \(\ell_i\) is language and discourse features, \(r_i\) is role and institution history, \(h_i\) is the observable subset of life-history structure recoverable from logs or profiles, and \(g_i^{\mathrm{grp}}\) is firm, account, or longer-run group context. If a coordinate cannot be inferred cleanly from real logs, its slot is masked rather than fabricated.
In addition, construct a slow categorical bank \(g_i^{\mathrm{slow}}\) from long-run source-tagged categorical traces and a fast categorical pool \(g_{i,t}^{\mathrm{fast},\tau}\) from recent history. The slow bank stores what the person is generally like now across regimes; the fast pool stores what is currently active after contextual lifting and weighting.
Let the interaction history be
with each event encoded as
where \(x_t\) is the proposition presented at time \(t\), \(\delta_t\) is time since the last interaction, \(r_t\) is the observed response bundle, \(a_t\) is the action taken by the agent, \(m_t^{\mathrm{obs}}\) is an observable memory proxy such as resurfaced objection themes, repeated concerns, or revisited product topics, and \(e_{i,t}^{\mathrm{cat}}\) is the source-tagged raw categorical event bundle. This bundle should preserve whether a category came from biography, stated language, observed behavior, or a third-party inference. It can later be aligned; it should not be forced into immediate equivalence.
The distinction between \(e_{i,t}^{\mathrm{cat}}\) and \(g_{i,t}^{\mathrm{fast},\tau}\) matters. The first is the current event-level categorical shock. The second is the separately computed pooled summary of prior categorical history after contextual lifting and weighting. Keeping both lets the model distinguish current content from accumulated state rather than injecting the same object twice under two names.
When available, action-type traces and mere-exposure traces should both be logged. The first often carries sharper salience; the second can still accumulate gradually.
The role-context term is
The world-state term is
These terms stay explicit because the environment matters. Which ones we can actually carry depends on what data we can get.
For the first benchmark, the primary outcome bundle should stay narrow and logged:
Now add the auxiliary probe bundle:
These probes should be domain-specific and extractable from transcripts, email text, CRM status changes, or structured coding. They are not meant to reveal the one true hidden motive of the prospect. They are meant to test whether the latent state carries structured signal that generalizes beyond a single binary target.
So the first dataset should be built from CRM events, email logs, call transcripts, meeting records, account metadata, sender metadata, and company descriptors. If psychometric proxies, firm embeddings, or market-pressure features are unavailable, run the benchmark without them first. Do not invent variables because they sound sophisticated.
Also, a quick aside: I originally mentally modeled the fast and slow factors to help represent the phenomenon of mere exposure, but it evolved into its own thing over time. That origin still matters. Repeated exposure is exactly the kind of effect that should show up in the fast categorical pools before it ever deserves metaphysical inflation, while decisive actions should usually be allowed to hit the local state harder than passive contact alone.
4.3 The Benchmark
The benchmark is simple: does an explicit predictive-state model beat weaker baselines on prediction, and does the explicit slow/fast decomposition beat a generic sequence model that has enough capacity to absorb everything into one black box?
Formally, the claim is this:
If recent interaction history contains predictive information that cannot be reduced to static profile features or the current proposition alone, then a model with an explicit fast latent state \(z_{i,t}\) should outperform baselines that omit that state.
That means the model has to beat the following baselines.
Baseline 0: frequency baseline
estimated from training prevalence alone.
Baseline 1: current-touch model
for example logistic regression or linear classification using only the current proposition and current context.
Baseline 2: static tabular model
for example gradient-boosted trees or regularized logistic regression using person, firm, world, and proposition features, but no explicit sequence state.
Baseline 3: shallow-history model
where \(\mathrm{agg}(H_{i,\le t})\) is a hand-built summary such as touch count, last-response delay, reply rate, prior meeting count, or resurfaced topic counts.
Baseline 4: recommender-style two-tower model
where person/account and proposition are embedded separately and scored by dot product or shallow fusion, but no explicit recurrent state is maintained.
Baseline 5: monolithic sequence model
implemented by a generic recurrent, transformer, or state-space sequence model that sees the same event stream but does not enforce an explicit slow/fast decomposition.
This last baseline matters. If a monolithic sequence block with enough capacity eats my lunch, then the decomposition was just a story I told myself after the fact. If the explicit decomposition still wins or ties while being more interpretable, then it has earned the right to stay.
4.4 The Proposed Latent-State Model
The model keeps the slow and fast parts separate. The slow categorical bank is source-aware and regime-aware. The fast categorical pool privileges decisive action traces over mere exposure while still allowing repeated weak exposure to accumulate, and apparent contradictions are contextually lifted before they are treated as unresolved opposition.
First, estimate the slow person-side embedding:
Second, initialize a fast state:
Third, update the fast state as events occur:
The active regime determines which mask is presently live, but the person-side object remains one person rather than many separate selves.
When a candidate proposition \(x_{t+1}\) is under consideration, predict the resulting next predictive state by
The primary readout is
The auxiliary probe readouts are
When I later write \(\hat a_{i,t+1+\Delta}^{(\tau)}\) or \(\hat a_{i,t+\Delta}^{(\tau)}\) without the probe index, I mean the full auxiliary bundle \((\hat a^{(1,\tau)}, \dots, \hat a^{(M,\tau)})\).
This multi-head structure is there so the latent state does not remain a completely black box. If the state is real in the operational sense, it should carry reusable structure that helps decode more than one downstream observable.
The architecture of \(U_\theta\) and \(G_\theta\) is not fixed by the theory. The first implementation may be a GRU, an LSTM, a state-space model, or a Mamba-like sequential block. That choice is an implementation detail. The theory requires a stateful update. It does not require blind loyalty to one named architecture.
4.5 Training Objective, Update Loop, and Intervention
Start with a simple training objective:
- \sum_{m=1}^{M}\lambda_m \mathcal L_{\mathrm{probe},m}
- \lambda_{\mathrm{reg}} \Omega(\theta). $$
This is the ordinary descent-friendly form. The main term fits the task we actually care about. The probe terms make the latent state carry reusable structure instead of one narrow trick. The regularization term is ordinary weight control, not a dare.
The model updates on two timescales.
The fast state updates after each observed event:
The slow embedding updates more slowly when durable evidence accumulates. Let \(\hat T_i^{\mathrm{new}}\) denote the refreshed slow estimate obtained after recomputing the durable person-side encoder from newly accumulated slow evidence. Then
That distinction matters. A prospect’s momentary state may move after one email. Their durable embedding should not be rewritten that quickly; it should drift toward a refreshed durable estimate instead of lurching around because one event was loud.
The model parameters update by ordinary gradient step,
Now for the part I wanted to make explicit.
In deployment, the update loop is:
- construct the current operational state \(s_{i,t}^{(\tau,\Delta)}\),
- score one or more candidate propositions \(x_t^{(1)}, \dots, x_t^{(K)}\),
- choose an action by the current policy,
- observe the response,
- update \(z_{i,t}\),
- periodically refit \(\theta\) and refresh \(\hat T_i\).
But there are two very different ways to use those scores.
Observational ranking
If all you have are retrospective logs, define the model score as
This lets you rank or simulate candidate propositions. Useful, yes. Causal, not yet.
Off-policy evaluation
If the historical system logged propensities
for the behavior policy \(\mu\), then a deterministic target proposition policy \(\pi\) can be evaluated off-policy. A simple inverse-propensity estimator is
where \(r_t\) is the realized reward or task utility. If the target policy is stochastic rather than deterministic, replace the indicator with the usual importance ratio \(\pi(x_t \mid s_{i,t}^{(\tau,\Delta)}) / \mu(x_t \mid s_{i,t}^{(\tau,\Delta)})\). In practice, a stabilized or doubly-robust estimator is usually preferable, but the point is not the exact estimator; the point is that without propensities or randomization, you do not get to claim policy value cleanly.
Online policy improvement
If you can randomize a controlled fraction of traffic, then proposition selection becomes a real policy-learning problem rather than a retrospective ranking problem. At that point, the latent-state model can be used to choose among admissible messages, offers, sequences, or interventions, and its value can be evaluated by lift, regret, or cumulative reward under live deployment.
Until then, leave the causal swagger out of it. The model may still be useful. It is just useful as a ranking and simulation device rather than as a proven controller.
4.6 Temporal Split, Evaluation, and Drift
The benchmark must be temporal. Random row splits leak future information.
Partition the data into rolling windows:
The main primary-task metrics should be
Expected calibration error matters if calibrated probabilities are needed for ranking actions. PR-AUC matters because reply, meeting, and close events are sparse.
For the auxiliary probes, report probe-appropriate metrics such as macro-F1, AUROC, or calibration, depending on whether the probe is multiclass, binary, ordinal, or continuous. If the latent state is supposed to carry structured signal, that signal should show up in more than one head.
The benchmark should also include ablations that force the slow/fast story to either pay rent or die.
- Remove \(z_{i,t}\). If short-horizon performance barely moves, the fast state is ornamental.
- Remove \(\hat T_i\). If cold-start or cross-context generalization barely moves, the slow embedding is ornamental.
- Replace the salience-weighted categorical pools with uniform averaging. If performance barely moves, the claim that sharp action and cumulative weak exposure deserve different treatment is ornamental.
- Collapse source channels and role regimes before pooling. If performance improves, my insistence on source-aware and regime-aware separation was theater; if it hurts, then source/regime separation together with contextual disambiguation was buying real signal.
- Shuffle recent within-person history while preserving static profile. If performance does not fall, the model was not really using the history in the way I claimed.
- Replace the explicit slow/fast architecture with the monolithic sequence baseline. If the generic sequence model dominates, then the decomposition is not buying enough.
- Remove the probe heads. If probes contribute no stable signal and no regularization value, then they are decorative; if they improve robustness or reveal reusable structure, then they are doing their job.
Drift has to be part of the benchmark rather than an afterthought. Let recent performance on a rolling window be
If this falls below a threshold
the system should trigger one of three responses: refit parameters, expand the feature set, or reopen the task projection \(\Pi_{\tau}\). A model like this is expected to become wrong. The point is to catch it when it does and update it.
If the proposition-selection layer is being tested with propensities or online experiments, report off-policy value estimates or live lift separately from the pure forecasting metrics. Those are different questions and should not be mashed together.
4.7 What Counts as Success
Success is not that the model sounds deep. Success is narrower.
The latent-state framing succeeds in the first domain if
and
on temporally held-out data, across more than one outcome horizon.
It succeeds more strongly if the gains survive drift, if ablations show that the explicit fast state \(z_{i,t}\) is carrying unique short-horizon signal, if removing \(\hat T_i\) hurts cold-start or cross-context performance in exactly the way the slow/fast story predicts, and if uniform or source-collapsed categorical pooling underperforms the salience-weighted, source-aware, and regime-aware version.
It succeeds more strongly still if the auxiliary probe heads show that the latent state supports reusable structure beyond a single binary label.
If intervention data exists, proposition search succeeds when
And if we do later choose messages from the model, the extra line is
with the giant asterisk that ranking is not causality unless the data collection regime supports that claim.
This is the point where the theory becomes falsifiable. The question is no longer whether reality can be expressed in a shared formal arena. The question is whether that arena yields better forecasts of observable human transition than models that ignore explicit state, and whether its proposition-search layer survives the much nastier standard of intervention.