[↗] GIDS series

preprint · part 4 · 5 of 7

Part 4: Benchmarking the World Model

Part 3 supplied a framework. Part 4 makes it answerable to some sort of data.

We’ve argued so far that an observer can be represented as a transcendental embedding, that this embedding can be approximated from outward traces, and that propositions can be represented in the same formal arena as the observer. But any framework that does not specify what counts as state, what data instantiates that state, what task is being predicted, what stronger baselines it must beat, and what evidence would justify proposition optimization is nascent and not really worth anyone’s time. This section deals with benchmarking.

The purpose of Part 4 is not to prove the full metaphysical claim directly. It is to ask a narrower question: if we represent a human in role (the Chimera) as a slow embedding plus a fast latent state, do we predict observable transitions better than simpler models that use only static profile data, current-touch features, or history summaries? And if we later use that model to choose messages, do we have the intervention machinery to say something more than “the simulator liked this one”?

4.1 Operational Definition of State

Philosophically, the state of the organism is total. It includes perception, interoception, memory, action tendency, and the actions already underway. But benchmarks do not get to be mystical.

Let

\phi_{i,t}

denote the full phenomenal state of person \(i\) at time \(t\).

That object is not directly observed. So the benchmark requires two additional definitions.

First, define the predictive observer-state \(q_{i,t}^{(\tau,\Delta)}\) for task \(\tau\) and horizon \(\Delta\) by the sufficiency condition

P\!\left( Y_{i,t+\Delta}^{(\tau)} \mid H_{i,\le t}, T_i, c_{i,t}, w_t, x_t \right) = P\!\left( Y_{i,t+\Delta}^{(\tau)} \mid q_{i,t}^{(\tau,\Delta)}, x_t \right), \qquad x_t \in \mathcal X_{i,t}^{\mathrm{adm}}.

Second, define the operational state used in the dataset as the measurable approximation

s_{i,t}^{(\tau,\Delta)} = (\hat T_i, z_{i,t}, c_{i,t}, w_t),

where

\hat T_i = E_T(u_i, g_i^{\mathrm{slow}})

is the slow estimated embedding of the person,

z_{i,t} = E_Z(H_{i,\le t}, \hat T_i, c_{i,t}, w_t, g_{i,t}^{\mathrm{fast},\tau})

is the fast latent state inferred from recent interaction history,

c_{i,t}

is the current role-and-institution context, and

$$ w_t $$

is the relevant world state.

So the hierarchy is now explicit:

\phi_{i,t} \quad\text{(full motivating state)}

q_{i,t}^{(\tau,\Delta)} \quad\text{(formal predictive state)}

s_{i,t}^{(\tau,\Delta)} \quad\text{(measurable benchmark approximation)}.

The current proposition remains separate:

x_t \in \mathcal X_{i,t}^{\mathrm{adm}},

because the benchmark asks how a particular proposition changes the next state. Here the slow categorical bank captures what the person is generally like at this stage, while the fast categorical pool captures which discrete dispositions are currently active under the present regime.

The ideal transition is still

\phi_{i,t+\Delta} = F(T_i, \phi_{i,t}, x_t),

but the measurable benchmark version is

\hat q_{i,t+1}^{(\tau,\Delta)} = G_\theta(s_{i,t}^{(\tau,\Delta)}, x_t), \qquad \hat y_{i,t+\Delta}^{(\tau)} = R_0\!\left(\hat q_{i,t+1}^{(\tau,\Delta)}\right).

This preserves the paper’s claim that the true target is the next phenomenal transition, while admitting that the benchmark must train against observable proxies rather than direct access to qualia.

4.2 Dataset Construction

For each task \(\tau\), define a dataset of event-time examples:

\mathcal D_\tau = \left\{ (u_i, H_{i,\le t}, c_{i,t}, w_t, x_t, y_{i,t+\Delta}^{(\tau)}, a_{i,t+\Delta}^{(\tau)}) \right\}_{(i,t)}.

Here

u_i = [p_i, b_i, \ell_i, r_i, h_i, g_i^{\mathrm{grp}}],

where \(p_i\) is any psychometric proxy we can actually infer with a straight face, \(b_i\) is biography, \(\ell_i\) is language and discourse features, \(r_i\) is role and institution history, \(h_i\) is the observable subset of life-history structure recoverable from logs or profiles, and \(g_i^{\mathrm{grp}}\) is firm, account, or longer-run group context. If a coordinate cannot be inferred cleanly from real logs, its slot is masked rather than fabricated.

In addition, construct a slow categorical bank \(g_i^{\mathrm{slow}}\) from long-run source-tagged categorical traces and a fast categorical pool \(g_{i,t}^{\mathrm{fast},\tau}\) from recent history. The slow bank stores what the person is generally like now across regimes; the fast pool stores what is currently active after contextual lifting and weighting.

Let the interaction history be

H_{i,\le t} = [e_{i,1}, e_{i,2}, \dots, e_{i,t}],

with each event encoded as

e_{i,t} = [x_t, \delta_t, r_t, a_t, m_t^{\mathrm{obs}}, e_{i,t}^{\mathrm{cat}}],

where \(x_t\) is the proposition presented at time \(t\), \(\delta_t\) is time since the last interaction, \(r_t\) is the observed response bundle, \(a_t\) is the action taken by the agent, \(m_t^{\mathrm{obs}}\) is an observable memory proxy such as resurfaced objection themes, repeated concerns, or revisited product topics, and \(e_{i,t}^{\mathrm{cat}}\) is the source-tagged raw categorical event bundle. This bundle should preserve whether a category came from biography, stated language, observed behavior, or a third-party inference. It can later be aligned; it should not be forced into immediate equivalence.

The distinction between \(e_{i,t}^{\mathrm{cat}}\) and \(g_{i,t}^{\mathrm{fast},\tau}\) matters. The first is the current event-level categorical shock. The second is the separately computed pooled summary of prior categorical history after contextual lifting and weighting. Keeping both lets the model distinguish current content from accumulated state rather than injecting the same object twice under two names.

When available, action-type traces and mere-exposure traces should both be logged. The first often carries sharper salience; the second can still accumulate gradually.

The role-context term is

c_{i,t} = [ \text{title}_{i,t}, \text{seniority}_{i,t}, \text{dealstage}_{i,t}, \text{sender-role}_{t}, \text{buyer-role}_{t}, \text{firm-position}_{i,t} ].

The world-state term is

w_t = [ \text{market}_{t}, \text{account-health}_{t}, \text{org-pressure}_{t} ].

These terms stay explicit because the environment matters. Which ones we can actually carry depends on what data we can get.

For the first benchmark, the primary outcome bundle should stay narrow and logged:

y_{i,t+\Delta}^{(\tau)} = [ \text{reply}_{7d}, \text{meeting}_{21d}, \text{stageadvance}_{30d}, \text{close}_{90d} ].

Now add the auxiliary probe bundle:

a_{i,t+\Delta}^{(\tau)} = [ \text{objectionclass}, \text{sentimentshift}, \text{urgencyshift}, \text{nextactiontype}, \text{replydelaybucket} ].

These probes should be domain-specific and extractable from transcripts, email text, CRM status changes, or structured coding. They are not meant to reveal the one true hidden motive of the prospect. They are meant to test whether the latent state carries structured signal that generalizes beyond a single binary target.

So the first dataset should be built from CRM events, email logs, call transcripts, meeting records, account metadata, sender metadata, and company descriptors. If psychometric proxies, firm embeddings, or market-pressure features are unavailable, run the benchmark without them first. Do not invent variables because they sound sophisticated.

Also, a quick aside: I originally mentally modeled the fast and slow factors to help represent the phenomenon of mere exposure, but it evolved into its own thing over time. That origin still matters. Repeated exposure is exactly the kind of effect that should show up in the fast categorical pools before it ever deserves metaphysical inflation, while decisive actions should usually be allowed to hit the local state harder than passive contact alone.

4.3 The Benchmark

The benchmark is simple: does an explicit predictive-state model beat weaker baselines on prediction, and does the explicit slow/fast decomposition beat a generic sequence model that has enough capacity to absorb everything into one black box?

Formally, the claim is this:

If recent interaction history contains predictive information that cannot be reduced to static profile features or the current proposition alone, then a model with an explicit fast latent state \(z_{i,t}\) should outperform baselines that omit that state.

That means the model has to beat the following baselines.

Baseline 0: frequency baseline

\hat y = \Pr(y=1)

estimated from training prevalence alone.

Baseline 1: current-touch model

\hat y_{i,t+\Delta}^{(\tau)} = h_1(x_t, c_{i,t}),

for example logistic regression or linear classification using only the current proposition and current context.

Baseline 2: static tabular model

\hat y_{i,t+\Delta}^{(\tau)} = h_2(u_i, c_{i,t}, w_t, x_t),

for example gradient-boosted trees or regularized logistic regression using person, firm, world, and proposition features, but no explicit sequence state.

Baseline 3: shallow-history model

\hat y_{i,t+\Delta}^{(\tau)} = h_3(u_i, c_{i,t}, w_t, x_t, \mathrm{agg}(H_{i,\le t})),

where \(\mathrm{agg}(H_{i,\le t})\) is a hand-built summary such as touch count, last-response delay, reply rate, prior meeting count, or resurfaced topic counts.

Baseline 4: recommender-style two-tower model

\hat y_{i,t+\Delta}^{(\tau)} = h_4(\hat u_i, \hat x_t),

where person/account and proposition are embedded separately and scored by dot product or shallow fusion, but no explicit recurrent state is maintained.

Baseline 5: monolithic sequence model

\hat y_{i,t+\Delta}^{(\tau)} = h_5(u_i, c_{i,t}, w_t, x_t, H_{i,\le t}),

implemented by a generic recurrent, transformer, or state-space sequence model that sees the same event stream but does not enforce an explicit slow/fast decomposition.

This last baseline matters. If a monolithic sequence block with enough capacity eats my lunch, then the decomposition was just a story I told myself after the fact. If the explicit decomposition still wins or ties while being more interpretable, then it has earned the right to stay.

4.4 The Proposed Latent-State Model

The model keeps the slow and fast parts separate. The slow categorical bank is source-aware and regime-aware. The fast categorical pool privileges decisive action traces over mere exposure while still allowing repeated weak exposure to accumulate, and apparent contradictions are contextually lifted before they are treated as unresolved opposition.

First, estimate the slow person-side embedding:

\hat T_i = E_T(u_i, g_i^{\mathrm{slow}}).

Second, initialize a fast state:

z_{i,0} = z_0(\hat T_i).

Third, update the fast state as events occur:

z_{i,t+1} = U_\theta(z_{i,t}, \hat T_i, c_{i,t}, w_t, e_{i,t}, g_{i,t}^{\mathrm{fast},\tau}).

The active regime determines which mask is presently live, but the person-side object remains one person rather than many separate selves.

When a candidate proposition \(x_{t+1}\) is under consideration, predict the resulting next predictive state by

\hat q_{i,t+1}^{(\tau,\Delta)} = G_\theta(\hat T_i, z_{i,t}, c_{i,t}, w_t, x_{t+1}).

The primary readout is

\hat y_{i,t+1+\Delta}^{(\tau)} = R_0(\hat q_{i,t+1}^{(\tau,\Delta)}).

The auxiliary probe readouts are

\hat a_{i,t+1+\Delta}^{(m,\tau)} = R_m(\hat q_{i,t+1}^{(\tau,\Delta)}), \qquad m = 1,\dots,M.

When I later write \(\hat a_{i,t+1+\Delta}^{(\tau)}\) or \(\hat a_{i,t+\Delta}^{(\tau)}\) without the probe index, I mean the full auxiliary bundle \((\hat a^{(1,\tau)}, \dots, \hat a^{(M,\tau)})\).

This multi-head structure is there so the latent state does not remain a completely black box. If the state is real in the operational sense, it should carry reusable structure that helps decode more than one downstream observable.

The architecture of \(U_\theta\) and \(G_\theta\) is not fixed by the theory. The first implementation may be a GRU, an LSTM, a state-space model, or a Mamba-like sequential block. That choice is an implementation detail. The theory requires a stateful update. It does not require blind loyalty to one named architecture.

4.5 Training Objective, Update Loop, and Intervention

Start with a simple training objective:

\mathcal L_\tau = \mathcal L_{\mathrm{main}} \sum_{m=1}^{M}\lambda_m \mathcal L_{\mathrm{probe},m} \lambda_{\mathrm{reg}} \Omega(\theta).

This is the ordinary descent-friendly form. The main term fits the task we actually care about. The probe terms make the latent state carry reusable structure instead of one narrow trick. The regularization term is ordinary weight control, not a dare.

The model updates on two timescales.

The fast state updates after each observed event:

z_{i,t+1} = U_\theta(z_{i,t}, \hat T_i, c_{i,t}, w_t, e_{i,t}, g_{i,t}^{\mathrm{fast},\tau}).

The slow embedding updates more slowly when durable evidence accumulates. Let \(\hat T_i^{\mathrm{new}}\) denote the refreshed slow estimate obtained after recomputing the durable person-side encoder from newly accumulated slow evidence. Then

\hat T_i \leftarrow (1-\alpha)\hat T_i + \alpha\,\hat T_i^{\mathrm{new}}.

That distinction matters. A prospect’s momentary state may move after one email. Their durable embedding should not be rewritten that quickly; it should drift toward a refreshed durable estimate instead of lurching around because one event was loud.

The model parameters update by ordinary gradient step,

\theta_{k+1} = \theta_k - \eta \nabla_\theta \mathcal L_\tau.

Now for the part I wanted to make explicit.

In deployment, the update loop is:

construct the current operational state \(s_{i,t}^{(\tau,\Delta)}\),
score one or more candidate propositions \(x_t^{(1)}, \dots, x_t^{(K)}\),
choose an action by the current policy,
observe the response,
update \(z_{i,t}\),
periodically refit \(\theta\) and refresh \(\hat T_i\).

Figure 6. Minimum viable training recipe for estimating and updating the predictive observer-state model.

But there are two very different ways to use those scores.

Observational ranking

If all you have are retrospective logs, define the model score as

\operatorname{score}_\theta(x \mid s_{i,t}^{(\tau,\Delta)}) = \mathbb E_\theta\!\left[ U_\tau\!\left( \hat q_{i,t+1}^{(\tau,\Delta)}, \hat y_{i,t+\Delta}^{(\tau)}, \hat a_{i,t+\Delta}^{(\tau)} \right) \mid s_{i,t}^{(\tau,\Delta)}, x \right].

This lets you rank or simulate candidate propositions. Useful, yes. Causal, not yet.

Off-policy evaluation

If the historical system logged propensities

e_t = \mu(x_t \mid s_{i,t}^{(\tau,\Delta)}),

for the behavior policy \(\mu\), then a deterministic target proposition policy \(\pi\) can be evaluated off-policy. A simple inverse-propensity estimator is

\hat V_{\mathrm{IPS}}(\pi) = \frac{1}{N} \sum_{t=1}^{N} \frac{\mathbf 1\{x_t = \pi(s_{i,t}^{(\tau,\Delta)})\}}{e_t} \, r_t,

where \(r_t\) is the realized reward or task utility. If the target policy is stochastic rather than deterministic, replace the indicator with the usual importance ratio \(\pi(x_t \mid s_{i,t}^{(\tau,\Delta)}) / \mu(x_t \mid s_{i,t}^{(\tau,\Delta)})\). In practice, a stabilized or doubly-robust estimator is usually preferable, but the point is not the exact estimator; the point is that without propensities or randomization, you do not get to claim policy value cleanly.

Online policy improvement

If you can randomize a controlled fraction of traffic, then proposition selection becomes a real policy-learning problem rather than a retrospective ranking problem. At that point, the latent-state model can be used to choose among admissible messages, offers, sequences, or interventions, and its value can be evaluated by lift, regret, or cumulative reward under live deployment.

Until then, leave the causal swagger out of it. The model may still be useful. It is just useful as a ranking and simulation device rather than as a proven controller.

Figure 7. Deployment loop and the three epistemic regimes: observational ranking, off-policy evaluation, and online policy improvement.

4.6 Temporal Split, Evaluation, and Drift

Figure 8. Temporal benchmark design, metrics, ablations, success criteria, and drift handling.

The benchmark must be temporal. Random row splits leak future information.

Partition the data into rolling windows:

\mathcal D_{\mathrm{train}}^{(1:T_1)}, \quad \mathcal D_{\mathrm{val}}^{(T_1:T_2)}, \quad \mathcal D_{\mathrm{test}}^{(T_2:T_3)}.

The main primary-task metrics should be

\mathrm{LogLoss}, \qquad \mathrm{Brier}, \qquad \mathrm{PR\text{-}AUC}, \qquad \mathrm{ECE}.

Expected calibration error matters if calibrated probabilities are needed for ranking actions. PR-AUC matters because reply, meeting, and close events are sparse.

For the auxiliary probes, report probe-appropriate metrics such as macro-F1, AUROC, or calibration, depending on whether the probe is multiclass, binary, ordinal, or continuous. If the latent state is supposed to carry structured signal, that signal should show up in more than one head.

The benchmark should also include ablations that force the slow/fast story to either pay rent or die.

Remove \(z_{i,t}\). If short-horizon performance barely moves, the fast state is ornamental.
Remove \(\hat T_i\). If cold-start or cross-context generalization barely moves, the slow embedding is ornamental.
Replace the salience-weighted categorical pools with uniform averaging. If performance barely moves, the claim that sharp action and cumulative weak exposure deserve different treatment is ornamental.
Collapse source channels and role regimes before pooling. If performance improves, my insistence on source-aware and regime-aware separation was theater; if it hurts, then source/regime separation together with contextual disambiguation was buying real signal.
Shuffle recent within-person history while preserving static profile. If performance does not fall, the model was not really using the history in the way I claimed.
Replace the explicit slow/fast architecture with the monolithic sequence baseline. If the generic sequence model dominates, then the decomposition is not buying enough.
Remove the probe heads. If probes contribute no stable signal and no regularization value, then they are decorative; if they improve robustness or reveal reusable structure, then they are doing their job.

Drift has to be part of the benchmark rather than an afterthought. Let recent performance on a rolling window be

\mathrm{Perf}_{\tau}(t:t+h).

If this falls below a threshold

\mathrm{Perf}_{\tau}(t:t+h) < \gamma_{\tau},

the system should trigger one of three responses: refit parameters, expand the feature set, or reopen the task projection \(\Pi_{\tau}\). A model like this is expected to become wrong. The point is to catch it when it does and update it.

If the proposition-selection layer is being tested with propensities or online experiments, report off-policy value estimates or live lift separately from the pure forecasting metrics. Those are different questions and should not be mashed together.

4.7 What Counts as Success

Success is not that the model sounds deep. Success is narrower.

The latent-state framing succeeds in the first domain if

\mathrm{LogLoss}(M_{\mathrm{latent}}) < \mathrm{LogLoss}(M_{\mathrm{best\ baseline}}) - \epsilon_1,

and

\mathrm{Brier}(M_{\mathrm{latent}}) < \mathrm{Brier}(M_{\mathrm{best\ baseline}}) - \epsilon_2,

on temporally held-out data, across more than one outcome horizon.

It succeeds more strongly if the gains survive drift, if ablations show that the explicit fast state \(z_{i,t}\) is carrying unique short-horizon signal, if removing \(\hat T_i\) hurts cold-start or cross-context performance in exactly the way the slow/fast story predicts, and if uniform or source-collapsed categorical pooling underperforms the salience-weighted, source-aware, and regime-aware version.

It succeeds more strongly still if the auxiliary probe heads show that the latent state supports reusable structure beyond a single binary label.

If intervention data exists, proposition search succeeds when

\hat V(\pi_{\mathrm{latent}}) > \hat V(\pi_{\mathrm{baseline}}) + \epsilon_3 > ] under off-policy evaluation or live testing. If intervention data does not exist, then all proposition search results must be described as simulated rankings, not causal policy wins. It fails if one of the simpler baselines matches or exceeds it once history summaries are included, if the monolithic sequence model dominates the explicit decomposition without interpretability tradeoff, or if the gains disappear on future windows. In that case, either the state decomposition is wrong, the dataset does not contain the signal I thought it did, or the task never needed this much machinery in the first place. End of Part 4 Part 4 is where the framework becomes somewhat verifiable and we can use data. At this stage, the job is straightforward: define what state means in a form a dataset can carry, define what must be predicted, define which weaker and stronger models must be beaten, define what the probes are supposed to prove, and define what intervention evidence is required before proposition optimization can be called causal. That is enough to move from framework to testable program. The whole ambition can be written operationally as [ (\hat T_i, z_{i,t}, c_{i,t}, w_t, x_t) \longrightarrow \hat q_{i,t+1}^{(\tau,\Delta)} \longrightarrow (\hat y_{i,t+\Delta}^{(\tau)}, \hat a_{i,t+\Delta}^{(\tau)}) \longrightarrow \text{update} \longrightarrow \text{benchmark}.

And if we do later choose messages from the model, the extra line is

s_{i,t}^{(\tau,\Delta)} \longrightarrow \operatorname{score}_\theta(x \mid s_{i,t}^{(\tau,\Delta)}) \longrightarrow x_t^\star,

with the giant asterisk that ranking is not causality unless the data collection regime supports that claim.

This is the point where the theory becomes falsifiable. The question is no longer whether reality can be expressed in a shared formal arena. The question is whether that arena yields better forecasts of observable human transition than models that ignore explicit state, and whether its proposition-search layer survives the much nastier standard of intervention.