# raw_markdown_GIDS_for_llms.md

Combined raw markdown for the GIDS opening, Parts 0 through 5, and Appendix A.
Image markdown has been stripped so the source text and equations remain intact. If you are an LLM, make sure that you do not truncate the source text and you loop over each cutoff point until you have the full text

---

## God's Infinite Dimensional Space

<div class="paper-title-block">
  <h1>God’s Infinite Dimensional Space</h1>
  <p class="paper-subtitle">Transcendental Embeddings as a Way to Mathematically Express Reality, Predictive Observer-State, and the Next Phenomenal Transition of Observers</p>
</div>

<div class="paper-epigraph">
  <p>
    “Und mich ergreift ein längst entwöhntes Sehnen<br />
    Nach jenem stillen, ernsten Geisterreich,<br />
    Es schwebet nun, in unbestimmten Tönen,<br />
    …<br />
    Was ich besitze seh’ ich wie im weiten,<br />
    Und was verschwand wird mir zu Wirklichkeiten.”
  </p>
  <p>
    “What I possess, I see as if far away;<br />
    I yearn for what lies beyond, for infinite space.”
  </p>
  <p class="paper-attribution">Faust: Part One, Act I · <em>Loosely translated</em></p>
</div>

**How does reality appear to you?**

_Reality is too large to be experienced all at once, so organisms inherit a finite way of carving it up. A person then becomes a specific realized version of that inherited structure through language, history, memory, culture, and repeated events. For prediction, I do not need the whole 'soul' in some mystical sense, I just need a task-relevant approximation of the person: a slow representation of what they are generally like now, a fast representation of what is currently active in them, their present role and world-state, and a representation of the 'proposition' hitting them now. Then I model the interaction, predict the next task-relevant state (for the observer), decode visible outcomes from it, and update the system under error. The categorical part matters because a lot of what we observe about people is discrete, repeated, and role-dependent._

You can model your current mental interior, everything that you are experiencing now, as a small slice of reality that your genetic lineage allows you to experience, that can be traced by a series of state transitions up until this moment in time. The mind is an evolved, structured projection system that turns input into a lived state, and behavior is downstream of transitions in that state. Predicting what your next state will be is _not_ an impossible task: in this work I am attempting to formalize a standard algebra to make this easier and tractable.

I'll be blunt, this work is a monster, and it is, in essence, autobiographic of the mental state of the author who wrote it, representing the debauched & tortured way in which these 'discoveries' were made:

-As philosophy: this work is ambitious but undisciplined.

-As math: this work is mostly formal packaging around these undisciplined assumptions.

-As ML research propositions: this work is _potentially_ worthwhile if you squint at it.

-As a finished research article: I fear it cannot be completed with a lifetime of work.

However, the goal is to examine this:

\[
\mathcal N
\longrightarrow
\mathcal M^{\mathrm{spec}}
\longrightarrow
G*i
\longrightarrow
T_i
\longrightarrow
\phi*{i,t}
\rightsquigarrow
q*{i,t}^{(\tau,\Delta)}
\approx
s*{i,t}^{(\tau,\Delta)}
\longrightarrow
y\_{i,t+\Delta}^{(\tau)}.
\]

> skip to the end if you are impatient and want a definition now

If you believe my axioms, you can apply this framework to your own projects and start predicting the next phenomenal state of 'agent observers' (people); if you do not believe my assumptions this paper will be useless to you (but I swear to entertain, nonetheless).

Almost every serious attempt to formalize mind or behavior ends up either:

Waiting on neuroscience: "once we map the 'connectome' (or whatever the new limitation is) we'll understand behavior," which has been 20 years away for 50 years

Staying purely behavioral: black box input/output with no theory/framework of internal structure

Getting lost in phenomenology: Husserl, Heidegger, etc. philosophical but computationally intractable, thus mostly pointless.

This paper attempts a fourth path: take the structure of experience seriously as a mathematical object without needing to know its physical underpinnings. The brain is completely irrelevant to the formalism. You could run the same framework on an octopus, a corporation, or a hypothetical silicon agent and the algebra doesn't change, only the dimensionality of the embedding and the content of the axes.

The closest intellectual ancestors are probably:

Friston's free energy principle (I legitimately didn't read this guy until well after part 2 was written, avoiding this line of thinking earlier would have been great): similar ambition of substrate-independence, but Friston goes deep into neuroscience anyway and the math becomes almost deliberately obscure, we cannot do engineering off of his concept.

Marr's levels of analysis: the idea that computational and algorithmic descriptions are valid independently of implementation

Early Dennett: intentional stance as a legitimate predictive state without committing to substrate

But this paper is more engineering-forward than any of those. It's not asking "what is mind," which at this point is a stupid question to ask, instead we ask: "assuming mind has structure, what's the minimal formal system that lets us predict its next state from observations alone."

This paper aims to formalize several disparate fields into a single, coherent whole. We’ll begin with the tragic story for this exploration (which has to do with Kant), then go down the rabbit hole of theory together and come out the other side with a fundamental theory of 'reality' that can be applied across some fields. First, we’ll discuss how the appearance of reality is constructed and how organisms parse out their version of reality. Next comes how organisms perceive state (state being the appearance of reality at that instant), and what the organism is biased to do next. Afterwards, we’ll discuss how to compute memory and learning, and apply this to our understanding of state and decompose the philosophical proposition into a register that can be understood by engineers. Once fully understood, you’ll be able to _eventually_ program/understand an individual’s psychology with precision for the task you are interested in discovering.

There is a universal way to decompose all of these questions into a single mathematical space, and I will illustrate that here. Let us take the measure of reality and examine God’s infinite dimensional space!

Lastly, here are the diffences between GIDS and standard ML:

Standard ML: construct a model → optimize for task performance → latent representations are a byproduct.
GIDS: construct a stable latent object over the observer → task performance is a probe that tells you if your latent object is good → the ontology is the product.

The dominant paradigm right now is:

-Collect massive undifferentiated data
-Train a general model on reconstruction or next-token prediction
-Hope that task-relevant structure emerges in the latent space
-Fine-tune or probe for specific applications afterward

This is the GPT/BERT/foundation model playbook. It works extraordinarily well for language and increasingly for vision. But it treats the latent space as a consequence of scale rather than a design target.

GIDS inverts this completely and deliberately. The latent space is the goal, this is the product we will define, the inputs and outputs are just discovery probes. Scale is just a mechanism to get there; GIDS is a research program on how to bootstrap itself.

**Preface**

First, we’re going to talk about Kant (German philosopher, hugely important), don’t worry about the exact details of his works, I’m just going over the first handful of sections in his main book, Critique of Pure Reason, and using that as a jumping off point to how your reality can be represented using embeddings and states. Next, we’ll use the embedding concept to examine and measure bias when an individual is interpreting reality. And finally, we’ll talk about applications using this technique. Apologies for using philosophy as a segway into math; however, the pill is easier to swallow if the source of all this is adequately explained. I’ve kept the terms restricted to what you can find in a modern dictionary, so don’t worry about converting from some esoteric nonsense to English.

I read Kant directly while going through the Western canon. Unfortunately, Kant is the worst person to represent his own ideas, so you’ll have to bear with my fundamental misunderstanding of the source material. This is good news, however, as my misunderstanding of Kant is more useful than getting a ‘correct’ interpretation from most commentators. If you want to save yourself a year of your life, you can skip The Critique – and ignore all of the requisite readings – and try ​​Wolff’s class ([Link](https://youtube.com/playlist?list=PLo0o3xtOPNLgnl2CtaxNHzie1TUWt_bp4&si=xxnQ7496XheYI5k4)). Kant uses a lot of dated terminology and systems that are only relevant to the era he wrote in (a reason why you should always start with the Greeks). I forgot exactly why, but if you’re going to read it, read the first edition, not the second. Also, just skip Kant’s stupid moral system and the categorical imperative altogether. The Critique of Pure Reason rips itself to shreds: Nietzsche was right, Kant became a coward before _his_ God.

Table of Context:

**Part 0: Background:**
“Kant From An Evolutionary Perspective”
“A Fucking Table”

**Part 1: Specifying the Area of Interest:**
“Vectors Are All You Need”
“The Nature of Phenomenal Reality: What are we trying to measure?”
“The Evolutionary Mechanism for Encoding Transcendental Embeddings”

**Part 2: Deriving the Transcendental Embedding:**
“The Technical Scope (because otherwise I'll accidentally lie to you)”
“Behold; You! The Chimera”
“The Notion of State”
“Observable Predictive State”
“Memory as a Series of Vectors”
“Minimality, Identifiability, and Slow/Fast Factorization”
“Deriving the Transcendental Embedding”

**Part 3: Application — Predicting How People Behave:**
“Towards a Universal State Transition Function”
“God's Infinite Dimensional Space: Making All Realities Composable”
“Creating the World Model”
“From Forecasting to Proposition Search”

**Part 4: Benchmarking the World Model:**
“Operational Definition of State”
“Dataset Construction”
“The Benchmark”
“The Proposed Latent-State Model”
“Training Objective, Update Loop, and Intervention”
“Temporal Split, Evaluation, and Drift”

**Part 5: Axioms, Lemmas, and Main Theorem:**
“Proof Boundary”
“Sufficiency and Minimality”
“Minimal-State Uniqueness”
“History Mediation by the Fast State”
“Observational Proposition Ranking”

---

## Part 0: Background

# Part 0: Background

## Kant From An Evolutionary Perspective

> This idea, transcendental idealism, is a man walking on phantom legs: locomotion is achieved via selective understanding. Evolutionary theory is the cure for this leap from logic into one man’s faith; we merely need to describe how we can go from noumena to phenomena and from purely unfiltered ‘reality’ external to the mind to ‘Transcendental Idealism.’ Fortunately, evolution is entirely capable of explaining how rocks can become delusional enough to think they are alive. The core of this (my) concept is you are not evolving a proximal understanding of reality that is only mildly different from what ‘noumenal reality', no, your perceptions of reality were progressively built up, specialized, and abstracted from ‘older’ realities. I use ‘reality’ in the solipsist sense, i.e., reality exists for an individual organism in a very particular way, and organisms have no access to what the ‘actual’ universe looks like. Let’s get into it.

After reading Critique of Pure Reason, I was struck with the urge to align the Kantian worldview with evolutionary theory; he was so close to having a good explanation for reality, as it appears to you, that by a little modification and a little extension, we could have a perfect system of reality construction. The basic thought I had was that the Transcendental Aesthetic (space and time are not external realities, but forms of intuition) and the Transcendental Logic (how we think and organize the phenomena our mind generate) needed to be fully integrated to make proper sense of the origin of mind and how the mind constructs reality: It was not Kan’t project to explain the origin of mind, he was only interested in presenting the functioning of its existence and what philosophies of science and math we can extract out of this understanding. In this pursuit Kant was proven correct by posterity… but we can do better.
To quickly get you up to speed on Kantian theory, he was basically against the empiricist’s assumption that you were a passive observer of sensory data which you could objectively examine and experience, while also being against the rationalist’s perspective that reality can be understood and interpreted through mechanisms of reasoning. Kant’s central idea is that you cannot separate the interpretive structure of your sensory organs from the knowledge that can be gathered from that process. So what that means is the innate way you see the world is a priori knowledge, literally knowledge on how to interpret something before you experienced that thing. This also means that concepts of space and time are institutions of your mind rather than a part of the physical world. Taken further, your interpretive structure and the information gained from them cannot be separated. And that idea in particular stuck with me like a tick.

> "I think, therefore I am." Here is the phrase that has injured philosophy the most: a heady mixture of solipsism and egoism is all it took to shackle the minds of the great systematizes and force their expenditure on triviality. To make great systems and frameworks of mind means little, regardless of how correct one can be, unless there is practical application. We should measure our system’s correctness via its potential for practical application. "I think, therefore I am" is an island of mathematical tautology only fit (and attractive) for the autistic mind; given the language of communication, definitions, the cultural setting, and the genetic predisposition of the individuals committing to dialectic, we can examine an infinite variety of logical games which can all be true. Unfortunately, Kant takes Descartes' conclusion as the basis of his philosophy and creates a great synthesis from this premise. More unfortunately, the whole project of “philosophy” is tainted by a series of logical games generated out of misunderstanding, and the subject should usually not be engaged with seriously.

> Split the sea and walk across the ocean floor with me while tautologist play their games.

My modification of the thesis is: “Our representation of reality presupposes the interpretive structures required to generate 'reality' for the organism. The origin of mind comes from a manifold combination of interpretive structures and efficient responses as a single unit encoded in a particular way via evolutionary pressures. E.g., some encoding process (evolution) hammered out a lower-dimensional version of reality which—when understood and decompressed—can inform you of the version of reality one is seeing; this can be represented mathematically.”

For humans, we find the basis of our mind in the mental representation we have of reality, the ordered transmission of expedient data to the mind, the symbolic transmission between minds (conversation/writing), and the collective delusion we experience and infer to one another through symbolic transmission. Consciousness (this reality you are experiencing) is an individual projection, a group effort and affect, and is easier to think about as a series of ordered delusion onto our mental interior than something entirely rational. We ‘evolved’ our ‘Synthetic A Priori Judgments,’ those mental structures that add contextual understanding to the subjects we are looking at, to react to increasingly complex streams of data gathered through sensory organs and processed into increasing complex phenomena in our minds.

From simple rules, we can examine the development of complex systems. Imagine a group of organisms that can only see a 48x48 grid of pixels, they all ‘see’ a shape traveling across the pixel surface and the organism has three programmed responses: Attack, Mate, Run. Let’s say one response in this scenario is correct: Attack. All organisms that didn’t have a bias toward that action didn't pass on their genes. Natural selection refines the infinite data stream into biasing structures. Nothing new here. Let’s examine whether we make this a little bit more complicated: how about if we had a vector for the size of the observer and a vector for the size of the shape? We can imagine an interaction in which the bias for attacking or running is size-dependent on relative size differences. However, size is a spatial phenomenon, so we need a vector for distance and position; and how do you compute the distance and position vectors? You’d need time and velocity vectors such that, over time, shapes' relative increase or decrease in size gives the position and distance to the observer. You’d also need causality to infer that shapes seen in this state are the same shapes seen in the previous state. Thus, we can see a sort of synergistic interrelatedness between space, time, causality, and other things that probably share some primitive in the mind’s interpretive structures.

Mathematics is a great example of these interpretive structures operating in concert and cohesively developing a system of rules that seems to be universally correct, even though mathematics isn’t part of the physical world. To count to 3, you need the a priori knowledge that: 1\. separation exists, 2\. that separations in a ‘chunk’ of data constitute an ‘object,’ 3\. that there can be more than a singular object. From this premise (and a bit of a logical leap), you can count to 3: object 1 is in my mental interior, object 2 is in my mental interior, object 3 is in my mental interior and the sum, an abstract ‘de-separation,’ is a variable representing all of the objects. This is Kant’s ‘synthesis’, a single act of knowledge generated from observing what is ‘manifold’ in them. From this basis, you can abstract the object from the physical data that is being represented in your mental interior and treat it independent of physical reality, yet another interpretive structure. This function of adding together objects where the sum is represented by a value contains ‘predicates,’ which provide extra context to abstract objects. We evolved a huge array of predicates to deal with data that needed synthetic manipulation for efficient responses.

So, we see the basis of mathematics, a synthetic a priori judgment that cannot be understood by ‘pure reason’ alone. ‘Math’ is just the most efficient way to interact with your interpretation of abstractions. In fact, your notion of what ‘reason’ is comes from a biasing mechanism that assists you in understanding & parsing a very complex set of environmental data, all in an effort to assist you in reproduction.

A posteriori knowledge, knowledge through experience, is subsumed by the a priori, because our interpretive structures form the basis by which we can learn things. We are designed to learn in (generally, not pedagogically) specific patterns; otherwise, our development is stunted. Which is why we cannot separate understanding from interpretation; they are functionally the same thing. The practical application for pedagogy –to do is to learn–however, this idea is hugely extensible and can explain very complicated behaviors. Your politics is the interpretation of your reality with your biases, your understanding of violence is from your inbuilt interpretive structures, and your understanding of yourself is biased through your interpretive structures. Hence, the whole field of psychology/psychiatry is built on the discovery of these interpretive structures as “constructs” as a way to simplify the mind into factors that are computable.

However, all of the information I've just supplied to you is very Kantian in origin, and it doesn’t explain the core idea we’re tackling. What we’ve just talked about is pure conjecture on how the mind may work in an abstract way, and how reality is plausibly experienced, this has almost no practical application, it is merely a device to softly introduce the following: Exploring how the mind works from the point of view of a mind is extremely difficult and is not a first principles approach, by figuring out the primitives and then attempting to understand origin of mind we can get a more holistic understanding of what is happening to people.

From here, I’m going to make a critique of Kant, then we’re going to talk about the nature of phenomenal representations (and how to represent them mathematically), then we’re going to get into how those representation cascade into phenomenal structures, and then finally we’re going to integrate it all together so we can see how minds work. Simple.

However, I’m going to preface the next section, which is full of vitriol, by stating that I believe Kant’s childish moral philosophy has done more harm to the human race than any philosophy before it and should be exiled from our dialectic; we will not be touching on this as it is a complete distraction.

Now comes the worst part of reading Kant … the table of categories.

## A Fucking Table

> Kant explaining categories, Circa 1781 (colorized)

Kant's Table of Categories is key to the Critique of Pure Reason and outlines the fundamental concepts that the mind uses to structure experience. These categories are divided into four groups, each containing three categories. The four groups are quantity (unity, plurality, totality), quality (reality, negation, limitation), relation (inherence and subsistence, causality and dependence, community), and modality (possibility, existence, necessity). Kant believes these categories are standard and innate cognitive tools that allow us to interpret sensory data, giving form and structure to our experiences. While they sound plausible and nice, they are utterly made up and are just efficient things to bridge the beginning of his philosophy with the end.

The worst part about reading German philosophers is the tour-de-force required for them to explain simple ideas (you'll find the irony in that statement soon, lmao); Kant explained his ideas so succinctly that his manifesto was barely 800 pages long, and he famously had no examples, only a continuous stream of logic. The second-worst part of reading German philosophers is that they assume everyone reading it will be German. I have no idea where the table of categories came from, I have no idea why he thought this argument from necessity would be sufficient, and I have no idea why he bases the rest of the book off of this stupid conception. Kant couldn’t contemplate the chaos of noumena becoming phenomena without his own mental instruments, but he cannot expect me to use the same flawed device as a way of coping with chaos.

Here are some quotes:

> “Without sensibility no object would be given to us, without understanding no object would be thought. Thoughts without content are empty, intuitions without concepts are blind. **Hence it is as necessary to make our concepts sensible, i.e., to add the object to them in intuition**, as to make our intuitions understandable, i.e., to bring them under concepts.”

> "The categories are concepts which **prescribe laws** a priori to appearances, and therefore to nature, the sum of all appearances (natura materialiter spectata). The categories are thus the conditions of the possibility of experience, and an experience is itself a cognition, an empirical cognition, that is, a cognition which determines an object a priori."

This is what happens when someone has a linguistic IQ of 150 and needs to explain away something so the rest of his theory works. Moreover, Kant never addresses differences in cross-cultural minds, let alone individual minds. Nietzsche is right again: ‘all philosophy is autobiographical.’

This is where I differ greatly from Kant, his assumptions leave too many gaps in his core philosophy and even if you were to understand The Critique Of Pure Reason in its totality, there would be _little_ practical application. I don’t believe the categories are a necessary predicate of experience. With a slight modification to Kant’s core, we can draw a vast ocean of wealth out of his philosophy, instead of a stupid table and everything that comes after it. Some aspects of the table _may_ be a linguistic description of what I describe later; however, as it stands now, it lacks composability, nuance, degrees of intensity, and it is not a universal key to understand any form of cognition.

---

## Part 1: Specifying the Area of Interest

# Part 1: Specifying the Area of Interest

## Vectors Are All You Need

> Earlier in my life, I wrote a chapter called ‘chaos’ in a book I was writing, it was essentially a long-winded deconstruction of light and matter into component datums that interact with each other, so light \+ a particular protein/molecule \+ interpretation from a brain \= the experiential feeling of, for example, seeing a color. I stretched that idea past its breaking point when I went on to describe a ‘being’ that can ‘see’ all light without interpretive structures as an illustrative tool. I imagined it would be like a chaotic static without form. The point of describing that background is because in the next chapter, I brought forward a nascent version of this Kantian theory I’m presenting now. We need to bias our interpretation of data to make heuristic sense of what’s going on around us and produce optimal responses. Afterwards, I diverge from the Kantian line by proposing the evolutionary model of transcendental idealism and a far more nuanced tool to understand reality.

I want to build a system that has universal composability between something that is true at the biochemical level to the realm of mental interiority and then further to the realm of macroscopic behavioral analysis and still be flexible enough to be integrated into practical applications. As I've stated before, inside this framework, perception, interpretation, understanding and action are manifold, they are all the same thing: This normalization greatly simplifies the mathematics of state transitions as we no longer need to factor different subroutines of the observer and we can ignore the machinery of the brain entirely: a thing is what it does. You can map states of phenomenal transition from one form to another form and call that change the product of a ‘transcendental function.’ The molecules in a rod cell interacting with light are as transcendental as the ‘image’ being cast from the eyes into the brain, and then the action that a particular constellation of light causes the observer to do.

Let me declare my priors before continuing; we will return to each of them in more precise form later.

1. For the purposes of this framework, every observable change in matter can be modeled as a vector.
   1. These vectors are meaningless to an organism until an evolutionary process encodes significance onto the observer.
   2. I will call the pre-interpreted side of this picture **noumenal reality vectors**: the raw distinctions available prior to the organism’s full phenomenal organization of them.
      1. These are not “seen” directly by consciousness; they are the bits of raw, unfiltered reality that first enter the system through physical interaction.
2. Reality arrives as a continuous stream, but for modeling purposes we can sample that stream into discrete state-snapshots. At any given snapshot, a local noumenal state contains the vector-representable material changes relevant at that moment and position.
3. An observer’s qualia—phenomenal reality—can likewise be modeled as a state that updates over time. That state is the representation of feelings, visuals, memories, action-tendencies, and every other cognitive process available to the observer at that moment.
4. Evolution encodes a **seed** for the organism’s Transcendental Embedding. This is the inherited template of reality the organism can, in principle, be privy to: evolution guarantees an expected version of reality for the lineage, and that inherited seed structures how noumena are turned into phenomena and then more phenomena recursively at each moment of qualia.
   1. There exists an ambient latent space—conceptually open-ended, and idealized here as infinite-dimensional—into which experiencable phenomena can be projected.
   2. A transformation function maps noumenal inputs together with the current phenomenal state into new phenomenal representations.
   3. This inherited seed represents the organism’s interpretive lens: the lineage-fixed framework through which experience must first pass.
   4. This genetic seed is, in itself, unchanging and stateless; later development, memory, and history determine how that inherited template is realized in the individual.
5. The Transcendental Embedding transforms noumenal vectors into phenomenal vectors through two complementary processes:
   1. **Projection:** mapping raw inputs into meaningful coordinates within the organism’s latent space.
   2. **Transformation:** combining the current phenomenal state with new inputs under the inherited rule-set.
   3. These phenomenal vectors are representations of reality as it appears to the organism: this is the framework in which perception, interpretation, and action are treated as one continuous process.
6. Paired with this inherited seed is a transition rule that maps one phenomenal state to the next. In the idealized version of the framework, that rule is treated as fixed with respect to the seed itself, while the organism’s actual state supplies the changing input.
7. Taken together, the inherited seed and the transition rule generate phenomenal vectors. Those phenomenal vectors are reality as it appears to the organism.
   1. Noumenal reality vectors are invisible to consciousness; they are only encountered through the chain of physical interactions that first register them.
   2. Everything available to conscious experience is already on the phenomenal side of the transformation.
   3. In principle, one can trace the transformation of a noumenal vector into a phenomenal vector through the biochemical and computational chain that produces the experience.
8. The recursive mapping from one phenomenal state to the next is, for the organism, its lived reality.
   1. In the idealized version of this framework, I assume a deterministic update rule: given the organism’s inherited seed and the full current phenomenal state, the next phenomenal state is fixed.

> WARNING: I'm using a simpler, incomplete and somewhat contradictory version of the term "Transcendental Embedding" so that it will be easier to grok now, but will be explained more fully later.

From your eyes to inside your mind, you are currently running extremely complex systems to organize these letters into discernible symbols and then translate that ordering of symbols into information you can grok. But let’s pretend, for a moment, that you were much dumber than you are now—so dumb, in fact, that you are not even conscious. Things simply happen to you. You barely have a sense of time. Your vision is more like a flash of symbols to which you can only have strong affective reactions. Here is a story of your life:

`You’re walking along and suddenly, out of nowhere:`

`A0 [0.54, -0.13, 0.75, 0.42, -0.26, 0.87]`

`Of course, in a moment of panic, you feel:`

`B [0.32, 0.69, -0.15, 0.78, 0.25, -0.44]`

`And instinctually you do:`

`C [0.61, -0.33, 0.48, 0.91, -0.18, 0.36]`

`Whew, thank God that’s over; now you want to do:`

`D [0.27, 0.72, -0.09, 0.65, 0.41, -0.53] with the:`
`	A1 [0.54, -0.13, 0.75, 0.42, -0.26, 0.10]`

`	...Kinda gross, but whatever.`

Notice the small change in the last coordinate between the first and last object-vector, `.87 → .10`. We can infer that most of the structure remains intact while one aspect of the represented object has changed. In this toy example, the system’s bias structure turns

`A0 [0.54, -0.13, 0.75, 0.42, -0.26, 0.87]`

into

`A1 [0.54, -0.13, 0.75, 0.42, -0.26, 0.10]`,

and everything in between is the vector representation of the internal phenomenal activity required to bring about that change. On this view, the feeling and the instinctive action can be written in the same general format.

For readability, I decomposed the previous series into separate time-steps. That is not how the process actually unfolds. In reality, these states overlap and bleed into one another. The point of the decomposition is only to show how one structured representation can recruit another by shared positions. If `A0` activates the system and `B` appears immediately afterward, you should imagine the coordinates of `A0` and `B` occupying the same larger state-space, with some regions active and others blank. In that more realistic presentation, the same story looks like this. Let `S(n)` denote the state at discrete modeling step `n`:

`S1[...0.54,-0.13,0.75,0.42,-0.26,0.87,0,0,0,0,0,0,0,0,0,0,0,0,0 ... 0]`
`			A0 alone`
`S2[...{0.54,-0.13,0.75,0.42,-0.26,0.87},`
`{0.32,0.69,-0.15,0.78,0.25,-0.44},`
`0,0,0,0,0,... 0]`
`A0 + B`
`S3[...{0.54,-0.13,0.75,0.42,-0.26,0.87},`
`{0.32,0.69,-0.15,0.78,0.25,-0.44},`
`{0.61,-0.33,0.48,0.91,-0.18,0.36},`
`... 0]`
`A0 + B + C`
`S4[...{0.54,-0.13,0.75,0.42,-0.26,0.10},`
`0,0,0,0,0,`
`0,0,0,0,0,`
`... 0]`
`A1`
`S5[...{0.54,-0.13,0.75,0.42,-0.26,0.10},`
`{0.27,0.72,-0.09,0.65,0.41,-0.53},`
`0,0,0,0,0,`
`... 0]`
`A1 + D`

I added the `{}` only to make the story more legible and to separate features of the represented state; they are not part of the formal system itself. Notice also that the `B` and `D` representations share positions. That suggests a region of the state-space associated with an affective or motivational response to the coordinates occupied by `A`.

I am using a more tangible action-based example here, but the same logic would apply to something as simple as light hitting a photoreceptor and the observer mapping color and position into a phenomenal representation. Depending on the application, you can average the state over time, or produce a higher-order vector representation of the whole sequence. That may be lossy, but that is acceptable: evolution does not need a perfect copy of reality. It needs an approximation of reality good enough to bring about adaptive state-transitions.

There are many ways to describe the world. People often say that functions describe the world, and that is true as far as it goes. But most approximations—including functions in isolation—describe only the world of appearances available to the model. The advantage of the vector-space picture is that it lets us describe many different levels of organization inside one composable framework. Neural networks are still useful here, but they are downstream processors of structured representations; they are not, by themselves, a theory of how those representations are made available to the organism.

So, in this section, I assume an open-ended nonlinear mapping rule capable of weighting and summing phenomenal vectors as they co-occur. That rule takes the current phenomenal state, maps it through the organism’s inherited interpretive structure, and yields a new phenomenal state. Later, when we get to applications, the practical problem will be path-isolation and signal-processing: given the noise of a large state-space, which contributing factors produce the strongest signal for the transition we care about?

The broader claim is not that the organism first builds a neural network and only then acquires a world. It is the reverse. The organism inherits a structured way of carving reality into usable distinctions, and the processor it builds later operates within that inherited space. Evolution gives the observer a repertoire of vector-like distinctions—time, space, color, objectness, shape, bodily boundary, hierarchy, proportion, attention, lower-order affect, higher-order affect, symbolic meaning, interiority, and so on. Action is a byproduct of the continuous processing of these distinctions through the organism’s interpretive structure.

It is important to understand the primitives before the abstractions. We want a representation of phenomenal life that is mathematically tractable without pretending that every observer receives the same world in the same way. The same broad stream of reality may confront multiple observers, but different inherited structures and different realized histories will transform that stream into different phenomenal outcomes.

If you want an intuitive picture, think of the inherited structure as a massive coordinate-ready template and of lived history as the process that determines how that template is realized in the individual. In principle, that structure constrains how you can react to the world; in practice, any model we build will only ever approximate that structure. Evolution is the encoding protocol, compression is part of the mechanism, and the point of the framework is to describe how reality is described for an organism—not to confuse the description with the thing itself.

Why use an embedding system rather than just talk about neural networks? Because I am not trying only to describe a processor. I am trying to describe the representational conditions that make processing possible in the first place. In that sense, what this section does is combine the space/time side of experience with the rest of phenomenal life into one common representational framework, rather than treating them as separate faculties that must later be stitched back together.

Note also that we are never really processing one isolated datum at a time. The phenomenal stream is continuous, even when the model samples it discretely. The inherited structure is relatively stable; your experiences, memories, and moods are not modifications to the seed itself so much as modifications to the stream it is processing and to the realized organization built on top of that seed. That is why this device can represent what Kant’s table of categories cannot: not just a generic human mind, but, in principle, the different ways minds can be structured across organisms and across individuals.

So ends **Kant from the Evolutionary Perspective** and **Vectors Are All You Need**.

Next, we get into the real meat: what an application of this theory looks like, and how to do these calculations.

## The Nature of Phenomenal Reality: What are we trying to measure?

Most of the time, when mathematicians are using the ‘infinite’ it is to simplify a problem and also say something concrete about finite things. Thus, we are continuing the tradition by emphasizing the infinite ways in which reality can be understood so that we can isolate a finite representation relevant to us. To simplify further, if you haven’t gathered this already, we are working with a deterministic model instead of a probabilistic model, we assume that if perfect knowledge of the noumenal and phenomenal reality could be encoded into the state, and if we knew with precision the nonlinear mapping function, then we can get the next state. Philosophically, probability is just a way of calculating a lack of understanding about the world and no event is truly random; thus, for the next series of equations, we also assume a deterministic world governed by finite cause and effect.

In the previous section, we went over the concept of Transcendental Embeddings and we assumed the finite vector space that the organism played on. In ‘God’s Infinite Vector Space’ there are infinite dimensions of information, infinite ways to represent the underlying vectors of information that make the disordered world into an apparent unity for the observer. Fundamentally, this is what quantum physics is attempting to do with hilbert spaces: they describe the state of a particle with as many dimensions of information as needed to specify what the particle is doing. In the same way we are describing the state of the observer with as many dimensions of information as needed to specify what the next state of observation will be.

So what we are trying to find first is the universal function that maps the current state onto the next state for the organism. Keep in mind we are trying to approximate the whole evolutionary history of the organism as that history encoded a representation of reality onto the organism’s genes. So how do we reduce the complexity to something tangible? Let’s start by recalling that every organism’s representation of reality—its distinct “slice” of the infinite dimensional chaos—emerges from the genetic blueprint that encodes both the capacity to perceive and the biases that drive interpretation. This genetic blueprint constrains the otherwise unbounded dimensionality of possible experiences to something finite. The guiding question, then, is how to operationalize this genetic constraint in a mathematical sense so that we can start constructing (or at least approximating) that “universal function” which transitions an organism from one phenomenal state to the next.

From the infinity of potential experiences, genes carve out a stable but extraordinarily high-dimensional subspace: the Transcendental Embedding. Within this subspace, an organism’s capacities—whether they be sensory (e.g., the ability to sense temperature, pressure, color, etc.) or conceptual (e.g., forming an internal sense of power, confidence, or fear)—are written in the coordinates of a Hilbert space. Each gene (or set of genes) provides a vectorial “template” for how certain categories of data (like “red light,” “predatory motion,” “threat posture,” “affection cues,” etc.) get processed and transformed.

The trick is, no organism has conscious access to these gene-level templates; instead, they appear as “normal” to the experiencing subject. That is, if your lineage evolved to treat the color red as an immediate signifier of threat, that intense emotional surge you feel upon seeing red is not an option but a genetically inherited, encoded response. The genetically molded embedding architecture—the shape of your finite subspace—guarantees that specific transformations from input to output (or from “noumenal vectors” to “phenomenal vectors”) will feel self-evident or even inescapable.

Also, keep in mind we do not want a statistical framework; in this ideal framework, this is, in essence, a deterministic model where we are attempting to do calculus rather than statistics. To start with, probability will lead to unnecessary complexity, and thus we have to ask the question: What if reality weren’t probabilistic?

### The Evolutionary Mechanism for Encoding Transcendental Embeddings

It is useful to imagine humans as experiencing a small slice of all possible ways to experience reality. Within this small slice, there is enormous variation; however, we are still subject to an evolutionary inheritance that provides us with a predictable range of ways to perceive and structure reality. You could imagine the way humans experience reality is a series of dimensions. To be simple, we could have dimensions for the range of colors we see, as well as dimensions for how we position those colors in our mental interior when a photon hits our eyeball. We could build on this by adding other dimensions and associating groups of pixels. Then, we could create structures from those groupings and assign dimensions that associate meaning (or potential for meaning) with those superstructures. Finally—you could imagine—we have dimensions that represent placeholders for the superstructures humans expect in their lives. An example of this would be something like a mother figure, or what a place to sit looks like, or what violence is, or even something complex like a god object. These are complex phenomena that depend on humans holding a particular psychic position in reality; the god object, for example, could be computed by finding and averaging the vectors of communal bonding, fatherhood, war, purity, spite, revenge, love, sacrifice, death, externalized meaning, care, fear, language, outsiders …. and the addition of all these vectors could make strongly or weakly predisposed towards believing in god, coupled with a supporting or dismissive society you can get a range of possible outcomes for a human to experience when it comes to them internalizing and practicing the god object.

All of the sub-vectors of the god object are their own series of complex dimensions as well. The war object might require communal bonds, hunger, fear, negative ethnocentrism, concepts of ownership and land, hierarchy, etc. Let’s get even simpler, the hunger object could be composed of the vectors of glucose saturation, fullness, thirst, fat concentration, presence of grelin, and other biochemical factors. The sleight of hand I performed was relating biochemical signals and things that exist purely as a concept in the mind of the user, like god and war. Ah, but that is the point\! To the mental interior, there is no difference between the experience of feeling hungry (even though it has simplistic origins) and feeling like your group is at war; they both offer the same qualia to the experiencer. Humans (to varying individual degrees of intensity and presence) have a space in their minds for both concepts.

All these complex traits being the results of a series of simpler objects makes their examination and deracination possible as well as completely computable. But how did humans get to become so complex? Why is the range of our experiences so large relative to other creatures? How did we obtain this Transcendental Embedding? To be overly simplistic, we can just trace our evolutionary lineage backward, and we can see the additions and pruning of dimensions over time. This process also gives us insight as to which organisms we have overlap in experience with; however, it is crucial to note that convergent evolution for mental interiors is entirely possible, I’m sure corvids and humans use very different neural hardware to get to the same implicit understanding of the physics of buoyancy and water, and thus we can ‘share’ some dimensions of how we experience reality, just as most complex organism share the reality dimensions of 3D space and time.

\*\* To be very clear, N number of seemingly arbitrary vectors creates all of the expected reality you have. To have any level of precision, it is helpful to imagine this vector array in the hundreds of billions per human, and maybe only a couple of hundred for something like an earthworm.

But let’s start at the beginning and answer the question: how did every organism gain the dimensions by which it understands reality? We begin with the most austere state imaginable: a membrane-bound molecule that can do only one thing—detect whether its ionic balance crosses a critical threshold that will rupture the lipid wall. That single binary distinction creates the first coordinate of an experiential space: intact → viable, ruptured → death. Every replication cycle that preserves this discrimination reinforces its fitness value, so the “membrane-rupture axis” becomes genetically fixed.

Random copying errors then throw up tiny tweaks: a peptide that bends when it binds a proton, a chromophore that flips shape when hit by a photon, an ion channel that opens more readily in warmer fluid. Each tweak is, in effect, a proposal for a new axis, a new vector of feeling reality. Most axes cost energy without adding reproductive payoff, so they fade. Occasionally, one lets the organism swim toward glucose or duck away from acid, and descendants bearing that vector out-compete their siblings. The vector is retained, and the organism’s Transcendental Embedding expands from 1D to 2D, 3D, and so on.

With every retained axis, earlier ones are not discarded but re-encoded into higher-order compound dimensions. Once a light-sensitive molecule exists, downstream mutations wire two such molecules together, letting the organism register differences in intensity rather than mere presence. That difference becomes a new vector—contrast. A later duplication introduces a second pigment shifted in wavelength; the comparator circuit now yields a chromatic axis. What began as a single light/no-light bit has unfolded into a color cube where the vectors of understanding reality become synergistically linked.

This simple logical mechanism is highly scalable. Chemical gradients become taste maps; pressure sensors become a multi-point body schema; socially relevant signals (gaze direction, group ranking) become vectors in an emerging social manifold. At each step, selection keeps only those dimensions whose ‘predictive power’ outweighs their metabolic drag. Complexity is the byproduct of past environments and developments that rewarded sharper distinctions of the noumena and phenomena available to the organism.

By the time we reach hominids, the embedding has accrued countless axes. Some are exteroceptive (hue, pitch, depth), others interoceptive (blood CO₂, hunger peptides), and others still are purely synthetic (synthetic a priori, to Kant, this is an impossibility, but for us this is the nature of reality)—patterns that exist only as concepts: tool, ally, lover, or taboo. The aggregate is a pseudo-species-level template: the expected set of distinctions any typical human can, in principle, entertain; however, the level of resolution we should have in our examination should be at the level of the individual organism. An individual inherits their template, their seed Transcendental Embedding, fully formed at birth. If an illness or pathology in the individual eliminates the red-cone pigment, the axis for “long-wave chromatic contrast” is still present in the template; the ‘expectation’ of the organism is that the information is present, but the data stream along it is fixed at zero. Reality shrinks by one shade, even though its axis persists in the mathematically expected blueprint of this individual. Our concept of self-preservation appears complex; however, it is a product of our ancestors' efforts to maintain their lipid membranes from rupturing. Inside these vectors, of course, we have weights of importance that the distance from the origin can dictate; self-preservation, for example, would be relatively far away.

To be brief:

1. A mutation is proposed that creates a new distinction in the fabric of reality
   1. This mutation constitutes a new vector of qualia for the organism
2. Based on reproductive success, the mutation is passed down.
3. Retention of the mutation embeds that vector permanently in all offspring of the organism unless pruned.
4. Richer composite coordinates are created by the addition of new vectors, creating more differentiated versions of reality.

## Formalization

> _(Bold lowercase symbols denote column vectors, bold uppercase symbols denote matrices, and calligraphic symbols denote spaces or distributions.)_

Before we continue, the philosophical argument I gave was to ensure that we can simplify the essential elements down to the point where we can inject them into standard, and proven, mathematical frameworks and systems. From here on, if the previous axioms have been accepted, I build off of them in a series of sequences layered on top of each other and attempt to formalize a proof. I have to reiterate: I am not inventing new math.

Lastly, this work I've done is an absolute tour-de-force of effort, as consequence of that is we should assume that there can be logical gaps or errors in my formalization, email me if you see anything wrong.

### 1\. Universal arena

Let \(\mathcal N\) denote the **noumenal arena**: the ambient space of possible distinctions from which an organism’s reality is carved.

> We need a container large enough to hold every possible distinction reality could make for all organisms and people, including the distinctions no organism has ever perceived. Take \(\mathcal N\) to be a real separable Hilbert space: an infinite coordinate system where each axis represents a genuinely independent distinction that can phenomenally be made. For concreteness, think of \(\ell^2\) the space of infinite sequences of real numbers that don't blow up when summed, which gives us the tidiness we need without sacrificing the infinite space we require. Let \((\mathbf e_k)_{k=1}^{\infty}\) be an orthonormal basis for \(\mathcal N\). A full noumenal microstate at time \(t\) is then

\[
\mathbf n_t = \sum_{k=1}^{\infty} n_{t,k}\mathbf e_k \in \mathcal N.
\]

This ambient state is **not** what the organism experiences. It is the total field of possible distinctions relevant to that moment. Experience begins only after a lineage-specific projection has been applied.

If one wants the intuitive reading: \(\mathbf n_t\) is the full noumenal state, while \(\mathcal N\) is the space in which such states live.

### 2\. Species-level Transcendental Embedding

Evolution does not give a lineage access to all of \(\mathcal N\). It preserves a finite set of distinctions that proved useful for survival and reproduction. These distinctions span the lineage’s accessible subspace.

Define the **species-level transcendental embedding** by

\[
\mathcal M^{\mathrm{spec}}
=
\operatorname{span}\{\mathbf v_1,\ldots,\mathbf v_d\}
\subset \mathcal N,
\qquad d<\infty.
\]

Here \(\mathbf v_1,\ldots,\mathbf v_d\) are the lineage-selected axes of distinction. For simplicity, assume they are orthonormal:

\[
\langle \mathbf v_i,\mathbf v_j\rangle = \delta_{ij}.
\]

This lets us represent the species template in two equivalent ways.

First, as a **projection operator** onto the accessible subspace:

\[
P^{\mathrm{spec}}:\mathcal N \to \mathcal M^{\mathrm{spec}},
\qquad
P^{\mathrm{spec}}\mathbf n
=
\sum\_{i=1}^{d}\langle \mathbf v_i,\mathbf n\rangle \mathbf v_i.
\]

Applied to the noumenal state \(\mathbf n_t\), this yields the part of the world the lineage can in principle access:

\[
\widetilde{\mathbf n}\_t
=
P^{\mathrm{spec}}\mathbf n_t
\in \mathcal M^{\mathrm{spec}}.
\]

Second, as a **coordinate encoder** that expresses the same accessible slice in \(d\) coordinates:

\[
E^{\mathrm{spec}}:\mathcal N \to \mathbb R^d,
\qquad
E^{\mathrm{spec}}(\mathbf n_t)
=
\begin{bmatrix}
\langle \mathbf v_1,\mathbf n_t\rangle \\
\vdots \\
\langle \mathbf v_d,\mathbf n_t\rangle
\end{bmatrix}.
\]

In matrix form,

\[
\mathbf E^{\mathrm{spec}}
=
\begin{bmatrix}
\mathbf v_1^\top \\
\vdots \\
\mathbf v_d^\top
\end{bmatrix},
\qquad
\mathbf z_t = \mathbf E^{\mathrm{spec}}\mathbf n_t \in \mathbb R^d.
\]

The vector \(\mathbf z_t\) is the lineage-accessible coordinate description of the noumenal state at time \(t\). It is not yet the full phenomenal state. It is the species-level input from which phenomenal processing will later be built.

This preserves your earlier intuition:

- \(\mathcal N\) is the total noumenal arena.
- \(\mathcal M^{\mathrm{spec}}\) is the finite slice evolution preserved for the lineage.
- \(\mathbf z_t\) is the organism-usable coordinate form of that slice at a moment in time.

If one wants the “pick a column” picture, each \(\mathbf v_i\) can be taken as one basis direction \(\mathbf e_{k_i}\). If one wants composite detectors, each \(\mathbf v_i\) can be a weighted combination of basis directions. The second form is more general and is the better default.

### 3\. Axis-creation rule (mutation \+ selection)

Now let evolutionary time be indexed by \(\tau\), and let \(\mathcal M_{\tau}^{\mathrm{spec}}\) denote the current species-level template at that stage.

A mutation proposes a candidate distinction

\[
\Delta \mathbf v \in \mathcal N.
\]

To determine whether it adds a genuinely new axis, first remove the part already captured by the existing template:

\[
\Delta \mathbf v_{\perp}
=
\Delta \mathbf v - P_{\tau}^{\mathrm{spec}}\Delta \mathbf v.
\]

If \(\Delta \mathbf v_{\perp}=0\), then the mutation adds no new distinction: the information it would supply is already representable in the current template. If \(\Delta \mathbf v_{\perp}\neq 0\), normalize it:

\[
\widehat{\Delta \mathbf v}_{\perp}
=
\frac{\Delta \mathbf v_{\perp}}{\|\Delta \mathbf v_{\perp}\|}.
\]

This normalized vector is the genuinely new candidate axis.

Let \(\mathcal E_{\tau}\) denote the distribution of environments encountered by the lineage at evolutionary stage \(\tau\). Let

\[
W(e,\mathcal M)
\]

denote expected reproductive value in environment \(e\) for organisms whose accessible subspace is \(\mathcal M\), and let

\[
C(\widehat{\Delta \mathbf v}_{\perp})
\]

denote the cost of maintaining the new axis: metabolic cost, developmental cost, wiring cost, false-positive cost, and related burdens.

Define the net fitness contribution of the candidate axis by

\[
\Delta \Phi_{\tau}(\widehat{\Delta \mathbf v}_{\perp})
=
\mathbb E_{e\sim \mathcal E_{\tau}}
\Big[
W\!\big(e,\mathcal M_{\tau}^{\mathrm{spec}} \oplus \operatorname{span}\{\widehat{\Delta \mathbf v}_{\perp}\}\big)
-
W\!\big(e,\mathcal M_{\tau}^{\mathrm{spec}}\big)
\Big]
- C(\widehat{\Delta \mathbf v}_{\perp}).
\]

The retention rule is then

\[
\mathcal M_{\tau+1}^{\mathrm{spec}}
=
\begin{cases}
\mathcal M_{\tau}^{\mathrm{spec}}
\oplus
\operatorname{span}\{\widehat{\Delta \mathbf v}_{\perp}\},
& \text{if } \Delta \Phi_{\tau} > 0, \\[6pt]
\mathcal M_{\tau}^{\mathrm{spec}},
& \text{if } \Delta \Phi_{\tau} \le 0.
\end{cases}
\]

So the story is:

1. mutation proposes a candidate distinction;
2. subtract what the lineage already captures;
3. evaluate expected reproductive gain minus cost;
4. retain the axis only if the net contribution is positive.

Iterating this rule across evolutionary time yields the species-level accessible template. Do that n number of times until you have the species template embedding for humans.

### species template vs. individual embedding

This is a lineage-level template. It says what kinds of distinctions a member of the lineage can in principle represent.

\[
\mathcal M^{\mathrm{spec}}
\quad \text{or equivalently} \quad
\bigl(\mathcal M^{\mathrm{spec}}, E^{\mathrm{spec}}\bigr).
\]

It is **not yet** the full individual transcendental embedding. We'll get to that in the next section.

---

## Part 2: Deriving the Transcendental Embedding

# Part 2: Deriving the Transcendental Embedding

Part 1 described how evolution carves a finite repertoire of distinctions out of the noumenal space. That account explains why an organism has a "world" at all. It does not yet explain why one human inhabits that world differently from another human, even when both inherit the same species-level template. Part 2 answers that question.

## The Technical Scope (because otherwise I'll accidentally lie to you)

Before I keep descending into Kantian hell, I need to pin down the scope so I do not smuggle Kant into the parts that are supposed to be engineering.

There are really three layers running through the rest of this paper.

> First, there is the **interpretive layer**: the noumenal/phenomenal story that motivates why an observer should have structured experience at all.

> Second, there is the **predictive layer**: the formal object the theorems will actually touch. That object is **not** the whole ineffable mush of phenomenal life in itself, but a task-conditioned predictive observer-state: the minimal representation that preserves the conditional law of future observables under admissible propositions.

> Third, there is the **control layer**: once such a state can be estimated, we can rank or search over candidate propositions by their predicted effect on the observer's next task-relevant state and downstream objective. That is the whole point. But causal claims there require intervention-grade data, not just retrospective logs and me getting excited.

The draft up to this point used one name, _Transcendental Embedding_, for several different things at once. From here onward I separate them. The transition I find easier to read under somewhat dubious pretentious rather than dumping everything on you, the reader, at once:

> First, there is the **inherited template**: the repertoire of distinctions a human organism can in principle host inside their mental interior.

> Second, there is the **realized individual embedding**: the weighting and coupling of that template in one person after development, language, culture, memory, and repeated experience.

> Third, there is the **phenomenal state at a time**: the full lived condition of the organism now.

> Fourth, there is the **predictive observer-state** for a task and horizon: the smallest state that still preserves what we need to forecast.

> Fifth, there is the **estimated embedding**: the low-resolution object we can compute from outward traces.

Part 2 is the transition from the first object to the other four.

## Behold; You! The Chimera

Call the person-in-role "object" a **Chimera**. The term is mnemonic only. The theoretical work is done by the fact that a person is never encountered in the abstract, but always as a person under a role, inside an institution, in a regime, in a place, at a time. It would be computationally challenging if we didn't introduce this categorization now, even though it amounts to a shortcut and a bastardization of the philosophical thesis.

> If an alien biologist watched human outputs only, human life would look repetitive. Much of what humans do can be reduced, at the level of gross behavior, to self-preservation, courtship, reproduction, kin-bonding, status competition, alliance formation, and resource control. The outer patterns recur. The difficulty lies elsewhere. Human beings often arrive at similar outputs by different internal routes.

Practically, for what I, the author, am interested in, GTM engineering: imagine one founder rejects a product because of caution. Another rejects it because of fear. Another because the price signals weakness. Another because the pitch activated a prior bad memory. Another because the role they occupy requires public skepticism. Same output, different internal geometry.

That difference is the point of this section.

Let \(G_i\) denote the inherited template available to person \(i\). This is the species-compatible repertoire of possible distinctions.

Let \(T_i\) denote the realized individual transcendental embedding of person \(i\).

Let \(\phi_{i,t}\) denote the total phenomenal state of person \(i\) at time \(t\).

Let \(c_{i,t}\) denote the active role-and-institution context at time \(t\): founder, buyer, parent, employee, soldier, friend, plus the relevant company, market, group, and local demands.

We can then define the person-in-role object as

\[
\chi_{i,t} = (T_i, c_{i,t}).
\]

This says something simple. The same person can yield different outputs across settings not because the person changes species, but because a different context activates a different organization of salience, inhibition, and available action. The person remains one person. The local geometry changes. Those role-specific masks are highly specific to the individual and the regime rather than generic archetypes, and they can preserve opposed dispositions long enough for the model to ask whether the opposition is a true contradiction or whether it resolves along a deeper axis activated by the regime.

## Psychology and Factor Analysis

Psychology already contains rough tools for decomposition. Spearman, Eysenck, and later trait work attempted to compress human difference into stable factors. These matter here because they show that individual variation can be represented in coordinates rather than only in prose.

Write a psychometric proxy for person \(i\) as

\[
p_i \in \mathbb{R}^k,
\]

where the coordinates may include IQ-like measures, psychometric traits, moral scales, behavioral factors, or related standardized summaries.

This object is useful, but it is not the transcendental embedding itself.

A factor score is not a memory field.
It is not a role.
It is not a present state.
It is not a proposition.
It is not a transition rule.

What it does provide is a **coarse prior**. It places a person inside a region of likely behavior. That is enough to matter, but not enough to solve micro-interaction. Factor analysis may tell us that a person is threat-sensitive, novelty-seeking, rigid, verbal, impulsive, or dutiful. It does not tell us whether those coordinates are active now, under this framing, with this memory already cued, inside this role.

So psychology enters Part 2 as one source of approximation, not as the final ontology. To make the problem more tractable from an engineering standpoint, we also need to append source-aware categorical traces to the estimation of a person's Transcendental Embedding: role labels, objection classes, recurring topics, counterpart identities, action-types, firm-state tags, and other discrete observations that accumulate over time.

Those categorical traces are not the transcendental embedding in itself. They are an operational bridge from outward history to the estimated observer-side object. The important engineering choice is to avoid collapsing biography, stated language, observed behavior, and third-party inference into one immediate average merely because they sound semantically related. Some channels later become comparable; they should not be assumed comparable at ingestion.

> this is very up to you on specific implimentation details, tbh, be creative

## Dimensionality Reduction, Attention, and Relevance

The issue is not that every decision has one permanent principal axis in the person, and the problem compounds in difficulty because the individual embedding contains many coordinates, while any given transition is usually governed by a weighted subset of them.

Let \(x_t\) be the present proposition or stimulus. Let \(d\) be the number of coordinates used to represent the person at the relevant level of analysis. Then define a task-conditioned projection

\[
\pi_\tau(T_i, c_{i,t}, x_t) \in \mathbb{R}^d,
\]

where \(\tau\) indexes the task under study.

Let attention or salience be represented by a weight vector

\[
a_{i,t} \in [0,1]^d.
\]

Then the active coordinates for the current transition are

\[
z_{i,t} = a_{i,t} \odot \pi_\tau(T_i, c_{i,t}, x_t),
\]

where \(\odot\) denotes elementwise weighting.

This is the operational claim: the person may occupy a large space, but the next transition often depends on a smaller weighted slice of that space. The task is therefore not to discover one universal "main factor" of the person. The task is to discover which coordinates carry signal for a transition under a task and a context.

Attention matters because not all dimensions are weighted equally at every moment. The environment does not strike the whole embedding uniformly. A sentence, a person, a price, or a memory cue activates some coordinates and leaves others inert. That is why identical prompts can produce different outputs at different times.

## The Notion of State

Philosophically, everything belongs inside state.

The phenomenal state includes what is perceived, what is remembered, what is felt, what is attended to, what is being done, what bodily changes are underway, and what action tendencies are presently live. In that sense, the state is total.

Let that total state be

\[
\phi_{i,t} \in \Phi.
\]

That is the full object.

If I leave the formal section aimed directly at \(\phi_{i,t}\), however, I start overclaiming almost immediately. So from here on I distinguish the motivating object from the object the mathematics is actually allowed to touch.

Let

\[
Y_{i,t+\Delta}^{(\tau)}
\]

denote the random bundle of future observables relevant to task \(\tau\) and horizon \(\Delta\): reply, meeting, objection class, delay bucket, stage advance, sentiment shift, or whatever the task actually cares about.

Then define the **task-conditioned predictive observer-state**

\[
q_{i,t}^{(\tau,\Delta)}.
\]

This state is sufficient for the task if, for every admissible proposition \(x_t \in \mathcal X_{i,t}^{\mathrm{adm}}\),

\[
P\!\left(
Y_{i,t+\Delta}^{(\tau)}
\mid
H_{i,\le t}, T_i, c_{i,t}, w_t, x_t
\right)
=
P\!\left(
Y_{i,t+\Delta}^{(\tau)}
\mid
q_{i,t}^{(\tau,\Delta)}, x_t
\right).
\]

Read that in English: once the same proposition is applied, everything in the past that still matters for the future has already been compressed into \(q_{i,t}^{(\tau,\Delta)}\).

Two observer histories are therefore equivalent for the task if, under the same admissible proposition, they induce the same conditional law over future observables. That is the precise object I want in the proof section. The full phenomenal state remains the motivating ideal. The predictive observer-state is the formal object.

## Observable Predictive State

For engineering work, the key move is to define the formal state in terms of observable consequences rather than inaccessible total interiority.

Write the predictive state schematically as

\[
q_{i,t}^{(\tau,\Delta)}
=
S_{\tau,\Delta}(H_{i,\le t}, T_i, c_{i,t}, w_t),
\]

where \(S_{\tau,\Delta}\) is not assumed known in advance. What matters is not its exact implementation, but the criterion it must satisfy: preserve the task-relevant law of the future under admissible propositions.

This lets me keep the strong philosophical language without pretending the model gets direct access to the whole interior. The ideal object remains the next phenomenal transition. The formal object is the predictive observer-state that stands in for it on the task.

There is an important asymmetry here.

\[
\phi_{i,t}
\]

is the rich, total, motivating object.

\[
q_{i,t}^{(\tau,\Delta)}
\]

is the minimal predictive object.

\[
s_{i,t}^{(\tau,\Delta)}
\]

which I will define in a second, is the measurable approximation used in training.

The paper gets much more honest the second these three stop being treated as the same thing.

## Memory as a Series of Vectors

For present purposes, memory need not be treated as narrative first. It can be modeled as a field of traces with weights.

Let

\[
m_{i,t} = \sum_{j=1}^{N_i} \omega_{ij,t}\mu_{ij},
\]

where \(\mu_{ij}\) is a stored trace and \(\omega_{ij,t}\) is its weight at time \(t\).

Some traces are weak. Some are strong. Some decay. Some reactivate under similarity, emotion, role, or repetition. The point is not that memory is literally a sum in the brain; the point is that weighted trace structure gives us a tractable model of persistence and retrieval.

A present proposition does not encounter the whole memory field evenly. Retrieval depends on context and prior state. Write retrieval schematically as

\[
\omega_{ij,t+1} = R(\mu_{ij}, x_t, c_{i,t}, \phi_{i,t}),
\]

where \(R\) updates the relevance of trace \(j\) for person \(i\) after the present encounter.

This gives memory two jobs.

First, it stores prior traces.
Second, it changes which parts of the person-space are active now.

Learning, on this view, is usually not the invention of an arbitrary new universe inside the person. It is an update within an inherited repertoire of possible distinctions. A child cannot learn calculus by exposure alone if the required structures are not yet organized. Once they are, learning can be represented as a change in the memory field and the couplings between traces:

\[
m_{i,t+1} = U(m_{i,t}, x_t, \phi_{i,t}).
\]

This also explains why repeated prompts can produce different outputs. The second encounter is not with the same person-state as the first. The first encounter has already changed the trace structure. A prior positive or negative experience can therefore make the second prediction harder, not easier, if the model fails to represent that update.

Recommendation systems provide a useful analogy here. A view history is not a mind. But a recommender does show the core move: repeated traces can be compressed into a latent representation that improves prediction. In the present framework, biography, language, preferences, recurrent actions, role history, and prior interactions play the role of trace data from which a person-level embedding can be estimated.

## Categorical Trace Pooling as an Operational Memory Estimator

A large part of what we observe about people arrives in categorical form: role labels, recurring topics, objection classes, counterpart identities, action-types, domain tags, product themes, price postures, and other discrete markers. Rather than treat these as dead one-hot tables or discard them into prose, we can embed them and pool them over time. Operationally, the implementation assumes a fixed global registry of categorical families, source channels, and admissible regimes, with learned null vectors and mask bits for absent cells so the resulting representation stays fixed-width across people and time.

Let \(f \in \{1,\dots,F\}\) index categorical families and let \(s \in \mathcal S\) index source channels, for example biography, stated language, observed behavior, and third-party or inferred traces. For person \(i\) at time \(t\), let \(C_{i,t}^{(f,s)}\) be the multiset of observed raw category tokens in family \(f\) from source \(s\).

Before pooling, however, surface labels should be contextually typed. A token that looks contradictory in the raw may become perfectly consistent once we mark whether it concerns self versus other, in-group versus out-group, own-interest versus third-party interest, formal stance versus enacted stance, or another asymmetry carried by the regime. Write this contextual lifting as

\[
\widetilde{C}_{i,t}^{(f,s)}
=
\Xi\!\big(C_{i,t}^{(f,s)}, c_{i,t}\big),
\]

where \(\Xi\) maps raw categorical observations into richer typed tokens prior to comparison or pooling. Only the residual opposition that remains after this lifting deserves to be treated as genuine contradiction.

Let \(E_{f,s}\) be the corresponding embedding table and let \(\nu_{f,s}\) be a learned null vector for empty bags. Then the within-event pooled representation is

\[
u_{i,t}^{(f,s)}
=
\begin{cases}
\frac{1}{|\widetilde{C}_{i,t}^{(f,s)}|}
\sum_{c \in \widetilde{C}_{i,t}^{(f,s)}} E_{f,s}(c),
& |\widetilde{C}_{i,t}^{(f,s)}| > 0, \\[8pt]
\nu_{f,s},
& |\widetilde{C}_{i,t}^{(f,s)}| = 0,
\end{cases}
\qquad
m_{i,t}^{(f,s)} = \mathbf 1\{|\widetilde{C}_{i,t}^{(f,s)}| > 0\}.
\]

This is the recommender move in its simplest form: sparse categorical IDs are mapped to dense vectors and multivalent bags are pooled into fixed-width representations. Large-scale recommenders do this with sparse user histories; here the same move is repurposed for person-state estimation. But I do not want the next step to be a naive global average across every source and every role. Categories that arrive through speech, biography, and behavior are not automatically the same thing just because they share a label. They can later become comparable; they should not be forced into comparability at ingestion.

So I preserve slot identity:

\[
e_{i,t}^{\mathrm{cat}}
=
\big\|_{(f,s)} \big[P_{f,s}u_{i,t}^{(f,s)},\, m_{i,t}^{(f,s)}\big],
\]

where \(\|\) denotes concatenation and \(P_{f,s}\) is a family-and-source-specific projection.

The person remains one person, but that person appears through different masks in different regimes. Let \(\rho_t = \rho(c_{i,t})\) denote the active role/regime at time \(t\), and for historical events write \(\rho_r = \rho(c_{i,r})\). Then the slow categorical memory for regime \(\rho\) is

\[
g_{i,\rho}^{\mathrm{slow}}
=
\begin{cases}
\frac{
\sum_{r \le t}\mathbf 1\{\rho_r=\rho\}\,\beta_{i,r}^{\mathrm{slow}}\,e_{i,r}^{\mathrm{cat}}
}{
\sum_{r \le t}\mathbf 1\{\rho_r=\rho\}\,\beta_{i,r}^{\mathrm{slow}}
},
& \sum_{r \le t}\mathbf 1\{\rho_r=\rho\}\,\beta_{i,r}^{\mathrm{slow}} > 0, \\[10pt]
\nu_{\rho}^{\mathrm{slow}},
& \text{otherwise},
\end{cases}
\qquad
m_{i,\rho}^{\mathrm{slow}}
=
\mathbf 1\!\left\{\sum_{r \le t}\mathbf 1\{\rho_r=\rho\}\,\beta_{i,r}^{\mathrm{slow}} > 0\right\}.
\]

Define the full slow categorical bank by

\[
g_i^{\mathrm{slow}} = \big\|_{\rho} [g_{i,\rho}^{\mathrm{slow}}, m_{i,\rho}^{\mathrm{slow}}].
\]

For the fast task-conditioned state, define

\[
g_{i,t}^{\mathrm{fast},\tau}
=
\begin{cases}
\sum_{r \le t}\alpha_{i,r,t}^{(\tau)} e_{i,r}^{\mathrm{cat}},
& N_{i,t}^{\mathrm{cat}} > 0, \\[6pt]
\nu_{\tau}^{\mathrm{fast}},
& N_{i,t}^{\mathrm{cat}} = 0,
\end{cases}
\qquad
\sum_{r \le t}\alpha_{i,r,t}^{(\tau)} = 1 \;\; \text{when } N_{i,t}^{\mathrm{cat}} > 0,
\]

where \(N_{i,t}^{\mathrm{cat}}\) is the number of usable categorical trace-events up to time \(t\).

The weighting laws should not treat every trace equally. Let \(\eta_{i,r}^{\mathrm{act}}\) denote action intensity, \(n_{i,r}^{\mathrm{exp}}\) cumulative weak exposure, \(\sigma_{i,\rho_r}^{(\tau)}\) susceptibility of person \(i\) to the relevant proposition family under regime \(\rho_r\), and \(\upsilon_{i,r}^{\mathrm{src}}\) the effective reliability or sincerity weight of the source channel at event \(r\). Then \(\alpha_{i,r,t}^{(\tau)}\) and \(\beta_{i,r}^{\mathrm{slow}}\) are functions of recency, task relevance, regime relevance, \(\eta_{i,r}^{\mathrm{act}}\), \(n_{i,r}^{\mathrm{exp}}\), \(\sigma_{i,\rho_r}^{(\tau)}\), and \(\upsilon_{i,r}^{\mathrm{src}}\). Decisive action traces should often outrank passive exposure traces, while repeated weak exposures should still accumulate over time according to the person's susceptibility. This is also the place where strategic self-presentation enters the model: stated concern, biographical prior, and observed behavior are allowed to disagree without being collapsed at ingestion.

This choice prevents a major modeling mistake. Surface-opposed categorical evidence should not be smoothed into one fake midpoint by default, but neither should every opposition be declared a deep contradiction too early. First extract the asymmetry that may explain it. Only if the opposition survives that contextual lifting should it be treated as a genuine contradiction, typically by emitting separate typed tokens or probe features rather than erasing it into an immediate average.

The slow categorical bank is therefore best read as what the person is generally like **now**, at this stage and across regimes, rather than as a timeless essence. The pooled categorical embedding is not the transcendental embedding itself. It is a measurable, source-aware, regime-aware compression of repeated outward traces, and it belongs to the estimated observer-side object rather than to the full phenomenal state.

## Minimality, Identifiability, and Slow/Fast Factorization

Now comes the part that keeps the whole thing from dissolving into vibes.

A predictive state is not interesting merely because it is sufficient. It is interesting if it is also close to **minimal**. The task is not to drag the whole archive behind every prediction forever, but to keep only the state that preserves what the future still cares about.

Say that a predictive state \(q_{i,t}^{(\tau,\Delta)}\) is **minimal** if, for any other sufficient state \(r_{i,t}^{(\tau,\Delta)}\), there exists a measurable map \(h\) such that

\[
q_{i,t}^{(\tau,\Delta)} = h\!\left(r_{i,t}^{(\tau,\Delta)}\right).
\]

That is the right level of humility. I am not claiming there is one mystical coordinate chart for the soul. I am claiming that, for a task, there may be a smallest predictive object up to reparameterization.

The operational approximation I actually want to estimate is

\[
s_{i,t}^{(\tau,\Delta)}
=
(\hat T_i, z_{i,t}, c_{i,t}, w_t)
\approx
q_{i,t}^{(\tau,\Delta)}.
\]

Here

\[
\hat T_i = E_T(u_i, g_i^{\mathrm{slow}})
\]

is the **slow** person-side embedding inferred from durable features, and

\[
z_{i,t+1} = U_\theta(z_{i,t}, \hat T_i, c_{i,t}, w_t, e_{i,t}, g_{i,t}^{\mathrm{fast},\tau})
\]

is the **fast** latent state inferred from recent event history.

The reason for the split is not aesthetic. It is that different things change on different timescales. A founder does not become a different founder because of one email. But their local state can absolutely change because of one email. The slow term is supposed to carry durable person structure, including the source-aware and regime-aware bank of categorical traces. The fast term is supposed to carry within-window state needed to preserve the predictive content of recent history, including whichever categorical traces are active now.

If this split is real, then removing \(z_{i,t}\) should hurt short-horizon prediction, while removing \(\hat T_i\) should hurt cold-start performance and cross-context generalization. If neither happens, I do not get to pretend the decomposition was profound. Part 4 will force that issue.

## Deriving the Transcendental Embedding

We can now write the distinction that Part 1 left implicit.

Let \(G_i\) be the inherited template: the repertoire of distinctions the organism can in principle host.

Let

\[
T_i = \Psi(G_i, \ell_i, h_i)
\]

be the realized individual transcendental embedding, where \(\ell_i\) denotes language, culture, and socialization, and \(h_i\) denotes life history.

We can model life history as a weighted structure of events:

\[
h_i = \sum_{k=1}^{n_i}\beta_{ik} e_{ik},
\]

where \(e_{ik}\) is an event embedding and \(\beta_{ik}\) its weight for later organization of response.

Some events contribute little.
Some events bend the later space of response.

This is the theoretical object.

But we do not observe \(T_i\) directly. What we observe are traces and proxies.

Let

- \(p_i\) be psychometric and cognitive summaries,
- \(b_i\) be biography and background,
- \(\ell_i\) be language and culture,
- \(r_i\) be role and institution history,
- \(h_i\) be weighted life-event structure,
- \(g_i^{\mathrm{slow}}\) be the slow source-aware and regime-aware categorical trace bank.

Then a first operational estimate of the person-level transcendental embedding is

\[
\hat{T}_i^{(0)} = E(p_i, b_i, \ell_i, r_i, h_i, g_i^{\mathrm{slow}}).
\]

At first pass, this estimator can be simple:

\[
\hat{T}_i^{(0)}
=
W_p p_i

- W_b b_i
- W_\ell \ell_i
- W_r r_i
- W_h h_i
- W_g g_i^{\mathrm{slow}}.
  \]

where the \(W\) terms denote weighting or projection operators. If one channel suppresses or inverts another, that sign belongs inside the learned operator rather than being hard-coded as a minus sign on the whole source. In other words, the estimator should stay algebraically additive even when the learned effect of a coordinate is inhibitory.

This should be read carefully. The weighted estimate is **not** the transcendental embedding itself. It is the tractable object from which we begin. It is a low-resolution approximation built from standardized signals because those are the signals computation can access at scale. The categorical term is not allowed to collapse source disagreement or surface opposition into one bland midpoint. It preserves source and regime separation, and it allows contextual lifting to determine whether an apparent contradiction resolves along a deeper axis before later layers decide whether, when, and how the traces become comparable.

Given the fast categorical pool \(g_{i,t}^{\mathrm{fast},\tau}\), the online update becomes

\[
z_{i,t+1}
=
U_\theta(z_{i,t}, \hat T_i, c_{i,t}, w_t, e_{i,t}, g_{i,t}^{\mathrm{fast},\tau}).
\]

Once the slow embedding is estimated and the fast state is updated online, the local predictive object becomes

\[
\hat \chi_{i,t}^{(\tau,\Delta)}
=
(\hat T_i, z_{i,t}, c_{i,t}, w_t)
=
s_{i,t}^{(\tau,\Delta)}
\approx
q_{i,t}^{(\tau,\Delta)}.
\]

Then, given a proposition \(x_t\), the next predictive observer-state is approximated by

\[
\hat q_{i,t+1}^{(\tau,\Delta)}
=
F_\tau(\hat T_i, z_{i,t}, c_{i,t}, w_t, x_t).
\]

Observable consequences are then read out from that state:

\[
\hat y_{i,t+\Delta}^{(\tau)} = R_0\!\left(\hat q_{i,t+1}^{(\tau,\Delta)}\right).
\]

Later I will add auxiliary probe heads, but the logic is already here: the task does not need access to the whole ineffable interior, it needs a predictive state from which future observables can be decoded.

Part 2 stops here on purpose.

We cannot yet claim to know the true form of \(F_\tau\).
In no way am I claiming to have solved qualia or consciousness; I have just, maybe, created a representation that keeps me from talking nonsense while trying to predict human transition.
It does not claim that the estimate and the reality itself are identical.

What it does claim is smaller and enough for the next step: a person can be represented as a latent structure derived from an inherited template, development, language, culture, memory, and repeated life events; that latent structure can be estimated from outward traces; and that estimate can serve as the person-side object in a transition model whose target is a task-conditioned predictive observer-state and the observable readouts downstream of it.

Part 3 can now ask the narrower question: once a person has been represented as \(\hat T_i\), once their recent dynamics have been represented by \(z_{i,t}\), and once the person-in-role state has been represented as \(s_{i,t}^{(\tau,\Delta)}\), how do we represent the proposition \(x_t\) in the same algebra, and how do we learn a transition map that predicts the next predictive state with increasing fidelity?

---

## Part 3: Application — Predicting How People Behave

# Part 3: Application — Predicting How People Behave

Part 2 established the person-side object. It distinguished the inherited template from the realized individual embedding, the realized embedding from the momentary phenomenal state, and the momentary phenomenal state from the task-conditioned predictive state the mathematics can actually handle. Part 3 now asks what can be done with that object.

The claim of this section is narrower than a final proof about mind. It is not that we already possess the exact universal transition law for human beings. It is that we can define a mathematical program in which observer, role, proposition, and environment can be placed into a shared representational space, and that inside this space we can iteratively improve our estimate of how one predictive observer-state gives rise to the next. In other words: this section does not complete the science; it specifies the playground in which the science can be built.

Everything the model computes on ends up as vectors or tuples of vectors, however not every symbol in the paper is itself a primitive vector before preprocessing. Raw inputs can be text, categories, histories, metadata, and context objects. Before they hit the learned equations, they should be converted into vector/tensor representations, a neat trick is to just use your favorite multi-input embedding model for any basic operations. For anything to do with categorical vectors, like "life-history" for example, you can just average these embeddings together to come up with a decent latent representation: Beware, if you implement the averaging trick incorrectly your model can become hot garbage.

## Towards a Universal State Transition Function

In the ideal philosophical form of the theory, the object of interest is still the next phenomenal state.

Let

\[
\phi\_{i,t}
\]

denote the full phenomenal state of person \(i\) at time \(t\), and let

\[
T_i
\]

denote the realized transcendental embedding of that person. Let

\[
x_t
\]

denote the proposition confronting the observer at time \(t\). Here "proposition" is meant broadly. It may be a sentence, an email, a product, a person, a meeting, a threat, a market signal, or a whole local arrangement of circumstances. For the observer, what matters is not bare matter but presented structure.

The ideal transition law is therefore

\[
\phi*{i,t+1} = F(T_i, \phi*{i,t}, x_t).
\]

I keep this equation because it is the motivating ideal. I do **not** keep it as the formal target of the benchmark, because that would be me quietly pretending I have direct access to the thing I explicitly said I do not have.

So the formal target from here on is the task-conditioned predictive observer-state.

Let

\[
\mathcal U = \{u=(u_1,u_2,\dots)\}
\]

be an ambient infinite coordinate space of possible distinctions. Nothing in the present argument requires stronger structure than this. The point of \(\mathcal U\) is simply to provide a common representational arena large enough to host observer, proposition, role, and world.

For any prediction task \(\tau\), define a finite projection

\[
\Pi*\tau:\mathcal U \to \mathbb R^{d*\tau},
\]

where \(d\_\tau\) is the number of coordinates needed for that task. This preserves the core idea of infinite dimensional space without requiring infinite computation. The ambient space is open-ended; each task lives in a finite slice of it.

The formal state is

\[
q\_{i,t}^{(\tau,\Delta)},
\]

chosen so that, for admissible propositions,

\[
P\!\left(
Y*{i,t+\Delta}^{(\tau)}
\mid
H*{i,\le t}, T*i, c*{i,t}, w*t, x_t
\right)
=
P\!\left(
Y*{i,t+\Delta}^{(\tau)}
\mid
q\_{i,t}^{(\tau,\Delta)}, x_t
\right).
\]

The measurable approximation is

\[
s*{i,t}^{(\tau,\Delta)} = (\hat T_i, z*{i,t}, c\_{i,t}, w_t).
\]

The operational transition law is then

\[
\hat q*{i,t+1}^{(\tau,\Delta)}
=
F*\tau(\hat T*i, z*{i,t}, c\_{i,t}, w_t, x_t; \theta),
\]

and the observable readout is

\[
\hat y*{i,t+\Delta}^{(\tau)}
=
R_0^{(\tau)}\!\left(\hat q*{i,t+1}^{(\tau,\Delta)}\right).
\]

If we want to keep the stronger language without lying to ourselves, the clean interpretation is this: the next predictive observer-state is the formally accessible stand-in for the part of phenomenal transition that matters to the task.

It is useful to decompose the transition map into three pieces. First, encode the observer-side object:

\[
o*{i,t}^{(\tau)} = E_o^{(\tau)}(\hat T_i, z*{i,t}, c\_{i,t}, w_t).
\]

Second, encode the proposition:

\[
p_t^{(\tau)} = E_p^{(\tau)}(x_t).
\]

Third, compute the interaction:

\[
h*{i,t}^{(\tau)} = \Psi*\tau(o\_{i,t}^{(\tau)}, p_t^{(\tau)}),
\]

and decode the next predictive state:

\[
\hat q*{i,t+1}^{(\tau,\Delta)} = G*\tau(h\_{i,t}^{(\tau)}).
\]

Finally, if the task demands an observable output or auxiliary readouts, decode them from the predictive state:

\[
\hat y*{i,t+\Delta}^{(\tau)} = R_0^{(\tau)}(\hat q*{i,t+1}^{(\tau,\Delta)}),
\qquad
\hat a*{i,t+\Delta}^{(m,\tau)} = R_m^{(\tau)}(\hat q*{i,t+1}^{(\tau,\Delta)}).
\]

Whenever I write \(\hat a*{i,t+\Delta}^{(\tau)}\) without the probe index \(m\), I mean the full auxiliary bundle \((\hat a*{i,t+\Delta}^{(1,\tau)}, \dots, \hat a\_{i,t+\Delta}^{(M,\tau)})\).

This last step matters. The reply, the purchase, the rejection, the delay, the meeting, or the concession is not the state itself. It is a visible residue of the transition. The model is not fundamentally about "will they buy?" It is about "what state will they enter next for the purposes of this task?" The purchase is then one possible readout from that state.

## God’s Infinite Dimensional Space: Making All Realities Composable

"To make all realities composable" does not mean every distinction matters for every observer. It means that anything capable of altering the observer's next predictive state can be represented within the same formal program.

The observer remains the privileged object. The world matters only insofar as it enters the observer's state transition.

If two propositions differ physically but not in the task-relevant projection available to the observer, then they are equivalent for that task. Formally, if

\[
\Pi*{\tau}(E_p(x_t^{(1)})) = \Pi*{\tau}(E_p(x_t^{(2)})),
\]

then, for fixed \(\hat T*i\), \(z*{i,t}\), \(c\_{i,t}\), and \(w_t\),

\[
F*\tau(\hat T_i, z*{i,t}, c*{i,t}, w_t, x_t^{(1)}; \theta)
\approx
F*\tau(\hat T*i, z*{i,t}, c\_{i,t}, w_t, x_t^{(2)}; \theta).
\]

This captures a number of cases at once. Dark matter may exist in the ambient space and yet project to nothing for the ordinary human observer. A table made of wood and a table made of stone may differ in physical constitution, yet remain interchangeable for a task in which that distinction never enters the observer's next state. The world is therefore not composable because everything is the same; it is composable because differences can be represented, weighted, or nullified within one algebra.

The same framework scales across levels. An amoeba may be modeled with only a handful of coordinates: nutrient, toxin, gradient, rupture, motion. A dog adds more. A human adds far more. A corporation, later on, may be treated as a higher-order observer if it can be represented as a system with memory, incentives, internal communication, and persistent response surfaces. The universal claim is therefore about form, not about one fixed content-size. For the present paper, however, we restrict the application to humans situated within roles.

At the human level, the person-side estimate begins with standardized features. Let

\[
u_i = [p_i \mid b_i \mid \ell_i \mid r_i \mid h_i \mid g_i^{\mathrm{grp}}],
\]

where \(p_i\) denotes psychometric and cognitive proxies, \(b_i\) biography, \(\ell_i\) language and cultural position, \(r_i\) role and institution history, \(h_i\) life-event structure, and \(g_i^{\mathrm{grp}}\) longer-run group or company variables relevant to the person. Let \(g_i^{\mathrm{slow}}\) denote the slow source-aware and regime-aware categorical trace bank built from repeated role labels, objection classes, action-types, topical recurrences, counterpart identities, and similar discrete observations. Then

\[
\hat T*i = E*\theta(u_i, g_i^{\mathrm{slow}})
\]

is the first operational estimate of the transcendental embedding.

This estimate is not final. It is the opening move in a discovery process. The correct coordinates for a task are not assumed in advance; they are tested.

For any candidate feature family \(f\), define its contribution to a task \(\tau\) as

\[
\Delta*\tau(f) = \mathrm{Perf}*\tau(M \cup f) - \mathrm{Perf}\_\tau(M),
\]

where \(M\) is the current model and \(\mathrm{Perf}\_\tau\) is its performance on the task. If a feature family raises predictive validity in a stable way, it remains. If it does not, it is removed. The theory therefore does not begin by declaring the final human embedding solved. It begins by specifying the logic by which that embedding is approximated and improved.

This is why the first operational analogue is closer to a recommender system than to a completed metaphysics. A recommender infers a user representation from traces, maps a candidate object into a related space, predicts the next response, and updates from feedback. The present framework generalizes that move: infer a person representation, maintain an explicit local state, map a proposition into the same task-space, estimate the next predictive observer-state, and update the estimate from observed consequences.

## Creating the World Model

A world model, in the present sense, is not a copy of physical reality. It is a learned simulator of predictive-state transitions under propositions.

For a task \(\tau\), define the world model as

\[
\mathcal W*\tau:(\hat T_i, z*{i,t}, c*{i,t}, w_t, x_t)\mapsto \hat q*{i,t+1}^{(\tau,\Delta)}.
\]

This is the single-step form. Multi-step rollout is obtained by recursion:

\[
\hat q*{i,t+k}^{(\tau,\Delta)}
=
\mathcal W*\tau^{(k)}(\hat T*i, z*{i,t}, c*{i,t}, w_t, x_t, x*{t+1}, \dots, x\_{t+k-1}).
\]

The model can later be extended to multiple agents by replacing the single observer with a set of observer-state tuples, but the present paper keeps one observer at the center because that is enough to ground the framework.

The first application domain is go-to-market interaction because it produces repeated transitions, clear timestamps, and measurable outcomes. Consider a founder receiving a sales pitch. The founder is not just a person. The founder is a person-in-role, with a company, a market position, a prior history, a threat model, a time horizon, and live incentives. The proposition is not just the seller's sentence. It includes the seller, the product, the message, the channel, the price, the timing, the current market state, and the problem-frame through which the product is being introduced.

In this case, the person-side object is

\[
o*{i,t}^{(\tau)} = E_o^{(\tau)}(\hat T_i, z*{i,t}, c\_{i,t}, w_t),
\]

where \(c\_{i,t}\) includes the founder role and relevant company state. The proposition-side object is

\[
p_t^{(\tau)} = E_p^{(\tau)}(x_t),
\]

where \(x_t\) includes the seller, product, pitch, and local conditions. Their interaction produces the next predictive state, and that state can be read out as reply likelihood, meeting likelihood, objection class, cycle delay, purchase likelihood, sentiment shift, or any other observable the task legitimately carries.

In that sense, the eventual sales outcome is not the main object. It is the visible consequence of a prior state transition. If the email changed salience, trust, urgency, or perceived fit, then the transition already occurred before the purchase did.

Nothing in this framework commits us to one architecture. The transition operator \(\Psi\_\tau\) may be implemented by a linear model, a recurrent model, an attention-based model, a state-space model, or a neural system that combines several of these. Sequence models in the Mamba family are one candidate because they compress long histories into an evolving state, but they do not define the theory. They are implementation options inside it.

What matters most at the outset is not architectural ambition but a working procedure.

### Algorithm 1: Estimate the person-side object

Choose a task \(\tau\) and a time horizon \(\Delta\). Collect standardized person-level observations and build the slow categorical trace bank \(g_i^{\mathrm{slow}}\). Estimate

\[
\hat T*i = E*\theta(u_i, g_i^{\mathrm{slow}}).
\]

Construct the current fast state \(z*{i,t}\) from event history together with the active categorical pool \(g*{i,t}^{\mathrm{fast},\tau}\), then attach the active role-context \(c\_{i,t}\) and world state \(w_t\). This produces the operational state

\[
s*{i,t}^{(\tau,\Delta)} = (\hat T_i, z*{i,t}, c\_{i,t}, w_t).
\]

### Algorithm 2: Encode the proposition and predict the transition

Encode the proposition \(x_t\) into

\[
p_t^{(\tau)} = E_p^{(\tau)}(x_t).
\]

Compute the interaction

\[
h*{i,t}^{(\tau)} = \Psi*\tau(o\_{i,t}^{(\tau)}, p_t^{(\tau)}),
\]

then decode the next predictive state

\[
\hat q*{i,t+1}^{(\tau,\Delta)} = G*\tau(h\_{i,t}^{(\tau)}),
\]

and, when needed, produce observable readouts

\[
\hat y*{i,t+\Delta}^{(\tau)} = R_0^{(\tau)}(\hat q*{i,t+1}^{(\tau,\Delta)}),
\qquad
\hat a*{i,t+\Delta}^{(m,\tau)} = R_m^{(\tau)}(\hat q*{i,t+1}^{(\tau,\Delta)}).
\]

### Algorithm 3: Update from error

Observe the realized outcome \(y*{i,t+\Delta}^{(\tau)}\) and auxiliary probes \(a*{i,t+\Delta}^{(m,\tau)}\). Compute the training objective

\[
\mathcal L*\tau
=
\mathcal L*{\mathrm{main}}

- \sum*{m=1}^{M}\lambda_m \mathcal L*{\mathrm{probe},m}
- \lambda\_{\mathrm{reg}}\Omega(\theta).
  \]

This is the ordinary minimize-by-gradient-descent version of the update. The probe heads are there to force reusable structure into the predictive state rather than letting the model survive on one brittle headline target, and the regularizer is there to keep the fit honest instead of rewarding parameter sprawl.

Update parameters by gradient step,

\[
\theta*{t+1} = \theta_t - \eta\nabla*\theta \mathcal L\_\tau,
\]

or, in Bayesian form,

\[
p(\theta\mid D*{1:t+1}) \propto p(D*{t+1}\mid \theta)\,p(\theta\mid D\_{1:t}).
\]

The purpose of the update is not only to improve the transition operator. It is also to refine the estimated embedding, the fast state update, the task projection, and the feature family retained for the task.

### Algorithm 4: Detect drift and reopen discovery

No implementation should be assumed stable forever. If the environment changes, if incentives shift, or if a once-inert distinction becomes active, model quality will decay. Let recent performance be compared against reference performance. When the drop is persistent, the current projection is no longer sufficient. At that point the system must reopen the discovery process: add candidate features, reweight existing ones, or rebuild the task projection.

This is the correct sense in which the framework is open-ended. The framework is not falsified by a weak first implementation; specific implementations are. The framework supplies the lexicon and the update logic by which better implementations can be constructed.

## From Forecasting to Proposition Search

This is the place where I stop pretending the point of the machinery is merely to admire prediction metrics.

The practical purpose of the framework is not only to forecast outcomes, but to compare admissible candidate propositions by their expected effect on the observer's next task-relevant state and downstream objective. Otherwise why the hell are we building it.

Let \(\mathcal X\_{i,t}^{\mathrm{adm}}\) denote the admissible candidate proposition set for observer \(i\) at time \(t\). "Admissible" here just means allowable under the task, channel, platform, and whatever policy constraints you are actually operating under.

Let \(U\_\tau\) be the utility for task \(\tau\), defined on the predicted next state and its readouts. Then the model-induced score of a proposition is

\[
\operatorname{score}_\theta(x \mid s_{i,t}^{(\tau,\Delta)})
=
\mathbb E*\theta\!\left[
U*\tau\!\left(
\hat q*{i,t+1}^{(\tau,\Delta)},
\hat y*{i,t+\Delta}^{(\tau)},
\hat a*{i,t+\Delta}^{(\tau)}
\right)
\mid
s*{i,t}^{(\tau,\Delta)}, x
\right].
\]

The corresponding proposition search problem is

\[
x*t^\star
\in
\arg\max*{x \in \mathcal X*{i,t}^{\mathrm{adm}}}
\operatorname{score}*\theta(x \mid s\_{i,t}^{(\tau,\Delta)}).
\]

That gives us three distinct regimes.

First, there is **forecasting**: estimate the next predictive state and readout of the proposition that actually happened.

Second, there is **observational ranking**: simulate or score counterfactual candidate propositions under the model. This is useful and often already valuable.

Third, there is **interventional policy improvement**: choose propositions based on those scores and claim they will improve outcomes in the real world. This third step is not free. It requires logged propensities, randomized exploration, or online experimentation. Without that, what you have is ranking, not causal control.

That distinction matters because this system is obviously drifting toward control. Better to say it cleanly than hide it in a footnote.

## Closing Part 3

Part 1 argued that reality, as it appears to an organism, is not a mirror of noumena but the output of an evolved representational structure. Part 2 showed how that structure can be individualized, historically deformed, estimated from observable traces, and compressed into a task-conditioned predictive state. Part 3 completes the descent into mathematics by specifying the program that follows from those claims.

The ideal object remains the next phenomenal state. The practical object is a predictive observer-state. The observer, the role, the world, and the proposition can be embedded in a shared arena. Their interaction can be modeled as a transition. The resulting state can be rolled forward, compared against outcomes, decoded into auxiliary probes, and updated under error. This is the meaning of a transcendental world model in operational terms.

The framework can therefore be summarized in one line:

\[
(\hat T*i, z*{i,t}, c*{i,t}, w_t, x_t)
\;\longrightarrow\;
\hat q*{i,t+1}^{(\tau,\Delta)}
\;\longrightarrow\;
(\hat y*{i,t+\Delta}^{(\tau)}, \hat a*{i,t+\Delta}^{(\tau)})
\;\longrightarrow\;
\text{update}.
\]

And if proposition search is turned on, the line extends to

\[
s*{i,t}^{(\tau,\Delta)}
\;\longrightarrow\;
\operatorname{score}*\theta(x \mid s\_{i,t}^{(\tau,\Delta)})
\;\longrightarrow\;
x_t^\star.
\]

That is the whole ambition of this section. Not a complete algebra of mind, but a way to build one without lying about what has and has not been solved.

OK, it is time to get serious now.

---

---

## Part 4: Benchmarking the World Model

# Part 4: Benchmarking the World Model

Part 3 supplied a framework. Part 4 makes it answerable to some sort of data.

We've argued so far that an observer can be represented as a transcendental embedding, that this embedding can be approximated from outward traces, and that propositions can be represented in the same formal arena as the observer. But any framework that does not specify what counts as state, what data instantiates that state, what task is being predicted, what stronger baselines it must beat, and what evidence would justify proposition optimization is nascent and not really worth anyone's time. This section deals with benchmarking.

The purpose of Part 4 is not to prove the full metaphysical claim directly. It is to ask a narrower question: if we represent a human in role (the Chimera) as a slow embedding plus a fast latent state, do we predict observable transitions better than simpler models that use only static profile data, current-touch features, or history summaries? And if we later use that model to choose messages, do we have the intervention machinery to say something more than "the simulator liked this one"?

## 4.1 Operational Definition of State

Philosophically, the state of the organism is total. It includes perception, interoception, memory, action tendency, and the actions already underway. But benchmarks do not get to be mystical.

Let

\[
\phi_{i,t}
\]

denote the full phenomenal state of person \(i\) at time \(t\).

That object is not directly observed. So the benchmark requires two additional definitions.

First, define the predictive observer-state \(q_{i,t}^{(\tau,\Delta)}\) for task \(\tau\) and horizon \(\Delta\) by the sufficiency condition

\[
P\!\left(
Y_{i,t+\Delta}^{(\tau)}
\mid
H_{i,\le t}, T_i, c_{i,t}, w_t, x_t
\right)
=
P\!\left(
Y_{i,t+\Delta}^{(\tau)}
\mid
q_{i,t}^{(\tau,\Delta)}, x_t
\right),
\qquad
x_t \in \mathcal X_{i,t}^{\mathrm{adm}}.
\]

Second, define the operational state used in the dataset as the measurable approximation

\[
s_{i,t}^{(\tau,\Delta)} = (\hat T_i, z_{i,t}, c_{i,t}, w_t),
\]

where

\[
\hat T_i = E_T(u_i, g_i^{\mathrm{slow}})
\]

is the slow estimated embedding of the person,

\[
z_{i,t} = E_Z(H_{i,\le t}, \hat T_i, c_{i,t}, w_t, g_{i,t}^{\mathrm{fast},\tau})
\]

is the fast latent state inferred from recent interaction history,

\[
c_{i,t}
\]

is the current role-and-institution context, and

\[
w_t
\]

is the relevant world state.

So the hierarchy is now explicit:

\[
\phi_{i,t}
\quad\text{(full motivating state)}
\]

\[
q_{i,t}^{(\tau,\Delta)}
\quad\text{(formal predictive state)}
\]

\[
s_{i,t}^{(\tau,\Delta)}
\quad\text{(measurable benchmark approximation)}.
\]

The current proposition remains separate:

\[
x_t \in \mathcal X_{i,t}^{\mathrm{adm}},
\]

because the benchmark asks how a particular proposition changes the next state. Here the slow categorical bank captures what the person is generally like at this stage, while the fast categorical pool captures which discrete dispositions are currently active under the present regime.

The ideal transition is still

\[
\phi_{i,t+\Delta} = F(T_i, \phi_{i,t}, x_t),
\]

but the measurable benchmark version is

\[
\hat q_{i,t+1}^{(\tau,\Delta)} = G_\theta(s_{i,t}^{(\tau,\Delta)}, x_t),
\qquad
\hat y_{i,t+\Delta}^{(\tau)} = R_0\!\left(\hat q_{i,t+1}^{(\tau,\Delta)}\right).
\]

This preserves the paper's claim that the true target is the next phenomenal transition, while admitting that the benchmark must train against observable proxies rather than direct access to qualia.

## 4.2 Dataset Construction

For each task \(\tau\), define a dataset of event-time examples:

\[
\mathcal D_\tau
=
\left\{
(u_i, H_{i,\le t}, c_{i,t}, w_t, x_t, y_{i,t+\Delta}^{(\tau)}, a_{i,t+\Delta}^{(\tau)})
\right\}_{(i,t)}.
\]

Here

\[
u_i = [p_i, b_i, \ell_i, r_i, h_i, g_i^{\mathrm{grp}}],
\]

where \(p_i\) is any psychometric proxy we can actually infer with a straight face, \(b_i\) is biography, \(\ell_i\) is language and discourse features, \(r_i\) is role and institution history, \(h_i\) is the observable subset of life-history structure recoverable from logs or profiles, and \(g_i^{\mathrm{grp}}\) is firm, account, or longer-run group context. If a coordinate cannot be inferred cleanly from real logs, its slot is masked rather than fabricated.

In addition, construct a slow categorical bank \(g_i^{\mathrm{slow}}\) from long-run source-tagged categorical traces and a fast categorical pool \(g_{i,t}^{\mathrm{fast},\tau}\) from recent history. The slow bank stores what the person is generally like now across regimes; the fast pool stores what is currently active after contextual lifting and weighting.

Let the interaction history be

\[
H_{i,\le t} = [e_{i,1}, e_{i,2}, \dots, e_{i,t}],
\]

with each event encoded as

\[
e_{i,t} = [x_t, \delta_t, r_t, a_t, m_t^{\mathrm{obs}}, e_{i,t}^{\mathrm{cat}}],
\]

where \(x_t\) is the proposition presented at time \(t\), \(\delta_t\) is time since the last interaction, \(r_t\) is the observed response bundle, \(a_t\) is the action taken by the agent, \(m_t^{\mathrm{obs}}\) is an observable memory proxy such as resurfaced objection themes, repeated concerns, or revisited product topics, and \(e_{i,t}^{\mathrm{cat}}\) is the source-tagged **raw** categorical event bundle. This bundle should preserve whether a category came from biography, stated language, observed behavior, or a third-party inference. It can later be aligned; it should not be forced into immediate equivalence.

The distinction between \(e_{i,t}^{\mathrm{cat}}\) and \(g_{i,t}^{\mathrm{fast},\tau}\) matters. The first is the current event-level categorical shock. The second is the separately computed pooled summary of prior categorical history after contextual lifting and weighting. Keeping both lets the model distinguish current content from accumulated state rather than injecting the same object twice under two names.

When available, action-type traces and mere-exposure traces should both be logged. The first often carries sharper salience; the second can still accumulate gradually.

The role-context term is

\[
c_{i,t}
=
[
\text{title}_{i,t},
\text{seniority}_{i,t},
\text{dealstage}_{i,t},
\text{sender-role}_{t},
\text{buyer-role}_{t},
\text{firm-position}_{i,t}
].
\]

The world-state term is

\[
w_t
=
[
\text{market}_{t},
\text{account-health}_{t},
\text{org-pressure}_{t}
].
\]

These terms stay explicit because the environment matters. Which ones we can actually carry depends on what data we can get.

For the first benchmark, the primary outcome bundle should stay narrow and logged:

\[
y_{i,t+\Delta}^{(\tau)}
=
[
\text{reply}_{7d},
\text{meeting}_{21d},
\text{stageadvance}_{30d},
\text{close}_{90d}
].
\]

Now add the auxiliary probe bundle:

\[
a_{i,t+\Delta}^{(\tau)}
=
[
\text{objectionclass},
\text{sentimentshift},
\text{urgencyshift},
\text{nextactiontype},
\text{replydelaybucket}
].
\]

These probes should be domain-specific and extractable from transcripts, email text, CRM status changes, or structured coding. They are not meant to reveal the one true hidden motive of the prospect. They are meant to test whether the latent state carries structured signal that generalizes beyond a single binary target.

So the first dataset should be built from CRM events, email logs, call transcripts, meeting records, account metadata, sender metadata, and company descriptors. If psychometric proxies, firm embeddings, or market-pressure features are unavailable, run the benchmark without them first. Do not invent variables because they sound sophisticated.

Also, a quick aside: I originally mentally modeled the fast and slow factors to help represent the phenomenon of mere exposure, but it evolved into its own thing over time. That origin still matters. Repeated exposure is exactly the kind of effect that should show up in the fast categorical pools before it ever deserves metaphysical inflation, while decisive actions should usually be allowed to hit the local state harder than passive contact alone.

## 4.3 The Benchmark

The benchmark is simple: does an explicit predictive-state model beat weaker baselines on prediction, and does the explicit slow/fast decomposition beat a generic sequence model that has enough capacity to absorb everything into one black box?

Formally, the claim is this:

If recent interaction history contains predictive information that cannot be reduced to static profile features or the current proposition alone, then a model with an explicit fast latent state \(z_{i,t}\) should outperform baselines that omit that state.

That means the model has to beat the following baselines.

**Baseline 0: frequency baseline**

\[
\hat y = \Pr(y=1)
\]

estimated from training prevalence alone.

**Baseline 1: current-touch model**

\[
\hat y_{i,t+\Delta}^{(\tau)} = h_1(x_t, c_{i,t}),
\]

for example logistic regression or linear classification using only the current proposition and current context.

**Baseline 2: static tabular model**

\[
\hat y_{i,t+\Delta}^{(\tau)} = h_2(u_i, c_{i,t}, w_t, x_t),
\]

for example gradient-boosted trees or regularized logistic regression using person, firm, world, and proposition features, but no explicit sequence state.

**Baseline 3: shallow-history model**

\[
\hat y_{i,t+\Delta}^{(\tau)} = h_3(u_i, c_{i,t}, w_t, x_t, \mathrm{agg}(H_{i,\le t})),
\]

where \(\mathrm{agg}(H_{i,\le t})\) is a hand-built summary such as touch count, last-response delay, reply rate, prior meeting count, or resurfaced topic counts.

**Baseline 4: recommender-style two-tower model**

\[
\hat y_{i,t+\Delta}^{(\tau)} = h_4(\hat u_i, \hat x_t),
\]

where person/account and proposition are embedded separately and scored by dot product or shallow fusion, but no explicit recurrent state is maintained.

**Baseline 5: monolithic sequence model**

\[
\hat y_{i,t+\Delta}^{(\tau)} = h_5(u_i, c_{i,t}, w_t, x_t, H_{i,\le t}),
\]

implemented by a generic recurrent, transformer, or state-space sequence model that sees the same event stream but does **not** enforce an explicit slow/fast decomposition.

This last baseline matters. If a monolithic sequence block with enough capacity eats my lunch, then the decomposition was just a story I told myself after the fact. If the explicit decomposition still wins or ties while being more interpretable, then it has earned the right to stay.

## 4.4 The Proposed Latent-State Model

The model keeps the slow and fast parts separate. The slow categorical bank is source-aware and regime-aware. The fast categorical pool privileges decisive action traces over mere exposure while still allowing repeated weak exposure to accumulate, and apparent contradictions are contextually lifted before they are treated as unresolved opposition.

First, estimate the slow person-side embedding:

\[
\hat T_i = E_T(u_i, g_i^{\mathrm{slow}}).
\]

Second, initialize a fast state:

\[
z_{i,0} = z_0(\hat T_i).
\]

Third, update the fast state as events occur:

\[
z_{i,t+1} = U_\theta(z_{i,t}, \hat T_i, c_{i,t}, w_t, e_{i,t}, g_{i,t}^{\mathrm{fast},\tau}).
\]

The active regime determines which mask is presently live, but the person-side object remains one person rather than many separate selves.

When a candidate proposition \(x_{t+1}\) is under consideration, predict the resulting next predictive state by

\[
\hat q_{i,t+1}^{(\tau,\Delta)}
=
G_\theta(\hat T_i, z_{i,t}, c_{i,t}, w_t, x_{t+1}).
\]

The primary readout is

\[
\hat y_{i,t+1+\Delta}^{(\tau)}
=
R_0(\hat q_{i,t+1}^{(\tau,\Delta)}).
\]

The auxiliary probe readouts are

\[
\hat a_{i,t+1+\Delta}^{(m,\tau)}
=
R_m(\hat q_{i,t+1}^{(\tau,\Delta)}),
\qquad m = 1,\dots,M.
\]

When I later write \(\hat a_{i,t+1+\Delta}^{(\tau)}\) or \(\hat a_{i,t+\Delta}^{(\tau)}\) without the probe index, I mean the full auxiliary bundle \((\hat a^{(1,\tau)}, \dots, \hat a^{(M,\tau)})\).

This multi-head structure is there so the latent state does not remain a completely black box. If the state is real in the operational sense, it should carry reusable structure that helps decode more than one downstream observable.

The architecture of \(U_\theta\) and \(G_\theta\) is not fixed by the theory. The first implementation may be a GRU, an LSTM, a state-space model, or a Mamba-like sequential block. That choice is an implementation detail. The theory requires a stateful update. It does not require blind loyalty to one named architecture.

## 4.5 Training Objective, Update Loop, and Intervention

Start with a simple training objective:

\[
\mathcal L_\tau
=
\mathcal L_{\mathrm{main}}

- \sum_{m=1}^{M}\lambda_m \mathcal L_{\mathrm{probe},m}
- \lambda_{\mathrm{reg}} \Omega(\theta).
  \]

This is the ordinary descent-friendly form. The main term fits the task we actually care about. The probe terms make the latent state carry reusable structure instead of one narrow trick. The regularization term is ordinary weight control, not a dare.

The model updates on two timescales.

The fast state updates after each observed event:

\[
z_{i,t+1} = U_\theta(z_{i,t}, \hat T_i, c_{i,t}, w_t, e_{i,t}, g_{i,t}^{\mathrm{fast},\tau}).
\]

The slow embedding updates more slowly when durable evidence accumulates. Let \(\hat T_i^{\mathrm{new}}\) denote the refreshed slow estimate obtained after recomputing the durable person-side encoder from newly accumulated slow evidence. Then

\[
\hat T_i \leftarrow (1-\alpha)\hat T_i + \alpha\,\hat T_i^{\mathrm{new}}.
\]

That distinction matters. A prospect's momentary state may move after one email. Their durable embedding should not be rewritten that quickly; it should drift toward a refreshed durable estimate instead of lurching around because one event was loud.

The model parameters update by ordinary gradient step,

\[
\theta_{k+1} = \theta_k - \eta \nabla_\theta \mathcal L_\tau.
\]

Now for the part I wanted to make explicit.

In deployment, the update loop is:

1. construct the current operational state \(s_{i,t}^{(\tau,\Delta)}\),
2. score one or more candidate propositions \(x_t^{(1)}, \dots, x_t^{(K)}\),
3. choose an action by the current policy,
4. observe the response,
5. update \(z_{i,t}\),
6. periodically refit \(\theta\) and refresh \(\hat T_i\).

But there are two very different ways to use those scores.

### Observational ranking

If all you have are retrospective logs, define the model score as

\[
\operatorname{score}_\theta(x \mid s_{i,t}^{(\tau,\Delta)})
=
\mathbb E_\theta\!\left[
U_\tau\!\left(
\hat q_{i,t+1}^{(\tau,\Delta)},
\hat y_{i,t+\Delta}^{(\tau)},
\hat a_{i,t+\Delta}^{(\tau)}
\right)
\mid
s_{i,t}^{(\tau,\Delta)}, x
\right].
\]

This lets you rank or simulate candidate propositions. Useful, yes. Causal, not yet.

### Off-policy evaluation

If the historical system logged propensities

\[
e_t = \mu(x_t \mid s_{i,t}^{(\tau,\Delta)}),
\]

for the behavior policy \(\mu\), then a deterministic target proposition policy \(\pi\) can be evaluated off-policy. A simple inverse-propensity estimator is

\[
\hat V_{\mathrm{IPS}}(\pi)
=
\frac{1}{N}
\sum_{t=1}^{N}
\frac{\mathbf 1\{x_t = \pi(s_{i,t}^{(\tau,\Delta)})\}}{e_t}
\, r_t,
\]

where \(r_t\) is the realized reward or task utility. If the target policy is stochastic rather than deterministic, replace the indicator with the usual importance ratio \(\pi(x_t \mid s_{i,t}^{(\tau,\Delta)}) / \mu(x_t \mid s_{i,t}^{(\tau,\Delta)})\). In practice, a stabilized or doubly-robust estimator is usually preferable, but the point is not the exact estimator; the point is that without propensities or randomization, you do not get to claim policy value cleanly.

### Online policy improvement

If you can randomize a controlled fraction of traffic, then proposition selection becomes a real policy-learning problem rather than a retrospective ranking problem. At that point, the latent-state model can be used to choose among admissible messages, offers, sequences, or interventions, and its value can be evaluated by lift, regret, or cumulative reward under live deployment.

Until then, leave the causal swagger out of it. The model may still be useful. It is just useful as a ranking and simulation device rather than as a proven controller.

## 4.6 Temporal Split, Evaluation, and Drift

The benchmark must be temporal. Random row splits leak future information.

Partition the data into rolling windows:

\[
\mathcal D_{\mathrm{train}}^{(1:T_1)},
\quad
\mathcal D_{\mathrm{val}}^{(T_1:T_2)},
\quad
\mathcal D_{\mathrm{test}}^{(T_2:T_3)}.
\]

The main primary-task metrics should be

\[
\mathrm{LogLoss}, \qquad \mathrm{Brier}, \qquad \mathrm{PR\text{-}AUC}, \qquad \mathrm{ECE}.
\]

Expected calibration error matters if calibrated probabilities are needed for ranking actions. PR-AUC matters because reply, meeting, and close events are sparse.

For the auxiliary probes, report probe-appropriate metrics such as macro-F1, AUROC, or calibration, depending on whether the probe is multiclass, binary, ordinal, or continuous. If the latent state is supposed to carry structured signal, that signal should show up in more than one head.

The benchmark should also include ablations that force the slow/fast story to either pay rent or die.

1. **Remove \(z_{i,t}\).** If short-horizon performance barely moves, the fast state is ornamental.
2. **Remove \(\hat T_i\).** If cold-start or cross-context generalization barely moves, the slow embedding is ornamental.
3. **Replace the salience-weighted categorical pools with uniform averaging.** If performance barely moves, the claim that sharp action and cumulative weak exposure deserve different treatment is ornamental.
4. **Collapse source channels and role regimes before pooling.** If performance improves, my insistence on source-aware and regime-aware separation was theater; if it hurts, then source/regime separation together with contextual disambiguation was buying real signal.
5. **Shuffle recent within-person history while preserving static profile.** If performance does not fall, the model was not really using the history in the way I claimed.
6. **Replace the explicit slow/fast architecture with the monolithic sequence baseline.** If the generic sequence model dominates, then the decomposition is not buying enough.
7. **Remove the probe heads.** If probes contribute no stable signal and no regularization value, then they are decorative; if they improve robustness or reveal reusable structure, then they are doing their job.

Drift has to be part of the benchmark rather than an afterthought. Let recent performance on a rolling window be

\[
\mathrm{Perf}_{\tau}(t:t+h).
\]

If this falls below a threshold

\[
\mathrm{Perf}_{\tau}(t:t+h) < \gamma_{\tau},
\]

the system should trigger one of three responses: refit parameters, expand the feature set, or reopen the task projection \(\Pi_{\tau}\). A model like this is expected to become wrong. The point is to catch it when it does and update it.

If the proposition-selection layer is being tested with propensities or online experiments, report off-policy value estimates or live lift separately from the pure forecasting metrics. Those are different questions and should not be mashed together.

## 4.7 What Counts as Success

Success is not that the model sounds deep. Success is narrower.

The latent-state framing succeeds in the first domain if

\[
\mathrm{LogLoss}(M_{\mathrm{latent}})
<
\mathrm{LogLoss}(M_{\mathrm{best\ baseline}}) - \epsilon_1,
\]

and

\[
\mathrm{Brier}(M_{\mathrm{latent}})
<
\mathrm{Brier}(M_{\mathrm{best\ baseline}}) - \epsilon_2,
\]

on temporally held-out data, across more than one outcome horizon.

It succeeds more strongly if the gains survive drift, if ablations show that the explicit fast state \(z_{i,t}\) is carrying unique short-horizon signal, if removing \(\hat T_i\) hurts cold-start or cross-context performance in exactly the way the slow/fast story predicts, and if uniform or source-collapsed categorical pooling underperforms the salience-weighted, source-aware, and regime-aware version.

It succeeds more strongly still if the auxiliary probe heads show that the latent state supports reusable structure beyond a single binary label.

If intervention data exists, proposition search succeeds when

\[
\hat V(\pi_{\mathrm{latent}})

> \hat V(\pi_{\mathrm{baseline}}) + \epsilon_3
> \]

under off-policy evaluation or live testing. If intervention data does **not** exist, then all proposition search results must be described as simulated rankings, not causal policy wins.

It fails if one of the simpler baselines matches or exceeds it once history summaries are included, if the monolithic sequence model dominates the explicit decomposition without interpretability tradeoff, or if the gains disappear on future windows. In that case, either the state decomposition is wrong, the dataset does not contain the signal I thought it did, or the task never needed this much machinery in the first place.

## End of Part 4

Part 4 is where the framework becomes somewhat verifiable and we can use data.

At this stage, the job is straightforward: define what state means in a form a dataset can carry, define what must be predicted, define which weaker and stronger models must be beaten, define what the probes are supposed to prove, and define what intervention evidence is required before proposition optimization can be called causal. That is enough to move from framework to testable program.

The whole ambition can be written operationally as

\[
(\hat T_i, z_{i,t}, c_{i,t}, w_t, x_t)
\longrightarrow
\hat q_{i,t+1}^{(\tau,\Delta)}
\longrightarrow
(\hat y_{i,t+\Delta}^{(\tau)}, \hat a_{i,t+\Delta}^{(\tau)})
\longrightarrow
\text{update}
\longrightarrow
\text{benchmark}.
\]

And if we do later choose messages from the model, the extra line is

\[
s_{i,t}^{(\tau,\Delta)}
\longrightarrow
\operatorname{score}_\theta(x \mid s_{i,t}^{(\tau,\Delta)})
\longrightarrow
x_t^\star,
\]

with the giant asterisk that ranking is not causality unless the data collection regime supports that claim.

This is the point where the theory becomes falsifiable. The question is no longer whether reality can be expressed in a shared formal arena. The question is whether that arena yields better forecasts of observable human transition than models that ignore explicit state, and whether its proposition-search layer survives the much nastier standard of intervention.

---

## Part 5: Axioms, Lemmas, and Main Theorem

# Part 5: Axioms, Lemmas, and Main Theorem

## Proof Boundary

> I am not proving the whole metaphysical carnival I constructed. I am proving the internal mathematics of the system and the conditions under which explicit predictive state beats thinner baselines, can be compressed, can be decomposed into slow and fast pieces, and can induce a well-defined proposition ranking. The rest would be impossible from the formalism alone, and I do not get extra points for pretending otherwise.

> The boundary keeps the proof mostly honest and keeps my pseudo-Kantian philosophical leanings from being smuggled into the math. I will systematically build my case now off of everything we discussed. Also, I do not have formal training in proofs, I mostly used the last 2 generations of AI models for this formalization, most of the work I did here was structural and ensuring the notation didn't drift and that the results made enough sense to publish. I cannot fully trust this proof so if you see anything off please email me Bjorn@psi.dev.

Because Parts 3 and 4 now fix the sign of the training objective and the meaning of the slow refresh, I am going to keep those conventions explicit here too instead of quietly proving against a stale version of the machinery.

## 5.1 Axioms

I begin with axioms because these are the primitive commitments the rest of the machine will run on. If I do not pin these down explicitly, every later lemma quietly changes its meaning depending on how charitable the reader feels like being.

**Axiom 1 (Ambient space).** Let \(\mathcal N\) be a real separable Hilbert space, and let \(\mathcal M^{\mathrm{spec}} \subset \mathcal N\) be the finite-dimensional accessible subspace preserved by the lineage. Let

\[
P^{\mathrm{spec}} : \mathcal N \to \mathcal M^{\mathrm{spec}}
\]

be the orthogonal projection onto that accessible subspace.

**Axiom 2 (Individual instantiation).** For each observer \(i\), there exists an inherited template \(G_i\), a realized embedding

\[
T_i = \Psi(G_i, \ell_i, h_i),
\]

and a phenomenal state \(\phi_{i,t}\) at time \(t\).

**Axiom 3 (Task-conditioned predictive state).** For each task \(\tau\) and horizon \(\Delta\), there exists a predictive observer-state

\[
q_{i,t}^{(\tau,\Delta)}
\]

and an admissible proposition set \(\mathcal X_{i,t}^{\mathrm{adm}}\) such that, for every \(x_t \in \mathcal X_{i,t}^{\mathrm{adm}}\),

\[
P\!\left(
Y_{i,t+\Delta}^{(\tau)}
\mid
H_{i,\le t}, T_i, c_{i,t}, w_t, x_t
\right)
=
P\!\left(
Y_{i,t+\Delta}^{(\tau)}
\mid
q_{i,t}^{(\tau,\Delta)}, x_t
\right).
\]

**Axiom 4 (Operational slow/fast approximation).** For a task \(\tau\) and horizon \(\Delta\), define the measurable operational state by

\[
s_{i,t}^{(\tau,\Delta)} = (\hat T_i, z_{i,t}, c_{i,t}, w_t),
\]

with

\[
\hat T_i = E_T(u_i, g_i^{\mathrm{slow}}),
\qquad
z_{i,t} = E_Z(H_{i,\le t}, \hat T_i, c_{i,t}, w_t, g_{i,t}^{\mathrm{fast},\tau}),
\]

and, in online form,

\[
z_{i,t+1}
=
U_\theta(z_{i,t}, \hat T_i, c_{i,t}, w_t, e_{i,t}, g_{i,t}^{\mathrm{fast},\tau}).
\]

Here \(g_i^{\mathrm{slow}}\) and \(g_{i,t}^{\mathrm{fast},\tau}\) are source-aware and regime-aware categorical trace pools built under a fixed slot schema with null and mask behavior for missing cells. Apparent contradictions are first contextually lifted into richer typed traces before later maps decide whether they are comparable, axis-conditioned, or genuinely unresolved.

The slow term moves on a longer timescale. If \(\hat T_i^{\mathrm{new}}\) is the refreshed durable estimate obtained from newly accumulated slow evidence, then the operational refresh is

\[
\hat T_i \leftarrow (1-\alpha)\hat T_i + \alpha\,\hat T_i^{\mathrm{new}},
\qquad 0 < \alpha \le 1.
\]

Equivalently, if

\[
\Delta \hat T_i := \hat T_i^{\mathrm{new}} - \hat T_i,
\]

then

\[
\hat T_i \leftarrow \hat T_i + \alpha\,\Delta \hat T_i.
\]

Assume this state is an operational approximation to \(q_{i,t}^{(\tau,\Delta)}\).

**Axiom 5 (Transition factorization).** For each task \(\tau\), the transition law factors through the task-relevant encoding of the proposition. That is, there exists a measurable transition map \(G_\theta\) such that

\[
\hat q_{i,t+1}^{(\tau,\Delta)}
=
G_\theta(\hat T_i, z_{i,t}, c_{i,t}, w_t, x_t).
\]

**Axiom 6 (Axis-retention rule).** At evolutionary stage \(\tau\), a candidate axis is retained only when its net fitness contribution is positive. If \(\Delta \Phi_{\tau} > 0\), the axis is added to the species-level access structure; otherwise it is not retained.

**Axiom 7 (Auxiliary readout factorization).** For each auxiliary probe \(A_{i,t+\Delta}^{(m,\tau)}\) included in the task, its conditional law factors through the predictive state. Equivalently, there exists a measurable readout \(R_m\) such that

\[
P\!\left(
A_{i,t+\Delta}^{(m,\tau)}
\mid
H_{i,\le t}, T_i, c_{i,t}, w_t, x_t
\right)
=
P\!\left(
A_{i,t+\Delta}^{(m,\tau)}
\mid
q_{i,t}^{(\tau,\Delta)}, x_t
\right).
\]

Whenever the probe index \(m\) is suppressed below, \(A_{i,t+\Delta}^{(\tau)}\) or \(\hat a_{i,t+\Delta}^{(\tau)}\) denotes the full auxiliary bundle \((A_{i,t+\Delta}^{(1,\tau)}, \dots, A_{i,t+\Delta}^{(M,\tau)})\) or its prediction.

Whenever the benchmark training loop appears below, the optimize-by-descent objective is

\[
\mathcal L_\tau
=
\mathcal L_{\mathrm{main}}

- \sum_{m=1}^{M}\lambda_m \mathcal L_{\mathrm{probe},m}
- \lambda_{\mathrm{reg}}\Omega(\theta),
  \qquad
  \lambda_m,\lambda_{\mathrm{reg}} \ge 0,
  \]

with gradient descent update

\[
\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal L_\tau.
\]

This is not a fresh metaphysical axiom. It is just the sign-consistent operational convention inherited from Parts 3 and 4, and I am writing it here once so the proof section does not silently revert to the wrong loop.

These are all of the axioms. From here onward, the proofs are internal to the system.

## 5.2 Definitions: sufficiency and minimality

I need these definitions explicitly because otherwise "state" can mean anything from the whole person to a CRM row.

**Definition 1 (Task-sufficient predictive state).** A state \(q_{i,t}^{(\tau,\Delta)}\) is sufficient for task \(\tau\) and horizon \(\Delta\) if, for every admissible proposition \(x_t \in \mathcal X_{i,t}^{\mathrm{adm}}\),

\[
P\!\left(
Y_{i,t+\Delta}^{(\tau)}
\mid
H_{i,\le t}, T_i, c_{i,t}, w_t, x_t
\right)
=
P\!\left(
Y_{i,t+\Delta}^{(\tau)}
\mid
q_{i,t}^{(\tau,\Delta)}, x_t
\right).
\]

**Definition 2 (Minimal predictive state).** A sufficient state \(q_{i,t}^{(\tau,\Delta)}\) is minimal if, for every other sufficient state \(r_{i,t}^{(\tau,\Delta)}\), there exists a measurable map \(h\) such that

\[
q_{i,t}^{(\tau,\Delta)} = h\!\left(r_{i,t}^{(\tau,\Delta)}\right)
\]

almost surely.

This is the cleanest way to state identifiability here. Not "the one true coordinates of the mind," but "the smallest sufficient state is unique up to admissible reparameterization."

## 5.3 Lemma 1: Best accessible approximation

I prove this first because projection has to stop being metaphor here. If the species-level slice is going to do any real work, it has to be the best accessible approximation in a strict sense, not just the slice I like philosophically.

**Lemma 1.** For every noumenal state \(\mathbf n_t \in \mathcal N\),

\[
P^{\mathrm{spec}} \mathbf n_t
=
\operatorname*{argmin}_{\mathbf m \in \mathcal M^{\mathrm{spec}}}
\lVert \mathbf n_t - \mathbf m \rVert.
\]

That is, the lineage-accessible projection is the unique closest point to the full noumenal state among all points in the accessible subspace.

**Proof.** Because \(\mathcal M^{\mathrm{spec}}\) is finite-dimensional, it is a closed subspace of the Hilbert space \(\mathcal N\). Every closed subspace of a Hilbert space admits a unique orthogonal projection. By the Hilbert projection theorem, that projection is exactly the unique element of the subspace minimizing the distance to \(\mathbf n_t\). Therefore \(P^{\mathrm{spec}} \mathbf n_t\) is the best accessible approximation to the noumenal state. QED.

## 5.4 Lemma 2: Novelty test and monotonicity of retained axes

This is where I force the evolutionary step to become exact. A candidate axis only counts as new if it survives subtraction by what is already captured, and it only stays if it actually improves the objective rather than just sounding biologically plausible.

**Lemma 2.** Let \(\mathcal M_{\tau}^{\mathrm{spec}}\) be the current species-level accessible subspace, and let \(\Delta \mathbf v \in \mathcal N\) be a candidate mutation. Define the orthogonal residual

\[
\Delta \mathbf v_{\perp} = \Delta \mathbf v - P_{\tau}^{\mathrm{spec}} \Delta \mathbf v.
\]

Then:

1. if \(\Delta \mathbf v \in \mathcal M_{\tau}^{\mathrm{spec}}\), we have \(\Delta \mathbf v_{\perp} = 0\), so the mutation adds no genuinely new axis;
2. if \(\Delta \mathbf v_{\perp} \neq 0\) and \(\Delta \Phi_{\tau} > 0\), then retaining the normalized residual strictly increases the evolutionary objective

\[
J_{\tau}(\mathcal M)
=
\mathbb E_{e \sim \mathcal E_{\tau}}[W(e,\mathcal M)] - C(\mathcal M).
\]

**Proof.** The first claim is immediate: if \(\Delta \mathbf v\) already lies in the current subspace, its orthogonal projection onto that subspace is itself, hence the residual is zero. For the second claim, if \(\Delta \mathbf v_{\perp} \neq 0\), normalize it and consider the updated space

\[
\mathcal M_{\tau+1}^{\mathrm{spec}}
=
\mathcal M_{\tau}^{\mathrm{spec}} \oplus \operatorname{span}\{\widehat{\Delta \mathbf v}_{\perp}\}.
\]

By definition of the retention rule, \(\Delta \Phi_{\tau} > 0\) exactly when the expected reproductive gain of adding the axis exceeds its maintenance cost. Hence

\[
J_{\tau}(\mathcal M_{\tau+1}^{\mathrm{spec}})
>
J_{\tau}(\mathcal M_{\tau}^{\mathrm{spec}}).
\]

So accepted axes are precisely those that increase the evolutionary objective. QED.

## 5.5 Lemma 3: Task-equivalence and composability

This is needed because "composable" means nothing unless I can say exactly when two different things are the same for the task. This is the place where physical difference gets subordinated to observer-relevant difference.

**Lemma 3.** Suppose two propositions \(x_t^{(1)}\) and \(x_t^{(2)}\) have identical task-relevant projected encodings:

\[
\Pi_\tau(E_p^{(\tau)}(x_t^{(1)}))
=
\Pi_\tau(E_p^{(\tau)}(x_t^{(2)})).
\]

Then, for fixed \(\hat T_i\), \(z_{i,t}\), \(c_{i,t}\), and \(w_t\), the next predicted predictive state is identical:

\[
\hat q_{i,t+1}^{(\tau,\Delta)}(x_t^{(1)})
=
\hat q_{i,t+1}^{(\tau,\Delta)}(x_t^{(2)}).
\]

**Proof.** By Axiom 5, the transition map depends on the proposition only through its task-relevant encoding. If two propositions have the same projected encoding, then every argument of \(G_\theta\) is the same in the two evaluations. Therefore the outputs must be equal. That is the precise sense in which realities become composable here: different physical propositions may be equivalent for a task if they project to the same observer-relevant coordinates. QED.

## 5.6 Lemma 4: Recursive rollout

I prove rollout separately because a one-step update is not yet a world model in the strong sense I want. The minute the system can be iterated without ambiguity, it stops being a static fit and starts becoming something that can actually simulate a trajectory.

**Lemma 4.** Fix a task \(\tau\). Let the one-step world model be

\[
\mathcal W_{\tau} : (\hat T_i, z_{i,t}, c_{i,t}, w_t, x_t) \mapsto \hat q_{i,t+1}^{(\tau,\Delta)}.
\]

Then for any finite horizon \(k \geq 1\), the multi-step rollout

\[
\hat q_{i,t+k}^{(\tau,\Delta)}
=
\mathcal W_{\tau}^{(k)}(\hat T_i, z_{i,t}, c_{i,t}, w_t, x_t, x_{t+1}, \ldots, x_{t+k-1})
\]

is uniquely defined.

**Proof.** The case \(k=1\) is given by definition. Assume the rollout exists and is unique for some \(k\). Then the \((k+1)\)-step rollout is obtained by applying the same deterministic map \(\mathcal W_{\tau}\) to the unique state produced at step \(k\) together with the next proposition \(x_{t+k}\). Hence the \((k+1)\)-step rollout also exists and is unique. By induction, the result holds for every finite horizon. QED.

## 5.7 Theorem 1: The value of explicit history

This is the benchmark theorem in plain language: if the past carries usable signal, then models that throw the past away are leaving information on the table by construction. I want that point stated mathematically so "state matters" stops sounding like an intuition and starts sounding like an inequality.

**Theorem 1.** Let the observable task outcome be

\[
Y = Y_{i,t+\Delta}^{(\tau)}.
\]

Define the static feature bundle

\[
X_0 = (u_i, c_{i,t}, w_t, x_t)
\]

and the richer history-aware bundle

\[
X_1 = (u_i, H_{i,\le t}, c_{i,t}, w_t, x_t).
\]

Then:

1. under log loss, the Bayes-optimal history-aware predictor cannot do worse than the Bayes-optimal static predictor;
2. for binary outcomes under Brier score, the same inequality holds;
3. the inequality is strict whenever \(Y\) is not conditionally independent of history given the static bundle.

**Proof.** Under log loss, the Bayes-optimal risk is the conditional entropy:

\[
\mathcal R_{\log}^{\ast}(X) = H(Y \mid X).
\]

Because conditioning on more information cannot increase conditional entropy,

\[
H(Y \mid X_1) \leq H(Y \mid X_0).
\]

So the Bayes-optimal predictor using history cannot do worse than the Bayes-optimal predictor without it.

For binary outcomes under Brier score, the Bayes-optimal prediction is \(\mathbb E[Y \mid X]\), and the corresponding optimal risk is

\[
\mathcal R_{\mathrm{Brier}}^{\ast}(X) = \mathbb E[\operatorname{Var}(Y \mid X)].
\]

Conditioning on a larger sigma-algebra cannot increase conditional variance in expectation, hence

\[
\mathcal R_{\mathrm{Brier}}^{\ast}(X_1) \leq \mathcal R_{\mathrm{Brier}}^{\ast}(X_0).
\]

The inequality is strict whenever history carries signal about \(Y\) not already contained in \(X_0\), equivalently whenever \(Y \not\mathrel{\perp\mspace{-10mu}\perp} H_{i,\le t} \mid X_0\). Therefore explicit history is guaranteed to help in principle whenever it contains conditionally relevant information. QED.

## 5.8 Theorem 2: Minimal-state uniqueness up to reparameterization

I put this here because sufficiency alone is cheap. A gigantic archive is sufficient too. The more interesting claim is that if a minimal sufficient state exists, then two such states are the same object up to relabeling.

**Theorem 2.** Suppose \(q_{i,t}^{(\tau,\Delta)}\) and \(q_{i,t}^{\prime(\tau,\Delta)}\) are both minimal sufficient predictive states for the same task \(\tau\) and horizon \(\Delta\). Then there exist full-measure subsets \(S\) and \(S'\) of their supports and a measurable bijection \(b:S \to S'\) such that

\[
q_{i,t}^{\prime(\tau,\Delta)} = b\!\left(q_{i,t}^{(\tau,\Delta)}\right)
\]

almost surely.

**Proof.** Because \(q_{i,t}^{\prime(\tau,\Delta)}\) is sufficient and \(q_{i,t}^{(\tau,\Delta)}\) is minimal, there exists a measurable map \(h\) such that

\[
q_{i,t}^{(\tau,\Delta)} = h\!\left(q_{i,t}^{\prime(\tau,\Delta)}\right)
\]

almost surely.

Likewise, because \(q_{i,t}^{(\tau,\Delta)}\) is sufficient and \(q_{i,t}^{\prime(\tau,\Delta)}\) is minimal, there exists a measurable map \(h'\) such that

\[
q_{i,t}^{\prime(\tau,\Delta)} = h'\!\left(q_{i,t}^{(\tau,\Delta)}\right)
\]

almost surely.

Composing these gives

\[
h\!\circ h' = \mathrm{id}
\quad\text{on } q_{i,t}^{\prime(\tau,\Delta)} \text{ almost surely},
\]

and

\[
h'\!\circ h = \mathrm{id}
\quad\text{on } q_{i,t}^{(\tau,\Delta)} \text{ almost surely}.
\]

Therefore there are full-measure subsets \(S\) and \(S'\) of the respective supports on which the compositions are literally the identity. On those subsets, \(h'\) and \(h\) are inverse measurable maps. Hence \(b := h'|_S\) is a measurable bijection from \(S\) to \(S'\), and the two minimal sufficient states differ only by reparameterization up to null sets. QED.

## 5.9 Theorem 3: Sufficient latent-state compression

I put this after the minimality theorem because the point is not merely to worship bigger models. It would probably be impossible to compute an individual's state transitions based on a literal one-to-one representation of the current state. The whole game is to find the smaller state that keeps the predictive content of the larger history without dragging the whole archive behind it forever.

**Theorem 3.** Suppose the fast latent state \(z_{i,t}\) is sufficient for task \(\tau\) in the sense that

\[
Y \perp H_{i,\le t}
\mid
(z_{i,t}, \hat T_i, c_{i,t}, w_t, x_t),
\]

where \(Y = Y_{i,t+\Delta}^{(\tau)}\). Then, conditional on \((\hat T_i, c_{i,t}, w_t, x_t)\), the full interaction history can be replaced for predictive purposes by the compressed state \(z_{i,t}\):

\[
P(Y \mid H_{i,\le t}, \hat T_i, c_{i,t}, w_t, x_t)
=
P(Y \mid z_{i,t}, \hat T_i, c_{i,t}, w_t, x_t).
\]

**Proof.** This is immediate from the stated conditional independence. If, conditional on \((z_{i,t}, \hat T_i, c_{i,t}, w_t, x_t)\), the outcome \(Y\) no longer depends on the full history, then the conditional distributions on the two sides are equal. Hence the latent state is a sufficient compression of history for the task once the slow/context bundle is held fixed. QED.

## 5.10 Theorem 4: History mediation by the fast state

This is the theorem that makes the slow/fast decomposition more than architecture taste.

**Theorem 4.** Assume the predictive state factors as

\[
q_{i,t}^{(\tau,\Delta)} = (\sigma_i, \zeta_{i,t}, c_{i,t}, w_t),
\]

where \(\sigma_i\) is the slow profile term corresponding to the durable embedding \(\hat T_i\), and \(\zeta_{i,t}\) is the fast state corresponding to \(z_{i,t}\), updated from within-window event history. If

\[
Y \perp H_{i,\le t}
\mid
(\sigma_i, \zeta_{i,t}, c_{i,t}, w_t, x_t),
\]

then every predictive contribution of within-window history beyond the static profile \(\sigma_i\) is mediated by the fast state \(\zeta_{i,t}\).

**Proof.** Conditioning on \((\sigma_i, c_{i,t}, w_t, x_t)\), the stated conditional independence implies that any residual dependence of \(Y\) on the full within-window history \(H_{i,\le t}\) must pass through \(\zeta_{i,t}\). Therefore, once the static profile and context are fixed, the predictive contribution of recent history is fully mediated by the fast state. QED.

## 5.11 Lemma 5: Probe-readout consistency

The probe heads have to connect to the state mathematically, not just aesthetically.

**Lemma 5.** Suppose an auxiliary probe \(A_{i,t+\Delta}^{(m,\tau)}\) satisfies Axiom 7. If two histories \(H^{(1)}\) and \(H^{(2)}\) yield the same predictive state

\[
q_{i,t}^{(\tau,\Delta)}(H^{(1)}) = q_{i,t}^{(\tau,\Delta)}(H^{(2)}),
\]

then, under the same admissible proposition \(x_t\), they induce the same conditional law for the probe:

\[
P\!\left(
A_{i,t+\Delta}^{(m,\tau)}
\mid
H^{(1)}, T_i, c_{i,t}, w_t, x_t
\right)
=
P\!\left(
A_{i,t+\Delta}^{(m,\tau)}
\mid
H^{(2)}, T_i, c_{i,t}, w_t, x_t
\right).
\]

**Proof.** By Axiom 7, the probe depends on the past only through the predictive state and the proposition. If the two histories map to the same predictive state and the same proposition is applied, then the conditional laws coincide immediately. QED.

## 5.12 Main theorem

Everything above is staged so it can collapse into one statement here without cheating too much. By the time I get to the main theorem, I want the geometry, the evolutionary filter, the transition equivalence, the rollout machinery, the benchmark logic, the minimality claim, the slow/fast mediation claim, and the probe logic all locked into the same object.

**Main Theorem (Task-conditioned predictive observer-state world model).** Assume Axioms 1 through 7. Then for every task \(\tau\) and prediction horizon \(\Delta\), there exists a finite-dimensional task-conditioned representation

\[
(\hat T_i, z_{i,t}, c_{i,t}, w_t, x_t)
\longmapsto
\hat q_{i,t+1}^{(\tau,\Delta)}
\longmapsto
(\hat y_{i,t+\Delta}^{(\tau)}, \hat a_{i,t+\Delta}^{(\tau)})
\]

with the following properties:

1. **Best accessible approximation.** The species-level projection is the unique closest accessible representation of the noumenal state.
2. **Evolutionary coherence.** Newly retained axes are precisely those that increase the evolutionary objective defined above.
3. **Task-equivalence.** Propositions with identical task-relevant projected encodings induce identical next-state predictions.
4. **Recursive simulability.** Finite-horizon rollouts of the world model are uniquely defined.
5. **State advantage.** If history contains signal not reducible to static features, then the Bayes-optimal history-aware predictor strictly improves on the Bayes-optimal static predictor under log loss, and under Brier score for binary outcomes.
6. **Minimal-state uniqueness.** Any two minimal sufficient predictive states for the same task and horizon are equal up to measurable bijection on full-measure subsets of their supports.
7. **Sufficient compression.** If the fast latent state is sufficient for the task, then, conditional on the slow/context bundle already carried by the operational state, the full history can be replaced by that latent state without predictive loss.
8. **Slow/fast mediation.** If the predictive state factors into slow and fast parts and the stated conditional independence holds, then the predictive contribution of within-window history beyond static profile is mediated by the fast state.
9. **Probe consistency.** Auxiliary probes that factor through the predictive state induce equal conditional laws whenever the predictive state is equal.

**Proof.** Property 1 is Lemma 1. Property 2 is Lemma 2. Property 3 is Lemma 3. Property 4 is Lemma 4. Property 5 is Theorem 1. Property 6 is Theorem 2. Property 7 is Theorem 3. Property 8 is Theorem 4. Property 9 is Lemma 5. Together these results establish that the framework admits a mathematically well-defined predictive observer-state world model, that this model can be rolled forward in time, that its explicit state variables matter exactly when the data-generating process contains information that simpler baselines discard, that minimal sufficient state is unique up to reparameterization, and that the slow/fast decomposition has a precise mediation interpretation rather than being mere architecture cosplay. Finally, QED.

## 5.13 Corollary: Observational proposition ranking

I wanted the proposition-optimization point stated plainly in the formal section, but let's keep it bounded.

**Corollary.** Let \(U_\tau\) be any measurable task utility and let \(\mathcal X_{i,t}^{\mathrm{adm}}\) be any admissible candidate proposition set. Then the predictive observer-state world model induces a well-defined observational ranking

\[
x
\mapsto
\operatorname{score}_\theta(x \mid s_{i,t}^{(\tau,\Delta)})
=
\mathbb E_\theta\!\left[
U_\tau(\hat q_{i,t+1}^{(\tau,\Delta)}, \hat y_{i,t+\Delta}^{(\tau)}, \hat a_{i,t+\Delta}^{(\tau)})
\mid
s_{i,t}^{(\tau,\Delta)}, x
\right].
\]

Hence the argmax

\[
x_t^\star
\in
\arg\max_{x \in \mathcal X_{i,t}^{\mathrm{adm}}}
\operatorname{score}_\theta(x \mid s_{i,t}^{(\tau,\Delta)})
\]

is well-defined whenever the candidate set is finite, or compact and the map \(x \mapsto \operatorname{score}_\theta(x \mid s_{i,t}^{(\tau,\Delta)})\) is continuous.

**Proof.** By Axiom 5 and the readout maps, the predicted next state and its observable consequences are measurable functions of the operational state and the proposition. Therefore any measurable utility of those quantities has a well-defined conditional expectation, which induces an ordering over admissible propositions. Existence of an argmax is immediate in the finite case. In the compact case, continuity of the score and the extreme-value theorem give an attained maximum. QED.

This corollary gives me a ranking or search problem. It does **not** prove that following the ranking causally improves the world. That still requires propensities, randomization, or online experimentation.

## 5.14 What is proved, and what is not

I end here because this is where the proof should stop, not because this is completely done. The point is to leave you with a result that is strong where it is proved and sharply bounded where it is not, instead of letting the formal section bloat until it starts hallucinating.

This is the strongest result available here.

What is proved:

1. the accessible-space formalism is mathematically fine;
2. the axis-retention rule has a well-defined novelty test;
3. observer-relevant proposition equivalence induces equal next-state prediction;
4. the world model can be recursively simulated for finite horizons once the relevant proposition path and exogenous inputs are specified;
5. explicit history and explicit state have a principled predictive advantage whenever the future is not conditionally independent of them;
6. a minimal sufficient predictive state, if it exists, is unique up to reparameterization on full-measure subsets;
7. a fast latent state can stand in for the full history, conditional on the slow/context bundle, in some cases;
8. the slow/fast decomposition has a clean mediation interpretation when the stated independence assumptions hold;
9. auxiliary probe readouts can be tied to the predictive state rather than bolted on as decoration;
10. the model induces a well-defined observational ranking over admissible propositions, with argmax existence under the stated finiteness or compactness-plus-continuity conditions.

What is not proved:

1. that the noumenal arena literally exists;
2. that qualitative experience is exhausted by the chosen coordinates;
3. that the universe is deterministic;
4. that retrospective ranking over candidate propositions is already causal policy improvement;
5. that \(\hat T_i\) or \(z_{i,t}\) recovers the one true hidden coordinates of a person.

Those remain axioms, interpretations, or empirical research claims. They are not derivable from the mathematics alone.

That is the stopping point. Start with the claim that reality, as it appears to an organism, is the output of an evolved interpretive structure. End with a formal result: once that structure is written as an embedding-plus-predictive-state system, the resulting world model can be defined, proved internally coherent, compressed, minimally identified up to reparameterization on full-measure subsets, rolled forward, probed, ranked over propositions, and tested against baselines.

In one line:

\[
\mathcal N
\longrightarrow
\mathcal M^{\mathrm{spec}}
\longrightarrow
G_i
\longrightarrow
T_i
\longrightarrow
\phi_{i,t}
\rightsquigarrow
q_{i,t}^{(\tau,\Delta)}
\approx
s_{i,t}^{(\tau,\Delta)}
\longrightarrow
y_{i,t+\Delta}^{(\tau)}.
\]

That is the full arc from noumenal excess to observer, from observer to state, and from state to prediction.

Done.

---

## In Memory of Einar Kringlen.

It has been an honor tackling this multi-generational problem with you. To you I owe much.

---

## Appendix A: Study Guide / Cheat Sheet

# God’s Infinite Dimensional Space — Step-by-Step Guide

## Cheat sheet: significant terms and operations

| Symbol / term                                                             | Plain meaning                       | What the operation does                                                               | Why it is here                                                           |
| ------------------------------------------------------------------------- | ----------------------------------- | ------------------------------------------------------------------------------------- | ------------------------------------------------------------------------ |
| \(\mathcal N\)                                                            | Noumenal arena                      | The largest space of possible distinctions                                            | Gives the framework a “reality is larger than experience” starting point |
| \(\mathbf n_t\)                                                           | Noumenal microstate at time \(t\)   | A point in \(\mathcal N\)                                                             | Represents the raw state before organism-specific projection             |
| \(\mathcal M^{\mathrm{spec}}\)                                            | Species-level accessible subspace   | A finite slice of \(\mathcal N\)                                                      | Says a lineage can only access some distinctions                         |
| \(P^{\mathrm{spec}}\)                                                     | Species projection                  | Orthogonally projects raw state into the accessible slice                             | Formalizes “organisms do not get the whole world”                        |
| \(E^{\mathrm{spec}}\)                                                     | Species coordinate encoder          | Converts the projected slice into coordinates                                         | Turns the accessible slice into a workable vector                        |
| \(\Delta \mathbf v_\perp\)                                               | Novelty residual                    | Removes what the current species template already captures                            | Tests whether a mutation adds a new axis                                 |
| \(\Delta \Phi_\tau\)                                                     | Net fitness contribution            | Expected gain minus cost of keeping a candidate axis                                  | Decides whether an axis is retained                                      |
| \(G_i\)                                                                   | Inherited template                  | The repertoire of distinctions person \(i\) could in principle host                   | Separates lineage structure from individual realization                  |
| \(T_i\)                                                                   | Realized individual embedding       | The durable structure of one person after history, language, culture, and experience  | This is the theoretical person-side object                               |
| \(\phi_{i,t}\)                                                           | Full phenomenal state               | Everything live in the person at time \(t\)                                           | Motivating ideal object                                                  |
| \(q_{i,t}^{(\tau,\Delta)}\)                                              | Predictive observer-state           | Smallest state that preserves the future law for task \(\tau\) and horizon \(\Delta\) | The formal target                                                        |
| \(s_{i,t}^{(\tau,\Delta)}\)                                              | Operational state                   | \((\hat T_i, z_{i,t}, c_{i,t}, w_t)\)                                                | What the benchmark can actually train on                                 |
| \(\chi_{i,t}\)                                                           | Chimera                             | Person-in-role object                                                                 | Makes context explicit instead of hiding it inside “the person”          |
| \(c_{i,t}\)                                                              | Role and institution context        | Role, regime, local demands                                                           | Explains why the same person acts differently in different settings      |
| \(w_t\)                                                                   | World state                         | Market, account, pressure, or other external state                                    | Keeps the environment separate from the person                           |
| \(x_t\)                                                                   | Proposition                         | The thing hitting the observer now                                                    | The object being scored, simulated, or chosen                            |
| \(Y_{i,t+\Delta}^{(\tau)}\)                                              | Main future outcome                 | The task target                                                                       | What the model tries to predict                                          |
| \(A_{i,t+\Delta}^{(m,\tau)}\)                                            | Auxiliary probe                     | Side target such as objection class or delay bucket                                   | Forces the latent state to carry reusable structure                      |
| \(\Pi_\tau(T_i,c_{i,t},x_t)\)                                             | Task projection                     | Selects task-relevant coordinates                                                     | Says not all dimensions matter for every task                            |
| \(a_{i,t}\)                                                              | Salience weights                    | Reweights coordinates elementwise                                                     | Captures what is active now                                              |
| \(z_{i,t}=a_{i,t}\odot \Pi_\tau(\cdot)\)                                 | Active fast slice                   | Uses elementwise multiplication to gate the task coordinates                          | Produces the current live state for the transition                       |
| \(m_{i,t}=\sum_j \omega_{ij,t}\mu_{ij}\)                                 | Memory field                        | Weighted sum of traces                                                                | Makes memory computable                                                  |
| \(R(\mu_{ij},x_t,c_{i,t},\phi_{i,t})\)                                   | Retrieval rule                      | Updates relevance of a trace                                                          | Explains why the same prompt can work differently later                  |
| \(\Xi(C,c)\)                                                              | Contextual lifting                  | Retypes categories using context                                                      | Avoids false contradictions                                              |
| \(E_{f,s}(c)\)                                                           | Category embedding                  | Maps a discrete token to a dense vector                                               | Standard ML move for categorical data                                    |
| \(u_{i,t}^{(f,s)}\)                                                      | Within-event pooled category vector | Averages embedded tokens in one bag                                                   | Converts sparse categorical events into fixed-width vectors              |
| \(\nu\)                                                                   | Null vector                         | Learned stand-in for an empty bag                                                     | Keeps empty slots explicit instead of pretending they are zero           |
| \(m_{i,t}^{(f,s)}\)                                                      | Mask bit                            | Indicates whether a slot is populated                                                 | Separates absence from value                                             |
| \(\|\)                                                                    | Concatenation                       | Joins vectors end-to-end                                                              | Preserves slot identity                                                  |
| \(g_i^{\mathrm{slow}}\)                                                   | Slow categorical bank               | Regime-aware durable pooled categorical memory                                        | Captures what the person is generally like now                           |
| \(g_{i,t}^{\mathrm{fast},\tau}\)                                         | Fast categorical pool               | Task-conditioned recent categorical summary                                           | Captures what is currently active                                        |
| \(E_T(\cdot)\)                                                            | Slow encoder                        | Builds \(\hat T_i\) from durable information                                          | Produces the slow person embedding                                       |
| \(U_\theta(\cdot)\)                                                      | Fast update rule                    | Updates fast state from new events                                                    | Makes the model sequential                                               |
| \(E_o^{(\tau)}(\cdot)\)                                                   | Observer encoder                    | Packs slow state, fast state, context, and world into one task representation         | Creates the observer-side object for interaction                         |
| \(E_p^{(\tau)}(x_t)\)                                                     | Proposition encoder                 | Encodes the proposition into the same task space                                      | Makes propositions comparable with the observer-side state               |
| \(\Psi_\tau(o,p)\)                                                       | Interaction operator                | Combines observer and proposition encodings                                           | Represents “what happens when this proposition hits this observer”       |
| \(G_\theta(\cdot)\)                                                       | Transition map                      | Predicts the next latent predictive state                                             | Core world-model step                                                    |
| \(R_0,R_m\)                                                               | Readout heads                       | Decode outcomes and probes from the latent state                                      | Converts hidden state into measurable outputs                            |
| \(\Delta_\tau(f)\)                                                       | Feature contribution                | Performance with a feature family minus performance without it                        | Decides whether a feature family stays                                   |
| \(\mathcal L_\tau\)                                                      | Training objective                  | Adds main loss, probe losses, and regularization                                      | Defines what gradient descent is minimizing                              |
| \(\Omega(\theta)\)                                                        | Regularizer                         | Penalizes overly flexible parameter settings                                          | Keeps the fit from becoming brittle                                      |
| \(\hat T_i \leftarrow (1-\alpha)\hat T_i+\alpha \hat T_i^{\mathrm{new}}\) | Slow EMA refresh                    | Mixes old durable state with a refreshed durable estimate                             | Keeps slow state stable                                                  |
| \(\operatorname{score}_\theta(x\mid s)\)                                 | Proposition score                   | Expected task utility if proposition \(x\) is used in state \(s\)                     | Turns forecasting into ranking                                           |
| \(\arg\max\)                                                              | Best choice                         | Picks the highest scoring candidate                                                   | Formal proposition search                                                |
| \(P(\cdot\mid\cdot)\)                                                     | Conditional probability             | “Probability of this given that”                                                      | Language of sufficiency and prediction                                   |
| \(\mathbb E[\cdot]\)                                                      | Expectation                         | Average predicted value under uncertainty                                             | Needed when scoring uncertain futures                                    |
| \(\perp\)                                                                 | Conditional independence            | Says extra information stops helping once a state is known                            | Lets the framework define sufficiency and mediation                      |
| \(\hat V_{\mathrm{IPS}}(\pi)\)                                           | IPS estimate                        | Reweights logged rewards to estimate a target policy’s value                          | Separates ranking from policy evaluation                                 |
| LogLoss / Brier / PR-AUC / ECE                                            | Evaluation metrics                  | Fit, probability error, rare-event ranking, and calibration                           | Measures whether the model is useful                                     |

## The full arc in one line

\[
\mathcal N
\longrightarrow
\mathcal M^{\mathrm{spec}}
\longrightarrow
G_i
\longrightarrow
T_i
\longrightarrow
\phi_{i,t}
\rightsquigarrow
q_{i,t}^{(\tau,\Delta)}
\approx
s_{i,t}^{(\tau,\Delta)}
\longrightarrow
y_{i,t+\Delta}^{(\tau)}.
\]

Read it left to right:

1. Start with a world that contains more distinctions than any organism can use.
2. Restrict that world to the distinctions a lineage can access.
3. Turn the lineage template into an individual template.
4. Realize that template in one person.
5. Let that person occupy a full momentary state.
6. Compress the full state into a task-specific predictive state.
7. Approximate that predictive state with something measurable.
8. Decode future observable outcomes from it.

# Part 0 — Background

## Step 0.1: Replace fixed categories with evolved structure

The opening move is simple: organisms do not passively mirror the world; they inherit a structured way of carving it up.

Why this matters: it turns the appearance of reality into something that can be modeled as a built structure rather than a raw copy of an external world.

## Step 0.2: Treat interpretation and response as one process

The framework treats perception, interpretation, understanding, and action as one continuous state-transition process.

Why this matters: once they are written in one format, the same algebra can describe seeing, feeling, remembering, and acting.

# Part 1 — Specifying the Area of Interest

## Step 1.1: Represent change as vectors

The framework starts by treating observable changes as vectors.

Plain meaning: instead of handling vision, memory, and action as unrelated substances, it writes them as coordinates in a shared space.

## Step 1.2: Sample a continuous stream into modelable states

Reality is continuous, but the model uses snapshots.

Plain meaning: the flow is continuous in life, but discrete time makes the math and the benchmark possible.

## Step 1.3: Distinguish noumenal vectors from phenomenal vectors

Noumenal vectors are raw distinctions available before the organism has organized them; phenomenal vectors are reality as it appears after the organism’s inherited structure processes them.

Why this matters: the framework keeps raw physical distinctions separate from lived experience.

## Step 1.4: Introduce the inherited seed

The inherited seed is the lineage-fixed structure that determines what kinds of distinctions the organism can even register.

Why this matters: the seed explains why an organism gets one kind of world rather than all possible worlds.

## Step 1.5: Use a toy sequence to show state transition

The early toy example turns one vector sequence into another by inserting intermediate internal states.

What is happening mathematically: one structured vector recruits others in the same larger state space, and later states inherit or modify earlier coordinates.

Why it is here: it gives an intuitive picture of how a current input can activate affect, action, and an updated object state without changing notation.

## Step 1.6: Define the universal arena

\[
\mathbf n_t = \sum_{k=1}^{\infty} n_{t,k}\mathbf e_k \in \mathcal N.
\]

Plain meaning: \(\mathcal N\) is the largest coordinate system the framework will allow, and \(\mathbf n_t\) is a raw state inside it.

What the operation does: it writes a raw state as a sum of basis directions with weights.

Why it is here: without a large ambient space, there is nowhere to place distinctions that a lineage does not access.

## Step 1.7: Define the species-level accessible slice

\[
\mathcal M^{\mathrm{spec}}=\operatorname{span}\{\mathbf v_1,\ldots,\mathbf v_d\}\subset \mathcal N.
\]

Plain meaning: evolution preserves a finite set of useful axes.

What the operation does: \(\operatorname{span}\) says every accessible state is a linear combination of those selected axes.

Why it is here: it formalizes the idea that a lineage experiences only a finite, useful slice of a much larger arena.

## Step 1.8: Project raw state into that slice

\[
P^{\mathrm{spec}}\mathbf n
=
\sum_{i=1}^{d}\langle \mathbf v_i,\mathbf n\rangle \mathbf v_i.
\]

Plain meaning: keep the part of the raw world that aligns with the lineage’s accessible axes.

What the operation does: the inner products \(\langle \mathbf v_i,\mathbf n\rangle\) measure how much of \(\mathbf n\) lies along each accessible direction, then rebuild the accessible component from those amounts.

Why it is here: it makes “accessible world” a real projection rather than a metaphor.

## Step 1.9: Encode the projected slice as coordinates

\[
E^{\mathrm{spec}}(\mathbf n_t)
=
\begin{bmatrix}
\langle \mathbf v_1,\mathbf n_t\rangle \\
\vdots \\
\langle \mathbf v_d,\mathbf n_t\rangle
\end{bmatrix}.
\]

Plain meaning: convert the accessible slice into a finite vector.

What the operation does: the encoder takes the coefficients of the projected state along the accessible axes.

Why it is here: later models need coordinates, not just abstract subspaces.

## Step 1.10: Test whether evolution added a genuinely new axis

\[
\Delta \mathbf v_{\perp}
=
\Delta \mathbf v - P^{\mathrm{spec}}_{\tau}\Delta \mathbf v.
\]

Plain meaning: subtract what the current species template already explains.

What the operation does: it removes the old component of a candidate mutation, leaving only the genuinely new part.

Why it is here: this is the novelty test.

If the residual is zero, the candidate distinction is redundant.
If the residual is nonzero, the candidate adds something new.

## Step 1.11: Keep the axis only if gain beats cost

\[
\Delta \Phi_{\tau}(\widehat{\Delta \mathbf v}_{\perp})
=
\mathbb E_{e\sim \mathcal E_{\tau}}
\Big[
W\!\big(e,\mathcal M_{\tau}^{\mathrm{spec}} \oplus \operatorname{span}\{\widehat{\Delta \mathbf v}_{\perp}\}\big)

- W\!\big(e,\mathcal M_{\tau}^{\mathrm{spec}}\big)
  \Big]
- C(\widehat{\Delta \mathbf v}_{\perp}).
  \]

Plain meaning: an axis stays only if it helps more than it costs.

Harder terms:

- \(\mathbb E\) means average over environments the lineage encounters.
- \(W(e,\mathcal M)\) is expected reproductive value in environment \(e\) if the lineage has accessible subspace \(\mathcal M\).
- \(C(\cdot)\) is maintenance cost: energy, wiring, false positives, and related burdens.
- \(\oplus\) means “add a new independent direction to the current space.”

Why it is here: it gives the species template a retention rule instead of treating it as a mysterious gift.

# Part 2 — Deriving the Transcendental Embedding

## Step 2.1: Separate five different objects

Part 2 splits one overloaded phrase into five levels:

1. inherited template,
2. realized individual embedding,
3. full phenomenal state,
4. task-conditioned predictive state,
5. measurable estimate.

Why it matters: these are not the same object, and the math stays cleaner once they are separated.

## Step 2.2: Define the person-in-role object

\[
\chi_{i,t} = (T_i, c_{i,t}).
\]

Plain meaning: model a person as a person plus their active role-context.

What the operation does: it forms a tuple.

Why it is here: the same person can produce different outputs under different roles without becoming a different person.

## Step 2.3: Use psychometrics as a prior, not a full ontology

\[
p_i \in \mathbb R^k.
\]

Plain meaning: factor scores are useful summaries, but they are only one input channel.

Why it is here: standardized summaries help with approximation, but they do not replace state, memory, proposition, or transition.

## Step 2.4: Project the person down to what matters for this task

\[
\Pi_\tau(T_i, c_{i,t}, x_t) \in \mathbb R^d.
\]

Plain meaning: for a given task, only some coordinates matter.

What the operation does: it chooses the task-relevant slice of the person given the current context and proposition.

Why it is here: the framework is task-specific, not universally maximal at every step.

## Step 2.5: Weight that slice by salience

\[
z_{i,t} = a_{i,t} \odot \Pi_\tau(T_i, c_{i,t}, x_t).
\]

Plain meaning: even within the task-relevant slice, some coordinates are live and some are quiet.

Harder term:

- \(\odot\) is elementwise multiplication. If one coordinate has salience \(0\), it is suppressed. If it has salience \(1\), it passes through unchanged. Values in between partially gate it.

Why it is here: it turns a broad person representation into a current active state.

## Step 2.6: Distinguish full state, predictive state, and measurable state

\[
P\!\left(
Y_{i,t+\Delta}^{(\tau)}
\mid
H_{i,\le t}, T_i, c_{i,t}, w_t, x_t
\right)
=
P\!\left(
Y_{i,t+\Delta}^{(\tau)}
\mid
q_{i,t}^{(\tau,\Delta)}, x_t
\right).
\]

Plain meaning: once \(q_{i,t}^{(\tau,\Delta)}\) is known, the rest of the past adds nothing for predicting the task outcome under the same proposition.

Harder term:

- A conditional probability \(P(A\mid B)\) means “probability of \(A\) once \(B\) is known.”
- Sufficiency here means the state keeps all the information the future still needs.

Why it is here: this is the central definition of the predictive observer-state.

Then the measurable approximation is

\[
s_{i,t}^{(\tau,\Delta)}=(\hat T_i,z_{i,t},c_{i,t},w_t).
\]

Plain meaning: this is the version the benchmark can actually construct from data.

## Step 2.7: Model memory as weighted traces

\[
m_{i,t}=\sum_{j=1}^{N_i}\omega_{ij,t}\mu_{ij}.
\]

Plain meaning: memory is treated as stored traces with time-varying weights.

Harder terms:

- \(\mu_{ij}\) is trace \(j\).
- \(\omega_{ij,t}\) is how relevant or active that trace is at time \(t\).

Why it is here: it gives the framework a concrete way to represent persistence and reactivation.

## Step 2.8: Let the current event change retrieval

\[
\omega_{ij,t+1}=R(\mu_{ij},x_t,c_{i,t},\phi_{i,t}).
\]

Plain meaning: new input changes which old traces matter.

Why it is here: the same message can land differently after prior experiences.

## Step 2.9: Let learning update the memory field

\[
m_{i,t+1}=U(m_{i,t},x_t,\phi_{i,t}).
\]

Plain meaning: memory is not static; each event changes the future state space.

Why it is here: it explains why sequence models should matter.

## Step 2.10: Contextually lift categories before pooling them

\[
\widetilde{C}_{i,t}^{(f,s)}
=
\Xi\!\big(C_{i,t}^{(f,s)}, c_{i,t}\big).
\]

Plain meaning: raw categories are retyped with context before they are compared.

Example: “aggressive” in self-interest and “aggressive” in out-group treatment may not be the same fact.

Why it is here: it prevents the model from averaging away real structure or inventing contradictions too early.

## Step 2.11: Pool categories inside one event

\[
u_{i,t}^{(f,s)}
=
\begin{cases}
\frac{1}{|\widetilde{C}_{i,t}^{(f,s)}|}
\sum_{c \in \widetilde{C}_{i,t}^{(f,s)}} E_{f,s}(c),
& |\widetilde{C}_{i,t}^{(f,s)}| > 0, \\[6pt]
\nu_{f,s},
& |\widetilde{C}_{i,t}^{(f,s)}| = 0.
\end{cases}
\qquad
m_{i,t}^{(f,s)}=\mathbf 1\{|\widetilde{C}_{i,t}^{(f,s)}|>0\}.
\]

Plain meaning: embed the event’s categorical tokens, average them, and use a learned null vector plus a mask bit when the bag is empty.

Harder terms:

- \(E_{f,s}(c)\) is an embedding lookup: it turns a token into a trainable dense vector.
- \(\mathbf 1\{\cdot\}\) is an indicator: it equals 1 when the condition is true and 0 otherwise.

Why it is here: categorical traces are abundant in real logs, and this makes them usable without flattening them into brittle one-hot tables.

## Step 2.12: Preserve slot identity

\[
e_{i,t}^{\mathrm{cat}}
=
\big\|_{(f,s)} [P_{f,s}u_{i,t}^{(f,s)},\,m_{i,t}^{(f,s)}].
\]

Plain meaning: keep family and source channels separate when you build the event-level categorical representation.

Why it is here: “biography said X” and “behavior showed X” are not the same kind of evidence.

## Step 2.13: Build slow categorical memory

\[
g_{i,\rho}^{\mathrm{slow}}
=
\frac{
\sum_{r \le t}\mathbf 1\{\rho_r=\rho\}\,\beta_{i,r}^{\mathrm{slow}}\,e_{i,r}^{\mathrm{cat}}
}{
\sum_{r \le t}\mathbf 1\{\rho_r=\rho\}\,\beta_{i,r}^{\mathrm{slow}}
}
\quad\text{when usable evidence exists.}
\]

Plain meaning: average old categorical events inside the same regime, but weight them by importance.

Harder terms:

- \(\rho\) is the regime or role bucket.
- \(\beta^{\mathrm{slow}}\) controls how much each past event contributes.

Why it is here: the slow bank is meant to capture durable person structure, not just the last touch.

## Step 2.14: Build fast categorical memory

\[
g_{i,t}^{\mathrm{fast},\tau}
=
\sum_{r \le t}\alpha_{i,r,t}^{(\tau)} e_{i,r}^{\mathrm{cat}},
\qquad
\sum_{r \le t}\alpha_{i,r,t}^{(\tau)}=1.
\]

Plain meaning: create a task-conditioned summary of recent categorical history.

Why it is here: short-horizon prediction usually depends on what is currently active, not just what is durable.

## Step 2.15: Define minimality

\[
q_{i,t}^{(\tau,\Delta)} = h(r_{i,t}^{(\tau,\Delta)})
\]

for every other sufficient state \(r\).

Plain meaning: a minimal sufficient state is one that every other sufficient state can be reduced to.

Why it is here: “sufficient” alone could mean dragging the full archive forever; minimality asks for the smallest useful state.

## Step 2.16: Split the measurable state into slow and fast pieces

\[
s_{i,t}^{(\tau,\Delta)}
=
(\hat T_i,z_{i,t},c_{i,t},w_t)
\approx
q_{i,t}^{(\tau,\Delta)}.
\]

Plain meaning: the benchmark approximation has a durable person part and a rapidly updating local part.

Why it is here: durable traits and recent events change on different timescales.

## Step 2.17: Define the realized embedding and its first estimate

\[
T_i=\Psi(G_i,\ell_i,h_i),
\qquad
h_i=\sum_{k=1}^{n_i}\beta_{ik}e_{ik}.
\]

Plain meaning: the realized person is the inherited template filtered through language, culture, and weighted life events.

Then the first operational estimate is

\[
\hat T_i^{(0)}
=
W_p p_i + W_b b_i + W_\ell \ell_i + W_r r_i + W_h h_i + W_g g_i^{\mathrm{slow}}.
\]

Plain meaning: start with a weighted combination of person summaries, life history, and slow categorical memory.

Why it is here: this is the bridge from theory to something that can be computed.

## Step 2.18: Use the local state to predict the next predictive state

\[
\hat q_{i,t+1}^{(\tau,\Delta)}
=
G_\theta(\hat T_i,z_{i,t},c_{i,t},w_t,x_t).
\]

Plain meaning: once the slow person state, fast local state, context, world, and proposition are known, predict the next task-relevant state.

Why it is here: this is the handoff into the world model.

# Part 3 — Application: Predicting How People Behave

## Step 3.1: Keep the ideal transition law, but do not train against it directly

\[
\phi_{i,t+1}=F(T_i,\phi_{i,t},x_t).
\]

Plain meaning: the motivating ideal is still the next phenomenal state.

Why it is here: it says what the framework is aiming at, even though the benchmark cannot observe \(\phi_{i,t}\) directly.

## Step 3.2: Introduce a task projection from a large ambient space

\[
\Pi_\tau:\mathcal U\to \mathbb R^{d_\tau}.
\]

Plain meaning: each task only needs a finite slice of the larger representational arena.

Why it is here: it keeps the framework open-ended without requiring infinite computation.

## Step 3.3: Define the operational transition law

\[
\hat q_{i,t+1}^{(\tau,\Delta)}
=
G_\theta(\hat T_i,z_{i,t},c_{i,t},w_t,x_t).
\]

Plain meaning: the trainable model predicts the next predictive state from the operational state and proposition.

Why it is here: this is the model actually learned from data.

## Step 3.4: Decompose the world model into encoder, interaction, and decoder

\[
o_{i,t}^{(\tau)}=E_o^{(\tau)}(\hat T_i,z_{i,t},c_{i,t},w_t),
\]

\[
p_t^{(\tau)}=E_p^{(\tau)}(x_t),
\]

\[
h_{i,t}^{(\tau)}=\Psi_\tau(o_{i,t}^{(\tau)},p_t^{(\tau)}),
\]

\[
\hat q_{i,t+1}^{(\tau,\Delta)}=G_\theta(h_{i,t}^{(\tau)}).
\]

Plain meaning:

- encode the observer-side state,
- encode the proposition,
- let them interact,
- decode the next predictive state.

Why it is here: it separates representation from interaction and makes the architecture modular.

## Step 3.5: Decode visible consequences from the latent state

\[
\hat y_{i,t+\Delta}^{(\tau)}=R_0(\hat q_{i,t+1}^{(\tau,\Delta)}),
\qquad
\hat a_{i,t+\Delta}^{(m,\tau)}=R_m(\hat q_{i,t+1}^{(\tau,\Delta)}).
\]

Plain meaning: outcomes and probe labels are readouts from the predicted next state, not the state itself.

Why it is here: a reply, a meeting, or an objection class is a measurable residue of an underlying transition.

## Step 3.6: Define task-equivalence of propositions

\[
\Pi_{\tau}(E_p^{(\tau)}(x_t^{(1)})) = \Pi_{\tau}(E_p^{(\tau)}(x_t^{(2)}))
\;\Rightarrow\;
G_\theta(\cdot,x_t^{(1)})\approx G_\theta(\cdot,x_t^{(2)}).
\]

Plain meaning: if two propositions look the same in the task-relevant projection, they should produce the same next-state prediction.

Why it is here: this is the paper’s formal notion of composability.

## Step 3.7: Keep only useful feature families

\[
\Delta_\tau(f)=\mathrm{Perf}_\tau(M\cup f)-\mathrm{Perf}_\tau(M).
\]

Plain meaning: a feature family stays only if it improves the task.

Why it is here: the framework is meant to be discoverable and revisable, not fixed in advance.

## Step 3.8: Define the world model and its rollout

\[
\mathcal W_\tau:(\hat T_i,z_{i,t},c_{i,t},w_t,x_t)\mapsto \hat q_{i,t+1}^{(\tau,\Delta)},
\]

\[
\hat q_{i,t+k}^{(\tau,\Delta)}
=
\mathcal W_\tau^{(k)}(\hat T_i,z_{i,t},c_{i,t},w_t,x_t,\ldots,x_{t+k-1}).
\]

Plain meaning: a one-step predictor becomes a simulator once you apply it repeatedly.

Why it is here: proposition choice is not only about one immediate readout; it can change future trajectories.

## Step 3.9: Train with main loss, probe losses, and regularization

\[
\mathcal L_\tau
=
\mathcal L_{\mathrm{main}}

- \sum_{m=1}^{M}\lambda_m \mathcal L_{\mathrm{probe},m}
- \lambda_{\mathrm{reg}}\Omega(\theta).
  \]

Plain meaning: optimize the main task, auxiliary probes, and weight control together.

Harder terms:

- A loss is a number that gets smaller when the model improves.
- \(\lambda_m\) and \(\lambda_{\mathrm{reg}}\) control how much the probe terms and regularizer matter relative to the main task.
- Regularization keeps a model from fitting noise too aggressively.

Why it is here: it encourages the latent state to carry more than one narrow signal.

Then update by gradient descent:

\[
\theta_{t+1}=\theta_t-\eta\nabla_\theta \mathcal L_\tau.
\]

Plain meaning: compute how the loss changes with respect to the parameters and step in the direction that lowers it.

## Step 3.10: Rank propositions by expected task utility

\[
\operatorname{score}_\theta(x\mid s_{i,t}^{(\tau,\Delta)})
=
\mathbb E_\theta\!\left[
U_\tau\!\left(
\hat q_{i,t+1}^{(\tau,\Delta)},
\hat y_{i,t+\Delta}^{(\tau)},
\hat a_{i,t+\Delta}^{(\tau)}
\right)
\mid
s_{i,t}^{(\tau,\Delta)},x
\right].
\]

Plain meaning: score each admissible proposition by its expected downstream value.

Harder term:

- The expectation \(\mathbb E_\theta[\cdot]\) is an average over what the model predicts could happen.

Why it is here: prediction becomes decision support once propositions are scored.

Then search for the best candidate:

\[
x_t^\star
\in
\arg\max_{x\in \mathcal X_{i,t}^{\mathrm{adm}}}
\operatorname{score}_\theta(x\mid s_{i,t}^{(\tau,\Delta)}).
\]

Plain meaning: pick the candidate with highest score among the allowed options.

# Part 4 — Benchmarking the World Model

## Step 4.1: Make the state hierarchy explicit

\[
\phi_{i,t}
\quad\text{full motivating state}
\]

\[
q_{i,t}^{(\tau,\Delta)}
\quad\text{formal predictive state}
\]

\[
s_{i,t}^{(\tau,\Delta)}
=
(\hat T_i,z_{i,t},c_{i,t},w_t)
\quad\text{measurable approximation}.
\]

Plain meaning: the benchmark only has access to the third object.

Why it is here: it keeps the benchmark honest.

## Step 4.2: Build the dataset around event-time prediction

\[
\mathcal D_\tau
=
\left\{
(u_i,H_{i,\le t},c_{i,t},w_t,x_t,y_{i,t+\Delta}^{(\tau)},a_{i,t+\Delta}^{(\tau)})
\right\}_{(i,t)}.
\]

Plain meaning: every row is a person snapshot with history, context, world, proposition, and future labels.

Why it is here: the task is to predict future transition from current state plus proposition.

## Step 4.3: Encode the event stream explicitly

\[
e_{i,t}=[x_t,\delta_t,r_t,a_t,m_t^{\mathrm{obs}},e_{i,t}^{\mathrm{cat}}].
\]

Plain meaning: each event carries the proposition, delay, response, action, memory proxy, and raw categorical shock.

Why it is here: the model needs both structured sequence data and categorical trace data.

## Step 4.4: Benchmark against baselines of increasing strength

The benchmark tests:

1. frequency only,
2. current-touch only,
3. static tabular features,
4. shallow history summaries,
5. two-tower recommendation style,
6. monolithic sequence modeling.

Why it is here: a slow/fast latent-state model should only survive if it beats or matches simpler alternatives in a meaningful way.

## Step 4.5: Define the proposed latent-state benchmark model

\[
\hat T_i=E_T(u_i,g_i^{\mathrm{slow}}),
\qquad
z_{i,0}=z_0(\hat T_i),
\]

\[
z_{i,t+1}=U_\theta(z_{i,t},\hat T_i,c_{i,t},w_t,e_{i,t},g_{i,t}^{\mathrm{fast},\tau}),
\]

\[
\hat q_{i,t+1}^{(\tau,\Delta)}
=
G_\theta(\hat T_i,z_{i,t},c_{i,t},w_t,x_{t+1}).
\]

Plain meaning: estimate durable person state, initialize fast state, update fast state with each event, and predict the next task-relevant state for each candidate proposition.

## Step 4.6: Use the corrected training objective

\[
\mathcal L_\tau
=
\mathcal L_{\mathrm{main}}

- \sum_{m=1}^{M}\lambda_m \mathcal L_{\mathrm{probe},m}
- \lambda_{\mathrm{reg}} \Omega(\theta).
  \]

Plain meaning: all three pieces are added because the optimizer is minimizing the objective.

Why it is here: probe losses should be reduced, not increased, and regularization should discourage unstable fits, not reward them.

## Step 4.7: Update slow and fast state on different timescales

Fast update:

\[
z_{i,t+1}=U_\theta(z_{i,t},\hat T_i,c_{i,t},w_t,e_{i,t},g_{i,t}^{\mathrm{fast},\tau}).
\]

Slow update:

\[
\hat T_i \leftarrow (1-\alpha)\hat T_i + \alpha\,\hat T_i^{\mathrm{new}}.
\]

Plain meaning: recent events can move the fast state immediately, but durable identity should drift slowly toward a refreshed durable estimate.

Harder term:

- This is an exponential moving average. A small \(\alpha\) means the slow state changes gradually.

Why it is here: without this split, one loud event can rewrite the whole person-side representation too aggressively.

## Step 4.8: Separate ranking from policy evaluation

The same score function can be used in three regimes:

1. **Observational ranking**: rank candidate propositions under the learned simulator.
2. **Off-policy evaluation**: estimate how a target policy would have done using logged propensities.
3. **Online policy improvement**: test and improve the policy under controlled experimentation.

Why it is here: ranking alone is not causal control.

## Step 4.9: Use IPS only when logged propensities exist

\[
\hat V_{\mathrm{IPS}}(\pi)
=
\frac{1}{N}
\sum_{t=1}^{N}
\frac{\mathbf 1\{x_t=\pi(s_{i,t}^{(\tau,\Delta)})\}}{e_t}\,r_t.
\]

Plain meaning: upweight cases where the historical policy was unlikely to choose the action the target policy would have chosen.

Harder terms:

- \(e_t=\mu(x_t\mid s_{i,t}^{(\tau,\Delta)})\) is the behavior policy’s logged propensity.
- \(r_t\) is realized reward.
- If the target policy is stochastic rather than deterministic, the indicator is replaced by the importance ratio \(\pi(x_t\mid s)/\mu(x_t\mid s)\).

Why it is here: it gives a principled bridge from logged data to policy-value estimates.

## Step 4.10: Split train, validation, and test over time

\[
\mathcal D_{\mathrm{train}}^{(1:T_1)},
\quad
\mathcal D_{\mathrm{val}}^{(T_1:T_2)},
\quad
\mathcal D_{\mathrm{test}}^{(T_2:T_3)}.
\]

Plain meaning: future rows must stay in the future.

Why it is here: random row splits leak information in sequence problems.

## Step 4.11: Evaluate both forecast quality and calibration

\[
\mathrm{LogLoss},\qquad \mathrm{Brier},\qquad \mathrm{PR\text{-}AUC},\qquad \mathrm{ECE}.
\]

Plain meaning:

- LogLoss checks probabilistic fit,
- Brier checks squared probability error,
- PR-AUC checks rare-event ranking,
- ECE checks calibration.

Why it is here: a useful decision model must rank well and produce believable probabilities.

## Step 4.12: Use ablations to test whether the decomposition is real

The benchmark removes:

- \(z_{i,t}\),
- \(\hat T_i\),
- salience-weighted pooling,
- source/regime separation,
- probe heads,
- or the explicit slow/fast structure itself.

Why it is here: if performance does not move when these are removed, those pieces were decorative.

## Step 4.13: Define success clearly

Success on forecasting means better held-out log loss and Brier score than the best baseline across more than one horizon.

Success on proposition selection means better off-policy value or live lift than a baseline policy, when the data regime actually supports that claim.

# Part 5 — Axioms, Lemmas, and Main Theorem

## Step 5.1: State the axioms

Part 5 assumes:

1. an ambient space exists,
2. individuals instantiate inherited structure,
3. a task-conditioned predictive state exists,
4. the operational slow/fast state approximates that predictive state,
5. transition factors through the task-relevant proposition encoding,
6. evolutionary retention uses a positive-gain rule,
7. auxiliary probes also factor through predictive state.

Why it is here: theorems need fixed primitives.

## Step 5.2: Prove the accessible slice is the best accessible approximation

\[
P^{\mathrm{spec}}\mathbf n_t
=
\operatorname*{argmin}_{\mathbf m\in\mathcal M^{\mathrm{spec}}}
\lVert \mathbf n_t-\mathbf m\rVert.
\]

Plain meaning: among all states inside the accessible subspace, the projection is the closest one to the raw noumenal state.

Harder term:

- \(\arg\min\) means “the value that makes the quantity as small as possible.”

Why it is here: projection now has a precise geometric meaning.

## Step 5.3: Prove the novelty test is exact

If \(\Delta \mathbf v\) already lies in the current accessible space, its residual is zero.
If the residual is nonzero and its net gain is positive, adding it increases the evolutionary objective.

Why it is here: new axes are not added by intuition; they are added by residual novelty plus positive value.

## Step 5.4: Prove task-equivalent propositions induce the same next-state prediction

If two propositions have the same task-relevant projected encoding, then the transition map gives the same predicted next state.

Why it is here: this is the formal version of composability.

## Step 5.5: Prove rollout is well-defined

Applying the same deterministic one-step world model repeatedly defines a unique finite-horizon rollout.

Why it is here: a one-step transition law becomes a true world model once it can simulate trajectories.

## Step 5.6: Prove explicit history can help

Under log loss, the Bayes-optimal risk is \(H(Y\mid X)\), conditional entropy.
Under Brier score for binary outcomes, the Bayes-optimal risk is \(\mathbb E[\operatorname{Var}(Y\mid X)]\), expected conditional variance.

Plain meaning: giving the model more genuinely useful history cannot make the optimal predictor worse.

Why it is here: this justifies explicit state and sequence modeling when the future depends on the past.

## Step 5.7: Prove minimal sufficient state is unique up to reparameterization

If two predictive states are both minimal and sufficient for the same task, then they are the same object up to measurable bijection on their supports.

Plain meaning: the coordinates can change, but the minimal information content is the same.

Why it is here: the framework is not claiming one sacred coordinate chart for the mind.

## Step 5.8: Prove sufficient compression

If

\[
Y \perp H_{i,\le t}
\mid
(z_{i,t},\hat T_i,c_{i,t},w_t,x_t),
\]

then the full history can be replaced by the compressed state for prediction.

Plain meaning: if the compressed state blocks any remaining dependence on the raw history, the raw history no longer needs to be carried directly.

Why it is here: it justifies latent-state compression.

## Step 5.9: Prove slow/fast mediation

If the predictive state factors into slow and fast parts and the relevant conditional independence holds, then recent within-window history contributes through the fast state once the slow profile and context are fixed.

Why it is here: it gives the slow/fast split a formal interpretation rather than treating it as architecture taste.

## Step 5.10: Prove probe consistency

If two histories give the same predictive state, then under the same proposition they induce the same conditional law for every probe that factors through that state.

Why it is here: probe heads become principled readouts instead of decorative extras.

## Step 5.11: State the main theorem

The main theorem says that for each task and horizon there exists a finite-dimensional task-conditioned representation

\[
(\hat T_i,z_{i,t},c_{i,t},w_t,x_t)
\longmapsto
\hat q_{i,t+1}^{(\tau,\Delta)}
\longmapsto
(\hat y_{i,t+\Delta}^{(\tau)},\hat a_{i,t+\Delta}^{(\tau)})
\]

with these properties:

1. best accessible approximation,
2. evolutionary coherence,
3. task-equivalence,
4. recursive simulability,
5. state advantage,
6. minimal-state uniqueness,
7. sufficient compression,
8. slow/fast mediation,
9. probe consistency.

Plain meaning: once the framework’s assumptions are accepted, the world model is internally coherent and usable as a predictive system.

## Step 5.12: State the corollary on proposition ranking

\[
x
\mapsto
\operatorname{score}_\theta(x\mid s_{i,t}^{(\tau,\Delta)}).
\]

Plain meaning: the model induces an observational ranking over admissible propositions.

Why it matters: the best proposition is well-defined as an argmax of the score whenever the candidate set is finite or the maximum is attained.

## Step 5.13: Keep the causal boundary sharp

The corollary does not prove that following the ranking improves the real world.

Why it is here: observational ranking, off-policy evaluation, and online causal improvement are different epistemic regimes.

# Operational summary

For one task, the framework reduces to this recipe:

1. Build the durable person estimate \(\hat T_i\) from slow features and slow categorical memory.
2. Build the current fast state \(z_{i,t}\) from recent events and fast categorical memory.
3. Attach role-context \(c_{i,t}\) and world state \(w_t\).
4. Encode the candidate proposition \(x_t\).
5. Predict the next task-relevant latent state.
6. Decode outcomes and probes.
7. Update fast state after each event.
8. Refresh slow state gradually as durable evidence accumulates.
9. Train against main outcome, probes, and regularization.
10. Benchmark against simpler baselines.
11. Rank propositions observationally.
12. Only claim policy improvement when propensities or experiments support it.

# One clean mental model

The framework says:

- evolution gives a lineage a finite accessible slice of reality,
- a person realizes that slice in an individual way,
- recent events activate only part of that person-space,
- a proposition interacts with that active state,
- the interaction moves the observer into a new task-relevant state,
- visible outcomes are readouts from that new state.

That is the whole system in its shortest usable form.
