Steven A. Thompson

ZOMBORG: LLM Compression via Structured-Manifold Projection

project_zomborg

The Story

I learn by doing, so I decided to learn LLM's by figuring out how to compress one by vibe coding real implementations of things I found in research papers.

Project WIZORB identified something interesting about the nature of transformer based large language models. Namely, that much of the strong Universal Weight Subspace Hypothesis (UWSH) seemed not to be real, or at least not to be provable based on my own tests. However, a Weak UWSH (that a subspace exists where the model can live) is true, and we can use it to do real work. This sort of stands to reason as the JL Lemma says we can down project any higher dimensional manifold into a lower dimensional one, it just limits the minimum size we can project a into without huge losses. So how much loss can we get away with?

To (over) simplify, much of the manifold of an LLM is degenerate and we can down project layers into much lower rank representations without losing much in the way of functional capacity. This was already known in the form of distillation, but this was different. I wasn't training it, I was just projecting it down then doing very brief "repair" and recovering almost all functionality. Some lucky down projections even seemed to increase functional capacity! See the Lottery Ticket Hypothesis for one potential explanation.

This project (ZOMBORG) was an attempt to take that rather surprising, but empirically derived, result and scale it to a full model. During ZOMBORG I quickly identified the limit's of my PHD (Persistent Homology Dimension) based procedure for compression, and made a pivot by trying to optimize the fundamental shape of the projected manifold. The simple theory was that a less chaotic manifold would lead to better performance, higher throughput, and smaller sizes.

What I discovered was that all of those things were true, the model WAS more performant and did not lose significant capacity with a single layer replacement. However, the findings did not scale. There seemed to be a non-linear scaling limit, wherein the errors increased rapidly due to small edge misalignment between layers. Or at least that seemed to be the case.

Ultimately I found that the issue was probably NOT a matter of simple edge alignment: Try as I might better alignment, edge shims, etc could never completely compensate. Individual layer replacements were fine, maybe even better than the originals but multi-layer replacements always resulted in slightly worse models.

My new working theory is that the "degenerate" sections of the manifold actually DO represent functional capacity, in that they are performing computation. In shrinking or replacing the degenerate manifold we were preserving it's representational abilities, but reducing it's raw computational capacity.

This is likely due to superposition. Superposition allows a neural network to represent more independent features than it has physical dimensions by assigning each feature a direction vector. Those direction vectors combine to form an over-complete basis.

Because the basis is over-complete, any single axis (neuron) aligns with multiple feature vectors. This is the origin of the "degeneracy" I was trying to eliminate. It wasn't degeneracy at all. It was compute. That convoluted cloud like manifold was actually performing real computation, just in a strangely "organic" way that's difficult to really grok intuitively.

I'll upload the files/etc to Github if anyone turns out to be interested. You can email me here if you need want them for something.

I've already begun work on a new project, which uses what I've learned from ZOMBORG to try to create an alternative manifold that preserves the superposition, but still allows us to order the manifold and achieve compression. But... math is hard. Especially high dimensional Hyperboloid manifold math. We'll see if I can cobble something together from the existing research papers and my own wild theories.

NOTE: Most of the rest of this post was written or at least edited by an AI. The AI was instructed to summarize my data in the form of an academic paper, and to expand upon my own cursory notes because "ain't nobody got time for that."™️I have a lot of data, and not a lot of time. This is just a hobby, I'm not doing it professionally or anything. I did at least proofread all of it. If you don't like that an AI assisted with this you're entitled to a full refund of everything you paid to read it.

Project ZOMBORG: Structured-Manifold Reparameterization and Interface Alignment in Llama-3.1-8B

Abstract

Project ZOMBORG is an empirical study of whether key transformer linear layers can be re-parameterized into constrained forms while preserving function, and what prevents such replacements from scaling from one layer to many. The project began with Fisher-weighted low-rank factorization plus distillation (“repair”), then pivoted to a geometry-first approach that replaces layers with PHD-family structured manifolds and repairs them via teacher KL distillation.

Across controlled experiments, ZOMBORG establishes several robust findings:

  1. Basis choice dominates approximation quality at fixed rank: Fisher-weighted SVD outperforms Hadamard/permuted/random structured bases in a rank-64 duel.
  2. Sigma-only tuning is insufficient to recover from a poor basis, even with extended optimization.
  3. Small structured replacements can work, but scaling exposes an interface-misalignment failure mode: 1–2 layer structured replacements can remain stable after brief repair, while deeper blocks collapse without additional interface degrees of freedom.
  4. The strongest 8-layer result to date uses a coupled-aligned 8-layer core (layers 14–21) + per-layer shims (r=16), achieving MBPP20 15% and improving heldout KL from 1.64→0.75, while remaining 0% on HE10.

Later phases test whether connectors can be made more principled: Phase 51 (in progress) introduces a TASE-based expanded-interface diagnostic with an explicit near-identity gate.

NOTE: This TASE base interface shim also improved the overall poor results, but the model still collapsed at around 8-10 layers and that's when I gave up on ZOMBORG. ZOMBORG IS DEAD. Long may he live.


1. Introduction and Motivation

Most compute and parameters in transformer decoders live in large linear maps (attention projections and MLP projections). ZOMBORG explores whether those maps can be replaced with structured, lower‑parameter representations while preserving function through minimal “repair” training. When classic low‑rank compression failed to restore logic with sigma‑only repair at strong compression, ZOMBORG pivoted to a higher-level scientific question:

Can the model’s function be preserved in a different manifold geometry (structured PHD-like factors), and if so, what prevents scaling this from one layer to many?

The “projection paradox” intuition motivating the pivot is that the base manifold is empirically inefficient: if function survives after moving into a structured manifold with very different weight geometry, then the model’s function may be representable in a substantially more constrained parameterization than dense weights suggest.


2. Methods

2.1 Baseline low-rank replacement

For a linear layer weight matrix Wdout×din, ZOMBORG uses a rank-r factorization:

WUdiag(σ)V

with Udout×r, Vdin×r, and trainable coefficients σr.

Forward pass for activations xdin:

y=U(diag(σ)(Vx))

Implementation note: matrix multiplies run in the input dtype (often BF16), while sigma scaling is computed in FP32 for numerical stability.

2.2 Fisher-weighted SVD and covariance modes

ZOMBORG’s original path uses activation statistics to weight decompositions so approximation error is low in the distribution where the model operates. This yields large MSE reductions relative to unweighted SVD in proof-of-concept tests (e.g., v_proj weighted MSE about half of unweighted).Because full covariances for very wide layers (e.g., MLP down-projection with 14k inputs) are expensive, ZOMBORG implements both:

2.3 Structured manifold replacements (PHD family)

The “geometry-first” pivot seeks to construct U,V (or related factors) using structured orthogonal transforms composed of operations like permutation matrices P, sign-diagonal D, and Hadamard transforms H. In practice, ZOMBORG explores blockwise PHD (e.g., block size 512) and “coupled alignment” variants that attempt to preserve layer-to-layer interfaces.The key scientific constraint throughout: the structured manifold should be doing real work; adapters/repair should not simply reconstitute a dense solution (a hidden LoRA). Later phases propose measurements like “shim dominance” to verify this principle.

2.4 Repair training (distillation)

Repair training minimizes teacher–student divergence. ZOMBORG implements:

2.5 The Dreamer/Hologram dataset

A major early failure: calibration and repair on code-only distributions led to partial MBPP recovery but zero HumanEval. To prevent subspace collapse and preserve non-code dimensions, ZOMBORG adds Phase 0 Dreamer: a synthetic dataset generated from the teacher using hybrid sampling. The project then standardizes on a 50/50 Code + Hologram mix for calibration and repair.


3. Metrics and Experimental Gates

ZOMBORG uses a fail-fast ladder with increasingly expensive runs.

3.1 Metrics

3.2 Gates (typical)


4. Results (chronological narrative)

4.1 Phases 2–9: Compression pipeline succeeds engineering-wise but fails semantically at aggressive ranks

ZOMBORG successfully builds:


4.2 Phase 1.5: Structured Basis Duel establishes “basis supremacy” and sigma-only limits

A controlled duel on layers[15].mlp.down_proj (rank 64) compares:

Results:

This anchors a key constraint: if your structured basis is “wrong,” you cannot fix it by only tuning sigma; you need a correct U/V rotation or additional trainable degrees of freedom.


4.3 Phase 0 Dreamer: stabilizing repair distributions

The project adds a hologram dataset:


4.4 Phases 20–22: Early structured “match Fisher basis” attempts fail SCS thresholds

ZOMBORG introduces similarity criteria (SCS) and tries to “transfer” the empirically derived Fisher geometry onto structured candidates:

These results suggest that “find a structured basis that closely matches Fisher” is not easy and may not be the best path.


4.5 Phase 30–34: Blockwise PHD becomes functional (core novelty)

ZOMBORG then pivots from “match Fisher” to “build a structured manifold that can be repaired into function.” A reported best configuration centers on blockwise PHD for the MLP down-projection with:

This phase family provides evidence that a structured manifold can retain non-trivial function after minimal repair, and sets the stage for multi-layer scaling.


4.6 Phase 35: Two-layer structured manifold smoke (multi-layer stability begins)

A two-layer replacement (layers 15–16 down-projections) yields low heldout KL (~0.18) with non-zero task signals on tiny evals, and appears stable under joint optimization in that small regime.This suggests the structured manifold can compose across at least a couple layers, but does not yet validate scaling to 8+ layers.


4.7 Phases 37–40: Scaling reveals interface mismatch; ΔW residual does not solve it

Later scaling attempts highlight that error becomes roughly additive and competence collapses as more independently built structured layers are stacked. A summary table of later-phase outcomes emphasizes:


4.8 Phases 41–42: Coupled interface basis extraction (edge alignment infrastructure)

To address the interface hypothesis directly, ZOMBORG adds a coupled interface basis extraction approach over a layer range (e.g., 14–21) and builds supporting infrastructure for aligned basis metadata.Conceptually, this treats layer pairs (or groups) as coupled during manifold construction, so the output subspace of layer better matches the input subspace expected by layer +1.


4.9 Phases 46–51: The 8-layer “interface wall”, shims, and the TASE diagnostic

ZOMBORG’s late-stage work focuses on whether structured replacements can scale from “a few layers” to a deep block while preserving competence.

Phase 46 baseline (independent shims) — fails at 8 layers

A baseline attempt to scale shims to an 8-layer block without coupled alignment yields 0% on tiny coding evals, motivating alignment-first and coupled-core approaches.

Phase 47 (current best): coupled-aligned 8-layer core + per-layer shims (rank 16)

A coupled-aligned core for layers 14–21 combined with per-layer shims (r=16) achieves:

This is a partial functional recovery, not a return to baseline, but it establishes that the 8-layer “wall” is not absolute under the project’s constraints.

Phase 48: shim-rank sweep (rank 32) — fails

Doubling shim rank 16 → 32 produces negligible KL change (0.753 → 0.751) while function collapses back to 0% MBPP20 in the tested run. This indicates that distributional matching (KL) is not sufficient, and that additional adapter capacity can destabilize task behavior even when KL improves slightly.

Phase 49: alignment-first, sigma-only (no shims) — fails

Using coupled alignment without shims, and allowing only sigma correction, does not restore competence at 8 layers (tiny evals remain 0%). This is consistent with the earlier “sigma-only limits” result.

Phase 50: freezing shim components — fails

Freezing parts of the shim path (B-only / frozen input projection) degrades performance relative to Phase 47 in the tested configuration, suggesting that input-side flexibility matters for stability.

Phase 51: TASE→PHD “expanded-interface” diagnostic (in progress)

Phase 51 tests a diagnostic hypothesis: if we expand the interface (TASE) and then compress back into the PHD substrate, can we make the interface mapping near-identity under a strict gate?

These diagnostics do not yet establish a functional improvement pathway, but they define a measurable target (“near-identity interface”) for future connector designs.


5. Scientific Interpretation

5.1 Is the structured manifold doing “real work,” or is it bending back?

Two pieces of evidence support “real work” rather than trivial bending:

  1. The basis duel demonstrates sigma-only cannot rescue a bad basis; therefore, if the PHD core remains fixed and function returns, the structured core is contributing materially.2) The project explicitly tracks the risk that adapters could become a hidden dense path; later phases introduce measurement and constraints (“shim dominance”) to ensure the structured manifold remains dominant.

However, the Phase 47 result also shows that adding shims can restore competence, raising the central methodological challenge: how to allow enough interface correction without “rebuilding the original dense manifold.”

5.2 Why multi-layer scaling is hard

A transformer stack is a composition of learned maps whose intermediate representations are highly coordinated. Independently replacing layers with structured manifolds introduces rotations/stat shifts; even small per-layer mismatch can accumulate into catastrophic failures at 8+ layers. This makes edge alignment (coupled extraction) a principled route, and shims a pragmatic route.


6. Negative Results and Dead-Ends (so future experimenters avoid them)

  1. Hadamard basis did not outperform random controls at rank 64 in the key duel.2) Sigma-only tuning does not meaningfully recover from a poor basis even with 1,000 steps.3) Structured candidates designed to match the Fisher basis (PHD matching, sparse givens, butterfly) failed early similarity thresholds.4) Aggressive MLP-only compression at rank ratio 0.30 repeatedly yields 0 pass@1 despite better loss curves and response-only distillation.
  2. Increasing shim rank (r=16→32) did not improve heldout KL meaningfully and regressed tiny-eval competence to 0% in the tested 8-layer setting (Phase 48).
  3. Coupled alignment without connector degrees of freedom (sigma-only repair, no shims) did not restore competence at 8 layers (Phase 49).
  4. Freezing parts of the shim pathway degraded performance relative to the best shimmed run (Phase 50).

7. Reproducibility Checklist

7.1 Core artifacts

7.2 Key scripts

7.3 Minimal “novel result” reproductions

A. Basis Duel: replicate Fisher vs Hadamard/random dominance and sigma-only non-recovery.B. Blockwise PHD functional retention: reproduce Phase 30–34 single-layer stability and heldout KL behavior.C. Multi-layer scaling and the interface wall: reproduce 4-layer collapse vs ΔW residual rescue, and 8-layer collapse vs coupled core + shims partial recovery (Phases 36–50). D. TASE near-identity diagnostic: reproduce the Phase 51 sanity check and identity-deviation sweep.


8. Suggested Next Steps (evidence-based)

  1. Proof-of-principle: full-model PHD without compression If the immediate goal is science rather than size, move the full model into the structured manifold at high rank (or full-rank structured factors) and repair. This answers: “Can the network live in this manifold at all?” before optimizing rank/speed.

  2. Strengthen alignment-first approaches Phase 49 suggests alignment improves KL but not competence. This points to missing task-critical information in the aligned manifold or insufficient repair freedom (sigma-only). Consider richer but still constrained repair degrees: tiny interface mixers, constrained rotations, or alignment penalties.

  3. If shims are used, enforce anti-LoRA constraints Use measurements like shim dominance and restrict shims to pre/post interfaces, keeping PHD cores frozen. The aim is connectors as scaffolding, not as the primary compute substrate.

  4. Only commit to 50+ hour runs when 8-layer has non-zero HE and MBPP Use KL + tiny evals to gate expensive runs.


Appendix: One-paragraph “brief” for a new experimenter

ZOMBORG began as a Fisher-weighted low-rank compression pipeline for Llama-3.1-8B, building robust calibration, compression, and distillation infrastructure, but repeatedly failed to restore coding benchmark competence under aggressive sigma-only compression. The project then pivoted to structured manifolds: replacing layers with blockwise PHD-style structured factors and repairing with teacher distillation on a mixture of code and a synthetic “Dreamer/Hologram” dataset to preserve general dimensions. The most robust scientific finding is that basis choice dominates approximation and sigma-only cannot fix a bad basis; nevertheless, structured PHD replacements can remain functional at small scale after minimal repair, implying “different geometry, similar function” is possible. Scaling to 8+ layers reveals a dominant interface-misalignment failure mode, motivating coupled interface basis extraction (edge alignment) and constrained connector shims; the best current 8-layer result partially recovers MBPP but remains weak on HumanEval.