ZOMBORG: LLM Compression via Structured-Manifold Projection

The Story
I learn by doing, so I decided to learn LLM's by figuring out how to compress one by vibe coding real implementations of things I found in research papers.
Project WIZORB identified something interesting about the nature of transformer based large language models. Namely, that much of the strong Universal Weight Subspace Hypothesis (UWSH) seemed not to be real, or at least not to be provable based on my own tests. However, a Weak UWSH (that a subspace exists where the model can live) is true, and we can use it to do real work. This sort of stands to reason as the JL Lemma says we can down project any higher dimensional manifold into a lower dimensional one, it just limits the minimum size we can project a into without huge losses. So how much loss can we get away with?
To (over) simplify, much of the manifold of an LLM is degenerate and we can down project layers into much lower rank representations without losing much in the way of functional capacity. This was already known in the form of distillation, but this was different. I wasn't training it, I was just projecting it down then doing very brief "repair" and recovering almost all functionality. Some lucky down projections even seemed to increase functional capacity! See the Lottery Ticket Hypothesis for one potential explanation.
This project (ZOMBORG) was an attempt to take that rather surprising, but empirically derived, result and scale it to a full model. During ZOMBORG I quickly identified the limit's of my PHD (Persistent Homology Dimension) based procedure for compression, and made a pivot by trying to optimize the fundamental shape of the projected manifold. The simple theory was that a less chaotic manifold would lead to better performance, higher throughput, and smaller sizes.
What I discovered was that all of those things were true, the model WAS more performant and did not lose significant capacity with a single layer replacement. However, the findings did not scale. There seemed to be a non-linear scaling limit, wherein the errors increased rapidly due to small edge misalignment between layers. Or at least that seemed to be the case.
Ultimately I found that the issue was probably NOT a matter of simple edge alignment: Try as I might better alignment, edge shims, etc could never completely compensate. Individual layer replacements were fine, maybe even better than the originals but multi-layer replacements always resulted in slightly worse models.
My new working theory is that the "degenerate" sections of the manifold actually DO represent functional capacity, in that they are performing computation. In shrinking or replacing the degenerate manifold we were preserving it's representational abilities, but reducing it's raw computational capacity.
This is likely due to superposition. Superposition allows a neural network to represent more independent features than it has physical dimensions by assigning each feature a direction vector. Those direction vectors combine to form an over-complete basis.
Because the basis is over-complete, any single axis (neuron) aligns with multiple feature vectors. This is the origin of the "degeneracy" I was trying to eliminate. It wasn't degeneracy at all. It was compute. That convoluted cloud like manifold was actually performing real computation, just in a strangely "organic" way that's difficult to really grok intuitively.
I'll upload the files/etc to Github if anyone turns out to be interested. You can email me here if you need want them for something.
I've already begun work on a new project, which uses what I've learned from ZOMBORG to try to create an alternative manifold that preserves the superposition, but still allows us to order the manifold and achieve compression. But... math is hard. Especially high dimensional Hyperboloid manifold math. We'll see if I can cobble something together from the existing research papers and my own wild theories.
NOTE: Most of the rest of this post was written or at least edited by an AI. The AI was instructed to summarize my data in the form of an academic paper, and to expand upon my own cursory notes because "ain't nobody got time for that."â˘ď¸I have a lot of data, and not a lot of time. This is just a hobby, I'm not doing it professionally or anything. I did at least proofread all of it. If you don't like that an AI assisted with this you're entitled to a full refund of everything you paid to read it.
Project ZOMBORG: Structured-Manifold Reparameterization and Interface Alignment in Llama-3.1-8B
Abstract
Project ZOMBORG is an empirical study of whether key transformer linear layers can be re-parameterized into constrained forms while preserving function, and what prevents such replacements from scaling from one layer to many. The project began with Fisher-weighted low-rank factorization plus distillation (ârepairâ), then pivoted to a geometry-first approach that replaces layers with PHD-family structured manifolds and repairs them via teacher KL distillation.
Across controlled experiments, ZOMBORG establishes several robust findings:
- Basis choice dominates approximation quality at fixed rank: Fisher-weighted SVD outperforms Hadamard/permuted/random structured bases in a rank-64 duel.
- Sigma-only tuning is insufficient to recover from a poor basis, even with extended optimization.
- Small structured replacements can work, but scaling exposes an interface-misalignment failure mode: 1â2 layer structured replacements can remain stable after brief repair, while deeper blocks collapse without additional interface degrees of freedom.
- The strongest 8-layer result to date uses a coupled-aligned 8-layer core (layers 14â21) + per-layer shims (r=16), achieving MBPP20 15% and improving heldout KL from 1.64â0.75, while remaining 0% on HE10.
Later phases test whether connectors can be made more principled: Phase 51 (in progress) introduces a TASE-based expanded-interface diagnostic with an explicit near-identity gate.
NOTE: This TASE base interface shim also improved the overall poor results, but the model still collapsed at around 8-10 layers and that's when I gave up on ZOMBORG. ZOMBORG IS DEAD. Long may he live.
1. Introduction and Motivation
Most compute and parameters in transformer decoders live in large linear maps (attention projections and MLP projections). ZOMBORG explores whether those maps can be replaced with structured, lowerâparameter representations while preserving function through minimal ârepairâ training. When classic lowârank compression failed to restore logic with sigmaâonly repair at strong compression, ZOMBORG pivoted to a higher-level scientific question:
Can the modelâs function be preserved in a different manifold geometry (structured PHD-like factors), and if so, what prevents scaling this from one layer to many?
The âprojection paradoxâ intuition motivating the pivot is that the base manifold is empirically inefficient: if function survives after moving into a structured manifold with very different weight geometry, then the modelâs function may be representable in a substantially more constrained parameterization than dense weights suggest.
2. Methods
2.1 Baseline low-rank replacement
For a linear layer weight matrix , ZOMBORG uses a rank- factorization:
with , , and trainable coefficients .
Forward pass for activations :
Implementation note: matrix multiplies run in the input dtype (often BF16), while sigma scaling is computed in FP32 for numerical stability.
2.2 Fisher-weighted SVD and covariance modes
ZOMBORGâs original path uses activation statistics to weight decompositions so approximation error is low in the distribution where the model operates. This yields large MSE reductions relative to unweighted SVD in proof-of-concept tests (e.g., v_proj weighted MSE about half of unweighted).Because full covariances for very wide layers (e.g., MLP down-projection with 14k inputs) are expensive, ZOMBORG implements both:
Full covariance modes where feasible, and
Diagonal-weighted approximations for large in_features, selected automatically by threshold.
2.3 Structured manifold replacements (PHD family)
The âgeometry-firstâ pivot seeks to construct (or related factors) using structured orthogonal transforms composed of operations like permutation matrices , sign-diagonal , and Hadamard transforms . In practice, ZOMBORG explores blockwise PHD (e.g., block size 512) and âcoupled alignmentâ variants that attempt to preserve layer-to-layer interfaces.The key scientific constraint throughout: the structured manifold should be doing real work; adapters/repair should not simply reconstitute a dense solution (a hidden LoRA). Later phases propose measurements like âshim dominanceâ to verify this principle.
2.4 Repair training (distillation)
Repair training minimizes teacherâstudent divergence. ZOMBORG implements:
shifted logits for causal alignment,
attention masking,
memoryâefficient KL with
log_target=True,response-only loss options,
and ârepair_stateâ checkpoints that store only trainable sigma/bias, shrinking checkpoints from >10GB to ~1MB.
2.5 The Dreamer/Hologram dataset
A major early failure: calibration and repair on code-only distributions led to partial MBPP recovery but zero HumanEval. To prevent subspace collapse and preserve non-code dimensions, ZOMBORG adds Phase 0 Dreamer: a synthetic dataset generated from the teacher using hybrid sampling. The project then standardizes on a 50/50 Code + Hologram mix for calibration and repair.
3. Metrics and Experimental Gates
ZOMBORG uses a fail-fast ladder with increasingly expensive runs.
3.1 Metrics
Tier-0 layer metrics: activation MSE and cosine similarity (basis duel style).
Heldout KL: teacher vs student logits on hologram samples (primary scaling signal).
EvalPlus pass@1: HumanEval+/MBPP+ smoke and expanded runs.
3.2 Gates (typical)
Non-zero pass@1 on smoke sets before long runs
Heldout KL should not worsen (often should improve) post-repair
When shims/connectors are introduced: âshim dominanceâ and related constraints (to avoid accidental LoRA)
4. Results (chronological narrative)
4.1 Phases 2â9: Compression pipeline succeeds engineering-wise but fails semantically at aggressive ranks
ZOMBORG successfully builds:
- full-model calibration and replacement tooling, including robust loaders and fallback paths, with coherent untrained generation at moderate ranks in smoke tests, and unit tests across decomposition and loading.- an eval harness for HumanEval+/MBPP+.However, repeated attempts to compress MLP-only layers at rank ratio 0.30 produce 0 pass@1 even after extended repair (500 steps) and response-only loss; inspection shows syntactically valid but logically wrong solutions.Interpretation: Sigma-only repair and/or the chosen factorization is insufficient to restore reasoning/code logic after strong compression, motivating the later geometry-first structured manifold approach.
4.2 Phase 1.5: Structured Basis Duel establishes âbasis supremacyâ and sigma-only limits
A controlled duel on layers[15].mlp.down_proj (rank 64) compares:
- Fisher-weighted SVD basis vs Hadamard vs permuted-Hadamard vs random orthogonal.
Results:
Fisher yields significantly lower MSE and much higher cosine similarity.
Hadamard â random (no special structure captured at this rank/layer).
Sigma-only tuning (100 steps, and later 1,000) yields negligible improvement.
This anchors a key constraint: if your structured basis is âwrong,â you cannot fix it by only tuning sigma; you need a correct rotation or additional trainable degrees of freedom.
4.3 Phase 0 Dreamer: stabilizing repair distributions
The project adds a hologram dataset:
20,000 synthetic samples generated in ~9.6 hours,
~74.7 tokens/s throughput (BF16, CUDA, KV caching),
with smoke generation success and artifacts stored on disk.This becomes standard for later structured manifold and scaling experiments.
4.4 Phases 20â22: Early structured âmatch Fisher basisâ attempts fail SCS thresholds
ZOMBORG introduces similarity criteria (SCS) and tries to âtransferâ the empirically derived Fisher geometry onto structured candidates:
PHD matching shows low SCS and low max cosine to Fisher.
Learned butterfly improves but remains far from targets.
PHD+Sparse Givens also remains poor.
These results suggest that âfind a structured basis that closely matches Fisherâ is not easy and may not be the best path.
4.5 Phase 30â34: Blockwise PHD becomes functional (core novelty)
ZOMBORG then pivots from âmatch Fisherâ to âbuild a structured manifold that can be repaired into function.â A reported best configuration centers on blockwise PHD for the MLP down-projection with:
rank 64, block size 512,
repair 200 steps,
hybrid repair mix with hologram.
This phase family provides evidence that a structured manifold can retain non-trivial function after minimal repair, and sets the stage for multi-layer scaling.
4.6 Phase 35: Two-layer structured manifold smoke (multi-layer stability begins)
A two-layer replacement (layers 15â16 down-projections) yields low heldout KL (~0.18) with non-zero task signals on tiny evals, and appears stable under joint optimization in that small regime.This suggests the structured manifold can compose across at least a couple layers, but does not yet validate scaling to 8+ layers.
4.7 Phases 37â40: Scaling reveals interface mismatch; ÎW residual does not solve it
Later scaling attempts highlight that error becomes roughly additive and competence collapses as more independently built structured layers are stacked. A summary table of later-phase outcomes emphasizes:
âFull MLP down-proj replacementâ variants and ÎW residual variants do not reliably restore task scores at 8 layers.
ÎW at 8 layers is flagged as broken; earlier evidence suggests âresidual offâ controls do not work reliably when scaling.Interpretation: The dominant failure mode is likely edge/interface misalignment between replaced layers.
4.8 Phases 41â42: Coupled interface basis extraction (edge alignment infrastructure)
To address the interface hypothesis directly, ZOMBORG adds a coupled interface basis extraction approach over a layer range (e.g., 14â21) and builds supporting infrastructure for aligned basis metadata.Conceptually, this treats layer pairs (or groups) as coupled during manifold construction, so the output subspace of layer better matches the input subspace expected by layer .
4.9 Phases 46â51: The 8-layer âinterface wallâ, shims, and the TASE diagnostic
ZOMBORGâs late-stage work focuses on whether structured replacements can scale from âa few layersâ to a deep block while preserving competence.
Phase 46 baseline (independent shims) â fails at 8 layers
A baseline attempt to scale shims to an 8-layer block without coupled alignment yields 0% on tiny coding evals, motivating alignment-first and coupled-core approaches.
Phase 47 (current best): coupled-aligned 8-layer core + per-layer shims (rank 16)
A coupled-aligned core for layers 14â21 combined with per-layer shims (r=16) achieves:
Heldout KL: 1.64 â 0.75
Tiny evals: HE10 0%, MBPP20 15%
This is a partial functional recovery, not a return to baseline, but it establishes that the 8-layer âwallâ is not absolute under the projectâs constraints.
Phase 48: shim-rank sweep (rank 32) â fails
Doubling shim rank 16 â 32 produces negligible KL change (0.753 â 0.751) while function collapses back to 0% MBPP20 in the tested run. This indicates that distributional matching (KL) is not sufficient, and that additional adapter capacity can destabilize task behavior even when KL improves slightly.
Phase 49: alignment-first, sigma-only (no shims) â fails
Using coupled alignment without shims, and allowing only sigma correction, does not restore competence at 8 layers (tiny evals remain 0%). This is consistent with the earlier âsigma-only limitsâ result.
Phase 50: freezing shim components â fails
Freezing parts of the shim path (B-only / frozen input projection) degrades performance relative to Phase 47 in the tested configuration, suggesting that input-side flexibility matters for stability.
Phase 51: TASEâPHD âexpanded-interfaceâ diagnostic (in progress)
Phase 51 tests a diagnostic hypothesis: if we expand the interface (TASE) and then compress back into the PHD substrate, can we make the interface mapping near-identity under a strict gate?
A âno sparsityâ sanity run (k=m=16384) reaches d_id_mean â 0.0339 by step 200 (passes the sanity condition).
Sparse/limited expansions (k=512, 1024, 4096, 8192) currently show much larger identity deviations (mean δ_id â 0.58, 0.44, 0.27, 0.17 respectively), with the final k=16384 run still in progress.
These diagnostics do not yet establish a functional improvement pathway, but they define a measurable target (ânear-identity interfaceâ) for future connector designs.
5. Scientific Interpretation
5.1 Is the structured manifold doing âreal work,â or is it bending back?
Two pieces of evidence support âreal workâ rather than trivial bending:
- The basis duel demonstrates sigma-only cannot rescue a bad basis; therefore, if the PHD core remains fixed and function returns, the structured core is contributing materially.2) The project explicitly tracks the risk that adapters could become a hidden dense path; later phases introduce measurement and constraints (âshim dominanceâ) to ensure the structured manifold remains dominant.
However, the Phase 47 result also shows that adding shims can restore competence, raising the central methodological challenge: how to allow enough interface correction without ârebuilding the original dense manifold.â
5.2 Why multi-layer scaling is hard
A transformer stack is a composition of learned maps whose intermediate representations are highly coordinated. Independently replacing layers with structured manifolds introduces rotations/stat shifts; even small per-layer mismatch can accumulate into catastrophic failures at 8+ layers. This makes edge alignment (coupled extraction) a principled route, and shims a pragmatic route.
6. Negative Results and Dead-Ends (so future experimenters avoid them)
- Hadamard basis did not outperform random controls at rank 64 in the key duel.2) Sigma-only tuning does not meaningfully recover from a poor basis even with 1,000 steps.3) Structured candidates designed to match the Fisher basis (PHD matching, sparse givens, butterfly) failed early similarity thresholds.4) Aggressive MLP-only compression at rank ratio 0.30 repeatedly yields 0 pass@1 despite better loss curves and response-only distillation.
- Increasing shim rank (r=16â32) did not improve heldout KL meaningfully and regressed tiny-eval competence to 0% in the tested 8-layer setting (Phase 48).
- Coupled alignment without connector degrees of freedom (sigma-only repair, no shims) did not restore competence at 8 layers (Phase 49).
- Freezing parts of the shim pathway degraded performance relative to the best shimmed run (Phase 50).
7. Reproducibility Checklist
7.1 Core artifacts
Hologram datasets:
artifacts/data/hologram*.ptCalibration covariances: produced by
scripts/calibrate_full.pyCompressed checkpoints: produced by
scripts/compress_model.pyRepair states:
src/zomborg/repair_state.py(sigma-only)
7.2 Key scripts
scripts/train_repair.py(distillation)scripts/evalplus_benchmark.py(HumanEval+/MBPP+)
7.3 Minimal ânovel resultâ reproductions
A. Basis Duel: replicate Fisher vs Hadamard/random dominance and sigma-only non-recovery.B. Blockwise PHD functional retention: reproduce Phase 30â34 single-layer stability and heldout KL behavior.C. Multi-layer scaling and the interface wall: reproduce 4-layer collapse vs ÎW residual rescue, and 8-layer collapse vs coupled core + shims partial recovery (Phases 36â50). D. TASE near-identity diagnostic: reproduce the Phase 51 sanity check and identity-deviation sweep.
8. Suggested Next Steps (evidence-based)
Proof-of-principle: full-model PHD without compression If the immediate goal is science rather than size, move the full model into the structured manifold at high rank (or full-rank structured factors) and repair. This answers: âCan the network live in this manifold at all?â before optimizing rank/speed.
Strengthen alignment-first approaches Phase 49 suggests alignment improves KL but not competence. This points to missing task-critical information in the aligned manifold or insufficient repair freedom (sigma-only). Consider richer but still constrained repair degrees: tiny interface mixers, constrained rotations, or alignment penalties.
If shims are used, enforce anti-LoRA constraints Use measurements like shim dominance and restrict shims to pre/post interfaces, keeping PHD cores frozen. The aim is connectors as scaffolding, not as the primary compute substrate.
Only commit to 50+ hour runs when 8-layer has non-zero HE and MBPP Use KL + tiny evals to gate expensive runs.
Appendix: One-paragraph âbriefâ for a new experimenter
ZOMBORG began as a Fisher-weighted low-rank compression pipeline for Llama-3.1-8B, building robust calibration, compression, and distillation infrastructure, but repeatedly failed to restore coding benchmark competence under aggressive sigma-only compression. The project then pivoted to structured manifolds: replacing layers with blockwise PHD-style structured factors and repairing with teacher distillation on a mixture of code and a synthetic âDreamer/Hologramâ dataset to preserve general dimensions. The most robust scientific finding is that basis choice dominates approximation and sigma-only cannot fix a bad basis; nevertheless, structured PHD replacements can remain functional at small scale after minimal repair, implying âdifferent geometry, similar functionâ is possible. Scaling to 8+ layers reveals a dominant interface-misalignment failure mode, motivating coupled interface basis extraction (edge alignment) and constrained connector shims; the best current 8-layer result partially recovers MBPP but remains weak on HumanEval.