Transformers scaled us to GPT-4, but the quadratic attention bottleneck and tokenization tradeoffs are real constraints. I have been working on a different approach that replaces the attention mechanism entirely.
No Tokenizer
The model works directly with raw bytes. No BPE, no SentencePiece, no subword vocabulary. This removes a failure point — no tokenizer files to drift, no special token logic, no vocabulary mismatch between training and inference.
Beyond Attention
Instead of computing pairwise interactions across the full sequence, the model propagates prediction deltas through a feedback stack. Only what deviates from expectation gets updated, which tends to be sparse. This sidesteps the all-to-all communication pattern that makes attention expensive.
Training Stack
The pipeline is built without standard frameworks. It has its own forward pass, gradient logic, and parameter updates, compiled directly to GPU. No PyTorch or JAX in the hot path.
I keep a reference implementation in another language for rapid prototyping, but the production training stack handles the heavy lifting.
Where It Stands
Research-scale, not production-scale. The focus is on proving the architecture can learn stably and that the pipeline is numerically sound. Early results on sequential prediction are promising. The model is not competitive on creative writing against GPT-4, but it holds context across long sequences without degradation.
This is a multi-year research project. I am not chasing benchmarks — I am testing whether the architecture scales.