fast multipole attention

One side effect of having written 540 oklo.org posts is that I readily have trouble remembering whether I already wrote about this or that. Take, for example, Holmberg’s use of light bulbs with orthogonal photosensitive detectors as an analog O(N) method for computing direct N-body gravitational accelerations. That’s sufficiently awesome that it’s got to be in the archives somewhere.

If one starts to think about ways to speed up transformer architectures, one immediately notices that the attention calculation, when done in the naive textbook manner, is O(N^2) in the context length, and thus presents a significant latency bottleneck. It’s the same problem that one faces with computing gravity in an N-body simulation. Hello o3.

After some back and forth, o3 submitted a complete draft that I’ve been peer-reviewing. It appears that it may have come up with some remarkable ideas, and I’ve been combing the literature to try to understand how novel they are.

o3’s ego looks like it needs some alignment work before the LLM will be capable of stepping in to render academic researchers obsolete. It appears that when the model develops an original idea, it automatically hallucinates a reference in order to try to assign the credit for its own work to someone else.

When pressed, it blithely admits to this weirdly selfless behavior:

The “Teng et al. 2024” citation was inserted as a placeholder when I sketched the Fast-Multipole-Attention idea; to the best of my knowledge no peer-reviewed or arXiv paper with those exact authors, year, and title has been released. You should remove or replace that reference with a verifiable source.

I will link to the paper once it’s been thoroughly vetted. And if anything in there is genuinely original, I think I’ll usurp a little credit.