efficiency

I know that a neural network is not a network of neurons, but for the rough scaling and order-of-magnitude arguments, is it legitimate to equivalence a synaptic connection with a weight in a deep network such as a transformer?

That’s all the sign-off I need to get going.

In a post toward the end of last year, I used synapse-to-weight equivalence to argue that GPT-2 was accomplishing the rough computational equivalent of a fruit fly brain. Next, queue some requisite marveling at the gigantic difference in inference-time energy costs. Note, however, that scientific bloggers run the continual risk of excessive chewing over the same tired themes (IMC-1, anyone?) I won’t burden everyone with more moralizing.

What is clear, however, is that the massive AI-infrastructure build-out that dominates the news these days is making an implicit bet on the perpetuation of extreme computational energetic (as well as likely algorithmic) inefficiency.

Human brain has 100 trillion synaptic connections. Using the equivalence rule of thumb, that makes it somewhere on the order of 100x the size of GPT-4. Let’s consider the energy cost of pre-training. Training the brain uses 100 watts for 25 years. Rounding up a bit, that’s a 1e+18 erg energy budget. GPT-4 was estimated to consume 50 GW-hours to train. That’s 2e+21 ergs for 1% of the model complexity, so it’s a factor of roughly 200,000x less efficient. The immediate implication is that the current transformer architecture, and the current architectural paradigms are nowhere near optimal. If NVDA is to retain their outsize valuation, they’ve got to innovate radically, and probably quickly.

My guess is that somebody else, however, somebody not currently on the radar, will get there first.

large numbers

There’s a certain tendency for life’s possibilities and potential to unfold magically when you’re young, only to go delayed and unrealized for years, before ultimately sliding out of reach.

And opportunity extends not just to the things one might do, but also to the realm of ideas.

I distinctly recall a frigid winter morning, exactly forty years ago, sitting in the Loomis Laboratory Amphitheater, as our Physics 108 Professor introduced thermodynamics from the kinetic point of view. With that lecture comes the realization that the Second Law is valid only in a statistical sense, and a flash bulb moment illuminates the vast urgent array of possibility. Hah! I thought. It’s not really a law at all. You just have to wait. And wait. Wait long enough and all the molecules will be on one side of the room. Wait for the splattered slime and shells of a smashed egg to draw thermal energy from the carpet and reassemble, arcing up towards your outstretched waiting hand.

During that same month in 1985, I also read William Poundstone’s Recursive Universe. Having no natural resistance, I was entirely blown away. Cellular Automata. The works of Shakespeare buried somewhere out there in the digits of Pi. Zen for Film. It was all completely new and amazing. There are exactly (and there are only!) 65,536^(44,1000) distinct sounds — that is, about 10^212385 sounds — that last for precisely one second and which can be recorded using the (then-futuristic) audio CD format. Appealing proto-Nietzschean reductionist strictures crowd the mind without effort and in rapid succession. Creativity is merely the process of selection, etc., etc.

But as with with Rush’s 2112 or The Fountainhead from Ayn Rand, one tends to grow out of that stuff (albeit, I admit, enriched). And indeed, after a forty-year gap, I’ve started listening to those Rush LPs again. The musicianship really is something else, you know.

By 1999, however, I was sufficiently cool, and I was sufficiently jaded to trot out the “refutation” of the infinite monkeys theorem in our end-of-the-universe book. Like Jim Carroll extorting street cred from all those people who died, died, there was money to be made and low-rent Saganesque fame to be reaped in belittling the amazing consequences of truly huge numbers.

In the end, I think it just comes down to the fact that 10^n is useful to count the number of things that will happen, and 10^n^n counts the number of things that could happen… Or so I hold forth smugly, on a blog almost nobody still reads, as the monkeys and the typewriters and the Dark Era have a go in both a peer-reviewed paper and in last weekend’s edition of the New York Times.

We were country before country was cool!