Public »

Why Was Deep Learning a Surprise

This page was written by one or more anonymous contributors.

by gwern

I kept an eye on deep learning the entire time post-AlexNet, and was perturbed by how DL just kept on growing in capabilities and marching through fields, and in particular, how its strengths were in the areas that had always historically bedeviled AI the most and how they kept scaling as model sizes improved---improve as models with millions of parameters were, people were already talking about training NNs with as many as a billion parameters. Crazy talk? One couldn't write it off so easily. Back in 2009 or so, I had spent a lot of time reading about Lisp machines and AI in the 1980s, going through old journals and news articles to improve the Wikipedia article on Lisp machines, and I was amazed by the Lisp machine OSes & software, so superior to Linux et al, but also doing a lot of eye-rolling at the expert systems and robots which passed for AI back then; in following deep learning, I was struck by how it was the reverse, GPU were a nightmare to program for and the software ecosystem was almost actively malicious in sabotaging productivity, but the resulting AIs were uncannily good and excelled at perceptual tasks. Gradually, I became convinced that DL was here to stay, and offered a potential path to AGI: not that anyone was going to throw a 2016-style char-RNN at a million GPUs and get an AGI, of course, but that there was now a nontrivial possibility that further tweaks to DL-style architectures of simple differentiable units combined with DRL would keep on scaling to human-level capabilities across the board. (Have you noticed how no one uses the word "transhumanist" anymore? Because we're all transhumanists now.) There was no equivalent of Rietveld et al 2013 for me, just years of reading Arxiv and following the trends, reinforced by occasional case-studies like AlphaGo (let's take a moment to remember how amazing it was that between October and May, the state of computer Go went from 'perhaps a MCTS variant will defeat a pro in a few years, and then maybe the world champ in a decade or two' to 'untouchably superhuman'; technology and computers do not follow human timelines or scaling, and 9 GPUs can train a NN in a month).

An open question: why was I and everyone else wrong to ignore connectionism when things have played out much as Schmidhuber and Moravec 1998 and a few others predicted? Were we wrong, or just unlucky? What was, ex ante, the right way to think about this, even back in the 1990s or 1960s? I am usually pretty good at bullet-biting on graphs of trends, but I can't remember any performance graphs for connectionism; what graph should I have believed, or if it didn't exist, why not?

What went wrong? There is a Catch-22 here: with the right techniques, impressive proof-of-concepts could have been done quite a few years ago on existing supercomputers and successful prototypes would have justified the investment, without waiting for commodity gaming GPUs; but the techniques could not be found without running many failed prototypes on those supercomputers in the first place! Only once the prerequisites fell to such low costs that near-zero funding sufficed to go through those countless iterations of failure, could the right techniques be found, and justify the creation of the necessary datasets, and further justify scaling up. Hence, the sudden deep learning renaissance---had we known what we were doing from the start, we would have simply seen a gradual increase in capabilities from the 1980s.

The flip side of the bitter lesson is the sweet shortcut: as long as you have weak compute and small data, it's always easy for the researcher to smuggle in prior knowledge/bias to gain greater performance. That this will be disproportionately true of the architectures which scale the worst will be invisible, because it is impossible to scale any superior approach at that time. Appealing to future compute and speculating about how "brain-equivalent computing power" arriving by 2010 or 2030 will enable AI sounds more like wishful thinking than good science. A connectionist might scoff at this skepticism, but they have no compelling arguments: the human brain may be an existence proof, but most connectionist work is a caricature of the baroque complexity of neurobiology, and besides, planes do not flap their wings nor do submarines swim. How would they prove any of this? They can't, until it's too late and everyone has retired.

Thus, there is an epistemic trap. The very fact that connectionism is so general and scales to the best possible solutions means that it performs the worst early on in R&D and compute trends, and is outcompeted by its smaller (but more limited) competitors; because of this competition, it is starved of research, further ensuring that it looks useless; with a track record of being useless, the steadily decreasing required investments don't make any difference because no one is taking seriously any projections; until finally, a hardware overhang accumulates to the point that it is doomed to success, when 1 GPU is enough to iterate and set SOTAs, breaking the equilibrium by providing undeniable hard results.

This trap is intrinsic to the approach. There is no alternate history where connectionism somehow wins the day in the 1970s and all this DL progress happens decades ahead of schedule. If Minsky hadn't pointed out the problems with perceptrons, someone else would have; if someone had imported convolutions in the 1970s rather than LeCun in 1990, it would have sped things up only a little; if backpropagation had been introduced decades earlier, as early as imaginable, perhaps in the 1950s with the development of dynamic programming, that too would have made little difference because there would be little one could backprop over (and residual networks were introduced in the 1980s decades before they were reinvented in 2015, to no effect); and so on. The history of connectionism is not one of being limited by ideas---everyone has tons of ideas, great ideas, just ask Schmidhuber for a basket as a party favor!---but one of results; somewhat like behavioral & population genetics, all of these great ideas fell through a portal from the future, dropping in on savages lacking the prerequisites to sort rubbish from revolution. The compute was not available, and humans just aren't smart enough to either invent everything required without painful trial-and-error or prove beyond a doubt their efficacy without needing to run them.

[draft from ]