I heard, recently, that people were frequently misunderstanding the results in this paper, up until people realized the constraint is not particularly meaningful. The result says all algorithms have the same expected performance across all cost functions – which is to say, it *appears *to preclude the possibility that there is any completely general search algorithm that's any better than the stupidest one you could imagine.

[Now, warning here, I'm making this argument informally, but the general idea seems solid, and I wrote most of this before reading the proof in the above paper, because why spoil my fun?]

The formalization of the search algorithm is basically, an algorithm, using the history of previously sampled points, the algorithm chooses the next point to sample, trying to minimize a loss function.

For simplicity's sake, let's imagine a one-dimensional search (minimizing a function f : **R** -> **R**). In this case, it's relatively clear that our *estimated* loss function could be uniquely defined by the d results we get from our d samples, and that if the cost function is (without loss of generality) a polynomial (possibly infinite and constructed via Fourier transform), we would be unable to distinguish between the infinite number of d + 1 dimensional cost functions, nor any of the d + 2 cost functions, etc. Since there are measure 0 polynomials of degree N or less, essentially all cost functions will be indistinguishable from a finite search, and so of course averaging over them, all algorithms are identical. Also without loss of generality, for any d-dimensional approximate cost function, we can construct an infinite number of d+2-dimensional cost functions adding in an additional, arbitrarily bad, additional minimum – or any other outcome – placed on the first point of divergence between it and another algorithm's sampling. So really, the sampling just means our best guesses will all be a measure-0 subset of loss functions we can imagine fitting the data, and so of course *average *results are completely identical across algorithms.

Really, "over all cost functions" just brings in arbitrarily complicated cost functions pretty casually. I think this argument (which seemed generally correct after just reading the formalization in the paper, but before I read their argument) is actually pretty similar to their proof argument. So I suppose I'm surprised people kept getting a confused picture of the research result. It's plausible that either they or I didn't read the paper carefully enough, but I do feel as though I understand it. They're searching a discrete space, and I'm imagining it as a polynomial, but the arguments really do feel very similar, and the polynomial point is pretty easy to make (in their argument it boils down to allowing them to make a counting argument about functions matching the d observed points – but that really does feel similar).

Or more essentially: your search function will only be able to (maximally) describe cost functions as complex as the sample space it's looked over. But however much information you get, all remaining information is orthogonal, so how well you've *actually *approximated an arbitrary cost function is completely undetermined, even if you're only missing the smallest amount of data. [For essentially the same reason that n-bit strings must have an average compression length of n – simple encodings can only exist for a small subset of data, so we should expect simple solutions (i.e. results found after d rounds of search that *actually work*) to exist only for a small subset of problems, even in the discrete case. Fortunately for us in real life, all problems we care about are of bounded complexity, so if a neural net doesn't work well, we understand we can just make a bigger one, and train it for longer.]