Statistical Learning Theory

Recall from the end of last lecture our setup: We are working with a finite dictionary

H = {h1, . . . , hM } of estimators, and we would like to understand the scaling of this problem

with respect to M and the sample size n. Given H, one idea is to simply try to minimize

ˆ the empirical risk based on the samples, and so we define the empirical risk minimizer, h

erm,

hˆerm ∈ ˆ argmin Rn(h).

h∈H

ˆ ˆ In what follows, we will simply write h instead of h

erm when possible. Also recall the

definition of the oracle, h ̄, which (somehow) minimizes the true risk and is defined by

h ̄ ∈ argmin R(h).

h∈H

The following theorem shows that, although hˆ ̄ cannot hope to do better than h in

general, the difference should not be too large as long as the sample size is not too small

compared to M.

Helmut Steiner