Non Parametric Inference

Suppose that we have a data set \((x_i) \in \mathbb{R}^n\) produced from \(n\) independent repetitions of an experiment. The mathematical model for this situation is \(n\) independent identically distributed random variables \(X_1,\dots,X_n \sim F\) for some cumulative distribution function \(F\). What can we learn about \(F\)? We shall approach this problem from a frequentest perspective, so \(F\) is some fixed unknown cumulative distribution function. Everything we are going to say can be found in the wonderful book All of Non Parametric Statistics by Wasserman.

The Empirical Distribution Function

The most important random object in this setting is the empirical distribution function: \[\widehat{F}_n(x) = \frac{1}{n} \sum_{i=1}^n \; {\bf 1}_{X_i \leq x}\] It is the step function which jumps up \(1/n\) whenever \(x\) crosses one of the \(x_i\) in our data set. The importance of the empirical distribution function stems from the Glivenko-Cantelli theorem, which says that \[\sup_{x} \lvert \widehat{F}_n(x) - F(x) \lvert \; \xrightarrow{P} \; 0\] The \(P\) stands for convergence in probability. Intuitively, this says that if the sample size \(n\) is large, the probability that \(\widehat{F}_n\) and \(F\) are "far apart" is small. This intuition can be quantified as follows: Define \[ L(x) = \widehat{F}_n(x) - \epsilon_n \qquad U(x) = \widehat{F}_n(x) + \epsilon_n \qquad \epsilon_n = \sqrt{\frac{1}{2n}\log \left( \frac{2}{\alpha} \right) }.\] Then the Dvoretzky-Kiefer-Wolfowitz inequality implies that \[ {\bf P}(L(x) \leq F(x) \leq U(x) \text{ for all $x$}) \geq 1 - \alpha. \] The theory is nice, but the best thing is to see it work:

import numpy
import scipy.stats
import matplotlib.pyplot as pyplot
%matplotlib inline

mu, sig = 163, 7.3
distrubution = scipy.stats.norm(mu, sig)

sample_size = 1000
bin_size = 10
num_bins = sample_size // bin_size
sample = distrubution.rvs(sample_size)

pyplot.hist(sample, bins = num_bins, normed = True, cumulative = True)
None

This python code should be run inside a Jupyter notebook. It takes 1000 independent samples from a normal distribution with mean 163 and standard deviation 7.3. It then plots the empirical distribution function as a histogram. The resulting graph looks very similar to the the cumulative distribution function for \({\rm N}(163,7.3)\).

Plug In Estimators

Suppose that \(\theta = \int_{\mathbb{R}} a(x) d F(x)\) is some statistic we are interested in. The corresponding plug in estimator is defined to be \[\widehat{\theta}_n = \int_{\mathbb{R}} a(x) d \widehat{F}_n(x) = \frac{1}{n} \sum_{i=1}^n a(X_i)\] The weak law of large numbers tells us that \(\widehat{\theta}_n\) converges in probability to \(\theta\). We say that \(\widehat{\theta}_n\) is a consistent estimator. Intuitively, this says that if \(n\) is large, the probability of \(\widehat{\theta}_n\) and \(\theta\) being far apart is small. We can produce a confidence interval using the central limit theorem. The central limit theorem says that \[\frac{\widehat{\theta}_n - \theta}{\sqrt{{\bf V}_F(\widehat{\theta}_n)}} \xrightarrow{D} N(0,1)\] The \(D\) stands for convergence in distribution. As long as \(n\) is large, then \[{\bf P} \left(- \epsilon \leq \frac{\widehat{\theta}_n - \theta}{\sqrt{{\bf V}_F(\widehat{\theta}_n)}} \leq \epsilon \right) \thickapprox {\bf P}(-\epsilon \leq Z \leq \epsilon) \qquad Z \sim N(0,1)\] There is a problem: The value \({\bf V}_F(\widehat{\theta}_n)\) depends on the distribution \(F\), which is unknown to us.

The Bootstrap

The Bootstrap is a sneaky way to estimate the value \({\bf V}_F(\widehat{\theta}_n)\). Suppose that we have our sample \((x_i) \in \mathbb{R}^n\). The empirical distribution \(\widehat{F}_n\) puts mass \(1/n\) at each data point \(x_i\). The trick is to estimate \({\bf V}_{F}(\widehat{\theta}_n)\) with \({\bf V}_{\widehat{F}_n}(\widehat{\theta}_n)\) which is computed as follows:

  1. Produce a \(h \times n\) array sampled uniformly at random from \((x_i)\).
  2. Apply \(a\) to every entry, sum up the rows and divide them by \(n\).
  3. Compute the variance of the resulting vector in \(\mathbb{R}^h\).

If we make \(h\) very large, the weak law of large numbers tells us that the result is going to be very close to \({\bf V}_{\widehat{F}_n}(\widehat{\theta}_n)\).