Naive Bayes and the MNIST Database

The MNIST database consists of handwritten digits stored as \(28 \times 28\) bit maps. In this post, we are going to use the database to train a naive Bayesian classifier.

The Model

Our model has the following random variables:

We are using a naive Bayesian model. Therefore

The joint distribution is given by

\[p(c,x,\theta,\pi) = p(c,x | \theta,\pi) p(\theta,\pi) = p(x | c,\theta,\pi) p(c | \theta,\pi) p(\theta,\pi) = p(\theta,\pi) \pi_c \prod_{i} p(x_i \lvert c,\theta)\]

The MNIST database is \(\mathcal{D} = \left\{ c^{(n)},x^{(n)} \right\}_{n=1,\dots,N}\) and we are interested in computing the distribution \(p(c | x, \mathcal{D})\). The joint posterior distribution is \[p(c,x,\theta,\pi | \mathcal{D}) = p(c,x|\theta,\pi,\mathcal{D}) p(\theta,\pi|\mathcal{D}) = p(c,x | \theta,\pi) p(\theta,\pi|\mathcal{D})\] We use the prior distribution \[p(\theta,\pi) = {\rm Dirichlet}(\pi,1) \prod_{i,c} {\rm Beta}(\theta_{i,c},1,1).\] Then the posterior distribution is \[p(\theta,\pi \lvert \mathcal{D}) = {\rm Dirichlet}(\pi,N_c+1) \prod_{i,c} {\rm Beta}(\theta_{i,c},N_{ic} + 1,N_c - N_{ic} + 1)\] where

Since \(\mathcal{D}\) is a large data set \((N=60,000)\), we can approximate the posterior \(p(\theta,\pi \lvert \mathcal{D})\) as a dirac measure supported at its mean: \[p(\theta,\pi \lvert \mathcal{D}) = \delta_{(\widehat{\theta},\widehat{\pi})} \quad \widehat{\theta}_{ic} = \frac{N_{ic} + 1}{N_c + 1} \quad \widehat{\pi}_{c} = \frac{N_c + 1}{N + 1}\] Therefore \[p(c,x | \mathcal{D}) = \int p(c,x | \theta,\pi) p(\theta,\pi \lvert \mathcal{D}) = p(c,x| \widehat{\theta},\widehat{\pi})\] which implies that \[p(c \lvert x, \mathcal{D}) \varpropto p(x | c,\widehat{\theta},\widehat{\pi}) \widehat{\pi}_c\]

The implementation

The MNIST bit maps are in gray scale. Firstly, we flatten them to black and white and translate the raw byte files into CSV files using Haskell. We then implement the above model in Python. The trained NBC correctly identifies \(85\%\) of the test set.