The Histogram

The simplest density estimator is the histogram. This is formed by first dividing the real line into intervals called bins. In the case of bins of equal width $h$ the histogram is a step function which estimates the density at a point $x$ by the function


\begin{displaymath}
\hat f(x) = \frac{1}{nh} (\mbox{no. of } x_i \mbox{ in the same
bin as }x)
\end{displaymath} (1)

where $n$ is the sample size. However, in constructing the histogram it is necessary to choose the origin and the bin width $h$. Both of these make a significant difference to the performance of the method, Figure 1 shows the effects of different start points and number of bins when plotting some of the data set ``Observations of eruptions of the Old Faithful geyser in Yellowstone National Park, USA'', from Weisberg (1980). The data consists of the eruption lengths (in minutes) of 107 eruptions of the Old Faithful geyser and is analysed using several existing KDE methods in Silverman (1986). For further examples see Scott (1992).

Figure:Four different histograms of the Old Faithful data.
\begin{figure}
\centering
\psfig{figure=../../thesis/pics/histfig.ps,width=5.25in,angle=270}
\end{figure}

The histogram can be generalised by allowing the bin widths to vary (Scott and Terrell, 1987; Silverman, 1986), such that


\begin{displaymath}
\hat f(x) = \frac{1}{n} \frac{(\mbox{no. of } x_i \mbox{ in the
same bin as }x)}
{(\mbox{width of bin containing }x)}
\end{displaymath} (2)

The bin width is often called the smoothing parameter as it specifies the amount of smoothing being applied to the data - a small value giving a more jagged appearance. The histogram is an excellent tool for Exploratory Data Analysis (EDA)(Tukey, 1977), however, it is of limited use for the application considered here. The histogram has the unfortunate feature that it estimates all densities as step functions. As the densities that result from Bayesian analysis are usually continuous, and part of the interest in KDE is to obtain some estimate of their smoothness, a continuous estimate of the density function is desirable. In addition the histogram's sensitivity to bin position and width means that, in order to obtain a satisfactory representation, multiple histograms, or even some form of averaged histogram (Scott, 1992), are needed.

danny 2009-07-23