Extension to bivariate density estimation

Outputs from the Grand Tour can be projections onto subspaces of any dimensionality, usually one two or three for convenience, as discussed in section [*]. Extending the BKDE discussed above to more than one dimension is both natural and easy. A KDE is still specified in terms of some function $K(\cdot\vert\mu, \theta)$ and an estimate of $f_X(\cdot)$ based on the sample ${\mbox{\boldmath$x$}}$, at the point, $X={{\mbox{\boldmath$t$}}}$ is written as


\begin{displaymath}
{\hat f}({\mbox{\boldmath$t$}}; {\mbox{\boldmath$H$}})=\fra...
...boldmath$H$}}}({\mbox{\boldmath$t$}}-{\mbox{\boldmath$x$}}_i)
\end{displaymath} (27)

where

\begin{displaymath}
K_{{\mbox{\boldmath$H$}}}({\mbox{\boldmath$x$}})=\frac{1}{\...
...\mbox{\boldmath$H$}}^{-1}}{{\mbox{\boldmath$x$}}^{-1}}\right)
\end{displaymath} (28)

${\mbox{\boldmath$H$}}$ is a bandwidth matrix and $K$ is some bivariate kernel.

The estimate of $f_X(\cdot)$ based on the sample ${\mbox{\boldmath$x$}}$, at the point ${\mbox{\boldmath$t$}}$, ${\mbox{\boldmath$X$}}={\mbox{\boldmath$t$}}$ is seen to again be of the form


\begin{displaymath}
\hat f_X({\mbox{\boldmath$t$}}\vert{\mbox{\boldmath$x$}}, {...
...}}\vert{\mbox{\boldmath$x_i$}}, {\mbox{\boldmath$\theta$}})
\end{displaymath} (29)

where $n$ is the number of data items. Note that ${\mbox{\boldmath$\theta$}}$ may be considered to be a generalisation of bandwidth. Taking ${\mbox{\boldmath$\theta$}} = h$ with univariate data still leads to the standard KDE of (1), however, taking ${\mbox{\boldmath$\theta$}} ={\mbox{\boldmath$H$}}$ leads to some higher dimensional estimate. In this case interest is in bivariate data and a $2\times 2$ matrix form of ${\mbox{\boldmath$H$}}$.

There are three possible orders of complexity for ${\mbox{\boldmath$H$}}$; if ${\mbox{\boldmath$H$}} \in \cal F$, the class of all symmetric, positive, definite $2\times 2$ matrices, then there are 3 bandwidth parameters to choose; if $H \in \cal D$, the subclass of all diagonal, positive, definite $2\times 2$ matrices, then there are 2 bandwidth parameters to choose; and finally, if $H \in \cal S$, where ${\cal S} = \{ h^2
{\mbox{\boldmath$I$}} : h > 0 \}$, there is only 1 bandwidth parameter to choose.

Figure:Contour plots of Normal kernels parameterised by (a) ${\mbox{\boldmath $H$}} \in \cal S$, (b) ${\mbox{\boldmath $H$}} \in \cal D$ (c) ${\mbox{\boldmath $H$}} \in
\cal F$.
\begin{figure}
\centering
\subfigure[]
{\psfig{figure=/home/danny/thesis/pic...
...re=/home/danny/thesis/pics/kshapec.ps,width=1.5in,height=1.5in}}
\end{figure}

However, a compromise between the work needed to estimate the bandwidth and the time taken to perform the estimation is required. Fukunaga (1972, p. 175) suggests a simple way of obtaining a bandwidth matrix of arbitrary orientation (see Silverman, 1986, p. 78). Take ${\mbox{\boldmath$H$}}$ to be of the form


\begin{displaymath}
{\mbox{\boldmath$H$}} = h^2 {\mbox{\boldmath$S$}}
\end{displaymath} (30)

where ${\mbox{\boldmath$S$}}$ is the covariance matrix. This approach is equivalent to sphering the data (i.e. transforming it to have unit covariance matrix).

This gives an estimate of the form


\begin{displaymath}
{\hat f}({\mbox{\boldmath$x$}}) = \frac{1}{n\sqrt{\det S}}\...
...mbox{\boldmath$x$}} - {\mbox{\boldmath$x$}}_i)}{h^2S} \right)
\end{displaymath} (31)

It can be shown (Wand and Jones, 1995, p. 106) that, for the multivariate $N(\mu, \sigma)$ distribution, the Asymptotic Mean Integrated Squared Error (AMISE) optimal ${\mbox{\boldmath$H$}}$ satisfies


\begin{displaymath}
{\mbox{\boldmath$H$}}_{AMISE} = c\Sigma
\end{displaymath} (32)

for a scalar constant $c$. This implies that, for the multivariate Normal, sphering is appropriate. There is, unfortunately, no equivalent result for estimation of arbitrary density shapes. This is the approach taken for the version of the bivariate BKDE incorporated into the Grand Tour. By taking $\hat f_X(x)$ to be a model for the data a likelihood function is constructed as before.


\begin{displaymath}
\ell({\mbox{\boldmath$x$}}; {\mbox{\boldmath$\theta$}}, {\m...
...rt{\mbox{\boldmath$x$}}_{(i-1)}, {\mbox{\boldmath$\theta$}}).
\end{displaymath} (33)

Choice of prior for ${\mbox{\boldmath$\theta$}}$ again indicates belief in the smoothness of the underlying density and in the strength of that belief. This gives the posterior density


\begin{displaymath}
p({\mbox{\boldmath$\theta$}} \vert {\mbox{\boldmath$x$}})=\...
...)
p({\mbox{\boldmath$\theta$}})}{p({\mbox{\boldmath$x$}})}
\end{displaymath} (34)

and the predictive density


\begin{displaymath}
\hat{f}({\mbox{\boldmath$y$}}\vert{\mbox{\boldmath$x$}}) = ...
...}}\vert {\mbox{\boldmath$x$}}) d{\mbox{\boldmath$\theta$}}.
\end{displaymath} (35)

danny 2009-07-23