A Bayesian Model

A KDE is specified in terms of some function $K(\cdot\vert\mu, \theta)$ where $\mu$ is a location parameter and $\theta$, which may, more generally, be a vector, is a scale parameter. Examples of $K(\cdot\vert\mu, {\mbox{\boldmath$\theta$}})$ include $K(\cdot)$ a Normal pdf with locator $\mu=x_i$ and scale $\theta = \sigma$. The estimate of $f_X(\cdot)$ based on the sample ${\mbox{\boldmath$x$}}$, at the point, $X=t$ is

\hat f_X(t\vert{\mbox{\boldmath$x$}}, {\mbox{\boldmath$\the...
...}{n} \sum_{i=1}^n K(t\vert x_i, {\mbox{\boldmath$\theta$}})
\end{displaymath} (2)

where $n$ is the number of data items. Note that ${\mbox{\boldmath$\theta$}}$ may be considered to be a generalisation of bandwidth. Taking ${\mbox{\boldmath$\theta$}} = h$ leads to the standard KDE of (1).

The form $\hat f_X(t\vert{\mbox{\boldmath$x$}}, {\mbox{\boldmath$\theta$}})$ exposes clearly the dependence of the estimator on both the sample to hand ${\mbox{\boldmath$x$}}$ and the parameter vector ${\mbox{\boldmath$\theta$}}$ and gives a density estimate that is a probability density if the kernel $K(\cdot)$ is a probability density. The estimate has the continuity of the kernel $K(\cdot)$.

The need to choose $h$, or more generally ${\mbox{\boldmath$\theta$}}$, is formulated as a problem of inference about ${\mbox{\boldmath$\theta$}}$. Sample data ${\mbox{\boldmath$x$}}$ and model (1) allow the building of a likelihood for ${\mbox{\boldmath$\theta$}}$ and belief about the smoothness is expressed in a prior $p({\mbox{\boldmath$\theta$}})$. Choice of ${\mbox{\boldmath$\theta$}}$ is provided by the posterior distribution $p({\mbox{\boldmath$\theta$}}\vert {\mbox{\boldmath$x$}})$.

By taking $\hat f_X(x)$ to be a model for the data, a likelihood function can be constructed in the following way:

Define ${\mbox{\boldmath$x$}}_{(i)} = \{x_1, x_2, \ldots , x_i\}$, as the sub-sample consisting of the first $i$ elements of ${\mbox{\boldmath$x$}}$. Assuming no particular ordering to the sample data (so that it can, at least, be considered exchangeable with a random sample observed in the order implied by the subscripts), the likelihood may be written as

\ell ({\mbox{\boldmath$x$}}; {\mbox{\boldmath$\theta$}}) = ...
...t{\mbox{\boldmath$x$}}_{(i-1)}, {\mbox{\boldmath$\theta$}})
\end{displaymath} (3)

where $p(x_i\vert{\mbox{\boldmath$x$}}_{(i-1)}, {\mbox{\boldmath$\theta$}})$ is the probability density of $x_i$, given ${\mbox{\boldmath$x$}}_{(i-1)}$ and ${\mbox{\boldmath$\theta$}}$, given by the chosen model.

This is of the style commonly used for a time series $\{x_i\}$, but is more generally true. If, for example, the density were $f_X(x\vert{\mbox{\boldmath$\theta$}})$ we would have the usual formulation for a random sample

\ell({\mbox{\boldmath$x$}}; {\mbox{\boldmath$\theta$}}) = \prod_{i=1}^n f_X(x_i\vert{\mbox{\boldmath$\theta$}})
\end{displaymath} (4)

If the KDE (2) is taken as a model for observation $x_i$ having seen ${\mbox{\boldmath$x$}}_{(i-1)}$ then (4) can be rewritten as

\ell({\mbox{\boldmath$x$}}; {\mbox{\boldmath$\theta$}}) = ...
...ert{\mbox{\boldmath$x$}}_{(i-1)}, {\mbox{\boldmath$\theta$}})
\end{displaymath} (5)

It would be difficult to argue that $\hat f_X()$ is, in any sense, a `true' model, but, in the absence of a parametric family for $p()$, it is the `best' available model.

Since ${\mbox{\boldmath$x$}}_{(i-1)}$ is a parameter of $\hat f_X()$ there is generally some minimum number $n_0$ (say) of values needed for $\hat f_X()$ to be defined and it seems reasonable to use a conditional likelihood of the form

\ell({\mbox{\boldmath$x$}}; {\mbox{\boldmath$\theta$}}, {\m...
...ert{\mbox{\boldmath$x$}}_{(i-1)}, {\mbox{\boldmath$\theta$}})
\end{displaymath} (6)

The choice of prior $p({\mbox{\boldmath$\theta$}})$ will, typically, reflect belief about the smoothness of the true density that $\hat f_X()$ is intended to estimate. For example, for the basic KDE (1), ${\mbox{\boldmath$\theta$}}$ is taken to be $\{h\}$. In this case a Normal prior for ${\mbox{\boldmath$\theta$}}$ with a high value of $\mu$ implies a belief in a rather smooth, near uniform density while a low value implies a non-smooth density. The variance parameter is chosen to reflect the strength of belief, a small value indicating a strongly held belief and a large value the opposite.

With small data sets the choice of prior has a large influence on the smoothness of the final estimate. It may be expected that with large data sets, for example those available from MCMC, the likelihood will dominate in the posterior and roughly non-informative priors will be a convenient initial choice.

Once the likelihood function and the prior are settled, Bayes theorem is applied to obtain the posterior density

p({\mbox{\boldmath$\theta$}} \vert {\mbox{\boldmath$x$}})=\...
\end{displaymath} (7)

in which

p({\mbox{\boldmath$x$}}) = \int_\Theta \ell({\mbox{\boldmat...
...oldmath$x$}}_{(n_0)}) p({\mbox{\boldmath$\theta$}}) d\theta
\end{displaymath} (8)

where the integration is over the whole parameter space $\Theta$.

The result in (7) is a posterior density for the parameter vector ${\mbox{\boldmath$\theta$}}$. The model used for the density is a KDE in which many densities are summed to model a single density. This is in no way anything like a realistic model since it depends on the sample size $n$. It is not believed to be the true model, however the KDE family is very rich and adaptable, and it can be reasonably expected to provide an adequate model. It is, in any case, presumably the best available model, since otherwise some other density or approach would be used.

Situations in which the model is known, a priori, not to include the `true' density have been considered by Berk (1966). In this situation the application of Bayes theorem leads, asymptotically, to the member nearest to the `true' density, in terms of Kullback-Leibler directed distance (Kullback, 1997). Berk terms this the `asymptotic carrier'. There is then, some assurance that Bayes' methods will lead to an adequate, even optimal or near optimal, solution within the chosen modeling family -- in this case KDE.

The primary objective here is that of estimating the density $f_X(\cdot)$. Within the family of KDEs, defined in terms of ${\mbox{\boldmath$\theta$}}$, it is easy to obtain estimates of the form $\hat f(\cdot \vert \hat{{\mbox{\boldmath$\theta$}}})$, where $\hat{{\mbox{\boldmath$\theta$}}}$ is some convenient point estimate of ${\mbox{\boldmath$\theta$}}$, perhaps the posterior mode or the posterior mean $E({\mbox{\boldmath$\theta$}} \vert {\mbox{\boldmath$x$}})$. Such a density could be evaluated at some set of values ${\mbox{\boldmath$y$}}$, as $f({\mbox{\boldmath$y$}}\vert\hat{{\mbox{\boldmath$\theta$}}})$, and this estimative density plotted. Here $\hat{{\mbox{\boldmath$\theta$}}}$ is an estimate, not the true value, and the estimate makes no allowance for the shape (or uncertainty in the estimate) of the posterior distribution for ${\mbox{\boldmath$\theta$}}$.

A better approach is to integrate over the parameter space of ${\mbox{\boldmath$\theta$}}$ giving the predictive density of the unobserved data $y$. See, for example Aitchison and Dunsmore (1975):

\hat{f}(y\vert{\mbox{\boldmath$x$}}) = \int_{\Theta} \hat f...
...$}}\vert {\mbox{\boldmath$x$}}) d{\mbox{\boldmath$\theta$}}
\end{displaymath} (9)

Each point requires the evaluation of an integral like (9) for $y=y_i$ - unfortunately, this becomes computationally expensive. Given data with range $x_{\mbox{range}} = x_{\mbox{max}} - x_{\mbox{min}}$, a set of points is chosen ${\mbox{\boldmath$y$}}$ that give sufficient information to cover $x_{\mbox{range}}$. The corresponding values of $\hat{f}(y_i\vert{\mbox{\boldmath$x$}})$ are plotted.

In (8) and (9) the integration is over the whole parameter space $\Theta$, this is typically of low dimension but in many cases it is not possible to obtain the result analytically. In such cases numerical quadrature 1 see Naylor and Smith (1982) can be applied.

danny 2009-07-23