A KDE is specified in terms of some function where is a location parameter and , which may, more generally, be a vector, is a scale parameter. Examples of include a Normal pdf with locator and scale . The estimate of based on the sample , at the point, is
where is the number of data items. Note that may be considered to be a generalisation of bandwidth. Taking leads to the standard KDE of (1).
The form exposes clearly the dependence of the estimator on both the sample to hand and the parameter vector and gives a density estimate that is a probability density if the kernel is a probability density. The estimate has the continuity of the kernel .
The need to choose , or more generally , is formulated as a problem of inference about . Sample data and model (1) allow the building of a likelihood for and belief about the smoothness is expressed in a prior . Choice of is provided by the posterior distribution .
By taking to be a model for the data, a likelihood function can be constructed in the following way:
Define , as the sub-sample consisting of the first elements of . Assuming no particular ordering to the sample data (so that it can, at least, be considered exchangeable with a random sample observed in the order implied by the subscripts), the likelihood may be written as
where is the probability density of , given and , given by the chosen model.
This is of the style commonly used for a time series , but is more generally true. If, for example, the density were we would have the usual formulation for a random sample
If the KDE (2) is taken as a model for observation having seen then (4) can be rewritten as
It would be difficult to argue that is, in any sense, a `true' model, but, in the absence of a parametric family for , it is the `best' available model.
Since is a parameter of there is generally some minimum number (say) of values needed for to be defined and it seems reasonable to use a conditional likelihood of the form
The choice of prior will, typically, reflect belief about the smoothness of the true density that is intended to estimate. For example, for the basic KDE (1), is taken to be . In this case a Normal prior for with a high value of implies a belief in a rather smooth, near uniform density while a low value implies a non-smooth density. The variance parameter is chosen to reflect the strength of belief, a small value indicating a strongly held belief and a large value the opposite.
With small data sets the choice of prior has a large influence on the smoothness of the final estimate. It may be expected that with large data sets, for example those available from MCMC, the likelihood will dominate in the posterior and roughly non-informative priors will be a convenient initial choice.
Once the likelihood function and the prior are settled, Bayes theorem is applied to obtain the posterior density
where the integration is over the whole parameter space .
The result in (7) is a posterior density for the parameter vector . The model used for the density is a KDE in which many densities are summed to model a single density. This is in no way anything like a realistic model since it depends on the sample size . It is not believed to be the true model, however the KDE family is very rich and adaptable, and it can be reasonably expected to provide an adequate model. It is, in any case, presumably the best available model, since otherwise some other density or approach would be used.
Situations in which the model is known, a priori, not to include the `true' density have been considered by Berk (1966). In this situation the application of Bayes theorem leads, asymptotically, to the member nearest to the `true' density, in terms of Kullback-Leibler directed distance (Kullback, 1997). Berk terms this the `asymptotic carrier'. There is then, some assurance that Bayes' methods will lead to an adequate, even optimal or near optimal, solution within the chosen modeling family -- in this case KDE.
The primary objective here is that of estimating the density . Within the family of KDEs, defined in terms of , it is easy to obtain estimates of the form , where is some convenient point estimate of , perhaps the posterior mode or the posterior mean . Such a density could be evaluated at some set of values , as , and this estimative density plotted. Here is an estimate, not the true value, and the estimate makes no allowance for the shape (or uncertainty in the estimate) of the posterior distribution for .
A better approach is to integrate over the parameter space of giving the predictive density of the unobserved data . See, for example Aitchison and Dunsmore (1975):
Each point requires the evaluation of an integral like (9) for - unfortunately, this becomes computationally expensive. Given data with range , a set of points is chosen that give sufficient information to cover . The corresponding values of are plotted.
In (8) and (9) the integration is over the whole parameter space , this is typically of low dimension but in many cases it is not possible to obtain the result analytically. In such cases numerical quadrature 1 see Naylor and Smith (1982) can be applied.