A KDE is specified in terms of some function
where
is a location parameter and
, which may, more
generally, be a vector, is a scale parameter. Examples of
include
a Normal pdf with
locator
and scale
. The estimate of
based on the sample
, at the point,
is
where is the number of data items. Note that
may be considered
to be a generalisation of bandwidth. Taking
leads to the
standard KDE of (1).
The form
exposes clearly the dependence of the
estimator on both the sample to hand
and the parameter vector
and gives a density estimate that is a probability density if the kernel
is a probability density. The estimate has the continuity of the
kernel
.
The need to choose , or more generally
, is formulated as a
problem of inference about
. Sample data
and model
(1) allow the building of a likelihood for
and belief
about the smoothness is expressed in a prior
. Choice of
is provided by the posterior distribution
.
By taking to be a model for the data, a likelihood function can
be constructed in the following way:
Define
, as the
sub-sample consisting of the first
elements of
.
Assuming no particular ordering to the sample data (so that it can,
at least, be considered exchangeable with a random sample observed
in the order implied by the subscripts), the likelihood may be written as
where
is the probability density
of
, given
and
, given by the
chosen model.
This is of the style commonly used for a time series , but
is more generally true. If, for example, the density were
we would have the usual formulation for a
random sample
If the KDE (2) is taken as a model for observation
having seen
then (4) can be rewritten as
It would be difficult to argue that is, in any sense,
a `true' model, but, in the absence of a parametric family for
, it is the `best' available model.
Since
is a parameter of
there is generally some
minimum number
(say) of values needed for
to be defined and
it seems reasonable to use a conditional likelihood of the form
![]() |
(6) |
The choice of prior
will, typically, reflect belief about the
smoothness of the true density that
is intended to estimate. For
example, for the basic KDE (1),
is taken to be
.
In this case a Normal prior for
with a high value of
implies
a belief in a rather smooth, near uniform density while a low value implies a
non-smooth density. The variance parameter is chosen to reflect the strength of
belief, a small value indicating a strongly held belief and a large value the
opposite.
With small data sets the choice of prior has a large influence on the smoothness of the final estimate. It may be expected that with large data sets, for example those available from MCMC, the likelihood will dominate in the posterior and roughly non-informative priors will be a convenient initial choice.
Once the likelihood function and the prior are settled, Bayes theorem is applied to obtain the posterior density
in which
where the integration is over the whole parameter space .
The result in (7) is a posterior density for the parameter vector
. The model used for the density is a KDE in which many densities are
summed to model a single density. This is in no way anything like a realistic
model since it depends on the sample size
. It is not believed to be the true
model, however the KDE family is very rich and adaptable, and it can be
reasonably expected to provide an adequate model. It is, in any case, presumably
the best available model, since otherwise some other density or approach would
be used.
Situations in which the model is known, a priori, not to include the `true' density have been considered by Berk (1966). In this situation the application of Bayes theorem leads, asymptotically, to the member nearest to the `true' density, in terms of Kullback-Leibler directed distance (Kullback, 1997). Berk terms this the `asymptotic carrier'. There is then, some assurance that Bayes' methods will lead to an adequate, even optimal or near optimal, solution within the chosen modeling family -- in this case KDE.
The primary objective here is that of estimating the density .
Within the family of KDEs, defined in terms of
, it is easy to
obtain estimates of the form
, where
is some convenient point estimate of
, perhaps the
posterior mode or the posterior mean
. Such a density
could be evaluated at some set of values
, as
, and this estimative density plotted. Here
is an
estimate, not the true value, and the estimate makes no allowance for the shape
(or uncertainty in the estimate) of the posterior distribution for
.
A better approach is to integrate over the parameter space of
giving the predictive density of the unobserved data
.
See, for example Aitchison and Dunsmore (1975):
Each point requires the evaluation of an integral like (9) for
- unfortunately, this becomes computationally expensive. Given data with range
, a set of points is chosen
that give sufficient information to cover
. The
corresponding values of
are plotted.
In (8) and (9) the integration is over the whole parameter
space , this is typically of low dimension but in many cases it is not
possible to obtain the result analytically. In such cases numerical
quadrature
1
see Naylor and Smith (1982) can be applied.
danny 2009-07-23