Neighborhood Kernel Density Estimation

For estimating the conditional density \(p(y|x)\), \(\epsilon\)-neighbor kernel density estimation (\(\epsilon\)-KDE) employs standard kernel density estimation in a local \(\epsilon\)-neighborhood around a query point \((x,y)\).

\(\epsilon\)-KDE is a lazy learner, meaning that it simply stores the training points \(\{(x_i,y_i)\}_{i=1}^n\) and puts a kernel function in each of the points. In order to compute \(p(y|x)\), the estimator only considers a local subset of the training samples \(\{(x_i, y_i)\}_{i \in \mathcal{I}_{x, \epsilon}}\), where \(\mathcal{I}_{x, \epsilon}\) is the set of sample indices such that \(||x_i - x|| \leq \epsilon\).

In case of Gaussian Kernels, the estimated density can be expressed as

\[p(y|x) = \sum_{j \in \mathcal{I}_{x, \epsilon}} w_j ~ N(y~| y_j, \sigma^2 I)\]

where \(w_j\) is the weighting of the j-th kernel and \(N(y~|\mu,\Sigma)\) the probability function of a multivariate Gaussian. This implementation currently supports two types of weighting:

  • equal weights: \(w_j = \frac{1}{|\mathcal{I}_{x, \epsilon}|}\)

  • weights \(w_j\) proportional to \(||x_j - x||_2\), the euclidean distance w.r.t. to x

class cde.density_estimator.NeighborKernelDensityEstimation(name='NKDE', ndim_x=None, ndim_y=None, epsilon=0.4, bandwidth=0.6, param_selection='normal_reference', weighted=True, n_jobs=-1, random_seed=None)[source]

Epsilon-Neighbor Kernel Density Estimation (lazy learner) with Gaussian Kernels

  • name – (str) name / identifier of estimator

  • ndim_x – (int) dimensionality of x variable

  • ndim_y – (int) dimensionality of y variable

  • epsilon – size of the (normalized) neighborhood region

  • bandwidth – (float of array_like) initial bandwidth parameter

  • param_selection – parameter selection method. Must be - None or False: use the provided epsilon and bandwidth - normal_reference: bandwidths are chosen according to normal reference distribution - cv_ml: select bandwidth and epsilon via maximum likelihood leave-one-out cross-validation

  • weighted – if true - the neighborhood Gaussians are weighted according to their distance to the query point, if false - all neighborhood Gaussians are weighted equally

  • random_seed – (optional) seed (int) of the random number generators used

conditional_value_at_risk(x_cond, alpha=0.01, n_samples=1000000)

Computes the Conditional Value-at-Risk (CVaR) / Expected Shortfall of the fitted distribution. Only if ndim_y = 1

  • x_cond – different x values to condition on - numpy array of shape (n_values, ndim_x)

  • alpha – quantile percentage of the distribution


CVaR values for each x to condition on - numpy array of shape (n_values)

covariance(x_cond, n_samples=1000000)

Covariance of the fitted distribution conditioned on x_cond


x_cond – different x values to condition on - numpy array of shape (n_values, ndim_x)


Covariances Cov[y|x] corresponding to x_cond - numpy array of shape (n_values, ndim_y, ndim_y)

eval_by_cv(X, Y, n_splits=5, verbose=True)

Fits the conditional density model with cross-validation by using the score function of the BaseDensityEstimator for scoring the various splits.

  • X – numpy array to be conditioned on - shape: (n_samples, n_dim_x)

  • Y – numpy array of y targets - shape: (n_samples, n_dim_y)

  • n_splits – number of cross-validation folds (positive integer)

  • verbose – the verbosity level

fit(X, Y, **kwargs)[source]

Since NKDE is a lazy learner, fit just stores the provided training data (X,Y)

  • X – numpy array to be conditioned on - shape: (n_samples, n_dim_x)

  • Y – numpy array of y targets - shape: (n_samples, n_dim_y)

fit_by_cv(X, Y, n_folds=3, param_grid=None, verbose=True, n_jobs=-1)

Fits the conditional density model with hyperparameter search and cross-validation. - Determines the best hyperparameter configuration from a pre-defined set using cross-validation. Thereby,

the conditional log-likelihood is used for simulation_eval.

  • Fits the model with the previously selected hyperparameter configuration

  • X – numpy array to be conditioned on - shape: (n_samples, n_dim_x)

  • Y – numpy array of y targets - shape: (n_samples, n_dim_y)

  • n_folds – number of cross-validation folds (positive integer)

  • param_grid

    (optional) a dictionary with the hyperparameters of the model as key and and a list of respective parametrizations as value. The hyperparameter search is performed over the cartesian product of the provided lists. Example: {“n_centers”: [20, 50, 100, 200],

    ”center_sampling_method”: [“agglomerative”, “k_means”, “random”], “keep_edges”: [True, False]



Get parameter configuration for this estimator.


deep – boolean, optional If True, will return the parameters for this estimator and contained subobjects that are estimators.


params - mapping of string to any Parameter names mapped to their values.


Get parameters for this estimator.


deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.


params – Parameter names mapped to their values.

Return type

mapping of string to any

kurtosis(x_cond, n_samples=1000000)

Kurtosis of the fitted distribution conditioned on x_cond


x_cond – different x values to condition on - numpy array of shape (n_values, ndim_x)


Kurtosis Kurt[y|x] corresponding to x_cond - numpy array of shape (n_values, ndim_y, ndim_y)

log_pdf(X, Y)[source]

Predicts the conditional log-probability log p(y|x). Requires the model to be fitted.

  • X – numpy array to be conditioned on - shape: (n_samples, n_dim_x)

  • Y – numpy array of y targets - shape: (n_samples, n_dim_y)


conditional log-probability log p(y|x) - numpy array of shape (n_query_samples, )

loo_likelihood(bandwidth, epsilon)[source]

calculates the negative leave-one-out log-likelihood of the training data

  • bw – bandwidth parameter

  • epsilon – size of the (normalized) neighborhood region

mean_(x_cond, n_samples=1000000)

Mean of the fitted distribution conditioned on x_cond :param x_cond: different x values to condition on - numpy array of shape (n_values, ndim_x)


Means E[y|x] corresponding to x_cond - numpy array of shape (n_values, ndim_y)

mean_std(x_cond, n_samples=1000000)
Computes Mean and Covariance of the fitted distribution conditioned on x_cond.

Computationally more efficient than calling mean and covariance computatio separately


x_cond – different x values to condition on - numpy array of shape (n_values, ndim_x)


Means E[y|x] and Covariances Cov[y|x]

pdf(X, Y)[source]

Predicts the conditional probability density p(y|x). Requires the model to be fitted.

  • X – numpy array to be conditioned on - shape: (n_samples, n_dim_x)

  • Y – numpy array of y targets - shape: (n_samples, n_dim_y)


conditional probability p(y|x) - numpy array of shape (n_query_samples, )

plot2d(x_cond=[0, 1, 2], ylim=(-8, 8), resolution=100, mode='pdf', show=True, prefix='', numpyfig=False)

Generates a 3d surface plot of the fitted conditional distribution if x and y are 1-dimensional each

  • xlim – 2-tuple specifying the x axis limits

  • ylim – 2-tuple specifying the y axis limits

  • resolution – integer specifying the resolution of plot

plot3d(xlim=(-5, 5), ylim=(-8, 8), resolution=100, show=False, numpyfig=False)

Generates a 3d surface plot of the fitted conditional distribution if x and y are 1-dimensional each

  • xlim – 2-tuple specifying the x axis limits

  • ylim – 2-tuple specifying the y axis limits

  • resolution – integer specifying the resolution of plot

predict_density(X, Y=None, resolution=50)

Computes conditional density p(y|x) over a predefined grid of y target values

  • X – values/vectors to be conditioned on - shape: (n_instances, n_dim_x)

  • Y – (optional) y values to be evaluated from p(y|x) - if not set, Y will be a grid with with specified resolution

  • resulution – integer specifying the resolution of simulation_eval grid

Returns: tuple (P, Y)
  • P - density p(y|x) - shape (n_instances, resolution**n_dim_y)

  • Y - grid with with specified resolution - shape (resolution**n_dim_y, n_dim_y) or a copy of Y in case it was provided as argument

score(X, Y)

Computes the mean conditional log-likelihood of the provided data (X, Y)

  • X – numpy array to be conditioned on - shape: (n_query_samples, n_dim_x)

  • Y – numpy array of y targets - shape: (n_query_samples, n_dim_y)


average log likelihood of data


Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.


Return type


skewness(x_cond, n_samples=1000000)

Skewness of the fitted distribution conditioned on x_cond


x_cond – different x values to condition on - numpy array of shape (n_values, ndim_x)


Skewness Skew[y|x] corresponding to x_cond - numpy array of shape (n_values, ndim_y, ndim_y)

std_(x_cond, n_samples=1000000)

Standard deviation of the fitted distribution conditioned on x_cond


x_cond – different x values to condition on - numpy array of shape (n_values, ndim_x)


Standard deviations sqrt(Var[y|x]) corresponding to x_cond - numpy array of shape (n_values, ndim_y)

tail_risk_measures(x_cond, alpha=0.01, n_samples=1000000)

Computes the Value-at-Risk (VaR) and Conditional Value-at-Risk (CVaR)

  • x_cond – different x values to condition on - numpy array of shape (n_values, ndim_x)

  • alpha – quantile percentage of the distribution

  • n_samples – number of samples for monte carlo model_fitting


  • VaR values for each x to condition on - numpy array of shape (n_values)

  • CVaR values for each x to condition on - numpy array of shape (n_values)

value_at_risk(x_cond, alpha=0.01, n_samples=1000000)

Computes the Value-at-Risk (VaR) of the fitted distribution. Only if ndim_y = 1

  • x_cond – different x values to condition on - numpy array of shape (n_values, ndim_x)

  • alpha – quantile percentage of the distribution


VaR values for each x to condition on - numpy array of shape (n_values)