Making HMC mix better

tldr; for MCMC for VI use the posterior covariance of q(u) from an initial standard-VI run to whiten the latent variables, rather than use the prior covariance?

I’m new to both Variational Inference and MCMC/HMC but I’ve already experienced the well known issues associated with each.

  • Variational Inference is well known for being poor at properly quantifying the uncertainty,

we do know that variational inference generally underestimates the variance of the posterior density; this is a consequence of its objective function. But, depending on the task at hand, underestimating the variance may be acceptable [1]

  • HMC/MCMC often struggles with Gaussian process models,

The common approach to inference for latent Gaussian models is Markov chain Monte Carlo (MCMC) sampling. It is well known, however, that MCMC methods tend to exhibit poor performance when applied to such models. Various factors explain this. First, the components of the latent field x are strongly dependent on each other. Second, \theta and x are also strongly dependent, especially when n is large. [2]

One solution is to “whiten” the latent variables for the HMC decorrelating them. This is described in [3] and was also suggested to me in this github issue. In both the prior covariance matrix is used to initially transform the latent variables.

In my application however I’ve found that isn’t sufficient. My application is calibrating lots of [low cost] air pollution sensors scattered across a city (one or two sensors are ‘reference’ and are assumed to give correct measurements). Some sensors can move & it’s assumed when sensors are colocated they’re measuring the same thing. In my model each sensor has a transformation function (parameterised by a series of correlated random variables that have GP priors, over time) this function describes the scaling etc that the cheap sensors do to the real data. We assume these can drift over time. For now imagine this is just a ‘calibration scaling’ each sensor needs to apply to the raw signal to get the ‘true’ pollution, at a given time.

A-priori the scaling for each sensor is assumed to be independent, but once two sensors (A & B) have been in proximity their (posterior) scalings are very very correlated. If B is colocated with sensor C and C with D, etc, it will lead to very strong correlations in these scaling variables across the network. Importantly these correlations are induced by the observations and are only in the posterior, not the prior. The upshot is that the whitening provided by the prior covariance doesn’t help decorrelate the latent variables much, and I found the HMC really struggled to mix.

The Idea
I was thinking an obvious solution is to run ‘normal’ Variational Inference first, using a standard multivariate Gaussian for the approximating distribution, q(u). Then, once it’s (very) roughly optimised, use the covariance from q(u) instead of the covariance from the prior for whitening. This way we hopefully will have the benefit of a complicated distribution provided by HMC but it can still sample/mix successfully.

[1] Blei, David M., Alp Kucukelbir, and Jon D. McAuliffe. “Variational inference: A review for statisticians.” Journal of the American statistical Association 112.518 (2017): 859-877. link

[2] Rue, Håvard, Sara Martino, and Nicolas Chopin. “Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations.” Journal of the royal statistical society: Series b (statistical methodology) 71.2 (2009): 319-392. link

[3] Hensman, James, et al. “MCMC for variationally sparse Gaussian processes.” Advances in Neural Information Processing Systems. 2015. link

EM interference and xenon flash bulbs

I’ve started looking at bumblebee flight paths (especially during initial orientation flights from the nest). To this end I’ve wanted to increase the acquisition frequency of the system.

I found that firing more frequently than about every 1.6s seconds lead to marked decreases in the actual brightness of the flash. Instead I’ve rebuilt the electronics to allow the flashes to be fired sequentially. I have previously done this but found often the flashes would end up firing together. I assumed my circuit had a short, but I found the same happened again in my new circuit. I experimented for a while adding various capacitors across the supply rail (thinking maybe there was a ‘spike’ coming back through the two trigger wires from the flash), with the same thinking I added a reversed diode across those lines to catch any such spike. None of this helped.

I eventually found that spreading the flashes out across my desk helped a lot… this wasn’t expected, and suggested the problem was likely to be electromagnetic interference. The flash going off involves considerable current*, which even given the small size of the flash probably will equate to quite a large whole-spectrum EM pulse.

After chatting with helpful people on freenode’s ##electronics, I tried out tin-foil, and found (as long as each flash & wire was wrapped in its own foil & not touching) the 4 flashes would fire independently.

Quick foil experiment!

I’ll try out using some household build thermal insulation tape to shield the flashes next as that will be a little more tidy. I also found direct sunlight got the black casings quite hot so they will probably benefit from the silver foil for that reason too.

*back of the envelope calculation: If the flash (at 1/32) is abour 3J over 10us (=300kW) with say, 300V across it, then the current will be on the order of 300kW/300=1kA.

Remove first digit from Swiss Coordinates

Quick tip. I’m using data from air pollution sensors provided here. The Koordinaten are of the form 2’710’500 / 1’259’810. These use the Swiss coordinate system. pyproj is the defacto standard for doing coordinate processing, this page also helped.

But I could make it work, those coordinate values were too big.

Eventually I realised one needs to remove the leading 1 and 2 from the two numbers. This is mentioned in the wikipedia article:

In order to nonetheless achieve a clear distinction between the two systems, an additional digit was added to the coordinates of LV95: any East coordinate (E) now starts with a 2, and any North coordinate (N) with a 1. Consequently, LV95 coordinates are given by pairs of 7-digit numbers, whereas LV03 used pairs of 6-digit numbers – for instance the coordinates (2 600 000m E / 1 200 000m N) in LV95 would be expressed as (600 000m E / 200 000m N) in LV03.

https://en.wikipedia.org/wiki/Swiss_coordinate_system

In summary, my code now looks like:

from pyproj import Proj, transform
sites = []
sites.append({'name':'Zurich, Schimmelstrasse (ZH)','E':681942,'N':247245,'height':415})
sites.append({'name':'Zurich, Heubeeribüel (ZH)','E':685126,'N':248460,'height':610})
pWorld = Proj(init='epsg:4326')
pCH = Proj(init='epsg:21781')
for site in sites:
    print(transform(pCH,pWorld, site['E'], site['N']))

Tensorflow and Matrices containing Variables

Recently Pablo, Dennis and I were wondering what the best way to build Tensors with variables inside. I’ve found three ways (that largely mirror the numpy equivalents). Basically just different combinations of stacking, concatting, reshaping and gathering. [related SO question]

import tensorflow as tf
import numpy as np

a = tf.Variable(1.0,dtype=np.float32)
b = tf.Variable(2.0,dtype=np.float32)
with tf.GradientTape() as t:
    #these lines are equivalent:
    M = tf.reshape(tf.gather([a**2,b**2,a**2/2,1],[0,2,3,1]),[2,2])
    M = tf.reshape(tf.stack([a**2,a**2/2,1,b**2]),[2,2])
    M = tf.concat([tf.stack([[a**2,a**2/2]]),tf.stack([[1,b**2]])],0)
    gradients = t.gradient(tf.linalg.det(M),[a,b])
    print(gradients)
[<tf.Tensor: shape=(), dtype=float32, numpy=7.000001>, <tf.Tensor: shape=(), dtype=float32, numpy=4.0000005>]

I thought I’d just add that, one (possibly unwise) default behaviour of the gradient method is, if one were to ask for the derivative of a matrix it will return the derivative of the reduce_sum of the matrix:

with tf.GradientTape() as t:
    M = tf.concat([tf.stack([[a**2,a**2/2]]),tf.stack([[1,b**2]])],0)
    gradients = t.gradient(M,[a,b])
    print(gradients)
[<tf.Tensor: shape=(), dtype=float32, numpy=3.0>, <tf.Tensor: shape=(), dtype=float32, numpy=4.0>]

Which one can see is returning the derivative of the sum of M.