MBatista

Extending python with Go

2021-04-03T00:00:00+00:00

This post is about extending python code with Go.
Python’s ecosystem typically contains a great deal of what is needed, but for the cases when it doesn’t or when some bespoke development is justified, Go might be worth looking into. For one, the language is simple and the compiler forces whatever code you generate to maintain some readibility.
After ad-hoc calls of Go code, structured calls of Go code, trying alternatives like this or this, it seems like it can be simpler or just a bit more automated.
Gopy generates (and compiles) a CPython extension module from a go package. It’s well maintained for linux environments and has plenty of examples to learn from. Installing Go and Gopy is straighforward to install and instructions are provided in Gopy’s repository.
—

To illustrate the process and expose the practical pitfalls of Gopy, let’s start with an implementation of a vantage-point tree from the gonum project.

First step: have Go code that you want to use in your python pipeline. This bit is an example from gonum that should be simple to follow: essentially from a collection of places and given a specific address, determine which are whithin a certain distance and also to display the top 5 closest distances.

package vptree

import (
  "fmt"
  "log"
  "math"

  "gonum.org/v1/gonum/spatial/vptree"
)

func Example_accessiblePublicTransport() {
  // Construct a vp tree of train station locations
  // to identify accessible public transport for the
  // elderly.
  t, err := vptree.New(stations, 5, nil)
  if err != nil {
    log.Fatal(err)
  }

  // Residence.
  q := place{lat: 51.501476, lon: -0.140634}

  var keep vptree.Keeper

  // Find all stations within 0.75 of the residence.
  keep = vptree.NewDistKeeper(0.75)
  t.NearestSet(keep, q)

  fmt.Println(`Stations within 750 m of 51.501476N 0.140634W.`)
  for _, c := range keep.(*vptree.DistKeeper).Heap {
    p := c.Comparable.(place)
    fmt.Printf("%s: %0.3f km\n", p.name, p.Distance(q))
  }
  fmt.Println()

  // Find the five closest stations to the residence.
  keep = vptree.NewNKeeper(5)
  t.NearestSet(keep, q)

  fmt.Println(`5 closest stations to 51.501476N 0.140634W.`)
  for _, c := range keep.(*vptree.NKeeper).Heap {
    p := c.Comparable.(place)
    fmt.Printf("%s: %0.3f km\n", p.name, p.Distance(q))
  }
}

// stations is a list of railways stations.
var stations = []vptree.Comparable{
  place{name: "Bond Street", lat: 51.5142, lon: -0.1494},
  place{name: "Charing Cross", lat: 51.508, lon: -0.1247},
  place{name: "Covent Garden", lat: 51.5129, lon: -0.1243},
  place{name: "Embankment", lat: 51.5074, lon: -0.1223},
  place{name: "Green Park", lat: 51.5067, lon: -0.1428},
  place{name: "Hyde Park Corner", lat: 51.5027, lon: -0.1527},
  place{name: "Leicester Square", lat: 51.5113, lon: -0.1281},
  place{name: "Marble Arch", lat: 51.5136, lon: -0.1586},
  place{name: "Oxford Circus", lat: 51.515, lon: -0.1415},
  place{name: "Picadilly Circus", lat: 51.5098, lon: -0.1342},
  place{name: "Pimlico", lat: 51.4893, lon: -0.1334},
  place{name: "Sloane Square", lat: 51.4924, lon: -0.1565},
  place{name: "South Kensington", lat: 51.4941, lon: -0.1738},
  place{name: "St. James's Park", lat: 51.4994, lon: -0.1335},
  place{name: "Temple", lat: 51.5111, lon: -0.1141},
  place{name: "Tottenham Court Road", lat: 51.5165, lon: -0.131},
  place{name: "Vauxhall", lat: 51.4861, lon: -0.1253},
  place{name: "Victoria", lat: 51.4965, lon: -0.1447},
  place{name: "Waterloo", lat: 51.5036, lon: -0.1143},
  place{name: "Westminster", lat: 51.501, lon: -0.1254},
}

// place is a vptree.Comparable implementations.
type place struct {
  name     string
  lat, lon float64
}

// Distance returns the distance between the receiver and c.
func (p place) Distance(c vptree.Comparable) float64 {
  q := c.(place)
  return haversine(p.lat, p.lon, q.lat, q.lon)
}

// haversine returns the distance between two geographic coordinates.
func haversine(lat1, lon1, lat2, lon2 float64) float64 {
  const r = 6371 // km
  sdLat := math.Sin(radians(lat2-lat1) / 2)
  sdLon := math.Sin(radians(lon2-lon1) / 2)
  a := sdLat*sdLat + math.Cos(radians(lat1))*math.Cos(radians(lat2))*sdLon*sdLon
  d := 2 * r * math.Asin(math.Sqrt(a))
  return d // km
}

func radians(d float64) float64 {
  return d * math.Pi / 180
}   

This is a particularly good example of code to take. The definition of the distance can easily be changed to reflect similarity in words, geometric spaces or whatever rule that makes sense to classify as similar.

  gopy build -output=some/folder -vm=python3 path/to/go_pkg

This creates the shared library and other objects needed for the binding.

For the the shared objects(.so), there is one extra step before interacting with the bindings, which is to add a new path to the LD_LIBRARY_PATH variable to tell the dynamic link loader where to search for the dynamic shared libraries. There is a long lasting issue, where all steps are descbribed.
If you’re in the location of the generated folder add the current working directory ($PWD) to the environment variable, else adjust it accordingly.

  LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$PWD python3

After it, you’re free to import vptree and use it with little to no issues.

    Python 3.7.6 (default, Jan  8 2020, 19:59:22) 
    [GCC 7.3.0] :: Anaconda, Inc. on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import vptree
    >>> vptree.Example_accessiblePublicTransport()
    Stations within 750 m of 51.501476N 0.140634W.
    St. James's Park: 0.545 km
    Green Park: 0.600 km
    Victoria: 0.621 km

    5 closest stations to 51.501476N 0.140634W.
    St. James's Park: 0.545 km
    Green Park: 0.600 km
    Victoria: 0.621 km
    Hyde Park Corner: 0.846 km
    Picadilly Circus: 1.027 km

Simple enough. This was a very simple example, but it seems to generalize well to more complex Go code.

Subsampling as a strategy to find optimal parameters (2/2)

2021-03-22T00:00:00+00:00

This is the continuation to this post where we explore the changes in the output of machine learning models when they are trained on samples of varying sizes.

Some preliminary thoughts and conclusions from the last post:

Complexity of the models determine how profitable it is to explore in lower samples. For quadratic algorithms and ignoring the actual implementation, training one model with a full dataset should cost the same amount of time as training 6.25 with 40% of that dataset.
Before a threshold sampling percentage, the results of a model are not informative for the full dataset. Being greedy doesn’t help here.
Overall, models at smaller samples seem to be noisier images of the full dataset trained models.

Let’s take this example from bayes_opt.
We want to optimize over a 3d space composed of random forest parameters (max_features, min_sample_split, trees) where the model is evaluated a cross-validated negative log-loss score.
A standard bayesian optimizer runs 100 models with a synthetically generated dataset with a binary target.

Let’s do a run in an exploratory mode: focusing one exploring the landscape instead of necessarily exploiting regions near local or global maxima.

The logs below print how many models were ran during the bayesian optimization and the computational budget it consumed.
We print the best combination of parameters until the last iteration. During this process, the best model was found in iteration 13 and the remaining 87 never resulted in a superior model.

Optimizing Random Forest: 100 models; budget: 100 
|   iter    |  target   | max_fe... | min_sa... | n_esti... |
=============================================================
|  4        | -0.3418   |  0.4902   |  0.02077  |  10.02    |
|  8        | -0.3349   |  0.6166   |  0.01651  |  10.02    |
|  13       | -0.2919   |  0.999    |  0.01     |  250.0    |
=============================================================

The figure below shows a fairly explored space, where some regions clearly seem to have more performant models (lighter tone in the color scale).

Another run promoting a more balanced relation between exploration and exploitation:

Optimizing Random Forest: 100 models; budget: 100
|   iter    |  target   | max_fe... | min_sa... | n_esti... |
=============================================================
|  4        | -0.3423   |  0.4902   |  0.02077  |  10.02    |
|  5        | -0.3404   |  0.999    |  0.01     |  10.0     |
|  6        | -0.3149   |  0.9771   |  0.01587  |  188.5    |
|  12       | -0.2959   |  0.999    |  0.01     |  96.31    |
|  16       | -0.2946   |  0.999    |  0.01     |  141.4    |
|  51       | -0.2943   |  0.999    |  0.01     |  122.7    |
|  95       | -0.2942   |  0.999    |  0.01     |  126.5    |
=============================================================

Both strategies seem to be effective exploring the parameters. Let’s add subsampling to it.

To benchmark all the variants in the exploration we fix the same compute budget, that is, the same that is needed to run 100 models with the full dataset.
Let’s consider negligeble the compute time for the gaussian process that fits the hyperparameter space, even though it isn’t: 1) as observations grow; 2) as the hyperparameter space grows; 3) and if the model function is not too expensive to compute.

Some remarks:

The computational budget is divided between different sample sizes, and for each size, we can quantify a quantity of models given by the complexity relation. Random forests are assumed to be log-linear; SVMs are considered to be quadratic. Abstracting some implmentation details is acceptable to get started.
The lowest percentage which samples the dataset is fixed and it should be tuned depending on the complexity of the data. 1% of the data may be enough to learn the target.
How many sample sizes should we explore? Not enough has been explored here, but it seems to be again contingent on the data.
What strategy to use when dividing the budget: evenly over the various sample sizes or something more complex?

The key concern once the exploration in a sample is completed, is how to pass the information gathered and pass it to the next (larger) sample exploration.
A simple step is to pass promissing points to probe; adjusting the domain of the parameters in order not to explore flat areas is promissing but not easy to implement in bayes_opt; another simple way to pass information is to copy the posterior (after fitting to the observations) covariance function of the underlying gaussian process and use it as a prior to the optimization process of the following sample.

Some ideas to make subsampled bayesian optimization more clever where: 1) to make the strategy of exploration sample size dependent. Exploring at lower samples and exploiting at higher samples seems a good heuristic; 2) to add a noise term to the gaussian process that is dependent on the sample size. This is to model the higher variance at lower samples.

The logs and figures below show the result of a 3 sample size split, with sample percentages in the following set: [30%, 70%, 100%], where the computational budget (the equivalent of training 100 models with the full dataset) was split evenly.
Notice that the same budget allows for a very different amount of models at each sample size. Allocating more budget to lower or higher percentages could ease the exploration of more complex parameter spaces.

Sample percentage:30%

Optimizing Random Forest: 148 models; budget: 33 
|   iter    |  target   | max_fe... | min_sa... | n_esti... |
=============================================================
|  4        | -0.3858   |  0.4902   |  0.02077  |  10.02    |
|  9        | -0.3753   |  0.9791   |  0.01979  |  23.29    |
|  19       | -0.3539   |  0.6998   |  0.02344  |  241.5    |
|  43       | -0.3354   |  0.999    |  0.01     |  221.5    |
|  112      | -0.3344   |  0.999    |  0.01     |  46.79    |
|  137      | -0.3343   |  0.999    |  0.01     |  40.97    |
=============================================================

Sample percentage:70%

Optimizing Random Forest: 57 models; budget: 33
|   iter    |  target   | max_fe... | min_sa... | n_esti... |
=============================================================
|  3        | -0.3023   |  0.999    |  0.01     |  114.4    |
|  5        | -0.302    |  0.999    |  0.01     |  234.8    |
|  42       | -0.3006   |  0.7511   |  0.0109   |  67.98    |
=============================================================

Sample percentage:100%

Optimizing Random Forest: 33 models; budget: 33 
|   iter    |  target   | max_fe... | min_sa... | n_esti... |
=============================================================
|  32       | -0.2917   |  0.7301   |  0.01     |  250.0    |
=============================================================

To conclude, this was not an exhaustive search of how subsampling can be used to search optimal points in the hyperparameter space.
The code below can be changed to:

Explore different budget dividing strategies;
Explore different amounts of noise in each of the sample percentages;
Investigate how to leverage the exploit vs explore trade-off;
Explore some relation between the observed points and the number of points to pass across samples;
Explore how different these points should be;
…

The exploration of higher dimensional spaces at lower samples seems at least more effective than relying on very expensive model full dataset training. For models with higher complexity this should be evident.

Code

Identifying outliers in time series

2021-02-20T00:00:00+00:00

This is essentially a back of the envelope study for the identification of outliers in time series. The idea is to sketch a method that associates model quality to the presence of outliers.
When dealing with data which does not follow time, finding outliers is hard, but even simple approaches might yield decent results. Setting a threshold for the percentile that determines what is a classifier works nicely in one dimensional data, and might even be useful in low dimensional data. To make sure, add a Bonferroni Outlier test and whatever you decide, it has some support.
For time series that evidently not a satisfying answer. Even very rare values can be periodical; this is in fact a common pattern.

One decent definition of outlier is a measurement that does not fit with the data generating process.
For a sufficient amount of samples, the signal makes itself clear, even when in presence of significant noise. Let’s focus on the problem with a small amount of data points — something like monthly series, a frequent business scenario.

Let’s create a small series which has the following decomposition:

We need to create a generating model which will be critical to evaluate the likelihood of a point being an outlier. Let’s not use the entirity of our knowledge of the series. For a series this simple, we’ll use a gaussian processes regression, and we’ll define the covariance function as the sum of the Matérn 5/2 kernel and a periodic kernel of period 12. We also define a linear mean funcion, with the slope defined with by a random variable that is inferred using MCMC.
This is as a plausible injection of basic yet informative priors to the model. More complex series may require complex models, which are harder to sample and harder to illustrate the idea of the post.

A quick test, performing a 12 month forecast, shows the model capturing the signal pretty well when there is no noise, and as the noise magnitude increases,the predictive ability decreases, as expected.

One particular feature that I enjoy in gaussian processes is the ability to interpolate data very nicely and allowing imputation of missing values in a principled way. Because we can draw samples from the model, we can generate distributions for each of the points in the series. We exclude each point and, assuming the model is sufficiently well defined, we could collect percentile values for the values of the series.
Now, the outcome of this step seems redundant or a symptom of a weak model/a hard problem to model. However, the main objective is to show that the absence of outliers will generate a superior model, that reduces the modelling error significantly for the rest of the series.

This extra iteration over the entire series adds a significant amount of computing time to an already complex method — but this is a small series and a few seconds per model is nothing obscene Below we get to see the process for the series with minimal amount of noise.

Even with a model that’s as close to naive as possible, the signal is picked and for the most part the mean of the samples matches the missing value.

The key idea behind this post is the superior model that is trained when outliers are removed from the time series. This is something made clear in the next animation. When the outlier is removed, the variance of the model is very low compared to the models generated when it’s not, and most importantly, the mean of the samples of the posterior resembles the original time series.

In case of more than one outlier, the problem stops being trivial; the absence of one of the outlier is no longer sufficient for the model to obtain a clear signal — depending on the magnitude of the outlier, the perturbation it adds makes any modelling quite hard, forcing the removal of another data point.

We can see the model seems to improve when any of the outliers are removed; measuring this improvement might be sufficient to list outlier candidates and then removing the combinations of elements of this list to find out the most promising set.
It’s still hard to generate some heuristic that generalizes for most outliers — any value with an offset large enough from the original time series is easily detectable, but after removed, requires one additional loop over the remaining points; even with a small amount of data, it gets prohibitive.

The main conclusions taken from this post is that outliers are hard to evaluate, unless if specific cases.
Here, knowledge of the system was needed to build a good enough model; of the amount of noise that influences the process ; and the of how many outliers are in series were essential. In a real world scenario, this is not the case, but it may still serve as an exploratory step.

Subsampling as a strategy to find optimal parameters (1/2)

2021-02-14T00:00:00+00:00

This is the first of two posts about finding optimal parameters for machine learning models, and is motivated by Hyper-Parameter Optimization: A Review of Algorithms and Applications. Subsampling is commented on a later section, as a strategy to reduce the training time; and doing so, reduces the search time for necessary to find optimal values. It’s stated that subsampling is risky in terms of the potential to introduce more noise and uncertainty.
On this first post I’m going to explore subsampling and how much can be learned about the parameter space from it. Try to find support from observations that the parameters tend to converge to the some value as training data increases, and if it seems to increase at the same rate for all parameters; observe the effect in combinations of parameters; and to see if the overall patterns generalizes in the same way to different methods.
On the second post I’ll use subsampling to explore the parameter space, using whatever computational budget there is, efficiently.

Predictive models are able to generalize what they learn from data, otherwise they’re not very useful – but there is a minimum amount of data which serves as a boundary for the lack of learning ability of the model.

Let’s pick then the Random Forest implementation by sklearn, and pick three parameters as a start. The data is the Boston housing dataset, around 500 records which is a nice amount to perform some grid searches for these initial tests. I’ll track the R² score obtained from cross validation.

What we expect to find is evidence that smaller samples are still informative to find the optimal parameters of the full dataset.
For each size and combination of parameters, we randomly sample from the original dataset, while keeping the random seed of the algorithm to maintain some amount of determinism.
The plots below show the results for grid search for the parameters. Keep the colorscale in mind; it’s used for the rest of the plots.

Some comments:

Less data means less to learn: the performance must necessarily be smaller with small samples.
Less data means greater variation in how the training data can be generated – some combinations can be very unrepresentative of the dataset. This needs to be explored further.
For some parameters – max features – subsampling seems to have a greater impact on the predictive performance of the model.

As expected, the parameter curves seems to share a lot of features across sample sizes, even at the smallest ones, and as the size increases the convergence seems to be even more apparent. To put it simply, the optimal parameters are quite similar after a certain size.

Let’s explore the variation of the scores a bit further.
Let’s focus on one parameter, min_sample_split and repeat the exploration a few times.

And we can see the mean value and the region of the two standard deviations.

Smaller samples have a greater variance than larger samples – there are certainly odd combinations, in particular for a small dataset as this one.
Larger samples will have a significant overlap with the orignal dataset which means that there’s less to vary in the training data, and as expected, the results are much less scattered.
In my implementation, even when the sample sizes matches the entire dataset, the order is not fixed by a random seed and is not kept; this adds additional variation, which is not only acceptable, but actually desirable to explore the what causes different outcomes in these methods, and to get an idea of the magnitude of the variation.

Before trying to observe the same patterns hold in other scenarios, let’s take a minute and explore, fixing one parameter, the distributions for the scores at each sample size. The reason is to start observing the actual dynamics of these distributions as more data gets fed to the model.

Below we can see that if we exclude very low sample sizes – there is a very significant overlap in the R² scores distributions. In other words, there is plenty of useful information at lower samples.

On interesting aspect of these repeated draws at a specific sample size seems to be the bell shape of the distribution. Even with ceiling effects, normality tests are passed and we can at least use this as information to model the behavior in the second post.

One continuation to this exploration needs to be related with the relation between sample size and more than one parameter.
We have two examples of combinations of parameters that support what we saw previously: the surfaces share plenty of similarities for various sample sizes, and most importantly, the parameters that maximize the R² score seem to be at the very least neighbours and at the very best the exact same.

One comment regarding how different datasets affect what we have observed. There seems to be a point in the sample size after which the dataset is informative enough for the model to pick up patterns. Rich datasets are a bit more demanding when it comes to how small samples can be – this adds another layer of variance to what we see and that is ignored for now.

One final exploration with a SVM is carried out. There is no scaling of the features or anything remotely trying to maximize the performance; again, the point is to see that there is no dramatic changes to the model behaviour.

We can see that for the very smallest sample size, the maximum seems to be contested by two distinct locations in the parameter space. Small samples are troublesome as we’ve seen and different algorithms scale differently.
The second smallest sample already seems to converge to the optimal set of parameters and at this point we can make some comments that motivate the next post.
Our first method was the random forest, which scales nicely, log-linearly. SVMs on another hand have quadratic time complexity, and this is one aspect to explore and that motivates subsampling in this exploration.
To compute one model with the entire dataset it costs us the same as the cost of training seven in the second smallest sample size: there is intrinsic noise that comes from small samples, yet we can explore a lot more.
A structured, yet simple, approach to how this exploration can be made, leveraging the trade-off between information gained and reduced computational load, is the content of the next post.