Bits and what?
After wetting (hopefully) your appetite with the Machine Learning quiz / teaser I am now moving on to a series of posts that I decided to title “Geoscience Machine Learning bits and bobs”.
OK, BUT fist of all, what does ‘bits and bobs‘ mean? It is a (mostly) British English expression that means “a lot of small things”.
Is it a commonly used expression? If you are curious enough you can read this post about it on the Not one-off British-isms blog. Or you can just look at the two Google Ngram plots below: the first is my updated version of the one in the post, comparing the usage of the expression in British vs. US English; the second is a comparison of its British English to that of the more familiar “bits and pieces” (not exactly the same according to the author of the blog, but the Cambridge Dictionary seems to contradict the claim).
I’ve chosen this title because I wanted to distill, in one spot, some of the best collective bits of Machine Learning that came out during, and in the wake of the 2016 SEG Machine Learning contest, including:
- The best methods and insights from the submissions, particularly the top 4 teams
- Things that I learned myself, during and after the contest
- Things that I learned from blog posts and papers published after the contest
I will touch on a lot of topics (see list below) but I hope that – in spite of the title’s pointing to a random assortment of things – what I will have created in the end is a cohesive blog narrative and a complete, mature Machine Learning pipeline in a Python notebook.
- List of previous works (in this post)
- Data inspection
- Data visualization
- Data sufficiency
- Data imputation
- Feature augmentation
- Model training and evaluation
- Connecting the bits: a full pipeline
Some background on the 2016 ML contest
The goal of the SEG contest was for teams to train a machine learning algorithm to predict rock facies from well log data. Below is the (slightly modified) description of the data form the original notebook by Brendon Hall:
The data is originally from a class exercise from The University of Kansas on Neural Networks and Fuzzy Systems. This exercise is based on a consortium project to use machine learning techniques to create a reservoir model of the largest gas fields in North America, the Hugoton and Panoma Fields. For more info on the origin of the data, see Bohling and Dubois (2003) and Dubois et al. (2007).
This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector and validation (test) data (830 examples from two wells) having the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five from wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.
The seven predictor variables are:
- Five wire line log curves include gamma ray (GR), resistivity logging(ILD_log10), photoelectric effect (PE), neutron-density porosity difference and average neutron-density porosity (DeltaPHI and PHIND). Note, some wells do not have PE.
- Two geologic constraining variables: nonmarine-marine indicator (NM_M) and relative position (RELPOS)
The nine discrete facies (classes of rocks) are:
List of previous works (comprehensive, to the best of my knowledge)
In each post I will make a point to explicitly reference whether a particular bit (or a bob) comes from a submitted notebook by a team, a previously unpublished notebook of mine, a blog post, or a paper.
However, I’ve also compiled below a list of all the published works, for those that may be interested.
The Github repo with all teams’ submissions.
The published summary of the contest by Brendon Hall and Matt Hall on The Leading Edge
An SEG extended abstract on using gradient boosting on the contest dataset
An arXiv e-print paper on using a ConvNet on the contest dataset
Abstract for a talk at the 2019 CSEG / CSPG Calgary Geoconvention