Machine Learning in Geoscience with Scikit-learn. Part 2: inferential statistics and domain knowledge to select features for oil prediction

In the first post of this series I showed how to use Pandas, Seaborn, and Matplotlib to:

  • load a dataset
  • test, clean up, and summarize the data
  • start looking for relationships between variables using scatterplots and correlation coefficients

In this second post, I will expand on the latter point by introducing some tests and visualizations that will help highlight the possible criteria for choosing some variables, and dropping others. All in Python.

I will use a different dataset than that in the previous post. This one is from the paper “Many correlation coefficients, null hypotheses, and high value“(Lee Hunt, CSEG Recorder, December 2013).

The target to be predicted is oil production from a marine barrier sand. We have measured production (in tens of barrels per day) and 7 unknown (initially) predictors, at 21 wells:

Hang on tight, and read along, because it will be a wild ride!

I will show how to:

1) automatically flag linearly correlated predictors, so we can decide which might be dropped. In the example below (a matrix of pair-wise correlation coefficients between variables) we see that X2, and X7, the second and third best individual predictors of production (shown in the bottom row) are also highly correlated to X1, the best overall predictor. This is flagged in the first column from the left in the image below.

2) automatically flag predictors that fail a critical r test, as shown in the next image

3) create a table to assess the probability that a certain correlation is spurious, in other words the probability of getting at least the correlation coefficient we got with our the sample, or even higher, purely by chance.

I will not recommend to run these tests and apply the criteria blindly. Rather, I will suggest how to use them to learn more about the data, and in conjunction with domain knowledge about the problem at hand (in this case oil production), make more informed choices about which variables should, and which should not be used.

And, of course, I will show how to make the prediction.

Have fun reading: get the Jupyter notebook on GitHub.

One response to “Machine Learning in Geoscience with Scikit-learn. Part 2: inferential statistics and domain knowledge to select features for oil prediction

  1. Pingback: Machine learning in geoscience and planetary science with scikit-learn: series outline | MyCarta·

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s