import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import matplotlib.colors as colors
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Quiet the Setting Copy warnings
pd.options.mode.chained_assignment = None
Step 5: Scaling structures¶
We now look at scaling structures. We do this early in the modelling process as it will help us decide
what library to start with and not waste time changing libraries later on when scaling in Step 9. Across
input features, the elevation change is well sampled and in many case the features we are interested in
are oversampled. This points to the use of sparse variational GPs (see Step 9) and therefore using
libraries such as GPJax
, GPFlow
or GPyTorch
.
Step 6: Data transformations¶
6.1 Output variable transformation¶
Inference can be improved if the distributions are Gaussian (in order to enforce normal distribution on the posterior distribution) and the data is standardized. In this case, the target distribution is especially complicated. Multiple transformations were applied on all and positive/negative domains individually, these include:
- Log transformation
- Exponential transformation
- Box Cox transformation
- Sampling the CDF of the data transform it to a Gaussian distribution
None of these significantly improved the inference, therefore only a standardised version of the distribution was used.
# Load data
train_df = pd.read_csv('data/training_set.csv', index_col=[0])
val_df = pd.read_csv('data/validation_set.csv', index_col=[0])
test_df = pd.read_csv('data/test_set.csv', index_col = [0])
6.2. Input variable standardisation¶
From the pairplot in the previous notebook, the velocity and slope values are heavily skewed. We apply a log transformation to the velocity and slope values to make these more Gaussian. We can check these values are now more Gaussian by plotting the pairplot again.
# Transform skewed variables
train_df['vel_log'] = np.log(train_df['vel'])
train_df['slope_log'] = np.log(train_df['slope'])
grid1 = sb.pairplot(train_df[['slope', 'vel', 'slope_log', 'vel_log']][::100])
# Repeat for training and validation sets
val_df['vel_log'] = np.log(val_df['vel'])
val_df['slope_log'] = np.log(val_df['slope'])
test_df['vel_log'] = np.log(test_df['vel'])
test_df['slope_log'] = np.log(test_df['slope'])
6.3. Variable standardisation¶
As we are using GPyTorch, there is no data standardisation before the data is passed to the model. We therefore need to z-score (removing the mean and dividing by the variance) the data manually. We fit the scaler on the training and apply it to both the validation and test sets. This is done to avoid to avoid overfitting.
# Z score the variables
vars = ['elev', 'ocean_dist', 'slope_log', 'vel_log', 'aspect', 'x', 'y', 'dhdt_filt']
scaler = StandardScaler().fit(train_df[vars])
train_df[vars] = scaler.transform(train_df[vars])
val_df[vars] = scaler.transform(val_df[vars])
test_df[vars] = scaler.transform(test_df[vars])
# Remove NaNs
train_std_df = train_df.dropna()
val_std_df = val_df.dropna()
test_std_df = test_df.dropna()
# Save the data
# train_df.to_csv('data/training_set_tr.csv')
# val_df.to_csv('data/validation_set_tr.csv')
# test_df.to_csv('data/test_set_tr.csv')
Step 7: Kernel design¶
In this section, we try to determine the properties of the data that will help us determine different possible covariance function designs to try. From the plots and analyses below, we assess the following aspects:
- Smoothness, the overall glacier elevation seems to vary smoothly although areas near the coast show abrupt changes.
- Periodicity, none observed.
- Outliers, the scatter plots shows that there are some points do not follow the general trend of the data (in particular for y, x, elevation, slope and velocity)
- Distribution tails, the histograms shows that some variables are heavy tailed (even after transformation). Although Gaussian distribution do help with inference, the signal we get from the data can lie in the fact that certain areas of the distribution will be highlighted through being skewed and can therefore help the predictive skill of the model.
- Stationarity, the data looks like it exhibits two modes, high variance-small lengthscale regime near the edges of the glacier and long lengthscale at the centre, this could be problematic to model without a non-staionary kernel.
We also need to acertain the following:
- Key predictive inputs, the pairplot and correlation plots shows that the elevation change is most heavily correlated with the elevation, slope and ocean distance. Although the spatial coordinates correlate on their own with elevation change, locations that are geographically close to each other will be highly correlated (c.f. k-means plot)
- Covariance lenghtscale, these can be estimated looking at the scatter plots in the last row of the pairplot. For most variables these look to be fairly large.
Lastly the hetreskedastic nature of the points is also important to consider. The most extremes values occupy the same variable space. A custom likelihood function may be needed to model this behaviour accurately.
Pairplot¶
grid2 = sb.pairplot(train_df[vars][::100])
Scatter plots with respect to spatial coordinates¶
f, ax = plt.subplots(1, 4, figsize=(15, 5))
ax[0].scatter(x=val_df['x'], y=val_df['y'],
c=val_df['dhdt_filt'],s=1, cmap='coolwarm', norm=colors.CenteredNorm(),)
ax[0].set_title('dhdt_filt')
ax[1].scatter(x=train_df['x'], y=train_df['y'],
c=train_df['elev'], s=1, cmap='gist_earth')
ax[1].set_title('elev')
ax[2].scatter(x=train_df['x'], y=train_df['y'],
c=train_df['ocean_dist'], s=1, cmap='Blues_r')
ax[2].set_title('ocean_dist')
ax[3].scatter(x=train_df['x'], y=train_df['y'],
c=train_df['vel_log'], s=1, cmap='Greens')
ax[3].set_title('vel_log')
Text(0.5, 1.0, 'vel_log')
K-means¶
We apply k-means agorithm to the data to identify other patterns within the data. In particular, we note that data that is geographically close in the center of Greenland is more likely to be similar we also observe that ocean distance elevation become more important on the west and north side of Greenland.
kmeans = KMeans(n_clusters=50, random_state=0,
n_init="auto").fit(train_df[['dhdt_filt']])
train_df['groups'] = kmeans.labels_
plt.figure(figsize=(10, 10))
km_plot = plt.scatter(x=train_df['x'], y=train_df['y'],
c=train_df['groups'], cmap='tab20', s=1)
1st and 2nd order statistics¶
These variables should be normalised as we did standardised the values in Step 5. However, in the case where they are not normalised the statistics can be used to initilise the kernel and mean hyperparmeters.
print('Mean')
print(train_df[['elev', 'ocean_dist', 'slope_log',
'vel_log', 'aspect', 'x', 'y']].mean())
Mean elev -4.927579e-16 ocean_dist -1.703684e-16 slope_log 1.048560e-17 vel_log 3.695684e-16 aspect -8.194884e-17 x -6.552631e-18 y -3.669473e-17 dtype: float64
print('Standard deviation')
print(train_df[['elev', 'ocean_dist', 'slope_log',
'vel_log', 'aspect', 'x', 'y']].std())
Standard deviation elev 1.000001 ocean_dist 1.000001 slope_log 1.000001 vel_log 1.000001 aspect 1.000001 x 1.000001 y 1.000001 dtype: float64
Correlation matrix¶
Correlation heatmap or matrix gives us an idea of which variables are linearly most predictive. Values with correlation coefficient close to 1 or -1 are most correlated and therefore most predictive of each other.
corr = train_df[vars].corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sb.diverging_palette(230, 20, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
heatmap= sb.heatmap(corr, mask=mask, cmap=cmap, center=0, annot=True,
square=True, linewidths=.5, cbar_kws={"shrink": .5, "label": "Correlation Coefficient R"})
Autocorrelation¶
The autocorelation plots show no signs of periodicity. The plots showing periodicity would periodically have higher autocorrelation values at lags further away from the center.
fig, ax = plt.subplots(len(vars), 1, figsize=(20, 20), sharey=True)
plt.subplots_adjust(hspace=0.5)
for i in range(len(vars)):
df = train_df.loc[430000:470000].sort_values(by=vars[i])
y = df[['dhdt_filt']].values.flatten()
ax[i].acorr(y, maxlags=10000)
ax[i].set_xlabel('Ranked by ' + vars[i])
ax[i].set_ylabel('Autocorrelation')
Kernel list¶
Below we list the kernels we will try in the next notebook. We start with a simple kernel using the spatial coordinates with the smoothest kernel. This should give us values similar simply interpolating the data using e.g. splines. Here we use a multplicative kernel to combine the spatial coordinates as we expect the value of the elevation change vary jointly with the spatial coordinates.
$$ k_1 = k_{\text{RBF}}(x, y) $$We can then add more variables the kernel, starting with the most predictive variables, trying both multiplicative and additive version of each.
$$ k_2 = k_{\text{RBF}}(x, y, elev) $$$$ k_3 = k_{\text{RBF}}(x, y) + k_{\text{RBF}}(elev)$$and so on and so forth, making sure all the while to set the lengthscale (and variance) parameters.
Once the most predictive variables and structure has been identified, the kernel smoothness can be varied by trying different Matérn kernels:
$$ k_{N-3} = k_{Matern52}(\cdot) $$$$ k_{N-2} = k_{Matern32}(\cdot) $$$$ k_{N-1} = k_{Matern12}(\cdot) $$