Explore data for machine learning wave model

Author: Leo Peach

Date: 22-12-2018

Initial data requirements

The initial investigation will follow a similar method to that used in James et al 2017. Initially we will focus on developing the necessary data for the MLP model (Mult-layer perceptron).

Data:

Wave Buoy from Queensland Government and Hindcast from CSIRO

  • Hindcast model Hs at an offshore location which also suitable to for a forecast model
  • Hindcast model Hs at or near the wave buoy locations
  • Historical Wave Buoy data

Get the Hindcast Data

The hindcast data is in daily grid files, to get data for one location we are going to access each file and grab the data we want.

In [1]:
import netCDF4
import numpy as np
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
In [2]:
#wave buoy info
brisbane = {'name': 'Brisbane, Australia', 'lat': -27.54722, 'lon': 153.8667}
gold_coast = {'name': 'Gold Coast, Australia', 'lat': -27.9646, 'lon': 153.4404} # location in metadata incorrect
In [3]:
def extract_hindcast(file, wb_info):
    '''Parsed a hindcast netCDF file
    Returns a dataframe with the relevant data for a location'''

    fh = netCDF4.Dataset(file,'r')
    lons = fh.variables['longitude'][:]
    lats = fh.variables['latitude'][:]
    hs = fh.variables['hs'][:]
    t02 = fh.variables['t02'][:]
    u10 = fh.variables['U10'][:]
    v10 = fh.variables['V10'][:]
    time = fh.variables['time'][:]
    pkdir = fh.variables['dp'][:]
    pkfreq = fh.variables['fp'][:]
    fh.close()
    
    latidx = np.abs(lats - wb_info['lat']).argmin()
    lonidx = np.abs(lons - wb_info['lon']).argmin()
    
    hs = pd.Series(hs[:,latidx, lonidx])
    t02 = pd.Series(t02[:,latidx, lonidx])
    u10 = pd.Series(u10[:,latidx, lonidx])
    v10 = pd.Series(v10[:, latidx, lonidx])
    pkfreq = pd.Series(pkfreq[:,latidx, lonidx])
    pkdir = pd.Series(pkdir[:,latidx, lonidx])
    
    tstr = file[-13:-3]
    
    start = tstr + " 00:00"
    end = tstr + " 23:00"
    dateidx = pd.date_range(start,end , freq='H')
    data = pd.concat([hs, t02,u10, v10, pkfreq, pkdir], axis = 1)
    data.index = dateidx
    data.columns = ['hs','t02','u10', 'v10','pkfreq','pkdir']
    
    return data
In [ ]:
import os
hindcast_directory = 'D:/Masters/data/hindcast/'
gc_hc_list = []
errs = 0
for file in os.listdir(hindcast_directory):
    if file.endswith('.nc'):
        try:
            gc_hc_list.append(extract_hindcast(os.path.join(hindcast_directory+file), gold_coast))
        except:
            errs +=1
            continue

gc_hc = pd.concat(gc_hc_list)
        

Save data at this point to ensure we don't need to run this again for gold wave bouy location

In [ ]:
gc_hc.to_csv('gold_coast_hindcast.csv')

Now that that the data is in CSV format grab it anduse it instead of re-processing

In [4]:
gc_hc = pd.read_csv('D:/Masters/python/machine_learning/gold_coast_hindcast.csv')
In [6]:
gc_hc.index = pd.to_datetime(gc_hc['Unnamed: 0'])
In [7]:
gc_hc.head()
Out[7]:
Unnamed: 0 hs t02 u10 v10 pkfreq pkdir
Unnamed: 0
1979-01-01 00:00:00 1979-01-01 00:00:00 0.000 1.98 -3.8 3.5 NaN NaN
1979-01-01 01:00:00 1979-01-01 01:00:00 0.122 1.46 -3.8 3.5 NaN NaN
1979-01-01 02:00:00 1979-01-01 02:00:00 0.238 1.51 -3.8 3.3 NaN NaN
1979-01-01 03:00:00 1979-01-01 03:00:00 0.268 1.88 -4.2 3.1 0.423 117.0
1979-01-01 04:00:00 1979-01-01 04:00:00 0.280 2.10 -4.9 2.9 0.391 116.0

Select a suitable offshore location

first location select offshore of the wave buoy in approx -500m depth.

In [ ]:
loc1_gc = {'name': 'Location1, Australia', 'lat': -27.9646, 'lon': 154.061}
In [ ]:
import os
hindcast_directory = 'D:/Masters/data/hindcast/'
lc_hc1_list = []
errs = 0
for file in os.listdir(hindcast_directory):
    if file.endswith('.nc'):
        try:
            lc_hc1_list.append(extract_hindcast(os.path.join(hindcast_directory+file), gold_coast))
        except:
            errs +=1
            continue
print(errs)
lc1_hc = pd.concat(lc_hc1_list)

also save this as a csv

In [ ]:
lc1_hc.to_csv('loc1_hindcast.csv')

now that we have this saved as a csv, use this instead of retrieving the data again

In [8]:
lc1_hc = pd.read_csv('D:/Masters/python/machine_learning/loc1_hindcast.csv')
In [10]:
lc1_hc.index = pd.to_datetime(gc_hc['Unnamed: 0'])
In [11]:
lc1_hc.head()
Out[11]:
Unnamed: 0 hs t02 u10 v10 pkfreq pkdir
Unnamed: 0
1979-01-01 00:00:00 1979-01-01 00:00:00 0.000 1.98 -3.8 3.5 NaN NaN
1979-01-01 01:00:00 1979-01-01 01:00:00 0.122 1.46 -3.8 3.5 NaN NaN
1979-01-01 02:00:00 1979-01-01 02:00:00 0.238 1.51 -3.8 3.3 NaN NaN
1979-01-01 03:00:00 1979-01-01 03:00:00 0.268 1.88 -4.2 3.1 0.423 117.0
1979-01-01 04:00:00 1979-01-01 04:00:00 0.280 2.10 -4.9 2.9 0.391 116.0
In [14]:
fig, ax = plt.subplots(figsize = (14, 6))
ax.plot(lc1_hc.index, lc1_hc.hs)
ax.plot(gc_hc.index, gc_hc.hs)
plt.show()
In [15]:
fig, ax = plt.subplots(figsize = (14, 6))
ax.plot(lc1_hc.index, lc1_hc.t02)
ax.plot(gc_hc.index, gc_hc.t02)
plt.show()

Observation: The data from both locations is identical, this could be due to a number of reasons, they are confirmed not to be in the same grid as they are 25km apart.

Conclusion: Although the hindcast dataset is extremely useful it will need to be upscaled to a higher resolution to be applicable for building machine learning models.

However we will need to look at other parts of the data to see how well the model has performed against the wave buoy data generally, are there trends that need to be accounted for within the data?

Plot the data in bokeh so that it is easier to interogate

In [13]:
from bokeh.io import output_notebook, show, push_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, Span
from bokeh.models.tools import HoverTool
output_notebook()
Loading BokehJS ...
In [ ]:
p = figure(x_axis_type="datetime", plot_height = 500, plot_width = 800, title='hindcast data comparison' )
plotf =p.line(x = gc_hc.index,y = gc_hc.hs, color = 'blue',legend='Wave Buoy Site')
p.add_tools(HoverTool(renderers=[plotf], tooltips=[("Hs","@y")],mode='vline'))
plota = p.line(lc1_hc.index, lc1_hc.hs, color = 'green',legend='Offshore')
p.add_tools(HoverTool(renderers=[plota], tooltips=[("Hs","@y")],mode='vline'))

    
p.yaxis.axis_label = "Signifcant Wave Height (m)"
    
p.legend.location = "top_left"
p.legend.click_policy="hide"
show(p)

Comparison with WB data

A comparison of the WB data we have and the hindcast dataset for the same location

In [3]:
wb_data_filename = r'D:/Masters/python/machine_learning/gold_coast_wb_cleaned.csv'
In [4]:
wb_data = pd.read_csv(wb_data_filename, index_col = 0, parse_dates = True)
In [5]:
wb_data = wb_data.sort_index()
In [6]:
print(wb_data.index[0], wb_data.index[-1])
1987-02-21 08:55:00 2018-10-31 23:30:00
In [7]:
hindcast_filename = r'D:/Masters/python/machine_learning/gold_coast_hindcast.csv'
In [8]:
hindcast_data = pd.read_csv(hindcast_filename, index_col = 0, parse_dates = True)
In [9]:
hindcast_data = hindcast_data.loc['1987-02-21 09:00':]
In [10]:
hindcast_data.index[-1]
Out[10]:
Timestamp('2010-12-31 23:00:00')
In [11]:
wb_data = wb_data.loc[:'2010-12-31']
In [12]:
wb_data = wb_data[['Hs','Tz', 'Tp','pkdir']]
In [13]:
hindcast_data = hindcast_data[['hs','t02','pkfreq','pkdir']]

Convert peak frequency to peak period

In [14]:
hindcast_data['pkfreq'] = 1/ hindcast_data['pkfreq']
In [15]:
hindcast_data = hindcast_data.rename(index=str, columns={"hs": "Hs", "t02": "Tz", "pkfreq":"Tp"})
In [16]:
wb_data.head()
Out[16]:
Hs Tz Tp pkdir
Date/Time
1987-02-21 08:55:00 1.00 5.59 7.05 NaN
1987-02-21 14:56:00 1.05 4.45 7.19 NaN
1987-02-21 20:53:00 1.08 5.14 7.24 NaN
1987-02-22 02:51:00 0.99 5.25 7.25 NaN
1987-02-22 08:55:00 0.83 5.51 7.39 NaN
In [17]:
hindcast_data.head()
Out[17]:
Hs Tz Tp pkdir
1987-02-21 09:00:00 1.134 4.36 7.092199 83.0
1987-02-21 10:00:00 1.118 4.39 7.042254 83.0
1987-02-21 11:00:00 1.100 4.43 7.042254 83.0
1987-02-21 12:00:00 1.078 4.48 6.993007 84.0
1987-02-21 13:00:00 1.066 4.42 6.993007 84.0

Now we need to interpolate the data to the same time format

In [18]:
wb_data_interp = wb_data.resample('1h').bfill(limit=1).interpolate()
In [19]:
hindcast_data.index = pd.to_datetime(hindcast_data.index)
In [20]:
fig, ax = plt.subplots(figsize = (14, 6))
ax.plot(hindcast_data.index, hindcast_data.Hs, label = 'hindcast')
ax.plot(wb_data_interp[1:].index, wb_data_interp[1:].Hs, label = 'wavebuoy')
plt.legend()
plt.show()
In [21]:
from pandas import plotting

hindcast_scatter = plotting.scatter_matrix(hindcast_data, alpha = 0.5, figsize = (14, 14))

create a combined dataframe for comparison

In [22]:
wb_data_interp['source'] = 'wave buoy'
In [23]:
hindcast_data['source'] = 'hindcast'
In [24]:
combined = pd.concat([wb_data_interp,hindcast_data])
In [25]:
import seaborn as sns
sns.set_style("white")
colours = ['windows blue', 'greyish']
In [26]:
sns.palplot(sns.xkcd_palette(colours))
In [27]:
pairplt = sns.pairplot(combined, hue="source", diag_kind = 'hist',plot_kws = {'alpha': 0.75, 's': 5, 'edgecolor': 'k',
                                                                             'linewidth' :0}, palette = sns.xkcd_palette(colours))
pairplt.fig.suptitle('Gold Coast Hindcast comparison matrix (1987-2010)')
pairplt.fig.subplots_adjust(top=.9)
plt.savefig('gold_coast_comparison_matrix.png')
C:\Users\leope\Anaconda3\envs\mlenv\lib\site-packages\numpy\lib\histograms.py:754: RuntimeWarning: invalid value encountered in greater_equal
  keep = (tmp_a >= first_edge)
C:\Users\leope\Anaconda3\envs\mlenv\lib\site-packages\numpy\lib\histograms.py:755: RuntimeWarning: invalid value encountered in less_equal
  keep &= (tmp_a <= last_edge)

Comparitive Statistics

In order to measure the similarity between the Hindcast model and the recroded data over the period we will compare the overlapping datasets.

Correlation

In [31]:
wb_data_interp.Hs.corr(hindcast_data.Hs)
Out[31]:
0.8161723825035221
In [32]:
wb_data_interp.Tz.corr(hindcast_data.Tz)
Out[32]:
0.49410156204469424
In [33]:
wb_data_interp.Tp.corr(hindcast_data.Tp)
Out[33]:
0.4361751729667104
In [34]:
wb_data_interp.pkdir.corr(hindcast_data.pkdir)
Out[34]:
0.5340731576138681

RMSE and MAE

Best value = 0.0

In [53]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, explained_variance_score
In [45]:
mean_squared_error(wb_data_interp.loc[hindcast_data.index].Hs, hindcast_data.Hs)
Out[45]:
0.09052734243585876
In [50]:
mean_absolute_error(wb_data_interp.loc[hindcast_data.index].Hs, hindcast_data.Hs)
Out[50]:
0.21260746103628508
In [46]:
mean_squared_error(wb_data_interp.loc[hindcast_data.index].Tz, hindcast_data.Tz)
Out[46]:
1.7732443762246959
In [51]:
mean_absolute_error(wb_data_interp.loc[hindcast_data.index].Tz, hindcast_data.Tz)
Out[51]:
0.9505038817364095
In [47]:
mean_squared_error(wb_data_interp.loc[hindcast_data.index].Tp, hindcast_data.Tp)
Out[47]:
5.75690646503502
In [52]:
mean_absolute_error(wb_data_interp.loc[hindcast_data.index].Tp, hindcast_data.Tp)
Out[52]:
1.5567921869687742

Variance

Best Value 1.0

In [54]:
explained_variance_score(wb_data_interp.loc[hindcast_data.index].Hs, hindcast_data.Hs)
Out[54]:
0.6357589432227202
In [55]:
explained_variance_score(wb_data_interp.loc[hindcast_data.index].Tz, hindcast_data.Tz)
Out[55]:
0.02472878300670589
In [56]:
explained_variance_score(wb_data_interp.loc[hindcast_data.index].Tp, hindcast_data.Tp)
Out[56]:
0.0166738219398046