Time Series Split

Sources from kaggle offering an hourly energy demand data:

https://www.kaggle.com/datasets/robikscube/hourly-energy-consumption

provided by PJM (Pennsylvania, Jersey, Maryland) which manages electricity grid for 13 states – and the data available spans a couple of years (~1982 – 2002)

here is where you can find more about some prelim EDA https://github.com/w-winnie/livnlearn/blob/main/TimeSeries1_EDAsplit_livnlearnversion.ipynb

Now lets dive into the topic of discussion – Splitting this time series data

We split the data into train and test (and validation) to test our hypothesis or model – which we base on a certain set of data (train) and evaluate on a different set of data (test) to analyze its performance, reliability, accuracy (i refrained from all the terms like bias, overfit, underfit, but accuracy is plain english so, its fair)

We’re given a sample of data – where some variables would be our x’s (the features) and one of them would be y (target variable), we want to measure what will by the y be given x.
If there is a linear relationship (for simplicity) – we might try to fit a linear mode which would look something like y = mx +b, we could use lets say a linear regression library to fit all the x and y data points and come up with the parameters/weights which define our model – m (slope) and b (intercept)
Now all the data points we use to determine these parameters are said to be a part of the training set and the R squared distance from the training datapoints would be the fit accuracy.
Now we evaluate this model on the data it hasn’t seen in the training – which would be the incoming data in practice or some data we set aside in training to see how our model will work on unseen data – these unseen data points used to evaluate the model are called test set.
We’ll go over the evaluation metrics and strategies later.

We will not talk about Validation data here for simplicity – but just so you don’t feel its completely foreign – validation set is something which we use to tune the model – select hyperparameters etc. after a model is trained (this data is not a part of training / test set) but is used in model selection / tuning, since the model has seen this data before giving a final answer – it is not considered a part of test data since that has to be completely unseen.

When we split data into train and test, we have a couple of options at our hand, there are libraries like sklearn, which would take a ratio like 0.8 (train and 0.2 test – to split all data randomly keeping 80% data for training and 20% for testing), there are stratified splits (keep the target variable ratios same in test and train), k-fold (iterating the splits multiple times for different combinations of train and test) and other combined and custom strategies.

Today though i want to discuss time series split

TimeSeries1_EDAsplit_livnlearnversion

Share:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *