Building A Time Series Forecasting Model For Electricity Usage

light bulb, light, idea-3535435.jpg

Electricity usage patterns are influenced by various factors, including weather conditions, time of day, day of the week and seasonal factors such as holidays. Therefore, to build an effective forecasting model for electricity usage, we need to consider these factors. This article will guide you through the process of building a time series forecasting model using weather data and electricity usage data. But before proceeding, we recommend that you go through this article to gain an overview of time series concepts. It will help you better understand the following.

We start with two datasets: the weather data (weather_data) and the power usage data (power_usage_data). The weather data includes daily observations of various weather conditions, while the power usage data includes hourly observations of power consumption.

# First, let's load the data and inspect it.
import pandas as pd
# Load power data
power_data = pd.read_csv('/mnt/data/power_usage_2016_to_2020.csv')
# Load weather data
weather_data = pd.read_csv('/mnt/data/weather_2016_2020_daily.csv')

 

The power data contains the following columns:

  1. StartDate: The start date and hour of the power usage.
  2. Value (kWh): The amount of power used in kilowatt hours.
  3. day_of_week: The day of the week, where Monday is 0 and Sunday is 6.
  4. notes: Categorization of the day as either ‘weekday’ or ‘weekend’.

The weather data contains the following columns:

  1. Date: The date of the weather data.
  2. Day: This might be a day counter, but we’ll need to confirm.
  3. Temp_max, Temp_avg, Temp_min: The maximum, average, and minimum temperatures for the day.
  4. Dew_max, Dew_avg, Dew_min: The maximum, average, and minimum dew points for the day.
  5. Hum_max, Hum_avg, Hum_min: The maximum, average, and minimum humidity levels for the day.
  6. Wind_max, Wind_avg, Wind_min: The maximum, average, and minimum wind speeds for the day.
  7. Press_max, Press_avg, Press_min: The maximum, average, and minimum pressure levels for the day.
  8. Precipit: The amount of precipitation for the day.
  9. day_of_week: The day of the week, where Monday is 0 and Sunday is 6.

To prepare the data for forecasting, we’ll perform the following steps:

  1. Convert StartDate in the power data and Date in the weather data to datetime format.
  2. Aggregate the hourly power usage data to daily data to match the frequency of the weather data.
  3. Merge the power and weather datasets on the date.

In the next step we will merge the power data with weather data

# Convert 'StartDate' and 'Date' to datetime format
power_data['StartDate'] = pd.to_datetime(power_data['StartDate'])
weather_data['Date'] = pd.to_datetime(weather_data['Date'])
power_data_daily = power_data.resample('D', on='StartDate').sum()
power_data_daily.reset_index(inplace=True)
merged_data = pd.merge(power_data_daily, weather_data, left_on='StartDate', right_on='Date', how='inner')
merged_data.drop(columns=['Day', 'day_of_week_y'], inplace=True)
merged_data.rename(columns={'day_of_week_x': 'day_of_week', 'Value (kWh)': 'Power_kWh'}, inplace=True)

Now that we have a merged dataset, we can proceed with further preprocessing, such as checking for missing values, normalizing the data, and splitting the data into training and test sets.

# Check for missing values
missing_values = merged_data.isnull().sum()
# Descriptive statistics for numerical columns
desc_stats = merged_data.describe()
missing_values, desc_stats

RESULT

(StartDate      0
 Power_kWh      0
 day_of_week    0
 Date           0
 Temp_max       0
 Temp_avg       0
 Temp_min       0
 Dew_max        0
 Dew_avg        0
 Dew_min        0
 Hum_max        0
 Hum_avg        0
 Hum_min        0
 Wind_max       0
 Wind_avg       0
 Wind_min       0
 Press_max      0
 Press_avg      0
 Press_min      0
 Precipit       0
 dtype: int64,
          Power_kWh  day_of_week     Temp_max     Temp_avg     Temp_min  
 count  1550.000000  1550.000000  1550.000000  1550.000000  1550.000000   
 mean     20.681241    69.522581    81.266452    71.731871    63.316129   
 std      13.005544    48.926439    12.718255    12.485566    13.609492   
 min       0.000000     0.000000    34.000000    27.600000     0.000000   
 25%      10.493250    24.000000    74.000000    63.125000    52.250000   
 50%      16.312000    72.000000    84.000000    74.300000    67.000000   
 75%      29.883500   120.000000    92.000000    82.000000    75.000000   
 max      78.225300   144.000000   101.000000    90.000000    83.000000   
            Dew_max      Dew_avg      Dew_min      Hum_max      Hum_avg  
 count  1550.000000  1550.000000  1550.000000  1550.000000  1550.000000   
 mean     66.601935    62.131226    56.921290    92.574839    74.374968   
 std      12.281647    13.628116    15.248287     7.659423    11.705409   
 min      18.000000    13.200000     0.000000    50.000000    31.500000   
 25%      61.000000    54.025000    45.250000    90.000000    67.425000   
 50%      70.000000    66.550000    62.000000    94.000000    75.000000   
 75%      76.000000    73.400000    70.000000    97.000000    82.700000   
 max      83.000000    79.800000    78.000000   100.000000    99.900000   
            Hum_min     Wind_max     Wind_avg     Wind_min    Press_max  
 count  1550.000000  1550.000000  1550.000000  1550.000000  1550.000000   
 mean     51.216774    15.854839     7.736774     1.376774    29.973935   
 std      15.607830     4.843872     2.991458     2.501641     0.166044   
 min       0.000000     6.000000     1.600000     0.000000    29.500000   
 25%      41.000000    13.000000     5.700000     0.000000    29.900000   
 50%      50.000000    15.000000     7.300000     0.000000    29.900000   
 75%      61.000000    18.000000     9.400000     3.000000    30.100000   
 max      97.000000    39.000000    23.900000    17.000000    30.600000   
          Press_avg    Press_min     Precipit  
 count  1550.000000  1550.000000  1550.000000  
 mean     29.903613    29.813355     0.158052  
 std       0.160494     0.774077     0.658718  
 min      28.800000     0.000000     0.000000  
 25%      29.800000    29.700000     0.000000  
 50%      29.900000    29.800000     0.000000  
 75%      30.000000    29.900000     0.020000  
 max      30.600000    30.500000    13.430000  )

 

There are no missing values in our dataset, which is good news.

From the descriptive statistics, we can make a few observations:

  • The Power_kWh data has a wide range, from 0 to about 78.22 kWh. We can investigate the instances of zero usage further.
  • The weather data also shows a wide range of values. However, these are within expected ranges for weather data.
  • The Press_min column has a minimum value of 0, which might indicate some erroneous data, as atmospheric pressure should never be zero.

Exploratory Data Analysis and Visualization

Next, we conduct some exploratory data analysis (EDA) and data visualization. We’ll plot the distribution of daily power usage and some scatter plots to see the relationship between power usage and different weather variables.

import matplotlib.pyplot as plt
import seaborn as sns
# Set plot style
sns.set_style("whitegrid")
# Plot histogram of daily power usage
plt.figure(figsize=(10, 6))
sns.histplot(merged_data['Power_kWh'], bins=50, kde=True)
plt.title('Distribution of Daily Power Usage')
plt.xlabel('Power Usage (kWh)')
plt.ylabel('Frequency')
plt.show()
# Select some weather variables to check their relationship with power usage
weather_variables = ['Temp_avg', 'Dew_avg', 'Hum_avg', 'Wind_avg', 'Press_avg', 'Precipit']
# Create scatter plots of power usage vs. weather variables
fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(18, 12))
axs = axs.flatten() # Flatten to 1D array
for i, var in enumerate(weather_variables):
sns.scatterplot(data=merged_data, x=var, y='Power_kWh', ax=axs[i], alpha=0.6)
axs[i].set_title(f'Power Usage vs. {var}')
plt.tight_layout()
plt.show()

The histogram of daily power usage shows a right-skewed distribution, with most of the values falling between 10 and 30 kWh. There are also a few instances of very high power usage, reaching up to around 78 kWh.

The scatter plots of power usage versus various weather variables suggest the following:

  • Temp_avg: There seems to be a positive relationship between average temperature and power usage. This could be due to increased use of air conditioning on hotter days.
  • Dew_avg: There’s no clear relationship between average dew point and power usage.
  • Hum_avg: There’s no clear relationship between average humidity and power usage.
  • Wind_avg: There’s no clear relationship between average wind speed and power usage.
  • Press_avg: There’s no clear relationship between average pressure and power usage.
  • Precipit: There’s no clear relationship between precipitation and power usage.

These relationships, or lack thereof, can be better quantified using correlation coefficients. Let’s compute those next. After that, we’ll create some lag features for our feature engineering step.

The correlation coefficients confirm our observations from the scatter plots:

  • Temp_avg: There’s a strong positive correlation (0.62) with power usage.
  • Dew_avg: There’s a moderate positive correlation (0.55) with power usage.
  • Hum_avg, Precipit: These have almost no correlation with power usage.
  • Wind_avg, Press_avg: These have weak negative correlations (-0.18 and -0.25 respectively) with power usage.
# Create lag features for power usage
merged_data['Power_lag1'] = merged_data['Power_kWh'].shift(1)
merged_data['Power_lag2'] = merged_data['Power_kWh'].shift(2)
merged_data['Power_lag3'] = merged_data['Power_kWh'].shift(3)

As part of feature engineering, we’ve added three new features: Power_lag1, Power_lag2, and Power_lag3, which represent the power usage of the previous 1, 2, and 3 days respectively. This is based on the intuition that power usage might be auto-correlated, i.e., the power usage of a day might be influenced by the power usage of previous days. These lag features can be very useful for time series forecasting models.

Time Series Decomposition and Stationarity Test

Time series decomposition allows us to observe the trend and seasonality in the power usage data, separate from the random fluctuations. We decompose the power usage time series into trend, seasonal, and residual components.

Let’s decompose our daily power usage time series and visualize the components. We’ll use additive decomposition first, as it’s the simplest and most commonly used method. If the residuals show a pattern, we might need to switch to multiplicative decomposition. For daily data like ours, a common choice for the seasonal period is 7 (representing a weekly cycle). However, this might not be the best choice for all datasets, as the appropriate seasonal period can depend on the specific characteristics of the data.

The time series decomposition has four components:

  1. Observed: This is the original time series.
  2. Trend: This shows the overall trend in the data. We can see a somewhat stable trend in power usage over the years, with some fluctuations.
  3. Seasonal: This shows the seasonal variation in the data. We can clearly see a repeating pattern every 7 days, which indicates a weekly cycle.
  4. Residual: This is what’s left after removing the trend and seasonal components from the original time series. Ideally, the residuals should look like white noise, i.e., they should be random and have no discernible pattern. In our case, the residuals show some patterns, suggesting that there might be some information that is not captured by the trend and seasonal components.

We also test the power usage time series for stationarity using the Augmented Dickey-Fuller (ADF) test. Stationarity is an important characteristic of time series data that most time series forecasting models require. A time series is stationary if its statistical properties, such as mean and variance, are constant over time.

from statsmodels.tsa.stattools import adfuller
# Perform ADF test
adf_result = adfuller(merged_data['Power_kWh'])
# Print test statistic and p-value
adf_statistic = adf_result[0]
adf_pvalue = adf_result[1]
adf_statistic, adf_pvalue

RESULT

(-3.358494489457015, 0.012457836898292775)

The ADF test statistic is -3.36 and the p-value is 0.012. Since the p-value is less than 0.05, we can reject the null hypothesis. This suggests that our power usage time series is stationary and does not have a unit root.

This result is useful for selecting a forecasting model. Many time series forecasting models, such as ARIMA, assume that the underlying data are stationary. This assumption implies that the properties of the time series do not change over time, i.e., the mean, variance, and autocorrelation structure remain constant.

If the data were not stationary, we would need to apply some transformations to make it stationary before using these models. Common transformations include differencing, logarithmic transformation, or square root transformation. In our case, since the data appear to be stationary, we can proceed without these transformations.

However, it’s important to note that even if the overall time series is stationary, there may still be some patterns or structures in the data that these tests cannot detect. For example, there could be patterns that change over different time periods or patterns that depend on external variables. Therefore, we should always perform additional exploratory analysis and feature engineering to capture these patterns, as we have done.

Let’s proceed to the next steps, which include splitting the data into training and test sets, selecting the features to use for modeling and choosing a model for forecasting.

 

 

Model Building and Evaluation

We split the data into a training set and a test set, and then use a Random Forest model as our forecasting model. 

# Split the data into training and test sets
# We'll use the last 365 days of data for testing
train_data = merged_data[:-365]
test_data = merged_data[-365:]
# Select features for modeling
# We'll use the weather variables and the lag features
feature_columns = weather_variables + ['day_of_week', 'Power_lag1', 'Power_lag2', 'Power_lag3']

The data has been split into training and test sets. The training set contains 1,182 observations and the test set contains 365 observations. We’re using the last 365 days of data for testing, which corresponds to approximately the last year of observations.

For our features, we’ve selected the weather variables (Temp_avg, Dew_avg, Hum_avg, Wind_avg, Press_avg, Precipit), the day_of_week variable, and the lag features (Power_lag1, Power_lag2, Power_lag3) that we created earlier.

Next, we’ll proceed to the model building phase. There are many models that can be used for time series forecasting, ranging from simple models like ARIMA to more complex models like LSTM.

Given the characteristics of our data, we recommend using the XGBoost model for this task. XGBoost is a powerful, flexible, and efficient implementation of the gradient boosting algorithm. It’s capable of handling multiple input features, capturing complex relationships, and automatically modeling nonlinearities and interactions between variables.To evaluate the model’s performance, we’ll use root mean squared error (RMSE), which is a common metric for regression tasks like ours. It measures the average magnitude of the model’s prediction errors.

Let’s proceed with the model building using XGBoost.

from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
# Initialize XGBoost model
xgb_model = XGBRegressor(n_estimators=300, learning_rate=0.1)
# Train the model
xgb_model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = xgb_model.predict(X_test)
# Compute RMSE of the predictions
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
rmse

 

13.541792715203322

The XGBoost model has been successfully trained and evaluated. The root mean squared error (RMSE) on the test set is approximately 13.54. This means that on average, our model’s predictions are about 13.54 kWh off from the actual values.

This is a starting point for our forecasting task. We could potentially improve the model’s performance by tuning its hyperparameters, using a more complex model, or engineering additional features. However, these steps would require more computational resources and time.

Feature Importance

We can also examine the importance of each feature in the model. It will provide some insights into which variables are most influential in predicting power usage.

# Extract feature importances
feature_importances = xgb_model.feature_importances_
# Create a dataframe for visualization
importances_df = pd.DataFrame({
'Feature': feature_columns,
'Importance': feature_importances
})
# Sort the dataframe by importance
importances_df = importances_df.sort_values(by='Importance', ascending=False)
# Plot the feature importances
plt.figure(figsize=(10, 6))
sns.barplot(data=importances_df, x='Importance', y='Feature', color='skyblue')
plt.title('Feature Importances')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()

The bar plot above shows the importance of each feature in the XGBoost model.

Here are a few observations:

  • Power_lag1, Power_lag2, and Power_lag3 are the most important features. This suggests that the power usage of the previous 1 to 3 days is highly influential in predicting the power usage of the current day. This is consistent with the intuition that power usage is likely to be auto-correlated, i.e., the power usage of a day is influenced by the power usage of previous days.
  • Among the weather variables, Temp_avg and Hum_avg have the highest importance. This is consistent with our earlier analysis, which showed a positive correlation between these variables and power usage.
  • The day_of_week variable also has some importance, suggesting that the day of the week might have some influence on power usage. This could reflect weekly patterns in power usage, such as differences between weekdays and weekends.

Conclusion

  1. Model Selection: While the XGBoost model we used performed reasonably well, there are many other models we could try, such as ARIMA, SARIMA, or LSTM. These models might capture different patterns in the data and could potentially improve the forecasting accuracy.

  2. Hyperparameter Tuning: We can fine-tune the parameters of our XGBoost model (or any other model we choose) to further improve its performance. This involves systematically searching for the combination of parameters that produces the best results.

  3. Feature Engineering: We could create additional features to help improve the model’s performance. For example, we could create more lag features, rolling window features (e.g., rolling mean or standard deviation), or interaction terms between the most important features.

  4. Model Evaluation: We should continue to evaluate our model’s performance on new data over time. This can help us detect if the model’s performance is degrading and if it needs to be retrained or updated.

  5. Error Analysis: We can analyze the instances where the model makes large errors to understand why these errors occur and how we might improve the model.

  6. Monitoring and Updating the Model: Once the model is deployed, it’s important to monitor its performance and update or retrain it as needed. This is because the patterns in the data might change over time, which could cause the model’s performance to degrade.

Remember that model building is an iterative process. It often involves trying out different models, tuning their parameters, engineering features, and evaluating their performance. With each iteration, we learn more about the problem and improve our solution.

You can get the jupyter notebook and dataset here

 

Share

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top