Electricity usage patterns are influenced by various factors, including weather conditions, time of day, day of the week and seasonal factors such as holidays. Therefore, to build an effective forecasting model for electricity usage, we need to consider these factors. This article will guide you through the process of building a time series forecasting model using weather data and electricity usage data. But before proceeding, we recommend that you go through this article to gain an overview of time series concepts. It will help you better understand the following.
We start with two datasets: the weather data (weather_data
) and the power usage data (power_usage_data
). The weather data includes daily observations of various weather conditions, while the power usage data includes hourly observations of power consumption.
# First, let's load the data and inspect it.
import pandas as pd
# Load power data
power_data = pd.read_csv('/mnt/data/power_usage_2016_to_2020.csv')
# Load weather data
weather_data = pd.read_csv('/mnt/data/weather_2016_2020_daily.csv')
The power data contains the following columns:
StartDate
: The start date and hour of the power usage.Value (kWh)
: The amount of power used in kilowatt hours.day_of_week
: The day of the week, where Monday is 0 and Sunday is 6.notes
: Categorization of the day as either ‘weekday’ or ‘weekend’.
The weather data contains the following columns:
Date
: The date of the weather data.Day
: This might be a day counter, but we’ll need to confirm.Temp_max
,Temp_avg
,Temp_min
: The maximum, average, and minimum temperatures for the day.Dew_max
,Dew_avg
,Dew_min
: The maximum, average, and minimum dew points for the day.Hum_max
,Hum_avg
,Hum_min
: The maximum, average, and minimum humidity levels for the day.Wind_max
,Wind_avg
,Wind_min
: The maximum, average, and minimum wind speeds for the day.Press_max
,Press_avg
,Press_min
: The maximum, average, and minimum pressure levels for the day.Precipit
: The amount of precipitation for the day.day_of_week
: The day of the week, where Monday is 0 and Sunday is 6.
To prepare the data for forecasting, we’ll perform the following steps:
- Convert
StartDate
in the power data andDate
in the weather data to datetime format. - Aggregate the hourly power usage data to daily data to match the frequency of the weather data.
- Merge the power and weather datasets on the date.
In the next step we will merge the power data with weather data
# Convert 'StartDate' and 'Date' to datetime format
power_data['StartDate'] = pd.to_datetime(power_data['StartDate'])
weather_data['Date'] = pd.to_datetime(weather_data['Date'])
power_data_daily = power_data.resample('D', on='StartDate').sum()
power_data_daily.reset_index(inplace=True)
merged_data = pd.merge(power_data_daily, weather_data, left_on='StartDate', right_on='Date', how='inner')
merged_data.drop(columns=['Day', 'day_of_week_y'], inplace=True)
merged_data.rename(columns={'day_of_week_x': 'day_of_week', 'Value (kWh)': 'Power_kWh'}, inplace=True)
Now that we have a merged dataset, we can proceed with further preprocessing, such as checking for missing values, normalizing the data, and splitting the data into training and test sets.
# Check for missing values
missing_values = merged_data.isnull().sum()
# Descriptive statistics for numerical columns
desc_stats = merged_data.describe()
missing_values, desc_stats
RESULT
(StartDate 0 Power_kWh 0 day_of_week 0 Date 0 Temp_max 0 Temp_avg 0 Temp_min 0 Dew_max 0 Dew_avg 0 Dew_min 0 Hum_max 0 Hum_avg 0 Hum_min 0 Wind_max 0 Wind_avg 0 Wind_min 0 Press_max 0 Press_avg 0 Press_min 0 Precipit 0 dtype: int64, Power_kWh day_of_week Temp_max Temp_avg Temp_min count 1550.000000 1550.000000 1550.000000 1550.000000 1550.000000 mean 20.681241 69.522581 81.266452 71.731871 63.316129 std 13.005544 48.926439 12.718255 12.485566 13.609492 min 0.000000 0.000000 34.000000 27.600000 0.000000 25% 10.493250 24.000000 74.000000 63.125000 52.250000 50% 16.312000 72.000000 84.000000 74.300000 67.000000 75% 29.883500 120.000000 92.000000 82.000000 75.000000 max 78.225300 144.000000 101.000000 90.000000 83.000000 Dew_max Dew_avg Dew_min Hum_max Hum_avg count 1550.000000 1550.000000 1550.000000 1550.000000 1550.000000 mean 66.601935 62.131226 56.921290 92.574839 74.374968 std 12.281647 13.628116 15.248287 7.659423 11.705409 min 18.000000 13.200000 0.000000 50.000000 31.500000 25% 61.000000 54.025000 45.250000 90.000000 67.425000 50% 70.000000 66.550000 62.000000 94.000000 75.000000 75% 76.000000 73.400000 70.000000 97.000000 82.700000 max 83.000000 79.800000 78.000000 100.000000 99.900000 Hum_min Wind_max Wind_avg Wind_min Press_max count 1550.000000 1550.000000 1550.000000 1550.000000 1550.000000 mean 51.216774 15.854839 7.736774 1.376774 29.973935 std 15.607830 4.843872 2.991458 2.501641 0.166044 min 0.000000 6.000000 1.600000 0.000000 29.500000 25% 41.000000 13.000000 5.700000 0.000000 29.900000 50% 50.000000 15.000000 7.300000 0.000000 29.900000 75% 61.000000 18.000000 9.400000 3.000000 30.100000 max 97.000000 39.000000 23.900000 17.000000 30.600000 Press_avg Press_min Precipit count 1550.000000 1550.000000 1550.000000 mean 29.903613 29.813355 0.158052 std 0.160494 0.774077 0.658718 min 28.800000 0.000000 0.000000 25% 29.800000 29.700000 0.000000 50% 29.900000 29.800000 0.000000 75% 30.000000 29.900000 0.020000 max 30.600000 30.500000 13.430000 )
There are no missing values in our dataset, which is good news.
From the descriptive statistics, we can make a few observations:
- The
Power_kWh
data has a wide range, from 0 to about 78.22 kWh. We can investigate the instances of zero usage further. - The weather data also shows a wide range of values. However, these are within expected ranges for weather data.
- The
Press_min
column has a minimum value of 0, which might indicate some erroneous data, as atmospheric pressure should never be zero.
Exploratory Data Analysis and Visualization
Next, we conduct some exploratory data analysis (EDA) and data visualization. We’ll plot the distribution of daily power usage and some scatter plots to see the relationship between power usage and different weather variables.
import matplotlib.pyplot as plt
import seaborn as sns
# Set plot style
sns.set_style("whitegrid")
# Plot histogram of daily power usage
plt.figure(figsize=(10, 6))
sns.histplot(merged_data['Power_kWh'], bins=50, kde=True)
plt.title('Distribution of Daily Power Usage')
plt.xlabel('Power Usage (kWh)')
plt.ylabel('Frequency')
plt.show()
# Select some weather variables to check their relationship with power usage
weather_variables = ['Temp_avg', 'Dew_avg', 'Hum_avg', 'Wind_avg', 'Press_avg', 'Precipit']
# Create scatter plots of power usage vs. weather variables
fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(18, 12))
axs = axs.flatten() # Flatten to 1D array
for i, var in enumerate(weather_variables):
sns.scatterplot(data=merged_data, x=var, y='Power_kWh', ax=axs[i], alpha=0.6)
axs[i].set_title(f'Power Usage vs. {var}')
plt.tight_layout()
plt.show()
The histogram of daily power usage shows a right-skewed distribution, with most of the values falling between 10 and 30 kWh. There are also a few instances of very high power usage, reaching up to around 78 kWh.
The scatter plots of power usage versus various weather variables suggest the following:
Temp_avg
: There seems to be a positive relationship between average temperature and power usage. This could be due to increased use of air conditioning on hotter days.Dew_avg
: There’s no clear relationship between average dew point and power usage.Hum_avg
: There’s no clear relationship between average humidity and power usage.Wind_avg
: There’s no clear relationship between average wind speed and power usage.Press_avg
: There’s no clear relationship between average pressure and power usage.Precipit
: There’s no clear relationship between precipitation and power usage.
These relationships, or lack thereof, can be better quantified using correlation coefficients. Let’s compute those next. After that, we’ll create some lag features for our feature engineering step.
The correlation coefficients confirm our observations from the scatter plots:
Temp_avg
: There’s a strong positive correlation (0.62) with power usage.Dew_avg
: There’s a moderate positive correlation (0.55) with power usage.Hum_avg
,Precipit
: These have almost no correlation with power usage.Wind_avg
,Press_avg
: These have weak negative correlations (-0.18 and -0.25 respectively) with power usage.
# Create lag features for power usage
merged_data['Power_lag1'] = merged_data['Power_kWh'].shift(1)
merged_data['Power_lag2'] = merged_data['Power_kWh'].shift(2)
merged_data['Power_lag3'] = merged_data['Power_kWh'].shift(3)
As part of feature engineering, we’ve added three new features: Power_lag1
, Power_lag2
, and Power_lag3
, which represent the power usage of the previous 1, 2, and 3 days respectively. This is based on the intuition that power usage might be auto-correlated, i.e., the power usage of a day might be influenced by the power usage of previous days. These lag features can be very useful for time series forecasting models.
Time Series Decomposition and Stationarity Test
Time series decomposition allows us to observe the trend and seasonality in the power usage data, separate from the random fluctuations. We decompose the power usage time series into trend, seasonal, and residual components.
Let’s decompose our daily power usage time series and visualize the components. We’ll use additive decomposition first, as it’s the simplest and most commonly used method. If the residuals show a pattern, we might need to switch to multiplicative decomposition. For daily data like ours, a common choice for the seasonal period is 7 (representing a weekly cycle). However, this might not be the best choice for all datasets, as the appropriate seasonal period can depend on the specific characteristics of the data.
The time series decomposition has four components:
- Observed: This is the original time series.
- Trend: This shows the overall trend in the data. We can see a somewhat stable trend in power usage over the years, with some fluctuations.
- Seasonal: This shows the seasonal variation in the data. We can clearly see a repeating pattern every 7 days, which indicates a weekly cycle.
- Residual: This is what’s left after removing the trend and seasonal components from the original time series. Ideally, the residuals should look like white noise, i.e., they should be random and have no discernible pattern. In our case, the residuals show some patterns, suggesting that there might be some information that is not captured by the trend and seasonal components.
We also test the power usage time series for stationarity using the Augmented Dickey-Fuller (ADF) test. Stationarity is an important characteristic of time series data that most time series forecasting models require. A time series is stationary if its statistical properties, such as mean and variance, are constant over time.
from statsmodels.tsa.stattools import adfuller
# Perform ADF test
adf_result = adfuller(merged_data['Power_kWh'])
# Print test statistic and p-value
adf_statistic = adf_result[0]
adf_pvalue = adf_result[1]
adf_statistic, adf_pvalue
RESULT
(-3.358494489457015, 0.012457836898292775)
The ADF test statistic is -3.36 and the p-value is 0.012. Since the p-value is less than 0.05, we can reject the null hypothesis. This suggests that our power usage time series is stationary and does not have a unit root.
This result is useful for selecting a forecasting model. Many time series forecasting models, such as ARIMA, assume that the underlying data are stationary. This assumption implies that the properties of the time series do not change over time, i.e., the mean, variance, and autocorrelation structure remain constant.
If the data were not stationary, we would need to apply some transformations to make it stationary before using these models. Common transformations include differencing, logarithmic transformation, or square root transformation. In our case, since the data appear to be stationary, we can proceed without these transformations.
However, it’s important to note that even if the overall time series is stationary, there may still be some patterns or structures in the data that these tests cannot detect. For example, there could be patterns that change over different time periods or patterns that depend on external variables. Therefore, we should always perform additional exploratory analysis and feature engineering to capture these patterns, as we have done.
Let’s proceed to the next steps, which include splitting the data into training and test sets, selecting the features to use for modeling and choosing a model for forecasting.
Model Building and Evaluation
We split the data into a training set and a test set, and then use a Random Forest model as our forecasting model.
# Split the data into training and test sets
# We'll use the last 365 days of data for testing
train_data = merged_data[:-365]
test_data = merged_data[-365:]
# Select features for modeling
# We'll use the weather variables and the lag features
feature_columns = weather_variables + ['day_of_week', 'Power_lag1', 'Power_lag2', 'Power_lag3']
The data has been split into training and test sets. The training set contains 1,182 observations and the test set contains 365 observations. We’re using the last 365 days of data for testing, which corresponds to approximately the last year of observations.
For our features, we’ve selected the weather variables (Temp_avg
, Dew_avg
, Hum_avg
, Wind_avg
, Press_avg
, Precipit
), the day_of_week
variable, and the lag features (Power_lag1
, Power_lag2
, Power_lag3
) that we created earlier.
Next, we’ll proceed to the model building phase. There are many models that can be used for time series forecasting, ranging from simple models like ARIMA to more complex models like LSTM.
Given the characteristics of our data, we recommend using the XGBoost model for this task. XGBoost is a powerful, flexible, and efficient implementation of the gradient boosting algorithm. It’s capable of handling multiple input features, capturing complex relationships, and automatically modeling nonlinearities and interactions between variables.To evaluate the model’s performance, we’ll use root mean squared error (RMSE), which is a common metric for regression tasks like ours. It measures the average magnitude of the model’s prediction errors.
Let’s proceed with the model building using XGBoost.
from xgboost import XGBRegressor from sklearn.metrics import mean_squared_error import numpy as np # Initialize XGBoost model xgb_model = XGBRegressor(n_estimators=300, learning_rate=0.1) # Train the model xgb_model.fit(X_train, y_train) # Make predictions on the test set y_pred = xgb_model.predict(X_test) # Compute RMSE of the predictions rmse = np.sqrt(mean_squared_error(y_test, y_pred)) rmse
13.541792715203322
The XGBoost model has been successfully trained and evaluated. The root mean squared error (RMSE) on the test set is approximately 13.54. This means that on average, our model’s predictions are about 13.54 kWh off from the actual values.
This is a starting point for our forecasting task. We could potentially improve the model’s performance by tuning its hyperparameters, using a more complex model, or engineering additional features. However, these steps would require more computational resources and time.
Feature Importance
We can also examine the importance of each feature in the model. It will provide some insights into which variables are most influential in predicting power usage.
# Extract feature importances
feature_importances = xgb_model.feature_importances_
# Create a dataframe for visualization
importances_df = pd.DataFrame({
'Feature': feature_columns,
'Importance': feature_importances
})
# Sort the dataframe by importance
importances_df = importances_df.sort_values(by='Importance', ascending=False)
# Plot the feature importances
plt.figure(figsize=(10, 6))
sns.barplot(data=importances_df, x='Importance', y='Feature', color='skyblue')
plt.title('Feature Importances')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()
The bar plot above shows the importance of each feature in the XGBoost model.
Here are a few observations:
Power_lag1
,Power_lag2
, andPower_lag3
are the most important features. This suggests that the power usage of the previous 1 to 3 days is highly influential in predicting the power usage of the current day. This is consistent with the intuition that power usage is likely to be auto-correlated, i.e., the power usage of a day is influenced by the power usage of previous days.- Among the weather variables,
Temp_avg
andHum_avg
have the highest importance. This is consistent with our earlier analysis, which showed a positive correlation between these variables and power usage. - The
day_of_week
variable also has some importance, suggesting that the day of the week might have some influence on power usage. This could reflect weekly patterns in power usage, such as differences between weekdays and weekends.
Conclusion
-
Model Selection: While the XGBoost model we used performed reasonably well, there are many other models we could try, such as ARIMA, SARIMA, or LSTM. These models might capture different patterns in the data and could potentially improve the forecasting accuracy.
-
Hyperparameter Tuning: We can fine-tune the parameters of our XGBoost model (or any other model we choose) to further improve its performance. This involves systematically searching for the combination of parameters that produces the best results.
-
Feature Engineering: We could create additional features to help improve the model’s performance. For example, we could create more lag features, rolling window features (e.g., rolling mean or standard deviation), or interaction terms between the most important features.
-
Model Evaluation: We should continue to evaluate our model’s performance on new data over time. This can help us detect if the model’s performance is degrading and if it needs to be retrained or updated.
-
Error Analysis: We can analyze the instances where the model makes large errors to understand why these errors occur and how we might improve the model.
-
Monitoring and Updating the Model: Once the model is deployed, it’s important to monitor its performance and update or retrain it as needed. This is because the patterns in the data might change over time, which could cause the model’s performance to degrade.
Remember that model building is an iterative process. It often involves trying out different models, tuning their parameters, engineering features, and evaluating their performance. With each iteration, we learn more about the problem and improve our solution.
You can get the jupyter notebook and dataset here.