Electricity usage patterns are influenced by various factors, including weather conditions, time of day, day of the week and seasonal factors such as holidays. Therefore, to build an effective forecasting model for electricity usage, we need to consider these factors. This article will guide you through the process of building a time series forecasting model using weather data and electricity usage data. But before proceeding, we recommend that you go through this article to gain an overview of time series concepts. It will help you better understand the following.
We start with two datasets: the weather data (weather_data
) and the power usage data (power_usage_data
). The weather data includes daily observations of various weather conditions, while the power usage data includes hourly observations of power consumption.
# First, let's load the data and inspect it.
import pandas as pd
# Load power data
power_data = pd.read_csv('/mnt/data/power_usage_2016_to_2020.csv')
# Load weather data
weather_data = pd.read_csv('/mnt/data/weather_2016_2020_daily.csv')
The power data contains the following columns:
StartDate
: The start date and hour of the power usage.Value (kWh)
: The amount of power used in kilowatt hours.day_of_week
: The day of the week, where Monday is 0 and Sunday is 6.notes
: Categorization of the day as either ‘weekday’ or ‘weekend’.
The weather data contains the following columns:
Date
: The date of the weather data.Day
: This might be a day counter, but we’ll need to confirm.Temp_max
,Temp_avg
,Temp_min
: The maximum, average, and minimum temperatures for the day.Dew_max
,Dew_avg
,Dew_min
: The maximum, average, and minimum dew points for the day.Hum_max
,Hum_avg
,Hum_min
: The maximum, average, and minimum humidity levels for the day.Wind_max
,Wind_avg
,Wind_min
: The maximum, average, and minimum wind speeds for the day.Press_max
,Press_avg
,Press_min
: The maximum, average, and minimum pressure levels for the day.Precipit
: The amount of precipitation for the day.day_of_week
: The day of the week, where Monday is 0 and Sunday is 6.
To prepare the data for forecasting, we’ll perform the following steps:
- Convert
StartDate
in the power data andDate
in the weather data to datetime format. - Aggregate the hourly power usage data to daily data to match the frequency of the weather data.
- Merge the power and weather datasets on the date.
In the next step we will merge the power data with weather data
# Convert 'StartDate' and 'Date' to datetime format
power_data['StartDate'] = pd.to_datetime(power_data['StartDate'])
weather_data['Date'] = pd.to_datetime(weather_data['Date'])
power_data_daily = power_data.resample('D', on='StartDate').sum()
power_data_daily.reset_index(inplace=True)
merged_data = pd.merge(power_data_daily, weather_data, left_on='StartDate', right_on='Date', how='inner')
merged_data.drop(columns=['Day', 'day_of_week_y'], inplace=True)
merged_data.rename(columns={'day_of_week_x': 'day_of_week', 'Value (kWh)': 'Power_kWh'}, inplace=True)
Now that we have a merged dataset, we can proceed with further preprocessing, such as checking for missing values, normalizing the data, and splitting the data into training and test sets.
# Check for missing values
missing_values = merged_data.isnull().sum()
# Descriptive statistics for numerical columns
desc_stats = merged_data.describe()
missing_values, desc_stats
RESULT
(StartDate 0 Power_kWh 0 day_of_week 0 Date 0 Temp_max 0 Temp_avg 0 Temp_min 0 Dew_max 0 Dew_avg 0 Dew_min 0 Hum_max 0 Hum_avg 0 Hum_min 0 Wind_max 0 Wind_avg 0 Wind_min 0 Press_max 0 Press_avg 0 Press_min 0 Precipit 0 dtype: int64, Power_kWh day_of_week Temp_max Temp_avg Temp_min count 1550.000000 1550.000000 1550.000000 1550.000000 1550.000000 mean 20.681241 69.522581 81.266452 71.731871 63.316129 std 13.005544 48.926439 12.718255 12.485566 13.609492 min 0.000000 0.000000 34.000000 27.600000 0.000000 25% 10.493250 24.000000 74.000000 63.125000 52.250000 50% 16.312000 72.000000 84.000000 74.300000 67.000000 75% 29.883500 120.000000 92.000000 82.000000 75.000000 max 78.225300 144.000000 101.000000 90.000000 83.000000 Dew_max Dew_avg Dew_min Hum_max Hum_avg count 1550.000000 1550.000000 1550.000000 1550.000000 1550.000000 mean 66.601935 62.131226 56.921290 92.574839 74.374968 std 12.281647 13.628116 15.248287 7.659423 11.705409 min 18.000000 13.200000 0.000000 50.000000 31.500000 25% 61.000000 54.025000 45.250000 90.000000 67.425000 50% 70.000000 66.550000 62.000000 94.000000 75.000000 75% 76.000000 73.400000 70.000000 97.000000 82.700000 max 83.000000 79.800000 78.000000 100.000000 99.900000 Hum_min Wind_max Wind_avg Wind_min Press_max count 1550.000000 1550.000000 1550.000000 1550.000000 1550.000000 mean 51.216774 15.854839 7.736774 1.376774 29.973935 std 15.607830 4.843872 2.991458 2.501641 0.166044 min 0.000000 6.000000 1.600000 0.000000 29.500000 25% 41.000000 13.000000 5.700000 0.000000 29.900000 50% 50.000000 15.000000 7.300000 0.000000 29.900000 75% 61.000000 18.000000 9.400000 3.000000 30.100000 max 97.000000 39.000000 23.900000 17.000000 30.600000 Press_avg Press_min Precipit count 1550.000000 1550.000000 1550.000000 mean 29.903613 29.813355 0.158052 std 0.160494 0.774077 0.658718 min 28.800000 0.000000 0.000000 25% 29.800000 29.700000 0.000000 50% 29.900000 29.800000 0.000000 75% 30.000000 29.900000 0.020000 max 30.600000 30.500000 13.430000 )
There are no missing values in our dataset, which is good news.
From the descriptive statistics, we can make a few observations:
- The
Power_kWh
data has a wide range, from 0 to about 78.22 kWh. We can investigate the instances of zero usage further. - The weather data also shows a wide range of values. However, these are within expected ranges for weather data.
- The
Press_min
column has a minimum value of 0, which might indicate some erroneous data, as atmospheric pressure should never be zero.
Exploratory Data Analysis and Visualization
Next, we conduct some exploratory data analysis (EDA) and data visualization. We’ll plot the distribution of daily power usage and some scatter plots to see the relationship between power usage and different weather variables.
import matplotlib.pyplot as plt
import seaborn as sns
# Set plot style
sns.set_style("whitegrid")
# Plot histogram of daily power usage
plt.figure(figsize=(10, 6))
sns.histplot(merged_data['Power_kWh'], bins=50, kde=True)
plt.title('Distribution of Daily Power Usage')
plt.xlabel('Power Usage (kWh)')
plt.ylabel('Frequency')
plt.show()
# Select some weather variables to check their relationship with power usage
weather_variables = ['Temp_avg', 'Dew_avg', 'Hum_avg', 'Wind_avg', 'Press_avg', 'Precipit']
# Create scatter plots of power usage vs. weather variables
fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(18, 12))
axs = axs.flatten() # Flatten to 1D array
for i, var in enumerate(weather_variables):
sns.scatterplot(data=merged_data, x=var, y='Power_kWh', ax=axs[i], alpha=0.6)
axs[i].set_title(f'Power Usage vs. {var}')
plt.tight_layout()
plt.show()
The histogram of daily power usage shows a right-skewed distribution, with most of the values falling between 10 and 30 kWh. There are also a few instances of very high power usage, reaching up to around 78 kWh.
The scatter plots of power usage versus various weather variables suggest the following:
Temp_avg
: There seems to be a positive relationship between average temperature and power usage. This could be due to increased use of air conditioning on hotter days.Dew_avg
: There’s no clear relationship between average dew point and power usage.Hum_avg
: There’s no clear relationship between average humidity and power usage.Wind_avg
: There’s no clear relationship between average wind speed and power usage.Press_avg
: There’s no clear relationship between average pressure and power usage.Precipit
: There’s no clear relationship between precipitation and power usage.
These relationships, or lack thereof, can be better quantified using correlation coefficients. Let’s compute those next. After that, we’ll create some lag features for our feature engineering step.
The correlation coefficients confirm our observations from the scatter plots:
Temp_avg
: There’s a strong positive correlation (0.62) with power usage.Dew_avg
: There’s a moderate positive correlation (0.55) with power usage.Hum_avg
,Precipit
: These have almost no correlation with power usage.Wind_avg
,Press_avg
: These have weak negative correlations (-0.18 and -0.25 respectively) with power usage.
# Create lag features for power usage
merged_data['Power_lag1'] = merged_data['Power_kWh'].shift(1)
merged_data['Power_lag2'] = merged_data['Power_kWh'].shift(2)
merged_data['Power_lag3'] = merged_data['Power_kWh'].shift(3)
As part of feature engineering, we’ve added three new features: Power_lag1
, Power_lag2
, and Power_lag3
, which represent the power usage of the previous 1, 2, and 3 days respectively. This is based on the intuition that power usage might be auto-correlated, i.e., the power usage of a day might be influenced by the power usage of previous days. These lag features can be very useful for time series forecasting models.
Time Series Decomposition and Stationarity Test
Time series decomposition allows us to observe the trend and seasonality in the power usage data, separate from the random fluctuations. We decompose the power usage time series into trend, seasonal, and residual components.
Let’s decompose our daily power usage time series and visualize the components. We’ll use additive decomposition first, as it’s the simplest and most commonly used method. If the residuals show a pattern, we might need to switch to multiplicative decomposition. For daily data like ours, a common choice for the seasonal period is 7 (representing a weekly cycle). However, this might not be the best choice for all datasets, as the appropriate seasonal period can depend on the specific characteristics of the data.
The time series decomposition has four components:
- Observed: This is the original time series.
- Trend: This shows the overall trend in the data. We can see a somewhat stable trend in power usage over the years, with some fluctuations.
- Seasonal: This shows the seasonal variation in the data. We can clearly see a repeating pattern every 7 days, which indicates a weekly cycle.
- Residual: This is what’s left after removing the trend and seasonal components from the original time series. Ideally, the residuals should look like white noise, i.e., they should be random and have no discernible pattern. In our case, the residuals show some patterns, suggesting that there might be some information that is not captured by the trend and seasonal components.
We also test the power usage time series for stationarity using the Augmented Dickey-Fuller (ADF) test. Stationarity is an important characteristic of time series data that most time series forecasting models require. A time series is stationary if its statistical properties, such as mean and variance, are constant over time.
from statsmodels.tsa.stattools import adfuller
# Perform ADF test
adf_result = adfuller(merged_data['Power_kWh'])
# Print test statistic and p-value
adf_statistic = adf_result[0]
adf_pvalue = adf_result[1]
adf_statistic, adf_pvalue
RESULT