Building A Time Series Forecasting Model For Electricity Usage

light bulb, light, idea-3535435.jpg

Electricity usage patterns are influenced by various factors, including weather conditions, time of day, day of the week and seasonal factors such as holidays. Therefore, to build an effective forecasting model for electricity usage, we need to consider these factors. This article will guide you through the process of building a time series forecasting model using weather data and electricity usage data. But before proceeding, we recommend that you go through this article to gain an overview of time series concepts. It will help you better understand the following.

We start with two datasets: the weather data (weather_data) and the power usage data (power_usage_data). The weather data includes daily observations of various weather conditions, while the power usage data includes hourly observations of power consumption.

# First, let's load the data and inspect it.
import pandas as pd
# Load power data
power_data = pd.read_csv('/mnt/data/power_usage_2016_to_2020.csv')
# Load weather data
weather_data = pd.read_csv('/mnt/data/weather_2016_2020_daily.csv')

 

The power data contains the following columns:

  1. StartDate: The start date and hour of the power usage.
  2. Value (kWh): The amount of power used in kilowatt hours.
  3. day_of_week: The day of the week, where Monday is 0 and Sunday is 6.
  4. notes: Categorization of the day as either ‘weekday’ or ‘weekend’.

The weather data contains the following columns:

  1. Date: The date of the weather data.
  2. Day: This might be a day counter, but we’ll need to confirm.
  3. Temp_max, Temp_avg, Temp_min: The maximum, average, and minimum temperatures for the day.
  4. Dew_max, Dew_avg, Dew_min: The maximum, average, and minimum dew points for the day.
  5. Hum_max, Hum_avg, Hum_min: The maximum, average, and minimum humidity levels for the day.
  6. Wind_max, Wind_avg, Wind_min: The maximum, average, and minimum wind speeds for the day.
  7. Press_max, Press_avg, Press_min: The maximum, average, and minimum pressure levels for the day.
  8. Precipit: The amount of precipitation for the day.
  9. day_of_week: The day of the week, where Monday is 0 and Sunday is 6.

To prepare the data for forecasting, we’ll perform the following steps:

  1. Convert StartDate in the power data and Date in the weather data to datetime format.
  2. Aggregate the hourly power usage data to daily data to match the frequency of the weather data.
  3. Merge the power and weather datasets on the date.

In the next step we will merge the power data with weather data

# Convert 'StartDate' and 'Date' to datetime format
power_data['StartDate'] = pd.to_datetime(power_data['StartDate'])
weather_data['Date'] = pd.to_datetime(weather_data['Date'])
power_data_daily = power_data.resample('D', on='StartDate').sum()
power_data_daily.reset_index(inplace=True)
merged_data = pd.merge(power_data_daily, weather_data, left_on='StartDate', right_on='Date', how='inner')
merged_data.drop(columns=['Day', 'day_of_week_y'], inplace=True)
merged_data.rename(columns={'day_of_week_x': 'day_of_week', 'Value (kWh)': 'Power_kWh'}, inplace=True)

Now that we have a merged dataset, we can proceed with further preprocessing, such as checking for missing values, normalizing the data, and splitting the data into training and test sets.

# Check for missing values
missing_values = merged_data.isnull().sum()
# Descriptive statistics for numerical columns
desc_stats = merged_data.describe()
missing_values, desc_stats

RESULT

(StartDate      0
 Power_kWh      0
 day_of_week    0
 Date           0
 Temp_max       0
 Temp_avg       0
 Temp_min       0
 Dew_max        0
 Dew_avg        0
 Dew_min        0
 Hum_max        0
 Hum_avg        0
 Hum_min        0
 Wind_max       0
 Wind_avg       0
 Wind_min       0
 Press_max      0
 Press_avg      0
 Press_min      0
 Precipit       0
 dtype: int64,
          Power_kWh  day_of_week     Temp_max     Temp_avg     Temp_min  
 count  1550.000000  1550.000000  1550.000000  1550.000000  1550.000000   
 mean     20.681241    69.522581    81.266452    71.731871    63.316129   
 std      13.005544    48.926439    12.718255    12.485566    13.609492   
 min       0.000000     0.000000    34.000000    27.600000     0.000000   
 25%      10.493250    24.000000    74.000000    63.125000    52.250000   
 50%      16.312000    72.000000    84.000000    74.300000    67.000000   
 75%      29.883500   120.000000    92.000000    82.000000    75.000000   
 max      78.225300   144.000000   101.000000    90.000000    83.000000   
            Dew_max      Dew_avg      Dew_min      Hum_max      Hum_avg  
 count  1550.000000  1550.000000  1550.000000  1550.000000  1550.000000   
 mean     66.601935    62.131226    56.921290    92.574839    74.374968   
 std      12.281647    13.628116    15.248287     7.659423    11.705409   
 min      18.000000    13.200000     0.000000    50.000000    31.500000   
 25%      61.000000    54.025000    45.250000    90.000000    67.425000   
 50%      70.000000    66.550000    62.000000    94.000000    75.000000   
 75%      76.000000    73.400000    70.000000    97.000000    82.700000   
 max      83.000000    79.800000    78.000000   100.000000    99.900000   
            Hum_min     Wind_max     Wind_avg     Wind_min    Press_max  
 count  1550.000000  1550.000000  1550.000000  1550.000000  1550.000000   
 mean     51.216774    15.854839     7.736774     1.376774    29.973935   
 std      15.607830     4.843872     2.991458     2.501641     0.166044   
 min       0.000000     6.000000     1.600000     0.000000    29.500000   
 25%      41.000000    13.000000     5.700000     0.000000    29.900000   
 50%      50.000000    15.000000     7.300000     0.000000    29.900000   
 75%      61.000000    18.000000     9.400000     3.000000    30.100000   
 max      97.000000    39.000000    23.900000    17.000000    30.600000   
          Press_avg    Press_min     Precipit  
 count  1550.000000  1550.000000  1550.000000  
 mean     29.903613    29.813355     0.158052  
 std       0.160494     0.774077     0.658718  
 min      28.800000     0.000000     0.000000  
 25%      29.800000    29.700000     0.000000  
 50%      29.900000    29.800000     0.000000  
 75%      30.000000    29.900000     0.020000  
 max      30.600000    30.500000    13.430000  )

 

There are no missing values in our dataset, which is good news.

From the descriptive statistics, we can make a few observations:

  • The Power_kWh data has a wide range, from 0 to about 78.22 kWh. We can investigate the instances of zero usage further.
  • The weather data also shows a wide range of values. However, these are within expected ranges for weather data.
  • The Press_min column has a minimum value of 0, which might indicate some erroneous data, as atmospheric pressure should never be zero.

Exploratory Data Analysis and Visualization

Next, we conduct some exploratory data analysis (EDA) and data visualization. We’ll plot the distribution of daily power usage and some scatter plots to see the relationship between power usage and different weather variables.

import matplotlib.pyplot as plt
import seaborn as sns
# Set plot style
sns.set_style("whitegrid")
# Plot histogram of daily power usage
plt.figure(figsize=(10, 6))
sns.histplot(merged_data['Power_kWh'], bins=50, kde=True)
plt.title('Distribution of Daily Power Usage')
plt.xlabel('Power Usage (kWh)')
plt.ylabel('Frequency')
plt.show()
# Select some weather variables to check their relationship with power usage
weather_variables = ['Temp_avg', 'Dew_avg', 'Hum_avg', 'Wind_avg', 'Press_avg', 'Precipit']
# Create scatter plots of power usage vs. weather variables
fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(18, 12))
axs = axs.flatten() # Flatten to 1D array
for i, var in enumerate(weather_variables):
sns.scatterplot(data=merged_data, x=var, y='Power_kWh', ax=axs[i], alpha=0.6)
axs[i].set_title(f'Power Usage vs. {var}')
plt.tight_layout()
plt.show()

The histogram of daily power usage shows a right-skewed distribution, with most of the values falling between 10 and 30 kWh. There are also a few instances of very high power usage, reaching up to around 78 kWh.

The scatter plots of power usage versus various weather variables suggest the following:

  • Temp_avg: There seems to be a positive relationship between average temperature and power usage. This could be due to increased use of air conditioning on hotter days.
  • Dew_avg: There’s no clear relationship between average dew point and power usage.
  • Hum_avg: There’s no clear relationship between average humidity and power usage.
  • Wind_avg: There’s no clear relationship between average wind speed and power usage.
  • Press_avg: There’s no clear relationship between average pressure and power usage.
  • Precipit: There’s no clear relationship between precipitation and power usage.

These relationships, or lack thereof, can be better quantified using correlation coefficients. Let’s compute those next. After that, we’ll create some lag features for our feature engineering step.

The correlation coefficients confirm our observations from the scatter plots:

  • Temp_avg: There’s a strong positive correlation (0.62) with power usage.
  • Dew_avg: There’s a moderate positive correlation (0.55) with power usage.
  • Hum_avg, Precipit: These have almost no correlation with power usage.
  • Wind_avg, Press_avg: These have weak negative correlations (-0.18 and -0.25 respectively) with power usage.
# Create lag features for power usage
merged_data['Power_lag1'] = merged_data['Power_kWh'].shift(1)
merged_data['Power_lag2'] = merged_data['Power_kWh'].shift(2)
merged_data['Power_lag3'] = merged_data['Power_kWh'].shift(3)

As part of feature engineering, we’ve added three new features: Power_lag1, Power_lag2, and Power_lag3, which represent the power usage of the previous 1, 2, and 3 days respectively. This is based on the intuition that power usage might be auto-correlated, i.e., the power usage of a day might be influenced by the power usage of previous days. These lag features can be very useful for time series forecasting models.

Time Series Decomposition and Stationarity Test

Time series decomposition allows us to observe the trend and seasonality in the power usage data, separate from the random fluctuations. We decompose the power usage time series into trend, seasonal, and residual components.

Let’s decompose our daily power usage time series and visualize the components. We’ll use additive decomposition first, as it’s the simplest and most commonly used method. If the residuals show a pattern, we might need to switch to multiplicative decomposition. For daily data like ours, a common choice for the seasonal period is 7 (representing a weekly cycle). However, this might not be the best choice for all datasets, as the appropriate seasonal period can depend on the specific characteristics of the data.

The time series decomposition has four components:

  1. Observed: This is the original time series.
  2. Trend: This shows the overall trend in the data. We can see a somewhat stable trend in power usage over the years, with some fluctuations.
  3. Seasonal: This shows the seasonal variation in the data. We can clearly see a repeating pattern every 7 days, which indicates a weekly cycle.
  4. Residual: This is what’s left after removing the trend and seasonal components from the original time series. Ideally, the residuals should look like white noise, i.e., they should be random and have no discernible pattern. In our case, the residuals show some patterns, suggesting that there might be some information that is not captured by the trend and seasonal components.

We also test the power usage time series for stationarity using the Augmented Dickey-Fuller (ADF) test. Stationarity is an important characteristic of time series data that most time series forecasting models require. A time series is stationary if its statistical properties, such as mean and variance, are constant over time.

from statsmodels.tsa.stattools import adfuller
# Perform ADF test
adf_result = adfuller(merged_data['Power_kWh'])
# Print test statistic and p-value
adf_statistic = adf_result[0]
adf_pvalue = adf_result[1]
adf_statistic, adf_pvalue

RESULT