⚡🔮Power-Outages-Prediction🔮⚡

This is a project for DSC80 at UCSD

Framing the Problem

Power outages can have significant impacts and disruptions on households, businesses, communities, and critical infrastructure. As such, the ability to forecast the severity of outages becomes a critical aspect of ensuring reliability of electrical grids. Through the development of predictive models, this project aims to provide valuable insights needed to manage power outages and help affected populations. To do so, this dataset created by the Laboratory for Advancing Sustainable Critical Infrastructure at Purdue University containing information about power outage data in the U.S. that occurred from January 2000 to July 2016 will be utilized. Previous exploratory data analysis on this dataset can be found here.

Prediction problem: Predict the severity of a major power outage in terms of its duration

This is a regression problem to predict the outage duration of a power outages.

Response variable: ‘OUTAGE.DURATION’⏱️

The model will try to predict ‘OUTAGE.DURATION’, a continuous numerical variable that describes how long a power outage lasted (in minutes). This is chosen as the reponse variable because the severity of an outage can be characterized by its duration as it directly reflects the outage impact’s extent (i.e. outages that last longer are more “severe”).

Evaluation Metric: RMSE

Valid evaluation metrics for regression models include RMSE and R^2. The RMSE assesses how well a regression model predicts the value of the response variable in absolute terms while R^2 does so in percentage terms. Because we want to evaluate the model’s ability to generalize to unseen data, RMSE is chosen over R^2 as the evaluation metric as it gives a better assessment in how well the model will perform for unseen observations. The lower the RMSE, the better the predictions.

Data Cleaning:

The same data cleaning performed in previous EDA will be performed here:

Fix Formatting
- the first 4 rows are all NaN values
- the correct column names appear in row 5
- These rows will be dropped and colum names will be reassigned.
Typecasting
- All columns are stored as strings, but it would make more sense for numerical information to be stored as floats.
- It would be preferable if the power outages start date and time were merged into one pd.Timestamp column, and power outages restoration date and time were merged into one pd.Timestamp column.
Fill in missing values
- The response variable ‘OUTAGE.DURATION’ has 58 NaN values, so median imputation will be performed because its distribution is skewed (as discovered in the previous EDA).
Keeping relevant columns
- The DataFrame now has 57 columns. Since we’re concerned with predicting the severity of a power outage from other power outage features, we only need columns in the dataset that include information related outage severity and other potentially contextual variables. We also only want information we would know at the time of prediction, so no columns that contain information that could only be known after a power outage will be included.
- The chosen relevant columns are their descriptions are::
  - YEAR: Indicates the year when the outage event occurred
  - MONTH: Indicates the month when the outage event occurred
  - U.S._STATE: Represents all the states in the continental U.S.
  - POSTAL.CODE: Represents the postal code of the U.S. states
  - NERC.REGION: The North American Electric Reliability Corporation (NERC) regions involved in the outage event
  - CLIMATE.REGION: U.S. Climate regions as specified by National Centers for Environmental Information (nine climatically consistent regions in continental U.S.A.)
  - CLIMATE.CATEGORY: This represents the climate episodes corresponding to the years. The categories—“Warm”, “Cold” or “Normal” episodes of the climate are based on a threshold of ± 0.5 °C for the Oceanic Niño Index (ONI)
  - OUTAGE.START.DATE: This variable indicates the day of the year when the outage event started (as reported by the corresponding Utility in the region)
  - OUTAGE.START.TIME: This variable indicates the time of the day when the outage event started (as reported by the corresponding Utility in the region)
  - CAUSE.CATEGORY: Categories of all the events causing the major power outages
  - CAUSE.CATEGORY.DETAIL: Detailed description of the event categories causing the major power outages
  - OUTAGE.DURATION: Duration of outage events (in minutes)
  - PI.UTIL.OFUSA: State utility sector׳s income (earnings) as a percentage of the total earnings of the U.S. utility sector׳s income (in %)
  - TOTAL.CUSTOMERS: Annual number of total customers served in the U.S. state

The first 5 rows of the resulting DataFrame:

U.S._STATE	POSTAL.CODE	NERC.REGION	CLIMATE.REGION	OUTAGE.START	OUTAGE.DURATION	YEAR	MONTH	CAUSE.CATEGORY	CAUSE.CATEGORY.DETAIL	CLIMATE.CATEGORY	PI.UTIL.OFUSA	TOTAL.CUSTOMERS
Minnesota	MN	MRO	East North Central	2011-07-01 17:00:00	3060.0	2011	7.0	severe weather	NaN	normal	2.2	2595696
Minnesota	MN	MRO	East North Central	2014-05-11 18:38:00	1.0	2014	5.0	intentional attack	vandalism	normal	2.2	2640737
Minnesota	MN	MRO	East North Central	2010-10-26 20:00:00	3000.0	2010	10.0	severe weather	heavy wind	cold	2.1	2586905
Minnesota	MN	MRO	East North Central	2012-06-19 04:30:00	2550.0	2012	6.0	severe weather	thunderstorm	normal	2.2	2606813
Minnesota	MN	MRO	East North Central	2015-07-18 02:00:00	1740.0	2015	7.0	severe weather	NaN	warm	2.2	2673531

Now that the data is cleaned, a baseline model can be built.

Baseline Model

Model

The model used in this prediciton task will be a RandomForestRegressor, a model suitable for regression and relatively robust to overfitting, therefore performing better on unseen data.

Features

In previous exploration of the dataset, it was found that certain regions have longer power outages. Thus, it seems that locational factors could be related to outage duration. The selected features for the model are:

‘CLIMATE.REGION’ 🗺️

U.S. Climate regions as specified by National Centers for Environmental Information (nine climatically consistent regions in continental U.S.A.)

This is a nominal categorical variable
‘NERC.REGION’ 📍

The North American Electric Reliability Corporation (NERC) regions involved in the outage event

This is a nominal categorical variable

These columns provide relevant locational information and are accessible at the time of prediction.

Feature Engineering

Since ‘CLIMATE.REGION’ and ‘NERC.REGION’ are both nominal categorical variables, they will be converted into numerical representations so that they are suitable to predict ‘OUTAGE.DURATION’.

‘CLIMATE.REGION’ 🗺️: One-Hot Encoding

One-hot encoding will transform the single categorical feature into multiple numerical features, creating binary columns for each unique category that indicate the presence or absence of that category in the data. The potential categories in ‘CLIMATE.REGION’ are: ‘East North Central’, ‘Central’, ‘South’, ‘Southeast’, ‘Northwest’, Southwest’, ‘Northeast’, ‘West North Central’, and ‘West’.

‘NERC.REGION’ 📍: One-Hot Encoding

The potential categories in ‘NERC.REGION’ are: ‘MRO’, ‘SERC’, ‘RFC’, ‘ECAR’, ‘TRE’, ‘WECC’, ‘SPP’, ‘NPCC’, ‘FRCC’, ‘FRCC, SERC’.

After one-hot encoding the 2 nominal categorical features, the baseline RandomForestRegressor model will use 19 discrete numerical features:

CLIMATE.REGION (East North Central)🗺️

CLIMATE.REGION (Central)🗺️

CLIMATE.REGION (South)🗺️

CLIMATE.REGION (Southeast)🗺️

CLIMATE.REGION (Northwest)🗺️

CLIMATE.REGION (Southwest)🗺️

CLIMATE.REGION (Northeast)🗺️

CLIMATE.REGION (West North Central)🗺️

CLIMATE.REGION (West)🗺️

NERC.REGION (MRO)📍

NERC.REGION (SERC)📍

NERC.REGION (RFC)📍

NERC.REGION (ECAR)📍

NERC.REGION (TRE)📍

NERC.REGION (WECC)📍

NERC.REGION (SPP)📍

NERC.REGION (FRCC)📍

NERC.REGION (NPCC)📍

NERC.REGION (FRCC, SERC)📍

Note: The model will actually use 17 features, because for each original feature (‘CLIMATE.REGION’ and ‘NERC.REGION’), the model will drop one one-hot encoded feature in order to prevent multicollinearity. Additionally, ‘CLIMATE.REGION’ has 6 missing values, a trivial portion of the dataset, so those rows will be dropped.

Model Performance

	Train Predicted Duration	Actual Duration
1484	1696.929466	1895.0
1279	1696.929466	4605.0
1152	1696.929466	249.0
914	4025.372307	1044.0
1505	1750.409304	103.0
…	…	…

	Test Predicted Duration	Actual Duration
1478	2986.133051	77.0
371	2807.566034	360.0
1157	1696.929466	5160.0
1034	4000.567202	182.0
1330	2675.564788	1337.0
…	…	…

Train RMSE: 4916.138668308378 Test RMSE: 8312.675770084086

Both the train and test RMSE’s are pretty high, and the test RMSE is double the train RMSE. This indicates that the model’s predictive ability is weak and the model is overfit to the current data. The model’s “bad” performance can be attributed to the limited number of features (only 2 features) and limited characteristics of features (only locational characteristics). Since no search for the best hyperparameters was performed, the model may have also been affected by bad hyperparameters. It’d be beneficial to consider other features and better hyperparameter configurations, so a final model (with those changes implemented) will be built upon the baseline model to see if predictions for power outage duration can improve.

Final Model

A RandomForestRegressor model will continue to be used, but the model’s performance will be further improved through additional steps like:

Additional feature engineering: extracting more information and insights will improve the model’s predictive ability
Hyperparameter tuning: choosing appropriate hyperparameter values that minimize the loss function better will allow the model to provide more optimal results

Features

In additon to location, there may be many other factors that affect outage duration. It could be beneficial to add features that include economic characteristics, electricity consumption characteristics, weather characteristics, causal characteristics, etc. The features chosen for the final model are:

‘CLIMATE.REGION’🗺️

see baseline model
‘NERC.REGION’📍

see baseline model
‘CAUSE.CATEGORY’🚨

Categories of all the events causing the major power outages

This is a nominal categorical variable

This feature could be useful because certain severe causes (e.g. severe weather) might cause longer outage durations than less severe causes (e.g. public appeal)

‘CLIMATE.CATEGORY’🌤️

This represents the climate episodes corresponding to the years. The categories—“Warm”, “Cold” or “Normal” episodes of the climate are based on a threshold of ± 0.5 °C for the Oceanic Niño Index (ONI)

This is an ordinal categorical variable

This feature could be useful because certain temperatures might affect power outages, as outages often occur during extremely hot or cold weather.

‘PI.UTIL.OFUSA’💲

State utility sector׳s income (earnings) as a percentage of the total earnings of the U.S. utility sector׳s income (in %)

This is a continous numerical variable

The distribution of power infrastructure earnings across different areas could give insight into economic influences on outage duration.

‘TOTAL.CUSTOMERS’👥

Annual number of total customers served in the U.S. state

This is a discrete numerical variable

This feature could provide relevant information about electricity consumption; more usage could cause greater strain on power grids and thus longer outage durations.

Feature Engineering

2 categorical features were added, so they will undergo transformations into numerical features. 2 numerical features were added, but with different units of measurements; to attain consistency and dull the dominance of features with larger values, they will undergo transformations to handle the different scales and magnitudes.

‘CLIMATE.REGION’🗺️: One-Hot Encoding

same as baseline model

‘NERC.REGION’📍: One-Hot Encoding

same as baseline model

‘CAUSE.CATEGORY’🚨: One-Hot Encoding

The potential one-hot encoded categories in ‘CAUSE.CATEGORY’ are: ‘severe weather’, ‘intentional attack’, ‘system operability disruption’, ‘equipment failure’, ‘public appeal’, ‘fuel supply emergency’, and ‘islanding’.

‘CLIMATE.CATEGORY’🌤️: Ordinal Encoding

The categories of ‘CLIMATE.CATEGORY’ are ‘normal’, ‘cold’, ‘warm’. They will be assigned corresponding integers of natural ordering

‘PI.UTIL.OFUSA’💲: Scaled using StandardScalar
‘TOTAL.CUSTOMERS’👥: Scaled using StandardScalar

Note: For each original feature, the model will drop one one-hot encoded feature to prevent multicollinearity. Additionally, ‘CLIMATE.CATEGORY’ has a very trivial number of NaN values so those rows will be dropped.

Hyperparameters Searching

GridSearchCV will be used to obtain better hyperparamters and train more effectively. The optimal hyperparameters found are:

max_depth: 22
max_features: sqrt
min_samples_split: 10
n_estimators: 122

Model Performance

Train RMSE: 3593.13972810358 Test RMSE: 7954.385910146231

	Baseline Train Predicted Duration	Final Train Predicted Duration	Actual Duration
1484	1696.929466	1064.502460	1895.0
1279	1696.929466	3291.529008	4605.0
1152	1696.929466	285.622604	249.0
914	4025.372307	5781.134054	1044.0
1505	1750.409304	499.433308	103.0
…	…	…	…

	Baseline Test Predicted Duration	Final Test Predicted Duration	Actual Duration
1478	2986.133051	44.929826	77.0
371	2807.566034	1187.522924	360.0
1157	1696.929466	6467.448835	5160.0
1034	4000.567202	1341.313430	182.0
1330	2675.564788	1323.058217	1337.0
…	…	…	…

	Train (Baseline)	Test (Baseline)	Train (Final)	Test (Final)
RMSE	4916.138668	8312.67577	3593.139728	7954.38591

The final model is considered an improvement, as both the training and testing RMSE are lower (training RMSE lowered by ~1300 minutes and testing RMSE lowered by ~400 minutes). However, the testing RMSE is still significantly higher than the training RMSE, indicating that the model is overfit to the training data (although the overfit is less extreme than in the baseline model). Overall, it appears that incorporating additional features (which include more information and help fit the model better) and optimizing hyperparameters enhanced the final model performance.

Fairness Analysis

To assess whether the model is fair, the test dataset is categorized into two groups: normal climate (‘CLIMATE.CATEGORY’ == ‘normal’) vs extreme climate (‘CLIMATE.CATEGORY == ‘warm’ or ‘cold’). To answer the question “Does my model perform worse for normal climates than it does for extreme climates?”, a permutation test will be conducted.

Null Hypothesis: The model is fair and its RMSE for both groups (normal vs hot/cold) is roughly the same, and any differences are due to random chance.

Alternative Hypothesis: The model is unfair and its RMSE for normal climates is higher than its RMSE for extreme climates.

Test Statistic: Absolute difference in RMSE between the ‘normal’ and combined ‘cold’/’hot’ climate categories

Significance level: 0.05

P-value: 0.931

The resulting p-value of 0.931 leads us to fail to reject the null hypothesis and conclude that the model is fair. Of course, since statistical tests were performed and not randomized controlled trials, the results of the test aren’t proven to be 100% true.