Skip to the content.

⚡Power Outages Analysis⚡

This is a project for DSC 80 at UCSD.

Name: Essie Cheng

Website Link: https://essiecheng.github.io/Power-Outage-Analysis/

Introduction

This dataset created by the Laboratory for Advancing Sustainable Critical Infrastructure at Purdue University contains information about power outage data in the U.S. that occurred from January 2000 to July 2016. The information in the dataset include general information about when and where power outages occur, regional climate information, outage events information, regional electricity consumption information, regional economic characteristics, and regional land-use characteristics. With this information available, a question of interest is where major power outages are more likely to occur and be severe.

Question: Are certain locations more prone to major power outages?

Finding regional patterns of where severe power outages may occur can provide crucial insight into areas that require better preparation for future power outages. It can also help people in areas where major outages are likely to occur be better informed about risks and prepare accordingly. The dataset contains 1534 rows and 54 columns. Since our question concerns power outage durations and climate regions, we only need columns in the dataset that include information related to outage duration, geographical information, and other potentially contextual variables. The chosen relevant columns and their descriptions are:

Cleaning and EDA

Data Cleaning

The following was done to clean the data:

The resulting dataframe has 1534 rows and 14 columns. The first 5 rows of the dataframe look like:

U.S._STATE POSTAL.CODE NERC.REGION CLIMATE.REGION OUTAGE.START OUTAGE.RESTORATION OUTAGE.DURATION YEAR MONTH CAUSE.CATEGORY CAUSE.CATEGORY.DETAIL CLIMATE.CATEGORY CUSTOMERS.AFFECTED TOTAL.CUSTOMERS
Minnesota MN MRO East North Central 2011-07-01 17:00:00 2011-07-03 20:00:00 3060.0 2011 7.0 severe weather NaN normal 70000.0 2595696
Minnesota MN MRO East North Central 2014-05-11 18:38:00 2014-05-11 18:39:00 1.0 2014 5.0 intentional attack vandalism normal NaN 2640737
Minnesota MN MRO East North Central 2010-10-26 20:00:00 2010-10-28 22:00:00 3000.0 2010 10.0 severe weather heavy wind cold 70000.0 2586905
Minnesota MN MRO East North Central 2012-06-19 04:30:00 2012-06-20 23:00:00 2550.0 2012 6.0 severe weather thunderstorm normal 68200.0 2606813
Minnesota MN MRO East North Central 2015-07-18 02:00:00 2015-07-19 07:00:00 1740.0 2015 7.0 severe weather NaN warm 250000.0 2673531

Univariate Analysis

‘CLIMATE.REGION’ Distribution

A barchart was created to get an idea of the distribution of climate regions.

The Northeast region appears to be the most prevalent region with power outages in the data set, and the West North Central region least.

‘OUTAGE.DURATION’ Distribution

A histogram was created to get an idea of the distribution outage durations.

The histogram appears to be heavily skewed right with some outliers of extreme outage durations.

A choropleth was also created to look at the distribution of average outage duration, per state. It’s important to note that the average durations may be skewed higher because of the outliers of longer outage durations.

Indeed, it appears that similar average durations appear in the same regions. For example, higher mean power outage durations are generally distributed among areas such as New York, New Jersey, and West Virgina, all of which are in the northwest region. Lower mean power outage durations are generally distributed among areas such as Montana, Wyoming, and South Dakota, all of which are in the west north central region. Thus, the bivariate analysis will illustrate the relationship between ‘OUTAGE.DURATION’ and ‘CLIMATE.REGION’.

Bivariate Analysis

Outage Duration and Climate Region

In the bivariate analysis, we’ll use more course granularity in outage duration and use the mean outage duration by state instead in order to avoid noisy visualization and make it easier to identify patterns and understand bigger-picture trends between climate regions.

Most notable is the East North Central climate region boxplot, as it has the highest median and largest max for mean outage duration of states in that region, suggesting that outage durations are longer in that climate region. Although the Northeast and Central regions have a relatively low median, they have a more skewed distribution and wider spread, indicating that they could be prone to longer outage durations.

Outage Duration and Customers Affected

Observations from the previous choropleth about similar average durations generally being distributed in the same regions indicate another possible relationship between outage duration and customers affected, as some regions are more densly populated than others. For example, lower mean power outage durations are generally distributed among areas such as Idaho, Wyoming, and Montana, all of which are less populated areas. A scatterplot is thus used to observe the relationship between ‘OUTAGE.DURATION’ and ‘CUSTOMERS.AFFECTED’.

The association between average outage duration and average number of customers affected appears to be positive and somewhat linear, with a few outliers. The scatterplot indicates that generally the longer the outage duration, the more customers affected.

Interesting Aggregates

Outage Duration, Climate Region, and Month

In addition to investigating where longer power outages occur, it can also investigated when the outages tend to be longer using a pivot table. (The column headers represent the month from ‘MONTH’ i.e. 1.0 = January)

CLIMATE.REGION 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0
Central 5320.42 3728.60 709.80 2621.21 2924.45 2112.77 1412.07 1500.75 5428.29 4355.00 1634.30 1110.38
East North Central 10983.25 2728.80 13200.40 4592.91 7395.80 4457.40 3136.10 2730.92 3908.89 3356.43 4442.56 4213.70
Northeast 1161.29 5398.94 2501.88 819.11 1409.38 2565.00 2616.00 3652.84 2786.95 5488.26 876.94 3992.36
Northwest 846.36 1041.58 539.88 1004.20 222.67 370.09 326.00 1148.08 902.29 1703.25 4171.00 2988.60
South 3473.44 999.60 2022.62 1234.12 1690.18 1944.97 1643.13 2496.28 10335.06 887.00 1720.60 7403.67
Southeast 1731.53 1425.89 1153.67 474.50 3359.60 1243.28 569.80 3986.91 3763.00 5701.60 203.67 1241.22
Southwest 53.83 174.53 1276.33 124.00 84.50 220.50 9023.75 109.00 300.00 566.75 16.00 1450.57
West 3903.62 1543.32 3540.24 853.31 342.73 504.78 992.78 382.58 337.00 2048.41 1410.17 2687.52
West North Central NaN NaN 56.00 NaN 0.00 40.40 NaN 100.00 NaN 106.00 30.50 5160.00

One observation is that colder months seem most frequent for major power outages, as the longest outage durations occur during fall and winter months for several regions (Central, East North Central, West, Northwest, Southeast). Another notable observation is that the West North Central region row contains many NaN values, indicating that power outages occur much less frequently in that region (as many months don’t have data). This makes sense as the months that do have data display very short outage durations in comparison to other regions.

Outage Duration, Climate Region, and Causes

The average outage duration in each region can also be compared with the causes behind those outages, which can reveal what kind of recovery resources to focus on for each region. (The column headers are cause categories from ‘CAUSE.CATEGORY’)

CLIMATE.REGION equipment failure fuel supply emergency intentional attack islanding public appeal severe weather system operability disruption
Central 322.000000 10035.250000 346.058824 125.333333 1410.000000 3250.007519 2695.200000
East North Central 26435.333333 33971.250000 2376.050000 1.000000 733.000000 4434.817308 2610.000000
Northeast 215.800000 14629.571429 195.984733 881.000000 2655.000000 4429.902857 773.500000
Northwest 702.000000 1.000000 373.811765 73.333333 898.000000 4838.000000 141.000000
South 295.777778 17482.500000 325.607143 493.500000 1163.976190 4391.349057 866.074074
Southeast 554.500000 NaN 504.666667 NaN 2865.400000 2662.560345 169.312500
Southwest 113.800000 76.000000 265.672131 2.000000 2275.000000 11572.900000 329.222222
West 524.809524 6154.600000 857.677419 214.857143 2028.111111 2928.373134 363.666667
West North Central 61.000000 NaN 23.500000 68.200000 439.500000 2442.500000 NaN

It can be seen that the East North Central region faces the longest power outages, particularly due to equipment failure and fuel supply emergency. In the Southwest region though, those causes played much less of a role and the longest power outages were due to severe weather instead.

It’s worth noting that the columns worked with in the EDA contained NaN values that have not been dealt with yet, which may have implications on analyses that skew/bias results. Missingness will thus be assessed next.

Assessment of Missingness

NMAR Analysis

NMAR (not missing at random) missingness is where the missingness of the missing value is related to the actual, unreported value. A column in the power outages dataset that I believe is NMAR is the ‘CAUSE.CATEGORY’ column, which describes categories of events that cause the major power outages. A possibility for value to be missing is that the cause itself is unknown, or so weird that it doesn’t fit into a category. Another possibility is that it’s missing due to the nature of the cause, where “bad” causes are more likely to be unreported as it reflects poorly on certain groups. For example, cause categories like “equipment failure” may be a bad look for those who built and maintain the power grids, and cause categories like “intentional attack” may implicate that those responsible for the grids failed to keep citizens safe and are bad at their job. Though we can’t conclude that the ‘CAUSE.CATEGORY’ column is NMAR for certain, it seems likely that its missingness is dependent on the actual cause itself. Additional data we might want to obtain that could explain the missingness (thereby making it MAR) could be data on reporting policies, in order to see if there are specific criteria for reporting certain types of causes.

Missingness Dependency

A column with non-trivial missingness is ‘CUSTOMERS.AFFECTED’. The missingness of this column is analyzed by performing permutation tests to analyze whether the column missingness depends on ‘CAUSE.CATEGORY’ and whether it depends on ‘TOTAL.CUSTOMERS’

‘CUSTOMERS.AFFECTED’ vs ‘CAUSE.CATEGORY’

A pivot table and bar graph will be used to compare the two distributions:

CAUSE.CATEGORY affected_missing = False affected_missing = True
equipment failure 0.027498 0.067720
fuel supply emergency 0.006416 0.099323
intentional attack 0.182401 0.494357
islanding 0.031164 0.027088
public appeal 0.019248 0.108352
severe weather 0.657195 0.103837
system operability disruption 0.076077 0.099323

The distribution of cause category appears very different, indicating the missingness of ‘CUSTOMERS.AFFECTED’ is dependent on ‘CAUSE.CATEGORY’. A permutation test will be used to analyze the dependency of the missingness.

Null hypothesis: The distribution of ‘CAUSE.CATEGORY’ when ‘CUSTOMERS.AFFECTED’ is missing is the same as the distribution of ‘CAUSE.CATEGORY’ when ‘CUSTOMERS.AFFECTED’ is not missing.

Alternative hypothesis: The distribution of ‘CAUSE.CATEGORY’ when ‘CUSTOMERS.AFFECTED’ is missing is different than the distribution of ‘CAUSE.CATEGORY’ when ‘CUSTOMERS.AFFECTED’ is not missing.

Test statistic: Total variation distance

Significance level: 0.05

The resulting p-value of 0.0 leads us to reject the null hypothesis and conclude that the missingness of ‘CUSTOMERS.AFFECTED’ is dependent on ‘CAUSE.CATEGORY’.

‘CUSTOMERS.AFFECTED’ vs ‘CLIMATE.CATEGORY’

A pivot table and histogram will be used to compare the two distributions:

CLIMATE.CATEGORY affected_missing = False affected_missing = True
cold 0.303506 0.326531
normal 0.484317 0.496599
warm 0.212177 0.176871

The distribution of climate category appears to be very similar, indicating the missingness of ‘CUSTOMERS.AFFECTED’ is not dependent on ‘CLIMATE.CATEGORY’.

A permutation test will be used to analyze the dependency of the missingness.

Null hypothesis: The distribution of ‘CLIMATE.CATEGORY’ when ‘CUSTOMERS.AFFECTED’ is missing is the same as the distribution of ‘CLIMATE.CATEGORY’ when ‘CUSTOMERS.AFFECTED’ is not missing.

Alternative hypothesis: The distribution of ‘CLIMATE.CATEGORY’ when ‘CUSTOMERS.AFFECTED’ is missing is different than the distribution of ‘CLIMATE.CATEGORY’ when ‘CUSTOMERS.AFFECTED’ is not missing.

Test statistic: Total variation distance

Significance level: 0.05

The resulting p-value of 0.3504 leads us to fail to reject the null hypothesis and conclude that the missingness of ‘CLIMATE.AFFECTED’ is not dependent on ‘CLIMATE.CATEGORY’.

Because the missingness of ‘CUSTOMERS.AFFECTED’ depended on at least 1 column (‘CAUSE.CATEGORY’), it can be concluded that the overall missingness of ‘CUSTOMERS.AFFECTED’ is MAR (missing at random).

Hypothesis Testing

Now that we have an understanding and assessment of our data, we can address our question: Are certain locations more prone to major outages?

Major outages will be defined as as outage durations above the national average duration. More specifically, the goal is to determine which climate regions have longer-than-average power outages more frequently. To do so, hypothesis tests will be performed for each of the climate regions using the following hypotheses:

Null hypothesis: The region’s proportion of major outages is equal to the national proportion of major outages.

Alternative hypothesis: The region’s proportion of major outages is greater than the national proportion of major outages

Test statistic: The proportion of major outages

Significance level: 0.01

Final adjustments

Before conducting the hypothesis tests, missingness should be dealt with as the relevant columns ‘CLIMATE.REGION’ and ‘OUTAGE.DURATION’ both have NaN values. There are only 6 NaN values in ‘CLIMATE.REGION’, a trivial portion of the dataset. Therefore, listwise deletion can be used to get rid of THE NaN values for ‘CLIMATE.REGION’. There are 58 NaN values in ‘OUTAGE.DURATION’. Imputation will be used to fill in the NaN values, specifically median imputation, since earlier exploration revealed that there are outliers in the dataset and the median is more robust to outliers.

Testing

The result:

Region P-value Reject Null Hypothesis
East North Central 1.0 False
Central 1.0 False
South 1.0 False
Southeast 0.0 True
Northwest 1.0 False
Southwest 0.0 True
Northeast 0.0 True
West North Central 0.0 True
West 0.0 True

It looks like the Southeast, Southwest, Northeast, West North Central, and West regions have a proportion of major outages greater than that of the greater United States, as the null hypothesis is rejected for those regions. It can be suggested that regional factors impact power outage durations and those regions are more prone to major power outages.

An observation of interest is that the East North Central region isn’t on the list of regions that are more prone to major power outages, despite having the highest median and max outage duration out of all regions. This may be because the East North Central region has major outages that are longer compared to other regions, but its proportion of major outages is actually similar to the proportion of major outages nationwide.

One last note is that since statistical tests were performed and not randomized controlled trials, the results of the test aren’t proven to be 100% true.