⚡Power Outages Analysis⚡
This is a project for DSC 80 at UCSD.
Name: Essie Cheng
Website Link: https://essiecheng.github.io/Power-Outage-Analysis/
Introduction
This dataset created by the Laboratory for Advancing Sustainable Critical Infrastructure at Purdue University contains information about power outage data in the U.S. that occurred from January 2000 to July 2016. The information in the dataset include general information about when and where power outages occur, regional climate information, outage events information, regional electricity consumption information, regional economic characteristics, and regional land-use characteristics. With this information available, a question of interest is where major power outages are more likely to occur and be severe.
Question: Are certain locations more prone to major power outages?
Finding regional patterns of where severe power outages may occur can provide crucial insight into areas that require better preparation for future power outages. It can also help people in areas where major outages are likely to occur be better informed about risks and prepare accordingly. The dataset contains 1534 rows and 54 columns. Since our question concerns power outage durations and climate regions, we only need columns in the dataset that include information related to outage duration, geographical information, and other potentially contextual variables. The chosen relevant columns and their descriptions are:
- YEAR: Indicates the year when the outage event occurred
- MONTH: Indicates the month when the outage event occurred
- U.S._STATE: Represents all the states in the continental U.S.
- POSTAL.CODE: Represents the postal code of the U.S. states
- NERC.REGION: The North American Electric Reliability Corporation (NERC) regions involved in the outage event
- CLIMATE.REGION: U.S. Climate regions as specified by National Centers for Environmental Information (nine climatically consistent regions in continental U.S.A.)
- CLIMATE.CATEGORY: This represents the climate episodes corresponding to the years. The categories—“Warm”, “Cold” or “Normal” episodes of the climate are based on a threshold of ± 0.5 °C for the Oceanic Niño Index (ONI)
- OUTAGE.START.DATE: This variable indicates the day of the year when the outage event started (as reported by the corresponding Utility in the region)
- OUTAGE.START.TIME: This variable indicates the time of the day when the outage event started (as reported by the corresponding Utility in the region)
- OUTAGE.RESTORATION.DATE: This variable indicates the day of the year when power was restored to all the customers (as reported by the corresponding Utility in the region)
- OUTAGE.RESTORATION.TIME: This variable indicates the time of the day when power was restored to all the customers (as reported by the corresponding Utility in the region)
- CAUSE.CATEGORY: Categories of all the events causing the major power outages
- CAUSE.CATEGORY.DETAIL: Detailed description of the event categories causing the major power outages
- OUTAGE.DURATION: Duration of outage events (in minutes)
- CUSTOMERS.AFFECTED: Number of customers affected by the power outage event
- TOTAL.CUSTOMERS: Annual number of total customers served in the U.S. state
Cleaning and EDA
Data Cleaning
The following was done to clean the data:
- Fix formatting: it appears that due to formatting issues from converting the xlsx file to a csv file, the first 4 rows of the resulting dataframe were all NaN values and the correct column names appeared in row 5. Those rows were dropped and the column names were reassigned.
- Keep relevant columns: all columns are dropped except for the relevant columns listed above.
- Typcasting: All columns are stored as strings, but it would make more sense for numerical information to be stored as floats. The power outages start date and time were merged into one pd.Timestamp column, and power outages restoration date and time were merged into one pd.Timestamp column.
The resulting dataframe has 1534 rows and 14 columns. The first 5 rows of the dataframe look like:
| U.S._STATE | POSTAL.CODE | NERC.REGION | CLIMATE.REGION | OUTAGE.START | OUTAGE.RESTORATION | OUTAGE.DURATION | YEAR | MONTH | CAUSE.CATEGORY | CAUSE.CATEGORY.DETAIL | CLIMATE.CATEGORY | CUSTOMERS.AFFECTED | TOTAL.CUSTOMERS |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Minnesota | MN | MRO | East North Central | 2011-07-01 17:00:00 | 2011-07-03 20:00:00 | 3060.0 | 2011 | 7.0 | severe weather | NaN | normal | 70000.0 | 2595696 |
| Minnesota | MN | MRO | East North Central | 2014-05-11 18:38:00 | 2014-05-11 18:39:00 | 1.0 | 2014 | 5.0 | intentional attack | vandalism | normal | NaN | 2640737 |
| Minnesota | MN | MRO | East North Central | 2010-10-26 20:00:00 | 2010-10-28 22:00:00 | 3000.0 | 2010 | 10.0 | severe weather | heavy wind | cold | 70000.0 | 2586905 |
| Minnesota | MN | MRO | East North Central | 2012-06-19 04:30:00 | 2012-06-20 23:00:00 | 2550.0 | 2012 | 6.0 | severe weather | thunderstorm | normal | 68200.0 | 2606813 |
| Minnesota | MN | MRO | East North Central | 2015-07-18 02:00:00 | 2015-07-19 07:00:00 | 1740.0 | 2015 | 7.0 | severe weather | NaN | warm | 250000.0 | 2673531 |
Univariate Analysis
‘CLIMATE.REGION’ Distribution
A barchart was created to get an idea of the distribution of climate regions.
The Northeast region appears to be the most prevalent region with power outages in the data set, and the West North Central region least.
‘OUTAGE.DURATION’ Distribution
A histogram was created to get an idea of the distribution outage durations.
The histogram appears to be heavily skewed right with some outliers of extreme outage durations.
A choropleth was also created to look at the distribution of average outage duration, per state. It’s important to note that the average durations may be skewed higher because of the outliers of longer outage durations.
Indeed, it appears that similar average durations appear in the same regions. For example, higher mean power outage durations are generally distributed among areas such as New York, New Jersey, and West Virgina, all of which are in the northwest region. Lower mean power outage durations are generally distributed among areas such as Montana, Wyoming, and South Dakota, all of which are in the west north central region. Thus, the bivariate analysis will illustrate the relationship between ‘OUTAGE.DURATION’ and ‘CLIMATE.REGION’.
Bivariate Analysis
Outage Duration and Climate Region
In the bivariate analysis, we’ll use more course granularity in outage duration and use the mean outage duration by state instead in order to avoid noisy visualization and make it easier to identify patterns and understand bigger-picture trends between climate regions.
Most notable is the East North Central climate region boxplot, as it has the highest median and largest max for mean outage duration of states in that region, suggesting that outage durations are longer in that climate region. Although the Northeast and Central regions have a relatively low median, they have a more skewed distribution and wider spread, indicating that they could be prone to longer outage durations.
Outage Duration and Customers Affected
Observations from the previous choropleth about similar average durations generally being distributed in the same regions indicate another possible relationship between outage duration and customers affected, as some regions are more densly populated than others. For example, lower mean power outage durations are generally distributed among areas such as Idaho, Wyoming, and Montana, all of which are less populated areas. A scatterplot is thus used to observe the relationship between ‘OUTAGE.DURATION’ and ‘CUSTOMERS.AFFECTED’.
The association between average outage duration and average number of customers affected appears to be positive and somewhat linear, with a few outliers. The scatterplot indicates that generally the longer the outage duration, the more customers affected.
Interesting Aggregates
Outage Duration, Climate Region, and Month
In addition to investigating where longer power outages occur, it can also investigated when the outages tend to be longer using a pivot table. (The column headers represent the month from ‘MONTH’ i.e. 1.0 = January)
| CLIMATE.REGION | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 | 7.0 | 8.0 | 9.0 | 10.0 | 11.0 | 12.0 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Central | 5320.42 | 3728.60 | 709.80 | 2621.21 | 2924.45 | 2112.77 | 1412.07 | 1500.75 | 5428.29 | 4355.00 | 1634.30 | 1110.38 |
| East North Central | 10983.25 | 2728.80 | 13200.40 | 4592.91 | 7395.80 | 4457.40 | 3136.10 | 2730.92 | 3908.89 | 3356.43 | 4442.56 | 4213.70 |
| Northeast | 1161.29 | 5398.94 | 2501.88 | 819.11 | 1409.38 | 2565.00 | 2616.00 | 3652.84 | 2786.95 | 5488.26 | 876.94 | 3992.36 |
| Northwest | 846.36 | 1041.58 | 539.88 | 1004.20 | 222.67 | 370.09 | 326.00 | 1148.08 | 902.29 | 1703.25 | 4171.00 | 2988.60 |
| South | 3473.44 | 999.60 | 2022.62 | 1234.12 | 1690.18 | 1944.97 | 1643.13 | 2496.28 | 10335.06 | 887.00 | 1720.60 | 7403.67 |
| Southeast | 1731.53 | 1425.89 | 1153.67 | 474.50 | 3359.60 | 1243.28 | 569.80 | 3986.91 | 3763.00 | 5701.60 | 203.67 | 1241.22 |
| Southwest | 53.83 | 174.53 | 1276.33 | 124.00 | 84.50 | 220.50 | 9023.75 | 109.00 | 300.00 | 566.75 | 16.00 | 1450.57 |
| West | 3903.62 | 1543.32 | 3540.24 | 853.31 | 342.73 | 504.78 | 992.78 | 382.58 | 337.00 | 2048.41 | 1410.17 | 2687.52 |
| West North Central | NaN | NaN | 56.00 | NaN | 0.00 | 40.40 | NaN | 100.00 | NaN | 106.00 | 30.50 | 5160.00 |
One observation is that colder months seem most frequent for major power outages, as the longest outage durations occur during fall and winter months for several regions (Central, East North Central, West, Northwest, Southeast). Another notable observation is that the West North Central region row contains many NaN values, indicating that power outages occur much less frequently in that region (as many months don’t have data). This makes sense as the months that do have data display very short outage durations in comparison to other regions.
Outage Duration, Climate Region, and Causes
The average outage duration in each region can also be compared with the causes behind those outages, which can reveal what kind of recovery resources to focus on for each region. (The column headers are cause categories from ‘CAUSE.CATEGORY’)
| CLIMATE.REGION | equipment failure | fuel supply emergency | intentional attack | islanding | public appeal | severe weather | system operability disruption |
|---|---|---|---|---|---|---|---|
| Central | 322.000000 | 10035.250000 | 346.058824 | 125.333333 | 1410.000000 | 3250.007519 | 2695.200000 |
| East North Central | 26435.333333 | 33971.250000 | 2376.050000 | 1.000000 | 733.000000 | 4434.817308 | 2610.000000 |
| Northeast | 215.800000 | 14629.571429 | 195.984733 | 881.000000 | 2655.000000 | 4429.902857 | 773.500000 |
| Northwest | 702.000000 | 1.000000 | 373.811765 | 73.333333 | 898.000000 | 4838.000000 | 141.000000 |
| South | 295.777778 | 17482.500000 | 325.607143 | 493.500000 | 1163.976190 | 4391.349057 | 866.074074 |
| Southeast | 554.500000 | NaN | 504.666667 | NaN | 2865.400000 | 2662.560345 | 169.312500 |
| Southwest | 113.800000 | 76.000000 | 265.672131 | 2.000000 | 2275.000000 | 11572.900000 | 329.222222 |
| West | 524.809524 | 6154.600000 | 857.677419 | 214.857143 | 2028.111111 | 2928.373134 | 363.666667 |
| West North Central | 61.000000 | NaN | 23.500000 | 68.200000 | 439.500000 | 2442.500000 | NaN |
It can be seen that the East North Central region faces the longest power outages, particularly due to equipment failure and fuel supply emergency. In the Southwest region though, those causes played much less of a role and the longest power outages were due to severe weather instead.
It’s worth noting that the columns worked with in the EDA contained NaN values that have not been dealt with yet, which may have implications on analyses that skew/bias results. Missingness will thus be assessed next.
Assessment of Missingness
NMAR Analysis
NMAR (not missing at random) missingness is where the missingness of the missing value is related to the actual, unreported value. A column in the power outages dataset that I believe is NMAR is the ‘CAUSE.CATEGORY’ column, which describes categories of events that cause the major power outages. A possibility for value to be missing is that the cause itself is unknown, or so weird that it doesn’t fit into a category. Another possibility is that it’s missing due to the nature of the cause, where “bad” causes are more likely to be unreported as it reflects poorly on certain groups. For example, cause categories like “equipment failure” may be a bad look for those who built and maintain the power grids, and cause categories like “intentional attack” may implicate that those responsible for the grids failed to keep citizens safe and are bad at their job. Though we can’t conclude that the ‘CAUSE.CATEGORY’ column is NMAR for certain, it seems likely that its missingness is dependent on the actual cause itself. Additional data we might want to obtain that could explain the missingness (thereby making it MAR) could be data on reporting policies, in order to see if there are specific criteria for reporting certain types of causes.
Missingness Dependency
A column with non-trivial missingness is ‘CUSTOMERS.AFFECTED’. The missingness of this column is analyzed by performing permutation tests to analyze whether the column missingness depends on ‘CAUSE.CATEGORY’ and whether it depends on ‘TOTAL.CUSTOMERS’
‘CUSTOMERS.AFFECTED’ vs ‘CAUSE.CATEGORY’
A pivot table and bar graph will be used to compare the two distributions:
- The distribution of ‘CAUSE.CATEGORY’ when ‘CUSTOMERS.AFFECTED’ is missing.
- The distribution of ‘CAUSE.CATEGORY’ when ‘CUSTOMERS.AFFECTED’ is missing.
| CAUSE.CATEGORY | affected_missing = False | affected_missing = True |
|---|---|---|
| equipment failure | 0.027498 | 0.067720 |
| fuel supply emergency | 0.006416 | 0.099323 |
| intentional attack | 0.182401 | 0.494357 |
| islanding | 0.031164 | 0.027088 |
| public appeal | 0.019248 | 0.108352 |
| severe weather | 0.657195 | 0.103837 |
| system operability disruption | 0.076077 | 0.099323 |
The distribution of cause category appears very different, indicating the missingness of ‘CUSTOMERS.AFFECTED’ is dependent on ‘CAUSE.CATEGORY’. A permutation test will be used to analyze the dependency of the missingness.
Null hypothesis: The distribution of ‘CAUSE.CATEGORY’ when ‘CUSTOMERS.AFFECTED’ is missing is the same as the distribution of ‘CAUSE.CATEGORY’ when ‘CUSTOMERS.AFFECTED’ is not missing.
Alternative hypothesis: The distribution of ‘CAUSE.CATEGORY’ when ‘CUSTOMERS.AFFECTED’ is missing is different than the distribution of ‘CAUSE.CATEGORY’ when ‘CUSTOMERS.AFFECTED’ is not missing.
Test statistic: Total variation distance
Significance level: 0.05
The resulting p-value of 0.0 leads us to reject the null hypothesis and conclude that the missingness of ‘CUSTOMERS.AFFECTED’ is dependent on ‘CAUSE.CATEGORY’.
‘CUSTOMERS.AFFECTED’ vs ‘CLIMATE.CATEGORY’
A pivot table and histogram will be used to compare the two distributions:
- The distribution of ‘CLIMATE.CATEGORY’ when ‘CUSTOMERS.AFFECTED’ is missing.
- The distribution of ‘CLIMATE.CATEGORY’ when ‘CUSTOMERS.AFFECTED’ is missing.
| CLIMATE.CATEGORY | affected_missing = False | affected_missing = True |
|---|---|---|
| cold | 0.303506 | 0.326531 |
| normal | 0.484317 | 0.496599 |
| warm | 0.212177 | 0.176871 |
The distribution of climate category appears to be very similar, indicating the missingness of ‘CUSTOMERS.AFFECTED’ is not dependent on ‘CLIMATE.CATEGORY’.
A permutation test will be used to analyze the dependency of the missingness.
Null hypothesis: The distribution of ‘CLIMATE.CATEGORY’ when ‘CUSTOMERS.AFFECTED’ is missing is the same as the distribution of ‘CLIMATE.CATEGORY’ when ‘CUSTOMERS.AFFECTED’ is not missing.
Alternative hypothesis: The distribution of ‘CLIMATE.CATEGORY’ when ‘CUSTOMERS.AFFECTED’ is missing is different than the distribution of ‘CLIMATE.CATEGORY’ when ‘CUSTOMERS.AFFECTED’ is not missing.
Test statistic: Total variation distance
Significance level: 0.05
The resulting p-value of 0.3504 leads us to fail to reject the null hypothesis and conclude that the missingness of ‘CLIMATE.AFFECTED’ is not dependent on ‘CLIMATE.CATEGORY’.
Because the missingness of ‘CUSTOMERS.AFFECTED’ depended on at least 1 column (‘CAUSE.CATEGORY’), it can be concluded that the overall missingness of ‘CUSTOMERS.AFFECTED’ is MAR (missing at random).
Hypothesis Testing
Now that we have an understanding and assessment of our data, we can address our question: Are certain locations more prone to major outages?
Major outages will be defined as as outage durations above the national average duration. More specifically, the goal is to determine which climate regions have longer-than-average power outages more frequently. To do so, hypothesis tests will be performed for each of the climate regions using the following hypotheses:
Null hypothesis: The region’s proportion of major outages is equal to the national proportion of major outages.
Alternative hypothesis: The region’s proportion of major outages is greater than the national proportion of major outages
Test statistic: The proportion of major outages
Significance level: 0.01
- Because multiple testing is occurring, the risk of committing a Type I error (rejecting the null hypothesis when it’s actually true) is increased. Therefore, the significance level is set lower to reduce the Type I error probability.
Final adjustments
Before conducting the hypothesis tests, missingness should be dealt with as the relevant columns ‘CLIMATE.REGION’ and ‘OUTAGE.DURATION’ both have NaN values. There are only 6 NaN values in ‘CLIMATE.REGION’, a trivial portion of the dataset. Therefore, listwise deletion can be used to get rid of THE NaN values for ‘CLIMATE.REGION’. There are 58 NaN values in ‘OUTAGE.DURATION’. Imputation will be used to fill in the NaN values, specifically median imputation, since earlier exploration revealed that there are outliers in the dataset and the median is more robust to outliers.
Testing
The result:
| Region | P-value | Reject Null Hypothesis |
|---|---|---|
| East North Central | 1.0 | False |
| Central | 1.0 | False |
| South | 1.0 | False |
| Southeast | 0.0 | True |
| Northwest | 1.0 | False |
| Southwest | 0.0 | True |
| Northeast | 0.0 | True |
| West North Central | 0.0 | True |
| West | 0.0 | True |
It looks like the Southeast, Southwest, Northeast, West North Central, and West regions have a proportion of major outages greater than that of the greater United States, as the null hypothesis is rejected for those regions. It can be suggested that regional factors impact power outage durations and those regions are more prone to major power outages.
An observation of interest is that the East North Central region isn’t on the list of regions that are more prone to major power outages, despite having the highest median and max outage duration out of all regions. This may be because the East North Central region has major outages that are longer compared to other regions, but its proportion of major outages is actually similar to the proportion of major outages nationwide.
One last note is that since statistical tests were performed and not randomized controlled trials, the results of the test aren’t proven to be 100% true.