top of page
Writer's pictureGarrett Woody

LESSONS LEARNED THROUGH NETWORK ANALYSIS & FORECASTING OF THE U.S. POWER GRID

 

PROJECT GOALS

Explore and Analyze the U.S. Power Grid Using Network Analysis

Explore ML Forecasting using Energy Demand for Balancing Authorities within the Grid

Visualize the U.S. Power Grid and Centralize Some Information About It on This Website for Easier Accessibility

 

AUDIENCE

The intended audience for this blog are people interested in data science who are also interested in possible applications to the U.S. power grid (and it is written with that in mind). The blog generally does not assume prior knowledge of the power grid, but does assume some familiarity with concepts and terms in data science. Additionally, our GitHub repository includes code used to collect, clean, and enhance the data sets, making it easier for people to not only reproduce our results, but also to do further analysis of their own.


The intended audience for the website where this blog finds home is anyone curious about the U.S. power grid. Some of the information shared and visualized here is either not easily found, not centralized, not clearly visualized, or doesn’t exist. The website is meant to be a friendly landing point for those interested in learning more and seeing some fun, interactive visuals.


An audience for the data manipulation associated with this project, which we did not initially have in mind, are those who host the public datasets used. As you will read below, in order to create the power grid network from that publicly available data hosted by the Department of Homeland Security, there was a lot of data cleaning and manipulation that needed to be performed. The “Disturbance and Unusual Occurrence” dataset had some variable naming irregularities and also required a significant amount of data cleaning and manipulation. Lastly, the EIA’s API has been transitioning from an older query approach to get data to a more up-to-date approach with industry standards for REST API’s. As a result, there are still some issues with making queries to the dataset where standard request methods and parameter inclusion do not work. Based on the amount of data cleaning necessary, we have already reached out to each of the parties involved who host this data and they have expressed interest in learning about the issues encountered and how we solved them. We plan to share some of our outcomes so these publicly available datasets are more user-friendly for those who come after us.


 

BACKGROUND

The U.S. power grid network is both interesting and incredibly complex. Governed by 70+ balancing authorities, 13,000+ power plants, 77,000+ substations, 160,000+ miles of high-voltage power lines, and millions of low-voltage power lines and distribution transformers connecting customers around the United States. At the largest management level, the U.S. lower 48 states consist of the Western, Eastern, and ERCOT (Electric Reliability Council of Texas) Interconnections (Figure 1.1). Energy flows freely within each interconnection, but AC-DC-AC valves act as gatekeepers of energy flow between them. While these interconnections describe the physical system of the grid, the operation of the electric system is managed by entities called balancing authorities (BA’s) (Figure 1.2).


Quoting the EIA, “A balancing authority ensures, in real-time, that power system demand and supply are finely balanced. This balance is needed to maintain the safe and reliable operation of the power system. If demand and supply fall out of balance, local or even wide-area blackouts can result…balancing authorities maintain appropriate operating conditions for the electric system by ensuring that a sufficient supply of electricity is available to serve expected demand, which includes managing transfers of electricity with other balancing authorities.” Based on the daily demand forecast, BAs will coordinate which generating units will run the next day. A utility serving as a BA (e.g. Florida Power and Light) selects units to run. The EIA states “In areas served by one of the largest independent system operators (ISOs) or regional transmission organizations (RTOs), about two-thirds of the country, generators submit offers into the RTO/ISOs day-ahead energy market to generate power the next day. RTO/ISO day-ahead energy markets use sophisticated optimization models to select bids and set wholesale prices. These arrangements are financially binding, and generators who fail to perform as agreed may face financial penalties.”



The North American Electric Reliability Corporation (NERC) is an international entity that regulates the reliability of the “Bulk Power System” (BPS). The BPS manages the facilities that control systems necessary for operating an interconnected electric energy supply and transmission network. A visual breakdown of NERC regions is shown in Figure 1.3 and was present in various datasets related to power generation, disturbances, and emissions for the power plant network. NERC develops and enforces reliability standards.



BAs oversee and balance the electricity generated by power plants in a given region. As shown in Figure 1.4 and Figure 1.5, a power plant generates AC electricity, step-up transformers within substations step-up the voltage to higher values for longer-distance transmission so less of the energy being bought and sold is dissipated within the transmission wires. Transmission substations can split up power to different areas and eventually substations with step-down transformers that lower the voltage to different levels depending on the kind of application and consumer.




The "Bulk Electric System" (BES), somewhat different than BPS, refers to the physical electrical generation resources, transmission lines, interconnections with neighboring systems, and associated equipment, generally operated at voltages of 100kV or higher (it is this part of the network that we visualized and explored), and not the medium and lower voltage wires (e.g. below 69 kV) which also were not available in the publicly available datasets. Figure 1.7 and Figure 1.8 highlight the different kinds of transmission lines that can be found throughout the power grid.




 

DATA SOURCES

To see how we accessed, cleaned, and enhanced the datasets below for the analysis performed, please check out the Jupyter notebooks in our GitHub repository.

Energy Information Agency (EIA)

U.S. Environmental Protection Agency (EPA)

Department of Homeland Security (DHS)

Department of Energy (DOE)

U.S. Census

NOAA Northeast Regional Climate Center (NRCC)

 

PART A: NETWORK ANALYSIS

Data Manipulation & Cleaning


Infrastructure

The Department of Homeland Security hosts various datasets related to U.S. infrastructure, including the power grid. As discussed in the background section, there are various components that make up the power grid and these are hosted in different data sets (power plants, substations, and transmission lines). At first glance there appear to be fields connecting these: the power plant data set has fields for the substations they’re connected to and the transmission lines data set has fields connecting substations to substations. Upon further investigation, however, it was discovered that the keys used are not unique as they leverage substation names instead of substation codes. Approximately 4.5% (3,500) of the substations have duplicate names, which affected 16.5% of the connections in the power plant and transmission line data sets.


In addition to the duplicate names, the lack of a unique key also led to spelling errors that had to be manually cleaned, or the incorrectly named substation was actually a legitimate substation name located elsewhere. As an example, there is a substation named “Rock Island” in Florida, but there is also one named “Rock Island Dam” in Washington, and omitting “Dam” links that substation incorrectly.


To overcome the challenges presented without a unique key, we decided to leverage the GeoJSON data provided by the Department of Homeland Security instead. While this on its own doesn’t solve the problem, it allowed us to leverage latitude and longitude values to make an educated guess regarding the correct substation based on proximity. Our initial try used native methods in Geopandas, but that proved to be very time consuming on the transmission line data set of almost 100,000 lines. To streamline the process, we leveraged the query method from SciPy’s cKDTree to quickly find the nearest neighbors based on the coordinates at each end of the transmission lines (Figure 2.1). While this method does not guarantee 100% accuracy, it significantly reduces errors and corrects things that do not pass the visual test such as the aforementioned transmission line stretching from Florida to Washington.


Figure 2.1

def ckdnearest(gdA, gdB):

    nA = np.array(list(gdA.geometry.apply(lambda g: (g.coords.xy[1][-1],
                                                     g.coords.xy[0][-1]))))
    nB = np.array(list(gdA.geometry.apply(lambda g: (g.coords.xy[1][ 0],
                                                     g.coords.xy[0][ 0]))))
    nC = np.array(list(zip(gdB['lat'], gdB['lon'])))

    btree = cKDTree(nC)
    distA, idxA = btree.query(nA, k=1)
    distB, idxB = btree.query(nB, k=1)

    gdB_nearestA = gdB.iloc[idxA][['sub_code', 'name']] \
                      .reset_index(drop=True) \
                      .rename(columns={'sub_code':'sub_code_1',
                                       'name':'sub_name_1'})   
    gdB_nearestB = gdB.iloc[idxB][['sub_code', 'name']] \
                      .reset_index(drop=True) \
                      .rename(columns={'sub_code':'sub_code_2',
                                       'name':'sub_name_2'})

    gdf = pd.concat(
        [
            gdA.reset_index(drop=True),
            gdB_nearestA.astype(str),
            round(pd.Series(distA, name='sub_dist_1'), 3),
            gdB_nearestB.astype(str),
            round(pd.Series(distB, name='sub_dist_2'), 3)
        ],
        axis=1)

    return gdf

While the Department of Homeland Security provides data for power plants, substations, and transmission lines, they do not provide data for the related balancing authorities. The Environmental Protection Agency does, however, via their EGRID data set and thankfully they use a unique key for power plants that allowed us to tie the power plants and substations together.


The connections between balancing authorities and related information about them was present in the Energy Information Agency’s 930 data sets, so we had yet another data set to leverage to create the various connections for the network. As a balancing authority is not a single building somewhere like a power plant or substation, there is no latitude or longitude data included. While this is not necessary to link the various components together from a network perspective, it became necessary to visualize the network within the United States. To gather those coordinates programmatically, we gathered the latitude and longitude values for all power plants connected to a given balancing authority and then leveraged the centroid method from Shapely’s MultiPoint class to gather the geometric center for those balancing authorities (Figure 2.2). A handful of balancing authorities regarding manual latitude and longitude values as they have no power plants from the U.S. connected to them.


Figure 2.2


bas_pos = plants.copy()[['plant_code', 'lat', 'lon', 'connected_ba']].dropna()
bas_pos['connected_ba'] = bas_pos['connected_ba'].str.split('; ')
bas_pos = bas_pos.explode('connected_ba')

bas_pos['coords'] = list(zip(bas_pos['lat'], bas_pos['lon']))
bas_pos = bas_pos.groupby('connected_ba')['coords'] \
                 .apply(list) \
                 .reset_index() \
                 .rename(columns={'connected_ba':'ba_code'})

bas_pos['lat'] = bas_pos['coords'].apply(lambda x: MultiPoint(x).centroid.x)
bas_pos['lon'] = bas_pos['coords'].apply(lambda x: MultiPoint(x).centroid.y)

While the original data sets did not include a unique key and the balancing authorities lacked latitude and longitude values, our processed data sets have both based on the previously mentioned methods. To help understand the new relationships, we created an Entity Relationship Diagram (Figure 2.3).


Figure 2.3


Disturbances

The data sets for the power grid infrastructure were used to build the network, but there is a complementary data set for disturbances and outages to address the likelihood of something happening. The Department of Energy hosts this information going back to 2002, but we only went back to 2010 in our analysis, partially due to data cleanliness and partially due to the accuracy of more recent data. From 2011 to 2022, there were 2,643 events captured.


Affected Area

One of the largest challenges with this data set is identifying the area affected programmatically and being able to tie that back to power plants, substations, and/or balancing authorities. At the most granular level, the disturbances are listed at the county level and more recent years do a decent job of capturing that. Because the form used to capture events is largely free form, however, this is still fairly inconsistent. Especially in earlier years, the area affected may be listed as something like “Central Texas” or “Metropolitan Detroit”, making it challenging to assign it to a specific county. In cases where a specific county was not able to be identified, it was dropped and only the state was kept, which was weighted accordingly in downstream analysis.


As disturbances are not always limited to a single county/location, the are affected often include multiple locations. Due to the discrepancies with how data was collected throughout the years, the separation of those was not consistent. In more recent years, they list the area affected like “Michigan: Washtenaw County, Lapeer County; Illinois: Cook County,”, but sometimes they will be separated by commas instead of semicolons, sometimes the county is listed before the state etc. This required a lot of regex and pulling in a separate data set of states and their abbreviations to parse those out programmatically.


The raw data includes a single column titled “Area Affected”, but after the various cleaning activities we ended up with the following 3 columns: “state_affected”, “county_affected”, and “county_affected_fips”. If an event spanned multiple counties and/or states, it was broken into multiple rows with the same event ID so that the affected location can be used as a unique key when combining with the other data sets. Event Type

As with the affected area, there was wide inconsistency between the event types over the years. While there are 14 types of events listed on the OE-417 form for submitting disturbances, there is also an “Other” category for free form text and they may have also used a different form in prior years. While there are only 14 event types currently available, we found 151 unique types from 2011 through 2022. Based on the existing types listed and guidance from the United States Electricity Industry Primer (Figure 2.4), we used regex to narrow this down to 7 categories (Figure 2.5).



Figure 2.5

disturb['event_type'] = disturb['event_type'].str.replace(
    '(.*)(weather|storm|disaster|earthquake)(.*)',
    'severe_weather',
    regex=True)
disturb['event_type'] = disturb['event_type'].str.replace(
    '(.*)(vandalism|vandalsim|cyber|attack|suspicious|sabotage)(.*)',
    'vandalism/cyber_attack',
    regex=True)
disturb['event_type'] = disturb['event_type'].str.replace(
    '(.*)(system|transmission|interruption)(.*)',
    'system_operations',
    regex=True)
disturb['event_type'] = disturb['event_type'].str.replace(
    '(.*)(appeal)(.*)',
    'public_appeal',
    regex=True)
disturb['event_type'] = disturb['event_type'].str.replace(
    '(.*)(supply|inadequacy|loss)(.*)',
    'fuel_supply_emergency',
    regex=True)
disturb['event_type'] = disturb['event_type'].str.replace(
    '(.*)(malfunction|shed|failure|equipment|fault|islanding)(.*)',
    'malfunction/equipment_failure',
    regex=True)
disturb['event_type'] = disturb['event_type'].str.replace(
    '(.*)(event)(.*)',
    'other',
    regex=True)

Exploring disturbance data using those new event type categories, it is easy to identify the top 3 as “Severe Weather”, “System Operations”, and “Vandalism/Cyber Attack” (Figure 2.6). It is also worth noting that there are significantly more events in the past couple of years, but that does not necessarily mean that more events are occurring and may require further exploration. It could simply mean that data is being captured more accurately and therefore spread across affected areas at a more granular level.


Figure 2.6


Building the Network

After going through the various cleaning activities, building the networks was relatively straightforward. We ultimately ended up with 2 networks: 1 for power plants and substations (Figure 2.7) and 1 for power plants and balancing authorities (Figures 2.8 and Figure 2.9). The networks were overlaid on the U.S. using Basemap for both visual context and speed. Because there are about 90,000 power plant and substation nodes with 180,000 edges, relying on NetworkX to calculate the location of the nodes/edges using the spring layout and related Fruchterman-Reingold force-directed algorithm can take several hours.


Figure 2.7 - Power Plants/Substations to Substations

Figure 2.8 - Power Plants to Balancing Authorities


Figure 2.9 - Balancing Authorities to Balancing Authorities


Building the networks was not just for visualization, however, as creating it allowed is to capture the following metrics:

  • Degree Centrality

    • Degree centrality is the number of edges that a node has. In the context of these networks, a higher degree centrality points to the potential for more risk. If that central node (e.g. power plant) goes down, how many other things go with it?

  • Betweenness Centrality

    • Betweenness centrality measures the fraction of shortest paths passing through a node. For the power grid, that can point to the increasing importance of a node (e.g. substation, balancing authority) if it is on the shortest path that power flows through.

  • Clustering Coefficient

    • The clustering coefficient is a measure of the degree to which nodes tend to cluster together. If all nodes are connected to each other, then the clustering coefficient is 1. If a node is connected to other nodes, but none of them are connected to each other, then the clustering coefficient is 0. For the power grid, this touches on the redundancy of the network. If the clustering coefficient is high, then are potentially other paths for power to flow in the event of an outage.

Of these metrics, betweenness centrality was the most computationally expensive. On a 32-core machine with 96GB of memory, it still took 11 hours to calculate betweenness centrality measures for the power plant/substation network. This is largely due to the way that NetworkX performs the calculations using a single core, so the extra computing did not have a significant impact. The results of that analysis were exported to a pickle file in our Github repository, but there may be ways to improve the performance leverage something like Spark and GraphX for distributed calculations, but betweenness centrality does not appear to be a built in method and may require custom libraries.


Case-Study: Assessing Inherent Vulnerability

One of the project goals was to “use engineered features from network analysis combined with other features (energy demand, disturbance data, population size, etc.) to identify nodes that are either more vulnerable or important to the network”.


The focus of this analysis is on inherent risk versus residual risk. Inherent risk is the risk before any mitigating factors are implemented. A power plant may be inherently susceptible to tornadoes, for instance, but the residual risk is much lower after they reinforced the walls with steel beams. As those mitigating factors are not something available to us in a general sense, the residual risk was not a realistic metric to achieve.


The metrics from the network analysis (degree centrality, betweenness centrality, and clustering coefficient) all point to a level of risk based on the importance of the nodes (substations or balancing authorities). The probability of a disturbance and or outage points to the likelihood of something happening based on historical events. The potential population affected should something occur points to the impact. These factors all contributed to our risk scoring mechanism, relative to each other (Figure 2.10).


Figure 2.10


We recognize that this is not the end all be all for risk in the U.S. power grid, but we view it as a lens to assist with that discussion.


Excerpts of the top risk scores for substations and balancing authorities can be found in Figure 2.11 and Figure 2.13 below, but you can recreate the results and explore them further leveraging the Jupyter notebooks in our Github repository.


When reviewing the substations (Figure 2.11), you can see that many at the top of the list are in Florida. If you look at the most risky states, however, based on a mean risk score (Figure 2.12), you can see that Missouri tops the list and Florida is number 8. It is also worth noting that some of the substations or balancing authorities with the highest risk scores might have low centrality measures and/or probabilities of outages, but that is offset by the lack of redundancy (small clustering coefficient) and potential impact if something were to happen (a large population).


Again, there are different ways to slice these metrics, but this is one slice that we hope can offer insight into potential vulnerabilities in the U.S. power grid.


Figure 2.11 - Substations by Risk Score


Figure 2.12 - States by Mean Substation Risk Score

Figure 2.13 - Balancing Authorities by Risk Score


Figure 2.14 - Region by Mean Balancing Authority Risk Score


Possible Future Work

  • Case studies of specific blackouts and extended outages (e.g. 2003 NE Blackout, 2017 Hurricane Irma hits Florida, 2021 Texas winter blackouts, etc.) to examine where and why outages occurred, seeing the effect on the grid.

  • Testing out different kinds of network analysis on one of the 3 interchanges (which behave - largely - independently). We chose not to go down this path, but exploring possible findings related to vulnerability by performing SIS or independent cascade analysis could be interesting.

  • We were originally interested in cybersecurity concerns from the grid, but through searches and some interviews found that we couldn’t get access to the data we would need, so we had to pivot. Looking at the disturbance data to infer information about grid vulnerabilities to cyber threats could be looked at more specifically. You could look into transmission lines that fall below the 69kV threshold, and thus under the cybersecurity standard set by FERC, and could also look at the age of different elements of the grid, but that would require other datasets.

  • In Extreme Weather and Climate Vulnerabilities of the Electric Grid: A Summary of Environmental Sensitivity Quantification Methods, there are modeling functions provided for various electrical generation hazards (fire, wind, etc.) such that raw or live data could be incorporated into a model to assess a kind of probabilistic risk of disturbance/outage in a given area.

  • An interesting possible extension came up in this article: What the Coming Wave of Distributed Energy Resources Means for the U.S. Grid. There is potential to track how demand on the energy grid changes as “microgrids” and better-developed smart-grids become more common and generate more energy (sample illustration shown in Figure 2.15).

Figure 2.15



 

PART B: FORECASTING

Forecasting is a very important and challenging task. As we alluded to in the background section, EIA gets its day-ahead (24-hour) Demand Forecast (DF) estimate directly from each balancing authority which do not all use the same method to generate their forecast (and even do not all effectively define their forecast in exactly the same way), making the modeling process more challenging. Because of the fact that some BAs are in somewhat different markets for the energy sold, they may also benefit from reporting demand a certain way in one region but not in another (another factor which could affect a forecast). BAs will sometimes pay for forecasts from outside companies which will perform multivariate forecasts and univariate forecasts with exogenous data. Our main goal was to test and see how good a forecasting model could perform based - at least at first - on freely available, historical demand or net generation data alone. Obviously, our model would not be able to outcompete a proprietary, professional model, but we wanted to see how close we could get not only to test data based on training data, but how see how similar the forecast model fits with not only training data, but also the day-ahead forecasts that are provided by the EIA.


Some quick notes about terminology (as stated by the EIA):

  • Net generation (NG) represents the metered output of electric generating units in the BA's electric system. This generation only includes generating units that are managed by the BA or whose operations are visible to the BA.

  • Energy Interchange (EI) is the net metered tie-line flow from one BA to another directly interconnected BA. Total net interchange is the net sum of all interchange occurring between a BA and its directly interconnected neighboring BAs. Negative interchange values indicate net inflows, and positive interchange values indicate net outflows.

  • Demand (D) is a calculated value representing the amount of electrical load within the BA's electric system. A BA derives its demand value by taking the total metered net electricity generation within its electric system and subtracting the total metered net electricity interchange occurring between the BA and its neighboring BAs.

  • Demand Forecast (DF) is produced by each BA for every hour of the next day as a day-ahead electricity demand forecast. These forecasts help BAs plan for and coordinate the reliable operation of their electric system.


Data Manipulation & Cleaning

Fortunately, the data pulled directly from EIA’s API did not require much cleaning, but more filtering using a query string to reduce the size of the output for easier and faster manipulation later. Since the EIA (at the time of this post) is getting ready to discontinue support for its legacy API (APIv1) in November 2022, all code was written for APIv2 interaction methods. There were some bugs in the system which led to a bit uglier code queries because some parameter entry features with API queries were not working properly yet (and their website went down a few times leading to a non-functional API) which slowed down this aspect of the project, but the “01_data_collection” colab notebook and the “05_energy_forecast” colab notebook have a function pipeline we created that will download all available data (generally, from June 15, 2015 to present day) for all balancing authorities for analysis.


Case-Study: Forecasting Energy Demand for a Balancing Authority

We explored energy net generation, demand, demand forecasts, and energy interchange, for all BAs, but this post will focus - as an example - on net generation, demand and demand forecast data pulled from the EIA’s API for one balancing authority, Florida Power and Light (FPL), because one of us is from Florida =). The data spans the entire range of what was available from the API at the time: June 15, 2015 to August 17, 2022.


First, we wanted to explore seasonality in the data. Figure 3.1 shows a seasonal decomposition for the entire time series and Figure 3.2 shows the data for a single year: 2016. The seasonal decomposition breaks down the data into “observed” (the full dataset), “trend” (any inherent trend in the data – which also shows some seasonality here), and “seasonal” (which is helpful to indicate what repeating patterns can be drawn out of the data), and “residual” (which graphs the remaining variance in the data not represented in the other three graphs. Interesting to note: the large anomaly in the data in 2017 coincides with when Hurricane Irma hit Florida, wiping out power to many (some for hours, some for weeks). The duration of this outage is an indication of the vulnerability of this aspect of the grid, which is what we spoke a bit about in the prior section.


Figure 3.1

Figure 3.2

When data is considered “stationary”, the assumption is - essentially - that the mean and variance do not change over time. This is often an assumption of time-series data analysis as it leads to somewhat easier and more robust statistical analysis when it is true. When it is not true, sometimes differencing and deseasonalizing techniques are applied for analysis. One statistical test that is used to determine how stationary the data is the Augmented Dickey-Fuller unit root test. If the null hypothesis is that the time series is not stationary, a p-value < 0.05 would indicate that the time series is considered “stationary enough” at least. Though the data oscillates, if you look at the larger-scale seasonal cycle (which is 365 days as indicated in top graph of Figure 3.2) the data appears fairly stationary and the Augmented Dickey-Fuller (ADP) unit root test returns a value of 0.043 (for the data up to 8/03/22 at least), supporting this conclusion. Interestingly, when more recent data is included (as will be referenced in some visuals below) the net energy generation reaches peaks never before reached, and since the data is trending upward the ADF value returned exceeds 0.09, indicating that it ceases to be stationary, which explains part of why models below to not forecast the most recent data as well.


Even though Figure 3.2 suggests a weekly seasonality (with 52 cycles in a year in the “seasonal” graph) forecast results were not accurate with that seasonality. The way that was tested was with ForecastGridSearchCV. We chose to use the ExpandingWindowSplitter approach to cross-validation (CV) - a technique more appropriate than standard CV for time series - which starts off with an initial window for the first step of cross-validation and for each successive split, the window of training data expands until the whole training dataset is in the last CV window (Figure 3.3 & Figure 3.4 provide a visualization to help better demonstrate the expanding window and the forecast horizon used for testing (Figure 3.3 used a ‘step length’ of ⅓ of a year and Figure 3.4 used a full year, which is why there are fewer splits displayed). These two visuals were created with code from this article.

Figure 3.3

Figure 3.4


First, the time series is split into a training and testing set (Figure 3.5), and the training set is split up into cross-validated sections. In the examples here, the test set is the last year of available data.


Figure 3.5

To determine what seasonality to use, ForecastGridSearchCV evaluated the performance when the seasonality (sp) was equal to 7, 52, and 365 using 2 different simple forecasters as well - NaiveForecaster and ThetaForecaster - each with a few tuned parameters. A performance metric commonly used to assess time series models when forecasting is the Mean Absolute Percentage Error (MAPE). An interesting outcome, however, was that the MAPE values for sp=365 were not all significantly better, but when the fitted results were graphed against test data the discrepancy was much greater than the MAPE results indicated. From this, we learned that performing a ForecastGridSearchCV was not enough to identify the best models because graphing predicted data and comparing them to test data was also necessary, which slowed down the model selection part.


A seasonality of sp=365 notably outperformed the others with both forecasters, especially when looking at other performance metrics, as shown in Figure 3.6, a table with the performance metrics used. The MAPE value is helpful, but you can also see that the “pearsonr_test_pred” value is also very informative (this is correlation coefficient when graphing ‘y_test’ vs. ‘y_predicted”, where a slope closer to 1 is better) and the “dtw_cost” (dynamic time warping cost metric, which is often used to find the similarity or calculate the distance between time series arrays that may be somewhat “out of synch” when it comes to time).


The metrics represented in Figure X below are:

  • MAPE (365_day_forecast)

  • MAPE (30_day_forecast)

  • MdAPE (365_day_forecast) - Median Absolute Percent Error

  • MAE (365_day_forecast) - Mean Absolute Error

  • MSE (365_day_forecast) - Mean Squared Error

  • Pearsonr (365_day_forecast) - the correlation coefficient when the predicted data was graphed vs. the test (actual) data

  • dtw_cost (365_day_forecast) - Pairwise, Dynamics Time Warping Cost

Figure 3.6


To somewhat highlight the problem with just looking at MAPE alone (and why other performance metrics were explored), if you look at Figure 3.7 and Figure 3.8 (which correspond to rows 2 and 4, respectively, in Figure 3.6), you can see that though the “day30_forecast (MAPE)” values are comparable, the graphs tell a different story. In fact, when running certain other models under a set of different random states, the boxplots created by the results would show more variability than these two values, indicating that these two metrics are really not very different.


Figure 3.7


Figure 3.8


Selecting a Forecasting Model

To help formalize the model selection process, we used a suite of performance metrics to identify the best-performing models. For each forecast model, We gathered the following error metrics:

  • MAPE (365_day_forecast)

  • MAPE (30_day_forecast)

  • MdAPE (365_day_forecast) - Median Absolute Percent Error

  • MAE (365_day_forecast) - Mean Absolute Error

  • MSE (365_day_forecast) - Mean Squared Error

  • dtw_cost (365_day_forecast) - Pairwise, Dynamics Time Warping Cost

  • Pearsonr (365_day_forecast) - the correlation coefficient when the predicted data was graphed vs. the test (actual) data

  • Pearsonr_ema (365_day_forecast) - the correlation coefficient of y_pred vs y_test when an exponential moving average is applied to the data

  • Timeit (365_day_forecast) - the %timeit magic function was applied to the fitting of the model to the training data in multiple loops to test time efficiency

  • Memit (365_day_forecast) - the %memit magic function was applied to the fitting of the model to the training data in multiple loops to find the peak memory usage


In speaking with an individual who works at the EIA, day-ahead forecasts and other short-term forecasts are generally provided by each BA to the EIA, and sometimes multiple input datasets and outside parties provide proprietary model forecasts as well. The goal here was to first try to model and forecast the time series based solely on historical data for that balancing authority.


Two notable challenges that arose:

  • The fact that the default performance metric (MAPE) wasn’t enough to distinguish between best models

  • The high level of daily variability that exists in the energy data

  • The high seasonality of 365 days


Because of the high seasonality, the ARIMA models were non-starters because they took an incredibly long time to auto-tune and, even with provided parameters, the model still took a comparatively long time to render the forecast and it was not as competitive as significantly faster models.


Looking at a smaller time window for the data in Figure 3.9 you can see not only the volatility of the data, but also how the NaiveForecaster makes its prediction when the modeling strategy is set to “last”, meaning that the y_pred data is an exact copy of the prior 365-day season (represented in blue, “y-train”).


Figure 3.9


One of the top base models that were tested was the Theta Forecaster (which is equivalent to simple exponential smoothing (SES) with drift and with seasonal decomposition performed before the resulting forecast is reseasonalised). Figure 3.10 shows how the forecast based on training data compares to the test data.


Figure 3.10

Figure 3.11 plots the predicted values vs. the actual (test) values from the graph above (a correlation coefficient close to 1 would indicate, pairwise, how well the test and predicted values correlate). The graph not only shows the variance present, but the “rightward skew” of the data shows where the most recent spike in the data is a historical anomaly (reaching the highest values right now ever).


Figure 3.11



Because the data is so volatile, to better see the outcome of the models (and make it more robust to the high variance), Figure 3.12 shows the original data, but also the same data with an exponential moving average (EMA) applied to it in orange and an EMA applied to the ThetaForecaster in green.


Figure 3.12

Another way that we tried to improve model performance and alleviate some of the issues with the volatility was through Bagging (“Bootstrap Aggregating”) Forecasts which “are obtained by forecasting bootstrapped time series and then aggregating the resulting forecasts.” Figure 3.13 shows the results of a STLBootstrapTransformer - which generates a population of similar time series to the input time series using the Box-Cox transformation to stabilize the variance, then decomposing the seasonal, trend, and residual time series, creating “synthetic residuals series that mimic the autocorrelation patterns of the observed series”. The synthetic and actual time series are all fed into a forecaster and the resulting forecast is an aggregate of the results. This method shows promise - often outperforming a non-bagged model - but sometimes the gains were not as significant and, more importantly, sometimes the bagging forecast would take 20 times longer (or more).


Figure 3.13

The models that tended to perform the best were combinations of more than one base model using either StackingForecasters or AutoEnsembleForecaster. StackingForecasters stack “two or more Forecasters and uses a meta-model (regressor) to infer the final predictions from the predictions of the given forecasters” and the AutoEnsembleForecaster “finds optimal weights for the ensembled forecasters using a given method or a meta-model (regressor)”. These models not only outperformed other models but also were much faster than bagging methods described above.


Figure 3.14 shows a table produced that summarizes the models and resulting performance metrics used to hone in on the top models. One of the better models, titled “autoensemble_nte_br” stands for an AutoEnsembleForecaster using an ensemble of 3 forecasters (Naive, Theta, and Exponential Smoothing) with a Bayesian Ridge regressor as the meta regressor. Other meta-regressors that were tested were k-neighbors and random-forest.


Figure 3.14


Unsurprisingly, models tended to fare better on a 30-day forecast rather than the 365-day forecast. In fact, the ProphetForecaster (originally developed by Facebook) is a top performer in the 30-day forecast (shown in Figure 3.15).


Figure 3.15

Even some of the top performers struggled with the upward trend of recent data as shown in Figure 3.16. It is easy to see how this kind of forecasting can be used for anomaly detection based on model performance over time.


Figure 3.16

Depending on the stakeholder looking at the data, if the goal were to prevent outage or possible failure, finding a confidence interval for the forecast is not only valuable but appropriate. Figure 3.17 shows a 90% confidence interval applied to a model and - if failure prevention was a chief concern - BA’s would potentially plan for the “upper bound” of the prediction, though sometimes the actual values even fall outside the 90% prediction interval (which happened with all models). If the goal is to set the price of energy for the following day, the actual forecast would be of greater interest instead of simply an upper or lower bound.


Figure 3.17

Some models which had a stochastic component in Figure 3.14 were run with different random states to get a sense of the variance of prediction output for a given model and it was found that several models had notable overlap, indicating that - in reality - some of the forecasters essentially “tie” (which is part of why no single winner was mentioned). The best way to improve model results would be to feed in exogenous variables or perform a multivariate forecast with additional important data (e.g. local weather, temperature, etc.) – we elaborate on this a bit in the next section “Potential Future Work”.


Lastly, we also wanted to test out higher-power models as well, so we also tuned several parameters of a LSTM (Long Short-Term Memory) network - a type of recurrent neural network (RNN) - which we had heard is good for forecasting. What we found was that, though it showed some potential, the time to run the model made it a non-starter for the task. Many models listed in the table above could outperform the LSTM which took over 90 min to run (as an example).


Possible Future Work

  • Temperature is a key factor that affects energy demand. Ideally, local temperature would factor in as an exogenous variable to the model. In a video call we set up with someone at the EIA, it was suggested that a way to go to improve the forecast with exogenous data would be to get the daily temperature for the different counties that make up the balancing authorities and weight them based on the population for each county (because the temperature where there is more population will also have a greater effect on the demand). Feeding this in as another variable to fit the model on would be a great place to next explore to see how much it improves the model (if at all). Getting this kind of live data proved difficult, which is why it was not implemented in this project beyond the test described for the ‘FPL’ BA above.

  • One of the next things we will explore - after this project - is creating a forecast in Tableau Public that automatically updates by running code that queries the API once a week - for example - and indicates a somewhat “live” forecast model. A sample method is explored in this article: Create Tableau Dashboards with the Auto-Refresh Data Source Feature.

  • Especially with recent attempts to accelerate efforts to combat climate change, thinking about what the emissions profile is for power plants that give us the electricity we use is important to consider. The EIA has rolled out a Power profiler to see how clean is the electricity you are using, somewhat similar to the work that WattTime has done. WattTime measures the Marginal Operating Emissions Rate (MOER) which is a reflection of how many pounds of CO2 is emitted by the marginal power plant (effectively, the powerplant that either most recently came into operation or is about to) per MegaWatt-hour (MWh) of energy it generates. Illustrating this concept with demonstration graphs for a theoretical BA Figures 3.18 & 3.19 below, this theoretical BA already has a renewable power plant in operation (e.g. wind, solar, etc.) and a nuclear power plant, as well as a coal-fired power plant. The three levels in yellow effectively represent three coal-fired power plants of increasing cost of operation, and likely emissions. In this visual, the marginal plant is the first yellow bar (where the demand line is) because it is in use and, since that powerplant is not yet at its capacity, as demand increases, that power plant remains the one most recently “turned on”. Therefore, the MOER value would be a reflection of the emissions of this particular power generator (and is thus different from the simple average of all power plants currently in operation). Performing forecasting of MOER or CO2 emissions directly would be a great area for future work.



  • Performing anomaly/outlier analysis using time-series classification and/or similarity metrics which could “flag” data that is particularly unusual given past data to prompt deeper analysis or even assess whether “concept drift” may have occurred - indicating the forecasting approach may need to be updated in some way.

  • Hourly forecasting of demand (daily data used was aggregated from hourly data gathered from BAs directly). There is a pattern/seasonality within a given day and it would also make assessing risk an hourly thing (because there are certain hours of the day when demand peaks - e.g. see Figure 3.20 below - and disturbances occur more frequently (as is shown in the disturbance dataset).




  • More exploration of neural networks for forecasting (when there is less emphasis on creating a timely forecast and instead the pursuit is solely more accurate in). We performed some tuning of parameters for an LSTM model, but we do wonder if different NN models and more parameter tuning of learning rates, layer sizes, etc. would be able to notably outperform a simpler stacking or auto-ensemble forecasting model.





 

PART C: WEBSITE FOR BROADER CONSUMPTION

One of our goals early on was to have a public-facing website with information and visualizations that could help centralize and convey information about the U.S. power grid. This led to the hosting of this blog and the GitHub link that has notebooks so that data can be directly downloaded from the web, cleaned, manipulated, and stored in a convenient location for someone else so they could perform analysis of their own. We also posted what people find to be helpful, interactive Tableau Public dashboards.


What dashboards can be found?

  • Power Plant Map and profilers

  • CO2 and other metrics by state

  • BA network information

  • Forecasting

Possible Future Work

  • In a prior milestone, one of our team looked at the ML/SL task of classifying power plants based on data present in the eGRID datasets (including emissions, number of generators, energy generation rates, etc). It turned out that though some power plants were labeled differently, they had similar emissions signatures, enough so that certain power plants - thought to be less damaging to the environment - were grouped together in ML tasks. Adding a blog about this kind of task may be something that will be added.

  • As discussed in the prior section, performing MOER forecasting by BA and showing hourly and monthly projections (comparing their error metrics) would also be very interesting to explore.


 

ANNOTATED REFERENCES

Documentation

Papers/Books

Web Links


27 views0 comments

コメント


bottom of page