The Testing Show: Storm Readiness Testing
Have you ever wondered what happens when something like a nor’easter, a tornado, a blizzard, or a wildfire comes through your area? How do critical systems like utilities, hospitals, and emergency services deal withy these situations? How can they help ensure their IT capabilities will remain intact, or as intact as possible?
Matthew Heusser and Michael Larsen welcome Kimberly Humphrey and Scott Swanigan to talk about Storm Readiness Testing and how organizations can plan for the worst and be up and running as quickly as possible.
- Qualitest Open Positions
- Storm Readiness Testing
- Disaster Recovery Testing: Ensuring Your DR Plan Works
- Utilities Storm Readiness: Don’t Leave Storm Critical IT Systems Out in the Cold
- March 1–3, 2018 nor’easter (Winter Storm Riley)
- March 6–8, 2018 nor’easter (Winter Storm Quinn)
Michael Larsen (INTRO) (00:00):
Hello and welcome to The Testing Show.
Storm Readiness Testing.
This show was recorded on March 15, 2021.
Matthew Heusser and Michael Larsen welcome Kimberly Humphrey and Scott Swanigan to talk about the unique challenges that go with Storm Readiness Testing. What happens to vital infrastructure and IT systems related to it during hurricanes, tornadoes, thunderstorms, blizzards, and even wildfires? How can we learn those events and find out how to make sure systems are able to withstand such emergencies? Listen and find out!
… and with that, on with the show.
We’ve got a real treat for you this week on The Testing Show. Many of you may have heard of the power outages in Texas. These were the results of a snowstorm, which is not unheard of. It could have been predictable. As you might guess, afterward every politician and executive would say, “Why didn’t QA find that bug?” Or to put it differently, “How is it possible that we have this potential for problems in our system and didn’t find it and mitigate it?” That sounds like a testing and QA problem. As it turns out, there’s a discipline called storm readiness testing that is exactly that. And there’s actually a group at Qualitest that does this primarily for utilities. So we thought about that, we said, “We got to have them on the show!” So today we’ve got Kimberly Humphrey, who’s vice president of accounts at Qualitest. We haven’t met before. Is it Kimberly or is it Kim?
Okay. And can you tell us a little bit about how you got involved into that role, and what it entails?
Well, as an account manager, I’m responsible for a lot of our accounts, either in the central United States or in the utility vertical. About 20 years ago, I was working in the financial industry. We were going to start to automate some underwriting software, started working with back then Anderson Consulting. As part of that, I did a lot of different roles. One of them was to do testing, fell in love with it. From there, my career changed a lot and accelerated, ended up at Orosi software, which then led me to Qualitest.
And we also have Scott Swanigan, who is vice-president for Utilities for Olenick and Associates, which is a company that partners with Qualitest depends on the resource side staffing side to solve some of these very complex problems. And in that role, Scott actually has done and supervised a fair bit of this storm readiness testing as a discipline. Did I get that right, Scott?
Yeah, you’ve got it right, at a high level.
Okay. How does one become a storm readiness tester? Did you just get hired into that role? Or how did you get into storm readiness testing?
I was trained as a mechanical engineer and I worked in the nuclear power industry as a systems engineer for a handful of years. Our team came up with a software application to solve repetitive engineering calculations and ended up in an oversight role for that application. And that really turned me on to IT and information systems. And from there, I moved into a couple of small companies before joining Olenick in 2001. My career with Olenick spans several responsibilities. I was a testing lead for a SCADA system implementation outage management system back in 2003-ish timeframe. And I’ve worked at a real-time space with real-time applications, such as SCADA and LMS/ADMS, mobile dispatch, et cetera, either as a quality assurance manager or a project manager for about 10 or 12 years. And then I rolled over into a quality sales role and a solutions director. So when one of our clients had some issues in 2018 with some of the larger storms that passed through the Northern United States, they needed our assistance in pulling together a solution to ensure that their IT systems would perform better in the future. So that’s kind of how that all came about.
You were in, let’s get into Storm Readiness Testing. I’ve noticed that in your resume, you’ve done a fair bit of mechanical engineering integration systems. So physical control systems. For folks who’ve heard that word SCADA, supervisory control and data acquisition, or you have computers tied into physical things that are gathering information from the real world and taking action, which I imagine would be helpful for Storm Readiness Testing. But aside from checking for redundancy, I honestly have no idea what some readiness testing is or how to do it. So maybe we should start there.
Yeah. So Storm Readiness Testing is a discipline and a test, All in one. It’s the solution that we bring to the table to ensure that the IT systems that support utilities during their outages. How does restoration and customer reporting that those critical systems are going to perform well under a heavy load and they remain stable and reliable in order to support the mission of power restoration? In a nutshell, it’s that’s simple. What’s interesting about the Texas event that you mentioned earlier is that SRT, the way that we perform it, isn’t something that would have prevented the issues that we saw in Texas recently, Texas is a deregulated market, and they largely are dependent upon their own energy generation. The issues in Texas were a little bit different. Their generation facilities were not winterized to the degree that they could have been. You know, that resulted in a tremendous loss of generation capacity at a time where they’re desperately needed it. The regulatory bodies, FERC and NERC in particular, I think, are going to do a little bit more of an investigation on that particular event to determine what course of action that they in Texas need to take in order to ensure that that bad thing doesn’t happen in the future. The challenges that they had and the lack of generation really weren’t so much IT related is my point with that. I think the thing that the storm does highlight for us is the tremendous amount of damage, really, and the negative impact that these weather events in general, storms can have on electric utilities. First and foremost, it’s safety of the people that rely on the power for heat and in the case in the winter, like in Texas, air conditioning, et cetera, public services rely on energy, please, fire departments, medical facilities, et cetera. You know, a lot of those places had generation backup, but those are temporary sorts of arrangements. So power is very important to people just from a reliability perspective and a safety perspective. The other aspect, or the impact would be financial. When these storms hit, they take out power. Customers often can lose money because their businesses rely on power. You think of a restaurant that has large freezers and investment in food stores. If there’s no power to power the freezer, that the food will go to waste. And there’s also financial damages to the utility in the event that power is out for longer periods of time, there can be financial penalties there’s States within the union that has reliability measures that the utilities have to meet. And the event that those reliability thresholds are not met, there can be financial penalties in some cases, depending upon the state. The final piece of this is a reputational damage, which is very obvious from just watching the events on the news or utility that kind of goes through any sort of situation where there’s prolonged outages is really hit hard from a PR perspective. All of this winds up being a big mess for utilities, ensuring that their IT systems are ready to support a heavy storm load really becomes critical to their operations.
Thank you. You mentioned a couple of things that kind of surprised me. If I heard you, right, you said that it’s actually about volume testing of the systems that report and take corrective action when things go down, so you can get people in the field to put power lines back up, but you didn’t elaborate on what the corrective action would be, but it’s more on the making sure that if something goes wrong, your reporting systems will work. If I heard that right. Also, I’m wondering if you could help us to differentiate this from just straight old Disaster Recovery Testing. Isn’t this just disaster recovery testing with a fancy name?
For disaster recovery, that’s a backup system to make sure the systems work right. For this storm readiness, What we’re saying is, are the systems, whether it’s the notification systems to the end-users, whether it’s the businesses or their residential customers, are they getting notified that there’s an outage, how long they think it’s going to be out when will be restored?
So, you know, what does SRT? Storm Readiness Testing is a very particular type of performance or load test wherein we connect IP applications end-to-end. The style of IT applications that an electric utility is really interested in ensuring work during a storm are customer facing systems that allow us, the electric customers to log in, report our power’s out, they allow us to check in on the outage status, Hey, our ability, et cetera. So those out of channels include the web page, the mobile applications we use on our phones, text messaging, et cetera. So we have to have a whole customer facing sort of suite. That is a concern. And then downstream from that, there are applications that the utilities use to manage the restoration of power. Outage Management is one of those, the ice management system predicts where outages would be, or potentially are, based upon the call characteristics or the outages that are coming in through the various customer channels. DMI meters are a part of this as well, automated meters that are on the side of our homes these days. When our power goes out, they automatically report back to the utility “The power’s out”. There’s an influx of information back in to these outage management systems and they understand the systems allow the dispatchers reliably predict where they need to send the crews to restore the power. So the back end, we have the mobile dispatch system or the mobile workforce management system, as it’s known that also helps in the, in the storm restoration in general for the field crews. Field crews get tickets in their mobile device, that’s in the truck and they roll their truck to the problem to get to work and restoring power. I mean, all these systems work together and need to be highly reliable, very resilient, in order to restore power and meet the mission, the SRT, the testing to date throughout the industry, if it’s done at all, it’s typically been very isolated to independent systems. So we think about a performance test. We oftentimes run a performance test against a single application like the outage management system I just mentioned. And we do that in isolation, or it’s been done in isolation, I should say, from the other systems that connect to it and feed it. And that it feeds the other interfaces. SRP is unique in that we connect and that’s where a Daisy chain way or method we connect the test systems end-to-end. And we load through the upstream side, a large volume of outage tickets, customer calls, et cetera. And we looked at flow down through the system, and we monitor and observe how the system’s perform and where they may have issues. Again, the big difference with SRT is that the coordinated effort across systems that are connected.
So the question that I have for both of you… and either can answer it. I’m not going to jump up and down on Texas and what they could or couldn’t have done, or how they went about things because I live in a glass house myself, I live in California, California has two major utility companies. We have Pacific Gas and Electric for Northern California, and I think it’s Southwest or Southern Edison for Southern California. I can definitely point to PG and E and a really significant disaster that I wonder if your team here might have been able to have done something about, and that was the San Bruno pipeline explosion of 2010. That’s my hometown. We were evacuated. Fortunately, our neighborhood survived it, but right next door to our neighborhood across from a small canyon, they didn’t. Something in the neighborhood of 40 houses were damaged or destroyed lives, lost many people displaced. And this is just one of those things where it came to light. We didn’t know this, or at least, I didn’t know this, that there was a massive gas line running underneath the housing development. All the people I asked, nobody knew this was there. PG & E, I guess, did. It came out later. Why wasn’t this marked, why wasn’t this known? Why were houses built over this huge pipeline where this catastrophe could happen? The second one, again, when we’re talking about storms in California, we get rainfall occasionally. We’re used to being arid, but we do get rainfall. But what we don’t often get is lightning this past year, we got a really big lightning storm up in Northern California. While that was kind of exciting to see, the downside was it started a massive wildfire in the Santa Cruz mountains. And that really had an effect on the power grid, as you might imagine. Now, with storms and snow and ice, that’s one thing, but lightning and fire seems like that’s a very different realm. So to tie all this together, are there any ways that the work that you do, could that have helped out or been a early warning or somehow notified or given some sense of, “Hey, we need to do something here and what we could do?”
At a high level, Acts of God… There’s things that we can’t avoid or prevent, okay. The wildfires, things like that. If the utility companies, aren’t going to be able to avoid those things, they’re going to happen, but how their systems react during those times, can they shut the power off? Can they, and are they ,notifying people the way they should? Is it being informed? Those are the things the storm readiness testing helps with. How’s it going to perform? Also, if there’s monitoring in production, we can recreate those storms to make sure that the systems react more quickly, faster, things of that nature. It doesn’t do the root cause analysis on those systems. Could they check the pipeline and see from the generation where it is? The testing that we’re speaking about, Scott, I don’t believe would impact that, but the lessons learned would.
I think the fire scenario is an interesting one. Airlines being down in dry areas has started some of the fires and have been traced back to the root cause for PG and E in several cases. To Kimberly’s point the storm readiness testing, we perform. One of the things that we look to infuse our solution with is a look at each utility and their high activity scenario. So not all utilities are created equal. Geography varies significantly. Some utilities are up against hurricanes, more frequently than others, California it’s fires. In the Midwest, it’s often thunderstorms, tornadoes sometimes. So the Acts of God, if you will, are different. And so they bring a different dynamic to how IP systems are impacted. Our SRT approach is to look at the data and talk to the IT system owners and the business users inside each utility and identify those high activity test scenarios, the scenarios that are the most likely to occur, and test those specifically, rather than trying to apply a scenario for one utility onto another that’s not applicable. Does that help?
Yes. Yes, it does. Thank you very much.
Yeah. I was going to ask the Texas situation is it’s just in our minds, it’s in our collective consciousness. For lots of reasons. For political reasons, for social reasons, for “weird, that’s a one and a hundred year event. Texas got a whole lot of snow. That’s like, Hmm”. So it’s kind of an edge case. And as testers, like we love edge cases. We want to dissect them. I’m curious with what happened in Texas. Could that have been prevented and if not prevented, handled better? Could systems have come online more quickly because SRT testing had been done and corrective action had been taken before the fact?
As far as I’ve heard, and there may be a finding that comes out of this as a future that I’m unaware of… at the moment, the it systems were resilient and they performed their functions during that event. The power outages and issues that they saw in Texas and ERCOT were rooted in other non IT related problems. We have seen other customers who’ve experienced IT system failures in the form of communications to their electric customers that is completely absent because the communication systems are down. Communication channels, like the back to the web application we were talking about earlier, your phone app, customers, unable to see when the power would be restored or perhaps getting information that’s stale because there’s backups in the IT systems during these storms, that’s been a failure we’ve seen. We’ve also seen the backend systems that manage the power restoration also have some backups, some stability issues, that resulted in a lack of visibility on the behalf of the operators are trying to work to restore the power or the dispatchers are trying to get the crews out to the field. These are the sorts of problems that we have seen,s not so much in Texas, but in other areas of the United States. It’s been the Genesis of why we’ve come up with the storm readiness test solution.
Yeah. And Matt, there’s a difference between the front end and then the power generation, like the pipeline explosion. That’s more to do with the generation or the transmission, if you will, of the natural gas than it is to do with a storm readiness piece of it.
Well, and I think that what I’m hearing is that SRT from a software perspective really is about testing the IT systems. And we don’t think the systematic failure in Texas was due to the IT systems. And it could have been worse that muse is bad in Texas. I don’t know if there were people that didn’t have potable water, but there were people that didn’t have power or heat for an extended period of time in places of the country where they never really thought they would need it. So they don’t have things like a wood-burning stove or anything. There’s no wood-burning component to the home. So when they lose power, they can’t heat the house. They’re in trouble. It could have been worse is what I’m hearing. It could have been failures to notify people to where to send the trucks, to how to do the recovery. At least it wasn’t worse than that was because the systems worked. At least that’s what I’m hearing. Back in 2018, you had two storms that actually came in back to back first Riley and then Quinn, which mostly, I think hit the US Northeast. There was a lot of outages and there were people trying to restore power, and then you have another serious issue and you have to kind of reboot your restoration process with a bunch of failures. Could SRT have helped there?
That’s actually the Genesis and the history behind SRT in general. In March of 2018, Riley and Quinn came across the Northern United States and hit the Northeastern corridor really hard. We saw rain, snow, and winds up to 70 miles per hour, that persisted for about a 48 hour period as Riley hit. And then immediately on the heels of that came Quinn by the end of the week, March 7th, I think it was, our largest customer had more than a million and a half of their customers out of power. Those two storms caused a load on the IT systems that they just weren’t prepared to handle. That resulted in a lack of communications back to customers that were looking to get a status on the restoration of their power. It resulted in a lack of visibility for the utility personnel trying to restore the power and the systems that they use. So this perfect storm, if you will, if I can use that phrase, caused a significant amount of harm in a handful of different categories for the utility. With the power being out there’s safety issues that related to power, that the police and the fire departments and medical facilities need financial damages to customers, potential fines. And of course, reputational damage for the utility. All these things came to play and our customer, our client, backed up from this and did a root cause investigation to determine how they can avoid this sort of scenario in the future. And one of the things that came out of that root cause investigation was the fact that they needed to perform a more thorough end to end performance test across all the applications that support their business during a storm. That was the Genesis of the Storm Readiness Test. And our team came quickly to the rescue. W,e’d been doing performance testing for as particular client for about 20 years. And they’ve known us to really be the experts in that category. So they engaged us to pull together a plan that can pull off what is in my experience, the largest test of its kind, certainly in my career and in any industry that I’m aware of. It’s a huge undertaking. The application count. The number of applications that are connected end to end for this performance test is in the order of 30. The test typically goes off in a single afternoon with upwards of 65 or 70 people on the call supporting various aspects of the load injection or the application systems monitoring quite the undertaking. And we’re really proud to be a part of it.
That’s amazing. I hope it helps. Have you got a success story or two? Have you done some storm readiness testing and made some significant changes that maybe got power back online faster at some point later?
Yeah, that’s a good question. The testing typically occurs two to four times a year, depending upon the business needs and what needs to be validated. On the IT side, the test often does reveal places within the data flow across the applications in between all the interfaces within the applications that need improvement. We have seen some success with regards to the tests that we’ve run. The current pace of testing is to get through two to four of these tests that are given here. And oftentimes this is in advance of a storm season, whether that be winter or summer, that the business needs to see that the applications will perform their function under heavy load. We’ve load tested and discovered, limitations if you will, and vulnerabilities inside the systems and between them and their interfaces and we’ve indexed… I think the first year we’ve done this was 2019. In 2019, we had 76 different issues that we were able to identify, and put out in front of the team for resolution. Some of these things, the team was aware of. Other things, they were not, and our reporting mechanism really enables the larger organization, not just the individual application support teams, but the larger organization to have some visibility into the readiness of their systems and where these issues are, so that they can take the necessary actions to rectify them. And that’s not always a software related change. And sometimes these are process dependent issues, or things that may, may require a good deal of time to resolve. So this gives the opportunity, it gives the utility some visibility into those things. They have an opportunity to put in place some mitigation steps, to where when the storm does approach the next one approaches, they can successfully negotiate potential weakness in a particular system.
We have another customer in the Midwest where we’ve been doing this type of testing, and we do it as Scott had mentioned at least twice a year ahead of their different seasons. So in the winter time they get freezing snow and ice, later on in the springtime, it’s more tornadoes. And so by doing this, we’ve seen a increase in reliability, other systems as a whole. And one of the things that cause them to do it more readily, isn’t so much from what could go wrong, but people that say for a tornado, not everyone’s impacted by the tornado, right? It’s usually a certain area, very specific, but what happened one time is their customers were unable to pay their bill. Right? Everyone thinks about that path about what can go wrong. You don’t think about customers not being impacted, not being able to do something online. So that’s a case where it vetted that out and we were able to fix the fix that. So during the next seasonal readiness, before that readiness happened, we were able to fix that in the software so that it didn’t occur during the actual event.
Okay. Well, I think we now understand what storm readiness testing is, and maybe when we would use it and how it could be beneficial and its impact. This is a part of the show where we get to talk about what you’re working on or more about the topic. Scott, where can people go to learn more?
To learn more? You can go to the link webinar that we just released. And we also have a case study that should be available as well, gives a little bit more detail about one of our larger customers and their experience, but that’s our team.
Okay. Thank you, Scott. Kimberly, do you have any comments?
To follow up to what Scott says, even if you’re not an expert in the utilities or the real time, there’s a lot of exciting opportunities outside the utility space as well. There are many different industries that we covered; banking, energy, medical, and medical devices, and there’s opportunities in any one of those industries as a subject matter expert and or performance testers or functional testers within the systems in those industries.
Yeah. We’ll make sure that those are in the show notes.
All right. Well, thanks everybody for being on the show for those of you in the listening audience, we’ll be back in a couple of weeks. Thanks everybody.
Thanks for having us, Matt. Take care, everybody.
Michael Larsen (OUTRO):
That concludes this episode of The Testing Show.
We also want to encourage you, our listeners, to give us a rating and a review on Apple podcasts.
Those ratings and reviews, help raise the visibility of the show and let more people find us.
Also, we want to invite you to come join us on The Testing Show Slack channel, as a way to communicate about the show.
Talk to us about what you like and what you’d like to hear, and also to help us shape future shows.
Please email us at thetestingshow (at) qualitestgroup (dot) com and we will send you an invite to join group.
The Testing Show is produced and edited by Michael Larsen, moderated by Matt Heusser, with frequent contributions from our many featured guests who bring the topics and expertise to make the show happen.
Additionally, if you have questions you’d like to see addressed on The Testing Show, or if you would like to be a guest on the podcast, please email us at thetestingshow (at) qualitestgroup (dot) com.