The Testing Show: Chaos Engineering
Many of us are familiar with the idea of negative testing, where we feed bad data or inputs to a program or application to see how it behaves. That works for a program or an app but how about an entire infrastructure? A discipline that has come to be known as Chaos Engineering is where this level of “testing” comes into play. Intriguing but what is “Chaos Engineering?”
Claire Moss joins Matt Heusser and Michael Larsen to discuss the good, bad, ugly, and just plain odd aspects of a discipline that is not readily understood but bears a resemblance to Exploratory Testing. It is also available to any organization that wants to implement it, provided they are ready and willing to go down a rabbit hole or ten.
- All Day DevOps 2020 Ask Me Anything Keynote: Chaos Engineering
- The DevOps Handbook
- You may be doing DevOps and not even know it – ‘The Third Way’
- Wheel of Misfortune: A Role-playing Game
- Effective DevOps (Book)
- Principles of Chaos Engineering
Michael Larsen (00:00):
Hello everybody. And welcome to The Testing Show. We are recording this on Monday, December 21, 2020, which means that this very memorable year is almost in our rear view mirror. There Is a good chance that the show might be out in the waning days of 2020. If for some reason it’s not, you’ll get to hear it the first thing in the new year. Either way, thanks for joining us. Glad to have you listening. I’m Michael Larsen. I am the show producer and editor, engineer, all the stuff that goes with it to make sure that you get something to listen to. And, of course, we wouldn’t be anywhere without our stellar guests. And this time we have a returning guest who… It’s been a while, I would like to say welcome back, Ms. Claire Moss.
Claire Moss (00:44):
Hey, y’all, how’s it going?
Michael Larsen (00:46):
Glad to have you with us. And, of course, you all are familiar with our master of ceremonies, Mr. Matthew Heusser. So, Mr. Matthew Heusser… Go!
Matthew Heusser (00:56):
Thanks, Michael. Wow. You can just call me host, but glad be here, I’ve known Claire for, I think, nine or 10 years now… known of her for before that. But we met at a conference in Seattle in 2011. She has a background, not too different than mine. That was a career that prepared her to be a programmer, then became a tester along the way, particularly with an Agile focus. Has moved around a bit to do testing and quality adjacent roles. Today, we’re going to talk about exploratory testing and how that interacts with Chaos Engineering as a discipline. Is there anything you’d like to add to that before we get started? Claire?
Claire Moss (01:44):
I guess you could say that my most recent sojourn has been into reliability engineering. That may not be as familiar as an area of work for the listeners.
Matthew Heusser (01:55):
Well, that’s what we’re here to do. You can learn just enough to know a little bit about the topic and know where to go next without having to go get a college degree to do it. There’s two things there. There’s reliability engineering and there’s Chaos Engineering. Maybe we can start with Chaos Engineering. If you had to… For you for this context, for this conversation, how would you define Chaos Engineering?
Claire Moss (02:21):
So I think it’s important… The chaos part is not very scary. Chaos part is not just exciting, but also inevitable. Things will happen unexpectedly. And then the other word being Engineering, we’re trying to bring some methodology and deliberate action to cultivating that chaos and responding to that chaos. So I liked what I heard at All Day DevOps 2020 and the “Ask Me Anything” keynote that it was about facilitation of experiments to uncover systemic weaknesses. That really resonates with me.
Matthew Heusser (02:56):
So in that case, chaos is not like the bad like, “Oh, the engineering team is chaos. It’s a total mess”. It’s, “we’re going to simulate maybe a chaotic environment to see how the software responds to it”?
Claire Moss (03:09):
I think responding to chaos is very important. Chaos is not something we’re afraid of because we know we’re going to engage with it. Since we know we’re going to engage with it, let’s stop being afraid of it and start getting curious about it. So not saying, “everything’s chaos and there’s nothing I can do about it”. It’s not being in the room that’s on fire and saying, “This is fine”. It’s saying, “Hey, the room this on fire. What are we going to do about that? ”
Michael Larsen (03:36):
So if I could step in here for just a second, just because I’m an etymological nerd and these things always fascinate me. When we use terms like “chaos”, we also have to look at it with the terms that it also goes with. So the idea is, in ancient Greek, “chaos” was just “without form”, whereas “cosmos” was “with order”. So that’s where those two words come from. We take liberties with words all the time. So chaos now means “everything’s on fire”, but that’s not really what that word means. It actually means that there is either a lack of order or there is a randomness factor to it. I just wonder if that’s maybe part of the definition that might get overlooked sometimes.
Claire Moss (04:15):
I think you’re right. People have a very negative response to chaos. It feels bad. We have these connotations of something must be wrong if it’s not well formed, maybe that’s why we want to pair it with engineering. Because we think of engineering as bringing order to the chaos. When we know that we’re not actually going to tame it and prevent everything from going wrong, we’re trying to make it more deliberate that we can see things that go wrong and maybe anticipate them, but definitely respond to them.
Matthew Heusser (04:49):
So for example, in a very micro way, when we do negative testing or disruptive testing, where we say, “I’m going to fill out stuff, the developer doesn’t expect to be filled out this way, I’m going to leave this field blank and put characters in that one. I’m going to cause a buffer overflow by putting in too much data in a very micro way”. You could kind of think of that as chaos testing. But I think you’re talking about something a lot more intentional and maybe even outside of the usual realm of classic quick-attack software testing.
Claire Moss (05:22):
That’s the same direction I was coming from. When I started learning about Chaos Engineering, I was thinking, “Well, you know, we do a lot of exploratory testing and in our charters, we can choose to focus on different aspects of failure modes or evoking behaviors that are bad, where we have a list of test ideas. And we’re going to set aside some time to explore those. When you want to connect that to dev ops, there’s this book called “The DevOps Handbook” and it has what they’re calling “The Third Way”, where you’re trying to have a deliberate practice of continual learning and experimentation, not just one time questions about how will the system fail, but also making this more repeatable, being able to monitor the results in order to compare behavior over time in a way that can give you more automatic feedback. So you can take those things that maybe you would explore manually and then make those programmatic. A lot of times when we have these exploratory testing charters, we don’t necessarily know an expected result. When you’re doing automation. You’re going to compare a result that you actually produced to some kind of expectation. Exploratory testing doesn’t necessarily have that expectation. Similarly, Chaos Engineering, you might not be predicting the outcome, but once you have that outcome, you can evaluate it and say, is this important to me? Will I need to evaluate this in the future? And if so, how do we want to keep an eye on that?
Matthew Heusser (07:03):
Can you give a couple of concrete examples to what that would look?
Claire Moss (07:07):
All right. So we have a couple of different practices. You mentioned destructive testing. So when I’m thinking about putting stressful inputs into a system, destructive testing could also include damaging the system in some way. So maybe you have turned off a dependent service. So those testing of integration points would be an example of checking to see the system, it’s behavior, when it can’t communicate with other systems.
Matthew Heusser (07:38):
So before we started, we talked about Netflix and how they have these sort of… use the term “”magical inputs” in quotes. There’s a lot that goes into it that is kind of ignored this magical tool that sort of pulls down systems to see what happens. It’s injecting chaos. It sounds like what you’re talking about is very similar. It’s just linear. We are going to create tests that pull down systems that monitor the results and maybe even run them automatically. And maybe even as part of CI/CD, it’s just that pulling on this system, pull down that system, do this, do that, do the other thing to make sure that the software under tests behaves as expected.
Claire Moss (08:20):
So if I understand the way that Netflix is using their tooling to implement Chaos Engineering, these things were not necessarily running with supervision of people. These were systems that were running constantly to take down parts of the infrastructure and the monitoring, alerting, metrics, logging, all of the observability would have recorded the results of that, but there wasn’t necessarily a person in real time engaged with it.
Matthew Heusser (08:53):
Right. So I think what you’re talking about is in test, not in production and it’s linear. So you actually know what the experiment is instead of some sort of randomization created by an AI or machine learning or something.
Claire Moss (09:08):
So the controlled experiment, you want to have a known environment to run it. It depends on how much control you want in your experiment, I guess. if you want to be able to create failure in that controlled situation, you’re not necessarily only testing the execution of the software, but that is where I most commonly saw that testing occurring when I was focusing on exploratory testing. There is another aspect of Chaos Engineering. I’ve heard it called “Wheel of Misfortune”. It’s more like a hands-on practice workshop, where you are building habits in people and not just executing the system in a vacuum. So when you’re doing exploratory testing, you’re not usually looking at how other people who aren’t executing the exploratory test would respond to that. Chaos Engineering, depending on your tolerance for risk can be done in production or non-production. So if you’re doing it in production, you need to have those systems that will give you feedback in place in order to have confidence that you are not having undue customer impact. Reliability engineering is very focused on customer experience and if you’re doing Chaos Engineering, you want to make sure that when you are getting information about the system, that you’re not causing a problem for the business or its customers that can’t be resolved quickly.
Michael Larsen (10:39):
Let me give you an example. And I don’t know if this really would qualify as Chaos Engineering or if this is just more expansive workflow model. And again, maybe I’m just using terms that make more sense to me, but part of what I do right now, and what I’m working with is that I’m working with an organization that leverages our core product. The one I’m most used to. And all of our other offerings are displayed or operate within it. And one of the ways that I like to do this, a lot of the things that we’re working on are microservice models. You don’t necessarily know if something is working or not, unless you go in and start plucking the cords, so to speak, to see which one communicates where. So part of what I want to do and part of how I try to do it, I get a list of the microservices that I know that I’m working with, or that I know that I’m going to have some level of interaction with and I’ll start doing various workflows. And in the process of doing those workflows, I’ll either go to the server or I will go to the connection string and I’ll blank it out. And what happens is, of course, that you blank out that connection string and you do a refresh. It’s not going to know how to get to that. So the point is, at what point do you work on something and determine, “Hey, where does this come in, where this actually get used?” And you might go through several workflows and realize I’m not even using that service. We’ve actually used this as a way to say, “what are the products that we are so desperately trying to cover for? And yet when push comes to shove, there’s no real reason for it to be there.” We’re not even really using the service. And yet it’s so terribly important to us. Of course, the flip side is you pluck a cord and the whole thing goes down, “Oh boy, that one was really important!” But there’s that vagueness in the sense, because so many things are interacting with so many other things, unless you’ve got a clear map of everything going on in your system, you don’t necessarily know what is dependent upon something else. And so in this way, I picture Chaos Engineering, kind of being a way to help get a handle on some of that and at least see, you know, what levers of power can you manipulate that have next to no impact on which ones are catastrophic. Does that make sense?
Claire Moss (12:50):
Matthew Heusser (12:52):
I’d like to add something to that. If I could real quick, if I was testing a new system that I didn’t know anything about, and I just get a list of all the web services, service oriented services, microservices, whatever you want to call them, maybe the databases. And I started pulling them down and seeing what happens when I pull them down. And what breaks. Is that what we’re talking about?
Claire Moss (13:13):
Disrupting the systems interactions and you’re observing the results. That would, to me be an example of destructive testing, where you are removing capabilities from the system, you’re knocking them out, you’re damaging the behaviors deliberately.
Matthew Heusser (13:30):
And that is a kind of Chaos Engineering part of it, but not all of it.
Claire Moss (13:37):
Matthew Heusser (13:37):
And I think something that you mentioned that I want to build onto it is like, we’re making this a workshop. We could get all the disciplines in the room so that when I pulled down the product displayed service, no, you can’t see products anymore on the website. We could see how long it takes for us to fix and adapt that and run it like, so then we know, okay, if product goes down, it takes us 25 minutes to bring that back up. Is that a problem? I don’t know. We can then present that information to management. Either try to tighten that window or say it that’s fine.
Claire Moss (14:12):
Right. You don’t necessarily know what your targets are. There’s two things that you’re looking at there. How long did it take us to detect something? And how long did it take us to resolve it and to recover from that issue that we detected.
Matthew Heusser (14:26):
In my experience, unless you’re in a relatively mature organization, nobody does, there are no SLAs or things like that. So what you can do is you can post the service capability. Well, this is what happened when I did it, is that a problem? And now management has got something to react to and they can say, “well, that’s a big problem. Okay, well then let’s get some engineering resource to fix it. How much resource can we get? How many people can we get for how long to fix this problem?” Or it’s not a problem. Either way, all I’m doing is providing information for decision makers.
Claire Moss (14:58):
Yeah. And I think that feels a lot like the exploratory testing charters that I used to run where you didn’t necessarily know the outcome, you would run through it, you’d get some observed results and say, what do we think about this? How do we evaluate these results? And you didn’t necessarily have data to compare it to ahead of time. Maybe you’re setting expectations by saying here’s what a baseline looks like.
Matthew Heusser (15:22):
And here’s where the power is in that. I agree that a lot of testers have the ability to run those experiments and we can ask those questions and company I’m consulting at right now, I bet if I asked that question, what is our expectation for recovery? No idea. Nobody knows, like, I’ll know they’ll make up numbers. Well, maybe there’s monitoring in place, but in terms of time to recover, including the human elements, we don’t know if we take those on one at a time. So then to be able to say, “Hey, I can run a workshop where we can figure that out. And then we can take corrective action”. That could be a very powerful thing for a tester to do. And testers might have the skills that other disciplines do not. Frankly, sometimes testers are the only ones that can see the whole board.
Claire Moss (16:09):
I think testers definitely have some troubleshooting skills. And there’s also an expectation in the testing role of being able to report out on those results and be able to work across disciplines in order to improve quality. That may not be everybody’s definition of testing, but that’s mine. I would certainly expect testers to be able to communicate more effectively across roles and then probably would have some facilitation involved in that. Especially if you are doing paired work or like a collective exercise. There are certainly training exercises and all different disciplines and hands on training. I mean, you and I have run that so many times. That’s definitely my preferred mode is to have an actual hands-on experience of something. I feel like that is very instructive. And so that’s where you definitely have that visceral learning. It’s one thing to read it in a book or have someone tell you about it, but then when you’ve experienced it yourself and you see how short five minutes is before you’re supposed to pull the rip cord, you don’t realize how fast time flies. But if you’re trying to say that facilitation would be a specialty of a tester, I’m not sure whether individual contributors get to do that as much.
Matthew Heusser (17:28):
What I think that testers can do that. A lot of people can’t is they know how to say you’re on the product team and your did product search, but you need to be able to create a product. They knew how to get the dependencies they need to build their test environment, to do their testing. And a lot of times developers on the team, don’t how to do that. They don’t have to work cross-functionally. They don’t have to build their dependencies. So when we talk about pulling down this server, that isn’t ours in the test environment, that’s something they don’t do and don’t have interest in as programmers. And what that results in is in software that fails when you pull down the dependencies. They never tested it, or where they tested it with mocks and made some incorrect assumptions. So then the tester has to explain why this is important to people outside of their group. I don’t know if that’s facilitation skills, but that’s communication and collaboration skills, not all testers have that.
Claire Moss (18:26):
And I think actually that points to a big, important thread in the DevOps culture. This is what Jennifer Davis and her “Effective DevOps” book calls, affinity, where it’s that relationship building across roles and teams and building empathy. And I think if you have that focus on the systems being people problems and needing to resolve issues through people or through influence without authority, then you’re definitely going to be more in line with Chaos Engineering, because it’s not just about the technical behavior of the system.
Matthew Heusser (19:00):
Yeah. I liked that idea of the whole picture. Something goes down who notices, how fast are they notice? What do they do to get it resolved? Is that too slow? Is it fast enough? Should we change the work processes? Many of the companies I’ve worked with, you could actually become radically more effective by calling someone on the phone, instead of sticking something into a work queue in JIRA. Like it’s only a 20 minutes fixed, but we don’t get to it for three days. That’s been a few years since I’ve had that experience. Most companies are doing a little bit better, but I don’t think I’m exaggerating by that much.
Claire Moss (19:33):
I think that to be effective in your role, if you think of your role as being just producing code, whether you’re an automator or a developer, then you wouldn’t necessarily have to have that interpersonal facilitation skill set. But I think when you want to advance to more senior levels, or if you want to be an accelerator of a development organization, that you have to have the interpersonal capabilities that enable us all to be effective, and that collaboration would certainly be really important regardless of what your job title is. I think testers are more likely to need to develop that earlier because of the collaborative nature of cultivating quality and being able to push that forward.
Matthew Heusser (20:23):
And again, just to get to a real concrete example, if you have a work process that’s supposed to be, when someone notices the thing is down, they enter a JIRA ticket and it’s not a priority one. It’s not login, but it’s some important feature and you measure it. And you say, given the SLA is that the various teams have, as the ticket moves through the system, it’s the two day resolution. And we can make that a two hour resolution if we’d substitute that with a phone call and a ticket because these things are important and we have a list of what’s important, boom. So we can get from two days to two hours, what’s that a 24 X productivity improvement just by making a phone call. I submit that those kinds of productivity improvements exist. They’re out there and you can get them they’re available. Someone just has to conduct the experiment to find out. And Chaos Engineering is one way to get there. Am I right about that?
Claire Moss (21:23):
Yeah. I would say Chaos Engineering can help you to have visibility and to how you’re collaborating and executing those processes. Like what are your routines? Human routines, for sure. That may not be the case. If you’re doing the Netflix style of executing an unmanned script that is knocking out parts of the system, and you’re maybe seeing the bugs filed automatically in a system somewhere. So I guess it will depend on how you implement the practice.
Speaker 2 (21:54):
What I’m really asking is is there massive productivity improvements all over the place, all over the floor that we just have to pick up. If we just understood how we respond to failure and we respond better to failure, if we want to write any code at all. We just say, Oh yeah, if that were to happen, I would do this thing, write it down document and make it a team process. The team picks up their bugs and fixes them, prioritizing them better so that we don’t have to fix it two sprints from now, but we can do it today.
Claire Moss (22:21):
Matthew Heusser (22:22):
And I think between the three of us, we have a pretty broad experience in software engineering. We talked to a lot of people. So I see that at my clients. I’m kind of asking to validate that in many organizations in the developed in the world that are doing software, a lot of the productivity improvements are just, you know, if I called this person when the system went down, as soon as I noticed it and they put it back up, it would go back up in an hour instead of three business days. Because if I enter a ticket, it’s going to go on to triage committee and they’re going to triage the triage, and then it’s going to go right back in the next sprint exaggerating. But,
Michael Larsen (23:03):
But not by much, not exaggerating by much. That is absolutely something that I’ve had more than my share of experience with, you know, sadly it’s, it’s one of those things that when a company starts out and they’re, you know, actively engaged, you can be nimble and you can work on things that have a relatively small footprint, still fairly complex, but you’re able to get things done and accomplished. Whereas when you get to dealing with larger companies and larger organizations, you start to lose a lot of that nimbleness. And then, being able to make sure that, well, are you gonna be able to make a case for this? Or how, how are you going to be able to explain this? And sometimes you have to be very creative just to say, “Okay, now I’ve got your attention. What are we going to do about this?” And so in case, yeah, I think that being able to know how to leverage those and being able to understand what to do, I think is important. And I have had some experiences where knowing which card to pull so that the whole thing comes tumbling down is good, but it’s more to the point to be able to know, “Okay, what are the cards that I can pull that will still safely keep the house up?” Very often, the problems that we experience aren’t ones as catastrophic. I’ll give you an example of one very recently. I don’t think I’m tattling out of turn to talk about this because of the product that I work on has to do with data transformation. And there’s something that we go through and we pass through fields and we get back to the values for it. And oftentimes you’ll look at it and everything just looks fine. Like, what are you looking for to confirm that what you’re getting is exactly correct? Well, we had the example of this, where somebody who had not submitted any data for quite a long time, went in and did a submit. And they came back to us and said, “Hey, we’ve got this issue” and we couldn’t recreate the issue. But then with a little bit of back and forth, what we determined was since they hadn’t done any type of change for a while, they hadn’t updated the API version they were using, or even the calls that they were using. So what had happened for us was we realized that an older API call was being made and that older API call was giving back results but somehow in the process, it had missed something. Missing a line of code that should have been there. And this is one of those things that for me, as a tester going through, it was like, okay, you know, how would I have thought to look for that? Or how would I have considered that? But once we had this experience of that, “Oh, okay, that’s worth knowing that’s a good thing to work with”. So now I’ve actually worked that in to say, okay, we’re encouraging people to use the newer API, of course, but there might be reasons why people still use the older one. And how will I know if that’s actually working? Well, introducing a call that uses a different API version, actually structures how those calls are made. And, that was a neat little thing for me to learn about. “Oh, that’s cool. Okay. So if I want to, I could go back several API versions and make this exact same request and I could potentially get very different results”. So does that seem like in line?
Claire Moss (26:06):
Yeah, that makes sense to me. And I like what you said there about when you found an unexpected behavior, you reflected on it and came up with a way to make that more manageable in the future. And you shared that out so that when other people had this experience, then they would also get the benefit of that learning
Matthew Heusser (26:28):
So, 10 years ago, we were talking about how Kanban allowed you to do just in time bug fixes and comparing that to maybe, “Oh, we found the above. That means that we should prioritize it and get it in the sprint. And then we go to production every two weeks. And now you’re going to have your bug fix in some time between 10 business days and 20 from when it was discovered, which would have been 20 days after it was introduced, because we’re only deploying every two weeks.” With continuous delivery we can fix bugs just in time. And the exposure to the user might be a day, two days, or even less. If we’ve got config flags, we’re talking about something very similar, but on the inside of the organization, on the infrastructure end and conducting experiments to figure out how long it takes to recover and what we can do about it, what the impact is on the customer and maybe build redundancy, maybe not. And I think it’s no surprise that testers can contribute to both. So thanks for being on the show, Claire, is there anything I missed there that little wrap-up you want to add? I’ll give you the last word.
Claire Moss (27:40):
I liked what you said. The whole thing of dev ops is sharing practices among the silos. And so bringing that testing expertise to the infrastructure operations content is really helpful. So we don’t want to limit the benefit that our organization can get from our skillset. We want to amplify it. I love that.
Matthew Heusser (28:01):
So where can we go to read more about this?
Claire Moss (28:04):
Well, there is a website called Principles of Chaos. That seems like a good starting point, just to see what is Chaos Engineering value and how do you think about it? So you can start off there. There are many different open source tool sets. If you’re looking to implement something, I can’t recommend one in particular, but they all should have a decent README that explains their use cases. And you can see whether those fit your current situation. And of course there are quite a few books out there. I know we just finished All Day DevOps last month and all of those recordings should be published. So you can check those out online and maybe look at the AMA keynote for Chaos Engineering, because people were just bringing their questions about what is Chaos Engineering. And it was all different levels of understanding and familiarity. So that would be definitely your place to go.
Michael Larsen (29:00):
Fantastic. Well, I guess at this point, it’s probably a good idea for us to say our farewells. For those who are listening. Thank you very much for joining us and participating with us on the testing show. Claire, thanks very much for being part of the show today. And we look forward to talking to you again and this time I will say, we look forward to talking to you again sometime in 2021. Take care everybody.
Claire Moss (29:24):
Matthew Heusser (29:25):