TestOps

Transcript

As the idea of melding operations and software development led to the discipline of DevOps, so has the idea of melding testing and operations led to the concept of TestOps where many of the operations areas also fall under the role of the testing teams(s) and help organizations actually get a handle on how they can better test for infrastructure needs and make sure that feature enhancements and code changes aren’t just deployed efficiently but work their best as well. For this episode, Alex Langshall and David Vydra join Matthew Heusser and Michael Larsen to discuss TestOps, its role in the development lifecycle, and ways that organizations can leverage the benefits for better systems and better release management.

Panelists:

References

Transcript

Michael Larsen (INTRO):

Hello and welcome to The Testing Show.

Episode 103.

TestOps.

This episode was recorded June 22, 2021.

The blending of Operations and Software Development gave us DevOps, and today we are talking about the the concept of TestOps, where we see a practial blend of Operations and Testing. For this discussion, we welcome Alex Langshall and David Vydra to discuss TestOps, its role in the development lifecycle, and how organizations can leverage TestOps for better systems and better release management.

And with that, on with the show!

Matthew Heusser (00:00):
Thank you, Michael, for the great introduction. This week, I’m really excited. We’ve got David Vydra with Time by Ping, who is a senior quality engineer over there. I’ve known David for a long time. I think Michael and I met him together at Test Coach Camp nine years ago in San Jose. David and I have been talking about CI/CD for a very long time and he’s been bouncing around Silicon Valley. Welcome to the show, David. Is this the first time we’ve had you on?

David Vydra (00:34):
Yes, it is. Thank you for having me.

Matthew Heusser (00:37):
And what else did I miss from your resume?

David Vydra (00:39):
So let me share a little bit about my background. I got introduced to Extreme Programming in ’98 and it sort of rocked my world and I’ve become Test-Driven ever since. A few years later, I moved from being mostly a developer to be mostly a tester and the build engineer. So I probably spent the last 23 years of my career, about half and half in quality and testing and build/release engineering.

Matthew Heusser (01:05):
Fantastic. Glad to have you on. And Alex Langshall is a TestOps engineer at Lucid. I know you’ve got a really interesting setup where you have TestOps and TestOps development as two different teams. And that’s where we started this sort of conversation on Twitter. And I wanted to bring you in because we’re talking about release engineering this week and also TestOps and you’re really positioned well, you do both. Tell us a little bit about Alex.

Alex Langshall (01:36):
Sure. I got hired by Lucid in 2014 as a QA analyst. My previous work was teaching community college film courses as an adjunct. So I come with an education background and kind of came into tech as a newbie, but got hired by a really great…

Matthew Heusser (01:55):
Wait, wait, you kind of fell into testing?! That’s a story I’ve never heard.

Alex Langshall (02:00):
No, nobody does that, right?! I looked at the careers available to me in tech and QA seemed really interesting. So I got hired by Lucid software. When I started, I was one of three new hires on a QA team of six. As of yesterday, I’m on a QA team of 43. So things have changed and we’ve evolved from large monolithic releases every two weeks to having the vast majority of our services now released continuously either twice a day or multiple times a day. It’s really been the quality assurance team that has helped push that process forward. So yeah, that’s sort of the short version.

Matthew Heusser (02:44):
I’m so pleased to hear it. You know, one of the things we talk about on the show is when change is coming to testing, you can fight it and maybe not have a job anymore in a couple of years… Or you can lead it. It’s great to hear that story. So let me throw the first question to Alex, if it’s all right? What the heck is TestOps?

Alex Langshall (03:04):
I use the term pretty loosely. I was struggling to identify what I needed testers to do in interaction with the ops team. A little bit of background… Long ago, when I started, we had a staging environment that was pretty awful. This is a common story to a lot of test teams. We had this staging environment, all the services on one box and completely unlike our production system. And after a lot of late nights, dealing with releases that didn’t work the way we thought they should because staging is nothing like production -In terms of caching, databases, even service configuration- we pushed and pushed to establish a pre-production environment. We poked at ops to give us this thing and ops said, “That’s great, but we don’t want to run it. We’ve got enough to do dealing with production.” I looked around on the internet for people who talked about ops and testing and kind of merging the two ideas together. And I found this fantastic talk by Ioana Șerban, who at the time, I believe, was working at eBay, where she was doing testing around capacity planning. Instead of her ops team just randomly picking what they thought would be the right size containers in AWS for what they were doing, she actually did performance testing and went, “okay, this is what AWS is telling you but the reality of it is these are the sizes that are really performing for you.” And just that bringing the QA mindset to your operations team. That’s in a nutshell, what TestOps is about. We’ve been able to test everything from migrating complete domains to changing our database structure so that we can have multiple rewrite nodes, all sorts of big picture infrastructure stuff that is scary to a lot of ops teams. We can rehearse it, we can test it. We can make sure that it’s right before it gets to production. In addition to all the stuff that we do with making sure we’re ready to release on a weekly basis and making sure that code is working to me, that’s what TestOps is really about.

Matthew Heusser (05:29):
I want to make sure I get that right, what you’re saying. And this is something I don’t see in many organizations is we’re actually going to have a prod-like environment that we maintain and we’re going to test changes before they roll out to make sure the infrastructure supports them and everything still works. Maybe run our automated tests on top of it, make sure nothing breaks due to the changes in the infrastructure. Did I get that right?

Alex Langshall (05:54):
Yeah. And it’s pretty unique. I don’t think we see in the industry a lot of companies that,when you see things in production, going down from major companies, you could test that, that infrastructure change, you could have seen what something is like when you add cacheing to it. Yeah. They don’t do that. And you have these sort of big profile companies that run into these issues pretty often.

David Vydra (06:15):
This fits perfectly into the paradigm which I’ve been living, which is I started going all in on unit tests. In fact, I was so all in, I even bought the domain test-driven dot com. I was completely test driven. Over the years, I started noticing that the level of desired quality just wasn’t reachable in many organizations with just traditional developer unit testing. A good amount of end-to-end tests were absolutely required. And this is where my background as a build and release engineer came in. Creating environments to run those tests. Reliably is a huge challenge in many organizations. A lot of the work I had to do was go up to the very high level, sometimes C level and basically sell them on the idea that we should be able to spin up a production-like environments at will and be able to run our end to end tests in parallel. And if this is achievable, then usually the company experiences a complete mind shift and a tremendous increase in their ability to release and the initial quality level. But I’ve never heard the term TestOps, but I think it’s a perfect term to fit this sort of endeavor.

Michael Larsen (07:27):
So this is for Alex or for David, either/or. Here’s something that I would love to see addressed on this because TestOps, like every term that ever seems to come down the pike, takes on this all encompassing factor at some point. When you say, “Oh yeah, don’t worry. TestOps is going to take care of that.” What exactly does that mean? Is TestOps meaning you’re part of the infrastructure side of things? Does TestOps mean you’re part of the build side of things? Does TestOps mean you’re part of the actual hands-on testing? Automation components that are associated with testing? I could make an argument that every one of those could be considered TestOps, and yet they’re not the same thing. Am I making sense? Do you get where I’m coming from with that?

Alex Langshall (08:16):
Yeah, I think DevOps has the same problem, too, where it’s you call everything and anything that a company does where developers and operations are working hand in hand and you call that DevOps and it gets really confusing really quickly. I have run into this problem personally, because I’ve got a TestOps engineering team of three people that works on infrastructure stories, actual active development in Chef, Ruby, TypeScript, Scala. And then I have a TestOps on-call team that functions a lot in the way like your ops on call would, but for pre-production. Same alert suite that you would have in production, same kinds of alerts go off for pre-prod. And we take care of those and escalate them to ops if we believe that there’ll be a symptom of something that’s going to happen in production. There is certainly a danger of any sort of buzzword slapping it on everything under the sun. I find that I try to use TestOps in a very tactical way at Lucid. I’m trying to immediately get people on board with the idea that QA and infrastructure teams work best when they work together. Before this, our backend teams didn’t really have QA members that worked on them. Now we have manual testing across all of the backend. Again, I think that’s pretty unique. I don’t see a lot of that happening. We tend to rely a lot on automation for backend stuff and not think about how do you test all these backend changes. It’s an umbrella term. I try to use it very carefully. And tactically, I gave a talk two years ago, at Testbash San Francisco on the subject. And I completely agreed that I picked a cool name and went with it. I thought TestOps sounded cool for the things I was doing so I grabbed it so I could have teams that sounded cool and sounded like they were contributing to the organization in the ways that I wanted them to. But yeah, even some of the monitoring and logging activities that I think kind of fall into TestOps, a lot of people are like, you know, this should be just part of testing and I don’t disagree with that. But I think having that ownership over test environments, having that ownership over your own infrastructure, as a tester, you’re more in tune with what the rest of your QA team is doing. You know what testing is coming down the line, you know what your testers are seeing that’s much better than having it relegated to like the infrastructure ops teams whose primary purpose really needs to be keeping production stable and up and running. And everything else is always going to be a second or third priority. Where for me keeping pre-prod up and running, that’s my first priority. Being able to put engineering effort towards that and keep it there is really a huge benefit of calling something TestOps like that.

Matthew Heusser (11:17):
So I like this for a couple of reasons. One, it actually does meld testing and ops in a real way. In that we are testing the operations. I suspect it would have design benefits because we’ve got to have a copy of the systems that we’re interacting with available on short notice, button-click script-running, that kind of thing, which is going to create the sort of infrastructure as code that we’ve been talking about for however many years, make it scriptable and then we can make it testable, which means we’ve got to have a nice separation of concerns. And all of that is going to have benefits for the developers. It’s going to have benefits for debugging. We’re going to have to have obseverability in order to make the systems that we’re talking about testable. And then we can run our GUI automated tests on top of it if we have them or manual tests, whatever to say, yeah, green bar with this change things still work. It’s very much creating the kind of puzzle pieces fit together, testable as puzzle pieces that we went through… what, a decade, two decades trying to do on the development unit integration system level for programming. I think that’s really neat. And I suspect that when most people hear TestOps, they just hear automate all the things. So this is a much more clear definition that I think makes a lot more sense. My next question is let’s flip it around. The other real interest that Alex and I were talking about is how does this change release management? I heard you say that most of your teams have moved from let’s release every couple of weeks to, I can make a change and push it out the same day, I’m going to have a battery of tests run. What does release management mean? And how does that change under something like TestOps?

David Vydra (13:14):
The only thing about the release management, I want to talk about impact on developer culture. If we’re able to create environments where we can run end to end tests reliably, what are the norms when the tests break? As an example, I can tell you that my team at Google, we locked down version control. Until an end-to-end test is fixed, nobody can commit the code. That was a pretty extreme position to take, but the context was a huge project with 70 plus engineers. It was very difficult to debug existing test failures with people adding additional code. So we just locked down version control. Things had to be fixed before additional code could be merged. This may not be the right solution for every company, but in general, this is the first thing I look at. What are the developer norms around test breakage, who jumps in and how quickly it gets fixed, et cetera.

Alex Langshall (14:08):
We have a system, our build system, the way that we have it designed, we continue to allow developers to merge code, but we block continuous release. Any of our front end targets are continuously released. They go through a series of automated tests prior to building the release target that is separate from the regular build. So that’s one way that we handle that. Culturally, the release manager tells you to do something, you drop everything and do it. When I started, release blockers, we were on a two week release cycles. You find a blocker on Wednesday that can wait a day or two to get to it. It doesn’t necessarily impact the release. We cut our non CD branches on a Monday morning, to release on a Wednesday morning. So if any bugs are found Monday or early Tuesday, if you’re a developer, you’ve got less than a day to get a fix in and merged. Yeah. Our culture has had to shift when their lease manager says to you, you’ve got this issue and you’ve got to get a fix out right now, you drop everything and you do it. The other thing that has helped that is that things that are continuously delivered have such a good battery of automation. We have really good alerting. We have the ability to roll back, do fast-fail and roll back quickly in production. It’s all little pieces that lead to safety. Because of the way our product is built, we’ve had to have multiple strategies for the backend versus different release strategy for the front end, where we actively cache things in CDNs and continuously releasing those means that we’re being hammered by users every time we release because they’re loading new JavaScript into their browser. I t’s all context.

Michael Larsen (16:09):
So I’ve got a question here from a release strategy and I’ll pose this as a hypothetical because this is kind of my favorite headache. And I also have a history of being the release manager in past lives, too. So one of the things that we always found frustrating and limiting is the fact that we had a hybrid approach to release. Now, if you’re a Software as a Service on a platform that everybody accesses uniformly, then being able to push out a release a day or a feature a day, that’s not a big deal. You can do that. But when you have people who have on-premises product, their requirements are we have to have our servers on premises. They can’t be in the cloud because of security reasons. They have to actually have control over their systems. So we don’t get that flexibility, unfortunately. We’re always having to say, Hey, we can push these things out to this group of people if we want to, but we can’t really do a release unless it goes out to everybody. Is that something that either of you have dealt with, or do you have similar limitations to where you can push things out at a given time, but then you might have to do a more formal release later that encompasses all of that. And do you ever find that you get interesting tension in doing that where you have a little bit of a release you can lead up to and then you do a fuller release and then have to come back and kind of scramble to make sure that everything fits?

Alex Langshall (17:35):
Yeah, totally for a while we had an on-premise product that we’ve deprecated to mixed results. It’s really a development headache for us. Yeah. It’s difficult. First, there’s the cultural problem. Once you’ve got everybody on board with continuous deployment to have something that’s still a monolithic release, okay, we’ve got this special thing that we’ve now got to get a bunch of testers on and make sure that it’s ready for release. It’s even worse when those more large monolithic things then have to interact with the continuously deployed things. This happens a lot with our mobile apps. We have two, iOS and an Android app, and we don’t push a release out to the app store every time that there’s a build, but we do keep everything building whether or not we release it as a full release of the product. It requires a level of vigilance to keep building, even if you’re not releasing it on a regular basis to keep building it as if you were doing that. And then put the effort in behind to check every once in a while and make sure, okay, is the app still doing what I think it’s doing? Even though we don’t have a release coming up, it’s better to catch (and easier for the developers) if you can catch those bugs that are creeping in much sooner in the process, those integration bugs that creep in. Particularly, if you have a deadline, it’s like, I’ve got to get this out by Thursday. And I can’t even get the build to start on a Tuesday. My opinion is you’ve got to have those builds already running incrementally. Otherwise you’re setting yourself up to have to deal with all of it all at once. And that’s never a good feeling.

Matthew Heusser (19:21):
Let me see if I understand what we’re saying here. Maybe there’s a Software as a Service version, but there’s definitely an on-premise version. Single-tenant, like one company has their box. The code base is different enough that even if you have the kind of cloud-based hosted, you can log in, multitenant, everybody can use it version. The local box is so different that those tests are not really valid. And Alex is describing two different strategies. One is we have a traditional classic regression test burn-down. We got to sweat this thing out at the end, versus at least having some sort of relatively decent, continuous integration, continuous testing for every change. And that the first approach is going to be painful and icky. Was there nuance that I lost in that quick summary?

Alex Langshall (20:23):
The only thing I would add as nuance is that sometimes that icky regression still has to happen. The ideal is we’re deploying all the time and everything is building all the time. Our automation covers everything that we needed to cover. I think the reality is our engineering budgets. We tackle the thing that is most important to the business and rightfully so. That means we can’t always hit those ideals. And that does mean every once in a while we’ve got a major feature or we’ve got a legacy release that we’ve got to deal with and we’ve got to be all hands on deck about it. It’s unpleasant, but also it’s a business necessity. And so we’d have to suck it up and just do it.

Matthew Heusser (21:11):
Yeah, I mean, I would say even as a methodologist and theorist, the CI/CD/CT stuff, mitigates the risk and finds some of the errors. But in many cases, if not most, there’s going to be that burn-down, especially if you’re not deploying an appliance, but instead, a windows application. The release cadence for a windows application is simply going to be different. One of the things that we talked about was this idea that there might be kind of a maturity ladder for you to climb. What we’re describing is this platonic ideal end state. And reality might be a not so great. We’re never there. It’s always, what should we do next week? Next month, next quarter. What are the key level maturity steps that we commonly see as companies strive toward a more nuanced version of testing?

David Vydra (22:08):
I want to jump in to actually share our actual practice at Time by Ping, because what was discussed before is exactly what we do. We have multiple versions of Windows clients out there in the wild, because our customers are not interested in upgrading whenever we push something. They have their own plans. And so they upgrade. We know when they upgrade. So we have to make sure that the backend is compatible with all of these versions of desktop. In the wild, this leads to the conversation about maturity. So what we’ve decided to do is developers, we trust them to automate backend tests, but the backend tests, specification and testing rests solely in the QA organization. So we actually run our client with a proxy and we capture all of the HD traffic. So we know exactly what it does. And then we run the API tests again through the proxy and we carefully make sure that the API tests do exactly what the actual product does. It is a bit tedious. It is a bit expensive, but so far it has worked for us really well. And then on the front end, we practice the testing of the desktop. We have a daily build of our windows installer, and we test all the new work that was done during the day. We do some live regression testing. So we’re trying to keep the product always releasable, even though the particular version may get to a particular client many months after the binary was built, but we practice continuous testing. So these are a couple of examples of maturing my organization, which at this point is still a series A startup.

Matthew Heusser (23:47):
Fantastic. So look at the time. It’s been great. We’ll be typically end with is what did we miss? Sort of last word, And then we’ll do tell us where people can learn more about you. And that absolutely is your chance to pimp your book, your training class. You know, you’re speaking at STAR West and they should really go and give you high scores. If they think your talk is good, whatever you want, I’m going to start with Michael. I’d love to hear your thoughts on this idea of TestOps.

Michael Larsen (24:19):
So my last, my last comment on this is I think like any other thing, it comes down to a maturity model and it comes down to how you ultimately put something in. What it really comes down to also with automation is you recognize patterns. You recognize where you’re doing things repetitively and reliably, or at least that you have the avenue of even doing dynamic work where you know what the patterns are. And then you work those in you get the ability to test those and make sure that they are sound. And then once that’s the case, I’m not necessarily a fan of automate everything, but I am absolutely a fan of automate the drudgery. Those are the things I would love to have automated and get out of my hair. And if that means that ultimately I do a lot less release management, that’s great, because it means that, especially from a testing perspective, your eyes are now open and available for more important things. And that’s the thing I keep hitting on. Remember, you’re not necessarily stopping doing the work that you need to do. You’re just opening yourself up to being able to deal with much more interesting things. And I think that’s an important place to go. So there, there’s my quote unquote elevator pitch for why this is an important topic. As to what I am doing. I’ve gotten the acceptance for Pacific Northwest Software Quality Conference in October. And I am in the process of writing a paper and doing a talk and presentation on it. It’s me in my wheelhouse. It’s The Do’s and Don’t’s of Accessibility. So I’m excited about that.

Matthew Heusser (25:54):
Awesome. Thank you, Michael. I would just add the way this is positioned, testing doesn’t go away. We’re not eliminating testing. We’re not automated testing. In fact, what we’re saying is, oh, there may be some tools and scripts and things. If I heard this correctly, let’s inject testing into operations so that we can test changes before we deploy them. So first yeah, we’ll, we’ll poke it. We’ll test it. We’ll do whatever, in our copy of prod environment, make sure it doesn’t break anything, but then we can also run our tooling against it, which is going to have to be developed by testers or programmers or whatever, and see that in the changes in the environment, the software still works. And the way we figured that out is through a traditional process, which we’ve automated. That’s really kind of a neat two for one. I really liked the framing and I’m going to start talking that way. So Alex, tell us what we’ve missed?

Alex Langshall (26:47):
My director at Lucid, director of quality assurance is a guy by the name of Craig Randall before working in quality assurance, he used to drive container ships. His degree is in maritime transport. The thing that he always tells us is a ship doesn’t turn 90 degrees right away. Multi-million pound vehicles have some inertia that you’re fighting. When you’re trying to go somewhere. Engineering organizations are like that. When you bring new ideas to them, it’s going to take awhile. Adoption is not going to happen immediately. Getting your testers involved with your infrastructure. It’s gonna take a while for it to catch on. My big thing is be patient with that. It’s a process for any kind of change in an organization, but eventually your boat will go where you turn it to go. In terms of self promotion, I’ve been kind of on a hiatus from talks and writing during COVID mostly to spend time with family. I’m on Twitter @ ALangslall. Feel free to follow me there. Lucid is hiring go to https://lucid.co/careers. We’re looking for software engineers, testing jobs, open up pretty often. We’re looking for automation engineers often. Great place to work. Great benefits. Come join us.

Speaker 1 (28:17):
Thanks Alex. David.

David Vydra (28:20):
Yeah. So one thing I think would have been great to touch is that a simple fact that the quality of your automation is only as good as the quality of your analysis. So I’m a big believer in hiring and nurturing great quality analysts who can truly partner with product to shift left and come up with the right test cases in the first place. As far as my personal state, I should say my company, Time by Ping, is on a mission to automate time capture and billing for professional services organizations. We’re starting with the legal market. We’re growing, we’re hiring both in engineering and on the testing side. And you can find that easily by searching for “Time by Ping”. My personal website is https://testdriven.com. I blog there once in awhile or just my last name, @vydra on Twitter. Just want to say it’s been a pleasure participating in the podcast and lovely meeting all of you. Thank you.

Matthew Heusser (29:15):
For the 2020 Definition of meeting. Thanks everybody.

David Vydra (29:18):
Exactly. Thank you.

Michael Larsen (29:20):
All right. Thanks for having us.

Alex Langshall (29:22):
Yeah. Thank you.

Michael Larsen (OUTRO):
That concludes this episode of The Testing Show.

We also want to encourage you, our listeners, to give us a rating and a review on Apple podcasts, Google Podcasts, and we are also available on Spotify.

Those ratings and reviews, as well as word of mouth and sharing, help raise the visibility of the show and let more people find us.

Also, we want to invite you to come join us on The Testing Show Slack channel, as a way to communicate about the show.

Talk to us about what you like and what you’d like to hear, and also to help us shape future shows.

Please email us at thetestingshow (at) qualitestgroup (dot) com and we will send you an invite to join group.

The Testing Show is produced and edited by Michael Larsen, moderated by Matt Heusser, with frequent contributions from our many featured guests who bring the topics and expertise to make the show happen.

Additionally, if you have questions you’d like to see addressed on The Testing Show, or if you would like to be a guest on the podcast, please email us at thetestingshow (at) qualitestgroup (dot) com.

TestOps

share

Panelists:

References

Transcript

Recent posts

share