The Testing Show: Data Quality
In our ever-changing world of applications, processes, and systems, we spend a lot of time talking specifically about improving those applications and how they are built. However, how many of us have taken a step back and asked about the actual data that we deal with? The quality of our data has everything to do with the ability of applications to be successful and work in ways that actually matter. To that end, Naresh Nunna and Sendhil Selvanathan join Matthew Heusser and Michael Larsen to discuss ways in which we can better assure overall data quality and perhaps introduce a DevOps for Data in conjunction with CI/CD pipeline modernization and analytics.
Michael Larsen (Intro):
Hello and welcome to The Testing Show.
This episode was recorded Monday, July 11, 2022.
We pay a lot of attention to applications, processes, and systems but those systems stand or fall based own the quality of data they contain. How can we make sure that the data that we have is both accurate and timely? Naresh Nunna and Sendhil Selvanathan join us to talk about improving overall data quality and create a DevOps for Data model.
And with that, on with the show.
Matthew Heusser (00:00):
Thanks, Michael, for the introduction. Today, we’re gonna be talking about data quality, which I think a lot of corporations take for granted, take for a given; “Of course our data is good!” and we in software know that that’s not always the case. Not only is there garbage data at some places, some organizations, it’s usually delayed. Usually, by the time we can do analytics on it, we’ve lost a day if we’re lucky. A week, if we’re unlucky. It can get to the point where organizations, the delays sort of lap themselves. By the time we want to provide a data update, we’ve already entered into the next cycle when we should have a new data update. We’ve learned so many lessons in continuous delivery and DevOps that could be applied to data… and I think it’s fair to say that they haven’t been yet. So this is what we’re gonna talk about today. To do that, we’re bringing in two experts who represent different sides of the business. Naresh Nunna is on the technical engineering side and… do me a favor, Michael? Help me get Naresh to come out because he has a lot of data pipeline experience, CI/CD for data, which we just don’t hear about much, but he’s… correct me if I’m wrong, you seem more like the strong, silent type, right? <laugh> So he’s done a lot. We gotta get him to talk about it. Tell us about your career, Naresh. What am I missing in there?
Naresh Nunna (01:30):
Oh, it’s pretty good. To start with, I’ve done the beauty of both. In my mind, it’s the development and the quality set of it. I started off my career being a developer on mainframe. Slowly moved into open source, building and managing applications on distributed technologies, microservices, cloud native, and then, almost a decade, I’ve been focusing on the quality side of it, which is primarily on the data side of it. How do I give the right set of data so the businesses can run smoothly? How can they make the decisions very well informed rather than me making some decision based off, “Hey, I have a good feeling about it. Lets just go with it.” Data should be factual and it should be real, which is where data quality comes into the picture. We’ve been trying to help customers understand the importance of the data and then validating the data and assuring the data as part of their application development is one major thing which we’re trying to focus and how could we do it even better? And that’s where the technology has evolved. Lots of principles, which has been evolved on the software side of it, which is DevOps, or CI/CD pipeline, some things like that. One angle that we started looking at is how do we bring these rich principles on the data side? How do we make this data quality as in continuous process, by building some data quality pipelines where business will get immediate feedback on where their data is, what condition data is, if my data is good, and at the same time supply the quality data to the data scientists where they can run mature algorithms analytics, so they can start realizing or uncovering the hidden facts behind the data and start driving businesses with much more accurate, reliable and timeliness aspect to the data.
Matthew Heusser (03:33):
Yeah. Thanks. I should say that we’re looking at Naresh’s resume. He’s currently a director at Qualitest, which is more of an independent, responsible for a business unit position, but he’s moved up and down to manager, architect, consultant, back again, and can represent that sort of technical perspective. But neither of the people we’re talking to today are 100% focused on one thing. There’s a balance there. Our other guest is Sendhil Selvanathan. Who’s the vice president for customer success at Qualitest but before that was an architect for another consulting company, established a digital quality engineering center of excellence, and did a lot in the space, and is now I like to think come home to quality, and is bringing that into data, particularly for business and finance. So welcome to the show, Sendhil. Glad to have you here. What did I get wrong?
Sendhil Selvanathan (04:35):
So you got most of it right, Matt. Thanks for the quick introduction. So I represent Qualitest from a customer sucess/organization perspective. What does customer success at Qualitest mean? As in what it really means to us, right? We cause changing people’s lives with our positive contributions in terms of bringing in quality mindset as a lifestyle, rather than as a process validation. We have done these kind of activities in the past as well, grew through the ranks of various customer program management, delivering business outcomes to the customers, but quality has always been the forefront of my career. And I’m now getting back into the quality attainment. So those are the areas that we focused on in the past.
Matthew Heusser (05:22):
Thanks, Sendhil. And Naresh kind of gave it his perspective very briefly. I’d like to hear from you if it’s all right, what would it mean to have quality around data? And maybe we can get into an, you know, we can throw these things out timely, accurate. How does the customer’s life change if they have quality around their data?
Sendhil Selvanathan (05:44):
Let me give you a short example of what it really means about quality around data. A couple of weeks back, my wife applied for a savings account in one of the financial services, their local entity as such, she started receiving emails about activating her card, but she had not received the card yet. So the debit card, which was supposed to be delivered to her for two weeks has still not been delivered, but we have been constantly getting like 4 to 5 reminders. And she started doubting whether I really missed those letters, which I know for a fact that I always checked those letters, with some unrelated ones on the same day. So she started doubting me and we had a little bit of argument that I did not lose your letter honey, right? When we went back and validated, like what’s really wrong with this particular process, we called customer care.
Then she found out that the card was never even dispatched, the processes, which were sending automated alerts to activate the card, were not even getting the right data. So what, really, that means is the data that’s supposed to drive the business and supposed to give you the best customer experience in any financial services (or any other industry) is not really working in tandem, not really working in the completeness about the customer’s data and life cycle. So that brings a point in terms of why should we maintain the quality and what quality really means for data? The completeness, the accuracy of data, in managing every specific process around the customer, you know, interactions with our organizations is quite key. And we really need to validate the accuracy across the various stages of those life cycle of data movement within our systems to ensure that we give the best customer experience. Hope that answers the question.
Michael Larsen (07:31):
Sounds good. It definitely helps me understand a little bit about what you’re both working with. So as is often the case, Matt asks the general questions and I tend to ask the mercenary questions because, interestingly enough, you’re right in my wheelhouse right now, especially with the fact that I deal with data transformation and the team that I work with deals with data transformation. The company that I work for has a number of products. And oftentimes they take data from one application and they import it into another, or they have a need to do a special transformation of that data. And then it goes out. That is something that I deal with on a regular basis. For me personally, the big challenge is, of course, is data integrity. How do you know that? What comes in is what goes out, especially if you’re not necessarily getting a like for like. So when you’re approaching things like data quality, data integrity, how do you know that stuff is where it should be? How would you go about doing something like that?
Naresh Nunna (08:41):
So I think you’ve brought a very interesting aspect, which is real. A lot of our customers have their data sort of disparate. It could be flat file. It could be mainframe or it could be coming from cloud or even better, now we started getting this data from smart sensors, like IOT devices. So there’s definitely a bundle of data is flowing through, the data sources have quite increased compared to how we saw before. That is only on the data sources. What has evolved in terms of storing the data, transforming the data, converting the data, and loading into data warehouse, data lake, data lake houses. There’s even more beyond that. The one challenge which is still not solved is on automating that data, quality assurance of it. That is, uh, lot of tools in this place where they can collect the data, connect to the data source, it’ll extract the data, and then show it as a profile to you where you can just visualize where your data state is, what is the health of your data, and then start building some algorithms and procedures to run these in an automated fashion by running through this again. We have created a framework which is tool agnostic framework, which will work irrespective of the tools, irrespective of the data sources that you’re talking to. Irrespective of wherever we store our process, this is where our frameworks will really help to automate the whole assuring of data quality. where we are even looking further is, “Hey, how do we apply the machine learning side of this? It works great from monitoring the data by running some sequel queries or applying some business tools and transmission rules around it. But how do I automate it even better and make it more predictive?” That’s where we started discussing internally and with our product partners as well. How do we bring machine learning to it? I want to understand the data anomalies, look at the trends in the data, start making some decisions based on what I’ve learned and how do I correct the data in a way it’ll still fit into the boundaries of the data semantics. And that’s where we started in looking at to make it even better or even faster in terms of assuring the quality of the data.
Matthew Heusser (11:06):
Okay. It brings up two questions for me. You mentioned visualizing the quality of the data. And when I hear that, I mean like a score. “This data is a 99 out of a hundred. This data is only an 87.” Maybe you could give us an example, or a story of a particular type of data, whether it’s customer data or orders or what’s in the warehouse. What would it mean to have that number evaluated?
Naresh Nunna (11:32):
If you take an example of a customer data store, I used to work for a banking financial customer, where they have this customer data store, which is running at the center of their deposits, consumer banking, loans, credit cards. So the customer data store is very key for running all those businesses. The challenge that they always had is the data is not accurate or consistently stored every day. For example, uh, I have a customer name, which is stored as in first name, last name in one place, and middle name. In other place, I have stored it as a full name. The challenge is there is data, but the data, how it is stored is different. In some cases, we have seen date of birth where we don’t even have year stored for various reasons of the business. How do we even reconcile the data? One is understanding the data itself.
We looked at the data semantics, look at the business rules behind it, and then start giving a score to it. And the score is given. There is six different dimensions which are industry popular. They look at accuracy, consistency, validity, uniqueness, timeliness… so created algorithm to start giving a score to it. Is my address consistent across the board? Is my name consistent across the board? Do I have the data uniquely stored rather than same data represented in different places? You started giving that score to measure the state of the data and then monitor that on a continuous basis to improve that over a period of time. That is the way you are methodically measuring the quality and quantity of the data. And at the same time, you are building processes to improve it over the period of the time, not just the existing data, also adopt these principles on the new dynamic data, which is coming in as part of new products that you are launching, new enhancements that you’re building into your projects. It’s simple and powerful, and you just start measuring the quality and quantity of the data and then make it as a continuous process.
Matthew Heusser (13:39):
Okay. So I think I get you now. You’re talking about those classic six and I’ll, I’ll say ’em slow for the audience. If you want to take notes; Accuracy, Completeness, Consistency, Uniqueness, Timeliness, and Validity. And from what I have seen, frankly, most organizations do not have the tools to assess all those things by themselves. So just the assessment I would think would add value. Sendhil, I’m curious, how does having that information on the customer side, how does that change their business? And then how do we improve it once we’ve got the assessment?
Sendhil Selvanathan (14:20):
Excellent question. So as a customer organization, most of our customers would like to measure the quality of data when the data is sent, when the data is in motion or when the data is in use. So what we really mean by these three different dimensions, when the data is sent, the source systems, which provide the data, we need to validate the quality of the data. As you spoke about the various six different dimensions in terms of the completeness, accuracy, validate the conformity, we really need to validate on that particular, when the data is in motion, we gotta be really looking at the sort systems to the target, all the transformation roles, how the data exchanges itself, right with various different parameters and data validations assets, we need to validate those transformation logic aspects. We need an automated solution even to validate the data in motion and its transformation rules, when the data is gonna be delivered, when it’s gonna be used by the business intelligence report, we really need an automated solution to validate what the business executives are looking at.
Are they the right data? Are they the right combination of data that is supposed to be part of that report? Because those reports are the ones which are driving the business decisions on any given day and every organization can make or break based on those decisions. So when we look at the holistic data ecosystem of data, coming from your core systems and staying at the rest, til all the way til the data is used by the business intelligence reports, we need an end-to-end framework or a solution which can really help the customers visualize the way the data moves across these different ops. At the same time, the quality measurements that are associated with that in terms of how accurate my data is when it traverses through my enterprise ecosystems, of these enterprise ecosystems as such. Every customer would need to have a scorecard looking at the data governance layer, which is not just validating the data quality aspect of it, but the end-to-end delivery of data to the end business. And that is what we at Qualitest also would like to propose and work with the customers in terms of how do you create such a platform which can enable automated validations of all of these different hops and make sure the customers have the confidence in what data we deliver to them and make sure that the customers really get the real benefit and use of those data sets. Hope that makes sense, Matt.
Matthew Heusser (16:52):
Yeah. And now that you said that I thought of a real concrete example around credit cards or health insurance. We’ve updated your address, and now you should get new mail in six weeks, or, you know, we’ve updated your address for the credit card. We got you a new credit card, and this is gonna be valid in two weeks. We’ve added your name to an email list. You should start getting emails in two weeks… an email list?! I changed my contact information and you don’t know it for two weeks?! And that’s because I think of old non-transformed systems that have to slowly replicate over time to push your new address into the rewards card information for your points. That’s actually in an entirely different system that was, who knows, written in Cobal. And so the data doesn’t replicate quickly and all these little tools, we’re talking about a very similar analogy to continuous delivery for data, that should replicate in minutes, maybe near real-time. Is that the kind of transformation that’s possible with these tools?
Sendhil Selvanathan (18:05):
So Matt, we called it as DevOps for Data. What you really mean by that is replicating these system of records. There is a customer reference information in terms of your email, your addresses, all your personal PII, PCI, the credit card information. All of them need to be hosted in a centralized customer hub, which is referenced as a reference data. So, which means there is a data architecture exchange, all that also need to be enabled for various organizations in order to understand what is the system operating under as the single source of truth in that particular information as such, so that all the systems underneath could consume the right information at any point in time. So data gets delivered in minutes rather than in days. That is one. And the validation around the entire ecosystem will also need to happen in real time. And that is what we call it as a DevOps for Data where the automation solutions, which can bring in the validation of the source to target as the data moves into your system, as the customer updates a particular information, your validation systems also would need to give a score like, “Am I confident with all these changes the customers made? That my systems really process those changes accurately? And how do I deliver to my business and how do I improve my customer experience net promoter scores with the data ecosystem being consistent?” That is a key focus around all of these areas.
Michael Larsen (19:33):
Very cool. So I wanna look at one aspect here, if I can, especially because you just opened up that whole DevOps for Data <laugh> and hearing that this of course gets me thinking whenever I hear DevOps, I think of the C alphabet; CI CE, CD, C whatever, continuous fill in the blank, continuous testing. And I guess the question is, of course, we wanna be able to have something that we can do that will allow us to continuously be able to add our updates, make our changes, and be able to roll out a new version of the product and deploy it. This is something I’ve never really considered because outside of the test and outside of the modular functions, doing a data integrity test, that would be a fairly involved process. And one that I would normally say, does that fall outside of continuous testing approaches?1 Or is this just yet another element that you can look at and that you can focus on? And if so, how do you approach that in your projects? Or do you?
Naresh Nunna (20:44):
So it is actually a bit different. It is continuous, but there is two folds to this mic. One, as you are building these modern pipelines or the cloud data fabrications, these modern frameworks that you’re building, the plumbing of the data is happening as it is data added to it. But the data quality measurement should not be started by me deploying a data pipeline. It actually starts after you deploy the data into production. After you deploy your pipeline to production, because you have a constant data, which is fitting through your source systems, operation systems, to your applications, your application might not have changed, but just because your data has changed, your application can start giving incorrect results, can start behaving differently. The continuous process is two-fold here. One, as you build the data products, second, you run these tests continuously after you deploy to production as well. That is where the monitoring solutions are coming into play, where the continuous testing happens after deploying into production as well.
Matthew Heusser (21:52):
Well, since Michael likes to get specific, let’s dig into it a little bit. Typical company, big enterprise, Global 50, S&P 100. You’ve got data that you don’t even know about because it’s a conglomeration of companies. Even at the corporate level, you’re aware of YAML, XML, SQL, APIs, mainframes, NoSQL. The data landscape is huge. You’re gonna come in and there’s a bunch of fiefdoms that all have their own thing. Maybe there’s an E R P system. If you’re really lucky, you get a company like Proctor & Gamble, where they sell products. And those products all have unique identifiers. And you could, in theory, at least have a database of sales. Maybe. If you’re lucky. You need that, you know, for your quarterly reports. What do you do? How do you manage that? How do you help companies?
Naresh Nunna (22:43):
To be honest, what we really start with, I don’t know where my data resides. If you say that I know where my data resides, it is XML, JSON, flat files which are sitting on mainframes, you probably win quarter of the battle, but often what we see and what we hear is the data testing/data quality is done only on data warehouse or either on the business intelligence server. When we come, when we go into the customer, first thing, what we ask is, “Do you know where your data resides?” It’s very important for us to understand the whole life cycle of the data, where the data gets generated, where it is getting stored, transformed, loaded, and consumed. That is very key. Before we even try to solve something, start there first to understand your data estate then involve the business. Often these solutions cannot be looked only through the technology.
We involve the business as well. So they bring a unique aspect of the rules that goes behind your data and the experiences that they’re facing from a customer perspective. Are they able to trust their data? That is where involving the business is very crucial as well when we build these solutions. Then once you know where your data resides, second, you’ll know what kind of rules the business runs behind the data? Third is just applying pretty simple solutions in the play, where “How do I connect these data sources using these next gen tools, which are intelligent enough to understand various data structures and ability to discover by themself?” and show you, “Hey, this is where you are right now.” And then start building some custom rules. It could be Python programs, or it could be Groovy scripts or rules that you write on top of it, which covers the data semantics. And it also applies business rules behind it. Start with a highly critical business processes, understand the data footprint, start building priority basis. That is the meat of how to start your data quality journey.
Michael Larsen (24:57):
Very cool. So to maybe go in a future direction here, if this is future, maybe this is current and it’s like, “Oh dude! You’re behind the times. We’re totally doing this.” We’re seeing, of course, a lot of traditional businesses, even in just regular corporate environments, it’s common to have systems that are running in their particular offices and they have replications on actual hardware. But then we’re also seeing people moving this into the cloud. In my opinion, the more often we look at cloud infrastructure, that just adds more complexity and more places where something can hiccup and go wrong, because of a handoff from one machine to another. So is that something that you find yourself dealing with and how do you approach that?
Naresh Nunna (25:45):
Yeah, I think if you are in the journey of migrating to cloud, first, before you start, if you know your state of data, health of your data, you are very welcome to go with it. But often that’s not the case. What we come across is I have to migrate data to cloud yesterday, which is always the case. You have to first understand where your data resides, the quality of your data, understanding that is actually going to help you plan your migration better. You might find a lot of absolute data, which is sitting and collecting dust over a period of time, which may not be needed for you to put it on the cloud. For start creating a baseline on what you have. What is the state of your data and how much of the data is really being used in my business processes or application processes. That will really help you to prioritize your use cases to the cloud.
Once you have done it, you have a before and after to compare it again. Once you have the base, you have a way to compare it against, to see how it is compared to my new system, which is running on the cloud. Be it from a data quality score perspective, how fast my procs are running compared to before and after, how quickly I’m able to produce these reports to businesses? It is really still important for you to understand about your state of data before you put it on the cloud. Once you are on the cloud, it is much easier with the cloud where you can create on-demand processes. You can create even generated pipelines where I can run these data quality rules whenever I need it, which is the beauty that comes with the cloud, which you may not be able to do on prem. You can run these greater quality rules as part of your pipelines, trigger them as you have a new data, which is loaded on workload. That’s how you need to adapt when it comes to running these data quality checks, be it on prem or on the cloud, or we are in the planning of migrating to the cloud.
Michael Larsen (27:40):
So we, I think we’ve pretty much covered everything that we had initially talked about for this. So this is where we typically say for our guests, this is our opportunity for you to tell us what you’re up to, what you’re doing. Give your 30-second elevator pitch, for example, that you want to make sure that people walk away with. So the time is yours. Naresh? Sendhil? How can people learn more about you? What you’re up to, and if there’s any parting thoughts you want to give, here’s your time. Naresh, let’s start with you.
Naresh Nunna (28:11):
Sure. I think the first thing is I wanted to talk to my business at other folks like customer service partners, like Sendhil, to understand what are they hearing from their customers? Is there challenges in terms of knowing the data quality? How are they even thinking the data quality as part of their transformation, or organization goals, which is very meat of everything, what we try to do here. Make them realize that importance of the data and data quality, and then create an ROI around it. Once I create an ROI by building some modern pipelines, which are running things much faster, dealing with the data… modern data, I would call it, be it 5G or IOT or smart sensors, as simple as the data feeds, which I get from internal, our external systems, understanding the data footprint and blending that with our quality background and building these modern pipelines, which feed their system in real time and ability for the businesses to look at their data. Start using the data for making those informed decisions. This is where we see a big need, and it is much more important if you are in the journey of migrating to the cloud or digital transformation or data modernization, wherever you are. We think data is always the oil, which is running the businesses and considering the data quality assurance as part of your application roadmap, product roadmap is gonna make you win lot more businesses, improve the customer experience. Over to Sendhil.
Sendhil Selvanathan (29:51):
Yeah, data is always the messiest place to deal with in any administration. It is more like all the old pipelines and wiring underneath your century old fortress of a warehouse, but people always think about, oh, when we want to validate data, it’s always complex, but we can always bring a method to the madness. Data validation is not all that complex. We have differentiated solutions now, which can make our lives easy in terms of validating what we are consuming and what we are delivering to the business. Let us make the lives of our customers, simple and easy that they could cluster data. Llet’s make the lives of our employees as well, to be easy for that. They don’t have to think about all the data issues. They could spend time with their family the rest of the evenings. That’s what we deal with.
Matthew Heusser (30:41):
Well, thanks Sendhil. Thank you, Naresh. Thank you for your time. Appreciate you being on the show. And I think with that, we’re gonna say, thanks everybody.
Michael Larsen (30:50):
Thank you very much.
Naresh Nunna (30:50):
Sendhil Selvanathan (30:51):
Thank you all.
Michael Larsen (OUTRO):
That concludes this episode of The Testing Show. We also want to encourage you, our listeners, to give us a rating and a review on Apple podcasts, Google Podcasts, and we are also available on Spotify. Those ratings and reviews, as well as word of mouth and sharing, help raise the visibility of the show and let more people find us. Also, we want to invite you to come join us on The Testing Show Slack channel, as a way to communicate about the show. Talk to us about what you like and what you’d like to hear, and also to help us shape future shows. Please email us at thetestingshow (at) qualitestgroup (dot) com and we will send you an invite to join group. The Testing Show is produced and edited by Michael Larsen, moderated by Matt Heusser, with frequent contributions from our many featured guests who bring the topics and expertise to make the show happen. Additionally, if you have questions you’d like to see addressed on The Testing Show, or if you would like to be a guest on the podcast, please email us at thetestingshow (at) qualitestgroup (dot) com.