The tech industry has been abuzz about big data lately; everyone from the medical industry to the government of China has been tapping into it due to its potential to streamline the massive amounts of information they deal with on a daily basis. Wikipedia defines big data as “a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications;” basically, it’s a way to handle massive data systems that are difficult to work with using conventional methodology, such as ETL (extract, transform, load). It’s a technology that requires its own careful methods of implementation and data processing, but it also provides interesting new challenges for testers as well.

It’s often incredibly difficult to know for sure not only that data is being transmitted properly on these systems, but also that the best possible data is being used. Therefore, when testing big data, verifying data quality is one of the biggest concerns; this requires that a wide array of software testing methodologies are performed for both functional and non-functional testing. Big data obviously necessitates the same testing that would be performed for any database regardless of size; a diverse approach which can incorporate a variety of methodologies will give a more complete picture of any errors that are present in the system.

In this way, big data testing is similar to the testing performed for any database, in that a more diverse approach which incorporates multiple methodologies can sometimes give the most complete picture.

Some of the biggest aspects of big data testing are the following:

  • Defining test strategies for both structured and unstructured data validation
  • Setting up an optimal test environment
  • Working with non-relational databases
  • Performing non-functional testing

Poor implementation of the above can lead to poor quality, delays in testing, and increased cost. Defining the test approach for data validation early in the implementation lifecycle ensures that defects are identified as soon as possible, which reduces the overall cost and time to market of the finished system. Performing data functionality testing can identify data quality issues that originate in errors with coding or node configuration; strong test data and test environment management ensures that data from a variety of sources can be processed without error and is of quality high enough for analysis.

Apart from functional testing, non-functional testing (specifically performance and failover testing) plays a key role in ensuring the scalability of the process. Functional testing is performed to identify functional coding issues and requirements issues, while non-functional testing identifies performance bottlenecks and validates the non-functional requirements. While this is a common fact of testing for databases in general (and probably testing of any system at all), it is important to note that, for systems dealing with the kind of volume big data uses, errors which would be small in smaller systems can be huge in bigger ones. The size often makes it difficult to replicate the entire system in the test environment; a smaller environment must be created instead, which can contribute to inaccurate test results as well. Therefore, it is necessary that the system engineers are careful when building the test environment, as many of these concerns can be eliminated by carefully designing the system architecture.

Poor implementation can lead to poor quality, delays in testing, and increased cost. Defining the test approach for data validation early in the implementation lifecycle ensures that defects are identified as soon as possible, which reduces the overall cost and time to market of the finished system.

This can help eliminate performance issues such as imbalance in input splits or redundant shuffle and sort, but of course, this alone doesn’t guarantee a system which performs well. According to an Infosys Labs briefing, “performance testing is conducted by setting up [a] huge volume of data and an infrastructure similar to production” (70), which aids in identifying bottlenecks. Relatedly, failover testing attempts to find and eradicate threats of node failure and non-functioning system components. A necessary component of failover testing is performing various validations, such as data recovery when a data node fails and recovery of edit logs and node file names, among others. Focusing on this area is important because, for big data implementations which validate their recovery process, ensuring that data processing is seamless when switched to other data nodes is paramount. Granted, the chance of data corruption is fairly small; however, when dealing with systems of this size, the chance that some data will eventually get corrupted in the long term is fairly high, and failover testing is a pertinent method for dealing with these issues before they even happen.

Whether functional or non, testing for big data clearly provides some new challenges as well as amplifying the testing concerns inherent to databases of any size. Because they’re challenges which will probably become increasingly common within the IT industry, IT professionals may want to familiarize themselves with them sooner rather than later. It’s doubly important for testers like us; with all of the industries which are branching into big data it’s very possible that we’ll eventually get the opportunity to test these systems for ourselves.

Heads up, UK readers! If you’re interested in big data and the unique challenges it presents (which, as I’ve just said, you should be!), QualiTest UK is going to be hosting a Star Testing event based around this topic in September. Please take a look here for more information on attending.

This post is part of an ongoing series about quality control, quality assurance, and software testing. More on the topic:

The Internet of Things and Software Testing
Quality Control and Android 4.4 “Kit Kat”
Software Testing and HealthCare.Gov