August 18, 2019

The Morning Paper - Testing

Like previously with virtualization, I’m reviewing paper summaries that Adrian Colyer wrote on his blog.

This time I was reviewing reviews of papers about software testing!

Here are a handful:

QSYM: a practical concolic execution engine tailored for hybrid fuzzing (2018)

I’m vaguely familiar with techniques for generating software tests automatically.

Apparently there are 2 techniques:

  • “fuzzing”
  • “concolic execution”

Fuzzing refers to messing around with the inputs to your unit under test to try and get it to break.

I think that this approach is very intuitive and obvious.

Concolic execution “uses symbolic execution to uncover constraints and pass them to a solver”–so it performs static(?) analysis of your unit-under-test’s implementation to identify important test inputs, so that you can get more comprehensive scenario coverage.

Fuzzing fails to get 100% scenario coverage because it performs random-sampling to try and uncover all the conditional branching that your unit under test might have.

Concolic execution’s main drawback is that it is slow, because the number of paths that it has to enumerate increase exponentially with every if-else block in your unit-under-test.

So this paper identifies a hybrid approach (both fuzzing + concolic execution) to test generation that is faster than existing published approaches.

The fact that it is a hybrid approach is not novel, but it uncovered some previously untested edge cases in some popular open-source libraries.

I’m not tremendously interested in this, but these are interesting tools that impact every human being’s quality of life (since everyone is impacted by the quality of open-source software).

These techniques are also interesting to me because I think they will “trickle down” to higher-level languages like JavaScript, etc. at some point.

Interesting? ✅✅✅☑️ ☑️ ☑️ ☑️ ☑️ ☑️ ☑️

A dissection of the test-driven development process: does it really matter to test-first or test-last? (2017)

I always find discussions of TDD interesting.

As with most “empirical software engineering” studies, this study has a small sample-size (39 practicioners).

…the results show that the most important thing is to work in short uniform cycles with each cycle introducing a small new piece of functionality and its associated tests. The order within the cycle – i.e., test-first or test-last didn’t really seem to matter.

Maybe this is a case of confirmation bias or a case of “you hear what you want to hear”, but this is my anecdotal experience as well.

This experiment design sounds like rubbish, but it’s always encouraging to see people attempting to get more empirical data on effective software engineering.

Interesting? ✅✅✅✅✅✅✅☑️ ☑️ ☑️

Simple testing can prevent most critical failures (2014)

The paper’s subtitle is: “an analysis of production failures in distributed data-intensive systems”

The author’s evaluated five distributable datastores:

  • Cassandra,
  • HBase,
  • HDFS,
  • MapReduce,
  • Redis

They reviewed 73 user-reported failures of these datastores, reproducing and chronicling them.

Almost all catastrophic failures (48 in total – 92%) are the result of incorrect handling of non-fatal errors explicitly signalled in software.

I’ve seen this at every job I’ve held in my career:

Using try/catch, logging the error, and then swallowing it.

So the author’s of this paper made a tool to identify cases where this is done, and conclude that “33% of the Cassandra, HBase, HDFS, and MapReduce’s catastrophic failures we studied could have been prevented”.

Also interesting tidbits:

A majority of the production failures (77%) can be reproduced by a unit test.

For a majority (84%) of failures, all of their triggering events are logged. This suggests that it is possible to deterministically replay the majority of failures based on the existing log messages alone.

Interesting? ✅✅✅✅✅✅✅✅✅☑️

Why do record/replay tests of web applications break? (2016)

UI-driven tests are very brittle–it seems obvious why these tests would break, as UIs change very frequently.

Humorously, the authors of this paper couldn’t find suitable open-source examples because few open-source projects have record/replay style tests.

So they made their own 🙄

What you really need to know is your tests are breaking because the information used to locate page elements keeps breaking

Interesting? ✅☑️ ☑️ ☑️ ☑️ ☑️ ☑️ ☑️ ☑️ ☑️

The Art of Testing Less Without Sacrificing Quality (2015)

The paper’s author make a tool for evaluating the value of running an automated software test based on:

  • the time cost (i.e. how much time it takes to run)
  • the value of preventing bugs that are reported later on

The author’s work at Microsoft so they were able to perform this research on legit data (Microsoft’s Office, Windows, and Dynamics products).

At the core of the model are estimates of the probability that a given test execution will find a genuine defect (true positive), and that it will raise a false alarm (false positive).

Both probability measurements consider the entire history from the beginning of monitoring until the moment the test is about to be executed. Consequently, probability measures get more stable and more reliable the more historic information we gathered for the corresponding test.

The authors’ model compares the estimated cost of running a test (time + test infrastucture) vs. the estimated cost of skipping the test (probability of a future defect multiplied by the cost to fix the defect in the future) to decide whether to run a test at all.

This model grossly simplifies the cost of maintaining test infrastructure (or hiring software engineers who know how to effectively test), and also grossly underestimates the cost of a defect making it to customers’ hands.

The most interesting thing about this paper to me is that it analyzed existing test result and support ticket data, so operated entirely on historical data.

If this research required instrumenting Microsoft’s build servers or test runners in some way, this paper would not have been produced (as it would have been too expensive).

The ultimate goal of adopting this kind of test-skipping criteria is to get time-savings in the product development process, to deliver product faster to customers without losing test coverage.

Curious to see if this finds adoption in industry (I have not seen anything like it yet, besides tests being ignored because they are too slow).

Interesting? ✅✅✅✅✅✅✅✅✅✅

Coverage and its Discontents (2014)

Most experienced testers can immediately answer that measuring code coverage is not a completely adequate replacement for measuring fault detection.

Interesting? ✅☑️ ☑️ ☑️ ☑️ ☑️ ☑️ ☑️ ☑️ ☑️

Coverage is not strongly correlated with test suite effectiveness (2014)

Testing is an important part of producing high quality software, but its effectiveness depends on the quality of the test suite: some suites are better at detecting faults than others.

This is a very interesting topic to me.

When you evaluate the $$$ value of a software product, the test suite contributes some non-zero value.

But how do you evaluate the value of a test suite?

The paper’s conclusion, specifically with regards to a test suite’s line-coverage:

While coverage measures are useful for identifying under-tested parts of a program, and low coverage may indicate that a test suite is inadequate, high coverage does not indicate that a test suite is effective.

Not a tremendously helpful conclusion–software engineers will continue to pursue this holy grail.

Interesting? ✅✅✅✅✅☑️ ☑️ ☑️ ☑️ ☑️