How not to test, part 1

or Complete coverage testing
or More is Better testing

The setup

For the sake of this post, let’s say I’ve got a Python package that needs testing.

  • It’s written completely in Python
  • It has a specification fully describing the API
  • The specification is so complete that it also covers behaviors and side effects
  • The API is the only interface this program exposes
  • It was written by us
  • It was written recently
  • It only uses base Python

Therefore:

  • no GUI
  • no other API other than the well-defined and specified one
  • we have full access to the source
  • we remember all of the decisions for all of the code
  • no third-party modules that might be flaky

This is the ideal program to both write and test.
You can tell this is made up, because I have never had all of the above be true.
In reality, so many of those items will not be present. But let’s go with it for now and see where it takes us.

Test Strategy: Test everything

I want this package to be rock solid. I don’t want any defect reports coming back from customers.
So why not just go ahead and plan on testing everything.

Functionality:

  • Every function or method in the API should have at least one test written to verify its correct behavior, and at least one testing each of the possible error conditions if we pass in faulty input to the function or method.
  • Every feature or behavior in the specification should have at least one test that verifies the package meets that feature or behavior.

Independence of components:

  • Every package/function/class/method/module should be tested in isolation with dependencies mocked.

Integration:

  • Every package/function/class/method/module should be tested also fully with its dependencies NOT mocked.

The plan is to just test the begeezers out of the package.
Some testing is good.
Full and complete testing must be better. Right? Wrong.

Why this is bad

In my mind there are quite a few problems with this kind of testing.

It takes too long

Surely my time is better spent developing the next super cool thing.

It’s overkill

I’d argue that it’s ok if some of the mid to low level functions don’t really satisfy their full internal interface.
As long as these deficiencies don’t result in errors at the API level, the users will never get bit by little bugs in helper functions.

For every internal function in your software, the set of possible valid input to the function is larger than the set your software is actually going to pass to the function.
This is obvious but important.

Let’s say your software sometimes uses the ‘pow()’ function to compute cubic volume.
The interface of ‘pow’ is ‘pow(x,y[,z])’.
Your software only every set’s y to 3, and never fills in z.
If you feel you need to test ‘pow()’ you don’t need to bother with testing any z input, and only 3 as y is necessary.

Let’s look at another example.
Let’s say I’ve got an internal function that takes a string and removes html tags from it.
Now, to fully test it in isolation I should probably at least do the following:

  • Test it with passing in ‘None’
  • Test an empty string
  • Test a one character string
  • Test something big, like a 2 MByte string.
  • Test it with unicode strings
  • Test xml tags or other non-html tags
  • Test multiple levels of tags
  • Test non-matching tags. <h1> with no </h1>, etc
  • Test strings with random newlines inserted

I can see right now that I’ll probably spend more time writing tests for ‘tag_strip()’ than the time it will take to actually write the function. However, if ‘tag_strip()’ is part of my API, it’s time well spent.
But ‘tag_strip()’ isn’t part of the API. It’s an internal function.
And upon inspection of the final software, I might notice that my software only ever strips html tags from strings it generates.
And I only ever use it for titles.

Real interface to tag_strip():

  • tag_strip('<h1>Hi there</h1>') -> ‘Hi there’
  • tag_strip('<h2>Foo</h2>') -> ‘Foo’
  • tag_strip('<h3></h3>') -> ”

So:

  • I never pass in None
  • I never pass the empty string
  • I never pass in one character strings
  • I never pass in large strings
  • I never pass in unicode
  • I never pass in xml tags or other non-html tags
  • I never pass in multi-line strings

In the end, I may have a very robust ‘tag_strip()’ function.
But it’s way over designed.

It’s inflexible

If absolutely every component of your software has a test harness around it, then any redesign of the code, any change of the code at all, will probably make many of your tests fail. You will have to go back and examine whether the test code is wrong, or your new code is wrong. And you’ll need to write new tests to accompany the new code.

Try it sometime. This is a crippling way to write software. And it’s not fun.

Many important design and implementation changes will be skipped due to this inflexibility, and the overhead attached to any and all changes.

It’s an illusion of complete testing

There are so many tests in the system, that many people will think that the test coverage is complete.
However, complete coverage is just not possible.

Complete coverage of every possible input AND compinations of input AND combinations of the order of method calls is JUST NOT POSSIBLE.
Unless you are providing an API with one function that takes no input.
Every function you add and every input parameter exponentially expands the combinations of input possible to your system.

You can’t have complete exhaustive tests of your system.
If you thought you could, get over it.
It’s not possible.

What should we do?

All hope is not lost. And testing is a wonderful thing.
Don’t throw in the towel yet.

Before I give my opinion on how testing should be done, I’m going to cover at least a couple more approaches that I feel sound reasonable at the start, but have serious problems.
And yes, I can only present my opinion.
I don’t think there is one right way to test.

However, I think it is useful to look at some of the ways that are seriously wrong.

Comments

  1. Anonymous says

    Well, my 2 cents…

    IMHO, you are focusing too much in general testing elements instead of what is suppose to worry others about over-testing.

    – It takes too long: Yes, and it is called testing. Usually worth it (the time).

    – It’s overkill: It depends on how neurotic you get testing things. Finding the equilibrium is one of the most valuable skills in a test writer.

    – It’s inflexible: Well, is a way to look at it. That inflexibility in the underlying mechanism preventing you breaking your own code base. That is not a “con”, but a nature defining characteristic for testing.

    – It’s an illusion of complete testing: Related with the second. It also depends on how easy you are prone to think that testing in infallible. I do not know to many good tester who believed in “perfect testing”. Again, a matter of nature and not a “con”.

    You attack the points of general testing. You point basic procedure stuff as “seriously wrong”. I think you should take away general testing elements and focus on the cons of the “full-coverage” strategy. Then, you can build another set of more soft arguments:

    – Full coverage takes >>more, too many<< time.
    – Full coverage is not as full as it looks.
    – Full coverage overpopulate your testing code base.

    Then you can defend a more conservative testing approach without discouraging testing as a practice. Because that is the "feeling" I get from the current article.

    Anyway, I said. It is my opinion.

    • says

      Very well stated.
      Thank you for taking the time to post this comment.

      I by no means want to discourage testing.
      It’s kinda the point if this site to encourage more testing.
      I want to encourage sane testing practices.

      I did want to present full complete testing as an extreme. I don’t think anyone really attempts this.
      I wanted to present an extreme version and explain a bit why it’s impractical.

      However, I think I may have presented the topic a bit too aggressively.

  2. Adam Skutt says

    Well, there’s a reasonable conclusion here but the premise and logic are extremely faulty.

    For starters, essentially no one has a complete specification, or even close to it. Especially including behaviors and side effects, especially in a language such as Python. Python’s runtime specification is sadly incomplete w.r.t. side effects in particular, so you just can’t do that. Plus, base Python does include a GUI.

    Testing with mocks doesn’t give you any notion of “independence” and is entirely pointless if you’re going to do exhaustive, comprehensive “integration” testing. You’re independent iif you don’t need the external component or the mock. Most software isn’t so loosely coupled, and that’s honestly really just fine.

    Your logic about internal functions not needing comprehensive testing is obviously faulty. The distinct between “API exposed” and “internal” is entirely irrelevant, as the only way to prove an “internal” defect is not exposable through the API is to write a formal proof or test it. If you can do the former you have no need for testing in the first place, if you’re doing the later then you end up writing comprehensive tests and removing the defects in your “internal” functions anyway.

    Also, it’s unusual for anyone to suggest that one needs to test 3rd party software used to construct your own application or library, so I don’t understand why I would be testing pow() in the first place.

    Likewise, this points out the problem in your comprehensive testing of the hypothetical tag_strip() function. I don’t need to test with different sized strings (unless I’m concerned about performance), as that’s a behavior that’s handled entirely by the runtime for me. I don’t see how or why it should be cared about newlines or honestly most of the things in that list. Plainly, I would never consider writing the majority of tests in your list. Most of the test you describe are tests of the str class and not tests of the tag_strip() method in the least. If this is what you think “comprehensive” testing would or should entail, then you really need to go back to basics.

    Most importantly, comprehensive testing is something that’s understood as intuitively absurd in just about every other technical discipline. It’s grossly too time consuming for the construction of any hardware or other physical component, merely because you would have to repeat it for every component you manufacture! As such, I personally consider that sufficient reason alone to dismiss anyone advocating such detailed testing of even software outright. We don’t do it anywhere else, and there’s nothing about software to suggest such testing is necessary.

    Such testing, of hardware or software, occurs in the rarest situations, such as avionics systems, some space-rated gear, and some life-safety systems.

  3. says

    Hello, and thank you for sharing your feelings on the subject. I would like to submit that the examples you have shown to be ‘to expensive/long to test’ are code smells, and expose a bad design. I would like to present this as a good thing, and one that we should embrace.

    – “Let’s say your software sometimes uses the ‘pow()’ function to compute cubic volume. … Your software only every set’s y to 3, and never fills in z.”
    Here if you find yourself needing to test built in functions or library code written/tested/maintained by another team, you are not providing value to *your* project, and only helping to flush out the test suite of that product. Here your time is better spent writing a test case for your code validating that you expect to only ever use the pow(x,y) flavor. You should then write a test for the error state when your function is used in a way that would necessitate the use of pow(x,y[z]). (or better yet write your function in such a way that there *cant* be a state that would necessitate the use of that form.)

    -“Let’s say I’ve got an internal function that takes a string and removes html tags from it.” Wow, that *is* quite the large function, and it would necessitate a large set of tests, but then you go on to say: “And upon inspection of the final software, I might notice that my software only ever strips html tags from strings it generates. And I only ever use it for titles.”
    So here we have a case were you don’t want to build a function that takes a string and removes the html tags, you want a function that removes the *very specific* set of heading tags. This is a great case were test case explosion is a ‘smell’ that indicates ‘over engineering’. Your article also points this out: “But it’s way over designed”. Exactly! I agree, so don’t blame testing for being difficult in this case, *thank* testing for showing you an area for code simplification.

    To your point of it’s inflexible, you didn’t submit any examples, but I will still like to share a thought. Having tests that break when you change code is a good thing. I fear that some developers have been stuck in terrible situations where they are criticized or punished for every test failure that gets kicked back. This can breed this fear of test failure and you are correct, it is not fun being afraid of you code base. I would like to challenge anyone in this situation to embrace test failure as a necessary step to functionality expansion. An unforeseen broken test when you add new functionality, just shows in beautiful detail, the hidden dependencies that exist in your code base and should be looked upon as areas to re-factor. A foreseen broken test shows that you have a good understanding of the workings of your code base, and should embolden you to take ‘fearless changes’

    The ‘right sizeing’ of your application in this case can be achieved from the on-set (not as a side effect of after the fact testing) if you flip the scenario on its head. Instead of having a spec that is “so complete that it covers behaviors and side effects”, first look at writing a test cases that define the actual behavior needed in your application. The TDD camp sometimes comes of as ‘just do it, I know better”, but here we have a concrete example where your design could have benefited from TDD. For example: With TDD you would initially write the test cases for your ‘cubit volume’ function for the way your application will actually use it. You will notice then, you never write a test case driving your use of the 3 parameter pow() function, and can quickly cover that error case with a small number of tests ( 1 or 2 most likely).

    Thank you again for this blog post, and I hope to see a continuation of the lively discussion here.

  4. Adam Skutt says

    ” Here your time is better spent writing a test case for your code validating that you expect to only ever use the pow(x,y) flavor. You should then write a test for the error state when your function is used in a way that would necessitate the use of pow(x,y[z]). (or better yet write your function in such a way that there *cant* be a state that would necessitate the use of that form.)”

    I would love to see such a test in the first place (never mind it’s correct application), but moreover, I’d love to see an actual rationale about how that test is actually detecting or preventing any software defects. The two forms compute different mathematical operations, and there’s no way to readily reason that just because the software performs x ^ y everywhere else, that this instance of x ^ y % z is therefore incorrect.

    “So here we have a case were you don’t want to build a function that takes a string and removes the html tags, you want a function that removes the *very specific* set of heading tags. This is a great case were test case explosion is a ‘smell’ that indicates ‘over engineering’”

    Writing a function to strip all HTML tags, which have a consistent syntax and grammar is not over engineering in the least, even if you only actually strip a small subset of those tags. In fact, I would argue that any correct function must be trivially capable of removing all tags, or it is broken.

    Plainly, I don’t see how limiting the function based on tag name simplifies it’s job at all, despite your claim that it would do so. Recognizing the content between the is not the hard part of safely and correctly stripping HTML tags.

  5. says

    Hey, I 95% agree. I think you can get more bang for your buck if your tests are well-factored, and if you continue to keep the test suite factored as it grows. But your main point is exactly right: test code is a liability. Minimize it.

Leave a Reply