Functional Testing, BDD and domain-specific languages

I love Test Driven Development (TDD). If you look back through the posts on this blog that soon becomes apparent. I’m pretty comfortable with using TDD techniques at all levels of a solution, from the tiniest code snippet to multiply-redundant collaborating systems. Of course, the difficulty of actually coding the tests in a test-driven design can vary widely based on the context and complexity of the interaction being tested, and everyone has a threshold at which they decide that the effort is not worth it.

The big difference between testing small code snippets and large systems is the combinatorial growth of test cases. To fully test at a higher (or perhaps more appropriately, outer) layer would require an effectively infinite number of tests. By a happy co-incidence, though, if you have good test coverage at lower/inner layers of the code, you can rely on them to do their job, so it’s only needed to test the additional behaviour at each layer. Even with this simplifying assumption the problem does not fully go away. Very often the nature of outer layers implies a broader range of inputs and outputs, and a greater dependence on internal state from previous actions. It can still seem to need a ridiculous amount of tests to properly cover a complex application. And worst of all, these tests are often boring. Slogging through acres of largely similar sequences, differing only in small details, can be a real turn off for developers, so once again, it can feel that the effort is not worth it.

If fully covering the outer layers of a system, with tests for every combination of data, every sequence of actions, and every broken configuration is prohibitively expensive, time-consuming and downright boring, then it makes sense to be smart about it, and prioritise those tests which have the highest value. Value in this sense is a loose term encompassing cost of failures, importance of features, and scale of use. A scenario with a high business value if successful, a high cost of failure, which is used often by large numbers of people would be an obvious choice for an outer layer test. Something of little value which nobody cares about and which is hardly ever used would be much further down the list.

And this leads neatly to the concept of automated “functional testing”. Functional tests being outer layer tests which exercise these important, valuable interactions with the system as a whole, Arguably there is a qualitative difference between unit tests of internal components and functional tests of outer layers. Internal components have an internal function, and relate mostly with other internal components and their interactions are designed by developers for code purposes. This makes it relatively easy for developers to decide what to test, so specifying tests in similar language to the implementation code is a straightforward and effective way to get the tests written. Outer layers and whole applications have an external function, and business drivers for what they do and how they do it. This can be less amenable to describing tests in the language used by programmers. Add to this the potentially tedious nature of these external, functional, tests and its easy to see why they sometimes get overlooked, despite their potentially high business value.

Many attempts have been made over the years to come up with ways to get users and business experts involved in writing and maintaining external functional test cases. My friend, some-time colleague and agile expert Steve Cresswell has recently blogged about a comparison of Behaviour-Driven Design (BDD) tools. It’s an interesting article, but I can’t help thinking that there is another important dimension to these tools which also needs to be unpicked.

Along this dimension, testing tools range from “raw program code” (with no assistance from a framework); through “library-style” frameworks which just add some extra features to a programming language using its existing extension mechanisms; “internal DSL” (Domain Specific Language) frameworks which use language meta-programming features to re-work the language into something more suitable for expressing business test cases; “external languages” which are parsed (and typically interpreted, but compilation is also an option) by some specialist software; through to “non-textual” tools where the test specification is stored and managed in another way, for example by recording activity, or entering details in a spreadsheet.

In my experience, most “TDD” tools (JUnit and the like) sit comfortably in the “library-style” group. test case specification is done in a general-purpose programming language, with additions to help with common test activities such as assertions and collecting test statistics. Likewise, most “BDD” tools (Cucumber and the like) are largely in the “external language” group. This is slightly complicated by the need to drop “down” to a general-purpose language for the details of particular actions and assertions.

Aside from the “raw program code” option which, by definition, has no frameworks, the great bulk of test tools occupy the “library style” and “external language” groups, with a small but significant number of tools in the “non-textual” group. I find it somewhat surprising how few there seem to be in the “internal DSL” category, especially given how popular internal DSL approaches are for other domains such as web applications. There are some, of course, (coulda, for example) which claim to be internal DSLs for BDD-style testing, but there is still a more subtle issue with these and with many external languages.

The biggest problem I have with most BDD test frameworks is related to the concept of “comments” in programming languages. Once upon a time, when I was new to programming, it was widely assumed that adding comments to code was very important. I recall several courses which stressed this so much that a significant point of the mark was dependent on comments. Student (and by implication, junior developer) code would sometimes contain more comments than executable code, just to be on the safe side. Over the years it seems that the prevailing wisdom has changed. Sure, there are still books and courses which emphasise comments, but many experienced developers try to avoid the need for comments wherever possible, using techniques such as extracting blocks of code to named methods and grouping free-floating variables into semantically meaningful data structures, as well as brute-force approaches such as just deleting comments.

The move away from comments has developed gradually, as more and more programmers have found themselves working on code produced by the above mentioned comment-happy processes. The more you do this, the more you realise that the natural churn and change of code has an unpleasant effect on comments. By their nature comments are not part of the executable code (I am specifically excluding “parsed comments” which function like hints to a compiler or interpreter here). This in turn means that comments can not be tested, and thus cannot automatically be protected against errors and regressions. Add to this the common pressure to get the code working as quickly and efficiently as possible, you can see that comments often go unchanged even when the code being commented is radically different. This effect then snowballs – the more that comments get out of step with the code, the less value they provide, and so the less effort is made to update them. Soon enough most (or all!) comments are at best a useless waste of space, and at worst dangerously misleading.

What does this have to do with BDD languages? If you look in detail at examples of statements in many BDD languages, they have a general form something like keyword "some literal string or regular expression" Typical keywords are things like “Given”, “When”, “Then”, “And”, and so on. For example: Given a payment server with support for Visa and Mastercard. This seems lovely and friendly, phased in business terms. But let’s dig into how this is commonly implemented. Very likely, somewhere in the code will be a “step” definition.

An example from spinach, which is written in Ruby, might be:

  step 'a payment server with support for Visa and Mastercard' do
    @server = deploy_server(:payment, 8099)

This also looks lovely, linking the business terminology to concrete program actions in a nice simple manner. However, this apparent simplicity hides the fact that the textual step name has no actual relationship to the concrete code. Sure, the text is used in at least two places, so it seems as if the system is on the case to prevent typos and accidental mis-edits, but it still says nothing about what the step code actually does. As an extreme example, suppose we changed the step code to be:

  step 'a payment server with support for Visa and Mastercard' do
    Dir.foreach('/') {|f| File.delete(f) if f != '.' && f != '..'}

The test specifications in business language would still be sensible, but running the tests would potentially delete a whole heap of files!

For a less extreme example, imagine that the system has grown a broad BDD test suite, with many cases which use “Given a payment server with support for Visa and Mastercard”. Now we need to add support for PayPal. There are several options including:

  • copy the existing step to a new one with a different name and an extra line to install a PayPal module (including going through all the test code to decide which tests should use the old step and which should use the new one
  • add the extra line to the existing step and modify the name to include PayPal then change all the references to the new name
  • or just add the extra module and leave the step name the same.

The last option happens more often than you might think. Just as with the comments, the BDD step name is not part of the test code, and is not itself tested, so there is nothing in the system to keep it in step with the implementation. And the further this goes, the less the business language of the test specifications can be trusted.

I have worked on a few projects which used these BDD approaches, and all of them fell foul of this problem to some degree. Once this “rot” takes hold, it seems almost inevitable that the BDD specifications either become just another set of “unit” tests, only understood by developers who can dig into the steps to work out what is really happening, or they are progressively abandoned. It seems unlikely that many projects would be willing to take on the costs of going through a large test suite and checking a bunch arbitrary text that’s not part of the deliverable, just to see if it makes sense.

I wonder if the main reason that this is not seen to be more of a problem is that so few projects have progressed with BDD through several iterations of the system, and are still using the approach.

So, is there any way out of this trap? For some cases, using regular expressions for step name matching can help. This can provide a form of parameter-passing to steps, increasing reuse for multiple purposes and reducing churn in the test specifications. This does not solve the overall problem, though, as it still has chunks of un-parsed literal text. For that we would need a more sweeping change.

Which brings me back to internal and external DSLs. To my mind, the only way to address this issue in the longer term is to define the test specifications in a much more flexible language, one which suits both the domain being tested and the domain of testing itself. My aim would be to avoid the need for those comment-like chunks of un-parsed literal text, and align the programming language with the business language well enough that expressions in the business language make sense as programming constructs. If done well, this should allow test specifications to be changed at a business level and still be a valid and correct expression of the desired actions and assertions.

Although some modern programming languages are relatively adept in metaprogramming and internal DSLs (Ruby is well-known for this), mostly they don’t go far enough. Arguably a better choice would be to base the DSL on one of the languages which have very little by way of their own syntax, and thus are more flexible in adapting to a domain. Languages such as LISP and FORTH have hardly any syntax, and have both been used for describing complex domains. It used to be said of FORTH that the task of programming in forth was mainly one of writing a language in which the solution is trivial to express.

I’m afraid I don’t have a final answer to this, though. I have no tool, language or framework to sell, just a hope that somebody, somewhere, is working on this kind of next-generation BDD approach.

For more information about low-syntax languages, you might want to read some of my articles on Raspberry Alpha Omega, such as What is a High Level Language and More Language Thoughts and Delimter-Free Languages.

Also, for Paul Marrington’s take on internal vs external DSLs for testing, see BDD – Use a DSL or natural language?.