Searching Literature for Technical Key Texts

A literature review is a key part of postgraduate research. To start with I’m attempting a broad literature search to try and find anything I can which sheds light on my topic area. In particular I’m trying to locate some “key texts” which align fairly closely with my planned research area, and could help inform my attempts at narrowing down my search. Despite the power of search engines, databases and indexes, this is not as easy as it might seem, particularly when the terminology of the domain is not always consistent.

For example, my topic area is templated generation of text. In my small corner of software development that seems quite specific and it’s easy, though naive, to assume that all is required is typing this phrase into a search box to turn up everything I need. As anyone who has tried this will tell you, this is very rarely the case.

I typed this very phrase into the search box at University of Suffolk Library (login may be required), and got a lot of results. It wasn’t until result 58 (Templated Search over Relational Databases by Zouzias, Anastasios; Vlachos, Michail; Hristidis, Vagelis) that anything was even vaguely related to computer science. The preceding 57 were mostly from biology and/or chemistry, with a perhaps a smattering of maths. At result 79 there’s something which looks at least worth reading the abstract (Patent Issued for Serializing a Templated Markup Language Representation of Test Artifacts, Computer Weekly News, 12/2013) but the link from the search results takes me to a set of entries in a publication database with mo mention of templates. Result 98 (IXIR: A statistical information distillation system by Levit, Michael; Hakkani-Tür, Dilek; Tur, Gokhan, …) seems possibly interesting but turns out to be a system which uses templated queries, a different technology altogether.

It’s not until result 119 that I get anything even peripherally related to my topic area. In this case a book (Professional ASP.NET MVC 4, by Galloway, Jon; Allen, K. Scott; Haack, Phil, …) which describes a web software development technology, some parts of which have aspects of templating. Crucially, though, this book is a commercial publication, not subject to academic peer review, and so not the best choice for a key reference. Result 132 has an updated version of the same book. Result 166 has another (older, this time) version of the same book. Result 191 yet another version. By result 198 even the microbiology is running thin, and we get a “business” book (Lead Generation For Dummies by Rothman, Dayna). With nothing useful in the first two hundred results it’s clear that this is not a productive search. Sure I could keep going, but tenacity on its own is not enough.

There are several directions to go from here, including:

  • Use the advanced selection features of the database search and restrict results to scholarly and peer-reviewed computer science publications
  • Re-think the search terms to try and find more precise and less ambiguous terminology
  • Give up on searching and instead concentrate on following the bibliographic citation tree of articles and researchers

The first option intiially seems reasonable, and gives a more appropriate set of results, but still considerably more misses than hits. The first result is IXIR: A statistical information distillation system by Levit, Michael; Hakkani-Tür, Dilek; Tur, Gokhan, … again, the second is about algorithmic program generation, and the third looks like it might even be useful (Statically safe program generation with SafeGen by Huang, Shan Shan; Zook, David; Smaragdakis, Yannis) As it turns out it’s certainly not a key text, but at least the approach described does make some small use of textual templating to generate code fragments. Result 6 is effectively the same work, but from a different source. Result 23 (Extracting Web Data Using Instance-Based Learning by Zhai, Yanhong; Liu, Bing) is the first hint of something I will see much more of later, the inverse of my topic. It seems that trying to analyse and remove boilerplate text, in this case from web pages, leaving just the “important” parts, is (or at least was in 2007) an important issue in data extraction and indexing. This paper is unconcerned with how the templated pages were generated, however. Nothing else of interest appears in the remaining nine results.

At this point I have become convinced that it is my search terms which hold the key. And I also have a sneaking suspicion that I will need several different searches to uncover papers on the use of templates in different contexts. The only even marginally useful documents I have found so far relate to web pages, so I decide to focus on this area in an attempt to discover some more effective terminology.

A search for web template language begins to turn up useful works. Bracketed by a pair of “inverse” papers, result two (TAL—Template Authoring Language by Soares Neto, Carlos de Salles; Soares, Luiz Fernando Gomes; de Souza, Clarisse Sieckenius) at last introduces a paper which is actually about templating. Best of all, it’s not a template language I have ever heard of. Despite a publication date of 2012 this seems an oddly old-fashioned approach, but it meets my inclusion criteria so it’s one for the big bibliography. Result 8 (Framework testing of web applications using TTCN-3 by Stepien, Bernard; Peyton, Liam; Xiong, Pulei) initially looks interesting, but turns out to be another form of inverse: template-like pasterns used to match variable data using testing. I do take note of the possibly useful keyword “framework” for later searches, though. Result 17 (Advanced authoring of paper-digital systems: Introducing templates and variable content elements for interactive paper publishing by Signer, Beat; Norrie, Moira C; Weibel, Nadir; …) is interesting as it contains (among other things) an approach to using templated text in a broader range of documents. So that one is in, too. Looks like “authoring” could be a useful keyword, as well. Result 29 (A lightweight framework for authoring XML multimedia content on the web by Vanoirbeek, Christine; Quint, Vincent; Sire, Stéphane; …) reinforces the importance of these keywords, as it is of interest and contains both “framework” and “authoring”. Result 37 (EDITEC – a graphical editor for hypermedia composite templates by Damasceno, Jean Ribeiro; dos Santos, Joel André Ferreira; Muchaluat-Saade, Débora Christina) is somewhat similar to the TAL paper, above. but this has the benefit of a few possibly more interesting looking references in its bibliography.

So far, then, I have one document with some potentially useful references and a slowly growing list of useful keywords: “template“, “authoring“, “framework” and “web“. Looking at that list makes me wonder what other keywords might be useful. If “framework” is there, then maybe “language” and “system” could also be useful, for example. And perhaps I should also be thinking of words which might be common in the body of articles as well as the titles, such as “placeholder“, “boilerplate“, “static“, “dynamic” or “replace“. Words such as “text” or “document” are too vague and conflict with the meta-domain of publishing, thus disproportionately likely to appear in papers and articles of any kind. Perhaps a better approach might be to focus on a few specific document format types in which templating is common, such as “HTML” or “XML“, with the hope of finding references to the more generic texts from specific documents.

With a growing search term list it’s important to be methodical and make sure that no potentially fertile combination is accidentally missed, in the excitement of following leads. It’s tempting to just click away on anything which seems interesting, but that rapidly loses context. Another, and almost as tempting approach is to use the power of a web browser to open each lead in a separate tab, only closing the tabs when the associated document has either been recorded or rejected. I tried this and it seemed very effective to start with but it soon became unworkable, for a variety of reasons.

The first group of problems with this approach is fairly simple: unlike hyperlinks in a web page, references and document index results usually require several steps to resolve to a readable text. Index entries often link to an abstract or a summary page, the actual articles may be behind a paywall, and even where I have access via the university or via a subscription to a professional body it’s still extra steps. Some document stores don’t play well with the web, and try to open documents in iframes, pop-ups, or specific named tabs, some stores present an unusable mime type for the document, forcing the browser to download rather than display the text, and so on. Implicit in this is that some documents will result in several tabs. If there were no limit, this would be only a minor annoyance, but even the best browsers have a hidden problem. As the number of tabs grows, the room on the label of each tab decreases, in turn both hiding the document title (leading to a lot more tab-swapping to find anything) and (arguably worse) making it increasingly probable that a close button will be hit by mistake, throwing away what might be a vital resource. All of this means that managing tabs and making sure that everything important is visible, and nothing vanishes before processing becomes increasingly tricky, even without the next problem.

The second type of problem is one of time and place. Attempting to manage a document search using multiple browser tabs feels like pearl diving. Take the biggest breath you can and head into the depths, keep looking while your lungs burn and hope that you can find something valuable before you have to come up for air and start all over again. In this metaphor “coming up for air” is any kind of break in the flow of searching which loses context. This interruption might be as commonplace as a phone call or a toilet break, but could also be something more serious such as a browser or computer shutdown. Who using a laptop has not felt the mounting stress of a low-battery warning? As a part-time student, this is my biggest problem. I rarely have the luxury of a long block of time to devote to a single activity, so I have to split tasks into smaller, more achievable, chunks. If you look carefully you may even be able to spot the several occasions where I left and came back to working on this article. I also work on a variety of machines. I have a laptop for travel, desktops at home and in my office and a growing stable of available machines at universities, co-working spaces and so on.

Any technique I use for more than the most superficial of queries has to be persistent across distractions, crashes and accidental clicks as well as easily transportable between different computers, which rules out browser tabs and anything transient such as an open text editor. For now I am using a combination of tools. As a “scratchpad” I use a Google Document and I copy and paste anything which might bear further investigation: a URL for a search page or a document link, text of a citation, possible search terms and so on. Once I have found a document which looks worth more detailled study, I add it to Mendeley which maintains a synchronised library between my various machines, with a web interface for when I am using a shared device. This works as far as capturing documents for study, but I have yet to master Mendeley’s tools for tagging, grouping, reviewing, annotating and citing of documents. Mendeley also offers a way to “follow” other writers and researchers, to be notified of what they find or produce. In time this could prove to be very valuable, but I’m not using it much yet.

As of this post, I have used the above techniques to search for several combinations of the above keywords, and eventually found some useful papers. For example a search for “Web framework” turned up Server-centric Web frameworks: an overview by Vosloo, Iwan; Kourie, Derrick G, ACM Computing Surveys, 06/2008, Volume 40, Issue 2 at position 23. This is the most useful article so far; in among a general survey of web frameworks, one section attempts to map a taxonomy of templated approaches. I can see immediate ways in which this will be useful in my own research. The terms boilerplate template results in a lot of legal and sociological texts, but at position 15 gives Is the Browser the Side for Templating? by Garcia-Izquierdo, F. J; Izquierdo, R, IEEE Internet Computing, 2012, Volume 16, Issue 1. Searching for “template processor” brings up a lot of articles about templates for processor design, but at position 7 we find XRound: A reversible template language and its application in model-based security analysis by Chivers, Howard; Paige, Richard F, Information and Software Technology, 2009, Volume 51, Issue 5 XTemplate is at position 10, TAL is at position 14, and nothing else of note until position 74 with the peripherally useful Advanced authoring of paper-digital systems: Introducing templates and variable content elements for interactive paper publishing by Signer, Beat; Norrie, Moira C; Weibel, Nadir; More…

There’s obviously a lot more searching to do, as well as following up on bibliographical entries, looking for other work by the authors of useful texts, and working through the indexes of likely-looking journals, but the conclusion is that even with pretty specific technical terminology, it takes work and patience to track down appropriate academic resources.