Querying and Analysis of Data (Electronic Texts)
The range of functionality that TuStep (TUebingen System of Text Processing Programs - http://www.zdv.uni-tuebingen.de/tustep/tustep_eng.html) offers makes it the ideal candidate to begin a section devoted to some of the tools and techniques that are (or will be) available to scholars wishing to query and analyze digital texts. In addition to the collation function referred to elsewhere (see wiki article Data Preparation ('Electronic Texts'), the other categories of operation are:
- Processing Text
- Preparing Indexes
- Generating Indexes and Concordances
- Generating Listings
With additional file handling and job control commands, TuStep provides a modular system where programmes are run in sequence to cover the whole chain of processes, whilst enabling the editor to intervene at any stage. Batch processes, defined and controlled by user input parameters can be aggregated and then stored so that the same sequence can be used subsequently. In his discussion of ‘Text Tools’, John Bradley attempts to be realistic when summarising the level of expertise required to use this tool.
[quote]Tustep is especially developed for performing tasks of interest to scholars working with texts, although it will still take real work to learn how to use it. […] For non-programmers, then, both TuStep and Perl have steep and long learning curves.
(from Bradley, J., Text Tools, in Schreibman, S., Siemens, R., Unsworth, J., (eds), A Companion to Digital Humanities, (pp.505 - 522) online version)[/quote]
The ongoing development of EDITION (http://www.sd-editions.com/EDITION/)by Peter Robinson and colleagues (software based partly on the COLLATE system - http://www.itsee.bham.ac.uk/software/collate/) will offer researchers a similarly full-featured toolset for the production of digital editions. The objective of the project is to design a system that will be usable by any scholar who has the knowledge to produce a print edition whilst still featuring enough functionality to enable output of exemplary quality. The other components of the system will be based on ANASTASIA (also developed by Robinson - http://anastasia.sourceforge.net/index.html), which is designed for publishing large and highly complex XML documents; and a third piece of software developed by the ARCHway project (http://beowulf.engl.uky.edu/~kiernan/ARCHway/entrance.htm), based at the University of Kentucky, that allows users to link text and images down to the finest level of detail.
A feature of Robinson’s research in the past has been the use of cladistic analysis on manuscripts. Borrowed originally from evolutionary biology, this method attempts to map [quote]‘trees of descent or history for which the fewest changes are required, basing this on comparisons between the descendants’
(from Hockey, S., Electronic Texts in the Humanities, OUP: New York, 2000) [/quote]
Using PAUP (Phylogenetic Analysis Using Parsimony) software, family trees of words were created showing patterns of similarity and deviation in regularized word lists. Hockey also reports on the use, by Patricia Galloway (1979), of cluster analysis and dendrogram diagrams, techniques which appear to have been transposed from her activities relating to archaeology, a discipline in which cluster analysis is a standard technique.
The widespread use of data (or text) mining techniques in humanities projects is now fairly well established and its application to literary studies can be demonstrated by reference to the NORA project, a collaborative U.S. based project directed by John Unsworth. The objective is to [quote]‘to produce software for discovering, visualizing, and exploring significant patterns across large collections of full-text humanities resources in existing digital libraries.’
(NORA project - http://www.noraproject.org/description.php)[/quote]
Fig. 3 shows a screenshot from one of the functions that the project is developing which allows users to put training information into the system that rates the text for evidence of a specific attribute. Once a training set is established, the system will then search all of the available data and apply the knowledge that has resulted from an analysis of the training set to return relevant results for data the user has not evaluated.
The principles of data mining have emerged from computing science and encompass a variety of complex methods and procedures, but at a very general level, it is clear that intelligent knowledge-augmented searching is already being used in a variety of disciplines and may well provide new methods of querying the ever larger datasets that arts and humanities scholars are confronted with.
The discipline of linguistics is one such area that grapples with increasingly larger repositories of information, frequently in the form of corpora that in some cases contain several hundred million words. Text editing and literary scholarship can clearly benefit from these vast reference sources, particularly in relation to historical collections of words which have often been harvested from literary sources. A recent Methods Network workshop on Corpus Approaches to Literature (http://www.methodsnetwork.ac.uk/activities/act3.html) demonstrated a range of techniques that will be of interest to literary scholars. Focused on the use of Wordsmith (http://www.lexically.net/wordsmith/), a lexical analysis software package developed by Mike Smith at the University of Liverpool, participants were introduced to clustering, collocation, colligation and semantic prosody analysis methods that can shed light on a number of issues including style and content analysis, attribution, the study of literary effects and the creative use of language in comparison with quantitative norms. In one study concerning the use of repetition in the works of Charles Dickens, Wordsmith was used to pick out short phrases that featured repeatedly in: a single chapter of a Dickens novel; the whole book; and then in a number of other novels as featured in a corpus of nineteenth century literature. It was demonstrated that Dickens recycled phrases more regularly than any of the other authors featured in the corpus, which might represent quantitative proof of the effect that journalistic deadlines had on Dickens’s often serialised output
In the course of other Methods Network events, presentations have been given on a range of techniques that also have relevance. The use of the automatic tagging system CLAWS (the Constituent Likelihood Automatic Word-tagging System - http://www.comp.lancs.ac.uk/computing/research/ucrel/claws/) is widely referenced and was the system used to automatically add part of speech (POS) tags to the British National Corpus (BNC - http://www.natcorp.ox.ac.uk/). The variant spelling detection programme VARD (Variant Detector - http://ucrel.lancs.ac.uk/events/htm06/) uses fuzzy matching procedures to try and identify and match historical spellings of words with their ‘normalized’ equivalents. The Historical Thesaurus of English (http://www.arts.gla.ac.uk/sesll/englang/thesaur/thes.htm) provides researchers with an enormous resource, arranged semantically and chronologically, that details English vocabulary as it has changed over the centuries. The use of a semantic ontology, USAS (UCREL Semantic Analysis System - http://www.methodsnetwork.ac.uk/redist/pdf/es1_08archer.pdf) has also been featured in research connected with word domain analysis in Shakespeare. This work concentrates on the semantic tagging of texts which allows for the grouping of words into conceptual clusters.