Interview: Paul Rayson WMatrix, text mining

Hi Paul,

You took an active part in the Methods Network workshop Text mining for historians in July 2007 and organized an earlier one on Historical text mining.

Is this focus on History just a coincidence, or are historians especially interested in text mining? And, to make the question more general, are there many differences regarding the application of text mining techniques in different humanities disciplines? If so, I would be interested to hear how this influences the development of tools for text mining.

--
Torsten Reimer
http://www.methodsnetwork.ac.uk

Hi Torsten,

The first workshop at Lancaster (Historical text mining) grew out of an overlap of interests that Dawn Archer and I had: historical linguistics, computational and corpus linguistics. Part of the reason that we organised the first workshop was to form a network of scholars working at the intersection of the same areas. We wanted to extend the group in both directions e.g. text mining researchers and historians. We were more successful in the former direction than the latter with the exception of Stephen Pumfrey at Lancaster who presented an account of his early explorations into corpus-based methods.

For the second workshop (Text mining for historians), the focus was squarely on the historians and seeing what research questions they had that could be answered by exploiting the text mining and corpus (linguistic) methods.

More generally, I think we need further networking events like those to explore the level of adoption or possibilities for use of text mining techniques in other disciplines.

Paul.

This week I had two meetings with members of the e-Uptake project, which exists to understand potential barriers that hinder wider adoption of e-infrastructures in research (and make recommendations about how to address these issues). Returning to our conversation about text mining and corpus methods, this made me wonder what the 'barriers' would be in this field of research. As text is still the most important source for historical research, you would expect historians to focus on these methods. Now, there is definitely interest, but not as much as you could expect.

Unless you disagree with the last statement, would you say this is due to historians lacking the skills or the discipline specific tools? Or is it a more basic problem, i.e. that historians are maybe not really aware of what text mining could do for them? If raising the awareness is a major issue in this (not only in History), are there any groups or projects that could take this agenda forward? Where would you point researchers with an interest in this field?

I don't really agree with your statement there, because historians would use a different definition of "text" to that used in corpus linguistics. This came up at the text mining workshops and again in a recent meeting that I was having at Lancaster with colleagues who are using Wmatrix to support stylistics research. The view of a text provided through a corpus tool tends to obscure the structure or flow of a text. It permits the user to focus in on specific sentences or parts of sentences without necessarily providing the wider context of the full text, or even the context outside the text.

Raising awareness of these tools and techniques is one issue and it looks like the e-uptake project will address this through discipline based case studies. However, two points that need to be addressed from my perspective on the computer science side is the awareness of the requirements of the historian and the ease of use of the tools that are provided. We talked about Google-style ease of use in the e-science scoping study of Linguistics last year in London: http://www.ahds.ac.uk/e-science/e-science-scoping-study.htm - these are the key barriers in my opinion.

You are right that corpus tools make you look at texts in a different way. Being a historian myself I do think that this can actually be very useful for the (usually) more qualitative approach my discipline takes.

For my Ph.D. research, the access to corpus tools and a wide sample of early modern English prints would have allowed me to see how insights gained through a qualitative analysis of one genre of texts could be weighted against a more representative sample of texts from a specific period (I was, among other things, concerned with how topoi such as the Royal Navy as the "Wooden Walls" were used in political discourse - it would have been interesting to see how popular this topos was beyond politicians, naval officials etc. and their pamphleteers).

A large corpus of, for instance, early modern texts, could help historians to test some theories develop through qualitative analysis against a wider range of source and help the discipline to avoid the "tunnel vision" that a qualitative approach can sometimes have. Here I would like to see more activities among my colleagues, which is what I meant by my comment.

One of the things that we are planning to do internally at Lancaster is put our full text copy of the EEBO dataset inside a corpus tool such as Corpus Workbench.

CWB was originally developed at [[http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/|Stuttgart ]] but has recently been made [[http://cwb.sourceforge.net/|open source ]].

This would allow local historians to see the full benefit of corpus methods on texts that they already use online through the standard EEBO interface.

That would be an extremely useful resource and incredibly helpful for researchers!

Unfortunately due to the licence restrictions on EEBO, we can only make it available internally. However, all UK academic institutions can obtain a copy of the full text EEBO. This was part of the JISC licence agreement, see the [[http://www.jisc-collections.ac.uk/eebo|JISC website for more information]].

Having worked intensively with EEBO, I am (unfortunately) aware of the licensing problems. Germany has recently acquired a national license for EEBO, but even that requires registration and access control. Still, it is good to have the resource available and I am sure that researchers at Lancaster will be able to make good use of your full text copy!

You have mentioned ease of use as something to focus on. Interestingly, a historian I discussed a similar issue with earlier this year, said more or less exactly the same:

[quote=GVogeler]I remember the great success of a simple Visual-Basic-Tool connecting MS-Word with MS-Access to build a glossary of phrases in the critical edition of the charters of Frederic II. (1194-1250): The principal editor Walter Koch – he stands somewhere between "computers are dangerous" and the "every day historian" - loved the tool (with it's football-button) because it made index entries and linked them to phrases he formed out of the texts of his charters. There wasn't any linguistic process, maybe some simple string functions, just a relational database as the most complex programming part in it. My conclusion from this experience was: A good computer tool for a historian has to be easy to use, reliable, looking atomic ("It's just one operation I think I can control, not a collection of operations that make the result blurry") and have a fancy user interface. The main task to disseminate computing methods among historians is to build tools for their needs.[/quote]

The question is, how does one build such a tool? If it is specifically designed for the needs of a certain discipline then you may have a relatively clear set of requirements. Text mining and corpus linguistic tools, however, can be used in many different disciplines. How did you approach this task when you started work on Wmatrix? Was it mainly built for your own needs, did you focus on a specific discipline or did you aim at developing a tool with good "general" ergonomics?

Developing a tool with general usability would not be the place to start (c.f. Microsoft Word)!!

The Wmatrix tool was originally developed with software engineering users in mind. The key requirement was to hide away the detail of the natural language processing and linguistic analysis. We used a web-based interface with that particular user group in mind in the [[http://www.comp.lancs.ac.uk/computing/research/cseg/projects...|REVERE project]].

After demoing the tool in Linguistics at Lancaster, I developed a much more open interface for those users to retrieve the detailed NL analysis. Over the past several years, the interface has been 'simplified' a number of times as I've gained more and more user feedback. It is an ongoing process. My solution within Wmatrix is to enable different 'viewpoints' so that different types of users can see different views on the same underlying datasets. The end point of this simplification process may be a Google-style simple box with one button, but we're not there yet!

Thank you for these details! Software such as MS Word is a good example to show that the process of simplification is not an easy one...

You have mentioned software engineering and linguistics; I first came across Wmatrix in the context of historical text mining (via the Methods Network workshops). Are there other disciplines using the tool? Who are your main clientèle and how do you engage with users to gather feedback?

Within the area of linguistics, it has been used for the analysis of political discourse, EAP, EFL, varietal studies, stylistics, modality, class and gender in sociolinguistics, weblogs in Singopore English and most recently metaphor analysis.

In software engineering, the tool is being used for requirements engineering and early-aspect identification.

The other main user community is in the management school at Lancaster. They are using it as a support tool for qualitative data analysis of interviews in the study of areas such as customer relationship management, knowledge transfer and entrepreneurship studies.

I'm always amazed at what people use the tool for!! I maintain a list of papers/presentations using Wmatrix on the [[http://ucrel.lancs.ac.uk/wmatrix/|tool's homepage]].

My main two methods of feedback are via dealing with direct questions from users and running workshops such as the one at the historical text mining event in Glasgow.

The topic of tools development in the (arts and) humanities comes up again and again. Lorna Hughes mentioned a dearth of tools available to scholars in her presentation at DRHA. Other events such as the Summit on Digital Tools for the Humanities have addressed tools development.

What do you feel are the most important issues in tools development for research in general or in your field? And what could be done to support such work - should we think about better ways to gather user feedback, do we need open source repositories or something like a scholarly tools portal site?

The Methods Network is soon to convene a second meeting of it's Tools Workgroup - in this context I am especially interested to learn more about the needs of developers.

One of the most important things is to encourage inter-disciplinary collaborations. This would allow developers (such as myself) to be immersed in the existing tools and methodologies within a particular sub-discipline. It also enables the important feedback loop, so that new tools can be tested and commented on by arts and humanities scholars, and then the next iteration of development begins. This picks up on the focus on methodology in the tools workgroup report.

The second strategic issue in the tools workgroup report is sustainability. From a developers point of view, this relates to funding. Research funding generally allows software prototypes to be built showing a proof of concept, but in order to build a community of practitioners and continue tool development over time, a funding model is required. This could either be via licencing the tool or development within further research projects. Open source development is another way forward, although I think this tends to favour more technically-minded users and exclude those who require the actual tool support!

Thank you for your comments, Paul! That is very interesting and I will pass them on to the tools workgroup at our next meeting.

Getting back to Wmatrix, your corpus analysis and comparison tool: what are, apart from the mentioned changes to the user interface, your further plans and hopes for its development? Which part do you think community input will have in this and what role can you see for this forum?

Torsten: sorry for the long delay on responding. I hadn't spotted the new messages on the forum.

Future specific plans for new features include:

1. Collocations
2. N-grams
3. Multiple file upload
4. Dispersion and range statistics
5. Better support for creating corpora by joining or splitting files

I'm also going to load standard corpora such as LOB, Brown and our new 1931 equivalent in order to make it more widely available.

Over and above that I'd like to engage with further user communities. On Thursday I'm meeting up with colleagues in Linguistics who are using the tool for corpus stylistics. I also noticed that Wmatrix was mentioned in the English Subject Centre newsletter (October 2007).

I hope that new Wmatrix users will stop by this forum and start to create a community of users.

Paul, thank you very much for your time and taking part in this. I think we will have an ongoing discussion about the best ways to build an infrastructure for tools development and interaction between users and developers!

As always, the thread will remain open for future comments and questions.

HI Paul,so sorry to interrupt your discussion. I was just wondering whether you are the same Paul Rayson from lxe Dubai who I met at the hilton hotel in Durban South Africa in 2007?.
Please get in touch if it is you!
Thanks so much
Susan

Err, no. That wasn't me!

Paul.

I am enjoying this discussion. But I am a little worried at times that those of us who use all sorts of tools to engage with the past sometimes loose sight of the past that we are engaging with. Torsten hinted at this; he hints that we can gain new knowledge in a quantitative way when engaging with the textual historical record through using textual analysis, but still I worry that this undermines the art of the craft. The 'process' of the craft, of any craft, is only half of the story. We still need to use this data to bring into our grand narratives. This is basically what we are; we are narrative story tellers, and we use the data underneath to construct stories. Computational process isn't outside of political discourse...more facts just creates more ways in which the fish monger can dish them up to tell stories.

And saying historians use 'text'. But I don't believe there is such a place outside of text. There are places outside of empiricism, but not text.

I agree: historians are storytellers. These stories are what makes our work (hopefully) relevant and entertaining. Everything we do eventually leads to this - including the use of tools. We need pencils, keyboards, books, PDF files and/or corpus linguistics software, depending on what kind of story we are after. So I wonder: how does getting a new tool undermine our craft? Do you feel that we focus too much on the tool and forget the story about it?
If you would argue that way, I would probably counter by saying that tools influence our work in several ways, which makes it important to every now and then to reflect on how they do it.

And regarding the text argument: If you define it in a more post-modern way you are certainly right. Although I do quite like doing that, it is not always a helpful definition as digital text (as used for corpus linguistics and text mining) has different qualities as compared to an image, and audio file or an idea. We would lose that quality by saying everything is text, I guess.