Stuart Dunn

CAA1 – The Digital Humanities and Archaeology Venn Diagram

The question ‘what is the digital humanities’ is hardly new; nor is discussion of the various epistemologies of which the digital humanities are made. However, the relationship which archaeology has with the digital humanities – whatever the epistemology of either – has been curiously lacking. Perhaps this is because archaeology has such strong and independent digital traditions, and such a set of well-understood quantitative methods, that the close analysis of of those traditions – familiar to readers of Humanist, say – seem redundant. However, at the excellent CAA International conference in Southampton last week, there was a dedicated round-table session on the ‘Digital Humanities/Archaeology Venn Diagram’, in which I was a participant. This session highlighted that the situation is far more nuanced and complex that it first seems. As is so often the case with digital humanities.

A Venn Diagram, of course, assumes two or more discrete groups of objects, where some objects contain the attributes of only one group, and others share attributes of multiple groups. So – assuming that one can draw a Venn loop big enough to contain the digital humanities – what objects do they share with archaeology? As I have not been the first to point out, digital humanities is mainly concerned with methods. This, indeed, was the basis of Short and McCarty’s famous diagram. The full title of CAA – Computer Applications and Quantitative Methods in Archaeology – suggests that a methodological focus is one such object shared by both groups. However unlike the digital humanities, archaeology is concerned with a well defined set of questions. Most if not all, of these questions derive from ‘what happened in the past?’. Invariably the answers lie, in turn, in a certain class of material; and indeed we refer to collectively to this class as ‘material culture’. And digital methods are a means that we use to the end of getting at the knowledge that comes from interpretation of material culture.

The digital humanities have much broader shared heritage which, as well as being methodological, is also primarily textual. This fact is illustrated by the main print publication in the field being called Literary and Linguistic Computing. It is not, I think, insignificant as an indication of how things have moved on that that a much more recently (2007) founded journal has the less content-specific title Digital Humanities Quarterly. This, I suspect, is related to the reason why digitisation so often falls between the cracks in the priorities of funding agencies: there is a perception that the world of printed text is so vast that trying to add to the corpus incrementally would be like painting the Forth Bridge with a toothbrush (although this doesn’t affect my general view that the biggest enemy of mass digitisation today is not FEC or public spending cuts, but the Mauer im Kopf that form notions of data ownership and IPR). The digital humanities are facing a tension, as they always have, between variable availability of digital material, and the broad access to content that any porting over to the ‘digital’ that the word ‘humanities’ implies. As Stuart Jeffrey’s talk in the session made clear, the questions facing archaeology are more about what data archaeologists throw away: the emergence of Twitter, for example, gives an illusion of ephemerality, but every tweet adds to the increasing cloud of noise on the internet; and those charged with preserving the archaeological record in digital form must decide where where the noise ends and the record begins.

There is also the question of what digital methods *do* to our data. Most scholars who call themselves ‘digital humanists’ would reject the notion that textual analysis, which begins with semantic and/or stylometric mark-up is a purely quantitative exercise; and that qualitative aspects of reading and analysis arise from, and challenge, the additional knowledge which is imparted to a text in the course of encoding by an expert. However, as a baseline, it is exactly the kind of quantitative reading of primary material which archaeology – going back to the early 1990s – characterized as reductionist and positivist. Outside the shared zone of the Venn diagram, then, must be considered the notions of positivism and reductionism: they present fundamentally different challenges to archaeological material than they do to other kinds of primary resource, certainly including text, but also, I suspect, to other kinds of ‘humanist’ material as well.

A final point which emerged from the session is the disciplinary nature(s) of archaeology and the digital humanities themselves. I would like to pose the question as to why the former is often expressed as a singular noun whereas the latter is a plural. Plurality in ‘the humanities’ is taken implicitly. It conjures up notions of a holistic liberal arts education in the human condition, taking in the fruits of all the arts and sciences in which humankind has excelled over the centuries. But some humanities are surely more digital than others. Some branches of learning, such as corpus linguistics, lend themselves to quantitative analysis of their material. Others tend towards the qualitative, and need to be prefixed by correspondingly different kinds of ‘digital’. Others are still more interpretive, with their practitioners actively resisting ‘number crunching’. Therefore, instead of being satisfied with ‘The Digital Humanities’ as an awkward collective noun, maybe we could look to free ourselves of the restrictions of nomenclature by recognizing that can’t impose homogeneity, and nor should we try to. Maybe we could even extend this logic, and start thinking in terms of ‘digital archaeologies’; of branches of archaeology which require (e.g.) archiving, communication, semantic web, UGC and so on; and some which don’t require any. I can’t doubt that the richness and variety of the conference last week is the strongest argument possible for this.

Moving through the past lecture at TOPOI

Returning from Berlin, where I was giving a lecture on the MiPP project at TOPOI, part of the Free University. This was the first MiPP presentation I have given to an audience composed for the most part of students, and as so often, it was questions from students which led to the really interesting and important questions.

MiPP was the requested topic, and the pictures are still pretty, but it is now beginning to sound a bit like old news, especially since we now have at least four publications on the back of it. Now thinking hard about where the concept could go next. After all, MiPP was part of the AHRC’s DEDEFI programme, a capital infrastructure grant that was not even supposed to fund research per se in the first place. What research in archaeology might we enable?

So in preparation for the lecture, I revisited some old intellectual haunts, the spatial significances of architecture in Classical Greece. Lisa Nevett has done a good deal of work pulling together the material, iconographic and literary sources to elucidate notions of oikis vs polis household and state, inside and outside, but as Nevett shows, these are concepts that are drawn from literary culture. We have no way of knowing what actual relevance they have on the ground. Nevett’s approach is to fall back on material/archaeological evidence, but here we run it to the sort of conceptual limitations of interpretation that MiPP has, I think quite successfully, defined. Discussion with colleagues working with reconstruction during and after the lecture made me think that there is a great deal of scholarly demand for augmenting material culture beyond simply representing it in the virtual world: what are the points of interest and points of interpretation drawn from architecture, artefacts and. Landscape that determine how people react to all three? How can we document these in 3D? Also, the concepts of comparanda between different levels of familiarity and experience was raised. A simple question could be, if a person is used to living and working in a round house, how would they behave in a Roman villa? How would heir actions differ from someone who had grown up in such an environment?

Another factor which came up in questions is the potential for using motion-based representation as a means for publication, dissemination and engagement. An area where there is likely of be not only interest, but money too.

I should record that there was great interest in the iPad app that Kirk Woolford developed as part of MiPP. Check out this video on his Sussex webpage.

Blackouts, copycratism and intellectual property

This seems as good a week as any to address the issue of copyright, what with the Wikipedia et al blackout this week. Perhaps like many non-Americans, the exact details of SOPA and PIPA require a little reaching for, but the premise is that American based websites would be banned from supporting non-US websites which host ‘pirated content’ in the form of funding, advertising, links or other assistance. This could be in the form of forcing search engines such as Google to stop indexing such sites, or to bar requests from clients in the US from resolving the DNS conversions of targeted foreign sites, or shutting down ‘offending’ sites in the US. The bills’ many detractors say that this is too broad a brush, potentially allowing unscrupulous commercial operators to target US websites for their own purposes, and also that such sites could be targeted if they are not knowingly hosting pirated content. Think Facebook having to individually clear each and every picture and every video uploaded to it anywhere in the world, and assuming legal responsibility for its presence there.

This all seems a bit weird. It is as if the UK Parliament decided to revisit the 1865 Locomotives Act, which limited any mechanically-propelled vehicle on the highway to 4mph, and stipulated that an authorized crew member should walk before it holding a red flag. Imagine Parliament reasserting this speed limit for, say, the M6, and stipulating that a bigger flag was needed. The interesting thing about these bills is that they come straight from the ink-in-the blood mentality of zillionaire copycrats (lit. ‘One who rules through the exercise of copyright’) like Rupert Murdoch who, rather predictably, tweeted “Seems blogsphere has succeeded in terrorising many senators and congressmen who previously committed … Politicians all the same”; and the Motion Picture Association of America. There is still, in some quarters, a mauer im kopf which says ‘it is a bad thing to share my data’ which, at least in some ways, transcends potential financial loss. What, in some quarters of the digitisation world at least, we are seeing is smarter ways to regulate *how* information is shared on the internet, and of ensuring attribution where it is.

How do this week’s debates relate to scholarly communication in the digital humanities? Here, there seems to be an emerging realization that, if we actually give up commercial control of our products, then not only will the sun continue to rise in the east and set in the west, but our profiles, and thus our all-important impact factors, will rise. Witness Bethany Nowviskie’s thoughtful intervention a little less than a year ago, or the recent request from the journal Digital Humanities Quarterly to its authors to allow commercial re-use of material they have contributed, for example, for indexing by proprietary metadata registries and repositories. I said that was just fine. For me, the danger only emerges when one commits ones content to being available only through commercial channels, which DHQ was not proposing.

So, beyond my contributions to DHQ, what lessons might we learn from applying the questions raised by this week’s events in relation to content provided by movie studios, pop stars, commercial publishers, (or indeed the writings of people that other people have actually heard of)? We should recognise that there is a conflict between good old-fashioned capitalist market forces and our – quite understandable – nervousness in Giving Up Control. Our thoughts are valuable, and not just to us. The way out is not to dig our heels in and resist the pressure, rather I feel we should see where it leads us. If Amazon (net worth in 2011 $78.09 billion) can do it for distribution by riding on long-tail marketing, where are the equivalent business models of IP in the digital age, and especially in scholarly communication? We need to look for better ways to identify our intellectual property, while setting it free for others to use. Combining digital data from a particular resource could lead to increased sales of (full) proprietary versions of that resource, if the content is mounted correctly and the right sort of targeting achieved. Clearly there is no one answer: it seems that there will be (must be) a whole new discipline emerging in how scholarly digital content is/can be reused. We are perhaps seeing early indications of this discipline in namespacing, and the categorisation of ideas in super-refined multi-facetted CC licences, but these will only ever be part of the answer.

But the first stage is to get over the mauer im kopf, and I suggest the first step for that is to allow ourselves to believe that the exploitation of web-mounted content is equivalent to citation, but taken to the logical extreme that technology allows. We have spent years developing systems for managing citation, properly attributing ideas and the authorship of concepts, and avoiding plagiarism: now we base our academic crediting systems on these conventions and terrorise our students with the consequences of deviating from them. We need to do the same for commercial and non-commercial reuse of data, applied across the whole spectrum that the concept of ‘reuse’ implies.

Otherwise, we are simply legislating for men with flags to walk in front of Lamborghinis.

CeRch seminar series, Srping 2012

We have a great line up for the Centre for e-Research Seminar Series this term. The events are held in the Anatomy Theatre, KCL, Strand. All welcome.

Tuesday 17 January, 6.15pm: Digital Transformations of Research and Styles of Knowing, Ralph Schroeder and Eric T. Meyer, Oxford Internet Institute

Tuesday 31 January, 6.15pm: Manuscript Digitisation: How applying publishing and content packaging theory can move us forward, Leah Tether, Anglia Ruskin University

Tuesday 14 February, 6.15pm: Networks of Networks: a critical review of formal network methods in archaeology through citation network analysis and close reading, Tom Brughmans, University of Southampton

Tuesday 28 February, 6.15pm: Building an Ontology of Creativity: a language processing approach, Anna Jordanous, King’s College London and Bill Keller, University of Sussex

Tuesday 13 March, 6.15pm: Digitization and Collaboration in the Study of Religious History: Rethinking the Dissenting Academies in Britain, 1660-1860, Simon Dixon and Rosemary Dixon, Queen Mary, University of London

Tuesday 27 March, 6.15pm: Enhanced Publications in the Social Sciences and Humanities: tensions, opportunities and problems, Andrea Scharnhorst, Nick Jankowski, Clifford Tatum, Sally Wyatt, Royal Netherlands Academy of Arts and Sciences, Netherlands

Happy 2012

My blogging has been somewhat quiet for the last couple of months (well, non existent really). Normal service has now been resumed. This is partly due to many of my waking hours being taken up with the Digital Exposure of English Place-names project, a JISC mass digitization content effort to digitise the entire corpus of the Survey of English place-Names (earning an hon mensh in the Times Higher). This is a fantastic project with CDDA in Belfast, Nottingham and Edinburgh. It will make SEPN available as a linked data gazetteer via Unlock Text, and as downloadable XML for text mining and visualization.

Also, just before Christmas, I was in the fair city of Umeå, at a NEDIMAH workshop on information visualization. Our homework is to gather evidence about important topics in this area. Still digesting it really, but will try to amass such evidence here.

CeRch seminar: Webometric Analyses of Social Web Texts: case studies Twitter and YouTube

Herewith a slightly belated report of the recent talk in the CeRch seminar series given by Professor Mike Thelwell of Wolverhampton University. Mike’s talk, Webometric Analyses of Social Web Texts: case studies Twitter and YouTube concerned getting useful information out of social media, primarily social science means: information, specifically, about the sentiment of the communications on those platforms. His group produces software for text based information analysis, making it easy to gather and process large scale data, focusing on Twitter, YouTube (especially the textual comments), and the web in general and the Technorati blog search engine, also Bing. This shows how a website is positioned on the web, and gives insights as to how their users are interacting with them.

In sentiment analysis, a computer programme reads text and predicts whether it is positive or negative in flavour; and how strongly that positivity or negativity is expressed. This is immensely useful in market research, and is widely employed by big corporations. It also goes to the heart of why social media works – they function well with human emotions, and tracks what role sentiments have in social media. The sentiment analysis engine is designed for text that is not written with good grammar. At its heart is a list of 2,489 terms which are either normally positive or negative. Each has a ‘normal’ value, and ratings of -2 – -5. Mike was asked if it could be adapted to slang words, which often develop, and sometime recede, rapidly. Experience is that it copes well with changing language over time – new words don’t have a big impact in the immediate term. However, the engine does not appear to work with sarcastic statements which, linguistically, might have diction opposite to its meaning, now with (for example) ‘typical British understatement’. This means that it does not work very well for news fora, where comments are often sarcastic and/or ironic (e.g. ‘David Cameron must be very happy that I have lost my job’). There is a need for contextual knowledge – e.g. ‘This book has a brilliant cover’ means ‘this is a terrible book’, in the context of the phrase don’t judge a book by its cover. Automating the analysis of such contextual minute would be a gigantic task, and the project is not attempting to do so.

Mike also discussed the Cyberemotions project. This looked at peaks of individual words in Twitter, e.g. Chile, when the earthquake struck in February 2010. As might be expected, positivity decreased. But negativity increased only by 9%: it was suggested that this might have been to do with praise for the response of the emergency services, or good wishes to the Chilean people. Also, the very transience of social media means that people might not need to express sentiment one way or another. For example, simply mentioning the earthquake and its context would be enough to convey the message the writer needed to convey. Mike also talked about the sentiment engine’s analysis of YouTube. As a whole, most YouTube comments are positive, however those individual videos which provoke many responses are frequently negatively viewed.

Try the sentiment engine (www. http://sentistrength.wlv.ac.uk). One wonders if it might be useful in XML/RDF projects such as SAWS, or indeed to book reviews on publications such as http://www.arts-humanities.net.

Semantic MediaWiki: a tool for collaborative databases

This promises to be a great event at KCL later this month:

Semantic MediaWiki: a tool for collaborative databases

Monday 26th September 2011

Anatomy Theatre and Museum, King’s College London
6th floor, King’s Building, Strand Campus, London WC2R 2LS

In association with Judaica Europeana, the British Library and the European Holocaust Research Infrastructure (EHRI) Project

On 26th September, the Centre for e-Research will host a day exploring the Semantic MediaWiki, led by New York City-based developer Yaron Koren.

Please register for the event(s) you wish to attend using the links below.

WORKSHOP: Semantic MediaWiki: a practical workshop

15:30 – 17:00, Anatomy Museum

The first part of the day will consist of an interactive seminar in the Anatomy Museum led by Koren, demonstrating the principles of Semantic MediaWiki. Participants will have an opportunity to create and use their own data structures on a public test wiki. The workshop will be of particular interest to people interested in the development and application of wiki technologies, and their place in digital research infrastructures.

Please register to attend at: http://www.eventbrite.com/event/1519008395

LECTURE: The Judaica Europeana Haskala (Jewish Enlightenment) database

18:00, Anatomy Lecture Theatre (TBC) followed by refreshments (All welcome)

With an introduction by Lena Stanley Clamp, Director, European Association for Jewish Culture

This seminar will give an overview of Semantic MediaWiki, with a special focus on the Judaica Europeana Haskala database of the Jewish Enlightenment literature, which is currently being converted into an SMW system. While the focus of the lecture will be on Semantic MediaWiki, the lecture will be of relevant to broad aspects of e-Research. The British Library is a partner in the Judaica Europeana project to assist with technical advice and dissemination.

There will also be a short introduction to the EU-funded EHRI project by Tobias Blanke.

Please register to attend at: http://www.eventbrite.com/event/1995834595

About Semantic MediaWiki

Semantic wikis are a technology that combines the massively collaborative abilities of a wiki with the well-defined structure and data-reusability of a database. Semantic MediaWiki is an extension, first developed in 2005, that adds this capability to MediaWiki, the popular open-source wiki application best known for powering Wikipedia. SMW is by far the most successful semantic wiki technology, currently in use on hundreds of wikis around the world, including internal use at major companies like Audi and Boeing.

There are a set of additional MediaWiki extensions that work alongside Semantic MediaWiki to extend its functionality; Semantic MediaWiki is almost always used together with one or more of these, and the term ‘Semantic MediaWiki’ is sometimes used to describe the entire set. Using Semantic MediaWiki and its related extensions, one can easily create custom data structures that provide forms for letting users add and edit data, and whose data can be queried and displayed in a variety of ways, including tables, charts, maps and calendars.

About the speakers

Yaron Koren is one of the main developers of Semantic MediaWiki. He has been involved with the project since 2006; and runs the MediaWiki consulting company WikiWorks, and helps to run the MediaWiki-based wiki farm Referata. Yaron grew up in Israel and the United States, and currently lives in New York City.

Lena Stanley Clamp is the director of the European Association for Jewish Culture and manager of the Judaica Europeana project, which will contribute vast quantities of digital content documenting Jewish life in Europe to Europeana – Europe’s libraries, archives and museums online.

Bobby on the beat

Finally, a footnote on Greyfriars Bobby. Bobby was, of course, the Skye terrier (my auto spell-check changed this to Skype terrier, thus surely coining a term for a person who IMs you incessantly) whose loyalty and devotion in refusing to abandon his departed master Auld Jock’s grave inspired generations. So what genius thought this would be a good sign to put at the Kirkyard entrance? Or is the cold-hearted Sexton alive and well, and working for Edinburgh council?

Fringe Benefits

I saw two shows at the Fringe. Sammy McMillan, aka Sammy J, and ‘Randy’, his bug-eyed purple puppet, was highly amusing. The Carroll Myth by the Schmuck’s Theatre Company was rather meatier, featuring the dark shadows around his ‘friendship’ with the 11 year old Alice Pleasance Liddell, and his own descent into madness. A cast of characters including the Mad Hatter, Tweedle-Dum and Tweedle-Dee, the Walrus, a trio of Cheshire Cats etc swirl around the hapless Carroll, with highly effective presence and undoubtedly accomplished menace. An interesting interpretation of what one might call an individual’s ‘personal myth’, and a nice reflection on the jumbled lines joining that myth to literature and to popular culture (and perception). It makes you wonder what other artefacts, literary or otherwise, one could trace a ‘myth’ through. All in all, it was decidedly odd but rather clever.

On Stallman and Surveillance

Key billing at the Turing Festival was, of course, Richard M. Stallman, the so-called ‘prophet of free software’. He delivered a ringing peroration, indicting proprietary software in all its forms and, occasionally throughout the meeting, engaging in spirited discourse with those who react less strongly to the ‘i-Bad’, and the ‘Amazon Swindle’.

Richard Stallman — Richard M. Stallman, GNU in hand, addressing the Turing Festival

Stallman identified a number of threats to our freedom online and, by extension, to our freedom overall in the Information age. The surveillance carried out by governments via our own devices and the reporting of our activities to others is one such threat. Mobile phones which send your GPS location to third parties is another. Various online features of Windows, and ‘like’ buttons on Facebook, all allow data on us to be harvested. Remote surveillance is carried out via systems that are not ours, for example ISPs keeping records about users. This can be used to attack democratic activity. And data gathered for the most legitimate of reasons can still be absued by future regimes. ANY data retention, Stallman argued, is dangerous. In a free society you are not guaranteed anonymity – you can be recognized in the street. But it is diffuse, it cannot be collated easily. With computerization and digitization, all this can be indexed. Censorship is another threat, even in the supposedly democratic West.

Stallman also discussed ‘threats’ posed by proprietary standards, whose source elements are not viewable by their users. Of course, the recent experience of the Digital Humanities suggests that matters influencing, or limiting the application of free standards are not limited to the mechanics of what is open and what is not. Followers of the travails of the TEI on Twitter and elsewhere will know that openness in governance and administration is just as importance as openness of schemata and documentation. One cannot detatch the one from the other, as one risks doing if one simply demands that the source be open.

It is difficult not to admire the elegance of Stallman’s dictum that ‘either users control their programmes or the programmes control the users’; and few, outside the neoist of neo-cons, doubt the horrors inflicted on Americans and non-Americans alike by the reactionary and deeply unpatriotic PATRIOT act. But one does perhaps have to wonder if, even in our ultra-technologized age, all this rests on the assumption that we *have* to give up our freedoms to technology in the first place. If I assume that anything I write in an email might potentially become public – just ask the climate scientists at UEA’s Climate Research Unit – then what does it matter if Windows is tracking my emails through Outlook? Stallman also made the point that Open Source communities are typically more interested in improving their code bases rather than enabling the users by making the software. Again, one needs to question why, exactly, there should be such a stark either or approach. I suppose this might take on a very different perspective, or set of perspectives, if one is using open vs. Proprietary software in the development of products or commercial services, or dealing with particularly sensitive information. But in the academic humanities, the question of whether this is something that should really bug us. Is it really making ‘war on sharing’ to point out that there is a trade-off between (say) the ease of using ESRI Arc products versus GRASS, or if should really bother us. Stallman surely has a point when he says that big corporations make universities dependent on their products by providing cheap site licences, but if it provides a level playing field across the ac.uk domain, doesn’t this allow us to make better use of our fEC ravaged budgets? And if Autodesk wants to burrow into the code underneath MiPP’s reconstructions using some clever Trojans that they installed alongside our software without telling us, then good luck to them. They could save themselves the effort by simply downloading it from our website, where we make it available for free.

Other highlights of the Festival included a hugely entertaining talk by David McCandless on data visualization. Rather reminiscent of Stephen Levitt and Stephen J. Dubner’s Freakonomics, McCandless’s thesis is that any data, anywhere, can be visualized in some way. Well worth checking out his website. Also were Arjan Haring and Maurits Kaptein from Persuasion API, talk on the science of persuasion. I guess I need to get some advice from them on writing grant applications.