CeRch seminar: Webometric Analyses of Social Web Texts: case studies Twitter and YouTube

Herewith a slightly belated report of the recent talk in the CeRch seminar series given by Professor Mike Thelwell of Wolverhampton University. Mike’s talk, Webometric Analyses of Social Web Texts: case studies Twitter and YouTube concerned getting useful information out of social media, primarily social science means: information, specifically, about the sentiment of the communications on those platforms. His group produces software for text based information analysis, making it easy to gather and process large scale data, focusing on Twitter, YouTube (especially the textual comments), and the web in general and the Technorati blog search engine, also Bing. This shows how a website is positioned on the web, and gives insights as to how their users are interacting with them.

In sentiment analysis, a computer programme reads text and predicts whether it is positive or negative in flavour; and how strongly that positivity or negativity is expressed. This is immensely useful in market research, and is widely employed by big corporations. It also goes to the heart of why social media works – they function well with human emotions, and tracks what role sentiments have in social media. The sentiment analysis engine is designed for text that is not written with good grammar. At its heart is a list of 2,489 terms which are either normally positive or negative. Each has a ‘normal’ value, and ratings of -2 – -5. Mike was asked if it could be adapted to slang words, which often develop, and sometime recede, rapidly.  Experience is that it copes well with changing language over time – new words don’t have a big impact in the immediate term. However, the engine does not appear to work with sarcastic statements which, linguistically, might have diction opposite to its meaning, now with (for example) ‘typical British understatement’. This means that it does not work very well for news fora, where comments are often sarcastic and/or ironic (e.g. ‘David Cameron must be very happy that I have lost my job’). There is a need for contextual knowledge – e.g. ‘This book has a brilliant cover’ means ‘this is a terrible book’, in the context of the phrase don’t judge a book by its cover. Automating the analysis of such contextual minute would be a gigantic task, and the project is not attempting to do so.

Mike also discussed the Cyberemotions project. This looked at peaks of individual words in Twitter, e.g. Chile, when the earthquake struck in February 2010. As might be expected, positivity decreased. But negativity increased only by 9%: it was suggested that this might have been to do with praise for the response of the emergency services, or good wishes to the Chilean people. Also, the very transience of social media means that people might not need to express sentiment one way or another. For example, simply mentioning the earthquake and its context would be enough to convey the message the writer needed to convey. Mike also talked about the sentiment engine’s analysis of YouTube. As a whole, most YouTube comments are positive, however those individual videos which provoke many responses are frequently negatively viewed.

Try the sentiment engine (www. http://sentistrength.wlv.ac.uk). One wonders if it might be useful in XML/RDF projects such as SAWS, or indeed to book reviews on publications such as http://www.arts-humanities.net.

Digital Classicist: Classical studies facing digital research infrastructures: from practice to requirements

Apologies are due to Agiatis Bernardou. I am a couple of weeks late posting my discussion of her paper in the Digital Classicist Seminar Series, Classical studies facing digital research infrastructures: from practice to requirements. Agiati is from the Digital Curation Unit, part of the “Athena” Research Centre, and her talk focused in the main on the preparatory phase of DARIAH, the European Arts and Humanities Research Infrastructure project. She began by outlining her own research background in Classics, which contained very little computing (it surely can’t be coincidence that the digital humanities is so full of former and practicing archaeologists and classicists).

DARIAH is technical and conceptual project. With the aim of providing  a research infrastructure for the Arts and Humanities across Europe. In practice, it is an umbrella for other projects, involving a big effort in the areas of law and finance, as well as technical infrastructure. A key part of this is to ensure that scholars in the arts and humanities are supported at each stage of the research lifecycle. This means ensuring that the requirements at each stage are understood. The DCU was part of the technical workpackage in DARIAH, and was tasked with doing this. Its approach was to develop a conceptual framework to map user requirements using an abstract model to represent the information practices within humanities research.

This included an empirical study of scholarly research activity. The main form of data collection was interviews with humanities scholars. The design of the study included transcription, coding and analysis of recordings of these interviews.  Context was provided by a good deal of previous work in this area, in the form of user studies of information browsing behaviour. In the 1980s, this carried the assumption that most humanists were ‘lone scholars’, with little interest in, or need for, collaborative practices. This however gave way to an increasingly self-critical awareness of how humanists work, highlighting practices such as annotation, which *might* be for the consumption of the lone scholar, which equally might be means for communication interpretation and thinking. This in turn led to a consideration of Scholarly primitives – low level, basic things humanities do both all the time and – often – at the same time. Agiatis cited the six types of information retrieval behaviour identified by D. Ellis, as revisited for the humanities by John Unsworth: Discovering, associating, comparing, referring, sampling, illustrating and representing.

The DCU’s aim was to produce a map of who does what and how. If one has a  research goal, for example to produce a commentary of Homer, what are the scholarly activities that one would need to achieve that, and what processes do those activities involve. To this end, Agiatis highlighted the following aspects that need to be mapped: Actor (researcher), Research activity, Research goal, information object, tool/service, format, and resource type.  The properties that link these include hasType, Creates, partOf, Searches, refersTo and Scholarly Activity.

A meaningful map of these processes must include meaningful descriptions of information types. DARIAH therefore has to embrace multiple interconnected objects, that need to be identified, represented, and managed, so they can be curated and reached throughout the digital research lifecycle. In this regard, there is a distinction that is second nature to most archaeologists,  between the visual representation of information, and hands-on access to objects.

The main interest of Agiati’s paper for me was the possibilities the DCU’s approach holds for specific research problems. One could easily see, for example, how the www.arts-humanities.net Methods Taxonomy could be better represented as a set of processes rather than as a static group of abstract entities, as it is at the moment. But if one could specify the properties of a particular purpose, the approach would be even more useful: for example one could test the efficacy of augmented reality by mapping the ways scholars engage with and use AR environments.