Digital Classicist: Aggregating Classical Datasets with Linked Data

Last week’s Digital Classicist seminar concerned the question of Linked Data, and its application to data about inscriptions. In his paper, Aggregating Classical Datasets with Linked Data, David Scott of the Edinburgh Parallel Computing Centre described the Supporting Productive Queries for Research (SPQR) project, a collaboration between EPCC and CeRch at KCL. The concept is that inscriptions contain many different kinds of information: information concerning personal names (gods, emperors, officials etc), places, concepts, and so on. When epigraphers and historians wish to use inscriptions for historical research, they undertake a reflexive and extremely unpredictable approach to building link s- both implicit and explicit – between different the kinds of information. SPQR’s long term aim is to facilitate these searches between data to make life easier for classicists and epigraphers to establish links between inscriptions. SPQR is using as case studies the Heidelberger Gesamtverzeichnis, the Inscriptions of Aphrodisias, and Inscriptions of Roman Tripolitania (the latter being the subject of a use case I undertook for the TEXTvre project last year). There have been a number of challenges in the preparation of the data. Epigraphers of course are not computer scientists; and there therefore do not prepare their data is such a way as to make their data machine-readable. The data can the fore be fuzzy, incomplete, uncertain and implicit or open to interpretation. Nor are epigraphers going to sit down and write programmes to do their analysis. Epigraphers have highly interactive workflows that are difficult to predict, but methodologically and in terms of research questions. When you answer one question inscriptions, too often it can lead you on to other questions of which the original workflow took no account. Epigraphic data therefore is distributed and has diverse representations. It can appear in Excel or Word, or in a relational database. It might be available via static or interactive webpages; or one might have to download a file. But there are overlaps in the content, in terms of e.g. places and persons which might be separate or contemporaneous.

The SPQR approach is based on URIs, where each subject and relationship are given URIs, and each object is a URI or literal. For example a subject could be, the object URI is a value for ‘material is…’ and the literal is ‘White marble’. This approach allows the user to build pathways of interpretation through sub-object units of the data.

SPQR is looking at inscriptions marked up in EpiDoc. In EpiDoc, one might find information on provenance; descriptions including date and language; edited texts; translations; findspots; and thematerial from which the inscriptions themselves were made. As my use case for IRT showed, the flexibility afforded by EpiDoc is of great value to digital epigraphers, that flexibility can also count against consistent markup. E.g. an object’s material can be represented as or as material>: bot a different representations of the same thing. SPQR is therefore is re-encoding the EpiDoc using uniform descriptions. The EpiDoc resources also contain references on findspots: name is given as ancientFindspot and modernFindspot (ancient findspot refers to the Barrington atlas; modern names to GeoNames). This is an example of data being linked together: reference sets containing both ancient and modern places are queried simultaneously. SPQR is based on the Linking and Querying Ancient Texts project, which used a relational database approach. The data – essentially, the same three datasets being used by SPQR – is stored as tables. Each row describes a particular inscription, and the columns contain attribute information such as date, place etc. In order to search across these, the user has to have all the tables available, or write an SQL query. This is not straightforward, since this relies on the data being consistently encoded and, as noted above, epigraphers using EpiDoc do not always encode things consistently.

The visual interface being used by SPQR is Gruff. This uses a straightforward colour coding approach, where the literals are yellow, the objects are grey, and the predicates represented as arrows of different colours, depending on the type of predicate.

SPQR Gruff interface

The talk was followed by a wide ranging discussion, which mostly centred on the nature of the things to be linked. There seemed to be a high level consensus that more needed to be done on the terminology behind the objects we are linking. If we are not careful in this then there is a danger that we will end up trying to represent the whole world (which perhaps would echo the big visions of some early adopters of CRM models a few years ago). As will no doubt be picked up in Charlotte Roueche and Charlotte Tupman’s presentation next week (which alas I will not be able to attend), all this comes down to defining units of information. EpiDoc, as a disciplined and rigorous mark-up schema gives us the basis for this, but there need to be very strict guidelines for its application in any given corpus.