A History of Place 3: Dead Trees and Digital Content

The stated aim of this series of posts is to reflect on what it means to write a book in the Digital Humanities. This is not a subject one can address without discussing how digital content and paper publication can work together. I need to say at the outset that A History of Place does not have any digital content per se. Therefore, what follows is a more general reflection of what seems to be going on at the moment, perhaps framing what I’d like to do for my next book.

It is hardly a secret that the world of academic publication is not particularly well set up for the publication of digital research data. Of course the “prevailing wind” in these waters is the need for high-quality publications to secure scholarly reputation, and with it the keys to the kingdom of job security, tenure and promotion. As long as DH happens in universities, the need to publish in order to be tenured and promoted is not going to go away  There is also the symbiotically related need to satisfy the metrics imposed by governments and funding agencies. In the UK for example, the upcoming Research Excellence Framework exercise explicitly sets out to encourage (ethically grounded) Open Access publication, but this does nothing to problematize the distinction, which is particularly acute in DH, between peer-reviewed research outputs (which can be digital or analogue) and research data, which is perforce digital only. Yet research data publication is a fundamental intellectual requirement for many DH projects and practitioners. There is therefore a paradox of sorts, a set of shifting and, at times, conflicting motivations and considerations, which those contemplating such are faced with.

It seems to be that journals and publishers are responding to this paradox in two ways. The first facilitates the publication of traditional articles online, albeit short ones, which draw on research datasets which are deposited elsewhere, and to require certain minimum standards of preservation, access and longevity. Ubiquity Press’s Journal of Open Archaeological Data, as the name suggests, follows this model. It describes its practice thus:

JOAD publishes data papers, which do not contain research results but rather a concise description of a dataset, and where to find it. Papers will only be accepted for datasets that authors agree to make freely available in a public repository. This means that they have been deposited in a data repository under an open licence (such as a Creative Commons Zero licence), and are therefore freely available to anyone with an internet connection, anywhere in the world.

In order to be accepted, the “data paper” must reference a dataset which has been accepted for accession in one of 11 “recommended repositories”, including, for example, the Archaeology Data Service and Open Context. It recommends that more conventional research papers then reference the data paper.

The second response is more monolithic, where a publisher takes on both the data produced by or for the publication, and hosts/mounts it online. One early adopter of this model is Stanford University Press’s digital scholarship project, which seeks to

[A]dvance a publishing process that helps authors develop their concept (in both content and form) and reach their market effectively to confer the same level of academic credibility on digital projects as print books receive.

In 2014, when I spent a period at Stanford’s Center for Electronic and Spatial Text Analysis, I was privileged to meet Nicolas Bauch, who was working on SUP’s first project of this type, Enchanting the Desert. This wonderful publication presents and discusses the photographic archive of Henry Peabody, who visited the Grand Canyon in 1879, and produced a series of landscape photographs. Bauch’s work enriches the presentation and context of these photographs by showing them alongside viewsheds of the Grand Canyon from the points where they were taken, this providing a landscape-level picture of what Peabody himself would have perceived.

However, to meet the mission SUP sets out in the passage quoted above requires significant resources, effort and institutional commitment over the longer term. It also depends on the preservation not only of the data (which JOAD does by linking to trusted repositories), but also the software which keeps the data accessible and usable. This in turn presents the problem encapsulated rather nicely in the observation that data ages like a fine wine, whereas software applications age like fish (much as I wish I could claim to be the source of this comparison, I’m afraid I can’t). This is also the case where a book (or thesis) produces data which in turn depends on a specialized third-party application. A good example of this would be 3D visualization files that need Unity or Blender, or GIS shapefiles which need ESRI plugins. These data will only be useful as long as those applications are supported.

My advice therefore to anyone contemplating such a publication, which potentially includes advice to my future self, is to go for pragmatism. Bearing in mind the truism about wine and fish, and software dependency, it probably makes sense to pare down the functional aspect on any digital output, and focus on the representational, i.e. the data itself. Ideally, I think one would go down the JOAD route, and have one’s data and deposit one’s data in a trusted repository, which has the professional skills and resources to keep the data available. Or, if you are lucky enough to work for an enlightened and forward-thinking Higher Education Institution, a better option still would be to have its IT infrastructure services accession, publish and maintain your data, so that it can be cross-referred with your paper book which, in a wonderfully “circle of life” sort of way, will contribute to the HEI’s own academic standing and reputation.

One absolutely key piece of advice – probably one of the few aspects of this, in fact, that anyone involved in such a process would agree on – is that any Universal Resource Indicators you use must be reliably persistent. This was the approach we adopted in the Heritage Gazetteer of Cyprus project, one of whose main aims was to provide a structure for URI references to toponyms that was both consistent and persistent, and thus citable – as my colleague Tassos Pappacostas demonstrated in his online Inventory of Byzantine Churches on Cyprus, published alongside the HGC precisely to demonstrate the utility of persistent URIs for referencing. As I argue in Chapter 7 of A History of Place in fact, developing resources which promote the “citability” of place, and link the flexibility of spatial web annotations with the academic authority of formal gazetteer and library structures is one of the key challenges for the spatial humanities itself.

I do feel that one further piece of advice needs a mention, especially when citing web pages rather than data. Ensure the page is archived using the Internet Archive’s Wayback Machine, then cite the Wayback link, as advocated earlier this year here:

This is very sound advice, as this will ensure persistence even the website itself depreciates.

Returning to the publication of data alongside a print publication however: the minimum one can do is simply purchase a domain name and publish the data oneself, alongside the book. This greatly reduces the risk of obsolescence, keeps you in control, and recognizes the fact that books start to date the moment they are published by their very nature.

All these approaches require a certain amount of critical reduction of the idea that publishing a book is a railway buffer which marks the conclusion of a major part of one’s career. Remember – especially if you are early career –  that this will not be the last thing you ever publish, digitally or otherwise. Until those bells and whistles hybrid digital/paper publishing model arrive, it’s necessary to remember that there are all sorts of ways data can be preserved, sustained and form a valuable part of a “traditional” monograph. The main thing for your own monograph is to find the one that fits, and it may be that you have to face down the norms and expectations of the traditional academic monograph, and settle for something that works, as opposed to something that is perfect.

Sourcing GIS data

Where does one get GIS data for teaching purposes? This is the sort of question one might ask on Twitter. However while, like many, I have learned to overcome, or at least creatively ignore, the constraints of 140 characters, it can’t really be done for a question this broad, or with as many attendant sub-issues. That said, this post was finally edged into existence by a Twitter follow, from “Canadian GIS & Geomatics Resources” (@CanadianGIS). So many thanks to them for the unintended prod. The linked website of this account states:

I am sure that almost any geomatics professional would agree that a major part of any GIS are the data sets involved. The data can be in the form of vectors, rasters, aerial photography or statistical tabular data and most often the data component can be very costly or labor intensive.

Too true. And as the university term ends, reviewing the issue from the point of view of teaching seems apposite.

First, of course, students need to know what a shapefile actually is. A shapefile is the building block of GIS, the datasets where individual map layers live. Points, lines, polygons: Cartesian geography are what makes the world go round – or at least the digital world, if we accept the oft-quoted statistic that 80% or all online material is in some way georeferenced. I have made various efforts to establish the veracity of this statistic or otherwise, and if anyone has any leads, I would be most grateful if you would share them with me by email or, better still, in the comments section here. Surely it can’t be any less than that now, with the emergence of mobile computing and the saturation of the 4G smartphone market. Anyway…

In my postgraduate course, part of a Digital Humanities MA programme, on digital mapping, I have used the Ordnance Survey Open Data resources, Geofabrik, an on-demand batch download service for OpenStreetMap data, Web Feature Service data from Westminster City Council, and  continental coastline data from the European Environment Agency. The first two in particular are useful, as they provide different perspectives from respectively the central mapping verses open source/crowdsourced geodata angles. But in the expediency required of teaching a module, they main virtues are the fact they’re free, (fairly) reliable, free, malleable, and can be delivered straight to the student’s machine, or classroom PC (infrastructure problems aside – but that’s a different matter) – and uploaded to a package such as QGIS.  But I also use some shapefiles, specifically point files, I created myself. Students should also be encouraged to consider how (and where) the data comes from. This seems to be the most important aspect of geospatial within the Digital Humanities. This data is out there, it can be downloaded, but to understand what it actually *is*, what it actually means, you have to create it. That can mean writing Python scripts to extract toponyms, considering how place is represented in a text, or poring over Google Earth to identify latitude/longitude references for archaeological features.

This goes to the heart of what it means to create geodata, certainly in the Digital Humanities. Like the Ordnance Survey and Geofabrik, much of the geodata around us on the internet arrives pre-packaged and with all its assumptions hidden from view.  Agnieszka Leszczynski, whose excellent work on the distinction between quantitative and qualitative geography I have been re-reading as part of preparation for various forthcoming writings, calls this a ‘datalogical’ view of the world. Everything is abstracted as computable points, lines and polygons (or rasters). Such data is abstracted from the ‘infological’ view of the world, as understood by the humanities.  As Leszczynski puts is: “The conceptual errors and semantic ambiguities of representation in the infologial world propagate and assume materiality in the form of bits and bytes”[1]. It is this process of assumption that a good DH module on digital mapping must address.

In the course of this module I have also become aware of important intellectual gaps in this sort of provision. Nowhere, for example, in either the OS or Geofabrik datasets, is there information in British public Rights of Way (PROWs). I’m going to be needing this data later in the summer for my own research on the historical geography of corpse roads (more here in the future, I hope). But a bit of Googling turned up the following blog reply from OS at the time of the OS data release in April 2010:

I’ve done some more digging on ROW information. It is the IP of the Local Authorities and currently we have an agreement that allows us to to include it in OS Explorer and OS Landranger Maps. Copies of the ‘Definitive Map’ are passed to our Data Collection and Management team where any changes are put into our GIS system in a vector format. These changes get fed through to Cartographic Production who update the ROW information within our raster mapping. Digitising the changes in this way is actually something we’ve not been doing for very long so we don’t have a full coverage in vector format, but it seems the answer to your question is a bit of both! I hope that makes sense![2]

So… teaching GIS in the arcane backstreets of the (digital) spatial humanities still means seeing what is not there due to IP as well as what is.

[1] Leszczynski, Agnieszka. “Quantitative Limits to Qualitative Engagements: GIS, Its Critics, and the Philosophical Divide∗.” The Professional Geographer 61.3 (2009): 350-365.

[2] https://www.ordnancesurvey.co.uk/blog/2010/04/os-opendata-goes-live/

Blackouts, copycratism and intellectual property

This seems as good a week as any to address the issue of copyright, what with the Wikipedia et al blackout this week. Perhaps like many non-Americans, the exact details of SOPA and PIPA require a little reaching for, but the premise is that American based websites would be banned from supporting non-US websites which host ‘pirated content’ in the form of funding, advertising, links or other assistance. This could be in the form of forcing search engines such as Google to stop indexing such sites, or to bar requests from clients in the US from resolving the DNS conversions of targeted foreign sites, or shutting down ‘offending’ sites in the US. The bills’ many detractors say that this is too broad a brush, potentially allowing unscrupulous commercial operators to target US websites for their own purposes, and also that such sites could be targeted if they are not knowingly hosting pirated content. Think Facebook having to individually clear each and every picture and every video uploaded to it anywhere in the world, and assuming legal responsibility for its presence there.

This all seems a bit weird. It is as if the UK Parliament decided to revisit the 1865 Locomotives Act, which limited any mechanically-propelled vehicle on the highway to 4mph, and stipulated that an authorized crew member should walk before it holding a red flag. Imagine Parliament reasserting this speed limit for, say, the M6, and stipulating that a bigger flag was needed. The interesting thing about these bills is that they come straight from the ink-in-the blood mentality of zillionaire copycrats (lit. ‘One who rules through the exercise of copyright’) like Rupert Murdoch who, rather predictably, tweeted “Seems blogsphere has succeeded in terrorising many senators and congressmen who previously committed … Politicians all the same”; and the Motion Picture Association of America. There is still, in some quarters, a mauer im kopf which says ‘it is a bad thing to share my data’ which, at least in some ways, transcends potential financial loss. What, in some quarters of the digitisation world at least, we are seeing is smarter ways to regulate *how* information is shared on the internet, and of ensuring attribution where it is.

How do this week’s debates relate to scholarly communication in the digital humanities?  Here, there seems to be an emerging realization that, if we actually give up commercial control of our products, then not only will the sun continue to rise in the east and set in the west, but our profiles, and thus our all-important impact factors, will rise. Witness Bethany Nowviskie’s thoughtful intervention a little less than a year ago, or the recent request from the journal Digital Humanities Quarterly to its authors to allow commercial re-use of material they have contributed, for example, for indexing by proprietary metadata registries and repositories. I said that was just fine. For me, the danger only emerges when one commits ones content to being available only through commercial channels, which DHQ was not proposing.

So, beyond my contributions to DHQ, what lessons might we learn from applying the questions raised by this week’s events in relation to content provided by movie studios, pop stars, commercial publishers, (or indeed the writings of people that other people have actually heard of)? We should recognise that there is a conflict between good old-fashioned capitalist market forces and our – quite understandable – nervousness in Giving Up Control. Our thoughts are valuable, and not just to us. The way out is not to dig our heels in and resist the pressure, rather I feel we should see where it leads us. If Amazon (net worth in 2011 $78.09 billion) can do it for distribution by riding on long-tail marketing, where are the equivalent business models of IP in the digital age, and especially in scholarly communication? We need to look for better ways to identify our intellectual property, while setting it free for others to use.  Combining digital data from a particular resource could lead to increased sales of (full) proprietary versions of that resource, if the content is mounted correctly and the right sort of targeting achieved. Clearly there is no one answer: it seems that there will be (must be) a whole new discipline emerging in how scholarly digital content is/can be reused. We are perhaps seeing early indications of this discipline in namespacing, and the categorisation of ideas in super-refined multi-facetted CC licences, but these will only ever be part of the answer.

But the first stage is to get over the mauer im kopf, and I suggest the first step for that is to allow ourselves to believe that the exploitation of web-mounted content is equivalent to citation, but taken to the logical extreme that technology allows. We have spent years developing systems for managing citation, properly attributing ideas and the authorship of concepts, and avoiding plagiarism: now we base our academic crediting systems on these conventions and terrorise our students with the consequences of deviating from them. We need to do the same for commercial and non-commercial reuse of data, applied across the whole spectrum that the concept of ‘reuse’ implies.

Otherwise, we are simply legislating for men with flags to walk in front of Lamborghinis.