Sourcing GIS data

Where does one get GIS data for teaching purposes? This is the sort of question one might ask on Twitter. However while, like many, I have learned to overcome, or at least creatively ignore, the constraints of 140 characters, it can’t really be done for a question this broad, or with as many attendant sub-issues. That said, this post was finally edged into existence by a Twitter follow, from “Canadian GIS & Geomatics Resources” (@CanadianGIS). So many thanks to them for the unintended prod. The linked website of this account states:

I am sure that almost any geomatics professional would agree that a major part of any GIS are the data sets involved. The data can be in the form of vectors, rasters, aerial photography or statistical tabular data and most often the data component can be very costly or labor intensive.

Too true. And as the university term ends, reviewing the issue from the point of view of teaching seems apposite.

First, of course, students need to know what a shapefile actually is. A shapefile is the building block of GIS, the datasets where individual map layers live. Points, lines, polygons: Cartesian geography are what makes the world go round – or at least the digital world, if we accept the oft-quoted statistic that 80% or all online material is in some way georeferenced. I have made various efforts to establish the veracity of this statistic or otherwise, and if anyone has any leads, I would be most grateful if you would share them with me by email or, better still, in the comments section here. Surely it can’t be any less than that now, with the emergence of mobile computing and the saturation of the 4G smartphone market. Anyway…

In my postgraduate course, part of a Digital Humanities MA programme, on digital mapping, I have used the Ordnance Survey Open Data resources, Geofabrik, an on-demand batch download service for OpenStreetMap data, Web Feature Service data from Westminster City Council, and  continental coastline data from the European Environment Agency. The first two in particular are useful, as they provide different perspectives from respectively the central mapping verses open source/crowdsourced geodata angles. But in the expediency required of teaching a module, they main virtues are the fact they’re free, (fairly) reliable, free, malleable, and can be delivered straight to the student’s machine, or classroom PC (infrastructure problems aside – but that’s a different matter) – and uploaded to a package such as QGIS.  But I also use some shapefiles, specifically point files, I created myself. Students should also be encouraged to consider how (and where) the data comes from. This seems to be the most important aspect of geospatial within the Digital Humanities. This data is out there, it can be downloaded, but to understand what it actually *is*, what it actually means, you have to create it. That can mean writing Python scripts to extract toponyms, considering how place is represented in a text, or poring over Google Earth to identify latitude/longitude references for archaeological features.

This goes to the heart of what it means to create geodata, certainly in the Digital Humanities. Like the Ordnance Survey and Geofabrik, much of the geodata around us on the internet arrives pre-packaged and with all its assumptions hidden from view.  Agnieszka Leszczynski, whose excellent work on the distinction between quantitative and qualitative geography I have been re-reading as part of preparation for various forthcoming writings, calls this a ‘datalogical’ view of the world. Everything is abstracted as computable points, lines and polygons (or rasters). Such data is abstracted from the ‘infological’ view of the world, as understood by the humanities.  As Leszczynski puts is: “The conceptual errors and semantic ambiguities of representation in the infologial world propagate and assume materiality in the form of bits and bytes”[1]. It is this process of assumption that a good DH module on digital mapping must address.

In the course of this module I have also become aware of important intellectual gaps in this sort of provision. Nowhere, for example, in either the OS or Geofabrik datasets, is there information in British public Rights of Way (PROWs). I’m going to be needing this data later in the summer for my own research on the historical geography of corpse roads (more here in the future, I hope). But a bit of Googling turned up the following blog reply from OS at the time of the OS data release in April 2010:

I’ve done some more digging on ROW information. It is the IP of the Local Authorities and currently we have an agreement that allows us to to include it in OS Explorer and OS Landranger Maps. Copies of the ‘Definitive Map’ are passed to our Data Collection and Management team where any changes are put into our GIS system in a vector format. These changes get fed through to Cartographic Production who update the ROW information within our raster mapping. Digitising the changes in this way is actually something we’ve not been doing for very long so we don’t have a full coverage in vector format, but it seems the answer to your question is a bit of both! I hope that makes sense![2]

So… teaching GIS in the arcane backstreets of the (digital) spatial humanities still means seeing what is not there due to IP as well as what is.

[1] Leszczynski, Agnieszka. “Quantitative Limits to Qualitative Engagements: GIS, Its Critics, and the Philosophical Divide∗.” The Professional Geographer 61.3 (2009): 350-365.


Blackouts, copycratism and intellectual property

This seems as good a week as any to address the issue of copyright, what with the Wikipedia et al blackout this week. Perhaps like many non-Americans, the exact details of SOPA and PIPA require a little reaching for, but the premise is that American based websites would be banned from supporting non-US websites which host ‘pirated content’ in the form of funding, advertising, links or other assistance. This could be in the form of forcing search engines such as Google to stop indexing such sites, or to bar requests from clients in the US from resolving the DNS conversions of targeted foreign sites, or shutting down ‘offending’ sites in the US. The bills’ many detractors say that this is too broad a brush, potentially allowing unscrupulous commercial operators to target US websites for their own purposes, and also that such sites could be targeted if they are not knowingly hosting pirated content. Think Facebook having to individually clear each and every picture and every video uploaded to it anywhere in the world, and assuming legal responsibility for its presence there.

This all seems a bit weird. It is as if the UK Parliament decided to revisit the 1865 Locomotives Act, which limited any mechanically-propelled vehicle on the highway to 4mph, and stipulated that an authorized crew member should walk before it holding a red flag. Imagine Parliament reasserting this speed limit for, say, the M6, and stipulating that a bigger flag was needed. The interesting thing about these bills is that they come straight from the ink-in-the blood mentality of zillionaire copycrats (lit. ‘One who rules through the exercise of copyright’) like Rupert Murdoch who, rather predictably, tweeted “Seems blogsphere has succeeded in terrorising many senators and congressmen who previously committed … Politicians all the same”; and the Motion Picture Association of America. There is still, in some quarters, a mauer im kopf which says ‘it is a bad thing to share my data’ which, at least in some ways, transcends potential financial loss. What, in some quarters of the digitisation world at least, we are seeing is smarter ways to regulate *how* information is shared on the internet, and of ensuring attribution where it is.

How do this week’s debates relate to scholarly communication in the digital humanities?  Here, there seems to be an emerging realization that, if we actually give up commercial control of our products, then not only will the sun continue to rise in the east and set in the west, but our profiles, and thus our all-important impact factors, will rise. Witness Bethany Nowviskie’s thoughtful intervention a little less than a year ago, or the recent request from the journal Digital Humanities Quarterly to its authors to allow commercial re-use of material they have contributed, for example, for indexing by proprietary metadata registries and repositories. I said that was just fine. For me, the danger only emerges when one commits ones content to being available only through commercial channels, which DHQ was not proposing.

So, beyond my contributions to DHQ, what lessons might we learn from applying the questions raised by this week’s events in relation to content provided by movie studios, pop stars, commercial publishers, (or indeed the writings of people that other people have actually heard of)? We should recognise that there is a conflict between good old-fashioned capitalist market forces and our – quite understandable – nervousness in Giving Up Control. Our thoughts are valuable, and not just to us. The way out is not to dig our heels in and resist the pressure, rather I feel we should see where it leads us. If Amazon (net worth in 2011 $78.09 billion) can do it for distribution by riding on long-tail marketing, where are the equivalent business models of IP in the digital age, and especially in scholarly communication? We need to look for better ways to identify our intellectual property, while setting it free for others to use.  Combining digital data from a particular resource could lead to increased sales of (full) proprietary versions of that resource, if the content is mounted correctly and the right sort of targeting achieved. Clearly there is no one answer: it seems that there will be (must be) a whole new discipline emerging in how scholarly digital content is/can be reused. We are perhaps seeing early indications of this discipline in namespacing, and the categorisation of ideas in super-refined multi-facetted CC licences, but these will only ever be part of the answer.

But the first stage is to get over the mauer im kopf, and I suggest the first step for that is to allow ourselves to believe that the exploitation of web-mounted content is equivalent to citation, but taken to the logical extreme that technology allows. We have spent years developing systems for managing citation, properly attributing ideas and the authorship of concepts, and avoiding plagiarism: now we base our academic crediting systems on these conventions and terrorise our students with the consequences of deviating from them. We need to do the same for commercial and non-commercial reuse of data, applied across the whole spectrum that the concept of ‘reuse’ implies.

Otherwise, we are simply legislating for men with flags to walk in front of Lamborghinis.