Data curation and digital preservation are often confused, but they are very different things. Terminology is a big problem in this area, especially where common terms from one domain – e.g. ‘curation’ in a musuem or cultural heritage context – are used in another. So can the emerging debate on Big Data help us move forward on a definition of ‘digital curation’?
The current issue of ‘Foreign Affairs’ has a paper by Kenneth Cukier and Viktor Mayer-Schoenberger entitled ‘The Rise of Big Data: How It’s Changing the Way We Think About the World‘. In it they argue that big data represents an epistemic change n how we do statistics, from the model of extrapolating general trends of patterns and populations from small representative random samples, to generalised overviews of entire datasets using data mining. In this world, ‘N=all’. The latter, they argue, are both imperfect and about correlation, rather that causation. I.e. Google claims to be able to track flu outbreaks by correlating certain search terms; but it doesn’t claim to know the actual reason why people made those searches – which would be a ‘traditional’ statistical research question. Recently however, Google’s method dramatically overestimated peak flu levels; a cursory reminder that correlation and causation are very different things.
Cukier and Mayer-Schoenberger argue that big data research means ‘giving up on clean carefully curated data and tolerating some messiness’. They also argue that the process of ‘datafication’ – capturing more and more forms of intangible processes such as friendships (as in Facebook likes) thoughts (Twitter) and professional relationships (LinkedIn) means that this body of data is growing less formal even as it exponentially grows in volume.
For me this raises two questions:
1. What does this mean for a museum-focused definition of ‘curation’. Can we give up on cleanly curated museum and cultural heritage collections and tolerate messiness? If so how, and where does that data come from?
2. By what processes can ‘the museum experience’ be ‘datafied?’ I have an idea forming that this could be to do, at least partly, with removing some of the interaction between audience and collection from being time and space specific. E.g. I don’t have to actually go to the British Museum to encounter all aspects of the experience of the Pompeii exhibition as some of those aspects have been datafied by others (both employees of the BM and other visitors), and I can review them wherever or whenever I like.
The main question is what does ‘big data curation’ actually mean? I am not sure I agree with the definition implied in the Cukier/Mayer-Schoenberger view, where it is precluded. That a curated dataset is necessarily one that is ‘small data’, shaped, presented and processed by a series of well-understood human interventions into a human readable narrative. However, they also make the very valid point that ‘in a world of big data, it is the most human traits that will need to be fostered – creativity, intuition, and intellectual ambition’. So whereas the present understanding in cultural heritage of what ‘curation’ means – the communication of a story or narrative of a collection of objects for an audience of specialists and/or non-specialists, where N can never = all – in big data terms, it means taking the imperfections of correlation across patterns in big data, and refining these by bridging with communities of experts – experts with the uncomputable human traits to take the broad brushstrokes that software tools are pulling out of our datafied world, and make worldly sense of them.