The week before last Silver and I went along to the Realising the Opportunities of Digital Humanities conference in Dublin. We gave a short presentation about linked data at the BBC then sat on a panel session attempting to answer questions. I've never been on a panel before but it's a bit like an interview: you only think of the answer you wanted to give once you've left the room.

Anyway, one person asked a question something like, "with all this data does 'content' become a second class citizen". At the time we were sat in the Royal Irish Academy library which is three storeys of floor to ceiling books. The thought that all that humanity could ever become subservient to some descriptive data seemed so odd that I don't think anyone even answered the question. A bit like suggesting if the library catalogue ever got good enough you could just burn the books.

A follow up point was made by a person from RTE about the costs associated with digitising content. I think it's often hard to justify digitisation costs because the thing you end up with is pretty much the thing you started with except in ones and zeros. And to my mind there are three steps to opening an archive. As a fag packet sketch it would look something like:

Step 1: Digitisation

As the RTE person said, digitisation is expensive and sometimes hard to justify. And I have no ideas on how to make it less expensive. Until content is digitised there's no way to get the economies of scale that computers and the web and the people on the web bring. And there's no real benefit in just digitising. You can put the resulting files on the web but until they link and are linked to and findable they're not in the web. To put stuff in the web you need links and to make links you need context so...

Step 2: Contextualisation

Once you have digitised files you need to somehow generate context and there seem to be three options:

  1. employ a team of librarians to catalogue content - which is great if you can but doesn't scale too well and can, occasionally, lead to systems which only other librarians can understand
  2. let lose the machines to analyse the content (OCR, speech to text, entity extraction, music detection, voice recognition, scene detection, object recognition, face recognition etc etc etc) - but machines can get things wrong
  3. build a community who are willing to help - but communities need nurturing (not managing, never managing)

The project I'm currently (kind of) working on has the research goal of finding a sweet spot between the last two: machine processing of audio content to build enough descriptive data to be corrected and enhanced by a community of users. Copying / pasting from an earlier comment:

We've been working on a project for BBC World Service to take 70,000 English-language programmes and somehow make them available on the web. The big problem is that whilst we have high quality audio files we have no descriptive data about them. So nothing about the subject matter discussed or who's in them and in some cases not even when they were broadcast.

To fix this we've put the audio through a speech to text system which gives us a (very) rough transcript. We've then entity extracted the text against DBpedia / Wikipedia concepts to make some navigation by "tags". Because the speech to text step is noisy some of the tags extracted are not accurate but we're working with the World Service Global Minds panel (a community of World Service listeners) who are helping us to correct them.

Machine's plus people is an interesting approach but, like digitisation, machine processing of content is expensive. Or at least difficult to set up unless you're a technology wizard. And there's a definite gap in the market for an out-of-the-box, cloud-based (sorry) solution (sorry) for content processing to extract useful metadata to build bare-bones navigation.

Step 3: Analysis

The line between contextualisation and analysis is probably not as clear cut as I've implied here. But by analysis I mean any attempt to interrogate the content to make more meaning. I'm reminded of the recent Literature is not Data: Against Digital Humanities article by Stephen Marche:

But there is a deeper problem with the digital humanities in general, a fundamental assumption that runs through all aspects of the methodology and which has not been adequately assessed in its nascent theory. Literature cannot meaningfully be treated as data. The problem is essential rather than superficial: literature is not data. Literature is the opposite of data.

Data precedes written literature. The first Sumerian examples of written language are recordings of beer and barley orders. But The Epic of Gilgamesh, the first story, is the story of "the man who saw the deep," a hero who has contact with the ineffable. The very first work of surviving literature is on the subject of what can't be processed as information, what transcends data.

It also reminds me of a trip to a Music Information Retrieval conference a couple of years back. Every other session was accompanied by a click track and seemed to be another attempt to improve "onset beat detection" in some exotic music genre by two or three percent. I'm no musicologist but it felt like a strange approach to determining meaning from music. If you were ever asked to describe punk or hip-hop or acid house I doubt you'd start with chord sequences or rhythm patterns. For at least some genres the context of the culture and the politics (and the narcotics) feels like a better starting point.

So I think when we throw machines at the analysis part there's a tendency to reduce down the ineffable to a meaningless set of atoms. Or a pile of salt. Machines have their place but it's their usual place: boring, repetitive tasks at speed.

Getting back to the diagram: once archive items are digitised, contextualised, findable and in the web they become social objects. People can link to them, share them, "curate" them, annotate them, analyse them, celebrate them, debunk them, take them, repurpose them, "remix" them and make new things from them. The best description I've seen of the possibilities of what could happen when people are allowed to meet archive items is Tony Ageh's recent speech on the Digital Public Space which is too good to quote and should just be read.

On the original question then, no I don't think "content" (novels, poems, pamphlets, journalism, oral history, radio, film, photography etc) will ever become a second class citizen to the data describing it. And digitisation costs are a lot easier to justify when coupled with contextualisation and analysis. And that some jobs are best done by machines and some jobs are best done by people.