More thoughts on open music data

Occasioned by someone describing media catalogue type data as the “crown jewels”. It is not the crown jewels. It is, at best, a poster pointing out the attractions of the Tower of London.

If any data deserves the description of crown jewels it’s your customer relationship data.

But since Amazon, Apple, Facebook and Google probably know more about your users / customers / audience / fan base than you do, you’ve probably already accidentally outsourced that anyway…

Longer thoughts over here

Events, causation, articles, reports, stories, repetition, insinuation, supposition and journalism as data

In a conversation with various folks around ontologies for news I went a bit tangential and tried to model news as I thought it should be rather than how it is. Which was probably not helpful. And left me with a bee in my bonnet. So…

Some events in the life of Chris Huhne

  1. In March 2003 he was clocked speeding somewhere in Essex. Already having 9 points on his licence a conviction would have seen him banned from driving so…
  2. …shortly after his then wife, Vicky Pryce, was declared to have been driving at the time of the speeding incident
  3. 16 days after the speeding incident he was caught again for using a mobile phone whilst driving and banned anyway
  4. In May 2005 he was elected to Parliament as the representative for Eastleigh
  5. Also in May 2005 Ms Pryce told a friend that Mr Huhne had named her as the driver without her consent
  6. Between October and December 2007 he stood for leadership of the Lib Dems
  7. At some point (I can’t track down) he began an affair with his aide Carina Trimingham
  8. In June 2010 he was clocked again, this time by the press emerging after spending the night at Ms Trimingham’s home
  9. A week later Ms Pryce filed for divorce
  10. In May 2011 The Sunday Times printed allegations that Mr Huhne had persuaded someone to pick up his driving points
  11. In the same month Labour MP Simon Danczuk made a formal complaint about the allegation to the police
  12. At some point after this there was a series of text messages between Mr Huhne and his son where his son accused him of lying and setting up Ms Pryce
  13. In February 2012 both Mr Huhne and Ms Pryce were charged with perverting the course of justice
  14. In June 2012 Mr Huhne and Ms Pryce announced they’d plead not guilty with Ms Pryce claiming Mr Huhne had coerced her into taking his penalty points
  15. In February 2012 the trial began and on the first day Mr Huhne changed his plea to guilty. He also resigned his parliamentary seat
  16. The trial of Ms Pryce continued. And collapsed shortly after when the jury failed to agree. Shortly after a second trial found her guilty
  17. In late February the by election resulting from the resignation of Mr Huhne took place
  18. And in March 2013 they were both sentenced to 8 months in prison

Some of the events went on to become part of other storylines. For a brief while Mr Huhne’s driving ban for using a mobile phone at the wheel became part of a “Government makes a million a month from drivers using mobiles” story (at least for the Daily Mail), the collapse of the first trial of Ms Pryce became a story about failures in the trial by jury system and the result of the by election became part of a story about the rise of minority parties in austerity hit Europe.

Anyway this list of events is as partial as any other. Many more things happened (in public and in private) and some of the events listed were really lots of little events bundled up into something bigger. But that’s the trouble with events: they quickly go fractal because everything is one. As Dan said, “it’s good to think about events but it’s good to stop thinking about them too.” I’m not quite there yet.

Anyway, boilings things down further to fit in a picture:

Causation and influence

For every event there’s a fairly obvious model with times, locations, people, organisations, factors and products. And (mostly) the facts expressed around events are agreed on across journalistic outlets.

The more interesting part (for me) is the dependencies and correlations that exist between events because why is always the most interesting question and because the most interesting answer. Getting the Daily Mail and The Guardian to agree that austerity is happening is relatively easy, getting them to agree on why, and on that basis what should happen next, much more difficult.

The same picture this time with arrows. The arrows aren’t meant to represent “causality”; the fact that Mr Huhne was elected did not cause him to resign. But without him being elected he couldn’t have resigned so there’s some connection there. Lets say “influence”:

Articles, reports and stories

The simplest model for news would bump most of the assertions (who, where, when etc.) to the events and hang articles off them, stitched together with predicates like depicts, reports or analyses. But whilst news organisations make great claims around reports and breaking news, journalists don’t talk about writing articles and rarely talk about writing reports. Journalists write stories, usually starting from a report about an event but filling in background events and surmising possible future events.

So an article written around the time of Mr Huhne’s resignation would look less like this:

and more like this:

Repetition, insinuation and supposition

The average piece of journalism is 10% reporting new facts and 90% repetition, insinuation and supposition where correlation and causation between events are never made explicit. Events from the storyline are hand picked and stitched together with a thin thread of causality. Often it’s enough to just mention two events in close proximity for the connections between them to be implied. The events you choose to mention and the order you mention them in gives the first layer of editorial spin.

And the claims you choose to make about an event and its actors are the second level. If there’s a female involved and she’s under 35 it’s usually best to mention her hair colour. “Bisexual” scores triple points. We know what we’re meant to think.

The Daily Mail took insinuation to new heights with the collapse of Ms Pryce’s first trial, printing a “story” about the ethnic make-up of the jury telling its readers:

Of the eight women and four men on the Vicky Pryce jury, only two were white – the rest appeared to be of Afro-Caribbean or Asian origin.

The point they were trying to make and how the appointment of a jury of certain skin colour might have led to the collapse of the trial was left as an exercise.

Sports journalism seems particularly attracted to insinuation and supposition. Maybe it’s because their events (and sometimes even the outcomes of those events) are more predictable than in most other news whilst the actual facts are mainly locked inside dressing rooms and boardrooms. But Rafa Benitez getting slightly stroppy in a news conference turned into, “Rafa out by the weekend, Grant to take over until the end of the season and Jose to return” headlines by the next day. None of which turned out to be true. Yet.

As Paul pointed out, the article as repetition of storyline and makeshift crystal ball wasn’t always true. In the past newspapers printed many small reports per page. This isn’t the best image but was the best image I could find without rights restrictions:

Photo via Boston Public Library cc-by-nc-nd

Neither of us knew enough about newspaper history to know when this changed or why it changed. Presumably there are good business reasons why articles stopped being reports and started being stories. We guessed that it might have been due to falling paper and printing prices meaning more space to fill but without evidence that’s just insinuation too.

To an outside observer the constant re-writing of “background” seems tedious to consume and wasteful to produce. Especially where the web gives us better tools for managing updates, corrections and clarifications. Maybe it’s because most news websites are a by-product of print production where articles are still commissioned, written and edited to fill a certain size on a piece of paper and are just re-used on digital platforms. But even news websites with no print edition follow the same pattern. Maybe its partly an SEO thing with journalists and editors trying to cram as many keywords into a news story as possible but surely one article per storyline with frequent updates would pick up more inbound links over time than publishing a new article every time there’s a “development”? It seems to work for Wikipedia. (Although that said, Google news search seems to reward the publishing of new articles over the updating of existing ones.) Or maybe it’s all just unintentional. Someone at the meeting (I forget who) mentioned “lack of institutional memory” as one possible cause of constant re-writing.

But in a “do what you do best and link to the rest” sense, constantly rewriting the same things doesn’t make sense unless what you do best is repetition.

An aside on television

Television producers seem to feel the same pull toward repetition: this is what we’re about to show you, this is us showing it, this is what we’ve just shown you. I have a secret addiction to block viewing (I think the industry term is binge viewing) episodes of Michael Portillo’s Great British Railway Journeys but for every 30 minute episode there’s 10 minutes of filler and 20 minute of new “content”.

Interestingly the Netflix commissioned series assume binge viewing as a general pattern so have dropped the continuity filler and characterisation repetition and get straight into the meat of the story. Nothing similar seems to be happening with news yet but I’m an old fashioned McLuhanist and believe the medium and the message are inextricably tied so maybe one day…

Journalism as data

Over the last couple of years there’s been much talk of data journalism which usually involves scanning through spreadsheets for gotcha moments and hence stories. It’s all good and all helps to make other institutions more transparent and accountable. But journalism is still opaque. I’m more interested in journalism as data not because I want to fetishise data but because I think it’s important for society that journalists make explicit their claims of causation. You can fact check when and where and who and what but you can’t fact check why because you can’t fact check insinuation and supposition. At the risk of using wonk-words “evidence-based journalism” feels like a good thing to aspire to.

I’m not terribly hopeful that this will ever happen. If forced to be explicit quite a lot of journalism would collapse under its own contradictions. In the meantime I think online journalism would be better served by an article per storyline (rather than development), an easily accessible edit history and clearly marked updates. I’m not suggesting most news sites would be more efficiently run as a minimal wiki, pushing updates via a microblog-of-your-choice. But given the fact that if you want to piece together the story of Mr Huhne you’ll have more luck going to Wikipedia than bouncing around news sites and news articles… maybe I am.

Thoughts on open music data

Yesterday I wore my MusicBrainz hat (or at least moth-eaten t-shirt) to the music4point5 event. It was an interesting event, but with so many people from so many bits of the music industry attending I thought some of the conversation was at cross-purposes. So this is my attempt at describing open data for music.

What is (are, if you must) the data?

The first speaker on the schedule was Gavin Starks from the Open Data Institute. He gave a good talk around some of the benefits of open data on the web and was looking for case studies from the music industry. He also made the point that, “personal data is not open data” (not an exact quote but hopefully close enough).

After that I think the “personal data” point got a bit lost. Data in general got clumped together as an homogenous lump of stuff and it was difficult to pick apart arguments without some agreement on terms. It felt like there was a missing session identifying some of the types of data we might be talking about. Someone tried to make a qualitative distinction between data as facts and data as other stuff but I didn’t quite follow that. So this is my attempt…

In any “content” business (music, TV, radio, books, newspapers) there are four layers of data:

  1. The core business graph. Contracts, payments, correspondence, financial reports
  2. The content graph. Or the stuff we used to call metadata (but slightly expanded). For music this might be works, events, performances, recordings, tracks, releases, labels, sessions, recording studios, cover art, licencing, download / streaming availabilities etc. Basically anything which might be used to describe the things you want to sell.
  3. The interest / attention graph. The bits where punters express interest toward your wares. Event attendance, favourites, playlists, purchases, listens etc.
  4. The social graph. Who those punters are, who they know, who they trust.

I don’t think anyone calling for open music data was in any way calling for the opening of 1, 3 or 4 (although obviously aggregate data is interesting). All of those touch on personal data and as Gavin made clear, personal data is not open data. There’s probably some fuzzy line between 1 and 2 where there’s non-personal business data which might be of interest to punters and might help to shift “product” but for convenience I’m leaving that out of my picture:


Given that different bits of the music industry have exposure to (and business interests in) different bits of these graphs they all seemed to have a different take on what data was being talked about and what opening that data might mean. I’m sure all of these people are exploring data from other sources to improve the services they offer, but plotting more traditional interests on a venn:

So lack of agreement on terms made conversation difficult. Sticking to the content graph side of things I can’t think of any reasonable reason why it shouldn’t be open, free, libre etc. It’s the Argus catalogue of data (with more details and links); it describes the things you have for sale. Why wouldn’t you want the world to know that? I don’t think anyone in the room disagreed but it was hard to say for sure…

Data portability

The social and interest / attention graphs are a different breed of fish. Outside the aggregate they’re where personal data and personal expression live. Depending on who you choose to believe that data either belongs to the organisation who harvested it or the person who created it. I’m firmly in the latter camp. As a consumer I want to be able to take my Last.fm interest data and give it to Spotify or my Spotify data to Amazon or my Amazon data to Apple or my Apple data to Last.fm. In the unlikely event I ever ran a startup I’d also want that because otherwise my potential customers are locked-in to other services and are unlikely to move to mine. If I were an “established player” I’d probably feel differently. Anyway data portability is important but it’s not “open data” and shouldn’t be confused with it.

Crossing the content to social divide

Many things in the content graph have a presence in the social graph. Any music brand whether it’s an artist, a label or a venue is likely to have a Twitter account or a Facebook account or etc. So sometimes the person to interest to content graph is entirely contained in the social graph. Social media is often seen as a marketing channel but it’s a whole chain of useful data from punters to “product”. Which is why it puzzles me when organisations set up social media accounts for things they’ve never minted a URI for on their own website (it’s either important or it’s not) and with no real plan for how to harvest the attention data back into their own business. “Single customer view” includes people out there too.

Data views, APIs and API control

Just down the bill from Gavin were two speakers from Last.fm. They spoke about how they’d built the business and what they plan to do next. In the context of open data (or not) that meant reviewing their API usage and moving toward a more “industry standard” approach to API management. Twitter was mentioned alongside the words best practice.

Throughout the afternoon there was lots of talk about a “controlled open” approach; open but not quite. Occasionally around licencing terms but more often about API management and restrictions. It’s another subject I find difficult as more and more structured data finds its way out of APIs and into webpages via RDFa and schema.org. In the past, the worlds of API development and Search Engine Optimisation haven’t been close bedfellows but they’re heading toward being the same thing. And there’s no point having your developers lock down API views when your SEO consultants are advising you to add RDFa all over your web pages and your social media consultants are advising you to add OpenGraph. But it all depends on the type of data you’re exposing, why you’re exposing it and who you want to expose it to. If you’re reliant on Google or Facebook for traffic you’re going end up exposing your some of your data somehow. The risk either way is accidentally outsourcing your business.

MusicBrainz

Robert from MusicBrainz appeared at the conference via a slightly glitchy Skype link. He spoke about how MusicBrainz came into being, what its goals are and how it became a profit making non-profit. He also said the most important thing MusicBrainz has is not its data or its code or its servers but its community. I’ve heard this said several times but it tends to treated like an Oscar starlet thanking her second grip.

From all dealings with open data I’ve ever had I can’t stress enough how wrong this reaction is. The big open data initiatives (Wiki/DBpedia, MusicBrainz, GeoNames, OpenStreetMap) are not community “generated”. They are not a source of free labour. They are community governed, community led and community policed. If your business adopts open data then you’re not dealing with a Robert like figure; you’re dealing with a community. If you hit a snag then your business development people can’t talk to their business development people and bang out a deal. And the usual maxim of not approaching people with a solution but an explanation of the problem you want to solve is doubly true for community projects because the chances are they’ve already thought about similar problems.

Dealing with open data means you’re also dealing with dependencies on the communities. If the community loses interest or gets demoralised or moves on then the open data well dries up. Or goes stale. And stale data is pretty useless unless you’re an historian.

So open data is not a free tap. If you expect something for nothing then you might well be disappointed. The least you need to give back is an understanding of and an interest in the community and the community norms. You need to understand how they operate, where their interests lie and how their rules are codified and acted on. And be polite and live by those rules because you’re not a client; you’re a guest. You wouldn’t do a business deal without checking the health of the organisation. Don’t adopt community data without checking the health of the community. Maybe spend a little of the money you might have spent on a biz dev person on a “community liaison officer”.

Question and answer

At the end of Robert’s talk I had to get up and answer questions. There was only one which was something like, “would you describe MusicBrainz as disruptive?” I had no idea what that meant so I didn’t really answer. As ever with question sessions there was a question I’d rather have answered because I think it’s more interesting: why should music industry people be interested in and adopt MusicBrainz. Answers anyway:

  1. Because it has stable identifiers for things. In an industry that’s only just realising the value of this, it’s not nothing.
  2. Because those identifiers are HTTP URIs which you can put in a browser or a line of code and get back data. This is useful.
  3. Because it’s open and with the right agreements you can use it to open your data and make APIs without accidentally giving away someone else’s business model.
  4. Because it links. If you have a MusicBrainz identifier you can get to artist websites, Twitter accounts, Facebook pages, Wikipedia, Discogs, YouTube and shortly Spotify / other streaming services of your choice. No data is an island and the value is at the joins.
  5. Because it’s used by other music services from Last.fm to the BBC. Which means you can talk to their APIs without having to jump through identifier translation loopholes.
  6. Because, whilst it’s pretty damn big, size isn’t everything and it’s rather shapely too. The value of data is too easily separated from the shape of the model it lives in. Lots of commercial music data suppliers model saleable items because that’s were the money lives. MusicBrainz models music which means it models the relationships between things your potential customers care about. So not just artists and bands but band memberships. And not just Rubber Soul the UK LP and the Japanese CD and the US remastered CD but Rubber Soul the cultural artefact. Which is an important hook in the interest graph when normal people don’t say, “I like the double CD remastered rerelease with the extra track and the tacky badge.”
  7. Because its coverage is deep and wide. Their are communities within communities and niches of music I never knew existed have data in MusicBrainz.
  8. Because the edit cycle is almost immediate. If you spot missing data in MusicBrainz you can add it now. And you’re a part of the community.
  9. Because the community is engaged and doing this because they care, it polices itself.
  10. Because Google’s Knowledge Graph is based on Freebase and Freebase takes data from MusicBrainz. If you want to optimise for the search engines, stop messing about with h1s and put your data in MusicBrainz.

So if any record label or agent or publisher or delivery service ever asked me what the smallest useful change to the data they store might be, I’d say just store MusicBrainz identifiers against your records. Even if you’re not yet using open data, one day they’ll be useful. Stable identifiers are the gateway drug to linked data. And I’d advise any record label large or small to spend a small portion of the money they might have spent building bespoke websites and maintaining social media accounts, on adding their data to MusicBrainz. Everybody benefits, most of all your consumers.

ps If you’re an indie artist Tom Robinson wrote a great guide to getting started with MusicBrainz here.

Dumb TVs

Following on from this year’s CES there’s been lots of talk about bigger, better, sharper, smarter TVs. As ever conversation around gadgets tends to get caught up with conversations around business models which tends to lead to breathless commentary on OTT vs traditional broadcast and whether smart TVs will render traditional broadcasters as obsolete as Blockbusters, HMV and Jessops. But this is only tangentially about that.

Rumbling away in the background is the usual speculation around Apple’s plans to “revolutionise” the TV “experience” and whether they’re planning to do the same to the TV industry as they did to the music industry (content deals permitting). In among the chatter there seems to be an assumption from some commentators that Apple’s plans for TV revolve around how Apple TV might improve the on-screen interface and controls, possibly replacing the EPG with an App Store style interface. There’s a tendency amongst media futurologists to predict the future by extrapolating from the past; therefore televisions will follow the same fat-client route as phones and already complicated TV interfaces will become more complicated still.

But to my mind this doesn’t make sense. Apple already own the content discovery route via their iDevices, they own the content acquisition route via iTunes and they own the play-out route via AirPlay. Why do they need to invent fat-client TV sets when they’ve already put fat-client laptops, tablets and phones into the hands of their customers? The App Store model might just about work when it’s in your hand / on your lap. But placing the same interaction model 10 feet away just doesn’t offer the affordances you need to discover, purchase and play programmes. From an accessibility angle alone, making potential customers interact from 10 feet away when you’ve already given them a better option seems like a painful redundancy.

How “smart” do TVs need to be?

In more general terms I think there’s a problem with the definition of a “smart” TV and the interfaces envisaged. If TVs are web connected why do they need to be smart? Some arguments why not:

  1. Upgrade cycles for TVs and radios (and most other household white goods) are too slow to build-in smartness. Build in too much and the smarts go obsolete before the primary function of the device.
  2. For any connected device smartness belongs in the network. This is why we connect them. If there are existing discovery and distribution channels and backchannels, then all a TV needs to do is accept instructions from the network; a connected (but dumb) screen.
  3. 10 feet away is no place for an interface. And just because a device has a screen doesn’t mean it has to be an input. As TV functionality becomes ever smarter and more complicated, the remote control grows to fit the demands and we end up with something almost resembling a keyboard on the arm of the sofa. When there’s a much better, much more accessible phone or pad or laptop (or any point in between) sat redundant alongside.
  4. The App Store / Smart TV model presupposes the existence of apps. But making native apps is expensive and the more platforms you have to provide for the more expensive it gets. A dumb TV only needs to accept instructions and play-out media.
  5. TV screens tend to be a shared device and authentication, personalisation and privacy concerns are hard on a shared device. Hard from an implementation point of view and hard from a user comfort point of view. There’s a spectrum from TV screen to desktop PC to laptop to tablet to phone and the further down that list you travel the less shared / more personal the device feels and the more comfortable users feel with authentication. Dumb TVs move authentication to where it makes sense.
  6. Smart TVs open up the possibility of device manufacturers finding a new role as content gatekeepers. Having control of both the interface and the backchannel data allows them to control the prominence of content. This is a particular problem for public service broadcasters. By the time your smart TV is plugged into your set top box and your assortment of games consoles, the front room TV can acquire a stack of half a dozen gatekeepers. Just keeping track of which one is currently active and which one you need to control is confusing.
  7. Media people like to talk about TV as a “lean back” medium. This is pure conjecture but it’s possible that separating the input interface from the play-out leads to this more “lean back” experience…

How dumb is dumb?

From conversations around Dumb TVs there seem to be two main options: the dumb but programmable TV and the dumber than Kletus TV.

Programmable TVs

Modern TV sets don’t live alone. There are ancillary devices like PVRs which sit alongside the TV box. TVs don’t need to be programmable but PVRs do. The big question is where you want to programme your PVR from. If it’s same room / same local area network then there’s no need for any additional smartness or authentication. If it’s on the same network you can control it. If you want to programme your PVR from the top deck of the bus this is somewhat harder. Somewhere you need a server to mediate your actions and given the need for a server there’s a need for authentication. But…

…how do PVRs as discrete devices make sense in a connected world? If 3 million people choose to record an episode of Doctor Who that’s a hell of a lot of redundant storage. And a hell of a lot of redundant power usage. Over time PVR functionality will move to “the cloud” (the legality of loopholes not withstanding), your mobile will programme it, discover content there and push that content to you TV screen. With no need for TV programmability.

Dumb, dumb, dumb

So what’s the very simplest thing with the least build and integration costs? Something which allows you to push and control media from a fat client to a dumb TV. DIAL promises to do something similar but seems to assume a native app at each end and the simplest thing is probably two browsers.

So somehow devices on a local area network need to be able to advertise the functionality they offer. There’s a web intents connection here but I’m not quite sure what it is. Once your laptop / tablet / phone knows there’s a device on the network which can play audio / video it needs to make that known to the browser. So there needs to be some kind of browser API standardisation allowing for the insertion of “play over there” buttons. And the ability to push a content location with play, pause, stop and volume control notifications from the browser on the fat client to the browser on the dumb TV. Which might be something like WebRTC. Given the paywalls and geo-restrictions which accompany much of the online TV and movie business there’d probably need to some kind of authentication / permission token passed. But that’s all dumb but connected would involve.

A late answer to a question from the digital humanities conference

The week before last Silver and I went along to the Realising the Opportunities of Digital Humanities conference in Dublin. We gave a short presentation about linked data at the BBC then sat on a panel session attempting to answer questions. I’ve never been on a panel before but it’s a bit like an interview: you only think of the answer you wanted to give once you’ve left the room.

Anyway, one person asked a question something like, “with all this data does ‘content’ become a second class citizen”. At the time we were sat in the Royal Irish Academy library which is three storeys of floor to ceiling books. The thought that all that humanity could ever become subservient to some descriptive data seemed so odd that I don’t think anyone even answered the question. A bit like suggesting if the library catalogue ever got good enough you could just burn the books.

A follow up point was made by a person from RTE about the costs associated with digitising content. I think it’s often hard to justify digitisation costs because the thing you end up with is pretty much the thing you started with except in ones and zeros. And to my mind there are three steps to opening an archive. As a fag packet sketch it would look something like:

Step 1: Digitisation

As the RTE person said, digitisation is expensive and sometimes hard to justify. And I have no ideas on how to make it less expensive. Until content is digitised there’s no way to get the economies of scale that computers and the web and the people on the web bring. And there’s no real benefit in just digitising. You can put the resulting files on the web but until they link and are linked to and findable they’re not in the web. To put stuff in the web you need links and to make links you need context so…

Step 2: Contextualisation

Once you have digitised files you need to somehow generate context and there seem to be three options:

  1. employ a team of librarians to catalogue content – which is great if you can but doesn’t scale too well and can, occasionally, lead to systems which only other librarians can understand
  2. let lose the machines to analyse the content (OCR, speech to text, entity extraction, music detection, voice recognition, scene detection, object recognition, face recognition etc etc etc) – but machines can get things wrong
  3. build a community who are willing to help – but communities need nurturing (not managing, never managing)

The project I’m currently (kind of) working on has the research goal of finding a sweet spot between the last two: machine processing of audio content to build enough descriptive data to be corrected and enhanced by a community of users. Copying / pasting from an earlier comment:

We’ve been working on a project for BBC World Service to take 70,000 English-language programmes and somehow make them available on the web. The big problem is that whilst we have high quality audio files we have no descriptive data about them. So nothing about the subject matter discussed or who’s in them and in some cases not even when they were broadcast.

To fix this we’ve put the audio through a speech to text system which gives us a (very) rough transcript. We’ve then entity extracted the text against DBpedia / Wikipedia concepts to make some navigation by “tags”. Because the speech to text step is noisy some of the tags extracted are not accurate but we’re working with the World Service Global Minds panel (a community of World Service listeners) who are helping us to correct them.

Machine’s plus people is an interesting approach but, like digitisation, machine processing of content is expensive. Or at least difficult to set up unless you’re a technology wizard. And there’s a definite gap in the market for an out-of-the-box, cloud-based (sorry) solution (sorry) for content processing to extract useful metadata to build bare-bones navigation.

Step 3: Analysis

The line between contextualisation and analysis is probably not as clear cut as I’ve implied here. But by analysis I mean any attempt to interrogate the content to make more meaning. I’m reminded of the recent Literature is not Data: Against Digital Humanities article by Stephen Marche:

But there is a deeper problem with the digital humanities in general, a fundamental assumption that runs through all aspects of the methodology and which has not been adequately assessed in its nascent theory. Literature cannot meaningfully be treated as data. The problem is essential rather than superficial: literature is not data. Literature is the opposite of data.

Data precedes written literature. The first Sumerian examples of written language are recordings of beer and barley orders. But The Epic of Gilgamesh, the first story, is the story of “the man who saw the deep,” a hero who has contact with the ineffable. The very first work of surviving literature is on the subject of what can’t be processed as information, what transcends data.

It also reminds me of a trip to a Music Information Retrieval conference a couple of years back. Every other session was accompanied by a click track and seemed to be another attempt to improve “onset beat detection” in some exotic music genre by two or three percent. I’m no musicologist but it felt like a strange approach to determining meaning from music. If you were ever asked to describe punk or hip-hop or acid house I doubt you’d start with chord sequences or rhythm patterns. For at least some genres the context of the culture and the politics (and the narcotics) feels like a better starting point.

So I think when we throw machines at the analysis part there’s a tendency to reduce down the ineffable to a meaningless set of atoms. Or a pile of salt. Machines have their place but it’s their usual place: boring, repetitive tasks at speed.

Getting back to the diagram: once archive items are digitised, contextualised, findable and in the web they become social objects. People can link to them, share them, “curate” them, annotate them, analyse them, celebrate them, debunk them, take them, repurpose them, “remix” them and make new things from them. The best description I’ve seen of the possibilities of what could happen when people are allowed to meet archive items is Tony Ageh’s recent speech on the Digital Public Space which is too good to quote and should just be read.

On the original question then, no I don’t think “content” (novels, poems, pamphlets, journalism, oral history, radio, film, photography etc) will ever become a second class citizen to the data describing it. And digitisation costs are a lot easier to justify when coupled with contextualisation and analysis. And that some jobs are best done by machines and some jobs are best done by people.

 

A plea for the word service

Some small things that bother me about the word product. Mostly stuff I’ve already said on Twitter rearranged into sentences.

Product doesn’t really mean anything

Or its meaning is slippery. It certainly doesn’t seem to mean what my dictionary says it does:

Product
an article or substance that is manufactured or refined for sale: dairy products.
a thing or person that is the result of an action or process: her perpetual suntan was the product of a solarium.

I guess digital services are things resulting from actions and processes but then so is everything and anything else.

Because product doesn’t mean anything when it’s combined with other words that don’t mean anything it makes meaningless sentences

Like “Minimum Viable Product”. A good deal of the conversation around “product development” is an attempt to define an MVP but no-one can ever agree on that definition. If you substitute viable for useful and product for service it’s still not immediately apparent but at least it’s a starting point for a conversation. Useful for what? To whom? Useful enough? Useful enough for people to pay for it? Useful for enough people for enough people to pay for it? What problem does it solve? Who has that problem?

Because product doesn’t mean anything people bring their own meanings to it

Some people seem to define it as a managed development process. And lots of the talk around “products” seems to be more about process than product. About cross-discipline teams and co-location. Which is equally applicable to building anything.

Some people seem to think it’s about building things that people want rather than just chucking up some web pages. Something about product life-cycles and future development based on interpretation of analytics. Which is all equally (and probably more) applicable to service design. Did we ever just make something and stick it live and forget about it? I can think of a few examples where this was true but not much I’ve ever worked on.

And some people think it’s about building something marketable; something that can be summed up in a one liner on the side of a bus. If you can’t explain to the people your service is designed to help how it might help them there’s probably not much point making it.

All the ‘products’ our industry makes get marketed as services

Services to help you pay bills or fill in tax returns or contact your MP or listen to music or radio or watch television or films or share photos or videos or find places. Why describe them between ourselves as products when we describe them to users / customers as services?

It at least implies branded, bounded boxes

Product management in a software sense has been around at least since Microsoft were still shipping software as shrink-wrapped packages. Back then the product word probably made sense but now…

Everything that used to a product is becoming a service

Software, music, books, films, games, maps, newspapers, even bathroom scales and maybe one day the beloved internet fridge. As everything that can be connected is connected, the lines between tangible products and intangible services are blurring. But it’s more about products becoming more service-like (less hard edged, more malleable, more adaptive to use) than services becoming more product-like.

Services outlive products

Lots of services manifest themselves through a product but when your iPhone 9 is making it’s final journey to the landfill site chances are something very like the iTunes store or Amazon’s Whispersync will still be around.

No-one I know makes products

From a rough list of friends and acquaintances there’s some teachers, a lawyer, a nurse, a doctor, a postman, someone who does “cloud provisioning” of some sort, some people who make it easier to listen to radio and watch TV, some people who make it easy to find and read news, some people who make legislation and case law available to lawyers, some people who make it easier to book travel, someone who provides wire feeds to news organisations etc. All of them provide services, not one of them makes products. Occasionally the service is manifested as something product-like but more as a token (like a postage stamp or an Oyster card) for service. The value (and the surplus value of labour) is always in the service. You don’t buy a loaf of bread, you buy the convenience of not having to bake.

Service just sounds nicer

Services are helpful. They adapt as you use them. They’re things that people need (in a large or small sense); things which make their lives easier. Products are things that people can be persuaded they want or desire.

Maybe I’m just out of date and not following the change of language but to my ears products are still tangible and the things our industry makes are largely intangible. And we have a perfectly good word for that.

My PVR is “in the cloud”

in The Cloud
on somebody else’s server

Because I’m lazy and can’t be arsed leafing through tables of features I get my broadband off BT. And they throw in a BT Vision box. Which is basically Freeview with a hard disc to record to and an internet connection. It’s ok. Mostly it works. About once a month the fan gets noisy and it panics and falls on its face and you need to turn it off and on again. Or, if it’s a proper panic attack, turn it off, hold down the reset button and turn it on again.

But every so often the box has a complete breakdown and no amount of off / on / off / on / reset makes it feel better. A while back my BT Vision box went a bit hysterical and broke. So they sent round an engineer with a new one.

Obviously I lost all my recorded programmes (even the on / off / reset achieves that aim). Probably no great loss; some Midsomer Murders, some Gardener’s World, some Great British Railway Journeys and my daughter’s carefully curated collection of M.I. Highs. But I also figured I’d lost all my instructions to record.

Last Thursday morning I had a bit of a panic because I thought the new series of Gardener’s World wouldn’t be recorded. So I clicked through several pages of the god awful EPG, found Gardener’s World and found it was already set up to record.

Because, I assume, the box phones home and record instructions are stored “in the cloud” and the whole thing is re-synched periodically. Which made me ponder three things:

  1. What else is the box recording and phoning home? What I watch? What I record? What I record and watch? What I record and fail to watch?
  2. How is that data used and by whom? Which reminded me of this vintage article from Wired on Sky’s plans to serve personalised adverts based on TV attention data and my oft quoted quote from Clive Humby (emphasis mine):

    If I knew your whole transaction profile – restaurants, travel, fashion – that could be immensely powerful. You’d need a consent-based model, but you’d understand every aspect of a person’s life. The credit-card data tells you how they live generally, the supermarket data tells you their motivations, the media data tells you how to talk to them. If you have those three things, you’re in marketing nirvana.

  3. Who owns that data and how else could it be used? If it’s mine then why shouldn’t I be able to port it out and offer it to Sky in the hope of a money off deal and a more reliable box? Why shouldn’t I be able to port it into Programme List and get broadcast reminders and links to VOD services? Or take it to Amazon and get DVD recommendations. Or go the other way and take my Amazon data and get programme recommendations? There’s a lot of personal data floating around but it’s all locked into proprietary systems and outside my control.

I’m not, despite appearances, a privacy zealot. I don’t think absolute privacy is possible or desirable. From supermarket loyalty cards to Oyster cards to Facebook every day we trade some privacy for some convenience. My problem is when the terms of trade are so obfuscated that it’s not possible to weigh what we gain against what we lose. In the BT Vision case if I’d been offered the option of exporting my record instructions off the box I might well have clicked yes. (I’d have been even more tempted to click yes if my recordings were stored off the box but that would probably break 10,000 copyright agreements.) But I wasn’t given the option and knowing that data is out there but outside my reach is just frustrating.

I don’t think I’m alone in this. Some recent research by the NoTube project found that most people were uncomfortable about their online TV viewing being recorded when they hadn’t consented and couldn’t control what happens to the data.

There’s been a long running debate about privacy vs “publicness” which mostly seems to miss the point. The point being informed consent. Those on the “publicness” side tend to say that organisations harvesting user data is a price worth paying for “free” access to “open” publishing tools. And ignore that the data being harvested disappears into proprietary systems where it’s impossible for users to extricate it or correct it. The exact opposite of openness. It’s bad for consumers because they get locked into a single system. And it’s bad for competitiveness because new businesses can’t hope to compete with established players who monopolise the interest graphs. Wherever you see customer relationship management or user relationship management it can almost always be read as “lock-in”.

Most people are used to the idea that when they use the web (in the browser open, clicking links sense) their actions are being reported and recorded. The fact that we seem no closer to solving informed consent and data portability on the web doesn’t bode well for when our white goods start phoning home.

A rambling post about RiscOS. Ish

As a kid I never had a computer. In those pre-web days the main reason for owning one seemed to be the playing of games and I never really liked games. Still don’t.

The first computer I ever met was a university mainframe. You’d spend a morning with a map chart and something that looked like an air hockey puck, tracing lines from red shifting stars or particles being accelerated or collided or some such. Then take the results and write some convoluted Fortran programme which would (hopefully) change the numbers you entered into some different set of numbers.

Except not before the code had been “compiled”. I don’t think anyone ever explained what “compilation” entailed. You spent several minutes copying your Fortran to a huge floppy disc, handed it over to a sullen lab assistant and they went away and did the water into wine thing. Which for more reasons unexplained took most of the afternoon. Since any time spent in the Blackett Laboratory was demoralising and depressing, compilation time was pub time.

My second meeting with a computer was a Mac Classic that sat in the corner of the same lab. You’d stagger back from the pub and (if you’d struck lucky and your code had compiled) take the new numbers it had made and type them into a spreadsheet type programme on the Mac and get back a pretty graph. I remember wondering why that Fortran computer thing couldn’t just work like this Mac computer thing. It had the distinct advantage of working even when you were drunk; the mainframe thing barely worked if you were sober.

After university I went into the then thriving (cough) CD-ROM trade. It was mostly Mac “Power PCs” with a few Windows machines scattered about. But they felt like a different breed than the old Mac Classic. The CD-ROM industry inherited the Adobe bloat-wear of the 1980s desktop publishing industry that we’re still stuck with today (Photoshop and Illustrator). And added a few of its own (Director and Authorware). The minute you started up one of these applications your Mac stopped being a general purpose computer. The chosen application ate all your memory for several hours before eventually overheating the machine and crashing. These were applications as operating systems; for the duration of use nothing else was possible.

After a little while I moved to glorious Norwich to make educational CD-ROMs. At that time most schools had graduated from the BBC Micro but kept faith with Acorn and invested in shiny new RISC PCs running RISC OS. So alongside the PCs and Macs we had RISC PCs. Which were desperately unfashionable. The only people who ever bought RISC PCs were schools so any RISC OS conference was entirely populated by geography teacher retreads with beards and elbow patches. In the room next door was Paul Mison (who kicked off this vague reminisce) and the first web people I ever met. They were Mac users to a (wo)man; our RISC PCs and CD-ROM burners were a badge of digital shame.

But secretly I quite liked my RISC PC. It was cute and simple and transparent. You could look inside it and tinker with it and work out what it was trying to do. It was far removed the shrink-sealed product of the modern Apple machine. And it didn’t have dongles and bloatware. Instead of a packaged applications like Photoshop to meet all your photo editing needs it had an app for resizing photos, an app for recolouring, an app for cropping…

In some ways these were a bit like the modern app of the twonkPhone or twonkPad: single purpose applications designed to do one thing (fairly) well. But unlike app store apps they co-operated, they talked to one another. As my friend Tom might say, they were generative. You could drag and drop the output from one app as the input for another. You could write simple little scripts that chained together apps as a mini process. And because each component co-operated you never got stuck with vendor lock-in or forced upgrades. You could just grab a different app for the same purpose off a floppy disc and swap them in and out at will.

Which reminded me for some reason…

…of a pub chat with Mo and Faith about APIs and open data. In my head at least there is a connection here…

Most businesses of any size will have data spread across multiple systems (staff details, product inventories, room bookings, finances). At some point there’s a realisation that scattering knowledge is inefficient and there’d be more value if the multitude of different system co-operated and exchanged information. For lots of organisations that path leads inevitably to the large scale enterprise architecture dreams of one consolidated system to rule them all. So the usual coterie of consultants march in to design the uber-system. It’s the mainframe mentally that never quite died out. (I half remember another company I used to work for had seven systems, all called The Global Platform).

Eventually the uber system gets built and all the data from the legacy systems gets pumped in. And it fails. Because the data isn’t quite the right shape and different systems use different identifiers and some systems use different fields to mean different things at different points in time… But mostly the enterprise architect dreams die because they ignore the real problem until it’s too late. Designing data cathedrals is (relatively) easy; data and identifier consolidation is the hard part. And if you can solve identifier consolidation in the first place there’s no need to ever build the cathedral. Which is why god gave us URIs, HTTP and APIs. And why…

…tl;dr everything should have an API

I realise there’s nothing new in any of this. It’s the usual, “small pieces, loosely joined” (though I’d quibble about the definition of loosely). And I realise that hanging this argument on nostalgia for the rotting corpse of a badly failed operating system weakens it somewhat :-/

But business systems should be more like RISC OS and less like mainframes or enterprise / Adobe bloat-wear. They should do one thing well and co-operate and communicate. An organisation should be its APIs. And everything should have an API.

If you’re in the legal business every case should have an API, every lawyer should have an API, every citation of case law or legislation should have an API. If you’re in the TV / radio business every studio should have an API (what’s recording there, what was recorded there, what’s planned), every camera should have an API (where is it, where has it been), every production should have an API (what stage is it at, what’s the budget, who’s the co-producer). If you’re in the news business every camera crew should have an API (where are they, where have they been), every journalist should have an API (what have they submitted, where are they (though it’s not unlikely that Twitter or Facebook or FourSquare already know this)). If a riot kicks off somewhere in Tottenham you should be able to query across systems to easily find the closest journalist, the closest camera crew, the closest radio car… without having to chase down six different spreadsheets across four different departments.

Whenever talk turns to APIs it’s usually a side effect of already publishing to the web. The usual question is, “we’ve published this content to the open web, can we give it an API?” Which feels like the wrong question. If everything is / has an API the real question is, “Which bits of this can we open to the web and which bits are better kept private?” That’s just a permissioning problem and permissioning is easy :-)

All of this is dependent on whether the intention is to create a more intelligent, connected, generative website. Which is not a bad goal. It just seems more ambitious to create a more intelligent, connected, generative business. And expose the bits you choose to expose to the world.

Stop testing the wrong things

Or at least start testing the right things.

Lots of chat about Test Driven Development and a brief flurry of tweets with @rarepleasures left my bicker button feeling unfulfilled so this is just another rant that wouldn’t fit into 140…

Why I don’t really like test driven development

  1. Because the minute you add a label to an approach, within a week it becomes a “process”, within a month someone will organise a conference and within six months its just more dogma and doctrine. But that aside…
  2. There’s a chain. At one end are the people somewhat pompously referred to as “the business”. At the other end an assortment of developers and designers patronisingly referred to as “geeks” and “creatives”. The people at “the business” end want to solve a problem; the people at the building stuff end generally help to solve problems. The more links in the chain, the more noise gets introduced until you end up with requirements and “user stories” as chinese whispers. Professionalising a class of people into business analysts and product managers doesn’t stop chinese whispers being chinese whispers.
  3. As Dan North says in his What’s in a story post:

    Usually, the business outcomes are too coarse-grained to be used to directly write software (where do you start coding when the outcome is “save 5% of my operating costs”?) so we need to define requirements at some intermediate level in order to get work done.

    The point being that by the time any of this stuff hits the designer / developer it’s usually passed through the hands of several intermedaries and been reduced to some requirements / user stories. But requirements don’t matter. They’re just an abstraction to make it easier to start writing code. What matters are the “business” objectives. Or, without wanting to sound too New Labour, the “outcomes”.

    The usual pattern is to explain the what to the developer / designer and leave the how to them. Which might be fine. But explaining the why is probably more important. Who knows, they might even have an opinion on the what. Stranger things have happened.

    Anyway, the more you separate developers and designers from the “why” the more we head back to the bad old days of waterfall, with the people doing the work sat at the end of the process being drip-fed user stories and expected to lay golden feature eggs.

  4. Requirements are fine as a starting point for code and using those requirements to generate tests for that code makes sense but you’re only testing the code against the requirements. You’re not testing the service / product / let’s-just-call-it-a-website against business objectives and outcomes.
  5. Businesses have all kinds of ways of measuring performance. That’s what the final slide of the boss people’s presentation on “KPIs” is all about. And anything that can be measured can be tested. The main problem is they usually get measured six months after the fact.

    The objective might be to get more registered users; the requirement might be a simplified registration process and / or the ability to authenticate with 3rd party accounts. The objective might be less abandoned shopping carts; the requirement simplified checkout and / or one click purchase. You can measure any of these objectives / outcomes so you can test them. But software tests only test software against requirements and…

  6. code does not live in isolation. Until real code meets real data and real content and real copywriting and real design and real users with real needs (and probably a real marketing campaign) you can’t measure the changes you make against real objectives.
  7. It’s fine to have those screens in development corner that show regression tests passing and failing with green and red lights. But it would be good to see other screens showing real registration rate data, real close account rate data, real buy / play / consume button data, real abandoned shopping cart data, real inbound traffic from search engines or social media or whatever data.
  8. If you’re measuring the impact of your work against real usage you can make tiny, tiny changes very, very quickly; isolate those changes from other changes in the system and see how they work for real people. Test code against requirements by all means but don’t assume your tests tell you anything meaningful.