Sausages and sausage machines: open data and transparency

Last Wednesday was the second BBC Data Day. I didn’t manage to make the first one but I did end up chatting afterwards with various BBC, ODI and OU people about the sort of data they’d like to see the BBC release. Shortly afterwards I sketched some wireframes and off the back of that was invited to talk at the second event. Which I also didn’t manage to make because I was at home, ill and feeling sorry for myself. In the event Bill stepped in and presented my slides. These are the slides and notes I would have presented if I had managed to be there:

Slide 1


I’d like to talk about open data on the web, what it’s for and in particular how it enables transparency to audiences across journalism and programme making.

Slide 2


So why publish open data on the web? Three common reasons are given:

  1. to enable content and service discovery from 3rd parties like Google, Bing, Facebook, Twitter etc. These are things like, Open Graph, Twitter Cards etc used to describe services so 3rd parties can find your stuff and make (pretty) links to it. Which often becomes a very low level form of automated syndication because that’s how the web works
  2. to outsource innovation and open up the possibilities of improving your service to 3rd parties. The Facebook strategy of encouraging flowers to bloom around their fields. Then picking the best ones and buying them
  3. and finally because… transparency. To show the world your workings in the best interests of serving the public

Today I’m only really talking about transparency.

Slide 3


So sausages. The BBC already publishes some “open” data but that data only describes the end product, the articles and programmes, and not the process.

Slide 4


This is the Programmes Ontology. It shows the kinds of data we publish about all BBC programmes.

There are programme brands and series and individual episodes and versions of those episodes and broadcasts and iPlayer availabilities. The kind of data you’d need to build a Radio Times or an EPG. Or iPlayer.

Slide 5


And this is the brand page for Panorama. Ask for it as data and you’ll get…

Slide 6



Again a brand with episodes with broadcast etc

Slide 7


What’s interesting is what isn’t there. What goes on in the factory before the sausages make it to the shelves.

Slide 8


Things like:

  1. commissioning decisions. Who? When? Why? What didn’t get commissioned?
  2. scheduling decisions
  3. talent decisions
  4. guest decisions
  5. runnings orders. What things / what order?

Who refused to appear? Who refused to put up a spokesperson? What was the gender split of guests? What was the airtime gender split?

A couple of weeks back there was a George Monbiot piece in the Guardian bemoaning the fact that BBC programmes often didn’t include enough background information about guests on current affairs programmes. Particularly in respect to connections with lobbyists and lobbying firms.

As a suggestion: every contributor to BBC news programmes should have a page (and data) on listing their appearances and detailing their links to political parties, NGOs, campaigning groups, lobbyists, corporations, trade unions etc.

Slide 9


Away from programmes what would transparency look like for online news.

Slide 10


The Guardian is the most obvious example where clarifications and corrections aren’t hidden away but given their own home on the website.

Slide 11


And the articles come with a history panel which doesn’t show you what changed but at least indicates when a change has happened.

The Guardian’s efforts are good but not as linked together as they might be.

Slide 12


Unlike Wikipedia. This is the edit history of the English Wikipedia article on the 2014 Crimean Crisis. Every change is there together with who made it, when and any discussion that happened around it.

Slide 13


And every edit can be compared with what went before, building a picture of how the article formed over time as new facts emerged and old facts were discounted.

Slide 14


I didn’t manage to attend last year’s data day but I did end up in the pub afterwards with Bill and some folk from the ODI and the Open University.

We talked about the kind of data we’d all like to see the BBC release and it was all about the process and not the products. The sausage factory and not the sausages.

We made a list of the kinds of data that might be published and it fitted well with how the BBC likes to measure its own activities: Reach, Impact and Value.

Slide 15


It also looked a lot like this infographic which made the rounds of social media last week detailing the cost per user per hour of the BBC TV channels

Slide 16


These were the wireframes I made following last year’s pub chat.

They were intended to sit on the “back” of BBC programme pages; side 1 would show the end product, side 2 would be the “making of” DVD extra, the data about the process.

Headline stats for every programme would include total cost, environmental impact, number of viewers / listeners across all platforms and the cost per viewer of that episode.

Programmes would be broken down by gender split of contributors and their speaking time.

Reach would list viewer / listener figures across broadcast, iplayer, downloads and commercial sales.

Slide 17


Impact would list awards, complaints, clarifications, corrections and feedback from across the web.

And value would list production costs, acquisition costs and marketing spend.

All of this would be available as open data for licence fee payers to take, query, recombine, evaluate and comment on.

Having made the wireframes I chatted with Tony Hirst from the OU about how we might prototype something similar. We came up with a rough data model and Tony attempted to gather some data via FOI requests.

Slide 18


Unfortunately they were all refused under the banner of “purposes of journalism, art or literature” which seems to be a catch all category for FOI requests marked “no”.

Google has 20 million results for the query “foi literature art journalism”, around 10 million of those would seem to relate in some way to the BBC.

The idealist in me would say that, for “the purposes of journalism”, in its noblest sense, and the greater good of society, the default position needs to flip from closed to open. The “purposes of journalism”, more than any other public service, should not be an escape hatch from open information.

And the public would benefit from “journalism as data” at least as much as from “data journalism”.

Photo credits

Packing Carsten’s weiner sausages on an assembly line, Tacoma, Washington by Washington University

Sausages at Wurstkuche by Sam Howzit


From the final few pages of Beneath the City Streets by Peter Laurie (1970) where he gets off the subject that the threat of global thermonuclear war might just be a plan to distract us and gets on to the subject of… transistors:

I am coming to believe that there is a much more serious threat to the technological way of life than the H-bomb. It is the transistor. Over the last two or three hundred years in the West we have followed a course of development that coupled increasingly powerful machines to small pieces of human brain to produce increasingly vast quantities of goods. The airliner, the ship, the typewriter, the lathe, the sewing machine, all employ a small part of the operator’s intelligence, and with it multiply his or her productivity a thousandfold.

As long as each machine needed a brain, it was profitable to make more brains and with them more profits. Industrial populations grew in all the advanced countries, and political systems became more liberal simply to get their cooperation.

But now we are beginning to find that we do not need the brains – at least not in the huge droves that we have them. Little by little [..] artificial intelligence is dispossessing hundreds of thousands and soon millions of workers. Because ‘the computer’ is seen only in large installations doing book-keeping, where it puts few out of work, this tendency goes on unnoticed. But in every job economics forces economies on management. Little gadgets here and there get rid of workers piecemeal. [..] Any job that can be specified in, say, a thousand rules, can now be automated by equipment that costs £200 or so. The microprocessor, which now costs in itself perhaps £20, [..] has not begun to be applied: over the next 10 to 15 years millions will be installed in industry, distribution, commerce. Machinery, which has almost denuded the land, will now denude cities.

Politically, this will split the population into two sharply divided groups: those who have intelligence or complicated manual skills that cannot be imitated by computers – a much larger group, who have not. In strict economic terms the second group will not be worth employing. They can do nothing that cannot be done cheaper by machinery. [..] The working population will be reduced to a relatively small core of technicians, artists, scientists, managers surrounded by a large, unemployed, dissatisfied and expensive mob. I would even argue that this process is much further advanced than it seems, and the political subterfuges necessary to keep it concealed, are responsible for the economic malaise of western nations [..] If one has to pay several million people who are in fact useless, this is bound to throw a strain on the economy and arouse the resentment of those who are useful, but who cannot be paid what they deserve for fear of arousing the envy of other.

If the unemployed can be kept down to a million or so in a country like Britain, the political problem they present can be contained by paying a generous dole [..] The real total of unemployed is hidden in business. What happens when automation advances further and the sham can no longer be kept up? [..] To cope with the millions of unemployed and unemployable people needs – in terms of crude power – greatly improved police and security services. [..] It suggests that the unemployed should be concentrated in small spaces where they can be controlled, de-educated, penned up.

Unless some drastic alteration occurs in economic and political thought, the developed nations are going to be faced in the next thirty years with the fact that the majority of their citizens are a dangerous, useless burden. One can see that when civil defence has moved everything useful out of the cities, there might be strong temptation on governments to solve the problem by nuclear war: the technological elite against the masses.

Now I’m no more of a fan of the technocratic, silicon valley, Ayn Rand fanboys than the next man on the street but even in my most paranoid moments I’d never suspected that when they’d done disrupting they might stagger out of a ted talk and h-bomb us all.

NoUI, informed consent and the internet of things

In more youthful days I spent a year studying HCI. I’m sure there was much more to it but unfortunately only three things have stuck in my mind:

  1. interactions should be consistent
  2. interactions should be predictable
  3. interactions should not come with unexpected side-effects

I half remember writing a dissertation which was mostly finding creative ways to keep rewriting the same three points until requisite word count was reached.

I was thinking about this today whilst reading an assortment of pro and anti NoUI blog posts. I half agree with some of the points the NoUI camp are making and if they save us from designers putting a screen on everything and the internet fridge with an iPad strapped to the front I’d be happy. But mostly I agree with Timo Arnall’s No to NoUI post and his point that “as both users and designers of interface technology, we are disenfranchised by the concepts of invisibility and disappearance.”

This doesn’t really add much to that but some thoughts in no particular order:

  1. Too often chrome is confused with interface. There’s too much chrome and not enough interface.
  2. Even when something has a screen it doesn’t have to be an input / output interface. The screen can be dumb, the smarts can be elsewhere, the interface can be distributed to where it’s useful. The network takes care of that.
  3. An interface should be exactly as complex as the system it describes. The system is the interface. The design happens inside. I’m reminded of a quote from the Domain Driven Design book that, “the bones of the model should show through in the view”. An interface should be honest. It should put bone-structure before make-up (or lack thereof).
  4. The simplify, simplify, simplify mantra is all very good but only if you’re simplifying the systems and not just the interface. And some systems are hard to simplify because some things are just hard.
  5. No matter how much you think some side-effect of an interaction will please and “delight” your users if the side-effect is unexpected it’s bad design. You might want to save on interface complexity by using one button to both bookmark a thing and subscribe to its grouping but things are not the same as groups and bookmarks are not the same as subscriptions and conflating the two is just confusing. Because too little interface can be more confusing than too much.
  6. There seems to be a general belief in UX circles that removing friction is a good thing. Friction is good. Friction is important. Friction helps us to understand the limits of the systems we work with. Removing friction removes honesty and a good interface should be honest.
  7. Invisible interfaces with friction stripped out are the fast path to vendor lock-in. If you can’t see the sides of the system you can’t understand it, control it or leave because you don’t even know where it ends.
  8. If your goal is something like simplifying the “onboarding process” removing friction might well please your paymasters but it doesn’t make for an honest interface. Too much UX serves corporate goals; not enough serves people.
  9. Decanting friction out of the interface and turning it into a checkbox on the terms and conditions page is not a good interface.
  10. In the media world in particular there’s a belief that if you could just strip out the navigation then by the intercession of magic and pink fluffy unicorns “content will come to you”. Which is usually accompanied by words like intuitive systems. Which seems to miss the point that the thing at the other end of the phone line is a machine in a data centre. It is not about to play the Jeeves to your Bertie Wooster. It does not have intuition. What it probably has is a shedload of your usage data matched to your payment data matched to some market demographic data matched to all the same for every other user in the system. For the majority of organisations the internet / web has always been more interesting as a backchannel than as a distribution platform. They’d happily forego the benefits of an open generative web if only they could get better data on what you like.
  11. If and when we move away from an internet of screens to an “internet of things” the opportunities for sensor network and corporate-state surveillance multiply. Everything becomes a back-channel, everything phones home. With interface and friction removed there’s not only no way to control this, there’s no way to see it. Think about the data that seeps out of your front room: the iPad, the Kindle, the Samsung telly, the Sky box, the Netflix app, YouView, iPlayer, XBox, Spotify. And god only knows where it goes past there.
  12. Informed consent is the only interesting design challenge. With no interface informed consent is just another tick box on the set-up screen. Or a signature on a sales contract.
  13. The fact that we’ve not only never solved but deliberately sidelined informed consent in a world with interfaces doesn’t bode well for a world without.

More thoughts on open music data

Occasioned by someone describing media catalogue type data as the “crown jewels”. It is not the crown jewels. It is, at best, a poster pointing out the attractions of the Tower of London.

If any data deserves the description of crown jewels it’s your customer relationship data.

But since Amazon, Apple, Facebook and Google probably know more about your users / customers / audience / fan base than you do, you’ve probably already accidentally outsourced that anyway…

Longer thoughts over here

Events, causation, articles, reports, stories, repetition, insinuation, supposition and journalism as data

In a conversation with various folks around ontologies for news I went a bit tangential and tried to model news as I thought it should be rather than how it is. Which was probably not helpful. And left me with a bee in my bonnet. So…

Some events in the life of Chris Huhne

  1. In March 2003 he was clocked speeding somewhere in Essex. Already having 9 points on his licence a conviction would have seen him banned from driving so…
  2. …shortly after his then wife, Vicky Pryce, was declared to have been driving at the time of the speeding incident
  3. 16 days after the speeding incident he was caught again for using a mobile phone whilst driving and banned anyway
  4. In May 2005 he was elected to Parliament as the representative for Eastleigh
  5. Also in May 2005 Ms Pryce told a friend that Mr Huhne had named her as the driver without her consent
  6. Between October and December 2007 he stood for leadership of the Lib Dems
  7. At some point (I can’t track down) he began an affair with his aide Carina Trimingham
  8. In June 2010 he was clocked again, this time by the press emerging after spending the night at Ms Trimingham’s home
  9. A week later Ms Pryce filed for divorce
  10. In May 2011 The Sunday Times printed allegations that Mr Huhne had persuaded someone to pick up his driving points
  11. In the same month Labour MP Simon Danczuk made a formal complaint about the allegation to the police
  12. At some point after this there was a series of text messages between Mr Huhne and his son where his son accused him of lying and setting up Ms Pryce
  13. In February 2012 both Mr Huhne and Ms Pryce were charged with perverting the course of justice
  14. In June 2012 Mr Huhne and Ms Pryce announced they’d plead not guilty with Ms Pryce claiming Mr Huhne had coerced her into taking his penalty points
  15. In February 2012 the trial began and on the first day Mr Huhne changed his plea to guilty. He also resigned his parliamentary seat
  16. The trial of Ms Pryce continued. And collapsed shortly after when the jury failed to agree. Shortly after a second trial found her guilty
  17. In late February the by election resulting from the resignation of Mr Huhne took place
  18. And in March 2013 they were both sentenced to 8 months in prison

Some of the events went on to become part of other storylines. For a brief while Mr Huhne’s driving ban for using a mobile phone at the wheel became part of a “Government makes a million a month from drivers using mobiles” story (at least for the Daily Mail), the collapse of the first trial of Ms Pryce became a story about failures in the trial by jury system and the result of the by election became part of a story about the rise of minority parties in austerity hit Europe.

Anyway this list of events is as partial as any other. Many more things happened (in public and in private) and some of the events listed were really lots of little events bundled up into something bigger. But that’s the trouble with events: they quickly go fractal because everything is one. As Dan said, “it’s good to think about events but it’s good to stop thinking about them too.” I’m not quite there yet.

Anyway, boilings things down further to fit in a picture:

Causation and influence

For every event there’s a fairly obvious model with times, locations, people, organisations, factors and products. And (mostly) the facts expressed around events are agreed on across journalistic outlets.

The more interesting part (for me) is the dependencies and correlations that exist between events because why is always the most interesting question and because the most interesting answer. Getting the Daily Mail and The Guardian to agree that austerity is happening is relatively easy, getting them to agree on why, and on that basis what should happen next, much more difficult.

The same picture this time with arrows. The arrows aren’t meant to represent “causality”; the fact that Mr Huhne was elected did not cause him to resign. But without him being elected he couldn’t have resigned so there’s some connection there. Lets say “influence”:

Articles, reports and stories

The simplest model for news would bump most of the assertions (who, where, when etc.) to the events and hang articles off them, stitched together with predicates like depicts, reports or analyses. But whilst news organisations make great claims around reports and breaking news, journalists don’t talk about writing articles and rarely talk about writing reports. Journalists write stories, usually starting from a report about an event but filling in background events and surmising possible future events.

So an article written around the time of Mr Huhne’s resignation would look less like this:

and more like this:

Repetition, insinuation and supposition

The average piece of journalism is 10% reporting new facts and 90% repetition, insinuation and supposition where correlation and causation between events are never made explicit. Events from the storyline are hand picked and stitched together with a thin thread of causality. Often it’s enough to just mention two events in close proximity for the connections between them to be implied. The events you choose to mention and the order you mention them in gives the first layer of editorial spin.

And the claims you choose to make about an event and its actors are the second level. If there’s a female involved and she’s under 35 it’s usually best to mention her hair colour. “Bisexual” scores triple points. We know what we’re meant to think.

The Daily Mail took insinuation to new heights with the collapse of Ms Pryce’s first trial, printing a “story” about the ethnic make-up of the jury telling its readers:

Of the eight women and four men on the Vicky Pryce jury, only two were white – the rest appeared to be of Afro-Caribbean or Asian origin.

The point they were trying to make and how the appointment of a jury of certain skin colour might have led to the collapse of the trial was left as an exercise.

Sports journalism seems particularly attracted to insinuation and supposition. Maybe it’s because their events (and sometimes even the outcomes of those events) are more predictable than in most other news whilst the actual facts are mainly locked inside dressing rooms and boardrooms. But Rafa Benitez getting slightly stroppy in a news conference turned into, “Rafa out by the weekend, Grant to take over until the end of the season and Jose to return” headlines by the next day. None of which turned out to be true. Yet.

As Paul pointed out, the article as repetition of storyline and makeshift crystal ball wasn’t always true. In the past newspapers printed many small reports per page. This isn’t the best image but was the best image I could find without rights restrictions:

Photo via Boston Public Library cc-by-nc-nd

Neither of us knew enough about newspaper history to know when this changed or why it changed. Presumably there are good business reasons why articles stopped being reports and started being stories. We guessed that it might have been due to falling paper and printing prices meaning more space to fill but without evidence that’s just insinuation too.

To an outside observer the constant re-writing of “background” seems tedious to consume and wasteful to produce. Especially where the web gives us better tools for managing updates, corrections and clarifications. Maybe it’s because most news websites are a by-product of print production where articles are still commissioned, written and edited to fill a certain size on a piece of paper and are just re-used on digital platforms. But even news websites with no print edition follow the same pattern. Maybe its partly an SEO thing with journalists and editors trying to cram as many keywords into a news story as possible but surely one article per storyline with frequent updates would pick up more inbound links over time than publishing a new article every time there’s a “development”? It seems to work for Wikipedia. (Although that said, Google news search seems to reward the publishing of new articles over the updating of existing ones.) Or maybe it’s all just unintentional. Someone at the meeting (I forget who) mentioned “lack of institutional memory” as one possible cause of constant re-writing.

But in a “do what you do best and link to the rest” sense, constantly rewriting the same things doesn’t make sense unless what you do best is repetition.

An aside on television

Television producers seem to feel the same pull toward repetition: this is what we’re about to show you, this is us showing it, this is what we’ve just shown you. I have a secret addiction to block viewing (I think the industry term is binge viewing) episodes of Michael Portillo’s Great British Railway Journeys but for every 30 minute episode there’s 10 minutes of filler and 20 minute of new “content”.

Interestingly the Netflix commissioned series assume binge viewing as a general pattern so have dropped the continuity filler and characterisation repetition and get straight into the meat of the story. Nothing similar seems to be happening with news yet but I’m an old fashioned McLuhanist and believe the medium and the message are inextricably tied so maybe one day…

Journalism as data

Over the last couple of years there’s been much talk of data journalism which usually involves scanning through spreadsheets for gotcha moments and hence stories. It’s all good and all helps to make other institutions more transparent and accountable. But journalism is still opaque. I’m more interested in journalism as data not because I want to fetishise data but because I think it’s important for society that journalists make explicit their claims of causation. You can fact check when and where and who and what but you can’t fact check why because you can’t fact check insinuation and supposition. At the risk of using wonk-words “evidence-based journalism” feels like a good thing to aspire to.

I’m not terribly hopeful that this will ever happen. If forced to be explicit quite a lot of journalism would collapse under its own contradictions. In the meantime I think online journalism would be better served by an article per storyline (rather than development), an easily accessible edit history and clearly marked updates. I’m not suggesting most news sites would be more efficiently run as a minimal wiki, pushing updates via a microblog-of-your-choice. But given the fact that if you want to piece together the story of Mr Huhne you’ll have more luck going to Wikipedia than bouncing around news sites and news articles… maybe I am.

Thoughts on open music data

Yesterday I wore my MusicBrainz hat (or at least moth-eaten t-shirt) to the music4point5 event. It was an interesting event, but with so many people from so many bits of the music industry attending I thought some of the conversation was at cross-purposes. So this is my attempt at describing open data for music.

What is (are, if you must) the data?

The first speaker on the schedule was Gavin Starks from the Open Data Institute. He gave a good talk around some of the benefits of open data on the web and was looking for case studies from the music industry. He also made the point that, “personal data is not open data” (not an exact quote but hopefully close enough).

After that I think the “personal data” point got a bit lost. Data in general got clumped together as an homogenous lump of stuff and it was difficult to pick apart arguments without some agreement on terms. It felt like there was a missing session identifying some of the types of data we might be talking about. Someone tried to make a qualitative distinction between data as facts and data as other stuff but I didn’t quite follow that. So this is my attempt…

In any “content” business (music, TV, radio, books, newspapers) there are four layers of data:

  1. The core business graph. Contracts, payments, correspondence, financial reports
  2. The content graph. Or the stuff we used to call metadata (but slightly expanded). For music this might be works, events, performances, recordings, tracks, releases, labels, sessions, recording studios, cover art, licencing, download / streaming availabilities etc. Basically anything which might be used to describe the things you want to sell.
  3. The interest / attention graph. The bits where punters express interest toward your wares. Event attendance, favourites, playlists, purchases, listens etc.
  4. The social graph. Who those punters are, who they know, who they trust.

I don’t think anyone calling for open music data was in any way calling for the opening of 1, 3 or 4 (although obviously aggregate data is interesting). All of those touch on personal data and as Gavin made clear, personal data is not open data. There’s probably some fuzzy line between 1 and 2 where there’s non-personal business data which might be of interest to punters and might help to shift “product” but for convenience I’m leaving that out of my picture:

Given that different bits of the music industry have exposure to (and business interests in) different bits of these graphs they all seemed to have a different take on what data was being talked about and what opening that data might mean. I’m sure all of these people are exploring data from other sources to improve the services they offer, but plotting more traditional interests on a venn:

So lack of agreement on terms made conversation difficult. Sticking to the content graph side of things I can’t think of any reasonable reason why it shouldn’t be open, free, libre etc. It’s the Argus catalogue of data (with more details and links); it describes the things you have for sale. Why wouldn’t you want the world to know that? I don’t think anyone in the room disagreed but it was hard to say for sure…

Data portability

The social and interest / attention graphs are a different breed of fish. Outside the aggregate they’re where personal data and personal expression live. Depending on who you choose to believe that data either belongs to the organisation who harvested it or the person who created it. I’m firmly in the latter camp. As a consumer I want to be able to take my interest data and give it to Spotify or my Spotify data to Amazon or my Amazon data to Apple or my Apple data to In the unlikely event I ever ran a startup I’d also want that because otherwise my potential customers are locked-in to other services and are unlikely to move to mine. If I were an “established player” I’d probably feel differently. Anyway data portability is important but it’s not “open data” and shouldn’t be confused with it.

Crossing the content to social divide

Many things in the content graph have a presence in the social graph. Any music brand whether it’s an artist, a label or a venue is likely to have a Twitter account or a Facebook account or etc. So sometimes the person to interest to content graph is entirely contained in the social graph. Social media is often seen as a marketing channel but it’s a whole chain of useful data from punters to “product”. Which is why it puzzles me when organisations set up social media accounts for things they’ve never minted a URI for on their own website (it’s either important or it’s not) and with no real plan for how to harvest the attention data back into their own business. “Single customer view” includes people out there too.

Data views, APIs and API control

Just down the bill from Gavin were two speakers from They spoke about how they’d built the business and what they plan to do next. In the context of open data (or not) that meant reviewing their API usage and moving toward a more “industry standard” approach to API management. Twitter was mentioned alongside the words best practice.

Throughout the afternoon there was lots of talk about a “controlled open” approach; open but not quite. Occasionally around licencing terms but more often about API management and restrictions. It’s another subject I find difficult as more and more structured data finds its way out of APIs and into webpages via RDFa and In the past, the worlds of API development and Search Engine Optimisation haven’t been close bedfellows but they’re heading toward being the same thing. And there’s no point having your developers lock down API views when your SEO consultants are advising you to add RDFa all over your web pages and your social media consultants are advising you to add OpenGraph. But it all depends on the type of data you’re exposing, why you’re exposing it and who you want to expose it to. If you’re reliant on Google or Facebook for traffic you’re going end up exposing your some of your data somehow. The risk either way is accidentally outsourcing your business.


Robert from MusicBrainz appeared at the conference via a slightly glitchy Skype link. He spoke about how MusicBrainz came into being, what its goals are and how it became a profit making non-profit. He also said the most important thing MusicBrainz has is not its data or its code or its servers but its community. I’ve heard this said several times but it tends to treated like an Oscar starlet thanking her second grip.

From all dealings with open data I’ve ever had I can’t stress enough how wrong this reaction is. The big open data initiatives (Wiki/DBpedia, MusicBrainz, GeoNames, OpenStreetMap) are not community “generated”. They are not a source of free labour. They are community governed, community led and community policed. If your business adopts open data then you’re not dealing with a Robert like figure; you’re dealing with a community. If you hit a snag then your business development people can’t talk to their business development people and bang out a deal. And the usual maxim of not approaching people with a solution but an explanation of the problem you want to solve is doubly true for community projects because the chances are they’ve already thought about similar problems.

Dealing with open data means you’re also dealing with dependencies on the communities. If the community loses interest or gets demoralised or moves on then the open data well dries up. Or goes stale. And stale data is pretty useless unless you’re an historian.

So open data is not a free tap. If you expect something for nothing then you might well be disappointed. The least you need to give back is an understanding of and an interest in the community and the community norms. You need to understand how they operate, where their interests lie and how their rules are codified and acted on. And be polite and live by those rules because you’re not a client; you’re a guest. You wouldn’t do a business deal without checking the health of the organisation. Don’t adopt community data without checking the health of the community. Maybe spend a little of the money you might have spent on a biz dev person on a “community liaison officer”.

Question and answer

At the end of Robert’s talk I had to get up and answer questions. There was only one which was something like, “would you describe MusicBrainz as disruptive?” I had no idea what that meant so I didn’t really answer. As ever with question sessions there was a question I’d rather have answered because I think it’s more interesting: why should music industry people be interested in and adopt MusicBrainz. Answers anyway:

  1. Because it has stable identifiers for things. In an industry that’s only just realising the value of this, it’s not nothing.
  2. Because those identifiers are HTTP URIs which you can put in a browser or a line of code and get back data. This is useful.
  3. Because it’s open and with the right agreements you can use it to open your data and make APIs without accidentally giving away someone else’s business model.
  4. Because it links. If you have a MusicBrainz identifier you can get to artist websites, Twitter accounts, Facebook pages, Wikipedia, Discogs, YouTube and shortly Spotify / other streaming services of your choice. No data is an island and the value is at the joins.
  5. Because it’s used by other music services from to the BBC. Which means you can talk to their APIs without having to jump through identifier translation loopholes.
  6. Because, whilst it’s pretty damn big, size isn’t everything and it’s rather shapely too. The value of data is too easily separated from the shape of the model it lives in. Lots of commercial music data suppliers model saleable items because that’s were the money lives. MusicBrainz models music which means it models the relationships between things your potential customers care about. So not just artists and bands but band memberships. And not just Rubber Soul the UK LP and the Japanese CD and the US remastered CD but Rubber Soul the cultural artefact. Which is an important hook in the interest graph when normal people don’t say, “I like the double CD remastered rerelease with the extra track and the tacky badge.”
  7. Because its coverage is deep and wide. Their are communities within communities and niches of music I never knew existed have data in MusicBrainz.
  8. Because the edit cycle is almost immediate. If you spot missing data in MusicBrainz you can add it now. And you’re a part of the community.
  9. Because the community is engaged and doing this because they care, it polices itself.
  10. Because Google’s Knowledge Graph is based on Freebase and Freebase takes data from MusicBrainz. If you want to optimise for the search engines, stop messing about with h1s and put your data in MusicBrainz.

So if any record label or agent or publisher or delivery service ever asked me what the smallest useful change to the data they store might be, I’d say just store MusicBrainz identifiers against your records. Even if you’re not yet using open data, one day they’ll be useful. Stable identifiers are the gateway drug to linked data. And I’d advise any record label large or small to spend a small portion of the money they might have spent building bespoke websites and maintaining social media accounts, on adding their data to MusicBrainz. Everybody benefits, most of all your consumers.

ps If you’re an indie artist Tom Robinson wrote a great guide to getting started with MusicBrainz here.

Dumb TVs

Following on from this year’s CES there’s been lots of talk about bigger, better, sharper, smarter TVs. As ever conversation around gadgets tends to get caught up with conversations around business models which tends to lead to breathless commentary on OTT vs traditional broadcast and whether smart TVs will render traditional broadcasters as obsolete as Blockbusters, HMV and Jessops. But this is only tangentially about that.

Rumbling away in the background is the usual speculation around Apple’s plans to “revolutionise” the TV “experience” and whether they’re planning to do the same to the TV industry as they did to the music industry (content deals permitting). In among the chatter there seems to be an assumption from some commentators that Apple’s plans for TV revolve around how Apple TV might improve the on-screen interface and controls, possibly replacing the EPG with an App Store style interface. There’s a tendency amongst media futurologists to predict the future by extrapolating from the past; therefore televisions will follow the same fat-client route as phones and already complicated TV interfaces will become more complicated still.

But to my mind this doesn’t make sense. Apple already own the content discovery route via their iDevices, they own the content acquisition route via iTunes and they own the play-out route via AirPlay. Why do they need to invent fat-client TV sets when they’ve already put fat-client laptops, tablets and phones into the hands of their customers? The App Store model might just about work when it’s in your hand / on your lap. But placing the same interaction model 10 feet away just doesn’t offer the affordances you need to discover, purchase and play programmes. From an accessibility angle alone, making potential customers interact from 10 feet away when you’ve already given them a better option seems like a painful redundancy.

How “smart” do TVs need to be?

In more general terms I think there’s a problem with the definition of a “smart” TV and the interfaces envisaged. If TVs are web connected why do they need to be smart? Some arguments why not:

  1. Upgrade cycles for TVs and radios (and most other household white goods) are too slow to build-in smartness. Build in too much and the smarts go obsolete before the primary function of the device.
  2. For any connected device smartness belongs in the network. This is why we connect them. If there are existing discovery and distribution channels and backchannels, then all a TV needs to do is accept instructions from the network; a connected (but dumb) screen.
  3. 10 feet away is no place for an interface. And just because a device has a screen doesn’t mean it has to be an input. As TV functionality becomes ever smarter and more complicated, the remote control grows to fit the demands and we end up with something almost resembling a keyboard on the arm of the sofa. When there’s a much better, much more accessible phone or pad or laptop (or any point in between) sat redundant alongside.
  4. The App Store / Smart TV model presupposes the existence of apps. But making native apps is expensive and the more platforms you have to provide for the more expensive it gets. A dumb TV only needs to accept instructions and play-out media.
  5. TV screens tend to be a shared device and authentication, personalisation and privacy concerns are hard on a shared device. Hard from an implementation point of view and hard from a user comfort point of view. There’s a spectrum from TV screen to desktop PC to laptop to tablet to phone and the further down that list you travel the less shared / more personal the device feels and the more comfortable users feel with authentication. Dumb TVs move authentication to where it makes sense.
  6. Smart TVs open up the possibility of device manufacturers finding a new role as content gatekeepers. Having control of both the interface and the backchannel data allows them to control the prominence of content. This is a particular problem for public service broadcasters. By the time your smart TV is plugged into your set top box and your assortment of games consoles, the front room TV can acquire a stack of half a dozen gatekeepers. Just keeping track of which one is currently active and which one you need to control is confusing.
  7. Media people like to talk about TV as a “lean back” medium. This is pure conjecture but it’s possible that separating the input interface from the play-out leads to this more “lean back” experience…

How dumb is dumb?

From conversations around Dumb TVs there seem to be two main options: the dumb but programmable TV and the dumber than Kletus TV.

Programmable TVs

Modern TV sets don’t live alone. There are ancillary devices like PVRs which sit alongside the TV box. TVs don’t need to be programmable but PVRs do. The big question is where you want to programme your PVR from. If it’s same room / same local area network then there’s no need for any additional smartness or authentication. If it’s on the same network you can control it. If you want to programme your PVR from the top deck of the bus this is somewhat harder. Somewhere you need a server to mediate your actions and given the need for a server there’s a need for authentication. But…

…how do PVRs as discrete devices make sense in a connected world? If 3 million people choose to record an episode of Doctor Who that’s a hell of a lot of redundant storage. And a hell of a lot of redundant power usage. Over time PVR functionality will move to “the cloud” (the legality of loopholes not withstanding), your mobile will programme it, discover content there and push that content to you TV screen. With no need for TV programmability.

Dumb, dumb, dumb

So what’s the very simplest thing with the least build and integration costs? Something which allows you to push and control media from a fat client to a dumb TV. DIAL promises to do something similar but seems to assume a native app at each end and the simplest thing is probably two browsers.

So somehow devices on a local area network need to be able to advertise the functionality they offer. There’s a web intents connection here but I’m not quite sure what it is. Once your laptop / tablet / phone knows there’s a device on the network which can play audio / video it needs to make that known to the browser. So there needs to be some kind of browser API standardisation allowing for the insertion of “play over there” buttons. And the ability to push a content location with play, pause, stop and volume control notifications from the browser on the fat client to the browser on the dumb TV. Which might be something like WebRTC. Given the paywalls and geo-restrictions which accompany much of the online TV and movie business there’d probably need to some kind of authentication / permission token passed. But that’s all dumb but connected would involve.

A late answer to a question from the digital humanities conference

The week before last Silver and I went along to the Realising the Opportunities of Digital Humanities conference in Dublin. We gave a short presentation about linked data at the BBC then sat on a panel session attempting to answer questions. I’ve never been on a panel before but it’s a bit like an interview: you only think of the answer you wanted to give once you’ve left the room.

Anyway, one person asked a question something like, “with all this data does ‘content’ become a second class citizen”. At the time we were sat in the Royal Irish Academy library which is three storeys of floor to ceiling books. The thought that all that humanity could ever become subservient to some descriptive data seemed so odd that I don’t think anyone even answered the question. A bit like suggesting if the library catalogue ever got good enough you could just burn the books.

A follow up point was made by a person from RTE about the costs associated with digitising content. I think it’s often hard to justify digitisation costs because the thing you end up with is pretty much the thing you started with except in ones and zeros. And to my mind there are three steps to opening an archive. As a fag packet sketch it would look something like:

Step 1: Digitisation

As the RTE person said, digitisation is expensive and sometimes hard to justify. And I have no ideas on how to make it less expensive. Until content is digitised there’s no way to get the economies of scale that computers and the web and the people on the web bring. And there’s no real benefit in just digitising. You can put the resulting files on the web but until they link and are linked to and findable they’re not in the web. To put stuff in the web you need links and to make links you need context so…

Step 2: Contextualisation

Once you have digitised files you need to somehow generate context and there seem to be three options:

  1. employ a team of librarians to catalogue content – which is great if you can but doesn’t scale too well and can, occasionally, lead to systems which only other librarians can understand
  2. let lose the machines to analyse the content (OCR, speech to text, entity extraction, music detection, voice recognition, scene detection, object recognition, face recognition etc etc etc) – but machines can get things wrong
  3. build a community who are willing to help – but communities need nurturing (not managing, never managing)

The project I’m currently (kind of) working on has the research goal of finding a sweet spot between the last two: machine processing of audio content to build enough descriptive data to be corrected and enhanced by a community of users. Copying / pasting from an earlier comment:

We’ve been working on a project for BBC World Service to take 70,000 English-language programmes and somehow make them available on the web. The big problem is that whilst we have high quality audio files we have no descriptive data about them. So nothing about the subject matter discussed or who’s in them and in some cases not even when they were broadcast.

To fix this we’ve put the audio through a speech to text system which gives us a (very) rough transcript. We’ve then entity extracted the text against DBpedia / Wikipedia concepts to make some navigation by “tags”. Because the speech to text step is noisy some of the tags extracted are not accurate but we’re working with the World Service Global Minds panel (a community of World Service listeners) who are helping us to correct them.

Machine’s plus people is an interesting approach but, like digitisation, machine processing of content is expensive. Or at least difficult to set up unless you’re a technology wizard. And there’s a definite gap in the market for an out-of-the-box, cloud-based (sorry) solution (sorry) for content processing to extract useful metadata to build bare-bones navigation.

Step 3: Analysis

The line between contextualisation and analysis is probably not as clear cut as I’ve implied here. But by analysis I mean any attempt to interrogate the content to make more meaning. I’m reminded of the recent Literature is not Data: Against Digital Humanities article by Stephen Marche:

But there is a deeper problem with the digital humanities in general, a fundamental assumption that runs through all aspects of the methodology and which has not been adequately assessed in its nascent theory. Literature cannot meaningfully be treated as data. The problem is essential rather than superficial: literature is not data. Literature is the opposite of data.

Data precedes written literature. The first Sumerian examples of written language are recordings of beer and barley orders. But The Epic of Gilgamesh, the first story, is the story of “the man who saw the deep,” a hero who has contact with the ineffable. The very first work of surviving literature is on the subject of what can’t be processed as information, what transcends data.

It also reminds me of a trip to a Music Information Retrieval conference a couple of years back. Every other session was accompanied by a click track and seemed to be another attempt to improve “onset beat detection” in some exotic music genre by two or three percent. I’m no musicologist but it felt like a strange approach to determining meaning from music. If you were ever asked to describe punk or hip-hop or acid house I doubt you’d start with chord sequences or rhythm patterns. For at least some genres the context of the culture and the politics (and the narcotics) feels like a better starting point.

So I think when we throw machines at the analysis part there’s a tendency to reduce down the ineffable to a meaningless set of atoms. Or a pile of salt. Machines have their place but it’s their usual place: boring, repetitive tasks at speed.

Getting back to the diagram: once archive items are digitised, contextualised, findable and in the web they become social objects. People can link to them, share them, “curate” them, annotate them, analyse them, celebrate them, debunk them, take them, repurpose them, “remix” them and make new things from them. The best description I’ve seen of the possibilities of what could happen when people are allowed to meet archive items is Tony Ageh’s recent speech on the Digital Public Space which is too good to quote and should just be read.

On the original question then, no I don’t think “content” (novels, poems, pamphlets, journalism, oral history, radio, film, photography etc) will ever become a second class citizen to the data describing it. And digitisation costs are a lot easier to justify when coupled with contextualisation and analysis. And that some jobs are best done by machines and some jobs are best done by people.


A plea for the word service

Some small things that bother me about the word product. Mostly stuff I’ve already said on Twitter rearranged into sentences.

Product doesn’t really mean anything

Or its meaning is slippery. It certainly doesn’t seem to mean what my dictionary says it does:

an article or substance that is manufactured or refined for sale: dairy products.
a thing or person that is the result of an action or process: her perpetual suntan was the product of a solarium.

I guess digital services are things resulting from actions and processes but then so is everything and anything else.

Because product doesn’t mean anything when it’s combined with other words that don’t mean anything it makes meaningless sentences

Like “Minimum Viable Product”. A good deal of the conversation around “product development” is an attempt to define an MVP but no-one can ever agree on that definition. If you substitute viable for useful and product for service it’s still not immediately apparent but at least it’s a starting point for a conversation. Useful for what? To whom? Useful enough? Useful enough for people to pay for it? Useful for enough people for enough people to pay for it? What problem does it solve? Who has that problem?

Because product doesn’t mean anything people bring their own meanings to it

Some people seem to define it as a managed development process. And lots of the talk around “products” seems to be more about process than product. About cross-discipline teams and co-location. Which is equally applicable to building anything.

Some people seem to think it’s about building things that people want rather than just chucking up some web pages. Something about product life-cycles and future development based on interpretation of analytics. Which is all equally (and probably more) applicable to service design. Did we ever just make something and stick it live and forget about it? I can think of a few examples where this was true but not much I’ve ever worked on.

And some people think it’s about building something marketable; something that can be summed up in a one liner on the side of a bus. If you can’t explain to the people your service is designed to help how it might help them there’s probably not much point making it.

All the ‘products’ our industry makes get marketed as services

Services to help you pay bills or fill in tax returns or contact your MP or listen to music or radio or watch television or films or share photos or videos or find places. Why describe them between ourselves as products when we describe them to users / customers as services?

It at least implies branded, bounded boxes

Product management in a software sense has been around at least since Microsoft were still shipping software as shrink-wrapped packages. Back then the product word probably made sense but now…

Everything that used to a product is becoming a service

Software, music, books, films, games, maps, newspapers, even bathroom scales and maybe one day the beloved internet fridge. As everything that can be connected is connected, the lines between tangible products and intangible services are blurring. But it’s more about products becoming more service-like (less hard edged, more malleable, more adaptive to use) than services becoming more product-like.

Services outlive products

Lots of services manifest themselves through a product but when your iPhone 9 is making it’s final journey to the landfill site chances are something very like the iTunes store or Amazon’s Whispersync will still be around.

No-one I know makes products

From a rough list of friends and acquaintances there’s some teachers, a lawyer, a nurse, a doctor, a postman, someone who does “cloud provisioning” of some sort, some people who make it easier to listen to radio and watch TV, some people who make it easy to find and read news, some people who make legislation and case law available to lawyers, some people who make it easier to book travel, someone who provides wire feeds to news organisations etc. All of them provide services, not one of them makes products. Occasionally the service is manifested as something product-like but more as a token (like a postage stamp or an Oyster card) for service. The value (and the surplus value of labour) is always in the service. You don’t buy a loaf of bread, you buy the convenience of not having to bake.

Service just sounds nicer

Services are helpful. They adapt as you use them. They’re things that people need (in a large or small sense); things which make their lives easier. Products are things that people can be persuaded they want or desire.

Maybe I’m just out of date and not following the change of language but to my ears products are still tangible and the things our industry makes are largely intangible. And we have a perfectly good word for that.