One thing connected journalism could do

This is a rant I tend to have every time I tag along to a news / newslabs meeting so I figured writing it up would save time in the future.

I’ve typed words before about journalism and the relentless churn of repetition, mystic meg prediction and loose, unqualified claims of causality. And somewhere in there claimed that of the Five Ws of journalism “why is always the most interesting question and because the most interesting answer”. “Because of this that” seems to be the underlying message of most journalism even if it does get wrapped up in insinuation, nudges and winks.

And because I’m as guilty of repetition as the next hack, in another post I made the same point:

In news storytelling in particular, why and because are the central pillars of decent journalism. Why is my local library closing? Because of council cutbacks. Why are the council cutting back? Because of central government cutbacks. Why are central government cutting back? Because they need to balance the national budget? Why does the budget need to be balanced? Because the previous government borrowed too much? Why did they borrow too much? Because the banks collapsed. Why did the banks collapse? [etc]

The problem I think we have is that causality claims are not only insinuations but that they’re confined to the article. Connected journalism would make assertions of causality explicit and free them from the confines of the rendered article or programme so chains of claims could be assembled and citizens could trace (assorted and disputed) claims of causality from the (hyper?)-local to the national to the global. And back.

Given that the world becomes more globalised and more decisions get made above democratically elected governments it’s often not clear where the levers of power might be found or whether they’d actually work if you found them. People become divorced from democracy when they no longer see how the policies they’re voting for actually impact on their lives. And power structures become less of a map and more an invisible set of ley lines. Connected journalism would at least attempt to give national (or international) context to local events and local context to (in)ternational events. Which sounds like something public service journalism should at least attempt.

And given that no one organisation can hope to report everything it would provide hooks and links between news organisations and citizen journalists and maybe help to sustain local news as a functioning industry.

I think this possibly echoes some of what Tony Hirst wrote earlier today about open data and transparency. As schools and hospitals and every other social service gets reduced to a league table of stats the decisions that lead to those numbers and why and because get lost in the data noise. And isolated incidents of journalism don’t fill any of those gaps.

So as ever, none of this will probably happen. Too many people would have to work together before any bigger picture of society got painted and “greater than the sum of the parts” always works better on powerpoint slides than in reality. In the meantime news orgs can continue to worry about getting more news onto more screens and in more channels. Because in New York Times innovation fashion, it’s definitely, absolutely not the journalism that’s the problem, just that people don’t read or need it.

Yet another post about Google (not really) removing the URL bar from Chrome

A Twitter conversation with Frankie Roberto and Paul Rissen continuing in my head.

A few days ago Google released a canary build of Chrome which, by default, hid the URL of the current page behind a button. The clue was probably in the canary part; the next day’s build reverted to visible URLs. And the URL was never actually removed, just placed a click away. Users were free to type URLs or search by whichever search engine they had configured in standard “omnibar” fashion.

Even so just about everyone seems to have chipped in with an opinion about this. I can’t pretend I have a clue about whether Google were experimenting with ways to protect users from phishing attacks or whether it was just a toe in the water of self interest (an attempt to expand and consolidate centralised power). But it’s their browser and I guess they can do what they like with it. Beware Silicon Valley bearing gifts and all that; they’ll probably arrive in wrapping paper made of adverts.

What I do think is: the URL has too many people expecting too much and that makes things break.

A few years back I always used to say a good URL should be three things:

  1. Persistent
  2. Readable
  3. Hackable

And the greatest of these was persistent. Three is a nice number and I enjoy a biblical reference or two but I’m not sure I ever really thought hackable mattered. It’s a nice geek feature to navigate by chopping bits out of URLs but do punters actually do that? If they do it’s probably more because your navigation is broken than because your URLs make nice sentences.

But the persistent / readable trade off is hard. My natural inclination is to repeat the words of my former teacher; “the job of a URL is to identify and locate a resource, the job of a resource is to describe itself.” And of course to quote liberally from Cool URIs don’t change. Whilst making the usual point that:

<a href="this-bit-is-for-machines">this-bit-is-for-people</a>


<a href="identify-and-locate">label-and-describe</a>

Which is why browsers have URL and title bars. Identifying / locating and labelling / describing are different things and HTML and browsers provide for both.

All of which is fine in theory but…

URLs have long since broken free of the href attribute and the URL bar. They’re on TV, read out on radio and on the side of buses. Pretending that URLs are just there to identify and locate sidesteps how they actually get used and how people think about them. When they stopped being an implementation detail of the linky web, when they stopped being identifiers and started becoming labels, everyone had an opinion on what they were for and what they should look like. The techy people have an opinion, the UX people have an opinion, the brand manager has an opinion, the marketing department have an opinion, the SEO people have an (often misguided) opinion and then the social media team chip in with theirs. And the people selling domains want to sell more domains. None of the opinions agree or are reconcilable. Like most things with more than one stakeholder the result is a bit of a shambles.

I guess the starting point is what do punters actually want from URLs:

  1. they want to trust that the page they’re looking at comes from the domain they think it does
  2. a sub-set want to copy and paste the URL and make new links
  3. they want to trust that the link they’re clicking on goes to the domain they think it does
  4. they might want to type one they’ve seen on the side of bus into a box but probably they’ll just search like normal people do

And that’s probably about it. But it does mean that as well as techy and UX and marketing and SEO and etc opinions the URL also gets lumbered with providing provenance and trust. It’s quite a lot to expect from a short string of letters and numbers and colons and slashes.

That said, in almost all cases (aside from the suspiciously spammy looking email) trust really resides in the linker and not the link. There are plenty of places where the bit-for-people part just replicates the bit-for-machines, often with the added horrors inserted by a series of URL shorteners. But we keep clicking on links in Twitter because we trust the linker not the link.

Even so there must be a way we can decouple provenance from location from label. What we’ve got now doesn’t work because too many “stakeholders” disagree about what we’re trying to achieve. It’s hard to not break the web because the marketing manager changes their mind about the brand message and no-one knows how to separate identifiers from labels. The problem isn’t with Google “removing” the URL bar; whatever any browser provider does to patch over this won’t work because there isn’t a right answer because the problem goes deeper. We’re misusing a thing designed to do one thing to do half a dozen other things none of which are compatible.


A couple more things since I posted this:

Should URLs be “hackable”?

Via Twitter Matthew Somerville said, FWIW I know many people who ‘hack’ the URLs, though not many of them would call it that ;) .

It’s something I do myself, usually to get a feel for the shape of the thing, more often when presented with a new website to check if there are any holes in the URL structure. As the same old mentor used to say, “never hack back to a hole”. Does it really matter? Not really but removing bits of the URL on the lookout for redirects or 40Xs is a pretty good proxy for how much care and attention has been given to the design.

I can’t deny hackable URLs are cute and lots of geeks seem to think the same. I just searched twitter for “hackable URL” and came across someone who “loves RESTful, hackable URLs” which is as big a misreading of REST as almost all other uses. But in real life (and in user testing) I’ve never seen anyone go anywhere near the URL bar. It gets used to check the page they’re looking at really is coming from their bank, to bring back websites from history and to summon Google. I suspect (though have no data) that’s the majority use case. Given all the other things we seem to expect of URLs expecting them to also function as navigation widgets probably just adds to the confusion.

And again, conflating REST with human readable and hackable is just wrong. And don’t get me started on “RESTful APIs” which are apparently something different from websites.

Should URLs be hidden?

I stumbled across a post from Nicholas C. Zakas with the title URLs are already dead which didn’t actually say URLs were dead (because that would be silly) but did say they were slowly disappearing in the same way email addresses and telephone numbers are disappearing. Which is true; URLs are already hidden away in iOS and as screen sizes shrink that will probably continue. Wherever browsers can use titles in preference to URLs they do. Autocomplete (from history) completes on titles (and URLs), history shows title not URLs, bookmarks show titles not URLs. Take a look at your bookmarks and history and imagine how much less useful and useable they’d be if they listed URLs.

The natural extension is to put URLs a click away from the URL bar. Whatever their motivations Google were right to hide the URL. It’s just a shame it only happened for one day.

Does hiding URLs in the browser solve the bigger problem?

No because URLs long ago stopped being the province of developers and became voodoo fetish objects for marketeers and brand consultants. I’d happily predict that the first place were we’ll no longer see URLs will be the browser. Well past that point they’ll be shown on telly screens, read out on air, plastered over posters etc.

I now think my thinking that URLs / URIs / whatever should be persistent, human readable and hackable made a nice slogan but was just wrong. They should only and always be persistent. Everything else is just sugar.

But that still leaves us with a problem because the marketeers and sales people still want to slap URLs over posters and books and beer mats. It’s interesting that the presence of a URL no longer seems to signify you can get some more information if you type this into a URL bar but instead to signify a vague acceptance of modernity (look we’re the webz).

Or at least that’s my understanding. Presumably the marketeers don’t assume punters emerge from a tube station and type these URLs into URL bars? Because that isn’t what appears to happen. From my day job I know plenty of people search for “”. Given the omnibar I’m fairly sure lots of people end up searching Google for Google. They’re just happier using search than weird looking slashdot protocols. Twitter is an interesting side case where the slightly geeky @ of @fantasticlife displaces the very geeky slashdots of Good.

So what if the marketeers could be dissuaded from plastering URLs over every surface they see. It would make our lives easier because we’d no longer have to have all those conversations trying to find a middle ground between “must / just persistent” and “must carry the brand message”. But it won’t happen because the alternative is something like, “just search for” and then you’re at the mercy of Google and Bing and all those competitors outbidding you for keywords.

Which is complete bollocks. Because that’s what happens. Punters do not memorise your URL and even if they do they search for it anyway. Your organisation / brand / “product” / whatever is already at the mercy of search engines because that’s how real people use the web.

So love of god Google, if only to save me from another meeting conversation about this please hide the URL behind a click in Chrome. And hope the marketeers start to think that covering the world in URLs makes as much sense as covering it in ISBNs or catalogue numbers or Amazon product IDs.

Sausages and sausage machines: open data and transparency

Last Wednesday was the second BBC Data Day. I didn’t manage to make the first one but I did end up chatting afterwards with various BBC, ODI and OU people about the sort of data they’d like to see the BBC release. Shortly afterwards I sketched some wireframes and off the back of that was invited to talk at the second event. Which I also didn’t manage to make because I was at home, ill and feeling sorry for myself. In the event Bill stepped in and presented my slides. These are the slides and notes I would have presented if I had managed to be there:

Slide 1


I’d like to talk about open data on the web, what it’s for and in particular how it enables transparency to audiences across journalism and programme making.

Slide 2


So why publish open data on the web? Three common reasons are given:

  1. to enable content and service discovery from 3rd parties like Google, Bing, Facebook, Twitter etc. These are things like, Open Graph, Twitter Cards etc used to describe services so 3rd parties can find your stuff and make (pretty) links to it. Which often becomes a very low level form of automated syndication because that’s how the web works
  2. to outsource innovation and open up the possibilities of improving your service to 3rd parties. The Facebook strategy of encouraging flowers to bloom around their fields. Then picking the best ones and buying them
  3. and finally because… transparency. To show the world your workings in the best interests of serving the public

Today I’m only really talking about transparency.

Slide 3


So sausages. The BBC already publishes some “open” data but that data only describes the end product, the articles and programmes, and not the process.

Slide 4


This is the Programmes Ontology. It shows the kinds of data we publish about all BBC programmes.

There are programme brands and series and individual episodes and versions of those episodes and broadcasts and iPlayer availabilities. The kind of data you’d need to build a Radio Times or an EPG. Or iPlayer.

Slide 5


And this is the brand page for Panorama. Ask for it as data and you’ll get…

Slide 6



Again a brand with episodes with broadcast etc

Slide 7


What’s interesting is what isn’t there. What goes on in the factory before the sausages make it to the shelves.

Slide 8


Things like:

  1. commissioning decisions. Who? When? Why? What didn’t get commissioned?
  2. scheduling decisions
  3. talent decisions
  4. guest decisions
  5. runnings orders. What things / what order?

Who refused to appear? Who refused to put up a spokesperson? What was the gender split of guests? What was the airtime gender split?

A couple of weeks back there was a George Monbiot piece in the Guardian bemoaning the fact that BBC programmes often didn’t include enough background information about guests on current affairs programmes. Particularly in respect to connections with lobbyists and lobbying firms.

As a suggestion: every contributor to BBC news programmes should have a page (and data) on listing their appearances and detailing their links to political parties, NGOs, campaigning groups, lobbyists, corporations, trade unions etc.

Slide 9


Away from programmes what would transparency look like for online news.

Slide 10


The Guardian is the most obvious example where clarifications and corrections aren’t hidden away but given their own home on the website.

Slide 11


And the articles come with a history panel which doesn’t show you what changed but at least indicates when a change has happened.

The Guardian’s efforts are good but not as linked together as they might be.

Slide 12


Unlike Wikipedia. This is the edit history of the English Wikipedia article on the 2014 Crimean Crisis. Every change is there together with who made it, when and any discussion that happened around it.

Slide 13


And every edit can be compared with what went before, building a picture of how the article formed over time as new facts emerged and old facts were discounted.

Slide 14


I didn’t manage to attend last year’s data day but I did end up in the pub afterwards with Bill and some folk from the ODI and the Open University.

We talked about the kind of data we’d all like to see the BBC release and it was all about the process and not the products. The sausage factory and not the sausages.

We made a list of the kinds of data that might be published and it fitted well with how the BBC likes to measure its own activities: Reach, Impact and Value.

Slide 15


It also looked a lot like this infographic which made the rounds of social media last week detailing the cost per user per hour of the BBC TV channels

Slide 16


These were the wireframes I made following last year’s pub chat.

They were intended to sit on the “back” of BBC programme pages; side 1 would show the end product, side 2 would be the “making of” DVD extra, the data about the process.

Headline stats for every programme would include total cost, environmental impact, number of viewers / listeners across all platforms and the cost per viewer of that episode.

Programmes would be broken down by gender split of contributors and their speaking time.

Reach would list viewer / listener figures across broadcast, iplayer, downloads and commercial sales.

Slide 17


Impact would list awards, complaints, clarifications, corrections and feedback from across the web.

And value would list production costs, acquisition costs and marketing spend.

All of this would be available as open data for licence fee payers to take, query, recombine, evaluate and comment on.

Having made the wireframes I chatted with Tony Hirst from the OU about how we might prototype something similar. We came up with a rough data model and Tony attempted to gather some data via FOI requests.

Slide 18


Unfortunately they were all refused under the banner of “purposes of journalism, art or literature” which seems to be a catch all category for FOI requests marked “no”.

Google has 20 million results for the query “foi literature art journalism”, around 10 million of those would seem to relate in some way to the BBC.

The idealist in me would say that, for “the purposes of journalism”, in its noblest sense, and the greater good of society, the default position needs to flip from closed to open. The “purposes of journalism”, more than any other public service, should not be an escape hatch from open information.

And the public would benefit from “journalism as data” at least as much as from “data journalism”.

Photo credits

Packing Carsten’s weiner sausages on an assembly line, Tacoma, Washington by Washington University

Sausages at Wurstkuche by Sam Howzit


From the final few pages of Beneath the City Streets by Peter Laurie (1970) where he gets off the subject that the threat of global thermonuclear war might just be a plan to distract us and gets on to the subject of… transistors:

I am coming to believe that there is a much more serious threat to the technological way of life than the H-bomb. It is the transistor. Over the last two or three hundred years in the West we have followed a course of development that coupled increasingly powerful machines to small pieces of human brain to produce increasingly vast quantities of goods. The airliner, the ship, the typewriter, the lathe, the sewing machine, all employ a small part of the operator’s intelligence, and with it multiply his or her productivity a thousandfold.

As long as each machine needed a brain, it was profitable to make more brains and with them more profits. Industrial populations grew in all the advanced countries, and political systems became more liberal simply to get their cooperation.

But now we are beginning to find that we do not need the brains – at least not in the huge droves that we have them. Little by little [..] artificial intelligence is dispossessing hundreds of thousands and soon millions of workers. Because ‘the computer’ is seen only in large installations doing book-keeping, where it puts few out of work, this tendency goes on unnoticed. But in every job economics forces economies on management. Little gadgets here and there get rid of workers piecemeal. [..] Any job that can be specified in, say, a thousand rules, can now be automated by equipment that costs £200 or so. The microprocessor, which now costs in itself perhaps £20, [..] has not begun to be applied: over the next 10 to 15 years millions will be installed in industry, distribution, commerce. Machinery, which has almost denuded the land, will now denude cities.

Politically, this will split the population into two sharply divided groups: those who have intelligence or complicated manual skills that cannot be imitated by computers – a much larger group, who have not. In strict economic terms the second group will not be worth employing. They can do nothing that cannot be done cheaper by machinery. [..] The working population will be reduced to a relatively small core of technicians, artists, scientists, managers surrounded by a large, unemployed, dissatisfied and expensive mob. I would even argue that this process is much further advanced than it seems, and the political subterfuges necessary to keep it concealed, are responsible for the economic malaise of western nations [..] If one has to pay several million people who are in fact useless, this is bound to throw a strain on the economy and arouse the resentment of those who are useful, but who cannot be paid what they deserve for fear of arousing the envy of other.

If the unemployed can be kept down to a million or so in a country like Britain, the political problem they present can be contained by paying a generous dole [..] The real total of unemployed is hidden in business. What happens when automation advances further and the sham can no longer be kept up? [..] To cope with the millions of unemployed and unemployable people needs – in terms of crude power – greatly improved police and security services. [..] It suggests that the unemployed should be concentrated in small spaces where they can be controlled, de-educated, penned up.

Unless some drastic alteration occurs in economic and political thought, the developed nations are going to be faced in the next thirty years with the fact that the majority of their citizens are a dangerous, useless burden. One can see that when civil defence has moved everything useful out of the cities, there might be strong temptation on governments to solve the problem by nuclear war: the technological elite against the masses.

Now I’m no more of a fan of the technocratic, silicon valley, Ayn Rand fanboys than the next man on the street but even in my most paranoid moments I’d never suspected that when they’d done disrupting they might stagger out of a ted talk and h-bomb us all.

NoUI, informed consent and the internet of things

In more youthful days I spent a year studying HCI. I’m sure there was much more to it but unfortunately only three things have stuck in my mind:

  1. interactions should be consistent
  2. interactions should be predictable
  3. interactions should not come with unexpected side-effects

I half remember writing a dissertation which was mostly finding creative ways to keep rewriting the same three points until requisite word count was reached.

I was thinking about this today whilst reading an assortment of pro and anti NoUI blog posts. I half agree with some of the points the NoUI camp are making and if they save us from designers putting a screen on everything and the internet fridge with an iPad strapped to the front I’d be happy. But mostly I agree with Timo Arnall’s No to NoUI post and his point that “as both users and designers of interface technology, we are disenfranchised by the concepts of invisibility and disappearance.”

This doesn’t really add much to that but some thoughts in no particular order:

  1. Too often chrome is confused with interface. There’s too much chrome and not enough interface.
  2. Even when something has a screen it doesn’t have to be an input / output interface. The screen can be dumb, the smarts can be elsewhere, the interface can be distributed to where it’s useful. The network takes care of that.
  3. An interface should be exactly as complex as the system it describes. The system is the interface. The design happens inside. I’m reminded of a quote from the Domain Driven Design book that, “the bones of the model should show through in the view”. An interface should be honest. It should put bone-structure before make-up (or lack thereof).
  4. The simplify, simplify, simplify mantra is all very good but only if you’re simplifying the systems and not just the interface. And some systems are hard to simplify because some things are just hard.
  5. No matter how much you think some side-effect of an interaction will please and “delight” your users if the side-effect is unexpected it’s bad design. You might want to save on interface complexity by using one button to both bookmark a thing and subscribe to its grouping but things are not the same as groups and bookmarks are not the same as subscriptions and conflating the two is just confusing. Because too little interface can be more confusing than too much.
  6. There seems to be a general belief in UX circles that removing friction is a good thing. Friction is good. Friction is important. Friction helps us to understand the limits of the systems we work with. Removing friction removes honesty and a good interface should be honest.
  7. Invisible interfaces with friction stripped out are the fast path to vendor lock-in. If you can’t see the sides of the system you can’t understand it, control it or leave because you don’t even know where it ends.
  8. If your goal is something like simplifying the “onboarding process” removing friction might well please your paymasters but it doesn’t make for an honest interface. Too much UX serves corporate goals; not enough serves people.
  9. Decanting friction out of the interface and turning it into a checkbox on the terms and conditions page is not a good interface.
  10. In the media world in particular there’s a belief that if you could just strip out the navigation then by the intercession of magic and pink fluffy unicorns “content will come to you”. Which is usually accompanied by words like intuitive systems. Which seems to miss the point that the thing at the other end of the phone line is a machine in a data centre. It is not about to play the Jeeves to your Bertie Wooster. It does not have intuition. What it probably has is a shedload of your usage data matched to your payment data matched to some market demographic data matched to all the same for every other user in the system. For the majority of organisations the internet / web has always been more interesting as a backchannel than as a distribution platform. They’d happily forego the benefits of an open generative web if only they could get better data on what you like.
  11. If and when we move away from an internet of screens to an “internet of things” the opportunities for sensor network and corporate-state surveillance multiply. Everything becomes a back-channel, everything phones home. With interface and friction removed there’s not only no way to control this, there’s no way to see it. Think about the data that seeps out of your front room: the iPad, the Kindle, the Samsung telly, the Sky box, the Netflix app, YouView, iPlayer, XBox, Spotify. And god only knows where it goes past there.
  12. Informed consent is the only interesting design challenge. With no interface informed consent is just another tick box on the set-up screen. Or a signature on a sales contract.
  13. The fact that we’ve not only never solved but deliberately sidelined informed consent in a world with interfaces doesn’t bode well for a world without.

More thoughts on open music data

Occasioned by someone describing media catalogue type data as the “crown jewels”. It is not the crown jewels. It is, at best, a poster pointing out the attractions of the Tower of London.

If any data deserves the description of crown jewels it’s your customer relationship data.

But since Amazon, Apple, Facebook and Google probably know more about your users / customers / audience / fan base than you do, you’ve probably already accidentally outsourced that anyway…

Longer thoughts over here

Events, causation, articles, reports, stories, repetition, insinuation, supposition and journalism as data

In a conversation with various folks around ontologies for news I went a bit tangential and tried to model news as I thought it should be rather than how it is. Which was probably not helpful. And left me with a bee in my bonnet. So…

Some events in the life of Chris Huhne

  1. In March 2003 he was clocked speeding somewhere in Essex. Already having 9 points on his licence a conviction would have seen him banned from driving so…
  2. …shortly after his then wife, Vicky Pryce, was declared to have been driving at the time of the speeding incident
  3. 16 days after the speeding incident he was caught again for using a mobile phone whilst driving and banned anyway
  4. In May 2005 he was elected to Parliament as the representative for Eastleigh
  5. Also in May 2005 Ms Pryce told a friend that Mr Huhne had named her as the driver without her consent
  6. Between October and December 2007 he stood for leadership of the Lib Dems
  7. At some point (I can’t track down) he began an affair with his aide Carina Trimingham
  8. In June 2010 he was clocked again, this time by the press emerging after spending the night at Ms Trimingham’s home
  9. A week later Ms Pryce filed for divorce
  10. In May 2011 The Sunday Times printed allegations that Mr Huhne had persuaded someone to pick up his driving points
  11. In the same month Labour MP Simon Danczuk made a formal complaint about the allegation to the police
  12. At some point after this there was a series of text messages between Mr Huhne and his son where his son accused him of lying and setting up Ms Pryce
  13. In February 2012 both Mr Huhne and Ms Pryce were charged with perverting the course of justice
  14. In June 2012 Mr Huhne and Ms Pryce announced they’d plead not guilty with Ms Pryce claiming Mr Huhne had coerced her into taking his penalty points
  15. In February 2012 the trial began and on the first day Mr Huhne changed his plea to guilty. He also resigned his parliamentary seat
  16. The trial of Ms Pryce continued. And collapsed shortly after when the jury failed to agree. Shortly after a second trial found her guilty
  17. In late February the by election resulting from the resignation of Mr Huhne took place
  18. And in March 2013 they were both sentenced to 8 months in prison

Some of the events went on to become part of other storylines. For a brief while Mr Huhne’s driving ban for using a mobile phone at the wheel became part of a “Government makes a million a month from drivers using mobiles” story (at least for the Daily Mail), the collapse of the first trial of Ms Pryce became a story about failures in the trial by jury system and the result of the by election became part of a story about the rise of minority parties in austerity hit Europe.

Anyway this list of events is as partial as any other. Many more things happened (in public and in private) and some of the events listed were really lots of little events bundled up into something bigger. But that’s the trouble with events: they quickly go fractal because everything is one. As Dan said, “it’s good to think about events but it’s good to stop thinking about them too.” I’m not quite there yet.

Anyway, boilings things down further to fit in a picture:

Causation and influence

For every event there’s a fairly obvious model with times, locations, people, organisations, factors and products. And (mostly) the facts expressed around events are agreed on across journalistic outlets.

The more interesting part (for me) is the dependencies and correlations that exist between events because why is always the most interesting question and because the most interesting answer. Getting the Daily Mail and The Guardian to agree that austerity is happening is relatively easy, getting them to agree on why, and on that basis what should happen next, much more difficult.

The same picture this time with arrows. The arrows aren’t meant to represent “causality”; the fact that Mr Huhne was elected did not cause him to resign. But without him being elected he couldn’t have resigned so there’s some connection there. Lets say “influence”:

Articles, reports and stories

The simplest model for news would bump most of the assertions (who, where, when etc.) to the events and hang articles off them, stitched together with predicates like depicts, reports or analyses. But whilst news organisations make great claims around reports and breaking news, journalists don’t talk about writing articles and rarely talk about writing reports. Journalists write stories, usually starting from a report about an event but filling in background events and surmising possible future events.

So an article written around the time of Mr Huhne’s resignation would look less like this:

and more like this:

Repetition, insinuation and supposition

The average piece of journalism is 10% reporting new facts and 90% repetition, insinuation and supposition where correlation and causation between events are never made explicit. Events from the storyline are hand picked and stitched together with a thin thread of causality. Often it’s enough to just mention two events in close proximity for the connections between them to be implied. The events you choose to mention and the order you mention them in gives the first layer of editorial spin.

And the claims you choose to make about an event and its actors are the second level. If there’s a female involved and she’s under 35 it’s usually best to mention her hair colour. “Bisexual” scores triple points. We know what we’re meant to think.

The Daily Mail took insinuation to new heights with the collapse of Ms Pryce’s first trial, printing a “story” about the ethnic make-up of the jury telling its readers:

Of the eight women and four men on the Vicky Pryce jury, only two were white – the rest appeared to be of Afro-Caribbean or Asian origin.

The point they were trying to make and how the appointment of a jury of certain skin colour might have led to the collapse of the trial was left as an exercise.

Sports journalism seems particularly attracted to insinuation and supposition. Maybe it’s because their events (and sometimes even the outcomes of those events) are more predictable than in most other news whilst the actual facts are mainly locked inside dressing rooms and boardrooms. But Rafa Benitez getting slightly stroppy in a news conference turned into, “Rafa out by the weekend, Grant to take over until the end of the season and Jose to return” headlines by the next day. None of which turned out to be true. Yet.

As Paul pointed out, the article as repetition of storyline and makeshift crystal ball wasn’t always true. In the past newspapers printed many small reports per page. This isn’t the best image but was the best image I could find without rights restrictions:

Photo via Boston Public Library cc-by-nc-nd

Neither of us knew enough about newspaper history to know when this changed or why it changed. Presumably there are good business reasons why articles stopped being reports and started being stories. We guessed that it might have been due to falling paper and printing prices meaning more space to fill but without evidence that’s just insinuation too.

To an outside observer the constant re-writing of “background” seems tedious to consume and wasteful to produce. Especially where the web gives us better tools for managing updates, corrections and clarifications. Maybe it’s because most news websites are a by-product of print production where articles are still commissioned, written and edited to fill a certain size on a piece of paper and are just re-used on digital platforms. But even news websites with no print edition follow the same pattern. Maybe its partly an SEO thing with journalists and editors trying to cram as many keywords into a news story as possible but surely one article per storyline with frequent updates would pick up more inbound links over time than publishing a new article every time there’s a “development”? It seems to work for Wikipedia. (Although that said, Google news search seems to reward the publishing of new articles over the updating of existing ones.) Or maybe it’s all just unintentional. Someone at the meeting (I forget who) mentioned “lack of institutional memory” as one possible cause of constant re-writing.

But in a “do what you do best and link to the rest” sense, constantly rewriting the same things doesn’t make sense unless what you do best is repetition.

An aside on television

Television producers seem to feel the same pull toward repetition: this is what we’re about to show you, this is us showing it, this is what we’ve just shown you. I have a secret addiction to block viewing (I think the industry term is binge viewing) episodes of Michael Portillo’s Great British Railway Journeys but for every 30 minute episode there’s 10 minutes of filler and 20 minute of new “content”.

Interestingly the Netflix commissioned series assume binge viewing as a general pattern so have dropped the continuity filler and characterisation repetition and get straight into the meat of the story. Nothing similar seems to be happening with news yet but I’m an old fashioned McLuhanist and believe the medium and the message are inextricably tied so maybe one day…

Journalism as data

Over the last couple of years there’s been much talk of data journalism which usually involves scanning through spreadsheets for gotcha moments and hence stories. It’s all good and all helps to make other institutions more transparent and accountable. But journalism is still opaque. I’m more interested in journalism as data not because I want to fetishise data but because I think it’s important for society that journalists make explicit their claims of causation. You can fact check when and where and who and what but you can’t fact check why because you can’t fact check insinuation and supposition. At the risk of using wonk-words “evidence-based journalism” feels like a good thing to aspire to.

I’m not terribly hopeful that this will ever happen. If forced to be explicit quite a lot of journalism would collapse under its own contradictions. In the meantime I think online journalism would be better served by an article per storyline (rather than development), an easily accessible edit history and clearly marked updates. I’m not suggesting most news sites would be more efficiently run as a minimal wiki, pushing updates via a microblog-of-your-choice. But given the fact that if you want to piece together the story of Mr Huhne you’ll have more luck going to Wikipedia than bouncing around news sites and news articles… maybe I am.

Thoughts on open music data

Yesterday I wore my MusicBrainz hat (or at least moth-eaten t-shirt) to the music4point5 event. It was an interesting event, but with so many people from so many bits of the music industry attending I thought some of the conversation was at cross-purposes. So this is my attempt at describing open data for music.

What is (are, if you must) the data?

The first speaker on the schedule was Gavin Starks from the Open Data Institute. He gave a good talk around some of the benefits of open data on the web and was looking for case studies from the music industry. He also made the point that, “personal data is not open data” (not an exact quote but hopefully close enough).

After that I think the “personal data” point got a bit lost. Data in general got clumped together as an homogenous lump of stuff and it was difficult to pick apart arguments without some agreement on terms. It felt like there was a missing session identifying some of the types of data we might be talking about. Someone tried to make a qualitative distinction between data as facts and data as other stuff but I didn’t quite follow that. So this is my attempt…

In any “content” business (music, TV, radio, books, newspapers) there are four layers of data:

  1. The core business graph. Contracts, payments, correspondence, financial reports
  2. The content graph. Or the stuff we used to call metadata (but slightly expanded). For music this might be works, events, performances, recordings, tracks, releases, labels, sessions, recording studios, cover art, licencing, download / streaming availabilities etc. Basically anything which might be used to describe the things you want to sell.
  3. The interest / attention graph. The bits where punters express interest toward your wares. Event attendance, favourites, playlists, purchases, listens etc.
  4. The social graph. Who those punters are, who they know, who they trust.

I don’t think anyone calling for open music data was in any way calling for the opening of 1, 3 or 4 (although obviously aggregate data is interesting). All of those touch on personal data and as Gavin made clear, personal data is not open data. There’s probably some fuzzy line between 1 and 2 where there’s non-personal business data which might be of interest to punters and might help to shift “product” but for convenience I’m leaving that out of my picture:

Given that different bits of the music industry have exposure to (and business interests in) different bits of these graphs they all seemed to have a different take on what data was being talked about and what opening that data might mean. I’m sure all of these people are exploring data from other sources to improve the services they offer, but plotting more traditional interests on a venn:

So lack of agreement on terms made conversation difficult. Sticking to the content graph side of things I can’t think of any reasonable reason why it shouldn’t be open, free, libre etc. It’s the Argus catalogue of data (with more details and links); it describes the things you have for sale. Why wouldn’t you want the world to know that? I don’t think anyone in the room disagreed but it was hard to say for sure…

Data portability

The social and interest / attention graphs are a different breed of fish. Outside the aggregate they’re where personal data and personal expression live. Depending on who you choose to believe that data either belongs to the organisation who harvested it or the person who created it. I’m firmly in the latter camp. As a consumer I want to be able to take my interest data and give it to Spotify or my Spotify data to Amazon or my Amazon data to Apple or my Apple data to In the unlikely event I ever ran a startup I’d also want that because otherwise my potential customers are locked-in to other services and are unlikely to move to mine. If I were an “established player” I’d probably feel differently. Anyway data portability is important but it’s not “open data” and shouldn’t be confused with it.

Crossing the content to social divide

Many things in the content graph have a presence in the social graph. Any music brand whether it’s an artist, a label or a venue is likely to have a Twitter account or a Facebook account or etc. So sometimes the person to interest to content graph is entirely contained in the social graph. Social media is often seen as a marketing channel but it’s a whole chain of useful data from punters to “product”. Which is why it puzzles me when organisations set up social media accounts for things they’ve never minted a URI for on their own website (it’s either important or it’s not) and with no real plan for how to harvest the attention data back into their own business. “Single customer view” includes people out there too.

Data views, APIs and API control

Just down the bill from Gavin were two speakers from They spoke about how they’d built the business and what they plan to do next. In the context of open data (or not) that meant reviewing their API usage and moving toward a more “industry standard” approach to API management. Twitter was mentioned alongside the words best practice.

Throughout the afternoon there was lots of talk about a “controlled open” approach; open but not quite. Occasionally around licencing terms but more often about API management and restrictions. It’s another subject I find difficult as more and more structured data finds its way out of APIs and into webpages via RDFa and In the past, the worlds of API development and Search Engine Optimisation haven’t been close bedfellows but they’re heading toward being the same thing. And there’s no point having your developers lock down API views when your SEO consultants are advising you to add RDFa all over your web pages and your social media consultants are advising you to add OpenGraph. But it all depends on the type of data you’re exposing, why you’re exposing it and who you want to expose it to. If you’re reliant on Google or Facebook for traffic you’re going end up exposing your some of your data somehow. The risk either way is accidentally outsourcing your business.


Robert from MusicBrainz appeared at the conference via a slightly glitchy Skype link. He spoke about how MusicBrainz came into being, what its goals are and how it became a profit making non-profit. He also said the most important thing MusicBrainz has is not its data or its code or its servers but its community. I’ve heard this said several times but it tends to treated like an Oscar starlet thanking her second grip.

From all dealings with open data I’ve ever had I can’t stress enough how wrong this reaction is. The big open data initiatives (Wiki/DBpedia, MusicBrainz, GeoNames, OpenStreetMap) are not community “generated”. They are not a source of free labour. They are community governed, community led and community policed. If your business adopts open data then you’re not dealing with a Robert like figure; you’re dealing with a community. If you hit a snag then your business development people can’t talk to their business development people and bang out a deal. And the usual maxim of not approaching people with a solution but an explanation of the problem you want to solve is doubly true for community projects because the chances are they’ve already thought about similar problems.

Dealing with open data means you’re also dealing with dependencies on the communities. If the community loses interest or gets demoralised or moves on then the open data well dries up. Or goes stale. And stale data is pretty useless unless you’re an historian.

So open data is not a free tap. If you expect something for nothing then you might well be disappointed. The least you need to give back is an understanding of and an interest in the community and the community norms. You need to understand how they operate, where their interests lie and how their rules are codified and acted on. And be polite and live by those rules because you’re not a client; you’re a guest. You wouldn’t do a business deal without checking the health of the organisation. Don’t adopt community data without checking the health of the community. Maybe spend a little of the money you might have spent on a biz dev person on a “community liaison officer”.

Question and answer

At the end of Robert’s talk I had to get up and answer questions. There was only one which was something like, “would you describe MusicBrainz as disruptive?” I had no idea what that meant so I didn’t really answer. As ever with question sessions there was a question I’d rather have answered because I think it’s more interesting: why should music industry people be interested in and adopt MusicBrainz. Answers anyway:

  1. Because it has stable identifiers for things. In an industry that’s only just realising the value of this, it’s not nothing.
  2. Because those identifiers are HTTP URIs which you can put in a browser or a line of code and get back data. This is useful.
  3. Because it’s open and with the right agreements you can use it to open your data and make APIs without accidentally giving away someone else’s business model.
  4. Because it links. If you have a MusicBrainz identifier you can get to artist websites, Twitter accounts, Facebook pages, Wikipedia, Discogs, YouTube and shortly Spotify / other streaming services of your choice. No data is an island and the value is at the joins.
  5. Because it’s used by other music services from to the BBC. Which means you can talk to their APIs without having to jump through identifier translation loopholes.
  6. Because, whilst it’s pretty damn big, size isn’t everything and it’s rather shapely too. The value of data is too easily separated from the shape of the model it lives in. Lots of commercial music data suppliers model saleable items because that’s were the money lives. MusicBrainz models music which means it models the relationships between things your potential customers care about. So not just artists and bands but band memberships. And not just Rubber Soul the UK LP and the Japanese CD and the US remastered CD but Rubber Soul the cultural artefact. Which is an important hook in the interest graph when normal people don’t say, “I like the double CD remastered rerelease with the extra track and the tacky badge.”
  7. Because its coverage is deep and wide. Their are communities within communities and niches of music I never knew existed have data in MusicBrainz.
  8. Because the edit cycle is almost immediate. If you spot missing data in MusicBrainz you can add it now. And you’re a part of the community.
  9. Because the community is engaged and doing this because they care, it polices itself.
  10. Because Google’s Knowledge Graph is based on Freebase and Freebase takes data from MusicBrainz. If you want to optimise for the search engines, stop messing about with h1s and put your data in MusicBrainz.

So if any record label or agent or publisher or delivery service ever asked me what the smallest useful change to the data they store might be, I’d say just store MusicBrainz identifiers against your records. Even if you’re not yet using open data, one day they’ll be useful. Stable identifiers are the gateway drug to linked data. And I’d advise any record label large or small to spend a small portion of the money they might have spent building bespoke websites and maintaining social media accounts, on adding their data to MusicBrainz. Everybody benefits, most of all your consumers.

ps If you’re an indie artist Tom Robinson wrote a great guide to getting started with MusicBrainz here.

Dumb TVs

Following on from this year’s CES there’s been lots of talk about bigger, better, sharper, smarter TVs. As ever conversation around gadgets tends to get caught up with conversations around business models which tends to lead to breathless commentary on OTT vs traditional broadcast and whether smart TVs will render traditional broadcasters as obsolete as Blockbusters, HMV and Jessops. But this is only tangentially about that.

Rumbling away in the background is the usual speculation around Apple’s plans to “revolutionise” the TV “experience” and whether they’re planning to do the same to the TV industry as they did to the music industry (content deals permitting). In among the chatter there seems to be an assumption from some commentators that Apple’s plans for TV revolve around how Apple TV might improve the on-screen interface and controls, possibly replacing the EPG with an App Store style interface. There’s a tendency amongst media futurologists to predict the future by extrapolating from the past; therefore televisions will follow the same fat-client route as phones and already complicated TV interfaces will become more complicated still.

But to my mind this doesn’t make sense. Apple already own the content discovery route via their iDevices, they own the content acquisition route via iTunes and they own the play-out route via AirPlay. Why do they need to invent fat-client TV sets when they’ve already put fat-client laptops, tablets and phones into the hands of their customers? The App Store model might just about work when it’s in your hand / on your lap. But placing the same interaction model 10 feet away just doesn’t offer the affordances you need to discover, purchase and play programmes. From an accessibility angle alone, making potential customers interact from 10 feet away when you’ve already given them a better option seems like a painful redundancy.

How “smart” do TVs need to be?

In more general terms I think there’s a problem with the definition of a “smart” TV and the interfaces envisaged. If TVs are web connected why do they need to be smart? Some arguments why not:

  1. Upgrade cycles for TVs and radios (and most other household white goods) are too slow to build-in smartness. Build in too much and the smarts go obsolete before the primary function of the device.
  2. For any connected device smartness belongs in the network. This is why we connect them. If there are existing discovery and distribution channels and backchannels, then all a TV needs to do is accept instructions from the network; a connected (but dumb) screen.
  3. 10 feet away is no place for an interface. And just because a device has a screen doesn’t mean it has to be an input. As TV functionality becomes ever smarter and more complicated, the remote control grows to fit the demands and we end up with something almost resembling a keyboard on the arm of the sofa. When there’s a much better, much more accessible phone or pad or laptop (or any point in between) sat redundant alongside.
  4. The App Store / Smart TV model presupposes the existence of apps. But making native apps is expensive and the more platforms you have to provide for the more expensive it gets. A dumb TV only needs to accept instructions and play-out media.
  5. TV screens tend to be a shared device and authentication, personalisation and privacy concerns are hard on a shared device. Hard from an implementation point of view and hard from a user comfort point of view. There’s a spectrum from TV screen to desktop PC to laptop to tablet to phone and the further down that list you travel the less shared / more personal the device feels and the more comfortable users feel with authentication. Dumb TVs move authentication to where it makes sense.
  6. Smart TVs open up the possibility of device manufacturers finding a new role as content gatekeepers. Having control of both the interface and the backchannel data allows them to control the prominence of content. This is a particular problem for public service broadcasters. By the time your smart TV is plugged into your set top box and your assortment of games consoles, the front room TV can acquire a stack of half a dozen gatekeepers. Just keeping track of which one is currently active and which one you need to control is confusing.
  7. Media people like to talk about TV as a “lean back” medium. This is pure conjecture but it’s possible that separating the input interface from the play-out leads to this more “lean back” experience…

How dumb is dumb?

From conversations around Dumb TVs there seem to be two main options: the dumb but programmable TV and the dumber than Kletus TV.

Programmable TVs

Modern TV sets don’t live alone. There are ancillary devices like PVRs which sit alongside the TV box. TVs don’t need to be programmable but PVRs do. The big question is where you want to programme your PVR from. If it’s same room / same local area network then there’s no need for any additional smartness or authentication. If it’s on the same network you can control it. If you want to programme your PVR from the top deck of the bus this is somewhat harder. Somewhere you need a server to mediate your actions and given the need for a server there’s a need for authentication. But…

…how do PVRs as discrete devices make sense in a connected world? If 3 million people choose to record an episode of Doctor Who that’s a hell of a lot of redundant storage. And a hell of a lot of redundant power usage. Over time PVR functionality will move to “the cloud” (the legality of loopholes not withstanding), your mobile will programme it, discover content there and push that content to you TV screen. With no need for TV programmability.

Dumb, dumb, dumb

So what’s the very simplest thing with the least build and integration costs? Something which allows you to push and control media from a fat client to a dumb TV. DIAL promises to do something similar but seems to assume a native app at each end and the simplest thing is probably two browsers.

So somehow devices on a local area network need to be able to advertise the functionality they offer. There’s a web intents connection here but I’m not quite sure what it is. Once your laptop / tablet / phone knows there’s a device on the network which can play audio / video it needs to make that known to the browser. So there needs to be some kind of browser API standardisation allowing for the insertion of “play over there” buttons. And the ability to push a content location with play, pause, stop and volume control notifications from the browser on the fat client to the browser on the dumb TV. Which might be something like WebRTC. Given the paywalls and geo-restrictions which accompany much of the online TV and movie business there’d probably need to some kind of authentication / permission token passed. But that’s all dumb but connected would involve.

A late answer to a question from the digital humanities conference

The week before last Silver and I went along to the Realising the Opportunities of Digital Humanities conference in Dublin. We gave a short presentation about linked data at the BBC then sat on a panel session attempting to answer questions. I’ve never been on a panel before but it’s a bit like an interview: you only think of the answer you wanted to give once you’ve left the room.

Anyway, one person asked a question something like, “with all this data does ‘content’ become a second class citizen”. At the time we were sat in the Royal Irish Academy library which is three storeys of floor to ceiling books. The thought that all that humanity could ever become subservient to some descriptive data seemed so odd that I don’t think anyone even answered the question. A bit like suggesting if the library catalogue ever got good enough you could just burn the books.

A follow up point was made by a person from RTE about the costs associated with digitising content. I think it’s often hard to justify digitisation costs because the thing you end up with is pretty much the thing you started with except in ones and zeros. And to my mind there are three steps to opening an archive. As a fag packet sketch it would look something like:

Step 1: Digitisation

As the RTE person said, digitisation is expensive and sometimes hard to justify. And I have no ideas on how to make it less expensive. Until content is digitised there’s no way to get the economies of scale that computers and the web and the people on the web bring. And there’s no real benefit in just digitising. You can put the resulting files on the web but until they link and are linked to and findable they’re not in the web. To put stuff in the web you need links and to make links you need context so…

Step 2: Contextualisation

Once you have digitised files you need to somehow generate context and there seem to be three options:

  1. employ a team of librarians to catalogue content – which is great if you can but doesn’t scale too well and can, occasionally, lead to systems which only other librarians can understand
  2. let lose the machines to analyse the content (OCR, speech to text, entity extraction, music detection, voice recognition, scene detection, object recognition, face recognition etc etc etc) – but machines can get things wrong
  3. build a community who are willing to help – but communities need nurturing (not managing, never managing)

The project I’m currently (kind of) working on has the research goal of finding a sweet spot between the last two: machine processing of audio content to build enough descriptive data to be corrected and enhanced by a community of users. Copying / pasting from an earlier comment:

We’ve been working on a project for BBC World Service to take 70,000 English-language programmes and somehow make them available on the web. The big problem is that whilst we have high quality audio files we have no descriptive data about them. So nothing about the subject matter discussed or who’s in them and in some cases not even when they were broadcast.

To fix this we’ve put the audio through a speech to text system which gives us a (very) rough transcript. We’ve then entity extracted the text against DBpedia / Wikipedia concepts to make some navigation by “tags”. Because the speech to text step is noisy some of the tags extracted are not accurate but we’re working with the World Service Global Minds panel (a community of World Service listeners) who are helping us to correct them.

Machine’s plus people is an interesting approach but, like digitisation, machine processing of content is expensive. Or at least difficult to set up unless you’re a technology wizard. And there’s a definite gap in the market for an out-of-the-box, cloud-based (sorry) solution (sorry) for content processing to extract useful metadata to build bare-bones navigation.

Step 3: Analysis

The line between contextualisation and analysis is probably not as clear cut as I’ve implied here. But by analysis I mean any attempt to interrogate the content to make more meaning. I’m reminded of the recent Literature is not Data: Against Digital Humanities article by Stephen Marche:

But there is a deeper problem with the digital humanities in general, a fundamental assumption that runs through all aspects of the methodology and which has not been adequately assessed in its nascent theory. Literature cannot meaningfully be treated as data. The problem is essential rather than superficial: literature is not data. Literature is the opposite of data.

Data precedes written literature. The first Sumerian examples of written language are recordings of beer and barley orders. But The Epic of Gilgamesh, the first story, is the story of “the man who saw the deep,” a hero who has contact with the ineffable. The very first work of surviving literature is on the subject of what can’t be processed as information, what transcends data.

It also reminds me of a trip to a Music Information Retrieval conference a couple of years back. Every other session was accompanied by a click track and seemed to be another attempt to improve “onset beat detection” in some exotic music genre by two or three percent. I’m no musicologist but it felt like a strange approach to determining meaning from music. If you were ever asked to describe punk or hip-hop or acid house I doubt you’d start with chord sequences or rhythm patterns. For at least some genres the context of the culture and the politics (and the narcotics) feels like a better starting point.

So I think when we throw machines at the analysis part there’s a tendency to reduce down the ineffable to a meaningless set of atoms. Or a pile of salt. Machines have their place but it’s their usual place: boring, repetitive tasks at speed.

Getting back to the diagram: once archive items are digitised, contextualised, findable and in the web they become social objects. People can link to them, share them, “curate” them, annotate them, analyse them, celebrate them, debunk them, take them, repurpose them, “remix” them and make new things from them. The best description I’ve seen of the possibilities of what could happen when people are allowed to meet archive items is Tony Ageh’s recent speech on the Digital Public Space which is too good to quote and should just be read.

On the original question then, no I don’t think “content” (novels, poems, pamphlets, journalism, oral history, radio, film, photography etc) will ever become a second class citizen to the data describing it. And digitisation costs are a lot easier to justify when coupled with contextualisation and analysis. And that some jobs are best done by machines and some jobs are best done by people.