Designing a URL structure for BBC programmes

This is a post I’ve been meaning to write for the last 7 years. In the circumstances it might be more of a eulogy than a birth announcement but since the subject still occasionally raises its head I figured it was finally worth typing.

Thanks to David Marland, Richard Jolly, Yves Raimond and Zillah Watson for idiot checking / making it to the end.

Some background

The story starts with a small team of researchers (Tom Coates and Matt Webb) and developers (Matt Biddulph and Paul Hammond) in what was BBC Radio and Music Interactive. Other people from across the BBC were involved; both Kim Plowright and Gavin Bell spring to mind. But if your name’s not here and you think it should be, please blame my ignorance and not any intentional slight.

Sometime around 2006 the team began to look at how programme pages should work in a world of blogs and social software. Up until that point online programme support had been sporadic with big budget programmes getting dedicated “websites” and low profile programmes often getting nothing. The PIPs (Programme Information Pages) project aimed to solve this problem by providing a baseline web page for all programmes which could be enhanced where time, budget and user demand dictated.

The team working on PIPs were keen to ensure that PIPs generated pages worked with the wefts and seams of the web; that all the important things had pages and were linkable and pointable at. Tom Coates in particular had given a great deal of thought about how the web was evolving the use of links to disambiguate and add meaning. In 2006 he published the seminal The Age of Point-at-Things. If you’ve not read it you should.

Taking linkability as a starting point, the first job was to establish what the important things were. At the BBC programme is a stretchy word. It can be used to mean all episodes of Eastenders ever or this particular episode of Eastenders or even tonight’s broadcast of this particular episode of Eastenders. Pre-PIPs hand-crafted programme support was inconsistent: most would have overarching programme pages (the “programme homepage”), some would have episode guides, some would have episode pages, some would have series pages, some would have character pages. And there was no real definition of what the important things were or how they related. It’s a hand-wavey generalisation but the important things in any broadcast chain tend to the asset (piece of media) to be transmitted and the broadcast slot to put it in. The PIPs insight was that neither the asset version nor the broadcast slot were interesting to users; the thing people talk about is the more platonic “episode”. So that episode of Eastenders might have a British Sign Language version or an Audio Described version or might be edited for duration and all of those versions might be broadcast many times but to the user who wants to find, find out about and share they’re all the same thing. Or they all share the same “editorial intent”.

So the core object of PIPs was the episode. Every episode was part of a larger programme grouping and every programme grouping (although it might be broadcast on many networks) belonged to one network. Which lead to the original PIPs URL structure of:

/:owning-network/:programme-group/pip/:episode/

or:

/radio3/freethinking2006/pip/132yy/

The 132yy part was referred to as a PIP key and uniquely identified a single episode. Tom explains the reasoning behind this pattern in his post on Developing a URL structure for broadcast radio sites. Everything in that post is still true and, since there’s no point in retyping all of it, needs to be read before any of the rest of this makes sense.

So the intent of PIPs was to provide a persistent (or persistent as possible) URL for every episode. The interesting part of doing that is not how you construct the URL but what you’re forced to leave out: no broadcast date or time because there might be many (or might be none), no genres (same reason) etc.

So what happened to PIPs version 1 and 2?

PIPs version 1 was designed to automate programme pages for Radio 3. PIPs version 2 took the same model and attempted to roll it out over the rest of the national radio networks. Both versions were a single system with 3 parts: data storage, management and publishing. Both predated any BBC dynamic publishing infrastructure by several years and relied on “compiling” pages and parts of pages offline to be FTPed to the live servers (the kind of architecture platforms like Jekyll have recently returned to). But the programme data model can get quite complicated and broadcasting in general is subject to lots of last minute changes (a football match overruns and the schedule is juggled to accommodate). Working out what had changed and which pages and parts of pages that change would affect became increasingly complicated. Sometimes a change would happen in the data and the results would be published in seconds; sometimes a change would go into the system and only emerge on the website a couple of days later. And no-one really knew how the internals of the system worked.

So PIPs version 3 was born and the acronym changed from Programme Information Pages to Programme Information Platform. PIPs would no longer be responsible for providing data management and editing facilities and would no longer be responsible for publishing pages. It would just be a data store from which other services would take data and publish.

All of this was around the time that iPlayer was gestating. And around the time that Tom Loosemore was proposing BBC 2.0 and automated programme support across radio and TV. The former obviously became iPlayer and the latter became /programmes. I think when we first worked on /programmes we all thought there’d be an “iPlayer inside” type model; where an episode was available to play it would be available on the episode page and where it wasn’t there wouldn’t. Certainly Kim’s wireframes suggested that and I think we all thought that was what we were working toward. Given the BBC has so many brands it felt like adding another one for ondemand programme viewing would just add confusion. Which shows how little we knew about branding. Back to the story…

Because the BBC had (and still has) no single store of programme information the data needed to populate PIPs came back in from outside the BBC. Programme teams and schedulers would send information to Red Bee who would structure it in a system called SID (acronym expansion lost in the mists of time but possibly Schedule Interface for DSAT), then send it back into PIPs as XML. The incoming data just reflected schedule events; there were no programme IDs and event IDs rolled over. PIPs had a set of heuristics to identify episodes, mostly based on textual repeat tags, such as “[Rpt Fri 4pm]” or “[Rpt of Mon 4pm]“. These were parsed and used to infer the episode structure. The format of repeat text was not exactly consistent so there was an editorial interface to confirm the identification and assign the brand. Episodes were coerced into a brand/series/episode structure with dummy brands and series labelled UNKNOWN R3 and UNKNOWN R4.

The SID feed had a very thin data model lacking the ability to describe programme structures accurately. And PIPs v1/2 had a very rich data model based on SMEF. Populating PIPs with SID was like attempting to paint a cathedral with a humbrol paint brush. There just wasn’t enough data to know how to fill all the gaps. There were more empty tables in the PIPs database than populated ones. And many of those that were populated were filled with dummy data to fill in the gaps in the model.

So given details of a new episode from a new programme there was no way to know whether that episode was a one-off, stand-alone episode (like a film) or the first of many. For that reason all episodes in PIPs v1/2 were assigned to programme groups whether or not those groups existed or would ever exist. For the original PIPs URLs this was a bonus. The grouping was always guaranteed to exist and that grouping belong to an owning network. But as iPlayer was developed people realised that the SID feed wasn’t rich enough to drive the developments envisaged. So PIPs 3 switched from taking data feeds from SID to Red Bee’s Teleview via a TV Anytime XML feed.

The Teleview model was much more descriptive and didn’t introduce phantom grouping objects. So a one-off film would be modelled as an orphan episode with no higher level grouping of series or “brand”. Which meant that the PIPs v1/2 style URLs of:

/:network/:group/pip/:episode/

would no longer work for at least some episodes. Given that programme structures can change over time and that pilot episodes get commissioned which may or may not lead to series commissions, building grouping hierarchies into the URLs was not an option if we wanted to maximise persistency and mimimise the management of redirects.

What happened to the PIP keys?

So what happened to the PIP key and the old PIPs v1/2 URLs? Pretty much they stayed published. I’m not sure how many pages are still live from the old PIPs system and I’m not sure how you’d find out but quite a few are still out there: here’s one for an electronic music episode of discovering music from 2008.

There was a plan to migrate the PIPs v1/2 data to PIPs 3, make new /programmes pages and redirect the old static pages. But unfortunately that never happened.

PIPs v3, iPlayer and /programmes

So from earlyish 2007 we had a new and shiny programme data store in PIPs v3 but no way to write to that store and no way to publish from it. Radio and Music took on two pieces of work to plug those gaps. The Programme Information Tool provided an admin interface to read from and write to PIPs and turned into one of those projects where people still bear the scars. And in October 2007 we went live with the first release of /programmes which published a page for every episode, series, brand, schedule, genre, format etc in PIPs.

In December of that year the first streaming version of iPlayer was released. For a whole variety of reasons the “iPlayer inside” (/programmes) model never came about and instead iPlayer became a destination in its own right. But both iPlayer and /programmes were based on PIPs and both use PIDs in their URLs:

http://www.bbc.co.uk/iplayer/episode/b04dzswb/the-kate-bush-story-running-up-that-hill

http://www.bbc.co.uk/programmes/b04dzswb

So while we got to one-thing-per-page we never quite got to one-page-per-thing. Though it’s worth pointing out that iPlayer episode (sometimes called item) pages only exist while an episode is available (or due to be available shortly). Once the catch-up availability window closes the iPlayer page redirects to the /programmes page. Although since the episode might be repeated and consequently the availability window might reopen this redirect is a 302 (temporary) rather than a 301 (permanent).

From PIPs to public

Before we could start work on /programmes we had two problems to solve: how to get PIPs data in a suitable shape for publishing and how to publish it dynamically.

In late 2006 the BBC had no dynamic publishing infrastructure. PIPs v1/2 had proved that offline processing to generate flat pages to be FTPed to the web servers wasn’t a great idea. And the BBC’s technical infrastructure back then was pretty much limited to static files and a slightly forked version of Perl. So Paul Clifford spent a couple of weekends making a Perl MVC framework that echoed some of the design patterns of Ruby on Rails. Which worked fine even if talking about it did generate a blog post with the memorable title of Why the BBC fails at the internet.

The second problem was harder to solve. PIPs data is heavily normalised (or at least heavily abstracted). It’s a relational database but the parent child relationships are managed through a table called pip_pip which relates one thing with a PID to another thing with a PID via some relationship, rather than by foreign keys between tables. Theoretically to allow for multiple parents, a feature that was never used and eventually deprecated although the modelling remained. PIPs is optimised for data storage but not for publishing so we had to transform it to allow for the queries we needed to build the pages we wanted.

So the first thing Duncan Robertson built was called the Green Box – because it was first drawn with a green pen on a whiteboard. Slightly later it got called the Trickle Application (because it trickled data from PIPs to the /programmes database) and soon after that got shortened to the TAPP. The green box denormalised data to make the queries we needed to run easier and faster. It still exists today, piping data through from PIPs into the original /programmes database. Which feeds the ‘Perl on Rails’ application which these days only publishes data views. The new /programmes PHP front-end consumes the JSON from the old /programmes application, together with some data from Dynamite (the application built to serve iPlayer). Requests for data views are channeled through to the old /programmes application. At some point soon both /programmes and iPlayer will run off the new Nitro backend and the green box and old /programmes application will be turned off and the data views will no longer work.

Data flow from PIPs to /programmes

What is a PID?

The PID acronym gets expanded in one of two ways. Usually people expand it to “Programme Identifier” which makes the most sense in most cases. Occasionally it gets expanded to “PIPs Identifier” which is less useful in most conversations but probably more accurate since any object (genres, formats, contributors, characters) in PIPs can have a PID and not just “programme” objects. No matter what the commissioning pack says it definitely does not stand for Packet Identification number which is a very different thing.

In PIPs database terms a PID is something like a non-domain native surrogate key for any object. More or less every row in every table has a PID and those tables might be describing programme objects (brands, series, episodes, clips, versions, segments); higher level programme groupings (seasons, franchises, collections); programme availability objects (broadcasts, ondemands); or non-programme objects (characters, contributors, genres, formats etc). Usually PIDs are described as opaque identifiers for PIPs objects; though the term obfuscated has been suggested but since they’re not substituting for any more “natural” identifier I’ll stick with opaque.

So PIDs are not designed to be human readable or meaningful. They’re a lower-case, alpha-numeric string of 8 (or potentially / occasionally more) characters with vowels excluded to prevent inadvertent swearing (although occasionally a swear does creep through; in this case, one suspects, deliberately).

The only character in a PID that is “meaningful” is the first one which denotes the authority of the PID (the people responsible for creating it). For PIDs starting b the generating authority is Red Bee, for PIDs starting p it’s PIPs, for w it’s the old World Service scheduling system. Other leading character / authorities exist for partner data but those are the main ones you’ll see. For Red Bee generated PIDs there’s a two-way transform between the PID and the CRID used in the Red Bee Teleview system. CRIDs are a form of non-HTTP URI defined as part of the TV-Anytime spec and are the identifiers that make series record work on your Freeview box. Although in this case the CRIDs are internal to Teleview and are not the same CRIDs as used by DVB so you can’t reverse engineer PIDs to do any useful hacking with Freeview.

Because World Service programme information doesn’t travel through Red Bee they self-provision programme data into PIPS which lead to their early PIDs looking slightly different to most, being 11 characters, not 8:

http://www.bbc.co.uk/programmes/wcr5dr3dnl3

I’m not sure when the changeover happened but these days World Service PIDs are generated as p PIDs with a PIPs authority and are 8 characters long.

The data model and the URLs

Hopefully, this at least partly explains why the URL structures changed between PIPs v1/2 and /programmes and iPlayer (and if it doesn’t hopefully the rest of this post will). But to explain why we ended up with the URLs /programmes ended up with unfortunately you need to understand something of the PIPs data model. So…

Episodes and versions (and clips)

The core of the PIPs data model is the episode. As explained above this is not the broadcast or the media asset but the more platonic grouping of media assets. I’ve heard this described in many ways from assets / broadcasts with the same “editorial intent” to assets / broadcasts telling the same “story”. So for example the Today Programme is a 3 hour broadcast on FM but a 2.5 hour broadcast on LW (the last 30 minutes make way for Yesterday in Parliament) but they’re recognisably the same episode. Or an episode of Casualty might have a BSL version and a non-BSL version but they’re recognisably the same episode. Or an episode of Merlin might get recut to be suitable for broadcast on CBBC but it’s recognisably the same episode / tells the same story. In theory at least (although not to my knowledge in practice) a Prom concert might be simulcast on Radio 3 and BBC Four so would be two media assets (one with moving pictures, one without) grouped into a single episode. (Though in reality, Red Bee are not contracted to recognise simulcasts which is why there’s a BBC One Match of the Day 2 Extra and a Five Live Match of the Day Extra 2.)

This asset grouping is handled in PIPs with episodes and versions. An episode can have one or many versions (but always at least one) with one of those marked as the “canonical” version. And a version always belongs to one and only one episode (actually versions can belong to clips too but let’s ignore that for a minute). Versions are probably the closest mapping in the model to media assets although the complications of delivering A/V online mean a version in iPlayer can have many different media files. And versions can have types. Aside from the canonical (default) version there might be versions with increased duration (the “repeat” of Desert Island Discs is longer), decreased duration (the LW version of the Today Programme) and, for TV, versions with BSL or audio description. So versions handle two aspects of change: editorial versioning (recuts) and accessibility versions.

Attached to a version there might be “publication events”: broadcasts and “ondemands”. A version might have zero, one or many broadcasts. Each of which might be on a different radio network or TV channel (e.g. Eastenders on BBC One and BBC Three). Networks and channels are modelled as “services” in PIPs because there isn’t a common “natural” word so SMEF went with service. And a version might have zero, one or many ondemands which determine if a version is available for iPlayer streaming or download for a given time period, territory, platform etc. Ondemands are mapped to availability types (like iPlayer international desktop streaming or iPlayer UK download) which again are modelled as services. Importantly (for URLs anyway) an episode may have no broadcasts on any of its versions. This isn’t the most common case but it’s becoming more common and will probably become more common still when BBC Three moves off broadcast and onto the web.

Episodes and versions are probably the trickiest part for people starting to work with PIPs data. Although iPlayer and /programmes are (largely) episode centric the original PIPs v3 data model had most description around versions. Because an episode could be recut and the duration altered lots of things about that episode have the potential to change between versions. So the segments / running order / music played and contributors might change which is why they were modelled at version level. Or the episode might be recut to make it suitable for a younger audience and because “Children’s” is a genre the genre was set at version not episode level. And the same for formats. It meant that Match of the Day in its entirety was not assigned to the Sport – Football genre but every version of every episode of every series was. In the green box we propagated genres and formats up from versions, through episode to series and brands just to make it possible to build useful aggregations. But it was all a bit of a workaround. These days the PIPs model has changed and things like genres, formats and contributors are assigned at the top level and cascade down in an “inheritance with override” fashion. So an episode looking for its genres will look to see if it has any genres directly assigned and if it doesn’t will look to its parent and onwards and upwards.

But segments are still set at version level so if you’re looking at an episode page and see, for example, a tracklist, that list is a list of segments set on the version. So an episode page is actually something of a hybrid between the data from the episode (and its associations) and data from the canonical version. If you’re trying to work with /programmes data views like:

http://www.bbc.co.uk/programmes/b04g708d.xml

and wondering why the episode page shows contributors and a tracklist but they aren’t shown in the data, you need to look for the version marked canonical:

		<versions>
			<version canonical="1">
				<pid>b04g708b</pid>
				<duration>1800</duration>
				<types>
					<type>Original version</type>
				</types>
			</version>
			<version canonical="0">
				<pid>p025j01q</pid>
				<duration>1800</duration>
				<types>
					<type>Dubbed Audio Described</type>
				</types>
			</version>
		</versions>
	

construct a link to that version:

http://www.bbc.co.uk/programmes/b04g708b.xml

where you’ll find the contributor and segment (tracklist) details:

		<contributors>...</contributors>
		<segment_events>...</segment_events>
	

Or… you could just request the episode URL and tack on /segments which is a damn sight simpler but not as useful an explanation…

Finally clips. Clips were a later addition to the PIPs model and describing them is hard. Importantly they are not (as the name might suggest) necessarily bits clipped from full length episodes. Although they might be. They might also be a trailer for an episode, a trailer for a series, a recap of a series, a best-bits highlight package of a series, additional footage or outtakes from an episode or something else entirely. It’s maybe best to think of them as meta-programmes; “programmes” about programmes. In some way.

From the data model point of view clips are pretty much identical to episodes. They have all the same attributes and also have at least one, sometimes many versions. The only real difference is that a clip can belong directly to an episode or a series or a brand, unlike episodes which can only belong to a series or a brand. And not another episode. Obviously.

Which brings us to…

Brands and series

So a few episodes exist in isolation. Films are the obvious example of episodes that stand-alone (although I guess franchises like The Godfather or Star Wars or Home Alone could be grouped together). And occasionally (fairly often on Radio 4) there are single episode documentaries or dramas which are not grouped into a wider series. And at the risk of repetition PIPs v3 (unlike v1/2) allows episodes to be orphans with no parent brand or series).

More commonly episodes are not orphaned and are grouped by a brand or a series (which may in turn be grouped by a brand or a series). At this point it all gets a bit complicated. An episode may be an orphan. Or it might belong to a series. Or it might belong to a brand. And a series might belong to another series. Or might belong to a brand. Or might be an orphan. And a brand is always an orphan. The possible combinations are shown below:

Brand, series, episode hierarchy

(It’s probably worth pointing out that series here is the UK (possibly antiquated) use of the word and is what Americans refer to as a season. Since PIPs v3 and /programmes were designed it feels like many more UK people (inside and outside the BBC) use season to mean what PIPs means by series. Though a season in PIPs terms is quite a different thing.)

Use cases should be fairly obvious. Orphan episodes I’ve described. The Series > Episode structure is usually used for short running series. (The Series > Series > Episode structure is theoretically possible but not seen in practice.) The Brand > Episode structure is used for long running programmes which aren’t broken into series (like the Today Programme or Eastenders or The Archers). Brand > Series > Episode is the classic case where long running programmes are split into series (like Doctor Who). And the Brand > Series > Series > Episode is only occasionally used for programmes split into series split into storylines where a storyline takes place over two episodes (the Waking the Dead edge case).

It’s also a mixed content model so a brand for example can have both series and episodes as children. So you might have Doctor Who as a brand, with series 1 inside, then the Christmas special as an orphan episode, then series 2 etc.:

Mixed model of series and episodes

So if series can be the top item in any family tree what’s the difference between series and brands? In all honesty I’m not sure and I never have been. The series / brand distinction is a hangover from SMEF and nobody seems to remember why it was invented. There was some talk of a brand being a series with marketing value but why that isn’t just a decorated series rather than a separate class of object I’m not sure. People new to PIPs tend to assume the object at the top of the tree is always a brand and talk about brand and series pages but in truth no /programmes code takes any notice of the brand / series distinction; the only important thing is how far down the family tree you’re looking.

To patch over the language difficulties of brands and series and no-one knowing the difference and the difference not mattering anyway, we invented a new term to describe all the objects at the top of their trees: Top Level Editorial Objects (TLEOs) (for which I’m truly sorry):

The set of TLEOs

(Top Level Editorial Containers (TLECs) is also occasionally used to refer to the subset of TLEOs which are brands and series (i.e. not orphan episodes)).

One of the most common questions asked about /programmes URLs goes something like:

I’d be interested to hear why you rejected the “brand/parent/series/episode” format.

What’s simpler than www.bbc.co.uk/programmes/heroes/s01/e20 ?

To which the answer is: not all episodes belong to series, not all episodes belong to brands, not all series belong to brands. The other answer is…

Unstable TLEOs

Building hierarchy into URLs always feels like a neat thing to do. It makes them readable and hackable and human guessable. At least for the subset of people who look at and manipulate URLs. But building in hierarchy is painful when the hierarchies change over time. And the PIPs hierarchy is not stable. Things which were orphans acquire parents as new series get made. So there might be a pilot episode that gets produced and broadcast. If it sinks like stones it will probably remain an orphan. But if the pilot works out, a series will follow and the pilot episode and the series will be wrapped into a brand (or the pilot episode made into episode 1 of the new series).

And it’s the same with recommissions of short series. When the first series of Sherlock was made it was created as a series with 3 episodes. When the second series came along a new brand was created, the first series placed into that brand alongside the new series again with 3 episodes:

Sherlock programme structure evolution over time

The unstable TLEO issue raises a few problems. Taking Sherlock as an example, because Series 1 had been around on the web for much longer than the Sherlock brand and because for most of its lifetime it represented both the first series of Sherlock and the entirety of the known Sherlock programme universe, it picked up lots of inbound links mostly titled “Sherlock” (and not Series 1). Before the brand came along this was good as the Page Rank gained from all those links pushed that page to the top of searches for Sherlock. But when the second series came along and the brand came into existence, the search engines still saw Series 1 as the URL with most of the ‘Sherlock’ titled inbound links so Series 1 still topped the search listings and the brand page promoting Series 2 was nowhere to be seen. Over time this problem irons itself out as the web rebalances and more links head toward the brand URL but it’s still awkward to explain why second series take a while to establish.

The second problem is more about user subscriptions. If a user subscribes to a URL for RSS or calendar feeds to get updates on the programme, until the second series and brand come along they’ll actually be subscribing to a subsidiary resource of Series 1. So when Series 2 comes along the RSS or ICS feeds they receive won’t include any information about the new episodes.

And the third problem is less visible. Many external systems (inside and outside the BBC) store TLEO PIDs to reference programmes. So a user might favourite a programme group and the thing that gets favourited is the Series 1 PID (because the brand doesn’t exist yet and may never exist). Again Series 2 comes along and they don’t get updates. Most of these problems are possible to work around but they’re always worth bearing in mind if you’re working with programme data.

In addition to intentional changes in the hierarchy, objects can also move when the PIPs data gets tidied. When we first started work on /programmes it was fairly common for an episode of one programme to be added to a brand / series of another. So an episode of Blue Peter in the middle of Doctor Who for example. Or just not added to any brand or series and left as an orphan object. If / when PIPs housekeeping happened, unintentionally orphaned episodes would be moved back into the appropriate brand or series causing the hierarchy to shift. You can still see a few accidentally orphaned episodes if you look at this list of Match of the Day TLEOs; some of the items listed are spin-off programmes (Match of the Day 2, Match of the Day Live, Match of the Day at 50) but most are accidentally orphaned episodes.

So sometimes hierarchies don’t exist and sometimes they change over time. Building them into URLs you expect to be persistent over years if not decades is a pretty bad idea.

“Masterbrands”

In case you thought brands were confusing… Masterbrands were introduced as a bit of a workaround for iPlayer. Given that a programme can have many episodes and each episode might be broadcast on multiple channels / networks (e.g. Eastenders is broadcast on BBC One and repeated on BBC Three) there was no way in the PIPs data model to associate a programme with an “owning” channel or network. It meant that iPlayer couldn’t associate the correct channel / network branding, couldn’t assign stats to the right place and couldn’t display the appropriate channel or network ‘ident’ (the bit of video that plays at the start of an iPlayer episode). Masterbrands were introduced to assign a programme object to one and only one channel or network. So whilst an episode of Eastenders might be broadcast on both BBC One and BBC Three, the Eastenders brand has a masterbrand of BBC One to denote that that channel has “editorial ownership”.

In the original build of iPlayer, masterbrands were also used to generate A-Z listings by channel and network. So the BBC One programme listing for programmes beginning ‘E’ would feature Eastenders but the BBC Three listing wouldn’t (A-Z by network no longer appears in iPlayer). /programmes only ever used masterbrand for styling and stats so Eastenders appears on both the BBC One and BBC Three aggregations. Although I’m told this is an expensive view to generate so that won’t be the case soon and, slightly sadly, /programmes will move to scoping service and service type level aggregations by masterbrand. So no more Eastenders on the BBC Three A-Z.

At first glance masterbrands look like they’d be good for inclusion into URLs. They’re familiar to users (mostly TV channels and radio networks) and they’re mono-hierarchical (a programme can have one and only one). But under the surface masterbrands are more complicated because different objects in a programme hierarchy can have different masterbrands. So the brand (TLEO) might (currently) have masterbrand BBC One whilst series one has masterbrand BBC Three and series two has masterbrand BBC One. QI is one example; it started life on BBC Two, moved to BBC One and then moved back to BBC Two. Since it’s fairly common for TV programmes to shift channels over time (or at least more common than for radio) programme groups with multiple masterbrands at different points of the hierarchy are an edge case but not an uncommon one.

The other problem with using masterbrands as part of the URL is that channels and networks are subject to occasional marketing changes. So what was Five Live became 5 Live and the URL that was /fivelive became /5live.

Back to URLs

When we started to design the URLs for /programmes we had three aims in mind:

  1. They must be persistent (or redirectable without hideous amounts of human intervention)
  2. They should be human readable / meaningful
  3. They should be hackable so users could trim bits off the end of the URL or substitute bits of the structure and be reasonably confident of what they’d get back

And since /programmes was one part of Tom Loosemore’s BBC 2.0 project and that project had a set of principles and principle 8 was make sure all your content can be linked to, forever, the greatest of these was and is persistence.

In practice the requirement for persistence and the requirements for readability and hackability never played well together. In order for the URLs to be persistent (or at least reasonably persistent given best efforts) the constraints of the programme domain model (and the assorted data and workflows and legal agreements) meant that:

  1. The URL couldn’t contain the broadcast date or time because lots of episodes have multiple broadcasts. Or none.
  2. The URL couldn’t contain the broadcast channel or network because lots of episodes have multiple broadcasts across multiple channels and networks. Or none.
  3. The URL couldn’t contain the genre because lots of episode have multiple genres. Or none.
  4. The URL couldn’t contain the programme hierarchy because programme hierarchies are subject to change
  5. The URL couldn’t contain the masterbrand channel or network because they were also subject to change

Having eliminated the impossible you’re left with URLs that can’t be a composite key of properties but have to address each programme object (brand, series, episode, clip) individually either by a label or by a key. And generating human readable / meaningful labels for programmes is hard to impossible. No-one knows how many films or dramas or readings of Pride and Prejudice the BBC will ever broadcast given all the ones in the archive and all the ones in the future. And heading down the road of /prideandprejudice1 and /prideandprejudice2 felt like it would introduce confusion rather than reduce it. And still wouldn’t solve the multiple types of shape shifting hierarchy problems.

There’s one more problem with human readable / meaningful labels generated from programme titles. Most programme data goes into PIPs 7-10 days pre-broadcast at which point programme titles are mostly stable. But some priority programmes are added to PIPs much earlier. In general the greater the gap before broadcast the greater the chance that the title will change in production.

So we ended up with two requirements. Something in the URL to ensure that requests were directed to the /programmes application and something to uniquely identify a programme object. The first bit was solved by always including /programmes somewhere in the URL of every page we were responsible for generating (which is why we always described /programmes as more of a namespace than a “product”). The only other contender was /shows but whilst it felt comfortable to describe all programmes as programmes, describing some programmes (Today Programme, Newsnight, Panorama) as shows didn’t feel quite right. And the second part was solved by using PIDs as the URL key for every programme object. So /programmes/:pid it became.

There has been the occasional suggestion that it would somehow have been more “RESTful” to go with something like /:object-type/:pid, so /brands/:pid, /series/:pid, /episodes/:pid etc. Which always felt like the kind of understanding of REST that comes from using “RESTful APIs” that aren’t actually RESTful. And since the brand and series distinction doesn’t mean anything to us it definitely wouldn’t mean much to users. Plus we had an extra requirement that came from my old teacher Nic Ferrier: never hack back to a hole. So every time a user removes a bit from the end of a URL path never return a 404 and definitely never a 403. And hacking back to /episodes would have returned what? A list of all episodes ever?

We did have a brief conversation about whether URLs should use singular or plural. So /programme/:pid or /programmes/:pid. Given the desire to make URLs hackable (at least where possible / for aggregations) we decided to go with plural so /things/:thing would be a thing page and /things would be a list of things or at least some routes to scoped lists of things. In practice it wouldn’t have made much difference but it’s good to be consistent.

It’s probably worth noting that requests for version URLs only return version information if you request a data view. Requests for HTML will just redirect you to the episode URL.

Marketing URLS and redirects

URLs might have started life as web plumbing but they’ve long since escaped the browser. These days you’re more likely to come across a URL on a poster or the side of a bus or read out on radio or shown on TV. And /programmes/:pid doesn’t work particularly well for that. BBC Standards and Guidelines have a URL Requirements document that says:

3.2.1. Only a top level directory SHOULD be promoted in connection with BBC public service web sites. Therefore there MUST only be one slash after the hostname when promoted in print or on air.

3.4.1. All BBC public service web sites and services MUST be promoted using the following syntax: bbc.co.uk/sitename

3.4.2. The URL SHOULD be pronounced as: ‘bbc dot co dot uk slash sitename’. The ‘slash’ element MUST NOT be read aloud as ‘forward slash’.

3.5.1. On television a URL MUST always be displayed on screen in the form: bbc.co.uk/sitename.

3.6.1. Sub-directories of URLs MUST NOT be promoted.

3.6.2. For example, the Radio 4 Today programme site MUST be promoted using the URL bbc.co.uk/today and NOT bbc.co.uk/radio4/today.

It also says:

3.3.1. A top level directory MAY be used to redirect a user to a subdirectory.

So the marketing URL problem gets solved with redirects. A request for http://bbc.co.uk/archers will first 301 redirect to http://www.bbc.co.uk/archers (all requests missing the www get redirected) then 301 redirect to http://www.bbc.co.uk/archers/ (with a trailing slash – a legacy of the old static serving infrastructure) then 301 redirect to http://www.bbc.co.uk/programmes/b006qpgr.

Trailing slashes

Up until /programmes bbc.co.uk had been a static website with all content served from flat files on web servers (and not dynamically from an application server). Pages were built using server side includes from .shtml files including .ssi or .sssi files. In standard UNIX fashion URLs for directories would have a trailing slash and URLs for files wouldn’t. Most internal links were to a directory / folder on the web server so included a trailing slash like /radio4/. Given a request for a folder the web server would look inside for an index.shtml file, process any includes and serve it.

With /programmes all pages were assembled dynamically so there were no files or folders sitting on web servers. We took the decision to drop all trailing slashes because denoting something was a folder when it wasn’t didn’t seem useful.

Because we can’t prevent people adding trailing slashes to links and to avoid splitting links (and Google juice) between URLs, when /programmes sees a request for a URL with trailing slash it 301 redirects to the same URL with no trailing slash.

Keyword stuffing

One fairly common question arising from the /programmes URL design is how the lack of readability / meaning affects search engines. Standard SEO arguments tend to stress the importance of keywords in URLs and there’s long been a suggestion that URL keywords are one factor influencing search result prominence. Then again there are so many factors rumoured to affect search results it’s hard to pick apart which factors are real and how much they count. Standard SEO arguments tend to be 6 years out of date attempts to second guess the better brains of Google. And since Google’s “Caffeine” release in 2010 keywords in the URL have had very little influence on ranking.

One thing we do know is the importance of links to Page Rank and if URLs move, links break and Page Rank evaporates. So given that we can’t manage both persistence and readability the only option was to stuff the URL with additional keywords similar to the iPlayer approach:

http://www.bbc.co.uk/iplayer/episode/b04gr4l7/eastenders-02092014

where:

http://www.bbc.co.uk/iplayer/episode/b04gr4l7

identifies the episode and:

/eastenders-02092014

is tacked on for the perceived benefit of search engines. We did think about doing this for at least 5 minutes but decided it felt like the worst of both worlds. Keyword stuffed URLs suggest to users that they’re hackable when they aren’t. So we didn’t and /programmes doesn’t seem to have suffered in the eyes of Google et al.

Subsidiary resources, transclusion, IDs and anchor links

Many of the pages on /programmes (particularly brands, series, episodes and clips) are constructed from data from a whole variety of related objects. So an episode page might display core episode data (title, synopsis), data from its ancestors (brand and series titles) and data from its descendants (canonical version data and downwards). More concretely it might display a list of segments (actually a list taken from its canonical version) and a list of cast and crew (ditto). Where possible we tried to ensure that all subsidiary resources (lists of descendant things) were addressable at their own URL even if we didn’t intend to link to them on desktop pages (although see the section on mobile views). David Marland has written an excellent post on how responsive design begins with the URL. And more particularly about how making subsidiary resources addressable makes it easy to swap and change what gets served as the core page and what gets AJAX transcluded depending on screen sizes etc.

HTML IDs and anchor links tend to get neglected in URL design. When designing the /programmes URLs we tried to add IDs to all transcluded subsidiary resources even if we didn’t intend to link to them. And we tried to keep the language used in those IDs consistent with the language used in the PIPs domain model and the URL of the subsidiary resource. So for an episode of Eastenders:

http://www.bbc.co.uk/programmes/b04gr4l7

adding #segments:

http://www.bbc.co.uk/programmes/b04gr4l7#segments

will anchor link you to the segment list (in this case a list of the music tracks played). And swapping out the hash for a slash:

http://www.bbc.co.uk/programmes/b04gr4l7/segments

will redirect you to the segment list resource nested under the canonical version URI (again because segments belong to versions not episodes):

http://www.bbc.co.uk/programmes/b04gr4l3/segments

This is particularly useful (and saves an extra request) if you want to work with tracklist data but only have episode PIDs to work with.

In terms of URL hackability you should always be able to replace a hash with a slash and vice versa.

Some titling problems with the PIPs hierarchy

Several paragraphs earlier I mentioned that programme makers and schedulers don’t tend to think in terms of brands, series and episodes. For them a programme gets commissioned, produced and broadcast in a slot on a network / channel. Often programmes are commissioned as a block (even an ongoing programme like Eastenders gets commissioned as a series) but that block doesn’t always reflect how the programme is offered to the public (either through broadcast or maybe later as a DVD). Occasionally the public facing title of a programme reflects the broadcast slot and not the “content”. So a play gets commissioned and broadcast as part of the Afternoon Drama slot on Radio 4 and gets placed into the Afternoon Drama brand in PIPs. In /programmes and radio iPlayer the title displayed will be generated by combining the titles of the brand and episode. But some time later the play might be rebroadcast in the evening outside the Afternoon Drama context. At which point you have to either make a new episode or live with the slightly misleading title.

And there are similar problems with recontextualised TV repeats with episodes from Have I Got News for You being largely the same as episodes from Have I Got a Bit More News For You but having to duplicate brands, series and episodes to cope with the titling requirements.

There have been some conversations about modelling schedule specific override titles on broadcasts but for now we have brands, series and episodes and the titling is generated from that hierarchy.

Segments and segment events

Back to the data model and all the other things… Episodes usually have a running order. In a news programmes these might be individual new stories, in a music programme the tracks played, in a football programme the matches covered. In PIPs the running order is modelled as a set of segments hanging off an episode version. Because a version can have many segments and a segment can be used in many versions, segments are joined to versions via segment events. Segment objects describe the editorial content of the segment and its duration. Segment events describe where the segment occurs in the version either by position (this is the third segment) or by the offset start time (how many seconds into the version the segment starts). This allows for segment reuse between different versions of the same episode or different versions of different episodes. So a segment of the canonical version of a Top Gear episode might be a review of a Ferrari and that same segment might be reused in the canonical version of the Top Gear Christmas special or might be reused in the canonical version of a clip:

Segment reuse via segment events

Though in practice only music tracks ever use this segment reuse across episode versions.

If you look at the data view of a version with segments like:

http://www.bbc.co.uk/programmes/b04gr4l3.xml

you’ll find that both segments and segment events have PIDs:

		<segment_events>
			<segment_event>
				<title/>
				<pid>p025pdv2</pid>
				...
				<version_offset/>
				<position>1</position>
				<segment type="music">
					<pid>p025pdv0</pid>
					<duration/>
					...
				</segment>
			<segment_event>	
		<segment_events>
	

Given that segments can belong to many versions we weren’t able to nest them under version URLs and needed to give them a URL of their own. So we followed the standard pattern of /programmes/:pid. Like http://www.bbc.co.uk/programmes/p025pdv0. Segment events were different though. A segment event belongs to one and only one segment and one and only one version so we could, theoretically, have nested them as either:

http://www.bbc.co.uk/programmes/:version.pid/:segment_event.pid

or:

http://www.bbc.co.uk/programmes/:segment.pid/:segment_event.pid

Given that segment events are not available as HTML pages but only as data views (at least at the time of writing) and the URLs were intended only for use as an API and that consumers of an API should be constructing URLs rather than hacking, making the URL dependent on a composite key of the segment event PID and another object would have added lookup complexity that wasn’t needed. So again we did the simplest thing and went for /programmes/:pid.

Collections, seasons and franchises

The final set of programme-type objects are collections, seasons and franchises. Collections provide a generic way to group any type of PIPs objects (brand, series, episode, clips, segments etc) although they’re usually used to group episodes and clips into editorially coherent packages. They’re basically a way to generate a random list of things with a similar theme (and I didn’t even say “curate”). This collection of John Betjeman episodes from a variety of archive programmes would be a classic example.

Seasons and franchises are specialised types of collection. A season is used to group “publications”: broadcasts and iPlayer ondemands. Although, in practice, pretty much always broadcasts. They correspond to the traditional (UK) definition of a broadcast season where broadcasts of episodes from multiple programme groups are promoted as a themed season. So there might be multiple broadcasts of some Clint Eastwood film but this broadcast and only this broadcast is part of the Wild West season. The currently running World War One season would be an obvious example.

Franchises are intended to group “related” TLEOs (although see unstable TLEOs) usually by some narrative theme shared between original programmes and spin-offs. So you might want to call Doctor Who and Torchwood a franchise. Or Doctor Who and Sarah Jane. Or all the various Matches of the Day. Or Casualty and Holby City. Or Autumn Watch and Spring Watch. So far there are only 3 franchises in PIPs and /programmes: Daily and Sunday Politics, UK Black and Desi Download.

When it comes to designing URLs, collections, seasons and franchises come with the same problems as the rest of the PIPs model. They are top-level objects (they don’t belong to anything) and there’s nothing to stop people from publishing a John Betjeman collection this year and a different John Betjeman collection next year. So again, titles don’t help much with URL generation. To keep things simple and consistent we decided to stick with the same pattern and publish seasons, collections and franchises at /programmes/:pid.

Aggregation URLs

Heading back to Tom Loosemore’s 15 web principles for the BBC, principle 10 said:

Maximise routes to content: Develop as many aggregations of content about people, places, topics, channels, networks & time as possible. Optimise your site to rank high in Google.

So we did. Or at least tried to. There are (or were) five main types of aggregation in /programmes:

  1. Schedule views (in the usual network / channel fashion but also schedules by genres, formats and tags)
  2. A-Z views
  3. Genres (the rough subject matter of a programme)
  4. Formats (the style of a programme)
  5. Tags (more granular descriptions of the subject matter of a programme based on DBpedia tags) – sadly now removed

Aggregations present none of the problems of programme objects when it comes to designing URLs. For a start there are far fewer of them and their structure is more stable over time. So whereas we sacrificed readability and hackability for brands and series and episodes etc, we were able to make aggregation URLs that were persistent (or persistent enough / easily redirectable), readable, meaningful and hackable.

Schedules

Before /programmes the BBC generated online radio and TV schedule listings through a service called WhatsOn. This was completely isolated from programme pages except where programme teams manually added a link to their hand-rolled programme home page. Where these links were added you couldn’t get directly to the details of the episode being broadcast but only to the overarching page about the programme. In addition to programme pages for brands and series and episodes, /programmes was designed to replace WhatsOn powered schedules with listings linking directly to the episode concerned. Though no longer linking to the TLEO page also took some debate.

Radio network and TV channel schedules are nested under the top-level directory for the network or channel (or service in PIPs language) concerned at:

http://www.bbc.co.uk/:service/programmes/schedules

Again the /programmes part is just there to make sure the BBC web servers know to send the request to the /programmes application. There was some debate about whether networks / services (let’s stick with services for the sake of typing) should live inside the top-level directory for the service or whether, in old iPlayer style, the service should live under /programmes like:

http://www.bbc.co.uk/programmes/radio1/schedules

At the time the BBC was keen to cut down on the number of top-level directories (for reasons I’ve never quite understood) but we were pretty sure that the service directories wouldn’t be deleted. For about five minutes I campaigned to have all radio aggregations under /radio and all TV aggregations under /tv but people scowled when I mentioned slash radio slash four. So service first, then /programmes as a namespace, then schedules.

Usually if you navigate to:

http://www.bbc.co.uk/:service/programmes/schedules

you’ll get a list of today’s broadcasts. But some services have different schedules depending on transmission method and / or location. So Radio 4 has a Long Wave schedule and an FM schedule. And BBC One and Two have a whole host of different regional schedules (and different regions for that matter). In PIPs (and SMEF) language these variations are referred to as outlets. So when you request the /schedules URL for a service with outlets you get a list of outlets rather than a day schedule and the schedule page is found at:

http://www.bbc.co.uk/:service/programmes/schedules/:outlet

Navigating between days expands the URL to include year, month and day:

http://www.bbc.co.uk/radio4/programmes/schedules/fm/2014/09/04

Removing the day returns a calendar month view:

http://www.bbc.co.uk/radio4/programmes/schedules/fm/2014/09

and removing the month returns a calendar year view:

http://www.bbc.co.uk/radio4/programmes/schedules/fm/2014

We also made week view schedules (mostly I think because that’s what Radio 3 had always had and that’s what they still wanted to have) with URLs like:

http://www.bbc.co.uk/:service/programmes/schedules/:outlet/:year/w:week-number

where the week number is the ISO week number (although some people did request we used “BBC week numbers” because the BBC has its very own week numbering system…). So you’ll find a week schedule for Radio 3 at:

http://www.bbc.co.uk/radio3/programmes/schedules/2014/w12

should you ever need such a thing.

There was some bickering about what kind of day a day schedule should represent. Stakeholders seemed to want a schedule day to represent a broadcast day (~ 6am to 6am) but we thought it would be odd to create a link to a schedule saying something like ‘Broadcast on Thursday September 4th at 3:30′ and a link to a schedule labelled ‘Wednesday September 3rd’. So we compromised and schedule days run from midnight to 6am the next day.

It’s also worth noting that for today’s schedule you can anchor link to the current broadcast. So add #on-now to today’s schedule URL and the page will scroll to the current broadcast:

http://www.bbc.co.uk/radio4/programmes/schedules/fm#on-now

Whilst not quite keeping with the pattern of making hash URLs and slash URLs consistent, you can also add /now to a schedule view like:

http://www.bbc.co.uk/radio4/programmes/schedules/now

which 302 redirects to the episode page of the current broadcast. Currently this only works for the ‘default’ schedule (e.g. FM for Radio 4) and doesn’t work if you specify a particular outlet. But there are plans to improve this to work with outlets and bring the naming in line with the anchor link. Even without improvements it’s a handy way to get quickly to details of the programme now being broadcast.

Finally (and this is edging into data views territory) in July 2008 Duncan Robertson added calendar data views to the schedules. Add .ics to any schedule URL, subscribe to that URL in the calendar application of your choice and you’ll be able to see what’s on for the next 7 days without ever having to visit the website.

Schedule helper URLs

Service schedules have a couple of other hidden URLs that occasionally prove useful:

http://www.bbc.co.uk/radio4/programmes/schedules/fm/yesterday will show yesterday’s schedule

http://www.bbc.co.uk/radio4/programmes/schedules/fm/today will show today’s schedule

http://www.bbc.co.uk/radio4/programmes/schedules/fm/tomorrow will show tomorrow’s schedule

http://www.bbc.co.uk/radio4/programmes/schedules/fm/last_week will show last week’s schedule

http://www.bbc.co.uk/radio4/programmes/schedules/fm/this_week will show this week’s schedule

http://www.bbc.co.uk/radio4/programmes/schedules/fm/next_week will show next week’s schedule

The decision to use underscores and not hyphens was mine and was the wrong one. And the helper URLs should probably redirect but…

A-Z

A-Z views are fairly simple. They live at:

http://www.bbc.co.uk/programmes/a-z

and list TLEOs (including accidentally orphaned episodes). The only interesting point to note is they’re cross-listed so The Archers will appear under both T and A. At least for now. Although I’m told the new platform /programmes is moving to doesn’t support this, so probably not for much longer. If that move has happened before you read this and you can’t find The Archers, you might want to look under T.

Programme lookup URLs

Another slightly undocumented feature is the quick programme lookup URL. Although we couldn’t make the TLEO URLs readable and meaningful we still wanted a way to hack the URL to find programmes by title. So we added:

http://www.bbc.co.uk/programmes/:programme-title

What happens is:

  1. You hack the URL to include the programme title (or a bit of the programme title) like:

    http://www.bbc.co.uk/programmes/some-text

  2. The /programmes application first checks to see if the bit after the final slash matches the PID pattern. If it does match the PID pattern the /programmes application looks for a TLEO with that PID. If it finds one it serves the TLEO page. If it doesn’t find one it returns a 404.
  3. If it doesn’t match the PID pattern the application checks to see how many TLEOs have the string you’ve typed as a substring of the title.

    1. If no TLEO titles match that pattern the application returns a 404:

      http://www.bbc.co.uk/programmes/this-is-the-most-boring-thing-ive-ever-read

    2. If only one TLEO title matches that pattern the application returns a 302 redirect to that TLEO PID:

      http://www.bbc.co.uk/programmes/sharedplanet > http://www.bbc.co.uk/programmes/b02xf2qg

    3. If many TLEO titles match that pattern the application returns a 302 redirect to a list of those TLEOs:

      http://www.bbc.co.uk/programmes/thearchers > http://www.bbc.co.uk/programmes/a-z/by/thearchers/all

(302s are used for redirects because you never know when a similar titled programme / spinoff might come along.)

The /by/ bit of the listing URL is a bit of a puzzler. Memory suggested it was a clever bit of namespacing to fence off the title lookup from the listings but lots of curling suggests that doesn’t seem to be the case. I think it was maybe added because we first added the programme lookup or list logic to /programmes/a-z/.. and wanted to separate that from the URL that only ever brought back lists: /programmes/a-z/by/.. It seems to be redundant now that /programmes/some-text does the TLEO or title match or title list logic.

Genres, formats and tags

PIPs has (or at least had) three different category schemes for programmes:

  1. Genres describe the rough subject matter of the programme. They were originally populated by Red Bee only on versions but propagated up to TLEOs by the green box. And now they can be set on any programme object and inherit down unless they’re overridden.
  2. Formats describe the way the programme has been made: a film and / or a documentary for example. Again they were originally populated by Red Bee on versions but now can be on any object and inherit with override.
  3. Tags were assigned by BBC staff to episodes and clips. They were based on DBpedia URLs so a programme could theoretically be “tagged” with any concept with a Wikipedia URL allowing for more granular description of the subject matter of a programme: who, where, when and what.

Tags were always something of a difficult sell to production staff; there were never quite enough to expose tag navigation properly and without navigation being exposed it was difficult to persuade people to add them. Given the number of iPlayer available episodes at any one time they were probably also a little too granular to be useful to users. Often there was only one available episode with a given tag at any one time so they didn’t really help with sideways navigation. Anyway, tags have been removed now though the data still exists somewhere. Maybe they’ll return as the number of available episodes increases.

One final point on tags. Whilst it was possible to assume that all episodes under a TLEO would share the same genres and formats it wasn’t possible to assume the same of tags. So whilst genre and format aggregations link to the TLEO homepage, tag aggregations used to link to an episode aggregation under the TLEO like:

http://www.bbc.co.uk/programmes/:tleo/episodes/topics/:tag

/programmes treated genres, formats and tags as types of “category” and the code to handle them was identical. The only major difference was that genres could have child genres to three levels whilst formats and tags were flat:

http://www.bbc.co.uk/programmes/formats/films

http://www.bbc.co.uk/programmes/genres/music/jazzandblues/blues

The URL keys for formats and genres didn’t exist in PIPs so were created in the green box. I think it was my decision to concatenate them (jazzandblues) but in retrospect they probably should have been hyphen separated (jazz-and-blues).

The tag URL keys were taken from DBpedia which in turn takes them from Wikipedia. This presents a real challenge for URL persistence because Wikipedia URLs change as Wikipedia page titles change and that causes DBpedia to change and that caused our tags to change. If we were starting this now we’d probably use Wikidata IDs rather than DBpedia URL keys.

Anyway, the main genre, format and tag URLs listed TLEOs filterable by availability and service / service type but there were also day schedule views like:

http://www.bbc.co.uk/programmes/genres/sport/football/worldcup/schedules

Like service schedules, clicking on another day expands the URL to include month and day like:

http://www.bbc.co.uk/programmes/genres/sport/football/worldcup/schedules/2014/07/13

Unfortunately whilst the URL is fairly readable / meaningful it isn’t quite hackable. Unlike service schedules if you remove the day or the day and the month you get a 404 rather than a calendar view. It was always tricky to agree on what counted as the “definition of done”. Persuading a project manager that some view that users can only get to if they hack the URL or some serialisation that only a few geeks will ever use has to be made before the box gets ticked is hard when there’s other, more visible, work on the list.

Like service schedules, genre and format schedules are available as .ics so if you want a calendar of films on the BBC in your Apple or Google calendar you can subscribe to:

http://www.bbc.co.uk/programmes/formats/films/schedules.ics

Unlike service schedules there’s no week view but there is a paginated list of all upcoming broadcasts (not split by day) at:

http://www.bbc.co.uk/programmes/formats/films/schedules/upcoming

Which isn’t currently linked to but is also available as ICS.

We did have half a plan to allow cross-pollination of genre and format URLs so you could query by combinations like:

http://www.bbc.co.uk/programmes/genres/sport/football/formats/performancesandevents

for all live football across the BBC or:

http://www.bbc.co.uk/programmes/genres/sport/football/formats/phoneins

for all football phone-ins across the BBC. But that never quite happened.

Availability filters

A-Z, genre, format and (in their day) tag aggregations could all be scoped by programme availability:

  1. Adding /player returns only TLEOs which have (or are) episodes available to stream
  2. Adding /current returns TLEOs which have (or are) episodes available to stream and / or have been / will be broadcast in a 7 / 7 day window
  3. Adding /all returns all TLEOs

Current was added so network / channel listings would show programmes currently being promoted but which might not (yet) have episodes available to stream.

For genres and formats (but not A-Z) there’s also a /player/episodes URL like:

http://www.bbc.co.uk/programmes/genres/sport/player/episodes

which lists all episodes currently available (not grouped into TLEOs). This was only ever meant to serve RSS / Atom and never meant to be linked to as an HTML page. We built the HTML view just to test the queries and in case a user removed the .rss from the URL. Unfortunately we never got round to adding an RSS serialisation. And the page got linked to in lieu of a TLEO listing sortable by latest episode broadcast time.

Service type and service filters

A-Z, formats, genres and (again when they existed) tags could all be filtered by service-type (TV or radio) or by service (TV channel or radio network). These URLs were designed to fit into existing top level directories so you can get all football programmes with available episodes across the BBC:

http://www.bbc.co.uk/programmes/genres/sport/football/player

or all football programmes with available episodes on radio:

http://www.bbc.co.uk/radio/programmes/genres/sport/football/player

or all football programmes with available episodes on 5 Live:

http://www.bbc.co.uk/5live/programmes/genres/sport/football/player

which leads to probably the most readable, hackable and definitely longest URLs in /programmes:

http://www.bbc.co.uk/bbcone/programmes/genres/sport/football/worldcup/schedules/2014/07/13

Unlike early iPlayer service listings, /programmes service type and service aggregations are based on broadcast history rather than masterbrand. So programmes like Eastenders and Doctor Who still appear on BBC Three listings even though their masterbrand is BBC One:

http://www.bbc.co.uk/bbcthree/programmes/genres/drama/soaps/player

Container aggregations

Most of the /programmes aggregations link down to TLEO pages (brand, series and orphan episodes). Once you’ve arrived at a brand or series you still need to be able to find individual episodes inside it. So /programmes also publishes aggregations of episodes inside brands and series. These are:

  1. A list of episodes currently available online:

    http://www.bbc.co.uk/programmes/b006q2x0/episodes/player

  2. A list of upcoming broadcasts (including repeats):

    http://www.bbc.co.uk/programmes/b006q2x0/broadcasts/upcoming

  3. A list of upcoming broadcasts (excluding repeats):

    http://www.bbc.co.uk/programmes/b006q2x0/broadcasts/upcoming/debut

    This view isn’t currently linked to.

  4. A list of all broadcasts by month:

    http://www.bbc.co.uk/programmes/b006q2x0/broadcasts/2014/09

    Hackable back to months with broadcasts in a year.

  5. A list of direct children (series and episodes):

    http://www.bbc.co.uk/programmes/b006q2x0/episodes/guide

The URLs have changed a little since we first went live and probably aren’t as hackable as they could be with /episodes listing years and months with broadcasts and /broadcasts redirecting to the current year and month.

There also used to be aggregations of episodes by tag but they disappeared when tags disappeared.

The Onion Problem

So we ended up with a lot of aggregations and probably spent more time thinking about them than we did the “content” pages (brands, series, episodes etc). The upsides are obvious; many more approach roads to programmes for both users and search bots. But one thing we never quite solved were the journeys up from programme pages and back out to the aggregations. Given a made up example of an episode of In Our Time tagged with Babylon should it link up to other In Our Time episodes also tagged Babylon, programmes from Radio 4 tagged Babylon, programmes from across radio tagged Babylon, all programmes tagged Babylon or everything the BBC has about Babylon. There were arguments both ways; that the higher up the onion you link the more stuff you expose but the more context you lose. And definite arguments around losing the context of intended audience. The best argument to keep links local was for children’s programmes where you probably didn’t want to link up outside that context. Although since Children’s TV eventually opted out of /programmes for programme pages (links from the /programmes CBBC schedule page get redirected which is rather sad) that became less of an issue. I think our gut feeling was to take users as high up the tree as possible and expose as much content as possible. And other options felt a little like information architecture as organisation structure. But then parts of the BBC organisation structure are meaningful to users. And some parts definitely aren’t. So there’s no absolutely, definitely correct answer.

Universality and “one web”

The overriding principle when designing /programmes was universality. The “manifesto” we drew up included:

/programmes believes:

  1. in one web
  2. in accessibility for people
  3. in accessibility for machines

The aim was to ensure users got the information they wanted no matter what their accessibility needs, device or agent. For that reason we spent a lot of time ensuring that the URLs we supported were accessible, would work across screen sizes and would output data in whatever fashion users wanted.

Mobile views

/programmes was born in the age of “feature phones”. The first generation iPhone launched a few months before we did but web browsing smart phones weren’t commonplace and there were another 3 years before responsive design went mainstream. But there was a feeling that one day smart phones would be everywhere and Chris Yanda in particular was telling everyone at the BBC to design for mobile. And /programmes was supposed to work everywhere so…

In the absence of responsive design we added a separate set of templates to /programmes serialising all the standard views as XHTML Mobile Profile. The plan was to add some device detection to route requests between “desktop” views and “mobile” views but since we didn’t have that technology in place for a few months, the first mobile friendly /programmes pages were served at separate URLs with a .mp suffix. Some of those views still exists like:

http://www.bbc.co.uk/programmes/b006q2x0/episodes/guide.mp

/programmes is currently being migrated to responsive design so I guess all these .mp URLs will 301 to the standard URL some time fairly soon.

Data views

All of what follows and any previous mention of XML or JSON or ICS comes with the caveat of being true at the time of typing. By the time you read this (if anyone makes it this far) it may no longer be true. If you curl a /programmes data view like:

curl –head http://www.bbc.co.uk/programmes/genres.xml

part of the response you get back is:

X-Aps-Deprecation-Notice: APS is soon to be deprecated. It will first of all cease to be supported on a 24/7 basis, and will then cease responding entirely. Nitro is the BBC’s new API for programme data, and can provide all the information previously provided by APS. Go here to read more: http://developer.bbc.co.uk/nitro

So data views exist for now but possibly / probably not for much longer.

In keeping with the principle of universal access to information, /programmes was designed to be RESTful. Not RESTful as in a RESTful API and some other separate website thing somewhere else. But RESTful as in some resources and some representations where one of those representations just happens to be HTML. But could be JSON or YAML or XML or RDF-XML or ICS or RSS or XSPF. And which representation you get back depends on what you choose to accept.

Or at least that was the case when /programmes still supported content negotiation. These days you have to request a specific representation by adding an extension to the URL (except for HTML which comes back as default (obviously)). So adding .json brings back JSON, .xml returns vanilla XML, .rdf gets you RDF-XML, .ics added to schedule views gets you ICS files and adding /segments.xspf to an episode page with a tracklist will bring back an XSPF playlist. Obviously not all URLs support all representations and some are more specialised than others. Back in the early design and development phase we used to have a whole wall of post-its outlining the URL structure and which representations each resource returned. If you’re designing large and fairly complex websites it’s still a useful technique for getting a general feel of the shape of what you’re building.

Post-it notes for URLs

It’s interesting (I think) that we seem to have moved from seeing the web as a universal information space and drawn clear lines of demarcation between content delivery to end users (the website) and a data space for programmers (the API). It always felt to me more like a continuum than a strict separation. You have:

  1. HTML designed to be viewed by end users in browsers
  2. Data exchange formats designed to be used by end users outside browsers (RSS, ICS, XSPF) which also happen to be handy for programmers
  3. The standard API-ish serialisations like JSON and XML up to various flavours of RDF
  4. And RDFa (RDF in HTML) and you’re back where you started

Trying to draw a clear line and segment the continuum by class of user agent and intended use seems to me misguided but maybe I’m suffering a Web 2.0 hangover and these days people are happier with websites as sealed systems.

Linked data views

Sometime in early 2008 Yves Raimond joined us for a couple of weeks and started to translate the PIPs data model into the Programmes Ontology. A few months later he joined us full time and began to add RDF views to /programmes in much the same way as we’d already added XML, JSON and YAML. I think maybe because we spoke quite a lot about RDF and the semantic web, /programmes is seen in some parts as a linked data website. But it’s no more a linked data site than it is a desktop site or a tablet site or a mobile site or a JSON site. RDF is just one more serialisation it publishes and one more way of making it universal.

At this point discussion about URLs gets a bit complicated and you probably end up slipping into talking about URIs (or if you’re feeling particularly geeky IRIs). A central conceit of the semantic web is that you need different URIs to publish data about the document and the real-world thing the document represents. So you might want to say that one person wrote an Archers’ episode page but another person wrote the episode. There’s a fair amount of bickering about the nomenclature around this but people tend to refer to the real-world thing (the actual episode) as a non-information resource and the document about the thing (the various serialisations of the episode “page”) as an information resource. And some people talk about the non-information resource having a URI and the information resource having a URL. And some people use URI for everything.

There are two common ways to differentiate between non-information resources and information resources:

  1. Give the non-information resource a completely different URI path like http://www.bbc.co.uk/things/:pid and when that resource gets requested return a 303 (see other) to the information resource at http://www.bbc.co.uk/programmes/:pid

    This is the httpRange-14 debate. It’s been going on since 2002 and it’s almost guaranteed to make your brain hurt.

  2. Give the non-information resource a hash URI like http://www.bbc.co.uk/programmes/:pid#thing. Because hashes don’t get sent to the server when you request a hash URI the server sees the hashless URI and returns details of the information resource

For simplicity and because our existing URIs were fairly granular (one URI per thing, one thing per URI) we chose to go with the hash pattern. For about an hour we went with the hash being the class of the object like http://www.bbc.co.uk/programmes/:pid#brand or http://www.bbc.co.uk/programmes/:pid#series but figured it would be easier for external consumers (who might not necessarily know the class of an object) to just go with http://www.bbc.co.uk/programmes/:pid#programme

So we ended up with:

  1. http://www.bbc.co.uk/programmes/:pid#programme being the URI of the non-information resource
  2. http://www.bbc.co.uk/programmes/:pid being the URI of the generic information resource
  3. http://www.bbc.co.uk/programmes/:pid.:representation being the URI of the information resource representation

Unlike the fairly common DBpedia URI pattern this doesn’t conflate the information resource / non-information resource split (I can’t send you that but (303) here’s some information) with the content negotiation part (which serialisation of the information would you prefer). So the thing that appears in the browser bar and gets copied and pasted and used to create new links is the generic information resource URI and not the representation URI.

Anyway, at some point fairly soon, the RDF XML views (like the JSON and XML and YAML and ICS and XSPF) will (probably) disappear. But /programmes won’t quite stop being a linked data website. Specifically because it will continue to serve RDFa (as Programmes Ontology / Schema.org). Where JSON or JSON-LD etc are only ever a transform away. But more generally because the principles that underlie the design of /programmes and the design of PIPs (since its Programme Information Pages days) are the same principles that underlie linked data: one URL (I?) per thing, one thing per URL, semantically linked. RDF was only ever an implementation detail and the design principles of how we make websites all remain true.

General stuff

There are a few general points about the URLs that don’t fit anywhere else but are fairly obvious. For completeness: the URLs don’t include any technology choices (no .php or /servlets/ etc) because the technology always changes. They don’t include brand names (except for service level aggregations) because brand names change. They don’t include any details about the backend systems (like those newspaper sites you see with /cms/ in the URL bar). And they only ever use parameters to change the display of a resource like:

http://www.bbc.co.uk/programmes/a-z/by/a/all?page=2

and never to specify the resource requested.

Conclusion

So that’s all I can think of on why /programmes URLs / URIs / whatever ended up looking like they did. The general rule of making object URLs flat, opaque and persistent and the aggregation URLs readable, meaningful and hackable seems to work quite well. It’s not perfect but then the way that programmes get commissioned, produced and broadcast isn’t “perfect”. And as URLs disappear back into the plumbing, I’m probably less of a fan of readable / hackable URLs that I was. Anyway, if there’s any obvious omissions or loose ends please do leave a comment.

One thing connected journalism could do

This is a rant I tend to have every time I tag along to a news / newslabs meeting so I figured writing it up would save time in the future.

I’ve typed words before about journalism and the relentless churn of repetition, mystic meg prediction and loose, unqualified claims of causality. And somewhere in there claimed that of the Five Ws of journalism “why is always the most interesting question and because the most interesting answer”. “Because of this that” seems to be the underlying message of most journalism even if it does get wrapped up in insinuation, nudges and winks.

And because I’m as guilty of repetition as the next hack, in another post I made the same point:

In news storytelling in particular, why and because are the central pillars of decent journalism. Why is my local library closing? Because of council cutbacks. Why are the council cutting back? Because of central government cutbacks. Why are central government cutting back? Because they need to balance the national budget? Why does the budget need to be balanced? Because the previous government borrowed too much? Why did they borrow too much? Because the banks collapsed. Why did the banks collapse? [etc]

The problem I think we have is that causality claims are not only insinuations but that they’re confined to the article. Connected journalism would make assertions of causality explicit and free them from the confines of the rendered article or programme so chains of claims could be assembled and citizens could trace (assorted and disputed) claims of causality from the (hyper?)-local to the national to the global. And back.

Given that the world becomes more globalised and more decisions get made above democratically elected governments it’s often not clear where the levers of power might be found or whether they’d actually work if you found them. People become divorced from democracy when they no longer see how the policies they’re voting for actually impact on their lives. And power structures become less of a map and more an invisible set of ley lines. Connected journalism would at least attempt to give national (or international) context to local events and local context to (in)ternational events. Which sounds like something public service journalism should at least attempt.

And given that no one organisation can hope to report everything it would provide hooks and links between news organisations and citizen journalists and maybe help to sustain local news as a functioning industry.

I think this possibly echoes some of what Tony Hirst wrote earlier today about open data and transparency. As schools and hospitals and every other social service gets reduced to a league table of stats the decisions that lead to those numbers and why and because get lost in the data noise. And isolated incidents of journalism don’t fill any of those gaps.

So as ever, none of this will probably happen. Too many people would have to work together before any bigger picture of society got painted and “greater than the sum of the parts” always works better on powerpoint slides than in reality. In the meantime news orgs can continue to worry about getting more news onto more screens and in more channels. Because in New York Times innovation fashion, it’s definitely, absolutely not the journalism that’s the problem, just that people don’t read or need it.

Yet another post about Google (not really) removing the URL bar from Chrome

A Twitter conversation with Frankie Roberto and Paul Rissen continuing in my head.

A few days ago Google released a canary build of Chrome which, by default, hid the URL of the current page behind a button. The clue was probably in the canary part; the next day’s build reverted to visible URLs. And the URL was never actually removed, just placed a click away. Users were free to type URLs or search by whichever search engine they had configured in standard “omnibar” fashion.

Even so just about everyone seems to have chipped in with an opinion about this. I can’t pretend I have a clue about whether Google were experimenting with ways to protect users from phishing attacks or whether it was just a toe in the water of self interest (an attempt to expand and consolidate centralised power). But it’s their browser and I guess they can do what they like with it. Beware Silicon Valley bearing gifts and all that; they’ll probably arrive in wrapping paper made of adverts.

What I do think is: the URL has too many people expecting too much and that makes things break.

A few years back I always used to say a good URL should be three things:

  1. Persistent
  2. Readable
  3. Hackable

And the greatest of these was persistent. Three is a nice number and I enjoy a biblical reference or two but I’m not sure I ever really thought hackable mattered. It’s a nice geek feature to navigate by chopping bits out of URLs but do punters actually do that? If they do it’s probably more because your navigation is broken than because your URLs make nice sentences.

But the persistent / readable trade off is hard. My natural inclination is to repeat the words of my former teacher; “the job of a URL is to identify and locate a resource, the job of a resource is to describe itself.” And of course to quote liberally from Cool URIs don’t change. Whilst making the usual point that:


<a href="this-bit-is-for-machines">this-bit-is-for-people</a>

or


<a href="identify-and-locate">label-and-describe</a>

Which is why browsers have URL and title bars. Identifying / locating and labelling / describing are different things and HTML and browsers provide for both.

All of which is fine in theory but…

URLs have long since broken free of the href attribute and the URL bar. They’re on TV, read out on radio and on the side of buses. Pretending that URLs are just there to identify and locate sidesteps how they actually get used and how people think about them. When they stopped being an implementation detail of the linky web, when they stopped being identifiers and started becoming labels, everyone had an opinion on what they were for and what they should look like. The techy people have an opinion, the UX people have an opinion, the brand manager has an opinion, the marketing department have an opinion, the SEO people have an (often misguided) opinion and then the social media team chip in with theirs. And the people selling domains want to sell more domains. None of the opinions agree or are reconcilable. Like most things with more than one stakeholder the result is a bit of a shambles.

I guess the starting point is what do punters actually want from URLs:

  1. they want to trust that the page they’re looking at comes from the domain they think it does
  2. a sub-set want to copy and paste the URL and make new links
  3. they want to trust that the link they’re clicking on goes to the domain they think it does
  4. they might want to type one they’ve seen on the side of bus into a box but probably they’ll just search like normal people do

And that’s probably about it. But it does mean that as well as techy and UX and marketing and SEO and etc opinions the URL also gets lumbered with providing provenance and trust. It’s quite a lot to expect from a short string of letters and numbers and colons and slashes.

That said, in almost all cases (aside from the suspiciously spammy looking email) trust really resides in the linker and not the link. There are plenty of places where the bit-for-people part just replicates the bit-for-machines, often with the added horrors inserted by a series of URL shorteners. But we keep clicking on links in Twitter because we trust the linker not the link.

Even so there must be a way we can decouple provenance from location from label. What we’ve got now doesn’t work because too many “stakeholders” disagree about what we’re trying to achieve. It’s hard to not break the web because the marketing manager changes their mind about the brand message and no-one knows how to separate identifiers from labels. The problem isn’t with Google “removing” the URL bar; whatever any browser provider does to patch over this won’t work because there isn’t a right answer because the problem goes deeper. We’re misusing a thing designed to do one thing to do half a dozen other things none of which are compatible.

Update

A couple more things since I posted this:

Should URLs be “hackable”?

Via Twitter Matthew Somerville said, FWIW I know many people who ‘hack’ the http://traintimes.org.uk URLs, though not many of them would call it that ;) .

It’s something I do myself, usually to get a feel for the shape of the thing, more often when presented with a new website to check if there are any holes in the URL structure. As the same old mentor used to say, “never hack back to a hole”. Does it really matter? Not really but removing bits of the URL on the lookout for redirects or 40Xs is a pretty good proxy for how much care and attention has been given to the design.

I can’t deny hackable URLs are cute and lots of geeks seem to think the same. I just searched twitter for “hackable URL” and came across someone who “loves RESTful, hackable URLs” which is as big a misreading of REST as almost all other uses. But in real life (and in user testing) I’ve never seen anyone go anywhere near the URL bar. It gets used to check the page they’re looking at really is coming from their bank, to bring back websites from history and to summon Google. I suspect (though have no data) that’s the majority use case. Given all the other things we seem to expect of URLs expecting them to also function as navigation widgets probably just adds to the confusion.

And again, conflating REST with human readable and hackable is just wrong. And don’t get me started on “RESTful APIs” which are apparently something different from websites.

Should URLs be hidden?

I stumbled across a post from Nicholas C. Zakas with the title URLs are already dead which didn’t actually say URLs were dead (because that would be silly) but did say they were slowly disappearing in the same way email addresses and telephone numbers are disappearing. Which is true; URLs are already hidden away in iOS and as screen sizes shrink that will probably continue. Wherever browsers can use titles in preference to URLs they do. Autocomplete (from history) completes on titles (and URLs), history shows title not URLs, bookmarks show titles not URLs. Take a look at your bookmarks and history and imagine how much less useful and useable they’d be if they listed URLs.

The natural extension is to put URLs a click away from the URL bar. Whatever their motivations Google were right to hide the URL. It’s just a shame it only happened for one day.

Does hiding URLs in the browser solve the bigger problem?

No because URLs long ago stopped being the province of developers and became voodoo fetish objects for marketeers and brand consultants. I’d happily predict that the first place were we’ll no longer see URLs will be the browser. Well past that point they’ll be shown on telly screens, read out on air, plastered over posters etc.

I now think my thinking that URLs / URIs / whatever should be persistent, human readable and hackable made a nice slogan but was just wrong. They should only and always be persistent. Everything else is just sugar.

But that still leaves us with a problem because the marketeers and sales people still want to slap URLs over posters and books and beer mats. It’s interesting that the presence of a URL no longer seems to signify you can get some more information if you type this into a URL bar but instead to signify a vague acceptance of modernity (look we’re the webz).

Or at least that’s my understanding. Presumably the marketeers don’t assume punters emerge from a tube station and type these URLs into URL bars? Because that isn’t what appears to happen. From my day job I know plenty of people search for “bbc.co.uk”. Given the omnibar I’m fairly sure lots of people end up searching Google for Google. They’re just happier using search than weird looking slashdot protocols. Twitter is an interesting side case where the slightly geeky @ of @fantasticlife displaces the very geeky slashdots of https://twitter.com/fantasticlife. Good.

So what if the marketeers could be dissuaded from plastering URLs over every surface they see. It would make our lives easier because we’d no longer have to have all those conversations trying to find a middle ground between “must / just persistent” and “must carry the brand message”. But it won’t happen because the alternative is something like, “just search for” and then you’re at the mercy of Google and Bing and all those competitors outbidding you for keywords.

Which is complete bollocks. Because that’s what happens. Punters do not memorise your URL and even if they do they search for it anyway. Your organisation / brand / “product” / whatever is already at the mercy of search engines because that’s how real people use the web.

So love of god Google, if only to save me from another meeting conversation about this please hide the URL behind a click in Chrome. And hope the marketeers start to think that covering the world in URLs makes as much sense as covering it in ISBNs or catalogue numbers or Amazon product IDs.

Sausages and sausage machines: open data and transparency

Last Wednesday was the second BBC Data Day. I didn’t manage to make the first one but I did end up chatting afterwards with various BBC, ODI and OU people about the sort of data they’d like to see the BBC release. Shortly afterwards I sketched some wireframes and off the back of that was invited to talk at the second event. Which I also didn’t manage to make because I was at home, ill and feeling sorry for myself. In the event Bill stepped in and presented my slides. These are the slides and notes I would have presented if I had managed to be there:

Slide 1

1

I’d like to talk about open data on the web, what it’s for and in particular how it enables transparency to audiences across journalism and programme making.

Slide 2

2

So why publish open data on the web? Three common reasons are given:

  1. to enable content and service discovery from 3rd parties like Google, Bing, Facebook, Twitter etc. These are things like schema.org, Open Graph, Twitter Cards etc used to describe services so 3rd parties can find your stuff and make (pretty) links to it. Which often becomes a very low level form of automated syndication because that’s how the web works
  2. to outsource innovation and open up the possibilities of improving your service to 3rd parties. The Facebook strategy of encouraging flowers to bloom around their fields. Then picking the best ones and buying them
  3. and finally because… transparency. To show the world your workings in the best interests of serving the public

Today I’m only really talking about transparency.

Slide 3

3

So sausages. The BBC already publishes some “open” data but that data only describes the end product, the articles and programmes, and not the process.

Slide 4

4

This is the Programmes Ontology. It shows the kinds of data we publish about all BBC programmes.

There are programme brands and series and individual episodes and versions of those episodes and broadcasts and iPlayer availabilities. The kind of data you’d need to build a Radio Times or an EPG. Or iPlayer.

Slide 5

5

And this is the brand page for Panorama. Ask for it as data and you’ll get…

Slide 6

6

…this.

Again a brand with episodes with broadcast etc

Slide 7

7

What’s interesting is what isn’t there. What goes on in the factory before the sausages make it to the shelves.

Slide 8

8

Things like:

  1. commissioning decisions. Who? When? Why? What didn’t get commissioned?
  2. scheduling decisions
  3. talent decisions
  4. guest decisions
  5. runnings orders. What things / what order?

Who refused to appear? Who refused to put up a spokesperson? What was the gender split of guests? What was the airtime gender split?

A couple of weeks back there was a George Monbiot piece in the Guardian bemoaning the fact that BBC programmes often didn’t include enough background information about guests on current affairs programmes. Particularly in respect to connections with lobbyists and lobbying firms.

As a suggestion: every contributor to BBC news programmes should have a page (and data) on bbc.co.uk listing their appearances and detailing their links to political parties, NGOs, campaigning groups, lobbyists, corporations, trade unions etc.

Slide 9

9

Away from programmes what would transparency look like for online news.

Slide 10

10

The Guardian is the most obvious example where clarifications and corrections aren’t hidden away but given their own home on the website.

Slide 11

11

And the articles come with a history panel which doesn’t show you what changed but at least indicates when a change has happened.

The Guardian’s efforts are good but not as linked together as they might be.

Slide 12

12

Unlike Wikipedia. This is the edit history of the English Wikipedia article on the 2014 Crimean Crisis. Every change is there together with who made it, when and any discussion that happened around it.

Slide 13

13

And every edit can be compared with what went before, building a picture of how the article formed over time as new facts emerged and old facts were discounted.

Slide 14

14

I didn’t manage to attend last year’s data day but I did end up in the pub afterwards with Bill and some folk from the ODI and the Open University.

We talked about the kind of data we’d all like to see the BBC release and it was all about the process and not the products. The sausage factory and not the sausages.

We made a list of the kinds of data that might be published and it fitted well with how the BBC likes to measure its own activities: Reach, Impact and Value.

Slide 15

15

It also looked a lot like this infographic which made the rounds of social media last week detailing the cost per user per hour of the BBC TV channels

Slide 16

16

These were the wireframes I made following last year’s pub chat.

They were intended to sit on the “back” of BBC programme pages; side 1 would show the end product, side 2 would be the “making of” DVD extra, the data about the process.

Headline stats for every programme would include total cost, environmental impact, number of viewers / listeners across all platforms and the cost per viewer of that episode.

Programmes would be broken down by gender split of contributors and their speaking time.

Reach would list viewer / listener figures across broadcast, iplayer, downloads and commercial sales.

Slide 17

17

Impact would list awards, complaints, clarifications, corrections and feedback from across the web.

And value would list production costs, acquisition costs and marketing spend.

All of this would be available as open data for licence fee payers to take, query, recombine, evaluate and comment on.

Having made the wireframes I chatted with Tony Hirst from the OU about how we might prototype something similar. We came up with a rough data model and Tony attempted to gather some data via FOI requests.

Slide 18

18

Unfortunately they were all refused under the banner of “purposes of journalism, art or literature” which seems to be a catch all category for FOI requests marked “no”.

Google has 20 million results for the query “foi literature art journalism”, around 10 million of those would seem to relate in some way to the BBC.

The idealist in me would say that, for “the purposes of journalism”, in its noblest sense, and the greater good of society, the default position needs to flip from closed to open. The “purposes of journalism”, more than any other public service, should not be an escape hatch from open information.

And the public would benefit from “journalism as data” at least as much as from “data journalism”.

Photo credits

Packing Carsten’s weiner sausages on an assembly line, Tacoma, Washington by Washington University
www.flickr.com/photos/uw_digital_images/4670205658

Sausages at Wurstkuche by Sam Howzit
www.flickr.com/photos/12508217@N08/7281627216

Disruption

From the final few pages of Beneath the City Streets by Peter Laurie (1970) where he gets off the subject that the threat of global thermonuclear war might just be a plan to distract us and gets on to the subject of… transistors:

I am coming to believe that there is a much more serious threat to the technological way of life than the H-bomb. It is the transistor. Over the last two or three hundred years in the West we have followed a course of development that coupled increasingly powerful machines to small pieces of human brain to produce increasingly vast quantities of goods. The airliner, the ship, the typewriter, the lathe, the sewing machine, all employ a small part of the operator’s intelligence, and with it multiply his or her productivity a thousandfold.

As long as each machine needed a brain, it was profitable to make more brains and with them more profits. Industrial populations grew in all the advanced countries, and political systems became more liberal simply to get their cooperation.

But now we are beginning to find that we do not need the brains – at least not in the huge droves that we have them. Little by little [..] artificial intelligence is dispossessing hundreds of thousands and soon millions of workers. Because ‘the computer’ is seen only in large installations doing book-keeping, where it puts few out of work, this tendency goes on unnoticed. But in every job economics forces economies on management. Little gadgets here and there get rid of workers piecemeal. [..] Any job that can be specified in, say, a thousand rules, can now be automated by equipment that costs £200 or so. The microprocessor, which now costs in itself perhaps £20, [..] has not begun to be applied: over the next 10 to 15 years millions will be installed in industry, distribution, commerce. Machinery, which has almost denuded the land, will now denude cities.

Politically, this will split the population into two sharply divided groups: those who have intelligence or complicated manual skills that cannot be imitated by computers – a much larger group, who have not. In strict economic terms the second group will not be worth employing. They can do nothing that cannot be done cheaper by machinery. [..] The working population will be reduced to a relatively small core of technicians, artists, scientists, managers surrounded by a large, unemployed, dissatisfied and expensive mob. I would even argue that this process is much further advanced than it seems, and the political subterfuges necessary to keep it concealed, are responsible for the economic malaise of western nations [..] If one has to pay several million people who are in fact useless, this is bound to throw a strain on the economy and arouse the resentment of those who are useful, but who cannot be paid what they deserve for fear of arousing the envy of other.

If the unemployed can be kept down to a million or so in a country like Britain, the political problem they present can be contained by paying a generous dole [..] The real total of unemployed is hidden in business. What happens when automation advances further and the sham can no longer be kept up? [..] To cope with the millions of unemployed and unemployable people needs – in terms of crude power – greatly improved police and security services. [..] It suggests that the unemployed should be concentrated in small spaces where they can be controlled, de-educated, penned up.

Unless some drastic alteration occurs in economic and political thought, the developed nations are going to be faced in the next thirty years with the fact that the majority of their citizens are a dangerous, useless burden. One can see that when civil defence has moved everything useful out of the cities, there might be strong temptation on governments to solve the problem by nuclear war: the technological elite against the masses.

Now I’m no more of a fan of the technocratic, silicon valley, Ayn Rand fanboys than the next man on the street but even in my most paranoid moments I’d never suspected that when they’d done disrupting they might stagger out of a ted talk and h-bomb us all.

NoUI, informed consent and the internet of things

In more youthful days I spent a year studying HCI. I’m sure there was much more to it but unfortunately only three things have stuck in my mind:

  1. interactions should be consistent
  2. interactions should be predictable
  3. interactions should not come with unexpected side-effects

I half remember writing a dissertation which was mostly finding creative ways to keep rewriting the same three points until requisite word count was reached.

I was thinking about this today whilst reading an assortment of pro and anti NoUI blog posts. I half agree with some of the points the NoUI camp are making and if they save us from designers putting a screen on everything and the internet fridge with an iPad strapped to the front I’d be happy. But mostly I agree with Timo Arnall’s No to NoUI post and his point that “as both users and designers of interface technology, we are disenfranchised by the concepts of invisibility and disappearance.”

This doesn’t really add much to that but some thoughts in no particular order:

  1. Too often chrome is confused with interface. There’s too much chrome and not enough interface.
  2. Even when something has a screen it doesn’t have to be an input / output interface. The screen can be dumb, the smarts can be elsewhere, the interface can be distributed to where it’s useful. The network takes care of that.
  3. An interface should be exactly as complex as the system it describes. The system is the interface. The design happens inside. I’m reminded of a quote from the Domain Driven Design book that, “the bones of the model should show through in the view”. An interface should be honest. It should put bone-structure before make-up (or lack thereof).
  4. The simplify, simplify, simplify mantra is all very good but only if you’re simplifying the systems and not just the interface. And some systems are hard to simplify because some things are just hard.
  5. No matter how much you think some side-effect of an interaction will please and “delight” your users if the side-effect is unexpected it’s bad design. You might want to save on interface complexity by using one button to both bookmark a thing and subscribe to its grouping but things are not the same as groups and bookmarks are not the same as subscriptions and conflating the two is just confusing. Because too little interface can be more confusing than too much.
  6. There seems to be a general belief in UX circles that removing friction is a good thing. Friction is good. Friction is important. Friction helps us to understand the limits of the systems we work with. Removing friction removes honesty and a good interface should be honest.
  7. Invisible interfaces with friction stripped out are the fast path to vendor lock-in. If you can’t see the sides of the system you can’t understand it, control it or leave because you don’t even know where it ends.
  8. If your goal is something like simplifying the “onboarding process” removing friction might well please your paymasters but it doesn’t make for an honest interface. Too much UX serves corporate goals; not enough serves people.
  9. Decanting friction out of the interface and turning it into a checkbox on the terms and conditions page is not a good interface.
  10. In the media world in particular there’s a belief that if you could just strip out the navigation then by the intercession of magic and pink fluffy unicorns “content will come to you”. Which is usually accompanied by words like intuitive systems. Which seems to miss the point that the thing at the other end of the phone line is a machine in a data centre. It is not about to play the Jeeves to your Bertie Wooster. It does not have intuition. What it probably has is a shedload of your usage data matched to your payment data matched to some market demographic data matched to all the same for every other user in the system. For the majority of organisations the internet / web has always been more interesting as a backchannel than as a distribution platform. They’d happily forego the benefits of an open generative web if only they could get better data on what you like.
  11. If and when we move away from an internet of screens to an “internet of things” the opportunities for sensor network and corporate-state surveillance multiply. Everything becomes a back-channel, everything phones home. With interface and friction removed there’s not only no way to control this, there’s no way to see it. Think about the data that seeps out of your front room: the iPad, the Kindle, the Samsung telly, the Sky box, the Netflix app, YouView, iPlayer, XBox, Spotify. And god only knows where it goes past there.
  12. Informed consent is the only interesting design challenge. With no interface informed consent is just another tick box on the set-up screen. Or a signature on a sales contract.
  13. The fact that we’ve not only never solved but deliberately sidelined informed consent in a world with interfaces doesn’t bode well for a world without.

More thoughts on open music data

Occasioned by someone describing media catalogue type data as the “crown jewels”. It is not the crown jewels. It is, at best, a poster pointing out the attractions of the Tower of London.

If any data deserves the description of crown jewels it’s your customer relationship data.

But since Amazon, Apple, Facebook and Google probably know more about your users / customers / audience / fan base than you do, you’ve probably already accidentally outsourced that anyway…

Longer thoughts over here

Events, causation, articles, reports, stories, repetition, insinuation, supposition and journalism as data

In a conversation with various folks around ontologies for news I went a bit tangential and tried to model news as I thought it should be rather than how it is. Which was probably not helpful. And left me with a bee in my bonnet. So…

Some events in the life of Chris Huhne

  1. In March 2003 he was clocked speeding somewhere in Essex. Already having 9 points on his licence a conviction would have seen him banned from driving so…
  2. …shortly after his then wife, Vicky Pryce, was declared to have been driving at the time of the speeding incident
  3. 16 days after the speeding incident he was caught again for using a mobile phone whilst driving and banned anyway
  4. In May 2005 he was elected to Parliament as the representative for Eastleigh
  5. Also in May 2005 Ms Pryce told a friend that Mr Huhne had named her as the driver without her consent
  6. Between October and December 2007 he stood for leadership of the Lib Dems
  7. At some point (I can’t track down) he began an affair with his aide Carina Trimingham
  8. In June 2010 he was clocked again, this time by the press emerging after spending the night at Ms Trimingham’s home
  9. A week later Ms Pryce filed for divorce
  10. In May 2011 The Sunday Times printed allegations that Mr Huhne had persuaded someone to pick up his driving points
  11. In the same month Labour MP Simon Danczuk made a formal complaint about the allegation to the police
  12. At some point after this there was a series of text messages between Mr Huhne and his son where his son accused him of lying and setting up Ms Pryce
  13. In February 2012 both Mr Huhne and Ms Pryce were charged with perverting the course of justice
  14. In June 2012 Mr Huhne and Ms Pryce announced they’d plead not guilty with Ms Pryce claiming Mr Huhne had coerced her into taking his penalty points
  15. In February 2012 the trial began and on the first day Mr Huhne changed his plea to guilty. He also resigned his parliamentary seat
  16. The trial of Ms Pryce continued. And collapsed shortly after when the jury failed to agree. Shortly after a second trial found her guilty
  17. In late February the by election resulting from the resignation of Mr Huhne took place
  18. And in March 2013 they were both sentenced to 8 months in prison

Some of the events went on to become part of other storylines. For a brief while Mr Huhne’s driving ban for using a mobile phone at the wheel became part of a “Government makes a million a month from drivers using mobiles” story (at least for the Daily Mail), the collapse of the first trial of Ms Pryce became a story about failures in the trial by jury system and the result of the by election became part of a story about the rise of minority parties in austerity hit Europe.

Anyway this list of events is as partial as any other. Many more things happened (in public and in private) and some of the events listed were really lots of little events bundled up into something bigger. But that’s the trouble with events: they quickly go fractal because everything is one. As Dan said, “it’s good to think about events but it’s good to stop thinking about them too.” I’m not quite there yet.

Anyway, boilings things down further to fit in a picture:

Causation and influence

For every event there’s a fairly obvious model with times, locations, people, organisations, factors and products. And (mostly) the facts expressed around events are agreed on across journalistic outlets.

The more interesting part (for me) is the dependencies and correlations that exist between events because why is always the most interesting question and because the most interesting answer. Getting the Daily Mail and The Guardian to agree that austerity is happening is relatively easy, getting them to agree on why, and on that basis what should happen next, much more difficult.

The same picture this time with arrows. The arrows aren’t meant to represent “causality”; the fact that Mr Huhne was elected did not cause him to resign. But without him being elected he couldn’t have resigned so there’s some connection there. Lets say “influence”:

Articles, reports and stories

The simplest model for news would bump most of the assertions (who, where, when etc.) to the events and hang articles off them, stitched together with predicates like depicts, reports or analyses. But whilst news organisations make great claims around reports and breaking news, journalists don’t talk about writing articles and rarely talk about writing reports. Journalists write stories, usually starting from a report about an event but filling in background events and surmising possible future events.

So an article written around the time of Mr Huhne’s resignation would look less like this:

and more like this:

Repetition, insinuation and supposition

The average piece of journalism is 10% reporting new facts and 90% repetition, insinuation and supposition where correlation and causation between events are never made explicit. Events from the storyline are hand picked and stitched together with a thin thread of causality. Often it’s enough to just mention two events in close proximity for the connections between them to be implied. The events you choose to mention and the order you mention them in gives the first layer of editorial spin.

And the claims you choose to make about an event and its actors are the second level. If there’s a female involved and she’s under 35 it’s usually best to mention her hair colour. “Bisexual” scores triple points. We know what we’re meant to think.

The Daily Mail took insinuation to new heights with the collapse of Ms Pryce’s first trial, printing a “story” about the ethnic make-up of the jury telling its readers:

Of the eight women and four men on the Vicky Pryce jury, only two were white – the rest appeared to be of Afro-Caribbean or Asian origin.

The point they were trying to make and how the appointment of a jury of certain skin colour might have led to the collapse of the trial was left as an exercise.

Sports journalism seems particularly attracted to insinuation and supposition. Maybe it’s because their events (and sometimes even the outcomes of those events) are more predictable than in most other news whilst the actual facts are mainly locked inside dressing rooms and boardrooms. But Rafa Benitez getting slightly stroppy in a news conference turned into, “Rafa out by the weekend, Grant to take over until the end of the season and Jose to return” headlines by the next day. None of which turned out to be true. Yet.

As Paul pointed out, the article as repetition of storyline and makeshift crystal ball wasn’t always true. In the past newspapers printed many small reports per page. This isn’t the best image but was the best image I could find without rights restrictions:

Photo via Boston Public Library cc-by-nc-nd

Neither of us knew enough about newspaper history to know when this changed or why it changed. Presumably there are good business reasons why articles stopped being reports and started being stories. We guessed that it might have been due to falling paper and printing prices meaning more space to fill but without evidence that’s just insinuation too.

To an outside observer the constant re-writing of “background” seems tedious to consume and wasteful to produce. Especially where the web gives us better tools for managing updates, corrections and clarifications. Maybe it’s because most news websites are a by-product of print production where articles are still commissioned, written and edited to fill a certain size on a piece of paper and are just re-used on digital platforms. But even news websites with no print edition follow the same pattern. Maybe its partly an SEO thing with journalists and editors trying to cram as many keywords into a news story as possible but surely one article per storyline with frequent updates would pick up more inbound links over time than publishing a new article every time there’s a “development”? It seems to work for Wikipedia. (Although that said, Google news search seems to reward the publishing of new articles over the updating of existing ones.) Or maybe it’s all just unintentional. Someone at the meeting (I forget who) mentioned “lack of institutional memory” as one possible cause of constant re-writing.

But in a “do what you do best and link to the rest” sense, constantly rewriting the same things doesn’t make sense unless what you do best is repetition.

An aside on television

Television producers seem to feel the same pull toward repetition: this is what we’re about to show you, this is us showing it, this is what we’ve just shown you. I have a secret addiction to block viewing (I think the industry term is binge viewing) episodes of Michael Portillo’s Great British Railway Journeys but for every 30 minute episode there’s 10 minutes of filler and 20 minute of new “content”.

Interestingly the Netflix commissioned series assume binge viewing as a general pattern so have dropped the continuity filler and characterisation repetition and get straight into the meat of the story. Nothing similar seems to be happening with news yet but I’m an old fashioned McLuhanist and believe the medium and the message are inextricably tied so maybe one day…

Journalism as data

Over the last couple of years there’s been much talk of data journalism which usually involves scanning through spreadsheets for gotcha moments and hence stories. It’s all good and all helps to make other institutions more transparent and accountable. But journalism is still opaque. I’m more interested in journalism as data not because I want to fetishise data but because I think it’s important for society that journalists make explicit their claims of causation. You can fact check when and where and who and what but you can’t fact check why because you can’t fact check insinuation and supposition. At the risk of using wonk-words “evidence-based journalism” feels like a good thing to aspire to.

I’m not terribly hopeful that this will ever happen. If forced to be explicit quite a lot of journalism would collapse under its own contradictions. In the meantime I think online journalism would be better served by an article per storyline (rather than development), an easily accessible edit history and clearly marked updates. I’m not suggesting most news sites would be more efficiently run as a minimal wiki, pushing updates via a microblog-of-your-choice. But given the fact that if you want to piece together the story of Mr Huhne you’ll have more luck going to Wikipedia than bouncing around news sites and news articles… maybe I am.

Thoughts on open music data

Yesterday I wore my MusicBrainz hat (or at least moth-eaten t-shirt) to the music4point5 event. It was an interesting event, but with so many people from so many bits of the music industry attending I thought some of the conversation was at cross-purposes. So this is my attempt at describing open data for music.

What is (are, if you must) the data?

The first speaker on the schedule was Gavin Starks from the Open Data Institute. He gave a good talk around some of the benefits of open data on the web and was looking for case studies from the music industry. He also made the point that, “personal data is not open data” (not an exact quote but hopefully close enough).

After that I think the “personal data” point got a bit lost. Data in general got clumped together as an homogenous lump of stuff and it was difficult to pick apart arguments without some agreement on terms. It felt like there was a missing session identifying some of the types of data we might be talking about. Someone tried to make a qualitative distinction between data as facts and data as other stuff but I didn’t quite follow that. So this is my attempt…

In any “content” business (music, TV, radio, books, newspapers) there are four layers of data:

  1. The core business graph. Contracts, payments, correspondence, financial reports
  2. The content graph. Or the stuff we used to call metadata (but slightly expanded). For music this might be works, events, performances, recordings, tracks, releases, labels, sessions, recording studios, cover art, licencing, download / streaming availabilities etc. Basically anything which might be used to describe the things you want to sell.
  3. The interest / attention graph. The bits where punters express interest toward your wares. Event attendance, favourites, playlists, purchases, listens etc.
  4. The social graph. Who those punters are, who they know, who they trust.

I don’t think anyone calling for open music data was in any way calling for the opening of 1, 3 or 4 (although obviously aggregate data is interesting). All of those touch on personal data and as Gavin made clear, personal data is not open data. There’s probably some fuzzy line between 1 and 2 where there’s non-personal business data which might be of interest to punters and might help to shift “product” but for convenience I’m leaving that out of my picture:


Given that different bits of the music industry have exposure to (and business interests in) different bits of these graphs they all seemed to have a different take on what data was being talked about and what opening that data might mean. I’m sure all of these people are exploring data from other sources to improve the services they offer, but plotting more traditional interests on a venn:

So lack of agreement on terms made conversation difficult. Sticking to the content graph side of things I can’t think of any reasonable reason why it shouldn’t be open, free, libre etc. It’s the Argus catalogue of data (with more details and links); it describes the things you have for sale. Why wouldn’t you want the world to know that? I don’t think anyone in the room disagreed but it was hard to say for sure…

Data portability

The social and interest / attention graphs are a different breed of fish. Outside the aggregate they’re where personal data and personal expression live. Depending on who you choose to believe that data either belongs to the organisation who harvested it or the person who created it. I’m firmly in the latter camp. As a consumer I want to be able to take my Last.fm interest data and give it to Spotify or my Spotify data to Amazon or my Amazon data to Apple or my Apple data to Last.fm. In the unlikely event I ever ran a startup I’d also want that because otherwise my potential customers are locked-in to other services and are unlikely to move to mine. If I were an “established player” I’d probably feel differently. Anyway data portability is important but it’s not “open data” and shouldn’t be confused with it.

Crossing the content to social divide

Many things in the content graph have a presence in the social graph. Any music brand whether it’s an artist, a label or a venue is likely to have a Twitter account or a Facebook account or etc. So sometimes the person to interest to content graph is entirely contained in the social graph. Social media is often seen as a marketing channel but it’s a whole chain of useful data from punters to “product”. Which is why it puzzles me when organisations set up social media accounts for things they’ve never minted a URI for on their own website (it’s either important or it’s not) and with no real plan for how to harvest the attention data back into their own business. “Single customer view” includes people out there too.

Data views, APIs and API control

Just down the bill from Gavin were two speakers from Last.fm. They spoke about how they’d built the business and what they plan to do next. In the context of open data (or not) that meant reviewing their API usage and moving toward a more “industry standard” approach to API management. Twitter was mentioned alongside the words best practice.

Throughout the afternoon there was lots of talk about a “controlled open” approach; open but not quite. Occasionally around licencing terms but more often about API management and restrictions. It’s another subject I find difficult as more and more structured data finds its way out of APIs and into webpages via RDFa and schema.org. In the past, the worlds of API development and Search Engine Optimisation haven’t been close bedfellows but they’re heading toward being the same thing. And there’s no point having your developers lock down API views when your SEO consultants are advising you to add RDFa all over your web pages and your social media consultants are advising you to add OpenGraph. But it all depends on the type of data you’re exposing, why you’re exposing it and who you want to expose it to. If you’re reliant on Google or Facebook for traffic you’re going end up exposing your some of your data somehow. The risk either way is accidentally outsourcing your business.

MusicBrainz

Robert from MusicBrainz appeared at the conference via a slightly glitchy Skype link. He spoke about how MusicBrainz came into being, what its goals are and how it became a profit making non-profit. He also said the most important thing MusicBrainz has is not its data or its code or its servers but its community. I’ve heard this said several times but it tends to treated like an Oscar starlet thanking her second grip.

From all dealings with open data I’ve ever had I can’t stress enough how wrong this reaction is. The big open data initiatives (Wiki/DBpedia, MusicBrainz, GeoNames, OpenStreetMap) are not community “generated”. They are not a source of free labour. They are community governed, community led and community policed. If your business adopts open data then you’re not dealing with a Robert like figure; you’re dealing with a community. If you hit a snag then your business development people can’t talk to their business development people and bang out a deal. And the usual maxim of not approaching people with a solution but an explanation of the problem you want to solve is doubly true for community projects because the chances are they’ve already thought about similar problems.

Dealing with open data means you’re also dealing with dependencies on the communities. If the community loses interest or gets demoralised or moves on then the open data well dries up. Or goes stale. And stale data is pretty useless unless you’re an historian.

So open data is not a free tap. If you expect something for nothing then you might well be disappointed. The least you need to give back is an understanding of and an interest in the community and the community norms. You need to understand how they operate, where their interests lie and how their rules are codified and acted on. And be polite and live by those rules because you’re not a client; you’re a guest. You wouldn’t do a business deal without checking the health of the organisation. Don’t adopt community data without checking the health of the community. Maybe spend a little of the money you might have spent on a biz dev person on a “community liaison officer”.

Question and answer

At the end of Robert’s talk I had to get up and answer questions. There was only one which was something like, “would you describe MusicBrainz as disruptive?” I had no idea what that meant so I didn’t really answer. As ever with question sessions there was a question I’d rather have answered because I think it’s more interesting: why should music industry people be interested in and adopt MusicBrainz. Answers anyway:

  1. Because it has stable identifiers for things. In an industry that’s only just realising the value of this, it’s not nothing.
  2. Because those identifiers are HTTP URIs which you can put in a browser or a line of code and get back data. This is useful.
  3. Because it’s open and with the right agreements you can use it to open your data and make APIs without accidentally giving away someone else’s business model.
  4. Because it links. If you have a MusicBrainz identifier you can get to artist websites, Twitter accounts, Facebook pages, Wikipedia, Discogs, YouTube and shortly Spotify / other streaming services of your choice. No data is an island and the value is at the joins.
  5. Because it’s used by other music services from Last.fm to the BBC. Which means you can talk to their APIs without having to jump through identifier translation loopholes.
  6. Because, whilst it’s pretty damn big, size isn’t everything and it’s rather shapely too. The value of data is too easily separated from the shape of the model it lives in. Lots of commercial music data suppliers model saleable items because that’s were the money lives. MusicBrainz models music which means it models the relationships between things your potential customers care about. So not just artists and bands but band memberships. And not just Rubber Soul the UK LP and the Japanese CD and the US remastered CD but Rubber Soul the cultural artefact. Which is an important hook in the interest graph when normal people don’t say, “I like the double CD remastered rerelease with the extra track and the tacky badge.”
  7. Because its coverage is deep and wide. Their are communities within communities and niches of music I never knew existed have data in MusicBrainz.
  8. Because the edit cycle is almost immediate. If you spot missing data in MusicBrainz you can add it now. And you’re a part of the community.
  9. Because the community is engaged and doing this because they care, it polices itself.
  10. Because Google’s Knowledge Graph is based on Freebase and Freebase takes data from MusicBrainz. If you want to optimise for the search engines, stop messing about with h1s and put your data in MusicBrainz.

So if any record label or agent or publisher or delivery service ever asked me what the smallest useful change to the data they store might be, I’d say just store MusicBrainz identifiers against your records. Even if you’re not yet using open data, one day they’ll be useful. Stable identifiers are the gateway drug to linked data. And I’d advise any record label large or small to spend a small portion of the money they might have spent building bespoke websites and maintaining social media accounts, on adding their data to MusicBrainz. Everybody benefits, most of all your consumers.

ps If you’re an indie artist Tom Robinson wrote a great guide to getting started with MusicBrainz here.

Dumb TVs

Following on from this year’s CES there’s been lots of talk about bigger, better, sharper, smarter TVs. As ever conversation around gadgets tends to get caught up with conversations around business models which tends to lead to breathless commentary on OTT vs traditional broadcast and whether smart TVs will render traditional broadcasters as obsolete as Blockbusters, HMV and Jessops. But this is only tangentially about that.

Rumbling away in the background is the usual speculation around Apple’s plans to “revolutionise” the TV “experience” and whether they’re planning to do the same to the TV industry as they did to the music industry (content deals permitting). In among the chatter there seems to be an assumption from some commentators that Apple’s plans for TV revolve around how Apple TV might improve the on-screen interface and controls, possibly replacing the EPG with an App Store style interface. There’s a tendency amongst media futurologists to predict the future by extrapolating from the past; therefore televisions will follow the same fat-client route as phones and already complicated TV interfaces will become more complicated still.

But to my mind this doesn’t make sense. Apple already own the content discovery route via their iDevices, they own the content acquisition route via iTunes and they own the play-out route via AirPlay. Why do they need to invent fat-client TV sets when they’ve already put fat-client laptops, tablets and phones into the hands of their customers? The App Store model might just about work when it’s in your hand / on your lap. But placing the same interaction model 10 feet away just doesn’t offer the affordances you need to discover, purchase and play programmes. From an accessibility angle alone, making potential customers interact from 10 feet away when you’ve already given them a better option seems like a painful redundancy.

How “smart” do TVs need to be?

In more general terms I think there’s a problem with the definition of a “smart” TV and the interfaces envisaged. If TVs are web connected why do they need to be smart? Some arguments why not:

  1. Upgrade cycles for TVs and radios (and most other household white goods) are too slow to build-in smartness. Build in too much and the smarts go obsolete before the primary function of the device.
  2. For any connected device smartness belongs in the network. This is why we connect them. If there are existing discovery and distribution channels and backchannels, then all a TV needs to do is accept instructions from the network; a connected (but dumb) screen.
  3. 10 feet away is no place for an interface. And just because a device has a screen doesn’t mean it has to be an input. As TV functionality becomes ever smarter and more complicated, the remote control grows to fit the demands and we end up with something almost resembling a keyboard on the arm of the sofa. When there’s a much better, much more accessible phone or pad or laptop (or any point in between) sat redundant alongside.
  4. The App Store / Smart TV model presupposes the existence of apps. But making native apps is expensive and the more platforms you have to provide for the more expensive it gets. A dumb TV only needs to accept instructions and play-out media.
  5. TV screens tend to be a shared device and authentication, personalisation and privacy concerns are hard on a shared device. Hard from an implementation point of view and hard from a user comfort point of view. There’s a spectrum from TV screen to desktop PC to laptop to tablet to phone and the further down that list you travel the less shared / more personal the device feels and the more comfortable users feel with authentication. Dumb TVs move authentication to where it makes sense.
  6. Smart TVs open up the possibility of device manufacturers finding a new role as content gatekeepers. Having control of both the interface and the backchannel data allows them to control the prominence of content. This is a particular problem for public service broadcasters. By the time your smart TV is plugged into your set top box and your assortment of games consoles, the front room TV can acquire a stack of half a dozen gatekeepers. Just keeping track of which one is currently active and which one you need to control is confusing.
  7. Media people like to talk about TV as a “lean back” medium. This is pure conjecture but it’s possible that separating the input interface from the play-out leads to this more “lean back” experience…

How dumb is dumb?

From conversations around Dumb TVs there seem to be two main options: the dumb but programmable TV and the dumber than Kletus TV.

Programmable TVs

Modern TV sets don’t live alone. There are ancillary devices like PVRs which sit alongside the TV box. TVs don’t need to be programmable but PVRs do. The big question is where you want to programme your PVR from. If it’s same room / same local area network then there’s no need for any additional smartness or authentication. If it’s on the same network you can control it. If you want to programme your PVR from the top deck of the bus this is somewhat harder. Somewhere you need a server to mediate your actions and given the need for a server there’s a need for authentication. But…

…how do PVRs as discrete devices make sense in a connected world? If 3 million people choose to record an episode of Doctor Who that’s a hell of a lot of redundant storage. And a hell of a lot of redundant power usage. Over time PVR functionality will move to “the cloud” (the legality of loopholes not withstanding), your mobile will programme it, discover content there and push that content to you TV screen. With no need for TV programmability.

Dumb, dumb, dumb

So what’s the very simplest thing with the least build and integration costs? Something which allows you to push and control media from a fat client to a dumb TV. DIAL promises to do something similar but seems to assume a native app at each end and the simplest thing is probably two browsers.

So somehow devices on a local area network need to be able to advertise the functionality they offer. There’s a web intents connection here but I’m not quite sure what it is. Once your laptop / tablet / phone knows there’s a device on the network which can play audio / video it needs to make that known to the browser. So there needs to be some kind of browser API standardisation allowing for the insertion of “play over there” buttons. And the ability to push a content location with play, pause, stop and volume control notifications from the browser on the fat client to the browser on the dumb TV. Which might be something like WebRTC. Given the paywalls and geo-restrictions which accompany much of the online TV and movie business there’d probably need to some kind of authentication / permission token passed. But that’s all dumb but connected would involve.