My PVR is "in the cloud"

in The Cloud
on somebody else's server

Because I'm lazy and can't be arsed leafing through tables of features I get my broadband off BT. And they throw in a BT Vision box. Which is basically Freeview with a hard disc to record to and an internet connection. It's ok. Mostly it works. About once a month the fan gets noisy and it panics and falls on its face and you need to turn it off and on again. Or, if it's a proper panic attack, turn it off, hold down the reset button and turn it on again.

But every so often the box has a complete breakdown and no amount of off / on / off / on / reset makes it feel better. A while back my BT Vision box went a bit hysterical and broke. So they sent round an engineer with a new one.

Obviously I lost all my recorded programmes (even the on / off / reset achieves that aim). Probably no great loss; some Midsomer Murders, some Gardener's World, some Great British Railway Journeys and my daughter's carefully curated collection of M.I. Highs. But I also figured I'd lost all my instructions to record.

Last Thursday morning I had a bit of a panic because I thought the new series of Gardener's World wouldn't be recorded. So I clicked through several pages of the god awful EPG, found Gardener's World and found it was already set up to record.

Because, I assume, the box phones home and record instructions are stored "in the cloud" and the whole thing is re-synched periodically. Which made me ponder three things:

  1. What else is the box recording and phoning home? What I watch? What I record? What I record and watch? What I record and fail to watch?
  2. How is that data used and by whom? Which reminded me of this vintage article from Wired on Sky's plans to serve personalised adverts based on TV attention data and my oft quoted quote from Clive Humby (emphasis mine):

    If I knew your whole transaction profile - restaurants, travel, fashion - that could be immensely powerful. You'd need a consent-based model, but you'd understand every aspect of a person's life. The credit-card data tells you how they live generally, the supermarket data tells you their motivations, the media data tells you how to talk to them. If you have those three things, you're in marketing nirvana.

  3. Who owns that data and how else could it be used? If it's mine then why shouldn't I be able to port it out and offer it to Sky in the hope of a money off deal and a more reliable box? Why shouldn't I be able to port it into Programme List and get broadcast reminders and links to VOD services? Or take it to Amazon and get DVD recommendations. Or go the other way and take my Amazon data and get programme recommendations? There's a lot of personal data floating around but it's all locked into proprietary systems and outside my control.

I'm not, despite appearances, a privacy zealot. I don't think absolute privacy is possible or desirable. From supermarket loyalty cards to Oyster cards to Facebook every day we trade some privacy for some convenience. My problem is when the terms of trade are so obfuscated that it's not possible to weigh what we gain against what we lose. In the BT Vision case if I'd been offered the option of exporting my record instructions off the box I might well have clicked yes. (I'd have been even more tempted to click yes if my recordings were stored off the box but that would probably break 10,000 copyright agreements.) But I wasn't given the option and knowing that data is out there but outside my reach is just frustrating.

I don't think I'm alone in this. Some recent research by the NoTube project found that most people were uncomfortable about their online TV viewing being recorded when they hadn't consented and couldn't control what happens to the data.

There's been a long running debate about privacy vs "publicness" which mostly seems to miss the point. The point being informed consent. Those on the "publicness" side tend to say that organisations harvesting user data is a price worth paying for "free" access to "open" publishing tools. And ignore that the data being harvested disappears into proprietary systems where it's impossible for users to extricate it or correct it. The exact opposite of openness. It's bad for consumers because they get locked into a single system. And it's bad for competitiveness because new businesses can't hope to compete with established players who monopolise the interest graphs. Wherever you see customer relationship management or user relationship management it can almost always be read as "lock-in".

Most people are used to the idea that when they use the web (in the browser open, clicking links sense) their actions are being reported and recorded. The fact that we seem no closer to solving informed consent and data portability on the web doesn't bode well for when our white goods start phoning home.

A rambling post about RiscOS. Ish

As a kid I never had a computer. In those pre-web days the main reason for owning one seemed to be the playing of games and I never really liked games. Still don't.

The first computer I ever met was a university mainframe. You'd spend a morning with a map chart and something that looked like an air hockey puck, tracing lines from red shifting stars or particles being accelerated or collided or some such. Then take the results and write some convoluted Fortran programme which would (hopefully) change the numbers you entered into some different set of numbers.

Except not before the code had been "compiled". I don't think anyone ever explained what "compilation" entailed. You spent several minutes copying your Fortran to a huge floppy disc, handed it over to a sullen lab assistant and they went away and did the water into wine thing. Which for more reasons unexplained took most of the afternoon. Since any time spent in the Blackett Laboratory was demoralising and depressing, compilation time was pub time.

My second meeting with a computer was a Mac Classic that sat in the corner of the same lab. You'd stagger back from the pub and (if you'd struck lucky and your code had compiled) take the new numbers it had made and type them into a spreadsheet type programme on the Mac and get back a pretty graph. I remember wondering why that Fortran computer thing couldn't just work like this Mac computer thing. It had the distinct advantage of working even when you were drunk; the mainframe thing barely worked if you were sober.

After university I went into the then thriving (cough) CD-ROM trade. It was mostly Mac "Power PCs" with a few Windows machines scattered about. But they felt like a different breed than the old Mac Classic. The CD-ROM industry inherited the Adobe bloat-wear of the 1980s desktop publishing industry that we're still stuck with today (Photoshop and Illustrator). And added a few of its own (Director and Authorware). The minute you started up one of these applications your Mac stopped being a general purpose computer. The chosen application ate all your memory for several hours before eventually overheating the machine and crashing. These were applications as operating systems; for the duration of use nothing else was possible.

After a little while I moved to glorious Norwich to make educational CD-ROMs. At that time most schools had graduated from the BBC Micro but kept faith with Acorn and invested in shiny new RISC PCs running RISC OS. So alongside the PCs and Macs we had RISC PCs. Which were desperately unfashionable. The only people who ever bought RISC PCs were schools so any RISC OS conference was entirely populated by geography teacher retreads with beards and elbow patches. In the room next door was Paul Mison (who kicked off this vague reminisce) and the first web people I ever met. They were Mac users to a (wo)man; our RISC PCs and CD-ROM burners were a badge of digital shame.

But secretly I quite liked my RISC PC. It was cute and simple and transparent. You could look inside it and tinker with it and work out what it was trying to do. It was far removed the shrink-sealed product of the modern Apple machine. And it didn't have dongles and bloatware. Instead of a packaged applications like Photoshop to meet all your photo editing needs it had an app for resizing photos, an app for recolouring, an app for cropping...

In some ways these were a bit like the modern app of the twonkPhone or twonkPad: single purpose applications designed to do one thing (fairly) well. But unlike app store apps they co-operated, they talked to one another. As my friend Tom might say, they were generative. You could drag and drop the output from one app as the input for another. You could write simple little scripts that chained together apps as a mini process. And because each component co-operated you never got stuck with vendor lock-in or forced upgrades. You could just grab a different app for the same purpose off a floppy disc and swap them in and out at will.

Which reminded me for some reason...

...of a pub chat with Mo and Faith about APIs and open data. In my head at least there is a connection here...

Most businesses of any size will have data spread across multiple systems (staff details, product inventories, room bookings, finances). At some point there's a realisation that scattering knowledge is inefficient and there'd be more value if the multitude of different system co-operated and exchanged information. For lots of organisations that path leads inevitably to the large scale enterprise architecture dreams of one consolidated system to rule them all. So the usual coterie of consultants march in to design the uber-system. It's the mainframe mentally that never quite died out. (I half remember another company I used to work for had seven systems, all called The Global Platform).

Eventually the uber system gets built and all the data from the legacy systems gets pumped in. And it fails. Because the data isn't quite the right shape and different systems use different identifiers and some systems use different fields to mean different things at different points in time... But mostly the enterprise architect dreams die because they ignore the real problem until it's too late. Designing data cathedrals is (relatively) easy; data and identifier consolidation is the hard part. And if you can solve identifier consolidation in the first place there's no need to ever build the cathedral. Which is why god gave us URIs, HTTP and APIs. And why...

...tl;dr everything should have an API

I realise there's nothing new in any of this. It's the usual, "small pieces, loosely joined" (though I'd quibble about the definition of loosely). And I realise that hanging this argument on nostalgia for the rotting corpse of a badly failed operating system weakens it somewhat :-/

But business systems should be more like RISC OS and less like mainframes or enterprise / Adobe bloat-wear. They should do one thing well and co-operate and communicate. An organisation should be its APIs. And everything should have an API.

If you're in the legal business every case should have an API, every lawyer should have an API, every citation of case law or legislation should have an API. If you're in the TV / radio business every studio should have an API (what's recording there, what was recorded there, what's planned), every camera should have an API (where is it, where has it been), every production should have an API (what stage is it at, what's the budget, who's the co-producer). If you're in the news business every camera crew should have an API (where are they, where have they been), every journalist should have an API (what have they submitted, where are they (though it's not unlikely that Twitter or Facebook or FourSquare already know this)). If a riot kicks off somewhere in Tottenham you should be able to query across systems to easily find the closest journalist, the closest camera crew, the closest radio car... without having to chase down six different spreadsheets across four different departments.

Whenever talk turns to APIs it's usually a side effect of already publishing to the web. The usual question is, "we've published this content to the open web, can we give it an API?" Which feels like the wrong question. If everything is / has an API the real question is, "Which bits of this can we open to the web and which bits are better kept private?" That's just a permissioning problem and permissioning is easy :-)

All of this is dependent on whether the intention is to create a more intelligent, connected, generative website. Which is not a bad goal. It just seems more ambitious to create a more intelligent, connected, generative business. And expose the bits you choose to expose to the world.

Stop testing the wrong things

Or at least start testing the right things.

Lots of chat about Test Driven Development and a brief flurry of tweets with @rarepleasures left my bicker button feeling unfulfilled so this is just another rant that wouldn't fit into 140...

Why I don't really like test driven development

  1. Because the minute you add a label to an approach, within a week it becomes a "process", within a month someone will organise a conference and within six months its just more dogma and doctrine. But that aside...
  2. There's a chain. At one end are the people somewhat pompously referred to as "the business". At the other end an assortment of developers and designers patronisingly referred to as "geeks" and "creatives". The people at "the business" end want to solve a problem; the people at the building stuff end generally help to solve problems. The more links in the chain, the more noise gets introduced until you end up with requirements and "user stories" as chinese whispers. Professionalising a class of people into business analysts and product managers doesn't stop chinese whispers being chinese whispers.
  3. As Dan North says in his What's in a story post:

    Usually, the business outcomes are too coarse-grained to be used to directly write software (where do you start coding when the outcome is "save 5% of my operating costs"?) so we need to define requirements at some intermediate level in order to get work done.

    The point being that by the time any of this stuff hits the designer / developer it's usually passed through the hands of several intermedaries and been reduced to some requirements / user stories. But requirements don't matter. They're just an abstraction to make it easier to start writing code. What matters are the "business" objectives. Or, without wanting to sound too New Labour, the "outcomes".

    The usual pattern is to explain the what to the developer / designer and leave the how to them. Which might be fine. But explaining the why is probably more important. Who knows, they might even have an opinion on the what. Stranger things have happened.

    Anyway, the more you separate developers and designers from the "why" the more we head back to the bad old days of waterfall, with the people doing the work sat at the end of the process being drip-fed user stories and expected to lay golden feature eggs.

  4. Requirements are fine as a starting point for code and using those requirements to generate tests for that code makes sense but you're only testing the code against the requirements. You're not testing the service / product / let's-just-call-it-a-website against business objectives and outcomes.
  5. Businesses have all kinds of ways of measuring performance. That's what the final slide of the boss people's presentation on "KPIs" is all about. And anything that can be measured can be tested. The main problem is they usually get measured six months after the fact.

    The objective might be to get more registered users; the requirement might be a simplified registration process and / or the ability to authenticate with 3rd party accounts. The objective might be less abandoned shopping carts; the requirement simplified checkout and / or one click purchase. You can measure any of these objectives / outcomes so you can test them. But software tests only test software against requirements and...

  6. ...code does not live in isolation. Until real code meets real data and real content and real copywriting and real design and real users with real needs (and probably a real marketing campaign) you can't measure the changes you make against real objectives.
  7. It's fine to have those screens in development corner that show regression tests passing and failing with green and red lights. But it would be good to see other screens showing real registration rate data, real close account rate data, real buy / play / consume button data, real abandoned shopping cart data, real inbound traffic from search engines or social media or whatever data.
  8. If you're measuring the impact of your work against real usage you can make tiny, tiny changes very, very quickly; isolate those changes from other changes in the system and see how they work for real people. Test code against requirements by all means but don't assume your tests tell you anything meaningful.

Data ghosts in the Facebook machine

This is partly an extended comment on Paul Clarke's excellent Accidental Data Controller post. And partly a whine that, even though we've been talking about social graphs, and very little else, for the last few years, we still don't really think in graph terms when it comes to our friendships. Or much else.

Paul's post is about the "find my friends by pillaging my address book" function that seems to ship with every social networking / commodity publishing website. And in particular about how Facebook stores contact data for people who've never registered with Facebook, the better to help them find their friends when / if they do. But best to read it.

The ghosts of the not yet born

Obviously I have no more knowledge of how Facebook model their data than the next data geek. But if I were evil then...

...say Alice registers on Facebook and consents to the pillage my address book function. Somewhere in that address book are contact details for Bob. Let's say email and mobile number. The first step is to check if there's a registered account in the system matching those details. If there is then Bob gets suggested to Alice as a possible friend. But if Bob isn't registered or is registered but hasn't supplied those details, Ghost Bob gets created:

1

If real Bob comes along later and registers or gives his email / mobile number to Facebook real Bob gets consolidated with Ghost Bob. But it doesn't necessarily stop there. Say Chris registers and also consents to the pillage. Chris isn't really a friend of Bob but they have worked together. So Chris' address book has a record for Bob with his email address and his work phone number. All of this is about finding points in data you can triangulate from. In this case it's the email address so Facebook's Ghost Bob now has email, mobile and work number:

2

Add in Dave who submits Bob's email address and postcode and Ghost Bob starts to accrete data like a velcro ball on a fluffy rug:

3

Then add in Edith and Fred and Gareth and Ghost Bob gets a lot less ghostly. He's just another person node in a huge graph of data; just a slightly less active one.

And the ghosts of the dead

It's been reported almost everywhere that Facebook's delete button is really more of a hide button. So the same thing works in reverse; leave Facebook and your data ghost lingers on. It would be interesting to know the figures for registered accounts vs the ghosts of the dead and the ghosts of the not yet born. I'm not on Facebook anymore but I'd happily bet that my data ghost still haunts the place.

Putting the ghosts to work

In theory the ghost people just sit there until a corresponding account is created / linked, at which point the suggested friendship schtick takes over. But even ghost people can be useful.

If you can infer that Alice knows Ghost Bob, Chris knows Ghost Bob and Dave knows Alice, Chris and Ghost Bob, then Alice has three indirect connections to Chris. One through Dave, one through Ghost Bob and one through Dave and Ghost Bob. Which increases the chances that Alice might know Chris. The more connections in the system the better you can predict other connections. And it really doesn't matter how many of those connections link to ghosts; the number of edges is more important than the quality of the nodes.

The social graph is not a different thing

Thinking in graph terms is hard. Thinking in social graph terms is even harder because our egos take over and we tend to picture ourselves at the centre of a spider's web of connections. To understand what's going on you need to step above, god-like and look down.

The other problem when thinking about the social graph is the tendency to see it as something separate. In page design terms it's usually the bit on the right of the "content" that looks like a bolted on afterthought. But switching examples to Twitter.

If Alice follows Bob and Bob follows Alice and Chris follows Bob and Dave follows Chris. And if Alice tweets and Bob retweets and Chris retweets and Dave favourites. And if Chris makes a list and Bob and Dave are both on that list and Alice follows that list. The whole thing is just some interwingled things and there's no content and no social graph; just a graph and some nodes and some edges. And some of the nodes are people.

It's not how big it is or even how you use it

Paul ends his post with a question:

how big does your address book have to be before you need to register it under the Data Protection Act?

I tried to leave an intelligent comment but accidentally added some angle brackets. So failed. What I wanted to say was: it doesn't matter how big the data set is or even how you (intended) to use it. The only thing that matters is how interwingled it is. Divide your edges (relationships) by your nodes (things) and you might be on to something...

Why is any of this a problem?

Mostly it isn't. Everyday in every way we trade privacy for convenience. Own a mobile phone or a sat nav or a connected set top box or a supermarket loyalty card and you're trading some privacy for some convenience. The trouble is it's never quite clear what the trade-off is. (Maybe we just need the equivalent of a nutritional information label for privacy / convenience?)

But most of the debates about online privacy aren't really about privacy at all. They're about informed consent and how we make the decision to make the privacy / convenience trade. Most of the convenience benefits are seen best from inside the graph. And most of the privacy invasion is only apparent when you step outside and look down. Which makes things tricky.

Being informed enough to give consent is difficult enough for most people. If you're Ghost Bob you were never even given the opportunity. You never signed up for the service or ticked the crappy little "I've read the Ts and Cs" checkbox. You're just an accidental node in some parasite's recommendation engine.

Massively interconnected data is dangerous when some of the nodes are people. When some of the nodes are ghost people it's just unethical.

Amazon and the reintermediation of the spectacle

There's an archetypal narrative of the web that starts with the word disintermediation. Which is the posh way of saying cutting out the supply chain middle men.

It's usually accompanied by a picture showing before:

Before

and after:

After

The perenial poster child for the promise of disintermediation is Dell; cutting out the distributors and retailers to go direct to consumers with their marvelous black boxes of technology.

Like most other things in the realm of new technology people tend to take this picture and extrapolate. So many industries have so many middle men it's tempting to imagine a world without. In the creative / spectacular industries in particular there was a promise from the early years of blogging that content producers could go directly to consumers and cut out countless layers of intermediation en route. Music artists wouldn't need record labels, authors wouldn't need publishers, journalists wouldn't need newspapers. And all this would play out on an open web tied together with links, search and micropayments.

But it hasn't really played out that way. Since the last dot com boom and bust the narrative has changed from disintermediation to reintermediation. And from there to the consolidation of reintermediators.

Apple reintermediated the music industry, Sky reintermediated the TV industry, Netflix threatens to reintermediate the cable TV industry, Etsy reintermediated the local craft market, Ebay reintermediated the jumble sale and Facebook reintermediated our friendships.

And for everything else there's Amazon...

...who've managed to reintermediate everything from the book industry to large parts of the film industry to the small power tools industry and onwards.

The weird thing about Amazon is they're always there but rarely noticed. The rest of the tech community is happy to stand on any platform available to shout their own praises. And the tech press follow along like adoring puppies. Every time Ed or Ev or whatever his name is announces a new feature the web lights up. Usually with adoration. It only takes an announcement of an announcement by Zuckerberg to trigger a torrent of praise / condemnation. And every time Google make a public proclamation... well at least Jeff Jarvis gets excited. Meanwhile Amazon just get their heads down and build the best online store and the best engineering platform and the best integrated service design and there's barely a murmur. It's all vaguely weird...

Something else that's weird. When the industry gets together for a spot of communal backslapping the integrated service design prize always heads towards Apple. Which makes me wonder if the people handing out these prizes have ever tried to use iTunes? In general it's pretty horrible. But compared to Amazon's Whispernet / Whispersync and one click purchase it's a real turd of a system. In the time it takes to boot up iTunes you can grab a Kindle, search for a book, find a book, buy a book and start reading it. When it comes to integration of web storage, web services, software services and physical devices Amazon make Apple look like amateurs. And the rest of us are barely trying.

That said Amazon aren't exactly shy about their role as a reintermediator. The self-publishing upload form cuts out the agent, the publisher, the distributor, the wholesaler and the retailer in one swoop. The Author Central and 'ask the author' features plug readers directly into authors, again cutting out the middle men.

But back to the point...

So why the new middle men?

Or why didn't the promise of producer to consumer work out? Wikipedia says:

Reintermediation occurred due to many new problems associated with the e-commerce disintermediation concept, largely centered on the issues associated with the direct-to-consumers model. The high cost of shipping many small orders, massive customer service issues, and confronting the wrath of disintermediated retailers and supply channel partners all presented real obstacles. Huge resources are required to accommodate presales and postsales issues of individual consumers. Before disintermediation, supply chain middlemen acted as salespeople for the producers. Without them, the producer itself would have to handle procuring those customers. Selling online has its own associated costs: developing quality websites, maintaining product information, and marketing expenses all add up. Finally, limiting a product's availability to Internet channels forces the producer to compete with the rest of the Internet for customers' attention, a space that is becoming increasingly crowded over time.

Which is probably true but I suspect it's more than that. Recently there's been a flurry of blog posts all saying that some random industry is "becoming software". It all kicked off when The Wall Street Journal published:

There's a premise that what sets successful businesses apart is their readiness to adapt to a software driven world. I'm not so sure. I've been around enough software for long enough to think most of it is just a patchwork of bug fixes for obscure corner cases and a set of features that no-one can quite remember requesting. Actually, that's not quite true. Software is what we write to extract information from data. The worse your data model is, the more software you have to write. The optimum line count for code is zero. But anyway, software without data is about as much use as a pub without beer. And I'm firmly of the opinion that:

Re-read Why software is eating the world and substitute every occurance of the word software for the word data. I swear it makes more sense.

The attention graph

In my head there's a picture that looks something like:

Graphs

Our friends in social media world tend to worry most about:

Social-graph

Enterprise architects, IAs, archivists, taxonomists etc tend to worry most about:

Content-graph

But the one thing all the usual web whatever-number-we're-up-to suspects really get right is:

Attention-graph

Facebook's creepy Open Graph protocol and "frictionless sharing" are just an attempt to own the attention graph no matter where its users are paying attention. The read / write web exists; you read something, Facebook write it to their database.

In my day job I've heard this described as "having a personalisation strategy". Which completely misses the point. Personalisation is the bait, customer relationship is the trap.

Anyway, Amazon take the exploitation of attention data to new levels. Their social graph is minimal and their content graph barely exists. Browse Amazon.wherever-you-are and the majority of the content is contributed by customers. And the majority of the context / navigation is contributed by customers. Web services inviting contributions by users tend to have standard boiler plate terms and conditions that at least pay lip service to the contributors rights over their material:

You or the owner of the content still own the copyright in the content sent to us, but by submitting content to us, you are granting us an unconditional, irrevocable, non-exclusive, royalty-free, fully transferable, perpetual worldwide licence to use, publish or transmit, or to authorise third-parties to use, publish or transmit your content in any format and on any platform, either now known or hereinafter invented.

But Amazon don't even bother with the lip service:

If you do post content or submit material, and unless we indicate otherwise, you (a) grant Amazon.co.uk and its affiliates a non-exclusive, royalty-free and fully sublicensable rights to use, reproduce, modify, adapt, publish, translate, create derivative works from, distribute, and display such content throughout the world in any media; and (b) Amazon.co.uk and its affiliates and sublicensees the right to use the name that you submit in connection with such content, if they choose. You agree that the rights you grant above are irrevocable during the entire period of protection of your intellectual property rights associated with such content and material. You agree to waive your right to be identified as the author of such content and your right to object to derogatory treatment of such content. You agree to perform all further acts necessary to perfect any of the above rights granted by you to Amazon.co.uk, including the execution of deeds and documents, at the request of Amazon.co.uk.

What you contribute to Amazon belongs to Amazon.

Pervasive computing, ubiquitous surveillance

By now we've probably all sat through conference talks on the "internet of things" and pervasive / ubiquitous computing. Past the point where people stop talking about making rabbits twitch their ears when someone tweets about carrots, I think the Kindle is the first real world example of any of this. So it's still got a screen but it doesn't feel like a computer. It's a reading device, a retail terminal and a beautifully designed back channel.

Read the Whispersync marketing foo-foo and the public messaging is all about seamless synching between devices. So you put down your Kindle and open the Kindle app on your iPad and your book is miraculously open at the page you were reading. Obviously there's no device to device synching going on here. It's device to web service to device. So all that data gets phoned home to Amazon.

Amazon already know the books you've bought, the books you've browsed, the books that other people who've bought similar books to you have bought, the books you've rated, the books you've listed, the books you've reviewed. Now if you're using a Kindle with Whispersync they also know if you're a slow reader. They know if you're the kind of person who buys books and never makes it to the end. They know the books you've bought that you skim through in an afternoon. They know the books you read slowly, flicking back though pages to check facts.

Again, this kind of integrated end-to-end service design is interesting to compare with the supposed masters of this sort of stuff. Apple managed to build the iTunes store, some clumsy iTunes software and the rather lovely iPod. But they allowed the backchannel data to get intermediated by Last.fm / Audioscrobbler. You just can't imagine Amazon letting Good Reads or Open Bookmarks build a business by tapping into the Kindle backchannel. They just seem to understand that connected technology isn't just good for distributing content outwards; it's also rather well adapted to reporting back on the usage of that content.

I'm probably gonna get flamed if I've got my facts wrong here but I've searched long and hard for Kindle / Whispersync terms and conditions about user contributed content / data and they just don't seem to exist. So I'm assuming that Amazon terms and conditions cover Kindle Whispersync too. In which case all your Kindle reading data, your bookmarks and your margin notes belong to Amazon too.

I can't help but wonder what it would be like to hack with that kind of data. What could you build around community reading groups, formal education, adult literacy? At the very least it would save me the chore of ticking homework diaries. But I doubt we'll get that chance.

Facebook regularly take a beating for pushing privacy issues to breaking point. I'm not sure what the difference is between Facebook snooping on your reading and Amazon snooping on your reading except the commonly reported privacy issues around Facebook are all about what gets reflected back onto the web. As opposed to what gets absorbed by Amazon.

The usual answer to all of these privacy worries is, what's the worst that could happen? I get some better adverts to watch. Which fits firmly in the, "nothing to hide, nothing to fear" bucket. And it's equally nonsense. It's not like there's not past evidence of data being collected for perfectly innocent purposes being used for something altogether different.

The obligatory user experience bit

Since most of this blog is about user experience in one form or another I'm feeling duty bound to pitch in with: forget about HTML5 and CSS3 and jQuery and responsive design and "mobile first". The most important  issue for user experience people to grapple with is informed consent. More and more web services are dependent on user contributed content and data. Every time you make a contribution (explicit or implicit) you're trading convenience for privacy. This isn't necessarily a bad thing; it's something we do everyday in real life from mobile phones to loyalty cards. But as the web moves out of the browser and into smart objects, the trade-offs we're making need to be made explicit so people can make informed choices about when to get involved and when to back away.

There's another personal bugbear in all of this. We're increasing told that companies fail when "UX professionals" don't have a C*O seat at the top table. The company would make better products that more people would want to own and use; everybody wins. Again the poster child is Apple and the pin up boy Jonathan Ive. But in the world of constantly connected, reporting devices your employer's interests are not necessarily the same as those of your users. There's a more general question here: as the supposed representatives of "the user" who exactly are we working for?

A dystopian summary

It feels like we've long since lost the desire for an open, disintermediating web. The old power brokers of publishers, record companies and film studios are dead or dying. But we're all more than willing to trade our privacy for the convenience offered by the new intermediators.

It's not like you can just ignore what's going on. Owning your own domain isn't enough anymore. If you're a newspaper and you don't engage on Facebook you'll just miss a massive chunk of your audience. If you're a broadcaster and you don't promote your programmes on YouTube you'll miss a huge chunk of your audience. If you're a publisher and you don't go through Amazon you might as well give up and go home.

But all the while you're being cut out of the deal. Signing up to any of the new intermediators isn't about opening up new distribution channels; it's about outsourcing your customer relationship. And attempting to fight the reintermediators on your own turf only works if you imagine that customers would get more value out a relationship with Penguin than they would out of a relationship with Amazon.

I'm not sure what happens next. Maybe the web is just going through a period of reintermediation and consolidation. More optimistic voices than mine would point to things like Disapora, data portability and personal data stores and say the web will return to a more open, distributed model. But for all the talk of the web as a distributed system it kinda isn't. It's some servers and some clients. Querying distributed data sources to build the kind of "experiences" that people have got used to isn't close to happening. Even in the occasionally rarified world of Linked Data, reintermediation is happening in the form of consolidated data stores / market places like Kasabi. Because having all that data out there isn't enough. It has to be in one place to make it useful to query.

I have nothing but admiration for Amazon. They make the best shop. They make the best services. They make the best software. They have the best data. But they scare me. If I was a book publisher they'd terrify me. And if I was a film studio or broadcaster I'd be watching my back. Especially since they also own LoveFilm and IMDB. And especially when they've just announced Whispersync for movies and TV. And even if I was Google or Facebook or Apple I'd be concerned. Because Amazon are light years ahead of the game we think we're playing.

Right now the bits of the web that aren't Amazon or Google or Apple or Facebook feel like the high street greengrocer waiting for the supermarket to ride into town. Which depresses me because it's the exact opposite of what got me interested in the web in the first place. Maybe there's still room for a chi-chi artisan cheese shop or two. But who the hell wants to run an artisan cheese shop?

Storytellin'

Round where I live it's almost impossible to walk past a powerpoint session without coming across the "Storytelling" word. Bonus points get scored for managing to get "second screen" and "transmedia" in the same "deck" but we won't go there. It all leads to lots of talk around what storytelling means, how it's done and how it relates to the web. To date the main experiment has been building the Mythology Engine and injecting some Doctor Who storylines into it to make journeys between characters and events from the new and 'classic' Doctor Who. What follows are some fairly random thoughts on how you'd go about modelling stories in order to tell them on the web. Since I'm an expert in neither RDF modelling nor critical theory it might all be nonsense.

 

Events

The obvious starting point for modelling stories is the event. Things happen in stories; capture those things and you have the basic building blocks of story telling. So we've used Yves' Event Ontology to capture events (real and fictional), the time they occured, the place they occured and the people and things involved. The next obvious step is to say a story has many events but also different people might tell different stories around the same event(s). So stories have many events and events have many stories. Which in relational database terms means a many-to-many and a many-to-many tends to suggests a missing concept. In this case the missing concept is narrative order, allowing a story to reveal events out of the sequence in which they happened. Which is useful if you're trying to describe a non-linear narrative with flashbacks and various recollections of nested narrators (think Wuthering Heights). So you end up with something like:

Events_stories

As a simple example take two of my favourite TV programmes: Columbo and Midsomer Murders. They have the same basic event structure which looks roughly (give or take a murder) like:

Events

But they're told very differently. Columbo almost always tells it straight, in event order: first establishing the characters (murderer and murderee), then revealing the motive, the means, the murder and onwards. Right from the start you know who did it, why and how. For the audience the game is all about guessing how Columbo will come good and catch them.

Midsomer Murders is a more standard whodunnit, told out of sequence using the usual techniques of recollection and flashback. It often opens with the murder scene followed by the investigation. The investigation turns up various clues on route; some real, some red herrings. The motive and means are only fully revealed as part of the post-investigation accusation. (Which, as a complete aside, is not a disimilar narrative structure to The Apprentice: Sir Alan as detective in a murder mystery, the country house replaced by a rented office in Docklands.)

Reordered_events

Assertions

This basic model works fine if all stories that agree on an event also agree on all the assertions made about that event: when, where, who and what. But imagining that the event being described is a crime, everyone might agree the crime took place but Alice might say that Bob was present and Bob might not agree.

None of this potential for disputed assertion (whether when, where, who or what) is covered by the stories as ordered events model. But in my mind at least stories are more an ordered set of assertions than a reordered set of events.

Scenes

So the reordered events model for Midsomer Murders shown above is clearly not correct. Midsomer Murders does often start with a scene from the murder event but whilst the murderee and maybe the location are depicted the murderer is kept out of shot. Over the course of the programme subsequent scenes often return to the murder event progressively revealing more detail. It's this split between events and scenes that the 'stories as reordered events model' doesn't give you.

Every medium has a bag of tricks that allows story tellers to control what's revealed when. In TV and film it's usually close up, over the shoulder shots filmed with low light levels (the shower scene in Psycho). Columbo's interesting for comparison because it's not a whodunnit. The murder is usually filmed as a well-lit wide shot with every detail (location, time, murderer, murderee, weapon...) made explicit.

The closest comparison I can think of to the bag of assertions model is the RDF named graph. And I'm not saying that to be all linked data-ish; I just can't think of a way you'd do this in any other data store. Named graphs allow you to bundle up a set of statements / assertions / claims (in this case RDF triples) and associate them with some provenance: person X stated these things:

Namedgraph

The named graph model only gets you as far as some collections of assertions. But stories are more than just bags of assertions: in order to 'tell' them you need to be able to control how those assertions are revealed to the reader. In this case it's the scene (maybe that should be act?) that reveals a particular named graph's bag of assertions:

Scenes

Event interwingling

The model so far allows you to bundle up and progressively reveal assertions around events. But it doesn't allow for assertions about the relationships between events: event A directly caused event B; A was one factor in B happening; A didn't cause B, but without A, B couldn't have happened etc. For me these assertions are the most important thing about storytelling because they speak to the reason we tell stories in the first place: an attempt to understand and explain why things happen. They also speak to the inner child's cry of "why?" (and the inner adults response of "because"). Every story we tell is one long chain of "cause" and "effect", why and because. Who and where and when matter but why trumps them all.

In news storytelling in particular, why and because are the central pillars of decent journalism. Why is my local library closing? Because of council cutbacks. Why are the council cutting back? Because of central government cutbacks. Why are central government cutting back? Because they need to balance the national budget? Why does the budget need to be balanced? Because the previous government borrowed too much? Why did they borrow too much? Because the banks collapsed. Why did the banks collapse? Because mankind is sinful and the bankers weren't washed in the blood of Christ...

Almost all journalism (all the examples I can think of anyway) follow this pattern of chaining events together with a sequence of becauses. Sometime the because is explicit, sometimes implied, sometimes insinuated but it's almost always there. And it's usually where the majority of disputes arise. Even where they agree on all other details, The Guardian's chain of causality is going to look very different to the Daily Mail's and every claim in the cause and effect chain could be and will be disputed by someone. The ability to see how claims of "causality" differ between different journalists and different news organisations would be a handy tool for general media literacy.

As an aside I think this is my main misgiving about the rNews spec. It models online news article publishing; it doesn't model news or journalism. No events, no claims of event <> event causality, no why, no because. To steal a line from Tom Scott news stories [are] metadata about real world events.. And to steal a line from Jeff Jarvis articles are the byproducts of journalism. Which makes rNews meta-metadata or the byproduct of a byproduct.

Anyway, that was a long aside to add one more line to the model: Alan was arrested for the murder of Joyce.

Causality

Stories and discourse

From the diagram above it seems like stories operate on two basic levels: the assertions they contain (the story) and the way in which those assertions are revealed (the telling). At this point I went off in search of better labels for these levels. I'd thought that some of story, narrative and plot might apply here but all the definitions seem a little fuzzy being both event (rather than assertion) centric and using account to cover a multitude of "telling" possibilities. At least according to the OED:

Story
an account of imaginary or real people and events told for entertainment
an account of past events, experiences, etc
Narrative
spoken or written account of connected events; a story
Plot
the main sequence of events in a play, novel or film

A chat with Matthew sent me in the direction of Roland Barthes' Introduction to the Structural Analysis of Narratives, an essay collected in Image Music Text [PDF - page 76 (page 79 of the book)] which says:

Tzvetan Todorov [..] proposes working on two major levels, themselves subdivided: story (the argument), comprising a logic of actions and a 'syntax' of characters, and discourse, comprising the tenses, aspects and modes of the narrative.

Which gives two useful labels, ending up with something roughly like:

Story-discourse

Down the structuralist rabbit hole

From my (probably simplistic) reading of Barthes his main point seems to be that discourse can be analysed and deconstructed in much the same way that linguistics deconstructs the sentence. The major premise being:

[A narrative] shares with other narratives a common structure which is open to analysis, no matter how much patience its formulation requires.

Once this structure is identified:

[The] 'art' of the storyteller, [..] is the ability to generate narratives (messages) from the structure (the code). This art corresponds to the notion of performance in Chomsky and is far removed from the 'genius' of the author, romantically conceived as some barely explicable personal secret [..] it is impossible to combine (to produce) a narrative without reference to an implicit system of units and rules.

Barthes proposes that narratives operate over a set of hierarchical levels in much the same way as linguistics describes the sentence as operating at multiple levels:

To understand a narrative is not merely to follow the unfolding of the story, it is also to recognize its construction in 'storeys', to project the horizontal concatenations of the narrative 'thread' on to an implicitly vertical axis; to read (to listen to) a narrative is not merely to move from one word to the next, it is also to move from one level to the next.

That said, Barthes doesn't identify the precise levels of narrative but he does propose:

to distinguish three levels of description in the narrative work: the level of 'functions' (in the sense this word has in Propp and Bremond), the level of 'actions' (in the sense this word has in Greimas when he talks of characters as actants) and the level of 'narration' (which is roughly the level of 'discourse' in Todorov).

If you choose to believe Barthes then the story level shown above breaks down into two parts: Propp style functions and 'actions'. Which seems to fit with the event part of the model although I have no idea how you'd model 'charcters', let alone 'characters as actants'. And life's too short to read Greimas. If you choose to believe Propp then capturing the functions seems trivial, every event sub-classes some more archetypal event / function.

But more interesting is Barthes' description of the way narrative levels interact:

Narrative thus appears as a succession of tightly interlocking mediate and immediate elements; dystaxia determines a 'horizontal' reading, while integration superimposes a 'vertical' reading: there is a sort of structural 'limping', an incessant play of potentials whose varying falls give the narrative its dynamism or energy.

these levels are in a hierarchical relationship with one another, for, while all have their own units and correlations [..] no level on its own can produce meaning. A unit belonging to a particular level only takes on meaning if it can be integrated in a higher level. The theory of levels gives two types of relations: distributional (if the relations are situated on the same level) and integrational (if they are grasped from one level to the next);

All of the examples given in the book are based in literature but thinking about film (and TV) for a minute, there's lots of obvious examples of integrational relationships between the story and discourse levels: again the background "music" in the Psycho shower scene, the cymbal crash at the end of a pratfall. When it comes to "telling" a story there are all kinds of claims made on the discourse level about things on the story level. Every decision on script, casting, costumes, locations, props, sound effects, background music, lighting, camera angles, editing, maybe even film stock is a claim made in the discourse level about objects in the story level.

Any attempt to capture the relationships between discourse and story (beyond "reveals") turns the simple model shown above to spaghetti. But storytelling is as much about how things are revealed as it is about when they're revealed. There are techniques that could probably be identified but how you'd model that I have no idea.

An attempt at a conclusion

I think it's possible (although the presence of named graphs makes it tricky) to model the mechanics of a story (the ordered revealing of claims around events). And the "cause and effect claims" still feel like the most important part (especially for news and history) because they reflect how we attempt to understand the world.

But a model of the mechanics of a story doesn't really get you any closer to being able to tell a story using that model. I think it would be good for news organisations to share identifiers for events and people and places. I think it would be good for journalism if claims of causality were made explicit rather than insinuated. (I'm thinking of the Tottenham / London / England riots and the varying claims of causality.) But I don't think it gets us any closer to "web native storytelling". Whatever that might be.

If I worked for a big media organisation...

...(or at least one whose content could reasonably end up encoded as an mp3 / mp4) one thing I'd definitely like to see is an ID3 tag dedicated to holding a RESTful HTTP URI.

ID3 tags are designed to allow people to embed metadata about the content of a media file into the file. Although designed can seem quite a strong word in this context. A quick glance at the ID3 spec gives the impression that it was more thrown together. New tags have accreted over time with little discernible rhyme or reason. What started as an attempt to add core metadata like track title, artist name and release title to music tracks has bloated to a spec with a quite ridiculous number of tags.

But there are still two important attributes missing from ID3:

  1. A stable, persistent identifier for the content of the file
  2. A way to get more information about the content of the file

Actually ID3 does make provision for a Unique file identifier but it goes on to disclaim responsibility with:

This frame's purpose is to be able to identify the audio file in a database, that may provide more information relevant to the content. Since standardisation of such a database is beyond this document, all UFID frames begin with an 'owner identifier' field. It is a null-terminated string with a URL [URL] containing an email address, or a link to a location where an email address can be found, that belongs to the organisation responsible for this specific database implementation. Questions regarding the database should be sent to the indicated email address.

Eh? Really? Who on earth would populate an ID3 tag with the email address of a database owner? And why?

Both gaps could be filled by the addition of a ID3 tag dedicated to storing a RESTful HTTP URI. Settling on a stable URI gives a stable globally-unique identifier. And because it's an HTTP URI you can dereference it to get back more information. And if that information is returned as Linked Data you can follow your nose to more information and etc. In short the URI should employ content negotiation so if it's requested by a browser the user should get back an appropriate human readable webpage. And if the user requests JSON or RDF or CSV then the URI should return JSON or RDF or CSV. And if the user requests the media itself (audio/mp3 eg) they should get back the media file if it's still available.

The basic problem with ID3 is however much the spec expands and however many tags get added there's always going to be more that people want to say about a music track or a film or a TV programme. Trying to encapsulate all this descriptive power in a pre-defined set of tags is always going to be way too limiting. Or why embed metadata as tags when you could embed one HTTP URI and just dereference that to get the data? Metadata embedding is a silly solution to a hard problem.

Taking music as an example, you could embed an artist name, track title, release title and record label in the file. But adding a MusicBrainz URI makes all this core data available over HTTP. And adding a MusicBrainz URI makes additional data that could never be encoded in ID3 (like band membership (and data about those members)) available too. Because both MusicBrainz and BBC Music are published as Linked Data you can traverse the web to get BBC News stories for that artist, BBC reviews for that artist and BBC programmes that play that artist. Because The Guardian uses MusicBrainz identifiers in their new music site you can get Guardian reviews and news stories about that artist. And because the Echonest uses MusicBrainz identifiers you can get recommendations for similar artists.

Taking a BBC programme example, if ID3 allowed for an HTTP URI, that tag could be populated by a RESTful /programmes URI. Dereference that and you'd get not only core episode data (title, the programme it belongs to, the series it belongs to, broadcast information, contributor information, clips) but also music played in that episode (again linked to MusicBrainz), trackbacks to blog posts about the episode, products for sale including that episode, recipes in that episode. The list probably isn't endless but it's more than ID3 could ever scale to.

Most importantly for content publishers one of the many things you could get back is recommendations for similar (legally available) content. If there's a recognition that content will "travel", the benefits of "upselling" to legality feels like an obvious response. So punters get better, more expansive metadata, better services and opportunities to explore new content. And publishers get an opportunity to tempt people back to legality. And if it doesn't completely solve the provenance problem at least it's a step in the appropriate direction.

All it takes (and I'm probably simplifying through ignorance) is for media companies to mint HTTP URIs for their content which return liberally licenced (meta)data in standard, non-proprietary formats and link out to other data sources. And an ID3 tag to embed these URIs into files. And for people to build smart media clients that suck in this data to make interesting and useful experiences.

In the meantime, as Mo has pointed out, there are ID3 tags designed to hold URLs. WOAF (Official audio file webpage) and WOAS (Official audio source webpage) are obvious candidates for overloading if anyone fancies a hack. But even the use of the word "webpage" suggests they weren't designed for RESTful HTTP URIs.

So, in summary, if I worked for a big media company i'd be putting in the effort to ensure both my website and ID3 were Linked Data compliant.

One from the archive: the /programmes manifesto

Not so hot on the heals of Tom Scott's development manifesto for the BBC Nature site I thought I'd dig out the old BBC Programmes (@programmes) manifesto. It took a while to track down but eventually turned up in a dusty folder with the title dogma.html...

The timestamp says 14/10/2008 but I think it existed as some post-it notes on a wall several months before that. I know it predated Yves' arrival, so also predated any of the Linked Data work. Which was really just a logical extension rather than any new principles. Anyway, here it is:

/programmes believes:

  1. in one web
  2. in accessibility for people
  3. in accessibility for machines
  4. it's a service, not a product
  5. in designing from the domain model up, not the interface down
  6. in being RESTful
  7. in open standards
  8. in open data
  9. in linked data
  10. in fixing the data, not hacking the code
  11. in links before pages
  12. that the real value is in the links to other domains
  13. in designing for the browser in the browser

Like Tom's list we didn't always live up to these standards but they kept us (mainly) honest. I seem to remember we also kept a 'hack log' to keep track of anywhere we evaded our principles for the sake of expedience. Wonder what happened to that?

Impolite personalisation - impotent in the face of inference

I'm not a massive fan of cars. Never having learnt to drive, their design pretty much passes by. But a wife working at BBC Magazines means a bathroom floor covered with old copies of Top Gear. Yesterday I came across a review of the Audi A1 (I'd link but topgear.com just returns a 500) which said:

Even its stop/start system is behind the Mini's - it keeps finding reasons not to stop at all. Not that it gives you its excuses, so I don't know how I can alter my driving style to make it more active. Too warm? It's not summer yet. Too cold? The coming of spring made no difference. Battery low? Shouldn't be. Aircon or heater on? Nope, have been careful to avoid that. Lights on? That changes nothing.

This struck a cord with a conversation going on on Twitter about personalisation and personalised recommendation. Which had been triggered by an Eli Pariser article in The Guardian which said, roughly:

the increasing personalisation of information [..] threatens to limit our access to information and enclose us in a self-reinforcing world view.

The opposing view was taken in a post by Better the Mask saying, roughly:

A lot of this article, I think, reads like a digital complement to the Reithian view on broadcasting - that it should be public service, give people what they need not what they want. High-minded, certainly, and noble in a certain light, but also highly problematic. Who decides what "we" as a community need?

Much of the debate seemed to centre on the usual paternalist reading of Reith with "low culture" as the sugar to make the "high culture" pill go down. I'm not sure that's entirely accurate. I don't remember ever seeing "inform, educate and entertain" rendered with bolds or italics. And as Tony Ageh might say, scheduling Top of the Pops next to Panorama was as much about exposing Top of the Pops to Panorama viewers as it was about exposing Panorama to Top of the Pops viewers.

I'd probably go further and say any attempt to break down culture into high and low is itself paternalistic and just leads to the usual sneering at the poor old Daily Mail reader. It also ignores the connections between things. It's usually not that many skips of the graph from "low" to "high"; there are no continents in culture.

And from a personalised recommendation perspective all the anecdotal evidence of user testing I've seen seems to suggest that people value recommendations outside of their bubble. Obviously that doesn't mean recommending Bells on Sunday to Westwood fans (or vice versa). But neither does it mean recommending Casualty from Holby City. People like to be surprised by recommendations, not locked into content ghettos.

All that said, there is one thing that bothers me about "personalised" content services. Recommendation engines take a large graph of data and compress it into a smaller set of one to many recommendations; compression for recommendation is just some inference over a data set to reduce too much choice to some choice. For personalised recommendation, part of the original graph is the user's past activity. There's some truth in the adage that, if you don't know your past, you don't know your future (who am I to disagree with Chuck D) and basing recommendations for future behaviour on observed past behaviour makes some sense.

The problems come when some system starts making inferences and you have no idea why. Like the Audi A1 stop/start system if you can't tell why a system is making some assumption you can't tweak your behaviour to change those assumptions and the whole thing just becomes frustrating. For recommendation engines the metric of measurement tends to be about what is returned. But for a useful and usable system why is equally important. And too often why becomes a black box with the intercession of magic. A polite, useful system would explain the assumptions it's making and the logical leaps it's taking. And allow you to help it to help you.

So given a standard e-commerce application I might be recommended products on the basis of products I've bought in the past. Which might work until the point that somebody else uses my account to buy things. At which point I start getting recommendations for things I have no interest in. Same deal for a TV recommender based on my past consumption.

All this is fine if I can see and modify any data that's been collected about me; so long as I can tell the system, "no, I didn't buy or watch that so please stop recommending me stuff on the basis that I did." Or, "yes, I did watch that, but my tastes have changed / it was crap."

But too often the data collected about me is hidden from view and when it is exposed I can't change it. But this is probably just me banging on about #userowneddata again...