The premise

Lots gets written on the economic collapse of the news industry in the face of the internet. Like a fair chunk of technology related journalism it tends to concentrate on gadgets, gizmos, experiences and platforms.

Maybe the problem is more fundamental: not enough interesting things happen in the world to justify the current population numbers of "journalists".

The egg cup shaped news industry

The majority of original journalism happens in the news wire services. They have networks of journalists across the world spotting newsworthy events and reporting back the details. The wire copy goes to the news outlets who decide if it's the kind of story they want to cover, copy, paste, decorate with background material and publish.

This worked fine pre-internet when each news outlet sent out stories to defined demographic and regional audiences. Since the web (and Google) came along it's worked less well given they're all competing for the same sets of people on the same turf.

The egg shaped news industry

The competition has increased recently with the advent of news aggregation platforms like Google, Facebook, Apple etc.

Now we have a few people in the wire services doing actual news gathering; many people in the news organisations copying, pasting and spinning; and a few people in the aggregators attempting to sort through all this noise to get back to the actual news.

There's an assumption that the news aggregators must be doing something like plagiarism detection to extract some signal from the noise of copy / paste. There was certainly an effort by Google a couple of years back to encourage news organisations to add rel-source to published articles to identify the originating wire copy. Which, understandably, wasn't well received. But any processing happening in aggregation services is opaque to the news industry who can't see what happens to their material once it hits the aggregators.

The work

To run plagiarism detection software across both Juicer (for published material) and Window on the Newsroom (for wire copy) to see where, when and how much additional editing of wire copy happens in news organisations. To cluster published articles around originating work according to similarity / divergence (e.g. the Daily Mail tends to add lots of additional material, the Daily Express less so). To build a map of the type of news that tends to get picked up by news organisations and who they're competing with with that material. To pick up original journalism happening in news organisations that isn't just repurposed wire copy. To establish and publish provenance / trust chains back to source. Possibly to feed into the event detection work (or at least to provide aggregation and clustering of shared assertions around existing event detection).

Also to play to one of the founding myths (which may be a mythical myth) of BBC news online as a neutral copy of the wires and a link hub to "other takes on this news article" from across the industry. A semi-automated way to help to toward BBC outbound link measurements.

Possible future work

To combine plagiarism clustering with the entity extraction already happening in Juicer to cluster coverage by topics. To combine that with some form of sentiment analysis to see how different news outlets spin different types of story (does the Daily X tend to add negative spin to stories about topic Y? More than Daily Z?).

Ideally (but probably impossibly) to combine with audience figures to see how individual wire copy is split across news outlets and audiences. Ideally (but probably more impossibly) to combine that with advertising revenue to compare the same article across different outlets and how the money gets distributed. To get cost / benefit figures for copy / paste vs actual journalism across the industry.

To do the same with radio and telly news?

Possible problem

A similar project has been mooted in the past but caused some consternation amongst content creators (lol). Sell it as identifying and rewarding original journalism and send them a copy of All the President's Men. Be fine