This is a long, long delayed comment to Jeni Tennison's Opaque URIs != Unreadable URIs post from back in 2009. I think I meant to comment at the time but forgot and now comments are closed so doing it here instead...

Best start by saying I agree with Jeni but think the post might be subject to misinterpretation and, in my view at least, some of the commenters have misinterpreted.

First off Jeni separates out URI opacity from URI readability. She points out that URI opacity is already claimed as a term in web architecture, which states that, web applications must not try to pick apart URIs in order to work out information from them, or, roughly, machines shouldn't guess. For discussions about the readability of URIs she suggests replacing 'opaque' with 'obfuscated', which makes some sense to me. Certainly if the label opaque is already used to identify one concept reusing for another will probably just cause confusion. And because this is all about labels in context as identifiers...

I think it's probably also useful to separate out human readable from human meaningful. In my mind human readable at least implies natural language. Or at least that's the way it tends to be used in conversations around this topic. In contrast W1A 1AA is not natural language (it's not in a dictionary) but it is human meaningful (especially to a certain generation). But that meaning doesn't stretch that far outside the context of the UK.

Anyway, Jeni goes on to discuss possible URI designs for a school and suggests 3 possible options:

  1. the name of the school
  2. the unique reference number for the school
  3. the record number for the school in the database that is being published on the web

The first option is, I think, what most people mean when they say human readable as opposed to opaque or obfuscated. Jeni dismisses it by pointing out that school names can change over time so persistence is a problem. The other obvious problem is that school names aren't unique. I have no idea how many St Mary's schools there might be in the UK but I'd guess that /schools/st-marys would return a fair few results. Again in my head, an identifier is a label that's guaranteed unique in some defined context. @Dmitry picks up on this in the comments suggesting that a desirable URI for a school should include type/class identifier 'school', school name, city/..., state/province/..., and country which is not dissimilar to a database composite key combining the identifier for the thing with some facet identifiers to add just enough context to guarantee uniqueness. It looks like a nice, desirable solution but it doesn't solve the original problem of school names changing and it introduces a whole new set of problems of its own. Firstly it implies that the world is a mono-hierarchical taxonomy of things when the world is more like a giant set of many-to-many relationships. The world is not a filing system or indeed a set of Russian dolls. Secondly it compounds the changing school name problem by introducing a whole set of other labels that are also subject to change. And thirdly it assumes where mono-hierarchical taxonomies do exist they remain stable over time.

The classic example of this is the Linnaean taxonomy and the use of genus and species labels as a composite key to identify a species. In practice it's fraught with difficulties as biologists constantly re-classify species into genus. As my old master would say, never build your taxonomies into your URIs because they will become unmaintainable and make you cry.

For now I'll skip over the second option and come back later. Option three is to use the database record number for the school. So basically publish the primary key of the database row as the web identifier. Which is a fairly common solution to the problem and common enough to be the default pattern for a Ruby on Rails app where the out of the box URI for a 'thing' page is /:table_name/:primary_key. And I'd guess this is what the 114 is in the URI of Jeni's blog post. It's also how dbpedia lite mints its URIs using the primary key from the Wikipedia table row. Back when I was a lad there used to be a standard warning accompanying any plans to use database primary keys in URIs: what happens if your database drops for some reason and you have to resurrect it and it gets resurrected in some different order with primary keys assigned to different things. Although I've never seen that happen in practice...?

So if publishing with composite keys has problems and publishing primary keys is frowned upon the only other option is the surrogate primary key: a column in your database table that's guaranteed unique across that table but isn't the primary key. Which is pretty much what a MusicBrainz 36 character UUID is in And also what the 8 character PID is in (Although at least some PIDs are actually 2-way transforms between Freeview broadcast CRIDs (non-HTTP URIs) but that's a different story.)

Back to Jeni's option 2 then, which again is a surrogate primary key. She makes the distinction between primary key and surrogate key by saying:

Using the record number for the school within the particular database that's being published is entirely non-human-readable because there is simply no way of finding out what that would be for a given school. The unique reference number for the school, on the other hand, may be an obscure series of digits, but it is a meaningful one which renders the URI readable and hackable.

The obvious point is that the school reference number might be readable (though isn't natural language) and it might be meaningful. But it's only really more meaningful than a primary key to a very select group of school administrators.

The other point is that you can only reuse "real world" identifiers as your surrogate key if "real world" identifiers exist in the domain you're working in. Using real world identifiers is really more a case of outsourcing your obfuscation because someone else has done the work already and, as ever, it's best to reuse and recycle. Meaningful to some is better than meaningful to none.

But "real world" identifiers tend to exist where there's administrative benefits around transactions (car registration plates, ISBNs, catalogue numbers, DOIs, National Insurance numbers, TV Licence numbers...). And they tend to only act as identifiers within that administrative framework. As Frankie picks up in the comments ISBNs are useful, can be meaningful / recognisable to some people and do have structure. But without wishing to disappear down a FRBR shaped rabbit hole they're about editions / saleable items. And there are no similar "real world" identifier frameworks for works. Or the usual problem that this is not a dramatisation of this, this or this. It's a dramatisation of something that no one's ever bothered to give a real world identifier to because there's no administrative / transactional benefit in them doing so.

So using real world IDs as surrogate keys is useful and adds meaning for some users but it's only possible where real world identifiers already exist. Otherwise you end up having to mint your own.

Which does open up the option of making your surrogate keys human readable / natural language URI slugs (as posterous does to this post). Given enough people to throw at the problem any site can generate human readable / natural language. It really depends on the throughput of data, the number of new pages that results in, the friction it introduces and the cost.

And I think cost is the big factor in all of this. For large data volumes you need human intervention to allocate URL keys. And human intervention is expensive. And human's make mistakes and change their minds. So you need to start storing history to generate redirects. Which adds storage and code complexity and makes things more expensive.

I'm still unconvinced that anyone outside "the industry" cares about any of this. No-one stops using twitter because a tweet URI is obfuscated. Amazon still makes a profit despite its use of obfuscated product IDs. I've sat through a fair few user-testing sessions and I've never seen anyone hack the URL or even look at it unless the task being tested is sharing in which case they copy and paste and send.

Which is not to say I wouldn't by preference make URIs human readable, human meaningful, natural language and hackable. Mostly because it seems polite. If I was building a website for my local restaurant then I'd definitely go for /drinks/wines/pink/sparkling or whatever. But for a site of any real complexity, based on any real amount of data / content, human readable / meaningful / natural language / hackable costs money (in admin, in storage, in code complexity for redirects). I'm going to kick myself for using 'return on investment' but I struggle to see where it lives in this case.