Occasioned by several walks through parks with Robert where we came to the conclusion that we could do a better job of explaining the new Parliamentary search. Why we've done what we've done, what some of the advantages and disadvantages might be, and what we need to do to make it better. So here goes...

The problem we were trying to solve

Parliament has a web presence with a lot of search forms. There's a web search that searches the public website. But not all of it. There's the requisite intranet search. There's an amazingly complicated public search form for Parliamentary material and an even more complicated version available internally. There are search-like forms for individual types of business like this one for written questions and answers and search forms for particular types of content like this one for Parliament TV.

Over the years we've spent time, effort and money building and tweaking our own search functionality. There's an entire triple store (separate from data.parliament) built to index material for the purposes of search.

Past a point it becomes difficult to maintain and update the code sitting behind this assortment of search options. And no matter how much effort we put in to building our own search tools, 90% of traffic still comes from organic web search. So Google, Bing, Duck Duck Go etc. Though in fairness almost entirely Google. There was a general feeling that effort poured into rolling our own search systems for 10% of users, was misaligned to the very little effort we put into making the website friendly to external search engines for 90% of users. And maintaining multiple search options is also confusing for people who have to get used to a variety of interfaces and the feeling of never being quite sure what this particular search is searching over.

Our initial goal was to "improve the experience of searching parliament.uk", which gave us some wiggle room because at least no-one was expecting us to fix it overnight. We also wanted to step toward a unified search rather than multiple code bases supporting multiple interfaces.

What we did

Given the opportunity to start afresh with assorted varieties of Parliamentary search we decided not to roll another new search platform, but to make use of what was already out there. So for now we're using the Microsoft Cognitive Services API which uses Bing web crawlers behind the scenes.

This is the first step in making it easier for people to find material from Parliament. We're intending to use more than this one data source and will be adding to it incrementally.

There are assorted advantages and disadvantages to this approach. Let's start with the disadvantages and end on a high note.

Some disadvantages of using web search as site search

  1. You're outsourcing relevancy and ranking to what is, in effect, a black box owned by a third party and subject to change by a third party. If some type of material is deemed by the business to merit a higher ranking than some other type of material you can't just tinker with some relevancy algorithm and see what happens. If you outsource you lose some control.
  2. You're outsourcing keywords to the vagaries of a search engine and to the wider web. Since the arrival of PageRank, search engines tend to work by taking into account not only the words on your pages but the text of links to your pages from other pages. And though we're not entirely sure (because it's all a black box) it looks like they also take account of words on pages linking to your pages regardless of whether those words are in the link text or not. The problem here is Google bombs will happen. People can choose to link to your stuff with deliberately mischievous and occasionally offensive words and search engines will rank your pages for those words. Ask George W. Bush and Rick Santorum. It's possible that our new search will return results for slightly offensive terms but if you're spending your time typing swears into text boxes you should probably grow up, maybe?
  3. In traditional "enterprise" search you fix search by fixing search. Search engines work by "browsing" your site. So if you want to improve the findability and presentation of material you publish on the web, you have to spend time fixing your website and not tweaking your search code. This is a very different approach, which seems to confuse people.

Some advantages of using web search as site search

  1. You're outsourcing your relevancy and ranking. Which is also a good thing. Because relevancy and ranking are hard. For an organisation like Parliament producing a variety of materials, some with long term reference value, some with much more immediate short term value, it's particularly hard. When you roll your own search, all the relevancy signals are packed into your corpus. You can combine with usage stats which in all likelihood you've already bought from Google. But you're still treating your bit of the web as a sealed box rather than part of a wider ecosystem. Using web search plugs you back into that ecosystem by taking account of much more than your documents and their usage. How the web sees them and links to them, what gets clicked on in search results, how usage varies over times of day and days of week, provides a whole new set of relevancy signals you just don't get from mining your own material.
  2. You're outsourcing keywords to the wider web and how your chosen search engine chooses to see the wider web. When you're reliant on keywords contained in your own documents, you're reliant on editorial policies that may have changed over time. So Hansard didn't use the word Brexit until May 2016 but there's lots of Parliamentary debate pertinent to Brexit that predates that. Because other people, outside Parliament, have linked to this material using the word Brexit, our new web based search takes this into account and returns results pertinent to Brexit that never mention the word Brexit. Using web search massively expands your corpus and massively expands your pool of relevant keywords. The price you pay is a few people typing rude words into search boxes. What you gain outweighs that.
  3. Our previous approach has been two-pronged. We've been doing the work to improve our own, hand-rolled search whilst also trying to play nicely with search engines (not always with great success). Combining the two means you're optimising once. All the work you do to improve internal search also improves web search. Which might be as simple as giving a little more thought to page titles and descriptions. Or making sure the site is progressively enhanced and doesn't fall to pieces when JavaScript disappears. Or including data in some form of schema.org markup. Again, 90% of traffic comes from external search which in the past has had 10% of the budget. Combining the two means there's a lot less lost effort.
  4. Search engines like Google and Bing have a massive user base and a massive cache of usage data. They've seen every possible typo and learned how to route around. So a search for dungerous droogs compensates for wonky spellings and returns best guess reasonable results.

Fix the browse, fix the goddam browse

So we've partially outsourced relevancy, ranking and keywords, but only partially. For now at least, search engines still place some emphasis on site structure, link density and the wording of links. We can still exercise some control over the findability of our documents by:

  1. Creating as many routes to them from as many angles as possible (people, groups, places, times, topics etc.).
  2. Increasing link density to more relevant documents (current members above ex members, open inquiries above closed inquiries, bills currently passing through Parliament rather than those that have already passed into legislation etc.)
  3. Ensuring that link titles are as descriptive as possible.

The tree-like design of the current website doesn't lend itself to any of this. With important documents hidden at the end of twigs, at the end of branches, at the end of trunks there's no sense of which documents we value and want to promote. By improving the information architecture of the new website, we have a much better starting point for playing well with the wider web and with search engines.

So far we've made a decent stab at making every resource we think might address an information need separately addressable. So there are "pages" to answer questions like:

As we explore and build out more of the domain, other resources designed to answer other questions will appear:

  • What government bills have been presented by Ministers in this department?
  • What questions have been asked of this department?
  • What stages has this bill passed?
  • Which Members have signed this EDM?
  • Which EDMs has this Member signed?

The intention was to make subsidiary resources available for inclusion in "thing" pages. So nested beneath a person page there's another page listing their committee memberships. And another listing their parties over time. And etc. All of these, or bits of these, can be included into the person page and swapped in and out as design responds to actual usage and changes over time. This at least is the theory.

Except in places we've added information to people pages without making the corresponding subsidiary resources. So a person page might list government positions, but as yet /people/:person/government-positions doesn't exist. And where we have made subsidiary resources we haven't made a great fist of linking down to them. So they're invisible to users and invisible to search bots. To meet information needs we need to fix some of this.

The upshot of all this is you can't just put time and budget into tweaking search code. Once you've outsourced ranking, the only way to "improve search" is to improve the website and make it friendlier to search bots. Which will have all kinds of knock on benefits for the other users who happen to be people.

This goes against more traditional (or at least computer age traditional) approaches to search, where you end up building ever more complicated search forms to cope with ever more complicated queries but neglect the work necessary to make the underlying documents discoverable via linking and browsing. This feels particularly true in the worlds of libraries and archives which seem to squeeze whole websites into a single form, a result listing and a "record".

As Karen Coyle says in Catalogs and Context:

First, the indexes of the database system are not visible to the user. This is the opposite of the card catalog where the entry points were what the user saw and navigated through. Those entry points, at their best, served as a knowledge organisation system that gave the user a context for the headings. Those headings suggest topics to users once the user finds a starting point in the catalog.

Most, if not all, online catalogs do not present the catalog as a linear, alphabetically ordered list of headings. Database management technology encourages the use of searching rather than linear browsing. Even if one searches in headings as a left-anchored string of characters a search results in a retrieved set of matching entries, not a point in an alphabetical list. There is no way to navigate to nearby entries. The bibliographic data is therefore not provided either in the context or the order of the catalog. After a search on "cat breeds" the user sees a screen-full of bibliographic records but lacking in context because most default displays do not show the user the headings or text that caused the item to be retrieved.

Search engines can't search

The emphasis on search interfaces at the expense of browse has other knock on effects for the wider web. If all of your "website" is packed into a single search form, it's impossible for search bots to fill in that form and get to your documents. Because, and this is possibly pointing out the obvious, search engines can't search. They send crawler bots out across the web. The bots "read" a page and follow links, "read" a page and follow links. Onwards and forever. If they meet any means of navigation that's not a plain and simple hypertext link they're baffled. If they meet a search form they're stumped. A page like Search Parliamentary Material is barren soil for search bots. All the money you've spent on building impossibly intricate forms not only makes it more difficult for non-expert users but also makes it impossible for search bots to find your stuff.

Anyway, our approach to search will evolve over time but the basic approach should stay similar. Instead of sinking effort into building dazzlingly complex search forms we intend to spend the time making incremental improvements to the website we're searching over. To pin out Parliamentary material like a butterfly and provide as many approach routes and as many aggregations as possible.