Jun 262007

Today sees the first major conference on Automated Content Access Protocol (ACAP).

ACAP is potentially very important for online (i.e., all) publishers. What is it? This is the description from the conference website:

a standard by which the owners of content published on the World Wide Web can provide permissions information (relating to access and use of their content) in a form that can be recognised and interpreted by a search engine “spider”, so that the search engine operator is enabled systematically to comply with the permissions granted by the owner. ACAP will allow publishers, broadcasters and any other to express their individual access and use policies in a language that search engine’s robot “spiders” can be taught to understand.

(For more information, see also the Wikipedia page; the official ACAP website)

ACAP was conceived in January 2006 and born some 9 months later at the Frankfurt Book Fair. As of today, it is mid-way through a first pilot project that is intended to design v1.0 of ACAP. The pilot is due to finish in October 2007, with a final conference scheduled for 29 October in London.

The ACAP partners include leading publishers (STM and others), media and news organisations (including WAN, the World Organisation of Newspapers) and the British Library. But right now none of the three major search engines are formal members (though they have participated informally), and clearly ACAP is never going to work without their active endorsement and participation.

So what’s in it for the search engines? On the one hand, they stand to lose access to content, or be barred from certain kinds of re-use they currently enjoy (particularly in news). But on the other hand there are potential gains, with publishers being able to make certain kinds of currently restricted content (books, databases) available to search engines if they feel they have more control over re-use (and potential to monetise that use). There are certainly potential huge gains for end-users here.

But one problem for an early adoption of ACAP is that it (at least partly) addresses an area of current tension between content owners and the search engines: copyright. For example, publishers and Google have engaged in legal battles over Google’s interpretation of copyright law in relation to book scanning (e.g. The Authors Guild of America and Association of American Publishers have separately sued Google) and news aggregation (e.g. Agence-France Presse). So it wouldn’t be surprising if the search engines decided to play their cards close to their chests at this point.

So what’s new at the London conference? In the opening keynote (pdf), WAN’s Gavin O’Reilly says of the Big Three search engines’ non-membership:

So however perplexing I find the fact that the big three still aren’t full participants in ACAP, it is – for me, probably the sole and minor disappointment among a long and continuing litany of successes and triumphs and I welcome the self-evident operational involvement that we continue to see from some of them.

Francis Cave’s project report (pdf) appears to shows that the pilot is roughly on track, with the Use Cases defined, the technology options for the Technical Framework researched and specified. Defining the Use Cases was clearly an important milestone, as they lie at the heart of what ACAP is about and are potentially quite complex. For instance Mark Maddocks’ presentation Why ACAP? Reed Elsevier Perspective (pdf) gives these examples:

  • Specify permitted use of indexed content (e.g. Limit number of displayed words in the search result; of Require direct link back to publisher site)
  • Exclude certain SE services from using indexed content (e.g. Allow for inclusion in main Google index but not Google Scholar)
  • Exclude specific parts of the page from indexing and/or display (e.g. Paragraphs, images, figures, or tables)
  • Exclude from the index copies of the page not found at the specified URL
  • First Click Free – site is indexed, but provides limited content to the user (e.g. Crawlers are allowed to index pages; A search results page can present search results from these pages; People can link to the page from the search results page but not onward link from that destination page. Instead they are redirected to a registration or other page)
  • Registration – site is indexed & free, but have to register for access (e.g. Crawlers are allowed to index the pages bypassing registration, the pages are flagged as Registration so that the crawler can explain this to the user if they choose to; Users clicking on a link are asked to register before seeing the content)

Technorati Tags: , ,