Reconciliation Service API

This document describes the reconciliation service API, as implemented in OpenRefine 2.8 to 3.2. It is intended as a comprehensive and definitive specification of this API in its given state. Various aspects of this API need to be improved, as hinted by notes throughout this document, and by the choice of the version number 0.1 indicating an early development stage. Further improvements to the API, to be discussed in the W3C Entity Reconciliation Community Group, will be specified in the next iteration of this document.

Members of the Community Group are encouraged to contribute to this document by documenting the current behaviour of the reconciliation API. The ReSpec Editor's Guide can be used to learn more about the markup to use in this document.

Introduction

Data Matching on the Web

Integrating data from sources which do not share common unique identifiers often requires matching (or reconciling, merging) records which refer to the same entities. The promises of the Linked Open Data movement, the ability to mix up data from different publishers expressed in a common language (such as RDF) rely on being able to identify identities across services. Due to the Web's decentralized nature, there is nothing preventing a service from publishing a new URI for a resource or concept which is already expressed by another URI.

Various mechanisms exist to state the equivalence between two URIs: for instance, such a correspondence can be stated with the owl:sameAs property [[owl-features]], or using looser notions equivalences defined in SKOS [[skos-primer]]. But such statements must in turn be themselves findable. One can aggregate owl:sameAs statements from various sources to infer identities by transitivity, but this is a subtle art as some data sources can erroneously equate different concepts [[beek-2018]]. After all, any quest towards building a universal identifier system which avoids duplicates is necessarily doomed. Data publishers use different granularities to model the world. Concepts have fluctuating boundaries across languages, cultures and time.

In practice, we can determine if two database records refer to the same entity by comparing their attributes. For instance, two entries about cities bearing the same name, in the same country and with the same mayor are likely to refer to the same city. The reconciliation API that we present here makes it easier to discover such matches. It is a protocol that a data provider can implement, enabling its consumers to efficiently match their own data to the entities represented by the provider.

By nature, reconciliation is a heuristic process. Different entities can have many identical characteristics, leading to false positives. The same entity can be represented in different ways by two databases, for instance by spelling names differently, leading to false negatives. This problem has been extensively studied and many heuristics have been proposed to tackle it [[christen-2012]]. The reconciliation API is agnostic about the particulars of the heuristics involved: it lets data providers choose how they want to determine which of their entities are good candidates for a particular query. What it provides is a web API to let users obtain these candidate entities without having to implement the underlying reconciliation heuristics themselves, nor download the entire contents of the target database.

History of the Reconciliation API

This API was originally designed by Metaweb as a protocol used between Freebase and Gridworks (now known as OpenRefine). Freebase was a free crowdsourced knowledge graph, storing data about a broad range of topics and exposed on the web as linked data. OpenRefine is tool which was originally designed to help populate this knowledge graph by importing data into it. It supports a range of operations which help the user reshape their data to prepare it for ingestion in a data model such as Freebase's. One of these operations is reconciliation, which matches mentions of entities in the local dataset to records in the target database. The reconciliation API was initially introduced to specify how OpenRefine and Freebase could communicate during that process.

The reconciliation API was then turned into a generic protocol that any database could implement. This made it possible to register such a database into OpenRefine by adding it as a Standard Service. This API was implemented by various services, either directly by the service provider itself (for instance the Crossref funder database, Nomisma or the Getty thesaurus) or by a third party as a wrapper sitting on top of other existing web APIs for the service (such as Wikidata or VIAF). Software was also developed to expose a reconciliation endpoint out of any tabular file (reconcile-csv) or by wrapping a SPARQL endpoint (in the OpenRefine RDF extension).

This API was documented on OpenRefine's wiki as a living document which evolved gradually, as OpenRefine improved. In addition to its core feature, fetching reconciliation candidates matching a given query, services are optionally able to implement additional endpoints which ease the integration of the service in OpenRefine's UI, by providing previews for entities (with a Preview Service) and auto-completion for various inputs (with Suggest Services). In 2018, a Data Extension Service was added, letting consumers pull data from the target database once they have reconciled their records.

In 2019 the W3C Entity Reconciliation Community Group was formed, with the intention of promoting and improving this API outside the strict scope of the OpenRefine project. This document is an attempt to better specify this API.

External Resources

A list of known public endpoints is maintained by the community, where they can also be tried out interactively. OpenRefine's wiki also hosts a list of reconciliable data source which also includes non-hosted or discontinued services. Existing clients to the API, such as OpenRefine or Cocoda can be used to interact with reconciliation services.

Core Concepts

This section documents the data model behind the reconciliation API. A reconciliation service lets users match their data against entities exposed by the service. Matching can be refined by filtering by type or properties with property values. The purpose of this section is to define these notions.

Entities

An entity is a record in the data source exposed by the service. It comprises the following fields:

id
an identifier, which is a non-empty string. This identifier must be unique among all entities;
name
a name, which is also a non-empty string;
type
a list of types, possibly empty;
Moreover, for each property it contains a set of associated property values, possibly empty.

Reconciliation services can define in their service manifest a view template which associates to each entity a corresponding URI, by inserting its identifier in the template. A view template is a string which contains the {{id}} substring. For each entity, replacing {{id}} in the template by the entity's identifier MUST result in a valid URI [[RFC2396]].

Types

A type represents a category of entities. It comprises the following fields:

id
an identifier, which is a non-empty string. This identifier must be unique among all types;
name
a human-readable name, which is a non-empty string.

Properties

A property represents a type of attribute that entities can have in the data source. It comprises the following fields:

id
an identifier, which is a non-empty string. This identifier must be unique among all properties;
name
a human-readable name, which is a non-empty string.

Property Values

A property value can be any of the following:

Identifier and Schema Spaces

A reconciliation service MUST define two URIs, exposed in its service manifest:

identifier space
A URI which identifies the sort of entity identifiers returned by this service. This URI MAY resolve to a page describing these entities and their identifiers;
schema space
A URI which identifies the ontology used by the service, in other words its collection of properties. This URI also MAY resolve to a page describing these properties and their identifiers.

If two different reconciliation services expose the same entities and properties, then they SHOULD use the same identifier and schema space URIs, signalling that (for instance) the Data Extension service of the first one can be used on reconciliation candidates by the second one.

The notions of identifier and schema space have been inherited from the API's original purpose, when it was specific to Freebase. Their original meaning was to be understood within Freebase's own data model.

Service Definition

This section documents how reconciliation services are exposed as HTTP(S) services and how they can announce the features of the API they implement.

Service Manifest

A service manifest consists of the following fields:

name
A human-readable name for the service, generally the name of the database it exposes. In the case where multiple reconciliation services exist for the same database, it is in the interest of a service to bear a meaningful name which will help disambiguating it from others;
identifierSpace
The identifier space used by the service, as a URI;
schemaSpace
The schema space used by the service, as a URI;
defaultTypes
A list of types which are considered sensible default choices as types supplied in reconciliation queries. For services which do not rely on types, this MAY contain a single type with a generic name making it clear that all entities in the database are instances of this type.
view
An optional object which contains a single field url. Its value is a view template for the service;
preview
A preview metadata, supplied if the service offers a preview service;
suggest
An optional object which may contain the following fields, depending on which suggest services are offered:
entity
A suggest metadata for auto-suggestion of entities;
property
A suggest metadata for auto-suggestion of properties;
type
A suggest metadata for auto-suggestion of types.
extend
A data extension metadata, supplied if the service offers a data extension service.

For instance, a service could expose the following minimal service manifest:


      

A more complete example, with some optional services implemented:


      

HTTP(S) Access

In the interest of protecting the data sent as reconciliation queries, all endpoints of reconciliation services SHOULD be available over HTTPS [[RFC7230]] [[SECURING-WEB]]. This does not apply to locally hosted services.

Cross-Origin Access

All HTTP(S) endpoints exposed by the service MUST support JSONP [[JSONP]], which enables web-based clients to access the service from a different domain.

Some clients might only require cross-origin access on some particular endpoints, which are called directly by a web UI. Since this depends on the architecture of the client, this cannot be relied upon and cross-origin access MUST be implemented for all endpoints in a uniform way.

In addition, endpoints SHOULD also enable access by CORS [[cors]] to enable newer web-based clients to access the service without exposing themselves to untrusted third-party code.

This can be achieved by adding the following HTTP headers to all HTTP responses produced by the service:

             Access-Control-Allow-Methods: GET, POST, PUT, DELETE, OPTIONS
             Access-Control-Allow-Headers: Origin, Accept, Content-Type, X-Requested-With, X-CSRF-Token
             Access-Control-Allow-Origin: *
           

CORS provides a safer way to expose cross-origin web services and SHOULD therefore be supported by reconciliation services, in the interest of other clients and potentially of newer versions of OpenRefine.

Reconciliation Queries

This section specifies how clients can send reconciliation queries to services and how services respond to them.

Structure of a Reconciliation Query

A reconciliation query consists of:

query
A query string, consisting of a non-empty string, which is mandatory. By supplying such a string, a client intends to search for entities with similar names. The specifics of how this similarity is defined are determined by the service.
type
Optionally, a list of types. Supplying such types allows users to restrict the search to entities which bear those types. Whether this restriction should be a hard constraint or simply induce a change on the reconciliation scores can be determined by the service. In particular, services MAY return candidates which do not belong to any of the supplied types;
limit
Optionally, a limit on the number of candidates to return, which must be a positive integer;
properties
Optionally, a map from property identifiers to a list of property values (or list of property values). These are used to further filter the set of candidates (similar to a WHERE clause in SQL, by allowing clients to specify other attributes of entities that should match, beyond their name in the query field. How reconciliation services handle this further restriction ("must match all properties" or "should match some") and how it affects the score, is up to the service;
type_strict
Optionally, a type strictness parameter, which can be one of the strings "should", "all" or "any".

A reconciliation query batch is a set of reconciliation queries indexed by string identifiers.

Minimal example of a reconciliation query batch with mandatory fields only:


        

Full example of a reconciliation query batch with all optional fields:


        

For a single property it is possible to provide multiple values as a list. The values provided do not need to have the same type. In the following example a string and a reconciled value are provided as values for the same property.


	

A JSON schema to validate the serialization of a query batch is available.

The meaning of the type_strict is unclear, it is inherited from Freebase's API but is not used or documented in OpenRefine.

Reconciliation Candidates

A reconciliation candidate represents an entity as a response to a reconciliation query. It is proposed to the client as a potential matching entity for this query. It contains the following fields:

id
The identifier of the candidate entity;
name
The name of the candidate entity;
type
The types of the candidate entity;
score
A numeral indicating how well this candidate entity matches the query: a higher score indicates a better match;
match
A boolean matching decision, which indicates whether the service considers this candidate good enough to be chosen as a correct match.

Sending Reconciliation Queries to a Service

A reconciliation result is a set of reconciliation candidates. It is serialized in JSON as an array of such reconciliation candidate objects. This list SHOULD be sorted by decreasing score.

A reconciliation result batch is a set of reconciliation results indexed by string identifiers of the corresponding reconciliation query batch.

A JSON schema to validate the serialization of a reconciliation result batch is available.

The primary role of a reconciliation service is to translate reconciliation query batches to reconciliation result batches over HTTP.

A reconciliation service MUST support HTTP POST requests with application/x-www-form-urlencoded bodies containing a reconciliation query batch (serialized in JSON) in a form element named queries.

POST / queries=<URL-encoded reconciliation query batch>

Similarly, a reconciliation service SHOULD support HTTP GET requests with a reconciliation query batch in a query string parameter named queries.

GET /?queries=<URL-encoded reconciliation query batch>

In both cases, the service returns the corresponding query batch serialized in JSON.

The POST method is the primary way to send reconciliation queries to a service since it does not restrict the length of the query batches. The GET method is useful for interactive debugging of reconciliation queries in a web browser, for instance.

A Note on Candidate Retrieval and Scoring

The way candidates are retrieved from the underlying database and scored against the query is left entirely at the discretion of the service. However services should retrieve and score the candidates of each query in a batch independently of the other queries in the same batch, or in previous ones. It is also expected that reconciliation queries where query matches exactly the name of an entity in the database and with no other constraint should return at least this entity, unless it is hidden by many namesakes. Similarly, supplying an entity identifier as query should return the corresponding entity as a candidate, with a high score.

Deciding on a scoring method is one of the main difficulties in developing a reconciliation service. Many scoring strategies used in data matching [[christen-2012]] cannot be implemented easily in the context of this reconciliation API due to the separation of responsibilities between the client and the server.

Many open source reconciliation services are available and these might provide some inspiration concerning indexing and scoring methods when developing new services. See External Resources for some examples.

Preview Service

This section specifies how reconciliation services can provide embeddable HTML previews of their entities, which clients can display in their user interface.

Preview Metadata

Reconciliation services MAY offer a preview service by providing the preview metadata as an object stored in the service manifest under the key preview. It consists of the following fields, all mandatory:

url
A string containing {{id}} such that replacing {{id}} by an entity identifier yields the preview URL for that entity. This preview URL MUST resolve to an HTML page summarizing the entity. It SHOULD render appropriately in an <iframe> whose dimensions are specified by the service in the following fields;
width
The width in pixels of the <iframe> element where to render an entity preview;
height
The height in pixels of the same <iframe>.

For instance, a service may expose the following preview metadata:


      

Preview Queries

A preview service is queried by resolving the preview URL for an entity. The URL must resolve to an HTML document.

For instance, assuming the example preview metadata above, the service could respond to a preview request as follows:

       

Suggest Services

This section specifies how reconciliation services can provide auto-complete endpoints for their entities, properties and types. A reconciliation service can offer a suggest service for any of these three classes. For instance, a service which only exposes a single type might not want to expose a suggest service for types. These suggest services can be used by clients to let users select an entity, property or type manually, at various stages of their reconciliation workflows. Suggest services for entities, properties and types are declared independently in the service manifest by providing a suggest metadata for them.

Suggest Metadata

A suggest metadata object consists of the following fields:

service_url
The base URL for the suggest service;
service_path
A URL path which will be concatenated to the service_url to obtain the full URL of the suggest service;
flyout_service_url
The base URL for the flyout service. If none is provided, it is assumed to be identical to service_url;
flyout_service_path
An optional URL path which will be concatenated to the flyout_service_url to obtain the full URL of the flyout service. The absence of this parameter indicates that no flyout service is associated with this suggest service.

For instance, a suggest metadata could be as follows:


	Such a metadata indicates that a suggest service is available at https://example.com/api/suggest with an associated flyout endpoint at https://example.com/api/suggest/flyout/${id}.
      

Suggest Queries

A suggest service MUST accept GET queries with the following URL-encoded parameters:

prefix
The string input by the user in the auto-suggest-enabled field;
cursor
An optional integer to specify the number of suggestions to skip: this can be used by clients to fetch more suggestions.

Suggest Responses

A response to a suggest query consists of the following fields:

result
A list of items, which can be entities, properties or types depending on which of these the service is provided for. Each such object can contain the following fields:
id
The identifier of the entity, property or type suggested;
name
Its corresponding human-readable name, to be displayed prominently to the user;
description
An optional description which can be provided to disambiguate namesakes, providing more context. This could for instance be displayed underneath the name;
notable
When suggesting entities only, this field can be used to supply some important types (not necessarily all types) of the suggested entity. The value must be an array of either type identifiers (as strings) or type objects, containing a id and name field which represent the type.

The key notable comes from a notion of notable types that existed in Freebase.

For instance, a suggest service for entities could return the following response:


        

A suggest service for properties could return the following response:


        

And a suggest service for types could return the following response:


        

JSON schemas to validate suggest responses are available for entities, for properties and for types.

General Expectations about Suggest Services

It is generally expected by users that an entity suggest query where prefix is the name of an entity should return this entity in the suggest response, unless that entity is hidden behind many other namesakes. Similarly, supplying an entity identifier as prefix should return this entity in the suggest response. Analogous expectations apply for property and type suggest services.

As the prefix name suggests, suggest services are expected to perform prefix search on their database of records, such that a suggest service can be used to provide auto-completion as users type names or identifiers in a field.

Flyout Services

A flyout service provides small previews of suggested elements. These previews are designed to be shown when hovering a suggested element. When a suggest service supports flyout, it declares the flyout endpoint in its suggest metadata.

A preview for a suggested entity, property or type can then be obtained at the flyout endpoint by replacing ${id} by the identifier for the entity, property or type. Upon a GET query to this URL, the service returns a JSON response consisting of an object with the following fields:

id
The identifier supplied in the URL;
html
A string containing HTML code that can be used to display a small preview alongside the suggested entity, property or type.

For instance, if a service's flyout endpoint is https://example.com/suggest/entities/flyout?id=${id}, then by retrieving https://example.com/suggest/entities/flyout?id=Q38274, one might get the following response:


	

Flyout services were used by Freebase and are mostly redundant with the description field in suggest responses. Given that they allow services to return arbitrary HTML content, they also pose a security threat to clients. It is therefore proposed that this functionality is dropped in the future.

Data Extension Service

This section specifies how reconciliation services can let clients fetch the values of some properties on a selection of entities.

A data extension service MUST support data extension query requests.

A data extension service SHOULD provide data extension property proposals.

A data extension service MAY support data extension property settings.

Data Extension Metadata

The data extension metadata is an object stored in the service manifest in the extend field. It consists of the following settings, all optional:

propose_properties
A service path object defining a URL which implements data extension property proposal, which consists of:
service_url
The root URL of the service;
service_path
The path to the endpoint for property proposals.
The full URL for data extension property proposals is obtained by concatenating these two fields.
property_settings
A list of data extension property settings.

A data extension property setting consists of:

name
A name for the setting, which identifies the setting uniquely;
label
A human-readable label, which is used when presenting the setting to the user in a form;
type
A data type, which can be one of the strings "number", "text", "checkbox", or "select". This determines which type of value the property setting gis expected to store: clients SHOULD render this setting with the corresponding HTML element;
default
A default value for the setting, when not provided or left untouched by the user;
help_text
A help text, which describes the meaning of the field to the user. This is meant to be a short string that can be displayed alongside the corresponding form field;
choices
If type is select, a list of property setting choices.

Example of data extension metadata with all optional fields:


        

Data Extension Property Proposals

A data extension property proposal service returns properties for a given type identifier.

The service MUST support HTTP GET requests with a `type` query string parameter.

The service SHOULD support an optional `limit` query string parameter to control the number of proposed properties.

The service URL and path are declared in the data extension metadata of the service manifest.

GET /properties?type=<type identifier>[&limit=<limit>]

A data extension property proposal response consists of:

properties
An array of proposed properties. These properties are suggested as fields that could be potentially fetched via data extension for entities of the type provided in the query;
type
The type identifier supplied in the query;
limit
Optionally, the requested limit;

Example of a data extension property proposal response:


        

Data Extension Query Requests

A data extension query request lets clients fetch the values of some properties on a selection of entities.

The fact that a reconciliation service offers data extension MUST be announced by including a data extension metadata in the extend field of the service manifest.

A data extension service MUST support HTTP POST requests with application/x-www-form-urlencoded bodies containing a data extension query in a form element named extend.

POST / extend=<URL-encoded data extension query>

A data extension service SHOULD support HTTP GET requests with a data extension query in a query string parameter named extend.

GET /?extend=<URL-encoded data extension query>

A data extension query consists of:

Example of a data extension query:


        

Data Extension Responses

A data extension response consists of metadata and rows.

The metadata contains the properties used for data extension, as requested in the data extension query. If properties have entities as values, they MAY specify a type in the metadata.

The rows object contains, for each entity identifier in the data extension query, for each property identifier in the metadata, the property values of that property in that entity. If the property values are entities, their identifiers MUST be in the service's identifier space.

Response example for the data extension query from the previous example:


        

JSON Schemas

This appendix provides JSON schemas [[json-schema]] which can be used to validate the JSON serialization of various elements as specified by these specifications.

Manifest Schema

The manifest schema can be used to validate a service manifest.

      

Reconciliation Query Batch Schema

The reconciliation query batch schema can be used to validate the JSON serialization of any reconciliation query batch, i.e. the payload of a GET/POST to the reconciliation endpoint.

      

Reconciliation Result Batch Schema

The reconciliation result batch schema can be used to validate the JSON serialization of any reconciliation result batch.

      

Suggest Entities Response Schema

The suggest entities response schema can be used to validate the JSON serialization of any suggest response for entities.

      

Suggest Properties Response Schema

The suggest properties response schema can be used to validate the JSON serialization of any suggest response for properties.

      

Suggest Types Response Schema

The suggest types response schema can be used to validate the JSON serialization of any suggest response for types.

      

Data Extension Query Schema

The data extension query schema validates data extension queries.

      

Data Extension Response Schema

The data extension response schema validates data extension responses.