Author: rwesten
Date: Mon May 23 12:49:51 2011
New Revision: 1126476
URL: http://svn.apache.org/viewvc?rev=1126476&view=rev
Log:
Proposal for using Linked Data / Linked media principles for the RESTful
service interface of the Stanbol Entityhub
Added:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/entityhub/
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/entityhub/entityhubandlinkeddata.mdtext
Added:
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/entityhub/entityhubandlinkeddata.mdtext
URL:
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/entityhub/entityhubandlinkeddata.mdtext?rev=1126476&view=auto
==============================================================================
---
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/entityhub/entityhubandlinkeddata.mdtext
(added)
+++
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/entityhub/entityhubandlinkeddata.mdtext
Mon May 23 12:49:51 2011
@@ -0,0 +1,264 @@
+Adopting Linked Media principles for Stanbol Entityhub
+======================================================
+
+[Linked Data](http://linkeddata.org/) describe the idea of linking - formally
unconnected - bits of data over the web. Think about how hyperlinks are used to
navigate within the Web of documents. Linked data tries to do the same for the
Web of Data. This basic idea is also central to most of the Apache Stanbol
Components. However Stanbol is not only concerned about about linking data but
also with interlinking the web of documents with the web of data. Therefore
[this
proposal](http://lists.w3.org/Archives/Public/public-lod/2011May/0019.html) to
extend Linked Data principles to also support content and not just data seams
like a natural fit for Apache Stanbol.
+
+This Documents first provides a short introduction to Linked Data and the
proposed Linked Media extensions. The second part of the document analysis
requirements of the Stanbol Entityhub related to Linked Data and Linked Media.
The third section goes than into more details on how Linked Media principles
could be implemented by Entityhub.
+
+
+Short Introduction to Linked Data and proposed Linked Media extensions
+----------------------------------------------------------------------
+
+from [linkeddata.org](http://linkeddata.org/faq)
+> ### What is Linked Data?
+> The Web enables us to link related documents. Similarly it enables us to
link related data.
+> The term Linked Data refers to a set of best practices for publishing and
connecting structured data on the Web.
+> Key technologies that support Linked Data are URIs (a generic means to
identify entities or
+> concepts in the world), HTTP (a simple yet universal mechanism for
retrieving resources,
+> or descriptions of resources), and RDF (a generic graph-based data model
with which to
+> structure and link data that describes things in the world).
+
+The following terminology is often used with with Linked Data:
+
+* Resources: All items of interest that are to be published on the Web.
+* Information Resources: All documents on the Web (test, imaged, videos ...)
+* Non-Information Resources: Real-word-objects that exist outside of the Web
(Persons, Organizations, Places ...) but also social concepts (Categories,
Terminologies â¦).
+* Resource Identifiers: Linked Data recommends to only use HTTP URIs as
identifiers because this allows to directly accessing information about the
resource over the web.
+* Representation: A stream of bytes in a certain format that describes an
Information Resource. Representations can be available in different formats.
+* Dereferencing of HTTP URIs: For Information Resources the content is
directly returned. For Non-Information Resources the HTTP status "303 See
Other" with a link to the Information Resource describing the Non-Information
resource is returned.
+* Content Negotiation: Users can select the format (content type) of the
returned Representations by setting the "Accept" header in requests. Linked
data recommends to use different URIs for Representations of different content
type to allow Bookmarking. The parsed "Accept" header is therefore used to
decide about the URI parsed with an "303 See Other" response.
+* URI Aliases: If different providers publish information about the same
Non-Information Resource (e.g a famous Person, a Country, ...) than
"[owl:sameAs](http://www.w3.org/TR/owl-ref/#sameAs-def)" relations are used to
tell clients that two different Resource Identifiers (HTTP URIs) identify the
same Resource.
+
+A more detailed overview is provided by the [Linked Data
Tutorial](http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/).
+
+### Linked Media
+
+The [Linked Media
proposal](http://lists.w3.org/Archives/Public/public-lod/2011May/0019.html)
tries to extend Linked Data by two features.
+
+1. Creating and updating of resources: Linked data currently covers only
retrieval of information, which is sufficient for sites like
[DBpedia](http://dbpedia.org) or [Geonames](http://www.geonames.org) where
users are only able to consume data. When creating interactive (web)
applications one needs to be able to create/update and remove information.
Features that are currently not covered by linked data, but well defined for
RESTful Services. The Linked Media proposal therefore suggest to use HTTP PUT,
POST and DELETE request for this purpose.
+2. Handling both content and metadata: Linked Data uses Content Negotiation to
select suitable content types. In addition it provides means to redirect to
Information Resources about Non-Information Resources. However linked data does
not differentiate between metadata and content. One can not explicitly ask
first for an GIF Image and later for the metadata as RDF. Or first for an HTML
blog post and later for its metadata formatted as HTML. Such a differentiation
is only supported for Non-Information Resources. E.g. for a famous painting
(Non-Information Resource) and a photo (Information Resource). Liked Media
proposes to use the "rel" parameter of the Accept header to allow users to
explicitly ask for content ("Accept: type/subtype; rel=content") or metadata
("Accept: type/subtype; rel=meta").
+
+For a more detailed description please follow the link to the [Linked Media
proposal](http://lists.w3.org/Archives/Public/public-lod/2011May/0019.html) [1]
as posted by by Sebastian Schaffert on the linked open data mailing list of
W3C. You might also be interested in reading the following discussion. Note
also
[ResourceWebService](http://code.google.com/p/kiwi/source/browse/kiwi-core/src/main/java/kiwi/core/webservices/resource/ResourceWebService.java)
[2] a first implementation of the Linked Media proposal based on the
[Kiwi2/Linked Media Framework](http://code.google.com/p/kiwi/) [3][4].
+
+Requirements of the Stanbol Entityhub
+-------------------------------------
+
+This section tries to identify requirements of the Stanbol Entityhub related
to Linked Data and Linked Media. The goal of this analysis is to identify where
it makes sense to adopt Linked Data/Media principles for the RESTful interface
of the Entityhub.
+
+The Entityhub fulfills two requirements:
+
+1. it allows to define and manage network of referenced sites used to retrieve
information about entities from. In addition the Entityhub also supports the
use of local caches to speedup access and to get independency of the
availability of remote services.
+2. it manages an own (local) site that is used to manage local entities. Such
entities can be created locally but it is also possible to import them form any
referenced site. Typical examples of locally managed entities are customers,
employees, concepts of a company thesaurus, offices, meeting rooms ...
+
+### Entity Model of the Entityhub
+
+Entities managed by the entityhub define first an unique ID. In case the
referenced site follows linked data principles this will be the HTTP URI of the
Non-Information resource. However this might be any valid URI (including URNs).
The URI prefix of locally managed entities are configureable. Therefore the URI
type of locally managed entities depends on the configuration. The Entity
itself represents a Non-Information Resource. Each Entity comes with a
Representation. The representation holds all information known by the site
about the entity. In Linked Data terminology the Representation is the
Information Resource a User needs to be redirected when he requests the Entity
(Non-Information Resource). Finally an Entity also links to the ID of the
(referenced) site managing it. This allows users to track who is providing the
information for an Entity.
+
+Currently the Entityhub distinguish three different types of Entities:
+
+1. Sign: All Entities managed by referenced sites
+2. Symbol: All locally managed Entities. Symbols hold additional metadata such
as a preferred label, a state.
+3. EntityMapping: Mappings form Symbols to Signs. Linked Data typically uses
owl:sameAs to define such mappings however in case of the Entityhub such
mappings need to hold additional meta information such as the state, expire
data of the mapping ...
+
+Metadata such as license, copyright statements, attributions as well as
informations about the organization managing a referenced site are managed with
referenced sites and not with single entities.
+
+All the additional information provided by this three Entity types as well as
the additional metadata provided for referenced sites are based on Linked Data
principles metadata about the Information Resource - the Representation - and
not about the Non-Information Resource - the Entity.
+
+Therefore the Entityhub manages:
+
+* Non-Information Resources: All the Entities of referenced Sites as well as
locally managed Entities
+* Content: All Representations about Entities
+* Metadata: Additional information about Representations such as license,
copyrights, attributions as well as mappings to other entities.
+
+### RESTful Services of the Entityhub
+
+The Entityhub defines the following service endpoints:
+
+1. The (referenced) Site Manager: Provides retrieval and search over all
referenced sites.
+2. (referenced) Site Endpoint: Provides the same interface but for a specific
referenced site.
+3. The Entityhub Endpoint: Provides full read/write and retrieval access for
locally managed Entities.
+
+Therefore the Entityhub needs to support read only access for Entities managed
by referenced sites and full read/write access (CRUD) locally managed Entities.
+
+### Summary
+
+Consuming Linked Data:
+
+* Consume Linked Data from remote sites
+* Search resources on remote sites based on labels/language and type (by using
SPARQL)
+
+Referenced Entities (Entities of Referenced Sites)
+
+* Support local management of additional metadata for referenced entities
(e.g. mappings to local entities)
+* Support merging of remote metadata (e.g. defined by "foaf:primaryTopic")
with local ones (e.g. mappings to local entities)
+* Provide Content + Metadata - as proposed by Linked Media - even for
referenced entities.
+* Support Search for Entities based on labels/language and type
+
+Local Entities (Entities managed by the Entityhub)
+
+* Provide local Entities as Linked Media (full CRUD support; management of
Content and Metadata)
+* Support creation of local entities based on referenced one
+* Support finding of additional mappings based on owl:sameAs relations
+* Support importing of metadata for mapped entities (e.g. to correctly handle
attribution requirements)
+* Support Enabling/disabling the use of redirects
+* Support Search for Entities based on labels/language and type
+
+Based on this evaluation of the Model and the Services provided by the
Entityhub the proposed Linked Media extension to the Linked Data principles
would be sufficient to cover most of the functionalities exposed by the
Entityhub as RESTful services. While for referenced Sites only the distinction
between Metadata and Content is needed for locally managed Entities also the
possibility to create, update and remove Entities, their Representation
(content) and metadata is of central importance. The main functionalities not
covered is the import of Entities from referenced sites. Also for
functionalities like the creation of mappings and the management of the Entity
workflow special additions to the generic Linked Media/Linked Data API would be
useful.
+
+
+Specific Considerations
+-----------------------
+
+This section contains Entityhub specific considerations about some of the
principles defined for Linked Data and Linked Media.
+
+### Resource Identifier
+
+Linked data defines the principle to use HTTP URIs as Resource Indetifier so
that one can retrieve data by directly accessing the URI of a resource. This
does not work out for the Entityhub because it needs to also manage remote
entities and also for local entities this will not always be an option. Because
of that the RESTful interface needs also to support an alternative that allows
to parse the URI of an entity as a parameter. This is also a requirement to
don't affect the IDs of entities when the Entityhub is deployed on an different
host of even by using localhost. In addition this allows to use use other URI
types (mainly URNs but also other protocols such as LDAP) as identifiers for
locally managed entities.
+
+### Redirects for Content Negotiation
+
+It is important to consider that Entities are Non-Information Resources and
based on Linked Data Principles requests for Non-Information resources need to
be answered with redirects ("303 See Other") to the URI of the Information
Resource. In practice such redirects are for two things:
+
+1. To allow Users to directly access (and bookmark) URIs of a specific format
and therefore bypass content negotiation. This is mainly because Browsers do
not allow to define the "Accept" headers. Because of that without this
indirection typical users would be unable to retrieve other formats that HTML.
+
+ For the Entityhub where most of the requests will be issued by clients
that support the usage of "Accept" headers, the usage of redirects seems
unfavorable because: First it will double the numbers of requests and also adds
an additional RTT (round trip time). Secound browsers always issue a GET
request when following an redirect independent of the type for the initial
request. This can cause problems when returning redirects for POST, PUT and
DELETE requests. Because of this for the Entityhub it would make sense to
provide the possibility to deactivate/activate the usage of redirects (e.g. via
a configuration, a request property or even a header field).
+
+2. To attach metadata of the Information Resources. As an example take the
[Linked Data endpoint of the New York Times](http://data.nytimes.com). It uses
"http://data.nytimes.com/{uuid}" for Entities and
"http://data.nytimes.com/{uuid}.rdf" for the RDF XML representations. When
looking at the representations provided for Entities (e.g. take [North
Carolina](http://data.nytimes.com/N25800450843199534421) one can see that
triples using "http://data.nytimes.com/{uuid}" as subject are data about North
Carolina where triples that use "http://data.nytimes.com/{uuid}.rdf" as subject
represent metadata. Note also that the metadata is also connected to the
representation of North Carolina by the
[foaf:primaryTopic](http://xmlns.com/foaf/0.1/primaryTopic) relation.
+
+ When using extensions proposed by Linked Media, than it would be possible
to directly refer to the metadata by setting the "rel" parameter of the
"Accept" header to "meta". Therefore a request defining "Accept:
application/rdf+xml; rel=meta" would - assuming that redirects are deactivated
- directly return the metadata for for the requested entity (e.g. the license)
encoded as RDF XML. In case redirects are enabled it would return a "303 See
Other" with the URI of the metadata.
+
+ Note that - in principle - there are two kinds of redirects: (1) redirects
between Resources. This includes redirects from Entities to Representation
("rel=content") as well as to the Metadata ("rel=meta"); (2) redirects used for
Content Negotiation. Therefore it would be possible to provide the possibility
to enable/disable this types separately.
+
+ Also note that in cases where several redirects would be needed to reach
the final resource (e.g. when requesting information about an Non-Information
Resource in "text/html": Non-Information Resource -> Information resource
-> HTML version) than the request will directly return the final
destination.
+
+
+Redesigning the Entityhub
+-------------------------
+
+This section evaluates necessary changes to the Entityhub.
+
+### URI scheme for Resources
+
+The support of Linked Data requires the use of a local URI. This is in
contrast to the parameter based approach ("?id={remoteURI}") as currently used
by the Entityhub. The goal is that the Entityhub allows both variants
+
+ http://{host}/entityhub/{site}/entity/{localname} and
+ http://{host}/entityhub/{site}/entity?uri={uri}
+
+to refer an Entity. This requires that the Entityhub provides a local HTTP URI
for any (local or remote) entity. The suggestion is to use the local name of
the remote entity or the MD5 of the whole URI in cases where this is not
possible.
+
+To support the redirects as defined by Linked Data it is also necessary to
generate own URIs for Representations. To support the differentiation between
Content and Metadata we need also an own URI for the metadata.
+
+The proposal is to use file extension like additions to the local name of
Entities:
+
+ http://{host}/entityhub/{site}/entity/{localname}.rep
+
+is used to directly refer to the Representation of an Entity - in Linked Media
terminology the Information Resource. Note that the local HTTP URI is use as
base for the ".rep" extension. "?uri={uri}.rep" will not be supported. Users of
the Entityhub can therefore use the ".rep" extension to directly access the
content for an Entity. Note that content negotiation will still be needed when
requesting this kind of URIs.
+
+Similar to the above the ".meta" extension will be used for constructing URIs
for the metadata:
+
+ http://{host}/entityhub/{site}/entity/{localname}.meta
+
+For referenced entities such representations will be created by merging remote
metadata with locally managed. Remote Metadata will be recognized by Resources
with a [foaf:primaryTopic](http://xmlns.com/foaf/0.1/primaryTopic) relation to
the Entity. Local Metadata can include information known for the referenced
site (e.g. license, copyright, attributions, information about the managing
organization ...) as well as mappings to other (locally managed) entities.
+
+For locally managed Entities the metadata will also include all the additional
information as currently defined by the Symbol API (state, predecessors,
successors).
+
+Note that the URIs for Representations and Metadata are optional and will be
omitted based on HTTP request headers in case redirects are disabled. However
even in case that redirects are disabled it is still possible to use such URIs
for requests.
+
+### URI scheme for Content Negotiation
+
+To confirm with the Linked Data principles the Entityhub needs to provide
unique HTTP URIs for any content type Information Resources (Content and
Metadata Resoruces) can be serialized. As for the ".rep" and ".meta" extensions
used to directly access Representations and their Metadata the proposal is also
to use of file extensions to indicate the media type. In cases users wish to
parse the remote URI as parameter it is also possible to parse the extension or
the media type as parameter.
+
+ http://{host}/entityhub/{site}/entity/{localname}.{extension} or
+
http://{host}/entityhub/{site}/entity?uri={uri}&format={extension}&mediaType={mediatype}
+
+This shows the case that the extension is directly added to the local URI of
the entity. In this case the "rel" parameter of the Accept header would be used
to determine if the content - representation - or the metadata need to be
encoded in the response. If not specified the representation will be returned.
+
+To allow also to directly address the representation or the metadata in a
specific format the Entityhub also supports the following two variants:
+
+ http://{host}/entityhub/{site}/entity/{localname}.rep.{extension}
+ http://{host}/entityhub/{site}/entity/{localname}.meta.{extension}
+
+Note that the URIs used for content negotiation are optional and will be
omitted based on HTTP request headers in case redirects are disabled. However
even in case that redirects are disabled it is still possible to use such URIs
for requests.
+
+### HTTP Request/Response Headers with special use
+
+This section provides information about header fields that are specially
evaluated by the Entityhub. Normal evaluations of headers as specified by
[RFC2616 section 14](http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html)
e.g. the use of Content-Type to read data parsed by PUT/POST requests are not
described.
+
+#### [Accept
header](http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.1)
+
+The Accept header allows to specify the media type of the content as expected
by the client in the response. The [Linked Media
proposal](http://lists.w3.org/Archives/Public/public-lod/2011May/0019.html)
suggests to use the "rel" parameter to specify if the response should return
the data or the metadata of the requested resource. The semantics of the "rel"
parameter is defined for the Link header by
[RFC5988](http://www.ietf.org/rfc/rfc5988.txt). An related example can be found
on the [LinkHeader](http://www.w3.org/wiki/LinkHeader) page on the W3C wiki.
+
+The pattern useable for Accept header looks like
+
+ Accept: {media-type}[; rel=meta]
+
+If no "rel" pattern is specified the Entityhub will return the data
(representation about the entity) as default. If users want to retrieve the the
metadata they need to add "rel=meta". The {media-type} is always applied to the
information selected by the "rel" parameter.
+
+####
[Cache-Control](http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.9)
+
+The Entityhub supports the following cache-request-directives to allow clients
some control about local caching of entities managed by remote sites. Note that
the Stanbol OFFLINE mode has precedence over Cache-Control specifications
+
+* no-cache: Entities are retrieved from the remote site even if a local cache
exists (if Stanbol is not in OFFLINE mode)
+* no-store: Entities retrieved from a remote side are not cached locally (if
Stanbol is not in OFFLINE mode)
+* no-transform: The Entityhub may be configured to transform/filter
information from the remote site. This can be used to bypass this kind of
transformations. In case transformations are used for the local cache, then
this parameter will not work out if Stanbol is operates in OFFLINE mode
+* only-if-cached: Representations are only returned if they are available in
the local cache.
+
+#### [Link Header](http://www.ietf.org/rfc/rfc5988.txt)
+
+The Link header is central to Linked Data and Linked Media because it is used
to expose internal structures defined in-between Resources (in-between Entities
but also between Entities and there Representations and Metadata)
+
+The principle Syntax of Link headers is as follows:
+
+ Link: <{uri}>; rel="{relation}"; type="{media-type}"
+
+The relation parameter defines the type of the relation. [Registered relation
types](http://www.iana.org/assignments/link-relations/link-relations.xml) are
mainly used to improve the navigation of users. The values "content" and "meta"
as suggested by the Linked Media proposal are currently not registered. In such
cases [RFC5988](http://www.ietf.org/rfc/rfc5988.txt) requires the use of
absolute URIs as {relation}. This document will use "content" and "meta"
instead of the full URIs as required by RFC5988.
+
+Regardless of that the values used for the "rel" parameter within the "Link"
header by the Entityhub MUST BE the SAME as supported values for the "rel"
parameter in the "Accept" header for requests. A pragmatic solution would be to
support both the short form and a full URI.
+
+The Entityhub will add the following Links (if applicable)
+
+* A reference to the Non-Information resource for the Entity by using the
relation type "self". This will always use the local URI used for the resource.
In case of remote entities there is also a link to the original resource.
+
+ Link: http://{host}/entityhub/{site}/entity/{localname}; rel=self;
+
+* A reference to the representation about the reference by using the relation
type "content". Currently it is not intended to provide separate links to all
available media types for content.
+
+ Link: http://{host}/entityhub/{site}/entity/{localname}.ref; rel=content;
+
+* A reference to the metadata about the representation about the Entity.
Currently it is not intended to provide separate links to all available media
types for metadata.
+
+ Link: http://{host}/entityhub/{site}/entity/{localname}.meta; rel=meta;
+
+* A reference to the source in case of referenced entities. This will be the
URI of the entity
+
+ Link: {uri}; rel=via
+
+* A link to the license for the entity if present
+
+ Link: {licenseURI}; rel=license
+
+### Entity Model
+
+This changes to the RESTful API should be also reflected in the Java API.
Currently on the API level there are three types of Entities: Sign, Symbol and
EntityMapping. The only differentiation between those Entities are a different
set of metadata. However there is no plan to distinguish such types on the
RESTful API level.
+
+To streamline the domain model and to bring it more in line with the RESTful
API the proposal is to drop the different Entity types. The Sign, Symbol and
EnttiyMapping Interfaces will be replaced by a single Entity interface with the
following Methods
+
+ Entity
+ + getId() : String
+ + getSite() : String
+ + getRepresentation() : Representation
+ + getMetadata() : Representation
+
+The use of the Representation interface also for the Metadata allows the use
of the same parsers and serializes for both content and metadata. Functionality
currently depending on the special APIs of Sign, Symbol and EntityMapping need
to be adapted to retrieve the information via the Representation interface.
This should be implemented by an utility class.
+
+
+References
+----------
+
+[1] http://lists.w3.org/Archives/Public/public-lod/2011May/0019.html
+
+[2]
http://code.google.com/p/kiwi/source/browse/kiwi-core/src/main/java/kiwi/core/webservices/resource/ResourceWebService.java
+
+[3] Kiwi Project: http://www.kiwi-community.eu/ Blog:
http://planet.kiwi-project.eu/
+
+[4] Kiwi Source Repository: http://code.google.com/p/kiwi/