Author: rwesten
Date: Mon May 23 12:49:51 2011
New Revision: 1126476

URL: http://svn.apache.org/viewvc?rev=1126476&view=rev
Log:
Proposal for using Linked Data / Linked media principles for the RESTful 
service interface of the Stanbol Entityhub

Added:
    incubator/stanbol/site/trunk/content/stanbol/docs/trunk/entityhub/
    
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/entityhub/entityhubandlinkeddata.mdtext

Added: 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/entityhub/entityhubandlinkeddata.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/entityhub/entityhubandlinkeddata.mdtext?rev=1126476&view=auto
==============================================================================
--- 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/entityhub/entityhubandlinkeddata.mdtext
 (added)
+++ 
incubator/stanbol/site/trunk/content/stanbol/docs/trunk/entityhub/entityhubandlinkeddata.mdtext
 Mon May 23 12:49:51 2011
@@ -0,0 +1,264 @@
+Adopting Linked Media principles for Stanbol Entityhub
+======================================================
+
+[Linked Data](http://linkeddata.org/) describe the idea of linking - formally 
unconnected - bits of data over the web. Think about how hyperlinks are used to 
navigate within the Web of documents. Linked data tries to do the same for the 
Web of Data. This basic idea is also central to most of the Apache Stanbol 
Components. However Stanbol is not only concerned about about linking data but 
also with interlinking the web of documents with the web of data. Therefore 
[this 
proposal](http://lists.w3.org/Archives/Public/public-lod/2011May/0019.html) to 
extend Linked Data principles to also support content and not just data seams 
like a natural fit for Apache Stanbol.
+
+This Documents first provides a short introduction to Linked Data and the 
proposed Linked Media extensions. The second part of the document analysis 
requirements of the Stanbol Entityhub related to Linked Data and Linked Media. 
The third section goes than into more details on how Linked Media principles 
could be implemented by Entityhub.
+
+
+Short Introduction to Linked Data and proposed Linked Media extensions
+----------------------------------------------------------------------
+
+from [linkeddata.org](http://linkeddata.org/faq) 
+> ### What is Linked Data?
+> The Web enables us to link related documents. Similarly it enables us to 
link related data. 
+> The term Linked Data refers to a set of best practices for publishing and 
connecting structured data on the Web. 
+> Key technologies that support Linked Data are URIs (a generic means to 
identify entities or 
+> concepts in the world), HTTP (a simple yet universal mechanism for 
retrieving resources, 
+> or descriptions of resources), and RDF (a generic graph-based data model 
with which to 
+> structure and link data that describes things in the world).
+
+The following terminology is often used with with Linked Data:
+
+* Resources: All items of interest that are to be published on the Web.
+* Information Resources: All documents on the Web (test, imaged, videos ...)
+* Non-Information Resources: Real-word-objects that exist outside of the Web 
(Persons, Organizations, Places ...) but also social concepts (Categories, 
Terminologies …).
+* Resource Identifiers: Linked Data recommends to only use HTTP URIs as 
identifiers because this allows to directly accessing information about the 
resource over the web.
+* Representation: A stream of bytes in a certain format that describes an 
Information Resource. Representations can be available in different formats.
+* Dereferencing of HTTP URIs: For Information Resources the content is 
directly returned. For Non-Information Resources the HTTP status "303 See 
Other" with a link to the Information Resource describing the Non-Information 
resource is returned.
+* Content Negotiation: Users can select the format (content type) of the 
returned Representations by setting the "Accept" header in requests. Linked 
data recommends to use different URIs for Representations of different content 
type to allow Bookmarking. The parsed "Accept" header is therefore used to 
decide about the URI parsed with an "303 See Other" response.
+* URI Aliases: If different providers publish information about the same 
Non-Information Resource (e.g a famous Person, a Country, ...) than 
"[owl:sameAs](http://www.w3.org/TR/owl-ref/#sameAs-def)" relations are used to 
tell clients that two different Resource Identifiers (HTTP URIs) identify the 
same Resource.
+
+A more detailed overview is provided by the [Linked Data 
Tutorial](http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/).
+
+### Linked Media
+
+The [Linked Media 
proposal](http://lists.w3.org/Archives/Public/public-lod/2011May/0019.html) 
tries to extend Linked Data by two features.
+
+1. Creating and updating of resources: Linked data currently covers only 
retrieval of information, which is sufficient for sites like 
[DBpedia](http://dbpedia.org) or [Geonames](http://www.geonames.org) where 
users are only able to consume data. When creating interactive (web) 
applications one needs to be able to create/update and remove information. 
Features that are currently not covered by linked data, but well defined for 
RESTful Services. The Linked Media proposal therefore suggest to use HTTP PUT, 
POST and DELETE request for this purpose.
+2. Handling both content and metadata: Linked Data uses Content Negotiation to 
select suitable content types. In addition it provides means to redirect to 
Information Resources about Non-Information Resources. However linked data does 
not differentiate between metadata and content. One can not explicitly ask 
first for an GIF Image and later for the metadata as RDF. Or first for an HTML 
blog post and later for its metadata formatted as HTML. Such a differentiation 
is only supported for Non-Information Resources. E.g. for a famous painting 
(Non-Information Resource) and a photo (Information Resource). Liked Media 
proposes to use the "rel" parameter of the Accept header to allow users to 
explicitly ask for content ("Accept: type/subtype; rel=content") or metadata 
("Accept: type/subtype; rel=meta").
+
+For a more detailed description please follow the link to the [Linked Media 
proposal](http://lists.w3.org/Archives/Public/public-lod/2011May/0019.html) [1] 
as posted by by Sebastian Schaffert on the linked open data mailing list of 
W3C. You might also be interested in reading the following discussion. Note 
also 
[ResourceWebService](http://code.google.com/p/kiwi/source/browse/kiwi-core/src/main/java/kiwi/core/webservices/resource/ResourceWebService.java)
 [2] a first implementation of the Linked Media proposal based on the 
[Kiwi2/Linked Media Framework](http://code.google.com/p/kiwi/) [3][4].  
+
+Requirements of the Stanbol Entityhub
+-------------------------------------
+
+This section tries to identify requirements of the Stanbol Entityhub related 
to Linked Data and Linked Media. The goal of this analysis is to identify where 
it makes sense to adopt Linked Data/Media principles for the RESTful interface 
of the Entityhub.
+
+The Entityhub fulfills two requirements: 
+
+1. it allows to define and manage network of referenced sites used to retrieve 
information about entities from. In addition the Entityhub also supports the 
use of local caches to speedup access and to get independency of the 
availability of remote services. 
+2. it manages an own (local) site that is used to manage local entities. Such 
entities can be created locally but it is also possible to import them form any 
referenced site. Typical examples of locally managed entities are customers, 
employees, concepts of a company thesaurus, offices, meeting rooms ... 
+
+### Entity Model of the Entityhub
+
+Entities managed by the entityhub define first an unique ID. In case the 
referenced site follows linked data principles this will be the HTTP URI of the 
Non-Information resource. However this might be any valid URI (including URNs). 
The URI prefix of locally managed entities are configureable. Therefore the URI 
type of locally managed entities depends on the configuration. The Entity 
itself represents a Non-Information Resource. Each Entity comes with a 
Representation. The representation holds all information known by the site 
about the entity. In Linked Data terminology the Representation is the 
Information Resource a User needs to be redirected when he requests the Entity 
(Non-Information Resource). Finally an Entity also links to the ID of the 
(referenced) site managing it. This allows users to track who is providing the 
information for an Entity.
+
+Currently the Entityhub distinguish three different types of Entities:
+
+1. Sign: All Entities managed by referenced sites
+2. Symbol: All locally managed Entities. Symbols hold additional metadata such 
as a preferred label, a state.
+3. EntityMapping: Mappings form Symbols to Signs. Linked Data typically uses 
owl:sameAs to define such mappings however in case of the Entityhub such 
mappings need to hold additional meta information such as the state, expire 
data of the mapping ...
+
+Metadata such as license, copyright statements, attributions as well as 
informations about the organization managing a referenced site are managed with 
referenced sites and not with single entities.
+
+All the additional information provided by this three Entity types as well as 
the additional metadata provided for referenced sites are based on Linked Data 
principles metadata about the Information Resource - the Representation - and 
not about the Non-Information Resource - the Entity.
+
+Therefore the Entityhub manages:
+
+* Non-Information Resources: All the Entities of referenced Sites as well as 
locally managed Entities
+* Content: All Representations about Entities
+* Metadata: Additional information about Representations such as license, 
copyrights, attributions as well as mappings to other entities.
+
+### RESTful Services of the Entityhub
+
+The Entityhub defines the following service endpoints:
+
+1. The (referenced) Site Manager: Provides retrieval and search over all 
referenced sites.
+2. (referenced) Site Endpoint: Provides the same interface but for a specific 
referenced site.
+3. The Entityhub Endpoint: Provides full read/write and retrieval access for 
locally managed Entities.
+
+Therefore the Entityhub needs to support read only access for Entities managed 
by referenced sites and full read/write access (CRUD) locally managed Entities.
+
+### Summary
+
+Consuming Linked Data:
+
+* Consume Linked Data from remote sites
+* Search resources on remote sites based on labels/language and type (by using 
SPARQL)
+
+Referenced Entities (Entities of Referenced Sites)
+
+* Support local management of additional metadata for referenced entities 
(e.g. mappings to local entities)
+* Support merging of remote metadata (e.g. defined by "foaf:primaryTopic") 
with local ones (e.g. mappings to local entities)
+* Provide Content + Metadata - as proposed by Linked Media - even for 
referenced entities.
+* Support Search for Entities based on labels/language and type
+
+Local Entities (Entities managed by the Entityhub)
+
+* Provide local Entities as Linked Media (full CRUD support; management of 
Content and Metadata)
+* Support creation of local entities based on referenced one
+* Support finding of additional mappings based on owl:sameAs relations
+* Support importing of metadata for mapped entities (e.g. to correctly handle 
attribution requirements)
+* Support Enabling/disabling the use of redirects
+* Support Search for Entities based on labels/language and type
+
+Based on this evaluation of the Model and the Services provided by the 
Entityhub the proposed Linked Media extension to the Linked Data principles 
would be sufficient to cover most of the functionalities exposed by the 
Entityhub as RESTful services. While for referenced Sites only the distinction 
between Metadata and Content is needed for locally managed Entities also the 
possibility to create, update and remove Entities, their Representation 
(content) and metadata is of central importance. The main functionalities not 
covered is the import of Entities from referenced sites. Also for 
functionalities like the creation of mappings and the management of the Entity 
workflow special additions to the generic Linked Media/Linked Data API would be 
useful.
+
+
+Specific Considerations
+-----------------------
+
+This section contains Entityhub specific considerations about some of the 
principles defined for Linked Data and Linked Media. 
+
+### Resource Identifier
+
+Linked data defines the principle to use HTTP URIs as Resource Indetifier so 
that one can retrieve data by directly accessing the URI of a resource. This 
does not work out for the Entityhub because it needs to also manage remote 
entities and also for local entities this will not always be an option. Because 
of that the RESTful interface needs also to support an alternative that allows 
to parse the URI of an entity as a parameter. This is also a requirement to 
don't affect the IDs of entities when the Entityhub is deployed on an different 
host of even by using localhost. In addition this allows to use use other URI 
types (mainly URNs but also other protocols such as LDAP) as identifiers for 
locally managed entities.
+
+### Redirects for Content Negotiation
+
+It is important to consider that Entities are Non-Information Resources and 
based on Linked Data Principles requests for Non-Information resources need to 
be answered with redirects ("303 See Other") to the URI of the Information 
Resource. In practice such redirects are for two things:
+
+1.  To allow Users to directly access (and bookmark) URIs of a specific format 
and therefore bypass content negotiation. This is mainly because Browsers do 
not allow to define the "Accept" headers. Because of that without this 
indirection typical users would be unable to retrieve other formats that HTML.
+
+    For the Entityhub where most of the requests will be issued by clients 
that support the usage of "Accept" headers, the usage of redirects seems 
unfavorable because: First it will double the numbers of requests and also adds 
an additional RTT (round trip time). Secound browsers always issue a GET 
request when following an redirect independent of the type for the initial 
request. This can cause problems when returning redirects for POST, PUT and 
DELETE requests. Because of this for the Entityhub it would make sense to 
provide the possibility to deactivate/activate the usage of redirects (e.g. via 
a configuration, a request property or even a header field).
+
+2.  To attach metadata of the Information Resources. As an example take the 
[Linked Data endpoint of the New York Times](http://data.nytimes.com). It uses 
"http://data.nytimes.com/{uuid}"; for Entities and 
"http://data.nytimes.com/{uuid}.rdf"; for the RDF XML representations. When 
looking at the representations provided for Entities (e.g. take [North 
Carolina](http://data.nytimes.com/N25800450843199534421) one can see that 
triples using "http://data.nytimes.com/{uuid}"; as subject are data about North 
Carolina where triples that use "http://data.nytimes.com/{uuid}.rdf"; as subject 
represent metadata. Note also that the metadata is also connected to the 
representation of North Carolina by the 
[foaf:primaryTopic](http://xmlns.com/foaf/0.1/primaryTopic) relation. 
+
+    When using extensions proposed by Linked Media, than it would be possible 
to directly refer to the metadata by setting the "rel" parameter of the 
"Accept" header to "meta". Therefore a request defining "Accept: 
application/rdf+xml; rel=meta" would - assuming that redirects are deactivated 
- directly return the metadata for for the requested entity (e.g. the license) 
encoded as RDF XML. In case redirects are enabled it would return a "303 See 
Other" with the URI of the metadata.
+
+    Note that - in principle - there are two kinds of redirects: (1) redirects 
between Resources. This includes redirects from Entities to Representation 
("rel=content") as well as to the Metadata ("rel=meta"); (2) redirects used for 
Content Negotiation. Therefore it would be possible to provide the possibility 
to enable/disable this types separately. 
+
+    Also note that in cases where several redirects would be needed to reach 
the final resource (e.g. when requesting information about an Non-Information 
Resource in "text/html": Non-Information Resource -> Information resource 
-> HTML version) than the request will directly return the final 
destination. 
+
+
+Redesigning the Entityhub
+-------------------------
+
+This section evaluates necessary changes to the Entityhub.
+
+### URI scheme for Resources
+
+The support of Linked Data requires the use of a local URI. This is in 
contrast to the parameter based approach ("?id={remoteURI}") as currently used 
by the Entityhub. The goal is that the Entityhub allows both variants
+
+    http://{host}/entityhub/{site}/entity/{localname} and
+    http://{host}/entityhub/{site}/entity?uri={uri}
+
+to refer an Entity. This requires that the Entityhub provides a local HTTP URI 
for any (local or remote) entity. The suggestion is to use the local name of 
the remote entity or the MD5 of the whole URI in cases where this is not 
possible.
+
+To support the redirects as defined by Linked Data it is also necessary to 
generate own URIs for Representations. To support the differentiation between 
Content and Metadata we need also an own URI for the metadata.
+
+The proposal is to use file extension like additions to the local name of 
Entities:
+
+    http://{host}/entityhub/{site}/entity/{localname}.rep 
+
+is used to directly refer to the Representation of an Entity - in Linked Media 
terminology the Information Resource. Note that the local HTTP URI is use as 
base for the ".rep" extension. "?uri={uri}.rep" will not be supported. Users of 
the Entityhub can therefore use the ".rep" extension to directly access the 
content for an Entity. Note that content negotiation will still be needed when 
requesting this kind of URIs.
+
+Similar to the above the ".meta" extension will be used for constructing URIs 
for the metadata:
+
+    http://{host}/entityhub/{site}/entity/{localname}.meta
+
+For referenced entities such representations will be created by merging remote 
metadata with locally managed. Remote Metadata will be recognized by Resources 
with a [foaf:primaryTopic](http://xmlns.com/foaf/0.1/primaryTopic) relation to 
the Entity. Local Metadata can include information known for the referenced 
site (e.g. license, copyright, attributions, information about the managing 
organization ...) as well as mappings to other (locally managed) entities.
+
+For locally managed Entities the metadata will also include all the additional 
information as currently defined by the Symbol API (state, predecessors, 
successors).
+
+Note that the URIs for Representations and Metadata are optional and will be 
omitted based on HTTP request headers in case redirects are disabled. However 
even in case that redirects are disabled it is still possible to use such URIs 
for requests.
+
+### URI scheme for Content Negotiation
+
+To confirm with the Linked Data principles the Entityhub needs to provide 
unique HTTP URIs for any content type Information Resources (Content and 
Metadata Resoruces) can be serialized. As for the ".rep" and ".meta" extensions 
used to directly access Representations and their Metadata the proposal is also 
to use of file extensions to indicate the media type. In cases users wish to 
parse the remote URI as parameter it is also possible to parse the extension or 
the media type as parameter.
+
+    http://{host}/entityhub/{site}/entity/{localname}.{extension} or
+    
http://{host}/entityhub/{site}/entity?uri={uri}&format={extension}&mediaType={mediatype}
+
+This shows the case that the extension is directly added to the local URI of 
the entity. In this case the "rel" parameter of the Accept header would be used 
to determine if the content - representation - or the metadata need to be 
encoded in the response. If not specified the representation will be returned.
+
+To allow also to directly address the representation or the metadata in a 
specific format the Entityhub also supports the following two variants: 
+
+    http://{host}/entityhub/{site}/entity/{localname}.rep.{extension}
+    http://{host}/entityhub/{site}/entity/{localname}.meta.{extension}
+
+Note that the URIs used for content negotiation are optional and will be 
omitted based on HTTP request headers in case redirects are disabled. However 
even in case that redirects are disabled it is still possible to use such URIs 
for requests.
+
+### HTTP Request/Response Headers with special use
+
+This section provides information about header fields that are specially 
evaluated by the Entityhub. Normal evaluations of headers as specified by 
[RFC2616 section 14](http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html) 
e.g. the use of Content-Type to read data parsed by PUT/POST requests are not 
described.
+
+#### [Accept 
header](http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.1)
+
+The Accept header allows to specify the media type of the content as expected 
by the client in the response. The [Linked Media 
proposal](http://lists.w3.org/Archives/Public/public-lod/2011May/0019.html) 
suggests to use the "rel" parameter to specify if the response should return 
the data or the metadata of the requested resource. The semantics of the "rel" 
parameter is defined for the Link header by 
[RFC5988](http://www.ietf.org/rfc/rfc5988.txt). An related example can be found 
on the [LinkHeader](http://www.w3.org/wiki/LinkHeader) page on the W3C wiki.
+
+The pattern useable for Accept header looks like
+
+    Accept: {media-type}[; rel=meta]
+
+If no "rel" pattern is specified the Entityhub will return the data 
(representation about the entity) as default. If users want to retrieve the the 
metadata they need to add "rel=meta". The {media-type} is always applied to the 
information selected by the "rel" parameter. 
+
+#### 
[Cache-Control](http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.9)
+
+The Entityhub supports the following cache-request-directives to allow clients 
some control about local caching of entities managed by remote sites. Note that 
the Stanbol OFFLINE mode has precedence over Cache-Control specifications  
+
+* no-cache: Entities are retrieved from the remote site even if a local cache 
exists (if Stanbol is not in OFFLINE mode)
+* no-store: Entities retrieved from a remote side are not cached locally (if 
Stanbol is not in OFFLINE mode)
+* no-transform: The Entityhub may be configured to transform/filter 
information from the remote site. This can be used to bypass this kind of 
transformations. In case transformations are used for the local cache, then 
this parameter will not work out if Stanbol is operates in OFFLINE mode
+* only-if-cached: Representations are only returned if they are available in 
the local cache.
+
+#### [Link Header](http://www.ietf.org/rfc/rfc5988.txt)
+
+The Link header is central to Linked Data and Linked Media because it is used 
to expose internal structures defined in-between Resources (in-between Entities 
but also between Entities and there Representations and Metadata)
+
+The principle Syntax of Link headers is as follows:
+
+    Link: <{uri}>; rel="{relation}"; type="{media-type}"
+
+The relation parameter defines the type of the relation. [Registered relation 
types](http://www.iana.org/assignments/link-relations/link-relations.xml) are 
mainly used to improve the navigation of users. The values "content" and "meta" 
as suggested by the Linked Media proposal are currently not registered. In such 
cases [RFC5988](http://www.ietf.org/rfc/rfc5988.txt) requires the use of 
absolute URIs as {relation}. This document will use "content" and "meta" 
instead of the full URIs as required by RFC5988.
+
+Regardless of that the values used for the "rel" parameter within the "Link" 
header by the Entityhub MUST BE the SAME as supported values for the "rel" 
parameter in the "Accept" header for requests. A pragmatic solution would be to 
support both the short form and a full URI.  
+
+The Entityhub will add the following Links (if applicable)
+
+* A reference to the Non-Information resource for the Entity by using the 
relation type "self". This will always use the local URI used for the resource. 
In case of remote entities there is also a link to the original resource.
+
+    Link: http://{host}/entityhub/{site}/entity/{localname}; rel=self; 
+
+* A reference to the representation about the reference by using the relation 
type "content". Currently it is not intended to provide separate links to all 
available media types for content.
+
+    Link: http://{host}/entityhub/{site}/entity/{localname}.ref; rel=content;
+
+* A reference to the metadata about the representation about the Entity. 
Currently it is not intended to provide separate links to all available media 
types for metadata.
+
+    Link: http://{host}/entityhub/{site}/entity/{localname}.meta; rel=meta;
+
+* A reference to the source in case of referenced entities. This will be the 
URI of the entity
+
+    Link: {uri}; rel=via
+
+* A link to the license for the entity if present
+
+    Link: {licenseURI}; rel=license
+
+### Entity Model
+
+This changes to the RESTful API should be also reflected in the Java API. 
Currently on the API level there are three types of Entities: Sign, Symbol and 
EntityMapping. The only differentiation between those Entities are a different 
set of metadata. However there is no plan to distinguish such types on the 
RESTful API level.
+
+To streamline the domain model and to bring it more in line with the RESTful 
API the proposal is to drop the different Entity types. The Sign, Symbol and 
EnttiyMapping Interfaces will be replaced by a single Entity interface with the 
following Methods
+
+    Entity
+        + getId() : String
+        + getSite() : String
+        + getRepresentation() : Representation
+        + getMetadata() : Representation
+
+The use of the Representation interface also for the Metadata allows the use 
of the same parsers and serializes for both content and metadata. Functionality 
currently depending on the special APIs of Sign, Symbol and EntityMapping need 
to be adapted to retrieve the information via the Representation interface. 
This should be implemented by an utility class.
+
+
+References
+----------
+
+[1] http://lists.w3.org/Archives/Public/public-lod/2011May/0019.html
+
+[2] 
http://code.google.com/p/kiwi/source/browse/kiwi-core/src/main/java/kiwi/core/webservices/resource/ResourceWebService.java
+
+[3] Kiwi Project: http://www.kiwi-community.eu/ Blog: 
http://planet.kiwi-project.eu/
+
+[4] Kiwi Source Repository: http://code.google.com/p/kiwi/


Reply via email to