Re: Comparing Sesame and Clerezza RDF API (was: ApacheCon EU CFP is open)

Sebastian Schaffert Tue, 07 Aug 2012 08:39:29 -0700

Hi Reto,

Am 06.08.2012 um 20:35 schrieb Reto Bachmann-Gmür:

> Hi Sebastian,
> 
> On Fri, Aug 3, 2012 at 2:26 PM, Sebastian Schaffert <
> [email protected]> wrote:
> 
>> Hi Reto,
>> 
>> comments inline. ;-)
>> 
> same here ;)

and again ;-)

Just a meta-comment: before, I was told that Clerezza is a kind of wrapper that 
abstracts away from the concrete triple store implementations (Jena and 
Sesame). What you explained to me now is a completely alternative triple store 
implementation. Which may have justifications, but also additional problems. 
Comments follow below…

Another meta-comment: I see this as an intellectual challenge where we both 
learn; it is not my intention to criticise Clerezza, I just take the Sesame 
position in this discussion because I know it quite well and have made very 
good experiences with it, so I need really convincing arguments to switch to a 
different API. Maybe the discussion with me can also help you convince others 
to use/develop for Clerezza.

> 
>> 
>> Am 03.08.2012 um 12:54 schrieb Reto Bachmann-Gmür:
>> 
>>> I agree that clerezza should finally have tha fastlane for sparql query,
>>> the current approach makes only sense if a query is against graphs from
>>> multiple backends. This is definitively a bottle neck now.
>>> 
>>> What puzzles me is that you seem to think that the sesame api is cleaner
>>> than the clerezza one. The clerezza api was introduced as non of the api
>>> available would model the rdf abstract syntax without tying additional
>>> concepts and utility classes into the core. If you find anything that's
>> not
>>> clean I'd like to address this.
>> 
>> Essentially, Sesame already provides many of the things Clerezza promises,
>> and it is well proven, established, and highly performant. So I don't
>> completely understand the rationale of reinventing the wheel with Clerezza.
>> 
>> I also don't understand your argument of tying additional concepts and
>> utility classes into the core. Clerezza implements utility classes in the
>> same way as Sesame, so where is the difference? Additionally, the Sesame
>> utility classes simply extend the Java core functionality (e.g. with
>> iterators that can throw checked exceptions).
>> 
> Looking at http://www.openrdf.org/doc/sesame2/api/ I see 1188 classes many
> of them seem to be related to implementation specific aspects and transport
> the clerezza core api (
> http://incubator.apache.org/clerezza/mvn-site/rdf.core/apidocs/index.html))
> contains 125 classes. These are the classes in the jar an api client or
> implementor depends on.

Merely counting classes does not say much, especially since Sesame provides 
much more functionality. Many of the Sesame classes I see are actually from the 
abstract syntax tree of the SPARQL Query and Update parsers (and as we all know 
there are many different ways to implement a parser), from the HTTP Server 
functionality, and from the different serializers and parsers. If you can live 
without these, you will easily have a package as small as the Clerezza core.

> 
> What I mean by separating utility classes is mainly the separate resource
> centric api provided by RDF utils (
> http://incubator.apache.org/clerezza/mvn-site/utils/apidocs/index.html). I
> think this is mainly a difference to the Jena API were the resource
> oriented and the triple oriented approach are provided by the same API.

This is probably a philosophical issue: have all functionality on one place vs. 
having things cleanly separated. Both have advantages and disadvantages. 
Personally, I like the Sesame value factory because it is a single place where 
to look for the suitable factory methods.

> 
> The classes in org.openrdf.model  are indeed similar to the ones in
> org.apache.clerezza.rdf.core, so let me argue why I think the zz variant
> are better:
> 
> - The zz api defines identity criterions for graphs. The sesame API doesn't
> define when equals should return true, the zz API defines clear rules which
> are distinct for mutable and for inmutable graphs. Similarly the hashcode
> method for graphs is defined. In Sesame it seems that an instance is equals
> only to itsself. This doesn't take into account what RDF semantics say
> about graph identity

This is honestly a functionality I have never needed in 10 years. I see its use 
case in small in-memory graphs (like they are used in Stanbol), but for a 
multi-billion triple graph this is an irrelevant functionality. 

BTW, if you want to implement graph equivalence, you have to implement the 
bijection as specified in 
http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#section-graph-equality . 
This is a very expensive operation anyways, especially when taking into account 
blank nodes. For example, the following two graphs could be considered 
equivalent (because the blank nodes are existentially qualified):

Graph 1:
<http://www.example.com/r/1> <http://www.example.com/p/1> _:a
_:a <http://www.example.com/p/2> "123"
_:a <http://www.example.com/p/3> "456"

Graph 2:
<http://www.example.com/r/1> <http://www.example.com/p/1> _:b
<http://www.example.com/r/1> <http://www.example.com/p/1> _:c
_:b <http://www.example.com/p/2> "123"
_:c <http://www.example.com/p/3> "456"

> 
> - In Sesame Graphs triples are added with one or several context. Such a
> context is not defined in RDF semantics or in the abstract Syntax. In
> Sesame a Graph is a collection of Statements where a Statement is not the
> same as a Triple in RDF

The currently official RDF specification dates back to 1998 with a minor 
revision in 2004 [1]. The definition of named graphs is undergoing 
specification in the course of the work on RDF 1.1 at the W3C until 2013 [2]. 
Whether it is technically represented as quadruples (Sesame) or as in the 
proposal for the abstract RDF model (as in Clerezza) is merely an 
implementation detail, and also still under discussion. The Sesame approach 
implements essentially the named graph specification of SPARQL 1.1 [3] (the 
only one that currently officially exists) and has the advantage of offering 
more efficient implementations, and especially of being very convenient to the 
user (e.g. give me triples matching a certain pattern and occurring either in 
graph1 or in graph2).

[1] http://www.w3.org/TR/REC-rdf-syntax/
[2] http://www.w3.org/TR/rdf11-concepts/#section-dataset
[3] http://www.w3.org/TR/sparql11-query/#rdfDataset

> 
> - Value Factory: In Sesame a value-factory is tied to the Graph. In ZZ
> triples can be added to any graph and need not be created via a method
> specific to that graph (it is left to to the implementation to
> transparently do the optimization for nodes that originate from its backend)

Sesame follows a classical factory pattern from object oriented design here to 
allow the backend to choose which implementation it wants to return to give the 
best results. Without a factory, you will always have additional maintenance 
overhead for converting into the correct implementation, e.g. when adding a 
triple to a graph backed by a database. With a factory, the implementation will 
immediately give you a database-backed version of the triple.

> 
> - Ids for BNode. In ZZ Bnodes are just what they are according to the
> specs: Anonymous resources. They are not java serializable objects so a
> client can only reference a BNode as long as the object is alive. This
> allows implementation to remove obsolete triples/duplicate bnodes when
> nobody holds a reference to that bnode. In Sesame BNodes have an ID and can
> be reconstructed with an ID. This means that an implementation doesn't know
> how long a bnode is referenced. When a duplicate is detected it should
> internally keep all the aliases of the node as it doesn't know for sure
> clients will not reference this bnode by a specific id it was once exposed
> with.

The semantics of BNodes are an issue of open debate and even dispute until 
today. In practice, it is often a disadvantage to not expose an ID, and this is 
why both Sesame and Jena do it, and most serialization formats also do it. 
Actually I had some troubles with Clerezza in Stanbol for exactly this reason. 
The case that does not work easily here is incremental updates of graphs 
between two systems involving blank nodes. In the specification, this case is 
forbidden (blank nodes are always distinct). In practice, it is very useful to 
still be able to do it. And actually, since this is also very common practice 
in logics (so-called Skolemization) the RDF specification takes this into 
account and explicitly acknowledge it [4]:

"Blank node identifiers are local identifiers that are used in some concrete 
RDF syntaxes or RDF store implementations. They are always locally scoped to 
the file or RDF store, and are not persistent or portable identifiers for blank 
nodes…."

[4] http://www.w3.org/TR/rdf11-concepts/#dfn-blank-node

> 
> - Namespaces: what are they doing in the core of the Sesame API, there is
> no such thing in RDF. 

There is in RDF 1.1: http://www.w3.org/TR/rdf11-concepts/#vocabularies

Again, this is a point where people after working with RDF discovered it would 
be extremely useful to be able to use abbreviated ways of writing URIs or IRIs, 
so they included it the software systems. The fact that both Sesame and Jena do 
this is proof of it. And the fact that they take this up in RDF 1.1 also.

> Also the Sesame URI class which (probably) represents
> what the RDF spec describes as "Uri Reference" has methods to split it into
> namespace (not using the Namespace class here) and local name.

RDF 1.1 no longer uses the term URI reference, it speaks of IRI. I do not find 
the fact that Sesame uses "URI" as interface name very fortunate, however, but 
mainly because it sometimes clashes with the existing Java URI class. The 
namespace handling methods are merely convenience methods.

> 
> - Literals: The ZZ API differentiates between typed and plain literals. The
> Sesame API has one literal datatype with some utility methods to access its
> value for common XSD datatypes.

If I look here I see many different literal datatypes. Sesame just hides them 
using the factory pattern:

http://www.openrdf.org/doc/sesame2/api/org/openrdf/model/impl/package-summary.html

> The ZZ approach of having a literal factory
> to convert java types to literals is more flexible and can be extended to
> new types.

Which is (strictly speaking) not really foreseen in the RDF specification. But 
I agree that it can be convenient … ;-)

> 
> - Statement: A sesame statemnet has a Context but this context is
> irrelevant for the identity criterion defined for equals and hashcode. This
> doesn't make a clean impression to me, either this contexts are relevant
> and then addin two statements with diffrent contexts to a Set should give a
> Set of size two or they aren't in which case they should disappear from the
> api.

Two triple with the same subject, predicate and object are the same, regardless 
in which graphs they are. But I agree that this can lead to confusions.

> 
> All in all I think the ZZ core is not only closer to the spec

… depending on which version of the spec you are talking about - 1998 or 
evolving 2013 spec?

> it is also
> easier to implement a Graph representation of data. The implementor of a
> Graph need not to care about splitting URIs or providing utility method to
> get values from a literal or
> 
Which are in most cases very straightforward to implement and in the few cases 
where they are NOT easy, they are actually needed (e.g. for mapping database 
types to Java types).

> 
> 
>> One aspect I like about the Sesame API is its completely modular structure
>> at several levels. This allows me to easily and cleanly add functionality
>> as needed, e.g.:
>> - a custom triple store like the one I described before; you can e.g.
>> easily provide a Jena TDB backend for Sesame (see
>> http://sjadapter.sourceforge.net/)
>> 
> In Sesame A custom triple store or a gateway just provides graph
> implementations. The sesame solution you're referring to implements a
> separate SPI (sail) which adds a level of complexity and is also a bit less
> performance as in ZZ there is (potentially) nothing between your
> implementation and the client, you can use zz utilty classes libe
> AbstractMGraph but you don't have to.

The Sesame SAIL API adds negligible overhead, but considerable additional 
functionality. It is a plugin mechanism that allows adding additional layers 
inbetween where you want them (e.g. for implementing a native SPARQL query, for 
adding a reasoner, etc.).

> 
> - a custom SPARQL implementation that allows a very efficient native
>> evaluation; at the same time, Sesame provides me an abstract syntax tree
>> and completely frees me from all the nasty parsing stuff and implements the
>> latest SPARQL specification for queries, updates and federation extensions
>> without me having to actually care
>> 
> I already mentioned that zz should improve here: Sparql fastlane to allow
> backed optimization. The abstract syntax tree you describe however is
> implemented in zz as well, for the latest released sparql spec (i.e not yet
> for sparql 1.1)

So browsers should also not support HTNL5? Last time I checked it was also 
still a working draft (http://www.w3.org/TR/html5/) … ;-)

W3C standardisation processes take incredibly long, because they involve a lot 
of processes, including a complete reference implementation. For me, this is no 
reason to not implement what is very likely to come in the future.

> 
> 
>> - a very clean Java SPI based approach for registering RDF parsers and
>> serializers that can operate natively on the respective triple store
>> implementations
>> 
> 
> So comparing org.openrdf.rio with
> org.apache.clerezza.rdf.core.serializedform or is there a sperate SPI
> apckage? The Sesame RDFParser interface seems much more complex than zz's
> ParsingProvider

You would typically implement the abstract class RDFParserBase in Sesame, 
leaving you only the two parse() methods and the getRDFFormat() to override. 
But giving you the option to also implement optional features like namespace 
support. Not so complex, I have written several Sesame parsers already.

> 
> Clerezza supports registering parsers and serializers for any media type
> (which is identified just by its media type without introducing an
> RDF-Format class) both using OSGi as well as the META-INF/services approach
> for non-osgi environment. Parsers and serializer have to work with data
> from any backend they can however be optimized for data from a particular
> backend.
> 
> 
>> - easily wrap filters around triple stores or iterators
>> You can easily see the modularity by looking at the many Maven artifacts
>> the project is composed off. Essentially, if I don't need a functionality I
>> can simply leave it out, and if I need it, adding it is a no-brainer.
>> 
> 
> The minimum clerezza jar you need is 240K, this contatins all you need to
> access and query graphs. It also conatins the infrastructure for
> serializing and parsing but you have to add the jar for the formats you
> need (just adding the jar to the classpath or loading the bundle when using
> OSGi is enough).

The complete Sesame distribution with all its features and bells and whistles 
is also just 2.1 MB. Compare this to Jena TDB ;-)

> 
>> 
>> In addition, the Sesame data model is completely based on lightweight
>> interfaces instead of wrapper objects. This makes it very easy to provide
>> efficient implementations and is IMHO very clean. In contrast, Clerezza
>> provides its complete own RDF model based on your own version of a triple,
>> your own version of a node, your own version of a URI, …
>> 
> 
> No, Literals, Triples, Resources and others are just interfaces as well. A
> BNode is indeed a class (just an empty subclass of Object) the same goes
> for URIRefs. The reason for this is that we couldn't find a use case were
> providing different implementation would provide benefits while it would
> provide great potential for misuse (e.g. I tell my custom object that
> implements BNode or UriRef add it to a triple store and expect to get my
> object back when querying).

Noone who uses some sort of storage should ever expect to get exactly the same 
object back. The benefit of leaving BNode and UriRef an interface is to leave 
the implementation open to others. Just because YOU did not find a use case 
doesn't mean such a use case doesn't exist for others. For efficient storage 
and querying, as well as caching there is indeed a definite benefit of being 
able to e.g. associate identifiers with BNodes. You are thinking too much 
in-memory.

> 
>> 
>> What I am missing from Clerezza:
>> - a lightweight data model that does not require additional instances in
>> main memory
>> 
> yes bnodes and urirefs require to be in memory as long as they are used by
> a client. For BNode I think this brings a significant advantage, what I
> described above about the backend knowing when redundancy can be removed
> without risk
> 
>> - a good reuse of functionality that is already there, e.g. direct SPARQL,
>> transactions, …
>> 
> Agreed for direct sparql. For transactions I don't know what you mean by
> "already there", yes some triple stores support some sorts of transactions,
> requiring all backend to support this would be quite a strong requirement
> and probably not what users want in many cases, see
> http://mail-archives.apache.org/mod_mbox/incubator-clerezza-dev/201009.mbox/%[email protected]%3Efor
> thought on this issue.

Sesame has transactions and Jena has transactions, so other intelligent people 
might have thought it is a good idea. I need them, because transactions are the 
natural level of granularity for our versioning and for our incremental 
reasoning. And when you build a web application on top of some sort of triple 
store where users interact with data, this is in fact a major requirement, 
because you want your system to be always in a consistent state even in case of 
multithreading, you want to take into account users simply clicking the back 
button in the middle of some workflow etc.

If you would say exactly this sentence in a database forum ("probably not what 
users want in many cases") you would be completely disintegrated in a flame 
war. ;-)

> 
> 
>> - support for named graphs / contexts
>> 
> 
> Named graphs are supported, not sure why you should have contexts to the
> individual triples

Technical simplification and one of the suggested implementations for named 
graphs (quadruples or quads).

> and I missing a clear description of this in sesame api.

http://www.openrdf.org/doc/sesame2/users/ch08.html#d0e1238

> 
>> 
>> I have been working with RDF APIs for about 10 years now in various
>> programming languages (even in Prolog, Haskell and Python). And my
>> conclusion at the moment is that Sesame by far offers the most convenient
>> API for a developer. But I am of course open to switching in case I get
>> convincing arguments. ;-)
> 
> not sure how convincing you found them, or if the missing sparql fastlane
> is a blocker for you.

What I am missing is a convincing argument that shows me how I as experienced 
RDF developer can benefit from Clerezza over e.g. Sesame, and I am sure that 
many other RDF developers will have the same hesitation. What you have argued 
is that Clerezza follows the 1998 RDF spec more closely, but the fact is that 
Sesame has already gone beyond that and tries to anticipate many features of 
the upcoming RDF 1.1 and SPARQL 1.1 specifications (which are not really 
secrets for many years now). Furthermore, you have argued why Clerezza ALSO can 
do what Sesame (and Jena for that matter) already does, but you did not show me 
what it can do MORE.

> 
>> 
>>> It seems that what you are using of sesame is mainly the spi/api and not
>>> the actual triple store. This is definitively something clerezza should
>>> have a good offer for.
>> 
>> So convince me why it is BETTER than the established project ;-)
>> 
> 
> Well before clerezza there were othe attempts to have a thinner backend
> agnostic layer like RDF2Go. This seems to confirm that other too think that
> the APIs provided by Sesame or Jena aren't that thin RDF Api one can use
> and implement without endorsing a big infrastructure or a large set of
> concept one doesn't necessarily want to deal with.

As it happens, the initial founder of RDF2Go (Max Völkel) is one of my friends. 
The main rationale behind the project was strictly to be able to exchange the 
underlying triple store more easily, because you might want to build 
applications first and think about the triple store later (and then of course 
get the bigger infrastructure). It is only used in research projects as far as 
I know, and not really under active development anymore.

As to big infrastructure: the Sesame jar is 2.1 MB big, alltogether. Smaller 
than Lucene or Xerces. So I would not consider it big infrastructure. Comes in 
many Maven modules, though (which is IMO a good thing).

> 
> 
>> 
>>> 
>>> When I last looked at it sesame was not ready to be used in apache
>>> projects, not sure if license issues are the cause of it not being
>>> available in maven central.
>> 
>> 
>> Sesame is under a BSD license, so it should be compatible with Apache
>> projects:
>> 
>> http://www.openrdf.org/download.jsp
> 
> 
> Is this true for the dependency it requires as well?

Yes:

http://repo.aduna-software.org/svn/info.aduna/commons/LICENSE.txt

Greetings,

Sebastian
-- 
| Dr. Sebastian Schaffert          [email protected]
| Salzburg Research Forschungsgesellschaft  http://www.salzburgresearch.at
| Head of Knowledge and Media Technologies Group          +43 662 2288 423
| Jakob-Haringer Strasse 5/II
| A-5020 Salzburg

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: Comparing Sesame and Clerezza RDF API (was: ApacheCon EU CFP is open)

Reply via email to