Hi Reto, Am 06.08.2012 um 20:35 schrieb Reto Bachmann-Gmür:
> Hi Sebastian, > > On Fri, Aug 3, 2012 at 2:26 PM, Sebastian Schaffert < > [email protected]> wrote: > >> Hi Reto, >> >> comments inline. ;-) >> > same here ;) and again ;-) Just a meta-comment: before, I was told that Clerezza is a kind of wrapper that abstracts away from the concrete triple store implementations (Jena and Sesame). What you explained to me now is a completely alternative triple store implementation. Which may have justifications, but also additional problems. Comments follow below… Another meta-comment: I see this as an intellectual challenge where we both learn; it is not my intention to criticise Clerezza, I just take the Sesame position in this discussion because I know it quite well and have made very good experiences with it, so I need really convincing arguments to switch to a different API. Maybe the discussion with me can also help you convince others to use/develop for Clerezza. > >> >> Am 03.08.2012 um 12:54 schrieb Reto Bachmann-Gmür: >> >>> I agree that clerezza should finally have tha fastlane for sparql query, >>> the current approach makes only sense if a query is against graphs from >>> multiple backends. This is definitively a bottle neck now. >>> >>> What puzzles me is that you seem to think that the sesame api is cleaner >>> than the clerezza one. The clerezza api was introduced as non of the api >>> available would model the rdf abstract syntax without tying additional >>> concepts and utility classes into the core. If you find anything that's >> not >>> clean I'd like to address this. >> >> Essentially, Sesame already provides many of the things Clerezza promises, >> and it is well proven, established, and highly performant. So I don't >> completely understand the rationale of reinventing the wheel with Clerezza. >> >> I also don't understand your argument of tying additional concepts and >> utility classes into the core. Clerezza implements utility classes in the >> same way as Sesame, so where is the difference? Additionally, the Sesame >> utility classes simply extend the Java core functionality (e.g. with >> iterators that can throw checked exceptions). >> > Looking at http://www.openrdf.org/doc/sesame2/api/ I see 1188 classes many > of them seem to be related to implementation specific aspects and transport > the clerezza core api ( > http://incubator.apache.org/clerezza/mvn-site/rdf.core/apidocs/index.html)) > contains 125 classes. These are the classes in the jar an api client or > implementor depends on. Merely counting classes does not say much, especially since Sesame provides much more functionality. Many of the Sesame classes I see are actually from the abstract syntax tree of the SPARQL Query and Update parsers (and as we all know there are many different ways to implement a parser), from the HTTP Server functionality, and from the different serializers and parsers. If you can live without these, you will easily have a package as small as the Clerezza core. > > What I mean by separating utility classes is mainly the separate resource > centric api provided by RDF utils ( > http://incubator.apache.org/clerezza/mvn-site/utils/apidocs/index.html). I > think this is mainly a difference to the Jena API were the resource > oriented and the triple oriented approach are provided by the same API. This is probably a philosophical issue: have all functionality on one place vs. having things cleanly separated. Both have advantages and disadvantages. Personally, I like the Sesame value factory because it is a single place where to look for the suitable factory methods. > > The classes in org.openrdf.model are indeed similar to the ones in > org.apache.clerezza.rdf.core, so let me argue why I think the zz variant > are better: > > - The zz api defines identity criterions for graphs. The sesame API doesn't > define when equals should return true, the zz API defines clear rules which > are distinct for mutable and for inmutable graphs. Similarly the hashcode > method for graphs is defined. In Sesame it seems that an instance is equals > only to itsself. This doesn't take into account what RDF semantics say > about graph identity This is honestly a functionality I have never needed in 10 years. I see its use case in small in-memory graphs (like they are used in Stanbol), but for a multi-billion triple graph this is an irrelevant functionality. BTW, if you want to implement graph equivalence, you have to implement the bijection as specified in http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#section-graph-equality . This is a very expensive operation anyways, especially when taking into account blank nodes. For example, the following two graphs could be considered equivalent (because the blank nodes are existentially qualified): Graph 1: <http://www.example.com/r/1> <http://www.example.com/p/1> _:a _:a <http://www.example.com/p/2> "123" _:a <http://www.example.com/p/3> "456" Graph 2: <http://www.example.com/r/1> <http://www.example.com/p/1> _:b <http://www.example.com/r/1> <http://www.example.com/p/1> _:c _:b <http://www.example.com/p/2> "123" _:c <http://www.example.com/p/3> "456" > > - In Sesame Graphs triples are added with one or several context. Such a > context is not defined in RDF semantics or in the abstract Syntax. In > Sesame a Graph is a collection of Statements where a Statement is not the > same as a Triple in RDF The currently official RDF specification dates back to 1998 with a minor revision in 2004 [1]. The definition of named graphs is undergoing specification in the course of the work on RDF 1.1 at the W3C until 2013 [2]. Whether it is technically represented as quadruples (Sesame) or as in the proposal for the abstract RDF model (as in Clerezza) is merely an implementation detail, and also still under discussion. The Sesame approach implements essentially the named graph specification of SPARQL 1.1 [3] (the only one that currently officially exists) and has the advantage of offering more efficient implementations, and especially of being very convenient to the user (e.g. give me triples matching a certain pattern and occurring either in graph1 or in graph2). [1] http://www.w3.org/TR/REC-rdf-syntax/ [2] http://www.w3.org/TR/rdf11-concepts/#section-dataset [3] http://www.w3.org/TR/sparql11-query/#rdfDataset > > - Value Factory: In Sesame a value-factory is tied to the Graph. In ZZ > triples can be added to any graph and need not be created via a method > specific to that graph (it is left to to the implementation to > transparently do the optimization for nodes that originate from its backend) Sesame follows a classical factory pattern from object oriented design here to allow the backend to choose which implementation it wants to return to give the best results. Without a factory, you will always have additional maintenance overhead for converting into the correct implementation, e.g. when adding a triple to a graph backed by a database. With a factory, the implementation will immediately give you a database-backed version of the triple. > > - Ids for BNode. In ZZ Bnodes are just what they are according to the > specs: Anonymous resources. They are not java serializable objects so a > client can only reference a BNode as long as the object is alive. This > allows implementation to remove obsolete triples/duplicate bnodes when > nobody holds a reference to that bnode. In Sesame BNodes have an ID and can > be reconstructed with an ID. This means that an implementation doesn't know > how long a bnode is referenced. When a duplicate is detected it should > internally keep all the aliases of the node as it doesn't know for sure > clients will not reference this bnode by a specific id it was once exposed > with. The semantics of BNodes are an issue of open debate and even dispute until today. In practice, it is often a disadvantage to not expose an ID, and this is why both Sesame and Jena do it, and most serialization formats also do it. Actually I had some troubles with Clerezza in Stanbol for exactly this reason. The case that does not work easily here is incremental updates of graphs between two systems involving blank nodes. In the specification, this case is forbidden (blank nodes are always distinct). In practice, it is very useful to still be able to do it. And actually, since this is also very common practice in logics (so-called Skolemization) the RDF specification takes this into account and explicitly acknowledge it [4]: "Blank node identifiers are local identifiers that are used in some concrete RDF syntaxes or RDF store implementations. They are always locally scoped to the file or RDF store, and are not persistent or portable identifiers for blank nodes…." [4] http://www.w3.org/TR/rdf11-concepts/#dfn-blank-node > > - Namespaces: what are they doing in the core of the Sesame API, there is > no such thing in RDF. There is in RDF 1.1: http://www.w3.org/TR/rdf11-concepts/#vocabularies Again, this is a point where people after working with RDF discovered it would be extremely useful to be able to use abbreviated ways of writing URIs or IRIs, so they included it the software systems. The fact that both Sesame and Jena do this is proof of it. And the fact that they take this up in RDF 1.1 also. > Also the Sesame URI class which (probably) represents > what the RDF spec describes as "Uri Reference" has methods to split it into > namespace (not using the Namespace class here) and local name. RDF 1.1 no longer uses the term URI reference, it speaks of IRI. I do not find the fact that Sesame uses "URI" as interface name very fortunate, however, but mainly because it sometimes clashes with the existing Java URI class. The namespace handling methods are merely convenience methods. > > - Literals: The ZZ API differentiates between typed and plain literals. The > Sesame API has one literal datatype with some utility methods to access its > value for common XSD datatypes. If I look here I see many different literal datatypes. Sesame just hides them using the factory pattern: http://www.openrdf.org/doc/sesame2/api/org/openrdf/model/impl/package-summary.html > The ZZ approach of having a literal factory > to convert java types to literals is more flexible and can be extended to > new types. Which is (strictly speaking) not really foreseen in the RDF specification. But I agree that it can be convenient … ;-) > > - Statement: A sesame statemnet has a Context but this context is > irrelevant for the identity criterion defined for equals and hashcode. This > doesn't make a clean impression to me, either this contexts are relevant > and then addin two statements with diffrent contexts to a Set should give a > Set of size two or they aren't in which case they should disappear from the > api. Two triple with the same subject, predicate and object are the same, regardless in which graphs they are. But I agree that this can lead to confusions. > > All in all I think the ZZ core is not only closer to the spec … depending on which version of the spec you are talking about - 1998 or evolving 2013 spec? > it is also > easier to implement a Graph representation of data. The implementor of a > Graph need not to care about splitting URIs or providing utility method to > get values from a literal or > Which are in most cases very straightforward to implement and in the few cases where they are NOT easy, they are actually needed (e.g. for mapping database types to Java types). > > >> One aspect I like about the Sesame API is its completely modular structure >> at several levels. This allows me to easily and cleanly add functionality >> as needed, e.g.: >> - a custom triple store like the one I described before; you can e.g. >> easily provide a Jena TDB backend for Sesame (see >> http://sjadapter.sourceforge.net/) >> > In Sesame A custom triple store or a gateway just provides graph > implementations. The sesame solution you're referring to implements a > separate SPI (sail) which adds a level of complexity and is also a bit less > performance as in ZZ there is (potentially) nothing between your > implementation and the client, you can use zz utilty classes libe > AbstractMGraph but you don't have to. The Sesame SAIL API adds negligible overhead, but considerable additional functionality. It is a plugin mechanism that allows adding additional layers inbetween where you want them (e.g. for implementing a native SPARQL query, for adding a reasoner, etc.). > > - a custom SPARQL implementation that allows a very efficient native >> evaluation; at the same time, Sesame provides me an abstract syntax tree >> and completely frees me from all the nasty parsing stuff and implements the >> latest SPARQL specification for queries, updates and federation extensions >> without me having to actually care >> > I already mentioned that zz should improve here: Sparql fastlane to allow > backed optimization. The abstract syntax tree you describe however is > implemented in zz as well, for the latest released sparql spec (i.e not yet > for sparql 1.1) So browsers should also not support HTNL5? Last time I checked it was also still a working draft (http://www.w3.org/TR/html5/) … ;-) W3C standardisation processes take incredibly long, because they involve a lot of processes, including a complete reference implementation. For me, this is no reason to not implement what is very likely to come in the future. > > >> - a very clean Java SPI based approach for registering RDF parsers and >> serializers that can operate natively on the respective triple store >> implementations >> > > So comparing org.openrdf.rio with > org.apache.clerezza.rdf.core.serializedform or is there a sperate SPI > apckage? The Sesame RDFParser interface seems much more complex than zz's > ParsingProvider You would typically implement the abstract class RDFParserBase in Sesame, leaving you only the two parse() methods and the getRDFFormat() to override. But giving you the option to also implement optional features like namespace support. Not so complex, I have written several Sesame parsers already. > > Clerezza supports registering parsers and serializers for any media type > (which is identified just by its media type without introducing an > RDF-Format class) both using OSGi as well as the META-INF/services approach > for non-osgi environment. Parsers and serializer have to work with data > from any backend they can however be optimized for data from a particular > backend. > > >> - easily wrap filters around triple stores or iterators >> You can easily see the modularity by looking at the many Maven artifacts >> the project is composed off. Essentially, if I don't need a functionality I >> can simply leave it out, and if I need it, adding it is a no-brainer. >> > > The minimum clerezza jar you need is 240K, this contatins all you need to > access and query graphs. It also conatins the infrastructure for > serializing and parsing but you have to add the jar for the formats you > need (just adding the jar to the classpath or loading the bundle when using > OSGi is enough). The complete Sesame distribution with all its features and bells and whistles is also just 2.1 MB. Compare this to Jena TDB ;-) > >> >> In addition, the Sesame data model is completely based on lightweight >> interfaces instead of wrapper objects. This makes it very easy to provide >> efficient implementations and is IMHO very clean. In contrast, Clerezza >> provides its complete own RDF model based on your own version of a triple, >> your own version of a node, your own version of a URI, … >> > > No, Literals, Triples, Resources and others are just interfaces as well. A > BNode is indeed a class (just an empty subclass of Object) the same goes > for URIRefs. The reason for this is that we couldn't find a use case were > providing different implementation would provide benefits while it would > provide great potential for misuse (e.g. I tell my custom object that > implements BNode or UriRef add it to a triple store and expect to get my > object back when querying). Noone who uses some sort of storage should ever expect to get exactly the same object back. The benefit of leaving BNode and UriRef an interface is to leave the implementation open to others. Just because YOU did not find a use case doesn't mean such a use case doesn't exist for others. For efficient storage and querying, as well as caching there is indeed a definite benefit of being able to e.g. associate identifiers with BNodes. You are thinking too much in-memory. > >> >> What I am missing from Clerezza: >> - a lightweight data model that does not require additional instances in >> main memory >> > yes bnodes and urirefs require to be in memory as long as they are used by > a client. For BNode I think this brings a significant advantage, what I > described above about the backend knowing when redundancy can be removed > without risk > >> - a good reuse of functionality that is already there, e.g. direct SPARQL, >> transactions, … >> > Agreed for direct sparql. For transactions I don't know what you mean by > "already there", yes some triple stores support some sorts of transactions, > requiring all backend to support this would be quite a strong requirement > and probably not what users want in many cases, see > http://mail-archives.apache.org/mod_mbox/incubator-clerezza-dev/201009.mbox/%[email protected]%3Efor > thought on this issue. Sesame has transactions and Jena has transactions, so other intelligent people might have thought it is a good idea. I need them, because transactions are the natural level of granularity for our versioning and for our incremental reasoning. And when you build a web application on top of some sort of triple store where users interact with data, this is in fact a major requirement, because you want your system to be always in a consistent state even in case of multithreading, you want to take into account users simply clicking the back button in the middle of some workflow etc. If you would say exactly this sentence in a database forum ("probably not what users want in many cases") you would be completely disintegrated in a flame war. ;-) > > >> - support for named graphs / contexts >> > > Named graphs are supported, not sure why you should have contexts to the > individual triples Technical simplification and one of the suggested implementations for named graphs (quadruples or quads). > and I missing a clear description of this in sesame api. http://www.openrdf.org/doc/sesame2/users/ch08.html#d0e1238 > >> >> I have been working with RDF APIs for about 10 years now in various >> programming languages (even in Prolog, Haskell and Python). And my >> conclusion at the moment is that Sesame by far offers the most convenient >> API for a developer. But I am of course open to switching in case I get >> convincing arguments. ;-) > > not sure how convincing you found them, or if the missing sparql fastlane > is a blocker for you. What I am missing is a convincing argument that shows me how I as experienced RDF developer can benefit from Clerezza over e.g. Sesame, and I am sure that many other RDF developers will have the same hesitation. What you have argued is that Clerezza follows the 1998 RDF spec more closely, but the fact is that Sesame has already gone beyond that and tries to anticipate many features of the upcoming RDF 1.1 and SPARQL 1.1 specifications (which are not really secrets for many years now). Furthermore, you have argued why Clerezza ALSO can do what Sesame (and Jena for that matter) already does, but you did not show me what it can do MORE. > >> >>> It seems that what you are using of sesame is mainly the spi/api and not >>> the actual triple store. This is definitively something clerezza should >>> have a good offer for. >> >> So convince me why it is BETTER than the established project ;-) >> > > Well before clerezza there were othe attempts to have a thinner backend > agnostic layer like RDF2Go. This seems to confirm that other too think that > the APIs provided by Sesame or Jena aren't that thin RDF Api one can use > and implement without endorsing a big infrastructure or a large set of > concept one doesn't necessarily want to deal with. As it happens, the initial founder of RDF2Go (Max Völkel) is one of my friends. The main rationale behind the project was strictly to be able to exchange the underlying triple store more easily, because you might want to build applications first and think about the triple store later (and then of course get the bigger infrastructure). It is only used in research projects as far as I know, and not really under active development anymore. As to big infrastructure: the Sesame jar is 2.1 MB big, alltogether. Smaller than Lucene or Xerces. So I would not consider it big infrastructure. Comes in many Maven modules, though (which is IMO a good thing). > > >> >>> >>> When I last looked at it sesame was not ready to be used in apache >>> projects, not sure if license issues are the cause of it not being >>> available in maven central. >> >> >> Sesame is under a BSD license, so it should be compatible with Apache >> projects: >> >> http://www.openrdf.org/download.jsp > > > Is this true for the dependency it requires as well? Yes: http://repo.aduna-software.org/svn/info.aduna/commons/LICENSE.txt Greetings, Sebastian -- | Dr. Sebastian Schaffert [email protected] | Salzburg Research Forschungsgesellschaft http://www.salzburgresearch.at | Head of Knowledge and Media Technologies Group +43 662 2288 423 | Jakob-Haringer Strasse 5/II | A-5020 Salzburg
signature.asc
Description: Message signed with OpenPGP using GPGMail
