Re: Role of RDF on the Web and within enterprise applications. was: AW: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL
Chris Bizer wrote: Hi Orri, It is my feeling that RDF has a dual role: 1. interchange format: This is like what XML does, except that RDF has more semantics and expressivity. 2: Database storage format for cases where data must be integrated and is too heterogenous to easily fall into one relational schema. This is for example the case in the open web conversation and social space. The first case is for mapping, the second for warehousing. Aside this, there is potential for more expressive queries through the query language dealing with inferencing, like subclass/subproperty/transitive etc. These do not go very well with SQL views. I cannot agree more with what you say :-) We are seeing the first RDF use case emerge within initiatives like the Linking Open Data effort, where beside of being more expressive, RDF is also playing its strength to provide for data links between record in different databases. Chris, Talking with people from industry, I get the feeling that also more and more people understand the second use case and that RDF is increasingly used as a technology for something like poor man's data integration. You don't have to spend a lot of time and money one designing a comprehensive data warehouse. You just throw data having different schemata from different sources together and instantly get the benefit that you can browse and query the data and that you have proper provenance tracking (using Named Graphs). Depending on how much data integration you need, you then start to apply some identity resolution and schema mapping techniques. We have been talking to some pharma and media companies that do data warehousing for years and they all seam to be very interested in this quick and dirty approach. Quick Dirty is simply not how I would characterize this matter. I prefer to describe this as step 1 in a multi phased approach to RDF based data integration. For both use cases, inferencing is a nice add-on but not essential. Within the first use case, inferencing usually does not work as data published by various autonomous sources tends to be to dirty for reasoning engines. Inferencing is not a nice add-on, it is essential (in various degrees) once you get beyond the initial stages of heterogeneous data integration. As with all things, these matters are connected and inherently symbiotic: you can't inference without having something you want to reason about available in palatable form, which goes back to the phased approach I refer to above. In my eyes, and experience, RDF is a powerful vehicle for implementing conceptual level data access that sits atop heterogeneous data sources. It's novelty comes from the platform independence that it injects into the data integration technology realm. Kingsley Cheers, Chris -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im Auftrag von Orri Erling Gesendet: Dienstag, 30. September 2008 00:16 An: 'Seaborne, Andy'; 'Story Henry' Cc: [EMAIL PROTECTED] Betreff: RE: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL From Henry Story: As a matter of interest, would it be possible to develop RDF stores that optimize the layout of the data by analyzing the queries to the database? A bit like a Java Just In Time compiler analyses the usage of the classes in order to decide how to optimize the compilation. From Andy Seaborne: On a similar note, by mining the query logs it would be possible to create parameterised queries and associated plan fragments without the client needing to notify the server of the templates. Couple with automatically calculating possible materialized views or other layout optimizations, the poor, overworked client application writer doesn't get brought into optimizing the server. Andy Orri here: With the BSBM workload, using parametrized queries as a small scale saves roughly 1/3 of the execution time. It is possible to remember query plans and to notice if the same query text is submitted with only changes in literal values. If the first query ran quickly, one may presume the query with substitutions will also run quickly. There are of course exceptions. But detecting these will mean running most of the optimizer cost model and will eliminate any benefit from caching. The other optimizations suggested have a larger upside but are far harder. I would say that if we have a predictable workload, then mapping relational to RDF is a lot easier than expecting the DBMS to figure out materialized views to do the same. If we do not have a predictable workload, then making too many materialized views based on transient usage patterns is a large downside because it grows the database, meaning less working set. The difference between in memory random access and a random access with disk is about 5000 times. Plus there is a high cost to making
Re: Role of RDF on the Web and within enterprise applications. was: AW: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL
Kingsley Idehen a écrit : Chris Bizer wrote: ... Depending on how much data integration you need, you then start to apply some identity resolution and schema mapping techniques. We have been talking to some pharma and media companies that do data warehousing for years and they all seam to be very interested in this quick and dirty approach. Quick Dirty is simply not how I would characterize this matter. I prefer to describe this as step 1 in a multi phased approach to RDF based data integration. Kingsley, I agree. I find it clean to make the basic but necessary first steps towards data integration (identification of things, mapping, etc.). Building upon legacy systems, that you just have to adapt, doesn't make the approach dirty: it makes it possible ;-) cheers, fps -- Disclaimer Ce message ainsi que les eventuelles pieces jointes constituent une correspondance privee et confidentielle a l'attention exclusive du destinataire designe ci-dessus. Si vous n'etes pas le destinataire du present message ou une personne susceptible de pouvoir le lui delivrer, il vous est signifie que toute divulgation, distribution ou copie de cette transmission est strictement interdite. Si vous avez recu ce message par erreur, nous vous remercions d'en informer l'expediteur par telephone ou de lui retourner le present message, puis d'effacer immediatement ce message de votre systeme. *** This e-mail and any attachments is a confidential correspondence intended only for use of the individual or entity named above. If you are not the intended recipient or the agent responsible for delivering the message to the intended recipient, you are hereby notified that any disclosure, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender by phone or by replying this message, and then delete this message from your system.
Re: AW: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL
Chris Bizer wrote: Hi Kingsley and Paul, Yes, I completely agree with you that different storage solutions fit different use cases and that one of the main strengths of the RDF data model is its flexibility and the possibility to mix different schemata. Nevertheless, it think it is useful to give application developers an indicator about what performance they can expect when they choose a specific architecture, which is what the benchmark is trying to do. Chris, Yes, but the user profile has to be a little clearer. If you separate the results in the narrative you achieve the goal. You can use SQL numbers as a sort of benchamark if you clearly explain the nature skew that SQL enjoys due to the nature of the schema. We plan to run the benchmark again in January and it would be great to also test Tucana/Kowari/Mulgara in this run. As the performance of RDF stores is constantly improving, let's also hope that the picture will not look that bad for them anymore then. But at the current time, there is no clear sense of what better means :-) What's the goal? What I fundamentally take from the benchmarks are the following: 1. Native RDF and RDF Views/Mapper scalability is becoming less of an issue (of course depending on your choice of product) and we are already at the point where this technology can be used for real-world solutions that have enterprise level scalability demands and expectations 2. It's impractical to create RDF warehouses from a existing SQL Data Sources when you can put RDF Views / Wrappers in front of the SQL Data Sources (SQL cost optimization technology has evolved significantly over the years across RDBMS engines). And Yes, I would also like to see Mulgara and others RDF Stores in the next round of benchmarks :-) Kingsley Cheers, Chris -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im Auftrag von Kingsley Idehen Gesendet: Mittwoch, 24. September 2008 20:57 An: Paul Gearon Cc: [EMAIL PROTECTED]; public-lod@w3.org Betreff: Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL Paul Gearon wrote: On Mon, Sep 22, 2008 at 3:47 AM, Eyal Oren [EMAIL PROTECTED] wrote: On 09/19/08/09/08 23:12 +0200, Orri Erling wrote: Has has there been any analysis on whether there is a *fundamental* reason for such performance difference? Or is it simply a question of maturity; in other words, relational db technology has been around for a very long time and is very mature, whereas RDF implementations are still quite recent, so this gap will surely narrow ...? This is a very complex subject. I will offer some analysis below, but this I fear will only raise further questions. This is not the end of the road, far from it. As far as I understand, another issue is relevant: this benchmark is somewhat unfair as the relational stores have one advantage compared to the native triple stores: the relational data structure is fixed (Products, Producers, Reviews, etc with given columns), while the triple representation is generic (arbitrary s,p,o). This point has an effect on several levels. For instance, the flexibility afforded by triples means that objects stored in this structure require processing just to piece it all together, whereas the RDBMS has already encoded the structure into the table. Ironically, this is exactly the reason we (Tucana/Kowari/Mulgara) ended up building an RDF database instead of building on top of an RDBMS: The flexibility in table structure was less efficient that a system that just knew it only had to deal with 3 columns. Obviously the shape of the data (among other things) dictates what it is the better type of storage to use. A related point is that processing RDF to create an object means you have to move around a lot in the graph. This could mean a lot of seeking on disk, while an RDBMS will usually find the entire object in one place on the disk. And seeks kill performance. This leads to the operations used to build objects from an RDF store. A single object often requires the traversal of several statements, where the object of one statement becomes the subject of the next. Since the tables are typically represented as Subject/Predicate/Object, this means that the main table will be joined against itself. Even RDBMSs are notorious for not doing this efficiently. One of the problems with self-joins is that efficient operations like merge-joins (when they can be identified) will still result in lots of seeking, since simple iteration on both sides of the join means seeking around in the same data. Of course, there ARE ways to optimize some of this, but the various stores are only just starting to get to these optimizations now. Relational databases suffer similar problems, but joins are usually only required for complex
Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL
On Mon, Sep 22, 2008 at 3:47 AM, Eyal Oren [EMAIL PROTECTED] wrote: On 09/19/08/09/08 23:12 +0200, Orri Erling wrote: Has has there been any analysis on whether there is a *fundamental* reason for such performance difference? Or is it simply a question of maturity; in other words, relational db technology has been around for a very long time and is very mature, whereas RDF implementations are still quite recent, so this gap will surely narrow ...? This is a very complex subject. I will offer some analysis below, but this I fear will only raise further questions. This is not the end of the road, far from it. As far as I understand, another issue is relevant: this benchmark is somewhat unfair as the relational stores have one advantage compared to the native triple stores: the relational data structure is fixed (Products, Producers, Reviews, etc with given columns), while the triple representation is generic (arbitrary s,p,o). This point has an effect on several levels. For instance, the flexibility afforded by triples means that objects stored in this structure require processing just to piece it all together, whereas the RDBMS has already encoded the structure into the table. Ironically, this is exactly the reason we (Tucana/Kowari/Mulgara) ended up building an RDF database instead of building on top of an RDBMS: The flexibility in table structure was less efficient that a system that just knew it only had to deal with 3 columns. Obviously the shape of the data (among other things) dictates what it is the better type of storage to use. A related point is that processing RDF to create an object means you have to move around a lot in the graph. This could mean a lot of seeking on disk, while an RDBMS will usually find the entire object in one place on the disk. And seeks kill performance. This leads to the operations used to build objects from an RDF store. A single object often requires the traversal of several statements, where the object of one statement becomes the subject of the next. Since the tables are typically represented as Subject/Predicate/Object, this means that the main table will be joined against itself. Even RDBMSs are notorious for not doing this efficiently. One of the problems with self-joins is that efficient operations like merge-joins (when they can be identified) will still result in lots of seeking, since simple iteration on both sides of the join means seeking around in the same data. Of course, there ARE ways to optimize some of this, but the various stores are only just starting to get to these optimizations now. Relational databases suffer similar problems, but joins are usually only required for complex structures between different tables, which can be stored on different spindles. Contrast this to RDF, which needs to do do many of these joins for all but the simplest of data. One can question whether such flexibility is relevant in practice, and if so, one may try to extract such structured patterns from data on-the-fly. Still, it's important to note that we're comparing somewhat different things here between the relational and the triple representation of the benchmark. This is why I think it is very important to consider the type of data being stored before choosing the type of storage to use. For some applications an RDBMS is going to win hands down every time. For other applications, an RDF store is definitely the way to go. Understanding the flexibility and performance constraints of each is important. This kind of benchmarking helps with that. It also helps identify where RDF databases need to pick up their act. Regards, Paul Gearon
Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL
Paul Gearon wrote: On Mon, Sep 22, 2008 at 3:47 AM, Eyal Oren [EMAIL PROTECTED] wrote: On 09/19/08/09/08 23:12 +0200, Orri Erling wrote: Has has there been any analysis on whether there is a *fundamental* reason for such performance difference? Or is it simply a question of maturity; in other words, relational db technology has been around for a very long time and is very mature, whereas RDF implementations are still quite recent, so this gap will surely narrow ...? This is a very complex subject. I will offer some analysis below, but this I fear will only raise further questions. This is not the end of the road, far from it. As far as I understand, another issue is relevant: this benchmark is somewhat unfair as the relational stores have one advantage compared to the native triple stores: the relational data structure is fixed (Products, Producers, Reviews, etc with given columns), while the triple representation is generic (arbitrary s,p,o). This point has an effect on several levels. For instance, the flexibility afforded by triples means that objects stored in this structure require processing just to piece it all together, whereas the RDBMS has already encoded the structure into the table. Ironically, this is exactly the reason we (Tucana/Kowari/Mulgara) ended up building an RDF database instead of building on top of an RDBMS: The flexibility in table structure was less efficient that a system that just knew it only had to deal with 3 columns. Obviously the shape of the data (among other things) dictates what it is the better type of storage to use. A related point is that processing RDF to create an object means you have to move around a lot in the graph. This could mean a lot of seeking on disk, while an RDBMS will usually find the entire object in one place on the disk. And seeks kill performance. This leads to the operations used to build objects from an RDF store. A single object often requires the traversal of several statements, where the object of one statement becomes the subject of the next. Since the tables are typically represented as Subject/Predicate/Object, this means that the main table will be joined against itself. Even RDBMSs are notorious for not doing this efficiently. One of the problems with self-joins is that efficient operations like merge-joins (when they can be identified) will still result in lots of seeking, since simple iteration on both sides of the join means seeking around in the same data. Of course, there ARE ways to optimize some of this, but the various stores are only just starting to get to these optimizations now. Relational databases suffer similar problems, but joins are usually only required for complex structures between different tables, which can be stored on different spindles. Contrast this to RDF, which needs to do do many of these joins for all but the simplest of data. One can question whether such flexibility is relevant in practice, and if so, one may try to extract such structured patterns from data on-the-fly. Still, it's important to note that we're comparing somewhat different things here between the relational and the triple representation of the benchmark. This is why I think it is very important to consider the type of data being stored before choosing the type of storage to use. For some applications an RDBMS is going to win hands down every time. For other applications, an RDF store is definitely the way to go. Understanding the flexibility and performance constraints of each is important. This kind of benchmarking helps with that. It also helps identify where RDF databases need to pick up their act. Regards, Paul Gearon Paul, You make valid points, the problem here is that the benchmark has been released without enough clarity about it's prime purpose. To even compare RDF Quads Stores with an RDBMS engine when the schema is Relational in itself is kinda twisted. The role of mappers (DR2Q Virtuoso RDF Views) for instance, should have been made much clearer, maybe in separate results tables. I say this because these mappers offer different approaches to projecting RDBMS based data in RDF Linked Data form, on the fly, and their purpose in this benchmark is all about raw performance and scalability as it relates to following RDF Linked Data generation and deployment conditions: 1. Schema is Relational 2. RDF warehouse is impractical As I am sure you know, we could invert this whole benchmark Open World style, and then bring RDBMS engines to their knees by incorporating SPARQL query patterns comprised of ?p's and subclasses . To conclude, the quad store numbers should simply be a conparisons of the quad stores themselves, and not the quad stores vs the mappers or native SQL. This clarification really needs to make it's way into the benchmark narrative. -- Regards, Kingsley Idehen Weblog: http://www.openlinksw.com/blog/~kidehen President CEO
AW: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL
Hi Kingsley and Paul, Yes, I completely agree with you that different storage solutions fit different use cases and that one of the main strengths of the RDF data model is its flexibility and the possibility to mix different schemata. Nevertheless, it think it is useful to give application developers an indicator about what performance they can expect when they choose a specific architecture, which is what the benchmark is trying to do. We plan to run the benchmark again in January and it would be great to also test Tucana/Kowari/Mulgara in this run. As the performance of RDF stores is constantly improving, let's also hope that the picture will not look that bad for them anymore then. Cheers, Chris -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im Auftrag von Kingsley Idehen Gesendet: Mittwoch, 24. September 2008 20:57 An: Paul Gearon Cc: [EMAIL PROTECTED]; public-lod@w3.org Betreff: Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL Paul Gearon wrote: On Mon, Sep 22, 2008 at 3:47 AM, Eyal Oren [EMAIL PROTECTED] wrote: On 09/19/08/09/08 23:12 +0200, Orri Erling wrote: Has has there been any analysis on whether there is a *fundamental* reason for such performance difference? Or is it simply a question of maturity; in other words, relational db technology has been around for a very long time and is very mature, whereas RDF implementations are still quite recent, so this gap will surely narrow ...? This is a very complex subject. I will offer some analysis below, but this I fear will only raise further questions. This is not the end of the road, far from it. As far as I understand, another issue is relevant: this benchmark is somewhat unfair as the relational stores have one advantage compared to the native triple stores: the relational data structure is fixed (Products, Producers, Reviews, etc with given columns), while the triple representation is generic (arbitrary s,p,o). This point has an effect on several levels. For instance, the flexibility afforded by triples means that objects stored in this structure require processing just to piece it all together, whereas the RDBMS has already encoded the structure into the table. Ironically, this is exactly the reason we (Tucana/Kowari/Mulgara) ended up building an RDF database instead of building on top of an RDBMS: The flexibility in table structure was less efficient that a system that just knew it only had to deal with 3 columns. Obviously the shape of the data (among other things) dictates what it is the better type of storage to use. A related point is that processing RDF to create an object means you have to move around a lot in the graph. This could mean a lot of seeking on disk, while an RDBMS will usually find the entire object in one place on the disk. And seeks kill performance. This leads to the operations used to build objects from an RDF store. A single object often requires the traversal of several statements, where the object of one statement becomes the subject of the next. Since the tables are typically represented as Subject/Predicate/Object, this means that the main table will be joined against itself. Even RDBMSs are notorious for not doing this efficiently. One of the problems with self-joins is that efficient operations like merge-joins (when they can be identified) will still result in lots of seeking, since simple iteration on both sides of the join means seeking around in the same data. Of course, there ARE ways to optimize some of this, but the various stores are only just starting to get to these optimizations now. Relational databases suffer similar problems, but joins are usually only required for complex structures between different tables, which can be stored on different spindles. Contrast this to RDF, which needs to do do many of these joins for all but the simplest of data. One can question whether such flexibility is relevant in practice, and if so, one may try to extract such structured patterns from data on-the-fly. Still, it's important to note that we're comparing somewhat different things here between the relational and the triple representation of the benchmark. This is why I think it is very important to consider the type of data being stored before choosing the type of storage to use. For some applications an RDBMS is going to win hands down every time. For other applications, an RDF store is definitely the way to go. Understanding the flexibility and performance constraints of each is important. This kind of benchmarking helps with that. It also helps identify where RDF databases need to pick up their act. Regards, Paul Gearon Paul, You make valid points, the problem here is that the benchmark has been released without enough clarity about it's prime purpose. To even compare RDF Quads Stores with an RDBMS
Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL
As a matter of interest, would it be possible to develop RDF stores that optimize the layout of the data by analyzing the queries to the database? A bit like a Java Just In Time compiler analyses the usage of the classes in order to decide how to optimize the compilation. Henry On 24 Sep 2008, at 20:30, Paul Gearon wrote: A related point is that processing RDF to create an object means you have to move around a lot in the graph. This could mean a lot of seeking on disk, while an RDBMS will usually find the entire object in one place on the disk. And seeks kill performance. This leads to the operations used to build objects from an RDF store. A single object often requires the traversal of several statements, where the object of one statement becomes the subject of the next. Since the tables are typically represented as Subject/Predicate/Object, this means that the main table will be joined against itself. Even RDBMSs are notorious for not doing this efficiently.
Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL
Hello Eyal, this benchmark is somewhat unfair as the relational stores have one advantage compared to the native triple stores: the relational data structure is fixed (Products, Producers, Reviews, etc with given columns), while the triple representation is generic (arbitrary s,p,o). One can question whether such flexibility is relevant in practice, and if so, one may try to extract such structured patterns from data on-the-fly. That will be our next big extension -- updateable RDV Views, as proposed in http://esw.w3.org/topic/UpdatingRelationalDataViaSPARUL . So we will be able to load BSBM data as RDF and query them via SPARQL web service endpoint; thus we will masquerade the relational storage entirely. Best Regards, Ivan Mikhailov, OpenLink Software http://virtuoso.openlinksw.com
Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL
Hi all, over the last weeks, we have extended the Berlin SPARQL Benchmark (BSBM) to a multi-client scenario, fine-tuned the benchmark dataset and the query mix, and implemented a SQL version of the benchmark in order to be able to compare SPARQL stores with classical SQL stores. Today, we have released the results of running the BSBM Benchmark Version 2 against: + three RDF stores (Virtuoso Version 5.0.8, Sesame Version 2.2, Jena TDB Version 0.53) and + two relational database-to-RDF wrappers (D2R Server Version 0.4 and Virtuoso - RDF Views Version 5.0.8). for datasets ranging from 250,000 triples to 100,000,000 triples. In order to set the SPARQL query performance into context we also report the results of running the SQL version of the benchmark against two relational database management systems (MySQL 5.1.26 and Virtuoso - RDBMS Version 5.0.8). A comparison of the performance for a single client working against the stores is found here: http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/index.html#comparison A comparison of the performance for 1 to 16 clients simultaneously executing query mixes against the stores is found here: http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/index.html#multiResults The complete benchmark results including the setup of the experiment and the configuration of the different stores is found here: http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/index.html The current specification of the Berlin SPARQL Benchmark is found here: http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/20080912/ It is interesting to see: 1. that relational database to RDF wrappers generally outperform RDF stores for larger dataset sizes. 2. that no store outperforms the others for all queries and dataset sizes. 3. that the query throughput still varies widely within the multi-client scenario. 4. that the fastest RDF store is still 7 times slower than a relational database. Thanks a lot to + Eli Lilly and Company and especially Susie Stephens for making this work possible through a research grant. + Orri Erling, Andy Seaborne, Arjohn Kampman, Michael Schmidt, Richard Cyganiak, Ivan Mikhailov, Patrick van Kleef, and Christian Becker for their feedback on the benchmark design and their help with configuring the stores and running the benchmark experiment. Without all your help it would not been possible to conduct this experiment. We highly welcome feedback on the benchmark design and the results of the experiment. Cheers, Chris Bizer and Andreas Schultz -- Prof. Dr. Chris Bizer Freie Universität Berlin Phone: +49 30 838 55509 Mail: [EMAIL PROTECTED] Web: www.bizer.de