Re: Inserting large volumes into a RW TDB store.
I might be confusing the DynamicDataset... Dick On 20 October 2014 20:40, Dick Murray dandh...@gmail.com wrote: Thanks that confirms what I thought. Crazy idea time! Am I correct in thinking that there is a dataset view which allows you to present multiple datasets as one? I'm sure I saw it in the codebase some time back? If I present the current datasets using this view I can create a new dataset and load in the new quads without a transaction then add it to a transient reference which is used by the system from then on and the old view would then be GC. This would keep the concurrency in the system and keep failures within a dataset. Currently the TDB is 53GB for the 120M triples and it's estimated that it will grow by the same amount every working day which equates to 31,200M or 31B triples and 13,780GB or 14TB on disk in a year... Dick On 20 Oct 2014 17:56, Andy Seaborne a...@apache.org wrote: On 20/10/14 10:12, Dick Murray wrote: Hello all. Are there any pointers to inserting large volumes of data in a persistent RW TDB store please? I currently have a 8M line 500MB+ input file which is being parsed by JavaCC and the created quads inserted into a TDB store. The process genreates 120M quads and takes just over 2hrs which is; 60M quads/hr/ or 1M quads/min or 1 quads/sec. Parse is single threaded (12% core utiliization i.e. 100%) with -Xmx8GB (16GB available) on a i7 8 core and a 512GB SSD. I am working with the datasetGraph after opening the TDB store to remove any extra code which might slow the process down. I begin/commit a transaction for every 1000 input rows as prior to this a OOME occured after ~3M input rows if I tried to wrap the entire load in a transaction. The TDB store is being read from so I am unable to use a TDB loader. I don't believe the runtime is poor but any pointers which would improve the speed... Dick, If you are loading into a live TDB store with transactions, there will be less performance than bulking offline. The system is a bit read-centric. The only tuning parameter you have at your disposal is the commit size. 1000 is very small - try more like 100K. This isn't inside Fuseki so some batching already occurs but the size of transactions themselves can make a difference. Andy
Re: Inserting large volumes into a RW TDB store.
On 21/10/14 10:25, Dick Murray wrote: I might be confusing the DynamicDataset... You might well be ... There is also DatasetGraphViewGraphs which is a view of one graph in a dataset. Dick On 20 October 2014 20:40, Dick Murray dandh...@gmail.com wrote: Thanks that confirms what I thought. Crazy idea time! Am I correct in thinking that there is a dataset view which allows you to present multiple datasets as one? I'm sure I saw it in the codebase some time back? If I present the current datasets using this view I can create a new dataset and load in the new quads without a transaction then add it to a transient reference which is used by the system from then on and the old view would then be GC. Yes - that might work for you. It's like MultiUnion which is a union graph, but for datasets. This would keep the concurrency in the system and keep failures within a dataset. Currently the TDB is 53GB for the 120M triples and it's estimated that it will grow by the same amount every working day which equates to 31,200M or 31B triples and 13,780GB or 14TB on disk in a year... Dick On 20 Oct 2014 17:56, Andy Seaborne a...@apache.org wrote: On 20/10/14 10:12, Dick Murray wrote: Hello all. Are there any pointers to inserting large volumes of data in a persistent RW TDB store please? I currently have a 8M line 500MB+ input file which is being parsed by JavaCC and the created quads inserted into a TDB store. The process genreates 120M quads and takes just over 2hrs which is; 60M quads/hr/ or 1M quads/min or 1 quads/sec. Parse is single threaded (12% core utiliization i.e. 100%) with -Xmx8GB (16GB available) on a i7 8 core and a 512GB SSD. I am working with the datasetGraph after opening the TDB store to remove any extra code which might slow the process down. I begin/commit a transaction for every 1000 input rows as prior to this a OOME occured after ~3M input rows if I tried to wrap the entire load in a transaction. The TDB store is being read from so I am unable to use a TDB loader. I don't believe the runtime is poor but any pointers which would improve the speed... Dick, If you are loading into a live TDB store with transactions, there will be less performance than bulking offline. The system is a bit read-centric. The only tuning parameter you have at your disposal is the commit size. 1000 is very small - try more like 100K. This isn't inside Fuseki so some batching already occurs but the size of transactions themselves can make a difference. Andy
Re: can't get subclass of a specific class
From 1m heigth, a likely cause is an error in the prefix product: . Anyway, unless you have a special reason to use XML , it's better to use Turtle format all the way. 2014-10-21 16:53 GMT+02:00 Bruno Baloi bruno.ba...@rogers.com: Hello, I am new to Jena, and to the semantic space as well. I am trying to build an application using Jena and I am encountering some issues. I built the Ontology using the Ontology APIs. It all looks good, and I can use a reasoner to get information out. I then generate the XML doc representing the Ontology. I can also use Protege to visualize the ontology. So far so good. The problem comes when I try to use SPARQL. When i use the query ?subject rdfs:subClassOf ?object it works fine. However if i try to use something to the effect of: ?subject rdfs:subclassOf product:Food I get nothing (I am trying to get all the classes that are subclasses of Food), and there are many subclasses of Food declared. I have attached the generated XML Ontology. If however I use OWL-DL (DL-Query in Protege) or in code, it works fine i.e. I get the necessary subclasses. So my question is what am I doing wrong ? Is the way the graph is stored in XML causing the problem for SPARQL, or am I misunderstanding the way to use the SPARQL query ? Any insight or guidance would be greatly appreciated regards, Bruno P.S. I had a look at the Wine ontology that is publicly available, and I noticed that in the XML doc, that Sub Classes are referred to by resource , whereas in the Product graph that I generate, they are referred to by ID. Could that be a problem ? And if it is what do I need to use in the APIs to rectify this ? -- Jean-Marc Vanel Déductions SARL - Consulting, services, training, Rule-based programming, Semantic Web http://deductions-software.com/ +33 (0)6 89 16 29 52 Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
Re: can't get subclass of a specific class
On 21/10/14 16:07, Jean-Marc Vanel wrote: From 1m heigth, a likely cause is an error in the prefix product: . I agree. The data has relative URIs in it so it depends how and where you rad the file. (try using riot to parse your file and see the Food declarations as full URIs). Maybe setting XML base in the XML to the same a xmlns:product Andy Anyway, unless you have a special reason to use XML , it's better to use Turtle format all the way. 2014-10-21 16:53 GMT+02:00 Bruno Baloi bruno.ba...@rogers.com: Hello, I am new to Jena, and to the semantic space as well. I am trying to build an application using Jena and I am encountering some issues. I built the Ontology using the Ontology APIs. It all looks good, and I can use a reasoner to get information out. I then generate the XML doc representing the Ontology. I can also use Protege to visualize the ontology. So far so good. The problem comes when I try to use SPARQL. When i use the query ?subject rdfs:subClassOf ?object it works fine. However if i try to use something to the effect of: ?subject rdfs:subclassOf product:Food I get nothing (I am trying to get all the classes that are subclasses of Food), and there are many subclasses of Food declared. I have attached the generated XML Ontology. If however I use OWL-DL (DL-Query in Protege) or in code, it works fine i.e. I get the necessary subclasses. So my question is what am I doing wrong ? Is the way the graph is stored in XML causing the problem for SPARQL, or am I misunderstanding the way to use the SPARQL query ? Any insight or guidance would be greatly appreciated regards, Bruno P.S. I had a look at the Wine ontology that is publicly available, and I noticed that in the XML doc, that Sub Classes are referred to by resource , whereas in the Product graph that I generate, they are referred to by ID. Could that be a problem ? And if it is what do I need to use in the APIs to rectify this ?
Re: can't get subclass of a specific class
Thx guys for the quick response. I will set the base and see what happens. Regards, Bruno On 14-10-21 11:16 AM, Andy Seaborne wrote: On 21/10/14 16:07, Jean-Marc Vanel wrote: From 1m heigth, a likely cause is an error in the prefix product: . I agree. The data has relative URIs in it so it depends how and where you rad the file. (try using riot to parse your file and see the Food declarations as full URIs). Maybe setting XML base in the XML to the same a xmlns:product Andy Anyway, unless you have a special reason to use XML , it's better to use Turtle format all the way. 2014-10-21 16:53 GMT+02:00 Bruno Baloi bruno.ba...@rogers.com: Hello, I am new to Jena, and to the semantic space as well. I am trying to build an application using Jena and I am encountering some issues. I built the Ontology using the Ontology APIs. It all looks good, and I can use a reasoner to get information out. I then generate the XML doc representing the Ontology. I can also use Protege to visualize the ontology. So far so good. The problem comes when I try to use SPARQL. When i use the query ?subject rdfs:subClassOf ?object it works fine. However if i try to use something to the effect of: ?subject rdfs:subclassOf product:Food I get nothing (I am trying to get all the classes that are subclasses of Food), and there are many subclasses of Food declared. I have attached the generated XML Ontology. If however I use OWL-DL (DL-Query in Protege) or in code, it works fine i.e. I get the necessary subclasses. So my question is what am I doing wrong ? Is the way the graph is stored in XML causing the problem for SPARQL, or am I misunderstanding the way to use the SPARQL query ? Any insight or guidance would be greatly appreciated regards, Bruno P.S. I had a look at the Wine ontology that is publicly available, and I noticed that in the XML doc, that Sub Classes are referred to by resource , whereas in the Product graph that I generate, they are referred to by ID. Could that be a problem ? And if it is what do I need to use in the APIs to rectify this ?
Re: can't get subclass of a specific class
Great, That worked thx a bundle. Now for a short follow-up question: 1) what would the SPARQL query look like to get all the classes that have a certain property of a certain value I would like to have a Base Class (i.e. Product) that has a certain value (Product.ClassCounter) that all other classes inherit from. As things get added to the Ontology this counter get incremented. I would like to be able to query the ontology for all the classes that have a ClassCounter value of greater than X for instance ? Any hints would be greatly appreciated. regards, Bruno On 14-10-21 11:07 AM, Jean-Marc Vanel wrote: From 1m heigth, a likely cause is an error in the prefix product: . Anyway, unless you have a special reason to use XML , it's better to use Turtle format all the way. 2014-10-21 16:53 GMT+02:00 Bruno Baloi bruno.ba...@rogers.com: Hello, I am new to Jena, and to the semantic space as well. I am trying to build an application using Jena and I am encountering some issues. I built the Ontology using the Ontology APIs. It all looks good, and I can use a reasoner to get information out. I then generate the XML doc representing the Ontology. I can also use Protege to visualize the ontology. So far so good. The problem comes when I try to use SPARQL. When i use the query ?subject rdfs:subClassOf ?object it works fine. However if i try to use something to the effect of: ?subject rdfs:subclassOf product:Food I get nothing (I am trying to get all the classes that are subclasses of Food), and there are many subclasses of Food declared. I have attached the generated XML Ontology. If however I use OWL-DL (DL-Query in Protege) or in code, it works fine i.e. I get the necessary subclasses. So my question is what am I doing wrong ? Is the way the graph is stored in XML causing the problem for SPARQL, or am I misunderstanding the way to use the SPARQL query ? Any insight or guidance would be greatly appreciated regards, Bruno P.S. I had a look at the Wine ontology that is publicly available, and I noticed that in the XML doc, that Sub Classes are referred to by resource , whereas in the Product graph that I generate, they are referred to by ID. Could that be a problem ? And if it is what do I need to use in the APIs to rectify this ?