Re: Inserting large volumes into a RW TDB store.

2014-10-21 Thread Dick Murray
I might be confusing the DynamicDataset...

Dick

On 20 October 2014 20:40, Dick Murray dandh...@gmail.com wrote:

 Thanks that confirms what I thought.

 Crazy idea time!

 Am I correct in thinking that there is a dataset view which allows you
 to present multiple datasets as one? I'm sure I saw it in the codebase some
 time back?

 If I present the current datasets using this view I can create a new
 dataset and load in the new quads without a transaction then add it to a
 transient reference which is used by the system from then on and the old
 view would then be GC.

 This would keep the concurrency in the system and keep failures within a
 dataset. Currently the TDB is 53GB for the 120M triples and it's estimated
 that it will grow by the same amount every working day which equates to
 31,200M or 31B triples and 13,780GB or 14TB on disk in a year...

 Dick

 On 20 Oct 2014 17:56, Andy Seaborne a...@apache.org wrote:
 
  On 20/10/14 10:12, Dick Murray wrote:
 
  Hello all.
 
  Are there any pointers to inserting large volumes of data in a
 persistent
  RW TDB store please?
 
  I currently have a 8M line 500MB+ input file which is being parsed by
  JavaCC and the created quads inserted into a TDB store.
 
  The process genreates 120M quads and takes just over 2hrs which is;
 
  60M quads/hr/ or
  1M quads/min or
  1 quads/sec.
 
  Parse is single threaded (12% core utiliization i.e. 100%) with -Xmx8GB
  (16GB available) on a i7 8 core and a 512GB SSD.
 
  I am working with the datasetGraph after opening the TDB store to remove
  any extra code which might slow the process down. I begin/commit a
  transaction for every 1000 input rows as prior to this a OOME occured
 after
  ~3M input rows if I tried to wrap the entire load in a transaction. The
 TDB
  store is being read from so I am unable to use a TDB loader.
 
  I don't believe the runtime is poor but any pointers which would improve
  the speed...
 
 
  Dick,
 
  If you are loading into a live TDB store with transactions, there will
 be less performance than bulking offline.  The system is a bit read-centric.
 
  The only tuning parameter you have at your disposal is the commit size.
 1000 is very small - try more like 100K.
 
  This isn't inside Fuseki so some batching already occurs but the size of
 transactions themselves can make a difference.
 
  Andy
 



Re: Inserting large volumes into a RW TDB store.

2014-10-21 Thread Andy Seaborne

On 21/10/14 10:25, Dick Murray wrote:

I might be confusing the DynamicDataset...


You might well be ...

There is also DatasetGraphViewGraphs which is a view of one graph in a 
dataset.



Dick

On 20 October 2014 20:40, Dick Murray dandh...@gmail.com wrote:


Thanks that confirms what I thought.

Crazy idea time!

Am I correct in thinking that there is a dataset view which allows you
to present multiple datasets as one? I'm sure I saw it in the codebase some
time back?

If I present the current datasets using this view I can create a new
dataset and load in the new quads without a transaction then add it to a
transient reference which is used by the system from then on and the old
view would then be GC.


Yes - that might work for you.  It's like MultiUnion which is a union 
graph, but for datasets.




This would keep the concurrency in the system and keep failures within a
dataset. Currently the TDB is 53GB for the 120M triples and it's estimated
that it will grow by the same amount every working day which equates to
31,200M or 31B triples and 13,780GB or 14TB on disk in a year...

Dick

On 20 Oct 2014 17:56, Andy Seaborne a...@apache.org wrote:


On 20/10/14 10:12, Dick Murray wrote:


Hello all.

Are there any pointers to inserting large volumes of data in a

persistent

RW TDB store please?

I currently have a 8M line 500MB+ input file which is being parsed by
JavaCC and the created quads inserted into a TDB store.

The process genreates 120M quads and takes just over 2hrs which is;

60M quads/hr/ or
1M quads/min or
1 quads/sec.

Parse is single threaded (12% core utiliization i.e. 100%) with -Xmx8GB
(16GB available) on a i7 8 core and a 512GB SSD.

I am working with the datasetGraph after opening the TDB store to remove
any extra code which might slow the process down. I begin/commit a
transaction for every 1000 input rows as prior to this a OOME occured

after

~3M input rows if I tried to wrap the entire load in a transaction. The

TDB

store is being read from so I am unable to use a TDB loader.

I don't believe the runtime is poor but any pointers which would improve
the speed...



Dick,

If you are loading into a live TDB store with transactions, there will

be less performance than bulking offline.  The system is a bit read-centric.


The only tuning parameter you have at your disposal is the commit size.

1000 is very small - try more like 100K.


This isn't inside Fuseki so some batching already occurs but the size of

transactions themselves can make a difference.


 Andy









Re: can't get subclass of a specific class

2014-10-21 Thread Jean-Marc Vanel
From 1m heigth, a likely cause is an error in  the prefix product: .

Anyway, unless you have a special reason to use XML ,
it's better to use Turtle format all the way.

2014-10-21 16:53 GMT+02:00 Bruno Baloi bruno.ba...@rogers.com:

 Hello,

 I am new to Jena, and to the semantic space as well.

 I am trying to build an application using Jena and I am encountering some
 issues.

 I built the Ontology using the Ontology APIs. It all looks good, and I can
 use a reasoner to get information out.
 I then generate the XML doc representing the Ontology. I can also use
 Protege to visualize the ontology.
 So far so good.

 The problem comes when I try to use SPARQL. When i use the query 
 ?subject rdfs:subClassOf ?object  it works fine. However if i try to use
 something to the effect of: ?subject rdfs:subclassOf product:Food I get
 nothing (I am trying to get all the classes that are subclasses of Food),
 and there are many subclasses of Food declared. I have attached the
 generated XML Ontology.

 If however I use OWL-DL (DL-Query in Protege) or in code, it works fine
 i.e. I get the necessary subclasses.

 So my question is what am I doing wrong ? Is the way the graph is stored
 in XML causing the problem for SPARQL, or am I misunderstanding the way to
 use the SPARQL query ?

 Any insight or guidance would be greatly appreciated

 regards,
 Bruno


 P.S. I had a look at the Wine ontology that is publicly available, and I
 noticed that in the XML doc, that Sub Classes are referred to by resource
 , whereas in the Product graph that I generate, they are referred to by
 ID. Could that be a problem ? And if it is what do I need to use in the
 APIs to rectify this ?





-- 
Jean-Marc Vanel
Déductions SARL - Consulting, services, training,
Rule-based programming, Semantic Web
http://deductions-software.com/
+33 (0)6 89 16 29 52
Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui


Re: can't get subclass of a specific class

2014-10-21 Thread Andy Seaborne

On 21/10/14 16:07, Jean-Marc Vanel wrote:

 From 1m heigth, a likely cause is an error in  the prefix product: .


I agree.

The data has relative URIs in it so it depends how and where you rad the 
file.


(try using riot to parse your file and see the Food declarations as 
full URIs).


Maybe setting XML base in the XML to the same a xmlns:product

Andy



Anyway, unless you have a special reason to use XML ,
it's better to use Turtle format all the way.

2014-10-21 16:53 GMT+02:00 Bruno Baloi bruno.ba...@rogers.com:


Hello,

I am new to Jena, and to the semantic space as well.

I am trying to build an application using Jena and I am encountering some
issues.

I built the Ontology using the Ontology APIs. It all looks good, and I can
use a reasoner to get information out.
I then generate the XML doc representing the Ontology. I can also use
Protege to visualize the ontology.
So far so good.

The problem comes when I try to use SPARQL. When i use the query 
?subject rdfs:subClassOf ?object  it works fine. However if i try to use
something to the effect of: ?subject rdfs:subclassOf product:Food I get
nothing (I am trying to get all the classes that are subclasses of Food),
and there are many subclasses of Food declared. I have attached the
generated XML Ontology.

If however I use OWL-DL (DL-Query in Protege) or in code, it works fine
i.e. I get the necessary subclasses.

So my question is what am I doing wrong ? Is the way the graph is stored
in XML causing the problem for SPARQL, or am I misunderstanding the way to
use the SPARQL query ?

Any insight or guidance would be greatly appreciated

regards,
Bruno


P.S. I had a look at the Wine ontology that is publicly available, and I
noticed that in the XML doc, that Sub Classes are referred to by resource
, whereas in the Product graph that I generate, they are referred to by
ID. Could that be a problem ? And if it is what do I need to use in the
APIs to rectify this ?










Re: can't get subclass of a specific class

2014-10-21 Thread Bruno Baloi

Thx guys for the quick response.

I will set the base and see what happens.

Regards,

Bruno
On 14-10-21 11:16 AM, Andy Seaborne wrote:

On 21/10/14 16:07, Jean-Marc Vanel wrote:
 From 1m heigth, a likely cause is an error in  the prefix 
product: .


I agree.

The data has relative URIs in it so it depends how and where you rad 
the file.


(try using riot to parse your file and see the Food declarations 
as full URIs).


Maybe setting XML base in the XML to the same a xmlns:product

Andy



Anyway, unless you have a special reason to use XML ,
it's better to use Turtle format all the way.

2014-10-21 16:53 GMT+02:00 Bruno Baloi bruno.ba...@rogers.com:


Hello,

I am new to Jena, and to the semantic space as well.

I am trying to build an application using Jena and I am encountering 
some

issues.

I built the Ontology using the Ontology APIs. It all looks good, and 
I can

use a reasoner to get information out.
I then generate the XML doc representing the Ontology. I can also use
Protege to visualize the ontology.
So far so good.

The problem comes when I try to use SPARQL. When i use the query 
?subject rdfs:subClassOf ?object  it works fine. However if i try 
to use
something to the effect of: ?subject rdfs:subclassOf product:Food 
I get
nothing (I am trying to get all the classes that are subclasses of 
Food),

and there are many subclasses of Food declared. I have attached the
generated XML Ontology.

If however I use OWL-DL (DL-Query in Protege) or in code, it works fine
i.e. I get the necessary subclasses.

So my question is what am I doing wrong ? Is the way the graph is 
stored
in XML causing the problem for SPARQL, or am I misunderstanding the 
way to

use the SPARQL query ?

Any insight or guidance would be greatly appreciated

regards,
Bruno


P.S. I had a look at the Wine ontology that is publicly available, 
and I
noticed that in the XML doc, that Sub Classes are referred to by 
resource
, whereas in the Product graph that I generate, they are referred 
to by
ID. Could that be a problem ? And if it is what do I need to use 
in the

APIs to rectify this ?













Re: can't get subclass of a specific class

2014-10-21 Thread Bruno Baloi

Great,

That worked thx a bundle.

Now for a short  follow-up question:

1) what would the SPARQL query look like to get all the classes that 
have a certain property of a certain value


I would like to have a Base Class (i.e. Product) that has a certain 
value (Product.ClassCounter)  that all other classes inherit from.

As things get added to the Ontology this counter get incremented.

I would like to be able to query the ontology for all the classes that 
have a ClassCounter value of  greater than X  for instance ?


Any hints would be greatly appreciated.

regards,

Bruno


On 14-10-21 11:07 AM, Jean-Marc Vanel wrote:

From 1m heigth, a likely cause is an error in  the prefix product: .

Anyway, unless you have a special reason to use XML ,
it's better to use Turtle format all the way.

2014-10-21 16:53 GMT+02:00 Bruno Baloi bruno.ba...@rogers.com:


Hello,

I am new to Jena, and to the semantic space as well.

I am trying to build an application using Jena and I am encountering some
issues.

I built the Ontology using the Ontology APIs. It all looks good, and I can
use a reasoner to get information out.
I then generate the XML doc representing the Ontology. I can also use
Protege to visualize the ontology.
So far so good.

The problem comes when I try to use SPARQL. When i use the query 
?subject rdfs:subClassOf ?object  it works fine. However if i try to use
something to the effect of: ?subject rdfs:subclassOf product:Food I get
nothing (I am trying to get all the classes that are subclasses of Food),
and there are many subclasses of Food declared. I have attached the
generated XML Ontology.

If however I use OWL-DL (DL-Query in Protege) or in code, it works fine
i.e. I get the necessary subclasses.

So my question is what am I doing wrong ? Is the way the graph is stored
in XML causing the problem for SPARQL, or am I misunderstanding the way to
use the SPARQL query ?

Any insight or guidance would be greatly appreciated

regards,
Bruno


P.S. I had a look at the Wine ontology that is publicly available, and I
noticed that in the XML doc, that Sub Classes are referred to by resource
, whereas in the Product graph that I generate, they are referred to by
ID. Could that be a problem ? And if it is what do I need to use in the
APIs to rectify this ?