Re: Trace back RDF containers in SPARQL

2015-05-12 Thread Laurent Rucquoy
I will upgrade Jena and investigate further following your advice.
Thank you for your help Andy.

Laurent.

On 11 May 2015 at 22:57, Andy Seaborne a...@apache.org wrote:

 On 11/05/15 14:55, Laurent Rucquoy wrote:

 Yes, the bad query is the good query with the last 3 triple patterns
 added.


 The optimizer in 2.10 would probably do a bad job on your query.  Adding
 the patterns makes it worse as it puts an unconstrained cross product (due
 to the ??? a :SomeClass2 parts).

 2.13 is better, using fixed.opt.

 It probably makes no difference as to whether you have a stats.opt file;
 if you have one, and with 2.13 it's worth trying both ways round.

 It could well explain what you are seeing and until that possibility is
 removed, it's hard to see any further.

 Andy


  When I run the good query (without the bad query last 3 triple patterns),
 I
 get about ten calculationResult nodes.
 When I run the bad query to try to retrieve the containing
 calculationResultCollection, the system freezes.

 What I want to do is to find the CalculationResultCollection nodes
 containing CalculationResult nodes referring to CalculationDataCollection
 nodes containing in their turn CalculationData nodes having
 0^^xsd:string
 value.

 Here is what could look like an instances diagram:

 CalculationResultCollection ---listCalculationResult--- blank_node_CR
 ---rdf:_1--- CalculationResult_1 ---calculationDataCollection---
 CalculationDataCollection ---listCalculationData--- blank_node_CD
 ---rdf:_1--- CalculationData_1_1
 CalculationResultCollection ---listCalculationResult--- blank_node_CR
 ---rdf:_1--- CalculationResult_1 ---calculationDataCollection---
 CalculationDataCollection ---listCalculationData--- blank_node_CD
 ---rdf:_2--- CalculationData_1_2
 CalculationResultCollection ---listCalculationResult--- blank_node_CR
 ---rdf:_1--- CalculationResult_1 ---calculationDataCollection---
 CalculationDataCollection ---listCalculationData--- blank_node_CD
 ---rdf:_3--- CalculationData_1_3
 CalculationResultCollection ---listCalculationResult--- blank_node_CR
 ---rdf:_2--- CalculationResult_2 ---calculationDataCollection---
 CalculationDataCollection ---listCalculationData--- blank_node_CD
 ---rdf:_1--- CalculationData_2_1
 CalculationResultCollection ---listCalculationResult--- blank_node_CR
 ---rdf:_2--- CalculationResult_2 ---calculationDataCollection---
 CalculationDataCollection ---listCalculationData--- blank_node_CD
 ---rdf:_2--- CalculationData_2_2
 CalculationResultCollection ---listCalculationResult--- blank_node_CR
 ---rdf:_2--- CalculationResult_2 ---calculationDataCollection---
 CalculationDataCollection ---listCalculationData--- blank_node_CD
 ---rdf:_3--- CalculationData_2_3
 ...


 Thank you for your help.

 Laurent.


 On 8 May 2015 at 12:42, Andy Seaborne a...@apache.org wrote:

  On 08/05/15 09:43, Laurent Rucquoy wrote:

  Hi Andy,

 Thank you for your response.

 1) Which version of Jena are you running?
 The used version of Jena is 2.10.1 (I will upgrade soon...)


 Try with 2.13.0 because the area of BGP optimizations has been improved.



 2) How are you storing the data and how big is it?
 TDBFactory.createDataset(directory)
 COUNT(*) - 1 224 103
 350MB on disk
 Do you need other details ?


 3) You say the query returns good results - what sort of query causes
 the
 system to freeze?
 This is the query returning good results appended with 3 more statements
 in
 the WHERE clause:


 So the bad query is the good query with the last 3 triple patterns added?

 It's hard to read but

 ?seqCalculationResultCollection
?seqCalculationResultCollectionIndex ?calculationResult .
 ?calculationResultCollection
 :listCalculationResult  ?seqCalculationResultCollection .
 ?calculationResultCollection rdf:type :CalculationResultCollection

 is connected to the good part by ?calculationResult; all the other
 variables are just fanning out from that point without anything like the
 :value 0^^xsd:string in the good part.  From what I understand of your
 data, that can be a huge number of results.

 Do you get no results, or that some results appear but then the query
 does
 not finish?

  Andy




  PREFIX : http://www.telemis.com/
 PREFIX xsd: http://www.w3.org/2001/XMLSchema#
 PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
 PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#

 SELECT ?calculationResultCollection
 WHERE {
 ?calculationData a :CalculationData ;
 :value 0^^xsd:string .
 ?seqCalculationDataCollection ?seqCalculationDataCollectionIndex
 ?calculationData .
 ?calculationDataCollection a :CalculationDataCollection ;
 :listCalculationData ?seqCalculationDataCollection .
 ?calculationResult a :CalculationResult ;
 :calculationDataCollection ?calculationDataCollection .
 ?seqCalculationResultCollection ?seqCalculationResultCollectionIndex
 ?calculationResult .
 ?calculationResultCollection :listCalculationResult
 ?seqCalculationResultCollection ;
 a :CalculationResultCollection .
 

Re: Implementing RDF reader

2015-05-12 Thread Andy Seaborne

On 11/05/15 20:28, Martynas Jusevičius wrote:

Thanks Andy.

I have a parser that works on String, but this time I want to do it
right and make it streaming and plug it into Jena at the low level.

It seems that I should be able to reuse some code from TokenizerText.

I understand StreamRDF is used to sink the triples, but what about
ParserProfile? I see LangTurtleBase uses it:

 org.apache.jena.iri.IRI iri = profile.makeIRI(iriStr,
currLine, currCol) ;

How do I construct an instance of ParserProfile? Or is there an
alternative way to construct IRIs etc.?


RiotLib.profile

Andy



Martynas

On Mon, May 11, 2015 at 2:44 PM, Andy Seaborne a...@apache.org wrote:

On 10/05/15 21:48, Martynas Jusevičius wrote:


Hey all,

I want to refactor my RDF/POST parser into a Jena-compatible reader.
An example of the format can be found here:
http://www.lsrn.org/semweb/rdfpost.html#sec-examples

The documentation suggests implementing ReaderRIOT interface:

https://github.com/apache/jena/blob/master/jena-arq/src-examples/arq/examples/riot/ExRIOT_5.java

However, if I look at (what I think is) existing readers such as
Turtle for example, they do not seem to implement ReaderRIOT:

https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/riot/lang/LangTurtleBase.java

What is the explanation for that?



Hi Martynas,

It is historical - the Turtle derived parsers emerged with the RiotReader
interface and some code is/was around that used that interface.

ReaderRIOTLang is the cross-over code from the proper interface ReaderRIOT
to RiotReader. RiotReader is a fixed set of parsers.

This can be sorted out in Jena3.



Do I need to to tokenize the InputStream myself or is there some
machinery I can reuse?



The Turtle-world tokenizer is TokenizerText.  It is turtle term specific.

Any tokenizing for a new language is often, in my experience, very sensitive
to the language details.

If you are used to javacc, and performance isn't critical at scale, that's a
good tool.

RIOT uses custom I/O for speed; Jena used to have a javacc parser for Turtle
but Turtle is sufficiently simple that a hand-written parser is doable.  A
hand written tokenizer is for speed at scale (big file - about x2 than basic
javacc tokenizing) but you need large input to make it worthwhile.  NTriples
dumps of databases make it worthwhile.

If you do rdfpost - Turtle (string manipulation), then you can parse the
Turtle as normal.  Downside: Error messages may be confusing as they refer
to the Turtle, not the input string.

Splitting up the query string, with all the HTTP escaping rules, can be done
with library code (see FusekiLib.parseQueryString [no longer used, but it
works without consuming the body, unlike the servlet operations which
combine form and query string processing] and probably lots of better code
examples on the web.

 Andy



Martynas
graphityhq.com