Re: InsertPic_(12-07(12-07-21-26-31)

Vincent Ventresque Mon, 17 Dec 2018 07:13:41 -0800

Ok, thanks for the explanation.

> not separate partitions on disk for each graph

I wrote " 'partitions' or 'tables' " but I had the TDB files in mind(GOSP.dat + GOSP.idn), it was a wrong analogy...

> you are reducing the number of tuples over which the filter has toscan by narrowing its scope, and that is what is improving your results.Fewer tuples to scan => fewer to throw away => quicker to get to thosethat won't get thrown away.


Yes, that's what I was trying to say, thank you for rephrasing!

Can I conclude that using named graphs is faster even when there isn'tany FILTER?

> You can do that by confining the scope of the filter to a graph, asyou did, but you can also do it by introducing a scope for the purpose,as Andy showed.

I didn't know of this scope "{ }" , I'm going to read about this. Ithought it was only some syntax to clarify the query

Le 17/12/2018 à 16:00, ajs6f a écrit :

No, I don't think that's what's going on. Andy can correct me if I'm going 
wrong here, but TDB builds the equivalent of a quad table for named graphs, not 
separate partitions on disk for each graph. The exception is the default graph. 
The same is true for TIM, Jena's transactional in-memory dataset impl.

As Andy wrote, what seems to be happening here is that you are reducing the number of 
tuples over which the filter has to scan by narrowing its scope, and that is what is 
improving your results. Fewer tuples to scan => fewer to throw away => quicker 
to get to those that won't get thrown away. You can do that by confining the scope of 
the filter to a graph, as you did, but you can also do it by introducing a scope for 
the purpose, as Andy showed.

ajs6f

On Dec 17, 2018, at 9:53 AM, Vincent Ventresque 
<[email protected]> wrote:

$word is provided as a value? $word looks like a variable to the optimizer so 
the FILTER is after the whole of the BGP.

Sorry I should have written" :

FILTER(REGEX(?title, "some word", 'i'))

my '$word' meant 'any value to complete the query string'.

With GRAPH, or using just the {} part, the scope of the filter is reduced.

Do you mean that using a named graph reduces the number of triples to check, so 
that the query is faster?

Here's what I understand of named graphs : the triples are stored in different 
'partitions' or 'tables' instead of being in the same place ; so there are 
fewer entries in the triple store index to check to find a matching pattern, 
therefore the query is more efficient.


Le 17/12/2018 à 15:24, Andy Seaborne a écrit :


On 17/12/2018 07:53, Vincent Ventresque wrote:

Are you sure that named graphs have better performance?

I'm not a specialist, and I'd like to know other users' opinion about that 
question. I thinks it depends both on the structure of your data and the 
queries you run.
My use case consisted in using a dataset with +/- 168 M triples, including +/- 10 M 
titles (triples like ?s dcterms:title "some words"), and running queries with 
FILTER(REGEX()) :

-- a query like this took a long time (1 min or more) :

SELECT * WHERE {

?edition dcterms:title ?title .
FILTER(REGEX(?title, $word, 'i')) .

?edition rdarelationships:expressionManifested ?expr .

?expression bnf-roles:r70 ?author .

?author foaf:familyName $name

$word is provided as a value?

$word looks like a variable to the optimizer so the FILTER is after the whole 
of the BGP. With GRAPH, or using just the {} part, the scope of the filter is 
reduced.

Add {}


SELECT * WHERE {
{
?edition dcterms:title ?title .
FILTER(REGEX(?title, $word, 'i')) .
}
?edition rdarelationships:expressionManifested ?expr .

?expression bnf-roles:r70 ?author .

?author foaf:familyName $name

}

}

-------------

-- whereas a query like this one takes about 1 sec :

SELECT * WHERE {

GRAPH <:titles> {
      ?edition dcterms:title ?title .
      FILTER(REGEX(?title, $word, 'i'))
}

?edition rdarelationships:expressionManifested ?expr .

?expression bnf-roles:r70 ?author .

?author foaf:familyName $name

}

----------------

So, depending on your data, it might be more efficient to use named graphs, e. 
g. :

SELECT  ?rel (count (?rel) as ?co)
where {
      GRAPH <:names> { ?object MKG:English_name 'Pyrilamine' } # <-- HERE : 
named graph for names
      ?RelAttr owl:annotatedTarget ?object ;
               owl:annotatedSource ?subject ;
               owl:annotatedProperty ?rel ;
               MKG:pyear '1967' .
      }
group by ?rel
limit 10


-------------------------------

Then how to build named graphs?

There are several ways, here are 3 methods :

#1) when uploading files in Fuseki web interface, specify the graph URI for the 
file

#2) use tdbloader : java -Xms4096m -Xmx4096m -cp ./fuseki-server.jar 
tdb.tdbloader --graph=$namedGraph --tdb=$configFile $f

#3) use SPARQL INSERT + DELETE

#1 & #2 are fast, but all the triples of the file go into the same graph, so 
maybe you have to have to modify your files first.

#3 is slower, but you don't have to modify your files. see my question on 
StackOverflow : 
https://stackoverflow.com/questions/48500404/sparql-offset-whithout-order-by-to-get-all-results-of-a-query




Le 07/12/2018 à 18:16, HYP a écrit :

OK. I explain my project in the following


The KG schema is composed of a set of semantic types like disease or drugs, and 
a set of relations like treated_by(disease, drug).
Then each instance relation, like treated_by(disease_1, drug_1) has an 
annotation  property 'year' which means this triple occur in the 'year'.


My query has two steps. Firstly, query the related triples about some drug, 
like Pyrilamine, and group them according to the relation types and give a 
count. Secondly, query the related nodes in one relation type.


The first step query, like:


SELECT  ?rel (count (?rel) as ?co)
where {
      ?object MKG:English_name 'Pyrilamine' .
      ?RelAttr owl:annotatedTarget ?object ;
               owl:annotatedSource ?subject ;
               owl:annotatedProperty ?rel ;
               MKG:pyear '1967' .
      }
group by ?rel
limit 10


On 12/8/2018 01:00，ajs6f<[email protected]> wrote：
Let's slow down here a bit.

We can't give you any reasonable advice until you tell us _much_ more about 
your work. What is the data like? What kinds of queries are you doing? How are 
you running them? What do you expect to happen?

Please give us a great deal more context.

ajs6f

On Dec 7, 2018, at 11:45 AM, HYP <[email protected]> wrote:





I store the 1.4B triples in two steps. Firstly, I made 886 rdf files, each of 
which contains 1615837 triples. Then, I upload them into TDB using Fuseki.
This is a huge job. Are you sure that named graphs have better performance?
Then how to build named graphs?


On 12/7/2018 23:48，Vincent Ventresque<[email protected]> wrote：
Do you mean -Xms = 64G ?

N.B. : with 1.4 B triples, you should have better performance using
named graphs.


Le 07/12/2018 à 16:37, 胡云苹 a écrit :
My memory is 64G and my setting is no upper limit.
On 12/7/2018 23:34，Vincent Ventresque<[email protected]>
<mailto:[email protected]> wrote：

Hello

How do you run fuseki? you can increase java memory limit with
java options :

java -jar -Xms4096m -Xmx4096m fuseki-server.jar

(where  4096m = 4 Go, but could be 8192m or more)

N.B. : I'm not a specialist, don't know if -Xms and -Xmx must be
the same

If I remember correctly the memory limit is 1.2 Go when you run
'./fuseki start' or './fuseki-server'

Vincent




Le 07/12/2018 à 16:23, 胡云苹 a écrit :

Dear jena,

I have built a graph with 1.4 billion triples and store it as a
data set in TDB through Fuseki upload system. Now, I try to make
some sparql search, the speed is very slow.

For example, when I make the sqarql in Fuseki in the following,
it takes 50 seconds. How can I improve the speed?



-

Re: InsertPic_(12-07(12-07-21-26-31)

Reply via email to