With bin/hdtsparql.sh instead (previously I ran the query against Fuseki, which 
pre-loads the dataset):

$ time ./bin/hdtsparql.sh wikidata.hdt "SELECT (COUNT(*) AS ?cnt) {?s a 
<http://wikiba.se/ontology-beta#Item>}"

cnt
37871468
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 28.244 s
[INFO] Finished at: 2017-12-17T17:22:30+01:00
[INFO] Final Memory: 29M/4203M
[INFO] ------------------------------------------------------------------------
./bin/hdtsparql.sh    51.71s user 2.09s system 180% cpu 29.855 total




 

 

Sent: Sunday, December 17, 2017 at 4:18 PM
From: "Andy Seaborne" <a...@apache.org>
To: users@jena.apache.org
Subject: Re: Very very slow query when using a high OFFSET

On 17/12/17 09:51, Lorenz Buehmann wrote:
> I think that aggregation is one of the access patterns that HDT is not
> really designed for:
>
> |time bin/hdtsparql.sh ~/wikidata.hdt "SELECT (COUNT(?s) AS ?cnt) {?s a
> <http://wikiba.se/ontology-beta#Item>}"|
>
> |cnt||
> ||37871468||
> ||bin/hdtsparql.sh ~/wikidata.hdt   282,55s user 5,20s system 185% cpu
> 2:35,04 total|
>
> It's not a problem with the OFFSET, it's just slow:
>
> |time bin/hdtsparql.sh ~/wikidata.hdt "SELECT (COUNT(?s) AS ?cnt) {?s a
> <http://wikiba.se/ontology-beta#Item[http://wikiba.se/ontology-beta#Item]>} 
> LIMIT 20 OFFSET 20000000"||
> ||
> ||cnt||
> ||bin/hdtsparql.sh ~/wikidata.hdt   350,53s user 30,95s system 207% cpu
> 3:03,83 total|

If yuo have the setup to hand still, coiudk you try:

SELECT (COUNT(*) AS ?cnt) {?s a 
<http://wikiba.se/ontology-beta#Item[http://wikiba.se/ontology-beta#Item]>}

Count(?s) materializes ?s, which is strictly unnecessary in this case
but in other cases is it necessary.

My expectation is that COUNT(*) and a slice of (10,2000000) should be
about the same. (It indicates something about how hdt-java works.)

Andy

>
>
> On 16.12.2017 19:17, Laura Morales wrote:
>>> What I'm trying to understand is why you would have such a large offset and 
>>> what real world application there is?
>> I don't have any particular use case in mind. I just tried to break it and 
>> it broke.
>>
>>> It's because the query is simple with no order that it seems 
>>> synthetic/contrived to me.
>> I think the default order is how triples are physically stored, which is 
>> probably SPO. But anyway this wasn't important for me. I just wanted to test 
>> a high offset.
>>
>>> I'm not near my hardware but I wonder if similar symptoms are obtained with 
>>> a count (s) and a
>>> limit 20000000. As this should be similar in that it reads a large number 
>>> of triples but
>>> returns a small result set?
>> Curiously, this query seems to hang in both cases, that is if I use 
>> defaultGraph or namedGraph
>>
>> SELECT (COUNT(?s) AS ?cnt)
>> FROM <...> <-- only used with namedGraph. No FROM with defaultGraph
>> WHERE {
>> ?s a 
>> <http://wikiba.se/ontology-beta#Item[http://wikiba.se/ontology-beta#Item]>
>> }
>> LIMIT 10
>> OFFSET 20000000
>
>

Reply via email to