With bin/hdtsparql.sh instead (previously I ran the query against Fuseki, which pre-loads the dataset):
$ time ./bin/hdtsparql.sh wikidata.hdt "SELECT (COUNT(*) AS ?cnt) {?s a <http://wikiba.se/ontology-beta#Item>}" cnt 37871468 [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 28.244 s [INFO] Finished at: 2017-12-17T17:22:30+01:00 [INFO] Final Memory: 29M/4203M [INFO] ------------------------------------------------------------------------ ./bin/hdtsparql.sh 51.71s user 2.09s system 180% cpu 29.855 total Sent: Sunday, December 17, 2017 at 4:18 PM From: "Andy Seaborne" <a...@apache.org> To: users@jena.apache.org Subject: Re: Very very slow query when using a high OFFSET On 17/12/17 09:51, Lorenz Buehmann wrote: > I think that aggregation is one of the access patterns that HDT is not > really designed for: > > |time bin/hdtsparql.sh ~/wikidata.hdt "SELECT (COUNT(?s) AS ?cnt) {?s a > <http://wikiba.se/ontology-beta#Item>}"| > > |cnt|| > ||37871468|| > ||bin/hdtsparql.sh ~/wikidata.hdt 282,55s user 5,20s system 185% cpu > 2:35,04 total| > > It's not a problem with the OFFSET, it's just slow: > > |time bin/hdtsparql.sh ~/wikidata.hdt "SELECT (COUNT(?s) AS ?cnt) {?s a > <http://wikiba.se/ontology-beta#Item[http://wikiba.se/ontology-beta#Item]>} > LIMIT 20 OFFSET 20000000"|| > || > ||cnt|| > ||bin/hdtsparql.sh ~/wikidata.hdt 350,53s user 30,95s system 207% cpu > 3:03,83 total| If yuo have the setup to hand still, coiudk you try: SELECT (COUNT(*) AS ?cnt) {?s a <http://wikiba.se/ontology-beta#Item[http://wikiba.se/ontology-beta#Item]>} Count(?s) materializes ?s, which is strictly unnecessary in this case but in other cases is it necessary. My expectation is that COUNT(*) and a slice of (10,2000000) should be about the same. (It indicates something about how hdt-java works.) Andy > > > On 16.12.2017 19:17, Laura Morales wrote: >>> What I'm trying to understand is why you would have such a large offset and >>> what real world application there is? >> I don't have any particular use case in mind. I just tried to break it and >> it broke. >> >>> It's because the query is simple with no order that it seems >>> synthetic/contrived to me. >> I think the default order is how triples are physically stored, which is >> probably SPO. But anyway this wasn't important for me. I just wanted to test >> a high offset. >> >>> I'm not near my hardware but I wonder if similar symptoms are obtained with >>> a count (s) and a >>> limit 20000000. As this should be similar in that it reads a large number >>> of triples but >>> returns a small result set? >> Curiously, this query seems to hang in both cases, that is if I use >> defaultGraph or namedGraph >> >> SELECT (COUNT(?s) AS ?cnt) >> FROM <...> <-- only used with namedGraph. No FROM with defaultGraph >> WHERE { >> ?s a >> <http://wikiba.se/ontology-beta#Item[http://wikiba.se/ontology-beta#Item]> >> } >> LIMIT 10 >> OFFSET 20000000 > >