Re: Very very slow query when using a high OFFSET

Lorenz Buehmann Mon, 18 Dec 2017 02:38:46 -0800

That's what I get from the metadata/header

bin/hdtInfo.sh ~/wikidata.hdt
<file://wikidata.ttl> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://purl.org/HDT/hdt#Dataset> .
<file://wikidata.ttl> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://rdfs.org/ns/void#Dataset> .
<file://wikidata.ttl> <http://rdfs.org/ns/void#triples> "4579973187" .
<file://wikidata.ttl> <http://rdfs.org/ns/void#properties> "17301" .
<file://wikidata.ttl> <http://rdfs.org/ns/void#distinctSubjects>
"481902070" .
<file://wikidata.ttl> <http://rdfs.org/ns/void#distinctObjects>
"715508797" .
<file://wikidata.ttl> <http://purl.org/HDT/hdt#statisticalInformation>
_:statistics .
<file://wikidata.ttl> <http://purl.org/HDT/hdt#publicationInformation>
_:publicationInformation .
<file://wikidata.ttl> <http://purl.org/HDT/hdt#formatInformation> _:format .
_:format <http://purl.org/HDT/hdt#dictionary> _:dictionary .
_:format <http://purl.org/HDT/hdt#triples> _:triples .
_:dictionary <http://purl.org/dc/terms/format>
<http://purl.org/HDT/hdt#dictionaryFour> .
_:dictionary <http://purl.org/HDT/hdt#dictionarynumSharedSubjectObject>
"381953626" .
_:dictionary <http://purl.org/HDT/hdt#dictionarymapping> "1" .
_:dictionary <http://purl.org/HDT/hdt#dictionarysizeStrings> "22827063388" .
_:dictionary <http://purl.org/HDT/hdt#dictionaryblockSize> "16" .
_:triples <http://purl.org/dc/terms/format>
<http://purl.org/HDT/hdt#triplesBitmap> .
_:triples <http://purl.org/HDT/hdt#triplesnumTriples> "4579973187" .
_:triples <http://purl.org/HDT/hdt#triplesOrder> "SPO" .
_:statistics <http://purl.org/HDT/hdt#originalSize> "198373280855" .
_:statistics <http://purl.org/HDT/hdt#hdtSize> "47873693833" .
_:publicationInformation <http://purl.org/dc/terms/issued>
"2017-11-03T21:24:29+01:00" .


In particular:

_:triples <http://purl.org/HDT/hdt#triplesOrder> "SPO" .

Moreover, the beginning of the wikidata.hdt.index file contains:

$HDT^E<http://purl.org/HDT/hdt#indexFoQ>^@numTriples=4579973187;order=1;

I don't know how/where the .index files is taken into account. According
to docs, the first time a search is triggered. But let's stop here - off
list. Continue on the HDT list/forum if necessary.

Lorenz


On 18.12.2017 11:03, Dick Murray wrote:
> On 18 December 2017 at 08:07, Laura Morales <laure...@mail.com> wrote:
>
>>> The don't have index permutations spo, ops, pos, etc.
>> Yes they have, what you're saying is wrong. See http://www.rdfhdt.org/hdt-
>> binary-format/#triples That's what the .hdt.index file is about, to store
>> more index permutations.
>>
> This is going off Jena list but do we know how the wiki HDT was compiled
> because having read the technical stuff including the link above the the
> $$streamsOrder property (which defaults to SPO) sets the triple index
> order. Can you query the HDT header and see what this is set to? 0 = SPO,
>> =1 SOP, etc. Also check $$IDCodificationBits because Wiki blew the
> original HDT code as it exceeded 2^32 triples and there was a new 64 id
> code base in dev. Plus how big is the generated .hdt.index file (it's in
> the same folder as the .hdt file), this file is autogen as soon as you try
> and search the HDT.
>
> As previously mentioned this is best off this list, so dick-twocows on
> github.
>
>
>>
>>> To bring this thread to an end, I guess we finally answered your
>>> question? Or are the any open issues?
>> I think the only remaining open questions are:
>>
>> - since the problem was not with the OFFSET, would the query "SELECT ?s
>> FROM <wikidata> WHERE ..." also fail to terminate with a TDB-backed
>> namedGraph (instead of HDT)?
>>
>> - is there any improvement that can be added to Jena to solve these type
>> of queries faster, or is it just the way it is and nothing can be done
>> about it?
>>

Re: Very very slow query when using a high OFFSET

Reply via email to