Re: Report on loading wikidata

Laura Morales Tue, 12 Dec 2017 23:11:53 -0800

So, if I understand correctly, Mosaic queries multiple TDB stores in parallel 
and then combines all results to answer the user query. Right?
Now, my curiosity is, with Mosaic can I split Wikidata into multiple stores 
(eg. 100M triples each) and query them like this "SELECT FROM <wikidata> ..." 
where <wikidata> is a graph split among several independent stores?

Sent: Tuesday, December 12, 2017 at 11:27 PM
From: "Dick Murray" <[email protected]>
To: [email protected]
Subject: Re: Report on loading wikidata
Correct, Mosaic federates multiple datasets as one. At some point in a
query find [G]SPO will get called and Mosaic will concurrently call find on
each child dataset and return the set of results. The dataset can be memory
or TDB or Thrift (this one's another discussion) Mosaic doesn't care as
long as it implements DatasetGraph. The child calls use parallel streams
and distinct or find first as appropriate. Transactions are supported via
ThreadProxy and delayed until needed because parallel streams use fork join
pools which create threads whenever and certain stream actions such as find
first will short circuit and may never get past reading the first child.
Mosaic exists because I needed to bulk load fast and perform multiple loads
after the bulk loads ie MRMW which Mosaic can do/spoof because it extends
Transactional with tryBegin(ReadWrite). Also we needed to access TDBs from
multiple JVMs because... (this one's another discussion too).

There was a PR but work got in the way of me testing with sufficient data
to stress it. It's now being stressed. Ideally I'd like to provide Mosaic
as a separate group eg jena-mosaic, which takes the load off maintaining
yet another add on.

Back on thread topic IMHO splitting the bulk load is the way to go as you
can always use service in your sparql plus manipulating a 250GB+ file is a
PITA!!! ;-)

On 12 Dec 2017 21:52, "ajs6f" <[email protected]> wrote:

That's not what Mosaic is doing at all. I'll leave it to Dick to explain
after this, because I am not the expert here, he is, but it's federating
multiple datasets so that they appear as one to SPARQL. It's got nothing to
do with individual graphs within a dataset.

ajs6f

> On Dec 12, 2017, at 4:36 PM, Laura Morales <[email protected]> wrote:
>
>> He can correct me as needed, but it seems that Dick is using (and
getting great results from)
>> an extension to Jena ("Mosaic") that federates different datasets (in
this cases from
>> independent TDB instances) and runs queries over them in parallel. We've
had some discussions
>> (all the way to a PR: https://github.com/apache/jena/pull/233) about
getting Mosaic into Jena's
>> codebase, but we haven't quite managed to do it. I would love to move
that process forward.
>
>
> I think his approach of splitting and running multiple tdbloaders works
if every TDB is loaded into the default graph (using
tdb:unionDefaultGraph). However I'm not sure if I want to maintain graph
labels. Is there any way to tell Jena that one particular graph is
"composed" of more than one TDB store? For example if I split Wikidata into
smaller stores of 100M triples each, I could "SELECT FROM <wikidata>"
instead of "SELECT FROM <wikidata-store1> <wikidata-store2>
<wikidata-store3> ..."

Re: Report on loading wikidata

Reply via email to