So, if I understand correctly, Mosaic queries multiple TDB stores in parallel and then combines all results to answer the user query. Right? Now, my curiosity is, with Mosaic can I split Wikidata into multiple stores (eg. 100M triples each) and query them like this "SELECT FROM <wikidata> ..." where <wikidata> is a graph split among several independent stores?
Sent: Tuesday, December 12, 2017 at 11:27 PM From: "Dick Murray" <[email protected]> To: [email protected] Subject: Re: Report on loading wikidata Correct, Mosaic federates multiple datasets as one. At some point in a query find [G]SPO will get called and Mosaic will concurrently call find on each child dataset and return the set of results. The dataset can be memory or TDB or Thrift (this one's another discussion) Mosaic doesn't care as long as it implements DatasetGraph. The child calls use parallel streams and distinct or find first as appropriate. Transactions are supported via ThreadProxy and delayed until needed because parallel streams use fork join pools which create threads whenever and certain stream actions such as find first will short circuit and may never get past reading the first child. Mosaic exists because I needed to bulk load fast and perform multiple loads after the bulk loads ie MRMW which Mosaic can do/spoof because it extends Transactional with tryBegin(ReadWrite). Also we needed to access TDBs from multiple JVMs because... (this one's another discussion too). There was a PR but work got in the way of me testing with sufficient data to stress it. It's now being stressed. Ideally I'd like to provide Mosaic as a separate group eg jena-mosaic, which takes the load off maintaining yet another add on. Back on thread topic IMHO splitting the bulk load is the way to go as you can always use service in your sparql plus manipulating a 250GB+ file is a PITA!!! ;-) On 12 Dec 2017 21:52, "ajs6f" <[email protected]> wrote: That's not what Mosaic is doing at all. I'll leave it to Dick to explain after this, because I am not the expert here, he is, but it's federating multiple datasets so that they appear as one to SPARQL. It's got nothing to do with individual graphs within a dataset. ajs6f > On Dec 12, 2017, at 4:36 PM, Laura Morales <[email protected]> wrote: > >> He can correct me as needed, but it seems that Dick is using (and getting great results from) >> an extension to Jena ("Mosaic") that federates different datasets (in this cases from >> independent TDB instances) and runs queries over them in parallel. We've had some discussions >> (all the way to a PR: https://github.com/apache/jena/pull/233) about getting Mosaic into Jena's >> codebase, but we haven't quite managed to do it. I would love to move that process forward. > > > I think his approach of splitting and running multiple tdbloaders works if every TDB is loaded into the default graph (using tdb:unionDefaultGraph). However I'm not sure if I want to maintain graph labels. Is there any way to tell Jena that one particular graph is "composed" of more than one TDB store? For example if I split Wikidata into smaller stores of 100M triples each, I could "SELECT FROM <wikidata>" instead of "SELECT FROM <wikidata-store1> <wikidata-store2> <wikidata-store3> ..."
