Re: Report on loading wikidata

Dick Murray Tue, 12 Dec 2017 12:20:35 -0800

tdbloader2

For anyone still following this thread ;-)

latest-truthy supposedly contains just the verified facts, this is
Wikipedia...

latest-truthy is unsorted and contains duplicates, running sort then uniq
yields 61K+ duplicates ranging from 2 to 100+. Running sort takes a while!
Whilst it's not going to reduce the line count hugely (-3M lines) it's
worth considering when doing any import.

Fastest elapsed load TPS currently is ~405K (got there Andy) which was
achieved by splitting the file into 200M line files using split, running
four concurrent loads into four TDBs, each TDB on a separate 5400rpm 6G
drive, tdbloader2 script was hacked to run sort parallel 6 buffer 8G
temporary appropriate drive, repeat 3 times to give 12 TDB instances, query
via my Mosaic extension. I'll up the file size until the drive saturates
and stalls then drop it back and run it concurrently as the stall appears
to occur on the drive write. Currently perform the index data-triple sort
in parallel but write the indexes sequentially. The drives were stolen from
some old laptops so not exactly bling hardware.

On the subject of performance it's possible to cascade the split if you
have enough drives, split file in half and when the second file is created
split the first one in half and so on, uses inotify in a script.

While I aim to "load" truthy in under an hour that won't account for
getting the file, uncompressing the file (non parallel bzip2!!!), splitting
file, etc, but for marketing purposes who cares... ;-)

On 11 Dec 2017 18:43, "Laura Morales" <laure...@mail.com> wrote:

Did you run your Threadripper test using tdbloader, tdbloader2, or
tdb2.tdbloader?

@Andy where can I find a description of TDB1/2 binary format (how stuff is
stored in the files)?

Sent: Monday, December 11, 2017 at 11:31 AM
From: "Dick Murray" <dandh...@gmail.com>
To: users@jena.apache.org
Subject: Re: Report on loading wikidata
Inline...

On 10 December 2017 at 23:03, Laura Morales <laure...@mail.com> wrote:

> Thank you a lot Dick! Is this test for tdbloader, tdbloader2, or
> tdb2.tdbloader?
>
> > 32GB DDR4 quad channel
>
> 2133 or higher?
>

2133

> > 3 x M.2 Samsung 960 EVO
>
> Are these PCI-e disks? Or SATA? Also, what size and configuration?

PCIe Turbo

> > Is it possible to split the index files into separate folders?
> > Or sym link the files, if I run the data phase, sym link, then run the
> index phase?
>
> What would you gain from this?
>

n index files need to be written so split the load across multiple devices,
be that cores/controllers/storage. Potentially use a fast/expensive device
to perform the load and copy the files over to a production grade device.
Load device would have no redundancy as who cares if it throws a drive?
Production devices are redundant as 5 9's requirement.

>
> > 172K/sec 3h45m for truthy.
>
> It still feels slow considering that you throw such a powerful machine to
> it, but it's very interesting nonetheless! What I think after these tests,
> is that the larger impact here is given by the M.2 disks

Its also got 2 x SATAIII 6G drives and the load time doesn't increase by
much using these. There's a fundamental limit at which degradation occurs
as eventually stuff has to be swapped or committed which then cascades into
stalls. As an ex DBA bulk loads always involved, dropping or disabling
indexes, running overnight so users were asleep, building indexes, updating
stats, present DB in TP mode to make users happy! Things have moved on but
the same problems exists.

> , and perhaps to a smaller scale by the DDR4 modules. When I tested with a
> xeon+ddr3-1600, it didn't seem to make any difference. It would be
> interesting to test with a more "mid-range setup" (iCore/xeon + DDR3) and
> M.2 disks. Is this something that you can try as well?
>

IMHO it's not, our SLA equates to 50K/sec or 180M/hr quads an hour, so
anything over this a bonus. But we don't work on getting 500M quads into a
store at 150K/sec because this will eventually hit a ceiling. We work on
getting concurrent 500M quads into stores at 75K/sec. Production
environments are a completely different beast to having fun with a test
setup.

Consider the simplified steps involved in getting a single quad into a
store (please correct me Andy);

Read quad from source.
Verify GSPO lexical and type.
Check GSPO for uniqueness (read and compare) possibly x4 write to node->id
lookup.
Write indexes.
Repeat n times.

Understand binary format and tweak appropriately for tdbloader2 ;-)

Broadly speaking you can affect the overall time and the elapsed time. What
we refer to as the fast or clever problem. Simplistically, reduce the
overall by loading more per second and reduce the elapsed time by loading
more concurrently. I prefer going after the elapsed time with the divide
and conquer approach because it yields more scalable results. This is why
we run multiple stores (not just TDB) and query over them. This in itself
is a trade because we need to use distinct when merging streams which can
be RAM intensive. And we're really tight on the number of quads you can
return! :-)

Re: Report on loading wikidata

Reply via email to