Re: Report on loading wikidata

Dick Murray Mon, 11 Dec 2017 02:32:17 -0800

Inline...

On 10 December 2017 at 23:03, Laura Morales <[email protected]> wrote:


> Thank you a lot Dick! Is this test for tdbloader, tdbloader2, or
> tdb2.tdbloader?
>
> > 32GB DDR4 quad channel
>
> 2133 or higher?
>

2133


> > 3 x M.2 Samsung 960 EVO
>
> Are these PCI-e disks? Or SATA? Also, what size and configuration?


PCIe Turbo


> > Is it possible to split the index files into separate folders?
> > Or sym link the files, if I run the data phase, sym link, then run the
> index phase?
>
> What would you gain from this?
>

n index files need to be written so split the load across multiple devices,
be that cores/controllers/storage. Potentially use a fast/expensive device
to perform the load and copy the files over to a production grade device.
Load device would have no redundancy as who cares if it throws a drive?
Production devices are redundant as 5 9's requirement.


>
> > 172K/sec 3h45m for truthy.
>
> It still feels slow considering that you throw such a powerful machine to
> it, but it's very interesting nonetheless! What I think after these tests,
> is that the larger impact here is given by the M.2 disks


Its also got 2 x SATAIII 6G drives and the load time doesn't increase by
much using these. There's a fundamental limit at which degradation occurs
as eventually stuff has to be swapped or committed which then cascades into
stalls. As an ex DBA bulk loads always involved, dropping or disabling
indexes, running overnight so users were asleep, building indexes, updating
stats, present DB in TP mode to make users happy! Things have moved on but
the same problems exists.


> , and perhaps to a smaller scale by the DDR4 modules. When I tested with a
> xeon+ddr3-1600, it didn't seem to make any difference. It would be
> interesting to test with a more "mid-range setup" (iCore/xeon + DDR3) and
> M.2 disks. Is this something that you can try as well?
>

IMHO it's not, our SLA equates to 50K/sec or 180M/hr quads an hour, so
anything over this a bonus. But we don't work on getting 500M quads into a
store at 150K/sec because this will eventually hit a ceiling. We work on
getting concurrent 500M quads into stores at 75K/sec. Production
environments are a completely different beast to having fun with a test
setup.

Consider the simplified steps involved in getting a single quad into a
store (please correct me Andy);

Read quad from source.
Verify GSPO lexical and type.
Check GSPO for uniqueness (read and compare) possibly x4 write to node->id
lookup.
Write indexes.
Repeat n times.

Understand binary format and tweak appropriately for tdbloader2 ;-)

Broadly speaking you can affect the overall time and the elapsed time. What
we refer to as the fast or clever problem. Simplistically, reduce the
overall by loading more per second and reduce the elapsed time by loading
more concurrently. I prefer going after the elapsed time with the divide
and conquer approach because it yields more scalable results. This is why
we run multiple stores (not just TDB) and query over them. This in itself
is a trade because we need to use distinct when merging streams which can
be RAM intensive. And we're really tight on the number of quads you can
return! :-)

Re: Report on loading wikidata

Reply via email to