tdbloader2 For anyone still following this thread ;-)
latest-truthy supposedly contains just the verified facts, this is Wikipedia... latest-truthy is unsorted and contains duplicates, running sort then uniq yields 61K+ duplicates ranging from 2 to 100+. Running sort takes a while! Whilst it's not going to reduce the line count hugely (-3M lines) it's worth considering when doing any import. Fastest elapsed load TPS currently is ~405K (got there Andy) which was achieved by splitting the file into 200M line files using split, running four concurrent loads into four TDBs, each TDB on a separate 5400rpm 6G drive, tdbloader2 script was hacked to run sort parallel 6 buffer 8G temporary appropriate drive, repeat 3 times to give 12 TDB instances, query via my Mosaic extension. I'll up the file size until the drive saturates and stalls then drop it back and run it concurrently as the stall appears to occur on the drive write. Currently perform the index data-triple sort in parallel but write the indexes sequentially. The drives were stolen from some old laptops so not exactly bling hardware. On the subject of performance it's possible to cascade the split if you have enough drives, split file in half and when the second file is created split the first one in half and so on, uses inotify in a script. While I aim to "load" truthy in under an hour that won't account for getting the file, uncompressing the file (non parallel bzip2!!!), splitting file, etc, but for marketing purposes who cares... ;-) On 11 Dec 2017 18:43, "Laura Morales" <laure...@mail.com> wrote: Did you run your Threadripper test using tdbloader, tdbloader2, or tdb2.tdbloader? @Andy where can I find a description of TDB1/2 binary format (how stuff is stored in the files)? Sent: Monday, December 11, 2017 at 11:31 AM From: "Dick Murray" <dandh...@gmail.com> To: users@jena.apache.org Subject: Re: Report on loading wikidata Inline... On 10 December 2017 at 23:03, Laura Morales <laure...@mail.com> wrote: > Thank you a lot Dick! Is this test for tdbloader, tdbloader2, or > tdb2.tdbloader? > > > 32GB DDR4 quad channel > > 2133 or higher? > 2133 > > 3 x M.2 Samsung 960 EVO > > Are these PCI-e disks? Or SATA? Also, what size and configuration? PCIe Turbo > > Is it possible to split the index files into separate folders? > > Or sym link the files, if I run the data phase, sym link, then run the > index phase? > > What would you gain from this? > n index files need to be written so split the load across multiple devices, be that cores/controllers/storage. Potentially use a fast/expensive device to perform the load and copy the files over to a production grade device. Load device would have no redundancy as who cares if it throws a drive? Production devices are redundant as 5 9's requirement. > > > 172K/sec 3h45m for truthy. > > It still feels slow considering that you throw such a powerful machine to > it, but it's very interesting nonetheless! What I think after these tests, > is that the larger impact here is given by the M.2 disks Its also got 2 x SATAIII 6G drives and the load time doesn't increase by much using these. There's a fundamental limit at which degradation occurs as eventually stuff has to be swapped or committed which then cascades into stalls. As an ex DBA bulk loads always involved, dropping or disabling indexes, running overnight so users were asleep, building indexes, updating stats, present DB in TP mode to make users happy! Things have moved on but the same problems exists. > , and perhaps to a smaller scale by the DDR4 modules. When I tested with a > xeon+ddr3-1600, it didn't seem to make any difference. It would be > interesting to test with a more "mid-range setup" (iCore/xeon + DDR3) and > M.2 disks. Is this something that you can try as well? > IMHO it's not, our SLA equates to 50K/sec or 180M/hr quads an hour, so anything over this a bonus. But we don't work on getting 500M quads into a store at 150K/sec because this will eventually hit a ceiling. We work on getting concurrent 500M quads into stores at 75K/sec. Production environments are a completely different beast to having fun with a test setup. Consider the simplified steps involved in getting a single quad into a store (please correct me Andy); Read quad from source. Verify GSPO lexical and type. Check GSPO for uniqueness (read and compare) possibly x4 write to node->id lookup. Write indexes. Repeat n times. Understand binary format and tweak appropriately for tdbloader2 ;-) Broadly speaking you can affect the overall time and the elapsed time. What we refer to as the fast or clever problem. Simplistically, reduce the overall by loading more per second and reduce the elapsed time by loading more concurrently. I prefer going after the elapsed time with the divide and conquer approach because it yields more scalable results. This is why we run multiple stores (not just TDB) and query over them. This in itself is a trade because we need to use distinct when merging streams which can be RAM intensive. And we're really tight on the number of quads you can return! :-)