quick update here on loader performance. Did a modest (in terms of cost) hardware upgrade of one of the dedicated data processors with a faster CPU and faster NVme SSD drive and was able to almost half our load times. Very satisfied with the HW upgrade and TDB2 loader performance. VM's don't seem to work well for us in combination with TDB.
On Fri, Jun 14, 2019 at 11:56 PM Marco Neumann <[email protected]> wrote: > absolutely it does, preferably NVMe SSD. tdbloaders are almost a showcase > themselves for good up-to-date hardware.. > > if possible I'd like to load the wikidata dataset* at at some point to see > where 57GB fits in terms of tdb. The wikidata team is currently looking at > new solutions that can go beyond blazegraph. And I get the impression that > they have not yet actively considered to give jena tdb try. > > https://dumps.wikimedia.org/wikidatawiki/entities/ > > > On Fri, Jun 14, 2019 at 11:47 PM Martynas Jusevičius < > [email protected]> wrote: > >> What about SSD disks, don't they make a difference? >> >> On Sat, Jun 15, 2019 at 12:36 AM Marco Neumann <[email protected]> >> wrote: >> > >> > that did the trick Andy, very good might be a good idea to add this to >> the >> > distribution in jena-log4j.properties >> > >> > I am getting these numbers for a midsize dedicated server, very nice >> > numbers indeed Andy. well done! >> > >> > 00:24:53 INFO loader :: Loader = LoaderPhased >> > 00:24:53 INFO loader :: Start: >> > ../../public_html/lotico.ttl.gz >> > 00:24:55 INFO loader :: Add: 500,000 lotico.ttl.gz >> (Batch: >> > 237,755 / Avg: 237,755) >> > 00:24:56 INFO loader :: Add: 1,000,000 lotico.ttl.gz >> (Batch: >> > 305,250 / Avg: 267,308) >> > 00:24:58 INFO loader :: Add: 1,500,000 lotico.ttl.gz >> (Batch: >> > 313,087 / Avg: 281,004) >> > 00:25:00 INFO loader :: Add: 2,000,000 lotico.ttl.gz >> (Batch: >> > 328,299 / Avg: 291,502) >> > 00:25:01 INFO loader :: Add: 2,500,000 lotico.ttl.gz >> (Batch: >> > 341,763 / Avg: 300,336) >> > 00:25:03 INFO loader :: Add: 3,000,000 lotico.ttl.gz >> (Batch: >> > 337,381 / Avg: 305,935) >> > 00:25:04 INFO loader :: Add: 3,500,000 lotico.ttl.gz >> (Batch: >> > 318,877 / Avg: 307,719) >> > 00:25:06 INFO loader :: Add: 4,000,000 lotico.ttl.gz >> (Batch: >> > 295,857 / Avg: 306,184) >> > 00:25:07 INFO loader :: Add: 4,500,000 lotico.ttl.gz >> (Batch: >> > 327,225 / Avg: 308,388) >> > 00:25:09 INFO loader :: Add: 5,000,000 lotico.ttl.gz >> (Batch: >> > 349,406 / Avg: 312,051) >> > 00:25:09 INFO loader :: Elapsed: 16.02 seconds >> [2019/06/15 >> > 00:25:09 CEST] >> > 00:25:11 INFO loader :: Add: 5,500,000 lotico.ttl.gz >> (Batch: >> > 285,062 / Avg: 309,388) >> > 00:25:13 INFO loader :: Add: 6,000,000 lotico.ttl.gz >> (Batch: >> > 203,665 / Avg: 296,559) >> > 00:25:16 INFO loader :: Add: 6,500,000 lotico.ttl.gz >> (Batch: >> > 189,393 / Avg: 284,190) >> > >> > on another machine that sits in the Azure infrastructure somewhere it >> > tdbloader doesn't look as good, even with decent hardware it seems to >> die a >> > slow death of memory exhaustion at 16GB. started off with 70kT/s and is >> now >> > down to 17kT/s and still going. >> > >> > lesson learned big iron and big memory is the way to go with Jena >> > tdbloaders. >> > >> > >> > >> > >> > On Fri, Jun 14, 2019 at 10:53 PM Andy Seaborne <[email protected]> wrote: >> > >> > > These messages are logged (to logger "org.apache.jena.tdb2.loader") - >> do >> > > you have log4j.proprties in the current working directory? >> > > >> > > Do you get any output? >> > > >> > > INFO Loader = LoaderParallel >> > > INFO Start: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz >> > > INFO Add: 500,000 bsbm-5m.nt.gz (Batch: 134,770 / Avg: 134,770) >> > > INFO Add: 1,000,000 bsbm-5m.nt.gz (Batch: 189,753 / Avg: 157,604) >> > > INFO Add: 1,500,000 bsbm-5m.nt.gz (Batch: 205,676 / Avg: 170,920) >> > > INFO Add: 2,000,000 bsbm-5m.nt.gz (Batch: 204,248 / Avg: 178,189) >> > > INFO Add: 2,500,000 bsbm-5m.nt.gz (Batch: 202,101 / Avg: 182,508) >> > > INFO Add: 3,000,000 bsbm-5m.nt.gz (Batch: 206,953 / Avg: 186,173) >> > > INFO Add: 3,500,000 bsbm-5m.nt.gz (Batch: 183,621 / Avg: 185,804) >> > > INFO Add: 4,000,000 bsbm-5m.nt.gz (Batch: 151,423 / Avg: 180,676) >> > > INFO Add: 4,500,000 bsbm-5m.nt.gz (Batch: 152,765 / Avg: 177,081) >> > > INFO Add: 5,000,000 bsbm-5m.nt.gz (Batch: 158,881 / Avg: 175,076) >> > > INFO Elapsed: 28.56 seconds [2019/06/14 22:51:37 BST] >> > > INFO Finished: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz: 5,000,599 >> tuples >> > > in 28.63s (Avg: 174,644) >> > > INFO Finish - index SPO >> > > INFO Finish - index POS >> > > INFO Finish - index OSP >> > > INFO Time = 35.572 seconds : Triples = 5,000,599 : Rate = 140,577 /s >> > > >> > > >> > > There is pause after the first "Finished:" - this is finished data in, >> > > the index threads are still running and the pause comes from flush to >> disk. >> > > >> > > Andy >> > > >> > > On 14/06/2019 20:16, Marco Neumann wrote: >> > > > let me fire up one of the big machines to see what I will get there. >> > > > currently I have no info display during load with tdb2.tdbloader . >> if -v >> > > is >> > > > specified I get some extra info but no load info. >> > > > >> > > > On Fri, Jun 14, 2019 at 8:03 PM Andy Seaborne <[email protected]> >> wrote: >> > > > >> > > >> >> > > >> >> > > >> On 14/06/2019 18:13, Marco Neumann wrote: >> > > >>> I am collecting jena loader benchmarks. if you have results >> please post >> > > >>> them directly. >> > > >>> >> > > >>> http://www.lotico.com/index.php/JENA_Loader_Benchmarks >> > > >> >> > > >> tdb2.tdbloader has variations controlled by --loader. >> > > >> >> > > >> --loader= >> > > >> Loader to use: 'basic', 'phased' (default), 'sequential', >> 'parallel' or >> > > >> 'light' >> > > >> >> > > >> "basic" is a super naive parser-add triple loop - it used if a >> loader >> > > >> can't cope with an already loaded database. >> > > >> >> > > >> "phased" is a balanced, does not saturate the machine loader. Some >> > > >> parallelism. >> > > >> >> > > >> "sequential" is the tdbloader algorithm for TDB2, more for >> reference. >> > > >> >> > > >> "parallel" is as much parallelism as it wants. (5 for triples, >> more for >> > > >> quads) >> > > >> >> > > >> "light" is two threaded. Slightly ligther than "phased". >> > > >> >> > > >> See LoaderPlans. >> > > >> >> > > >>> On a linux machine I am using "time" to collect data. >> > > >>> >> > > >>> Is there a flag on tdb2.tdbloader to report time and triples per >> > > second? >> > > >>> >> > > >>> I have noticed that storage space use for tdbloader2 is >> significantly >> > > >>> smaller on disk compared to tdbloader and tdb2.tdbloader. Is >> there a >> > > >>> straight forward explanation here? >> > > >>> >> > > >> >> > > > >> > > > >> > > >> > >> > >> > -- >> > >> > >> > --- >> > Marco Neumann >> > KONA >> > > > -- > > > --- > Marco Neumann > KONA > > -- --- Marco Neumann KONA
