yes I'd say the local NVMe SSDs make the difference here. In my case for zone US East and US East 2 the VMs only showing a premium ssd option. So called ultra ssd's seem to be high in demand and currently not available in my profile. And they also come at a very high estimated price point.
which dataset do you use to run the load test above? On Sat, Jun 22, 2019 at 11:47 PM Andy Seaborne <[email protected]> wrote: > > > On 20/06/2019 16:01, Marco Neumann wrote: > > quick update here on loader performance. Did a modest (in terms of cost) > > hardware upgrade of one of the dedicated data processors with a faster > CPU > > and faster NVme SSD drive and was able to almost half our load times. > Very > > satisfied with the HW upgrade and TDB2 loader performance. VM's don't > seem > > to work well for us in combination with TDB. > > My experience has been significant variation across different VM types. > My assumption is the form of virtualization matters. > > I had access to an AWS i3.8xlarge for a short while which had local NVMe > SSDs and got very good performance: > > 500m TDB2 2,362s 39m 22s 218,460 TPS > 1 billion TDB2 5,164s 1h 26m 04s 200,100 TPS > > (this is a single graph dataset) > > i3 are "Storage optimized" > > The TDB2 loader is multithreaded and each thread is working on a > different indexes so the access patterns are jumping around all over the > place both because the non-primary index is, in effect at scale, > randomly accessed, and because multiple indexes are updating at the same > time. > > Andy > > > > > On Fri, Jun 14, 2019 at 11:56 PM Marco Neumann <[email protected]> > > wrote: > > > >> absolutely it does, preferably NVMe SSD. tdbloaders are almost a > showcase > >> themselves for good up-to-date hardware.. > >> > >> if possible I'd like to load the wikidata dataset* at at some point to > see > >> where 57GB fits in terms of tdb. The wikidata team is currently looking > at > >> new solutions that can go beyond blazegraph. And I get the impression > that > >> they have not yet actively considered to give jena tdb try. > >> > >> https://dumps.wikimedia.org/wikidatawiki/entities/ > >> > >> > >> On Fri, Jun 14, 2019 at 11:47 PM Martynas Jusevičius < > >> [email protected]> wrote: > >> > >>> What about SSD disks, don't they make a difference? > >>> > >>> On Sat, Jun 15, 2019 at 12:36 AM Marco Neumann < > [email protected]> > >>> wrote: > >>>> > >>>> that did the trick Andy, very good might be a good idea to add this to > >>> the > >>>> distribution in jena-log4j.properties > >>>> > >>>> I am getting these numbers for a midsize dedicated server, very nice > >>>> numbers indeed Andy. well done! > >>>> > >>>> 00:24:53 INFO loader :: Loader = LoaderPhased > >>>> 00:24:53 INFO loader :: Start: > >>>> ../../public_html/lotico.ttl.gz > >>>> 00:24:55 INFO loader :: Add: 500,000 lotico.ttl.gz > >>> (Batch: > >>>> 237,755 / Avg: 237,755) > >>>> 00:24:56 INFO loader :: Add: 1,000,000 lotico.ttl.gz > >>> (Batch: > >>>> 305,250 / Avg: 267,308) > >>>> 00:24:58 INFO loader :: Add: 1,500,000 lotico.ttl.gz > >>> (Batch: > >>>> 313,087 / Avg: 281,004) > >>>> 00:25:00 INFO loader :: Add: 2,000,000 lotico.ttl.gz > >>> (Batch: > >>>> 328,299 / Avg: 291,502) > >>>> 00:25:01 INFO loader :: Add: 2,500,000 lotico.ttl.gz > >>> (Batch: > >>>> 341,763 / Avg: 300,336) > >>>> 00:25:03 INFO loader :: Add: 3,000,000 lotico.ttl.gz > >>> (Batch: > >>>> 337,381 / Avg: 305,935) > >>>> 00:25:04 INFO loader :: Add: 3,500,000 lotico.ttl.gz > >>> (Batch: > >>>> 318,877 / Avg: 307,719) > >>>> 00:25:06 INFO loader :: Add: 4,000,000 lotico.ttl.gz > >>> (Batch: > >>>> 295,857 / Avg: 306,184) > >>>> 00:25:07 INFO loader :: Add: 4,500,000 lotico.ttl.gz > >>> (Batch: > >>>> 327,225 / Avg: 308,388) > >>>> 00:25:09 INFO loader :: Add: 5,000,000 lotico.ttl.gz > >>> (Batch: > >>>> 349,406 / Avg: 312,051) > >>>> 00:25:09 INFO loader :: Elapsed: 16.02 seconds > >>> [2019/06/15 > >>>> 00:25:09 CEST] > >>>> 00:25:11 INFO loader :: Add: 5,500,000 lotico.ttl.gz > >>> (Batch: > >>>> 285,062 / Avg: 309,388) > >>>> 00:25:13 INFO loader :: Add: 6,000,000 lotico.ttl.gz > >>> (Batch: > >>>> 203,665 / Avg: 296,559) > >>>> 00:25:16 INFO loader :: Add: 6,500,000 lotico.ttl.gz > >>> (Batch: > >>>> 189,393 / Avg: 284,190) > >>>> > >>>> on another machine that sits in the Azure infrastructure somewhere it > >>>> tdbloader doesn't look as good, even with decent hardware it seems to > >>> die a > >>>> slow death of memory exhaustion at 16GB. started off with 70kT/s and > is > >>> now > >>>> down to 17kT/s and still going. > >>>> > >>>> lesson learned big iron and big memory is the way to go with Jena > >>>> tdbloaders. > >>>> > >>>> > >>>> > >>>> > >>>> On Fri, Jun 14, 2019 at 10:53 PM Andy Seaborne <[email protected]> > wrote: > >>>> > >>>>> These messages are logged (to logger "org.apache.jena.tdb2.loader") - > >>> do > >>>>> you have log4j.proprties in the current working directory? > >>>>> > >>>>> Do you get any output? > >>>>> > >>>>> INFO Loader = LoaderParallel > >>>>> INFO Start: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz > >>>>> INFO Add: 500,000 bsbm-5m.nt.gz (Batch: 134,770 / Avg: 134,770) > >>>>> INFO Add: 1,000,000 bsbm-5m.nt.gz (Batch: 189,753 / Avg: 157,604) > >>>>> INFO Add: 1,500,000 bsbm-5m.nt.gz (Batch: 205,676 / Avg: 170,920) > >>>>> INFO Add: 2,000,000 bsbm-5m.nt.gz (Batch: 204,248 / Avg: 178,189) > >>>>> INFO Add: 2,500,000 bsbm-5m.nt.gz (Batch: 202,101 / Avg: 182,508) > >>>>> INFO Add: 3,000,000 bsbm-5m.nt.gz (Batch: 206,953 / Avg: 186,173) > >>>>> INFO Add: 3,500,000 bsbm-5m.nt.gz (Batch: 183,621 / Avg: 185,804) > >>>>> INFO Add: 4,000,000 bsbm-5m.nt.gz (Batch: 151,423 / Avg: 180,676) > >>>>> INFO Add: 4,500,000 bsbm-5m.nt.gz (Batch: 152,765 / Avg: 177,081) > >>>>> INFO Add: 5,000,000 bsbm-5m.nt.gz (Batch: 158,881 / Avg: 175,076) > >>>>> INFO Elapsed: 28.56 seconds [2019/06/14 22:51:37 BST] > >>>>> INFO Finished: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz: 5,000,599 > >>> tuples > >>>>> in 28.63s (Avg: 174,644) > >>>>> INFO Finish - index SPO > >>>>> INFO Finish - index POS > >>>>> INFO Finish - index OSP > >>>>> INFO Time = 35.572 seconds : Triples = 5,000,599 : Rate = 140,577 /s > >>>>> > >>>>> > >>>>> There is pause after the first "Finished:" - this is finished data > in, > >>>>> the index threads are still running and the pause comes from flush to > >>> disk. > >>>>> > >>>>> Andy > >>>>> > >>>>> On 14/06/2019 20:16, Marco Neumann wrote: > >>>>>> let me fire up one of the big machines to see what I will get there. > >>>>>> currently I have no info display during load with tdb2.tdbloader . > >>> if -v > >>>>> is > >>>>>> specified I get some extra info but no load info. > >>>>>> > >>>>>> On Fri, Jun 14, 2019 at 8:03 PM Andy Seaborne <[email protected]> > >>> wrote: > >>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On 14/06/2019 18:13, Marco Neumann wrote: > >>>>>>>> I am collecting jena loader benchmarks. if you have results > >>> please post > >>>>>>>> them directly. > >>>>>>>> > >>>>>>>> http://www.lotico.com/index.php/JENA_Loader_Benchmarks > >>>>>>> > >>>>>>> tdb2.tdbloader has variations controlled by --loader. > >>>>>>> > >>>>>>> --loader= > >>>>>>> Loader to use: 'basic', 'phased' (default), 'sequential', > >>> 'parallel' or > >>>>>>> 'light' > >>>>>>> > >>>>>>> "basic" is a super naive parser-add triple loop - it used if a > >>> loader > >>>>>>> can't cope with an already loaded database. > >>>>>>> > >>>>>>> "phased" is a balanced, does not saturate the machine loader. Some > >>>>>>> parallelism. > >>>>>>> > >>>>>>> "sequential" is the tdbloader algorithm for TDB2, more for > >>> reference. > >>>>>>> > >>>>>>> "parallel" is as much parallelism as it wants. (5 for triples, > >>> more for > >>>>>>> quads) > >>>>>>> > >>>>>>> "light" is two threaded. Slightly ligther than "phased". > >>>>>>> > >>>>>>> See LoaderPlans. > >>>>>>> > >>>>>>>> On a linux machine I am using "time" to collect data. > >>>>>>>> > >>>>>>>> Is there a flag on tdb2.tdbloader to report time and triples per > >>>>> second? > >>>>>>>> > >>>>>>>> I have noticed that storage space use for tdbloader2 is > >>> significantly > >>>>>>>> smaller on disk compared to tdbloader and tdb2.tdbloader. Is > >>> there a > >>>>>>>> straight forward explanation here? > >>>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>> > >>>> > >>>> > >>>> -- > >>>> > >>>> > >>>> --- > >>>> Marco Neumann > >>>> KONA > >>> > >> > >> > >> -- > >> > >> > >> --- > >> Marco Neumann > >> KONA > >> > >> > > > -- --- Marco Neumann KONA
