yes I'd say the local NVMe SSDs make the difference here. In my case for
zone US East and US East 2 the VMs only showing a premium ssd option. So
called ultra ssd's seem to be high in demand and currently not available in
my profile. And they also come at a very high estimated price point.

which dataset do you use to run the load test above?



On Sat, Jun 22, 2019 at 11:47 PM Andy Seaborne <[email protected]> wrote:

>
>
> On 20/06/2019 16:01, Marco Neumann wrote:
> > quick update here on loader performance. Did a modest (in terms of cost)
> > hardware upgrade of one of the dedicated data processors with a faster
> CPU
> > and faster NVme SSD drive and was able to almost half our load times.
> Very
> > satisfied with the HW upgrade and TDB2 loader performance. VM's don't
> seem
> > to work well for us in combination with TDB.
>
> My experience has been significant variation across different VM types.
> My assumption is the form of virtualization matters.
>
> I had access to an AWS i3.8xlarge for a short while which had local NVMe
> SSDs and got very good performance:
>
> 500m            TDB2    2,362s  39m 22s         218,460 TPS
> 1 billion       TDB2    5,164s  1h 26m 04s      200,100 TPS
>
> (this is a single graph dataset)
>
> i3 are "Storage optimized"
>
> The TDB2 loader is multithreaded and each thread is working on a
> different indexes so the access patterns are jumping around all over the
> place both because the non-primary index is, in effect at scale,
> randomly accessed, and because multiple indexes are updating at the same
> time.
>
>      Andy
>
> >
> > On Fri, Jun 14, 2019 at 11:56 PM Marco Neumann <[email protected]>
> > wrote:
> >
> >> absolutely it does, preferably NVMe SSD. tdbloaders are almost a
> showcase
> >> themselves for good up-to-date hardware..
> >>
> >> if possible I'd like to load the wikidata dataset* at at some point to
> see
> >> where 57GB fits in terms of tdb. The wikidata team is currently looking
> at
> >> new solutions that can go beyond blazegraph. And I get the impression
> that
> >> they have not yet actively considered to give jena tdb try.
> >>
> >> https://dumps.wikimedia.org/wikidatawiki/entities/
> >>
> >>
> >> On Fri, Jun 14, 2019 at 11:47 PM Martynas Jusevičius <
> >> [email protected]> wrote:
> >>
> >>> What about SSD disks, don't they make a difference?
> >>>
> >>> On Sat, Jun 15, 2019 at 12:36 AM Marco Neumann <
> [email protected]>
> >>> wrote:
> >>>>
> >>>> that did the trick Andy, very good might be a good idea to add this to
> >>> the
> >>>> distribution in jena-log4j.properties
> >>>>
> >>>> I am getting these numbers for a midsize dedicated server, very nice
> >>>> numbers indeed Andy. well done!
> >>>>
> >>>> 00:24:53 INFO  loader               :: Loader = LoaderPhased
> >>>> 00:24:53 INFO  loader               :: Start:
> >>>> ../../public_html/lotico.ttl.gz
> >>>> 00:24:55 INFO  loader               :: Add: 500,000 lotico.ttl.gz
> >>> (Batch:
> >>>> 237,755 / Avg: 237,755)
> >>>> 00:24:56 INFO  loader               :: Add: 1,000,000 lotico.ttl.gz
> >>> (Batch:
> >>>> 305,250 / Avg: 267,308)
> >>>> 00:24:58 INFO  loader               :: Add: 1,500,000 lotico.ttl.gz
> >>> (Batch:
> >>>> 313,087 / Avg: 281,004)
> >>>> 00:25:00 INFO  loader               :: Add: 2,000,000 lotico.ttl.gz
> >>> (Batch:
> >>>> 328,299 / Avg: 291,502)
> >>>> 00:25:01 INFO  loader               :: Add: 2,500,000 lotico.ttl.gz
> >>> (Batch:
> >>>> 341,763 / Avg: 300,336)
> >>>> 00:25:03 INFO  loader               :: Add: 3,000,000 lotico.ttl.gz
> >>> (Batch:
> >>>> 337,381 / Avg: 305,935)
> >>>> 00:25:04 INFO  loader               :: Add: 3,500,000 lotico.ttl.gz
> >>> (Batch:
> >>>> 318,877 / Avg: 307,719)
> >>>> 00:25:06 INFO  loader               :: Add: 4,000,000 lotico.ttl.gz
> >>> (Batch:
> >>>> 295,857 / Avg: 306,184)
> >>>> 00:25:07 INFO  loader               :: Add: 4,500,000 lotico.ttl.gz
> >>> (Batch:
> >>>> 327,225 / Avg: 308,388)
> >>>> 00:25:09 INFO  loader               :: Add: 5,000,000 lotico.ttl.gz
> >>> (Batch:
> >>>> 349,406 / Avg: 312,051)
> >>>> 00:25:09 INFO  loader               ::   Elapsed: 16.02 seconds
> >>> [2019/06/15
> >>>> 00:25:09 CEST]
> >>>> 00:25:11 INFO  loader               :: Add: 5,500,000 lotico.ttl.gz
> >>> (Batch:
> >>>> 285,062 / Avg: 309,388)
> >>>> 00:25:13 INFO  loader               :: Add: 6,000,000 lotico.ttl.gz
> >>> (Batch:
> >>>> 203,665 / Avg: 296,559)
> >>>> 00:25:16 INFO  loader               :: Add: 6,500,000 lotico.ttl.gz
> >>> (Batch:
> >>>> 189,393 / Avg: 284,190)
> >>>>
> >>>> on another machine that sits in the Azure infrastructure somewhere it
> >>>> tdbloader doesn't look as good, even with decent hardware it seems to
> >>> die a
> >>>> slow death of memory exhaustion at 16GB. started off with 70kT/s and
> is
> >>> now
> >>>> down to 17kT/s and still going.
> >>>>
> >>>> lesson learned big iron and big memory is the way to go with Jena
> >>>> tdbloaders.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Fri, Jun 14, 2019 at 10:53 PM Andy Seaborne <[email protected]>
> wrote:
> >>>>
> >>>>> These messages are logged (to logger "org.apache.jena.tdb2.loader") -
> >>> do
> >>>>> you have log4j.proprties in the current working directory?
> >>>>>
> >>>>> Do you get any output?
> >>>>>
> >>>>> INFO  Loader = LoaderParallel
> >>>>> INFO  Start: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz
> >>>>> INFO  Add: 500,000 bsbm-5m.nt.gz (Batch: 134,770 / Avg: 134,770)
> >>>>> INFO  Add: 1,000,000 bsbm-5m.nt.gz (Batch: 189,753 / Avg: 157,604)
> >>>>> INFO  Add: 1,500,000 bsbm-5m.nt.gz (Batch: 205,676 / Avg: 170,920)
> >>>>> INFO  Add: 2,000,000 bsbm-5m.nt.gz (Batch: 204,248 / Avg: 178,189)
> >>>>> INFO  Add: 2,500,000 bsbm-5m.nt.gz (Batch: 202,101 / Avg: 182,508)
> >>>>> INFO  Add: 3,000,000 bsbm-5m.nt.gz (Batch: 206,953 / Avg: 186,173)
> >>>>> INFO  Add: 3,500,000 bsbm-5m.nt.gz (Batch: 183,621 / Avg: 185,804)
> >>>>> INFO  Add: 4,000,000 bsbm-5m.nt.gz (Batch: 151,423 / Avg: 180,676)
> >>>>> INFO  Add: 4,500,000 bsbm-5m.nt.gz (Batch: 152,765 / Avg: 177,081)
> >>>>> INFO  Add: 5,000,000 bsbm-5m.nt.gz (Batch: 158,881 / Avg: 175,076)
> >>>>> INFO    Elapsed: 28.56 seconds [2019/06/14 22:51:37 BST]
> >>>>> INFO  Finished: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz: 5,000,599
> >>> tuples
> >>>>> in 28.63s (Avg: 174,644)
> >>>>> INFO  Finish - index SPO
> >>>>> INFO  Finish - index POS
> >>>>> INFO  Finish - index OSP
> >>>>> INFO  Time = 35.572 seconds : Triples = 5,000,599 : Rate = 140,577 /s
> >>>>>
> >>>>>
> >>>>> There is pause after the first "Finished:" - this is finished data
> in,
> >>>>> the index threads are still running and the pause comes from flush to
> >>> disk.
> >>>>>
> >>>>>       Andy
> >>>>>
> >>>>> On 14/06/2019 20:16, Marco Neumann wrote:
> >>>>>> let me fire up one of the big machines to see what I will get there.
> >>>>>> currently I have no info display during load with tdb2.tdbloader .
> >>> if -v
> >>>>> is
> >>>>>> specified I get some extra info but no load info.
> >>>>>>
> >>>>>> On Fri, Jun 14, 2019 at 8:03 PM Andy Seaborne <[email protected]>
> >>> wrote:
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 14/06/2019 18:13, Marco Neumann wrote:
> >>>>>>>> I am collecting jena loader benchmarks. if you have results
> >>> please post
> >>>>>>>> them directly.
> >>>>>>>>
> >>>>>>>> http://www.lotico.com/index.php/JENA_Loader_Benchmarks
> >>>>>>>
> >>>>>>> tdb2.tdbloader has variations controlled by --loader.
> >>>>>>>
> >>>>>>> --loader=
> >>>>>>> Loader to use: 'basic', 'phased' (default), 'sequential',
> >>> 'parallel' or
> >>>>>>> 'light'
> >>>>>>>
> >>>>>>> "basic" is a super naive parser-add triple loop - it used if a
> >>> loader
> >>>>>>> can't cope with an already loaded database.
> >>>>>>>
> >>>>>>> "phased" is a balanced, does not saturate the machine loader. Some
> >>>>>>> parallelism.
> >>>>>>>
> >>>>>>> "sequential" is the tdbloader algorithm for TDB2, more for
> >>> reference.
> >>>>>>>
> >>>>>>> "parallel" is as much parallelism as it wants. (5 for triples,
> >>> more for
> >>>>>>> quads)
> >>>>>>>
> >>>>>>> "light" is two threaded. Slightly ligther than "phased".
> >>>>>>>
> >>>>>>> See LoaderPlans.
> >>>>>>>
> >>>>>>>> On a linux machine I am using "time" to collect data.
> >>>>>>>>
> >>>>>>>> Is there a flag on tdb2.tdbloader to report time and triples per
> >>>>> second?
> >>>>>>>>
> >>>>>>>> I have noticed that storage space use for tdbloader2 is
> >>> significantly
> >>>>>>>> smaller on disk compared to tdbloader and tdb2.tdbloader. Is
> >>> there a
> >>>>>>>> straight forward explanation here?
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>>
> >>>> ---
> >>>> Marco Neumann
> >>>> KONA
> >>>
> >>
> >>
> >> --
> >>
> >>
> >> ---
> >> Marco Neumann
> >> KONA
> >>
> >>
> >
>


-- 


---
Marco Neumann
KONA

Reply via email to