Re: TDB optimization query

Amandeep Srivastava Thu, 28 Nov 2019 02:28:02 -0800

Yes, I have the jena-log4j.properties file within the jena repo and the
TDB2.loader file under bin in same repo.


For me, when I run tdb2.loader - - loc=.. /.. /db .. /../file.nt, I see no
logs. The process starts consuming cores and ram but there's nothing on the
console. When the loading is finished, cursor moves on to the next line.

On Thu, 28 Nov, 2019, 3:48 PM Andy Seaborne, <[email protected]> wrote:

>
>
> On 28/11/2019 05:44, Amandeep Srivastava wrote:
> > Thanks Andy, setting it that way worked.
> >
> > Also, can we turn on the verbose logging in TDB2.loader like we have in
> > tdbloader2?
> >
> > Basically, giving an output of how many triples it's loading and how much
> > time has elapsed so far.
>
> It does that by default for the data phase. The report step size is
> longer (500k) than TDB1
>
> The index phase is more parallel and not all mods report progress.
>
> What are you seeing?
> (Do you have a log4j.properties in the current directory?)
>
>      Andy
>
>
>
>
> tdb2.tdbloader --loader=parallel --loc DB2 ~/Datasets/BSBM/bsbm-5m.nt.gz
>
> INFO  Loader = LoaderParallel
> INFO  Start: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz
> INFO  Add: 500,000 bsbm-5m.nt.gz (Batch: 124,875 / Avg: 124,875)
> INFO  Add: 1,000,000 bsbm-5m.nt.gz (Batch: 171,174 / Avg: 144,404)
> INFO  Add: 1,500,000 bsbm-5m.nt.gz (Batch: 190,403 / Avg: 157,051)
> INFO  Add: 2,000,000 bsbm-5m.nt.gz (Batch: 200,883 / Avg: 166,112)
> INFO  Add: 2,500,000 bsbm-5m.nt.gz (Batch: 206,953 / Avg: 172,938)
> INFO  Add: 3,000,000 bsbm-5m.nt.gz (Batch: 205,170 / Avg: 177,588)
> INFO  Add: 3,500,000 bsbm-5m.nt.gz (Batch: 198,255 / Avg: 180,272)
> INFO  Add: 4,000,000 bsbm-5m.nt.gz (Batch: 147,449 / Avg: 175,392)
> INFO  Add: 4,500,000 bsbm-5m.nt.gz (Batch: 159,642 / Avg: 173,490)
> INFO  Add: 5,000,000 bsbm-5m.nt.gz (Batch: 166,777 / Avg: 172,795)
> INFO    Elapsed: 28.94 seconds [2019/11/28 10:17:55 GMT]
> INFO  Finished: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz: 5,000,599 tuples
> in 28.96s (Avg: 172,690)
> INFO  Finish - index POS
> INFO  Finish - index SPO
> INFO  Finish - index OSP
> INFO  Time = 39.180 seconds : Triples = 5,000,599 : Rate = 127,631 /s
>
> though the default may be faster on this small datset
>
> >
> > On Thu, 14 Nov, 2019, 2:20 PM Andy Seaborne, <[email protected]> wrote:
> >
> >> Firstly - just to be clear - tdb.tdbloader2 is (confusingly) for TDB1.
> >> Old name, before TDB2 came along so we're a bit stuck with it.
> >>
> >> tdbloader2 respects the $TMPDIR environment variable.
> >>
> >> Or set the SORT_ARGS environment variable with --temporary-directory=
> >> (or -T). See tdbloader2 --help
> >>
> >>       Andy
> >>
> >> On 14/11/2019 02:54, Amandeep Srivastava wrote:
> >>> I was trying to test the performance of tdb.tdbloader2 by creating a
> TDB
> >>> database. The loader failed at sort SPO step. The failure seems to
> occur
> >>> because of insufficient storage in the /tmp folder. Can we point tdb to
> >> use
> >>> another folder as /tmp?
> >>>
> >>> Error log:
> >>> sort: write failed: /tmp/sortxRql3B: No space left on device
> >>>
> >>> On Wed, 13 Nov, 2019, 5:37 PM Amandeep Srivastava, <
> >>> [email protected]> wrote:
> >>>
> >>>> Thanks, Andy, for the detailed explanation :)
> >>>>
> >>>> On Wed, 13 Nov, 2019, 4:52 PM Andy Seaborne, <[email protected]> wrote:
> >>>>
> >>>>>
> >>>>>
> >>>>> On 12/11/2019 15:53, Amandeep Srivastava wrote:
> >>>>>> Thanks for the heads up, Dan. Will go and check the archives.
> >>>>>>
> >>>>>> I think I should get how to decide between tdb and TDB2 in the
> >> archives
> >>>>>> itself.
> >>>>>
> >>>>> For large bulk loaders, the TDB2 loader is faster, if you use
> >>>>> --loader-parallel (NB it can take over your machine's I/O!)
> >>>>>
> >>>>> See tdb2.tdbloader --help for names of plans that are built-in.
> >>>>>
> >>>>> The only way to know which is best is to try but
> >>>>>
> >>>>>
> >>>>> The order threading used is:
> >>>>>
> >>>>> sequential < light < phased < parallel
> >>>>>
> >>>>> (it does not always mean more threads is faster).
> >>>>>
> >>>>> sequential is roughly the same as the TDB1 bulk loader.
> >>>>>
> >>>>> parallel usualy wins as data gets larger (several 100m) if the
> machine
> >>>>> has the I/O to handle it.
> >>>>>
> >>>>>        Andy
> >>>>>
> >>>>>>
> >>>>>> On Tue, 12 Nov, 2019, 8:59 PM Dan Pritts, <[email protected]> wrote:
> >>>>>>
> >>>>>>> Look through the list archives for posts from Andy describing the
> >>>>>>> differences between tdb1 and tdb2. they have different
> >> optimizations; I
> >>>>>>> don't recall the differences.
> >>>>>>>
> >>>>>>> thanks
> >>>>>>> danno
> >>>>>>>
> >>>>>>> Dan Pritts
> >>>>>>> ICPSR Computing and Network Services
> >>>>>>>
> >>>>>>> On 12 Nov 2019, at 7:29, Amandeep Srivastava wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I'm trying to create a TDB database from Wikidata's official RDF
> >> dump
> >>>>>>>> to
> >>>>>>>> read the data using Fuseki service. I need to make a few queries
> for
> >>>>>>>> my
> >>>>>>>> personal project, running which the online service times out.
> >>>>>>>>
> >>>>>>>> I have a 12 core machine with 36 GB memory.
> >>>>>>>>
> >>>>>>>> Can you please advise on the best way for creating the database?
> >> Since
> >>>>>>>> the
> >>>>>>>> dump is huge, I cannot try all the approaches. Besides, I'm not
> sure
> >>>>>>>> if the
> >>>>>>>> tdbloader function works in a similar way on data of different
> >> sizes.
> >>>>>>>>
> >>>>>>>> Questions:
> >>>>>>>>
> >>>>>>>> 1. Which one would be better to use - tdb.tdbloader2 (TDB1) or
> >>>>>>>> tdb2.tdbloader (TDB2) for creating the database and why? Any
> >> specific
> >>>>>>>> configurations that I should be aware of?
> >>>>>>>>
> >>>>>>>> 2. I'm running a job currently using tdb.tdbloader2 but it is
> using
> >>>>>>>> just a
> >>>>>>>> single core. Also, it's loading speed is decreasing slowly. It
> >> started
> >>>>>>>> at
> >>>>>>>> an avg of 120k tuples and is currently at 80k tuples. Can you
> advise
> >>>>>>>> how
> >>>>>>>> can I utilize all the cores of my machine and maintain the loading
> >>>>>>>> speed at
> >>>>>>>> the same time?
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Aman
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>

Re: TDB optimization query

Reply via email to