Re: TDB optimization query

Amandeep Srivastava Thu, 28 Nov 2019 04:41:19 -0800

Attaching cmd line details for reference. Also, after creating the
database, it isn't removing the tdb lock which hinders fuseki server from
reading from the database


aman@DESKTOP-ML2LO1I:~$ cd apache-jena-3.13.1/
aman@DESKTOP-ML2LO1I:~/apache-jena-3.13.1$ ls
LICENSE  NOTICE  README  bat  bin  *jena-log4j.properties*  lib  lib-src
 src-examples  test
aman@DESKTOP-ML2LO1I:~/apache-jena-3.13.1$ *./bin/tdb2.tdbloader
--loader=parallel --loc=../test ../bsbm-generated-dataset.nt*

aman@DESKTOP-ML2LO1I:~/apache-jena-3.13.1$ ls ../test/

Data-0001  *tdb.lock *
aman@DESKTOP-ML2LO1I:~/apache-jena-3.13.1$ ls ../test/Data-0001/
GOSP.bpt  GPOS.idn  GSPO.dat  OSPG.bpt  POS.idn   SPO.dat   journal.jrnl
 nodes.idn          prefixes.idn             GOSP.dat  GPU.bpt   GSPO.idn
 OSPG.dat  POSG.bpt  SPO.idn   nodes-data.bdf  prefixes-data.bdf  *tdb.lock*
GOSP.idn  GPU.dat   OSP.bpt   OSPG.idn  POSG.dat  SPOG.bpt  nodes-data.obj
 prefixes-data.obj  GPOS.bpt  GPU.idn   OSP.dat   POS.bpt   POSG.idn
 SPOG.dat  nodes.bpt       prefixes.bpt     GPOS.dat  GSPO.bpt  OSP.idn
POS.dat   SPO.bpt   SPOG.idn  nodes.dat  prefixes.dat


Thanks,
Aman

On Thu, Nov 28, 2019 at 3:59 PM Amandeep Srivastava <
[email protected]> wrote:

> Correction, using
>
> tdb2.tdbloader - -loader=parallel - - loc=../../db .. /.. /file.nt
>
> On Thu, 28 Nov, 2019, 3:56 PM Amandeep Srivastava, <
> [email protected]> wrote:
>
>> Yes, I have the jena-log4j.properties file within the jena repo and the
>> TDB2.loader file under bin in same repo.
>>
>> For me, when I run tdb2.loader - - loc=.. /.. /db .. /../file.nt, I see
>> no logs. The process starts consuming cores and ram but there's nothing on
>> the console. When the loading is finished, cursor moves on to the next
>> line.
>>
>> On Thu, 28 Nov, 2019, 3:48 PM Andy Seaborne, <[email protected]> wrote:
>>
>>>
>>>
>>> On 28/11/2019 05:44, Amandeep Srivastava wrote:
>>> > Thanks Andy, setting it that way worked.
>>> >
>>> > Also, can we turn on the verbose logging in TDB2.loader like we have in
>>> > tdbloader2?
>>> >
>>> > Basically, giving an output of how many triples it's loading and how
>>> much
>>> > time has elapsed so far.
>>>
>>> It does that by default for the data phase. The report step size is
>>> longer (500k) than TDB1
>>>
>>> The index phase is more parallel and not all mods report progress.
>>>
>>> What are you seeing?
>>> (Do you have a log4j.properties in the current directory?)
>>>
>>>      Andy
>>>
>>>
>>>
>>>
>>> tdb2.tdbloader --loader=parallel --loc DB2 ~/Datasets/BSBM/bsbm-5m.nt.gz
>>>
>>> INFO  Loader = LoaderParallel
>>> INFO  Start: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz
>>> INFO  Add: 500,000 bsbm-5m.nt.gz (Batch: 124,875 / Avg: 124,875)
>>> INFO  Add: 1,000,000 bsbm-5m.nt.gz (Batch: 171,174 / Avg: 144,404)
>>> INFO  Add: 1,500,000 bsbm-5m.nt.gz (Batch: 190,403 / Avg: 157,051)
>>> INFO  Add: 2,000,000 bsbm-5m.nt.gz (Batch: 200,883 / Avg: 166,112)
>>> INFO  Add: 2,500,000 bsbm-5m.nt.gz (Batch: 206,953 / Avg: 172,938)
>>> INFO  Add: 3,000,000 bsbm-5m.nt.gz (Batch: 205,170 / Avg: 177,588)
>>> INFO  Add: 3,500,000 bsbm-5m.nt.gz (Batch: 198,255 / Avg: 180,272)
>>> INFO  Add: 4,000,000 bsbm-5m.nt.gz (Batch: 147,449 / Avg: 175,392)
>>> INFO  Add: 4,500,000 bsbm-5m.nt.gz (Batch: 159,642 / Avg: 173,490)
>>> INFO  Add: 5,000,000 bsbm-5m.nt.gz (Batch: 166,777 / Avg: 172,795)
>>> INFO    Elapsed: 28.94 seconds [2019/11/28 10:17:55 GMT]
>>> INFO  Finished: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz: 5,000,599 tuples
>>> in 28.96s (Avg: 172,690)
>>> INFO  Finish - index POS
>>> INFO  Finish - index SPO
>>> INFO  Finish - index OSP
>>> INFO  Time = 39.180 seconds : Triples = 5,000,599 : Rate = 127,631 /s
>>>
>>> though the default may be faster on this small datset
>>>
>>> >
>>> > On Thu, 14 Nov, 2019, 2:20 PM Andy Seaborne, <[email protected]> wrote:
>>> >
>>> >> Firstly - just to be clear - tdb.tdbloader2 is (confusingly) for TDB1.
>>> >> Old name, before TDB2 came along so we're a bit stuck with it.
>>> >>
>>> >> tdbloader2 respects the $TMPDIR environment variable.
>>> >>
>>> >> Or set the SORT_ARGS environment variable with --temporary-directory=
>>> >> (or -T). See tdbloader2 --help
>>> >>
>>> >>       Andy
>>> >>
>>> >> On 14/11/2019 02:54, Amandeep Srivastava wrote:
>>> >>> I was trying to test the performance of tdb.tdbloader2 by creating a
>>> TDB
>>> >>> database. The loader failed at sort SPO step. The failure seems to
>>> occur
>>> >>> because of insufficient storage in the /tmp folder. Can we point tdb
>>> to
>>> >> use
>>> >>> another folder as /tmp?
>>> >>>
>>> >>> Error log:
>>> >>> sort: write failed: /tmp/sortxRql3B: No space left on device
>>> >>>
>>> >>> On Wed, 13 Nov, 2019, 5:37 PM Amandeep Srivastava, <
>>> >>> [email protected]> wrote:
>>> >>>
>>> >>>> Thanks, Andy, for the detailed explanation :)
>>> >>>>
>>> >>>> On Wed, 13 Nov, 2019, 4:52 PM Andy Seaborne, <[email protected]>
>>> wrote:
>>> >>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> On 12/11/2019 15:53, Amandeep Srivastava wrote:
>>> >>>>>> Thanks for the heads up, Dan. Will go and check the archives.
>>> >>>>>>
>>> >>>>>> I think I should get how to decide between tdb and TDB2 in the
>>> >> archives
>>> >>>>>> itself.
>>> >>>>>
>>> >>>>> For large bulk loaders, the TDB2 loader is faster, if you use
>>> >>>>> --loader-parallel (NB it can take over your machine's I/O!)
>>> >>>>>
>>> >>>>> See tdb2.tdbloader --help for names of plans that are built-in.
>>> >>>>>
>>> >>>>> The only way to know which is best is to try but
>>> >>>>>
>>> >>>>>
>>> >>>>> The order threading used is:
>>> >>>>>
>>> >>>>> sequential < light < phased < parallel
>>> >>>>>
>>> >>>>> (it does not always mean more threads is faster).
>>> >>>>>
>>> >>>>> sequential is roughly the same as the TDB1 bulk loader.
>>> >>>>>
>>> >>>>> parallel usualy wins as data gets larger (several 100m) if the
>>> machine
>>> >>>>> has the I/O to handle it.
>>> >>>>>
>>> >>>>>        Andy
>>> >>>>>
>>> >>>>>>
>>> >>>>>> On Tue, 12 Nov, 2019, 8:59 PM Dan Pritts, <[email protected]>
>>> wrote:
>>> >>>>>>
>>> >>>>>>> Look through the list archives for posts from Andy describing the
>>> >>>>>>> differences between tdb1 and tdb2. they have different
>>> >> optimizations; I
>>> >>>>>>> don't recall the differences.
>>> >>>>>>>
>>> >>>>>>> thanks
>>> >>>>>>> danno
>>> >>>>>>>
>>> >>>>>>> Dan Pritts
>>> >>>>>>> ICPSR Computing and Network Services
>>> >>>>>>>
>>> >>>>>>> On 12 Nov 2019, at 7:29, Amandeep Srivastava wrote:
>>> >>>>>>>
>>> >>>>>>>> Hi,
>>> >>>>>>>>
>>> >>>>>>>> I'm trying to create a TDB database from Wikidata's official RDF
>>> >> dump
>>> >>>>>>>> to
>>> >>>>>>>> read the data using Fuseki service. I need to make a few
>>> queries for
>>> >>>>>>>> my
>>> >>>>>>>> personal project, running which the online service times out.
>>> >>>>>>>>
>>> >>>>>>>> I have a 12 core machine with 36 GB memory.
>>> >>>>>>>>
>>> >>>>>>>> Can you please advise on the best way for creating the database?
>>> >> Since
>>> >>>>>>>> the
>>> >>>>>>>> dump is huge, I cannot try all the approaches. Besides, I'm not
>>> sure
>>> >>>>>>>> if the
>>> >>>>>>>> tdbloader function works in a similar way on data of different
>>> >> sizes.
>>> >>>>>>>>
>>> >>>>>>>> Questions:
>>> >>>>>>>>
>>> >>>>>>>> 1. Which one would be better to use - tdb.tdbloader2 (TDB1) or
>>> >>>>>>>> tdb2.tdbloader (TDB2) for creating the database and why? Any
>>> >> specific
>>> >>>>>>>> configurations that I should be aware of?
>>> >>>>>>>>
>>> >>>>>>>> 2. I'm running a job currently using tdb.tdbloader2 but it is
>>> using
>>> >>>>>>>> just a
>>> >>>>>>>> single core. Also, it's loading speed is decreasing slowly. It
>>> >> started
>>> >>>>>>>> at
>>> >>>>>>>> an avg of 120k tuples and is currently at 80k tuples. Can you
>>> advise
>>> >>>>>>>> how
>>> >>>>>>>> can I utilize all the cores of my machine and maintain the
>>> loading
>>> >>>>>>>> speed at
>>> >>>>>>>> the same time?
>>> >>>>>>>>
>>> >>>>>>>> Regards,
>>> >>>>>>>> Aman
>>> >>>>>>>
>>> >>>>>>
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>> >
>>>
>>

-- 
Regards,
Amandeep Srivastava
Final Year Bachelor of Technology,
Computer Science and Engineering Department,
Indian Institute of Technology (ISM), Dhanbad.

Re: TDB optimization query

Reply via email to