Attaching cmd line details for reference. Also, after creating the database, it isn't removing the tdb lock which hinders fuseki server from reading from the database
aman@DESKTOP-ML2LO1I:~$ cd apache-jena-3.13.1/ aman@DESKTOP-ML2LO1I:~/apache-jena-3.13.1$ ls LICENSE NOTICE README bat bin *jena-log4j.properties* lib lib-src src-examples test aman@DESKTOP-ML2LO1I:~/apache-jena-3.13.1$ *./bin/tdb2.tdbloader --loader=parallel --loc=../test ../bsbm-generated-dataset.nt* aman@DESKTOP-ML2LO1I:~/apache-jena-3.13.1$ ls ../test/ Data-0001 *tdb.lock * aman@DESKTOP-ML2LO1I:~/apache-jena-3.13.1$ ls ../test/Data-0001/ GOSP.bpt GPOS.idn GSPO.dat OSPG.bpt POS.idn SPO.dat journal.jrnl nodes.idn prefixes.idn GOSP.dat GPU.bpt GSPO.idn OSPG.dat POSG.bpt SPO.idn nodes-data.bdf prefixes-data.bdf *tdb.lock* GOSP.idn GPU.dat OSP.bpt OSPG.idn POSG.dat SPOG.bpt nodes-data.obj prefixes-data.obj GPOS.bpt GPU.idn OSP.dat POS.bpt POSG.idn SPOG.dat nodes.bpt prefixes.bpt GPOS.dat GSPO.bpt OSP.idn POS.dat SPO.bpt SPOG.idn nodes.dat prefixes.dat Thanks, Aman On Thu, Nov 28, 2019 at 3:59 PM Amandeep Srivastava < [email protected]> wrote: > Correction, using > > tdb2.tdbloader - -loader=parallel - - loc=../../db .. /.. /file.nt > > On Thu, 28 Nov, 2019, 3:56 PM Amandeep Srivastava, < > [email protected]> wrote: > >> Yes, I have the jena-log4j.properties file within the jena repo and the >> TDB2.loader file under bin in same repo. >> >> For me, when I run tdb2.loader - - loc=.. /.. /db .. /../file.nt, I see >> no logs. The process starts consuming cores and ram but there's nothing on >> the console. When the loading is finished, cursor moves on to the next >> line. >> >> On Thu, 28 Nov, 2019, 3:48 PM Andy Seaborne, <[email protected]> wrote: >> >>> >>> >>> On 28/11/2019 05:44, Amandeep Srivastava wrote: >>> > Thanks Andy, setting it that way worked. >>> > >>> > Also, can we turn on the verbose logging in TDB2.loader like we have in >>> > tdbloader2? >>> > >>> > Basically, giving an output of how many triples it's loading and how >>> much >>> > time has elapsed so far. >>> >>> It does that by default for the data phase. The report step size is >>> longer (500k) than TDB1 >>> >>> The index phase is more parallel and not all mods report progress. >>> >>> What are you seeing? >>> (Do you have a log4j.properties in the current directory?) >>> >>> Andy >>> >>> >>> >>> >>> tdb2.tdbloader --loader=parallel --loc DB2 ~/Datasets/BSBM/bsbm-5m.nt.gz >>> >>> INFO Loader = LoaderParallel >>> INFO Start: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz >>> INFO Add: 500,000 bsbm-5m.nt.gz (Batch: 124,875 / Avg: 124,875) >>> INFO Add: 1,000,000 bsbm-5m.nt.gz (Batch: 171,174 / Avg: 144,404) >>> INFO Add: 1,500,000 bsbm-5m.nt.gz (Batch: 190,403 / Avg: 157,051) >>> INFO Add: 2,000,000 bsbm-5m.nt.gz (Batch: 200,883 / Avg: 166,112) >>> INFO Add: 2,500,000 bsbm-5m.nt.gz (Batch: 206,953 / Avg: 172,938) >>> INFO Add: 3,000,000 bsbm-5m.nt.gz (Batch: 205,170 / Avg: 177,588) >>> INFO Add: 3,500,000 bsbm-5m.nt.gz (Batch: 198,255 / Avg: 180,272) >>> INFO Add: 4,000,000 bsbm-5m.nt.gz (Batch: 147,449 / Avg: 175,392) >>> INFO Add: 4,500,000 bsbm-5m.nt.gz (Batch: 159,642 / Avg: 173,490) >>> INFO Add: 5,000,000 bsbm-5m.nt.gz (Batch: 166,777 / Avg: 172,795) >>> INFO Elapsed: 28.94 seconds [2019/11/28 10:17:55 GMT] >>> INFO Finished: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz: 5,000,599 tuples >>> in 28.96s (Avg: 172,690) >>> INFO Finish - index POS >>> INFO Finish - index SPO >>> INFO Finish - index OSP >>> INFO Time = 39.180 seconds : Triples = 5,000,599 : Rate = 127,631 /s >>> >>> though the default may be faster on this small datset >>> >>> > >>> > On Thu, 14 Nov, 2019, 2:20 PM Andy Seaborne, <[email protected]> wrote: >>> > >>> >> Firstly - just to be clear - tdb.tdbloader2 is (confusingly) for TDB1. >>> >> Old name, before TDB2 came along so we're a bit stuck with it. >>> >> >>> >> tdbloader2 respects the $TMPDIR environment variable. >>> >> >>> >> Or set the SORT_ARGS environment variable with --temporary-directory= >>> >> (or -T). See tdbloader2 --help >>> >> >>> >> Andy >>> >> >>> >> On 14/11/2019 02:54, Amandeep Srivastava wrote: >>> >>> I was trying to test the performance of tdb.tdbloader2 by creating a >>> TDB >>> >>> database. The loader failed at sort SPO step. The failure seems to >>> occur >>> >>> because of insufficient storage in the /tmp folder. Can we point tdb >>> to >>> >> use >>> >>> another folder as /tmp? >>> >>> >>> >>> Error log: >>> >>> sort: write failed: /tmp/sortxRql3B: No space left on device >>> >>> >>> >>> On Wed, 13 Nov, 2019, 5:37 PM Amandeep Srivastava, < >>> >>> [email protected]> wrote: >>> >>> >>> >>>> Thanks, Andy, for the detailed explanation :) >>> >>>> >>> >>>> On Wed, 13 Nov, 2019, 4:52 PM Andy Seaborne, <[email protected]> >>> wrote: >>> >>>> >>> >>>>> >>> >>>>> >>> >>>>> On 12/11/2019 15:53, Amandeep Srivastava wrote: >>> >>>>>> Thanks for the heads up, Dan. Will go and check the archives. >>> >>>>>> >>> >>>>>> I think I should get how to decide between tdb and TDB2 in the >>> >> archives >>> >>>>>> itself. >>> >>>>> >>> >>>>> For large bulk loaders, the TDB2 loader is faster, if you use >>> >>>>> --loader-parallel (NB it can take over your machine's I/O!) >>> >>>>> >>> >>>>> See tdb2.tdbloader --help for names of plans that are built-in. >>> >>>>> >>> >>>>> The only way to know which is best is to try but >>> >>>>> >>> >>>>> >>> >>>>> The order threading used is: >>> >>>>> >>> >>>>> sequential < light < phased < parallel >>> >>>>> >>> >>>>> (it does not always mean more threads is faster). >>> >>>>> >>> >>>>> sequential is roughly the same as the TDB1 bulk loader. >>> >>>>> >>> >>>>> parallel usualy wins as data gets larger (several 100m) if the >>> machine >>> >>>>> has the I/O to handle it. >>> >>>>> >>> >>>>> Andy >>> >>>>> >>> >>>>>> >>> >>>>>> On Tue, 12 Nov, 2019, 8:59 PM Dan Pritts, <[email protected]> >>> wrote: >>> >>>>>> >>> >>>>>>> Look through the list archives for posts from Andy describing the >>> >>>>>>> differences between tdb1 and tdb2. they have different >>> >> optimizations; I >>> >>>>>>> don't recall the differences. >>> >>>>>>> >>> >>>>>>> thanks >>> >>>>>>> danno >>> >>>>>>> >>> >>>>>>> Dan Pritts >>> >>>>>>> ICPSR Computing and Network Services >>> >>>>>>> >>> >>>>>>> On 12 Nov 2019, at 7:29, Amandeep Srivastava wrote: >>> >>>>>>> >>> >>>>>>>> Hi, >>> >>>>>>>> >>> >>>>>>>> I'm trying to create a TDB database from Wikidata's official RDF >>> >> dump >>> >>>>>>>> to >>> >>>>>>>> read the data using Fuseki service. I need to make a few >>> queries for >>> >>>>>>>> my >>> >>>>>>>> personal project, running which the online service times out. >>> >>>>>>>> >>> >>>>>>>> I have a 12 core machine with 36 GB memory. >>> >>>>>>>> >>> >>>>>>>> Can you please advise on the best way for creating the database? >>> >> Since >>> >>>>>>>> the >>> >>>>>>>> dump is huge, I cannot try all the approaches. Besides, I'm not >>> sure >>> >>>>>>>> if the >>> >>>>>>>> tdbloader function works in a similar way on data of different >>> >> sizes. >>> >>>>>>>> >>> >>>>>>>> Questions: >>> >>>>>>>> >>> >>>>>>>> 1. Which one would be better to use - tdb.tdbloader2 (TDB1) or >>> >>>>>>>> tdb2.tdbloader (TDB2) for creating the database and why? Any >>> >> specific >>> >>>>>>>> configurations that I should be aware of? >>> >>>>>>>> >>> >>>>>>>> 2. I'm running a job currently using tdb.tdbloader2 but it is >>> using >>> >>>>>>>> just a >>> >>>>>>>> single core. Also, it's loading speed is decreasing slowly. It >>> >> started >>> >>>>>>>> at >>> >>>>>>>> an avg of 120k tuples and is currently at 80k tuples. Can you >>> advise >>> >>>>>>>> how >>> >>>>>>>> can I utilize all the cores of my machine and maintain the >>> loading >>> >>>>>>>> speed at >>> >>>>>>>> the same time? >>> >>>>>>>> >>> >>>>>>>> Regards, >>> >>>>>>>> Aman >>> >>>>>>> >>> >>>>>> >>> >>>>> >>> >>>> >>> >>> >>> >> >>> > >>> >> -- Regards, Amandeep Srivastava Final Year Bachelor of Technology, Computer Science and Engineering Department, Indian Institute of Technology (ISM), Dhanbad.
