Thanks for the information on tdbloader. It appears that I will need to start 
over with a new database. I see that "JENA-648: Make TDB datasets harder to 
corrupt" is in Jira (https://issues.apache.org/jira/browse/JENA-648).  It might 
be useful to note the issue with tdbloader there as well. Perhaps CTRL+C could 
be caught by tdbloader to set the database in a consistent state before exiting.

I tried to restore from a backup I made via tdbdump before the load operation. 
Unfortunately, I have run into some troubles with that as well. I filed 
"JENA-1000: tdbdump / tdbloader sequence corrupts rdf:type predicates" 
(https://issues.apache.org/jira/browse/JENA-1000) to document one of the issues 
with the restore.

I am trying to load the entire PubChem RDF collection. I have a single node 
with 94 GB of memory, 24 CPUs, and a hybrid storage unit that caches to SSD in 
front of mechanical drives on a 10 GigE interconnect. Unfortunately, the load 
operation had slowed to 2,500 triples avg before even 10% of the files in the 
Compound sub-collection were loaded:

07:57:45 INFO  loader               :: Add: 169,000,000 triples (Batch: 1,567 / 
Avg: 2,567)
07:57:45 INFO  loader               ::   Elapsed: 65,830.09 seconds [2015/07/24 
07:57:45 EDT]
07:58:13 INFO  loader               :: Add: 169,050,000 triples (Batch: 1,758 / 
Avg: 2,566)

Do you have any suggestions for tuning the load operation for faster 
performance?

-----Original Message-----
From: Andy Seaborne [mailto:[email protected]] 
Sent: Friday, July 24, 2015 1:44 PM
To: [email protected]
Subject: Re: Canceled tdbloader operation generates "WARN DatasetPrefixesTDB :: 
Mangled prefix map: graph name="

Hi Donald,

The bulk loader tdbloader is not transactional and if aborted part way through, 
the database is suspect.  You *may* find that deleting the prefix tables sorts 
things but but there is a good chance the triple indexes or node table is 
broken as well.

TDB has two bulk loader - tdbloader2 can be faster for larger datasets. 
  Whatever one you use, more RAM (not java heap) improves performance of 
loading.

Both bulk loaders can only do anything specail if the database is initially 
empty.  tdbloader2 simply refuses to load an existing database, tdbloader,

You can give all the files to load in a single command to load multiple files 
into an empty database.

I tried downloading PubChem but failed (the server didn't like the buylk loader 
I was using).  Are you loading the whole thing? Which is about
1.6 billion triples?  You will need a large RAM machine to use TDB. 
Having an SSD makes a big difference.

Once you have a loaded the database, you can move it to another machine by 
simply copying the directory when no program is connected to the database.

        Andy

On 24/07/15 15:40, Pellegrino, Donald (DA) wrote:
> I attempted to load the NCBI PubChem RDF Compound data 
> (https://pubchem.ncbi.nlm.nih.gov/rdf/#_Toc421254632) into an Apache Jena TDB 
> database. Given 18 hours, the load of PubChem RDF Compound data was only 
> 12/109 .ttl.gz files (11%) complete. Therefore, I hit CTRL+C to cancel the 
> tdbloader operation and try other approaches. Unfortunately, now when I try 
> to run tdbloader I get "WARN  DatasetPrefixesTDB   :: Mangled prefix map: 
> graph name=" followed by a java.lang.NullPointerException. Partial tdbloader 
> error output is below.
>
> Please let me know if you have any suggestions for debugging this error.
>
> ---
>
> tdbloader --verbose --loc=/home/irkmoo/reactionsdb/ 
> pc_compound_type.ttl.gz Java maximum memory: 1029177344 
> symbol:http://jena.hpl.hp.com/ARQ#constantBNodeLabels = true 
> symbol:http://jena.hpl.hp.com/ARQ#regexImpl = 
> symbol:http://jena.hpl.hp.com/ARQ#javaRegex
> symbol:http://jena.hpl.hp.com/ARQ#stageGenerator = 
> com.hp.hpl.jena.tdb.solver.StageGeneratorDirectTDB@18078bef
> symbol:http://jena.hpl.hp.com/ARQ#strictSPARQL = false 
> symbol:http://jena.hpl.hp.com/ARQ#enablePropertyFunctions = true
> 10:26:22 INFO  loader               :: -- Start triples data phase
> 10:26:22 INFO  loader               :: ** Load into triples table with 
> existing data
> 10:26:22 INFO  loader               :: -- Start quads data phase
> 10:26:22 INFO  loader               :: ** Load empty quads table
> 10:26:22 INFO  loader               :: Load: pc_compound_type.ttl.gz -- 
> 2015/07/24 10:26:22 EDT
> 10:26:22 WARN  DatasetPrefixesTDB   :: Mangled prefix map: graph name=
> java.lang.NullPointerException
>          at 
> com.hp.hpl.jena.tdb.store.DatasetPrefixesTDB.readPrefixMap(DatasetPrefixesTDB.java:119)
>          at 
> com.hp.hpl.jena.sparql.graph.GraphPrefixesProjection.getNsPrefixMap(GraphPrefixesProjection.java:62)
>          at 
> com.hp.hpl.jena.tdb.store.DatasetPrefixesTDB.getPrefixMapping(DatasetPrefixesTDB.java:168)
>          at 
> com.hp.hpl.jena.tdb.store.DatasetPrefixesTDB.getPrefixMapping(DatasetPrefixesTDB.java:160)
>          at 
> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader$DestinationDSG.prefix(BulkLoader.java:272)
>          at 
> org.apache.jena.riot.lang.LangTurtleBase.emitPrefix(LangTurtleBase.java:492)
>          at 
> org.apache.jena.riot.lang.LangTurtleBase.directivePrefix(LangTurtleBase.java:164)
>          at 
> org.apache.jena.riot.lang.LangTurtleBase.directive(LangTurtleBase.java:140)
>          at 
> org.apache.jena.riot.lang.LangTurtleBase.runParser(LangTurtleBase.java:79)
>          at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:42)
>          at 
> org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:182)
>          at org.apache.jena.riot.RDFDataMgr.process(RDFDataMgr.java:906)
>          at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:687)
>          at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:666)
>          at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:654)
>          at 
> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:148)
>          at 
> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:114)
>          at com.hp.hpl.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:261)
>          at com.hp.hpl.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:193)
>          at com.hp.hpl.jena.tdb.TDBLoader.load(TDBLoader.java:74)
>          at tdb.tdbloader.loadQuads(tdbloader.java:118)
>          at tdb.tdbloader.exec(tdbloader.java:86)
>          at arq.cmdline.CmdMain.mainMethod(CmdMain.java:102)
>          at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
>          at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
>          at tdb.tdbloader.main(tdbloader.java:44)
> 10:26:22 WARN  DatasetPrefixesTDB   :: Mangled prefix map: graph name=
> java.lang.NullPointerException
> ...
>
>

Reply via email to