Thanks for the information on tdbloader. It appears that I will need to start over with a new database. I see that "JENA-648: Make TDB datasets harder to corrupt" is in Jira (https://issues.apache.org/jira/browse/JENA-648). It might be useful to note the issue with tdbloader there as well. Perhaps CTRL+C could be caught by tdbloader to set the database in a consistent state before exiting.
I tried to restore from a backup I made via tdbdump before the load operation. Unfortunately, I have run into some troubles with that as well. I filed "JENA-1000: tdbdump / tdbloader sequence corrupts rdf:type predicates" (https://issues.apache.org/jira/browse/JENA-1000) to document one of the issues with the restore. I am trying to load the entire PubChem RDF collection. I have a single node with 94 GB of memory, 24 CPUs, and a hybrid storage unit that caches to SSD in front of mechanical drives on a 10 GigE interconnect. Unfortunately, the load operation had slowed to 2,500 triples avg before even 10% of the files in the Compound sub-collection were loaded: 07:57:45 INFO loader :: Add: 169,000,000 triples (Batch: 1,567 / Avg: 2,567) 07:57:45 INFO loader :: Elapsed: 65,830.09 seconds [2015/07/24 07:57:45 EDT] 07:58:13 INFO loader :: Add: 169,050,000 triples (Batch: 1,758 / Avg: 2,566) Do you have any suggestions for tuning the load operation for faster performance? -----Original Message----- From: Andy Seaborne [mailto:[email protected]] Sent: Friday, July 24, 2015 1:44 PM To: [email protected] Subject: Re: Canceled tdbloader operation generates "WARN DatasetPrefixesTDB :: Mangled prefix map: graph name=" Hi Donald, The bulk loader tdbloader is not transactional and if aborted part way through, the database is suspect. You *may* find that deleting the prefix tables sorts things but but there is a good chance the triple indexes or node table is broken as well. TDB has two bulk loader - tdbloader2 can be faster for larger datasets. Whatever one you use, more RAM (not java heap) improves performance of loading. Both bulk loaders can only do anything specail if the database is initially empty. tdbloader2 simply refuses to load an existing database, tdbloader, You can give all the files to load in a single command to load multiple files into an empty database. I tried downloading PubChem but failed (the server didn't like the buylk loader I was using). Are you loading the whole thing? Which is about 1.6 billion triples? You will need a large RAM machine to use TDB. Having an SSD makes a big difference. Once you have a loaded the database, you can move it to another machine by simply copying the directory when no program is connected to the database. Andy On 24/07/15 15:40, Pellegrino, Donald (DA) wrote: > I attempted to load the NCBI PubChem RDF Compound data > (https://pubchem.ncbi.nlm.nih.gov/rdf/#_Toc421254632) into an Apache Jena TDB > database. Given 18 hours, the load of PubChem RDF Compound data was only > 12/109 .ttl.gz files (11%) complete. Therefore, I hit CTRL+C to cancel the > tdbloader operation and try other approaches. Unfortunately, now when I try > to run tdbloader I get "WARN DatasetPrefixesTDB :: Mangled prefix map: > graph name=" followed by a java.lang.NullPointerException. Partial tdbloader > error output is below. > > Please let me know if you have any suggestions for debugging this error. > > --- > > tdbloader --verbose --loc=/home/irkmoo/reactionsdb/ > pc_compound_type.ttl.gz Java maximum memory: 1029177344 > symbol:http://jena.hpl.hp.com/ARQ#constantBNodeLabels = true > symbol:http://jena.hpl.hp.com/ARQ#regexImpl = > symbol:http://jena.hpl.hp.com/ARQ#javaRegex > symbol:http://jena.hpl.hp.com/ARQ#stageGenerator = > com.hp.hpl.jena.tdb.solver.StageGeneratorDirectTDB@18078bef > symbol:http://jena.hpl.hp.com/ARQ#strictSPARQL = false > symbol:http://jena.hpl.hp.com/ARQ#enablePropertyFunctions = true > 10:26:22 INFO loader :: -- Start triples data phase > 10:26:22 INFO loader :: ** Load into triples table with > existing data > 10:26:22 INFO loader :: -- Start quads data phase > 10:26:22 INFO loader :: ** Load empty quads table > 10:26:22 INFO loader :: Load: pc_compound_type.ttl.gz -- > 2015/07/24 10:26:22 EDT > 10:26:22 WARN DatasetPrefixesTDB :: Mangled prefix map: graph name= > java.lang.NullPointerException > at > com.hp.hpl.jena.tdb.store.DatasetPrefixesTDB.readPrefixMap(DatasetPrefixesTDB.java:119) > at > com.hp.hpl.jena.sparql.graph.GraphPrefixesProjection.getNsPrefixMap(GraphPrefixesProjection.java:62) > at > com.hp.hpl.jena.tdb.store.DatasetPrefixesTDB.getPrefixMapping(DatasetPrefixesTDB.java:168) > at > com.hp.hpl.jena.tdb.store.DatasetPrefixesTDB.getPrefixMapping(DatasetPrefixesTDB.java:160) > at > com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader$DestinationDSG.prefix(BulkLoader.java:272) > at > org.apache.jena.riot.lang.LangTurtleBase.emitPrefix(LangTurtleBase.java:492) > at > org.apache.jena.riot.lang.LangTurtleBase.directivePrefix(LangTurtleBase.java:164) > at > org.apache.jena.riot.lang.LangTurtleBase.directive(LangTurtleBase.java:140) > at > org.apache.jena.riot.lang.LangTurtleBase.runParser(LangTurtleBase.java:79) > at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:42) > at > org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:182) > at org.apache.jena.riot.RDFDataMgr.process(RDFDataMgr.java:906) > at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:687) > at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:666) > at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:654) > at > com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:148) > at > com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:114) > at com.hp.hpl.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:261) > at com.hp.hpl.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:193) > at com.hp.hpl.jena.tdb.TDBLoader.load(TDBLoader.java:74) > at tdb.tdbloader.loadQuads(tdbloader.java:118) > at tdb.tdbloader.exec(tdbloader.java:86) > at arq.cmdline.CmdMain.mainMethod(CmdMain.java:102) > at arq.cmdline.CmdMain.mainRun(CmdMain.java:63) > at arq.cmdline.CmdMain.mainRun(CmdMain.java:50) > at tdb.tdbloader.main(tdbloader.java:44) > 10:26:22 WARN DatasetPrefixesTDB :: Mangled prefix map: graph name= > java.lang.NullPointerException > ... > >
