Re: Canceled tdbloader operation generates "WARN DatasetPrefixesTDB :: Mangled prefix map: graph name="

Andy Seaborne Fri, 24 Jul 2015 14:02:42 -0700

On 24/07/15 20:50, Pellegrino, Donald (DA) wrote:

Thanks for the information on tdbloader. It appears that I will need to start over with a 
new database. I see that "JENA-648: Make TDB datasets harder to corrupt" is in 
Jira (https://issues.apache.org/jira/browse/JENA-648).  It might be useful to note the 
issue with tdbloader there as well. Perhaps CTRL+C could be caught by tdbloader to set 
the database in a consistent state before exiting.

JENA-648 is actually about two instances of TDB access the same files -which would result in chaos due to the fact that TDB, like everydatabase, caches a lot.

The only reset for tdbloader (and tdbloader2) is to delete the database.Both build each database index separately because doing one at a timeis faster than doing all at the same time (better cache and working seteffects).

I tried to restore from a backup I made via tdbdump before the load operation. 
Unfortunately, I have run into some troubles with that as well. I filed "JENA-1000: 
tdbdump / tdbloader sequence corrupts rdf:type predicates" 
(https://issues.apache.org/jira/browse/JENA-1000) to document one of the issues with the 
restore.


There is a question for you on that JIRA about that.

I am trying to load the entire PubChem RDF collection. I have a single node 
with 94 GB of memory, 24 CPUs, and a hybrid storage unit that caches to SSD in 
front of mechanical drives on a 10 GigE interconnect. Unfortunately, the load 
operation had slowed to 2,500 triples avg before even 10% of the files in the 
Compound sub-collection were loaded:


tdbloader? tdbloader2?

07:57:45 INFO  loader               :: Add: 169,000,000 triples (Batch: 1,567 / 
Avg: 2,567)
07:57:45 INFO  loader               ::   Elapsed: 65,830.09 seconds [2015/07/24 
07:57:45 EDT]
07:58:13 INFO  loader               :: Add: 169,050,000 triples (Batch: 1,758 / 
Avg: 2,566)


Very slow.

I regularly load that sort of size for simple testing on my 32G/quadcore desktop.


I thought I'd look at the data. I'm trying to download

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/compound/general/

but I'm only getting 10 mb/s and many FTP errors. As I'm on a consumergrade connection, that fact that the server is the bottleneck is, well,ironic.

Do you have any suggestions for tuning the load operation for faster 
performance?


this is centos?

I've not encountered such a sever drop off but others have reported it.The cause has never been clear but in this case:


1/ How much of RAM is actually being used?

TDB uses memory mapped files - apprently, some OSs limit their use(ulimit? OS configuration?) though we have never had hard evidence oneway of the other.

2/ How big is the SSD because if the working set is bigger than thecache, then it is not good. Whether the 10 GigE interconnect is afactor, I don't know. Which side of the connection is the SSD? Local ornetwork storage?

Released tdbloader(1|2) isn't a parallel loader so your cores are not anissue.

-----Original Message-----
From: Andy Seaborne [mailto:[email protected]]
Sent: Friday, July 24, 2015 1:44 PM
To: [email protected]
Subject: Re: Canceled tdbloader operation generates "WARN DatasetPrefixesTDB :: 
Mangled prefix map: graph name="

Hi Donald,

The bulk loader tdbloader is not transactional and if aborted part way through, 
the database is suspect.  You *may* find that deleting the prefix tables sorts 
things but but there is a good chance the triple indexes or node table is 
broken as well.

TDB has two bulk loader - tdbloader2 can be faster for larger datasets.
   Whatever one you use, more RAM (not java heap) improves performance of 
loading.

Both bulk loaders can only do anything specail if the database is initially 
empty.  tdbloader2 simply refuses to load an existing database, tdbloader,

You can give all the files to load in a single command to load multiple files 
into an empty database.

I tried downloading PubChem but failed (the server didn't like the buylk loader 
I was using).  Are you loading the whole thing? Which is about
1.6 billion triples?  You will need a large RAM machine to use TDB.
Having an SSD makes a big difference.

Once you have a loaded the database, you can move it to another machine by 
simply copying the directory when no program is connected to the database.

        Andy

On 24/07/15 15:40, Pellegrino, Donald (DA) wrote:

I attempted to load the NCBI PubChem RDF Compound data 
(https://pubchem.ncbi.nlm.nih.gov/rdf/#_Toc421254632) into an Apache Jena TDB database. 
Given 18 hours, the load of PubChem RDF Compound data was only 12/109 .ttl.gz files (11%) 
complete. Therefore, I hit CTRL+C to cancel the tdbloader operation and try other 
approaches. Unfortunately, now when I try to run tdbloader I get "WARN  
DatasetPrefixesTDB   :: Mangled prefix map: graph name=" followed by a 
java.lang.NullPointerException. Partial tdbloader error output is below.

Please let me know if you have any suggestions for debugging this error.

---

tdbloader --verbose --loc=/home/irkmoo/reactionsdb/
pc_compound_type.ttl.gz Java maximum memory: 1029177344
symbol:http://jena.hpl.hp.com/ARQ#constantBNodeLabels = true
symbol:http://jena.hpl.hp.com/ARQ#regexImpl =
symbol:http://jena.hpl.hp.com/ARQ#javaRegex
symbol:http://jena.hpl.hp.com/ARQ#stageGenerator =
com.hp.hpl.jena.tdb.solver.StageGeneratorDirectTDB@18078bef
symbol:http://jena.hpl.hp.com/ARQ#strictSPARQL = false
symbol:http://jena.hpl.hp.com/ARQ#enablePropertyFunctions = true
10:26:22 INFO  loader               :: -- Start triples data phase
10:26:22 INFO  loader               :: ** Load into triples table with existing 
data
10:26:22 INFO  loader               :: -- Start quads data phase
10:26:22 INFO  loader               :: ** Load empty quads table
10:26:22 INFO  loader               :: Load: pc_compound_type.ttl.gz -- 
2015/07/24 10:26:22 EDT
10:26:22 WARN  DatasetPrefixesTDB   :: Mangled prefix map: graph name=
java.lang.NullPointerException
          at 
com.hp.hpl.jena.tdb.store.DatasetPrefixesTDB.readPrefixMap(DatasetPrefixesTDB.java:119)
          at 
com.hp.hpl.jena.sparql.graph.GraphPrefixesProjection.getNsPrefixMap(GraphPrefixesProjection.java:62)
          at 
com.hp.hpl.jena.tdb.store.DatasetPrefixesTDB.getPrefixMapping(DatasetPrefixesTDB.java:168)
          at 
com.hp.hpl.jena.tdb.store.DatasetPrefixesTDB.getPrefixMapping(DatasetPrefixesTDB.java:160)
          at 
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader$DestinationDSG.prefix(BulkLoader.java:272)
          at 
org.apache.jena.riot.lang.LangTurtleBase.emitPrefix(LangTurtleBase.java:492)
          at 
org.apache.jena.riot.lang.LangTurtleBase.directivePrefix(LangTurtleBase.java:164)
          at 
org.apache.jena.riot.lang.LangTurtleBase.directive(LangTurtleBase.java:140)
          at 
org.apache.jena.riot.lang.LangTurtleBase.runParser(LangTurtleBase.java:79)
          at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:42)
          at 
org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:182)
          at org.apache.jena.riot.RDFDataMgr.process(RDFDataMgr.java:906)
          at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:687)
          at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:666)
          at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:654)
          at 
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:148)
          at 
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:114)
          at com.hp.hpl.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:261)
          at com.hp.hpl.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:193)
          at com.hp.hpl.jena.tdb.TDBLoader.load(TDBLoader.java:74)
          at tdb.tdbloader.loadQuads(tdbloader.java:118)
          at tdb.tdbloader.exec(tdbloader.java:86)
          at arq.cmdline.CmdMain.mainMethod(CmdMain.java:102)
          at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
          at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
          at tdb.tdbloader.main(tdbloader.java:44)
10:26:22 WARN  DatasetPrefixesTDB   :: Mangled prefix map: graph name=
java.lang.NullPointerException
...

Re: Canceled tdbloader operation generates "WARN DatasetPrefixesTDB :: Mangled prefix map: graph name="

Reply via email to