On 24/07/15 20:50, Pellegrino, Donald (DA) wrote:
Thanks for the information on tdbloader. It appears that I will need to start over with a
new database. I see that "JENA-648: Make TDB datasets harder to corrupt" is in
Jira (https://issues.apache.org/jira/browse/JENA-648). It might be useful to note the
issue with tdbloader there as well. Perhaps CTRL+C could be caught by tdbloader to set
the database in a consistent state before exiting.
JENA-648 is actually about two instances of TDB access the same files -
which would result in chaos due to the fact that TDB, like every
database, caches a lot.
The only reset for tdbloader (and tdbloader2) is to delete the database.
Both build each database index separately because doing one at a time
is faster than doing all at the same time (better cache and working set
effects).
I tried to restore from a backup I made via tdbdump before the load operation.
Unfortunately, I have run into some troubles with that as well. I filed "JENA-1000:
tdbdump / tdbloader sequence corrupts rdf:type predicates"
(https://issues.apache.org/jira/browse/JENA-1000) to document one of the issues with the
restore.
There is a question for you on that JIRA about that.
I am trying to load the entire PubChem RDF collection. I have a single node
with 94 GB of memory, 24 CPUs, and a hybrid storage unit that caches to SSD in
front of mechanical drives on a 10 GigE interconnect. Unfortunately, the load
operation had slowed to 2,500 triples avg before even 10% of the files in the
Compound sub-collection were loaded:
tdbloader? tdbloader2?
07:57:45 INFO loader :: Add: 169,000,000 triples (Batch: 1,567 /
Avg: 2,567)
07:57:45 INFO loader :: Elapsed: 65,830.09 seconds [2015/07/24
07:57:45 EDT]
07:58:13 INFO loader :: Add: 169,050,000 triples (Batch: 1,758 /
Avg: 2,566)
Very slow.
I regularly load that sort of size for simple testing on my 32G/quad
core desktop.
I thought I'd look at the data. I'm trying to download
ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/compound/general/
but I'm only getting 10 mb/s and many FTP errors. As I'm on a consumer
grade connection, that fact that the server is the bottleneck is, well,
ironic.
Do you have any suggestions for tuning the load operation for faster
performance?
this is centos?
I've not encountered such a sever drop off but others have reported it.
The cause has never been clear but in this case:
1/ How much of RAM is actually being used?
TDB uses memory mapped files - apprently, some OSs limit their use
(ulimit? OS configuration?) though we have never had hard evidence one
way of the other.
2/ How big is the SSD because if the working set is bigger than the
cache, then it is not good. Whether the 10 GigE interconnect is a
factor, I don't know. Which side of the connection is the SSD? Local or
network storage?
Released tdbloader(1|2) isn't a parallel loader so your cores are not an
issue.
-----Original Message-----
From: Andy Seaborne [mailto:[email protected]]
Sent: Friday, July 24, 2015 1:44 PM
To: [email protected]
Subject: Re: Canceled tdbloader operation generates "WARN DatasetPrefixesTDB ::
Mangled prefix map: graph name="
Hi Donald,
The bulk loader tdbloader is not transactional and if aborted part way through,
the database is suspect. You *may* find that deleting the prefix tables sorts
things but but there is a good chance the triple indexes or node table is
broken as well.
TDB has two bulk loader - tdbloader2 can be faster for larger datasets.
Whatever one you use, more RAM (not java heap) improves performance of
loading.
Both bulk loaders can only do anything specail if the database is initially
empty. tdbloader2 simply refuses to load an existing database, tdbloader,
You can give all the files to load in a single command to load multiple files
into an empty database.
I tried downloading PubChem but failed (the server didn't like the buylk loader
I was using). Are you loading the whole thing? Which is about
1.6 billion triples? You will need a large RAM machine to use TDB.
Having an SSD makes a big difference.
Once you have a loaded the database, you can move it to another machine by
simply copying the directory when no program is connected to the database.
Andy
On 24/07/15 15:40, Pellegrino, Donald (DA) wrote:
I attempted to load the NCBI PubChem RDF Compound data
(https://pubchem.ncbi.nlm.nih.gov/rdf/#_Toc421254632) into an Apache Jena TDB database.
Given 18 hours, the load of PubChem RDF Compound data was only 12/109 .ttl.gz files (11%)
complete. Therefore, I hit CTRL+C to cancel the tdbloader operation and try other
approaches. Unfortunately, now when I try to run tdbloader I get "WARN
DatasetPrefixesTDB :: Mangled prefix map: graph name=" followed by a
java.lang.NullPointerException. Partial tdbloader error output is below.
Please let me know if you have any suggestions for debugging this error.
---
tdbloader --verbose --loc=/home/irkmoo/reactionsdb/
pc_compound_type.ttl.gz Java maximum memory: 1029177344
symbol:http://jena.hpl.hp.com/ARQ#constantBNodeLabels = true
symbol:http://jena.hpl.hp.com/ARQ#regexImpl =
symbol:http://jena.hpl.hp.com/ARQ#javaRegex
symbol:http://jena.hpl.hp.com/ARQ#stageGenerator =
com.hp.hpl.jena.tdb.solver.StageGeneratorDirectTDB@18078bef
symbol:http://jena.hpl.hp.com/ARQ#strictSPARQL = false
symbol:http://jena.hpl.hp.com/ARQ#enablePropertyFunctions = true
10:26:22 INFO loader :: -- Start triples data phase
10:26:22 INFO loader :: ** Load into triples table with existing
data
10:26:22 INFO loader :: -- Start quads data phase
10:26:22 INFO loader :: ** Load empty quads table
10:26:22 INFO loader :: Load: pc_compound_type.ttl.gz --
2015/07/24 10:26:22 EDT
10:26:22 WARN DatasetPrefixesTDB :: Mangled prefix map: graph name=
java.lang.NullPointerException
at
com.hp.hpl.jena.tdb.store.DatasetPrefixesTDB.readPrefixMap(DatasetPrefixesTDB.java:119)
at
com.hp.hpl.jena.sparql.graph.GraphPrefixesProjection.getNsPrefixMap(GraphPrefixesProjection.java:62)
at
com.hp.hpl.jena.tdb.store.DatasetPrefixesTDB.getPrefixMapping(DatasetPrefixesTDB.java:168)
at
com.hp.hpl.jena.tdb.store.DatasetPrefixesTDB.getPrefixMapping(DatasetPrefixesTDB.java:160)
at
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader$DestinationDSG.prefix(BulkLoader.java:272)
at
org.apache.jena.riot.lang.LangTurtleBase.emitPrefix(LangTurtleBase.java:492)
at
org.apache.jena.riot.lang.LangTurtleBase.directivePrefix(LangTurtleBase.java:164)
at
org.apache.jena.riot.lang.LangTurtleBase.directive(LangTurtleBase.java:140)
at
org.apache.jena.riot.lang.LangTurtleBase.runParser(LangTurtleBase.java:79)
at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:42)
at
org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:182)
at org.apache.jena.riot.RDFDataMgr.process(RDFDataMgr.java:906)
at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:687)
at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:666)
at org.apache.jena.riot.RDFDataMgr.parse(RDFDataMgr.java:654)
at
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:148)
at
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:114)
at com.hp.hpl.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:261)
at com.hp.hpl.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:193)
at com.hp.hpl.jena.tdb.TDBLoader.load(TDBLoader.java:74)
at tdb.tdbloader.loadQuads(tdbloader.java:118)
at tdb.tdbloader.exec(tdbloader.java:86)
at arq.cmdline.CmdMain.mainMethod(CmdMain.java:102)
at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
at tdb.tdbloader.main(tdbloader.java:44)
10:26:22 WARN DatasetPrefixesTDB :: Mangled prefix map: graph name=
java.lang.NullPointerException
...