I've put some debugging in so that the term being unpacked it printed out.

It looks like it is the timezone.

        Andy

On 24/07/12 12:13, Michael Brunnbauer wrote:

Hello Andy,

On Thu, Jun 14, 2012 at 01:12:25PM +0100, Andy Seaborne wrote:
I guess it would be a good idea to look at the end of the dump and check
the
corresponding named graph for bad datetimes ?

Yes - my best guess at the moment is that a dateTime can get in (they
are encoded into 56 bits, not recorded using the lexical form) but there
was a problem on the recreation of the lexical form.  Whether the
encoding or decoding is wrong, I can't tell.

I was not able to find the named graph causing the problem so I recreated the
TDB with tdbloader2 from apache-jena-2.7.2 and tried tdbdump from
apache-jena-2.7.2 immediately after that. The result is that I seem to run
into the same problem:

Exception in thread "main" org.openjena.atlas.AtlasException: formatInt: 
overflow
        at 
org.openjena.atlas.lib.NumberUtils.formatUnsignedInt(NumberUtils.java:115)
        at org.openjena.atlas.lib.NumberUtils.formatInt(NumberUtils.java:87)
        at org.openjena.atlas.lib.NumberUtils.formatInt(NumberUtils.java:60)
        at com.hp.hpl.jena.tdb.store.DateTimeNode.unpack(DateTimeNode.java:255)
        at 
com.hp.hpl.jena.tdb.store.DateTimeNode.unpackDateTime(DateTimeNode.java:180)
        at com.hp.hpl.jena.tdb.store.NodeId.extract(NodeId.java:313)
        at 
com.hp.hpl.jena.tdb.nodetable.NodeTableInline.getNodeForNodeId(NodeTableInline.java:64)
        at com.hp.hpl.jena.tdb.lib.TupleLib.quad(TupleLib.java:163)
        at com.hp.hpl.jena.tdb.lib.TupleLib.quad(TupleLib.java:155)
        at com.hp.hpl.jena.tdb.lib.TupleLib.access$100(TupleLib.java:45)
        at com.hp.hpl.jena.tdb.lib.TupleLib$4.convert(TupleLib.java:89)
        at com.hp.hpl.jena.tdb.lib.TupleLib$4.convert(TupleLib.java:85)
        at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
        at org.openjena.atlas.iterator.IteratorCons.next(IteratorCons.java:94)
        at org.openjena.atlas.iterator.Iter.sendToSink(Iter.java:560)
        at org.openjena.riot.out.NQuadsWriter.write(NQuadsWriter.java:45)
        at org.openjena.riot.out.NQuadsWriter.write(NQuadsWriter.java:37)
        at org.openjena.riot.RiotWriter.writeNQuads(RiotWriter.java:41)
        at tdb.tdbdump.exec(tdbdump.java:49)
        at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101)
        at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
        at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
        at tdb.tdbdump.main(tdbdump.java:31)

This seems to be a serious issue.

BTW: Here is some output from tdbloader2 for this TDB which shows that
the tdbloader2 data phase runtime gets quite non-linear for very big datasets.
I called tdbloader2 with JVM_ARGS="-Xmx32768M -server" and it did not seem to
run into memory problems.

  12:39:17 -- TDB Bulk Loader Start
  12:39:17 Data phase
...
INFO  Add: 100,000,000 Data (Batch: 68,027 / Avg: 57,649)
...
INFO  Add: 500,000,000 Data (Batch: 55,309 / Avg: 41,446)
...
INFO  Add: 1,000,000,000 Data (Batch: 27,901 / Avg: 24,119)
...
INFO  Add: 1,100,000,000 Data (Batch: 335 / Avg: 6,308)
...
INFO  Add: 1,138,800,000 Data (Batch: 256 / Avg: 5,038)
...
INFO  Total: 1,138,845,529 tuples : 227,654.44 seconds : 5,002.52 tuples/sec 
[2012/07/22 03:53:36 CEST]
...
  20:24:24 -- TDB Bulk Loader Finish
  20:24:24 -- 373477 seconds

Regards,

Michael Brunnbauer


Reply via email to