Great. Thanks, Andy! I moved from CSV to the text output because I was fighting a phantom newline that was messing up the downstream processing. The phantom won that round, but there’s more days ahead.
Regards, Tim On Mar 28, 2014, at 11:15 AM, Andy Seaborne <[email protected]> wrote: >> at >> com.hp.hpl.jena.sparql.resultset.TextOutput.format(TextOutput.java:65) >> at >> com.hp.hpl.jena.query.ResultSetFormatter.out(ResultSetFormatter.java:135) >> at >> com.hp.hpl.jena.sparql.util.QueryExecUtils.outputResultSet(QueryExecUtils.java:157) >> at >> com.hp.hpl.jena.sparql.util.QueryExecUtils.doSelectQuery(QueryExecUtils.java:199) >> at >> com.hp.hpl.jena.sparql.util.QueryExecUtils.executeQuery(QueryExecUtils.java:75) > > Looks like you are trying to output as formatted text. > > For text format aligns column widths so it needs to scan the entire result > set to find column widths, then go back and actually write stuff. > > It takes a copy of the whole results to do that. > > You can use a streaming format like JSON, TSV, CSV (the last two can be > thought of as unformatted text). > > Andy > > On 28/03/14 15:10, Timothy Lebo wrote: >> Thanks, David. >> >> Bumping it from 1 GB to 4 GB handled it to produce: >> >> 38 MB of gzipped dbpedia URLs, >> 8 MB of gzipped freebase URLs, and >> 7 MB of gzipped reference.data.gov.uk URLs. >> (the only three “big” domains) >> >> I’ll put the streaming question on hold until I run out of memory :-) >> >> Regards, >> Tim >> >> On Mar 28, 2014, at 10:44 AM, David Jordan <[email protected]> wrote: >> >>> The first question to answer is how much memory have you allocated in the >>> Java heap. You can control this. The default JVM heap size will very likely >>> be too small. >>> >>> -----Original Message----- >>> From: Timothy Lebo [mailto:[email protected]] >>> Sent: Friday, March 28, 2014 10:41 AM >>> To: [email protected] >>> Subject: OutOfMemoryError with tdbquery >>> >>> Jena, >>> >>> I have a TDB with 4.2 billion triples that I created with tdbloader. >>> It's taken from the 2012 Billion Triples Challenge. >>> I assert three triples for each URL they retrieved ("context"), e.g. for >>> the URL http://www.hyphen.info/rdf/30.xml: >>> >>> <http://www.hyphen.info/rdf/30.xml> >>> <http://purl.org/twc/vocab/between-the-edges/root> <http://www.hyphen.info> >>> . >>> <http://www.hyphen.info> <http://purl.org/twc/vocab/between-the-edges/pld> >>> <http://hyphen.info> . >>> <http://hyphen.info> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> >>> <http://purl.org/twc/vocab/between-the-edges/PayLevelDomain> . >>> >>> >>> When I submit the following query with tdbquery: >>> >>> select ?url where{?url <http://purl.org/twc/vocab/between-the-edges/root> >>> <http://dbpedia.org>.} >>> >>> The following Exception is thrown. >>> >>> I'm assuming that Jena is trying to build up all of the results before >>> reporting them. >>> Is there a way to just get "the stream" to avoid the memory issue? >>> >>> Thanks, >>> Tim Lebo >>> >>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space >>> at >>> com.hp.hpl.jena.tdb.base.record.RecordFactory.create(RecordFactory.java:87) >>> at >>> com.hp.hpl.jena.tdb.base.record.RecordFactory.buildFrom(RecordFactory.java:122) >>> at >>> com.hp.hpl.jena.tdb.base.buffer.RecordBuffer._get(RecordBuffer.java:107) >>> at >>> com.hp.hpl.jena.tdb.base.buffer.RecordBuffer.get(RecordBuffer.java:53) >>> at >>> com.hp.hpl.jena.tdb.base.recordbuffer.RecordRangeIterator.hasNext(RecordRangeIterator.java:130) >>> at org.openjena.atlas.iterator.Iter$4.hasNext(Iter.java:295) >>> at >>> com.hp.hpl.jena.tdb.sys.DatasetControlMRSW$IteratorCheckNotConcurrent.hasNext(DatasetControlMRSW.java:119) >>> at org.openjena.atlas.iterator.Iter$4.hasNext(Iter.java:295) >>> at org.openjena.atlas.iterator.Iter$3.hasNext(Iter.java:181) >>> at org.openjena.atlas.iterator.Iter.hasNext(Iter.java:825) >>> at >>> org.openjena.atlas.iterator.RepeatApplyIterator.hasNext(RepeatApplyIterator.java:58) >>> at org.openjena.atlas.iterator.Iter$4.hasNext(Iter.java:295) >>> at >>> com.hp.hpl.jena.sparql.engine.iterator.QueryIterPlainWrapper.hasNextBinding(QueryIterPlainWrapper.java:54) >>> at >>> com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108) >>> at >>> com.hp.hpl.jena.sparql.engine.iterator.QueryIterConvert.hasNextBinding(QueryIterConvert.java:59) >>> at >>> com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108) >>> at >>> com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:40) >>> at >>> com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108) >>> at >>> com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:40) >>> at >>> com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108) >>> at >>> com.hp.hpl.jena.sparql.engine.ResultSetStream.hasNext(ResultSetStream.java:72) >>> at >>> com.hp.hpl.jena.sparql.resultset.ResultSetMem.<init>(ResultSetMem.java:95) >>> at >>> com.hp.hpl.jena.sparql.resultset.TextOutput.write(TextOutput.java:147) >>> at >>> com.hp.hpl.jena.sparql.resultset.TextOutput.write(TextOutput.java:130) >>> at >>> com.hp.hpl.jena.sparql.resultset.TextOutput.write(TextOutput.java:118) >>> at >>> com.hp.hpl.jena.sparql.resultset.TextOutput.format(TextOutput.java:65) >>> at >>> com.hp.hpl.jena.query.ResultSetFormatter.out(ResultSetFormatter.java:135) >>> at >>> com.hp.hpl.jena.sparql.util.QueryExecUtils.outputResultSet(QueryExecUtils.java:157) >>> at >>> com.hp.hpl.jena.sparql.util.QueryExecUtils.doSelectQuery(QueryExecUtils.java:199) >>> at >>> com.hp.hpl.jena.sparql.util.QueryExecUtils.executeQuery(QueryExecUtils.java:75) >>> at arq.query.queryExec(query.java:186) >>> at arq.query.exec(query.java:145) >>> >>> >> > >
