Great. Thanks, Andy!

I moved from CSV to the text output because I was fighting a phantom newline 
that was messing up the downstream processing.
The phantom won that round, but there’s more days ahead.

Regards,
Tim

On Mar 28, 2014, at 11:15 AM, Andy Seaborne <[email protected]> wrote:

>>      at 
>> com.hp.hpl.jena.sparql.resultset.TextOutput.format(TextOutput.java:65)
>>      at 
>> com.hp.hpl.jena.query.ResultSetFormatter.out(ResultSetFormatter.java:135)
>>      at 
>> com.hp.hpl.jena.sparql.util.QueryExecUtils.outputResultSet(QueryExecUtils.java:157)
>>      at 
>> com.hp.hpl.jena.sparql.util.QueryExecUtils.doSelectQuery(QueryExecUtils.java:199)
>>      at 
>> com.hp.hpl.jena.sparql.util.QueryExecUtils.executeQuery(QueryExecUtils.java:75)
> 
> Looks like you are trying to output as formatted text.
> 
> For text format aligns column widths so it needs to scan the entire result 
> set to find column widths, then go back and actually write stuff.
> 
> It takes a copy of the whole results to do that.
> 
> You can use a streaming format like JSON, TSV, CSV (the last two can be 
> thought of as unformatted text).
> 
>       Andy
> 
> On 28/03/14 15:10, Timothy Lebo wrote:
>> Thanks, David.
>> 
>> Bumping it from 1 GB to 4 GB handled it to produce:
>> 
>> 38 MB of gzipped dbpedia URLs,
>> 8 MB of gzipped freebase URLs, and
>> 7 MB of gzipped reference.data.gov.uk URLs.
>> (the only three “big” domains)
>> 
>> I’ll put the streaming question on hold until I run out of memory :-)
>> 
>> Regards,
>> Tim
>> 
>> On Mar 28, 2014, at 10:44 AM, David Jordan <[email protected]> wrote:
>> 
>>> The first question to answer is how much memory have you allocated in the 
>>> Java heap. You can control this. The default JVM heap size will very likely 
>>> be too small.
>>> 
>>> -----Original Message-----
>>> From: Timothy Lebo [mailto:[email protected]]
>>> Sent: Friday, March 28, 2014 10:41 AM
>>> To: [email protected]
>>> Subject: OutOfMemoryError with tdbquery
>>> 
>>> Jena,
>>> 
>>> I have a TDB with 4.2 billion triples that I created with tdbloader.
>>> It's taken from the 2012 Billion Triples Challenge.
>>> I assert three triples for each URL they retrieved ("context"), e.g. for 
>>> the URL http://www.hyphen.info/rdf/30.xml:
>>> 
>>> <http://www.hyphen.info/rdf/30.xml> 
>>> <http://purl.org/twc/vocab/between-the-edges/root> <http://www.hyphen.info> 
>>> .
>>> <http://www.hyphen.info> <http://purl.org/twc/vocab/between-the-edges/pld> 
>>> <http://hyphen.info> .
>>> <http://hyphen.info> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> 
>>> <http://purl.org/twc/vocab/between-the-edges/PayLevelDomain> .
>>> 
>>> 
>>> When I submit the following query with tdbquery:
>>> 
>>> select ?url where{?url <http://purl.org/twc/vocab/between-the-edges/root> 
>>> <http://dbpedia.org>.}
>>> 
>>> The following Exception is thrown.
>>> 
>>> I'm assuming that Jena is trying to build up all of the results before 
>>> reporting them.
>>> Is there a way to just get "the stream" to avoid the memory issue?
>>> 
>>> Thanks,
>>> Tim Lebo
>>> 
>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>>     at 
>>> com.hp.hpl.jena.tdb.base.record.RecordFactory.create(RecordFactory.java:87)
>>>     at 
>>> com.hp.hpl.jena.tdb.base.record.RecordFactory.buildFrom(RecordFactory.java:122)
>>>     at 
>>> com.hp.hpl.jena.tdb.base.buffer.RecordBuffer._get(RecordBuffer.java:107)
>>>     at 
>>> com.hp.hpl.jena.tdb.base.buffer.RecordBuffer.get(RecordBuffer.java:53)
>>>     at 
>>> com.hp.hpl.jena.tdb.base.recordbuffer.RecordRangeIterator.hasNext(RecordRangeIterator.java:130)
>>>     at org.openjena.atlas.iterator.Iter$4.hasNext(Iter.java:295)
>>>     at 
>>> com.hp.hpl.jena.tdb.sys.DatasetControlMRSW$IteratorCheckNotConcurrent.hasNext(DatasetControlMRSW.java:119)
>>>     at org.openjena.atlas.iterator.Iter$4.hasNext(Iter.java:295)
>>>     at org.openjena.atlas.iterator.Iter$3.hasNext(Iter.java:181)
>>>     at org.openjena.atlas.iterator.Iter.hasNext(Iter.java:825)
>>>     at 
>>> org.openjena.atlas.iterator.RepeatApplyIterator.hasNext(RepeatApplyIterator.java:58)
>>>     at org.openjena.atlas.iterator.Iter$4.hasNext(Iter.java:295)
>>>     at 
>>> com.hp.hpl.jena.sparql.engine.iterator.QueryIterPlainWrapper.hasNextBinding(QueryIterPlainWrapper.java:54)
>>>     at 
>>> com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
>>>     at 
>>> com.hp.hpl.jena.sparql.engine.iterator.QueryIterConvert.hasNextBinding(QueryIterConvert.java:59)
>>>     at 
>>> com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
>>>     at 
>>> com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:40)
>>>     at 
>>> com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
>>>     at 
>>> com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:40)
>>>     at 
>>> com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
>>>     at 
>>> com.hp.hpl.jena.sparql.engine.ResultSetStream.hasNext(ResultSetStream.java:72)
>>>     at 
>>> com.hp.hpl.jena.sparql.resultset.ResultSetMem.<init>(ResultSetMem.java:95)
>>>     at 
>>> com.hp.hpl.jena.sparql.resultset.TextOutput.write(TextOutput.java:147)
>>>     at 
>>> com.hp.hpl.jena.sparql.resultset.TextOutput.write(TextOutput.java:130)
>>>     at 
>>> com.hp.hpl.jena.sparql.resultset.TextOutput.write(TextOutput.java:118)
>>>     at 
>>> com.hp.hpl.jena.sparql.resultset.TextOutput.format(TextOutput.java:65)
>>>     at 
>>> com.hp.hpl.jena.query.ResultSetFormatter.out(ResultSetFormatter.java:135)
>>>     at 
>>> com.hp.hpl.jena.sparql.util.QueryExecUtils.outputResultSet(QueryExecUtils.java:157)
>>>     at 
>>> com.hp.hpl.jena.sparql.util.QueryExecUtils.doSelectQuery(QueryExecUtils.java:199)
>>>     at 
>>> com.hp.hpl.jena.sparql.util.QueryExecUtils.executeQuery(QueryExecUtils.java:75)
>>>     at arq.query.queryExec(query.java:186)
>>>     at arq.query.exec(query.java:145)
>>> 
>>> 
>> 
> 
> 

Reply via email to