On 25/03/13 18:26, Martino Buffolino wrote:
Sorry I became a little sidetracked over the last week.
Ditto.
This occurs when
indexing a model of size 19169727.
What I *think* is happening is that the Lucene writer is growing too large.
I can sse, maybe, two workarounds:
1/ setAvoidDuplicates(false)
2/ Closed/reopened the index writer every so often to stop RAM-state
build up.
something like (untested):
StmtIterator sIter = model.listStatements()
for ( int i = 0 ; sIter.hasNext() ; i++ )
{
larqBuilder.indexStatement(sIter.next()) ;
if ( i%10000 == 9999 )
{
close
reopen
}
}
larqBuilder.close()
Sorry this isn't a definite solution - I don't see a way to get hold of
the Lucene writer to call .commit.
My preferred long term solution is to start again with a full integrated
approach to text indexing, based on datasets, assemblers, Lucene4 and
Solr4 (and then it all works with Fuseki). LARQ1 predates much of
datasets and Fuseki - in this case, evolution is hard, revolution is
easier.
It would seem to be a better approach to just be an index to map
"textquery -> uri" for a definable property. LARQ1 also stores terms
which makes for large index sizes.
Andy
*Stack Trace:*
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
at java.lang.StringBuilder.append(StringBuilder.java:119)
at org.apache.jena.larq.LARQ.hash(LARQ.java:266)
at org.apache.jena.larq.LARQ.unindex(LARQ.java:132)
at
org.apache.jena.larq.IndexBuilderNode.unindex(IndexBuilderNode.java:116)
at org.apache.jena.larq.IndexBuilderNode.index(IndexBuilderNode.java:83)
at
org.apache.jena.larq.IndexBuilderLiteral.indexStatement(IndexBuilderLiteral.java:88)
at
org.apache.jena.larq.IndexBuilderModel.indexStatements(IndexBuilderModel.java:84)
at RDFIndexer.main(RDFIndexer.java:53)
line 53: larqBuilder.indexStatements(model.listStatements());
*Running:*
Mac OS X 10.8.3
2.66 Intel Core i7 64bit
8 GB Ram
and setting VM arg -Xmx2048m
*Code:
*IndexBuilderString larqBuilder = new IndexBuilderString();
Dataset dataset = TDBFactory.createDataset(dir);
Model model = dataset.getDefaultModel();
larqBuilder.indexStatements(model.listStatements());
Please let me know if you need any other information. Thanks
On Mon, Mar 18, 2013 at 12:10 PM, Andy Seaborne <[email protected]> wrote:
On 18/03/13 01:50, Martino Buffolino wrote:
Thanks for the response Andy.
So I guess the overall picture would be that I have a TDB dataset stored
on
disk and I would like to query it using lucene text match like the
following:
PREFIX pf:
<http://jena.hpl.hp.com/ARQ/**property#<http://jena.hpl.hp.com/ARQ/property#>>SELECT
?doc{
?lit pf:textMatch '+text' .
?doc ?p ?lit}
If I index by partitions of the dataset, can I store that to disk so I
don't have to repeat the process again?
Yes - the lucene index should be on disk.
(/me still not clear where it runs out of memory - what's your system? 32
bit? What's the stacktrace?)
On Sun, Mar 17, 2013 at 8:57 AM, Andy Seaborne <[email protected]> wrote:
On 17/03/13 00:45, Martino Buffolino wrote:
Hi,
I built a large dataset using tdbloader and now I would like to query it
by
using a lucene index. I've tried to index by using
larqBuilder.indexStatements(****model.listStatements()); which led to
an
out of
memory exception.
Could you give some more details?
It might be it is using up RAM for something but it might also be because
the model has many large text literals which, combined with all the other
uses of heap, is causing the problem, rather than LARQ per se.
Is there another approach to do this?
If it's a large database, then doing it in sections is a possibility.
What might work (given I'm not sure where it is running out of memory) is
to:
Get an iterator e.g. model.listStatements() then index some selection of
it (e.g. 1000 items), then close and reopen the index, then index another
1,000 items from the iterator.
Andy
Thanks