Re: Combining TDB + LARQ

Andy Seaborne Mon, 01 Apr 2013 09:53:38 -0700

On 25/03/13 18:26, Martino Buffolino wrote:

Sorry I became a little sidetracked over the last week.


Ditto.

This occurs when
indexing a model of size 19169727.


What I *think* is happening is that the Lucene writer is growing too large.

I can sse, maybe, two workarounds:

1/ setAvoidDuplicates(false)

2/ Closed/reopened the index writer every so often to stop RAM-statebuild up.


something like (untested):

StmtIterator sIter = model.listStatements()

for ( int i = 0 ; sIter.hasNext() ; i++ )
{
    larqBuilder.indexStatement(sIter.next()) ;
    if ( i%10000 == 9999 )
    {
        close
        reopen
    }
}

larqBuilder.close()

Sorry this isn't a definite solution - I don't see a way to get hold ofthe Lucene writer to call .commit.

My preferred long term solution is to start again with a full integratedapproach to text indexing, based on datasets, assemblers, Lucene4 andSolr4 (and then it all works with Fuseki). LARQ1 predates much ofdatasets and Fuseki - in this case, evolution is hard, revolution iseasier.

It would seem to be a better approach to just be an index to map"textquery -> uri" for a definable property. LARQ1 also stores termswhich makes for large index sizes.


        Andy


*Stack Trace:*

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
     at java.util.Arrays.copyOf(Arrays.java:2882)
     at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
     at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
     at java.lang.StringBuilder.append(StringBuilder.java:119)
     at org.apache.jena.larq.LARQ.hash(LARQ.java:266)
     at org.apache.jena.larq.LARQ.unindex(LARQ.java:132)
     at
org.apache.jena.larq.IndexBuilderNode.unindex(IndexBuilderNode.java:116)
     at org.apache.jena.larq.IndexBuilderNode.index(IndexBuilderNode.java:83)
     at
org.apache.jena.larq.IndexBuilderLiteral.indexStatement(IndexBuilderLiteral.java:88)
     at
org.apache.jena.larq.IndexBuilderModel.indexStatements(IndexBuilderModel.java:84)
     at RDFIndexer.main(RDFIndexer.java:53)

line 53: larqBuilder.indexStatements(model.listStatements());

*Running:*

Mac OS X 10.8.3
2.66 Intel Core i7 64bit
8 GB Ram

and setting VM arg -Xmx2048m

*Code:

*IndexBuilderString larqBuilder = new IndexBuilderString();
Dataset dataset = TDBFactory.createDataset(dir);
Model model = dataset.getDefaultModel();
larqBuilder.indexStatements(model.listStatements());


Please let me know if you need any other information. Thanks


On Mon, Mar 18, 2013 at 12:10 PM, Andy Seaborne <[email protected]> wrote:

On 18/03/13 01:50, Martino Buffolino wrote:

Thanks for the response Andy.

So I guess the overall picture would be that I have a TDB dataset stored
on
disk and I would like to query it using lucene text match like the
following:

PREFIX pf: 
<http://jena.hpl.hp.com/ARQ/**property#<http://jena.hpl.hp.com/ARQ/property#>>SELECT
?doc{
      ?lit pf:textMatch '+text' .
      ?doc ?p ?lit}

If I index by partitions of the dataset, can I store that to disk so I
don't have to repeat the process again?


Yes - the lucene index should be on disk.

(/me still not clear where it runs out of memory - what's your system? 32
bit? What's the stacktrace?)


On Sun, Mar 17, 2013 at 8:57 AM, Andy Seaborne <[email protected]> wrote:

  On 17/03/13 00:45, Martino Buffolino wrote:

Hi,


I built a large dataset using tdbloader and now I would like to query it
by
using a lucene index. I've tried to index by using
larqBuilder.indexStatements(****model.listStatements()); which led to
an
out of
memory exception.

Could you give some more details?

It might be it is using up RAM for something but it might also be because
the model has many large text literals which, combined with all the other
uses of heap, is causing the problem, rather than LARQ per se.


   Is there another approach to do this?

If it's a large database, then doing it in sections is a possibility.

What might work (given I'm not sure where it is running out of memory) is
to:

Get an iterator e.g. model.listStatements() then index some selection of
it (e.g. 1000 items), then close and reopen the index, then index another
1,000 items from the iterator.

          Andy


  Thanks

Re: Combining TDB + LARQ

Reply via email to