Re: Combining TDB + LARQ

Martino Buffolino Tue, 02 Apr 2013 08:23:03 -0700

Hey Andy,

I made a dumb mistake, but I will share.


I was using the default constructor ( IndexBuilderString larqBuilder = new
IndexBuilderString(); ) which keep the indexes in main memory and since I
was indexing a large portion of dbpedia triples I would run out of memory.

To fix, I switched to IndexBuilderString larqBuilder = new
IndexBuilderString("some/file/path"); which writes the indexes to that
folder and it works like a charm.

Then to query the larq indexes, I added:

// LOAD LARQ Index
IndexWriter indexWriter = IndexWriterFactory.create(FSDirectory.open(new
File(larqIndex)));
IndexBuilderString larqBuilder = new IndexBuilderString(indexWriter);
IndexLARQ index = larqBuilder.getIndex();

//LOAD TDB Dataset (actual n-triple store)
Dataset dataset = TDBFactory.createDataset(dir);
Model model = dataset.getDefaultModel();

larqBuilder.closeWriter();
model.unregister(larqBuilder);
LARQ.setDefaultIndex(index);

// query goes here

and it works beautifully.

Thanks


On Mon, Apr 1, 2013 at 12:53 PM, Andy Seaborne <[email protected]> wrote:

> On 25/03/13 18:26, Martino Buffolino wrote:
>
>> Sorry I became a little sidetracked over the last week.
>>
>
> Ditto.
>
>
>  This occurs when
>> indexing a model of size 19169727.
>>
>
> What I *think* is happening is that the Lucene writer is growing too large.
>
> I can sse, maybe, two workarounds:
>
> 1/ setAvoidDuplicates(false)
>
> 2/ Closed/reopened the index writer every so often to stop RAM-state build
> up.
>
> something like (untested):
>
> StmtIterator sIter = model.listStatements()
>
> for ( int i = 0 ; sIter.hasNext() ; i++ )
> {
>     larqBuilder.indexStatement(**sIter.next()) ;
>     if ( i%10000 == 9999 )
>     {
>         close
>         reopen
>     }
> }
>
> larqBuilder.close()
>
> Sorry this isn't a definite solution - I don't see a way to get hold of
> the Lucene writer to call .commit.
>
>
> My preferred long term solution is to start again with a full integrated
> approach to text indexing, based on datasets, assemblers, Lucene4 and Solr4
> (and then it all works with Fuseki).  LARQ1 predates much of datasets and
> Fuseki - in this case, evolution is hard, revolution is easier.
>
> It would seem to be a better approach to just be an index to map
> "textquery -> uri" for a definable property. LARQ1 also stores terms which
> makes for large index sizes.
>
>         Andy
>
>
>> *Stack Trace:*
>>
>>
>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>      at java.util.Arrays.copyOf(**Arrays.java:2882)
>>      at
>> java.lang.**AbstractStringBuilder.**expandCapacity(**
>> AbstractStringBuilder.java:**100)
>>      at
>> java.lang.**AbstractStringBuilder.append(**AbstractStringBuilder.java:**
>> 390)
>>      at java.lang.StringBuilder.**append(StringBuilder.java:119)
>>      at org.apache.jena.larq.LARQ.**hash(LARQ.java:266)
>>      at org.apache.jena.larq.LARQ.**unindex(LARQ.java:132)
>>      at
>> org.apache.jena.larq.**IndexBuilderNode.unindex(**
>> IndexBuilderNode.java:116)
>>      at org.apache.jena.larq.**IndexBuilderNode.index(**
>> IndexBuilderNode.java:83)
>>      at
>> org.apache.jena.larq.**IndexBuilderLiteral.**indexStatement(**
>> IndexBuilderLiteral.java:88)
>>      at
>> org.apache.jena.larq.**IndexBuilderModel.**indexStatements(**
>> IndexBuilderModel.java:84)
>>      at RDFIndexer.main(RDFIndexer.**java:53)
>>
>> line 53: larqBuilder.indexStatements(**model.listStatements());
>>
>> *Running:*
>>
>>
>> Mac OS X 10.8.3
>> 2.66 Intel Core i7 64bit
>> 8 GB Ram
>>
>> and setting VM arg -Xmx2048m
>>
>>
>
>  *Code:
>>
>> *IndexBuilderString larqBuilder = new IndexBuilderString();
>>
>> Dataset dataset = TDBFactory.createDataset(dir);
>> Model model = dataset.getDefaultModel();
>> larqBuilder.indexStatements(**model.listStatements());
>>
>>
>> Please let me know if you need any other information. Thanks
>>
>>
>> On Mon, Mar 18, 2013 at 12:10 PM, Andy Seaborne <[email protected]> wrote:
>>
>>  On 18/03/13 01:50, Martino Buffolino wrote:
>>>
>>>  Thanks for the response Andy.
>>>>
>>>> So I guess the overall picture would be that I have a TDB dataset stored
>>>> on
>>>> disk and I would like to query it using lucene text match like the
>>>> following:
>>>>
>>>> PREFIX pf: 
>>>> <http://jena.hpl.hp.com/ARQ/****property#<http://jena.hpl.hp.com/ARQ/**property#>
>>>> <http://jena.hpl.hp.**com/ARQ/property#<http://jena.hpl.hp.com/ARQ/property#>
>>>> >>SELECT
>>>>
>>>> ?doc{
>>>>       ?lit pf:textMatch '+text' .
>>>>       ?doc ?p ?lit}
>>>>
>>>> If I index by partitions of the dataset, can I store that to disk so I
>>>> don't have to repeat the process again?
>>>>
>>>>
>>> Yes - the lucene index should be on disk.
>>>
>>> (/me still not clear where it runs out of memory - what's your system? 32
>>> bit? What's the stacktrace?)
>>>
>>>
>>>
>>>> On Sun, Mar 17, 2013 at 8:57 AM, Andy Seaborne <[email protected]> wrote:
>>>>
>>>>   On 17/03/13 00:45, Martino Buffolino wrote:
>>>>
>>>>>
>>>>>   Hi,
>>>>>
>>>>>>
>>>>>> I built a large dataset using tdbloader and now I would like to query
>>>>>> it
>>>>>> by
>>>>>> using a lucene index. I've tried to index by using
>>>>>> larqBuilder.indexStatements(******model.listStatements()); which led
>>>>>> to
>>>>>>
>>>>>> an
>>>>>> out of
>>>>>> memory exception.
>>>>>>
>>>>>>
>>>>>>  Could you give some more details?
>>>>>
>>>>> It might be it is using up RAM for something but it might also be
>>>>> because
>>>>> the model has many large text literals which, combined with all the
>>>>> other
>>>>> uses of heap, is causing the problem, rather than LARQ per se.
>>>>>
>>>>>
>>>>>    Is there another approach to do this?
>>>>>
>>>>>
>>>>>>
>>>>>>  If it's a large database, then doing it in sections is a possibility.
>>>>>
>>>>> What might work (given I'm not sure where it is running out of memory)
>>>>> is
>>>>> to:
>>>>>
>>>>> Get an iterator e.g. model.listStatements() then index some selection
>>>>> of
>>>>> it (e.g. 1000 items), then close and reopen the index, then index
>>>>> another
>>>>> 1,000 items from the iterator.
>>>>>
>>>>>           Andy
>>>>>
>>>>>
>>>>>   Thanks
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Combining TDB + LARQ

Reply via email to