Hey Andy,
I made a dumb mistake, but I will share.
I was using the default constructor ( IndexBuilderString larqBuilder = new
IndexBuilderString(); ) which keep the indexes in main memory and since I
was indexing a large portion of dbpedia triples I would run out of memory.
To fix, I switched to IndexBuilderString larqBuilder = new
IndexBuilderString("some/file/path"); which writes the indexes to that
folder and it works like a charm.
Then to query the larq indexes, I added:
// LOAD LARQ Index
IndexWriter indexWriter = IndexWriterFactory.create(FSDirectory.open(new
File(larqIndex)));
IndexBuilderString larqBuilder = new IndexBuilderString(indexWriter);
IndexLARQ index = larqBuilder.getIndex();
//LOAD TDB Dataset (actual n-triple store)
Dataset dataset = TDBFactory.createDataset(dir);
Model model = dataset.getDefaultModel();
larqBuilder.closeWriter();
model.unregister(larqBuilder);
LARQ.setDefaultIndex(index);
// query goes here
and it works beautifully.
Thanks
On Mon, Apr 1, 2013 at 12:53 PM, Andy Seaborne <[email protected]> wrote:
> On 25/03/13 18:26, Martino Buffolino wrote:
>
>> Sorry I became a little sidetracked over the last week.
>>
>
> Ditto.
>
>
> This occurs when
>> indexing a model of size 19169727.
>>
>
> What I *think* is happening is that the Lucene writer is growing too large.
>
> I can sse, maybe, two workarounds:
>
> 1/ setAvoidDuplicates(false)
>
> 2/ Closed/reopened the index writer every so often to stop RAM-state build
> up.
>
> something like (untested):
>
> StmtIterator sIter = model.listStatements()
>
> for ( int i = 0 ; sIter.hasNext() ; i++ )
> {
> larqBuilder.indexStatement(**sIter.next()) ;
> if ( i%10000 == 9999 )
> {
> close
> reopen
> }
> }
>
> larqBuilder.close()
>
> Sorry this isn't a definite solution - I don't see a way to get hold of
> the Lucene writer to call .commit.
>
>
> My preferred long term solution is to start again with a full integrated
> approach to text indexing, based on datasets, assemblers, Lucene4 and Solr4
> (and then it all works with Fuseki). LARQ1 predates much of datasets and
> Fuseki - in this case, evolution is hard, revolution is easier.
>
> It would seem to be a better approach to just be an index to map
> "textquery -> uri" for a definable property. LARQ1 also stores terms which
> makes for large index sizes.
>
> Andy
>
>
>> *Stack Trace:*
>>
>>
>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>> at java.util.Arrays.copyOf(**Arrays.java:2882)
>> at
>> java.lang.**AbstractStringBuilder.**expandCapacity(**
>> AbstractStringBuilder.java:**100)
>> at
>> java.lang.**AbstractStringBuilder.append(**AbstractStringBuilder.java:**
>> 390)
>> at java.lang.StringBuilder.**append(StringBuilder.java:119)
>> at org.apache.jena.larq.LARQ.**hash(LARQ.java:266)
>> at org.apache.jena.larq.LARQ.**unindex(LARQ.java:132)
>> at
>> org.apache.jena.larq.**IndexBuilderNode.unindex(**
>> IndexBuilderNode.java:116)
>> at org.apache.jena.larq.**IndexBuilderNode.index(**
>> IndexBuilderNode.java:83)
>> at
>> org.apache.jena.larq.**IndexBuilderLiteral.**indexStatement(**
>> IndexBuilderLiteral.java:88)
>> at
>> org.apache.jena.larq.**IndexBuilderModel.**indexStatements(**
>> IndexBuilderModel.java:84)
>> at RDFIndexer.main(RDFIndexer.**java:53)
>>
>> line 53: larqBuilder.indexStatements(**model.listStatements());
>>
>> *Running:*
>>
>>
>> Mac OS X 10.8.3
>> 2.66 Intel Core i7 64bit
>> 8 GB Ram
>>
>> and setting VM arg -Xmx2048m
>>
>>
>
> *Code:
>>
>> *IndexBuilderString larqBuilder = new IndexBuilderString();
>>
>> Dataset dataset = TDBFactory.createDataset(dir);
>> Model model = dataset.getDefaultModel();
>> larqBuilder.indexStatements(**model.listStatements());
>>
>>
>> Please let me know if you need any other information. Thanks
>>
>>
>> On Mon, Mar 18, 2013 at 12:10 PM, Andy Seaborne <[email protected]> wrote:
>>
>> On 18/03/13 01:50, Martino Buffolino wrote:
>>>
>>> Thanks for the response Andy.
>>>>
>>>> So I guess the overall picture would be that I have a TDB dataset stored
>>>> on
>>>> disk and I would like to query it using lucene text match like the
>>>> following:
>>>>
>>>> PREFIX pf:
>>>> <http://jena.hpl.hp.com/ARQ/****property#<http://jena.hpl.hp.com/ARQ/**property#>
>>>> <http://jena.hpl.hp.**com/ARQ/property#<http://jena.hpl.hp.com/ARQ/property#>
>>>> >>SELECT
>>>>
>>>> ?doc{
>>>> ?lit pf:textMatch '+text' .
>>>> ?doc ?p ?lit}
>>>>
>>>> If I index by partitions of the dataset, can I store that to disk so I
>>>> don't have to repeat the process again?
>>>>
>>>>
>>> Yes - the lucene index should be on disk.
>>>
>>> (/me still not clear where it runs out of memory - what's your system? 32
>>> bit? What's the stacktrace?)
>>>
>>>
>>>
>>>> On Sun, Mar 17, 2013 at 8:57 AM, Andy Seaborne <[email protected]> wrote:
>>>>
>>>> On 17/03/13 00:45, Martino Buffolino wrote:
>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>>>
>>>>>> I built a large dataset using tdbloader and now I would like to query
>>>>>> it
>>>>>> by
>>>>>> using a lucene index. I've tried to index by using
>>>>>> larqBuilder.indexStatements(******model.listStatements()); which led
>>>>>> to
>>>>>>
>>>>>> an
>>>>>> out of
>>>>>> memory exception.
>>>>>>
>>>>>>
>>>>>> Could you give some more details?
>>>>>
>>>>> It might be it is using up RAM for something but it might also be
>>>>> because
>>>>> the model has many large text literals which, combined with all the
>>>>> other
>>>>> uses of heap, is causing the problem, rather than LARQ per se.
>>>>>
>>>>>
>>>>> Is there another approach to do this?
>>>>>
>>>>>
>>>>>>
>>>>>> If it's a large database, then doing it in sections is a possibility.
>>>>>
>>>>> What might work (given I'm not sure where it is running out of memory)
>>>>> is
>>>>> to:
>>>>>
>>>>> Get an iterator e.g. model.listStatements() then index some selection
>>>>> of
>>>>> it (e.g. 1000 items), then close and reopen the index, then index
>>>>> another
>>>>> 1,000 items from the iterator.
>>>>>
>>>>> Andy
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>