Otis,

I haven't tried it yet but what I meant is :
If we divide the content in multiple parts, then words will be splitted in two 
different SOLR documents. If the main document contains 'Hello World' then 
these two words might get indexed in two different documents. Searching for 
'Hello world' won't give me the required search result unless I use OR in the 
query.

Thanks,
Siddharth

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Tuesday, February 17, 2009 9:58 AM
To: solr-user@lucene.apache.org
Subject: Re: Outofmemory error for large files

Siddharth,

At the end of your email you said:
"One option I see is to break the file in chunks, but with this, I won't be 
able to search with multiple words if they are distributed in different 
documents."

Unless I'm missing something unusual about your application, I don't think the 
above is technically correct.  Have you tried doing this and have you 
then tried your searches?  Everything should still work, even if you index one 
document at a time.

Otis--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch 




________________________________
From: "Gargate, Siddharth" <sgarg...@ptc.com>
To: solr-user@lucene.apache.org
Sent: Monday, February 16, 2009 2:00:58 PM
Subject: Outofmemory error for large files


I am trying to index around 150 MB text file with 1024 MB max heap. But I get 
Outofmemory error in the SolrJ code. 

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2882)
    at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.jav
a:100)
    at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572)
    at java.lang.StringBuffer.append(StringBuffer.java:320)
    at java.io.StringWriter.write(StringWriter.java:60)
    at org.apache.solr.common.util.XML.escape(XML.java:206)
    at org.apache.solr.common.util.XML.escapeCharData(XML.java:79)
    at org.apache.solr.common.util.XML.writeXML(XML.java:149)
    at
org.apache.solr.client.solrj.util.ClientUtils.writeXML(ClientUtils.java:
115)
    at
org.apache.solr.client.solrj.request.UpdateRequest.writeXML(UpdateReques
t.java:200)
    at
org.apache.solr.client.solrj.request.UpdateRequest.getXML(UpdateRequest.
java:178)
    at
org.apache.solr.client.solrj.request.UpdateRequest.getContentStreams(Upd
ateRequest.java:173)
    at
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(Embedde
dSolrServer.java:136)
    at
org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest
.java:243)
    at
org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:63)


I modified the UpdateRequest class to initialize the StringWriter object in 
UpdateRequest.getXML with initial size, and cleared the SolrInputDocument that 
is having the reference of the file text. Then I am getting OOM as below:


Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2786)
    at java.lang.StringCoding.safeTrim(StringCoding.java:64)
    at java.lang.StringCoding.access$300(StringCoding.java:34)
    at
java.lang.StringCoding$StringEncoder.encode(StringCoding.java:251)
    at java.lang.StringCoding.encode(StringCoding.java:272)
    at java.lang.String.getBytes(String.java:947)
    at
org.apache.solr.common.util.ContentStreamBase$StringStream.getStream(Con
tentStreamBase.java:142)
    at
org.apache.solr.common.util.ContentStreamBase$StringStream.getReader(Con
tentStreamBase.java:154)
    at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:61)
    at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conte
ntStreamHandlerBase.java:54)
    at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB
ase.java:131)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333)
    at
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(Embedde
dSolrServer.java:139)
    at
org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest
.java:249)
    at
org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:63)


After I increase the heap size upto 1250 MB, I get OOM as 

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOfRange(Arrays.java:3209)
    at java.lang.String.<init>(String.java:216)
    at java.lang.StringBuffer.toString(StringBuffer.java:585)
    at
com.ctc.wstx.util.TextBuffer.contentsAsString(TextBuffer.java:403)
    at
com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:821)
    at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:276)
    at
org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
    at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
    at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conte
ntStreamHandlerBase.java:54)
    at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB
ase.java:131)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333)
    at
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(Embedde
dSolrServer.java:139)
    at
org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest
.java:249)
    at
org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:63)


So looks like I won't be able to get out of these OOMs. 
Is there any way to avoid these OOMs? One option I see is to break the file in 
chunks, but with this, I won't be able to search with multiple words if they 
are distributed in different documents.
Also, can somebody tell me the minimum heap size required w.r.t. file size so 
that document get indexed successfully? 

Thanks,
Siddharth

Reply via email to