Re: Stream still in memory after tika exception? Possible memoryleak?

Lance Norskog Sun, 06 Nov 2011 16:39:41 -0800

Yes, please open a JIRA for this, with as much info as possible.

Lance


On Thu, Nov 3, 2011 at 9:48 AM, P Williams
<williams.tricia.l...@gmail.com>wrote:

> Hi All,
>
> I'm experiencing a similar problem to the other's in the thread.
>
> I've recently upgraded from apache-solr-4.0-2011-06-14_08-33-23.war to
> apache-solr-4.0-2011-10-14_08-56-59.war and then
> apache-solr-4.0-2011-10-30_09-00-00.war to index ~5300 pdfs, of various
> sizes, using the TikaEntityProcessor.  My indexing would run to completion
> and was completely successful under the June build.  The only error was
> readability of the fulltext in highlighting.  This was fixed in Tika 0.10
> (TIKA-611).  I chose to use the October 14 build of Solr because Tika 0.10
> had recently been included (SOLR-2372).
>
> On the same machine without changing any memory settings my initial problem
> is a Perm Gen error.  Fine, I increase the PermGen space.
>
> I've set the "onError" parameter to "skip" for the TikaEntityProcessor.
>  Now I get several (6)
>
> *SEVERE: Exception thrown while getting data*
> *java.net.SocketTimeoutException: Read timed out*
> *SEVERE: Exception in entity :
> tika:org.apache.solr.handler.dataimport.DataImport*
> *HandlerException: Exception in invoking url <url removed> # 2975*
>
> pairs.  And after ~3881 documents, with auto commit set unreasonably
> frequently I consistently get an Out of Memory Error
>
> *SEVERE: Exception while processing: f document :
> null:org.apache.solr.handle**r.dataimport.DataImportHandlerException:
> java.lang.OutOfMemoryError: Java heap s**pace*
>
> The stack trace points
> to
> org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151)
> and org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
> r.java:718).
>
> The October 30 build performs identically.
>
> Funny thing is that monitoring via JConsole doesn't reveal any memory
> issues.
>
> Because the out of Memory error did not occur in June, this leads me to
> believe that a bug has been introduced to the code since then.  Should I
> open an issue in JIRA?
>
> Thanks,
> Tricia
>
> On Tue, Aug 30, 2011 at 12:22 PM, Marc Jacobs <jacob...@gmail.com> wrote:
>
> > Hi Erick,
> >
> > I am using Solr 3.3.0, but with 1.4.1 the same problems.
> > The connector is a homemade program in the C# programming language and is
> > posting via http remote streaming (i.e.
> >
> >
> http://localhost:8080/solr/update/extract?stream.file=/path/to/file.doc&literal.id=1
> > )
> > I'm using Tika to extract the content (comes with the Solr Cell).
> >
> > A possible problem is that the filestream needs to be closed, after
> > extracting, by the client application, but it seems that there is going
> > something wrong while getting a Tika-exception: the stream never leaves
> the
> > memory. At least that is my assumption.
> >
> > What is the common way to extract content from officefiles (pdf, doc,
> rtf,
> > xls etc) and index them? To write a content extractor / validator
> yourself?
> > Or is it possible to do this with the Solr Cell without getting a huge
> > memory consumption? Please let me know. Thanks in advance.
> >
> > Marc
> >
> > 2011/8/30 Erick Erickson <erickerick...@gmail.com>
> >
> > > What version of Solr are you using, and how are you indexing?
> > > DIH? SolrJ?
> > >
> > > I'm guessing you're using Tika, but how?
> > >
> > > Best
> > > Erick
> > >
> > > On Tue, Aug 30, 2011 at 4:55 AM, Marc Jacobs <jacob...@gmail.com>
> wrote:
> > > > Hi all,
> > > >
> > > > Currently I'm testing Solr's indexing performance, but unfortunately
> > I'm
> > > > running into memory problems.
> > > > It looks like Solr is not closing the filestream after an exception,
> > but
> > > I'm
> > > > not really sure.
> > > >
> > > > The current system I'm using has 150GB of memory and while I'm
> indexing
> > > the
> > > > memoryconsumption is growing and growing (eventually more then 50GB).
> > > > In the attached graph I indexed about 70k of office-documents
> > > (pdf,doc,xls
> > > > etc) and between 1 and 2 percent throws an exception.
> > > > The commits are after 64MB, 60 seconds or after a job (there are 6
> > evenly
> > > > divided jobs).
> > > >
> > > > After indexing the memoryconsumption isn't dropping. Even after an
> > > optimize
> > > > command it's still there.
> > > > What am I doing wrong? I can't imagine I'm the only one with this
> > > problem.
> > > > Thanks in advance!
> > > >
> > > > Kind regards,
> > > >
> > > > Marc
> > > >
> > >
> >
>



-- 
Lance Norskog
goks...@gmail.com

Re: Stream still in memory after tika exception? Possible memoryleak?

Reply via email to