Yes, please open a JIRA for this, with as much info as possible. Lance
On Thu, Nov 3, 2011 at 9:48 AM, P Williams <williams.tricia.l...@gmail.com>wrote: > Hi All, > > I'm experiencing a similar problem to the other's in the thread. > > I've recently upgraded from apache-solr-4.0-2011-06-14_08-33-23.war to > apache-solr-4.0-2011-10-14_08-56-59.war and then > apache-solr-4.0-2011-10-30_09-00-00.war to index ~5300 pdfs, of various > sizes, using the TikaEntityProcessor. My indexing would run to completion > and was completely successful under the June build. The only error was > readability of the fulltext in highlighting. This was fixed in Tika 0.10 > (TIKA-611). I chose to use the October 14 build of Solr because Tika 0.10 > had recently been included (SOLR-2372). > > On the same machine without changing any memory settings my initial problem > is a Perm Gen error. Fine, I increase the PermGen space. > > I've set the "onError" parameter to "skip" for the TikaEntityProcessor. > Now I get several (6) > > *SEVERE: Exception thrown while getting data* > *java.net.SocketTimeoutException: Read timed out* > *SEVERE: Exception in entity : > tika:org.apache.solr.handler.dataimport.DataImport* > *HandlerException: Exception in invoking url <url removed> # 2975* > > pairs. And after ~3881 documents, with auto commit set unreasonably > frequently I consistently get an Out of Memory Error > > *SEVERE: Exception while processing: f document : > null:org.apache.solr.handle**r.dataimport.DataImportHandlerException: > java.lang.OutOfMemoryError: Java heap s**pace* > > The stack trace points > to > org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151) > and org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde > r.java:718). > > The October 30 build performs identically. > > Funny thing is that monitoring via JConsole doesn't reveal any memory > issues. > > Because the out of Memory error did not occur in June, this leads me to > believe that a bug has been introduced to the code since then. Should I > open an issue in JIRA? > > Thanks, > Tricia > > On Tue, Aug 30, 2011 at 12:22 PM, Marc Jacobs <jacob...@gmail.com> wrote: > > > Hi Erick, > > > > I am using Solr 3.3.0, but with 1.4.1 the same problems. > > The connector is a homemade program in the C# programming language and is > > posting via http remote streaming (i.e. > > > > > http://localhost:8080/solr/update/extract?stream.file=/path/to/file.doc&literal.id=1 > > ) > > I'm using Tika to extract the content (comes with the Solr Cell). > > > > A possible problem is that the filestream needs to be closed, after > > extracting, by the client application, but it seems that there is going > > something wrong while getting a Tika-exception: the stream never leaves > the > > memory. At least that is my assumption. > > > > What is the common way to extract content from officefiles (pdf, doc, > rtf, > > xls etc) and index them? To write a content extractor / validator > yourself? > > Or is it possible to do this with the Solr Cell without getting a huge > > memory consumption? Please let me know. Thanks in advance. > > > > Marc > > > > 2011/8/30 Erick Erickson <erickerick...@gmail.com> > > > > > What version of Solr are you using, and how are you indexing? > > > DIH? SolrJ? > > > > > > I'm guessing you're using Tika, but how? > > > > > > Best > > > Erick > > > > > > On Tue, Aug 30, 2011 at 4:55 AM, Marc Jacobs <jacob...@gmail.com> > wrote: > > > > Hi all, > > > > > > > > Currently I'm testing Solr's indexing performance, but unfortunately > > I'm > > > > running into memory problems. > > > > It looks like Solr is not closing the filestream after an exception, > > but > > > I'm > > > > not really sure. > > > > > > > > The current system I'm using has 150GB of memory and while I'm > indexing > > > the > > > > memoryconsumption is growing and growing (eventually more then 50GB). > > > > In the attached graph I indexed about 70k of office-documents > > > (pdf,doc,xls > > > > etc) and between 1 and 2 percent throws an exception. > > > > The commits are after 64MB, 60 seconds or after a job (there are 6 > > evenly > > > > divided jobs). > > > > > > > > After indexing the memoryconsumption isn't dropping. Even after an > > > optimize > > > > command it's still there. > > > > What am I doing wrong? I can't imagine I'm the only one with this > > > problem. > > > > Thanks in advance! > > > > > > > > Kind regards, > > > > > > > > Marc > > > > > > > > > > -- Lance Norskog goks...@gmail.com