Hi Erick, I am using Solr 3.3.0, but with 1.4.1 the same problems. The connector is a homemade program in the C# programming language and is posting via http remote streaming (i.e. http://localhost:8080/solr/update/extract?stream.file=/path/to/file.doc&literal.id=1 ) I'm using Tika to extract the content (comes with the Solr Cell).
A possible problem is that the filestream needs to be closed, after extracting, by the client application, but it seems that there is going something wrong while getting a Tika-exception: the stream never leaves the memory. At least that is my assumption. What is the common way to extract content from officefiles (pdf, doc, rtf, xls etc) and index them? To write a content extractor / validator yourself? Or is it possible to do this with the Solr Cell without getting a huge memory consumption? Please let me know. Thanks in advance. Marc 2011/8/30 Erick Erickson <erickerick...@gmail.com> > What version of Solr are you using, and how are you indexing? > DIH? SolrJ? > > I'm guessing you're using Tika, but how? > > Best > Erick > > On Tue, Aug 30, 2011 at 4:55 AM, Marc Jacobs <jacob...@gmail.com> wrote: > > Hi all, > > > > Currently I'm testing Solr's indexing performance, but unfortunately I'm > > running into memory problems. > > It looks like Solr is not closing the filestream after an exception, but > I'm > > not really sure. > > > > The current system I'm using has 150GB of memory and while I'm indexing > the > > memoryconsumption is growing and growing (eventually more then 50GB). > > In the attached graph I indexed about 70k of office-documents > (pdf,doc,xls > > etc) and between 1 and 2 percent throws an exception. > > The commits are after 64MB, 60 seconds or after a job (there are 6 evenly > > divided jobs). > > > > After indexing the memoryconsumption isn't dropping. Even after an > optimize > > command it's still there. > > What am I doing wrong? I can't imagine I'm the only one with this > problem. > > Thanks in advance! > > > > Kind regards, > > > > Marc > > >