Hi Brian, can you send me the email? I would like to play around :-)
Have you opened a JIRA for PdfBox? If not I willl open one if I can reproduce the issue … Thanks in advance Siegfried Goeschl On 25 May 2014, at 04:18, Brian McDowell <brianmc...@gmail.com> wrote: > Our feeding (indexing) tool halts because Solr becomes unresponsive after > getting some really bad pdfs. There are levels of pdf "badness." Some just > will not parse and that's fine, but others are more problematic in that our > Operations team has to restart Solr because it just hangs and accepts no > more documents. I actually have identified a pdf that will bring down Solr > every time. Does anyone think that doing pre-validation using the pdfbox > jar will work? Or, will trying to validate just hang as well? Any help is > appreciated. > > > On Thu, May 22, 2014 at 8:47 AM, Jack Krupansky > <j...@basetechnology.com>wrote: > >> Yeah, I recall running into infinite loop issues with PDFBox in Solr years >> ago. They keep fixing these issues, but they keep popping up again. Sigh. >> >> -- Jack Krupansky >> >> -----Original Message----- From: Siegfried Goeschl >> Sent: Thursday, May 22, 2014 4:35 AM >> To: solr-user@lucene.apache.org >> Subject: Re: pdfs >> >> >> Hi folks, >> >> for a small customer project I'm running SOLR with embedded Tikka. >> >> * memory consumption is an issue but can be handled >> * there is an issue with PDFBox hitting an infinite loop which causes >> excessive CPU usage - requires SOLR restart but happens only once >> withing 400.000 documents (PDF, Word, ect) but is seems a little bit >> erratic since I was never able to track the problem back to a particular >> PDF document >> >> Having said that we wire SOLR with Nagios to get an alarm when CPU >> consumption goes through the roof >> >> If you doing really serious stuff I would recommend >> * moving the document extraction stuff out of SOLR >> * provide monitoring and recovery and stuck document extractions >> ** killing worker threads >> ** using external processed and kill them when spinning out of control >> >> Cheers, >> >> Siegfried Goeschl >> >> On 22.05.14 06:46, Jack Krupansky wrote: >> >>> Yeah, PDF extraction has always been at least somewhat problematic. It >>> has improved over the years, but still not likely to be perfect. >>> >>> That said, I'm not aware of any specific PDF extraction issue that would >>> bring down Solr - as opposed to causing a 500 status with an exception >>> in PDF extraction, with the exception of memory usage. Some PDF >>> documents, especially those which are graphic-intense can require a lot >>> of memory. The rest of Solr could be adversely affected if all available >>> JVM heap is consumed. The solution is to give the JVM more heap space. >>> >>> So, what is your specific symptom? >>> >>> -- Jack Krupansky >>> >>> -----Original Message----- From: Brian McDowell >>> Sent: Thursday, May 22, 2014 12:24 AM >>> To: solr-user@lucene.apache.org >>> Subject: pdfs >>> >>> Has anyone had issues with indexing pdf files? Some pdfs are bringing down >>> Solr completely so that it actually needs to be manually restarted. We are >>> using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the >>> problem because the release notes associated with the new tika version and >>> also the new pdfbox indicate fixes for pdf issues. It didn't work and now >>> this issue is causing us to reevaluate using Solr. Any help on this matter >>> would be greatly appreciated. Thank you! >>> >> >>