Re: OutOfMemoryError for PDF document upload into Solr

Jack Krupansky Fri, 16 Jan 2015 07:44:38 -0800

It would be nice to have a SolrJ-level implementation as well as a
command-line implementation of the extraction request handler so that app
ingestion code could do the extraction outside of Solr at the app level and
even as a separate process to stream to the app or Solr. That would permit
the  to do customization, entity extraction, boiler-plate removal, etc. in
app-friendly code, before transport to the Solr server.


The extraction request handler is a really cool feature and quite
sufficient for a lot of scenarios, but additional architectural flexibility
would be a big win.

-- Jack Krupansky

On Fri, Jan 16, 2015 at 10:21 AM, Charlie Hull <char...@flax.co.uk> wrote:

> On 16/01/2015 04:02, Dan Davis wrote:
>
>> Why re-write all the document conversion in Java ;)  Tika is very slow.
>>  5
>> GB PDF is very big.
>>
>
> Or you can run Tika in a separate process, or even on a separate machine,
> wrapped with something to cope if it dies due to some horrible input...we
> generally avoid document format translation within Solr and do it
> externally before feeding documents to Solr.
>
> Charlie
>
>
>> If you have a lot of PDF like that try pdftotext in HTML and UTF-8 output
>> mode.   The HTML mode captures some meta-data that would otherwise be
>> lost.
>>
>>
>> If you need to go faster still, you can  also write some stuff linked
>> directly against poppler library.
>>
>> Before you jump down by through about Tika being slow - I wrote a PDF
>> indexer that ran at 36 MB/s per core.   Different indexer, all C, lots of
>> getjmp/longjmp.   But fast...
>>
>>
>>
>> On Thu, Jan 15, 2015 at 1:54 PM, <ganesh.ya...@sungard.com> wrote:
>>
>>  Siegfried and Michael Thank you for your replies and help.
>>>
>>> -----Original Message-----
>>> From: Siegfried Goeschl [mailto:sgoes...@gmx.at]
>>> Sent: Thursday, January 15, 2015 3:45 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: OutOfMemoryError for PDF document upload into Solr
>>>
>>> Hi Ganesh,
>>>
>>> you can increase the heap size but parsing a 4 GB PDF document will very
>>> likely consume A LOT OF memory - I think you need to check if that large
>>> PDF can be parsed at all :-)
>>>
>>> Cheers,
>>>
>>> Siegfried Goeschl
>>>
>>> On 14.01.15 18:04, Michael Della Bitta wrote:
>>>
>>>> Yep, you'll have to increase the heap size for your Tomcat container.
>>>>
>>>> http://stackoverflow.com/questions/6897476/tomcat-7-how-to-set-initial
>>>> -heap-size-correctly
>>>>
>>>> Michael Della Bitta
>>>>
>>>> Senior Software Engineer
>>>>
>>>> o: +1 646 532 3062
>>>>
>>>> appinions inc.
>>>>
>>>> “The Science of Influence Marketing”
>>>>
>>>> 18 East 41st Street
>>>>
>>>> New York, NY 10017
>>>>
>>>> t: @appinions <https://twitter.com/Appinions> | g+:
>>>> plus.google.com/appinions
>>>> <https://plus.google.com/u/0/b/112002776285509593336/11200277628550959
>>>> 3336/posts>
>>>> w: appinions.com <http://www.appinions.com/>
>>>>
>>>> On Wed, Jan 14, 2015 at 12:00 PM, <ganesh.ya...@sungard.com> wrote:
>>>>
>>>>  Hello,
>>>>>
>>>>> Can someone pass on the hints to get around following error? Is there
>>>>> any Heap Size parameter I can set in Tomcat or in Solr webApp that
>>>>> gets deployed in Solr?
>>>>>
>>>>> I am running Solr webapp inside Tomcat on my local machine which has
>>>>> RAM of 12 GB. I have PDF document which is 4 GB max in size that
>>>>> needs to be loaded into Solr
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Exception in thread "http-apr-8983-exec-6" java.lang.    : Java heap
>>>>>
>>>> space
>>>
>>>>           at java.util.AbstractCollection.toArray(Unknown Source)
>>>>>           at java.util.ArrayList.<init>(Unknown Source)
>>>>>           at
>>>>> org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:518)
>>>>>           at
>>>>>
>>>> org.apache.pdfbox.cos.COSDocument.close(COSDocument.java:575)
>>>
>>>>           at
>>>>>
>>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:254)
>>>
>>>>           at
>>>>>
>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238)
>>>
>>>>           at
>>>>>
>>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203)
>>>
>>>>           at
>>>>>
>>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111)
>>>
>>>>           at
>>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>>>>>           at
>>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>>>>>           at
>>>>> org.apache.tika.parser.AutoDetectParser.parse(
>>>>> AutoDetectParser.java:120)
>>>>>           at
>>>>>
>>>>>  org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(
>>> ExtractingDocumentLoader.java:219)
>>>
>>>>           at
>>>>>
>>>>>  org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(
>>> ContentStreamHandlerBase.java:74)
>>>
>>>>           at
>>>>>
>>>>>  org.apache.solr.handler.RequestHandlerBase.handleRequest(
>>> RequestHandlerBase.java:135)
>>>
>>>>           at
>>>>>
>>>>>  org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.
>>> handleRequest(RequestHandlers.java:246)
>>>
>>>>           at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
>>>>>           at
>>>>>
>>>>>  org.apache.solr.servlet.SolrDispatchFilter.execute(
>>> SolrDispatchFilter.java:777)
>>>
>>>>           at
>>>>>
>>>>>  org.apache.solr.servlet.SolrDispatchFilter.doFilter(
>>> SolrDispatchFilter.java:418)
>>>
>>>>           at
>>>>>
>>>>>  org.apache.solr.servlet.SolrDispatchFilter.doFilter(
>>> SolrDispatchFilter.java:207)
>>>
>>>>           at
>>>>>
>>>>>  org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
>>> ApplicationFilterChain.java:241)
>>>
>>>>           at
>>>>>
>>>>>  org.apache.catalina.core.ApplicationFilterChain.doFilter(
>>> ApplicationFilterChain.java:208)
>>>
>>>>           at
>>>>>
>>>>>  org.apache.catalina.core.StandardWrapperValve.invoke(
>>> StandardWrapperValve.java:220)
>>>
>>>>           at
>>>>>
>>>>>  org.apache.catalina.core.StandardContextValve.invoke(
>>> StandardContextValve.java:122)
>>>
>>>>           at
>>>>>
>>>>>  org.apache.catalina.core.StandardHostValve.invoke(
>>> StandardHostValve.java:170)
>>>
>>>>           at
>>>>>
>>>>>  org.apache.catalina.valves.ErrorReportValve.invoke(
>>> ErrorReportValve.java:103)
>>>
>>>>           at
>>>>>
>>>>>  org.apache.catalina.valves.AccessLogValve.invoke(
>>> AccessLogValve.java:950)
>>>
>>>>           at
>>>>>
>>>>>  org.apache.catalina.core.StandardEngineValve.invoke(
>>> StandardEngineValve.java:116)
>>>
>>>>           at
>>>>>
>>>>>  org.apache.catalina.connector.CoyoteAdapter.service(
>>> CoyoteAdapter.java:421)
>>>
>>>>           at
>>>>>
>>>>>  org.apache.coyote.http11.AbstractHttp11Processor.process(
>>> AbstractHttp11Processor.java:1070)
>>>
>>>>           at
>>>>>
>>>>>  org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.
>>> process(AbstractProtocol.java:611)
>>>
>>>>           at
>>>>>
>>>>>  org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.
>>> doRun(AprEndpoint.java:2462)
>>>
>>>>           at
>>>>> org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoin
>>>>> t.java:2451)
>>>>>
>>>>>
>>>>> Thanks
>>>>> Ganesh
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk
>

Re: OutOfMemoryError for PDF document upload into Solr

Reply via email to