It would be nice to have a SolrJ-level implementation as well as a command-line implementation of the extraction request handler so that app ingestion code could do the extraction outside of Solr at the app level and even as a separate process to stream to the app or Solr. That would permit the to do customization, entity extraction, boiler-plate removal, etc. in app-friendly code, before transport to the Solr server.
The extraction request handler is a really cool feature and quite sufficient for a lot of scenarios, but additional architectural flexibility would be a big win. -- Jack Krupansky On Fri, Jan 16, 2015 at 10:21 AM, Charlie Hull <char...@flax.co.uk> wrote: > On 16/01/2015 04:02, Dan Davis wrote: > >> Why re-write all the document conversion in Java ;) Tika is very slow. >> 5 >> GB PDF is very big. >> > > Or you can run Tika in a separate process, or even on a separate machine, > wrapped with something to cope if it dies due to some horrible input...we > generally avoid document format translation within Solr and do it > externally before feeding documents to Solr. > > Charlie > > >> If you have a lot of PDF like that try pdftotext in HTML and UTF-8 output >> mode. The HTML mode captures some meta-data that would otherwise be >> lost. >> >> >> If you need to go faster still, you can also write some stuff linked >> directly against poppler library. >> >> Before you jump down by through about Tika being slow - I wrote a PDF >> indexer that ran at 36 MB/s per core. Different indexer, all C, lots of >> getjmp/longjmp. But fast... >> >> >> >> On Thu, Jan 15, 2015 at 1:54 PM, <ganesh.ya...@sungard.com> wrote: >> >> Siegfried and Michael Thank you for your replies and help. >>> >>> -----Original Message----- >>> From: Siegfried Goeschl [mailto:sgoes...@gmx.at] >>> Sent: Thursday, January 15, 2015 3:45 AM >>> To: solr-user@lucene.apache.org >>> Subject: Re: OutOfMemoryError for PDF document upload into Solr >>> >>> Hi Ganesh, >>> >>> you can increase the heap size but parsing a 4 GB PDF document will very >>> likely consume A LOT OF memory - I think you need to check if that large >>> PDF can be parsed at all :-) >>> >>> Cheers, >>> >>> Siegfried Goeschl >>> >>> On 14.01.15 18:04, Michael Della Bitta wrote: >>> >>>> Yep, you'll have to increase the heap size for your Tomcat container. >>>> >>>> http://stackoverflow.com/questions/6897476/tomcat-7-how-to-set-initial >>>> -heap-size-correctly >>>> >>>> Michael Della Bitta >>>> >>>> Senior Software Engineer >>>> >>>> o: +1 646 532 3062 >>>> >>>> appinions inc. >>>> >>>> “The Science of Influence Marketing” >>>> >>>> 18 East 41st Street >>>> >>>> New York, NY 10017 >>>> >>>> t: @appinions <https://twitter.com/Appinions> | g+: >>>> plus.google.com/appinions >>>> <https://plus.google.com/u/0/b/112002776285509593336/11200277628550959 >>>> 3336/posts> >>>> w: appinions.com <http://www.appinions.com/> >>>> >>>> On Wed, Jan 14, 2015 at 12:00 PM, <ganesh.ya...@sungard.com> wrote: >>>> >>>> Hello, >>>>> >>>>> Can someone pass on the hints to get around following error? Is there >>>>> any Heap Size parameter I can set in Tomcat or in Solr webApp that >>>>> gets deployed in Solr? >>>>> >>>>> I am running Solr webapp inside Tomcat on my local machine which has >>>>> RAM of 12 GB. I have PDF document which is 4 GB max in size that >>>>> needs to be loaded into Solr >>>>> >>>>> >>>>> >>>>> >>>>> Exception in thread "http-apr-8983-exec-6" java.lang. : Java heap >>>>> >>>> space >>> >>>> at java.util.AbstractCollection.toArray(Unknown Source) >>>>> at java.util.ArrayList.<init>(Unknown Source) >>>>> at >>>>> org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:518) >>>>> at >>>>> >>>> org.apache.pdfbox.cos.COSDocument.close(COSDocument.java:575) >>> >>>> at >>>>> >>>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:254) >>> >>>> at >>>>> >>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238) >>> >>>> at >>>>> >>>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203) >>> >>>> at >>>>> >>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111) >>> >>>> at >>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) >>>>> at >>>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) >>>>> at >>>>> org.apache.tika.parser.AutoDetectParser.parse( >>>>> AutoDetectParser.java:120) >>>>> at >>>>> >>>>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load( >>> ExtractingDocumentLoader.java:219) >>> >>>> at >>>>> >>>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody( >>> ContentStreamHandlerBase.java:74) >>> >>>> at >>>>> >>>>> org.apache.solr.handler.RequestHandlerBase.handleRequest( >>> RequestHandlerBase.java:135) >>> >>>> at >>>>> >>>>> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper. >>> handleRequest(RequestHandlers.java:246) >>> >>>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) >>>>> at >>>>> >>>>> org.apache.solr.servlet.SolrDispatchFilter.execute( >>> SolrDispatchFilter.java:777) >>> >>>> at >>>>> >>>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter( >>> SolrDispatchFilter.java:418) >>> >>>> at >>>>> >>>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter( >>> SolrDispatchFilter.java:207) >>> >>>> at >>>>> >>>>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter( >>> ApplicationFilterChain.java:241) >>> >>>> at >>>>> >>>>> org.apache.catalina.core.ApplicationFilterChain.doFilter( >>> ApplicationFilterChain.java:208) >>> >>>> at >>>>> >>>>> org.apache.catalina.core.StandardWrapperValve.invoke( >>> StandardWrapperValve.java:220) >>> >>>> at >>>>> >>>>> org.apache.catalina.core.StandardContextValve.invoke( >>> StandardContextValve.java:122) >>> >>>> at >>>>> >>>>> org.apache.catalina.core.StandardHostValve.invoke( >>> StandardHostValve.java:170) >>> >>>> at >>>>> >>>>> org.apache.catalina.valves.ErrorReportValve.invoke( >>> ErrorReportValve.java:103) >>> >>>> at >>>>> >>>>> org.apache.catalina.valves.AccessLogValve.invoke( >>> AccessLogValve.java:950) >>> >>>> at >>>>> >>>>> org.apache.catalina.core.StandardEngineValve.invoke( >>> StandardEngineValve.java:116) >>> >>>> at >>>>> >>>>> org.apache.catalina.connector.CoyoteAdapter.service( >>> CoyoteAdapter.java:421) >>> >>>> at >>>>> >>>>> org.apache.coyote.http11.AbstractHttp11Processor.process( >>> AbstractHttp11Processor.java:1070) >>> >>>> at >>>>> >>>>> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler. >>> process(AbstractProtocol.java:611) >>> >>>> at >>>>> >>>>> org.apache.tomcat.util.net.AprEndpoint$SocketProcessor. >>> doRun(AprEndpoint.java:2462) >>> >>>> at >>>>> org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoin >>>>> t.java:2451) >>>>> >>>>> >>>>> Thanks >>>>> Ganesh >>>>> >>>>> >>>>> >>>> >>> >>> >> > > -- > Charlie Hull > Flax - Open Source Enterprise Search > > tel/fax: +44 (0)8700 118334 > mobile: +44 (0)7767 825828 > web: www.flax.co.uk >