RE: PDF extraction using Tika

Phil Scadden Tue, 25 Aug 2020 14:14:11 -0700

Code for solrj is going to be very dependent on your needs but the beating 
heart of my code is below ( note that I do OCR as separate step before feeding 
files into indexer). Solrj and tika docs should help.


            File f = new File(filename);
             ContentHandler textHandler = new 
BodyContentHandler(Integer.MAX_VALUE);
             Metadata metadata = new Metadata();
             Parser parser = new AutoDetectParser();
             ParseContext context = new ParseContext();
             if (filename.toLowerCase().contains("pdf")) {
               PDFParserConfig pdfConfig = new PDFParserConfig();
               pdfConfig.setExtractInlineImages(false);
               pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR);
               context.set(PDFParserConfig.class,pdfConfig);
               context.set(Parser.class,parser);
             }
             InputStream input = new FileInputStream(f);
             try {
               parser.parse(input, textHandler, metadata, context);
             } catch (Exception e) {
               
Logger.getLogger(JsMapAdminService.class.getName()).log(Level.SEVERE, 
null,String.format("File %s failed", f.getCanonicalPath()));
               e.printStackTrace();
               writeLog(String.format("File %s failed", f.getCanonicalPath()));
               return false;
              }
             SolrInputDocument up = new SolrInputDocument();
             if (title==null) title = metadata.get("title");
             if (author==null) author = metadata.get("author");
             up.addField("id",f.getCanonicalPath());
             up.addField("location",idString);
             up.addField("title",title);
             up.addField("author",author);
etc for all your fields.
             String content = textHandler.toString();
             up.addField("_text_",content);
             UpdateRequest req = new UpdateRequest();
             req.add(up);
             req.setBasicAuthCredentials("solrAdmin", password);
             UpdateResponse ur =  req.process(solr,"prindex");
             req.commit(solr, "prindex");

-----Original Message-----
From: Srinivas Kashyap <srini...@bamboorose.com.INVALID>
Sent: Tuesday, 25 August 2020 17:04
To: solr-user@lucene.apache.org
Subject: RE: PDF extraction using Tika

Hi Alexandre,

Yes, these are the same PDF files running in windows and linux. There are 
around 30 pdf files and I tried indexing single file, but faced same error. Is 
it related to how PDF stored in linux?

And with regard to DIH and TIKA going away, can you share if any program which 
extracts from PDF and pushes into solr?

Thanks,
Srinivas Kashyap

-----Original Message-----
From: Alexandre Rafalovitch <arafa...@gmail.com>
Sent: 24 August 2020 20:54
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: PDF extraction using Tika

The issue seems to be more with a specific file and at the level way below 
Solr's or possibly even Tika's:
Caused by: java.io.IOException: expected='>' actual='
' at offset 2383
                at
org.apache.pdfbox.pdfparser.BaseParser.readExpectedChar(BaseParser.java:1045)

Are you indexing the same files on Windows and Linux? I am guessing not. I 
would try to narrow down which of the files it is. One way could be to get a 
standalone Tika (make sure to match the version Solr
embeds) and run it over the documents by itself. It will probably complain with 
the same error.

Regards,
   Alex.
P.s. Additionally, both DIH and Embedded Tika are not recommended for 
production. And both will be going away in future Solr versions. You may have a 
much less brittle pipeline if you save the structured outputs from those Tika 
standalone runs and then index them into Solr, possibly pre-processed.

On Mon, 24 Aug 2020 at 11:09, Srinivas Kashyap 
<srini...@bamboorose.com.invalid> wrote:
>
> Hello,
>
> We are using TikaEntityProcessor to extract the content out of PDF and make 
> the content searchable.
>
> When jetty is run on windows based machine, we are able to successfully load 
> documents using full import DIH(tika entity). Here PDF's is maintained in 
> windows file system.
>
> But when jetty solr is run on linux machine, and try to run DIH, we
> are getting below exception: (Here PDF's are maintained in linux
> filesystem)
>
> Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: 
> org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read 
> content Processing Document # 1
>                 at 
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271)
>                 at 
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
>                 at 
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
>                 at 
> org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
>                 at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.RuntimeException: 
> org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read 
> content Processing Document # 1
>                 at 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:417)
>                 at 
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
>                 at 
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
>                 ... 4 more
> Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: 
> Unable to read content Processing Document # 1
>                 at 
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69)
>                 at 
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:171)
>                 at 
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
>                 at 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
>                 at 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
>                 at 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
>                 ... 6 more
> Caused by: org.apache.tika.exception.TikaException: Unable to extract PDF 
> content
>                 at 
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139)
>                 at 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
>                 at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>                 at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>                 at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>                 at 
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:165)
>                 ... 10 more
> Caused by: java.io.IOException: expected='>' actual='
> ' at offset 2383
>                 at 
> org.apache.pdfbox.pdfparser.BaseParser.readExpectedChar(BaseParser.java:1045)
>                 at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:226)
>                 at 
> org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:163)
>                 at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:510)
>                 at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
>                 at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
>                 at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
>                 at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
>                 at 
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
>                 at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>                 at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>                 at 
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
>                 ... 15 more
>
> Can you please suggest, how to extract PDF from linux based file system?
>
> Thanks,
> Srinivas Kashyap
> ________________________________
> DISCLAIMER:
> E-mails and attachments from Bamboo Rose, LLC are confidential.
> If you are not the intended recipient, please notify the sender immediately 
> by replying to the e-mail, and then delete it without making copies or using 
> it in any way.
> No representation is made that this email or any attachments are free of 
> viruses. Virus scanning is recommended and is the responsibility of the 
> recipient.
>
> Disclaimer
>
> The information contained in this communication from the sender is 
> confidential. It is intended solely for use by the recipient and others 
> authorized to receive it. If you are not the recipient, you are hereby 
> notified that any disclosure, copying, distribution or taking action in 
> relation of the contents of this information is strictly prohibited and may 
> be unlawful.
>
> This email has been scanned for viruses and malware, and may have been 
> automatically archived by Mimecast Ltd, an innovator in Software as a Service 
> (SaaS) for business. Providing a safer and more useful place for your human 
> generated data. Specializing in; Security, archiving and compliance. To find 
> out more visit the Mimecast website.

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: PDF extraction using Tika

Reply via email to