Alternatives to tika for extracting text out of PDFs

Phil Scadden Thu, 07 Dec 2017 16:58:13 -0800

I am indexing PDFs and a separate process has converted any image PDFs to 
search PDF before solr gets near it. I notice that tika is very slow at parsing 
some PDFs. I don't need any metadata (which I suspect is slowing tika down), 
just the text. Has anyone used an alternative PDF text extraction library in a 
SOLRJ context?
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

Alternatives to tika for extracting text out of PDFs

Reply via email to