[Tika content extraction Content Transformation Component] Additional Options

Konrad Holl Wed, 09 Mar 2016 06:46:07 -0800

Hi,

for a client project I needed to enable OCR for images inside PDFs. 
Unfortunately ManifoldCF does not provide configuration options to handle this. 
It would be nice to have these options for the Tika content extraction:



1.       Enable PDF image extraction for OCR: 
https://tika.apache.org/1.7/api/org/apache/tika/parser/pdf/PDFParserConfig.html#setExtractInlineImages%28boolean%29

2.       Set default language for tesseract: 
https://tika.apache.org/1.7/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html#setLanguage%28java.lang.String%29

Thanks

-Konrad

KONRAD HOLL
Senior Technical Consultant

M +49 178 8855 553
F  +49 178 99 8855 553
Skype: konrad.holl

Search Technologies GmbH
Theodor-Heuss-Allee 112
60486 Frankfurt am Main

SEARCH TECHNOLOGIES
Find Better Answers.
www.searchtechnologies.com<http://www.searchtechnologies.com/>

[Tika content extraction Content Transformation Component] Additional Options

Reply via email to