Hi, for a client project I needed to enable OCR for images inside PDFs. Unfortunately ManifoldCF does not provide configuration options to handle this. It would be nice to have these options for the Tika content extraction:
1. Enable PDF image extraction for OCR: https://tika.apache.org/1.7/api/org/apache/tika/parser/pdf/PDFParserConfig.html#setExtractInlineImages%28boolean%29 2. Set default language for tesseract: https://tika.apache.org/1.7/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html#setLanguage%28java.lang.String%29 Thanks -Konrad KONRAD HOLL Senior Technical Consultant M +49 178 8855 553 F +49 178 99 8855 553 Skype: konrad.holl Search Technologies GmbH Theodor-Heuss-Allee 112 60486 Frankfurt am Main SEARCH TECHNOLOGIES Find Better Answers. www.searchtechnologies.com<http://www.searchtechnologies.com/>
