OCR'ing of PDFs is fiddly at the moment because of Tika, not Solr!  We
have an open ticket to make it "just work", but we aren't there yet
(TIKA-2749).

You have to tell Tika how you want to process images from PDFs via the
tika-config.xml file.

You've seen this link in the links you mentioned:
https://wiki.apache.org/tika/TikaOCR

This one is key for PDFs:
https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR
On Fri, Nov 2, 2018 at 10:30 AM Furkan KAMACI <furkankam...@gmail.com> wrote:
>
> Hi All,
>
> I want to index images and pdf documents which have images into Solr. I
> test it with my Solr 6.3.0.
>
> I've installed tesseract at my computer (Mac). I verify that Tesseract
> works fine to extract text from an image.
>
> I index image into Solr but it has no content. However, as far as I know, I
> don't need to do anything else to integrate Tesseract with Solr.
>
> I've checked these but they were not useful for me:
>
> http://lucene.472066.n3.nabble.com/TIKA-OCR-not-working-td4201834.html
> http://lucene.472066.n3.nabble.com/Fwd-configuring-Solr-with-Tesseract-td4361908.html
>
> My question is, how can I support OCR with Solr?
>
> Kind Regards,
> Furkan KAMACI

Reply via email to