I second Jorn: don't deploy Tesseract + Tika on the same server as Solr.
Tesseract, specially with OCR enabled, will drain your machine resources
that could be used to indexing/searching. In addition to that, any
malformed PDF could potentially shutdown the Solr server. Best bet would be
to use tika-server + tesseract on a dedicated server/container and then use
it to extract the text/ocr from the documents and then send it to Solr.
But answering your question: Solr embeds Tika that can be configured to use
Tesseract. It's Tika that knows about Tesseract. See here:
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR for more
information.
Best regards,
Edward
On Tue, Feb 11, 2020 at 3:26 PM Jörn Franke wrote:
> Honestly i would not run tesseract on the same server as Solr. It takes a
> lot of resources and may negatively impact Solr. Just write a small program
> using Tika+Tesseract that runs on a different server / container and posts
> the results to Solr.
>
> About your question: Probably Tika (a dependency of Solr) figured it out
> or depending on your format Pdfbox (used by Tika).
>
> > Am 11.02.2020 um 19:15 schrieb Karan Jain :
> >
> > Hi All,
> >
> > The Solr version 7.6.0 is running on my local machine. I have installed
> > Tesseract through following steps:-
> > yum install tesseract echo export PATH=$PATH:/usr/share/tesseract
> >>> ~/.bash_profile
> > echo export TESSDATA_PREFIX=/usr/share/tesseract >>~/.bash_profile
> >
> > Now the deployed Solr is supporting tesseract. I searched TESSDATA_PREFIX
> > in https://github.com/apache/lucene-solr and found no reference there. I
> > could not understand How Solr came to know about the deployed tesseract.
> > Please tell the specific java class in Solr if possible.
> >
> > Thanks for your time,
> > Best,
> > Karan
>