Re: Support Tesseract in Apache Solr

2020-02-11 Thread Edward Ribeiro
I second Jorn: don't deploy Tesseract + Tika on the same server as Solr.
Tesseract, specially with OCR enabled, will drain your machine resources
that could be used to indexing/searching. In addition to that, any
malformed PDF could potentially shutdown the Solr server. Best bet would be
to use tika-server + tesseract on a dedicated server/container and then use
it to extract the text/ocr from the documents and then send it to Solr.

But answering your question: Solr embeds Tika that can be configured to use
Tesseract. It's Tika that knows about Tesseract. See here:
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR for more
information.

Best regards,
Edward

On Tue, Feb 11, 2020 at 3:26 PM Jörn Franke  wrote:

> Honestly i would not run tesseract on the same server as Solr. It takes a
> lot of resources and may negatively impact Solr. Just write a small program
> using Tika+Tesseract that runs on a different server / container and posts
> the results to Solr.
>
> About your question: Probably Tika (a dependency of Solr) figured it out
> or depending on your format Pdfbox (used by Tika).
>
> > Am 11.02.2020 um 19:15 schrieb Karan Jain :
> >
> > Hi All,
> >
> > The Solr version 7.6.0 is running on my local machine. I have installed
> > Tesseract through following steps:-
> > yum install tesseract echo export PATH=$PATH:/usr/share/tesseract
> >>> ~/.bash_profile
> > echo export TESSDATA_PREFIX=/usr/share/tesseract >>~/.bash_profile
> >
> > Now the deployed Solr is supporting tesseract. I searched TESSDATA_PREFIX
> > in https://github.com/apache/lucene-solr and found no reference there. I
> > could not understand How Solr came to know about the deployed tesseract.
> > Please tell the specific java class in Solr if possible.
> >
> > Thanks for your time,
> > Best,
> > Karan
>


Re: Support Tesseract in Apache Solr

2020-02-11 Thread Jörn Franke
Honestly i would not run tesseract on the same server as Solr. It takes a lot 
of resources and may negatively impact Solr. Just write a small program using 
Tika+Tesseract that runs on a different server / container and posts the 
results to Solr.

About your question: Probably Tika (a dependency of Solr) figured it out or 
depending on your format Pdfbox (used by Tika).

> Am 11.02.2020 um 19:15 schrieb Karan Jain :
> 
> Hi All,
> 
> The Solr version 7.6.0 is running on my local machine. I have installed
> Tesseract through following steps:-
> yum install tesseract echo export PATH=$PATH:/usr/share/tesseract
>>> ~/.bash_profile
> echo export TESSDATA_PREFIX=/usr/share/tesseract >>~/.bash_profile
> 
> Now the deployed Solr is supporting tesseract. I searched TESSDATA_PREFIX
> in https://github.com/apache/lucene-solr and found no reference there. I
> could not understand How Solr came to know about the deployed tesseract.
> Please tell the specific java class in Solr if possible.
> 
> Thanks for your time,
> Best,
> Karan


Support Tesseract in Apache Solr

2020-02-11 Thread Karan Jain
Hi All,

The Solr version 7.6.0 is running on my local machine. I have installed
Tesseract through following steps:-
yum install tesseract echo export PATH=$PATH:/usr/share/tesseract
>>~/.bash_profile
echo export TESSDATA_PREFIX=/usr/share/tesseract >>~/.bash_profile

Now the deployed Solr is supporting tesseract. I searched TESSDATA_PREFIX
in https://github.com/apache/lucene-solr and found no reference there. I
could not understand How Solr came to know about the deployed tesseract.
Please tell the specific java class in Solr if possible.

Thanks for your time,
Best,
Karan