Re: regarding Extracting text from Images

2020-01-22 Thread Steve Ge
In my experience, enabling Tika at server level can result in memory heap space 
used up under high volume of extraction, and bring down Solr entirely.   Likely 
due to garbage collector not able to keep up w/ load, even tuning garbage 
collector didn't resolve the problem completely.  Not recommend.
Steve  
 
  On Wed, Oct 23, 2019 at 7:08 PM, suresh pendap wrote: 
  Hi Alex,
Thanks for your reply. How do we integrate tesseract with Solr?  Do we have
to implement Custom update processor or extend the
ExtractingRequestProcessor?

Regards
Suresh

On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch 
wrote:

> I believe Tika that powers this can do so with extra libraries (tesseract?)
> But Solr does not bundle those extras.
>
> In any case, you may want to run Tika externally to avoid the
> conversion/extraction process be a burden to Solr itself.
>
> Regards,
>      Alex
>
> On Wed, Oct 23, 2019, 1:58 PM suresh pendap, 
> wrote:
>
> > Hello,
> > I am reading the Solr documentation about integration with Tika and Solr
> > Cell framework over here
> >
> >
> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
> >
> > I would like to know if the can Solr Cell framework also be used to
> extract
> > text from the image files?
> >
> > Regards
> > Suresh
> >
>
  


Re: regarding Extracting text from Images

2020-01-22 Thread Retro
Good day,
We solved the situation. Here is what was used and changed:
In our installation we used Tesseract  version 3.05, Tika version 1.17, SOLR
version 7.4.  We actually, had TIKA version 1.17, not 18. 
1. Changed from HOCR to TXT  >>> 
in file parseContext.xml
2. Had to start SOLR as a root user.
Version 4.1.1 is not compatible with TIKA 1.17 , so we will upgrade SOLR to
version 7.7, TIKA version 1.19 and will try to install Tesseract 4.1.1
 



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: regarding Extracting text from Images

2020-01-21 Thread Retro
Hello, thank you for the info, Iwill look into this as well. Yes, we plan to
use it in production, but on a longer run. For the moment I just need to
make it work as a test case. 



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: regarding Extracting text from Images

2020-01-21 Thread Retro
Yes, I did. this manual is referring to standalone version of TIKA, while I
have a build-in version.



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: regarding Extracting text from Images

2020-01-17 Thread Marco Reis
Are you intending to use the solution in production? If so, combining Tika
and Tesseract on the same server could not be a good choice.
Tika and Tesseract are heavy processing consumers, harming the main service
on the solution, in your case, Solr service.
I had the same situation here, and the combination Tika/Tesseract in the
production server does not scale, once I have many text documents and
images.
An alternative is to use a microservice to text preprocessing and another
one to OCR. You can take some ideas from https://github.com/tleyden/open-ocr
.
I have a separated Kubernetes cluster just for this, to extract and OCR
text from binary documents. Now, I can scale to a world-class solution.

Marco Reis
Software Engineer
http://marcoreis.net
+55 61 981194620



On Fri, 17 Jan 2020 at 07:17, Jörn Franke  wrote:

> Have you checked this?
>
> https://cwiki.apache.org/confluence/display/TIKA/TikaOCR
>
> > Am 17.01.2020 um 10:54 schrieb Retro :
> >
> > Hello, can you please advise me, how to configure Solr so that embedded
> Tika
> > is able to use Tesseract to do the  ocr of images? I have installed the
> > following software -
> > SOLR  - 7.4.0
> > Tesseract - 4.1.1-rc2-20-g01fb
> > TIKA   - TIKA 1.18
> > Tesseract is installed in to the following directory:
> > /usr/share/tesseract/4/tessdata/
> > echo $TESSDATA_PREFIX - > /usr/share/tesseract/4/tessdata/
> > tesseract -v
> > tesseract 4.1.1-rc2-20-g01fb
> > leptonica-1.76.0
> >
> > Command “tesseract test.jpg  test.txt”  produces accurate txt file with
> > OCRed content from test.jpg
> > Current setup allows us to index attachments such like structured text
> files
> > (txt, word, pdf, etc), but does not react in any way for attachments like
> > png, jpg. Nor it works if uploaded directly to SOLR using its web
> interface.
> >
> > Necessary modifications were made to the following files:
> > solrconfig.xml; TesseractOCRConfig.properties; parsecontent.xml;
> > PDFparser.properties.
> >
> > Would appreciate if someone helped me with this configuration.
> >
> >
> >
> > --
> > Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: regarding Extracting text from Images

2020-01-17 Thread Jörn Franke
Have you checked this?

https://cwiki.apache.org/confluence/display/TIKA/TikaOCR

> Am 17.01.2020 um 10:54 schrieb Retro :
> 
> Hello, can you please advise me, how to configure Solr so that embedded Tika
> is able to use Tesseract to do the  ocr of images? I have installed the
> following software -
> SOLR  - 7.4.0
> Tesseract - 4.1.1-rc2-20-g01fb
> TIKA   - TIKA 1.18 
> Tesseract is installed in to the following directory:
> /usr/share/tesseract/4/tessdata/
> echo $TESSDATA_PREFIX - > /usr/share/tesseract/4/tessdata/
> tesseract -v
> tesseract 4.1.1-rc2-20-g01fb
> leptonica-1.76.0
> 
> Command “tesseract test.jpg  test.txt”  produces accurate txt file with
> OCRed content from test.jpg
> Current setup allows us to index attachments such like structured text files
> (txt, word, pdf, etc), but does not react in any way for attachments like
> png, jpg. Nor it works if uploaded directly to SOLR using its web interface.
> 
> Necessary modifications were made to the following files:
> solrconfig.xml; TesseractOCRConfig.properties; parsecontent.xml;
> PDFparser.properties.
> 
> Would appreciate if someone helped me with this configuration. 
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: regarding Extracting text from Images

2020-01-17 Thread Retro
Hello, can you please advise me, how to configure Solr so that embedded Tika
is able to use Tesseract to do the  ocr of images? I have installed the
following software -
SOLR  - 7.4.0
Tesseract - 4.1.1-rc2-20-g01fb
TIKA   - TIKA 1.18 
Tesseract is installed in to the following directory:
/usr/share/tesseract/4/tessdata/
echo $TESSDATA_PREFIX - > /usr/share/tesseract/4/tessdata/
tesseract -v
tesseract 4.1.1-rc2-20-g01fb
leptonica-1.76.0

Command “tesseract test.jpg  test.txt”  produces accurate txt file with
OCRed content from test.jpg
Current setup allows us to index attachments such like structured text files
(txt, word, pdf, etc), but does not react in any way for attachments like
png, jpg. Nor it works if uploaded directly to SOLR using its web interface.

Necessary modifications were made to the following files:
solrconfig.xml; TesseractOCRConfig.properties; parsecontent.xml;
PDFparser.properties.

Would appreciate if someone helped me with this configuration. 



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: regarding Extracting text from Images

2019-10-27 Thread Jörn Franke
Maybe some additional consideration:
If you need to upgrade Solr then eventually you need to reindex.
If you change fields or add fields then you need to reindex. 
Both are much faster if you have an external program that converts rich 
documents (pdf, word, ocr) to Text once and you use the text  (or hypertext if 
you need to keep headings etc) for reindexing. This will save you a lot of time 
- especially for large collections.

> Am 27.10.2019 um 15:13 schrieb Erick Erickson :
> 
> I would do neither. I’d put it all on an external server and use _that_, 
> then send
> the finished docs to Solr.
> 
> The problem with putting this all on Solr is at least three-fold:
> 1> you’re talking heavy-duty work here to do the OCR, which takes away from 
> the available resources for searching and indexing
> 2> any problems with either one will potentially blow up Solr
> 3> If you’re processing very many docs, you’ll have to parallelize somehow
> 
> Here’s the long form: 
> https://lucidworks.com/post/indexing-with-solrj/
> 
> Best,
> Erick
> 
>> On Oct 26, 2019, at 12:37 PM, Edward Ribeiro  
>> wrote:
>> 
>> No. You should install tesseract-ocr on the same box your Solr instance is,
>> and configure Solr so that embedded Tika is able to use Tesseract to do the
>> ocr of images.
>> 
>> Best,
>> Edward
>> 
>> Em qua, 23 de out de 2019 20:08, suresh pendap 
>> escreveu:
>> 
>>> Hi Alex,
>>> Thanks for your reply. How do we integrate tesseract with Solr?  Do we have
>>> to implement Custom update processor or extend the
>>> ExtractingRequestProcessor?
>>> 
>>> Regards
>>> Suresh
>>> 
>>> On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch >>> 
>>> wrote:
>>> 
 I believe Tika that powers this can do so with extra libraries
>>> (tesseract?)
 But Solr does not bundle those extras.
 
 In any case, you may want to run Tika externally to avoid the
 conversion/extraction process be a burden to Solr itself.
 
 Regards,
Alex
 
 On Wed, Oct 23, 2019, 1:58 PM suresh pendap, 
 wrote:
 
> Hello,
> I am reading the Solr documentation about integration with Tika and
>>> Solr
> Cell framework over here
> 
> 
 
>>> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
> 
> I would like to know if the can Solr Cell framework also be used to
 extract
> text from the image files?
> 
> Regards
> Suresh
> 
 
>>> 
> 


Re: regarding Extracting text from Images

2019-10-27 Thread Erick Erickson
I would do neither. I’d put it all on an external server and use _that_, then 
send
the finished docs to Solr.

The problem with putting this all on Solr is at least three-fold:
1> you’re talking heavy-duty work here to do the OCR, which takes away from the 
available resources for searching and indexing
2> any problems with either one will potentially blow up Solr
3> If you’re processing very many docs, you’ll have to parallelize somehow

Here’s the long form: 
https://lucidworks.com/post/indexing-with-solrj/

Best,
Erick

> On Oct 26, 2019, at 12:37 PM, Edward Ribeiro  wrote:
> 
> No. You should install tesseract-ocr on the same box your Solr instance is,
> and configure Solr so that embedded Tika is able to use Tesseract to do the
> ocr of images.
> 
> Best,
> Edward
> 
> Em qua, 23 de out de 2019 20:08, suresh pendap 
> escreveu:
> 
>> Hi Alex,
>> Thanks for your reply. How do we integrate tesseract with Solr?  Do we have
>> to implement Custom update processor or extend the
>> ExtractingRequestProcessor?
>> 
>> Regards
>> Suresh
>> 
>> On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch >> 
>> wrote:
>> 
>>> I believe Tika that powers this can do so with extra libraries
>> (tesseract?)
>>> But Solr does not bundle those extras.
>>> 
>>> In any case, you may want to run Tika externally to avoid the
>>> conversion/extraction process be a burden to Solr itself.
>>> 
>>> Regards,
>>> Alex
>>> 
>>> On Wed, Oct 23, 2019, 1:58 PM suresh pendap, 
>>> wrote:
>>> 
 Hello,
 I am reading the Solr documentation about integration with Tika and
>> Solr
 Cell framework over here
 
 
>>> 
>> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
 
 I would like to know if the can Solr Cell framework also be used to
>>> extract
 text from the image files?
 
 Regards
 Suresh
 
>>> 
>> 



Re: regarding Extracting text from Images

2019-10-26 Thread Edward Ribeiro
No. You should install tesseract-ocr on the same box your Solr instance is,
and configure Solr so that embedded Tika is able to use Tesseract to do the
ocr of images.

Best,
Edward

Em qua, 23 de out de 2019 20:08, suresh pendap 
escreveu:

> Hi Alex,
> Thanks for your reply. How do we integrate tesseract with Solr?  Do we have
> to implement Custom update processor or extend the
> ExtractingRequestProcessor?
>
> Regards
> Suresh
>
> On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch  >
> wrote:
>
> > I believe Tika that powers this can do so with extra libraries
> (tesseract?)
> > But Solr does not bundle those extras.
> >
> > In any case, you may want to run Tika externally to avoid the
> > conversion/extraction process be a burden to Solr itself.
> >
> > Regards,
> >  Alex
> >
> > On Wed, Oct 23, 2019, 1:58 PM suresh pendap, 
> > wrote:
> >
> > > Hello,
> > > I am reading the Solr documentation about integration with Tika and
> Solr
> > > Cell framework over here
> > >
> > >
> >
> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
> > >
> > > I would like to know if the can Solr Cell framework also be used to
> > extract
> > > text from the image files?
> > >
> > > Regards
> > > Suresh
> > >
> >
>


Re: regarding Extracting text from Images

2019-10-25 Thread Eric Pugh
Just to stir the pot on this topic, here is an article about why and how to use 
Tika inside of Solr:

https://opensourceconnections.com/blog/2019/10/24/it-s-okay-to-run-tika-inside-of-solr-if-and-only-if/

> On Oct 23, 2019, at 7:21 PM, Erick Erickson  wrote:
> 
> Here’s a blog about why and how to use Tika outside Solr (and an RDBMS too, 
> but you can pull that part out pretty easily):
> https://lucidworks.com/post/indexing-with-solrj/
> 
> 
> 
>> On Oct 23, 2019, at 7:16 PM, Alexandre Rafalovitch  
>> wrote:
>> 
>> Again, I think you are best to do it out of Solr.
>> 
>> But even of you want to get it to work in Solr, I think you start by
>> getting it to work directly in Tika. Then, get the missing libraries and
>> configuration into Solr.
>> 
>> Regards,
>>   Alex
>> 
>> On Wed, Oct 23, 2019, 7:08 PM suresh pendap,  wrote:
>> 
>>> Hi Alex,
>>> Thanks for your reply. How do we integrate tesseract with Solr?  Do we have
>>> to implement Custom update processor or extend the
>>> ExtractingRequestProcessor?
>>> 
>>> Regards
>>> Suresh
>>> 
>>> On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch >>> 
>>> wrote:
>>> 
 I believe Tika that powers this can do so with extra libraries
>>> (tesseract?)
 But Solr does not bundle those extras.
 
 In any case, you may want to run Tika externally to avoid the
 conversion/extraction process be a burden to Solr itself.
 
 Regards,
Alex
 
 On Wed, Oct 23, 2019, 1:58 PM suresh pendap, 
 wrote:
 
> Hello,
> I am reading the Solr documentation about integration with Tika and
>>> Solr
> Cell framework over here
> 
> 
 
>>> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
> 
> I would like to know if the can Solr Cell framework also be used to
 extract
> text from the image files?
> 
> Regards
> Suresh
> 
 
>>> 
> 

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com  | 
My Free/Busy   
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 


This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: regarding Extracting text from Images

2019-10-23 Thread Erick Erickson
Here’s a blog about why and how to use Tika outside Solr (and an RDBMS too, but 
you can pull that part out pretty easily):
https://lucidworks.com/post/indexing-with-solrj/



> On Oct 23, 2019, at 7:16 PM, Alexandre Rafalovitch  wrote:
> 
> Again, I think you are best to do it out of Solr.
> 
> But even of you want to get it to work in Solr, I think you start by
> getting it to work directly in Tika. Then, get the missing libraries and
> configuration into Solr.
> 
> Regards,
>Alex
> 
> On Wed, Oct 23, 2019, 7:08 PM suresh pendap,  wrote:
> 
>> Hi Alex,
>> Thanks for your reply. How do we integrate tesseract with Solr?  Do we have
>> to implement Custom update processor or extend the
>> ExtractingRequestProcessor?
>> 
>> Regards
>> Suresh
>> 
>> On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch >> 
>> wrote:
>> 
>>> I believe Tika that powers this can do so with extra libraries
>> (tesseract?)
>>> But Solr does not bundle those extras.
>>> 
>>> In any case, you may want to run Tika externally to avoid the
>>> conversion/extraction process be a burden to Solr itself.
>>> 
>>> Regards,
>>> Alex
>>> 
>>> On Wed, Oct 23, 2019, 1:58 PM suresh pendap, 
>>> wrote:
>>> 
 Hello,
 I am reading the Solr documentation about integration with Tika and
>> Solr
 Cell framework over here
 
 
>>> 
>> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
 
 I would like to know if the can Solr Cell framework also be used to
>>> extract
 text from the image files?
 
 Regards
 Suresh
 
>>> 
>> 



Re: regarding Extracting text from Images

2019-10-23 Thread Alexandre Rafalovitch
Again, I think you are best to do it out of Solr.

But even of you want to get it to work in Solr, I think you start by
getting it to work directly in Tika. Then, get the missing libraries and
configuration into Solr.

Regards,
Alex

On Wed, Oct 23, 2019, 7:08 PM suresh pendap,  wrote:

> Hi Alex,
> Thanks for your reply. How do we integrate tesseract with Solr?  Do we have
> to implement Custom update processor or extend the
> ExtractingRequestProcessor?
>
> Regards
> Suresh
>
> On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch  >
> wrote:
>
> > I believe Tika that powers this can do so with extra libraries
> (tesseract?)
> > But Solr does not bundle those extras.
> >
> > In any case, you may want to run Tika externally to avoid the
> > conversion/extraction process be a burden to Solr itself.
> >
> > Regards,
> >  Alex
> >
> > On Wed, Oct 23, 2019, 1:58 PM suresh pendap, 
> > wrote:
> >
> > > Hello,
> > > I am reading the Solr documentation about integration with Tika and
> Solr
> > > Cell framework over here
> > >
> > >
> >
> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
> > >
> > > I would like to know if the can Solr Cell framework also be used to
> > extract
> > > text from the image files?
> > >
> > > Regards
> > > Suresh
> > >
> >
>


Re: regarding Extracting text from Images

2019-10-23 Thread suresh pendap
Hi Alex,
Thanks for your reply. How do we integrate tesseract with Solr?  Do we have
to implement Custom update processor or extend the
ExtractingRequestProcessor?

Regards
Suresh

On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch 
wrote:

> I believe Tika that powers this can do so with extra libraries (tesseract?)
> But Solr does not bundle those extras.
>
> In any case, you may want to run Tika externally to avoid the
> conversion/extraction process be a burden to Solr itself.
>
> Regards,
>  Alex
>
> On Wed, Oct 23, 2019, 1:58 PM suresh pendap, 
> wrote:
>
> > Hello,
> > I am reading the Solr documentation about integration with Tika and Solr
> > Cell framework over here
> >
> >
> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
> >
> > I would like to know if the can Solr Cell framework also be used to
> extract
> > text from the image files?
> >
> > Regards
> > Suresh
> >
>


Re: regarding Extracting text from Images

2019-10-23 Thread Alexandre Rafalovitch
I believe Tika that powers this can do so with extra libraries (tesseract?)
But Solr does not bundle those extras.

In any case, you may want to run Tika externally to avoid the
conversion/extraction process be a burden to Solr itself.

Regards,
 Alex

On Wed, Oct 23, 2019, 1:58 PM suresh pendap,  wrote:

> Hello,
> I am reading the Solr documentation about integration with Tika and Solr
> Cell framework over here
>
> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
>
> I would like to know if the can Solr Cell framework also be used to extract
> text from the image files?
>
> Regards
> Suresh
>


regarding Extracting text from Images

2019-10-23 Thread suresh pendap
Hello,
I am reading the Solr documentation about integration with Tika and Solr
Cell framework over here
https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html

I would like to know if the can Solr Cell framework also be used to extract
text from the image files?

Regards
Suresh