Re: regarding Extracting text from Images

Jörn Franke Sun, 27 Oct 2019 10:13:46 -0700

Maybe some additional consideration:
If you need to upgrade Solr then eventually you need to reindex.
If you change fields or add fields then you need to reindex. 
Both are much faster if you have an external program that converts rich 
documents (pdf, word, ocr) to Text once and you use the text  (or hypertext if 
you need to keep headings etc) for reindexing. This will save you a lot of time 
- especially for large collections.


> Am 27.10.2019 um 15:13 schrieb Erick Erickson <erickerick...@gmail.com>:
> 
> I would do neither. I’d put it all on an external server and use _that_, 
> then send
> the finished docs to Solr.
> 
> The problem with putting this all on Solr is at least three-fold:
> 1> you’re talking heavy-duty work here to do the OCR, which takes away from 
> the available resources for searching and indexing
> 2> any problems with either one will potentially blow up Solr
> 3> If you’re processing very many docs, you’ll have to parallelize somehow
> 
> Here’s the long form: 
> https://lucidworks.com/post/indexing-with-solrj/
> 
> Best,
> Erick
> 
>> On Oct 26, 2019, at 12:37 PM, Edward Ribeiro <edward.ribe...@gmail.com> 
>> wrote:
>> 
>> No. You should install tesseract-ocr on the same box your Solr instance is,
>> and configure Solr so that embedded Tika is able to use Tesseract to do the
>> ocr of images.
>> 
>> Best,
>> Edward
>> 
>> Em qua, 23 de out de 2019 20:08, suresh pendap <sureshpen...@gmail.com>
>> escreveu:
>> 
>>> Hi Alex,
>>> Thanks for your reply. How do we integrate tesseract with Solr?  Do we have
>>> to implement Custom update processor or extend the
>>> ExtractingRequestProcessor?
>>> 
>>> Regards
>>> Suresh
>>> 
>>> On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch <arafa...@gmail.com
>>>> 
>>> wrote:
>>> 
>>>> I believe Tika that powers this can do so with extra libraries
>>> (tesseract?)
>>>> But Solr does not bundle those extras.
>>>> 
>>>> In any case, you may want to run Tika externally to avoid the
>>>> conversion/extraction process be a burden to Solr itself.
>>>> 
>>>> Regards,
>>>>    Alex
>>>> 
>>>> On Wed, Oct 23, 2019, 1:58 PM suresh pendap, <sureshpen...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hello,
>>>>> I am reading the Solr documentation about integration with Tika and
>>> Solr
>>>>> Cell framework over here
>>>>> 
>>>>> 
>>>> 
>>> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
>>>>> 
>>>>> I would like to know if the can Solr Cell framework also be used to
>>>> extract
>>>>> text from the image files?
>>>>> 
>>>>> Regards
>>>>> Suresh
>>>>> 
>>>> 
>>> 
>

Re: regarding Extracting text from Images

Reply via email to