subject:"Text in images are not extracted and indexed to content"

Re: Text in images are not extracted and indexed to content

2018-04-10 Thread Zheng Lin Edwin Yeo

Thanks for the reply. It was due to the Tesseract OCR problem, as I have tried out the new Tesseract 4 version on my system, and it does not set the path in the Environment Variables, unlike the older Tesseract 3, which set the path automatically during installation. Regards, Edwin On 10 April

Re: Text in images are not extracted and indexed to content

2018-04-10 Thread Shamik Sinha

To index text in images the image needs to be searchable i. e. text needs to be overlayed on the image like a searchable pdf. You can do this using ocr but it is a bit unreliable if the images are scanned copies of written text. On 10-Apr-2018 4:12 PM, "Rahul Singh"

Re: Text in images are not extracted and indexed to content

2018-04-10 Thread Rahul Singh

May need to extract outside SolR and index pure text with an external ingestion process. You have much more control over the Tika attributes and behaviors. -- Rahul Singh rahul.si...@anant.us Anant Corporation On Apr 9, 2018, 10:23 PM -0400, Zheng Lin Edwin Yeo , wrote:

Text in images are not extracted and indexed to content

2018-04-09 Thread Zheng Lin Edwin Yeo

Hi, Currently I am facing issue whereby the text in images file like jpg, bmp are not being extracted out and indexed. After the indexing, Tika did extract all the meta data out and index them under the fields attr_*. However, the content field is always empty for images file. For other types of

Re: Text in images are not extracted and indexed to content

Re: Text in images are not extracted and indexed to content

Re: Text in images are not extracted and indexed to content

Text in images are not extracted and indexed to content

4 matches

Site Navigation

Mail list logo

Footer information