Re: regarding Extracting text from Images

2020-01-22 Thread Retro
Good day,
We solved the situation. Here is what was used and changed:
In our installation we used Tesseract  version 3.05, Tika version 1.17, SOLR
version 7.4.  We actually, had TIKA version 1.17, not 18. 
1. Changed from HOCR to TXT  >>> 
in file parseContext.xml
2. Had to start SOLR as a root user.
Version 4.1.1 is not compatible with TIKA 1.17 , so we will upgrade SOLR to
version 7.7, TIKA version 1.19 and will try to install Tesseract 4.1.1
 



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: regarding Extracting text from Images

2020-01-21 Thread Retro
Hello, thank you for the info, Iwill look into this as well. Yes, we plan to
use it in production, but on a longer run. For the moment I just need to
make it work as a test case. 



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: regarding Extracting text from Images

2020-01-21 Thread Retro
Yes, I did. this manual is referring to standalone version of TIKA, while I
have a build-in version.



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: regarding Extracting text from Images

2020-01-17 Thread Retro
Hello, can you please advise me, how to configure Solr so that embedded Tika
is able to use Tesseract to do the  ocr of images? I have installed the
following software -
SOLR  - 7.4.0
Tesseract - 4.1.1-rc2-20-g01fb
TIKA   - TIKA 1.18 
Tesseract is installed in to the following directory:
/usr/share/tesseract/4/tessdata/
echo $TESSDATA_PREFIX - > /usr/share/tesseract/4/tessdata/
tesseract -v
tesseract 4.1.1-rc2-20-g01fb
leptonica-1.76.0

Command “tesseract test.jpg  test.txt”  produces accurate txt file with
OCRed content from test.jpg
Current setup allows us to index attachments such like structured text files
(txt, word, pdf, etc), but does not react in any way for attachments like
png, jpg. Nor it works if uploaded directly to SOLR using its web interface.

Necessary modifications were made to the following files:
solrconfig.xml; TesseractOCRConfig.properties; parsecontent.xml;
PDFparser.properties.

Would appreciate if someone helped me with this configuration. 



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: Using Tesseract OCR to extract PDF files in EML file attachment

2019-10-14 Thread Retro
Hello, thanks for answer, but let me explain the setup. We are running our
own backup solution for emails (messages from Exchange in MSG format).
Content of these messages then indexed in SOLR. But SOLR can not process
attachments within those MSG files, can not OCR them. This is what I need -
to OCR attachments and get their content indexed in SOLR. 

Davis, Daniel (NIH/NLM) [C] wrote
> Nuance and ABBYY provide OCR capabilities as well.
> Looking at higher level solutions, both indexengines.com and Comvault can
> do email remediation for legal issues.
>> AJ Weber wrote
>> > There are alternative, paid, libraries to parse and extract attachments
>> > from EML files as well
>> > EML attachments will have a mimetype associated with their metadata.
>> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html





--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Using Tesseract OCR to extract PDF files in EML file attachment

2019-10-11 Thread Retro
AJ Weber wrote
> There are alternative, paid, libraries to parse and extract attachments 
> from EML files as well
> EML attachments will have a mimetype associated with their metadata.

Hello, can you give a hint what are those commercial libraries that would do
the job? We need to index MSG files and OCR attachments within MSG. 
Tesseract can not do this, and I'm having hard time to find the solution.
Thank you!



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html