Yes I also think that the only way is ti look for text portions of the 
image. For this reason I am working with OpenCV to detect text zones and 
then to apply tesseract fro text extraction. The problem is that I can find 
just 1/3 of total number of words.

On Monday, October 16, 2017 at 2:02:02 PM UTC+2, Art Rhyno wrote:
>
> The height of the sample is definitely challenging, if I use a portion of 
> it, Olena might be able to do a viable job of picking out the text [1]. I 
> am not even sure it’s a proper font, though, it might make more sense to 
> use something like template matching rather than OCR. There seems to be 
> lots of instances where the characters touch or overlap with each other. 
>
>  
>
> art
>
> ---
>
> 1. 
> https://drive.google.com/file/d/0B-PK1n92dlzwWmRReVYzdVdBU2M/view?usp=sharing
>
>  
>
>  
>
> *From:* [email protected] <javascript:> [mailto:
> [email protected] <javascript:>] *On Behalf Of *zbgns
> *Sent:* Monday, October 16, 2017 7:11 AM
> *To:* tesseract-ocr <[email protected] <javascript:>>
> *Subject:* [tesseract-ocr] Re: Detection on complex images
>
>  
>
> I understand that the aim is to obtain searchable file in order to be able 
> to identify places where some specific words occur in the document. I would 
> try to do this by creating searchable pdf and afterwards by using “find” in 
> a pdf reader.
>
>  
>
> However I identified two main problems with the file attached by you.
>
>  
>
> First of all the image is too large for tesseract to process it (it may be 
> limitation set by pdf specification – the image is 128 inches high, whereas 
> the limit is probably 45 inches). So the image needs to be cut into 3 
> pieces before it may be turned into pdf with tesseract.
>
>  
>
> You may try to open the file with gImageReader and try to perform ocr on 
> parts containing letters by using rectangle selection(s). I tried it (using 
> tesseract 4.00 alpha engine) and it gives a text in output, but the quality 
> is rather not satisfying. This is the second issue. The quality of the 
> image is not sufficient to perform effective recognition (shapes of some 
> letters are hardly readable) and I don’t think it may be improved in any 
> easy way.
>
>
>
> W dniu piątek, 13 października 2017 15:54:39 UTC+2 użytkownik Paolo 
> Giannoccaro napisał:
>
> Hi,
>
> I need to detect a fixed set of words in the attached image, not all are 
> part of canonical english dictionary (for example words could be acronyms).
>
>  
>
> I tried detection on full image or iterating on splitted sub-images, but 
> quality of detection is low.
>
>  
>
> I use Tess4J and the most important part of my code are:
>
>  
>
> //initialize
>
> ITesseract instance = new Tesseract();
>
> instance.setTessVariable(VAR_CHAR_WHITELIST, WHITELIST_DEFAULT);
>
>  
>
> //detect
>
> int pageIteratorLevel = TessPageIteratorLevel.RIL_WORD;
>
> List<Word> result = instance.getWords(image, pageIteratorLevel);
>
>  
>
> Any help ? 
>
> Thanks a lot
>
>  
>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] <javascript:>.
> To post to this group, send email to [email protected] 
> <javascript:>.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/tesseract-ocr/870fa717-09f7-421d-8654-680088001d9d%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/tesseract-ocr/870fa717-09f7-421d-8654-680088001d9d%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e79cfe73-e0de-41bb-bc88-03b134b17dde%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to