Yes I also think that the only way is ti look for text portions of the image. For this reason I am working with OpenCV to detect text zones and then to apply tesseract fro text extraction. The problem is that I can find just 1/3 of total number of words.
On Monday, October 16, 2017 at 2:02:02 PM UTC+2, Art Rhyno wrote: > > The height of the sample is definitely challenging, if I use a portion of > it, Olena might be able to do a viable job of picking out the text [1]. I > am not even sure it’s a proper font, though, it might make more sense to > use something like template matching rather than OCR. There seems to be > lots of instances where the characters touch or overlap with each other. > > > > art > > --- > > 1. > https://drive.google.com/file/d/0B-PK1n92dlzwWmRReVYzdVdBU2M/view?usp=sharing > > > > > > *From:* [email protected] <javascript:> [mailto: > [email protected] <javascript:>] *On Behalf Of *zbgns > *Sent:* Monday, October 16, 2017 7:11 AM > *To:* tesseract-ocr <[email protected] <javascript:>> > *Subject:* [tesseract-ocr] Re: Detection on complex images > > > > I understand that the aim is to obtain searchable file in order to be able > to identify places where some specific words occur in the document. I would > try to do this by creating searchable pdf and afterwards by using “find” in > a pdf reader. > > > > However I identified two main problems with the file attached by you. > > > > First of all the image is too large for tesseract to process it (it may be > limitation set by pdf specification – the image is 128 inches high, whereas > the limit is probably 45 inches). So the image needs to be cut into 3 > pieces before it may be turned into pdf with tesseract. > > > > You may try to open the file with gImageReader and try to perform ocr on > parts containing letters by using rectangle selection(s). I tried it (using > tesseract 4.00 alpha engine) and it gives a text in output, but the quality > is rather not satisfying. This is the second issue. The quality of the > image is not sufficient to perform effective recognition (shapes of some > letters are hardly readable) and I don’t think it may be improved in any > easy way. > > > > W dniu piątek, 13 października 2017 15:54:39 UTC+2 użytkownik Paolo > Giannoccaro napisał: > > Hi, > > I need to detect a fixed set of words in the attached image, not all are > part of canonical english dictionary (for example words could be acronyms). > > > > I tried detection on full image or iterating on splitted sub-images, but > quality of detection is low. > > > > I use Tess4J and the most important part of my code are: > > > > //initialize > > ITesseract instance = new Tesseract(); > > instance.setTessVariable(VAR_CHAR_WHITELIST, WHITELIST_DEFAULT); > > > > //detect > > int pageIteratorLevel = TessPageIteratorLevel.RIL_WORD; > > List<Word> result = instance.getWords(image, pageIteratorLevel); > > > > Any help ? > > Thanks a lot > > > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected] <javascript:>. > To post to this group, send email to [email protected] > <javascript:>. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/870fa717-09f7-421d-8654-680088001d9d%40googlegroups.com > > <https://groups.google.com/d/msgid/tesseract-ocr/870fa717-09f7-421d-8654-680088001d9d%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e79cfe73-e0de-41bb-bc88-03b134b17dde%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

