Thank Art for your contribution. The words that I have to extract from the attached sample are: ost, stain, stn, resd, o stn (they occur several times, in total there are 20 words). I am currently working with OpenCV to preprocess the image and find a raw detection of rectangles that contain text. Then I use Tesseract to check each rectangle and make ocr. Till now I am able to get 10 of 20 words.
Of course if I already could have bounding boxes for each word, I would already solved the problem. On Saturday, October 14, 2017 at 10:29:29 PM UTC+2, Dmitri Silaev wrote: > > What are you unhappy with: detection rate or recognition accuracy? All in > all, there's a ton of reasons why Tess can work poorly here. Some kind of > preprocessing is definitely needed. What kind? It depends. > > I personally would say that I need to know: > - 5-10 concrete examples of words you are going to look for, > - their bounding boxes within your sample image. > > Once I have it, I might be able to help. > > Best regards, > Dmitri Silaev > www.CustomOCR.com > > > > > > On Fri, Oct 13, 2017 at 9:05 AM, Paolo Giannoccaro <[email protected] > <javascript:>> wrote: > >> Hi, >> I need to detect a fixed set of words in the attached image, not all are >> part of canonical english dictionary (for example words could be acronyms). >> >> I tried detection on full image or iterating on splitted sub-images, but >> quality of detection is low. >> >> I use Tess4J and the most important part of my code are: >> >> //initialize >> ITesseract instance = new Tesseract(); >> instance.setTessVariable(VAR_CHAR_WHITELIST, WHITELIST_DEFAULT); >> >> //detect >> int pageIteratorLevel = TessPageIteratorLevel.RIL_WORD; >> List<Word> result = instance.getWords(image, pageIteratorLevel); >> >> Any help ? >> Thanks a lot >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/90295194-26a9-4f31-bd9d-63d61d7bd592%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/90295194-26a9-4f31-bd9d-63d61d7bd592%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2a4e7de3-3ff3-4085-80f4-6fb2767a6938%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

