Sorry, I posted wrong data. This is the correct words position inside the image
43007108190000_sample.tif,stain,304,4643,389,4679 43007108190000_sample.tif,stain,555,4685,634,4717 43007108190000_sample.tif,ost,1037,17303,1135,17341 43007108190000_sample.tif,o stn,910,24353,1049,24395 43007108190000_sample.tif,stn,960,30230,1066,30280 43007108190000_sample.tif,stn,997,31693,1095,31731 43007108190000_sample.tif,resd,749,33140,872,33187 43007108190000_sample.tif,resd,756,33543,873,33585 43007108190000_sample.tif,resd,778,33625,894,33666 43007108190000_sample.tif,resd,774,35233,894,35281 43007108190000_sample.tif,resd,881,38096,1004,38134 43007108190000_sample.tif,stn,1115,39344,1209,39384 43007108190000_sample.tif,resd,1066,39674,1189,39710 43007108190000_sample.tif,resd,883,39751,1001,39791 43007108190000_sample.tif,stn,765,40758,856,40797 43007108190000_sample.tif,stn,765,41079,852,41112 43007108190000_sample.tif,resd,977,42652,1093,42698 43007108190000_sample.tif,resd,885,42976,1011,43024 43007108190000_sample.tif,resd,908,43544,1024,43588 43007108190000_sample.tif,resd,1028,43665,1151,43711 Each row has image name, word, rect coordinates thanks On Monday, October 16, 2017 at 8:35:12 PM UTC+2, Dmitri Silaev wrote: > > I asked for few bounding boxes to let us all locate the required words > inside the image. Depending on what they are, various methods can work or > not. Your image is 135 megapixels in size. You should give as much > information as possible to make life easier for people who are willing to > help, shouldn't you? > > > > On Mon, Oct 16, 2017 at 2:01 PM, Paolo Giannoccaro <[email protected] > <javascript:>> wrote: > >> Thank Art for your contribution. >> The words that I have to extract from the attached sample are: ost, >> stain, stn, resd, o stn (they occur several times, in total there are 20 >> words). >> I am currently working with OpenCV to preprocess the image and find a raw >> detection of rectangles that contain text. Then I use Tesseract to check >> each rectangle and make ocr. Till now I am able to get 10 of 20 words. >> >> Of course if I already could have bounding boxes for each word, I would >> already solved the problem. >> >> >> On Saturday, October 14, 2017 at 10:29:29 PM UTC+2, Dmitri Silaev wrote: >>> >>> What are you unhappy with: detection rate or recognition accuracy? All >>> in all, there's a ton of reasons why Tess can work poorly here. Some kind >>> of preprocessing is definitely needed. What kind? It depends. >>> >>> I personally would say that I need to know: >>> - 5-10 concrete examples of words you are going to look for, >>> - their bounding boxes within your sample image. >>> >>> Once I have it, I might be able to help. >>> >>> Best regards, >>> Dmitri Silaev >>> www.CustomOCR.com >>> >>> >>> >>> >>> >>> On Fri, Oct 13, 2017 at 9:05 AM, Paolo Giannoccaro <[email protected] >>> > wrote: >>> >>>> Hi, >>>> I need to detect a fixed set of words in the attached image, not all >>>> are part of canonical english dictionary (for example words could be >>>> acronyms). >>>> >>>> I tried detection on full image or iterating on splitted sub-images, >>>> but quality of detection is low. >>>> >>>> I use Tess4J and the most important part of my code are: >>>> >>>> //initialize >>>> ITesseract instance = new Tesseract(); >>>> instance.setTessVariable(VAR_CHAR_WHITELIST, WHITELIST_DEFAULT); >>>> >>>> //detect >>>> int pageIteratorLevel = TessPageIteratorLevel.RIL_WORD; >>>> List<Word> result = instance.getWords(image, pageIteratorLevel); >>>> >>>> Any help ? >>>> Thanks a lot >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/90295194-26a9-4f31-bd9d-63d61d7bd592%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/90295194-26a9-4f31-bd9d-63d61d7bd592%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/2a4e7de3-3ff3-4085-80f4-6fb2767a6938%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/2a4e7de3-3ff3-4085-80f4-6fb2767a6938%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3f074ee6-ae5f-49a5-bfa0-4370629a4e22%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

