Sorry, I posted wrong data.
This is the correct words position inside the image

43007108190000_sample.tif,stain,304,4643,389,4679 
43007108190000_sample.tif,stain,555,4685,634,4717 
43007108190000_sample.tif,ost,1037,17303,1135,17341 
43007108190000_sample.tif,o stn,910,24353,1049,24395 
43007108190000_sample.tif,stn,960,30230,1066,30280 
43007108190000_sample.tif,stn,997,31693,1095,31731 
43007108190000_sample.tif,resd,749,33140,872,33187 
43007108190000_sample.tif,resd,756,33543,873,33585 
43007108190000_sample.tif,resd,778,33625,894,33666 
43007108190000_sample.tif,resd,774,35233,894,35281 
43007108190000_sample.tif,resd,881,38096,1004,38134 
43007108190000_sample.tif,stn,1115,39344,1209,39384 
43007108190000_sample.tif,resd,1066,39674,1189,39710 
43007108190000_sample.tif,resd,883,39751,1001,39791 
43007108190000_sample.tif,stn,765,40758,856,40797 
43007108190000_sample.tif,stn,765,41079,852,41112 
43007108190000_sample.tif,resd,977,42652,1093,42698 
43007108190000_sample.tif,resd,885,42976,1011,43024 
43007108190000_sample.tif,resd,908,43544,1024,43588 
43007108190000_sample.tif,resd,1028,43665,1151,43711 
Each row has image name, word, rect coordinates

thanks

On Monday, October 16, 2017 at 8:35:12 PM UTC+2, Dmitri Silaev wrote:
>
> I asked for few bounding boxes to let us all locate the required words 
> inside the image. Depending on what they are, various methods can work or 
> not. Your image is 135 megapixels in size. You should give as much 
> information as possible to make life easier for people who are willing to 
> help, shouldn't you?
>
>
>
> On Mon, Oct 16, 2017 at 2:01 PM, Paolo Giannoccaro <[email protected] 
> <javascript:>> wrote:
>
>> Thank Art for your contribution.
>> The words that I have to extract from the attached sample are: ost, 
>> stain, stn, resd, o stn (they occur several times, in total there are 20 
>> words).
>> I am currently working with OpenCV to preprocess the image and find a raw 
>> detection of rectangles that contain text. Then I use Tesseract to check 
>> each rectangle and make ocr. Till now I am able to get 10 of 20 words.
>>
>> Of course if I already could have bounding boxes for each word, I would 
>> already solved the problem.
>>
>>
>> On Saturday, October 14, 2017 at 10:29:29 PM UTC+2, Dmitri Silaev wrote:
>>>
>>> What are you unhappy with: detection rate or recognition accuracy? All 
>>> in all, there's a ton of reasons why Tess can work poorly here. Some kind 
>>> of preprocessing is definitely needed. What kind? It depends.
>>>
>>> I personally would say that I need to know:
>>> - 5-10 concrete examples of words you are going to look for,
>>> - their bounding boxes within your sample image.
>>>
>>> Once I have it, I might be able to help.
>>>
>>> Best regards,
>>> Dmitri Silaev
>>> www.CustomOCR.com
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Oct 13, 2017 at 9:05 AM, Paolo Giannoccaro <[email protected]
>>> > wrote:
>>>
>>>> Hi,
>>>> I need to detect a fixed set of words in the attached image, not all 
>>>> are part of canonical english dictionary (for example words could be 
>>>> acronyms).
>>>>
>>>> I tried detection on full image or iterating on splitted sub-images, 
>>>> but quality of detection is low.
>>>>
>>>> I use Tess4J and the most important part of my code are:
>>>>
>>>> //initialize
>>>> ITesseract instance = new Tesseract();
>>>> instance.setTessVariable(VAR_CHAR_WHITELIST, WHITELIST_DEFAULT);
>>>>
>>>> //detect
>>>> int pageIteratorLevel = TessPageIteratorLevel.RIL_WORD;
>>>> List<Word> result = instance.getWords(image, pageIteratorLevel);
>>>>
>>>> Any help ? 
>>>> Thanks a lot
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/90295194-26a9-4f31-bd9d-63d61d7bd592%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/90295194-26a9-4f31-bd9d-63d61d7bd592%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/2a4e7de3-3ff3-4085-80f4-6fb2767a6938%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/2a4e7de3-3ff3-4085-80f4-6fb2767a6938%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3f074ee6-ae5f-49a5-bfa0-4370629a4e22%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to