[tesseract-ocr] Re: Comparing GetComponentImages to iterate_level

T G Mon, 02 Jan 2017 08:15:06 -0800

I'm still hoping to learn how to use GetComponentImages / SetRectangle 
better, but I found a workaround to get what I need out of GetIterator / 
iterate_level... BoundingBoxInternal is not something I can find 
documentation for, but I saw a reference to it 
<https://zdenop.github.io/tesseract-doc/group___advanced_a_p_i.html#ga56369b1654400ef97e581bb65749ec3d>
 and 
decided to see if I could get it to work. I now see bounding boxes for my 
faster and more correct method.


#!/usr/bin/python

from PIL import Image

Image.MAX_IMAGE_PIXELS=1000000000

from tesserocr import PyTessBaseAPI, RIL, iterate_level


image = Image.open('/Users/gordot2/tess-install/tesseract/scan_2_new.tif')

with PyTessBaseAPI() as api:

    api.SetImage(image)

    api.Recognize()

    api.SetVariable("save_blob_choices","T")

    ri=api.GetIterator()

    level=RIL.WORD

    for r in iterate_level(ri, level):

        print r.BoundingBoxInternal(level)

        symbol = r.GetUTF8Text(level)

        conf = r.Confidence(level)

        if symbol:

            print u'symbol {},conf: 
{}\n'.format(symbol,conf).encode('utf-8')

On Monday, January 2, 2017 at 7:25:39 AM UTC-5, T G wrote:
>
> I've continued to spend a little time each day working on my problem. I've 
> found something that fuels my desire to understand what GetComponentImages 
> does differently from iterate_level.
>
> from PIL import Image
>
> Image.MAX_IMAGE_PIXELS=1000000000
>
> from tesserocr import PyTessBaseAPI, RIL
>
>
> image = 
> Image.open('/Users/chrysrobyn/tess-install/tesseract/scan_2_new.tif')
>
> with PyTessBaseAPI() as api:
>
>     api.SetImage(image)
>
>     api.Recognize()
>
>     api.SetVariable("save_blob_choices","T")
>
>     boxes = api.GetComponentImages(RIL.WORD, True)
>
>     print 'Found {} textword image components.'.format(len(boxes))
>
>     print enumerate(boxes)
>
>     for i, (im, box, _, _) in enumerate(boxes):
>
> #        api.SetRectangle(box['x'], box['y'], box['w'], box['h'])
>
>         api.SetRectangle(int(box['x'])-8, int(box['y'])-8, 
> int(box['w'])+16, int(box['h'])+16)
>
>         ocrResult = api.GetUTF8Text()
>
>         conf = api.MeanTextConf()
>
>         thresholdedImage = api.GetThresholdedImage()
>
>         
> thresholdedImage.save('/Users/chrysrobyn/tess-install/tesseract/scan_2_new_piece'+str(i)+str(ocrResult)+'.tif')
>
>         if ocrResult:
>
>             print u'symbol {},conf: 
> {}\n'.format(ocrResult,conf).encode('utf-8')
>
> I've highlighted in green my debugging steps. 1) I started saving the 
> boxed images to see what commonality I could see. It's picking out words 
> that work in my Example2.py script below. Unfortunately, api.GetUTF8Text() 
> isn't returning anything for the vast majority of these boxes (second 
> example below, plus this modified one).  r.GetUTF8Text() (first example 
> below), however, picks it up with high confidence. 2) I started playing 
> with making bigger boxes. This has yielded some improvements, but nothing 
> drastic.
>
> I'm again left struggling to understand: What can I do to 
> GetComponentImages / SetRectangle / GetUTF8Text to make it match 
> GetIterator / iterate_level / GetUTF8Text?
>
> On Friday, December 30, 2016 at 6:21:49 AM UTC-5, T G wrote:
>>
>> I'm trying to learn Python in parallel with the Tesseract API. My end 
>> goal is to learn how to use the Tesseract API to be able to read a document 
>> and do some basic error checking. I've found a few examples that seem to be 
>> good places to start, but I'm having trouble understanding the difference 
>> between two pieces of code that, while different in behavior, seem to me 
>> like they should be equivalent. These were both modified slightly from 
>> https://pypi.python.org/pypi/tesserocr .
>>
>> The first example produces this output:
>>
>> $ time ./GetComponentImagesExample2.py|tail -2
>> symbol MISSISSIPPI,conf: 88.3686599731
>>
>>
>> real    0m14.227s
>> user    0m13.534s
>> sys 0m0.397s
>>
>> This is accurate and completes in 14 seconds. Reviewing the rest of the 
>> output, it is pretty good -- I'm probably a few SetVariable commands away 
>> from 99+% accuracy.
>>
>> $ ./GetComponentImagesExample2.py|wc -l
>>     1289
>>
>> Manually reviewing the results, it appears to get all the text.
>>
>> #!/usr/bin/pythonfrom PIL import ImageImage.MAX_IMAGE_PIXELS=1000000000from 
>> tesserocr import PyTessBaseAPI, RIL, iterate_level
>>
>> image = 
>> Image.open('/Users/chrysrobyn/tess-install/tesseract/scan_2_new.tif')with 
>> PyTessBaseAPI() as api:
>>     api.SetImage(image)
>>     api.Recognize()
>>     api.SetVariable("save_blob_choices","T")
>>     ri=api.GetIterator()
>>     level=RIL.WORD
>>     boxes = api.GetComponentImages(RIL.WORD, True)
>>     print 'Found {} textline image components.'.format(len(boxes))
>>     for r in iterate_level(ri, level):
>>         symbol = r.GetUTF8Text(level)
>>         conf = r.Confidence(level)
>>         if symbol:
>>             print u'symbol {},conf: {}\n'.format(symbol,conf).encode('utf-8')
>>
>> The second example produces this output.
>>
>> $ time ./GetComponentImagesExample4.py|tail -4
>> symbol MISSISS IPPI,conf: 85
>>
>>
>> real    0m17.524s
>> user    0m16.600s
>> sys 0m0.427s
>>
>> This is less accurate (extra space detected in a word) and slower (takes 
>> 17.5 seconds).
>>
>> $ ./GetComponentImagesExample4.py|wc -l
>>      223
>>
>> This is sorely lacking a large amount of text and I don't understand why 
>> it misses some stuff.
>>
>> #!/usr/bin/pythonfrom PIL import ImageImage.MAX_IMAGE_PIXELS=1000000000from 
>> tesserocr import PyTessBaseAPI, RIL
>>
>> image = 
>> Image.open('/Users/chrysrobyn/tess-install/tesseract/scan_2_new.tif')with 
>> PyTessBaseAPI() as api:
>>     api.SetImage(image)
>>     api.Recognize()
>>     api.SetVariable("save_blob_choices","T")
>>     boxes = api.GetComponentImages(RIL.WORD, True)
>>     print 'Found {} textword image components.'.format(len(boxes))
>>     for i, (im, box, _, _) in enumerate(boxes):
>>         api.SetRectangle(box['x'], box['y'], box['w'], box['h'])
>>         ocrResult = api.GetUTF8Text()
>>         conf = api.MeanTextConf()
>>         if ocrResult:
>>             print u'symbol {},conf: 
>> {}\n'.format(ocrResult,conf).encode('utf-8')#        print (u"Box[{0}]: 
>> x={x}, y={y}, w={w}, h={h}, "#               "confidence: {1}, text: 
>> {2}").format(i, conf, ocrResult, **box).encode('utf-8')
>>
>> My end goal relies on understanding where text is found in the document, 
>> so I need the bounding boxes like the second example. As near as I can 
>> tell, the iterate_level doesn't expose the coordinates of the found text, 
>> so I need the GetComponentImages... but the output is not equivalent.
>>
>> Why do these pieces of code behave differently in accuracy? Can I get 
>> GetComponentImages to match GetIterator?
>>
>>
>> Thanks for any help.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ff24142f-1279-4ada-a320-a261801d7dfe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Comparing GetComponentImages to iterate_level

Reply via email to