I'm still hoping to learn how to use GetComponentImages / SetRectangle better, but I found a workaround to get what I need out of GetIterator / iterate_level... BoundingBoxInternal is not something I can find documentation for, but I saw a reference to it <https://zdenop.github.io/tesseract-doc/group___advanced_a_p_i.html#ga56369b1654400ef97e581bb65749ec3d> and decided to see if I could get it to work. I now see bounding boxes for my faster and more correct method.
#!/usr/bin/python from PIL import Image Image.MAX_IMAGE_PIXELS=1000000000 from tesserocr import PyTessBaseAPI, RIL, iterate_level image = Image.open('/Users/gordot2/tess-install/tesseract/scan_2_new.tif') with PyTessBaseAPI() as api: api.SetImage(image) api.Recognize() api.SetVariable("save_blob_choices","T") ri=api.GetIterator() level=RIL.WORD for r in iterate_level(ri, level): print r.BoundingBoxInternal(level) symbol = r.GetUTF8Text(level) conf = r.Confidence(level) if symbol: print u'symbol {},conf: {}\n'.format(symbol,conf).encode('utf-8') On Monday, January 2, 2017 at 7:25:39 AM UTC-5, T G wrote: > > I've continued to spend a little time each day working on my problem. I've > found something that fuels my desire to understand what GetComponentImages > does differently from iterate_level. > > from PIL import Image > > Image.MAX_IMAGE_PIXELS=1000000000 > > from tesserocr import PyTessBaseAPI, RIL > > > image = > Image.open('/Users/chrysrobyn/tess-install/tesseract/scan_2_new.tif') > > with PyTessBaseAPI() as api: > > api.SetImage(image) > > api.Recognize() > > api.SetVariable("save_blob_choices","T") > > boxes = api.GetComponentImages(RIL.WORD, True) > > print 'Found {} textword image components.'.format(len(boxes)) > > print enumerate(boxes) > > for i, (im, box, _, _) in enumerate(boxes): > > # api.SetRectangle(box['x'], box['y'], box['w'], box['h']) > > api.SetRectangle(int(box['x'])-8, int(box['y'])-8, > int(box['w'])+16, int(box['h'])+16) > > ocrResult = api.GetUTF8Text() > > conf = api.MeanTextConf() > > thresholdedImage = api.GetThresholdedImage() > > > thresholdedImage.save('/Users/chrysrobyn/tess-install/tesseract/scan_2_new_piece'+str(i)+str(ocrResult)+'.tif') > > if ocrResult: > > print u'symbol {},conf: > {}\n'.format(ocrResult,conf).encode('utf-8') > > I've highlighted in green my debugging steps. 1) I started saving the > boxed images to see what commonality I could see. It's picking out words > that work in my Example2.py script below. Unfortunately, api.GetUTF8Text() > isn't returning anything for the vast majority of these boxes (second > example below, plus this modified one). r.GetUTF8Text() (first example > below), however, picks it up with high confidence. 2) I started playing > with making bigger boxes. This has yielded some improvements, but nothing > drastic. > > I'm again left struggling to understand: What can I do to > GetComponentImages / SetRectangle / GetUTF8Text to make it match > GetIterator / iterate_level / GetUTF8Text? > > On Friday, December 30, 2016 at 6:21:49 AM UTC-5, T G wrote: >> >> I'm trying to learn Python in parallel with the Tesseract API. My end >> goal is to learn how to use the Tesseract API to be able to read a document >> and do some basic error checking. I've found a few examples that seem to be >> good places to start, but I'm having trouble understanding the difference >> between two pieces of code that, while different in behavior, seem to me >> like they should be equivalent. These were both modified slightly from >> https://pypi.python.org/pypi/tesserocr . >> >> The first example produces this output: >> >> $ time ./GetComponentImagesExample2.py|tail -2 >> symbol MISSISSIPPI,conf: 88.3686599731 >> >> >> real 0m14.227s >> user 0m13.534s >> sys 0m0.397s >> >> This is accurate and completes in 14 seconds. Reviewing the rest of the >> output, it is pretty good -- I'm probably a few SetVariable commands away >> from 99+% accuracy. >> >> $ ./GetComponentImagesExample2.py|wc -l >> 1289 >> >> Manually reviewing the results, it appears to get all the text. >> >> #!/usr/bin/pythonfrom PIL import ImageImage.MAX_IMAGE_PIXELS=1000000000from >> tesserocr import PyTessBaseAPI, RIL, iterate_level >> >> image = >> Image.open('/Users/chrysrobyn/tess-install/tesseract/scan_2_new.tif')with >> PyTessBaseAPI() as api: >> api.SetImage(image) >> api.Recognize() >> api.SetVariable("save_blob_choices","T") >> ri=api.GetIterator() >> level=RIL.WORD >> boxes = api.GetComponentImages(RIL.WORD, True) >> print 'Found {} textline image components.'.format(len(boxes)) >> for r in iterate_level(ri, level): >> symbol = r.GetUTF8Text(level) >> conf = r.Confidence(level) >> if symbol: >> print u'symbol {},conf: {}\n'.format(symbol,conf).encode('utf-8') >> >> The second example produces this output. >> >> $ time ./GetComponentImagesExample4.py|tail -4 >> symbol MISSISS IPPI,conf: 85 >> >> >> real 0m17.524s >> user 0m16.600s >> sys 0m0.427s >> >> This is less accurate (extra space detected in a word) and slower (takes >> 17.5 seconds). >> >> $ ./GetComponentImagesExample4.py|wc -l >> 223 >> >> This is sorely lacking a large amount of text and I don't understand why >> it misses some stuff. >> >> #!/usr/bin/pythonfrom PIL import ImageImage.MAX_IMAGE_PIXELS=1000000000from >> tesserocr import PyTessBaseAPI, RIL >> >> image = >> Image.open('/Users/chrysrobyn/tess-install/tesseract/scan_2_new.tif')with >> PyTessBaseAPI() as api: >> api.SetImage(image) >> api.Recognize() >> api.SetVariable("save_blob_choices","T") >> boxes = api.GetComponentImages(RIL.WORD, True) >> print 'Found {} textword image components.'.format(len(boxes)) >> for i, (im, box, _, _) in enumerate(boxes): >> api.SetRectangle(box['x'], box['y'], box['w'], box['h']) >> ocrResult = api.GetUTF8Text() >> conf = api.MeanTextConf() >> if ocrResult: >> print u'symbol {},conf: >> {}\n'.format(ocrResult,conf).encode('utf-8')# print (u"Box[{0}]: >> x={x}, y={y}, w={w}, h={h}, "# "confidence: {1}, text: >> {2}").format(i, conf, ocrResult, **box).encode('utf-8') >> >> My end goal relies on understanding where text is found in the document, >> so I need the bounding boxes like the second example. As near as I can >> tell, the iterate_level doesn't expose the coordinates of the found text, >> so I need the GetComponentImages... but the output is not equivalent. >> >> Why do these pieces of code behave differently in accuracy? Can I get >> GetComponentImages to match GetIterator? >> >> >> Thanks for any help. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ff24142f-1279-4ada-a320-a261801d7dfe%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.