[tesseract-ocr] Bank statement hOCR issues

Andrew Lentvorski Sun, 06 Dec 2015 10:18:07 -0800

I'm trying to chew through an OCR for some bank statements, and I'm having 
difficulty with the hOCR.  I could use some overall advice as well as 
specific issues.

1) The insertion of tags like <strong> without a corresponding bbox
attribute is really irritating when trying to programmatically extract
text. Can I turn this off somehow without recompiling the universe?
(Personally, why aren't these part of the attributes anyway? In reality
the value that should be being returned is *weight* or *slant* and those
don't easily correspond with html without CSS anyway).

2) My .tif files look a bit ... "fuzzy" after threshold and deskew. Any
suggestions for filtering to help out Tesseract (Tesseract really fumbles
this by completely missing the 9 and 5 before the decimal). ImageMagick
tends to be my Swiss-army chainsaw for such operations, but if I need a
different tool, I am open to it.

3) This font is confusing tesseract a bit (small l(L) is particularly bad
for obvious reasons). Is there any way to help it out by indicating font
characteristics?

Overall, though, things aren't bad. ABBYY is probably about 10% more
accurate on word detection; it seems to work much harder to detect and
preserve clusters of characters as a word. Tesseract occasionally splits
things like "3,475.56" into "3", "475" and "56" and loses either the comma
or the period. It's probably about twice a page that it occurs. That's
fairly irritating.

Layout detection is, as for any OCR, a disaster. It's remarkable how hard
it is to code layout. I wonder if it wouldn't be better to just have a
list of the words and their bbox and attributes rather than a bunch of
_area, _line, etc that are all just broken. Maybe if I were digitizing
books this would appeal to me more as presumably lines/paragraphs/pictures
are easier to detect.

Anyway, thanks for all the hard work. I couldn't even have tried to do
this programmatically without Tesseract.

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/4d95b239-5dac-46ed-ad4a-2b586145498e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Bank statement hOCR issues

Reply via email to