I'm trying to chew through an OCR for some bank statements, and I'm having 
difficulty with the hOCR.  I could use some overall advice as well as 
specific issues.

1) The insertion of tags like <strong> without a corresponding bbox 
attribute is really irritating when trying to programmatically extract 
text.  Can I turn this off somehow without recompiling the universe?  
(Personally, why aren't these part of the attributes anyway?  In reality 
the value that should be being returned is *weight* or *slant* and those 
don't easily correspond with html without CSS anyway).

2) My .tif files look a bit ... "fuzzy" after threshold and deskew.  Any 
suggestions for filtering to help out Tesseract  (Tesseract really fumbles 
this by completely missing the 9 and 5 before the decimal).  ImageMagick 
tends to be my Swiss-army chainsaw for such operations, but if I need a 
different tool, I am open to it.




3) This font is confusing tesseract a bit (small l(L) is particularly bad 
for obvious reasons).  Is there any way to help it out by indicating font 
characteristics?






Overall, though, things aren't bad.  ABBYY is probably about 10% more 
accurate on word detection; it seems to work much harder to detect and 
preserve clusters of characters as a word.  Tesseract occasionally splits 
things like "3,475.56" into "3", "475" and "56" and loses either the comma 
or the period.  It's probably about twice a page that it occurs.  That's 
fairly irritating.

Layout detection is, as for any OCR, a disaster.  It's remarkable how hard 
it is to code layout.  I wonder if it wouldn't be better to just have a 
list of the words and their bbox and attributes rather than a bunch of 
_area, _line, etc that are all just broken.  Maybe if I were digitizing 
books this would appeal to me more as presumably lines/paragraphs/pictures 
are easier to detect. 

Anyway, thanks for all the hard work.  I couldn't even have tried to do 
this programmatically without Tesseract.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4d95b239-5dac-46ed-ad4a-2b586145498e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to