Dn(a 26.05.2012 23:09, Galt wrote / napísal(a): > Worderful news, Zdenko! > >> Yesterday David Eger commit patch that should fix tesseract-ocr hOCR output >> to follow hOCR spec. > I wonder what he did? see [1] and [2]. And I did today r729... We tested output with pdfbeads (1.0.9) and ExactImage's hocr2pdf. pdf was checked in evince (linux pdf viewer).
I found out that pdfbeads is not able to work ocrx_line so David revert code to ocr_line. hocr2pdf produces warning message for XML declarations, uses title value strange way (for me) and expect all words in one line (e.g. I can not indent ocrx_words). So there is not title value and no indentation of ocrx_words. So the current (r729) hocr output is compromise (from my point of view) to work in pdfbeads and ExactImage's hocr2pdf. Output is valid XHTML 1.0 Transitional document. [1] http://code.google.com/p/tesseract-ocr/source/diff?spec=svn726&r=726&format=side&path=/trunk/api/baseapi.cpp <http://code.google.com/p/tesseract-ocr/source/diff?spec=svn726&r=726&format=side&path=/trunk/api/baseapi.cpp> [2] http://code.google.com/p/tesseract-ocr/source/diff?spec=svn728&r=728&format=side&path=/trunk/api/baseapi.cpp <http://code.google.com/p/tesseract-ocr/source/diff?spec=svn728&r=728&format=side&path=/trunk/api/baseapi.cpp> [3] http://code.google.com/p/tesseract-ocr/source/detail?r=729 >> A. Spec conformity. As far as I understood this is fixed (no report about >> non conformity to hOCR spec). > Good. > >> B. Usability in other tools. This is a little bit tricky because it needs >> support of author of other tools (e.g. pdfbeads). Example: if tesserac-ocr >> produce valid hOCR document and some tool is not able to process it, than >> IMO that tool should be fixed... >> From my point >> of view pdfbeads 1.0.9 fixed ocrx_word problem so issue 711 should be >> closed. > If one uses 1.0.9, as I noted, it stops the segfault, that's true. > > But you end up with a pdf in which the highlighted words > are anywhere from reduced-accuracy to unusable. Please send to my e-mail example image. I could not reproduce (I tested only one page). > > When I use my patch for use with Tess 3.01, > it gets the word-start-specific highlights that > crisply align to the beginning of each word. > > It also prevents the more horrible output problems > where it sometimes went very wrong, like on my > table of contents page. > > I did not make pdfbeads do anything new. > It used to work fine with word-perfect starts etc. > on Tess 3.00. All I did is change the code so > that it uses the Tess 3.01 hocr format. 3.00 hocr output is not according hocr spec. Also I found out that pdfbeads do not recognize all hocr tags from spec (e.g. ocrx_line). > >> C. Other problems/enhancements: e.g. "empty words". This need to tested >> (improved) but I think other tools should be able to process it. >> > I recently patched pdfbeads just a little bit more > to tolerate and ignore empty words or lines. > Very handy for people who have to hand-tweak > a few mistakes in the hocr output. After deleting > some text, a word or line may become empty. Please send me image that generate empty words and your last pdfbeads patch (just to see expected changes). BTW: hocr patch for tesseract-ocr was sent by user amkryukov (see issue 263[4]). pdfbeads authors name[5] is Alexey Kryukov. I guess it is the same person and this is IMO reason why 3.00 hocr version worked with pdfbeads even it do not follow hocr spec... [4] http://code.google.com/p/tesseract-ocr/issues/detail?id=263 [5] http://rubygems.org/gems/pdfbeads -- Zdenko -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

