Re: Tess3.01 hocr output not working with pdfbeads

Zdenko Podobný Mon, 28 May 2012 14:18:57 -0700

Dn(a 26.05.2012 23:09, Galt  wrote / napísal(a):
> Worderful news, Zdenko!
>
>> Yesterday David Eger commit patch that should fix tesseract-ocr hOCR output
>> to follow hOCR spec.
> I wonder what he did?
see [1] and [2]. And I did today r729... We tested output with pdfbeads
(1.0.9) and ExactImage's hocr2pdf. pdf was checked in evince (linux pdf
viewer).


I found out that pdfbeads is not able to work ocrx_line so David revert
code to ocr_line.
hocr2pdf produces warning message for XML declarations, uses title value
strange way (for me) and expect all words in one line (e.g. I can not
indent ocrx_words). So there is not title value and no indentation of
ocrx_words.

So the current (r729) hocr output is compromise (from my point of view)
to work in pdfbeads and ExactImage's hocr2pdf. Output is valid XHTML 1.0
Transitional document.

[1] 
http://code.google.com/p/tesseract-ocr/source/diff?spec=svn726&r=726&format=side&path=/trunk/api/baseapi.cpp
<http://code.google.com/p/tesseract-ocr/source/diff?spec=svn726&r=726&format=side&path=/trunk/api/baseapi.cpp>
[2]
http://code.google.com/p/tesseract-ocr/source/diff?spec=svn728&r=728&format=side&path=/trunk/api/baseapi.cpp
<http://code.google.com/p/tesseract-ocr/source/diff?spec=svn728&r=728&format=side&path=/trunk/api/baseapi.cpp>
[3]  http://code.google.com/p/tesseract-ocr/source/detail?r=729

>> A. Spec conformity. As far as I understood this is fixed (no report about
>> non conformity to hOCR spec).
> Good.
>
>> B. Usability in other tools. This is a little bit tricky because it needs
>> support of author of other tools (e.g. pdfbeads). Example: if tesserac-ocr
>> produce valid hOCR document and some tool is not able to process it, than
>> IMO that tool should be fixed...
>> From my point
>> of view pdfbeads 1.0.9 fixed ocrx_word problem so issue 711 should be
>> closed.
> If one uses 1.0.9, as I noted, it stops the segfault, that's true.
>
> But you end up with a pdf in which the highlighted words
> are anywhere from reduced-accuracy to unusable.
Please send to my e-mail example image. I could not reproduce (I tested
only one page).
>
> When I use my patch for use with Tess 3.01,
> it gets the word-start-specific highlights that
> crisply align to the beginning of each word.
>
> It also prevents the more horrible output problems
> where it sometimes went very wrong, like on my
> table of contents page.
>
> I did not make pdfbeads do anything new.
> It used to work fine with word-perfect starts etc.
> on Tess 3.00.  All I did is change the code so
> that it uses the Tess 3.01 hocr format.
3.00 hocr output is not according hocr spec. Also I found out that
pdfbeads do not recognize all hocr tags from spec (e.g. ocrx_line).
>
>> C. Other problems/enhancements: e.g. "empty words". This need to tested
>> (improved) but I think other tools should be able to process it.
>>
> I recently patched pdfbeads just a little bit more
> to tolerate and ignore empty words or lines.
> Very handy for people who have to hand-tweak
> a few mistakes in the hocr output. After deleting
> some text, a word or line may become empty.
Please send me image that generate empty words and your last pdfbeads
patch (just to see expected changes).

BTW: hocr patch for tesseract-ocr was sent by user amkryukov (see issue
263[4]).  pdfbeads authors name[5] is Alexey Kryukov. I guess it is the
same person and this is IMO reason why 3.00 hocr version worked with
pdfbeads even it do not follow hocr spec...

[4] http://code.google.com/p/tesseract-ocr/issues/detail?id=263
[5] http://rubygems.org/gems/pdfbeads

--
Zdenko

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Tess3.01 hocr output not working with pdfbeads

Reply via email to