Worderful news, Zdenko!

> Yesterday David Eger commit patch that should fix tesseract-ocr hOCR output
> to follow hOCR spec.

I wonder what he did?

> A. Spec conformity. As far as I understood this is fixed (no report about
> non conformity to hOCR spec).

Good.

> B. Usability in other tools. This is a little bit tricky because it needs
> support of author of other tools (e.g. pdfbeads). Example: if tesserac-ocr
> produce valid hOCR document and some tool is not able to process it, than
> IMO that tool should be fixed...

> From my point
> of view pdfbeads 1.0.9 fixed ocrx_word problem so issue 711 should be
> closed.

If one uses 1.0.9, as I noted, it stops the segfault, that's true.

But you end up with a pdf in which the highlighted words
are anywhere from reduced-accuracy to unusable.

When I use my patch for use with Tess 3.01,
it gets the word-start-specific highlights that
crisply align to the beginning of each word.

It also prevents the more horrible output problems
where it sometimes went very wrong, like on my
table of contents page.

I did not make pdfbeads do anything new.
It used to work fine with word-perfect starts etc.
on Tess 3.00.  All I did is change the code so
that it uses the Tess 3.01 hocr format.

> C. Other problems/enhancements: e.g. "empty words". This need to tested
> (improved) but I think other tools should be able to process it.
>
I recently patched pdfbeads just a little bit more
to tolerate and ignore empty words or lines.
Very handy for people who have to hand-tweak
a few mistakes in the hocr output. After deleting
some text, a word or line may become empty.

> --
> Zdenko
>

Thanks for all you help, Zdenko!

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to