Tess3.01 hocr output not working with pdfbeads

Galt Mon, 21 May 2012 19:48:41 -0700

I should begin by saying that I am grateful and happy to have
a very nice searchable pdf of an old book thanks to Tess.


I found this on the web:

<quote>

https://github.com/steelThread/mimeograph/commit/b29af3338e8f15b22392b4e313c8688d9950e13b

pdfbeads currently doesn't work with hOCR output generated by
tesseract v3.01.
the owner of pdfbeads doesn't want to enhance pdfbeads to work with
the existing
tesseract
hOCR output because tesseract's hOCR output is not properly following
the hOCR.
while i totally understand that position, tesseract release about once
a year
and mimeo needs to work now

</quote>

Can you guys kiss and make up?

Until then I have forced to make a little hack to pdfbeads to get it
to read the position
and word from ocr_word and ocrx_word respectively so that it can read
the Tess3.01 hocr input.  It seems that pdfbeads is
expecting both attributes to be in ocrx_word (the way it was in
Tess3.0?).
If anyone is interested in my simple hack, let me know.

I get a very nice searchable pdf with cut/paste text as the result,
and achieving
that was one of the primary reasons that I turned to Tess.

One little problem: my boxes around the text in the acroread viewer
always
begin a word at exactly the right position.  But the end of the word
(judging by
text highlight) is sometimes not extending far enough, and once in a
while too far.
I don't know which part is to blame here: tess, pdfbeads, or acroread?
It would be nice to fix it.

It seeems like Tess' hocr output is only identifying word-positions
while the acroread is allowing sub-word access to highlight parts of
words
(individual letters).
Since it doesn't have access to the actual positions of each letter
within the
word, that probably leads to the guessed positions which are not
always right.

Will Tess be providing letter-level hocr output?  That seems like it
would solve
the problem.  Judging by some code I saw in pdfbeads, it looks like
cuneiform
hocr output is doing letter-by-letter positions.  I guess adding
individual letter
positions might make the pdf output file larger.

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Tess3.01 hocr output not working with pdfbeads

Reply via email to