Re: Tess3.01 hocr output not working with pdfbeads

zdenko podobny Tue, 22 May 2012 01:05:46 -0700

On Tue, May 22, 2012 at 12:14 AM, Galt <[email protected]> wrote:

> I should begin by saying that I am grateful and happy to have
> a very nice searchable pdf of an old book thanks to Tess.
>
> I found this on the web:
>
> <quote>
>
>
> https://github.com/steelThread/mimeograph/commit/b29af3338e8f15b22392b4e313c8688d9950e13b
>
> pdfbeads currently doesn't work with hOCR output generated by
> tesseract v3.01.
> the owner of pdfbeads doesn't want to enhance pdfbeads to work with
> the existing
> tesseract
> hOCR output because tesseract's hOCR output is not properly following
> the hOCR.
> while i totally understand that position, tesseract release about once
> a year
> and mimeo needs to work now
>
> </quote>
>
> Can you guys kiss and make up?
>
> Please create issue with description what is output and how it should be...



> Until then I have forced to make a little hack to pdfbeads to get it
> to read the position
> and word from ocr_word and ocrx_word respectively so that it can read
> the Tess3.01 hocr input.  It seems that pdfbeads is
> expecting both attributes to be in ocrx_word (the way it was in
> Tess3.0?).
> If anyone is interested in my simple hack, let me know.
>
> You can put your hack to issue so other users can use it until somebody
will fix tesseract-ocr.


> I get a very nice searchable pdf with cut/paste text as the result,
> and achieving
> that was one of the primary reasons that I turned to Tess.
>
> One little problem: my boxes around the text in the acroread viewer
> always
> begin a word at exactly the right position.  But the end of the word
> (judging by
> text highlight) is sometimes not extending far enough, and once in a
> while too far.
> I don't know which part is to blame here: tess, pdfbeads, or acroread?
> It would be nice to fix it.
>
> I would expect it is a pdf issue. IMO it could be because of font used for
searchable text.
And I guess it has different metric than original (scanned) text font.
But without your examples it is difficult to give real explanation.


> It seeems like Tess' hocr output is only identifying word-positions
> while the acroread is allowing sub-word access to highlight parts of
> words
> (individual letters).
> Since it doesn't have access to the actual positions of each letter
> within the
> word, that probably leads to the guessed positions which are not
> always right.
>
> Will Tess be providing letter-level hocr output?  That seems like it
> would solve
> the problem.  Judging by some code I saw in pdfbeads, it looks like
> cuneiform
> hocr output is doing letter-by-letter positions.  I guess adding
> individual letter
> positions might make the pdf output file larger.
>
> For me it does not make sense (but maybe I miss something) - If it
places letter-by-letter to pdf than each letter should be individual object
(with individual position) so search for words (strings) should not work...

Anyway I made simple test[1] with tesseract-ocr where I set page
segmentation to symbol (letter) and I got the worst result comparing to
word/lines/block segmentation (as far as I remember best result I got for
block, but maybe it depends on input).

[1] https://groups.google.com/group/tesseract-ocr/msg/e0a5d02702cdac21

-- 
Zdenko

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Tess3.01 hocr output not working with pdfbeads

Reply via email to