Re: Tess3.01 hocr output not working with pdfbeads

Galt Tue, 22 May 2012 05:27:51 -0700

>
> > Please create issue with description what is output and how it should be...
> > Until then I have forced to make a little hack to pdfbeads to get it
> > to read the position
> > and word from ocr_word and ocrx_word respectively so that it can read
> > the Tess3.01 hocr input.  It seems that pdfbeads is
> > expecting both attributes to be in ocrx_word (the way it was in
> > Tess3.0?).
> > If anyone is interested in my simple hack, let me know.
>
> You can put your hack to issue so other users can use it until somebody
> will fix tesseract-ocr.


OK, I will make an issue for it.

> > One little problem: my boxes around the text in the acroread viewer
> > always
> > begin a word at exactly the right position.  But the end of the word
> > (judging by
> > text highlight) is sometimes not extending far enough, and once in a
> > while too far.
> > I don't know which part is to blame here: tess, pdfbeads, or acroread?
> > It would be nice to fix it.
>
> I would expect it is a pdf issue. IMO it could be because of font used for
> searchable text.
> And I guess it has different metric than original (scanned) text font.
> But without your examples it is difficult to give real explanation.
>

That makes sense.

> > Will Tess be providing letter-level hocr output?  That seems like it
> > would solve
> > the problem.  Judging by some code I saw in pdfbeads, it looks like
> > cuneiform
> > hocr output is doing letter-by-letter positions.  I guess adding
> > individual letter
> > positions might make the pdf output file larger.
>
> For me it does not make sense (but maybe I miss something) - If it
> places letter-by-letter to pdf than each letter should be individual object
> (with individual position) so search for words (strings) should not work...
>

Maybe the hocr standard allows both the word and the letter positions.

I suppose if it only had letter-positions it would have to infer the
word breaks, which could work, but might not.

Using the analogy between words and letters,
if I search for any part of two adjacent words, acroread finds it,
even though it was only given the words as units.

I can search for these successfully
buaċaill
óg
buaċaill óg
buaċaill ó
uaċaill ó

So it is not limited to searching only single full words, which is
great.

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Tess3.01 hocr output not working with pdfbeads

Reply via email to