Re: Tess3.01 hocr output not working with pdfbeads

zdenko podobny Tue, 22 May 2012 13:12:59 -0700

On Tue, May 22, 2012 at 2:03 PM, Galt <[email protected]> wrote:

>
> >
> > > Please create issue with description what is output and how it should
> be...
> > > Until then I have forced to make a little hack to pdfbeads to get it
> > > to read the position
> > > and word from ocr_word and ocrx_word respectively so that it can read
> > > the Tess3.01 hocr input.  It seems that pdfbeads is
> > > expecting both attributes to be in ocrx_word (the way it was in
> > > Tess3.0?).
> > > If anyone is interested in my simple hack, let me know.
> >
> > You can put your hack to issue so other users can use it until somebody
> > will fix tesseract-ocr.
>
> OK, I will make an issue for it.
>
> > > One little problem: my boxes around the text in the acroread viewer
> > > always
> > > begin a word at exactly the right position.  But the end of the word
> > > (judging by
> > > text highlight) is sometimes not extending far enough, and once in a
> > > while too far.
> > > I don't know which part is to blame here: tess, pdfbeads, or acroread?
> > > It would be nice to fix it.
> >
> > I would expect it is a pdf issue. IMO it could be because of font used
> for
> > searchable text.
> > And I guess it has different metric than original (scanned) text font.
> > But without your examples it is difficult to give real explanation.
> >
>
> That makes sense.
>
> > > Will Tess be providing letter-level hocr output?  That seems like it
> > > would solve
> > > the problem.  Judging by some code I saw in pdfbeads, it looks like
> > > cuneiform
> > > hocr output is doing letter-by-letter positions.  I guess adding
> > > individual letter
> > > positions might make the pdf output file larger.
> >
> > For me it does not make sense (but maybe I miss something) - If it
> > places letter-by-letter to pdf than each letter should be individual
> object
> > (with individual position) so search for words (strings) should not
> work...
> >
>
> Maybe the hocr standard allows both the word and the letter positions.
>
> I suppose if it only had letter-positions it would have to infer the
> word breaks, which could work, but might not.
>
> Using the analogy between words and letters,
> if I search for any part of two adjacent words, acroread finds it,
> even though it was only given the words as units.
>
> I can search for these successfully
> buaċaill
> óg
> buaċaill óg
> buaċaill ó
> uaċaill ó
>
> So it is not limited to searching only single full words, which is
> great.
>
>
>
Thanks  for creating issue 711 [1]. IMO there should be
some discussion/clarification regarding this (see other question [2]). It
is not problem to change "ocrx_word" to "ocr_word" but after googling I
think it is not correct.


As far as I know official hOCR spec is here[3] and maintainer should
be Thomas Breuel (see [4]) if nothing changed from 2009/2010.

He stated:

If there is something engine-specific you need, pick an ocrx_... tag that
doesn't conflict with an existing one.

ocr_... tags are intended to represent engine-independent information, so
for that, it's probably a good idea to talk about it before picking a new
tag.

As Carlos[2] pointed specification did not mention ocr_word and ocrx_word
(mentioned in spec) is considered engine specific markup (see page 7 of
spec).  I did not find any discussion about ocr_word in hocr group[5] or
ocropus...

My understanding is usage of ocr_word  breach hocr spec and tesseract
should not do it. So other sw (pdfbeads, cuneiform) should fix their
output/requirements or start discussion at hocr group.

Or did I miss something?

[1] http://code.google.com/p/tesseract-ocr/issues/detail?id=711
[2]
https://groups.google.com/group/tesseract-ocr/browse_thread/thread/6d30401001068920
[3] 
http://docs.google.com/View?docid=dfxcv4vc_67g844kf<http://www.google.com/url?sa=D&q=http://docs.google.com/View%3Fdocid%3Ddfxcv4vc_67g844kf&usg=AFQjCNFDmXcgDA2C9hI8sCFx_zPBksqeBg>

[4]
https://groups.google.com/group/ocropus/browse_thread/thread/797976effef9166f
[5] https://groups.google.com/group/hocr

-- 
Zdenko

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Tess3.01 hocr output not working with pdfbeads

Reply via email to