Re: Tess3.01 hocr output not working with pdfbeads

zdenko podobny Tue, 22 May 2012 13:20:00 -0700

On Tue, May 22, 2012 at 10:12 PM, zdenko podobny <[email protected]> wrote:


>
>
> On Tue, May 22, 2012 at 2:03 PM, Galt <[email protected]> wrote:
>
>>
>> >
>> > > Please create issue with description what is output and how it should
>> be...
>> > > Until then I have forced to make a little hack to pdfbeads to get it
>> > > to read the position
>> > > and word from ocr_word and ocrx_word respectively so that it can read
>> > > the Tess3.01 hocr input.  It seems that pdfbeads is
>> > > expecting both attributes to be in ocrx_word (the way it was in
>> > > Tess3.0?).
>> > > If anyone is interested in my simple hack, let me know.
>> >
>> > You can put your hack to issue so other users can use it until somebody
>> > will fix tesseract-ocr.
>>
>> OK, I will make an issue for it.
>>
>> > > One little problem: my boxes around the text in the acroread viewer
>> > > always
>> > > begin a word at exactly the right position.  But the end of the word
>> > > (judging by
>> > > text highlight) is sometimes not extending far enough, and once in a
>> > > while too far.
>> > > I don't know which part is to blame here: tess, pdfbeads, or acroread?
>> > > It would be nice to fix it.
>> >
>> > I would expect it is a pdf issue. IMO it could be because of font used
>> for
>> > searchable text.
>> > And I guess it has different metric than original (scanned) text font.
>> > But without your examples it is difficult to give real explanation.
>> >
>>
>> That makes sense.
>>
>> > > Will Tess be providing letter-level hocr output?  That seems like it
>> > > would solve
>> > > the problem.  Judging by some code I saw in pdfbeads, it looks like
>> > > cuneiform
>> > > hocr output is doing letter-by-letter positions.  I guess adding
>> > > individual letter
>> > > positions might make the pdf output file larger.
>> >
>> > For me it does not make sense (but maybe I miss something) - If it
>> > places letter-by-letter to pdf than each letter should be individual
>> object
>> > (with individual position) so search for words (strings) should not
>> work...
>> >
>>
>> Maybe the hocr standard allows both the word and the letter positions.
>>
>> I suppose if it only had letter-positions it would have to infer the
>> word breaks, which could work, but might not.
>>
>> Using the analogy between words and letters,
>> if I search for any part of two adjacent words, acroread finds it,
>> even though it was only given the words as units.
>>
>> I can search for these successfully
>> buaċaill
>> óg
>> buaċaill óg
>> buaċaill ó
>> uaċaill ó
>>
>> So it is not limited to searching only single full words, which is
>> great.
>>
>>
>>
> Thanks  for creating issue 711 [1]. IMO there should be
> some discussion/clarification regarding this (see other question [2]). It
> is not problem to change "ocrx_word" to "ocr_word" but after googling I
> think it is not correct.
>
> As far as I know official hOCR spec is here[3] and maintainer should
> be Thomas Breuel (see [4]) if nothing changed from 2009/2010.
>
> He stated:
>
> If there is something engine-specific you need, pick an ocrx_... tag that
> doesn't conflict with an existing one.
>
> ocr_... tags are intended to represent engine-independent information, so
> for that, it's probably a good idea to talk about it before picking a new
> tag.
>
> As Carlos[2] pointed specification did not mention ocr_word and ocrx_word
> (mentioned in spec) is considered engine specific markup (see page 7 of
> spec).  I did not find any discussion about ocr_word in hocr group[5] or
> ocropus...
>
> My understanding is usage of ocr_word  breach hocr spec and tesseract
> should not do it. So other sw (pdfbeads, cuneiform) should fix their
> output/requirements or start discussion at hocr group.
>
> Or did I miss something?
>
> [1] http://code.google.com/p/tesseract-ocr/issues/detail?id=711
> [2]
> https://groups.google.com/group/tesseract-ocr/browse_thread/thread/6d30401001068920
> [3] 
> http://docs.google.com/View?docid=dfxcv4vc_67g844kf<http://www.google.com/url?sa=D&q=http://docs.google.com/View%3Fdocid%3Ddfxcv4vc_67g844kf&usg=AFQjCNFDmXcgDA2C9hI8sCFx_zPBksqeBg>
>
> [4]
> https://groups.google.com/group/ocropus/browse_thread/thread/797976effef9166f
> [5] https://groups.google.com/group/hocr
>
> --
> Zdenko
>

I think I need to take a break ;-) - I found out that mix tesseract
and cuneiform output.
Tesseract is using ocr_word so it is not problem of cuneiform but
tesseract-ocr.

-- 
Zdenko

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Tess3.01 hocr output not working with pdfbeads

Reply via email to