On Tue, May 22, 2012 at 2:03 PM, Galt <[email protected]> wrote: > > > > > > Please create issue with description what is output and how it should > be... > > > Until then I have forced to make a little hack to pdfbeads to get it > > > to read the position > > > and word from ocr_word and ocrx_word respectively so that it can read > > > the Tess3.01 hocr input. It seems that pdfbeads is > > > expecting both attributes to be in ocrx_word (the way it was in > > > Tess3.0?). > > > If anyone is interested in my simple hack, let me know. > > > > You can put your hack to issue so other users can use it until somebody > > will fix tesseract-ocr. > > OK, I will make an issue for it. > > > > One little problem: my boxes around the text in the acroread viewer > > > always > > > begin a word at exactly the right position. But the end of the word > > > (judging by > > > text highlight) is sometimes not extending far enough, and once in a > > > while too far. > > > I don't know which part is to blame here: tess, pdfbeads, or acroread? > > > It would be nice to fix it. > > > > I would expect it is a pdf issue. IMO it could be because of font used > for > > searchable text. > > And I guess it has different metric than original (scanned) text font. > > But without your examples it is difficult to give real explanation. > > > > That makes sense. > > > > Will Tess be providing letter-level hocr output? That seems like it > > > would solve > > > the problem. Judging by some code I saw in pdfbeads, it looks like > > > cuneiform > > > hocr output is doing letter-by-letter positions. I guess adding > > > individual letter > > > positions might make the pdf output file larger. > > > > For me it does not make sense (but maybe I miss something) - If it > > places letter-by-letter to pdf than each letter should be individual > object > > (with individual position) so search for words (strings) should not > work... > > > > Maybe the hocr standard allows both the word and the letter positions. > > I suppose if it only had letter-positions it would have to infer the > word breaks, which could work, but might not. > > Using the analogy between words and letters, > if I search for any part of two adjacent words, acroread finds it, > even though it was only given the words as units. > > I can search for these successfully > buaċaill > óg > buaċaill óg > buaċaill ó > uaċaill ó > > So it is not limited to searching only single full words, which is > great. > > > Thanks for creating issue 711 [1]. IMO there should be some discussion/clarification regarding this (see other question [2]). It is not problem to change "ocrx_word" to "ocr_word" but after googling I think it is not correct.
As far as I know official hOCR spec is here[3] and maintainer should be Thomas Breuel (see [4]) if nothing changed from 2009/2010. He stated: If there is something engine-specific you need, pick an ocrx_... tag that doesn't conflict with an existing one. ocr_... tags are intended to represent engine-independent information, so for that, it's probably a good idea to talk about it before picking a new tag. As Carlos[2] pointed specification did not mention ocr_word and ocrx_word (mentioned in spec) is considered engine specific markup (see page 7 of spec). I did not find any discussion about ocr_word in hocr group[5] or ocropus... My understanding is usage of ocr_word breach hocr spec and tesseract should not do it. So other sw (pdfbeads, cuneiform) should fix their output/requirements or start discussion at hocr group. Or did I miss something? [1] http://code.google.com/p/tesseract-ocr/issues/detail?id=711 [2] https://groups.google.com/group/tesseract-ocr/browse_thread/thread/6d30401001068920 [3] http://docs.google.com/View?docid=dfxcv4vc_67g844kf<http://www.google.com/url?sa=D&q=http://docs.google.com/View%3Fdocid%3Ddfxcv4vc_67g844kf&usg=AFQjCNFDmXcgDA2C9hI8sCFx_zPBksqeBg> [4] https://groups.google.com/group/ocropus/browse_thread/thread/797976effef9166f [5] https://groups.google.com/group/hocr -- Zdenko -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

