On Tue, May 22, 2012 at 10:12 PM, zdenko podobny <[email protected]> wrote:
> > > On Tue, May 22, 2012 at 2:03 PM, Galt <[email protected]> wrote: > >> >> > >> > > Please create issue with description what is output and how it should >> be... >> > > Until then I have forced to make a little hack to pdfbeads to get it >> > > to read the position >> > > and word from ocr_word and ocrx_word respectively so that it can read >> > > the Tess3.01 hocr input. It seems that pdfbeads is >> > > expecting both attributes to be in ocrx_word (the way it was in >> > > Tess3.0?). >> > > If anyone is interested in my simple hack, let me know. >> > >> > You can put your hack to issue so other users can use it until somebody >> > will fix tesseract-ocr. >> >> OK, I will make an issue for it. >> >> > > One little problem: my boxes around the text in the acroread viewer >> > > always >> > > begin a word at exactly the right position. But the end of the word >> > > (judging by >> > > text highlight) is sometimes not extending far enough, and once in a >> > > while too far. >> > > I don't know which part is to blame here: tess, pdfbeads, or acroread? >> > > It would be nice to fix it. >> > >> > I would expect it is a pdf issue. IMO it could be because of font used >> for >> > searchable text. >> > And I guess it has different metric than original (scanned) text font. >> > But without your examples it is difficult to give real explanation. >> > >> >> That makes sense. >> >> > > Will Tess be providing letter-level hocr output? That seems like it >> > > would solve >> > > the problem. Judging by some code I saw in pdfbeads, it looks like >> > > cuneiform >> > > hocr output is doing letter-by-letter positions. I guess adding >> > > individual letter >> > > positions might make the pdf output file larger. >> > >> > For me it does not make sense (but maybe I miss something) - If it >> > places letter-by-letter to pdf than each letter should be individual >> object >> > (with individual position) so search for words (strings) should not >> work... >> > >> >> Maybe the hocr standard allows both the word and the letter positions. >> >> I suppose if it only had letter-positions it would have to infer the >> word breaks, which could work, but might not. >> >> Using the analogy between words and letters, >> if I search for any part of two adjacent words, acroread finds it, >> even though it was only given the words as units. >> >> I can search for these successfully >> buaċaill >> óg >> buaċaill óg >> buaċaill ó >> uaċaill ó >> >> So it is not limited to searching only single full words, which is >> great. >> >> >> > Thanks for creating issue 711 [1]. IMO there should be > some discussion/clarification regarding this (see other question [2]). It > is not problem to change "ocrx_word" to "ocr_word" but after googling I > think it is not correct. > > As far as I know official hOCR spec is here[3] and maintainer should > be Thomas Breuel (see [4]) if nothing changed from 2009/2010. > > He stated: > > If there is something engine-specific you need, pick an ocrx_... tag that > doesn't conflict with an existing one. > > ocr_... tags are intended to represent engine-independent information, so > for that, it's probably a good idea to talk about it before picking a new > tag. > > As Carlos[2] pointed specification did not mention ocr_word and ocrx_word > (mentioned in spec) is considered engine specific markup (see page 7 of > spec). I did not find any discussion about ocr_word in hocr group[5] or > ocropus... > > My understanding is usage of ocr_word breach hocr spec and tesseract > should not do it. So other sw (pdfbeads, cuneiform) should fix their > output/requirements or start discussion at hocr group. > > Or did I miss something? > > [1] http://code.google.com/p/tesseract-ocr/issues/detail?id=711 > [2] > https://groups.google.com/group/tesseract-ocr/browse_thread/thread/6d30401001068920 > [3] > http://docs.google.com/View?docid=dfxcv4vc_67g844kf<http://www.google.com/url?sa=D&q=http://docs.google.com/View%3Fdocid%3Ddfxcv4vc_67g844kf&usg=AFQjCNFDmXcgDA2C9hI8sCFx_zPBksqeBg> > > [4] > https://groups.google.com/group/ocropus/browse_thread/thread/797976effef9166f > [5] https://groups.google.com/group/hocr > > -- > Zdenko > I think I need to take a break ;-) - I found out that mix tesseract and cuneiform output. Tesseract is using ocr_word so it is not problem of cuneiform but tesseract-ocr. -- Zdenko -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

