> > > Please create issue with description what is output and how it should be... > > Until then I have forced to make a little hack to pdfbeads to get it > > to read the position > > and word from ocr_word and ocrx_word respectively so that it can read > > the Tess3.01 hocr input. It seems that pdfbeads is > > expecting both attributes to be in ocrx_word (the way it was in > > Tess3.0?). > > If anyone is interested in my simple hack, let me know. > > You can put your hack to issue so other users can use it until somebody > will fix tesseract-ocr.
OK, I will make an issue for it. > > One little problem: my boxes around the text in the acroread viewer > > always > > begin a word at exactly the right position. But the end of the word > > (judging by > > text highlight) is sometimes not extending far enough, and once in a > > while too far. > > I don't know which part is to blame here: tess, pdfbeads, or acroread? > > It would be nice to fix it. > > I would expect it is a pdf issue. IMO it could be because of font used for > searchable text. > And I guess it has different metric than original (scanned) text font. > But without your examples it is difficult to give real explanation. > That makes sense. > > Will Tess be providing letter-level hocr output? That seems like it > > would solve > > the problem. Judging by some code I saw in pdfbeads, it looks like > > cuneiform > > hocr output is doing letter-by-letter positions. I guess adding > > individual letter > > positions might make the pdf output file larger. > > For me it does not make sense (but maybe I miss something) - If it > places letter-by-letter to pdf than each letter should be individual object > (with individual position) so search for words (strings) should not work... > Maybe the hocr standard allows both the word and the letter positions. I suppose if it only had letter-positions it would have to infer the word breaks, which could work, but might not. Using the analogy between words and letters, if I search for any part of two adjacent words, acroread finds it, even though it was only given the words as units. I can search for these successfully buaċaill óg buaċaill óg buaċaill ó uaċaill ó So it is not limited to searching only single full words, which is great. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

