On Tue, May 22, 2012 at 12:14 AM, Galt <[email protected]> wrote: > I should begin by saying that I am grateful and happy to have > a very nice searchable pdf of an old book thanks to Tess. > > I found this on the web: > > <quote> > > > https://github.com/steelThread/mimeograph/commit/b29af3338e8f15b22392b4e313c8688d9950e13b > > pdfbeads currently doesn't work with hOCR output generated by > tesseract v3.01. > the owner of pdfbeads doesn't want to enhance pdfbeads to work with > the existing > tesseract > hOCR output because tesseract's hOCR output is not properly following > the hOCR. > while i totally understand that position, tesseract release about once > a year > and mimeo needs to work now > > </quote> > > Can you guys kiss and make up? > > Please create issue with description what is output and how it should be...
> Until then I have forced to make a little hack to pdfbeads to get it > to read the position > and word from ocr_word and ocrx_word respectively so that it can read > the Tess3.01 hocr input. It seems that pdfbeads is > expecting both attributes to be in ocrx_word (the way it was in > Tess3.0?). > If anyone is interested in my simple hack, let me know. > > You can put your hack to issue so other users can use it until somebody will fix tesseract-ocr. > I get a very nice searchable pdf with cut/paste text as the result, > and achieving > that was one of the primary reasons that I turned to Tess. > > One little problem: my boxes around the text in the acroread viewer > always > begin a word at exactly the right position. But the end of the word > (judging by > text highlight) is sometimes not extending far enough, and once in a > while too far. > I don't know which part is to blame here: tess, pdfbeads, or acroread? > It would be nice to fix it. > > I would expect it is a pdf issue. IMO it could be because of font used for searchable text. And I guess it has different metric than original (scanned) text font. But without your examples it is difficult to give real explanation. > It seeems like Tess' hocr output is only identifying word-positions > while the acroread is allowing sub-word access to highlight parts of > words > (individual letters). > Since it doesn't have access to the actual positions of each letter > within the > word, that probably leads to the guessed positions which are not > always right. > > Will Tess be providing letter-level hocr output? That seems like it > would solve > the problem. Judging by some code I saw in pdfbeads, it looks like > cuneiform > hocr output is doing letter-by-letter positions. I guess adding > individual letter > positions might make the pdf output file larger. > > For me it does not make sense (but maybe I miss something) - If it places letter-by-letter to pdf than each letter should be individual object (with individual position) so search for words (strings) should not work... Anyway I made simple test[1] with tesseract-ocr where I set page segmentation to symbol (letter) and I got the worst result comparing to word/lines/block segmentation (as far as I remember best result I got for block, but maybe it depends on input). [1] https://groups.google.com/group/tesseract-ocr/msg/e0a5d02702cdac21 -- Zdenko -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

