I should begin by saying that I am grateful and happy to have a very nice searchable pdf of an old book thanks to Tess.
I found this on the web: <quote> https://github.com/steelThread/mimeograph/commit/b29af3338e8f15b22392b4e313c8688d9950e13b pdfbeads currently doesn't work with hOCR output generated by tesseract v3.01. the owner of pdfbeads doesn't want to enhance pdfbeads to work with the existing tesseract hOCR output because tesseract's hOCR output is not properly following the hOCR. while i totally understand that position, tesseract release about once a year and mimeo needs to work now </quote> Can you guys kiss and make up? Until then I have forced to make a little hack to pdfbeads to get it to read the position and word from ocr_word and ocrx_word respectively so that it can read the Tess3.01 hocr input. It seems that pdfbeads is expecting both attributes to be in ocrx_word (the way it was in Tess3.0?). If anyone is interested in my simple hack, let me know. I get a very nice searchable pdf with cut/paste text as the result, and achieving that was one of the primary reasons that I turned to Tess. One little problem: my boxes around the text in the acroread viewer always begin a word at exactly the right position. But the end of the word (judging by text highlight) is sometimes not extending far enough, and once in a while too far. I don't know which part is to blame here: tess, pdfbeads, or acroread? It would be nice to fix it. It seeems like Tess' hocr output is only identifying word-positions while the acroread is allowing sub-word access to highlight parts of words (individual letters). Since it doesn't have access to the actual positions of each letter within the word, that probably leads to the guessed positions which are not always right. Will Tess be providing letter-level hocr output? That seems like it would solve the problem. Judging by some code I saw in pdfbeads, it looks like cuneiform hocr output is doing letter-by-letter positions. I guess adding individual letter positions might make the pdf output file larger. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

