Here is my pdfbuilder.rb diff.
This contains my fixes to use Tess3.01-specific hocr output
with crisp word-start boundaries,
as well as tolerate empty word or line in hocr output.
$ diff pdfbuilder.orig.rb pdfbuilder.rb
480c480
< ocr_words = ocr_line.search("//span[@class='ocrx_word']")
---
> ocr_words = ocr_line.search("//span[@class='ocr_word']")
485a486
>
486a488
>
488,491c490,498
< bbox = elementCoordinates( word,xscale,yscale )
< next if bbox == [0,0,0,0]
< txt = elementText( word,charset )
< units << [txt,bbox]
---
> ocrx_words = word.search("//span[@class='ocrx_word']")
> if ocrx_words.length > 0
> wordx = ocrx_words[0]
> bbox = elementCoordinates( word,xscale,yscale )
> next if bbox == [0,0,0,0] # from 1.0.9 ?
> txt = elementText( wordx,charset )
> next if txt == ""
> units << [txt,bbox]
> end
494c501
< # If 'ocrx_cinfo' data is available (as in Cuneiform) owtput,
then split it
---
> # If 'ocrx_cinfo' data is available (as in Cuneiform) output, then split
> it
575a583
> next if ltxt == ""
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en