Here is my pdfbuilder.rb diff.

This contains my fixes to use Tess3.01-specific hocr output
with crisp word-start boundaries,
as well as tolerate empty word or line in hocr output.


$ diff pdfbuilder.orig.rb pdfbuilder.rb
480c480
<     ocr_words = ocr_line.search("//span[@class='ocrx_word']")
---
>     ocr_words = ocr_line.search("//span[@class='ocr_word']")
485a486
>
486a488
>
488,491c490,498
<         bbox = elementCoordinates( word,xscale,yscale )
<         next if bbox == [0,0,0,0]
<         txt = elementText( word,charset )
<         units << [txt,bbox]
---
>         ocrx_words = word.search("//span[@class='ocrx_word']")
>         if ocrx_words.length > 0
>            wordx = ocrx_words[0]
>            bbox = elementCoordinates( word,xscale,yscale )
>            next if bbox == [0,0,0,0] # from 1.0.9 ?
>            txt = elementText( wordx,charset )
>            next if txt == ""
>            units << [txt,bbox]
>          end
494c501
<     # If 'ocrx_cinfo' data is available (as in Cuneiform) owtput,
then split it
---
>     # If 'ocrx_cinfo' data is available (as in Cuneiform) output, then split 
> it
575a583
>       next if ltxt == ""

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to