I think you are mixing 2 different things: you can get box output or hOCR output but not both:
- box file is IMO useful for tesseract training and it has only information about symbols and its positions - hOCR is IMO focused page analyze (it identifies blocks, paragraphs, words) and it show word confidence (in x_wconf) Using both variables does not make sense. If you are not satisfied with hOCR output you can create your own output using tesseract-ocr API. Zdenko On Mon, Jun 24, 2013 at 7:10 PM, Perry Horwich <[email protected]>wrote: > Hi, > > Thanks for the awesome opensource OCR application. > > I can generate html and box files using a config file like this: > > tessedit_char_whitelist > abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ > tessedit_create_boxfile 1 > tessedit_create_hocr 1 > > This does not seem to be producing confidence values, either by word or > letter. > > The box file looks like this: > > a 1883 3619 1940 3684 0 > d 1946 3617 2007 3704 0 > e 2014 3618 2069 3684 0 > > And the <body> of the html hocr file looks identical: > > a 1883 3619 1940 3684 0 > d 1946 3617 2007 3704 0 > e 2014 3618 2069 3684 0 > > Is there a variable I can set in the config file to produce confidence > values for words or letters? > > I am using: > tesseract 3.02.02 > leptonica-1.69 > libjpeg 8d : libpng 1.5.14 : libtiff 4.0.3 : zlib 1.2.5 > > ... compiled on a Mac, OS X 10.8.3 Works great. > > Many thanks - > > Perry > > -- > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > > --- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

