On Tuesday, June 6, 2017 at 6:08:32 PM UTC-4, John Muccigrosso wrote: > > The wiki suggests making sure that the x-height of text is at least 20 px. > Is there a fairly straightforward way to estimate this with manually > examining the image? Getting average or median from hocr or something? >
Months later... It looks like what I want to do is create a box file, so checking out the wiki, I modified the instructions to create this command, which seems to do what I want: tesseract text_image_file output_file_name makebox Output looks like this: C 261 2453 285 2480 0 A 287 2454 312 2480 0 P 315 2454 334 2479 0 I 337 2454 347 2480 0 T 349 2454 372 2481 0 O 374 2454 402 2480 0 L 406 2454 426 2480 0 I 429 2454 439 2480 0 N 442 2454 471 2480 0 E 473 2454 494 2480 0 So now I need to process this output to get the letter heights (element 4 - element 2 in each line) and then grab the median. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b1f2c480-87dc-4a2d-8cc3-c95d101dad64%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

