[tesseract-ocr] Small script to generate all boxes for ocrd-train

Lorenzo Bolzani Wed, 18 Sep 2019 03:38:27 -0700

Hi,
I wrote this small script to speed up OCRD-train
<https://github.com/OCR-D/ocrd-train> training startup.


It generates the boxes for all the images provided on the command line (it
works only for single line images).

It is a simple conversion of the generate_line_box.py from ocrd-train. I
used it once, it seems to work fine.

Currently with OCR-D the boxes and lstmf generation is very slow because it
starts a new process for each image.

I execute this script before calling the makefile.

I do the "shell expansion" in python so that it can handle a very long list
of files.

So you need to call it in this way:

python generate_all_line_boxes.py -i 'data/train/*.tif'

with single quotes to prevent shell expansion.


BTW, it would be nice to have the same thing for the lstmf files.



Bye

Lorenzo

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwTnogqM0C1jk69QKX3hMFvk7nuMJLYAbvw%2BsL%3DZdsQcA%40mail.gmail.com.

#!/usr/bin/env python

import io
import argparse
import unicodedata
from PIL import Image
import glob

#
# command line arguments
#
arg_parser = argparse.ArgumentParser('''Creates tesseract box files for given (line) image text pairs''')


# Image files (NOTE: use quotes in the command line to prevent shell expansion)
arg_parser.add_argument('-i', '--images', nargs='?', metavar='IMAGE', help='Image files', required=True)

args = arg_parser.parse_args()

#
# main
#
files = list(glob.glob(args.images))

for image_name in files:

    #print("Processing:", image_name)

    # load image
    with open(image_name, "rb") as f:
        width, height = Image.open(f).size

    # load gt
    gt_txt_name = image_name.replace(".tif", ".gt.txt")
    with io.open(gt_txt_name, "r", encoding='utf-8') as f:
        lines = f.read().strip().split('\n')

    box_name = image_name.replace(".tif", ".box")
    with io.open(box_name, "w", encoding='utf-8') as f:
        for line in lines:
            if len(line) == 0:
                f.write("WARNING: line is empty")
            for i in range(1, len(line)):
                char = line[i]
                prev_char = line[i-1]
                if unicodedata.combining(char):
                    f.write(u"%s %d %d %d %d 0\n" % ((prev_char + char), 0, 0, width, height))
                elif not unicodedata.combining(prev_char):
                    f.write(u"%s %d %d %d %d 0\n" % (prev_char, 0, 0, width, height))
            if not unicodedata.combining(line[-1]):
                f.write(u"%s %d %d %d %d 0\n" % (line[-1], 0, 0, width, height))
            f.write(u"%s %d %d %d %d 0\n" % ("\t", width, height, width+1, height+1))

[tesseract-ocr] Small script to generate all boxes for ocrd-train

Reply via email to