Re: [tesseract-ocr] Re: Looking to hire a pytesseract consultant via skype

Aaron Stewart Wed, 01 Apr 2020 10:20:48 -0700

I agree with the suggestion to try template matching.  I already did some 
experiments with Tesseract, so I will share those here.

Previous threads have brought up issues with part numbers with mixed
letters and digits, when using the default English training data. The same
thing is happening here. In one of your examples, R5 -> RS and T9 -> TS.

I tried a few experiments to alter the spacing in the original image.

(1) First, I tried increasing the horizontal spacing between characters. A
little bit of increase does seem to help; however, if I added too much
space, there was a "ringing" effect, that Tesseract would read in
characters that aren't there. You can see that in some cases "V" got
doubled into "Vv".

(2) Next, I tried putting each character an a separate line. In this case
also, there was a "ringing" effect with letter V.

(3) Third, I tried putting each character into its own image. (This is
slower because I believe pytesseract launches a new instance each time you
call it.)

(4) Finally, I tried running all three approaches together and showing the
results together.

For each method, I had to tune the parameters a little bit, and so it's
likely that it will still fail on some cases in your data set.

For me, it was interesting to play with the different spacing parameters
and see how Tesseract reacts.

I did not experiment much with the Page Segmentation Mode (psm) parameter.
I haven't tried the legacy engine either, which was suggested.

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/04927f33-7530-4952-b082-d59cc8313aed%40googlegroups.com.

# Usage: python img.py <filename.png> [mode]

# mode is optional and can be:
# 1 : expand spacing between characters
# 2 : put characters on separate lines
# 3 : put characters in separate images
# 4 : ensemble - try all three modes, and show the results

# Next experiment to try: Separate each character into its own image (with margins) and process each image separately! (return a list of images)
# Have three modes: expand horizontally, expand vertically, and send separate images...

from pytesseract import image_to_string
import pytesseract
import cv2
import re
import sys

import numpy as np

filename = sys.argv[1]

mode = 4
if len(sys.argv) > 2:
    mode = int(sys.argv[2])

img = cv2.imread(filename, cv2.IMREAD_GRAYSCALE)
height, width = img.shape

# Important: remove the line from the bottom as well as the top
# The re-spacing algorithm won't work unless it can find a whitespace gap from top to bottom between characters.
# i.e., this code only works with a single line of text, with no other content.
MARGIN_TOP_FROM_BOTTOM = 41
MARGIN_BOTTOM = 5
MARGIN_LEFT = 2
MARGIN_RIGHT = 2
roi = img[height - MARGIN_TOP_FROM_BOTTOM : height - MARGIN_BOTTOM, MARGIN_LEFT : width - MARGIN_RIGHT]

import char_spacing

'''
tesseract --help-extra

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.
'''

roi = cv2.resize(roi, None, fx=2, fy=2)

# Experimental

def process_horizontal(roi):
    print('Expanding horizontal space between characters.')
    psm = 7
    roi = char_spacing.expand_horizontal_gaps(roi, min_run_length=64, white_threshold = 220)
    tess_config = f"--psm {psm} --oem 3  tessedit_char_whitelist=0123456789"
    _, roi = cv2.threshold(roi, 128+64, 255, cv2.THRESH_BINARY)
    roi = cv2.GaussianBlur(roi, (3,3), 0)
    text_detected = image_to_string(roi, config=tess_config, )
    return text_detected, roi

def process_vertical(roi):
    print('Expanding characters onto separate lines.')
    psm = 6
    roi = char_spacing.one_character_per_line(roi, line_spacing=20, white_threshold = 220)
    tess_config = f"--psm {psm} --oem 3  tessedit_char_whitelist=0123456789"
    _, roi = cv2.threshold(roi, 128+64, 255, cv2.THRESH_BINARY)
    roi = cv2.GaussianBlur(roi, (3,3), 0)
    text_detected = image_to_string(roi, config=tess_config, )
    text_detected = text_detected.replace('\n', ' ') # Mode 2 only
    return text_detected, roi

def process_separate(roi):    
    print('Separating character clusters into separate images')
    psm = 7 # Single text line (or 10, single character)
    #multiple_images = True
    images = char_spacing.one_character_per_image(roi, new_margin=8, white_threshold = 220)

    tess_config = f"--psm {psm} --oem 3  tessedit_char_whitelist=0123456789"

    text_list = []
    
    processed_images = []
    for roi in images:
        _, roi = cv2.threshold(roi, 128+64, 255, cv2.THRESH_BINARY)
        roi = cv2.GaussianBlur(roi, (3,3), 0)
        processed_images.append(roi)
        text_detected = image_to_string(roi, config=tess_config, )
        text_list.append(text_detected)
    text_detected = ' '.join(text_list) # Use space for readable output, or blank for better number recognition (we will remove the spaces later)
    
    return text_detected, images[0]

def extract_numbers(text_detected):
    print()
    print(text_detected, '(before correction)')
    text_detected = re.sub('I', '1', text_detected)
    text_detected = re.sub('i', '1', text_detected)
    text_detected = re.sub('l', '1', text_detected)
    text_detected = re.sub('L', '1', text_detected)
    text_detected = re.sub('Z', '2', text_detected)
    text_detected = re.sub('S', '5', text_detected)
    text_detected = re.sub('s', '5', text_detected)
    text_detected = re.sub('G', '6', text_detected)
    text_detected = text_detected.replace('O', '0')
    text_detected = text_detected.replace('o', '0')
    print(text_detected, '(after correction)')
    
    # Remove spacing before finding numbers, because the preprocessing may have separated digits
    text_detected = text_detected.replace(' ', '')
    numbers = re.findall("[0-9]+", text_detected)

    print(numbers)
    return numbers

if (mode == 1): text_detected, roi = process_horizontal(roi)
if (mode == 2): text_detected, roi = process_vertical(roi)
if (mode == 3): text_detected, roi = process_separate(roi)

if (mode == 4):
    t1, _ = process_horizontal(roi)
    t2, _ = process_vertical(roi)
    t3, _ = process_separate(roi)

    n1 = extract_numbers(t1)
    n2 = extract_numbers(t2)
    n3 = extract_numbers(t3)
    print('Results:')
    print(n1)
    print(n2)
    print(n3)

if (mode != 4):
    print(extract_numbers(text_detected))
    #print(text[5] + text[6] + text[7])

    #if (multiple_images):
    #    TIMEOUT = 3000
    #    for roi in processed_images:
    #        cv2.imshow("roi-sub", roi)
    #        cv2.waitKey(TIMEOUT)
    #else:
    TIMEOUT = 45 * 1000
    cv2.imshow("roi", roi)
    cv2.waitKey(TIMEOUT)

# Experimental

import numpy as np

def expand_horizontal_gaps(bin_img, min_run_length=16, white_threshold = 240):
    gaps = []
    in_gap = False
    
    (height, width) = bin_img.shape
    for x in range(width):
    
        # Look for runs of white pixels...
        is_white = True
        for y in range(height):
            if bin_img[y][x] < white_threshold:
                is_white = False
                break
        
        if is_white and not in_gap:
            in_gap = True
            gap_start = x
            
        if in_gap and not is_white:
            gaps.append((gap_start, x))
            in_gap = False#
            
    if in_gap:
        gaps.append((gap_start, width))
                
    # Now 'gaps' contains a list of all gaps...
    gap_deficits = []
    for gap in gaps:
        gap_size = gap[1] - gap[0]
        if (gap_size < min_run_length):
            gap_deficits.append(min_run_length - gap_size)
        else:
            gap_deficits.append(0)
    total_deficit = sum(gap_deficits)
    
    gap_centers = []
    for gap in gaps:
        avg = (gap[0] + gap[1]) // 2
        gap_centers.append(avg)
    gap_centers.append(-1)
    gap_index = 0
    
    WHITE = 255
    
    new_img = np.zeros((height, width + total_deficit), np.uint8)
    xx = 0
    for x in range(width):
    
        # Insert a gap if necessary
        if x == gap_centers[gap_index]:
            for _ in range(gap_deficits[gap_index]):
                for y in range(height):
                    new_img[y][xx] = WHITE
                xx += 1
            gap_index += 1
        
        # Copy the column over regularly
        for y in range(height):
            new_img[y][xx] = bin_img[y][x]
        
        xx += 1
        
    return new_img
    
def get_runs(img, white_threshold = 240):
    runs = []
    in_run = False
    
    (height, width) = img.shape
    
    for x in range(width):
    
        # Look for runs of white pixels...
        is_white = True
        for y in range(height):
            if img[y][x] < white_threshold:
                is_white = False
                break
        
        if is_white and in_run:
            in_run = False
            runs.append((run_start, x))
            
        if not in_run and not is_white:
            run_start = x
            in_run = True
            
    if in_run:
        runs.append((run_start, width))
    
    return runs
    
def one_character_per_image(img, new_margin = 16, white_threshold = 240):
    images = []
    
    # Find the margins...
    margins = get_margins(img, white_threshold)
    top, bottom, left, right = margins

    # Crop the image
    img = img[top:-(bottom+1), left:-(right+1)]
    height = len(img)

    runs = get_runs(img)
    
    for row_index, run in enumerate(runs):
        x_offset = new_margin
        y_offset = new_margin
        
        new_width = (run[1] - run[0]) + (2 * new_margin)
        new_height = (2 * new_margin) + height
        new_img = np.ones((new_height, new_width), np.uint8) * 255
        
        for y in range(height):
            yy = y_offset + y
            for x in range(run[1] - run[0]):
                xx = x + x_offset
                new_img[yy][xx] = img[y][x + run[0]]
                
        images.append(new_img)

    return images

def one_character_per_line(img, line_spacing = 20, white_threshold = 240):
    # find the margins...
    margins = get_margins(img, white_threshold)
    top, bottom, left, right = margins
    # Crop the image
    img = img[top:-(bottom+1), left:-(right+1)]

    runs = get_runs(img)
    
    # Calculate the size of the new image
    new_margin = 16
    width = len(img[0])
    height = len(img)
    
    longest_run = max([run[1] - run[0] for run in runs]) # Longest run length
    new_width = longest_run + 2 * new_margin
    new_height = (2 * new_margin) + (height * len(runs)) + (line_spacing * (len(runs) - 1))
    
    new_img = np.ones((new_height, new_width), np.uint8) * 255
    
    for row_index, run in enumerate(runs):
        x_offset = new_margin
        y_offset = new_margin + (line_spacing + height) * row_index
        for y in range(height):
            yy = y_offset + y
            for x in range(run[1] - run[0]):
                xx = x + x_offset
                new_img[yy][xx] = img[y][x + run[0]]
    return new_img
    
def get_margins(gray, shade):
    # Identify margins (top, bottom, left, right) on a grayscale image, using the given shade as a threshold, assuming a light background.
    height, width = gray.shape
    
    # Top margin:
    top_margin = 0
    for y in range(height):
        blank = True
        for x in range(width):
            if gray[y][x] < shade:
                blank = False
                break
        if (blank): top_margin += 1
        else: break
    
    # Bottom margin:
    bottom_margin = 0
    for y in range(height-1, -1, -1):
        blank = True
        for x in range(width):
            if gray[y][x] < shade:
                blank = False
                break
        if (blank): bottom_margin += 1
        else: break
    
    # Right margin:
    right_margin = 0
    for x in range(width-1, -1, -1):
        blank = True
        for y in range(height):
            if gray[y][x] < shade:
                blank = False
                break
        if (blank): right_margin += 1
        else: break
    
    # Left margin:
    left_margin = 0
    for x in range(width):
        blank = True
        for y in range(height):
            if gray[y][x] < shade:
                blank = False
                break
        if (blank): left_margin += 1
        else: break
    
    return (top_margin, bottom_margin, left_margin, right_margin)

This file shows the three methods in succession, and then all three results 
together.

>python img.py 2017-03-26_SecondPie.png

Expanding horizontal space between characters.
Expanding characters onto separate lines.
Separating character clusters into separate images

N 1 Vv N 2 Vv 2 N 2 Vv 2 T R Vv 3 T 3 R 3 R 1 N 8 R2 R 1 T 6 R2 T 2 T 1 (before 
correction)
N 1 Vv N 2 Vv 2 N 2 Vv 2 T R Vv 3 T 3 R 3 R 1 N 8 R2 R 1 T 6 R2 T 2 T 1 (after 
correction)
['1', '2', '2', '2', '2', '3', '3', '3', '1', '8', '2', '1', '6', '2', '2', '1']

N 1 Vv N 2 V2 N 2 V2 T R V3 T3 R3 R 1 N8 R2 R 1 T6 R2 T2 T1 (before correction)
N 1 Vv N 2 V2 N 2 V2 T R V3 T3 R3 R 1 N8 R2 R 1 T6 R2 T2 T1 (after correction)
['1', '2', '2', '2', '2', '3', '3', '3', '1', '8', '2', '1', '6', '2', '2', '1']

N 1 Vv N 2 V2 N 2 V2 T R V3 T3 R3 R 1 N8 R2 R 1 T6 R2 T2 T1 (before correction)
N 1 Vv N 2 V2 N 2 V2 T R V3 T3 R3 R 1 N8 R2 R 1 T6 R2 T2 T1 (after correction)
['1', '2', '2', '2', '2', '3', '3', '3', '1', '8', '2', '1', '6', '2', '2', '1']
Results:
['1', '2', '2', '2', '2', '3', '3', '3', '1', '8', '2', '1', '6', '2', '2', '1']
['1', '2', '2', '2', '2', '3', '3', '3', '1', '8', '2', '1', '6', '2', '2', '1']
['1', '2', '2', '2', '2', '3', '3', '3', '1', '8', '2', '1', '6', '2', '2', '1']

===

>python img.py 2007-04-12_SecondPie.png

Expanding horizontal space between characters.
Expanding characters onto separate lines.
Separating character clusters into separate images

R 1 N Vv T N 3 R 5 N 2 Vv 5 N 2 R4 R 1 N 3 T9 N 2 R4 R 1 N 3 Vv 1 0 R 3 (before 
correction)
R 1 N Vv T N 3 R 5 N 2 Vv 5 N 2 R4 R 1 N 3 T9 N 2 R4 R 1 N 3 Vv 1 0 R 3 (after 
correction)
['1', '3', '5', '2', '5', '2', '4', '1', '3', '9', '2', '4', '1', '3', '10', 
'3']

R 1 N Vv T N 3 RS N 2 V5 N 2 R4 R 1 N 3 T9 N 2 R4 R 1 N 3 vio R3 (before 
correction)
R 1 N Vv T N 3 R5 N 2 V5 N 2 R4 R 1 N 3 T9 N 2 R4 R 1 N 3 v10 R3 (after 
correction)
['1', '3', '5', '2', '5', '2', '4', '1', '3', '9', '2', '4', '1', '3', '10', 
'3']

R 1 N Vv T N 3 RS N 2 VS N 2 R4 R 1 N 3 T9 N 2 R4 R 1 N 3 vio R3 (before 
correction)
R 1 N Vv T N 3 R5 N 2 V5 N 2 R4 R 1 N 3 T9 N 2 R4 R 1 N 3 v10 R3 (after 
correction)
['1', '3', '5', '2', '5', '2', '4', '1', '3', '9', '2', '4', '1', '3', '10', 
'3']
Results:
['1', '3', '5', '2', '5', '2', '4', '1', '3', '9', '2', '4', '1', '3', '10', 
'3']
['1', '3', '5', '2', '5', '2', '4', '1', '3', '9', '2', '4', '1', '3', '10', 
'3']
['1', '3', '5', '2', '5', '2', '4', '1', '3', '9', '2', '4', '1', '3', '10', 
'3']

Re: [tesseract-ocr] Re: Looking to hire a pytesseract consultant via skype

Reply via email to