Re: [tesseract-ocr] user patterns with tesserocr python API

Roman Seidel Sat, 02 Mar 2024 01:45:37 -0800

Yes, sure, the input file is a snippet with a capital letter followed by 9
digits. The correct user pattern, corresponding to [1] is:


``\A\d\d\d\d\d\d\d\d\d``

The result of Tesseract (psm 8) is fully correct. Nevertheless, user
patterns is not working in the way described above.

For instance, I have tried to extract only the capital character with user
patterns (not with whitelist), which is:

\A

In this case, the capital letter and all digits are given back by tesseract.

I've attached my input file and the corresponding Python snippet for
reading and proessing the image with tesserocr from [2]


[1]
https://github.com/tesseract-ocr/tesseract/blob/main/src/dict/trie.h#L197
[2] https://github.com/sirfz/tesserocr



Am Fr., 1. März 2024 um 18:59 Uhr schrieb René JM Clais <
[email protected]>:

> Can you send an example of an input document and the output of tesseract
> as well of what should be your expectation using the pattern file.
>
> Le jeu. 29 févr. 2024 à 21:40, Roman Seidel <[email protected]> a
> écrit :
>
>> Hi all,
>>
>> I am currently try to use user-patterns on the PyTessBaseAPI from
>> tesserocr [1].
>>
>> What I've done is to initialize the API with:
>>
>> with PyTessBaseAPI(path='/usr/share/tesseract-ocr/4.00/tessdata', lang=
>> LANGUAGE, psm=int(psm), oem=int(TOEM)) as api:
>>
>> setting the user patterns file with:
>>
>> api.SetVariable('user_patterns_file',
>> '/home/roman/Dev_d/playground/user_patterns/deu.patterns')
>>
>> Where the user patterns file contains a pattern, e.g.:
>>
>> \A\A\A
>>
>> (which means three characters in capital letters.
>>
>>
>> The result, independently ,whether I use the user_patterns_file argument
>> or not, are the same. This brings me to the question if tesserocr supports
>> user (and word) patterns?
>>
>> My versions:
>>
>> tesserocr 2.6.2
>> tesseract 5.3.3
>>  leptonica-1.83.1
>>   libpng 1.6.34 : zlib 1.2.11
>>
>> Thanks a lot for your help and best wishes,
>> Roman
>>
>>
>>
>>
>>
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/767cc60f-5325-43d7-a6ef-9cf879f82950n%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/767cc60f-5325-43d7-a6ef-9cf879f82950n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/tesseract-ocr/MMtdkQu3vSM/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAPJAo_ok%2BQec6cJ1fxfb5NOqLVr8MAovZMNdXT-N3QS3di%2B%3Dng%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAPJAo_ok%2BQec6cJ1fxfb5NOqLVr8MAovZMNdXT-N3QS3di%2B%3Dng%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAL%3DSc5v%3DLm8Bf_5qE2yaFGb7sY99%3DLceSWTqEk8DMMR_GYWjeg%40mail.gmail.com.

deu.patterns
Description: Binary data

import numpy as np
from PIL import Image
import json
import tesserocr
from tesserocr import PyTessBaseAPI, RIL, PSM, OEM
from pathlib import Path


def read_image(input_image):
    image = np.asarray(Image.open(input_image).convert('RGB')) 
    return image


def detect_text(image, psm, whitelist):

    # convert list to PIL.image for reading by tesseract
    img_arr = np.array(image, dtype=np.uint8)
    new_image = Image.fromarray(img_arr)
    
    DPI = '300'
    CONF = 0.5
    LANGUAGE = 'deu'
    TOEM = 0

    box_list = []
    #                                    11                       0
    # with PyTessBaseAPI(lang='deu', psm=PSM.SPARSE_TEXT, oem=OEM.TESSERACT_ONLY) as api:
    with PyTessBaseAPI(path='/usr/share/tesseract-ocr/4.00/tessdata', lang=LANGUAGE, psm=int(psm), oem=int(TOEM)) as api:
        #api.SetImageBytes(image.tobytes(), image.shape[1], image.shape[0], 1, image.shape[1])
        api.SetImage(new_image)
        api.SetVariable("tessedit_char_whitelist", str(whitelist))
        api.SetVariable("user_defined_dpi", DPI)
        # user patterns
        # api.SetVariable('user_patterns_file', '/home/roman/Dev_d/playground/user_patterns/deu.patterns')
        
        boxes = api.GetComponentImages(RIL.WORD, True)
        #print('Found {} textline image components.'.format(len(boxes)))
        for i, (im, box, _, _) in enumerate(boxes):
            # im is a PIL image object
            # box is a dict with x, y, w and h keys
            api.SetRectangle(box['x'], box['y'], box['w'], box['h'])
            text = api.GetUTF8Text()
            text = text.replace("\n", "")
            conf = api.MeanTextConf()
            # beautify data
            data = {
                'text': text,
                'x': box['x'], 
                'y': box['y'],
                'w': box['w'], 
                'h': box['h'],
                'c': conf}
            if conf >= CONF:
                print(u"Box[{0}]: x={x}, y={y}, w={w}, h={h}, "
                "confidence: {1}, text: {2}".format(i, conf, text, **box))
                box_list.append(data)
    
    return box_list




def main():

    print(tesserocr.tesseract_version())
    print(tesserocr.get_languages())


    input_image = '/home/roman/Dev_d/playground/user_patterns/betriebsstaette.png'
    image = read_image(input_image)
    #box_list = detect_text(image, 8, "abcdefghijklmnopqrstuvwxyzäöüABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜß0123456789,.;- ")
    box_list = detect_text(image, 8, "ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜß0123456789")


    #print(f"box list: {box_list}")



if __name__ == "__main__":
    main()

Re: [tesseract-ocr] user patterns with tesserocr python API

Reply via email to