Yes, sure, the input file is a snippet with a capital letter followed by 9 digits. The correct user pattern, corresponding to [1] is:
``\A\d\d\d\d\d\d\d\d\d`` The result of Tesseract (psm 8) is fully correct. Nevertheless, user patterns is not working in the way described above. For instance, I have tried to extract only the capital character with user patterns (not with whitelist), which is: \A In this case, the capital letter and all digits are given back by tesseract. I've attached my input file and the corresponding Python snippet for reading and proessing the image with tesserocr from [2] [1] https://github.com/tesseract-ocr/tesseract/blob/main/src/dict/trie.h#L197 [2] https://github.com/sirfz/tesserocr Am Fr., 1. März 2024 um 18:59 Uhr schrieb René JM Clais < [email protected]>: > Can you send an example of an input document and the output of tesseract > as well of what should be your expectation using the pattern file. > > Le jeu. 29 févr. 2024 à 21:40, Roman Seidel <[email protected]> a > écrit : > >> Hi all, >> >> I am currently try to use user-patterns on the PyTessBaseAPI from >> tesserocr [1]. >> >> What I've done is to initialize the API with: >> >> with PyTessBaseAPI(path='/usr/share/tesseract-ocr/4.00/tessdata', lang= >> LANGUAGE, psm=int(psm), oem=int(TOEM)) as api: >> >> setting the user patterns file with: >> >> api.SetVariable('user_patterns_file', >> '/home/roman/Dev_d/playground/user_patterns/deu.patterns') >> >> Where the user patterns file contains a pattern, e.g.: >> >> \A\A\A >> >> (which means three characters in capital letters. >> >> >> The result, independently ,whether I use the user_patterns_file argument >> or not, are the same. This brings me to the question if tesserocr supports >> user (and word) patterns? >> >> My versions: >> >> tesserocr 2.6.2 >> tesseract 5.3.3 >> leptonica-1.83.1 >> libpng 1.6.34 : zlib 1.2.11 >> >> Thanks a lot for your help and best wishes, >> Roman >> >> >> >> >> >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/767cc60f-5325-43d7-a6ef-9cf879f82950n%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/767cc60f-5325-43d7-a6ef-9cf879f82950n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- > You received this message because you are subscribed to a topic in the > Google Groups "tesseract-ocr" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/tesseract-ocr/MMtdkQu3vSM/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAPJAo_ok%2BQec6cJ1fxfb5NOqLVr8MAovZMNdXT-N3QS3di%2B%3Dng%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAPJAo_ok%2BQec6cJ1fxfb5NOqLVr8MAovZMNdXT-N3QS3di%2B%3Dng%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAL%3DSc5v%3DLm8Bf_5qE2yaFGb7sY99%3DLceSWTqEk8DMMR_GYWjeg%40mail.gmail.com.
deu.patterns
Description: Binary data
import numpy as np
from PIL import Image
import json
import tesserocr
from tesserocr import PyTessBaseAPI, RIL, PSM, OEM
from pathlib import Path
def read_image(input_image):
image = np.asarray(Image.open(input_image).convert('RGB'))
return image
def detect_text(image, psm, whitelist):
# convert list to PIL.image for reading by tesseract
img_arr = np.array(image, dtype=np.uint8)
new_image = Image.fromarray(img_arr)
DPI = '300'
CONF = 0.5
LANGUAGE = 'deu'
TOEM = 0
box_list = []
# 11 0
# with PyTessBaseAPI(lang='deu', psm=PSM.SPARSE_TEXT, oem=OEM.TESSERACT_ONLY) as api:
with PyTessBaseAPI(path='/usr/share/tesseract-ocr/4.00/tessdata', lang=LANGUAGE, psm=int(psm), oem=int(TOEM)) as api:
#api.SetImageBytes(image.tobytes(), image.shape[1], image.shape[0], 1, image.shape[1])
api.SetImage(new_image)
api.SetVariable("tessedit_char_whitelist", str(whitelist))
api.SetVariable("user_defined_dpi", DPI)
# user patterns
# api.SetVariable('user_patterns_file', '/home/roman/Dev_d/playground/user_patterns/deu.patterns')
boxes = api.GetComponentImages(RIL.WORD, True)
#print('Found {} textline image components.'.format(len(boxes)))
for i, (im, box, _, _) in enumerate(boxes):
# im is a PIL image object
# box is a dict with x, y, w and h keys
api.SetRectangle(box['x'], box['y'], box['w'], box['h'])
text = api.GetUTF8Text()
text = text.replace("\n", "")
conf = api.MeanTextConf()
# beautify data
data = {
'text': text,
'x': box['x'],
'y': box['y'],
'w': box['w'],
'h': box['h'],
'c': conf}
if conf >= CONF:
print(u"Box[{0}]: x={x}, y={y}, w={w}, h={h}, "
"confidence: {1}, text: {2}".format(i, conf, text, **box))
box_list.append(data)
return box_list
def main():
print(tesserocr.tesseract_version())
print(tesserocr.get_languages())
input_image = '/home/roman/Dev_d/playground/user_patterns/betriebsstaette.png'
image = read_image(input_image)
#box_list = detect_text(image, 8, "abcdefghijklmnopqrstuvwxyzäöüABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜß0123456789,.;- ")
box_list = detect_text(image, 8, "ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜß0123456789")
#print(f"box list: {box_list}")
if __name__ == "__main__":
main()

