I know what you mean. but in some cases, it helps me. I have faced specific characters and words are always not recognized by Tesseract. That way I use these regex to replace those characters and words if those characters are incorrect.
see what I have done: " ী": "ী", " ্": " ", " ে": " ", জ্া: "জা", " ": " ", " ": " ", " ": " ", "্প": " ", " য": "র্য", য: "য", " া": "া", আা: "আ", ম্ি: "মি", স্ু: "সু", "হূ ": "হূ", " ণ": "ণ", র্্: "র", "চিন্ত ": "চিন্তা ", ন্া: "না", "সম ূর্ন": "সম্পূর্ণ", On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 desal...@gmail.com wrote: > The problem for regex is that Tesseract is not consistent in its > replacement. > Think of the original training of English data doesn't contain the letter > /u/. What does Tesseract do when it faces /u/ in actual processing?? > In some cases, it replaces it with closely similar letters such as /v/ and > /w/. In other cases, it completely removes it. That is what is happening > with my case. Those characters re sometimes completely removed; other > times, they are replaced by closely resembling characters. Because of this > inconsistency, applying regex is very difficult. > > On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 mdalihu...@gmail.com > wrote: > >> if Some specific characters or words are always missing from the OCR >> result. then you can apply logic with the Regular expressions method on >> your applications. After OCR, these specific characters or words will be >> replaced by current characters or words that you defined in your >> applications by Regular expressions. it can be done in some major problems. >> >> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 desal...@gmail.com >> wrote: >> >>> The characters are getting missed, even after fine-tuning. >>> I never made any progress. I tried many different ways. Some specific >>> characters are always missing from the OCR result. >>> >>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 >>> mdalihu...@gmail.com wrote: >>> >>>> EasyOCR I think is best for ID cards or something like that image >>>> process. but document images like books, here Tesseract is better than >>>> EasyOCR. Even I didn't use EasyOCR. you can try it. >>>> >>>> I have added words of dictionaries but the result is the same. >>>> >>>> what kind of problem you have faced in fine-tuning in few new >>>> characters as you said (*but, I failed in every possible way to >>>> introduce a few new characters into the database.)* >>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 desal...@gmail.com >>>> wrote: >>>> >>>>> Yes, we are new to this. I find the instructions (the manual) very >>>>> hard to follow. The video you linked above was really helpful to get >>>>> started. My plan at the beginning was to fine tune the existing >>>>> .traineddata. But, I failed in every possible way to introduce a few new >>>>> characters into the database. That is why I started from scratch. >>>>> >>>>> Sure, I will follow Lorenzo's suggestion: will run more the >>>>> iterations, and see if I can improve. >>>>> >>>>> Another areas we need to explore is usage of dictionaries actually. >>>>> May be adding millions of words into the dictionary could help Tesseract. >>>>> I >>>>> don't have millions of words; but I am looking into some corpus to get >>>>> more >>>>> words into the dictionary. >>>>> >>>>> If this all fails, EasyOCR (and probably other similar open-source >>>>> packages) is probably our next option to try on. Sure, sharing >>>>> our experiences will be helpful. I will let you know if I made good >>>>> progresses in any of these options. >>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 >>>>> mdalihu...@gmail.com wrote: >>>>> >>>>>> How is your training going for Bengali? It was nearly good but I >>>>>> faced space problems between two words, some words are spaces but most >>>>>> of >>>>>> them have no space. I think is problem is in the dataset but I use the >>>>>> default training dataset from Tesseract which is used in Ben That way I >>>>>> am >>>>>> confused so I have to explore more. by the way, you can try as Lorenzo >>>>>> Blz said. Actually training from scratch is harder than >>>>>> fine-tuning. so you can use different datasets to explore. if you >>>>>> succeed. >>>>>> please let me know how you have done this whole process. I'm also new >>>>>> in >>>>>> this field. >>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 >>>>>> desal...@gmail.com wrote: >>>>>> >>>>>>> How is your training going for Bengali? >>>>>>> I have been trying to train from scratch. I made about 64,000 lines >>>>>>> of text (which produced about 255,000 files, in the end) and run the >>>>>>> training for 150,000 iterations; getting 0.51 training error rate. I >>>>>>> was >>>>>>> hopping to get reasonable accuracy. Unfortunately, when I run the OCR >>>>>>> using .traineddata, the accuracy is absolutely terrible. Do you think >>>>>>> I >>>>>>> made some mistakes, or that is an expected result? >>>>>>> >>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 >>>>>>> mdalihu...@gmail.com wrote: >>>>>>> >>>>>>>> Yes, he doesn't mention all fonts but only one font. That way he >>>>>>>> didn't use *MODEL_NAME in a separate **script **file script I >>>>>>>> think.* >>>>>>>> >>>>>>>> Actually, here we teach all *tif, gt.txt, and .box files *which >>>>>>>> are created by *MODEL_NAME I mean **eng, ben, oro flag or >>>>>>>> language code *because when we first create *tif, gt.txt, and .box >>>>>>>> files, *every file starts by *MODEL_NAME*. This *MODEL_NAME* we >>>>>>>> selected on the training script for looping each tif, gt.txt, and .box >>>>>>>> files which are created by *MODEL_NAME.* >>>>>>>> >>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 >>>>>>>> desal...@gmail.com wrote: >>>>>>>> >>>>>>>>> Yes, I am familiar with the video and have set up the folder >>>>>>>>> structure as you did. Indeed, I have tried a number of fine-tuning >>>>>>>>> with a >>>>>>>>> single font following Gracia's video. But, your script is much >>>>>>>>> better >>>>>>>>> because supports multiple fonts. The whole improvement you made is >>>>>>>>> brilliant; and very useful. It is all working for me. >>>>>>>>> The only part that I didn't understand is the trick you used in >>>>>>>>> your tesseract_train.py script. You see, I have been doing exactly to >>>>>>>>> you >>>>>>>>> did except this script. >>>>>>>>> >>>>>>>>> The scripts seems to have the trick of sending/teaching each of >>>>>>>>> the fonts (iteratively) into the model. The script I have been using >>>>>>>>> (which I get from Garcia) doesn't mention font at all. >>>>>>>>> >>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>> MODEL_NAME=oro TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000* >>>>>>>>> Does it mean that my model does't train the fonts (even if the >>>>>>>>> fonts have been included in the splitting process, in the other >>>>>>>>> script)? >>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 >>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *import subprocess# List of font namesfont_names = ['ben']for >>>>>>>>>> font in font_names: command = >>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata >>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben >>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000"* >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> * subprocess.run(command, shell=True) 1 . This command is for >>>>>>>>>> training data that I have named '*tesseract_training*.py' inside >>>>>>>>>> tesstrain folder.* >>>>>>>>>> *2. root directory means your main training folder and inside it >>>>>>>>>> as like langdata, tessearact, tesstrain folders. if you see this >>>>>>>>>> tutorial >>>>>>>>>> *https://www.youtube.com/watch?v=KE4xEzFGSU8 you will >>>>>>>>>> understand better the folder structure. only I >>>>>>>>>> created tesseract_training.py in tesstrain folder for training and >>>>>>>>>> FontList.py file is the main path as *like langdata, >>>>>>>>>> tessearact, tesstrain, and *split_training_text.py. >>>>>>>>>> 3. first of all you have to put all fonts in your Linux fonts >>>>>>>>>> folder. /usr/share/fonts/ then run: sudo apt update then sudo >>>>>>>>>> fc-cache -fv >>>>>>>>>> >>>>>>>>>> after that, you have to add the exact font's name in FontList.py >>>>>>>>>> file like me. >>>>>>>>>> I have added two pic my folder structure. first is main >>>>>>>>>> structure pic and the second is the Colopse tesstrain folder. >>>>>>>>>> >>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: Screenshot >>>>>>>>>> 2023-09-11 135014.png] >>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 >>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>> >>>>>>>>>>> Thank you so much for putting out these brilliant scripts. They >>>>>>>>>>> make the process much more efficient. >>>>>>>>>>> >>>>>>>>>>> I have one more question on the other script that you use to >>>>>>>>>>> train. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> *import subprocess# List of font namesfont_names = ['ben']for >>>>>>>>>>> font in font_names: command = >>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata >>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben >>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000"* >>>>>>>>>>> * subprocess.run(command, shell=True) * >>>>>>>>>>> >>>>>>>>>>> Do you have the name of fonts listed in file in the same/root >>>>>>>>>>> directory? >>>>>>>>>>> How do you setup the names of the fonts in the file, if you >>>>>>>>>>> don't mind sharing it? >>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 >>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>> >>>>>>>>>>>> You can use the new script below. it's better than the previous >>>>>>>>>>>> two scripts. You can create *tif, gt.txt, and .box files *by >>>>>>>>>>>> multiple fonts and also use breakpoint if vs code close or >>>>>>>>>>>> anything during >>>>>>>>>>>> creating *tif, gt.txt, and .box files *then you can checkpoint >>>>>>>>>>>> to navigate where you close vs code. >>>>>>>>>>>> >>>>>>>>>>>> command for *tif, gt.txt, and .box files * >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> import os >>>>>>>>>>>> import random >>>>>>>>>>>> import pathlib >>>>>>>>>>>> import subprocess >>>>>>>>>>>> import argparse >>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>> >>>>>>>>>>>> def create_training_data(training_text_file, font_list, >>>>>>>>>>>> output_directory, start_line=None, end_line=None): >>>>>>>>>>>> lines = [] >>>>>>>>>>>> with open(training_text_file, 'r') as input_file: >>>>>>>>>>>> lines = input_file.readlines() >>>>>>>>>>>> >>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>> >>>>>>>>>>>> if start_line is None: >>>>>>>>>>>> start_line = 0 >>>>>>>>>>>> >>>>>>>>>>>> if end_line is None: >>>>>>>>>>>> end_line = len(lines) - 1 >>>>>>>>>>>> >>>>>>>>>>>> for font_name in font_list.fonts: >>>>>>>>>>>> for line_index in range(start_line, end_line + 1): >>>>>>>>>>>> line = lines[line_index].strip() >>>>>>>>>>>> >>>>>>>>>>>> training_text_file_name = pathlib.Path( >>>>>>>>>>>> training_text_file).stem >>>>>>>>>>>> >>>>>>>>>>>> line_serial = f"{line_index:d}" >>>>>>>>>>>> >>>>>>>>>>>> line_gt_text = os.path.join(output_directory, f'{ >>>>>>>>>>>> training_text_file_name}_{line_serial}_{font_name.replace(" ", >>>>>>>>>>>> "_")}.gt.txt') >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> with open(line_gt_text, 'w') as output_file: >>>>>>>>>>>> output_file.writelines([line]) >>>>>>>>>>>> >>>>>>>>>>>> file_base_name = f'{training_text_file_name}_{ >>>>>>>>>>>> line_serial}_{font_name.replace(" ", "_")}' >>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>> 'text2image', >>>>>>>>>>>> f'--font={font_name}', >>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>> f'--outputbase={output_directory}/{ >>>>>>>>>>>> file_base_name}', >>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>> '--strip_unrenderable_words', >>>>>>>>>>>> '--leading=36', >>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>> '--ysize=330', >>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>> '--unicharset_file=langdata/eng.unicharset', >>>>>>>>>>>> ]) >>>>>>>>>>>> >>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>> parser.add_argument('--start', type=int, help='Starting >>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>> parser.add_argument('--end', type=int, help='Ending line >>>>>>>>>>>> count (inclusive)') >>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>> >>>>>>>>>>>> training_text_file = 'langdata/eng.training_text' >>>>>>>>>>>> output_directory = 'tesstrain/data/eng-ground-truth' >>>>>>>>>>>> >>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>> >>>>>>>>>>>> create_training_data(training_text_file, font_list, >>>>>>>>>>>> output_directory, args.start, args.end) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Then create a file called "FontList" in the root directory and >>>>>>>>>>>> paste it. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> class FontList: >>>>>>>>>>>> def __init__(self): >>>>>>>>>>>> self.fonts = [ >>>>>>>>>>>> "Gerlick" >>>>>>>>>>>> "Sagar Medium", >>>>>>>>>>>> "Ekushey Lohit Normal", >>>>>>>>>>>> "Charukola Round Head Regular, weight=433", >>>>>>>>>>>> "Charukola Round Head Bold, weight=443", >>>>>>>>>>>> "Ador Orjoma Unicode", >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> ] >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> then import in the above code, >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> *for breakpoint command:* >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> sudo python3 split_training_text.py --start 0 --end 11 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> change checkpoint according to you --start 0 --end 11. >>>>>>>>>>>> >>>>>>>>>>>> *and training checkpoint as you know already.* >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 >>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi mhalidu, >>>>>>>>>>>>> the script you posted here seems much more extensive than you >>>>>>>>>>>>> posted before: >>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com >>>>>>>>>>>>> . >>>>>>>>>>>>> >>>>>>>>>>>>> I have been using your earlier script. It is magical. How is >>>>>>>>>>>>> this one different from the earlier one? >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you for posting these scripts, by the way. It has saved >>>>>>>>>>>>> my countless hours; by running multiple fonts in one sweep. I was >>>>>>>>>>>>> not able >>>>>>>>>>>>> to find any instruction on how to train for multiple fonts. The >>>>>>>>>>>>> official >>>>>>>>>>>>> manual is also unclear. YOUr script helped me to get started. >>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 >>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> ok, I will try as you said. >>>>>>>>>>>>>> one more thing, what's the role of the trained_text lines >>>>>>>>>>>>>> will be? I have seen Bengali text are long words of lines. so I >>>>>>>>>>>>>> wanna know >>>>>>>>>>>>>> how many words or characters will be the better choice for the >>>>>>>>>>>>>> train? >>>>>>>>>>>>>> and '--xsize=3600','--ysize=350', will be according to words of >>>>>>>>>>>>>> lines? >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 shree wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Include the default fonts also in your fine-tuning list of >>>>>>>>>>>>>>> fonts and see if that helps. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain < >>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune methods for the >>>>>>>>>>>>>>>> Bengali language in Tesseract 5 and I have used all official >>>>>>>>>>>>>>>> trained_text >>>>>>>>>>>>>>>> and tessdata_best and other things also. everything is good >>>>>>>>>>>>>>>> but the >>>>>>>>>>>>>>>> problem is the default font which was trained before that does >>>>>>>>>>>>>>>> not convert >>>>>>>>>>>>>>>> text like prev but my new fonts work well. I don't understand >>>>>>>>>>>>>>>> why it's >>>>>>>>>>>>>>>> happening. I share code based to understand what going on. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> *codes for creating tif, gt.txt, .box files:* >>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> def read_line_count(): >>>>>>>>>>>>>>>> if os.path.exists('line_count.txt'): >>>>>>>>>>>>>>>> with open('line_count.txt', 'r') as file: >>>>>>>>>>>>>>>> return int(file.read()) >>>>>>>>>>>>>>>> return 0 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> def write_line_count(line_count): >>>>>>>>>>>>>>>> with open('line_count.txt', 'w') as file: >>>>>>>>>>>>>>>> file.write(str(line_count)) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> def create_training_data(training_text_file, font_list, >>>>>>>>>>>>>>>> output_directory, start_line=None, end_line=None): >>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>> with open(training_text_file, 'r') as input_file: >>>>>>>>>>>>>>>> for line in input_file.readlines(): >>>>>>>>>>>>>>>> lines.append(line.strip()) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> random.shuffle(lines) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>> line_count = read_line_count() # Set the starting >>>>>>>>>>>>>>>> line_count from the file >>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>> line_count = start_line >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>> end_line_count = len(lines) - 1 # Set the ending >>>>>>>>>>>>>>>> line_count >>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>> end_line_count = min(end_line, len(lines) - 1) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> for font in font_list.fonts: # Iterate through all >>>>>>>>>>>>>>>> the fonts in the font_list >>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>> for line in lines: >>>>>>>>>>>>>>>> training_text_file_name = pathlib.Path( >>>>>>>>>>>>>>>> training_text_file).stem >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> # Generate a unique serial number for each line >>>>>>>>>>>>>>>> line_serial = f"{line_count:d}" >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> # GT (Ground Truth) text filename >>>>>>>>>>>>>>>> line_gt_text = os.path.join(output_directory, f >>>>>>>>>>>>>>>> '{training_text_file_name}_{line_serial}.gt.txt') >>>>>>>>>>>>>>>> with open(line_gt_text, 'w') as output_file: >>>>>>>>>>>>>>>> output_file.writelines([line]) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> # Image filename >>>>>>>>>>>>>>>> file_base_name = f'ben_{line_serial}' # >>>>>>>>>>>>>>>> Unique filename for each font >>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>> f'--font={font}', >>>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>>> f'--outputbase={output_directory}/{ >>>>>>>>>>>>>>>> file_base_name}', >>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>> '--strip_unrenderable_words', >>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>> '--ysize=350', >>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>> '--unicharset_file=langdata/ben.unicharset' >>>>>>>>>>>>>>>> , >>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> line_count += 1 >>>>>>>>>>>>>>>> font_serial += 1 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> # Reset font_serial for the next font iteration >>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> write_line_count(line_count) # Update the line_count >>>>>>>>>>>>>>>> in the file >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>> parser.add_argument('--start', type=int, help='Starting >>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, help='Ending >>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> training_text_file = 'langdata/ben.training_text' >>>>>>>>>>>>>>>> output_directory = 'tesstrain/data/ben-ground-truth' >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> # Create an instance of the FontList class >>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> create_training_data(training_text_file, font_list, >>>>>>>>>>>>>>>> output_directory, args.start, args.end) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> *and for training code:* >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> # List of font names >>>>>>>>>>>>>>>> font_names = ['ben'] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> for font in font_names: >>>>>>>>>>>>>>>> command = f"TESSDATA_PREFIX=../tesseract/tessdata make >>>>>>>>>>>>>>>> training MODEL_NAME={font} START_MODEL=ben >>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 >>>>>>>>>>>>>>>> LANG_TYPE=Indic" >>>>>>>>>>>>>>>> subprocess.run(command, shell=True) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> any suggestion to identify to extract the problem. >>>>>>>>>>>>>>>> thanks, everyone >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails >>>>>>>>>>>>>>>> from it, send an email to tesseract-oc...@googlegroups.com. >>>>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5ffbaca4-fcfd-4e8a-8cd6-8709aee142a3n%40googlegroups.com.