Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Ali hussain Wed, 13 Sep 2023 03:45:59 -0700

I know what you mean. but in some cases, it helps me.  I have faced 
specific characters and words are always not recognized by Tesseract. That 
way I use these regex to replace those characters   and words if  those 
characters are incorrect.


see what I have done: 

   " ী": "ী",
    " ্": " ",
    " ে": " ",
    জ্া: "জা",
    "  ": " ",
    "   ": " ",
    "    ": " ",
    "্প": " ",
    " য": "র্য",
    য: "য",
    " া": "া",
    আা: "আ",
    ম্ি: "মি",
    স্ু: "সু",
    "হূ ": "হূ",
    " ণ": "ণ",
    র্্: "র",
    "চিন্ত ": "চিন্তা ",
    ন্া: "না",
    "সম ূর্ন": "সম্পূর্ণ",
On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 desal...@gmail.com 
wrote:

> The problem for regex is that Tesseract is not consistent in its 
> replacement. 
> Think of the original training of English data doesn't contain the letter 
> /u/. What does Tesseract do when it faces /u/ in actual processing??
> In some cases, it replaces it with closely similar letters such as /v/ and 
> /w/. In other cases, it completely removes it. That is what is happening 
> with my case. Those characters re sometimes completely removed; other 
> times, they are replaced by closely resembling characters. Because of this 
> inconsistency, applying regex is very difficult. 
>
> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 mdalihu...@gmail.com 
> wrote:
>
>> if Some specific characters or words are always missing from the OCR 
>> result.  then you can apply logic with the Regular expressions method on 
>> your applications. After OCR, these specific characters or words will be 
>> replaced by current characters or words that you defined in your 
>> applications by  Regular expressions. it can be done in some major problems.
>>
>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 desal...@gmail.com 
>> wrote:
>>
>>> The characters are getting missed, even after fine-tuning. 
>>> I never made any progress. I tried many different ways. Some  specific 
>>> characters are always missing from the OCR result.  
>>>
>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 
>>> mdalihu...@gmail.com wrote:
>>>
>>>> EasyOCR I think is best for ID cards or something like that image 
>>>> process. but document images like books, here Tesseract is better than 
>>>> EasyOCR.  Even I didn't use EasyOCR. you can try it.
>>>>
>>>> I have added words of dictionaries but the result is the same. 
>>>>
>>>> what kind of problem you have faced in fine-tuning in few new 
>>>> characters as you said (*but, I failed in every possible way to 
>>>> introduce a few new characters into the database.)*
>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 desal...@gmail.com 
>>>> wrote:
>>>>
>>>>> Yes, we are new to this. I find the instructions (the manual) very 
>>>>> hard to follow. The video you linked above was really helpful  to get 
>>>>> started. My plan at the beginning was to fine tune the existing 
>>>>> .traineddata. But, I failed in every possible way to introduce a few new 
>>>>> characters into the database. That is why I started from scratch. 
>>>>>
>>>>> Sure, I will follow Lorenzo's suggestion: will run more the 
>>>>> iterations, and see if I can improve. 
>>>>>
>>>>> Another areas we need to explore is usage of dictionaries actually. 
>>>>> May be adding millions of words into the dictionary could help Tesseract. 
>>>>> I 
>>>>> don't have millions of words; but I am looking into some corpus to get 
>>>>> more 
>>>>> words into the dictionary. 
>>>>>
>>>>> If this all fails, EasyOCR (and probably other similar open-source 
>>>>> packages)  is probably our next option to try on. Sure, sharing 
>>>>> our experiences will be helpful. I will let you know if I made good 
>>>>> progresses in any of these options. 
>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 
>>>>> mdalihu...@gmail.com wrote:
>>>>>
>>>>>> How is your training going for Bengali?  It was nearly good but I 
>>>>>> faced space problems between two words, some words are spaces but most 
>>>>>> of 
>>>>>> them have no space. I think is problem is in the dataset but I use the 
>>>>>> default training dataset from Tesseract which is used in Ben That way I 
>>>>>> am 
>>>>>> confused so I have to explore more. by the way,  you can try as Lorenzo 
>>>>>> Blz said.  Actually training from scratch is harder than 
>>>>>> fine-tuning. so you can use different datasets to explore. if you 
>>>>>> succeed. 
>>>>>> please let me know how you have done this whole process.  I'm also new 
>>>>>> in 
>>>>>> this field.
>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 
>>>>>> desal...@gmail.com wrote:
>>>>>>
>>>>>>> How is your training going for Bengali?
>>>>>>> I have been trying to train from scratch. I made about 64,000 lines 
>>>>>>> of text (which produced about 255,000 files, in the end) and run the 
>>>>>>> training for 150,000 iterations; getting 0.51 training error rate. I 
>>>>>>> was 
>>>>>>> hopping to get reasonable accuracy. Unfortunately, when I run the OCR 
>>>>>>> using  .traineddata,  the accuracy is absolutely terrible. Do you think 
>>>>>>> I 
>>>>>>> made some mistakes, or that is an expected result?
>>>>>>>
>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 
>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>
>>>>>>>> Yes, he doesn't mention all fonts but only one font.  That way he 
>>>>>>>> didn't use *MODEL_NAME in a separate **script **file script I 
>>>>>>>> think.*
>>>>>>>>
>>>>>>>> Actually, here we teach all *tif, gt.txt, and .box files *which 
>>>>>>>> are created by  *MODEL_NAME I mean **eng, ben, oro flag or 
>>>>>>>> language code *because when we first create *tif, gt.txt, and .box 
>>>>>>>> files, *every file starts by  *MODEL_NAME*. This  *MODEL_NAME*  we 
>>>>>>>> selected on the training script for looping each tif, gt.txt, and .box 
>>>>>>>> files which are created by  *MODEL_NAME.*
>>>>>>>>
>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 
>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>
>>>>>>>>> Yes, I am familiar with the video and have set up the folder 
>>>>>>>>> structure as you did. Indeed, I have tried a number of fine-tuning 
>>>>>>>>> with a 
>>>>>>>>> single font following Gracia's video. But, your script is much  
>>>>>>>>> better 
>>>>>>>>> because supports multiple fonts. The whole improvement you made is  
>>>>>>>>> brilliant; and very useful. It is all working for me. 
>>>>>>>>> The only part that I didn't understand is the trick you used in 
>>>>>>>>> your tesseract_train.py script. You see, I have been doing exactly to 
>>>>>>>>> you 
>>>>>>>>> did except this script. 
>>>>>>>>>
>>>>>>>>> The scripts seems to have the trick of sending/teaching each of 
>>>>>>>>> the fonts (iteratively) into the model. The script I have been using  
>>>>>>>>> (which I get from Garcia) doesn't mention font at all. 
>>>>>>>>>
>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>> MODEL_NAME=oro TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000*
>>>>>>>>> Does it mean that my model does't train the fonts (even if the 
>>>>>>>>> fonts have been included in the splitting process, in the other 
>>>>>>>>> script)?
>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 
>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *import subprocess# List of font namesfont_names = ['ben']for 
>>>>>>>>>> font in font_names:    command = 
>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata 
>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben 
>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000"*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *    subprocess.run(command, shell=True) 1 . This command is for 
>>>>>>>>>> training data that I have named '*tesseract_training*.py' inside 
>>>>>>>>>> tesstrain folder.*
>>>>>>>>>> *2. root directory means your main training folder and inside it 
>>>>>>>>>> as like langdata, tessearact,  tesstrain folders. if you see this 
>>>>>>>>>> tutorial  
>>>>>>>>>>   *https://www.youtube.com/watch?v=KE4xEzFGSU8   you will 
>>>>>>>>>> understand better the folder structure. only I 
>>>>>>>>>> created tesseract_training.py in tesstrain folder for training and  
>>>>>>>>>> FontList.py file is the main path as *like langdata, 
>>>>>>>>>> tessearact,  tesstrain, and *split_training_text.py.
>>>>>>>>>> 3. first of all you have to put all fonts in your Linux fonts 
>>>>>>>>>> folder.   /usr/share/fonts/  then run:  sudo apt update  then sudo 
>>>>>>>>>> fc-cache -fv
>>>>>>>>>>
>>>>>>>>>> after that, you have to add the exact font's name in FontList.py 
>>>>>>>>>> file like me.
>>>>>>>>>> I  have added two pic my folder structure. first is main 
>>>>>>>>>> structure pic and the second is the Colopse tesstrain folder.
>>>>>>>>>>
>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: Screenshot 
>>>>>>>>>> 2023-09-11 135014.png] 
>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 
>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>
>>>>>>>>>>> Thank you so much for putting out these brilliant scripts. They 
>>>>>>>>>>> make the process  much more efficient.
>>>>>>>>>>>
>>>>>>>>>>> I have one more question on the other script that you use to 
>>>>>>>>>>> train. 
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *import subprocess# List of font namesfont_names = ['ben']for 
>>>>>>>>>>> font in font_names:    command = 
>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata 
>>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben 
>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000"*
>>>>>>>>>>> *    subprocess.run(command, shell=True) *
>>>>>>>>>>>
>>>>>>>>>>> Do you have the name of fonts listed in file in the same/root 
>>>>>>>>>>> directory?
>>>>>>>>>>> How do you setup the names of the fonts in the file, if you 
>>>>>>>>>>> don't mind sharing it?
>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 
>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>
>>>>>>>>>>>> You can use the new script below. it's better than the previous 
>>>>>>>>>>>> two scripts.  You can create *tif, gt.txt, and .box files *by 
>>>>>>>>>>>> multiple fonts and also use breakpoint if vs code close or 
>>>>>>>>>>>> anything during 
>>>>>>>>>>>> creating *tif, gt.txt, and .box files *then you can checkpoint 
>>>>>>>>>>>> to navigate where you close vs code.
>>>>>>>>>>>>
>>>>>>>>>>>> command for *tif, gt.txt, and .box files *
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> import os
>>>>>>>>>>>> import random
>>>>>>>>>>>> import pathlib
>>>>>>>>>>>> import subprocess
>>>>>>>>>>>> import argparse
>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>
>>>>>>>>>>>> def create_training_data(training_text_file, font_list, 
>>>>>>>>>>>> output_directory, start_line=None, end_line=None):
>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>     with open(training_text_file, 'r') as input_file:
>>>>>>>>>>>>         lines = input_file.readlines()
>>>>>>>>>>>>
>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>
>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>         start_line = 0
>>>>>>>>>>>>
>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>         end_line = len(lines) - 1
>>>>>>>>>>>>
>>>>>>>>>>>>     for font_name in font_list.fonts:
>>>>>>>>>>>>         for line_index in range(start_line, end_line + 1):
>>>>>>>>>>>>             line = lines[line_index].strip()
>>>>>>>>>>>>
>>>>>>>>>>>>             training_text_file_name = pathlib.Path(
>>>>>>>>>>>> training_text_file).stem
>>>>>>>>>>>>
>>>>>>>>>>>>             line_serial = f"{line_index:d}"
>>>>>>>>>>>>
>>>>>>>>>>>>             line_gt_text = os.path.join(output_directory, f'{
>>>>>>>>>>>> training_text_file_name}_{line_serial}_{font_name.replace(" ", 
>>>>>>>>>>>> "_")}.gt.txt')
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>             with open(line_gt_text, 'w') as output_file:
>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>
>>>>>>>>>>>>             file_base_name = f'{training_text_file_name}_{
>>>>>>>>>>>> line_serial}_{font_name.replace(" ", "_")}'
>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>                 f'--font={font_name}',
>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>                 f'--outputbase={output_directory}/{
>>>>>>>>>>>> file_base_name}',
>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>                 '--ysize=330',
>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>                 '--unicharset_file=langdata/eng.unicharset',
>>>>>>>>>>>>             ])
>>>>>>>>>>>>
>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>     parser.add_argument('--start', type=int, help='Starting 
>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>     parser.add_argument('--end', type=int, help='Ending line 
>>>>>>>>>>>> count (inclusive)')
>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>
>>>>>>>>>>>>     training_text_file = 'langdata/eng.training_text'
>>>>>>>>>>>>     output_directory = 'tesstrain/data/eng-ground-truth'
>>>>>>>>>>>>
>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>
>>>>>>>>>>>>     create_training_data(training_text_file, font_list, 
>>>>>>>>>>>> output_directory, args.start, args.end)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Then create a file called "FontList" in the root directory and 
>>>>>>>>>>>> paste it.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> class FontList:
>>>>>>>>>>>>     def __init__(self):
>>>>>>>>>>>>         self.fonts = [
>>>>>>>>>>>>         "Gerlick"
>>>>>>>>>>>>             "Sagar Medium",
>>>>>>>>>>>>             "Ekushey Lohit Normal",  
>>>>>>>>>>>>            "Charukola Round Head Regular, weight=433",
>>>>>>>>>>>>             "Charukola Round Head Bold, weight=443",
>>>>>>>>>>>>             "Ador Orjoma Unicode",
>>>>>>>>>>>>       
>>>>>>>>>>>>           
>>>>>>>>>>>>                        
>>>>>>>>>>>> ]                         
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> then import in the above code,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *for breakpoint command:*
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> sudo python3 split_training_text.py --start 0  --end 11
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> change checkpoint according to you  --start 0 --end 11.
>>>>>>>>>>>>
>>>>>>>>>>>> *and training checkpoint as you know already.*
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 
>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi mhalidu, 
>>>>>>>>>>>>> the script you posted here seems much more extensive than you 
>>>>>>>>>>>>> posted before: 
>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com
>>>>>>>>>>>>> .
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have been using your earlier script. It is magical. How is 
>>>>>>>>>>>>> this one different from the earlier one?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you for posting these scripts, by the way. It has saved 
>>>>>>>>>>>>> my countless hours; by running multiple fonts in one sweep. I was 
>>>>>>>>>>>>> not able 
>>>>>>>>>>>>> to find any instruction on how to train for  multiple fonts. The 
>>>>>>>>>>>>> official 
>>>>>>>>>>>>> manual is also unclear. YOUr script helped me to get started. 
>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 
>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> ok, I will try as you said.
>>>>>>>>>>>>>> one more thing, what's the role of the trained_text lines 
>>>>>>>>>>>>>> will be? I have seen Bengali text are long words of lines. so I 
>>>>>>>>>>>>>> wanna know 
>>>>>>>>>>>>>> how many words or characters will be the better choice for the 
>>>>>>>>>>>>>> train? 
>>>>>>>>>>>>>> and '--xsize=3600','--ysize=350',  will be according to words of 
>>>>>>>>>>>>>> lines?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 shree wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Include the default fonts also in your fine-tuning list of 
>>>>>>>>>>>>>>> fonts and see if that helps.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <
>>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune methods for the 
>>>>>>>>>>>>>>>> Bengali language in Tesseract 5 and I have used all official 
>>>>>>>>>>>>>>>> trained_text 
>>>>>>>>>>>>>>>> and tessdata_best and other things also.  everything is good 
>>>>>>>>>>>>>>>> but the 
>>>>>>>>>>>>>>>> problem is the default font which was trained before that does 
>>>>>>>>>>>>>>>> not convert 
>>>>>>>>>>>>>>>> text like prev but my new fonts work well. I don't understand 
>>>>>>>>>>>>>>>> why it's 
>>>>>>>>>>>>>>>> happening. I share code based to understand what going on.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *codes  for creating tif, gt.txt, .box files:*
>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> def read_line_count():
>>>>>>>>>>>>>>>>     if os.path.exists('line_count.txt'):
>>>>>>>>>>>>>>>>         with open('line_count.txt', 'r') as file:
>>>>>>>>>>>>>>>>             return int(file.read())
>>>>>>>>>>>>>>>>     return 0
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> def write_line_count(line_count):
>>>>>>>>>>>>>>>>     with open('line_count.txt', 'w') as file:
>>>>>>>>>>>>>>>>         file.write(str(line_count))
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> def create_training_data(training_text_file, font_list, 
>>>>>>>>>>>>>>>> output_directory, start_line=None, end_line=None):
>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as input_file:
>>>>>>>>>>>>>>>>         for line in input_file.readlines():
>>>>>>>>>>>>>>>>             lines.append(line.strip())
>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>     random.shuffle(lines)
>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>         line_count = read_line_count()  # Set the starting 
>>>>>>>>>>>>>>>> line_count from the file
>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>         line_count = start_line
>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>         end_line_count = len(lines) - 1  # Set the ending 
>>>>>>>>>>>>>>>> line_count
>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>         end_line_count = min(end_line, len(lines) - 1)
>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>     for font in font_list.fonts:  # Iterate through all 
>>>>>>>>>>>>>>>> the fonts in the font_list
>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>         for line in lines:
>>>>>>>>>>>>>>>>             training_text_file_name = pathlib.Path(
>>>>>>>>>>>>>>>> training_text_file).stem
>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>             # Generate a unique serial number for each line
>>>>>>>>>>>>>>>>             line_serial = f"{line_count:d}"
>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>             # GT (Ground Truth) text filename
>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(output_directory, f
>>>>>>>>>>>>>>>> '{training_text_file_name}_{line_serial}.gt.txt')
>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as output_file:
>>>>>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>             # Image filename
>>>>>>>>>>>>>>>>             file_base_name = f'ben_{line_serial}'  # 
>>>>>>>>>>>>>>>> Unique filename for each font
>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>                 f'--font={font}',
>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>                 f'--outputbase={output_directory}/{
>>>>>>>>>>>>>>>> file_base_name}',
>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>                 '--ysize=350',
>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>                 '--unicharset_file=langdata/ben.unicharset'
>>>>>>>>>>>>>>>> ,
>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>             line_count += 1
>>>>>>>>>>>>>>>>             font_serial += 1
>>>>>>>>>>>>>>>>         
>>>>>>>>>>>>>>>>         # Reset font_serial for the next font iteration
>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>     write_line_count(line_count)  # Update the line_count 
>>>>>>>>>>>>>>>> in the file
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, help='Starting 
>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, help='Ending 
>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>     training_text_file = 'langdata/ben.training_text'
>>>>>>>>>>>>>>>>     output_directory = 'tesstrain/data/ben-ground-truth'
>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>     # Create an instance of the FontList class
>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>      
>>>>>>>>>>>>>>>>     create_training_data(training_text_file, font_list, 
>>>>>>>>>>>>>>>> output_directory, args.start, args.end)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *and for training code:*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> # List of font names
>>>>>>>>>>>>>>>> font_names = ['ben']
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> for font in font_names:
>>>>>>>>>>>>>>>>     command = f"TESSDATA_PREFIX=../tesseract/tessdata make 
>>>>>>>>>>>>>>>> training MODEL_NAME={font} START_MODEL=ben 
>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 
>>>>>>>>>>>>>>>> LANG_TYPE=Indic"
>>>>>>>>>>>>>>>>     subprocess.run(command, shell=True)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> any suggestion to identify to extract the problem.
>>>>>>>>>>>>>>>> thanks, everyone
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails 
>>>>>>>>>>>>>>>> from it, send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5ffbaca4-fcfd-4e8a-8cd6-8709aee142a3n%40googlegroups.com.

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Reply via email to