Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Des Bw Thu, 19 Oct 2023 10:32:13 -0700

Hi Ali, 
How is your training going?
Do you get good results with the training-from-the-scratch?


On Friday, September 15, 2023 at 6:42:26 PM UTC+3 tesseract-ocr wrote:

> yes, two months ago when I started to learn OCR I saw that. it was very 
> helpful at the beginning.
> On Friday, 15 September, 2023 at 4:01:32 pm UTC+6 desal...@gmail.com 
> wrote:
>
>> Just saw this paper: https://osf.io/b8h7q
>>
>> On Thursday, September 14, 2023 at 9:02:22 PM UTC+3 mdalihu...@gmail.com 
>> wrote:
>>
>>> I will try some changes. thx
>>>
>>> On Thursday, 14 September, 2023 at 2:46:36 pm UTC+6 elvi...@gmail.com 
>>> wrote:
>>>
>>>> I also faced that issue in the Windows. Apparently, the issue is 
>>>> related with unicode. You can try your luck by changing  "r" to "utf8" in 
>>>> the script.
>>>> I end up installing Ubuntu because i was having too many errors in the 
>>>> Windows.
>>>>
>>>> On Thu, Sep 14, 2023, 9:33 AM Ali hussain <mdalihu...@gmail.com> wrote:
>>>>
>>>>> you faced this error,  Can't encode transcription? if you faced how 
>>>>> you have solved this?
>>>>>
>>>>> On Thursday, 14 September, 2023 at 10:51:52 am UTC+6 elvi...@gmail.com 
>>>>> wrote:
>>>>>
>>>>>> I was using my own text
>>>>>>
>>>>>> On Thu, Sep 14, 2023, 6:58 AM Ali hussain <mdalihu...@gmail.com> 
>>>>>> wrote:
>>>>>>
>>>>>>> you are training from Tessearact default text data or your own 
>>>>>>> collected text data?
>>>>>>> On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 
>>>>>>> desal...@gmail.com wrote:
>>>>>>>
>>>>>>>> I now get to 200000 iterations; and the error rate is stuck at 
>>>>>>>> 0.46. The result is absolutely trash: nowhere close to the 
>>>>>>>> default/Ray's 
>>>>>>>> training. 
>>>>>>>>
>>>>>>>> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 
>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> after Tesseact recognizes text from images. then you can apply 
>>>>>>>>> regex to replace the wrong word with to correct word.
>>>>>>>>> I'm not familiar with paddleOcr and scanTailor also.
>>>>>>>>>
>>>>>>>>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 
>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>
>>>>>>>>>> At what stage are you doing the regex replacement?
>>>>>>>>>> My process has been: Scan (tif)--> ScanTailor --> Tesseract --> 
>>>>>>>>>> pdf
>>>>>>>>>>
>>>>>>>>>> >EasyOCR I think is best for ID cards or something like that 
>>>>>>>>>> image process. but document images like books, here Tesseract is 
>>>>>>>>>> better 
>>>>>>>>>> than EasyOCR.
>>>>>>>>>>
>>>>>>>>>> How about paddleOcr?, are you familiar with it?
>>>>>>>>>>
>>>>>>>>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 
>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>
>>>>>>>>>>> I know what you mean. but in some cases, it helps me.  I have 
>>>>>>>>>>> faced specific characters and words are always not recognized by 
>>>>>>>>>>> Tesseract. 
>>>>>>>>>>> That way I use these regex to replace those characters   and words 
>>>>>>>>>>> if  
>>>>>>>>>>> those characters are incorrect.
>>>>>>>>>>>
>>>>>>>>>>> see what I have done: 
>>>>>>>>>>>
>>>>>>>>>>>    " ী": "ী",
>>>>>>>>>>>     " ্": " ",
>>>>>>>>>>>     " ে": " ",
>>>>>>>>>>>     জ্া: "জা",
>>>>>>>>>>>     "  ": " ",
>>>>>>>>>>>     "   ": " ",
>>>>>>>>>>>     "    ": " ",
>>>>>>>>>>>     "্প": " ",
>>>>>>>>>>>     " য": "র্য",
>>>>>>>>>>>     য: "য",
>>>>>>>>>>>     " া": "া",
>>>>>>>>>>>     আা: "আ",
>>>>>>>>>>>     ম্ি: "মি",
>>>>>>>>>>>     স্ু: "সু",
>>>>>>>>>>>     "হূ ": "হূ",
>>>>>>>>>>>     " ণ": "ণ",
>>>>>>>>>>>     র্্: "র",
>>>>>>>>>>>     "চিন্ত ": "চিন্তা ",
>>>>>>>>>>>     ন্া: "না",
>>>>>>>>>>>     "সম ূর্ন": "সম্পূর্ণ",
>>>>>>>>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 
>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>
>>>>>>>>>>>> The problem for regex is that Tesseract is not consistent in 
>>>>>>>>>>>> its replacement. 
>>>>>>>>>>>> Think of the original training of English data doesn't contain 
>>>>>>>>>>>> the letter /u/. What does Tesseract do when it faces /u/ in actual 
>>>>>>>>>>>> processing??
>>>>>>>>>>>> In some cases, it replaces it with closely similar letters such 
>>>>>>>>>>>> as /v/ and /w/. In other cases, it completely removes it. That is 
>>>>>>>>>>>> what is 
>>>>>>>>>>>> happening with my case. Those characters re sometimes completely 
>>>>>>>>>>>> removed; 
>>>>>>>>>>>> other times, they are replaced by closely resembling characters. 
>>>>>>>>>>>> Because of 
>>>>>>>>>>>> this inconsistency, applying regex is very difficult. 
>>>>>>>>>>>>
>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 
>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> if Some specific characters or words are always missing 
>>>>>>>>>>>>> from the OCR result.  then you can apply logic with the Regular 
>>>>>>>>>>>>> expressions 
>>>>>>>>>>>>> method on your applications. After OCR, these specific characters 
>>>>>>>>>>>>> or words 
>>>>>>>>>>>>> will be replaced by current characters or words that you defined 
>>>>>>>>>>>>> in your 
>>>>>>>>>>>>> applications by  Regular expressions. it can be done in some 
>>>>>>>>>>>>> major problems.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 
>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The characters are getting missed, even after fine-tuning. 
>>>>>>>>>>>>>> I never made any progress. I tried many different ways. Some  
>>>>>>>>>>>>>> specific characters are always missing from the OCR result.  
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 
>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> EasyOCR I think is best for ID cards or something like that 
>>>>>>>>>>>>>>> image process. but document images like books, here Tesseract 
>>>>>>>>>>>>>>> is better 
>>>>>>>>>>>>>>> than EasyOCR.  Even I didn't use EasyOCR. you can try it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have added words of dictionaries but the result is the 
>>>>>>>>>>>>>>> same. 
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> what kind of problem you have faced in fine-tuning in few 
>>>>>>>>>>>>>>> new characters as you said (*but, I failed in every 
>>>>>>>>>>>>>>> possible way to introduce a few new characters into the 
>>>>>>>>>>>>>>> database.)*
>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 
>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, we are new to this. I find the instructions (the 
>>>>>>>>>>>>>>>> manual) very hard to follow. The video you linked above was 
>>>>>>>>>>>>>>>> really helpful  
>>>>>>>>>>>>>>>> to get started. My plan at the beginning was to fine tune the 
>>>>>>>>>>>>>>>> existing 
>>>>>>>>>>>>>>>> .traineddata. But, I failed in every possible way to introduce 
>>>>>>>>>>>>>>>> a few new 
>>>>>>>>>>>>>>>> characters into the database. That is why I started from 
>>>>>>>>>>>>>>>> scratch. 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run more the 
>>>>>>>>>>>>>>>> iterations, and see if I can improve. 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Another areas we need to explore is usage of dictionaries 
>>>>>>>>>>>>>>>> actually. May be adding millions of words into the dictionary 
>>>>>>>>>>>>>>>> could help 
>>>>>>>>>>>>>>>> Tesseract. I don't have millions of words; but I am looking 
>>>>>>>>>>>>>>>> into some 
>>>>>>>>>>>>>>>> corpus to get more words into the dictionary. 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If this all fails, EasyOCR (and probably other similar 
>>>>>>>>>>>>>>>> open-source packages)  is probably our next option to try on. 
>>>>>>>>>>>>>>>> Sure, sharing 
>>>>>>>>>>>>>>>> our experiences will be helpful. I will let you know if I made 
>>>>>>>>>>>>>>>> good 
>>>>>>>>>>>>>>>> progresses in any of these options. 
>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 
>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> How is your training going for Bengali?  It was nearly 
>>>>>>>>>>>>>>>>> good but I faced space problems between two words, some words 
>>>>>>>>>>>>>>>>> are spaces 
>>>>>>>>>>>>>>>>> but most of them have no space. I think is problem is in the 
>>>>>>>>>>>>>>>>> dataset but I 
>>>>>>>>>>>>>>>>> use the default training dataset from Tesseract which is used 
>>>>>>>>>>>>>>>>> in Ben That 
>>>>>>>>>>>>>>>>> way I am confused so I have to explore more. by the way,  you 
>>>>>>>>>>>>>>>>> can try as Lorenzo 
>>>>>>>>>>>>>>>>> Blz said.  Actually training from scratch is harder than 
>>>>>>>>>>>>>>>>> fine-tuning. so you can use different datasets to explore. if 
>>>>>>>>>>>>>>>>> you succeed. 
>>>>>>>>>>>>>>>>> please let me know how you have done this whole process.  I'm 
>>>>>>>>>>>>>>>>> also new in 
>>>>>>>>>>>>>>>>> this field.
>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 
>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> How is your training going for Bengali?
>>>>>>>>>>>>>>>>>> I have been trying to train from scratch. I made about 
>>>>>>>>>>>>>>>>>> 64,000 lines of text (which produced about 255,000 files, in 
>>>>>>>>>>>>>>>>>> the end) and 
>>>>>>>>>>>>>>>>>> run the training for 150,000 iterations; getting 0.51 
>>>>>>>>>>>>>>>>>> training error rate. 
>>>>>>>>>>>>>>>>>> I was hopping to get reasonable accuracy. Unfortunately, 
>>>>>>>>>>>>>>>>>> when I run the OCR 
>>>>>>>>>>>>>>>>>> using  .traineddata,  the accuracy is absolutely terrible. 
>>>>>>>>>>>>>>>>>> Do you think I 
>>>>>>>>>>>>>>>>>> made some mistakes, or that is an expected result?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 
>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one font.  
>>>>>>>>>>>>>>>>>>> That way he didn't use *MODEL_NAME in a separate *
>>>>>>>>>>>>>>>>>>> *script **file script I think.*
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and .box 
>>>>>>>>>>>>>>>>>>> files *which are created by  *MODEL_NAME I mean **eng, 
>>>>>>>>>>>>>>>>>>> ben, oro flag or language code *because when we first 
>>>>>>>>>>>>>>>>>>> create *tif, gt.txt, and .box files, *every file starts 
>>>>>>>>>>>>>>>>>>> by  *MODEL_NAME*. This  *MODEL_NAME*  we selected on 
>>>>>>>>>>>>>>>>>>> the training script for looping each tif, gt.txt, and .box 
>>>>>>>>>>>>>>>>>>> files which are 
>>>>>>>>>>>>>>>>>>> created by  *MODEL_NAME.*
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 
>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Yes, I am familiar with the video and have set up the 
>>>>>>>>>>>>>>>>>>>> folder structure as you did. Indeed, I have tried a number 
>>>>>>>>>>>>>>>>>>>> of fine-tuning 
>>>>>>>>>>>>>>>>>>>> with a single font following Gracia's video. But, your 
>>>>>>>>>>>>>>>>>>>> script is much  
>>>>>>>>>>>>>>>>>>>> better because supports multiple fonts. The whole 
>>>>>>>>>>>>>>>>>>>> improvement you made is  
>>>>>>>>>>>>>>>>>>>> brilliant; and very useful. It is all working for me. 
>>>>>>>>>>>>>>>>>>>> The only part that I didn't understand is the trick you 
>>>>>>>>>>>>>>>>>>>> used in your tesseract_train.py script. You see, I have 
>>>>>>>>>>>>>>>>>>>> been doing exactly 
>>>>>>>>>>>>>>>>>>>> to you did except this script. 
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The scripts seems to have the trick of sending/teaching 
>>>>>>>>>>>>>>>>>>>> each of the fonts (iteratively) into the model. The script 
>>>>>>>>>>>>>>>>>>>> I have been 
>>>>>>>>>>>>>>>>>>>> using  (which I get from Garcia) doesn't mention font at 
>>>>>>>>>>>>>>>>>>>> all. 
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>>>>>>> MODEL_NAME=oro TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000*
>>>>>>>>>>>>>>>>>>>> Does it mean that my model does't train the fonts (even 
>>>>>>>>>>>>>>>>>>>> if the fonts have been included in the splitting process, 
>>>>>>>>>>>>>>>>>>>> in the other 
>>>>>>>>>>>>>>>>>>>> script)?
>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 
>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = 
>>>>>>>>>>>>>>>>>>>>> ['ben']for font in font_names:    command = 
>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>>>>>>>> MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) 1 . This 
>>>>>>>>>>>>>>>>>>>>> command is for training data that I have named '*
>>>>>>>>>>>>>>>>>>>>> tesseract_training*.py' inside tesstrain folder.*
>>>>>>>>>>>>>>>>>>>>> *2. root directory means your main training folder and 
>>>>>>>>>>>>>>>>>>>>> inside it as like langdata, tessearact,  tesstrain 
>>>>>>>>>>>>>>>>>>>>> folders. if you see this 
>>>>>>>>>>>>>>>>>>>>> tutorial    *
>>>>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=KE4xEzFGSU8   you 
>>>>>>>>>>>>>>>>>>>>> will understand better the folder structure. only I 
>>>>>>>>>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder for 
>>>>>>>>>>>>>>>>>>>>> training and  
>>>>>>>>>>>>>>>>>>>>> FontList.py file is the main path as *like langdata, 
>>>>>>>>>>>>>>>>>>>>> tessearact,  tesstrain, and *split_training_text.py.
>>>>>>>>>>>>>>>>>>>>> 3. first of all you have to put all fonts in your 
>>>>>>>>>>>>>>>>>>>>> Linux fonts folder.   /usr/share/fonts/  then run:  
>>>>>>>>>>>>>>>>>>>>> sudo apt update  then sudo fc-cache -fv
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> after that, you have to add the exact font's name in 
>>>>>>>>>>>>>>>>>>>>> FontList.py file like me.
>>>>>>>>>>>>>>>>>>>>> I  have added two pic my folder structure. first is 
>>>>>>>>>>>>>>>>>>>>> main structure pic and the second is the Colopse 
>>>>>>>>>>>>>>>>>>>>> tesstrain folder.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: 
>>>>>>>>>>>>>>>>>>>>> Screenshot 2023-09-11 135014.png] 
>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 
>>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thank you so much for putting out these brilliant 
>>>>>>>>>>>>>>>>>>>>>> scripts. They make the process  much more efficient.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I have one more question on the other script that you 
>>>>>>>>>>>>>>>>>>>>>> use to train. 
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = 
>>>>>>>>>>>>>>>>>>>>>> ['ben']for font in font_names:    command = 
>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training 
>>>>>>>>>>>>>>>>>>>>>> MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) *
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Do you have the name of fonts listed in file in the 
>>>>>>>>>>>>>>>>>>>>>> same/root directory?
>>>>>>>>>>>>>>>>>>>>>> How do you setup the names of the fonts in the file, 
>>>>>>>>>>>>>>>>>>>>>> if you don't mind sharing it?
>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 
>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> You can use the new script below. it's better than 
>>>>>>>>>>>>>>>>>>>>>>> the previous two scripts.  You can create *tif, 
>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *by multiple fonts and also 
>>>>>>>>>>>>>>>>>>>>>>> use breakpoint if vs code close or anything during 
>>>>>>>>>>>>>>>>>>>>>>> creating *tif, 
>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *then you can checkpoint to 
>>>>>>>>>>>>>>>>>>>>>>> navigate where you close vs code.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files *
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, 
>>>>>>>>>>>>>>>>>>>>>>> end_line=None):
>>>>>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as 
>>>>>>>>>>>>>>>>>>>>>>> input_file:
>>>>>>>>>>>>>>>>>>>>>>>         lines = input_file.readlines()
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>>>>>         start_line = 0
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>>>>>         end_line = len(lines) - 1
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>     for font_name in font_list.fonts:
>>>>>>>>>>>>>>>>>>>>>>>         for line_index in range(start_line, end_line 
>>>>>>>>>>>>>>>>>>>>>>> + 1):
>>>>>>>>>>>>>>>>>>>>>>>             line = lines[line_index].strip()
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>             training_text_file_name = pathlib.Path(
>>>>>>>>>>>>>>>>>>>>>>> training_text_file).stem
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_index:d}"
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(
>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{
>>>>>>>>>>>>>>>>>>>>>>> line_serial}_{font_name.replace(" ", "_")}.gt.txt')
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as 
>>>>>>>>>>>>>>>>>>>>>>> output_file:
>>>>>>>>>>>>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>             file_base_name = f'{
>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial}_{
>>>>>>>>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}'
>>>>>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>>>>>                 f'--font={font_name}',
>>>>>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>>>>>>                 f'--outputbase={output_directory}/{
>>>>>>>>>>>>>>>>>>>>>>> file_base_name}',
>>>>>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>>>>>                 '--ysize=330',
>>>>>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/eng.unicharset',
>>>>>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, 
>>>>>>>>>>>>>>>>>>>>>>> help='Starting 
>>>>>>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, help='Ending 
>>>>>>>>>>>>>>>>>>>>>>> line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>     training_text_file = 'langdata/eng.training_text
>>>>>>>>>>>>>>>>>>>>>>> '
>>>>>>>>>>>>>>>>>>>>>>>     output_directory = '
>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/eng-ground-truth'
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Then create a file called "FontList" in the root 
>>>>>>>>>>>>>>>>>>>>>>> directory and paste it.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> class FontList:
>>>>>>>>>>>>>>>>>>>>>>>     def __init__(self):
>>>>>>>>>>>>>>>>>>>>>>>         self.fonts = [
>>>>>>>>>>>>>>>>>>>>>>>         "Gerlick"
>>>>>>>>>>>>>>>>>>>>>>>             "Sagar Medium",
>>>>>>>>>>>>>>>>>>>>>>>             "Ekushey Lohit Normal",  
>>>>>>>>>>>>>>>>>>>>>>>            "Charukola Round Head Regular, weight=433
>>>>>>>>>>>>>>>>>>>>>>> ",
>>>>>>>>>>>>>>>>>>>>>>>             "Charukola Round Head Bold, weight=443",
>>>>>>>>>>>>>>>>>>>>>>>             "Ador Orjoma Unicode",
>>>>>>>>>>>>>>>>>>>>>>>       
>>>>>>>>>>>>>>>>>>>>>>>           
>>>>>>>>>>>>>>>>>>>>>>>                        
>>>>>>>>>>>>>>>>>>>>>>> ]                         
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> then import in the above code,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> *for breakpoint command:*
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 0  --end 
>>>>>>>>>>>>>>>>>>>>>>> 11
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> change checkpoint according to you  --start 0 --end 
>>>>>>>>>>>>>>>>>>>>>>> 11.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> *and training checkpoint as you know already.*
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 
>>>>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hi mhalidu, 
>>>>>>>>>>>>>>>>>>>>>>>> the script you posted here seems much more 
>>>>>>>>>>>>>>>>>>>>>>>> extensive than you posted before: 
>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I have been using your earlier script. It is 
>>>>>>>>>>>>>>>>>>>>>>>> magical. How is this one different from the 
>>>>>>>>>>>>>>>>>>>>>>>> earlier one?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the way. It 
>>>>>>>>>>>>>>>>>>>>>>>> has saved my countless hours; by running multiple 
>>>>>>>>>>>>>>>>>>>>>>>> fonts in one sweep. I was 
>>>>>>>>>>>>>>>>>>>>>>>> not able to find any instruction on how to train for  
>>>>>>>>>>>>>>>>>>>>>>>> multiple fonts. The 
>>>>>>>>>>>>>>>>>>>>>>>> official manual is also unclear. YOUr script helped me 
>>>>>>>>>>>>>>>>>>>>>>>> to get started. 
>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 
>>>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> ok, I will try as you said.
>>>>>>>>>>>>>>>>>>>>>>>>> one more thing, what's the role of the 
>>>>>>>>>>>>>>>>>>>>>>>>> trained_text lines will be? I have seen Bengali text 
>>>>>>>>>>>>>>>>>>>>>>>>> are long words of 
>>>>>>>>>>>>>>>>>>>>>>>>> lines. so I wanna know how many words or characters 
>>>>>>>>>>>>>>>>>>>>>>>>> will be the better 
>>>>>>>>>>>>>>>>>>>>>>>>> choice for the train? and 
>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600','--ysize=350',  will be according 
>>>>>>>>>>>>>>>>>>>>>>>>> to words of lines?
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 
>>>>>>>>>>>>>>>>>>>>>>>>> shree wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Include the default fonts also in your 
>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning list of fonts and see if that helps.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <
>>>>>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune 
>>>>>>>>>>>>>>>>>>>>>>>>>>> methods for the Bengali language in Tesseract 5 and 
>>>>>>>>>>>>>>>>>>>>>>>>>>> I have used all 
>>>>>>>>>>>>>>>>>>>>>>>>>>> official trained_text and tessdata_best and other 
>>>>>>>>>>>>>>>>>>>>>>>>>>> things also.  everything 
>>>>>>>>>>>>>>>>>>>>>>>>>>> is good but the problem is the default font which 
>>>>>>>>>>>>>>>>>>>>>>>>>>> was trained before that 
>>>>>>>>>>>>>>>>>>>>>>>>>>> does not convert text like prev but my new fonts 
>>>>>>>>>>>>>>>>>>>>>>>>>>> work well. I don't 
>>>>>>>>>>>>>>>>>>>>>>>>>>> understand why it's happening. I share code based 
>>>>>>>>>>>>>>>>>>>>>>>>>>> to understand what going 
>>>>>>>>>>>>>>>>>>>>>>>>>>> on.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> *codes  for creating tif, gt.txt, .box files:*
>>>>>>>>>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> def read_line_count():
>>>>>>>>>>>>>>>>>>>>>>>>>>>     if os.path.exists('line_count.txt'):
>>>>>>>>>>>>>>>>>>>>>>>>>>>         with open('line_count.txt', 'r') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>> file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>             return int(file.read())
>>>>>>>>>>>>>>>>>>>>>>>>>>>     return 0
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count):
>>>>>>>>>>>>>>>>>>>>>>>>>>>     with open('line_count.txt', 'w') as file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>         file.write(str(line_count))
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, 
>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None):
>>>>>>>>>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>> input_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>         for line in input_file.readlines():
>>>>>>>>>>>>>>>>>>>>>>>>>>>             lines.append(line.strip())
>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>     random.shuffle(lines)
>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>         line_count = read_line_count()  # Set 
>>>>>>>>>>>>>>>>>>>>>>>>>>> the starting line_count from the file
>>>>>>>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>>>>>>>         line_count = start_line
>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>         end_line_count = len(lines) - 1  # Set 
>>>>>>>>>>>>>>>>>>>>>>>>>>> the ending line_count
>>>>>>>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>>>>>>>         end_line_count = min(end_line, len(lines) 
>>>>>>>>>>>>>>>>>>>>>>>>>>> - 1)
>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>     for font in font_list.fonts:  # Iterate 
>>>>>>>>>>>>>>>>>>>>>>>>>>> through all the fonts in the font_list
>>>>>>>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>         for line in lines:
>>>>>>>>>>>>>>>>>>>>>>>>>>>             training_text_file_name = pathlib.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Path(training_text_file).stem
>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>             # Generate a unique serial number 
>>>>>>>>>>>>>>>>>>>>>>>>>>> for each line
>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_count:d}"
>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>             # GT (Ground Truth) text filename
>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(
>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{
>>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial}.gt.txt')
>>>>>>>>>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>> output_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>                 output_file.writelines([line])
>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>             # Image filename
>>>>>>>>>>>>>>>>>>>>>>>>>>>             file_base_name = f'ben_{line_serial}
>>>>>>>>>>>>>>>>>>>>>>>>>>> '  # Unique filename for each font
>>>>>>>>>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--font={font}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--outputbase={output_directory
>>>>>>>>>>>>>>>>>>>>>>>>>>> }/{file_base_name}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--ysize=350',
>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset',
>>>>>>>>>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_count += 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>             font_serial += 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>         
>>>>>>>>>>>>>>>>>>>>>>>>>>>         # Reset font_serial for the next font 
>>>>>>>>>>>>>>>>>>>>>>>>>>> iteration
>>>>>>>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>     write_line_count(line_count)  # Update the 
>>>>>>>>>>>>>>>>>>>>>>>>>>> line_count in the file
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, 
>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Starting line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, help=
>>>>>>>>>>>>>>>>>>>>>>>>>>> 'Ending line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>     training_text_file = '
>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata/ben.training_text'
>>>>>>>>>>>>>>>>>>>>>>>>>>>     output_directory = '
>>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/ben-ground-truth'
>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>     # Create an instance of the FontList class
>>>>>>>>>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>>>>>>>>>      
>>>>>>>>>>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end)
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> *and for training code:*
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> # List of font names
>>>>>>>>>>>>>>>>>>>>>>>>>>> font_names = ['ben']
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> for font in font_names:
>>>>>>>>>>>>>>>>>>>>>>>>>>>     command = 
>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben 
>>>>>>>>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 
>>>>>>>>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic"
>>>>>>>>>>>>>>>>>>>>>>>>>>>     subprocess.run(command, shell=True)
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract the 
>>>>>>>>>>>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks, everyone
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you are 
>>>>>>>>>>>>>>>>>>>>>>>>>>> subscribed to the Google Groups "tesseract-ocr" 
>>>>>>>>>>>>>>>>>>>>>>>>>>> group.
>>>>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop 
>>>>>>>>>>>>>>>>>>>>>>>>>>> receiving emails from it, send an email to 
>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract-oc...@googlegroups.com.
>>>>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>
>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com
>>>>>>>  
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>>
>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/949aa119-6aaf-4764-9c4e-0e32af47ee8bn%40googlegroups.com.

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Reply via email to