Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Ali hussain Sun, 22 Oct 2023 02:09:31 -0700

you can test by changes '--char spacing=1.0 . i think it would be problem 
accuracy of result on it also.
On Sunday, 22 October, 2023 at 3:07:16 pm UTC+6 Ali hussain wrote:


> i haven't tried by cut the top layer of the network. you can share your 
> knowledge what you done by cut the top layer of the network. or github 
> project link.
> On Sunday, 22 October, 2023 at 12:27:32 pm UTC+6 [email protected] wrote:
>
>> That is massive data. Have you tried to train by cut the top layer of the 
>> network?
>> I think that is the most promising approach. I was getting really good 
>> results with that. But, the result is not getting translated to scanned 
>> documents. I get best results with the syntethic data. I am no 
>> experimenting with the settings in text2image if it is possible to emulate 
>> the scanned documents. 
>> I am also suspecting this setting   '--char_spacing=1.0', in our setup is 
>> causing more trouble. Scanned documents come with characters spacing close 
>> to zero.If you are planning to train more, try removing this parameter. 
>>
>> On Sunday, October 22, 2023 at 4:09:46 AM UTC+3 [email protected] 
>> wrote:
>>
>>> 600000 lines of text and the itarations  higher then 600000. but some 
>>> time i got better result in lower itarations in finetune like 100000 lines 
>>> of text and itaration is only 5000 to10000. 
>>> On Saturday, 21 October, 2023 at 11:37:13 am UTC+6 [email protected] 
>>> wrote:
>>>
>>>> How many lines of text and iterations did you use?
>>>>
>>>> On Saturday, October 21, 2023 at 8:36:38 AM UTC+3 Des Bw wrote:
>>>>
>>>>> Yah, that is what I am getting as well. I was able to add the missing 
>>>>> letter. But, the overall accuracy become lower than the default model. 
>>>>>
>>>>> On Saturday, October 21, 2023 at 3:22:44 AM UTC+3 [email protected] 
>>>>> wrote:
>>>>>
>>>>>> not good result. that's way i stop to training now. default 
>>>>>> traineddata is overall good then scratch.
>>>>>> On Thursday, 19 October, 2023 at 11:32:08 pm UTC+6 [email protected] 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Ali, 
>>>>>>> How is your training going?
>>>>>>> Do you get good results with the training-from-the-scratch?
>>>>>>>
>>>>>>> On Friday, September 15, 2023 at 6:42:26 PM UTC+3 tesseract-ocr 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> yes, two months ago when I started to learn OCR I saw that. it was 
>>>>>>>> very helpful at the beginning.
>>>>>>>> On Friday, 15 September, 2023 at 4:01:32 pm UTC+6 
>>>>>>>> [email protected] wrote:
>>>>>>>>
>>>>>>>>> Just saw this paper: https://osf.io/b8h7q
>>>>>>>>>
>>>>>>>>> On Thursday, September 14, 2023 at 9:02:22 PM UTC+3 
>>>>>>>>> [email protected] wrote:
>>>>>>>>>
>>>>>>>>>> I will try some changes. thx
>>>>>>>>>>
>>>>>>>>>> On Thursday, 14 September, 2023 at 2:46:36 pm UTC+6 
>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>
>>>>>>>>>>> I also faced that issue in the Windows. Apparently, the issue is 
>>>>>>>>>>> related with unicode. You can try your luck by changing  "r" to 
>>>>>>>>>>> "utf8" in 
>>>>>>>>>>> the script.
>>>>>>>>>>> I end up installing Ubuntu because i was having too many errors 
>>>>>>>>>>> in the Windows.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Sep 14, 2023, 9:33 AM Ali hussain <[email protected]> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> you faced this error,  Can't encode transcription? if you faced 
>>>>>>>>>>>> how you have solved this?
>>>>>>>>>>>>
>>>>>>>>>>>> On Thursday, 14 September, 2023 at 10:51:52 am UTC+6 
>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I was using my own text
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Sep 14, 2023, 6:58 AM Ali hussain <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> you are training from Tessearact default text data or your 
>>>>>>>>>>>>>> own collected text data?
>>>>>>>>>>>>>> On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 
>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I now get to 200000 iterations; and the error rate is stuck 
>>>>>>>>>>>>>>> at 0.46. The result is absolutely trash: nowhere close to the 
>>>>>>>>>>>>>>> default/Ray's 
>>>>>>>>>>>>>>> training. 
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 
>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> after Tesseact recognizes text from images. then you can 
>>>>>>>>>>>>>>>> apply regex to replace the wrong word with to correct word.
>>>>>>>>>>>>>>>> I'm not familiar with paddleOcr and scanTailor also.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 
>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> At what stage are you doing the regex replacement?
>>>>>>>>>>>>>>>>> My process has been: Scan (tif)--> ScanTailor --> 
>>>>>>>>>>>>>>>>> Tesseract --> pdf
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> >EasyOCR I think is best for ID cards or something like 
>>>>>>>>>>>>>>>>> that image process. but document images like books, here 
>>>>>>>>>>>>>>>>> Tesseract is 
>>>>>>>>>>>>>>>>> better than EasyOCR.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> How about paddleOcr?, are you familiar with it?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 
>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I know what you mean. but in some cases, it helps me.  I 
>>>>>>>>>>>>>>>>>> have faced specific characters and words are always not 
>>>>>>>>>>>>>>>>>> recognized by 
>>>>>>>>>>>>>>>>>> Tesseract. That way I use these regex to replace those 
>>>>>>>>>>>>>>>>>> characters   and 
>>>>>>>>>>>>>>>>>> words if  those characters are incorrect.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> see what I have done: 
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>    " ী": "ী",
>>>>>>>>>>>>>>>>>>     " ্": " ",
>>>>>>>>>>>>>>>>>>     " ে": " ",
>>>>>>>>>>>>>>>>>>     জ্া: "জা",
>>>>>>>>>>>>>>>>>>     "  ": " ",
>>>>>>>>>>>>>>>>>>     "   ": " ",
>>>>>>>>>>>>>>>>>>     "    ": " ",
>>>>>>>>>>>>>>>>>>     "্প": " ",
>>>>>>>>>>>>>>>>>>     " য": "র্য",
>>>>>>>>>>>>>>>>>>     য: "য",
>>>>>>>>>>>>>>>>>>     " া": "া",
>>>>>>>>>>>>>>>>>>     আা: "আ",
>>>>>>>>>>>>>>>>>>     ম্ি: "মি",
>>>>>>>>>>>>>>>>>>     স্ু: "সু",
>>>>>>>>>>>>>>>>>>     "হূ ": "হূ",
>>>>>>>>>>>>>>>>>>     " ণ": "ণ",
>>>>>>>>>>>>>>>>>>     র্্: "র",
>>>>>>>>>>>>>>>>>>     "চিন্ত ": "চিন্তা ",
>>>>>>>>>>>>>>>>>>     ন্া: "না",
>>>>>>>>>>>>>>>>>>     "সম ূর্ন": "সম্পূর্ণ",
>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 
>>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The problem for regex is that Tesseract is not 
>>>>>>>>>>>>>>>>>>> consistent in its replacement. 
>>>>>>>>>>>>>>>>>>> Think of the original training of English data doesn't 
>>>>>>>>>>>>>>>>>>> contain the letter /u/. What does Tesseract do when it 
>>>>>>>>>>>>>>>>>>> faces /u/ in actual 
>>>>>>>>>>>>>>>>>>> processing??
>>>>>>>>>>>>>>>>>>> In some cases, it replaces it with closely similar 
>>>>>>>>>>>>>>>>>>> letters such as /v/ and /w/. In other cases, it completely 
>>>>>>>>>>>>>>>>>>> removes it. That 
>>>>>>>>>>>>>>>>>>> is what is happening with my case. Those characters re 
>>>>>>>>>>>>>>>>>>> sometimes completely 
>>>>>>>>>>>>>>>>>>> removed; other times, they are replaced by closely 
>>>>>>>>>>>>>>>>>>> resembling characters. 
>>>>>>>>>>>>>>>>>>> Because of this inconsistency, applying regex is very 
>>>>>>>>>>>>>>>>>>> difficult. 
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 
>>>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> if Some specific characters or words are always missing 
>>>>>>>>>>>>>>>>>>>> from the OCR result.  then you can apply logic with the 
>>>>>>>>>>>>>>>>>>>> Regular expressions 
>>>>>>>>>>>>>>>>>>>> method on your applications. After OCR, these specific 
>>>>>>>>>>>>>>>>>>>> characters or words 
>>>>>>>>>>>>>>>>>>>> will be replaced by current characters or words that you 
>>>>>>>>>>>>>>>>>>>> defined in your 
>>>>>>>>>>>>>>>>>>>> applications by  Regular expressions. it can be done in 
>>>>>>>>>>>>>>>>>>>> some major problems.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 
>>>>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The characters are getting missed, even after 
>>>>>>>>>>>>>>>>>>>>> fine-tuning. 
>>>>>>>>>>>>>>>>>>>>> I never made any progress. I tried many different 
>>>>>>>>>>>>>>>>>>>>> ways. Some  specific characters are always missing from 
>>>>>>>>>>>>>>>>>>>>> the OCR result.  
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 
>>>>>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> EasyOCR I think is best for ID cards or something 
>>>>>>>>>>>>>>>>>>>>>> like that image process. but document images like books, 
>>>>>>>>>>>>>>>>>>>>>> here Tesseract is 
>>>>>>>>>>>>>>>>>>>>>> better than EasyOCR.  Even I didn't use EasyOCR. you can 
>>>>>>>>>>>>>>>>>>>>>> try it.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I have added words of dictionaries but the result is 
>>>>>>>>>>>>>>>>>>>>>> the same. 
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> what kind of problem you have faced in fine-tuning in 
>>>>>>>>>>>>>>>>>>>>>> few new characters as you said (*but, I failed in 
>>>>>>>>>>>>>>>>>>>>>> every possible way to introduce a few new characters 
>>>>>>>>>>>>>>>>>>>>>> into the database.)*
>>>>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 
>>>>>>>>>>>>>>>>>>>>>> [email protected] wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Yes, we are new to this. I find the instructions 
>>>>>>>>>>>>>>>>>>>>>>> (the manual) very hard to follow. The video you linked 
>>>>>>>>>>>>>>>>>>>>>>> above was really 
>>>>>>>>>>>>>>>>>>>>>>> helpful  to get started. My plan at the beginning was 
>>>>>>>>>>>>>>>>>>>>>>> to fine tune the 
>>>>>>>>>>>>>>>>>>>>>>> existing .traineddata. But, I failed in every possible 
>>>>>>>>>>>>>>>>>>>>>>> way to introduce a 
>>>>>>>>>>>>>>>>>>>>>>> few new characters into the database. That is why I 
>>>>>>>>>>>>>>>>>>>>>>> started from scratch. 
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run 
>>>>>>>>>>>>>>>>>>>>>>> more the iterations, and see if I can improve. 
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Another areas we need to explore is usage of 
>>>>>>>>>>>>>>>>>>>>>>> dictionaries actually. May be adding millions of words 
>>>>>>>>>>>>>>>>>>>>>>> into the 
>>>>>>>>>>>>>>>>>>>>>>> dictionary could help Tesseract. I don't have millions 
>>>>>>>>>>>>>>>>>>>>>>> of words; but I am 
>>>>>>>>>>>>>>>>>>>>>>> looking into some corpus to get more words into the 
>>>>>>>>>>>>>>>>>>>>>>> dictionary. 
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> If this all fails, EasyOCR (and probably other 
>>>>>>>>>>>>>>>>>>>>>>> similar open-source packages)  is probably our next 
>>>>>>>>>>>>>>>>>>>>>>> option to try on. Sure, 
>>>>>>>>>>>>>>>>>>>>>>> sharing our experiences will be helpful. I will let you 
>>>>>>>>>>>>>>>>>>>>>>> know if I made good 
>>>>>>>>>>>>>>>>>>>>>>> progresses in any of these options. 
>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM 
>>>>>>>>>>>>>>>>>>>>>>> UTC+3 [email protected] wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> How is your training going for Bengali?  It was 
>>>>>>>>>>>>>>>>>>>>>>>> nearly good but I faced space problems between two 
>>>>>>>>>>>>>>>>>>>>>>>> words, some words are 
>>>>>>>>>>>>>>>>>>>>>>>> spaces but most of them have no space. I think is 
>>>>>>>>>>>>>>>>>>>>>>>> problem is in the dataset 
>>>>>>>>>>>>>>>>>>>>>>>> but I use the default training dataset from Tesseract 
>>>>>>>>>>>>>>>>>>>>>>>> which is used in Ben 
>>>>>>>>>>>>>>>>>>>>>>>> That way I am confused so I have to explore more. by 
>>>>>>>>>>>>>>>>>>>>>>>> the way,  you can try 
>>>>>>>>>>>>>>>>>>>>>>>> as Lorenzo Blz said.  Actually training from 
>>>>>>>>>>>>>>>>>>>>>>>> scratch is harder than fine-tuning. so you can use 
>>>>>>>>>>>>>>>>>>>>>>>> different datasets to 
>>>>>>>>>>>>>>>>>>>>>>>> explore. if you succeed. please let me know how you 
>>>>>>>>>>>>>>>>>>>>>>>> have done this whole 
>>>>>>>>>>>>>>>>>>>>>>>> process.  I'm also new in this field.
>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm 
>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 [email protected] wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> How is your training going for Bengali?
>>>>>>>>>>>>>>>>>>>>>>>>> I have been trying to train from scratch. I made 
>>>>>>>>>>>>>>>>>>>>>>>>> about 64,000 lines of text (which produced about 
>>>>>>>>>>>>>>>>>>>>>>>>> 255,000 files, in the end) 
>>>>>>>>>>>>>>>>>>>>>>>>> and run the training for 150,000 iterations; getting 
>>>>>>>>>>>>>>>>>>>>>>>>> 0.51 training error 
>>>>>>>>>>>>>>>>>>>>>>>>> rate. I was hopping to get reasonable accuracy. 
>>>>>>>>>>>>>>>>>>>>>>>>> Unfortunately, when I run 
>>>>>>>>>>>>>>>>>>>>>>>>> the OCR using  .traineddata,  the accuracy is 
>>>>>>>>>>>>>>>>>>>>>>>>> absolutely terrible. Do you 
>>>>>>>>>>>>>>>>>>>>>>>>> think I made some mistakes, or that is an expected 
>>>>>>>>>>>>>>>>>>>>>>>>> result?
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM 
>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 [email protected] wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one 
>>>>>>>>>>>>>>>>>>>>>>>>>> font.  That way he didn't use *MODEL_NAME in a 
>>>>>>>>>>>>>>>>>>>>>>>>>> separate **script **file script I think.*
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and 
>>>>>>>>>>>>>>>>>>>>>>>>>> .box files *which are created by  *MODEL_NAME I 
>>>>>>>>>>>>>>>>>>>>>>>>>> mean **eng, ben, oro flag or language code *because 
>>>>>>>>>>>>>>>>>>>>>>>>>> when we first create *tif, gt.txt, and .box 
>>>>>>>>>>>>>>>>>>>>>>>>>> files, *every file starts by  *MODEL_NAME*. 
>>>>>>>>>>>>>>>>>>>>>>>>>> This  *MODEL_NAME*  we selected on the training 
>>>>>>>>>>>>>>>>>>>>>>>>>> script for looping each tif, gt.txt, and .box files 
>>>>>>>>>>>>>>>>>>>>>>>>>> which are created by
>>>>>>>>>>>>>>>>>>>>>>>>>>   *MODEL_NAME.*
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm 
>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 [email protected] wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I am familiar with the video and have set 
>>>>>>>>>>>>>>>>>>>>>>>>>>> up the folder structure as you did. Indeed, I have 
>>>>>>>>>>>>>>>>>>>>>>>>>>> tried a number of 
>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning with a single font following Gracia's 
>>>>>>>>>>>>>>>>>>>>>>>>>>> video. But, your script 
>>>>>>>>>>>>>>>>>>>>>>>>>>> is much  better because supports multiple fonts. 
>>>>>>>>>>>>>>>>>>>>>>>>>>> The whole improvement you 
>>>>>>>>>>>>>>>>>>>>>>>>>>> made is  brilliant; and very useful. It is all 
>>>>>>>>>>>>>>>>>>>>>>>>>>> working for me. 
>>>>>>>>>>>>>>>>>>>>>>>>>>> The only part that I didn't understand is the 
>>>>>>>>>>>>>>>>>>>>>>>>>>> trick you used in your tesseract_train.py script. 
>>>>>>>>>>>>>>>>>>>>>>>>>>> You see, I have been 
>>>>>>>>>>>>>>>>>>>>>>>>>>> doing exactly to you did except this script. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> The scripts seems to have the trick of 
>>>>>>>>>>>>>>>>>>>>>>>>>>> sending/teaching each of the fonts (iteratively) 
>>>>>>>>>>>>>>>>>>>>>>>>>>> into the model. The script 
>>>>>>>>>>>>>>>>>>>>>>>>>>> I have been using  (which I get from Garcia) 
>>>>>>>>>>>>>>>>>>>>>>>>>>> doesn't mention font at all. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make 
>>>>>>>>>>>>>>>>>>>>>>>>>>> training MODEL_NAME=oro 
>>>>>>>>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000*
>>>>>>>>>>>>>>>>>>>>>>>>>>> Does it mean that my model does't train the 
>>>>>>>>>>>>>>>>>>>>>>>>>>> fonts (even if the fonts have been included in the 
>>>>>>>>>>>>>>>>>>>>>>>>>>> splitting process, 
>>>>>>>>>>>>>>>>>>>>>>>>>>> in the other script)?
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM 
>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 [email protected] wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> namesfont_names = ['ben']for font in font_names:   
>>>>>>>>>>>>>>>>>>>>>>>>>>>>  command = 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> training MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) 1 . 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> This command is for training data that I have 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> named '*
>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract_training*.py' inside tesstrain 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> folder.*
>>>>>>>>>>>>>>>>>>>>>>>>>>>> *2. root directory means your main training 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> folder and inside it as like langdata, tessearact, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>  tesstrain folders. if 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> you see this tutorial    *
>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=KE4xEzFGSU8  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>  you will understand better the folder structure. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> only I 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> for training and  
>>>>>>>>>>>>>>>>>>>>>>>>>>>> FontList.py file is the main path as *like 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata, tessearact,  tesstrain, and *
>>>>>>>>>>>>>>>>>>>>>>>>>>>> split_training_text.py.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. first of all you have to put all fonts in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> your Linux fonts folder.   /usr/share/fonts/  
>>>>>>>>>>>>>>>>>>>>>>>>>>>> then run:  sudo apt update  then sudo fc-cache 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> -fv
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> after that, you have to add the exact font's 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> name in FontList.py file like me.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I  have added two pic my folder structure. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> first is main structure pic and the second is the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Colopse tesstrain folder.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Screenshot 2023-09-11 135014.png] 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 [email protected] wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you so much for putting out these 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> brilliant scripts. They make the process  much 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> more efficient.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have one more question on the other script 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that you use to train. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> namesfont_names = ['ben']for font in font_names:  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   command = 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *    subprocess.run(command, shell=True) *
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Do you have the name of fonts listed in file 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in the same/root directory?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> How do you setup the names of the fonts in the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file, if you don't mind sharing it?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 [email protected] wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You can use the new script below. it's better 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> than the previous two scripts.  You can create 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *tif, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *by multiple fonts 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and also use breakpoint if vs code close or 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> anything during creating *tif, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *then you can 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> checkpoint to navigate where you close vs code.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files *
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> input_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         lines = input_file.readlines()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if not os.path.exists(output_directory):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         start_line = 0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         end_line = len(lines) - 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     for font_name in font_list.fonts:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         for line_index in range(start_line, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line + 1):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line = lines[line_index].strip()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             training_text_file_name = 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pathlib.Path(training_text_file).stem
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_index:d}"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> _{line_serial}_{font_name.replace(" ", "_")}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> .gt.txt')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 output_file.writelines
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ([line])
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             file_base_name = f'{
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial}_{
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--font={font_name}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--outputbase={
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory}/{file_base_name}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--ysize=330',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/eng.unicharset',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=int, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Starting line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Ending line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     training_text_file = '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata/eng.training_text'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     output_directory = '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/eng-ground-truth'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> args.end)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Then create a file called "FontList" in the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> root directory and paste it.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class FontList:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     def __init__(self):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         self.fonts = [
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         "Gerlick"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             "Sagar Medium",
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             "Ekushey Lohit Normal",  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>            "Charukola Round Head Regular, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> weight=433",
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             "Charukola Round Head Bold, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> weight=443",
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             "Ador Orjoma Unicode",
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>       
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>           
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                        
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ]                         
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> then import in the above code,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *for breakpoint command:*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 0 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  --end 11
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> change checkpoint according to you  --start 0 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --end 11.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *and training checkpoint as you know already.*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 [email protected] wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi mhalidu, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the script you posted here seems much more 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> extensive than you posted before: 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have been using your earlier script. It is 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> magical. How is this one different from the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> earlier one?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> way. It has saved my countless hours; by 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> running multiple fonts in one 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sweep. I was not able to find any instruction 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on how to train for  multiple 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fonts. The official manual is also unclear. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> YOUr script helped me to get 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> started. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 [email protected] wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ok, I will try as you said.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one more thing, what's the role of the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> trained_text lines will be? I have seen 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bengali text are long words of 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines. so I wanna know how many words or 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> characters will be the better 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> choice for the train? and 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600','--ysize=350',  will be 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> according 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to words of lines?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 shree wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Include the default fonts also in your 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning list of fonts and see if that 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> helps.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tune methods for the Bengali language 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in Tesseract 5 and I have used 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all official trained_text and tessdata_best 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and other things also.  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> everything is good but the problem is the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> default font which was trained 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> before that does not convert text like prev 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but my new fonts work well. I 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't understand why it's happening. I share 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> code based to understand what 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> going on.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *codes  for creating tif, gt.txt, .box 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> files:*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import os
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import random
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import argparse
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def read_line_count():
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if os.path.exists('line_count.txt'):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         with open('line_count.txt', 'r') 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             return int(file.read())
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     return 0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     with open('line_count.txt', 'w') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         file.write(str(line_count))
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file, font_list, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, start_line=None, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     lines = []
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     with open(training_text_file, 'r') as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> input_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         for line in input_file.readlines
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ():
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             lines.append(line.strip())
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if not os.path.exists(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory):
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         os.mkdir(output_directory)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     random.shuffle(lines)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if start_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         line_count = read_line_count()  # 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Set the starting line_count from the file
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         line_count = start_line
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     if end_line is None:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         end_line_count = len(lines) - 1  # 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Set the ending line_count
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     else:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         end_line_count = min(end_line, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> len(lines) - 1)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     for font in font_list.fonts:  # 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Iterate through all the fonts in the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         for line in lines:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             training_text_file_name = 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pathlib.Path(training_text_file).stem
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             # Generate a unique serial 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> number for each line
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_serial = f"{line_count:d
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> }"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             # GT (Ground Truth) text 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filename
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_gt_text = os.path.join(
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> .gt.txt')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             with open(line_gt_text, 'w') 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as output_file:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 output_file.writelines
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ([line])
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             # Image filename
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             file_base_name = f'ben_{
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial}'  # Unique filename for 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> each font
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             subprocess.run([
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 'text2image',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--font={font}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--text={line_gt_text}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 f'--outputbase={
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory}/{file_base_name}',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--max_pages=1',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --strip_unrenderable_words',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--leading=36',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--xsize=3600',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--ysize=350',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--char_spacing=1.0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '--exposure=0',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>                 '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ',
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             ])
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             line_count += 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>             font_serial += 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         # Reset font_serial for the next 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font iteration
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>         font_serial = 1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     write_line_count(line_count)  # 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Update the line_count in the file
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__":
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser = argparse.ArgumentParser()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--start', type=
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> int, help='Starting line count 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     parser.add_argument('--end', type=int, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Ending line count (inclusive)')
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     args = parser.parse_args()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     training_text_file = '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata/ben.training_text'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     output_directory = '
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/ben-ground-truth'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     # Create an instance of the FontList 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     font_list = FontList()
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>      
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     create_training_data(training_text_file, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> args.end)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *and for training code:*
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # List of font names
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_names = ['ben']
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for font in font_names:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     command = 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>     subprocess.run(command, shell=True)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks, everyone
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you are 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> subscribed to the Google Groups 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "tesseract-ocr" group.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> receiving emails from it, send an email to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected].
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>>>>>
>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>
>>>>>>>>>>>>> -- 
>>>>>>>>>>>> You received this message because you are subscribed to the 
>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from 
>>>>>>>>>>>> it, send an email to [email protected].
>>>>>>>>>>>>
>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com
>>>>>>>>>>>>  
>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>> .
>>>>>>>>>>>>
>>>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/47cd457c-55da-42e6-8fa4-501ac5197303n%40googlegroups.com.

Re: [tesseract-ocr] accuracy problem after trained in fine-tune

Reply via email to