yes, two months ago when I started to learn OCR I saw that. it was very helpful at the beginning. On Friday, 15 September, 2023 at 4:01:32 pm UTC+6 desal...@gmail.com wrote:
> Just saw this paper: https://osf.io/b8h7q > > On Thursday, September 14, 2023 at 9:02:22 PM UTC+3 mdalihu...@gmail.com > wrote: > >> I will try some changes. thx >> >> On Thursday, 14 September, 2023 at 2:46:36 pm UTC+6 elvi...@gmail.com >> wrote: >> >>> I also faced that issue in the Windows. Apparently, the issue is related >>> with unicode. You can try your luck by changing "r" to "utf8" in the >>> script. >>> I end up installing Ubuntu because i was having too many errors in the >>> Windows. >>> >>> On Thu, Sep 14, 2023, 9:33 AM Ali hussain <mdalihu...@gmail.com> wrote: >>> >>>> you faced this error, Can't encode transcription? if you faced how you >>>> have solved this? >>>> >>>> On Thursday, 14 September, 2023 at 10:51:52 am UTC+6 elvi...@gmail.com >>>> wrote: >>>> >>>>> I was using my own text >>>>> >>>>> On Thu, Sep 14, 2023, 6:58 AM Ali hussain <mdalihu...@gmail.com> >>>>> wrote: >>>>> >>>>>> you are training from Tessearact default text data or your own >>>>>> collected text data? >>>>>> On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 >>>>>> desal...@gmail.com wrote: >>>>>> >>>>>>> I now get to 200000 iterations; and the error rate is stuck at 0.46. >>>>>>> The result is absolutely trash: nowhere close to the default/Ray's >>>>>>> training. >>>>>>> >>>>>>> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 >>>>>>> mdalihu...@gmail.com wrote: >>>>>>> >>>>>>>> >>>>>>>> after Tesseact recognizes text from images. then you can apply >>>>>>>> regex to replace the wrong word with to correct word. >>>>>>>> I'm not familiar with paddleOcr and scanTailor also. >>>>>>>> >>>>>>>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 >>>>>>>> desal...@gmail.com wrote: >>>>>>>> >>>>>>>>> At what stage are you doing the regex replacement? >>>>>>>>> My process has been: Scan (tif)--> ScanTailor --> Tesseract --> pdf >>>>>>>>> >>>>>>>>> >EasyOCR I think is best for ID cards or something like that image >>>>>>>>> process. but document images like books, here Tesseract is better >>>>>>>>> than >>>>>>>>> EasyOCR. >>>>>>>>> >>>>>>>>> How about paddleOcr?, are you familiar with it? >>>>>>>>> >>>>>>>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 >>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>> >>>>>>>>>> I know what you mean. but in some cases, it helps me. I have >>>>>>>>>> faced specific characters and words are always not recognized by >>>>>>>>>> Tesseract. >>>>>>>>>> That way I use these regex to replace those characters and words >>>>>>>>>> if >>>>>>>>>> those characters are incorrect. >>>>>>>>>> >>>>>>>>>> see what I have done: >>>>>>>>>> >>>>>>>>>> " ী": "ী", >>>>>>>>>> " ্": " ", >>>>>>>>>> " ে": " ", >>>>>>>>>> জ্া: "জা", >>>>>>>>>> " ": " ", >>>>>>>>>> " ": " ", >>>>>>>>>> " ": " ", >>>>>>>>>> "্প": " ", >>>>>>>>>> " য": "র্য", >>>>>>>>>> য: "য", >>>>>>>>>> " া": "া", >>>>>>>>>> আা: "আ", >>>>>>>>>> ম্ি: "মি", >>>>>>>>>> স্ু: "সু", >>>>>>>>>> "হূ ": "হূ", >>>>>>>>>> " ণ": "ণ", >>>>>>>>>> র্্: "র", >>>>>>>>>> "চিন্ত ": "চিন্তা ", >>>>>>>>>> ন্া: "না", >>>>>>>>>> "সম ূর্ন": "সম্পূর্ণ", >>>>>>>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 >>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>> >>>>>>>>>>> The problem for regex is that Tesseract is not consistent in its >>>>>>>>>>> replacement. >>>>>>>>>>> Think of the original training of English data doesn't contain >>>>>>>>>>> the letter /u/. What does Tesseract do when it faces /u/ in actual >>>>>>>>>>> processing?? >>>>>>>>>>> In some cases, it replaces it with closely similar letters such >>>>>>>>>>> as /v/ and /w/. In other cases, it completely removes it. That is >>>>>>>>>>> what is >>>>>>>>>>> happening with my case. Those characters re sometimes completely >>>>>>>>>>> removed; >>>>>>>>>>> other times, they are replaced by closely resembling characters. >>>>>>>>>>> Because of >>>>>>>>>>> this inconsistency, applying regex is very difficult. >>>>>>>>>>> >>>>>>>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 >>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>> >>>>>>>>>>>> if Some specific characters or words are always missing >>>>>>>>>>>> from the OCR result. then you can apply logic with the Regular >>>>>>>>>>>> expressions >>>>>>>>>>>> method on your applications. After OCR, these specific characters >>>>>>>>>>>> or words >>>>>>>>>>>> will be replaced by current characters or words that you defined >>>>>>>>>>>> in your >>>>>>>>>>>> applications by Regular expressions. it can be done in some major >>>>>>>>>>>> problems. >>>>>>>>>>>> >>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 >>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>> >>>>>>>>>>>>> The characters are getting missed, even after fine-tuning. >>>>>>>>>>>>> I never made any progress. I tried many different ways. Some >>>>>>>>>>>>> specific characters are always missing from the OCR result. >>>>>>>>>>>>> >>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 >>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> EasyOCR I think is best for ID cards or something like that >>>>>>>>>>>>>> image process. but document images like books, here Tesseract is >>>>>>>>>>>>>> better >>>>>>>>>>>>>> than EasyOCR. Even I didn't use EasyOCR. you can try it. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I have added words of dictionaries but the result is the >>>>>>>>>>>>>> same. >>>>>>>>>>>>>> >>>>>>>>>>>>>> what kind of problem you have faced in fine-tuning in few new >>>>>>>>>>>>>> characters as you said (*but, I failed in every possible way >>>>>>>>>>>>>> to introduce a few new characters into the database.)* >>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 >>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yes, we are new to this. I find the instructions (the >>>>>>>>>>>>>>> manual) very hard to follow. The video you linked above was >>>>>>>>>>>>>>> really helpful >>>>>>>>>>>>>>> to get started. My plan at the beginning was to fine tune the >>>>>>>>>>>>>>> existing >>>>>>>>>>>>>>> .traineddata. But, I failed in every possible way to introduce >>>>>>>>>>>>>>> a few new >>>>>>>>>>>>>>> characters into the database. That is why I started from >>>>>>>>>>>>>>> scratch. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run more the >>>>>>>>>>>>>>> iterations, and see if I can improve. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Another areas we need to explore is usage of dictionaries >>>>>>>>>>>>>>> actually. May be adding millions of words into the dictionary >>>>>>>>>>>>>>> could help >>>>>>>>>>>>>>> Tesseract. I don't have millions of words; but I am looking >>>>>>>>>>>>>>> into some >>>>>>>>>>>>>>> corpus to get more words into the dictionary. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If this all fails, EasyOCR (and probably other similar >>>>>>>>>>>>>>> open-source packages) is probably our next option to try on. >>>>>>>>>>>>>>> Sure, sharing >>>>>>>>>>>>>>> our experiences will be helpful. I will let you know if I made >>>>>>>>>>>>>>> good >>>>>>>>>>>>>>> progresses in any of these options. >>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM UTC+3 >>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> How is your training going for Bengali? It was nearly good >>>>>>>>>>>>>>>> but I faced space problems between two words, some words are >>>>>>>>>>>>>>>> spaces but >>>>>>>>>>>>>>>> most of them have no space. I think is problem is in the >>>>>>>>>>>>>>>> dataset but I use >>>>>>>>>>>>>>>> the default training dataset from Tesseract which is used in >>>>>>>>>>>>>>>> Ben That way I >>>>>>>>>>>>>>>> am confused so I have to explore more. by the way, you can >>>>>>>>>>>>>>>> try as Lorenzo >>>>>>>>>>>>>>>> Blz said. Actually training from scratch is harder than >>>>>>>>>>>>>>>> fine-tuning. so you can use different datasets to explore. if >>>>>>>>>>>>>>>> you succeed. >>>>>>>>>>>>>>>> please let me know how you have done this whole process. I'm >>>>>>>>>>>>>>>> also new in >>>>>>>>>>>>>>>> this field. >>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm UTC+6 >>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> How is your training going for Bengali? >>>>>>>>>>>>>>>>> I have been trying to train from scratch. I made about >>>>>>>>>>>>>>>>> 64,000 lines of text (which produced about 255,000 files, in >>>>>>>>>>>>>>>>> the end) and >>>>>>>>>>>>>>>>> run the training for 150,000 iterations; getting 0.51 >>>>>>>>>>>>>>>>> training error rate. >>>>>>>>>>>>>>>>> I was hopping to get reasonable accuracy. Unfortunately, when >>>>>>>>>>>>>>>>> I run the OCR >>>>>>>>>>>>>>>>> using .traineddata, the accuracy is absolutely terrible. Do >>>>>>>>>>>>>>>>> you think I >>>>>>>>>>>>>>>>> made some mistakes, or that is an expected result? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM UTC+3 >>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one font. >>>>>>>>>>>>>>>>>> That way he didn't use *MODEL_NAME in a separate * >>>>>>>>>>>>>>>>>> *script **file script I think.* >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and .box >>>>>>>>>>>>>>>>>> files *which are created by *MODEL_NAME I mean **eng, >>>>>>>>>>>>>>>>>> ben, oro flag or language code *because when we first >>>>>>>>>>>>>>>>>> create *tif, gt.txt, and .box files, *every file starts >>>>>>>>>>>>>>>>>> by *MODEL_NAME*. This *MODEL_NAME* we selected on the >>>>>>>>>>>>>>>>>> training script for looping each tif, gt.txt, and .box files >>>>>>>>>>>>>>>>>> which are >>>>>>>>>>>>>>>>>> created by *MODEL_NAME.* >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm UTC+6 >>>>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Yes, I am familiar with the video and have set up the >>>>>>>>>>>>>>>>>>> folder structure as you did. Indeed, I have tried a number >>>>>>>>>>>>>>>>>>> of fine-tuning >>>>>>>>>>>>>>>>>>> with a single font following Gracia's video. But, your >>>>>>>>>>>>>>>>>>> script is much >>>>>>>>>>>>>>>>>>> better because supports multiple fonts. The whole >>>>>>>>>>>>>>>>>>> improvement you made is >>>>>>>>>>>>>>>>>>> brilliant; and very useful. It is all working for me. >>>>>>>>>>>>>>>>>>> The only part that I didn't understand is the trick you >>>>>>>>>>>>>>>>>>> used in your tesseract_train.py script. You see, I have >>>>>>>>>>>>>>>>>>> been doing exactly >>>>>>>>>>>>>>>>>>> to you did except this script. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> The scripts seems to have the trick of sending/teaching >>>>>>>>>>>>>>>>>>> each of the fonts (iteratively) into the model. The script >>>>>>>>>>>>>>>>>>> I have been >>>>>>>>>>>>>>>>>>> using (which I get from Garcia) doesn't mention font at >>>>>>>>>>>>>>>>>>> all. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>>>>>>>> MODEL_NAME=oro TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000* >>>>>>>>>>>>>>>>>>> Does it mean that my model does't train the fonts (even >>>>>>>>>>>>>>>>>>> if the fonts have been included in the splitting process, >>>>>>>>>>>>>>>>>>> in the other >>>>>>>>>>>>>>>>>>> script)? >>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM UTC+3 >>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = >>>>>>>>>>>>>>>>>>>> ['ben']for font in font_names: command = >>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>>>>>>>>> MODEL_NAME={font} >>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"* >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> * subprocess.run(command, shell=True) 1 . This >>>>>>>>>>>>>>>>>>>> command is for training data that I have named '* >>>>>>>>>>>>>>>>>>>> tesseract_training*.py' inside tesstrain folder.* >>>>>>>>>>>>>>>>>>>> *2. root directory means your main training folder and >>>>>>>>>>>>>>>>>>>> inside it as like langdata, tessearact, tesstrain >>>>>>>>>>>>>>>>>>>> folders. if you see this >>>>>>>>>>>>>>>>>>>> tutorial * >>>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=KE4xEzFGSU8 you will >>>>>>>>>>>>>>>>>>>> understand better the folder structure. only I >>>>>>>>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder for >>>>>>>>>>>>>>>>>>>> training and >>>>>>>>>>>>>>>>>>>> FontList.py file is the main path as *like langdata, >>>>>>>>>>>>>>>>>>>> tessearact, tesstrain, and *split_training_text.py. >>>>>>>>>>>>>>>>>>>> 3. first of all you have to put all fonts in your Linux >>>>>>>>>>>>>>>>>>>> fonts folder. /usr/share/fonts/ then run: sudo apt >>>>>>>>>>>>>>>>>>>> update then sudo fc-cache -fv >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> after that, you have to add the exact font's name in >>>>>>>>>>>>>>>>>>>> FontList.py file like me. >>>>>>>>>>>>>>>>>>>> I have added two pic my folder structure. first is >>>>>>>>>>>>>>>>>>>> main structure pic and the second is the Colopse tesstrain >>>>>>>>>>>>>>>>>>>> folder. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: >>>>>>>>>>>>>>>>>>>> Screenshot 2023-09-11 135014.png] >>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm UTC+6 >>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thank you so much for putting out these brilliant >>>>>>>>>>>>>>>>>>>>> scripts. They make the process much more efficient. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I have one more question on the other script that you >>>>>>>>>>>>>>>>>>>>> use to train. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font namesfont_names = >>>>>>>>>>>>>>>>>>>>> ['ben']for font in font_names: command = >>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make training >>>>>>>>>>>>>>>>>>>>> MODEL_NAME={font} >>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"* >>>>>>>>>>>>>>>>>>>>> * subprocess.run(command, shell=True) * >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Do you have the name of fonts listed in file in the >>>>>>>>>>>>>>>>>>>>> same/root directory? >>>>>>>>>>>>>>>>>>>>> How do you setup the names of the fonts in the file, >>>>>>>>>>>>>>>>>>>>> if you don't mind sharing it? >>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM UTC+3 >>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> You can use the new script below. it's better than >>>>>>>>>>>>>>>>>>>>>> the previous two scripts. You can create *tif, >>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *by multiple fonts and also >>>>>>>>>>>>>>>>>>>>>> use breakpoint if vs code close or anything during >>>>>>>>>>>>>>>>>>>>>> creating *tif, >>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *then you can checkpoint to >>>>>>>>>>>>>>>>>>>>>> navigate where you close vs code. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files * >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, >>>>>>>>>>>>>>>>>>>>>> end_line=None): >>>>>>>>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>>>>>>>> with open(training_text_file, 'r') as input_file: >>>>>>>>>>>>>>>>>>>>>> lines = input_file.readlines() >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>>>>>>>> start_line = 0 >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>>>>>>>> end_line = len(lines) - 1 >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> for font_name in font_list.fonts: >>>>>>>>>>>>>>>>>>>>>> for line_index in range(start_line, end_line >>>>>>>>>>>>>>>>>>>>>> + 1): >>>>>>>>>>>>>>>>>>>>>> line = lines[line_index].strip() >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> training_text_file_name = pathlib.Path( >>>>>>>>>>>>>>>>>>>>>> training_text_file).stem >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> line_serial = f"{line_index:d}" >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> line_gt_text = os.path.join( >>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{ >>>>>>>>>>>>>>>>>>>>>> line_serial}_{font_name.replace(" ", "_")}.gt.txt') >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') as >>>>>>>>>>>>>>>>>>>>>> output_file: >>>>>>>>>>>>>>>>>>>>>> output_file.writelines([line]) >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> file_base_name = f'{ >>>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial}_{ >>>>>>>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}' >>>>>>>>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>>>>>>>> f'--font={font_name}', >>>>>>>>>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>>>>>>>>> f'--outputbase={output_directory}/{ >>>>>>>>>>>>>>>>>>>>>> file_base_name}', >>>>>>>>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>>>>>>>> '--strip_unrenderable_words', >>>>>>>>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>>>>>>>> '--ysize=330', >>>>>>>>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/eng.unicharset', >>>>>>>>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--start', type=int, >>>>>>>>>>>>>>>>>>>>>> help='Starting >>>>>>>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, help='Ending >>>>>>>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> training_text_file = 'langdata/eng.training_text' >>>>>>>>>>>>>>>>>>>>>> output_directory = ' >>>>>>>>>>>>>>>>>>>>>> tesstrain/data/eng-ground-truth' >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end) >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Then create a file called "FontList" in the root >>>>>>>>>>>>>>>>>>>>>> directory and paste it. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> class FontList: >>>>>>>>>>>>>>>>>>>>>> def __init__(self): >>>>>>>>>>>>>>>>>>>>>> self.fonts = [ >>>>>>>>>>>>>>>>>>>>>> "Gerlick" >>>>>>>>>>>>>>>>>>>>>> "Sagar Medium", >>>>>>>>>>>>>>>>>>>>>> "Ekushey Lohit Normal", >>>>>>>>>>>>>>>>>>>>>> "Charukola Round Head Regular, weight=433" >>>>>>>>>>>>>>>>>>>>>> , >>>>>>>>>>>>>>>>>>>>>> "Charukola Round Head Bold, weight=443", >>>>>>>>>>>>>>>>>>>>>> "Ador Orjoma Unicode", >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> ] >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> then import in the above code, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> *for breakpoint command:* >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 0 --end >>>>>>>>>>>>>>>>>>>>>> 11 >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> change checkpoint according to you --start 0 --end >>>>>>>>>>>>>>>>>>>>>> 11. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> *and training checkpoint as you know already.* >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am UTC+6 >>>>>>>>>>>>>>>>>>>>>> desal...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Hi mhalidu, >>>>>>>>>>>>>>>>>>>>>>> the script you posted here seems much more extensive >>>>>>>>>>>>>>>>>>>>>>> than you posted before: >>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com >>>>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> I have been using your earlier script. It is >>>>>>>>>>>>>>>>>>>>>>> magical. How is this one different from the earlier >>>>>>>>>>>>>>>>>>>>>>> one? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the way. It >>>>>>>>>>>>>>>>>>>>>>> has saved my countless hours; by running multiple fonts >>>>>>>>>>>>>>>>>>>>>>> in one sweep. I was >>>>>>>>>>>>>>>>>>>>>>> not able to find any instruction on how to train for >>>>>>>>>>>>>>>>>>>>>>> multiple fonts. The >>>>>>>>>>>>>>>>>>>>>>> official manual is also unclear. YOUr script helped me >>>>>>>>>>>>>>>>>>>>>>> to get started. >>>>>>>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM UTC+3 >>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> ok, I will try as you said. >>>>>>>>>>>>>>>>>>>>>>>> one more thing, what's the role of the trained_text >>>>>>>>>>>>>>>>>>>>>>>> lines will be? I have seen Bengali text are long words >>>>>>>>>>>>>>>>>>>>>>>> of lines. so I wanna >>>>>>>>>>>>>>>>>>>>>>>> know how many words or characters will be the better >>>>>>>>>>>>>>>>>>>>>>>> choice for the train? >>>>>>>>>>>>>>>>>>>>>>>> and '--xsize=3600','--ysize=350', will be according >>>>>>>>>>>>>>>>>>>>>>>> to words of lines? >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am UTC+6 >>>>>>>>>>>>>>>>>>>>>>>> shree wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Include the default fonts also in your fine-tuning >>>>>>>>>>>>>>>>>>>>>>>>> list of fonts and see if that helps. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain < >>>>>>>>>>>>>>>>>>>>>>>>> mdalihu...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by fine-tune >>>>>>>>>>>>>>>>>>>>>>>>>> methods for the Bengali language in Tesseract 5 and >>>>>>>>>>>>>>>>>>>>>>>>>> I have used all >>>>>>>>>>>>>>>>>>>>>>>>>> official trained_text and tessdata_best and other >>>>>>>>>>>>>>>>>>>>>>>>>> things also. everything >>>>>>>>>>>>>>>>>>>>>>>>>> is good but the problem is the default font which >>>>>>>>>>>>>>>>>>>>>>>>>> was trained before that >>>>>>>>>>>>>>>>>>>>>>>>>> does not convert text like prev but my new fonts >>>>>>>>>>>>>>>>>>>>>>>>>> work well. I don't >>>>>>>>>>>>>>>>>>>>>>>>>> understand why it's happening. I share code based to >>>>>>>>>>>>>>>>>>>>>>>>>> understand what going >>>>>>>>>>>>>>>>>>>>>>>>>> on. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> *codes for creating tif, gt.txt, .box files:* >>>>>>>>>>>>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> def read_line_count(): >>>>>>>>>>>>>>>>>>>>>>>>>> if os.path.exists('line_count.txt'): >>>>>>>>>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'r') as file: >>>>>>>>>>>>>>>>>>>>>>>>>> return int(file.read()) >>>>>>>>>>>>>>>>>>>>>>>>>> return 0 >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count): >>>>>>>>>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'w') as file: >>>>>>>>>>>>>>>>>>>>>>>>>> file.write(str(line_count)) >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, >>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None): >>>>>>>>>>>>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>>>>>>>>>>>> with open(training_text_file, 'r') as >>>>>>>>>>>>>>>>>>>>>>>>>> input_file: >>>>>>>>>>>>>>>>>>>>>>>>>> for line in input_file.readlines(): >>>>>>>>>>>>>>>>>>>>>>>>>> lines.append(line.strip()) >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> random.shuffle(lines) >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>>>>>>>>>>>> line_count = read_line_count() # Set >>>>>>>>>>>>>>>>>>>>>>>>>> the starting line_count from the file >>>>>>>>>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>>>>>>>>> line_count = start_line >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>>>>>>>>>>>> end_line_count = len(lines) - 1 # Set >>>>>>>>>>>>>>>>>>>>>>>>>> the ending line_count >>>>>>>>>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>>>>>>>>> end_line_count = min(end_line, len(lines) >>>>>>>>>>>>>>>>>>>>>>>>>> - 1) >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> for font in font_list.fonts: # Iterate >>>>>>>>>>>>>>>>>>>>>>>>>> through all the fonts in the font_list >>>>>>>>>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>>>>>>>>> for line in lines: >>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name = pathlib. >>>>>>>>>>>>>>>>>>>>>>>>>> Path(training_text_file).stem >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> # Generate a unique serial number >>>>>>>>>>>>>>>>>>>>>>>>>> for each line >>>>>>>>>>>>>>>>>>>>>>>>>> line_serial = f"{line_count:d}" >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> # GT (Ground Truth) text filename >>>>>>>>>>>>>>>>>>>>>>>>>> line_gt_text = os.path.join( >>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name}_{ >>>>>>>>>>>>>>>>>>>>>>>>>> line_serial}.gt.txt') >>>>>>>>>>>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') as >>>>>>>>>>>>>>>>>>>>>>>>>> output_file: >>>>>>>>>>>>>>>>>>>>>>>>>> output_file.writelines([line]) >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> # Image filename >>>>>>>>>>>>>>>>>>>>>>>>>> file_base_name = f'ben_{line_serial}' >>>>>>>>>>>>>>>>>>>>>>>>>> # Unique filename for each font >>>>>>>>>>>>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>>>>>>>>>>>> f'--font={font}', >>>>>>>>>>>>>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>>>>>>>>>>>>> f'--outputbase={output_directory} >>>>>>>>>>>>>>>>>>>>>>>>>> /{file_base_name}', >>>>>>>>>>>>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>>>>>>>>>>>> '--strip_unrenderable_words', >>>>>>>>>>>>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>>>>>>>>>>>> '--ysize=350', >>>>>>>>>>>>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset', >>>>>>>>>>>>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> line_count += 1 >>>>>>>>>>>>>>>>>>>>>>>>>> font_serial += 1 >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> # Reset font_serial for the next font >>>>>>>>>>>>>>>>>>>>>>>>>> iteration >>>>>>>>>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> write_line_count(line_count) # Update the >>>>>>>>>>>>>>>>>>>>>>>>>> line_count in the file >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--start', type=int, help >>>>>>>>>>>>>>>>>>>>>>>>>> ='Starting line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, >>>>>>>>>>>>>>>>>>>>>>>>>> help='Ending >>>>>>>>>>>>>>>>>>>>>>>>>> line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file = ' >>>>>>>>>>>>>>>>>>>>>>>>>> langdata/ben.training_text' >>>>>>>>>>>>>>>>>>>>>>>>>> output_directory = ' >>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/ben-ground-truth' >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> # Create an instance of the FontList class >>>>>>>>>>>>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, args.end) >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> *and for training code:* >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> # List of font names >>>>>>>>>>>>>>>>>>>>>>>>>> font_names = ['ben'] >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> for font in font_names: >>>>>>>>>>>>>>>>>>>>>>>>>> command = >>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} START_MODEL=ben >>>>>>>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 >>>>>>>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic" >>>>>>>>>>>>>>>>>>>>>>>>>> subprocess.run(command, shell=True) >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract the problem. >>>>>>>>>>>>>>>>>>>>>>>>>> thanks, everyone >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you are >>>>>>>>>>>>>>>>>>>>>>>>>> subscribed to the Google Groups "tesseract-ocr" >>>>>>>>>>>>>>>>>>>>>>>>>> group. >>>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving >>>>>>>>>>>>>>>>>>>>>>>>>> emails from it, send an email to >>>>>>>>>>>>>>>>>>>>>>>>>> tesseract-oc...@googlegroups.com. >>>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>> >>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> >>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f57a721f-c8a8-4e86-9664-6a71ff337333n%40googlegroups.com.