you can test by changes '--char spacing=1.0 . i think it would be problem accuracy of result on it also. On Sunday, 22 October, 2023 at 3:07:16 pm UTC+6 Ali hussain wrote:
> i haven't tried by cut the top layer of the network. you can share your > knowledge what you done by cut the top layer of the network. or github > project link. > On Sunday, 22 October, 2023 at 12:27:32 pm UTC+6 [email protected] wrote: > >> That is massive data. Have you tried to train by cut the top layer of the >> network? >> I think that is the most promising approach. I was getting really good >> results with that. But, the result is not getting translated to scanned >> documents. I get best results with the syntethic data. I am no >> experimenting with the settings in text2image if it is possible to emulate >> the scanned documents. >> I am also suspecting this setting '--char_spacing=1.0', in our setup is >> causing more trouble. Scanned documents come with characters spacing close >> to zero.If you are planning to train more, try removing this parameter. >> >> On Sunday, October 22, 2023 at 4:09:46 AM UTC+3 [email protected] >> wrote: >> >>> 600000 lines of text and the itarations higher then 600000. but some >>> time i got better result in lower itarations in finetune like 100000 lines >>> of text and itaration is only 5000 to10000. >>> On Saturday, 21 October, 2023 at 11:37:13 am UTC+6 [email protected] >>> wrote: >>> >>>> How many lines of text and iterations did you use? >>>> >>>> On Saturday, October 21, 2023 at 8:36:38 AM UTC+3 Des Bw wrote: >>>> >>>>> Yah, that is what I am getting as well. I was able to add the missing >>>>> letter. But, the overall accuracy become lower than the default model. >>>>> >>>>> On Saturday, October 21, 2023 at 3:22:44 AM UTC+3 [email protected] >>>>> wrote: >>>>> >>>>>> not good result. that's way i stop to training now. default >>>>>> traineddata is overall good then scratch. >>>>>> On Thursday, 19 October, 2023 at 11:32:08 pm UTC+6 [email protected] >>>>>> wrote: >>>>>> >>>>>>> Hi Ali, >>>>>>> How is your training going? >>>>>>> Do you get good results with the training-from-the-scratch? >>>>>>> >>>>>>> On Friday, September 15, 2023 at 6:42:26 PM UTC+3 tesseract-ocr >>>>>>> wrote: >>>>>>> >>>>>>>> yes, two months ago when I started to learn OCR I saw that. it was >>>>>>>> very helpful at the beginning. >>>>>>>> On Friday, 15 September, 2023 at 4:01:32 pm UTC+6 >>>>>>>> [email protected] wrote: >>>>>>>> >>>>>>>>> Just saw this paper: https://osf.io/b8h7q >>>>>>>>> >>>>>>>>> On Thursday, September 14, 2023 at 9:02:22 PM UTC+3 >>>>>>>>> [email protected] wrote: >>>>>>>>> >>>>>>>>>> I will try some changes. thx >>>>>>>>>> >>>>>>>>>> On Thursday, 14 September, 2023 at 2:46:36 pm UTC+6 >>>>>>>>>> [email protected] wrote: >>>>>>>>>> >>>>>>>>>>> I also faced that issue in the Windows. Apparently, the issue is >>>>>>>>>>> related with unicode. You can try your luck by changing "r" to >>>>>>>>>>> "utf8" in >>>>>>>>>>> the script. >>>>>>>>>>> I end up installing Ubuntu because i was having too many errors >>>>>>>>>>> in the Windows. >>>>>>>>>>> >>>>>>>>>>> On Thu, Sep 14, 2023, 9:33 AM Ali hussain <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> you faced this error, Can't encode transcription? if you faced >>>>>>>>>>>> how you have solved this? >>>>>>>>>>>> >>>>>>>>>>>> On Thursday, 14 September, 2023 at 10:51:52 am UTC+6 >>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I was using my own text >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Sep 14, 2023, 6:58 AM Ali hussain < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> you are training from Tessearact default text data or your >>>>>>>>>>>>>> own collected text data? >>>>>>>>>>>>>> On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 >>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I now get to 200000 iterations; and the error rate is stuck >>>>>>>>>>>>>>> at 0.46. The result is absolutely trash: nowhere close to the >>>>>>>>>>>>>>> default/Ray's >>>>>>>>>>>>>>> training. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 2:47:05 PM UTC+3 >>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> after Tesseact recognizes text from images. then you can >>>>>>>>>>>>>>>> apply regex to replace the wrong word with to correct word. >>>>>>>>>>>>>>>> I'm not familiar with paddleOcr and scanTailor also. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 5:06:12 pm UTC+6 >>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> At what stage are you doing the regex replacement? >>>>>>>>>>>>>>>>> My process has been: Scan (tif)--> ScanTailor --> >>>>>>>>>>>>>>>>> Tesseract --> pdf >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >EasyOCR I think is best for ID cards or something like >>>>>>>>>>>>>>>>> that image process. but document images like books, here >>>>>>>>>>>>>>>>> Tesseract is >>>>>>>>>>>>>>>>> better than EasyOCR. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> How about paddleOcr?, are you familiar with it? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:45:54 PM UTC+3 >>>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I know what you mean. but in some cases, it helps me. I >>>>>>>>>>>>>>>>>> have faced specific characters and words are always not >>>>>>>>>>>>>>>>>> recognized by >>>>>>>>>>>>>>>>>> Tesseract. That way I use these regex to replace those >>>>>>>>>>>>>>>>>> characters and >>>>>>>>>>>>>>>>>> words if those characters are incorrect. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> see what I have done: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> " ী": "ী", >>>>>>>>>>>>>>>>>> " ্": " ", >>>>>>>>>>>>>>>>>> " ে": " ", >>>>>>>>>>>>>>>>>> জ্া: "জা", >>>>>>>>>>>>>>>>>> " ": " ", >>>>>>>>>>>>>>>>>> " ": " ", >>>>>>>>>>>>>>>>>> " ": " ", >>>>>>>>>>>>>>>>>> "্প": " ", >>>>>>>>>>>>>>>>>> " য": "র্য", >>>>>>>>>>>>>>>>>> য: "য", >>>>>>>>>>>>>>>>>> " া": "া", >>>>>>>>>>>>>>>>>> আা: "আ", >>>>>>>>>>>>>>>>>> ম্ি: "মি", >>>>>>>>>>>>>>>>>> স্ু: "সু", >>>>>>>>>>>>>>>>>> "হূ ": "হূ", >>>>>>>>>>>>>>>>>> " ণ": "ণ", >>>>>>>>>>>>>>>>>> র্্: "র", >>>>>>>>>>>>>>>>>> "চিন্ত ": "চিন্তা ", >>>>>>>>>>>>>>>>>> ন্া: "না", >>>>>>>>>>>>>>>>>> "সম ূর্ন": "সম্পূর্ণ", >>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 4:18:22 pm UTC+6 >>>>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> The problem for regex is that Tesseract is not >>>>>>>>>>>>>>>>>>> consistent in its replacement. >>>>>>>>>>>>>>>>>>> Think of the original training of English data doesn't >>>>>>>>>>>>>>>>>>> contain the letter /u/. What does Tesseract do when it >>>>>>>>>>>>>>>>>>> faces /u/ in actual >>>>>>>>>>>>>>>>>>> processing?? >>>>>>>>>>>>>>>>>>> In some cases, it replaces it with closely similar >>>>>>>>>>>>>>>>>>> letters such as /v/ and /w/. In other cases, it completely >>>>>>>>>>>>>>>>>>> removes it. That >>>>>>>>>>>>>>>>>>> is what is happening with my case. Those characters re >>>>>>>>>>>>>>>>>>> sometimes completely >>>>>>>>>>>>>>>>>>> removed; other times, they are replaced by closely >>>>>>>>>>>>>>>>>>> resembling characters. >>>>>>>>>>>>>>>>>>> Because of this inconsistency, applying regex is very >>>>>>>>>>>>>>>>>>> difficult. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 1:02:01 PM UTC+3 >>>>>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> if Some specific characters or words are always missing >>>>>>>>>>>>>>>>>>>> from the OCR result. then you can apply logic with the >>>>>>>>>>>>>>>>>>>> Regular expressions >>>>>>>>>>>>>>>>>>>> method on your applications. After OCR, these specific >>>>>>>>>>>>>>>>>>>> characters or words >>>>>>>>>>>>>>>>>>>> will be replaced by current characters or words that you >>>>>>>>>>>>>>>>>>>> defined in your >>>>>>>>>>>>>>>>>>>> applications by Regular expressions. it can be done in >>>>>>>>>>>>>>>>>>>> some major problems. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:51:29 pm UTC+6 >>>>>>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> The characters are getting missed, even after >>>>>>>>>>>>>>>>>>>>> fine-tuning. >>>>>>>>>>>>>>>>>>>>> I never made any progress. I tried many different >>>>>>>>>>>>>>>>>>>>> ways. Some specific characters are always missing from >>>>>>>>>>>>>>>>>>>>> the OCR result. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:49:20 PM UTC+3 >>>>>>>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> EasyOCR I think is best for ID cards or something >>>>>>>>>>>>>>>>>>>>>> like that image process. but document images like books, >>>>>>>>>>>>>>>>>>>>>> here Tesseract is >>>>>>>>>>>>>>>>>>>>>> better than EasyOCR. Even I didn't use EasyOCR. you can >>>>>>>>>>>>>>>>>>>>>> try it. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I have added words of dictionaries but the result is >>>>>>>>>>>>>>>>>>>>>> the same. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> what kind of problem you have faced in fine-tuning in >>>>>>>>>>>>>>>>>>>>>> few new characters as you said (*but, I failed in >>>>>>>>>>>>>>>>>>>>>> every possible way to introduce a few new characters >>>>>>>>>>>>>>>>>>>>>> into the database.)* >>>>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 3:33:48 pm UTC+6 >>>>>>>>>>>>>>>>>>>>>> [email protected] wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Yes, we are new to this. I find the instructions >>>>>>>>>>>>>>>>>>>>>>> (the manual) very hard to follow. The video you linked >>>>>>>>>>>>>>>>>>>>>>> above was really >>>>>>>>>>>>>>>>>>>>>>> helpful to get started. My plan at the beginning was >>>>>>>>>>>>>>>>>>>>>>> to fine tune the >>>>>>>>>>>>>>>>>>>>>>> existing .traineddata. But, I failed in every possible >>>>>>>>>>>>>>>>>>>>>>> way to introduce a >>>>>>>>>>>>>>>>>>>>>>> few new characters into the database. That is why I >>>>>>>>>>>>>>>>>>>>>>> started from scratch. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Sure, I will follow Lorenzo's suggestion: will run >>>>>>>>>>>>>>>>>>>>>>> more the iterations, and see if I can improve. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Another areas we need to explore is usage of >>>>>>>>>>>>>>>>>>>>>>> dictionaries actually. May be adding millions of words >>>>>>>>>>>>>>>>>>>>>>> into the >>>>>>>>>>>>>>>>>>>>>>> dictionary could help Tesseract. I don't have millions >>>>>>>>>>>>>>>>>>>>>>> of words; but I am >>>>>>>>>>>>>>>>>>>>>>> looking into some corpus to get more words into the >>>>>>>>>>>>>>>>>>>>>>> dictionary. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> If this all fails, EasyOCR (and probably other >>>>>>>>>>>>>>>>>>>>>>> similar open-source packages) is probably our next >>>>>>>>>>>>>>>>>>>>>>> option to try on. Sure, >>>>>>>>>>>>>>>>>>>>>>> sharing our experiences will be helpful. I will let you >>>>>>>>>>>>>>>>>>>>>>> know if I made good >>>>>>>>>>>>>>>>>>>>>>> progresses in any of these options. >>>>>>>>>>>>>>>>>>>>>>> On Wednesday, September 13, 2023 at 12:19:48 PM >>>>>>>>>>>>>>>>>>>>>>> UTC+3 [email protected] wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> How is your training going for Bengali? It was >>>>>>>>>>>>>>>>>>>>>>>> nearly good but I faced space problems between two >>>>>>>>>>>>>>>>>>>>>>>> words, some words are >>>>>>>>>>>>>>>>>>>>>>>> spaces but most of them have no space. I think is >>>>>>>>>>>>>>>>>>>>>>>> problem is in the dataset >>>>>>>>>>>>>>>>>>>>>>>> but I use the default training dataset from Tesseract >>>>>>>>>>>>>>>>>>>>>>>> which is used in Ben >>>>>>>>>>>>>>>>>>>>>>>> That way I am confused so I have to explore more. by >>>>>>>>>>>>>>>>>>>>>>>> the way, you can try >>>>>>>>>>>>>>>>>>>>>>>> as Lorenzo Blz said. Actually training from >>>>>>>>>>>>>>>>>>>>>>>> scratch is harder than fine-tuning. so you can use >>>>>>>>>>>>>>>>>>>>>>>> different datasets to >>>>>>>>>>>>>>>>>>>>>>>> explore. if you succeed. please let me know how you >>>>>>>>>>>>>>>>>>>>>>>> have done this whole >>>>>>>>>>>>>>>>>>>>>>>> process. I'm also new in this field. >>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, 13 September, 2023 at 1:13:43 pm >>>>>>>>>>>>>>>>>>>>>>>> UTC+6 [email protected] wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> How is your training going for Bengali? >>>>>>>>>>>>>>>>>>>>>>>>> I have been trying to train from scratch. I made >>>>>>>>>>>>>>>>>>>>>>>>> about 64,000 lines of text (which produced about >>>>>>>>>>>>>>>>>>>>>>>>> 255,000 files, in the end) >>>>>>>>>>>>>>>>>>>>>>>>> and run the training for 150,000 iterations; getting >>>>>>>>>>>>>>>>>>>>>>>>> 0.51 training error >>>>>>>>>>>>>>>>>>>>>>>>> rate. I was hopping to get reasonable accuracy. >>>>>>>>>>>>>>>>>>>>>>>>> Unfortunately, when I run >>>>>>>>>>>>>>>>>>>>>>>>> the OCR using .traineddata, the accuracy is >>>>>>>>>>>>>>>>>>>>>>>>> absolutely terrible. Do you >>>>>>>>>>>>>>>>>>>>>>>>> think I made some mistakes, or that is an expected >>>>>>>>>>>>>>>>>>>>>>>>> result? >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Tuesday, September 12, 2023 at 11:15:25 PM >>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 [email protected] wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Yes, he doesn't mention all fonts but only one >>>>>>>>>>>>>>>>>>>>>>>>>> font. That way he didn't use *MODEL_NAME in a >>>>>>>>>>>>>>>>>>>>>>>>>> separate **script **file script I think.* >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Actually, here we teach all *tif, gt.txt, and >>>>>>>>>>>>>>>>>>>>>>>>>> .box files *which are created by *MODEL_NAME I >>>>>>>>>>>>>>>>>>>>>>>>>> mean **eng, ben, oro flag or language code *because >>>>>>>>>>>>>>>>>>>>>>>>>> when we first create *tif, gt.txt, and .box >>>>>>>>>>>>>>>>>>>>>>>>>> files, *every file starts by *MODEL_NAME*. >>>>>>>>>>>>>>>>>>>>>>>>>> This *MODEL_NAME* we selected on the training >>>>>>>>>>>>>>>>>>>>>>>>>> script for looping each tif, gt.txt, and .box files >>>>>>>>>>>>>>>>>>>>>>>>>> which are created by >>>>>>>>>>>>>>>>>>>>>>>>>> *MODEL_NAME.* >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> On Tuesday, 12 September, 2023 at 9:42:13 pm >>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 [email protected] wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I am familiar with the video and have set >>>>>>>>>>>>>>>>>>>>>>>>>>> up the folder structure as you did. Indeed, I have >>>>>>>>>>>>>>>>>>>>>>>>>>> tried a number of >>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning with a single font following Gracia's >>>>>>>>>>>>>>>>>>>>>>>>>>> video. But, your script >>>>>>>>>>>>>>>>>>>>>>>>>>> is much better because supports multiple fonts. >>>>>>>>>>>>>>>>>>>>>>>>>>> The whole improvement you >>>>>>>>>>>>>>>>>>>>>>>>>>> made is brilliant; and very useful. It is all >>>>>>>>>>>>>>>>>>>>>>>>>>> working for me. >>>>>>>>>>>>>>>>>>>>>>>>>>> The only part that I didn't understand is the >>>>>>>>>>>>>>>>>>>>>>>>>>> trick you used in your tesseract_train.py script. >>>>>>>>>>>>>>>>>>>>>>>>>>> You see, I have been >>>>>>>>>>>>>>>>>>>>>>>>>>> doing exactly to you did except this script. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> The scripts seems to have the trick of >>>>>>>>>>>>>>>>>>>>>>>>>>> sending/teaching each of the fonts (iteratively) >>>>>>>>>>>>>>>>>>>>>>>>>>> into the model. The script >>>>>>>>>>>>>>>>>>>>>>>>>>> I have been using (which I get from Garcia) >>>>>>>>>>>>>>>>>>>>>>>>>>> doesn't mention font at all. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> *TESSDATA_PREFIX=../tesseract/tessdata make >>>>>>>>>>>>>>>>>>>>>>>>>>> training MODEL_NAME=oro >>>>>>>>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000* >>>>>>>>>>>>>>>>>>>>>>>>>>> Does it mean that my model does't train the >>>>>>>>>>>>>>>>>>>>>>>>>>> fonts (even if the fonts have been included in the >>>>>>>>>>>>>>>>>>>>>>>>>>> splitting process, >>>>>>>>>>>>>>>>>>>>>>>>>>> in the other script)? >>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 10:54:08 AM >>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 [email protected] wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font >>>>>>>>>>>>>>>>>>>>>>>>>>>> namesfont_names = ['ben']for font in font_names: >>>>>>>>>>>>>>>>>>>>>>>>>>>> command = >>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make >>>>>>>>>>>>>>>>>>>>>>>>>>>> training MODEL_NAME={font} >>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"* >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> * subprocess.run(command, shell=True) 1 . >>>>>>>>>>>>>>>>>>>>>>>>>>>> This command is for training data that I have >>>>>>>>>>>>>>>>>>>>>>>>>>>> named '* >>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract_training*.py' inside tesstrain >>>>>>>>>>>>>>>>>>>>>>>>>>>> folder.* >>>>>>>>>>>>>>>>>>>>>>>>>>>> *2. root directory means your main training >>>>>>>>>>>>>>>>>>>>>>>>>>>> folder and inside it as like langdata, tessearact, >>>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain folders. if >>>>>>>>>>>>>>>>>>>>>>>>>>>> you see this tutorial * >>>>>>>>>>>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=KE4xEzFGSU8 >>>>>>>>>>>>>>>>>>>>>>>>>>>> you will understand better the folder structure. >>>>>>>>>>>>>>>>>>>>>>>>>>>> only I >>>>>>>>>>>>>>>>>>>>>>>>>>>> created tesseract_training.py in tesstrain folder >>>>>>>>>>>>>>>>>>>>>>>>>>>> for training and >>>>>>>>>>>>>>>>>>>>>>>>>>>> FontList.py file is the main path as *like >>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata, tessearact, tesstrain, and * >>>>>>>>>>>>>>>>>>>>>>>>>>>> split_training_text.py. >>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. first of all you have to put all fonts in >>>>>>>>>>>>>>>>>>>>>>>>>>>> your Linux fonts folder. /usr/share/fonts/ >>>>>>>>>>>>>>>>>>>>>>>>>>>> then run: sudo apt update then sudo fc-cache >>>>>>>>>>>>>>>>>>>>>>>>>>>> -fv >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> after that, you have to add the exact font's >>>>>>>>>>>>>>>>>>>>>>>>>>>> name in FontList.py file like me. >>>>>>>>>>>>>>>>>>>>>>>>>>>> I have added two pic my folder structure. >>>>>>>>>>>>>>>>>>>>>>>>>>>> first is main structure pic and the second is the >>>>>>>>>>>>>>>>>>>>>>>>>>>> Colopse tesstrain folder. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> I[image: Screenshot 2023-09-11 134947.png][image: >>>>>>>>>>>>>>>>>>>>>>>>>>>> Screenshot 2023-09-11 135014.png] >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 12:50:03 pm >>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 [email protected] wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you so much for putting out these >>>>>>>>>>>>>>>>>>>>>>>>>>>>> brilliant scripts. They make the process much >>>>>>>>>>>>>>>>>>>>>>>>>>>>> more efficient. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have one more question on the other script >>>>>>>>>>>>>>>>>>>>>>>>>>>>> that you use to train. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> *import subprocess# List of font >>>>>>>>>>>>>>>>>>>>>>>>>>>>> namesfont_names = ['ben']for font in font_names: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> command = >>>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata make >>>>>>>>>>>>>>>>>>>>>>>>>>>>> training MODEL_NAME={font} >>>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000"* >>>>>>>>>>>>>>>>>>>>>>>>>>>>> * subprocess.run(command, shell=True) * >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Do you have the name of fonts listed in file >>>>>>>>>>>>>>>>>>>>>>>>>>>>> in the same/root directory? >>>>>>>>>>>>>>>>>>>>>>>>>>>>> How do you setup the names of the fonts in the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> file, if you don't mind sharing it? >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, September 11, 2023 at 4:27:27 AM >>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 [email protected] wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You can use the new script below. it's better >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> than the previous two scripts. You can create >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *tif, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *by multiple fonts >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and also use breakpoint if vs code close or >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> anything during creating *tif, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> gt.txt, and .box files *then you can >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> checkpoint to navigate where you close vs code. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> command for *tif, gt.txt, and .box files * >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, start_line=None, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open(training_text_file, 'r') as >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> input_file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines = input_file.readlines() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if not os.path.exists(output_directory): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> start_line = 0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line = len(lines) - 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for font_name in font_list.fonts: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for line_index in range(start_line, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line + 1): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line = lines[line_index].strip() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name = >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pathlib.Path(training_text_file).stem >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial = f"{line_index:d}" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_gt_text = os.path.join( >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{training_text_file_name} >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> _{line_serial}_{font_name.replace(" ", "_")} >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> .gt.txt') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') as >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_file.writelines >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ([line]) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file_base_name = f'{ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial}_{ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_name.replace(" ", "_")}' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--font={font_name}', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--outputbase={ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory}/{file_base_name}', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--strip_unrenderable_words', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--ysize=330', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/eng.unicharset', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--start', type=int, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Starting line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Ending line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file = ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata/eng.training_text' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory = ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/eng-ground-truth' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> args.end) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Then create a file called "FontList" in the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> root directory and paste it. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class FontList: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def __init__(self): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> self.fonts = [ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Gerlick" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Sagar Medium", >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Ekushey Lohit Normal", >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Charukola Round Head Regular, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> weight=433", >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Charukola Round Head Bold, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> weight=443", >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "Ador Orjoma Unicode", >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> then import in the above code, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *for breakpoint command:* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sudo python3 split_training_text.py --start 0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --end 11 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> change checkpoint according to you --start 0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --end 11. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *and training checkpoint as you know already.* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday, 11 September, 2023 at 1:22:34 am >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 [email protected] wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi mhalidu, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the script you posted here seems much more >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> extensive than you posted before: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have been using your earlier script. It is >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> magical. How is this one different from the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> earlier one? >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for posting these scripts, by the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> way. It has saved my countless hours; by >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> running multiple fonts in one >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sweep. I was not able to find any instruction >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on how to train for multiple >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fonts. The official manual is also unclear. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> YOUr script helped me to get >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> started. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday, August 9, 2023 at 11:00:49 PM >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+3 [email protected] wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ok, I will try as you said. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one more thing, what's the role of the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> trained_text lines will be? I have seen >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Bengali text are long words of >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines. so I wanna know how many words or >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> characters will be the better >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> choice for the train? and >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600','--ysize=350', will be >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> according >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to words of lines? >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thursday, 10 August, 2023 at 1:10:14 am >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> UTC+6 shree wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Include the default fonts also in your >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning list of fonts and see if that >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> helps. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 9, 2023, 2:27 PM Ali hussain < >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have trained some new fonts by >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tune methods for the Bengali language >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in Tesseract 5 and I have used >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all official trained_text and tessdata_best >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and other things also. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> everything is good but the problem is the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> default font which was trained >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> before that does not convert text like prev >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but my new fonts work well. I >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't understand why it's happening. I share >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> code based to understand what >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> going on. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *codes for creating tif, gt.txt, .box >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> files:* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import os >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import random >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import pathlib >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import argparse >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from FontList import FontList >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def read_line_count(): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if os.path.exists('line_count.txt'): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'r') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> return int(file.read()) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> return 0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def write_line_count(line_count): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open('line_count.txt', 'w') as >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file.write(str(line_count)) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> def create_training_data( >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file, font_list, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, start_line=None, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line=None): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines = [] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open(training_text_file, 'r') as >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> input_file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for line in input_file.readlines >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines.append(line.strip()) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if not os.path.exists( >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory): >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> os.mkdir(output_directory) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> random.shuffle(lines) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if start_line is None: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_count = read_line_count() # >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Set the starting line_count from the file >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_count = start_line >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if end_line is None: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line_count = len(lines) - 1 # >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Set the ending line_count >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> else: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end_line_count = min(end_line, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> len(lines) - 1) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for font in font_list.fonts: # >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Iterate through all the fonts in the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for line in lines: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name = >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pathlib.Path(training_text_file).stem >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Generate a unique serial >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> number for each line >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial = f"{line_count:d >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> }" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # GT (Ground Truth) text >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filename >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_gt_text = os.path.join( >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory, f'{ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file_name}_{line_serial} >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> .gt.txt') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with open(line_gt_text, 'w') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as output_file: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_file.writelines >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ([line]) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Image filename >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file_base_name = f'ben_{ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_serial}' # Unique filename for >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> each font >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> subprocess.run([ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'text2image', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--font={font}', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--text={line_gt_text}', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f'--outputbase={ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory}/{file_base_name}', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--max_pages=1', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --strip_unrenderable_words', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--leading=36', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--xsize=3600', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--ysize=350', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--char_spacing=1.0', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--exposure=0', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --unicharset_file=langdata/ben.unicharset >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ', >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ]) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> line_count += 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_serial += 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Reset font_serial for the next >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font iteration >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_serial = 1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> write_line_count(line_count) # >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Update the line_count in the file >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser = argparse.ArgumentParser() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--start', type= >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> int, help='Starting line count >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parser.add_argument('--end', type=int, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> help='Ending line count (inclusive)') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> args = parser.parse_args() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> training_text_file = ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> langdata/ben.training_text' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> output_directory = ' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesstrain/data/ben-ground-truth' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # Create an instance of the FontList >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list = FontList() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> create_training_data(training_text_file, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_list, output_directory, args.start, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> args.end) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> *and for training code:* >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> import subprocess >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> # List of font names >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> font_names = ['ben'] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for font in font_names: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> command = >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> f"TESSDATA_PREFIX=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> make training MODEL_NAME={font} >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> START_MODEL=ben >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TESSDATA=../tesseract/tessdata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MAX_ITERATIONS=10000 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LANG_TYPE=Indic" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> subprocess.run(command, shell=True) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> any suggestion to identify to extract the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> problem. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks, everyone >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you are >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> subscribed to the Google Groups >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "tesseract-ocr" group. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> receiving emails from it, send an email to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>>>>> it, send an email to [email protected]. >>>>>>>>>>>>>> >>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com >>>>>>>>>>>>>> >>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d8c16644-b52a-426c-86a6-b1e797f3e5a2n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>> . >>>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>>> it, send an email to [email protected]. >>>>>>>>>>>> >>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com >>>>>>>>>>>> >>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/eb833902-7258-43e3-8854-d51ce26b7257n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>> . >>>>>>>>>>>> >>>>>>>>>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/47cd457c-55da-42e6-8fa4-501ac5197303n%40googlegroups.com.

