[tesseract-ocr] Re: How to finetune tesseract 5 with multiple fonts

Ali hussain Sat, 08 Jul 2023 05:08:24 -0700

hey I'm new in ocr and I don't know Python language actually I'm working on 
javascript but I fixed the problem.


I share my code of what I'm done: 

*1,  Replace the bellow code into your main 'split_training_text.py'  
file: *


import os
import random
import pathlib
import subprocess
from tesstrain.tesseract_training import run_tesseract_training

training_text_file = 'langdata/eng.training_text'

fonts = ['lato', 'roboto']

lines = []

with open(training_text_file, 'r') as input_file:
    for line in input_file.readlines():
        lines.append(line.strip())

output_directory = 'tesstrain/data'

if not os.path.exists(output_directory):
    os.mkdir(output_directory)

random.shuffle(lines)

count = 5
lines = lines[:count]
line_count = 0

for font in fonts:
    font_output_directory = os.path.join(
        output_directory, f'{font}-ground-truth')
    if not os.path.exists(font_output_directory):
        os.mkdir(font_output_directory)

    for line in lines:
        training_text_file_name = pathlib.Path(training_text_file).stem
        line_training_text = os.path.join(
            font_output_directory, f'{training_text_file_name}_{line_count}
.gt.txt')
        with open(line_training_text, 'w') as output_file:
            output_file.writelines([line])

        file_base_name = f'eng_{line_count}'

        subprocess.run([
            'text2image',
            f'--font={font}',
            f'--text={line_training_text}',
            f'--outputbase={os.path.join(font_output_directory, 
file_base_name)}',
            '--max_pages=1',
            '--strip_unrenderable_words',
            '--leading=32',
            '--xsize=3600',
            '--ysize=480',
            '--char_spacing=1.0',
            '--exposure=0',
            '--unicharset_file=langdata/eng.unicharset'
        ])

        line_count += 1

    run_tesseract_training(font)

and run by command:  *python3 split_training_text.py *

I just train two fonts and I have seen it work as one font. but I have not 
tested whether it is actually working or not. you can add multiple fonts 
and try it.

*2, create a file called '*tesseract_training.py' *in 'tesstrain' folder 
and paste the bellow code: *

import subprocess

# List of font names
font_names = ['lato', 'roboto']

for font in font_names:
    command = f"TESSDATA_PREFIX=../tesseract/tessdata make training 
MODEL_NAME={font} START_MODEL=eng TESSDATA=../tesseract/tessdata 
MAX_ITERATIONS=10000"
    subprocess.run(command, shell=True)

then run by command:  *python3 split_training_text.py *

On Wednesday, 18 January, 2023 at 6:52:24 pm UTC+6 Muhammad Hamza wrote:

> I want to finetune the ell.traineddata with multiple fonts at once  can 
> anyone tell me the flow of this scenario.
>
> subprocess.run([
>         'text2image',
>         '--font=OCRA Medium',
>         f'--text={line_training_text}',
>         f'--outputbase={output_directory}/{file_base_name}',
>         '--max_pages=1',
>         '--strip_unrenderable_words',
>         '--leading=32',
>         '--xsize=3600',
>         '--ysize=480',
>         '--char_spacing=1.0',
>         '--exposure=0',
>         '--unicharset_file=langdata/bos.unicharset'
>     ])
>
> above in -font only one is mention can anyone tell me how i can train with 
> multiple fonts at once 
> thanks 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6c618011-4bbe-40bd-9303-18f0bcbce59fn%40googlegroups.com.

[tesseract-ocr] Re: How to finetune tesseract 5 with multiple fonts

Reply via email to