hey I'm new in ocr and I don't know Python language actually I'm working on
javascript but I fixed the problem.
I share my code of what I'm done:
*1, Replace the bellow code into your main 'split_training_text.py'
file: *
import os
import random
import pathlib
import subprocess
from tesstrain.tesseract_training import run_tesseract_training
training_text_file = 'langdata/eng.training_text'
fonts = ['lato', 'roboto']
lines = []
with open(training_text_file, 'r') as input_file:
for line in input_file.readlines():
lines.append(line.strip())
output_directory = 'tesstrain/data'
if not os.path.exists(output_directory):
os.mkdir(output_directory)
random.shuffle(lines)
count = 5
lines = lines[:count]
line_count = 0
for font in fonts:
font_output_directory = os.path.join(
output_directory, f'{font}-ground-truth')
if not os.path.exists(font_output_directory):
os.mkdir(font_output_directory)
for line in lines:
training_text_file_name = pathlib.Path(training_text_file).stem
line_training_text = os.path.join(
font_output_directory, f'{training_text_file_name}_{line_count}
.gt.txt')
with open(line_training_text, 'w') as output_file:
output_file.writelines([line])
file_base_name = f'eng_{line_count}'
subprocess.run([
'text2image',
f'--font={font}',
f'--text={line_training_text}',
f'--outputbase={os.path.join(font_output_directory,
file_base_name)}',
'--max_pages=1',
'--strip_unrenderable_words',
'--leading=32',
'--xsize=3600',
'--ysize=480',
'--char_spacing=1.0',
'--exposure=0',
'--unicharset_file=langdata/eng.unicharset'
])
line_count += 1
run_tesseract_training(font)
and run by command: *python3 split_training_text.py *
I just train two fonts and I have seen it work as one font. but I have not
tested whether it is actually working or not. you can add multiple fonts
and try it.
*2, create a file called '*tesseract_training.py' *in 'tesstrain' folder
and paste the bellow code: *
import subprocess
# List of font names
font_names = ['lato', 'roboto']
for font in font_names:
command = f"TESSDATA_PREFIX=../tesseract/tessdata make training
MODEL_NAME={font} START_MODEL=eng TESSDATA=../tesseract/tessdata
MAX_ITERATIONS=10000"
subprocess.run(command, shell=True)
then run by command: *python3 split_training_text.py *
On Wednesday, 18 January, 2023 at 6:52:24 pm UTC+6 Muhammad Hamza wrote:
> I want to finetune the ell.traineddata with multiple fonts at once can
> anyone tell me the flow of this scenario.
>
> subprocess.run([
> 'text2image',
> '--font=OCRA Medium',
> f'--text={line_training_text}',
> f'--outputbase={output_directory}/{file_base_name}',
> '--max_pages=1',
> '--strip_unrenderable_words',
> '--leading=32',
> '--xsize=3600',
> '--ysize=480',
> '--char_spacing=1.0',
> '--exposure=0',
> '--unicharset_file=langdata/bos.unicharset'
> ])
>
> above in -font only one is mention can anyone tell me how i can train with
> multiple fonts at once
> thanks
>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/6c618011-4bbe-40bd-9303-18f0bcbce59fn%40googlegroups.com.