For finetuning, I like to use the original unicharset alongwith the
unicharset from the training set so that all characters are included.
Please see below a modified makefile that can be used for this - please
make changes as per your requirements.
export
SHELL := /bin/bash
LOCAL := $(PWD)/usr
PATH := $(LOCAL)/bin:$(PATH)
HOME := /home/ubuntu
TESSDATA = $(HOME)/tessdata_best
LANGDATA = $(HOME)/langdata
# Name of the model to be built
MODEL_NAME = san
# Name of the model to continue from
CONTINUE_FROM = san
# Normalization Mode - see src/training/language_specific.sh for details
NORM_MODE = 2
# Tesseract model repo to use. Default: $(TESSDATA_REPO)
TESSDATA_REPO = _best
# Train directory
TRAIN := data/train
# BEGIN-EVAL makefile-parser --make-help Makefile
help:
@echo ""
@echo " Targets"
@echo ""
@echo " unicharset Create unicharset"
@echo " lists Create lists of lstmf filenames for training
and eval"
@echo " training Start training"
@echo " proto-model Build the proto model"
@echo " leptonica Build leptonica"
@echo " tesseract Build tesseract"
@echo " tesseract-langs Download tesseract-langs"
@echo " langdata Download langdata"
@echo " clean Clean all generated files"
@echo ""
@echo " Variables"
@echo ""
@echo " MODEL_NAME Name of the model to be built"
@echo " CORES No of cores to use for compiling
leptonica/tesseract"
@echo " LEPTONICA_VERSION Leptonica version. Default:
$(LEPTONICA_VERSION)"
@echo " TESSERACT_VERSION Tesseract commit. Default:
$(TESSERACT_VERSION)"
@echo " LANGDATA_VERSION Tesseract langdata version. Default:
$(LANGDATA_VERSION)"
@echo " TESSDATA_REPO Tesseract model repo to use. Default:
$(TESSDATA_REPO)"
@echo " TRAIN Train directory"
@echo " RATIO_TRAIN Ratio of train / eval training data"
# END-EVAL
# Ratio of train / eval training data
RATIO_TRAIN := 0.90
ALL_BOXES = data/all-boxes
ALL_LSTMF = data/all-lstmf
# Create unicharset
unicharset: data/unicharset
# Create lists of lstmf filenames for training and eval
lists: $(ALL_LSTMF) data/list.train data/list.eval
data/list.train: $(ALL_LSTMF)
total=`cat $(ALL_LSTMF) | wc -l` \
no=`echo "$$total * $(RATIO_TRAIN) / 1" | bc`; \
head -n "$$no" $(ALL_LSTMF) > "$@"
data/list.eval: $(ALL_LSTMF)
total=`cat $(ALL_LSTMF) | wc -l` \
no=`echo "($$total - $$total * $(RATIO_TRAIN)) / 1" | bc`; \
tail -n "+$$no" $(ALL_LSTMF) > "$@"
# Start training
training: data/$(MODEL_NAME).traineddata
data/unicharset: $(ALL_BOXES)
combine_tessdata -u $(TESSDATA)/$(CONTINUE_FROM).traineddata
$(TESSDATA)/$(CONTINUE_FROM).
unicharset_extractor --output_unicharset "$(TRAIN)/my.unicharset"
--norm_mode $(NORM_MODE) "$(ALL_BOXES)"
merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset
$(TRAIN)/my.unicharset "$@"
$(ALL_BOXES): $(sort $(patsubst %.tif,%.box,$(wildcard $(TRAIN)/*.tif)))
find $(TRAIN) -name '*.box' -exec cat {} \; > "$@"
$(TRAIN)/%.box: $(TRAIN)/%.tif $(TRAIN)/%-gt.txt
python generate_line_box.py -i "$(TRAIN)/$*.tif" -t "$(TRAIN)/$*-gt.txt" >
"$@"
$(ALL_LSTMF): $(sort $(patsubst %.tif,%.lstmf,$(wildcard $(TRAIN)/*.tif)))
find $(TRAIN) -name '*.lstmf' -exec echo {} \; | sort -R -o "$@"
$(TRAIN)/%.lstmf: $(TRAIN)/%.box
tesseract $(TRAIN)/$*.tif $(TRAIN)/$* --psm 6 lstm.train
# Build the proto model
proto-model: data/$(MODEL_NAME)/$(MODEL_NAME).traineddata
data/$(MODEL_NAME)/$(MODEL_NAME).traineddata: $(LANGDATA) data/unicharset
combine_lang_model \
--input_unicharset data/unicharset \
--pass_through_recoder \
--script_dir $(LANGDATA) \
--words $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).wordlist \
--numbers $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).numbers \
--puncs $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).punc \
--output_dir data/ \
--lang $(MODEL_NAME)
data/checkpoints/$(MODEL_NAME)_checkpoint: unicharset lists proto-model
mkdir -p data/checkpoints
lstmtraining \
--continue_from $(TESSDATA)/$(CONTINUE_FROM).lstm \
--old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \
--traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
--model_output data/checkpoints/$(MODEL_NAME) \
--debug_interval -1 \
--train_listfile data/list.train \
--eval_listfile data/list.eval \
--sequential_training \
--max_iterations 3000
data/$(MODEL_NAME).traineddata: data/checkpoints/$(MODEL_NAME)_checkpoint
lstmtraining \
--stop_training \
--continue_from $^ \
--old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \
--traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
--model_output $@
# Clean all generated files
clean:
find data/train -name '*.box' -delete
find data/train -name '*.lstmf' -delete
rm -rf data/all-*
rm -rf data/list.*
rm -rf data/$(MODEL_NAME)
rm -rf data/unicharset
rm -rf data/checkpoints
On Tue, Sep 4, 2018 at 4:48 PM, Raniem AROUR <[email protected]> wrote:
> Hello..
>
> I am trying to fine tune the dan.traineddata for my specific use case.
> However, the model is over fitting on my data and seems to be forgetting
> the original data it was trained on. I remember I have read somewhere that
> this can be solved by showing the original training data to the network so
> that I don't get regression over the original performance.
>
> I have images and their corresponding ground truth files. Therefore I have
> used ocrd-train <https://github.com/OCR-D/ocrd-train> to do the fine
> tuning earlier (using some advises found in this thread
> <https://groups.google.com/forum/#!searchin/tesseract-ocr/fine$20tuning$20english$20language%7Csort:date/tesseract-ocr/be4-rjvY2tQ/32evtMHlAQAJ>,
> thanks to Shree).
> I have then mixed my training data with the original training data using
> the hints provided by shree in this thread
> <https://github.com/tesseract-ocr/tesseract/issues/1172>.
>
> the command i used after updating the tesstrain.sh as recommended was:
>
> ~/tesseract/src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang
> dan --linedata_only \
> --noextract_font_properties --langdata_dir /home/my_user/ocrd-train/langdata
> \
> --tessdata_dir /home/my_user/tesseract/tessdata \
> --output_dir /home/my_user/my_models/danNew/
>
>
>
> then I tried to run "make training" in the ocrd-train directory as I
> usually do for fine tuning. The fine tuning started, however, I got some
> errors that I believe are resulted from the original data:
> e.g. Encoding of string failed! Failure bytes: ffffffc3 ffffffb6 20 65 72
> 20 31 2e 34 35 24 2e 20 74 69 64 6c 69 67 65 72 65 20 31 37 2e 20 68 61 76
> 65 20 6d 61 6e 67 65 20 4e 59 20 2d 20 76 ffffffc3 ffffffa6 72 65 20 69 20
> 53 ffffffc3 ffffff85 20 43 61 6e 61 6c 2b 20 6f 67
> Can't encode transcription: 'har Søg butik været blevet Ifö er 1.45$.
> tidligere 17. have mange NY - være i SÅ Canal+ og' in language ''
> Encoding of string failed! Failure bytes: ffffffc3 ffffffb6 20 65 72 20 31
> 2e 34 35 24 2e 20 74 69 64 6c 69 67 65 72 65 20 31 37 2e 20 68 61 76 65 20
> 6d 61 6e 67 65 20 4e 59 20 2d 20 76 ffffffc3 ffffffa6 72 65 20 69 20 53
> ffffffc3 ffffff85 20 43 61 6e 61 6c 2b 20 6f 67
> Can't encode transcription: 'har Søg butik været blevet Ifö er 1.45$.
> tidligere 17. have mange NY - være i SÅ Canal+ og' in language ''
>
> P.S. I know the box resulted by ocrd-train looks different from the usual
> box used for training tesseract4 but it worked fine-tunning other models
> and was wondering whether it is a bad idea just to mix them this way.
>
> What could have been gone wrong in this process? I appreciate every
> suggestion.
>
>
> Kind Regards
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/e9676a7b-7396-4d05-8978-97c9bfbc387f%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/e9676a7b-7396-4d05-8978-97c9bfbc387f%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWTh1uFu%2Bw-zC9%2BmX704W-pRm3-MPGvQdYZk5wYF5vvog%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.