Re: [tesseract-ocr] Fine tuning existing model

Lorenzo Bolzani Sun, 08 Sep 2019 10:03:27 -0700

Hi Ayush,
usually images are denoised much more. I think the standard models are
trained on pure black on pure white background, maybe with a little noise.
I think it could work even on these images especially with fine tuning. But
this is not the typical training data, I'm not surprised you have problems.


Anyway I think your problem here is with segmentation, not with the LSTM
model. I suppose segmentation is done with thresholding and component
analysis. And these are quite sensitive to noise.

I suspect the problem with the SW-something image might be the small
fragments on top. While the 3-M-something image is probably fooled by the
red line at bottom. You could do component analysis to clear these
fragments but with this amount of noise is very hard. If you can, try to
crop tighter (and see if it helps).

About your questions:
1. I suppose the segmentation step during training is different, it should
use the box files rather than doing the page analysis. PSM 6/7 do some
extra cleanup. I do not know why it fails.
2. I do the training with PSM 6 and, for one model, I use 13 at runtime and
it works fine. A fine tuning training for me usually takes less than one
hour. So when I have doubts like these I just try all the alternatives and
see what works best on the eval set.
3. No idea. Check the box file too.

4. I manually do incremental training: 100 iteration, save model, run
lstm_eval, 200 iterations, save model, lstm_eval, etc.

See this thread:
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/COJ4IjcrL6s/GnvIpZ2uBgAJ

5. I do not know.


Lorenzo

Il giorno dom 8 set 2019 alle ore 16:05 Ayush Pandey <xapianay...@gmail.com>
ha scritto:

> Hi Lorenzo, Shree
>
>    - Here is the link of the images for which no lsmtf files were
>    generated ->
>    
> https://drive.google.com/drive/folders/1VDBPB_k-oOXbWUI3zIlB3ljuyIlOkoMK?usp=sharing
>    .
>    - Here is the Makefile that I used for generating lstmf files ->
>    https://drive.google.com/open?id=15vvRMM03AOqoHKecEIx8NRTeU0y_kREy. I
>    used Lorenzo's suggestion to create another target "train-lists" to avoid
>    creating the training and the eval list again and again.
>    - Tesseract Version: 4.1.0
>    - I am using
>    
> https://github.com/tesseract-ocr/tesstrain/blob/master/generate_line_box.py to
>    generate .box files.
>    - My images are in .tif format. I am saving my images using OpenCV
>    imwrite.
>
> I have a few questions:
>
>    1. In the link provided by Shree ->
>    
> https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR#tesseract-fails-to-create-lstm-files.
>    It says that .lstmf are not generated for some images if you use the
>    default list.train settings. Using PSM=13 helps build those lstmf files,
>    whereas using PSM= 6 or 7 ignores them. Any clues as to why that is the
>    case??, Tesseract does give me output text for the images for PSM values
>    6,7 and 13.
>    2. If I use PSM 13 for generating the lstmf files used for training,
>    will it be okay to use PSM values 6 and 7 while testing.
>    3. How can I check the contents of lstmf files to see if they contain
>    the ground truth text info and the image data correctly??
>    4. Side Questions: lstmtraining saves the checkpoints in the following
>    format: loss_iteration. It saves the checkpoints for a few iterations with
>    the best loss ( apart from eng_checkpoint which contains the metadata I
>    guess ). Is the loss calculated on the traininging data or the evaluation
>    data??. Is there a way to save all checkpoints
>    5. Side Questions: Does lstmeval use the psm value with which the
>    lsmtf file was generated for evaluation??.
>
> I know its a lot of questions and doubts. I thank you for your time in
> helping me out.
>
> On Friday, September 6, 2019 at 2:54:49 PM UTC+5:30, Lorenzo Blz wrote:
>>
>> Hi Ayush,
>> psm 6 and 7 do some extra pre-processing of the image, 13 does much less.
>>
>> Unless your image contains text like this:
>>
>> ----
>> ====
>> ....
>>
>> I would not expect much difference between PSM 6/7 and 13. While PSM 13
>> solves some problems I got more "ghost letters" errors (letters that are
>> repeated more than once or split in similar variations, like O becoming
>> O0). So this may not be an ideal solution.
>>
>> Also there is no reason why a clean single line of text should not work
>> with 6 or 7.
>>
>> For some single line images with messy background I found that PSM 6
>> works better than 7.
>>
>>
>> Lorenzo
>>
>> Il giorno ven 6 set 2019 alle ore 11:04 Ayush Pandey <xapia...@gmail.com>
>> ha scritto:
>>
>>> Hi Lorenzo. The empty output was due to the fact that I was using 7 as
>>> PSM parameter. Using 13 as PSM parameter completely eliminated the problem.
>>>
>>> On Friday, September 6, 2019 at 12:34:22 PM UTC+5:30, Lorenzo Blz wrote:
>>>>
>>>> Can you please share an example?
>>>>
>>>> An empty output usually means that it failed to recognize the black
>>>> parts as text, this could be because the text is too big or too small or a
>>>> wrong dpi setting. Or the image is not reasonably clean.
>>>>
>>>> To better understand the problem you can try to downscale the images
>>>> (according to some tests done by a user on this forum 35/50px is what
>>>> worked best for him), try different dpi settings, remove borders, denoise,
>>>> etc. Compare images that work with the ones who do not.
>>>>
>>>>
>>>>
>>>> Lorenzo
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Il giorno gio 5 set 2019 alle ore 10:48 Ayush Pandey <
>>>> xapia...@gmail.com> ha scritto:
>>>>
>>>>> Hi shree,
>>>>>              Thank you so much for your response. I also wanted to
>>>>> ask, I do get an empty output on a lot of images, after training, the
>>>>> height and width of the image in pixels is usually > 100. Apart from
>>>>> changing the psm value, is there any other way to reduce this.
>>>>>
>>>>> On Thursday, September 5, 2019 at 2:00:20 PM UTC+5:30, shree wrote:
>>>>>>
>>>>>> See
>>>>>> https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR#tesseract-fails-to-create-lstm-files
>>>>>>
>>>>>> On Thu, Sep 5, 2019 at 1:25 PM Ayush Pandey <xapia...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Tesseract Version: 4.1.0
>>>>>>>
>>>>>>> I am trying to fine tune tesseract on custom dataset with the
>>>>>>> following Makefile:
>>>>>>>
>>>>>>> export
>>>>>>>
>>>>>>> SHELL := /bin/bash
>>>>>>> HOME := $(PWD)
>>>>>>> TESSDATA = $(HOME)/tessdata
>>>>>>> LANGDATA = $(HOME)/langdata
>>>>>>>
>>>>>>> # Train directory
>>>>>>> # TRAIN := $(HOME)/train_data
>>>>>>> TRAIN := /media/vimaan/Data/OCR/tesseract_train
>>>>>>>
>>>>>>> # Name of the model to be built
>>>>>>> MODEL_NAME = eng
>>>>>>> LANG_CODE = eng
>>>>>>>
>>>>>>> # Name of the model to continue from
>>>>>>> CONTINUE_FROM = eng
>>>>>>>
>>>>>>> TESSDATA_REPO = _best
>>>>>>>
>>>>>>> # Normalization Mode - see src/training/language_specific.sh for details
>>>>>>> NORM_MODE = 1
>>>>>>>
>>>>>>> # BEGIN-EVAL makefile-parser --make-help Makefile
>>>>>>>
>>>>>>> help:
>>>>>>>         @echo ""
>>>>>>>         @echo "  Targets"
>>>>>>>         @echo ""
>>>>>>>         @echo "    unicharset       Create unicharset"
>>>>>>>         @echo "    lists            Create lists of lstmf filenames for 
>>>>>>> training and eval"
>>>>>>>         @echo "    training         Start training"
>>>>>>>         @echo "    proto-model      Build the proto model"
>>>>>>>         @echo "    leptonica        Build leptonica"
>>>>>>>         @echo "    tesseract        Build tesseract"
>>>>>>>         @echo "    tesseract-langs  Download tesseract-langs"
>>>>>>>         @echo "    langdata         Download langdata"
>>>>>>>         @echo "    clean            Clean all generated files"
>>>>>>>         @echo ""
>>>>>>>         @echo "  Variables"
>>>>>>>         @echo ""
>>>>>>>         @echo "    MODEL_NAME         Name of the model to be built"
>>>>>>>         @echo "    CORES              No of cores to use for compiling 
>>>>>>> leptonica/tesseract"
>>>>>>>         @echo "    LEPTONICA_VERSION  Leptonica version. Default: 
>>>>>>> $(LEPTONICA_VERSION)"
>>>>>>>         @echo "    TESSERACT_VERSION  Tesseract commit. Default: 
>>>>>>> $(TESSERACT_VERSION)"
>>>>>>>         @echo "    LANGDATA_VERSION   Tesseract langdata version. 
>>>>>>> Default: $(LANGDATA_VERSION)"
>>>>>>>         @echo "    TESSDATA_REPO      Tesseract model repo to use. 
>>>>>>> Default: $(TESSDATA_REPO)"
>>>>>>>         @echo "    TRAIN              Train directory"
>>>>>>>         @echo "    RATIO_TRAIN        Ratio of train / eval training 
>>>>>>> data"
>>>>>>>
>>>>>>> # END-EVAL
>>>>>>>
>>>>>>> # Ratio of train / eval training data
>>>>>>> RATIO_TRAIN := 0.90
>>>>>>>
>>>>>>> ALL_BOXES = data/all-boxes
>>>>>>> ALL_LSTMF = data/all-lstmf
>>>>>>>
>>>>>>> # Create unicharset
>>>>>>> unicharset: data/unicharset
>>>>>>>
>>>>>>> # Create lists of lstmf filenames for training and eval
>>>>>>> #lists: $(ALL_LSTMF) data/list.train data/list.eval
>>>>>>> lists: $(ALL_LSTMF)
>>>>>>>
>>>>>>> train-lists: data/list.train data/list.eval
>>>>>>>
>>>>>>> data/list.train: $(ALL_LSTMF)
>>>>>>>         total=`cat $(ALL_LSTMF) | wc -l` \
>>>>>>>            no=`echo "$$total * $(RATIO_TRAIN) / 1" | bc`; \
>>>>>>>            head -n "$$no" $(ALL_LSTMF) > "$@"
>>>>>>>
>>>>>>> data/list.eval: $(ALL_LSTMF)
>>>>>>>         total=`cat $(ALL_LSTMF) | wc -l` \
>>>>>>>            no=`echo "($$total - $$total * $(RATIO_TRAIN)) / 1" | bc`; \
>>>>>>>            tail -n "$$no" $(ALL_LSTMF) > "$@"
>>>>>>>
>>>>>>> # Start training
>>>>>>> training: data/$(MODEL_NAME).traineddata
>>>>>>>
>>>>>>> data/unicharset: $(ALL_BOXES)
>>>>>>>         mkdir -p data/$(START_MODEL)
>>>>>>>         combine_tessdata -u $(TESSDATA)/$(CONTINUE_FROM).traineddata  
>>>>>>> $(TESSDATA)/$(CONTINUE_FROM).
>>>>>>>         unicharset_extractor --output_unicharset 
>>>>>>> "$(TRAIN)/my.unicharset" --norm_mode $(NORM_MODE) "$(ALL_BOXES)"
>>>>>>>         #merge_unicharsets 
>>>>>>> data/$(START_MODEL)/$(START_MODEL).lstm-unicharset 
>>>>>>> $(GROUND_TRUTH_DIR)/my.unicharset  "$@"
>>>>>>>         merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset 
>>>>>>> $(TRAIN)/my.unicharset  "$@"
>>>>>>>         
>>>>>>> $(ALL_BOXES): $(sort $(patsubst %.tif,%.box,$(wildcard $(TRAIN)/*.tif)))
>>>>>>>         find $(TRAIN) -name '*.box' -exec cat {} \; > "$@"
>>>>>>>         
>>>>>>> $(TRAIN)/%.box: $(TRAIN)/%.tif $(TRAIN)/%.gt.txt
>>>>>>>         python generate_line_box.py -i "$(TRAIN)/$*.tif" -t 
>>>>>>> "$(TRAIN)/$*.gt.txt" > "$@"
>>>>>>>
>>>>>>> $(ALL_LSTMF): $(sort $(patsubst %.tif,%.lstmf,$(wildcard 
>>>>>>> $(TRAIN)/*.tif)))
>>>>>>>         find $(TRAIN) -name '*.lstmf' -exec echo {} \; | sort -R -o "$@"
>>>>>>>
>>>>>>> $(TRAIN)/%.lstmf: $(TRAIN)/%.box
>>>>>>>         tesseract $(TRAIN)/$*.tif $(TRAIN)/$* --dpi 300 --psm 7 
>>>>>>> lstm.train
>>>>>>>         
>>>>>>>
>>>>>>> # Build the proto model
>>>>>>> proto-model: data/$(MODEL_NAME)/$(MODEL_NAME).traineddata
>>>>>>>
>>>>>>> data/$(MODEL_NAME)/$(MODEL_NAME).traineddata: $(LANGDATA) 
>>>>>>> data/unicharset
>>>>>>>         combine_lang_model \
>>>>>>>           --input_unicharset data/unicharset \
>>>>>>>           --script_dir $(LANGDATA) \
>>>>>>>           --words $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).wordlist \
>>>>>>>           --numbers $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).numbers \
>>>>>>>           --puncs $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).punc \
>>>>>>>           --output_dir data/ \
>>>>>>>           --lang $(MODEL_NAME)
>>>>>>>
>>>>>>> data/checkpoints/$(MODEL_NAME)_checkpoint: unicharset proto-model
>>>>>>>         mkdir -p data/checkpoints
>>>>>>>         lstmtraining \
>>>>>>>           --continue_from   $(TESSDATA)/$(CONTINUE_FROM).lstm \
>>>>>>>           --old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \
>>>>>>>           --traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
>>>>>>>           --model_output data/checkpoints/$(MODEL_NAME) \
>>>>>>>           --debug_interval -1 \
>>>>>>>           --train_listfile data/list.train \
>>>>>>>           --eval_listfile data/list.eval \
>>>>>>>           --sequential_training \
>>>>>>>           --max_iterations 170000
>>>>>>>
>>>>>>> data/$(MODEL_NAME).traineddata: 
>>>>>>> data/checkpoints/$(MODEL_NAME)_checkpoint
>>>>>>>         lstmtraining \
>>>>>>>         --stop_training \
>>>>>>>         --continue_from $^ \
>>>>>>>         --old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \
>>>>>>>         --traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
>>>>>>>         --model_output $@
>>>>>>>
>>>>>>> # Clean all generated files
>>>>>>> clean:
>>>>>>>         find data/train -name '*.box' -delete
>>>>>>>         find data/train -name '*.lstmf' -delete
>>>>>>>         rm -rf data/all-*
>>>>>>>         rm -rf data/list.*
>>>>>>>         rm -rf data/$(MODEL_NAME)
>>>>>>>         rm -rf data/unicharset
>>>>>>>         rm -rf data/checkpoints
>>>>>>>
>>>>>>> The number of .lstmf files being generated is significantly lower
>>>>>>> than .box files being generated.
>>>>>>> For eg:
>>>>>>> Number of .tif files: 10k
>>>>>>> Number of .gt.txt files: 10k
>>>>>>> Number of .box files: 10k
>>>>>>> Number of .lstmf files: 8k.
>>>>>>> Could anyone point me out to the possible reasons for this issue
>>>>>>>
>>>>>>> On Friday, June 29, 2018 at 5:39:09 PM UTC+5:30, shree wrote:
>>>>>>>>
>>>>>>>> I modified the makefile for ocrd-train to do fine-tuning.  It is
>>>>>>>> pasted below:
>>>>>>>>
>>>>>>>> export
>>>>>>>>
>>>>>>>> SHELL := /bin/bash
>>>>>>>> LOCAL := $(PWD)/usr
>>>>>>>> PATH := $(LOCAL)/bin:$(PATH)
>>>>>>>> HOME := /home/ubuntu
>>>>>>>> TESSDATA =  $(HOME)/tessdata_best
>>>>>>>> LANGDATA = $(HOME)/langdata
>>>>>>>>
>>>>>>>> # Name of the model to be built
>>>>>>>> MODEL_NAME = frk
>>>>>>>>
>>>>>>>> # Name of the model to continue from
>>>>>>>> CONTINUE_FROM = frk
>>>>>>>>
>>>>>>>> # Normalization Mode - see src/training/language_specific.sh for
>>>>>>>> details
>>>>>>>> NORM_MODE = 2
>>>>>>>>
>>>>>>>> # Tesseract model repo to use. Default: $(TESSDATA_REPO)
>>>>>>>> TESSDATA_REPO = _best
>>>>>>>>
>>>>>>>> # Train directory
>>>>>>>> TRAIN := data/train
>>>>>>>>
>>>>>>>> # BEGIN-EVAL makefile-parser --make-help Makefile
>>>>>>>>
>>>>>>>> help:
>>>>>>>> @echo ""
>>>>>>>> @echo "  Targets"
>>>>>>>> @echo ""
>>>>>>>> @echo "    unicharset       Create unicharset"
>>>>>>>> @echo "    lists            Create lists of lstmf filenames for
>>>>>>>> training and eval"
>>>>>>>> @echo "    training         Start training"
>>>>>>>> @echo "    proto-model      Build the proto model"
>>>>>>>> @echo "    leptonica        Build leptonica"
>>>>>>>> @echo "    tesseract        Build tesseract"
>>>>>>>> @echo "    tesseract-langs  Download tesseract-langs"
>>>>>>>> @echo "    langdata         Download langdata"
>>>>>>>> @echo "    clean            Clean all generated files"
>>>>>>>> @echo ""
>>>>>>>> @echo "  Variables"
>>>>>>>> @echo ""
>>>>>>>> @echo "    MODEL_NAME         Name of the model to be built"
>>>>>>>> @echo "    CORES              No of cores to use for compiling
>>>>>>>> leptonica/tesseract"
>>>>>>>> @echo "    LEPTONICA_VERSION  Leptonica version. Default:
>>>>>>>> $(LEPTONICA_VERSION)"
>>>>>>>> @echo "    TESSERACT_VERSION  Tesseract commit. Default:
>>>>>>>> $(TESSERACT_VERSION)"
>>>>>>>> @echo "    LANGDATA_VERSION   Tesseract langdata version. Default:
>>>>>>>> $(LANGDATA_VERSION)"
>>>>>>>> @echo "    TESSDATA_REPO      Tesseract model repo to use. Default:
>>>>>>>> $(TESSDATA_REPO)"
>>>>>>>> @echo "    TRAIN              Train directory"
>>>>>>>> @echo "    RATIO_TRAIN        Ratio of train / eval training data"
>>>>>>>>
>>>>>>>> # END-EVAL
>>>>>>>>
>>>>>>>> # Ratio of train / eval training data
>>>>>>>> RATIO_TRAIN := 0.90
>>>>>>>>
>>>>>>>> ALL_BOXES = data/all-boxes
>>>>>>>> ALL_LSTMF = data/all-lstmf
>>>>>>>>
>>>>>>>> # Create unicharset
>>>>>>>> unicharset: data/unicharset
>>>>>>>>
>>>>>>>> # Create lists of lstmf filenames for training and eval
>>>>>>>> lists: $(ALL_LSTMF) data/list.train data/list.eval
>>>>>>>>
>>>>>>>> data/list.train: $(ALL_LSTMF)
>>>>>>>> total=`cat $(ALL_LSTMF) | wc -l` \
>>>>>>>>    no=`echo "$$total * $(RATIO_TRAIN) / 1" | bc`; \
>>>>>>>>    head -n "$$no" $(ALL_LSTMF) > "$@"
>>>>>>>>
>>>>>>>> data/list.eval: $(ALL_LSTMF)
>>>>>>>> total=`cat $(ALL_LSTMF) | wc -l` \
>>>>>>>>    no=`echo "($$total - $$total * $(RATIO_TRAIN)) / 1" | bc`; \
>>>>>>>>    tail -n "+$$no" $(ALL_LSTMF) > "$@"
>>>>>>>>
>>>>>>>> # Start training
>>>>>>>> training: data/$(MODEL_NAME).traineddata
>>>>>>>>
>>>>>>>> data/unicharset: $(ALL_BOXES)
>>>>>>>> combine_tessdata -u $(TESSDATA)/$(CONTINUE_FROM).traineddata
>>>>>>>> $(TESSDATA)/$(CONTINUE_FROM).
>>>>>>>> unicharset_extractor --output_unicharset "$(TRAIN)/my.unicharset"
>>>>>>>> --norm_mode $(NORM_MODE) "$(ALL_BOXES)"
>>>>>>>> merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset
>>>>>>>> $(TRAIN)/my.unicharset  "$@"
>>>>>>>> $(ALL_BOXES): $(sort $(patsubst %.tif,%.box,$(wildcard
>>>>>>>> $(TRAIN)/*.tif)))
>>>>>>>> find $(TRAIN) -name '*.box' -exec cat {} \; > "$@"
>>>>>>>> $(TRAIN)/%.box: $(TRAIN)/%.tif $(TRAIN)/%-gt.txt
>>>>>>>> python generate_line_box.py -i "$(TRAIN)/$*.tif" -t
>>>>>>>> "$(TRAIN)/$*-gt.txt" > "$@"
>>>>>>>>
>>>>>>>> $(ALL_LSTMF): $(sort $(patsubst %.tif,%.lstmf,$(wildcard
>>>>>>>> $(TRAIN)/*.tif)))
>>>>>>>> find $(TRAIN) -name '*.lstmf' -exec echo {} \; | sort -R -o "$@"
>>>>>>>>
>>>>>>>> $(TRAIN)/%.lstmf: $(TRAIN)/%.box
>>>>>>>> tesseract $(TRAIN)/$*.tif $(TRAIN)/$*   --psm 6 lstm.train
>>>>>>>>
>>>>>>>> # Build the proto model
>>>>>>>> proto-model: data/$(MODEL_NAME)/$(MODEL_NAME).traineddata
>>>>>>>>
>>>>>>>> data/$(MODEL_NAME)/$(MODEL_NAME).traineddata: $(LANGDATA)
>>>>>>>> data/unicharset
>>>>>>>> combine_lang_model \
>>>>>>>>   --input_unicharset data/unicharset \
>>>>>>>>   --script_dir $(LANGDATA) \
>>>>>>>>   --words $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).wordlist \
>>>>>>>>   --numbers $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).numbers \
>>>>>>>>   --puncs $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).punc \
>>>>>>>>   --output_dir data/ \
>>>>>>>>   --lang $(MODEL_NAME)
>>>>>>>>
>>>>>>>> data/checkpoints/$(MODEL_NAME)_checkpoint: unicharset lists
>>>>>>>> proto-model
>>>>>>>> mkdir -p data/checkpoints
>>>>>>>> lstmtraining \
>>>>>>>>   --continue_from   $(TESSDATA)/$(CONTINUE_FROM).lstm \
>>>>>>>>   --old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \
>>>>>>>>   --traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
>>>>>>>>   --model_output data/checkpoints/$(MODEL_NAME) \
>>>>>>>>   --debug_interval -1 \
>>>>>>>>   --train_listfile data/list.train \
>>>>>>>>   --eval_listfile data/list.eval \
>>>>>>>>   --sequential_training \
>>>>>>>>   --max_iterations 3000
>>>>>>>>
>>>>>>>> data/$(MODEL_NAME).traineddata:
>>>>>>>> data/checkpoints/$(MODEL_NAME)_checkpoint
>>>>>>>> lstmtraining \
>>>>>>>> --stop_training \
>>>>>>>> --continue_from $^ \
>>>>>>>> --old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \
>>>>>>>> --traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
>>>>>>>> --model_output $@
>>>>>>>>
>>>>>>>> # Clean all generated files
>>>>>>>> clean:
>>>>>>>> find data/train -name '*.box' -delete
>>>>>>>> find data/train -name '*.lstmf' -delete
>>>>>>>> rm -rf data/all-*
>>>>>>>> rm -rf data/list.*
>>>>>>>> rm -rf data/$(MODEL_NAME)
>>>>>>>> rm -rf data/unicharset
>>>>>>>> rm -rf data/checkpoints
>>>>>>>>
>>>>>>>> On Fri, Jun 29, 2018 at 5:31 PM Lorenzo Bolzani <l.bo...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> 
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> I'm trying to do fine tuning of an existing model using line
>>>>>>>>> images and text labels. I'm running this version:
>>>>>>>>>
>>>>>>>>> tesseract 4.0.0-beta.3-56-g5fda
>>>>>>>>>  leptonica-1.76.0
>>>>>>>>>   libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54
>>>>>>>>> : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
>>>>>>>>>  Found AVX2
>>>>>>>>>  Found AVX
>>>>>>>>>  Found SSE
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I used OCR-D to generate lstmf files for the demo data.
>>>>>>>>>
>>>>>>>>> If I run the make command it works fine.
>>>>>>>>>
>>>>>>>>> make training MODEL_NAME=prova
>>>>>>>>>
>>>>>>>>> Now I isolated this command from the build:
>>>>>>>>>
>>>>>>>>> lstmtraining \
>>>>>>>>>   --traineddata data/prova/prova.traineddata \
>>>>>>>>>   --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256
>>>>>>>>> O1c`head -n1 data/unicharset`]" \
>>>>>>>>>   --model_output data/checkpoints/prova \
>>>>>>>>>   --learning_rate 20e-4 \
>>>>>>>>>   --train_listfile data/list.train \
>>>>>>>>>   --eval_listfile data/list.eval \
>>>>>>>>>   --max_iterations 10000
>>>>>>>>>
>>>>>>>>> and it works fine.
>>>>>>>>>
>>>>>>>>> Now I'm trying to modify it to fine tune the existing eng model. I
>>>>>>>>> made a few attempts, all ending into different errors (see the 
>>>>>>>>> attached
>>>>>>>>> file for full output).
>>>>>>>>>
>>>>>>>>> I used:
>>>>>>>>>
>>>>>>>>> combine_tessdata -e /usr/local/share/tessdata/eng.traineddata
>>>>>>>>> extracted/eng.lstm
>>>>>>>>>
>>>>>>>>> to extract the eng.lstm model.
>>>>>>>>>
>>>>>>>>> This seems to works but I'm not sure it is the correct.
>>>>>>>>>
>>>>>>>>> lstmtraining \
>>>>>>>>>   --continue_from  extracted/eng.lstm \
>>>>>>>>>   --traineddata data/prova/prova.traineddata \
>>>>>>>>>   --old_traineddata extracted/eng.traineddata \
>>>>>>>>>   --model_output data/checkpoints/prova \
>>>>>>>>>   --learning_rate 20e-4 \
>>>>>>>>>   --train_listfile data/list.train \
>>>>>>>>>   --eval_listfile data/list.eval \
>>>>>>>>>   --max_iterations 10000
>>>>>>>>>
>>>>>>>>> (extracted/eng.traineddata is just a copy of eng.traineddata)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The training resume exactly with the RMS of prova_checkpoint (6%)
>>>>>>>>> so it looks like it is training from that checkpoint, not the 
>>>>>>>>> eng.lstm.
>>>>>>>>>
>>>>>>>>> Is this correct? What should I change?
>>>>>>>>> 
>>>>>>>>> I'm following this guide:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>> I think continue_from and traineddata should refer to the eng
>>>>>>>>> model and old_traineddata should point to prova.traineddata, but if I 
>>>>>>>>> do
>>>>>>>>> that I get a segmentation fault:
>>>>>>>>>
>>>>>>>>> [...]
>>>>>>>>> !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
>>>>>>>>> !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
>>>>>>>>> Segmentation fault
>>>>>>>>>
>>>>>>>>> What am I missing?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks, bye
>>>>>>>>>
>>>>>>>>> Lorenzo
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>>> send an email to tesser...@googlegroups.com.
>>>>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>>>> To view this discussion on the web visit
>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyOJN31PdWQumXPO3JjuAc1Yz2BZYpMd4ftzBHgZkEaxA%40mail.gmail.com
>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyOJN31PdWQumXPO3JjuAc1Yz2BZYpMd4ftzBHgZkEaxA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>>>>> .
>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> ____________________________________________________________
>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>
>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to tesser...@googlegroups.com.
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/e3ba3b90-a8c8-4085-bec5-cf918034ba2a%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/e3ba3b90-a8c8-4085-bec5-cf918034ba2a%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> ____________________________________________________________
>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesser...@googlegroups.com.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/91e85125-a9fc-450b-b434-391d2d4bd974%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/91e85125-a9fc-450b-b434-391d2d4bd974%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesser...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/3f97c86f-cc85-4ade-9aee-bfe67c43f066%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/3f97c86f-cc85-4ade-9aee-bfe67c43f066%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/85fd20c5-7d5d-41ca-8665-f3d47c9980f4%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/85fd20c5-7d5d-41ca-8665-f3d47c9980f4%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwe2te%2BS%3DXo9q8M0zb7Ccm-iWnhz1SAvdV3%2BYmoYGZd7w%40mail.gmail.com.

Re: [tesseract-ocr] Fine tuning existing model

Reply via email to