Re: [tesseract-ocr] Fine tuning existing model

2019-09-08 Thread Lorenzo Bolzani
Hi Ayush, usually images are denoised much more. I think the standard models are trained on pure black on pure white background, maybe with a little noise. I think it could work even on these images especially with fine tuning. But this is not the typical training data, I'm not surprised you have

Re: [tesseract-ocr] Fine tuning existing model

2019-09-08 Thread Ayush Pandey
Hi Lorenzo, Shree - Here is the link of the images for which no lsmtf files were generated -> https://drive.google.com/drive/folders/1VDBPB_k-oOXbWUI3zIlB3ljuyIlOkoMK?usp=sharing . - Here is the Makefile that I used for generating lstmf files ->

Re: [tesseract-ocr] Fine tuning existing model

2019-09-06 Thread Lorenzo Bolzani
Hi Ayush, psm 6 and 7 do some extra pre-processing of the image, 13 does much less. Unless your image contains text like this: I would not expect much difference between PSM 6/7 and 13. While PSM 13 solves some problems I got more "ghost letters" errors (letters that are repeated

Re: [tesseract-ocr] Fine tuning existing model

2019-09-06 Thread Ayush Pandey
Hi Lorenzo. The empty output was due to the fact that I was using 7 as PSM parameter. Using 13 as PSM parameter completely eliminated the problem. On Friday, September 6, 2019 at 12:34:22 PM UTC+5:30, Lorenzo Blz wrote: > > Can you please share an example? > > An empty output usually means that

Re: [tesseract-ocr] Fine tuning existing model

2019-09-06 Thread Lorenzo Bolzani
Can you please share an example? An empty output usually means that it failed to recognize the black parts as text, this could be because the text is too big or too small or a wrong dpi setting. Or the image is not reasonably clean. To better understand the problem you can try to downscale the

Re: [tesseract-ocr] Fine tuning existing model

2019-09-05 Thread Ayush Pandey
Hi shree, Thank you so much for your response. I also wanted to ask, I do get an empty output on a lot of images, after training, the height and width of the image in pixels is usually > 100. Apart from changing the psm value, is there any other way to reduce this. On Thursday,

Re: [tesseract-ocr] Fine tuning existing model

2019-09-05 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR#tesseract-fails-to-create-lstm-files On Thu, Sep 5, 2019 at 1:25 PM Ayush Pandey wrote: > Tesseract Version: 4.1.0 > > I am trying to fine tune tesseract on custom dataset with the following > Makefile: > > export > > SHELL :=

Re: [tesseract-ocr] Fine tuning existing model

2019-09-05 Thread Ayush Pandey
Tesseract Version: 4.1.0 I am trying to fine tune tesseract on custom dataset with the following Makefile: export SHELL := /bin/bash HOME := $(PWD) TESSDATA = $(HOME)/tessdata LANGDATA = $(HOME)/langdata # Train directory # TRAIN := $(HOME)/train_data TRAIN :=

Re: [tesseract-ocr] Fine tuning existing model

2019-05-03 Thread Tairen Chen
Thank you for your further explanation, Shree!! On Friday, May 3, 2019 at 2:59:12 AM UTC-7, shree wrote: > > >There are three model sizes: best, normal and fast. Each of these can > also be converted to an integer model. > > Only `best` can be converted to integer and in fact the LSTM models in

Re: [tesseract-ocr] Fine tuning existing model

2019-05-03 Thread Tairen Chen
Hi, Lorenzo, Thank you very much for your reply. It really gives more clue about the training. All the best, Tairen On Friday, May 3, 2019 at 2:30:12 AM UTC-7, Lorenzo Blz wrote: > > See answer inline. > > Il giorno ven 3 mag 2019 alle ore 03:48 Tairen Chen > ha

Re: [tesseract-ocr] Fine tuning existing model

2019-05-03 Thread Lorenzo Bolzani
Shree, thanks for the clarification. Il giorno ven 3 mag 2019 alle ore 11:59 Shree Devi Kumar < shreesh...@gmail.com> ha scritto: > >There are three model sizes: best, normal and fast. Each of these can > also be converted to an integer model. > > Only `best` can be converted to integer and in

Re: [tesseract-ocr] Fine tuning existing model

2019-05-03 Thread Shree Devi Kumar
>There are three model sizes: best, normal and fast. Each of these can also be converted to an integer model. Only `best` can be converted to integer and in fact the LSTM models in `tessdata` are the integer versions of best along with the base/legacy models. `fast` models have been trained with

Re: [tesseract-ocr] Fine tuning existing model

2019-05-03 Thread Lorenzo Bolzani
See answer inline. Il giorno ven 3 mag 2019 alle ore 03:48 Tairen Chen ha scritto: > > 1. I define the "--max_iterations 2" but the training stops at > 5700, like below: > " At iteration 351/5700/5700, Mean rms=0.117%, delta=0%, char > train=0%, word train=0%, skip ratio=0%,

Re: [tesseract-ocr] Fine tuning existing model

2019-05-02 Thread Tairen Chen
Thank you very much for your quick answer, Lorenzo! You are right, it is an extra space at the beginning where the "TESSDATA" is defined not at the "lstmtraining" line. I still have few questions want to ask you for help. 1. I define the "--max_iterations 2" but the

Re: [tesseract-ocr] Fine tuning existing model

2019-05-02 Thread Lorenzo Bolzani
Hi Tairen, the error is quite clear: Must provide a --traineddata see training wiki You say that it works if you run it as a single line so I suppose there is something wrong in the make file, probably a typo. Maybe there is a space or a tab after a "\" ? Maybe there are some extra characters

Re: [tesseract-ocr] Fine tuning existing model

2019-05-02 Thread Tairen Chen
Hi, Lorenzo and Shree Thanks for your sharing. I am trying to repeat what you have done here. I followed your posts and change the Makefile, but when I run $ make training, I got the following errors: mkdir -p data/checkpoints lstmtraining \

Re: [tesseract-ocr] Fine tuning existing model

2019-02-15 Thread Russia Aiyappa
Having a hard time training tesseract as I am naive to this. Is it possible to get the updated code for fine-tuning now that langdata is not supported? https://github.com/OCR-D/ocrd-train/issues/49 On Friday, 29 June 2018 08:09:09 UTC-4, shree wrote: > > I modified the makefile for ocrd-train

Re: [tesseract-ocr] Fine tuning existing model

2018-09-19 Thread Varun Sab
Thank you so much.. That worked. :) On Tuesday, September 18, 2018 at 9:24:53 PM UTC+5:30, shree wrote: > > If you are getting error > > !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244 > !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244 > > You are probably

Re: [tesseract-ocr] Fine tuning existing model

2018-09-18 Thread Shree Devi Kumar
If you are getting error !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244 !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244 You are probably using the traineddata fille which has an `integer` model. Please use tessdata_best as base for further training. On Tue,

Re: [tesseract-ocr] Fine tuning existing model

2018-07-02 Thread Lorenzo Bolzani
Hi Shree, I replaced the line: merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset $(TRAIN)/my.unicharset "$@" with: cp "$(TRAIN)/my.unicharset" "data/unicharset" (I write this in case someone else is following this thread). And now I have a fine tuned brand new model with only

Re: [tesseract-ocr] Fine tuning existing model

2018-06-29 Thread Shree Devi Kumar
> ​ The problem was a "-gt.txt" rather than a ".gt.txt" as in my train files. Now I can run your script directly. Oh, I remember now. I had changed that for ease in renaming files for some reason. > In this way can I train a model that, for example, only recognize uppercase characters, or

Re: [tesseract-ocr] Fine tuning existing model

2018-06-29 Thread Lorenzo Bolzani
I think I found the problem. Running directly the new Makefile I had this error: make: *** No rule to make target 'data/train/alexis_ruhe01_1852_0018_022.box', needed by 'data/all-boxes'. Stop. The problem was a "-gt.txt" rather than a ".gt.txt" as in my train files. Now I can run your script

Re: [tesseract-ocr] Fine tuning existing model

2018-06-29 Thread Shree Devi Kumar
You should be able to use the new makefile after you make changes for all the directory locations to match your setup. Change the language from frk to eng, though the sample training text seems to be non-english. In which case it is better for you to use the appropriate language traineddata eg.

Re: [tesseract-ocr] Fine tuning existing model

2018-06-29 Thread Lorenzo Bolzani
Hi Shree, thanks for your answer. I tried the script setting: TESSDATA=extracted # here I have the eng.lstm and eng.trainedata LANGDATA=langdata-master # all langdata downladed by OCR-D MODEL_NAME = eng CONTINUE_FROM = eng First I run the old Makefile to create the boxes.

Re: [tesseract-ocr] Fine tuning existing model

2018-06-29 Thread Shree Devi Kumar
I modified the makefile for ocrd-train to do fine-tuning. It is pasted below: export SHELL := /bin/bash LOCAL := $(PWD)/usr PATH := $(LOCAL)/bin:$(PATH) HOME := /home/ubuntu TESSDATA = $(HOME)/tessdata_best LANGDATA = $(HOME)/langdata # Name of the model to be built MODEL_NAME = frk # Name

[tesseract-ocr] Fine tuning existing model

2018-06-29 Thread Lorenzo Bolzani
​​ Hi, I'm trying to do fine tuning of an existing model using line images and text labels. I'm running this version: tesseract 4.0.0-beta.3-56-g5fda leptonica-1.76.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2