[tesseract-ocr] Image too small to scale!! (3x48 vs min width of 3)

2017-10-18 Thread robertyoung0511
Hello, I am trying to manually generate the input data of Tess4.0, which contain the box and tif file. But when I run the command to generate the .lstmf file, images have been rotated 90 degrees, which is shown following;

Re: [tesseract-ocr] The Accuracy improvement of training the chi_sim.traineddata model

2017-09-19 Thread robertyoung0511
OK. Thanks for your reply. 在 2017年9月19日星期二 UTC+8下午5:06:57,shree写道: > > Ray is the only one who would know those details. > > Please see > https://github.com/tesseract-ocr/tesseract/issues/590#issuecomment-322020794 > for his comment regarding finetuning. > > ShreeDevi >

Re: [tesseract-ocr] The Accuracy improvement of training the chi_sim.traineddata model

2017-09-19 Thread robertyoung0511
Does the finetune update all the parameters in all of the layers? We need to add lots of mathematical symbols and some other special symbols. Maybe we should scratch training? What is the char error and iteration times for the scratch training, then we train the chi_sim(Simplified Chinese)?

[tesseract-ocr] The Accuracy improvement of training the chi_sim.traineddata model

2017-09-19 Thread robertyoung0511
Hello, I am training my own traineddata model for the chi_sim language with the finetune training. In my trained data, there are some mathematical symbols, such as "∞", "β", "△" and so on, which cannot be recognized in the official chi_sim.traineddata model. So we change the content of the

[tesseract-ocr] Re: Network overfitting processing

2017-09-18 Thread robertyoung0511
On the other side, the network contains the LSTM layers. Does the LSTM in the network train the word order? But I find that the word order in the trained_text file is chaotic. 在 2017年9月18日星期一 UTC+8下午2:30:33,roberty...@gmail.com写道: > > Hello, > > I am using the finetune training to train my

[tesseract-ocr] Network overfitting processing

2017-09-18 Thread robertyoung0511
Hello, I am using the finetune training to train my model for the chi_sim language with the network of [1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c1] After analyzing this network, I cannot find the any regularization operations in the layers, and there is only one convolution layer

Re: [tesseract-ocr] ERROR: /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.unicharset does not exist or is not readable

2017-09-14 Thread robertyoung0511
Shree, thanks for your reply. But I have another problem in the project which needs your helpness: Some italicized characters in my data need to be identified, but these italic characters tend to be low in recognition. Can I add some italic characters to train our model? I have observed

[tesseract-ocr] ERROR: /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.unicharset does not exist or is not readable

2017-09-14 Thread robertyoung0511
Hello, I'm trying to train my traineddata model with Tess4.0, following the commands in the* TrainingTesseract 4.00 *tutorial. The first command to creat training data is showed as follows: training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim --linedata_only \

Re: [tesseract-ocr] The net_spec in the chi_sim.traineddata

2017-08-23 Thread robertyoung0511
Year, I have observed the builted network at beginning of the training step. Thanks for reply. The basetrain.log file shows that Built network:[1,48,0,1 [C3,3 Ft16] Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 Fc209] from request [1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c1] Some problms for

[tesseract-ocr] The net_spec in the chi_sim.traineddata

2017-08-23 Thread robertyoung0511
Hello, I have pulled out the network of the chi_sim.traineddata with the command: combine_tessdata -u ../../tessdata/chi_sim.traineddata ../../chi_sim_comp Then I observe the network which is shown in the chi_sim_comp file. The network is [1,48,0,1Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512

[tesseract-ocr] Training from scratch to re-train the chi_sim.traineddata for studying

2017-08-22 Thread robertyoung0511
Hello, I'm trying to re-train the chi_sim.traineddata model from scratch for studying. I use the source data of chi_sim.training_text in the link directory https://github.com/tesseract-ocr/langdata/tree/master/chi_sim to train the model with the command: training/lstmtraining

[tesseract-ocr] Training from Scratch for chi_sim.traineddata

2017-08-22 Thread robertyoung0511
Hello, I'm trying to re-train the chi_sim.traineddata model from scratch for studying. I use the source data of chi_sim.training_text in the link directory https://github.com/tesseract-ocr/langdata/tree/master/chi_sim to train the model with the command: training/lstmtraining

[tesseract-ocr] Re: Unrecognized characters in the traineddata model

2017-08-17 Thread robertyoung0511
Maybe some other information about these special characters also help me. If you know about it, please leave words. Thanks. 在 2017年8月18日星期五 UTC+8上午9:45:11,roberty...@gmail.com写道: > > I have debugged the code, and find that the special characters 'Joined' > and '|Broken|0|1' are added while

[tesseract-ocr] Re: Unrecognized characters in the traineddata model

2017-08-17 Thread robertyoung0511
I have debugged the code, and find that the special characters 'Joined' and '|Broken|0|1' are added while generating the unicharset file. But what is the function of these characters? Can anyone tell me which stage in the training process, these characters play in a role? I can't find it. Thx

Re: [tesseract-ocr] Re: Error:Assert failed:in file ../lstm/lstmtrainer.h, line 110

2017-08-15 Thread robertyoung0511
Hi, I don't encounter this error. But you may check your traineddata whether in the correct directory, as well as some other paths. 在 2017年8月15日星期二 UTC+8下午5:45:17,Ava Nimaee写道: > > Hi thanks for your help > i used your link. but i got this error: >

[tesseract-ocr] Unrecognized characters in the traineddata model

2017-08-14 Thread robertyoung0511
Hello, I have extracted all the characters and id numbers from the chi_sim.traineddata. And all the characters are stored in a txt file, which can be demonstrated following: 0 1Joined 2|Broken|0|1 3S 4D 5F 68 77 80 9K 10O 11U 12H 13E 14

[tesseract-ocr] Other case Л of л is not in unicharset

2017-08-14 Thread robertyoung0511
Hello, I use the new tutorial to fine tuning the traineddata. I want to add some specific symbols to the existing chi_sim.traineddata model. First, I use the command:* training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim --linedata_only --noextract_font_properties --langdata_dir

[tesseract-ocr] Creation of encoded unicharset failed While constructing LSTM training data.

2017-08-10 Thread robertyoung0511
Hello, I'm trying to finetune the end.traineddata model as the steps in the link: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-%C2%B1-a-few-characters As the tutorail shows, I fine tuning for ± a few characters following the steps. But when I execute

[tesseract-ocr] Failed to continue from: /home/robert/tesstutorial/trainplusminus/eng.lstm

2017-08-07 Thread robertyoung0511
And when I execute the 1st command. An error: Failed to read data from: ../langdata/eng/eng.config -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to

[tesseract-ocr] Failed to continue from: /home/robert/tesstutorial/trainplusminus/eng.lstm

2017-08-06 Thread robertyoung0511
Hello, I'm trying to train the traineddata with the new tutorial for the finetune training: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-%C2%B1-a-few-characters I execute the commands as the tutorial showing. Executing the commands as following: 1.

[tesseract-ocr] Failed to continue from: /home/robert/tesstutorial/trainplusminus/eng.lstm

2017-08-06 Thread robertyoung0511
Hello, I'm trying to train the traineddata with the new tutorial for the finetune training: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-%C2%B1-a-few-characters I execute the command as the tutorial showing. Executing the commands as following: 1.

[tesseract-ocr] Re: Error:Assert failed:in file ../lstm/lstmtrainer.h, line 110

2017-08-04 Thread robertyoung0511
The code seems to have changed a lot, as well as the training commands and corresponding tutorials. The changes can refer to https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00. 在 2017年8月4日星期五 UTC+8下午2:33:41,roberty...@gmail.com写道: > > Hello, > > I use the 'git pull' command

Re: [tesseract-ocr] "Can't encode transcript" error when using "lstmtraining" command with Tess4.0

2017-08-04 Thread robertyoung0511
Hi, Shree, I have also tried the new traineddata to recognize the simplified Chinese with the Linux system (ubuntu), and it works. but it seems that the new traineddata dosen't support in the windows. For the new traineddata in the ubuntu, there is also some special symbols cannot be

Re: [tesseract-ocr] "Can't encode transcript" error when using "lstmtraining" command with Tess4.0

2017-08-04 Thread robertyoung0511
I have tried the new traineddata with the Linux system (ubuntu). It works, but it seems that the new traineddata dosen't support in the windows. 在 2017年8月1日星期二 UTC+8下午6:03:13,roberty...@gmail.com写道: > > When I use the new traineddata, it will *report **an >

[tesseract-ocr] Error:Assert failed:in file ../lstm/lstmtrainer.h, line 110

2017-08-04 Thread robertyoung0511
Hello, I use the 'git pull' command to update the code from the link https://github.com/tesseract-ocr/tesseract.git, and I recompile, reinstall the Tess4.0. But when I execute the command (showed in below) to finetune the traineddata, an error appears:

Re: [tesseract-ocr] "Can't encode transcript" error when using "lstmtraining" command with Tess4.0

2017-08-01 Thread robertyoung0511
When I use the new traineddata, it will *report **an **error : cannot find the chi_sim.traineddata. Does the new traineddata only support the Tess4.0 alpa release? I use the newest code release.* 在 2017年8月1日星期二 UTC+8下午4:45:07,shree写道: > > Ray has uploaded new traineddata files in >

Re: [tesseract-ocr] "Can't encode transcript" error when using "lstmtraining" command with Tess4.0

2017-08-01 Thread robertyoung0511
OK,I will have a try. Thanks 在 2017年8月1日星期二 UTC+8下午4:45:07,shree写道: > > Ray has uploaded new traineddata files in > https://github.com/tesseract-ocr/tessdata/tree/master/best > > Why don't you first try recognition with that > > ShreeDevi >

Re: [tesseract-ocr] "Can't encode transcript" error when using "lstmtraining" command with Tess4.0

2017-08-01 Thread robertyoung0511
Hello, Shree: I'm sorry, but whether can I use more than one unicharset, such as chi_sim and eng and so on, to finetune the training? Maybe some special characters can be in other unicharsets. If I find it/them, maybe I will train my traineddata with more unicharsets, and the special

[tesseract-ocr] How to recognize some specific symbols with Tess4.0

2017-07-31 Thread robertyoung0511
Hello, I'm trying to apply Tess4.0 to recongnize the simplified Chinese with the command as: argc = 13; argv[1] =

Re: [tesseract-ocr] Could not find font named AR PL UMing Patched Light

2017-07-26 Thread robertyoung0511
OK. Thanks for the reply from Shree sincerely. 在 2017年7月26日星期三 UTC+8下午2:48:13,shree写道: > > I do not have this font. > > The training is done at Google. They probably use a number of commercial > fonts in addition to freely available fonts. The fonts are not provided as > part of the training

[tesseract-ocr] Could not find font named AR PL UMing Patched Light

2017-07-25 Thread robertyoung0511
Hello, I'm trying to train my own traineddata with Tess4.0 following the tutorail: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Replace-Top-Layer When executing the command: training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim \ --training_text

Re: [tesseract-ocr] "Can't encode transcript" error when using "lstmtraining" command with Tess4.0

2017-07-25 Thread robertyoung0511
Thanks for helpness. I will finetune with new traineddata for all languages after 2-3 weeks, and give feedback to evaluate the specific characters. 在 2017年7月25日星期二 UTC+8下午3:23:08,shree写道: > > That error is because some characters in your training text are not part > of the unicharset of

[tesseract-ocr] "Can't encode transcript" error when using "lstmtraining" command with Tess4.0

2017-07-25 Thread robertyoung0511
Hello, I apply the command to train my own traineddata: lstmtraining --model_output ~/tesstutorial/chituned_from_chisim/chituned \ --continue_from ~/tesstutorial/chituned_from_chisim/chi_sim.lstm \ --train_listfile ~/tesstutorial/chitest/chi.training_files.txt \ --eval_listfile

Re: [tesseract-ocr] Combine_tessdata command error while training Tesseract4.0

2017-07-25 Thread robertyoung0511
I forgot the nor.traineddata. Thanks for helpness. 在 2017年7月24日星期一 UTC+8下午7:59:20,shree写道: > > Is your traineddata file present at ../tessdata/nor.traineddata? > Is it 4.00 version? > > ShreeDevi > > भजन - कीर्तन - आरती @

[tesseract-ocr] Combine_tessdata command error while training Tesseract4.0

2017-07-24 Thread robertyoung0511
Hello, I'm trying to train the Tesseract4.0 following the steps in the tutorial: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Replacing-Top-Layer-Example But when I execute the command: mkdir -p ~/tesstutorial/nor_layer $ combine_tessdata -e