from:"robertyoung0511"

[tesseract-ocr] Image too small to scale!! (3x48 vs min width of 3)

2017-10-18 Thread robertyoung0511

Hello,

I am trying to manually generate the input data of Tess4.0, which contain 
the box and tif file.

But when I run the command to generate the .lstmf file, images have been 
rotated 90 degrees, which is shown following;


























So, when I execute the "training/lstmtraining" command to finetune training 
the network, Tess4.0 reminders the error "Image too small to scale!! (3x48 
vs min width of 3)".

After analyze this error, I find this is because the image has been 
rotated, so when Tess4.0 compress the image height to 48 pixels, the image 
width is too small (only one pixel).


To contrast, I also execute the "training/lstmtraining" command with the 
image which is generated by Tess4.0 itself. Then I find the image has not 
been rotated, which is shown following:






But when I manually generate the image, it will be rotated, why?   I cannot 
find anything wrong.  Can you help me?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/aa9773d1-3735-4e5a-97b4-819919f4b3c9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] The Accuracy improvement of training the chi_sim.traineddata model

2017-09-19 Thread robertyoung0511

OK. Thanks for your reply.

在 2017年9月19日星期二 UTC+8下午5:06:57，shree写道：
>
> Ray is the only one who would know those details.
>
> Please see 
> https://github.com/tesseract-ocr/tesseract/issues/590#issuecomment-322020794 
> for his comment regarding finetuning.
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Tue, Sep 19, 2017 at 2:28 PM,  
> wrote:
>
>> Does the finetune update all the parameters in all of the layers?
>>
>> We need to add lots of mathematical symbols and some other special 
>> symbols. Maybe we should scratch training?
>>
>> What is the char error and iteration times for the scratch training, then 
>> we train the chi_sim(Simplified Chinese)?
>>
>>
>>
>> 在 2017年9月19日星期二 UTC+8下午4:49:30，shree写道：
>>>
>>> As per comments by Ray, for finetune or for plus minus a few letters.
>>> the number of iterations should be limited to 3000 or so.
>>>
>>> It probably won't get to .2% accuracy, but you might have better results 
>>>
>>> ShreeDevi
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Tue, Sep 19, 2017 at 2:00 PM,  wrote:
>>>
 Hello,

 I am training my own traineddata model for the chi_sim language with 
 the finetune training. In my trained data, there are some mathematical 
 symbols, such as "∞", "β", "△" and so on, which cannot be recognized in 
 the 
 official chi_sim.traineddata model.

 So we change the content of the chi_sim.training_text file, and fill 
 the file with our training data.


 Then executing the training command:
 training/lstmtraining --model_output 
 ~/tesstutorial/trainspecial/special \
   --continue_from ~/tesstutorial/trainspecial/chi_sim.lstm \
   --traineddata ~/tesstutorial/trainspecial/chi_sim/chi_sim.traineddata 
 \
   --old_traineddata tessdata/best/chi_sim.traineddata \
   --train_listfile 
 ~/tesstutorial/trainspecial/chi_sim.training_files.txt \
   --max_iterations 40

 As the command, when we iterate 40 times, the char error is about 
 0.2% and the word error is about 4.2%. 
 The error rate has almost started to oscillate and it can't go down. So 
 we stopped training and exported the traineddata model.

 After testing the exported traineddata model, the accuracy is not 
 satisfactory enough, which is lower than the model provided by the 
 official 
 website (tesseract github website).

 We hope that the training model recognition accuracy will be consistent 
 with the official website. Then how can we continue to further improve the 
 accuracy of the model?

 Does anyone know the details of the official website training language 
 model, such as the num of iteration, the lowest char error and word error, 
 the value of the learning_rate, and so on?

 If you know these information, please give some tips.


 Thank you.

 -- 
 You received this message because you are subscribed to the Google 
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to tesseract-oc...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/a9a25aeb-2182-41d5-9a69-aef34a92eb27%40googlegroups.com
  
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/54f6a114-54c3-462b-a6f0-11d6ca81f6c4%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at

Re: [tesseract-ocr] The Accuracy improvement of training the chi_sim.traineddata model

2017-09-19 Thread robertyoung0511

Does the finetune update all the parameters in all of the layers?

We need to add lots of mathematical symbols and some other special symbols. 
Maybe we should scratch training?

What is the char error and iteration times for the scratch training, then 
we train the chi_sim(Simplified Chinese)?



在 2017年9月19日星期二 UTC+8下午4:49:30，shree写道：
>
> As per comments by Ray, for finetune or for plus minus a few letters.
> the number of iterations should be limited to 3000 or so.
>
> It probably won't get to .2% accuracy, but you might have better results 
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Tue, Sep 19, 2017 at 2:00 PM,  
> wrote:
>
>> Hello,
>>
>> I am training my own traineddata model for the chi_sim language with the 
>> finetune training. In my trained data, there are some mathematical symbols, 
>> such as "∞", "β", "△" and so on, which cannot be recognized in the official 
>> chi_sim.traineddata model.
>>
>> So we change the content of the chi_sim.training_text file, and fill the 
>> file with our training data.
>>
>>
>> Then executing the training command:
>> training/lstmtraining --model_output ~/tesstutorial/trainspecial/special \
>>   --continue_from ~/tesstutorial/trainspecial/chi_sim.lstm \
>>   --traineddata ~/tesstutorial/trainspecial/chi_sim/chi_sim.traineddata \
>>   --old_traineddata tessdata/best/chi_sim.traineddata \
>>   --train_listfile ~/tesstutorial/trainspecial/chi_sim.training_files.txt 
>> \
>>   --max_iterations 40
>>
>> As the command, when we iterate 40 times, the char error is about 
>> 0.2% and the word error is about 4.2%. 
>> The error rate has almost started to oscillate and it can't go down. So 
>> we stopped training and exported the traineddata model.
>>
>> After testing the exported traineddata model, the accuracy is not 
>> satisfactory enough, which is lower than the model provided by the official 
>> website (tesseract github website).
>>
>> We hope that the training model recognition accuracy will be consistent 
>> with the official website. Then how can we continue to further improve the 
>> accuracy of the model?
>>
>> Does anyone know the details of the official website training language 
>> model, such as the num of iteration, the lowest char error and word error, 
>> the value of the learning_rate, and so on?
>>
>> If you know these information, please give some tips.
>>
>>
>> Thank you.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/a9a25aeb-2182-41d5-9a69-aef34a92eb27%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/54f6a114-54c3-462b-a6f0-11d6ca81f6c4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] The Accuracy improvement of training the chi_sim.traineddata model

2017-09-19 Thread robertyoung0511

Hello,

I am training my own traineddata model for the chi_sim language with the
finetune training. In my trained data, there are some mathematical symbols,
such as "∞", "β", "△" and so on, which cannot be recognized in the official
chi_sim.traineddata model.

So we change the content of the chi_sim.training_text file, and fill the
file with our training data.

Then executing the training command:
training/lstmtraining --model_output ~/tesstutorial/trainspecial/special \
--continue_from ~/tesstutorial/trainspecial/chi_sim.lstm \
--traineddata ~/tesstutorial/trainspecial/chi_sim/chi_sim.traineddata \
--old_traineddata tessdata/best/chi_sim.traineddata \
--train_listfile ~/tesstutorial/trainspecial/chi_sim.training_files.txt \
--max_iterations 40

As the command, when we iterate 40 times, the char error is about 0.2%
and the word error is about 4.2%.
The error rate has almost started to oscillate and it can't go down. So we
stopped training and exported the traineddata model.

After testing the exported traineddata model, the accuracy is not
satisfactory enough, which is lower than the model provided by the official
website (tesseract github website).

We hope that the training model recognition accuracy will be consistent
with the official website. Then how can we continue to further improve the
accuracy of the model?

Does anyone know the details of the official website training language
model, such as the num of iteration, the lowest char error and word error,
the value of the learning_rate, and so on?

If you know these information, please give some tips.

Thank you.

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/a9a25aeb-2182-41d5-9a69-aef34a92eb27%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Network overfitting processing

2017-09-18 Thread robertyoung0511

On the other side, the network contains the LSTM layers.

Does the LSTM in the network train the word order? But I find that the word 
order in the trained_text file is chaotic.




在 2017年9月18日星期一 UTC+8下午2:30:33，roberty...@gmail.com写道：
>
> Hello,
>
> I am using the finetune training to train my model for the chi_sim 
> language with the network of [1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 
> Lfx512 O1c1]
>
>
> After analyzing this network, I cannot find the any regularization 
> operations in the layers, and there is only one convolution layer in the 
> network.
>
> Then how can I optimize the network structure, such as adding the 
> regularization operations, to avoid the overfiting for the data training? 
> Or any other operations such as extending the network depth?
>
>
>
> Thanks for your helpness.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/48b3a08f-b2f7-4490-a0fd-531b6e939119%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Network overfitting processing

2017-09-18 Thread robertyoung0511

Hello,

I am using the finetune training to train my model for the chi_sim language 
with the network of [1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c1]


After analyzing this network, I cannot find the any regularization 
operations in the layers, and there is only one convolution layer in the 
network.

Then how can I optimize the network structure, such as adding the 
regularization operations, to avoid the overfiting for the data training? 
Or any other operations such as extending the network depth?



Thanks for your helpness.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/db9d3c6b-389c-4ef6-bb2d-425cccd88d03%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] ERROR: /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.unicharset does not exist or is not readable

2017-09-14 Thread robertyoung0511

Shree, thanks for your reply.


But I have another problem in the project which needs your helpness:

Some italicized characters in my data need to be identified, but these 
italic characters tend to be low in recognition. Can I add some italic 
characters to train our model? 

I have observed that we cannot add some italic characters in the 
chi_sim.training_text 

 
file in the https://github.com/tesseract-ocr/langdata/tree/master/chi_sim 
link.

How would I train these italic characters?

在 2017年9月14日星期四 UTC+8下午4:30:40，shree写道：
>
> It is a known problem with the latest code in github - see 
> https://github.com/tesseract-ocr/tesseract/issues/1114
>
> Waiting for fix from Ray.
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Thu, Sep 14, 2017 at 1:50 PM,  
> wrote:
>
>>  Hello,
>>
>> I'm trying to train my traineddata model with Tess4.0, following the 
>> commands in the* TrainingTesseract 4.00 *tutorial. The first command to 
>> creat training data is showed as follows:
>>
>> training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim 
>> --linedata_only \
>> --noextract_font_properties --langdata_dir ../langdata \
>> --fontlist "SIMSUN" --tessdata_dir ./tessdata --output_dir 
>> ~/tesstutorial/trainspecial
>>
>>
>> And the execution log for this command is as follows:
>>
>> === Phase I: Generating training images ===
>> Rendering using SIMSUN
>> [2017年 09月 14日 星期四 16:01:57 CST] /usr/local/bin/text2image 
>> --fontconfig_tmpdir=/tmp/font_tmp.whlzhytMkp --fonts_dir=/usr/share/fonts 
>> --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 
>> --outputbase=/tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.SIMSUN.exp0 --max_pages=3 
>> --font=SIMSUN --text=../langdata/chi_sim/chi_sim.training_text
>> Rendered page 0 to file 
>> /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.SIMSUN.exp0.tif
>>
>> === Phase UP: Generating unicharset and unichar properties files ===
>> [2017年 09月 14日 星期四 16:01:58 CST] /usr/local/bin/unicharset_extractor 
>> --output_unicharset /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.unicharset 
>> --norm_mode 1 /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.SIMSUN.exp0.box
>> Extracting unicharset from box file 
>> /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.SIMSUN.exp0.box
>> Invalid Unicode codepoint: 0xffe8
>> IsValidCodepoint(ch):Error:Assert failed:in file normstrngs.cpp, line 225
>> ERROR: /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.unicharset does not exist or 
>> is not readable
>>
>>
>> But an error appears in this progress, which shows that chi_sim.unicharset 
>> extracted error. I have checked the directory of 
>> /tmp/tmp.8JcoYdZI17/chi_sim/, 
>> and chi_sim.unicharset file does not exist.
>>
>> How can I modify this error? Can you help me? Thanks.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/9b9b26b8-5fc8-42aa-bd7c-2305dffc6fd1%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7bbbc559-3af3-4971-9be0-4211dea9a699%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] ERROR: /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.unicharset does not exist or is not readable

2017-09-14 Thread robertyoung0511

Hello,

I'm trying to train my traineddata model with Tess4.0, following the
commands in the* TrainingTesseract 4.00 *tutorial. The first command to
creat training data is showed as follows:

training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim
--linedata_only \
--noextract_font_properties --langdata_dir ../langdata \
--fontlist "SIMSUN" --tessdata_dir ./tessdata --output_dir
~/tesstutorial/trainspecial

And the execution log for this command is as follows:

=== Phase I: Generating training images ===
Rendering using SIMSUN
[2017年 09月 14日 星期四 16:01:57 CST] /usr/local/bin/text2image
--fontconfig_tmpdir=/tmp/font_tmp.whlzhytMkp --fonts_dir=/usr/share/fonts
--strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0
--outputbase=/tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.SIMSUN.exp0 --max_pages=3
--font=SIMSUN --text=../langdata/chi_sim/chi_sim.training_text
Rendered page 0 to file /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.SIMSUN.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files ===
[2017年 09月 14日 星期四 16:01:58 CST] /usr/local/bin/unicharset_extractor
--output_unicharset /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.unicharset
--norm_mode 1 /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.SIMSUN.exp0.box
Extracting unicharset from box file
/tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.SIMSUN.exp0.box
Invalid Unicode codepoint: 0xffe8
IsValidCodepoint(ch):Error:Assert failed:in file normstrngs.cpp, line 225
ERROR: /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.unicharset does not exist or is
not readable

But an error appears in this progress, which shows that chi_sim.unicharset
extracted error. I have checked the directory of /tmp/tmp.8JcoYdZI17/chi_sim/,
and chi_sim.unicharset file does not exist.

How can I modify this error? Can you help me? Thanks.

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/9b9b26b8-5fc8-42aa-bd7c-2305dffc6fd1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] The net_spec in the chi_sim.traineddata

2017-08-23 Thread robertyoung0511

Year, I have observed the builted network at beginning of the training 
step. Thanks for reply.

The basetrain.log file shows that  Built network:[1,48,0,1 [C3,3 Ft16] 
Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 Fc209] from request [1,48,0,1 Ct3,3,16 
Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c1]

Some problms for understanding this builted network:

1. [C3,3 Ft16] layers in the network has been enclosed in brackets. But why 
it is enclosed in brackets? What does it stand for with the brackets?
2. Fc209 the last layer of this network is a Fully-connected layer. what's 
the meanings of the 'c' in this layer? I cannot find what 'Fc' represents 
in the VGSLSpecs tutorial.

Thanks.



在 2017年8月23日星期三 UTC+8下午3:00:00，shree写道：
>
> I think that number is ignored and the actual number generated from 
> unichasrset is used.
>
> Usually there will be a message right at beginning of training showing the 
> number being used.
>
> On 23-Aug-2017 12:21 PM,  wrote:
>
>> Hello,
>>
>> I have pulled out the network of the chi_sim.traineddata with the 
>> command:  combine_tessdata -u ../../tessdata/chi_sim.traineddata 
>> ../../chi_sim_comp
>>
>> Then I observe the network which is shown in the chi_sim_comp file. The 
>> network is [1,48,0,1Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c1]
>>
>> By analyzing the VGSL Specs language, I can infer that the output layer 
>> of the network is O1c1, which means that Output layer produces 1-d 
>> (sequence) output, trained with CTC,*outputting 1 class*.
>>
>>
>> Why does the output layer end up in one category? Whether the network 
>> structure recorded in the chi_sim.traineddata will be wrong?
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/5f5e3422-59e4-499e-bc4d-84ed214c1523%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d0eec6d3-11af-4953-901a-4f5e03b63b79%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] The net_spec in the chi_sim.traineddata

2017-08-23 Thread robertyoung0511

Hello,

I have pulled out the network of the chi_sim.traineddata with the command:  
combine_tessdata -u ../../tessdata/chi_sim.traineddata ../../chi_sim_comp

Then I observe the network which is shown in the chi_sim_comp file. The 
network is [1,48,0,1Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c1]

By analyzing the VGSL Specs language, I can infer that the output layer of 
the network is O1c1, which means that Output layer produces 1-d (sequence) 
output, trained with CTC,*outputting 1 class*.


Why does the output layer end up in one category? Whether the network 
structure recorded in the chi_sim.traineddata will be wrong?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5f5e3422-59e4-499e-bc4d-84ed214c1523%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Training from scratch to re-train the chi_sim.traineddata for studying

2017-08-22 Thread robertyoung0511

Hello,

I'm trying to re-train the chi_sim.traineddata model from scratch for
studying.

I use the source data of chi_sim.training_text in the link directory
https://github.com/tesseract-ocr/langdata/tree/master/chi_sim to train the
model with the command:

training/lstmtraining --debug_interval 100 \
--traineddata ~/tesstutorial/trainspecial/chi_sim/chi_sim.traineddata \
--net_spec '[1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c1]' \
--model_output ~/tesstutorial/specialoutput/base --learning_rate 20e-4 \
--train_listfile ~/tesstutorial/trainspecial/chi_sim.training_files.txt \
--eval_listfile ~/tesstutorial/evalspecial/chi_sim.training_files.txt \
--max_iterations 3600 &>~/tesstutorial/specialoutput/basetrain.log

The net_spec is same as the official model package (chi_sim.traineddata
from the tessdata github).

After converting the training model with the lstmtraining --stop_training,
a new chi_sim.traineddata model gererated, which is named
chi_sim_new.traineddata.
And I name the official chi_sim.traineddata as chi_sim.traineddata for
distinguishing.

Then I pull out all the characters in the two traineddata model.

There are 4384 characters in the chi_sim.traineddata, but 2538 characters
in the chi_sim_new.traineddata which is generated by me.

Why are there different characters in the two models? Does the source data
in the chi_sim.training_text haven't updated in time?

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/e3f0-588b-456f-90bf-a878f20b1f26%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Training from Scratch for chi_sim.traineddata

2017-08-22 Thread robertyoung0511

Hello,

I'm trying to re-train the chi_sim.traineddata model from scratch for
studying.

I use the source data of chi_sim.training_text in the link directory
https://github.com/tesseract-ocr/langdata/tree/master/chi_sim to train the
model with the command:

The net_spec is same as the official model package (chi_sim.traineddata
from the tessdata github).

Then I pull all the characters in the two traineddata model.

There are 4384 characters in the chi_sim.traineddata, but 2538 characters
in the chi_sim_new.traineddata which is generated by me.

Why are there different characters in the two models? Does the source data
in the chi_sim.training_text haven't updated in time?

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/de659842-ab20-4d68-aa1e-9f7250347e4b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Unrecognized characters in the traineddata model

2017-08-17 Thread robertyoung0511

Maybe some other information about these special characters also help me. 
If you know about it, please leave words.

Thanks.

在 2017年8月18日星期五 UTC+8上午9:45:11，roberty...@gmail.com写道：
>
> I have debugged the code, and find that the special characters 'Joined' 
> and '|Broken|0|1' are added while generating the unicharset file. 
>
> But what is the function of these characters? Can anyone tell me which 
> stage in the training process, these characters play in a role? I can't 
> find it. Thx a lot.
>
> For other special characters, such as 'cl', '|d|0|2', '|d|1|2', what is 
> the function of these characters? Are they added in the combine_lang_model 
> stage? 
>
> Can you help me?
>
>
> Thanks sincerely.
>
> 在 2017年8月15日星期二 UTC+8下午1:47:10，roberty...@gmail.com写道：
>>
>> Hello,
>>
>> I have extracted all the characters and id numbers from the 
>> chi_sim.traineddata. And all the characters are stored in a txt file, which 
>> can be demonstrated following:
>>
>> 0 
>> 1Joined
>> 2|Broken|0|1
>> 3S
>> 4D
>> 5F
>> 68
>> 77
>> 80
>> 9K
>> 10O
>> 11U
>> 12H
>> 13E
>> 14I
>> 154
>> 165
>> 171
>> 189
>> 19&
>> 20C
>> 21W
>> 22N
>> 23_
>> 24P
>> 25M
>> 26T
>> 27V
>> 28R
>> 29L
>> 30A
>> 31Y
>> 322
>> 33J
>> 34B
>> 35G
>> 363
>> 376
>> 38Z
>> 39X
>> 40Q
>> 41'
>> 42+
>> 43-
>> 44.
>> 45#
>> 46e
>> 47v
>> 48a
>> 49m
>> 50i
>> 51z
>> 52o
>> 53l
>> 54s
>> 55h
>> 56n
>> 57d
>> 58g
>> 59y
>> 60u
>> 61王
>> 62汝
>> 63敏
>> 64邹
>> 65立
>> 66健
>> 67熊
>> ...
>> ...
>> 4013扔
>> 4014嗨
>> 4015髋
>> 4016「
>> 4017[
>> 4018』
>> 4019瀵
>> 4020〕
>> 4021掺
>> 4022|"|0|2
>> 4023|"|1|2
>> 4024rn
>> 4025|m|0|2
>> 4026|m|1|2
>> 4027in
>> 4028cl
>> 4029|d|0|2
>> 4030|d|1|2
>> 4031rm
>> 4032|rm|0|2
>> 4033|rm|1|2
>> 4034nn
>> 4035|nn|0|2
>> 4036|nn|1|2
>> 4037ri
>> 4038|n|0|2
>> 4039|n|1|2
>> 4040|h|0|2
>> 4041|h|1|2
>> 4042|u|0|2
>> 4043|u|1|2
>> 4044|m|0|3
>> 4045|m|1|3
>> 4046|m|2|3
>> 4047|H|0|2
>> 4048|H|1|2
>> 4049|H|0|3
>> 4050|H|1|3
>> 4051|H|2|3
>> 4052|w|0|2
>> 4053|w|1|2
>> 4054|W|0|2
>> 4055|W|1|2
>> 4056fi
>> 4057|k|0|2
>> 4058|k|1|2
>> 4059ki
>> 4060|ki|0|2
>> 4061|ki|1|2
>> 4062|in|0|2
>> 4063|in|1|2
>> 4064tl
>> 4065th
>> ...
>>
>>
>> I can recognize most of the characters, such as the han, ladin alphabet. 
>> But some characters, such as 'Joined', ' |Broken|0|1' at the file header, 
>> and |"|0|2, |m|0|2 at the end of the file, cannot be recognized by myself.
>>
>> Can you explan what these characters mean?
>> 4059ki
>> 4060|ki|0|2
>> 4061|ki|1|2
>> 4062|in|0|2
>> 4063|in|1|2
>>  and so on
>>
>>
>> Thx alot.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/342c926d-adbf-418d-af1b-4ade6a1841b2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Unrecognized characters in the traineddata model

2017-08-17 Thread robertyoung0511

I have debugged the code, and find that the special characters 'Joined' and 
'|Broken|0|1' are added while generating the unicharset file. 

But what is the function of these characters? Can anyone tell me which 
stage in the training process, these characters play in a role? I can't 
find it. Thx a lot.

For other special characters, such as 'cl', '|d|0|2', '|d|1|2', what is the 
function of these characters? Are they added in the combine_lang_model 
stage? 

Can you help me?


Thanks sincerely.

在 2017年8月15日星期二 UTC+8下午1:47:10，roberty...@gmail.com写道：
>
> Hello,
>
> I have extracted all the characters and id numbers from the 
> chi_sim.traineddata. And all the characters are stored in a txt file, which 
> can be demonstrated following:
>
> 0 
> 1Joined
> 2|Broken|0|1
> 3S
> 4D
> 5F
> 68
> 77
> 80
> 9K
> 10O
> 11U
> 12H
> 13E
> 14I
> 154
> 165
> 171
> 189
> 19&
> 20C
> 21W
> 22N
> 23_
> 24P
> 25M
> 26T
> 27V
> 28R
> 29L
> 30A
> 31Y
> 322
> 33J
> 34B
> 35G
> 363
> 376
> 38Z
> 39X
> 40Q
> 41'
> 42+
> 43-
> 44.
> 45#
> 46e
> 47v
> 48a
> 49m
> 50i
> 51z
> 52o
> 53l
> 54s
> 55h
> 56n
> 57d
> 58g
> 59y
> 60u
> 61王
> 62汝
> 63敏
> 64邹
> 65立
> 66健
> 67熊
> ...
> ...
> 4013扔
> 4014嗨
> 4015髋
> 4016「
> 4017[
> 4018』
> 4019瀵
> 4020〕
> 4021掺
> 4022|"|0|2
> 4023|"|1|2
> 4024rn
> 4025|m|0|2
> 4026|m|1|2
> 4027in
> 4028cl
> 4029|d|0|2
> 4030|d|1|2
> 4031rm
> 4032|rm|0|2
> 4033|rm|1|2
> 4034nn
> 4035|nn|0|2
> 4036|nn|1|2
> 4037ri
> 4038|n|0|2
> 4039|n|1|2
> 4040|h|0|2
> 4041|h|1|2
> 4042|u|0|2
> 4043|u|1|2
> 4044|m|0|3
> 4045|m|1|3
> 4046|m|2|3
> 4047|H|0|2
> 4048|H|1|2
> 4049|H|0|3
> 4050|H|1|3
> 4051|H|2|3
> 4052|w|0|2
> 4053|w|1|2
> 4054|W|0|2
> 4055|W|1|2
> 4056fi
> 4057|k|0|2
> 4058|k|1|2
> 4059ki
> 4060|ki|0|2
> 4061|ki|1|2
> 4062|in|0|2
> 4063|in|1|2
> 4064tl
> 4065th
> ...
>
>
> I can recognize most of the characters, such as the han, ladin alphabet. 
> But some characters, such as 'Joined', ' |Broken|0|1' at the file header, 
> and |"|0|2, |m|0|2 at the end of the file, cannot be recognized by myself.
>
> Can you explan what these characters mean?
> 4059ki
> 4060|ki|0|2
> 4061|ki|1|2
> 4062|in|0|2
> 4063|in|1|2
>  and so on
>
>
> Thx alot.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/db617ab0-d486-4792-8782-e722d620e154%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Error:Assert failed:in file ../lstm/lstmtrainer.h, line 110

2017-08-15 Thread robertyoung0511

Hi, I don't encounter this error.

But you may check your traineddata whether in the correct directory, as 
well as some other paths.

在 2017年8月15日星期二 UTC+8下午5:45:17，Ava Nimaee写道：
>
> Hi thanks for your help
> i used your link. but i got this error:
> mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file 
> ../lstm/lstmtrainer.h, line 110
> Segmentation fault (core dumped)
> I wanna start train persian language.so im trying english first. i creat 
> boxfile and unicharset .then eng.charset_size=110.txt 
> ,eng.Times_New_Roman.exp0.lstmf , eng.traineddata , eng.training_files.txt 
> , eng.unicharset
> that all of those have created with this syntax:
> training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng   
>  --training_text training/langdata/eng/eng.training_text 
> --linedata_only \
>   --noextract_font_properties --langdata_dir training/langdata \
>   --tessdata_dir ./tessdata \
>   --fontlist "Times New Roman," --output_dir ~/tesstutorial/engtrian
> and now i have error that i told you
>
> On Monday, August 14, 2017 at 1:00:02 PM UTC+4:30, roberty...@gmail.com 
> wrote:
>>
>>  What problems do you encounter? Please give more information about the 
>> problems.
>>
>> I later used the new tutorial (
>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact)
>>  
>> to train data, and I didn't have any problems. Hope to help you.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/17ac0e3b-9a1c-40c3-8500-7bb16825f77d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Unrecognized characters in the traineddata model

2017-08-14 Thread robertyoung0511

Hello,

I have extracted all the characters and id numbers from the 
chi_sim.traineddata. And all the characters are stored in a txt file, which 
can be demonstrated following:

0 
1Joined
2|Broken|0|1
3S
4D
5F
68
77
80
9K
10O
11U
12H
13E
14I
154
165
171
189
19&
20C
21W
22N
23_
24P
25M
26T
27V
28R
29L
30A
31Y
322
33J
34B
35G
363
376
38Z
39X
40Q
41'
42+
43-
44.
45#
46e
47v
48a
49m
50i
51z
52o
53l
54s
55h
56n
57d
58g
59y
60u
61王
62汝
63敏
64邹
65立
66健
67熊
...
...
4013扔
4014嗨
4015髋
4016「
4017[
4018』
4019瀵
4020〕
4021掺
4022|"|0|2
4023|"|1|2
4024rn
4025|m|0|2
4026|m|1|2
4027in
4028cl
4029|d|0|2
4030|d|1|2
4031rm
4032|rm|0|2
4033|rm|1|2
4034nn
4035|nn|0|2
4036|nn|1|2
4037ri
4038|n|0|2
4039|n|1|2
4040|h|0|2
4041|h|1|2
4042|u|0|2
4043|u|1|2
4044|m|0|3
4045|m|1|3
4046|m|2|3
4047|H|0|2
4048|H|1|2
4049|H|0|3
4050|H|1|3
4051|H|2|3
4052|w|0|2
4053|w|1|2
4054|W|0|2
4055|W|1|2
4056fi
4057|k|0|2
4058|k|1|2
4059ki
4060|ki|0|2
4061|ki|1|2
4062|in|0|2
4063|in|1|2
4064tl
4065th
...


I can recognize most of the characters, such as the han, ladin alphabet. 
But some characters, such as 'Joined', ' |Broken|0|1' at the file header, 
and |"|0|2, |m|0|2 at the end of the file, cannot be recognized by myself.

Can you explan what these characters mean?
4059ki
4060|ki|0|2
4061|ki|1|2
4062|in|0|2
4063|in|1|2
 and so on


Thx alot.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b042f6e0-7fc9-487b-bcc6-0acf22c343fd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Other case Л of л is not in unicharset

2017-08-14 Thread robertyoung0511

Hello,

I use the new tutorial to fine tuning the traineddata. I want to add some 
specific symbols to the existing chi_sim.traineddata model.

First, I use the command:* training/tesstrain.sh --fonts_dir 
/usr/share/fonts --lang chi_sim --linedata_only --noextract_font_properties 
--langdata_dir ../langdata --fontlist "SIMSUN" --tessdata_dir ./tessdata 
--output_dir ~/tesstutorial/trainspecial* to create the new training data. 
But some specific symbols cannot be added to the unicharset file.

A part of output information showed following:

=== Phase UP: Generating unicharset and unichar properties files ===
[2017年 08月 14日 星期一 15:59:17 CST] /usr/local/bin/unicharset_extractor -D 
/tmp/tmp.78WyISy4o7/chi_sim/ 
/tmp/tmp.78WyISy4o7/chi_sim/chi_sim.SIMSUN.exp0.box
Extracting unicharset from 
/tmp/tmp.78WyISy4o7/chi_sim/chi_sim.SIMSUN.exp0.box
Wrote unicharset file /tmp/tmp.78WyISy4o7/chi_sim//unicharset.
[2017年 08月 14日 星期一 15:59:17 CST] /usr/local/bin/set_unicharset_properties 
-U /tmp/tmp.78WyISy4o7/chi_sim/chi_sim.unicharset -O 
/tmp/tmp.78WyISy4o7/chi_sim/chi_sim.unicharset -X 
/tmp/tmp.78WyISy4o7/chi_sim/chi_sim.xheights --script_dir=../langdata
Loaded unicharset of size 1129 from file 
/tmp/tmp.78WyISy4o7/chi_sim/chi_sim.unicharset
Setting unichar properties
Other case Л of л is not in unicharset
Other case Υ of υ is not in unicharset
Other case Π of π is not in unicharset
Other case Β of β is not in unicharset
Mirror ∼ of ∽ is not in unicharset
Mirror ⧵ of ∕ is not in unicharset
Other case σ of Σ is not in unicharset
Other case Ρ of ρ is not in unicharset
Mirror 》 of 《 is not in unicharset
Other case j of J is not in unicharset
Mirror 【 of 】 is not in unicharset
Mirror 「 of 」 is not in unicharset
Other case K of k is not in unicharset
Mirror { of } is not in unicharset
Other case q of Q is not in unicharset
Mirror 〗 of 〖 is not in unicharset
Setting script properties
Warning: properties incomplete for index 57 = ）
Warning: properties incomplete for index 60 = ：
Warning: properties incomplete for index 64 = ！
Warning: properties incomplete for index 67 = ？
Warning: properties incomplete for index 73 = ＞
Warning: properties incomplete for index 81 = ；
Warning: properties incomplete for index 82 = ～
Warning: properties incomplete for index 90 = ．
Warning: properties incomplete for index 98 = （
Warning: properties incomplete for index 99 = ゜
Warning: properties incomplete for index 115 = ＜
Warning: properties incomplete for index 190 = ，
Writing unicharset to file /tmp/tmp.78WyISy4o7/chi_sim/chi_sim.unicharset


which shows that some specific symbols such as 'Л', '》', ...,   cannot be 
added to the unicharset.


How can I add these symbols to the unicharset? Should I add them manually?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1b2e87fb-ebca-4b92-a561-1a6ccc4a27ba%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Creation of encoded unicharset failed While constructing LSTM training data.

2017-08-10 Thread robertyoung0511

Hello,

I'm trying to finetune the end.traineddata model as the steps in the link: 
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-%C2%B1-a-few-characters

As the tutorail shows, I fine tuning for ± a few characters following the 
steps.

But when I execute the first command, to generate new training and eval 
data:

training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata --output_dir ~/tesstutorial/trainplusminus


An error is prompted: *Creation of encoded unicharset failed! *While 
constructing LSTM training data.

More details refer to the image.

Can you help me? Thanks.




-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1c40ba47-a6e5-4ec9-bf58-677bcdb2f74b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Failed to continue from: /home/robert/tesstutorial/trainplusminus/eng.lstm

2017-08-07 Thread robertyoung0511

And when I execute the 1st command. An error: Failed to read data from: 
../langdata/eng/eng.config

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/632d545f-ffb1-4a93-82ab-6bc10fc0011a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Failed to continue from: /home/robert/tesstutorial/trainplusminus/eng.lstm

2017-08-06 Thread robertyoung0511

Hello,

I'm trying to train the traineddata with the new tutorial for the finetune 
training: 
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-%C2%B1-a-few-characters

I execute the commands as the tutorial showing.  Executing the commands as 
following:

1. training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng 
--linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata --output_dir ~/tesstutorial/trainplusminus
2. training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng 
--linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata \
  --fontlist "Impact Condensed" --output_dir ~/tesstutorial/evalplusminus
3. training/combine_tessdata -e tessdata/best/eng.traineddata \
  ~/tesstutorial/trainplusminus/eng.lstm
4. training/lstmtraining --model_output ~/tesstutorial/trainplusminus/plusminus 
\
  --continue_from ~/tesstutorial/trainplusminus/eng.lstm \
  --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
  --old_traineddata tessdata/best/eng.traineddata \
  --train_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt \
  --max_iterations 3600


When I excute the 4th command, there is an error appears:
Failed to continue from: /home/robert/tesstutorial/trainplusminus/eng.lstm

I check the 'eng.lstm' file in the dir of 
/home/robert/tesstutorial/trainplusminus,  and it exists.

But why Tess4 cannot continue from the file?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6e71569c-297d-4a16-8761-b352f25d87bb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Failed to continue from: /home/robert/tesstutorial/trainplusminus/eng.lstm

2017-08-06 Thread robertyoung0511

Hello,

I'm trying to train the traineddata with the new tutorial for the finetune 
training: 
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-%C2%B1-a-few-characters

I execute the command as the tutorial showing.  Executing the commands as 
following:

1. training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng 
--linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata --output_dir ~/tesstutorial/trainplusminus
2. training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng 
--linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata \
  --fontlist "Impact Condensed" --output_dir ~/tesstutorial/evalplusminus
3. training/combine_tessdata -e tessdata/best/eng.traineddata \
  ~/tesstutorial/trainplusminus/eng.lstm
4. training/lstmtraining --model_output ~/tesstutorial/trainplusminus/plusminus 
\
  --continue_from ~/tesstutorial/trainplusminus/eng.lstm \
  --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
  --old_traineddata tessdata/best/eng.traineddata \
  --train_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt \
  --max_iterations 3600


When I excute the 4th command, there is an error appears:
Failed to continue from: /home/robert/tesstutorial/trainplusminus/eng.lstm

I check the 'en.lstm' file in the dir of 
/home/robert/tesstutorial/trainplusminus,  and it exists.

But why Tess4 says cannot continue from the file?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/726bdeb6-d944-43c0-b2d0-e1a580c90d2e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Error:Assert failed:in file ../lstm/lstmtrainer.h, line 110

2017-08-04 Thread robertyoung0511

The code seems to have changed a lot, as well as the training commands and 
corresponding tutorials. The changes can refer to 
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00.

在 2017年8月4日星期五 UTC+8下午2:33:41，roberty...@gmail.com写道：
>
> Hello,
>
> I use the 'git pull' command to update the code from the link 
> https://github.com/tesseract-ocr/tesseract.git, and I recompile, 
> reinstall the Tess4.0.
>
> But when I execute the command (showed in below) to finetune the 
> traineddata, an error appears: 
> "mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file 
> ../lstm/lstmtrainer.h, line 110"
>
> lstmtraining --model_output ~/tesstutorial/chituned_from_chisim/chituned \
> --continue_from ~/tesstutorial/chituned_from_chisim/chi_sim.lstm \
> --train_listfile ~/tesstutorial/chitest/chi_sim.training_files.txt \
> --eval_listfile ~/tesstutorial/chitest/chi_sim.training_files.txt \
> --target_error_rate 0.01
>
>
>
> There is nothing wrong with the Tess before updating the code. But now, An 
> assertion error crashes. Why? Can you help me?
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/981a9380-bb57-4bcf-b321-d4ebff5f92bd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] "Can't encode transcript" error when using "lstmtraining" command with Tess4.0

2017-08-04 Thread robertyoung0511

Hi, Shree,

I have also tried the new traineddata to recognize the simplified Chinese 
with the Linux system (ubuntu), and it works. but it seems that the new 
traineddata dosen't support in the windows.

For the new traineddata in the ubuntu, there is also some special symbols 
cannot be recognized, such as, '∠', '△', '≌', '≥' and so on.

And, I will improve these special symbols' recognition. But there is no 
good way to implement it now. Can you give me some advice?

Thanks.

在 2017年8月1日星期二 UTC+8下午4:45:07，shree写道：
>
> Ray has uploaded new traineddata files in 
> https://github.com/tesseract-ocr/tessdata/tree/master/best
>
> Why don't you first try recognition with that
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Tue, Aug 1, 2017 at 1:45 PM,  
> wrote:
>
>> Hello, Shree:
>>
>> I'm sorry, but whether can I use more than one unicharset, such as 
>> chi_sim and eng and so on, to finetune the training? 
>> Maybe some special characters can be in other unicharsets. If I find 
>> it/them, maybe I will train my traineddata with more unicharsets, and the 
>> special characters will be encoded at that time.
>>
>> Thanks, and hope for your reply.
>>
>> 在 2017年7月25日星期二 UTC+8下午3:23:08，shree写道：
>>>
>>> That error is because some characters in your training text are not part 
>>> of the unicharset of chi_sim.
>>>
>>> You are trying finetune training which will give error. Replace top 
>>> layer will work.
>>>
>>> I suggest that you wait 2-3 weeks for Ray to upload new traineddata for 
>>> all languages. 
>>>
>>> You can tell us if there are any specific characters missing from 
>>> existing traineddata .
>>>
>>> ShreeDevi
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Tue, Jul 25, 2017 at 12:46 PM,  wrote:
>>>
 Hello,

 I apply the command to train my own traineddata:

 lstmtraining --model_output ~/tesstutorial/chituned_from_chisim/chituned \
   --continue_from ~/tesstutorial/chituned_from_chisim/chi_sim.lstm \
   --train_listfile ~/tesstutorial/chitest/chi.training_files.txt \
   --eval_listfile ~/tesstutorial/chitest/chi.training_files.txt \
   --target_error_rate 0.01 

 An error appears by Tess4.0 that shown in the following img. The system 
 (Tess4.0) says "Can't encode transcript" for text content such as 
 "化简（-x2）3的结果是...".
 Why? Can you help me?


 

 -- 
 You received this message because you are subscribed to the Google 
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to tesseract-oc...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/e2e1d749-a55d-4355-b128-5d0fe2181e19%40googlegroups.com
  
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/2753f88a-ba89-4164-8271-9eb13207736f%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1985a9ff-316f-4e98-bcc6-58880214ab82%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] "Can't encode transcript" error when using "lstmtraining" command with Tess4.0

2017-08-04 Thread robertyoung0511

I have tried the new traineddata with the Linux system (ubuntu). It works, 
but it seems that the new traineddata dosen't support in the windows.

在 2017年8月1日星期二 UTC+8下午6:03:13，roberty...@gmail.com写道：
>
> When I use the new traineddata, it will *report  **an 
>  **error : cannot find the 
> chi_sim.traineddata. Does the new traineddata only support the Tess4.0 alpa 
> release? I use the newest code release.*
>
> 在 2017年8月1日星期二 UTC+8下午4:45:07，shree写道：
>>
>> Ray has uploaded new traineddata files in 
>> https://github.com/tesseract-ocr/tessdata/tree/master/best
>>
>> Why don't you first try recognition with that
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Tue, Aug 1, 2017 at 1:45 PM,  wrote:
>>
>>> Hello, Shree:
>>>
>>> I'm sorry, but whether can I use more than one unicharset, such as 
>>> chi_sim and eng and so on, to finetune the training? 
>>> Maybe some special characters can be in other unicharsets. If I find 
>>> it/them, maybe I will train my traineddata with more unicharsets, and the 
>>> special characters will be encoded at that time.
>>>
>>> Thanks, and hope for your reply.
>>>
>>> 在 2017年7月25日星期二 UTC+8下午3:23:08，shree写道：

 That error is because some characters in your training text are not 
 part of the unicharset of chi_sim.

 You are trying finetune training which will give error. Replace top 
 layer will work.

 I suggest that you wait 2-3 weeks for Ray to upload new traineddata for 
 all languages. 

 You can tell us if there are any specific characters missing from 
 existing traineddata .

 ShreeDevi
 
 भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

 On Tue, Jul 25, 2017 at 12:46 PM,  wrote:

> Hello,
>
> I apply the command to train my own traineddata:
>
> lstmtraining --model_output ~/tesstutorial/chituned_from_chisim/chituned \
>   --continue_from ~/tesstutorial/chituned_from_chisim/chi_sim.lstm \
>   --train_listfile ~/tesstutorial/chitest/chi.training_files.txt \
>   --eval_listfile ~/tesstutorial/chitest/chi.training_files.txt \
>   --target_error_rate 0.01 
>
> An error appears by Tess4.0 that shown in the following img. The system 
> (Tess4.0) says "Can't encode transcript" for text content such as 
> "化简（-x2）3的结果是...".
> Why? Can you help me?
>
>
> 
>
> -- 
> You received this message because you are subscribed to the Google 
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send 
> an email to tesseract-oc...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/tesseract-ocr/e2e1d749-a55d-4355-b128-5d0fe2181e19%40googlegroups.com
>  
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

 -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/2753f88a-ba89-4164-8271-9eb13207736f%40googlegroups.com
>>>  
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5651752f-75e9-4d99-a0eb-dce266ad5b3e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Error:Assert failed:in file ../lstm/lstmtrainer.h, line 110

2017-08-04 Thread robertyoung0511

Hello,

I use the 'git pull' command to update the code from the link 
https://github.com/tesseract-ocr/tesseract.git, and I recompile, reinstall 
the Tess4.0.

But when I execute the command (showed in below) to finetune the 
traineddata, an error appears: 
"mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file 
../lstm/lstmtrainer.h, line 110"

lstmtraining --model_output ~/tesstutorial/chituned_from_chisim/chituned \
--continue_from ~/tesstutorial/chituned_from_chisim/chi_sim.lstm \
--train_listfile ~/tesstutorial/chitest/chi_sim.training_files.txt \
--eval_listfile ~/tesstutorial/chitest/chi_sim.training_files.txt \
--target_error_rate 0.01



There is nothing wrong with the Tess before updating the code. But now, An 
assertion error crashes. Why? Can you help me?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/75ba4766-370a-46c0-88b0-a15456aa7c9f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] "Can't encode transcript" error when using "lstmtraining" command with Tess4.0

2017-08-01 Thread robertyoung0511

When I use the new traineddata, it will *report  **an 
 **error : cannot find the 
chi_sim.traineddata. Does the new traineddata only support the Tess4.0 alpa 
release? I use the newest code release.*

在 2017年8月1日星期二 UTC+8下午4:45:07，shree写道：
>
> Ray has uploaded new traineddata files in 
> https://github.com/tesseract-ocr/tessdata/tree/master/best
>
> Why don't you first try recognition with that
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Tue, Aug 1, 2017 at 1:45 PM,  
> wrote:
>
>> Hello, Shree:
>>
>> I'm sorry, but whether can I use more than one unicharset, such as 
>> chi_sim and eng and so on, to finetune the training? 
>> Maybe some special characters can be in other unicharsets. If I find 
>> it/them, maybe I will train my traineddata with more unicharsets, and the 
>> special characters will be encoded at that time.
>>
>> Thanks, and hope for your reply.
>>
>> 在 2017年7月25日星期二 UTC+8下午3:23:08，shree写道：
>>>
>>> That error is because some characters in your training text are not part 
>>> of the unicharset of chi_sim.
>>>
>>> You are trying finetune training which will give error. Replace top 
>>> layer will work.
>>>
>>> I suggest that you wait 2-3 weeks for Ray to upload new traineddata for 
>>> all languages. 
>>>
>>> You can tell us if there are any specific characters missing from 
>>> existing traineddata .
>>>
>>> ShreeDevi
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Tue, Jul 25, 2017 at 12:46 PM,  wrote:
>>>
 Hello,

 I apply the command to train my own traineddata:

 lstmtraining --model_output ~/tesstutorial/chituned_from_chisim/chituned \
   --continue_from ~/tesstutorial/chituned_from_chisim/chi_sim.lstm \
   --train_listfile ~/tesstutorial/chitest/chi.training_files.txt \
   --eval_listfile ~/tesstutorial/chitest/chi.training_files.txt \
   --target_error_rate 0.01 

 An error appears by Tess4.0 that shown in the following img. The system 
 (Tess4.0) says "Can't encode transcript" for text content such as 
 "化简（-x2）3的结果是...".
 Why? Can you help me?


 

 -- 
 You received this message because you are subscribed to the Google 
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to tesseract-oc...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/e2e1d749-a55d-4355-b128-5d0fe2181e19%40googlegroups.com
  
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/2753f88a-ba89-4164-8271-9eb13207736f%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f5dc5b16-3082-444a-b298-52867ae61e64%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] "Can't encode transcript" error when using "lstmtraining" command with Tess4.0

2017-08-01 Thread robertyoung0511

OK，I will have a try. Thanks

在 2017年8月1日星期二 UTC+8下午4:45:07，shree写道：
>
> Ray has uploaded new traineddata files in 
> https://github.com/tesseract-ocr/tessdata/tree/master/best
>
> Why don't you first try recognition with that
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Tue, Aug 1, 2017 at 1:45 PM,  
> wrote:
>
>> Hello, Shree:
>>
>> I'm sorry, but whether can I use more than one unicharset, such as 
>> chi_sim and eng and so on, to finetune the training? 
>> Maybe some special characters can be in other unicharsets. If I find 
>> it/them, maybe I will train my traineddata with more unicharsets, and the 
>> special characters will be encoded at that time.
>>
>> Thanks, and hope for your reply.
>>
>> 在 2017年7月25日星期二 UTC+8下午3:23:08，shree写道：
>>>
>>> That error is because some characters in your training text are not part 
>>> of the unicharset of chi_sim.
>>>
>>> You are trying finetune training which will give error. Replace top 
>>> layer will work.
>>>
>>> I suggest that you wait 2-3 weeks for Ray to upload new traineddata for 
>>> all languages. 
>>>
>>> You can tell us if there are any specific characters missing from 
>>> existing traineddata .
>>>
>>> ShreeDevi
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Tue, Jul 25, 2017 at 12:46 PM,  wrote:
>>>
 Hello,

 I apply the command to train my own traineddata:

 lstmtraining --model_output ~/tesstutorial/chituned_from_chisim/chituned \
   --continue_from ~/tesstutorial/chituned_from_chisim/chi_sim.lstm \
   --train_listfile ~/tesstutorial/chitest/chi.training_files.txt \
   --eval_listfile ~/tesstutorial/chitest/chi.training_files.txt \
   --target_error_rate 0.01 

 An error appears by Tess4.0 that shown in the following img. The system 
 (Tess4.0) says "Can't encode transcript" for text content such as 
 "化简（-x2）3的结果是...".
 Why? Can you help me?


 

 -- 
 You received this message because you are subscribed to the Google 
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to tesseract-oc...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/e2e1d749-a55d-4355-b128-5d0fe2181e19%40googlegroups.com
  
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/2753f88a-ba89-4164-8271-9eb13207736f%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3dbb845e-f992-47e9-bed4-888e3f623693%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] "Can't encode transcript" error when using "lstmtraining" command with Tess4.0

2017-08-01 Thread robertyoung0511

Hello, Shree:

I'm sorry, but whether can I use more than one unicharset, such as chi_sim 
and eng and so on, to finetune the training? 
Maybe some special characters can be in other unicharsets. If I find 
it/them, maybe I will train my traineddata with more unicharsets, and the 
special characters will be encoded at that time.

Thanks, and hope for your reply.

在 2017年7月25日星期二 UTC+8下午3:23:08，shree写道：
>
> That error is because some characters in your training text are not part 
> of the unicharset of chi_sim.
>
> You are trying finetune training which will give error. Replace top layer 
> will work.
>
> I suggest that you wait 2-3 weeks for Ray to upload new traineddata for 
> all languages. 
>
> You can tell us if there are any specific characters missing from existing 
> traineddata .
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Tue, Jul 25, 2017 at 12:46 PM,  
> wrote:
>
>> Hello,
>>
>> I apply the command to train my own traineddata:
>>
>> lstmtraining --model_output ~/tesstutorial/chituned_from_chisim/chituned \
>>   --continue_from ~/tesstutorial/chituned_from_chisim/chi_sim.lstm \
>>   --train_listfile ~/tesstutorial/chitest/chi.training_files.txt \
>>   --eval_listfile ~/tesstutorial/chitest/chi.training_files.txt \
>>   --target_error_rate 0.01 
>>
>> An error appears by Tess4.0 that shown in the following img. The system 
>> (Tess4.0) says "Can't encode transcript" for text content such as 
>> "化简（-x2）3的结果是...".
>> Why? Can you help me?
>>
>>
>> 
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/e2e1d749-a55d-4355-b128-5d0fe2181e19%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2753f88a-ba89-4164-8271-9eb13207736f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] How to recognize some specific symbols with Tess4.0

2017-07-31 Thread robertyoung0511










Hello,

I'm trying to apply Tess4.0 to recongnize the simplified Chinese with the 
command as:
  argc = 13;
  argv[1] = "E:/数据库/yanghui_results/yanghui_100_0.jpg";
  argv[2] = "E:/sample/01";
  argv[3] = "-l";
  argv[4] = "chi_sim+eng";
  argv[5] = "-psm";
  argv[6] = "7";
  argv[7] = "--oem";
  argv[8] = "OEM_TESSERACT_LSTM_COMBINED";
  argv[9] = "--tessdata-dir";
  argv[10] = "../tessdata";
  argv[11] = "--user-words";
  argv[12] = "../tessdata/chi_sim.user-words";

I have used the chi_sim and eng traineddata as the tessdata language, but 
some specific symbols, such as '∠' (means an angle), cannot be correctly 
recognized.


For example, an image demonstrated in above is the input data of Tess4.0, 
and the results is shown as the following:
如图， 在口ABCD中， 点E， F在AC上， 且乙ABE=乙CDF， 求证: BE=DF,

>From the results, we can observe that the '∠' symbol has been recognized as 
'乙', and the *rhomboid  symbol is recognized as '口', '.' 
period symbol as ',' **comma  *



*symbol .How to correctly recognized these specific 
symbols with Tess4.0? Can you help me?*

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8c00f7b8-1d84-4824-96a4-c8c2e50781bc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Could not find font named AR PL UMing Patched Light

2017-07-26 Thread robertyoung0511

OK. Thanks for the reply from Shree sincerely.

在 2017年7月26日星期三 UTC+8下午2:48:13，shree写道：
>
> I do not have this font.
>
> The training is done at Google. They probably use a number of commercial 
> fonts in addition to freely available fonts. The fonts are not provided as 
> part of the training data.
>
> You have to get your own set of fonts to train or wait for the new 
> traineddata by Ray (expected in next few weeks).
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Wed, Jul 26, 2017 at 11:09 AM,  
> wrote:
>
>> Yeah, I know that. But I lack the font of AR PL UMing Patched Light, 
>> which cannot be found in the Internet.
>>
>> I'm afraid that I may need to find this package (the font of AR PL UMing 
>> Patched Light) from you. If you don't mind sharing your resources, thanks 
>> sincerely.
>>
>> 在 2017年7月26日星期三 UTC+8上午11:31:23，shree写道：
>>>
>>> The training process uses the list of fonts from 
>>> https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh
>>>
>>> You need to update it to match the fonts available with you for the 
>>> script you are training and include the correct location for the fonts 
>>> directory.
>>>
>>> ShreeDevi
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Wed, Jul 26, 2017 at 7:17 AM,  wrote:
>>>
 Hello,

 I'm trying to train my own traineddata with Tess4.0 following the 
 tutorail: 
 https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Replace-Top-Layer

 When executing the command:
 training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim \
 --training_text ../training_data/part.txt \
 --linedata_only --noextract_font_properties \
 --langdata_dir ../langdata --tessdata_dir ./tessdata \
 --output_dir ~/tesstutorial/chisim

 An error appears: "Could not find font named AR PL UMing Patched 
 Light", showed in the follow img.

 Then I search for the package of "AR PL UMing Patched Light.ttf" with 
 Baidu, Google and some other search engines, but cannot find the result. 

 Can you help me? I don't know if there are other solutions for this 
 problem.


 

 -- 
 You received this message because you are subscribed to the Google 
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to tesseract-oc...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/825ee74a-854f-4a46-b911-3e3c6bd56427%40googlegroups.com
  
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/bd8a12f7-44e6-4ee2-ab98-cad5506a3091%40googlegroups.com
>>  
>> 
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ee308604-0f7b-4835-93f7-8db7c2b54435%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Could not find font named AR PL UMing Patched Light

2017-07-25 Thread robertyoung0511

Hello,

I'm trying to train my own traineddata with Tess4.0 following the tutorail: 
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Replace-Top-Layer

When executing the command:
training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim \
--training_text ../training_data/part.txt \
--linedata_only --noextract_font_properties \
--langdata_dir ../langdata --tessdata_dir ./tessdata \
--output_dir ~/tesstutorial/chisim

An error appears: "Could not find font named AR PL UMing Patched Light", 
showed in the follow img.

Then I search for the package of "AR PL UMing Patched Light.ttf" with 
Baidu, Google and some other search engines, but cannot find the result. 

Can you help me? I don't know if there are other solutions for this problem.



-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/825ee74a-854f-4a46-b911-3e3c6bd56427%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] "Can't encode transcript" error when using "lstmtraining" command with Tess4.0

2017-07-25 Thread robertyoung0511

Thanks for helpness.

I will finetune with new traineddata for all languages after 2-3 weeks, and 
give feedback to evaluate the specific characters.

在 2017年7月25日星期二 UTC+8下午3:23:08，shree写道：
>
> That error is because some characters in your training text are not part 
> of the unicharset of chi_sim.
>
> You are trying finetune training which will give error. Replace top layer 
> will work.
>
> I suggest that you wait 2-3 weeks for Ray to upload new traineddata for 
> all languages. 
>
> You can tell us if there are any specific characters missing from existing 
> traineddata .
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Tue, Jul 25, 2017 at 12:46 PM,  
> wrote:
>
>> Hello,
>>
>> I apply the command to train my own traineddata:
>>
>> lstmtraining --model_output ~/tesstutorial/chituned_from_chisim/chituned \
>>   --continue_from ~/tesstutorial/chituned_from_chisim/chi_sim.lstm \
>>   --train_listfile ~/tesstutorial/chitest/chi.training_files.txt \
>>   --eval_listfile ~/tesstutorial/chitest/chi.training_files.txt \
>>   --target_error_rate 0.01 
>>
>> An error appears by Tess4.0 that shown in the following img. The system 
>> (Tess4.0) says "Can't encode transcript" for text content such as 
>> "化简（-x2）3的结果是...".
>> Why? Can you help me?
>>
>>
>> 
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/e2e1d749-a55d-4355-b128-5d0fe2181e19%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c931a314-6dca-44cb-8b22-dd14703a133f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] "Can't encode transcript" error when using "lstmtraining" command with Tess4.0

2017-07-25 Thread robertyoung0511

Hello,

I apply the command to train my own traineddata:

lstmtraining --model_output ~/tesstutorial/chituned_from_chisim/chituned \
  --continue_from ~/tesstutorial/chituned_from_chisim/chi_sim.lstm \
  --train_listfile ~/tesstutorial/chitest/chi.training_files.txt \
  --eval_listfile ~/tesstutorial/chitest/chi.training_files.txt \
  --target_error_rate 0.01 

An error appears by Tess4.0 that shown in the following img. The system 
(Tess4.0) says "Can't encode transcript" for text content such as 
"化简（-x2）3的结果是...".
Why? Can you help me?



-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e2e1d749-a55d-4355-b128-5d0fe2181e19%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Combine_tessdata command error while training Tesseract4.0

2017-07-25 Thread robertyoung0511

I forgot the nor.traineddata. Thanks for helpness.

在 2017年7月24日星期一 UTC+8下午7:59:20，shree写道：
>
> Is your traineddata file present at  ../tessdata/nor.traineddata?
> Is it 4.00 version?
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Mon, Jul 24, 2017 at 1:47 PM,  
> wrote:
>
>>  Hello,
>>
>> I'm trying to train the Tesseract4.0 following the steps in the tutorial: 
>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Replacing-Top-Layer-Example
>>
>> But when I execute the command:
>>
>> mkdir -p ~/tesstutorial/nor_layer
>> $ combine_tessdata -e ../tessdata/nor.traineddata \
>> >   ~/tesstutorial/nor_layer/nor.lstm
>>
>>
>> An error message is given by the system, which is shown as following: Not 
>> extracting /home/robert/tesstutorial/nor_layer/nor.lstm, since this 
>> component is not present.
>>
>> Why do I receive this error? The message in the tutorial shows: "Wrote 
>> /home/shree/tesstutorial/nor_layer/nor.lstm"  represents nor.lstm will be 
>> written.
>> But why the system hint the nor.lstm file not present? Can you help me... 
>> (Thanks)
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/a867b49d-7535-4260-b1b5-a45ffb533394%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f42714c7-910e-407f-88b1-494b854da6f0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Combine_tessdata command error while training Tesseract4.0

2017-07-24 Thread robertyoung0511

 Hello,

I'm trying to train the Tesseract4.0 following the steps in the tutorial: 
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Replacing-Top-Layer-Example

But when I execute the command:

mkdir -p ~/tesstutorial/nor_layer
$ combine_tessdata -e ../tessdata/nor.traineddata \
>   ~/tesstutorial/nor_layer/nor.lstm


An error message is given by the system, which is shown as following: Not 
extracting /home/robert/tesstutorial/nor_layer/nor.lstm, since this 
component is not present.

Why do I receive this error? The message in the tutorial shows: "Wrote 
/home/shree/tesstutorial/nor_layer/nor.lstm"  represents nor.lstm will be 
written.
But why the system hint the nor.lstm file not present? Can you help me... 
(Thanks)

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a867b49d-7535-4260-b1b5-a45ffb533394%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Image too small to scale!! (3x48 vs min width of 3)

Re: [tesseract-ocr] The Accuracy improvement of training the chi_sim.traineddata model

Re: [tesseract-ocr] The Accuracy improvement of training the chi_sim.traineddata model

[tesseract-ocr] The Accuracy improvement of training the chi_sim.traineddata model

[tesseract-ocr] Re: Network overfitting processing

[tesseract-ocr] Network overfitting processing

Re: [tesseract-ocr] ERROR: /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.unicharset does not exist or is not readable

[tesseract-ocr] ERROR: /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.unicharset does not exist or is not readable

Re: [tesseract-ocr] The net_spec in the chi_sim.traineddata

[tesseract-ocr] The net_spec in the chi_sim.traineddata

[tesseract-ocr] Training from scratch to re-train the chi_sim.traineddata for studying

[tesseract-ocr] Training from Scratch for chi_sim.traineddata

[tesseract-ocr] Re: Unrecognized characters in the traineddata model

[tesseract-ocr] Re: Unrecognized characters in the traineddata model

Re: [tesseract-ocr] Re: Error:Assert failed:in file ../lstm/lstmtrainer.h, line 110

[tesseract-ocr] Unrecognized characters in the traineddata model

[tesseract-ocr] Other case Л of л is not in unicharset

[tesseract-ocr] Creation of encoded unicharset failed While constructing LSTM training data.

[tesseract-ocr] Failed to continue from: /home/robert/tesstutorial/trainplusminus/eng.lstm

[tesseract-ocr] Failed to continue from: /home/robert/tesstutorial/trainplusminus/eng.lstm

[tesseract-ocr] Failed to continue from: /home/robert/tesstutorial/trainplusminus/eng.lstm

[tesseract-ocr] Re: Error:Assert failed:in file ../lstm/lstmtrainer.h, line 110

Re: [tesseract-ocr] "Can't encode transcript" error when using "lstmtraining" command with Tess4.0

Re: [tesseract-ocr] "Can't encode transcript" error when using "lstmtraining" command with Tess4.0

[tesseract-ocr] Error:Assert failed:in file ../lstm/lstmtrainer.h, line 110

Re: [tesseract-ocr] "Can't encode transcript" error when using "lstmtraining" command with Tess4.0

Re: [tesseract-ocr] "Can't encode transcript" error when using "lstmtraining" command with Tess4.0

Re: [tesseract-ocr] "Can't encode transcript" error when using "lstmtraining" command with Tess4.0

[tesseract-ocr] How to recognize some specific symbols with Tess4.0

Re: [tesseract-ocr] Could not find font named AR PL UMing Patched Light

[tesseract-ocr] Could not find font named AR PL UMing Patched Light

Re: [tesseract-ocr] "Can't encode transcript" error when using "lstmtraining" command with Tess4.0

[tesseract-ocr] "Can't encode transcript" error when using "lstmtraining" command with Tess4.0

Re: [tesseract-ocr] Combine_tessdata command error while training Tesseract4.0

[tesseract-ocr] Combine_tessdata command error while training Tesseract4.0

35 matches

Site Navigation

Mail list logo

Footer information