Re: [tesseract-ocr] Compute CTC targets failed while training

Khosrobeigy.zohreh Wed, 26 Sep 2018 01:01:31 -0700

I know, actually I am master in lstm. I want to resolve all error and then
train big text.
By version alpha, I trained about 1000 line and it is not so bad. But in
version beta 4 I got many error.
In alpha,
# Use LSTM
tessedit_ocr_engine_mode 1
tessedit_pageseg_mode 6


# Arabic page layout variables
segment_nonalphabetic_script 1

# Avoid dropping rows
textord_noise_rowratio 20.0
textord_noise_syfract 0.6

textord_min_linesize 2.5

# Avoid over-estimating intra-word spacing at both row and
# block levels when using old to method
tosp_old_to_method T
tosp_old_to_constrain_sp_kn T
tosp_old_sp_kn_th_factor 4.0

tosp_only_small_gaps_for_kern T
tosp_use_pre_chopping T
 I used all these, but now my model doesn't learn.
Has any thing changed in beta 4 for example text2image?

On Wed, Sep 26, 2018 at 12:53 AM Shree Devi Kumar <shreesh...@gmail.com>
wrote:

>   --fontlist "Arial"
>
> Does that have good coverage for Farsi?
>
>
> --max_iterations 5000
>
> You are trying to train from scratch with 18000 lines of text and only
> 5000 iterations. That will not work.
>
> Ray has trained on hundreds of thousands of lines of text and millions of
> iterations.
>
> On Tue, 25 Sep 2018, 16:20 Zohreh Khosrobeygi, <beigy.zoh...@gmail.com>
> wrote:
>
>> Hi, I use this :
>> tesseract 4.0.0-beta.4
>>  leptonica-1.74.4
>>   libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib
>> 1.2.8
>>
>>  Found AVX2
>>  Found AVX
>>  Found SSE
>> I've trained about 18000 line for persian language. I use this command:
>>
>> bash -x tesstrain.sh --fonts_dir /usr/share/fonts --lang fas
>> --training_text
>>  
>> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/fas.training_text.txt
>> --wordlist
>> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/fas.wordlist.txt
>> --linedata_only \
>>   --noextract_font_properties --langdata_dir
>> /home/zohreh/Desktop/tesseract-master/src/training/langdata \
>>   --tessdata_dir /home/zohreh/Desktop/tesseract-master/tessdata \
>>   --fontlist "Arial" --output_dir
>> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Phase2
>> and then run this:
>> sudo /home/zohreh/Desktop/tesseract-master/src/training/lstmtraining   \
>>   --traineddata
>> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Phase2/fas/fas.traineddata
>>  --net_spec '[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx192O1c1]' \
>>   --model_output
>> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Out/base
>> --learning_rate 0.001 \
>>   --train_listfile
>> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Phase2/fas.training_files.txt
>> \
>>   --eval_listfile
>> /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/v/fas.training_files.txt
>> \
>>   --max_iterations 5000
>> &>/home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Out/basetrain.log
>> but always show Compute CTC targets failed and the model is not well at
>> all.
>> I normal my text and each line of the text have 20 token(max).
>> Could you pleas help me?
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/04872dc6-7d92-4f95-9f65-8bb0cbf87c8c%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/04872dc6-7d92-4f95-9f65-8bb0cbf87c8c%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/tesseract-ocr/hGQMuZip6io/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUcjmoC%2BfvY5qvn3e4PBVMhBFiEGDGP9WCkEUnsygQTpw%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUcjmoC%2BfvY5qvn3e4PBVMhBFiEGDGP9WCkEUnsygQTpw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 
Zohreh Khosrobeygi
University of Tehran, 2016
Tel: +989196042887
khosrobeygi.zo...@ut.ac.ir <khosrobeygi.zoh...@ut.ac.ir>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAE1QSgxi-B-N7K32SzHtaxoQFQiYLVA%3Du65V6stVG3vPEJmMRw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Compute CTC targets failed while training

Reply via email to