Re: [tesseract-ocr] What is difference between "unicharset file" and "lstm-unicharset file"

ShreeDevi Kumar Thu, 01 Mar 2018 20:18:42 -0800

Please see
https://github.com/tesseract-ocr/tesseract/blob/master/doc/combine_tessdata.1.asc#components




ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Mar 2, 2018 at 6:22 AM, 이경준 <[email protected]> wrote:

>
> Hi . Thank you for seeing my questions
>
> 1. What is difference between 'unicharset' and 'lstm-unicharset' ?
>
> I know to make 'unicharset' by command line : "$ tesseract
> (lang).(filename).exp(num).tif  (lang).(filename).exp(num).box
>
> But I don't know to make 'lstm-unicharset'  ???
>
> cf) .tr -> .lstmf
>
> I apply this command line = "$tesseract (lang).(filename).exp(num).tif
> (lang).(filename).exp(num) nobatch *box.train*" to tesseract
> (lang).(filename).exp(num).tif (lang).(filename).exp(num) nobatch
> *lstm.train*"
>
> 2. This usage is right?
>
> Is it possible to apply 'unicharset' to 'lstm-unicharset'
>
>
>
> 3. In the github wiki passage
>
> Overview of Training Process
>
> The overall training process is similar to training 3.04
> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract>.
>
> Conceptually the same:
>
>    1. Prepare training text.
>    
> <https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951>
>    2. Render text to image + box file. (Or create hand-made box files for
>    existing image data.)
>    3. Make unicharset file. (Can be partially specified, ie created
>    manually).
>    4. Make a starter traineddata from the unicharset and optional
>    dictionary data.
>    
> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-starter-traineddata>
>    5. Run tesseract to process image + box file to make training data set.
>    6. Run training on training data set.
>    7. Combine data files.
>
> The key differences are:
>
>    - The boxes only need to be at the *textline level.* It is thus *far
>    easier* to make training data from existing image data.
>    - The .tr files are replaced by .lstmf data files.
>    - Fonts *can and should be mixed freely* instead of being separate.
>    - The clustering steps (mftraining, cntraining, shapeclustering) are
>    replaced with a single slow lstmtraining step.
>
>
> I think that In The key differecen section "unicharset" are replace by
> "lstm-unicharset"  - sentence is added
>
> Am I false????
>
>
>
> I wait everybody's answers
>
> Thank U. Have a nice day!
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/5730b272-043b-4abe-8d85-b8f4d96aad33%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/5730b272-043b-4abe-8d85-b8f4d96aad33%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVGeZ4kPvWmOu6f3RG8Z9i-TT_iBeGHjAF5Ciu7t0Dtdw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] What is difference between "unicharset file" and "lstm-unicharset file"

Reply via email to