Re: [tesseract-ocr] Training tessract 4.0 using images?

2018-04-13 Thread ShreeDevi Kumar
 training Tesseract 4.0 from images is not officially .supported .   Different
people have had success in doing LSTM training with box/tiff pairs. but it
requires hacks/programming on their part to create 4.0.0 compatible box
files.

tesstrain.sh creates box/tiff files in the /tmp directory, these are used
to create the lstmf files for LSTMtraining. tesstrain.sh can create a 3.0x
compatible traineddata or 4.0.0 compatible starter traineddata depending on
options that are chosen. For 4.0.0 this starter traineddata alongwith the
lstmf files is used for LSTM training.

The format of traineddata files for 3.0x and 4.0.0 is different.

For different components of a traineddata file, See

https://github.com/tesseract-ocr/tesseract/blob/master/doc/combine_tessdata.1.asc

For creating 4.0 compatible box files see

https://github.com/tesseract-ocr/langdata/issues/83#issuecomment-375247341

https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#training-tesseract-lstm-engine

Please note that all these are unsupported options.


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Apr 13, 2018 at 12:09 PM,  wrote:

> Hi all,
>
> I read in a different post that training Tesseract 4.0 from images is not
> supported, is this true? I have been able to successfully train Tesseract
> 4.0 so far using font data. When using tesstrain.sh, the script creates a
> number of files, including an lstmf file alongside the usual trainedata
> file (and there are some others like unicharset). I was wondering if it is
> possible to use the traineddata generation from image and boxfile described
> in the Tesseract 3.0 training instructions to create these training files
> to train Tesseract 4.0. Tesseract 3.0 instructions already produce a
> traineddata file, how can I generate the lstmf file (and the others) if it
> is possible?
>
> Thank you,
> Dennis
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/bc664de6-5386-45b3-ae4d-70ac5338938c%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUTs%2BZCSOUa6mQ6W%3DqQ9q-r%2BeBPa%3D3qjAss6zowy44nZQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Python wrappers for Tesseract

2018-04-13 Thread Mehul Bhardwaj
Hi, 

I am trying to train Tesseract-3.05 for a new language.

I am processing images using PIL and then need some post processing on the 
predicted text. For this I am using python. I use a python-tesseract 
wrapper in my python code to return the box file data in the form of a 
dictionary.

My question is this: When I train tesseract on my local machine, do such 
wrappers also give me an updated output? Or will training tesseract 
separately have no effect on the output of the wrapper? If this is true (no 
effect) how can I update the wrapper so that it gives me updated output 
after every round of training?

I am working on Ubuntu 16.04, tesseract-3.05, python-2.7 and using 
Pytesseract/tesserocr. 

Best Regards
Mehul

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a5493a0b-d30c-4e7c-96af-3911d68e2fb7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Training tessract 4.0 using images?

2018-04-13 Thread denniscfeng
Hi all,

I read in a different post that training Tesseract 4.0 from images is not 
supported, is this true? I have been able to successfully train Tesseract 
4.0 so far using font data. When using tesstrain.sh, the script creates a 
number of files, including an lstmf file alongside the usual trainedata 
file (and there are some others like unicharset). I was wondering if it is 
possible to use the traineddata generation from image and boxfile described 
in the Tesseract 3.0 training instructions to create these training files 
to train Tesseract 4.0. Tesseract 3.0 instructions already produce a 
traineddata file, how can I generate the lstmf file (and the others) if it 
is possible?

Thank you,
Dennis

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/bc664de6-5386-45b3-ae4d-70ac5338938c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.