training Tesseract 4.0 from images is not officially .supported . Different people have had success in doing LSTM training with box/tiff pairs. but it requires hacks/programming on their part to create 4.0.0 compatible box files.
tesstrain.sh creates box/tiff files in the /tmp directory, these are used to create the lstmf files for LSTMtraining. tesstrain.sh can create a 3.0x compatible traineddata or 4.0.0 compatible starter traineddata depending on options that are chosen. For 4.0.0 this starter traineddata alongwith the lstmf files is used for LSTM training. The format of traineddata files for 3.0x and 4.0.0 is different. For different components of a traineddata file, See https://github.com/tesseract-ocr/tesseract/blob/master/doc/combine_tessdata.1.asc For creating 4.0 compatible box files see https://github.com/tesseract-ocr/langdata/issues/83#issuecomment-375247341 https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#training-tesseract-lstm-engine Please note that all these are unsupported options. ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Fri, Apr 13, 2018 at 12:09 PM, <denniscf...@berkeley.edu> wrote: > Hi all, > > I read in a different post that training Tesseract 4.0 from images is not > supported, is this true? I have been able to successfully train Tesseract > 4.0 so far using font data. When using tesstrain.sh, the script creates a > number of files, including an lstmf file alongside the usual trainedata > file (and there are some others like unicharset). I was wondering if it is > possible to use the traineddata generation from image and boxfile described > in the Tesseract 3.0 training instructions to create these training files > to train Tesseract 4.0. Tesseract 3.0 instructions already produce a > traineddata file, how can I generate the lstmf file (and the others) if it > is possible? > > Thank you, > Dennis > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to email@example.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/bc664de6-5386-45b3-ae4d-70ac5338938c% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/bc664de6-5386-45b3-ae4d-70ac5338938c%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to firstname.lastname@example.org. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUTs%2BZCSOUa6mQ6W%3DqQ9q-r%2BeBPa%3D3qjAss6zowy44nZQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.