After some work i am able to: - Use the method *lstmbox* of *tesseract.exe* to obtain the *.box* files of my *.tif* images - Use the third party software *JTessBoxEditor* to correct the recognized characters, leaving boxes all around the full line of text - Use the method *lstm.train* of *tesseract.exe* to obtain the *.lstmf* files from the *.box* files
Now when i try to use *lstmtraining.exe, *using *eng*.*traineddata *as starter traineddata i obtain the error: *Deserialize header failed: [myfile1].lstmf* *Deserialize header failed: **[myfile2]**.lstmf* *Deserialize header failed: **[myfile3]**.lstmf* *Loaded 1/1 lines (1-1) of document **[myfile4]**.lstmf* *Load of images failed!!* >From this i can understand there is an error either in the process of creating *.lstmf* files or in the images themselves that i have selected. Any suggestion is well accepted. Il giorno martedì 14 gennaio 2020 17:43:40 UTC+1, Fabio Lugli ha scritto: > > Hello everyone, i'm trying to train tesseract on handwriting, knowing that > it's not the best option, using the latest version available for Windows. I > have access to a huge amount of .tif files, lines of handwritten text, i'm > able to obtain the .box files, which I later edit to be compliant to the > latest requirements (boxes all over the line, spaces between words, tab at > the end). After that i did not understand how to improve eng.traineddata or > how to create an own .traineddata file, also following the instructions on > https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00. > So which are the next passages to obtain a correct training dataset? > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/185a0555-a41f-4158-ad7b-a16ff7006e86%40googlegroups.com.

