I want to try using Tesseract 4 for old manuscript languages ("The Song of 
Roland" and such).

I have looked 
at https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 
but the steps are very unclear.

I have an image and a text file with the line content for each line of 
manuscript text. The doc says what to do, but not how.

I first thought I'd need img/box files pairs, but it seems it was for 
Tesseract 3 and is now irrelevant...

So I guess my starting point is here 

There is no tool to create the lstm-recoder directly. Instead there is a 
> new tool, combine_lang_model which takes as input an input_unicharset and 
> script_dir(script_dir points to the langdata directory) and optional word 
> list files. It creates the lstm-recoder from the input_unicharset and 
> creates all the dawgs, if wordlists are provided, putting everything 
> together into a traineddata file.

I don't really get this part. How do I make  input_unicharset ? What is 


Guillaume Desforges

