For LSTM training, box files need to have an additional line for each text line with the tab character to indicate a new line.
If you have existing box/tiff pairs, you can use a box editor (such as
jtessboxeditor) and insert a box at end of each line and add a tab
character in it.
>On the toolbar, the Character textbox has a built-in conversion function.
If you enter U+0009 and hit Enter key or click on the adjacent Tool icon,
the escape sequences will be converted to Unicode. You can also enter the
tab character via Alt+09 numpad keys on Windows.
o
r add a dummy sequence such as @@@ and then replace to tab character in a
text editor.
See attached files as a sample.
Then modify tesstrain.sh to copy the box tiff pairs to the training
directory before starting training
mkdir -p ${TRAINING_DIR}
tlog "\n=== Starting training for language '${LANG_CODE}'"
cp ./*.box "${TRAINING_DIR}/"
cp ./*.tif "${TRAINING_DIR}/"
On Tue, Feb 7, 2017 at 8:27 PM, Kay-Michael Würzner <[email protected]>
wrote:
> +1 for this question. The training documentation for Tesseract 4.0 by now
> only covers training with font files (synthetic materials). What is missing
> is information on training with real data (i.e. manually aligned ground
> truth).
> Any hints on that matter are greatly appreciated.
>
> Cheers,
> Kay
>
> On Wednesday, January 18, 2017 at 12:31:54 AM UTC+1, [email protected]
> wrote:
>>
>> I have a bunch of images, containing English words.
>> I would like to generate training data by these images, and do the
>> training.
>> How should I do?
>>
>> Thanks a lot.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/ms
> gid/tesseract-ocr/7bffab95-3e6b-4165-929e-a152f1799703%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/7bffab95-3e6b-4165-929e-a152f1799703%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWSTfJ-EMxpK3ATDdWcH6iTJiBmjaYVAxARk%2BFJxTbw8w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
frk.embedsiver.exp0.box
Description: Binary data

