Hi,
tesstrain (https://github.com/tesseract-ocr/tesstrain) works very well. It
is not the same thing as tesstrain.sh, it was called ocr-d before.

tesstrain works only with single lines. You need only the images and the
corresponding gt.txt files, it will create the tiff, box files and ltmsf,
unicharset and other files for you. It will even download the stuff you
need.

About your questions:

1. yes. If you have images with multiple lines I think there are tools
around to split them automatically search this forum
2. single lines
3. the training does not use the vocabulary at all
4. I recommend tesstrain (without .sh, the one with the Makefile).


The text must be cropped tight, a couple of pixels per side or none, see
what works best. Image height should be 35 to 48 pixels (try a few and see
what works best for your data). No need to do a full threshold in the
images but you want strong contrast. See the attached file.



Bye

Lorenzo









Il giorno ven 3 apr 2020 alle ore 17:48 hmaster <[email protected]>
ha scritto:

>
>    1. So essentially, I need to create a box file and ground-truth file
>    for each image I have, and run it with tesstrain repo. Which doesn't
>    work....
>    2. That's what I understood from the README as well.
>    3. Unfortunately, I've tried it already, and have not come too far
>    with that either.
>    4. The documentation and examples are missing in explanation, and that
>    is very demanding, as can be seen by the sheer questions on how to train,
>    and how to use the tools.
>    5.
> 6. *I've spent around 200 hours on this tool so far, and I am no closer
>    to what I need than I was when I started with. *
>    7. *Some repos use lstmbox/lst.train, some use makebox/box.train, and
>    all of them fail at one point or another, through the examples.*
>    8. Lots of the tutorials or explanations are diluted because of the
>    sheer number of versions and differences in how tesseract works.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/5c1a6e36-30f5-47f5-a026-2d86b7addc48%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/5c1a6e36-30f5-47f5-a026-2d86b7addc48%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyx%3DSa9Vu9k42chrzQciQS-DSRJa5X4EafPMaUzHU9ZEw%40mail.gmail.com.

Reply via email to