I forgot to mention: The *.box files created by OCR-D are not in the same format as described in https://github.com/tesseract-ocr/tesseract/wiki/Making-Box-Files---4.0 I know Tesseract 4 boxes only need to cover a text line instead of individual chars, but in the example given in that link every character box value is different while in *.box files created by OCR-D the all have the same values.
Is that a problem? quarta-feira, 4 de Julho de 2018 às 11:50:54 UTC-3, Joe escreveu: > > Hi everybody! > > I'm trying this tool https://github.com/OCR-D/ocrd-train/ but without > success so far. Tesseract and Leptonica are installed by the scripts. > Inspired by the test set provided in that repo, I created pairs of [*.tif, > *.gt.txt] with binarized chars and TTF's from two fonts (1869 text lines in > total). > You can see an example of my set in attachment that also contains files > created by the training process. > > My guess is that something is wrong with my data. > Sometimes I can see the char train value increasing instead of decreasing > and the final error rate still too high (about 60%). > > That new training process with LSTM is driving me crazy! > I would appreciate if anyone with experience could take a look to my data > set. > > > Joe. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/601364b4-3ebd-4a04-9f6a-3d418ab728ab%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

