Hi, Lorenzo!
Thank you for your tips! When I run those check commands I get this: <https://lh3.googleusercontent.com/-Soz3vfA1HVc/W0DO5ya_HPI/AAAAAAAAAH8/3sJ-_tf0eWslqt9BxHXmRIFqIZYagMr1ACLcBGAs/s1600/tess4eval.JPG> I'm gathering more data and as soon as I get any result I will share it here. Have a nice weekend! Joe. quarta-feira, 4 de Julho de 2018 às 13:39:41 UTC-3, Lorenzo Blz escreveu: > > > I suspect 1800 lines may not be enough data for training from scratch and > you are simply overfitting. I think 5% refers to the evaluation set, with a > default split 80/20 I think. > > Try this to check the accuracy on the training set and the eval set: > > lstmeval --model your-model.traineddata --eval_listfile data/list.train > lstmeval --model your-model.traineddata --eval_listfile data/list.eval > > If the train accuracy is much lower, like 0.1% or even 2%, you are > overfitting: too little data and/or a model too large. > > If so, you may add more different data (I guess at least 10 times or > more), also try some augmentation even if I think you already do. > > > > Lorenzo > > > 2018-07-04 18:13 GMT+02:00 Joe <[email protected] <javascript:>>: > >> Thank you for your answer, Lorenzo! >> >> I was following the sample data provided by ocr-d and I realized every >> tiff in ocrd-testset.zip has no left or right white border. That's why my >> tiffs are the same way. >> Anyway I'll give it a try with some space and with no-binarized data. >> >> I'm training from scratch and I used the 10000 iterations given by >> default by ocr-d (then I tried with 20K/30K but only with slightly better >> results). The training process takes about 2-3 hours to complete (4-5h >> with 20K iterations). >> >> This is the best result a got: >> >> >> <https://lh3.googleusercontent.com/-LpN72wYMGOo/WzzxEMcwkjI/AAAAAAAAAHY/GQ7kUm3ekV8PptNwyNh6ObNQe_SsiKqNgCLcBGAs/s1600/tess4lstmEx.JPG> >> >> After that with more iterations the char train value remains almost the >> same and sometimes it ends up bigger. >> >> The thread you commented about only refers to fine tuning, so I'd >> probably use it later. Thank you once again! >> >> >> quarta-feira, 4 de Julho de 2018 às 12:33:41 UTC-3, Lorenzo Blz escreveu: >>> >>> >>> I had no problems training with the ocr-d boxes. Looking at the tiffs >>> the first thing I'd try to do is adding some white border on left and right. >>> >>> For my training I used no-binarized (grayscale) data and I think it >>> could be better (more information is available). >>> >>> Are you training from scratch of fine tuning a model? How many epochs >>> did you do? How long did it run? Maybe you just need to wait more. >>> >>> Please, have a look at this thread too: >>> >>> https://groups.google.com/forum/#!topic/tesseract-ocr/be4-rjvY2tQ >>> >>> >>> Bye >>> >>> Lorenzo >>> >>> >>> 2018-07-04 17:03 GMT+02:00 Joe <[email protected]>: >>> >>>> I forgot to mention: >>>> The *.box files created by OCR-D are not in the same format as >>>> described in >>>> https://github.com/tesseract-ocr/tesseract/wiki/Making-Box-Files---4.0 >>>> I know Tesseract 4 boxes only need to cover a text line instead of >>>> individual chars, but in the example given in that link every character >>>> box >>>> value is different while in *.box files created by OCR-D the all have the >>>> same values. >>>> >>>> Is that a problem? >>>> >>>> >>>> quarta-feira, 4 de Julho de 2018 às 11:50:54 UTC-3, Joe escreveu: >>>>> >>>>> Hi everybody! >>>>> >>>>> I'm trying this tool https://github.com/OCR-D/ocrd-train/ but without >>>>> success so far. Tesseract and Leptonica are installed by the scripts. >>>>> Inspired by the test set provided in that repo, I created pairs of >>>>> [*.tif, *.gt.txt] with binarized chars and TTF's from two fonts (1869 >>>>> text >>>>> lines in total). >>>>> You can see an example of my set in attachment that also contains >>>>> files created by the training process. >>>>> >>>>> My guess is that something is wrong with my data. >>>>> Sometimes I can see the char train value increasing instead of >>>>> decreasing and the final error rate still too high (about 60%). >>>>> >>>>> That new training process with LSTM is driving me crazy! >>>>> I would appreciate if anyone with experience could take a look to my >>>>> data set. >>>>> >>>>> >>>>> Joe. >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/601364b4-3ebd-4a04-9f6a-3d418ab728ab%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/601364b4-3ebd-4a04-9f6a-3d418ab728ab%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/55041513-f089-4a18-b712-7daed030da01%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/55041513-f089-4a18-b712-7daed030da01%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e6a29281-0322-40b3-a6ab-7459055a994e%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

