Thank you for your answer, Lorenzo!

I was following the sample data provided by ocr-d and I realized every tiff 
in ocrd-testset.zip has no left or right white border. That's why my tiffs 
are the same way.
Anyway I'll give it a try with some space and with no-binarized data.

I'm training from scratch and I used the 10000 iterations given by default 
by ocr-d (then I tried with 20K/30K but only with slightly better 
results).  The training process takes about 2-3 hours to complete (4-5h 
with 20K iterations).

This is the best result a got:

<https://lh3.googleusercontent.com/-LpN72wYMGOo/WzzxEMcwkjI/AAAAAAAAAHY/GQ7kUm3ekV8PptNwyNh6ObNQe_SsiKqNgCLcBGAs/s1600/tess4lstmEx.JPG>

After that with more iterations the char train value remains almost the 
same and sometimes it ends up bigger.

The thread you commented about only refers to fine tuning, so I'd probably 
use it later. Thank you once again!


quarta-feira, 4 de Julho de 2018 às 12:33:41 UTC-3, Lorenzo Blz escreveu:
>
>
> I had no problems training with the ocr-d boxes. Looking at the tiffs the 
> first thing I'd try to do is adding some white border on left and right.
>
> For my training I used no-binarized (grayscale) data and I think it could 
> be better (more information is available).
>
> Are you training from scratch of fine tuning a model? How many epochs did 
> you do? How long did it run? Maybe you just need to wait more. 
>
> Please, have a look at this thread too:
>
> https://groups.google.com/forum/#!topic/tesseract-ocr/be4-rjvY2tQ
>
>
> Bye
>
> Lorenzo
>
>
> 2018-07-04 17:03 GMT+02:00 Joe <[email protected] <javascript:>>:
>
>> I forgot to mention:
>> The *.box files created by OCR-D are not in the same format as described 
>> in https://github.com/tesseract-ocr/tesseract/wiki/Making-Box-Files---4.0
>> I know Tesseract 4 boxes only need to cover a text line instead of 
>> individual chars, but in the example given in that link every character box 
>> value is different while in *.box files created by OCR-D the all have the 
>> same values.
>>
>> Is that a problem?
>>
>>
>> quarta-feira, 4 de Julho de 2018 às 11:50:54 UTC-3, Joe escreveu:
>>>
>>> Hi everybody!
>>>
>>> I'm trying this tool https://github.com/OCR-D/ocrd-train/ but without 
>>> success so far. Tesseract and Leptonica are installed by the scripts.
>>> Inspired by the test set provided in that repo, I created pairs of 
>>> [*.tif, *.gt.txt] with binarized chars and TTF's from two fonts (1869 text 
>>> lines in total).
>>> You can see an example of my set in attachment that also contains files 
>>> created by the training process.
>>>
>>> My guess is that something is wrong with my data.
>>> Sometimes I can see the char train value increasing instead of 
>>> decreasing and the final error rate still too high (about 60%).
>>>
>>> That new training process with LSTM is driving me crazy!
>>> I would appreciate if anyone with experience could take a look to my 
>>> data set.
>>>
>>>
>>> Joe.
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/601364b4-3ebd-4a04-9f6a-3d418ab728ab%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/601364b4-3ebd-4a04-9f6a-3d418ab728ab%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/55041513-f089-4a18-b712-7daed030da01%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to