Hi Shree,

The box file uploaded by you as the attachment seems to contradict with the 
LSTM4.0 training tutorial guidelines, as there it states that the boxes 
should actually be at line level instead of at character level. Please do 
correct me if I am wrong. I still am not able to understand how to train 
tesseract on real image data I have collected from scanned documents. It 
would be beneficial to all of us here if we could have a sample video 
guiding us on how to train tesseract, at least the starting steps with 
proper commands.

Thanks in advance.
Anubhav


On Tuesday, 7 February 2017 21:04:11 UTC+5:30, shree wrote:
>
> ​For LSTM training, box files need to have an additional line for each 
> text line with the tab character to indicate a new line.
>
> If you have existing box/tiff pairs, you can use a box editor (such as 
> jtessboxeditor) and insert a box at end of each line and add a tab 
> character in it.
>
> >On the toolbar, the Character textbox has a built-in conversion 
> function. If you enter U+0009 and hit Enter key or click on the adjacent 
> Tool icon, the escape sequences will be converted to Unicode. You can also 
> enter the tab character via Alt+09 numpad keys on Windows.
>
> o
> ​r add a dummy sequence such as @@@ and then replace to tab character in a 
> text editor.
> ​
> ​See attached files as a sample.
>
> Then modify tesstrain.sh to copy the box tiff pairs to the training 
> directory before starting training
>
>
>
> mkdir -p ${TRAINING_DIR}
> tlog "\n=== Starting training for language '${LANG_CODE}'"
>
> cp  ./*.box "${TRAINING_DIR}/"
> cp  ./*.tif "${TRAINING_DIR}/"​
>
>
> On Tue, Feb 7, 2017 at 8:27 PM, Kay-Michael Würzner <[email protected] 
> <javascript:>> wrote:
>
>> +1 for this question. The training documentation for Tesseract 4.0 by now 
>> only covers training with font files (synthetic materials). What is missing 
>> is information on training with real data (i.e. manually aligned ground 
>> truth).
>> Any hints on that matter are greatly appreciated.
>>
>> Cheers,
>> Kay
>>
>> On Wednesday, January 18, 2017 at 12:31:54 AM UTC+1, [email protected] 
>> wrote:
>>>
>>> I have a bunch of images, containing English words.
>>> I would like to generate training data by these images, and do the 
>>> training.
>>> How should I do?
>>>
>>> Thanks a lot.
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/7bffab95-3e6b-4165-929e-a152f1799703%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/7bffab95-3e6b-4165-929e-a152f1799703%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/bc9e908a-add3-41c6-b418-6b30c314905d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to