Re: Tesseract Training

Dmitry Silaev Thu, 17 Feb 2011 00:27:01 -0800

Sriranga,

> It is
> presumed that commandline for (WinXP) should be as follows:
> eg= "  c:\tess\copy  001.tr + 002.tr + 003.tr + oo4.tr > 1234.tr or
> Multiimage.tr"  which may kindly be confirmed.  OR correct commandline for
> cancatenate using command "copy" to be used may kindly be intimated.


This command won't do what you want. First, you don't need to indicate
a path before "copy" as it is a built-in command of the MS-DOS command
processor, while prepended with a path it is treated as a name of an
executable within the "c:\tess\" directory and it doesn't exist.
Second, you don't need the ">" as it will direct all informational
output of the "copy" command (not files' contents) to "1234.tr". A
destination file should be specified at the end of the command after a
space. Therefore your command line should be

copy  001.tr + 002.tr + 003.tr + oo4.tr 1234.tr

Warm regards,
Dmitry Silaev





On Thu, Feb 17, 2011 at 9:44 AM, Sriranga(78yrsold)
<[email protected]> wrote:
> Dmitry,
> Thanks for the valuable guidance  However I could not understand how to
> cancatenate (simply "copy" all the resulted .tr files together? It is
> presumed that commandline for (WinXP) should be as follows:
> eg= "  c:\tess\copy  001.tr + 002.tr + 003.tr + oo4.tr > 1234.tr or
> Multiimage.tr"  which may kindly be confirmed.  OR correct commandline for
> cancatenate using command "copy" to be used may kindly be intimated.
> With Warmest Regards,
> -sriranga(78yrs)
>
> On Wed, Feb 16, 2011 at 11:58 AM, Dmitry Silaev <[email protected]>
> wrote:
>>
>> Guys,
>>
>> If you have more than one box/tiff pair, you can train (i.e. generate a
>> .tr file) for each of these pairs separately.
>>
>> Then you can concatenate (simply "cat" or "copy") all resulted .tr files
>> together and then run all training tools on the single final .tr file. This
>> relieves you from the 32 file limit.
>>
>> For your convenience you can craft a batch file or shell script which
>> would train, concatenate, cluster, etc. in one run. You should analyze all
>> errors carefully though.
>>
>> Warm regards,
>> Dmitry Silaev
>>
>>
>>
>>
>> On Wed, Feb 16, 2011 at 5:56 AM, Sriranga(78yrsold)
>> <[email protected]> wrote:
>>>
>>> Dimitry,
>>> It appears that Khem has not endorsed copy to you as such I am forwarding
>>> for valuable guidance/comments - which may help me in my Kannada project..
>>> with regards,
>>> -sriranga(78yrs)
>>>
>>> ---------- Forwarded message ----------
>>> From: KHEM Sochenda <[email protected]>
>>> Date: Wed, Feb 16, 2011 at 7:45 AM
>>> Subject: Re: Tesseract Training
>>> To: "Sriranga(78yrsold)" <[email protected]>
>>>
>>>
>>> Dear Sriranga,
>>>
>>> The below are the steps that I did the trainings:
>>>
>>> I created 3 pages of training images as you can see in the attachments(
>>> khm.limons1.1 is page, khm.limons1.2 is page 2, and the khm.limons1.3 is the
>>> page 3)
>>> I create box files of every page (khm.limons1.1.box and so on) with the
>>> command line:
>>>
>>> "tesseract khm.limons1.1.tif khm.limons1.1 batch.nochop  makebox" for
>>> page 1 and "tesseract khm.limons1.2.tif khm.limons1.2 batch.nochop  makebox"
>>> for page two and the same for the page 3.
>>>
>>> Then I edit the box files, I got the final result in the attachments.
>>> I merged the images together into a single file (khm.limons1.0.tif)
>>> I merged to three box files into a single box file with page number
>>> assigned (khm.limons1.0.box)
>>>
>>> I ran the command to train the sinble file "tesseract khm.limons1.1.tif
>>> khm.limons1.0.tif khm.limons1.0 nobatch box.train".. Result look okay at
>>> this step. (My purpose to merge this into one file is I want a single font
>>> to be in just one .tr file)
>>>
>>> I then run the command "unicharset_extractor khm.limons1.0.box " to
>>> extract every single glyp from the box files. The result look okay.
>>>
>>> Then I tried running this to extract the feature "mftraining –U
>>> unicharset –O khm.unicharset khm.limons1.0.tr" and "cntraining
>>> khm.limons1.0.tr" I failed in this step.
>>>
>>>
>>> --------------------------------------------------------------------------------------------------------
>>> Since I have no clue getting the above idea works, I obmitted the step 4
>>> and 5 and skipped to point 6, 7, and 8 using the separated box files, I got
>>> the traineddata as in the attached file. With three .tr files separately is
>>> not what I want to do.
>>>
>>> Currently I used the obtained trained data for my temporary OCR system.
>>> What I wished to do is to add other fonts, but the number of .tr files are
>>> limited to 32 only... This is what I concerned.
>>>
>>> Best Regards,
>>>
>>> Sochenda
>>>
>>>
>>>
>>>
>>>
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Tesseract Training

Reply via email to