Re: Tesseract Training

Ray Smith Sat, 19 Feb 2011 19:12:31 -0800

Sorry to be late on this very long thread, but you guys are making lives
difficult for yourselves by getting hold of the wrong end of the stick.
There is no need to give tesseract a convoluted re-encoding of the
recognizable units that you want it to recognize, and and translate it on
output.
Maybe I misunderstand what you were trying to do to start with, but you can
give tesseract any utf-8 string for each recognizable unit that you train it
with, including multiple unicodes if you want. If your original
shapes/recognizable units/aksharas/syllables (call them what you like)
represent multiple unicodes, then give tesseract all the utf8 for those, and
it will be happy. (It currently supports up to 24 bytes of utf-8 for each
shape.) It will make life easier when you want to give it a dictionary to
use with the shapes, as it assumes that the words you give it can be made up
of sequences of the codes for the basic shapes.


On Thu, Feb 17, 2011 at 12:42 AM, Sriranga(78yrsold) <
[email protected]> wrote:

> Dmitry,
> I am extremely thankful for your valuable guidance. It works for me.I have
> to lean many things
> under you.
> With warmest Regards,
> -sriranga(78yrs)
>
>
> On Thu, Feb 17, 2011 at 1:56 PM, Dmitry Silaev <[email protected]>wrote:
>
>> Sriranga,
>>
>> > It is
>> > presumed that commandline for (WinXP) should be as follows:
>> > eg= "  c:\tess\copy  001.tr + 002.tr + 003.tr + oo4.tr > 1234.tr or
>> > Multiimage.tr"  which may kindly be confirmed.  OR correct commandline
>> for
>> > cancatenate using command "copy" to be used may kindly be intimated.
>>
>> This command won't do what you want. First, you don't need to indicate
>> a path before "copy" as it is a built-in command of the MS-DOS command
>> processor, while prepended with a path it is treated as a name of an
>> executable within the "c:\tess\" directory and it doesn't exist.
>> Second, you don't need the ">" as it will direct all informational
>> output of the "copy" command (not files' contents) to "1234.tr". A
>> destination file should be specified at the end of the command after a
>> space. Therefore your command line should be
>>
>> copy  001.tr + 002.tr + 003.tr + oo4.tr 1234.tr
>>
>> Warm regards,
>> Dmitry Silaev
>>
>>
>>
>>
>>
>> On Thu, Feb 17, 2011 at 9:44 AM, Sriranga(78yrsold)
>> <[email protected]> wrote:
>> > Dmitry,
>> > Thanks for the valuable guidance  However I could not understand how to
>> > cancatenate (simply "copy" all the resulted .tr files together? It is
>> > presumed that commandline for (WinXP) should be as follows:
>> > eg= "  c:\tess\copy  001.tr + 002.tr + 003.tr + oo4.tr > 1234.tr or
>> > Multiimage.tr"  which may kindly be confirmed.  OR correct commandline
>> for
>> > cancatenate using command "copy" to be used may kindly be intimated.
>> > With Warmest Regards,
>> > -sriranga(78yrs)
>> >
>> > On Wed, Feb 16, 2011 at 11:58 AM, Dmitry Silaev <[email protected]>
>> > wrote:
>> >>
>> >> Guys,
>> >>
>> >> If you have more than one box/tiff pair, you can train (i.e. generate a
>> >> .tr file) for each of these pairs separately.
>> >>
>> >> Then you can concatenate (simply "cat" or "copy") all resulted .tr
>> files
>> >> together and then run all training tools on the single final .tr file.
>> This
>> >> relieves you from the 32 file limit.
>> >>
>> >> For your convenience you can craft a batch file or shell script which
>> >> would train, concatenate, cluster, etc. in one run. You should analyze
>> all
>> >> errors carefully though.
>> >>
>> >> Warm regards,
>> >> Dmitry Silaev
>> >>
>> >>
>> >>
>> >>
>> >> On Wed, Feb 16, 2011 at 5:56 AM, Sriranga(78yrsold)
>> >> <[email protected]> wrote:
>> >>>
>> >>> Dimitry,
>> >>> It appears that Khem has not endorsed copy to you as such I am
>> forwarding
>> >>> for valuable guidance/comments - which may help me in my Kannada
>> project..
>> >>> with regards,
>> >>> -sriranga(78yrs)
>> >>>
>> >>> ---------- Forwarded message ----------
>> >>> From: KHEM Sochenda <[email protected]>
>> >>> Date: Wed, Feb 16, 2011 at 7:45 AM
>> >>> Subject: Re: Tesseract Training
>> >>> To: "Sriranga(78yrsold)" <[email protected]>
>> >>>
>> >>>
>> >>> Dear Sriranga,
>> >>>
>> >>> The below are the steps that I did the trainings:
>> >>>
>> >>> I created 3 pages of training images as you can see in the
>> attachments(
>> >>> khm.limons1.1 is page, khm.limons1.2 is page 2, and the khm.limons1.3
>> is the
>> >>> page 3)
>> >>> I create box files of every page (khm.limons1.1.box and so on) with
>> the
>> >>> command line:
>> >>>
>> >>> "tesseract khm.limons1.1.tif khm.limons1.1 batch.nochop  makebox" for
>> >>> page 1 and "tesseract khm.limons1.2.tif khm.limons1.2 batch.nochop
>>  makebox"
>> >>> for page two and the same for the page 3.
>> >>>
>> >>> Then I edit the box files, I got the final result in the attachments.
>> >>> I merged the images together into a single file (khm.limons1.0.tif)
>> >>> I merged to three box files into a single box file with page number
>> >>> assigned (khm.limons1.0.box)
>> >>>
>> >>> I ran the command to train the sinble file "tesseract
>> khm.limons1.1.tif
>> >>> khm.limons1.0.tif khm.limons1.0 nobatch box.train".. Result look okay
>> at
>> >>> this step. (My purpose to merge this into one file is I want a single
>> font
>> >>> to be in just one .tr file)
>> >>>
>> >>> I then run the command "unicharset_extractor khm.limons1.0.box " to
>> >>> extract every single glyp from the box files. The result look okay.
>> >>>
>> >>> Then I tried running this to extract the feature "mftraining –U
>> >>> unicharset –O khm.unicharset khm.limons1.0.tr" and "cntraining
>> >>> khm.limons1.0.tr" I failed in this step.
>> >>>
>> >>>
>> >>>
>> --------------------------------------------------------------------------------------------------------
>> >>> Since I have no clue getting the above idea works, I obmitted the step
>> 4
>> >>> and 5 and skipped to point 6, 7, and 8 using the separated box files,
>> I got
>> >>> the traineddata as in the attached file. With three .tr files
>> separately is
>> >>> not what I want to do.
>> >>>
>> >>> Currently I used the obtained trained data for my temporary OCR
>> system.
>> >>> What I wished to do is to add other fonts, but the number of .tr files
>> are
>> >>> limited to 32 only... This is what I concerned.
>> >>>
>> >>> Best Regards,
>> >>>
>> >>> Sochenda
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>
>> >
>> >
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Tesseract Training

Reply via email to