Re: Tesseract Training

Eugene Reimer Sat, 19 Feb 2011 20:11:27 -0800

Would a "basic shape" be the same as a "shape", or as a "utf8"? Hmm,perhaps it is a "call them what you like"?


Ray Smith wrote, On 2011-02-19 21:12:

Sorry to be late on this very long thread, but you guys are makinglives difficult for yourselves by getting hold of the wrong end of thestick. There is no need to give tesseract a convoluted re-encoding ofthe recognizable units that you want it to recognize, and andtranslate it on output.Maybe I misunderstand what you were trying to do to start with, butyou can give tesseract any utf-8 string for each recognizable unitthat you train it with, including multiple unicodes if you want. Ifyour original shapes/recognizable units/aksharas/syllables (call themwhat you like) represent multiple unicodes, then give tesseract allthe utf8 for those, and it will be happy. (It currently supports up to24 bytes of utf-8 for each shape.) It will make life easier when youwant to give it a dictionary to use with the shapes, as it assumesthat the words you give it can be made up of sequences of the codesfor the basic shapes.

On Thu, Feb 17, 2011 at 12:42 AM, Sriranga(78yrsold)<[email protected] <mailto:[email protected]>> wrote:


    Dmitry,
    I am extremely thankful for your valuable guidance. It works for
    me.I have to lean many things
    under you.
    With warmest Regards,
    -sriranga(78yrs)


    On Thu, Feb 17, 2011 at 1:56 PM, Dmitry Silaev
    <[email protected] <mailto:[email protected]>> wrote:

        Sriranga,

        > It is
        > presumed that commandline for (WinXP) should be as follows:
        > eg= "  c:\tess\copy  001.tr <http://001.tr> + 002.tr
        <http://002.tr> + 003.tr <http://003.tr> + oo4.tr
        <http://oo4.tr> > 1234.tr <http://1234.tr> or
        > Multiimage.tr"  which may kindly be confirmed.  OR correct
        commandline for
        > cancatenate using command "copy" to be used may kindly be
        intimated.

        This command won't do what you want. First, you don't need to
        indicate
        a path before "copy" as it is a built-in command of the MS-DOS
        command
        processor, while prepended with a path it is treated as a name
        of an
        executable within the "c:\tess\" directory and it doesn't exist.
        Second, you don't need the ">" as it will direct all informational
        output of the "copy" command (not files' contents) to "1234.tr
        <http://1234.tr>". A
        destination file should be specified at the end of the command
        after a
        space. Therefore your command line should be

        copy  001.tr <http://001.tr> + 002.tr <http://002.tr> + 003.tr
        <http://003.tr> + oo4.tr <http://oo4.tr> 1234.tr <http://1234.tr>

        Warm regards,
        Dmitry Silaev





        On Thu, Feb 17, 2011 at 9:44 AM, Sriranga(78yrsold)
        <[email protected] <mailto:[email protected]>> wrote:
        > Dmitry,
        > Thanks for the valuable guidance  However I could not
        understand how to
        > cancatenate (simply "copy" all the resulted .tr files
        together? It is
        > presumed that commandline for (WinXP) should be as follows:
        > eg= "  c:\tess\copy  001.tr <http://001.tr> + 002.tr
        <http://002.tr> + 003.tr <http://003.tr> + oo4.tr
        <http://oo4.tr> > 1234.tr <http://1234.tr> or
        > Multiimage.tr"  which may kindly be confirmed.  OR correct
        commandline for
        > cancatenate using command "copy" to be used may kindly be
        intimated.
        > With Warmest Regards,
        > -sriranga(78yrs)
        >
        > On Wed, Feb 16, 2011 at 11:58 AM, Dmitry Silaev
        <[email protected] <mailto:[email protected]>>
        > wrote:
        >>
        >> Guys,
        >>
        >> If you have more than one box/tiff pair, you can train
        (i.e. generate a
        >> .tr file) for each of these pairs separately.
        >>
        >> Then you can concatenate (simply "cat" or "copy") all
        resulted .tr files
        >> together and then run all training tools on the single
        final .tr file. This
        >> relieves you from the 32 file limit.
        >>
        >> For your convenience you can craft a batch file or shell
        script which
        >> would train, concatenate, cluster, etc. in one run. You
        should analyze all
        >> errors carefully though.
        >>
        >> Warm regards,
        >> Dmitry Silaev
        >>
        >>
        >>
        >>
        >> On Wed, Feb 16, 2011 at 5:56 AM, Sriranga(78yrsold)
        >> <[email protected] <mailto:[email protected]>>
        wrote:
        >>>
        >>> Dimitry,
        >>> It appears that Khem has not endorsed copy to you as such
        I am forwarding
        >>> for valuable guidance/comments - which may help me in my
        Kannada project..
        >>> with regards,
        >>> -sriranga(78yrs)
        >>>
        >>> ---------- Forwarded message ----------
        >>> From: KHEM Sochenda <[email protected]
        <mailto:[email protected]>>
        >>> Date: Wed, Feb 16, 2011 at 7:45 AM
        >>> Subject: Re: Tesseract Training
        >>> To: "Sriranga(78yrsold)" <[email protected]
        <mailto:[email protected]>>
        >>>
        >>>
        >>> Dear Sriranga,
        >>>
        >>> The below are the steps that I did the trainings:
        >>>
        >>> I created 3 pages of training images as you can see in the
        attachments(
        >>> khm.limons1.1 is page, khm.limons1.2 is page 2, and the
        khm.limons1.3 is the
        >>> page 3)
        >>> I create box files of every page (khm.limons1.1.box and so
        on) with the
        >>> command line:
        >>>
        >>> "tesseract khm.limons1.1.tif khm.limons1.1 batch.nochop
         makebox" for
        >>> page 1 and "tesseract khm.limons1.2.tif khm.limons1.2
        batch.nochop  makebox"
        >>> for page two and the same for the page 3.
        >>>
        >>> Then I edit the box files, I got the final result in the
        attachments.
        >>> I merged the images together into a single file
        (khm.limons1.0.tif)
        >>> I merged to three box files into a single box file with
        page number
        >>> assigned (khm.limons1.0.box)
        >>>
        >>> I ran the command to train the sinble file "tesseract
        khm.limons1.1.tif
        >>> khm.limons1.0.tif khm.limons1.0 nobatch box.train"..
        Result look okay at
        >>> this step. (My purpose to merge this into one file is I
        want a single font
        >>> to be in just one .tr file)
        >>>
        >>> I then run the command "unicharset_extractor
        khm.limons1.0.box " to
        >>> extract every single glyp from the box files. The result
        look okay.
        >>>
        >>> Then I tried running this to extract the feature
        "mftraining –U
        >>> unicharset –O khm.unicharset khm.limons1.0.tr
        <http://khm.limons1.0.tr>" and "cntraining
        >>> khm.limons1.0.tr <http://khm.limons1.0.tr>" I failed in
        this step.
        >>>
        >>>
        >>>
        
--------------------------------------------------------------------------------------------------------
        >>> Since I have no clue getting the above idea works, I
        obmitted the step 4
        >>> and 5 and skipped to point 6, 7, and 8 using the separated
        box files, I got
        >>> the traineddata as in the attached file. With three .tr
        files separately is
        >>> not what I want to do.
        >>>
        >>> Currently I used the obtained trained data for my
        temporary OCR system.
        >>> What I wished to do is to add other fonts, but the number
        of .tr files are
        >>> limited to 32 only... This is what I concerned.
        >>>
        >>> Best Regards,
        >>>
        >>> Sochenda
        >>>
        >>>
        >>>
        >>>
        >>>
        >>
        >
        >

--You received this message because you are subscribed to the Google

    Groups "tesseract-ocr" group.
    To post to this group, send email to
    [email protected]
    <mailto:[email protected]>.
    To unsubscribe from this group, send email to
    [email protected]
    <mailto:tesseract-ocr%[email protected]>.
    For more options, visit this group at
    http://groups.google.com/group/tesseract-ocr?hl=en.


--

You received this message because you are subscribed to the GoogleGroups "tesseract-ocr" group.

To post to this group, send email to [email protected].

To unsubscribe from this group, send email to[email protected].For more options, visit this group athttp://groups.google.com/group/tesseract-ocr?hl=en.


--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Tesseract Training

Reply via email to