Would a "basic shape" be the same as a "shape", or as a "utf8"? Hmm,
perhaps it is a "call them what you like"?
Ray Smith wrote, On 2011-02-19 21:12:
Sorry to be late on this very long thread, but you guys are making
lives difficult for yourselves by getting hold of the wrong end of the
stick. There is no need to give tesseract a convoluted re-encoding of
the recognizable units that you want it to recognize, and and
translate it on output.
Maybe I misunderstand what you were trying to do to start with, but
you can give tesseract any utf-8 string for each recognizable unit
that you train it with, including multiple unicodes if you want. If
your original shapes/recognizable units/aksharas/syllables (call them
what you like) represent multiple unicodes, then give tesseract all
the utf8 for those, and it will be happy. (It currently supports up to
24 bytes of utf-8 for each shape.) It will make life easier when you
want to give it a dictionary to use with the shapes, as it assumes
that the words you give it can be made up of sequences of the codes
for the basic shapes.
On Thu, Feb 17, 2011 at 12:42 AM, Sriranga(78yrsold)
<[email protected] <mailto:[email protected]>> wrote:
Dmitry,
I am extremely thankful for your valuable guidance. It works for
me.I have to lean many things
under you.
With warmest Regards,
-sriranga(78yrs)
On Thu, Feb 17, 2011 at 1:56 PM, Dmitry Silaev
<[email protected] <mailto:[email protected]>> wrote:
Sriranga,
> It is
> presumed that commandline for (WinXP) should be as follows:
> eg= " c:\tess\copy 001.tr <http://001.tr> + 002.tr
<http://002.tr> + 003.tr <http://003.tr> + oo4.tr
<http://oo4.tr> > 1234.tr <http://1234.tr> or
> Multiimage.tr" which may kindly be confirmed. OR correct
commandline for
> cancatenate using command "copy" to be used may kindly be
intimated.
This command won't do what you want. First, you don't need to
indicate
a path before "copy" as it is a built-in command of the MS-DOS
command
processor, while prepended with a path it is treated as a name
of an
executable within the "c:\tess\" directory and it doesn't exist.
Second, you don't need the ">" as it will direct all informational
output of the "copy" command (not files' contents) to "1234.tr
<http://1234.tr>". A
destination file should be specified at the end of the command
after a
space. Therefore your command line should be
copy 001.tr <http://001.tr> + 002.tr <http://002.tr> + 003.tr
<http://003.tr> + oo4.tr <http://oo4.tr> 1234.tr <http://1234.tr>
Warm regards,
Dmitry Silaev
On Thu, Feb 17, 2011 at 9:44 AM, Sriranga(78yrsold)
<[email protected] <mailto:[email protected]>> wrote:
> Dmitry,
> Thanks for the valuable guidance However I could not
understand how to
> cancatenate (simply "copy" all the resulted .tr files
together? It is
> presumed that commandline for (WinXP) should be as follows:
> eg= " c:\tess\copy 001.tr <http://001.tr> + 002.tr
<http://002.tr> + 003.tr <http://003.tr> + oo4.tr
<http://oo4.tr> > 1234.tr <http://1234.tr> or
> Multiimage.tr" which may kindly be confirmed. OR correct
commandline for
> cancatenate using command "copy" to be used may kindly be
intimated.
> With Warmest Regards,
> -sriranga(78yrs)
>
> On Wed, Feb 16, 2011 at 11:58 AM, Dmitry Silaev
<[email protected] <mailto:[email protected]>>
> wrote:
>>
>> Guys,
>>
>> If you have more than one box/tiff pair, you can train
(i.e. generate a
>> .tr file) for each of these pairs separately.
>>
>> Then you can concatenate (simply "cat" or "copy") all
resulted .tr files
>> together and then run all training tools on the single
final .tr file. This
>> relieves you from the 32 file limit.
>>
>> For your convenience you can craft a batch file or shell
script which
>> would train, concatenate, cluster, etc. in one run. You
should analyze all
>> errors carefully though.
>>
>> Warm regards,
>> Dmitry Silaev
>>
>>
>>
>>
>> On Wed, Feb 16, 2011 at 5:56 AM, Sriranga(78yrsold)
>> <[email protected] <mailto:[email protected]>>
wrote:
>>>
>>> Dimitry,
>>> It appears that Khem has not endorsed copy to you as such
I am forwarding
>>> for valuable guidance/comments - which may help me in my
Kannada project..
>>> with regards,
>>> -sriranga(78yrs)
>>>
>>> ---------- Forwarded message ----------
>>> From: KHEM Sochenda <[email protected]
<mailto:[email protected]>>
>>> Date: Wed, Feb 16, 2011 at 7:45 AM
>>> Subject: Re: Tesseract Training
>>> To: "Sriranga(78yrsold)" <[email protected]
<mailto:[email protected]>>
>>>
>>>
>>> Dear Sriranga,
>>>
>>> The below are the steps that I did the trainings:
>>>
>>> I created 3 pages of training images as you can see in the
attachments(
>>> khm.limons1.1 is page, khm.limons1.2 is page 2, and the
khm.limons1.3 is the
>>> page 3)
>>> I create box files of every page (khm.limons1.1.box and so
on) with the
>>> command line:
>>>
>>> "tesseract khm.limons1.1.tif khm.limons1.1 batch.nochop
makebox" for
>>> page 1 and "tesseract khm.limons1.2.tif khm.limons1.2
batch.nochop makebox"
>>> for page two and the same for the page 3.
>>>
>>> Then I edit the box files, I got the final result in the
attachments.
>>> I merged the images together into a single file
(khm.limons1.0.tif)
>>> I merged to three box files into a single box file with
page number
>>> assigned (khm.limons1.0.box)
>>>
>>> I ran the command to train the sinble file "tesseract
khm.limons1.1.tif
>>> khm.limons1.0.tif khm.limons1.0 nobatch box.train"..
Result look okay at
>>> this step. (My purpose to merge this into one file is I
want a single font
>>> to be in just one .tr file)
>>>
>>> I then run the command "unicharset_extractor
khm.limons1.0.box " to
>>> extract every single glyp from the box files. The result
look okay.
>>>
>>> Then I tried running this to extract the feature
"mftraining –U
>>> unicharset –O khm.unicharset khm.limons1.0.tr
<http://khm.limons1.0.tr>" and "cntraining
>>> khm.limons1.0.tr <http://khm.limons1.0.tr>" I failed in
this step.
>>>
>>>
>>>
--------------------------------------------------------------------------------------------------------
>>> Since I have no clue getting the above idea works, I
obmitted the step 4
>>> and 5 and skipped to point 6, 7, and 8 using the separated
box files, I got
>>> the traineddata as in the attached file. With three .tr
files separately is
>>> not what I want to do.
>>>
>>> Currently I used the obtained trained data for my
temporary OCR system.
>>> What I wished to do is to add other fonts, but the number
of .tr files are
>>> limited to 32 only... This is what I concerned.
>>>
>>> Best Regards,
>>>
>>> Sochenda
>>>
>>>
>>>
>>>
>>>
>>
>
>
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to
[email protected]
<mailto:[email protected]>.
To unsubscribe from this group, send email to
[email protected]
<mailto:tesseract-ocr%[email protected]>.
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.