Re: Using tesseract in CUBE mode

Dmitri Silaev Wed, 20 Apr 2011 06:33:27 -0700

Amrit,

First of all, did you train a new font using your source images? For
the image you've shown before, it's still a crucial stage to gain
success, be it with dictionary or without. Your postal address font is
very specific.


Simplistically, Tesseract's word matching is almost an exhaustive
enumeration of "chop" points. In other words, enumeration of connected
component partitions. Pixels between every pair of chop points are
thought as potential symbols and are being matched against trained
templates. Some best matches are saved and then "permuted" using
various methods to get possible word choices. Dictionary in some
degree is deemed as a "permuter".

I've made some basic checks for how dictionary is working in the
current revision, and from what I've seen I think it's fine. But if
your training glyphs are very different from those you are trying to
recognize, the dictionary permuter won't have any chance to come into
play.

Warm regards,
Dmitri Silaev
www.CustomOCR.com





On Mon, Apr 18, 2011 at 11:22 PM, Amrit <[email protected]> wrote:
> Thanks Dimitri,
>                      As mentioned earlier, my expectation for using
> tesseract in cube mode is hinged on the possibility of using some kind
> of grammar/language model restrictions to the word recognition which
> is happening.(eng.cube.lm ; eng.cube.bigrams etc)
>                     My understanding of the tesseract recognition is
> that the image text is segmented at a character level and stored as
> blobs.These blobs are recognized individually with the help of
> unicharset as per the given language.Further more word recognition
> takes place based on the character input inside page_resit(assuming
> iterator) -> page_res->werd_res ( containing the extra information
> about the physical location of the word in the image)
>                     What I am still looking for is, that in the
> decoding of a single word there has to be a grammar/dictionary
> associated with it so that tesseract validates as to whatever it has
> recognized at a character level, when put together, actually
> symbolizes a valid word.If the word is not found in the dict then the
> result is what is obtained by character level recognition alone (This
> is when the output is sometimes a group of random characters)
>                      Do correct me if I am wrong in assuming the
> above, but it'll really help me if I can get hold of this grammar/dict
> if it is being used at all.It will enable me to restrict such random
> results which I am observing in my image ocr output.
>
> Regards,
> Amrit.
>
> On Apr 18, 3:10 am, Dmitri Silaev <[email protected]> wrote:
>> Well, I may know no more than you do. You've probably found this
>> remark yourself, but some time ago Ray Smith casually mentioned
>> "Cube increases the accuracy slightly, but adds a lot of compute
>> time." 
>> (https://groups.google.com/d/msg/tesseract-ocr/0msQtTB_XrI/D1noel9GpPgJ)
>>
>> I don't know if this is currently relevant, but as for me, I wouldn't
>> investigate much time in studying the Cube's behavior (at least for
>> the moment) as it certainly will undergo many substantial source code
>> corrections (this can even be found in the source code comments), as
>> will do the way of interaction between Tesseract and Cube. Currently
>> Tesseract segments everything itself and then passes segmented results
>> to Cube on the word-by-word basis. Then some selection happens for who
>> of the two did better OCR: Tess or Cube.
>>
>> However if you still wish to dig, refer to "cube_control.cpp" and the
>> "cube" source directory.
>>
>> HTH
>>
>> Warm regards,
>> Dmitri Silaevwww.CustomOCR.com
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Apr 15, 2011 at 12:05 AM, Amrit <[email protected]> wrote:
>> > Hi All,
>> >         Pursuing my ongoing work on trying to develop a postal
>> > address recognizer , I was excited to discover the implementation of
>> > CUBE mode and especially the thought that it might be used to
>> > incorporate some language modelling techniques along with tesseract.
>> >        I believe that one should be able to activate it just by
>> > changing the tesseract's initialization mode from
>> > api.Init(argv[0], lang, tesseract::OEM_DEFAULT,
>> >           &(argv[arg]), argc - arg, NULL, NULL, false);
>> > to:
>> > api.Init(argv[0], lang, tesseract::OEM_CUBE_ONLY,
>> >           &(argv[arg]), argc - arg, NULL, NULL, false);
>>
>> > On doing so I have some queries :
>>
>> > 1) I was wondering as to what exactly is the difference between
>> > OEM_CUBE_ONLY and OEM_TESSERACT_CUBE_COMBINED ?
>> > At a high level from the little material I could get on cube
>> > implementation , I understand that using tesseract in cube mode can
>> > improve the performance(especially in connected char set like
>> > arabic).I am trying to use it to recognize English alpha numeric text
>> > alone and thus would it be safe to expect a better accuracy?
>> > So far on couple images I tested it on, the results have not shown any
>> > remarkable improvements.
>>
>> > 2) Furthermore,under tessdata I am seeing files such as
>> >      1) eng.cube.lm -  which contains listing of whitelist characters
>> > which seem to define the grammar space for tesseract to work on.
>> >      2) eng.cube.bigrams and eng.cube.word-freq - not sure how these
>> > are being used currently and to what effect.
>>
>> > 3) Is there a way of customizing the above and using it in tesseract
>> > (I would assume that this will be part of the eng.traineddata , but
>> > when I split the same I do not find these files as its members)
>> > e.g. instead of using the whitelist in the code , can we customize the
>> > eng.cube.lm and use that instead to restrict the tesseract's character
>> > output.
>>
>> > Regards,
>> > Amrit.
>>
>> > --
>> > You received this message because you are subscribed to the Google Groups 
>> > "tesseract-ocr" group.
>> > To post to this group, send email to [email protected].
>> > To unsubscribe from this group, send email to 
>> > [email protected].
>> > For more options, visit this group 
>> > athttp://groups.google.com/group/tesseract-ocr?hl=en.
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Using tesseract in CUBE mode

Reply via email to