Thanks Dimitri,
                      As mentioned earlier, my expectation for using
tesseract in cube mode is hinged on the possibility of using some kind
of grammar/language model restrictions to the word recognition which
is happening.(eng.cube.lm ; eng.cube.bigrams etc)
                     My understanding of the tesseract recognition is
that the image text is segmented at a character level and stored as
blobs.These blobs are recognized individually with the help of
unicharset as per the given language.Further more word recognition
takes place based on the character input inside page_resit(assuming
iterator) -> page_res->werd_res ( containing the extra information
about the physical location of the word in the image)
                     What I am still looking for is, that in the
decoding of a single word there has to be a grammar/dictionary
associated with it so that tesseract validates as to whatever it has
recognized at a character level, when put together, actually
symbolizes a valid word.If the word is not found in the dict then the
result is what is obtained by character level recognition alone (This
is when the output is sometimes a group of random characters)
                      Do correct me if I am wrong in assuming the
above, but it'll really help me if I can get hold of this grammar/dict
if it is being used at all.It will enable me to restrict such random
results which I am observing in my image ocr output.

Regards,
Amrit.

On Apr 18, 3:10 am, Dmitri Silaev <[email protected]> wrote:
> Well, I may know no more than you do. You've probably found this
> remark yourself, but some time ago Ray Smith casually mentioned
> "Cube increases the accuracy slightly, but adds a lot of compute
> time." 
> (https://groups.google.com/d/msg/tesseract-ocr/0msQtTB_XrI/D1noel9GpPgJ)
>
> I don't know if this is currently relevant, but as for me, I wouldn't
> investigate much time in studying the Cube's behavior (at least for
> the moment) as it certainly will undergo many substantial source code
> corrections (this can even be found in the source code comments), as
> will do the way of interaction between Tesseract and Cube. Currently
> Tesseract segments everything itself and then passes segmented results
> to Cube on the word-by-word basis. Then some selection happens for who
> of the two did better OCR: Tess or Cube.
>
> However if you still wish to dig, refer to "cube_control.cpp" and the
> "cube" source directory.
>
> HTH
>
> Warm regards,
> Dmitri Silaevwww.CustomOCR.com
>
>
>
>
>
>
>
> On Fri, Apr 15, 2011 at 12:05 AM, Amrit <[email protected]> wrote:
> > Hi All,
> >         Pursuing my ongoing work on trying to develop a postal
> > address recognizer , I was excited to discover the implementation of
> > CUBE mode and especially the thought that it might be used to
> > incorporate some language modelling techniques along with tesseract.
> >        I believe that one should be able to activate it just by
> > changing the tesseract's initialization mode from
> > api.Init(argv[0], lang, tesseract::OEM_DEFAULT,
> >           &(argv[arg]), argc - arg, NULL, NULL, false);
> > to:
> > api.Init(argv[0], lang, tesseract::OEM_CUBE_ONLY,
> >           &(argv[arg]), argc - arg, NULL, NULL, false);
>
> > On doing so I have some queries :
>
> > 1) I was wondering as to what exactly is the difference between
> > OEM_CUBE_ONLY and OEM_TESSERACT_CUBE_COMBINED ?
> > At a high level from the little material I could get on cube
> > implementation , I understand that using tesseract in cube mode can
> > improve the performance(especially in connected char set like
> > arabic).I am trying to use it to recognize English alpha numeric text
> > alone and thus would it be safe to expect a better accuracy?
> > So far on couple images I tested it on, the results have not shown any
> > remarkable improvements.
>
> > 2) Furthermore,under tessdata I am seeing files such as
> >      1) eng.cube.lm -  which contains listing of whitelist characters
> > which seem to define the grammar space for tesseract to work on.
> >      2) eng.cube.bigrams and eng.cube.word-freq - not sure how these
> > are being used currently and to what effect.
>
> > 3) Is there a way of customizing the above and using it in tesseract
> > (I would assume that this will be part of the eng.traineddata , but
> > when I split the same I do not find these files as its members)
> > e.g. instead of using the whitelist in the code , can we customize the
> > eng.cube.lm and use that instead to restrict the tesseract's character
> > output.
>
> > Regards,
> > Amrit.
>
> > --
> > You received this message because you are subscribed to the Google Groups 
> > "tesseract-ocr" group.
> > To post to this group, send email to [email protected].
> > To unsubscribe from this group, send email to 
> > [email protected].
> > For more options, visit this group 
> > athttp://groups.google.com/group/tesseract-ocr?hl=en.

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to