Well, I may know no more than you do. You've probably found this
remark yourself, but some time ago Ray Smith casually mentioned
"Cube increases the accuracy slightly, but adds a lot of compute
time." (https://groups.google.com/d/msg/tesseract-ocr/0msQtTB_XrI/D1noel9GpPgJ)

I don't know if this is currently relevant, but as for me, I wouldn't
investigate much time in studying the Cube's behavior (at least for
the moment) as it certainly will undergo many substantial source code
corrections (this can even be found in the source code comments), as
will do the way of interaction between Tesseract and Cube. Currently
Tesseract segments everything itself and then passes segmented results
to Cube on the word-by-word basis. Then some selection happens for who
of the two did better OCR: Tess or Cube.

However if you still wish to dig, refer to "cube_control.cpp" and the
"cube" source directory.

HTH

Warm regards,
Dmitri Silaev
www.CustomOCR.com





On Fri, Apr 15, 2011 at 12:05 AM, Amrit <[email protected]> wrote:
> Hi All,
>         Pursuing my ongoing work on trying to develop a postal
> address recognizer , I was excited to discover the implementation of
> CUBE mode and especially the thought that it might be used to
> incorporate some language modelling techniques along with tesseract.
>        I believe that one should be able to activate it just by
> changing the tesseract's initialization mode from
> api.Init(argv[0], lang, tesseract::OEM_DEFAULT,
>           &(argv[arg]), argc - arg, NULL, NULL, false);
> to:
> api.Init(argv[0], lang, tesseract::OEM_CUBE_ONLY,
>           &(argv[arg]), argc - arg, NULL, NULL, false);
>
> On doing so I have some queries :
>
> 1) I was wondering as to what exactly is the difference between
> OEM_CUBE_ONLY and OEM_TESSERACT_CUBE_COMBINED ?
> At a high level from the little material I could get on cube
> implementation , I understand that using tesseract in cube mode can
> improve the performance(especially in connected char set like
> arabic).I am trying to use it to recognize English alpha numeric text
> alone and thus would it be safe to expect a better accuracy?
> So far on couple images I tested it on, the results have not shown any
> remarkable improvements.
>
> 2) Furthermore,under tessdata I am seeing files such as
>      1) eng.cube.lm -  which contains listing of whitelist characters
> which seem to define the grammar space for tesseract to work on.
>      2) eng.cube.bigrams and eng.cube.word-freq - not sure how these
> are being used currently and to what effect.
>
> 3) Is there a way of customizing the above and using it in tesseract
> (I would assume that this will be part of the eng.traineddata , but
> when I split the same I do not find these files as its members)
> e.g. instead of using the whitelist in the code , can we customize the
> eng.cube.lm and use that instead to restrict the tesseract's character
> output.
>
> Regards,
> Amrit.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/tesseract-ocr?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to