Hi All,
Pursuing my ongoing work on trying to develop a postal
address recognizer , I was excited to discover the implementation of
CUBE mode and especially the thought that it might be used to
incorporate some language modelling techniques along with tesseract.
I believe that one should be able to activate it just by
changing the tesseract's initialization mode from
api.Init(argv[0], lang, tesseract::OEM_DEFAULT,
&(argv[arg]), argc - arg, NULL, NULL, false);
to:
api.Init(argv[0], lang, tesseract::OEM_CUBE_ONLY,
&(argv[arg]), argc - arg, NULL, NULL, false);
On doing so I have some queries :
1) I was wondering as to what exactly is the difference between
OEM_CUBE_ONLY and OEM_TESSERACT_CUBE_COMBINED ?
At a high level from the little material I could get on cube
implementation , I understand that using tesseract in cube mode can
improve the performance(especially in connected char set like
arabic).I am trying to use it to recognize English alpha numeric text
alone and thus would it be safe to expect a better accuracy?
So far on couple images I tested it on, the results have not shown any
remarkable improvements.
2) Furthermore,under tessdata I am seeing files such as
1) eng.cube.lm - which contains listing of whitelist characters
which seem to define the grammar space for tesseract to work on.
2) eng.cube.bigrams and eng.cube.word-freq - not sure how these
are being used currently and to what effect.
3) Is there a way of customizing the above and using it in tesseract
(I would assume that this will be part of the eng.traineddata , but
when I split the same I do not find these files as its members)
e.g. instead of using the whitelist in the code , can we customize the
eng.cube.lm and use that instead to restrict the tesseract's character
output.
Regards,
Amrit.
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.