For a training tool, python is fine, but c/c++ is essential for the runtime
component.Thanks,
Ray.

On Tue, Feb 3, 2009 at 5:52 PM, Michael Reimer <[email protected]>wrote:

>
> Hi Ray.  I mean something along those lines, yes, though I'm not
> especially focused on part of speech tags (grammatical approaches to
> this problem don't generally yield much reward relative to the amount
> of processing power they require, unless the grammar is highly
> constrained).  I'm not sure if you want to know about my plans in any
> great detail but I'll be happy to elaborate if so.  In any case your
> reply is applicable and I'll look into method (b).
>
> Whatever I end up with will probably involve a training component that
> uses some corpus to generate a data file which can then be used in
> adjust_word/permute_words.  I'm wondering, is there a strong
> preference as to what language that ends up being written in?  I'm
> comfortable with C/C++, but much prefer python for string
> manipulation.  Thanks for your time.
>
> On Feb 3, 5:58 pm, Ray Smith <[email protected]> wrote:
> > Hi Michael,
> > I assume by word order modelling you mean part of speech tagging and
> > modelling the order of the parts of speech, perhaps by some HMM - based
> > model?
> >
> > In any case, there are 2 ways to do this:
> >
> > (a) Easier to do, but lighter effect:
> > Modify the adjust_word function to make use of the local context to
> promote
> > words that fit the model.
> >
> > (b) More difficult to do, but possibly larger effect.
> > Split the dictionary by part of speech into multiple (possibly
> overlapping)
> > sub-dictionaries. Change permute_words to search each of the
> > sub-dictionaries and then adjust_word is more likely to have a good range
> to
> > choose from.
> >
> > We try to incorporate improvements into the mainline code. This has been
> a
> > bit slow so far, but the turnaround time is improving as I catch up.
> >
> > Ray.
> >
> > On Mon, Feb 2, 2009 at 3:04 PM, Michael Reimer <[email protected]
> >wrote:
> >
> >
> >
> > > Hello all.  I'm a computational linguistics graduate student, and I'd
> > > like to do some work on Tesseract for credit in a software engineering
> > > course.  My area of interest is word order modelling and I believe
> > > this can and has been used to improve the accuracy of other OCR
> > > systems.  As far as I can tell, Tesseract has nothing similar
> > > currently, so I'm interested in adding it.  Any feedback on that idea
> > > would be appreciated.
> >
> > > I'm also completely new to open source, so provided that my general
> > > goal appeals, I would appreciate any and all advice on how best to get
> > > involved with your community, get myself up to speed technically (I am
> > > reading the "Hacking Tesseract" manual currently), avoid stepping on
> > > toes, and so on.  Thanks.
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to