Re: Word order modelling

Michael Reimer Tue, 03 Feb 2009 16:52:44 -0800

Hi Ray.  I mean something along those lines, yes, though I'm not
especially focused on part of speech tags (grammatical approaches to
this problem don't generally yield much reward relative to the amount
of processing power they require, unless the grammar is highly
constrained).  I'm not sure if you want to know about my plans in any
great detail but I'll be happy to elaborate if so.  In any case your
reply is applicable and I'll look into method (b).


Whatever I end up with will probably involve a training component that
uses some corpus to generate a data file which can then be used in
adjust_word/permute_words.  I'm wondering, is there a strong
preference as to what language that ends up being written in?  I'm
comfortable with C/C++, but much prefer python for string
manipulation.  Thanks for your time.

On Feb 3, 5:58 pm, Ray Smith <[email protected]> wrote:
> Hi Michael,
> I assume by word order modelling you mean part of speech tagging and
> modelling the order of the parts of speech, perhaps by some HMM - based
> model?
>
> In any case, there are 2 ways to do this:
>
> (a) Easier to do, but lighter effect:
> Modify the adjust_word function to make use of the local context to promote
> words that fit the model.
>
> (b) More difficult to do, but possibly larger effect.
> Split the dictionary by part of speech into multiple (possibly overlapping)
> sub-dictionaries. Change permute_words to search each of the
> sub-dictionaries and then adjust_word is more likely to have a good range to
> choose from.
>
> We try to incorporate improvements into the mainline code. This has been a
> bit slow so far, but the turnaround time is improving as I catch up.
>
> Ray.
>
> On Mon, Feb 2, 2009 at 3:04 PM, Michael Reimer 
> <[email protected]>wrote:
>
>
>
> > Hello all.  I'm a computational linguistics graduate student, and I'd
> > like to do some work on Tesseract for credit in a software engineering
> > course.  My area of interest is word order modelling and I believe
> > this can and has been used to improve the accuracy of other OCR
> > systems.  As far as I can tell, Tesseract has nothing similar
> > currently, so I'm interested in adding it.  Any feedback on that idea
> > would be appreciated.
>
> > I'm also completely new to open source, so provided that my general
> > goal appeals, I would appreciate any and all advice on how best to get
> > involved with your community, get myself up to speed technically (I am
> > reading the "Hacking Tesseract" manual currently), avoid stepping on
> > toes, and so on.  Thanks.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Word order modelling

Reply via email to