Hi Michael, I assume by word order modelling you mean part of speech tagging and modelling the order of the parts of speech, perhaps by some HMM - based model?
In any case, there are 2 ways to do this: (a) Easier to do, but lighter effect: Modify the adjust_word function to make use of the local context to promote words that fit the model. (b) More difficult to do, but possibly larger effect. Split the dictionary by part of speech into multiple (possibly overlapping) sub-dictionaries. Change permute_words to search each of the sub-dictionaries and then adjust_word is more likely to have a good range to choose from. We try to incorporate improvements into the mainline code. This has been a bit slow so far, but the turnaround time is improving as I catch up. Ray. On Mon, Feb 2, 2009 at 3:04 PM, Michael Reimer <[email protected]>wrote: > > Hello all. I'm a computational linguistics graduate student, and I'd > like to do some work on Tesseract for credit in a software engineering > course. My area of interest is word order modelling and I believe > this can and has been used to improve the accuracy of other OCR > systems. As far as I can tell, Tesseract has nothing similar > currently, so I'm interested in adding it. Any feedback on that idea > would be appreciated. > > I'm also completely new to open source, so provided that my general > goal appeals, I would appreciate any and all advice on how best to get > involved with your community, get myself up to speed technically (I am > reading the "Hacking Tesseract" manual currently), avoid stepping on > toes, and so on. Thanks. > > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~----------~----~----~----~------~----~------~--~---

