Hi Ray. I mean something along those lines, yes, though I'm not especially focused on part of speech tags (grammatical approaches to this problem don't generally yield much reward relative to the amount of processing power they require, unless the grammar is highly constrained). I'm not sure if you want to know about my plans in any great detail but I'll be happy to elaborate if so. In any case your reply is applicable and I'll look into method (b).
Whatever I end up with will probably involve a training component that uses some corpus to generate a data file which can then be used in adjust_word/permute_words. I'm wondering, is there a strong preference as to what language that ends up being written in? I'm comfortable with C/C++, but much prefer python for string manipulation. Thanks for your time. On Feb 3, 5:58 pm, Ray Smith <[email protected]> wrote: > Hi Michael, > I assume by word order modelling you mean part of speech tagging and > modelling the order of the parts of speech, perhaps by some HMM - based > model? > > In any case, there are 2 ways to do this: > > (a) Easier to do, but lighter effect: > Modify the adjust_word function to make use of the local context to promote > words that fit the model. > > (b) More difficult to do, but possibly larger effect. > Split the dictionary by part of speech into multiple (possibly overlapping) > sub-dictionaries. Change permute_words to search each of the > sub-dictionaries and then adjust_word is more likely to have a good range to > choose from. > > We try to incorporate improvements into the mainline code. This has been a > bit slow so far, but the turnaround time is improving as I catch up. > > Ray. > > On Mon, Feb 2, 2009 at 3:04 PM, Michael Reimer > <[email protected]>wrote: > > > > > Hello all. I'm a computational linguistics graduate student, and I'd > > like to do some work on Tesseract for credit in a software engineering > > course. My area of interest is word order modelling and I believe > > this can and has been used to improve the accuracy of other OCR > > systems. As far as I can tell, Tesseract has nothing similar > > currently, so I'm interested in adding it. Any feedback on that idea > > would be appreciated. > > > I'm also completely new to open source, so provided that my general > > goal appeals, I would appreciate any and all advice on how best to get > > involved with your community, get myself up to speed technically (I am > > reading the "Hacking Tesseract" manual currently), avoid stepping on > > toes, and so on. Thanks. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~----------~----~----~----~------~----~------~--~---

