All, I'm currently working with a student, Eugenie Giesbrecht, who is implementing a HMM-based part-of-speech tagger for inclusion in the sandbox. This is 100% original work of Eugenie's for Apache, and we'll start checking in code during the next few days.
The only data Eugenie currently has for experimentation is the Brown corpus of American English. If you have any POS-tagged data that we could use for training (English or other languages), please let us know. The usual license restrictions apply. I don't think we can use any data that's only free for research purposes. Please let us know if you have any suggestions or would like to help ;-) Once Eugenie has something running, we'll make an announcement on the user's list. Eugenie has an ICLA on file with the ASF. --Thilo
