RE: Persian PC-Kimmo 0.8 released
Thanks for your reply, Jon. Thanks for asking. All the words are in tab-separated text files, as in noun.lex, verb.lex, etc. They get converted to a kimmo-usable file such as fa-noun.lex, fa-verb.lex, etc. using the db2lex perl scripts in the scripts directory. The verb and adjective files use a specific script written for them; all others use the plain script. Also see the orthography.txt file for the romanization scheme. It also has some other goodies. I would love add any additions you might make to the lexicon in the next release. I suppose I can use roman2unicode to convert the roman encoding into readable plain text (I'm not fast on reading the roman notation). That way, I can import the data into Excel, sort it alphabetically, and start adding new stuff... As you can see, it needs a little more work on the morphophonemic rules, but it should work fine for stemming purposes. Yes, it's pretty good at recognizing the stem of the word. Hans Nelson is the man to talk to. He's working on a Kimmo output to XML program. I don't know much about it, but here's his email: [EMAIL PROTECTED] Thanks for your hint. I'll try to contact him. In case you're interested, I can send the final result of our discussion to you off-list. - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Persian PC-Kimmo 0.8 released
For anyone who's interested, Persian PC-Kimmo version 0.8 has just been released. It's available here: http://home.byu.net/jmd56/download/persian-pckimmo-0.8.tar.gz Thanks, Jon, for releasing this version. It looks a lot better than the previous one! The biggest thing holding them back from being a 1.0 is a relatively small lexicon (~1350 words). The morphology engine achieves about two-thirds recognition on a corpus of about 3.5 million words. And of course, it's GPL'ed. Hmmm, do you have a list of the words in the current lexicon? (I'm not familiar with PC-KIMMO specific commands, so I can't parse them on my own.) What should I do to help adding more words? Any helpful feedback would be appreciated. I find the new tree-style recognition a lot helpful: n+mi+]+im NEG+DUR+come.PRES+1P 1: Top | Verb | VNEGPREFIXVNStem n+ __|___ NEG+ VPREFIX VStem mi+ | DUR+V1Stem |_ V2Stem VPSUFFIX | +im V3Stem +1P | V ] come.PRES Top: [ cat: Top ] 1 parse found n+mi+]+m NEG+DUR+come.PRES+1S 1: Top | Verb | VNEGPREFIXVNStem n+ __|___ NEG+ VPREFIX VStem mi+ | DUR+V1Stem |_ V2Stem VPSUFFIX | +m V3Stem +1S | V ] come.PRES Top: [ cat: Top ] 1 parse found I was wonderring if there's some way to retrieve the tree-structured data in a format which is easy to parse (the ASCII style is too difficult for a computer program to parse), something like an XML format maybe? - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing