RE: Persian PC-Kimmo 0.8 released

2004-05-13 Thread Ehsan Akhgari
Thanks for your reply, Jon.

 Thanks for asking.   All the words are in
 tab-separated text files, as in noun.lex, verb.lex,
 etc.   They get converted to a kimmo-usable file such
 as fa-noun.lex, fa-verb.lex, etc. using the db2lex perl scripts in the
 scripts directory.  The verb and adjective files use a specific script
 written for them; all others use the plain script.  Also see the
 orthography.txt file for the romanization scheme.  It also has some
 other goodies.

 I would love add any additions you might make to the lexicon in the
 next release.

I suppose I can use roman2unicode to convert the roman encoding into
readable plain text (I'm not fast on reading the roman notation).  That way,
I can import the data into Excel, sort it alphabetically, and start adding
new stuff...

 As you can see, it needs a little more work on the morphophonemic
 rules, but it should work fine for stemming purposes.

Yes, it's pretty good at recognizing the stem of the word.

 Hans Nelson is the man to talk to.  He's working on a Kimmo output to
 XML program.  I don't know much about
 it, but here's his email:   [EMAIL PROTECTED]

Thanks for your hint.  I'll try to contact him.  In case you're interested,
I can send the final result of our discussion to you off-list.

-
Ehsan Akhgari

Farda Technology (http://www.farda-tech.com/)

List Owner: [EMAIL PROTECTED]

[ Email: [EMAIL PROTECTED] ]
[ WWW: http://www.beginthread.com/Ehsan ]



___
PersianComputing mailing list
[EMAIL PROTECTED]
http://lists.sharif.edu/mailman/listinfo/persiancomputing


RE: Persian PC-Kimmo 0.8 released

2004-05-11 Thread Ehsan Akhgari
 For anyone who's interested, Persian PC-Kimmo version
 0.8 has just been released.  It's available here:

 http://home.byu.net/jmd56/download/persian-pckimmo-0.8.tar.gz

Thanks, Jon, for releasing this version.  It looks a lot better than the
previous one!

 The biggest thing holding them back from being a 1.0 is a relatively
 small lexicon (~1350 words).  The morphology engine achieves about
 two-thirds recognition on a corpus of about 3.5 million words.
 And of course, it's GPL'ed.

Hmmm, do you have a list of the words in the current lexicon?  (I'm not
familiar with PC-KIMMO specific commands, so I can't parse them on my own.)
What should I do to help adding more words?

 Any helpful feedback would be appreciated.

I find the new tree-style recognition a lot helpful:

n+mi+]+im NEG+DUR+come.PRES+1P

1:
Top
 |
   Verb
 |
VNEGPREFIXVNStem
n+ __|___
   NEG+ VPREFIX   VStem
  mi+   |
 DUR+V1Stem
|_
 V2Stem  VPSUFFIX
|   +im
 V3Stem +1P
|
V
]
come.PRES

Top:
[ cat:   Top ]

1 parse found

n+mi+]+m NEG+DUR+come.PRES+1S

1:
Top
 |
   Verb
 |
VNEGPREFIXVNStem
n+ __|___
   NEG+ VPREFIX   VStem
  mi+   |
 DUR+V1Stem
|_
 V2Stem  VPSUFFIX
|   +m
 V3Stem +1S
|
V
]
come.PRES

Top:
[ cat:   Top ]

1 parse found

I was wonderring if there's some way to retrieve the tree-structured data in
a format which is easy to parse (the ASCII style is too difficult for a
computer program to parse), something like an XML format maybe?

-
Ehsan Akhgari

Farda Technology (http://www.farda-tech.com/)

List Owner: [EMAIL PROTECTED]

[ Email: [EMAIL PROTECTED] ]
[ WWW: http://www.beginthread.com/Ehsan ]



___
PersianComputing mailing list
[EMAIL PROTECTED]
http://lists.sharif.edu/mailman/listinfo/persiancomputing