On Wed, 3 Aug 2005, enas khalil wrote:
> i want to build my own arabic training corpus data and use the NLTK to > parse and make test for unkown data Hi Enas, By NLTK, I'll assume that you mean the Natural Language Toolkit at: http://nltk.sourceforge.net/ Have you gone through the introduction and tutorials from the NLTK web page? http://nltk.sourceforge.net/getting_started.html http://nltk.sourceforge.net/tutorial/index.html > how can i build this file and make it available to treat with it using > different NLTK classes Your question is a bit specialized, so we may not be the best people to ask about this. The part that you may want to think about is how to break a corpus into a sequence of tokens, since tokens are primarily what the NLTK classes work with. This may or may not be immediately easy, depending on how much you can take advantage of existing NLTK classes. As the documentation in NLTK mentions: """If we turn to languages other than English, segmenting words can be even more of a challenge. For example, in Chinese orthography, characters correspond to monosyllabic morphemes. Many morphemes are words in their own right, but many words contain more than one morpheme; most of them consist of two morphemes. However, there is no visual representation of word boundaries in Chinese text.""" I don't know how Arabic works, so I'm not sure if the caveat above is something that we need to worry about. There are a few built-in NLTK tokenizers that break a corpus into tokens, including a WhitespaceTokenizer and a RegexpTokenizer class, both introduced here: http://nltk.sourceforge.net/tutorial/tokenization/nochunks.html For example: ###### >>> import nltk.token >>> mytext = nltk.token.Token(TEXT="hello world this is a test") >>> mytext <hello world this is a test> ###### At the moment, this is a single token. We can use a naive approach in breaking this into words by using whitespace as our delimiter: ###### >>> import nltk.tokenizer >>> nltk.tokenizer.WhitespaceTokenizer(SUBTOKENS='WORDS').tokenize(mytext) >>> mytext <[<hello>, <world>, <this>, <is>, <a>, <test>]> ###### And now our text is broken into a sequence of discrete tokens, where we can now play with the 'subtokens' of our text: ###### >>> mytext['WORDS'] [<hello>, <world>, <this>, <is>, <a>, <test>] >>> len(mytext['WORDS']) 6 ###### If Arabic follows conventions that fit closely with the assumptions of those tokenizers, you should be in good shape. Otherwise, you'll probably have to do some work to build your own customized tokenizers. _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor