Re: calculating relative freq of word in a large corpus

Lance Norskog Tue, 31 Jul 2012 17:50:10 -0700

What is "large"?

On Tue, Jul 31, 2012 at 3:22 AM, Dimitrios Jim Piliouras
<[email protected]> wrote:
> Ok, generally if you 're trying to count the frequency of words first you
> need to tokenize your sentences which means that you need to separate the
> sentences. you can use openNLP to do both. If we assume that you 've got
> your tokenized sentences then it's pretty straight forward to count the
> frequencies...for example in Clojure it is simply a matter of :
>
> (frequencies  (clojure.string/split "the cat jumped out the window"  #"\s"))
> {"the" 2, "cat" 1, "jumped" 1, "out" 1, "window" 1}
>
> Of course this is the simplistic case where splitting at space is
> acceptable. If you 've got a big corpus in english and you need good
> results then you must build at least 1 maxent model (the sentence-detector
> model should work out of the box for english). In other words you need at
> least a tokenizer model unless your corpus contains "news" material in
> which case the ready - trained maxent model should work for you...if you've
> got a medical or a very specific domain corpus you might need to train your
> own model but you will need some training data...an example of how you
> would do it from Clojure using clojure-opennlp which wraps the official
> openNLP, follows:
>
> (require [opennlp.nlp :as nlp]) ;;we're using clojure-opennlp
> (def sentence-detector (nlp/make-sentence-detector
> "models/V1.5/en-sent.bin")) ;;load the sentence-detector model
> (def tokenizer (nlp/make-tokenizer "models/V1.5/en-token.bin")) ;;load the
> tokenizer model
>
> (defn process [text]  ;;this functions will return a nested collection -
> each sentence found in text but tokenized
>       (->> (sentence-detector text)
>               (map tokenizer)))
>
> Once you have the tokens of each sentence  you can simply flatten the
> nested collection and call 'frequencies' on it to get the same effect as my
> first minimal example, which will return a map with all the words and their
> frequencies...
>
> Hope that helps...I apologize for not posting some Java code instead but
> there are plenty of examples for sentence detection and tokenization...once
> you have the tokenized sentences it really doesn't matter what language you
> use...
>
> Hope that helps...
>
> Jim
>
>
> On Tue, Jul 31, 2012 at 12:20 PM, saeed farzi <[email protected]> wrote:
>
>> Hi all,
>> I wanna calculate relative frequency of words in a large corpus, plz help
>> me how to use open nlp ,
>> tnx in advance
>>
>> --
>>            S.Farzi, Ph.D. Student
>>     Natural Language Processing Lab,
>>   School of Electrical and Computer Eng.,
>>                Tehran University
>>              Tel: +9821-6111-9719
>>




-- 
Lance Norskog
[email protected]

Re: calculating relative freq of word in a large corpus

Reply via email to