Re: calculating relative freq of word in a large corpus

Dimitrios Jim Piliouras Tue, 31 Jul 2012 03:22:39 -0700

Ok, generally if you 're trying to count the frequency of words first you
need to tokenize your sentences which means that you need to separate the
sentences. you can use openNLP to do both. If we assume that you 've got
your tokenized sentences then it's pretty straight forward to count the
frequencies...for example in Clojure it is simply a matter of :

(frequencies  (clojure.string/split "the cat jumped out the window"  #"\s"))
{"the" 2, "cat" 1, "jumped" 1, "out" 1, "window" 1}

Of course this is the simplistic case where splitting at space is
acceptable. If you 've got a big corpus in english and you need good
results then you must build at least 1 maxent model (the sentence-detector
model should work out of the box for english). In other words you need at
least a tokenizer model unless your corpus contains "news" material in
which case the ready - trained maxent model should work for you...if you've
got a medical or a very specific domain corpus you might need to train your
own model but you will need some training data...an example of how you
would do it from Clojure using clojure-opennlp which wraps the official
openNLP, follows:

(require [opennlp.nlp :as nlp]) ;;we're using clojure-opennlp
(def sentence-detector (nlp/make-sentence-detector
"models/V1.5/en-sent.bin")) ;;load the sentence-detector model
(def tokenizer (nlp/make-tokenizer "models/V1.5/en-token.bin")) ;;load the
tokenizer model

(defn process [text]  ;;this functions will return a nested collection -
each sentence found in text but tokenized
      (->> (sentence-detector text)
              (map tokenizer)))

Once you have the tokens of each sentence  you can simply flatten the
nested collection and call 'frequencies' on it to get the same effect as my
first minimal example, which will return a map with all the words and their
frequencies...

Hope that helps...I apologize for not posting some Java code instead but
there are plenty of examples for sentence detection and tokenization...once
you have the tokenized sentences it really doesn't matter what language you
use...

Hope that helps...

Jim

On Tue, Jul 31, 2012 at 12:20 PM, saeed farzi <[email protected]> wrote:

> Hi all,
> I wanna calculate relative frequency of words in a large corpus, plz help
> me how to use open nlp ,
> tnx in advance
>
> --
>            S.Farzi, Ph.D. Student
>     Natural Language Processing Lab,
>   School of Electrical and Computer Eng.,
>                Tehran University
>              Tel: +9821-6111-9719
>

Re: calculating relative freq of word in a large corpus

Reply via email to