What is "large"? On Tue, Jul 31, 2012 at 3:22 AM, Dimitrios Jim Piliouras <[email protected]> wrote: > Ok, generally if you 're trying to count the frequency of words first you > need to tokenize your sentences which means that you need to separate the > sentences. you can use openNLP to do both. If we assume that you 've got > your tokenized sentences then it's pretty straight forward to count the > frequencies...for example in Clojure it is simply a matter of : > > (frequencies (clojure.string/split "the cat jumped out the window" #"\s")) > {"the" 2, "cat" 1, "jumped" 1, "out" 1, "window" 1} > > Of course this is the simplistic case where splitting at space is > acceptable. If you 've got a big corpus in english and you need good > results then you must build at least 1 maxent model (the sentence-detector > model should work out of the box for english). In other words you need at > least a tokenizer model unless your corpus contains "news" material in > which case the ready - trained maxent model should work for you...if you've > got a medical or a very specific domain corpus you might need to train your > own model but you will need some training data...an example of how you > would do it from Clojure using clojure-opennlp which wraps the official > openNLP, follows: > > (require [opennlp.nlp :as nlp]) ;;we're using clojure-opennlp > (def sentence-detector (nlp/make-sentence-detector > "models/V1.5/en-sent.bin")) ;;load the sentence-detector model > (def tokenizer (nlp/make-tokenizer "models/V1.5/en-token.bin")) ;;load the > tokenizer model > > (defn process [text] ;;this functions will return a nested collection - > each sentence found in text but tokenized > (->> (sentence-detector text) > (map tokenizer))) > > Once you have the tokens of each sentence you can simply flatten the > nested collection and call 'frequencies' on it to get the same effect as my > first minimal example, which will return a map with all the words and their > frequencies... > > Hope that helps...I apologize for not posting some Java code instead but > there are plenty of examples for sentence detection and tokenization...once > you have the tokenized sentences it really doesn't matter what language you > use... > > Hope that helps... > > Jim > > > On Tue, Jul 31, 2012 at 12:20 PM, saeed farzi <[email protected]> wrote: > >> Hi all, >> I wanna calculate relative frequency of words in a large corpus, plz help >> me how to use open nlp , >> tnx in advance >> >> -- >> S.Farzi, Ph.D. Student >> Natural Language Processing Lab, >> School of Electrical and Computer Eng., >> Tehran University >> Tel: +9821-6111-9719 >>
-- Lance Norskog [email protected]
