Thanks Suneel, I will go through your approach and will also learn more about various api's you have suggested. I am new to Mahout so will need to dig more. :)
By the time I was thinking the approach like this : 1. Create the sequence file of Bad of words and Input Data in different documents 2. For individual documents , Il loop through 100 keywords and count the number of time each keyword occur in a document 3.Create the RandomAccessSparseVector to store keyword and its frequency for each document This is not the good approach to do may be due to Step 2 , but this approach can also be implemented using MR. Please provide your thoughts on this. Thanks Stuti -----Original Message----- From: Suneel Marthi [mailto:[email protected]] Sent: Tuesday, May 21, 2013 10:21 PM To: [email protected] Subject: Re: Feature vector generation from Bag-of-Words It should be easy to convert the below pseudocode to MapReduce to scale for large collection of documents. ________________________________ From: Suneel Marthi <[email protected]> To: "[email protected]" <[email protected]> Sent: Tuesday, May 21, 2013 12:20 PM Subject: Re: Feature vector generation from Bag-of-Words Stuti, Here's how I would do it. 1. Create a collection of the 100 keywords that r of interest. Collection<String> keywords = new ArrayList<String>(); keywords.addAll(<your 100 keywords>); 2. For each word in each of the text documents create a Multiset (which is a bag of words) , retain only those terms of interest from (1) that are of interest and use Mahout's StaticWordValu // Itertate through all the documents for document in documents { //create a bag of words for each document Multiset<String> multiset = new HashMultiset<String>(); // create a RandomAccessSparseVector Vector v = new RandomAccessSparseVector(100); // 100 features for the 100 keywords for term in document.terms { multiset.add(term); } // retain only those keywords that are of interest (from step 1) multiset.retainAll(keywords); // You now have a bag of words containing only the keywords with their term frequencies // Use one of the Feature Encoders, refer to Section 14.3 of Mahout in Action for more detailed description of // this process FeatureVectorEncoder encoder = new StaticWordValueEncoder("body"); for (Multiset.Entry<String> entry : multiset.entrySet()) { encoder.addToVector(entry.getElement(), entry.getCount(), v); } ________________________________ From: Stuti Awasthi <[email protected]> To: "[email protected]" <[email protected]> Sent: Tuesday, May 21, 2013 7:17 AM Subject: Feature vector generation from Bag-of-Words Hi all, I have a query regarding the Feature Vector generation for Text documents. I have read Mahout in Action and understood how to create the text document in feature vector weighed by Tf of Tfidf schemes. My usecase is a little tweaked with that. I have few keywords may be say 100 and I want to create the Feature Vector of the text documents only with these 100 keywords. So I would like to calculate the frequency of each keyword in each document and generate the feature vector of the keyword with the frequency as weights. Is there any already present way to do this or Il need to write the custom code? Thanks Stuti Awasthi ::DISCLAIMER:: ---------------------------------------------------------------------------------------------------------------------------------------------------- The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects. ----------------------------------------------------------------------------------------------------------------------------------------------------
