Re: Feature vector generation from Bag-of-Words

Suneel Marthi Thu, 30 May 2013 12:58:11 -0700

That's correct.  Also that SnowballAnalyzer implicitly converts all text to 
lower case and u could avoid that step in ur computation.


All of your keywords would have to be first run through the SnowballAnalyzer 
and the same goes for your documents
before u make the call to Multiset.retainall(keywords).

I am assuming that all ur documents are 'English' text only. Lucene has 
language specific analyzers (some of which implicitly invoke the 
SnowballFilter) if u have to deal with other languages.

While on the topic of Lucene, ensure that u r using Lucene 4.2.x libraries 
(that's the Lucene version in Mahout trunk).




________________________________
 From: Stuti Awasthi <[email protected]>
To: "'[email protected]'" <[email protected]> 
Sent: Thursday, May 30, 2013 8:34 AM
Subject: RE: Feature vector generation from Bag-of-Words
 

Hey Suneel,

I got this stemming working with SnowballAnalyzer. One more query to use the 
Multiset.retainall(keywords) functionality, all of my keywords must also be 
generated with Analyzer else they won't be retained.
Is my understanding correct ?

Thanks
Stuti Awasthi

-----Original Message-----
From: Stuti Awasthi 
Sent: Thursday, May 30, 2013 3:59 PM
To: [email protected]
Subject: RE: Feature vector generation from Bag-of-Words

Hi Suneel,

Thanks, For the point 2, I tried to look how to achieve this using Lucene but 
was not able to gather much information. 
It would be helpful if you can guide me through the relevant links or samples 
through which I can achieve Point 2.

Thanks
Stuti Awasthi

-----Original Message-----
From: Suneel Marthi [mailto:[email protected]] 
Sent: Wednesday, May 22, 2013 6:13 PM
To: [email protected]
Subject: Re: Feature vector generation from Bag-of-Words

See inline.




________________________________
From: Stuti Awasthi <[email protected]>
To: "'[email protected]'" <[email protected]> 
Sent: Wednesday, May 22, 2013 7:02 AM
Subject: RE: Feature vector generation from Bag-of-Words


Hi Suneel,

I implemented your suggested approach. This was simple to implement and you 
have made the steps pretty clear. Thankyou :) . I have few query in creating 
Features using Multiset:

1. Can't we consider keyword Case Insensitiveness using multiset i.e my keyword 
may be "Day" and in document it may be "day". 

>>Yes, you can if that's a requirement for you. Convert all keywords to 
>>lowercase before storing them in multiset.

2. Can we use the  multiset to contain the words which might match the keyword 
regex rather than exact keyword i.e. if keyword is "Recommend" and in the 
document it is "Recommended" then it should take care of it.

>> What you are describing is called 'Stemming'.  Lucene should be able to help 
>> you here. 

Any pointers ?

Thanks
Stuti Awasthi


-----Original Message-----
From: Stuti Awasthi 
Sent: Wednesday, May 22, 2013 12:01 PM
To: [email protected]
Subject: RE: Feature vector generation from Bag-of-Words

Thanks Suneel,

I will go through your approach and will also learn more about various api's 
you have suggested. I am new to Mahout so will need to dig more. :)

By the time I was thinking the approach like this :
1. Create the sequence file of Bad of words and Input Data in different 
documents 2. For individual documents , Il loop through 100 keywords and count 
the number of time each keyword occur in a document 3.Create the 
RandomAccessSparseVector to store keyword and its frequency for each document

This is not the good approach to do may be due to Step 2 , but this approach 
can also be implemented using MR. Please provide your thoughts on this.

Thanks
Stuti 


-----Original Message-----
From: Suneel Marthi [mailto:[email protected]]
Sent: Tuesday, May 21, 2013 10:21 PM
To: [email protected]
Subject: Re: Feature vector generation from Bag-of-Words

It should be easy to convert the below pseudocode to MapReduce to scale for 
large collection of documents.



________________________________
From: Suneel Marthi <[email protected]>
To: "[email protected]" <[email protected]>
Sent: Tuesday, May 21, 2013 12:20 PM
Subject: Re: Feature vector generation from Bag-of-Words


Stuti,

Here's how I would do it.

1.  Create a collection of the 100 keywords that r of interest.

     Collection<String> keywords = new ArrayList<String>();
     keywords.addAll(<your 100 keywords>);
     

2.  For each word in each of the text documents create a Multiset (which is a 
bag of words) ,
      retain only those terms of interest from (1) that are of interest and use 
Mahout's StaticWordValu

     // Itertate through all the documents
     for document in documents {

      //create a bag of words for each document
       Multiset<String> multiset = new HashMultiset<String>();

     // create a RandomAccessSparseVector
     Vector v = new RandomAccessSparseVector(100); // 100 features for the 100 
keywords 

        for term in document.terms {
            multiset.add(term);
        }

        // retain only those keywords that are of interest (from step 1)
        multiset.retainAll(keywords);

       // You now have a bag of words containing only the keywords with their 
term frequencies
      
      // Use one of the Feature Encoders, refer to Section 14.3 of Mahout in 
Action for more detailed description of
      // this process

       FeatureVectorEncoder encoder = new StaticWordValueEncoder("body");
      
     for (Multiset.Entry<String> entry : multiset.entrySet()) {
       encoder.addToVector(entry.getElement(), entry.getCount(), v);
     }



     


     




________________________________
From: Stuti Awasthi <[email protected]>
To: "[email protected]" <[email protected]> 
Sent: Tuesday, May 21, 2013 7:17 AM
Subject: Feature vector generation from Bag-of-Words


Hi all,

I have a query regarding the Feature Vector generation for Text documents.
I have read Mahout in Action and understood how to create the text document in 
feature vector weighed by Tf of Tfidf schemes. My usecase is a little tweaked 
with that.

I have few keywords may be say 100 and I want to create the Feature Vector of 
the text documents only with these 100 keywords. So I would like to calculate 
the frequency of each keyword in each document and generate the feature vector 
of the keyword with the frequency as weights.

Is there any already present way to do this or Il need to write the custom code?

Thanks
Stuti Awasthi


::DISCLAIMER::
----------------------------------------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.

----------------------------------------------------------------------------------------------------------------------------------------------------

Re: Feature vector generation from Bag-of-Words

Reply via email to