Latent Semantic Analysis for Document Categorization

2015-03-26 Thread Hersheeta Chandankar
Hi, I'm working on a document categorization project wherein I have some crawled text documents on different topics which I want to categorize into pre-decided categories like travel,sports,education etc. Currently the approach I've used is of building a NaiveBayes Classification model in mahout

Re: Latent Semantic Analysis for Document Categorization

2015-03-26 Thread Hersheeta Chandankar
Thank you so much Chirag and David for your suggestion. I'll surely try it. On Thu, Mar 26, 2015 at 6:31 PM, 3316 Chirag Nagpal chiragnagpal_12...@aitpune.edu.in wrote: A better approach I can think of for the aformentioned task is to use Latent Dirichlet Allocation You can force, LDA to

Re: Latent Semantic Analysis for Document Categorization

2015-03-26 Thread 3316 Chirag Nagpal
A better approach I can think of for the aformentioned task is to use Latent Dirichlet Allocation You can force, LDA to learn topics with certain specific words by assigning higher probability values to those words in the initial dirichlet distribution. That way you will be able to discover

Re: Latent Semantic Analysis for Document Categorization

2015-03-26 Thread David Starina
Hi, as Chirag said, try LDA. You can also check an implementation of pLSA, but it is not part of Mahout, you can find it here: https://github.com/akopich/dplsa --David On Thu, Mar 26, 2015 at 2:01 PM, 3316 Chirag Nagpal chiragnagpal_12...@aitpune.edu.in wrote: A better approach I can think

Re: demo spark-itemsimilarity; empty output

2015-03-26 Thread Pat Ferrel
Hmm I just ran into this, thanks for the research. This may cause problems on cluster machines unless it is Mac specific so putting into /usr/lib/java may need to be on all nodes. Not sure that is the best solution. Let me know if you run into this on a ’nix type cluster. On Mar 19, 2015, at

Re: mahout 1.0 on EMR with spark item-similarity

2015-03-26 Thread Pat Ferrel
Finally getting to Yarn. Paul were you trying to run spark-itemsimilarity with the spark-submit script? That shouldn’t work, the job is a standalone app and does not require, nor is it likely to work with spark-submit. Were you able to run on Yarn? How? On Jan 29, 2015, at 9:15 AM, Pat Ferrel

Re: Fw: Mahout dataset Vectorization

2015-03-26 Thread Ted Dunning
Raghuveer, I am more confused than before. You say that the destination is on the second line. That seems to imply that your data has more than one line per data point. Is this so? That seems to contradict your previous comments. On Wed, Mar 25, 2015 at 10:20 PM, Raghuveer

Re: Latent Semantic Analysis for Document Categorization

2015-03-26 Thread Ted Dunning
Also, if you can include linking information between documents, you should be able to substantially improve accuracy. Same goes for behavioral data like browsing history. On Thu, Mar 26, 2015 at 6:10 AM, Hersheeta Chandankar hersheetachandan...@gmail.com wrote: Thank you so much Chirag and