running mahout cmdump

2013-05-30 Thread Chandra Mohan, Ananda Vel Murugan
Hi, I am doing text classification using complimentary Naïve bayes in Mahout 0.7 and Hadoop 1.0.2. I want to export the confusion matrix as HTML. I running the following command mahout cmdump -i PIPS0-testing/part-m-0 -o PIPS0-testing -ow -html I am getting the following exception

RE: running mahout cmdump

2013-05-30 Thread Chandra Mohan, Ananda Vel Murugan
Hi, I tried vectordump and it created the CSV file. Is there any easy way to convert this file or vectors into confusion matrix? Please suggest. Regards, Anand.C -Original Message- From: Chandra Mohan, Ananda Vel Murugan Sent: Thursday, May 30, 2013 12:11 PM To:

RE: Feature vector generation from Bag-of-Words

2013-05-30 Thread Stuti Awasthi
Hi Suneel, Thanks, For the point 2, I tried to look how to achieve this using Lucene but was not able to gather much information. It would be helpful if you can guide me through the relevant links or samples through which I can achieve Point 2. Thanks Stuti Awasthi -Original Message-

Re: Fwd: Re: convert input for SVD

2013-05-30 Thread Rajesh Nikam
Hi Suneel/Dmitriy, I got mahout-examples-0.8-SNAPSHOT-job.jar compiled from trunk. Now I got -us param as your mentioned for the input set working. Steps followed are: mahout arff.vector --input /mnt/cluster/t/PE_EXE/input-set.arff --output /user/hadoop/t/input-set-vector/ --dictOut

Re: Fwd: Re: convert input for SVD

2013-05-30 Thread Suneel Marthi
You should be using 'pca' with ssvd mahout ssvd -i /user/hadoop/t/input-set-vector/ -o /user/hadoop/t/input-set-svd/ -k 50 --reduceTasks 2 -U true -V false -us true -ow -pca true You should be using USigma (U*Sigma) this is generated by the 'us' option.

Re: Fwd: Re: convert input for SVD

2013-05-30 Thread Rajesh Nikam
Hello Suneel, Thanks alot for quick reply for missing param. mahout arff.vector --input /mnt/cluster/t/input-set.arff --output /user/hadoop/t/input-set-vector/ --dictOut /mnt/cluster/t/input-set-dict mahout ssvd --input /user/hadoop/t/input-set-vector/ --output /user/hadoop/t/input-set-svd/ -k

RE: running mahout cmdump Solved

2013-05-30 Thread Chandra Mohan, Ananda Vel Murugan
Hi, I got it working. I wrote a utility class which takes the classification output (part-m-0) and creates the confusion matrix. part-m-0 was a sequence file with vectors and cmdump trying to convert Vectors into Matrix and hence I was getting the error. I don't know whether it is a

bottom up clustering

2013-05-30 Thread Rajesh Nikam
Hi, I want to do bottom up clustering (rather hierarchical clustering) rather than top-down as mentioned in https://cwiki.apache.org/MAHOUT/top-down-clustering.html kmeans-clusterdump-clusterpp and then kmeans on each cluster How to use centroid from first phase of canopy and use them for

Re: bottom up clustering

2013-05-30 Thread Suneel Marthi
The input to canopy is your vectors from seq2sparse and not cluster centroids (as u had it), hence the error message u r seeing. The output of canopy could be fed into kmeans as input centroids. From: Rajesh Nikam rajeshni...@gmail.com To:

Re: bottom up clustering

2013-05-30 Thread Rajesh Nikam
Hello Suneel, I got it. Next step to canopy is to feed these centroids to kmeans and cluster. However I want is to use centroids from these clusters and do clustering on them so as to find related clusters. Thanks Rajesh On Thu, May 30, 2013 at 8:38 PM, Suneel Marthi

Re: bottom up clustering

2013-05-30 Thread Ted Dunning
Rajesh The streaming k-means implementation is very much like what you are asking for. The first pass is to cluster into many, many clusters and then cluster those clusters. Sent from my iPhone On May 30, 2013, at 11:20, Rajesh Nikam rajeshni...@gmail.com wrote: Hello Suneel, I got

Re: bottom up clustering

2013-05-30 Thread Suneel Marthi
To add to Ted's reply, streaming k-means was recently added to Mahout (thanks to Dan and Ted). Here's the reference paper that talks about Streaming k-means: http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf You have to be working off of trunk to use this, its not available as part

Re: Fwd: Re: convert input for SVD

2013-05-30 Thread Dmitriy Lyubimov
I believe this flow describes how to use lanczos svd in mahout to arrive at the same reduction as ssvd already provides with pca and USigma options in one step. This flow is irrelevant when working with ssvd, it already does it all internally for you. On May 30, 2013 5:45 AM, Rajesh Nikam

Re: Fwd: Re: convert input for SVD

2013-05-30 Thread Dmitriy Lyubimov
I.e. i guess you want to run kmeans directly on usigma output. On May 30, 2013 9:37 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: I believe this flow describes how to use lanczos svd in mahout to arrive at the same reduction as ssvd already provides with pca and USigma options in one step.

Re: Fwd: Re: convert input for SVD

2013-05-30 Thread Suneel Marthi
Agree with Dmitriy. From: Dmitriy Lyubimov dlie...@gmail.com To: user@mahout.apache.org Cc: Suneel Marthi suneel_mar...@yahoo.com Sent: Thursday, May 30, 2013 12:39 PM Subject: Re: Fwd: Re: convert input for SVD I.e. i guess you want to run kmeans

Re: Fwd: Re: convert input for SVD

2013-05-30 Thread Rajesh Nikam
Yes, how to run canopy/ kmeans on usigma output? What is the connecting step? Please update on the same. Thanks, Rajesh On May 30, 2013 10:09 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: I.e. i guess you want to run kmeans directly on usigma output. On May 30, 2013 9:37 AM, Dmitriy Lyubimov

IRStats Evaluation for Recommender Systems

2013-05-30 Thread Parimi Rohit
Hi All, Is there a way to compute precision and recall values given a file of recommendations and a test file of user preferences. I know there is GenericRecommenderIRStatsEvaluator in Mahout to compute the IR Stats but it takes a RecommenderBuilder object among others as parameters to build a

Re: IRStats Evaluation for Recommender Systems

2013-05-30 Thread Sean Owen
THere's nothing direct, but you can probably save yourself time by copying the code that computes these stats and apply them to your pre-computed values. It's not terribly complex, just counting the intersection and union size and deriving some stats from it. The split is actually based on value

Re: IRStats Evaluation for Recommender Systems

2013-05-30 Thread Parimi Rohit
On Thu, May 30, 2013 at 12:01 PM, Sean Owen sro...@gmail.com wrote: THere's nothing direct, but you can probably save yourself time by copying the code that computes these stats and apply them to your pre-computed values. It's not terribly complex, just counting the intersection and union

Re: Feature vector generation from Bag-of-Words

2013-05-30 Thread Suneel Marthi
That's correct.  Also that SnowballAnalyzer implicitly converts all text to lower case and u could avoid that step in ur computation. All of your keywords would have to be first run through the SnowballAnalyzer and the same goes for your documents before u make the call to