Re: Text clustering with SVD

2015-03-30 Thread Suneel Marthi
, but how do I compute U*Sigma? Can I do that by Mahout? Is there optimal method to determin K? another quesion is how do I make the relation between ssvd output and words dictionary(real words)? Thank you Donni On Mon, Mar 30, 2015 at 10:04 AM, Suneel Marthi suneel.mar...@gmail.com

Re: mahout failing with -c as required option

2015-03-10 Thread Suneel Marthi
/part-r-0 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow -cl both give the same exception still. Kindly suggest. On Tuesday, March 10, 2015 11:35 AM, Suneel Marthi suneel.mar...@gmail.com wrote

Re: mahout failing with -c as required option

2015-03-10 Thread Suneel Marthi
option now so i get the mentioned exception that -c is mandatory. On Tuesday, March 10, 2015 11:06 AM, Suneel Marthi suneel.mar...@gmail.com wrote: R u still specifying the -c option, its only needed if u have initial centroids to launch the KMEans from otherwise KMeans picks random

Re: kmeans is throwing IllegalArgumentException

2015-03-09 Thread Suneel Marthi
Yes, that's correct On Mon, Mar 9, 2015 at 1:53 PM, Pat Ferrel p...@occamsmachete.com wrote: I think you don’t want to supply a -c argument unless you have seed vectors in /user/netlog/upload/output4/uscensus-kmeans-centroids/part-randomSeed. Just leave it out and Mahout will use random

Re: mahout failing with -c as required option

2015-03-09 Thread Suneel Marthi
R u still specifying the -c option, its only needed if u have initial centroids to launch the KMEans from otherwise KMeans picks random centroids. Also CosineDistanceMeasure doesn't make sense with kMeans which is in Euclidean space -try using SquaredEuclidean or Euclidean distances. On Tue, Mar

Re: mahout output of seq2sparse is empty

2015-03-03 Thread Suneel Marthi
Depends on what u r trying to do. Are u trying classification or clustering? On Wed, Mar 4, 2015 at 1:08 AM, Raghuveer alwaysra...@yahoo.com.invalid wrote: Yes, you are right its was a directory. I see the part-m-0 file can you kindly suggest me how to run mahout on this file. Should i run

Re: Delete output folder automatically in CRUNCH (FlumeJava)

2015-02-18 Thread Suneel Marthi
Please send the FlumeJava mailing list, this would be better addressed there. On Wed, Feb 18, 2015 at 2:24 AM, unmesha sreeveni unmeshab...@gmail.com wrote: Hi I am new to FlumeJava.I ran wordcount in the same.But how can I automatically delete the outputfolder in the code block. Instead of

Re: why rbm was removed from mahout?

2014-11-09 Thread Suneel Marthi
The algorithm never made it to the codebase and remained a patch for sometime when the original author recalled the patch while we were working on 0.8. It wasn't scalable and the author didn't think it was worth committing to trunk. On Mon, Nov 10, 2014 at 2:34 AM, Ted Dunning

Re: 回复: Mahout documentation

2014-11-03 Thread Suneel Marthi
There is no online documentation for each of the algorithm parameters, AFAIK. The only documentation would be the MiA book which covers details about the algorithms and parameters (without having to look at the code). On Mon, Nov 3, 2014 at 3:49 AM, Sean Farrell drsafarr...@gmail.com wrote: So

Re: Lucene version compatibility

2014-10-30 Thread Suneel Marthi
: org.apache.lucene.index.AtomicReaderContext,org.apache.lucene.util.Bits [ERROR] found: org.apache.lucene.index.AtomicReaderContext,boolean,boolean,nulltype [ERROR] reason: actual and formal argument lists differ in length From: Suneel Marthi smar...@apache.org Sent: 29 October

Re: Lucene version compatibility

2014-10-29 Thread Suneel Marthi
From: Suneel Marthi smar...@apache.org Sent: 28 October 2014 22:33 To: user@mahout.apache.org Subject: Re: Lucene version compatibility Yes it should be possible, and we have been upgrading to the latest and greatest Lucene versions at the point of Release

Re: compatibility of hadoop and mahout version

2014-10-28 Thread Suneel Marthi
on Hadoop 2.x? On 27 October 2014 01:37, Suneel Marthi smar...@apache.org wrote: Mahout 0.9 is not compatible Hadoop 2.x. Either u can work off present git codebase on HAdoop 2.x or try running Mahout 0.9 on Hadoop 1.2.1 On Mon, Oct 27, 2014 at 1:34 AM, jyotiranjan panda tell2jy

Re: Lucene version compatibility

2014-10-28 Thread Suneel Marthi
Yes it should be possible, and we have been upgrading to the latest and greatest Lucene versions at the point of Release and may be a trivial change. Just gotta replace all references in the code for 'Version_46' with 'Version_Latest'. Also Lucene = 4.7 mandate Java 1.7. On Tue, Oct 28, 2014 at

Re: compatibility of hadoop and mahout version

2014-10-26 Thread Suneel Marthi
Mahout 0.9 is not compatible Hadoop 2.x. Either u can work off present git codebase on HAdoop 2.x or try running Mahout 0.9 on Hadoop 1.2.1 On Mon, Oct 27, 2014 at 1:34 AM, jyotiranjan panda tell2jy...@gmail.com wrote: Hi, I have just started mahout learning last week. I am facing lots of

Re: Invoking Mahout 0.9 with Lucene 4.6.1 ClassNotFoundException

2014-10-23 Thread Suneel Marthi
You can't be using Lucene 4x with Lucene 3x. Lucene 4x is not backward compatible with Lucene 3x. R u trying to set TermVectors and offsets, if so it should be done differently with Lucene 4x, see TestClusterDumper.java for an example. On Thu, Oct 23, 2014 at 7:15 PM, Benjamin Eckstein

Re: Streaming K Means exception without any reason

2014-10-09 Thread Suneel Marthi
Seen this issue happen a few times before, there are few edge conditions that need to be fixed in the Streaming KMeans code and you are right that the generated clusters are different on successive runs given the same input. IIRC this stacktrace is due to BallKMeans failing to read any input

Re: Streaming K Means exception without any reason

2014-10-09 Thread Suneel Marthi
, but it would be a problem if it crashes like this. On четвртак, 09. октобар 2014. 14:54:28 CEST, Suneel Marthi wrote: Seen this issue happen a few times before, there are few edge conditions that need to be fixed in the Streaming KMeans code and you are right that the generated clusters

Re: SSVD: lease conflict due to 2 attempts using the same dir

2014-10-07 Thread Suneel Marthi
Have u tried running with -ow (overwrite) option, that should clear all tmpdir between successive runs ?? The SSVD code does clear the tmpdir when -ow is specified. On Tue, Oct 7, 2014 at 5:55 PM, Yang tedd...@gmail.com wrote: we are running mahout ssvd, with a --tempDir parameter, but we

Re: word weights using BM25

2014-10-01 Thread Suneel Marthi
to understand how I would port that to mr. I ll try to share something if I succeed. Arian Pasquali http://about.me/arianpasquali 2014-09-24 5:12 GMT+01:00 Suneel Marthi suneel.mar...@gmail.com: Lucene 4.x supports okapi-bm25. So it should be easy to implement. On Tue, Sep 23

Re: Streaming K Means

2014-09-29 Thread Suneel Marthi
This was replied to earlier with the details u r looking for, repeating here again: See http://stackoverflow.com/questions/17272296/how-to-use-mahout-streaming-k-means/18090471#18090471 for how to invoke Streaming Kmeans Also look at examples/bin/cluster-reuters.sh for the Streaming KMeans

Re: Performance of RowSimilarityJob

2014-09-26 Thread Suneel Marthi
What's the Mahout version? Please work off of 0.9, there was a performance issue in RSJ that was fixed in 0.9. On Fri, Sep 26, 2014 at 4:23 PM, Burke Webster bu...@collectiveip.com wrote: I've been implementing the RowSimilarityJob on our 40-node cluster and have run into so serious

Re: Performance of RowSimilarityJob

2014-09-26 Thread Suneel Marthi
I had seen the issue u r reporting when running CooccurrencesMapper on a 2M document corpus on an 80 node cluster. The job would be stuck in cooccurencesMapper forever. This has been fixed in 0.9 (I have not had a chance to try it out on the size and cluster I had before), so it would be good if

Re: word weights using BM25

2014-09-24 Thread Suneel Marthi
/~jperezi/Lucene-BM25/ and the current mahout's tfidf code. Trying to understand how I would port that to mr. I ll try to share something if I succeed. Arian Pasquali http://about.me/arianpasquali 2014-09-24 5:12 GMT+01:00 Suneel Marthi suneel.mar...@gmail.com: Lucene 4.x supports okapi

Re: word weights using BM25

2014-09-23 Thread Suneel Marthi
Lucene 4.x supports okapi-bm25. So it should be easy to implement. On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning ted.dunn...@gmail.com wrote: Should be pretty easy. I haven't heard of anyone doing it. Sent from my iPhone On Sep 23, 2014, at 18:53, Arian Pasquali ar...@arianpasquali.com

Re: Errors while using Mahout 0.9 with Hadoop 2.4.0

2014-09-03 Thread Suneel Marthi
Mahout 0.9 doesn't support Hadoop 2x, work off of present trunk if u r looking to run on Hadoop 2x. On Wed, Sep 3, 2014 at 3:57 AM, Kalmohsen cstudent...@gmail.com wrote: Hello all I am a master student who is willing to implement a scalable recommender system using Mahout, hadoop and spark

Re: lucene2seq and empty fields

2014-08-29 Thread Suneel Marthi
Which Mahout version? On Sat, Aug 30, 2014 at 12:32 AM, Tom LAMPERT t.lamp...@laboquantup.eu wrote: Hi all, I have running into a problem with lucene2seq and I'm wondering whether any of you can help me. I have a Solr index in which the documents contain several fields and some of these

Re: does Mahout support Hadoop 2.5.0 ?

2014-08-21 Thread Suneel Marthi
Mahout 0.9 does not support Hadoop 2x. Period... M-1329 is not part of Mahout 0.9 and has been fixed for 1.0 (see the Fix version in the JIRA) If u wanna run Mahout on Hadoop 2x, work off of present trunk (not 0.9 codebase). On Thu, Aug 21, 2014 at 6:55 PM, Wei Zhang w...@us.ibm.com wrote:

Re: Mahout on Spark: random forest

2014-08-11 Thread Suneel Marthi
there is no Random Forest impl on Spark in Mahout yet. Ml-lib has a Random Forests impl why can't u use that instead. On Tue, Aug 12, 2014 at 2:19 AM, Sameer Tilak ssti...@live.com wrote: Hi All, We are currently using Weka. I looked the the site and read briefly about experimental

Re: CSV to Mahout Seqfile

2014-08-08 Thread Suneel Marthi
See http://stackoverflow.com/questions/13663567/mahout-csv-to-vector-and-running-the-program On Fri, Aug 8, 2014 at 11:05 PM, Aniket sankhe@gmail.com wrote: Hi, I am working on project want to run a dataset on mahout for naive bayes classifier. dataset has csv format with columns (

Re: How to get document count for TFIDF calculate method?

2014-07-29 Thread Suneel Marthi
Have been silently following this discussion for sometime now. Jonathan if I understand u right, u r trying to determine the no. of docs in ur corpus. Correct? One of the artifactsfrom seq2sparse should have the doc count, not sure which one top of my head and I am not in front of a computer.

Re: Fpgrowth

2014-07-23 Thread Suneel Marthi
fpgrowth was initially removed and added again for 0.9 because one specific user stepped up to support it (and was never heard from again). Mahout 0.9 should have fpgrowth IIRC. On Thu, Jul 24, 2014 at 1:27 AM, Martin, Nick nimar...@pssd.com wrote: So I know fpgrowth was sent out to pasture a

Re: SparseVectorsFromSequenceFiles: ArrayIndexOutOfBoundsException in DictionaryVectorizer

2014-07-18 Thread Suneel Marthi
Are u running vanilla Mahout 0.9 on Hadoop 2x? While that may not be the issue here, Mahout 0.9 doesn't support Hadoop 2x yet. Its better if u could work against the present trunk and build the code with hadoop 2 profile if that's ur target test bed. On Sat, Jul 12, 2014 at 11:38 AM, Reinis

Re: Random Forest Implementation training is too slow for 2 GB of data

2014-07-11 Thread Suneel Marthi
R u working off if trunk? Mahout version?? Sent from my iPhone On Jul 11, 2014, at 6:53 AM, Stuti Awasthi stutiawas...@hcl.com wrote: Hi all, I have some 2 GB of data and tried to execute RF with no of trees = 10 and maxsplitsize as 90 MB. The execution takes too much time. I have

Re: Random Forest Implementation training is too slow for 2 GB of data

2014-07-11 Thread Suneel Marthi
Please work off of trunk, few fixes for RDF have gone in that should address this issue. See release notes for details. Sent from my iPhone On Jul 11, 2014, at 7:06 AM, Stuti Awasthi stutiawas...@hcl.com wrote: Mahout 0.7 -Original Message- From: Suneel Marthi [mailto:suneel.mar

Re: Problem with SequenceFilesFromDirectory

2014-06-28 Thread Suneel Marthi
0.7 is not supported anymore, please switch to 0.9 or present trunk Sent from my iPhone On Jun 28, 2014, at 5:05 PM, Matías matias2...@gmail.com wrote: Hi guys, I'm using Mahout 0.7 I'm having a problem with SequenceFilesFromDirectory I have a txt file with ascii enconding in Linux and

Re: Clusterdump in mahout

2014-06-26 Thread Suneel Marthi
me know if I am wrong. Thanks, Venkat On Thu, Jun 26, 2014 at 1:27 PM, Suneel Marthi smar...@apache.org wrote: Its clear from the stacktrace that u have a String as key where an integer was expected. How did u go about building ur clusters from original input ? On Thu, Jun 26

Re: ClassNotFound Exception when running SequenceFilesFromDirectory

2014-06-23 Thread Suneel Marthi
You need to first convert *.sgm from reuters download to text files (this shuld happen before running seqdirectory). To convert .sgm to text run - $MAHOUT org.apache.lucene.benchmark.utils.ExtractReuters ${WORK_DIR}/reuters-sgm ${WORK_DIR}/reuters-out Then run seqdirectory on the output of the

Re: Interpretation of cluster output

2014-06-20 Thread Suneel Marthi
There was an issue with empty cluster file being created for Canopy which has since been fixed in present trunk. So u may want to work off of present trunk. Also Canopy's been marked for deprecation in future release so whatever u r trying to do, you may want to look at the alternatives. On

Re: a seemingly benign test that fails MahoutTestCase

2014-06-17 Thread Suneel Marthi
Annotate ur test case with the following: @ThreadLeakAction({ThreadLeakAction.Action.WARN) (This is from Carrot Randomized Test framework, ensure that u have the relevant jars in ur classpath for this to compile) that should throw a Warning as opposed to interrupting the thread. As Ted's said

Re: How to move KMeans clustering from version 0.7 to 0.9

2014-06-17 Thread Suneel Marthi
(1) Mahout 0.7 is not supported anymore and u shouldn't be using it. (2) To get ur code to compile with 0.9 remove the DistanceMeasure arguments in ur call to KMeansDriver.run() WeightedVectorWritable was replaced by WeightedPropertyVectorWritable in 0.9. So change the line of code to

Re: Mahout DistributedRowMatrix run with only one mapper

2014-06-17 Thread Suneel Marthi
DRM is not for demo and is used across several Mahout jobs like RowSimilarityJob etc... a) What's the Mahout version u r working off of? b) Have u tried using MatrixMultiplicationJob which is MapReduce based? On Tue, Jun 17, 2014 at 3:05 AM, Han Fan visaya...@gmail.com wrote: I have a 6kx10k

Re: Determining appropriate t1 and t2 values for CanopyDriver

2014-06-03 Thread Suneel Marthi
This has been asked before several times, if you search the mailing lists you may hit similar posts. There is no clear formula for picking the ideal T1 and T2 values, the problem with using Canopy is that because it runs with a single reducer u r most likely to hit OOME depending on how big the

Re: SparseVectorsFromSequenceFiles StandardAnalyzer ClassNotFoundException issue

2014-06-03 Thread Suneel Marthi
You r missing the Lucene jars from ur classpath. Mahout's presently at Lucene 4.6.1 that's what u should be including. On Tuesday, June 3, 2014 3:40 PM, Terry Blankers te...@amritanet.com wrote: Hello, can anyone please give me a clue as to what I may be missing here? I'm trying to run a

Re: SparseVectorsFromSequenceFiles StandardAnalyzer ClassNotFoundException issue

2014-06-03 Thread Suneel Marthi
, Suneel Marthi wrote: You r missing the Lucene jars from ur classpath. Mahout's presently at Lucene 4.6.1 that's what u should be including. On Tuesday, June 3, 2014 3:40 PM, Terry Blankers te...@amritanet.com wrote: Hello, can anyone please give me a clue as to what I may be missing

Re: Naive Bayes implementation in Java

2014-05-23 Thread Suneel Marthi
Look at the unit tests for reference Sent from my iPhone On May 23, 2014, at 2:52 AM, namit maheshwari namitmaheshwa...@gmail.com wrote: Hello Everyone, I am trying to implement Naive Bayes in Java rather than running it through command line. Could anyone please direct me to examples

Re: Theory behind LogisticRegression in Mahout

2014-05-23 Thread Suneel Marthi
Mahout's impl closely follows http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.177.3514rep=rep1type=pdf Mahout's impl On Friday, May 23, 2014 2:50 AM, namit maheshwari namitmaheshwa...@gmail.com wrote: No I didnt find any links in the comments. On Fri, May 23, 2014 at 2:44 AM,

Re: Setting mahout heapsize for rowsimilarity job

2014-05-23 Thread Suneel Marthi
I had seen this issue too with RSJ until 0.8. Switch to using Mahout 0.9, downsampling was introduced in RSJ which should avoid this error. On Fri, May 23, 2014 at 2:59 PM, Mohit Singh mohit1...@gmail.com wrote: Hi, I have a 1M X 6 dimensional matrix stored as sequence file and I am

Re: Theory behind LogisticRegression in Mahout

2014-05-22 Thread Suneel Marthi
See Frank's blog for how Mahout's SGD works http://blog.trifork.com/2014/02/04/an-introduction-to-mahouts-logistic-regression-sgd-classifier/ On Thu, May 22, 2014 at 2:44 AM, Peng Zhang pzhang.x...@gmail.com wrote: Namit, I think the theory behind Mahout’s logistic regression is stochastic

Re: How to list all vectors from a cluster

2014-05-22 Thread Suneel Marthi
Look at clusterdump utility Sent from my iPhone On May 22, 2014, at 9:19 AM, Aleksander Sadecki aleksander.sade...@pi.esisar.grenoble-inp.fr wrote: Hi, I have got a piece of code which creates for me few clusters with vectors. When I run it, I can see a log which says that 2 clusters

Re: Problem in applying in patch Mahout-838

2014-05-03 Thread Suneel Marthi
I believe Adam's reply could be a kid messing with his smartphone and hitting reply in error (happens with me sometimes). Anyways coming back to ur question, the patch u mention is a few years (and hence few versions old). Why would u want to try applying the patch in 2014? What r u trying to do?

Re: Future of Frequent Pattern Mining

2014-04-28 Thread Suneel Marthi
+100 to purging this from the codebase. This stuff uses the old MR api and would have to be upgraded not to mention that this was removed from 0.9 and was restored only because one user wanted it who promised to maintain it and has not been heard from. On Mon, Apr 28, 2014 at 2:19 AM,

Re: Difficulties mapping results of CVB/LDA back to corresponding vector keys

2014-04-24 Thread Suneel Marthi
RowId creates a matrix and docIndex which r IntWritable, vectorWritable and IntWritable, Text respectively. Have u looked at LDAPrintTopics.java ? On Thu, Apr 24, 2014 at 7:32 PM, Mohammed Omer beancinemat...@gmail.comwrote: Good evening all. This is my first time working with Mahout, and

Re: Getting error in qualcluster command

2014-04-22 Thread Suneel Marthi
What is the error u r seeing? the output from KMeans is (IntWritable, ClusterWritable) and for Streaming KMeans its (IntWritable, CentroidWritable) QualCluster may be expecting the later and hence works for Streaming KMeans. Could u post the error u r seeing? On Tue, Apr 22, 2014 at 9:12 AM,

Re: Installation on Ubuntu

2014-04-18 Thread Suneel Marthi
New API for ? On Friday, April 18, 2014 3:50 PM, Christopher Eugene xriseug...@gmail.com wrote: @sebastian I have version 1.7. @Andrew I plan on using mahout with php since I heard that there is a new API or am I wrong? On Fri, Apr 18, 2014 at 10:45 PM, Andrew Musselman

Re: Grumble about (lack of) warning of deprecation of Canopy KMeans

2014-04-18 Thread Suneel Marthi
On Fri, Apr 18, 2014 at 5:47 PM, Bob Morris morris@gmail.com wrote: I was taken aback that the immensely touted and convenient Canopy KMeans package was today deprecated [1] in the incubating mahout 1.0 with no hint that I could find warned in this, at least back through March. This

Re: lucene2seq error: field does not exist in the index

2014-04-18 Thread Suneel Marthi
Please file a jira for this. Thanks again. Sent from my iPhone On Apr 18, 2014, at 10:34 PM, Terry Blankers te...@amritanet.com wrote: Hi Frank, In working with a small test index, if I change the 'body' field to indexed it indeed does work as expected. It would be great if lucene2seq

Re: lucene2seq error: field does not exist in the index

2014-04-13 Thread Suneel Marthi
Apologies for the delayed response Terry.  Mahout's presently at Lucene 4.6.1 (both 0.9 and trunk).  The practice so far has been to upgrade to the latest Lucene version right before a planned release. Not sure what has changed in Solr/Lucene 4.7.1. You could try either of 2 things:- a) Is

Re: Can any one help

2014-04-08 Thread Suneel Marthi
Its not a Mahout issue, u may need to format ur datanodes and restart Hadoop, Hadoop is not able to replicate. On Tuesday, April 8, 2014 1:23 PM, Neetha netasu...@gmail.com wrote: Hi, I am trying to run Mahout -kmeans clustering on hadoop, but I am getting this error,

Re: Using split without partitioning the data to train/test

2014-03-31 Thread Suneel Marthi
Sent from my iPhone On Mar 31, 2014, at 4:20 PM, Mahmood Naderan nt_mahm...@yahoo.com wrote: Hi, In an old Mahout, I used wikipediaDataSetCreator on an input to create the training data mahout wikipediaDataSetCreator -i wiki-tr/chunks -o tr-input -c labels.txt and then

Re: GC Overhead limit exceed in sequential mode of Mahout Streamingkmeans

2014-03-26 Thread Suneel Marthi
17070 Reducer Xmx is 6GB, running in full Map/Reduce mode. Do you have any other idea what to try? Thanks, Roland On Tue, Mar 25, 2014 at 7:13 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: What's ur value for -km? Based on what you had provided -km should be =  1 * ln(200) = 145090

Re: GC Overhead limit exceed in sequential mode of Mahout Streamingkmeans

2014-03-26 Thread Suneel Marthi
... forgot to ask? How many dimensions r u trying to cluster on? Adding a combiner may address this excessive memory usage issue in the reducer (presently not there). On Wednesday, March 26, 2014 8:10 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Hi Roland, Could u tell me how many

Re: GC Overhead limit exceed in sequential mode of Mahout Streamingkmeans

2014-03-26 Thread Suneel Marthi
. I don’t know how to assign enough memory to mahout sequential job. How about changing configuration in hadoop-env, such as heap_size Or datanode memory size? Will they take effects? Ma -Original Message- From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] Sent: Thursday, March

Re: trainclassifier/trainnb

2014-03-25 Thread Suneel Marthi
If u r looking for an example usage, see examples/bin/classify-20newsgroups.sh Sent from my iPhone On Mar 25, 2014, at 9:28 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: If you need to see which options are available for a given job you can just run $MAHOUT_HOME/bin/mahout

Re: GC Overhead limit exceed in sequential mode of Mahout Streamingkmeans

2014-03-25 Thread Suneel Marthi
What's ur value for -km? Based on what you had provided -km should be =  1 * ln(200) = 145090 Try reducing ur no. of clusters to 1000 and -km = 14509 On Tuesday, March 25, 2014 2:45 AM, fx MA XIAOJUN xiaojun...@fujixerox.co.jp wrote: I am using Mahout Streamingkmeans in

Re: Does Recommender System Overview Demo work?

2014-03-24 Thread Suneel Marthi
It was removed in 0.9 and am not sure if it was there in 0.8. I vaguely remember removing it in 0.9 based on a conversation with Manuel on user@. Manuel, if u could chime in here. On Monday, March 24, 2014 9:44 AM, Sebastian Schelter s...@apache.org wrote: The webapp in Mahout does not

Re: Mahout parallel K-Means - algorithms analysis

2014-03-18 Thread Suneel Marthi
: Re: Mahout parallel K-Means - algorithms analysis From: weish...@gmail.com To: user@mahout.apache.org CC: ted.dunn...@gmail.com You could take a look at org.apache.mahout.clustering.classify/ClusterClassificationMapper Enjoy, Wei Shung On Sat, Mar 15, 2014 at 2:51 PM, Suneel Marthi

Re: reduce is too slow in StreamingKmeans

2014-03-18 Thread Suneel Marthi
: Tuesday, March 18, 2014 10:50 AM To: Suneel Marthi; user@mahout.apache.org Subject: RE: reduce is too slow in StreamingKmeans Thank you for your extremely quick reply. What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u mean Streaming KMeans here? I want to try using -rskm

Re: Naive Bayes classification

2014-03-18 Thread Suneel Marthi
Tharindu, If I understand what u r trying to do:- a) You have a trained Bayes model. b) You would like to classify new documents using this trained model. c) You were trying to use TestNaiveBayesDriver to classify the documents in (b). Option 1: --- You could write a custom MapReduce

Re: clusterdump samplePoints parameter

2014-03-18 Thread Suneel Marthi
Its the max. no. of points to include from each cluster in the clusterdump. If not specified all points would be included. On Tuesday, March 18, 2014 11:25 PM, Terry Blankers te...@amritanet.com wrote: Hi all, Can someone please answer a quick question about the --samplePoints parameter

Re: Text clustering with hashing vector encoders

2014-03-18 Thread Suneel Marthi
+1 to this. We could then use Hamming Distance to compute the distances between Hashed Vectors. We have  the code for HashedVector.java based on Moses Charikar's SimHash paper. On Tuesday, March 18, 2014 7:14 PM, Ted Dunning ted.dunn...@gmail.com wrote: Yes.  Hashing vector encoders

Re: reduce is too slow in StreamingKmeans

2014-03-17 Thread Suneel Marthi
, work off of present trunk and build the code with Hadoop 2 profile like below: mvn clean install -Dhadoop2.profile=hadoop 2.x version Please give that a try. -Original Message- From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] Sent: Wednesday, February 19, 2014 1:08 AM To: user

Re: Problem with FileSystem in Kmeans

2014-03-17 Thread Suneel Marthi
This problem's specifically to do with Canopy clustering and is not an issue with KMeans. I had seen this behavior with Canopy and looking at the code its indeed an issue wherein cluster-0 is created on the local file system and the remaining clusters land on HDFS. Please file a JIRA for this

Re: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

2014-03-17 Thread Suneel Marthi
R u running on Hadoop 2.x which seems to be the case here. Compile with hadoop 2 profile: mvn -DskipTests clean install -Dhadoop2.profile=ur hadoop version On Monday, March 17, 2014 5:57 AM, Margusja mar...@roo.ee wrote: Hi Here is my output: [speech@h14 ~]$ mahout/bin/mahout

Re: Normalization in Mahout

2014-03-17 Thread Suneel Marthi
What r u trying to do? On Monday, March 17, 2014 7:45 AM, Bikash Gupta bikash.gupt...@gmail.com wrote: Hi, Do we have any utility for Column and Row normalization in Mahout? -- Thanks Regards Bikash Gupta

Re: reduce is too slow in StreamingKmeans

2014-03-17 Thread Suneel Marthi
of incompatibility between hadoop and mahout, I don’t think mahout kmeans can run properly. Is mahout 0.9 compatible with Hadoop 0.20? -Original Message- From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] Sent: Monday, March 17, 2014 6:21 PM To: fx MA XIAOJUN; user@mahout.apache.org Subject

Re: Mahout with Storm/Spark

2014-03-16 Thread Suneel Marthi
!LDA being the other option) thank you On Mar 7, 2014, at 12:36 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: a) Upgrade to the latest Mahout version, please move away from 0.7 a lot of lint was cleaned up since then.  b) Seems like u r running the old LDA algorithm that was replaced

Re: Mahout parallel K-Means - algorithms analysis‏

2014-03-15 Thread Suneel Marthi
The clustering code is cimapper and cireducer. Following the clustering, there is cluster classification which is mapper only. Not sure about the reference paper, this stuffs been around for long but the documentation for kmeans on mahout.apache.org should explain the approach. Sent from my

Re: Mahout class responsible for testnb

2014-03-14 Thread Suneel Marthi
TestNaiveBayesDriver.java On Friday, March 14, 2014 8:27 AM, Tharindu Rusira tharindurus...@gmail.com wrote: Hello everyone, I'm currently writing an application which uses Mahout's NaiveBayes classification algorithm. In my program, the requirements of my application reflect a typical

Re: lda times out prior to writing topics

2014-03-14 Thread Suneel Marthi
Its not a timeout but an exception that's being thrown while generating ldatopics due to a list of terms being empty.  Looking into it On Friday, March 14, 2014 12:16 PM, Steven Cullens srcull...@gmail.com wrote: Hi, I'm running Mahout 0.9 and Hadoop 1.1.1 and I'm following the

Re: lda times out prior to writing topics

2014-03-14 Thread Suneel Marthi
the issue is that the numTerms in dictionary is 0. learning for LDA on reuters-lda/reuters-matrix/matrix (numTerms: 0), finding 5-topics, with document/topic prior 1.0E-4, topic/term prior 1.0E-4.  Maximum iterations to run will be 2, unless the change in perplexity is less than 0.0.  Topic

Re: local file input for seqdirectory

2014-03-13 Thread Suneel Marthi
The workaround is to add -xm sequential. A MR version of seqdirectory was introduced in 0.8 and hence the default execution mode is MR if none is specified. On Thursday, March 13, 2014 4:12 PM, Steven Cullens srcull...@gmail.com wrote: Hi, I have a large number of files on the order of

Re: Automation of Canopy Clustering seeding t1 and t2

2014-03-12 Thread Suneel Marthi
Is there any rational to what u r proposing?  Its better to go with Streaming KMeans than the combination of Canopy - KMeans clustering.  Moreover, Canopy clustering (due to a single reducer in Canopy Generation phase) is more likely to fail with large datasets and that's a behavior that's

Re: PCA to improve classification performances

2014-03-10 Thread Suneel Marthi
please feel free to comment. Kévin Moulart 2014-03-07 16:23 GMT+01:00 Suneel Marthi suneel_mar...@yahoo.com: Its not clear to me from ur description as to the exact sequence of steps u r running thru, but an SSVD job requires a matrix as input (not a sequencefile of Text, VectorWritables

Re: Few questions about SVM configuration in Mahout

2014-03-10 Thread Suneel Marthi
Mahout presently has no SVM impl.  U could use Logistic Regression (with SGD) for classification. On Monday, March 10, 2014 5:39 AM, Quentin-Gabriel Thurier quentin.thur...@gmail.com wrote: Hi all, Just few questions about the configuration of an SVM in Mahout : - Is it possible to do a

Re: Heap space

2014-03-10 Thread Suneel Marthi
that there should be a more meaningful error message that *who* needs more heap size? Hadoop, Mahout, Java, ?   Regards, Mahmood On Monday, March 10, 2014 1:31 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: Mahmood, Firstly thanks for starting this email thread and for highlighting

Re: ClusteringUtils for Kmeans output

2014-03-09 Thread Suneel Marthi
U could call ClusterQualitySummarizer which then calls ClusteringUtils to spew out the different metrics u had specified. For an example, see the Streaming Kmeans section in examples/bin/cluster-reuters.sh.  It calls 'qualcluster' with options -i tf-idf vectors generated from seq2sparse -c

Re: ClusteringUtils for Kmeans output

2014-03-09 Thread Suneel Marthi
PM, Suneel Marthi suneel_mar...@yahoo.com wrote: U could call ClusterQualitySummarizer which then calls ClusteringUtils to spew out the different metrics u had specified. For an example, see the Streaming Kmeans section in examples/bin/cluster-reuters.sh. It calls 'qualcluster

Re: Heap space

2014-03-09 Thread Suneel Marthi
Mahmood, Firstly thanks for starting this email thread and for highlighting the issues with wikipedia example. Since you raised this issue, I updated the new wikipedia examples page at http://mahout.apache.org/users/classification/wikipedia-bayes-example.html and also responded to a similar

Re: ClusteringUtils for Kmeans output

2014-03-09 Thread Suneel Marthi
org.apache.mahout.classifier.sgd.OnlineLogisticRegressionTest On Sun, Mar 9, 2014 at 3:45 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Darn. U r the second guy to report that this week.  Change that line to what ted suggested.  The issue is with guava incompatibility with Hadoop's antiquated guava version. Sent from my

Re: mahout command

2014-03-08 Thread Suneel Marthi
nt_mahm...@yahoo.com wrote: That is rather disappointing b) Work off of present Head and build with Hadoop 2.x profile. Can you explain more?   Regards, Mahmood On Friday, March 7, 2014 8:09 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: The example as documented on the Wiki should work

Re: mahout command

2014-03-08 Thread Suneel Marthi
-distribution-0.9$ Regards, Mahmood   On Saturday, March 8, 2014 7:28 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: mvn clean package -Dhadoop2.version=2.3.0 please give that a try.   On Saturday, March 8, 2014 9:56 AM, Mahmood Naderan nt_mahm...@yahoo.com wrote: mvn clean package

Re: mahout command

2014-03-08 Thread Suneel Marthi
only WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Is there any concern about them?   R.egards, Mahmood On Saturday, March 8, 2014 11:19 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Thanks Andrew

Re: mahout command

2014-03-07 Thread Suneel Marthi
Mehmood, wikipediaXMLSplitter is not present in driver.classes.default.props. To accomplish what u r trying to do, u can edit src/conf/driver.classes/default/props and add an entry for wikipediaXMLSplitter. org.apache.mahout.text.wikipedia.WikipediaXmlSplitter = wikipediaXmlSplitter :

Re: PCA to improve classification performances

2014-03-07 Thread Suneel Marthi
Its not clear to me from ur description as to the exact sequence of steps u r running thru, but an SSVD job requires a matrix as input (not a sequencefile of Text, VectorWritables. When u try running a seqdumper on ur SSVD output do u see anything? The next step after u create ur

Re: mahout command

2014-03-07 Thread Suneel Marthi
On Friday, March 7, 2014 5:02 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Mehmood, wikipediaXMLSplitter is not present in driver.classes.default.props. To accomplish what u r trying to do, u can edit src/conf/driver.classes/default/props and add an entry for wikipediaXMLSplitter

Re: Welcome Andrew Musselman as new comitter

2014-03-07 Thread Suneel Marthi
Congrats Andrew. On Friday, March 7, 2014 12:13 PM, Sebastian Schelter s...@apache.org wrote: Hi, this is to announce that the Project Management Committee (PMC) for Apache Mahout has asked Andrew Musselman to become committer and we are pleased to announce that he has accepted. Being a

Re: Mahout with Storm/Spark

2014-03-07 Thread Suneel Marthi
a) Upgrade to the latest Mahout version, please move away from 0.7 a lot of lint was cleaned up since then.  b) Seems like u r running the old LDA algorithm that was replaced by CVB in later versions,  try running ur corpus thru CVB once you upgrade to a later version of Mahout. I don't think

Re: Rework our website

2014-03-06 Thread Suneel Marthi
I fixed some of the broken links. For some of others eg: TasteCommandline, Recommendationexamples either the pages have not been migrated or the links have to be purged? On Thursday, March 6, 2014 9:07 AM, Sebastian Schelter s...@apache.org wrote: Thank you very much! Could you create a

Re: Rework our website

2014-03-06 Thread Suneel Marthi
There is stuff that needs to be migrated over from the old Web site. See Jira for the details. On Thursday, March 6, 2014 9:45 AM, Sebastian Schelter s...@apache.org wrote: Could you add the missing pages to the jira issue? I'll have a look later. On 03/06/2014 03:25 PM, Suneel Marthi

Re: Reuters Example LDA Error (no help anywhere)

2014-03-06 Thread Suneel Marthi
The script needs to be corrected to not call vectordump for LDA as vectordump utility (or even clusterdump) are presently not capable of displaying topics and relevant documents. I recall this issue was previously reported by Peyman Faratin post 0.9 release. Ideally Mahout's missing a

<    1   2   3   4   5   >