Desicion Tree in Mahout

2013-12-11 Thread unmesha sreeveni
Am i able to run `Decision tree` from mahout in Eclipse without installing. Should i `install` Mahout in my system or download all `jar` dependencies and include them in lib. I want to Know the working of Decision Tree. Where can i find the `source code` for Mahout Decision tree. -- *Thanks & R

RE: Elephant-Bird, Pig, and Mahout

2013-12-11 Thread Sameer Tilak
Hi Andrew et al., I have the following statement in my pig script. AU = FOREACH A GENERATE myparser.myUDF(param1, param2); STORE AU into '/scratch/AU'; AU has the following format: (userid, (item_view_history)) (27,(0,1,1,0,0))(28,(0,0,1,0,0))(29,(0,0,1,0,1))(30,(1,0,1,0,1)) I will have at least

Re: Mahout and Hadoop 2.2.0

2013-12-11 Thread Gokhan Capan
In the meantime, you might apply the patch in MAHOUT-1354, build mahout using mvn package -Phadoop2 -DskipTests=true, use that mahout version and see if that works Gokhan On Wed, Dec 11, 2013 at 10:09 PM, Gokhan Capan wrote: > I apologize, Suneel is right, Counter breaks the binary compatibili

Re: Mahout and Hadoop 2.2.0

2013-12-11 Thread Hi There
Here are the full contents of my pom file: http://maven.apache.org/POM/4.0.0"; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";   xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd";>   4.0.0   clustertest   clustertest   1.0   jar   cluster

Re: Mahout and Hadoop 2.2.0

2013-12-11 Thread Gokhan Capan
I apologize, Suneel is right, Counter breaks the binary compatibility. Well, I can say there is a work in progress for building mahout against hadoop2. Gokhan On Wed, Dec 11, 2013 at 10:03 PM, Hi There wrote: > Here are the full contents of my pom file: > > http://maven.apache.org/POM/4.0.0";

Re: Mahout and Hadoop 2.2.0

2013-12-11 Thread Suneel Marthi
Per this link, one notability incompatibility is Counter and CounterGroup. http://hadoop.apache.org/docs/r2.2.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html On Wednesday, December 11, 2013 2:46 PM, Hi There wrote: I tried to run SparseVe

Re: Mahout and Hadoop 2.2.0

2013-12-11 Thread Gokhan Capan
Could you check the following? Are you sure that your hadoop cluster is hadoop 2.2.0? Are you sure other dependencies of your project do not have a transitive dependency to hadoop? Gokhan On Wed, Dec 11, 2013 at 9:46 PM, Hi There wrote: > I tried to run SparseVectorsFromSequenceFiles, specify

Re: Mahout and Hadoop 2.2.0

2013-12-11 Thread Hi There
I tried to run SparseVectorsFromSequenceFiles, specifying a directory with sequence files, and I got the following error: java.lang.Exception: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected Here is a relevant snippet of my pom

Re: Mahout and Hadoop 2.2.0

2013-12-11 Thread Gokhan Capan
Hi Zoltan, I am saying that hadoop2-stable and hadoop1 are binary compatible. I don't know what version of hadoop is used in cdh4-mr2 but I guess it was hadoop2 alpha, since bigtop was at hadoop 2.0.6 alpha last time I checked, which was last week. Just try it and let us know if you experience

Re: Mahout and Hadoop 2.2.0

2013-12-11 Thread Suneel Marthi
Sebastian, R we still using SplitInputJob, seems like its been replaced by a much newer SplitInput. Do u think this needs to be purged from the codebase for 0.9, its been marked as deprecated anyways? On Wednesday, December 11, 2013 2:08 PM, Suneel Marthi wrote: A quick search thru the

Re: Mahout and Hadoop 2.2.0

2013-12-11 Thread Suneel Marthi
A quick search thru the codebase has the following using old mapred :- DistributedRowMatrix SplitInputJob MatrixMultiplicationJob BtJob TransposeJob TimesSquaredJob ABtJob ABtDenseOutJob BtJob QJob QRFirstStep On Wednesday, December 11, 2013 2:01 PM, Sebastian Schelter wrote: I think t

Re: Mahout and Hadoop 2.2.0

2013-12-11 Thread Sebastian Schelter
I think there are still parts of the code (e.g. in DistributedRowMatrix) that use the old API. --sebastian On 11.12.2013 19:56, Suneel Marthi wrote: > Mahout is using the newer mapreduce API and not the older mapred API. > Was that what u were looking for? > > > > > > On Wednesday, December

Re: Mahout and Hadoop 2.2.0

2013-12-11 Thread Suneel Marthi
Mahout is using the newer mapreduce API and not the older mapred API. Was that what u were looking for? On Wednesday, December 11, 2013 1:53 PM, Zoltan Prekopcsak wrote: Hi Gokhan, Thank you for the clarification. Does it mean that Mahout is using the mapred API everywhere and there is

Re: Mahout and Hadoop 2.2.0

2013-12-11 Thread Zoltan Prekopcsak
Hi Gokhan, Thank you for the clarification. Does it mean that Mahout is using the mapred API everywhere and there is no mapreduce API left? As far as I know, the mapreduce API needs to be recompiled and I remember needing to recompile Mahout for CDH4 when it first came out. Thanks, Zoltan

Re: Avoiding OOM for large datasets

2013-12-11 Thread Ted Dunning
This is not right. THe sequential version would have finished long before this for any reasonable value of k. I do note, however, that you have set k = 200,000 where you only have 300,000 documents. Depending on which value you set (I don't have the code handy), this may actually be increased in

Re: Avoiding OOM for large datasets

2013-12-11 Thread Amir Mohammad Saied
Hi, I first tried Streaming K-means with about 5000 news stories, and it worked just fine. Then I tried it over 300,000 news stories and gave it 10GB of RAM. After more than 43 hours, It was still in the last merge-pass when I eventually decided to stop it. I set K to 20 and KM 2522308 (its f

Re: Mahout for text classification

2013-12-11 Thread tuku
I am currently using naive bayes for text classification. I prefer NB over SVM because; - SVM has long training time - NB can be incremental - NB can be fully parallel the main decisions you should make while using NB is using tf or tfidf and using binary NB or multinomial if you classify short te