Train and test set are in single files (part-r-00000). Training file is 30MB and testing file is 2MB.
2013/12/2 Fernando Santos <[email protected]> > Hello Ted, > > No, the training ran also in one machine. What happens sometimes is that > each box execute one job one at a time, but not together. For example, if > it will run 3 jobs, it runs the first job in box1, the next in box2 and the > next in box 1 again. > > The full dataset is a csv around 70MB. I turned it into sequence file, > applied seq2sparse, then splitted and trained. The training task was quite > fast, some minutes to execute. But the test is really slow as I said, and > also running in one machine. > > > > 2013/12/1 Ted Dunning <[email protected]> > >> Did the training run use both machines? >> >> How large is the input for the test run? >> >> Is it contained in a single file? >> >> >> >> >> On Sat, Nov 30, 2013 at 11:22 AM, Fernando Santos < >> [email protected]> wrote: >> >> > Hello everyone, >> > >> > I'm trying to do a text classification task. My dataset is not that >> big, I >> > have around 700.000 small comments. >> > >> > Following the 20newsgroups example, I created the vector from the text, >> > splited it and trained the model. Now I'm trying to test it but it is >> > really slow and also I cannot make it to run in the cluster. Whatever I >> do >> > it always just run in one machine. And I think the testnb algorithm is >> > supposed to run using mapReduce, right? >> > >> > I also tried this example here ( >> > >> > >> http://chimpler.wordpress.com/2013/06/24/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages-part-2-distribute-classification-with-hadoop/ >> > ) >> > but also, the other box in the cluster is not executing any task. In >> fact, >> > when I execute the testnb or using the MapReduceClassifier proposed in >> this >> > tutorial above, I get one job, executing one task and this task runs >> really >> > slowly (like 6 minutes to achieve 0.13% of the task). >> > >> > I think I must be doing something wrong so that the cluster is not >> working >> > how it is supposed to be. >> > >> > I have a cluster with 2 box configured with hadoop 0.20.205.0 and using >> > mahout 0.8. >> > >> > I also tried versions 0.7 and 0.6 of mahout but nothing changed. >> > >> > Any help would be aprreciated. >> > >> > >> > The logs I have from this task: >> > >> > >> > *stdout logs* >> > >> > Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library >> > /usr/local/hadoop/lib/libhadoop.so which might have disabled stack >> > guard. The VM will try to fix the stack guard now. >> > It's highly recommended that you fix the library with 'execstack -c >> > <libfile>', or link it with '-z noexecstack'. >> > >> > >> > *syslog logs* >> > >> > 2013-11-30 17:09:19,191 WARN org.apache.hadoop.util.NativeCodeLoader: >> > Unable to load native-hadoop library for your platform... using >> > builtin-java classes where applicable >> > 2013-11-30 17:09:19,400 WARN >> > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi >> > already exists! >> > 2013-11-30 17:09:19,472 INFO org.apache.hadoop.util.ProcessTree: >> > setsid exited with exit code 0 >> > 2013-11-30 17:09:19,474 INFO org.apache.hadoop.mapred.Task: Using >> > ResourceCalculatorPlugin : >> > org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5810d963 >> > 2013-11-30 17:09:19,543 INFO org.apache.hadoop.mapred.MapTask: >> io.sort.mb >> > = 100 >> > 2013-11-30 17:09:19,569 INFO org.apache.hadoop.mapred.MapTask: data >> > buffer = 79691776/99614720 >> > 2013-11-30 17:09:19,569 INFO org.apache.hadoop.mapred.MapTask: record >> > buffer = 262144/327680 >> > >> > >> > >> > >> > >> > -- >> > Fernando Santos >> > +55 61 8129 8505 >> > >> > > > > -- > Fernando Santos > +55 61 8129 8505 > > -- Fernando Santos +55 61 8129 8505
