Train and test set are in single files (part-r-00000). Training file is
30MB and testing file is 2MB.


2013/12/2 Fernando Santos <[email protected]>

> Hello Ted,
>
> No, the training ran also in one machine. What happens sometimes is that
> each box execute one job one at a time, but not together. For example, if
> it will run 3 jobs, it runs the first job in box1, the next in box2 and the
> next in box 1 again.
>
> The full dataset is a csv around 70MB. I turned it into sequence file,
> applied seq2sparse, then splitted and trained. The training task was quite
> fast, some minutes to execute. But the test is really slow as I said, and
> also running in one machine.
>
>
>
> 2013/12/1 Ted Dunning <[email protected]>
>
>> Did the training run use both machines?
>>
>> How large is the input for the test run?
>>
>> Is it contained in a single file?
>>
>>
>>
>>
>> On Sat, Nov 30, 2013 at 11:22 AM, Fernando Santos <
>> [email protected]> wrote:
>>
>> > Hello everyone,
>> >
>> > I'm trying to do a text classification task. My dataset is not that
>> big, I
>> > have around 700.000 small comments.
>> >
>> > Following the 20newsgroups example, I created the vector from the text,
>> > splited it and trained the model. Now I'm trying to test it but it is
>> > really slow and also I cannot make it to run in the cluster. Whatever I
>> do
>> > it always just run in one machine. And I think the testnb algorithm is
>> > supposed to run using mapReduce, right?
>> >
>> > I also tried this example here (
>> >
>> >
>> http://chimpler.wordpress.com/2013/06/24/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages-part-2-distribute-classification-with-hadoop/
>> > )
>> > but also, the other box in the cluster is not executing any task. In
>> fact,
>> > when I execute the testnb or using the MapReduceClassifier proposed in
>> this
>> > tutorial above, I get one job, executing one task and this task runs
>> really
>> > slowly (like 6 minutes to achieve 0.13% of the task).
>> >
>> > I think I must be doing something wrong so that the cluster is not
>> working
>> > how it is supposed to be.
>> >
>> > I have a cluster with 2 box configured with hadoop 0.20.205.0 and using
>> > mahout 0.8.
>> >
>> > I also tried versions 0.7 and 0.6 of mahout but nothing changed.
>> >
>> > Any help would be aprreciated.
>> >
>> >
>> > The logs I have from this task:
>> >
>> >
>> > *stdout logs*
>> >
>> > Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library
>> > /usr/local/hadoop/lib/libhadoop.so which might have disabled stack
>> > guard. The VM will try to fix the stack guard now.
>> > It's highly recommended that you fix the library with 'execstack -c
>> > <libfile>', or link it with '-z noexecstack'.
>> >
>> >
>> > *syslog logs*
>> >
>> > 2013-11-30 17:09:19,191 WARN org.apache.hadoop.util.NativeCodeLoader:
>> > Unable to load native-hadoop library for your platform... using
>> > builtin-java classes where applicable
>> > 2013-11-30 17:09:19,400 WARN
>> > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi
>> > already exists!
>> > 2013-11-30 17:09:19,472 INFO org.apache.hadoop.util.ProcessTree:
>> > setsid exited with exit code 0
>> > 2013-11-30 17:09:19,474 INFO org.apache.hadoop.mapred.Task:  Using
>> > ResourceCalculatorPlugin :
>> > org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5810d963
>> > 2013-11-30 17:09:19,543 INFO org.apache.hadoop.mapred.MapTask:
>> io.sort.mb
>> > = 100
>> > 2013-11-30 17:09:19,569 INFO org.apache.hadoop.mapred.MapTask: data
>> > buffer = 79691776/99614720
>> > 2013-11-30 17:09:19,569 INFO org.apache.hadoop.mapred.MapTask: record
>> > buffer = 262144/327680
>> >
>> >
>> >
>> >
>> >
>> > --
>> > Fernando Santos
>> > +55 61 8129 8505
>> >
>>
>
>
>
> --
> Fernando Santos
> +55 61 8129 8505
>
>


-- 
Fernando Santos
+55 61 8129 8505

Reply via email to