date:20140317

Re: Need help in executing SSVD for dimensionality reduction on Mahout

2014-03-17 Thread Dmitriy Lyubimov

If the rows in the input for SSVD are data points you are trying to create
reduced space for, then rows of USigma represent the same points in the PCA
(reduced) space. The mapping between the input rows and output rows is by
same keys in the sequence files. However, it doesn't look like your input
is using distinct such values (1), this is not recommended.

SSVD will also propagate names if NamedVector is used for rows of the
input. That's possibly another way to map input rows to PCA space rows in
USigma. However, it doesn't look like the input is using Named vectors in
this case.


On Mon, Mar 17, 2014 at 10:22 PM, Vijaya Pratap  wrote:

> Hi,
>
> I am trying to use SSVD for dimensionality reduction on Mahout, the input
> is a sample data in CSV format. Below is a snippet of the input
>
> 22,2,44,36,5,9,2824,2,4,733,285,169
> 25,1,150,175,3,9,4037,2,18,1822,254,171
>
> I have executed the below steps.
>
> 1. Loaded the csv file and Vectorized the data by following the steps
> mentioned at https://github.com/tdunning/pig-vector with key as
> TextConverter and value as VectorWritable. Listed below is the output of
> this step. I believe the values 420468, 279945 are indices, please correct
> me if I am wrong.
> Key: 1: Value:
>
> {420468:733.0,279945:2.0,607618:285.0,107323:4.0,88330:2.0,263605:9.0,975378:169.0,796003:2824.0,899937:44.0,422862:5.0,723271:22.0,508675:36.0}
> Key: 1: Value:
>
> {420468:1822.0,279945:2.0,607618:254.0,107323:18.0,88330:1.0,263605:9.0,975378:171.0,796003:4037.0,899937:150.0,422862:3.0,723271:25.0,508675:175.0}
>
> 2. Passed the output of the above command to SSVD as follows
> bin/mahout ssvd -i /user/cloudera/vectorized_data/ -o
> /user/cloudera/reduced_dimensions --rank 7 -us true -V false -U false -pca
> true -ow -t 1
>
> Below is a snippet of the output in USigma folder
> Key: 1: Value:
>
> {0:190.78376981262613,1:350.30406212052424,2:78.24932121461198,3:98.67283686605012,4:-122.95056058078157,5:-4.201436498582381,6:-1.4370820809434337}
> Key: 1: Value:
>
> {0:1295.933786837574,1:-698.5629072274602,2:-24.15996813349674,3:60.936737740013946,4:11.859426028893711,5:-6.379057682687426,6:0.9356299409590896}
>
> Please let me know if my approach is correct and help me in interpreting
> the output in USigma folder
>
>
> Thanks in advance
> Pratap
>

Fwd: Need help in executing SSVD for dimensionality reduction on Mahout

2014-03-17 Thread Vijaya Pratap

Hi,

I am trying to use SSVD for dimensionality reduction on Mahout, the input
is a sample data in CSV format. Below is a snippet of the input

22,2,44,36,5,9,2824,2,4,733,285,169
25,1,150,175,3,9,4037,2,18,1822,254,171

I have executed the below steps.

1. Loaded the csv file and Vectorized the data by following the steps
mentioned at https://github.com/tdunning/pig-vector with key as
TextConverter and value as VectorWritable. Listed below is the output of
this step. I believe the values 420468, 279945 are indices, please correct
me if I am wrong.
Key: 1: Value:
{420468:733.0,279945:2.0,607618:285.0,107323:4.0,88330:2.0,263605:9.0,975378:169.0,796003:2824.0,899937:44.0,422862:5.0,723271:22.0,508675:36.0}
Key: 1: Value:
{420468:1822.0,279945:2.0,607618:254.0,107323:18.0,88330:1.0,263605:9.0,975378:171.0,796003:4037.0,899937:150.0,422862:3.0,723271:25.0,508675:175.0}

2. Passed the output of the above command to SSVD as follows
bin/mahout ssvd -i /user/cloudera/vectorized_data/ -o
/user/cloudera/reduced_dimensions --rank 7 -us true -V false -U false -pca
true -ow -t 1

Below is a snippet of the output in USigma folder
Key: 1: Value:
{0:190.78376981262613,1:350.30406212052424,2:78.24932121461198,3:98.67283686605012,4:-122.95056058078157,5:-4.201436498582381,6:-1.4370820809434337}
Key: 1: Value:
{0:1295.933786837574,1:-698.5629072274602,2:-24.15996813349674,3:60.936737740013946,4:11.859426028893711,5:-6.379057682687426,6:0.9356299409590896}

Please let me know if my approach is correct and help me in interpreting
the output in USigma folder


Thanks in advance
Pratap

RE: reduce is too slow in StreamingKmeans

2014-03-17 Thread fx MA XIAOJUN

As mahout streamingkmeans has no problems in sequential mode, 
I would like to try sequential mode.
However, "java.lang.OutofMemoryError" occurs.

I wonder where to set JVM heap size for sequential mode?
Is it the same with mapreduce mode?

-Original Message-
From: fx MA XIAOJUN [mailto:xiaojun...@fujixerox.co.jp] 
Sent: Tuesday, March 18, 2014 10:50 AM
To: Suneel Marthi; user@mahout.apache.org
Subject: RE: reduce is too slow in StreamingKmeans

Thank you for your extremely quick reply.

>> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u 
>> mean Streaming KMeans here?
I want to try using -rskm in streaming kmeans. 
But in mahout 0.8, if setting -rskm as true, errors occur.
I heard that the bug has been fixed in 0.9. So I upgraded 0.8->0.9

The hadoop I installed is cdh5-MRv1, corresponding to hadoop 0.20, not hadoop 
2.x(YARN)
cdh5-MRv1 has compatible version of mahout(mahout-0.8+cdh5.0.0b2+28) which is 
compiled by cloudera.
So I uninstalled mahout-0.8+cdh5.0.0b2+28, and installed apache mahout 0.9 
distribution. 
It turned out that "Mahout kmeans" runs very well on mapreduce.
However, "Mahout streamingkmeans" runs properly in sequential mode, but fails 
in mapreduce mode.

If it is the problem of incompatibility between hadoop and mahout, I don’t 
think "mahout kmeans" can run properly.

Is mahout 0.9 compatible with Hadoop 0.20?

-Original Message-
From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] 
Sent: Monday, March 17, 2014 6:21 PM
To: fx MA XIAOJUN; user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN  
wrote:

Thank you for your quick reply.

As to -km, I thought it was log10, instead of ln. I was wrong...
This time I set -km 14 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, 
Mahout 0.8) The maps run faster than before, but the reduce was still stuck at 
76% for ever.

>> This has been my experience too both with 0.8 and 0.9. 

So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm 
option.

Mahout kmeans can be executed properly, so I think the installation of mahout 
0.9 is successful.

>> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u 
>> mean Streaming KMeans here?

However, when executing mahout streamingkmeans, I got errors as following.
Hadoop I installed is cdh5-beta1-mapreduce version 1.

Exception in thread "main" java.lang.IncompatibleClassChangeError: Found 
interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

Seems like u r trying to execute on Hadoop 2 while Mahout 0.9 has been built 
with Hadoop 1.x profile, hence the error u r seeing.
If u would like to test on Hadoop 2, work off of present trunk and build the 
code with Hadoop 2 profile like below:

mvn clean install -Dhadoop2.profile=

Please give that a try.

-Original Message-
From: Suneel Marthi [mailto:suneel_mar...@yahoo.com]
Sent: Wednesday, February 19, 2014 1:08 AM
To: user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

Streaming KMeans runs with a single reducer that runs Ball KMe

Re: reduce is too slow in StreamingKmeans

2014-03-17 Thread Suneel Marthi

-rskm option works only in sequential mode and fails in MR. That's still an 
issue in present trunk that needs to be fixed.
That should explain why Streaming KMeans with -rskm works only in sequential 
mode for you.

Mahout 0.9 has been built with Hadoop 1.2.1 profile, not sure if that's gonna 
work with 0.20.

On Monday, March 17, 2014 9:50 PM, fx MA XIAOJUN  
wrote:

Thank you for your extremely quick reply.

>> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u 
>> mean Streaming KMeans here?
I want to try using -rskm in streaming kmeans. 
But in mahout 0.8, if setting -rskm as true, errors occur.
I heard that the bug has been fixed in 0.9. So I upgraded 0.8->0.9

The hadoop I installed is cdh5-MRv1, corresponding to hadoop 0.20, not hadoop 
2.x(YARN)
cdh5-MRv1 has compatible version of mahout(mahout-0.8+cdh5.0.0b2+28) which is 
compiled by cloudera.
So I uninstalled mahout-0.8+cdh5.0.0b2+28, and installed apache mahout 0.9 
distribution. 
It turned out that "Mahout kmeans" runs very well on mapreduce.
However, "Mahout streamingkmeans" runs properly in sequential mode, but fails 
in mapreduce mode.

If it is the problem of incompatibility between hadoop and mahout, I don’t 
think "mahout kmeans" can run properly.

Is mahout 0.9 compatible with Hadoop 0.20?

-Original Message-
From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] 
Sent: Monday, March 17, 2014 6:21 PM
To: fx MA XIAOJUN; user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN  
wrote:

Thank you for your quick reply.

As to -km, I thought it was log10, instead of ln. I was wrong...
This time I set -km 14 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, 
Mahout 0.8) The maps run faster than before, but the reduce was still stuck at 
76% for ever.

>> This has been my experience too both with 0.8 and 0.9. 

So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm 
option.

Mahout kmeans can be executed properly, so I think the installation of mahout 
0.9 is successful.

>> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u 
>> mean Streaming KMeans here?

However, when executing mahout streamingkmeans, I got errors as following.
Hadoop I installed is cdh5-beta1-mapreduce version 1.

Exception in thread "main" java.lang.IncompatibleClassChangeError: Found 
interface org.apache.hadoop.mapreduce.JobContext, but class was expected
    at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
    at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464)
    at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419)
    at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
    at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

Seems like u r trying to execute on Hadoop 2 while Mahout 0.9 has been built 
with Hadoop 1.x profile, hence the error u r seeing.
If u would like to test on Hadoop 2, work off of present trunk and build the 
code with Hadoop 2 profile like below:

mvn clean install -Dhadoop2.profile=

Please give that a try.

-Original Message-
From: Suneel Marthi [mailto:suneel_mar...@yahoo.com]
Sent: Wednesday, February 19, 2014 1:08 AM
To: user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the 
slow performance that you have been experiencing. 

How did u c

RE: reduce is too slow in StreamingKmeans

2014-03-17 Thread fx MA XIAOJUN

Thank you for your extremely quick reply.

>> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u 
>> mean Streaming KMeans here?
I want to try using -rskm in streaming kmeans. 
But in mahout 0.8, if setting -rskm as true, errors occur.
I heard that the bug has been fixed in 0.9. So I upgraded 0.8->0.9

The hadoop I installed is cdh5-MRv1, corresponding to hadoop 0.20, not hadoop 
2.x(YARN)
cdh5-MRv1 has compatible version of mahout(mahout-0.8+cdh5.0.0b2+28) which is 
compiled by cloudera.
So I uninstalled mahout-0.8+cdh5.0.0b2+28, and installed apache mahout 0.9 
distribution. 
It turned out that "Mahout kmeans" runs very well on mapreduce.
However, "Mahout streamingkmeans" runs properly in sequential mode, but fails 
in mapreduce mode.

If it is the problem of incompatibility between hadoop and mahout, I don’t 
think "mahout kmeans" can run properly.

Is mahout 0.9 compatible with Hadoop 0.20?

-Original Message-
From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] 
Sent: Monday, March 17, 2014 6:21 PM
To: fx MA XIAOJUN; user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN  
wrote:

Thank you for your quick reply.

As to -km, I thought it was log10, instead of ln. I was wrong...
This time I set -km 14 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, 
Mahout 0.8) The maps run faster than before, but the reduce was still stuck at 
76% for ever.

>> This has been my experience too both with 0.8 and 0.9. 

So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm 
option.

Mahout kmeans can be executed properly, so I think the installation of mahout 
0.9 is successful.

>> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u 
>> mean Streaming KMeans here?

However, when executing mahout streamingkmeans, I got errors as following.
Hadoop I installed is cdh5-beta1-mapreduce version 1.

Exception in thread "main" java.lang.IncompatibleClassChangeError: Found 
interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

Seems like u r trying to execute on Hadoop 2 while Mahout 0.9 has been built 
with Hadoop 1.x profile, hence the error u r seeing.
If u would like to test on Hadoop 2, work off of present trunk and build the 
code with Hadoop 2 profile like below:

mvn clean install -Dhadoop2.profile=

Please give that a try.

-Original Message-
From: Suneel Marthi [mailto:suneel_mar...@yahoo.com]
Sent: Wednesday, February 19, 2014 1:08 AM
To: user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the 
slow performance that you have been experiencing. 

How did u come up with -km 63000?

Given that u would like 1 clusters (= k) and have 2,000,000 datapoints (= 
n) so k * ln(n) = 1 * ln(2 * 10^6)  = 145087 (rounded to nearest integer) 
and that should be the value of -km in ur case. (km = k * log (n) )

Not sure if that's gonna fix ur reduce being stuck at 76% forever but its 
definitely worth a try.

If you would like go to wit

Re: Mahout parallel K-Means - algorithms analysis

2014-03-17 Thread Weishung Chung

You could take a look
at org.apache.mahout.clustering.classify/ClusterClassificationMapper

Enjoy,
Wei Shung


On Sat, Mar 15, 2014 at 2:51 PM, Suneel Marthi wrote:

> The clustering code is cimapper and cireducer.  Following the clustering,
> there is cluster classification which is mapper only.
>
> Not sure about the reference paper, this stuffs been around for long but
> the documentation for kmeans on mahout.apache.org should explain the
> approach.
>
> Sent from my iPhone
>
> > On Mar 15, 2014, at 5:36 PM, hiroshi leon 
> wrote:
> >
> > Hello Ted,
> >
> > Thank you so much for your reply, the program that I was checking is the
> KMeansDriver class with the run function,
> > the buildCluster function in the same class and following the
> ClusterIterator class with
> > the iterateMR function.
> >
> > I would like to know how where can I check the code that is implemented
> for the mapper and the
> > reducer? is it in the CIMappper.class and CIReducer.class?
> >
> > Is there a research paper or pseudo-code in which Mahout parallel
> K-means was based on?
> >
> > Thank you so much and have a nice day.
> >
> > Best regards
> >
> >
> >> From: ted.dunn...@gmail.com
> >> Date: Sat, 15 Mar 2014 13:56:56 -0700
> >> Subject: Re: Mahout parallel K-Means - algorithms analysis
> >> To: user@mahout.apache.org
> >>
> >> We would love to help.
> >>
> >> Can you say which program and which classes you are looking at?
> >>
> >>
> >> On Sat, Mar 15, 2014 at 12:58 PM, hiroshi leon <
> hiroshi_8...@hotmail.com>wrote:
> >>
> >>> To whom it may correspond,
> >>>
> >>> Hello, I have been checking the algorithm of Mahout 0.9 version k-means
> >>> using MapReduce and I would like to know where can I check the code of
> >>> what is happening inside the map function and in the reducer?
> >>>
> >>>
> >>> I was debugging using NetBeans and I was not able to find what is
> exactly
> >>> implemented in the Map and Reduce functions...
> >>>
> >>>
> >>>
> >>> The reason what I am doing this is because I would like to know what
> >>> is exactly implemented in the version of Mahout 0.9 in order to see
> >>> which parts where optimized on the K-Means mapReduce algorithm.
> >>>
> >>>
> >>>
> >>> Do you know  which research paper the Mahout K-means was based on or
> where
> >>> can I read the pseudo code?
> >>>
> >>>
> >>>
> >>> Thank you so much!
> >>>
> >>>
> >>>
> >>> Best regards!
> >>>
> >>> Hiroshi
> >
>

Re: Normalization in Mahout

2014-03-17 Thread Suneel Marthi

On Monday, March 17, 2014 8:10 AM, Bikash Gupta  
wrote:

Want to achieve few things

1. Normalize input data of clustering and classification algorithm

Not sure what you consider as normalization, but:

If u r trying to normalize text, Lucene's analyzers do it while generating term 
vectors.
If u r trying to normalize the term vectors for clustering, the distance 
measure specified while clustering normalizes the values appropriately based on 
the chosen distance measure.

2. Normalize output data to plot in graph

The output from clustering is already normalized based on the specified 
distanceMeasure (all of the clustered points r). 

On Mon, Mar 17, 2014 at 5:32 PM, Suneel Marthi wrote:

> What r u trying to do?
>
>
>
>
>
> On Monday, March 17, 2014 7:45 AM, Bikash Gupta 
> wrote:
>
> Hi,
>
> Do we have any utility for Column and Row normalization in Mahout?
>
> --
> Thanks & Regards

> Bikash Gupta
>

-- 
Thanks & Regards
Bikash Kumar Gupta

Re: Normalization in Mahout

2014-03-17 Thread Bikash Gupta

Want to achieve few things

1. Normalize input data of clustering and classification algorithm
2. Normalize output data to plot in graph

On Mon, Mar 17, 2014 at 5:32 PM, Suneel Marthi wrote:

> What r u trying to do?
>
>
>
>
>
> On Monday, March 17, 2014 7:45 AM, Bikash Gupta 
> wrote:
>
> Hi,
>
> Do we have any utility for Column and Row normalization in Mahout?
>
> --
> Thanks & Regards
> Bikash Gupta
>

-- 
Thanks & Regards
Bikash Kumar Gupta

Re: Normalization in Mahout

2014-03-17 Thread Suneel Marthi

What r u trying to do? 





On Monday, March 17, 2014 7:45 AM, Bikash Gupta  
wrote:
 
Hi,

Do we have any utility for Column and Row normalization in Mahout?

-- 
Thanks & Regards
Bikash Gupta

Normalization in Mahout

2014-03-17 Thread Bikash Gupta

Hi,

Do we have any utility for Column and Row normalization in Mahout?

-- 
Thanks & Regards
Bikash Gupta

Re: Problem with FileSystem in Kmeans

2014-03-17 Thread Bikash Gupta

I have 3 node cluster of CDH4.6, however I have build Mahout 0.9 with
Hadoop 2.x profile.

I have also created a mount point for these node and the path uri is same
as HDFS.

I have manually configured filesystem parameter

conf.set("fs.hdfs.impl",org.
apache.hadoop.hdfs.DistributedFileSystem.class.getName());
conf.set("fs.file.impl",org.apache.hadoop.fs.LocalFileSystem.class.getName());

Input data(sequence file) and Cluster center(output of Canopy) are present
in HDFS. After this I am executing KmeansDriver using ToolRunner but got
the error as shown above.

After debugging I have found that cluster-0 is getting created in Mount
Point and cluster-1 in HDFS if I dont provide file system scheme. Once i
provide the file system scheme i.e. "hdfs://<<>>/", everything works like
charm.



On Mon, Mar 17, 2014 at 4:24 PM, Suneel Marthi wrote:

> Have not seen that behavior with KMeans, what were ur settings again?
> Sorry joining late onto this thread, hence have not looked at the entire
> history.
>
>
>
>
>   On Monday, March 17, 2014 6:52 AM, Bikash Gupta <
> bikash.gupt...@gmail.com> wrote:
>  Suneel,
>
> Just for information, I havent found this issue in Canopy. Canopy
> cluster-0 was created in HDFS only.
>
> However Kmeans cluster-0 was created in local file system and cluster-1 in
> HDFS and after that it spit an error as it was unable to locate cluster-0
>
>
> On Mon, Mar 17, 2014 at 3:10 PM, Suneel Marthi wrote:
>
> This problem's specifically to do with Canopy clustering and is not an
> issue with KMeans. I had seen this behavior with Canopy and looking at the
> code its indeed an issue wherein cluster-0 is created on the local file
> system and the remaining clusters land on HDFS.
>
> Please file a JIRA for this if not already done so.
>
>
>
>
>
> On Wednesday, March 12, 2014 3:02 AM, Bikash Gupta <
> bikash.gupt...@gmail.com> wrote:
>
> Hi,
>
> Problem is not with input path, its the way Kmeans is getting executed. Let
> me explain.
>
> I have created CSV->Sequence using map-reduce hence my data is in HDFS
> After this I have run Canopy MR hence data is also in HDFS
>
> Now these two things are getting pushed in Kmeans MR.
>
> If you check KmeansDriver class, at first it tries to create cluster-0
> folder with data, here if you dont specify the scheme then it will write in
> local file system. After that MR job is getting started which is expecting
> cluster-0 in HDFS.
>
> Path priorClustersPath = new Path(output, Cluster.INITIAL_CLUSTERS_DIR);
> ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta);
> ClusterClassifier prior = new ClusterClassifier(clusters, policy);
> prior.writeToSeqFiles(priorClustersPath);
>
> if (runSequential) {
>   ClusterIterator.iterateSeq(conf, input, priorClustersPath, output,
> maxIterations);
> } else {
>   ClusterIterator.iterateMR(conf, input, priorClustersPath, output,
> maxIterations);
> }
>
> Let me know if I am not able to explain clearly.
>
>
>
> On Wed, Mar 12, 2014 at 11:53 AM, Sebastian Schelter 
> wrote:
>
> > Hi Bikash,
> >
> > Have you tried adding hdfs:// to your input path? Maybe that helps.
> >
> > --sebastian
> >
> >
> > On 03/11/2014 11:22 AM, Bikash Gupta wrote:
> >
> >> Hi,
> >>
> >> I am running Kmeans in cluster where I am setting the configuration of
> >> fs.hdfs.impl and fs.file.impl before hand as mentioned below
> >>
> >> conf.set("fs.hdfs.impl",org.apache.hadoop.hdfs.
> >> DistributedFileSystem.class.getName());
> >> conf.set("fs.file.impl",org.apache.hadoop.fs.
> >> LocalFileSystem.class.getName());
> >>
> >> Problem is that cluster-0 directory is getting created in local file
> >> system
> >> and cluster-1 is getting created in HDFS, and Kmeans map reduce job is
> >> unable to find cluster-0 . Please see below the stacktrace
> >>
> >> 2014-03-11 14:52:15 o.a.m.c.AbstractJob [INFO] Command line arguments:
> >> {--clustering=null, --clusters=[/3/clusters-0-final],
> >> --convergenceDelta=[0.1],
> >> --distanceMeasure=[org.apache.mahout.common.distance.
> >> EuclideanDistanceMeasure],
> >> --endPhase=[2147483647], --input=[/2/sequence], --maxIter=[100],
> >> --method=[mapreduce], --output=[/5], --overwrite=null, --startPhase=[0],
> >> --tempDir=[temp]}
> >> 2014-03-11 14:52:15 o.a.h.u.NativeCodeLoader [WARN] Unable to load
> >> native-hadoop library for your platform... using builtin-java classes
> >> where
> >> applicable
> >> 2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] Input: /2/sequence
> >> Clusters In: /3/clusters-0-final Out: /5
> >> 2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] convergence: 0.1 max
> >> Iterations: 100
> >> 2014-03-11 14:52:16 o.a.h.m.JobClient [WARN] Use GenericOptionsParser
> for
> >> parsing the arguments. Applications should implement Tool for the same.
> >> 2014-03-11 14:52:17 o.a.h.m.l.i.FileInputFormat [INFO] Total input paths
> >> to
> >> process : 3
> >> 2014-03-11 14:52:19 o.a.h.m.JobClient [INFO] Running job:
> >> job_201403111332_0011
> >> 2014-03-11

Re: Problem with FileSystem in Kmeans

2014-03-17 Thread Suneel Marthi

Have not seen that behavior with KMeans, what were ur settings again?
Sorry joining late onto this thread, hence have not looked at the entire 
history.





On Monday, March 17, 2014 6:52 AM, Bikash Gupta  
wrote:
 
Suneel,

Just for information, I havent found this issue in Canopy. Canopy cluster-0 was 
created in HDFS only.

However Kmeans cluster-0 was created in local file system and cluster-1 in HDFS 
and after that it spit an error as it was unable to locate cluster-0




On Mon, Mar 17, 2014 at 3:10 PM, Suneel Marthi  wrote:

This problem's specifically to do with Canopy clustering and is not an issue 
with KMeans. I had seen this behavior with Canopy and looking at the code its 
indeed an issue wherein cluster-0 is created on the local file system and the 
remaining clusters land on HDFS.
>
>Please file a JIRA for this if not already done so.
>
>
>
>
>
>
>On Wednesday, March 12, 2014 3:02 AM, Bikash Gupta  
>wrote:
>
>Hi,
>
>Problem is not with input path, its the way Kmeans is getting executed. Let
>me explain.
>
>I have created CSV->Sequence using map-reduce hence my data is in HDFS
>After this I have run Canopy MR hence data is also in HDFS
>
>Now these two things are getting pushed in Kmeans MR.
>
>If you check KmeansDriver class, at first it tries to create cluster-0
>folder with data, here if you dont specify the scheme then it will write in
>local file system. After that MR job is getting started which is expecting
>cluster-0 in HDFS.
>
>Path priorClustersPath = new Path(output, Cluster.INITIAL_CLUSTERS_DIR);
>    ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta);
>    ClusterClassifier prior = new ClusterClassifier(clusters, policy);
>    prior.writeToSeqFiles(priorClustersPath);
>
>    if (runSequential) {
>      ClusterIterator.iterateSeq(conf, input, priorClustersPath, output,
>maxIterations);
>    } else {
>      ClusterIterator.iterateMR(conf, input, priorClustersPath, output,
>maxIterations);
>    }
>
>Let me know if I am not able to explain clearly.
>
>
>
>On Wed, Mar 12, 2014 at 11:53 AM, Sebastian Schelter  wrote:
>
>> Hi Bikash,
>>
>> Have you tried adding hdfs:// to your input path? Maybe that helps.
>>
>> --sebastian
>>
>>
>> On 03/11/2014 11:22 AM, Bikash Gupta wrote:
>>
>>> Hi,
>>>
>>> I am running Kmeans in cluster where I am setting the configuration of
>>> fs.hdfs.impl and fs.file.impl before hand as mentioned below
>>>
>
>>> conf.set("fs.hdfs.impl",org.apache.hadoop.hdfs.
>>> DistributedFileSystem.class.getName());
>>> conf.set("fs.file.impl",org.apache.hadoop.fs.
>>> LocalFileSystem.class.getName());
>>>
>
>>> Problem is that cluster-0 directory is getting created in local file
>>> system
>>> and cluster-1 is getting created in HDFS, and Kmeans map reduce job is
>>> unable to find cluster-0 . Please see below the stacktrace
>>>
>>> 2014-03-11 14:52:15 o.a.m.c.AbstractJob [INFO] Command line arguments:
>>> {--clustering=null, --clusters=[/3/clusters-0-final],
>>> --convergenceDelta=[0.1],
>>> --distanceMeasure=[org.apache.mahout.common.distance.
>>> EuclideanDistanceMeasure],
>>> --endPhase=[2147483647], --input=[/2/sequence], --maxIter=[100],
>>> --method=[mapreduce], --output=[/5], --overwrite=null, --startPhase=[0],
>>> --tempDir=[temp]}
>>> 2014-03-11 14:52:15 o.a.h.u.NativeCodeLoader [WARN] Unable to load
>>> native-hadoop library for your platform... using builtin-java classes
>>> where
>>> applicable
>>> 2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] Input: /2/sequence
>>> Clusters In: /3/clusters-0-final Out: /5
>>> 2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] convergence: 0.1 max
>>> Iterations: 100
>>> 2014-03-11 14:52:16 o.a.h.m.JobClient [WARN] Use GenericOptionsParser for
>>> parsing the arguments. Applications should implement Tool for the same.
>>> 2014-03-11 14:52:17 o.a.h.m.l.i.FileInputFormat [INFO] Total input paths
>>> to
>>> process : 3
>>> 2014-03-11 14:52:19 o.a.h.m.JobClient [INFO] Running job:
>>> job_201403111332_0011
>>> 2014-03-11 14:52:20 o.a.h.m.JobClient [INFO]  map 0% reduce 0%
>>> 2014-03-11 14:52:28 o.a.h.m.JobClient [INFO] Task Id :
>>> attempt_201403111332_0011_m_00_0, Status : FAILED
>>> 2014-03-11 14:52:28 STDIO [ERROR] java.lang.IllegalStateException:
>>> /5/clusters-0
>>>          at
>
>>> org.apache.mahout.common.iterator.sequencefile.
>>> SequenceFileDirValueIterable.iterator(SequenceFileDirValueIterable.
>>> java:78)
>>>          at
>>> org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles(
>
>>> ClusterClassifier.java:208)
>>>          at
>>> org.apache.mahout.clustering.iterator.CIMapper.setup(CIMapper.java:44)
>>>          at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:138)
>>>          at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.
>>> java:672)
>>>          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
>>>          at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>>>          at java.security.AccessController.doPrivileged(Native Method)

Re: Problem with FileSystem in Kmeans

2014-03-17 Thread Bikash Gupta

Suneel,

Just for information, I havent found this issue in Canopy. Canopy cluster-0
was created in HDFS only.

However Kmeans cluster-0 was created in local file system and cluster-1 in
HDFS and after that it spit an error as it was unable to locate cluster-0


On Mon, Mar 17, 2014 at 3:10 PM, Suneel Marthi wrote:

> This problem's specifically to do with Canopy clustering and is not an
> issue with KMeans. I had seen this behavior with Canopy and looking at the
> code its indeed an issue wherein cluster-0 is created on the local file
> system and the remaining clusters land on HDFS.
>
> Please file a JIRA for this if not already done so.
>
>
>
>
>
> On Wednesday, March 12, 2014 3:02 AM, Bikash Gupta <
> bikash.gupt...@gmail.com> wrote:
>
> Hi,
>
> Problem is not with input path, its the way Kmeans is getting executed. Let
> me explain.
>
> I have created CSV->Sequence using map-reduce hence my data is in HDFS
> After this I have run Canopy MR hence data is also in HDFS
>
> Now these two things are getting pushed in Kmeans MR.
>
> If you check KmeansDriver class, at first it tries to create cluster-0
> folder with data, here if you dont specify the scheme then it will write in
> local file system. After that MR job is getting started which is expecting
> cluster-0 in HDFS.
>
> Path priorClustersPath = new Path(output, Cluster.INITIAL_CLUSTERS_DIR);
> ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta);
> ClusterClassifier prior = new ClusterClassifier(clusters, policy);
> prior.writeToSeqFiles(priorClustersPath);
>
> if (runSequential) {
>   ClusterIterator.iterateSeq(conf, input, priorClustersPath, output,
> maxIterations);
> } else {
>   ClusterIterator.iterateMR(conf, input, priorClustersPath, output,
> maxIterations);
> }
>
> Let me know if I am not able to explain clearly.
>
>
>
> On Wed, Mar 12, 2014 at 11:53 AM, Sebastian Schelter 
> wrote:
>
> > Hi Bikash,
> >
> > Have you tried adding hdfs:// to your input path? Maybe that helps.
> >
> > --sebastian
> >
> >
> > On 03/11/2014 11:22 AM, Bikash Gupta wrote:
> >
> >> Hi,
> >>
> >> I am running Kmeans in cluster where I am setting the configuration of
> >> fs.hdfs.impl and fs.file.impl before hand as mentioned below
> >>
> >> conf.set("fs.hdfs.impl",org.apache.hadoop.hdfs.
> >> DistributedFileSystem.class.getName());
> >> conf.set("fs.file.impl",org.apache.hadoop.fs.
> >> LocalFileSystem.class.getName());
> >>
> >> Problem is that cluster-0 directory is getting created in local file
> >> system
> >> and cluster-1 is getting created in HDFS, and Kmeans map reduce job is
> >> unable to find cluster-0 . Please see below the stacktrace
> >>
> >> 2014-03-11 14:52:15 o.a.m.c.AbstractJob [INFO] Command line arguments:
> >> {--clustering=null, --clusters=[/3/clusters-0-final],
> >> --convergenceDelta=[0.1],
> >> --distanceMeasure=[org.apache.mahout.common.distance.
> >> EuclideanDistanceMeasure],
> >> --endPhase=[2147483647], --input=[/2/sequence], --maxIter=[100],
> >> --method=[mapreduce], --output=[/5], --overwrite=null, --startPhase=[0],
> >> --tempDir=[temp]}
> >> 2014-03-11 14:52:15 o.a.h.u.NativeCodeLoader [WARN] Unable to load
> >> native-hadoop library for your platform... using builtin-java classes
> >> where
> >> applicable
> >> 2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] Input: /2/sequence
> >> Clusters In: /3/clusters-0-final Out: /5
> >> 2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] convergence: 0.1 max
> >> Iterations: 100
> >> 2014-03-11 14:52:16 o.a.h.m.JobClient [WARN] Use GenericOptionsParser
> for
> >> parsing the arguments. Applications should implement Tool for the same.
> >> 2014-03-11 14:52:17 o.a.h.m.l.i.FileInputFormat [INFO] Total input paths
> >> to
> >> process : 3
> >> 2014-03-11 14:52:19 o.a.h.m.JobClient [INFO] Running job:
> >> job_201403111332_0011
> >> 2014-03-11 14:52:20 o.a.h.m.JobClient [INFO]  map 0% reduce 0%
> >> 2014-03-11 14:52:28 o.a.h.m.JobClient [INFO] Task Id :
> >> attempt_201403111332_0011_m_00_0, Status : FAILED
> >> 2014-03-11 14:52:28 STDIO [ERROR] java.lang.IllegalStateException:
> >> /5/clusters-0
> >>  at
> >> org.apache.mahout.common.iterator.sequencefile.
> >> SequenceFileDirValueIterable.iterator(SequenceFileDirValueIterable.
> >> java:78)
> >>  at
> >>
> org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles(
> >> ClusterClassifier.java:208)
> >>  at
> >> org.apache.mahout.clustering.iterator.CIMapper.setup(CIMapper.java:44)
> >>  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:138)
> >>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.
> >> java:672)
> >>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
> >>  at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
> >>  at java.security.AccessController.doPrivileged(Native Method)
> >>  at javax.security.auth.Subject.doAs(Subject.java:415)
> >>  at
> >> org.apache.h

Re: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

2014-03-17 Thread Suneel Marthi

R u running on Hadoop 2.x which seems to be the case here.

Compile with hadoop 2 profile:

mvn -DskipTests clean install -Dhadoop2.profile=





On Monday, March 17, 2014 5:57 AM, Margusja  wrote:
 
Hi

Here is my output:
[speech@h14 ~]$ mahout/bin/mahout seqdirectory -c UTF-8 -i 
/user/speech/demo -o demo-seqfiles
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/bin/hadoop and 
HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: 
/home/speech/mahout/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar
14/03/17 11:47:30 INFO common.AbstractJob: Command line arguments: 
{--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], 
--fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], 
--input=[/user/speech/demo], --keyPrefix=[], --method=[mapreduce], 
--output=[demo-seqfiles], --startPhase=[0], --tempDir=[temp]}
14/03/17 11:47:31 INFO Configuration.deprecation: mapred.input.dir is 
deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
14/03/17 11:47:31 INFO Configuration.deprecation: 
mapred.compress.map.output is deprecated. Instead, use 
mapreduce.map.output.compress
14/03/17 11:47:31 INFO Configuration.deprecation: mapred.output.dir is 
deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/03/17 11:47:31 INFO Configuration.deprecation: session.id is 
deprecated. Instead, use dfs.metrics.session-id
14/03/17 11:47:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with 
processName=JobTracker, sessionId=
14/03/17 11:47:32 INFO input.FileInputFormat: Total input paths to 
process : 10
14/03/17 11:47:32 INFO input.CombineFileInputFormat: DEBUG: Terminated 
node allocation with : CompletedNodes: 4, size left: 29775
14/03/17 11:47:32 INFO mapreduce.JobSubmitter: number of splits:1
14/03/17 11:47:32 INFO Configuration.deprecation: user.name is 
deprecated. Instead, use mapreduce.job.user.name
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.output.compress 
is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.jar is 
deprecated. Instead, use mapreduce.job.jar
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.reduce.tasks is 
deprecated. Instead, use mapreduce.job.reduces
14/03/17 11:47:32 INFO Configuration.deprecation: 
mapred.output.value.class is deprecated. Instead, use 
mapreduce.job.output.value.class
14/03/17 11:47:32 INFO Configuration.deprecation: 
mapred.mapoutput.value.class is deprecated. Instead, use 
mapreduce.map.output.value.class
14/03/17 11:47:32 INFO Configuration.deprecation: mapreduce.map.class is 
deprecated. Instead, use mapreduce.job.map.class
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.job.name is 
deprecated. Instead, use mapreduce.job.name
14/03/17 11:47:32 INFO Configuration.deprecation: 
mapreduce.inputformat.class is deprecated. Instead, use 
mapreduce.job.inputformat.class
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.max.split.size 
is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
14/03/17 11:47:32 INFO Configuration.deprecation: 
mapreduce.outputformat.class is deprecated. Instead, use 
mapreduce.job.outputformat.class
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.map.tasks is 
deprecated. Instead, use mapreduce.job.maps
14/03/17 11:47:32 INFO Configuration.deprecation: 
mapred.output.key.class is deprecated. Instead, use 
mapreduce.job.output.key.class
14/03/17 11:47:32 INFO Configuration.deprecation: 
mapred.mapoutput.key.class is deprecated. Instead, use 
mapreduce.map.output.key.class
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.working.dir is 
deprecated. Instead, use mapreduce.job.working.dir
14/03/17 11:47:32 INFO mapreduce.JobSubmitter: Submitting tokens for 
job: job_local42076163_0001
14/03/17 11:47:32 WARN conf.Configuration: 
file:/tmp/hadoop-speech/mapred/staging/speech42076163/.staging/job_local42076163_0001/job.xml:an
 
attempt to override final parameter: 
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
14/03/17 11:47:32 WARN conf.Configuration: 
file:/tmp/hadoop-speech/mapred/staging/speech42076163/.staging/job_local42076163_0001/job.xml:an
 
attempt to override final parameter: 
mapreduce.job.end-notification.max.attempts;  Ignoring.
14/03/17 11:47:32 WARN conf.Configuration: 
file:/tmp/hadoop-speech/mapred/local/localRunner/speech/job_local42076163_0001/job_local42076163_0001.xml:an
 
attempt to override final parameter: 
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
14/03/17 11:47:32 WARN conf.Configuration: 
file:/tmp/hadoop-speech/mapred/local/localRunner/speech/job_local42076163_0001/job_local42076163_0001.xml:an
 
attempt to override final parameter: 
mapreduce.job.end-notification.max.attempts;  Ignoring.
14/03/17 11:47:32 INFO mapreduce.Job: The url to track the job: 
http://localhost:8080/
14/03/17 11:47:32 INFO mapreduce.Job: Running job: job_local42076163_0001
14/03/17 11:47:32 INFO ma

java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

2014-03-17 Thread Margusja


Hi

Here is my output:
[speech@h14 ~]$ mahout/bin/mahout seqdirectory -c UTF-8 -i 
/user/speech/demo -o demo-seqfiles

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/bin/hadoop and 
HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: 
/home/speech/mahout/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar
14/03/17 11:47:30 INFO common.AbstractJob: Command line arguments: 
{--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], 
--fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], 
--input=[/user/speech/demo], --keyPrefix=[], --method=[mapreduce], 
--output=[demo-seqfiles], --startPhase=[0], --tempDir=[temp]}
14/03/17 11:47:31 INFO Configuration.deprecation: mapred.input.dir is 
deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
14/03/17 11:47:31 INFO Configuration.deprecation: 
mapred.compress.map.output is deprecated. Instead, use 
mapreduce.map.output.compress
14/03/17 11:47:31 INFO Configuration.deprecation: mapred.output.dir is 
deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/03/17 11:47:31 INFO Configuration.deprecation: session.id is 
deprecated. Instead, use dfs.metrics.session-id
14/03/17 11:47:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with 
processName=JobTracker, sessionId=
14/03/17 11:47:32 INFO input.FileInputFormat: Total input paths to 
process : 10
14/03/17 11:47:32 INFO input.CombineFileInputFormat: DEBUG: Terminated 
node allocation with : CompletedNodes: 4, size left: 29775

14/03/17 11:47:32 INFO mapreduce.JobSubmitter: number of splits:1
14/03/17 11:47:32 INFO Configuration.deprecation: user.name is 
deprecated. Instead, use mapreduce.job.user.name
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.output.compress 
is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.jar is 
deprecated. Instead, use mapreduce.job.jar
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.reduce.tasks is 
deprecated. Instead, use mapreduce.job.reduces
14/03/17 11:47:32 INFO Configuration.deprecation: 
mapred.output.value.class is deprecated. Instead, use 
mapreduce.job.output.value.class
14/03/17 11:47:32 INFO Configuration.deprecation: 
mapred.mapoutput.value.class is deprecated. Instead, use 
mapreduce.map.output.value.class
14/03/17 11:47:32 INFO Configuration.deprecation: mapreduce.map.class is 
deprecated. Instead, use mapreduce.job.map.class
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.job.name is 
deprecated. Instead, use mapreduce.job.name
14/03/17 11:47:32 INFO Configuration.deprecation: 
mapreduce.inputformat.class is deprecated. Instead, use 
mapreduce.job.inputformat.class
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.max.split.size 
is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
14/03/17 11:47:32 INFO Configuration.deprecation: 
mapreduce.outputformat.class is deprecated. Instead, use 
mapreduce.job.outputformat.class
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.map.tasks is 
deprecated. Instead, use mapreduce.job.maps
14/03/17 11:47:32 INFO Configuration.deprecation: 
mapred.output.key.class is deprecated. Instead, use 
mapreduce.job.output.key.class
14/03/17 11:47:32 INFO Configuration.deprecation: 
mapred.mapoutput.key.class is deprecated. Instead, use 
mapreduce.map.output.key.class
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.working.dir is 
deprecated. Instead, use mapreduce.job.working.dir
14/03/17 11:47:32 INFO mapreduce.JobSubmitter: Submitting tokens for 
job: job_local42076163_0001
14/03/17 11:47:32 WARN conf.Configuration: 
file:/tmp/hadoop-speech/mapred/staging/speech42076163/.staging/job_local42076163_0001/job.xml:an 
attempt to override final parameter: 
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
14/03/17 11:47:32 WARN conf.Configuration: 
file:/tmp/hadoop-speech/mapred/staging/speech42076163/.staging/job_local42076163_0001/job.xml:an 
attempt to override final parameter: 
mapreduce.job.end-notification.max.attempts;  Ignoring.
14/03/17 11:47:32 WARN conf.Configuration: 
file:/tmp/hadoop-speech/mapred/local/localRunner/speech/job_local42076163_0001/job_local42076163_0001.xml:an 
attempt to override final parameter: 
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
14/03/17 11:47:32 WARN conf.Configuration: 
file:/tmp/hadoop-speech/mapred/local/localRunner/speech/job_local42076163_0001/job_local42076163_0001.xml:an 
attempt to override final parameter: 
mapreduce.job.end-notification.max.attempts;  Ignoring.
14/03/17 11:47:32 INFO mapreduce.Job: The url to track the job: 
http://localhost:8080/

14/03/17 11:47:32 INFO mapreduce.Job: Running job: job_local42076163_0001
14/03/17 11:47:32 INFO mapred.LocalJobRunner: OutputCommitter set in 
config null
14/03/17 11:47:33 INFO mapred.LocalJobRunner: OutputCommitter is 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter

14/03/17 11:47:33

Re: Problem with FileSystem in Kmeans

2014-03-17 Thread Suneel Marthi

This problem's specifically to do with Canopy clustering and is not an issue 
with KMeans. I had seen this behavior with Canopy and looking at the code its 
indeed an issue wherein cluster-0 is created on the local file system and the 
remaining clusters land on HDFS. 

Please file a JIRA for this if not already done so. 





On Wednesday, March 12, 2014 3:02 AM, Bikash Gupta  
wrote:
 
Hi,

Problem is not with input path, its the way Kmeans is getting executed. Let
me explain.

I have created CSV->Sequence using map-reduce hence my data is in HDFS
After this I have run Canopy MR hence data is also in HDFS

Now these two things are getting pushed in Kmeans MR.

If you check KmeansDriver class, at first it tries to create cluster-0
folder with data, here if you dont specify the scheme then it will write in
local file system. After that MR job is getting started which is expecting
cluster-0 in HDFS.

Path priorClustersPath = new Path(output, Cluster.INITIAL_CLUSTERS_DIR);
    ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta);
    ClusterClassifier prior = new ClusterClassifier(clusters, policy);
    prior.writeToSeqFiles(priorClustersPath);

    if (runSequential) {
      ClusterIterator.iterateSeq(conf, input, priorClustersPath, output,
maxIterations);
    } else {
      ClusterIterator.iterateMR(conf, input, priorClustersPath, output,
maxIterations);
    }

Let me know if I am not able to explain clearly.



On Wed, Mar 12, 2014 at 11:53 AM, Sebastian Schelter  wrote:

> Hi Bikash,
>
> Have you tried adding hdfs:// to your input path? Maybe that helps.
>
> --sebastian
>
>
> On 03/11/2014 11:22 AM, Bikash Gupta wrote:
>
>> Hi,
>>
>> I am running Kmeans in cluster where I am setting the configuration of
>> fs.hdfs.impl and fs.file.impl before hand as mentioned below
>>
>> conf.set("fs.hdfs.impl",org.apache.hadoop.hdfs.
>> DistributedFileSystem.class.getName());
>> conf.set("fs.file.impl",org.apache.hadoop.fs.
>> LocalFileSystem.class.getName());
>>
>> Problem is that cluster-0 directory is getting created in local file
>> system
>> and cluster-1 is getting created in HDFS, and Kmeans map reduce job is
>> unable to find cluster-0 . Please see below the stacktrace
>>
>> 2014-03-11 14:52:15 o.a.m.c.AbstractJob [INFO] Command line arguments:
>> {--clustering=null, --clusters=[/3/clusters-0-final],
>> --convergenceDelta=[0.1],
>> --distanceMeasure=[org.apache.mahout.common.distance.
>> EuclideanDistanceMeasure],
>> --endPhase=[2147483647], --input=[/2/sequence], --maxIter=[100],
>> --method=[mapreduce], --output=[/5], --overwrite=null, --startPhase=[0],
>> --tempDir=[temp]}
>> 2014-03-11 14:52:15 o.a.h.u.NativeCodeLoader [WARN] Unable to load
>> native-hadoop library for your platform... using builtin-java classes
>> where
>> applicable
>> 2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] Input: /2/sequence
>> Clusters In: /3/clusters-0-final Out: /5
>> 2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] convergence: 0.1 max
>> Iterations: 100
>> 2014-03-11 14:52:16 o.a.h.m.JobClient [WARN] Use GenericOptionsParser for
>> parsing the arguments. Applications should implement Tool for the same.
>> 2014-03-11 14:52:17 o.a.h.m.l.i.FileInputFormat [INFO] Total input paths
>> to
>> process : 3
>> 2014-03-11 14:52:19 o.a.h.m.JobClient [INFO] Running job:
>> job_201403111332_0011
>> 2014-03-11 14:52:20 o.a.h.m.JobClient [INFO]  map 0% reduce 0%
>> 2014-03-11 14:52:28 o.a.h.m.JobClient [INFO] Task Id :
>> attempt_201403111332_0011_m_00_0, Status : FAILED
>> 2014-03-11 14:52:28 STDIO [ERROR] java.lang.IllegalStateException:
>> /5/clusters-0
>>          at
>> org.apache.mahout.common.iterator.sequencefile.
>> SequenceFileDirValueIterable.iterator(SequenceFileDirValueIterable.
>> java:78)
>>          at
>> org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles(
>> ClusterClassifier.java:208)
>>          at
>> org.apache.mahout.clustering.iterator.CIMapper.setup(CIMapper.java:44)
>>          at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:138)
>>          at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.
>> java:672)
>>          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
>>          at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>>          at java.security.AccessController.doPrivileged(Native Method)
>>          at javax.security.auth.Subject.doAs(Subject.java:415)
>>          at
>> org.apache.hadoop.security.UserGroupInformation.doAs(
>> UserGroupInformation.java:1438)
>>          at org.apache.hadoop.mapred.Child.main(Child.java:262)
>> Caused by: java.io.FileNotFoundException: File /5/clusters-0
>>
>> Please suggest!!!
>>
>>
>>
>


-- 
Thanks & Regards
Bikash Kumar Gupta

Re: reduce is too slow in StreamingKmeans

2014-03-17 Thread Suneel Marthi

On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN  
wrote:

Thank you for your quick reply.

As to -km, I thought it was log10, instead of ln. I was wrong...
This time I set -km 14 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, 
Mahout 0.8)
The maps run faster than before, but the reduce was still stuck at 76% for ever.

>> This has been my experience too both with 0.8 and 0.9. 

So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm 
option.

Mahout kmeans can be executed properly, so I think the installation of mahout 
0.9 is successful.

>> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u 
>> mean Streaming KMeans here?

However, when executing mahout streamingkmeans, I got errors as following.
Hadoop I installed is cdh5-beta1-mapreduce version 1.

Exception in thread "main" java.lang.IncompatibleClassChangeError: Found 
interface org.apache.hadoop.mapreduce.JobContext, but class was expected
    at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
    at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464)
    at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419)
    at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
    at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

Seems like u r trying to execute on Hadoop 2 while Mahout 0.9 has been built 
with Hadoop 1.x profile, hence the error u r seeing.
If u would like to test on Hadoop 2, work off of present trunk and build the 
code with Hadoop 2 profile like below:

mvn clean install -Dhadoop2.profile=

Please give that a try.

-Original Message-
From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] 
Sent: Wednesday, February 19, 2014 1:08 AM
To: user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the 
slow performance that you have been experiencing. 

How did u come up with -km 63000?

Given that u would like 1 clusters (= k) and have 2,000,000 datapoints (= 
n) so k * ln(n) = 1 * ln(2 * 10^6)  = 145087 (rounded to nearest integer) 
and that should be the value of -km in ur case. (km = k * log (n) )

Not sure if that's gonna fix ur reduce being stuck at 76% forever but its 
definitely worth a try.

If you would like go to with -rskm option, please upgrade to Mahout 0.9.  I 
still think there's an issue with -rskm option with Mahout 0.9 and trunk today 
while executing in MR mode, but it definitely works in the nonMR (-xm 
sequential) mode in 0.9.

On Monday, February 17, 2014 9:05 PM, Sylvia Ma  
wrote:

I am using mahout 0.8 embedded in chd5.0.0 provided by cloudera and found that 
reduce of mahout streamingkmeans is extremely slow.

For example:
With a dataset of 200 objects, 128 variables, I would like to get 1 
clusters.

The command executed is as the following.
mahout streamingkmeans -i input -o output -ow -k 1 -km 63000

I have 15 maps which were all completed in 4 hours.
However, reduce took over 100 hours and it was still stuck at 76%.

I have tuned performance of hadoop as the following. 
map task jvm = 3g
reduce task jvm = 10g
io.sort.mb = 512
io.sort.factor = 50
mapred.reduce.parallel.copies = 10
mapred.inmem.merge.threshold = 0 

I tried to assign enough memory but the reduce is still very very very slow.

Why does it take so much time in reduce?
And What can I do to speed up the job?

I wonder if it will be helpf

RE: reduce is too slow in StreamingKmeans

2014-03-17 Thread fx MA XIAOJUN

Thank you for your quick reply.

As to -km, I thought it was log10, instead of ln. I was wrong...
This time I set -km 14 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, 
Mahout 0.8)
The maps run faster than before, but the reduce was still stuck at 76% for ever.

So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm 
option.

Mahout kmeans can be executed properly, so I think the installation of mahout 
0.9 is successful.

However, when executing mahout streamingkmeans, I got errors as following.
Hadoop I installed is cdh5-beta1-mapreduce version 1.

Exception in thread "main" java.lang.IncompatibleClassChangeError: Found 
interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at 
org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)








-Original Message-
From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] 
Sent: Wednesday, February 19, 2014 1:08 AM
To: user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the 
slow performance that you have been experiencing. 

How did u come up with -km 63000?

Given that u would like 1 clusters (= k) and have 2,000,000 datapoints (= 
n) so k * ln(n) = 1 * ln(2 * 10^6)  = 145087 (rounded to nearest integer) 
and that should be the value of -km in ur case. (km = k * log (n) )

Not sure if that's gonna fix ur reduce being stuck at 76% forever but its 
definitely worth a try.

If you would like go to with -rskm option, please upgrade to Mahout 0.9.  I 
still think there's an issue with -rskm option with Mahout 0.9 and trunk today 
while executing in MR mode, but it definitely works in the nonMR (-xm 
sequential) mode in 0.9.











On Monday, February 17, 2014 9:05 PM, Sylvia Ma  
wrote:
 
I am using mahout 0.8 embedded in chd5.0.0 provided by cloudera and found that 
reduce of mahout streamingkmeans is extremely slow.

For example:
With a dataset of 200 objects, 128 variables, I would like to get 1 
clusters.

The command executed is as the following.
mahout streamingkmeans -i input -o output -ow -k 1 -km 63000

I have 15 maps which were all completed in 4 hours.
However, reduce took over 100 hours and it was still stuck at 76%.

I have tuned performance of hadoop as the following. 
map task jvm = 3g
reduce task jvm = 10g
io.sort.mb = 512
io.sort.factor = 50
mapred.reduce.parallel.copies = 10
mapred.inmem.merge.threshold = 0 

I tried to assign enough memory but the reduce is still very very very slow.


Why does it take so much time in reduce?
And What can I do to speed up the job?

I wonder if it will be helpful to set -rskm to be true.
-rskm option has bug in Mahout 0.8, so I cannot get a try... 




Yours Sincerely,
Sylvia Ma

Re: Need help in executing SSVD for dimensionality reduction on Mahout

Fwd: Need help in executing SSVD for dimensionality reduction on Mahout

RE: reduce is too slow in StreamingKmeans

Re: reduce is too slow in StreamingKmeans

RE: reduce is too slow in StreamingKmeans

Re: Mahout parallel K-Means - algorithms analysis

Re: Normalization in Mahout

Re: Normalization in Mahout

Re: Normalization in Mahout

Normalization in Mahout

Re: Problem with FileSystem in Kmeans

Re: Problem with FileSystem in Kmeans

Re: Problem with FileSystem in Kmeans

Re: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

Re: Problem with FileSystem in Kmeans

Re: reduce is too slow in StreamingKmeans

RE: reduce is too slow in StreamingKmeans

18 matches

Site Navigation

Mail list logo

Footer information