Re: running "clustering of synthetic control data"

Pavan K Narayanan Thu, 26 Sep 2013 23:20:08 -0700

Hi Daniele

I installed Mahout 0.8 in Hadoop 1.2.1 in a diferent Ubuntu 12.04 LTS
(Hadoop configured properly and Mahout is running) and I try to run almost
all of them -- 20 newsgroups, reuters, synthetic control data and getting
the following errors.


*For Reuters*: got stuck on the reduce task for a long time so had to break
the operation using crtl+c

bigdata@bigdata-OptiPlex-390:~/mahout-distribution-0.8/examples/bin$
./cluster-reuters.sh
Please select a number to choose the corresponding clustering algorithm
1. kmeans clustering
2. fuzzykmeans clustering
3. dirichlet clustering
4. lda clustering
5. minhash clustering
Enter your choice : 1
ok. You chose 1 and we'll use kmeans Clustering
creating work directory at /tmp/mahout-work-bigdata
Converting to Sequence Files from Directory
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Warning: $HADOOP_HOME is deprecated.

Running on hadoop, using /home/bigdata/hadoop-1.2.1/bin/hadoop and
HADOOP_CONF_DIR=/home/bigdata/hadoop-1.2.1/conf
MAHOUT-JOB:
/home/bigdata/mahout-distribution-0.8/mahout-examples-0.8-job.jar
Warning: $HADOOP_HOME is deprecated.

13/09/27 11:26:54 INFO common.AbstractJob: Command line arguments:
{--charset=[UTF-8], --chunkSize=[5], --endPhase=[2147483647],
--fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter],
--input=[/tmp/mahout-work-bigdata/reuters-out], --keyPrefix=[],
--method=[mapreduce],
--output=[/tmp/mahout-work-bigdata/reuters-out-seqdir], --startPhase=[0],
--tempDir=[temp]}
13/09/27 11:26:55 INFO input.FileInputFormat: Total input paths to process
: 3743
13/09/27 11:26:56 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
13/09/27 11:26:56 WARN snappy.LoadSnappy: Snappy native library not loaded
13/09/27 11:26:59 INFO mapred.JobClient: Running job: job_201309271028_0001
13/09/27 11:27:00 INFO mapred.JobClient:  map 0% reduce 0%
13/09/27 11:27:16 INFO mapred.JobClient:  map 46% reduce 0%
13/09/27 11:27:19 INFO mapred.JobClient:  map 78% reduce 0%
13/09/27 11:27:22 INFO mapred.JobClient:  map 100% reduce 0%
13/09/27 11:27:22 INFO mapred.JobClient: Job complete: job_201309271028_0001
13/09/27 11:27:22 INFO mapred.JobClient: Counters: 18
13/09/27 11:27:22 INFO mapred.JobClient:   Job Counters
13/09/27 11:27:22 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=18361
13/09/27 11:27:22 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
13/09/27 11:27:22 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
13/09/27 11:27:22 INFO mapred.JobClient:     Launched map tasks=1
13/09/27 11:27:22 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
13/09/27 11:27:22 INFO mapred.JobClient:   File Output Format Counters
13/09/27 11:27:22 INFO mapred.JobClient:     Bytes Written=1889543
13/09/27 11:27:22 INFO mapred.JobClient:   FileSystemCounters
13/09/27 11:27:22 INFO mapred.JobClient:     HDFS_BYTES_READ=3439773
13/09/27 11:27:22 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=57671
13/09/27 11:27:22 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1889543
13/09/27 11:27:22 INFO mapred.JobClient:   File Input Format Counters
13/09/27 11:27:22 INFO mapred.JobClient:     Bytes Read=0
13/09/27 11:27:22 INFO mapred.JobClient:   Map-Reduce Framework
13/09/27 11:27:22 INFO mapred.JobClient:     Map input records=3742
13/09/27 11:27:22 INFO mapred.JobClient:     Physical memory (bytes)
snapshot=131174400
13/09/27 11:27:22 INFO mapred.JobClient:     Spilled Records=0
13/09/27 11:27:22 INFO mapred.JobClient:     CPU time spent (ms)=9920
13/09/27 11:27:22 INFO mapred.JobClient:     Total committed heap usage
(bytes)=116916224
13/09/27 11:27:22 INFO mapred.JobClient:     Virtual memory (bytes)
snapshot=1074016256
13/09/27 11:27:22 INFO mapred.JobClient:     Map output records=3742
13/09/27 11:27:22 INFO mapred.JobClient:     SPLIT_RAW_BYTES=362622
13/09/27 11:27:22 INFO driver.MahoutDriver: Program took 28377 ms (Minutes:
0.47295)
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Warning: $HADOOP_HOME is deprecated.

Running on hadoop, using /home/bigdata/hadoop-1.2.1/bin/hadoop and
HADOOP_CONF_DIR=/home/bigdata/hadoop-1.2.1/conf
MAHOUT-JOB:
/home/bigdata/mahout-distribution-0.8/mahout-examples-0.8-job.jar
Warning: $HADOOP_HOME is deprecated.

13/09/27 11:27:25 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum
n-gram size is: 1
13/09/27 11:27:25 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum
LLR value: 1.0
13/09/27 11:27:25 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of
reduce tasks: 1
13/09/27 11:27:25 INFO vectorizer.SparseVectorsFromSequenceFiles:
Tokenizing documents in /tmp/mahout-work-bigdata/reuters-out-seqdir
13/09/27 11:27:26 INFO input.FileInputFormat: Total input paths to process
: 1
13/09/27 11:27:26 INFO mapred.JobClient: Running job: job_201309271028_0002
13/09/27 11:27:27 INFO mapred.JobClient:  map 0% reduce 0%
13/09/27 11:27:36 INFO mapred.JobClient:  map 100% reduce 0%
13/09/27 11:27:36 INFO mapred.JobClient: Job complete: job_201309271028_0002
13/09/27 11:27:36 INFO mapred.JobClient: Counters: 19
13/09/27 11:27:36 INFO mapred.JobClient:   Job Counters
13/09/27 11:27:36 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=5689
13/09/27 11:27:36 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
13/09/27 11:27:36 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
13/09/27 11:27:36 INFO mapred.JobClient:     Launched map tasks=1
13/09/27 11:27:36 INFO mapred.JobClient:     Data-local map tasks=1
13/09/27 11:27:36 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
13/09/27 11:27:36 INFO mapred.JobClient:   File Output Format Counters
13/09/27 11:27:36 INFO mapred.JobClient:     Bytes Written=2640631
13/09/27 11:27:36 INFO mapred.JobClient:   FileSystemCounters
13/09/27 11:27:36 INFO mapred.JobClient:     HDFS_BYTES_READ=1889686
13/09/27 11:27:36 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=57246
13/09/27 11:27:36 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=2640631
13/09/27 11:27:36 INFO mapred.JobClient:   File Input Format Counters
13/09/27 11:27:36 INFO mapred.JobClient:     Bytes Read=1889543
13/09/27 11:27:36 INFO mapred.JobClient:   Map-Reduce Framework
13/09/27 11:27:36 INFO mapred.JobClient:     Map input records=3742
13/09/27 11:27:36 INFO mapred.JobClient:     Physical memory (bytes)
snapshot=125198336
13/09/27 11:27:36 INFO mapred.JobClient:     Spilled Records=0
13/09/27 11:27:36 INFO mapred.JobClient:     CPU time spent (ms)=1580
13/09/27 11:27:36 INFO mapred.JobClient:     Total committed heap usage
(bytes)=123797504
13/09/27 11:27:36 INFO mapred.JobClient:     Virtual memory (bytes)
snapshot=1074016256
13/09/27 11:27:36 INFO mapred.JobClient:     Map output records=3742
13/09/27 11:27:36 INFO mapred.JobClient:     SPLIT_RAW_BYTES=143
13/09/27 11:27:36 INFO vectorizer.SparseVectorsFromSequenceFiles: Creating
Term Frequency Vectors
13/09/27 11:27:36 INFO vectorizer.DictionaryVectorizer: Creating dictionary
from
/tmp/mahout-work-bigdata/reuters-out-seqdir-sparse-kmeans/tokenized-documents
and saving at
/tmp/mahout-work-bigdata/reuters-out-seqdir-sparse-kmeans/wordcount
13/09/27 11:27:36 INFO input.FileInputFormat: Total input paths to process
: 1
13/09/27 11:27:38 INFO mapred.JobClient: Running job: job_201309271028_0003
13/09/27 11:27:39 INFO mapred.JobClient:  map 0% reduce 0%
13/09/27 11:27:46 INFO mapred.JobClient:  map 100% reduce 0%

^Cbigdata@bigdata-OptiPlex-390:~/mahout-distribution-0.8/examples/bin$

*Synthetic control data* -- hadoop not running (hadoop was running , i ran
jps command and also set classpath once again

bigdata@bigdata-OptiPlex-390:~/mahout-distribution-0.8/examples/bin$
./cluster-syntheticcontrol.sh
Please select a number to choose the corresponding clustering algorithm
1. canopy clustering
2. kmeans clustering
3. fuzzykmeans clustering
4. dirichlet clustering
5. meanshift clustering
Enter your choice : 2
ok. You chose 2 and we'll use kmeans Clustering
creating work directory at /tmp/mahout-work-bigdata
Downloading Synthetic control data
  % Total    % Received % Xferd  Average Speed   Time    Time     Time
Current
                                 Dload  Upload   Total   Spent    Left
Speed
100  281k  100  281k    0     0  64314      0  0:00:04  0:00:04 --:--:--
82742
Checking the health of DFS...
Warning: $HADOOP_HOME is deprecated.

ls: Cannot access .: No such file or directory.
 HADOOP is not running. Please make sure you hadoop is running.


appreciate your help

regards
Pavan


On 26 September 2013 17:13, Darius Miliauskas
<[email protected]>wrote:

> Dear Pavan,
>
> There is the newer release of mahout (0.8). Why do you use 0.6? Have you
> tried ./build-cluster-syntheticcontrol.sh or
> ./cluster-syntheticcontrol.shfrom
> ../mahout-distribution-0.8/examples\bin?
>
>
> Ciao,
>
> Darius
>
>
> 2013/9/26 Pavan K Narayanan <[email protected]>
>
> >  Folks,
> >
> > I am currently attempting to run the Synthetic_control data example on
> > Mahout. I have installed Hadoop-1.2.1 and Mahout 0.6 in my Ubuntu.
> >
> > I prepared the following hadoop runtime configuration file to set all the
> > paths required. the following are the contents of the hadooprc.sh
> >
> > *export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-i386
> > export HADOOP_HOME=/home/hduser/hadoop-1.2.1
> > export MAHOUT_HOME=/home/hduser/mahout-distribution-0.6
> > export PATH=$MAHOUT_HOME/bin:$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH
> > export
> >
> >
> CLASSPATH=$JAVA_HOME:/home/hduser/hadoop-1.2.1/hadoop-core-1.2.1.jar:$MAHOUT_HOME/mahout-core-0.6.jar
> > *
> > And also tried the following as suggested by Saeed Iqbal's blog for
> runtime
> > configuration file
> >
> > *export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-i386
> > export HADOOP_HOME=/home/hduser/hadoop-1.2.1
> > export HADOOP_CONF_DIR=/home/hduser/hadoop-1.2.1/conf
> > export MAHOUT_HOME=/home/hduser/mahout-distribution-0.6/bin
> > export PATH=$PATH:$MAHOUT_HOME*
> >
> > The above file initializes Mahout and I followed the commands below to
> > write the synthetic control data into HDFS. fRom this link:
> >
> >
> https://cwiki.apache.org/confluence/display/MAHOUT/Clustering+of+synthetic+control+data
> >
> > $HADOOP_HOME/bin/hadoop fs -mkdir testdata
> > $HADOOP_HOME/bin/hadoop fs -put <PATH TO synthetic_control.data> testdata
> >
> > the mvn clean install option gave me a build failure error but when i
> typed
> > maven -version i got the latest maven installed.
> >
> > I tried to enter this command from mahout_home/bin
> > org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
> > and got the following error:
> >
> > org.apache.mahout.clustering.syntheticcontrol.kmeans.Job command not
> found
> >
> > Can anyone tell me where I am going wrong? how to fix this? really
> > appreciate your help
> >
> > Regads
> > Pavan
> >
>

Re: running "clustering of synthetic control data"

Reply via email to