Re: running "clustering of synthetic control data"

Pavan K Narayanan Fri, 27 Sep 2013 06:33:12 -0700

Hi

Admittedly my system was extremely slow and I could see why such error
would have come with my reuters example.


I apologize not running properly and posting the code but I have been
checking and rectifying my mistakes. Yet, I am getting this small
error: *java.lang.IllegalStateException:
No clusters found. Check your -c path.
Exception in thread "main" java.lang.InterruptedException: K-Means
Iteration failed processing
/tmp/mahout-work-hduser/reuters-kmeans-clusters/part-randomSeed
*

I was running the traiining example under examples/bin/./cluster-reuters.sh

Appreciate your help.

Regards




On 27 September 2013 11:48, Pavan K Narayanan <[email protected]>wrote:

> Hi Daniele
>
> I installed Mahout 0.8 in Hadoop 1.2.1 in a diferent Ubuntu 12.04 LTS
> (Hadoop configured properly and Mahout is running) and I try to run almost
> all of them -- 20 newsgroups, reuters, synthetic control data and getting
> the following errors.
>
> *For Reuters*: got stuck on the reduce task for a long time so had to
> break the operation using crtl+c
>
> bigdata@bigdata-OptiPlex-390:~/mahout-distribution-0.8/examples/bin$
> ./cluster-reuters.sh
> Please select a number to choose the corresponding clustering algorithm
> 1. kmeans clustering
> 2. fuzzykmeans clustering
> 3. dirichlet clustering
> 4. lda clustering
> 5. minhash clustering
> Enter your choice : 1
> ok. You chose 1 and we'll use kmeans Clustering
> creating work directory at /tmp/mahout-work-bigdata
> Converting to Sequence Files from Directory
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> Warning: $HADOOP_HOME is deprecated.
>
> Running on hadoop, using /home/bigdata/hadoop-1.2.1/bin/hadoop and
> HADOOP_CONF_DIR=/home/bigdata/hadoop-1.2.1/conf
> MAHOUT-JOB:
> /home/bigdata/mahout-distribution-0.8/mahout-examples-0.8-job.jar
> Warning: $HADOOP_HOME is deprecated.
>
> 13/09/27 11:26:54 INFO common.AbstractJob: Command line arguments:
> {--charset=[UTF-8], --chunkSize=[5], --endPhase=[2147483647],
> --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter],
> --input=[/tmp/mahout-work-bigdata/reuters-out], --keyPrefix=[],
> --method=[mapreduce],
> --output=[/tmp/mahout-work-bigdata/reuters-out-seqdir], --startPhase=[0],
> --tempDir=[temp]}
> 13/09/27 11:26:55 INFO input.FileInputFormat: Total input paths to process
> : 3743
> 13/09/27 11:26:56 INFO util.NativeCodeLoader: Loaded the native-hadoop
> library
> 13/09/27 11:26:56 WARN snappy.LoadSnappy: Snappy native library not loaded
> 13/09/27 11:26:59 INFO mapred.JobClient: Running job: job_201309271028_0001
> 13/09/27 11:27:00 INFO mapred.JobClient:  map 0% reduce 0%
> 13/09/27 11:27:16 INFO mapred.JobClient:  map 46% reduce 0%
> 13/09/27 11:27:19 INFO mapred.JobClient:  map 78% reduce 0%
> 13/09/27 11:27:22 INFO mapred.JobClient:  map 100% reduce 0%
> 13/09/27 11:27:22 INFO mapred.JobClient: Job complete:
> job_201309271028_0001
> 13/09/27 11:27:22 INFO mapred.JobClient: Counters: 18
> 13/09/27 11:27:22 INFO mapred.JobClient:   Job Counters
> 13/09/27 11:27:22 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=18361
> 13/09/27 11:27:22 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 13/09/27 11:27:22 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 13/09/27 11:27:22 INFO mapred.JobClient:     Launched map tasks=1
> 13/09/27 11:27:22 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> 13/09/27 11:27:22 INFO mapred.JobClient:   File Output Format Counters
> 13/09/27 11:27:22 INFO mapred.JobClient:     Bytes Written=1889543
> 13/09/27 11:27:22 INFO mapred.JobClient:   FileSystemCounters
> 13/09/27 11:27:22 INFO mapred.JobClient:     HDFS_BYTES_READ=3439773
> 13/09/27 11:27:22 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=57671
> 13/09/27 11:27:22 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1889543
> 13/09/27 11:27:22 INFO mapred.JobClient:   File Input Format Counters
> 13/09/27 11:27:22 INFO mapred.JobClient:     Bytes Read=0
> 13/09/27 11:27:22 INFO mapred.JobClient:   Map-Reduce Framework
> 13/09/27 11:27:22 INFO mapred.JobClient:     Map input records=3742
> 13/09/27 11:27:22 INFO mapred.JobClient:     Physical memory (bytes)
> snapshot=131174400
> 13/09/27 11:27:22 INFO mapred.JobClient:     Spilled Records=0
> 13/09/27 11:27:22 INFO mapred.JobClient:     CPU time spent (ms)=9920
> 13/09/27 11:27:22 INFO mapred.JobClient:     Total committed heap usage
> (bytes)=116916224
> 13/09/27 11:27:22 INFO mapred.JobClient:     Virtual memory (bytes)
> snapshot=1074016256
> 13/09/27 11:27:22 INFO mapred.JobClient:     Map output records=3742
> 13/09/27 11:27:22 INFO mapred.JobClient:     SPLIT_RAW_BYTES=362622
> 13/09/27 11:27:22 INFO driver.MahoutDriver: Program took 28377 ms
> (Minutes: 0.47295)
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> Warning: $HADOOP_HOME is deprecated.
>
> Running on hadoop, using /home/bigdata/hadoop-1.2.1/bin/hadoop and
> HADOOP_CONF_DIR=/home/bigdata/hadoop-1.2.1/conf
> MAHOUT-JOB:
> /home/bigdata/mahout-distribution-0.8/mahout-examples-0.8-job.jar
> Warning: $HADOOP_HOME is deprecated.
>
> 13/09/27 11:27:25 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum
> n-gram size is: 1
> 13/09/27 11:27:25 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum
> LLR value: 1.0
> 13/09/27 11:27:25 INFO vectorizer.SparseVectorsFromSequenceFiles: Number
> of reduce tasks: 1
> 13/09/27 11:27:25 INFO vectorizer.SparseVectorsFromSequenceFiles:
> Tokenizing documents in /tmp/mahout-work-bigdata/reuters-out-seqdir
> 13/09/27 11:27:26 INFO input.FileInputFormat: Total input paths to process
> : 1
> 13/09/27 11:27:26 INFO mapred.JobClient: Running job: job_201309271028_0002
> 13/09/27 11:27:27 INFO mapred.JobClient:  map 0% reduce 0%
> 13/09/27 11:27:36 INFO mapred.JobClient:  map 100% reduce 0%
> 13/09/27 11:27:36 INFO mapred.JobClient: Job complete:
> job_201309271028_0002
> 13/09/27 11:27:36 INFO mapred.JobClient: Counters: 19
> 13/09/27 11:27:36 INFO mapred.JobClient:   Job Counters
> 13/09/27 11:27:36 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=5689
> 13/09/27 11:27:36 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 13/09/27 11:27:36 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 13/09/27 11:27:36 INFO mapred.JobClient:     Launched map tasks=1
> 13/09/27 11:27:36 INFO mapred.JobClient:     Data-local map tasks=1
> 13/09/27 11:27:36 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> 13/09/27 11:27:36 INFO mapred.JobClient:   File Output Format Counters
> 13/09/27 11:27:36 INFO mapred.JobClient:     Bytes Written=2640631
> 13/09/27 11:27:36 INFO mapred.JobClient:   FileSystemCounters
> 13/09/27 11:27:36 INFO mapred.JobClient:     HDFS_BYTES_READ=1889686
> 13/09/27 11:27:36 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=57246
> 13/09/27 11:27:36 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=2640631
> 13/09/27 11:27:36 INFO mapred.JobClient:   File Input Format Counters
> 13/09/27 11:27:36 INFO mapred.JobClient:     Bytes Read=1889543
> 13/09/27 11:27:36 INFO mapred.JobClient:   Map-Reduce Framework
> 13/09/27 11:27:36 INFO mapred.JobClient:     Map input records=3742
> 13/09/27 11:27:36 INFO mapred.JobClient:     Physical memory (bytes)
> snapshot=125198336
> 13/09/27 11:27:36 INFO mapred.JobClient:     Spilled Records=0
> 13/09/27 11:27:36 INFO mapred.JobClient:     CPU time spent (ms)=1580
> 13/09/27 11:27:36 INFO mapred.JobClient:     Total committed heap usage
> (bytes)=123797504
> 13/09/27 11:27:36 INFO mapred.JobClient:     Virtual memory (bytes)
> snapshot=1074016256
> 13/09/27 11:27:36 INFO mapred.JobClient:     Map output records=3742
> 13/09/27 11:27:36 INFO mapred.JobClient:     SPLIT_RAW_BYTES=143
> 13/09/27 11:27:36 INFO vectorizer.SparseVectorsFromSequenceFiles: Creating
> Term Frequency Vectors
> 13/09/27 11:27:36 INFO vectorizer.DictionaryVectorizer: Creating
> dictionary from
> /tmp/mahout-work-bigdata/reuters-out-seqdir-sparse-kmeans/tokenized-documents
> and saving at
> /tmp/mahout-work-bigdata/reuters-out-seqdir-sparse-kmeans/wordcount
> 13/09/27 11:27:36 INFO input.FileInputFormat: Total input paths to process
> : 1
> 13/09/27 11:27:38 INFO mapred.JobClient: Running job: job_201309271028_0003
> 13/09/27 11:27:39 INFO mapred.JobClient:  map 0% reduce 0%
> 13/09/27 11:27:46 INFO mapred.JobClient:  map 100% reduce 0%
>
> ^Cbigdata@bigdata-OptiPlex-390:~/mahout-distribution-0.8/examples/bin$
>
> *Synthetic control data* -- hadoop not running (hadoop was running , i
> ran jps command and also set classpath once again
>
> bigdata@bigdata-OptiPlex-390:~/mahout-distribution-0.8/examples/bin$
> ./cluster-syntheticcontrol.sh
> Please select a number to choose the corresponding clustering algorithm
> 1. canopy clustering
> 2. kmeans clustering
> 3. fuzzykmeans clustering
> 4. dirichlet clustering
> 5. meanshift clustering
> Enter your choice : 2
> ok. You chose 2 and we'll use kmeans Clustering
> creating work directory at /tmp/mahout-work-bigdata
> Downloading Synthetic control data
>   % Total    % Received % Xferd  Average Speed   Time    Time     Time
> Current
>                                  Dload  Upload   Total   Spent    Left
> Speed
> 100  281k  100  281k    0     0  64314      0  0:00:04  0:00:04 --:--:--
> 82742
> Checking the health of DFS...
> Warning: $HADOOP_HOME is deprecated.
>
> ls: Cannot access .: No such file or directory.
>  HADOOP is not running. Please make sure you hadoop is running.
>
>
> appreciate your help
>
> regards
> Pavan
>
>
> On 26 September 2013 17:13, Darius Miliauskas <[email protected]
> > wrote:
>
>> Dear Pavan,
>>
>> There is the newer release of mahout (0.8). Why do you use 0.6? Have you
>> tried ./build-cluster-syntheticcontrol.sh or
>> ./cluster-syntheticcontrol.shfrom
>> ../mahout-distribution-0.8/examples\bin?
>>
>>
>> Ciao,
>>
>> Darius
>>
>>
>> 2013/9/26 Pavan K Narayanan <[email protected]>
>>
>> >  Folks,
>> >
>> > I am currently attempting to run the Synthetic_control data example on
>> > Mahout. I have installed Hadoop-1.2.1 and Mahout 0.6 in my Ubuntu.
>> >
>> > I prepared the following hadoop runtime configuration file to set all
>> the
>> > paths required. the following are the contents of the hadooprc.sh
>> >
>> > *export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-i386
>> > export HADOOP_HOME=/home/hduser/hadoop-1.2.1
>> > export MAHOUT_HOME=/home/hduser/mahout-distribution-0.6
>> > export PATH=$MAHOUT_HOME/bin:$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH
>> > export
>> >
>> >
>> CLASSPATH=$JAVA_HOME:/home/hduser/hadoop-1.2.1/hadoop-core-1.2.1.jar:$MAHOUT_HOME/mahout-core-0.6.jar
>> > *
>> > And also tried the following as suggested by Saeed Iqbal's blog for
>> runtime
>> > configuration file
>> >
>> > *export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-i386
>> > export HADOOP_HOME=/home/hduser/hadoop-1.2.1
>> > export HADOOP_CONF_DIR=/home/hduser/hadoop-1.2.1/conf
>> > export MAHOUT_HOME=/home/hduser/mahout-distribution-0.6/bin
>> > export PATH=$PATH:$MAHOUT_HOME*
>> >
>> > The above file initializes Mahout and I followed the commands below to
>> > write the synthetic control data into HDFS. fRom this link:
>> >
>> >
>> https://cwiki.apache.org/confluence/display/MAHOUT/Clustering+of+synthetic+control+data
>> >
>> > $HADOOP_HOME/bin/hadoop fs -mkdir testdata
>> > $HADOOP_HOME/bin/hadoop fs -put <PATH TO synthetic_control.data>
>> testdata
>> >
>> > the mvn clean install option gave me a build failure error but when i
>> typed
>> > maven -version i got the latest maven installed.
>> >
>> > I tried to enter this command from mahout_home/bin
>> > org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
>> > and got the following error:
>> >
>> > org.apache.mahout.clustering.syntheticcontrol.kmeans.Job command not
>> found
>> >
>> > Can anyone tell me where I am going wrong? how to fix this? really
>> > appreciate your help
>> >
>> > Regads
>> > Pavan
>> >
>>
>
>

Re: running "clustering of synthetic control data"

Reply via email to