Hi all,

I have a problem with clustering Wikipedia articles.
Here is my problem description ( http://stackoverflow.com/questions/11881865/mahout-seqwiki-problems ):

I've downloaded http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

setup hadoop. On single machine (pseudo-distributed).

downloaded mahout from git://git.apache.org/mahout.git
built it.
vectorized articles:

mahout seqwiki -c categories.txt -i wiki/enwiki-latest-pages-articles.xml -o wiki/seqfiles mahout seq2sparse -i wiki/seqfiles -o wiki/vectors-bigram -ow -a org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq

and tried to cluster it using KMeans algorythm:

mahout kmeans -i wiki/vectors-bigram/tfidf-vectors/ -c wiki/kmeans-centroids -o wiki/kmeans-clusters -cd 1.0 -k 20 -x 20 -cl -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure

and got an exception:

    MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /home/valden/hadoop-1.0.3/bin//hadoop and HADOOP_CONF_DIR=/home/valden/hadoop-1.0.3/conf MAHOUT-JOB: /home/valden/mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar 12/08/08 20:03:43 INFO common.AbstractJob: Command line arguments: {--clustering=null, --clusters=[wiki/kmeans-centroids], --convergenceDelta=[1.0], --distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure], --endPhase=[2147483647], --input=[wiki/vectors-bigram/tfidf-vectors/], --maxIter=[20], --method=[mapreduce], --numClusters=[20], --output=[wiki/kmeans-clusters], --startPhase=[0], --tempDir=[temp]} 12/08/08 20:03:44 INFO common.HadoopUtil: Deleting wiki/kmeans-centroids 12/08/08 20:03:44 INFO kmeans.RandomSeedGenerator: >>> inputPathPattern: wiki/vectors-bigram/tfidf-vectors/* 12/08/08 20:03:44 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/08/08 20:03:44 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
    12/08/08 20:03:44 INFO compress.CodecPool: Got brand-new compressor
    12/08/08 20:03:44 INFO kmeans.RandomSeedGenerator: >>> inputFilesN: 1
12/08/08 20:03:44 INFO kmeans.RandomSeedGenerator: >>>> f: hdfs://localhost:9000/user/valden/wiki/vectors-bigram/tfidf-vectors/part-r-00000 12/08/08 20:03:44 INFO kmeans.RandomSeedGenerator: Wrote 0 Klusters to wiki/kmeans-centroids/part-randomSeed 12/08/08 20:03:44 INFO kmeans.KMeansDriver: Input: wiki/vectors-bigram/tfidf-vectors Clusters In: wiki/kmeans-centroids/part-randomSeed Out: wiki/kmeans-clusters Distance: org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure 12/08/08 20:03:44 INFO kmeans.KMeansDriver: convergence: 1.0 max Iterations: 20 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {}
    12/08/08 20:03:44 INFO compress.CodecPool: Got brand-new decompressor
Exception in thread "main" java.lang.IllegalStateException: No input clusters found in wiki/kmeans-centroids/part-randomSeed. Check your -c argument. at org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:217) at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:148) at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:107)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:48)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)


All INFO with ">>>" inside is my custom logging. With it I've determined that org.apache.mahout.clustering.kmeans.RandomSeedGenerator was not able to load vectors from wiki/vectors-bigram/tfidf-vectors/part-r-00000. Why? I do not know. So. Vectors were not loaded and random centroids were not generated also. After some tests I've determined that wiki/seqfiles/part-r-00000 does not contain any vectors. So. seqwiki produced wrong data. I've checked that with code:

    private static void readData(Configuration conf, Path in) {
        log.info("readData...");
for (Pair<Writable, VectorWritable> record : new SequenceFileIterable<Writable, VectorWritable>(in, false, conf)) { /// THIS loop was never entered
            Writable key = record.getFirst();
            VectorWritable value = record.getSecond();
            ++n;
            log.info(">>> " + Integer.toString(n));
        }
    }


I've determined that last Mahout-Examples-Cluster-Reuters build ( https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/lastBuild/console ) also has the same problem

.......................

Exception in thread "main" java.lang.IllegalStateException: No input clusters found in /tmp/mahout-work-hudson/reuters-kmeans-clusters/part-randomSeed. Check your -c argument. at org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:217) at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:148) at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:107)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:48)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
Build step 'Execute shell' marked build as failure

...............................

Is there any jira issue regarded this bug?
I'll appreciate any help with my issue.


--
Best regards,
--
Denys Valchuk
Skype: dvalchuk
cell: +380664059110

Reply via email to