Hi all,
I have a problem with clustering Wikipedia articles.
Here is my problem description (
http://stackoverflow.com/questions/11881865/mahout-seqwiki-problems ):
I've downloaded
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
setup hadoop. On single machine (pseudo-distributed).
downloaded mahout from git://git.apache.org/mahout.git
built it.
vectorized articles:
mahout seqwiki -c categories.txt -i
wiki/enwiki-latest-pages-articles.xml -o wiki/seqfiles
mahout seq2sparse -i wiki/seqfiles -o wiki/vectors-bigram -ow -a
org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 200 -wt tfidf -s 5
-md 3 -x 90 -ng 2 -ml 50 -seq
and tried to cluster it using KMeans algorythm:
mahout kmeans -i wiki/vectors-bigram/tfidf-vectors/ -c
wiki/kmeans-centroids -o wiki/kmeans-clusters -cd 1.0 -k 20 -x 20 -cl
-dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
and got an exception:
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /home/valden/hadoop-1.0.3/bin//hadoop and
HADOOP_CONF_DIR=/home/valden/hadoop-1.0.3/conf
MAHOUT-JOB:
/home/valden/mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar
12/08/08 20:03:43 INFO common.AbstractJob: Command line arguments:
{--clustering=null, --clusters=[wiki/kmeans-centroids],
--convergenceDelta=[1.0],
--distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure],
--endPhase=[2147483647], --input=[wiki/vectors-bigram/tfidf-vectors/],
--maxIter=[20], --method=[mapreduce], --numClusters=[20],
--output=[wiki/kmeans-clusters], --startPhase=[0], --tempDir=[temp]}
12/08/08 20:03:44 INFO common.HadoopUtil: Deleting
wiki/kmeans-centroids
12/08/08 20:03:44 INFO kmeans.RandomSeedGenerator: >>>
inputPathPattern: wiki/vectors-bigram/tfidf-vectors/*
12/08/08 20:03:44 INFO util.NativeCodeLoader: Loaded the
native-hadoop library
12/08/08 20:03:44 INFO zlib.ZlibFactory: Successfully loaded &
initialized native-zlib library
12/08/08 20:03:44 INFO compress.CodecPool: Got brand-new compressor
12/08/08 20:03:44 INFO kmeans.RandomSeedGenerator: >>> inputFilesN: 1
12/08/08 20:03:44 INFO kmeans.RandomSeedGenerator: >>>> f:
hdfs://localhost:9000/user/valden/wiki/vectors-bigram/tfidf-vectors/part-r-00000
12/08/08 20:03:44 INFO kmeans.RandomSeedGenerator: Wrote 0 Klusters
to wiki/kmeans-centroids/part-randomSeed
12/08/08 20:03:44 INFO kmeans.KMeansDriver: Input:
wiki/vectors-bigram/tfidf-vectors Clusters In:
wiki/kmeans-centroids/part-randomSeed Out: wiki/kmeans-clusters
Distance: org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
12/08/08 20:03:44 INFO kmeans.KMeansDriver: convergence: 1.0 max
Iterations: 20 num Reduce Tasks: org.apache.mahout.math.VectorWritable
Input Vectors: {}
12/08/08 20:03:44 INFO compress.CodecPool: Got brand-new decompressor
Exception in thread "main" java.lang.IllegalStateException: No
input clusters found in wiki/kmeans-centroids/part-randomSeed. Check
your -c argument.
at
org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:217)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:148)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:107)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:48)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at
org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
All INFO with ">>>" inside is my custom logging. With it I've determined
that org.apache.mahout.clustering.kmeans.RandomSeedGenerator was not
able to load vectors from
wiki/vectors-bigram/tfidf-vectors/part-r-00000. Why? I do not know. So.
Vectors were not loaded and random centroids were not generated also.
After some tests I've determined that wiki/seqfiles/part-r-00000 does
not contain any vectors. So. seqwiki produced wrong data. I've checked
that with code:
private static void readData(Configuration conf, Path in) {
log.info("readData...");
for (Pair<Writable, VectorWritable> record : new
SequenceFileIterable<Writable, VectorWritable>(in, false, conf)) {
/// THIS loop was never entered
Writable key = record.getFirst();
VectorWritable value = record.getSecond();
++n;
log.info(">>> " + Integer.toString(n));
}
}
I've determined that last Mahout-Examples-Cluster-Reuters build (
https://builds.apache.org/job/Mahout-Examples-Cluster-Reuters/lastBuild/console
) also has the same problem
.......................
Exception in thread "main" java.lang.IllegalStateException: No input
clusters found in
/tmp/mahout-work-hudson/reuters-kmeans-clusters/part-randomSeed. Check
your -c argument.
at
org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:217)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:148)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:107)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:48)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
Build step 'Execute shell' marked build as failure
...............................
Is there any jira issue regarded this bug?
I'll appreciate any help with my issue.
--
Best regards,
--
Denys Valchuk
Skype: dvalchuk
cell: +380664059110