I think I found the root but not sure what needs fixing.
I took out n-gram generation and the vector now looks like this:
Key: https://farfetchers.com/category/collections/source/brice-berard:
Value:
https://farfetchers.com/category/collections/source/brice-berard:{701:0.5484552974788475,1876:0.6020428878306935,3620:0.5802940184767269}
This works in clustering.
It doesn't seem like a malformed vector should crash clustering (it
apparently doesn't in mahout 0.6) but it looks like something in
seq2sparse's n-gram weighting does cause a malformed vector.
I'll file a JIRA
On 6/5/12 11:48 AM, Pat Ferrel wrote:
Using seqdumper on the TFIDF vectors, that vector is indeed in the list
Key: https://farfetchers.com/category/collections/source/brice-berard:
Value: https://farfetchers.com/category/collections/source/brice-berard:{
Looking in the seqfiles we find the document in part-00005 of 10 in no
particular part of the file.
Key: https://farfetchers.com/category/collections/source/brice-berard:
Value: ::Title::
Brice Berard | FarFetchers.com
Blog Posts
On the chance that this originates in seq2sparse I'll try changing
options until the vector looks different. and try clustering again.
On 6/5/12 10:43 AM, Pat Ferrel wrote:
I'm not completely sure what I'm looking at but...
In iterateSeq on iteration #1 of processing vectors/tfidf-vectors it
reads
vector =
"https://farfetchers.com/category/collections/source/brice-berard:{"
it's a named vector where the url is the name, the value is "{",
which looks wrong and when that is classified to get a probability it
gets
probabilities =
"{0:NaN,1:NaN,2:NaN,3:NaN,4:NaN,5:NaN,6:NaN,7:NaN,8:NaN,9:NaN,10:NaN,11:NaN,12:NaN,13:NaN,14:NaN,15:NaN,16:NaN,17:NaN,18:NaN,19:NaN}"
That causes the probabilities.maxValueIndex() = -1 and everything dies.
vector looks wrong, doesn't it? Truncated?
I went back to try the same on mahout 0.6 but iterateSeq does not get
called though I used -xm sequential on both runs. I can't see
kmeans-clusters/clusters-0 being created on mahout 0.6 either. Is
that part of the refactoring?
On 6/4/12 3:07 PM, Pat Ferrel wrote:
Some things to try:
- Have you verified the contents of your input vectors actually have
data in them?
* YES, from the other email you know that the data works fine in 0.6
- Can you run the cluster dumper on the
b3/kmeans-clusters/clusters-0 contents?
* YES, It is attached from trunk's clusterdump after the failure of
kmeans, of course. A simple data set fortunately.
- Is it possible to run the sequential version (-xm sequential)? If
it is you could run it in a debugger to gain more insight.
* YES, will report back.
On 6/4/12 2:19 PM, Jeff Eastman wrote:
It looks like the probabilities vector returned by
AbstractClusteringPolicy.classify() has no non-zero elements. In
this case, AbstractClusteringPolicy.select()'s call to
AbstractVector.maxValueIndex() is returning -1 and that is causing
the exception.
How could this happen? I'm not exactly sure, but consider that the
probabilities vector is calculated in
AbstractClusteringPolicy.classify() by calling
DistanceMeasureCluster.pdf() on each of the prior clusters in
b3/kmeans-clusters/clusters-0. With a CosineDistanceMeasure I don't
see how this could ever return zero. Certainly, some of your
vectors will match the prior cluster centers exactly (they were
sampled from the input) and those values would return pdf==1. Even
if the cosine distance was 1 the pdf would be 0.5.
Some things to try:
- Have you verified the contents of your input vectors actually
have data in them?
- Can you run the cluster dumper on the
b3/kmeans-clusters/clusters-0 contents?
- Is it possible to run the sequential version (-xm sequential)? If
it is you could run it in a debugger to gain more insight.
Jeff
On 6/4/12 12:05 PM, Pat Ferrel wrote:
Using the CLI to kmeans from several trunk versions I get an error
I don't understand. When the job died the
b3/canopy-centroids/clusters-0-final contained the random-seeds
file generated by the kmeans driver and the
b3/kmeans-clusters/clusters-0 had several part files but
b3/kmeans-clusters/clusters-1 was empty. When I look through the
code from the trace it doesn't make much sense.
Command line:
mahout kmeans
-i b3/vectors/tfidf-vectors/
-k 20
-c b3/canopy-centroids/clusters-0-final
-cl
-o b3/kmeans-clusters
-ow
-cd 0.01
-x 30
-dm org.apache.mahout.common.distance.CosineDistanceMeasure
Error:
12/06/04 07:55:03 INFO common.AbstractJob: Command line arguments:
{--clustering=null,
--clusters=[b3/canopy-centroids/clusters-0-final],
--convergenceDelta=[0.01],
--distanceMeasure=[org.apache.mahout.common.distance.CosineDistanceMeasure],
--endPhase=[2147483647], --input=[b3/vectors/tfidf-vectors/],
--maxIter=[30], --method=[mapreduce], --numClusters=[20],
--output=[b3/kmeans-clusters], --overwrite=null, --startPhase=[0],
--tempDir=[temp]}
2012-06-04 07:55:03.752 java[67308:1903] Unable to load realm info
from SCDynamicStore
12/06/04 07:55:03 INFO common.HadoopUtil: Deleting
b3/canopy-centroids/clusters-0-final
12/06/04 07:55:04 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java
classes where applicable
12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new compressor
12/06/04 07:55:04 INFO kmeans.RandomSeedGenerator: Wrote 20
vectors to b3/canopy-centroids/clusters-0-final/part-randomSeed
12/06/04 07:55:04 INFO kmeans.KMeansDriver: Input:
b3/vectors/tfidf-vectors Clusters In:
b3/canopy-centroids/clusters-0-final/part-randomSeed Out:
b3/kmeans-clusters Distance:
org.apache.mahout.common.distance.CosineDistanceMeasure
12/06/04 07:55:04 INFO kmeans.KMeansDriver: convergence: 0.01 max
Iterations: 30 num Reduce Tasks:
org.apache.mahout.math.VectorWritable Input Vectors: {}
12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new decompressor
Cluster Iterator running iteration 1 over priorPath:
b3/kmeans-clusters/clusters-0
12/06/04 07:55:05 INFO input.FileInputFormat: Total input paths to
process : 1
12/06/04 07:55:05 INFO mapred.JobClient: Running job: job_local_0001
12/06/04 07:55:06 INFO mapred.MapTask: io.sort.mb = 100
12/06/04 07:55:08 INFO mapred.MapTask: data buffer =
79691776/99614720
12/06/04 07:55:08 INFO mapred.MapTask: record buffer = 262144/327680
12/06/04 07:55:08 INFO mapred.JobClient: map 0% reduce 0%
12/06/04 07:55:09 WARN mapred.LocalJobRunner: job_local_0001
org.apache.mahout.math.IndexException: Index -1 is outside
allowable range of [0,20)
at
org.apache.mahout.math.AbstractVector.set(AbstractVector.java:439)
at
org.apache.mahout.clustering.iterator.AbstractClusteringPolicy.select(AbstractClusteringPolicy.java:44)
at
org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:52)
at
org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:18)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
12/06/04 07:55:09 INFO mapred.JobClient: Job complete: job_local_0001
12/06/04 07:55:09 INFO mapred.JobClient: Counters: 0
Exception in thread "main" java.lang.InterruptedException: Cluster
Iteration 1 failed processing b3/kmeans-clusters/clusters-1
at
org.apache.mahout.clustering.iterator.ClusterIterator.iterateMR(ClusterIterator.java:186)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:229)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:149)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:108)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at
org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)