Yes, your new documents have introduced new terms which have increased the size 
of the document vectors compared to the size of the cluster centers. If you 
convert your cluster centers to use sparse vectors with max_int size then you 
should be able to move forward. 

-----Original Message-----
From: David Saile [mailto:[email protected]] 
Sent: Thursday, May 26, 2011 10:36 AM
To: [email protected]
Subject: CardinalityException during data clustering 

Hi list,

As suggested in previous posts, I am trying to use k-means to assign newly 
arriving documents to existing clusters.

However, while trying to assign the vectors corresponding to the new documents 
to the existing clusters (using KMeansDriver.clusterData(...)), I am running 
into an org.apache.mahout.math.CardinalityException.
See below for the complete stack-trace. 

For vector creation I use Mahout's DictionaryVectorizer. 
I assume, this exception occurs because the new vectors have a different 
cardinality than the previously computed clusters.

Is there some way to assign a fixed cardinality to all vectors? Or is there any 
other solution for this?

I would really appreciate any help! Thanks,
David

 

java.lang.Exception: org.apache.mahout.math.CardinalityException: Required 
cardinality 16 but got 22
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:371)
Caused by: org.apache.mahout.math.CardinalityException: Required cardinality 16 
but got 22
        at 
org.apache.mahout.math.RandomAccessSparseVector.dot(RandomAccessSparseVector.java:172)
        at org.apache.mahout.math.NamedVector.dot(NamedVector.java:127)
        at 
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure.distance(SquaredEuclideanDistanceMeasure.java:57)
        at 
org.apache.mahout.clustering.kmeans.KMeansClusterer.outputPointWithClusterInfo(KMeansClusterer.java:140)
        at 
org.apache.mahout.clustering.kmeans.KMeansClusterMapper.map(KMeansClusterMapper.java:40)
        at 
org.apache.mahout.clustering.kmeans.KMeansClusterMapper.map(KMeansClusterMapper.java:1)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:652)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:328)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:238)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:680)

Reply via email to