Yes, your new documents have introduced new terms which have increased the size of the document vectors compared to the size of the cluster centers. If you convert your cluster centers to use sparse vectors with max_int size then you should be able to move forward.
-----Original Message----- From: David Saile [mailto:[email protected]] Sent: Thursday, May 26, 2011 10:36 AM To: [email protected] Subject: CardinalityException during data clustering Hi list, As suggested in previous posts, I am trying to use k-means to assign newly arriving documents to existing clusters. However, while trying to assign the vectors corresponding to the new documents to the existing clusters (using KMeansDriver.clusterData(...)), I am running into an org.apache.mahout.math.CardinalityException. See below for the complete stack-trace. For vector creation I use Mahout's DictionaryVectorizer. I assume, this exception occurs because the new vectors have a different cardinality than the previously computed clusters. Is there some way to assign a fixed cardinality to all vectors? Or is there any other solution for this? I would really appreciate any help! Thanks, David java.lang.Exception: org.apache.mahout.math.CardinalityException: Required cardinality 16 but got 22 at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:371) Caused by: org.apache.mahout.math.CardinalityException: Required cardinality 16 but got 22 at org.apache.mahout.math.RandomAccessSparseVector.dot(RandomAccessSparseVector.java:172) at org.apache.mahout.math.NamedVector.dot(NamedVector.java:127) at org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure.distance(SquaredEuclideanDistanceMeasure.java:57) at org.apache.mahout.clustering.kmeans.KMeansClusterer.outputPointWithClusterInfo(KMeansClusterer.java:140) at org.apache.mahout.clustering.kmeans.KMeansClusterMapper.map(KMeansClusterMapper.java:40) at org.apache.mahout.clustering.kmeans.KMeansClusterMapper.map(KMeansClusterMapper.java:1) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:652) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:328) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:238) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:680)
