Hi Sascha, What is the size of your input vectors?...10672? I think that you need to need to use a prototype size of 10672 instead of 10 HTW
2010/11/26 Sascha Nordquist <[email protected]>: > Hi, > > when I run Dirichlet Clustering I get the following Exception: > > org.apache.mahout.math.CardinalityException: Required cardinality 10672 but > got 10 > at > org.apache.mahout.math.RandomAccessSparseVector.dot(RandomAccessSparseVector.java:172) > at org.apache.mahout.math.NamedVector.dot(NamedVector.java:127) > at > org.apache.mahout.clustering.dirichlet.models.NormalModel.pdf(NormalModel.java:130) > at > org.apache.mahout.clustering.dirichlet.models.NormalModel.pdf(NormalModel.java:38) > at > org.apache.mahout.clustering.dirichlet.DirichletState.adjustedProbability(DirichletState.java:129) > at > org.apache.mahout.clustering.dirichlet.DirichletClusterer.assignToModel(DirichletClusterer.java:256) > at > org.apache.mahout.clustering.dirichlet.DirichletMapper.map(DirichletMapper.java:47) > at > org.apache.mahout.clustering.dirichlet.DirichletMapper.map(DirichletMapper.java:41) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176) > > > This is the method for dirichlet clustering: > > private void dirichletClustering(Path vectorPath,DistanceMeasure measure, > int numClusters, int maxIterations, double alpha0) throws Exception { > boolean runSequential = false; > Configuration conf = new Configuration(); > int prototypeSize = 10; > boolean emitMostLikely = true; > double threshold = 0.1; > String modelPrototype = > "org.apache.mahout.math.RandomAccessSparseVector"; > String modelFactory = > "org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution"; > Path clusterPath = new Path(outputDir, > "dirichletClustering-c"+numClusters+"-alpha"+alpha0+"-"+measure.getClass().getSimpleName()); > HadoopUtil.overwriteOutput(clusterPath); > Path clusterPointsPath = new Path(clusterPath, > AbstractCluster.CLUSTERED_POINTS_DIR); > AbstractVectorModelDistribution modelDistribution = > DirichletDriver.createModelDistribution(modelFactory, modelPrototype, > measure.getClass().getName(), prototypeSize); > Path resultPath = DirichletDriver.buildClusters(conf, vectorPath, > clusterPath, modelDistribution, numClusters, maxIterations, alpha0, > runSequential); > DirichletDriver.clusterData(conf, vectorPath, clusterPath, > clusterPointsPath, emitMostLikely, threshold, runSequential); > } > > The vectors are created this way: > > private void generateVectors() throws Exception { > int minSupport = 2; > int maxNGramSize = 2; > float minLLRValue = 50; > float normPower = 2; > boolean logNormalize = false; > int chunkSizeInMegabytes = 64; > int numReducers = 1; > boolean sequentialAccessOutput = false; > boolean namedVectors = true; > > Configuration conf = new Configuration(); > String tokenizedDir = > preparePath.toString()+"/"+DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER; > Path tokenizedPath = new Path(tokenizedDir); > HadoopUtil.overwriteOutput(preparePath); > DocumentProcessor.tokenizeDocuments(inputPath, DefaultAnalyzer.class, > tokenizedPath); > > DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, > preparePath, conf, minSupport, maxNGramSize, minLLRValue, > normPower, logNormalize, numReducers, chunkSizeInMegabytes, > sequentialAccessOutput, namedVectors); > } > > > I already used this tf vectors as input for kmeans and fuzzykmeans, so whats > wrong? > > Thanks! >
