Dirichlet Clustering failed

Sascha Nordquist Fri, 26 Nov 2010 09:53:39 -0800

Hi,

when I run Dirichlet Clustering I get the following Exception:

org.apache.mahout.math.CardinalityException: Required cardinality 10672but got 10atorg.apache.mahout.math.RandomAccessSparseVector.dot(RandomAccessSparseVector.java:172)

    at org.apache.mahout.math.NamedVector.dot(NamedVector.java:127)

atorg.apache.mahout.clustering.dirichlet.models.NormalModel.pdf(NormalModel.java:130)atorg.apache.mahout.clustering.dirichlet.models.NormalModel.pdf(NormalModel.java:38)atorg.apache.mahout.clustering.dirichlet.DirichletState.adjustedProbability(DirichletState.java:129)atorg.apache.mahout.clustering.dirichlet.DirichletClusterer.assignToModel(DirichletClusterer.java:256)atorg.apache.mahout.clustering.dirichlet.DirichletMapper.map(DirichletMapper.java:47)atorg.apache.mahout.clustering.dirichlet.DirichletMapper.map(DirichletMapper.java:41)

    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)

atorg.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)



This is the method for dirichlet clustering:

private void dirichletClustering(Path vectorPath,DistanceMeasuremeasure, int numClusters, int maxIterations, double alpha0) throwsException {

        boolean runSequential = false;
        Configuration conf = new Configuration();
        int prototypeSize = 10;
        boolean emitMostLikely = true;
        double threshold = 0.1;

String modelPrototype ="org.apache.mahout.math.RandomAccessSparseVector";String modelFactory ="org.apache.mahout.clustering.dirichlet.models.NormalModelDistribution";Path clusterPath = new Path(outputDir,"dirichletClustering-c"+numClusters+"-alpha"+alpha0+"-"+measure.getClass().getSimpleName());

        HadoopUtil.overwriteOutput(clusterPath);

Path clusterPointsPath = new Path(clusterPath,AbstractCluster.CLUSTERED_POINTS_DIR);AbstractVectorModelDistribution modelDistribution =DirichletDriver.createModelDistribution(modelFactory, modelPrototype,measure.getClass().getName(), prototypeSize);Path resultPath = DirichletDriver.buildClusters(conf,vectorPath, clusterPath, modelDistribution, numClusters, maxIterations,alpha0, runSequential);DirichletDriver.clusterData(conf, vectorPath, clusterPath,clusterPointsPath, emitMostLikely, threshold, runSequential);

    }

The vectors are created this way:

    private void generateVectors() throws Exception {
        int minSupport = 2;
        int maxNGramSize = 2;
        float minLLRValue = 50;
        float normPower = 2;
        boolean logNormalize = false;
        int chunkSizeInMegabytes = 64;
        int numReducers = 1;
        boolean sequentialAccessOutput = false;
        boolean namedVectors = true;

        Configuration conf = new Configuration();

String tokenizedDir =preparePath.toString()+"/"+DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER;

        Path tokenizedPath = new Path(tokenizedDir);
        HadoopUtil.overwriteOutput(preparePath);

DocumentProcessor.tokenizeDocuments(inputPath,DefaultAnalyzer.class, tokenizedPath);


        DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,

preparePath, conf, minSupport, maxNGramSize,minLLRValue, normPower, logNormalize, numReducers, chunkSizeInMegabytes,sequentialAccessOutput, namedVectors);

I already used this tf vectors as input for kmeans and fuzzykmeans, sowhats wrong?


Thanks!

Dirichlet Clustering failed

Reply via email to