I am trying to run the Dirichlet Process Clustering on the cooccurrence matrix output of the RowSimilarityJob. Since RowSimilarityJob creates RandomAccessSparseVectors with a cardinality of Integer.MAX_VALUE, I used the following code to run the clustering:

ModelDistribution<VectorWritable> modelDist = new GaussianClusterDistribution(new VectorWritable(new DenseVector(2))); DistributionDescription description = new DistributionDescription(modelDist.getClass().getName(), RandomAccessSparseVector.class.getName(), CosineDistanceMeasure.class.getName(), Integer.MAX_VALUE);

DirichletDriver.run(conf, cooccurrenceMatrixPath, clusteringOutput, description, 10, 20, 1.0, true, true, 0, false);



Using Integer.MAX_VALUE for the DistributionDescription results in an exploding heap space usage. Is there a way to circumvent this problem?

Reply via email to