I am trying to run the Dirichlet Process Clustering on the cooccurrence
matrix output of the RowSimilarityJob. Since RowSimilarityJob creates
RandomAccessSparseVectors with a cardinality of Integer.MAX_VALUE, I
used the following code to run the clustering:
ModelDistribution<VectorWritable> modelDist = new
GaussianClusterDistribution(new VectorWritable(new DenseVector(2)));
DistributionDescription description = new
DistributionDescription(modelDist.getClass().getName(),
RandomAccessSparseVector.class.getName(),
CosineDistanceMeasure.class.getName(), Integer.MAX_VALUE);
DirichletDriver.run(conf, cooccurrenceMatrixPath, clusteringOutput,
description, 10, 20, 1.0, true, true, 0, false);
Using Integer.MAX_VALUE for the DistributionDescription results in an
exploding heap space usage. Is there a way to circumvent this problem?