DirichletDriver vector cardinality and heap usage

contact Mon, 13 May 2013 03:32:41 -0700

I am trying to run the Dirichlet Process Clustering on the cooccurrencematrix output of the RowSimilarityJob. Since RowSimilarityJob createsRandomAccessSparseVectors with a cardinality of Integer.MAX_VALUE, Iused the following code to run the clustering:

ModelDistribution<VectorWritable> modelDist = newGaussianClusterDistribution(new VectorWritable(new DenseVector(2)));DistributionDescription description = newDistributionDescription(modelDist.getClass().getName(),RandomAccessSparseVector.class.getName(),CosineDistanceMeasure.class.getName(), Integer.MAX_VALUE);

DirichletDriver.run(conf, cooccurrenceMatrixPath, clusteringOutput,description, 10, 20, 1.0, true, true, 0, false);

Using Integer.MAX_VALUE for the DistributionDescription results in anexploding heap space usage. Is there a way to circumvent this problem?

DirichletDriver vector cardinality and heap usage

Reply via email to