Hello all,

I am fairly new to Mahout. Recently I am using the Mahout KMeans for
some of my tasks. For testing purpose, I would like to generate same
input data for Mahout KMeans given a configuration.

I started to modify from the sample at
/examples/src/main/java/org/apache/mahout/clustering/kmeans/GenKMeansDataset.java.
I modified all pseudo random number generators in that file to
initialize with constant seeds (initialize MersenneTwisterRNG and
GaussianSampleGenerator with const seeds). With this change, when I
print the sample seeds, means and std, and initial cluster values,
they remain the same for different runs.

However, for some data, initial clusters for example, even the content
of two runs are the same, they become different after they are written
into Sequence files. I could check that their check sum become
different, and the generated output clusters are different. So I would
like to know if anyone has ever tried the same effort in make the
random data generation reproducible, and how did you succeed? Why
writing to sequence files would alter the content? Any feedback on
what I could do to fix my problem would helps as well.

Thanks for any feedback!

Jingyi

Reply via email to