Hello all, I am fairly new to Mahout. Recently I am using the Mahout KMeans for some of my tasks. For testing purpose, I would like to generate same input data for Mahout KMeans given a configuration.
I started to modify from the sample at /examples/src/main/java/org/apache/mahout/clustering/kmeans/GenKMeansDataset.java. I modified all pseudo random number generators in that file to initialize with constant seeds (initialize MersenneTwisterRNG and GaussianSampleGenerator with const seeds). With this change, when I print the sample seeds, means and std, and initial cluster values, they remain the same for different runs. However, for some data, initial clusters for example, even the content of two runs are the same, they become different after they are written into Sequence files. I could check that their check sum become different, and the generated output clusters are different. So I would like to know if anyone has ever tried the same effort in make the random data generation reproducible, and how did you succeed? Why writing to sequence files would alter the content? Any feedback on what I could do to fix my problem would helps as well. Thanks for any feedback! Jingyi
