Just to clarify: Are you saying that the *generated* data is different after it is serialized ?
Or that the final outputs are different? > On Sep 2, 2014, at 1:04 PM, Jingyi Jin <[email protected]> wrote: > > Hello all, > > I am fairly new to Mahout. Recently I am using the Mahout KMeans for > some of my tasks. For testing purpose, I would like to generate same > input data for Mahout KMeans given a configuration. > > I started to modify from the sample at > /examples/src/main/java/org/apache/mahout/clustering/kmeans/GenKMeansDataset.java. > I modified all pseudo random number generators in that file to > initialize with constant seeds (initialize MersenneTwisterRNG and > GaussianSampleGenerator with const seeds). With this change, when I > print the sample seeds, means and std, and initial cluster values, > they remain the same for different runs. > > However, for some data, initial clusters for example, even the content > of two runs are the same, they become different after they are written > into Sequence files. I could check that their check sum become > different, and the generated output clusters are different. So I would > like to know if anyone has ever tried the same effort in make the > random data generation reproducible, and how did you succeed? Why > writing to sequence files would alter the content? Any feedback on > what I could do to fix my problem would helps as well. > > Thanks for any feedback! > > Jingyi
