Hello Jay, The serialized data is different (through the check sum comparison); the final outputs are different as well.
This puzzles me a lot too, when I dump the content of the sample seed (mean, std), and initial centroids to the screen, the values are the same for different runs. But when they are serialized into sequence files, the check sums of sequence files are different for different runs. Yet the final output clusters from kmeans are different as well. What could have modified? I imagine that generating reproducible input for kmeans should be a fairly common problem. So I wonder if anyone succeeded and could share the experience. I need to reuse multiple large data at different time, so copying the data into a local directory is not an option for me. Thanks! Jingyi On Tue, Sep 2, 2014 at 1:46 PM, <[email protected]> wrote: > Just to clarify: > > Are you saying that the *generated* data is different after it is serialized ? > > Or that the final outputs are different? > >> On Sep 2, 2014, at 1:04 PM, Jingyi Jin <[email protected]> wrote: >> >> Hello all, >> >> I am fairly new to Mahout. Recently I am using the Mahout KMeans for >> some of my tasks. For testing purpose, I would like to generate same >> input data for Mahout KMeans given a configuration. >> >> I started to modify from the sample at >> /examples/src/main/java/org/apache/mahout/clustering/kmeans/GenKMeansDataset.java. >> I modified all pseudo random number generators in that file to >> initialize with constant seeds (initialize MersenneTwisterRNG and >> GaussianSampleGenerator with const seeds). With this change, when I >> print the sample seeds, means and std, and initial cluster values, >> they remain the same for different runs. >> >> However, for some data, initial clusters for example, even the content >> of two runs are the same, they become different after they are written >> into Sequence files. I could check that their check sum become >> different, and the generated output clusters are different. So I would >> like to know if anyone has ever tried the same effort in make the >> random data generation reproducible, and how did you succeed? Why >> writing to sequence files would alter the content? Any feedback on >> what I could do to fix my problem would helps as well. >> >> Thanks for any feedback! >> >> Jingyi
