Just to clarify: 

Are you saying that the *generated* data is different after it is serialized ?

Or that the final outputs are different?

> On Sep 2, 2014, at 1:04 PM, Jingyi Jin <[email protected]> wrote:
> 
> Hello all,
> 
> I am fairly new to Mahout. Recently I am using the Mahout KMeans for
> some of my tasks. For testing purpose, I would like to generate same
> input data for Mahout KMeans given a configuration.
> 
> I started to modify from the sample at
> /examples/src/main/java/org/apache/mahout/clustering/kmeans/GenKMeansDataset.java.
> I modified all pseudo random number generators in that file to
> initialize with constant seeds (initialize MersenneTwisterRNG and
> GaussianSampleGenerator with const seeds). With this change, when I
> print the sample seeds, means and std, and initial cluster values,
> they remain the same for different runs.
> 
> However, for some data, initial clusters for example, even the content
> of two runs are the same, they become different after they are written
> into Sequence files. I could check that their check sum become
> different, and the generated output clusters are different. So I would
> like to know if anyone has ever tried the same effort in make the
> random data generation reproducible, and how did you succeed? Why
> writing to sequence files would alter the content? Any feedback on
> what I could do to fix my problem would helps as well.
> 
> Thanks for any feedback!
> 
> Jingyi

Reply via email to