Re: Hadoop RDD incorrect data

2013-12-10 Thread Matt Cheah
;user@spark.incubator.apache.org<mailto:user@spark.incubator.apache.org>" mailto:user@spark.incubator.apache.org>> Subject: Re: Hadoop RDD incorrect data That data size is sufficiently small for the cluster configuration that you mention. Are you doing the sort in local mode or

Re: Hadoop RDD incorrect data

2013-12-09 Thread Ashish Rangole
pache.org > *Cc:* Mingyu Kim > *Subject:* Re: Hadoop RDD incorrect data > > Hi Matt, > > The behavior for sequenceFile is there because we reuse the same > Writable object when reading elements from the file. This is definitely > unintuitive, but if you pass throug

RE: Hadoop RDD incorrect data

2013-12-09 Thread Matt Cheah
0. None of these configurations lets me sort the dataset without the cluster collapsing. -Matt Cheah From: Matei Zaharia [matei.zaha...@gmail.com] Sent: Monday, December 09, 2013 7:02 PM To: user@spark.incubator.apache.org Cc: Mingyu Kim Subject: Re: Hadoop RDD

Re: Hadoop RDD incorrect data

2013-12-09 Thread Matei Zaharia
Hi Matt, The behavior for sequenceFile is there because we reuse the same Writable object when reading elements from the file. This is definitely unintuitive, but if you pass through each data item only once instead of caching it, it can be more efficient (probably should be off by default thou

Hadoop RDD incorrect data

2013-12-09 Thread Matt Cheah
Hi, Assume my spark context is pointing to local[N]. If I have an RDD created with sparkContext.sequenceFile(…), and I call .collect() on it immediately (assume it's small), sometimes I get duplicate rows back. In addition, if I call sparkContext.sequenceFile(…) and immediately call an operatio