;user@spark.incubator.apache.org<mailto:user@spark.incubator.apache.org>"
mailto:user@spark.incubator.apache.org>>
Subject: Re: Hadoop RDD incorrect data
That data size is sufficiently small for the cluster configuration that you
mention.
Are you doing the sort in local mode or
pache.org
> *Cc:* Mingyu Kim
> *Subject:* Re: Hadoop RDD incorrect data
>
> Hi Matt,
>
> The behavior for sequenceFile is there because we reuse the same
> Writable object when reading elements from the file. This is definitely
> unintuitive, but if you pass throug
0. None of
these configurations lets me sort the dataset without the cluster collapsing.
-Matt Cheah
From: Matei Zaharia [matei.zaha...@gmail.com]
Sent: Monday, December 09, 2013 7:02 PM
To: user@spark.incubator.apache.org
Cc: Mingyu Kim
Subject: Re: Hadoop RDD
Hi Matt,
The behavior for sequenceFile is there because we reuse the same Writable
object when reading elements from the file. This is definitely unintuitive, but
if you pass through each data item only once instead of caching it, it can be
more efficient (probably should be off by default thou
Hi,
Assume my spark context is pointing to local[N]. If I have an RDD created with
sparkContext.sequenceFile(…), and I call .collect() on it immediately (assume
it's small), sometimes I get duplicate rows back. In addition, if I call
sparkContext.sequenceFile(…) and immediately call an operatio