I see. It makes a lot of sense now. It is not unique to spark but it would
be great if it is mentioned in spark documentation.
I have been using hadoop for a while and I am not aware of it!
Zheng zheng
On Thu, Jun 11, 2015 at 7:21 PM, Will Briggs wrote:
> To be fair, this is a long-standing i
To be fair, this is a long-standing issue due to optimizations for object reuse
in the Hadoop API, and isn't necessarily a failing in Spark - see this blog
post
(https://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/)
from 2011 documenting a
Yep you need to use a transformation of the raw value; use toString for
example.
On Thu, Jun 11, 2015, 8:54 PM Crystal Xing wrote:
> That is a little scary.
> So you mean in general, we shouldn't use hadoop's writable as Key in RDD?
>
> Zheng zheng
>
> On Thu, Jun 11, 2015 at 6:44 PM, Sean Owen
That is a little scary.
So you mean in general, we shouldn't use hadoop's writable as Key in RDD?
Zheng zheng
On Thu, Jun 11, 2015 at 6:44 PM, Sean Owen wrote:
> Guess: it has something to do with the Text object being reused by Hadoop?
> You can't in general keep around refs to them since the
Guess: it has something to do with the Text object being reused by Hadoop?
You can't in general keep around refs to them since they change. So you may
have a bunch of copies of one object at the end that become just one in
each partition.
On Thu, Jun 11, 2015, 8:36 PM Crystal Xing wrote:
> I loa
I load a list of ids from a text file as NLineInputFormat, and when I do
distinct(), it returns incorrect number.
JavaRDD idListData = jvc
.hadoopFile(idList, NLineInputFormat.class,
LongWritable.class, Text.class).values().distinct()
I should have 7000K