I've encountered very strange problem, after doing union of 2 rdds the
reduceByKey works wrong(unless I'm missing something very basic) and brings
to the function that reduces 2 objects with different key! I've rewrited
java class to scala to test it in spark-shell and I see same problem
I have SingleIdDailyAggData object with following fileds(it's avro object
really)
long dyid;
com.dy.agg.gen.AggData data;
long lastSeen;

I have AvroReader.readKeys that returns me this objects as
JavaRDD<SingleIdDailyAggData>(reading from s3 files):

val ctx = new JavaSparkContext(sc)
val absent = Optional.absent[Integer](); 

val dayInputs = new ArrayList[String]();
dayInputs.add("s3n://xxx/blocks/2015-05-26/8765260/day/")
//java rdd of SingleIdDailyAggData
val day = AvroReader.readKeys[SingleIdDailyAggData](ctx,
dayInputs.asInstanceOf[List[String]], absent);

val prevDayInputs = new ArrayList[String]();
prevDayInputs.add("s3n://xxx/blocks/2015-05-25/8765260/day/")
//java rdd of SingleIdDailyAggData
val prevDay = AvroReader.readKeys[SingleIdDailyAggData](ctx,
prevDayInputs.asInstanceOf[List[String]], absent);

//long->SingleIdDailyAggData
val dayByKey = day.rdd.keyBy((x:SingleIdDailyAggData)=> x.getDyid());

//long->SingleIdDailyAggData
val prevDayByKey = prevDay.rdd.keyBy((x:SingleIdDailyAggData)=>
x.getDyid());

//union
val un = dayByKey.union(prevDayByKey)

val r = un.reduceByKey((v1,v2) => if(v1.getDyid().longValue ==
v2.getDyid().longValue()) new SingleIdDailyAggData(v1.getDyid(), null, null)
else throw new IllegalArgumentException(v1.getDyid().longValue  + " != " +
v2.getDyid().longValue()) )
r.count()


Task 7 in stage 3.0 failed 1 times, most recent failure: Lost task 7.0 in
stage 3.0 (TID 133, localhost): java.lang.IllegalArgumentException:
522417601156300917 != 2481483430461576405

so it's not custom object as a key, it's java's Long(i.e. not hashCode vs
equals problem)

Any ideas would be highly appreciated!
Igor



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/union-and-reduceByKey-wrong-shuffle-tp23092.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to