I've encountered very strange problem, after doing union of 2 rdds the reduceByKey works wrong(unless I'm missing something very basic) and brings to the function that reduces 2 objects with different key! I've rewrited java class to scala to test it in spark-shell and I see same problem I have SingleIdDailyAggData object with following fileds(it's avro object really) long dyid; com.dy.agg.gen.AggData data; long lastSeen;
I have AvroReader.readKeys that returns me this objects as JavaRDD<SingleIdDailyAggData>(reading from s3 files): val ctx = new JavaSparkContext(sc) val absent = Optional.absent[Integer](); val dayInputs = new ArrayList[String](); dayInputs.add("s3n://xxx/blocks/2015-05-26/8765260/day/") //java rdd of SingleIdDailyAggData val day = AvroReader.readKeys[SingleIdDailyAggData](ctx, dayInputs.asInstanceOf[List[String]], absent); val prevDayInputs = new ArrayList[String](); prevDayInputs.add("s3n://xxx/blocks/2015-05-25/8765260/day/") //java rdd of SingleIdDailyAggData val prevDay = AvroReader.readKeys[SingleIdDailyAggData](ctx, prevDayInputs.asInstanceOf[List[String]], absent); //long->SingleIdDailyAggData val dayByKey = day.rdd.keyBy((x:SingleIdDailyAggData)=> x.getDyid()); //long->SingleIdDailyAggData val prevDayByKey = prevDay.rdd.keyBy((x:SingleIdDailyAggData)=> x.getDyid()); //union val un = dayByKey.union(prevDayByKey) val r = un.reduceByKey((v1,v2) => if(v1.getDyid().longValue == v2.getDyid().longValue()) new SingleIdDailyAggData(v1.getDyid(), null, null) else throw new IllegalArgumentException(v1.getDyid().longValue + " != " + v2.getDyid().longValue()) ) r.count() Task 7 in stage 3.0 failed 1 times, most recent failure: Lost task 7.0 in stage 3.0 (TID 133, localhost): java.lang.IllegalArgumentException: 522417601156300917 != 2481483430461576405 so it's not custom object as a key, it's java's Long(i.e. not hashCode vs equals problem) Any ideas would be highly appreciated! Igor -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/union-and-reduceByKey-wrong-shuffle-tp23092.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org