subject:"\[Spark Core\] Potential bug in JavaRDD#countByValue"

Re: [Spark Core] Potential bug in JavaRDD#countByValue

2024-02-27 Thread Mich Talebzadeh

Hi, Quick observations from what you have provided - The observed discrepancy between rdd.count() and rdd.map(Item::getType).countByValue()in distributed mode suggests a potential aggregation issue with countByValue(). The correct results in local mode give credence to this theory. - Workarounds

[Spark Core] Potential bug in JavaRDD#countByValue

2024-02-27 Thread Stuart Fehr

Hello, I recently encountered a bug with the results from JavaRDD#countByValue that does not reproduce when running locally. For background, we are running a Spark 3.5.0 job on AWS EMR 7.0.0. The code in question is something like this: JavaRDD rdd = // ... > rdd.count(); // 75187 // Get the