Hello, (Sorry for the sensationalist title) :)
If I run Spark on files from S3 and do basic transformation like: textfile() filter groupByKey count I get one number (e.g. 40,000). If I do the same on the same files from HDFS, the number spat out is completely different (VERY different - something like 13,000). What would one do in a situation like this? How do I even go about figuring out what the problem is? This is run on a cluster of 15 instances on Amazon. Thanks, Ognen
