Hello,

(Sorry for the sensationalist title) :)

If I run Spark on files from S3 and do basic transformation like:

textfile()
filter
groupByKey
count

I get one number (e.g. 40,000).

If I do the same on the same files from HDFS, the number spat out is
completely different (VERY different - something like 13,000).

What would one do in a situation like this? How do I even go about figuring
out what the problem is? This is run on a cluster of 15 instances on Amazon.

Thanks,
Ognen

Reply via email to