Does there are some non-deterministic codes in filter ? Such as Random.nextInt(). If so, the program lost the idempotent feature. You should specify a seed to it.
2014/1/24 Ognen Duzlevski <[email protected]> > Hello, > > (Sorry for the sensationalist title) :) > > If I run Spark on files from S3 and do basic transformation like: > > textfile() > filter > groupByKey > count > > I get one number (e.g. 40,000). > > If I do the same on the same files from HDFS, the number spat out is > completely different (VERY different - something like 13,000). > > What would one do in a situation like this? How do I even go about > figuring out what the problem is? This is run on a cluster of 15 instances > on Amazon. > > Thanks, > Ognen > -- Best Regards ----------------------------------- Xusen Yin 尹绪森 Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia Beijing University of Posts & Telecommunications Intel Labs China Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*
