I can confirm that there is something seriously wrong with this. If I run the spark-shell with local[4] on the same cluster and run the same task on the same hdfs:// files I get an output like
res0: Long = 58177 If I run the spark-shell on the cluster with 15 nodes, same task I get res0: Long = 14137 This is just crazy. Ognen On Fri, Jan 24, 2014 at 1:39 PM, Ognen Duzlevski < [email protected]> wrote: > Thanks. > > This is a VERY simple example. > > I have two 20 GB json files. Each line in the files has the same format. > I run: val events = filter(_split(something)(get the field)).map(field => > (field, 0)) on the first file > I then run val events1 = the same filter on the second file and do > map(field => (field, 1)) > > This ensures that events has form of (field, 0) and events1 has form of > (field, 1) > > I then to val ret=events.union(events1) - this will put all the fields in > the same RDD > > Then I do val r = ret.groupByKey().filter(e => e._2.length > 1 && > e._2(0)==0) to make sure all groups with key field have at least two > elements and the first one is a zero (so, for example, an entry in this > structure will have form (field, (0, 1. 1, 1....)) > > I then just do a simple r.count > > Ognen > > > > On Fri, Jan 24, 2014 at 1:29 PM, 尹绪森 <[email protected]> wrote: > >> 1. Does there any in-place operation in you code? Such as addi() for >> DoubleMatrix. This kind of operation will affect the original data. >> >> 2. You could try to use Spark replay debugger, there is a assert >> function. Hope that helpful. >> http://spark-replay-debugger-overview.readthedocs.org/en/latest/ >> >> >> 2014/1/24 Ognen Duzlevski <[email protected]> >> >>> No. It is a filter that splits a line in a json file and extracts a >>> position for it - every run is the same. >>> >>> That's what bothers me about this. >>> >>> Ognen >>> >>> >>> On Fri, Jan 24, 2014 at 12:40 PM, 尹绪森 <[email protected]> wrote: >>> >>>> Does there are some non-deterministic codes in filter ? Such as >>>> Random.nextInt(). If so, the program lost the idempotent feature. You >>>> should specify a seed to it. >>>> >>>> >>>> 2014/1/24 Ognen Duzlevski <[email protected]> >>>> >>>>> Hello, >>>>> >>>>> (Sorry for the sensationalist title) :) >>>>> >>>>> If I run Spark on files from S3 and do basic transformation like: >>>>> >>>>> textfile() >>>>> filter >>>>> groupByKey >>>>> count >>>>> >>>>> I get one number (e.g. 40,000). >>>>> >>>>> If I do the same on the same files from HDFS, the number spat out is >>>>> completely different (VERY different - something like 13,000). >>>>> >>>>> What would one do in a situation like this? How do I even go about >>>>> figuring out what the problem is? This is run on a cluster of 15 instances >>>>> on Amazon. >>>>> >>>>> Thanks, >>>>> Ognen >>>>> >>>> >>>> >>>> >>>> -- >>>> Best Regards >>>> ----------------------------------- >>>> Xusen Yin 尹绪森 >>>> Beijing Key Laboratory of Intelligent Telecommunications Software and >>>> Multimedia >>>> Beijing University of Posts & Telecommunications >>>> Intel Labs China >>>> Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>* >>>> >>> >>> >>> >>> -- >>> "Le secret des grandes fortunes sans cause apparente est un crime >>> oublié, parce qu'il a été proprement fait" - Honore de Balzac >>> >> >> >> >> -- >> Best Regards >> ----------------------------------- >> Xusen Yin 尹绪森 >> Beijing Key Laboratory of Intelligent Telecommunications Software and >> Multimedia >> Beijing University of Posts & Telecommunications >> Intel Labs China >> Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>* >> > > > > -- > "Le secret des grandes fortunes sans cause apparente est un crime oublié, > parce qu'il a été proprement fait" - Honore de Balzac > -- "Le secret des grandes fortunes sans cause apparente est un crime oublié, parce qu'il a été proprement fait" - Honore de Balzac
