Ahhh. That explains it (not keeping order). I was counting on the order so made perfect sense to have an extra check where the first day you get a zero, the next day you get a 1 (this is retention analysis day to day).
Thank you! Ognen On Fri, Jan 24, 2014 at 2:44 PM, 尹绪森 <[email protected]> wrote: > r = ret.groupByKey().filter(e => e._2.length > 1 && e._2(0)==0) > > Why choosing `e._2(0) == 0` ? How about e._2(0) != 0 ? I am not very sure > of whether goupByKey will keep the order of elements. How about sample a > subset from your dataset, and log some information out, e.g. logInfo(e._2) ? > > > 2014/1/24 Ognen Duzlevski <[email protected]> > >> I can confirm that there is something seriously wrong with this. >> >> If I run the spark-shell with local[4] on the same cluster and run the >> same task on the same hdfs:// files I get an output like >> >> res0: Long = 58177 >> >> If I run the spark-shell on the cluster with 15 nodes, same task I get >> >> res0: Long = 14137 >> >> This is just crazy. >> >> Ognen >> >> >> On Fri, Jan 24, 2014 at 1:39 PM, Ognen Duzlevski < >> [email protected]> wrote: >> >>> Thanks. >>> >>> This is a VERY simple example. >>> >>> I have two 20 GB json files. Each line in the files has the same format. >>> I run: val events = filter(_split(something)(get the field)).map(field >>> => (field, 0)) on the first file >>> I then run val events1 = the same filter on the second file and do >>> map(field => (field, 1)) >>> >>> This ensures that events has form of (field, 0) and events1 has form of >>> (field, 1) >>> >>> I then to val ret=events.union(events1) - this will put all the fields >>> in the same RDD >>> >>> Then I do val r = ret.groupByKey().filter(e => e._2.length > 1 && >>> e._2(0)==0) to make sure all groups with key field have at least two >>> elements and the first one is a zero (so, for example, an entry in this >>> structure will have form (field, (0, 1. 1, 1....)) >>> >>> I then just do a simple r.count >>> >>> Ognen >>> >>> >>> >>> On Fri, Jan 24, 2014 at 1:29 PM, 尹绪森 <[email protected]> wrote: >>> >>>> 1. Does there any in-place operation in you code? Such as addi() for >>>> DoubleMatrix. This kind of operation will affect the original data. >>>> >>>> 2. You could try to use Spark replay debugger, there is a assert >>>> function. Hope that helpful. >>>> http://spark-replay-debugger-overview.readthedocs.org/en/latest/ >>>> >>>> >>>> 2014/1/24 Ognen Duzlevski <[email protected]> >>>> >>>>> No. It is a filter that splits a line in a json file and extracts a >>>>> position for it - every run is the same. >>>>> >>>>> That's what bothers me about this. >>>>> >>>>> Ognen >>>>> >>>>> >>>>> On Fri, Jan 24, 2014 at 12:40 PM, 尹绪森 <[email protected]> wrote: >>>>> >>>>>> Does there are some non-deterministic codes in filter ? Such as >>>>>> Random.nextInt(). If so, the program lost the idempotent feature. You >>>>>> should specify a seed to it. >>>>>> >>>>>> >>>>>> 2014/1/24 Ognen Duzlevski <[email protected]> >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> (Sorry for the sensationalist title) :) >>>>>>> >>>>>>> If I run Spark on files from S3 and do basic transformation like: >>>>>>> >>>>>>> textfile() >>>>>>> filter >>>>>>> groupByKey >>>>>>> count >>>>>>> >>>>>>> I get one number (e.g. 40,000). >>>>>>> >>>>>>> If I do the same on the same files from HDFS, the number spat out is >>>>>>> completely different (VERY different - something like 13,000). >>>>>>> >>>>>>> What would one do in a situation like this? How do I even go about >>>>>>> figuring out what the problem is? This is run on a cluster of 15 >>>>>>> instances >>>>>>> on Amazon. >>>>>>> >>>>>>> Thanks, >>>>>>> Ognen >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Best Regards >>>>>> ----------------------------------- >>>>>> Xusen Yin 尹绪森 >>>>>> Beijing Key Laboratory of Intelligent Telecommunications Software and >>>>>> Multimedia >>>>>> Beijing University of Posts & Telecommunications >>>>>> Intel Labs China >>>>>> Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>* >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> "Le secret des grandes fortunes sans cause apparente est un crime >>>>> oublié, parce qu'il a été proprement fait" - Honore de Balzac >>>>> >>>> >>>> >>>> >>>> -- >>>> Best Regards >>>> ----------------------------------- >>>> Xusen Yin 尹绪森 >>>> Beijing Key Laboratory of Intelligent Telecommunications Software and >>>> Multimedia >>>> Beijing University of Posts & Telecommunications >>>> Intel Labs China >>>> Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>* >>>> >>> >>> >>> >>> -- >>> "Le secret des grandes fortunes sans cause apparente est un crime >>> oublié, parce qu'il a été proprement fait" - Honore de Balzac >>> >> >> >> >> -- >> "Le secret des grandes fortunes sans cause apparente est un crime >> oublié, parce qu'il a été proprement fait" - Honore de Balzac >> > > > > -- > Best Regards > ----------------------------------- > Xusen Yin 尹绪森 > Beijing Key Laboratory of Intelligent Telecommunications Software and > Multimedia > Beijing University of Posts & Telecommunications > Intel Labs China > Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>* > -- "Le secret des grandes fortunes sans cause apparente est un crime oublié, parce qu'il a été proprement fait" - Honore de Balzac
