r = ret.groupByKey().filter(e => e._2.length > 1 && e._2(0)==0)
Why choosing `e._2(0) == 0` ? How about e._2(0) != 0 ? I am not very sure of whether goupByKey will keep the order of elements. How about sample a subset from your dataset, and log some information out, e.g. logInfo(e._2) ? 2014/1/24 Ognen Duzlevski <[email protected]> > I can confirm that there is something seriously wrong with this. > > If I run the spark-shell with local[4] on the same cluster and run the > same task on the same hdfs:// files I get an output like > > res0: Long = 58177 > > If I run the spark-shell on the cluster with 15 nodes, same task I get > > res0: Long = 14137 > > This is just crazy. > > Ognen > > > On Fri, Jan 24, 2014 at 1:39 PM, Ognen Duzlevski < > [email protected]> wrote: > >> Thanks. >> >> This is a VERY simple example. >> >> I have two 20 GB json files. Each line in the files has the same format. >> I run: val events = filter(_split(something)(get the field)).map(field => >> (field, 0)) on the first file >> I then run val events1 = the same filter on the second file and do >> map(field => (field, 1)) >> >> This ensures that events has form of (field, 0) and events1 has form of >> (field, 1) >> >> I then to val ret=events.union(events1) - this will put all the fields in >> the same RDD >> >> Then I do val r = ret.groupByKey().filter(e => e._2.length > 1 && >> e._2(0)==0) to make sure all groups with key field have at least two >> elements and the first one is a zero (so, for example, an entry in this >> structure will have form (field, (0, 1. 1, 1....)) >> >> I then just do a simple r.count >> >> Ognen >> >> >> >> On Fri, Jan 24, 2014 at 1:29 PM, 尹绪森 <[email protected]> wrote: >> >>> 1. Does there any in-place operation in you code? Such as addi() for >>> DoubleMatrix. This kind of operation will affect the original data. >>> >>> 2. You could try to use Spark replay debugger, there is a assert >>> function. Hope that helpful. >>> http://spark-replay-debugger-overview.readthedocs.org/en/latest/ >>> >>> >>> 2014/1/24 Ognen Duzlevski <[email protected]> >>> >>>> No. It is a filter that splits a line in a json file and extracts a >>>> position for it - every run is the same. >>>> >>>> That's what bothers me about this. >>>> >>>> Ognen >>>> >>>> >>>> On Fri, Jan 24, 2014 at 12:40 PM, 尹绪森 <[email protected]> wrote: >>>> >>>>> Does there are some non-deterministic codes in filter ? Such as >>>>> Random.nextInt(). If so, the program lost the idempotent feature. You >>>>> should specify a seed to it. >>>>> >>>>> >>>>> 2014/1/24 Ognen Duzlevski <[email protected]> >>>>> >>>>>> Hello, >>>>>> >>>>>> (Sorry for the sensationalist title) :) >>>>>> >>>>>> If I run Spark on files from S3 and do basic transformation like: >>>>>> >>>>>> textfile() >>>>>> filter >>>>>> groupByKey >>>>>> count >>>>>> >>>>>> I get one number (e.g. 40,000). >>>>>> >>>>>> If I do the same on the same files from HDFS, the number spat out is >>>>>> completely different (VERY different - something like 13,000). >>>>>> >>>>>> What would one do in a situation like this? How do I even go about >>>>>> figuring out what the problem is? This is run on a cluster of 15 >>>>>> instances >>>>>> on Amazon. >>>>>> >>>>>> Thanks, >>>>>> Ognen >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Best Regards >>>>> ----------------------------------- >>>>> Xusen Yin 尹绪森 >>>>> Beijing Key Laboratory of Intelligent Telecommunications Software and >>>>> Multimedia >>>>> Beijing University of Posts & Telecommunications >>>>> Intel Labs China >>>>> Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>* >>>>> >>>> >>>> >>>> >>>> -- >>>> "Le secret des grandes fortunes sans cause apparente est un crime >>>> oublié, parce qu'il a été proprement fait" - Honore de Balzac >>>> >>> >>> >>> >>> -- >>> Best Regards >>> ----------------------------------- >>> Xusen Yin 尹绪森 >>> Beijing Key Laboratory of Intelligent Telecommunications Software and >>> Multimedia >>> Beijing University of Posts & Telecommunications >>> Intel Labs China >>> Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>* >>> >> >> >> >> -- >> "Le secret des grandes fortunes sans cause apparente est un crime >> oublié, parce qu'il a été proprement fait" - Honore de Balzac >> > > > > -- > "Le secret des grandes fortunes sans cause apparente est un crime oublié, > parce qu'il a été proprement fait" - Honore de Balzac > -- Best Regards ----------------------------------- Xusen Yin 尹绪森 Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia Beijing University of Posts & Telecommunications Intel Labs China Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*
