Hmm, ok, but I'm not seeing why foldByKey is more appropriate than reduceByKey? Specifically, is foldByKey guaranteed to walk the RDD in order, but reduceByKey is not?
On Mon, Sep 21, 2015 at 8:41 PM, Sean Owen <so...@cloudera.com> wrote: > The zero value here is None. Combining None with any row should yield > Some(row). After that, combining is a no-op for other rows. > > On Tue, Sep 22, 2015 at 4:27 AM, Philip Weaver <philip.wea...@gmail.com> > wrote: > > Hmm, I don't think that's what I want. There's no "zero value" in my use > > case. > > > > On Mon, Sep 21, 2015 at 8:20 PM, Sean Owen <so...@cloudera.com> wrote: > >> > >> I think foldByKey is much more what you want, as it has more a notion > >> of building up some result per key by encountering values serially. > >> You would take the first and ignore the rest. Note that "first" > >> depends on your RDD having an ordering to begin with, or else you rely > >> on however it happens to be ordered after whatever operations give you > >> a key-value RDD. > >> > >> On Tue, Sep 22, 2015 at 1:26 AM, Philip Weaver <philip.wea...@gmail.com > > > >> wrote: > >> > I am processing a single file and want to remove duplicate rows by > some > >> > key > >> > by always choosing the first row in the file for that key. > >> > > >> > The best solution I could come up with is to zip each row with the > >> > partition > >> > index and local index, like this: > >> > > >> > rdd.mapPartitionsWithIndex { case (partitionIndex, rows) => > >> > rows.zipWithIndex.map { case (row, localIndex) => (row.key, > >> > ((partitionIndex, localIndex), row)) } > >> > } > >> > > >> > > >> > And then using reduceByKey with a min ordering on the (partitionIndex, > >> > localIndex) pair. > >> > > >> > First, can i count on SparkContext.textFile to read the lines in such > >> > that > >> > the partition indexes are always increasing so that the above works? > >> > > >> > And, is there a better way to accomplish the same effect? > >> > > >> > Thanks! > >> > > >> > - Philip > >> > > > > > >