Re: Folding an RDD in order

Cheng Lian Thu, 16 Oct 2014 08:47:44 -0700

Hi Michael,

I'm not sure I fully understood your question, but I think RDD.aggregatecan be helpful in your case. You can see it as a more general version offold.


Cheng


On 10/16/14 11:15 PM, Michael Misiewicz wrote:

Hi,
I'm working on a problem where I'd like to sum items in an RDD /inorder (/approximately/)/. I am currently trying to implement thisusing a fold, but I'm having some issues because the sorting key of mydata is not the same as the folding key for my data. I have data thatlooks like this:
user_id, transaction_timestamp, transaction_amount
And I'm interested in doing a foldByKey on user_id to sum transactionamounts - taking care to note approximately when a user surpasses atotal transaction threshold. I'm using RangePartitioner to make surethat data is ordered sequentially between partitions, and I'd alsomake sure that data is sorted within partitions, though I'm not surehow to do this exactly (I was going to look at the code for sortByKeyto figure this out - I believe sorting in place in a mapPartitionsshould work). What do you think about the approach? Here's some samplecode that demonstrates what I'm thinking:
def myFold(V1:Float, V2:Float) : Float = {
val partialSum = V1 + V2
if (partialSum >= 500) {
// make a note of it, do things
}
return partialSum
}

val rawData = sc.textFile("hdfs://path/to/data").map{ x => // load data
l = x.split()
(l(0).toLong, l(1).toLong, l(2).toFloat) // user_id:long,transaction_timestamp:long, transaction_amount:float
}
val keyByTimestamp = rawData.map(x=> (x._2, (x._1, x._3))) //rearrange to make timestamp the key (for sorting), convert to PairRDD
val sortedByTimestamp = keyByTimestamp.sortByKey()
val partitionedByTimestamp = sortedByTimestamp.partitionBy(
new org.apache.spark.RangePartitioner(partitions=500,rdd=sortedByTimestamp)).persist()// By this point, the RDD should be sorted and partitioned accordingto the timestamp. However, I need to now make user_id the key,// because the output must be per user. At this point, since I changethe keys of the PairRDD, I understand that I lose the partitioning// the consequence of this is that I can no longer be sure in my foldfunction that the ordering is retained.
val keyByUser = partitionedByTimestamp.map(x => (x._2._1, x._2._2))
val finalResult = keyByUser.foldByKey(zeroValue=0)(myFold)
finalResult.saveAsTextFile("hdfs://...")
The problem as you'd expect takes place in the folding function, afterI've re-arranged my RDD to no longer be keyed by timestamp (when Iproduce keyByUser, I lose the correct partitioning). As I've read inthe documentation, partitioning is not preserved when keys are changed(which makes sense).
Reading this thread:https://groups.google.com/forum/#!topic/spark-users/Fx7DNtWiSx4<https://groups.google.com/forum/#%21topic/spark-users/Fx7DNtWiSx4> itappears that one possible solution might be to subclass RDD (à laMappedValuesRDD) to define my own RDDthat retains the partitions ofits parent. This seems simple enough, but I've never done anythinglike that before, but I'm not sure where to start. I'm also willing towrite my own custom partitioner class, but it appears that thegetPartitionmethod only accepts a "key" argument - and since the valueI need to partition on in the final step (the timestamp) would be inthe Value, my partitioner class doesn't have the data it needs to makethe right decision. I cannot have timestamp in my key.
Alternatively, has anyone else encountered a problem like this (i.e.an approximately ordered sum) and did they find a good solution? Doesmy approach of subclassing RDDmake sense? Would there be some way tofinagle a custom partitioner into making this work? Perhaps this mightbe a job for some other tool, like spark streaming?
Thanks,
Michael

Re: Folding an RDD in order

Reply via email to