Austin,
I made up a mock example...my real use case is more complex. I used
foreach() instead of collect/cache..that forces the accumulable to be
evaluated. On another thread Xiangrui pointed me to a sliding window rdd in
mlllib that is a great alternative (although I did not switch to using it)
M
Mohit, if you want to end up with (1 .. N) , why don't you skip the logic
for finding missing values, and generate it directly?
val max = myCollection.reduce(math.max)
sc.parallelize((0 until max))
In either case, you don't need to call cache, which will force it into
memory - you can do somethin
Thanks Brian. This works. I used Accumulable to do the "collect" in step B.
While doing that I found that Accumulable.value is not a Spark "action", I
need to call "cache" in the underlying RDD for "value" to work. Not sure if
that is intentional or a bug.
The "collect" of Step B can be done as a n
I don't think there's a direct way of bleeding elements across partitions.
But you could write it yourself relatively succinctly:
A) Sort the RDD
B) Look at the sorted RDD's partitions with the .mapParititionsWithIndex( )
method. Map each partition to its partition ID, and its maximum element.
Col
Hi,
I am trying to find a way to fill in missing values in an RDD. The RDD is a
sorted sequence.
For example, (1, 2, 3, 5, 8, 11, ...)
I need to fill in the missing numbers and get (1,2,3,4,5,6,7,8,9,10,11)
One way to do this is to "slide and zip"
rdd1 = sc.parallelize(List(1, 2, 3, 5, 8, 11, ...