Re: Assigning input files to spark partitions

Daniel Siegmann Thu, 13 Nov 2014 13:06:05 -0800

On Thu, Nov 13, 2014 at 3:24 PM, Pala M Muthaia <[email protected]
> wrote


>
> No i don't want separate RDD because each of these partitions are being
> processed the same way (in my case, each partition corresponds to HBase
> keys belonging to one region server, and i will do HBase lookups). After
> that i have aggregations too, hence all these partitions should be in the
> same RDD. The reason to follow the partition structure is to limit
> concurrent HBase lookups targeting a single region server.
>

Neither of these is necessarily a barrier to using separate RDDs. You can
define the function you want to use and then pass it to multiple map
methods. Then you could union all the RDDs to do your aggregations. For
example, it might look something like this:

val paths: String = ... // the paths to the files you want to load
def myFunc(t: T) = ... // the function to apply to every RDD
val rdds = paths.map { path =>
    sc.textFile(path).map(myFunc)
}
val completeRdd = sc.union(rdds)

Does that make any sense?

-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: [email protected] W: www.velos.io

Re: Assigning input files to spark partitions

Reply via email to