Hi Maximo — This is a relatively naive answer, but I would consider structuring the RDD into a DataFrame, then saving the 'splits' using something like DataFrame.write.parquet(hdfs_path, byPartition=('client')). You could then read a DataFrame from each resulting parquet directory and do your per-client work from these. You mention re-using the splits, so this solution might be worth the file-writing time.
Does anyone know of a method that gets a collection of DataFrames — one for each partition, in the byPartition=('client') sense — from a 'big' DataFrame? Basically, the equivalent of writing by partition and creating a DataFrame for each result, but skipping the HDFS step. On Tue, Sep 8, 2015 at 10:47 AM, Maximo Gurmendez <mgurmen...@dataxu.com> wrote: > Hi, > I have a RDD that needs to be split (say, by client) in order to train > n models (i.e. one for each client). Since most of the classifiers that > come with ml-lib only can accept an RDD as input (and cannot build multiple > models in one pass - as I understand it), the only way to train n separate > models is to create n RDDs (by filtering the original one). > > Conceptually: > > rdd1,rdd2,rdd3 = splitRdds(bigRdd) > > the function splitRdd would use the standard filter mechanism . I would > then need to submit n training spark jobs. When I do this, will it mean > that it will traverse the bigRdd n times? Is there a better way to persist > the splitted rdd (i.e. save the split RDD in a cache)? > > I could cache the bigRdd, but not sure that would be ver efficient either > since it will require the same number of passes anyway (I think - but I’m > relatively new to Spark). Also I’m planning on reusing the individual > splits (rdd1, rdd2, etc so would be convenient to have them individually > cached). > > Another problem is that the splits are could be very skewed (i.e. one > split could represent a large percentage of the original bigRdd ). So > saving the split RDDs to disk (at least, naively) could be a challenge. > > Is there any better way of doing this? > > Thanks! > Máximo > >