Thanks for the info. Do you think this would be useful in spark itself? add a function to RDD like "assumePartitioner(partitioner: Partitioner, verify: Boolean)". Where verify would run a mapPartitionsWithIndex, to check that every record was actually in the partition it belonged in?
I'm surprised this hasn't come up before -- maybe there is a better way to do something similar? On Tue, Jan 28, 2014 at 12:25 AM, Matei Zaharia <[email protected]>wrote: > Hey Imran, > > You probably have to create a subclass of HadoopRDD to do this, or some > RDD that wraps around the HadoopRDD. It would be a cool feature but HDFS > itself has no information about partitioning, so your application needs to > track it. > > Matei > > On Jan 27, 2014, at 11:57 PM, Imran Rashid <[email protected]> wrote: > > > Hi, > > > > > > I'm trying to figure out how to get partitioners to work correctly with > hadoop rdds, so that I can get narrow dependencies & avoid shuffling. I > feel like I must be missing something obvious. > > > > I can create an RDD with a parititioner of my choosing, shuffle it and > then save it out to hdfs. But I can't figure out how to get it to still > have that partitioner after I read it back in from hdfs. HadoopRDD always > has the partitioner set to None, and there isn't any way for me to change > it. > > > > the reason I care is b/c if I can set the partitioner, then there would > be a narrow dependency, so I can avoid a shuffle. I have a big data set > I'm saving on hdfs. Then some time later, in a totally independent spark > context, I read a little more data in, shuffle it w/ the same partitioner, > and then want to join it to the previous data that was sitting on hdfs. > > > > I guess this can't be done in general, since you don't have any > guarantees on the how the file was saved in hdfs. But, it still seems like > there ought to be a way to do this, even if I need to enforce safety at the > application level. > > > > sorry if I'm missing something obvious ... > > > > thanks, > > Imran > >
