Is there a way to take advantage of the underlying datasource partitions
when generating a DataFrame/SchemaRDD via catalyst?  It seems from the sql
module that the only options are RangePartitioner and HashPartitioner - and
further that those are selected automatically by the code .  It was not
apparent that either the underlying partitioning were translated to the
partitions presented in the rdd or that a custom partitioner were possible
to be provided.

The motivation would be to subsequently use df.map (with
preservesPartitioning=true) and/or df.mapPartitions (likewise) to perform
operations that work within the original datasource partitions - thus
avoiding a shuffle.

Reply via email to