SchemaRDD/DataFrame result partitioned according to the underlying datasource partitions

Stephen Boesch Mon, 23 Mar 2015 10:25:41 -0700

Is there a way to take advantage of the underlying datasource partitions
when generating a DataFrame/SchemaRDD via catalyst?  It seems from the sql
module that the only options are RangePartitioner and HashPartitioner - and
further that those are selected automatically by the code .  It was not
apparent that either the underlying partitioning were translated to the
partitions presented in the rdd or that a custom partitioner were possible
to be provided.


The motivation would be to subsequently use df.map (with
preservesPartitioning=true) and/or df.mapPartitions (likewise) to perform
operations that work within the original datasource partitions - thus
avoiding a shuffle.

SchemaRDD/DataFrame result partitioned according to the underlying datasource partitions

Reply via email to