It does not appear that the in-memory caching currently preserves the
information about the partitioning of the data so this optimization will
probably not work.

On Thu, Dec 4, 2014 at 8:42 PM, nitin <nitin2go...@gmail.com> wrote:

> With some quick googling, I learnt that I can we can provide "distribute by
> <coulmn_name>" in hive ql to distribute data based on a column values. My
> question now if I use "distribute by id", will there be any performance
> improvements? Will I be able to avoid data movement in shuffle(Excahnge
> before JOIN step) and improve overall performance?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350p20424.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to