Hi

For my use case, I need to call a third party function(which is in memory
based) for each complete partition data. So I am partitioning RDD logically
using repartition on index column and applying function f  on
mapPartitions(f).

When, I iterate through mapPartition iterator. Can, I assume one task will
only processes one particular partition's complete data(assuming this is
small in size)?

Or to achieve this, do I need to use glom() on repartition? instead of
mapPartitions?

And when exactly, I should use preservesPartitioning=true on mapPartitions?

Thanks & Regards,
B Anil Kumar.

Reply via email to