In my case I am just writing the data frame back to hive. so when is the best case to repartition it. I did repartition before calling insert overwrite on table
On Tue, Oct 17, 2017 at 3:07 PM, Sebastian Piu <sebastian....@gmail.com> wrote: > You have to repartition/coalesce *after *the action that is causing the > shuffle as that one will take the value you've set > > On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed < > mdkhajaasm...@gmail.com> wrote: > >> Yes still I see more number of part files and exactly the number I have >> defined did spark.sql.shuffle.partitions >> >> Sent from my iPhone >> >> On Oct 17, 2017, at 2:32 PM, Michael Artz <michaelea...@gmail.com> wrote: >> >> Have you tried caching it and using a coalesce? >> >> >> >> On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <mdkhajaasm...@gmail.com> >> wrote: >> >>> I tried repartitions but spark.sql.shuffle.partitions is taking up >>> precedence over repartitions or coalesce. how to get the lesser number of >>> files with same performance? >>> >>> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara < >>> tushar_adesh...@persistent.com> wrote: >>> >>>> You can also try coalesce as it will avoid full shuffle. >>>> >>>> >>>> Regards, >>>> >>>> *Tushar Adeshara* >>>> >>>> *Technical Specialist – Analytics Practice* >>>> >>>> *Cell: +91-81490 04192 <+91%2081490%2004192>* >>>> >>>> *Persistent Systems** Ltd. **| **Partners in Innovation **|* >>>> *www.persistentsys.com >>>> <http://www.persistentsys.com/>* >>>> >>>> >>>> ------------------------------ >>>> *From:* KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> >>>> *Sent:* 13 October 2017 09:35 >>>> *To:* user @spark >>>> *Subject:* Spark - Partitions >>>> >>>> Hi, >>>> >>>> I am reading hive query and wiriting the data back into hive after >>>> doing some transformations. >>>> >>>> I have changed setting spark.sql.shuffle.partitions to 2000 and since >>>> then job completes fast but the main problem is I am getting 2000 files for >>>> each partition >>>> size of file is 10 MB . >>>> >>>> is there a way to get same performance but write lesser number of files >>>> ? >>>> >>>> I am trying repartition now but would like to know if there are any >>>> other options. >>>> >>>> Thanks, >>>> Asmath >>>> DISCLAIMER >>>> ========== >>>> This e-mail may contain privileged and confidential information which >>>> is the property of Persistent Systems Ltd. It is intended only for the use >>>> of the individual or entity to which it is addressed. If you are not the >>>> intended recipient, you are not authorized to read, retain, copy, print, >>>> distribute or use this message. If you have received this communication in >>>> error, please notify the sender and delete all copies of this message. >>>> Persistent Systems Ltd. does not accept any liability for virus infected >>>> mails. >>>> >>> >>>