Change this unionDS.repartition(numPartitions); unionDS.createOrReplaceTempView(...
To unionDS.repartition(numPartitions).createOrReplaceTempView(... On Wed, 18 Oct 2017, 03:05 KhajaAsmath Mohammed, <mdkhajaasm...@gmail.com> wrote: > val unionDS = rawDS.union(processedDS) > //unionDS.persist(StorageLevel.MEMORY_AND_DISK) > val unionedDS = unionDS.dropDuplicates() > //val > unionedPartitionedDS=unionedDS.repartition(unionedDS("year"),unionedDS("month"),unionedDS("day")).persist(StorageLevel.MEMORY_AND_DISK) > //unionDS.persist(StorageLevel.MEMORY_AND_DISK) > unionDS.repartition(numPartitions); > unionDS.createOrReplaceTempView("datapoint_prq_union_ds_view") > sparkSession.sql(s"set hive.exec.dynamic.partition.mode=nonstrict") > val deltaDSQry = "insert overwrite table datapoint > PARTITION(year,month,day) select VIN, utctime, description, descriptionuom, > providerdesc, dt_map, islocation, latitude, longitude, speed, > value,current_date,YEAR, MONTH, DAY from datapoint_prq_union_ds_view" > println(deltaDSQry) > sparkSession.sql(deltaDSQry) > > > Here is the code and also properties used in my project. > > > On Tue, Oct 17, 2017 at 3:38 PM, Sebastian Piu <sebastian....@gmail.com> > wrote: > >> Can you share some code? >> >> On Tue, 17 Oct 2017, 21:11 KhajaAsmath Mohammed, <mdkhajaasm...@gmail.com> >> wrote: >> >>> In my case I am just writing the data frame back to hive. so when is the >>> best case to repartition it. I did repartition before calling insert >>> overwrite on table >>> >>> On Tue, Oct 17, 2017 at 3:07 PM, Sebastian Piu <sebastian....@gmail.com> >>> wrote: >>> >>>> You have to repartition/coalesce *after *the action that is causing >>>> the shuffle as that one will take the value you've set >>>> >>>> On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed < >>>> mdkhajaasm...@gmail.com> wrote: >>>> >>>>> Yes still I see more number of part files and exactly the number I >>>>> have defined did spark.sql.shuffle.partitions >>>>> >>>>> Sent from my iPhone >>>>> >>>>> On Oct 17, 2017, at 2:32 PM, Michael Artz <michaelea...@gmail.com> >>>>> wrote: >>>>> >>>>> Have you tried caching it and using a coalesce? >>>>> >>>>> >>>>> >>>>> On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" < >>>>> mdkhajaasm...@gmail.com> wrote: >>>>> >>>>>> I tried repartitions but spark.sql.shuffle.partitions is taking up >>>>>> precedence over repartitions or coalesce. how to get the lesser number of >>>>>> files with same performance? >>>>>> >>>>>> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara < >>>>>> tushar_adesh...@persistent.com> wrote: >>>>>> >>>>>>> You can also try coalesce as it will avoid full shuffle. >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> *Tushar Adeshara* >>>>>>> >>>>>>> *Technical Specialist – Analytics Practice* >>>>>>> >>>>>>> *Cell: +91-81490 04192 <+91%2081490%2004192>* >>>>>>> >>>>>>> *Persistent Systems** Ltd. **| **Partners in Innovation **|* >>>>>>> *www.persistentsys.com >>>>>>> <http://www.persistentsys.com/>* >>>>>>> >>>>>>> >>>>>>> ------------------------------ >>>>>>> *From:* KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> >>>>>>> *Sent:* 13 October 2017 09:35 >>>>>>> *To:* user @spark >>>>>>> *Subject:* Spark - Partitions >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I am reading hive query and wiriting the data back into hive after >>>>>>> doing some transformations. >>>>>>> >>>>>>> I have changed setting spark.sql.shuffle.partitions to 2000 and >>>>>>> since then job completes fast but the main problem is I am getting 2000 >>>>>>> files for each partition >>>>>>> size of file is 10 MB . >>>>>>> >>>>>>> is there a way to get same performance but write lesser number of >>>>>>> files ? >>>>>>> >>>>>>> I am trying repartition now but would like to know if there are any >>>>>>> other options. >>>>>>> >>>>>>> Thanks, >>>>>>> Asmath >>>>>>> DISCLAIMER >>>>>>> ========== >>>>>>> This e-mail may contain privileged and confidential information >>>>>>> which is the property of Persistent Systems Ltd. It is intended only for >>>>>>> the use of the individual or entity to which it is addressed. If you are >>>>>>> not the intended recipient, you are not authorized to read, retain, >>>>>>> copy, >>>>>>> print, distribute or use this message. If you have received this >>>>>>> communication in error, please notify the sender and delete all copies >>>>>>> of >>>>>>> this message. Persistent Systems Ltd. does not accept any liability for >>>>>>> virus infected mails. >>>>>>> >>>>>> >>>>>> >>> >