Re: Spark - Partitions

Sebastian Piu Tue, 17 Oct 2017 22:54:19 -0700

Change this
unionDS.repartition(numPartitions);
unionDS.createOrReplaceTempView(...


To

unionDS.repartition(numPartitions).createOrReplaceTempView(...

On Wed, 18 Oct 2017, 03:05 KhajaAsmath Mohammed, <mdkhajaasm...@gmail.com>
wrote:

>     val unionDS = rawDS.union(processedDS)
>       //unionDS.persist(StorageLevel.MEMORY_AND_DISK)
>       val unionedDS = unionDS.dropDuplicates()
>       //val
> unionedPartitionedDS=unionedDS.repartition(unionedDS("year"),unionedDS("month"),unionedDS("day")).persist(StorageLevel.MEMORY_AND_DISK)
>       //unionDS.persist(StorageLevel.MEMORY_AND_DISK)
>       unionDS.repartition(numPartitions);
>       unionDS.createOrReplaceTempView("datapoint_prq_union_ds_view")
>       sparkSession.sql(s"set hive.exec.dynamic.partition.mode=nonstrict")
>       val deltaDSQry = "insert overwrite table  datapoint
> PARTITION(year,month,day) select VIN, utctime, description, descriptionuom,
> providerdesc, dt_map, islocation, latitude, longitude, speed,
> value,current_date,YEAR, MONTH, DAY from datapoint_prq_union_ds_view"
>       println(deltaDSQry)
>       sparkSession.sql(deltaDSQry)
>
>
> Here is the code and also properties used in my project.
>
>
> On Tue, Oct 17, 2017 at 3:38 PM, Sebastian Piu <sebastian....@gmail.com>
> wrote:
>
>> Can you share some code?
>>
>> On Tue, 17 Oct 2017, 21:11 KhajaAsmath Mohammed, <mdkhajaasm...@gmail.com>
>> wrote:
>>
>>> In my case I am just writing the data frame back to hive. so when is the
>>> best case to repartition it. I did repartition before calling insert
>>> overwrite on table
>>>
>>> On Tue, Oct 17, 2017 at 3:07 PM, Sebastian Piu <sebastian....@gmail.com>
>>> wrote:
>>>
>>>> You have to repartition/coalesce *after *the action that is causing
>>>> the shuffle as that one will take the value you've set
>>>>
>>>> On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed <
>>>> mdkhajaasm...@gmail.com> wrote:
>>>>
>>>>> Yes still I see more number of part files and exactly the number I
>>>>> have defined did spark.sql.shuffle.partitions
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Oct 17, 2017, at 2:32 PM, Michael Artz <michaelea...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Have you tried caching it and using a coalesce?
>>>>>
>>>>>
>>>>>
>>>>> On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <
>>>>> mdkhajaasm...@gmail.com> wrote:
>>>>>
>>>>>> I tried repartitions but spark.sql.shuffle.partitions is taking up
>>>>>> precedence over repartitions or coalesce. how to get the lesser number of
>>>>>> files with same performance?
>>>>>>
>>>>>> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
>>>>>> tushar_adesh...@persistent.com> wrote:
>>>>>>
>>>>>>> You can also try coalesce as it will avoid full shuffle.
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> *Tushar Adeshara*
>>>>>>>
>>>>>>> *Technical Specialist – Analytics Practice*
>>>>>>>
>>>>>>> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>>>>>>>
>>>>>>> *Persistent Systems** Ltd. **| **Partners in Innovation **|* 
>>>>>>> *www.persistentsys.com
>>>>>>> <http://www.persistentsys.com/>*
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------
>>>>>>> *From:* KhajaAsmath Mohammed <mdkhajaasm...@gmail.com>
>>>>>>> *Sent:* 13 October 2017 09:35
>>>>>>> *To:* user @spark
>>>>>>> *Subject:* Spark - Partitions
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am reading hive query and wiriting the data back into hive after
>>>>>>> doing some transformations.
>>>>>>>
>>>>>>> I have changed setting spark.sql.shuffle.partitions to 2000 and
>>>>>>> since then job completes fast but the main problem is I am getting 2000
>>>>>>> files for each partition
>>>>>>> size of file is 10 MB .
>>>>>>>
>>>>>>> is there a way to get same performance but write lesser number of
>>>>>>> files ?
>>>>>>>
>>>>>>> I am trying repartition now but would like to know if there are any
>>>>>>> other options.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Asmath
>>>>>>> DISCLAIMER
>>>>>>> ==========
>>>>>>> This e-mail may contain privileged and confidential information
>>>>>>> which is the property of Persistent Systems Ltd. It is intended only for
>>>>>>> the use of the individual or entity to which it is addressed. If you are
>>>>>>> not the intended recipient, you are not authorized to read, retain, 
>>>>>>> copy,
>>>>>>> print, distribute or use this message. If you have received this
>>>>>>> communication in error, please notify the sender and delete all copies 
>>>>>>> of
>>>>>>> this message. Persistent Systems Ltd. does not accept any liability for
>>>>>>> virus infected mails.
>>>>>>>
>>>>>>
>>>>>>
>>>
>

Re: Spark - Partitions

Reply via email to