Re: Spark - Partitions

2017-10-17 Thread Sebastian Piu
Change this unionDS.repartition(numPartitions); unionDS.createOrReplaceTempView(... To unionDS.repartition(numPartitions).createOrReplaceTempView(... On Wed, 18 Oct 2017, 03:05 KhajaAsmath Mohammed, wrote: > val unionDS = rawDS.union(processedDS) >

Re: Spark - Partitions

2017-10-17 Thread KhajaAsmath Mohammed
val unionDS = rawDS.union(processedDS) //unionDS.persist(StorageLevel.MEMORY_AND_DISK) val unionedDS = unionDS.dropDuplicates() //val unionedPartitionedDS=unionedDS.repartition(unionedDS("year"),unionedDS("month"),unionedDS("day")).persist(StorageLevel.MEMORY_AND_DISK)

Re: Spark - Partitions

2017-10-17 Thread Sebastian Piu
Can you share some code? On Tue, 17 Oct 2017, 21:11 KhajaAsmath Mohammed, wrote: > In my case I am just writing the data frame back to hive. so when is the > best case to repartition it. I did repartition before calling insert > overwrite on table > > On Tue, Oct 17,

Re: Spark - Partitions

2017-10-17 Thread KhajaAsmath Mohammed
In my case I am just writing the data frame back to hive. so when is the best case to repartition it. I did repartition before calling insert overwrite on table On Tue, Oct 17, 2017 at 3:07 PM, Sebastian Piu wrote: > You have to repartition/coalesce *after *the action

Re: Spark - Partitions

2017-10-17 Thread Sebastian Piu
You have to repartition/coalesce *after *the action that is causing the shuffle as that one will take the value you've set On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > Yes still I see more number of part files and exactly the number I have > defined

Re: Spark - Partitions

2017-10-17 Thread KhajaAsmath Mohammed
Yes still I see more number of part files and exactly the number I have defined did spark.sql.shuffle.partitions Sent from my iPhone > On Oct 17, 2017, at 2:32 PM, Michael Artz wrote: > > Have you tried caching it and using a coalesce? > > > >> On Oct 17, 2017 1:47

Re: Spark - Partitions

2017-10-17 Thread Michael Artz
Have you tried caching it and using a coalesce? On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" wrote: > I tried repartitions but spark.sql.shuffle.partitions is taking up > precedence over repartitions or coalesce. how to get the lesser number of > files with same

Re: Spark - Partitions

2017-10-17 Thread KhajaAsmath Mohammed
I tried repartitions but spark.sql.shuffle.partitions is taking up precedence over repartitions or coalesce. how to get the lesser number of files with same performance? On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara < tushar_adesh...@persistent.com> wrote: > You can also try coalesce as it

Re: Spark - Partitions

2017-10-13 Thread Tushar Adeshara
You can also try coalesce as it will avoid full shuffle. Regards, Tushar Adeshara Technical Specialist – Analytics Practice Cell: +91-81490 04192 Persistent Systems Ltd. | Partners in Innovation | www.persistentsys.com From:

Re: Spark - Partitions

2017-10-12 Thread Chetan Khatri
Use repartition On 13-Oct-2017 9:35 AM, "KhajaAsmath Mohammed" wrote: > Hi, > > I am reading hive query and wiriting the data back into hive after doing > some transformations. > > I have changed setting spark.sql.shuffle.partitions to 2000 and since then > job completes

Re: Spark partitions from CassandraRDD

2015-09-04 Thread Ankur Srivastava
Oh if that is the case then you can try tuning " spark.cassandra.input.split.size" spark.cassandra.input.split.sizeapprox number of Cassandra partitions in a Spark partition 10 Hope this helps. Thanks Ankur On Thu, Sep 3, 2015 at 12:22 PM, Alaa Zubaidi (PDF)

Re: Spark partitions from CassandraRDD

2015-09-03 Thread Ankur Srivastava
Hi Alaa, Partition when using CassandraRDD depends on your partition key in Cassandra table. If you see only 1 partition in the RDD it means all the rows you have selected have same partition_key in C* Thanks Ankur On Thu, Sep 3, 2015 at 11:54 AM, Alaa Zubaidi (PDF)

Re: Spark partitions from CassandraRDD

2015-09-03 Thread Alaa Zubaidi (PDF)
Thanks Ankur, But I grabbed some keys from the Spark results and ran "nodetool -h getendpoints " and it showed the data is coming from at least 2 nodes? Regards, Alaa On Thu, Sep 3, 2015 at 12:06 PM, Ankur Srivastava < ankur.srivast...@gmail.com> wrote: > Hi Alaa, > > Partition when