subject:"Re\: Spark \- Partitions"

Re: Spark - Partitions

2017-10-17 Thread Sebastian Piu

Change this
unionDS.repartition(numPartitions);
unionDS.createOrReplaceTempView(...

To

unionDS.repartition(numPartitions).createOrReplaceTempView(...

On Wed, 18 Oct 2017, 03:05 KhajaAsmath Mohammed, 
wrote:

> val unionDS = rawDS.union(processedDS)
>   //unionDS.persist(StorageLevel.MEMORY_AND_DISK)
>   val unionedDS = unionDS.dropDuplicates()
>   //val
> unionedPartitionedDS=unionedDS.repartition(unionedDS("year"),unionedDS("month"),unionedDS("day")).persist(StorageLevel.MEMORY_AND_DISK)
>   //unionDS.persist(StorageLevel.MEMORY_AND_DISK)
>   unionDS.repartition(numPartitions);
>   unionDS.createOrReplaceTempView("datapoint_prq_union_ds_view")
>   sparkSession.sql(s"set hive.exec.dynamic.partition.mode=nonstrict")
>   val deltaDSQry = "insert overwrite table  datapoint
> PARTITION(year,month,day) select VIN, utctime, description, descriptionuom,
> providerdesc, dt_map, islocation, latitude, longitude, speed,
> value,current_date,YEAR, MONTH, DAY from datapoint_prq_union_ds_view"
>   println(deltaDSQry)
>   sparkSession.sql(deltaDSQry)
>
>
> Here is the code and also properties used in my project.
>
>
> On Tue, Oct 17, 2017 at 3:38 PM, Sebastian Piu 
> wrote:
>
>> Can you share some code?
>>
>> On Tue, 17 Oct 2017, 21:11 KhajaAsmath Mohammed, 
>> wrote:
>>
>>> In my case I am just writing the data frame back to hive. so when is the
>>> best case to repartition it. I did repartition before calling insert
>>> overwrite on table
>>>
>>> On Tue, Oct 17, 2017 at 3:07 PM, Sebastian Piu 
>>> wrote:
>>>
 You have to repartition/coalesce *after *the action that is causing
 the shuffle as that one will take the value you've set

 On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed <
 mdkhajaasm...@gmail.com> wrote:

> Yes still I see more number of part files and exactly the number I
> have defined did spark.sql.shuffle.partitions
>
> Sent from my iPhone
>
> On Oct 17, 2017, at 2:32 PM, Michael Artz 
> wrote:
>
> Have you tried caching it and using a coalesce?
>
>
>
> On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <
> mdkhajaasm...@gmail.com> wrote:
>
>> I tried repartitions but spark.sql.shuffle.partitions is taking up
>> precedence over repartitions or coalesce. how to get the lesser number of
>> files with same performance?
>>
>> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
>> tushar_adesh...@persistent.com> wrote:
>>
>>> You can also try coalesce as it will avoid full shuffle.
>>>
>>>
>>> Regards,
>>>
>>> *Tushar Adeshara*
>>>
>>> *Technical Specialist – Analytics Practice*
>>>
>>> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>>>
>>> *Persistent Systems** Ltd. **| **Partners in Innovation **|* 
>>> *www.persistentsys.com
>>> *
>>>
>>>
>>> --
>>> *From:* KhajaAsmath Mohammed 
>>> *Sent:* 13 October 2017 09:35
>>> *To:* user @spark
>>> *Subject:* Spark - Partitions
>>>
>>> Hi,
>>>
>>> I am reading hive query and wiriting the data back into hive after
>>> doing some transformations.
>>>
>>> I have changed setting spark.sql.shuffle.partitions to 2000 and
>>> since then job completes fast but the main problem is I am getting 2000
>>> files for each partition
>>> size of file is 10 MB .
>>>
>>> is there a way to get same performance but write lesser number of
>>> files ?
>>>
>>> I am trying repartition now but would like to know if there are any
>>> other options.
>>>
>>> Thanks,
>>> Asmath
>>> DISCLAIMER
>>> ==
>>> This e-mail may contain privileged and confidential information
>>> which is the property of Persistent Systems Ltd. It is intended only for
>>> the use of the individual or entity to which it is addressed. If you are
>>> not the intended recipient, you are not authorized to read, retain, 
>>> copy,
>>> print, distribute or use this message. If you have received this
>>> communication in error, please notify the sender and delete all copies 
>>> of
>>> this message. Persistent Systems Ltd. does not accept any liability for
>>> virus infected mails.
>>>
>>
>>
>>>
>

Re: Spark - Partitions

2017-10-17 Thread KhajaAsmath Mohammed

val unionDS = rawDS.union(processedDS)
  //unionDS.persist(StorageLevel.MEMORY_AND_DISK)
  val unionedDS = unionDS.dropDuplicates()
  //val
unionedPartitionedDS=unionedDS.repartition(unionedDS("year"),unionedDS("month"),unionedDS("day")).persist(StorageLevel.MEMORY_AND_DISK)
  //unionDS.persist(StorageLevel.MEMORY_AND_DISK)
  unionDS.repartition(numPartitions);
  unionDS.createOrReplaceTempView("datapoint_prq_union_ds_view")
  sparkSession.sql(s"set hive.exec.dynamic.partition.mode=nonstrict")
  val deltaDSQry = "insert overwrite table  datapoint
PARTITION(year,month,day) select VIN, utctime, description, descriptionuom,
providerdesc, dt_map, islocation, latitude, longitude, speed,
value,current_date,YEAR, MONTH, DAY from datapoint_prq_union_ds_view"
  println(deltaDSQry)
  sparkSession.sql(deltaDSQry)


Here is the code and also properties used in my project.


On Tue, Oct 17, 2017 at 3:38 PM, Sebastian Piu 
wrote:

> Can you share some code?
>
> On Tue, 17 Oct 2017, 21:11 KhajaAsmath Mohammed, 
> wrote:
>
>> In my case I am just writing the data frame back to hive. so when is the
>> best case to repartition it. I did repartition before calling insert
>> overwrite on table
>>
>> On Tue, Oct 17, 2017 at 3:07 PM, Sebastian Piu 
>> wrote:
>>
>>> You have to repartition/coalesce *after *the action that is causing the
>>> shuffle as that one will take the value you've set
>>>
>>> On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed <
>>> mdkhajaasm...@gmail.com> wrote:
>>>
 Yes still I see more number of part files and exactly the number I have
 defined did spark.sql.shuffle.partitions

 Sent from my iPhone

 On Oct 17, 2017, at 2:32 PM, Michael Artz 
 wrote:

 Have you tried caching it and using a coalesce?



 On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <
 mdkhajaasm...@gmail.com> wrote:

> I tried repartitions but spark.sql.shuffle.partitions is taking up
> precedence over repartitions or coalesce. how to get the lesser number of
> files with same performance?
>
> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
> tushar_adesh...@persistent.com> wrote:
>
>> You can also try coalesce as it will avoid full shuffle.
>>
>>
>> Regards,
>>
>> *Tushar Adeshara*
>>
>> *Technical Specialist – Analytics Practice*
>>
>> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>>
>> *Persistent Systems** Ltd. **| **Partners in Innovation **|* 
>> *www.persistentsys.com
>> *
>>
>>
>> --
>> *From:* KhajaAsmath Mohammed 
>> *Sent:* 13 October 2017 09:35
>> *To:* user @spark
>> *Subject:* Spark - Partitions
>>
>> Hi,
>>
>> I am reading hive query and wiriting the data back into hive after
>> doing some transformations.
>>
>> I have changed setting spark.sql.shuffle.partitions to 2000 and since
>> then job completes fast but the main problem is I am getting 2000 files 
>> for
>> each partition
>> size of file is 10 MB .
>>
>> is there a way to get same performance but write lesser number of
>> files ?
>>
>> I am trying repartition now but would like to know if there are any
>> other options.
>>
>> Thanks,
>> Asmath
>> DISCLAIMER
>> ==
>> This e-mail may contain privileged and confidential information which
>> is the property of Persistent Systems Ltd. It is intended only for the 
>> use
>> of the individual or entity to which it is addressed. If you are not the
>> intended recipient, you are not authorized to read, retain, copy, print,
>> distribute or use this message. If you have received this communication 
>> in
>> error, please notify the sender and delete all copies of this message.
>> Persistent Systems Ltd. does not accept any liability for virus infected
>> mails.
>>
>
>
>>


application-datapoint-hdfs-dyn.properties
Description: Binary data

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark - Partitions

2017-10-17 Thread Sebastian Piu

Can you share some code?

On Tue, 17 Oct 2017, 21:11 KhajaAsmath Mohammed, 
wrote:

> In my case I am just writing the data frame back to hive. so when is the
> best case to repartition it. I did repartition before calling insert
> overwrite on table
>
> On Tue, Oct 17, 2017 at 3:07 PM, Sebastian Piu 
> wrote:
>
>> You have to repartition/coalesce *after *the action that is causing the
>> shuffle as that one will take the value you've set
>>
>> On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed <
>> mdkhajaasm...@gmail.com> wrote:
>>
>>> Yes still I see more number of part files and exactly the number I have
>>> defined did spark.sql.shuffle.partitions
>>>
>>> Sent from my iPhone
>>>
>>> On Oct 17, 2017, at 2:32 PM, Michael Artz 
>>> wrote:
>>>
>>> Have you tried caching it and using a coalesce?
>>>
>>>
>>>
>>> On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" 
>>> wrote:
>>>
 I tried repartitions but spark.sql.shuffle.partitions is taking up
 precedence over repartitions or coalesce. how to get the lesser number of
 files with same performance?

 On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
 tushar_adesh...@persistent.com> wrote:

> You can also try coalesce as it will avoid full shuffle.
>
>
> Regards,
>
> *Tushar Adeshara*
>
> *Technical Specialist – Analytics Practice*
>
> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>
> *Persistent Systems** Ltd. **| **Partners in Innovation **|* 
> *www.persistentsys.com
> *
>
>
> --
> *From:* KhajaAsmath Mohammed 
> *Sent:* 13 October 2017 09:35
> *To:* user @spark
> *Subject:* Spark - Partitions
>
> Hi,
>
> I am reading hive query and wiriting the data back into hive after
> doing some transformations.
>
> I have changed setting spark.sql.shuffle.partitions to 2000 and since
> then job completes fast but the main problem is I am getting 2000 files 
> for
> each partition
> size of file is 10 MB .
>
> is there a way to get same performance but write lesser number of
> files ?
>
> I am trying repartition now but would like to know if there are any
> other options.
>
> Thanks,
> Asmath
> DISCLAIMER
> ==
> This e-mail may contain privileged and confidential information which
> is the property of Persistent Systems Ltd. It is intended only for the use
> of the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>


>

Re: Spark - Partitions

2017-10-17 Thread KhajaAsmath Mohammed

In my case I am just writing the data frame back to hive. so when is the
best case to repartition it. I did repartition before calling insert
overwrite on table

On Tue, Oct 17, 2017 at 3:07 PM, Sebastian Piu 
wrote:

> You have to repartition/coalesce *after *the action that is causing the
> shuffle as that one will take the value you've set
>
> On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com> wrote:
>
>> Yes still I see more number of part files and exactly the number I have
>> defined did spark.sql.shuffle.partitions
>>
>> Sent from my iPhone
>>
>> On Oct 17, 2017, at 2:32 PM, Michael Artz  wrote:
>>
>> Have you tried caching it and using a coalesce?
>>
>>
>>
>> On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" 
>> wrote:
>>
>>> I tried repartitions but spark.sql.shuffle.partitions is taking up
>>> precedence over repartitions or coalesce. how to get the lesser number of
>>> files with same performance?
>>>
>>> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
>>> tushar_adesh...@persistent.com> wrote:
>>>
 You can also try coalesce as it will avoid full shuffle.

 Regards,

 *Tushar Adeshara*

 *Technical Specialist – Analytics Practice*

 *Cell: +91-81490 04192 <+91%2081490%2004192>*

 *Persistent Systems** Ltd. **| **Partners in Innovation **|* 
 *www.persistentsys.com
 *

 --
 *From:* KhajaAsmath Mohammed 
 *Sent:* 13 October 2017 09:35
 *To:* user @spark
 *Subject:* Spark - Partitions

 Hi,

 I am reading hive query and wiriting the data back into hive after
 doing some transformations.

 I have changed setting spark.sql.shuffle.partitions to 2000 and since
 then job completes fast but the main problem is I am getting 2000 files for
 each partition
 size of file is 10 MB .

 is there a way to get same performance but write lesser number of files
 ?

 I am trying repartition now but would like to know if there are any
 other options.

 Thanks,
 Asmath
 DISCLAIMER
 ==
 This e-mail may contain privileged and confidential information which
 is the property of Persistent Systems Ltd. It is intended only for the use
 of the individual or entity to which it is addressed. If you are not the
 intended recipient, you are not authorized to read, retain, copy, print,
 distribute or use this message. If you have received this communication in
 error, please notify the sender and delete all copies of this message.
 Persistent Systems Ltd. does not accept any liability for virus infected
 mails.

>>>
>>>

Re: Spark - Partitions

2017-10-17 Thread Sebastian Piu

You have to repartition/coalesce *after *the action that is causing the
shuffle as that one will take the value you've set

On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:

> Yes still I see more number of part files and exactly the number I have
> defined did spark.sql.shuffle.partitions
>
> Sent from my iPhone
>
> On Oct 17, 2017, at 2:32 PM, Michael Artz  wrote:
>
> Have you tried caching it and using a coalesce?
>
>
>
> On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" 
> wrote:
>
>> I tried repartitions but spark.sql.shuffle.partitions is taking up
>> precedence over repartitions or coalesce. how to get the lesser number of
>> files with same performance?
>>
>> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
>> tushar_adesh...@persistent.com> wrote:
>>
>>> You can also try coalesce as it will avoid full shuffle.
>>>
>>>
>>> Regards,
>>>
>>> *Tushar Adeshara*
>>>
>>> *Technical Specialist – Analytics Practice*
>>>
>>> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>>>
>>> *Persistent Systems** Ltd. **| **Partners in Innovation **|* 
>>> *www.persistentsys.com
>>> *
>>>
>>>
>>> --
>>> *From:* KhajaAsmath Mohammed 
>>> *Sent:* 13 October 2017 09:35
>>> *To:* user @spark
>>> *Subject:* Spark - Partitions
>>>
>>> Hi,
>>>
>>> I am reading hive query and wiriting the data back into hive after doing
>>> some transformations.
>>>
>>> I have changed setting spark.sql.shuffle.partitions to 2000 and since
>>> then job completes fast but the main problem is I am getting 2000 files for
>>> each partition
>>> size of file is 10 MB .
>>>
>>> is there a way to get same performance but write lesser number of files ?
>>>
>>> I am trying repartition now but would like to know if there are any
>>> other options.
>>>
>>> Thanks,
>>> Asmath
>>> DISCLAIMER
>>> ==
>>> This e-mail may contain privileged and confidential information which is
>>> the property of Persistent Systems Ltd. It is intended only for the use of
>>> the individual or entity to which it is addressed. If you are not the
>>> intended recipient, you are not authorized to read, retain, copy, print,
>>> distribute or use this message. If you have received this communication in
>>> error, please notify the sender and delete all copies of this message.
>>> Persistent Systems Ltd. does not accept any liability for virus infected
>>> mails.
>>>
>>
>>

Re: Spark - Partitions

2017-10-17 Thread KhajaAsmath Mohammed

Yes still I see more number of part files and exactly the number I have defined 
did spark.sql.shuffle.partitions

Sent from my iPhone

> On Oct 17, 2017, at 2:32 PM, Michael Artz  wrote:
> 
> Have you tried caching it and using a coalesce? 
> 
> 
> 
>> On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed"  
>> wrote:
>> I tried repartitions but spark.sql.shuffle.partitions is taking up 
>> precedence over repartitions or coalesce. how to get the lesser number of 
>> files with same performance?
>> 
>>> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara 
>>>  wrote:
>>> You can also try coalesce as it will avoid full shuffle.
>>> 
>>> 
>>> Regards,
>>> Tushar Adeshara
>>> 
>>> Technical Specialist – Analytics Practice
>>> 
>>> Cell: +91-81490 04192
>>> 
>>> Persistent Systems Ltd. | Partners in Innovation | www.persistentsys.com
>>> 
>>> 
>>> From: KhajaAsmath Mohammed 
>>> Sent: 13 October 2017 09:35
>>> To: user @spark
>>> Subject: Spark - Partitions
>>>  
>>> Hi,
>>> 
>>> I am reading hive query and wiriting the data back into hive after doing 
>>> some transformations.
>>> 
>>> I have changed setting spark.sql.shuffle.partitions to 2000 and since then 
>>> job completes fast but the main problem is I am getting 2000 files for each 
>>> partition 
>>> size of file is 10 MB .
>>> 
>>> is there a way to get same performance but write lesser number of files ?
>>> 
>>> I am trying repartition now but would like to know if there are any other 
>>> options.
>>> 
>>> Thanks,
>>> Asmath
>>> DISCLAIMER
>>> ==
>>> This e-mail may contain privileged and confidential information which is 
>>> the property of Persistent Systems Ltd. It is intended only for the use of 
>>> the individual or entity to which it is addressed. If you are not the 
>>> intended recipient, you are not authorized to read, retain, copy, print, 
>>> distribute or use this message. If you have received this communication in 
>>> error, please notify the sender and delete all copies of this message. 
>>> Persistent Systems Ltd. does not accept any liability for virus infected 
>>> mails.
>>

Re: Spark - Partitions

2017-10-17 Thread Michael Artz

Have you tried caching it and using a coalesce?



On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" 
wrote:

> I tried repartitions but spark.sql.shuffle.partitions is taking up
> precedence over repartitions or coalesce. how to get the lesser number of
> files with same performance?
>
> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
> tushar_adesh...@persistent.com> wrote:
>
>> You can also try coalesce as it will avoid full shuffle.
>>
>>
>> Regards,
>>
>> *Tushar Adeshara*
>>
>> *Technical Specialist – Analytics Practice*
>>
>> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>>
>> *Persistent Systems** Ltd. **| **Partners in Innovation **|* 
>> *www.persistentsys.com
>> *
>>
>>
>> --
>> *From:* KhajaAsmath Mohammed 
>> *Sent:* 13 October 2017 09:35
>> *To:* user @spark
>> *Subject:* Spark - Partitions
>>
>> Hi,
>>
>> I am reading hive query and wiriting the data back into hive after doing
>> some transformations.
>>
>> I have changed setting spark.sql.shuffle.partitions to 2000 and since
>> then job completes fast but the main problem is I am getting 2000 files for
>> each partition
>> size of file is 10 MB .
>>
>> is there a way to get same performance but write lesser number of files ?
>>
>> I am trying repartition now but would like to know if there are any other
>> options.
>>
>> Thanks,
>> Asmath
>> DISCLAIMER
>> ==
>> This e-mail may contain privileged and confidential information which is
>> the property of Persistent Systems Ltd. It is intended only for the use of
>> the individual or entity to which it is addressed. If you are not the
>> intended recipient, you are not authorized to read, retain, copy, print,
>> distribute or use this message. If you have received this communication in
>> error, please notify the sender and delete all copies of this message.
>> Persistent Systems Ltd. does not accept any liability for virus infected
>> mails.
>>
>
>

Re: Spark - Partitions

2017-10-17 Thread KhajaAsmath Mohammed

I tried repartitions but spark.sql.shuffle.partitions is taking up
precedence over repartitions or coalesce. how to get the lesser number of
files with same performance?

On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
tushar_adesh...@persistent.com> wrote:

> You can also try coalesce as it will avoid full shuffle.
>
>
> Regards,
>
> *Tushar Adeshara*
>
> *Technical Specialist – Analytics Practice*
>
> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>
> *Persistent Systems** Ltd. **| **Partners in Innovation **|* 
> *www.persistentsys.com
> *
>
>
> --
> *From:* KhajaAsmath Mohammed 
> *Sent:* 13 October 2017 09:35
> *To:* user @spark
> *Subject:* Spark - Partitions
>
> Hi,
>
> I am reading hive query and wiriting the data back into hive after doing
> some transformations.
>
> I have changed setting spark.sql.shuffle.partitions to 2000 and since then
> job completes fast but the main problem is I am getting 2000 files for each
> partition
> size of file is 10 MB .
>
> is there a way to get same performance but write lesser number of files ?
>
> I am trying repartition now but would like to know if there are any other
> options.
>
> Thanks,
> Asmath
> DISCLAIMER
> ==
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>

Re: Spark - Partitions

2017-10-13 Thread Tushar Adeshara

You can also try coalesce as it will avoid full shuffle.


Regards,
Tushar Adeshara
Technical Specialist – Analytics Practice
Cell: +91-81490 04192
Persistent Systems Ltd. | Partners in Innovation | 
www.persistentsys.com



From: KhajaAsmath Mohammed 
Sent: 13 October 2017 09:35
To: user @spark
Subject: Spark - Partitions

Hi,

I am reading hive query and wiriting the data back into hive after doing some 
transformations.

I have changed setting spark.sql.shuffle.partitions to 2000 and since then job 
completes fast but the main problem is I am getting 2000 files for each 
partition
size of file is 10 MB .

is there a way to get same performance but write lesser number of files ?

I am trying repartition now but would like to know if there are any other 
options.

Thanks,
Asmath
DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.

Re: Spark - Partitions

2017-10-12 Thread Chetan Khatri

Use repartition
On 13-Oct-2017 9:35 AM, "KhajaAsmath Mohammed" 
wrote:

> Hi,
>
> I am reading hive query and wiriting the data back into hive after doing
> some transformations.
>
> I have changed setting spark.sql.shuffle.partitions to 2000 and since then
> job completes fast but the main problem is I am getting 2000 files for each
> partition
> size of file is 10 MB .
>
> is there a way to get same performance but write lesser number of files ?
>
> I am trying repartition now but would like to know if there are any other
> options.
>
> Thanks,
> Asmath
>

Re: Spark partitions from CassandraRDD

2015-09-04 Thread Ankur Srivastava

Oh if that is the case then you can try tuning "
spark.cassandra.input.split.size"

spark.cassandra.input.split.sizeapprox number of Cassandra
partitions in a Spark partition  10

Hope this helps.

Thanks
Ankur

On Thu, Sep 3, 2015 at 12:22 PM, Alaa Zubaidi (PDF) 
wrote:

> Thanks Ankur,
>
> But I grabbed some keys from the Spark results and ran "nodetool -h
> getendpoints " and it showed the data is coming from at least 2 nodes?
> Regards,
> Alaa
>
> On Thu, Sep 3, 2015 at 12:06 PM, Ankur Srivastava <
> ankur.srivast...@gmail.com> wrote:
>
>> Hi Alaa,
>>
>> Partition when using CassandraRDD depends on your partition key in
>> Cassandra table.
>>
>> If you see only 1 partition in the RDD it means all the rows you have
>> selected have same partition_key in C*
>>
>> Thanks
>> Ankur
>>
>>
>> On Thu, Sep 3, 2015 at 11:54 AM, Alaa Zubaidi (PDF) > > wrote:
>>
>>> Hi,
>>>
>>> I testing Spark and Cassandra, Spark 1.4, Cassandra 2.1.7 cassandra
>>> spark connector 1.4, running in standalone mode.
>>>
>>> I am getting 4000 rows from Cassandra (4mb row), where the row keys are
>>> random.
>>> .. sc.cassandraTable[RES](keyspace,res_name).where(res_where).cache
>>>
>>> I am expecting that it will generate few partitions.
>>> However, I can ONLY see 1 partition.
>>> I cached the CassandraRDD and in the UI storage tab it shows ONLY 1
>>> partition.
>>>
>>> Any idea, why I am getting 1 partition?
>>>
>>> Thanks,
>>> Alaa
>>>
>>>
>>>
>>> *This message may contain confidential and privileged information. If it
>>> has been sent to you in error, please reply to advise the sender of the
>>> error and then immediately permanently delete it and all attachments to it
>>> from your systems. If you are not the intended recipient, do not read,
>>> copy, disclose or otherwise use this message or any attachments to it. The
>>> sender disclaims any liability for such unauthorized use. PLEASE NOTE that
>>> all incoming e-mails sent to PDF e-mail accounts will be archived and may
>>> be scanned by us and/or by external service providers to detect and prevent
>>> threats to our systems, investigate illegal or inappropriate behavior,
>>> and/or eliminate unsolicited promotional e-mails (“spam”). If you have any
>>> concerns about this process, please contact us at *
>>> *legal.departm...@pdf.com* *.*
>>
>>
>>
>
>
> --
>
> Alaa Zubaidi
> PDF Solutions, Inc.
> 333 West San Carlos Street, Suite 1000
> San Jose, CA 95110  USA
> Tel: 408-283-5639
> fax: 408-938-6479
> email: alaa.zuba...@pdf.com
>
>
> *This message may contain confidential and privileged information. If it
> has been sent to you in error, please reply to advise the sender of the
> error and then immediately permanently delete it and all attachments to it
> from your systems. If you are not the intended recipient, do not read,
> copy, disclose or otherwise use this message or any attachments to it. The
> sender disclaims any liability for such unauthorized use. PLEASE NOTE that
> all incoming e-mails sent to PDF e-mail accounts will be archived and may
> be scanned by us and/or by external service providers to detect and prevent
> threats to our systems, investigate illegal or inappropriate behavior,
> and/or eliminate unsolicited promotional e-mails (“spam”). If you have any
> concerns about this process, please contact us at *
> *legal.departm...@pdf.com* *.*
>

Re: Spark partitions from CassandraRDD

2015-09-03 Thread Ankur Srivastava

Hi Alaa,

Partition when using CassandraRDD depends on your partition key in
Cassandra table.

If you see only 1 partition in the RDD it means all the rows you have
selected have same partition_key in C*

Thanks
Ankur


On Thu, Sep 3, 2015 at 11:54 AM, Alaa Zubaidi (PDF) 
wrote:

> Hi,
>
> I testing Spark and Cassandra, Spark 1.4, Cassandra 2.1.7 cassandra spark
> connector 1.4, running in standalone mode.
>
> I am getting 4000 rows from Cassandra (4mb row), where the row keys are
> random.
> .. sc.cassandraTable[RES](keyspace,res_name).where(res_where).cache
>
> I am expecting that it will generate few partitions.
> However, I can ONLY see 1 partition.
> I cached the CassandraRDD and in the UI storage tab it shows ONLY 1
> partition.
>
> Any idea, why I am getting 1 partition?
>
> Thanks,
> Alaa
>
>
>
> *This message may contain confidential and privileged information. If it
> has been sent to you in error, please reply to advise the sender of the
> error and then immediately permanently delete it and all attachments to it
> from your systems. If you are not the intended recipient, do not read,
> copy, disclose or otherwise use this message or any attachments to it. The
> sender disclaims any liability for such unauthorized use. PLEASE NOTE that
> all incoming e-mails sent to PDF e-mail accounts will be archived and may
> be scanned by us and/or by external service providers to detect and prevent
> threats to our systems, investigate illegal or inappropriate behavior,
> and/or eliminate unsolicited promotional e-mails (“spam”). If you have any
> concerns about this process, please contact us at *
> *legal.departm...@pdf.com* *.*

Re: Spark partitions from CassandraRDD

2015-09-03 Thread Alaa Zubaidi (PDF)

Thanks Ankur,

But I grabbed some keys from the Spark results and ran "nodetool -h
getendpoints " and it showed the data is coming from at least 2 nodes?
Regards,
Alaa

On Thu, Sep 3, 2015 at 12:06 PM, Ankur Srivastava <
ankur.srivast...@gmail.com> wrote:

> Hi Alaa,
>
> Partition when using CassandraRDD depends on your partition key in
> Cassandra table.
>
> If you see only 1 partition in the RDD it means all the rows you have
> selected have same partition_key in C*
>
> Thanks
> Ankur
>
>
> On Thu, Sep 3, 2015 at 11:54 AM, Alaa Zubaidi (PDF) 
> wrote:
>
>> Hi,
>>
>> I testing Spark and Cassandra, Spark 1.4, Cassandra 2.1.7 cassandra spark
>> connector 1.4, running in standalone mode.
>>
>> I am getting 4000 rows from Cassandra (4mb row), where the row keys are
>> random.
>> .. sc.cassandraTable[RES](keyspace,res_name).where(res_where).cache
>>
>> I am expecting that it will generate few partitions.
>> However, I can ONLY see 1 partition.
>> I cached the CassandraRDD and in the UI storage tab it shows ONLY 1
>> partition.
>>
>> Any idea, why I am getting 1 partition?
>>
>> Thanks,
>> Alaa
>>
>>
>>
>> *This message may contain confidential and privileged information. If it
>> has been sent to you in error, please reply to advise the sender of the
>> error and then immediately permanently delete it and all attachments to it
>> from your systems. If you are not the intended recipient, do not read,
>> copy, disclose or otherwise use this message or any attachments to it. The
>> sender disclaims any liability for such unauthorized use. PLEASE NOTE that
>> all incoming e-mails sent to PDF e-mail accounts will be archived and may
>> be scanned by us and/or by external service providers to detect and prevent
>> threats to our systems, investigate illegal or inappropriate behavior,
>> and/or eliminate unsolicited promotional e-mails (“spam”). If you have any
>> concerns about this process, please contact us at *
>> *legal.departm...@pdf.com* *.*
>
>
>


-- 

Alaa Zubaidi
PDF Solutions, Inc.
333 West San Carlos Street, Suite 1000
San Jose, CA 95110  USA
Tel: 408-283-5639
fax: 408-938-6479
email: alaa.zuba...@pdf.com

-- 
*This message may contain confidential and privileged information. If it 
has been sent to you in error, please reply to advise the sender of the 
error and then immediately permanently delete it and all attachments to it 
from your systems. If you are not the intended recipient, do not read, 
copy, disclose or otherwise use this message or any attachments to it. The 
sender disclaims any liability for such unauthorized use. PLEASE NOTE that 
all incoming e-mails sent to PDF e-mail accounts will be archived and may 
be scanned by us and/or by external service providers to detect and prevent 
threats to our systems, investigate illegal or inappropriate behavior, 
and/or eliminate unsolicited promotional e-mails (“spam”). If you have any 
concerns about this process, please contact us at *
*legal.departm...@pdf.com* *.*

Re: Spark - Partitions

Re: Spark - Partitions

Re: Spark - Partitions

Re: Spark - Partitions

Re: Spark - Partitions

Re: Spark - Partitions

Re: Spark - Partitions

Re: Spark - Partitions

Re: Spark - Partitions

Re: Spark - Partitions

Re: Spark partitions from CassandraRDD

Re: Spark partitions from CassandraRDD

Re: Spark partitions from CassandraRDD

13 matches

Site Navigation

Mail list logo

Footer information