Spark Partitions Size control

2022-11-27 Thread vijay khatri
Hi Team,

I am reading data from sql server tables through pyspark and storing data
into S3 as parquet file format.

In some table I have lots of data so I am getting file size in S3 for those
tables in GBs.

I need help on this following:

I want to assign 128 MB to each partition. How we can assign?

I don't know the data size in tables. Some tables have 2 column but
billions records and some tables have 200 columns but thousands of records.


Thanks in advance for your help.

Regards,
Vijay


On Wed, 23 Nov, 2022, 10:05 pm Mitch Shepherd, 
wrote:

> Hello,
>
>
>
> I’m wondering if anyone can point me in the right direction for a Spark
> connector developer guide.
>
>
>
> I’m looking for information on writing a new connector for Spark to move
> data between Apache Spark and other systems.
>
>
>
> Any information would be helpful. I found a similar thing for Kafka
>  but
> haven’t been able to track down documentation for Spark.
>
>
>
> Best,
>
> Mitch
>
> This message and any attached documents contain information of MarkLogic
> and/or its customers that may be confidential and/or privileged. If you are
> not the intended recipient, you may not read, copy, distribute, or use this
> information. If you have received this transmission in error, please notify
> the sender immediately by reply e-mail and then delete this message. This
> email may contain pricing or other suggested contract terms related to
> MarkLogic software or services. Any such terms are not binding on MarkLogic
> unless and until they are included in a definitive agreement executed by
> MarkLogic.
>


Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread lucas.g...@gmail.com
Ok, so for JDBC I presume it defaults to a single partition if you don't
provide partitioning meta data?

Thanks!

Gary

On 26 October 2017 at 13:43, Daniel Siegmann <dsiegm...@securityscorecard.io
> wrote:

> Those settings apply when a shuffle happens. But they don't affect the way
> the data will be partitioned when it is initially read, for example
> spark.read.parquet("path/to/input"). So for HDFS / S3 I think it depends
> on how the data is split into chunks, but if there are lots of small chunks
> Spark will automatically merge them into small partitions. There are going
> to be various settings depending on what you're reading from.
>
> val df = spark.read.parquet("path/to/input") // partitioning will depend
> on the data
> val df2 = df.groupBy("thing").count() // a shuffle happened, so shuffle
> partitioning configuration applies
>
>
> Tip: gzip files can't be split, so if you read a gzip file everything will
> be in one partition. That's a good reason to avoid large gzip files. :-)
>
> If you don't have a shuffle but you want to change how many partitions
> there are, you will need to coalesce or repartition.
>
>
> --
> Daniel Siegmann
> Senior Software Engineer
> *SecurityScorecard Inc.*
> 214 W 29th Street, 5th Floor
> <https://maps.google.com/?q=214+W+29th+Street,+5th+FloorNew+York,+NY+10001=gmail=g>
> New York, NY 10001
> <https://maps.google.com/?q=214+W+29th+Street,+5th+FloorNew+York,+NY+10001=gmail=g>
>
>
> On Thu, Oct 26, 2017 at 11:31 AM, lucas.g...@gmail.com <
> lucas.g...@gmail.com> wrote:
>
>> Thanks Daniel!
>>
>> I've been wondering that for ages!
>>
>> IE where my JDBC sourced datasets are coming up with 200 partitions on
>> write to S3.
>>
>> What do you mean for (except for the initial read)?
>>
>> Can you explain that a bit further?
>>
>> Gary Lucas
>>
>> On 26 October 2017 at 11:28, Daniel Siegmann <
>> dsiegm...@securityscorecard.io> wrote:
>>
>>> When working with datasets, Spark uses spark.sql.shuffle.partitions. It
>>> defaults to 200. Between that and the default parallelism you can control
>>> the number of partitions (except for the initial read).
>>>
>>> More info here: http://spark.apache.org/docs/l
>>> atest/sql-programming-guide.html#other-configuration-options
>>>
>>> I have no idea why it defaults to a fixed 200 (while default parallelism
>>> defaults to a number scaled to your number of cores), or why there are two
>>> separate configuration properties.
>>>
>>>
>>> --
>>> Daniel Siegmann
>>> Senior Software Engineer
>>> *SecurityScorecard Inc.*
>>> 214 W 29th Street, 5th Floor
>>> <https://maps.google.com/?q=214+W+29th+Street,+5th+FloorNew+York,+NY+10001=gmail=g>
>>> New York, NY 10001
>>> <https://maps.google.com/?q=214+W+29th+Street,+5th+FloorNew+York,+NY+10001=gmail=g>
>>>
>>>
>>> On Thu, Oct 26, 2017 at 9:53 AM, Deepak Sharma <deepakmc...@gmail.com>
>>> wrote:
>>>
>>>> I guess the issue is spark.default.parallelism is ignored when you are
>>>> working with Data frames.It is supposed to work with only raw RDDs.
>>>>
>>>> Thanks
>>>> Deepak
>>>>
>>>> On Thu, Oct 26, 2017 at 10:05 PM, Noorul Islam Kamal Malmiyoda <
>>>> noo...@noorul.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I have the following spark configuration
>>>>>
>>>>> spark.app.name=Test
>>>>> spark.cassandra.connection.host=127.0.0.1
>>>>> spark.cassandra.connection.keep_alive_ms=5000
>>>>> spark.cassandra.connection.port=1
>>>>> spark.cassandra.connection.timeout_ms=3
>>>>> spark.cleaner.ttl=3600
>>>>> spark.default.parallelism=4
>>>>> spark.master=local[2]
>>>>> spark.ui.enabled=false
>>>>> spark.ui.showConsoleProgress=false
>>>>>
>>>>> Because I am setting spark.default.parallelism to 4, I was expecting
>>>>> only 4 spark partitions. But it looks like it is not the case
>>>>>
>>>>> When I do the following
>>>>>
>>>>> df.foreachPartition { partition =>
>>>>>   val groupedPartition = partition.toList.grouped(3).toList
>>>>>   println("Grouped partition " + groupedPartition)
>>>>> }
>>>>>
>>>>> There are too many print statements with empty list at the top. Only
>>>>> the relevant partitions are at the bottom. Is there a way to control
>>>>> number of partitions?
>>>>>
>>>>> Regards,
>>>>> Noorul
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks
>>>> Deepak
>>>> www.bigdatabig.com
>>>> www.keosha.net
>>>>
>>>
>>>
>>
>


Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread Daniel Siegmann
Those settings apply when a shuffle happens. But they don't affect the way
the data will be partitioned when it is initially read, for example
spark.read.parquet("path/to/input"). So for HDFS / S3 I think it depends on
how the data is split into chunks, but if there are lots of small chunks
Spark will automatically merge them into small partitions. There are going
to be various settings depending on what you're reading from.

val df = spark.read.parquet("path/to/input") // partitioning will depend on
the data
val df2 = df.groupBy("thing").count() // a shuffle happened, so shuffle
partitioning configuration applies


Tip: gzip files can't be split, so if you read a gzip file everything will
be in one partition. That's a good reason to avoid large gzip files. :-)

If you don't have a shuffle but you want to change how many partitions
there are, you will need to coalesce or repartition.


--
Daniel Siegmann
Senior Software Engineer
*SecurityScorecard Inc.*
214 W 29th Street, 5th Floor
New York, NY 10001


On Thu, Oct 26, 2017 at 11:31 AM, lucas.g...@gmail.com <lucas.g...@gmail.com
> wrote:

> Thanks Daniel!
>
> I've been wondering that for ages!
>
> IE where my JDBC sourced datasets are coming up with 200 partitions on
> write to S3.
>
> What do you mean for (except for the initial read)?
>
> Can you explain that a bit further?
>
> Gary Lucas
>
> On 26 October 2017 at 11:28, Daniel Siegmann <dsiegmann@securityscorecard.
> io> wrote:
>
>> When working with datasets, Spark uses spark.sql.shuffle.partitions. It
>> defaults to 200. Between that and the default parallelism you can control
>> the number of partitions (except for the initial read).
>>
>> More info here: http://spark.apache.org/docs/l
>> atest/sql-programming-guide.html#other-configuration-options
>>
>> I have no idea why it defaults to a fixed 200 (while default parallelism
>> defaults to a number scaled to your number of cores), or why there are two
>> separate configuration properties.
>>
>>
>> --
>> Daniel Siegmann
>> Senior Software Engineer
>> *SecurityScorecard Inc.*
>> 214 W 29th Street, 5th Floor
>> <https://maps.google.com/?q=214+W+29th+Street,+5th+FloorNew+York,+NY+10001=gmail=g>
>> New York, NY 10001
>> <https://maps.google.com/?q=214+W+29th+Street,+5th+FloorNew+York,+NY+10001=gmail=g>
>>
>>
>> On Thu, Oct 26, 2017 at 9:53 AM, Deepak Sharma <deepakmc...@gmail.com>
>> wrote:
>>
>>> I guess the issue is spark.default.parallelism is ignored when you are
>>> working with Data frames.It is supposed to work with only raw RDDs.
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Thu, Oct 26, 2017 at 10:05 PM, Noorul Islam Kamal Malmiyoda <
>>> noo...@noorul.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I have the following spark configuration
>>>>
>>>> spark.app.name=Test
>>>> spark.cassandra.connection.host=127.0.0.1
>>>> spark.cassandra.connection.keep_alive_ms=5000
>>>> spark.cassandra.connection.port=1
>>>> spark.cassandra.connection.timeout_ms=3
>>>> spark.cleaner.ttl=3600
>>>> spark.default.parallelism=4
>>>> spark.master=local[2]
>>>> spark.ui.enabled=false
>>>> spark.ui.showConsoleProgress=false
>>>>
>>>> Because I am setting spark.default.parallelism to 4, I was expecting
>>>> only 4 spark partitions. But it looks like it is not the case
>>>>
>>>> When I do the following
>>>>
>>>> df.foreachPartition { partition =>
>>>>   val groupedPartition = partition.toList.grouped(3).toList
>>>>   println("Grouped partition " + groupedPartition)
>>>> }
>>>>
>>>> There are too many print statements with empty list at the top. Only
>>>> the relevant partitions are at the bottom. Is there a way to control
>>>> number of partitions?
>>>>
>>>> Regards,
>>>> Noorul
>>>>
>>>> -
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks
>>> Deepak
>>> www.bigdatabig.com
>>> www.keosha.net
>>>
>>
>>
>


Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread lucas.g...@gmail.com
Thanks Daniel!

I've been wondering that for ages!

IE where my JDBC sourced datasets are coming up with 200 partitions on
write to S3.

What do you mean for (except for the initial read)?

Can you explain that a bit further?

Gary Lucas

On 26 October 2017 at 11:28, Daniel Siegmann <dsiegm...@securityscorecard.io
> wrote:

> When working with datasets, Spark uses spark.sql.shuffle.partitions. It
> defaults to 200. Between that and the default parallelism you can control
> the number of partitions (except for the initial read).
>
> More info here: http://spark.apache.org/docs/latest/sql-programming-guide.
> html#other-configuration-options
>
> I have no idea why it defaults to a fixed 200 (while default parallelism
> defaults to a number scaled to your number of cores), or why there are two
> separate configuration properties.
>
>
> --
> Daniel Siegmann
> Senior Software Engineer
> *SecurityScorecard Inc.*
> 214 W 29th Street, 5th Floor
> <https://maps.google.com/?q=214+W+29th+Street,+5th+FloorNew+York,+NY+10001=gmail=g>
> New York, NY 10001
> <https://maps.google.com/?q=214+W+29th+Street,+5th+FloorNew+York,+NY+10001=gmail=g>
>
>
> On Thu, Oct 26, 2017 at 9:53 AM, Deepak Sharma <deepakmc...@gmail.com>
> wrote:
>
>> I guess the issue is spark.default.parallelism is ignored when you are
>> working with Data frames.It is supposed to work with only raw RDDs.
>>
>> Thanks
>> Deepak
>>
>> On Thu, Oct 26, 2017 at 10:05 PM, Noorul Islam Kamal Malmiyoda <
>> noo...@noorul.com> wrote:
>>
>>> Hi all,
>>>
>>> I have the following spark configuration
>>>
>>> spark.app.name=Test
>>> spark.cassandra.connection.host=127.0.0.1
>>> spark.cassandra.connection.keep_alive_ms=5000
>>> spark.cassandra.connection.port=1
>>> spark.cassandra.connection.timeout_ms=3
>>> spark.cleaner.ttl=3600
>>> spark.default.parallelism=4
>>> spark.master=local[2]
>>> spark.ui.enabled=false
>>> spark.ui.showConsoleProgress=false
>>>
>>> Because I am setting spark.default.parallelism to 4, I was expecting
>>> only 4 spark partitions. But it looks like it is not the case
>>>
>>> When I do the following
>>>
>>> df.foreachPartition { partition =>
>>>   val groupedPartition = partition.toList.grouped(3).toList
>>>   println("Grouped partition " + groupedPartition)
>>> }
>>>
>>> There are too many print statements with empty list at the top. Only
>>> the relevant partitions are at the bottom. Is there a way to control
>>> number of partitions?
>>>
>>> Regards,
>>> Noorul
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>>
>
>


Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread Daniel Siegmann
When working with datasets, Spark uses spark.sql.shuffle.partitions. It
defaults to 200. Between that and the default parallelism you can control
the number of partitions (except for the initial read).

More info here:
http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options

I have no idea why it defaults to a fixed 200 (while default parallelism
defaults to a number scaled to your number of cores), or why there are two
separate configuration properties.


--
Daniel Siegmann
Senior Software Engineer
*SecurityScorecard Inc.*
214 W 29th Street, 5th Floor
New York, NY 10001


On Thu, Oct 26, 2017 at 9:53 AM, Deepak Sharma <deepakmc...@gmail.com>
wrote:

> I guess the issue is spark.default.parallelism is ignored when you are
> working with Data frames.It is supposed to work with only raw RDDs.
>
> Thanks
> Deepak
>
> On Thu, Oct 26, 2017 at 10:05 PM, Noorul Islam Kamal Malmiyoda <
> noo...@noorul.com> wrote:
>
>> Hi all,
>>
>> I have the following spark configuration
>>
>> spark.app.name=Test
>> spark.cassandra.connection.host=127.0.0.1
>> spark.cassandra.connection.keep_alive_ms=5000
>> spark.cassandra.connection.port=1
>> spark.cassandra.connection.timeout_ms=3
>> spark.cleaner.ttl=3600
>> spark.default.parallelism=4
>> spark.master=local[2]
>> spark.ui.enabled=false
>> spark.ui.showConsoleProgress=false
>>
>> Because I am setting spark.default.parallelism to 4, I was expecting
>> only 4 spark partitions. But it looks like it is not the case
>>
>> When I do the following
>>
>> df.foreachPartition { partition =>
>>   val groupedPartition = partition.toList.grouped(3).toList
>>   println("Grouped partition " + groupedPartition)
>> }
>>
>> There are too many print statements with empty list at the top. Only
>> the relevant partitions are at the bottom. Is there a way to control
>> number of partitions?
>>
>> Regards,
>> Noorul
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>


Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread Deepak Sharma
I guess the issue is spark.default.parallelism is ignored when you are
working with Data frames.It is supposed to work with only raw RDDs.

Thanks
Deepak

On Thu, Oct 26, 2017 at 10:05 PM, Noorul Islam Kamal Malmiyoda <
noo...@noorul.com> wrote:

> Hi all,
>
> I have the following spark configuration
>
> spark.app.name=Test
> spark.cassandra.connection.host=127.0.0.1
> spark.cassandra.connection.keep_alive_ms=5000
> spark.cassandra.connection.port=1
> spark.cassandra.connection.timeout_ms=3
> spark.cleaner.ttl=3600
> spark.default.parallelism=4
> spark.master=local[2]
> spark.ui.enabled=false
> spark.ui.showConsoleProgress=false
>
> Because I am setting spark.default.parallelism to 4, I was expecting
> only 4 spark partitions. But it looks like it is not the case
>
> When I do the following
>
> df.foreachPartition { partition =>
>   val groupedPartition = partition.toList.grouped(3).toList
>   println("Grouped partition " + groupedPartition)
> }
>
> There are too many print statements with empty list at the top. Only
> the relevant partitions are at the bottom. Is there a way to control
> number of partitions?
>
> Regards,
> Noorul
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread lucas.g...@gmail.com
I think we'd need to see the code that loads the df.

Parallelism and partition count are related but they're not the same.  I've
found the documentation fuzzy on this, but it looks like
default.parrallelism is what spark uses for partitioning when it has no
other guidance.  I'm also under the impression (and I could be wrong here)
that the data loading step has some impact on partitioning.

In any case, I think it would be more helpful with the df loading code.

Good luck!

Gary Lucas

On 26 October 2017 at 09:35, Noorul Islam Kamal Malmiyoda <noo...@noorul.com
> wrote:

> Hi all,
>
> I have the following spark configuration
>
> spark.app.name=Test
> spark.cassandra.connection.host=127.0.0.1
> spark.cassandra.connection.keep_alive_ms=5000
> spark.cassandra.connection.port=1
> spark.cassandra.connection.timeout_ms=3
> spark.cleaner.ttl=3600
> spark.default.parallelism=4
> spark.master=local[2]
> spark.ui.enabled=false
> spark.ui.showConsoleProgress=false
>
> Because I am setting spark.default.parallelism to 4, I was expecting
> only 4 spark partitions. But it looks like it is not the case
>
> When I do the following
>
> df.foreachPartition { partition =>
>   val groupedPartition = partition.toList.grouped(3).toList
>   println("Grouped partition " + groupedPartition)
> }
>
> There are too many print statements with empty list at the top. Only
> the relevant partitions are at the bottom. Is there a way to control
> number of partitions?
>
> Regards,
> Noorul
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Controlling number of spark partitions in dataframes

2017-10-26 Thread Noorul Islam Kamal Malmiyoda
Hi all,

I have the following spark configuration

spark.app.name=Test
spark.cassandra.connection.host=127.0.0.1
spark.cassandra.connection.keep_alive_ms=5000
spark.cassandra.connection.port=1
spark.cassandra.connection.timeout_ms=3
spark.cleaner.ttl=3600
spark.default.parallelism=4
spark.master=local[2]
spark.ui.enabled=false
spark.ui.showConsoleProgress=false

Because I am setting spark.default.parallelism to 4, I was expecting
only 4 spark partitions. But it looks like it is not the case

When I do the following

df.foreachPartition { partition =>
  val groupedPartition = partition.toList.grouped(3).toList
  println("Grouped partition " + groupedPartition)
}

There are too many print statements with empty list at the top. Only
the relevant partitions are at the bottom. Is there a way to control
number of partitions?

Regards,
Noorul

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark - Partitions

2017-10-17 Thread Sebastian Piu
Change this
unionDS.repartition(numPartitions);
unionDS.createOrReplaceTempView(...

To

unionDS.repartition(numPartitions).createOrReplaceTempView(...

On Wed, 18 Oct 2017, 03:05 KhajaAsmath Mohammed, <mdkhajaasm...@gmail.com>
wrote:

> val unionDS = rawDS.union(processedDS)
>   //unionDS.persist(StorageLevel.MEMORY_AND_DISK)
>   val unionedDS = unionDS.dropDuplicates()
>   //val
> unionedPartitionedDS=unionedDS.repartition(unionedDS("year"),unionedDS("month"),unionedDS("day")).persist(StorageLevel.MEMORY_AND_DISK)
>   //unionDS.persist(StorageLevel.MEMORY_AND_DISK)
>   unionDS.repartition(numPartitions);
>   unionDS.createOrReplaceTempView("datapoint_prq_union_ds_view")
>   sparkSession.sql(s"set hive.exec.dynamic.partition.mode=nonstrict")
>   val deltaDSQry = "insert overwrite table  datapoint
> PARTITION(year,month,day) select VIN, utctime, description, descriptionuom,
> providerdesc, dt_map, islocation, latitude, longitude, speed,
> value,current_date,YEAR, MONTH, DAY from datapoint_prq_union_ds_view"
>   println(deltaDSQry)
>   sparkSession.sql(deltaDSQry)
>
>
> Here is the code and also properties used in my project.
>
>
> On Tue, Oct 17, 2017 at 3:38 PM, Sebastian Piu <sebastian@gmail.com>
> wrote:
>
>> Can you share some code?
>>
>> On Tue, 17 Oct 2017, 21:11 KhajaAsmath Mohammed, <mdkhajaasm...@gmail.com>
>> wrote:
>>
>>> In my case I am just writing the data frame back to hive. so when is the
>>> best case to repartition it. I did repartition before calling insert
>>> overwrite on table
>>>
>>> On Tue, Oct 17, 2017 at 3:07 PM, Sebastian Piu <sebastian@gmail.com>
>>> wrote:
>>>
>>>> You have to repartition/coalesce *after *the action that is causing
>>>> the shuffle as that one will take the value you've set
>>>>
>>>> On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed <
>>>> mdkhajaasm...@gmail.com> wrote:
>>>>
>>>>> Yes still I see more number of part files and exactly the number I
>>>>> have defined did spark.sql.shuffle.partitions
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Oct 17, 2017, at 2:32 PM, Michael Artz <michaelea...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Have you tried caching it and using a coalesce?
>>>>>
>>>>>
>>>>>
>>>>> On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <
>>>>> mdkhajaasm...@gmail.com> wrote:
>>>>>
>>>>>> I tried repartitions but spark.sql.shuffle.partitions is taking up
>>>>>> precedence over repartitions or coalesce. how to get the lesser number of
>>>>>> files with same performance?
>>>>>>
>>>>>> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
>>>>>> tushar_adesh...@persistent.com> wrote:
>>>>>>
>>>>>>> You can also try coalesce as it will avoid full shuffle.
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> *Tushar Adeshara*
>>>>>>>
>>>>>>> *Technical Specialist – Analytics Practice*
>>>>>>>
>>>>>>> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>>>>>>>
>>>>>>> *Persistent Systems** Ltd. **| **Partners in Innovation **|* 
>>>>>>> *www.persistentsys.com
>>>>>>> <http://www.persistentsys.com/>*
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *From:* KhajaAsmath Mohammed <mdkhajaasm...@gmail.com>
>>>>>>> *Sent:* 13 October 2017 09:35
>>>>>>> *To:* user @spark
>>>>>>> *Subject:* Spark - Partitions
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am reading hive query and wiriting the data back into hive after
>>>>>>> doing some transformations.
>>>>>>>
>>>>>>> I have changed setting spark.sql.shuffle.partitions to 2000 and
>>>>>>> since then job completes fast but the main problem is I am getting 2000
>>>>>>> files for each partition
>>>>>>> size of file is 10 MB .
>>>>>>>
>>>>>>> is there a way to get same performance but write lesser number of
>>>>>>> files ?
>>>>>>>
>>>>>>> I am trying repartition now but would like to know if there are any
>>>>>>> other options.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Asmath
>>>>>>> DISCLAIMER
>>>>>>> ==
>>>>>>> This e-mail may contain privileged and confidential information
>>>>>>> which is the property of Persistent Systems Ltd. It is intended only for
>>>>>>> the use of the individual or entity to which it is addressed. If you are
>>>>>>> not the intended recipient, you are not authorized to read, retain, 
>>>>>>> copy,
>>>>>>> print, distribute or use this message. If you have received this
>>>>>>> communication in error, please notify the sender and delete all copies 
>>>>>>> of
>>>>>>> this message. Persistent Systems Ltd. does not accept any liability for
>>>>>>> virus infected mails.
>>>>>>>
>>>>>>
>>>>>>
>>>
>


Re: Spark - Partitions

2017-10-17 Thread KhajaAsmath Mohammed
val unionDS = rawDS.union(processedDS)
  //unionDS.persist(StorageLevel.MEMORY_AND_DISK)
  val unionedDS = unionDS.dropDuplicates()
  //val
unionedPartitionedDS=unionedDS.repartition(unionedDS("year"),unionedDS("month"),unionedDS("day")).persist(StorageLevel.MEMORY_AND_DISK)
  //unionDS.persist(StorageLevel.MEMORY_AND_DISK)
  unionDS.repartition(numPartitions);
  unionDS.createOrReplaceTempView("datapoint_prq_union_ds_view")
  sparkSession.sql(s"set hive.exec.dynamic.partition.mode=nonstrict")
  val deltaDSQry = "insert overwrite table  datapoint
PARTITION(year,month,day) select VIN, utctime, description, descriptionuom,
providerdesc, dt_map, islocation, latitude, longitude, speed,
value,current_date,YEAR, MONTH, DAY from datapoint_prq_union_ds_view"
  println(deltaDSQry)
  sparkSession.sql(deltaDSQry)


Here is the code and also properties used in my project.


On Tue, Oct 17, 2017 at 3:38 PM, Sebastian Piu <sebastian@gmail.com>
wrote:

> Can you share some code?
>
> On Tue, 17 Oct 2017, 21:11 KhajaAsmath Mohammed, <mdkhajaasm...@gmail.com>
> wrote:
>
>> In my case I am just writing the data frame back to hive. so when is the
>> best case to repartition it. I did repartition before calling insert
>> overwrite on table
>>
>> On Tue, Oct 17, 2017 at 3:07 PM, Sebastian Piu <sebastian@gmail.com>
>> wrote:
>>
>>> You have to repartition/coalesce *after *the action that is causing the
>>> shuffle as that one will take the value you've set
>>>
>>> On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed <
>>> mdkhajaasm...@gmail.com> wrote:
>>>
>>>> Yes still I see more number of part files and exactly the number I have
>>>> defined did spark.sql.shuffle.partitions
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Oct 17, 2017, at 2:32 PM, Michael Artz <michaelea...@gmail.com>
>>>> wrote:
>>>>
>>>> Have you tried caching it and using a coalesce?
>>>>
>>>>
>>>>
>>>> On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <
>>>> mdkhajaasm...@gmail.com> wrote:
>>>>
>>>>> I tried repartitions but spark.sql.shuffle.partitions is taking up
>>>>> precedence over repartitions or coalesce. how to get the lesser number of
>>>>> files with same performance?
>>>>>
>>>>> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
>>>>> tushar_adesh...@persistent.com> wrote:
>>>>>
>>>>>> You can also try coalesce as it will avoid full shuffle.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> *Tushar Adeshara*
>>>>>>
>>>>>> *Technical Specialist – Analytics Practice*
>>>>>>
>>>>>> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>>>>>>
>>>>>> *Persistent Systems** Ltd. **| **Partners in Innovation **|* 
>>>>>> *www.persistentsys.com
>>>>>> <http://www.persistentsys.com/>*
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *From:* KhajaAsmath Mohammed <mdkhajaasm...@gmail.com>
>>>>>> *Sent:* 13 October 2017 09:35
>>>>>> *To:* user @spark
>>>>>> *Subject:* Spark - Partitions
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am reading hive query and wiriting the data back into hive after
>>>>>> doing some transformations.
>>>>>>
>>>>>> I have changed setting spark.sql.shuffle.partitions to 2000 and since
>>>>>> then job completes fast but the main problem is I am getting 2000 files 
>>>>>> for
>>>>>> each partition
>>>>>> size of file is 10 MB .
>>>>>>
>>>>>> is there a way to get same performance but write lesser number of
>>>>>> files ?
>>>>>>
>>>>>> I am trying repartition now but would like to know if there are any
>>>>>> other options.
>>>>>>
>>>>>> Thanks,
>>>>>> Asmath
>>>>>> DISCLAIMER
>>>>>> ==
>>>>>> This e-mail may contain privileged and confidential information which
>>>>>> is the property of Persistent Systems Ltd. It is intended only for the 
>>>>>> use
>>>>>> of the individual or entity to which it is addressed. If you are not the
>>>>>> intended recipient, you are not authorized to read, retain, copy, print,
>>>>>> distribute or use this message. If you have received this communication 
>>>>>> in
>>>>>> error, please notify the sender and delete all copies of this message.
>>>>>> Persistent Systems Ltd. does not accept any liability for virus infected
>>>>>> mails.
>>>>>>
>>>>>
>>>>>
>>


application-datapoint-hdfs-dyn.properties
Description: Binary data

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark - Partitions

2017-10-17 Thread Sebastian Piu
Can you share some code?

On Tue, 17 Oct 2017, 21:11 KhajaAsmath Mohammed, <mdkhajaasm...@gmail.com>
wrote:

> In my case I am just writing the data frame back to hive. so when is the
> best case to repartition it. I did repartition before calling insert
> overwrite on table
>
> On Tue, Oct 17, 2017 at 3:07 PM, Sebastian Piu <sebastian@gmail.com>
> wrote:
>
>> You have to repartition/coalesce *after *the action that is causing the
>> shuffle as that one will take the value you've set
>>
>> On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed <
>> mdkhajaasm...@gmail.com> wrote:
>>
>>> Yes still I see more number of part files and exactly the number I have
>>> defined did spark.sql.shuffle.partitions
>>>
>>> Sent from my iPhone
>>>
>>> On Oct 17, 2017, at 2:32 PM, Michael Artz <michaelea...@gmail.com>
>>> wrote:
>>>
>>> Have you tried caching it and using a coalesce?
>>>
>>>
>>>
>>> On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <mdkhajaasm...@gmail.com>
>>> wrote:
>>>
>>>> I tried repartitions but spark.sql.shuffle.partitions is taking up
>>>> precedence over repartitions or coalesce. how to get the lesser number of
>>>> files with same performance?
>>>>
>>>> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
>>>> tushar_adesh...@persistent.com> wrote:
>>>>
>>>>> You can also try coalesce as it will avoid full shuffle.
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> *Tushar Adeshara*
>>>>>
>>>>> *Technical Specialist – Analytics Practice*
>>>>>
>>>>> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>>>>>
>>>>> *Persistent Systems** Ltd. **| **Partners in Innovation **|* 
>>>>> *www.persistentsys.com
>>>>> <http://www.persistentsys.com/>*
>>>>>
>>>>>
>>>>> --
>>>>> *From:* KhajaAsmath Mohammed <mdkhajaasm...@gmail.com>
>>>>> *Sent:* 13 October 2017 09:35
>>>>> *To:* user @spark
>>>>> *Subject:* Spark - Partitions
>>>>>
>>>>> Hi,
>>>>>
>>>>> I am reading hive query and wiriting the data back into hive after
>>>>> doing some transformations.
>>>>>
>>>>> I have changed setting spark.sql.shuffle.partitions to 2000 and since
>>>>> then job completes fast but the main problem is I am getting 2000 files 
>>>>> for
>>>>> each partition
>>>>> size of file is 10 MB .
>>>>>
>>>>> is there a way to get same performance but write lesser number of
>>>>> files ?
>>>>>
>>>>> I am trying repartition now but would like to know if there are any
>>>>> other options.
>>>>>
>>>>> Thanks,
>>>>> Asmath
>>>>> DISCLAIMER
>>>>> ==
>>>>> This e-mail may contain privileged and confidential information which
>>>>> is the property of Persistent Systems Ltd. It is intended only for the use
>>>>> of the individual or entity to which it is addressed. If you are not the
>>>>> intended recipient, you are not authorized to read, retain, copy, print,
>>>>> distribute or use this message. If you have received this communication in
>>>>> error, please notify the sender and delete all copies of this message.
>>>>> Persistent Systems Ltd. does not accept any liability for virus infected
>>>>> mails.
>>>>>
>>>>
>>>>
>


Re: Spark - Partitions

2017-10-17 Thread KhajaAsmath Mohammed
In my case I am just writing the data frame back to hive. so when is the
best case to repartition it. I did repartition before calling insert
overwrite on table

On Tue, Oct 17, 2017 at 3:07 PM, Sebastian Piu <sebastian@gmail.com>
wrote:

> You have to repartition/coalesce *after *the action that is causing the
> shuffle as that one will take the value you've set
>
> On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com> wrote:
>
>> Yes still I see more number of part files and exactly the number I have
>> defined did spark.sql.shuffle.partitions
>>
>> Sent from my iPhone
>>
>> On Oct 17, 2017, at 2:32 PM, Michael Artz <michaelea...@gmail.com> wrote:
>>
>> Have you tried caching it and using a coalesce?
>>
>>
>>
>> On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <mdkhajaasm...@gmail.com>
>> wrote:
>>
>>> I tried repartitions but spark.sql.shuffle.partitions is taking up
>>> precedence over repartitions or coalesce. how to get the lesser number of
>>> files with same performance?
>>>
>>> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
>>> tushar_adesh...@persistent.com> wrote:
>>>
>>>> You can also try coalesce as it will avoid full shuffle.
>>>>
>>>>
>>>> Regards,
>>>>
>>>> *Tushar Adeshara*
>>>>
>>>> *Technical Specialist – Analytics Practice*
>>>>
>>>> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>>>>
>>>> *Persistent Systems** Ltd. **| **Partners in Innovation **|* 
>>>> *www.persistentsys.com
>>>> <http://www.persistentsys.com/>*
>>>>
>>>>
>>>> --
>>>> *From:* KhajaAsmath Mohammed <mdkhajaasm...@gmail.com>
>>>> *Sent:* 13 October 2017 09:35
>>>> *To:* user @spark
>>>> *Subject:* Spark - Partitions
>>>>
>>>> Hi,
>>>>
>>>> I am reading hive query and wiriting the data back into hive after
>>>> doing some transformations.
>>>>
>>>> I have changed setting spark.sql.shuffle.partitions to 2000 and since
>>>> then job completes fast but the main problem is I am getting 2000 files for
>>>> each partition
>>>> size of file is 10 MB .
>>>>
>>>> is there a way to get same performance but write lesser number of files
>>>> ?
>>>>
>>>> I am trying repartition now but would like to know if there are any
>>>> other options.
>>>>
>>>> Thanks,
>>>> Asmath
>>>> DISCLAIMER
>>>> ==
>>>> This e-mail may contain privileged and confidential information which
>>>> is the property of Persistent Systems Ltd. It is intended only for the use
>>>> of the individual or entity to which it is addressed. If you are not the
>>>> intended recipient, you are not authorized to read, retain, copy, print,
>>>> distribute or use this message. If you have received this communication in
>>>> error, please notify the sender and delete all copies of this message.
>>>> Persistent Systems Ltd. does not accept any liability for virus infected
>>>> mails.
>>>>
>>>
>>>


Re: Spark - Partitions

2017-10-17 Thread Sebastian Piu
You have to repartition/coalesce *after *the action that is causing the
shuffle as that one will take the value you've set

On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:

> Yes still I see more number of part files and exactly the number I have
> defined did spark.sql.shuffle.partitions
>
> Sent from my iPhone
>
> On Oct 17, 2017, at 2:32 PM, Michael Artz <michaelea...@gmail.com> wrote:
>
> Have you tried caching it and using a coalesce?
>
>
>
> On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <mdkhajaasm...@gmail.com>
> wrote:
>
>> I tried repartitions but spark.sql.shuffle.partitions is taking up
>> precedence over repartitions or coalesce. how to get the lesser number of
>> files with same performance?
>>
>> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
>> tushar_adesh...@persistent.com> wrote:
>>
>>> You can also try coalesce as it will avoid full shuffle.
>>>
>>>
>>> Regards,
>>>
>>> *Tushar Adeshara*
>>>
>>> *Technical Specialist – Analytics Practice*
>>>
>>> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>>>
>>> *Persistent Systems** Ltd. **| **Partners in Innovation **|* 
>>> *www.persistentsys.com
>>> <http://www.persistentsys.com/>*
>>>
>>>
>>> --
>>> *From:* KhajaAsmath Mohammed <mdkhajaasm...@gmail.com>
>>> *Sent:* 13 October 2017 09:35
>>> *To:* user @spark
>>> *Subject:* Spark - Partitions
>>>
>>> Hi,
>>>
>>> I am reading hive query and wiriting the data back into hive after doing
>>> some transformations.
>>>
>>> I have changed setting spark.sql.shuffle.partitions to 2000 and since
>>> then job completes fast but the main problem is I am getting 2000 files for
>>> each partition
>>> size of file is 10 MB .
>>>
>>> is there a way to get same performance but write lesser number of files ?
>>>
>>> I am trying repartition now but would like to know if there are any
>>> other options.
>>>
>>> Thanks,
>>> Asmath
>>> DISCLAIMER
>>> ==
>>> This e-mail may contain privileged and confidential information which is
>>> the property of Persistent Systems Ltd. It is intended only for the use of
>>> the individual or entity to which it is addressed. If you are not the
>>> intended recipient, you are not authorized to read, retain, copy, print,
>>> distribute or use this message. If you have received this communication in
>>> error, please notify the sender and delete all copies of this message.
>>> Persistent Systems Ltd. does not accept any liability for virus infected
>>> mails.
>>>
>>
>>


Re: Spark - Partitions

2017-10-17 Thread KhajaAsmath Mohammed
Yes still I see more number of part files and exactly the number I have defined 
did spark.sql.shuffle.partitions

Sent from my iPhone

> On Oct 17, 2017, at 2:32 PM, Michael Artz <michaelea...@gmail.com> wrote:
> 
> Have you tried caching it and using a coalesce? 
> 
> 
> 
>> On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <mdkhajaasm...@gmail.com> 
>> wrote:
>> I tried repartitions but spark.sql.shuffle.partitions is taking up 
>> precedence over repartitions or coalesce. how to get the lesser number of 
>> files with same performance?
>> 
>>> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara 
>>> <tushar_adesh...@persistent.com> wrote:
>>> You can also try coalesce as it will avoid full shuffle.
>>> 
>>> 
>>> Regards,
>>> Tushar Adeshara
>>> 
>>> Technical Specialist – Analytics Practice
>>> 
>>> Cell: +91-81490 04192
>>> 
>>> Persistent Systems Ltd. | Partners in Innovation | www.persistentsys.com
>>> 
>>> 
>>> From: KhajaAsmath Mohammed <mdkhajaasm...@gmail.com>
>>> Sent: 13 October 2017 09:35
>>> To: user @spark
>>> Subject: Spark - Partitions
>>>  
>>> Hi,
>>> 
>>> I am reading hive query and wiriting the data back into hive after doing 
>>> some transformations.
>>> 
>>> I have changed setting spark.sql.shuffle.partitions to 2000 and since then 
>>> job completes fast but the main problem is I am getting 2000 files for each 
>>> partition 
>>> size of file is 10 MB .
>>> 
>>> is there a way to get same performance but write lesser number of files ?
>>> 
>>> I am trying repartition now but would like to know if there are any other 
>>> options.
>>> 
>>> Thanks,
>>> Asmath
>>> DISCLAIMER
>>> ==
>>> This e-mail may contain privileged and confidential information which is 
>>> the property of Persistent Systems Ltd. It is intended only for the use of 
>>> the individual or entity to which it is addressed. If you are not the 
>>> intended recipient, you are not authorized to read, retain, copy, print, 
>>> distribute or use this message. If you have received this communication in 
>>> error, please notify the sender and delete all copies of this message. 
>>> Persistent Systems Ltd. does not accept any liability for virus infected 
>>> mails.
>> 


Re: Spark - Partitions

2017-10-17 Thread Michael Artz
Have you tried caching it and using a coalesce?



On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <mdkhajaasm...@gmail.com>
wrote:

> I tried repartitions but spark.sql.shuffle.partitions is taking up
> precedence over repartitions or coalesce. how to get the lesser number of
> files with same performance?
>
> On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
> tushar_adesh...@persistent.com> wrote:
>
>> You can also try coalesce as it will avoid full shuffle.
>>
>>
>> Regards,
>>
>> *Tushar Adeshara*
>>
>> *Technical Specialist – Analytics Practice*
>>
>> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>>
>> *Persistent Systems** Ltd. **| **Partners in Innovation **|* 
>> *www.persistentsys.com
>> <http://www.persistentsys.com/>*
>>
>>
>> ------
>> *From:* KhajaAsmath Mohammed <mdkhajaasm...@gmail.com>
>> *Sent:* 13 October 2017 09:35
>> *To:* user @spark
>> *Subject:* Spark - Partitions
>>
>> Hi,
>>
>> I am reading hive query and wiriting the data back into hive after doing
>> some transformations.
>>
>> I have changed setting spark.sql.shuffle.partitions to 2000 and since
>> then job completes fast but the main problem is I am getting 2000 files for
>> each partition
>> size of file is 10 MB .
>>
>> is there a way to get same performance but write lesser number of files ?
>>
>> I am trying repartition now but would like to know if there are any other
>> options.
>>
>> Thanks,
>> Asmath
>> DISCLAIMER
>> ==
>> This e-mail may contain privileged and confidential information which is
>> the property of Persistent Systems Ltd. It is intended only for the use of
>> the individual or entity to which it is addressed. If you are not the
>> intended recipient, you are not authorized to read, retain, copy, print,
>> distribute or use this message. If you have received this communication in
>> error, please notify the sender and delete all copies of this message.
>> Persistent Systems Ltd. does not accept any liability for virus infected
>> mails.
>>
>
>


Re: Spark - Partitions

2017-10-17 Thread KhajaAsmath Mohammed
I tried repartitions but spark.sql.shuffle.partitions is taking up
precedence over repartitions or coalesce. how to get the lesser number of
files with same performance?

On Fri, Oct 13, 2017 at 3:45 AM, Tushar Adeshara <
tushar_adesh...@persistent.com> wrote:

> You can also try coalesce as it will avoid full shuffle.
>
>
> Regards,
>
> *Tushar Adeshara*
>
> *Technical Specialist – Analytics Practice*
>
> *Cell: +91-81490 04192 <+91%2081490%2004192>*
>
> *Persistent Systems** Ltd. **| **Partners in Innovation **|* 
> *www.persistentsys.com
> <http://www.persistentsys.com/>*
>
>
> --
> *From:* KhajaAsmath Mohammed <mdkhajaasm...@gmail.com>
> *Sent:* 13 October 2017 09:35
> *To:* user @spark
> *Subject:* Spark - Partitions
>
> Hi,
>
> I am reading hive query and wiriting the data back into hive after doing
> some transformations.
>
> I have changed setting spark.sql.shuffle.partitions to 2000 and since then
> job completes fast but the main problem is I am getting 2000 files for each
> partition
> size of file is 10 MB .
>
> is there a way to get same performance but write lesser number of files ?
>
> I am trying repartition now but would like to know if there are any other
> options.
>
> Thanks,
> Asmath
> DISCLAIMER
> ==
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>


Re: Spark - Partitions

2017-10-13 Thread Tushar Adeshara
You can also try coalesce as it will avoid full shuffle.


Regards,
Tushar Adeshara
Technical Specialist – Analytics Practice
Cell: +91-81490 04192
Persistent Systems Ltd. | Partners in Innovation | 
www.persistentsys.com<http://www.persistentsys.com/>



From: KhajaAsmath Mohammed <mdkhajaasm...@gmail.com>
Sent: 13 October 2017 09:35
To: user @spark
Subject: Spark - Partitions

Hi,

I am reading hive query and wiriting the data back into hive after doing some 
transformations.

I have changed setting spark.sql.shuffle.partitions to 2000 and since then job 
completes fast but the main problem is I am getting 2000 files for each 
partition
size of file is 10 MB .

is there a way to get same performance but write lesser number of files ?

I am trying repartition now but would like to know if there are any other 
options.

Thanks,
Asmath
DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


Re: Spark - Partitions

2017-10-12 Thread Chetan Khatri
Use repartition
On 13-Oct-2017 9:35 AM, "KhajaAsmath Mohammed" 
wrote:

> Hi,
>
> I am reading hive query and wiriting the data back into hive after doing
> some transformations.
>
> I have changed setting spark.sql.shuffle.partitions to 2000 and since then
> job completes fast but the main problem is I am getting 2000 files for each
> partition
> size of file is 10 MB .
>
> is there a way to get same performance but write lesser number of files ?
>
> I am trying repartition now but would like to know if there are any other
> options.
>
> Thanks,
> Asmath
>


Spark - Partitions

2017-10-12 Thread KhajaAsmath Mohammed
Hi,

I am reading hive query and wiriting the data back into hive after doing
some transformations.

I have changed setting spark.sql.shuffle.partitions to 2000 and since then
job completes fast but the main problem is I am getting 2000 files for each
partition
size of file is 10 MB .

is there a way to get same performance but write lesser number of files ?

I am trying repartition now but would like to know if there are any other
options.

Thanks,
Asmath


Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Umesh Kacha
Hi Lan, thanks for the response yes I know and I have confirmed in UI that
it has only 12 partitions because of 12 HDFS blocks and hive orc file strip
size is 33554432.

On Thu, Oct 8, 2015 at 11:55 PM, Lan Jiang <ljia...@gmail.com> wrote:

> The partition number should be the same as the HDFS block number instead
> of file number. Did you confirmed from the spark UI that only 12 partitions
> were created? What is your ORC orc.stripe.size?
>
> Lan
>
>
> > On Oct 8, 2015, at 1:13 PM, unk1102 <umesh.ka...@gmail.com> wrote:
> >
> > Hi I have the following code where I read ORC files from HDFS and it
> loads
> > directory which contains 12 ORC files. Now since HDFS directory contains
> 12
> > files it will create 12 partitions by default. These directory is huge
> and
> > when ORC files gets decompressed it becomes around 10 GB how do I
> increase
> > partitions for the below code so that my Spark job runs faster and does
> not
> > hang for long time because of reading 10 GB files through shuffle in 12
> > partitions. Please guide.
> >
> > DataFrame df =
> > hiveContext.read().format("orc").load("/hdfs/path/to/orc/files/");
> > df.select().groupby(..)
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
>


Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Lan Jiang
Hmm, that’s odd. 

You can always use repartition(n) to increase the partition number, but then 
there will be shuffle. How large is your ORC file? Have you used NameNode UI to 
check how many HDFS blocks each ORC file has?

Lan


> On Oct 8, 2015, at 2:08 PM, Umesh Kacha <umesh.ka...@gmail.com> wrote:
> 
> Hi Lan, thanks for the response yes I know and I have confirmed in UI that it 
> has only 12 partitions because of 12 HDFS blocks and hive orc file strip size 
> is 33554432.
> 
> On Thu, Oct 8, 2015 at 11:55 PM, Lan Jiang <ljia...@gmail.com 
> <mailto:ljia...@gmail.com>> wrote:
> The partition number should be the same as the HDFS block number instead of 
> file number. Did you confirmed from the spark UI that only 12 partitions were 
> created? What is your ORC orc.stripe.size?
> 
> Lan
> 
> 
> > On Oct 8, 2015, at 1:13 PM, unk1102 <umesh.ka...@gmail.com 
> > <mailto:umesh.ka...@gmail.com>> wrote:
> >
> > Hi I have the following code where I read ORC files from HDFS and it loads
> > directory which contains 12 ORC files. Now since HDFS directory contains 12
> > files it will create 12 partitions by default. These directory is huge and
> > when ORC files gets decompressed it becomes around 10 GB how do I increase
> > partitions for the below code so that my Spark job runs faster and does not
> > hang for long time because of reading 10 GB files through shuffle in 12
> > partitions. Please guide.
> >
> > DataFrame df =
> > hiveContext.read().format("orc").load("/hdfs/path/to/orc/files/");
> > df.select().groupby(..)
> >
> >
> >
> >
> > --
> > View this message in context: 
> > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html
> >  
> > <http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html>
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> > <mailto:user-unsubscr...@spark.apache.org>
> > For additional commands, e-mail: user-h...@spark.apache.org 
> > <mailto:user-h...@spark.apache.org>
> >
> 
> 



Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Umesh Kacha
Hi Lan thanks for the reply. I have tried to do the following but it did
not increase partition

DataFrame df = hiveContext.read().format("orc").load("/hdfs/path/to/orc/
files/").repartition(100);

Yes I have checked in namenode ui ORC files contains 12 files/blocks of 128
MB each and ORC files when decompressed its around 10 GB and its
uncompressed file size is around 1 GB

On Fri, Oct 9, 2015 at 12:43 AM, Lan Jiang <ljia...@gmail.com> wrote:

> Hmm, that’s odd.
>
> You can always use repartition(n) to increase the partition number, but
> then there will be shuffle. How large is your ORC file? Have you used
> NameNode UI to check how many HDFS blocks each ORC file has?
>
> Lan
>
>
> On Oct 8, 2015, at 2:08 PM, Umesh Kacha <umesh.ka...@gmail.com> wrote:
>
> Hi Lan, thanks for the response yes I know and I have confirmed in UI that
> it has only 12 partitions because of 12 HDFS blocks and hive orc file strip
> size is 33554432.
>
> On Thu, Oct 8, 2015 at 11:55 PM, Lan Jiang <ljia...@gmail.com> wrote:
>
>> The partition number should be the same as the HDFS block number instead
>> of file number. Did you confirmed from the spark UI that only 12 partitions
>> were created? What is your ORC orc.stripe.size?
>>
>> Lan
>>
>>
>> > On Oct 8, 2015, at 1:13 PM, unk1102 <umesh.ka...@gmail.com> wrote:
>> >
>> > Hi I have the following code where I read ORC files from HDFS and it
>> loads
>> > directory which contains 12 ORC files. Now since HDFS directory
>> contains 12
>> > files it will create 12 partitions by default. These directory is huge
>> and
>> > when ORC files gets decompressed it becomes around 10 GB how do I
>> increase
>> > partitions for the below code so that my Spark job runs faster and does
>> not
>> > hang for long time because of reading 10 GB files through shuffle in 12
>> > partitions. Please guide.
>> >
>> > DataFrame df =
>> > hiveContext.read().format("orc").load("/hdfs/path/to/orc/files/");
>> > df.select().groupby(..)
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com
>> .
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>>
>
>


Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Ted Yu
bq. contains 12 files/blocks

Looks like you hit the limit of parallelism these files can provide.

If you have larger dataset, you would have more partitions.

On Thu, Oct 8, 2015 at 12:21 PM, Umesh Kacha <umesh.ka...@gmail.com> wrote:

> Hi Lan thanks for the reply. I have tried to do the following but it did
> not increase partition
>
> DataFrame df = hiveContext.read().format("orc").load("/hdfs/path/to/orc/
> files/").repartition(100);
>
> Yes I have checked in namenode ui ORC files contains 12 files/blocks of
> 128 MB each and ORC files when decompressed its around 10 GB and its
> uncompressed file size is around 1 GB
>
> On Fri, Oct 9, 2015 at 12:43 AM, Lan Jiang <ljia...@gmail.com> wrote:
>
>> Hmm, that’s odd.
>>
>> You can always use repartition(n) to increase the partition number, but
>> then there will be shuffle. How large is your ORC file? Have you used
>> NameNode UI to check how many HDFS blocks each ORC file has?
>>
>> Lan
>>
>>
>> On Oct 8, 2015, at 2:08 PM, Umesh Kacha <umesh.ka...@gmail.com> wrote:
>>
>> Hi Lan, thanks for the response yes I know and I have confirmed in UI
>> that it has only 12 partitions because of 12 HDFS blocks and hive orc file
>> strip size is 33554432.
>>
>> On Thu, Oct 8, 2015 at 11:55 PM, Lan Jiang <ljia...@gmail.com> wrote:
>>
>>> The partition number should be the same as the HDFS block number instead
>>> of file number. Did you confirmed from the spark UI that only 12 partitions
>>> were created? What is your ORC orc.stripe.size?
>>>
>>> Lan
>>>
>>>
>>> > On Oct 8, 2015, at 1:13 PM, unk1102 <umesh.ka...@gmail.com> wrote:
>>> >
>>> > Hi I have the following code where I read ORC files from HDFS and it
>>> loads
>>> > directory which contains 12 ORC files. Now since HDFS directory
>>> contains 12
>>> > files it will create 12 partitions by default. These directory is huge
>>> and
>>> > when ORC files gets decompressed it becomes around 10 GB how do I
>>> increase
>>> > partitions for the below code so that my Spark job runs faster and
>>> does not
>>> > hang for long time because of reading 10 GB files through shuffle in 12
>>> > partitions. Please guide.
>>> >
>>> > DataFrame df =
>>> > hiveContext.read().format("orc").load("/hdfs/path/to/orc/files/");
>>> > df.select().groupby(..)
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html
>>> > Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>> >
>>> > -
>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: user-h...@spark.apache.org
>>> >
>>>
>>>
>>
>>
>


How to increase Spark partitions for the DataFrame?

2015-10-08 Thread unk1102
Hi I have the following code where I read ORC files from HDFS and it loads
directory which contains 12 ORC files. Now since HDFS directory contains 12
files it will create 12 partitions by default. These directory is huge and
when ORC files gets decompressed it becomes around 10 GB how do I increase
partitions for the below code so that my Spark job runs faster and does not
hang for long time because of reading 10 GB files through shuffle in 12
partitions. Please guide. 

DataFrame df =
hiveContext.read().format("orc").load("/hdfs/path/to/orc/files/");
df.select().groupby(..)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to increase Spark partitions for the DataFrame?

2015-10-08 Thread Lan Jiang
The partition number should be the same as the HDFS block number instead of 
file number. Did you confirmed from the spark UI that only 12 partitions were 
created? What is your ORC orc.stripe.size?

Lan


> On Oct 8, 2015, at 1:13 PM, unk1102 <umesh.ka...@gmail.com> wrote:
> 
> Hi I have the following code where I read ORC files from HDFS and it loads
> directory which contains 12 ORC files. Now since HDFS directory contains 12
> files it will create 12 partitions by default. These directory is huge and
> when ORC files gets decompressed it becomes around 10 GB how do I increase
> partitions for the below code so that my Spark job runs faster and does not
> hang for long time because of reading 10 GB files through shuffle in 12
> partitions. Please guide. 
> 
> DataFrame df =
> hiveContext.read().format("orc").load("/hdfs/path/to/orc/files/");
> df.select().groupby(..)
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

2015-09-24 Thread Adrian Tanase
RE: # because I already have a bunch of InputSplits, do I still need to specify 
the number of executors to get processing parallelized?

I would say it’s best practice to have as many executors as data nodes and as 
many cores as you can get from the cluster – if YARN has enough  resources it 
will deploy the executors distributed across the cluster, then each of them 
will try to process the data locally (check the spark ui for NODE_LOCAL), with 
as many splits in parallel as you defined in spark.executor.cores

-adrian

From: Sandy Ryza
Date: Thursday, September 24, 2015 at 2:43 AM
To: Anfernee Xu
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task 
and Yarn containers

Hi Anfernee,

That's correct that each InputSplit will map to exactly a Spark partition.

On YARN, each Spark executor maps to a single YARN container.  Each executor 
can run multiple tasks over its lifetime, both parallel and sequentially.

If you enable dynamic allocation, after the stage including the InputSplits 
gets submitted, Spark will try to request an appropriate number of executors.

The memory in the YARN resource requests is --executor-memory + what's set for 
spark.yarn.executor.memoryOverhead, which defaults to 10% of --executor-memory.

-Sandy

On Wed, Sep 23, 2015 at 2:38 PM, Anfernee Xu 
<anfernee...@gmail.com<mailto:anfernee...@gmail.com>> wrote:
Hi Spark experts,

I'm coming across these terminologies and having some confusions, could you 
please help me understand them better?

For instance I have implemented a Hadoop InputFormat to load my external data 
in Spark, in turn my custom InputFormat will create a bunch of InputSplit's, my 
questions is about

# Each InputSplit will exactly map to a Spark partition, is that correct?

# If I run on Yarn, how does Spark executor/task map to Yarn container?

# because I already have a bunch of InputSplits, do I still need to specify the 
number of executors to get processing parallelized?

# How does -executor-memory map to the memory requirement in Yarn's resource 
request?

--
--Anfernee



Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

2015-09-24 Thread Sabarish Sasidharan
A little caution is needed as one executor per node may not always be ideal
esp when your nodes have lots of RAM. But yes, using lesser number of
executors has benefits like more efficient broadcasts.

Regards
Sab
On 24-Sep-2015 2:57 pm, "Adrian Tanase" <atan...@adobe.com> wrote:

> RE: # because I already have a bunch of InputSplits, do I still need to
> specify the number of executors to get processing parallelized?
>
> I would say it’s best practice to have as many executors as data nodes and
> as many cores as you can get from the cluster – if YARN has enough
>  resources it will deploy the executors distributed across the cluster,
> then each of them will try to process the data locally (check the spark ui
> for NODE_LOCAL), with as many splits in parallel as you defined in
> spark.executor.cores
>
> -adrian
>
> From: Sandy Ryza
> Date: Thursday, September 24, 2015 at 2:43 AM
> To: Anfernee Xu
> Cc: "user@spark.apache.org"
> Subject: Re: Custom Hadoop InputSplit, Spark partitions, spark
> executors/task and Yarn containers
>
> Hi Anfernee,
>
> That's correct that each InputSplit will map to exactly a Spark partition.
>
> On YARN, each Spark executor maps to a single YARN container.  Each
> executor can run multiple tasks over its lifetime, both parallel and
> sequentially.
>
> If you enable dynamic allocation, after the stage including the
> InputSplits gets submitted, Spark will try to request an appropriate number
> of executors.
>
> The memory in the YARN resource requests is --executor-memory + what's set
> for spark.yarn.executor.memoryOverhead, which defaults to 10% of
> --executor-memory.
>
> -Sandy
>
> On Wed, Sep 23, 2015 at 2:38 PM, Anfernee Xu <anfernee...@gmail.com>
> wrote:
>
>> Hi Spark experts,
>>
>> I'm coming across these terminologies and having some confusions, could
>> you please help me understand them better?
>>
>> For instance I have implemented a Hadoop InputFormat to load my external
>> data in Spark, in turn my custom InputFormat will create a bunch of
>> InputSplit's, my questions is about
>>
>> # Each InputSplit will exactly map to a Spark partition, is that correct?
>>
>> # If I run on Yarn, how does Spark executor/task map to Yarn container?
>>
>> # because I already have a bunch of InputSplits, do I still need to
>> specify the number of executors to get processing parallelized?
>>
>> # How does -executor-memory map to the memory requirement in Yarn's
>> resource request?
>>
>> --
>> --Anfernee
>>
>
>


Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

2015-09-23 Thread Anfernee Xu
Hi Spark experts,

I'm coming across these terminologies and having some confusions, could you
please help me understand them better?

For instance I have implemented a Hadoop InputFormat to load my external
data in Spark, in turn my custom InputFormat will create a bunch of
InputSplit's, my questions is about

# Each InputSplit will exactly map to a Spark partition, is that correct?

# If I run on Yarn, how does Spark executor/task map to Yarn container?

# because I already have a bunch of InputSplits, do I still need to specify
the number of executors to get processing parallelized?

# How does -executor-memory map to the memory requirement in Yarn's
resource request?

-- 
--Anfernee


Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

2015-09-23 Thread Sandy Ryza
Hi Anfernee,

That's correct that each InputSplit will map to exactly a Spark partition.

On YARN, each Spark executor maps to a single YARN container.  Each
executor can run multiple tasks over its lifetime, both parallel and
sequentially.

If you enable dynamic allocation, after the stage including the InputSplits
gets submitted, Spark will try to request an appropriate number of
executors.

The memory in the YARN resource requests is --executor-memory + what's set
for spark.yarn.executor.memoryOverhead, which defaults to 10% of
--executor-memory.

-Sandy

On Wed, Sep 23, 2015 at 2:38 PM, Anfernee Xu  wrote:

> Hi Spark experts,
>
> I'm coming across these terminologies and having some confusions, could
> you please help me understand them better?
>
> For instance I have implemented a Hadoop InputFormat to load my external
> data in Spark, in turn my custom InputFormat will create a bunch of
> InputSplit's, my questions is about
>
> # Each InputSplit will exactly map to a Spark partition, is that correct?
>
> # If I run on Yarn, how does Spark executor/task map to Yarn container?
>
> # because I already have a bunch of InputSplits, do I still need to
> specify the number of executors to get processing parallelized?
>
> # How does -executor-memory map to the memory requirement in Yarn's
> resource request?
>
> --
> --Anfernee
>


Re: Spark partitions from CassandraRDD

2015-09-04 Thread Ankur Srivastava
Oh if that is the case then you can try tuning "
spark.cassandra.input.split.size"

spark.cassandra.input.split.sizeapprox number of Cassandra
partitions in a Spark partition  10

Hope this helps.

Thanks
Ankur

On Thu, Sep 3, 2015 at 12:22 PM, Alaa Zubaidi (PDF) 
wrote:

> Thanks Ankur,
>
> But I grabbed some keys from the Spark results and ran "nodetool -h
> getendpoints " and it showed the data is coming from at least 2 nodes?
> Regards,
> Alaa
>
> On Thu, Sep 3, 2015 at 12:06 PM, Ankur Srivastava <
> ankur.srivast...@gmail.com> wrote:
>
>> Hi Alaa,
>>
>> Partition when using CassandraRDD depends on your partition key in
>> Cassandra table.
>>
>> If you see only 1 partition in the RDD it means all the rows you have
>> selected have same partition_key in C*
>>
>> Thanks
>> Ankur
>>
>>
>> On Thu, Sep 3, 2015 at 11:54 AM, Alaa Zubaidi (PDF) > > wrote:
>>
>>> Hi,
>>>
>>> I testing Spark and Cassandra, Spark 1.4, Cassandra 2.1.7 cassandra
>>> spark connector 1.4, running in standalone mode.
>>>
>>> I am getting 4000 rows from Cassandra (4mb row), where the row keys are
>>> random.
>>> .. sc.cassandraTable[RES](keyspace,res_name).where(res_where).cache
>>>
>>> I am expecting that it will generate few partitions.
>>> However, I can ONLY see 1 partition.
>>> I cached the CassandraRDD and in the UI storage tab it shows ONLY 1
>>> partition.
>>>
>>> Any idea, why I am getting 1 partition?
>>>
>>> Thanks,
>>> Alaa
>>>
>>>
>>>
>>> *This message may contain confidential and privileged information. If it
>>> has been sent to you in error, please reply to advise the sender of the
>>> error and then immediately permanently delete it and all attachments to it
>>> from your systems. If you are not the intended recipient, do not read,
>>> copy, disclose or otherwise use this message or any attachments to it. The
>>> sender disclaims any liability for such unauthorized use. PLEASE NOTE that
>>> all incoming e-mails sent to PDF e-mail accounts will be archived and may
>>> be scanned by us and/or by external service providers to detect and prevent
>>> threats to our systems, investigate illegal or inappropriate behavior,
>>> and/or eliminate unsolicited promotional e-mails (“spam”). If you have any
>>> concerns about this process, please contact us at *
>>> *legal.departm...@pdf.com* *.*
>>
>>
>>
>
>
> --
>
> Alaa Zubaidi
> PDF Solutions, Inc.
> 333 West San Carlos Street, Suite 1000
> San Jose, CA 95110  USA
> Tel: 408-283-5639
> fax: 408-938-6479
> email: alaa.zuba...@pdf.com
>
>
> *This message may contain confidential and privileged information. If it
> has been sent to you in error, please reply to advise the sender of the
> error and then immediately permanently delete it and all attachments to it
> from your systems. If you are not the intended recipient, do not read,
> copy, disclose or otherwise use this message or any attachments to it. The
> sender disclaims any liability for such unauthorized use. PLEASE NOTE that
> all incoming e-mails sent to PDF e-mail accounts will be archived and may
> be scanned by us and/or by external service providers to detect and prevent
> threats to our systems, investigate illegal or inappropriate behavior,
> and/or eliminate unsolicited promotional e-mails (“spam”). If you have any
> concerns about this process, please contact us at *
> *legal.departm...@pdf.com* *.*
>


Re: Spark partitions from CassandraRDD

2015-09-03 Thread Ankur Srivastava
Hi Alaa,

Partition when using CassandraRDD depends on your partition key in
Cassandra table.

If you see only 1 partition in the RDD it means all the rows you have
selected have same partition_key in C*

Thanks
Ankur


On Thu, Sep 3, 2015 at 11:54 AM, Alaa Zubaidi (PDF) 
wrote:

> Hi,
>
> I testing Spark and Cassandra, Spark 1.4, Cassandra 2.1.7 cassandra spark
> connector 1.4, running in standalone mode.
>
> I am getting 4000 rows from Cassandra (4mb row), where the row keys are
> random.
> .. sc.cassandraTable[RES](keyspace,res_name).where(res_where).cache
>
> I am expecting that it will generate few partitions.
> However, I can ONLY see 1 partition.
> I cached the CassandraRDD and in the UI storage tab it shows ONLY 1
> partition.
>
> Any idea, why I am getting 1 partition?
>
> Thanks,
> Alaa
>
>
>
> *This message may contain confidential and privileged information. If it
> has been sent to you in error, please reply to advise the sender of the
> error and then immediately permanently delete it and all attachments to it
> from your systems. If you are not the intended recipient, do not read,
> copy, disclose or otherwise use this message or any attachments to it. The
> sender disclaims any liability for such unauthorized use. PLEASE NOTE that
> all incoming e-mails sent to PDF e-mail accounts will be archived and may
> be scanned by us and/or by external service providers to detect and prevent
> threats to our systems, investigate illegal or inappropriate behavior,
> and/or eliminate unsolicited promotional e-mails (“spam”). If you have any
> concerns about this process, please contact us at *
> *legal.departm...@pdf.com* *.*


Spark partitions from CassandraRDD

2015-09-03 Thread Alaa Zubaidi (PDF)
Hi,

I testing Spark and Cassandra, Spark 1.4, Cassandra 2.1.7 cassandra spark
connector 1.4, running in standalone mode.

I am getting 4000 rows from Cassandra (4mb row), where the row keys are
random.
.. sc.cassandraTable[RES](keyspace,res_name).where(res_where).cache

I am expecting that it will generate few partitions.
However, I can ONLY see 1 partition.
I cached the CassandraRDD and in the UI storage tab it shows ONLY 1
partition.

Any idea, why I am getting 1 partition?

Thanks,
Alaa

-- 
*This message may contain confidential and privileged information. If it 
has been sent to you in error, please reply to advise the sender of the 
error and then immediately permanently delete it and all attachments to it 
from your systems. If you are not the intended recipient, do not read, 
copy, disclose or otherwise use this message or any attachments to it. The 
sender disclaims any liability for such unauthorized use. PLEASE NOTE that 
all incoming e-mails sent to PDF e-mail accounts will be archived and may 
be scanned by us and/or by external service providers to detect and prevent 
threats to our systems, investigate illegal or inappropriate behavior, 
and/or eliminate unsolicited promotional e-mails (“spam”). If you have any 
concerns about this process, please contact us at *
*legal.departm...@pdf.com* *.*


Re: Spark partitions from CassandraRDD

2015-09-03 Thread Alaa Zubaidi (PDF)
Thanks Ankur,

But I grabbed some keys from the Spark results and ran "nodetool -h
getendpoints " and it showed the data is coming from at least 2 nodes?
Regards,
Alaa

On Thu, Sep 3, 2015 at 12:06 PM, Ankur Srivastava <
ankur.srivast...@gmail.com> wrote:

> Hi Alaa,
>
> Partition when using CassandraRDD depends on your partition key in
> Cassandra table.
>
> If you see only 1 partition in the RDD it means all the rows you have
> selected have same partition_key in C*
>
> Thanks
> Ankur
>
>
> On Thu, Sep 3, 2015 at 11:54 AM, Alaa Zubaidi (PDF) 
> wrote:
>
>> Hi,
>>
>> I testing Spark and Cassandra, Spark 1.4, Cassandra 2.1.7 cassandra spark
>> connector 1.4, running in standalone mode.
>>
>> I am getting 4000 rows from Cassandra (4mb row), where the row keys are
>> random.
>> .. sc.cassandraTable[RES](keyspace,res_name).where(res_where).cache
>>
>> I am expecting that it will generate few partitions.
>> However, I can ONLY see 1 partition.
>> I cached the CassandraRDD and in the UI storage tab it shows ONLY 1
>> partition.
>>
>> Any idea, why I am getting 1 partition?
>>
>> Thanks,
>> Alaa
>>
>>
>>
>> *This message may contain confidential and privileged information. If it
>> has been sent to you in error, please reply to advise the sender of the
>> error and then immediately permanently delete it and all attachments to it
>> from your systems. If you are not the intended recipient, do not read,
>> copy, disclose or otherwise use this message or any attachments to it. The
>> sender disclaims any liability for such unauthorized use. PLEASE NOTE that
>> all incoming e-mails sent to PDF e-mail accounts will be archived and may
>> be scanned by us and/or by external service providers to detect and prevent
>> threats to our systems, investigate illegal or inappropriate behavior,
>> and/or eliminate unsolicited promotional e-mails (“spam”). If you have any
>> concerns about this process, please contact us at *
>> *legal.departm...@pdf.com* *.*
>
>
>


-- 

Alaa Zubaidi
PDF Solutions, Inc.
333 West San Carlos Street, Suite 1000
San Jose, CA 95110  USA
Tel: 408-283-5639
fax: 408-938-6479
email: alaa.zuba...@pdf.com

-- 
*This message may contain confidential and privileged information. If it 
has been sent to you in error, please reply to advise the sender of the 
error and then immediately permanently delete it and all attachments to it 
from your systems. If you are not the intended recipient, do not read, 
copy, disclose or otherwise use this message or any attachments to it. The 
sender disclaims any liability for such unauthorized use. PLEASE NOTE that 
all incoming e-mails sent to PDF e-mail accounts will be archived and may 
be scanned by us and/or by external service providers to detect and prevent 
threats to our systems, investigate illegal or inappropriate behavior, 
and/or eliminate unsolicited promotional e-mails (“spam”). If you have any 
concerns about this process, please contact us at *
*legal.departm...@pdf.com* *.*


Re: Assigning input files to spark partitions

2014-11-17 Thread Pala M Muthaia
Hi Daniel,

Yes that should work also. However, is it possible to setup so that each
RDD has exactly one partition, without repartitioning (and thus incurring
extra cost)? Is there a mechanism similar to MR where we can ensure each
partition is assigned some amount of data by size, by setting some block
size parameter?



On Thu, Nov 13, 2014 at 1:05 PM, Daniel Siegmann daniel.siegm...@velos.io
wrote:

 On Thu, Nov 13, 2014 at 3:24 PM, Pala M Muthaia 
 mchett...@rocketfuelinc.com wrote


 No i don't want separate RDD because each of these partitions are being
 processed the same way (in my case, each partition corresponds to HBase
 keys belonging to one region server, and i will do HBase lookups). After
 that i have aggregations too, hence all these partitions should be in the
 same RDD. The reason to follow the partition structure is to limit
 concurrent HBase lookups targeting a single region server.


 Neither of these is necessarily a barrier to using separate RDDs. You can
 define the function you want to use and then pass it to multiple map
 methods. Then you could union all the RDDs to do your aggregations. For
 example, it might look something like this:

 val paths: String = ... // the paths to the files you want to load
 def myFunc(t: T) = ... // the function to apply to every RDD
 val rdds = paths.map { path =
 sc.textFile(path).map(myFunc)
 }
 val completeRdd = sc.union(rdds)

 Does that make any sense?

 --
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning

 54 W 40th St, New York, NY 10018
 E: daniel.siegm...@velos.io W: www.velos.io



Re: Assigning input files to spark partitions

2014-11-17 Thread Daniel Siegmann
I'm not aware of any such mechanism.

On Mon, Nov 17, 2014 at 2:55 PM, Pala M Muthaia mchett...@rocketfuelinc.com
 wrote:

 Hi Daniel,

 Yes that should work also. However, is it possible to setup so that each
 RDD has exactly one partition, without repartitioning (and thus incurring
 extra cost)? Is there a mechanism similar to MR where we can ensure each
 partition is assigned some amount of data by size, by setting some block
 size parameter?



 On Thu, Nov 13, 2014 at 1:05 PM, Daniel Siegmann daniel.siegm...@velos.io
  wrote:

 On Thu, Nov 13, 2014 at 3:24 PM, Pala M Muthaia 
 mchett...@rocketfuelinc.com wrote


 No i don't want separate RDD because each of these partitions are being
 processed the same way (in my case, each partition corresponds to HBase
 keys belonging to one region server, and i will do HBase lookups). After
 that i have aggregations too, hence all these partitions should be in the
 same RDD. The reason to follow the partition structure is to limit
 concurrent HBase lookups targeting a single region server.


 Neither of these is necessarily a barrier to using separate RDDs. You can
 define the function you want to use and then pass it to multiple map
 methods. Then you could union all the RDDs to do your aggregations. For
 example, it might look something like this:

 val paths: String = ... // the paths to the files you want to load
 def myFunc(t: T) = ... // the function to apply to every RDD
 val rdds = paths.map { path =
 sc.textFile(path).map(myFunc)
 }
 val completeRdd = sc.union(rdds)

 Does that make any sense?

 --
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning

 54 W 40th St, New York, NY 10018
 E: daniel.siegm...@velos.io W: www.velos.io





-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.io W: www.velos.io


Re: Assigning input files to spark partitions

2014-11-13 Thread Daniel Siegmann
Would it make sense to read each file in as a separate RDD? This way you
would be guaranteed the data is partitioned as you expected.

Possibly you could then repartition each of those RDDs into a single
partition and then union them. I think that would achieve what you expect.
But it would be easy to accidentally screw this up (have some operation
that causes a shuffle), so I think you're better off just leaving them as
separate RDDs.

On Wed, Nov 12, 2014 at 10:27 PM, Pala M Muthaia 
mchett...@rocketfuelinc.com wrote:

 Hi,

 I have a set of input files for a spark program, with each file
 corresponding to a logical data partition. What is the API/mechanism to
 assign each input file (or a set of files) to a spark partition, when
 initializing RDDs?

 When i create a spark RDD pointing to the directory of files, my
 understanding is it's not guaranteed that each input file will be treated
 as separate partition.

 My job semantics require that the data is partitioned, and i want to
 leverage the partitioning that has already been done, rather than
 repartitioning again in the spark job.

 I tried to lookup online but haven't found any pointers so far.


 Thanks
 pala




-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.io W: www.velos.io


Re: Assigning input files to spark partitions

2014-11-13 Thread Rishi Yadav
If your data is in hdfs and you are reading as textFile and each file is
less than block size, my understanding is it would always have one
partition per file.

On Thursday, November 13, 2014, Daniel Siegmann daniel.siegm...@velos.io
wrote:

 Would it make sense to read each file in as a separate RDD? This way you
 would be guaranteed the data is partitioned as you expected.

 Possibly you could then repartition each of those RDDs into a single
 partition and then union them. I think that would achieve what you expect.
 But it would be easy to accidentally screw this up (have some operation
 that causes a shuffle), so I think you're better off just leaving them as
 separate RDDs.

 On Wed, Nov 12, 2014 at 10:27 PM, Pala M Muthaia 
 mchett...@rocketfuelinc.com
 javascript:_e(%7B%7D,'cvml','mchett...@rocketfuelinc.com'); wrote:

 Hi,

 I have a set of input files for a spark program, with each file
 corresponding to a logical data partition. What is the API/mechanism to
 assign each input file (or a set of files) to a spark partition, when
 initializing RDDs?

 When i create a spark RDD pointing to the directory of files, my
 understanding is it's not guaranteed that each input file will be treated
 as separate partition.

 My job semantics require that the data is partitioned, and i want to
 leverage the partitioning that has already been done, rather than
 repartitioning again in the spark job.

 I tried to lookup online but haven't found any pointers so far.


 Thanks
 pala




 --
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning

 54 W 40th St, New York, NY 10018
 E: daniel.siegm...@velos.io
 javascript:_e(%7B%7D,'cvml','daniel.siegm...@velos.io'); W: www.velos.io



-- 
- Rishi


Re: Assigning input files to spark partitions

2014-11-13 Thread Daniel Siegmann
I believe Rishi is correct. I wouldn't rely on that though - all it would
take is for one file to exceed the block size and you'd be setting yourself
up for pain. Also, if your files are small - small enough to fit in a
single record - you could use SparkContext.wholeTextFile.

On Thu, Nov 13, 2014 at 10:11 AM, Rishi Yadav ri...@infoobjects.com wrote:

 If your data is in hdfs and you are reading as textFile and each file is
 less than block size, my understanding is it would always have one
 partition per file.


 On Thursday, November 13, 2014, Daniel Siegmann daniel.siegm...@velos.io
 wrote:

 Would it make sense to read each file in as a separate RDD? This way you
 would be guaranteed the data is partitioned as you expected.

 Possibly you could then repartition each of those RDDs into a single
 partition and then union them. I think that would achieve what you expect.
 But it would be easy to accidentally screw this up (have some operation
 that causes a shuffle), so I think you're better off just leaving them as
 separate RDDs.

 On Wed, Nov 12, 2014 at 10:27 PM, Pala M Muthaia 
 mchett...@rocketfuelinc.com wrote:

 Hi,

 I have a set of input files for a spark program, with each file
 corresponding to a logical data partition. What is the API/mechanism to
 assign each input file (or a set of files) to a spark partition, when
 initializing RDDs?

 When i create a spark RDD pointing to the directory of files, my
 understanding is it's not guaranteed that each input file will be treated
 as separate partition.

 My job semantics require that the data is partitioned, and i want to
 leverage the partitioning that has already been done, rather than
 repartitioning again in the spark job.

 I tried to lookup online but haven't found any pointers so far.


 Thanks
 pala




 --
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning

 54 W 40th St, New York, NY 10018
 E: daniel.siegm...@velos.io W: www.velos.io



 --
 - Rishi




-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.io W: www.velos.io


Re: Assigning input files to spark partitions

2014-11-13 Thread Pala M Muthaia
Thanks for the responses Daniel and Rishi.

No i don't want separate RDD because each of these partitions are being
processed the same way (in my case, each partition corresponds to HBase
keys belonging to one region server, and i will do HBase lookups). After
that i have aggregations too, hence all these partitions should be in the
same RDD. The reason to follow the partition structure is to limit
concurrent HBase lookups targeting a single region server.

Not sure what the block size is here (HDFS block size?), but my files may
get large over time, so cannot depend on block size assumption. That said,
from your description, it seems like i don't have to worry too much because
Spark does assign files to partitions while maintaining 'locality' (i.e. a
given file's data would fit in ceil(filesize/blocksize) partitions, as
opposed to spread across numerous partitions).

Yes, i saw the wholeTextFile(), it won't apply in my case because input
file size can be quite large.

On Thu, Nov 13, 2014 at 8:04 AM, Daniel Siegmann daniel.siegm...@velos.io
wrote:

 I believe Rishi is correct. I wouldn't rely on that though - all it would
 take is for one file to exceed the block size and you'd be setting yourself
 up for pain. Also, if your files are small - small enough to fit in a
 single record - you could use SparkContext.wholeTextFile.

 On Thu, Nov 13, 2014 at 10:11 AM, Rishi Yadav ri...@infoobjects.com
 wrote:

 If your data is in hdfs and you are reading as textFile and each file is
 less than block size, my understanding is it would always have one
 partition per file.


 On Thursday, November 13, 2014, Daniel Siegmann daniel.siegm...@velos.io
 wrote:

 Would it make sense to read each file in as a separate RDD? This way you
 would be guaranteed the data is partitioned as you expected.

 Possibly you could then repartition each of those RDDs into a single
 partition and then union them. I think that would achieve what you expect.
 But it would be easy to accidentally screw this up (have some operation
 that causes a shuffle), so I think you're better off just leaving them as
 separate RDDs.

 On Wed, Nov 12, 2014 at 10:27 PM, Pala M Muthaia 
 mchett...@rocketfuelinc.com wrote:

 Hi,

 I have a set of input files for a spark program, with each file
 corresponding to a logical data partition. What is the API/mechanism to
 assign each input file (or a set of files) to a spark partition, when
 initializing RDDs?

 When i create a spark RDD pointing to the directory of files, my
 understanding is it's not guaranteed that each input file will be treated
 as separate partition.

 My job semantics require that the data is partitioned, and i want to
 leverage the partitioning that has already been done, rather than
 repartitioning again in the spark job.

 I tried to lookup online but haven't found any pointers so far.


 Thanks
 pala




 --
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning

 54 W 40th St, New York, NY 10018
 E: daniel.siegm...@velos.io W: www.velos.io



 --
 - Rishi




 --
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning

 54 W 40th St, New York, NY 10018
 E: daniel.siegm...@velos.io W: www.velos.io



Re: Assigning input files to spark partitions

2014-11-13 Thread Daniel Siegmann
On Thu, Nov 13, 2014 at 3:24 PM, Pala M Muthaia mchett...@rocketfuelinc.com
 wrote


 No i don't want separate RDD because each of these partitions are being
 processed the same way (in my case, each partition corresponds to HBase
 keys belonging to one region server, and i will do HBase lookups). After
 that i have aggregations too, hence all these partitions should be in the
 same RDD. The reason to follow the partition structure is to limit
 concurrent HBase lookups targeting a single region server.


Neither of these is necessarily a barrier to using separate RDDs. You can
define the function you want to use and then pass it to multiple map
methods. Then you could union all the RDDs to do your aggregations. For
example, it might look something like this:

val paths: String = ... // the paths to the files you want to load
def myFunc(t: T) = ... // the function to apply to every RDD
val rdds = paths.map { path =
sc.textFile(path).map(myFunc)
}
val completeRdd = sc.union(rdds)

Does that make any sense?

-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.io W: www.velos.io


Assigning input files to spark partitions

2014-11-12 Thread Pala M Muthaia
Hi,

I have a set of input files for a spark program, with each file
corresponding to a logical data partition. What is the API/mechanism to
assign each input file (or a set of files) to a spark partition, when
initializing RDDs?

When i create a spark RDD pointing to the directory of files, my
understanding is it's not guaranteed that each input file will be treated
as separate partition.

My job semantics require that the data is partitioned, and i want to
leverage the partitioning that has already been done, rather than
repartitioning again in the spark job.

I tried to lookup online but haven't found any pointers so far.


Thanks
pala