Spark performance on small dataset

2022-11-20 Thread Prarthi Jain
Hi Everyone,

Spark and the RDD approach it favors assumes that most applications run on
big data and need massive parallelism via sharding and concurrent
computing. But some tasks run on small data and do not need or benefit from
RDD parallelism. How are these tasks expected to perform on Spark?

Looking forward to more insights on this!

Thanks,
Prarthi


Re: A scene with unstable Spark performance

2022-05-18 Thread Chang Chen
This is a case where resources are fixed in the same SparkContext, but sqls
have different priorities.

Some SQLs are only allowed to be executed if there are spare resources,
once the high priority sql comes in, those sqls taskset either are killed
or stalled.

If  we set a high priority pool's minShare to a relatively higher value,
e.g.  50% or 60% of total cores, does it make sense?


Sungwoo Park  于2022年5月18日周三 13:28写道:

> The problem you describe is the motivation for developing Spark on MR3.
> From the blog article (
> https://www.datamonad.com/post/2021-08-18-spark-mr3/):
>
> *The main motivation for developing Spark on MR3 is to allow multiple
> Spark applications to share compute resources such as Yarn containers or
> Kubernetes Pods.*
>
> The problem is due to an architectural limitation of Spark, and I guess
> fixing the problem would require a heavy rewrite of Spark core. When we
> developed Spark on MR3, we were not aware of any attempt being made
> elsewhere (in academia and industry) to address this limitation.
>
> A potential workaround might be to implement a custom Spark application
> that manages the submission of two groups of Spark jobs and controls their
> execution (similarly to Spark Thrift Server). Not sure if this approach
> would fix your problem, though.
>
> If you are interested, see the webpage of Spark on MR3:
> https://mr3docs.datamonad.com/docs/spark/
>
> We have released Spark 3.0.1 on MR3, and Spark 3.2.1 on MR3 is under
> development. For Spark 3.0.1 on MR3, no change is made to Spark and MR3 is
> used as an add-on. The main application of MR3 is Hive on MR3, but Spark on
> MR3 is equally ready for production.
>
> Thank you,
>
> --- Sungwoo
>
>>


Re: A scene with unstable Spark performance

2022-05-17 Thread Sungwoo Park
The problem you describe is the motivation for developing Spark on MR3.
>From the blog article (https://www.datamonad.com/post/2021-08-18-spark-mr3/
):

*The main motivation for developing Spark on MR3 is to allow multiple Spark
applications to share compute resources such as Yarn containers or
Kubernetes Pods.*

The problem is due to an architectural limitation of Spark, and I guess
fixing the problem would require a heavy rewrite of Spark core. When we
developed Spark on MR3, we were not aware of any attempt being made
elsewhere (in academia and industry) to address this limitation.

A potential workaround might be to implement a custom Spark application
that manages the submission of two groups of Spark jobs and controls their
execution (similarly to Spark Thrift Server). Not sure if this approach
would fix your problem, though.

If you are interested, see the webpage of Spark on MR3:
https://mr3docs.datamonad.com/docs/spark/

We have released Spark 3.0.1 on MR3, and Spark 3.2.1 on MR3 is under
development. For Spark 3.0.1 on MR3, no change is made to Spark and MR3 is
used as an add-on. The main application of MR3 is Hive on MR3, but Spark on
MR3 is equally ready for production.

Thank you,

--- Sungwoo

>


Re: A scene with unstable Spark performance

2022-05-17 Thread Bowen Song
Hi,

Spark dynamic resource allocation cannot solve my problem, because the 
resources of the production environment are limited. I expect that under this 
premise, by reserving resources to ensure that job tasks of different groups 
can be scheduled in time.

Thank you,
Bowen Song


From: Qian SUN 
Sent: Wednesday, May 18, 2022 9:32
To: Bowen Song 
Cc: user.spark 
Subject: Re: A scene with unstable Spark performance

Hi. I think you need Spark dynamic resource allocation. Please refer to 
https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation.
And If you use Spark SQL, AQE maybe help. 
https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution

Bowen Song mailto:bowen.s...@kyligence.io>> 
于2022年5月17日周二 22:33写道:

Hi all,



I find Spark performance is unstable in this scene: we divided the jobs into 
two groups according to the job completion time. One group of jobs had an 
execution time of less than 10s, and the other group of jobs had an execution 
time from 10s to 300s. The reason for the difference is that the latter will 
scan more files, that is, the number of tasks will be larger. When the two 
groups of jobs were submitted to Spark for execution, I found that due to 
resource competition, the existence of the slower jobs made the original faster 
job take longer to return the result, which manifested as unstable Spark 
performance. The problem I want to solve is: Can we reserve certain resources 
for each of the two groups, so that the fast jobs can be scheduled in time, and 
the slow jobs will not be starved to death because the resources are completely 
allocated to the fast jobs.



In this context, I need to group spark jobs, and the tasks from different 
groups of jobs can be scheduled using group reserved resources. At the 
beginning of each round of scheduling, tasks in this group will be scheduled 
first, only when there are no tasks in this group to schedule, its resources 
can be allocated to other groups to avoid idling of resources.



For the consideration of resource utilization and the overhead of managing 
multiple clusters, I hope that the jobs can share the spark cluster, rather 
than creating private clusters for the groups.



I've read the code for the Spark Fair Scheduler, and the implementation doesn't 
seem to meet the need to reserve resources for different groups of job.



Is there a workaround that can solve this problem through Spark Fair Scheduler? 
If it can't be solved, would you consider adding a mechanism like capacity 
scheduling.



Thank you,

Bowen Song


--
Best!
Qian SUN


Re: A scene with unstable Spark performance

2022-05-17 Thread Qian SUN
Hi. I think you need Spark dynamic resource allocation. Please refer to
https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
.
And If you use Spark SQL, AQE maybe help.
https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution

Bowen Song  于2022年5月17日周二 22:33写道:

> Hi all,
>
>
>
> I find Spark performance is unstable in this scene: we divided the jobs
> into two groups according to the job completion time. One group of jobs had
> an execution time of less than 10s, and the other group of jobs had an
> execution time from 10s to 300s. The reason for the difference is that the
> latter will scan more files, that is, the number of tasks will be larger.
> When the two groups of jobs were submitted to Spark for execution, I found
> that due to resource competition, the existence of the slower jobs made the
> original faster job take longer to return the result, which manifested as
> unstable Spark performance. The problem I want to solve is: Can we reserve
> certain resources for each of the two groups, so that the fast jobs can be
> scheduled in time, and the slow jobs will not be starved to death because
> the resources are completely allocated to the fast jobs.
>
>
>
> In this context, I need to group spark jobs, and the tasks from different
> groups of jobs can be scheduled using group reserved resources. At the
> beginning of each round of scheduling, tasks in this group will be
> scheduled first, only when there are no tasks in this group to schedule,
> its resources can be allocated to other groups to avoid idling of resources.
>
>
>
> For the consideration of resource utilization and the overhead of managing
> multiple clusters, I hope that the jobs can share the spark cluster, rather
> than creating private clusters for the groups.
>
>
>
> I've read the code for the Spark Fair Scheduler, and the implementation
> doesn't seem to meet the need to reserve resources for different groups of
> job.
>
>
>
> Is there a workaround that can solve this problem through Spark Fair
> Scheduler? If it can't be solved, would you consider adding a mechanism
> like capacity scheduling.
>
>
>
> Thank you,
>
> Bowen Song
>


-- 
Best!
Qian SUN


A scene with unstable Spark performance

2022-05-17 Thread Bowen Song
Hi all,

I find Spark performance is unstable in this scene: we divided the jobs into 
two groups according to the job completion time. One group of jobs had an 
execution time of less than 10s, and the other group of jobs had an execution 
time from 10s to 300s. The reason for the difference is that the latter will 
scan more files, that is, the number of tasks will be larger. When the two 
groups of jobs were submitted to Spark for execution, I found that due to 
resource competition, the existence of the slower jobs made the original faster 
job take longer to return the result, which manifested as unstable Spark 
performance. The problem I want to solve is: Can we reserve certain resources 
for each of the two groups, so that the fast jobs can be scheduled in time, and 
the slow jobs will not be starved to death because the resources are completely 
allocated to the fast jobs.

In this context, I need to group spark jobs, and the tasks from different 
groups of jobs can be scheduled using group reserved resources. At the 
beginning of each round of scheduling, tasks in this group will be scheduled 
first, only when there are no tasks in this group to schedule, its resources 
can be allocated to other groups to avoid idling of resources.

For the consideration of resource utilization and the overhead of managing 
multiple clusters, I hope that the jobs can share the spark cluster, rather 
than creating private clusters for the groups.

I've read the code for the Spark Fair Scheduler, and the implementation doesn't 
seem to meet the need to reserve resources for different groups of job.

Is there a workaround that can solve this problem through Spark Fair Scheduler? 
If it can't be solved, would you consider adding a mechanism like capacity 
scheduling.

Thank you,
Bowen Song


RE: Spark performance over S3

2021-04-07 Thread Boris Litvak
Oh, Tzahi, I misread the metrics in the first reply. It’s about reads indeed, 
not writes.

From: Tzahi File 
Sent: Wednesday, 7 April 2021 16:02
To: Hariharan 
Cc: user 
Subject: Re: Spark performance over S3

Hi Hariharan,

Thanks for your reply.

In both cases we are writing the data to S3. The difference is that in the 
first case we read the data from S3 and in the second we read from HDFS.
We are using ListObjectsV2 API in 
S3A<https://issues.apache.org/jira/browse/HADOOP-13421>.

The S3 bucket and the cluster are located at the same AWS region.



On Wed, Apr 7, 2021 at 2:12 PM Hariharan 
mailto:hariharan...@gmail.com>> wrote:
Hi Tzahi,

Comparing the first two cases:

  *   > reads the parquet files from S3 and also writes to S3, it takes 22 min
  *   > reads the parquet files from S3 and writes to its local hdfs, it takes 
the same amount of time (±22 min)

It looks like most of the time is being spent in reading, and the time spent in 
writing is likely negligible (probably you're not writing much output?)

Can you clarify what is the difference between these two?

> reads the parquet files from S3 and writes to its local hdfs, it takes the 
> same amount of time (±22 min)?
> reads the parquet files from S3 (they were copied into the hdfs before) and 
> writes to its local hdfs, the job took 7 min

In the second case, was the data read from hdfs or s3?

Regarding the point from the post you linked to:
1, Enhanced networking does make a 
difference<https://laptrinhx.com/hadoop-with-enhanced-networking-on-aws-1893465489/>,
 but it should be automatically enabled if you're using a compatible instance 
type and an AWS AMI. However if you're using a custom AMI, you might want to 
check if it's enabled for you.
2. VPC endpoints also can make a difference in performance - at least that used 
to be the case a few years ago. Maybe that has changed now.

Couple of other things you might want to check:
1. If your bucket is versioned, you may want to check if you're using the 
ListObjectsV2 API in S3A<https://issues.apache.org/jira/browse/HADOOP-13421>.
2. Also check these recommendations from 
Cloudera<https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.5/bk_cloud-data-access/content/s3-performance.html>
 for optimal use of S3A.

Thanks,
Hariharan


On Wed, Apr 7, 2021 at 12:15 AM Tzahi File 
mailto:tzahi.f...@ironsrc.com>> wrote:

Hi All,

We have a spark cluster on aws ec2 that has 60 X i3.4xlarge.

The spark job running on that cluster reads from an S3 bucket and writes to 
that bucket.

the bucket and the ec2 run in the same region.

As part of our efforts to reduce the runtime of our spark jobs we found there's 
serious latency when reading from S3.

When the job:
· reads the parquet files from S3 and also writes to S3, it takes 22 min
· reads the parquet files from S3 and writes to its local hdfs, it 
takes the same amount of time (±22 min)
· reads the parquet files from S3 (they were copied into the hdfs 
before) and writes to its local hdfs, the job took 7 min

the spark job has the following S3-related configuration:
· spark.hadoop.fs.s3a.connection.establish.timeout=5000
· spark.hadoop.fs.s3a.connection.maximum=200

when reading from S3 we tried to increase the 
spark.hadoop.fs.s3a.connection.maximum config param from 200 to 400 or 900 but 
it didn't reduce the S3 latency.

Do you have any idea for the cause of the read latency from S3?

I saw this 
post<https://aws.amazon.com/premiumsupport/knowledge-center/s3-transfer-data-bucket-instance/>
 to improve the transfer speed, is something here relevant?


Thanks,
Tzahi


--
Tzahi File
Data Engineers Team Lead
[ironSource]<http://www.ironsrc.com/>
email tzahi.f...@ironsrc.com<mailto:tzahi.f...@ironsrc.com>
mobile +972-546864835
fax +972-77-5448273
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com<http://www.ironsrc.com/>
[linkedin]<https://www.linkedin.com/company/ironsource>[twitter]<https://twitter.com/ironsource>[facebook]<https://www.facebook.com/ironSource>[googleplus]<https://plus.google.com/+ironsrc>
This email (including any attachments) is for the sole use of the intended 
recipient and may contain confidential information which may be protected by 
legal privilege. If you are not the intended recipient, or the employee or 
agent responsible for delivering it to the intended recipient, you are hereby 
notified that any use, dissemination, distribution or copying of this 
communication and/or its content is strictly prohibited. If you are not the 
intended recipient, please immediately notify us by reply email or by 
telephone, delete this email and destroy any copies. Thank you.



Re: Spark performance over S3

2021-04-07 Thread Tzahi File
Hi Hariharan,

Thanks for your reply.

In both cases we are writing the data to S3. The difference is that in the
first case we read the data from S3 and in the second we read from HDFS.
We are using ListObjectsV2 API in S3A
.

The S3 bucket and the cluster are located at the same AWS region.



On Wed, Apr 7, 2021 at 2:12 PM Hariharan  wrote:

> Hi Tzahi,
>
> Comparing the first two cases:
>
>- > reads the parquet files from S3 and also writes to S3, it takes 22
>min
>- > reads the parquet files from S3 and writes to its local hdfs, it
>takes the same amount of time (±22 min)
>
>
> It looks like most of the time is being spent in reading, and the time
> spent in writing is likely negligible (probably you're not writing much
> output?)
>
> Can you clarify what is the difference between these two?
>
> > reads the parquet files from S3 and writes to its local hdfs, it takes
> the same amount of time (±22 min)?
> > reads the parquet files from S3 (they were copied into the hdfs before)
> and writes to its local hdfs, the job took 7 min
>
> In the second case, was the data read from hdfs or s3?
>
> Regarding the point from the post you linked to:
> 1, Enhanced networking does make a difference
> ,
> but it should be automatically enabled if you're using a compatible
> instance type and an AWS AMI. However if you're using a custom AMI, you
> might want to check if it's enabled for you.
> 2. VPC endpoints also can make a difference in performance - at least that
> used to be the case a few years ago. Maybe that has changed now.
>
> Couple of other things you might want to check:
> 1. If your bucket is versioned, you may want to check if you're using the 
> ListObjectsV2
> API in S3A .
> 2. Also check these recommendations from Cloudera
> 
> for optimal use of S3A.
>
> Thanks,
> Hariharan
>
>
>
> On Wed, Apr 7, 2021 at 12:15 AM Tzahi File  wrote:
>
>> Hi All,
>>
>> We have a spark cluster on aws ec2 that has 60 X i3.4xlarge.
>>
>> The spark job running on that cluster reads from an S3 bucket and writes
>> to that bucket.
>>
>> the bucket and the ec2 run in the same region.
>>
>> As part of our efforts to reduce the runtime of our spark jobs we found
>> there's serious latency when reading from S3.
>>
>> When the job:
>>
>>- reads the parquet files from S3 and also writes to S3, it takes 22
>>min
>>- reads the parquet files from S3 and writes to its local hdfs, it
>>takes the same amount of time (±22 min)
>>- reads the parquet files from S3 (they were copied into the hdfs
>>before) and writes to its local hdfs, the job took 7 min
>>
>> the spark job has the following S3-related configuration:
>>
>>- spark.hadoop.fs.s3a.connection.establish.timeout=5000
>>- spark.hadoop.fs.s3a.connection.maximum=200
>>
>> when reading from S3 we tried to increase the
>> spark.hadoop.fs.s3a.connection.maximum config param from 200 to 400 or 900
>> but it didn't reduce the S3 latency.
>>
>> Do you have any idea for the cause of the read latency from S3?
>>
>> I saw this post
>> 
>>  to
>> improve the transfer speed, is something here relevant?
>>
>>
>> Thanks,
>> Tzahi
>>
>

-- 
Tzahi File
Data Engineers Team Lead
[image: ironSource] 

email tzahi.f...@ironsrc.com
mobile +972-546864835
fax +972-77-5448273
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com 
[image: linkedin] [image:
twitter] [image: facebook]
[image: googleplus]

This email (including any attachments) is for the sole use of the intended
recipient and may contain confidential information which may be protected
by legal privilege. If you are not the intended recipient, or the employee
or agent responsible for delivering it to the intended recipient, you are
hereby notified that any use, dissemination, distribution or copying of
this communication and/or its content is strictly prohibited. If you are
not the intended recipient, please immediately notify us by reply email or
by telephone, delete this email and destroy any copies. Thank you.


Re: Spark performance over S3

2021-04-07 Thread Vladimir Prus
VPC endpoint can also make a major difference in costs. Without it, access
to S3 incurs data transfer costs and NAT costs, and these can be large.

On Wed, 7 Apr 2021 at 14:13, Hariharan  wrote:

> Hi Tzahi,
>
> Comparing the first two cases:
>
>- > reads the parquet files from S3 and also writes to S3, it takes 22
>min
>- > reads the parquet files from S3 and writes to its local hdfs, it
>takes the same amount of time (±22 min)
>
>
> It looks like most of the time is being spent in reading, and the time
> spent in writing is likely negligible (probably you're not writing much
> output?)
>
> Can you clarify what is the difference between these two?
>
> > reads the parquet files from S3 and writes to its local hdfs, it takes
> the same amount of time (±22 min)?
> > reads the parquet files from S3 (they were copied into the hdfs before)
> and writes to its local hdfs, the job took 7 min
>
> In the second case, was the data read from hdfs or s3?
>
> Regarding the point from the post you linked to:
> 1, Enhanced networking does make a difference
> ,
> but it should be automatically enabled if you're using a compatible
> instance type and an AWS AMI. However if you're using a custom AMI, you
> might want to check if it's enabled for you.
> 2. VPC endpoints also can make a difference in performance - at least that
> used to be the case a few years ago. Maybe that has changed now.
>
> Couple of other things you might want to check:
> 1. If your bucket is versioned, you may want to check if you're using the 
> ListObjectsV2
> API in S3A .
> 2. Also check these recommendations from Cloudera
> 
> for optimal use of S3A.
>
> Thanks,
> Hariharan
>
>
>
> On Wed, Apr 7, 2021 at 12:15 AM Tzahi File  wrote:
>
>> Hi All,
>>
>> We have a spark cluster on aws ec2 that has 60 X i3.4xlarge.
>>
>> The spark job running on that cluster reads from an S3 bucket and writes
>> to that bucket.
>>
>> the bucket and the ec2 run in the same region.
>>
>> As part of our efforts to reduce the runtime of our spark jobs we found
>> there's serious latency when reading from S3.
>>
>> When the job:
>>
>>- reads the parquet files from S3 and also writes to S3, it takes 22
>>min
>>- reads the parquet files from S3 and writes to its local hdfs, it
>>takes the same amount of time (±22 min)
>>- reads the parquet files from S3 (they were copied into the hdfs
>>before) and writes to its local hdfs, the job took 7 min
>>
>> the spark job has the following S3-related configuration:
>>
>>- spark.hadoop.fs.s3a.connection.establish.timeout=5000
>>- spark.hadoop.fs.s3a.connection.maximum=200
>>
>> when reading from S3 we tried to increase the
>> spark.hadoop.fs.s3a.connection.maximum config param from 200 to 400 or 900
>> but it didn't reduce the S3 latency.
>>
>> Do you have any idea for the cause of the read latency from S3?
>>
>> I saw this post
>> 
>>  to
>> improve the transfer speed, is something here relevant?
>>
>>
>> Thanks,
>> Tzahi
>>
> --
Vladimir Prus
http://vladimirprus.com


Re: Spark performance over S3

2021-04-07 Thread Hariharan
Hi Tzahi,

Comparing the first two cases:
- > reads the parquet files from S3 and also writes to S3, it takes 22 min
- > reads the parquet files from S3 and writes to its local hdfs, it takes
the same amount of time (±22 min)

It looks like most of the time is being spent in reading, and the time
spent in writing is likely negligible (probably you're not writing much
output?)

Can you clarify what is the difference between these two?

> reads the parquet files from S3 and writes to its local hdfs, it takes
the same amount of time (±22 min)?
> reads the parquet files from S3 (they were copied into the hdfs before)
and writes to its local hdfs, the job took 7 min

In the second case, was the data read from hdfs or s3?

Regarding the point from the post you linked to:
1, Enhanced networking does make a difference
,
but it should be automatically enabled if you're using a compatible
instance type and an AWS AMI. However if you're using a custom AMI, you
might want to check if it's enabled for you.
2. VPC endpoints also can make a difference in performance - at least that
used to be the case a few years ago. Maybe that has changed now.

Couple of other things you might want to check:
1. If your bucket is versioned, you may want to check if you're using
the ListObjectsV2
API in S3A .
2. Also check these recommendations from Cloudera

for optimal use of S3A.

Thanks,
Hariharan



On Wed, Apr 7, 2021 at 12:15 AM Tzahi File  wrote:

> Hi All,
>
> We have a spark cluster on aws ec2 that has 60 X i3.4xlarge.
>
> The spark job running on that cluster reads from an S3 bucket and writes
> to that bucket.
>
> the bucket and the ec2 run in the same region.
>
> As part of our efforts to reduce the runtime of our spark jobs we found
> there's serious latency when reading from S3.
>
> When the job:
>
>- reads the parquet files from S3 and also writes to S3, it takes 22
>min
>- reads the parquet files from S3 and writes to its local hdfs, it
>takes the same amount of time (±22 min)
>- reads the parquet files from S3 (they were copied into the hdfs
>before) and writes to its local hdfs, the job took 7 min
>
> the spark job has the following S3-related configuration:
>
>- spark.hadoop.fs.s3a.connection.establish.timeout=5000
>- spark.hadoop.fs.s3a.connection.maximum=200
>
> when reading from S3 we tried to increase the
> spark.hadoop.fs.s3a.connection.maximum config param from 200 to 400 or 900
> but it didn't reduce the S3 latency.
>
> Do you have any idea for the cause of the read latency from S3?
>
> I saw this post
> 
>  to
> improve the transfer speed, is something here relevant?
>
>
> Thanks,
> Tzahi
>


RE: Spark performance over S3

2021-04-07 Thread Boris Litvak
Hi Tzahi,

I don’t know the reasons for that, though I’d check for fs.s3a implementation 
to be using multipart uploads, which I assume it does.

I would say that none of the comments in the link are relevant to you, as the 
VPC endpoint is more of a security rather than performance feature.

I got an answer from AWS support recently saying that they tested this vs S3 
access via public internet and the differences were negligible.
There is always an option it was not tested in your region, but it’s unlikely. 
Anyway, you can provision & test this with aws cli.

There is always an option to compare this with EMRFS performance …
I know it requires you to put in some work.

Boris

From: Gourav Sengupta 
Sent: Tuesday, 6 April 2021 22:24
To: Tzahi File 
Cc: user 
Subject: Re: Spark performance over S3

Hi Tzahi,

that is a huge cost. So that I can understand the question before answering it:
1. what is the SPARK version that you are using?
2. what is the SQL code that you are using to read and write?

There are several other questions that are pertinent, but the above will be a 
great starting point.

Regards,
Gourav Sengupta

On Tue, Apr 6, 2021 at 7:46 PM Tzahi File 
mailto:tzahi.f...@ironsrc.com>> wrote:

Hi All,

We have a spark cluster on aws ec2 that has 60 X i3.4xlarge.

The spark job running on that cluster reads from an S3 bucket and writes to 
that bucket.

the bucket and the ec2 run in the same region.

As part of our efforts to reduce the runtime of our spark jobs we found there's 
serious latency when reading from S3.

When the job:
· reads the parquet files from S3 and also writes to S3, it takes 22 min
· reads the parquet files from S3 and writes to its local hdfs, it 
takes the same amount of time (±22 min)
· reads the parquet files from S3 (they were copied into the hdfs 
before) and writes to its local hdfs, the job took 7 min

the spark job has the following S3-related configuration:
· spark.hadoop.fs.s3a.connection.establish.timeout=5000
· spark.hadoop.fs.s3a.connection.maximum=200

when reading from S3 we tried to increase the 
spark.hadoop.fs.s3a.connection.maximum config param from 200 to 400 or 900 but 
it didn't reduce the S3 latency.

Do you have any idea for the cause of the read latency from S3?

I saw this 
post<https://aws.amazon.com/premiumsupport/knowledge-center/s3-transfer-data-bucket-instance/>
 to improve the transfer speed, is something here relevant?


Thanks,
Tzahi


Re: Spark performance over S3

2021-04-06 Thread Gourav Sengupta
Hi Tzahi,

that is a huge cost. So that I can understand the question before answering
it:
1. what is the SPARK version that you are using?
2. what is the SQL code that you are using to read and write?

There are several other questions that are pertinent, but the above will be
a great starting point.

Regards,
Gourav Sengupta

On Tue, Apr 6, 2021 at 7:46 PM Tzahi File  wrote:

> Hi All,
>
> We have a spark cluster on aws ec2 that has 60 X i3.4xlarge.
>
> The spark job running on that cluster reads from an S3 bucket and writes
> to that bucket.
>
> the bucket and the ec2 run in the same region.
>
> As part of our efforts to reduce the runtime of our spark jobs we found
> there's serious latency when reading from S3.
>
> When the job:
>
>- reads the parquet files from S3 and also writes to S3, it takes 22
>min
>- reads the parquet files from S3 and writes to its local hdfs, it
>takes the same amount of time (±22 min)
>- reads the parquet files from S3 (they were copied into the hdfs
>before) and writes to its local hdfs, the job took 7 min
>
> the spark job has the following S3-related configuration:
>
>- spark.hadoop.fs.s3a.connection.establish.timeout=5000
>- spark.hadoop.fs.s3a.connection.maximum=200
>
> when reading from S3 we tried to increase the
> spark.hadoop.fs.s3a.connection.maximum config param from 200 to 400 or 900
> but it didn't reduce the S3 latency.
>
> Do you have any idea for the cause of the read latency from S3?
>
> I saw this post
> 
>  to
> improve the transfer speed, is something here relevant?
>
>
> Thanks,
> Tzahi
>


Spark performance over S3

2021-04-06 Thread Tzahi File
Hi All,

We have a spark cluster on aws ec2 that has 60 X i3.4xlarge.

The spark job running on that cluster reads from an S3 bucket and writes to
that bucket.

the bucket and the ec2 run in the same region.

As part of our efforts to reduce the runtime of our spark jobs we found
there's serious latency when reading from S3.

When the job:

   - reads the parquet files from S3 and also writes to S3, it takes 22 min
   - reads the parquet files from S3 and writes to its local hdfs, it takes
   the same amount of time (±22 min)
   - reads the parquet files from S3 (they were copied into the hdfs
   before) and writes to its local hdfs, the job took 7 min

the spark job has the following S3-related configuration:

   - spark.hadoop.fs.s3a.connection.establish.timeout=5000
   - spark.hadoop.fs.s3a.connection.maximum=200

when reading from S3 we tried to increase the
spark.hadoop.fs.s3a.connection.maximum config param from 200 to 400 or 900
but it didn't reduce the S3 latency.

Do you have any idea for the cause of the read latency from S3?

I saw this post

to
improve the transfer speed, is something here relevant?


Thanks,
Tzahi


Re: How does extending an existing parquet with columns affect impala/spark performance?

2018-04-03 Thread naresh Goud
>From spark point of view it shouldn’t effect. it’s possible to extend
columns of new parquet files and it won’t affect Performance and not
required to change spark application code.



On Tue, Apr 3, 2018 at 9:14 AM Vitaliy Pisarev 
wrote:

> This is not strictly a spark question but I'll give it a shot:
>
> have an existing setup of parquet files that are being queried from impala
> and from spark.
>
> I intend to add some 30 relatively 'heavy' columns to the parquet. Each
> column would store an array of structs. Each struct can have from 5 to 20
> fields. An array may have a couple of thousands of structs.
>
> Theoretically, parquet being a columnar storage- extending it with columns
> should not affect performance of *existing* queries (since they are not
> touching these columns).
>
>- Is this premise correct?
>- What should I watch out for doing this move?
>- In general, what are the considerations when deciding on the "width"
>(i.e amount of columns) of a parquet file?
>
>
> --
Thanks,
Naresh
www.linkedin.com/in/naresh-dulam
http://hadoopandspark.blogspot.com/


How does extending an existing parquet with columns affect impala/spark performance?

2018-04-03 Thread Vitaliy Pisarev
This is not strictly a spark question but I'll give it a shot:

have an existing setup of parquet files that are being queried from impala
and from spark.

I intend to add some 30 relatively 'heavy' columns to the parquet. Each
column would store an array of structs. Each struct can have from 5 to 20
fields. An array may have a couple of thousands of structs.

Theoretically, parquet being a columnar storage- extending it with columns
should not affect performance of *existing* queries (since they are not
touching these columns).

   - Is this premise correct?
   - What should I watch out for doing this move?
   - In general, what are the considerations when deciding on the "width"
   (i.e amount of columns) of a parquet file?


Re: GroupBy and Spark Performance issue

2017-01-17 Thread Andy Dang
Repartition wouldn't save you from skewed data unfortunately. The way Spark
works now is that it pulls data of the same key to one single partition,
and Spark, AFAIK, retains the mapping from key to data in memory.

You can use aggregateBykey() or combineByKey() or reduceByKey() to avoid
this problem because these functions can be evaluated using map-side
aggregation:
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html


---
Regards,
Andy

On Tue, Jan 17, 2017 at 5:39 AM, KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:

> Hi,
>
> I am trying to group by data in spark and find out maximum value for group
> of data. I have to use group by as I need to transpose based on the values.
>
> I tried repartition data by increasing number from 1 to 1.Job gets run
> till the below stage and it takes long time to move ahead. I was never
> successful, job gets killed after somtime with GC overhead limit issues.
>
>
> [image: Inline image 1]
>
> Increased Memory limits too. Not sure what is going wrong, can anyone
> guide me through right approach.
>
> Thanks,
> Asmath
>


GroupBy and Spark Performance issue

2017-01-16 Thread KhajaAsmath Mohammed
Hi,

I am trying to group by data in spark and find out maximum value for group
of data. I have to use group by as I need to transpose based on the values.

I tried repartition data by increasing number from 1 to 1.Job gets run
till the below stage and it takes long time to move ahead. I was never
successful, job gets killed after somtime with GC overhead limit issues.


[image: Inline image 1]

Increased Memory limits too. Not sure what is going wrong, can anyone guide
me through right approach.

Thanks,
Asmath


Re: SPARK PERFORMANCE TUNING

2016-09-21 Thread Mich Talebzadeh
LOL

I think we should try the Chrystal ball to answer this question.


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 21 September 2016 at 13:14, Jörn Franke <jornfra...@gmail.com> wrote:

> Do you mind sharing what your software does? What is the input data size?
> What is the spark version and apis used? How many nodes? What is the input
> data format? Is compression used?
>
> On 21 Sep 2016, at 13:37, Trinadh Kaja <ktr.hadoo...@gmail.com> wrote:
>
> Hi all,
>
> how to increase spark performance ,i am using pyspark.
>
> cluster info :
>
> Total memory :600gb
> Cores:96
>
> command :
> spark-submit --master  yarn-client --executor-memory 10G --num-executors
> 50 --executor-cores 2 --driver-memory 10g --queue thequeue
>
>
> please help on this
>
> --
> Thanks
> K.Trinadh
> Ph-7348826118
>
>


Re: SPARK PERFORMANCE TUNING

2016-09-21 Thread Jörn Franke
Do you mind sharing what your software does? What is the input data size? What 
is the spark version and apis used? How many nodes? What is the input data 
format? Is compression used?

> On 21 Sep 2016, at 13:37, Trinadh Kaja <ktr.hadoo...@gmail.com> wrote:
> 
> Hi all,
> 
> how to increase spark performance ,i am using pyspark.
> 
> cluster info :
> 
> Total memory :600gb
> Cores:96
> 
> command :
> spark-submit --master  yarn-client --executor-memory 10G --num-executors 50 
> --executor-cores 2 --driver-memory 10g --queue thequeue
>  
> 
> please help on this 
> 
> -- 
> Thanks
> K.Trinadh
> Ph-7348826118


SPARK PERFORMANCE TUNING

2016-09-21 Thread Trinadh Kaja
Hi all,

how to increase spark performance ,i am using pyspark.

cluster info :

Total memory :600gb
Cores:96

command :
spark-submit --master  yarn-client --executor-memory 10G --num-executors 50
--executor-cores 2 --driver-memory 10g --queue thequeue


please help on this

-- 
Thanks
K.Trinadh
Ph-7348826118


increase spark performance

2016-09-21 Thread Trinadh Kaja
Hi all,

how to increase spark performance ,

cluster info :

total memory :600gb
cores

-- 
Thanks
K.Trinadh
Ph-7348826118


Re: Spark performance testing

2016-07-09 Thread Mich Talebzadeh
Hi Andrew,

I suggest that you narrow down your scope for performance testing using the
same setup and doing incremental changes keeping other systematics the same.

Spark itself can run on local, standalone, yarn client and yarn cluster
modes So really you need to target a particular setup of run and a
particular application like SQL, streaming etc.

And then increment the memory keeping cores the same etc.

For test data you can create your own data using Linux shell scripts etc.
Then I would say the test will have more meaning.

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 9 July 2016 at 05:28, Andrew Ehrlich <and...@aehrlich.com> wrote:

> Yea, I'm looking for any personal experiences people have had with tools
> like these.
>
> On Jul 8, 2016, at 8:57 PM, charles li <charles.up...@gmail.com> wrote:
>
> Hi, Andrew, I've got lots of materials when asking google for "*spark
> performance test*"
>
>
>- https://github.com/databricks/spark-perf
>-
>
> https://spark-summit.org/2014/wp-content/uploads/2014/06/Testing-Spark-Best-Practices-Anupama-Shetty-Neil-Marshall.pdf
>- http://people.cs.vt.edu/~butta/docs/tpctc2015-sparkbench.pdf
>
>
>
> On Sat, Jul 9, 2016 at 11:40 AM, Andrew Ehrlich <and...@aehrlich.com>
> wrote:
>
>> Hi group,
>>
>> What solutions are people using to do performance testing and tuning of
>> spark applications? I have been doing a pretty manual technique where I lay
>> out an Excel sheet of various memory settings and caching parameters and
>> then execute each one by hand. It’s pretty tedious though, so I’m wondering
>> what others do, and if you do performance testing at all.  Also, is anyone
>> generating test data, or just operating on a static set? Is regression
>> testing for performance a thing?
>>
>> Andrew
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> *___*
> Quant | Engineer | Boy
> *___*
> *blog*:http://litaotao.github.io
> *github*: www.github.com/litaotao
>
>


Re: Spark performance testing

2016-07-08 Thread Andrew Ehrlich
Yea, I'm looking for any personal experiences people have had with tools like 
these. 

> On Jul 8, 2016, at 8:57 PM, charles li <charles.up...@gmail.com> wrote:
> 
> Hi, Andrew, I've got lots of materials when asking google for "spark 
> performance test"
> 
> https://github.com/databricks/spark-perf
> https://spark-summit.org/2014/wp-content/uploads/2014/06/Testing-Spark-Best-Practices-Anupama-Shetty-Neil-Marshall.pdf
> http://people.cs.vt.edu/~butta/docs/tpctc2015-sparkbench.pdf
> 
> 
>> On Sat, Jul 9, 2016 at 11:40 AM, Andrew Ehrlich <and...@aehrlich.com> wrote:
>> Hi group,
>> 
>> What solutions are people using to do performance testing and tuning of 
>> spark applications? I have been doing a pretty manual technique where I lay 
>> out an Excel sheet of various memory settings and caching parameters and 
>> then execute each one by hand. It’s pretty tedious though, so I’m wondering 
>> what others do, and if you do performance testing at all.  Also, is anyone 
>> generating test data, or just operating on a static set? Is regression 
>> testing for performance a thing?
>> 
>> Andrew
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
> 
> 
> -- 
> ___
> Quant | Engineer | Boy
> ___
> blog:http://litaotao.github.io
> github: www.github.com/litaotao


Re: Spark performance testing

2016-07-08 Thread charles li
Hi, Andrew, I've got lots of materials when asking google for "*spark
performance test*"


   - https://github.com/databricks/spark-perf
   -
   
https://spark-summit.org/2014/wp-content/uploads/2014/06/Testing-Spark-Best-Practices-Anupama-Shetty-Neil-Marshall.pdf
   - http://people.cs.vt.edu/~butta/docs/tpctc2015-sparkbench.pdf



On Sat, Jul 9, 2016 at 11:40 AM, Andrew Ehrlich <and...@aehrlich.com> wrote:

> Hi group,
>
> What solutions are people using to do performance testing and tuning of
> spark applications? I have been doing a pretty manual technique where I lay
> out an Excel sheet of various memory settings and caching parameters and
> then execute each one by hand. It’s pretty tedious though, so I’m wondering
> what others do, and if you do performance testing at all.  Also, is anyone
> generating test data, or just operating on a static set? Is regression
> testing for performance a thing?
>
> Andrew
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
*___*
Quant | Engineer | Boy
*___*
*blog*:http://litaotao.github.io
*github*: www.github.com/litaotao


Spark performance testing

2016-07-08 Thread Andrew Ehrlich
Hi group,

What solutions are people using to do performance testing and tuning of spark 
applications? I have been doing a pretty manual technique where I lay out an 
Excel sheet of various memory settings and caching parameters and then execute 
each one by hand. It’s pretty tedious though, so I’m wondering what others do, 
and if you do performance testing at all.  Also, is anyone generating test 
data, or just operating on a static set? Is regression testing for performance 
a thing?

Andrew
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is that normal spark performance?

2016-06-15 Thread Deepak Goel
ed as bytes in memory (estimated size 2.1 KB, 
> free 21.9 KB)
> [2016-06-15 09:26:01.383] [INFO ] [dispatcher-event-loop-1] 
> [BlockManagerInfo] Added broadcast_1_piece0 in memory on node2:44871 (size: 
> 2.1 KB, free: 2.4 GB)
> [2016-06-15 09:26:01.384] [INFO ] [dag-scheduler-event-loop] [SparkContext] 
> Created broadcast 1 from broadcast at DAGScheduler.scala:1006
> [2016-06-15 09:26:01.385] [INFO ] [dag-scheduler-event-loop] [DAGScheduler] 
> Submitting 5 missing tasks from ResultStage 1 (ShuffledRDD[3] at reduceByKey 
> at EquityTCAAnalytics.java:87)
> [2016-06-15 09:26:01.386] [INFO ] [dag-scheduler-event-loop] 
> [TaskSchedulerImpl] Adding task set 1.0 with 5 tasks
> [2016-06-15 09:26:01.390] [INFO ] [dispatcher-event-loop-4] [TaskSetManager] 
> Starting task 0.0 in stage 1.0 (TID 5, node1, partition 0,NODE_LOCAL, 2786 
> bytes)
> [2016-06-15 09:26:01.390] [INFO ] [dispatcher-event-loop-4] [TaskSetManager] 
> Starting task 1.0 in stage 1.0 (TID 6, node1, partition 1,NODE_LOCAL, 2786 
> bytes)
> [2016-06-15 09:26:01.397] [INFO ] [dispatcher-event-loop-4] [TaskSetManager] 
> Starting task 2.0 in stage 1.0 (TID 7, node1, partition 2,NODE_LOCAL, 2786 
> bytes)
> [2016-06-15 09:26:01.398] [INFO ] [dispatcher-event-loop-4] [TaskSetManager] 
> Starting task 3.0 in stage 1.0 (TID 8, node1, partition 3,NODE_LOCAL, 2786 
> bytes)
> [2016-06-15 09:26:01.406] [INFO ] [dispatcher-event-loop-4] [TaskSetManager] 
> Starting task 4.0 in stage 1.0 (TID 9, node1, partition 4,NODE_LOCAL, 2786 
> bytes)
> [2016-06-15 09:26:01.429] [INFO ] [dispatcher-event-loop-4] 
> [BlockManagerInfo] Added broadcast_1_piece0 in memory on node1:36512 (size: 
> 2.1 KB, free: 511.1 MB)
> [2016-06-15 09:26:01.452] [INFO ] [dispatcher-event-loop-6] 
> [MapOutputTrackerMasterEndpoint] Asked to send map output locations for 
> shuffle 0 to node1:41122
> [2016-06-15 09:26:01.456] [INFO ] [dispatcher-event-loop-6] 
> [MapOutputTrackerMaster] Size of output statuses for shuffle 0 is 161 bytes
> [2016-06-15 09:26:01.526] [INFO ] [task-result-getter-1] [TaskSetManager] 
> Finished task 4.0 in stage 1.0 (TID 9) in 128 ms on node1 (1/5)
> [2016-06-15 09:26:01.575] [INFO ] [task-result-getter-3] [TaskSetManager] 
> Finished task 2.0 in stage 1.0 (TID 7) in 184 ms on node1 (2/5)
> [2016-06-15 09:26:01.580] [INFO ] [task-result-getter-2] [TaskSetManager] 
> Finished task 0.0 in stage 1.0 (TID 5) in 193 ms on node1 (3/5)
> [2016-06-15 09:26:01.589] [INFO ] [task-result-getter-3] [TaskSetManager] 
> Finished task 1.0 in stage 1.0 (TID 6) in 199 ms on node1 (4/5)
> [2016-06-15 09:26:01.599] [INFO ] [task-result-getter-2] [TaskSetManager] 
> Finished task 3.0 in stage 1.0 (TID 8) in 200 ms on node1 (5/5)
> [2016-06-15 09:26:01.599] [INFO ] [task-result-getter-2] [TaskSchedulerImpl] 
> Removed TaskSet 1.0, whose tasks have all completed, from pool
> [2016-06-15 09:26:01.599] [INFO ] [dag-scheduler-event-loop] [DAGScheduler] 
> ResultStage 1 (collect at EquityTCAAnalytics.java:88) finished in 0.202 s
> [2016-06-15 09:26:01.612] [INFO ] [main] [DAGScheduler] Job 0 finished: 
> collect at EquityTCAAnalytics.java:88, took 32.496470 s
> [2016-06-15 09:26:01.634] [INFO ] [main] [EquityTCAAnalytics] [((2016-06-10 
> 13:45:00.0,DA),6944), ((2016-06-10 14:25:00.0,B),5241), ..., ((2016-06-10 
> 10:55:00.0,QD),109080), ((2016-06-10 14:55:00.0,A),1300)]
> [2016-06-15 09:26:01.641] [INFO ] [main] [EquityTCAAnalytics] finish
>
> 32.5 s is normal?
> --
> View this message in context: Is that normal spark performance?
> <http://apache-spark-user-list.1001560.n3.nabble.com/Is-that-normal-spark-performance-tp27174.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com
> <http://nabble.com>.
>
>


Re: Is that normal spark performance?

2016-06-15 Thread Jörn Franke
dulerImpl] Adding task set 1.0 with 5 tasks
> [2016-06-15 09:26:01.390] [INFO ] [dispatcher-event-loop-4] [TaskSetManager] 
> Starting task 0.0 in stage 1.0 (TID 5, node1, partition 0,NODE_LOCAL, 2786 
> bytes)
> [2016-06-15 09:26:01.390] [INFO ] [dispatcher-event-loop-4] [TaskSetManager] 
> Starting task 1.0 in stage 1.0 (TID 6, node1, partition 1,NODE_LOCAL, 2786 
> bytes)
> [2016-06-15 09:26:01.397] [INFO ] [dispatcher-event-loop-4] [TaskSetManager] 
> Starting task 2.0 in stage 1.0 (TID 7, node1, partition 2,NODE_LOCAL, 2786 
> bytes)
> [2016-06-15 09:26:01.398] [INFO ] [dispatcher-event-loop-4] [TaskSetManager] 
> Starting task 3.0 in stage 1.0 (TID 8, node1, partition 3,NODE_LOCAL, 2786 
> bytes)
> [2016-06-15 09:26:01.406] [INFO ] [dispatcher-event-loop-4] [TaskSetManager] 
> Starting task 4.0 in stage 1.0 (TID 9, node1, partition 4,NODE_LOCAL, 2786 
> bytes)
> [2016-06-15 09:26:01.429] [INFO ] [dispatcher-event-loop-4] 
> [BlockManagerInfo] Added broadcast_1_piece0 in memory on node1:36512 (size: 
> 2.1 KB, free: 511.1 MB)
> [2016-06-15 09:26:01.452] [INFO ] [dispatcher-event-loop-6] 
> [MapOutputTrackerMasterEndpoint] Asked to send map output locations for 
> shuffle 0 to node1:41122
> [2016-06-15 09:26:01.456] [INFO ] [dispatcher-event-loop-6] 
> [MapOutputTrackerMaster] Size of output statuses for shuffle 0 is 161 bytes
> [2016-06-15 09:26:01.526] [INFO ] [task-result-getter-1] [TaskSetManager] 
> Finished task 4.0 in stage 1.0 (TID 9) in 128 ms on node1 (1/5)
> [2016-06-15 09:26:01.575] [INFO ] [task-result-getter-3] [TaskSetManager] 
> Finished task 2.0 in stage 1.0 (TID 7) in 184 ms on node1 (2/5)
> [2016-06-15 09:26:01.580] [INFO ] [task-result-getter-2] [TaskSetManager] 
> Finished task 0.0 in stage 1.0 (TID 5) in 193 ms on node1 (3/5)
> [2016-06-15 09:26:01.589] [INFO ] [task-result-getter-3] [TaskSetManager] 
> Finished task 1.0 in stage 1.0 (TID 6) in 199 ms on node1 (4/5)
> [2016-06-15 09:26:01.599] [INFO ] [task-result-getter-2] [TaskSetManager] 
> Finished task 3.0 in stage 1.0 (TID 8) in 200 ms on node1 (5/5)
> [2016-06-15 09:26:01.599] [INFO ] [task-result-getter-2] [TaskSchedulerImpl] 
> Removed TaskSet 1.0, whose tasks have all completed, from pool
> [2016-06-15 09:26:01.599] [INFO ] [dag-scheduler-event-loop] [DAGScheduler] 
> ResultStage 1 (collect at EquityTCAAnalytics.java:88) finished in 0.202 s
> [2016-06-15 09:26:01.612] [INFO ] [main] [DAGScheduler] Job 0 finished: 
> collect at EquityTCAAnalytics.java:88, took 32.496470 s
> [2016-06-15 09:26:01.634] [INFO ] [main] [EquityTCAAnalytics] [((2016-06-10 
> 13:45:00.0,DA),6944), ((2016-06-10 14:25:00.0,B),5241), ..., ((2016-06-10 
> 10:55:00.0,QD),109080), ((2016-06-10 14:55:00.0,A),1300)]
> [2016-06-15 09:26:01.641] [INFO ] [main] [EquityTCAAnalytics] finish
> 32.5 s is normal? 
> View this message in context: Is that normal spark performance?
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Is that normal spark performance?

2016-06-15 Thread nikita.dobryukha
We use Cassandra 3.5 + Spark 1.6.1 in 2-node cluster (8 cores and 1g memory
per node). There is the following Cassandra tableAnd I want to calculate
percentage of volume: sum of all volume from trades in the relevant security
during the time period groupped by exchange and time bar (1 or 5 minutes).
I've created an example:32.5 s is normal?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-that-normal-spark-performance-tp27174.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: 答复: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Gajanan Satone
Thanks for sharing,

Please consider me.

Thanks,
Gajanan

On Wed, May 18, 2016 at 8:34 AM, 谭成灶 <tanx...@live.cn> wrote:

> Thanks for your sharing!
> Please include me too
> --
> 发件人: Mich Talebzadeh <mich.talebza...@gmail.com>
> 发送时间: ‎2016/‎5/‎18 5:16
> 收件人: user @spark <user@spark.apache.org>
> 主题: Re: My notes on Spark Performance & Tuning Guide
>
> Hi all,
>
> Many thanks for your tremendous interest in the forthcoming notes. I have
> had nearly thirty requests and many supporting kind words from the
> colleagues in this forum.
>
> I will strive to get the first draft ready as soon as possible. Apologies
> for not being more specific. However, hopefully not too long for your
> perusal.
>
>
> Regards,
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 12 May 2016 at 11:08, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> Hi Al,,
>
>
> Following the threads in spark forum, I decided to write up on
> configuration of Spark including allocation of resources and configuration
> of driver, executors, threads, execution of Spark apps and general
> troubleshooting taking into account the allocation of resources for Spark
> applications and OS tools at the disposal.
>
> Since the most widespread configuration as I notice is with "Spark
> Standalone Mode", I have decided to write these notes starting with
> Standalone and later on moving to Yarn
>
>
>-
>
>*Standalone *– a simple cluster manager included with Spark that makes
>it easy to set up a cluster.
>-
>
>*YARN* – the resource manager in Hadoop 2.
>
>
> I would appreciate if anyone interested in reading and commenting to get
> in touch with me directly on mich.talebza...@gmail.com so I can send the
> write-up for their review and comments.
>
>
> Just to be clear this is not meant to be any commercial proposition or
> anything like that. As I seem to get involved with members troubleshooting
> issues and threads on this topic, I thought it is worthwhile writing a note
> about it to summarise the findings for the benefit of the community.
>
>
> Regards.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
>


答复: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread 谭成灶
Thanks for your sharing!
Please include me too

发件人: Mich Talebzadeh<mailto:mich.talebza...@gmail.com>
发送时间: ‎2016/‎5/‎18 5:16
收件人: user @spark<mailto:user@spark.apache.org>
主题: Re: My notes on Spark Performance & Tuning Guide

Hi all,

Many thanks for your tremendous interest in the forthcoming notes. I have
had nearly thirty requests and many supporting kind words from the
colleagues in this forum.

I will strive to get the first draft ready as soon as possible. Apologies
for not being more specific. However, hopefully not too long for your
perusal.


Regards,


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 12 May 2016 at 11:08, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:

> Hi Al,,
>
>
> Following the threads in spark forum, I decided to write up on
> configuration of Spark including allocation of resources and configuration
> of driver, executors, threads, execution of Spark apps and general
> troubleshooting taking into account the allocation of resources for Spark
> applications and OS tools at the disposal.
>
> Since the most widespread configuration as I notice is with "Spark
> Standalone Mode", I have decided to write these notes starting with
> Standalone and later on moving to Yarn
>
>
>-
>
>*Standalone *– a simple cluster manager included with Spark that makes
>it easy to set up a cluster.
>-
>
>*YARN* – the resource manager in Hadoop 2.
>
>
> I would appreciate if anyone interested in reading and commenting to get
> in touch with me directly on mich.talebza...@gmail.com so I can send the
> write-up for their review and comments.
>
>
> Just to be clear this is not meant to be any commercial proposition or
> anything like that. As I seem to get involved with members troubleshooting
> issues and threads on this topic, I thought it is worthwhile writing a note
> about it to summarise the findings for the benefit of the community.
>
>
> Regards.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>


Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Jeff Zhang
I think you can write it in gitbook and share it in user mail list then
everyone can comment on that.

On Wed, May 18, 2016 at 10:12 AM, Vinayak Agrawal <
vinayakagrawa...@gmail.com> wrote:

> Please include me too.
>
> Vinayak Agrawal
> Big Data Analytics
> IBM
>
> "To Strive, To Seek, To Find and Not to Yield!"
> ~Lord Alfred Tennyson
>
> On May 17, 2016, at 2:15 PM, Mich Talebzadeh 
> wrote:
>
> Hi all,
>
> Many thanks for your tremendous interest in the forthcoming notes. I have
> had nearly thirty requests and many supporting kind words from the
> colleagues in this forum.
>
> I will strive to get the first draft ready as soon as possible. Apologies
> for not being more specific. However, hopefully not too long for your
> perusal.
>
>
> Regards,
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 12 May 2016 at 11:08, Mich Talebzadeh 
> wrote:
>
>> Hi Al,,
>>
>>
>> Following the threads in spark forum, I decided to write up on
>> configuration of Spark including allocation of resources and configuration
>> of driver, executors, threads, execution of Spark apps and general
>> troubleshooting taking into account the allocation of resources for Spark
>> applications and OS tools at the disposal.
>>
>> Since the most widespread configuration as I notice is with "Spark
>> Standalone Mode", I have decided to write these notes starting with
>> Standalone and later on moving to Yarn
>>
>>
>>-
>>
>>*Standalone *– a simple cluster manager included with Spark that
>>makes it easy to set up a cluster.
>>-
>>
>>*YARN* – the resource manager in Hadoop 2.
>>
>>
>> I would appreciate if anyone interested in reading and commenting to get
>> in touch with me directly on mich.talebza...@gmail.com so I can send the
>> write-up for their review and comments.
>>
>>
>> Just to be clear this is not meant to be any commercial proposition or
>> anything like that. As I seem to get involved with members troubleshooting
>> issues and threads on this topic, I thought it is worthwhile writing a note
>> about it to summarise the findings for the benefit of the community.
>>
>>
>> Regards.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>
>


-- 
Best Regards

Jeff Zhang


Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Vinayak Agrawal
Please include me too. 

Vinayak Agrawal
Big Data Analytics
IBM

"To Strive, To Seek, To Find and Not to Yield!"
~Lord Alfred Tennyson

> On May 17, 2016, at 2:15 PM, Mich Talebzadeh  
> wrote:
> 
> Hi all,
> 
> Many thanks for your tremendous interest in the forthcoming notes. I have had 
> nearly thirty requests and many supporting kind words from the colleagues in 
> this forum.
> 
> I will strive to get the first draft ready as soon as possible. Apologies for 
> not being more specific. However, hopefully not too long for your perusal.
> 
> 
> Regards,
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 12 May 2016 at 11:08, Mich Talebzadeh  wrote:
>> Hi Al,,
>> 
>> 
>> Following the threads in spark forum, I decided to write up on configuration 
>> of Spark including allocation of resources and configuration of driver, 
>> executors, threads, execution of Spark apps and general troubleshooting 
>> taking into account the allocation of resources for Spark applications and 
>> OS tools at the disposal.
>> 
>> Since the most widespread configuration as I notice is with "Spark 
>> Standalone Mode", I have decided to write these notes starting with 
>> Standalone and later on moving to Yarn
>> 
>> Standalone – a simple cluster manager included with Spark that makes it easy 
>> to set up a cluster.
>> YARN – the resource manager in Hadoop 2.
>> 
>> I would appreciate if anyone interested in reading and commenting to get in 
>> touch with me directly on mich.talebza...@gmail.com so I can send the 
>> write-up for their review and comments.
>> 
>> Just to be clear this is not meant to be any commercial proposition or 
>> anything like that. As I seem to get involved with members troubleshooting 
>> issues and threads on this topic, I thought it is worthwhile writing a note 
>> about it to summarise the findings for the benefit of the community.
>> 
>> Regards.
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
> 


Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Abi
Please include me too

On May 12, 2016 6:08:14 AM EDT, Mich Talebzadeh  
wrote:
>Hi Al,,
>
>
>Following the threads in spark forum, I decided to write up on
>configuration of Spark including allocation of resources and
>configuration
>of driver, executors, threads, execution of Spark apps and general
>troubleshooting taking into account the allocation of resources for
>Spark
>applications and OS tools at the disposal.
>
>Since the most widespread configuration as I notice is with "Spark
>Standalone Mode", I have decided to write these notes starting with
>Standalone and later on moving to Yarn
>
>
>   -
>
> *Standalone *– a simple cluster manager included with Spark that makes
>   it easy to set up a cluster.
>   -
>
>   *YARN* – the resource manager in Hadoop 2.
>
>
>I would appreciate if anyone interested in reading and commenting to
>get in
>touch with me directly on mich.talebza...@gmail.com so I can send the
>write-up for their review and comments.
>
>
>Just to be clear this is not meant to be any commercial proposition or
>anything like that. As I seem to get involved with members
>troubleshooting
>issues and threads on this topic, I thought it is worthwhile writing a
>note
>about it to summarise the findings for the benefit of the community.
>
>
>Regards.
>
>
>Dr Mich Talebzadeh
>
>
>
>LinkedIn *
>https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>*
>
>
>
>http://talebzadehmich.wordpress.com


Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Cesar Flores
Please sent me to me too !


Thanks ! ! !


Cesar Flores

On Tue, May 17, 2016 at 4:55 PM, Femi Anthony  wrote:

> Please send it to me as well.
>
> Thanks
>
> Sent from my iPhone
>
> On May 17, 2016, at 12:09 PM, Raghavendra Pandey <
> raghavendra.pan...@gmail.com> wrote:
>
> Can you please send me as well.
>
> Thanks
> Raghav
> On 12 May 2016 20:02, "Tom Ellis"  wrote:
>
>> I would like to also Mich, please send it through, thanks!
>>
>> On Thu, 12 May 2016 at 15:14 Alonso Isidoro  wrote:
>>
>>> Me too, send me the guide.
>>>
>>> Enviado desde mi iPhone
>>>
>>> El 12 may 2016, a las 12:11, Ashok Kumar >> > escribió:
>>>
>>> Hi Dr Mich,
>>>
>>> I will be very keen to have a look at it and review if possible.
>>>
>>> Please forward me a copy
>>>
>>> Thanking you warmly
>>>
>>>
>>> On Thursday, 12 May 2016, 11:08, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>
>>> Hi Al,,
>>>
>>>
>>> Following the threads in spark forum, I decided to write up on
>>> configuration of Spark including allocation of resources and configuration
>>> of driver, executors, threads, execution of Spark apps and general
>>> troubleshooting taking into account the allocation of resources for Spark
>>> applications and OS tools at the disposal.
>>>
>>> Since the most widespread configuration as I notice is with "Spark
>>> Standalone Mode", I have decided to write these notes starting with
>>> Standalone and later on moving to Yarn
>>>
>>>
>>>- *Standalone *– a simple cluster manager included with Spark that
>>>makes it easy to set up a cluster.
>>>- *YARN* – the resource manager in Hadoop 2.
>>>
>>>
>>> I would appreciate if anyone interested in reading and commenting to get
>>> in touch with me directly on mich.talebza...@gmail.com so I can send
>>> the write-up for their review and comments.
>>>
>>> Just to be clear this is not meant to be any commercial proposition or
>>> anything like that. As I seem to get involved with members troubleshooting
>>> issues and threads on this topic, I thought it is worthwhile writing a note
>>> about it to summarise the findings for the benefit of the community.
>>>
>>> Regards.
>>>
>>> Dr Mich Talebzadeh
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>>


-- 
Cesar Flores


Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Femi Anthony
Please send it to me as well.

Thanks

Sent from my iPhone

> On May 17, 2016, at 12:09 PM, Raghavendra Pandey 
>  wrote:
> 
> Can you please send me as well.
> 
> Thanks 
> Raghav
> 
>> On 12 May 2016 20:02, "Tom Ellis"  wrote:
>> I would like to also Mich, please send it through, thanks!
>> 
>>> On Thu, 12 May 2016 at 15:14 Alonso Isidoro  wrote:
>>> Me too, send me the guide.
>>> 
>>> Enviado desde mi iPhone
>>> 
 El 12 may 2016, a las 12:11, Ashok Kumar  
 escribió:
 
 Hi Dr Mich,
 
 I will be very keen to have a look at it and review if possible.
 
 Please forward me a copy
 
 Thanking you warmly
 
 
 On Thursday, 12 May 2016, 11:08, Mich Talebzadeh 
  wrote:
 
 
 Hi Al,,
 
 
 Following the threads in spark forum, I decided to write up on 
 configuration of Spark including allocation of resources and configuration 
 of driver, executors, threads, execution of Spark apps and general 
 troubleshooting taking into account the allocation of resources for Spark 
 applications and OS tools at the disposal.
 
 Since the most widespread configuration as I notice is with "Spark 
 Standalone Mode", I have decided to write these notes starting with 
 Standalone and later on moving to Yarn
 
 Standalone – a simple cluster manager included with Spark that makes it 
 easy to set up a cluster.
 YARN – the resource manager in Hadoop 2.
 
 I would appreciate if anyone interested in reading and commenting to get 
 in touch with me directly on mich.talebza...@gmail.com so I can send the 
 write-up for their review and comments.
 
 Just to be clear this is not meant to be any commercial proposition or 
 anything like that. As I seem to get involved with members troubleshooting 
 issues and threads on this topic, I thought it is worthwhile writing a 
 note about it to summarise the findings for the benefit of the community.
 
 Regards.
 
 Dr Mich Talebzadeh
  
 LinkedIn  
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
  
 http://talebzadehmich.wordpress.com


Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread rakesh sharma
It would be a rare doc. Please share

Get Outlook for Android



On Tue, May 17, 2016 at 9:14 AM -0700, "Natu Lauchande" 
> wrote:

Hi Mich,

I am also interested in the write up.

Regards,
Natu

On Thu, May 12, 2016 at 12:08 PM, Mich Talebzadeh 
> wrote:
Hi Al,,


Following the threads in spark forum, I decided to write up on configuration of 
Spark including allocation of resources and configuration of driver, executors, 
threads, execution of Spark apps and general troubleshooting taking into 
account the allocation of resources for Spark applications and OS tools at the 
disposal.

Since the most widespread configuration as I notice is with "Spark Standalone 
Mode", I have decided to write these notes starting with Standalone and later 
on moving to Yarn


  *   Standalone - a simple cluster manager included with Spark that makes it 
easy to set up a cluster.

  *   YARN - the resource manager in Hadoop 2.


I would appreciate if anyone interested in reading and commenting to get in 
touch with me directly on 
mich.talebza...@gmail.com so I can send the 
write-up for their review and comments.


Just to be clear this is not meant to be any commercial proposition or anything 
like that. As I seem to get involved with members troubleshooting issues and 
threads on this topic, I thought it is worthwhile writing a note about it to 
summarise the findings for the benefit of the community.


Regards.


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com





Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Natu Lauchande
Hi Mich,

I am also interested in the write up.

Regards,
Natu

On Thu, May 12, 2016 at 12:08 PM, Mich Talebzadeh  wrote:

> Hi Al,,
>
>
> Following the threads in spark forum, I decided to write up on
> configuration of Spark including allocation of resources and configuration
> of driver, executors, threads, execution of Spark apps and general
> troubleshooting taking into account the allocation of resources for Spark
> applications and OS tools at the disposal.
>
> Since the most widespread configuration as I notice is with "Spark
> Standalone Mode", I have decided to write these notes starting with
> Standalone and later on moving to Yarn
>
>
>-
>
>*Standalone *– a simple cluster manager included with Spark that makes
>it easy to set up a cluster.
>-
>
>*YARN* – the resource manager in Hadoop 2.
>
>
> I would appreciate if anyone interested in reading and commenting to get
> in touch with me directly on mich.talebza...@gmail.com so I can send the
> write-up for their review and comments.
>
>
> Just to be clear this is not meant to be any commercial proposition or
> anything like that. As I seem to get involved with members troubleshooting
> issues and threads on this topic, I thought it is worthwhile writing a note
> about it to summarise the findings for the benefit of the community.
>
>
> Regards.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>


Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Raghavendra Pandey
Can you please send me as well.

Thanks
Raghav
On 12 May 2016 20:02, "Tom Ellis"  wrote:

> I would like to also Mich, please send it through, thanks!
>
> On Thu, 12 May 2016 at 15:14 Alonso Isidoro  wrote:
>
>> Me too, send me the guide.
>>
>> Enviado desde mi iPhone
>>
>> El 12 may 2016, a las 12:11, Ashok Kumar > > escribió:
>>
>> Hi Dr Mich,
>>
>> I will be very keen to have a look at it and review if possible.
>>
>> Please forward me a copy
>>
>> Thanking you warmly
>>
>>
>> On Thursday, 12 May 2016, 11:08, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>> Hi Al,,
>>
>>
>> Following the threads in spark forum, I decided to write up on
>> configuration of Spark including allocation of resources and configuration
>> of driver, executors, threads, execution of Spark apps and general
>> troubleshooting taking into account the allocation of resources for Spark
>> applications and OS tools at the disposal.
>>
>> Since the most widespread configuration as I notice is with "Spark
>> Standalone Mode", I have decided to write these notes starting with
>> Standalone and later on moving to Yarn
>>
>>
>>- *Standalone *– a simple cluster manager included with Spark that
>>makes it easy to set up a cluster.
>>- *YARN* – the resource manager in Hadoop 2.
>>
>>
>> I would appreciate if anyone interested in reading and commenting to get
>> in touch with me directly on mich.talebza...@gmail.com so I can send the
>> write-up for their review and comments.
>>
>> Just to be clear this is not meant to be any commercial proposition or
>> anything like that. As I seem to get involved with members troubleshooting
>> issues and threads on this topic, I thought it is worthwhile writing a note
>> about it to summarise the findings for the benefit of the community.
>>
>> Regards.
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>>


Re: My notes on Spark Performance & Tuning Guide

2016-05-12 Thread Tom Ellis
I would like to also Mich, please send it through, thanks!

On Thu, 12 May 2016 at 15:14 Alonso Isidoro  wrote:

> Me too, send me the guide.
>
> Enviado desde mi iPhone
>
> El 12 may 2016, a las 12:11, Ashok Kumar  > escribió:
>
> Hi Dr Mich,
>
> I will be very keen to have a look at it and review if possible.
>
> Please forward me a copy
>
> Thanking you warmly
>
>
> On Thursday, 12 May 2016, 11:08, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> Hi Al,,
>
>
> Following the threads in spark forum, I decided to write up on
> configuration of Spark including allocation of resources and configuration
> of driver, executors, threads, execution of Spark apps and general
> troubleshooting taking into account the allocation of resources for Spark
> applications and OS tools at the disposal.
>
> Since the most widespread configuration as I notice is with "Spark
> Standalone Mode", I have decided to write these notes starting with
> Standalone and later on moving to Yarn
>
>
>- *Standalone *– a simple cluster manager included with Spark that
>makes it easy to set up a cluster.
>- *YARN* – the resource manager in Hadoop 2.
>
>
> I would appreciate if anyone interested in reading and commenting to get
> in touch with me directly on mich.talebza...@gmail.com so I can send the
> write-up for their review and comments.
>
> Just to be clear this is not meant to be any commercial proposition or
> anything like that. As I seem to get involved with members troubleshooting
> issues and threads on this topic, I thought it is worthwhile writing a note
> about it to summarise the findings for the benefit of the community.
>
> Regards.
>
> Dr Mich Talebzadeh
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
> http://talebzadehmich.wordpress.com
>
>
>
>


Re: My notes on Spark Performance & Tuning Guide

2016-05-12 Thread Alonso Isidoro
Me too, send me the guide.

Enviado desde mi iPhone

> El 12 may 2016, a las 12:11, Ashok Kumar  
> escribió:
> 
> Hi Dr Mich,
> 
> I will be very keen to have a look at it and review if possible.
> 
> Please forward me a copy
> 
> Thanking you warmly
> 
> 
> On Thursday, 12 May 2016, 11:08, Mich Talebzadeh  
> wrote:
> 
> 
> Hi Al,,
> 
> 
> Following the threads in spark forum, I decided to write up on configuration 
> of Spark including allocation of resources and configuration of driver, 
> executors, threads, execution of Spark apps and general troubleshooting 
> taking into account the allocation of resources for Spark applications and OS 
> tools at the disposal.
> 
> Since the most widespread configuration as I notice is with "Spark Standalone 
> Mode", I have decided to write these notes starting with Standalone and later 
> on moving to Yarn
> 
> Standalone – a simple cluster manager included with Spark that makes it easy 
> to set up a cluster.
> YARN – the resource manager in Hadoop 2.
> 
> I would appreciate if anyone interested in reading and commenting to get in 
> touch with me directly on mich.talebza...@gmail.com so I can send the 
> write-up for their review and comments.
> 
> Just to be clear this is not meant to be any commercial proposition or 
> anything like that. As I seem to get involved with members troubleshooting 
> issues and threads on this topic, I thought it is worthwhile writing a note 
> about it to summarise the findings for the benefit of the community.
> 
> Regards.
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
> 


Re: My notes on Spark Performance & Tuning Guide

2016-05-12 Thread Ashok Kumar
Hi Dr Mich,
I will be very keen to have a look at it and review if possible.
Please forward me a copy
Thanking you warmly 

On Thursday, 12 May 2016, 11:08, Mich Talebzadeh 
 wrote:
 

 Hi Al,,

Following the threads in spark forum, I decided to write up on configuration of 
Spark including allocation of resources and configuration of driver, executors, 
threads, execution of Spark apps and general troubleshooting taking into 
account the allocation of resources for Spark applications and OS tools at the 
disposal.
Since the most widespread configuration as I notice is with "Spark Standalone 
Mode", I have decided to write these notes starting with Standalone and later 
on moving to Yarn
   
   - Standalone – a simple cluster managerincluded with Spark that makes it 
easy to set up a cluster.
   - YARN – the resource manager inHadoop 2.

I would appreciate if anyone interested in reading and commenting to get in 
touch with me directly on mich.talebza...@gmail.com so I can send the write-up 
for their review and comments.
Just to be clear this is not meant to be any commercial proposition or anything 
like that. As I seem to get involved with members troubleshooting issues and 
threads on this topic, I thought it is worthwhile writing a note about it to 
summarise the findings for the benefit of the community.
Regards.
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 

  

My notes on Spark Performance & Tuning Guide

2016-05-12 Thread Mich Talebzadeh
Hi Al,,


Following the threads in spark forum, I decided to write up on
configuration of Spark including allocation of resources and configuration
of driver, executors, threads, execution of Spark apps and general
troubleshooting taking into account the allocation of resources for Spark
applications and OS tools at the disposal.

Since the most widespread configuration as I notice is with "Spark
Standalone Mode", I have decided to write these notes starting with
Standalone and later on moving to Yarn


   -

   *Standalone *– a simple cluster manager included with Spark that makes
   it easy to set up a cluster.
   -

   *YARN* – the resource manager in Hadoop 2.


I would appreciate if anyone interested in reading and commenting to get in
touch with me directly on mich.talebza...@gmail.com so I can send the
write-up for their review and comments.


Just to be clear this is not meant to be any commercial proposition or
anything like that. As I seem to get involved with members troubleshooting
issues and threads on this topic, I thought it is worthwhile writing a note
about it to summarise the findings for the benefit of the community.


Regards.


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


Re: Hive on Spark performance

2016-03-13 Thread Mich Talebzadeh
Depending on the version of Hive on Spark engine.

As far as I am aware the latest version of Hive that I am using (Hive 2)
has improvements compared to the previous versions of Hive (0.14,1.2.1) on
Spark engine.

As of today I have managed to use Hive 2.0 on Spark version 1.3.1. So it is
not the latest Spark but it is pretty good.

What specific concerns do you have in mind?

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 13 March 2016 at 23:27, sjayatheertha  wrote:

> Just curious if you could share your experience on the performance of
> spark in your company? How much data do you process? And what's the latency
> you are getting with spark engine?
>
> Vidya


spark performance non-linear response

2015-10-07 Thread Yadid Ayzenberg

Hi All,

Im using spark 1.4.1 to to analyze a largish data set (several Gigabytes 
of data). The RDD is partitioned into 2048 partitions which are more or 
less equal and entirely cached in RAM.
I evaluated the performance on several cluster sizes, and am witnessing 
a non linear (power) performance improvement as the cluster size 
increases (plot below). Each node has 4 cores and each worker is 
configured to use 10GB or RAM.


Spark performance

I would expect a more linear response given the number of partitions and 
the fact that all of the data is cached.

Can anyone suggest what I should tweak in order to improve the performance?
Or perhaps provide an explanation as to the behavior Im witnessing?

Yadid


Re: spark performance non-linear response

2015-10-07 Thread Sean Owen
OK, next question then is: if this is wall-clock time for the whole
process, then, I wonder if you are just measuring the time taken by the
longest single task. I'd expect the time taken by the longest straggler
task to follow a distribution like this. That is, how balanced are the
partitions?

Are you running so many executors that nodes are bottlenecking on CPU, or
swapping?


On Wed, Oct 7, 2015 at 4:42 PM, Yadid Ayzenberg <ya...@media.mit.edu> wrote:

> Additional missing relevant information:
>
> Im running a transformation, there are no Shuffles occurring and at the
> end im performing a lookup of 4 partitions on the driver.
>
>
>
>
> On 10/7/15 11:26 AM, Yadid Ayzenberg wrote:
>
> Hi All,
>
> Im using spark 1.4.1 to to analyze a largish data set (several Gigabytes
> of data). The RDD is partitioned into 2048 partitions which are more or
> less equal and entirely cached in RAM.
> I evaluated the performance on several cluster sizes, and am witnessing a
> non linear (power) performance improvement as the cluster size increases
> (plot below). Each node has 4 cores and each worker is configured to use
> 10GB or RAM.
>
> [image: Spark performance]
>
> I would expect a more linear response given the number of partitions and
> the fact that all of the data is cached.
> Can anyone suggest what I should tweak in order to improve the performance?
> Or perhaps provide an explanation as to the behavior Im witnessing?
>
> Yadid
>
>
>


Re: spark performance non-linear response

2015-10-07 Thread Yadid Ayzenberg

Additional missing relevant information:

Im running a transformation, there are no Shuffles occurring and at the 
end im performing a lookup of 4 partitions on the driver.




On 10/7/15 11:26 AM, Yadid Ayzenberg wrote:

Hi All,

Im using spark 1.4.1 to to analyze a largish data set (several 
Gigabytes of data). The RDD is partitioned into 2048 partitions which 
are more or less equal and entirely cached in RAM.
I evaluated the performance on several cluster sizes, and am 
witnessing a non linear (power) performance improvement as the cluster 
size increases (plot below). Each node has 4 cores and each worker is 
configured to use 10GB or RAM.


Spark performance

I would expect a more linear response given the number of partitions 
and the fact that all of the data is cached.
Can anyone suggest what I should tweak in order to improve the 
performance?

Or perhaps provide an explanation as to the behavior Im witnessing?

Yadid




Re: spark performance non-linear response

2015-10-07 Thread Jonathan Coveney
I've noticed this as well and am curious if there is anything more people
can say.

My theory is that it is just communication overhead. If you only have a
couple of gigabytes (a tiny dataset), then spotting that into 50 nodes
means you'll have a ton of tiny partitions all finishing very quickly, and
thus creating a lot of communication overhead per amount of data processed.
Just a theory though.

El miércoles, 7 de octubre de 2015, Yadid Ayzenberg <ya...@media.mit.edu>
escribió:

> Additional missing relevant information:
>
> Im running a transformation, there are no Shuffles occurring and at the
> end im performing a lookup of 4 partitions on the driver.
>
>
>
> On 10/7/15 11:26 AM, Yadid Ayzenberg wrote:
>
> Hi All,
>
> Im using spark 1.4.1 to to analyze a largish data set (several Gigabytes
> of data). The RDD is partitioned into 2048 partitions which are more or
> less equal and entirely cached in RAM.
> I evaluated the performance on several cluster sizes, and am witnessing a
> non linear (power) performance improvement as the cluster size increases
> (plot below). Each node has 4 cores and each worker is configured to use
> 10GB or RAM.
>
> [image: Spark performance]
>
> I would expect a more linear response given the number of partitions and
> the fact that all of the data is cached.
> Can anyone suggest what I should tweak in order to improve the performance?
> Or perhaps provide an explanation as to the behavior Im witnessing?
>
> Yadid
>
>
>


Re: flatmap() and spark performance

2015-09-28 Thread Hemant Bhanawat
You can use spark.executor.memory to specify the memory of the executors
which will  hold this intermediate results.

You may want to look at the section "Understanding Memory Management in
Spark" of this link:

https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html


On Tue, Sep 29, 2015 at 10:51 AM, jeff saremi 
wrote:

> Is there anyway to let spark know ahead of time what size of RDD to expect
> as a result of a flatmap() operation?
> And would that help in terms of performance?
> For instance, if I have an RDD of 1million rows and I know that my
> flatMap() will produce 100million rows, is there a way to indicate that to
> Spark? to say "reserve" space for the resulting RDD?
>
> thanks
> Jeff
>


Re: spark performance - executor computing time

2015-09-17 Thread Adrian Tanase
Something similar happened to our job as well - spark streaming, YARN deployed 
on AWS.
One of the jobs was consistently taking 10–15X longer one one machine. Same 
data volume, data partitioned really well, etc.

Are you running on AWS or on prem?

We were assuming that one of the VMs in Amazon was flaky and decided to restart 
it, leading to a host of other issues (the executor on it was never recreated 
after the machine joined back in YARN as a healthy node…)

-adrian

From: Robin East
Date: Wednesday, September 16, 2015 at 7:45 PM
To: patcharee
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Re: spark performance - executor computing time

Is this repeatable? Do you always get one or two executors that are 6 times as 
slow? It could be that some of your tasks have more work to do (maybe you are 
filtering some records out? If it’s always one particular worker node is there 
something about the machine configuration (e.g. CPU speed) that means the 
processing takes longer.

—
Robin East
Spark GraphX in Action Michael S Malak and Robin East
http://www.manning.com/books/spark-graphx-in-action

On 15 Sep 2015, at 12:35, patcharee 
<patcharee.thong...@uni.no<mailto:patcharee.thong...@uni.no>> wrote:

Hi,

I was running a job (on Spark 1.5 + Yarn + java 8). In a stage that lookup 
(org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:873)) 
there was an executor that took the executor computing time > 6 times of 
median. This executor had almost the same shuffle read size and low gc time as 
others.

What can impact the executor computing time? Any suggestions what parameters I 
should monitor/configure?

BR,
Patcharee



-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>




Re: spark performance - executor computing time

2015-09-16 Thread Robin East
Is this repeatable? Do you always get one or two executors that are 6 times as 
slow? It could be that some of your tasks have more work to do (maybe you are 
filtering some records out? If it’s always one particular worker node is there 
something about the machine configuration (e.g. CPU speed) that means the 
processing takes longer.

—
Robin East
Spark GraphX in Action Michael S Malak and Robin East
http://www.manning.com/books/spark-graphx-in-action 


> On 15 Sep 2015, at 12:35, patcharee  wrote:
> 
> Hi,
> 
> I was running a job (on Spark 1.5 + Yarn + java 8). In a stage that lookup 
> (org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:873)) 
> there was an executor that took the executor computing time > 6 times of 
> median. This executor had almost the same shuffle read size and low gc time 
> as others.
> 
> What can impact the executor computing time? Any suggestions what parameters 
> I should monitor/configure?
> 
> BR,
> Patcharee
> 
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 



spark performance - executor computing time

2015-09-15 Thread patcharee

Hi,

I was running a job (on Spark 1.5 + Yarn + java 8). In a stage that 
lookup 
(org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:873)) 
there was an executor that took the executor computing time > 6 times of 
median. This executor had almost the same shuffle read size and low gc 
time as others.


What can impact the executor computing time? Any suggestions what 
parameters I should monitor/configure?


BR,
Patcharee



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



DataFrames in Spark - Performance when interjected with RDDs

2015-09-07 Thread Pallavi Rao
Hello All,
I had a question regarding the performance optimization (Catalyst
Optimizer) of DataFrames. I understand that DataFrames are interoperable
with RDDs. If I switch back and forth between DataFrames and RDDs, does the
performance optimization still kick-in? I need to switch to RDDs to reuse
some previously written functions that had been coded up using RDDs.

Are there are any recommendations/best practices, in terms of performance
tuning, that need to be followed while using a combination of DataFrames
and RDDs?

Thank you for your time.

Regards,
Pallavi.

-- 
_
The information contained in this communication is intended solely for the 
use of the individual or entity to whom it is addressed and others 
authorized to receive it. It may contain confidential or legally privileged 
information. If you are not the intended recipient you are hereby notified 
that any disclosure, copying, distribution or taking any action in reliance 
on the contents of this information is strictly prohibited and may be 
unlawful. If you have received this communication in error, please notify 
us immediately by responding to this email and then delete it from your 
system. The firm is neither liable for the proper and complete transmission 
of the information contained in this communication nor for any delay in its 
receipt.


blogs/articles/videos on how to analyse spark performance

2015-08-19 Thread Todd
Hi,
I would ask if there are some blogs/articles/videos on how to analyse spark 
performance during runtime,eg, tools that can be used or something related.


Re: blogs/articles/videos on how to analyse spark performance

2015-08-19 Thread Gourav Sengupta
Excellent resource: http://www.oreilly.com/pub/e/3330

And more amazing is the fact that the presenter actually responds to your
questions.

Regards,
Gourav Sengupta

On Wed, Aug 19, 2015 at 4:12 PM, Todd bit1...@163.com wrote:

 Hi,
 I would ask if there are some blogs/articles/videos on how to analyse
 spark performance during runtime,eg, tools that can be used or something
 related.



Re: blogs/articles/videos on how to analyse spark performance

2015-08-19 Thread Igor Berman
you don't need to register, search in youtube for this video...

On 19 August 2015 at 18:34, Gourav Sengupta gourav.sengu...@gmail.com
wrote:

 Excellent resource: http://www.oreilly.com/pub/e/3330

 And more amazing is the fact that the presenter actually responds to your
 questions.

 Regards,
 Gourav Sengupta

 On Wed, Aug 19, 2015 at 4:12 PM, Todd bit1...@163.com wrote:

 Hi,
 I would ask if there are some blogs/articles/videos on how to analyse
 spark performance during runtime,eg, tools that can be used or something
 related.





RE: Spark performance

2015-07-13 Thread Mohammed Guller
Good points, Michael.

The underlying assumption in my statement is that cost is an issue. If cost is 
not an issue and the only requirement is to query structured data, then there 
are several databases such as Teradata, Exadata, and Vertica that can handle 
4-6 TB of data and outperform Spark.

Mohammed

From: Michael Segel [mailto:msegel_had...@hotmail.com]
Sent: Sunday, July 12, 2015 6:59 AM
To: Mohammed Guller
Cc: David Mitchell; Roman Sokolov; user; Ravisankar Mani
Subject: Re: Spark performance

Not necessarily.

It depends on the use case and what you intend to do with the data.

4-6 TB will easily fit on an SMP box and can be efficiently searched by an 
RDBMS.
Again it depends on what you want to do and how you want to do it.

Informix’s IDS engine with its extensibility could still outperform spark in 
some use cases based on the proper use of indexes and amount of parallelism.

There is a lot of cross over… now had you said 100TB+ on unstructured data… 
things may be different.

Please understand that what would make spark more compelling is the TCO of the 
solution when compared to SMP boxes and software licensing.

Its not that I don’t disagree with your statements, because moving from mssql 
or any small RDBMS to spark … doesn’t make a whole lot of sense.
Just wanted to add that the decision isn’t as cut and dry as some think….

On Jul 11, 2015, at 8:47 AM, Mohammed Guller 
moham...@glassbeam.commailto:moham...@glassbeam.com wrote:

Hi Roman,
Yes, Spark SQL will be a better solution than standard RDBMS databases for 
querying 4-6 TB data. You can pair Spark SQL with HDFS+Parquet to build a 
powerful analytics solution.

Mohammed

From: David Mitchell [mailto:jdavidmitch...@gmail.com]
Sent: Saturday, July 11, 2015 7:10 AM
To: Roman Sokolov
Cc: Mohammed Guller; user; Ravisankar Mani
Subject: Re: Spark performance

You can certainly query over 4 TB of data with Spark.  However, you will get an 
answer in minutes or hours, not in milliseconds or seconds.  OLTP databases are 
used for web applications, and typically return responses in milliseconds.  
Analytic databases tend to operate on large data sets, and return responses in 
seconds, minutes or hours.  When running batch jobs over large data sets, Spark 
can be a replacement for analytic databases like Greenplum or Netezza.



On Sat, Jul 11, 2015 at 8:53 AM, Roman Sokolov 
ole...@gmail.commailto:ole...@gmail.com wrote:
Hello. Had the same question. What if I need to store 4-6 Tb and do queries? 
Can't find any clue in documentation.
Am 11.07.2015 03:28 schrieb Mohammed Guller 
moham...@glassbeam.commailto:moham...@glassbeam.com:
Hi Ravi,
First, Neither Spark nor Spark SQL is a database. Both are compute engines, 
which need to be paired with a storage system. Seconds, they are designed for 
processing large distributed datasets. If you have only 100,000 records or even 
a million records, you don’t need Spark. A RDBMS will perform much better for 
that volume of data.

Mohammed

From: Ravisankar Mani [mailto:rrav...@gmail.commailto:rrav...@gmail.com]
Sent: Friday, July 10, 2015 3:50 AM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Spark performance

Hi everyone,
I have planned to move mssql server to spark?.  I have using around 50,000 to 
1l records.
 The spark performance is slow when compared to mssql server.

What is the best data base(Spark or sql) to store or retrieve data around 
50,000 to 1l records ?
regards,
Ravi




--
### Confidential e-mail, for recipient's (or recipients') eyes only, not for 
distribution. ###



Re: Spark performance

2015-07-12 Thread santoshv98
Ravi


Spark (or in that case Big Data solutions like Hive) is suited for large 
analytical loads, where the “scaling  up” starts to pale in comparison to 
“Scaling out” with regards to performance, versatility(types of data) and cost. 
Without going into the details of MsSQL architecture, there is an inflection 
point in terms of cost(licensing), performance and Maintainability where open 
Source commodity platform would start to become viable albeit sometimes at the 
expense of slower performance. With 1 million records ,  I am not sure you are 
reaching that point to justify a Spark cluster. So why are you planning to move 
away from MSSql and move to Spark as the destination platform?


You said “Spark performance” is slow as compared to MSSql. What kind of load 
are you running and what kind of querying are you performing? There may be 
startup costs associated with running the Map side of the querying.


If your testing to understand Spark, can you post what you are currently doing 
(queries, table structures, compression and storage optimizations)? That way, 
we could look at suggesting optimizations but again, not to compare with MsSQL, 
but to improve Spark side of things.


Again, to quote someone who answered earlier in the thread, What is your ‘Use 
case’? 


-Santosh






Sent from Windows Mail





From: Jörn Franke
Sent: ‎Saturday‎, ‎July‎ ‎11‎, ‎2015 ‎8‎:‎20‎ ‎PM
To: Mohammed Guller, Ravisankar Mani, user@spark.apache.org





Honestly you are addressing this wrongly - you do not seem.to have a business 
case for changing - so why do you want to switch 




Le sam. 11 juil. 2015 à 3:28, Mohammed Guller moham...@glassbeam.com a écrit :





Hi Ravi,

First, Neither Spark nor Spark SQL is a database. Both are compute engines, 
which need to be paired with a storage system. Seconds, they are designed for 
processing large distributed datasets. If you have only 100,000 records or even 
a million records, you don’t need Spark. A RDBMS will perform much better for 
that volume of data.

 

Mohammed

 

From: Ravisankar Mani [mailto:rrav...@gmail.com] 
Sent: Friday, July 10, 2015 3:50 AM
To: user@spark.apache.org
Subject: Spark performance



 





Hi everyone,


I have planned to move mssql server to spark?.  I have using around 50,000 to 
1l records.


 The spark performance is slow when compared to mssql server.


 

What is the best data base(Spark or sql) to store or retrieve data around 
50,000 to 1l records ?

regards,

Ravi

Re: Spark performance

2015-07-11 Thread Jörn Franke
What is your business case for the move?

Le ven. 10 juil. 2015 à 12:49, Ravisankar Mani rrav...@gmail.com a écrit :

 Hi everyone,

 I have planned to move mssql server to spark?.  I have using around 50,000
 to 1l records.
  The spark performance is slow when compared to mssql server.

 What is the best data base(Spark or sql) to store or retrieve data around
 50,000 to 1l records ?

 regards,
 Ravi




Re: Spark performance

2015-07-11 Thread David Mitchell
You can certainly query over 4 TB of data with Spark.  However, you will
get an answer in minutes or hours, not in milliseconds or seconds.  OLTP
databases are used for web applications, and typically return responses in
milliseconds.  Analytic databases tend to operate on large data sets, and
return responses in seconds, minutes or hours.  When running batch jobs
over large data sets, Spark can be a replacement for analytic databases
like Greenplum or Netezza.



On Sat, Jul 11, 2015 at 8:53 AM, Roman Sokolov ole...@gmail.com wrote:

 Hello. Had the same question. What if I need to store 4-6 Tb and do
 queries? Can't find any clue in documentation.
 Am 11.07.2015 03:28 schrieb Mohammed Guller moham...@glassbeam.com:

  Hi Ravi,

 First, Neither Spark nor Spark SQL is a database. Both are compute
 engines, which need to be paired with a storage system. Seconds, they are
 designed for processing large distributed datasets. If you have only
 100,000 records or even a million records, you don’t need Spark. A RDBMS
 will perform much better for that volume of data.



 Mohammed



 *From:* Ravisankar Mani [mailto:rrav...@gmail.com]
 *Sent:* Friday, July 10, 2015 3:50 AM
 *To:* user@spark.apache.org
 *Subject:* Spark performance



 Hi everyone,

 I have planned to move mssql server to spark?.  I have using around
 50,000 to 1l records.

  The spark performance is slow when compared to mssql server.



 What is the best data base(Spark or sql) to store or retrieve data around
 50,000 to 1l records ?

 regards,

 Ravi






-- 
### Confidential e-mail, for recipient's (or recipients') eyes only, not
for distribution. ###


RE: Spark performance

2015-07-11 Thread Roman Sokolov
Hello. Had the same question. What if I need to store 4-6 Tb and do
queries? Can't find any clue in documentation.
Am 11.07.2015 03:28 schrieb Mohammed Guller moham...@glassbeam.com:

  Hi Ravi,

 First, Neither Spark nor Spark SQL is a database. Both are compute
 engines, which need to be paired with a storage system. Seconds, they are
 designed for processing large distributed datasets. If you have only
 100,000 records or even a million records, you don’t need Spark. A RDBMS
 will perform much better for that volume of data.



 Mohammed



 *From:* Ravisankar Mani [mailto:rrav...@gmail.com]
 *Sent:* Friday, July 10, 2015 3:50 AM
 *To:* user@spark.apache.org
 *Subject:* Spark performance



 Hi everyone,

 I have planned to move mssql server to spark?.  I have using around 50,000
 to 1l records.

  The spark performance is slow when compared to mssql server.



 What is the best data base(Spark or sql) to store or retrieve data around
 50,000 to 1l records ?

 regards,

 Ravi





RE: Spark performance

2015-07-11 Thread Mohammed Guller
Hi Roman,
Yes, Spark SQL will be a better solution than standard RDBMS databases for 
querying 4-6 TB data. You can pair Spark SQL with HDFS+Parquet to build a 
powerful analytics solution.

Mohammed

From: David Mitchell [mailto:jdavidmitch...@gmail.com]
Sent: Saturday, July 11, 2015 7:10 AM
To: Roman Sokolov
Cc: Mohammed Guller; user; Ravisankar Mani
Subject: Re: Spark performance

You can certainly query over 4 TB of data with Spark.  However, you will get an 
answer in minutes or hours, not in milliseconds or seconds.  OLTP databases are 
used for web applications, and typically return responses in milliseconds.  
Analytic databases tend to operate on large data sets, and return responses in 
seconds, minutes or hours.  When running batch jobs over large data sets, Spark 
can be a replacement for analytic databases like Greenplum or Netezza.



On Sat, Jul 11, 2015 at 8:53 AM, Roman Sokolov 
ole...@gmail.commailto:ole...@gmail.com wrote:

Hello. Had the same question. What if I need to store 4-6 Tb and do queries? 
Can't find any clue in documentation.
Am 11.07.2015 03:28 schrieb Mohammed Guller 
moham...@glassbeam.commailto:moham...@glassbeam.com:
Hi Ravi,
First, Neither Spark nor Spark SQL is a database. Both are compute engines, 
which need to be paired with a storage system. Seconds, they are designed for 
processing large distributed datasets. If you have only 100,000 records or even 
a million records, you don’t need Spark. A RDBMS will perform much better for 
that volume of data.

Mohammed

From: Ravisankar Mani [mailto:rrav...@gmail.commailto:rrav...@gmail.com]
Sent: Friday, July 10, 2015 3:50 AM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Spark performance

Hi everyone,
I have planned to move mssql server to spark?.  I have using around 50,000 to 
1l records.
 The spark performance is slow when compared to mssql server.

What is the best data base(Spark or sql) to store or retrieve data around 
50,000 to 1l records ?
regards,
Ravi




--
### Confidential e-mail, for recipient's (or recipients') eyes only, not for 
distribution. ###


Re: Spark performance

2015-07-11 Thread Jörn Franke
Honestly you are addressing this wrongly - you do not seem.to have a
business case for changing - so why do you want to switch

Le sam. 11 juil. 2015 à 3:28, Mohammed Guller moham...@glassbeam.com a
écrit :

  Hi Ravi,

 First, Neither Spark nor Spark SQL is a database. Both are compute
 engines, which need to be paired with a storage system. Seconds, they are
 designed for processing large distributed datasets. If you have only
 100,000 records or even a million records, you don’t need Spark. A RDBMS
 will perform much better for that volume of data.



 Mohammed



 *From:* Ravisankar Mani [mailto:rrav...@gmail.com]
 *Sent:* Friday, July 10, 2015 3:50 AM
 *To:* user@spark.apache.org
 *Subject:* Spark performance



 Hi everyone,

 I have planned to move mssql server to spark?.  I have using around 50,000
 to 1l records.

  The spark performance is slow when compared to mssql server.



 What is the best data base(Spark or sql) to store or retrieve data around
 50,000 to 1l records ?

 regards,

 Ravi





Re: Spark performance

2015-07-11 Thread Jörn Franke
Le sam. 11 juil. 2015 à 14:53, Roman Sokolov ole...@gmail.com a écrit :

 Hello. Had the same question. What if I need to store 4-6 Tb and do
 queries? Can't find any clue in documentation.
 Am 11.07.2015 03:28 schrieb Mohammed Guller moham...@glassbeam.com:

  Hi Ravi,

 First, Neither Spark nor Spark SQL is a database. Both are compute
 engines, which need to be paired with a storage system. Seconds, they are
 designed for processing large distributed datasets. If you have only
 100,000 records or even a million records, you don’t need Spark. A RDBMS
 will perform much better for that volume of data.



 Mohammed



 *From:* Ravisankar Mani [mailto:rrav...@gmail.com]
 *Sent:* Friday, July 10, 2015 3:50 AM
 *To:* user@spark.apache.org
 *Subject:* Spark performance



 Hi everyone,

 I have planned to move mssql server to spark?.  I have using around
 50,000 to 1l records.

  The spark performance is slow when compared to mssql server.



 What is the best data base(Spark or sql) to store or retrieve data around
 50,000 to 1l records ?

 regards,

 Ravi






Spark performance

2015-07-10 Thread Ravisankar Mani
Hi everyone,

I have planned to move mssql server to spark?.  I have using around 50,000
to 1l records.
 The spark performance is slow when compared to mssql server.

What is the best data base(Spark or sql) to store or retrieve data around
50,000 to 1l records ?

regards,
Ravi


RE: Spark performance

2015-07-10 Thread Mohammed Guller
Hi Ravi,
First, Neither Spark nor Spark SQL is a database. Both are compute engines, 
which need to be paired with a storage system. Seconds, they are designed for 
processing large distributed datasets. If you have only 100,000 records or even 
a million records, you don’t need Spark. A RDBMS will perform much better for 
that volume of data.

Mohammed

From: Ravisankar Mani [mailto:rrav...@gmail.com]
Sent: Friday, July 10, 2015 3:50 AM
To: user@spark.apache.org
Subject: Spark performance

Hi everyone,
I have planned to move mssql server to spark?.  I have using around 50,000 to 
1l records.
 The spark performance is slow when compared to mssql server.

What is the best data base(Spark or sql) to store or retrieve data around 
50,000 to 1l records ?
regards,
Ravi



Re: Spark performance issue

2015-07-03 Thread Silvio Fiorito
It’ll help to see the code or at least understand what transformations you’re 
using.

Also, you have 15 nodes but not using all of them, so that means you may be 
losing data locality. You can see this in the job UI for Spark if any jobs do 
not have node or process local.

From: diplomatic Guru
Date: Friday, July 3, 2015 at 8:58 AM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Spark performance issue

Hello guys,

I'm after some advice on Spark performance.

I've a MapReduce job that read inputs carry out a simple calculation and write 
the results into HDFS. I've implemented the same logic in Spark job.

When I tried both jobs on same datasets, I'm getting different execution time, 
which is expected.

BUT
..
In my example, MapReduce job is performing much better than Spark.

The difference is that I'm not changing much with the MR job configuration, 
e.g., memory, cores, etc...But this is not the case with Spark as it's very 
flexible. So I'm sure my configuration isn't correct which is why MR is 
outperforming Spark but need your advice.

For example:

Test 1:
4.5GB data -  MR job took ~55 seconds to compute, but Spark took ~3 minutes and 
20 seconds.

Test 2:
25GB data -MR took 2 minutes and 15 seconds, whereas Spark job is still 
running, and it's already been 15 minutes.


I have a cluster of 15 nodes. The maximum memory that I could allocate to each 
executor is 6GB. Therefore, for Test 1, this is the config I used:

--executor-memory 6G --num-executors 4 --driver-memory 6G  --executor-cores 2 
(also I set spark.storage.memoryFraction to 0.3)


For Test 2:
--executor-memory 6G --num-executors 10 --driver-memory 6G  --executor-cores 2 
(also I set spark.storage.memoryFraction to 0.3)

I tried all possible combination but couldn't get better performance. Any 
suggestions will be much appreciated.








Spark performance issue

2015-07-03 Thread diplomatic Guru
Hello guys,

I'm after some advice on Spark performance.

I've a MapReduce job that read inputs carry out a simple calculation and
write the results into HDFS. I've implemented the same logic in Spark job.

When I tried both jobs on same datasets, I'm getting different execution
time, which is expected.

BUT
..
In my example, MapReduce job is performing much better than Spark.

The difference is that I'm not changing much with the MR job configuration,
e.g., memory, cores, etc...But this is not the case with Spark as it's very
flexible. So I'm sure my configuration isn't correct which is why MR is
outperforming Spark but need your advice.

For example:

Test 1:
4.5GB data -  MR job took ~55 seconds to compute, but Spark took ~3 minutes
and 20 seconds.

Test 2:
25GB data -MR took 2 minutes and 15 seconds, whereas Spark job is still
running, and it's already been 15 minutes.


I have a cluster of 15 nodes. The maximum memory that I could allocate to
each executor is 6GB. Therefore, for Test 1, this is the config I used:

--executor-memory 6G --num-executors 4 --driver-memory 6G  --executor-cores
2 (also I set spark.storage.memoryFraction to 0.3)


For Test 2:
--executor-memory 6G --num-executors 10 --driver-memory 6G
 --executor-cores 2 (also I set spark.storage.memoryFraction to 0.3)

I tried all possible combination but couldn't get better performance. Any
suggestions will be much appreciated.


Does spark performance really scale out with multiple machines?

2015-06-15 Thread Wang, Ningjun (LNG-NPV)
I try to measure how spark standalone cluster performance scale out with 
multiple machines. I did a test of training the SVM model which is heavy in 
memory computation. I measure the run time for spark standalone cluster of 1 - 
3 nodes, the result is following

1 node: 35 minutes
2 nodes: 30.1 minutes
3 nodes: 30.8 minutes

So the speed does not seems to increase much with more machines. I know there 
are overhead for coordinating tasks among different machines. Seem to me the 
overhead is over 30% of the total run time.

Is this typical? Does anybody see significant performance increase with more 
machines? Is there anything I can tune my spark cluster to make it scale out 
with more machines?

Thanks
Ningjun



Re: Does spark performance really scale out with multiple machines?

2015-06-15 Thread William Briggs
There are a lot of variables to consider. I'm not an expert on Spark, and
my ML knowledge is rudimentary at best, but here are some questions whose
answers might help us to help you:

   - What type of Spark cluster are you running (e.g., Stand-alone, Mesos,
   YARN)?
   - What does the HTTP UI tell you in terms of number of stages / tasks,
   number of exectors, and task execution time / memory used / amount of data
   shuffled over the network?

As I said, I'm not all that familiar with the ML side of Spark, but in
general, if I were adding more resources, and not seeing an improvement,
here are a few things I would consider:

   1. Is your data set partitioned to allow the parallelism you are
   seeking? Spark's parallelism comes from processing RDD partitions in
   parallel, not processing individual RDD items in parallel; if you don't
   have enough partitions to take advantage of the extra hardware, you will
   see no benefit from adding capacity to your cluster.
   2. Do you have enough Spark executors to process your partitions in
   parallel? This depends on  your configuration and on your cluster type
   (doubtful this is an issue here, since you are adding more executors and
   seeing very little benefit).
   3. Are your partitions small enough (and/or your executor memory
   configuration large enough) so that each partition fits into the memory of
   an executor? If not, you will be constantly spilling to disk, which will
   have a severe impact on performance.
   4. Are you shuffling over the network? If so, how frequently and how
   much? Are you using efficient serialization (e.g., Kryo) and registering
   your serialized classes in order to minimize shuffle overhead?

There are plenty more variables, and some very good performance tuning
documentation https://spark.apache.org/docs/latest/tuning.html is
available. Without any more information to go on, my best guess would be
that you hit your maximum level of parallelism with the addition of the
second node (and even that was not fully utilized), and thus you see no
difference when adding a third node.

Regards,
Will


On Mon, Jun 15, 2015 at 1:29 PM, Wang, Ningjun (LNG-NPV) 
ningjun.w...@lexisnexis.com wrote:

  I try to measure how spark standalone cluster performance scale out with
 multiple machines. I did a test of training the SVM model which is heavy in
 memory computation. I measure the run time for spark standalone cluster of
 1 – 3 nodes, the result is following



 1 node: 35 minutes

 2 nodes: 30.1 minutes

 3 nodes: 30.8 minutes



 So the speed does not seems to increase much with more machines. I know
 there are overhead for coordinating tasks among different machines. Seem to
 me the overhead is over 30% of the total run time.



 Is this typical? Does anybody see significant performance increase with
 more machines? Is there anything I can tune my spark cluster to make it
 scale out with more machines?



 Thanks

 Ningjun





Re: Does spark performance really scale out with multiple machines?

2015-06-15 Thread William Briggs
I just wanted to clarify - when I said you hit your maximum level of
parallelism, I meant that the default number of partitions might not be
large enough to take advantage of more hardware, not that there was no way
to increase your parallelism - the documentation I linked gives a few
suggestions on how to increase the number of partitions.

-Will

On Mon, Jun 15, 2015 at 5:00 PM, William Briggs wrbri...@gmail.com wrote:

 There are a lot of variables to consider. I'm not an expert on Spark, and
 my ML knowledge is rudimentary at best, but here are some questions whose
 answers might help us to help you:

- What type of Spark cluster are you running (e.g., Stand-alone,
Mesos, YARN)?
- What does the HTTP UI tell you in terms of number of stages / tasks,
number of exectors, and task execution time / memory used / amount of data
shuffled over the network?

 As I said, I'm not all that familiar with the ML side of Spark, but in
 general, if I were adding more resources, and not seeing an improvement,
 here are a few things I would consider:

1. Is your data set partitioned to allow the parallelism you are
seeking? Spark's parallelism comes from processing RDD partitions in
parallel, not processing individual RDD items in parallel; if you don't
have enough partitions to take advantage of the extra hardware, you will
see no benefit from adding capacity to your cluster.
2. Do you have enough Spark executors to process your partitions in
parallel? This depends on  your configuration and on your cluster type
(doubtful this is an issue here, since you are adding more executors and
seeing very little benefit).
3. Are your partitions small enough (and/or your executor memory
configuration large enough) so that each partition fits into the memory of
an executor? If not, you will be constantly spilling to disk, which will
have a severe impact on performance.
4. Are you shuffling over the network? If so, how frequently and how
much? Are you using efficient serialization (e.g., Kryo) and registering
your serialized classes in order to minimize shuffle overhead?

 There are plenty more variables, and some very good performance tuning
 documentation https://spark.apache.org/docs/latest/tuning.html is
 available. Without any more information to go on, my best guess would be
 that you hit your maximum level of parallelism with the addition of the
 second node (and even that was not fully utilized), and thus you see no
 difference when adding a third node.

 Regards,
 Will


 On Mon, Jun 15, 2015 at 1:29 PM, Wang, Ningjun (LNG-NPV) 
 ningjun.w...@lexisnexis.com wrote:

  I try to measure how spark standalone cluster performance scale out
 with multiple machines. I did a test of training the SVM model which is
 heavy in memory computation. I measure the run time for spark standalone
 cluster of 1 – 3 nodes, the result is following



 1 node: 35 minutes

 2 nodes: 30.1 minutes

 3 nodes: 30.8 minutes



 So the speed does not seems to increase much with more machines. I know
 there are overhead for coordinating tasks among different machines. Seem to
 me the overhead is over 30% of the total run time.



 Is this typical? Does anybody see significant performance increase with
 more machines? Is there anything I can tune my spark cluster to make it
 scale out with more machines?



 Thanks

 Ningjun







Re: Spark performance in cluster mode using yarn

2015-05-15 Thread Sachin Singh
Hi Ayan,
I am asking general scenarios as per given info/configuration, from
experts, not specific,
java code is nothing get hive context and select query,
there is no serialization or any other complex things I kept,straight
forward, 10 lines of code,
Group Please suggest if any Idea,

Regards
Sachin

On Fri, May 15, 2015 at 6:57 AM, ayan guha guha.a...@gmail.com wrote:

 With this information it is hard to predict. What's the performance you
 are getting? What's your desired performance? Maybe you can post your code
 and experts can suggests improvement?
 On 14 May 2015 15:02, sachin Singh sachin.sha...@gmail.com wrote:

 Hi Friends,
 please someone can give the idea, Ideally what should be time(complete job
 execution) for spark job,

 I have data in a hive table, amount of data would be 1GB , 2 lacs rows for
 whole month,
 I want to do monthly aggregation, using SQL queries,groupby

 I have only one node,1 cluster,below configuration for running job,
 --num-executors 2 --driver-memory 3g --driver-java-options
 -XX:MaxPermSize=1G --executor-memory 2g --executor-cores 2

 how much approximate time require to finish the job,

 or can someone suggest the best way to get quickly results,

 Thanks in advance,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-performance-in-cluster-mode-using-yarn-tp22877.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark performance in cluster mode using yarn

2015-05-14 Thread ayan guha
With this information it is hard to predict. What's the performance you are
getting? What's your desired performance? Maybe you can post your code and
experts can suggests improvement?
On 14 May 2015 15:02, sachin Singh sachin.sha...@gmail.com wrote:

 Hi Friends,
 please someone can give the idea, Ideally what should be time(complete job
 execution) for spark job,

 I have data in a hive table, amount of data would be 1GB , 2 lacs rows for
 whole month,
 I want to do monthly aggregation, using SQL queries,groupby

 I have only one node,1 cluster,below configuration for running job,
 --num-executors 2 --driver-memory 3g --driver-java-options
 -XX:MaxPermSize=1G --executor-memory 2g --executor-cores 2

 how much approximate time require to finish the job,

 or can someone suggest the best way to get quickly results,

 Thanks in advance,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-performance-in-cluster-mode-using-yarn-tp22877.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Spark performance in cluster mode using yarn

2015-05-13 Thread sachin Singh
Hi Friends,
please someone can give the idea, Ideally what should be time(complete job
execution) for spark job,

I have data in a hive table, amount of data would be 1GB , 2 lacs rows for
whole month,
I want to do monthly aggregation, using SQL queries,groupby

I have only one node,1 cluster,below configuration for running job,
--num-executors 2 --driver-memory 3g --driver-java-options
-XX:MaxPermSize=1G --executor-memory 2g --executor-cores 2

how much approximate time require to finish the job,

or can someone suggest the best way to get quickly results,

Thanks in advance,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-performance-in-cluster-mode-using-yarn-tp22877.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Performance on Yarn

2015-04-22 Thread Ted Yu
In master branch, overhead is now 10%. 
That would be 500 MB 

FYI



 On Apr 22, 2015, at 8:26 AM, nsalian neeleshssal...@gmail.com wrote:
 
 +1 to executor-memory to 5g.
 Do check the overhead space for both the driver and the executor as per
 Wilfred's suggestion.
 
 Typically, 384 MB should suffice.
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22610.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Performance on Yarn

2015-04-22 Thread nsalian
+1 to executor-memory to 5g.
Do check the overhead space for both the driver and the executor as per
Wilfred's suggestion.

Typically, 384 MB should suffice.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22610.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Performance on Yarn

2015-04-22 Thread Neelesh Salian
Does it still hit the memory limit for the container?

An expensive transformation?

On Wed, Apr 22, 2015 at 8:45 AM, Ted Yu yuzhih...@gmail.com wrote:

 In master branch, overhead is now 10%.
 That would be 500 MB

 FYI



  On Apr 22, 2015, at 8:26 AM, nsalian neeleshssal...@gmail.com wrote:
 
  +1 to executor-memory to 5g.
  Do check the overhead space for both the driver and the executor as per
  Wilfred's suggestion.
 
  Typically, 384 MB should suffice.
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22610.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 



Re: Spark Performance on Yarn

2015-04-21 Thread hnahak
Try --executor-memory 5g , because you have 8 gb RAM in each machine 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22603.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Performance on Yarn

2015-04-20 Thread Peng Cheng
I got exactly the same problem, except that I'm running on a standalone
master. Can you tell me the counterpart parameter on standalone master for
increasing the same memroy overhead?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22580.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-17 Thread Evo Eftimov
And btw if you suspect this is a YARN issue you can always launch and use
Spark in a Standalone Mode which uses its own embedded cluster resource
manager - this is possible even when Spark has been deployed on CDH under
YARN by the pre-canned install  scripts of CDH

 

To achieve that:

 

1.   Launch spark in a standalone mode using its shell scripts - you may
get some script errors initially because of some mess in the scripts created
by the pre-canned CDH YARN install - which you can fix by editing the spark
standalone scripts - the error messages will guide you 

2.   Submit a spark job to the standalone spark master rather than YARN
and this is it 

3.   Measure and compare the performance under YARN, Spark Standalone on
Cluster and Spark Standalone on a single machine  

 

Bear in mind that running Spark in  Standalone mode while using YARN for all
other apps would not be very appropriate in production because the two
resource managers will be competing for cluster resources - but you can use
this for performance tests  

 

From: Evo Eftimov [mailto:evo.efti...@isecc.com] 
Sent: Thursday, April 16, 2015 6:28 PM
To: 'Manish Gupta 8'; 'user@spark.apache.org'
Subject: RE: General configurations on CDH5 to achieve maximum Spark
Performance

 

Essentially to change the performance yield of software cluster
infrastructure platform like spark you play with different permutations of:

 

-  Number of CPU cores used by Spark Executors on every cluster node

-  Amount of RAM allocated for each executor   

 

How disks and network IO is used also plays a role but that is influenced
more by app algorithmic aspects rather than YARN / Spark cluster config
(except rack awreness etc) 

 

When Spark runs under the management of YARN the above is controlled /
allocated by YARN 

 

https://spark.apache.org/docs/latest/running-on-yarn.html 

 

From: Manish Gupta 8 [mailto:mgupt...@sapient.com] 
Sent: Thursday, April 16, 2015 6:21 PM
To: Evo Eftimov; user@spark.apache.org
Subject: RE: General configurations on CDH5 to achieve maximum Spark
Performance

 

Thanks Evo. Yes, my concern is only regarding the infrastructure
configurations. Basically, configuring Yarn (Node manager) + Spark is must
and default setting never works. And what really happens, is we make changes
as and when an issue is faced because of one of the numerous default
configuration settings. And every time, we have to google a lot to decide on
the right values J

 

Again, my issue is very centric to running Spark on Yarn in CDH5
environment.

 

If you know a link that talks about optimum configuration settings for
running Spark on Yarn (CDH5), please share the same. 

 

Thanks,

Manish

 

From: Evo Eftimov [mailto:evo.efti...@isecc.com] 
Sent: Thursday, April 16, 2015 10:38 PM
To: Manish Gupta 8; user@spark.apache.org
Subject: RE: General configurations on CDH5 to achieve maximum Spark
Performance

 

Well there are a number of performance tuning guidelines in dedicated
sections of the spark documentation - have you read and applied them 

 

Secondly any performance problem within a distributed cluster environment
has two aspects:

 

1.   Infrastructure 

2.   App Algorithms 

 

You seem to be focusing only on 1, but what you said about the performance
differences between single laptop and cluster points to potential
algorithmic inefficiency in your app when e.g. distributing and performing
parallel processing and data. On a single laptop data moves instantly
between workers because all worker instances run in the memory of a single
machine ..

 

Regards,

Evo Eftimov  

 

From: Manish Gupta 8 [mailto:mgupt...@sapient.com] 
Sent: Thursday, April 16, 2015 6:03 PM
To: user@spark.apache.org
Subject: General configurations on CDH5 to achieve maximum Spark Performance

 

Hi,

 

Is there a document/link that describes the general configuration settings
to achieve maximum Spark Performance while running on CDH5? In our
environment, we did lot of changes (and still doing it) to get decent
performance otherwise our 6 node dev cluster with default configurations,
lags behind a single laptop running Spark.

 

Having a standard checklist (taking a base node size of 4-CPU, 16GB RAM)
would be really great. Any pointers in this regards will be really helpful.

 

We are running Spark 1.2.0 on CDH 5.3.0.

 

Thanks,

 

Manish Gupta

Specialist | Sapient Global Markets

 

Green Boulevard (Tower C)

3rd  4th Floor

Plot No. B-9A, Sector 62

Noida 201 301

Uttar Pradesh, India

 

Tel: +91 (120) 479 5000

Fax: +91 (120) 479 5001

Email: mgupt...@sapient.com

 

sapientglobalmarkets.com

 

The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient

General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Manish Gupta 8
Hi,

Is there a document/link that describes the general configuration settings to 
achieve maximum Spark Performance while running on CDH5? In our environment, we 
did lot of changes (and still doing it) to get decent performance otherwise our 
6 node dev cluster with default configurations, lags behind a single laptop 
running Spark.

Having a standard checklist (taking a base node size of 4-CPU, 16GB RAM) would 
be really great. Any pointers in this regards will be really helpful.

We are running Spark 1.2.0 on CDH 5.3.0.

Thanks,

Manish Gupta
Specialist | Sapient Global Markets

Green Boulevard (Tower C)
3rd  4th Floor
Plot No. B-9A, Sector 62
Noida 201 301
Uttar Pradesh, India

Tel: +91 (120) 479 5000
Fax: +91 (120) 479 5001
Email: mgupt...@sapient.com

sapientglobalmarkets.com

The information transmitted is intended only for the person or entity to which 
it is addressed and may contain confidential and/or privileged material. Any 
review, retransmission, dissemination or other use of, or taking of any action 
in reliance upon, this information by persons or entities other than the 
intended recipient is prohibited. If you received this in error, please contact 
the sender and delete the material from any (your) computer.

***Please consider the environment before printing this email.***



RE: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Evo Eftimov
Well there are a number of performance tuning guidelines in dedicated
sections of the spark documentation - have you read and applied them 

 

Secondly any performance problem within a distributed cluster environment
has two aspects:

 

1.   Infrastructure 

2.   App Algorithms 

 

You seem to be focusing only on 1, but what you said about the performance
differences between single laptop and cluster points to potential
algorithmic inefficiency in your app when e.g. distributing and performing
parallel processing and data. On a single laptop data moves instantly
between workers because all worker instances run in the memory of a single
machine ..

 

Regards,

Evo Eftimov  

 

From: Manish Gupta 8 [mailto:mgupt...@sapient.com] 
Sent: Thursday, April 16, 2015 6:03 PM
To: user@spark.apache.org
Subject: General configurations on CDH5 to achieve maximum Spark Performance

 

Hi,

 

Is there a document/link that describes the general configuration settings
to achieve maximum Spark Performance while running on CDH5? In our
environment, we did lot of changes (and still doing it) to get decent
performance otherwise our 6 node dev cluster with default configurations,
lags behind a single laptop running Spark.

 

Having a standard checklist (taking a base node size of 4-CPU, 16GB RAM)
would be really great. Any pointers in this regards will be really helpful.

 

We are running Spark 1.2.0 on CDH 5.3.0.

 

Thanks,

 

Manish Gupta

Specialist | Sapient Global Markets

 

Green Boulevard (Tower C)

3rd  4th Floor

Plot No. B-9A, Sector 62

Noida 201 301

Uttar Pradesh, India

 

Tel: +91 (120) 479 5000

Fax: +91 (120) 479 5001

Email: mgupt...@sapient.com

 

sapientglobalmarkets.com

 

The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
(your) computer.

 

***Please consider the environment before printing this email.***

 



RE: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Manish Gupta 8
Thanks Evo. Yes, my concern is only regarding the infrastructure 
configurations. Basically, configuring Yarn (Node manager) + Spark is must and 
default setting never works. And what really happens, is we make changes as and 
when an issue is faced because of one of the numerous default configuration 
settings. And every time, we have to google a lot to decide on the right values 
:)

Again, my issue is very centric to running Spark on Yarn in CDH5 environment.

If you know a link that talks about optimum configuration settings for running 
Spark on Yarn (CDH5), please share the same.

Thanks,
Manish

From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Thursday, April 16, 2015 10:38 PM
To: Manish Gupta 8; user@spark.apache.org
Subject: RE: General configurations on CDH5 to achieve maximum Spark Performance

Well there are a number of performance tuning guidelines in dedicated sections 
of the spark documentation - have you read and applied them

Secondly any performance problem within a distributed cluster environment has 
two aspects:


1.   Infrastructure

2.   App Algorithms

You seem to be focusing only on 1, but what you said about the performance 
differences between single laptop and cluster points to potential algorithmic 
inefficiency in your app when e.g. distributing and performing parallel 
processing and data. On a single laptop data moves instantly between workers 
because all worker instances run in the memory of a single machine 

Regards,
Evo Eftimov

From: Manish Gupta 8 [mailto:mgupt...@sapient.com]
Sent: Thursday, April 16, 2015 6:03 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: General configurations on CDH5 to achieve maximum Spark Performance

Hi,

Is there a document/link that describes the general configuration settings to 
achieve maximum Spark Performance while running on CDH5? In our environment, we 
did lot of changes (and still doing it) to get decent performance otherwise our 
6 node dev cluster with default configurations, lags behind a single laptop 
running Spark.

Having a standard checklist (taking a base node size of 4-CPU, 16GB RAM) would 
be really great. Any pointers in this regards will be really helpful.

We are running Spark 1.2.0 on CDH 5.3.0.

Thanks,

Manish Gupta
Specialist | Sapient Global Markets

Green Boulevard (Tower C)
3rd  4th Floor
Plot No. B-9A, Sector 62
Noida 201 301
Uttar Pradesh, India

Tel: +91 (120) 479 5000
Fax: +91 (120) 479 5001
Email: mgupt...@sapient.commailto:mgupt...@sapient.com

sapientglobalmarkets.com

The information transmitted is intended only for the person or entity to which 
it is addressed and may contain confidential and/or privileged material. Any 
review, retransmission, dissemination or other use of, or taking of any action 
in reliance upon, this information by persons or entities other than the 
intended recipient is prohibited. If you received this in error, please contact 
the sender and delete the material from any (your) computer.

***Please consider the environment before printing this email.***



RE: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Evo Eftimov
Essentially to change the performance yield of software cluster
infrastructure platform like spark you play with different permutations of:

 

-  Number of CPU cores used by Spark Executors on every cluster node

-  Amount of RAM allocated for each executor   

 

How disks and network IO is used also plays a role but that is influenced
more by app algorithmic aspects rather than YARN / Spark cluster config
(except rack awreness etc) 

 

When Spark runs under the management of YARN the above is controlled /
allocated by YARN 

 

https://spark.apache.org/docs/latest/running-on-yarn.html 

 

From: Manish Gupta 8 [mailto:mgupt...@sapient.com] 
Sent: Thursday, April 16, 2015 6:21 PM
To: Evo Eftimov; user@spark.apache.org
Subject: RE: General configurations on CDH5 to achieve maximum Spark
Performance

 

Thanks Evo. Yes, my concern is only regarding the infrastructure
configurations. Basically, configuring Yarn (Node manager) + Spark is must
and default setting never works. And what really happens, is we make changes
as and when an issue is faced because of one of the numerous default
configuration settings. And every time, we have to google a lot to decide on
the right values J

 

Again, my issue is very centric to running Spark on Yarn in CDH5
environment.

 

If you know a link that talks about optimum configuration settings for
running Spark on Yarn (CDH5), please share the same. 

 

Thanks,

Manish

 

From: Evo Eftimov [mailto:evo.efti...@isecc.com] 
Sent: Thursday, April 16, 2015 10:38 PM
To: Manish Gupta 8; user@spark.apache.org
Subject: RE: General configurations on CDH5 to achieve maximum Spark
Performance

 

Well there are a number of performance tuning guidelines in dedicated
sections of the spark documentation - have you read and applied them 

 

Secondly any performance problem within a distributed cluster environment
has two aspects:

 

1.   Infrastructure 

2.   App Algorithms 

 

You seem to be focusing only on 1, but what you said about the performance
differences between single laptop and cluster points to potential
algorithmic inefficiency in your app when e.g. distributing and performing
parallel processing and data. On a single laptop data moves instantly
between workers because all worker instances run in the memory of a single
machine ..

 

Regards,

Evo Eftimov  

 

From: Manish Gupta 8 [mailto:mgupt...@sapient.com] 
Sent: Thursday, April 16, 2015 6:03 PM
To: user@spark.apache.org
Subject: General configurations on CDH5 to achieve maximum Spark Performance

 

Hi,

 

Is there a document/link that describes the general configuration settings
to achieve maximum Spark Performance while running on CDH5? In our
environment, we did lot of changes (and still doing it) to get decent
performance otherwise our 6 node dev cluster with default configurations,
lags behind a single laptop running Spark.

 

Having a standard checklist (taking a base node size of 4-CPU, 16GB RAM)
would be really great. Any pointers in this regards will be really helpful.

 

We are running Spark 1.2.0 on CDH 5.3.0.

 

Thanks,

 

Manish Gupta

Specialist | Sapient Global Markets

 

Green Boulevard (Tower C)

3rd  4th Floor

Plot No. B-9A, Sector 62

Noida 201 301

Uttar Pradesh, India

 

Tel: +91 (120) 479 5000

Fax: +91 (120) 479 5001

Email: mgupt...@sapient.com

 

sapientglobalmarkets.com

 

The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
(your) computer.

 

***Please consider the environment before printing this email.***

 



Re: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Sean Owen
I don't think there's anything specific to CDH that you need to know,
other than it ought to set things up sanely for you.

Sandy did a couple posts about tuning:
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

I don't think there's such a thing as one optimal configuration. It
depends very heavily on your workload. First you need to have a look
at your app, really. All the tuning in the world isn't going to make
an unnecessary shuffle as fast as eliminating it.


On Thu, Apr 16, 2015 at 6:02 PM, Manish Gupta 8 mgupt...@sapient.com wrote:
 Hi,



 Is there a document/link that describes the general configuration settings
 to achieve maximum Spark Performance while running on CDH5? In our
 environment, we did lot of changes (and still doing it) to get decent
 performance otherwise our 6 node dev cluster with default configurations,
 lags behind a single laptop running Spark.



 Having a standard checklist (taking a base node size of 4-CPU, 16GB RAM)
 would be really great. Any pointers in this regards will be really helpful.



 We are running Spark 1.2.0 on CDH 5.3.0.



 Thanks,



 Manish Gupta

 Specialist | Sapient Global Markets



 Green Boulevard (Tower C)

 3rd  4th Floor

 Plot No. B-9A, Sector 62

 Noida 201 301

 Uttar Pradesh, India



 Tel: +91 (120) 479 5000

 Fax: +91 (120) 479 5001

 Email: mgupt...@sapient.com



 sapientglobalmarkets.com



 The information transmitted is intended only for the person or entity to
 which it is addressed and may contain confidential and/or privileged
 material. Any review, retransmission, dissemination or other use of, or
 taking of any action in reliance upon, this information by persons or
 entities other than the intended recipient is prohibited. If you received
 this in error, please contact the sender and delete the material from any
 (your) computer.



 ***Please consider the environment before printing this email.***



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Performance -Hive or Hbase?

2015-03-25 Thread Siddharth Ubale
HI ,

We have started RnD on Apache Spark to use its features such as Spark-SQL  
Spark Streaming. I have two Pain points , can anyone of you address them which 
are as follows:

1.   Does spark allow us the feature to fetch updated items after an RDD 
has been mapped and schema has been applied? Or every time while running the 
query we have to perform RDD Mapping and Apply schema? In this case I am using 
hbase tables to map the RDD.

2.   Spark-SQL provides better performance when used with Hive or Hbase?


Thanks,
Siddharth Ubale,
Synchronized Communications
#43, Velankani Tech Park, Block No. II,
3rd Floor, Electronic City Phase I,
Bangalore – 560 100
Tel : +91 80 3202 4060
Web: www.syncoms.comhttp://www.syncoms.com/
[LogoNEWmohLARGE]
London|Bangalore|Orlando

we innovate, plan, execute, and transform the business​



Need some help on the Spark performance on Hadoop Yarn

2015-03-19 Thread Yi Ming Huang

Dear Spark experts, I appreciate you can look into my problem and give me
some help and suggestions here... Thank you!

I have a simple Spark application to parse and analyze the log, and I can
run it on my hadoop yarn cluster. The problem with me is that I find it
runs quite slow on the cluster, even slower than running it just on a
single Spark machine.

This is my application sketch:
1) read in the log file and use mapToPair to transform the raw logs to my
Object - Tuple2String, LogEntry  I use a string as key so later I will
aggregate by the key
2) persist the RDD transformed from step 1 and let me call it logObjects
3) use aggregateByKey to to calculate the sum, avg value for each key. the
reason I use aggregateByKey instead of reduce by key is the output Object
is different
4) persist the RDD from step 3, let me call it aggregatedObjects.
5) run several takeOrdered to get top X values that I'm interested in

What suprised me is that even with the persits (MEMORY_ONLY_SER) for two
major RDDs I'm manipulating later, the process speed is not improved. It's
even slower than not persist them... Any idea on that? I logged some date
to the stdout and find the two major actions take more than 1 minutes. It's
just 1GB log though...
Another problem I'm seeing is it seems just use two of my DataNode in my
Hadoop Yarn cluster, but actually I have three. Any configuration here that
matters?



I attached the syserr output here, please help me to analyze it and suggest
where can I improve the speed. Thank you so much!
(See attached file: applicationLog.txt)
Best Regards

Yi Ming Huang(黄毅铭)
ICS Performance
IBM Collaboration Solutions, China Development Lab, Shanghai
huang...@cn.ibm.com (86-21)60922771
Addr: 5F, Building 10, No 399, Keyuan Road, Zhangjiang High Tech ParkSLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/tmp/hadoop-root/nm-local-dir/usercache/root/filecache/17/spark-assembly-1.2.0-hadoop2.4.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/hadoop-2.4.1/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/03/19 21:13:46 INFO yarn.ApplicationMaster: Registered signal handlers for 
[TERM, HUP, INT]
15/03/19 21:13:47 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
15/03/19 21:13:47 INFO yarn.ApplicationMaster: ApplicationAttemptId: 
appattempt_1426685857620_0005_01
15/03/19 21:13:48 INFO spark.SecurityManager: Changing view acls to: root
15/03/19 21:13:48 INFO spark.SecurityManager: Changing modify acls to: root
15/03/19 21:13:48 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(root); users with 
modify permissions: Set(root)
15/03/19 21:13:48 INFO yarn.ApplicationMaster: Starting the user JAR in a 
separate Thread
15/03/19 21:13:48 INFO yarn.ApplicationMaster: Waiting for spark context 
initialization
15/03/19 21:13:48 INFO yarn.ApplicationMaster: Waiting for spark context 
initialization ... 0
15/03/19 21:13:48 INFO spark.SecurityManager: Changing view acls to: root
15/03/19 21:13:48 INFO spark.SecurityManager: Changing modify acls to: root
15/03/19 21:13:48 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(root); users with 
modify permissions: Set(root)
15/03/19 21:13:48 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/03/19 21:13:48 INFO Remoting: Starting remoting
15/03/19 21:13:48 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkDriver@datanode03:42687]
15/03/19 21:13:48 INFO util.Utils: Successfully started service 'sparkDriver' 
on port 42687.
15/03/19 21:13:48 INFO spark.SparkEnv: Registering MapOutputTracker
15/03/19 21:13:48 INFO spark.SparkEnv: Registering BlockManagerMaster
15/03/19 21:13:48 INFO storage.DiskBlockManager: Created local directory at 
/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1426685857620_0005/spark-local-20150319211348-756f
15/03/19 21:13:48 INFO storage.MemoryStore: MemoryStore started with capacity 
257.8 MB
15/03/19 21:13:48 INFO spark.HttpFileServer: HTTP File server directory is 
/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1426685857620_0005/container_1426685857620_0005_01_01/tmp/spark-8288d778-2bca-4afa-a805-cb3807c40f9f
15/03/19 21:13:48 INFO spark.HttpServer: Starting HTTP Server
15/03/19 21:13:49 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/03/19 21:13:49 INFO server.AbstractConnector: Started 
SocketConnector@0.0.0.0:54693
15/03/19 21:13:49 INFO util.Utils: Successfully started service 'HTTP file 
server' on port 54693.
15/03/19 21:13:49 INFO ui.JettyUtils: 

Re: Spark Performance on Yarn

2015-02-23 Thread Lee Bierman
Thanks for the suggestions.

I removed the  persist call from program. Doing so I started it with:

spark-submit --class com.xxx.analytics.spark.AnalyticsJob --master yarn
/tmp/analytics.jar --input_directory hdfs://ip:8020/flume/events/2015/02/


This takes all the default and only runs 2 executors. This runs with no
failures but takes 17 hours.


After this I tried to run it with

spark-submit --class com.extole.analytics.spark.AnalyticsJob
--num-executors 5 --executor-cores 2 --master yarn /tmp/analytics.jar
--input_directory
hdfs://ip-10-142-198-50.ec2.internal:8020/flume/events/2015/02/

This results in lots of executor failures and restarts and failures. I
can't seem to get any kind of parallelism or throughput. The next try will
be to set the yarn memory overhead.


What other configs should I list to help figure out the sweet spot here.




On Sat, Feb 21, 2015 at 12:29 AM, Davies Liu dav...@databricks.com wrote:

 How many executors you have per machine? It will be helpful if you
 could list all the configs.

 Could you also try to run it without persist? Caching do hurt than
 help, if you don't have enough memory.

 On Fri, Feb 20, 2015 at 5:18 PM, Lee Bierman leebier...@gmail.com wrote:
  Thanks for the suggestions.
  I'm experimenting with different values for spark memoryOverhead and
  explictly giving the executors more memory, but still have not found the
  golden medium to get it to finish in a proper time frame.
 
  Is my cluster massively undersized at 5 boxes, 8gb 2cpu ?
  Trying to figure out a memory setting and executor setting so it runs on
  many containers in parallel.
 
  I'm still struggling as pig jobs and hive jobs on the same whole data set
  don't take as long. I'm wondering too if the logic in our code is just
 doing
  something silly causing multiple reads of all the data.
 
 
  On Fri, Feb 20, 2015 at 9:45 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:
 
  If that's the error you're hitting, the fix is to boost
  spark.yarn.executor.memoryOverhead, which will put some extra room in
  between the executor heap sizes and the amount of memory requested for
 them
  from YARN.
 
  -Sandy
 
  On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote:
 
  A bit more context on this issue. From the container logs on the
 executor
 
  Given my cluster specs above what would be appropriate parameters to
 pass
  into :
  --num-executors --num-cores --executor-memory
 
  I had tried it with --executor-memory 2500MB
 
  015-02-20 06:50:09,056 WARN
 
 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
  Container
 [pid=23320,containerID=container_1423083596644_0238_01_004160]
  is
  running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB
  physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing
  container.
  Dump of the process-tree for container_1423083596644_0238_01_004160 :
  |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
  SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
  |- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash
 -c
  /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
  -Xms2400m
  -Xmx2400m
 
 
 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp
 
 
 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
  org.apache.spark.executor.CoarseGrainedExecutorBackend
 
  akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal
 :42535/user/CoarseGrainedScheduler
  8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1
 
 
 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout
  2
 
 
 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr
  |- 23323 23320 23320 23320 (java) 922271 12263 461976
 724218
  /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p
  -Xms2400m
  -Xmx2400m
 
 
 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp
 
 
 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
  org.apache.spark.executor.CoarseGrainedExecutorBackend
  akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse
 
 
 
 
  --
  View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html
  Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 
 



Re: Spark performance tuning

2015-02-22 Thread Akhil Das
You can simply follow these http://spark.apache.org/docs/1.2.0/tuning.html

Thanks
Best Regards

On Sun, Feb 22, 2015 at 1:14 AM, java8964 java8...@hotmail.com wrote:

 Can someone share some ideas about how to tune the GC time?

 Thanks

 --
 From: java8...@hotmail.com
 To: user@spark.apache.org
 Subject: Spark performance tuning
 Date: Fri, 20 Feb 2015 16:04:23 -0500


 Hi,

 I am new to Spark, and I am trying to test the Spark SQL performance vs
 Hive. I setup a standalone box, with 24 cores and 64G memory.

 We have one SQL in mind to test. Here is the basically setup on this one
 box for the SQL we are trying to run:

 1) Dataset 1, 6.6G AVRO file with snappy compression, which contains nest
 structure of 3 array of struct in AVRO
 2) Dataset2, 5G AVRO file with snappy compression
 3) Dataset3, 2.3M AVRO file with snappy compression.

 The basic structure of the query is like this:


 (select
 xxx
 from
 dataset1 lateral view outer explode(struct1) lateral view outer
 explode(struct2)
 where x )
 left outer join
 (
 select  from dataset2 lateral view explode(xxx) where 
 )
 on 
 left outer join
 (
 select xxx from dataset3 where )
 on x

 So overall what it does is 2 outer explode on dataset1, left outer join
 with explode of dataset2, then finally left outer join with dataset 3.

 On this standalone box, I installed Hadoop 2.2 and Hive 0.12, and Spark
 1.2.0.

 Baseline, the above query can finish around 50 minutes in Hive 12, with 6
 mappers and 3 reducers, each with 1G max heap, in 3 rounds of MR jobs.

 This is a very expensive query running in our production, of course with
 much bigger data set, every day. Now I want to see how fast Spark can do
 for the same query.

 I am using the following settings, based on my understanding of Spark, for
 a fair test between it and Hive:

 export SPARK_WORKER_MEMORY=32g
 export SPARK_DRIVER_MEMORY=2g
 --executor-memory 9g
 --total-executor-cores 9

 I am trying to run the one executor with 9 cores and max 9G heap, to make
 Spark use almost same resource we gave to the MapReduce.
 Here is the result without any additional configuration changes, running
 under Spark 1.2.0, using HiveContext in Spark SQL, to run the exactly same
 query:

 The Spark SQL generated 5 stage of tasks, shown below:
 4   collect at SparkPlan.scala:84 +details  2015/02/20 10:48:46 *26 s*
200/200
 3   mapPartitions at Exchange.scala:64 +details 2015/02/20 10:32:07 *16
 min*  200/200 1112.3 MB
 2   mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06 *9
 min*  40/40   4.7 GB  22.2 GB
 1   mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06 *1.9*
 min 50/50   6.2 GB  2.8 GB
 0   mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06 *6 s*
   2/2 2.3 MB  156.6 KB

 So the wall time of whole query is 26s + 16m + 9m + 2m + 6s, around 28
 minutes.

 It is about 56% of originally time, not bad. But I want to know any tuning
 of Spark can make it even faster.

 For stage 2 and 3, I observed that GC time is more and more expensive.
 Especially in stage 3, shown below:

 For stage 3:
 Metric  Min 25th percentile Median  75th percentile Max
 Duration20 s30 s35 s39 s
  2.4 min
 GC Time 9 s 17 s20 s25 s
  2.2 min
 Shuffle Write   4.7 MB  4.9 MB  5.2 MB  6.1 MB
  8.3 MB

 So in median, the GC time took overall 20s/35s = 57% of time.

 First change I made is to add the following line in the spark-default.conf:
 spark.serializer org.apache.spark.serializer.KryoSerializer

 My assumption is that using kryoSerializer, instead of default java
 serialize, will lower the memory footprint, should lower the GC pressure
 during runtime. I know the I changed the correct spark-default.conf,
 because if I were add spark.executor.extraJavaOptions -verbose:gc
 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps in the same file, I will see
 the GC usage in the stdout file. Of course, in this test, I didn't add
 that, as I want to only make one change a time.
 The result is almost the same, as using standard java serialize. The wall
 time is still 28 minutes, and in stage 3, the GC still took around 50 to
 60% of time, almost same result within min, median to max in stage 3,
 without any noticeable performance gain.

 Next, based on my understanding, and for this test, I think the
 default spark.storage.memoryFraction is too high for this query, as there
 is no reason to reserve so much memory for caching data, Because we don't
 reuse any dataset in this one query. So I add this at the end of
 spark-shell command --conf spark.storage.memoryFraction=0.3, as I want to
 just reserve half of the memory for caching data vs first time. Of course,
 this time, I rollback the first change of KryoSerializer.

 The result looks like almost the same. The whole query finished around 28s
 + 14m

Re: Spark Performance on Yarn

2015-02-21 Thread Davies Liu
How many executors you have per machine? It will be helpful if you
could list all the configs.

Could you also try to run it without persist? Caching do hurt than
help, if you don't have enough memory.

On Fri, Feb 20, 2015 at 5:18 PM, Lee Bierman leebier...@gmail.com wrote:
 Thanks for the suggestions.
 I'm experimenting with different values for spark memoryOverhead and
 explictly giving the executors more memory, but still have not found the
 golden medium to get it to finish in a proper time frame.

 Is my cluster massively undersized at 5 boxes, 8gb 2cpu ?
 Trying to figure out a memory setting and executor setting so it runs on
 many containers in parallel.

 I'm still struggling as pig jobs and hive jobs on the same whole data set
 don't take as long. I'm wondering too if the logic in our code is just doing
 something silly causing multiple reads of all the data.


 On Fri, Feb 20, 2015 at 9:45 AM, Sandy Ryza sandy.r...@cloudera.com wrote:

 If that's the error you're hitting, the fix is to boost
 spark.yarn.executor.memoryOverhead, which will put some extra room in
 between the executor heap sizes and the amount of memory requested for them
 from YARN.

 -Sandy

 On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote:

 A bit more context on this issue. From the container logs on the executor

 Given my cluster specs above what would be appropriate parameters to pass
 into :
 --num-executors --num-cores --executor-memory

 I had tried it with --executor-memory 2500MB

 015-02-20 06:50:09,056 WARN

 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Container [pid=23320,containerID=container_1423083596644_0238_01_004160]
 is
 running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB
 physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing
 container.
 Dump of the process-tree for container_1423083596644_0238_01_004160 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
 SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash -c
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
 -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend

 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/CoarseGrainedScheduler
 8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout
 2

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr
 |- 23323 23320 23320 23320 (java) 922271 12263 461976 724218
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p
 -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Spark performance tuning

2015-02-21 Thread java8964
Can someone share some ideas about how to tune the GC time?
Thanks

From: java8...@hotmail.com
To: user@spark.apache.org
Subject: Spark performance tuning
Date: Fri, 20 Feb 2015 16:04:23 -0500




Hi, 
I am new to Spark, and I am trying to test the Spark SQL performance vs Hive. I 
setup a standalone box, with 24 cores and 64G memory.
We have one SQL in mind to test. Here is the basically setup on this one box 
for the SQL we are trying to run:
1) Dataset 1, 6.6G AVRO file with snappy compression, which contains nest 
structure of 3 array of struct in AVRO2) Dataset2, 5G AVRO file with snappy 
compression3) Dataset3, 2.3M AVRO file with snappy compression.
The basic structure of the query is like this:

(selectxxxfromdataset1 lateral view outer explode(struct1) lateral view outer 
explode(struct2)where x )left outer join(select  from dataset2 lateral 
view explode(xxx) where )on left outer join(select xxx from dataset3 
where )on x
So overall what it does is 2 outer explode on dataset1, left outer join with 
explode of dataset2, then finally left outer join with dataset 3.
On this standalone box, I installed Hadoop 2.2 and Hive 0.12, and Spark 1.2.0.
Baseline, the above query can finish around 50 minutes in Hive 12, with 6 
mappers and 3 reducers, each with 1G max heap, in 3 rounds of MR jobs.
This is a very expensive query running in our production, of course with much 
bigger data set, every day. Now I want to see how fast Spark can do for the 
same query.
I am using the following settings, based on my understanding of Spark, for a 
fair test between it and Hive:
export SPARK_WORKER_MEMORY=32gexport SPARK_DRIVER_MEMORY=2g--executor-memory 9g 
--total-executor-cores 9
I am trying to run the one executor with 9 cores and max 9G heap, to make Spark 
use almost same resource we gave to the MapReduce. Here is the result without 
any additional configuration changes, running under Spark 1.2.0, using 
HiveContext in Spark SQL, to run the exactly same query:
The Spark SQL generated 5 stage of tasks, shown below:4   collect at 
SparkPlan.scala:84 +details  2015/02/20 10:48:46 26 s200/200
 3   mapPartitions at Exchange.scala:64 +details 2015/02/20 10:32:07 16 min  
200/200 1112.3 MB2   mapPartitions at Exchange.scala:64 
+details 2015/02/20 10:22:06 9 min  40/40   4.7 GB  22.2 GB1   
mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06 1.9 min 50/50   
6.2 GB  2.8 GB0   mapPartitions at Exchange.scala:64 +details 
2015/02/20 10:22:06 6 s 2/2 2.3 MB  156.6 KB
So the wall time of whole query is 26s + 16m + 9m + 2m + 6s, around 28 minutes.
It is about 56% of originally time, not bad. But I want to know any tuning of 
Spark can make it even faster.
For stage 2 and 3, I observed that GC time is more and more expensive. 
Especially in stage 3, shown below:
For stage 3:Metric  Min 25th percentile Median  75th percentile 
MaxDuration20 s30 s35 s39 s
2.4 minGC Time 9 s 17 s20 s25 s
2.2 minShuffle Write   4.7 MB  4.9 MB  5.2 MB  6.1 MB  
8.3 MB
So in median, the GC time took overall 20s/35s = 57% of time.
First change I made is to add the following line in the 
spark-default.conf:spark.serializer org.apache.spark.serializer.KryoSerializer
My assumption is that using kryoSerializer, instead of default java serialize, 
will lower the memory footprint, should lower the GC pressure during runtime. I 
know the I changed the correct spark-default.conf, because if I were add 
spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps in the same file, I will see the GC usage in the stdout 
file. Of course, in this test, I didn't add that, as I want to only make one 
change a time.The result is almost the same, as using standard java serialize. 
The wall time is still 28 minutes, and in stage 3, the GC still took around 50 
to 60% of time, almost same result within min, median to max in stage 3, 
without any noticeable performance gain.
Next, based on my understanding, and for this test, I think the default 
spark.storage.memoryFraction is too high for this query, as there is no reason 
to reserve so much memory for caching data, Because we don't reuse any dataset 
in this one query. So I add this at the end of spark-shell command --conf 
spark.storage.memoryFraction=0.3, as I want to just reserve half of the memory 
for caching data vs first time. Of course, this time, I rollback the first 
change of KryoSerializer.
The result looks like almost the same. The whole query finished around 28s + 
14m + 9.6m + 1.9m + 6s = 27 minutes.
It looks like that Spark is faster than Hive, but is there any steps I can make 
it even faster? Why using KryoSerializer makes no difference? If I want to 
use the same resource as now, anything I can do to speed it up

Re: Spark Performance on Yarn

2015-02-20 Thread Sean Owen
None of this really points to the problem. These indicate that workers
died but not why. I'd first go locate executor logs that reveal more
about what's happening. It sounds like a hard-er type of failure, like
JVM crash or running out of file handles, or GC thrashing.

On Fri, Feb 20, 2015 at 4:51 AM, lbierman leebier...@gmail.com wrote:
 I'm a bit new to Spark, but had a question on performance. I suspect a lot of
 my issue is due to tuning and parameters. I have a Hive external table on
 this data and to run queries against it runs in minutes

 The Job:
 + 40gb of avro events on HDFS (100 million+ avro events)
 + Read in the files from HDFS and dedupe events by key (mapToPair then a
 reduceByKey)
 + RDD returned and persisted (disk and memory)
 + Then passed to a job that take the RDD and mapToPair of new object data
 and then reduceByKey and foreachpartion do work

 The issue:
 When I run this on my environment on Yarn this takes 20+ hours. Running on
 yarn we see the first stage runs to do build the RDD deduped, but then when
 the next stage starts, things fail and data is lost. This results in stage 0
 starting over and over and just dragging it out.

 Errors I see in the driver logs:
 ERROR cluster.YarnClientClusterScheduler: Lost executor 1 on X: remote
 Akka client disassociated

 15/02/20 00:27:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.1
 (TID 1335,): FetchFailed(BlockManagerId(3, i, 33958), shuffleId=1,
 mapId=162, reduceId=0, message=
 org.apache.spark.shuffle.FetchFailedException: Failed to connect
 toX/X:33958

 Also we see this, but I'm suspecting this is because the previous stage
 fails and the next one starts:
 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
 location for shuffle 1

 Cluster:
 5 machines, each 2 core , 8gb machines

 Spark-submit command:
  spark-submit --class com.myco.SparkJob \
 --master yarn \
 /tmp/sparkjob.jar \

 Any thoughts or where to look or how to start approaching this problem or
 more data points to present.

 Thanks..

 Code for the job:
  JavaRDDAnalyticsEvent events = ((JavaRDDAvroKeylt;AnalyticsEvent)
 context.newAPIHadoopRDD(
 context.hadoopConfiguration(),
 AvroKeyInputFormat.class,
 AvroKey.class,
 NullWritable.class
 ).keys())
 .map(event - AnalyticsEvent.newBuilder(event.datum()).build())
 .filter(key - { return
 Optional.ofNullable(key.getStepEventKey()).isPresent(); })
 .mapToPair(event - new Tuple2AnalyticsEvent, Integer(event, 1))
 .reduceByKey((analyticsEvent1, analyticsEvent2) - analyticsEvent1)
 .map(tuple - tuple._1());

 events.persist(StorageLevel.MEMORY_AND_DISK_2());
 events.mapToPair(event - {
 return new Tuple2T, RunningAggregates(
 keySelector.select(event),
 new RunningAggregates(
 Optional.ofNullable(event.getVisitors()).orElse(0L),
 Optional.ofNullable(event.getImpressions()).orElse(0L),
 Optional.ofNullable(event.getAmount()).orElse(0.0D),

 Optional.ofNullable(event.getAmountSumOfSquares()).orElse(0.0D)));
 })
 .reduceByKey((left, right) - { return left.add(right); })
 .foreachpartition(dostuff)






 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Performance on Yarn

2015-02-20 Thread Kelvin Chu
Hi Sandy,

I appreciate your clear explanation. Let me try again. It's the best way to
confirm I understand.

spark.executor.memory + spark.yarn.executor.memoryOverhead = the memory
that YARN will create a JVM

spark.executor.memory = the memory I can actually use in my jvm application
= part of it (spark.storage.memoryFraction) is reserved for caching + part
of it (spark.shuffle.memoryFraction) is reserved for shuffling + the
remaining is for bookkeeping  UDFs

If I am correct above, then one implication from them is:

(spark.executor.memory + spark.yarn.executor.memoryOverhead) * number of
executors per machine should be configured smaller than a single machine
physical memory

Right? Again, thanks!

Kelvin

On Fri, Feb 20, 2015 at 11:50 AM, Sandy Ryza sandy.r...@cloudera.com
wrote:

 Hi Kelvin,

 spark.executor.memory controls the size of the executor heaps.

 spark.yarn.executor.memoryOverhead is the amount of memory to request from
 YARN beyond the heap size.  This accounts for the fact that JVMs use some
 non-heap memory.

 The Spark heap is divided into spark.storage.memoryFraction (default 0.6)
 and spark.shuffle.memoryFraction (default 0.2), and the rest is for basic
 Spark bookkeeping and anything the user does inside UDFs.

 -Sandy



 On Fri, Feb 20, 2015 at 11:44 AM, Kelvin Chu 2dot7kel...@gmail.com
 wrote:

 Hi Sandy,

 I am also doing memory tuning on YARN. Just want to confirm, is it
 correct to say:

 spark.executor.memory - spark.yarn.executor.memoryOverhead = the memory
 I can actually use in my jvm application

 If it is not, what is the correct relationship? Any other variables or
 config parameters in play? Thanks.

 Kelvin

 On Fri, Feb 20, 2015 at 9:45 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 If that's the error you're hitting, the fix is to boost
 spark.yarn.executor.memoryOverhead, which will put some extra room in
 between the executor heap sizes and the amount of memory requested for them
 from YARN.

 -Sandy

 On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote:

 A bit more context on this issue. From the container logs on the
 executor

 Given my cluster specs above what would be appropriate parameters to
 pass
 into :
 --num-executors --num-cores --executor-memory

 I had tried it with --executor-memory 2500MB

 015-02-20 06:50:09,056 WARN

 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Container
 [pid=23320,containerID=container_1423083596644_0238_01_004160] is
 running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB
 physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing
 container.
 Dump of the process-tree for container_1423083596644_0238_01_004160 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
 SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash -c
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
 -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal
 :42535/user/CoarseGrainedScheduler
 8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout
 2

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr
 |- 23323 23320 23320 23320 (java) 922271 12263 461976 724218
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p
 -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







Spark performance tuning

2015-02-20 Thread java8964
Hi, 
I am new to Spark, and I am trying to test the Spark SQL performance vs Hive. I 
setup a standalone box, with 24 cores and 64G memory.
We have one SQL in mind to test. Here is the basically setup on this one box 
for the SQL we are trying to run:
1) Dataset 1, 6.6G AVRO file with snappy compression, which contains nest 
structure of 3 array of struct in AVRO2) Dataset2, 5G AVRO file with snappy 
compression3) Dataset3, 2.3M AVRO file with snappy compression.
The basic structure of the query is like this:

(selectxxxfromdataset1 lateral view outer explode(struct1) lateral view outer 
explode(struct2)where x )left outer join(select  from dataset2 lateral 
view explode(xxx) where )on left outer join(select xxx from dataset3 
where )on x
So overall what it does is 2 outer explode on dataset1, left outer join with 
explode of dataset2, then finally left outer join with dataset 3.
On this standalone box, I installed Hadoop 2.2 and Hive 0.12, and Spark 1.2.0.
Baseline, the above query can finish around 50 minutes in Hive 12, with 6 
mappers and 3 reducers, each with 1G max heap, in 3 rounds of MR jobs.
This is a very expensive query running in our production, of course with much 
bigger data set, every day. Now I want to see how fast Spark can do for the 
same query.
I am using the following settings, based on my understanding of Spark, for a 
fair test between it and Hive:
export SPARK_WORKER_MEMORY=32gexport SPARK_DRIVER_MEMORY=2g--executor-memory 9g 
--total-executor-cores 9
I am trying to run the one executor with 9 cores and max 9G heap, to make Spark 
use almost same resource we gave to the MapReduce. Here is the result without 
any additional configuration changes, running under Spark 1.2.0, using 
HiveContext in Spark SQL, to run the exactly same query:
The Spark SQL generated 5 stage of tasks, shown below:4   collect at 
SparkPlan.scala:84 +details  2015/02/20 10:48:46 26 s200/200
 3   mapPartitions at Exchange.scala:64 +details 2015/02/20 10:32:07 16 min  
200/200 1112.3 MB2   mapPartitions at Exchange.scala:64 
+details 2015/02/20 10:22:06 9 min  40/40   4.7 GB  22.2 GB1   
mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06 1.9 min 50/50   
6.2 GB  2.8 GB0   mapPartitions at Exchange.scala:64 +details 
2015/02/20 10:22:06 6 s 2/2 2.3 MB  156.6 KB
So the wall time of whole query is 26s + 16m + 9m + 2m + 6s, around 28 minutes.
It is about 56% of originally time, not bad. But I want to know any tuning of 
Spark can make it even faster.
For stage 2 and 3, I observed that GC time is more and more expensive. 
Especially in stage 3, shown below:
For stage 3:Metric  Min 25th percentile Median  75th percentile 
MaxDuration20 s30 s35 s39 s
2.4 minGC Time 9 s 17 s20 s25 s
2.2 minShuffle Write   4.7 MB  4.9 MB  5.2 MB  6.1 MB  
8.3 MB
So in median, the GC time took overall 20s/35s = 57% of time.
First change I made is to add the following line in the 
spark-default.conf:spark.serializer org.apache.spark.serializer.KryoSerializer
My assumption is that using kryoSerializer, instead of default java serialize, 
will lower the memory footprint, should lower the GC pressure during runtime. I 
know the I changed the correct spark-default.conf, because if I were add 
spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps in the same file, I will see the GC usage in the stdout 
file. Of course, in this test, I didn't add that, as I want to only make one 
change a time.The result is almost the same, as using standard java serialize. 
The wall time is still 28 minutes, and in stage 3, the GC still took around 50 
to 60% of time, almost same result within min, median to max in stage 3, 
without any noticeable performance gain.
Next, based on my understanding, and for this test, I think the default 
spark.storage.memoryFraction is too high for this query, as there is no reason 
to reserve so much memory for caching data, Because we don't reuse any dataset 
in this one query. So I add this at the end of spark-shell command --conf 
spark.storage.memoryFraction=0.3, as I want to just reserve half of the memory 
for caching data vs first time. Of course, this time, I rollback the first 
change of KryoSerializer.
The result looks like almost the same. The whole query finished around 28s + 
14m + 9.6m + 1.9m + 6s = 27 minutes.
It looks like that Spark is faster than Hive, but is there any steps I can make 
it even faster? Why using KryoSerializer makes no difference? If I want to 
use the same resource as now, anything I can do to speed it up more, especially 
lower the GC time?
Thanks
Yong
  

Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
That's all correct.

-Sandy

On Fri, Feb 20, 2015 at 1:23 PM, Kelvin Chu 2dot7kel...@gmail.com wrote:

 Hi Sandy,

 I appreciate your clear explanation. Let me try again. It's the best way
 to confirm I understand.

 spark.executor.memory + spark.yarn.executor.memoryOverhead = the memory
 that YARN will create a JVM

 spark.executor.memory = the memory I can actually use in my jvm
 application = part of it (spark.storage.memoryFraction) is reserved for
 caching + part of it (spark.shuffle.memoryFraction) is reserved for
 shuffling + the remaining is for bookkeeping  UDFs

 If I am correct above, then one implication from them is:

 (spark.executor.memory + spark.yarn.executor.memoryOverhead) * number of
 executors per machine should be configured smaller than a single machine
 physical memory

 Right? Again, thanks!

 Kelvin

 On Fri, Feb 20, 2015 at 11:50 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 Hi Kelvin,

 spark.executor.memory controls the size of the executor heaps.

 spark.yarn.executor.memoryOverhead is the amount of memory to request
 from YARN beyond the heap size.  This accounts for the fact that JVMs use
 some non-heap memory.

 The Spark heap is divided into spark.storage.memoryFraction (default 0.6)
 and spark.shuffle.memoryFraction (default 0.2), and the rest is for basic
 Spark bookkeeping and anything the user does inside UDFs.

 -Sandy



 On Fri, Feb 20, 2015 at 11:44 AM, Kelvin Chu 2dot7kel...@gmail.com
 wrote:

 Hi Sandy,

 I am also doing memory tuning on YARN. Just want to confirm, is it
 correct to say:

 spark.executor.memory - spark.yarn.executor.memoryOverhead = the memory
 I can actually use in my jvm application

 If it is not, what is the correct relationship? Any other variables or
 config parameters in play? Thanks.

 Kelvin

 On Fri, Feb 20, 2015 at 9:45 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 If that's the error you're hitting, the fix is to boost
 spark.yarn.executor.memoryOverhead, which will put some extra room in
 between the executor heap sizes and the amount of memory requested for them
 from YARN.

 -Sandy

 On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote:

 A bit more context on this issue. From the container logs on the
 executor

 Given my cluster specs above what would be appropriate parameters to
 pass
 into :
 --num-executors --num-cores --executor-memory

 I had tried it with --executor-memory 2500MB

 015-02-20 06:50:09,056 WARN

 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Container
 [pid=23320,containerID=container_1423083596644_0238_01_004160] is
 running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB
 physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing
 container.
 Dump of the process-tree for container_1423083596644_0238_01_004160 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
 SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash
 -c
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
 -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal
 :42535/user/CoarseGrainedScheduler
 8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout
 2

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr
 |- 23323 23320 23320 23320 (java) 922271 12263 461976
 724218
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p
 -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org








Re: Spark Performance on Yarn

2015-02-20 Thread Lee Bierman
Thanks for the suggestions.
I'm experimenting with different values for spark memoryOverhead and
explictly giving the executors more memory, but still have not found the
golden medium to get it to finish in a proper time frame.

Is my cluster massively undersized at 5 boxes, 8gb 2cpu ?
Trying to figure out a memory setting and executor setting so it runs on
many containers in parallel.

I'm still struggling as pig jobs and hive jobs on the same whole data set
don't take as long. I'm wondering too if the logic in our code is just
doing something silly causing multiple reads of all the data.


On Fri, Feb 20, 2015 at 9:45 AM, Sandy Ryza sandy.r...@cloudera.com wrote:

 If that's the error you're hitting, the fix is to boost
 spark.yarn.executor.memoryOverhead, which will put some extra room in
 between the executor heap sizes and the amount of memory requested for them
 from YARN.

 -Sandy

 On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote:

 A bit more context on this issue. From the container logs on the executor

 Given my cluster specs above what would be appropriate parameters to pass
 into :
 --num-executors --num-cores --executor-memory

 I had tried it with --executor-memory 2500MB

 015-02-20 06:50:09,056 WARN

 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Container [pid=23320,containerID=container_1423083596644_0238_01_004160]
 is
 running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB
 physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing
 container.
 Dump of the process-tree for container_1423083596644_0238_01_004160 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
 SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash -c
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
 -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal
 :42535/user/CoarseGrainedScheduler
 8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout
 2

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr
 |- 23323 23320 23320 23320 (java) 922271 12263 461976 724218
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Spark Performance on Yarn

2015-02-20 Thread lbierman
A bit more context on this issue. From the container logs on the executor 

Given my cluster specs above what would be appropriate parameters to pass
into :
--num-executors --num-cores --executor-memory 

I had tried it with --executor-memory 2500MB

015-02-20 06:50:09,056 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Container [pid=23320,containerID=container_1423083596644_0238_01_004160] is
running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB
physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing
container.
Dump of the process-tree for container_1423083596644_0238_01_004160 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash -c
/usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms2400m
-Xmx2400m 
-Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
org.apache.spark.executor.CoarseGrainedExecutorBackend
akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/CoarseGrainedScheduler
8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1
/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout
2
/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr
|- 23323 23320 23320 23320 (java) 922271 12263 461976 724218
/usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms2400m
-Xmx2400m
-Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
org.apache.spark.executor.CoarseGrainedExecutorBackend
akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
Are you specifying the executor memory, cores, or number of executors
anywhere?  If not, you won't be taking advantage of the full resources on
the cluster.

-Sandy

On Fri, Feb 20, 2015 at 2:41 AM, Sean Owen so...@cloudera.com wrote:

 None of this really points to the problem. These indicate that workers
 died but not why. I'd first go locate executor logs that reveal more
 about what's happening. It sounds like a hard-er type of failure, like
 JVM crash or running out of file handles, or GC thrashing.

 On Fri, Feb 20, 2015 at 4:51 AM, lbierman leebier...@gmail.com wrote:
  I'm a bit new to Spark, but had a question on performance. I suspect a
 lot of
  my issue is due to tuning and parameters. I have a Hive external table on
  this data and to run queries against it runs in minutes
 
  The Job:
  + 40gb of avro events on HDFS (100 million+ avro events)
  + Read in the files from HDFS and dedupe events by key (mapToPair then a
  reduceByKey)
  + RDD returned and persisted (disk and memory)
  + Then passed to a job that take the RDD and mapToPair of new object data
  and then reduceByKey and foreachpartion do work
 
  The issue:
  When I run this on my environment on Yarn this takes 20+ hours. Running
 on
  yarn we see the first stage runs to do build the RDD deduped, but then
 when
  the next stage starts, things fail and data is lost. This results in
 stage 0
  starting over and over and just dragging it out.
 
  Errors I see in the driver logs:
  ERROR cluster.YarnClientClusterScheduler: Lost executor 1 on X:
 remote
  Akka client disassociated
 
  15/02/20 00:27:36 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
 3.1
  (TID 1335,): FetchFailed(BlockManagerId(3, i, 33958),
 shuffleId=1,
  mapId=162, reduceId=0, message=
  org.apache.spark.shuffle.FetchFailedException: Failed to connect
  toX/X:33958
 
  Also we see this, but I'm suspecting this is because the previous stage
  fails and the next one starts:
  org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
  location for shuffle 1
 
  Cluster:
  5 machines, each 2 core , 8gb machines
 
  Spark-submit command:
   spark-submit --class com.myco.SparkJob \
  --master yarn \
  /tmp/sparkjob.jar \
 
  Any thoughts or where to look or how to start approaching this problem or
  more data points to present.
 
  Thanks..
 
  Code for the job:
   JavaRDDAnalyticsEvent events = ((JavaRDDAvroKeylt;AnalyticsEvent)
  context.newAPIHadoopRDD(
  context.hadoopConfiguration(),
  AvroKeyInputFormat.class,
  AvroKey.class,
  NullWritable.class
  ).keys())
  .map(event - AnalyticsEvent.newBuilder(event.datum()).build())
  .filter(key - { return
  Optional.ofNullable(key.getStepEventKey()).isPresent(); })
  .mapToPair(event - new Tuple2AnalyticsEvent, Integer(event,
 1))
  .reduceByKey((analyticsEvent1, analyticsEvent2) -
 analyticsEvent1)
  .map(tuple - tuple._1());
 
  events.persist(StorageLevel.MEMORY_AND_DISK_2());
  events.mapToPair(event - {
  return new Tuple2T, RunningAggregates(
  keySelector.select(event),
  new RunningAggregates(
  Optional.ofNullable(event.getVisitors()).orElse(0L),
 
  Optional.ofNullable(event.getImpressions()).orElse(0L),
  Optional.ofNullable(event.getAmount()).orElse(0.0D),
 
  Optional.ofNullable(event.getAmountSumOfSquares()).orElse(0.0D)));
  })
  .reduceByKey((left, right) - { return left.add(right); })
  .foreachpartition(dostuff)
 
 
 
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
If that's the error you're hitting, the fix is to boost
spark.yarn.executor.memoryOverhead, which will put some extra room in
between the executor heap sizes and the amount of memory requested for them
from YARN.

-Sandy

On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote:

 A bit more context on this issue. From the container logs on the executor

 Given my cluster specs above what would be appropriate parameters to pass
 into :
 --num-executors --num-cores --executor-memory

 I had tried it with --executor-memory 2500MB

 015-02-20 06:50:09,056 WARN

 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Container [pid=23320,containerID=container_1423083596644_0238_01_004160] is
 running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB
 physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing
 container.
 Dump of the process-tree for container_1423083596644_0238_01_004160 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
 SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash -c
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
 -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal
 :42535/user/CoarseGrainedScheduler
 8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout
 2

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr
 |- 23323 23320 23320 23320 (java) 922271 12263 461976 724218
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark Performance on Yarn

2015-02-20 Thread Kelvin Chu
Hi Sandy,

I am also doing memory tuning on YARN. Just want to confirm, is it correct
to say:

spark.executor.memory - spark.yarn.executor.memoryOverhead = the memory I
can actually use in my jvm application

If it is not, what is the correct relationship? Any other variables or
config parameters in play? Thanks.

Kelvin

On Fri, Feb 20, 2015 at 9:45 AM, Sandy Ryza sandy.r...@cloudera.com wrote:

 If that's the error you're hitting, the fix is to boost
 spark.yarn.executor.memoryOverhead, which will put some extra room in
 between the executor heap sizes and the amount of memory requested for them
 from YARN.

 -Sandy

 On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote:

 A bit more context on this issue. From the container logs on the executor

 Given my cluster specs above what would be appropriate parameters to pass
 into :
 --num-executors --num-cores --executor-memory

 I had tried it with --executor-memory 2500MB

 015-02-20 06:50:09,056 WARN

 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Container [pid=23320,containerID=container_1423083596644_0238_01_004160]
 is
 running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB
 physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing
 container.
 Dump of the process-tree for container_1423083596644_0238_01_004160 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
 SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash -c
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
 -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal
 :42535/user/CoarseGrainedScheduler
 8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout
 2

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr
 |- 23323 23320 23320 23320 (java) 922271 12263 461976 724218
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Spark Performance on Yarn

2015-02-20 Thread Sandy Ryza
Hi Kelvin,

spark.executor.memory controls the size of the executor heaps.

spark.yarn.executor.memoryOverhead is the amount of memory to request from
YARN beyond the heap size.  This accounts for the fact that JVMs use some
non-heap memory.

The Spark heap is divided into spark.storage.memoryFraction (default 0.6)
and spark.shuffle.memoryFraction (default 0.2), and the rest is for basic
Spark bookkeeping and anything the user does inside UDFs.

-Sandy



On Fri, Feb 20, 2015 at 11:44 AM, Kelvin Chu 2dot7kel...@gmail.com wrote:

 Hi Sandy,

 I am also doing memory tuning on YARN. Just want to confirm, is it correct
 to say:

 spark.executor.memory - spark.yarn.executor.memoryOverhead = the memory I
 can actually use in my jvm application

 If it is not, what is the correct relationship? Any other variables or
 config parameters in play? Thanks.

 Kelvin

 On Fri, Feb 20, 2015 at 9:45 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

 If that's the error you're hitting, the fix is to boost
 spark.yarn.executor.memoryOverhead, which will put some extra room in
 between the executor heap sizes and the amount of memory requested for them
 from YARN.

 -Sandy

 On Fri, Feb 20, 2015 at 9:40 AM, lbierman leebier...@gmail.com wrote:

 A bit more context on this issue. From the container logs on the executor

 Given my cluster specs above what would be appropriate parameters to pass
 into :
 --num-executors --num-cores --executor-memory

 I had tried it with --executor-memory 2500MB

 015-02-20 06:50:09,056 WARN

 org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Container [pid=23320,containerID=container_1423083596644_0238_01_004160]
 is
 running beyond physical memory limits. Current usage: 2.8 GB of 2.7 GB
 physical memory used; 4.4 GB of 5.8 GB virtual memory used. Killing
 container.
 Dump of the process-tree for container_1423083596644_0238_01_004160 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
 SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 23320 23318 23320 23320 (bash) 0 0 108650496 305 /bin/bash -c
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
 -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal
 :42535/user/CoarseGrainedScheduler
 8 ip-10-99-162-56.ec2.internal 1 application_1423083596644_0238 1

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stdout
 2

 /var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160/stderr
 |- 23323 23320 23320 23320 (java) 922271 12263 461976 724218
 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p
 -Xms2400m
 -Xmx2400m

 -Djava.io.tmpdir=/dfs/yarn/nm/usercache/root/appcache/application_1423083596644_0238/container_1423083596644_0238_01_004160/tmp

 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1423083596644_0238/container_1423083596644_0238_01_004160
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://sparkDriver@ip-10-168-86-13.ec2.internal:42535/user/Coarse




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p21739.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






  1   2   >