subject:"input size too large \| Performance issues with Spark"

Re: input size too large | Performance issues with Spark

2015-04-05 Thread Ted Yu

Reading Sandy's blog, there seems to be one typo.

bq. Similarly, the heap size can be controlled with the --executor-cores flag
or thespark.executor.memory property.
'--executor-memory' should be the right flag.

BTW

bq. It defaults to max(384, .07 * spark.executor.memory)
Default memory overhead has been increased to 10 percent in master branch.
See SPARK-6085. Though the change is not in 1.3

Cheers

On Thu, Apr 2, 2015 at 12:55 PM, Christian Perez christ...@svds.com wrote:

To Akhil's point, see Tuning Data structures. Avoid standard collection
hashmap.

With fewer machines, try running 4 or 5 cores per executor and only
3-4 executors (1 per node):

http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
.
Ought to reduce shuffle performance hit (someone else confirm?)

#7 see default.shuffle.partitions (default: 200)

On Sun, Mar 29, 2015 at 7:57 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Go through this once, if you haven't read it already.
https://spark.apache.org/docs/latest/tuning.html

Thanks
Best Regards

On Sat, Mar 28, 2015 at 7:33 PM, nsareen nsar...@gmail.com wrote:

Hi All,

I'm facing performance issues with spark implementation, and was briefly
investigating on WebUI logs, i noticed that my RDD size is 55GB the
Shuffle Write is 10 GB Input Size is 200GB. Application is a web
application which does predictive analytics, so we keep most of our data
in
memory. This observation was only for 30mins usage of the application
on a
single user. We anticipate atleast 10-15 users of the application
sending
requests in parallel, which makes me a bit nervous.

One constraint we have is that we do not have too many nodes in a
cluster,
we may end up with 3-4 machines at best, but they can be scaled up
vertically each having 24 cores / 512 GB ram etc. which can allow us to
make
a virtual 10-15 node cluster.

Even then the input size shuffle write is too high for my liking. Any
suggestions in this regard will be greatly appreciated as there aren't
much
resource on the net for handling performance issues such as these.

Some pointers on my application's data structures design

1) RDD is a JavaPairRDD with Key / Value as CustomPOJO containing 3-4
Hashmaps Value containing 1 Hashmap
2) Data is loaded via JDBCRDD during application startup, which also
tends
to take a lot of time, since we massage the data once it is fetched from
DB
and then save it as JavaPairRDD.
3) Most of the data is structured, but we are still using JavaPairRDD,
have
not explored the option of Spark SQL though.
4) We have only one SparkContext which caters to all the requests coming
into the application from various users.
5) During a single user session user can send 3-4 parallel stages
consisting
of Map / Group By / Join / Reduce etc.
6) We have to change the RDD structure using different types of group by
operations since the user can do drill down drill up of the data (
aggregation at a higher / lower level). This is where we make use of
Groupby's but there is a cost associated with this.
7) We have observed, that the initial RDD's we create have 40 odd
partitions, but post some stage executions like groupby's the partitions
increase to 200 or so, this was odd, and we havn't figured out why this
happens.

In summary we wan to use Spark to provide us the capability to process
our
in-memory data structure very fast as well as scale to a larger volume
when
required in the future.

--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/input-size-too-large-Performance-issues-with-Spark-tp22270.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

--
Christian Perez
Silicon Valley Data Science
Data Analyst
christ...@svds.com
@cp_phd

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: input size too large | Performance issues with Spark

2015-04-02 Thread Christian Perez

To Akhil's point, see Tuning Data structures. Avoid standard collection hashmap.

With fewer machines, try running 4 or 5 cores per executor and only
3-4 executors (1 per node):
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/.
Ought to reduce shuffle performance hit (someone else confirm?)

#7 see default.shuffle.partitions (default: 200)

On Sun, Mar 29, 2015 at 7:57 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
Go through this once, if you haven't read it already.
https://spark.apache.org/docs/latest/tuning.html

Thanks
Best Regards

On Sat, Mar 28, 2015 at 7:33 PM, nsareen nsar...@gmail.com wrote:

Hi All,

I'm facing performance issues with spark implementation, and was briefly
investigating on WebUI logs, i noticed that my RDD size is 55GB the
Shuffle Write is 10 GB Input Size is 200GB. Application is a web
application which does predictive analytics, so we keep most of our data
in
memory. This observation was only for 30mins usage of the application on a
single user. We anticipate atleast 10-15 users of the application sending
requests in parallel, which makes me a bit nervous.

One constraint we have is that we do not have too many nodes in a cluster,
we may end up with 3-4 machines at best, but they can be scaled up
vertically each having 24 cores / 512 GB ram etc. which can allow us to
make
a virtual 10-15 node cluster.

Some pointers on my application's data structures design

1) RDD is a JavaPairRDD with Key / Value as CustomPOJO containing 3-4
Hashmaps Value containing 1 Hashmap
2) Data is loaded via JDBCRDD during application startup, which also tends
to take a lot of time, since we massage the data once it is fetched from
DB
and then save it as JavaPairRDD.
3) Most of the data is structured, but we are still using JavaPairRDD,
have
not explored the option of Spark SQL though.
4) We have only one SparkContext which caters to all the requests coming
into the application from various users.
5) During a single user session user can send 3-4 parallel stages
consisting
of Map / Group By / Join / Reduce etc.
6) We have to change the RDD structure using different types of group by
operations since the user can do drill down drill up of the data (
aggregation at a higher / lower level). This is where we make use of
Groupby's but there is a cost associated with this.
7) We have observed, that the initial RDD's we create have 40 odd
partitions, but post some stage executions like groupby's the partitions
increase to 200 or so, this was odd, and we havn't figured out why this
happens.

In summary we wan to use Spark to provide us the capability to process our
in-memory data structure very fast as well as scale to a larger volume
when
required in the future.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/input-size-too-large-Performance-issues-with-Spark-tp22270.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

--
Christian Perez
Silicon Valley Data Science
Data Analyst
christ...@svds.com
@cp_phd

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: input size too large | Performance issues with Spark

2015-03-29 Thread Akhil Das

Go through this once, if you haven't read it already.
https://spark.apache.org/docs/latest/tuning.html

Thanks
Best Regards

On Sat, Mar 28, 2015 at 7:33 PM, nsareen nsar...@gmail.com wrote:

Hi All,

I'm facing performance issues with spark implementation, and was briefly
investigating on WebUI logs, i noticed that my RDD size is 55GB the
Shuffle Write is 10 GB Input Size is 200GB. Application is a web
application which does predictive analytics, so we keep most of our data in
memory. This observation was only for 30mins usage of the application on a
single user. We anticipate atleast 10-15 users of the application sending
requests in parallel, which makes me a bit nervous.

One constraint we have is that we do not have too many nodes in a cluster,
we may end up with 3-4 machines at best, but they can be scaled up
vertically each having 24 cores / 512 GB ram etc. which can allow us to
make
a virtual 10-15 node cluster.

Even then the input size shuffle write is too high for my liking. Any
suggestions in this regard will be greatly appreciated as there aren't much
resource on the net for handling performance issues such as these.

Some pointers on my application's data structures design

1) RDD is a JavaPairRDD with Key / Value as CustomPOJO containing 3-4
Hashmaps Value containing 1 Hashmap
2) Data is loaded via JDBCRDD during application startup, which also tends
to take a lot of time, since we massage the data once it is fetched from DB
and then save it as JavaPairRDD.
3) Most of the data is structured, but we are still using JavaPairRDD, have
not explored the option of Spark SQL though.
4) We have only one SparkContext which caters to all the requests coming
into the application from various users.
5) During a single user session user can send 3-4 parallel stages
consisting
of Map / Group By / Join / Reduce etc.
6) We have to change the RDD structure using different types of group by
operations since the user can do drill down drill up of the data (
aggregation at a higher / lower level). This is where we make use of
Groupby's but there is a cost associated with this.
7) We have observed, that the initial RDD's we create have 40 odd
partitions, but post some stage executions like groupby's the partitions
increase to 200 or so, this was odd, and we havn't figured out why this
happens.

In summary we wan to use Spark to provide us the capability to process our
in-memory data structure very fast as well as scale to a larger volume when
required in the future.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

input size too large | Performance issues with Spark

2015-03-28 Thread nsareen

Hi All,

I'm facing performance issues with spark implementation, and was briefly
investigating on WebUI logs, i noticed that my RDD size is 55GB the
Shuffle Write is 10 GB Input Size is 200GB. Application is a web
application which does predictive analytics, so we keep most of our data in
memory. This observation was only for 30mins usage of the application on a
single user. We anticipate atleast 10-15 users of the application sending
requests in parallel, which makes me a bit nervous.

One constraint we have is that we do not have too many nodes in a cluster,
we may end up with 3-4 machines at best, but they can be scaled up
vertically each having 24 cores / 512 GB ram etc. which can allow us to make
a virtual 10-15 node cluster.

Even then the input size shuffle write is too high for my liking. Any
suggestions in this regard will be greatly appreciated as there aren't much
resource on the net for handling performance issues such as these.

Some pointers on my application's data structures design

1) RDD is a JavaPairRDD with Key / Value as CustomPOJO containing 3-4
Hashmaps Value containing 1 Hashmap
2) Data is loaded via JDBCRDD during application startup, which also tends
to take a lot of time, since we massage the data once it is fetched from DB
and then save it as JavaPairRDD.
3) Most of the data is structured, but we are still using JavaPairRDD, have
not explored the option of Spark SQL though.
4) We have only one SparkContext which caters to all the requests coming
into the application from various users.
5) During a single user session user can send 3-4 parallel stages consisting
of Map / Group By / Join / Reduce etc.
6) We have to change the RDD structure using different types of group by
operations since the user can do drill down drill up of the data (
aggregation at a higher / lower level). This is where we make use of
Groupby's but there is a cost associated with this.
7) We have observed, that the initial RDD's we create have 40 odd
partitions, but post some stage executions like groupby's the partitions
increase to 200 or so, this was odd, and we havn't figured out why this
happens.

In summary we wan to use Spark to provide us the capability to process our
in-memory data structure very fast as well as scale to a larger volume when
required in the future.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: input size too large | Performance issues with Spark

Re: input size too large | Performance issues with Spark

Re: input size too large | Performance issues with Spark

input size too large | Performance issues with Spark

4 matches

Site Navigation

Mail list logo

Footer information