Re: input size too large | Performance issues with Spark

2015-04-05 Thread Ted Yu
Reading Sandy's blog, there seems to be one typo.

bq. Similarly, the heap size can be controlled with the --executor-cores flag
or thespark.executor.memory property.
'--executor-memory' should be the right flag.

BTW

bq. It defaults to max(384, .07 * spark.executor.memory)
Default memory overhead has been increased to 10 percent in master branch.
See SPARK-6085. Though the change is not in 1.3

Cheers

On Thu, Apr 2, 2015 at 12:55 PM, Christian Perez christ...@svds.com wrote:

 To Akhil's point, see Tuning Data structures. Avoid standard collection
 hashmap.

 With fewer machines, try running 4 or 5 cores per executor and only
 3-4 executors (1 per node):

 http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
 .
 Ought to reduce shuffle performance hit (someone else confirm?)

 #7 see default.shuffle.partitions (default: 200)

 On Sun, Mar 29, 2015 at 7:57 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:
  Go through this once, if you haven't read it already.
  https://spark.apache.org/docs/latest/tuning.html
 
  Thanks
  Best Regards
 
  On Sat, Mar 28, 2015 at 7:33 PM, nsareen nsar...@gmail.com wrote:
 
  Hi All,
 
  I'm facing performance issues with spark implementation, and was briefly
  investigating on WebUI logs, i noticed that my RDD size is 55GB  the
  Shuffle Write is 10 GB  Input Size is 200GB. Application is a web
  application which does predictive analytics, so we keep most of our data
  in
  memory. This observation was only for 30mins usage of the application
 on a
  single user. We anticipate atleast 10-15 users of the application
 sending
  requests in parallel, which makes me a bit nervous.
 
  One constraint we have is that we do not have too many nodes in a
 cluster,
  we may end up with 3-4 machines at best, but they can be scaled up
  vertically each having 24 cores / 512 GB ram etc. which can allow us to
  make
  a virtual 10-15 node cluster.
 
  Even then the input size  shuffle write is too high for my liking. Any
  suggestions in this regard will be greatly appreciated as there aren't
  much
  resource on the net for handling performance issues such as these.
 
  Some pointers on my application's data structures  design
 
  1) RDD is a JavaPairRDD with Key / Value as CustomPOJO containing 3-4
  Hashmaps  Value containing 1 Hashmap
  2) Data is loaded via JDBCRDD during application startup, which also
 tends
  to take a lot of time, since we massage the data once it is fetched from
  DB
  and then save it as JavaPairRDD.
  3) Most of the data is structured, but we are still using JavaPairRDD,
  have
  not explored the option of Spark SQL though.
  4) We have only one SparkContext which caters to all the requests coming
  into the application from various users.
  5) During a single user session user can send 3-4 parallel stages
  consisting
  of Map / Group By / Join / Reduce etc.
  6) We have to change the RDD structure using different types of group by
  operations since the user can do drill down drill up of the data (
  aggregation at a higher / lower level). This is where we make use of
  Groupby's but there is a cost associated with this.
  7) We have observed, that the initial RDD's we create have 40 odd
  partitions, but post some stage executions like groupby's the partitions
  increase to 200 or so, this was odd, and we havn't figured out why this
  happens.
 
  In summary we wan to use Spark to provide us the capability to process
 our
  in-memory data structure very fast as well as scale to a larger volume
  when
  required in the future.
 
 
 
  --
  View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/input-size-too-large-Performance-issues-with-Spark-tp22270.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 



 --
 Christian Perez
 Silicon Valley Data Science
 Data Analyst
 christ...@svds.com
 @cp_phd

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: input size too large | Performance issues with Spark

2015-04-02 Thread Christian Perez
To Akhil's point, see Tuning Data structures. Avoid standard collection hashmap.

With fewer machines, try running 4 or 5 cores per executor and only
3-4 executors (1 per node):
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/.
Ought to reduce shuffle performance hit (someone else confirm?)

#7 see default.shuffle.partitions (default: 200)

On Sun, Mar 29, 2015 at 7:57 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
 Go through this once, if you haven't read it already.
 https://spark.apache.org/docs/latest/tuning.html

 Thanks
 Best Regards

 On Sat, Mar 28, 2015 at 7:33 PM, nsareen nsar...@gmail.com wrote:

 Hi All,

 I'm facing performance issues with spark implementation, and was briefly
 investigating on WebUI logs, i noticed that my RDD size is 55GB  the
 Shuffle Write is 10 GB  Input Size is 200GB. Application is a web
 application which does predictive analytics, so we keep most of our data
 in
 memory. This observation was only for 30mins usage of the application on a
 single user. We anticipate atleast 10-15 users of the application sending
 requests in parallel, which makes me a bit nervous.

 One constraint we have is that we do not have too many nodes in a cluster,
 we may end up with 3-4 machines at best, but they can be scaled up
 vertically each having 24 cores / 512 GB ram etc. which can allow us to
 make
 a virtual 10-15 node cluster.

 Even then the input size  shuffle write is too high for my liking. Any
 suggestions in this regard will be greatly appreciated as there aren't
 much
 resource on the net for handling performance issues such as these.

 Some pointers on my application's data structures  design

 1) RDD is a JavaPairRDD with Key / Value as CustomPOJO containing 3-4
 Hashmaps  Value containing 1 Hashmap
 2) Data is loaded via JDBCRDD during application startup, which also tends
 to take a lot of time, since we massage the data once it is fetched from
 DB
 and then save it as JavaPairRDD.
 3) Most of the data is structured, but we are still using JavaPairRDD,
 have
 not explored the option of Spark SQL though.
 4) We have only one SparkContext which caters to all the requests coming
 into the application from various users.
 5) During a single user session user can send 3-4 parallel stages
 consisting
 of Map / Group By / Join / Reduce etc.
 6) We have to change the RDD structure using different types of group by
 operations since the user can do drill down drill up of the data (
 aggregation at a higher / lower level). This is where we make use of
 Groupby's but there is a cost associated with this.
 7) We have observed, that the initial RDD's we create have 40 odd
 partitions, but post some stage executions like groupby's the partitions
 increase to 200 or so, this was odd, and we havn't figured out why this
 happens.

 In summary we wan to use Spark to provide us the capability to process our
 in-memory data structure very fast as well as scale to a larger volume
 when
 required in the future.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/input-size-too-large-Performance-issues-with-Spark-tp22270.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-- 
Christian Perez
Silicon Valley Data Science
Data Analyst
christ...@svds.com
@cp_phd

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: input size too large | Performance issues with Spark

2015-03-29 Thread Akhil Das
Go through this once, if you haven't read it already.
https://spark.apache.org/docs/latest/tuning.html

Thanks
Best Regards

On Sat, Mar 28, 2015 at 7:33 PM, nsareen nsar...@gmail.com wrote:

 Hi All,

 I'm facing performance issues with spark implementation, and was briefly
 investigating on WebUI logs, i noticed that my RDD size is 55GB  the
 Shuffle Write is 10 GB  Input Size is 200GB. Application is a web
 application which does predictive analytics, so we keep most of our data in
 memory. This observation was only for 30mins usage of the application on a
 single user. We anticipate atleast 10-15 users of the application sending
 requests in parallel, which makes me a bit nervous.

 One constraint we have is that we do not have too many nodes in a cluster,
 we may end up with 3-4 machines at best, but they can be scaled up
 vertically each having 24 cores / 512 GB ram etc. which can allow us to
 make
 a virtual 10-15 node cluster.

 Even then the input size  shuffle write is too high for my liking. Any
 suggestions in this regard will be greatly appreciated as there aren't much
 resource on the net for handling performance issues such as these.

 Some pointers on my application's data structures  design

 1) RDD is a JavaPairRDD with Key / Value as CustomPOJO containing 3-4
 Hashmaps  Value containing 1 Hashmap
 2) Data is loaded via JDBCRDD during application startup, which also tends
 to take a lot of time, since we massage the data once it is fetched from DB
 and then save it as JavaPairRDD.
 3) Most of the data is structured, but we are still using JavaPairRDD, have
 not explored the option of Spark SQL though.
 4) We have only one SparkContext which caters to all the requests coming
 into the application from various users.
 5) During a single user session user can send 3-4 parallel stages
 consisting
 of Map / Group By / Join / Reduce etc.
 6) We have to change the RDD structure using different types of group by
 operations since the user can do drill down drill up of the data (
 aggregation at a higher / lower level). This is where we make use of
 Groupby's but there is a cost associated with this.
 7) We have observed, that the initial RDD's we create have 40 odd
 partitions, but post some stage executions like groupby's the partitions
 increase to 200 or so, this was odd, and we havn't figured out why this
 happens.

 In summary we wan to use Spark to provide us the capability to process our
 in-memory data structure very fast as well as scale to a larger volume when
 required in the future.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/input-size-too-large-Performance-issues-with-Spark-tp22270.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




input size too large | Performance issues with Spark

2015-03-28 Thread nsareen
Hi All,

I'm facing performance issues with spark implementation, and was briefly
investigating on WebUI logs, i noticed that my RDD size is 55GB  the
Shuffle Write is 10 GB  Input Size is 200GB. Application is a web
application which does predictive analytics, so we keep most of our data in
memory. This observation was only for 30mins usage of the application on a
single user. We anticipate atleast 10-15 users of the application sending
requests in parallel, which makes me a bit nervous. 

One constraint we have is that we do not have too many nodes in a cluster,
we may end up with 3-4 machines at best, but they can be scaled up
vertically each having 24 cores / 512 GB ram etc. which can allow us to make
a virtual 10-15 node cluster. 

Even then the input size  shuffle write is too high for my liking. Any
suggestions in this regard will be greatly appreciated as there aren't much
resource on the net for handling performance issues such as these.

Some pointers on my application's data structures  design 

1) RDD is a JavaPairRDD with Key / Value as CustomPOJO containing 3-4
Hashmaps  Value containing 1 Hashmap
2) Data is loaded via JDBCRDD during application startup, which also tends
to take a lot of time, since we massage the data once it is fetched from DB
and then save it as JavaPairRDD.
3) Most of the data is structured, but we are still using JavaPairRDD, have
not explored the option of Spark SQL though.
4) We have only one SparkContext which caters to all the requests coming
into the application from various users.
5) During a single user session user can send 3-4 parallel stages consisting
of Map / Group By / Join / Reduce etc.
6) We have to change the RDD structure using different types of group by
operations since the user can do drill down drill up of the data (
aggregation at a higher / lower level). This is where we make use of
Groupby's but there is a cost associated with this.
7) We have observed, that the initial RDD's we create have 40 odd
partitions, but post some stage executions like groupby's the partitions
increase to 200 or so, this was odd, and we havn't figured out why this
happens.

In summary we wan to use Spark to provide us the capability to process our
in-memory data structure very fast as well as scale to a larger volume when
required in the future.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/input-size-too-large-Performance-issues-with-Spark-tp22270.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org