Re: input size too large | Performance issues with Spark

Ted Yu Sun, 05 Apr 2015 08:10:16 -0700

Reading Sandy's blog, there seems to be one typo.

bq. Similarly, the heap size can be controlled with the --executor-cores flag
or thespark.executor.memory property.
'--executor-memory' should be the right flag.


BTW

bq. It defaults to max(384, .07 * spark.executor.memory)
Default memory overhead has been increased to 10 percent in master branch.
See SPARK-6085. Though the change is not in 1.3

Cheers

On Thu, Apr 2, 2015 at 12:55 PM, Christian Perez <christ...@svds.com> wrote:

> To Akhil's point, see Tuning Data structures. Avoid standard collection
> hashmap.
>
> With fewer machines, try running 4 or 5 cores per executor and only
> 3-4 executors (1 per node):
>
> http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
> .
> Ought to reduce shuffle performance hit (someone else confirm?)
>
> #7 see default.shuffle.partitions (default: 200)
>
> On Sun, Mar 29, 2015 at 7:57 AM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
> > Go through this once, if you haven't read it already.
> > https://spark.apache.org/docs/latest/tuning.html
> >
> > Thanks
> > Best Regards
> >
> > On Sat, Mar 28, 2015 at 7:33 PM, nsareen <nsar...@gmail.com> wrote:
> >>
> >> Hi All,
> >>
> >> I'm facing performance issues with spark implementation, and was briefly
> >> investigating on WebUI logs, i noticed that my RDD size is 55GB & the
> >> Shuffle Write is 10 GB & Input Size is 200GB. Application is a web
> >> application which does predictive analytics, so we keep most of our data
> >> in
> >> memory. This observation was only for 30mins usage of the application
> on a
> >> single user. We anticipate atleast 10-15 users of the application
> sending
> >> requests in parallel, which makes me a bit nervous.
> >>
> >> One constraint we have is that we do not have too many nodes in a
> cluster,
> >> we may end up with 3-4 machines at best, but they can be scaled up
> >> vertically each having 24 cores / 512 GB ram etc. which can allow us to
> >> make
> >> a virtual 10-15 node cluster.
> >>
> >> Even then the input size & shuffle write is too high for my liking. Any
> >> suggestions in this regard will be greatly appreciated as there aren't
> >> much
> >> resource on the net for handling performance issues such as these.
> >>
> >> Some pointers on my application's data structures & design
> >>
> >> 1) RDD is a JavaPairRDD with Key / Value as CustomPOJO containing 3-4
> >> Hashmaps & Value containing 1 Hashmap
> >> 2) Data is loaded via JDBCRDD during application startup, which also
> tends
> >> to take a lot of time, since we massage the data once it is fetched from
> >> DB
> >> and then save it as JavaPairRDD.
> >> 3) Most of the data is structured, but we are still using JavaPairRDD,
> >> have
> >> not explored the option of Spark SQL though.
> >> 4) We have only one SparkContext which caters to all the requests coming
> >> into the application from various users.
> >> 5) During a single user session user can send 3-4 parallel stages
> >> consisting
> >> of Map / Group By / Join / Reduce etc.
> >> 6) We have to change the RDD structure using different types of group by
> >> operations since the user can do drill down drill up of the data (
> >> aggregation at a higher / lower level). This is where we make use of
> >> Groupby's but there is a cost associated with this.
> >> 7) We have observed, that the initial RDD's we create have 40 odd
> >> partitions, but post some stage executions like groupby's the partitions
> >> increase to 200 or so, this was odd, and we havn't figured out why this
> >> happens.
> >>
> >> In summary we wan to use Spark to provide us the capability to process
> our
> >> in-memory data structure very fast as well as scale to a larger volume
> >> when
> >> required in the future.
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://apache-spark-user-list.1001560.n3.nabble.com/input-size-too-large-Performance-issues-with-Spark-tp22270.html
> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
>
>
>
> --
> Christian Perez
> Silicon Valley Data Science
> Data Analyst
> christ...@svds.com
> @cp_phd
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: input size too large | Performance issues with Spark

Reply via email to