Reading Sandy's blog, there seems to be one typo. bq. Similarly, the heap size can be controlled with the --executor-cores flag or thespark.executor.memory property. '--executor-memory' should be the right flag.
BTW bq. It defaults to max(384, .07 * spark.executor.memory) Default memory overhead has been increased to 10 percent in master branch. See SPARK-6085. Though the change is not in 1.3 Cheers On Thu, Apr 2, 2015 at 12:55 PM, Christian Perez <christ...@svds.com> wrote: > To Akhil's point, see Tuning Data structures. Avoid standard collection > hashmap. > > With fewer machines, try running 4 or 5 cores per executor and only > 3-4 executors (1 per node): > > http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ > . > Ought to reduce shuffle performance hit (someone else confirm?) > > #7 see default.shuffle.partitions (default: 200) > > On Sun, Mar 29, 2015 at 7:57 AM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > > Go through this once, if you haven't read it already. > > https://spark.apache.org/docs/latest/tuning.html > > > > Thanks > > Best Regards > > > > On Sat, Mar 28, 2015 at 7:33 PM, nsareen <nsar...@gmail.com> wrote: > >> > >> Hi All, > >> > >> I'm facing performance issues with spark implementation, and was briefly > >> investigating on WebUI logs, i noticed that my RDD size is 55GB & the > >> Shuffle Write is 10 GB & Input Size is 200GB. Application is a web > >> application which does predictive analytics, so we keep most of our data > >> in > >> memory. This observation was only for 30mins usage of the application > on a > >> single user. We anticipate atleast 10-15 users of the application > sending > >> requests in parallel, which makes me a bit nervous. > >> > >> One constraint we have is that we do not have too many nodes in a > cluster, > >> we may end up with 3-4 machines at best, but they can be scaled up > >> vertically each having 24 cores / 512 GB ram etc. which can allow us to > >> make > >> a virtual 10-15 node cluster. > >> > >> Even then the input size & shuffle write is too high for my liking. Any > >> suggestions in this regard will be greatly appreciated as there aren't > >> much > >> resource on the net for handling performance issues such as these. > >> > >> Some pointers on my application's data structures & design > >> > >> 1) RDD is a JavaPairRDD with Key / Value as CustomPOJO containing 3-4 > >> Hashmaps & Value containing 1 Hashmap > >> 2) Data is loaded via JDBCRDD during application startup, which also > tends > >> to take a lot of time, since we massage the data once it is fetched from > >> DB > >> and then save it as JavaPairRDD. > >> 3) Most of the data is structured, but we are still using JavaPairRDD, > >> have > >> not explored the option of Spark SQL though. > >> 4) We have only one SparkContext which caters to all the requests coming > >> into the application from various users. > >> 5) During a single user session user can send 3-4 parallel stages > >> consisting > >> of Map / Group By / Join / Reduce etc. > >> 6) We have to change the RDD structure using different types of group by > >> operations since the user can do drill down drill up of the data ( > >> aggregation at a higher / lower level). This is where we make use of > >> Groupby's but there is a cost associated with this. > >> 7) We have observed, that the initial RDD's we create have 40 odd > >> partitions, but post some stage executions like groupby's the partitions > >> increase to 200 or so, this was odd, and we havn't figured out why this > >> happens. > >> > >> In summary we wan to use Spark to provide us the capability to process > our > >> in-memory data structure very fast as well as scale to a larger volume > >> when > >> required in the future. > >> > >> > >> > >> -- > >> View this message in context: > >> > http://apache-spark-user-list.1001560.n3.nabble.com/input-size-too-large-Performance-issues-with-Spark-tp22270.html > >> Sent from the Apache Spark User List mailing list archive at Nabble.com. > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >> For additional commands, e-mail: user-h...@spark.apache.org > >> > > > > > > -- > Christian Perez > Silicon Valley Data Science > Data Analyst > christ...@svds.com > @cp_phd > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >