You should take also into account that spark has different option to represent
data in-memory, such as Java serialized objects, Kyro serialized, Tungsten
(columnar optionally compressed) etc. the tungsten thing depends heavily on the
underlying data and sorting especially if compressed.
Then, you might think also about broadcasted data etc.
As such I am not aware of a specific guide, but there is also no magic behind
it. could be a good jira task :)
> On 22 Sep 2016, at 08:36, Hemant Bhanawat <hemant9...@gmail.com> wrote:
> I am working on profiling TPCH queries for Spark 2.0. I see lot of temporary
> object creation (sometimes size as much as the data size) which is justified
> for the kind of processing Spark does. But, from production perspective, is
> there a guideline on how much memory should be allocated for processing a
> specific data size of let's say parquet data? Also, has someone investigated
> memory usage for the individual SQL operators like Filter, group by, order
> by, Exchange etc.?
> Hemant Bhanawat