Heuristics for Partitioning Non-Local Data

Hamel Kothari Thu, 28 Jan 2016 06:44:10 -0800

Hey spark-devs,

I'm in the process of writing a DataSource for what is essentially a java
web service. Each relation which we create will consist of a series of
queries to this webservice which returns a pretty much known amount of data
(eg. 2000 rows, 5 string columns or similar which we can calculate size in
bytes from).


Now in creating an RDD for this web service, I have to provide the
Partitions by which to break up the RDD. Naively, I can break it up over a
handful of the attributes we query by, but I was wondering if there are any
heuristics you guys would suggest considering that we know roughly how much
data will be fetched from the webservice during each call.

Using information from the SparkContext about executor size/number of
executors/free memory, etc, is there anything smart I could do with regards
to how I size my partitions? Has anyone done this before or is it a bad
idea?

Thanks,
Hamel

Heuristics for Partitioning Non-Local Data

Reply via email to