Hey spark-devs, I'm in the process of writing a DataSource for what is essentially a java web service. Each relation which we create will consist of a series of queries to this webservice which returns a pretty much known amount of data (eg. 2000 rows, 5 string columns or similar which we can calculate size in bytes from).
Now in creating an RDD for this web service, I have to provide the Partitions by which to break up the RDD. Naively, I can break it up over a handful of the attributes we query by, but I was wondering if there are any heuristics you guys would suggest considering that we know roughly how much data will be fetched from the webservice during each call. Using information from the SparkContext about executor size/number of executors/free memory, etc, is there anything smart I could do with regards to how I size my partitions? Has anyone done this before or is it a bad idea? Thanks, Hamel