apologies for the long answer. 

understanding partitioning at each stage of the the RDD graph/lineage is
important for efficient parallelism and having load balanced. This applies
to working with any sources streaming or static. 
you have tricky situation here of one source kafka with 9 partitions and
static data set 90 partitions. 

before joining both these try to have number of partitions equal for both
you can either repartition kafka source to 90 partitions or coalesce flat
file RDD to 9 partitions
or midway between 9 and 90. 

in general no of tasks that can run in parallel equal to total no of cores
spark job has (no of executors * no of cores per executor).

As an example
if the flat file has 90 partitions and if you set 4 executors each with 5
cores for a total of 20 cores if you have 20+20+20+20+10 tasks gets
scheduled. as you can see at the last you will have only 10 tasks though you
have 20 cores. 

compare this with 6 executors each with 5 cores for a total of 30 cores,
then it would be

ideally no of partitions for each RDD (in the graph lineage) should be a
multiple of total no of available cores for the spark job.

in terms of data locality prefer process-local over node-local over rack
as an example 
5 executors with 4 cores and 4 executors with 5 cores each of this option
will have 20 cores in total.
But with 4 executors its less shuffling more process-local/node-local

need to look at RDD graph for this
df = sqlContext.read.parquet(...)
RDD rdd = df.as[T].rdd

on your final question, you should be able to tune the static RDD without
external store by carefully looking at each batch RDD lineage for that 30
mins before the RDD gets refreshed again. 

if you would like to use external system Apache Ignite is something that you
can use as cache.


Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to