Hi Spark experts,

First of all, happy Thanksgiving!

The comes to my question, I have implemented custom Hadoop InputFormat to
load millions of entities from my data source to Spark(as JavaRDD and
transform to DataFrame). The approach I took in implementing the custom
Hadoop RDD is loading all ID's of my data entity(each entity has an unique
ID: Long) and split the ID list(contains 3 millions of Long number for
example) into configured splits, each split contains a sub-set of ID's, in
turn my custom RecordReader will load the full entity(a plain Java Bean)
from my data source for each ID in the specific split.

My first observation is some Spark tasks were timeout, and looks like Spark
broadcast variable is being used to distribute my splits, is that correct?
If so, from performance perspective, what enhancement I can make to make it
better?

Thanks

-- 
--Anfernee

Reply via email to