Hi Spark experts, First of all, happy Thanksgiving!
The comes to my question, I have implemented custom Hadoop InputFormat to load millions of entities from my data source to Spark(as JavaRDD and transform to DataFrame). The approach I took in implementing the custom Hadoop RDD is loading all ID's of my data entity(each entity has an unique ID: Long) and split the ID list(contains 3 millions of Long number for example) into configured splits, each split contains a sub-set of ID's, in turn my custom RecordReader will load the full entity(a plain Java Bean) from my data source for each ID in the specific split. My first observation is some Spark tasks were timeout, and looks like Spark broadcast variable is being used to distribute my splits, is that correct? If so, from performance perspective, what enhancement I can make to make it better? Thanks -- --Anfernee