Hi Ted, Perhaps this might help? Thanks for your response. I am trying to access/read binary files stored over a series of servers.
Line used to build RDD: val BIN_pairRDD: RDD[(BIN_Key, BIN_Value)] = spark.newAPIHadoopFile("not.used", classOf[BIN_InputFormat], classOf[BIN_Key], classOf[BIN_Value], config); In order to support this, we have the following custom classes: - BIN_Key and BIN_Value as the paired entry for the RDD - BIN_RecordReader and BIN_FileSplit to handle the special splits - BIN_FileSplit overrides getLocations() and getLocationInfo(), and we have verified that the right IP address is being sent to Spark. - BIN_InputFormat queries a database for details about every split to be created; as in, which file to read and the IP address where that file is local. When it works: - No problems running a local job - No problems running in a cluster when there is 1 computer as Master and another computer with 3 workers along with the files to process. When it fails: - When running in a cluster with multiple workers and files spread across multiple computers. Jobs are not assigned to the nodes where the files are local. Thanks, Raajen