I am running in a 'too many open files' issue and before I posted this I have searched on the web to see if anyone had a solution already to my particular problem but I did not see anything that helped.
I know I can adjust the max open files allowed by the OS but I'd rather fix the underlaying issue. In HDFS I have a directory that contains hourly files in a directory structure like this: directoryname/hourly/2015/06/02/13 (where the numbers at the end are the date in YYYY/MM/DD/HH format). I have a Spark job that roughly does the following: val directories = findModifiedFiles() directories.foreach(directory => { sparkContext.newAPIHadoopFile(directory) .filter(...) .map(...) .reduceByKey(...) .foreachPartition(iterator => { iterator.foreach(tuple => { // send data to kafka } } } If there are only a few directories that have been modified then this works pretty well. But when I have the job reprocess all the data (I have 350M of test data that pretty much has data for each hour of each day for a full year) I run out of file handles. I executed the test on a test cluster of 2 hadoop slave nodes that each have the HDFS data node and yarn node manager running. When I run "lsof -p" on the Spark processes, I see a lot of the following types of open files: java 21196 yarn 3268r REG 8,16 139 533320 /hadoop/hdfs/data/current/BP-479153573-10.240.60.21-1441026862238/current/finalized/subdir0/subdir17/blk_1073746324_5500.meta java 21196 yarn 3269r REG 8,16 15004 533515 /hadoop/hdfs/data/current/BP-479153573-10.240.60.21-1441026862238/current/finalized/subdir0/subdir17/blk_1073746422 java 21196 yarn 3270r REG 8,16 127 533516 /hadoop/hdfs/data/current/BP-479153573-10.240.60.21-1441026862238/current/finalized/subdir0/subdir17/blk_1073746422_5598.meta java 21196 yarn 3271r REG 8,16 15583 534081 /hadoop/hdfs/data/current/BP-479153573-10.240.60.21-1441026862238/current/finalized/subdir0/subdir19/blk_1073746704 java 21196 yarn 3272r REG 8,16 131 534082 /hadoop/hdfs/data/current/BP-479153573-10.240.60.21-1441026862238/current/finalized/subdir0/subdir19/blk_1073746704_5880.meta When I watch the process run, I can see that the number of those open files is ever increasing while it is processing directories until it runs out of file handles (it then drops to zero and starts up again until it runs out again but that is due to the fact that Yarn retries running the job). It basically ends up opening about 4500 file handles to those files per node. As I said, I know that I can increase the number of open file handles, and I will do that, but in my opinion it should not be ever increasing. I would have thought that when I was done with an RDD that Spark would close all the resources that it opened for them (so that it would close the file handles after each execution of the directories.foreach loop). I looked if there was a close() method or something like that for the RDD but couldn't find that. Am I doing something that is causing Spark not to close the file handles? Should I write this job differently? Thanks, Sigurd