I am running in a 'too many open files' issue and before I posted this I
have searched on the web to see if anyone had a solution already to my
particular problem but I did not see anything that helped.

I know I can adjust the max open files allowed by the OS but I'd rather fix
the underlaying issue.

In HDFS I have a directory that contains hourly files in a directory
structure like this: directoryname/hourly/2015/06/02/13 (where the numbers
at the end are the date in YYYY/MM/DD/HH format).

I have a Spark job that roughly does the following:

val directories = findModifiedFiles()
directories.foreach(directory => {
    sparkContext.newAPIHadoopFile(directory)
        .filter(...)
        .map(...)
        .reduceByKey(...)
        .foreachPartition(iterator => {
            iterator.foreach(tuple => {
                // send data to kafka
            }
        }
}

If there are only a few directories that have been modified then this works
pretty well. But when I have the job reprocess all the data (I have 350M of
test data that pretty much has data for each hour of each day for a full
year) I run out of file handles.

I executed the test on a test cluster of 2 hadoop slave nodes that each
have the HDFS data node and yarn node manager running.

When I run "lsof -p" on the Spark processes, I see a lot of the following
types of open files:

java    21196 yarn 3268r   REG               8,16       139  533320
/hadoop/hdfs/data/current/BP-479153573-10.240.60.21-1441026862238/current/finalized/subdir0/subdir17/blk_1073746324_5500.meta

java    21196 yarn 3269r   REG               8,16     15004  533515
/hadoop/hdfs/data/current/BP-479153573-10.240.60.21-1441026862238/current/finalized/subdir0/subdir17/blk_1073746422

java    21196 yarn 3270r   REG               8,16       127  533516
/hadoop/hdfs/data/current/BP-479153573-10.240.60.21-1441026862238/current/finalized/subdir0/subdir17/blk_1073746422_5598.meta

java    21196 yarn 3271r   REG               8,16     15583  534081
/hadoop/hdfs/data/current/BP-479153573-10.240.60.21-1441026862238/current/finalized/subdir0/subdir19/blk_1073746704

java    21196 yarn 3272r   REG               8,16       131  534082
/hadoop/hdfs/data/current/BP-479153573-10.240.60.21-1441026862238/current/finalized/subdir0/subdir19/blk_1073746704_5880.meta

When I watch the process run, I can see that the number of those open files
is ever increasing while it is processing directories until it runs out of
file handles (it then drops to zero and starts up again until it runs out
again but that is due to the fact that Yarn retries running the job). It
basically ends up opening about 4500 file handles to those files per node.

As I said, I know that I can increase the number of open file handles, and
I will do that, but in my opinion it should not be ever increasing. I would
have thought that when I was done with an RDD that Spark would close all
the resources that it opened for them (so that it would close the file
handles after each execution of the directories.foreach loop). I looked if
there was a close() method or something like that for the RDD but couldn't
find that.

Am I doing something that is causing Spark not to close the file handles?
Should I write this job differently?
Thanks,

Sigurd

Reply via email to