just 50 lines of each file. Please find the code at below link
https://gist.github.com/ashwini-anand/0e468da9b4ab7863dff14833d34de79e
The size of each file of the directory can be very large in my case and
because of this reason use of wholeTextFiles api will be inefficient in this
case. Right now
I am reading each file of a directory using wholeTextFiles. After that I am
calling a function on each element of the rdd using map . The whole program
uses just 50 lines of each file. The code is as below:def
processFiles(fileNameContentsPair): fileName= fileNameContentsPair[0]
result = "\n\n"+f
By looking into the source code, I found that for textFile(), the
partitioning is computed by the computeSplitSize() function in
FileInputFormat class. This function takes into consideration the
minPartitions value passed by user. As per my understanding , the same thing
for binaryFiles() is comput