Hi, all We are migrating from mapreduce to spark, and encountered a problem.
Our input files are IIS logs with file head. It's easy to get the file head if we process only one file, e.g. val lines = sc.textFile('hdfs://*/u_ex14073011.log') val head = lines.take(4) Then we can write our map method using this head. However, if we input multiple files, each of which may have a different file head, how can we get file head for each partition? It seems we have two options: 1. still use textFile() to get lines. Since each partition may have a different "head", we have to write mapPartitionsWithContext method. However we can't find a way to get the "head" for each partition. In our former mapreduce program, we could simply use Path path = ((FileSplit) context.getInputSplit()).getPath() but there seems no way in spark, since HadoopPartition which wraps InputSplit inside HadoopRDD is a private class. 2. use wholeTextFile() to get whole contents. It's easy to get file head for each file, but according to the document, this API is better for small files. *Any suggestions on how to process these files with heads?*