What's the format of the file header? Is it possible to filter them out by prefix string matching or regex?
On Wed, Jul 30, 2014 at 1:39 PM, Fengyun RAO <raofeng...@gmail.com> wrote: > It will certainly cause bad performance, since it reads the whole content > of a large file into one value, instead of splitting it into partitions. > > Typically one file is 1 GB. Suppose we have 3 large files, in this way, > there would only be 3 key-value pairs, and thus 3 tasks at most. > > > 2014-07-30 12:49 GMT+08:00 Hossein <fal...@gmail.com>: > > You can use SparkContext.wholeTextFile(). >> >> Please note that the documentation suggests: "Small files are preferred, >> large file is also allowable, but may cause bad performance." >> >> --Hossein >> >> >> On Tue, Jul 29, 2014 at 9:21 PM, Nicholas Chammas < >> nicholas.cham...@gmail.com> wrote: >> >>> This is an interesting question. I’m curious to know as well how this >>> problem can be approached. >>> >>> Is there a way, perhaps, to ensure that each input file matching the >>> glob expression gets mapped to exactly one partition? Then you could >>> probably get what you want using RDD.mapPartitions(). >>> >>> Nick >>> >>> >>> >>> On Wed, Jul 30, 2014 at 12:02 AM, Fengyun RAO <raofeng...@gmail.com> >>> wrote: >>> >>>> Hi, all >>>> >>>> We are migrating from mapreduce to spark, and encountered a problem. >>>> >>>> Our input files are IIS logs with file head. It's easy to get the file >>>> head if we process only one file, e.g. >>>> >>>> val lines = sc.textFile('hdfs://*/u_ex14073011.log') >>>> val head = lines.take(4) >>>> >>>> Then we can write our map method using this head. >>>> >>>> However, if we input multiple files, each of which may have a different >>>> file head, how can we get file head for each partition? >>>> >>>> It seems we have two options: >>>> >>>> 1. still use textFile() to get lines. >>>> >>>> Since each partition may have a different "head", we have to write >>>> mapPartitionsWithContext method. However we can't find a way to get >>>> the "head" for each partition. >>>> >>>> In our former mapreduce program, we could simply use >>>> >>>> Path path = ((FileSplit) context.getInputSplit()).getPath() >>>> >>>> but there seems no way in spark, since HadoopPartition which wraps >>>> InputSplit inside HadoopRDD is a private class. >>>> >>>> 2. use wholeTextFile() to get whole contents. >>>> >>>> It's easy to get file head for each file, but according to the >>>> document, this API is better for small files. >>>> >>>> >>>> *Any suggestions on how to process these files with heads?* >>>> >>> >>> >> >