What's the format of the file header? Is it possible to filter them out by
prefix string matching or regex?


On Wed, Jul 30, 2014 at 1:39 PM, Fengyun RAO <raofeng...@gmail.com> wrote:

> It will certainly cause bad performance, since it reads the whole content
> of a large file into one value, instead of splitting it into partitions.
>
> Typically one file is 1 GB. Suppose we have 3 large files, in this way,
> there would only be 3 key-value pairs, and thus 3 tasks at most.
>
>
> 2014-07-30 12:49 GMT+08:00 Hossein <fal...@gmail.com>:
>
> You can use SparkContext.wholeTextFile().
>>
>> Please note that the documentation suggests: "Small files are preferred,
>> large file is also allowable, but may cause bad performance."
>>
>> --Hossein
>>
>>
>> On Tue, Jul 29, 2014 at 9:21 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> This is an interesting question. I’m curious to know as well how this
>>> problem can be approached.
>>>
>>> Is there a way, perhaps, to ensure that each input file matching the
>>> glob expression gets mapped to exactly one partition? Then you could
>>> probably get what you want using RDD.mapPartitions().
>>>
>>> Nick
>>> ​
>>>
>>>
>>> On Wed, Jul 30, 2014 at 12:02 AM, Fengyun RAO <raofeng...@gmail.com>
>>> wrote:
>>>
>>>> Hi, all
>>>>
>>>> We are migrating from mapreduce to spark, and encountered a problem.
>>>>
>>>> Our input files are IIS logs with file head. It's easy to get the file
>>>> head if we process only one file, e.g.
>>>>
>>>> val lines = sc.textFile('hdfs://*/u_ex14073011.log')
>>>> val head = lines.take(4)
>>>>
>>>> Then we can write our map method using this head.
>>>>
>>>> However, if we input multiple files, each of which may have a different
>>>> file head, how can we get file head for each partition?
>>>>
>>>> It seems we have two options:
>>>>
>>>> 1. still use textFile() to get lines.
>>>>
>>>> Since each partition may have a different "head", we have to write
>>>> mapPartitionsWithContext method. However we can't find a way to get
>>>> the "head" for each partition.
>>>>
>>>> In our former mapreduce program, we could simply use
>>>>
>>>> Path path = ((FileSplit) context.getInputSplit()).getPath()
>>>>
>>>> but there seems no way in spark, since HadoopPartition which wraps
>>>> InputSplit inside HadoopRDD is a private class.
>>>>
>>>> 2. use wholeTextFile() to get whole contents.
>>>>
>>>>  It's easy to get file head for each file, but according to the
>>>> document, this API is better for small files.
>>>>
>>>>
>>>> *Any suggestions on how to process these files with heads?*
>>>>
>>>
>>>
>>
>

Reply via email to