Re: HDFS Sink log rotation on the basis of time of writing

Brock Noland Mon, 05 Nov 2012 07:57:47 -0800

Hi,

Yes you are correct, I suggest running a MR job once an hour to merge
those 60 files into one file.


Brock

On Mon, Nov 5, 2012 at 9:49 AM, Pankaj Gupta <[email protected]> wrote:
> Hi Brock,
>
> But then if I rotate frequently e.g. every minute, the total number of files 
> in a single folder of HDFS will go into thousands very quickly. I am not sure 
> how/if that will affect HDFS namenode performance and I worry that it may 
> suffer. I don't have a lot of experience with HDFS, do you happen to know if 
> having thousands of files in a single directory in HDFS is common?
>
> Thanks,
> Pankaj
>
>
> On Nov 5, 2012, at 7:30 AM, Brock Noland <[email protected]> wrote:
>
>> Hi,
>>
>> If you just did not bucket the data at all, it would be organized by
>> the time they arrived at the sink.
>>
>> Brock
>>
>> On Fri, Nov 2, 2012 at 6:08 PM, Pankaj Gupta <[email protected]> wrote:
>>> Hi,
>>>
>>> Is it possible to organize files written to HDFS into buckets based on the
>>> time of writing rather than the timestamp in the header? Alternatively, is
>>> it possible to insert the timestamp injector just before the HDFS Sink?
>>>
>>> My use case is  to organize files such that they are organized
>>> chronologically as well as alphabetically by name and that there is only one
>>> file being written to at a time. This will make it easier to look for newly
>>> available data so that MapReduce jobs can process them.
>>>
>>> Thanks in Advance,
>>> Pankaj
>>>
>>>
>>>
>>
>>
>>
>> --
>> Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
>



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

Re: HDFS Sink log rotation on the basis of time of writing

Reply via email to