Re: HDFS Sink log rotation on the basis of time of writing

Pankaj Gupta Mon, 05 Nov 2012 07:49:43 -0800

Hi Brock,

But then if I rotate frequently e.g. every minute, the total number of files in 
a single folder of HDFS will go into thousands very quickly. I am not sure 
how/if that will affect HDFS namenode performance and I worry that it may 
suffer. I don't have a lot of experience with HDFS, do you happen to know if 
having thousands of files in a single directory in HDFS is common?


Thanks,
Pankaj


On Nov 5, 2012, at 7:30 AM, Brock Noland <[email protected]> wrote:

> Hi,
> 
> If you just did not bucket the data at all, it would be organized by
> the time they arrived at the sink.
> 
> Brock
> 
> On Fri, Nov 2, 2012 at 6:08 PM, Pankaj Gupta <[email protected]> wrote:
>> Hi,
>> 
>> Is it possible to organize files written to HDFS into buckets based on the
>> time of writing rather than the timestamp in the header? Alternatively, is
>> it possible to insert the timestamp injector just before the HDFS Sink?
>> 
>> My use case is  to organize files such that they are organized
>> chronologically as well as alphabetically by name and that there is only one
>> file being written to at a time. This will make it easier to look for newly
>> available data so that MapReduce jobs can process them.
>> 
>> Thanks in Advance,
>> Pankaj
>> 
>> 
>> 
> 
> 
> 
> -- 
> Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

Re: HDFS Sink log rotation on the basis of time of writing

Reply via email to