Hi, Yes you are correct, I suggest running a MR job once an hour to merge those 60 files into one file.
Brock On Mon, Nov 5, 2012 at 9:49 AM, Pankaj Gupta <[email protected]> wrote: > Hi Brock, > > But then if I rotate frequently e.g. every minute, the total number of files > in a single folder of HDFS will go into thousands very quickly. I am not sure > how/if that will affect HDFS namenode performance and I worry that it may > suffer. I don't have a lot of experience with HDFS, do you happen to know if > having thousands of files in a single directory in HDFS is common? > > Thanks, > Pankaj > > > On Nov 5, 2012, at 7:30 AM, Brock Noland <[email protected]> wrote: > >> Hi, >> >> If you just did not bucket the data at all, it would be organized by >> the time they arrived at the sink. >> >> Brock >> >> On Fri, Nov 2, 2012 at 6:08 PM, Pankaj Gupta <[email protected]> wrote: >>> Hi, >>> >>> Is it possible to organize files written to HDFS into buckets based on the >>> time of writing rather than the timestamp in the header? Alternatively, is >>> it possible to insert the timestamp injector just before the HDFS Sink? >>> >>> My use case is to organize files such that they are organized >>> chronologically as well as alphabetically by name and that there is only one >>> file being written to at a time. This will make it easier to look for newly >>> available data so that MapReduce jobs can process them. >>> >>> Thanks in Advance, >>> Pankaj >>> >>> >>> >> >> >> >> -- >> Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/ > -- Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
