Ya, I read this first; I find the implementation of the idleTimeout slightly odd that it doesn't persist through the file closing.
On Thu, Jan 17, 2013 at 6:39 PM, Juhani Connolly < [email protected]> wrote: > I lined up why it was happening in FLUME-1850 > > He has hourly rolls, a 4000 interval and a 900 idle. > > After an hour 400 remains on the interval. So the interval gets triggered > first, which triggers close, which cancels all timers including the > idleTimeout. Thus the entry in sfWriters remains. His memory dump confirms > this(he has a huge sfWriters map in memory after 30 days). I also confirmed > this behaviour of rollInterval when developing the idleTimeout feature. > > You're right about the limit on the size of sfWriters. With a limit of > 5000, even if the closed ones stay in the list, they shouldn't be that big > since buffers should be cleaned up. > > idleTimeout will indeed result in more files if you don't have a steady > stream of files. It is most useful with a steady stream of data and time > bucketed data. In such situations, I might even recommend not using > rollInterval at all and having a short idleTimeout(or if you're not in a > rush to get your file closed, give it a comfortably long timeout) > > > On 01/18/2013 11:19 AM, Connor Woodson wrote: > > Whether idleTimeout is lower or higher than rollInterval is a > preference; set it before, and assume you get one message right on the turn > of the hour, then you will have some part of that hour without any bucket > writers; but if you get another message at the end of the hour, you will > end up with two files instead of one. Set it idleTimeout to be longer and > you will get just one file, but also (at worst case) you will have twice as > many bucketwriters open; so it all depends on how many files you want/how > much memory you have to spare. > > - Connor > > An aside: > bucketwriters, after being closed by rollInterval, aren't really a memory > leak; they just are very rarely useful to keep around (your path could rely > on hostname, and you could use a rollinterval, and then those bucketwriters > will still remain useful). And they will get removed eventually; by default > after you've created your 5001st bucketwriter, the first (or whichever was > used longest ago) will be removed. > > And I don't think that's the cause behind 1850 as he did have an > idleTimeout set at 15 minutes. > > > On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly < > [email protected]> wrote: > >> It's also useful if you want files to get promptly closed and renamed >> from the .tmp or whatever. >> >> We use it with something like 30seconds setting(we have a constant stream >> of data) and hourly bucketing. >> >> There is also the issue that files closed by rollInterval are never >> removed from the internal linkedList so it actually causes a small memory >> leak(which can get big in the long term if you have a lot of files and >> hourly renames). I believe this is what is causing the OOM Mohit is getting >> in FLUME-1850 >> >> So I personally would recommend using it(with a setting that will close >> files before rollInterval does). >> >> >> On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote: >> >>> Ah I see. Again something useful to have in the flume user guide. >>> >>> On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson <[email protected]> >>> wrote: >>> >>>> the rollInterval will still cause the last 01-17 file to be closed >>>> eventually. The way the HDFS sink works with the different files is each >>>> unique path is specified by a different BucketWriter object. The sink >>>> can >>>> hold as many objects as specified by hdfs.maxOpenWorkers (default: >>>> 5000), >>>> and bucketwriters are only removed when you create the 5001th writer >>>> (5001th >>>> unique path). However, generally once a writer is closed it is never >>>> used >>>> again (all of your 1-17 writers will never be used again). To avoid >>>> keeping >>>> them in the sink's internal list of writers, the idleTimeout is a >>>> specified >>>> number of seconds in which no data is received by the BucketWriter. >>>> After >>>> this time, the writer will try to close itself and will then tell the >>>> sink >>>> to remove it, thus freeing up everything used by the bucketwriter. >>>> >>>> So the idleTimeout is just a setting to help limit memory usage by the >>>> hdfs >>>> sink. The ideal time for it is longer than the maximum time between >>>> events >>>> (capped at the rollInterval) - if you know you'll receive a constant >>>> stream >>>> of events you might just set it to a minute or something. Or if you are >>>> fine >>>> with having multiple files open per hour, you can set it to a lower >>>> number; >>>> maybe just over the average time between events. For me in just >>>> testing, I >>>> set it >= rollInterval for the cases when no events are received in a >>>> given >>>> hour (I'd rather keep the object alive for an extra hour than create >>>> files >>>> every 30 minutes or something). >>>> >>>> Hope that was helpful, >>>> >>>> - Connor >>>> >>>> >>>> On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar >>>> <[email protected]> wrote: >>>> >>>>> Say If I have >>>>> >>>>> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/ >>>>> >>>>> hdfs.rollInterval=60 >>>>> >>>>> Now, if there is a file >>>>> /flume/events/2013-01-17/flume_XXXXXXXXX.tmp >>>>> This file is not ready to be rolled over yet, i.e. 60 seconds are not >>>>> up and now it's past 12 midnight, i.e. new day >>>>> And events start to be written to >>>>> /flume/events/2013-01-18/flume_XXXXXXXX.tmp >>>>> >>>>> will the file 2013-01-17 never be rolled over, unless I have something >>>>> like hdfs.idleTimeout=60 ? >>>>> If so how do flume sinks keep track of files they need to rollover >>>>> after idealTimeout ? >>>>> >>>>> In short what's the exact use of idealTimeout parameter ? >>>>> >>>> >>>> >> > >
