Alright, that makes sense. The takeaway from this conversation for everyone else:
If you use idleTimeout, be sure to set the rollInterval to 0. And if you don't use idleTimeout, be sure to lower maxOpenFiles to a number relative to your expected throughput. To use the least memory, you will want to use idleTimeout; but the result will be that more files created in hdfs. - Connor On Thu, Jan 17, 2013 at 7:39 PM, Juhani Connolly < [email protected]> wrote: > That breaks the use case idleTimeout was originally made for: making > sure the file is closed promptly after data stops arriving. We use this to > make sure the files ready for our batches which run quite soon after. The > time that rollInterval will trigger is unpredictable as it will reset every > time any other type of roll is triggered(event count or size). > > By making rollInterval behave properly all of this is a non-issue. My > recommendation to users woudl be not to use rollInterval if they're > bucketing by time(it's redundant behavior). > > Documentation could definitely be improved. Once we sort out the approach > we want to take I can write it up to make the difference and usage clearer. > > > On 01/18/2013 12:24 PM, Connor Woodson wrote: > > The way idleTimeout works right now is that it's another rollInterval; it > will work best when rollInterval is not set and so it seems that it's use > is best for when you don't want to use a rollInterval and just want to have > your bucketwriters close when no events are coming through (caused by path > change or something else; and you can still roll reliably with either count > or size) > > As such, perhaps it is more clear if idleTimeout is renamed to idleRoll > or such? > > And then change idleTimeout to only count seconds since it was closed; > if a bucketwriter is closed for long enough it will automatically remove > itself. This type of idle will then work well with rollInterval, while the > other one doesn't (idleRoll + rollInterval creates two time-based rollers. > There are certainly times for that, but not all of the time). > > - Connor > > > On Thu, Jan 17, 2013 at 6:46 PM, Juhani Connolly < > [email protected]> wrote: > >> It seemed neater at the time. It's only an issue because rollInterval >> doesn't remove the entry in sfWriters. We could change it so that close >> doesn't cancel it, and have it check whether or not the writer is already >> closed, but that'd be kind of ugly. >> >> @Mohit: >> >> When flume dies unexpectedly the .tmp file remains. When it restarts >> there is some logic in HDFS sink to recover it(and continue writing from >> there). I'm not actually sure of the specifics. You may want to try and >> just kill -9 a running flume process on a test machine and then start it >> up, look at the logs and see what happens with the output. >> >> If flume dies cleanly the file is properly closed. >> >> >> On 01/18/2013 11:23 AM, Connor Woodson wrote: >> >> And @ my aside: I hadn't realized that the idleTimeout is canceled by the >> rollInterval occurring. That's annoying. So setting a lower idleTimeout, >> and drastically decreasing maxOpenFiles to at most 2 * possible open files, >> is probably necessary. >> >> >> On Thu, Jan 17, 2013 at 6:20 PM, Connor Woodson >> <[email protected]>wrote: >> >>> @Mohit: >>> >>> For the HDFS Sink, the tmp files are placed based on the >>> hadoop.tmp.dir property. The default location is /tmp/hadoop-${user.name} >>> To change this you can add -Dhadoop.tmp.dir=<path> to your Flume command >>> line call, or you can specify the property in the core-site.xml of wherever >>> your HADOOP_HOME environment variable points to. >>> >>> - Connor >>> >>> >>> On Thu, Jan 17, 2013 at 6:19 PM, Connor Woodson >>> <[email protected]>wrote: >>> >>>> Whether idleTimeout is lower or higher than rollInterval is a >>>> preference; set it before, and assume you get one message right on the turn >>>> of the hour, then you will have some part of that hour without any bucket >>>> writers; but if you get another message at the end of the hour, you will >>>> end up with two files instead of one. Set it idleTimeout to be longer and >>>> you will get just one file, but also (at worst case) you will have twice as >>>> many bucketwriters open; so it all depends on how many files you want/how >>>> much memory you have to spare. >>>> >>>> - Connor >>>> >>>> An aside: >>>> bucketwriters, after being closed by rollInterval, aren't really a >>>> memory leak; they just are very rarely useful to keep around (your path >>>> could rely on hostname, and you could use a rollinterval, and then those >>>> bucketwriters will still remain useful). And they will get removed >>>> eventually; by default after you've created your 5001st bucketwriter, the >>>> first (or whichever was used longest ago) will be removed. >>>> >>>> And I don't think that's the cause behind 1850 as he did have an >>>> idleTimeout set at 15 minutes. >>>> >>>> >>>> On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly < >>>> [email protected]> wrote: >>>> >>>>> It's also useful if you want files to get promptly closed and renamed >>>>> from the .tmp or whatever. >>>>> >>>>> We use it with something like 30seconds setting(we have a constant >>>>> stream of data) and hourly bucketing. >>>>> >>>>> There is also the issue that files closed by rollInterval are never >>>>> removed from the internal linkedList so it actually causes a small memory >>>>> leak(which can get big in the long term if you have a lot of files and >>>>> hourly renames). I believe this is what is causing the OOM Mohit is >>>>> getting >>>>> in FLUME-1850 >>>>> >>>>> So I personally would recommend using it(with a setting that will >>>>> close files before rollInterval does). >>>>> >>>>> >>>>> On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote: >>>>> >>>>>> Ah I see. Again something useful to have in the flume user guide. >>>>>> >>>>>> On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> the rollInterval will still cause the last 01-17 file to be closed >>>>>>> eventually. The way the HDFS sink works with the different files is >>>>>>> each >>>>>>> unique path is specified by a different BucketWriter object. The >>>>>>> sink can >>>>>>> hold as many objects as specified by hdfs.maxOpenWorkers (default: >>>>>>> 5000), >>>>>>> and bucketwriters are only removed when you create the 5001th writer >>>>>>> (5001th >>>>>>> unique path). However, generally once a writer is closed it is never >>>>>>> used >>>>>>> again (all of your 1-17 writers will never be used again). To avoid >>>>>>> keeping >>>>>>> them in the sink's internal list of writers, the idleTimeout is a >>>>>>> specified >>>>>>> number of seconds in which no data is received by the BucketWriter. >>>>>>> After >>>>>>> this time, the writer will try to close itself and will then tell >>>>>>> the sink >>>>>>> to remove it, thus freeing up everything used by the bucketwriter. >>>>>>> >>>>>>> So the idleTimeout is just a setting to help limit memory usage by >>>>>>> the hdfs >>>>>>> sink. The ideal time for it is longer than the maximum time between >>>>>>> events >>>>>>> (capped at the rollInterval) - if you know you'll receive a constant >>>>>>> stream >>>>>>> of events you might just set it to a minute or something. Or if you >>>>>>> are fine >>>>>>> with having multiple files open per hour, you can set it to a lower >>>>>>> number; >>>>>>> maybe just over the average time between events. For me in just >>>>>>> testing, I >>>>>>> set it >= rollInterval for the cases when no events are received in >>>>>>> a given >>>>>>> hour (I'd rather keep the object alive for an extra hour than create >>>>>>> files >>>>>>> every 30 minutes or something). >>>>>>> >>>>>>> Hope that was helpful, >>>>>>> >>>>>>> - Connor >>>>>>> >>>>>>> >>>>>>> On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar >>>>>>> <[email protected]> wrote: >>>>>>> >>>>>>>> Say If I have >>>>>>>> >>>>>>>> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/ >>>>>>>> >>>>>>>> hdfs.rollInterval=60 >>>>>>>> >>>>>>>> Now, if there is a file >>>>>>>> /flume/events/2013-01-17/flume_XXXXXXXXX.tmp >>>>>>>> This file is not ready to be rolled over yet, i.e. 60 seconds are >>>>>>>> not >>>>>>>> up and now it's past 12 midnight, i.e. new day >>>>>>>> And events start to be written to >>>>>>>> /flume/events/2013-01-18/flume_XXXXXXXX.tmp >>>>>>>> >>>>>>>> will the file 2013-01-17 never be rolled over, unless I have >>>>>>>> something >>>>>>>> like hdfs.idleTimeout=60 ? >>>>>>>> If so how do flume sinks keep track of files they need to rollover >>>>>>>> after idealTimeout ? >>>>>>>> >>>>>>>> In short what's the exact use of idealTimeout parameter ? >>>>>>>> >>>>>>> >>>>>>> >>>>> >>>> >>> >> >> > >
