I lined up why it was happening in FLUME-1850

He has hourly rolls, a 4000 interval and a 900 idle.

After an hour 400 remains on the interval. So the interval gets triggered first, which triggers close, which cancels all timers including the idleTimeout. Thus the entry in sfWriters remains. His memory dump confirms this(he has a huge sfWriters map in memory after 30 days). I also confirmed this behaviour of rollInterval when developing the idleTimeout feature.

You're right about the limit on the size of sfWriters. With a limit of 5000, even if the closed ones stay in the list, they shouldn't be that big since buffers should be cleaned up.

idleTimeout will indeed result in more files if you don't have a steady stream of files. It is most useful with a steady stream of data and time bucketed data. In such situations, I might even recommend not using rollInterval at all and having a short idleTimeout(or if you're not in a rush to get your file closed, give it a comfortably long timeout)

On 01/18/2013 11:19 AM, Connor Woodson wrote:
Whether idleTimeout is lower or higher than rollInterval is a preference; set it before, and assume you get one message right on the turn of the hour, then you will have some part of that hour without any bucket writers; but if you get another message at the end of the hour, you will end up with two files instead of one. Set it idleTimeout to be longer and you will get just one file, but also (at worst case) you will have twice as many bucketwriters open; so it all depends on how many files you want/how much memory you have to spare.

- Connor

An aside:
bucketwriters, after being closed by rollInterval, aren't really a memory leak; they just are very rarely useful to keep around (your path could rely on hostname, and you could use a rollinterval, and then those bucketwriters will still remain useful). And they will get removed eventually; by default after you've created your 5001st bucketwriter, the first (or whichever was used longest ago) will be removed.

And I don't think that's the cause behind 1850 as he did have an idleTimeout set at 15 minutes.


On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly <[email protected] <mailto:[email protected]>> wrote:

    It's also useful if you want files to get promptly closed and
    renamed from the .tmp or whatever.

    We use it with something like 30seconds setting(we have a constant
    stream of data) and hourly bucketing.

    There is also the issue that files closed by rollInterval are
    never removed from the internal linkedList so it actually causes a
    small memory leak(which can get big in the long term if you have a
    lot of files and hourly renames). I believe this is what is
    causing the OOM Mohit is getting in FLUME-1850

    So I personally would recommend using it(with a setting that will
    close files before rollInterval does).


    On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:

        Ah I see. Again something useful to have in the flume user guide.

        On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson
        <[email protected] <mailto:[email protected]>> wrote:

            the rollInterval will still cause the last 01-17 file to
            be closed
            eventually. The way the HDFS sink works with the different
            files is each
            unique path is specified by a different BucketWriter
            object. The sink can
            hold as many objects as specified by hdfs.maxOpenWorkers
            (default: 5000),
            and bucketwriters are only removed when you create the
            5001th writer (5001th
            unique path). However, generally once a writer is closed
            it is never used
            again (all of your 1-17 writers will never be used again).
            To avoid keeping
            them in the sink's internal list of writers, the
            idleTimeout is a specified
            number of seconds in which no data is received by the
            BucketWriter. After
            this time, the writer will try to close itself and will
            then tell the sink
            to remove it, thus freeing up everything used by the
            bucketwriter.

            So the idleTimeout is just a setting to help limit memory
            usage by the hdfs
            sink. The ideal time for it is longer than the maximum
            time between events
            (capped at the rollInterval) - if you know you'll receive
            a constant stream
            of events you might just set it to a minute or something.
            Or if you are fine
            with having multiple files open per hour, you can set it
            to a lower number;
            maybe just over the average time between events. For me in
            just testing, I
            set it >= rollInterval for the cases when no events are
            received in a given
            hour (I'd rather keep the object alive for an extra hour
            than create files
            every 30 minutes or something).

            Hope that was helpful,

            - Connor


            On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar
            <[email protected] <mailto:[email protected]>> wrote:

                Say If I have

                a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/

                hdfs.rollInterval=60

                Now, if there is a file
                /flume/events/2013-01-17/flume_XXXXXXXXX.tmp
                This file is not ready to be rolled over yet, i.e. 60
                seconds are not
                up and now it's past 12 midnight, i.e. new day
                And events start to be written to
                /flume/events/2013-01-18/flume_XXXXXXXX.tmp

                will the file 2013-01-17 never be rolled over, unless
                I have something
                like hdfs.idleTimeout=60  ?
                If so how do flume sinks keep track of files they need
                to rollover
                after idealTimeout ?

                In short what's the exact use of idealTimeout parameter ?





Reply via email to