It seemed neater at the time. It's only an issue because rollInterval doesn't remove the entry in sfWriters. We could change it so that close doesn't cancel it, and have it check whether or not the writer is already closed, but that'd be kind of ugly.

@Mohit:

When flume dies unexpectedly the .tmp file remains. When it restarts there is some logic in HDFS sink to recover it(and continue writing from there). I'm not actually sure of the specifics. You may want to try and just kill -9 a running flume process on a test machine and then start it up, look at the logs and see what happens with the output.

If flume dies cleanly the file is properly closed.

On 01/18/2013 11:23 AM, Connor Woodson wrote:
And @ my aside: I hadn't realized that the idleTimeout is canceled by the rollInterval occurring. That's annoying. So setting a lower idleTimeout, and drastically decreasing maxOpenFiles to at most 2 * possible open files, is probably necessary.


On Thu, Jan 17, 2013 at 6:20 PM, Connor Woodson <[email protected] <mailto:[email protected]>> wrote:

    @Mohit:

    For the HDFS Sink, the tmp files are placed based on the
    hadoop.tmp.dir property. The default location is
    /tmp/hadoop-${user.name <http://user.name>} To change this you can
    add -Dhadoop.tmp.dir=<path> to your Flume command line call, or
    you can specify the property in the core-site.xml of wherever your
    HADOOP_HOME environment variable points to.

    - Connor


    On Thu, Jan 17, 2013 at 6:19 PM, Connor Woodson
    <[email protected] <mailto:[email protected]>> wrote:

        Whether idleTimeout is lower or higher than rollInterval is a
        preference; set it before, and assume you get one message
        right on the turn of the hour, then you will have some part of
        that hour without any bucket writers; but if you get another
        message at the end of the hour, you will end up with two files
        instead of one. Set it idleTimeout to be longer and you will
        get just one file, but also (at worst case) you will have
        twice as many bucketwriters open; so it all depends on how
        many files you want/how much memory you have to spare.

        - Connor

        An aside:
        bucketwriters, after being closed by rollInterval, aren't
        really a memory leak; they just are very rarely useful to keep
        around (your path could rely on hostname, and you could use a
        rollinterval, and then those bucketwriters will still remain
        useful). And they will get removed eventually; by default
        after you've created your 5001st bucketwriter, the first (or
        whichever was used longest ago) will be removed.

        And I don't think that's the cause behind 1850 as he did have
        an idleTimeout set at 15 minutes.


        On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly
        <[email protected]
        <mailto:[email protected]>> wrote:

            It's also useful if you want files to get promptly closed
            and renamed from the .tmp or whatever.

            We use it with something like 30seconds setting(we have a
            constant stream of data) and hourly bucketing.

            There is also the issue that files closed by rollInterval
            are never removed from the internal linkedList so it
            actually causes a small memory leak(which can get big in
            the long term if you have a lot of files and hourly
            renames). I believe this is what is causing the OOM Mohit
            is getting in FLUME-1850

            So I personally would recommend using it(with a setting
            that will close files before rollInterval does).


            On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:

                Ah I see. Again something useful to have in the flume
                user guide.

                On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson
                <[email protected]
                <mailto:[email protected]>> wrote:

                    the rollInterval will still cause the last 01-17
                    file to be closed
                    eventually. The way the HDFS sink works with the
                    different files is each
                    unique path is specified by a different
                    BucketWriter object. The sink can
                    hold as many objects as specified by
                    hdfs.maxOpenWorkers (default: 5000),
                    and bucketwriters are only removed when you create
                    the 5001th writer (5001th
                    unique path). However, generally once a writer is
                    closed it is never used
                    again (all of your 1-17 writers will never be used
                    again). To avoid keeping
                    them in the sink's internal list of writers, the
                    idleTimeout is a specified
                    number of seconds in which no data is received by
                    the BucketWriter. After
                    this time, the writer will try to close itself and
                    will then tell the sink
                    to remove it, thus freeing up everything used by
                    the bucketwriter.

                    So the idleTimeout is just a setting to help limit
                    memory usage by the hdfs
                    sink. The ideal time for it is longer than the
                    maximum time between events
                    (capped at the rollInterval) - if you know you'll
                    receive a constant stream
                    of events you might just set it to a minute or
                    something. Or if you are fine
                    with having multiple files open per hour, you can
                    set it to a lower number;
                    maybe just over the average time between events.
                    For me in just testing, I
                    set it >= rollInterval for the cases when no
                    events are received in a given
                    hour (I'd rather keep the object alive for an
                    extra hour than create files
                    every 30 minutes or something).

                    Hope that was helpful,

                    - Connor


                    On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V.
                    Karambelkar
                    <[email protected] <mailto:[email protected]>>
                    wrote:

                        Say If I have

                        a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/

                        hdfs.rollInterval=60

                        Now, if there is a file
                        /flume/events/2013-01-17/flume_XXXXXXXXX.tmp
                        This file is not ready to be rolled over yet,
                        i.e. 60 seconds are not
                        up and now it's past 12 midnight, i.e. new day
                        And events start to be written to
                        /flume/events/2013-01-18/flume_XXXXXXXX.tmp

                        will the file 2013-01-17 never be rolled over,
                        unless I have something
                        like hdfs.idleTimeout=60  ?
                        If so how do flume sinks keep track of files
                        they need to rollover
                        after idealTimeout ?

                        In short what's the exact use of idealTimeout
                        parameter ?







Reply via email to