And @ my aside: I hadn't realized that the idleTimeout is
canceled by the rollInterval occurring. That's annoying. So
setting a lower idleTimeout, and drastically decreasing
maxOpenFiles to at most 2 * possible open files, is probably
necessary.
On Thu, Jan 17, 2013 at 6:20 PM, Connor Woodson
<[email protected] <mailto:[email protected]>> wrote:
@Mohit:
For the HDFS Sink, the tmp files are placed based on the
hadoop.tmp.dir property. The default location is
/tmp/hadoop-${user.name <http://user.name>} To change this
you can add -Dhadoop.tmp.dir=<path> to your Flume command
line call, or you can specify the property in the
core-site.xml of wherever your HADOOP_HOME environment
variable points to.
- Connor
On Thu, Jan 17, 2013 at 6:19 PM, Connor Woodson
<[email protected] <mailto:[email protected]>> wrote:
Whether idleTimeout is lower or higher than rollInterval
is a preference; set it before, and assume you get one
message right on the turn of the hour, then you will have
some part of that hour without any bucket writers; but if
you get another message at the end of the hour, you will
end up with two files instead of one. Set it idleTimeout
to be longer and you will get just one file, but also (at
worst case) you will have twice as many bucketwriters
open; so it all depends on how many files you want/how
much memory you have to spare.
- Connor
An aside:
bucketwriters, after being closed by rollInterval, aren't
really a memory leak; they just are very rarely useful to
keep around (your path could rely on hostname, and you
could use a rollinterval, and then those bucketwriters
will still remain useful). And they will get removed
eventually; by default after you've created your 5001st
bucketwriter, the first (or whichever was used longest
ago) will be removed.
And I don't think that's the cause behind 1850 as he did
have an idleTimeout set at 15 minutes.
On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly
<[email protected]
<mailto:[email protected]>> wrote:
It's also useful if you want files to get promptly
closed and renamed from the .tmp or whatever.
We use it with something like 30seconds setting(we
have a constant stream of data) and hourly bucketing.
There is also the issue that files closed by
rollInterval are never removed from the internal
linkedList so it actually causes a small memory
leak(which can get big in the long term if you have a
lot of files and hourly renames). I believe this is
what is causing the OOM Mohit is getting in FLUME-1850
So I personally would recommend using it(with a
setting that will close files before rollInterval does).
On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:
Ah I see. Again something useful to have in the
flume user guide.
On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson
<[email protected]
<mailto:[email protected]>> wrote:
the rollInterval will still cause the last
01-17 file to be closed
eventually. The way the HDFS sink works with
the different files is each
unique path is specified by a different
BucketWriter object. The sink can
hold as many objects as specified by
hdfs.maxOpenWorkers (default: 5000),
and bucketwriters are only removed when you
create the 5001th writer (5001th
unique path). However, generally once a
writer is closed it is never used
again (all of your 1-17 writers will never be
used again). To avoid keeping
them in the sink's internal list of writers,
the idleTimeout is a specified
number of seconds in which no data is
received by the BucketWriter. After
this time, the writer will try to close
itself and will then tell the sink
to remove it, thus freeing up everything used
by the bucketwriter.
So the idleTimeout is just a setting to help
limit memory usage by the hdfs
sink. The ideal time for it is longer than
the maximum time between events
(capped at the rollInterval) - if you know
you'll receive a constant stream
of events you might just set it to a minute
or something. Or if you are fine
with having multiple files open per hour, you
can set it to a lower number;
maybe just over the average time between
events. For me in just testing, I
set it >= rollInterval for the cases when no
events are received in a given
hour (I'd rather keep the object alive for an
extra hour than create files
every 30 minutes or something).
Hope that was helpful,
- Connor
On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V.
Karambelkar
<[email protected]
<mailto:[email protected]>> wrote:
Say If I have
a1.sinks.k1.hdfs.path =
/flume/events/%y-%m-%d/
hdfs.rollInterval=60
Now, if there is a file
/flume/events/2013-01-17/flume_XXXXXXXXX.tmp
This file is not ready to be rolled over
yet, i.e. 60 seconds are not
up and now it's past 12 midnight, i.e.
new day
And events start to be written to
/flume/events/2013-01-18/flume_XXXXXXXX.tmp
will the file 2013-01-17 never be rolled
over, unless I have something
like hdfs.idleTimeout=60 ?
If so how do flume sinks keep track of
files they need to rollover
after idealTimeout ?
In short what's the exact use of
idealTimeout parameter ?