HDFS file rolling behaviour

Jagadish Bihani Tue, 18 Sep 2012 05:09:21 -0700

Hi

Does anybody know about  the issue mentioned in the following mail?



Update: I have seen following behaviour now even for time based rolling.

By time based rolling I would expect: That single file should be createdafter x seconds.

But in my case some n files are created after every x seconds.
Is it something to do with HDFS batch size?

Regards,
Jagadish

-------- Original Message --------
Subject:        HDFS file rolling behaviour
Date:   Thu, 13 Sep 2012 14:26:56 +0530
From:   Jagadish Bihani <[email protected]>
To:     [email protected]

Hi

I use two flume agents:

1. flume_agent 1 which is a source with (exec source -file channel -avrosink)2. flume_agent 2 which is a dest with (avro source -file channel - HDFSsink)

I have observed that for HDFS sink with rolling by *file size/number ofevents* it

creates a lot of simultaneous connections to source's avro sink. But

while rolling by *time interval* it does it *one by one* i.e. opens 1HDFS file write toit and then close it. I expect for other rolling intervals too samething should happeni.e. first open file and if x number of events are written to it thenroll it and open another

and so on.

In my case my data ingestion works fine with "time" based rolling but inother

cases due to the above behaviour I get exceptions like:
-- too many open files
-- timeout related exceptions for file channel and few more exceptions.

I can increase the values of the parameters giving exceptions but I dontknow what

adverse effects it may have.

Can somebody throw some light on the rolling based on file size/numberof events ?


Regards,
Jagadish

HDFS file rolling behaviour

Reply via email to