Re: HDFSChannel?

Hari Shreedharan Thu, 13 Dec 2012 00:26:10 -0800

There are several reasons we did not want a channel loading events into the 
next hop/final destination.

One of the reasons is to clearly define the responsibilities of each component 
in the system and the responsibility of the channel is to be a buffer and that 
is it - you can see this from the Channel interface (It is the same reason you 
don't want classes and methods exist - in theory you could put everything into 
your main method and expect it to work - but in reality, that is not something 
you want to do.).

Another important thing to consider is that such an architecture is going to 
hit issues because a transaction is owned by a source thread, and by making the 
same transaction responsible for writing to HDFS, there is a tight coupling 
created between hop 1 to hop 2 writes and hop 2 to hdfs writes - which is 
exactly what Flume strives to remove, by providing the channel as a buffer. 

In addition to this, such a single threaded source-sink coupling existed in 
Flume OG which caused major issues and introduced much complexity making things 
impossible to debug.  

In your case if you have a channel that also does the writes within the same 
transaction, you are going to have complex issues when HDFS writes fail or 
timeout (I guarantee you this is going to happen). Handling such issues are 
complex. Now if you have an extra thread within the channel trying to clear up 
the data out of the "HDFS channel," it is not any different from an HDFS Sink. 
Having no channel and having just a source+sink is also going to make things 
quite complex and you are going to have to do a lot of handling if and when you 
hit some failure. 

I don't recommend having such an approach, and I don't think the File channel 
is going to hit your performance too much - which is what I'd recommend you use.

Hari
-- 
Hari Shreedharan

On Wednesday, December 12, 2012 at 11:34 PM, Guy Peleg wrote:

> Say I have multi-hop flow, and lets say the last one stores its data in HDFS 
> using the HDFS sink.
> 
> In the last agent, as in every agent, there are the source-channel-sink trio, 
> my question is: why do we need that channel if the only thing that agent does 
> is store the events in HDFS (or other data source)? 
> 
> Won't it be more efficient to have an 'HDFSChannel' that is part of the 
> transaction, and no sink at all? otherwise I might need to use persistent 
> channel (JDBC, File) to make sure that data is not lost before 
> it is moved to the sink, which again, is redundant, since ideally I would 
> like the incoming events, on the 'last agent' to be stored as quickly as 
> possible in their destination without paying the extra channel coast
> 
>

Re: HDFSChannel?

Reply via email to