Hi, thanks for your input.

On 07/09/2012 02:42 PM, Arvind Prabhakar wrote:
Hi,

> It's certainly one possible solution to the issue, though I do
> believe that the current one could be made more friendly
> towards single disk access(e.g. batching writes to the disk
> may well be doable and would be curious what someone
> with more familiarity with the implementation thinks).

The implementation of the file channel is that of a write ahead log, in that it serializes all the actions as they happen. Using these actions, it can reconstruct the state of the channel at anytime. There are two mutually exclusive transaction types it supports - a transaction consisting of puts, and one consisting of takes. It may be possible to use the heap to batch the puts and takes and serialize them to disk when the commit occurs.

This approach will minimize the number of disk operations and will have an impact on the performance characteristics of the channel. Although it probably will improve performance, it is hard to tell for sure unless we test it out under load in different scenarios.


This does sound a lot better to me. I'm not sure if there is much demand for restoring the state of an uncommitted set of puts/takes to a file channel after restarting an agent? If the transaction wasn't completed its current state is not really going to be important after a restart. I'm really not familiar with WAL implementations, but is it not merely enough to write the data to be committed before the commit marker/informing of success? I don't think it is necessary to write each piece as it comes in, so long as it is done before informing of success/failure.

Another matter that I'm curious of is whether or not we actually need separate files for the data and checkpoints... Can we not add a magic header before each type of entry to differentiate, and thus guarantee significantly more sequential access? What is killing performance on a single disk right now is the constant seeks. The problem with this though would be putting together a file format that allows quick seeking through to the correct position, and rolling would be a lot harder. I think this is a lot more difficult and might be more of a long term target.

Juhani

Regards,
Arvind Prabhakar


On Wed, Jul 4, 2012 at 3:33 AM, Juhani Connolly <[email protected] <mailto:[email protected]>> wrote:

    It looks good to me as it provides a nice balance between
    reliability and throughput.

    It's certainly one possible solution to the issue, though I do
    believe that the current one could be made more friendly towards
    single disk access(e.g. batching writes to the disk may well be
    doable and would be curious what someone with more familiarity
    with the implementation thinks).


    On 07/04/2012 06:36 PM, Jarek Jarcec Cecho wrote:

        We had connected discussion about this "SpillableChannel"
        (working name) on FLUME-1045 and I believe that consensus is
        that we will create something like that. In fact, I'm planning
        to do it myself in near future - I just need to prioritize my
        todo list first.

        Jarcec

        On Wed, Jul 04, 2012 at 06:13:43PM +0900, Juhani Connolly wrote:

            Yes... I was actually poking around for that issue as I
            remembered
            seeing it before.  I had before also suggested a compound
            channel
            that would have worked like the buffer store in scribe,
            but general
            opinion was that it provided too many mixed configurations
            that
            could make testings and verifying correctness difficult.

            On 07/04/2012 04:33 PM, Jarek Jarcec Cecho wrote:

                Hi Juhally,
                while ago I've filled jira FLUME-1227 where I've
                suggested creating some sort of SpillableChannel that
                would behave similarly as scribe. It would be normally
                acting as memory channel and it would start spilling
                data to disk in case that it would get full (my
                primary goal here was to solve issue when remote goes
                down, for example in case of HDFS maintenance). Would
                it be helpful for your case?

                Jarcec

                On Wed, Jul 04, 2012 at 04:07:48PM +0900, Juhani
                Connolly wrote:

                    Evaluating flume on some of our servers, the file
                    channel seems very
                    slow, likely because like most typical web servers
                    ours have a
                    single raided disk available for writing to.

                    Quoted below is a suggestion from a  previous
                    issue where our poor
                    throughput came up, where it turns out that on
                    multiple disks, file
                    channel performance is great.

                    On 06/27/2012 11:01 AM, Mike Percy wrote:

                        We are able to push > 8000 events/sec (2KB per
                        event) through a single file channel if you
                        put checkpoint on one disk and use 2 other
                        disks for data dirs. Not sure what the limit
                        is. This is using the latest trunk code. Other
                        limitations may be you need to add additional
                        sinks to your channel to drain it faster. This
                        is because sinks are single threaded and
                        sources are multithreaded.

                        Mike

                    For the case where the disks happen to be
                    available on the server,
                    that's fantastic, but I suspect that most use
                    cases are going to be
                    similar to ours, where multiple disks are not
                    available. Our use
                    case isn't unusual as it's primarily aggregating
                    logs from various
                    services.

                    We originally ran our log servers with a
                    exec(tail)->file->avro
                    setup where throughput was very bad(80mb in an
                    hour). We then
                    switched this to a memory channel which was
                    fine(the peak time 500mb
                    worth of hourly logs went through). Afterwards we
                    switched back to
                    the file channel, but with 5 identical avro sinks.
                    This did not
                    improve throughput(still 80mb).
                    RecoverableMemoryChannel showed very
                    similar characteristics.

                    I presume this is due to the writes going to two
                    separate places,
                    and being further compounded by also writing out
                    and tailing the
                    normal web logs: checking top and iostat, we could
                    confirm we have
                    significant iowait time, far more than we have
                    during typical
                    operation.

                    As it is, we seem to be more or less guaranteeing
                    no loss of logs
                    with the file channel. Perhaps we could look into
                    batching
                    puts/takes for those that do not need 100% data
                    retention but want
                    more reliability than with the MemoryChannel which
                    can potentially
                    lose the entire capacity on a restart? Another
                    possibility is
                    writing an implementation that writes primarily
                    sequentially. I've
                    been meaning to get a deeper look at the
                    implementation itself to
                    give a more informed commentary on the contents
                    but unfortunately
                    don't have the cycles right now, hopefully someone
                    with a better
                    understanding of the current implementation(along
                    with its
                    interaction with the OS file cache) can comment on
                    this.







Reply via email to