Folks,

We have observed an issue in Mina 2.0-M4 where TCP writes are not
flushed to the network even though the receiver is keeping up and CPU
has low utilization.

In the test scenario, all of the clients initially connect to the
server, then the server starts writing messages to each of the
clients.  The clients connect via localhost to eliminate clock drift
and physical network impact, and we measure the observable latency of
the message between sender and receiver.

Note that the clients do not send any data to the server after
messages start flowing from the server, just ack traffic.

When the problem occurs, some messages are unexpectedly delayed by
approximately 1000ms.

By observing a tcpdump of the wire traffic during the test, we
verified that the receiver was keeping up with message delivery, but
that message delivery would sometimes stall for approximately 1000ms
after which the delayed messages would be written to the wire.

By re-running the same test with a locally modified version of
AbstractPollingIoProcessor, such that the select() timeout was 750ms
instead of 1000ms, we found that the messages were delayed by
approximately 750ms instead.

Further inspection of AbstractPollingIoProcessor.flush(T session)
suggests a potential race condition that would explain this behavior:

    public final void flush(T session) {
        boolean needsWakeup = flushingSessions.isEmpty();
        if (scheduleFlush(session) && needsWakeup) {
            wakeup();
        }
    }

For the purposes of this description, let's say
AbstractPollingIoProcessor.flush(T session) is called by the [writer]
thread, while Selector.select(long timeout) is called by the
[processor] thread.

[writer]       flushingSessions.isEmpty() -> false (last flushing
session being processed)
[processor] flush(currentTime), flushingSessions.poll() empties
flushingSessions, completely write last flushing session, write
request queue empty, no subsequent flush scheduled
[writer]       scheduleFlush(session) returns true, flushingSessions
now non-empty, requires wakeup(), but not called (!)
[processor] stalls until select timeout expires, then flushes

It seems that the intent of needsWakeup is to avoid calling wakeup()
unnecessarily by tracking when flushingSessions becomes non-empty, but
it does not seem to capture the edge case where flushSessions starts
out non-empty, then becomes empty, then becomes non-empty again,
triggering the stall.

We found that the following modification to
AbstractPollingIoProcessor.flush(T session) addressed the issue
without adverse effects.

    public final void flush(T session) {
        if (scheduleFlush(session)) {
            wakeup();
        }
    }

The difference in behavior is as follows:
  1. wakeup() could be called many times between calls to select()
  2. select() might unblock one extra time when the selector is otherwise idle

Case (1) is documented by Selector.select to be equivalent to a single
call to wakeup() between calls to select().  The assumption is that
subsequent calls to wakeup() are cheap because it is latched
internally.
Case (2) can occur as follows:

[writer]       scheduleFlush(session) returns true
[processor] processes all flushingSessions, including most recent
addition by [writer]
[writer]       calls wakeup()
[processor] immediately unblocks next call to select() due to wakeup()
call by [writer]
[processor] blocks again on next call to select()

When the [processor] unblocks select() due to wakeup() call by
[writer], this is not an _extra_ unblock when there is other work to
do in the [processor] loop, so any additional overhead introduced by
eliminating the needsWakeup check is only observed when the selector
is otherwise idle, and is therefore considered harmless.

If you agree that this a bug, we would gladly file a ticket on JIRA to
track the issue, unless it has already been resolved.  We would also
be happy to test other suggested fixes in our environment if
necessary.

Kind Regards,
John Fallows
-- 
>|< Kaazing Corporation >|<
John Fallows | CTO | +1.650.960.8148
888 Villa St, Ste 410 | Mountain View, CA 94041, USA

Reply via email to