[jira] [Commented] (YARN-6382) Address race condition on TimelineWriter.flush() caused by buffer-sized flush

Joep Rottinghuis (JIRA) Mon, 03 Apr 2017 11:59:54 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-6382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954009#comment-15954009
 ]


Joep Rottinghuis commented on YARN-6382:
----------------------------------------

Thanks for pointing this out [~haibochen]. Yes, with asynchonous buffering and 
size based-flush this can happen.
The periodic buffering can cause the same issue.

Here is the scenario:
* Internal buffer in buffered mutator is almost full
* Thead A does a write (which we know will cause issues later down the road)
* Thead B does a write.
** This write causes the buffer to be full, or perhaps thread B calls flush, or 
a timer calls flush.
** The earlier put from A caused an issue
** Thread B gets an error back, not knowing exactly which put failed, it can 
re-try its write later
* The buffer is now empty
* Thread A does a flush to confirm that its previous write made it through
* Thread A receives a success status, because there are no further issues
* Thread A incorrectly assumes that its writes were successfully written

There seem to be three options to deal with this:
a) Make writes synchronous, ie. for important writes do not use a buffered 
Mutator. The APIs would have to change, and performance might be significantly 
impacted as we saw in tests early on in the application timeline service 
development.
b) Modify the API for the BufferedMutator (or not use the public API that comes 
along from instantiating one from the connection, ie -> hackery required). For 
a put we would return the batch-id (see work on HBASE-17018) to indicate which 
batch of writes a put went into. Then for the flush, we'd change the API as 
well to take a batch ID as in input argument. The (Spooling)BufferedMutator 
would then have to keep track of a limited list of recent failed batches for 
failed flushes. When threads ask if their batch fails, we can check the 
earliest entry in the failed list against the requested batch and return 
whether it was successful, failed, or if we don't know for sure (due to the 
limit in # failed batches we want to keep).

This becomes all more complicated when we start considering spooling, because 
the error can happen much later. In the presence of spooling, all we really 
"guarantee" is that puts are persisted to a (distributed) filesystem, and that 
we'll do our utmost best to replay. Of course operators of a particular 
installation may choose to spool after an infinite amount of time, essentially 
blocking writes until they can be pushed into HBase.

This leads us to the third option to deal with these race conditions:
c) Document the conditions in JavaDoc and/or the external documentation, and 
move on for now. Language could be something like:
{noformat}
Under rare circumstances, some race conditions can exist between writers and 
internal buffer flushing that make it appear that a flush succeeds after a 
problematic write.
{noformat}

> Address race condition on TimelineWriter.flush() caused by buffer-sized flush
> -----------------------------------------------------------------------------
>
>                 Key: YARN-6382
>                 URL: https://issues.apache.org/jira/browse/YARN-6382
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>    Affects Versions: 3.0.0-alpha2
>            Reporter: Haibo Chen
>            Assignee: Haibo Chen
>              Labels: yarn-5355-merge-blocker
>
> YARN-6376 fixes the race condition between putEntities() and periodical 
> flush() by WriterFlushThread in TimelineCollectorManager, or between 
> putEntities() in different threads.
> However, BufferedMutator can have internal size-based flush as well. We need 
> to address the resulting race condition.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-6382) Address race condition on TimelineWriter.flush() caused by buffer-sized flush

Reply via email to