[
https://issues.apache.org/jira/browse/YARN-6376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Haibo Chen reassigned YARN-6376:
--------------------------------
Assignee: Haibo Chen
> Exceptions caused by synchronous putEntities requests can be swallowed in
> TimelineCollector
> -------------------------------------------------------------------------------------------
>
> Key: YARN-6376
> URL: https://issues.apache.org/jira/browse/YARN-6376
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: ATSv2
> Affects Versions: 3.0.0-alpha2
> Reporter: Haibo Chen
> Assignee: Haibo Chen
> Priority: Critical
> Labels: yarn-5355-merge-blocker
>
> TimelineCollector.putEntitities() is currently implemented by calling
> TimelineWriter.write() followed by TimelineWriter.flush(). Given
> HBaseTimelineWriter.write() is an asynchronous operation, it is possible that
> TimelineClient sends a synchronous putEntities() request for critical data,
> but never gets back an exception even though the HBase write request to store
> the entities may have failed.
> This is due to a race condition between the WriterFlushThread in
> TimelineCollectorManager and web threads handling synchronous putEntities()
> requests. Entities are first put into the buffer by the web thread, it is
> possible that before the web thread invokes writer.flush(), WriterFlushThread
> is fired up to flush the writer. If the entities were not successfully
> written to the backend during flush, the WriterFlushThread would just simply
> log an error, whereas the web thread would never get an exception out from
> its writer.flush() invocation. This is bad because the reason of
> TimelineClient sending synchronously putEntities() is to retry upon any
> exception.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]