Li Lu commented on YARN-3595:

I put some thoughts on this. The problem of simply letting the write process 
and the removal listener calls to synchronize with each other is on the 
visibility of the connection object. Before we get the connection from the 
cache it's not possible for us to synchronize on it. However, after we get the 
connection from the cache and right before we start the synchronized block, the 
connection may have already been evicted from the cache and got closed. To 
address this problem, we need to do speculative read on these connections, and 
"rollback" if we notice we had a stale connection from the cache. 

I think we can have a thin layer of "connection wrapper" for each connection. 
The wrapper stores a flag to indicate if the connection inside it is still 

On a speculative get call (from our write(TimelineEntity) method), we keep 
# get a connection wrapper
# synchronize on the wrapper
# if the wrapper is invalid, try next round. 
# do normal write operations with the connection inside the wrapper

On a removal call, we do the following:
# synchronize on the wrapper
# mark the wrapper as invalid
# close the connection

We can think about the case when a removal call's synchronization block 
serialized just before a write's. In this case, even if the write call got a 
stale connection wrapper that contains a connection to be closed, it will 
notice the flag and attempt to obtain a newer connection. Concurrent 
modifications to the same connection are of course not possible between write 
calls and removal calls, as they work on the same lock. 

Given the fine-grained synchronization pattern (only synchronizing between one 
writer thread and one Guava cache's clean up thread), contention should not be 
a big problem. Overhead for acquiring the lock for the synchronization 
statement for each writer is also acceptable I assume. The only problem may be, 
since we're performing JDBC operations inside a synchronized statement, we may 
block the cache removal process for a long time. Ideally we may make this 
algorithm obstruction-free, but maybe we can firstly understand how severe the 
problem is before we make more complex algorithm changes. Also, it appears to 
be possible to make the removal methods asynchronous. 

> Performance optimization using connection cache of Phoenix timeline writer
> --------------------------------------------------------------------------
>                 Key: YARN-3595
>                 URL: https://issues.apache.org/jira/browse/YARN-3595
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Li Lu
>            Assignee: Li Lu
> The story about the connection cache in Phoenix timeline storage is a little 
> bit long. In YARN-3033 we planned to have shared writer layer for all 
> collectors in the same collector manager. In this way we can better reuse the 
> same heavy-weight storage layer connection, therefore it's more friendly to 
> conventional storage layer connections which are typically heavy-weight. 
> Phoenix, on the other hand, implements its own connection interface layer to 
> be light-weight, thread-unsafe. To make these connections work with our 
> "multiple collector, single writer" model, we're adding a thread indexed 
> connection cache. However, many performance critical factors are yet to be 
> tested. 
> In this JIRA we're tracing performance optimization efforts using this 
> connection cache. Previously we had a draft, but there was one implementation 
> challenge on cache evictions: There may be races between Guava cache's 
> removal listener calls (which close the connection) and normal references to 
> the connection. We need to carefully define the way they synchronize. 
> Performance-wise, at the very beginning stage we may need to understand:
> # If the current, thread-based indexing is an appropriate approach, or we can 
> use some better ways to index the connections. 
> # the best size of the cache, presumably as the proposed default value of a 
> configuration. 
> # how long we need to preserve a connection in the cache. 
> Please feel free to add this list. 

This message was sent by Atlassian JIRA

Reply via email to