[
https://issues.apache.org/jira/browse/SENTRY-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15319585#comment-15319585
]
Colin Patrick McCabe edited comment on SENTRY-872 at 6/7/16 10:36 PM:
--
Thanks for checking out the design doc!
Just to be clear, HIVE-7973 has been committed to Hive already. In fact, it is
already a part of Cloudera's CDH5.8 distribution. While it's true that there
are a few open subtasks remaining on the upstream JIRA, the same could be said
for almost any Hadoop feature. We always have plans to improve things :) We
are planning on using HIVE-7973 for other things besides Sentry HA-- for
example, it is useful for replicating the Hive database. That code will
receive additional testing and attention due to the other uses that it's being
put to. When using HIVE-7973, it doesn't matter which HMS process we talk to--
both of them have access to the notification log stored in SQL. This allows us
to see what is going on in Hive, and exactly what order it occurred in, even
when there are multiple HMS processes involved-- something we currently cannot
do.
With an active/active design, all the sentry daemons would have to request
updates (or be sent updates) from the HMS. This is inefficient because it
multiplies the RPC load on the HMS service. It is especially inefficient if we
have 3 sentry daemons (for extra redundancy). It opens the door to divergence
between sentry daemons, because some of the sentry daemons might receive
updates from HMS earlier or later due to network conditions. If we are
persisting the HMS updates in the Sentry SQL database, we must somehow choose
which sentry daemon does the persisting. They can't all do it, because their
updates would conflict. Choosing one sentry daemon to do the persistence is
essentially equivalent to choosing a master.
The update log is useful for more than just implementing HA. It can be used as
a generalized mechanism for synchronizing a cache. For example, the HDFS
plugin can read the update log and apply its updates to keep the cache
maintained in the NameNode process in sync with what is going on in Sentry.
This is better than the current mechanism of buffering "deltas" in memory in
the sentry daemon. The delta mechanism requires lots of heap memory, whereas
the update log mechanism does not. Because the update log is stored in the SQL
database, the HDFS plugin will be able to continue requesting update log
entries even if the sentry service is restarted or has a failover. In
contrast, the deltas buffered in memory will be lost if either of those events
occur. So in conclusion I would say that we do agree that sentry should move
towards becoming stateless, and we view this design as a stepping stone towards
that.
was (Author: cmccabe):
Thanks for checking out the design doc!
Just to be clear, HIVE-7973 has been committed to Hive already. In fact, it is
already a part of Cloudera's CDH5.8 distribution. While it's true that there
are a few open subtasks remaining on the upstream JIRA, the same could be said
for almost any Hadoop feature. We always have plans to improve things :) We
are planning on using HIVE-7973 for other things besides Sentry HA-- for
example, it is useful for replicating the Hive database. That code will
receive additional testing and attention due to the other uses that it's being
put to. When using HIVE-7973, it doesn't matter which HMS process we talk to--
both of them have access to the notification log stored in SQL. This allows us
to see what is going on in Hive, and exactly what order it occurred in, even
when there are multiple HMS processes involved-- something we currently cannot
do.
With an active/active design, all the sentry daemons would have to request
updates (or be sent updates) from the HMS. This is inefficient because it
multiplies the RPC load on the HMS service. It is especially inefficient if we
have 3 sentry daemons (for extra redundancy). It opens the door to divergence
between sentry daemons, because some of the sentry daemons might receive
updates from HMS earlier or later due to network conditions. If we are
persisting the HMS updates in the Sentry SQL database, we must somehow choose
which sentry daemon does the persisting. They can't all do it, because their
updates would conflict. Choosing one sentry daemon to do the persistence is
essentially equivalent to choosing a master
The update log is useful for more than just implementing HA. It can be used as
a generalized mechanism for synchronizing a cache. For example, the HDFS
plugin can read the update log and apply its updates to keep the cache
maintained in the NameNode process in sync with what is going on in Sentry.
This is better than the current mechanism of buffering "deltas" in memory in
the sentry daemon. The delt