[ https://issues.apache.org/jira/browse/HADOOP-16284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wei-Chiu Chuang updated HADOOP-16284: ------------------------------------- Attachment: 4 kms, no KTS patch.png > KMS Cache Miss Storm > -------------------- > > Key: HADOOP-16284 > URL: https://issues.apache.org/jira/browse/HADOOP-16284 > Project: Hadoop Common > Issue Type: Bug > Components: kms > Affects Versions: 2.6.0 > Environment: CDH 5.13.1, Kerberized, Cloudera Keytrustee Server > Reporter: Wei-Chiu Chuang > Priority: Major > Attachments: 4 kms, no KTS patch.png > > > We recently stumble upon a performance issue with KMS, where occasionally it > exhibited "No content to map" error (this cluster ran an old version that > doesn't have HADOOP-14841) and jobs crashed. *We bumped the number of KMSes > from 2 to 4, and situation went even worse.* > Later, we realized this cluster had a few hundred encryption zones and a few > hundred encryption keys. This is pretty unusual because most of the > deployments known to us has at most a dozen keys. So in terms of number of > keys, this cluster is 1-2 order of magnitude higher than any one else. > The high number of encryption keys in creases the likelihood of key cache > miss in KMS. In Cloudera's setup, each cache miss forces KMS to sync with its > backend, the Cloudera Keytrustee Server. Plus the high number of KMSes > amplifies the latency, effectively causing a [cache miss > storm|https://en.wikipedia.org/wiki/Cache_stampede]. > We were able to reproduce this issue with KMS-o-meter (HDFS-14312) - I will > come up with a better name later surely - and discovered a scalability bug in > CKTS. The fix was verified again with the tool. > Filing this bug so the community is aware of this issue. I don't have a > solution for now in KMS. But we want to address this scalability problem in > the near future because we are seeing use cases that requires thousands of > encryption keys. > ---- > On a side note, 4 KMS doesn't work well without HADOOP-14445 (and subsequent > fixes). A MapReduce job acquires at most 3 KMS delegation tokens, and so for > cases, such as distcp, it wouldn fail to reach the 4th KMS on the remote > cluster. I imagine similar issues exist for other execution engines, but I > didn't test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org