Apache Hadoop qbt Report: branch2.10+JDK7 on Linux/x86
For more details, see https://builds.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86/692/ No changes [Error replacing 'FILE' - Workspace is not accessible] - To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org
Apache Hadoop qbt Report: trunk+JDK8 on Linux/x86_64
For more details, see https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/148/ [May 20, 2020 2:39:40 AM] (Yiqun Lin) HDFS-15340. RBF: Implement BalanceProcedureScheduler basic framework. Contributed by Jinglun. [May 20, 2020 6:06:52 AM] (pjoseph) YARN-9606. Set sslfactory for AuthenticatedURL() while creating LogsCLI#webServiceClient. [May 20, 2020 12:42:25 PM] (Steve Loughran) HADOOP-16900. Very large files can be truncated when written through the S3A FileSystem. [May 20, 2020 4:23:56 PM] (Eric Yang) YARN-10228. Relax restriction of file path character in yarn.service.am.java.opts. [May 20, 2020 6:51:48 PM] (noreply) HADOOP-17004. Fixing a formatting issue [May 21, 2020 1:07:23 AM] (noreply) HDFS-15353. Use sudo instead of su to allow nologin user for secure DataNode (#2018) [Error replacing 'FILE' - Workspace is not accessible] - To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org
Re: [DISCUSS] Secure Hadoop without Kerberos
See my comments inline: On Wed, May 20, 2020 at 4:50 PM Rajive Chittajallu wrote: > On Wed, May 20, 2020 at 1:47 PM Eric Yang wrote: > > > >> > Kerberos was developed decade before web development becomes popular. > >> > There are some Kerberos limitations which does not work well in > Hadoop. A > >> > few examples of corner cases: > >> > >> Microsoft Active Directory, which is extensively used in many > organizations, > >> is based on Kerberos. > > > > > > True, but with rise of Google and AWS. OIDC seems to be a formidable > standard that can replace Kerberos for authentication. I think providing > an option for the new standard is good for Hadoop. > > > > I think you are referring to Oauth2 and adoption across varies > significantly across vendors. When one refers to Kerberos, its mostly > about MIT Kerberos or Microsoft Active Directory. But Oauth2 is a > specification, implementations vary and are quite prone to bugs. I > would be very careful in making a generic statement as a "formidable > standard". > > AWS services, atleast in the context of Data processing / Analytics > does not support Oauth2. Its more of a GCP thing. AWS uses Signed > requests [1]. > > [1] https://docs.aws.amazon.com/general/latest/gr/signature-version-4.html Kerberos is a protocol for authentication. OIDC is also an authentication protocol. MIT Kerberos or Oauth2 are frameworks, not authentication protocol. By no means that I am suggesting to adopt Oauth2 framework because implementing according to protocol spec is better than hard wired to a certain libraries. We can adopt existing OIDC libraries like pac4j to reduce maintenance of implementing OIDC protocol in Hadoop. AWS has been offering OIDC authentication for EKS, and IAM identity provider. By offering native OIDC support, it will help Hadoop to access cloud services that are secured by OIDC more easily. > > >> > >> > 1. Kerberos principal doesn't encode port number, it is difficult to > know > >> > if the principal is coming from an authorized daemon or a hacker > container > >> > trying to forge service principal. > >> > >> Clients use ephemeral ports. Not sure of what the relevancy of this > statement. > > > > Hint: CVE-2020-9492 > > > > Its a reserved one. You can help the conversation by describing a threat > model. > Hadoop security mailing list has the problem listed, if you are interested in this area. Hadoop Kerberos security quirks is a off topic for decoupling Kerberos from Hadoop. > >> > 2. Hadoop Kerberos principals are used as high privileged principal, > a form > >> > of credential to impersonate end user. > >> > >> Principals are identities of the user. You can make identities fully > qualified, > >> to include issuing authority if you want to. This is not kerberos > specific. > >> > >> Remember, Kerberos is an authentication mechanism, How those assertions > >> are translated to authorization rules are application specific. > >> > >> Probably reconsider alternatives to auth_to_local rules. > > > > > > Trust must be validated. Hadoop Kerberos principals for service that > can perform impersonation are equal to root power. Transport root power > securely without being intercepted is quite difficult, when services are > running as root instead of daemons. There is alternate solution to always > forward signed end user token, hence, there is no need of validation of > proxy user credential. The down side of forwarding signed token is > difficult to forward multiple tokens of incompatible security mechanism > because renewal mechanism and expiration time may not be deciphered by the > transport mechanism. This is the reason that using SSO token is a good way > to ensure every libraries and framework abide by same security practice to > eliminate confused deputy problems. > > Trust of what? Service principals should not be used for > authentication in client context, there > are there for server identification. The trust is referring to service (Oozie/Hive) impersonates as end user, and namenode issues delegation token after check proxy user ACL, The form of token presented to namenode is a service tgt, not end user tgt. The service tgt is validated in proxy user ACL validation with namenode to allow impersonation to happen. If service tgt is intercepted due to lack of encryption in RPC or HTTP transport, service ticket is vulnerable to replay attack. > > OAuth2 (which OIDC flow is based on) suggests JWT, which are signed > tokens. Can you > elaborate more on what do you mean my "SSO Token"? SSO Token is JWT token in this context. My advice is there should only be one token transported, instead of multiple tokens to prevent out of sync expiration date problem on multiple tokens. > To improve security for doAS use cases, add context to the calls. Just > replacing Kerberos with a different authentication mechanism is not going to > solve the problem. The focus is to support alternate security mechanism that may have been
Re: [DISCUSS] Secure Hadoop without Kerberos
On Wed, May 20, 2020 at 1:47 PM Eric Yang wrote: > >> > Kerberos was developed decade before web development becomes popular. >> > There are some Kerberos limitations which does not work well in Hadoop. A >> > few examples of corner cases: >> >> Microsoft Active Directory, which is extensively used in many organizations, >> is based on Kerberos. > > > True, but with rise of Google and AWS. OIDC seems to be a formidable > standard that can replace Kerberos for authentication. I think providing an > option for the new standard is good for Hadoop. > I think you are referring to Oauth2 and adoption across varies significantly across vendors. When one refers to Kerberos, its mostly about MIT Kerberos or Microsoft Active Directory. But Oauth2 is a specification, implementations vary and are quite prone to bugs. I would be very careful in making a generic statement as a "formidable standard". AWS services, atleast in the context of Data processing / Analytics does not support Oauth2. Its more of a GCP thing. AWS uses Signed requests [1]. [1] https://docs.aws.amazon.com/general/latest/gr/signature-version-4.html >> >> > 1. Kerberos principal doesn't encode port number, it is difficult to know >> > if the principal is coming from an authorized daemon or a hacker container >> > trying to forge service principal. >> >> Clients use ephemeral ports. Not sure of what the relevancy of this >> statement. > > Hint: CVE-2020-9492 > Its a reserved one. You can help the conversation by describing a threat model. >> > 2. Hadoop Kerberos principals are used as high privileged principal, a form >> > of credential to impersonate end user. >> >> Principals are identities of the user. You can make identities fully >> qualified, >> to include issuing authority if you want to. This is not kerberos specific. >> >> Remember, Kerberos is an authentication mechanism, How those assertions >> are translated to authorization rules are application specific. >> >> Probably reconsider alternatives to auth_to_local rules. > > > Trust must be validated. Hadoop Kerberos principals for service that can > perform impersonation are equal to root power. Transport root power securely > without being intercepted is quite difficult, when services are running as > root instead of daemons. There is alternate solution to always forward > signed end user token, hence, there is no need of validation of proxy user > credential. The down side of forwarding signed token is difficult to forward > multiple tokens of incompatible security mechanism because renewal mechanism > and expiration time may not be deciphered by the transport mechanism. This > is the reason that using SSO token is a good way to ensure every libraries > and framework abide by same security practice to eliminate confused deputy > problems. Trust of what? Service principals should not be used for authentication in client context, there are there for server identification. OAuth2 (which OIDC flow is based on) suggests JWT, which are signed tokens. Can you elaborate more on what do you mean my "SSO Token"? To improve security for doAS use cases, add context to the calls. Just replacing Kerberos with a different authentication mechanism is not going to solve the problem. And how to improve Proxy User usecases vary by application. Asserting a 'on-behalf-of' action, when there is an active client on the other end (eg: hdfs proxy) would be different from one that is initiated per schedule, eg Oozie. >> >> > 3. Delegation token may allow expired users to continue to run jobs long >> > after they are gone, without rechecking if end user credentials is still >> > valid. >> >> Delegation tokens are hadoop specific implementation, whose lifecycle is >> outside the scope of Kerberos. Hadoop (NN/RM) can periodically check >> respective IDP Policy and revoke tokens. Or have a central token >> management service, similar to KMS >> >> > 4. Passing different form of tokens does not work well with cloud provider >> > security mechanism. For example, passing AWS sts token for S3 bucket. >> > There is no renewal mechanism, nor good way to identify when the token >> > would expire. >> >> This is outside the scope of Kerberos. >> >> Assuming you are using YARN, making RM handle S3 temp credentials, >> similar to HDFS delegation tokens is something to consider. >> >> > There are companies that work on bridging security mechanism of different >> > types, but this is not primary goal for Hadoop. Hadoop can benefit from >> > modernized security using open standards like OpenID Connect, which >> > proposes to unify web applications using SSO. This ensure the client >> > credentials are transported in each stage of client servers interaction. >> > This may improve overall security, and provide more cloud native form >> > factor. I wonder if there is any interested in the community to enable >> > Hadoop OpenID Connect integration work? >> >> End to end identity assertion is where
Re: [DISCUSS] Secure Hadoop without Kerberos
On Wed, May 6, 2020 at 3:32 PM Eric Yang wrote: > > Hi all, > > Kerberos was developed decade before web development becomes popular. > There are some Kerberos limitations which does not work well in Hadoop. A > few examples of corner cases: Microsoft Active Directory, which is extensively used in many organizations, is based on Kerberos. > 1. Kerberos principal doesn't encode port number, it is difficult to know > if the principal is coming from an authorized daemon or a hacker container > trying to forge service principal. Clients use ephemeral ports. Not sure of what the relevancy of this statement. > 2. Hadoop Kerberos principals are used as high privileged principal, a form > of credential to impersonate end user. Principals are identities of the user. You can make identities fully qualified, to include issuing authority if you want to. This is not kerberos specific. Remember, Kerberos is an authentication mechanism, How those assertions are translated to authorization rules are application specific. Probably reconsider alternatives to auth_to_local rules. > 3. Delegation token may allow expired users to continue to run jobs long > after they are gone, without rechecking if end user credentials is still > valid. Delegation tokens are hadoop specific implementation, whose lifecycle is outside the scope of Kerberos. Hadoop (NN/RM) can periodically check respective IDP Policy and revoke tokens. Or have a central token management service, similar to KMS > 4. Passing different form of tokens does not work well with cloud provider > security mechanism. For example, passing AWS sts token for S3 bucket. > There is no renewal mechanism, nor good way to identify when the token > would expire. This is outside the scope of Kerberos. Assuming you are using YARN, making RM handle S3 temp credentials, similar to HDFS delegation tokens is something to consider. > There are companies that work on bridging security mechanism of different > types, but this is not primary goal for Hadoop. Hadoop can benefit from > modernized security using open standards like OpenID Connect, which > proposes to unify web applications using SSO. This ensure the client > credentials are transported in each stage of client servers interaction. > This may improve overall security, and provide more cloud native form > factor. I wonder if there is any interested in the community to enable > Hadoop OpenID Connect integration work? End to end identity assertion is where Kerberos in it self does not address. But any implementation should not pass "credentials'. Need a way to pass signed requests, that could be verified along the chain. > > regards, > Eric On Wed, May 6, 2020 at 3:32 PM Eric Yang wrote: > > Hi all, > > Kerberos was developed decade before web development becomes popular. > There are some Kerberos limitations which does not work well in Hadoop. A > few examples of corner cases: > > 1. Kerberos principal doesn't encode port number, it is difficult to know > if the principal is coming from an authorized daemon or a hacker container > trying to forge service principal. > 2. Hadoop Kerberos principals are used as high privileged principal, a form > of credential to impersonate end user. > 3. Delegation token may allow expired users to continue to run jobs long > after they are gone, without rechecking if end user credentials is still > valid. > 4. Passing different form of tokens does not work well with cloud provider > security mechanism. For example, passing AWS sts token for S3 bucket. > There is no renewal mechanism, nor good way to identify when the token > would expire. > > There are companies that work on bridging security mechanism of different > types, but this is not primary goal for Hadoop. Hadoop can benefit from > modernized security using open standards like OpenID Connect, which > proposes to unify web applications using SSO. This ensure the client > credentials are transported in each stage of client servers interaction. > This may improve overall security, and provide more cloud native form > factor. I wonder if there is any interested in the community to enable > Hadoop OpenID Connect integration work? > > regards, > Eric - To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org
Re: [EXTERNAL] Re: [DISCUSS] Secure Hadoop without Kerberos
I have to strongly disagree with making UGI.doAs() private. Just because you feel that impersonation isn't an important feature, does not make it so for all users. There are many valid use cases which require impersonation, and in fact I consider this to be one of the differentiating features of the Hadoop ecosystem. We make use of it heavily to build a variety of services which would not be possible without this. Also consider that in addition to gateway services such as Knox being broken by this change, you would also cripple job schedulers such as Oozie. Running workloads on YARN as different users is vital to ensure that queue resources are allocated and accounted for properly as well as file permissions enforced. Without impersonation, all users of a cluster would need to be granted access to talk directly to YARN. Higher level access points or APIs would not be possible. Craig Condit From: Eric Yang Sent: Wednesday, May 20, 2020 1:57 PM To: Akira Ajisaka Cc: Hadoop Common Subject: [EXTERNAL] Re: [DISCUSS] Secure Hadoop without Kerberos Hi Akira, Thank you for the information. Knox plays a main role in reverse proxy for Hadoop cluster. I understand the importance to keep Knox running to centralize audit log for ingress into the cluster. Other reverse proxy solution like Nginx are more feature rich for caching static contents and load balancer. It would be great to have ability to use either Knox or Nginx as reverse proxy solution. Company wide OIDC is likely to run independently from Hadoop cluster, but also possible to run in a Hadoop cluster. Reverse proxy must have ability to redirects to OIDC where exposed endpoint is appropriate. HADOOP-11717 was a good effort to enable SSO integration except it is written to extend on Kerberos authentication, which prevents decoupling from Kerberos a reality. I gathered a few design requirements this morning, and welcome to contribute: 1. Encryption is mandatory. Server certificate validation is required. 2. Existing token infrastructure for block access token remains the same. 3. Replace delegation token transport with OIDC JWT token. 4. Patch token renewer logic to support renew token with OIDC endpoint before token expires. 5. Impersonation logic uses service user credentials. New way to renew service user credentials securely. 6. Replace Hadoop RPC SASL transport with TLS because OIDC works with TLS natively. 7. Command CLI improvements to use environment variables or files for accessing client credentials Downgrade the use of UGI.doAs() to private of Hadoop. Service should not run with elevated privileges unless there is a good reason for it (i.e. loading hive external tables). I think this is good starting point, and feedback can help to turn these requirements into tasks. Let me know what you think. Thanks regards, Eric On Tue, May 19, 2020 at 9:47 PM Akira Ajisaka wrote: > Hi Eric, thank you for starting the discussion. > > I'm interested in OpenID Connect (OIDC) integration. > > In addition to the benefits (security, cloud native), operating costs may > be reduced in some companies. > We have our company-wide OIDC provider and enable SSO for Hadoop Web UIs > via Knox + OIDC in Yahoo! JAPAN. > On the other hand, Hadoop administrators have to manage our own KDC > servers only for Hadoop ecosystems. > If Hadoop and its ecosystem can support OIDC, we don't have to manage KDC > and that way operating costs will be reduced. > > Regards, > Akira > > On Thu, May 7, 2020 at 7:32 AM Eric Yang wrote: > >> Hi all, >> >> Kerberos was developed decade before web development becomes popular. >> There are some Kerberos limitations which does not work well in Hadoop. A >> few examples of corner cases: >> >> 1. Kerberos principal doesn't encode port number, it is difficult to know >> if the principal is coming from an authorized daemon or a hacker container >> trying to forge service principal. >> 2. Hadoop Kerberos principals are used as high privileged principal, a >> form >> of credential to impersonate end user. >> 3. Delegation token may allow expired users to continue to run jobs long >> after they are gone, without rechecking if end user credentials is still >> valid. >> 4. Passing different form of tokens does not work well with cloud >> provider >> security mechanism. For example, passing AWS sts token for S3 bucket. >> There is no renewal mechanism, nor good way to identify when the token >> would expire. >> >> There are companies that work on bridging security mechanism of different >> types, but this is not primary goal for Hadoop. Hadoop can benefit from >> modernized security using open standards like OpenID Connect, which >> proposes to unify web applications using SSO. This ensure the client >> credentials are transported in each stage of client servers interaction. >> This may improve overall security, and provide more cloud native form >> factor. I wonder if there is
Re: [DISCUSS] Secure Hadoop without Kerberos
Hi Akira, Thank you for the information. Knox plays a main role in reverse proxy for Hadoop cluster. I understand the importance to keep Knox running to centralize audit log for ingress into the cluster. Other reverse proxy solution like Nginx are more feature rich for caching static contents and load balancer. It would be great to have ability to use either Knox or Nginx as reverse proxy solution. Company wide OIDC is likely to run independently from Hadoop cluster, but also possible to run in a Hadoop cluster. Reverse proxy must have ability to redirects to OIDC where exposed endpoint is appropriate. HADOOP-11717 was a good effort to enable SSO integration except it is written to extend on Kerberos authentication, which prevents decoupling from Kerberos a reality. I gathered a few design requirements this morning, and welcome to contribute: 1. Encryption is mandatory. Server certificate validation is required. 2. Existing token infrastructure for block access token remains the same. 3. Replace delegation token transport with OIDC JWT token. 4. Patch token renewer logic to support renew token with OIDC endpoint before token expires. 5. Impersonation logic uses service user credentials. New way to renew service user credentials securely. 6. Replace Hadoop RPC SASL transport with TLS because OIDC works with TLS natively. 7. Command CLI improvements to use environment variables or files for accessing client credentials Downgrade the use of UGI.doAs() to private of Hadoop. Service should not run with elevated privileges unless there is a good reason for it (i.e. loading hive external tables). I think this is good starting point, and feedback can help to turn these requirements into tasks. Let me know what you think. Thanks regards, Eric On Tue, May 19, 2020 at 9:47 PM Akira Ajisaka wrote: > Hi Eric, thank you for starting the discussion. > > I'm interested in OpenID Connect (OIDC) integration. > > In addition to the benefits (security, cloud native), operating costs may > be reduced in some companies. > We have our company-wide OIDC provider and enable SSO for Hadoop Web UIs > via Knox + OIDC in Yahoo! JAPAN. > On the other hand, Hadoop administrators have to manage our own KDC > servers only for Hadoop ecosystems. > If Hadoop and its ecosystem can support OIDC, we don't have to manage KDC > and that way operating costs will be reduced. > > Regards, > Akira > > On Thu, May 7, 2020 at 7:32 AM Eric Yang wrote: > >> Hi all, >> >> Kerberos was developed decade before web development becomes popular. >> There are some Kerberos limitations which does not work well in Hadoop. A >> few examples of corner cases: >> >> 1. Kerberos principal doesn't encode port number, it is difficult to know >> if the principal is coming from an authorized daemon or a hacker container >> trying to forge service principal. >> 2. Hadoop Kerberos principals are used as high privileged principal, a >> form >> of credential to impersonate end user. >> 3. Delegation token may allow expired users to continue to run jobs long >> after they are gone, without rechecking if end user credentials is still >> valid. >> 4. Passing different form of tokens does not work well with cloud >> provider >> security mechanism. For example, passing AWS sts token for S3 bucket. >> There is no renewal mechanism, nor good way to identify when the token >> would expire. >> >> There are companies that work on bridging security mechanism of different >> types, but this is not primary goal for Hadoop. Hadoop can benefit from >> modernized security using open standards like OpenID Connect, which >> proposes to unify web applications using SSO. This ensure the client >> credentials are transported in each stage of client servers interaction. >> This may improve overall security, and provide more cloud native form >> factor. I wonder if there is any interested in the community to enable >> Hadoop OpenID Connect integration work? >> >> regards, >> Eric >> >
Apache Hadoop qbt Report: trunk+JDK8 on Linux/x86_64
For more details, see https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/147/ [May 19, 2020 3:45:54 AM] (noreply) HADOOP-17004. ABFS: Improve the ABFS driver documentation [May 19, 2020 5:27:12 AM] (noreply) HADOOP-17024. ListStatus on ViewFS root (ls "/") should list the linkFallBack root (configured target root). Contributed by Abhishek Das. [May 19, 2020 5:36:36 AM] (Surendra Singh Lilhore) MAPREDUCE-6826. Job fails with InvalidStateTransitonException: Invalid event: JOB_TASK_COMPLETED at SUCCEEDED/COMMITTING. Contributed by Bilwa S T. [May 19, 2020 7:30:07 PM] (noreply) Hadoop-17015. ABFS: Handling Rename and Delete idempotency [May 19, 2020 11:47:04 PM] (noreply) HADOOP-16586. ITestS3GuardFsck, others fails when run using a local metastore. (#1950) [Error replacing 'FILE' - Workspace is not accessible] - To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org
[jira] [Resolved] (HADOOP-16900) Very large files can be truncated when written through S3AFileSystem
[ https://issues.apache.org/jira/browse/HADOOP-16900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran resolved HADOOP-16900. - Fix Version/s: 3.4.0 Resolution: Fixed in trunk; rebuilding and retesting branch-3.3 with it too > Very large files can be truncated when written through S3AFileSystem > > > Key: HADOOP-16900 > URL: https://issues.apache.org/jira/browse/HADOOP-16900 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 >Affects Versions: 3.2.1 >Reporter: Andrew Olson >Assignee: Mukund Thakur >Priority: Major > Labels: s3 > Fix For: 3.4.0 > > > If a written file size exceeds 10,000 * {{fs.s3a.multipart.size}}, a corrupt > truncation of the S3 object will occur as the maximum number of parts in a > multipart upload is 10,000 as specific by the S3 API and there is an apparent > bug where this failure is not fatal, and the multipart upload is allowed to > be marked as completed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org
[jira] [Created] (HADOOP-17050) Add support for multiple delegation tokens in S3AFilesystem
Gabor Bota created HADOOP-17050: --- Summary: Add support for multiple delegation tokens in S3AFilesystem Key: HADOOP-17050 URL: https://issues.apache.org/jira/browse/HADOOP-17050 Project: Hadoop Common Issue Type: Sub-task Components: fs/s3 Reporter: Gabor Bota Assignee: Gabor Bota In {{org.apache.hadoop.fs.s3a.auth.delegation.AbstractDelegationTokenBinding}} the {{createDelegationToken}} should return a list of tokens. With this functionality, the {{AbstractDelegationTokenBinding}} can get two different tokens at the same time. {{AbstractDelegationTokenBinding.TokenSecretManager}} should be extended to retrieve secrets and lookup delegation tokens (use the public API for secretmanager in hadoop) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org