[GitHub] [hadoop] bogthe commented on a change in pull request #3260: HADOOP-17198 Support S3 AccessPoint
bogthe commented on a change in pull request #3260: URL: https://github.com/apache/hadoop/pull/3260#discussion_r717597221 ## File path: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java ## @@ -743,11 +775,24 @@ protected void verifyBucketExists() */ @Retries.RetryTranslated protected void verifyBucketExistsV2() - throws UnknownStoreException, IOException { + throws UnknownStoreException, IOException { if (!invoker.retry("doesBucketExistV2", bucket, true, trackDurationOfOperation(getDurationTrackerFactory(), STORE_EXISTS_PROBE.getSymbol(), -() -> s3.doesBucketExistV2(bucket { +() -> { + // Bug in SDK always returns `true` for AccessPoint ARNs with `doesBucketExistV2()` + // expanding implementation to use ARNs and buckets correctly + try { +s3.getBucketAcl(bucket); + } catch (AmazonServiceException ex) { +int statusCode = ex.getStatusCode(); +if (statusCode == 404 || (statusCode == 403 && accessPoint != null)) { Review comment: I see there's a `SC_404` in internal constants so I'll use that and add a `SC_403`. ## File path: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java ## @@ -1167,7 +1216,10 @@ public String getBucketLocation(String bucketName) throws IOException { final String region = trackDurationAndSpan( STORE_EXISTS_PROBE, bucketName, null, () -> invoker.retry("getBucketLocation()", bucketName, true, () -> -s3.getBucketLocation(bucketName))); +// If accessPoint then region is known from Arn +accessPoint != null Review comment: I liked the iostats tracking so it doesn't look like an operation is missing / changed. ## File path: hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md ## @@ -1580,6 +1580,68 @@ Why explicitly declare a bucket bound to the central endpoint? It ensures that if the default endpoint is changed to a new region, data store in US-east is still reachable. +## Configuring S3 AccessPoints usage with S3A +S3a now supports [S3 Access Point](https://aws.amazon.com/s3/features/access-points/) usage which +improves VPC integration with S3 and simplifies your data's permission model because different +policies can be applied now on the Access Point level. For more information about why to use and +how to create them make sure to read the official documentation. + +Accessing data through an access point, is done by using its ARN, as opposed to just the bucket name. +You can set the Access Point ARN property using the following per bucket configuration property: +```xml + +fs.s3a.sample-bucket.accesspoint.arn + {ACCESSPOINT_ARN_HERE} +Configure S3a traffic to use this AccessPoint + +``` + +This configures access to the `sample-bucket` bucket for S3A, to go through the +new Access Point ARN. So, for example `s3a://sample-bucket/key` will now use your +configured ARN when getting data from S3 instead of your bucket. + +You can also use an Access Point name as a path URI such as `s3a://finance-team-access/key`, by +configuring the `.accesspoint.arn` property as a per-bucket override: +```xml + +fs.s3a.finance-team-access.accesspoint.arn + {ACCESSPOINT_ARN_HERE} +Configure S3a traffic to use this AccessPoint + +``` + +The `fs.s3a.accesspoint.required` property can also require all access to S3 to go through Access +Points. This has the advantage of increasing security inside a VPN / VPC as you only allow access +to known sources of data defined through Access Points. In case there is a need to access a bucket +directly (without Access Points) then you can use per bucket overrides to disable this setting on a +bucket by bucket basis i.e. `fs.s3a.{YOUR-BUCKET}.accesspoint.required`. + +```xml + + +fs.s3a.accesspoint.required +true + + + +fs.s3a.example-bucket.accesspoint.required +false + +``` + +Before using Access Points make sure you're not impacted by the following: +- `ListObjectsV1` is not supported, this is also deprecated on AWS S3 for performance reasons; +- The endpoint for S3 requests will automatically change from `s3.amazonaws.com` to use +`s3-accesspoint.REGION.amazonaws.{com | com.cn}` depending on the Access Point ARN. This **only** +happens if the `fs.s3a.endpoint` property isn't set. The endpoint property overwrites any changes, Review comment: nope, removing good catch ## File path: hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md ## @@ -1580,6 +1580,68 @@ Why explicitly declare a bucket bound to the central endpoint? It ensures that if the default endpoint is changed to a new region, data store in US-east is still reachable. +## Configuring S3 AccessPoints usage
[GitHub] [hadoop] bogthe commented on a change in pull request #3260: HADOOP-17198 Support S3 AccessPoint
bogthe commented on a change in pull request #3260: URL: https://github.com/apache/hadoop/pull/3260#discussion_r717607970 ## File path: hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md ## @@ -1580,6 +1580,68 @@ Why explicitly declare a bucket bound to the central endpoint? It ensures that if the default endpoint is changed to a new region, data store in US-east is still reachable. +## Configuring S3 AccessPoints usage with S3A +S3a now supports [S3 Access Point](https://aws.amazon.com/s3/features/access-points/) usage which +improves VPC integration with S3 and simplifies your data's permission model because different +policies can be applied now on the Access Point level. For more information about why to use and +how to create them make sure to read the official documentation. + +Accessing data through an access point, is done by using its ARN, as opposed to just the bucket name. +You can set the Access Point ARN property using the following per bucket configuration property: +```xml + +fs.s3a.sample-bucket.accesspoint.arn + {ACCESSPOINT_ARN_HERE} +Configure S3a traffic to use this AccessPoint + +``` + +This configures access to the `sample-bucket` bucket for S3A, to go through the +new Access Point ARN. So, for example `s3a://sample-bucket/key` will now use your +configured ARN when getting data from S3 instead of your bucket. + +You can also use an Access Point name as a path URI such as `s3a://finance-team-access/key`, by +configuring the `.accesspoint.arn` property as a per-bucket override: +```xml + +fs.s3a.finance-team-access.accesspoint.arn + {ACCESSPOINT_ARN_HERE} +Configure S3a traffic to use this AccessPoint + +``` + +The `fs.s3a.accesspoint.required` property can also require all access to S3 to go through Access +Points. This has the advantage of increasing security inside a VPN / VPC as you only allow access +to known sources of data defined through Access Points. In case there is a need to access a bucket +directly (without Access Points) then you can use per bucket overrides to disable this setting on a +bucket by bucket basis i.e. `fs.s3a.{YOUR-BUCKET}.accesspoint.required`. + +```xml + + +fs.s3a.accesspoint.required +true + + + +fs.s3a.example-bucket.accesspoint.required +false + +``` + +Before using Access Points make sure you're not impacted by the following: +- `ListObjectsV1` is not supported, this is also deprecated on AWS S3 for performance reasons; +- The endpoint for S3 requests will automatically change from `s3.amazonaws.com` to use +`s3-accesspoint.REGION.amazonaws.{com | com.cn}` depending on the Access Point ARN. This **only** +happens if the `fs.s3a.endpoint` property isn't set. The endpoint property overwrites any changes, +this is intentional so FIPS or DualStack endpoints can be set. While considering endpoints, +if you have any custom signers that use the host endpoint property make sure to update them if +needed; +- Access Point names don't have to be globally unique, in the same way that bucket names have to. Review comment: ✅ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] bogthe commented on a change in pull request #3260: HADOOP-17198 Support S3 AccessPoint
bogthe commented on a change in pull request #3260: URL: https://github.com/apache/hadoop/pull/3260#discussion_r717605122 ## File path: hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md ## @@ -1580,6 +1580,68 @@ Why explicitly declare a bucket bound to the central endpoint? It ensures that if the default endpoint is changed to a new region, data store in US-east is still reachable. +## Configuring S3 AccessPoints usage with S3A +S3a now supports [S3 Access Point](https://aws.amazon.com/s3/features/access-points/) usage which +improves VPC integration with S3 and simplifies your data's permission model because different +policies can be applied now on the Access Point level. For more information about why to use and +how to create them make sure to read the official documentation. + +Accessing data through an access point, is done by using its ARN, as opposed to just the bucket name. +You can set the Access Point ARN property using the following per bucket configuration property: +```xml + +fs.s3a.sample-bucket.accesspoint.arn + {ACCESSPOINT_ARN_HERE} +Configure S3a traffic to use this AccessPoint + +``` + +This configures access to the `sample-bucket` bucket for S3A, to go through the +new Access Point ARN. So, for example `s3a://sample-bucket/key` will now use your +configured ARN when getting data from S3 instead of your bucket. + +You can also use an Access Point name as a path URI such as `s3a://finance-team-access/key`, by +configuring the `.accesspoint.arn` property as a per-bucket override: +```xml + +fs.s3a.finance-team-access.accesspoint.arn + {ACCESSPOINT_ARN_HERE} +Configure S3a traffic to use this AccessPoint + +``` + +The `fs.s3a.accesspoint.required` property can also require all access to S3 to go through Access +Points. This has the advantage of increasing security inside a VPN / VPC as you only allow access +to known sources of data defined through Access Points. In case there is a need to access a bucket +directly (without Access Points) then you can use per bucket overrides to disable this setting on a +bucket by bucket basis i.e. `fs.s3a.{YOUR-BUCKET}.accesspoint.required`. + +```xml + + +fs.s3a.accesspoint.required +true + + + +fs.s3a.example-bucket.accesspoint.required +false + +``` + +Before using Access Points make sure you're not impacted by the following: +- `ListObjectsV1` is not supported, this is also deprecated on AWS S3 for performance reasons; +- The endpoint for S3 requests will automatically change from `s3.amazonaws.com` to use +`s3-accesspoint.REGION.amazonaws.{com | com.cn}` depending on the Access Point ARN. This **only** +happens if the `fs.s3a.endpoint` property isn't set. The endpoint property overwrites any changes, Review comment: nope, removing good catch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] bogthe commented on a change in pull request #3260: HADOOP-17198 Support S3 AccessPoint
bogthe commented on a change in pull request #3260: URL: https://github.com/apache/hadoop/pull/3260#discussion_r717603229 ## File path: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java ## @@ -1167,7 +1216,10 @@ public String getBucketLocation(String bucketName) throws IOException { final String region = trackDurationAndSpan( STORE_EXISTS_PROBE, bucketName, null, () -> invoker.retry("getBucketLocation()", bucketName, true, () -> -s3.getBucketLocation(bucketName))); +// If accessPoint then region is known from Arn +accessPoint != null Review comment: I liked the iostats tracking so it doesn't look like an operation is missing / changed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] bogthe commented on a change in pull request #3260: HADOOP-17198 Support S3 AccessPoint
bogthe commented on a change in pull request #3260: URL: https://github.com/apache/hadoop/pull/3260#discussion_r717597221 ## File path: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java ## @@ -743,11 +775,24 @@ protected void verifyBucketExists() */ @Retries.RetryTranslated protected void verifyBucketExistsV2() - throws UnknownStoreException, IOException { + throws UnknownStoreException, IOException { if (!invoker.retry("doesBucketExistV2", bucket, true, trackDurationOfOperation(getDurationTrackerFactory(), STORE_EXISTS_PROBE.getSymbol(), -() -> s3.doesBucketExistV2(bucket { +() -> { + // Bug in SDK always returns `true` for AccessPoint ARNs with `doesBucketExistV2()` + // expanding implementation to use ARNs and buckets correctly + try { +s3.getBucketAcl(bucket); + } catch (AmazonServiceException ex) { +int statusCode = ex.getStatusCode(); +if (statusCode == 404 || (statusCode == 403 && accessPoint != null)) { Review comment: I see there's a `SC_404` in internal constants so I'll use that and add a `SC_403`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] bogthe commented on a change in pull request #3260: HADOOP-17198 Support S3 AccessPoint
bogthe commented on a change in pull request #3260: URL: https://github.com/apache/hadoop/pull/3260#discussion_r693923413 ## File path: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java ## @@ -2570,6 +2614,11 @@ protected S3ListResult continueListObjects(S3ListRequest request, OBJECT_CONTINUE_LIST_REQUEST, () -> { if (useListV1) { + if (accessPoint != null) { +// AccessPoints are not compatible with V1List +throw new InvalidRequestException("ListV1 is not supported by AccessPoints"); Review comment: If not, remove that and skip all V1 list tests (since they're using V2 anyway) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] bogthe commented on a change in pull request #3260: HADOOP-17198 Support S3 AccessPoint
bogthe commented on a change in pull request #3260: URL: https://github.com/apache/hadoop/pull/3260#discussion_r693920808 ## File path: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java ## @@ -2570,6 +2614,11 @@ protected S3ListResult continueListObjects(S3ListRequest request, OBJECT_CONTINUE_LIST_REQUEST, () -> { if (useListV1) { + if (accessPoint != null) { +// AccessPoints are not compatible with V1List +throw new InvalidRequestException("ListV1 is not supported by AccessPoints"); Review comment: I upgraded list to V2 in `initialize` method. I'm thinking I should make that logging more explicit there and completely remove this extra check + `throw InvalidRequestException`. Should be enough right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] bogthe commented on a change in pull request #3260: HADOOP-17198 Support S3 AccessPoint
bogthe commented on a change in pull request #3260: URL: https://github.com/apache/hadoop/pull/3260#discussion_r692873679 ## File path: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java ## @@ -822,10 +854,21 @@ private void bindAWSClient(URI name, boolean dtEnabled) throws IOException { S3_CLIENT_FACTORY_IMPL, DEFAULT_S3_CLIENT_FACTORY_IMPL, S3ClientFactory.class); +// If there's no endpoint set, then use the default for bucket or AccessPoint. Overriding is +// useful when using FIPS or DualStack S3 endpoints. +String endpoint = conf.getTrimmed(ENDPOINT, ""); +if (endpoint.isEmpty()) { Review comment: So I thought about this a bit more and decided to change to always use the access point endpoint instead of the above logic. So right now they should work even if you have set a custom endpoint to be something different. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] bogthe commented on a change in pull request #3260: HADOOP-17198 Support S3 AccessPoint
bogthe commented on a change in pull request #3260: URL: https://github.com/apache/hadoop/pull/3260#discussion_r684168185 ## File path: hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/S3ATestUtils.java ## @@ -1507,4 +1507,13 @@ public static void skipIfKmsKeyIdIsNotSet(Configuration configuration) { } } + /** + * Skip if a test doesn't use CSE. + */ + public static void skipIfCSEIsNotEnabled(Configuration configuration) { +String encryption = configuration.get(Constants.SERVER_SIDE_ENCRYPTION_ALGORITHM, ""); +if (!encryption.equals(S3AEncryptionMethods.CSE_KMS.getMethod())) { Review comment: Sorry, I don't understand. Why would I change this method to skip if CSE is enabled? And the places where it's called (even though I added it you have the better context, so trying to understand it). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] bogthe commented on a change in pull request #3260: HADOOP-17198 Support S3 AccessPoint
bogthe commented on a change in pull request #3260: URL: https://github.com/apache/hadoop/pull/3260#discussion_r684165520 ## File path: hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3AEncryptionSSEKMSUserDefinedKey.java ## @@ -39,12 +41,10 @@ protected Configuration createConfiguration() { // get the KMS key for this test. Configuration c = new Configuration(); String kmsKey = c.get(SERVER_SIDE_ENCRYPTION_KEY); -if (StringUtils.isBlank(kmsKey) || !c.get(SERVER_SIDE_ENCRYPTION_ALGORITHM) -.equals(S3AEncryptionMethods.CSE_KMS.name())) { - skip(SERVER_SIDE_ENCRYPTION_KEY + " is not set for " + - SSE_KMS.getMethod() + " or CSE-KMS algorithm is used instead of " - + "SSE-KMS"); -} + +skipIfKmsKeyIdIsNotSet(c); +skipIfCSEIsNotEnabled(c); Review comment: Updated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] bogthe commented on a change in pull request #3260: HADOOP-17198 Support S3 AccessPoint
bogthe commented on a change in pull request #3260: URL: https://github.com/apache/hadoop/pull/3260#discussion_r684164823 ## File path: hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/auth/ITestCustomSigner.java ## @@ -214,6 +220,31 @@ public void sign(SignableRequest request, AWSCredentials credentials) { } } +private String parseBucketFromHost(String host) { + // host: {bucket || accesspoint}.{s3 || s3-accesspoint}.{region}.amazonaws.com + String[] hostBits = host.split("\\."); + String bucketName = hostBits[0]; + String service = hostBits[1]; + + if (service.contains("s3-accesspoint") || service.contains("s3-outposts") || + service.contains("s3-object-lambda")) { +// If AccessPoint then bucketName is of format `accessPoint-accountId`; +String[] accessPointBits = hostBits[0].split("\\-"); Review comment: Good catch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] bogthe commented on a change in pull request #3260: HADOOP-17198 Support S3 AccessPoint
bogthe commented on a change in pull request #3260: URL: https://github.com/apache/hadoop/pull/3260#discussion_r684160558 ## File path: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java ## @@ -822,10 +854,21 @@ private void bindAWSClient(URI name, boolean dtEnabled) throws IOException { S3_CLIENT_FACTORY_IMPL, DEFAULT_S3_CLIENT_FACTORY_IMPL, S3ClientFactory.class); +// If there's no endpoint set, then use the default for bucket or AccessPoint. Overriding is +// useful when using FIPS or DualStack S3 endpoints. +String endpoint = conf.getTrimmed(ENDPOINT, ""); +if (endpoint.isEmpty()) { Review comment: No, what I initially intended is to say "if you're not setting the endpoint then I'll provide a default Access Point endpoint". This is because I don't know what endpoint the user wants to point it to. This is also why your tests are failing when you set the endpoint to `ap-south-1`. I'm open to adding another `fs.s3a.accesspoint-endpoint` configuration if it's better to provide an option to override only the access point endpoint. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] bogthe commented on a change in pull request #3260: HADOOP-17198 Support S3 AccessPoint
bogthe commented on a change in pull request #3260: URL: https://github.com/apache/hadoop/pull/3260#discussion_r684147891 ## File path: hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md ## @@ -1576,6 +1576,81 @@ Why explicitly declare a bucket bound to the central endpoint? It ensures that if the default endpoint is changed to a new region, data store in US-east is still reachable. +## Configuring S3 AccessPoints usage with S3a +S3a now supports [S3 Access Point](https://aws.amazon.com/s3/features/access-points/) usage which +improves VPC integration with S3 and simplifies your data's permission model because different +policies can be applied now on the Access Point level. For more information about why to use them +make sure to read the official documentation. + +Accessing data through an access point, is done by using its ARN, as opposed to just the bucket name. +You can set the Access Point ARN property using the following configuration property: +```xml + +fs.s3a.accesspoint.arn + {ACCESSPOINT_ARN_HERE} +Configure S3a traffic to use this AccessPoint + +``` + +Be mindful that this configures **all access** to S3a, and in turn S3, to go through that ARN. +So for example `s3a://yourbucket/key` will now use your configured ARN when getting data from S3 +instead of your bucket. The flip side to this is that if you're working with multiple buckets +`s3a://yourbucket` and `s3a://yourotherbucket` both of their requests will go through the same +Access Point ARN. To configure different Access Point ARNs, per bucket overrides can be used with +access point names instead of bucket names as such: + +- Let's assume you have an existing workflow with the following paths `s3a://data-bucket`, +`s3a://output-bucket` and you want to work with a new Access Point called `finance-accesspoint`. All +you would then need to add is the following per bucket configuration change: +```xml + +fs.s3a.bucket.finance-accesspoint.accesspoint.arn + arn:aws:s3:eu-west-1:123456789101:accesspoint/finance-accesspoint + +``` + +While keeping the global `accesspoint.arn` property set to empty `" "` which is the default. Review comment: Yeah, my fault for mixing it up. I thought the default was `" "` not `""` for properties. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] bogthe commented on a change in pull request #3260: HADOOP-17198 Support S3 AccessPoint
bogthe commented on a change in pull request #3260: URL: https://github.com/apache/hadoop/pull/3260#discussion_r684147618 ## File path: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java ## @@ -400,6 +410,14 @@ public void initialize(URI name, Configuration originalConf) LOG.debug("Initializing S3AFileSystem for {}", bucket); // clone the configuration into one with propagated bucket options Configuration conf = propagateBucketOptions(originalConf, bucket); + + String apArn = conf.getTrimmed(ACCESS_POINT_ARN, ""); Review comment: Yup, makes sense -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] bogthe commented on a change in pull request #3260: HADOOP-17198 Support S3 AccessPoint
bogthe commented on a change in pull request #3260: URL: https://github.com/apache/hadoop/pull/3260#discussion_r684147473 ## File path: hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md ## @@ -1576,6 +1576,81 @@ Why explicitly declare a bucket bound to the central endpoint? It ensures that if the default endpoint is changed to a new region, data store in US-east is still reachable. +## Configuring S3 AccessPoints usage with S3a +S3a now supports [S3 Access Point](https://aws.amazon.com/s3/features/access-points/) usage which +improves VPC integration with S3 and simplifies your data's permission model because different +policies can be applied now on the Access Point level. For more information about why to use them +make sure to read the official documentation. + +Accessing data through an access point, is done by using its ARN, as opposed to just the bucket name. +You can set the Access Point ARN property using the following configuration property: +```xml + +fs.s3a.accesspoint.arn + {ACCESSPOINT_ARN_HERE} +Configure S3a traffic to use this AccessPoint + +``` + +Be mindful that this configures **all access** to S3a, and in turn S3, to go through that ARN. +So for example `s3a://yourbucket/key` will now use your configured ARN when getting data from S3 +instead of your bucket. The flip side to this is that if you're working with multiple buckets +`s3a://yourbucket` and `s3a://yourotherbucket` both of their requests will go through the same +Access Point ARN. To configure different Access Point ARNs, per bucket overrides can be used with +access point names instead of bucket names as such: + +- Let's assume you have an existing workflow with the following paths `s3a://data-bucket`, +`s3a://output-bucket` and you want to work with a new Access Point called `finance-accesspoint`. All +you would then need to add is the following per bucket configuration change: +```xml + +fs.s3a.bucket.finance-accesspoint.accesspoint.arn Review comment: Yes! That would be the desired outcome. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] bogthe commented on a change in pull request #3260: HADOOP-17198 Support S3 AccessPoint
bogthe commented on a change in pull request #3260: URL: https://github.com/apache/hadoop/pull/3260#discussion_r682739268 ## File path: hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md ## @@ -1576,6 +1576,81 @@ Why explicitly declare a bucket bound to the central endpoint? It ensures that if the default endpoint is changed to a new region, data store in US-east is still reachable. +## Configuring S3 AccessPoints usage with S3a +S3a now supports [S3 Access Point](https://aws.amazon.com/s3/features/access-points/) usage which +improves VPC integration with S3 and simplifies your data's permission model because different +policies can be applied now on the Access Point level. For more information about why to use them +make sure to read the official documentation. + +Accessing data through an access point, is done by using its ARN, as opposed to just the bucket name. +You can set the Access Point ARN property using the following configuration property: +```xml + +fs.s3a.accesspoint.arn + {ACCESSPOINT_ARN_HERE} +Configure S3a traffic to use this AccessPoint + +``` + +Be mindful that this configures **all access** to S3a, and in turn S3, to go through that ARN. +So for example `s3a://yourbucket/key` will now use your configured ARN when getting data from S3 +instead of your bucket. The flip side to this is that if you're working with multiple buckets Review comment: I like it. Will update PR ## File path: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java ## @@ -2570,6 +2614,11 @@ protected S3ListResult continueListObjects(S3ListRequest request, OBJECT_CONTINUE_LIST_REQUEST, () -> { if (useListV1) { + if (accessPoint != null) { +// AccessPoints are not compatible with V1List +throw new InvalidRequestException("ListV1 is not supported by AccessPoints"); Review comment: Yep, good idea, upgrading it is! ## File path: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java ## @@ -400,6 +410,14 @@ public void initialize(URI name, Configuration originalConf) LOG.debug("Initializing S3AFileSystem for {}", bucket); // clone the configuration into one with propagated bucket options Configuration conf = propagateBucketOptions(originalConf, bucket); + + String apArn = conf.getTrimmed(ACCESS_POINT_ARN, ""); + if (!apArn.isEmpty()) { +accessPoint = ArnResource.accessPointFromArn(apArn); +LOG.info("Using AccessPoint ARN \"{}\" for bucket {}", apArn, bucket); Review comment: good point -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] bogthe commented on a change in pull request #3260: HADOOP-17198 Support S3 AccessPoint
bogthe commented on a change in pull request #3260: URL: https://github.com/apache/hadoop/pull/3260#discussion_r682757870 ## File path: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java ## @@ -400,6 +410,14 @@ public void initialize(URI name, Configuration originalConf) LOG.debug("Initializing S3AFileSystem for {}", bucket); // clone the configuration into one with propagated bucket options Configuration conf = propagateBucketOptions(originalConf, bucket); + + String apArn = conf.getTrimmed(ACCESS_POINT_ARN, ""); + if (!apArn.isEmpty()) { +accessPoint = ArnResource.accessPointFromArn(apArn); +LOG.info("Using AccessPoint ARN \"{}\" for bucket {}", apArn, bucket); Review comment: good point -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] bogthe commented on a change in pull request #3260: HADOOP-17198 Support S3 AccessPoint
bogthe commented on a change in pull request #3260: URL: https://github.com/apache/hadoop/pull/3260#discussion_r682740179 ## File path: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java ## @@ -2570,6 +2614,11 @@ protected S3ListResult continueListObjects(S3ListRequest request, OBJECT_CONTINUE_LIST_REQUEST, () -> { if (useListV1) { + if (accessPoint != null) { +// AccessPoints are not compatible with V1List +throw new InvalidRequestException("ListV1 is not supported by AccessPoints"); Review comment: Yep, good idea, upgrading it is! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] bogthe commented on a change in pull request #3260: HADOOP-17198 Support S3 AccessPoint
bogthe commented on a change in pull request #3260: URL: https://github.com/apache/hadoop/pull/3260#discussion_r682739268 ## File path: hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md ## @@ -1576,6 +1576,81 @@ Why explicitly declare a bucket bound to the central endpoint? It ensures that if the default endpoint is changed to a new region, data store in US-east is still reachable. +## Configuring S3 AccessPoints usage with S3a +S3a now supports [S3 Access Point](https://aws.amazon.com/s3/features/access-points/) usage which +improves VPC integration with S3 and simplifies your data's permission model because different +policies can be applied now on the Access Point level. For more information about why to use them +make sure to read the official documentation. + +Accessing data through an access point, is done by using its ARN, as opposed to just the bucket name. +You can set the Access Point ARN property using the following configuration property: +```xml + +fs.s3a.accesspoint.arn + {ACCESSPOINT_ARN_HERE} +Configure S3a traffic to use this AccessPoint + +``` + +Be mindful that this configures **all access** to S3a, and in turn S3, to go through that ARN. +So for example `s3a://yourbucket/key` will now use your configured ARN when getting data from S3 +instead of your bucket. The flip side to this is that if you're working with multiple buckets Review comment: I like it. Will update PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org