Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-10 Thread Jack Ye
Yes, the intention is to allow S3FileIO to be used to temporarily unblock
users who are using a S3-compatible storage service or framework and can
directly use it to make requests through the AWS S3 SDK. We have seen this
repeatedly occur for MinIO, Dell EMC ECS, GCS. But I think we should always
promote people to contribute new FileIOs that can leverage the native
feature of the storage service for optimized performance and native
configurations related to areas like access control and encryption.

I will update the S3FileIO documentation to make this clear.

-Jack

On Fri, Dec 10, 2021 at 9:44 AM Ryan Blue  wrote:

> I think there's some confusion here. The changes doesn't make S3FileIO the
> handler for gs URIs. All it does is allow gs URIs when you've configured
> S3FileIO for your catalog. That's why #3656 is "remove S3 URI scheme
> restrictions".
>
> I think we do want to have a native GCSFileIO implementation. And the
> proposal for updating ResolvingFileIO is to allow choosing the
> implementation for URI schemes.
>
> Ryan
>
> On Fri, Dec 10, 2021 at 9:37 AM Daniel Weeks 
> wrote:
>
>> Hey Mayur and Laurent,
>>
>> As an alternative to using S3FileIO to talk to GCS, I just posted a
>> native GCSFileIO implementation
>> <https://github.com/apache/iceberg/pull/3711> and would really
>> appreciate feedback.  I'd prefer to go this route which has a number of
>> advantages (like using gRPC eventually) and more native support of some of
>> the GCS features (like streaming transport).
>>
>> It would be great if someone has a change to try this out in a real
>> google cloud environment and help improve it.
>>
>> -Dan
>>
>> On Fri, Dec 3, 2021 at 7:48 AM Laurent Goujon  wrote:
>>
>>> To be clear, the reasons why using S3FileIO over HadoopFileIO are
>>> totally reasonable. My issue is making gs:// is an alias to s3://, which I
>>> don't believe it is. Even assuming that GCS has an endpoint so one can use
>>> a S3 API to access data, you would need to configure this endpoint, and you
>>> would need to create S3 accesskey/accessecret (which is not the regular
>>> mode of operations for GCS) in order to access the data. So personally if I
>>> was interested to access GCS data through the S3 endpoint, I would be
>>> better off using a s3:// url and configuring the endpoint in the properties
>>> (although I have to say I didn't find any property to it, so any
>>> alternative S3 server needs to provide a specific AWS S3 Client to use with
>>> S3FileIO?)
>>>
>>> I also noticed that https:// is an alias to s3:// but again, isn't this
>>> breaking expectations about what the URI is supposed to represent?
>>>
>>> On Fri, Dec 3, 2021 at 6:44 AM Ryan Murray  wrote:
>>>
>>>> Echoing Laurent and Igor I wonder what the consequence of adding
>>>> 'gs://' scheme to S3FileIO is if that scheme is already used by the hadoop
>>>> gcs connector? Do we want to overload that scheme? I would almost think it
>>>> should be an s3:// scheme or so right?
>>>>
>>>> Best,
>>>> Ryan
>>>>
>>>> On Fri, Dec 3, 2021 at 9:26 AM Mayur Srivastava <
>>>> [email protected]> wrote:
>>>>
>>>>> Jack, https://github.com/apache/iceberg/pull/3656 is enough for my
>>>>> use case (because we are creating our own S3Client).
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Mayur
>>>>>
>>>>>
>>>>>
>>>>> *From:* Igor Dvorzhak 
>>>>> *Sent:* Thursday, December 2, 2021 8:12 PM
>>>>> *To:* [email protected]
>>>>> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
>>>>> Storage
>>>>>
>>>>>
>>>>>
>>>>> As long as proposed changes will not prevent Iceberg from using GCS
>>>>> connector (https://github.com/GoogleCloudDataproc/hadoop-connectors)
>>>>> via HCFS/HadoopFileIO to access GCS, I think that it is OK to allow users
>>>>> to use S3FileIO with GCS.
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Dec 2, 2021 at 3:15 PM Laurent Goujon 
>>>>> wrote:
>>>>>
>>>>> What about credentials? Sure, GCS has a S3 compatibility mode, but the
>>>>> gs:// URI used by Hadoop is native GCS support with Google authenticat

Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-10 Thread Ryan Blue
I think there's some confusion here. The changes doesn't make S3FileIO the
handler for gs URIs. All it does is allow gs URIs when you've configured
S3FileIO for your catalog. That's why #3656 is "remove S3 URI scheme
restrictions".

I think we do want to have a native GCSFileIO implementation. And the
proposal for updating ResolvingFileIO is to allow choosing the
implementation for URI schemes.

Ryan

On Fri, Dec 10, 2021 at 9:37 AM Daniel Weeks 
wrote:

> Hey Mayur and Laurent,
>
> As an alternative to using S3FileIO to talk to GCS, I just posted a native 
> GCSFileIO
> implementation <https://github.com/apache/iceberg/pull/3711> and would
> really appreciate feedback.  I'd prefer to go this route which has a number
> of advantages (like using gRPC eventually) and more native support of some
> of the GCS features (like streaming transport).
>
> It would be great if someone has a change to try this out in a real google
> cloud environment and help improve it.
>
> -Dan
>
> On Fri, Dec 3, 2021 at 7:48 AM Laurent Goujon  wrote:
>
>> To be clear, the reasons why using S3FileIO over HadoopFileIO are totally
>> reasonable. My issue is making gs:// is an alias to s3://, which I don't
>> believe it is. Even assuming that GCS has an endpoint so one can use a S3
>> API to access data, you would need to configure this endpoint, and you
>> would need to create S3 accesskey/accessecret (which is not the regular
>> mode of operations for GCS) in order to access the data. So personally if I
>> was interested to access GCS data through the S3 endpoint, I would be
>> better off using a s3:// url and configuring the endpoint in the properties
>> (although I have to say I didn't find any property to it, so any
>> alternative S3 server needs to provide a specific AWS S3 Client to use with
>> S3FileIO?)
>>
>> I also noticed that https:// is an alias to s3:// but again, isn't this
>> breaking expectations about what the URI is supposed to represent?
>>
>> On Fri, Dec 3, 2021 at 6:44 AM Ryan Murray  wrote:
>>
>>> Echoing Laurent and Igor I wonder what the consequence of adding 'gs://'
>>> scheme to S3FileIO is if that scheme is already used by the hadoop gcs
>>> connector? Do we want to overload that scheme? I would almost think it
>>> should be an s3:// scheme or so right?
>>>
>>> Best,
>>> Ryan
>>>
>>> On Fri, Dec 3, 2021 at 9:26 AM Mayur Srivastava <
>>> [email protected]> wrote:
>>>
>>>> Jack, https://github.com/apache/iceberg/pull/3656 is enough for my use
>>>> case (because we are creating our own S3Client).
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Mayur
>>>>
>>>>
>>>>
>>>> *From:* Igor Dvorzhak 
>>>> *Sent:* Thursday, December 2, 2021 8:12 PM
>>>> *To:* [email protected]
>>>> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
>>>> Storage
>>>>
>>>>
>>>>
>>>> As long as proposed changes will not prevent Iceberg from using GCS
>>>> connector (https://github.com/GoogleCloudDataproc/hadoop-connectors)
>>>> via HCFS/HadoopFileIO to access GCS, I think that it is OK to allow users
>>>> to use S3FileIO with GCS.
>>>>
>>>>
>>>>
>>>> On Thu, Dec 2, 2021 at 3:15 PM Laurent Goujon 
>>>> wrote:
>>>>
>>>> What about credentials? Sure, GCS has a S3 compatibility mode, but the
>>>> gs:// URI used by Hadoop is native GCS support with Google authentication
>>>> mechanisms (GCS Hadoop filesystem is actually out of tree ->
>>>> https://github.com/GoogleCloudDataproc/hadoop-connectors)
>>>>
>>>>
>>>>
>>>> Laurent
>>>>
>>>>
>>>>
>>>> On Thu, Dec 2, 2021 at 3:05 PM Jack Ye  wrote:
>>>>
>>>> Also https://github.com/apache/iceberg/pull/3658.
>>>>
>>>>
>>>>
>>>> Please let me know if these are enough, we can discuss in the PRs. It
>>>> would also be great if there are users of systems like MinIO to confirm.
>>>>
>>>>
>>>>
>>>> -Jack
>>>>
>>>>
>>>>
>>>> On Thu, Dec 2, 2021 at 1:18 PM Mayur Srivastava <
>>>> [email protected]> wrote:
>>>>
>>>> Looks like Jack is already on the top of

Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-10 Thread Daniel Weeks
Hey Mayur and Laurent,

As an alternative to using S3FileIO to talk to GCS, I just posted a
native GCSFileIO
implementation <https://github.com/apache/iceberg/pull/3711> and would
really appreciate feedback.  I'd prefer to go this route which has a number
of advantages (like using gRPC eventually) and more native support of some
of the GCS features (like streaming transport).

It would be great if someone has a change to try this out in a real google
cloud environment and help improve it.

-Dan

On Fri, Dec 3, 2021 at 7:48 AM Laurent Goujon  wrote:

> To be clear, the reasons why using S3FileIO over HadoopFileIO are totally
> reasonable. My issue is making gs:// is an alias to s3://, which I don't
> believe it is. Even assuming that GCS has an endpoint so one can use a S3
> API to access data, you would need to configure this endpoint, and you
> would need to create S3 accesskey/accessecret (which is not the regular
> mode of operations for GCS) in order to access the data. So personally if I
> was interested to access GCS data through the S3 endpoint, I would be
> better off using a s3:// url and configuring the endpoint in the properties
> (although I have to say I didn't find any property to it, so any
> alternative S3 server needs to provide a specific AWS S3 Client to use with
> S3FileIO?)
>
> I also noticed that https:// is an alias to s3:// but again, isn't this
> breaking expectations about what the URI is supposed to represent?
>
> On Fri, Dec 3, 2021 at 6:44 AM Ryan Murray  wrote:
>
>> Echoing Laurent and Igor I wonder what the consequence of adding 'gs://'
>> scheme to S3FileIO is if that scheme is already used by the hadoop gcs
>> connector? Do we want to overload that scheme? I would almost think it
>> should be an s3:// scheme or so right?
>>
>> Best,
>> Ryan
>>
>> On Fri, Dec 3, 2021 at 9:26 AM Mayur Srivastava <
>> [email protected]> wrote:
>>
>>> Jack, https://github.com/apache/iceberg/pull/3656 is enough for my use
>>> case (because we are creating our own S3Client).
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Mayur
>>>
>>>
>>>
>>> *From:* Igor Dvorzhak 
>>> *Sent:* Thursday, December 2, 2021 8:12 PM
>>> *To:* [email protected]
>>> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
>>> Storage
>>>
>>>
>>>
>>> As long as proposed changes will not prevent Iceberg from using GCS
>>> connector (https://github.com/GoogleCloudDataproc/hadoop-connectors)
>>> via HCFS/HadoopFileIO to access GCS, I think that it is OK to allow users
>>> to use S3FileIO with GCS.
>>>
>>>
>>>
>>> On Thu, Dec 2, 2021 at 3:15 PM Laurent Goujon 
>>> wrote:
>>>
>>> What about credentials? Sure, GCS has a S3 compatibility mode, but the
>>> gs:// URI used by Hadoop is native GCS support with Google authentication
>>> mechanisms (GCS Hadoop filesystem is actually out of tree ->
>>> https://github.com/GoogleCloudDataproc/hadoop-connectors)
>>>
>>>
>>>
>>> Laurent
>>>
>>>
>>>
>>> On Thu, Dec 2, 2021 at 3:05 PM Jack Ye  wrote:
>>>
>>> Also https://github.com/apache/iceberg/pull/3658.
>>>
>>>
>>>
>>> Please let me know if these are enough, we can discuss in the PRs. It
>>> would also be great if there are users of systems like MinIO to confirm.
>>>
>>>
>>>
>>> -Jack
>>>
>>>
>>>
>>> On Thu, Dec 2, 2021 at 1:18 PM Mayur Srivastava <
>>> [email protected]> wrote:
>>>
>>> Looks like Jack is already on the top of the problem (
>>> https://github.com/apache/iceberg/pull/3656). Thanks Jack!
>>>
>>>
>>>
>>> *From:* Mayur Srivastava 
>>> *Sent:* Thursday, December 2, 2021 4:16 PM
>>> *To:* [email protected]
>>> *Subject:* RE: Supporting gs:// prefix in S3URI for Google Cloud S3
>>> Storage
>>>
>>>
>>>
>>> There are three reasons why we want to use S3FileIO over HadoopFileIO:
>>>
>>> 1.  We want access to the S3Client in our service so support some
>>> special handling of the auth. This is not possible with the HadoopFileIO
>>> because the S3Client is not exposed.
>>>
>>> 2.  We would like to improve upon the S3FileIO in the future, by
>>> introducing a vectorized IO mechanism and it makes is easier if we are
>>> 

Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-03 Thread Laurent Goujon
To be clear, the reasons why using S3FileIO over HadoopFileIO are totally
reasonable. My issue is making gs:// is an alias to s3://, which I don't
believe it is. Even assuming that GCS has an endpoint so one can use a S3
API to access data, you would need to configure this endpoint, and you
would need to create S3 accesskey/accessecret (which is not the regular
mode of operations for GCS) in order to access the data. So personally if I
was interested to access GCS data through the S3 endpoint, I would be
better off using a s3:// url and configuring the endpoint in the properties
(although I have to say I didn't find any property to it, so any
alternative S3 server needs to provide a specific AWS S3 Client to use with
S3FileIO?)

I also noticed that https:// is an alias to s3:// but again, isn't this
breaking expectations about what the URI is supposed to represent?

On Fri, Dec 3, 2021 at 6:44 AM Ryan Murray  wrote:

> Echoing Laurent and Igor I wonder what the consequence of adding 'gs://'
> scheme to S3FileIO is if that scheme is already used by the hadoop gcs
> connector? Do we want to overload that scheme? I would almost think it
> should be an s3:// scheme or so right?
>
> Best,
> Ryan
>
> On Fri, Dec 3, 2021 at 9:26 AM Mayur Srivastava <
> [email protected]> wrote:
>
>> Jack, https://github.com/apache/iceberg/pull/3656 is enough for my use
>> case (because we are creating our own S3Client).
>>
>>
>>
>> Thanks,
>>
>> Mayur
>>
>>
>>
>> *From:* Igor Dvorzhak 
>> *Sent:* Thursday, December 2, 2021 8:12 PM
>> *To:* [email protected]
>> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
>> Storage
>>
>>
>>
>> As long as proposed changes will not prevent Iceberg from using GCS
>> connector (https://github.com/GoogleCloudDataproc/hadoop-connectors)
>> via HCFS/HadoopFileIO to access GCS, I think that it is OK to allow users
>> to use S3FileIO with GCS.
>>
>>
>>
>> On Thu, Dec 2, 2021 at 3:15 PM Laurent Goujon  wrote:
>>
>> What about credentials? Sure, GCS has a S3 compatibility mode, but the
>> gs:// URI used by Hadoop is native GCS support with Google authentication
>> mechanisms (GCS Hadoop filesystem is actually out of tree ->
>> https://github.com/GoogleCloudDataproc/hadoop-connectors)
>>
>>
>>
>> Laurent
>>
>>
>>
>> On Thu, Dec 2, 2021 at 3:05 PM Jack Ye  wrote:
>>
>> Also https://github.com/apache/iceberg/pull/3658.
>>
>>
>>
>> Please let me know if these are enough, we can discuss in the PRs. It
>> would also be great if there are users of systems like MinIO to confirm.
>>
>>
>>
>> -Jack
>>
>>
>>
>> On Thu, Dec 2, 2021 at 1:18 PM Mayur Srivastava <
>> [email protected]> wrote:
>>
>> Looks like Jack is already on the top of the problem (
>> https://github.com/apache/iceberg/pull/3656). Thanks Jack!
>>
>>
>>
>> *From:* Mayur Srivastava 
>> *Sent:* Thursday, December 2, 2021 4:16 PM
>> *To:* [email protected]
>> *Subject:* RE: Supporting gs:// prefix in S3URI for Google Cloud S3
>> Storage
>>
>>
>>
>> There are three reasons why we want to use S3FileIO over HadoopFileIO:
>>
>> 1.  We want access to the S3Client in our service so support some
>> special handling of the auth. This is not possible with the HadoopFileIO
>> because the S3Client is not exposed.
>>
>> 2.  We would like to improve upon the S3FileIO in the future, by
>> introducing a vectorized IO mechanism and it makes is easier if we are
>> already using S3FileIO. I’ll post my thoughts about the vectorized IO in a
>> later email in upcoming weeks.
>>
>> 3.  As Ryan mentioned earlier, we are seeing very high memory usage
>> with the HadoopFileIO in case of high concurrent commits. I reported that
>> in another thread.
>>
>>
>>
>> To moving forward:
>>
>>
>>
>> Can we start by adding ‘gs’ to the S3URI’s valid prefixes?
>>
>>
>>
>> One of Jack’s suggestion was to remove any scheme check from the S3URI.
>> Given we are building ResolvingFileIO, I think removing scheme check in the
>> individual implementation is not a bad idea.
>>
>>
>>
>> Either solution will work for us.
>>
>>
>>
>> Thanks,
>>
>> Mayur
>>
>>
>>
>> *From:* Ryan Blue 
>> *Sent:* Thursday, December 2, 2021 11:37 AM
>> *To:* Iceberg Dev List 
>> *Subject:* Re: Supp

Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-03 Thread Ryan Murray
Echoing Laurent and Igor I wonder what the consequence of adding 'gs://'
scheme to S3FileIO is if that scheme is already used by the hadoop gcs
connector? Do we want to overload that scheme? I would almost think it
should be an s3:// scheme or so right?

Best,
Ryan

On Fri, Dec 3, 2021 at 9:26 AM Mayur Srivastava <
[email protected]> wrote:

> Jack, https://github.com/apache/iceberg/pull/3656 is enough for my use
> case (because we are creating our own S3Client).
>
>
>
> Thanks,
>
> Mayur
>
>
>
> *From:* Igor Dvorzhak 
> *Sent:* Thursday, December 2, 2021 8:12 PM
> *To:* [email protected]
> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
> Storage
>
>
>
> As long as proposed changes will not prevent Iceberg from using GCS
> connector (https://github.com/GoogleCloudDataproc/hadoop-connectors)
> via HCFS/HadoopFileIO to access GCS, I think that it is OK to allow users
> to use S3FileIO with GCS.
>
>
>
> On Thu, Dec 2, 2021 at 3:15 PM Laurent Goujon  wrote:
>
> What about credentials? Sure, GCS has a S3 compatibility mode, but the
> gs:// URI used by Hadoop is native GCS support with Google authentication
> mechanisms (GCS Hadoop filesystem is actually out of tree ->
> https://github.com/GoogleCloudDataproc/hadoop-connectors)
>
>
>
> Laurent
>
>
>
> On Thu, Dec 2, 2021 at 3:05 PM Jack Ye  wrote:
>
> Also https://github.com/apache/iceberg/pull/3658.
>
>
>
> Please let me know if these are enough, we can discuss in the PRs. It
> would also be great if there are users of systems like MinIO to confirm.
>
>
>
> -Jack
>
>
>
> On Thu, Dec 2, 2021 at 1:18 PM Mayur Srivastava <
> [email protected]> wrote:
>
> Looks like Jack is already on the top of the problem (
> https://github.com/apache/iceberg/pull/3656). Thanks Jack!
>
>
>
> *From:* Mayur Srivastava 
> *Sent:* Thursday, December 2, 2021 4:16 PM
> *To:* [email protected]
> *Subject:* RE: Supporting gs:// prefix in S3URI for Google Cloud S3
> Storage
>
>
>
> There are three reasons why we want to use S3FileIO over HadoopFileIO:
>
> 1.  We want access to the S3Client in our service so support some
> special handling of the auth. This is not possible with the HadoopFileIO
> because the S3Client is not exposed.
>
> 2.  We would like to improve upon the S3FileIO in the future, by
> introducing a vectorized IO mechanism and it makes is easier if we are
> already using S3FileIO. I’ll post my thoughts about the vectorized IO in a
> later email in upcoming weeks.
>
> 3.  As Ryan mentioned earlier, we are seeing very high memory usage
> with the HadoopFileIO in case of high concurrent commits. I reported that
> in another thread.
>
>
>
> To moving forward:
>
>
>
> Can we start by adding ‘gs’ to the S3URI’s valid prefixes?
>
>
>
> One of Jack’s suggestion was to remove any scheme check from the S3URI.
> Given we are building ResolvingFileIO, I think removing scheme check in the
> individual implementation is not a bad idea.
>
>
>
> Either solution will work for us.
>
>
>
> Thanks,
>
> Mayur
>
>
>
> *From:* Ryan Blue 
> *Sent:* Thursday, December 2, 2021 11:37 AM
> *To:* Iceberg Dev List 
> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
> Storage
>
>
>
> I think the advantage of S3FileIO over HadoopFileIO with s3a is it doesn't
> hit the memory consumption problem that Mayur posted to the list. That's a
> fairly big advantage so I think it's reasonable to try to support this in
> 0.13.0.
>
>
>
> It should be easy enough to add the gs scheme and then we can figure out
> how we want to handle ResolvingFileIO. Jack's plan seems reasonable to me,
> so I guess we'll be adding scheme to implementation customization sooner
> than I thought!
>
>
>
> Ryan
>
>
>
> On Thu, Dec 2, 2021 at 1:24 AM Piotr Findeisen 
> wrote:
>
> Hi
>
>
>
> I agree that endpoint, credentials, path style access etc. should be
> configurable.
>
> There are storages which are primarily used as "s3 compatible" and they
> need these settings to make them work.
>
> We've seen these being used to access MinIO, Ceph and even S3 with some
> gateway (i am light on details, sorry).
>
> In all these cases, users seem to use s3:// urls even if not talking to
> actual AWS S3 service.
>
>
>
> If this is sufficient for GCS, we could create GCSFileIO, or GCSS3FileIO,
> just by accepting gs:// protocol and delegating to S3FileIO for now.
>
> In the long term, i would recommend using native GCS

RE: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-03 Thread Mayur Srivastava
Jack, https://github.com/apache/iceberg/pull/3656 is enough for my use case 
(because we are creating our own S3Client).

Thanks,
Mayur

From: Igor Dvorzhak 
Sent: Thursday, December 2, 2021 8:12 PM
To: [email protected]
Subject: Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

As long as proposed changes will not prevent Iceberg from using GCS connector 
(https://github.com/GoogleCloudDataproc/hadoop-connectors) via 
HCFS/HadoopFileIO to access GCS, I think that it is OK to allow users to use 
S3FileIO with GCS.

On Thu, Dec 2, 2021 at 3:15 PM Laurent Goujon 
mailto:[email protected]>> wrote:
What about credentials? Sure, GCS has a S3 compatibility mode, but the gs:// 
URI used by Hadoop is native GCS support with Google authentication mechanisms 
(GCS Hadoop filesystem is actually out of tree -> 
https://github.com/GoogleCloudDataproc/hadoop-connectors)

Laurent

On Thu, Dec 2, 2021 at 3:05 PM Jack Ye 
mailto:[email protected]>> wrote:
Also https://github.com/apache/iceberg/pull/3658.

Please let me know if these are enough, we can discuss in the PRs. It would 
also be great if there are users of systems like MinIO to confirm.

-Jack

On Thu, Dec 2, 2021 at 1:18 PM Mayur Srivastava 
mailto:[email protected]>> wrote:
Looks like Jack is already on the top of the problem 
(https://github.com/apache/iceberg/pull/3656). Thanks Jack!

From: Mayur Srivastava 
mailto:[email protected]>>
Sent: Thursday, December 2, 2021 4:16 PM
To: [email protected]<mailto:[email protected]>
Subject: RE: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

There are three reasons why we want to use S3FileIO over HadoopFileIO:

1.  We want access to the S3Client in our service so support some special 
handling of the auth. This is not possible with the HadoopFileIO because the 
S3Client is not exposed.

2.  We would like to improve upon the S3FileIO in the future, by 
introducing a vectorized IO mechanism and it makes is easier if we are already 
using S3FileIO. I’ll post my thoughts about the vectorized IO in a later email 
in upcoming weeks.

3.  As Ryan mentioned earlier, we are seeing very high memory usage with 
the HadoopFileIO in case of high concurrent commits. I reported that in another 
thread.

To moving forward:

Can we start by adding ‘gs’ to the S3URI’s valid prefixes?

One of Jack’s suggestion was to remove any scheme check from the S3URI. Given 
we are building ResolvingFileIO, I think removing scheme check in the 
individual implementation is not a bad idea.

Either solution will work for us.

Thanks,
Mayur

From: Ryan Blue mailto:[email protected]>>
Sent: Thursday, December 2, 2021 11:37 AM
To: Iceberg Dev List mailto:[email protected]>>
Subject: Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

I think the advantage of S3FileIO over HadoopFileIO with s3a is it doesn't hit 
the memory consumption problem that Mayur posted to the list. That's a fairly 
big advantage so I think it's reasonable to try to support this in 0.13.0.

It should be easy enough to add the gs scheme and then we can figure out how we 
want to handle ResolvingFileIO. Jack's plan seems reasonable to me, so I guess 
we'll be adding scheme to implementation customization sooner than I thought!

Ryan

On Thu, Dec 2, 2021 at 1:24 AM Piotr Findeisen 
mailto:[email protected]>> wrote:
Hi

I agree that endpoint, credentials, path style access etc. should be 
configurable.
There are storages which are primarily used as "s3 compatible" and they need 
these settings to make them work.
We've seen these being used to access MinIO, Ceph and even S3 with some gateway 
(i am light on details, sorry).
In all these cases, users seem to use s3:// urls even if not talking to actual 
AWS S3 service.

If this is sufficient for GCS, we could create GCSFileIO, or GCSS3FileIO, just 
by accepting gs:// protocol and delegating to S3FileIO for now.
In the long term, i would recommend using native GCS client though, or hadoop 
file system implementation provided by google.

BTW, Mayur what is the advantage of using S3FileIO for google storage vs 
HadoopFileIO?

BR
PF




On Thu, Dec 2, 2021 at 1:30 AM Jack Ye 
mailto:[email protected]>> wrote:
And here is a proposal of what I think could be the best way to go for both 
worlds:
(1) remove URI restrictions in S3FileIO (or allow configuration of additional 
accepted schemes), and allow direct user configuration of endpoint, 
credentials, etc. to make S3 configuration simpler without the need to 
reconfigure the entire client.
(2) configure ResolvingFileIO to map s3 -> S3FileIO, gs -> S3FileIO, others -> 
HadoopFileIO
(3) for s3 and gs, ResolvingFileIO needs to develop the ability to initialize 
S3FileIO differently, and users should be able to configure them differently in 
catalog properties
(4) for users that n

Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-02 Thread Laurent Goujon
What about credentials? Sure, GCS has a S3 compatibility mode, but the
gs:// URI used by Hadoop is native GCS support with Google authentication
mechanisms (GCS Hadoop filesystem is actually out of tree ->
https://github.com/GoogleCloudDataproc/hadoop-connectors)

Laurent

On Thu, Dec 2, 2021 at 3:05 PM Jack Ye  wrote:

> Also https://github.com/apache/iceberg/pull/3658.
>
> Please let me know if these are enough, we can discuss in the PRs. It
> would also be great if there are users of systems like MinIO to confirm.
>
> -Jack
>
> On Thu, Dec 2, 2021 at 1:18 PM Mayur Srivastava <
> [email protected]> wrote:
>
>> Looks like Jack is already on the top of the problem (
>> https://github.com/apache/iceberg/pull/3656). Thanks Jack!
>>
>>
>>
>> *From:* Mayur Srivastava 
>> *Sent:* Thursday, December 2, 2021 4:16 PM
>> *To:* [email protected]
>> *Subject:* RE: Supporting gs:// prefix in S3URI for Google Cloud S3
>> Storage
>>
>>
>>
>> There are three reasons why we want to use S3FileIO over HadoopFileIO:
>>
>> 1.  We want access to the S3Client in our service so support some
>> special handling of the auth. This is not possible with the HadoopFileIO
>> because the S3Client is not exposed.
>>
>> 2.  We would like to improve upon the S3FileIO in the future, by
>> introducing a vectorized IO mechanism and it makes is easier if we are
>> already using S3FileIO. I’ll post my thoughts about the vectorized IO in a
>> later email in upcoming weeks.
>>
>> 3.  As Ryan mentioned earlier, we are seeing very high memory usage
>> with the HadoopFileIO in case of high concurrent commits. I reported that
>> in another thread.
>>
>>
>>
>> To moving forward:
>>
>>
>>
>> Can we start by adding ‘gs’ to the S3URI’s valid prefixes?
>>
>>
>>
>> One of Jack’s suggestion was to remove any scheme check from the S3URI.
>> Given we are building ResolvingFileIO, I think removing scheme check in the
>> individual implementation is not a bad idea.
>>
>>
>>
>> Either solution will work for us.
>>
>>
>>
>> Thanks,
>>
>> Mayur
>>
>>
>>
>> *From:* Ryan Blue 
>> *Sent:* Thursday, December 2, 2021 11:37 AM
>> *To:* Iceberg Dev List 
>> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
>> Storage
>>
>>
>>
>> I think the advantage of S3FileIO over HadoopFileIO with s3a is it
>> doesn't hit the memory consumption problem that Mayur posted to the list.
>> That's a fairly big advantage so I think it's reasonable to try to support
>> this in 0.13.0.
>>
>>
>>
>> It should be easy enough to add the gs scheme and then we can figure out
>> how we want to handle ResolvingFileIO. Jack's plan seems reasonable to me,
>> so I guess we'll be adding scheme to implementation customization sooner
>> than I thought!
>>
>>
>>
>> Ryan
>>
>>
>>
>> On Thu, Dec 2, 2021 at 1:24 AM Piotr Findeisen 
>> wrote:
>>
>> Hi
>>
>>
>>
>> I agree that endpoint, credentials, path style access etc. should be
>> configurable.
>>
>> There are storages which are primarily used as "s3 compatible" and they
>> need these settings to make them work.
>>
>> We've seen these being used to access MinIO, Ceph and even S3 with some
>> gateway (i am light on details, sorry).
>>
>> In all these cases, users seem to use s3:// urls even if not talking to
>> actual AWS S3 service.
>>
>>
>>
>> If this is sufficient for GCS, we could create GCSFileIO, or GCSS3FileIO,
>> just by accepting gs:// protocol and delegating to S3FileIO for now.
>>
>> In the long term, i would recommend using native GCS client though, or
>> hadoop file system implementation provided by google.
>>
>>
>>
>> BTW, Mayur what is the advantage of using S3FileIO for google storage
>> vs HadoopFileIO?
>>
>>
>>
>> BR
>>
>> PF
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Dec 2, 2021 at 1:30 AM Jack Ye  wrote:
>>
>> And here is a proposal of what I think could be the best way to go for
>> both worlds:
>>
>> (1) remove URI restrictions in S3FileIO (or allow configuration of
>> additional accepted schemes), and allow direct user configuration of
>> endpoint, credentials, etc. to make S3 configuration simpler without the
>> nee

Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-02 Thread Jack Ye
Also https://github.com/apache/iceberg/pull/3658.

Please let me know if these are enough, we can discuss in the PRs. It would
also be great if there are users of systems like MinIO to confirm.

-Jack

On Thu, Dec 2, 2021 at 1:18 PM Mayur Srivastava <
[email protected]> wrote:

> Looks like Jack is already on the top of the problem (
> https://github.com/apache/iceberg/pull/3656). Thanks Jack!
>
>
>
> *From:* Mayur Srivastava 
> *Sent:* Thursday, December 2, 2021 4:16 PM
> *To:* [email protected]
> *Subject:* RE: Supporting gs:// prefix in S3URI for Google Cloud S3
> Storage
>
>
>
> There are three reasons why we want to use S3FileIO over HadoopFileIO:
>
> 1.  We want access to the S3Client in our service so support some
> special handling of the auth. This is not possible with the HadoopFileIO
> because the S3Client is not exposed.
>
> 2.  We would like to improve upon the S3FileIO in the future, by
> introducing a vectorized IO mechanism and it makes is easier if we are
> already using S3FileIO. I’ll post my thoughts about the vectorized IO in a
> later email in upcoming weeks.
>
> 3.  As Ryan mentioned earlier, we are seeing very high memory usage
> with the HadoopFileIO in case of high concurrent commits. I reported that
> in another thread.
>
>
>
> To moving forward:
>
>
>
> Can we start by adding ‘gs’ to the S3URI’s valid prefixes?
>
>
>
> One of Jack’s suggestion was to remove any scheme check from the S3URI.
> Given we are building ResolvingFileIO, I think removing scheme check in the
> individual implementation is not a bad idea.
>
>
>
> Either solution will work for us.
>
>
>
> Thanks,
>
> Mayur
>
>
>
> *From:* Ryan Blue 
> *Sent:* Thursday, December 2, 2021 11:37 AM
> *To:* Iceberg Dev List 
> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
> Storage
>
>
>
> I think the advantage of S3FileIO over HadoopFileIO with s3a is it doesn't
> hit the memory consumption problem that Mayur posted to the list. That's a
> fairly big advantage so I think it's reasonable to try to support this in
> 0.13.0.
>
>
>
> It should be easy enough to add the gs scheme and then we can figure out
> how we want to handle ResolvingFileIO. Jack's plan seems reasonable to me,
> so I guess we'll be adding scheme to implementation customization sooner
> than I thought!
>
>
>
> Ryan
>
>
>
> On Thu, Dec 2, 2021 at 1:24 AM Piotr Findeisen 
> wrote:
>
> Hi
>
>
>
> I agree that endpoint, credentials, path style access etc. should be
> configurable.
>
> There are storages which are primarily used as "s3 compatible" and they
> need these settings to make them work.
>
> We've seen these being used to access MinIO, Ceph and even S3 with some
> gateway (i am light on details, sorry).
>
> In all these cases, users seem to use s3:// urls even if not talking to
> actual AWS S3 service.
>
>
>
> If this is sufficient for GCS, we could create GCSFileIO, or GCSS3FileIO,
> just by accepting gs:// protocol and delegating to S3FileIO for now.
>
> In the long term, i would recommend using native GCS client though, or
> hadoop file system implementation provided by google.
>
>
>
> BTW, Mayur what is the advantage of using S3FileIO for google storage
> vs HadoopFileIO?
>
>
>
> BR
>
> PF
>
>
>
>
>
>
>
>
>
> On Thu, Dec 2, 2021 at 1:30 AM Jack Ye  wrote:
>
> And here is a proposal of what I think could be the best way to go for
> both worlds:
>
> (1) remove URI restrictions in S3FileIO (or allow configuration of
> additional accepted schemes), and allow direct user configuration of
> endpoint, credentials, etc. to make S3 configuration simpler without the
> need to reconfigure the entire client.
>
> (2) configure ResolvingFileIO to map s3 -> S3FileIO, gs -> S3FileIO,
> others -> HadoopFileIO
>
> (3) for s3 and gs, ResolvingFileIO needs to develop the ability to
> initialize S3FileIO differently, and users should be able to configure them
> differently in catalog properties
>
> (4) for users that need special GCS unique features, a GCSFileIO could
> eventually be developed, and then people can choose to map gs -> GCSFileIO
> in ResolvingFileIO
>
>
>
> -Jack
>
>
>
>
>
> On Wed, Dec 1, 2021 at 4:14 PM Jack Ye  wrote:
>
> Thanks for the confirmation, this is as I expected. We had a similar case
> for Dell EMC ECS recently, where they published a version of their FileIO
> that works through S3FileIO (https://github.com/apache/iceberg/pull/2807)
> and the only thing needed was 

RE: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-02 Thread Mayur Srivastava
Looks like Jack is already on the top of the problem 
(https://github.com/apache/iceberg/pull/3656). Thanks Jack!

From: Mayur Srivastava 
Sent: Thursday, December 2, 2021 4:16 PM
To: [email protected]
Subject: RE: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

There are three reasons why we want to use S3FileIO over HadoopFileIO:

1.  We want access to the S3Client in our service so support some special 
handling of the auth. This is not possible with the HadoopFileIO because the 
S3Client is not exposed.

2.  We would like to improve upon the S3FileIO in the future, by 
introducing a vectorized IO mechanism and it makes is easier if we are already 
using S3FileIO. I’ll post my thoughts about the vectorized IO in a later email 
in upcoming weeks.

3.  As Ryan mentioned earlier, we are seeing very high memory usage with 
the HadoopFileIO in case of high concurrent commits. I reported that in another 
thread.

To moving forward:

Can we start by adding ‘gs’ to the S3URI’s valid prefixes?

One of Jack’s suggestion was to remove any scheme check from the S3URI. Given 
we are building ResolvingFileIO, I think removing scheme check in the 
individual implementation is not a bad idea.

Either solution will work for us.

Thanks,
Mayur

From: Ryan Blue mailto:[email protected]>>
Sent: Thursday, December 2, 2021 11:37 AM
To: Iceberg Dev List mailto:[email protected]>>
Subject: Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

I think the advantage of S3FileIO over HadoopFileIO with s3a is it doesn't hit 
the memory consumption problem that Mayur posted to the list. That's a fairly 
big advantage so I think it's reasonable to try to support this in 0.13.0.

It should be easy enough to add the gs scheme and then we can figure out how we 
want to handle ResolvingFileIO. Jack's plan seems reasonable to me, so I guess 
we'll be adding scheme to implementation customization sooner than I thought!

Ryan

On Thu, Dec 2, 2021 at 1:24 AM Piotr Findeisen 
mailto:[email protected]>> wrote:
Hi

I agree that endpoint, credentials, path style access etc. should be 
configurable.
There are storages which are primarily used as "s3 compatible" and they need 
these settings to make them work.
We've seen these being used to access MinIO, Ceph and even S3 with some gateway 
(i am light on details, sorry).
In all these cases, users seem to use s3:// urls even if not talking to actual 
AWS S3 service.

If this is sufficient for GCS, we could create GCSFileIO, or GCSS3FileIO, just 
by accepting gs:// protocol and delegating to S3FileIO for now.
In the long term, i would recommend using native GCS client though, or hadoop 
file system implementation provided by google.

BTW, Mayur what is the advantage of using S3FileIO for google storage vs 
HadoopFileIO?

BR
PF




On Thu, Dec 2, 2021 at 1:30 AM Jack Ye 
mailto:[email protected]>> wrote:
And here is a proposal of what I think could be the best way to go for both 
worlds:
(1) remove URI restrictions in S3FileIO (or allow configuration of additional 
accepted schemes), and allow direct user configuration of endpoint, 
credentials, etc. to make S3 configuration simpler without the need to 
reconfigure the entire client.
(2) configure ResolvingFileIO to map s3 -> S3FileIO, gs -> S3FileIO, others -> 
HadoopFileIO
(3) for s3 and gs, ResolvingFileIO needs to develop the ability to initialize 
S3FileIO differently, and users should be able to configure them differently in 
catalog properties
(4) for users that need special GCS unique features, a GCSFileIO could 
eventually be developed, and then people can choose to map gs -> GCSFileIO in 
ResolvingFileIO

-Jack


On Wed, Dec 1, 2021 at 4:14 PM Jack Ye 
mailto:[email protected]>> wrote:
Thanks for the confirmation, this is as I expected. We had a similar case for 
Dell EMC ECS recently, where they published a version of their FileIO that 
works through S3FileIO (https://github.com/apache/iceberg/pull/2807) and the 
only thing needed was to override the endpoint, region and credentials. They 
also proposed some specialization because their object storage service is 
specialized with the Append operation when writing data. However, in the end 
they ended up just creating another FileIO 
(https://github.com/apache/iceberg/pull/3376) using their own SDK to better 
support the specialization.

I believe the recent addition of ResolvingFileIO was to support using multiple 
FileIOs and switch between them based on the file scheme. If we continue that 
path, it feels more reasonable to me that we will have specialized FileIOs for 
each implementation and allow them to evolve independently. Users will be able 
to set whatever specialized configurations for each implementation and take 
advantage of all of them.

On the other hand, if we can support using S3FileIO as the new standard FileIO 
that works wit

RE: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-02 Thread Mayur Srivastava
There are three reasons why we want to use S3FileIO over HadoopFileIO:

1.  We want access to the S3Client in our service so support some special 
handling of the auth. This is not possible with the HadoopFileIO because the 
S3Client is not exposed.

2.  We would like to improve upon the S3FileIO in the future, by 
introducing a vectorized IO mechanism and it makes is easier if we are already 
using S3FileIO. I’ll post my thoughts about the vectorized IO in a later email 
in upcoming weeks.

3.  As Ryan mentioned earlier, we are seeing very high memory usage with 
the HadoopFileIO in case of high concurrent commits. I reported that in another 
thread.

To moving forward:

Can we start by adding ‘gs’ to the S3URI’s valid prefixes?

One of Jack’s suggestion was to remove any scheme check from the S3URI. Given 
we are building ResolvingFileIO, I think removing scheme check in the 
individual implementation is not a bad idea.

Either solution will work for us.

Thanks,
Mayur

From: Ryan Blue 
Sent: Thursday, December 2, 2021 11:37 AM
To: Iceberg Dev List 
Subject: Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

I think the advantage of S3FileIO over HadoopFileIO with s3a is it doesn't hit 
the memory consumption problem that Mayur posted to the list. That's a fairly 
big advantage so I think it's reasonable to try to support this in 0.13.0.

It should be easy enough to add the gs scheme and then we can figure out how we 
want to handle ResolvingFileIO. Jack's plan seems reasonable to me, so I guess 
we'll be adding scheme to implementation customization sooner than I thought!

Ryan

On Thu, Dec 2, 2021 at 1:24 AM Piotr Findeisen 
mailto:[email protected]>> wrote:
Hi

I agree that endpoint, credentials, path style access etc. should be 
configurable.
There are storages which are primarily used as "s3 compatible" and they need 
these settings to make them work.
We've seen these being used to access MinIO, Ceph and even S3 with some gateway 
(i am light on details, sorry).
In all these cases, users seem to use s3:// urls even if not talking to actual 
AWS S3 service.

If this is sufficient for GCS, we could create GCSFileIO, or GCSS3FileIO, just 
by accepting gs:// protocol and delegating to S3FileIO for now.
In the long term, i would recommend using native GCS client though, or hadoop 
file system implementation provided by google.

BTW, Mayur what is the advantage of using S3FileIO for google storage vs 
HadoopFileIO?

BR
PF




On Thu, Dec 2, 2021 at 1:30 AM Jack Ye 
mailto:[email protected]>> wrote:
And here is a proposal of what I think could be the best way to go for both 
worlds:
(1) remove URI restrictions in S3FileIO (or allow configuration of additional 
accepted schemes), and allow direct user configuration of endpoint, 
credentials, etc. to make S3 configuration simpler without the need to 
reconfigure the entire client.
(2) configure ResolvingFileIO to map s3 -> S3FileIO, gs -> S3FileIO, others -> 
HadoopFileIO
(3) for s3 and gs, ResolvingFileIO needs to develop the ability to initialize 
S3FileIO differently, and users should be able to configure them differently in 
catalog properties
(4) for users that need special GCS unique features, a GCSFileIO could 
eventually be developed, and then people can choose to map gs -> GCSFileIO in 
ResolvingFileIO

-Jack


On Wed, Dec 1, 2021 at 4:14 PM Jack Ye 
mailto:[email protected]>> wrote:
Thanks for the confirmation, this is as I expected. We had a similar case for 
Dell EMC ECS recently, where they published a version of their FileIO that 
works through S3FileIO (https://github.com/apache/iceberg/pull/2807) and the 
only thing needed was to override the endpoint, region and credentials. They 
also proposed some specialization because their object storage service is 
specialized with the Append operation when writing data. However, in the end 
they ended up just creating another FileIO 
(https://github.com/apache/iceberg/pull/3376) using their own SDK to better 
support the specialization.

I believe the recent addition of ResolvingFileIO was to support using multiple 
FileIOs and switch between them based on the file scheme. If we continue that 
path, it feels more reasonable to me that we will have specialized FileIOs for 
each implementation and allow them to evolve independently. Users will be able 
to set whatever specialized configurations for each implementation and take 
advantage of all of them.

On the other hand, if we can support using S3FileIO as the new standard FileIO 
that works with multiple storage providers, the advantages I see are:
(1) simple from the user's perspective because the least common denominator of 
all storages needed by many cloud storage service providers is S3. It's more 
work to configure and maintain multiple FileIOs.
(2) we can avoid the current check in ResolvingFileIO of the file scheme for 
each

Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-02 Thread Ryan Blue
port the
>>> feature. The concern is that we will end up like Hadoop that had to develop
>>> another sub-layer of FileSystem interface to accommodate different unique
>>> features of different storage providers when the specialized feature
>>> request comes, and at that time there is no difference from the dedicated
>>> FileIO + ResolvingFileIO architecture.
>>>
>>> I wonder what Daniel thinks about this since I believe he is more
>>> interested in multi-cloud support.
>>>
>>> -Jack
>>>
>>> On Wed, Dec 1, 2021 at 3:18 PM Mayur Srivastava <
>>> [email protected]> wrote:
>>>
>>>> Hi Jack, Daniel,
>>>>
>>>>
>>>>
>>>> We use several S3-compatible backends with Iceberg, these include S3,
>>>> GCS, and others. Currently, S3FileIO provides us all the functionality we
>>>> need Iceberg to talk to these backends. The way we create S3FileIO is via
>>>> the constructor and provide the S3Client as the constructor param; we do
>>>> not use the initialize(Map) method in FileIO. Our custom
>>>> catalog accepts the FileIO object at creation time. To talk to GCS, we
>>>> create the S3Client with a few overrides (described below) and pass it to
>>>> S3FileIO. After that, the rest of the S3FileIO code works as is. The only
>>>> exception is that “gs” (used by GCS URIs) needs to be accepted as a valid
>>>> S3 prefix. This is the reason I sent the email.
>>>>
>>>>
>>>>
>>>> The reason why we want to use S3FileIO to talk to GCS is that S3FileIO
>>>> almost works out of the box and contains all the functionality needed to
>>>> talk to GCS. The only special requirement is the creation of the S3Client
>>>> and allow “gs” prefix in the URIs. Based on our early experiments and
>>>> benchmarks, S3FileIO provides all the functionality we need and performs
>>>> well, so we didn’t see a need to create a native GCS FileIO. Iceberg
>>>> operations that we need are create, drop, read and write objects from S3
>>>> and S3FileIO provides this functionality.
>>>>
>>>>
>>>>
>>>> We are managing ACLs (IAM in case of GCS) at the bucket level and that
>>>> happens in our custom catalog. GCS has ACLs but IAMs are preferred. I’ve
>>>> not experimented with ACLs or encryption with S3FileIO and that is a good
>>>> question whether it works with GCS. But, if these features are not enabled
>>>> via default settings, S3FileIO works just fine with GCS.
>>>>
>>>>
>>>>
>>>> I think there is a case for supporting S3-compatible backends in
>>>> S3FileIO because a lot of the code is common. The question is whether we
>>>> can cleanly expose the common S3FileIO code to work with these backends and
>>>> separate out any specialization (if required) OR we want to have a
>>>> different FileIO implementation for each of the other S3 compatible
>>>> backends such as GCS? I’m eager to hear more from the community about this.
>>>> I’m happy to discuss and follow long-term design direction of the Iceberg
>>>> community.
>>>>
>>>>
>>>>
>>>> The S3Client for GCS is created as follows (currently the code is not
>>>> open source so I’m sharing the steps only):
>>>>
>>>> 1. Create S3ClientBuilder.
>>>>
>>>> 2. Set GCS endpoint URI and region.
>>>>
>>>> 3. Set a credentials provider that returns null. You can set
>>>> credentials here if you have static credentials.
>>>>
>>>> 4. Set ClientOverrideConfiguration with interceptors in the
>>>> overrideConfiguration(). The interceptors are used to setup authorization
>>>> header in requests (setting projectId, auth tokens, etc.) and do header
>>>> translation for requests and responses.
>>>>
>>>> 5. Build the S3Client.
>>>>
>>>> 6. Pass the S3Client to S3FileIO.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Mayur
>>>>
>>>>
>>>>
>>>> *From:* Jack Ye 
>>>> *Sent:* Wednesday, December 1, 2021 1:16 PM
>>>> *To:* Iceberg Dev List 
>>>> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
>>>> Storage
>>>>
>>>>
>>>>
>&g

Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-02 Thread Piotr Findeisen
we
>>> need Iceberg to talk to these backends. The way we create S3FileIO is via
>>> the constructor and provide the S3Client as the constructor param; we do
>>> not use the initialize(Map) method in FileIO. Our custom
>>> catalog accepts the FileIO object at creation time. To talk to GCS, we
>>> create the S3Client with a few overrides (described below) and pass it to
>>> S3FileIO. After that, the rest of the S3FileIO code works as is. The only
>>> exception is that “gs” (used by GCS URIs) needs to be accepted as a valid
>>> S3 prefix. This is the reason I sent the email.
>>>
>>>
>>>
>>> The reason why we want to use S3FileIO to talk to GCS is that S3FileIO
>>> almost works out of the box and contains all the functionality needed to
>>> talk to GCS. The only special requirement is the creation of the S3Client
>>> and allow “gs” prefix in the URIs. Based on our early experiments and
>>> benchmarks, S3FileIO provides all the functionality we need and performs
>>> well, so we didn’t see a need to create a native GCS FileIO. Iceberg
>>> operations that we need are create, drop, read and write objects from S3
>>> and S3FileIO provides this functionality.
>>>
>>>
>>>
>>> We are managing ACLs (IAM in case of GCS) at the bucket level and that
>>> happens in our custom catalog. GCS has ACLs but IAMs are preferred. I’ve
>>> not experimented with ACLs or encryption with S3FileIO and that is a good
>>> question whether it works with GCS. But, if these features are not enabled
>>> via default settings, S3FileIO works just fine with GCS.
>>>
>>>
>>>
>>> I think there is a case for supporting S3-compatible backends in
>>> S3FileIO because a lot of the code is common. The question is whether we
>>> can cleanly expose the common S3FileIO code to work with these backends and
>>> separate out any specialization (if required) OR we want to have a
>>> different FileIO implementation for each of the other S3 compatible
>>> backends such as GCS? I’m eager to hear more from the community about this.
>>> I’m happy to discuss and follow long-term design direction of the Iceberg
>>> community.
>>>
>>>
>>>
>>> The S3Client for GCS is created as follows (currently the code is not
>>> open source so I’m sharing the steps only):
>>>
>>> 1. Create S3ClientBuilder.
>>>
>>> 2. Set GCS endpoint URI and region.
>>>
>>> 3. Set a credentials provider that returns null. You can set credentials
>>> here if you have static credentials.
>>>
>>> 4. Set ClientOverrideConfiguration with interceptors in the
>>> overrideConfiguration(). The interceptors are used to setup authorization
>>> header in requests (setting projectId, auth tokens, etc.) and do header
>>> translation for requests and responses.
>>>
>>> 5. Build the S3Client.
>>>
>>> 6. Pass the S3Client to S3FileIO.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Mayur
>>>
>>>
>>>
>>> *From:* Jack Ye 
>>> *Sent:* Wednesday, December 1, 2021 1:16 PM
>>> *To:* Iceberg Dev List 
>>> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
>>> Storage
>>>
>>>
>>>
>>> Hi Mayur,
>>>
>>>
>>>
>>> I know many object storage services have allowed communication using the
>>> Amazon S3 client by implementing the same protocol, like recently the Dell
>>> EMC ECS and Aliyun OSS. But ultimately there are functionality differences
>>> that could be optimized with a native FileIO, and the 2 examples I listed
>>> before both contributed their own FileIO implementations to Iceberg
>>> recently. I would imagine some native S3 features like ACL or SSE to not
>>> work for GCS, and some GCS features to be not supported in S3FileIO, so I
>>> think a specific GCS FileIO would likely be better for GCS support in the
>>> long term.
>>>
>>>
>>>
>>> Could you describe how you configure S3FileIO to talk to GCS? Do you
>>> need to override the S3 endpoint or have any other configurations?
>>>
>>>
>>>
>>> And I am not an expert of GCS, do you see using S3FileIO for GCS as a
>>> feasible long-term solution? Are there any GCS specific features that you
>>> might need and could not be done through S3FileIO, and how widely used are
>>> those features?

Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-01 Thread Jack Ye
that we need are create, drop, read and write objects from S3
>> and S3FileIO provides this functionality.
>>
>>
>>
>> We are managing ACLs (IAM in case of GCS) at the bucket level and that
>> happens in our custom catalog. GCS has ACLs but IAMs are preferred. I’ve
>> not experimented with ACLs or encryption with S3FileIO and that is a good
>> question whether it works with GCS. But, if these features are not enabled
>> via default settings, S3FileIO works just fine with GCS.
>>
>>
>>
>> I think there is a case for supporting S3-compatible backends in S3FileIO
>> because a lot of the code is common. The question is whether we can cleanly
>> expose the common S3FileIO code to work with these backends and separate
>> out any specialization (if required) OR we want to have a different FileIO
>> implementation for each of the other S3 compatible backends such as GCS?
>> I’m eager to hear more from the community about this. I’m happy to discuss
>> and follow long-term design direction of the Iceberg community.
>>
>>
>>
>> The S3Client for GCS is created as follows (currently the code is not
>> open source so I’m sharing the steps only):
>>
>> 1. Create S3ClientBuilder.
>>
>> 2. Set GCS endpoint URI and region.
>>
>> 3. Set a credentials provider that returns null. You can set credentials
>> here if you have static credentials.
>>
>> 4. Set ClientOverrideConfiguration with interceptors in the
>> overrideConfiguration(). The interceptors are used to setup authorization
>> header in requests (setting projectId, auth tokens, etc.) and do header
>> translation for requests and responses.
>>
>> 5. Build the S3Client.
>>
>> 6. Pass the S3Client to S3FileIO.
>>
>>
>>
>> Thanks,
>>
>> Mayur
>>
>>
>>
>> *From:* Jack Ye 
>> *Sent:* Wednesday, December 1, 2021 1:16 PM
>> *To:* Iceberg Dev List 
>> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
>> Storage
>>
>>
>>
>> Hi Mayur,
>>
>>
>>
>> I know many object storage services have allowed communication using the
>> Amazon S3 client by implementing the same protocol, like recently the Dell
>> EMC ECS and Aliyun OSS. But ultimately there are functionality differences
>> that could be optimized with a native FileIO, and the 2 examples I listed
>> before both contributed their own FileIO implementations to Iceberg
>> recently. I would imagine some native S3 features like ACL or SSE to not
>> work for GCS, and some GCS features to be not supported in S3FileIO, so I
>> think a specific GCS FileIO would likely be better for GCS support in the
>> long term.
>>
>>
>>
>> Could you describe how you configure S3FileIO to talk to GCS? Do you need
>> to override the S3 endpoint or have any other configurations?
>>
>>
>>
>> And I am not an expert of GCS, do you see using S3FileIO for GCS as a
>> feasible long-term solution? Are there any GCS specific features that you
>> might need and could not be done through S3FileIO, and how widely used are
>> those features?
>>
>>
>>
>> Best,
>>
>> Jack Ye
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Dec 1, 2021 at 8:50 AM Daniel Weeks 
>> wrote:
>>
>> The S3FileIO does use the AWS S3 V2 Client libraries and while there
>> appears to be some level of compatibility, it's not clear to me how far
>> that currently extends (some AWS features like encryption, IAM, etc. may
>> not have full support).
>>
>>
>>
>> I think it's great that there may be a path for more native GCS FileIO
>> support, but it might be a little early to rename the classes and except
>> that everything will work cleanly.
>>
>>
>>
>> Thanks for pointing this out, Mayur.  It's really an interesting
>> development.
>>
>>
>>
>> -Dan
>>
>>
>>
>> On Wed, Dec 1, 2021 at 8:12 AM Piotr Findeisen 
>> wrote:
>>
>> if S3FileIO is supposed to be used with other file systems, we should
>> consider proper class renames.
>>
>> just my 2c
>>
>>
>>
>> On Wed, Dec 1, 2021 at 5:07 PM Mayur Srivastava <
>> [email protected]> wrote:
>>
>> Hi,
>>
>>
>>
>> We are using S3FileIO to talk to the GCS backend. GCS URIs are compatible
>> with the AWS S3 SDKs and if they are added to the list of supported
>> prefixes, they work with S3FileIO.
>>
>>
>>
>> Thanks,
>>
>> Mayur
>>
>>
>>
>> *From:* Piotr Findeisen 
>> *Sent:* Wednesday, December 1, 2021 10:58 AM
>> *To:* Iceberg Dev List 
>> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
>> Storage
>>
>>
>>
>> Hi
>>
>>
>>
>> Just curious. S3URI seems aws s3-specific. What would be the goal of
>> using S3URI with google cloud storage urls?
>>
>> what problem are we solving?
>>
>>
>>
>> PF
>>
>>
>>
>>
>>
>> On Wed, Dec 1, 2021 at 4:56 PM Russell Spitzer 
>> wrote:
>>
>> Sounds reasonable to me if they are compatible
>>
>>
>>
>> On Wed, Dec 1, 2021 at 8:27 AM Mayur Srivastava <
>> [email protected]> wrote:
>>
>> Hi,
>>
>>
>>
>> We have URIs starting with gs:// representing objects on GCS. Currently,
>> S3URI doesn’t support gs:// prefix (see
>> https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/s3/S3URI.java#L41).
>> Is there an existing JIRA for supporting this? Any objections to add “gs”
>> to the list of S3 prefixes?
>>
>>
>>
>> Thanks,
>>
>> Mayur
>>
>>
>>
>>


Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-01 Thread Jack Ye
t; source so I’m sharing the steps only):
>
> 1. Create S3ClientBuilder.
>
> 2. Set GCS endpoint URI and region.
>
> 3. Set a credentials provider that returns null. You can set credentials
> here if you have static credentials.
>
> 4. Set ClientOverrideConfiguration with interceptors in the
> overrideConfiguration(). The interceptors are used to setup authorization
> header in requests (setting projectId, auth tokens, etc.) and do header
> translation for requests and responses.
>
> 5. Build the S3Client.
>
> 6. Pass the S3Client to S3FileIO.
>
>
>
> Thanks,
>
> Mayur
>
>
>
> *From:* Jack Ye 
> *Sent:* Wednesday, December 1, 2021 1:16 PM
> *To:* Iceberg Dev List 
> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
> Storage
>
>
>
> Hi Mayur,
>
>
>
> I know many object storage services have allowed communication using the
> Amazon S3 client by implementing the same protocol, like recently the Dell
> EMC ECS and Aliyun OSS. But ultimately there are functionality differences
> that could be optimized with a native FileIO, and the 2 examples I listed
> before both contributed their own FileIO implementations to Iceberg
> recently. I would imagine some native S3 features like ACL or SSE to not
> work for GCS, and some GCS features to be not supported in S3FileIO, so I
> think a specific GCS FileIO would likely be better for GCS support in the
> long term.
>
>
>
> Could you describe how you configure S3FileIO to talk to GCS? Do you need
> to override the S3 endpoint or have any other configurations?
>
>
>
> And I am not an expert of GCS, do you see using S3FileIO for GCS as a
> feasible long-term solution? Are there any GCS specific features that you
> might need and could not be done through S3FileIO, and how widely used are
> those features?
>
>
>
> Best,
>
> Jack Ye
>
>
>
>
>
>
>
> On Wed, Dec 1, 2021 at 8:50 AM Daniel Weeks 
> wrote:
>
> The S3FileIO does use the AWS S3 V2 Client libraries and while there
> appears to be some level of compatibility, it's not clear to me how far
> that currently extends (some AWS features like encryption, IAM, etc. may
> not have full support).
>
>
>
> I think it's great that there may be a path for more native GCS FileIO
> support, but it might be a little early to rename the classes and except
> that everything will work cleanly.
>
>
>
> Thanks for pointing this out, Mayur.  It's really an interesting
> development.
>
>
>
> -Dan
>
>
>
> On Wed, Dec 1, 2021 at 8:12 AM Piotr Findeisen 
> wrote:
>
> if S3FileIO is supposed to be used with other file systems, we should
> consider proper class renames.
>
> just my 2c
>
>
>
> On Wed, Dec 1, 2021 at 5:07 PM Mayur Srivastava <
> [email protected]> wrote:
>
> Hi,
>
>
>
> We are using S3FileIO to talk to the GCS backend. GCS URIs are compatible
> with the AWS S3 SDKs and if they are added to the list of supported
> prefixes, they work with S3FileIO.
>
>
>
> Thanks,
>
> Mayur
>
>
>
> *From:* Piotr Findeisen 
> *Sent:* Wednesday, December 1, 2021 10:58 AM
> *To:* Iceberg Dev List 
> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
> Storage
>
>
>
> Hi
>
>
>
> Just curious. S3URI seems aws s3-specific. What would be the goal of using
> S3URI with google cloud storage urls?
>
> what problem are we solving?
>
>
>
> PF
>
>
>
>
>
> On Wed, Dec 1, 2021 at 4:56 PM Russell Spitzer 
> wrote:
>
> Sounds reasonable to me if they are compatible
>
>
>
> On Wed, Dec 1, 2021 at 8:27 AM Mayur Srivastava <
> [email protected]> wrote:
>
> Hi,
>
>
>
> We have URIs starting with gs:// representing objects on GCS. Currently,
> S3URI doesn’t support gs:// prefix (see
> https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/s3/S3URI.java#L41).
> Is there an existing JIRA for supporting this? Any objections to add “gs”
> to the list of S3 prefixes?
>
>
>
> Thanks,
>
> Mayur
>
>
>
>


RE: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-01 Thread Mayur Srivastava
Hi Jack, Daniel,

We use several S3-compatible backends with Iceberg, these include S3, GCS, and 
others. Currently, S3FileIO provides us all the functionality we need Iceberg 
to talk to these backends. The way we create S3FileIO is via the constructor 
and provide the S3Client as the constructor param; we do not use the 
initialize(Map) method in FileIO. Our custom catalog accepts the 
FileIO object at creation time. To talk to GCS, we create the S3Client with a 
few overrides (described below) and pass it to S3FileIO. After that, the rest 
of the S3FileIO code works as is. The only exception is that “gs” (used by GCS 
URIs) needs to be accepted as a valid S3 prefix. This is the reason I sent the 
email.

The reason why we want to use S3FileIO to talk to GCS is that S3FileIO almost 
works out of the box and contains all the functionality needed to talk to GCS. 
The only special requirement is the creation of the S3Client and allow “gs” 
prefix in the URIs. Based on our early experiments and benchmarks, S3FileIO 
provides all the functionality we need and performs well, so we didn’t see a 
need to create a native GCS FileIO. Iceberg operations that we need are create, 
drop, read and write objects from S3 and S3FileIO provides this functionality.

We are managing ACLs (IAM in case of GCS) at the bucket level and that happens 
in our custom catalog. GCS has ACLs but IAMs are preferred. I’ve not 
experimented with ACLs or encryption with S3FileIO and that is a good question 
whether it works with GCS. But, if these features are not enabled via default 
settings, S3FileIO works just fine with GCS.

I think there is a case for supporting S3-compatible backends in S3FileIO 
because a lot of the code is common. The question is whether we can cleanly 
expose the common S3FileIO code to work with these backends and separate out 
any specialization (if required) OR we want to have a different FileIO 
implementation for each of the other S3 compatible backends such as GCS? I’m 
eager to hear more from the community about this. I’m happy to discuss and 
follow long-term design direction of the Iceberg community.

The S3Client for GCS is created as follows (currently the code is not open 
source so I’m sharing the steps only):
1. Create S3ClientBuilder.
2. Set GCS endpoint URI and region.
3. Set a credentials provider that returns null. You can set credentials here 
if you have static credentials.
4. Set ClientOverrideConfiguration with interceptors in the 
overrideConfiguration(). The interceptors are used to setup authorization 
header in requests (setting projectId, auth tokens, etc.) and do header 
translation for requests and responses.
5. Build the S3Client.
6. Pass the S3Client to S3FileIO.

Thanks,
Mayur

From: Jack Ye 
Sent: Wednesday, December 1, 2021 1:16 PM
To: Iceberg Dev List 
Subject: Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

Hi Mayur,

I know many object storage services have allowed communication using the Amazon 
S3 client by implementing the same protocol, like recently the Dell EMC ECS and 
Aliyun OSS. But ultimately there are functionality differences that could be 
optimized with a native FileIO, and the 2 examples I listed before both 
contributed their own FileIO implementations to Iceberg recently. I would 
imagine some native S3 features like ACL or SSE to not work for GCS, and some 
GCS features to be not supported in S3FileIO, so I think a specific GCS FileIO 
would likely be better for GCS support in the long term.

Could you describe how you configure S3FileIO to talk to GCS? Do you need to 
override the S3 endpoint or have any other configurations?

And I am not an expert of GCS, do you see using S3FileIO for GCS as a feasible 
long-term solution? Are there any GCS specific features that you might need and 
could not be done through S3FileIO, and how widely used are those features?

Best,
Jack Ye



On Wed, Dec 1, 2021 at 8:50 AM Daniel Weeks 
mailto:[email protected]>> wrote:
The S3FileIO does use the AWS S3 V2 Client libraries and while there appears to 
be some level of compatibility, it's not clear to me how far that currently 
extends (some AWS features like encryption, IAM, etc. may not have full 
support).

I think it's great that there may be a path for more native GCS FileIO support, 
but it might be a little early to rename the classes and except that everything 
will work cleanly.

Thanks for pointing this out, Mayur.  It's really an interesting development.

-Dan

On Wed, Dec 1, 2021 at 8:12 AM Piotr Findeisen 
mailto:[email protected]>> wrote:
if S3FileIO is supposed to be used with other file systems, we should consider 
proper class renames.
just my 2c

On Wed, Dec 1, 2021 at 5:07 PM Mayur Srivastava 
mailto:[email protected]>> wrote:
Hi,

We are using S3FileIO to talk to the GCS backend. GCS URIs are compatible with 
the AWS S3 SDKs and if they are added to the list of supported prefixes, 

Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-01 Thread Jack Ye
Hi Mayur,

I know many object storage services have allowed communication using the
Amazon S3 client by implementing the same protocol, like recently the Dell
EMC ECS and Aliyun OSS. But ultimately there are functionality differences
that could be optimized with a native FileIO, and the 2 examples I listed
before both contributed their own FileIO implementations to Iceberg
recently. I would imagine some native S3 features like ACL or SSE to not
work for GCS, and some GCS features to be not supported in S3FileIO, so I
think a specific GCS FileIO would likely be better for GCS support in the
long term.

Could you describe how you configure S3FileIO to talk to GCS? Do you need
to override the S3 endpoint or have any other configurations?

And I am not an expert of GCS, do you see using S3FileIO for GCS as a
feasible long-term solution? Are there any GCS specific features that you
might need and could not be done through S3FileIO, and how widely used are
those features?

Best,
Jack Ye



On Wed, Dec 1, 2021 at 8:50 AM Daniel Weeks 
wrote:

> The S3FileIO does use the AWS S3 V2 Client libraries and while there
> appears to be some level of compatibility, it's not clear to me how far
> that currently extends (some AWS features like encryption, IAM, etc. may
> not have full support).
>
> I think it's great that there may be a path for more native GCS FileIO
> support, but it might be a little early to rename the classes and except
> that everything will work cleanly.
>
> Thanks for pointing this out, Mayur.  It's really an interesting
> development.
>
> -Dan
>
> On Wed, Dec 1, 2021 at 8:12 AM Piotr Findeisen 
> wrote:
>
>> if S3FileIO is supposed to be used with other file systems, we should
>> consider proper class renames.
>> just my 2c
>>
>> On Wed, Dec 1, 2021 at 5:07 PM Mayur Srivastava <
>> [email protected]> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> We are using S3FileIO to talk to the GCS backend. GCS URIs are
>>> compatible with the AWS S3 SDKs and if they are added to the list of
>>> supported prefixes, they work with S3FileIO.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Mayur
>>>
>>>
>>>
>>> *From:* Piotr Findeisen 
>>> *Sent:* Wednesday, December 1, 2021 10:58 AM
>>> *To:* Iceberg Dev List 
>>> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
>>> Storage
>>>
>>>
>>>
>>> Hi
>>>
>>>
>>>
>>> Just curious. S3URI seems aws s3-specific. What would be the goal of
>>> using S3URI with google cloud storage urls?
>>>
>>> what problem are we solving?
>>>
>>>
>>>
>>> PF
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Dec 1, 2021 at 4:56 PM Russell Spitzer <
>>> [email protected]> wrote:
>>>
>>> Sounds reasonable to me if they are compatible
>>>
>>>
>>>
>>> On Wed, Dec 1, 2021 at 8:27 AM Mayur Srivastava <
>>> [email protected]> wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> We have URIs starting with gs:// representing objects on GCS. Currently,
>>> S3URI doesn’t support gs:// prefix (see
>>> https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/s3/S3URI.java#L41).
>>> Is there an existing JIRA for supporting this? Any objections to add “gs”
>>> to the list of S3 prefixes?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Mayur
>>>
>>>
>>>
>>>


Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-01 Thread Daniel Weeks
The S3FileIO does use the AWS S3 V2 Client libraries and while there
appears to be some level of compatibility, it's not clear to me how far
that currently extends (some AWS features like encryption, IAM, etc. may
not have full support).

I think it's great that there may be a path for more native GCS FileIO
support, but it might be a little early to rename the classes and except
that everything will work cleanly.

Thanks for pointing this out, Mayur.  It's really an interesting
development.

-Dan

On Wed, Dec 1, 2021 at 8:12 AM Piotr Findeisen 
wrote:

> if S3FileIO is supposed to be used with other file systems, we should
> consider proper class renames.
> just my 2c
>
> On Wed, Dec 1, 2021 at 5:07 PM Mayur Srivastava <
> [email protected]> wrote:
>
>> Hi,
>>
>>
>>
>> We are using S3FileIO to talk to the GCS backend. GCS URIs are compatible
>> with the AWS S3 SDKs and if they are added to the list of supported
>> prefixes, they work with S3FileIO.
>>
>>
>>
>> Thanks,
>>
>> Mayur
>>
>>
>>
>> *From:* Piotr Findeisen 
>> *Sent:* Wednesday, December 1, 2021 10:58 AM
>> *To:* Iceberg Dev List 
>> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
>> Storage
>>
>>
>>
>> Hi
>>
>>
>>
>> Just curious. S3URI seems aws s3-specific. What would be the goal of
>> using S3URI with google cloud storage urls?
>>
>> what problem are we solving?
>>
>>
>>
>> PF
>>
>>
>>
>>
>>
>> On Wed, Dec 1, 2021 at 4:56 PM Russell Spitzer 
>> wrote:
>>
>> Sounds reasonable to me if they are compatible
>>
>>
>>
>> On Wed, Dec 1, 2021 at 8:27 AM Mayur Srivastava <
>> [email protected]> wrote:
>>
>> Hi,
>>
>>
>>
>> We have URIs starting with gs:// representing objects on GCS. Currently,
>> S3URI doesn’t support gs:// prefix (see
>> https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/s3/S3URI.java#L41).
>> Is there an existing JIRA for supporting this? Any objections to add “gs”
>> to the list of S3 prefixes?
>>
>>
>>
>> Thanks,
>>
>> Mayur
>>
>>
>>
>>


Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-01 Thread Piotr Findeisen
if S3FileIO is supposed to be used with other file systems, we should
consider proper class renames.
just my 2c

On Wed, Dec 1, 2021 at 5:07 PM Mayur Srivastava <
[email protected]> wrote:

> Hi,
>
>
>
> We are using S3FileIO to talk to the GCS backend. GCS URIs are compatible
> with the AWS S3 SDKs and if they are added to the list of supported
> prefixes, they work with S3FileIO.
>
>
>
> Thanks,
>
> Mayur
>
>
>
> *From:* Piotr Findeisen 
> *Sent:* Wednesday, December 1, 2021 10:58 AM
> *To:* Iceberg Dev List 
> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
> Storage
>
>
>
> Hi
>
>
>
> Just curious. S3URI seems aws s3-specific. What would be the goal of using
> S3URI with google cloud storage urls?
>
> what problem are we solving?
>
>
>
> PF
>
>
>
>
>
> On Wed, Dec 1, 2021 at 4:56 PM Russell Spitzer 
> wrote:
>
> Sounds reasonable to me if they are compatible
>
>
>
> On Wed, Dec 1, 2021 at 8:27 AM Mayur Srivastava <
> [email protected]> wrote:
>
> Hi,
>
>
>
> We have URIs starting with gs:// representing objects on GCS. Currently,
> S3URI doesn’t support gs:// prefix (see
> https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/s3/S3URI.java#L41).
> Is there an existing JIRA for supporting this? Any objections to add “gs”
> to the list of S3 prefixes?
>
>
>
> Thanks,
>
> Mayur
>
>
>
>


RE: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-01 Thread Mayur Srivastava
Hi,

We are using S3FileIO to talk to the GCS backend. GCS URIs are compatible with 
the AWS S3 SDKs and if they are added to the list of supported prefixes, they 
work with S3FileIO.

Thanks,
Mayur

From: Piotr Findeisen 
Sent: Wednesday, December 1, 2021 10:58 AM
To: Iceberg Dev List 
Subject: Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

Hi

Just curious. S3URI seems aws s3-specific. What would be the goal of using 
S3URI with google cloud storage urls?
what problem are we solving?

PF


On Wed, Dec 1, 2021 at 4:56 PM Russell Spitzer 
mailto:[email protected]>> wrote:
Sounds reasonable to me if they are compatible

On Wed, Dec 1, 2021 at 8:27 AM Mayur Srivastava 
mailto:[email protected]>> wrote:
Hi,

We have URIs starting with gs:// representing objects on GCS. Currently, S3URI 
doesn’t support gs:// prefix (see 
https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/s3/S3URI.java#L41).
 Is there an existing JIRA for supporting this? Any objections to add “gs” to 
the list of S3 prefixes?

Thanks,
Mayur



Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-01 Thread Piotr Findeisen
Hi

Just curious. S3URI seems aws s3-specific. What would be the goal of using
S3URI with google cloud storage urls?
what problem are we solving?

PF


On Wed, Dec 1, 2021 at 4:56 PM Russell Spitzer 
wrote:

> Sounds reasonable to me if they are compatible
>
> On Wed, Dec 1, 2021 at 8:27 AM Mayur Srivastava <
> [email protected]> wrote:
>
>> Hi,
>>
>>
>>
>> We have URIs starting with gs:// representing objects on GCS. Currently,
>> S3URI doesn’t support gs:// prefix (see
>> https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/s3/S3URI.java#L41).
>> Is there an existing JIRA for supporting this? Any objections to add “gs”
>> to the list of S3 prefixes?
>>
>>
>>
>> Thanks,
>>
>> Mayur
>>
>>
>>
>


Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-01 Thread Russell Spitzer
Sounds reasonable to me if they are compatible

On Wed, Dec 1, 2021 at 8:27 AM Mayur Srivastava <
[email protected]> wrote:

> Hi,
>
>
>
> We have URIs starting with gs:// representing objects on GCS. Currently,
> S3URI doesn’t support gs:// prefix (see
> https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/s3/S3URI.java#L41).
> Is there an existing JIRA for supporting this? Any objections to add “gs”
> to the list of S3 prefixes?
>
>
>
> Thanks,
>
> Mayur
>
>
>