Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
I don't really understand how Iceberg and the hadoop libraries can coexist in a 
deployment.

The latest spark (3.5.1) base image contains the hadoop-client*-3.3.4.jar. The 
AWS v2 SDK is only supported in hadoop*-3.4.0.jar and onward.
Iceberg AWS integration states AWS v2 SDK is 
required<https://iceberg.apache.org/docs/latest/aws/>

Does anyone have a working combination of pyspark, iceberg and hadoop? Or, is 
there an alternative way to use pyspark to 
spark.read.parquet("s3a:///.parquet") such that I don't need the 
hadoop dependencies?

Kind regards,
Dan

From: Oxlade, Dan 
Sent: 03 April 2024 15:49
To: Oxlade, Dan ; Aaron Grubb 
; user@spark.apache.org 
Subject: Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility 
matrix

Swapping out the iceberg-aws-bundle for the very latest aws provided sdk 
('software.amazon.awssdk:bundle:2.25.23') produces an incompatibility from a 
slightly different code path:

java.lang.NoSuchMethodError: 'void 
org.apache.hadoop.util.SemaphoredDelegatingExecutor.(java.util.concurrent.ExecutorService,
 int, boolean, org.apache.hadoop.fs.statistics.DurationTrackerFactory)'
at org.apache.hadoop.fs.s3a.S3AFileSystem.executeOpen(S3AFileSystem.java 
[s3afilesystem.java]<https://urldefense.com/v3/__http://S3AFileSystem.java__;!!EVw9PLhwfpc!ajy_-43U_YFTsDsGtsUXnaHdPzmpJCPeIYkmz-moYJTm7hKc1HL69MDQhPaUjyk_6Ka6XayBQNpTuf4c_roI0G8cYtHzLKu6sQ$>:1767)
at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java 
[s3afilesystem.java]<https://urldefense.com/v3/__http://S3AFileSystem.java__;!!EVw9PLhwfpc!ajy_-43U_YFTsDsGtsUXnaHdPzmpJCPeIYkmz-moYJTm7hKc1HL69MDQhPaUjyk_6Ka6XayBQNpTuf4c_roI0G8cYtHzLKu6sQ$>:1717)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java 
[filesystem.java]<https://urldefense.com/v3/__http://FileSystem.java__;!!EVw9PLhwfpc!ajy_-43U_YFTsDsGtsUXnaHdPzmpJCPeIYkmz-moYJTm7hKc1HL69MDQhPaUjyk_6Ka6XayBQNpTuf4c_roI0G8cYtFOz7Rg0A$>:976)
at 
org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java 
[hadoopinputfile.java]<https://urldefense.com/v3/__http://HadoopInputFile.java__;!!EVw9PLhwfpc!ajy_-43U_YFTsDsGtsUXnaHdPzmpJCPeIYkmz-moYJTm7hKc1HL69MDQhPaUjyk_6Ka6XayBQNpTuf4c_roI0G8cYtGnqRrxSg$>:69)
at org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java 
[parquetfilereader.java]<https://urldefense.com/v3/__http://ParquetFileReader.java__;!!EVw9PLhwfpc!ajy_-43U_YFTsDsGtsUXnaHdPzmpJCPeIYkmz-moYJTm7hKc1HL69MDQhPaUjyk_6Ka6XayBQNpTuf4c_roI0G8cYtHDzEly0A$>:774)
at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java 
[parquetfilereader.java]<https://urldefense.com/v3/__http://ParquetFileReader.java__;!!EVw9PLhwfpc!ajy_-43U_YFTsDsGtsUXnaHdPzmpJCPeIYkmz-moYJTm7hKc1HL69MDQhPaUjyk_6Ka6XayBQNpTuf4c_roI0G8cYtHDzEly0A$>:658)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java
 
[parquetfooterreader.java]<https://urldefense.com/v3/__http://ParquetFooterReader.java__;!!EVw9PLhwfpc!ajy_-43U_YFTsDsGtsUXnaHdPzmpJCPeIYkmz-moYJTm7hKc1HL69MDQhPaUjyk_6Ka6XayBQNpTuf4c_roI0G8cYtGEyk3Riw$>:53)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java
 
[parquetfooterreader.java]<https://urldefense.com/v3/__http://ParquetFooterReader.java__;!!EVw9PLhwfpc!ajy_-43U_YFTsDsGtsUXnaHdPzmpJCPeIYkmz-moYJTm7hKc1HL69MDQhPaUjyk_6Ka6XayBQNpTuf4c_roI0G8cYtGEyk3Riw$>:44)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:429)




From: Oxlade, Dan 
Sent: 03 April 2024 14:33
To: Aaron Grubb ; user@spark.apache.org 
Subject: Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility 
matrix


[sorry; replying all this time]

With hadoop-*-3.3.6 in place of the 3.4.0 below I get 
java.lang.NoClassDefFoundError: com/amazonaws/AmazonClientException

I think that the below iceberg-aws-bundle version supplies the v2 sdk.

Dan


From: Aaron Grubb 
Sent: 03 April 2024 13:52
To: user@spark.apache.org 
Subject: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility 
matrix

Downgrade to hadoop-*:3.3.x, Hadoop 3.4.x is based on the AWS SDK v2 and should 
probably be considered as breaking for tools that build on < 3.4.0 while using 
AWS.

From: Oxlade, Dan 
Sent: Wednesday, April 3, 2024 2:41:11 PM
To: user@spark.apache.org 
Subject: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix


Hi all,



I’ve struggled with this for quite some time.

My requirement is to read a parquet file from s3 to a Dataframe then append to 
an existing iceberg table.



In order to read the parquet I need the hadoop-aws dependency for s3a:// . In 
order to write to iceberg I need the iceberg dependency. Both of these 
dependencies have a transitive depend

Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
Swapping out the iceberg-aws-bundle for the very latest aws provided sdk 
('software.amazon.awssdk:bundle:2.25.23') produces an incompatibility from a 
slightly different code path:

java.lang.NoSuchMethodError: 'void 
org.apache.hadoop.util.SemaphoredDelegatingExecutor.(java.util.concurrent.ExecutorService,
 int, boolean, org.apache.hadoop.fs.statistics.DurationTrackerFactory)'
at org.apache.hadoop.fs.s3a.S3AFileSystem.executeOpen(S3AFileSystem.java:1767)
at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:1717)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:976)
at 
org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
at 
org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:774)
at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:658)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:53)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:44)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:429)




From: Oxlade, Dan 
Sent: 03 April 2024 14:33
To: Aaron Grubb ; user@spark.apache.org 
Subject: Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility 
matrix


[sorry; replying all this time]

With hadoop-*-3.3.6 in place of the 3.4.0 below I get 
java.lang.NoClassDefFoundError: com/amazonaws/AmazonClientException

I think that the below iceberg-aws-bundle version supplies the v2 sdk.

Dan


From: Aaron Grubb 
Sent: 03 April 2024 13:52
To: user@spark.apache.org 
Subject: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility 
matrix

Downgrade to hadoop-*:3.3.x, Hadoop 3.4.x is based on the AWS SDK v2 and should 
probably be considered as breaking for tools that build on < 3.4.0 while using 
AWS.

From: Oxlade, Dan 
Sent: Wednesday, April 3, 2024 2:41:11 PM
To: user@spark.apache.org 
Subject: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix


Hi all,



I’ve struggled with this for quite some time.

My requirement is to read a parquet file from s3 to a Dataframe then append to 
an existing iceberg table.



In order to read the parquet I need the hadoop-aws dependency for s3a:// . In 
order to write to iceberg I need the iceberg dependency. Both of these 
dependencies have a transitive dependency on the aws SDK. I can’t find versions 
for Spark 3.4 that work together.





Current Versions:

Spark 3.4.1

iceberg-spark-runtime-3.4-2.12:1.4.1

iceberg-aws-bundle:1.4.1

hadoop-aws:3.4.0

hadoop-common:3.4.0



I’ve tried a number of combinations of the above and their respective versions 
but all fall over with their assumptions on the aws sdk version with class not 
found exceptions or method not found etc.



Is there a compatibility matrix somewhere that someone could point me to?



Thanks

Dan

T. Rowe Price International Ltd (registered number 3957748) is registered in 
England and Wales with its registered office at Warwick Court, 5 Paternoster 
Square, London EC4M 7DX. T. Rowe Price International Ltd is authorised and 
regulated by the Financial Conduct Authority. The company has a branch in Dubai 
International Financial Centre (regulated by the DFSA as a Representative 
Office).

T. Rowe Price (including T. Rowe Price International Ltd and its affiliates) 
and its associates do not provide legal or tax advice. Any tax-related 
discussion contained in this e-mail, including any attachments, is not intended 
or written to be used, and cannot be used, for the purpose of (i) avoiding any 
tax penalties or (ii) promoting, marketing, or recommending to any other party 
any transaction or matter addressed herein. Please consult your independent 
legal counsel and/or professional tax advisor regarding any legal or tax issues 
raised in this e-mail.

The contents of this e-mail and any attachments are intended solely for the use 
of the named addressee(s) and may contain confidential and/or privileged 
information. Any unauthorized use, copying, disclosure, or distribution of the 
contents of this e-mail is strictly prohibited by the sender and may be 
unlawful. If you are not the intended recipient, please notify the sender 
immediately and delete this e-mail.

T. Rowe Price International Ltd (registered number 3957748) is registered in 
England and Wales with its registered office at Warwick Court, 5 Paternoster 
Square, London EC4M 7DX. T. Rowe Price International Ltd is authorised and 
regulated by the Financial Conduct Authority. The company has a branch in Dubai 
International Financial Centre (regulated by the DFSA as a Representative 
Office).

T. Rowe Price (including T. Rowe Price International Ltd and its affiliates) 
and its associates do not provide legal or tax adv

Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan

[sorry; replying all this time]

With hadoop-*-3.3.6 in place of the 3.4.0 below I get 
java.lang.NoClassDefFoundError: com/amazonaws/AmazonClientException

I think that the below iceberg-aws-bundle version supplies the v2 sdk.

Dan


From: Aaron Grubb 
Sent: 03 April 2024 13:52
To: user@spark.apache.org 
Subject: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility 
matrix

Downgrade to hadoop-*:3.3.x, Hadoop 3.4.x is based on the AWS SDK v2 and should 
probably be considered as breaking for tools that build on < 3.4.0 while using 
AWS.

From: Oxlade, Dan 
Sent: Wednesday, April 3, 2024 2:41:11 PM
To: user@spark.apache.org 
Subject: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix


Hi all,



I’ve struggled with this for quite some time.

My requirement is to read a parquet file from s3 to a Dataframe then append to 
an existing iceberg table.



In order to read the parquet I need the hadoop-aws dependency for s3a:// . In 
order to write to iceberg I need the iceberg dependency. Both of these 
dependencies have a transitive dependency on the aws SDK. I can’t find versions 
for Spark 3.4 that work together.





Current Versions:

Spark 3.4.1

iceberg-spark-runtime-3.4-2.12:1.4.1

iceberg-aws-bundle:1.4.1

hadoop-aws:3.4.0

hadoop-common:3.4.0



I’ve tried a number of combinations of the above and their respective versions 
but all fall over with their assumptions on the aws sdk version with class not 
found exceptions or method not found etc.



Is there a compatibility matrix somewhere that someone could point me to?



Thanks

Dan

T. Rowe Price International Ltd (registered number 3957748) is registered in 
England and Wales with its registered office at Warwick Court, 5 Paternoster 
Square, London EC4M 7DX. T. Rowe Price International Ltd is authorised and 
regulated by the Financial Conduct Authority. The company has a branch in Dubai 
International Financial Centre (regulated by the DFSA as a Representative 
Office).

T. Rowe Price (including T. Rowe Price International Ltd and its affiliates) 
and its associates do not provide legal or tax advice. Any tax-related 
discussion contained in this e-mail, including any attachments, is not intended 
or written to be used, and cannot be used, for the purpose of (i) avoiding any 
tax penalties or (ii) promoting, marketing, or recommending to any other party 
any transaction or matter addressed herein. Please consult your independent 
legal counsel and/or professional tax advisor regarding any legal or tax issues 
raised in this e-mail.

The contents of this e-mail and any attachments are intended solely for the use 
of the named addressee(s) and may contain confidential and/or privileged 
information. Any unauthorized use, copying, disclosure, or distribution of the 
contents of this e-mail is strictly prohibited by the sender and may be 
unlawful. If you are not the intended recipient, please notify the sender 
immediately and delete this e-mail.
T. Rowe Price International Ltd (registered number 3957748) is registered in 
England and Wales with its registered office at Warwick Court, 5 Paternoster 
Square, London EC4M 7DX. T. Rowe Price International Ltd is authorised and 
regulated by the Financial Conduct Authority. The company has a branch in Dubai 
International Financial Centre (regulated by the DFSA as a Representative 
Office).

T. Rowe Price (including T. Rowe Price International Ltd and its affiliates) 
and its associates do not provide legal or tax advice. Any tax-related 
discussion contained in this e-mail, including any attachments, is not intended 
or written to be used, and cannot be used, for the purpose of (i) avoiding any 
tax penalties or (ii) promoting, marketing, or recommending to any other party 
any transaction or matter addressed herein. Please consult your independent 
legal counsel and/or professional tax advisor regarding any legal or tax issues 
raised in this e-mail.

The contents of this e-mail and any attachments are intended solely for the use 
of the named addressee(s) and may contain confidential and/or privileged 
information. Any unauthorized use, copying, disclosure, or distribution of the 
contents of this e-mail is strictly prohibited by the sender and may be 
unlawful. If you are not the intended recipient, please notify the sender 
immediately and delete this e-mail.


Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Aaron Grubb
Downgrade to hadoop-*:3.3.x, Hadoop 3.4.x is based on the AWS SDK v2 and should 
probably be considered as breaking for tools that build on < 3.4.0 while using 
AWS.

From: Oxlade, Dan 
Sent: Wednesday, April 3, 2024 2:41:11 PM
To: user@spark.apache.org 
Subject: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix


Hi all,



I’ve struggled with this for quite some time.

My requirement is to read a parquet file from s3 to a Dataframe then append to 
an existing iceberg table.



In order to read the parquet I need the hadoop-aws dependency for s3a:// . In 
order to write to iceberg I need the iceberg dependency. Both of these 
dependencies have a transitive dependency on the aws SDK. I can’t find versions 
for Spark 3.4 that work together.





Current Versions:

Spark 3.4.1

iceberg-spark-runtime-3.4-2.12:1.4.1

iceberg-aws-bundle:1.4.1

hadoop-aws:3.4.0

hadoop-common:3.4.0



I’ve tried a number of combinations of the above and their respective versions 
but all fall over with their assumptions on the aws sdk version with class not 
found exceptions or method not found etc.



Is there a compatibility matrix somewhere that someone could point me to?



Thanks

Dan

T. Rowe Price International Ltd (registered number 3957748) is registered in 
England and Wales with its registered office at Warwick Court, 5 Paternoster 
Square, London EC4M 7DX. T. Rowe Price International Ltd is authorised and 
regulated by the Financial Conduct Authority. The company has a branch in Dubai 
International Financial Centre (regulated by the DFSA as a Representative 
Office).

T. Rowe Price (including T. Rowe Price International Ltd and its affiliates) 
and its associates do not provide legal or tax advice. Any tax-related 
discussion contained in this e-mail, including any attachments, is not intended 
or written to be used, and cannot be used, for the purpose of (i) avoiding any 
tax penalties or (ii) promoting, marketing, or recommending to any other party 
any transaction or matter addressed herein. Please consult your independent 
legal counsel and/or professional tax advisor regarding any legal or tax issues 
raised in this e-mail.

The contents of this e-mail and any attachments are intended solely for the use 
of the named addressee(s) and may contain confidential and/or privileged 
information. Any unauthorized use, copying, disclosure, or distribution of the 
contents of this e-mail is strictly prohibited by the sender and may be 
unlawful. If you are not the intended recipient, please notify the sender 
immediately and delete this e-mail.


[Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
Hi all,

I've struggled with this for quite some time.
My requirement is to read a parquet file from s3 to a Dataframe then append to 
an existing iceberg table.

In order to read the parquet I need the hadoop-aws dependency for s3a:// . In 
order to write to iceberg I need the iceberg dependency. Both of these 
dependencies have a transitive dependency on the aws SDK. I can't find versions 
for Spark 3.4 that work together.


Current Versions:
Spark 3.4.1
iceberg-spark-runtime-3.4-2.12:1.4.1
iceberg-aws-bundle:1.4.1
hadoop-aws:3.4.0
hadoop-common:3.4.0

I've tried a number of combinations of the above and their respective versions 
but all fall over with their assumptions on the aws sdk version with class not 
found exceptions or method not found etc.

Is there a compatibility matrix somewhere that someone could point me to?

Thanks
Dan
T. Rowe Price International Ltd (registered number 3957748) is registered in 
England and Wales with its registered office at Warwick Court, 5 Paternoster 
Square, London EC4M 7DX. T. Rowe Price International Ltd is authorised and 
regulated by the Financial Conduct Authority. The company has a branch in Dubai 
International Financial Centre (regulated by the DFSA as a Representative 
Office).

T. Rowe Price (including T. Rowe Price International Ltd and its affiliates) 
and its associates do not provide legal or tax advice. Any tax-related 
discussion contained in this e-mail, including any attachments, is not intended 
or written to be used, and cannot be used, for the purpose of (i) avoiding any 
tax penalties or (ii) promoting, marketing, or recommending to any other party 
any transaction or matter addressed herein. Please consult your independent 
legal counsel and/or professional tax advisor regarding any legal or tax issues 
raised in this e-mail.

The contents of this e-mail and any attachments are intended solely for the use 
of the named addressee(s) and may contain confidential and/or privileged 
information. Any unauthorized use, copying, disclosure, or distribution of the 
contents of this e-mail is strictly prohibited by the sender and may be 
unlawful. If you are not the intended recipient, please notify the sender 
immediately and delete this e-mail.