Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
I don't really understand how Iceberg and the hadoop libraries can coexist in a 
deployment.

The latest spark (3.5.1) base image contains the hadoop-client*-3.3.4.jar. The 
AWS v2 SDK is only supported in hadoop*-3.4.0.jar and onward.
Iceberg AWS integration states AWS v2 SDK is 
required<https://iceberg.apache.org/docs/latest/aws/>

Does anyone have a working combination of pyspark, iceberg and hadoop? Or, is 
there an alternative way to use pyspark to 
spark.read.parquet("s3a:///.parquet") such that I don't need the 
hadoop dependencies?

Kind regards,
Dan

From: Oxlade, Dan 
Sent: 03 April 2024 15:49
To: Oxlade, Dan ; Aaron Grubb 
; user@spark.apache.org 
Subject: Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility 
matrix

Swapping out the iceberg-aws-bundle for the very latest aws provided sdk 
('software.amazon.awssdk:bundle:2.25.23') produces an incompatibility from a 
slightly different code path:

java.lang.NoSuchMethodError: 'void 
org.apache.hadoop.util.SemaphoredDelegatingExecutor.(java.util.concurrent.ExecutorService,
 int, boolean, org.apache.hadoop.fs.statistics.DurationTrackerFactory)'
at org.apache.hadoop.fs.s3a.S3AFileSystem.executeOpen(S3AFileSystem.java 
[s3afilesystem.java]<https://urldefense.com/v3/__http://S3AFileSystem.java__;!!EVw9PLhwfpc!ajy_-43U_YFTsDsGtsUXnaHdPzmpJCPeIYkmz-moYJTm7hKc1HL69MDQhPaUjyk_6Ka6XayBQNpTuf4c_roI0G8cYtHzLKu6sQ$>:1767)
at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java 
[s3afilesystem.java]<https://urldefense.com/v3/__http://S3AFileSystem.java__;!!EVw9PLhwfpc!ajy_-43U_YFTsDsGtsUXnaHdPzmpJCPeIYkmz-moYJTm7hKc1HL69MDQhPaUjyk_6Ka6XayBQNpTuf4c_roI0G8cYtHzLKu6sQ$>:1717)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java 
[filesystem.java]<https://urldefense.com/v3/__http://FileSystem.java__;!!EVw9PLhwfpc!ajy_-43U_YFTsDsGtsUXnaHdPzmpJCPeIYkmz-moYJTm7hKc1HL69MDQhPaUjyk_6Ka6XayBQNpTuf4c_roI0G8cYtFOz7Rg0A$>:976)
at 
org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java 
[hadoopinputfile.java]<https://urldefense.com/v3/__http://HadoopInputFile.java__;!!EVw9PLhwfpc!ajy_-43U_YFTsDsGtsUXnaHdPzmpJCPeIYkmz-moYJTm7hKc1HL69MDQhPaUjyk_6Ka6XayBQNpTuf4c_roI0G8cYtGnqRrxSg$>:69)
at org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java 
[parquetfilereader.java]<https://urldefense.com/v3/__http://ParquetFileReader.java__;!!EVw9PLhwfpc!ajy_-43U_YFTsDsGtsUXnaHdPzmpJCPeIYkmz-moYJTm7hKc1HL69MDQhPaUjyk_6Ka6XayBQNpTuf4c_roI0G8cYtHDzEly0A$>:774)
at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java 
[parquetfilereader.java]<https://urldefense.com/v3/__http://ParquetFileReader.java__;!!EVw9PLhwfpc!ajy_-43U_YFTsDsGtsUXnaHdPzmpJCPeIYkmz-moYJTm7hKc1HL69MDQhPaUjyk_6Ka6XayBQNpTuf4c_roI0G8cYtHDzEly0A$>:658)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java
 
[parquetfooterreader.java]<https://urldefense.com/v3/__http://ParquetFooterReader.java__;!!EVw9PLhwfpc!ajy_-43U_YFTsDsGtsUXnaHdPzmpJCPeIYkmz-moYJTm7hKc1HL69MDQhPaUjyk_6Ka6XayBQNpTuf4c_roI0G8cYtGEyk3Riw$>:53)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java
 
[parquetfooterreader.java]<https://urldefense.com/v3/__http://ParquetFooterReader.java__;!!EVw9PLhwfpc!ajy_-43U_YFTsDsGtsUXnaHdPzmpJCPeIYkmz-moYJTm7hKc1HL69MDQhPaUjyk_6Ka6XayBQNpTuf4c_roI0G8cYtGEyk3Riw$>:44)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:429)




From: Oxlade, Dan 
Sent: 03 April 2024 14:33
To: Aaron Grubb ; user@spark.apache.org 
Subject: Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility 
matrix


[sorry; replying all this time]

With hadoop-*-3.3.6 in place of the 3.4.0 below I get 
java.lang.NoClassDefFoundError: com/amazonaws/AmazonClientException

I think that the below iceberg-aws-bundle version supplies the v2 sdk.

Dan


From: Aaron Grubb 
Sent: 03 April 2024 13:52
To: user@spark.apache.org 
Subject: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility 
matrix

Downgrade to hadoop-*:3.3.x, Hadoop 3.4.x is based on the AWS SDK v2 and should 
probably be considered as breaking for tools that build on < 3.4.0 while using 
AWS.

From: Oxlade, Dan 
Sent: Wednesday, April 3, 2024 2:41:11 PM
To: user@spark.apache.org 
Subject: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix


Hi all,



I’ve struggled with this for quite some time.

My requirement is to read a parquet file from s3 to a Dataframe then append to 
an existing iceberg table.



In order to read the parquet I need the hadoop-aws dependency for s3a:// . In 
order to write to iceberg I need the iceberg dependency. Both of these 
dependencies have a transitive depend

Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
Swapping out the iceberg-aws-bundle for the very latest aws provided sdk 
('software.amazon.awssdk:bundle:2.25.23') produces an incompatibility from a 
slightly different code path:

java.lang.NoSuchMethodError: 'void 
org.apache.hadoop.util.SemaphoredDelegatingExecutor.(java.util.concurrent.ExecutorService,
 int, boolean, org.apache.hadoop.fs.statistics.DurationTrackerFactory)'
at org.apache.hadoop.fs.s3a.S3AFileSystem.executeOpen(S3AFileSystem.java:1767)
at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:1717)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:976)
at 
org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:69)
at 
org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:774)
at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:658)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:53)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:44)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:429)




From: Oxlade, Dan 
Sent: 03 April 2024 14:33
To: Aaron Grubb ; user@spark.apache.org 
Subject: Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility 
matrix


[sorry; replying all this time]

With hadoop-*-3.3.6 in place of the 3.4.0 below I get 
java.lang.NoClassDefFoundError: com/amazonaws/AmazonClientException

I think that the below iceberg-aws-bundle version supplies the v2 sdk.

Dan


From: Aaron Grubb 
Sent: 03 April 2024 13:52
To: user@spark.apache.org 
Subject: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility 
matrix

Downgrade to hadoop-*:3.3.x, Hadoop 3.4.x is based on the AWS SDK v2 and should 
probably be considered as breaking for tools that build on < 3.4.0 while using 
AWS.

From: Oxlade, Dan 
Sent: Wednesday, April 3, 2024 2:41:11 PM
To: user@spark.apache.org 
Subject: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix


Hi all,



I’ve struggled with this for quite some time.

My requirement is to read a parquet file from s3 to a Dataframe then append to 
an existing iceberg table.



In order to read the parquet I need the hadoop-aws dependency for s3a:// . In 
order to write to iceberg I need the iceberg dependency. Both of these 
dependencies have a transitive dependency on the aws SDK. I can’t find versions 
for Spark 3.4 that work together.





Current Versions:

Spark 3.4.1

iceberg-spark-runtime-3.4-2.12:1.4.1

iceberg-aws-bundle:1.4.1

hadoop-aws:3.4.0

hadoop-common:3.4.0



I’ve tried a number of combinations of the above and their respective versions 
but all fall over with their assumptions on the aws sdk version with class not 
found exceptions or method not found etc.



Is there a compatibility matrix somewhere that someone could point me to?



Thanks

Dan

T. Rowe Price International Ltd (registered number 3957748) is registered in 
England and Wales with its registered office at Warwick Court, 5 Paternoster 
Square, London EC4M 7DX. T. Rowe Price International Ltd is authorised and 
regulated by the Financial Conduct Authority. The company has a branch in Dubai 
International Financial Centre (regulated by the DFSA as a Representative 
Office).

T. Rowe Price (including T. Rowe Price International Ltd and its affiliates) 
and its associates do not provide legal or tax advice. Any tax-related 
discussion contained in this e-mail, including any attachments, is not intended 
or written to be used, and cannot be used, for the purpose of (i) avoiding any 
tax penalties or (ii) promoting, marketing, or recommending to any other party 
any transaction or matter addressed herein. Please consult your independent 
legal counsel and/or professional tax advisor regarding any legal or tax issues 
raised in this e-mail.

The contents of this e-mail and any attachments are intended solely for the use 
of the named addressee(s) and may contain confidential and/or privileged 
information. Any unauthorized use, copying, disclosure, or distribution of the 
contents of this e-mail is strictly prohibited by the sender and may be 
unlawful. If you are not the intended recipient, please notify the sender 
immediately and delete this e-mail.

T. Rowe Price International Ltd (registered number 3957748) is registered in 
England and Wales with its registered office at Warwick Court, 5 Paternoster 
Square, London EC4M 7DX. T. Rowe Price International Ltd is authorised and 
regulated by the Financial Conduct Authority. The company has a branch in Dubai 
International Financial Centre (regulated by the DFSA as a Representative 
Office).

T. Rowe Price (including T. Rowe Price International Ltd and its affiliates) 
and its associates do not provide legal or tax adv

Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan

[sorry; replying all this time]

With hadoop-*-3.3.6 in place of the 3.4.0 below I get 
java.lang.NoClassDefFoundError: com/amazonaws/AmazonClientException

I think that the below iceberg-aws-bundle version supplies the v2 sdk.

Dan


From: Aaron Grubb 
Sent: 03 April 2024 13:52
To: user@spark.apache.org 
Subject: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility 
matrix

Downgrade to hadoop-*:3.3.x, Hadoop 3.4.x is based on the AWS SDK v2 and should 
probably be considered as breaking for tools that build on < 3.4.0 while using 
AWS.

From: Oxlade, Dan 
Sent: Wednesday, April 3, 2024 2:41:11 PM
To: user@spark.apache.org 
Subject: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix


Hi all,



I’ve struggled with this for quite some time.

My requirement is to read a parquet file from s3 to a Dataframe then append to 
an existing iceberg table.



In order to read the parquet I need the hadoop-aws dependency for s3a:// . In 
order to write to iceberg I need the iceberg dependency. Both of these 
dependencies have a transitive dependency on the aws SDK. I can’t find versions 
for Spark 3.4 that work together.





Current Versions:

Spark 3.4.1

iceberg-spark-runtime-3.4-2.12:1.4.1

iceberg-aws-bundle:1.4.1

hadoop-aws:3.4.0

hadoop-common:3.4.0



I’ve tried a number of combinations of the above and their respective versions 
but all fall over with their assumptions on the aws sdk version with class not 
found exceptions or method not found etc.



Is there a compatibility matrix somewhere that someone could point me to?



Thanks

Dan

T. Rowe Price International Ltd (registered number 3957748) is registered in 
England and Wales with its registered office at Warwick Court, 5 Paternoster 
Square, London EC4M 7DX. T. Rowe Price International Ltd is authorised and 
regulated by the Financial Conduct Authority. The company has a branch in Dubai 
International Financial Centre (regulated by the DFSA as a Representative 
Office).

T. Rowe Price (including T. Rowe Price International Ltd and its affiliates) 
and its associates do not provide legal or tax advice. Any tax-related 
discussion contained in this e-mail, including any attachments, is not intended 
or written to be used, and cannot be used, for the purpose of (i) avoiding any 
tax penalties or (ii) promoting, marketing, or recommending to any other party 
any transaction or matter addressed herein. Please consult your independent 
legal counsel and/or professional tax advisor regarding any legal or tax issues 
raised in this e-mail.

The contents of this e-mail and any attachments are intended solely for the use 
of the named addressee(s) and may contain confidential and/or privileged 
information. Any unauthorized use, copying, disclosure, or distribution of the 
contents of this e-mail is strictly prohibited by the sender and may be 
unlawful. If you are not the intended recipient, please notify the sender 
immediately and delete this e-mail.
T. Rowe Price International Ltd (registered number 3957748) is registered in 
England and Wales with its registered office at Warwick Court, 5 Paternoster 
Square, London EC4M 7DX. T. Rowe Price International Ltd is authorised and 
regulated by the Financial Conduct Authority. The company has a branch in Dubai 
International Financial Centre (regulated by the DFSA as a Representative 
Office).

T. Rowe Price (including T. Rowe Price International Ltd and its affiliates) 
and its associates do not provide legal or tax advice. Any tax-related 
discussion contained in this e-mail, including any attachments, is not intended 
or written to be used, and cannot be used, for the purpose of (i) avoiding any 
tax penalties or (ii) promoting, marketing, or recommending to any other party 
any transaction or matter addressed herein. Please consult your independent 
legal counsel and/or professional tax advisor regarding any legal or tax issues 
raised in this e-mail.

The contents of this e-mail and any attachments are intended solely for the use 
of the named addressee(s) and may contain confidential and/or privileged 
information. Any unauthorized use, copying, disclosure, or distribution of the 
contents of this e-mail is strictly prohibited by the sender and may be 
unlawful. If you are not the intended recipient, please notify the sender 
immediately and delete this e-mail.


Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Aaron Grubb
Downgrade to hadoop-*:3.3.x, Hadoop 3.4.x is based on the AWS SDK v2 and should 
probably be considered as breaking for tools that build on < 3.4.0 while using 
AWS.

From: Oxlade, Dan 
Sent: Wednesday, April 3, 2024 2:41:11 PM
To: user@spark.apache.org 
Subject: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix


Hi all,



I’ve struggled with this for quite some time.

My requirement is to read a parquet file from s3 to a Dataframe then append to 
an existing iceberg table.



In order to read the parquet I need the hadoop-aws dependency for s3a:// . In 
order to write to iceberg I need the iceberg dependency. Both of these 
dependencies have a transitive dependency on the aws SDK. I can’t find versions 
for Spark 3.4 that work together.





Current Versions:

Spark 3.4.1

iceberg-spark-runtime-3.4-2.12:1.4.1

iceberg-aws-bundle:1.4.1

hadoop-aws:3.4.0

hadoop-common:3.4.0



I’ve tried a number of combinations of the above and their respective versions 
but all fall over with their assumptions on the aws sdk version with class not 
found exceptions or method not found etc.



Is there a compatibility matrix somewhere that someone could point me to?



Thanks

Dan

T. Rowe Price International Ltd (registered number 3957748) is registered in 
England and Wales with its registered office at Warwick Court, 5 Paternoster 
Square, London EC4M 7DX. T. Rowe Price International Ltd is authorised and 
regulated by the Financial Conduct Authority. The company has a branch in Dubai 
International Financial Centre (regulated by the DFSA as a Representative 
Office).

T. Rowe Price (including T. Rowe Price International Ltd and its affiliates) 
and its associates do not provide legal or tax advice. Any tax-related 
discussion contained in this e-mail, including any attachments, is not intended 
or written to be used, and cannot be used, for the purpose of (i) avoiding any 
tax penalties or (ii) promoting, marketing, or recommending to any other party 
any transaction or matter addressed herein. Please consult your independent 
legal counsel and/or professional tax advisor regarding any legal or tax issues 
raised in this e-mail.

The contents of this e-mail and any attachments are intended solely for the use 
of the named addressee(s) and may contain confidential and/or privileged 
information. Any unauthorized use, copying, disclosure, or distribution of the 
contents of this e-mail is strictly prohibited by the sender and may be 
unlawful. If you are not the intended recipient, please notify the sender 
immediately and delete this e-mail.


[Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
Hi all,

I've struggled with this for quite some time.
My requirement is to read a parquet file from s3 to a Dataframe then append to 
an existing iceberg table.

In order to read the parquet I need the hadoop-aws dependency for s3a:// . In 
order to write to iceberg I need the iceberg dependency. Both of these 
dependencies have a transitive dependency on the aws SDK. I can't find versions 
for Spark 3.4 that work together.


Current Versions:
Spark 3.4.1
iceberg-spark-runtime-3.4-2.12:1.4.1
iceberg-aws-bundle:1.4.1
hadoop-aws:3.4.0
hadoop-common:3.4.0

I've tried a number of combinations of the above and their respective versions 
but all fall over with their assumptions on the aws sdk version with class not 
found exceptions or method not found etc.

Is there a compatibility matrix somewhere that someone could point me to?

Thanks
Dan
T. Rowe Price International Ltd (registered number 3957748) is registered in 
England and Wales with its registered office at Warwick Court, 5 Paternoster 
Square, London EC4M 7DX. T. Rowe Price International Ltd is authorised and 
regulated by the Financial Conduct Authority. The company has a branch in Dubai 
International Financial Centre (regulated by the DFSA as a Representative 
Office).

T. Rowe Price (including T. Rowe Price International Ltd and its affiliates) 
and its associates do not provide legal or tax advice. Any tax-related 
discussion contained in this e-mail, including any attachments, is not intended 
or written to be used, and cannot be used, for the purpose of (i) avoiding any 
tax penalties or (ii) promoting, marketing, or recommending to any other party 
any transaction or matter addressed herein. Please consult your independent 
legal counsel and/or professional tax advisor regarding any legal or tax issues 
raised in this e-mail.

The contents of this e-mail and any attachments are intended solely for the use 
of the named addressee(s) and may contain confidential and/or privileged 
information. Any unauthorized use, copying, disclosure, or distribution of the 
contents of this e-mail is strictly prohibited by the sender and may be 
unlawful. If you are not the intended recipient, please notify the sender 
immediately and delete this e-mail.


Re: automatically/dinamically renew aws temporary token

2023-10-24 Thread Carlos Aguni
hi all,

thank you for your reply.

> Can’t you attach the cross account permission to the glue job role? Why
the detour via AssumeRole ?
yes Jorn, i also believe this is the best approach. but here we're dealing
with company policies and all the bureaucracy that comes along.
in parallel i'm trying to argue on that path. by now even requesting an
increase on the session duration is a struggle.
but at the moment, since I was only allowed the AssumeRole approach i'm
figuring out a way through this path.

> https://github.com/zillow/aws-custom-credential-provider
thank you Pol. I'll take a look into the project.

regards,c.

On Mon, Oct 23, 2023 at 7:03 AM Pol Santamaria  wrote:

> Hi Carlos!
>
> Take a look at this project, it's 6 years old but the approach is still
> valid:
>
> https://github.com/zillow/aws-custom-credential-provider
>
> The credential provider gets called each time an S3 or Glue Catalog is
> accessed, and then you can decide whether to use a cached token or renew.
>
> Best,
>
> *Pol Santamaria*
>
>
> On Mon, Oct 23, 2023 at 8:08 AM Jörn Franke  wrote:
>
>> Can’t you attach the cross account permission to the glue job role? Why
>> the detour via AssumeRole ?
>>
>> Assumerole can make sense if you use an AWS IAM user and STS
>> authentication, but this would make no sense within AWS for cross-account
>> access as attaching the permissions to the Glue job role is more secure (no
>> need for static credentials, automatically renew permissions in shorter
>> time without any specific configuration in Spark).
>>
>> Have you checked with AWS support?
>>
>> Am 22.10.2023 um 21:14 schrieb Carlos Aguni :
>>
>> 
>> hi all,
>>
>> i've a scenario where I need to assume a cross account role to have S3
>> bucket access.
>>
>> the problem is that this role only allows for 1h time span (no
>> negotiation).
>>
>> that said.
>> does anyone know a way to tell spark to automatically renew the token
>> or to dinamically renew the token on each node?
>> i'm currently using spark on AWS glue.
>>
>> wonder what options do I have.
>>
>> regards,c.
>>
>>


Re: automatically/dinamically renew aws temporary token

2023-10-23 Thread Pol Santamaria
Hi Carlos!

Take a look at this project, it's 6 years old but the approach is still
valid:

https://github.com/zillow/aws-custom-credential-provider

The credential provider gets called each time an S3 or Glue Catalog is
accessed, and then you can decide whether to use a cached token or renew.

Best,

*Pol Santamaria*


On Mon, Oct 23, 2023 at 8:08 AM Jörn Franke  wrote:

> Can’t you attach the cross account permission to the glue job role? Why
> the detour via AssumeRole ?
>
> Assumerole can make sense if you use an AWS IAM user and STS
> authentication, but this would make no sense within AWS for cross-account
> access as attaching the permissions to the Glue job role is more secure (no
> need for static credentials, automatically renew permissions in shorter
> time without any specific configuration in Spark).
>
> Have you checked with AWS support?
>
> Am 22.10.2023 um 21:14 schrieb Carlos Aguni :
>
> 
> hi all,
>
> i've a scenario where I need to assume a cross account role to have S3
> bucket access.
>
> the problem is that this role only allows for 1h time span (no
> negotiation).
>
> that said.
> does anyone know a way to tell spark to automatically renew the token
> or to dinamically renew the token on each node?
> i'm currently using spark on AWS glue.
>
> wonder what options do I have.
>
> regards,c.
>
>


Re: automatically/dinamically renew aws temporary token

2023-10-23 Thread Jörn Franke
Can’t you attach the cross account permission to the glue job role? Why the 
detour via AssumeRole ?

Assumerole can make sense if you use an AWS IAM user and STS authentication, 
but this would make no sense within AWS for cross-account access as attaching 
the permissions to the Glue job role is more secure (no need for static 
credentials, automatically renew permissions in shorter time without any 
specific configuration in Spark).

Have you checked with AWS support?

> Am 22.10.2023 um 21:14 schrieb Carlos Aguni :
> 
> 
> hi all,
> 
> i've a scenario where I need to assume a cross account role to have S3 bucket 
> access.
> 
> the problem is that this role only allows for 1h time span (no negotiation).
> 
> that said.
> does anyone know a way to tell spark to automatically renew the token
> or to dinamically renew the token on each node?
> i'm currently using spark on AWS glue.
> 
> wonder what options do I have.
> 
> regards,c.


automatically/dinamically renew aws temporary token

2023-10-22 Thread Carlos Aguni
hi all,

i've a scenario where I need to assume a cross account role to have S3 bucket 
access.

the problem is that this role only allows for 1h time span (no negotiation).

that said.
does anyone know a way to tell spark to automatically renew the token
or to dinamically renew the token on each node?
i'm currently using spark on AWS glue.

wonder what options do I have.

regards,c.


Re: Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-14 Thread Mich Talebzadeh
OK I managed to load the Python zipped file and the run py.file onto s3 for
AWS EKS to work

It is a bit of nightmare compared to the same on Google SDK which is simpler

Anyhow you will require additional jar files to be added to
$SPARK_HOME/jars. These two files will be picked up after you build the
docker image and will be available to pods.


   1. hadoop-aws-3.2.0.jar
   2. aws-java-sdk-bundle-1.11.375.jar

Then build your docker image and push the image to ecr registry on AWS.

This will allow you to refer to both the zipped package and your source
file as

 spark-submit --verbose \
   --master k8s://$KUBERNETES_MASTER_IP:443 \
   --deploy-mode cluster \
   --py-files s3a://spark-on-k8s/codes/spark_on_eks.zip \
   s3a://1spark-on-k8s/codes/

Note that you refer to the bucket as* s3a rather than s3*

Output from driver log

kubectl logs   -n spark

Started at
14/04/2023 15:08:11.11
starting at ID =  1 ,ending on =  100
root
 |-- ID: integer (nullable = false)
 |-- CLUSTERED: float (nullable = true)
 |-- SCATTERED: float (nullable = true)
 |-- RANDOMISED: float (nullable = true)
 |-- RANDOM_STRING: string (nullable = true)
 |-- SMALL_VC: string (nullable = true)
 |-- PADDING: string (nullable = true)
 |-- op_type: integer (nullable = false)
 |-- op_time: timestamp (nullable = false)

+---+-+-+--+--+--+--+---+---+
|ID |CLUSTERED|SCATTERED|RANDOMISED|RANDOM_STRING
  |SMALL_VC  |PADDING   |op_type|op_time|
+---+-+-+--+--+--+--+---+---+
|1  |0.0  |0.0  |17.0
 |KZWeqhFWCEPyYngFbyBMWXaSCrUZoLgubbbPIayRnBUbHoWCFJ|
1|xx|1  |2023-04-14 15:08:15.534|
|2  |0.01 |1.0  |7.0
|ffxkVZQtqMnMcLRkBOzZUGxICGrcbxDuyBHkJlpobluliGGxGR| 2|xx|1
 |2023-04-14 15:08:15.534|
|3  |0.02 |2.0  |30.0
 |LIixMEOLeMaEqJomTEIJEzOjoOjHyVaQXekWLctXbrEMUyTYBz|
3|xx|1  |2023-04-14 15:08:15.534|
|4  |0.03 |3.0  |30.0
 |tgUzEjfebzJsZWdoHIxrXlgqnbPZqZrmktsOUxfMvQyGplpErf|
4|xx|1  |2023-04-14 15:08:15.534|
|5  |0.04 |4.0  |79.0
 |qVwYSVPHbDXpPdkhxEpyIgKpaUnArlXykWZeiNNCiiaanXnkks|
5|xx|1  |2023-04-14 15:08:15.534|
|6  |0.05 |5.0  |73.0
 |fFWqcajQLEWVxuXbrFZmUAIIRgmKJSZUqQZNRfBvfxZAZqCSgW|
6|xx|1  |2023-04-14 15:08:15.534|
|7  |0.06 |6.0  |41.0
 |jzPdeIgxLdGncfBAepfJBdKhoOOLdKLzdocJisAjIhKtJRlgLK|
7|xx|1  |2023-04-14 15:08:15.534|
|8  |0.07 |7.0  |29.0
 |xyimTcfipZGnzPbDFDyFKmzfFoWbSrHAEyUhQqgeyNygQdvpSf|
8|xx|1  |2023-04-14 15:08:15.534|
|9  |0.08 |8.0  |59.0
 |NxrilRavGDMfvJNScUykTCUBkkpdhiGLeXSyYVgsnRoUYAfXrn|
9|xx|1  |2023-04-14 15:08:15.534|
|10 |0.09 |9.0  |73.0
 |cBEKanDFrPZkcHFuepVxcAiMwyAsRqDlRtQxiDXpCNycLapimt|
 10|xx|1  |2023-04-14 15:08:15.534|
+---+-+-+--+--+--+--+---+---+
only showing top 10 rows

Finished at
14/04/2023 15:08:16.16

I will provide the details under section *spark-on-aws *in
http://sparkcommunitytalk.slack.com/

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 12 Apr 2023 at 19:04, Mich Talebzadeh 
wrote:

> Thanks! I will have a look.
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 12 Apr 2023 at 18:26, Bjørn Jørgensen 
> wrote:
>
>> Yes, it looks inside the docker containers folder. It will work if you
>> are using s3 og gs.
>>
>> ons. 12. apr. 2023, 18:02 skrev Mich Talebzad

Re: Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-12 Thread Mich Talebzadeh
Thanks! I will have a look.

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 12 Apr 2023 at 18:26, Bjørn Jørgensen 
wrote:

> Yes, it looks inside the docker containers folder. It will work if you are
> using s3 og gs.
>
> ons. 12. apr. 2023, 18:02 skrev Mich Talebzadeh  >:
>
>> Hi,
>>
>> In my spark-submit to eks cluster, I use the standard code to submit to
>> the cluster as below:
>>
>> spark-submit --verbose \
>>--master k8s://$KUBERNETES_MASTER_IP:443 \
>>--deploy-mode cluster \
>>--name sparkOnEks \
>>--py-files local://$CODE_DIRECTORY/spark_on_eks.zip \
>>   local:///home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
>>
>> In Google Kubernetes Engine (GKE) I simply load them from gs:// storage
>> bucket.and it works fine.
>>
>> I am getting the following error in driver pod
>>
>>  + CMD=("$SPARK_HOME/bin/spark-submit" --conf 
>> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client 
>> "$@")
>> + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf 
>> spark.driver.bindAddress=192.168.39.251 --deploy-mode client 
>> --properties-file /opt/spark/conf/spark.properties --class 
>> org.apache.spark.deploy.PythonRunner 
>> local:///home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
>> 23/04/11 23:07:23 WARN NativeCodeLoader: Unable to load native-hadoop 
>> library for your platform... using builtin-java classes where applicable
>> /usr/bin/python3: can't open file 
>> '/home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py': [Errno 
>> 2] No such file or directory
>> log4j:WARN No appenders could be found for logger 
>> (org.apache.spark.util.ShutdownHookManager).
>> It says  can't open file 
>> '/home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py':
>>
>>
>> [Errno 2] No such file or directory but it is there!
>>
>> ls -l /home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
>> -rw-rw-rw- 1 hduser hadoop 5060 Mar 18 14:16 
>> /home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
>> So not sure what is going on. I have suspicion that it is looking inside the 
>> docker itself for this file?
>>
>>
>> Is that a correct assumption?
>>
>>
>> Thanks
>>
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>


Re: Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-12 Thread Bjørn Jørgensen
Yes, it looks inside the docker containers folder. It will work if you are
using s3 og gs.

ons. 12. apr. 2023, 18:02 skrev Mich Talebzadeh :

> Hi,
>
> In my spark-submit to eks cluster, I use the standard code to submit to
> the cluster as below:
>
> spark-submit --verbose \
>--master k8s://$KUBERNETES_MASTER_IP:443 \
>--deploy-mode cluster \
>--name sparkOnEks \
>--py-files local://$CODE_DIRECTORY/spark_on_eks.zip \
>   local:///home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
>
> In Google Kubernetes Engine (GKE) I simply load them from gs:// storage
> bucket.and it works fine.
>
> I am getting the following error in driver pod
>
>  + CMD=("$SPARK_HOME/bin/spark-submit" --conf 
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client 
> "$@")
> + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress=192.168.39.251 --deploy-mode client 
> --properties-file /opt/spark/conf/spark.properties --class 
> org.apache.spark.deploy.PythonRunner 
> local:///home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
> 23/04/11 23:07:23 WARN NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> /usr/bin/python3: can't open file 
> '/home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py': [Errno 
> 2] No such file or directory
> log4j:WARN No appenders could be found for logger 
> (org.apache.spark.util.ShutdownHookManager).
> It says  can't open file 
> '/home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py':
>
>
> [Errno 2] No such file or directory but it is there!
>
> ls -l /home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
> -rw-rw-rw- 1 hduser hadoop 5060 Mar 18 14:16 
> /home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
> So not sure what is going on. I have suspicion that it is looking inside the 
> docker itself for this file?
>
>
> Is that a correct assumption?
>
>
> Thanks
>
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-12 Thread Mich Talebzadeh
Hi,

In my spark-submit to eks cluster, I use the standard code to submit to the
cluster as below:

spark-submit --verbose \
   --master k8s://$KUBERNETES_MASTER_IP:443 \
   --deploy-mode cluster \
   --name sparkOnEks \
   --py-files local://$CODE_DIRECTORY/spark_on_eks.zip \
  local:///home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py

In Google Kubernetes Engine (GKE) I simply load them from gs:// storage
bucket.and it works fine.

I am getting the following error in driver pod

 + CMD=("$SPARK_HOME/bin/spark-submit" --conf
"spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode
client "$@")
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf
spark.driver.bindAddress=192.168.39.251 --deploy-mode client
--properties-file /opt/spark/conf/spark.properties --class
org.apache.spark.deploy.PythonRunner
local:///home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
23/04/11 23:07:23 WARN NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
/usr/bin/python3: can't open file
'/home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py':
[Errno 2] No such file or directory
log4j:WARN No appenders could be found for logger
(org.apache.spark.util.ShutdownHookManager).
It says  can't open file
'/home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py':


[Errno 2] No such file or directory but it is there!

ls -l /home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
-rw-rw-rw- 1 hduser hadoop 5060 Mar 18 14:16
/home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
So not sure what is going on. I have suspicion that it is looking
inside the docker itself for this file?


Is that a correct assumption?


Thanks


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: [Spark Structured Streaming] Do spark structured streaming is support sink to AWS Kinesis currently and how to handle if achieve quotas of kinesis?

2023-03-06 Thread Mich Talebzadeh
Spark Structured Streaming can write to anything as long as an appropriate
API or JDBC connection exists.

I have not tried Kinesis but have you thought about how you want to write
it as a Sync?

Those quota limitations, much like quotas set by the vendors (say Google on
BigQuery writes etc) are default but can be negotiated with the vendor.to
increase it.

What facts have you established so far?

HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 6 Mar 2023 at 04:20, hueiyuan su  wrote:

> *Component*: Spark Structured Streaming
> *Level*: Advanced
> *Scenario*: How-to
>
> 
> *Problems Description*
> 1. I currently would like to use pyspark structured streaming to
> write data to kinesis. But it seems like does not have corresponding
> connector can use. I would confirm whether have another method in addition
> to this solution
> <https://repost.aws/questions/QUP_OJomilTO6oIgvK00VHEA/writing-data-to-kinesis-stream-from-py-spark>
> 2. Because aws kinesis have quota limitation (like 1MB/s and 1000
> records/s), if spark structured streaming micro batch size too large, how
> can we handle this?
>
> --
> Best Regards,
>
> Mars Su
> *Phone*: 0988-661-013
> *Email*: hueiyua...@gmail.com
>


[Spark Structured Streaming] Do spark structured streaming is support sink to AWS Kinesis currently and how to handle if achieve quotas of kinesis?

2023-03-05 Thread hueiyuan su
*Component*: Spark Structured Streaming
*Level*: Advanced
*Scenario*: How-to


*Problems Description*
1. I currently would like to use pyspark structured streaming to write data
to kinesis. But it seems like does not have corresponding connector can
use. I would confirm whether have another method in addition to this
solution
<https://repost.aws/questions/QUP_OJomilTO6oIgvK00VHEA/writing-data-to-kinesis-stream-from-py-spark>
2. Because aws kinesis have quota limitation (like 1MB/s and 1000
records/s), if spark structured streaming micro batch size too large, how
can we handle this?

-- 
Best Regards,

Mars Su
*Phone*: 0988-661-013
*Email*: hueiyua...@gmail.com


Re: [Spark Structured Streaming] Do spark structured streaming is support sink to AWS Kinesis currently?

2023-02-16 Thread Vikas Kumar
Doesn't directly answer your question but there are ways in scala and
pyspark - See if this helps:
https://repost.aws/questions/QUP_OJomilTO6oIgvK00VHEA/writing-data-to-kinesis-stream-from-py-spark

On Thu, Feb 16, 2023, 8:27 PM hueiyuan su  wrote:

> *Component*: Spark Structured Streaming
> *Level*: Advanced
> *Scenario*: How-to
>
> 
> *Problems Description*
> I would like to implement witeStream data to AWS Kinesis with Spark
> structured Streaming, but I do not find related connector jar can be used.
> I want to check whether fully support write stream to AWS Kinesis. If you
> have any ideas, please let me know. I will be appreciate it for your answer.
>
> --
> Best Regards,
>
> Mars Su
> *Phone*: 0988-661-013
> *Email*: hueiyua...@gmail.com
>


[Spark Structured Streaming] Do spark structured streaming is support sink to AWS Kinesis currently?

2023-02-16 Thread hueiyuan su
*Component*: Spark Structured Streaming
*Level*: Advanced
*Scenario*: How-to


*Problems Description*
I would like to implement witeStream data to AWS Kinesis with Spark
structured Streaming, but I do not find related connector jar can be used.
I want to check whether fully support write stream to AWS Kinesis. If you
have any ideas, please let me know. I will be appreciate it for your answer.

-- 
Best Regards,

Mars Su
*Phone*: 0988-661-013
*Email*: hueiyua...@gmail.com


Re: Need help with the configuration for AWS glue jobs

2022-06-23 Thread Sid
Where can I find information on the size of the datasets supported by AWS
Glue? I didn't see it on the documentation

Also, if I want to process TBs of data for eg 1TB what should be the ideal
EMR cluster configuration?

Could you please guide me on this?

Thanks,
Sid.


On Thu, 23 Jun 2022, 23:44 Gourav Sengupta, 
wrote:

> Please use EMR, Glue is not made for heavy processing jobs.
>
> On Thu, Jun 23, 2022 at 6:36 AM Sid  wrote:
>
>> Hi Team,
>>
>> Could anyone help me in the below problem:
>>
>>
>> https://stackoverflow.com/questions/72724999/how-to-calculate-number-of-g-1-workers-in-aws-glue-for-processing-1tb-data
>>
>> Thanks,
>> Sid
>>
>


Re: Need help with the configuration for AWS glue jobs

2022-06-23 Thread Gourav Sengupta
Please use EMR, Glue is not made for heavy processing jobs.

On Thu, Jun 23, 2022 at 6:36 AM Sid  wrote:

> Hi Team,
>
> Could anyone help me in the below problem:
>
>
> https://stackoverflow.com/questions/72724999/how-to-calculate-number-of-g-1-workers-in-aws-glue-for-processing-1tb-data
>
> Thanks,
> Sid
>


Need help with the configuration for AWS glue jobs

2022-06-22 Thread Sid
Hi Team,

Could anyone help me in the below problem:

https://stackoverflow.com/questions/72724999/how-to-calculate-number-of-g-1-workers-in-aws-glue-for-processing-1tb-data

Thanks,
Sid


Re: AWS EMR SPARK 3.1.1 date issues

2021-08-29 Thread Gourav Sengupta
Hi Nicolas,

thanks a ton for your kind response, I will surely try this out.

Regards,
Gourav Sengupta

On Sun, Aug 29, 2021 at 11:01 PM Nicolas Paris 
wrote:

> as a workaround turn off pruning :
>
> spark.sql.hive.metastorePartitionPruning false
> spark.sql.hive.convertMetastoreParquet false
>
> see
> https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/issues/45
>
> On Tue Aug 24, 2021 at 9:18 AM CEST, Gourav Sengupta wrote:
> > Hi,
> >
> > I received a response from AWS, this is an issue with EMR, and they are
> > working on resolving the issue I believe.
> >
> > Thanks and Regards,
> > Gourav Sengupta
> >
> > On Mon, Aug 23, 2021 at 1:35 PM Gourav Sengupta <
> > gourav.sengupta.develo...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > the query still gives the same error if we write "SELECT * FROM
> table_name
> > > WHERE data_partition > CURRENT_DATE() - INTERVAL 10 DAYS".
> > >
> > > Also the queries work fine in SPARK 3.0.x, or in EMR 6.2.0.
> > >
> > >
> > > Thanks and Regards,
> > > Gourav Sengupta
> > >
> > > On Mon, Aug 23, 2021 at 1:16 PM Sean Owen  wrote:
> > >
> > >> Date handling was tightened up in Spark 3. I think you need to
> compare to
> > >> a date literal, not a string literal.
> > >>
> > >> On Mon, Aug 23, 2021 at 5:12 AM Gourav Sengupta <
> > >> gourav.sengupta.develo...@gmail.com> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> while I am running in EMR 6.3.0 (SPARK 3.1.1) a simple query as
> "SELECT
> > >>> * FROM  WHERE  > '2021-03-01'" the
> query
> > >>> is failing with error:
> > >>>
> > >>>
> ---
> > >>> pyspark.sql.utils.AnalysisException:
> > >>> org.apache.hadoop.hive.metastore.api.InvalidObjectException:
> Unsupported
> > >>> expression '2021 - 03 - 01' (Service: AWSGlue; Status Code: 400;
> Error
> > >>> Code: InvalidInputException; Request ID:
> > >>> dd3549c2-2eeb-4616-8dc5-5887ba43dd22; Proxy: null)
> > >>>
> > >>>
> ---
> > >>>
> > >>> The above query works fine in all previous versions of SPARK.
> > >>>
> > >>> Is this the expected behaviour in SPARK 3.1.1? If so can someone
> please
> > >>> let me know how to write this query.
> > >>>
> > >>> Also if this is the expected behaviour I think that a lot of users
> will
> > >>> have to make these changes in their existing code making transition
> to
> > >>> SPARK 3.1.1 expensive I think.
> > >>>
> > >>> Regards,
> > >>> Gourav Sengupta
> > >>>
> > >>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: AWS EMR SPARK 3.1.1 date issues

2021-08-29 Thread Nicolas Paris
as a workaround turn off pruning :

spark.sql.hive.metastorePartitionPruning false
spark.sql.hive.convertMetastoreParquet false

see 
https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/issues/45

On Tue Aug 24, 2021 at 9:18 AM CEST, Gourav Sengupta wrote:
> Hi,
>
> I received a response from AWS, this is an issue with EMR, and they are
> working on resolving the issue I believe.
>
> Thanks and Regards,
> Gourav Sengupta
>
> On Mon, Aug 23, 2021 at 1:35 PM Gourav Sengupta <
> gourav.sengupta.develo...@gmail.com> wrote:
>
> > Hi,
> >
> > the query still gives the same error if we write "SELECT * FROM table_name
> > WHERE data_partition > CURRENT_DATE() - INTERVAL 10 DAYS".
> >
> > Also the queries work fine in SPARK 3.0.x, or in EMR 6.2.0.
> >
> >
> > Thanks and Regards,
> > Gourav Sengupta
> >
> > On Mon, Aug 23, 2021 at 1:16 PM Sean Owen  wrote:
> >
> >> Date handling was tightened up in Spark 3. I think you need to compare to
> >> a date literal, not a string literal.
> >>
> >> On Mon, Aug 23, 2021 at 5:12 AM Gourav Sengupta <
> >> gourav.sengupta.develo...@gmail.com> wrote:
> >>
> >>> Hi,
> >>>
> >>> while I am running in EMR 6.3.0 (SPARK 3.1.1) a simple query as "SELECT
> >>> * FROM  WHERE  > '2021-03-01'" the query
> >>> is failing with error:
> >>>
> >>> ---
> >>> pyspark.sql.utils.AnalysisException:
> >>> org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unsupported
> >>> expression '2021 - 03 - 01' (Service: AWSGlue; Status Code: 400; Error
> >>> Code: InvalidInputException; Request ID:
> >>> dd3549c2-2eeb-4616-8dc5-5887ba43dd22; Proxy: null)
> >>>
> >>> ---
> >>>
> >>> The above query works fine in all previous versions of SPARK.
> >>>
> >>> Is this the expected behaviour in SPARK 3.1.1? If so can someone please
> >>> let me know how to write this query.
> >>>
> >>> Also if this is the expected behaviour I think that a lot of users will
> >>> have to make these changes in their existing code making transition to
> >>> SPARK 3.1.1 expensive I think.
> >>>
> >>> Regards,
> >>> Gourav Sengupta
> >>>
> >>


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: AWS EMR SPARK 3.1.1 date issues

2021-08-24 Thread Gourav Sengupta
Hi,

I received a response from AWS, this is an issue with EMR, and they are
working on resolving the issue I believe.

Thanks and Regards,
Gourav Sengupta

On Mon, Aug 23, 2021 at 1:35 PM Gourav Sengupta <
gourav.sengupta.develo...@gmail.com> wrote:

> Hi,
>
> the query still gives the same error if we write "SELECT * FROM table_name
> WHERE data_partition > CURRENT_DATE() - INTERVAL 10 DAYS".
>
> Also the queries work fine in SPARK 3.0.x, or in EMR 6.2.0.
>
>
> Thanks and Regards,
> Gourav Sengupta
>
> On Mon, Aug 23, 2021 at 1:16 PM Sean Owen  wrote:
>
>> Date handling was tightened up in Spark 3. I think you need to compare to
>> a date literal, not a string literal.
>>
>> On Mon, Aug 23, 2021 at 5:12 AM Gourav Sengupta <
>> gourav.sengupta.develo...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> while I am running in EMR 6.3.0 (SPARK 3.1.1) a simple query as "SELECT
>>> * FROM  WHERE  > '2021-03-01'" the query
>>> is failing with error:
>>>
>>> ---
>>> pyspark.sql.utils.AnalysisException:
>>> org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unsupported
>>> expression '2021 - 03 - 01' (Service: AWSGlue; Status Code: 400; Error
>>> Code: InvalidInputException; Request ID:
>>> dd3549c2-2eeb-4616-8dc5-5887ba43dd22; Proxy: null)
>>>
>>> ---
>>>
>>> The above query works fine in all previous versions of SPARK.
>>>
>>> Is this the expected behaviour in SPARK 3.1.1? If so can someone please
>>> let me know how to write this query.
>>>
>>> Also if this is the expected behaviour I think that a lot of users will
>>> have to make these changes in their existing code making transition to
>>> SPARK 3.1.1 expensive I think.
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>


Re: AWS EMR SPARK 3.1.1 date issues

2021-08-23 Thread Gourav Sengupta
Hi,

the query still gives the same error if we write "SELECT * FROM table_name
WHERE data_partition > CURRENT_DATE() - INTERVAL 10 DAYS".

Also the queries work fine in SPARK 3.0.x, or in EMR 6.2.0.


Thanks and Regards,
Gourav Sengupta

On Mon, Aug 23, 2021 at 1:16 PM Sean Owen  wrote:

> Date handling was tightened up in Spark 3. I think you need to compare to
> a date literal, not a string literal.
>
> On Mon, Aug 23, 2021 at 5:12 AM Gourav Sengupta <
> gourav.sengupta.develo...@gmail.com> wrote:
>
>> Hi,
>>
>> while I am running in EMR 6.3.0 (SPARK 3.1.1) a simple query as "SELECT *
>> FROM  WHERE  > '2021-03-01'" the query is
>> failing with error:
>>
>> ---
>> pyspark.sql.utils.AnalysisException:
>> org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unsupported
>> expression '2021 - 03 - 01' (Service: AWSGlue; Status Code: 400; Error
>> Code: InvalidInputException; Request ID:
>> dd3549c2-2eeb-4616-8dc5-5887ba43dd22; Proxy: null)
>>
>> ---
>>
>> The above query works fine in all previous versions of SPARK.
>>
>> Is this the expected behaviour in SPARK 3.1.1? If so can someone please
>> let me know how to write this query.
>>
>> Also if this is the expected behaviour I think that a lot of users will
>> have to make these changes in their existing code making transition to
>> SPARK 3.1.1 expensive I think.
>>
>> Regards,
>> Gourav Sengupta
>>
>


Re: AWS EMR SPARK 3.1.1 date issues

2021-08-23 Thread Sean Owen
Date handling was tightened up in Spark 3. I think you need to compare to a
date literal, not a string literal.

On Mon, Aug 23, 2021 at 5:12 AM Gourav Sengupta <
gourav.sengupta.develo...@gmail.com> wrote:

> Hi,
>
> while I am running in EMR 6.3.0 (SPARK 3.1.1) a simple query as "SELECT *
> FROM  WHERE  > '2021-03-01'" the query is
> failing with error:
> ---
> pyspark.sql.utils.AnalysisException:
> org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unsupported
> expression '2021 - 03 - 01' (Service: AWSGlue; Status Code: 400; Error
> Code: InvalidInputException; Request ID:
> dd3549c2-2eeb-4616-8dc5-5887ba43dd22; Proxy: null)
> ---
>
> The above query works fine in all previous versions of SPARK.
>
> Is this the expected behaviour in SPARK 3.1.1? If so can someone please
> let me know how to write this query.
>
> Also if this is the expected behaviour I think that a lot of users will
> have to make these changes in their existing code making transition to
> SPARK 3.1.1 expensive I think.
>
> Regards,
> Gourav Sengupta
>


AWS EMR SPARK 3.1.1 date issues

2021-08-23 Thread Gourav Sengupta
Hi,

while I am running in EMR 6.3.0 (SPARK 3.1.1) a simple query as "SELECT *
FROM  WHERE  > '2021-03-01'" the query is
failing with error:
---
pyspark.sql.utils.AnalysisException:
org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unsupported
expression '2021 - 03 - 01' (Service: AWSGlue; Status Code: 400; Error
Code: InvalidInputException; Request ID:
dd3549c2-2eeb-4616-8dc5-5887ba43dd22; Proxy: null)
---

The above query works fine in all previous versions of SPARK.

Is this the expected behaviour in SPARK 3.1.1? If so can someone please let
me know how to write this query.

Also if this is the expected behaviour I think that a lot of users will
have to make these changes in their existing code making transition to
SPARK 3.1.1 expensive I think.

Regards,
Gourav Sengupta


Bursting Your On-Premises Data Lake Analytics and AI Workloads on AWS

2021-02-18 Thread Bin Fan
Hi everyone!

I am sharing this article about running Spark / Presto workloads on
AWS: Bursting
On-Premise Datalake Analytics and AI Workloads on AWS
<https://bit.ly/3qA1Tom> published on AWS blog. Hope you enjoy it. Feel
free to discuss with me here <https://alluxio.io/slack>.

- Bin Fan
Powered by Alluxio <https://www.alluxio.io/powered-by-alluxio/> | Alluxio
Slack Channel <https://alluxio.io/slack> | Data Orchestration Summit 2020
<https://www.alluxio.io/data-orchestration-summit-2020/>


Spark in hybrid cloud in AWS & GCP

2020-12-07 Thread Bin Fan
Dear Spark users,

If you are interested in running Spark in Hybrid Cloud? Checkout talks from
AWS & GCP at the virtual Data Orchestration Summit
<https://www.alluxio.io/data-orchestration-summit-2020/> on Dec. 8-9, 2020,
register for free <https://www.alluxio.io/data-orchestration-summit-2020/>.

The summit has speaker lineup spans creators and committers of Alluxio,
Spark, Presto, Tensorflow, K8s to data engineers and software engineers
building cloud-native data and AI platforms at Amazon, Alibaba, Comcast,
Facebook, Google, ING Bank, Microsoft, Tencent, and more!


- Bin Fan


Re: Spark Job Fails with Unknown Error writing to S3 from AWS EMR

2020-07-22 Thread Shriraj Bhardwaj
We faced this similar situation with jre 8u262 try reverting back...

On Thu, Jul 23, 2020, 5:18 AM koti reddy  wrote:

> Hi,
>
> Can someone help to resolve this issue?
> Thank you in advance.
>
> Error logs :
>
> java.io.EOFException: Unexpected EOF while trying to read response from server
>   at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:402)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
>   at 
> org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1073)
> 20/07/22 22:43:37 WARN DataStreamer: Error Recovery for 
> BP-439833631-172.19.222.143-1595381416559:blk_1073742309_1498 in pipeline 
> [DatanodeInfoWithStorage[172.19.222.182:50010,DS-7783002b-d57a-43a3-9d91-9934e2d063f8,DISK],
>  
> DatanodeInfoWithStorage[172.19.223.27:50010,DS-1bc8cea7-9c28-4869-aada-55b7d0b0680c,DISK],
>  
> DatanodeInfoWithStorage[172.19.223.199:50010,DS-880d5121-16a2-465e-ad20-ca99f4287770,DISK]]:
>  datanode 
> 0(DatanodeInfoWithStorage[172.19.222.182:50010,DS-7783002b-d57a-43a3-9d91-9934e2d063f8,DISK])
>  is bad.
> 20/07/22 22:44:58 WARN DataStreamer: Exception for 
> BP-439833631-172.19.222.143-1595381416559:blk_1073742309_1499
> java.io.EOFException: Unexpected EOF while trying to read response from server
>   at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:402)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
>   at 
> org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1073)
> 20/07/22 22:44:58 WARN DataStreamer: Error Recovery for 
> BP-439833631-172.19.222.143-1595381416559:blk_1073742309_1499 in pipeline 
> [DatanodeInfoWithStorage[172.19.223.27:50010,DS-1bc8cea7-9c28-4869-aada-55b7d0b0680c,DISK],
>  
> DatanodeInfoWithStorage[172.19.223.199:50010,DS-880d5121-16a2-465e-ad20-ca99f4287770,DISK],
>  
> DatanodeInfoWithStorage[172.19.222.180:50010,DS-40b1c81b-18d1-4d8d-ab49-11904f3dd23c,DISK]]:
>  datanode 
> 0(DatanodeInfoWithStorage[172.19.223.27:50010,DS-1bc8cea7-9c28-4869-aada-55b7d0b0680c,DISK])
>  is bad.
> 20/07/22 22:47:00 WARN DataStreamer: Exception for 
> BP-439833631-172.19.222.143-1595381416559:blk_1073742309_1500
> java.io.EOFException: Unexpected EOF while trying to read response from server
>   at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:402)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
>   at 
> org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1073)
> 20/07/22 22:47:00 WARN DataStreamer: Error Recovery for 
> BP-439833631-172.19.222.143-1595381416559:blk_1073742309_1500 in pipeline 
> [DatanodeInfoWithStorage[172.19.223.199:50010,DS-880d5121-16a2-465e-ad20-ca99f4287770,DISK],
>  
> DatanodeInfoWithStorage[172.19.222.180:50010,DS-40b1c81b-18d1-4d8d-ab49-11904f3dd23c,DISK],
>  
> DatanodeInfoWithStorage[172.19.223.55:50010,DS-3e5dc677-cd1d-49fc-b50a-4b058ae298aa,DISK]]:
>  datanode 
> 0(DatanodeInfoWithStorage[172.19.223.199:50010,DS-880d5121-16a2-465e-ad20-ca99f4287770,DISK])
>  is bad.
> 20/07/22 22:47:03 INFO MemoryStore: Block broadcast_4 stored as values in 
> memory (estimated size 18.0 GB, free 24.5 GB)
>
> --
> Thanks,
> Koti Reddy Nusum,
> +1-(660) 541-5623.
>
>


Spark Job Fails with Unknown Error writing to S3 from AWS EMR

2020-07-22 Thread koti reddy
Hi,

Can someone help to resolve this issue?
Thank you in advance.

Error logs :

java.io.EOFException: Unexpected EOF while trying to read response from server
at 
org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:402)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
at 
org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1073)
20/07/22 22:43:37 WARN DataStreamer: Error Recovery for
BP-439833631-172.19.222.143-1595381416559:blk_1073742309_1498 in
pipeline 
[DatanodeInfoWithStorage[172.19.222.182:50010,DS-7783002b-d57a-43a3-9d91-9934e2d063f8,DISK],
DatanodeInfoWithStorage[172.19.223.27:50010,DS-1bc8cea7-9c28-4869-aada-55b7d0b0680c,DISK],
DatanodeInfoWithStorage[172.19.223.199:50010,DS-880d5121-16a2-465e-ad20-ca99f4287770,DISK]]:
datanode 
0(DatanodeInfoWithStorage[172.19.222.182:50010,DS-7783002b-d57a-43a3-9d91-9934e2d063f8,DISK])
is bad.
20/07/22 22:44:58 WARN DataStreamer: Exception for
BP-439833631-172.19.222.143-1595381416559:blk_1073742309_1499
java.io.EOFException: Unexpected EOF while trying to read response from server
at 
org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:402)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
at 
org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1073)
20/07/22 22:44:58 WARN DataStreamer: Error Recovery for
BP-439833631-172.19.222.143-1595381416559:blk_1073742309_1499 in
pipeline 
[DatanodeInfoWithStorage[172.19.223.27:50010,DS-1bc8cea7-9c28-4869-aada-55b7d0b0680c,DISK],
DatanodeInfoWithStorage[172.19.223.199:50010,DS-880d5121-16a2-465e-ad20-ca99f4287770,DISK],
DatanodeInfoWithStorage[172.19.222.180:50010,DS-40b1c81b-18d1-4d8d-ab49-11904f3dd23c,DISK]]:
datanode 
0(DatanodeInfoWithStorage[172.19.223.27:50010,DS-1bc8cea7-9c28-4869-aada-55b7d0b0680c,DISK])
is bad.
20/07/22 22:47:00 WARN DataStreamer: Exception for
BP-439833631-172.19.222.143-1595381416559:blk_1073742309_1500
java.io.EOFException: Unexpected EOF while trying to read response from server
at 
org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:402)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
at 
org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1073)
20/07/22 22:47:00 WARN DataStreamer: Error Recovery for
BP-439833631-172.19.222.143-1595381416559:blk_1073742309_1500 in
pipeline 
[DatanodeInfoWithStorage[172.19.223.199:50010,DS-880d5121-16a2-465e-ad20-ca99f4287770,DISK],
DatanodeInfoWithStorage[172.19.222.180:50010,DS-40b1c81b-18d1-4d8d-ab49-11904f3dd23c,DISK],
DatanodeInfoWithStorage[172.19.223.55:50010,DS-3e5dc677-cd1d-49fc-b50a-4b058ae298aa,DISK]]:
datanode 
0(DatanodeInfoWithStorage[172.19.223.199:50010,DS-880d5121-16a2-465e-ad20-ca99f4287770,DISK])
is bad.
20/07/22 22:47:03 INFO MemoryStore: Block broadcast_4 stored as values
in memory (estimated size 18.0 GB, free 24.5 GB)

-- 
Thanks,
Koti Reddy Nusum,
+1-(660) 541-5623.


AWS EMR slow write to HDFS

2019-06-11 Thread Femi Anthony


I'm writing a large dataset in Parquet format to HDFS using Spark and it runs 
rather slowly in EMR vs say Databricks. I realize that if I was able to use 
Hadoop 3.1, it would be much more performant because it has a high performance 
output committer. Is this the case, and if so - when will there be a version of 
EMR that uses Hadoop 3.1 ? The current version I'm using is 5.21.
Sent from my iPhone
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Aws

2019-02-08 Thread Pedro Tuero
Hi Noritaka,

I start clusters from Java API.
Clusters running on 5.16 have not manual configurations in the Emr console
Configuration tab, so I assume the value of this property should be the
default on 5.16.
I enabled maximize resource allocation because otherwise, the number of
cores automatically assigned (without assigning spark.executor.cores
manually) was always one per executor.

I already use the same configurations. I used the same scripts and
configuration files for running the same job with same input data with the
same configuration, only changing the binaries with my own code which
include launching the clusters using emr 5.20 release label.

Anyway, setting maximize resource allocation seems to have helped with the
cores distribution enough.
Some jobs take even less than before.
Now I'm stuck analyzing a case where the number of tasks created seems to
be the problem. I have posted in this forum another thread about that
recently.

Regards,
Pedro


El jue., 7 de feb. de 2019 a la(s) 21:37, Noritaka Sekiyama (
moomind...@gmail.com) escribió:

> Hi Pedro,
>
> It seems that you disabled maximize resource allocation in 5.16, but
> enabled in 5.20.
> This config can be different based on how you start EMR cluster (via quick
> wizard, advanced wizard in console, or CLI/API).
> You can see that in EMR console Configuration tab.
>
> Please compare spark properties (especially spark.executor.cores,
> spark.executor.memory, spark.dynamicAllocation.enabled, etc.)  between
> your two Spark cluster with different version of EMR.
> You can see them from Spark web UI’s environment tab or log files.
> Then please try with the same properties against the same dataset with the
> same deployment mode (cluster or client).
>
> Even in EMR, you can configure num of cores and memory of driver/executors
> in config files, arguments in spark-submit, and inside Spark app if you
> need.
>
>
> Warm regards,
> Nori
>
> 2019年2月8日(金) 8:16 Hiroyuki Nagata :
>
>> Hi,
>> thank you Pedro
>>
>> I tested maximizeResourceAllocation option. When it's enabled, it seems
>> Spark utilized their cores fully. However the performance is not so
>> different from default setting.
>>
>> I consider to use s3-distcp for uploading files. And, I think
>> table(dataframe) caching is also effectiveness.
>>
>> Regards,
>> Hiroyuki
>>
>> 2019年2月2日(土) 1:12 Pedro Tuero :
>>
>>> Hi Hiroyuki, thanks for the answer.
>>>
>>> I found a solution for the cores per executor configuration:
>>> I set this configuration to true:
>>>
>>> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#emr-spark-maximizeresourceallocation
>>> Probably it was true by default at version 5.16, but I didn't find when
>>> it has changed.
>>> In the same link, it says that dynamic allocation is true by default. I
>>> thought it would do the trick but reading again I think it is related to
>>> the number of executors rather than the number of cores.
>>>
>>> But the jobs are still taking more than before.
>>> Watching application history,  I see these differences:
>>> For the same job, the same kind of instances types, default (aws
>>> managed) configuration for executors, cores, and memory:
>>> Instances:
>>> 6 r5.xlarge :  4 vCpu , 32gb of mem. (So there is 24 cores: 6 instances
>>> * 4 cores).
>>>
>>> With 5.16:
>>> - 24 executors  (4 in each instance, including the one who also had the
>>> driver).
>>> - 4 cores each.
>>> - 2.7  * 2 (Storage + on-heap storage) memory each.
>>> - 1 executor per core, but at the same time  4 cores per executor (?).
>>> - Total Mem in executors per Instance : 21.6 (2.7 * 2 * 4)
>>> - Total Elapsed Time: 6 minutes
>>> With 5.20:
>>> - 5 executors (1 in each instance, 0 in the instance with the driver).
>>> - 4 cores each.
>>> - 11.9  * 2 (Storage + on-heap storage) memory each.
>>> - Total Mem  in executors per Instance : 23.8 (11.9 * 2 * 1)
>>> - Total Elapsed Time: 8 minutes
>>>
>>>
>>> I don't understand the configuration of 5.16, but it works better.
>>> It seems that in 5.20, a full instance is wasted with the driver only,
>>> while it could also contain an executor.
>>>
>>>
>>> Regards,
>>> Pedro.
>>>
>>>
>>>
>>> l jue., 31 de ene. de 2019 20:16, Hiroyuki Nagata 
>>> escribió:
>>>
>>>> Hi, Pedro
>>>>
>>>>
>>>> I also start using AWS EMR, with Spark 2.4.0. I'm seeking methods 

Re: Aws

2019-02-07 Thread Noritaka Sekiyama
Hi Pedro,

It seems that you disabled maximize resource allocation in 5.16, but
enabled in 5.20.
This config can be different based on how you start EMR cluster (via quick
wizard, advanced wizard in console, or CLI/API).
You can see that in EMR console Configuration tab.

Please compare spark properties (especially spark.executor.cores,
spark.executor.memory, spark.dynamicAllocation.enabled, etc.)  between your
two Spark cluster with different version of EMR.
You can see them from Spark web UI’s environment tab or log files.
Then please try with the same properties against the same dataset with the
same deployment mode (cluster or client).

Even in EMR, you can configure num of cores and memory of driver/executors
in config files, arguments in spark-submit, and inside Spark app if you
need.


Warm regards,
Nori

2019年2月8日(金) 8:16 Hiroyuki Nagata :

> Hi,
> thank you Pedro
>
> I tested maximizeResourceAllocation option. When it's enabled, it seems
> Spark utilized their cores fully. However the performance is not so
> different from default setting.
>
> I consider to use s3-distcp for uploading files. And, I think
> table(dataframe) caching is also effectiveness.
>
> Regards,
> Hiroyuki
>
> 2019年2月2日(土) 1:12 Pedro Tuero :
>
>> Hi Hiroyuki, thanks for the answer.
>>
>> I found a solution for the cores per executor configuration:
>> I set this configuration to true:
>>
>> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#emr-spark-maximizeresourceallocation
>> Probably it was true by default at version 5.16, but I didn't find when
>> it has changed.
>> In the same link, it says that dynamic allocation is true by default. I
>> thought it would do the trick but reading again I think it is related to
>> the number of executors rather than the number of cores.
>>
>> But the jobs are still taking more than before.
>> Watching application history,  I see these differences:
>> For the same job, the same kind of instances types, default (aws managed)
>> configuration for executors, cores, and memory:
>> Instances:
>> 6 r5.xlarge :  4 vCpu , 32gb of mem. (So there is 24 cores: 6 instances *
>> 4 cores).
>>
>> With 5.16:
>> - 24 executors  (4 in each instance, including the one who also had the
>> driver).
>> - 4 cores each.
>> - 2.7  * 2 (Storage + on-heap storage) memory each.
>> - 1 executor per core, but at the same time  4 cores per executor (?).
>> - Total Mem in executors per Instance : 21.6 (2.7 * 2 * 4)
>> - Total Elapsed Time: 6 minutes
>> With 5.20:
>> - 5 executors (1 in each instance, 0 in the instance with the driver).
>> - 4 cores each.
>> - 11.9  * 2 (Storage + on-heap storage) memory each.
>> - Total Mem  in executors per Instance : 23.8 (11.9 * 2 * 1)
>> - Total Elapsed Time: 8 minutes
>>
>>
>> I don't understand the configuration of 5.16, but it works better.
>> It seems that in 5.20, a full instance is wasted with the driver only,
>> while it could also contain an executor.
>>
>>
>> Regards,
>> Pedro.
>>
>>
>>
>> l jue., 31 de ene. de 2019 20:16, Hiroyuki Nagata 
>> escribió:
>>
>>> Hi, Pedro
>>>
>>>
>>> I also start using AWS EMR, with Spark 2.4.0. I'm seeking methods for
>>> performance tuning.
>>>
>>> Do you configure dynamic allocation ?
>>>
>>> FYI:
>>>
>>> https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
>>>
>>> I've not tested it yet. I guess spark-submit needs to specify number of
>>> executors.
>>>
>>> Regards,
>>> Hiroyuki
>>>
>>> 2019年2月1日(金) 5:23、Pedro Tuero さん(tuerope...@gmail.com)のメッセージ:
>>>
>>>> Hi guys,
>>>> I use to run spark jobs in Aws emr.
>>>> Recently I switch from aws emr label  5.16 to 5.20 (which use Spark
>>>> 2.4.0).
>>>> I've noticed that a lot of steps are taking longer than before.
>>>> I think it is related to the automatic configuration of cores by
>>>> executor.
>>>> In version 5.16, some executors toke more cores if the instance allows
>>>> it.
>>>> Let say, if an instance had 8 cores and 40gb of ram, and ram configured
>>>> by executor was 10gb, then aws emr automatically assigned 2 cores by
>>>> executor.
>>>> Now in label 5.20, unless I configure the number of cores manually,
>>>> only one core is assigned per executor.
>>>>
>>>> I don't know if it is related to Spark 2.4.0 or if it is something
>>>> managed by aws...
>>>> Does anyone know if there is a way to automatically use more cores when
>>>> it is physically possible?
>>>>
>>>> Thanks,
>>>> Peter.
>>>>
>>>


Re: Aws

2019-02-07 Thread Hiroyuki Nagata
Hi,
thank you Pedro

I tested maximizeResourceAllocation option. When it's enabled, it seems
Spark utilized their cores fully. However the performance is not so
different from default setting.

I consider to use s3-distcp for uploading files. And, I think
table(dataframe) caching is also effectiveness.

Regards,
Hiroyuki

2019年2月2日(土) 1:12 Pedro Tuero :

> Hi Hiroyuki, thanks for the answer.
>
> I found a solution for the cores per executor configuration:
> I set this configuration to true:
>
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#emr-spark-maximizeresourceallocation
> Probably it was true by default at version 5.16, but I didn't find when it
> has changed.
> In the same link, it says that dynamic allocation is true by default. I
> thought it would do the trick but reading again I think it is related to
> the number of executors rather than the number of cores.
>
> But the jobs are still taking more than before.
> Watching application history,  I see these differences:
> For the same job, the same kind of instances types, default (aws managed)
> configuration for executors, cores, and memory:
> Instances:
> 6 r5.xlarge :  4 vCpu , 32gb of mem. (So there is 24 cores: 6 instances *
> 4 cores).
>
> With 5.16:
> - 24 executors  (4 in each instance, including the one who also had the
> driver).
> - 4 cores each.
> - 2.7  * 2 (Storage + on-heap storage) memory each.
> - 1 executor per core, but at the same time  4 cores per executor (?).
> - Total Mem in executors per Instance : 21.6 (2.7 * 2 * 4)
> - Total Elapsed Time: 6 minutes
> With 5.20:
> - 5 executors (1 in each instance, 0 in the instance with the driver).
> - 4 cores each.
> - 11.9  * 2 (Storage + on-heap storage) memory each.
> - Total Mem  in executors per Instance : 23.8 (11.9 * 2 * 1)
> - Total Elapsed Time: 8 minutes
>
>
> I don't understand the configuration of 5.16, but it works better.
> It seems that in 5.20, a full instance is wasted with the driver only,
> while it could also contain an executor.
>
>
> Regards,
> Pedro.
>
>
>
> l jue., 31 de ene. de 2019 20:16, Hiroyuki Nagata 
> escribió:
>
>> Hi, Pedro
>>
>>
>> I also start using AWS EMR, with Spark 2.4.0. I'm seeking methods for
>> performance tuning.
>>
>> Do you configure dynamic allocation ?
>>
>> FYI:
>>
>> https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
>>
>> I've not tested it yet. I guess spark-submit needs to specify number of
>> executors.
>>
>> Regards,
>> Hiroyuki
>>
>> 2019年2月1日(金) 5:23、Pedro Tuero さん(tuerope...@gmail.com)のメッセージ:
>>
>>> Hi guys,
>>> I use to run spark jobs in Aws emr.
>>> Recently I switch from aws emr label  5.16 to 5.20 (which use Spark
>>> 2.4.0).
>>> I've noticed that a lot of steps are taking longer than before.
>>> I think it is related to the automatic configuration of cores by
>>> executor.
>>> In version 5.16, some executors toke more cores if the instance allows
>>> it.
>>> Let say, if an instance had 8 cores and 40gb of ram, and ram configured
>>> by executor was 10gb, then aws emr automatically assigned 2 cores by
>>> executor.
>>> Now in label 5.20, unless I configure the number of cores manually, only
>>> one core is assigned per executor.
>>>
>>> I don't know if it is related to Spark 2.4.0 or if it is something
>>> managed by aws...
>>> Does anyone know if there is a way to automatically use more cores when
>>> it is physically possible?
>>>
>>> Thanks,
>>> Peter.
>>>
>>


Re: Aws

2019-02-01 Thread Pedro Tuero
Hi Hiroyuki, thanks for the answer.

I found a solution for the cores per executor configuration:
I set this configuration to true:
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#emr-spark-maximizeresourceallocation
Probably it was true by default at version 5.16, but I didn't find when it
has changed.
In the same link, it says that dynamic allocation is true by default. I
thought it would do the trick but reading again I think it is related to
the number of executors rather than the number of cores.

But the jobs are still taking more than before.
Watching application history,  I see these differences:
For the same job, the same kind of instances types, default (aws managed)
configuration for executors, cores, and memory:
Instances:
6 r5.xlarge :  4 vCpu , 32gb of mem. (So there is 24 cores: 6 instances * 4
cores).

With 5.16:
- 24 executors  (4 in each instance, including the one who also had the
driver).
- 4 cores each.
- 2.7  * 2 (Storage + on-heap storage) memory each.
- 1 executor per core, but at the same time  4 cores per executor (?).
- Total Mem in executors per Instance : 21.6 (2.7 * 2 * 4)
- Total Elapsed Time: 6 minutes
With 5.20:
- 5 executors (1 in each instance, 0 in the instance with the driver).
- 4 cores each.
- 11.9  * 2 (Storage + on-heap storage) memory each.
- Total Mem  in executors per Instance : 23.8 (11.9 * 2 * 1)
- Total Elapsed Time: 8 minutes


I don't understand the configuration of 5.16, but it works better.
It seems that in 5.20, a full instance is wasted with the driver only,
while it could also contain an executor.


Regards,
Pedro.



l jue., 31 de ene. de 2019 20:16, Hiroyuki Nagata 
escribió:

> Hi, Pedro
>
>
> I also start using AWS EMR, with Spark 2.4.0. I'm seeking methods for
> performance tuning.
>
> Do you configure dynamic allocation ?
>
> FYI:
>
> https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
>
> I've not tested it yet. I guess spark-submit needs to specify number of
> executors.
>
> Regards,
> Hiroyuki
>
> 2019年2月1日(金) 5:23、Pedro Tuero さん(tuerope...@gmail.com)のメッセージ:
>
>> Hi guys,
>> I use to run spark jobs in Aws emr.
>> Recently I switch from aws emr label  5.16 to 5.20 (which use Spark
>> 2.4.0).
>> I've noticed that a lot of steps are taking longer than before.
>> I think it is related to the automatic configuration of cores by executor.
>> In version 5.16, some executors toke more cores if the instance allows it.
>> Let say, if an instance had 8 cores and 40gb of ram, and ram configured
>> by executor was 10gb, then aws emr automatically assigned 2 cores by
>> executor.
>> Now in label 5.20, unless I configure the number of cores manually, only
>> one core is assigned per executor.
>>
>> I don't know if it is related to Spark 2.4.0 or if it is something
>> managed by aws...
>> Does anyone know if there is a way to automatically use more cores when
>> it is physically possible?
>>
>> Thanks,
>> Peter.
>>
>


Re: Aws

2019-01-31 Thread Hiroyuki Nagata
Hi, Pedro


I also start using AWS EMR, with Spark 2.4.0. I'm seeking methods for
performance tuning.

Do you configure dynamic allocation ?

FYI:
https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation

I've not tested it yet. I guess spark-submit needs to specify number of
executors.

Regards,
Hiroyuki

2019年2月1日(金) 5:23、Pedro Tuero さん(tuerope...@gmail.com)のメッセージ:

> Hi guys,
> I use to run spark jobs in Aws emr.
> Recently I switch from aws emr label  5.16 to 5.20 (which use Spark 2.4.0).
> I've noticed that a lot of steps are taking longer than before.
> I think it is related to the automatic configuration of cores by executor.
> In version 5.16, some executors toke more cores if the instance allows it.
> Let say, if an instance had 8 cores and 40gb of ram, and ram configured by
> executor was 10gb, then aws emr automatically assigned 2 cores by executor.
> Now in label 5.20, unless I configure the number of cores manually, only
> one core is assigned per executor.
>
> I don't know if it is related to Spark 2.4.0 or if it is something managed
> by aws...
> Does anyone know if there is a way to automatically use more cores when it
> is physically possible?
>
> Thanks,
> Peter.
>


Aws

2019-01-31 Thread Pedro Tuero
Hi guys,
I use to run spark jobs in Aws emr.
Recently I switch from aws emr label  5.16 to 5.20 (which use Spark 2.4.0).
I've noticed that a lot of steps are taking longer than before.
I think it is related to the automatic configuration of cores by executor.
In version 5.16, some executors toke more cores if the instance allows it.
Let say, if an instance had 8 cores and 40gb of ram, and ram configured by
executor was 10gb, then aws emr automatically assigned 2 cores by executor.
Now in label 5.20, unless I configure the number of cores manually, only
one core is assigned per executor.

I don't know if it is related to Spark 2.4.0 or if it is something managed
by aws...
Does anyone know if there is a way to automatically use more cores when it
is physically possible?

Thanks,
Peter.


Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-21 Thread Riccardo Ferrari
Hi Aakash,

Can you share how are you adding those jars? Are you using the package
method ? I assume you're running in a cluster, and those dependencies might
have not properly distributed.

How are you submitting your app? What kind of resource manager are you
using standalone, yarn, ...

Best,

On Fri, Dec 21, 2018 at 1:18 PM Aakash Basu 
wrote:

> Any help, anyone?
>
> On Fri, Dec 21, 2018 at 2:21 PM Aakash Basu 
> wrote:
>
>> Hey Shuporno,
>>
>> With the updated config too, I am getting the same error. While trying to
>> figure that out, I found this link which says I need aws-java-sdk (which I
>> already have):
>> https://github.com/amazon-archives/kinesis-storm-spout/issues/8
>>
>> Now, this is my java details:
>>
>> java version "1.8.0_181"
>>
>> Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>>
>> Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
>>
>>
>>
>> Is it due to some java version mismatch then or is it something else I am
>> missing out? What do you think?
>>
>> Thanks,
>> Aakash.
>>
>> On Fri, Dec 21, 2018 at 1:43 PM Shuporno Choudhury <
>> shuporno.choudh...@gmail.com> wrote:
>>
>>> Hi,
>>> I don't know whether the following config (that you have tried) are
>>> correct:
>>> fs.s3a.awsAccessKeyId
>>> fs.s3a.awsSecretAccessKey
>>>
>>> The correct ones probably are:
>>> fs.s3a.access.key
>>> fs.s3a.secret.key
>>>
>>> On Fri, 21 Dec 2018 at 13:21, Aakash Basu-2 [via Apache Spark User List]
>>>  wrote:
>>>
>>>> Hey Shuporno,
>>>>
>>>> Thanks for a prompt reply. Thanks for noticing the silly mistake, I
>>>> tried this out, but still getting another error, which is related to
>>>> connectivity it seems.
>>>>
>>>> >>> hadoop_conf.set("fs.s3a.awsAccessKeyId", "abcd")
>>>>> >>> hadoop_conf.set("fs.s3a.awsSecretAccessKey", "123abc")
>>>>> >>> a =
>>>>> spark.read.csv("s3a:///test-bucket/breast-cancer-wisconsin.csv",
>>>>> header=True)
>>>>> Traceback (most recent call last):
>>>>>   File "", line 1, in 
>>>>>   File
>>>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/readwriter.py",
>>>>> line 441, in csv
>>>>> return
>>>>> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
>>>>>   File
>>>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>>>>> line 1257, in __call__
>>>>>   File
>>>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py",
>>>>> line 63, in deco
>>>>> return f(*a, **kw)
>>>>>   File
>>>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>>>>> line 328, in get_return_value
>>>>> py4j.protocol.Py4JJavaError: An error occurred while calling o220.csv.
>>>>> : java.lang.NoClassDefFoundError:
>>>>> com/amazonaws/auth/AWSCredentialsProvider
>>>>> at java.lang.Class.forName0(Native Method)
>>>>> at java.lang.Class.forName(Class.java:348)
>>>>> at
>>>>> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
>>>>> at
>>>>> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
>>>>> at
>>>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
>>>>> at
>>>>> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
>>>>> at
>>>>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
>>>>> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
>>>>> at
>>>>> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
>>>>> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
>>>>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
>>>>> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>>>>> at
>>>>> org.apache.spark.sql.execution.streaming.FileStreamS

Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-21 Thread Aakash Basu
Any help, anyone?

On Fri, Dec 21, 2018 at 2:21 PM Aakash Basu 
wrote:

> Hey Shuporno,
>
> With the updated config too, I am getting the same error. While trying to
> figure that out, I found this link which says I need aws-java-sdk (which I
> already have):
> https://github.com/amazon-archives/kinesis-storm-spout/issues/8
>
> Now, this is my java details:
>
> java version "1.8.0_181"
>
> Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>
> Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
>
>
>
> Is it due to some java version mismatch then or is it something else I am
> missing out? What do you think?
>
> Thanks,
> Aakash.
>
> On Fri, Dec 21, 2018 at 1:43 PM Shuporno Choudhury <
> shuporno.choudh...@gmail.com> wrote:
>
>> Hi,
>> I don't know whether the following config (that you have tried) are
>> correct:
>> fs.s3a.awsAccessKeyId
>> fs.s3a.awsSecretAccessKey
>>
>> The correct ones probably are:
>> fs.s3a.access.key
>> fs.s3a.secret.key
>>
>> On Fri, 21 Dec 2018 at 13:21, Aakash Basu-2 [via Apache Spark User List] <
>> ml+s1001560n34217...@n3.nabble.com> wrote:
>>
>>> Hey Shuporno,
>>>
>>> Thanks for a prompt reply. Thanks for noticing the silly mistake, I
>>> tried this out, but still getting another error, which is related to
>>> connectivity it seems.
>>>
>>> >>> hadoop_conf.set("fs.s3a.awsAccessKeyId", "abcd")
>>>> >>> hadoop_conf.set("fs.s3a.awsSecretAccessKey", "123abc")
>>>> >>> a =
>>>> spark.read.csv("s3a:///test-bucket/breast-cancer-wisconsin.csv",
>>>> header=True)
>>>> Traceback (most recent call last):
>>>>   File "", line 1, in 
>>>>   File
>>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/readwriter.py",
>>>> line 441, in csv
>>>> return
>>>> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
>>>>   File
>>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>>>> line 1257, in __call__
>>>>   File
>>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py",
>>>> line 63, in deco
>>>> return f(*a, **kw)
>>>>   File
>>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>>>> line 328, in get_return_value
>>>> py4j.protocol.Py4JJavaError: An error occurred while calling o220.csv.
>>>> : java.lang.NoClassDefFoundError:
>>>> com/amazonaws/auth/AWSCredentialsProvider
>>>> at java.lang.Class.forName0(Native Method)
>>>> at java.lang.Class.forName(Class.java:348)
>>>> at
>>>> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
>>>> at
>>>> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
>>>> at
>>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
>>>> at
>>>> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
>>>> at
>>>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
>>>> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
>>>> at
>>>> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
>>>> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
>>>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
>>>> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>>>> at
>>>> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
>>>> at
>>>> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
>>>> at
>>>> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>>>> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>> at
>>>> sun.reflect.DelegatingMe

Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-21 Thread Aakash Basu
Hey Shuporno,

With the updated config too, I am getting the same error. While trying to
figure that out, I found this link which says I need aws-java-sdk (which I
already have):
https://github.com/amazon-archives/kinesis-storm-spout/issues/8

Now, this is my java details:

java version "1.8.0_181"

Java(TM) SE Runtime Environment (build 1.8.0_181-b13)

Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)



Is it due to some java version mismatch then or is it something else I am
missing out? What do you think?

Thanks,
Aakash.

On Fri, Dec 21, 2018 at 1:43 PM Shuporno Choudhury <
shuporno.choudh...@gmail.com> wrote:

> Hi,
> I don't know whether the following config (that you have tried) are
> correct:
> fs.s3a.awsAccessKeyId
> fs.s3a.awsSecretAccessKey
>
> The correct ones probably are:
> fs.s3a.access.key
> fs.s3a.secret.key
>
> On Fri, 21 Dec 2018 at 13:21, Aakash Basu-2 [via Apache Spark User List] <
> ml+s1001560n34217...@n3.nabble.com> wrote:
>
>> Hey Shuporno,
>>
>> Thanks for a prompt reply. Thanks for noticing the silly mistake, I tried
>> this out, but still getting another error, which is related to connectivity
>> it seems.
>>
>> >>> hadoop_conf.set("fs.s3a.awsAccessKeyId", "abcd")
>>> >>> hadoop_conf.set("fs.s3a.awsSecretAccessKey", "123abc")
>>> >>> a = spark.read.csv("s3a:///test-bucket/breast-cancer-wisconsin.csv",
>>> header=True)
>>> Traceback (most recent call last):
>>>   File "", line 1, in 
>>>   File
>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/readwriter.py",
>>> line 441, in csv
>>> return
>>> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
>>>   File
>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>>> line 1257, in __call__
>>>   File
>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py",
>>> line 63, in deco
>>> return f(*a, **kw)
>>>   File
>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>>> line 328, in get_return_value
>>> py4j.protocol.Py4JJavaError: An error occurred while calling o220.csv.
>>> : java.lang.NoClassDefFoundError:
>>> com/amazonaws/auth/AWSCredentialsProvider
>>> at java.lang.Class.forName0(Native Method)
>>> at java.lang.Class.forName(Class.java:348)
>>> at
>>> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
>>> at
>>> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
>>> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
>>> at
>>> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
>>> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
>>> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
>>> at
>>> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
>>> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
>>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
>>> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>>> at
>>> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
>>> at
>>> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
>>> at
>>> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>>> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>>> at py4j.Gateway.invoke(Gateway.java:282)
>>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>> at py4j.Ga

Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-21 Thread Shuporno Choudhury
Hi,
I don't know whether the following config (that you have tried) are correct:
fs.s3a.awsAccessKeyId
fs.s3a.awsSecretAccessKey

The correct ones probably are:
fs.s3a.access.key
fs.s3a.secret.key

On Fri, 21 Dec 2018 at 13:21, Aakash Basu-2 [via Apache Spark User List] <
ml+s1001560n34217...@n3.nabble.com> wrote:

> Hey Shuporno,
>
> Thanks for a prompt reply. Thanks for noticing the silly mistake, I tried
> this out, but still getting another error, which is related to connectivity
> it seems.
>
> >>> hadoop_conf.set("fs.s3a.awsAccessKeyId", "abcd")
>> >>> hadoop_conf.set("fs.s3a.awsSecretAccessKey", "123abc")
>> >>> a = spark.read.csv("s3a:///test-bucket/breast-cancer-wisconsin.csv",
>> header=True)
>> Traceback (most recent call last):
>>   File "", line 1, in 
>>   File
>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/readwriter.py",
>> line 441, in csv
>> return
>> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
>>   File
>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>> line 1257, in __call__
>>   File
>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py",
>> line 63, in deco
>> return f(*a, **kw)
>>   File
>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>> line 328, in get_return_value
>> py4j.protocol.Py4JJavaError: An error occurred while calling o220.csv.
>> : java.lang.NoClassDefFoundError:
>> com/amazonaws/auth/AWSCredentialsProvider
>> at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:348)
>> at
>> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
>> at
>> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
>> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
>> at
>> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
>> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
>> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
>> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
>> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
>> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>> at
>> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
>> at
>> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
>> at
>> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:498)
>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>> at py4j.Gateway.invoke(Gateway.java:282)
>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>> at py4j.GatewayConnection.run(GatewayConnection.java:238)
>> at java.lang.Thread.run(Thread.java:748)
>> Caused by: java.lang.ClassNotFoundException:
>> com.amazonaws.auth.AWSCredentialsProvider
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> ... 28 more
>
>
>
> Thanks,
> Aakash.
>
> On Fri, Dec 21, 2018 at 12:51 PM Shuporno Choudhury <[hidden email]
> <http:///user/SendEmail.jtp?type=node=34217=0>> wrote:
>
>>
>>
>> On Fri, 21 Dec 2018 at 12:47, Shuporno Choudhury <[hidden email]
>> <http:///user/SendEmail.jtp?type=node=34217=1>> wrote:
>>
>>> Hi,
>>> Your connection config uses 's3n' but your read command uses 's3a'.
>>> The config for 

Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-20 Thread Aakash Basu
Hey Shuporno,

Thanks for a prompt reply. Thanks for noticing the silly mistake, I tried
this out, but still getting another error, which is related to connectivity
it seems.

>>> hadoop_conf.set("fs.s3a.awsAccessKeyId", "abcd")
> >>> hadoop_conf.set("fs.s3a.awsSecretAccessKey", "123abc")
> >>> a = spark.read.csv("s3a:///test-bucket/breast-cancer-wisconsin.csv",
> header=True)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/readwriter.py",
> line 441, in csv
> return
> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
> line 1257, in __call__
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py",
> line 63, in deco
> return f(*a, **kw)
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
> line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o220.csv.
> : java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentialsProvider
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at
> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
> at
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
> at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
> at
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
> at
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
> at
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:282)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:238)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.ClassNotFoundException:
> com.amazonaws.auth.AWSCredentialsProvider
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 28 more



Thanks,
Aakash.

On Fri, Dec 21, 2018 at 12:51 PM Shuporno Choudhury <
shuporno.choudh...@gmail.com> wrote:

>
>
> On Fri, 21 Dec 2018 at 12:47, Shuporno Choudhury <
> shuporno.choudh...@gmail.com> wrote:
>
>> Hi,
>> Your connection config uses 's3n' but your read command uses 's3a'.
>> The config for s3a are:
>> spark.hadoop.fs.s3a.access.key
>> spark.hadoop.fs.s3a.secret.key
>>
>> I feel this should solve the problem.
>>
>> On Fri, 21 Dec 2018 at 12:09, Aakash Basu-2 [via Apache Spark User List] <
>> ml+s1001560n34215...@n3.nabble.com> wrote:
>>
>>> Hi,
>>>
>>> I am trying to connect to AWS S3 and read a csv file (running POC) from
>>> a bucket.
>>>
>>> I have s3cmd and and being able to run ls and other operation from cli.
>>>
>>> *Present Configuration:*
>>> Python 3.7
>>> Spark 2.3.1
>>>
>>> *JARs added:*
>>> hadoop-aws-2.7.3.jar (in sync with the hadoop version used with spark)
>>> 

Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-20 Thread Shuporno Choudhury
On Fri, 21 Dec 2018 at 12:47, Shuporno Choudhury <
shuporno.choudh...@gmail.com> wrote:

> Hi,
> Your connection config uses 's3n' but your read command uses 's3a'.
> The config for s3a are:
> spark.hadoop.fs.s3a.access.key
> spark.hadoop.fs.s3a.secret.key
>
> I feel this should solve the problem.
>
> On Fri, 21 Dec 2018 at 12:09, Aakash Basu-2 [via Apache Spark User List] <
> ml+s1001560n34215...@n3.nabble.com> wrote:
>
>> Hi,
>>
>> I am trying to connect to AWS S3 and read a csv file (running POC) from a
>> bucket.
>>
>> I have s3cmd and and being able to run ls and other operation from cli.
>>
>> *Present Configuration:*
>> Python 3.7
>> Spark 2.3.1
>>
>> *JARs added:*
>> hadoop-aws-2.7.3.jar (in sync with the hadoop version used with spark)
>> aws-java-sdk-1.11.472.jar
>>
>> Trying out the following code:
>>
>> >>> sc=spark.sparkContext
>>>
>>> >>> hadoop_conf=sc._jsc.hadoopConfiguration()
>>>
>>> >>> hadoop_conf.set("fs.s3n.awsAccessKeyId", "abcd")
>>>
>>> >>> hadoop_conf.set("fs.s3n.awsSecretAccessKey", "xyz123")
>>>
>>> >>> a = spark.read.csv("s3a://test-bucket/breast-cancer-wisconsin.csv",
>>> header=True)
>>>
>>> Traceback (most recent call last):
>>>
>>>   File "", line 1, in 
>>>
>>>   File
>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/readwriter.py",
>>> line 441, in csv
>>>
>>> return
>>> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
>>>
>>>   File
>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>>> line 1257, in __call__
>>>
>>>   File
>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py",
>>> line 63, in deco
>>>
>>> return f(*a, **kw)
>>>
>>>   File
>>> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>>> line 328, in get_return_value
>>>
>>> py4j.protocol.Py4JJavaError: An error occurred while calling o33.csv.
>>>
>>> : java.lang.NoClassDefFoundError:
>>> com/amazonaws/auth/AWSCredentialsProvider
>>>
>>> at java.lang.Class.forName0(Native Method)
>>>
>>> at java.lang.Class.forName(Class.java:348)
>>>
>>> at
>>> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
>>>
>>> at
>>> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
>>>
>>> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
>>>
>>> at
>>> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
>>>
>>> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
>>>
>>> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
>>>
>>> at
>>> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
>>>
>>> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
>>>
>>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
>>>
>>> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>>>
>>> at
>>> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
>>>
>>> at
>>> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
>>>
>>> at
>>> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>>>
>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>>>
>>> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>>>
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>
>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>>
>>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>>>
>>

Connection issue with AWS S3 from PySpark 2.3.1

2018-12-20 Thread Aakash Basu
Hi,

I am trying to connect to AWS S3 and read a csv file (running POC) from a
bucket.

I have s3cmd and and being able to run ls and other operation from cli.

*Present Configuration:*
Python 3.7
Spark 2.3.1

*JARs added:*
hadoop-aws-2.7.3.jar (in sync with the hadoop version used with spark)
aws-java-sdk-1.11.472.jar

Trying out the following code:

>>> sc=spark.sparkContext
>
> >>> hadoop_conf=sc._jsc.hadoopConfiguration()
>
> >>> hadoop_conf.set("fs.s3n.awsAccessKeyId", "abcd")
>
> >>> hadoop_conf.set("fs.s3n.awsSecretAccessKey", "xyz123")
>
> >>> a = spark.read.csv("s3a://test-bucket/breast-cancer-wisconsin.csv",
> header=True)
>
> Traceback (most recent call last):
>
>   File "", line 1, in 
>
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/readwriter.py",
> line 441, in csv
>
> return
> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
>
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
> line 1257, in __call__
>
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py",
> line 63, in deco
>
> return f(*a, **kw)
>
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
> line 328, in get_return_value
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o33.csv.
>
> : java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentialsProvider
>
> at java.lang.Class.forName0(Native Method)
>
> at java.lang.Class.forName(Class.java:348)
>
> at
> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
>
> at
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
>
> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
>
> at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
>
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
>
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
>
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
>
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
>
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
>
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>
> at
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
>
> at
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
>
> at
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>
> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:498)
>
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>
> at py4j.Gateway.invoke(Gateway.java:282)
>
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>
> at py4j.GatewayConnection.run(GatewayConnection.java:238)
>
> at java.lang.Thread.run(Thread.java:748)
>
> Caused by: java.lang.ClassNotFoundException:
> com.amazonaws.auth.AWSCredentialsProvider
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
> ... 28 more
>
>
> >>> a = spark.read.csv("s3a://test-bucket/breast-cancer-wisconsin.csv",
> header=True)
>
> Traceback (most recent call last):
>
>   File "", line 1, in 
>
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/readwriter.py",
> line 441, in csv
>
> return
> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
>
>   File
> "/Users/aakash/Downloads/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src

Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-11-15 Thread Holden Karau
If folks are interested, while it's not on Amazon, I've got a live stream
of getting client mode with Jupyternotebook to work on GCP/GKE :
https://www.youtube.com/watch?v=eMj0Pv1-Nfo=3=PLRLebp9QyZtZflexn4Yf9xsocrR_aSryx

On Wed, Oct 31, 2018 at 5:55 PM Zhang, Yuqi  wrote:

> Hi Li,
>
>
>
> Thank you very much for your reply!
>
>
>
> > Did you make the headless service that reflects the driver pod name?
>
> I am not sure but I used “app” in the headless service as selector which
> is the same app name for the StatefulSet that will create the spark driver
> pod.
>
> For your reference, I attached the yaml file for making headless service
> and StatefulSet. Could you please help me take a look at it if you have
> time?
>
>
>
> I appreciate for your help & have a good day!
>
>
>
> Best Regards,
>
> --
>
> Yuqi Zhang
>
> Software Engineer
>
> m: 090-6725-6573
>
>
> [image: signature_147554612] <http://www.teradata.com/>
>
> 2 Chome-2-23-1 Akasaka
>
> Minato, Tokyo 107-0052
> teradata.com <http://www.teradata.com>
>
> This e-mail is from Teradata Corporation and may contain information that
> is confidential or proprietary. If you are not the intended recipient, do
> not read, copy or distribute the e-mail or any attachments. Instead, please
> notify the sender and delete the e-mail and any attachments. Thank you.
>
> Please consider the environment before printing.
>
>
>
>
>
>
>
> *From: *Li Gao 
> *Date: *Thursday, November 1, 2018 4:56
> *To: *"Zhang, Yuqi" 
> *Cc: *Gourav Sengupta , "user@spark.apache.org"
> , "Nogami, Masatsugu"
> 
> *Subject: *Re: [Spark Shell on AWS K8s Cluster]: Is there more
> documentation regarding how to run spark-shell on k8s cluster?
>
>
>
> Hi Yuqi,
>
>
>
> Yes we are running Jupyter Gateway and kernels on k8s and using Spark
> 2.4's client mode to launch pyspark. In client mode your driver is running
> on the same pod where your kernel runs.
>
>
>
> I am planning to write some blog post on this on some future date. Did you
> make the headless service that reflects the driver pod name? Thats one of
> critical pieces we automated in our custom code that makes the client mode
> works.
>
>
>
> -Li
>
>
>
>
>
> On Wed, Oct 31, 2018 at 8:13 AM Zhang, Yuqi 
> wrote:
>
> Hi Li,
>
>
>
> Thank you for your reply.
>
> Do you mean running Jupyter client on k8s cluster to use spark 2.4?
> Actually I am also trying to set up JupyterHub on k8s to use spark, that’s
> why I would like to know how to run spark client mode on k8s cluster. If
> there is any related documentation on how to set up the Jupyter on k8s to
> use spark, could you please share with me?
>
>
>
> Thank you for your help!
>
>
>
> Best Regards,
>
> --
>
> Yuqi Zhang
>
> Software Engineer
>
> m: 090-6725-6573
>
>
> [image: signature_147554612] <http://www.teradata.com/>
>
> 2 Chome-2-23-1 Akasaka
>
> Minato, Tokyo 107-0052
> teradata.com <http://www.teradata.com>
>
> This e-mail is from Teradata Corporation and may contain information that
> is confidential or proprietary. If you are not the intended recipient, do
> not read, copy or distribute the e-mail or any attachments. Instead, please
> notify the sender and delete the e-mail and any attachments. Thank you.
>
> Please consider the environment before printing.
>
>
>
>
>
>
>
> *From: *Li Gao 
> *Date: *Thursday, November 1, 2018 0:07
> *To: *"Zhang, Yuqi" 
> *Cc: *"gourav.sengu...@gmail.com" , "
> user@spark.apache.org" , "Nogami, Masatsugu"
> 
> *Subject: *Re: [Spark Shell on AWS K8s Cluster]: Is there more
> documentation regarding how to run spark-shell on k8s cluster?
>
>
>
> Yuqi,
>
>
>
> Your error seems unrelated to headless service config you need to enable.
> For headless service you need to create a headless service that matches to
> your driver pod name exactly in order for spark 2.4 RC to work under client
> mode. We have this running for a while now using Jupyter kernel as the
> driver client.
>
>
>
> -Li
>
>
>
>
>
> On Wed, Oct 31, 2018 at 7:30 AM Zhang, Yuqi 
> wrote:
>
> Hi Gourav,
>
>
>
> Thank you for your reply.
>
>
>
> I haven’t try glue or EMK, but I guess it’s integrating kubernetes on aws
> instances?
>
> I could set up the k8s cluster on AWS, but my problem is don’t know how to
> run spark-shell on kubernetes…
>
> Since spark only support client mode on k8s from 2.4 version which is not
>

Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Zhang, Yuqi
Hi Li,

Thank you very much for your reply!

> Did you make the headless service that reflects the driver pod name?
I am not sure but I used “app” in the headless service as selector which is the 
same app name for the StatefulSet that will create the spark driver pod.
For your reference, I attached the yaml file for making headless service and 
StatefulSet. Could you please help me take a look at it if you have time?

I appreciate for your help & have a good day!

Best Regards,
--
Yuqi Zhang
Software Engineer
m: 090-6725-6573


[signature_147554612]<http://www.teradata.com/>

2 Chome-2-23-1 Akasaka
Minato, Tokyo 107-0052
teradata.com<http://www.teradata.com>

This e-mail is from Teradata Corporation and may contain information that is 
confidential or proprietary. If you are not the intended recipient, do not 
read, copy or distribute the e-mail or any attachments. Instead, please notify 
the sender and delete the e-mail and any attachments. Thank you.

Please consider the environment before printing.



From: Li Gao 
Date: Thursday, November 1, 2018 4:56
To: "Zhang, Yuqi" 
Cc: Gourav Sengupta , "user@spark.apache.org" 
, "Nogami, Masatsugu" 
Subject: Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation 
regarding how to run spark-shell on k8s cluster?

Hi Yuqi,

Yes we are running Jupyter Gateway and kernels on k8s and using Spark 2.4's 
client mode to launch pyspark. In client mode your driver is running on the 
same pod where your kernel runs.

I am planning to write some blog post on this on some future date. Did you make 
the headless service that reflects the driver pod name? Thats one of critical 
pieces we automated in our custom code that makes the client mode works.

-Li


On Wed, Oct 31, 2018 at 8:13 AM Zhang, Yuqi 
mailto:yuqi.zh...@teradata.com>> wrote:
Hi Li,

Thank you for your reply.
Do you mean running Jupyter client on k8s cluster to use spark 2.4? Actually I 
am also trying to set up JupyterHub on k8s to use spark, that’s why I would 
like to know how to run spark client mode on k8s cluster. If there is any 
related documentation on how to set up the Jupyter on k8s to use spark, could 
you please share with me?

Thank you for your help!

Best Regards,
--
Yuqi Zhang
Software Engineer
m: 090-6725-6573


[signature_147554612]<http://www.teradata.com/>

2 Chome-2-23-1 Akasaka
Minato, Tokyo 107-0052
teradata.com<http://www.teradata.com>
This e-mail is from Teradata Corporation and may contain information that is 
confidential or proprietary. If you are not the intended recipient, do not 
read, copy or distribute the e-mail or any attachments. Instead, please notify 
the sender and delete the e-mail and any attachments. Thank you.

Please consider the environment before printing.



From: Li Gao mailto:ligao...@gmail.com>>
Date: Thursday, November 1, 2018 0:07
To: "Zhang, Yuqi" 
Cc: "gourav.sengu...@gmail.com<mailto:gourav.sengu...@gmail.com>" 
mailto:gourav.sengu...@gmail.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>, "Nogami, Masatsugu" 

Subject: Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation 
regarding how to run spark-shell on k8s cluster?

Yuqi,

Your error seems unrelated to headless service config you need to enable. For 
headless service you need to create a headless service that matches to your 
driver pod name exactly in order for spark 2.4 RC to work under client mode. We 
have this running for a while now using Jupyter kernel as the driver client.

-Li


On Wed, Oct 31, 2018 at 7:30 AM Zhang, Yuqi 
mailto:yuqi.zh...@teradata.com>> wrote:
Hi Gourav,

Thank you for your reply.

I haven’t try glue or EMK, but I guess it’s integrating kubernetes on aws 
instances?
I could set up the k8s cluster on AWS, but my problem is don’t know how to run 
spark-shell on kubernetes…
Since spark only support client mode on k8s from 2.4 version which is not 
officially released yet, I would like to ask if there is more detailed 
documentation regarding the way to run spark-shell on k8s cluster?

Thank you in advance & best regards!

--
Yuqi Zhang
Software Engineer
m: 090-6725-6573


Error! Filename not specified.<http://www.teradata.com/>

2 Chome-2-23-1 Akasaka
Minato, Tokyo 107-0052
teradata.com<http://www.teradata.com>
This e-mail is from Teradata Corporation and may contain information that is 
confidential or proprietary. If you are not the intended recipient, do not 
read, copy or distribute the e-mail or any attachments. Instead, please notify 
the sender and delete the e-mail and any attachments. Thank you.

Please consider the environment before printing.



From: Gourav Sengupta 
mailto:gourav.sengu...@gmail.com>>
Date: Wednesday, October 31, 2018 18:34
To: "Zhang, Yuqi" 
Cc: user mailto:user@spark.ap

Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Li Gao
Hi Yuqi,

Yes we are running Jupyter Gateway and kernels on k8s and using Spark 2.4's
client mode to launch pyspark. In client mode your driver is running on the
same pod where your kernel runs.

I am planning to write some blog post on this on some future date. Did you
make the headless service that reflects the driver pod name? Thats one of
critical pieces we automated in our custom code that makes the client mode
works.

-Li


On Wed, Oct 31, 2018 at 8:13 AM Zhang, Yuqi  wrote:

> Hi Li,
>
>
>
> Thank you for your reply.
>
> Do you mean running Jupyter client on k8s cluster to use spark 2.4?
> Actually I am also trying to set up JupyterHub on k8s to use spark, that’s
> why I would like to know how to run spark client mode on k8s cluster. If
> there is any related documentation on how to set up the Jupyter on k8s to
> use spark, could you please share with me?
>
>
>
> Thank you for your help!
>
>
>
> Best Regards,
>
> --
>
> Yuqi Zhang
>
> Software Engineer
>
> m: 090-6725-6573
>
>
> [image: signature_147554612] <http://www.teradata.com/>
>
> 2 Chome-2-23-1 Akasaka
>
> Minato, Tokyo 107-0052
> teradata.com <http://www.teradata.com>
>
> This e-mail is from Teradata Corporation and may contain information that
> is confidential or proprietary. If you are not the intended recipient, do
> not read, copy or distribute the e-mail or any attachments. Instead, please
> notify the sender and delete the e-mail and any attachments. Thank you.
>
> Please consider the environment before printing.
>
>
>
>
>
>
>
> *From: *Li Gao 
> *Date: *Thursday, November 1, 2018 0:07
> *To: *"Zhang, Yuqi" 
> *Cc: *"gourav.sengu...@gmail.com" , "
> user@spark.apache.org" , "Nogami, Masatsugu"
> 
> *Subject: *Re: [Spark Shell on AWS K8s Cluster]: Is there more
> documentation regarding how to run spark-shell on k8s cluster?
>
>
>
> Yuqi,
>
>
>
> Your error seems unrelated to headless service config you need to enable.
> For headless service you need to create a headless service that matches to
> your driver pod name exactly in order for spark 2.4 RC to work under client
> mode. We have this running for a while now using Jupyter kernel as the
> driver client.
>
>
>
> -Li
>
>
>
>
>
> On Wed, Oct 31, 2018 at 7:30 AM Zhang, Yuqi 
> wrote:
>
> Hi Gourav,
>
>
>
> Thank you for your reply.
>
>
>
> I haven’t try glue or EMK, but I guess it’s integrating kubernetes on aws
> instances?
>
> I could set up the k8s cluster on AWS, but my problem is don’t know how to
> run spark-shell on kubernetes…
>
> Since spark only support client mode on k8s from 2.4 version which is not
> officially released yet, I would like to ask if there is more detailed
> documentation regarding the way to run spark-shell on k8s cluster?
>
>
>
> Thank you in advance & best regards!
>
>
>
> --
>
> Yuqi Zhang
>
> Software Engineer
>
> m: 090-6725-6573
>
>
> [image: signature_147554612] <http://www.teradata.com/>
>
> 2 Chome-2-23-1 Akasaka
>
> Minato, Tokyo 107-0052
> teradata.com <http://www.teradata.com>
>
> This e-mail is from Teradata Corporation and may contain information that
> is confidential or proprietary. If you are not the intended recipient, do
> not read, copy or distribute the e-mail or any attachments. Instead, please
> notify the sender and delete the e-mail and any attachments. Thank you.
>
> Please consider the environment before printing.
>
>
>
>
>
>
>
> *From: *Gourav Sengupta 
> *Date: *Wednesday, October 31, 2018 18:34
> *To: *"Zhang, Yuqi" 
> *Cc: *user , "Nogami, Masatsugu"
> 
> *Subject: *Re: [Spark Shell on AWS K8s Cluster]: Is there more
> documentation regarding how to run spark-shell on k8s cluster?
>
>
>
> [External Email]
> ------
>
> Just out of curiosity why would you not use Glue (which is Spark on
> kubernetes) or EMR?
>
>
>
> Regards,
>
> Gourav Sengupta
>
>
>
> On Mon, Oct 29, 2018 at 1:29 AM Zhang, Yuqi 
> wrote:
>
> Hello guys,
>
>
>
> I am Yuqi from Teradata Tokyo. Sorry to disturb but I have some problem
> regarding using spark 2.4 client mode function on kubernetes cluster, so I
> would like to ask if there is some solution to my problem.
>
>
>
> The problem is when I am trying to run spark-shell on kubernetes v1.11.3
> cluster on AWS environment, I couldn’t successfully run stateful set using
> the docker image built from spark 2.4. The error message is showing below.
> The version I

Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Zhang, Yuqi
Hi Li,

Thank you for your reply.
Do you mean running Jupyter client on k8s cluster to use spark 2.4? Actually I 
am also trying to set up JupyterHub on k8s to use spark, that’s why I would 
like to know how to run spark client mode on k8s cluster. If there is any 
related documentation on how to set up the Jupyter on k8s to use spark, could 
you please share with me?

Thank you for your help!

Best Regards,
--
Yuqi Zhang
Software Engineer
m: 090-6725-6573


[signature_147554612]<http://www.teradata.com/>

2 Chome-2-23-1 Akasaka
Minato, Tokyo 107-0052
teradata.com<http://www.teradata.com>

This e-mail is from Teradata Corporation and may contain information that is 
confidential or proprietary. If you are not the intended recipient, do not 
read, copy or distribute the e-mail or any attachments. Instead, please notify 
the sender and delete the e-mail and any attachments. Thank you.

Please consider the environment before printing.



From: Li Gao 
Date: Thursday, November 1, 2018 0:07
To: "Zhang, Yuqi" 
Cc: "gourav.sengu...@gmail.com" , 
"user@spark.apache.org" , "Nogami, Masatsugu" 

Subject: Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation 
regarding how to run spark-shell on k8s cluster?

Yuqi,

Your error seems unrelated to headless service config you need to enable. For 
headless service you need to create a headless service that matches to your 
driver pod name exactly in order for spark 2.4 RC to work under client mode. We 
have this running for a while now using Jupyter kernel as the driver client.

-Li


On Wed, Oct 31, 2018 at 7:30 AM Zhang, Yuqi 
mailto:yuqi.zh...@teradata.com>> wrote:
Hi Gourav,

Thank you for your reply.

I haven’t try glue or EMK, but I guess it’s integrating kubernetes on aws 
instances?
I could set up the k8s cluster on AWS, but my problem is don’t know how to run 
spark-shell on kubernetes…
Since spark only support client mode on k8s from 2.4 version which is not 
officially released yet, I would like to ask if there is more detailed 
documentation regarding the way to run spark-shell on k8s cluster?

Thank you in advance & best regards!

--
Yuqi Zhang
Software Engineer
m: 090-6725-6573


[signature_147554612]<http://www.teradata.com/>

2 Chome-2-23-1 Akasaka
Minato, Tokyo 107-0052
teradata.com<http://www.teradata.com>
This e-mail is from Teradata Corporation and may contain information that is 
confidential or proprietary. If you are not the intended recipient, do not 
read, copy or distribute the e-mail or any attachments. Instead, please notify 
the sender and delete the e-mail and any attachments. Thank you.

Please consider the environment before printing.



From: Gourav Sengupta 
mailto:gourav.sengu...@gmail.com>>
Date: Wednesday, October 31, 2018 18:34
To: "Zhang, Yuqi" 
Cc: user mailto:user@spark.apache.org>>, "Nogami, 
Masatsugu" 
Subject: Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation 
regarding how to run spark-shell on k8s cluster?

[External Email]

Just out of curiosity why would you not use Glue (which is Spark on kubernetes) 
or EMR?

Regards,
Gourav Sengupta

On Mon, Oct 29, 2018 at 1:29 AM Zhang, Yuqi 
mailto:yuqi.zh...@teradata.com>> wrote:
Hello guys,

I am Yuqi from Teradata Tokyo. Sorry to disturb but I have some problem 
regarding using spark 2.4 client mode function on kubernetes cluster, so I 
would like to ask if there is some solution to my problem.

The problem is when I am trying to run spark-shell on kubernetes v1.11.3 
cluster on AWS environment, I couldn’t successfully run stateful set using the 
docker image built from spark 2.4. The error message is showing below. The 
version I am using is spark v2.4.0-rc3.

Also, I wonder if there is more documentation on how to use client-mode or 
integrate spark-shell on kubernetes cluster. From the documentation on 
https://github.com/apache/spark/blob/v2.4.0-rc3/docs/running-on-kubernetes.md 
there is only a brief description. I understand it’s not the official released 
version yet, but If there is some more documentation, could you please share 
with me?

Thank you very much for your help!


Error msg:
+ env
+ sed 's/[^=]*=\(.*\)/\1/g'
+ sort -t_ -k4 -n
+ grep SPARK_JAVA_OPT_
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -n '' ']'
+ PYSPARK_ARGS=
+ '[' -n '' ']'
+ R_ARGS=
+ '[' -n '' ']'
+ '[' '' == 2 ']'
+ '[' '' == 3 ']'
+ case "$SPARK_K8S_CMD" in
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf 
"spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress= --deploy-mode client
Error: Missing application resource.
Usage: spark-submit [options]  [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submiss

Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Li Gao
Yuqi,

Your error seems unrelated to headless service config you need to enable.
For headless service you need to create a headless service that matches to
your driver pod name exactly in order for spark 2.4 RC to work under client
mode. We have this running for a while now using Jupyter kernel as the
driver client.

-Li


On Wed, Oct 31, 2018 at 7:30 AM Zhang, Yuqi  wrote:

> Hi Gourav,
>
>
>
> Thank you for your reply.
>
>
>
> I haven’t try glue or EMK, but I guess it’s integrating kubernetes on aws
> instances?
>
> I could set up the k8s cluster on AWS, but my problem is don’t know how to
> run spark-shell on kubernetes…
>
> Since spark only support client mode on k8s from 2.4 version which is not
> officially released yet, I would like to ask if there is more detailed
> documentation regarding the way to run spark-shell on k8s cluster?
>
>
>
> Thank you in advance & best regards!
>
>
>
> --
>
> Yuqi Zhang
>
> Software Engineer
>
> m: 090-6725-6573
>
>
> [image: signature_147554612] <http://www.teradata.com/>
>
> 2 Chome-2-23-1 Akasaka
>
> Minato, Tokyo 107-0052
> teradata.com <http://www.teradata.com>
>
> This e-mail is from Teradata Corporation and may contain information that
> is confidential or proprietary. If you are not the intended recipient, do
> not read, copy or distribute the e-mail or any attachments. Instead, please
> notify the sender and delete the e-mail and any attachments. Thank you.
>
> Please consider the environment before printing.
>
>
>
>
>
>
>
> *From: *Gourav Sengupta 
> *Date: *Wednesday, October 31, 2018 18:34
> *To: *"Zhang, Yuqi" 
> *Cc: *user , "Nogami, Masatsugu"
> 
> *Subject: *Re: [Spark Shell on AWS K8s Cluster]: Is there more
> documentation regarding how to run spark-shell on k8s cluster?
>
>
>
> [External Email]
> --
>
> Just out of curiosity why would you not use Glue (which is Spark on
> kubernetes) or EMR?
>
>
>
> Regards,
>
> Gourav Sengupta
>
>
>
> On Mon, Oct 29, 2018 at 1:29 AM Zhang, Yuqi 
> wrote:
>
> Hello guys,
>
>
>
> I am Yuqi from Teradata Tokyo. Sorry to disturb but I have some problem
> regarding using spark 2.4 client mode function on kubernetes cluster, so I
> would like to ask if there is some solution to my problem.
>
>
>
> The problem is when I am trying to run spark-shell on kubernetes v1.11.3
> cluster on AWS environment, I couldn’t successfully run stateful set using
> the docker image built from spark 2.4. The error message is showing below.
> The version I am using is spark v2.4.0-rc3.
>
>
>
> Also, I wonder if there is more documentation on how to use client-mode or
> integrate spark-shell on kubernetes cluster. From the documentation on
> https://github.com/apache/spark/blob/v2.4.0-rc3/docs/running-on-kubernetes.md
> there is only a brief description. I understand it’s not the official
> released version yet, but If there is some more documentation, could you
> please share with me?
>
>
>
> Thank you very much for your help!
>
>
>
>
>
> Error msg:
>
> + env
>
> + sed 's/[^=]*=\(.*\)/\1/g'
>
> + sort -t_ -k4 -n
>
> + grep SPARK_JAVA_OPT_
>
> + readarray -t SPARK_EXECUTOR_JAVA_OPTS
>
> + '[' -n '' ']'
>
> + '[' -n '' ']'
>
> + PYSPARK_ARGS=
>
> + '[' -n '' ']'
>
> + R_ARGS=
>
> + '[' -n '' ']'
>
> + '[' '' == 2 ']'
>
> + '[' '' == 3 ']'
>
> + case "$SPARK_K8S_CMD" in
>
> + CMD=("$SPARK_HOME/bin/spark-submit" --conf
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client
> "$@")
>
> + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf
> spark.driver.bindAddress= --deploy-mode client
>
> Error: Missing application resource.
>
> Usage: spark-submit [options]  [app
> arguments]
>
> Usage: spark-submit --kill [submission ID] --master [spark://...]
>
> Usage: spark-submit --status [submission ID] --master [spark://...]
>
> Usage: spark-submit run-example [options] example-class [example args]
>
>
>
>
>
> --
>
> Yuqi Zhang
>
> Software Engineer
>
> m: 090-6725-6573
>
>
> [image: signature_147554612] <http://www.teradata.com/>
>
> 2 Chome-2-23-1 Akasaka
>
> Minato, Tokyo 107-0052
> teradata.com <http://www.teradata.com>
>
> This e-mail is from Teradata Corporation and may contain information that
> is confidential or proprietary. If you are not the intended recipient, do
> not read, copy or distribute the e-mail or any attachments. Instead, please
> notify the sender and delete the e-mail and any attachments. Thank you.
>
> Please consider the environment before printing.
>
>
>
>
>
>


Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Zhang, Yuqi
Hi Gourav,

Thank you for your reply.

I haven’t try glue or EMK, but I guess it’s integrating kubernetes on aws 
instances?
I could set up the k8s cluster on AWS, but my problem is don’t know how to run 
spark-shell on kubernetes…
Since spark only support client mode on k8s from 2.4 version which is not 
officially released yet, I would like to ask if there is more detailed 
documentation regarding the way to run spark-shell on k8s cluster?

Thank you in advance & best regards!

--
Yuqi Zhang
Software Engineer
m: 090-6725-6573


[signature_147554612]<http://www.teradata.com/>

2 Chome-2-23-1 Akasaka
Minato, Tokyo 107-0052
teradata.com<http://www.teradata.com>

This e-mail is from Teradata Corporation and may contain information that is 
confidential or proprietary. If you are not the intended recipient, do not 
read, copy or distribute the e-mail or any attachments. Instead, please notify 
the sender and delete the e-mail and any attachments. Thank you.

Please consider the environment before printing.



From: Gourav Sengupta 
Date: Wednesday, October 31, 2018 18:34
To: "Zhang, Yuqi" 
Cc: user , "Nogami, Masatsugu" 

Subject: Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation 
regarding how to run spark-shell on k8s cluster?

[External Email]

Just out of curiosity why would you not use Glue (which is Spark on kubernetes) 
or EMR?

Regards,
Gourav Sengupta

On Mon, Oct 29, 2018 at 1:29 AM Zhang, Yuqi 
mailto:yuqi.zh...@teradata.com>> wrote:
Hello guys,

I am Yuqi from Teradata Tokyo. Sorry to disturb but I have some problem 
regarding using spark 2.4 client mode function on kubernetes cluster, so I 
would like to ask if there is some solution to my problem.

The problem is when I am trying to run spark-shell on kubernetes v1.11.3 
cluster on AWS environment, I couldn’t successfully run stateful set using the 
docker image built from spark 2.4. The error message is showing below. The 
version I am using is spark v2.4.0-rc3.

Also, I wonder if there is more documentation on how to use client-mode or 
integrate spark-shell on kubernetes cluster. From the documentation on 
https://github.com/apache/spark/blob/v2.4.0-rc3/docs/running-on-kubernetes.md 
there is only a brief description. I understand it’s not the official released 
version yet, but If there is some more documentation, could you please share 
with me?

Thank you very much for your help!


Error msg:
+ env
+ sed 's/[^=]*=\(.*\)/\1/g'
+ sort -t_ -k4 -n
+ grep SPARK_JAVA_OPT_
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -n '' ']'
+ PYSPARK_ARGS=
+ '[' -n '' ']'
+ R_ARGS=
+ '[' -n '' ']'
+ '[' '' == 2 ']'
+ '[' '' == 3 ']'
+ case "$SPARK_K8S_CMD" in
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf 
"spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress= --deploy-mode client
Error: Missing application resource.
Usage: spark-submit [options]  [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]


--
Yuqi Zhang
Software Engineer
m: 090-6725-6573


[signature_147554612]<http://www.teradata.com/>

2 Chome-2-23-1 Akasaka
Minato, Tokyo 107-0052
teradata.com<http://www.teradata.com>

This e-mail is from Teradata Corporation and may contain information that is 
confidential or proprietary. If you are not the intended recipient, do not 
read, copy or distribute the e-mail or any attachments. Instead, please notify 
the sender and delete the e-mail and any attachments. Thank you.

Please consider the environment before printing.




Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Biplob Biswas
Hi Yuqi,

Just curious can you share the spark-submit script and what are you passing
as --master argument?

Thanks & Regards
Biplob Biswas


On Wed, Oct 31, 2018 at 10:34 AM Gourav Sengupta 
wrote:

> Just out of curiosity why would you not use Glue (which is Spark on
> kubernetes) or EMR?
>
> Regards,
> Gourav Sengupta
>
> On Mon, Oct 29, 2018 at 1:29 AM Zhang, Yuqi 
> wrote:
>
>> Hello guys,
>>
>>
>>
>> I am Yuqi from Teradata Tokyo. Sorry to disturb but I have some problem
>> regarding using spark 2.4 client mode function on kubernetes cluster, so I
>> would like to ask if there is some solution to my problem.
>>
>>
>>
>> The problem is when I am trying to run spark-shell on kubernetes v1.11.3
>> cluster on AWS environment, I couldn’t successfully run stateful set using
>> the docker image built from spark 2.4. The error message is showing below.
>> The version I am using is spark v2.4.0-rc3.
>>
>>
>>
>> Also, I wonder if there is more documentation on how to use client-mode
>> or integrate spark-shell on kubernetes cluster. From the documentation on
>> https://github.com/apache/spark/blob/v2.4.0-rc3/docs/running-on-kubernetes.md
>> there is only a brief description. I understand it’s not the official
>> released version yet, but If there is some more documentation, could you
>> please share with me?
>>
>>
>>
>> Thank you very much for your help!
>>
>>
>>
>>
>>
>> Error msg:
>>
>> + env
>>
>> + sed 's/[^=]*=\(.*\)/\1/g'
>>
>> + sort -t_ -k4 -n
>>
>> + grep SPARK_JAVA_OPT_
>>
>> + readarray -t SPARK_EXECUTOR_JAVA_OPTS
>>
>> + '[' -n '' ']'
>>
>> + '[' -n '' ']'
>>
>> + PYSPARK_ARGS=
>>
>> + '[' -n '' ']'
>>
>> + R_ARGS=
>>
>> + '[' -n '' ']'
>>
>> + '[' '' == 2 ']'
>>
>> + '[' '' == 3 ']'
>>
>> + case "$SPARK_K8S_CMD" in
>>
>> + CMD=("$SPARK_HOME/bin/spark-submit" --conf
>> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client
>> "$@")
>>
>> + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf
>> spark.driver.bindAddress= --deploy-mode client
>>
>> Error: Missing application resource.
>>
>> Usage: spark-submit [options]  [app
>> arguments]
>>
>> Usage: spark-submit --kill [submission ID] --master [spark://...]
>>
>> Usage: spark-submit --status [submission ID] --master [spark://...]
>>
>> Usage: spark-submit run-example [options] example-class [example args]
>>
>>
>>
>>
>>
>> --
>>
>> Yuqi Zhang
>>
>> Software Engineer
>>
>> m: 090-6725-6573
>>
>>
>> [image: signature_147554612] <http://www.teradata.com/>
>>
>> 2 Chome-2-23-1 Akasaka
>>
>> Minato, Tokyo 107-0052
>> teradata.com <http://www.teradata.com>
>>
>>
>> This e-mail is from Teradata Corporation and may contain information that
>> is confidential or proprietary. If you are not the intended recipient, do
>> not read, copy or distribute the e-mail or any attachments. Instead, please
>> notify the sender and delete the e-mail and any attachments. Thank you.
>>
>> Please consider the environment before printing.
>>
>>
>>
>>
>>
>


Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Gourav Sengupta
Just out of curiosity why would you not use Glue (which is Spark on
kubernetes) or EMR?

Regards,
Gourav Sengupta

On Mon, Oct 29, 2018 at 1:29 AM Zhang, Yuqi  wrote:

> Hello guys,
>
>
>
> I am Yuqi from Teradata Tokyo. Sorry to disturb but I have some problem
> regarding using spark 2.4 client mode function on kubernetes cluster, so I
> would like to ask if there is some solution to my problem.
>
>
>
> The problem is when I am trying to run spark-shell on kubernetes v1.11.3
> cluster on AWS environment, I couldn’t successfully run stateful set using
> the docker image built from spark 2.4. The error message is showing below.
> The version I am using is spark v2.4.0-rc3.
>
>
>
> Also, I wonder if there is more documentation on how to use client-mode or
> integrate spark-shell on kubernetes cluster. From the documentation on
> https://github.com/apache/spark/blob/v2.4.0-rc3/docs/running-on-kubernetes.md
> there is only a brief description. I understand it’s not the official
> released version yet, but If there is some more documentation, could you
> please share with me?
>
>
>
> Thank you very much for your help!
>
>
>
>
>
> Error msg:
>
> + env
>
> + sed 's/[^=]*=\(.*\)/\1/g'
>
> + sort -t_ -k4 -n
>
> + grep SPARK_JAVA_OPT_
>
> + readarray -t SPARK_EXECUTOR_JAVA_OPTS
>
> + '[' -n '' ']'
>
> + '[' -n '' ']'
>
> + PYSPARK_ARGS=
>
> + '[' -n '' ']'
>
> + R_ARGS=
>
> + '[' -n '' ']'
>
> + '[' '' == 2 ']'
>
> + '[' '' == 3 ']'
>
> + case "$SPARK_K8S_CMD" in
>
> + CMD=("$SPARK_HOME/bin/spark-submit" --conf
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client
> "$@")
>
> + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf
> spark.driver.bindAddress= --deploy-mode client
>
> Error: Missing application resource.
>
> Usage: spark-submit [options]  [app
> arguments]
>
> Usage: spark-submit --kill [submission ID] --master [spark://...]
>
> Usage: spark-submit --status [submission ID] --master [spark://...]
>
> Usage: spark-submit run-example [options] example-class [example args]
>
>
>
>
>
> --
>
> Yuqi Zhang
>
> Software Engineer
>
> m: 090-6725-6573
>
>
> [image: signature_147554612] <http://www.teradata.com/>
>
> 2 Chome-2-23-1 Akasaka
>
> Minato, Tokyo 107-0052
> teradata.com <http://www.teradata.com>
>
>
> This e-mail is from Teradata Corporation and may contain information that
> is confidential or proprietary. If you are not the intended recipient, do
> not read, copy or distribute the e-mail or any attachments. Instead, please
> notify the sender and delete the e-mail and any attachments. Thank you.
>
> Please consider the environment before printing.
>
>
>
>
>


[Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-28 Thread Zhang, Yuqi
Hello guys,

I am Yuqi from Teradata Tokyo. Sorry to disturb but I have some problem 
regarding using spark 2.4 client mode function on kubernetes cluster, so I 
would like to ask if there is some solution to my problem.

The problem is when I am trying to run spark-shell on kubernetes v1.11.3 
cluster on AWS environment, I couldn’t successfully run stateful set using the 
docker image built from spark 2.4. The error message is showing below. The 
version I am using is spark v2.4.0-rc3.

Also, I wonder if there is more documentation on how to use client-mode or 
integrate spark-shell on kubernetes cluster. From the documentation on 
https://github.com/apache/spark/blob/v2.4.0-rc3/docs/running-on-kubernetes.md 
there is only a brief description. I understand it’s not the official released 
version yet, but If there is some more documentation, could you please share 
with me?

Thank you very much for your help!


Error msg:
+ env
+ sed 's/[^=]*=\(.*\)/\1/g'
+ sort -t_ -k4 -n
+ grep SPARK_JAVA_OPT_
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -n '' ']'
+ PYSPARK_ARGS=
+ '[' -n '' ']'
+ R_ARGS=
+ '[' -n '' ']'
+ '[' '' == 2 ']'
+ '[' '' == 3 ']'
+ case "$SPARK_K8S_CMD" in
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf 
"spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress= --deploy-mode client
Error: Missing application resource.
Usage: spark-submit [options]  [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]


--
Yuqi Zhang
Software Engineer
m: 090-6725-6573


[signature_147554612]<http://www.teradata.com/>

2 Chome-2-23-1 Akasaka
Minato, Tokyo 107-0052
teradata.com<http://www.teradata.com>


This e-mail is from Teradata Corporation and may contain information that is 
confidential or proprietary. If you are not the intended recipient, do not 
read, copy or distribute the e-mail or any attachments. Instead, please notify 
the sender and delete the e-mail and any attachments. Thank you.

Please consider the environment before printing.




Re: AWS credentials needed while trying to read a model from S3 in Spark

2018-05-09 Thread Srinath C
You could use IAM roles in AWS to access the data in S3 without credentials.
See this link
<https://www.cloudera.com/documentation/enterprise/5-5-x/topics/spark_s3.html>
and
this link
<http://parthicloud.com/how-to-access-s3-bucket-from-application-on-amazon-ec2-without-access-credentials/>
for an example.

Regards,
Srinath.


On Thu, May 10, 2018 at 7:04 AM, Mina Aslani <aslanim...@gmail.com> wrote:

> Hi,
>
> I am trying to load a ML model from AWS S3 in my spark app running in a
> docker container, however I need to pass the AWS credentials.
> My questions is, why do I need to pass the credentials in the path?
> And what is the workaround?
>
> Best regards,
> Mina
>


AWS credentials needed while trying to read a model from S3 in Spark

2018-05-09 Thread Mina Aslani
Hi,

I am trying to load a ML model from AWS S3 in my spark app running in a
docker container, however I need to pass the AWS credentials.
My questions is, why do I need to pass the credentials in the path?
And what is the workaround?

Best regards,
Mina


Spark Structured Streaming how to read data from AWS SQS

2017-12-11 Thread Bogdan Cojocar
For spark streaming there are connectors
 that can achieve this
functionality.

Unfortunately for spark structured streaming I couldn't find any as it's a
newer technology. Is there a way to connect to a source using a spark
streaming connector? Or is there a way to create a custom connector similar
to the way one can be created in a spark streaming

 application?


Many thanks,

Bogdan Cojocar


Re: Quick one... AWS SDK version?

2017-10-08 Thread Jonathan Kelly
Tushar,

Yes, the hadoop-aws jar installed on an emr-5.8.0 cluster was built with
AWS Java SDK 1.11.160, if that’s what you mean.

~ Jonathan
On Sun, Oct 8, 2017 at 8:42 AM Tushar Sudake <etusha...@gmail.com> wrote:

> Hi Jonathan,
>
> Does that mean Hadoop-AWS 2.7.3 too is built against AWS SDK 1.11.160 and
> not 1.7.4?
>
> Thanks.
>
>
> On Oct 7, 2017 3:50 PM, "Jean Georges Perrin" <j...@jgp.net> wrote:
>
>
> Hey Marco,
>
> I am actually reading from S3 and I use 2.7.3, but I inherited the project
> and they use some AWS API from Amazon SDK, which version is like from
> yesterday :) so it’s confused and AMZ is changing its version like crazy so
> it’s a little difficult to follow. Right now I went back to 2.7.3 and SDK
> 1.7.4...
>
> jg
>
>
> On Oct 7, 2017, at 15:34, Marco Mistroni <mmistr...@gmail.com> wrote:
>
> Hi JG
>  out of curiosity what's ur usecase? are you writing to S3? you could use
> Spark to do that , e.g using hadoop package
> org.apache.hadoop:hadoop-aws:2.7.1 ..that will download the aws client
> which is in line with hadoop 2.7.1?
>
> hth
>  marco
>
> On Fri, Oct 6, 2017 at 10:58 PM, Jonathan Kelly <jonathaka...@gmail.com>
> wrote:
>
>> Note: EMR builds Hadoop, Spark, et al, from source against specific
>> versions of certain packages like the AWS Java SDK, httpclient/core,
>> Jackson, etc., sometimes requiring some patches in these applications in
>> order to work with versions of these dependencies that differ from what the
>> applications may support upstream.
>>
>> For emr-5.8.0, we have built Hadoop and Spark (the Spark Kinesis
>> connector, that is, since that's the only part of Spark that actually
>> depends upon the AWS Java SDK directly) against AWS Java SDK 1.11.160
>> instead of the much older version that vanilla Hadoop 2.7.3 would otherwise
>> depend upon.
>>
>> ~ Jonathan
>>
>> On Wed, Oct 4, 2017 at 7:17 AM Steve Loughran <ste...@hortonworks.com>
>> wrote:
>>
>>> On 3 Oct 2017, at 21:37, JG Perrin <jper...@lumeris.com> wrote:
>>>
>>> Sorry Steve – I may not have been very clear: thinking about
>>> aws-java-sdk-z.yy.xxx.jar. To the best of my knowledge, none is bundled
>>> with Spark.
>>>
>>>
>>>
>>> I know, but if you are talking to s3 via the s3a client, you will need
>>> the SDK version to match the hadoop-aws JAR of the same version of Hadoop
>>> your JARs have. Similarly, if you were using spark-kinesis, it needs to be
>>> in sync there.
>>>
>>>
>>> *From:* Steve Loughran [mailto:ste...@hortonworks.com
>>> <ste...@hortonworks.com>]
>>> *Sent:* Tuesday, October 03, 2017 2:20 PM
>>> *To:* JG Perrin <jper...@lumeris.com>
>>> *Cc:* user@spark.apache.org
>>> *Subject:* Re: Quick one... AWS SDK version?
>>>
>>>
>>>
>>> On 3 Oct 2017, at 02:28, JG Perrin <jper...@lumeris.com> wrote:
>>>
>>> Hey Sparkians,
>>>
>>> What version of AWS Java SDK do you use with Spark 2.2? Do you stick
>>> with the Hadoop 2.7.3 libs?
>>>
>>>
>>> You generally to have to stick with the version which hadoop was built
>>> with I'm afraid...very brittle dependency.
>>>
>>>
>
>


Re: Quick one... AWS SDK version?

2017-10-08 Thread Tushar Sudake
Hi Jonathan,

Does that mean Hadoop-AWS 2.7.3 too is built against AWS SDK 1.11.160 and
not 1.7.4?

Thanks.


On Oct 7, 2017 3:50 PM, "Jean Georges Perrin" <j...@jgp.net> wrote:


Hey Marco,

I am actually reading from S3 and I use 2.7.3, but I inherited the project
and they use some AWS API from Amazon SDK, which version is like from
yesterday :) so it’s confused and AMZ is changing its version like crazy so
it’s a little difficult to follow. Right now I went back to 2.7.3 and SDK
1.7.4...

jg


On Oct 7, 2017, at 15:34, Marco Mistroni <mmistr...@gmail.com> wrote:

Hi JG
 out of curiosity what's ur usecase? are you writing to S3? you could use
Spark to do that , e.g using hadoop package  org.apache.hadoop:hadoop-aws:2.7.1
..that will download the aws client which is in line with hadoop 2.7.1?

hth
 marco

On Fri, Oct 6, 2017 at 10:58 PM, Jonathan Kelly <jonathaka...@gmail.com>
wrote:

> Note: EMR builds Hadoop, Spark, et al, from source against specific
> versions of certain packages like the AWS Java SDK, httpclient/core,
> Jackson, etc., sometimes requiring some patches in these applications in
> order to work with versions of these dependencies that differ from what the
> applications may support upstream.
>
> For emr-5.8.0, we have built Hadoop and Spark (the Spark Kinesis
> connector, that is, since that's the only part of Spark that actually
> depends upon the AWS Java SDK directly) against AWS Java SDK 1.11.160
> instead of the much older version that vanilla Hadoop 2.7.3 would otherwise
> depend upon.
>
> ~ Jonathan
>
> On Wed, Oct 4, 2017 at 7:17 AM Steve Loughran <ste...@hortonworks.com>
> wrote:
>
>> On 3 Oct 2017, at 21:37, JG Perrin <jper...@lumeris.com> wrote:
>>
>> Sorry Steve – I may not have been very clear: thinking about
>> aws-java-sdk-z.yy.xxx.jar. To the best of my knowledge, none is bundled
>> with Spark.
>>
>>
>>
>> I know, but if you are talking to s3 via the s3a client, you will need
>> the SDK version to match the hadoop-aws JAR of the same version of Hadoop
>> your JARs have. Similarly, if you were using spark-kinesis, it needs to be
>> in sync there.
>>
>>
>> *From:* Steve Loughran [mailto:ste...@hortonworks.com
>> <ste...@hortonworks.com>]
>> *Sent:* Tuesday, October 03, 2017 2:20 PM
>> *To:* JG Perrin <jper...@lumeris.com>
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Quick one... AWS SDK version?
>>
>>
>>
>> On 3 Oct 2017, at 02:28, JG Perrin <jper...@lumeris.com> wrote:
>>
>> Hey Sparkians,
>>
>> What version of AWS Java SDK do you use with Spark 2.2? Do you stick with
>> the Hadoop 2.7.3 libs?
>>
>>
>> You generally to have to stick with the version which hadoop was built
>> with I'm afraid...very brittle dependency.
>>
>>


Re: Quick one... AWS SDK version?

2017-10-07 Thread Jean Georges Perrin

Hey Marco,

I am actually reading from S3 and I use 2.7.3, but I inherited the project and 
they use some AWS API from Amazon SDK, which version is like from yesterday :) 
so it’s confused and AMZ is changing its version like crazy so it’s a little 
difficult to follow. Right now I went back to 2.7.3 and SDK 1.7.4...

jg


> On Oct 7, 2017, at 15:34, Marco Mistroni <mmistr...@gmail.com> wrote:
> 
> Hi JG
>  out of curiosity what's ur usecase? are you writing to S3? you could use 
> Spark to do that , e.g using hadoop package  
> org.apache.hadoop:hadoop-aws:2.7.1 ..that will download the aws client which 
> is in line with hadoop 2.7.1?
> 
> hth
>  marco
> 
>> On Fri, Oct 6, 2017 at 10:58 PM, Jonathan Kelly <jonathaka...@gmail.com> 
>> wrote:
>> Note: EMR builds Hadoop, Spark, et al, from source against specific versions 
>> of certain packages like the AWS Java SDK, httpclient/core, Jackson, etc., 
>> sometimes requiring some patches in these applications in order to work with 
>> versions of these dependencies that differ from what the applications may 
>> support upstream.
>> 
>> For emr-5.8.0, we have built Hadoop and Spark (the Spark Kinesis connector, 
>> that is, since that's the only part of Spark that actually depends upon the 
>> AWS Java SDK directly) against AWS Java SDK 1.11.160 instead of the much 
>> older version that vanilla Hadoop 2.7.3 would otherwise depend upon.
>> 
>> ~ Jonathan
>> 
>>> On Wed, Oct 4, 2017 at 7:17 AM Steve Loughran <ste...@hortonworks.com> 
>>> wrote:
>>>> On 3 Oct 2017, at 21:37, JG Perrin <jper...@lumeris.com> wrote:
>>>> 
>>>> Sorry Steve – I may not have been very clear: thinking about 
>>>> aws-java-sdk-z.yy.xxx.jar. To the best of my knowledge, none is bundled 
>>>> with Spark.
>>> 
>>> 
>>> I know, but if you are talking to s3 via the s3a client, you will need the 
>>> SDK version to match the hadoop-aws JAR of the same version of Hadoop your 
>>> JARs have. Similarly, if you were using spark-kinesis, it needs to be in 
>>> sync there. 
>>>>  
>>>> From: Steve Loughran [mailto:ste...@hortonworks.com] 
>>>> Sent: Tuesday, October 03, 2017 2:20 PM
>>>> To: JG Perrin <jper...@lumeris.com>
>>>> Cc: user@spark.apache.org
>>>> Subject: Re: Quick one... AWS SDK version?
>>>>  
>>>>  
>>>> On 3 Oct 2017, at 02:28, JG Perrin <jper...@lumeris.com> wrote:
>>>>  
>>>> Hey Sparkians,
>>>>  
>>>> What version of AWS Java SDK do you use with Spark 2.2? Do you stick with 
>>>> the Hadoop 2.7.3 libs?
>>>>  
>>>> You generally to have to stick with the version which hadoop was built 
>>>> with I'm afraid...very brittle dependency. 
> 


Re: Quick one... AWS SDK version?

2017-10-07 Thread Marco Mistroni
Hi JG
 out of curiosity what's ur usecase? are you writing to S3? you could use
Spark to do that , e.g using hadoop package
org.apache.hadoop:hadoop-aws:2.7.1 ..that will download the aws client
which is in line with hadoop 2.7.1?

hth
 marco

On Fri, Oct 6, 2017 at 10:58 PM, Jonathan Kelly <jonathaka...@gmail.com>
wrote:

> Note: EMR builds Hadoop, Spark, et al, from source against specific
> versions of certain packages like the AWS Java SDK, httpclient/core,
> Jackson, etc., sometimes requiring some patches in these applications in
> order to work with versions of these dependencies that differ from what the
> applications may support upstream.
>
> For emr-5.8.0, we have built Hadoop and Spark (the Spark Kinesis
> connector, that is, since that's the only part of Spark that actually
> depends upon the AWS Java SDK directly) against AWS Java SDK 1.11.160
> instead of the much older version that vanilla Hadoop 2.7.3 would otherwise
> depend upon.
>
> ~ Jonathan
>
> On Wed, Oct 4, 2017 at 7:17 AM Steve Loughran <ste...@hortonworks.com>
> wrote:
>
>> On 3 Oct 2017, at 21:37, JG Perrin <jper...@lumeris.com> wrote:
>>
>> Sorry Steve – I may not have been very clear: thinking about
>> aws-java-sdk-z.yy.xxx.jar. To the best of my knowledge, none is bundled
>> with Spark.
>>
>>
>>
>> I know, but if you are talking to s3 via the s3a client, you will need
>> the SDK version to match the hadoop-aws JAR of the same version of Hadoop
>> your JARs have. Similarly, if you were using spark-kinesis, it needs to be
>> in sync there.
>>
>>
>> *From:* Steve Loughran [mailto:ste...@hortonworks.com
>> <ste...@hortonworks.com>]
>> *Sent:* Tuesday, October 03, 2017 2:20 PM
>> *To:* JG Perrin <jper...@lumeris.com>
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Quick one... AWS SDK version?
>>
>>
>>
>> On 3 Oct 2017, at 02:28, JG Perrin <jper...@lumeris.com> wrote:
>>
>> Hey Sparkians,
>>
>> What version of AWS Java SDK do you use with Spark 2.2? Do you stick with
>> the Hadoop 2.7.3 libs?
>>
>>
>> You generally to have to stick with the version which hadoop was built
>> with I'm afraid...very brittle dependency.
>>
>>


Re: Quick one... AWS SDK version?

2017-10-06 Thread Jonathan Kelly
Note: EMR builds Hadoop, Spark, et al, from source against specific
versions of certain packages like the AWS Java SDK, httpclient/core,
Jackson, etc., sometimes requiring some patches in these applications in
order to work with versions of these dependencies that differ from what the
applications may support upstream.

For emr-5.8.0, we have built Hadoop and Spark (the Spark Kinesis connector,
that is, since that's the only part of Spark that actually depends upon the
AWS Java SDK directly) against AWS Java SDK 1.11.160 instead of the much
older version that vanilla Hadoop 2.7.3 would otherwise depend upon.

~ Jonathan

On Wed, Oct 4, 2017 at 7:17 AM Steve Loughran <ste...@hortonworks.com>
wrote:

> On 3 Oct 2017, at 21:37, JG Perrin <jper...@lumeris.com> wrote:
>
> Sorry Steve – I may not have been very clear: thinking about
> aws-java-sdk-z.yy.xxx.jar. To the best of my knowledge, none is bundled
> with Spark.
>
>
>
> I know, but if you are talking to s3 via the s3a client, you will need the
> SDK version to match the hadoop-aws JAR of the same version of Hadoop your
> JARs have. Similarly, if you were using spark-kinesis, it needs to be in
> sync there.
>
>
> *From:* Steve Loughran [mailto:ste...@hortonworks.com
> <ste...@hortonworks.com>]
> *Sent:* Tuesday, October 03, 2017 2:20 PM
> *To:* JG Perrin <jper...@lumeris.com>
> *Cc:* user@spark.apache.org
> *Subject:* Re: Quick one... AWS SDK version?
>
>
>
> On 3 Oct 2017, at 02:28, JG Perrin <jper...@lumeris.com> wrote:
>
> Hey Sparkians,
>
> What version of AWS Java SDK do you use with Spark 2.2? Do you stick with
> the Hadoop 2.7.3 libs?
>
>
> You generally to have to stick with the version which hadoop was built
> with I'm afraid...very brittle dependency.
>
>


Re: Quick one... AWS SDK version?

2017-10-04 Thread Steve Loughran

On 3 Oct 2017, at 21:37, JG Perrin 
<jper...@lumeris.com<mailto:jper...@lumeris.com>> wrote:

Sorry Steve – I may not have been very clear: thinking about 
aws-java-sdk-z.yy.xxx.jar. To the best of my knowledge, none is bundled with 
Spark.


I know, but if you are talking to s3 via the s3a client, you will need the SDK 
version to match the hadoop-aws JAR of the same version of Hadoop your JARs 
have. Similarly, if you were using spark-kinesis, it needs to be in sync there.

From: Steve Loughran [mailto:ste...@hortonworks.com]
Sent: Tuesday, October 03, 2017 2:20 PM
To: JG Perrin <jper...@lumeris.com<mailto:jper...@lumeris.com>>
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Quick one... AWS SDK version?


On 3 Oct 2017, at 02:28, JG Perrin 
<jper...@lumeris.com<mailto:jper...@lumeris.com>> wrote:

Hey Sparkians,

What version of AWS Java SDK do you use with Spark 2.2? Do you stick with the 
Hadoop 2.7.3 libs?

You generally to have to stick with the version which hadoop was built with I'm 
afraid...very brittle dependency.



RE: Quick one... AWS SDK version?

2017-10-03 Thread JG Perrin
Sorry Steve - I may not have been very clear: thinking about 
aws-java-sdk-z.yy.xxx.jar. To the best of my knowledge, none is bundled with 
Spark.

From: Steve Loughran [mailto:ste...@hortonworks.com]
Sent: Tuesday, October 03, 2017 2:20 PM
To: JG Perrin <jper...@lumeris.com>
Cc: user@spark.apache.org
Subject: Re: Quick one... AWS SDK version?


On 3 Oct 2017, at 02:28, JG Perrin 
<jper...@lumeris.com<mailto:jper...@lumeris.com>> wrote:

Hey Sparkians,

What version of AWS Java SDK do you use with Spark 2.2? Do you stick with the 
Hadoop 2.7.3 libs?

You generally to have to stick with the version which hadoop was built with I'm 
afraid...very brittle dependency.


RE: Quick one... AWS SDK version?

2017-10-03 Thread JG Perrin
Thanks Yash… this is helpful!

From: Yash Sharma [mailto:yash...@gmail.com]
Sent: Tuesday, October 03, 2017 1:02 AM
To: JG Perrin <jper...@lumeris.com>; user@spark.apache.org
Subject: Re: Quick one... AWS SDK version?


Hi JG,
Here are my cluster configs if it helps.

Cheers.

EMR: emr-5.8.0
Hadoop distribution: Amazon 2.7.3
AWS sdk: /usr/share/aws/aws-java-sdk/aws-java-sdk-1.11.160.jar
Applications:
Hive 2.3.0
Spark 2.2.0
Tez 0.8.4

On Tue, 3 Oct 2017 at 12:29 JG Perrin 
<jper...@lumeris.com<mailto:jper...@lumeris.com>> wrote:
Hey Sparkians,

What version of AWS Java SDK do you use with Spark 2.2? Do you stick with the 
Hadoop 2.7.3 libs?

Thanks!

jg


Re: Quick one... AWS SDK version?

2017-10-03 Thread Steve Loughran

On 3 Oct 2017, at 02:28, JG Perrin 
<jper...@lumeris.com<mailto:jper...@lumeris.com>> wrote:

Hey Sparkians,

What version of AWS Java SDK do you use with Spark 2.2? Do you stick with the 
Hadoop 2.7.3 libs?

You generally to have to stick with the version which hadoop was built with I'm 
afraid...very brittle dependency.


Re: Quick one... AWS SDK version?

2017-10-03 Thread Yash Sharma
Hi JG,
Here are my cluster configs if it helps.

Cheers.

EMR: emr-5.8.0
Hadoop distribution: Amazon 2.7.3
AWS sdk: /usr/share/aws/aws-java-sdk/aws-java-sdk-1.11.160.jar

Applications:
Hive 2.3.0
Spark 2.2.0
Tez 0.8.4


On Tue, 3 Oct 2017 at 12:29 JG Perrin <jper...@lumeris.com> wrote:

> Hey Sparkians,
>
>
>
> What version of AWS Java SDK do you use with Spark 2.2? Do you stick with
> the Hadoop 2.7.3 libs?
>
>
>
> Thanks!
>
>
>
> jg
>


Quick one... AWS SDK version?

2017-10-02 Thread JG Perrin
Hey Sparkians,

What version of AWS Java SDK do you use with Spark 2.2? Do you stick with the 
Hadoop 2.7.3 libs?

Thanks!

jg


Spark ES Connector -- AWS Managed ElasticSearch Services

2017-08-01 Thread Deepak Sharma
I am tying to connect to AWS managed ES service using Spark ES Connector ,
but am not able to.

I am passing es.nodes and es.port along with es.nodes.wan.only set to true.
But it fails with below error:

34 ERROR NetworkClient: Node [x.x.x.x:443] failed (The server x.x.x.x
failed to respond); no other nodes left - aborting...

org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES
version - typically this happens if the network/Elasticsearch cluster is
not accessible or when targeting a WAN/Cloud instance without the proper
setting 'es.nodes.wan.only'

  at
org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:294)

  at org.elasticsearch.spark.rdd.EsSpark$.doSaveToEs(EsSpark.scala:103)

  at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:79)

  at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:76)

  ... 50 elided

Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException:
Connection error (check network and/or proxy settings)- all nodes failed;
tried [[x.x.x.x:443]]

I just wanted to check if anyone have already connected to managed ES
service of AWS from spark and how it can be done?
-- 
Thanks
Deepak


Re: Running Spark und YARN on AWS EMR

2017-07-17 Thread Takashi Sasaki
Hi Josh,


As you say, I also recognize the problem. I feel I got a warning when
specifying a huge data set.


We also adjust the partition size but we are doing command options
instead of default settings, or in code.


Regards,

Takashi

2017-07-18 6:48 GMT+09:00 Josh Holbrook <josh.holbr...@fusion.net>:
> I just ran into this issue! Small world.
>
> As far as I can tell, by default spark on EMR is completely untuned, but it
> comes with a flag that you can set to tell EMR to autotune spark. In your
> configuration.json file, you can add something like:
>
>   {
> "Classification": "spark",
> "Properties": {
>   "maximizeResourceAllocation": "true"
> }
>   },
>
> but keep in mind that, again as far as I can tell, the default parallelism
> with this config is merely twice the number of executor cores--so for a 10
> machine cluster w/ 3 active cores each, 60 partitions. This is pretty low,
> so you'll likely want to adjust this--I'm currently using the following
> because spark chokes on datasets that are bigger than about 2g per
> partition:
>
>   {
> "Classification": "spark-defaults",
> "Properties": {
>   "spark.default.parallelism": "1000"
> }
>   }
>
> Good luck, and I hope this is helpful!
>
> --Josh
>
>
> On Mon, Jul 17, 2017 at 4:59 PM, Takashi Sasaki <tsasaki...@gmail.com>
> wrote:
>>
>> Hi Pascal,
>>
>> The error also occurred frequently in our project.
>>
>> As a solution, it was effective to specify the memory size directly
>> with spark-submit command.
>>
>> eg. spark-submit executor-memory 2g
>>
>>
>> Regards,
>>
>> Takashi
>>
>> > 2017-07-18 5:18 GMT+09:00 Pascal Stammer <stam...@deichbrise.de>:
>> >> Hi,
>> >>
>> >> I am running a Spark 2.1.x Application on AWS EMR with YARN and get
>> >> following error that kill my application:
>> >>
>> >> AM Container for appattempt_1500320286695_0001_01 exited with
>> >> exitCode:
>> >> -104
>> >> For more detailed output, check application tracking
>> >>
>> >> page:http://ip-172-31-35-192.eu-central-1.compute.internal:8088/cluster/app/application_1500320286695_0001Then,
>> >> click on links to logs of each attempt.
>> >> Diagnostics: Container
>> >> [pid=9216,containerID=container_1500320286695_0001_01_01] is
>> >> running
>> >> beyond physical memory limits. Current usage: 1.4 GB of 1.4 GB physical
>> >> memory used; 3.3 GB of 6.9 GB virtual memory used. Killing container.
>> >>
>> >>
>> >> I already change spark.yarn.executor.memoryOverhead but the error still
>> >> occurs. Does anybody have a hint for me which parameter or
>> >> configuration I
>> >> have to adapt.
>> >>
>> >> Thank you very much.
>> >>
>> >> Regards,
>> >>
>> >> Pascal Stammer
>> >>
>> >>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Running Spark und YARN on AWS EMR

2017-07-17 Thread Josh Holbrook
I just ran into this issue! Small world.

As far as I can tell, by default spark on EMR is completely untuned, but it
comes with a flag that you can set to tell EMR to autotune spark. In your
configuration.json file, you can add something like:

  {
"Classification": "spark",
"Properties": {
  "maximizeResourceAllocation": "true"
}
  },

but keep in mind that, again as far as I can tell, the default parallelism
with this config is merely twice the number of executor cores--so for a 10
machine cluster w/ 3 active cores each, 60 partitions. This is pretty low,
so you'll likely want to adjust this--I'm currently using the following
because spark chokes on datasets that are bigger than about 2g per
partition:

  {
"Classification": "spark-defaults",
"Properties": {
  "spark.default.parallelism": "1000"
}
  }

Good luck, and I hope this is helpful!

--Josh


On Mon, Jul 17, 2017 at 4:59 PM, Takashi Sasaki <tsasaki...@gmail.com>
wrote:

> Hi Pascal,
>
> The error also occurred frequently in our project.
>
> As a solution, it was effective to specify the memory size directly
> with spark-submit command.
>
> eg. spark-submit executor-memory 2g
>
>
> Regards,
>
> Takashi
>
> > 2017-07-18 5:18 GMT+09:00 Pascal Stammer <stam...@deichbrise.de>:
> >> Hi,
> >>
> >> I am running a Spark 2.1.x Application on AWS EMR with YARN and get
> >> following error that kill my application:
> >>
> >> AM Container for appattempt_1500320286695_0001_01 exited with
> exitCode:
> >> -104
> >> For more detailed output, check application tracking
> >> page:http://ip-172-31-35-192.eu-central-1.compute.internal:
> 8088/cluster/app/application_1500320286695_0001Then,
> >> click on links to logs of each attempt.
> >> Diagnostics: Container
> >> [pid=9216,containerID=container_1500320286695_0001_01_01] is
> running
> >> beyond physical memory limits. Current usage: 1.4 GB of 1.4 GB physical
> >> memory used; 3.3 GB of 6.9 GB virtual memory used. Killing container.
> >>
> >>
> >> I already change spark.yarn.executor.memoryOverhead but the error still
> >> occurs. Does anybody have a hint for me which parameter or
> configuration I
> >> have to adapt.
> >>
> >> Thank you very much.
> >>
> >> Regards,
> >>
> >> Pascal Stammer
> >>
> >>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Running Spark und YARN on AWS EMR

2017-07-17 Thread Pascal Stammer

Hi Takashi,

thanks for your help. After a further investigation, I figure out that the 
killed container was the driver process. After setting 
spark.yarn.driver.memoryOverhead instead of spark.yarn.executor.memoryOverhead 
the error was gone and application is executed without error. Maybe it will 
help you as well.

Regards,

Pascal 




> Am 17.07.2017 um 22:59 schrieb Takashi Sasaki <tsasaki...@gmail.com>:
> 
> Hi Pascal,
> 
> The error also occurred frequently in our project.
> 
> As a solution, it was effective to specify the memory size directly
> with spark-submit command.
> 
> eg. spark-submit executor-memory 2g
> 
> 
> Regards,
> 
> Takashi
> 
>> 2017-07-18 5:18 GMT+09:00 Pascal Stammer <stam...@deichbrise.de>:
>>> Hi,
>>> 
>>> I am running a Spark 2.1.x Application on AWS EMR with YARN and get
>>> following error that kill my application:
>>> 
>>> AM Container for appattempt_1500320286695_0001_01 exited with exitCode:
>>> -104
>>> For more detailed output, check application tracking
>>> page:http://ip-172-31-35-192.eu-central-1.compute.internal:8088/cluster/app/application_1500320286695_0001Then,
>>> click on links to logs of each attempt.
>>> Diagnostics: Container
>>> [pid=9216,containerID=container_1500320286695_0001_01_01] is running
>>> beyond physical memory limits. Current usage: 1.4 GB of 1.4 GB physical
>>> memory used; 3.3 GB of 6.9 GB virtual memory used. Killing container.
>>> 
>>> 
>>> I already change spark.yarn.executor.memoryOverhead but the error still
>>> occurs. Does anybody have a hint for me which parameter or configuration I
>>> have to adapt.
>>> 
>>> Thank you very much.
>>> 
>>> Regards,
>>> 
>>> Pascal Stammer
>>> 
>>> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 



Re: Running Spark und YARN on AWS EMR

2017-07-17 Thread Takashi Sasaki
Hi Pascal,

The error also occurred frequently in our project.

As a solution, it was effective to specify the memory size directly
with spark-submit command.

eg. spark-submit executor-memory 2g


Regards,

Takashi

> 2017-07-18 5:18 GMT+09:00 Pascal Stammer <stam...@deichbrise.de>:
>> Hi,
>>
>> I am running a Spark 2.1.x Application on AWS EMR with YARN and get
>> following error that kill my application:
>>
>> AM Container for appattempt_1500320286695_0001_01 exited with exitCode:
>> -104
>> For more detailed output, check application tracking
>> page:http://ip-172-31-35-192.eu-central-1.compute.internal:8088/cluster/app/application_1500320286695_0001Then,
>> click on links to logs of each attempt.
>> Diagnostics: Container
>> [pid=9216,containerID=container_1500320286695_0001_01_01] is running
>> beyond physical memory limits. Current usage: 1.4 GB of 1.4 GB physical
>> memory used; 3.3 GB of 6.9 GB virtual memory used. Killing container.
>>
>>
>> I already change spark.yarn.executor.memoryOverhead but the error still
>> occurs. Does anybody have a hint for me which parameter or configuration I
>> have to adapt.
>>
>> Thank you very much.
>>
>> Regards,
>>
>> Pascal Stammer
>>
>>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Running Spark und YARN on AWS EMR

2017-07-17 Thread Pascal Stammer
Hi,

I am running a Spark 2.1.x Application on AWS EMR with YARN and get following 
error that kill my application:

AM Container for appattempt_1500320286695_0001_01 exited with exitCode: -104
For more detailed output, check application tracking 
page:http://ip-172-31-35-192.eu-central-1.compute.internal:8088/cluster/app/application_1500320286695_0001Then,
 click on links to logs of each attempt.
Diagnostics: Container 
[pid=9216,containerID=container_1500320286695_0001_01_01] is running beyond 
physical memory limits. Current usage: 1.4 GB of 1.4 GB physical memory used; 
3.3 GB of 6.9 GB virtual memory used. Killing container.


I already change spark.yarn.executor.memoryOverhead but the error still occurs. 
Does anybody have a hint for me which parameter or configuration I have to 
adapt.

Thank you very much.

Regards,

Pascal Stammer




Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-12 Thread lucas.g...@gmail.com
"Building data products is a very different discipline from that of
building software."

That is a fundamentally incorrect assumption.

There will always be a need for figuring out how to apply said principles,
but saying 'we're different' has always turned out to be incorrect and I
have seen no reason to think otherwise for data products.

At some point it always comes down to 'how do I get this to my customer, in
a reliable and repeatable fashion'.  The CI/CD patterns that we've come to
rely on are designed to optimize that process.

I have seen no evidence that 'data products' don't benefit from those
practices and I have definitely seen evidence that not following those
patterns has had substantial costs.

Of course there's always a balancing act in the early phases of discovery,
but at some point the needle swings from: "Do I have a valuable product"
to: "How do I get this to customers"

Gary Lucas

On 12 April 2017 at 10:46, Steve Loughran  wrote:

>
> On 12 Apr 2017, at 17:25, Gourav Sengupta 
> wrote:
>
> Hi,
>
> Your answer is like saying, I know how to code in assembly level language
> and I am going to build the next GUI in assembly level code and I think
> that there is a genuine functional requirement to see a color of a button
> in green on the screen.
>
>
> well, I reserve the right to have incomplete knowledge, and look forward
> to improving it.
>
> Perhaps it may be pertinent to read the first preface of a CI/ CD book and
> realize to what kind of software development disciplines is it applicable
> to.
>
>
> the original introduction on CI was probably Fowler's Cruise Control
> article,
> https://martinfowler.com/articles/originalContinuousIntegration.html
>
> "The key is to automate absolutely everything and run the process so often
> that integration errors are found quickly"
>
> Java Development with Ant, 2003, looks at Cruise Control, Anthill and
> Gump, again, with that focus on team coding and automated regression
> testing, both of unit tests, and, with things like HttpUnit, web UIs.
> There's no discussion of "Data" per-se, though databases are implicit.
>
> Apache Gump [Sam Ruby, 2001] was designed to address a single problem "get
> the entire ASF project portfolio to build and test against the latest build
> of everything else". Lots of finger pointing there, especially when
> something foundational like Ant or Xerces did bad.
>
> AFAIK, The earliest known in-print reference to Continuous Deployme3nt is
> the HP Labs 2002 paper, *Making Web Services that Work*. That introduced
> the concept with a focus on automating deployment, staging testing and
> treating ops problems as use cases for which engineers could often write
> tests for, and, perhaps, even design their applications to support. "We are
> exploring extending this model to one we term Continuous Deployment —after
> passing the local test suite, a service can be automatically deployed to a
> public staging server for stress and acceptance testing by physically
> remote calling parties"
>
> At this time, the applications weren't modern "big data" apps as they
> didn't have affordable storage or the tools to schedule work over it. It
> wasn't that the people writing the books and papers looked at big data and
> said "not for us", it just wasn't on their horizons. 1TB was a lot of
> storage in those days, not a high-end SSD.
>
> Otherwise your approach is just another line of defense in saving your job
> by applying an impertinent, incorrect, and outdated skill and tool to a
> problem.
>
>
> please be a bit more constructive here, the ASF code of conduct encourages
> empathy and coillaboration. https://www.apache.org/foundation/
> policies/conduct . Thanks.,
>
>
> Building data products is a very different discipline from that of
> building software.
>
>
> Which is why we ned to consider how to take what are core methodologies
> for software and apply them, and, where appropriate, supercede them with
> new workflows, ideas, technologies. But doing so with an understanding of
> the reasoning behind today's tools and workflows. I'm really interested in
> how do we get from experimental notebook code to something usable in
> production, pushing it out, finding the dirty-data-problems before it goes
> live, etc, etc. I do think today's tools have been outgrown by the
> applications we now build, and am thinking not so much "which tools to
> use', but one step further, "what are the new tools and techniques to
> use?".
>
> I look forward to whatever insight people have here.
>
>
> My genuine advice to everyone in all spheres of activities will be to
> first understand the problem to solve before solving it and definitely
> before selecting the tools to solve it, otherwise you will land up with a
> bowl of soup and fork in hand and argue that CI/ CD is still applicable to
> building data products and data warehousing.
>
>
> I concur
>
> Regards,
> Gourav
>
>
> -Steve
>
> On 

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-12 Thread Steve Loughran

On 12 Apr 2017, at 17:25, Gourav Sengupta 
> wrote:

Hi,

Your answer is like saying, I know how to code in assembly level language and I 
am going to build the next GUI in assembly level code and I think that there is 
a genuine functional requirement to see a color of a button in green on the 
screen.


well, I reserve the right to have incomplete knowledge, and look forward to 
improving it.

Perhaps it may be pertinent to read the first preface of a CI/ CD book and 
realize to what kind of software development disciplines is it applicable to.

the original introduction on CI was probably Fowler's Cruise Control article,
https://martinfowler.com/articles/originalContinuousIntegration.html

"The key is to automate absolutely everything and run the process so often that 
integration errors are found quickly"

Java Development with Ant, 2003, looks at Cruise Control, Anthill and Gump, 
again, with that focus on team coding and automated regression testing, both of 
unit tests, and, with things like HttpUnit, web UIs. There's no discussion of 
"Data" per-se, though databases are implicit.

Apache Gump [Sam Ruby, 2001] was designed to address a single problem "get the 
entire ASF project portfolio to build and test against the latest build of 
everything else". Lots of finger pointing there, especially when something 
foundational like Ant or Xerces did bad.

AFAIK, The earliest known in-print reference to Continuous Deployme3nt is the 
HP Labs 2002 paper, Making Web Services that Work. That introduced the concept 
with a focus on automating deployment, staging testing and treating ops 
problems as use cases for which engineers could often write tests for, and, 
perhaps, even design their applications to support. "We are exploring extending 
this model to one we term Continuous Deployment —after passing the local test 
suite, a service can be automatically deployed to a public staging server for 
stress and acceptance testing by physically remote calling parties"

At this time, the applications weren't modern "big data" apps as they didn't 
have affordable storage or the tools to schedule work over it. It wasn't that 
the people writing the books and papers looked at big data and said "not for 
us", it just wasn't on their horizons. 1TB was a lot of storage in those days, 
not a high-end SSD.

Otherwise your approach is just another line of defense in saving your job by 
applying an impertinent, incorrect, and outdated skill and tool to a problem.


please be a bit more constructive here, the ASF code of conduct encourages 
empathy and coillaboration. https://www.apache.org/foundation/policies/conduct 
. Thanks.,


Building data products is a very different discipline from that of building 
software.


Which is why we ned to consider how to take what are core methodologies for 
software and apply them, and, where appropriate, supercede them with new 
workflows, ideas, technologies. But doing so with an understanding of the 
reasoning behind today's tools and workflows. I'm really interested in how do 
we get from experimental notebook code to something usable in production, 
pushing it out, finding the dirty-data-problems before it goes live, etc, etc. 
I do think today's tools have been outgrown by the applications we now build, 
and am thinking not so much "which tools to use', but one step further, "what 
are the new tools and techniques to use?".

I look forward to whatever insight people have here.


My genuine advice to everyone in all spheres of activities will be to first 
understand the problem to solve before solving it and definitely before 
selecting the tools to solve it, otherwise you will land up with a bowl of soup 
and fork in hand and argue that CI/ CD is still applicable to building data 
products and data warehousing.


I concur

Regards,
Gourav


-Steve

On Wed, Apr 12, 2017 at 12:42 PM, Steve Loughran 
> wrote:

On 11 Apr 2017, at 20:46, Gourav Sengupta 
> wrote:

And once again JAVA programmers are trying to solve a data analytics and data 
warehousing problem using programming paradigms. It genuinely a pain to see 
this happen.



While I'm happy to be faulted for treating things as software processes, having 
a full automated mechanism for testing the latest code before production is 
something I'd consider foundational today. This is what "Contiunous Deployment" 
was about when it was first conceived. Does it mean you should blindly deploy 
that way? well, not if you worry about security, but having that review process 
and then a final manual "deploy" button can address that.

Cloud infras let you integrate cluster instantiation to the process; which 
helps you automate things like "stage the deployment in some new VMs, run 
acceptance tests (*), then switch the load balancer over to the new cluster, 
being 

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-12 Thread Gourav Sengupta
Hi,

Your answer is like saying, I know how to code in assembly level language
and I am going to build the next GUI in assembly level code and I think
that there is a genuine functional requirement to see a color of a button
in green on the screen.

Perhaps it may be pertinent to read the first preface of a CI/ CD book and
realize to what kind of software development disciplines is it applicable
to. Otherwise your approach is just another line of defense in saving your
job by applying an impertinent, incorrect, and outdated skill and tool to a
problem.

Building data products is a very different discipline from that of building
software.

My genuine advice to everyone in all spheres of activities will be to first
understand the problem to solve before solving it and definitely before
selecting the tools to solve it, otherwise you will land up with a bowl of
soup and fork in hand and argue that CI/ CD is still applicable to building
data products and data warehousing.

Regards,
Gourav

On Wed, Apr 12, 2017 at 12:42 PM, Steve Loughran 
wrote:

>
> On 11 Apr 2017, at 20:46, Gourav Sengupta 
> wrote:
>
> And once again JAVA programmers are trying to solve a data analytics and
> data warehousing problem using programming paradigms. It genuinely a pain
> to see this happen.
>
>
>
> While I'm happy to be faulted for treating things as software processes,
> having a full automated mechanism for testing the latest code before
> production is something I'd consider foundational today. This is what
> "Contiunous Deployment" was about when it was first conceived. Does it mean
> you should blindly deploy that way? well, not if you worry about security,
> but having that review process and then a final manual "deploy" button can
> address that.
>
> Cloud infras let you integrate cluster instantiation to the process; which
> helps you automate things like "stage the deployment in some new VMs, run
> acceptance tests (*), then switch the load balancer over to the new
> cluster, being ready to switch back if you need. I've not tried that with
> streaming apps though; I don't know how to do it there. Boot the new
> cluster off checkpointed state requires deserialization to work, which
> can't be guaranteed if you are changing the objects which get serialized.
>
> I'd argue then, it's not a problem which has already been solved by data
> analystics/warehousing —though if you've got pointers there, I'd be
> grateful. Always good to see work by others. Indeed, the telecoms industry
> have led the way in testing and HA deployment: if you look at Erlang you
> can see a system designed with hot upgrades in mind, the way java code "add
> a JAR to a web server" never was.
>
> -Steve
>
>
> (*) do always make sure this is the test cluster with a snapshot of test
> data, not production machines/data. There are always horror stories there.
>
>
> Regards,
> Gourav
>
> On Tue, Apr 11, 2017 at 2:20 PM, Sam Elamin 
> wrote:
>
>> Hi Steve
>>
>>
>> Thanks for the detailed response, I think this problem doesn't have an
>> industry standard solution as of yet and I am sure a lot of people would
>> benefit from the discussion
>>
>> I realise now what you are saying so thanks for clarifying, that said let
>> me try and explain how we approached the problem
>>
>> There are 2 problems you highlighted, the first if moving the code from
>> SCM to prod, and the other is enusiring the data your code uses is correct.
>> (using the latest data from prod)
>>
>>
>> *"how do you get your code from SCM into production?"*
>>
>> We currently have our pipeline being run via airflow, we have our dags in
>> S3, with regards to how we get our code from SCM to production
>>
>> 1) Jenkins build that builds our spark applications and runs tests
>> 2) Once the first build is successful we trigger another build to copy
>> the dags to an s3 folder
>>
>> We then routinely sync this folder to the local airflow dags folder every
>> X amount of mins
>>
>> Re test data
>> *" but what's your strategy for test data: that's always the
>> troublespot."*
>>
>> Our application is using versioning against the data, so we expect the
>> source data to be in a certain version and the output data to also be in a
>> certain version
>>
>> We have a test resources folder that we have following the same
>> convention of versioning - this is the data that our application tests use
>> - to ensure that the data is in the correct format
>>
>> so for example if we have Table X with version 1 that depends on data
>> from Table A and B also version 1, we run our spark application then ensure
>> the transformed table X has the correct columns and row values
>>
>> Then when we have a new version 2 of the source data or adding a new
>> column in Table X (version 2), we generate a new version of the data and
>> ensure the tests are updated
>>
>> That way we ensure any new version of the data has tests against it
>>
>> 

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-12 Thread Steve Loughran

On 11 Apr 2017, at 20:46, Gourav Sengupta 
> wrote:

And once again JAVA programmers are trying to solve a data analytics and data 
warehousing problem using programming paradigms. It genuinely a pain to see 
this happen.



While I'm happy to be faulted for treating things as software processes, having 
a full automated mechanism for testing the latest code before production is 
something I'd consider foundational today. This is what "Contiunous Deployment" 
was about when it was first conceived. Does it mean you should blindly deploy 
that way? well, not if you worry about security, but having that review process 
and then a final manual "deploy" button can address that.

Cloud infras let you integrate cluster instantiation to the process; which 
helps you automate things like "stage the deployment in some new VMs, run 
acceptance tests (*), then switch the load balancer over to the new cluster, 
being ready to switch back if you need. I've not tried that with streaming apps 
though; I don't know how to do it there. Boot the new cluster off checkpointed 
state requires deserialization to work, which can't be guaranteed if you are 
changing the objects which get serialized.

I'd argue then, it's not a problem which has already been solved by data 
analystics/warehousing —though if you've got pointers there, I'd be grateful. 
Always good to see work by others. Indeed, the telecoms industry have led the 
way in testing and HA deployment: if you look at Erlang you can see a system 
designed with hot upgrades in mind, the way java code "add a JAR to a web 
server" never was.

-Steve


(*) do always make sure this is the test cluster with a snapshot of test data, 
not production machines/data. There are always horror stories there.


Regards,
Gourav

On Tue, Apr 11, 2017 at 2:20 PM, Sam Elamin 
> wrote:
Hi Steve


Thanks for the detailed response, I think this problem doesn't have an industry 
standard solution as of yet and I am sure a lot of people would benefit from 
the discussion

I realise now what you are saying so thanks for clarifying, that said let me 
try and explain how we approached the problem

There are 2 problems you highlighted, the first if moving the code from SCM to 
prod, and the other is enusiring the data your code uses is correct. (using the 
latest data from prod)


"how do you get your code from SCM into production?"

We currently have our pipeline being run via airflow, we have our dags in S3, 
with regards to how we get our code from SCM to production

1) Jenkins build that builds our spark applications and runs tests
2) Once the first build is successful we trigger another build to copy the dags 
to an s3 folder

We then routinely sync this folder to the local airflow dags folder every X 
amount of mins

Re test data
" but what's your strategy for test data: that's always the troublespot."

Our application is using versioning against the data, so we expect the source 
data to be in a certain version and the output data to also be in a certain 
version

We have a test resources folder that we have following the same convention of 
versioning - this is the data that our application tests use - to ensure that 
the data is in the correct format

so for example if we have Table X with version 1 that depends on data from 
Table A and B also version 1, we run our spark application then ensure the 
transformed table X has the correct columns and row values

Then when we have a new version 2 of the source data or adding a new column in 
Table X (version 2), we generate a new version of the data and ensure the tests 
are updated

That way we ensure any new version of the data has tests against it

"I've never seen any good strategy there short of "throw it at a copy of the 
production dataset"."

I agree which is why we have a sample of the production data and version the 
schemas we expect the source and target data to look like.

If people are interested I am happy writing a blog about it in the hopes this 
helps people build more reliable pipelines


Love to see that.

Kind Regards
Sam



Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-11 Thread Sumona Routh
etting code into production, either at the press
> of a button or as part of a scheduled "we push out an update every night,
> rerun the deployment tests and then switch over to the new installation"
> mech.
>
> Put differently: how do you get your code from SCM into production? Not
> just for CI, but what's your strategy for test data: that's always the
> troublespot. Random selection of rows may work, although it will skip the
> odd outlier (high-unicode char in what should be a LATIN-1 field, time set
> to 0, etc), and for work joining > 1 table, you need rows which join well.
> I've never seen any good strategy there short of "throw it at a copy of the
> production dataset".
>
>
> -Steve
>
>
>
>
>
>
> On Fri, 7 Apr 2017 at 16:17, Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
> Hi Steve,
>
> Why would you ever do that? You are suggesting the use of a CI tool as a
> workflow and orchestration engine.
>
> Regards,
> Gourav Sengupta
>
> On Fri, Apr 7, 2017 at 4:07 PM, Steve Loughran <ste...@hortonworks.com>
> wrote:
>
> If you have Jenkins set up for some CI workflow, that can do scheduled
> builds and tests. Works well if you can do some build test before even
> submitting it to a remote cluster
>
> On 7 Apr 2017, at 10:15, Sam Elamin <hussam.ela...@gmail.com> wrote:
>
> Hi Shyla
>
> You have multiple options really some of which have been already listed
> but let me try and clarify
>
> Assuming you have a spark application in a jar you have a variety of
> options
>
> You have to have an existing spark cluster that is either running on EMR
> or somewhere else.
>
> *Super simple / hacky*
> Cron job on EC2 that calls a simple shell script that does a spart submit
> to a Spark Cluster OR create or add step to an EMR cluster
>
> *More Elegant*
> Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that will
> do the above step but have scheduling and potential backfilling and error
> handling(retries,alerts etc)
>
> AWS are coming out with glue <https://aws.amazon.com/glue/> soon that
> does some Spark jobs but I do not think its available worldwide just yet
>
> Hope I cleared things up
>
> Regards
> Sam
>
>
> On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta <gourav.sengu...@gmail.com
> > wrote:
>
> Hi Shyla,
>
> why would you want to schedule a spark job in EC2 instead of EMR?
>
> Regards,
> Gourav
>
> On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande <deshpandesh...@gmail.com>
> wrote:
>
> I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
> easiest way to do this. Thanks
>
>
>
>
>
>
>
>


Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-11 Thread Gourav Sengupta
hat's always the
>> troublespot. Random selection of rows may work, although it will skip the
>> odd outlier (high-unicode char in what should be a LATIN-1 field, time set
>> to 0, etc), and for work joining > 1 table, you need rows which join well.
>> I've never seen any good strategy there short of "throw it at a copy of the
>> production dataset".
>>
>>
>> -Steve
>>
>>
>>
>>
>>
>>
>> On Fri, 7 Apr 2017 at 16:17, Gourav Sengupta <gourav.sengu...@gmail.com>
>> wrote:
>>
>>> Hi Steve,
>>>
>>> Why would you ever do that? You are suggesting the use of a CI tool as a
>>> workflow and orchestration engine.
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>> On Fri, Apr 7, 2017 at 4:07 PM, Steve Loughran <ste...@hortonworks.com>
>>> wrote:
>>>
>>>> If you have Jenkins set up for some CI workflow, that can do scheduled
>>>> builds and tests. Works well if you can do some build test before even
>>>> submitting it to a remote cluster
>>>>
>>>> On 7 Apr 2017, at 10:15, Sam Elamin <hussam.ela...@gmail.com> wrote:
>>>>
>>>> Hi Shyla
>>>>
>>>> You have multiple options really some of which have been already listed
>>>> but let me try and clarify
>>>>
>>>> Assuming you have a spark application in a jar you have a variety of
>>>> options
>>>>
>>>> You have to have an existing spark cluster that is either running on
>>>> EMR or somewhere else.
>>>>
>>>> *Super simple / hacky*
>>>> Cron job on EC2 that calls a simple shell script that does a spart
>>>> submit to a Spark Cluster OR create or add step to an EMR cluster
>>>>
>>>> *More Elegant*
>>>> Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that
>>>> will do the above step but have scheduling and potential backfilling and
>>>> error handling(retries,alerts etc)
>>>>
>>>> AWS are coming out with glue <https://aws.amazon.com/glue/> soon that
>>>> does some Spark jobs but I do not think its available worldwide just yet
>>>>
>>>> Hope I cleared things up
>>>>
>>>> Regards
>>>> Sam
>>>>
>>>>
>>>> On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta <
>>>> gourav.sengu...@gmail.com> wrote:
>>>>
>>>>> Hi Shyla,
>>>>>
>>>>> why would you want to schedule a spark job in EC2 instead of EMR?
>>>>>
>>>>> Regards,
>>>>> Gourav
>>>>>
>>>>> On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande <
>>>>> deshpandesh...@gmail.com> wrote:
>>>>>
>>>>>> I want to run a spark batch job maybe hourly on AWS EC2 .  What is
>>>>>> the easiest way to do this. Thanks
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>


Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-11 Thread Sam Elamin
4:07 PM, Steve Loughran <ste...@hortonworks.com>
>> wrote:
>>
>>> If you have Jenkins set up for some CI workflow, that can do scheduled
>>> builds and tests. Works well if you can do some build test before even
>>> submitting it to a remote cluster
>>>
>>> On 7 Apr 2017, at 10:15, Sam Elamin <hussam.ela...@gmail.com> wrote:
>>>
>>> Hi Shyla
>>>
>>> You have multiple options really some of which have been already listed
>>> but let me try and clarify
>>>
>>> Assuming you have a spark application in a jar you have a variety of
>>> options
>>>
>>> You have to have an existing spark cluster that is either running on EMR
>>> or somewhere else.
>>>
>>> *Super simple / hacky*
>>> Cron job on EC2 that calls a simple shell script that does a spart
>>> submit to a Spark Cluster OR create or add step to an EMR cluster
>>>
>>> *More Elegant*
>>> Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that
>>> will do the above step but have scheduling and potential backfilling and
>>> error handling(retries,alerts etc)
>>>
>>> AWS are coming out with glue <https://aws.amazon.com/glue/> soon that
>>> does some Spark jobs but I do not think its available worldwide just yet
>>>
>>> Hope I cleared things up
>>>
>>> Regards
>>> Sam
>>>
>>>
>>> On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
>>>> Hi Shyla,
>>>>
>>>> why would you want to schedule a spark job in EC2 instead of EMR?
>>>>
>>>> Regards,
>>>> Gourav
>>>>
>>>> On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande <
>>>> deshpandesh...@gmail.com> wrote:
>>>>
>>>>> I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
>>>>> easiest way to do this. Thanks
>>>>>
>>>>
>>>>
>>>
>>>
>>
>


Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-11 Thread Steve Loughran

On 7 Apr 2017, at 18:40, Sam Elamin 
<hussam.ela...@gmail.com<mailto:hussam.ela...@gmail.com>> wrote:

Definitely agree with gourav there. I wouldn't want jenkins to run my work 
flow. Seems to me that you would only be using jenkins for its scheduling 
capabilities


Maybe I was just looking at this differenlty

Yes you can run tests but you wouldn't want it to run your orchestration of jobs

What happens if jenkijs goes down for any particular reason. How do you have 
the conversation with your stakeholders that your pipeline is not working and 
they don't have data because the build server is going through an upgrade or 
going through an upgrade



Well, I wouldn't use it as a replacement for Oozie, but I'd certainly consider 
as the pipeline for getting your code out to the cluster, so you don't have to 
explain why you just pushed out something broken

As example, here's Renault's pipeline as discussed last week in Munich 
https://flic.kr/p/Tw3Emu

However to be fair I understand what you are saying Steve if someone is in a 
place where you only have access to jenkins and have to go through hoops to 
setup:get access to new instances then engineers will do what they always do, 
find ways to game the system to get their work done



This isn't about trying to "Game the system", this is about what makes a 
replicable workflow for getting code into production, either at the press of a 
button or as part of a scheduled "we push out an update every night, rerun the 
deployment tests and then switch over to the new installation" mech.

Put differently: how do you get your code from SCM into production? Not just 
for CI, but what's your strategy for test data: that's always the troublespot. 
Random selection of rows may work, although it will skip the odd outlier 
(high-unicode char in what should be a LATIN-1 field, time set to 0, etc), and 
for work joining > 1 table, you need rows which join well. I've never seen any 
good strategy there short of "throw it at a copy of the production dataset".


-Steve






On Fri, 7 Apr 2017 at 16:17, Gourav Sengupta 
<gourav.sengu...@gmail.com<mailto:gourav.sengu...@gmail.com>> wrote:
Hi Steve,

Why would you ever do that? You are suggesting the use of a CI tool as a 
workflow and orchestration engine.

Regards,
Gourav Sengupta

On Fri, Apr 7, 2017 at 4:07 PM, Steve Loughran 
<ste...@hortonworks.com<mailto:ste...@hortonworks.com>> wrote:
If you have Jenkins set up for some CI workflow, that can do scheduled builds 
and tests. Works well if you can do some build test before even submitting it 
to a remote cluster

On 7 Apr 2017, at 10:15, Sam Elamin 
<hussam.ela...@gmail.com<mailto:hussam.ela...@gmail.com>> wrote:

Hi Shyla

You have multiple options really some of which have been already listed but let 
me try and clarify

Assuming you have a spark application in a jar you have a variety of options

You have to have an existing spark cluster that is either running on EMR or 
somewhere else.

Super simple / hacky
Cron job on EC2 that calls a simple shell script that does a spart submit to a 
Spark Cluster OR create or add step to an EMR cluster

More Elegant
Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that will do 
the above step but have scheduling and potential backfilling and error 
handling(retries,alerts etc)

AWS are coming out with glue<https://aws.amazon.com/glue/> soon that does some 
Spark jobs but I do not think its available worldwide just yet

Hope I cleared things up

Regards
Sam


On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta 
<gourav.sengu...@gmail.com<mailto:gourav.sengu...@gmail.com>> wrote:
Hi Shyla,

why would you want to schedule a spark job in EC2 instead of EMR?

Regards,
Gourav

On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande 
<deshpandesh...@gmail.com<mailto:deshpandesh...@gmail.com>> wrote:
I want to run a spark batch job maybe hourly on AWS EC2 .  What is the easiest 
way to do this. Thanks







Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-07 Thread Sam Elamin
Definitely agree with gourav there. I wouldn't want jenkins to run my work
flow. Seems to me that you would only be using jenkins for its scheduling
capabilities

Yes you can run tests but you wouldn't want it to run your orchestration of
jobs

What happens if jenkijs goes down for any particular reason. How do you
have the conversation with your stakeholders that your pipeline is not
working and they don't have data because the build server is going through
an upgrade or going through an upgrade

However to be fair I understand what you are saying Steve if someone is in
a place where you only have access to jenkins and have to go through hoops
to setup:get access to new instances then engineers will do what they
always do, find ways to game the system to get their work done




On Fri, 7 Apr 2017 at 16:17, Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> Hi Steve,
>
> Why would you ever do that? You are suggesting the use of a CI tool as a
> workflow and orchestration engine.
>
> Regards,
> Gourav Sengupta
>
> On Fri, Apr 7, 2017 at 4:07 PM, Steve Loughran <ste...@hortonworks.com>
> wrote:
>
>> If you have Jenkins set up for some CI workflow, that can do scheduled
>> builds and tests. Works well if you can do some build test before even
>> submitting it to a remote cluster
>>
>> On 7 Apr 2017, at 10:15, Sam Elamin <hussam.ela...@gmail.com> wrote:
>>
>> Hi Shyla
>>
>> You have multiple options really some of which have been already listed
>> but let me try and clarify
>>
>> Assuming you have a spark application in a jar you have a variety of
>> options
>>
>> You have to have an existing spark cluster that is either running on EMR
>> or somewhere else.
>>
>> *Super simple / hacky*
>> Cron job on EC2 that calls a simple shell script that does a spart submit
>> to a Spark Cluster OR create or add step to an EMR cluster
>>
>> *More Elegant*
>> Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that will
>> do the above step but have scheduling and potential backfilling and error
>> handling(retries,alerts etc)
>>
>> AWS are coming out with glue <https://aws.amazon.com/glue/> soon that
>> does some Spark jobs but I do not think its available worldwide just yet
>>
>> Hope I cleared things up
>>
>> Regards
>> Sam
>>
>>
>> On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>>> Hi Shyla,
>>>
>>> why would you want to schedule a spark job in EC2 instead of EMR?
>>>
>>> Regards,
>>> Gourav
>>>
>>> On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande <
>>> deshpandesh...@gmail.com> wrote:
>>>
>>>> I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
>>>> easiest way to do this. Thanks
>>>>
>>>
>>>
>>
>>
>


Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-07 Thread Gourav Sengupta
Hi Steve,

Why would you ever do that? You are suggesting the use of a CI tool as a
workflow and orchestration engine.

Regards,
Gourav Sengupta

On Fri, Apr 7, 2017 at 4:07 PM, Steve Loughran <ste...@hortonworks.com>
wrote:

> If you have Jenkins set up for some CI workflow, that can do scheduled
> builds and tests. Works well if you can do some build test before even
> submitting it to a remote cluster
>
> On 7 Apr 2017, at 10:15, Sam Elamin <hussam.ela...@gmail.com> wrote:
>
> Hi Shyla
>
> You have multiple options really some of which have been already listed
> but let me try and clarify
>
> Assuming you have a spark application in a jar you have a variety of
> options
>
> You have to have an existing spark cluster that is either running on EMR
> or somewhere else.
>
> *Super simple / hacky*
> Cron job on EC2 that calls a simple shell script that does a spart submit
> to a Spark Cluster OR create or add step to an EMR cluster
>
> *More Elegant*
> Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that will
> do the above step but have scheduling and potential backfilling and error
> handling(retries,alerts etc)
>
> AWS are coming out with glue <https://aws.amazon.com/glue/> soon that
> does some Spark jobs but I do not think its available worldwide just yet
>
> Hope I cleared things up
>
> Regards
> Sam
>
>
> On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta <gourav.sengu...@gmail.com
> > wrote:
>
>> Hi Shyla,
>>
>> why would you want to schedule a spark job in EC2 instead of EMR?
>>
>> Regards,
>> Gourav
>>
>> On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande <deshpandesh...@gmail.com
>> > wrote:
>>
>>> I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
>>> easiest way to do this. Thanks
>>>
>>
>>
>
>


Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-07 Thread Steve Loughran
If you have Jenkins set up for some CI workflow, that can do scheduled builds 
and tests. Works well if you can do some build test before even submitting it 
to a remote cluster

On 7 Apr 2017, at 10:15, Sam Elamin 
<hussam.ela...@gmail.com<mailto:hussam.ela...@gmail.com>> wrote:

Hi Shyla

You have multiple options really some of which have been already listed but let 
me try and clarify

Assuming you have a spark application in a jar you have a variety of options

You have to have an existing spark cluster that is either running on EMR or 
somewhere else.

Super simple / hacky
Cron job on EC2 that calls a simple shell script that does a spart submit to a 
Spark Cluster OR create or add step to an EMR cluster

More Elegant
Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that will do 
the above step but have scheduling and potential backfilling and error 
handling(retries,alerts etc)

AWS are coming out with glue<https://aws.amazon.com/glue/> soon that does some 
Spark jobs but I do not think its available worldwide just yet

Hope I cleared things up

Regards
Sam


On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta 
<gourav.sengu...@gmail.com<mailto:gourav.sengu...@gmail.com>> wrote:
Hi Shyla,

why would you want to schedule a spark job in EC2 instead of EMR?

Regards,
Gourav

On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande 
<deshpandesh...@gmail.com<mailto:deshpandesh...@gmail.com>> wrote:
I want to run a spark batch job maybe hourly on AWS EC2 .  What is the easiest 
way to do this. Thanks





Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-07 Thread Sam Elamin
Hi Shyla

You have multiple options really some of which have been already listed but
let me try and clarify

Assuming you have a spark application in a jar you have a variety of options

You have to have an existing spark cluster that is either running on EMR or
somewhere else.

*Super simple / hacky*
Cron job on EC2 that calls a simple shell script that does a spart submit
to a Spark Cluster OR create or add step to an EMR cluster

*More Elegant*
Airflow/Luigi/AWS Data Pipeline (Which is just CRON in the UI ) that will
do the above step but have scheduling and potential backfilling and error
handling(retries,alerts etc)

AWS are coming out with glue <https://aws.amazon.com/glue/> soon that does
some Spark jobs but I do not think its available worldwide just yet

Hope I cleared things up

Regards
Sam


On Fri, Apr 7, 2017 at 6:05 AM, Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> Hi Shyla,
>
> why would you want to schedule a spark job in EC2 instead of EMR?
>
> Regards,
> Gourav
>
> On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande <deshpandesh...@gmail.com>
> wrote:
>
>> I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
>> easiest way to do this. Thanks
>>
>
>


Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-06 Thread Gourav Sengupta
Hi Shyla,

why would you want to schedule a spark job in EC2 instead of EMR?

Regards,
Gourav

On Fri, Apr 7, 2017 at 1:04 AM, shyla deshpande <deshpandesh...@gmail.com>
wrote:

> I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
> easiest way to do this. Thanks
>


Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-06 Thread Yash Sharma
Hi Shyla,
We could suggest based on what you're trying to do exactly. But with the
given information - If you have your spark job ready you could schedule it
via any scheduling framework like Airflow or Celery or Cron based on how
simple/complex you want your work flow to be.

Cheers,
Yash



On Fri, 7 Apr 2017 at 10:04 shyla deshpande <deshpandesh...@gmail.com>
wrote:

> I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
> easiest way to do this. Thanks
>


What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-06 Thread shyla deshpande
I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
easiest way to do this. Thanks


What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-06 Thread shyla deshpande
I want to run a spark batch job maybe hourly on AWS EC2 .  What is the
easiest way to do this. Thanks


Consuming AWS Cloudwatch logs from Kinesis into Spark

2017-04-05 Thread Tim Smith
I am sharing this code snippet since I spent quite some time figuring it
out and I couldn't find any examples online. Between the Kinesis
documentation, tutorial on AWS site and other code snippets on the
Internet, I was confused about structure/format of the messages that Spark
fetches from Kinesis - base64 encoded, json, gzipped - which one first and
what order.

I tested this on EMR-5.4.0, Amazon Hadoop 2.7.3 and Spark 2.1.0. Hope it
helps others googling for similar info. I tried using Structured Streaming
but (1) it's in Alpha and (2) despite including what I thought were all the
dependencies, it complained of not finding DataSource.Kinesis. You probably
do not need all the libs but I am just too lazy to redact ones you don't
require for the snippet below :)

import org.apache.spark.streaming.Duration
import org.apache.spark.streaming.kinesis._
import
com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.storage.StorageLevel
import org.apache.spark.rdd.RDD
import java.util.Base64
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.explode
import org.apache.commons.math3.stat.descriptive._
import java.io.File
import java.net.InetAddress
import scala.util.control.NonFatal
import org.apache.spark.SparkFiles
import org.apache.spark.sql.SaveMode
import java.util.Properties;
import org.json4s._
import org.json4s.jackson.JsonMethods._
import java.io.{ByteArrayOutputStream, ByteArrayInputStream}
import java.util.zip.{GZIPOutputStream, GZIPInputStream}
import scala.util.Try


//sc.setLogLevel("INFO")

val ssc = new StreamingContext(sc, Seconds(30))

val kinesisStreams = (0 until 2).map { i => KinesisUtils.createStream(ssc,
"myApp", "cloudwatchlogs",
"https://kinesis.us-east-1.amazonaws.com","us-east-1;,
InitialPositionInStream.LATEST , Seconds(30),
StorageLevel.MEMORY_AND_DISK_2,"myId","mySecret") }

val unionStreams = ssc.union(kinesisStreams)

unionStreams.foreachRDD((rdd: RDD[Array[Byte]], time: Time) => {
  if(rdd.count() > 0) {
  val json = rdd.map(input => {
  val inputStream = new GZIPInputStream(new ByteArrayInputStream(input))
  val record = scala.io.Source.fromInputStream(inputStream).mkString
  compact(render(parse(record)))
  })

  val df = spark.sqlContext.read.json(json)
  val preDF =
df.select($"logGroup",explode($"logEvents").as("events_flat"))
  val penDF = preDF.select($"logGroup",$"events_flat.extractedFields")
  val finalDF =
penDF.select($"logGroup".as("cluster"),$"extractedFields.*")
  finalDF.printSchema()
  finalDF.show()
 }
})

ssc.start



--
Thanks,

Tim


Spark is inventing its own AWS secret key

2017-03-08 Thread Jonhy Stack
Hi,

I'm trying to read a s3 bucket from Spark and up until today Spark always
complain that the request return 403

hadoopConf = spark_context._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.access.key", "ACCESSKEY")
hadoopConf.set("fs.s3a.secret.key", "SECRETKEY")
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
logs = spark_context.textFile("s3a://mybucket/logs/*)

Spark was saying  Invalid Access key [ACCESSKEY]

However with the same ACCESSKEY and SECRETKEY this was working with aws-cli

aws s3 ls mybucket/logs/

and in python boto3 this was working

resource = boto3.resource("s3", region_name="us-east-1")
resource.Object("mybucket", "logs/text.py") \
.put(Body=open("text.py", "rb"),ContentType="text/x-py")

so my credentials ARE invalid and the problem is definitely something with
Spark..

Today I decided to turn on the "DEBUG" log for the entire spark and to my
suprise... Spark is NOT using the [SECRETKEY] I have provided but
instead... add a random one???

17/03/08 10:40:04 DEBUG request: Sending Request: HEAD
https://mybucket.s3.amazonaws.com / Headers: (Authorization: AWS
ACCESSKEY:**[RANDON-SECRET-KEY]**, User-Agent: aws-sdk-java/1.7.4
Mac_OS_X/10.11.6 Java_HotSpot(TM)_64-Bit_Server_VM/25.65-b01/1.8.0_65,
Date: Wed, 08 Mar 2017 10:40:04 GMT, Content-Type:
application/x-www-form-urlencoded; charset=utf-8, )

This is why it still return 403! Spark is not using the key I provide with
fs.s3a.secret.key but instead invent a random one EACH time (everytime I
submit the job the random secret key is different)

For the record I'm running this locally on my machine (OSX) with this
command

spark-submit --packages
com.amazonaws:aws-java-sdk-pom:1.11.98,org.apache.hadoop:hadoop-aws:2.7.3
test.py

Could some one enlighten me on this?


Re: Custom log4j.properties on AWS EMR

2017-02-28 Thread Prithish
Thanks for your response Jonathan. Yes, this works. I also added another
way of achieving this to the Stackoverflow post. Thanks for the help.

On Tue, Feb 28, 2017 at 11:58 PM, Jonathan Kelly <jonathaka...@gmail.com>
wrote:

> Prithish,
>
> I saw you posted this on SO, so I responded there just now. See
> http://stackoverflow.com/questions/42452622/custom-
> log4j-properties-on-aws-emr/42516161#42516161
>
> In short, an hdfs:// path can't be used to configure log4j because log4j
> knows nothing about hdfs. Instead, since you are using EMR, you should use
> the Configuration API when creating your cluster to configure the
> spark-log4j configuration classification. See http://docs.aws.amazon.
> com/emr/latest/ReleaseGuide/emr-configure-apps.html for more info.
>
> ~ Jonathan
>
> On Sun, Feb 26, 2017 at 8:17 PM Prithish <prith...@gmail.com> wrote:
>
>> Steve, I tried that, but didn't work. Any other ideas?
>>
>> On Mon, Feb 27, 2017 at 1:42 AM, Steve Loughran <ste...@hortonworks.com>
>> wrote:
>>
>> try giving a resource of a file in the JAR, e.g add a file
>> "log4j-debugging.properties into the jar, and give a config option of
>> -Dlog4j.configuration=/log4j-debugging.properties   (maybe also try
>> without the "/")
>>
>>
>> On 26 Feb 2017, at 16:31, Prithish <prith...@gmail.com> wrote:
>>
>> Hoping someone can answer this.
>>
>> I am unable to override and use a Custom log4j.properties on Amazon EMR.
>> I am running Spark on EMR (Yarn) and have tried all the below combinations
>> in the Spark-Submit to try and use the custom log4j.
>>
>> In Client mode
>> --driver-java-options "-Dlog4j.configuration=hdfs://
>> host:port/user/hadoop/log4j.properties"
>>
>> In Cluster mode
>> --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=hdfs://host:
>> port/user/hadoop/log4j.properties"
>>
>> I have also tried picking from local filesystem using file://// instead
>> of hdfs. None of this seem to work. However, I can get this working when
>> running on my local Yarn setup.
>>
>> Any ideas?
>>
>> I have also posted on Stackoverflow (link below)
>> http://stackoverflow.com/questions/42452622/custom-
>> log4j-properties-on-aws-emr
>>
>>
>>
>>


Re: Custom log4j.properties on AWS EMR

2017-02-28 Thread Jonathan Kelly
Prithish,

I saw you posted this on SO, so I responded there just now. See
http://stackoverflow.com/questions/42452622/custom-log4j-properties-on-aws-emr/42516161#42516161

In short, an hdfs:// path can't be used to configure log4j because log4j
knows nothing about hdfs. Instead, since you are using EMR, you should use
the Configuration API when creating your cluster to configure the
spark-log4j configuration classification. See
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html for
more info.

~ Jonathan

On Sun, Feb 26, 2017 at 8:17 PM Prithish <prith...@gmail.com> wrote:

> Steve, I tried that, but didn't work. Any other ideas?
>
> On Mon, Feb 27, 2017 at 1:42 AM, Steve Loughran <ste...@hortonworks.com>
> wrote:
>
> try giving a resource of a file in the JAR, e.g add a file
> "log4j-debugging.properties into the jar, and give a config option of
> -Dlog4j.configuration=/log4j-debugging.properties   (maybe also try without
> the "/")
>
>
> On 26 Feb 2017, at 16:31, Prithish <prith...@gmail.com> wrote:
>
> Hoping someone can answer this.
>
> I am unable to override and use a Custom log4j.properties on Amazon EMR. I
> am running Spark on EMR (Yarn) and have tried all the below combinations in
> the Spark-Submit to try and use the custom log4j.
>
> In Client mode
> --driver-java-options "-Dlog4j.configuration=
> hdfs://host:port/user/hadoop/log4j.properties"
>
> In Cluster mode
> --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=
> hdfs://host:port/user/hadoop/log4j.properties"
>
> I have also tried picking from local filesystem using file: instead
> of hdfs. None of this seem to work. However, I can get this working when
> running on my local Yarn setup.
>
> Any ideas?
>
> I have also posted on Stackoverflow (link below)
>
> http://stackoverflow.com/questions/42452622/custom-log4j-properties-on-aws-emr
>
>
>
>


Re: Custom log4j.properties on AWS EMR

2017-02-26 Thread Prithish
Steve, I tried that, but didn't work. Any other ideas?

On Mon, Feb 27, 2017 at 1:42 AM, Steve Loughran <ste...@hortonworks.com>
wrote:

> try giving a resource of a file in the JAR, e.g add a file
> "log4j-debugging.properties into the jar, and give a config option of
> -Dlog4j.configuration=/log4j-debugging.properties   (maybe also try
> without the "/")
>
>
> On 26 Feb 2017, at 16:31, Prithish <prith...@gmail.com> wrote:
>
> Hoping someone can answer this.
>
> I am unable to override and use a Custom log4j.properties on Amazon EMR. I
> am running Spark on EMR (Yarn) and have tried all the below combinations in
> the Spark-Submit to try and use the custom log4j.
>
> In Client mode
> --driver-java-options "-Dlog4j.configuration=hdfs://
> host:port/user/hadoop/log4j.properties"
>
> In Cluster mode
> --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=hdfs://host:
> port/user/hadoop/log4j.properties"
>
> I have also tried picking from local filesystem using file: instead
> of hdfs. None of this seem to work. However, I can get this working when
> running on my local Yarn setup.
>
> Any ideas?
>
> I have also posted on Stackoverflow (link below)
> http://stackoverflow.com/questions/42452622/custom-
> log4j-properties-on-aws-emr
>
>
>


Re: Custom log4j.properties on AWS EMR

2017-02-26 Thread Steve Loughran
try giving a resource of a file in the JAR, e.g add a file 
"log4j-debugging.properties into the jar, and give a config option of 
-Dlog4j.configuration=/log4j-debugging.properties   (maybe also try without the 
"/")


On 26 Feb 2017, at 16:31, Prithish 
<prith...@gmail.com<mailto:prith...@gmail.com>> wrote:

Hoping someone can answer this.

I am unable to override and use a Custom log4j.properties on Amazon EMR. I am 
running Spark on EMR (Yarn) and have tried all the below combinations in the 
Spark-Submit to try and use the custom log4j.

In Client mode
--driver-java-options 
"-Dlog4j.configuration=hdfs://host:port/user/hadoop/log4j.properties"

In Cluster mode
--conf 
"spark.driver.extraJavaOptions=-Dlog4j.configuration=hdfs://host:port/user/hadoop/log4j.properties"

I have also tried picking from local filesystem using file: instead of 
hdfs. None of this seem to work. However, I can get this working when running 
on my local Yarn setup.

Any ideas?

I have also posted on Stackoverflow (link below)
http://stackoverflow.com/questions/42452622/custom-log4j-properties-on-aws-emr



Custom log4j.properties on AWS EMR

2017-02-26 Thread Prithish
Hoping someone can answer this.

I am unable to override and use a Custom log4j.properties on Amazon EMR. I
am running Spark on EMR (Yarn) and have tried all the below combinations in
the Spark-Submit to try and use the custom log4j.

In Client mode
--driver-java-options
"-Dlog4j.configuration=hdfs://host:port/user/hadoop/log4j.properties"

In Cluster mode
--conf
"spark.driver.extraJavaOptions=-Dlog4j.configuration=hdfs://host:port/user/hadoop/log4j.properties"

I have also tried picking from local filesystem using file: instead of
hdfs. None of this seem to work. However, I can get this working when
running on my local Yarn setup.

Any ideas?

I have also posted on Stackoverflow (link below)
http://stackoverflow.com/questions/42452622/custom-log4j-properties-on-aws-emr


Spark streaming on AWS EC2 error . Please help

2017-02-20 Thread shyla deshpande
I am running Spark streaming on AWS EC2 in standalone mode.

When I do a spark-submit, I get the following message. I am subscribing to
3 kafka topics and it is reading and processing just 2 topics. Works fine
in local mode.
Appreciate your help. Thanks

Exception in thread "pool-26-thread-132" java.lang.NullPointerException
at
org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler.run(Checkpoint.scala:225)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


Re: Spark Read from Google store and save in AWS s3

2017-01-10 Thread A Shaikh
This should help
https://cloud.google.com/hadoop/examples/bigquery-connector-spark-example

On 8 January 2017 at 03:49, neil90 <neilp1...@icloud.com> wrote:

> Here is how you would read from Google Cloud Storage(note you need to
> create
> a service account key) ->
>
> os.environ['PYSPARK_SUBMIT_ARGS'] = """--jars
> /home/neil/Downloads/gcs-connector-latest-hadoop2.jar pyspark-shell"""
>
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SparkSession, SQLContext
>
> conf = SparkConf()\
> .setMaster("local[8]")\
> .setAppName("GS")
>
> sc = SparkContext(conf=conf)
>
> sc._jsc.hadoopConfiguration().set("fs.gs.impl",
> "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
> sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.gs.impl",
> "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
> sc._jsc.hadoopConfiguration().set("fs.gs.project.id", "PUT UR GOOGLE
> PROJECT
> ID HERE")
>
> sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.email",
> "testa...@sparkgcs.iam.gserviceaccount.com")
> sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.enable",
> "true")
> sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.keyfile",
> "sparkgcs-96bd21691c29.p12")
>
> spark = SparkSession.builder\
> .config(conf=sc.getConf())\
> .getOrCreate()
>
> dfTermRaw = spark.read.format("csv")\
> .option("header", "true")\
> .option("delimiter" ,"\t")\
> .option("inferSchema", "true")\
> .load("gs://bucket_test/sample.tsv")
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Spark-Read-from-Google-store-and-
> save-in-AWS-s3-tp28278p28286.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark Read from Google store and save in AWS s3

2017-01-07 Thread neil90
Here is how you would read from Google Cloud Storage(note you need to create
a service account key) ->

os.environ['PYSPARK_SUBMIT_ARGS'] = """--jars
/home/neil/Downloads/gcs-connector-latest-hadoop2.jar pyspark-shell"""

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, SQLContext

conf = SparkConf()\
.setMaster("local[8]")\
.setAppName("GS")   

sc = SparkContext(conf=conf)

sc._jsc.hadoopConfiguration().set("fs.gs.impl",
"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.gs.impl",
"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
sc._jsc.hadoopConfiguration().set("fs.gs.project.id", "PUT UR GOOGLE PROJECT
ID HERE")

sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.email",
"testa...@sparkgcs.iam.gserviceaccount.com")
sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.enable",
"true")
sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.keyfile",
"sparkgcs-96bd21691c29.p12")

spark = SparkSession.builder\
.config(conf=sc.getConf())\
.getOrCreate()

dfTermRaw = spark.read.format("csv")\
.option("header", "true")\
.option("delimiter" ,"\t")\
.option("inferSchema", "true")\
.load("gs://bucket_test/sample.tsv")




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Read-from-Google-store-and-save-in-AWS-s3-tp28278p28286.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Read from Google store and save in AWS s3

2017-01-06 Thread Steve Loughran

On 5 Jan 2017, at 20:07, Manohar Reddy 
> wrote:

Hi Steve,
Thanks for the reply and below is follow-up help needed from you.
Do you mean we can set up two native file system to single sparkcontext ,so 
then based on urls prefix( gs://bucket/path and dest s3a://bucket-on-s3/path2) 
will that identify and write/read appropriate cloud.

Is that my understanding right?


I wouldn't use the term "native FS", as they are all just client libraries to 
talk to the relevant object stores. You'd still have to have the cluster 
"default" FS.

but yes, you can use them: get your classpath right and they are all just URLS 
you use your code


  1   2   3   >