Spark as an application server cache

2021-02-10 Thread javaguy Java
Hi,

I was just curious if anyone has ever used Spark as an application server
cache?

My use case is:
 * I have large datasets which need to be updated / inserted (upsert) in
the database
 * I have actually found that it is much easier to run a Spark submit job
that pulls from the database, and compares the incoming new data with the
existing data in memory and only upsert the necessary rows (remove all
duplicates)

I was thinking that if I keep the spark dataframe in memory in a long
running spark session, then I can further speed up this process as I can
remove the database query on each batch run.

I have a data pipeline in which I'm subscribed to essentially a firehose of
information and I want to save everything however I don't want to update /
save any duplicate data and would like to eliminate this in memory before
having to make the database IO call.

If anyone has used Spark like this would appreciate their input and or a
diff solution if Spark is not appropriate

Thx


Re: Issue with accessing S3 from EKS spark pod

2021-02-10 Thread Rishabh Jain
Seemed like I was not able connect to sts.amazonaws.com. Fixed that error.
Now spark write to s3 is able to create folder structure on s3 but on final
file write it fails with below big error:

org.apache.spark.SparkException: Job aborted.

at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:226)

at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:178)

at
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)

at
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)

at
org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131)

at
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)

at
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)

at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)

at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)

at
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)

at
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121)

at
org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:963)

at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)

at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)

at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)

at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)

at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)

at
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:963)

at
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:415)

at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:399)

at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:288)

at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:897)

Exception occurred while running transaction extracts job: Job aborted.

at com.gpn.batch.writer.S3Writer.write(S3Writer.java:9)

at com.gpn.batch.PostedTransactionsJob.main(PostedTransactionsJob.java:47)

at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)

at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)

at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.base/java.lang.reflect.Method.invoke(Method.java:564)

at
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)

at org.apache.spark.deploy.SparkSubmit.org
$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)

at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)

at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)

at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)

at
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Caused by: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 1 in stage 6.0 failed 4 times, most recent failure: Lost task
1.3 in stage 6.0 (TID 17, 10.37.2.40, executor 1):
java.nio.file.AccessDeniedException:
s3a://gpn-corebatch-posting-extracts/totals-extract-1612978376492/_temporary/0/_temporary/attempt_20210210173339_0006_m_01_17/part-1-43be031c-5f3d-4b4f-bd2d-dc19ed99c7b4-c000.txt:
getFileStatus on
s3a://gpn-corebatch-posting-extracts/totals-extract-1612978376492/_temporary/0/_temporary/attempt_20210210173339_0006_m_01_17/part-1-43be031c-5f3d-4b4f-bd2d-dc19ed99c7b4-c000.txt:
com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service:
Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID:
86B9CEF5EDA607F8; S3 Extended Request ID:
1XOprWwxqw0OV9mhb4wFkB3cOhwcI/kaFHctXEgGaovT8VTRWjnW6DwaMyO0laeCNUmn1nTbQYY=;
Proxy: null), S3 Extended Request ID:
1XOprWwxqw0OV9mhb4wFkB3cOhwcI/kaFHctXEgGaovT8VTRWjnW6DwaMyO0laeCNUmn1nTbQYY=:403
Forbidden

at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:230)

at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:151)

at
org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2198)

at
org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2163)

at
org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2102)

at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:752)

at org.apache.hado

Re: Spark Kubernetes 3.0.1 | podcreationTimeout not working

2021-02-10 Thread Jacek Laskowski
Hi Ranju,

Can you show the pods and their state? Does the situation happen at the
very beginning of spark-submit or some time later (once you've got a couple
of executors)? My understanding allows me to think of the driver not
starting up due to lack of resources or executors. In either case they're
not deleted as they simply wait forever. I might be mistaken here though.

What property is this for "this timeout of 60 sec."?

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books 
Follow me on https://twitter.com/jaceklaskowski




On Wed, Feb 10, 2021 at 2:03 PM Ranju Jain 
wrote:

> Hi,
>
>
>
> I submitted the spark job and pods goes in Pending state because of
> insufficient resources.
>
> But they are not getting deleted after this timeout of 60 sec. Please help
> me in understanding.
>
>
>
> Regards
>
> Ranju
>


unsubscribe

2021-02-10 Thread Ricardo Sardenberg



Spark Kubernetes 3.0.1 | podcreationTimeout not working

2021-02-10 Thread Ranju Jain
Hi,

I submitted the spark job and pods goes in Pending state because of 
insufficient resources.
But they are not getting deleted after this timeout of 60 sec. Please help me 
in understanding.

Regards
Ranju


Re: Issue with accessing S3 from EKS spark pod

2021-02-10 Thread Rishabh Jain
Hi,

I tried doing what Vladimir suggested. But no luck there either. My guess
is that it has something to do with securityContext.fsGroup. I am trying to
pass yaml file path along with spark submit command. My yaml file content
is
```

apiVersion: v1

kind: Pod

spec:

  securityContext:

fsGroup: 65534

  serviceAccount: 

  serviceAccountName: 

```


Is there anything wrong with this yaml file?


~
*Thanks,*

Rishabh Jain
Application Developer
Email rishabh.j...@thoughtworks.com
Telephone +91 6264277897 <+91+626+427+7897>
[image: ThoughtWorks]





On Tue, Feb 9, 2021 at 10:44 PM Vladimir Prus 
wrote:

>
>
> On 9 Feb 2021, at 19:46, Rishabh Jain 
> wrote:
>
> Hi,
>
> We are trying to access S3 from spark job running on EKS cluster pod. I
> have a service account that has an IAM role attached with full S3
> permission. We are using DefaultCredentialsProviderChain.  But still we are
> getting 403 Forbidden from S3.
>
>
> It’s hard to say without any information, but some things you might want
> to double-check
>
> - Make sure the Spark job is using sufficiently new AWS SDK, so that IAM
> for service account is supported
> - Modify your job to print the effective role, e.g.
>
> val stsClient =
> AWSSecurityTokenServiceClientBuilder.standard().build();
> val request = new GetCallerIdentityRequest()
> val identity = stsClient.getCallerIdentity(request)
> println(identity.getArn())
>
> - If the above does not print the expected role, verify that the pods
> actually have the right service account, and
> that  AWS_ROLE_ARN/AWS_WEB_IDENTITY_TOKEN_FILE variables are set on the
> pod, and that
>   the assume policy for the role does allow EKS to assume that role.
> - If the above prints the expected role, then 403 error means you did not
> setup IAM policies on your role/bucket.
>
>
> Is there anything wrong with our approach?
>
> Generally speaking, IAM for service accounts in EKS + Spark works, it's
> just there's a lot of things that can go wrong the first time you do it.
>
>
> HTH,
>


Re: Issue with accessing S3 from EKS spark pod

2021-02-10 Thread Vladimir Prus
Hi,

the fsGroup setting should match the id Spark is running at. When building
from source, that id is 185, and you can use "docker inspect "
to double-check.

On Wed, Feb 10, 2021 at 11:43 AM Rishabh Jain 
wrote:

> Hi,
>
> I tried doing what Vladimir suggested. But no luck there either. My guess
> is that it has something to do with securityContext.fsGroup. I am trying to
> pass yaml file path along with spark submit command. My yaml file content
> is
> ```
>
> apiVersion: v1
>
> kind: Pod
>
> spec:
>
>   securityContext:
>
> fsGroup: 65534
>
>   serviceAccount: 
>
>   serviceAccountName: 
>
> ```
>
>
> Is there anything wrong with this yaml file?
>
>
> ~
> *Thanks,*
>
> Rishabh Jain
> Application Developer
> Email rishabh.j...@thoughtworks.com
> Telephone +91 6264277897 <+91+626+427+7897>
> [image: ThoughtWorks]
> 
>
>
>
>
> On Tue, Feb 9, 2021 at 10:44 PM Vladimir Prus 
> wrote:
>
>>
>>
>> On 9 Feb 2021, at 19:46, Rishabh Jain 
>> wrote:
>>
>> Hi,
>>
>> We are trying to access S3 from spark job running on EKS cluster pod. I
>> have a service account that has an IAM role attached with full S3
>> permission. We are using DefaultCredentialsProviderChain.  But still we are
>> getting 403 Forbidden from S3.
>>
>>
>> It’s hard to say without any information, but some things you might want
>> to double-check
>>
>> - Make sure the Spark job is using sufficiently new AWS SDK, so that IAM
>> for service account is supported
>> - Modify your job to print the effective role, e.g.
>>
>> val stsClient =
>> AWSSecurityTokenServiceClientBuilder.standard().build();
>> val request = new GetCallerIdentityRequest()
>> val identity = stsClient.getCallerIdentity(request)
>> println(identity.getArn())
>>
>> - If the above does not print the expected role, verify that the pods
>> actually have the right service account, and
>> that  AWS_ROLE_ARN/AWS_WEB_IDENTITY_TOKEN_FILE variables are set on the
>> pod, and that
>>   the assume policy for the role does allow EKS to assume that role.
>> - If the above prints the expected role, then 403 error means you did not
>> setup IAM policies on your role/bucket.
>>
>>
>> Is there anything wrong with our approach?
>>
>> Generally speaking, IAM for service accounts in EKS + Spark works, it's
>> just there's a lot of things that can go wrong the first time you do it.
>>
>>
>> HTH,
>>
>

-- 
Vladimir Prus
http://vladimirprus.com