Spark salesforce connector

2021-11-24 Thread Atlas - Samir Souidi
Dear all,
Do you know if there is any spark connector to SalesForce?
Thanks
Sam

Sent from Mail for Windows



Re: [Spark] Does Spark support backward and forward compatibility?

2021-11-24 Thread Lalwani, Jayesh
One thing to be pointed out is that you never bundle the Spark Client with your 
code. You compile against a Spark version. You bundle your code (without Spark 
jars) in an uber jar and deploy the Uber jar into Spark. Spark is already 
bundled with the jars that are required to send jobs to scheduler. At runtime, 
your code will be using the jars bundled in the instance of Spark that your 
application is running in

Spark is backward compatible; ie; a jar, compiled against 3.1.x , will run in a 
Spark 3.2.0 cluster
Like Sean mentioned, Spark is not guaranteed to be forward compatible; ie; a 
jar, compiled against 3.2.1, may not run in a Spark 2.4.0 cluster. It might 
work if the functions called from your code are available in 2.4.0. But, it 
will fail if you are calling API that was introduced after 2.4.0.

So, the question of “Can I use an older version of the client to submit jobs to 
a newer version of Spark” is moot. You never do that.

From: Amin Borjian 
Date: Wednesday, November 24, 2021 at 2:44 PM
To: Sean Owen 
Cc: "user@spark.apache.org" 
Subject: RE: [EXTERNAL] [Spark] Does Spark support backward and forward 
compatibility?


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.


Thanks again for reply.

Personally, I think the whole cluster should have a single version. What 
mattered most to me was how important the client version that sends the jobs to 
scheduler, that we should hope everything work well in small version changes! 
(In version changes less than major)

From: Sean Owen
Sent: Wednesday, November 24, 2021 10:48 PM
To: Amin Borjian
Cc: user@spark.apache.org
Subject: Re: [Spark] Does Spark support backward and forward compatibility?

I think/hope that it goes without saying you can't mix Spark versions within a 
cluster.
Forwards compatibility is something you don't generally expect as a default 
from any piece of software, so not sure there is something to document 
explicitly.
Backwards compatibility is important, and this is documented extensively where 
it doesn't hold in the Spark docs and release notes.


On Wed, Nov 24, 2021 at 1:16 PM Amin Borjian 
mailto:borjianami...@outlook.com>> wrote:
Thank you very much for the reply you sent. It would be great if these items 
were mentioned in the Spark document (for example, the download page or 
something else)

If I understand correctly, it means that we can compile the client (for example 
Java, etc.) with a newer version (for example 3.2.0) within the range of a 
major version against older server (for example 3.1.x) and do not see any 
problem in most cases. Am I right? (Because the issue of backward-compatibility 
can be expressed from both the server and the client view, I repeated the 
sentence to make sure I got it right.)

But what happened if we update server to 3.2.x and our client was in version 
3.1.x? Does it client can work with newer cluster version because it uses just 
old feature of severs? (Maybe you mean this and in fact my previous sentence 
was wrong and I misunderstood)

From: Sean Owen
Sent: Wednesday, November 24, 2021 5:38 PM
To: Amin Borjian
Cc: user@spark.apache.org
Subject: Re: [Spark] Does Spark support backward and forward compatibility?

Can you mix different Spark versions on driver and executor? no.
Can you compile against a different version of Spark than you run on? That 
typically works within a major release, though forwards compatibility may not 
work (you can't use a feature that doesn't exist in the version on the 
cluster). Compiling vs 3.2.0 and running on 3.1.x for example should work fine 
in 99% of cases.

On Wed, Nov 24, 2021 at 8:04 AM Amin Borjian 
mailto:borjianami...@outlook.com>> wrote:

I have a simple question about using Spark, which although most tools usually 
explain this question explicitly (in important text, such as a specific format 
or a separate page), I did not find it anywhere. Maybe my search was not 
enough, but I thought it was good that I ask this question in the hope that 
maybe the answer will benefit other people as well.

Spark binary is usually downloaded from the following link and installed and 
configured on the cluster: Download Apache 
Spark

If, for example, we use the Java language for programming (although it can be 
other supported languages), we need the following dependencies to communicate 
with Spark:



org.apache.spark

spark-core_2.12

3.2.0





org.apache.spark

spark-sql_2.12

3.2.0



As is clear, both the Spark cluster (binary of Spark) and the dependencies used 
on the application side have a specific version. In my opinion, it is obvious 
that if the version used is the same on bo

RE: [Spark] Does Spark support backward and forward compatibility?

2021-11-24 Thread Amin Borjian
Thanks again for reply.

Personally, I think the whole cluster should have a single version. What 
mattered most to me was how important the client version that sends the jobs to 
scheduler, that we should hope everything work well in small version changes! 
(In version changes less than major)

From: Sean Owen
Sent: Wednesday, November 24, 2021 10:48 PM
To: Amin Borjian
Cc: user@spark.apache.org
Subject: Re: [Spark] Does Spark support backward and forward compatibility?

I think/hope that it goes without saying you can't mix Spark versions within a 
cluster.
Forwards compatibility is something you don't generally expect as a default 
from any piece of software, so not sure there is something to document 
explicitly.
Backwards compatibility is important, and this is documented extensively where 
it doesn't hold in the Spark docs and release notes.


On Wed, Nov 24, 2021 at 1:16 PM Amin Borjian 
mailto:borjianami...@outlook.com>> wrote:
Thank you very much for the reply you sent. It would be great if these items 
were mentioned in the Spark document (for example, the download page or 
something else)

If I understand correctly, it means that we can compile the client (for example 
Java, etc.) with a newer version (for example 3.2.0) within the range of a 
major version against older server (for example 3.1.x) and do not see any 
problem in most cases. Am I right? (Because the issue of backward-compatibility 
can be expressed from both the server and the client view, I repeated the 
sentence to make sure I got it right.)

But what happened if we update server to 3.2.x and our client was in version 
3.1.x? Does it client can work with newer cluster version because it uses just 
old feature of severs? (Maybe you mean this and in fact my previous sentence 
was wrong and I misunderstood)

From: Sean Owen
Sent: Wednesday, November 24, 2021 5:38 PM
To: Amin Borjian
Cc: user@spark.apache.org
Subject: Re: [Spark] Does Spark support backward and forward compatibility?

Can you mix different Spark versions on driver and executor? no.
Can you compile against a different version of Spark than you run on? That 
typically works within a major release, though forwards compatibility may not 
work (you can't use a feature that doesn't exist in the version on the 
cluster). Compiling vs 3.2.0 and running on 3.1.x for example should work fine 
in 99% of cases.

On Wed, Nov 24, 2021 at 8:04 AM Amin Borjian 
mailto:borjianami...@outlook.com>> wrote:

I have a simple question about using Spark, which although most tools usually 
explain this question explicitly (in important text, such as a specific format 
or a separate page), I did not find it anywhere. Maybe my search was not 
enough, but I thought it was good that I ask this question in the hope that 
maybe the answer will benefit other people as well.

Spark binary is usually downloaded from the following link and installed and 
configured on the cluster: Download Apache 
Spark

If, for example, we use the Java language for programming (although it can be 
other supported languages), we need the following dependencies to communicate 
with Spark:



org.apache.spark

spark-core_2.12

3.2.0





org.apache.spark

spark-sql_2.12

3.2.0



As is clear, both the Spark cluster (binary of Spark) and the dependencies used 
on the application side have a specific version. In my opinion, it is obvious 
that if the version used is the same on both the application side and the 
server side, everything will most likely work in its ideal state without any 
problems.

But the question is, what if the two versions are not the same? Is it possible 
to have compatibility between the server and the application in specific number 
of conditions (such as not changing major version)? Or, for example, if the 
client is always ahead, is it not a problem? Or if the server is always ahead, 
is it not a problem?

The argument is that there may be a library that I did not write and it is an 
old version, but I want to update my cluster (server version). Or it may not be 
possible for me to update the server version and all the applications version 
at the same time, so I want to update each one separately. As a result, the 
application-server version differs in a period of time. (maybe short or long 
period) I want to know exactly how Spark works in this situation.




Re: [Spark] Does Spark support backward and forward compatibility?

2021-11-24 Thread Martin Wunderlich

Hi Amin,

This might be only marginally relevant to your question, but in my 
project I also noticed the following: The trained and exported Spark 
models (i.e. pipelines saved to binary files) are also not compatible 
between versions, at least between major versions. I noticed this when 
trying to load a model built with Spark 2.4.4 after updating to 3.2.0. 
This didn't work.


Cheers,

Martin

Am 24.11.21 um 20:18 schrieb Sean Owen:
I think/hope that it goes without saying you can't mix Spark versions 
within a cluster.
Forwards compatibility is something you don't generally expect as a 
default from any piece of software, so not sure there is something to 
document explicitly.
Backwards compatibility is important, and this is documented 
extensively where it doesn't hold in the Spark docs and release notes.



On Wed, Nov 24, 2021 at 1:16 PM Amin Borjian 
 wrote:


Thank you very much for the reply you sent. It would be great if
these items were mentioned in the Spark document (for example, the
download page or something else)

If I understand correctly, it means that we can compile the client
(for example Java, etc.) with a newer version (for example 3.2.0)
within the range of a major version against older server (for
example 3.1.x) and do not see any problem in most cases. Am I
right?(Because the issue of backward-compatibility can be
expressed from both the server and the client view, I repeated the
sentence to make sure I got it right.)

But what happened if we update server to 3.2.x and our client was
in version 3.1.x? Does it client can work with newer cluster
version because it uses just old feature of severs? (Maybe you
mean this and in fact my previous sentence was wrong and I
misunderstood)

*From: *Sean Owen 
*Sent: *Wednesday, November 24, 2021 5:38 PM
*To: *Amin Borjian 
*Cc: *user@spark.apache.org
*Subject: *Re: [Spark] Does Spark support backward and forward
compatibility?

Can you mix different Spark versions on driver and executor? no.

Can you compile against a different version of Spark than you run
on? That typically works within a major release, though forwards
compatibility may not work (you can't use a feature that doesn't
exist in the version on the cluster). Compiling vs 3.2.0 and
running on 3.1.x for example should work fine in 99% of cases.

On Wed, Nov 24, 2021 at 8:04 AM Amin Borjian
 wrote:

I have a simple question about using Spark, which although
most tools usually explain this question explicitly (in
important text, such as a specific format or a separate page),
I did not find it anywhere. Maybe my search was not enough,
but I thought it was good that I ask this question in the hope
that maybe the answer will benefit other people as well.

Spark binary is usually downloaded from the following link and
installed and configured on the cluster: Download Apache Spark


If, for example, we use the Java language for programming
(although it can be other supported languages), we need the
following dependencies to communicate with Spark:

||

|    org.apache.spark|

|    spark-core_2|.12||

|    |3.2.0||

||

||

|    org.apache.spark|

|    spark-sql_2|.12||

|    |3.2.0||

||

As is clear, both the Spark cluster (binary of Spark) and the
dependencies used on the application side have a specific
version. In my opinion, it is obvious that if the version used
is the same on both the application side and the server side,
everything will most likely work in its ideal state without
any problems.

But the question is, what if the two versions are not the
same? Is it possible to have compatibility between the server
and the application in specific number of conditions (such as
not changing major version)? Or, for example, if the client is
always ahead, is it not a problem? Or if the server is always
ahead, is it not a problem?

The argument is that there may be a library that I did not write
and it is an old version, but I want to update my cluster (server
version). Or it may not be possible for me to update the server
version and all the applications version at the same time, so I
want to update each one separately. As a result, the
application-server version differs in a period of time. (maybe
short or long period) I want to know exactly how Spark works in
this situation.


Re: [Spark] Does Spark support backward and forward compatibility?

2021-11-24 Thread Sean Owen
I think/hope that it goes without saying you can't mix Spark versions
within a cluster.
Forwards compatibility is something you don't generally expect as a default
from any piece of software, so not sure there is something to document
explicitly.
Backwards compatibility is important, and this is documented extensively
where it doesn't hold in the Spark docs and release notes.


On Wed, Nov 24, 2021 at 1:16 PM Amin Borjian 
wrote:

> Thank you very much for the reply you sent. It would be great if these
> items were mentioned in the Spark document (for example, the download page
> or something else)
>
>
>
> If I understand correctly, it means that we can compile the client (for
> example Java, etc.) with a newer version (for example 3.2.0) within the
> range of a major version against older server (for example 3.1.x) and do
> not see any problem in most cases. Am I right? (Because the issue of
> backward-compatibility can be expressed from both the server and the client
> view, I repeated the sentence to make sure I got it right.)
>
>
>
> But what happened if we update server to 3.2.x and our client was in
> version 3.1.x? Does it client can work with newer cluster version because
> it uses just old feature of severs? (Maybe you mean this and in fact my
> previous sentence was wrong and I misunderstood)
>
>
>
> *From: *Sean Owen 
> *Sent: *Wednesday, November 24, 2021 5:38 PM
> *To: *Amin Borjian 
> *Cc: *user@spark.apache.org
> *Subject: *Re: [Spark] Does Spark support backward and forward
> compatibility?
>
>
>
> Can you mix different Spark versions on driver and executor? no.
>
> Can you compile against a different version of Spark than you run on? That
> typically works within a major release, though forwards compatibility may
> not work (you can't use a feature that doesn't exist in the version on the
> cluster). Compiling vs 3.2.0 and running on 3.1.x for example should work
> fine in 99% of cases.
>
>
>
> On Wed, Nov 24, 2021 at 8:04 AM Amin Borjian 
> wrote:
>
> I have a simple question about using Spark, which although most tools
> usually explain this question explicitly (in important text, such as a
> specific format or a separate page), I did not find it anywhere. Maybe my
> search was not enough, but I thought it was good that I ask this question
> in the hope that maybe the answer will benefit other people as well.
>
> Spark binary is usually downloaded from the following link and installed
> and configured on the cluster: Download Apache Spark
> 
>
> If, for example, we use the Java language for programming (although it can
> be other supported languages), we need the following dependencies to
> communicate with Spark:
>
> 
>
> org.apache.spark
>
> spark-core_2.12
>
> 3.2.0
>
> 
>
> 
>
> org.apache.spark
>
> spark-sql_2.12
>
> 3.2.0
>
> 
>
> As is clear, both the Spark cluster (binary of Spark) and the dependencies
> used on the application side have a specific version. In my opinion, it is
> obvious that if the version used is the same on both the application side
> and the server side, everything will most likely work in its ideal state
> without any problems.
>
> But the question is, what if the two versions are not the same? Is it
> possible to have compatibility between the server and the application in
> specific number of conditions (such as not changing major version)? Or, for
> example, if the client is always ahead, is it not a problem? Or if the
> server is always ahead, is it not a problem?
>
> The argument is that there may be a library that I did not write and it is
> an old version, but I want to update my cluster (server version). Or it may
> not be possible for me to update the server version and all the
> applications version at the same time, so I want to update each one
> separately. As a result, the application-server version differs in a period
> of time. (maybe short or long period) I want to know exactly how Spark
> works in this situation.
>
>
>


RE: [Spark] Does Spark support backward and forward compatibility?

2021-11-24 Thread Amin Borjian
Thank you very much for the reply you sent. It would be great if these items 
were mentioned in the Spark document (for example, the download page or 
something else)

If I understand correctly, it means that we can compile the client (for example 
Java, etc.) with a newer version (for example 3.2.0) within the range of a 
major version against older server (for example 3.1.x) and do not see any 
problem in most cases. Am I right? (Because the issue of backward-compatibility 
can be expressed from both the server and the client view, I repeated the 
sentence to make sure I got it right.)

But what happened if we update server to 3.2.x and our client was in version 
3.1.x? Does it client can work with newer cluster version because it uses just 
old feature of severs? (Maybe you mean this and in fact my previous sentence 
was wrong and I misunderstood)

From: Sean Owen
Sent: Wednesday, November 24, 2021 5:38 PM
To: Amin Borjian
Cc: user@spark.apache.org
Subject: Re: [Spark] Does Spark support backward and forward compatibility?

Can you mix different Spark versions on driver and executor? no.
Can you compile against a different version of Spark than you run on? That 
typically works within a major release, though forwards compatibility may not 
work (you can't use a feature that doesn't exist in the version on the 
cluster). Compiling vs 3.2.0 and running on 3.1.x for example should work fine 
in 99% of cases.

On Wed, Nov 24, 2021 at 8:04 AM Amin Borjian 
mailto:borjianami...@outlook.com>> wrote:

I have a simple question about using Spark, which although most tools usually 
explain this question explicitly (in important text, such as a specific format 
or a separate page), I did not find it anywhere. Maybe my search was not 
enough, but I thought it was good that I ask this question in the hope that 
maybe the answer will benefit other people as well.

Spark binary is usually downloaded from the following link and installed and 
configured on the cluster: Download Apache 
Spark

If, for example, we use the Java language for programming (although it can be 
other supported languages), we need the following dependencies to communicate 
with Spark:



org.apache.spark

spark-core_2.12

3.2.0





org.apache.spark

spark-sql_2.12

3.2.0



As is clear, both the Spark cluster (binary of Spark) and the dependencies used 
on the application side have a specific version. In my opinion, it is obvious 
that if the version used is the same on both the application side and the 
server side, everything will most likely work in its ideal state without any 
problems.

But the question is, what if the two versions are not the same? Is it possible 
to have compatibility between the server and the application in specific number 
of conditions (such as not changing major version)? Or, for example, if the 
client is always ahead, is it not a problem? Or if the server is always ahead, 
is it not a problem?

The argument is that there may be a library that I did not write and it is an 
old version, but I want to update my cluster (server version). Or it may not be 
possible for me to update the server version and all the applications version 
at the same time, so I want to update each one separately. As a result, the 
application-server version differs in a period of time. (maybe short or long 
period) I want to know exactly how Spark works in this situation.



Listening to ExternalCatalogEvent in Spark 3

2021-11-24 Thread Khai Tran
Hello community,
Previously in Spark 2.4, we listen and capture ExternalCatalogEvent on
"onOtherEvent()" method of SparkListener, but with Spark 3, we no longer
see those events.

Just wonder if there is any behavior change for emitting
ExternalCatalogEvent in Spark 3, and if yes, where should I get those
events on Spark 3 now?

Thank you,
Khai


Re: [issue] not able to add external libs to pyspark job while using spark-submit

2021-11-24 Thread Mich Talebzadeh
I am not sure about that. However, with Kubernetes and docker image for
PySpark, I build the packages into the image itself as below in the
dockerfile

RUN pip install pyyaml numpy cx_Oracle

and that will add those packages that you can reference in your py script

import yaml
import cx_Oracle

HTH







   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 24 Nov 2021 at 17:44, Bode, Meikel, NMA-CFD <
meikel.b...@bertelsmann.de> wrote:

> Can we add Python dependencies as we can do for mvn coordinates? So that
> we run sth like pip install  or download from pypi index?
>
>
>
> *From:* Mich Talebzadeh 
> *Sent:* Mittwoch, 24. November 2021 18:28
> *Cc:* user@spark.apache.org
> *Subject:* Re: [issue] not able to add external libs to pyspark job while
> using spark-submit
>
>
>
> The easiest way to set this up is to create dependencies.zip file.
>
>
>
> Assuming that you have a virtual environment already set-up, where there
> is directory called site-packages, go to that directory and just create a
> minimal a shell script  say package_and_zip_dependencies.sh to do it for
> you
>
>
>
> Example:
>
>
>
> cat package_and_zip_dependencies.sh
>
>
>
> #!/bin/bash
>
> # https://blog.danielcorin.com/posts/2015-11-09-pyspark/
> 
>
> zip -r ../dependencies.zip .
>
> ls -l ../dependencies.zip
>
> exit 0
>
>
>
> One created, create an environment variable called DEPENDENCIES
>
>
>
> export DEPENDENCIES="export
> DEPENDENCIES="/usr/src/Python-3.7.3/airflow_virtualenv/lib/python3.7/dependencies.zip"
>
>
>
> Then in spark-submit you can do this
>
>
>
> spark-submit --master yarn --deploy-mode client --driver-memory xG
> --executor-memory yG --num-executors m --executor-cores n --py-files
> $DEPENDENCIES --jars $HOME/jars/spark-sql-kafka-0-10_2.12-3.1.0.jar
>
>
>
> Also check this link as well
> https://blog.danielcorin.com/posts/2015-11-09-pyspark/
> 
>
>
>
> HTH
>
>
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Wed, 24 Nov 2021 at 14:03, Atheer Alabdullatif 
> wrote:
>
> Dear Spark team,
>
> hope my email finds you well
>
>
>
>
>
> I am using pyspark 3.0 and facing an issue with adding external library
> [configparser] while running the job using [spark-submit] & [yarn]
>
> issue:
>
>
>
> import configparser
>
> ImportError: No module named configparser
>
> 21/11/24 08:54:38 INFO util.ShutdownHookManager: Shutdown hook called
>
> solutions I tried:
>
> 1- installing library src files and adding it to the session using
> [addPyFile]:
>
>- files structure:
>
> -- main dir
>
>-- subdir
>
>   -- libs
>
>  -- configparser-5.1.0
>
> -- src
>
>-- configparser.py
>
>  -- configparser.zip
>
>   -- sparkjob.py
>
> 1.a zip file:
>
> spark = SparkSession.builder.appName(jobname + '_' + table).config(
>
> "spark.mongodb.input.uri", uri +
>
> "." +
>
> table +
>
> "").config(
>
> "spark.mongodb.input.sampleSize",
>
> 990).getOrCreate()
>
>
>
> spark.sparkContext.

RE: [issue] not able to add external libs to pyspark job while using spark-submit

2021-11-24 Thread Bode, Meikel, NMA-CFD
Can we add Python dependencies as we can do for mvn coordinates? So that we run 
sth like pip install  or download from pypi index?

From: Mich Talebzadeh 
Sent: Mittwoch, 24. November 2021 18:28
Cc: user@spark.apache.org
Subject: Re: [issue] not able to add external libs to pyspark job while using 
spark-submit

The easiest way to set this up is to create dependencies.zip file.

Assuming that you have a virtual environment already set-up, where there is 
directory called site-packages, go to that directory and just create a minimal 
a shell script  say package_and_zip_dependencies.sh to do it for you

Example:

cat package_and_zip_dependencies.sh

#!/bin/bash
# 
https://blog.danielcorin.com/posts/2015-11-09-pyspark/
zip -r ../dependencies.zip .
ls -l ../dependencies.zip
exit 0

One created, create an environment variable called DEPENDENCIES

export DEPENDENCIES="export 
DEPENDENCIES="/usr/src/Python-3.7.3/airflow_virtualenv/lib/python3.7/dependencies.zip"

Then in spark-submit you can do this

spark-submit --master yarn --deploy-mode client --driver-memory xG 
--executor-memory yG --num-executors m --executor-cores n --py-files 
$DEPENDENCIES --jars $HOME/jars/spark-sql-kafka-0-10_2.12-3.1.0.jar

Also check this link as well  
https://blog.danielcorin.com/posts/2015-11-09-pyspark/

HTH



 
[https://docs.google.com/uc?export=download&id=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ&revid=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Wed, 24 Nov 2021 at 14:03, Atheer Alabdullatif 
mailto:a.alabdulla...@lean.sa>> wrote:
Dear Spark team,
hope my email finds you well



I am using pyspark 3.0 and facing an issue with adding external library 
[configparser] while running the job using [spark-submit] & [yarn]

issue:



import configparser

ImportError: No module named configparser

21/11/24 08:54:38 INFO util.ShutdownHookManager: Shutdown hook called

solutions I tried:

1- installing library src files and adding it to the session using [addPyFile]:

  *   files structure:

-- main dir

   -- subdir

  -- libs

 -- configparser-5.1.0

-- src

   -- configparser.py

 -- configparser.zip

  -- sparkjob.py

1.a zip file:

spark = SparkSession.builder.appName(jobname + '_' + table).config(

"spark.mongodb.input.uri", uri +

"." +

table +

"").config(

"spark.mongodb.input.sampleSize",

990).getOrCreate()



spark.sparkContext.addPyFile('/maindir/subdir/libs/configparser.zip')

df = spark.read.format("mongo").load()

1.b python file

spark = SparkSession.builder.appName(jobname + '_' + table).config(

"spark.mongodb.input.uri", uri +

"." +

table +

"").config(

"spark.mongodb.input.sampleSize",

990).getOrCreate()



spark.sparkContext.addPyFile('maindir/subdir/libs/configparser-5.1.0/src/configparser.py')

df = spark.read.format("mongo").load()



2- using os library

def install_libs():

'''

this function used to install external python libs in yarn

'''

os.system("pip3 install configparser")



if __name__ == "__main__":



# install libs

install_libs()



we value your support

best,

Atheer Alabdullatif



*إشعار السرية وإخلاء المسؤولية*
هذه الرسالة ومرفقاتها معدة لاستخدام المُرسل إليه المقصود بالرسالة فقط وقد تحتوي 
على معلومات سرية أو محمية قانونياً، إ

Re: [issue] not able to add external libs to pyspark job while using spark-submit

2021-11-24 Thread Mich Talebzadeh
The easiest way to set this up is to create dependencies.zip file.

Assuming that you have a virtual environment already set-up, where there is
directory called site-packages, go to that directory and just create a
minimal a shell script  say package_and_zip_dependencies.sh to do it for you

Example:

cat package_and_zip_dependencies.sh

#!/bin/bash
# https://blog.danielcorin.com/posts/2015-11-09-pyspark/
zip -r ../dependencies.zip .
ls -l ../dependencies.zip
exit 0

One created, create an environment variable called DEPENDENCIES

export DEPENDENCIES="export
DEPENDENCIES="/usr/src/Python-3.7.3/airflow_virtualenv/lib/python3.7/dependencies.zip"

Then in spark-submit you can do this

spark-submit --master yarn --deploy-mode client --driver-memory xG
--executor-memory yG --num-executors m --executor-cores n --py-files
$DEPENDENCIES --jars $HOME/jars/spark-sql-kafka-0-10_2.12-3.1.0.jar

Also check this link as well
https://blog.danielcorin.com/posts/2015-11-09-pyspark/

HTH



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 24 Nov 2021 at 14:03, Atheer Alabdullatif 
wrote:

> Dear Spark team,
> hope my email finds you well
>
>
> I am using pyspark 3.0 and facing an issue with adding external library
> [configparser] while running the job using [spark-submit] & [yarn]
>
> issue:
>
>
> import configparser
> ImportError: No module named configparser21/11/24 08:54:38 INFO 
> util.ShutdownHookManager: Shutdown hook called
>
> solutions I tried:
>
> 1- installing library src files and adding it to the session using
> [addPyFile]:
>
>
>- files structure:
>
> -- main dir
>-- subdir
>   -- libs
>  -- configparser-5.1.0
> -- src
>-- configparser.py
>  -- configparser.zip
>   -- sparkjob.py
>
> 1.a zip file:
>
> spark = SparkSession.builder.appName(jobname + '_' + table).config(
> "spark.mongodb.input.uri", uri +
> "." +
> table +
> "").config(
> "spark.mongodb.input.sampleSize",
> 990).getOrCreate()
>
> spark.sparkContext.addPyFile('/maindir/subdir/libs/configparser.zip')
> df = spark.read.format("mongo").load()
>
> 1.b python file
>
> spark = SparkSession.builder.appName(jobname + '_' + table).config(
> "spark.mongodb.input.uri", uri +
> "." +
> table +
> "").config(
> "spark.mongodb.input.sampleSize",
> 990).getOrCreate()
>
> spark.sparkContext.addPyFile('maindir/subdir/libs/configparser-5.1.0/src/configparser.py')
> df = spark.read.format("mongo").load()
>
>
> 2- using os library
>
> def install_libs():
> '''
> this function used to install external python libs in yarn
> '''
> os.system("pip3 install configparser")
> if __name__ == "__main__":
>
> # install libs
> install_libs()
>
>
> we value your support
>
> best,
>
> Atheer Alabdullatif
>
>
>
>
>
>
> إشعار السرية وإخلاء المسؤولية
> هذه الرسالة ومرفقاتها معدة لاستخدام المُرسل إليه المقصود بالرسالة فقط وقد
> تحتوي على معلومات سرية أو محمية قانونياً، إن لم تكن الشخص المقصود فنرجو
> إخطار المُرسل فوراً عن طريق الرد على هذا البريد الإلكتروني وحذف الرسالة من
> البريد الإلكتروني، وعدم إبقاء نسخ منه،  لا يجوز استخدام أو عرض أو نشر
> المحتوى سواء بشكل مباشر أو غير مباشر دون موافقة خطية مسبقة، لا تتحمل شركة
> لين مسؤولية الأضرار الناتجة عن أي فيروسات قد تحملها هذه الرسالة.
>
>
>
> **Confidentiality & Disclaimer Notice**
> This e-mail message, including any attachments, is for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information or otherwise protected by law. If you are not the intended
> recipient, please immediately notify the sender, delete the e-mail, and do
> not retain any copies of it. It is prohibited to use, disseminate or
> distribute the content of this e-mail, directly or indirectly, without
> prior written consent. Lean accepts no liability for damage caused by any
> virus that may be transmitted by this Email.
>
>
>
>
>


Re: [issue] not able to add external libs to pyspark job while using spark-submit

2021-11-24 Thread Atheer Alabdullatif
Hello Owen,
Thank you for your prompt reply!
We will check it out.

best,
Atheer Alabdullatif

From: Sean Owen 
Sent: Wednesday, November 24, 2021 5:06 PM
To: Atheer Alabdullatif 
Cc: user@spark.apache.org ; Data Engineering 

Subject: Re: [issue] not able to add external libs to pyspark job while using 
spark-submit

You don't often get email from sro...@gmail.com. Learn why this is 
important
External Sender: be CAUTION , Particularly with links and attachments.
That's not how you add a library. From the docs: 
https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html

On Wed, Nov 24, 2021 at 8:02 AM Atheer Alabdullatif 
mailto:a.alabdulla...@lean.sa>> wrote:
Dear Spark team,
hope my email finds you well



I am using pyspark 3.0 and facing an issue with adding external library 
[configparser] while running the job using [spark-submit] & [yarn]

issue:


import configparser
ImportError: No module named configparser
21/11/24 08:54:38 INFO util.ShutdownHookManager: Shutdown hook called

solutions I tried:

1- installing library src files and adding it to the session using [addPyFile]:

  *   files structure:

-- main dir
   -- subdir
  -- libs
 -- configparser-5.1.0
-- src
   -- configparser.py
 -- configparser.zip
  -- sparkjob.py

1.a zip file:

spark = SparkSession.builder.appName(jobname + '_' + table).config(
"spark.mongodb.input.uri", uri +
"." +
table +
"").config(
"spark.mongodb.input.sampleSize",
990).getOrCreate()

spark.sparkContext.addPyFile('/maindir/subdir/libs/configparser.zip')
df = spark.read.format("mongo").load()

1.b python file

spark = SparkSession.builder.appName(jobname + '_' + table).config(
"spark.mongodb.input.uri", uri +
"." +
table +
"").config(
"spark.mongodb.input.sampleSize",
990).getOrCreate()

spark.sparkContext.addPyFile('maindir/subdir/libs/configparser-5.1.0/src/configparser.py')
df = spark.read.format("mongo").load()


2- using os library

def install_libs():
'''
this function used to install external python libs in yarn
'''
os.system("pip3 install configparser")

if __name__ == "__main__":

# install libs
install_libs()


we value your support

best,

Atheer Alabdullatif






*إشعار السرية وإخلاء المسؤولية*
هذه الرسالة ومرفقاتها معدة لاستخدام المُرسل إليه المقصود بالرسالة فقط وقد تحتوي 
على معلومات سرية أو محمية قانونياً، إن لم تكن الشخص المقصود فنرجو إخطار المُرسل 
فوراً عن طريق الرد على هذا البريد الإلكتروني وحذف الرسالة من  البريد 
الإلكتروني، وعدم إبقاء نسخ منه،  لا يجوز استخدام أو عرض أو نشر المحتوى سواء 
بشكل مباشر أو غير مباشر دون موافقة خطية مسبقة، لا تتحمل شركة لين مسؤولية 
الأضرار الناتجة عن أي فيروسات قد تحملها هذه الرسالة.



*Confidentiality & Disclaimer Notice*
This e-mail message, including any attachments, is for the sole use of the 
intended recipient(s) and may contain confidential and privileged information 
or otherwise protected by law. If you are not the intended recipient, please 
immediately notify the sender, delete the e-mail, and do not retain any copies 
of it. It is prohibited to use, disseminate or distribute the content of this 
e-mail, directly or indirectly, without prior written consent. Lean accepts no 
liability for damage caused by any virus that may be transmitted by this Email.






Re: [Spark] Does Spark support backward and forward compatibility?

2021-11-24 Thread Sean Owen
Can you mix different Spark versions on driver and executor? no.
Can you compile against a different version of Spark than you run on? That
typically works within a major release, though forwards compatibility may
not work (you can't use a feature that doesn't exist in the version on the
cluster). Compiling vs 3.2.0 and running on 3.1.x for example should work
fine in 99% of cases.

On Wed, Nov 24, 2021 at 8:04 AM Amin Borjian 
wrote:

> I have a simple question about using Spark, which although most tools
> usually explain this question explicitly (in important text, such as a
> specific format or a separate page), I did not find it anywhere. Maybe my
> search was not enough, but I thought it was good that I ask this question
> in the hope that maybe the answer will benefit other people as well.
>
> Spark binary is usually downloaded from the following link and installed
> and configured on the cluster: Download Apache Spark
> 
>
> If, for example, we use the Java language for programming (although it can
> be other supported languages), we need the following dependencies to
> communicate with Spark:
>
> 
>
> org.apache.spark
>
> spark-core_2.12
>
> 3.2.0
>
> 
>
> 
>
> org.apache.spark
>
> spark-sql_2.12
>
> 3.2.0
>
> 
>
> As is clear, both the Spark cluster (binary of Spark) and the dependencies
> used on the application side have a specific version. In my opinion, it is
> obvious that if the version used is the same on both the application side
> and the server side, everything will most likely work in its ideal state
> without any problems.
>
> But the question is, what if the two versions are not the same? Is it
> possible to have compatibility between the server and the application in
> specific number of conditions (such as not changing major version)? Or, for
> example, if the client is always ahead, is it not a problem? Or if the
> server is always ahead, is it not a problem?
>
> The argument is that there may be a library that I did not write and it is
> an old version, but I want to update my cluster (server version). Or it may
> not be possible for me to update the server version and all the
> applications version at the same time, so I want to update each one
> separately. As a result, the application-server version differs in a period
> of time. (maybe short or long period) I want to know exactly how Spark
> works in this situation.
>


Re: [issue] not able to add external libs to pyspark job while using spark-submit

2021-11-24 Thread Sean Owen
That's not how you add a library. From the docs:
https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html

On Wed, Nov 24, 2021 at 8:02 AM Atheer Alabdullatif 
wrote:

> Dear Spark team,
> hope my email finds you well
>
>
> I am using pyspark 3.0 and facing an issue with adding external library
> [configparser] while running the job using [spark-submit] & [yarn]
>
> issue:
>
>
> import configparser
> ImportError: No module named configparser21/11/24 08:54:38 INFO 
> util.ShutdownHookManager: Shutdown hook called
>
> solutions I tried:
>
> 1- installing library src files and adding it to the session using
> [addPyFile]:
>
>
>- files structure:
>
> -- main dir
>-- subdir
>   -- libs
>  -- configparser-5.1.0
> -- src
>-- configparser.py
>  -- configparser.zip
>   -- sparkjob.py
>
> 1.a zip file:
>
> spark = SparkSession.builder.appName(jobname + '_' + table).config(
> "spark.mongodb.input.uri", uri +
> "." +
> table +
> "").config(
> "spark.mongodb.input.sampleSize",
> 990).getOrCreate()
>
> spark.sparkContext.addPyFile('/maindir/subdir/libs/configparser.zip')
> df = spark.read.format("mongo").load()
>
> 1.b python file
>
> spark = SparkSession.builder.appName(jobname + '_' + table).config(
> "spark.mongodb.input.uri", uri +
> "." +
> table +
> "").config(
> "spark.mongodb.input.sampleSize",
> 990).getOrCreate()
>
> spark.sparkContext.addPyFile('maindir/subdir/libs/configparser-5.1.0/src/configparser.py')
> df = spark.read.format("mongo").load()
>
>
> 2- using os library
>
> def install_libs():
> '''
> this function used to install external python libs in yarn
> '''
> os.system("pip3 install configparser")
> if __name__ == "__main__":
>
> # install libs
> install_libs()
>
>
> we value your support
>
> best,
>
> Atheer Alabdullatif
>
>
>
>
>
>
> إشعار السرية وإخلاء المسؤولية
> هذه الرسالة ومرفقاتها معدة لاستخدام المُرسل إليه المقصود بالرسالة فقط وقد
> تحتوي على معلومات سرية أو محمية قانونياً، إن لم تكن الشخص المقصود فنرجو
> إخطار المُرسل فوراً عن طريق الرد على هذا البريد الإلكتروني وحذف الرسالة من
> البريد الإلكتروني، وعدم إبقاء نسخ منه،  لا يجوز استخدام أو عرض أو نشر
> المحتوى سواء بشكل مباشر أو غير مباشر دون موافقة خطية مسبقة، لا تتحمل شركة
> لين مسؤولية الأضرار الناتجة عن أي فيروسات قد تحملها هذه الرسالة.
>
>
>
> **Confidentiality & Disclaimer Notice**
> This e-mail message, including any attachments, is for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information or otherwise protected by law. If you are not the intended
> recipient, please immediately notify the sender, delete the e-mail, and do
> not retain any copies of it. It is prohibited to use, disseminate or
> distribute the content of this e-mail, directly or indirectly, without
> prior written consent. Lean accepts no liability for damage caused by any
> virus that may be transmitted by this Email.
>
>
>
>
>


[Spark] Does Spark support backward and forward compatibility?

2021-11-24 Thread Amin Borjian
I have a simple question about using Spark, which although most tools usually 
explain this question explicitly (in important text, such as a specific format 
or a separate page), I did not find it anywhere. Maybe my search was not 
enough, but I thought it was good that I ask this question in the hope that 
maybe the answer will benefit other people as well.

Spark binary is usually downloaded from the following link and installed and 
configured on the cluster: Download Apache 
Spark

If, for example, we use the Java language for programming (although it can be 
other supported languages), we need the following dependencies to communicate 
with Spark:



org.apache.spark

spark-core_2.12

3.2.0





org.apache.spark

spark-sql_2.12

3.2.0



As is clear, both the Spark cluster (binary of Spark) and the dependencies used 
on the application side have a specific version. In my opinion, it is obvious 
that if the version used is the same on both the application side and the 
server side, everything will most likely work in its ideal state without any 
problems.

But the question is, what if the two versions are not the same? Is it possible 
to have compatibility between the server and the application in specific number 
of conditions (such as not changing major version)? Or, for example, if the 
client is always ahead, is it not a problem? Or if the server is always ahead, 
is it not a problem?

The argument is that there may be a library that I did not write and it is an 
old version, but I want to update my cluster (server version). Or it may not be 
possible for me to update the server version and all the applications version 
at the same time, so I want to update each one separately. As a result, the 
application-server version differs in a period of time. (maybe short or long 
period) I want to know exactly how Spark works in this situation.


[issue] not able to add external libs to pyspark job while using spark-submit

2021-11-24 Thread Atheer Alabdullatif
Dear Spark team,
hope my email finds you well



I am using pyspark 3.0 and facing an issue with adding external library 
[configparser] while running the job using [spark-submit] & [yarn]

issue:


import configparser
ImportError: No module named configparser
21/11/24 08:54:38 INFO util.ShutdownHookManager: Shutdown hook called

solutions I tried:

1- installing library src files and adding it to the session using [addPyFile]:

  *   files structure:

-- main dir
   -- subdir
  -- libs
 -- configparser-5.1.0
-- src
   -- configparser.py
 -- configparser.zip
  -- sparkjob.py

1.a zip file:

spark = SparkSession.builder.appName(jobname + '_' + table).config(
"spark.mongodb.input.uri", uri +
"." +
table +
"").config(
"spark.mongodb.input.sampleSize",
990).getOrCreate()

spark.sparkContext.addPyFile('/maindir/subdir/libs/configparser.zip')
df = spark.read.format("mongo").load()

1.b python file

spark = SparkSession.builder.appName(jobname + '_' + table).config(
"spark.mongodb.input.uri", uri +
"." +
table +
"").config(
"spark.mongodb.input.sampleSize",
990).getOrCreate()

spark.sparkContext.addPyFile('maindir/subdir/libs/configparser-5.1.0/src/configparser.py')
df = spark.read.format("mongo").load()


2- using os library

def install_libs():
'''
this function used to install external python libs in yarn
'''
os.system("pip3 install configparser")

if __name__ == "__main__":

# install libs
install_libs()


we value your support

best,

Atheer Alabdullatif



*? ?? ?? ?*
??? ??? ?   ???  ???  ??? ??? ? 
??? ???  ?? ? ? ?? ?? ??? ? ??? ? ? ??? 
? ??   ??? ??? ?? ??  ??? ??  ?? 
???  ? ???   ??  ??? ?? ??? ?? ??? ???  
 ? ?? ??? ? ??? ??  ?? ?? ?  ??? ??? 
??? ??? ?? ?? ??? ?? ?? ??? ???.

*Confidentiality & Disclaimer Notice*
This e-mail message, including any attachments, is for the sole use of the 
intended recipient(s) and may contain confidential and privileged information 
or otherwise protected by law. If you are not the intended recipient, please 
immediately notify the sender, delete the e-mail, and do not retain any copies 
of it. It is prohibited to use, disseminate or distribute the content of this 
e-mail, directly or indirectly, without prior written consent. Lean accepts no 
liability for damage caused by any virus that may be transmitted by this Email.




Re: Choosing architecture for on-premise Spark & HDFS on Kubernetes cluster

2021-11-24 Thread Mich Talebzadeh
Just to clarify it should say  The current Spark Kubernetes model ...


You will also need to build or get the Spark docker image that you are
going to use in k8s clusters based on spark version, java version, scala
version, OS and so forth. Are you going to use Hive as your main storage?


HTH


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 23 Nov 2021 at 19:39, Mich Talebzadeh 
wrote:

> OK  to your point below
>
> "... We are going to deploy 20 physical Linux servers for use as an
> on-premise Spark & HDFS on Kubernetes cluster..
>
>  Kubernetes is really a cloud-native technology. However, the
> cloud-native concept does not exclude the use of on-premises infrastructure
> in cases where it makes sense. So the question is are you going to use a
> mesh structure to integrate these microservices together, including
> on-premise and in cloud?
> Now you have 20 tin boxes on-prem that you want to deploy for
> building your Spark & HDFS stack on top of them. You will gain benefit from
> Kubernetes and your microservices by simplifying the deployment by
> decoupling the dependencies and abstracting your infra-structure away with
> the ability to port these infrastructures. As you have your hardware
> (your Linux servers),running k8s on bare metal will give you native
> hardware performance. However, with 20 linux servers, you may limit your
> scalability (your number of k8s nodes). If you go this way, you will need
> to invest in a bare metal automation platform such as platform9
>  . The likelihood is that  you may
> decide to move to the public cloud at some point or integrate with the
> public cloud. My advice would be to look at something like GKE on-prem
> 
>
>
> Back to Spark, The current Kubernetes model works on the basis of the 
> "one-container-per-Pod"
> model   meaning that
> for each node of the cluster you will have one node running the driver and
> each remaining node running one executor each. My question would be will
> you be integrating with public cloud (AWS, GCP etc) at some point? In that
> case you should look at mesh technologies like Istio
> 
>
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 23 Nov 2021 at 14:09, JHI Star  wrote:
>
>> We are going to deploy 20 physical Linux servers for use as an on-premise
>> Spark & HDFS on Kubernetes cluster. My question is: within this
>> architecture, is it best to have the pods run directly on bare metal or
>> under VMs or system containers like LXC and/or under an on-premise instance
>> of something like OpenStack - or something else altogether ?
>>
>> I am looking to garner any experience around this question relating
>> directly to the specific use case of Spark & HDFS on Kuberenetes - I know
>> there are also general points to consider regardless of the use case.
>>
>