UDF issues with spark

2017-12-08 Thread Afshin, Bardia
Using pyspark cli on spark 2.1.1 I’m getting out of memory issues when running 
the udf function on a recordset count of 10 with a mapping of the same value 
(arbirtrary for testing purposes). This is on amazon EMR release label 5.6.0 
with the following hardware specs

m4.4xlarge
32 vCPU, 64 GiB memory, EBS only storage
EBS Storage:100 GiB

Help?

This message is confidential, intended only for the named recipient(s) and may 
contain information that is privileged or exempt from disclosure under 
applicable law. If you are not the intended recipient(s), you are notified that 
the dissemination, distribution, or copying of this message is strictly 
prohibited. If you receive this message in error or are not the named 
recipient(s), please notify the sender by return email and delete this message. 
Thank you.


Best way of shipping self-contained pyspark jobs with 3rd-party dependencies

2017-12-08 Thread Sergey Zhemzhitsky
Hi PySparkers,

What currently is the best way of shipping self-contained pyspark jobs
with 3rd-party dependencies?
There are some open JIRA issues [1], [2] as well as corresponding PRs
[3], [4] and articles [5], [6], [7] regarding setting up the python
environment with conda and virtualenv respectively, and I believe [7]
is misleading article, because of unsupported spark options, like
spark.pyspark.virtualenv.enabled,
spark.pyspark.virtualenv.requirements, etc.

So I'm wondering what the community does in cases, when it's necessary to
- prevent python package/module version conflicts between different jobs
- prevent updating all the nodes of the cluster in case of new job dependencies
- track which dependencies are introduced on the per-job basis


[1] https://issues.apache.org/jira/browse/SPARK-13587
[2] https://issues.apache.org/jira/browse/SPARK-16367
[3] https://github.com/apache/spark/pull/13599
[4] https://github.com/apache/spark/pull/14180
[5] https://www.anaconda.com/blog/developer-blog/conda-spark
[6] http://henning.kropponline.de/2016/09/17/running-pyspark-with-virtualenv
[7] 
https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-08 Thread bsikander
Qiao, Richard wrote
> Comparing #1 and #3, my understanding of “submitted” is “the jar is
> submitted to executors”. With this concept, you may define your own
> status.

In SparkLauncher, SUBMITTED means that the Driver was able to acquire cores
from Spark cluster and Launcher is waiting for Driver to connect back. Once
it connects back, the state of Driver is changed to CONNECTED.
As Marcelo mentioned, Launcher can only tell me about the Driver state and
it is not possible to guess the state of "application (executors)". For the
state of executors we can use SparkListener.

With the combination of both Launcher + Listener, I have a solution. As you
mentioned, that even if 1 executor is allocated to "application", the state
will change to RUNNING. So in my application, I change the status of my job
to RUNNING only if I receive RUNNING from Launcher and onExecuterAdded event
from SparkListener.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-08 Thread bsikander
Qiao, Richard wrote
> For your question of example, the answer is yes.

Perfect. I am assuming that this is true for Spark-standalone/YARN/Mesos.




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org