[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

2014-10-08 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2651#issuecomment-58429854
  
We don't support IPython 1.0 anymore, so it seems reasonable to make Python 
2.7 the default when using IPython (since IPython 2.0 requires at least Python 
2.7).  It seems like there's a growing list of use-cases that we'd like to 
support:

1. The same custom Python version (non IPython) used everywhere (e.g. I 
want to use PyPy).
2. Stock IPython (`ipython`) with a different Python version on the workers.
3. Custom IPython (#2167) with a different Python version on the workers.
4. IPython (custom or stock) everywhere, including the workers.

In Spark 1.1, we support 1 and 2 (via `IPYTHON=1`), but not 3 or 4.  In 
#2167, we tried to add support for 3 (Custom IPython) but ended up actually 
providing 4 (which was broken until #2554 and this PR).  

For 3, we need a way to specify the driver's Python executable 
independently from the worker's executable.  Currently, `PYSPARK_PYTHON` 
affects all Python processes.  What if we added a `PYSPARK_DRIVER_PYTHON` 
option that only affected the driver Python (and which defaults to 
`PYSPARK_PYTHON` if unset)?  I think this would provide enough mechanism to 
support all four use-cases.

As far as defaults are concerned, maybe we could try `which python2.7` to 
check whether Python 2.7 is installed, and fall back to `python` if it's not 
available (this would only be used if PYSPARK_PYTHON) wasn't set.  I've noticed 
that certain programs run _way_ faster under 2.7 (programs that use `json`, for 
example), so we should probably try to make `python2.7` the default if it's 
installed.

Does this sound like a reasonable approach?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

2014-10-08 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2651#issuecomment-58444611
  
Updated based on offline discussion.  The key changes:

- Introduce PYSPARK_DRIVER_PYTHON to allow the driver's Python executable 
to be different than the workers'.
- Attempt to use `python2.7` as the default Python version and fall back to 
`python` if it's not available.
- Refuse to launch IPython without `python2.7`, unless PYSPARK_PYTHON is 
set.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

2014-10-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2651#issuecomment-58445007
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21490/consoleFull)
 for   PR 2651 at commit 
[`7b8eb86`](https://github.com/apache/spark/commit/7b8eb86c303a83d83f99e7cd18dab020a76e001b).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

2014-10-08 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2651#issuecomment-58445558
  
LGTM, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

2014-10-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2651#issuecomment-58449456
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21490/consoleFull)
 for   PR 2651 at commit 
[`7b8eb86`](https://github.com/apache/spark/commit/7b8eb86c303a83d83f99e7cd18dab020a76e001b).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

2014-10-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2651#issuecomment-58449460
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21490/Test 
PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

2014-10-07 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2651#issuecomment-58273093
  
@davies Good points; we should definitely discuss this before 1.2.  I guess 
that the script also loads `spark-env.sh`, but this is less of an issue since 
we moved most of the configuration to SparkConf.

Since we still need to support `spark-submit` and maintain 
backwards-compatibility with existing uses of the `pyspark` script, what do you 
think about this PR's approach?  At a minimum, we need to preserve the old 
behavior of the `IPYTHON=` setting.  Does the `PYSPARK_DRIVER_PYTHON_OPTS` seem 
like a reasonable thing to add to this script?

I'd still like to discuss the rest of your proposal, but I'd like to try to 
get the fixes here merged first because the current master instructions are 
broken and we need to re-introduce backwards-compatibility.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

2014-10-07 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2651#issuecomment-58276170
  
The current patch looks good to me.

One question: should we figure out the version of IPython, then use the 
correct of python in worker? In the most common cases, user's have IPython 
(2.7), but have no Python 2.7 in workers, we need to tell user how to make it 
work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

2014-10-04 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2651#issuecomment-57927210
  
Before 1.2 release, maybe it's time to rethink how to run pyspark shell or 
scripts, using bin/pyspark or spark-submit is not so friendly for user, maybe 
we could simplify it. 

The most things that pyspark does are setup SPARK_HOME and PYTHONPATH, so I 
did these in .bashrc:
```
export SPARK_HOME=
export 
PYTHONPATH=${PYTHONPATH}:${SPARK_HOME}/python:${SPARK_HOME}/python/lib/py4j-0.8.2.1-src.zip
``` 
Then I could run any python script to use pyspark (most of them are 
testing). I can easily choose the version of python to use, such as ipython:
```
ipython python/pyspark/tests.py
```
If the version of python I used for driver is not binary compatible with 
the default one, then I need to use PYSPARK_PYTHON
```
PYSPARK_PYTHON=pypy pypy rdd.py
```
I think we could find the correct version to use for worker, PYSPARK_PYTHON 
is not need for most cases. For example, if PYSPARK_PYTHON is not set, by 
default, we could use the path of python used in driver for it. We could create 
some special cases, such as ipython, we could use python2.7 for ipython which 
uses python2.7.

Also, we could create SPARK_HOME and PYTHONPATH during install spark for 
user.

bin/pyspark could be called sparkshell.py, then user could easily choose 
whatever version of python to use it.

bin/spark-submit is still useful to submit the jobs into cluster or adding 
files. In the same time, may be we could introduce some default arguments for 
general pyspark scripts. such as:
```bash
$ipython
daviesliu@dm:~/work/spark$ ipython wc.py -h
Usage: wc.py [options] [args]

Options:
  -h, --helpshow this help message and exit
  -q, --quiet
  -v, --verbose

  PySpark Options:
-m MASTER, --master=MASTER
-p PARALLEL, --parallel=PARALLEL number of processes
-c CPUS, --cpus=CPUS cpus used per task
-M MEM, --mem=MEM  memory used per task
--conf=CONF path for configuration file
--profile  do profiling
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

2014-10-03 Thread JoshRosen
GitHub user JoshRosen opened a pull request:

https://github.com/apache/spark/pull/2651

[SPARK-3772] Allow `ipython` to be used by Pyspark workers; IPython fixes:

This pull request addresses a few issues related to PySpark's IPython 
support:

- Fix the remaining uses of the '-u' flag, which IPython doesn't support 
(see SPARK-3772).
- Change PYSPARK_PYTHON_OPTS to PYSPARK_DRIVER_PYTHON_OPTS, so that the old 
name is reserved in case we ever want to allow the worker Python options to be 
customized (this variable was introduced in #2554 and hasn't landed in a 
release yet, so this doesn't break any compatibility).
- Retain the old semantics for IPYTHON=1 and IPYTHON_OPTS (to avoid 
breaking existing example programs).

There are more details in a large block comment in `bin/pyspark`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/JoshRosen/spark SPARK-3772

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2651.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2651


commit c4f57780ac03065ee8cecb71c3a950a68277f82f
Author: Josh Rosen joshro...@apache.org
Date:   2014-10-03T22:43:16Z

[SPARK-3772] Allow ipython to be used by Pyspark workers; IPython fixes:

- Fix the remaining uses of the '-u' flag, which IPython doesn't support.
- Change PYSPARK_PYTHON_OPTS to PYSPARK_DRIVER_PYTHON_OPTS, so that the old
  name is reserved in case we ever want to allow the worker Python options
  to be customized.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

2014-10-03 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2651#issuecomment-57891560
  
/cc @davies @cocotomo @robbles for reviews / feedback.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

2014-10-03 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2651#issuecomment-57891571
  
/cc @davies @cocoatomo @robbles for reviews / feedback.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

2014-10-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2651#issuecomment-57891695
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21284/consoleFull)
 for   PR 2651 at commit 
[`c4f5778`](https://github.com/apache/spark/commit/c4f57780ac03065ee8cecb71c3a950a68277f82f).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

2014-10-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2651#issuecomment-57893128
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21284/consoleFull)
 for   PR 2651 at commit 
[`c4f5778`](https://github.com/apache/spark/commit/c4f57780ac03065ee8cecb71c3a950a68277f82f).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

2014-10-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2651#issuecomment-57893129
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21284/Test 
PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org