[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-16 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15659


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-16 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r88254128
  
--- Diff: python/setup.py ---
@@ -0,0 +1,209 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import glob
+import os
+import sys
+from setuptools import setup, find_packages
+from shutil import copyfile, copytree, rmtree
+
+if sys.version_info < (2, 7):
+print("Python versions prior to 2.7 are not supported for pip 
installed PySpark.",
+  file=sys.stderr)
+exit(-1)
+
+try:
+exec(open('pyspark/version.py').read())
+except IOError:
+print("Failed to load PySpark version file for packaging. You must be 
in Spark's python dir.",
+  file=sys.stderr)
+sys.exit(-1)
+VERSION = __version__
+# A temporary path so we can access above the Python project root and 
fetch scripts and jars we need
+TEMP_PATH = "deps"
+SPARK_HOME = os.path.abspath("../")
+
+# Provide guidance about how to use setup.py
+incorrect_invocation_message = """
+If you are installing pyspark from spark source, you must first build 
Spark and
+run sdist.
+
+To build Spark with maven you can run:
+  ./build/mvn -DskipTests clean package
+Building the source dist is done in the Python directory:
+  cd python
+  python setup.py sdist
+  pip install dist/*.tar.gz"""
+
+# Figure out where the jars are we need to package with PySpark.
+JARS_PATH = glob.glob(os.path.join(SPARK_HOME, 
"assembly/target/scala-*/jars/"))
--- End diff --

It might not be defined if someone is just building there own sdist or 
manually installing from source rather than with the packaging scripts so I'd 
rather avoid assuming `$SPARK_SCALA_VERSION` is present.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-15 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r88045780
  
--- Diff: python/MANIFEST.in ---
@@ -0,0 +1,21 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+recursive-include deps/jars *.jar
+graft deps/bin
+recursive-include deps/examples *.py
+recursive-include lib *.zip
+include README.md
--- End diff --

Actually even then it shouldn't happen "normally" (since we use 
recursive-include *.py as the inclusion rule for the python directory and our 
own graft directory is the bin directory). But still better to have the 
exclusion rule incase someone has pyc files in bin and is rolling their own 
package. Thanks for the suggestion :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-15 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r88041574
  
--- Diff: python/MANIFEST.in ---
@@ -0,0 +1,21 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+recursive-include deps/jars *.jar
+graft deps/bin
+recursive-include deps/examples *.py
+recursive-include lib *.zip
+include README.md
--- End diff --

So it wouldn't happen with the make release scripts since they use a fresh 
copy of the source, but if were making the packages by hand those could 
certainly show up. I'll add the exclusion rule since it shouldn't break 
anything.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-11 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r87686589
  
--- Diff: pom.xml ---
@@ -26,6 +26,7 @@
   
   org.apache.spark
   spark-parent_2.11
+  
--- End diff --

@JoshRosen so we already update this implicitly using release-tag.sh - this 
is just the version for dev builds.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-11 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r87686249
  
--- Diff: dev/create-release/release-build.sh ---
@@ -187,10 +208,10 @@ if [[ "$1" == "package" ]]; then
   # We increment the Zinc port each time to avoid OOM's and other 
craziness if multiple builds
   # share the same Zinc server.
   FLAGS="-Psparkr -Phive -Phive-thriftserver -Pyarn -Pmesos"
-  make_binary_release "hadoop2.3" "-Phadoop2.3 $FLAGS" "3033" &
-  make_binary_release "hadoop2.4" "-Phadoop2.4 $FLAGS" "3034" &
-  make_binary_release "hadoop2.6" "-Phadoop2.6 $FLAGS" "3035" &
-  make_binary_release "hadoop2.7" "-Phadoop2.7 $FLAGS" "3036" &
+  make_binary_release "hadoop2.3" "-Phadoop-2.3 $FLAGS" "3033" &
+  make_binary_release "hadoop2.4" "-Phadoop-2.4 $FLAGS" "3034" &
+  make_binary_release "hadoop2.6" "-Phadoop-2.6 $FLAGS" "3035" &
+  make_binary_release "hadoop2.7" "-Phadoop-2.7 $FLAGS" "3036" &
--- End diff --

Done - https://github.com/apache/spark/pull/15860


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-11 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r87678827
  
--- Diff: dev/run-pip-tests-2 ---
@@ -0,0 +1,105 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Stop on error
+set -e
+# Set nullglob for when we are checking existence based on globs
+shopt -s nullglob
+
+FWDIR="$(cd "`dirname $0`"/..; pwd)"
+cd "$FWDIR"
+# Some systems don't have pip or virtualenv - in those cases our tests 
won't work.
+if ! hash virtualenv 2>/dev/null; then
+  echo "Missing virtualenv skipping pip installability tests."
+  exit 0
+fi
+if ! hash pip 2>/dev/null; then
+  echo "Missing pip, skipping pip installability tests."
+  exit 0
+fi
+
+# Figure out which Python execs we should test pip installation with
+PYTHON_EXECS=()
+if hash python 2>/dev/null; then
+  # We do this since we are testing with virtualenv and the default 
virtual env python
+  # is in /usr/bin/python
+  PYTHON_EXECS+=('python')
--- End diff --

One slight oddity in AMPLab Jenkins is that `python` might actually point 
to `python3`. Given this, I think that it might be worth trying to use 
`python2` or `python2.7` or `python2.6` first and then only fall back on adding 
`python` as a last resort in order to guarantee that we're testing with a 
Python 2 environment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-11 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r87679919
  
--- Diff: dev/create-release/release-build.sh ---
@@ -187,10 +208,10 @@ if [[ "$1" == "package" ]]; then
   # We increment the Zinc port each time to avoid OOM's and other 
craziness if multiple builds
   # share the same Zinc server.
   FLAGS="-Psparkr -Phive -Phive-thriftserver -Pyarn -Pmesos"
-  make_binary_release "hadoop2.3" "-Phadoop2.3 $FLAGS" "3033" &
-  make_binary_release "hadoop2.4" "-Phadoop2.4 $FLAGS" "3034" &
-  make_binary_release "hadoop2.6" "-Phadoop2.6 $FLAGS" "3035" &
-  make_binary_release "hadoop2.7" "-Phadoop2.7 $FLAGS" "3036" &
+  make_binary_release "hadoop2.3" "-Phadoop-2.3 $FLAGS" "3033" &
+  make_binary_release "hadoop2.4" "-Phadoop-2.4 $FLAGS" "3034" &
+  make_binary_release "hadoop2.6" "-Phadoop-2.6 $FLAGS" "3035" &
+  make_binary_release "hadoop2.7" "-Phadoop-2.7 $FLAGS" "3036" &
--- End diff --

I think this is a new issue which was introduced in 
https://github.com/apache/spark/pull/14637/files#diff-01ca42240614718522afde4d4885b40dL189.
 I'd be in favor of fixing this separately. Do you mind splitting this change 
into a separate small PR which I'll merge right away?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-11 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r87678482
  
--- Diff: dev/run-pip-tests ---
@@ -0,0 +1,35 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+FWDIR="$(cd "`dirname $0`"/..; pwd)"
--- End diff --

```
In dev/run-pip-tests line 21:
FWDIR="$(cd "`dirname $0`"/..; pwd)"
 ^-- SC2164: Use cd ... || exit in case cd fails.
 ^-- SC2006: Use $(..) instead of legacy `..`.
  ^-- SC2086: Double quote to prevent globbing and word 
splitting.


In dev/run-pip-tests line 22:
cd "$FWDIR"
^-- SC2164: Use cd ... || exit in case cd fails.


In dev/run-pip-tests line 26:
$FWDIR/dev/run-pip-tests-2
^-- SC2086: Double quote to prevent globbing and word splitting.


In dev/run-pip-tests line 31:
  rm -rf `cat ./virtual_env_temp_dir`
 ^-- SC2046: Quote this to prevent word splitting.
 ^-- SC2006: Use $(..) instead of legacy `..`.
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-11 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r87677806
  
--- Diff: bin/beeline ---
@@ -25,7 +25,7 @@ set -o posix
 
 # Figure out if SPARK_HOME is set
 if [ -z "${SPARK_HOME}" ]; then
-  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
+  source `dirname $0`/find-spark-home
--- End diff --

```
In bin/beeline line 28:
  source `dirname $0`/find-spark-home
  ^-- SC1090: Can't follow non-constant source. Use a directive to specify 
location.
 ^-- SC2046: Quote this to prevent word splitting.
 ^-- SC2006: Use $(..) instead of legacy `..`.
  ^-- SC2086: Double quote to prevent globbing and word 
splitting.
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-11 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r87679615
  
--- Diff: python/setup.py ---
@@ -0,0 +1,180 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import glob
+import os
+import sys
+from setuptools import setup, find_packages
+from shutil import copyfile, copytree, rmtree
+
+if sys.version_info < (2, 7):
+print("Python versions prior to 2.7 are not supported for pip 
installed PySpark.",
+  file=sys.stderr)
+exit(-1)
+
+try:
+exec(open('pyspark/version.py').read())
+except IOError:
+print("Failed to load PySpark version file for packaging you must be 
in Spark's python dir.",
+  file=sys.stderr)
+sys.exit(-1)
+VERSION = __version__
+# A temporary path so we can access above the Python project root and 
fetch scripts and jars we need
+TEMP_PATH = "deps"
+SPARK_HOME = os.path.abspath("../")
+JARS_PATH = "%s/assembly/target/scala-2.11/jars/" % SPARK_HOME
+
+# Use the release jars path if we are in release mode.
+if (os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1):
+JARS_PATH = "%s/jars/" % SPARK_HOME
+
+EXAMPLES_PATH = "%s/examples/src/main/python" % SPARK_HOME
+SCRIPTS_PATH = "%s/bin" % SPARK_HOME
+SCRIPTS_TARGET = "%s/bin" % TEMP_PATH
+JARS_TARGET = "%s/jars" % TEMP_PATH
+EXAMPLES_TARGET = "%s/examples" % TEMP_PATH
+
+# Check and see if we are under the spark path in which case we need to 
build the symlink farm.
+# This is important because we only want to build the symlink farm while 
under Spark otherwise we
+# want to use the symlink farm. And if the symlink farm exists under while 
under Spark (e.g. a
+# partially built sdist) we should error and have the user sort it out.
+in_spark = 
(os.path.isfile("../core/src/main/scala/org/apache/spark/SparkContext.scala") or
+(os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1))
+
+if (in_spark):
+# Construct links for setup
+try:
+os.mkdir(TEMP_PATH)
+except:
+print("Temp path for symlink to parent already exists %s" % 
TEMP_PATH, file=sys.stderr)
+exit(-1)
+
+try:
+if (in_spark):
+# Construct the symlink farm - this is necessary since we can't 
refer to the path above the
+# package root and we need to copy the jars and scripts which are 
up above the python root.
+if getattr(os, "symlink", None) is not None:
+os.symlink(JARS_PATH, JARS_TARGET)
+os.symlink(SCRIPTS_PATH, SCRIPTS_TARGET)
+os.symlink(EXAMPLES_PATH, EXAMPLES_TARGET)
+else:
+# For windows fall back to the slower copytree
+copytree(JARS_PATH, JARS_TARGET)
+copytree(SCRIPTS_PATH, SCRIPTS_TARGET)
+copytree(EXAMPLES_PATH, EXAMPLES_TARGET)
+else:
+# If we are not inside of SPARK_HOME verify we have the required 
symlink farm
+if not os.path.exists(JARS_TARGET):
+print("To build packaging must be in the python directory 
under the SPARK_HOME.",
+  file=sys.stderr)
+# We copy the shell script to be under pyspark/python/pyspark so 
that the launcher scripts
+# find it where expected. The rest of the files aren't copied 
because they are accessed
+# using Python imports instead which will be resolved correctly.
+try:
+os.makedirs("pyspark/python/pyspark")
+except OSError:
+# Don't worry if the directory already exists.
+True
+copyfile("pyspark/shell.py", "pyspark/python/pyspark/shell.py")
+
+if not os.path.isdir(SCRIPTS_TARGET):
+print("You must first create a source dist and install that source 
dist.", file=sys.stderr)
+ 

[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-11 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r87678665
  
--- Diff: dev/run-pip-tests-2 ---
@@ -0,0 +1,105 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Stop on error
+set -e
+# Set nullglob for when we are checking existence based on globs
+shopt -s nullglob
+
+FWDIR="$(cd "`dirname $0`"/..; pwd)"
+cd "$FWDIR"
+# Some systems don't have pip or virtualenv - in those cases our tests 
won't work.
+if ! hash virtualenv 2>/dev/null; then
+  echo "Missing virtualenv skipping pip installability tests."
+  exit 0
+fi
+if ! hash pip 2>/dev/null; then
+  echo "Missing pip, skipping pip installability tests."
+  exit 0
+fi
+
+# Figure out which Python execs we should test pip installation with
+PYTHON_EXECS=()
+if hash python 2>/dev/null; then
+  # We do this since we are testing with virtualenv and the default 
virtual env python
+  # is in /usr/bin/python
+  PYTHON_EXECS+=('python')
+fi
+if hash python3 2>/dev/null; then
+  PYTHON_EXECS+=('python3')
+fi
+
+echo "Constucting virtual env for testing"
+mktemp -d > ./virtual_env_temp_dir
+VIRTUALENV_BASE=`cat ./virtual_env_temp_dir`
+
+# Determine which version of PySpark we are building for archive name
+PYSPARK_VERSION=`python -c 
"exec(open('python/pyspark/version.py').read());print __version__"`
+PYSPARK_DIST="$FWDIR/python/dist/pyspark-$PYSPARK_VERSION.tar.gz"
+# The pip install options we use for all the pip commands
+PIP_OPTIONS="--upgrade --no-cache-dir --force-reinstall "
+# Test both regular user and edit/dev install modes.
+PIP_COMMANDS=("pip install $PIP_OPTIONS $PYSPARK_DIST"
+ "pip install $PIP_OPTIONS -e python/")
+
+for python in "${PYTHON_EXECS[@]}"; do
+  for install_command in "${PIP_COMMANDS[@]}"; do
+echo "Testing pip installation with python $python"
+# Create a temp directory for us to work in and save its name to a 
file for cleanup
+echo "Using $VIRTUALENV_BASE for virtualenv"
+VIRTUALENV_PATH=$VIRTUALENV_BASE/$python
+rm -rf $VIRTUALENV_PATH
+mkdir -p $VIRTUALENV_PATH
+virtualenv --python=$python $VIRTUALENV_PATH
+source $VIRTUALENV_PATH/bin/activate
+# Upgrade pip
+pip install --upgrade pip
+
+echo "Creating pip installable source dist"
+cd $FWDIR/python
--- End diff --

I think we need double quotes here and around the other substitutions in 
this file to avoid problems for folks with spaces in their directory names.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-11 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r87676870
  
--- Diff: pom.xml ---
@@ -26,6 +26,7 @@
   
   org.apache.spark
   spark-parent_2.11
+  
--- End diff --

I think that the easiest way to automate this would be to add a new `sed` 
replacement near the existing logic for updating hardcoded versions in 
`dev/create-release/release-tag.sh`:


https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/dev/create-release/release-tag.sh#L75


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-11 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r87677850
  
--- Diff: bin/load-spark-env.sh ---
@@ -23,7 +23,7 @@
 
 # Figure out where Spark is installed
 if [ -z "${SPARK_HOME}" ]; then
-  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
+  source `dirname $0`/find-spark-home
--- End diff --

Let's also apply the same fix for Shellcheck complaints here and at all 
other occurrences of this line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-11 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r87679034
  
--- Diff: dev/run-tests.py ---
@@ -583,6 +589,7 @@ def main():
 modules_with_python_tests = [m for m in test_modules if 
m.python_test_goals]
 if modules_with_python_tests:
 run_python_tests(modules_with_python_tests, opts.parallelism)
+run_python_packaging_tests()
--- End diff --

+1 as well; this seems cheap to run and it's better to err on the side of 
running things more often.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-11 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r87680011
  
--- Diff: dev/create-release/release-build.sh ---
@@ -162,14 +162,35 @@ if [[ "$1" == "package" ]]; then
 export ZINC_PORT=$ZINC_PORT
 echo "Creating distribution: $NAME ($FLAGS)"
 
+# Write out the NAME and VERSION to PySpark version info we rewrite 
the - into a . and SNAPSHOT
+# to dev0 to be closer to PEP440. We use the NAME as a "local version".
+PYSPARK_VERSION=`echo "$SPARK_VERSION+$NAME" |  sed -r "s/-/./" | sed 
-r "s/SNAPSHOT/dev0/"`
+echo "__version__='$PYSPARK_VERSION'" > python/pyspark/version.py
+
 # Get maven home set by MVN
 MVN_HOME=`$MVN -version 2>&1 | grep 'Maven home' | awk '{print $NF}'`
 
+echo "Creating distribution"
 ./dev/make-distribution.sh --name $NAME --mvn $MVN_HOME/bin/mvn --tgz 
$FLAGS \
   -DzincPort=$ZINC_PORT 2>&1 >  ../binary-release-$NAME.log
 cd ..
-cp spark-$SPARK_VERSION-bin-$NAME/spark-$SPARK_VERSION-bin-$NAME.tgz .
 
+echo "Copying and signing python distribution"
+PYTHON_DIST_NAME=pyspark-$PYSPARK_VERSION.tar.gz
--- End diff --

We shouldn't port the release bash scripts to Windows. It's going to be a 
huge pain with little obvious benefit. Windows users who want to make release 
builds can just run this version of the script in a `*nix` VM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-11 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r87677407
  
--- Diff: bin/find-spark-home ---
@@ -0,0 +1,41 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Attempts to find a proper value for SPARK_HOME. Should be included using 
"source" directive.
+
+FIND_SPARK_HOME_PYTHON_SCRIPT="$(cd "`dirname "$0"`"; 
pwd)/find_spark_home.py"
+
+# Short cirtuit if the user already has this set.
+if [ ! -z "${SPARK_HOME}" ]; then
+   exit 0
+elif [ ! -f $FIND_SPARK_HOME_PYTHON_SCRIPT ]; then
--- End diff --

Actually, I'll just paste `shellcheck`'s full output:

```
In find-spark-home line 22:
FIND_SPARK_HOME_PYTHON_SCRIPT="$(cd "`dirname "$0"`"; 
pwd)/find_spark_home.py"
 ^-- SC2164: Use cd ... || exit in case cd 
fails.
 ^-- SC2006: Use $(..) instead of 
legacy `..`.


In find-spark-home line 27:
elif [ ! -f $FIND_SPARK_HOME_PYTHON_SCRIPT ]; then
^-- SC2086: Double quote to prevent globbing and word splitting.


In find-spark-home line 33:
  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
 ^-- SC2155: Declare and assign separately to avoid masking return 
values.
   ^-- SC2164: Use cd ... || exit in case cd fails.
   ^-- SC2006: Use $(..) instead of legacy `..`.


In find-spark-home line 40:
  export SPARK_HOME=`$PYSPARK_DRIVER_PYTHON $FIND_SPARK_HOME_PYTHON_SCRIPT`
 ^-- SC2155: Declare and assign separately to avoid masking return 
values.
^-- SC2006: Use $(..) instead of legacy `..`.
^-- SC2086: Double quote to 
prevent globbing and word splitting.
```

Some of these aren't super-important, but the word-splitting ones are.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-11 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r87678217
  
--- Diff: dev/create-release/release-build.sh ---
@@ -162,14 +162,35 @@ if [[ "$1" == "package" ]]; then
 export ZINC_PORT=$ZINC_PORT
 echo "Creating distribution: $NAME ($FLAGS)"
 
+# Write out the NAME and VERSION to PySpark version info we rewrite 
the - into a . and SNAPSHOT
+# to dev0 to be closer to PEP440. We use the NAME as a "local version".
+PYSPARK_VERSION=`echo "$SPARK_VERSION+$NAME" |  sed -r "s/-/./" | sed 
-r "s/SNAPSHOT/dev0/"`
+echo "__version__='$PYSPARK_VERSION'" > python/pyspark/version.py
--- End diff --

If we can, I'd like to consolidate this logic into the `release-tag` shell 
script mentioned upthread.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-11 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r87679242
  
--- Diff: python/setup.py ---
@@ -0,0 +1,170 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import glob
+import os
+import sys
+from setuptools import setup, find_packages
--- End diff --

I don't think that the `setuptools` requirement is a big deal; my 
impression is that there are many Python packages whose installation will fail 
if `setuptools` is missing and it's not hard for users to figure out what went 
wrong and fix this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-11 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r87678635
  
--- Diff: dev/run-pip-tests ---
@@ -0,0 +1,35 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+FWDIR="$(cd "`dirname $0`"/..; pwd)"
+cd "$FWDIR"
+
+# Run the tests, we wrap the underlying test script for cleanup and 
because early exit
+# doesn't always properly exit a virtualenv.
+$FWDIR/dev/run-pip-tests-2
+export success=$?
+
+# Clean up the virtual env enviroment used if we created one.
+if [ -f ./virtual_env_tmp_dir ]; then
--- End diff --

I think that you could combine both this and the `run-pip-tests-2` into a 
single script if you used Bash exit traps, e.g.

```
function delete_virtualenv() {
  echo "Deleting temp directory $tmpdir"
  rm -rf "$tmpdir"
}
trap delete_virtualenv EXIT
```

and putting that at the top of the script before you actually create the 
temporary directory / virtualenv.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-11 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r87677119
  
--- Diff: bin/find-spark-home ---
@@ -0,0 +1,41 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Attempts to find a proper value for SPARK_HOME. Should be included using 
"source" directive.
+
+FIND_SPARK_HOME_PYTHON_SCRIPT="$(cd "`dirname "$0"`"; 
pwd)/find_spark_home.py"
+
+# Short cirtuit if the user already has this set.
+if [ ! -z "${SPARK_HOME}" ]; then
+   exit 0
+elif [ ! -f $FIND_SPARK_HOME_PYTHON_SCRIPT ]; then
--- End diff --

`shellcheck` complains that `FIND_SPARK_HOME_PYTHON_SCRIPT` should be 
quoted here to avoid word splitting issues:

```
In find-spark-home line 27:
elif [ ! -f $FIND_SPARK_HOME_PYTHON_SCRIPT ]; then
^-- SC2086: Double quote to prevent globbing and word splitting.
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-11 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r87677950
  
--- Diff: bin/spark-class ---
@@ -36,7 +36,7 @@ else
 fi
 
 # Find Spark jars.
-if [ -f "${SPARK_HOME}/RELEASE" ]; then
+if [ -d "${SPARK_HOME}/jars" ]; then
--- End diff --

Makes sense. This seems reasonable to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-11 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r87642242
  
--- Diff: python/MANIFEST.in ---
@@ -0,0 +1,23 @@
+#!/usr/bin/env python
--- End diff --

oh yah not needed, probably from copying the license header for RAT from 
another Python file (and skipped since it starts with a `#`). removed :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-11 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r87641702
  
--- Diff: python/MANIFEST.in ---
@@ -0,0 +1,23 @@
+#!/usr/bin/env python
--- End diff --

?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-08 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r87125921
  
--- Diff: python/setup.py ---
@@ -38,11 +38,22 @@
 # A temporary path so we can access above the Python project root and 
fetch scripts and jars we need
 TEMP_PATH = "deps"
 SPARK_HOME = os.path.abspath("../")
-JARS_PATH = os.path.join(SPARK_HOME, "assembly/target/scala-2.11/jars/")
 
-# Use the release jars path if we are in release mode.
-if (os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1):
+# Figure out where the jars are we need to package with PySpark.
+JARS_PATH = glob.glob(os.path.join(SPARK_HOME, 
"assembly/target/scala-*/jars/"))
+
+if len(JARS_PATH) == 1:
+JARS_PATH = JARS_PATH[0]
+elif (os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1):
+# Release mode puts the jars in a jars directory
 JARS_PATH = os.path.join(SPARK_HOME, "jars")
+elif len(JARS_PATH) > 1:
+print("Assembly jars exist for multiple scalas, please cleanup 
assembly/target",
+  file=sys.stderr)
+sys.exit(-1)
+elif len(JARS_PATH) == 0 and not os.path.exists("deps"):
--- End diff --

nit: "deps" -> TEMP_PATH, I think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-07 Thread rgbkrk
Github user rgbkrk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86895291
  
--- Diff: python/setup.py ---
@@ -0,0 +1,179 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import glob
+import os
+import sys
+from setuptools import setup, find_packages
+from shutil import copyfile, copytree, rmtree
+
+if sys.version_info < (2, 7):
+print("Python versions prior to 2.7 are not supported for pip 
installed PySpark.",
+  file=sys.stderr)
+exit(-1)
+
+try:
+exec(open('pyspark/version.py').read())
+except IOError:
+print("Failed to load PySpark version file for packaging. You must be 
in Spark's python dir.",
+  file=sys.stderr)
+sys.exit(-1)
+VERSION = __version__
+# A temporary path so we can access above the Python project root and 
fetch scripts and jars we need
+TEMP_PATH = "deps"
+SPARK_HOME = os.path.abspath("../")
+JARS_PATH = os.path.join(SPARK_HOME, "assembly/target/scala-2.11/jars/")
--- End diff --

Probably `glob.glob` would be good to use here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-07 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86888019
  
--- Diff: python/setup.py ---
@@ -0,0 +1,179 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import glob
+import os
+import sys
+from setuptools import setup, find_packages
+from shutil import copyfile, copytree, rmtree
+
+if sys.version_info < (2, 7):
+print("Python versions prior to 2.7 are not supported for pip 
installed PySpark.",
+  file=sys.stderr)
+exit(-1)
+
+try:
+exec(open('pyspark/version.py').read())
+except IOError:
+print("Failed to load PySpark version file for packaging. You must be 
in Spark's python dir.",
+  file=sys.stderr)
+sys.exit(-1)
+VERSION = __version__
+# A temporary path so we can access above the Python project root and 
fetch scripts and jars we need
+TEMP_PATH = "deps"
+SPARK_HOME = os.path.abspath("../")
+JARS_PATH = os.path.join(SPARK_HOME, "assembly/target/scala-2.11/jars/")
+
+# Use the release jars path if we are in release mode.
+if (os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1):
+JARS_PATH = os.path.join(SPARK_HOME, "jars")
+
+EXAMPLES_PATH = os.path.join(SPARK_HOME, "examples/src/main/python")
+SCRIPTS_PATH = os.path.join(SPARK_HOME, "bin")
+SCRIPTS_TARGET = os.path.join(TEMP_PATH, "bin")
+JARS_TARGET = os.path.join(TEMP_PATH, "jars")
+EXAMPLES_TARGET = os.path.join(TEMP_PATH, "examples")
+
+# Check and see if we are under the spark path in which case we need to 
build the symlink farm.
+# This is important because we only want to build the symlink farm while 
under Spark otherwise we
+# want to use the symlink farm. And if the symlink farm exists under while 
under Spark (e.g. a
+# partially built sdist) we should error and have the user sort it out.
+in_spark = 
(os.path.isfile("../core/src/main/scala/org/apache/spark/SparkContext.scala") or
+(os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1))
+
+if (in_spark):
+# Construct links for setup
+try:
+os.mkdir(TEMP_PATH)
+except:
+print("Temp path for symlink to parent already exists 
{0}".format(TEMP_PATH),
+  file=sys.stderr)
+exit(-1)
+
+try:
+# We copy the shell script to be under pyspark/python/pyspark so that 
the launcher scripts
+# find it where expected. The rest of the files aren't copied because 
they are accessed
+# using Python imports instead which will be resolved correctly.
+try:
+os.makedirs("pyspark/python/pyspark")
+except OSError:
+# Don't worry if the directory already exists.
+pass
+copyfile("pyspark/shell.py", "pyspark/python/pyspark/shell.py")
+
+if (in_spark):
+# Construct the symlink farm - this is necessary since we can't 
refer to the path above the
+# package root and we need to copy the jars and scripts which are 
up above the python root.
+if getattr(os, "symlink", None) is not None:
+os.symlink(JARS_PATH, JARS_TARGET)
+os.symlink(SCRIPTS_PATH, SCRIPTS_TARGET)
+os.symlink(EXAMPLES_PATH, EXAMPLES_TARGET)
+else:
+# For windows fall back to the slower copytree
+copytree(JARS_PATH, JARS_TARGET)
+copytree(SCRIPTS_PATH, SCRIPTS_TARGET)
+copytree(EXAMPLES_PATH, EXAMPLES_TARGET)
+else:
+# If we are not inside of SPARK_HOME verify we have the required 
symlink farm
+if not os.path.exists(JARS_TARGET):
+print("To build packaging must be in the python directory 
under the SPARK_HOME.",
+  file=sys.stderr)
+
+if not os.path.isdir(SCRIPTS_TARGET):
+print("You must first create a 

[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-07 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86885069
  
--- Diff: python/setup.py ---
@@ -0,0 +1,179 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import glob
+import os
+import sys
+from setuptools import setup, find_packages
+from shutil import copyfile, copytree, rmtree
+
+if sys.version_info < (2, 7):
+print("Python versions prior to 2.7 are not supported for pip 
installed PySpark.",
+  file=sys.stderr)
+exit(-1)
+
+try:
+exec(open('pyspark/version.py').read())
+except IOError:
+print("Failed to load PySpark version file for packaging. You must be 
in Spark's python dir.",
+  file=sys.stderr)
+sys.exit(-1)
+VERSION = __version__
+# A temporary path so we can access above the Python project root and 
fetch scripts and jars we need
+TEMP_PATH = "deps"
+SPARK_HOME = os.path.abspath("../")
+JARS_PATH = os.path.join(SPARK_HOME, "assembly/target/scala-2.11/jars/")
+
+# Use the release jars path if we are in release mode.
+if (os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1):
+JARS_PATH = os.path.join(SPARK_HOME, "jars")
+
+EXAMPLES_PATH = os.path.join(SPARK_HOME, "examples/src/main/python")
+SCRIPTS_PATH = os.path.join(SPARK_HOME, "bin")
+SCRIPTS_TARGET = os.path.join(TEMP_PATH, "bin")
+JARS_TARGET = os.path.join(TEMP_PATH, "jars")
+EXAMPLES_TARGET = os.path.join(TEMP_PATH, "examples")
+
+# Check and see if we are under the spark path in which case we need to 
build the symlink farm.
+# This is important because we only want to build the symlink farm while 
under Spark otherwise we
+# want to use the symlink farm. And if the symlink farm exists under while 
under Spark (e.g. a
+# partially built sdist) we should error and have the user sort it out.
+in_spark = 
(os.path.isfile("../core/src/main/scala/org/apache/spark/SparkContext.scala") or
+(os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1))
+
+if (in_spark):
+# Construct links for setup
+try:
+os.mkdir(TEMP_PATH)
+except:
+print("Temp path for symlink to parent already exists 
{0}".format(TEMP_PATH),
+  file=sys.stderr)
+exit(-1)
+
+try:
+# We copy the shell script to be under pyspark/python/pyspark so that 
the launcher scripts
+# find it where expected. The rest of the files aren't copied because 
they are accessed
+# using Python imports instead which will be resolved correctly.
+try:
+os.makedirs("pyspark/python/pyspark")
+except OSError:
+# Don't worry if the directory already exists.
+pass
+copyfile("pyspark/shell.py", "pyspark/python/pyspark/shell.py")
+
+if (in_spark):
+# Construct the symlink farm - this is necessary since we can't 
refer to the path above the
+# package root and we need to copy the jars and scripts which are 
up above the python root.
+if getattr(os, "symlink", None) is not None:
+os.symlink(JARS_PATH, JARS_TARGET)
+os.symlink(SCRIPTS_PATH, SCRIPTS_TARGET)
+os.symlink(EXAMPLES_PATH, EXAMPLES_TARGET)
+else:
+# For windows fall back to the slower copytree
+copytree(JARS_PATH, JARS_TARGET)
+copytree(SCRIPTS_PATH, SCRIPTS_TARGET)
+copytree(EXAMPLES_PATH, EXAMPLES_TARGET)
+else:
+# If we are not inside of SPARK_HOME verify we have the required 
symlink farm
+if not os.path.exists(JARS_TARGET):
+print("To build packaging must be in the python directory 
under the SPARK_HOME.",
+  file=sys.stderr)
+
+if not os.path.isdir(SCRIPTS_TARGET):
+print("You must first create a 

[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-07 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86884208
  
--- Diff: python/setup.py ---
@@ -0,0 +1,179 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import glob
+import os
+import sys
+from setuptools import setup, find_packages
+from shutil import copyfile, copytree, rmtree
+
+if sys.version_info < (2, 7):
+print("Python versions prior to 2.7 are not supported for pip 
installed PySpark.",
+  file=sys.stderr)
+exit(-1)
+
+try:
+exec(open('pyspark/version.py').read())
+except IOError:
+print("Failed to load PySpark version file for packaging. You must be 
in Spark's python dir.",
+  file=sys.stderr)
+sys.exit(-1)
+VERSION = __version__
+# A temporary path so we can access above the Python project root and 
fetch scripts and jars we need
+TEMP_PATH = "deps"
+SPARK_HOME = os.path.abspath("../")
+JARS_PATH = os.path.join(SPARK_HOME, "assembly/target/scala-2.11/jars/")
--- End diff --

Should we not hardcode scala 2.11 here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-07 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86882629
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,74 @@
+#!/usr/bin/python
--- End diff --

/usr/bin/env python


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-07 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86795440
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,66 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and
+(os.path.isdir(os.path.join(path, "jars"
+
+paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")]
--- End diff --

Ah that makes more sense - but it's not quite the desired search set - one 
is relative the pwd and the other is relative the script location (they most 
likely are not the same).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-06 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86714156
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,73 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and
+(os.path.isdir(os.path.join(path, "jars")) or
+ os.path.isdir(os.path.join(path, "assembly"
+
+paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")]
--- End diff --

I guess we could normalize the path of the pwd first and then use 
os.path.dirname on it. Is this something you think would make a difference?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-06 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86712504
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,73 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and
+(os.path.isdir(os.path.join(path, "jars")) or
+ os.path.isdir(os.path.join(path, "assembly"
+
+paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")]
--- End diff --

To be clear it's going to the parent of the pwd and dirname respectively.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-06 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86706618
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -59,15 +59,15 @@ def is_spark_home(path):
 paths.append(os.path.join(module_home, "../../"))
 except ImportError:
 # Not pip installed no worries
-True
+pass
 
 # Normalize the paths
 paths = [os.path.abspath(p) for p in paths]
 
 try:
 return next(path for path in paths if is_spark_home(path))
 except StopIteration:
-print("Could not find valid SPARK_HOME while searching 
%s".format(paths), file=sys.stderr)
+print("Could not find valid SPARK_HOME while searching 
{0}".format(paths), file=sys.stderr)
--- End diff --

Added, sorry (got left out when I was working on the change to for edit 
mode tests and I reset part of the way through).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-06 Thread nchammas
Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86699002
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,73 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and
+(os.path.isdir(os.path.join(path, "jars")) or
+ os.path.isdir(os.path.join(path, "assembly"
+
+paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")]
+
+# Add the path of the PySpark module if it exists
+if sys.version < "3":
+import imp
+try:
+module_home = imp.find_module("pyspark")[1]
+paths.append(module_home)
+# If we are installed in edit mode also look two dirs up
+paths.append(os.path.join(module_home, "../../"))
+except ImportError:
+# Not pip installed no worries
+True
+else:
+from importlib.util import find_spec
+try:
+module_home = os.path.dirname(find_spec("pyspark").origin)
+paths.append(module_home)
+# If we are installed in edit mode also look two dirs up
+paths.append(os.path.join(module_home, "../../"))
+except ImportError:
+# Not pip installed no worries
+True
--- End diff --

Same nit about `pass`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-06 Thread nchammas
Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86698782
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,73 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and
+(os.path.isdir(os.path.join(path, "jars")) or
+ os.path.isdir(os.path.join(path, "assembly"
+
+paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")]
+
+# Add the path of the PySpark module if it exists
+if sys.version < "3":
+import imp
+try:
+module_home = imp.find_module("pyspark")[1]
+paths.append(module_home)
+# If we are installed in edit mode also look two dirs up
+paths.append(os.path.join(module_home, "../../"))
+except ImportError:
+# Not pip installed no worries
+True
--- End diff --

Nit: The idiom in Python for "do nothing" is usually `pass`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-06 Thread nchammas
Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86699184
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,73 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and
+(os.path.isdir(os.path.join(path, "jars")) or
+ os.path.isdir(os.path.join(path, "assembly"
+
+paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")]
+
+# Add the path of the PySpark module if it exists
+if sys.version < "3":
+import imp
+try:
+module_home = imp.find_module("pyspark")[1]
+paths.append(module_home)
+# If we are installed in edit mode also look two dirs up
+paths.append(os.path.join(module_home, "../../"))
+except ImportError:
+# Not pip installed no worries
+True
+else:
+from importlib.util import find_spec
+try:
+module_home = os.path.dirname(find_spec("pyspark").origin)
+paths.append(module_home)
+# If we are installed in edit mode also look two dirs up
+paths.append(os.path.join(module_home, "../../"))
+except ImportError:
+# Not pip installed no worries
+True
+
+# Normalize the paths
+paths = [os.path.abspath(p) for p in paths]
+
+try:
+return next(path for path in paths if is_spark_home(path))
+except StopIteration:
+print("Could not find valid SPARK_HOME while searching 
%s".format(paths), file=sys.stderr)
--- End diff --

Hmm, did a commit get gobbled up accidentally? This line still uses `%` and 
is missing an `exit(1)`. I see you changed it for another file, so I assume you 
meant to do it here too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-06 Thread nchammas
Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86698987
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,73 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and
+(os.path.isdir(os.path.join(path, "jars")) or
+ os.path.isdir(os.path.join(path, "assembly"
+
+paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")]
--- End diff --

I guess if this works we don't have to change it, but to clarify my earlier 
comment about why `dirname()` is better than joining to `'../'`:

```
>>> os.path.join('/example/path', '../')
'/example/path/../'
>>> os.path.dirname('/example/path')
'/example'
```

There are a few places where this could be changed, but it's not a big deal.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-06 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86692076
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,66 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and
+(os.path.isdir(os.path.join(path, "jars"
+
+paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")]
+
+# Add the path of the PySpark module if it exists
+if sys.version < "3":
+import imp
+try:
+paths.append(imp.find_module("pyspark")[1])
+except ImportError:
+# Not pip installed no worries
+True
+else:
+from importlib.util import find_spec
+try:
+paths.append(os.path.dirname(find_spec("pyspark").origin))
+except ImportError:
+# Not pip installed no worries
+True
+
+# Normalize the paths
+paths = map(lambda path: os.path.abspath(path), paths)
+
+try:
+return next(path for path in paths if is_spark_home(path))
+except StopIteration:
+print("Could not find valid SPARK_HOME while searching %s" % 
paths, file=sys.stderr)
--- End diff --

Exit sounds reasonable, I'll follow up with the change to format once we 
fix the find_spark_home issues inside of Python3 venv.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-06 Thread nchammas
Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86692033
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,66 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and
+(os.path.isdir(os.path.join(path, "jars"
+
+paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")]
--- End diff --

I meant you are probably looking for

```python
paths = [THIS_DIR, os.path.dirname(THIS_DIR)]
```

The signature of `os.path.dirname()` is the same in Python 3.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-06 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86691906
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,66 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and
+(os.path.isdir(os.path.join(path, "jars"
+
+paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")]
--- End diff --

So in python 2.7 os.path.dirname only expects one argument, but I can do 
part 1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-06 Thread nchammas
Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86690907
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,66 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and
+(os.path.isdir(os.path.join(path, "jars"
+
+paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")]
+
+# Add the path of the PySpark module if it exists
+if sys.version < "3":
+import imp
+try:
+paths.append(imp.find_module("pyspark")[1])
+except ImportError:
+# Not pip installed no worries
+True
+else:
+from importlib.util import find_spec
+try:
+paths.append(os.path.dirname(find_spec("pyspark").origin))
+except ImportError:
+# Not pip installed no worries
+True
+
+# Normalize the paths
+paths = map(lambda path: os.path.abspath(path), paths)
--- End diff --

```python
paths = [os.path.abspath(p) for p in paths]
```

This is more Pythonic and eliminates the need to call `list()` on the 
output of `map()` later, because `map()` returns an iterator.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-06 Thread nchammas
Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86690854
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,66 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and
+(os.path.isdir(os.path.join(path, "jars"
+
+paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")]
+
+# Add the path of the PySpark module if it exists
+if sys.version < "3":
+import imp
+try:
+paths.append(imp.find_module("pyspark")[1])
+except ImportError:
+# Not pip installed no worries
+True
+else:
+from importlib.util import find_spec
+try:
+paths.append(os.path.dirname(find_spec("pyspark").origin))
+except ImportError:
+# Not pip installed no worries
+True
+
+# Normalize the paths
+paths = map(lambda path: os.path.abspath(path), paths)
+
+try:
+return next(path for path in paths if is_spark_home(path))
+except StopIteration:
+print("Could not find valid SPARK_HOME while searching %s" % 
paths, file=sys.stderr)
--- End diff --

```python
print("Could not find valid SPARK_HOME while searching {}".format(paths), 
file=sys.stderr)
```

Minor point, but `%` is discouraged these days in favor of `format()`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-06 Thread nchammas
Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86691246
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,66 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and
+(os.path.isdir(os.path.join(path, "jars"
+
+paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")]
--- End diff --

Couple of comments here:

1. A better way to get a directory relative to the current file is to have 
something like this at the top of the file and refer to it as necessary:

   ```
   THIS_DIR = os.path.dirname(os.path.realpath(__file__))
   ```

2. The correct way to go up one directory is to just call `dirname()` 
again. `os.path.dirname(..., '../')` will just append `'../'` to the end of the 
path, which may not work as expected later on.

  So I think you're looking for `THIS_DIR` and `os.dirname(THIS_DIR)`.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-06 Thread nchammas
Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86690957
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,66 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and
+(os.path.isdir(os.path.join(path, "jars"
+
+paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")]
+
+# Add the path of the PySpark module if it exists
+if sys.version < "3":
+import imp
+try:
+paths.append(imp.find_module("pyspark")[1])
+except ImportError:
+# Not pip installed no worries
+True
+else:
+from importlib.util import find_spec
+try:
+paths.append(os.path.dirname(find_spec("pyspark").origin))
+except ImportError:
+# Not pip installed no worries
+True
+
+# Normalize the paths
+paths = map(lambda path: os.path.abspath(path), paths)
+
+try:
+return next(path for path in paths if is_spark_home(path))
+except StopIteration:
+print("Could not find valid SPARK_HOME while searching %s" % 
paths, file=sys.stderr)
--- End diff --

We should raise an exception here or `exit(1)` since this is a fatal error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-05 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86674922
  
--- Diff: python/setup.py ---
@@ -0,0 +1,180 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import glob
+import os
+import sys
+from setuptools import setup, find_packages
+from shutil import copyfile, copytree, rmtree
+
+if sys.version_info < (2, 7):
+print("Python versions prior to 2.7 are not supported for pip 
installed PySpark.",
+  file=sys.stderr)
+exit(-1)
+
+try:
+exec(open('pyspark/version.py').read())
+except IOError:
+print("Failed to load PySpark version file for packaging you must be 
in Spark's python dir.",
+  file=sys.stderr)
+sys.exit(-1)
+VERSION = __version__
+# A temporary path so we can access above the Python project root and 
fetch scripts and jars we need
+TEMP_PATH = "deps"
+SPARK_HOME = os.path.abspath("../")
+JARS_PATH = "%s/assembly/target/scala-2.11/jars/" % SPARK_HOME
+
+# Use the release jars path if we are in release mode.
+if (os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1):
+JARS_PATH = "%s/jars/" % SPARK_HOME
+
+EXAMPLES_PATH = "%s/examples/src/main/python" % SPARK_HOME
+SCRIPTS_PATH = "%s/bin" % SPARK_HOME
+SCRIPTS_TARGET = "%s/bin" % TEMP_PATH
+JARS_TARGET = "%s/jars" % TEMP_PATH
+EXAMPLES_TARGET = "%s/examples" % TEMP_PATH
+
+# Check and see if we are under the spark path in which case we need to 
build the symlink farm.
+# This is important because we only want to build the symlink farm while 
under Spark otherwise we
+# want to use the symlink farm. And if the symlink farm exists under while 
under Spark (e.g. a
+# partially built sdist) we should error and have the user sort it out.
+in_spark = 
(os.path.isfile("../core/src/main/scala/org/apache/spark/SparkContext.scala") or
+(os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1))
+
+if (in_spark):
+# Construct links for setup
+try:
+os.mkdir(TEMP_PATH)
+except:
+print("Temp path for symlink to parent already exists %s" % 
TEMP_PATH, file=sys.stderr)
+exit(-1)
+
+try:
+if (in_spark):
+# Construct the symlink farm - this is necessary since we can't 
refer to the path above the
+# package root and we need to copy the jars and scripts which are 
up above the python root.
+if getattr(os, "symlink", None) is not None:
+os.symlink(JARS_PATH, JARS_TARGET)
+os.symlink(SCRIPTS_PATH, SCRIPTS_TARGET)
+os.symlink(EXAMPLES_PATH, EXAMPLES_TARGET)
+else:
+# For windows fall back to the slower copytree
+copytree(JARS_PATH, JARS_TARGET)
+copytree(SCRIPTS_PATH, SCRIPTS_TARGET)
+copytree(EXAMPLES_PATH, EXAMPLES_TARGET)
+else:
+# If we are not inside of SPARK_HOME verify we have the required 
symlink farm
+if not os.path.exists(JARS_TARGET):
+print("To build packaging must be in the python directory 
under the SPARK_HOME.",
+  file=sys.stderr)
+# We copy the shell script to be under pyspark/python/pyspark so 
that the launcher scripts
+# find it where expected. The rest of the files aren't copied 
because they are accessed
+# using Python imports instead which will be resolved correctly.
+try:
+os.makedirs("pyspark/python/pyspark")
+except OSError:
+# Don't worry if the directory already exists.
+True
+copyfile("pyspark/shell.py", "pyspark/python/pyspark/shell.py")
+
+if not os.path.isdir(SCRIPTS_TARGET):
+print("You must first create a source dist and install that source 
dist.", file=sys.stderr)
+   

[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-05 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86674883
  
--- Diff: docs/building-spark.md ---
@@ -259,6 +259,14 @@ or
 Java 8 tests are automatically enabled when a Java 8 JDK is detected.
 If you have JDK 8 installed but it is not the system default, you can set 
JAVA_HOME to point to JDK 8 before running the tests.
 
+## PySpark pip installable
+
+If you are building Spark for use in a Python environment and you wish to 
pip install it, you will first need to build the Spark JARs as described above. 
Then you can construct an sdist package suitable for setup.py and pip 
installable package.
+
+cd python; python setup.py sdist
--- End diff --

So `make-distribution.sh` will copy the files into the distribution 
directory. If you just want to test the pip install part `./build/sbt 
package;cd python; python setup.py sdist; cd dist; pip install *.tar.gz` should 
do the trick.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-05 Thread nchammas
Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86668198
  
--- Diff: docs/building-spark.md ---
@@ -259,6 +259,14 @@ or
 Java 8 tests are automatically enabled when a Java 8 JDK is detected.
 If you have JDK 8 installed but it is not the system default, you can set 
JAVA_HOME to point to JDK 8 before running the tests.
 
+## PySpark pip installable
+
+If you are building Spark for use in a Python environment and you wish to 
pip install it, you will first need to build the Spark JARs as described above. 
Then you can construct an sdist package suitable for setup.py and pip 
installable package.
+
+cd python; python setup.py sdist
--- End diff --

Just to confirm, if I run this:

```
./dev/make-distribution.sh --pip
```

It should take care of both building the right JARs _and_ building the 
Python package.

Then I just run:

```
pip install -e ./python/
```

to install Spark into my Python environment.

Is that all correct?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-05 Thread nchammas
Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86668059
  
--- Diff: python/setup.py ---
@@ -0,0 +1,180 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import glob
+import os
+import sys
+from setuptools import setup, find_packages
+from shutil import copyfile, copytree, rmtree
+
+if sys.version_info < (2, 7):
+print("Python versions prior to 2.7 are not supported for pip 
installed PySpark.",
+  file=sys.stderr)
+exit(-1)
+
+try:
+exec(open('pyspark/version.py').read())
+except IOError:
+print("Failed to load PySpark version file for packaging you must be 
in Spark's python dir.",
+  file=sys.stderr)
+sys.exit(-1)
+VERSION = __version__
+# A temporary path so we can access above the Python project root and 
fetch scripts and jars we need
+TEMP_PATH = "deps"
+SPARK_HOME = os.path.abspath("../")
+JARS_PATH = "%s/assembly/target/scala-2.11/jars/" % SPARK_HOME
+
+# Use the release jars path if we are in release mode.
+if (os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1):
+JARS_PATH = "%s/jars/" % SPARK_HOME
+
+EXAMPLES_PATH = "%s/examples/src/main/python" % SPARK_HOME
+SCRIPTS_PATH = "%s/bin" % SPARK_HOME
+SCRIPTS_TARGET = "%s/bin" % TEMP_PATH
+JARS_TARGET = "%s/jars" % TEMP_PATH
+EXAMPLES_TARGET = "%s/examples" % TEMP_PATH
+
+# Check and see if we are under the spark path in which case we need to 
build the symlink farm.
+# This is important because we only want to build the symlink farm while 
under Spark otherwise we
+# want to use the symlink farm. And if the symlink farm exists under while 
under Spark (e.g. a
+# partially built sdist) we should error and have the user sort it out.
+in_spark = 
(os.path.isfile("../core/src/main/scala/org/apache/spark/SparkContext.scala") or
+(os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1))
+
+if (in_spark):
+# Construct links for setup
+try:
+os.mkdir(TEMP_PATH)
+except:
+print("Temp path for symlink to parent already exists %s" % 
TEMP_PATH, file=sys.stderr)
+exit(-1)
+
+try:
+if (in_spark):
+# Construct the symlink farm - this is necessary since we can't 
refer to the path above the
+# package root and we need to copy the jars and scripts which are 
up above the python root.
+if getattr(os, "symlink", None) is not None:
+os.symlink(JARS_PATH, JARS_TARGET)
+os.symlink(SCRIPTS_PATH, SCRIPTS_TARGET)
+os.symlink(EXAMPLES_PATH, EXAMPLES_TARGET)
+else:
+# For windows fall back to the slower copytree
+copytree(JARS_PATH, JARS_TARGET)
+copytree(SCRIPTS_PATH, SCRIPTS_TARGET)
+copytree(EXAMPLES_PATH, EXAMPLES_TARGET)
+else:
+# If we are not inside of SPARK_HOME verify we have the required 
symlink farm
+if not os.path.exists(JARS_TARGET):
+print("To build packaging must be in the python directory 
under the SPARK_HOME.",
+  file=sys.stderr)
+# We copy the shell script to be under pyspark/python/pyspark so 
that the launcher scripts
+# find it where expected. The rest of the files aren't copied 
because they are accessed
+# using Python imports instead which will be resolved correctly.
+try:
+os.makedirs("pyspark/python/pyspark")
+except OSError:
+# Don't worry if the directory already exists.
+True
+copyfile("pyspark/shell.py", "pyspark/python/pyspark/shell.py")
+
+if not os.path.isdir(SCRIPTS_TARGET):
+print("You must first create a source dist and install that source 
dist.", file=sys.stderr)
+  

[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-05 Thread nchammas
Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86667967
  
--- Diff: python/setup.py ---
@@ -0,0 +1,180 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import glob
+import os
+import sys
+from setuptools import setup, find_packages
+from shutil import copyfile, copytree, rmtree
+
+if sys.version_info < (2, 7):
+print("Python versions prior to 2.7 are not supported for pip 
installed PySpark.",
+  file=sys.stderr)
+exit(-1)
+
+try:
+exec(open('pyspark/version.py').read())
+except IOError:
+print("Failed to load PySpark version file for packaging you must be 
in Spark's python dir.",
--- End diff --

Seems like there is a missing sentence break somewhere here. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-03 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86365769
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,66 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and
+(os.path.isdir(os.path.join(path, "jars"
+
+paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")]
+
+# Add the path of the PySpark module if it exists
+if sys.version < "3":
+import imp
+try:
+paths.append(imp.find_module("pyspark")[1])
+except ImportError:
+# Not pip installed no worries
+True
+else:
+from importlib.util import find_spec
+try:
+paths.append(os.path.dirname(find_spec("pyspark").origin))
+except ImportError:
--- End diff --

Same as above, this is to handle the case where it can't find PySpark not 
importlib.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-02 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86287258
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,66 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and
+(os.path.isdir(os.path.join(path, "jars"
+
+paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")]
+
+# Add the path of the PySpark module if it exists
+if sys.version < "3":
+import imp
+try:
+paths.append(imp.find_module("pyspark")[1])
+except ImportError:
+# Not pip installed no worries
+True
+else:
+from importlib.util import find_spec
+try:
+paths.append(os.path.dirname(find_spec("pyspark").origin))
+except ImportError:
--- End diff --

Same as above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-02 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86287252
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,66 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and
+(os.path.isdir(os.path.join(path, "jars"
+
+paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")]
+
+# Add the path of the PySpark module if it exists
+if sys.version < "3":
+import imp
+try:
+paths.append(imp.find_module("pyspark")[1])
+except ImportError:
--- End diff --

Is this `ImportError` used to deal with `import imp` above? If so, I think 
it should be put in `try` block?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-02 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86286268
  
--- Diff: dev/run-pip-tests-2 ---
@@ -0,0 +1,94 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Stop on error
+set -e
+# Set nullglob for when we are checking existence based on globs
+shopt -s nullglob
+
+FWDIR="$(cd "`dirname $0`"/..; pwd)"
+cd "$FWDIR"
+# Some systems don't have pip or virtualenv - in those cases our tests 
won't work.
+if ! hash virtualenv 2>/dev/null; then
+  echo "Missing virtualenv skipping pip installability tests."
+  exit 0
+fi
+if ! hash pip 2>/dev/null; then
+  echo "Missing pip, skipping pip installability tests."
+  exit 0
+fi
+
+if [ -d ~/.cache/pip/wheels/ ]; then
+  echo "Cleaning up pip wheel cache so we install the fresh package"
+  rm -rf ~/.cache/pip/wheels/
+fi
+
+# Figure out which Python execs we should test pip installation with
+PYTHON_EXECS=()
+if hash python 2>/dev/null; then
+  # We do this since we are testing with virtualenv and the default 
virtual env python
+  # is in /usr/bin/python
+  PYTHON_EXECS+=('python')
+fi
+if hash python3 2>/dev/null; then
+  PYTHON_EXECS+=('python3')
+fi
+
+for python in $PYTHON_EXECS; do
+  echo "Testing pip installation with python $python"
+  # Create a temp directory for us to work in and save its name to a file 
for cleanup
+  echo "Constucting virtual env for testing"
+  mktemp -d > ./virtual_env_temp_dir
+  VIRTUALENV_BASE=`cat ./virtual_env_temp_dir`
+  echo "Using $VIRTUALENV_BASE for virtualenv"
+  virtualenv --python=$python $VIRTUALENV_BASE
+  source $VIRTUALENV_BASE/bin/activate
+  # Upgrade pip
+  pip install --upgrade pip
+
+  echo "Creating pip installable source dist"
+  cd $FWDIR/python
+  $python setup.py sdist
+
+
+  echo "Installing dist into virtual env"
+  cd dist
+  # Verify that the dist directory only contains one thing to install
+  sdists=(*.tar.gz)
+  if [ ${#sdists[@]} -ne 1 ]; then
+echo "Unexpected number of targets found in dist directory - please 
cleanup existing sdists first."
+exit -1
+  fi
+  # Do the actual installation
+  pip install --upgrade --force-reinstall *.tar.gz
+
+  cd /
+
+  echo "Run basic sanity check on pip installed version with spark-submit"
+  spark-submit $FWDIR/dev/pip-sanity-check.py
+  echo "Run basic sanity check with import based"
+  python $FWDIR/dev/pip-sanity-check.py
--- End diff --

yeah. nvm. `virtualenv` will take care of that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-02 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86286233
  
--- Diff: dev/run-pip-tests-2 ---
@@ -0,0 +1,94 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Stop on error
+set -e
+# Set nullglob for when we are checking existence based on globs
+shopt -s nullglob
+
+FWDIR="$(cd "`dirname $0`"/..; pwd)"
+cd "$FWDIR"
+# Some systems don't have pip or virtualenv - in those cases our tests 
won't work.
+if ! hash virtualenv 2>/dev/null; then
+  echo "Missing virtualenv skipping pip installability tests."
+  exit 0
+fi
+if ! hash pip 2>/dev/null; then
+  echo "Missing pip, skipping pip installability tests."
+  exit 0
+fi
+
+if [ -d ~/.cache/pip/wheels/ ]; then
+  echo "Cleaning up pip wheel cache so we install the fresh package"
+  rm -rf ~/.cache/pip/wheels/
+fi
+
+# Figure out which Python execs we should test pip installation with
+PYTHON_EXECS=()
+if hash python 2>/dev/null; then
+  # We do this since we are testing with virtualenv and the default 
virtual env python
+  # is in /usr/bin/python
+  PYTHON_EXECS+=('python')
+fi
+if hash python3 2>/dev/null; then
+  PYTHON_EXECS+=('python3')
+fi
+
+for python in "${PYTHON_EXECS[@]}"; do
+  echo "Testing pip installation with python $python"
+  # Create a temp directory for us to work in and save its name to a file 
for cleanup
+  echo "Constucting virtual env for testing"
+  mktemp -d > ./virtual_env_temp_dir
+  VIRTUALENV_BASE=`cat ./virtual_env_temp_dir`
+  echo "Using $VIRTUALENV_BASE for virtualenv"
+  virtualenv --python=$python $VIRTUALENV_BASE
+  source $VIRTUALENV_BASE/bin/activate
+  # Upgrade pip
+  pip install --upgrade pip
--- End diff --

oh. nvm. `virtualenv` will take care of it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-02 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86285663
  
--- Diff: dev/run-pip-tests-2 ---
@@ -0,0 +1,94 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Stop on error
+set -e
+# Set nullglob for when we are checking existence based on globs
+shopt -s nullglob
+
+FWDIR="$(cd "`dirname $0`"/..; pwd)"
+cd "$FWDIR"
+# Some systems don't have pip or virtualenv - in those cases our tests 
won't work.
+if ! hash virtualenv 2>/dev/null; then
+  echo "Missing virtualenv skipping pip installability tests."
+  exit 0
+fi
+if ! hash pip 2>/dev/null; then
+  echo "Missing pip, skipping pip installability tests."
+  exit 0
+fi
+
+if [ -d ~/.cache/pip/wheels/ ]; then
+  echo "Cleaning up pip wheel cache so we install the fresh package"
+  rm -rf ~/.cache/pip/wheels/
+fi
+
+# Figure out which Python execs we should test pip installation with
+PYTHON_EXECS=()
+if hash python 2>/dev/null; then
+  # We do this since we are testing with virtualenv and the default 
virtual env python
+  # is in /usr/bin/python
+  PYTHON_EXECS+=('python')
+fi
+if hash python3 2>/dev/null; then
+  PYTHON_EXECS+=('python3')
+fi
+
+for python in "${PYTHON_EXECS[@]}"; do
+  echo "Testing pip installation with python $python"
+  # Create a temp directory for us to work in and save its name to a file 
for cleanup
+  echo "Constucting virtual env for testing"
+  mktemp -d > ./virtual_env_temp_dir
+  VIRTUALENV_BASE=`cat ./virtual_env_temp_dir`
+  echo "Using $VIRTUALENV_BASE for virtualenv"
+  virtualenv --python=$python $VIRTUALENV_BASE
+  source $VIRTUALENV_BASE/bin/activate
+  # Upgrade pip
+  pip install --upgrade pip
--- End diff --

When we want to test python3, shall we use pip3?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-02 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86285288
  
--- Diff: dev/run-pip-tests-2 ---
@@ -0,0 +1,94 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Stop on error
+set -e
+# Set nullglob for when we are checking existence based on globs
+shopt -s nullglob
+
+FWDIR="$(cd "`dirname $0`"/..; pwd)"
+cd "$FWDIR"
+# Some systems don't have pip or virtualenv - in those cases our tests 
won't work.
+if ! hash virtualenv 2>/dev/null; then
+  echo "Missing virtualenv skipping pip installability tests."
+  exit 0
+fi
+if ! hash pip 2>/dev/null; then
+  echo "Missing pip, skipping pip installability tests."
+  exit 0
+fi
+
+if [ -d ~/.cache/pip/wheels/ ]; then
+  echo "Cleaning up pip wheel cache so we install the fresh package"
+  rm -rf ~/.cache/pip/wheels/
+fi
+
+# Figure out which Python execs we should test pip installation with
+PYTHON_EXECS=()
+if hash python 2>/dev/null; then
+  # We do this since we are testing with virtualenv and the default 
virtual env python
+  # is in /usr/bin/python
+  PYTHON_EXECS+=('python')
+fi
+if hash python3 2>/dev/null; then
+  PYTHON_EXECS+=('python3')
+fi
+
+for python in $PYTHON_EXECS; do
+  echo "Testing pip installation with python $python"
+  # Create a temp directory for us to work in and save its name to a file 
for cleanup
+  echo "Constucting virtual env for testing"
+  mktemp -d > ./virtual_env_temp_dir
+  VIRTUALENV_BASE=`cat ./virtual_env_temp_dir`
+  echo "Using $VIRTUALENV_BASE for virtualenv"
+  virtualenv --python=$python $VIRTUALENV_BASE
+  source $VIRTUALENV_BASE/bin/activate
+  # Upgrade pip
+  pip install --upgrade pip
+
+  echo "Creating pip installable source dist"
+  cd $FWDIR/python
+  $python setup.py sdist
+
+
+  echo "Installing dist into virtual env"
+  cd dist
+  # Verify that the dist directory only contains one thing to install
+  sdists=(*.tar.gz)
+  if [ ${#sdists[@]} -ne 1 ]; then
+echo "Unexpected number of targets found in dist directory - please 
cleanup existing sdists first."
+exit -1
+  fi
+  # Do the actual installation
+  pip install --upgrade --force-reinstall *.tar.gz
+
+  cd /
+
+  echo "Run basic sanity check on pip installed version with spark-submit"
+  spark-submit $FWDIR/dev/pip-sanity-check.py
+  echo "Run basic sanity check with import based"
+  python $FWDIR/dev/pip-sanity-check.py
--- End diff --

I think I'm missing something - it does work with a python3 virtualenv - 
what are suggesting with the most recent suggestion?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-02 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86284252
  
--- Diff: dev/run-pip-tests-2 ---
@@ -0,0 +1,94 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Stop on error
+set -e
+# Set nullglob for when we are checking existence based on globs
+shopt -s nullglob
+
+FWDIR="$(cd "`dirname $0`"/..; pwd)"
+cd "$FWDIR"
+# Some systems don't have pip or virtualenv - in those cases our tests 
won't work.
+if ! hash virtualenv 2>/dev/null; then
+  echo "Missing virtualenv skipping pip installability tests."
+  exit 0
+fi
+if ! hash pip 2>/dev/null; then
+  echo "Missing pip, skipping pip installability tests."
+  exit 0
+fi
+
+if [ -d ~/.cache/pip/wheels/ ]; then
+  echo "Cleaning up pip wheel cache so we install the fresh package"
+  rm -rf ~/.cache/pip/wheels/
+fi
+
+# Figure out which Python execs we should test pip installation with
+PYTHON_EXECS=()
+if hash python 2>/dev/null; then
+  # We do this since we are testing with virtualenv and the default 
virtual env python
+  # is in /usr/bin/python
+  PYTHON_EXECS+=('python')
+fi
+if hash python3 2>/dev/null; then
+  PYTHON_EXECS+=('python3')
+fi
+
+for python in $PYTHON_EXECS; do
+  echo "Testing pip installation with python $python"
+  # Create a temp directory for us to work in and save its name to a file 
for cleanup
+  echo "Constucting virtual env for testing"
+  mktemp -d > ./virtual_env_temp_dir
+  VIRTUALENV_BASE=`cat ./virtual_env_temp_dir`
+  echo "Using $VIRTUALENV_BASE for virtualenv"
+  virtualenv --python=$python $VIRTUALENV_BASE
+  source $VIRTUALENV_BASE/bin/activate
+  # Upgrade pip
+  pip install --upgrade pip
+
+  echo "Creating pip installable source dist"
+  cd $FWDIR/python
+  $python setup.py sdist
+
+
+  echo "Installing dist into virtual env"
+  cd dist
+  # Verify that the dist directory only contains one thing to install
+  sdists=(*.tar.gz)
+  if [ ${#sdists[@]} -ne 1 ]; then
+echo "Unexpected number of targets found in dist directory - please 
cleanup existing sdists first."
+exit -1
+  fi
+  # Do the actual installation
+  pip install --upgrade --force-reinstall *.tar.gz
+
+  cd /
+
+  echo "Run basic sanity check on pip installed version with spark-submit"
+  spark-submit $FWDIR/dev/pip-sanity-check.py
+  echo "Run basic sanity check with import based"
+  python $FWDIR/dev/pip-sanity-check.py
--- End diff --

Shall we use `$python` here too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-02 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86196393
  
--- Diff: dev/create-release/release-build.sh ---
@@ -162,14 +162,35 @@ if [[ "$1" == "package" ]]; then
 export ZINC_PORT=$ZINC_PORT
 echo "Creating distribution: $NAME ($FLAGS)"
 
+# Write out the NAME and VERSION to PySpark version info we rewrite 
the - into a . and SNAPSHOT
+# to dev0 to be closer to PEP440. We use the NAME as a "local version".
+PYSPARK_VERSION=`echo "$SPARK_VERSION+$NAME" |  sed -r "s/-/./" | sed 
-r "s/SNAPSHOT/dev0/"`
+echo "__version__='$PYSPARK_VERSION'" > python/pyspark/version.py
+
 # Get maven home set by MVN
 MVN_HOME=`$MVN -version 2>&1 | grep 'Maven home' | awk '{print $NF}'`
 
+echo "Creating distribution"
 ./dev/make-distribution.sh --name $NAME --mvn $MVN_HOME/bin/mvn --tgz 
$FLAGS \
   -DzincPort=$ZINC_PORT 2>&1 >  ../binary-release-$NAME.log
 cd ..
-cp spark-$SPARK_VERSION-bin-$NAME/spark-$SPARK_VERSION-bin-$NAME.tgz .
 
+echo "Copying and signing python distribution"
+PYTHON_DIST_NAME=pyspark-$PYSPARK_VERSION.tar.gz
--- End diff --

The goal is to eventually support building sdists on Windows - but I think 
porting the entire release process to windows is out of scope.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-02 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86163124
  
--- Diff: python/setup.py ---
@@ -0,0 +1,180 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import glob
+import os
+import sys
+from setuptools import setup, find_packages
+from shutil import copyfile, copytree, rmtree
+
+if sys.version_info < (2, 7):
+print("Python versions prior to 2.7 are not supported for pip 
installed PySpark.",
+  file=sys.stderr)
+exit(-1)
+
+try:
+exec(open('pyspark/version.py').read())
+except IOError:
+print("Failed to load PySpark version file for packaging you must be 
in Spark's python dir.",
+  file=sys.stderr)
+sys.exit(-1)
+VERSION = __version__
+# A temporary path so we can access above the Python project root and 
fetch scripts and jars we need
+TEMP_PATH = "deps"
+SPARK_HOME = os.path.abspath("../")
+JARS_PATH = "%s/assembly/target/scala-2.11/jars/" % SPARK_HOME
+
+# Use the release jars path if we are in release mode.
+if (os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1):
+JARS_PATH = "%s/jars/" % SPARK_HOME
+
+EXAMPLES_PATH = "%s/examples/src/main/python" % SPARK_HOME
+SCRIPTS_PATH = "%s/bin" % SPARK_HOME
+SCRIPTS_TARGET = "%s/bin" % TEMP_PATH
+JARS_TARGET = "%s/jars" % TEMP_PATH
+EXAMPLES_TARGET = "%s/examples" % TEMP_PATH
+
+# Check and see if we are under the spark path in which case we need to 
build the symlink farm.
+# This is important because we only want to build the symlink farm while 
under Spark otherwise we
+# want to use the symlink farm. And if the symlink farm exists under while 
under Spark (e.g. a
+# partially built sdist) we should error and have the user sort it out.
+in_spark = 
(os.path.isfile("../core/src/main/scala/org/apache/spark/SparkContext.scala") or
+(os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1))
+
+if (in_spark):
+# Construct links for setup
+try:
+os.mkdir(TEMP_PATH)
+except:
+print("Temp path for symlink to parent already exists %s" % 
TEMP_PATH, file=sys.stderr)
+exit(-1)
+
+try:
+if (in_spark):
+# Construct the symlink farm - this is necessary since we can't 
refer to the path above the
+# package root and we need to copy the jars and scripts which are 
up above the python root.
+if getattr(os, "symlink", None) is not None:
+os.symlink(JARS_PATH, JARS_TARGET)
+os.symlink(SCRIPTS_PATH, SCRIPTS_TARGET)
+os.symlink(EXAMPLES_PATH, EXAMPLES_TARGET)
+else:
+# For windows fall back to the slower copytree
+copytree(JARS_PATH, JARS_TARGET)
+copytree(SCRIPTS_PATH, SCRIPTS_TARGET)
+copytree(EXAMPLES_PATH, EXAMPLES_TARGET)
+else:
+# If we are not inside of SPARK_HOME verify we have the required 
symlink farm
+if not os.path.exists(JARS_TARGET):
+print("To build packaging must be in the python directory 
under the SPARK_HOME.",
+  file=sys.stderr)
+# We copy the shell script to be under pyspark/python/pyspark so 
that the launcher scripts
+# find it where expected. The rest of the files aren't copied 
because they are accessed
+# using Python imports instead which will be resolved correctly.
+try:
+os.makedirs("pyspark/python/pyspark")
+except OSError:
+# Don't worry if the directory already exists.
+True
+copyfile("pyspark/shell.py", "pyspark/python/pyspark/shell.py")
+
+if not os.path.isdir(SCRIPTS_TARGET):
+print("You must first create a source dist and install that source 
dist.", file=sys.stderr)
+   

[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-01 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86073517
  
--- Diff: python/setup.py ---
@@ -0,0 +1,170 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import glob
+import os
+import sys
+from setuptools import setup, find_packages
+from shutil import copyfile, copytree, rmtree
+
+exec(open('pyspark/version.py').read())
+VERSION = __version__
+# A temporary path so we can access above the Python project root and 
fetch scripts and jars we need
+TEMP_PATH = "deps"
+SPARK_HOME = os.path.abspath("../")
+JARS_PATH = "%s/assembly/target/scala-2.11/jars/" % SPARK_HOME
+
+# Use the release jars path if we are in release mode.
+if (os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1):
+JARS_PATH = "%s/jars/" % SPARK_HOME
+
+EXAMPLES_PATH = "%s/examples/src/main/python" % SPARK_HOME
+SCRIPTS_PATH = "%s/bin" % SPARK_HOME
+SCRIPTS_TARGET = "%s/bin" % TEMP_PATH
+JARS_TARGET = "%s/jars" % TEMP_PATH
+EXAMPLES_TARGET = "%s/examples" % TEMP_PATH
+
+if sys.version_info < (2, 7):
+print("Python versions prior to 2.7 are not supported.", 
file=sys.stderr)
+exit(-1)
+
+# Check and see if we are under the spark path in which case we need to 
build the symlink farm.
+# This is important because we only want to build the symlink farm while 
under Spark otherwise we
+# want to use the symlink farm. And if the symlink farm exists under while 
under Spark (e.g. a
+# partially built sdist) we should error and have the user sort it out.
+in_spark = 
(os.path.isfile("../core/src/main/scala/org/apache/spark/SparkContext.scala") or
--- End diff --

I mean we haven't moved SparkContext.scala and it's the clearest thing 
which will only exist in Spark. I'm open to checking for something else if 
people have a suggestion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-01 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86073446
  
--- Diff: python/setup.py ---
@@ -0,0 +1,180 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import glob
+import os
+import sys
+from setuptools import setup, find_packages
+from shutil import copyfile, copytree, rmtree
+
+try:
+exec(open('pyspark/version.py').read())
+except IOError:
+print("Failed to load PySpark version file for packaging you must be 
in Spark's python dir.",
+  file=sys.stderr)
+sys.exit(-1)
+VERSION = __version__
+# A temporary path so we can access above the Python project root and 
fetch scripts and jars we need
+TEMP_PATH = "deps"
+SPARK_HOME = os.path.abspath("../")
+JARS_PATH = "%s/assembly/target/scala-2.11/jars/" % SPARK_HOME
+
+# Use the release jars path if we are in release mode.
+if (os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1):
+JARS_PATH = "%s/jars/" % SPARK_HOME
+
+EXAMPLES_PATH = "%s/examples/src/main/python" % SPARK_HOME
+SCRIPTS_PATH = "%s/bin" % SPARK_HOME
+SCRIPTS_TARGET = "%s/bin" % TEMP_PATH
+JARS_TARGET = "%s/jars" % TEMP_PATH
+EXAMPLES_TARGET = "%s/examples" % TEMP_PATH
+
+if sys.version_info < (2, 7):
--- End diff --

I can move it up, not quite to the top (need sys import of course first)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-01 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86073257
  
--- Diff: python/pyspark/version.py ---
@@ -0,0 +1,19 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+__version__ = '2.1.0.dev1'
--- End diff --

well we want to indicate its the equivalent to snapshot, and following 
PEP440 (which we need to do for eventual PyPI publishing) we swapped `SNAPSHOT` 
to `dev`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-01 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86073111
  
--- Diff: python/pyspark/version.py ---
@@ -0,0 +1,19 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+__version__ = '2.1.0.dev1'
--- End diff --

does this need to go as "2.1.0"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-01 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86072893
  
--- Diff: dev/run-tests.py ---
@@ -583,6 +589,7 @@ def main():
 modules_with_python_tests = [m for m in test_modules if 
m.python_test_goals]
 if modules_with_python_tests:
 run_python_tests(modules_with_python_tests, opts.parallelism)
+run_python_packaging_tests()
--- End diff --

I would +1 on this given the logic in setup.py that should be checked


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-01 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86072643
  
--- Diff: python/setup.py ---
@@ -0,0 +1,180 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import glob
+import os
+import sys
+from setuptools import setup, find_packages
+from shutil import copyfile, copytree, rmtree
+
+try:
+exec(open('pyspark/version.py').read())
+except IOError:
+print("Failed to load PySpark version file for packaging you must be 
in Spark's python dir.",
+  file=sys.stderr)
+sys.exit(-1)
+VERSION = __version__
+# A temporary path so we can access above the Python project root and 
fetch scripts and jars we need
+TEMP_PATH = "deps"
+SPARK_HOME = os.path.abspath("../")
+JARS_PATH = "%s/assembly/target/scala-2.11/jars/" % SPARK_HOME
+
+# Use the release jars path if we are in release mode.
+if (os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1):
+JARS_PATH = "%s/jars/" % SPARK_HOME
+
+EXAMPLES_PATH = "%s/examples/src/main/python" % SPARK_HOME
+SCRIPTS_PATH = "%s/bin" % SPARK_HOME
+SCRIPTS_TARGET = "%s/bin" % TEMP_PATH
+JARS_TARGET = "%s/jars" % TEMP_PATH
+EXAMPLES_TARGET = "%s/examples" % TEMP_PATH
+
+if sys.version_info < (2, 7):
--- End diff --

nit: move this to the top of the file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-01 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r85935215
  
--- Diff: python/setup.py ---
@@ -0,0 +1,179 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import glob
+import os
+import sys
+from setuptools import setup, find_packages
+from shutil import copyfile, copytree, rmtree
+
+try:
+exec(open('pyspark/version.py').read())
+except IOError:
+print("Failed to load PySpark version file for packaging you must be 
in Spark's python dir.",
+  file=sys.stderr)
+sys.exit(-1)
+VERSION = __version__
+# A temporary path so we can access above the Python project root and 
fetch scripts and jars we need
+TEMP_PATH = "deps"
+SPARK_HOME = os.path.abspath("../")
+JARS_PATH = "%s/assembly/target/scala-2.11/jars/" % SPARK_HOME
+
+# Use the release jars path if we are in release mode.
+if (os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1):
+JARS_PATH = "%s/jars/" % SPARK_HOME
+
+EXAMPLES_PATH = "%s/examples/src/main/python" % SPARK_HOME
+SCRIPTS_PATH = "%s/bin" % SPARK_HOME
+SCRIPTS_TARGET = "%s/bin" % TEMP_PATH
+JARS_TARGET = "%s/jars" % TEMP_PATH
+EXAMPLES_TARGET = "%s/examples" % TEMP_PATH
+
+if sys.version_info < (2, 7):
+print("Python versions prior to 2.7 are not supported.", 
file=sys.stderr)
+exit(-1)
--- End diff --

Indentation :).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-01 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r85933041
  
--- Diff: dev/create-release/release-build.sh ---
@@ -162,14 +162,35 @@ if [[ "$1" == "package" ]]; then
 export ZINC_PORT=$ZINC_PORT
 echo "Creating distribution: $NAME ($FLAGS)"
 
+# Write out the NAME and VERSION to PySpark version info we rewrite 
the - into a . and SNAPSHOT
+# to dev0 to be closer to PEP440. We use the NAME as a "local version".
+PYSPARK_VERSION=`echo "$SPARK_VERSION+$NAME" |  sed -r "s/-/./" | sed 
-r "s/SNAPSHOT/dev0/"`
+echo "__version__='$PYSPARK_VERSION'" > python/pyspark/version.py
+
 # Get maven home set by MVN
 MVN_HOME=`$MVN -version 2>&1 | grep 'Maven home' | awk '{print $NF}'`
 
+echo "Creating distribution"
 ./dev/make-distribution.sh --name $NAME --mvn $MVN_HOME/bin/mvn --tgz 
$FLAGS \
   -DzincPort=$ZINC_PORT 2>&1 >  ../binary-release-$NAME.log
 cd ..
-cp spark-$SPARK_VERSION-bin-$NAME/spark-$SPARK_VERSION-bin-$NAME.tgz .
 
+echo "Copying and signing python distribution"
+PYTHON_DIST_NAME=pyspark-$PYSPARK_VERSION.tar.gz
--- End diff --

Yeah, actually I have this question is because `setup.py` has few codes for 
windows.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-10-31 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r85783215
  
--- Diff: dev/create-release/release-build.sh ---
@@ -162,14 +162,35 @@ if [[ "$1" == "package" ]]; then
 export ZINC_PORT=$ZINC_PORT
 echo "Creating distribution: $NAME ($FLAGS)"
 
+# Write out the NAME and VERSION to PySpark version info we rewrite 
the - into a . and SNAPSHOT
+# to dev0 to be closer to PEP440. We use the NAME as a "local version".
+PYSPARK_VERSION=`echo "$SPARK_VERSION+$NAME" |  sed -r "s/-/./" | sed 
-r "s/SNAPSHOT/dev0/"`
+echo "__version__='$PYSPARK_VERSION'" > python/pyspark/version.py
+
 # Get maven home set by MVN
 MVN_HOME=`$MVN -version 2>&1 | grep 'Maven home' | awk '{print $NF}'`
 
+echo "Creating distribution"
 ./dev/make-distribution.sh --name $NAME --mvn $MVN_HOME/bin/mvn --tgz 
$FLAGS \
   -DzincPort=$ZINC_PORT 2>&1 >  ../binary-release-$NAME.log
 cd ..
-cp spark-$SPARK_VERSION-bin-$NAME/spark-$SPARK_VERSION-bin-$NAME.tgz .
 
+echo "Copying and signing python distribution"
+PYTHON_DIST_NAME=pyspark-$PYSPARK_VERSION.tar.gz
--- End diff --

I think packaging on windows can be considered a future todo


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-10-31 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r85783126
  
--- Diff: python/MANIFEST.in ---
@@ -0,0 +1,23 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+recursive-include deps/jars *.jar
--- End diff --

Yes we do, if the manifest file is empty then package_data is used but if 
the manifest file isn't empty then the package_data isn't used to generate a 
manifest file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-10-30 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r85686524
  
--- Diff: python/MANIFEST.in ---
@@ -0,0 +1,23 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+recursive-include deps/jars *.jar
--- End diff --

These files looks like included in `package_data` already. Do we still need 
to specify them in MANIFEST.in?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-10-30 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r85686116
  
--- Diff: dev/create-release/release-build.sh ---
@@ -162,14 +162,35 @@ if [[ "$1" == "package" ]]; then
 export ZINC_PORT=$ZINC_PORT
 echo "Creating distribution: $NAME ($FLAGS)"
 
+# Write out the NAME and VERSION to PySpark version info we rewrite 
the - into a . and SNAPSHOT
+# to dev0 to be closer to PEP440. We use the NAME as a "local version".
+PYSPARK_VERSION=`echo "$SPARK_VERSION+$NAME" |  sed -r "s/-/./" | sed 
-r "s/SNAPSHOT/dev0/"`
+echo "__version__='$PYSPARK_VERSION'" > python/pyspark/version.py
+
 # Get maven home set by MVN
 MVN_HOME=`$MVN -version 2>&1 | grep 'Maven home' | awk '{print $NF}'`
 
+echo "Creating distribution"
 ./dev/make-distribution.sh --name $NAME --mvn $MVN_HOME/bin/mvn --tgz 
$FLAGS \
   -DzincPort=$ZINC_PORT 2>&1 >  ../binary-release-$NAME.log
 cd ..
-cp spark-$SPARK_VERSION-bin-$NAME/spark-$SPARK_VERSION-bin-$NAME.tgz .
 
+echo "Copying and signing python distribution"
+PYTHON_DIST_NAME=pyspark-$PYSPARK_VERSION.tar.gz
--- End diff --

Without specifying format, seems the output distribution will be zip file 
on windows? Because in `setup.py` it has support for windows, so I am wondering 
if this is an issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-10-30 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r85685534
  
--- Diff: docs/index.md ---
@@ -14,7 +14,9 @@ It also supports a rich set of higher-level tools 
including [Spark SQL](sql-prog
 
 Get Spark from the [downloads 
page](http://spark.apache.org/downloads.html) of the project website. This 
documentation is for Spark version {{site.SPARK_VERSION}}. Spark uses Hadoop's 
client libraries for HDFS and YARN. Downloads are pre-packaged for a handful of 
popular Hadoop versions.
 Users can also download a "Hadoop free" binary and run Spark with any 
Hadoop version
-[by augmenting Spark's classpath](hadoop-provided.html). 
+[by augmenting Spark's classpath](hadoop-provided.html).
+Scala and Java users can include Spark in their projects using it's maven 
cooridnates and in the future Python users can also install Spark from PyPI.
--- End diff --

nit: not very sure, but it's seems to be its?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-10-30 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r85685141
  
--- Diff: dev/create-release/release-build.sh ---
@@ -162,14 +162,35 @@ if [[ "$1" == "package" ]]; then
 export ZINC_PORT=$ZINC_PORT
 echo "Creating distribution: $NAME ($FLAGS)"
 
+# Write out the NAME and VERSION to PySpark version info we rewrite 
the - into a . and SNAPSHOT
+# to dev0 to be closer to PEP440. We use the NAME as a "local version".
+PYSPARK_VERSION=`echo "$SPARK_VERSION+$NAME" |  sed -r "s/-/./" | sed 
-r "s/SNAPSHOT/dev0/"`
+echo "__version__='$PYSPARK_VERSION'" > python/pyspark/version.py
+
 # Get maven home set by MVN
 MVN_HOME=`$MVN -version 2>&1 | grep 'Maven home' | awk '{print $NF}'`
 
+echo "Creating distribution"
 ./dev/make-distribution.sh --name $NAME --mvn $MVN_HOME/bin/mvn --tgz 
$FLAGS \
   -DzincPort=$ZINC_PORT 2>&1 >  ../binary-release-$NAME.log
 cd ..
-cp spark-$SPARK_VERSION-bin-$NAME/spark-$SPARK_VERSION-bin-$NAME.tgz .
 
+echo "Copying and signing python distribution"
+PYTHON_DIST_NAME=pyspark-$PYSPARK_VERSION.tar.gz
+cp spark-$SPARK_VERSION-bin-$NAME/python/dist/$PYTHON_DIST_NAME .
+
+echo $GPG_PASSPHRASE | $GPG --passphrase-fd 0 --armour \
+  --output $PYTHON_DIST_NAME.asc \
+  --detach-sig $PYTHON_DIST_NAME
+echo $GPG_PASSPHRASE | $GPG --passphrase-fd 0 --print-md \
+  MD5 $PYTHON_DIST_NAME.gz > \
--- End diff --

Do we have a wrongly appended `.gz` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-10-30 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r85676793
  
--- Diff: python/setup.py ---
@@ -0,0 +1,170 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import glob
+import os
+import sys
+from setuptools import setup, find_packages
+from shutil import copyfile, copytree, rmtree
+
+exec(open('pyspark/version.py').read())
+VERSION = __version__
+# A temporary path so we can access above the Python project root and 
fetch scripts and jars we need
+TEMP_PATH = "deps"
+SPARK_HOME = os.path.abspath("../")
+JARS_PATH = "%s/assembly/target/scala-2.11/jars/" % SPARK_HOME
+
+# Use the release jars path if we are in release mode.
+if (os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1):
+JARS_PATH = "%s/jars/" % SPARK_HOME
+
+EXAMPLES_PATH = "%s/examples/src/main/python" % SPARK_HOME
+SCRIPTS_PATH = "%s/bin" % SPARK_HOME
+SCRIPTS_TARGET = "%s/bin" % TEMP_PATH
+JARS_TARGET = "%s/jars" % TEMP_PATH
+EXAMPLES_TARGET = "%s/examples" % TEMP_PATH
+
+if sys.version_info < (2, 7):
+print("Python versions prior to 2.7 are not supported.", 
file=sys.stderr)
+exit(-1)
+
+# Check and see if we are under the spark path in which case we need to 
build the symlink farm.
+# This is important because we only want to build the symlink farm while 
under Spark otherwise we
+# want to use the symlink farm. And if the symlink farm exists under while 
under Spark (e.g. a
+# partially built sdist) we should error and have the user sort it out.
+in_spark = 
(os.path.isfile("../core/src/main/scala/org/apache/spark/SparkContext.scala") or
--- End diff --

OK. Looks good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-10-30 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r85675496
  
--- Diff: python/setup.py ---
@@ -0,0 +1,170 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import glob
+import os
+import sys
+from setuptools import setup, find_packages
+from shutil import copyfile, copytree, rmtree
+
+exec(open('pyspark/version.py').read())
+VERSION = __version__
+# A temporary path so we can access above the Python project root and 
fetch scripts and jars we need
+TEMP_PATH = "deps"
+SPARK_HOME = os.path.abspath("../")
+JARS_PATH = "%s/assembly/target/scala-2.11/jars/" % SPARK_HOME
+
+# Use the release jars path if we are in release mode.
+if (os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1):
+JARS_PATH = "%s/jars/" % SPARK_HOME
+
+EXAMPLES_PATH = "%s/examples/src/main/python" % SPARK_HOME
+SCRIPTS_PATH = "%s/bin" % SPARK_HOME
+SCRIPTS_TARGET = "%s/bin" % TEMP_PATH
+JARS_TARGET = "%s/jars" % TEMP_PATH
+EXAMPLES_TARGET = "%s/examples" % TEMP_PATH
+
+if sys.version_info < (2, 7):
+print("Python versions prior to 2.7 are not supported.", 
file=sys.stderr)
+exit(-1)
+
+# Check and see if we are under the spark path in which case we need to 
build the symlink farm.
+# This is important because we only want to build the symlink farm while 
under Spark otherwise we
+# want to use the symlink farm. And if the symlink farm exists under while 
under Spark (e.g. a
+# partially built sdist) we should error and have the user sort it out.
+in_spark = 
(os.path.isfile("../core/src/main/scala/org/apache/spark/SparkContext.scala") or
--- End diff --

I've added an error message for if someone runs sdist from an unpexpected 
location (ends up showing up earlier during version check).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-10-30 Thread rgbkrk
Github user rgbkrk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r85662141
  
--- Diff: docs/building-spark.md ---
@@ -263,6 +263,8 @@ If you have JDK 8 installed but it is not the system 
default, you can set JAVA_H
 
 If your are building Spark for use in a Python environment and you wish to 
pip install it, you will first need to build the Spark JARs as described above. 
Then you can construct an sdist package suitable for setup.py and pip 
installable package.
 
--- End diff --

I just noticed a typo above here:

...If you are...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org