[GitHub] spark pull request #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setu...

2017-12-27 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20089


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setu...

2017-12-27 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20089#discussion_r158799303
  
--- Diff: python/README.md ---
@@ -29,4 +29,4 @@ The Python packaging for Spark is not intended to replace 
all of the other use c
 
 ## Python Requirements
 
-At its core PySpark depends on Py4J (currently version 0.10.6), but 
additional sub-packages have their own requirements (including numpy and 
pandas).
+At its core PySpark depends on Py4J (currently version 0.10.6), but 
additional sub-packages might have their own requirements declared as "Extras" 
(including numpy, pandas, and pyarrow). You can install the requirements by 
specifying their extra names.
--- End diff --

Let's use the simple one you suggested and leave the detailed description 
for the future prs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setu...

2017-12-27 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20089#discussion_r158797489
  
--- Diff: python/README.md ---
@@ -29,4 +29,4 @@ The Python packaging for Spark is not intended to replace 
all of the other use c
 
 ## Python Requirements
 
-At its core PySpark depends on Py4J (currently version 0.10.6), but 
additional sub-packages have their own requirements (including numpy and 
pandas).
+At its core PySpark depends on Py4J (currently version 0.10.6), but 
additional sub-packages might have their own requirements declared as "Extras" 
(including numpy, pandas, and pyarrow). You can install the requirements by 
specifying their extra names.
--- End diff --

Not a big deal anyway. I am actually fine as is too if you prefer @ueshin.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setu...

2017-12-27 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20089#discussion_r158796151
  
--- Diff: python/README.md ---
@@ -29,4 +29,4 @@ The Python packaging for Spark is not intended to replace 
all of the other use c
 
 ## Python Requirements
 
-At its core PySpark depends on Py4J (currently version 0.10.6), but 
additional sub-packages have their own requirements (including numpy and 
pandas).
+At its core PySpark depends on Py4J (currently version 0.10.6), but 
additional sub-packages might have their own requirements declared as "Extras" 
(including numpy, pandas, and pyarrow). You can install the requirements by 
specifying their extra names.
--- End diff --

Ah, I see. How about simply:

```
At its core PySpark depends on Py4J (currently version 0.10.6), but some 
additional sub-packages have their own 
extra requirements for some features (including numpy, pandas, and pyarrow).
```

for now? I just noticed we are a bit unclear on this (e.g., actually I have 
been under impression that NumPy is required for ML/MLlib so far) but I think 
this roughly describes it correctly and is good enough.

Will maybe try to make a followup to fully describe it later. This PR 
targets PyArrow anyway.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setu...

2017-12-27 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20089#discussion_r158789947
  
--- Diff: python/README.md ---
@@ -29,4 +29,4 @@ The Python packaging for Spark is not intended to replace 
all of the other use c
 
 ## Python Requirements
 
-At its core PySpark depends on Py4J (currently version 0.10.6), but 
additional sub-packages have their own requirements (including numpy and 
pandas).
+At its core PySpark depends on Py4J (currently version 0.10.6), but 
additional sub-packages have their own requirements (including numpy, pandas, 
and pyarrow).
--- End diff --

I added some more details. WDYT?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setu...

2017-12-26 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20089#discussion_r158774077
  
--- Diff: python/README.md ---
@@ -29,4 +29,4 @@ The Python packaging for Spark is not intended to replace 
all of the other use c
 
 ## Python Requirements
 
-At its core PySpark depends on Py4J (currently version 0.10.6), but 
additional sub-packages have their own requirements (including numpy and 
pandas).
+At its core PySpark depends on Py4J (currently version 0.10.6), but 
additional sub-packages have their own requirements (including numpy, pandas, 
and pyarrow).
--- End diff --

Yea, Pandas and PyArrow are optional. Maybe, it's nicer if we have some 
more details here too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setu...

2017-12-26 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20089#discussion_r158773775
  
--- Diff: python/setup.py ---
@@ -201,7 +201,7 @@ def _supports_symlinks():
 extras_require={
 'ml': ['numpy>=1.7'],
 'mllib': ['numpy>=1.7'],
-'sql': ['pandas>=0.19.2']
+'sql': ['pandas>=0.19.2', 'pyarrow>=0.8.0']
--- End diff --

Nope, `extras_require` does not do anything in normal cases but they can be 
installed together with a dev option via pip IIRC.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setu...

2017-12-26 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20089#discussion_r158773551
  
--- Diff: python/setup.py ---
@@ -201,7 +201,7 @@ def _supports_symlinks():
 extras_require={
 'ml': ['numpy>=1.7'],
 'mllib': ['numpy>=1.7'],
-'sql': ['pandas>=0.19.2']
+'sql': ['pandas>=0.19.2', 'pyarrow>=0.8.0']
--- End diff --

If no pyarrow is installed, will setup force users to install it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setu...

2017-12-26 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20089#discussion_r158773507
  
--- Diff: python/README.md ---
@@ -29,4 +29,4 @@ The Python packaging for Spark is not intended to replace 
all of the other use c
 
 ## Python Requirements
 
-At its core PySpark depends on Py4J (currently version 0.10.6), but 
additional sub-packages have their own requirements (including numpy and 
pandas).
+At its core PySpark depends on Py4J (currently version 0.10.6), but 
additional sub-packages have their own requirements (including numpy, pandas, 
and pyarrow).
--- End diff --

This sounds like mandatory, but I think pyarrow is still an optional 
choice. Right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20089: [SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setu...

2017-12-26 Thread ueshin
GitHub user ueshin opened a pull request:

https://github.com/apache/spark/pull/20089

[SPARK-22324][SQL][PYTHON][FOLLOW-UP] Update setup.py file.

## What changes were proposed in this pull request?

This is a follow-up pr of #19884 updating setup.py file to add pyarrow 
dependency.

## How was this patch tested?

Existing tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ueshin/apache-spark issues/SPARK-22324/fup1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20089.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20089


commit 36614af4d8e00bb9564ef834a341859a0e96dfe4
Author: Takuya UESHIN 
Date:   2017-12-27T04:33:59Z

Add pyarrow to setup.py.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org