[GitHub] spark pull request #16854: [WIP][SPARK-15463][SQL] Add an API to load DataFr...

2017-02-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16854#discussion_r100229312
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -361,6 +362,41 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
   }
 
   /**
+   * Loads an `Dataset[String]` storing CSV rows and returns the result as 
a `DataFrame`.
+   *
+   * Unless the schema is specified using `schema` function, this function 
goes through the
+   * input once to determine the input schema.
+   *
+   * @param csvDataset input Dataset with one CSV row per record
+   * @since 2.2.0
+   */
+  def csv(csvDataset: Dataset[String]): DataFrame = {
+val parsedOptions: CSVOptions = new CSVOptions(extraOptions.toMap)
--- End diff --

Just to help review, there is a similar code path in 
https://github.com/apache/spark/blob/3d314d08c9420e74b4bb687603cdd11394eccab5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L105-L125


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16854: [SPARK-15463][SQL] Add an API to load DataFrame from Dat...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16854
  
Cc @cloud-fan, do you mind if I ask you think it is worth adding this API?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #7963: [SPARK-6227] [MLlib] [PySpark] Implement PySpark wrappers...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/7963
  
Ping @MechCoder, are you able to proceed this PR and address the comments 
above? If not it might be good to close this for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #8374: [SPARK-10101] [SQL] Add maxlength to JDBC field metadata ...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/8374
  
ping @rama-mullapudi


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #8384: [SPARK-8510] [CORE] [PYSPARK] NumPy matrices as values in...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/8384
  
@paberline, can we then close this for now? I guess it is a soft-yes for 
closing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #8785: [Spark-10625] [SQL] Spark SQL JDBC read/write is unable t...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/8785
  
ping @tribbloid. Are you able to proceed the review comments? If not, it'd 
be better closed for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #10262: [SPARK-12270][SQL]remove empty space after getString fro...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/10262
  
Hi @huaxingao, would this be better closed for now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11192: [SPARK-13257] [Improvement] Scala naive Bayes example: c...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/11192
  
Hi @movelikeriver, are you able to proceed this further? If not, maybe it'd 
be better closed for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11211: [SPARK-13330][PYSPARK] PYTHONHASHSEED is not propgated t...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/11211
  
(ping @zjffdu)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11374: [SPARK-12042] Python API for mllib.stat.test.StreamingTe...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/11374
  
Hi @yinxusen, are you able to proceed this further? If not, it seems it 
might be better closed for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11692: [SPARK-13852][YARN]handle the InterruptedException cause...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/11692
  
Hi @vanzin, would this be then a soft-suggestion for closing this if there 
is no objection for about a week?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11887: [SPARK-13041][Mesos]add driver sandbox uri to the dispat...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/11887
  
Hi @skonto, are you able to proceed this PR further? if not, it might be 
better closed for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12243: [SPARK-14467][SQL] Interleave CPU and IO better in FileS...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/12243
  
Hi @nongli, I just happened to look at this PR. It seems it has been 
inactive for few months without answering to review comments. Would this be 
better closed for now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12335: [SPARK-11321] [SQL] Python non null udfs

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/12335
  
Hi @kevincox, I happened to look at this PR. It seems it is inactive for 
few months while there are some review comments. Would this be better closed if 
you are currently not able to proceed this further for now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12337: [SPARK-15566] Expose null checking function to Python la...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/12337
  
@kevincox, It seems inactive for few months. Should this be maybe closed 
for now if you are currently not able to proceed this further?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12398: [SPARK-5929][PYSPARK] Context addPyPackage and addPyRequ...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/12398
  
ping @buckhx


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12620: [SPARK-14859][PYSPARK] Make Lambda Serializer Configurab...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/12620
  
@njwhite It seems inactive for few months. Would this be better to close 
this for now if you are currently not able to proceed this further?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12675: [SPARK-14894][PySpark] Add result summary api to Gaussia...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/12675
  
Hi @GayathriMurali, it seems inactive for the review comments for few 
months. Should this be better closed for now if you are not able to proceed 
this further?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12697: [SPARK-14754][SPARK CORE] Metrics as logs are not coming...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/12697
  
gentle ping @mihir6692


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12800: [SPARK-15024] NoClassDefFoundError in spark-examples due...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/12800
  
Hi @atokhy, it seems inactive without an answer to the comment above. 
Should this be better closed if you are not able to proceed this further?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12933: [Spark-15155][Mesos] Optionally ignore default role reso...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/12933
  
Hi @hellertime, this PR seems inactive for few months after the last review 
comments above. Would this be better closed if you are not currently able to 
work on this further maybe?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13467: [SPARK-15642][SQL][WIP] Metadata gets lost when selectin...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/13467
  
Hi @zommerfelds, do you mind if I ask whether you still working on this? It 
seems inactive for more than a half year. Maybe, it'd be better closed for now 
if you are currently not able to work on this further.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13715: [SPARK-15992] [MESOS] Refactor MesosCoarseGrainedSchedul...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/13715
  
(ping @drcrallen)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13891: [SPARK-6685][MLLIB]Use DSYRK to compute AtA in ALS

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/13891
  
@hqzizania If you check the log, there are some guides for how to. Should 
we maybe rebase this and check the logs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14266: [SPARK-16526][SQL] Benchmarking Performance for Fast Has...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/14266
  
(gentle ping @ooq)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14321: [SPARK-8971][ML] Add stratified sampling to ML CrossVali...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/14321
  
I just happened to look at this PR. Is this still WIP or waiting more 
review comments? If it is simply that the author is not currently able to 
proceed this further, then, maybe it'd be better to close this for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14461: [SPARK-16856] [WEBUI] [CORE] Link the application's exec...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/14461
  
(gentle ping @nblintao)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14517: [SPARK-16931][PYTHON] PySpark APIS for bucketBy a...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14517#discussion_r100310447
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -747,16 +800,25 @@ def _test():
 except py4j.protocol.Py4JError:
 spark = SparkSession(sc)
 
+seed = int(time() * 1000)
--- End diff --

@GregBowyer ping. Let me propose to close this after a week.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/14579
  
Is there any reason why it is not merged yet? I personally like this too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14601: [SPARK-13979][Core] Killed executor is re spawned withou...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/14601
  
(gentle ping @agsachin)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14936: [SPARK-7877][MESOS] Allow configuration of framework tim...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/14936
  
(@gentle ping philipphoffmann)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15159: [SPARK-17605][SPARK_SUBMIT] Add option spark.usePython a...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15159
  
(gentle ping @zjffdu)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15209: replace function type with function isinstance

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15209
  
@frankfqchen Could you follow the comment above? If you are not able to 
proceed further, I think it might be better closed for now. Actually, I am not 
too sure if it is worth sweeping them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15267: [SPARK-17667] [YARN][WIP]Make locking fine grained in Ya...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15267
  
Hi @ashwinshankar77, if you are not currently able to work on this further, 
maybe it should be closed for now. It seems inactive for few months.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15861: [SPARK-18294][CORE] Implement commit protocol to support...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15861
  
(gentle ping @jiangxb1987)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15871: [SPARK-17116][Pyspark] Allow parameters to be {string,va...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15871
  
Hi @aditya1702, do you mind if I ask whether you are still working on this? 
Maybe it should be closed for now if you are currently not able to work on this 
further. It seems inactive for few months.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16199: [SPARK-18772][SQL] NaN/Infinite float parsing in JSON is...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16199
  
What do you think about my suggestion @NathanHowell ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16278: [SPARK-18779][STREAMING][KAFKA] Messages being received ...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16278
  
(@pnakhe gentle ping, I am curious too)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16324: [SPARK-18910][SQL]Resolve faile to use UDF that jar file...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16324
  
(gentle ping @henh062326)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16319: [WiP][SPARK-18699] SQL - parsing CSV should return null ...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16319
  
Hi @kubatyszko, are you still working on this? If you are currently unable 
to proceed further, maybe it should be closed for now. It seems inactive for 
few months.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16083: [SPARK-18097][SQL] Add exception catch to handle corrupt...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16083
  
Hi @jayadevanmurali, are you still working on this? It seems inactive for 
few months. Maybe, it might better be closed for now if you are currently not 
able to proceed further.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16386#discussion_r100332529
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -298,22 +312,22 @@ class JacksonParser(
 // Here, we pass empty `PartialFunction` so that this case can be
 // handled as a failed conversion. It will throw an exception as
 // long as the value is not null.
-parseJsonToken(parser, dataType)(PartialFunction.empty[JsonToken, 
Any])
+parseJsonToken[AnyRef](parser, 
dataType)(PartialFunction.empty[JsonToken, AnyRef])
   }
 
   /**
* This method skips `FIELD_NAME`s at the beginning, and handles nulls 
ahead before trying
* to parse the JSON token using given function `f`. If the `f` failed 
to parse and convert the
* token, call `failedConversion` to handle the token.
*/
-  private def parseJsonToken(
+  private def parseJsonToken[R >: Null](
--- End diff --

Yes, I said +1 because it explicitly expresses it should be nullable and I 
_assumed_ (because I did not check the byte codes by myself and I might be 
wrong) that it gives a hint to compiler because `Null` is `null`able (I 
remember I googled and played some references for whole several days before 
when I was investigating another null-related PR).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16777: [SPARK-19435][SQL] Type coercion between ArrayTypes

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16777
  
cc @cloud-fan, WDYT?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16882: [SPARK-19544][SQL] Improve error message when som...

2017-02-09 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/16882

[SPARK-19544][SQL] Improve error message when some column types are 
compatible and others are not in set/union operations

## What changes were proposed in this pull request?

This PR proposes to fix the error message when some data types are 
compatible and others are not in set/union operation.

```scala
Seq((1,("a", 1))).toDF.union(Seq((1L,("a", "b"))).toDF)
```

**Before**

```
Union can only be performed on tables with the compatible column types.
LongType <> IntegerType at the first column of the second table;;
```

**After**

```
Union can only be performed on tables with the compatible column types. 
StructType(StructField(_1,StringType,true), 
StructField(_2,StringType,true)) <> StructType(StructField(_1,StringType,true), 
StructField(_2,IntegerType,false)) at the second column of the second table;;
```

## How was this patch tested?

Unit tests in `AnalysisErrorSuite` and manual tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-19544

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16882.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16882


commit 07e698415bb6a48e60cd2359cd9d412c2f61e48b
Author: hyukjinkwon 
Date:   2017-02-10T05:01:16Z

Improve error message when some column types are compatible and others are 
not in set/union operations




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16882: [SPARK-19544][SQL] Improve error message when some colum...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16882
  
cc @hvanhovell could you please take a look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16881: [SPARK-19543] from_json fails when the input row ...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16881#discussion_r100477537
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -496,7 +496,7 @@ case class JsonToStruct(schema: StructType, options: 
Map[String, String], child:
   override def dataType: DataType = schema
 
   override def nullSafeEval(json: Any): Any = {
-try parser.parse(json.toString).head catch {
+try parser.parse(json.toString).headOption.orNull catch {
--- End diff --

(Not for this PR but maybe loosely related I guess) I was thinking it is a 
bit odd that we support to only read the single row when it is a json array. It 
seems, for example,

```scala
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val schema = StructType(StructField("a", IntegerType) :: Nil)
Seq(("""[{"a": 1}, {"a": 
2}]""")).toDF("struct").select(from_json(col("struct"), schema)).show()
++
|jsontostruct(struct)|
++
| [1]|
++
```

I think maybe we should not support this in that function or it should work 
like a generator expression. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16882: [SPARK-19544][SQL] Improve error message when some colum...

2017-02-10 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16882
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16881: [SPARK-19543] from_json fails when the input row ...

2017-02-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16881#discussion_r100521142
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -496,7 +496,7 @@ case class JsonToStruct(schema: StructType, options: 
Map[String, String], child:
   override def dataType: DataType = schema
 
   override def nullSafeEval(json: Any): Any = {
-try parser.parse(json.toString).head catch {
+try parser.parse(json.toString).headOption.orNull catch {
--- End diff --

Thank you for confirming.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16882: [SPARK-19544][SQL] Improve error message when some colum...

2017-02-10 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16882
  
Oh, sure. Let me give a shot. Thank you for your quick review!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16777: [SPARK-19435][SQL] Type coercion between ArrayTypes

2017-02-10 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16777
  
Let me check other DBMSs and back.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16890: when colum is use alias ,the order by result is wrong

2017-02-11 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16890
  
Could you please this and ask this to Spark user mailing list? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16278: [SPARK-18779][STREAMING][KAFKA] Messages being received ...

2017-02-11 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16278
  
Aha, thanks for the details, then is this PR/JIRA closable maybe?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13467: [SPARK-15642][SQL][WIP] Metadata gets lost when selectin...

2017-02-11 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/13467
  
Aha, then to be strict, it is not WIP but just waiting the feedback. How 
about closing this and pinging related guys who touched this code path lately 
in the JIRA?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11692: [SPARK-13852][YARN]handle the InterruptedException cause...

2017-02-11 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/11692
  
Let me try to propose to close this after a week if the author seems not 
active on this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #8785: [Spark-10625] [SQL] Spark SQL JDBC read/write is unable t...

2017-02-11 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/8785
  
I am not supposed to decide what to merge but I left the command as I just 
found this seems not active to the review comments and I assumed that this PR 
is currently abandoned which the author happened to be not able to proceed 
further for now.

I'd rebase/address the review comments and keep pinging the related guys 
here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16083: [SPARK-18097][SQL] Add exception catch to handle corrupt...

2017-02-11 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16083
  
Actually, I don't know (or heard of) how the schema is corrupt. I guess any 
committer should verify this before merging it. How about asking this to 
dev/user mailing list if you are not sure? If it is expected to be open without 
the reproducible steps for a long time (like few weeks or months), I guess we 
maybe should better close this for now and then open it latter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16777: [SPARK-19435][SQL] Type coercion between ArrayTypes

2017-02-11 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16777
  

**Postgres**

```
postgres=# SELECT greatest(array[1], array[0.1]);
 greatest
--
 {1}
(1 row)

postgres=# SELECT least(array[1], array[0.1]);
 least
---
 {0.1}
(1 row)
```

```
postgres=# SELECT * FROM (values (array[1]), (array[0.1])) as foo;
 column1
-
 {1}
 {0.1}
(2 rows)
```

```
postgres=# SELECT array[1] UNION SELECT array[0.1];
 array
---
 {0.1}
 {1}
(2 rows)
```

```
postgres=# SELECT CASE WHEN TRUE THEN array[0.1] ELSE array[1] END;
 array
---
 {0.1}
(1 row)
```

(sorry, I could not find a proper way to test this with `IF`. So, I used 
`CASE`/`WHEN` in postgres).




**Hive** - not supporting this type coercion.

```
SELECT least(array(1), array(1D));
FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments '1': 
least only takes primitive types, got array
```
least/greatest: seems only supporting primitive types

```
SELECT inline(array(struct(array(0)), struct(array(1D;
FAILED: SemanticException [Error 10016]: Line 1:38 Argument type mismatch 
'1D': Argument type "struct>" is different from preceding 
arguments. Previous type was "struct>"
```

```
SELECT array(1) UNION SELECT array(1D);
FAILED: SemanticException Schema of both sides of union should match: 
Column _c0 is of type array on first table and type array on 
second table. Cannot tell the position of null AST.
```

```
SELECT IF(1=1, array(1), array(1D));
FAILED: SemanticException [Error 10016]: Line 1:25 Argument type mismatch 
'1D': The second and the third arguments of function IF should have the same 
type, but they are different: "array" and "array"
```

**MySQL**

Seems not supporting arrays



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16777: [SPARK-19435][SQL] Type coercion between ArrayTypes

2017-02-11 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16777
  
@cloud-fan, To cut this short, it seems Postgres supports this whereas Hive 
does not.

It seems we now support implicit cast between`ArrayType`s via 
[SPARK-18624](https://issues.apache.org/jira/browse/SPARK-18624), for example 
as below:

```scala
sql("SELECT percentile_approx(10.0, array('1', '1', '1'), 100)").show()
```

Wouldn't it be more reasonable to allow this case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16854: [SPARK-15463][SQL] Add an API to load DataFrame f...

2017-02-11 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16854#discussion_r100667530
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -361,6 +362,41 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
   }
 
   /**
+   * Loads an `Dataset[String]` storing CSV rows and returns the result as 
a `DataFrame`.
+   *
+   * Unless the schema is specified using `schema` function, this function 
goes through the
+   * input once to determine the input schema.
+   *
+   * @param csvDataset input Dataset with one CSV row per record
+   * @since 2.2.0
+   */
+  def csv(csvDataset: Dataset[String]): DataFrame = {
--- End diff --

Sure. Actually, there is a JIRA and closed PR 
https://github.com/apache/spark/pull/13460 and 
[SPARK-15615](https://issues.apache.org/jira/browse/SPARK-15615) where I was 
negative because it can be easily worked around.

However, I am fine if we are promoting to use datasets instead of RDDs for 
some advantages such as 
[SPARK-18362](https://issues.apache.org/jira/browse/SPARK-18362).

cc @pjfanning, could you reopen and proceed your PR if we are all fine?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16611: [SPARK-17967][SPARK-17878][SQL][PYTHON] Support for arra...

2017-02-11 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16611
  
@rxin, does that look okay to you? I am worried if

> **SQL** - array-like form of integer, decimal, string and boolean

sounds okay to you.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16882: [SPARK-19544][SQL] Improve error message when som...

2017-02-11 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16882#discussion_r100667599
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -321,12 +321,12 @@ trait CheckAnalysis extends PredicateHelper {
   // Check if the data types match.
   dataTypes(child).zip(ref).zipWithIndex.foreach { case ((dt1, 
dt2), ci) =>
 // SPARK-18058: we shall not care about the nullability of 
columns
-if (!dt1.sameType(dt2)) {
+if (TypeCoercion.findWiderTypeForTwo(dt1.asNullable, 
dt2.asNullable).isEmpty) {
   failAnalysis(
 s"""
   |${operator.nodeName} can only be performed on 
tables with the compatible
-  |column types. $dt1 <> $dt2 at the 
${ordinalNumber(ci)} column of
-  |the ${ordinalNumber(ti + 1)} table
+  |column types. ${dt1.simpleString} <> 
${dt2.simpleString} at the
--- End diff --

(I used `simpleString` for consistency with other codes in this)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13467: [SPARK-15642][SQL][WIP] Metadata gets lost when selectin...

2017-02-11 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/13467
  
Yes, exactly. That's usually I do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16777: [SPARK-19435][SQL] Type coercion between ArrayTyp...

2017-02-11 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16777#discussion_r100668061
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -101,13 +101,13 @@ object TypeCoercion {
 case _ => None
   }
 
-  /** Similar to [[findTightestCommonType]], but can promote all the way 
to StringType. */
-  def findTightestCommonTypeToString(left: DataType, right: DataType): 
Option[DataType] = {
-findTightestCommonType(left, right).orElse((left, right) match {
-  case (StringType, t2: AtomicType) if t2 != BinaryType && t2 != 
BooleanType => Some(StringType)
-  case (t1: AtomicType, StringType) if t1 != BinaryType && t1 != 
BooleanType => Some(StringType)
-  case _ => None
-})
+  /**
+   * Promotes all the way to StringType.
+   */
+  private def stringPromotion: (DataType, DataType) => Option[DataType] = {
--- End diff --

Oh, I will fix it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16824: [SPARK-18069][PYTHON] Make PySpark doctests for SQL self...

2017-02-11 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16824
  
gentle ping @holdenk (somehow writing `@holdenk` does not show your name 
... https://cloud.githubusercontent.com/assets/6477701/22854544/3d5ec930-f0b4-11e6-82c8-195a725caaf4.png";>)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16882: [SPARK-19544][SQL] Improve error message when som...

2017-02-11 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16882#discussion_r100683242
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -116,7 +116,7 @@ object TypeCoercion {
* i.e. the main difference with [[findTightestCommonType]] is that here 
we allow some
* loss of precision when widening decimal and double, and promotion to 
string.
*/
-  private def findWiderTypeForTwo(t1: DataType, t2: DataType): 
Option[DataType] = (t1, t2) match {
--- End diff --

Sure!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16882: [SPARK-19544][SQL] Improve error message when som...

2017-02-11 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16882#discussion_r100683913
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -116,7 +116,7 @@ object TypeCoercion {
* i.e. the main difference with [[findTightestCommonType]] is that here 
we allow some
* loss of precision when widening decimal and double, and promotion to 
string.
*/
-  private def findWiderTypeForTwo(t1: DataType, t2: DataType): 
Option[DataType] = (t1, t2) match {
--- End diff --

(Added)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16777: [SPARK-19435][SQL] Type coercion between ArrayTypes

2017-02-11 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16777
  
@cloud-fan, I just addressed your comments and test a build with Scala 2.10.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16890: when colum is use alias ,the order by result is wrong

2017-02-11 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16890
  
@muyannian Could you click the "Close pull request" button below?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16777: [SPARK-19435][SQL] Type coercion between ArrayTyp...

2017-02-12 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16777#discussion_r100690775
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
 ---
@@ -379,6 +386,67 @@ class TypeCoercionSuite extends PlanTest {
 widenTest(ArrayType(IntegerType), StructType(Seq()), None)
   }
 
+  test("wider common type for decimal and array") {
+def widenTestWithStringPromotion(
+t1: DataType,
+t2: DataType,
+expected: Option[DataType]): Unit = {
+  checkWidenType(TypeCoercion.findWiderTypeForTwo, t1, t2, expected)
+}
+
+def widenTestWithoutStringPromotion(
+t1: DataType,
+t2: DataType,
+expected: Option[DataType]): Unit = {
+  
checkWidenType(TypeCoercion.findWiderTypeWithoutStringPromotionForTwo, t1, t2, 
expected)
+}
+
+// Decimal
+widenTestWithStringPromotion(
+  DecimalType(2, 1), DecimalType(3, 2), Some(DecimalType(3, 2)))
+widenTestWithStringPromotion(
+  DecimalType(2, 1), DoubleType, Some(DoubleType))
+widenTestWithStringPromotion(
+  DecimalType(2, 1), IntegerType, Some(DecimalType(11, 1)))
+widenTestWithStringPromotion(
+  DoubleType, DecimalType(2, 1), Some(DoubleType))
+widenTestWithStringPromotion(
+  LongType, DecimalType(2, 1), Some(DecimalType(21, 1)))
+
+// ArrayType
+widenTestWithStringPromotion(
+  ArrayType(ShortType, containsNull = true),
+  ArrayType(DoubleType, containsNull = false),
+  Some(ArrayType(DoubleType, containsNull = true)))
+widenTestWithStringPromotion(
+  ArrayType(TimestampType, containsNull = false),
+  ArrayType(StringType, containsNull = true),
+  Some(ArrayType(StringType, containsNull = true)))
+widenTestWithStringPromotion(
+  ArrayType(ArrayType(IntegerType), containsNull = false),
+  ArrayType(ArrayType(LongType), containsNull = false),
+  Some(ArrayType(ArrayType(LongType), containsNull = false)))
+
+// Without string promotion
+widenTestWithoutStringPromotion(IntegerType, StringType, None)
--- End diff --

`LongType` test was removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16777: [SPARK-19435][SQL] Type coercion between ArrayTypes

2017-02-12 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16777
  
Thanks @cloud-fan for your detailed review. I will keep in mind those 
comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16898: [SPARK-19563][SQL] advoid unnecessary sort in Fil...

2017-02-12 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16898#discussion_r100692348
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
 ---
@@ -134,8 +142,26 @@ object FileFormatWriter extends Logging {
   // prepares the job, any exception thrown from here shouldn't cause 
abortJob() to be called.
   committer.setupJob(job)
 
+  val bucketIdExpression = bucketSpec.map { spec =>
+// Use `HashPartitioning.partitionIdExpression` as our bucket id 
expression, so that we can
+// guarantee the data distribution is same between shuffle and 
bucketed data source, which
+// enables us to only shuffle one side when join a bucketed table 
and a normal one.
+HashPartitioning(bucketColumns, 
spec.numBuckets).partitionIdExpression
+  }
+  // We should first sort by partition columns, then bucket id, and 
finally sorting columns.
+  val requiredOrdering = (partitionColumns ++ bucketIdExpression ++ 
sortColumns)
+.map(SortOrder(_, Ascending))
+  val actualOrdering = queryExecution.executedPlan.outputOrdering
+  // We can still avoid the sort if the required ordering is [partCol] 
and the actual ordering
+  // is [partCol, anotherCol].
+  val rdd = if (requiredOrdering == 
actualOrdering.take(requiredOrdering.length)) {
+queryExecution.toRdd
+  } else {
+SortExec(requiredOrdering, global = false, 
queryExecution.executedPlan).execute()
--- End diff --

Oh, I met this case before IIRC. This complains in Scala 2.10. I guess it 
should be 

```
SortExec(requiredOrdering, global = false, child = 
queryExecution.executedPlan).execute()
```

because it seems the complier gets confused the positional/named  
arguments. (this is actually invalid syntax in Python).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16898: [SPARK-19563][SQL] advoid unnecessary sort in Fil...

2017-02-12 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16898#discussion_r100692415
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
 ---
@@ -134,8 +142,26 @@ object FileFormatWriter extends Logging {
   // prepares the job, any exception thrown from here shouldn't cause 
abortJob() to be called.
   committer.setupJob(job)
 
+  val bucketIdExpression = bucketSpec.map { spec =>
+// Use `HashPartitioning.partitionIdExpression` as our bucket id 
expression, so that we can
+// guarantee the data distribution is same between shuffle and 
bucketed data source, which
+// enables us to only shuffle one side when join a bucketed table 
and a normal one.
+HashPartitioning(bucketColumns, 
spec.numBuckets).partitionIdExpression
+  }
+  // We should first sort by partition columns, then bucket id, and 
finally sorting columns.
+  val requiredOrdering = (partitionColumns ++ bucketIdExpression ++ 
sortColumns)
+.map(SortOrder(_, Ascending))
+  val actualOrdering = queryExecution.executedPlan.outputOrdering
+  // We can still avoid the sort if the required ordering is [partCol] 
and the actual ordering
+  // is [partCol, anotherCol].
+  val rdd = if (requiredOrdering == 
actualOrdering.take(requiredOrdering.length)) {
+queryExecution.toRdd
+  } else {
+SortExec(requiredOrdering, global = false, 
queryExecution.executedPlan).execute()
--- End diff --

Yea, it seems it complains.

```
[error] 
.../spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala:160:
 not enough arguments for method apply: (sortOrder: 
Seq[org.apache.spark.sql.catalyst.expressions.SortOrder], global: Boolean, 
child: org.apache.spark.sql.execution.SparkPlan, testSpillFrequency: 
Int)org.apache.spark.sql.execution.SortExec in object SortExec.
[error] Unspecified value parameter child.
[error] SortExec(requiredOrdering, global = false, 
queryExecution.executedPlan).execute()
[error]
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16777: [SPARK-19435][SQL] Type coercion between ArrayTyp...

2017-02-12 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16777#discussion_r100716252
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -116,48 +114,66 @@ object TypeCoercion {
* i.e. the main difference with [[findTightestCommonType]] is that here 
we allow some
* loss of precision when widening decimal and double, and promotion to 
string.
*/
-  private def findWiderTypeForTwo(t1: DataType, t2: DataType): 
Option[DataType] = (t1, t2) match {
-case (t1: DecimalType, t2: DecimalType) =>
-  Some(DecimalPrecision.widerDecimalType(t1, t2))
-case (t: IntegralType, d: DecimalType) =>
-  Some(DecimalPrecision.widerDecimalType(DecimalType.forType(t), d))
-case (d: DecimalType, t: IntegralType) =>
-  Some(DecimalPrecision.widerDecimalType(DecimalType.forType(t), d))
-case (_: FractionalType, _: DecimalType) | (_: DecimalType, _: 
FractionalType) =>
-  Some(DoubleType)
-case _ =>
-  findTightestCommonTypeToString(t1, t2)
+  def findWiderTypeForTwo(t1: DataType, t2: DataType): Option[DataType] = {
+findTightestCommonType(t1, t2)
--- End diff --

Yes, it is true that the type dispatch order was changed but 
`findTightestCommonType` does not take care of `DecimalType` therefore the 
results would be the same.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16777: [SPARK-19435][SQL] Type coercion between ArrayTypes

2017-02-12 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16777
  
Do you mean two PRs for cleaning up the logics here and the support of 
array type coercion?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16777: [SPARK-19435][SQL] Type coercion between ArrayTyp...

2017-02-12 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16777#discussion_r100716751
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -116,48 +114,66 @@ object TypeCoercion {
* i.e. the main difference with [[findTightestCommonType]] is that here 
we allow some
* loss of precision when widening decimal and double, and promotion to 
string.
*/
-  private def findWiderTypeForTwo(t1: DataType, t2: DataType): 
Option[DataType] = (t1, t2) match {
-case (t1: DecimalType, t2: DecimalType) =>
-  Some(DecimalPrecision.widerDecimalType(t1, t2))
-case (t: IntegralType, d: DecimalType) =>
-  Some(DecimalPrecision.widerDecimalType(DecimalType.forType(t), d))
-case (d: DecimalType, t: IntegralType) =>
-  Some(DecimalPrecision.widerDecimalType(DecimalType.forType(t), d))
-case (_: FractionalType, _: DecimalType) | (_: DecimalType, _: 
FractionalType) =>
-  Some(DoubleType)
-case _ =>
-  findTightestCommonTypeToString(t1, t2)
+  def findWiderTypeForTwo(t1: DataType, t2: DataType): Option[DataType] = {
+findTightestCommonType(t1, t2)
--- End diff --

@cloud-fan refactored this logic recently and I believe he didn't missed 
this part.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16777: [SPARK-19435][SQL] Type coercion between ArrayTyp...

2017-02-12 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16777#discussion_r100723132
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -116,48 +114,66 @@ object TypeCoercion {
* i.e. the main difference with [[findTightestCommonType]] is that here 
we allow some
* loss of precision when widening decimal and double, and promotion to 
string.
*/
-  private def findWiderTypeForTwo(t1: DataType, t2: DataType): 
Option[DataType] = (t1, t2) match {
-case (t1: DecimalType, t2: DecimalType) =>
-  Some(DecimalPrecision.widerDecimalType(t1, t2))
-case (t: IntegralType, d: DecimalType) =>
-  Some(DecimalPrecision.widerDecimalType(DecimalType.forType(t), d))
-case (d: DecimalType, t: IntegralType) =>
-  Some(DecimalPrecision.widerDecimalType(DecimalType.forType(t), d))
-case (_: FractionalType, _: DecimalType) | (_: DecimalType, _: 
FractionalType) =>
-  Some(DoubleType)
-case _ =>
-  findTightestCommonTypeToString(t1, t2)
+  def findWiderTypeForTwo(t1: DataType, t2: DataType): Option[DataType] = {
+findTightestCommonType(t1, t2)
--- End diff --

Aha, thank you for correcting me. I overlooked but the result should be 
still the same, shouldn't it?

- `DecimalType.isWiderThan`

```
(p1 - s1) >= (p2 - s2) && s1 >= s2
```

- DecimalPrecision.widerDecimalType

```
 max(s1, s2) + max(p1-s1, p2-s2), max(s1, s2)
```

If both are different, we were already applying different type coercion 
rules between `findWiderTypeWithoutStringPromotion` and `findWiderTypeForTwo`, 
I guess we should match them with the same given 
https://github.com/apache/spark/pull/14439 ?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16777: [SPARK-19435][SQL] Type coercion between ArrayTypes

2017-02-12 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16777
  
I see what you mean. The code paths are now different. Let me try to 
investigate it and split them. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16777: [SPARK-19435][SQL] Type coercion between ArrayTypes

2017-02-12 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16777
  
@gatorsmile, Can we make this merged and then add test cases for them 
separately? It seems the results are the same. I ran two tests as below:

```scala
val integralTypes =
  IndexedSeq(
ByteType,
ShortType,
IntegerType,
LongType)

val decimals = (-38 to 38).flatMap { p =>
  (-38 to 38).flatMap(s => allCatch opt DecimalType(p, s))
}

assert(decimals.nonEmpty)
integralTypes.foreach { it =>
  test(s"$it test") {
decimals.foreach { d =>

  // From TypeCoercion.findWiderTypeForTwo
  val maybeType1 = (d, it) match {
case (d: DecimalType, t: IntegralType) =>
  Some(DecimalPrecision.widerDecimalType(DecimalType.forType(t), d))
case _ => None
  }

  // From TypeCoercion.findTightestCommonType
  val maybeType2 = (d, it) match {
case (t1: DecimalType, t2: IntegralType) if t1.isWiderThan(t2) =>
  Some(t1)
case _ => None
  }

  if (maybeType2.isDefined) {
val t1 = maybeType1.get
val t2 = maybeType2.get
assert(t1 == t2)
  }
}
  }
}
```

```scala
val integralTypes =
  IndexedSeq(
ByteType,
ShortType,
IntegerType,
LongType)

val decimals = (-38 to 38).flatMap { p =>
  (-38 to 38).flatMap(s => allCatch opt DecimalType(p, s))
}
  
assert(decimals.nonEmpty)
  
integralTypes.foreach { it =>
  test(s"$it test") {
val widenDecimals = decimals.flatMap { d =>
  // From TypeCoercion.findWiderTypeForTwo
  (d, it) match {
case (d: DecimalType, t: IntegralType) =>
  Some(DecimalPrecision.widerDecimalType(DecimalType.forType(t), d))
case _ => None
  }
}.toSet

val tightDecimals = decimals.flatMap { d =>
  // From TypeCoercion.findTightestCommonType
  (d, it) match {
case (t1: DecimalType, t2: IntegralType) if t1.isWiderThan(t2) =>
  Some(t1)
case _ => None
  }
}.toSet

assert(widenDecimals.nonEmpty)
assert(tightDecimals.nonEmpty)
assert(tightDecimals.subsetOf(widenDecimals))
  }
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16777: [SPARK-19435][SQL] Type coercion between ArrayTyp...

2017-02-12 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16777#discussion_r100730262
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -116,48 +114,66 @@ object TypeCoercion {
* i.e. the main difference with [[findTightestCommonType]] is that here 
we allow some
* loss of precision when widening decimal and double, and promotion to 
string.
*/
-  private def findWiderTypeForTwo(t1: DataType, t2: DataType): 
Option[DataType] = (t1, t2) match {
-case (t1: DecimalType, t2: DecimalType) =>
-  Some(DecimalPrecision.widerDecimalType(t1, t2))
-case (t: IntegralType, d: DecimalType) =>
-  Some(DecimalPrecision.widerDecimalType(DecimalType.forType(t), d))
-case (d: DecimalType, t: IntegralType) =>
-  Some(DecimalPrecision.widerDecimalType(DecimalType.forType(t), d))
-case (_: FractionalType, _: DecimalType) | (_: DecimalType, _: 
FractionalType) =>
-  Some(DoubleType)
-case _ =>
-  findTightestCommonTypeToString(t1, t2)
+  def findWiderTypeForTwo(t1: DataType, t2: DataType): Option[DataType] = {
+findTightestCommonType(t1, t2)
--- End diff --

I see. Thank you for catching it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16882: [SPARK-19544][SQL] Improve error message when som...

2017-02-13 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16882#discussion_r100779764
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -321,12 +321,12 @@ trait CheckAnalysis extends PredicateHelper {
   // Check if the data types match.
   dataTypes(child).zip(ref).zipWithIndex.foreach { case ((dt1, 
dt2), ci) =>
 // SPARK-18058: we shall not care about the nullability of 
columns
-if (!dt1.sameType(dt2)) {
+if (TypeCoercion.findWiderTypeForTwo(dt1.asNullable, 
dt2.asNullable).isEmpty) {
   failAnalysis(
 s"""
   |${operator.nodeName} can only be performed on 
tables with the compatible
-  |column types. $dt1 <> $dt2 at the 
${ordinalNumber(ci)} column of
-  |the ${ordinalNumber(ti + 1)} table
+  |column types. ${dt1.simpleString} <> 
${dt2.simpleString} at the
--- End diff --

Sure, let me change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16882: [SPARK-19544][SQL] Improve error message when some colum...

2017-02-13 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16882
  
Thank you @hvanhovell


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16777: [SPARK-19435][SQL] Type coercion between ArrayTypes

2017-02-13 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16777
  
(I just rebased and added `private[analysis]` for consistency)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16741: [SPARK-19402][DOCS] Support LaTex inline formula ...

2017-02-13 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16741#discussion_r100924150
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -135,13 +135,13 @@ abstract class MLWriter extends BaseReadWrite with 
Logging {
 }
 
 /**
- * Trait for classes that provide [[MLWriter]].
+ * Trait for classes that provide `MLWriter`.
--- End diff --

Hmm. That's weird. At least, it should warn because I copied and pasted 
each class names from the messages and went to the line number. Let me double 
check and be back. Maybe, I made mistakes for few.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16741: [SPARK-19402][DOCS] Support LaTex inline formula ...

2017-02-13 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16741#discussion_r100958021
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -135,13 +135,13 @@ abstract class MLWriter extends BaseReadWrite with 
Logging {
 }
 
 /**
- * Trait for classes that provide [[MLWriter]].
+ * Trait for classes that provide `MLWriter`.
--- End diff --

@jkbradley, Yes, it seems fine for both.

I think I changed them given `sealed trait BaseReadWrite`...

```
[warn] 
.../spark/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala:143: 
Could not find any member to link for "MLWriter".
[warn]   /**
[warn]   ^
```

and I think I swept them here as they looked identical. I am okay to revive 
identified ones back.

My (maybe nitpicking) concern is, we should be able to identify the cases 
explicitly which I guess I and many guys failed and also what to do in each 
case. Otherwise, I think we should prefer backquotes because even if reviving 
links might work for now for some cases, we could easily introduce other breaks 
again when we change the codes.





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16533: [SPARK-19160][PYTHON][SQL] Add udf decorator

2017-02-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16533
  
Thanks for cc'ing me. I like this pythonic way. +1 and looks okay to me too 
despite of the trick in argument checking.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16777: [SPARK-19435][SQL] Type coercion between ArrayTypes

2017-02-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16777
  
Thank you @gatorsmile 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16776: [SPARK-19436][SQL] Add missing tests for approxQu...

2017-02-14 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16776#discussion_r100986732
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -58,49 +58,54 @@ final class DataFrameStatFunctions private[sql](df: 
DataFrame) {
* @param probabilities a list of quantile probabilities
*   Each number must belong to [0, 1].
*   For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
-   * @param relativeError The relative target precision to achieve 
(greater or equal to 0).
+   * @param relativeError The relative target precision to achieve 
(greater than or equal to 0).
*   If set to zero, the exact quantiles are computed, which could be 
very expensive.
*   Note that values greater than 1 are accepted but give the same 
result as 1.
* @return the approximate quantiles at the given probabilities
*
-   * @note NaN values will be removed from the numerical column before 
calculation
+   * @note null and NaN values will be removed from the numerical column 
before calculation. If
+   *   the dataframe is empty or all rows contain null or NaN, null is 
returned.
*
* @since 2.0.0
*/
   def approxQuantile(
   col: String,
   probabilities: Array[Double],
   relativeError: Double): Array[Double] = {
-StatFunctions.multipleApproxQuantiles(df.select(col).na.drop(),
-  Seq(col), probabilities, relativeError).head.toArray
+val res = approxQuantile(Array(col), probabilities, relativeError)
+Option(res).map(_.head).orNull
   }
 
   /**
* Calculates the approximate quantiles of numerical columns of a 
DataFrame.
-   * @see [[DataFrameStatsFunctions.approxQuantile(col:Str* 
approxQuantile]] for
-   * detailed description.
+   * @see `[[DataFrameStatsFunctions.approxQuantile(col:Str* 
approxQuantile]]` for detailed
--- End diff --

nit: `DataFrameStatsFunctions` -> `DataFrameStatFunctions` or remove it.

For example, just

```
`approxQuantile(String, Array[Double], Double)`
```

We could just wrap them by backticks without `[[ ... ]]` in general. It 
seems Scaladoc specific annotation also does not work to disambiguate the 
argument types.

```
[error] 
.../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:43:
 error: unexpected content
[error]* @see {@link DataFrameStatFunctions.approxQuantile(col:Str* 
approxQuantile)} for
[error]  ^
[error] 
.../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:45:
 error: unexpected text
[error]* @see #approxQuantile(String, Array[Double], Double) for 
detailed description.
[error]  ^
```

I guess It does not necessarily make a link if it breaks.





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16776: [SPARK-19436][SQL] Add missing tests for approxQu...

2017-02-14 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16776#discussion_r100987139
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -58,49 +58,54 @@ final class DataFrameStatFunctions private[sql](df: 
DataFrame) {
* @param probabilities a list of quantile probabilities
*   Each number must belong to [0, 1].
*   For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
-   * @param relativeError The relative target precision to achieve 
(greater or equal to 0).
+   * @param relativeError The relative target precision to achieve 
(greater than or equal to 0).
*   If set to zero, the exact quantiles are computed, which could be 
very expensive.
*   Note that values greater than 1 are accepted but give the same 
result as 1.
* @return the approximate quantiles at the given probabilities
*
-   * @note NaN values will be removed from the numerical column before 
calculation
+   * @note null and NaN values will be removed from the numerical column 
before calculation. If
+   *   the dataframe is empty or all rows contain null or NaN, null is 
returned.
*
* @since 2.0.0
*/
   def approxQuantile(
   col: String,
   probabilities: Array[Double],
   relativeError: Double): Array[Double] = {
-StatFunctions.multipleApproxQuantiles(df.select(col).na.drop(),
-  Seq(col), probabilities, relativeError).head.toArray
+val res = approxQuantile(Array(col), probabilities, relativeError)
+Option(res).map(_.head).orNull
   }
 
   /**
* Calculates the approximate quantiles of numerical columns of a 
DataFrame.
-   * @see [[DataFrameStatsFunctions.approxQuantile(col:Str* 
approxQuantile]] for
-   * detailed description.
+   * @see `[[DataFrameStatsFunctions.approxQuantile(col:Str* 
approxQuantile]]` for detailed
--- End diff --

It seems the breaks are queued up a bit. Let me sweep it soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16776: [SPARK-19436][SQL] Add missing tests for approxQu...

2017-02-14 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16776#discussion_r100992640
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -58,49 +58,52 @@ final class DataFrameStatFunctions private[sql](df: 
DataFrame) {
* @param probabilities a list of quantile probabilities
*   Each number must belong to [0, 1].
*   For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
-   * @param relativeError The relative target precision to achieve 
(greater or equal to 0).
+   * @param relativeError The relative target precision to achieve 
(greater than or equal to 0).
*   If set to zero, the exact quantiles are computed, which could be 
very expensive.
*   Note that values greater than 1 are accepted but give the same 
result as 1.
* @return the approximate quantiles at the given probabilities
*
-   * @note NaN values will be removed from the numerical column before 
calculation
+   * @note null and NaN values will be removed from the numerical column 
before calculation. If
+   *   the dataframe is empty or all rows contain null or NaN, null is 
returned.
*
* @since 2.0.0
*/
   def approxQuantile(
   col: String,
   probabilities: Array[Double],
   relativeError: Double): Array[Double] = {
-StatFunctions.multipleApproxQuantiles(df.select(col).na.drop(),
-  Seq(col), probabilities, relativeError).head.toArray
+val res = approxQuantile(Array(col), probabilities, relativeError)
+Option(res).map(_.head).orNull
   }
 
   /**
* Calculates the approximate quantiles of numerical columns of a 
DataFrame.
-   * @see [[DataFrameStatsFunctions.approxQuantile(col:Str* 
approxQuantile]] for
-   * detailed description.
*
-   * Note that rows containing any null or NaN values values will be 
removed before
-   * calculation.
* @param cols the names of the numerical columns
* @param probabilities a list of quantile probabilities
*   Each number must belong to [0, 1].
*   For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
-   * @param relativeError The relative target precision to achieve (>= 0).
+   * @param relativeError The relative target precision to achieve 
(greater than or equal to 0).
*   If set to zero, the exact quantiles are computed, which could be 
very expensive.
*   Note that values greater than 1 are accepted but give the same 
result as 1.
* @return the approximate quantiles at the given probabilities of each 
column
*
-   * @note Rows containing any NaN values will be removed before 
calculation
+   * @note Rows containing any null or NaN values will be removed before 
calculation. If
+   *   the dataframe is empty or all rows contain null or NaN, null is 
returned.
*
* @since 2.2.0
*/
   def approxQuantile(
   cols: Array[String],
   probabilities: Array[Double],
   relativeError: Double): Array[Array[Double]] = {
-StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): 
_*).na.drop(), cols,
-  probabilities, relativeError).map(_.toArray).toArray
+// TODO: Update NaN/null handling to keep consistent with the 
single-column version
--- End diff --

Should we make this todo to JIRA? I guess It is generally not good to leave 
todos but file them in JIRAs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16776: [SPARK-19436][SQL] Add missing tests for approxQu...

2017-02-14 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16776#discussion_r100996427
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -58,49 +58,52 @@ final class DataFrameStatFunctions private[sql](df: 
DataFrame) {
* @param probabilities a list of quantile probabilities
*   Each number must belong to [0, 1].
*   For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
-   * @param relativeError The relative target precision to achieve 
(greater or equal to 0).
+   * @param relativeError The relative target precision to achieve 
(greater than or equal to 0).
*   If set to zero, the exact quantiles are computed, which could be 
very expensive.
*   Note that values greater than 1 are accepted but give the same 
result as 1.
* @return the approximate quantiles at the given probabilities
*
-   * @note NaN values will be removed from the numerical column before 
calculation
+   * @note null and NaN values will be removed from the numerical column 
before calculation. If
+   *   the dataframe is empty or all rows contain null or NaN, null is 
returned.
*
* @since 2.0.0
*/
   def approxQuantile(
   col: String,
   probabilities: Array[Double],
   relativeError: Double): Array[Double] = {
-StatFunctions.multipleApproxQuantiles(df.select(col).na.drop(),
-  Seq(col), probabilities, relativeError).head.toArray
+val res = approxQuantile(Array(col), probabilities, relativeError)
+Option(res).map(_.head).orNull
   }
 
   /**
* Calculates the approximate quantiles of numerical columns of a 
DataFrame.
-   * @see [[DataFrameStatsFunctions.approxQuantile(col:Str* 
approxQuantile]] for
--- End diff --

I am sorry. Actually, I meant remove `DataFrameStatFunctions` leaving the 
method as it is in the same class. Nevertheless, FWIW, I am fine with removing 
it as is given other functions here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16776: [SPARK-19436][SQL] Add missing tests for approxQu...

2017-02-14 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16776#discussion_r100997576
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -58,49 +58,52 @@ final class DataFrameStatFunctions private[sql](df: 
DataFrame) {
* @param probabilities a list of quantile probabilities
*   Each number must belong to [0, 1].
*   For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
-   * @param relativeError The relative target precision to achieve 
(greater or equal to 0).
+   * @param relativeError The relative target precision to achieve 
(greater than or equal to 0).
*   If set to zero, the exact quantiles are computed, which could be 
very expensive.
*   Note that values greater than 1 are accepted but give the same 
result as 1.
* @return the approximate quantiles at the given probabilities
*
-   * @note NaN values will be removed from the numerical column before 
calculation
+   * @note null and NaN values will be removed from the numerical column 
before calculation. If
+   *   the dataframe is empty or all rows contain null or NaN, null is 
returned.
*
* @since 2.0.0
*/
   def approxQuantile(
   col: String,
   probabilities: Array[Double],
   relativeError: Double): Array[Double] = {
-StatFunctions.multipleApproxQuantiles(df.select(col).na.drop(),
-  Seq(col), probabilities, relativeError).head.toArray
+val res = approxQuantile(Array(col), probabilities, relativeError)
+Option(res).map(_.head).orNull
   }
 
   /**
* Calculates the approximate quantiles of numerical columns of a 
DataFrame.
-   * @see [[DataFrameStatsFunctions.approxQuantile(col:Str* 
approxQuantile]] for
-   * detailed description.
*
-   * Note that rows containing any null or NaN values values will be 
removed before
-   * calculation.
* @param cols the names of the numerical columns
* @param probabilities a list of quantile probabilities
*   Each number must belong to [0, 1].
*   For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
-   * @param relativeError The relative target precision to achieve (>= 0).
+   * @param relativeError The relative target precision to achieve 
(greater than or equal to 0).
*   If set to zero, the exact quantiles are computed, which could be 
very expensive.
*   Note that values greater than 1 are accepted but give the same 
result as 1.
* @return the approximate quantiles at the given probabilities of each 
column
*
-   * @note Rows containing any NaN values will be removed before 
calculation
+   * @note Rows containing any null or NaN values will be removed before 
calculation. If
+   *   the dataframe is empty or all rows contain null or NaN, null is 
returned.
*
* @since 2.2.0
*/
   def approxQuantile(
   cols: Array[String],
   probabilities: Array[Double],
   relativeError: Double): Array[Array[Double]] = {
-StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): 
_*).na.drop(), cols,
-  probabilities, relativeError).map(_.toArray).toArray
+// TODO: Update NaN/null handling to keep consistent with the 
single-column version
--- End diff --

I just saw your comment above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16926: [MINOR] Fix javadoc8 break

2017-02-14 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/16926

[MINOR] Fix javadoc8 break

## What changes were proposed in this pull request?

These error below seems caused by unidoc that does not understand double 
commented block.

```
[error] 
.../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:69: error: 
class, interface, or enum expected
[error]  * MapGroupsWithStateFunction<String, Integer, Integer, 
String> mappingFunction =
[error]  ^
[error] 
.../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:69: error: 
class, interface, or enum expected
[error]  * MapGroupsWithStateFunction<String, Integer, Integer, 
String> mappingFunction =
[error] 
  ^
[error] 
.../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:70: error: 
class, interface, or enum expected
[error]  *new MapGroupsWithStateFunction<String, Integer, Integer, 
String>() {
[error] ^
[error] 
.../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:70: error: 
class, interface, or enum expected
[error]  *new MapGroupsWithStateFunction<String, Integer, Integer, 
String>() {
[error] 
^
[error] 
.../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:72: error: 
illegal character: '#'
[error]  *  @Override
[error]  ^
[error] 
.../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:72: error: 
class, interface, or enum expected
[error]  *  @Override
[error]  ^
[error] 
.../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: 
class, interface, or enum expected
[error]  *  public String call(String key, Iterator<Integer> 
value, KeyedState<Integer> state) {
[error]^
[error] 
.../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: 
class, interface, or enum expected
[error]  *  public String call(String key, Iterator<Integer> 
value, KeyedState<Integer> state) {
[error]^
[error] 
.../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: 
class, interface, or enum expected
[error]  *  public String call(String key, Iterator<Integer> 
value, KeyedState<Integer> state) {
[error]^
[error] 
.../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: 
class, interface, or enum expected
[error]  *  public String call(String key, Iterator<Integer> 
value, KeyedState<Integer> state) {
[error] 
^
[error] 
.../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: 
class, interface, or enum expected
[error]  *  public String call(String key, Iterator<Integer> 
value, KeyedState<Integer> state) {
[error] 
^
[error] 
.../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:76: error: 
class, interface, or enum expected
[error]  *  boolean shouldRemove = ...; // Decide whether to remove 
the state
[error]  ^
[error] 
.../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:77: error: 
class, interface, or enum expected
[error]  *  if (shouldRemove) {
[error]  ^
[error] 
.../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:79: error: 
class, interface, or enum expected
[error]  *  } else {
[error]  ^
[error] 
.../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:81: error: 
class, interface, or enum expected
[error]  *state.update(newState); // Set the new state
[error]  ^
[error] 
.../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:82: error: 
class, interface, or enum expected
[error]  *  }
[error]  ^
[error] 
.../forked/spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:85: 
error: class, interface, or enum expected
[error]  *  state.update(initialState);
[error]  ^
[error] 
.../forked/spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:86: 
error: class, interface, or enum expected
[error]  *}
[error]  ^
[error] 
.../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:90: error: 
class, interface, or enum expected
[error]  * 
[error]  ^
[error] 
.../spark/s

[GitHub] spark issue #16926: [MINOR] Fix javadoc8 break

2017-02-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16926
  
Note that such many errors seem hiding more errors in the error messages. 
This starts to hide the more errors so I proposed this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16926: [MINOR] Fix javadoc8 break

2017-02-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16926
  
cc @srowen.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16927: [WIP][SPARK-19571][R] Fix SparkR test break on Wi...

2017-02-14 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/16927

[WIP][SPARK-19571][R] Fix SparkR test break on Windows via AppVeyor

## What changes were proposed in this pull request?

It seems wintuils for Hadoop 2.6.5 not exiting for now in 
https://github.com/steveloughran/winutils

This breaks the tests in SparkR on Windows so this PR proposes to use 
winutils built by Hadoop 2.6.4 for now.

## How was this patch tested?

Manually via AppVeyor

**Before**

https://ci.appveyor.com/project/spark-test/spark/build/627-r-test-break

**After**

https://ci.appveyor.com/project/spark-test/spark/build/629-r-test-break

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark spark-r-windows-break

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16927.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16927


commit 58af8fc830bb4c51bd80bc46da669125ec6339f6
Author: hyukjinkwon 
Date:   2017-02-14T14:48:37Z

Fix SparkR test break on Windows via AppVeyor




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16927: [SPARK-19571][R] Fix SparkR test break on Windows via Ap...

2017-02-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16927
  
cc @felixcheung and @shivaram


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16927: [SPARK-19571][R] Fix SparkR test break on Windows via Ap...

2017-02-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16927
  
(I think I should cc @steveloughran too just FYI)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16929: [SPARK-19595][SQL] Do not allow json array in fro...

2017-02-14 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/16929

[SPARK-19595][SQL] Do not allow json array in from_json

## What changes were proposed in this pull request?

Currently, it only reads the single row when the input is a json array. So, 
the codes below:

```scala
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val schema = StructType(StructField("a", IntegerType) :: Nil)
Seq(("""[{"a": 1}, {"a": 
2}]""")).toDF("struct").select(from_json(col("struct"), schema)).show()
```
prints 

```
++
|jsontostruct(struct)|
++
| [1]|
++
```
We may consider supporting this as a generator expression but I guess it'd 
be arguable. So, this PR simply suggests to disallow json array in `from_json` 
for now.

**After**

```
++
|jsontostruct(struct)|
++
|null|
++
``` 

## How was this patch tested?

Unit test in `JsonExpressionsSuite` and manual test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark disallow-array

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16929.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16929


commit acbce26cd983c4e3510a8db707196e3cd848aba2
Author: hyukjinkwon 
Date:   2017-02-14T15:37:00Z

Do not allow json array in from_json




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16927: [SPARK-19571][R] Fix SparkR test break on Windows via Ap...

2017-02-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16927
  
Yea, I agree that the error message was hard to read. Maybe let me try to 
raise this issue in a JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16927: [SPARK-19571][R] Fix SparkR test break on Windows via Ap...

2017-02-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16927
  
cc @srowen too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >