from:"HyukjinKwon"

[GitHub] spark issue #17458: [SPARK-20127][CORE] few warning have been fixed which In...

2017-03-28 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17458
  
Probably, I guess this should be fine. Just in my experience, IntelliJ's 
inspection was quite okay except the case of breaking Scala 2.10. It might be 
better if they can be manually tested via UI though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17435: [SPARK-20098][PYSPARK] dataType's typeName fix

2017-03-28 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17435#discussion_r108437512
  
--- Diff: python/pyspark/sql/types.py ---
@@ -57,7 +57,25 @@ def __ne__(self, other):
 
 @classmethod
 def typeName(cls):
-return cls.__name__[:-4].lower()
+typeTypeNameMap = {"DataType": "data",
+   "NullType": "null",
+   "StringType": "string",
+   "BinaryType": "binary",
+   "BooleanType": "boolean",
+   "DateType": "date",
+   "TimestampType": "timestamp",
+   "DecimalType": "decimal",
+   "DoubleType": "double",
+   "FloatType": "float",
+   "ByteType": "byte",
+   "IntegerType": "integer",
+   "LongType": "long",
+   "ShortType": "short",
+   "ArrayType": "array",
+   "MapType": "map",
+   "StructField": "struct",
--- End diff --

I guess it would not produce correct fields if `struct` is printed from 
`StructField`.

```python
>>> from pyspark.sql import Row
>>>
>>> df = spark.createDataFrame([[Row(a=1), 2]])
>>> cols = []
>>> for i in df.schema:
... cols.append("`" + i.name + "`" + "\t" +  i.typeName())
...
>>> print ",\n".join(cols)
`_1`struct,
`_2`struct
```

because actual types are as below:

```python
>>> df.schema.simpleString()
'struct<_1:struct,_2:bigint>'
```

How about the one as below?

```python
from pyspark.sql import Row

df = spark.createDataFrame([[Row(a=1), 2]])
cols = []
for i in df.schema:
cols.append("`" + i.name + "`" + "\t" +  i.dataType.simpleString())

print ",\n".join(cols)
```

prints

```
`_1`struct,
`_2`bigint
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17458: [SPARK-20127][CORE] few warning have been fixed which In...

2017-03-28 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17458
  
They are suggestions in my point of view. it doesn't necessarily mean you 
should follow if there are some reasons.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17458: [SPARK-20127][CORE] few warning have been fixed w...

2017-03-28 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17458#discussion_r108560956
  
--- Diff: core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala ---
@@ -103,7 +103,7 @@ private[ui] class StagePage(parent: StagesTab) extends 
WebUIPage("stage") {
   val taskSortColumn = Option(parameterTaskSortColumn).map { 
sortColumn =>
 UIUtils.decodeURLParameter(sortColumn)
   }.getOrElse("Index")
-  val taskSortDesc = 
Option(parameterTaskSortDesc).map(_.toBoolean).getOrElse(false)
+  val taskSortDesc = Option(parameterTaskSortDesc).exists(_.toBoolean)
--- End diff --

If my opinion can be helpful, personally I prefer the previous one to 
explicitly show the default. Actually, I intendedly use this pattern in several 
places.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17466: Added getter for impurityStats

2017-03-29 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17466
  
Hi @shaynativ, this PR looks a non-trivial change that needs a JIRA. Please 
refer http://spark.apache.org/contributing.html.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17467: Ysharma/spark kinesis retries

2017-03-29 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17467
  
Hi @shaynativ, this PR looks a non-trivial change that needs a JIRA. Please 
refer http://spark.apache.org/contributing.html.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17467: Ysharma/spark kinesis retries

2017-03-29 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17467
  
Hi @yssharma, this PR looks a non-trivial change that needs a JIRA. Please 
refer http://spark.apache.org/contributing.html.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17467: Ysharma/spark kinesis retries

2017-03-29 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17467
  
You could just edit the title. I think closing this and opening new one is 
also fine and an option.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17435: [SPARK-20098][PYSPARK] dataType's typeName fix

2017-03-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17435#discussion_r108685381
  
--- Diff: python/pyspark/sql/types.py ---
@@ -57,7 +57,25 @@ def __ne__(self, other):
 
 @classmethod
 def typeName(cls):
-return cls.__name__[:-4].lower()
+typeTypeNameMap = {"DataType": "data",
+   "NullType": "null",
+   "StringType": "string",
+   "BinaryType": "binary",
+   "BooleanType": "boolean",
+   "DateType": "date",
+   "TimestampType": "timestamp",
+   "DecimalType": "decimal",
+   "DoubleType": "double",
+   "FloatType": "float",
+   "ByteType": "byte",
+   "IntegerType": "integer",
+   "LongType": "long",
+   "ShortType": "short",
+   "ArrayType": "array",
+   "MapType": "map",
+   "StructField": "struct",
--- End diff --

Yup, I think we should still fix and overriding it is good enough.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17468: [SPARK-20143][SQL] DataType.fromJson should throw...

2017-03-29 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/17468

[SPARK-20143][SQL] DataType.fromJson should throw an exception with better 
message

## What changes were proposed in this pull request?

Currently, `DataType.fromJson` throws `scala.MatchError` or 
`java.util.NoSuchElementException` in some cases when the JSON input is invalid 
as below:

```scala
DataType.fromJson(""""abcd"""")
```

```
java.util.NoSuchElementException: key not found: abcd
  at ...
```

```scala
DataType.fromJson("""{"abcd":"a"}""")
```

```
scala.MatchError: JObject(List((abcd,JString(a (of class 
org.json4s.JsonAST$JObject)
  at ...
```

```scala
DataType.fromJson("""{"fields": [{"a":123}], "type": "struct"}""")
```

```
scala.MatchError: JObject(List((a,JInt(123 (of class 
org.json4s.JsonAST$JObject)
  at ...
```

After this PR, 

```scala
DataType.fromJson(""""abcd"""")
```

```
java.lang.IllegalArgumentException: Failed to convert the JSON string 
'abcd' to a data type.
  at ...
```

```scala
DataType.fromJson("""{"abcd":"a"}""")
```

```
java.lang.IllegalArgumentException: Failed to convert the JSON string 
'{"abcd":"a"}' to a data type.
  at ...
```


```scala
DataType.fromJson("""{"fields": [{"a":123}], "type": "struct"}""")
  at ...
```

```
java.lang.IllegalArgumentException: Failed to convert the JSON string 
'{"a":123}' to a field.
```

## How was this patch tested?

Unit test added in `DataTypeSuite`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark fromjson_exception

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17468.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17468


commit 75ebd181950a556b264267a73a452f9e75601439
Author: hyukjinkwon 
Date:   2017-03-29T14:17:35Z

DataType.fromJson should throw an exception with better message




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17468: [SPARK-20143][SQL] DataType.fromJson should throw an exc...

2017-03-29 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17468
  
(BTW, I believe this does not make a conflict with PR 17406)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17375: [SPARK-19019][PYTHON][BRANCH-1.6] Fix hijacked `collecti...

2017-03-29 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17375
  
Yea, it might be less important but I guess still it is a valid backport.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17469: [SPARK-20132][Docs] Add documentation for column ...

2017-03-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17469#discussion_r108835449
  
--- Diff: python/pyspark/sql/column.py ---
@@ -124,6 +124,35 @@ def _(self, other):
 return _
 
 
+like_doc = """ Return a Boolean :class:`Column` based on a SQL LIKE 
match.\n
+   :param other: a SQL LIKE pattern\n
+   See :func:`pyspark.sql.Column.rlike` for a regex version\n
+
+   >>> df.filter( df.name.like('Al%') ).collect()
+   [Row(name=u'Alice', age=1)]
+"""
+rlike_doc = """ Return a Boolean :class:`Column` based on a regex match.\n
+:param other: an extended regex expression\n
+
+>>> df.filter( df.name.rlike('ice$') ).collect()
+[Row(name=u'Alice', age=1)]
+"""
+endswith_doc = ''' Return a Boolean :class:`Column` based on matching end 
of string.\n
+   :param other: string at end of line (do not use a regex 
`$`)\n
+   >>> df.filter(df.name.endswith('ice')).collect()
+   [Row(name=u'Alice', age=1)]
+   >>> df.filter(df.name.endswith('ice$')).collect()
+   []
+   '''
+startswith_doc = ''' Return a Boolean :class:`Column` based on a string 
match.\n
--- End diff --

Mind adding `_` as a prefix in this variable to indicate this is a private 
one?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17467: [SPARK-20140][DStream] Remove hardcoded kinesis retry wa...

2017-03-29 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17467
  
I am not used to Kinesis. I usually click blame button and check both the 
recent code modifier and committer, e.g., 
https://github.com/yssharma/spark/blame/4b589adeaef540f6227266ecc628ad41ef0733c3/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala

I think I assume @tdas and @brkyvz are experts in this area.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17477: [SPARK-18692][BUILD][DOCS] Test Java 8 unidoc bui...

2017-03-29 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/17477

[SPARK-18692][BUILD][DOCS] Test Java 8 unidoc build on Jenkins

## What changes were proposed in this pull request?

This PR proposes to run Spark unidoc to test Javadoc 8 build as Javadoc 8 
is easily re-breakable.

There are several problems with it:

- It introduces little extra bit of time to run the tests. In my case, it 
took 1.5 mins more (`Elapsed :[94.8746569157]`). How it was tested is described 
in "How was this patch tested?".

- > One problem that I noticed was that Unidoc appeared to be processing 
test sources: if we can find a way to exclude those from being processed in the 
first place then that might significantly speed things up.

  (see  @joshrosen's 
[comment](https://issues.apache.org/jira/browse/SPARK-18692?focusedCommentId=15947627&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15947627))

To complete this automated build, It also suggests to fix existing Javadoc 
breaks / ones introduced by test codes as described above.

There fixes were similar instances that previously fixed. Please refer 
https://github.com/apache/spark/pull/15999 and 
https://github.com/apache/spark/pull/16013

Note that this only fixes **errors** not **warnings**. Please see my 
observation https://github.com/apache/spark/pull/17389#issuecomment-288438704 
for spurious errors for warnings.

## How was this patch tested?

Manually via `jekyll build` for building tests. Also, tested via running 
`dev/run-tests.py`. 

This was tested via manually adding `time.time()` as below:

```diff
 profiles_and_goals = build_profiles + sbt_goals

 print("[info] Building Spark unidoc (w/Hive 1.2.1) using SBT with 
these arguments: ",
   " ".join(profiles_and_goals))

+import time
+st = time.time()
 exec_sbt(profiles_and_goals)
+print("Elapsed :[%s]" % str(time.time() - st))
```

produces

```
...

Building Unidoc API Documentation

...
[info] Main Java API documentation successful.
...
Elapsed :[94.8746569157]
...
...

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-18692

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17477.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17477


commit 7ddb6eb11ed17c87355826db5bf2512b785042f5
Author: hyukjinkwon 
Date:   2017-03-30T03:55:09Z

Test Java 8 unidoc build on Jenkins




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17477: [SPARK-18692][BUILD][DOCS] Test Java 8 unidoc bui...

2017-03-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17477#discussion_r108852891
  
--- Diff: mllib/src/test/scala/org/apache/spark/ml/PipelineSuite.scala ---
@@ -230,7 +230,9 @@ class PipelineSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defaul
 }
 
 
-/** Used to test [[Pipeline]] with [[MLWritable]] stages */
+/**
+ * Used to test [[Pipeline]] with `MLWritable` stages
--- End diff --

We should avoid inlined comment when there are code blocks (`` ` ... ` ``). 
See https://github.com/apache/spark/pull/16050


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17477: [SPARK-18692][BUILD][DOCS] Test Java 8 unidoc bui...

2017-03-30 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17477#discussion_r108853043
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala ---
@@ -74,7 +74,7 @@ abstract class Classifier[
* and features (`Vector`).
* @param numClasses  Number of classes label can take.  Labels must be 
integers in the range
*[0, numClasses).
-   * @throws SparkException  if any label is not an integer >= 0
+   * @note Throws `SparkException` if any label is not an integer is 
greater than or equal to 0
--- End diff --

This case throws an error as below:

```
[error] 
.../spark/mllib/target/java/org/apache/spark/ml/classification/Classifier.java:28:
 error: reference not found
[error]* @throws SparkException  if any label is not an integer >= 0
[error]  ^
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17477: [SPARK-18692][BUILD][DOCS] Test Java 8 unidoc build on J...

2017-03-30 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17477
  
FYI, If I haven't missed something, all the cases are the instances with 
the ones previously fixed. cc @joshrosen, @srowen and @jkbradley. Could you 
take a look and see if it makes sense?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17477: [SPARK-18692][BUILD][DOCS] Test Java 8 unidoc bui...

2017-03-30 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17477#discussion_r108857749
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -704,12 +704,12 @@ private[spark] object TaskSchedulerImpl {
* Used to balance containers across hosts.
*
* Accepts a map of hosts to resource offers for that host, and returns 
a prioritized list of
-   * resource offers representing the order in which the offers should be 
used.  The resource
+   * resource offers representing the order in which the offers should be 
used. The resource
* offers are ordered such that we'll allocate one container on each 
host before allocating a
* second container on any host, and so on, in order to reduce the 
damage if a host fails.
*
-   * For example, given , , , 
returns
-   * [o1, o5, o4, 02, o6, o3]
+   * For example, given maps from h1 to [o1, o2, o3], from h2 to [o4] and 
from h3 to [o5, o6],
+   * returns a list, [o1, o5, o4, o2, o6, o3].
--- End diff --

There look few typos here. 02 -> o2 and h1 -> h3.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17479: [SPARK-20154][Web UI]In web ui,http://ip:4040/executors/...

2017-03-30 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17479
  
I think UI change requires a screenshot as written above. It seems trivial 
though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17468: [SPARK-20143][SQL] DataType.fromJson should throw an exc...

2017-03-30 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17468
  
@gatorsmile, could you take a look please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17458: [SPARK-20127][CORE] few warning have been fixed w...

2017-03-30 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17458#discussion_r108874489
  
--- Diff: core/src/main/scala/org/apache/spark/ui/jobs/StagesTab.scala ---
@@ -35,7 +35,7 @@ private[ui] class StagesTab(parent: SparkUI) extends 
SparkUITab(parent, "stages"
   attachPage(new StagePage(this))
   attachPage(new PoolPage(this))
 
-  def isFairScheduler: Boolean = progressListener.schedulingMode == 
Some(SchedulingMode.FAIR)
+  def isFairScheduler: Boolean = 
progressListener.schedulingMode.contains(SchedulingMode.FAIR)
--- End diff --

It seems not reverted yet..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17458: [SPARK-20127][CORE] few warning have been fixed w...

2017-03-30 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17458#discussion_r108874623
  
--- Diff: core/src/main/scala/org/apache/spark/ui/jobs/UIData.scala ---
@@ -180,8 +180,8 @@ private[spark] object UIData {
 speculative = taskInfo.speculative
   )
   newTaskInfo.gettingResultTime = taskInfo.gettingResultTime
-  newTaskInfo.setAccumulables(taskInfo.accumulables.filter {
-accum => !accum.internal && accum.metadata != 
Some(AccumulatorContext.SQL_ACCUM_IDENTIFIER)
+  newTaskInfo.setAccumulables(taskInfo.accumulables.filter { acc =>
+!acc.internal && 
!acc.metadata.contains(AccumulatorContext.SQL_ACCUM_IDENTIFIER)
--- End diff --

We should revert this one too for the same reason in 
https://github.com/apache/spark/pull/17458/files#r108432518


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17458: [SPARK-20127][CORE] few warning have been fixed w...

2017-03-30 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17458#discussion_r108874836
  
--- Diff: core/src/main/scala/org/apache/spark/ui/storage/RDDPage.scala ---
@@ -42,7 +42,7 @@ private[ui] class RDDPage(parent: StorageTab) extends 
WebUIPage("rdd") {
 
 val blockPage = Option(parameterBlockPage).map(_.toInt).getOrElse(1)
 val blockSortColumn = 
Option(parameterBlockSortColumn).getOrElse("Block Name")
-val blockSortDesc = 
Option(parameterBlockSortDesc).map(_.toBoolean).getOrElse(false)
+val blockSortDesc = Option(parameterBlockSortDesc).exists(_.toBoolean)
--- End diff --

This also looks the same instance. Let's revert this change too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17479: [SPARK-20154][Web UI]In web ui,http://ip:4040/exe...

2017-03-30 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17479#discussion_r108879364
  
--- Diff: 
core/src/main/resources/org/apache/spark/ui/static/executorspage-template.html 
---
@@ -24,7 +24,7 @@ Summary
 
 RDD Blocks
 Storage Memory
+  title="Memory used / total available memory for storage 
of data like RDD partitions cached in memory. ">Storage Memory used/total
--- End diff --

Let's post before/after snapshoots here and defer to committers. 
Personally, I am not sure too because tooltip explains quite clear.

BTW, I think we should capitalise it just to be consistent other pages.

![2017-03-30 6 21 
24](https://cloud.githubusercontent.com/assets/6477701/24497335/0f62dabc-1576-11e7-9ec4-ad0f3cd8206c.png)
![2017-03-30 6 21 
18](https://cloud.githubusercontent.com/assets/6477701/24497339/15032e36-1576-11e7-99d7-8493eea3ae92.png)
![2017-03-30 6 21 
15](https://cloud.githubusercontent.com/assets/6477701/24497341/1810be18-1576-11e7-907f-858f692b522b.png)
![2017-03-30 6 21 
12](https://cloud.githubusercontent.com/assets/6477701/24497344/1b4013ae-1576-11e7-9599-9f91e4f89317.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17477: [SPARK-18692][BUILD][DOCS] Test Java 8 unidoc bui...

2017-03-30 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17477#discussion_r108882566
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -704,12 +704,12 @@ private[spark] object TaskSchedulerImpl {
* Used to balance containers across hosts.
*
* Accepts a map of hosts to resource offers for that host, and returns 
a prioritized list of
-   * resource offers representing the order in which the offers should be 
used.  The resource
+   * resource offers representing the order in which the offers should be 
used. The resource
* offers are ordered such that we'll allocate one container on each 
host before allocating a
* second container on any host, and so on, in order to reduce the 
damage if a host fails.
*
-   * For example, given , , , 
returns
-   * [o1, o5, o4, 02, o6, o3]
+   * For example, given a map consisting of h1 to [o1, o2, o3], h2 to [o4] 
and h3 to [o5, o6],
+   * returns a list, [o1, o5, o4, o2, o6, o3].
--- End diff --

There look few typos here. 02 -> o2 and h1 -> h3.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17458: [SPARK-20127][CORE] few warning have been fixed which In...

2017-03-30 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17458
  
It looks good to me too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17468: [SPARK-20143][SQL] DataType.fromJson should throw an exc...

2017-03-30 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17468
  
Thank you @gatorsmile.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17489: [SPARK-20166][SQL] Use XXX for ISO timezone instead of Z...

2017-03-30 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17489
  
cc @srowen and @ueshin, could you take a look please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17489: [SPARK-20166][SQL] Use XXX for ISO timezone inste...

2017-03-30 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/17489

[SPARK-20166][SQL] Use XXX for ISO timezone instead of ZZ (FastDateFormat 
specific) in CSV/JSON timeformat options

## What changes were proposed in this pull request?

This PR proposes to use `XXX` format instead of `ZZ`. `ZZ` seems a 
`FastDateFormat` specific.

`ZZ` supports "ISO 8601 extended format time zones" but it seems 
`FastDateFormat` specific option. 
I misunderstood this is compatible format with `SimpleDateFormat` when this 
change is introduced.
Please see [SimpleDateFormat documentation]( 
https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html#iso8601timezone)
 and [FastDateFormat 
documentation](https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/time/FastDateFormat.html).

It seems we better replace `ZZ` to `XXX` because they look using the same 
strategy - 
[FastDateParser.java#L930](https://github.com/apache/commons-lang/blob/8767cd4f1a6af07093c1e6c422dae8e574be7e5e/src/main/java/org/apache/commons/lang3/time/FastDateParser.java#L930),
 [FastDateParser.java#L932-L951 
](https://github.com/apache/commons-lang/blob/8767cd4f1a6af07093c1e6c422dae8e574be7e5e/src/main/java/org/apache/commons/lang3/time/FastDateParser.java#L932-L951)
 and 
[FastDateParser.java#L596-L601](https://github.com/apache/commons-lang/blob/8767cd4f1a6af07093c1e6c422dae8e574be7e5e/src/main/java/org/apache/commons/lang3/time/FastDateParser.java#L596-L601).
 

I also checked the codes and manually debugged it for sure. It seems both 
cases use the same pattern `( Z|(?:[+-]\\d{2}(?::)\\d{2}))`.

_Note that this should be rather a fix about documentation and not the 
behaviour change because `ZZ` seems invalid date format in `SimpleDateFormat` 
as documented in `DataFrameReader` and both `ZZ` and `XXX` look identically 
working_

Current documentation is as below:

```
   * `timestampFormat` (default `-MM-dd'T'HH:mm:ss.SSSZZ`): sets 
the string that
   * indicates a timestamp format. Custom date formats follow the formats at
   * `java.text.SimpleDateFormat`. This applies to timestamp type.
```

## How was this patch tested?

Existing tests should cover this. Also, manually tested as below (BTW, I 
don't think these are worth being added as tests within Spark):

**Parse**

```scala
scala> new 
java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00")
res4: java.util.Date = Tue Mar 21 20:00:00 KST 2017

scala>  new 
java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000Z")
res10: java.util.Date = Tue Mar 21 09:00:00 KST 2017

scala> new 
java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11:00")
java.text.ParseException: Unparseable date: "2017-03-21T00:00:00.000-11:00"
  at java.text.DateFormat.parse(DateFormat.java:366)
  ... 48 elided
scala>  new 
java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000Z")
java.text.ParseException: Unparseable date: "2017-03-21T00:00:00.000Z"
  at java.text.DateFormat.parse(DateFormat.java:366)
  ... 48 elided
```

```scala
scala> 
org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00")
res7: java.util.Date = Tue Mar 21 20:00:00 KST 2017

scala> 
org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000Z")
res1: java.util.Date = Tue Mar 21 09:00:00 KST 2017

scala> 
org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11:00")
res8: java.util.Date = Tue Mar 21 20:00:00 KST 2017

scala> 
org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000Z")
res2: java.util.Date = Tue Mar 21 09:00:00 KST 2017
```

**Format**

```scala
scala> new 
java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSXXX").format(new 
java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00"))
res6: String = 2017-03-21T20:00:00.000+09:00
```

```scala
scala> val fd = 
org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSZZ")
fd: org.apache.commons.lang3.ti

[GitHub] spark issue #17492: [SPARK-19641][SQL] JSON schema inference in DROPMALFORME...

2017-03-30 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17492
  
cc @NathanHowell and @cloud-fan.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17492: [SPARK-19641][SQL] JSON schema inference in DROPM...

2017-03-30 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/17492

[SPARK-19641][SQL] JSON schema inference in DROPMALFORMED mode produces 
incorrect schema for non-array/object JSONs

## What changes were proposed in this pull request?

Currently, when we infer the types for vaild JSON strings but object or 
array, we are producing empty schemas regardless of parse modes as below:

```scala
scala> spark.read.option("mode", "DROPMALFORMED").json(Seq("""{"a": 1}""", 
""""a"""").toDS).printSchema()
root


scala> spark.read.option("mode", "FAILFAST").json(Seq("""{"a": 1}""", 
""""a"""").toDS).printSchema()
root
```

This PR proposes to handle parse modes in type inference.

After this PR,

```scala

scala> spark.read.option("mode", "DROPMALFORMED").json(Seq("""{"a": 1}""", 
""""a"""").toDS).printSchema()
root
 |-- a: long (nullable = true)
```

```
scala> spark.read.option("mode", "FAILFAST").json(Seq("""{"a": 1}""", 
""""a"""").toDS).printSchema()
java.lang.RuntimeException: Failed to infer a common schema. Struct types 
are expected but string was found.
```

This PR is based on 
https://github.com/NathanHowell/spark/commit/e233fd03346a73b3b447fa4c24f3b12c8b2e53ae
 and I and @NathanHowell talked about this in 
https://issues.apache.org/jira/browse/SPARK-19641


## How was this patch tested?

Unit tests in `JsonSuite` for both `DROPMALFORMED` and `FAILFAST` modes.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-19641

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17492.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17492


commit dc24522502b3f7816d539da2c8efb09a3515bb70
Author: hyukjinkwon 
Date:   2017-03-31T04:52:27Z

JSON schema inference in DROPMALFORMED mode produces incorrect schema




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17492: [SPARK-19641][SQL] JSON schema inference in DROPMALFORME...

2017-03-31 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17492
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15751: [SPARK-18246][SQL] Throws an exception before execution ...

2017-03-31 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15751
  
I will close this for now and make a new one soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15751: [SPARK-18246][SQL] Throws an exception before exe...

2017-03-31 Thread HyukjinKwon

Github user HyukjinKwon closed the pull request at:

https://github.com/apache/spark/pull/15751


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17489: [SPARK-20166][SQL] Use XXX for ISO 8601 timezone instead...

2017-03-31 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17489
  
> ZZ means something different to FastDateFormat, and what it means 
actually matches what XXX means to SimpleDateFormat?

Yes, it seems so given my tests and checking the codes.

>  But FastDateFormat also supports XXX and supports it the same way?

Yes, it is.

> I think you need to check cases like "-1100" and "11".

Up to my knowledge, both cases should be `XX` and `X` for each. It seems 
throwing an exception as below if we use `XXX` or `ZZ`.

```scala
scala> 
org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-1100")
java.text.ParseException: Unparseable date: 2017-03-21T00:00:00.000-1100
  at 
org.apache.commons.lang3.time.FastDateParser.parse(FastDateParser.java:371)
  at 
org.apache.commons.lang3.time.FastDateFormat.parse(FastDateFormat.java:545)
  ... 48 elided

scala> 
org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11")
java.text.ParseException: Unparseable date: 2017-03-21T00:00:00.000-11
  at 
org.apache.commons.lang3.time.FastDateParser.parse(FastDateParser.java:371)
  at 
org.apache.commons.lang3.time.FastDateFormat.parse(FastDateFormat.java:545)
  ... 48 elided

scala> 
org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-1100")
java.text.ParseException: Unparseable date: 2017-03-21T00:00:00.000-1100
  at 
org.apache.commons.lang3.time.FastDateParser.parse(FastDateParser.java:371)
  at 
org.apache.commons.lang3.time.FastDateFormat.parse(FastDateFormat.java:545)
  ... 48 elided

scala> 
org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11")
java.text.ParseException: Unparseable date: 2017-03-21T00:00:00.000-11
  at 
org.apache.commons.lang3.time.FastDateParser.parse(FastDateParser.java:371)
  at 
org.apache.commons.lang3.time.FastDateFormat.parse(FastDateFormat.java:545)
  ... 48 elided
```

`XX` and `X` test for both as below:

```scala
scala> 
org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSXX").parse("2017-03-21T00:00:00.000-1100")
res18: java.util.Date = Tue Mar 21 20:00:00 KST 2017

scala> 
org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSX").parse("2017-03-21T00:00:00.000-11")
res19: java.util.Date = Tue Mar 21 20:00:00 KST 2017
```

```scala
scala> new 
java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSXX").parse("2017-03-21T00:00:00.000-1100")
res20: java.util.Date = Tue Mar 21 20:00:00 KST 2017

scala> new 
java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSX").parse("2017-03-21T00:00:00.000-11")
res21: java.util.Date = Tue Mar 21 20:00:00 KST 2017
```

> Is there any case that worked before but not after the change?

I understand this concern and that's why I checked the codes. They look 
using the same timezone pattern for both `ZZ` and `XXX` - 
[FastDateParser.java#L930](https://github.com/apache/commons-lang/blob/8767cd4f1a6af07093c1e6c422dae8e574be7e5e/src/main/java/org/apache/commons/lang3/time/FastDateParser.java#L930),
 [FastDateParser.java#L932-L951 
](https://github.com/apache/commons-lang/blob/8767cd4f1a6af07093c1e6c422dae8e574be7e5e/src/main/java/org/apache/commons/lang3/time/FastDateParser.java#L932-L951)
 and 
[FastDateParser.java#L596-L601](https://github.com/apache/commons-lang/blob/8767cd4f1a6af07093c1e6c422dae8e574be7e5e/src/main/java/org/apache/commons/lang3/time/FastDateParser.java#L596-L601).
 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17489: [SPARK-20166][SQL] Use XXX for ISO 8601 timezone instead...

2017-03-31 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17489
  
I tested "-1100" cases 
[before](https://issues.apache.org/jira/browse/SPARK-17545?focusedCommentId=15509110&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15509110)
 with `ZZ`.

I _guess_ this was rather a bug in commons-lang but fixed after we upgrade 
commons-lang [SPARK-17985](https://issues.apache.org/jira/browse/SPARK-17985). 
I just downloaded Spark 2.0.0 and test the codes below:

```scala
scala> 
org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-1100")
res2: java.util.Date = Tue Mar 21 20:00:00 KST 2017
```

Spark 2.1.0:

```scala
scala> 
org.apache.commons.lang3.time.FastDateFormat.getInstance("-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-1100")
java.text.ParseException: Unparseable date: 2017-03-21T00:00:00.000-1100
  at 
org.apache.commons.lang3.time.FastDateParser.parse(FastDateParser.java:371)
  at 
org.apache.commons.lang3.time.FastDateFormat.parse(FastDateFormat.java:545)
  ... 48 elided
```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17489: [SPARK-20166][SQL] Use XXX for ISO 8601 timezone instead...

2017-03-31 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17489
  
Let me just fix up the documentation if you are worried of it. I am fine 
with it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17489: [SPARK-20166][SQL] Use XXX for ISO 8601 timezone instead...

2017-03-31 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17489
  
Otherwise, let me ask if both `ZZ` to `XXX` are really the same to Apache 
commons user mailing list.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17489: [SPARK-20166][SQL] Use XXX for ISO 8601 timezone instead...

2017-03-31 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17489
  
Yes. I asked a question to commons mailing list about this for clarity. Let 
me update you when I have an answer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17489: [SPARK-20166][SQL] Use XXX for ISO 8601 timezone instead...

2017-03-31 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17489
  
@srowen, I had a great answer about `ZZ` - 
[here](http://mail-archives.apache.org/mod_mbox/commons-user/201704.mbox/).
 It seems okay to use `XXX` instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17489: [SPARK-20166][SQL] Use XXX for ISO 8601 timezone instead...

2017-04-01 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17489
  
I tried to grep this pattern. I think these are all if I haven't missed 
ones. Thanks for approving it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17468: [SPARK-20143][SQL] DataType.fromJson should throw an exc...

2017-04-01 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17468
  
@gatorsmile, could this get merged maybe?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17492: [SPARK-19641][SQL] JSON schema inference in DROPM...

2017-04-02 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17492#discussion_r109304123
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala
 ---
@@ -217,26 +221,43 @@ private[sql] object JsonInferSchema {
 }
   }
 
+  private def withParseMode(
--- End diff --

Sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17477: [SPARK-18692][BUILD][DOCS] Test Java 8 unidoc build on J...

2017-04-02 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17477
  
(gentle ping @joshrosen).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17375: [SPARK-19019][PYTHON][BRANCH-1.6] Fix hijacked `collecti...

2017-04-02 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17375
  
(gentle ping @holdenk and @davies)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14660: [SPARK-17071][SQL] Add an option to support for reading ...

2017-04-02 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/14660
  
@rxin, I think there are no downside vs what it improves. Do you mind if I 
ask reconsider this?

If you think it is still not worth being added, I will close, resolve the 
JIRA and leave some comments in the similar PRs proposing this change in the 
future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17489: [SPARK-20166][SQL] Use XXX for ISO 8601 timezone ...

2017-04-02 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17489#discussion_r109305593
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -363,7 +363,7 @@ def csv(self, path, schema=None, sep=None, 
encoding=None, quote=None, escape=Non
 :param timestampFormat: sets the string that indicates a timestamp 
format. Custom date
 formats follow the formats at 
``java.text.SimpleDateFormat``.
 This applies to timestamp type. If None is 
set, it uses the
-default value, 
``-MM-dd'T'HH:mm:ss.SSSZZ``.
+default value, 
``-MM-dd'T'HH:mm:ss.SSSXXX``.
--- End diff --

Yea, they look similar instances. However, it seems the doc states 

> ... the timestamp of that moment in the current system time zone ...

> ...  using the default timezone and the default locale, ...

This probably is fine up to my knowledge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17489: [SPARK-20166][SQL] Use XXX for ISO 8601 timezone ...

2017-04-02 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17489#discussion_r109305610
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -363,7 +363,7 @@ def csv(self, path, schema=None, sep=None, 
encoding=None, quote=None, escape=Non
 :param timestampFormat: sets the string that indicates a timestamp 
format. Custom date
 formats follow the formats at 
``java.text.SimpleDateFormat``.
 This applies to timestamp type. If None is 
set, it uses the
-default value, 
``-MM-dd'T'HH:mm:ss.SSSZZ``.
+default value, 
``-MM-dd'T'HH:mm:ss.SSSXXX``.
--- End diff --

I will double check this path and related references and then open a JIRA 
if it looks needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17492: [SPARK-19641][SQL] JSON schema inference in DROPM...

2017-04-02 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17492#discussion_r109307361
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala
 ---
@@ -202,41 +206,54 @@ private[sql] object JsonInferSchema {
 
   private def withCorruptField(
   struct: StructType,
-  columnNameOfCorruptRecords: String): StructType = {
-if (!struct.fieldNames.contains(columnNameOfCorruptRecords)) {
-  // If this given struct does not have a column used for corrupt 
records,
-  // add this field.
-  val newFields: Array[StructField] =
-StructField(columnNameOfCorruptRecords, StringType, nullable = 
true) +: struct.fields
-  // Note: other code relies on this sorting for correctness, so don't 
remove it!
-  java.util.Arrays.sort(newFields, structFieldComparator)
-  StructType(newFields)
-} else {
-  // Otherwise, just return this struct.
+  other: DataType,
+  columnNameOfCorruptRecords: String,
+  parseMode: ParseMode) = parseMode match {
+case PermissiveMode =>
+  // If we see any other data type at the root level, we get records 
that cannot be
+  // parsed. So, we use the struct as the data type and add the 
corrupt field to the schema.
+  if (!struct.fieldNames.contains(columnNameOfCorruptRecords)) {
+// If this given struct does not have a column used for corrupt 
records,
+// add this field.
+val newFields: Array[StructField] =
+  StructField(columnNameOfCorruptRecords, StringType, nullable = 
true) +: struct.fields
+// Note: other code relies on this sorting for correctness, so 
don't remove it!
+java.util.Arrays.sort(newFields, structFieldComparator)
+StructType(newFields)
+  } else {
+// Otherwise, just return this struct.
+struct
+  }
+
+case DropMalformedMode =>
+  // If corrupt record handling is disabled we retain the valid schema 
and discard the other.
   struct
-}
+
+case FailFastMode =>
--- End diff --

It looks possible to run this line. I added a test in 1936L. In more 
details, if the json is a valid one but not a object of an areay of object, it 
will infer not `StructType` per record. If one of the types is not a struct 
type and failfast mode is enabled, we will hit this line.

I am outside now. I will double check when I get to my computer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17492: [SPARK-19641][SQL] JSON schema inference in DROPM...

2017-04-02 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17492#discussion_r109307493
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -1903,9 +1932,8 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   .option("wholeFile", true)
   .option("mode", "FAILFAST")
   .json(path)
-  .collect()
   }
-  assert(exceptionOne.getMessage.contains("Failed to parse a value"))
+  assert(exceptionOne.getMessage.contains("Failed to infer a common 
schema"))
--- End diff --

Up to my knowledge, this test case throws an exception when actually 
parsing the data before. Now, I intended to check the exception in schema 
inference as it looks parsing data case is covered below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17492: [SPARK-19641][SQL] JSON schema inference in DROPM...

2017-04-02 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17492#discussion_r109307548
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -1041,7 +1041,6 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
   spark.read
 .option("mode", "FAILFAST")
 .json(corruptRecords)
-.collect()
--- End diff --

I intended to remove this line to make sure it happens in schrma inference. 
I think this is not related with the change here. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17469: [SPARK-20132][Docs] Add documentation for column ...

2017-04-02 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17469#discussion_r109309786
  
--- Diff: python/pyspark/sql/column.py ---
@@ -250,11 +250,41 @@ def __iter__(self):
 raise TypeError("Column is not iterable")
 
 # string methods
+_rlike_doc = """ Return a Boolean :class:`Column` based on a regex 
match.\n
+ :param other: an extended regex expression
+
+ >>> df.filter( df.name.rlike('ice$') ).collect()
--- End diff --

Let's make sure there is no extra space between `(...)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17469: [SPARK-20132][Docs] Add documentation for column ...

2017-04-02 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17469#discussion_r109309540
  
--- Diff: python/pyspark/sql/column.py ---
@@ -303,8 +333,25 @@ def isin(self, *cols):
 desc = _unary_op("desc", "Returns a sort expression based on the"
  " descending order of the given column name.")
 
-isNull = _unary_op("isNull", "True if the current expression is null.")
-isNotNull = _unary_op("isNotNull", "True if the current expression is 
not null.")
+_isNull_doc = ''' True if the current expression is null. Often 
combined with 
--- End diff --

It seems the key is consistency in this case. We could use `"""` here if 
there is no specific other reasons.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17469: [SPARK-20132][Docs] Add documentation for column ...

2017-04-02 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17469#discussion_r109309750
  
--- Diff: python/pyspark/sql/column.py ---
@@ -250,11 +250,39 @@ def __iter__(self):
 raise TypeError("Column is not iterable")
 
 # string methods
+_rlike_doc = """ Return a Boolean :class:`Column` based on a regex 
match.\n
--- End diff --

I just manually tried after replacing `\n` to a newline with `make clean` 
and `make html`. It seems fine to use a newline (not `\n`). Could you double 
check and replace it if I was not wrong?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17469: [SPARK-20132][Docs] Add documentation for column ...

2017-04-02 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17469#discussion_r109309888
  
--- Diff: python/pyspark/sql/column.py ---
@@ -303,8 +333,25 @@ def isin(self, *cols):
 desc = _unary_op("desc", "Returns a sort expression based on the"
  " descending order of the given column name.")
 
-isNull = _unary_op("isNull", "True if the current expression is null.")
-isNotNull = _unary_op("isNotNull", "True if the current expression is 
not null.")
+_isNull_doc = ''' True if the current expression is null. Often 
combined with 
+  :func:`DataFrame.filter` to select rows with null 
values.
+
+  >>> df2.collect()
+  [Row(name=u'Tom', height=80), Row(name=u'Alice', 
height=None)]
+  >>> df2.filter( df2.height.isNull ).collect()
+  [Row(name=u'Alice', height=None)]
+  '''
+_isNotNull_doc = ''' True if the current expression is null. Often 
combined with 
--- End diff --

BTW, just a question. Do we need the leading space here in the 
documentation? I think we should remove if it is unnecessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17469: [SPARK-20132][Docs] Add documentation for column string ...

2017-04-02 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17469
  
Also, you could run the python tests via `./python/run-tests.py --module 
pyspark-sql` for python tests after building. It is fine because I guess 
Jenkins will catch this but it might be nicer if they could be tested in the 
local ahead if possible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17517: [MINOR][DOCS] Replace non-breaking space to norma...

2017-04-02 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/17517

[MINOR][DOCS] Replace non-breaking space to normal spaces that breaks 
rendering markdown

## What changes were proposed in this pull request?

It seems there are several non-breaking spaces were inserted into `.md` and 
they look breaking rendering markdown files. as below:

These are different. For example, this can be checked via `python` as below:

```python
>>> "Â "
'\xc2\xa0'
>>> " "
' '
```

**Before**

![2017-04-03 12 36 
57](https://cloud.githubusercontent.com/assets/6477701/24594654/50a855e6-186a-11e7-94e2-661e56544b0f.png)
![2017-04-03 12 37 
17](https://cloud.githubusercontent.com/assets/6477701/24594655/50aaba02-186a-11e7-80bb-d34b17a3398a.png)

**After**

![2017-04-03 12 36 
46](https://cloud.githubusercontent.com/assets/6477701/24594657/53c2545c-186a-11e7-9a73-00529afbfd75.png)
![2017-04-03 12 36 
31](https://cloud.githubusercontent.com/assets/6477701/24594658/53c286c0-186a-11e7-99c9-e66b1f510fe7.png)

## How was this patch tested?

Manually checking.



These instances were found via 

```
grep --include=*.scala --include=*.python --include=*.java --include=*.r 
--include=*.R --include=*.md --include=*.r -r -I "Â " .
```

in Mac OS.

It seems there are several instances more as below:

```
./docs/sql-programming-guide.md:âÂ Â  âââ ...
./docs/sql-programming-guide.md:âÂ Â  â
./docs/sql-programming-guide.md:âÂ Â  âââ country=US
./docs/sql-programming-guide.md:âÂ Â  âÂ Â  âââ 
data.parquet
./docs/sql-programming-guide.md:âÂ Â  âââ country=CN
./docs/sql-programming-guide.md:âÂ Â  âÂ Â  âââ 
data.parquet
./docs/sql-programming-guide.md:âÂ Â  âââ ...
./docs/sql-programming-guide.md: Â Â  âââ ...
./docs/sql-programming-guide.md: Â Â  â
./docs/sql-programming-guide.md: Â Â  âââ country=US
./docs/sql-programming-guide.md: Â Â  âÂ Â  âââ data.parquet
./docs/sql-programming-guide.md: Â Â  âââ country=CN
./docs/sql-programming-guide.md: Â Â  âÂ Â  âââ data.parquet
./docs/sql-programming-guide.md: Â Â  âââ ...
./sql/core/src/test/README.md:âÂ Â  âââ *.avdl  # 
Testing Avro IDL(s)
./sql/core/src/test/README.md:âÂ Â  âââ *.avpr  # 
!! NO TOUCH !! Protocol files generated from Avro IDL(s)
./sql/core/src/test/README.md:âÂ Â  âââ gen-avro.sh # 
Script used to generate Java code for Avro
./sql/core/src/test/README.md:âÂ Â  âââ gen-thrift.sh   # 
Script used to generate Java code for Thrift
```

These seems generated via `tree` command which inserts non-breaking spaces. 
They do not look causing any problem for rendering and I did not fix it to 
reduce the overhead to manually replace it when it is overwritten via `tree` 
command in the future.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark non-breaking-space

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17517.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17517


commit 4b0e56a590a9a73e26e6c5c2cff2a20942fdb908
Author: hyukjinkwon 
Date:   2017-04-03T03:35:10Z

Replace non-breaking space to normal spaces




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17517: [MINOR][DOCS] Replace non-breaking space to norma...

2017-04-02 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17517#discussion_r109336173
  
--- Diff: docs/monitoring.md ---
@@ -257,7 +257,7 @@ In the API, an application is referenced by its 
application ID, `[app-id]`.
 When running on YARN, each application may have multiple attempts, but 
there are attempt IDs
 only for applications in cluster mode, not applications in client mode. 
Applications in YARN cluster mode
 can be identified by their `[attempt-id]`. In the API listed below, when 
running in YARN cluster mode,
-`[app-id]`Â will actually beÂ `[base-app-id]/[attempt-id]`, where 
`[base-app-id]`Â is the YARN application ID.
+`[app-id]` will actually be `[base-app-id]/[attempt-id]`, where 
`[base-app-id]` is the YARN application ID.
--- End diff --

These seems mistakenly added. So, I decided to fix it here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17517: [MINOR][DOCS] Replace non-breaking space to normal space...

2017-04-02 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17517
  
cc @srowen. Could you take a look please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17517: [MINOR][DOCS] Replace non-breaking space to norma...

2017-04-02 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17517#discussion_r109336405
  
--- Diff: docs/monitoring.md ---
@@ -257,7 +257,7 @@ In the API, an application is referenced by its 
application ID, `[app-id]`.
 When running on YARN, each application may have multiple attempts, but 
there are attempt IDs
 only for applications in cluster mode, not applications in client mode. 
Applications in YARN cluster mode
 can be identified by their `[attempt-id]`. In the API listed below, when 
running in YARN cluster mode,
-`[app-id]`Â will actually beÂ `[base-app-id]/[attempt-id]`, where 
`[base-app-id]`Â is the YARN application ID.
+`[app-id]` will actually be `[base-app-id]/[attempt-id]`, where 
`[base-app-id]` is the YARN application ID.
--- End diff --

Open a `vi` and copy it and print it within python

```
>>> "`[app-id]`Â will actually beÂ `[base-app-id]/[attempt-id]`, where 
`[base-app-id]`Â is the YARN application ID"
'`[app-id]`\xc2\xa0will actually be\xc2\xa0`[base-app-id]/[attempt-id]`, 
where `[base-app-id]`\xc2\xa0is the YARN application ID'
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17517: [MINOR][DOCS] Replace non-breaking space to norma...

2017-04-02 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17517#discussion_r109336446
  
--- Diff: docs/building-spark.md ---
@@ -154,7 +154,7 @@ Developers who compile Spark frequently may want to 
speed up compilation; e.g.,
 developers who build with SBT).  For more information about how to do 
this, refer to the
 [Useful Developer Tools 
page](http://spark.apache.org/developer-tools.html#reducing-build-times).
 
-##Â Encrypted Filesystems
+## Encrypted Filesystems
--- End diff --

```
>>> "##Â Encrypted Filesystems"
'##\xc2\xa0Encrypted Filesystems'
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17517: [MINOR][DOCS] Replace non-breaking space to norma...

2017-04-02 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17517#discussion_r109336475
  
--- Diff: README.md ---
@@ -97,7 +97,7 @@ building for particular Hive and Hive Thriftserver 
distributions.
 Please refer to the [Configuration 
Guide](http://spark.apache.org/docs/latest/configuration.html)
 in the online documentation for an overview on how to configure Spark.
 
-##Â Contributing
+## Contributing
--- End diff --

```python
>>> "##Â Contributing"
'##\xc2\xa0Contributing'
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17489: [SPARK-20166][SQL] Use XXX for ISO 8601 timezone instead...

2017-04-03 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17489
  
Thank you for asking the details and merging it  @srowen.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17492: [SPARK-19641][SQL] JSON schema inference in DROPMALFORME...

2017-04-03 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17492
  
Thank you @cloud-fan.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17523: [SPARK-20064][PySpark]

2017-04-03 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17523
  
(it would be nicer if the title is fixed to indicate what it proposes in 
short)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17469: [SPARK-20132][Docs] Add documentation for column ...

2017-04-03 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17469#discussion_r109574278
  
--- Diff: python/pyspark/sql/column.py ---
@@ -303,8 +333,25 @@ def isin(self, *cols):
 desc = _unary_op("desc", "Returns a sort expression based on the"
  " descending order of the given column name.")
 
-isNull = _unary_op("isNull", "True if the current expression is null.")
-isNotNull = _unary_op("isNotNull", "True if the current expression is 
not null.")
+_isNull_doc = ''' True if the current expression is null. Often 
combined with 
+  :func:`DataFrame.filter` to select rows with null 
values.
+
+  >>> df2.collect()
+  [Row(name=u'Tom', height=80), Row(name=u'Alice', 
height=None)]
+  >>> df2.filter( df2.height.isNull ).collect()
+  [Row(name=u'Alice', height=None)]
+  '''
+_isNotNull_doc = ''' True if the current expression is null. Often 
combined with 
--- End diff --

Up to my knowledge, both docstrings comply pep8 up to my knowledge,
```
""" ...
"""
```

or 

```
"""
...
"""
```

but for this case, it seems a separate variable. Personally, I prefer

```python
_isNull_doc = """
True if the current expression is null. Often combined with
:func:`DataFrame.filter` to select rows with null values.

>>> df2.collect()
[Row(name=u'Tom', height=80), Row(name=u'Alice', height=None)]
>>> df2.filter( df2.height.isNull ).collect()
[Row(name=u'Alice', height=None)]
"""
```

but I could not find a formal reference to support this idea (in case that 
it is a separate variable) and I am not supposed to decide this. So, I am fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17469: [SPARK-20132][Docs] Add documentation for column ...

2017-04-03 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17469#discussion_r109574284
  
--- Diff: python/pyspark/sql/column.py ---
@@ -303,8 +333,25 @@ def isin(self, *cols):
 desc = _unary_op("desc", "Returns a sort expression based on the"
  " descending order of the given column name.")
 
-isNull = _unary_op("isNull", "True if the current expression is null.")
-isNotNull = _unary_op("isNotNull", "True if the current expression is 
not null.")
+_isNull_doc = ''' True if the current expression is null. Often 
combined with 
--- End diff --

I just found a good reference in pep8

> For triple-quoted strings, always use double quote characters to be 
consistent with the docstring convention in PEP 257

https://www.python.org/dev/peps/pep-0008/#string-quotes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17469: [SPARK-20132][Docs] Add documentation for column string ...

2017-04-03 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17469
  
It might be better to run`./dev/lint-python` locally if possible. There 
will catch more of minor nits ahead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17469: [SPARK-20132][Docs] Add documentation for column ...

2017-04-03 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17469#discussion_r109575023
  
--- Diff: python/pyspark/sql/column.py ---
@@ -250,11 +250,39 @@ def __iter__(self):
 raise TypeError("Column is not iterable")
 
 # string methods
+_rlike_doc = """ Return a Boolean :class:`Column` based on a regex 
match.\n
--- End diff --

Could you maybe give a shot with this patch - 
https://github.com/map222/spark/compare/patterson-documentation...HyukjinKwon:rlike-docstring.patch

?

I double checked it produces 

![2017-04-04 1 23 
30](https://cloud.githubusercontent.com/assets/6477701/24641412/84765e9c-193a-11e7-85d5-9745ea151c12.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17469: [SPARK-20132][Docs] Add documentation for column ...

2017-04-03 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17469#discussion_r109575589
  
--- Diff: python/pyspark/sql/column.py ---
@@ -303,8 +333,25 @@ def isin(self, *cols):
 desc = _unary_op("desc", "Returns a sort expression based on the"
  " descending order of the given column name.")
 
-isNull = _unary_op("isNull", "True if the current expression is null.")
-isNotNull = _unary_op("isNotNull", "True if the current expression is 
not null.")
+_isNull_doc = ''' True if the current expression is null. Often 
combined with 
+  :func:`DataFrame.filter` to select rows with null 
values.
+
+  >>> df2.collect()
+  [Row(name=u'Tom', height=80), Row(name=u'Alice', 
height=None)]
+  >>> df2.filter( df2.height.isNull ).collect()
+  [Row(name=u'Alice', height=None)]
+  '''
+_isNotNull_doc = ''' True if the current expression is null. Often 
combined with 
--- End diff --

^ cc @holdenk


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17528: [MINOR][R] Reorder `Collate` fields in DESCRIPTIO...

2017-04-04 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/17528

[MINOR][R] Reorder `Collate` fields in DESCRIPTION file

## What changes were proposed in this pull request?

It seems cran check scripts corrects `R/pkg/DESCRIPTION` and follows the 
order in `Collate` fields.

This PR proposes to fix `catalog.R`'s order so that running this script 
does not show up a diff in this file every time.

## How was this patch tested?

Manually via `./R/check-cran.sh`.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark minor-reorder-description

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17528.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17528


commit 64035148e6944724d604589cfdeb1824409f6372
Author: hyukjinkwon 
Date:   2017-04-04T14:55:50Z

[MINOR][R] Reorder `Collate` fields in DESCRIPTION file




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17528: [MINOR][R] Reorder `Collate` fields in DESCRIPTION file

2017-04-04 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17528
  
@felixcheung, it is probably not worth being as a separate PR. I am fine if 
you add this in any of your PR that is going to be merged soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17149: [SPARK-19257][SQL]location for table/partition/da...

2017-04-04 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17149#discussion_r109826847
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -386,7 +386,7 @@ class SparkSqlAstBuilder(conf: SQLConf) extends 
AstBuilder {
 "LOCATION and 'path' in OPTIONS are both used to indicate the 
custom table path, " +
   "you can only specify one of them.", ctx)
 }
-val customLocation = storage.locationUri.orElse(location)
+val customLocation = 
storage.locationUri.orElse(location.map(CatalogUtils.stringToURI(_)))
--- End diff --

It seems this has a double-de/encoding problem when the input path is an 
URI.

```scala
scala> new org.apache.hadoop.fs.Path(new java.io.File("a 
b").toURI.toString).toUri.toString
res1: String = file:/.../a%2520b
```

A space character in URI is encoded as `%20` and `%` character is encoded 
as `%25`. A URI already has a %20 in it, and gets urlencoded again, from `%20` 
to `%2520`.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17149: [SPARK-19257][SQL]location for table/partition/da...

2017-04-04 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17149#discussion_r109826890
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -386,7 +386,7 @@ class SparkSqlAstBuilder(conf: SQLConf) extends 
AstBuilder {
 "LOCATION and 'path' in OPTIONS are both used to indicate the 
custom table path, " +
   "you can only specify one of them.", ctx)
 }
-val customLocation = storage.locationUri.orElse(location)
+val customLocation = 
storage.locationUri.orElse(location.map(CatalogUtils.stringToURI(_)))
--- End diff --

Also, it seems we have a similar util with `CatalogUtils.stringToURI` in 
`org.apache.spark.util.Utils.resolveURI`. Could we consolidate them? I did not 
replace it when I identified this because some existing tests in 
`org.apache.spark.util.UtilsSuite` were broken but I guess it would be fine if 
there is a coherent reason. These broken cases might be bugs.

@windpiger could you double check my comments and open a followup if I was 
not wrong?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17149: [SPARK-19257][SQL]location for table/partition/da...

2017-04-04 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17149#discussion_r109827045
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -386,7 +386,7 @@ class SparkSqlAstBuilder(conf: SQLConf) extends 
AstBuilder {
 "LOCATION and 'path' in OPTIONS are both used to indicate the 
custom table path, " +
   "you can only specify one of them.", ctx)
 }
-val customLocation = storage.locationUri.orElse(location)
+val customLocation = 
storage.locationUri.orElse(location.map(CatalogUtils.stringToURI(_)))
 
--- End diff --

Also, it seems we have a similar util with `CatalogUtils.stringToURI` in 
`org.apache.spark.util.Utils.resolveURI`. Could we consolidate them? I did not 
replace it when I identified this because some existing tests in 
`org.apache.spark.util.UtilsSuite` were broken but I guess it would be fine if 
there is a coherent reason. These broken cases might be bugs.

@windpiger could you double check my comments and open a followup if I was 
not wrong?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17149: [SPARK-19257][SQL]location for table/partition/da...

2017-04-04 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17149#discussion_r109827042
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -386,7 +386,7 @@ class SparkSqlAstBuilder(conf: SQLConf) extends 
AstBuilder {
 "LOCATION and 'path' in OPTIONS are both used to indicate the 
custom table path, " +
   "you can only specify one of them.", ctx)
 }
-val customLocation = storage.locationUri.orElse(location)
+val customLocation = 
storage.locationUri.orElse(location.map(CatalogUtils.stringToURI(_)))
 
--- End diff --

It seems this has a double-de/encoding problem when the input path is an 
URI.

```scala
scala> new org.apache.hadoop.fs.Path(new java.io.File("a 
b").toURI.toString).toUri.toString
res1: String = file:/.../a%2520b
```

A space character in URI is encoded as `%20` and `%` character is encoded 
as `%25`. A URI already has a %20 in it, and gets urlencoded again, from `%20` 
to `%2520`.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17149: [SPARK-19257][SQL]location for table/partition/da...

2017-04-04 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17149#discussion_r109827145
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -386,7 +386,7 @@ class SparkSqlAstBuilder(conf: SQLConf) extends 
AstBuilder {
 "LOCATION and 'path' in OPTIONS are both used to indicate the 
custom table path, " +
   "you can only specify one of them.", ctx)
 }
-val customLocation = storage.locationUri.orElse(location)
+val customLocation = 
storage.locationUri.orElse(location.map(CatalogUtils.stringToURI(_)))
 
--- End diff --

> I did not replace it when I identified this because some existing tests 
in org.apache.spark.util.UtilsSuite were broken but I guess it would be fine if 
there is a coherent reason. These broken cases might be bugs.

FYI.. cc @sarutak 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17149: [SPARK-19257][SQL]location for table/partition/da...

2017-04-05 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17149#discussion_r109854005
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -386,7 +386,7 @@ class SparkSqlAstBuilder(conf: SQLConf) extends 
AstBuilder {
 "LOCATION and 'path' in OPTIONS are both used to indicate the 
custom table path, " +
   "you can only specify one of them.", ctx)
 }
-val customLocation = storage.locationUri.orElse(location)
+val customLocation = 
storage.locationUri.orElse(location.map(CatalogUtils.stringToURI(_)))
 
--- End diff --

Hm.. is the input path always expected to be a path? I thought we support 
both URI and path forms.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17149: [SPARK-19257][SQL]location for table/partition/da...

2017-04-05 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17149#discussion_r109863558
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -386,7 +386,7 @@ class SparkSqlAstBuilder(conf: SQLConf) extends 
AstBuilder {
 "LOCATION and 'path' in OPTIONS are both used to indicate the 
custom table path, " +
   "you can only specify one of them.", ctx)
 }
-val customLocation = storage.locationUri.orElse(location)
+val customLocation = 
storage.locationUri.orElse(location.map(CatalogUtils.stringToURI(_)))
 
--- End diff --

Up to my knowledge, HDFS's fully qualified path is a URI form. If we do not 
support this, that virtually means we are going to disallow the fully qualified 
path. I understand it sounds ambiguous but disallowing does not look a good 
solution.

Also, if users might want to access to local files or S3 when default 
scheme is `hdfs`, I guess it requires changing the default scheme every time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17149: [SPARK-19257][SQL]location for table/partition/da...

2017-04-05 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17149#discussion_r109865610
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -386,7 +386,7 @@ class SparkSqlAstBuilder(conf: SQLConf) extends 
AstBuilder {
 "LOCATION and 'path' in OPTIONS are both used to indicate the 
custom table path, " +
   "you can only specify one of them.", ctx)
 }
-val customLocation = storage.locationUri.orElse(location)
+val customLocation = 
storage.locationUri.orElse(location.map(CatalogUtils.stringToURI(_)))
 
--- End diff --

In addition, I guess we alway have a lot of tests with URI input paths and 
I did many of them to pass the tests on Windows, which I guess implicitly some 
committers do not disagree with this.

IMHO, I guess URIs should be supported first correctly and local path 
should be supported as abbreviation because those local path form is 
abbreviation of the fully qualified path. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17466: Added getter for impurityStats

2017-04-05 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17466
  
If you think it is the JIRA, fix the title to `[SPARK-14681][ML] Added 
getter for impurityStats`. That would make a link to the JIRA automatically.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17149: [SPARK-19257][SQL]location for table/partition/da...

2017-04-05 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17149#discussion_r109892051
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -285,7 +285,7 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
 // compatible format, which means the data source is file-based 
and must have a `path`.
 require(table.storage.locationUri.isDefined,
   "External file-based data source table must have a `path` entry 
in storage properties.")
-Some(new Path(table.location).toUri.toString)
--- End diff --

I see, it seems the problem itself seems existing before.

I found this while running related tests on Windows which it looks related 
with this PR (special character stuff) so I think this is a proper JIRA to make 
a followup with and PR to note this.

It seems we do have tests that use URIs already whether it was mistakenly 
supported or not. Should we ask it to dev-mailing list? I think this is an 
important decision.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String t...

2017-04-05 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17527
  
I support this idea in general. I can at least identify several references, 
for example,, 
https://hibernate.atlassian.net/plugins/servlet/mobile#issue/HHH-9722,  
`https://github.com/hibernate/hibernate-orm/pull/931` and 
https://www.google.co.kr/amp/s/garygregory.wordpress.com/2015/11/03/java-lowercase-conversion-turkey/amp/

Let me investigate possible downside within today.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17149: [SPARK-19257][SQL]location for table/partition/da...

2017-04-05 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17149#discussion_r110060244
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -285,7 +285,7 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
 // compatible format, which means the data source is file-based 
and must have a `path`.
 require(table.storage.locationUri.isDefined,
   "External file-based data source table must have a `path` entry 
in storage properties.")
-Some(new Path(table.location).toUri.toString)
--- End diff --

Thank you for confirming. Let me send a mail soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-05 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110064613
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -407,7 +408,7 @@ public UTF8String toLowerCase() {
   }
 
   private UTF8String toLowerCaseSlow() {
-return fromString(toString().toLowerCase());
+return fromString(toString().toLowerCase(Locale.ROOT));
--- End diff --

It seems there are few cases exposed to users. For example, this seems used 
in `Lower` and `InitCap` expressions, where, up to my knowledge, the lower 
cased ones are exposed into users.

In this case, IIRC the same argument (consistent base + options for 
variants approach) is applied in 
[SPARK-18076](https://issues.apache.org/jira/browse/SPARK-18076). So, I think 
it might be fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String t...

2017-04-05 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17527
  
@viirya, I think it is possible. In case of `Lower`, `Upper` and `InitCap` 
as an example maybe.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String t...

2017-04-05 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17527
  
Yea, that's the concern. The downside is when these are exposed to users. 
However, it might be an advantage as well. The behavior doesn't depend on 
default JVM locale and is consistent. I think the argument is the same with 
[SPARK-18076](https://issues.apache.org/jira/browse/SPARK-18076).

If it is easy to leave out the exposed ones, I think that's also an option.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String t...

2017-04-06 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17527
  
Thank you for clarifying it, @srowen and @viirya . I am okay with it too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17149: [SPARK-19257][SQL]location for table/partition/da...

2017-04-06 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17149#discussion_r110114252
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -285,7 +285,7 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
 // compatible format, which means the data source is file-based 
and must have a `path`.
 require(table.storage.locationUri.isDefined,
   "External file-based data source table must have a `path` entry 
in storage properties.")
-Some(new Path(table.location).toUri.toString)
--- End diff --

Actually, I did a bit of search more about what `org.apache.hadoop.fs.Path` 
expects. It seems the 
[documentation](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/Path.html)
 says:

> Construct a path from a String. Path strings are URIs, but with unescaped 
elements and some additional normalization.

It seems those strings are expected to be unescaped. So, it seems we 
support URI with unescaped characters, which is inherited from `Path`.

I want to be sure on this because I have fixed many tests to use URIs to 
pass on Windows and I am about to fix them further in this way. @steveloughran, 
do you mind if I cc you and ask to take a look and help to confirm this please? 
I know no one who knows better about Hadoop.
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17315: [SPARK-19949][SQL] unify bad record handling in CSV and ...

2017-04-06 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17315
  
Currently, the corrupted record field should be set explicitly if I haven't 
missed some changes in the related code path. Please refer the test here - 
https://github.com/apache/spark/blob/364b0db75308ddd346b4ab1e032680e8eb4c1753/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala#L1007-L1016

This PR is about restructuring the logics about handling corrupt records so 
ideally this does not introduce a behaviour change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-06 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110315394
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala
 ---
@@ -82,8 +84,8 @@ case class OptimizeMetadataOnlyQuery(
   private def getPartitionAttrs(
   partitionColumnNames: Seq[String],
   relation: LogicalPlan): Seq[Attribute] = {
-val partColumns = partitionColumnNames.map(_.toLowerCase).toSet
-relation.output.filter(a => partColumns.contains(a.name.toLowerCase))
+val partColumns = 
partitionColumnNames.map(_.toLowerCase(Locale.ROOT)).toSet
+relation.output.filter(a => 
partColumns.contains(a.name.toLowerCase(Locale.ROOT)))
--- End diff --

I am little bit worried of this change likewise. For example,

Before

```scala
scala> java.util.Locale.setDefault(new java.util.Locale("tr"))

scala>  val partColumns = Seq("I").map(_.toLowerCase).toSet
partColumns: scala.collection.immutable.Set[String] = Set(Ä±)

scala> Seq("a", "Ä±", "I").filter(a => partColumns.contains(a.toLowerCase))
res13: Seq[String] = List(Ä±, I)
```

After

```scala
scala> val partColumns = 
Seq("I").map(_.toLowerCase(java.util.Locale.ROOT)).toSet
partColumns: scala.collection.immutable.Set[String] = Set(i)

scala> Seq("a", "Ä±", "I").filter(a => 
partColumns.contains(a.toLowerCase(java.util.Locale.ROOT)))
res14: Seq[String] = List(I)

```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-06 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110314541
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CaseInsensitiveMap.scala
 ---
@@ -26,11 +28,12 @@ package org.apache.spark.sql.catalyst.util
 class CaseInsensitiveMap[T] private (val originalMap: Map[String, T]) 
extends Map[String, T]
   with Serializable {
 
-  val keyLowerCasedMap = originalMap.map(kv => kv.copy(_1 = 
kv._1.toLowerCase))
+  val keyLowerCasedMap = originalMap.map(kv => kv.copy(_1 = 
kv._1.toLowerCase(Locale.ROOT)))
--- End diff --

Maybe nitpicking and it is rarely possible I guess. However, up to my 
knowledge, this will affect the options users set, `spark.read.option(...)`. 
Namely, I think these case possible as below:

```scala
scala> java.util.Locale.setDefault(new java.util.Locale("tr"))

scala> val originalMap = Map("Ä±" -> 1, "I" -> 2)
originalMap: scala.collection.immutable.Map[String,Int] = Map(Ä± -> 1, I -> 
2)
```

Before

```scala
scala> originalMap.map(kv => kv.copy(_1 = kv._1.toLowerCase))
res6: scala.collection.immutable.Map[String,Int] = Map(Ä± -> 2)
```

After

```scala
scala> originalMap.map(kv => kv.copy(_1 = 
kv._1.toLowerCase(java.util.Locale.ROOT)))
res7: scala.collection.immutable.Map[String,Int] = Map(Ä± -> 1, i -> 2)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-06 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110317272
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala
 ---
@@ -52,7 +54,11 @@ case class HadoopFsRelation(
 
   val schema: StructType = {
 val getColName: (StructField => String) =
-  if (sparkSession.sessionState.conf.caseSensitiveAnalysis) _.name 
else _.name.toLowerCase
+  if (sparkSession.sessionState.conf.caseSensitiveAnalysis) {
+_.name
+  } else {
+_.name.toLowerCase(Locale.ROOT)
+  }
--- End diff --

I think we should leave this out. It seems `dataSchema` means the schema 
from the source which is exposed to users. I think this could cause a problem. 
For example as below:

Before

```scala
import collection.mutable

import org.apache.spark.sql.types._

java.util.Locale.setDefault(new java.util.Locale("tr"))

val partitionSchema: StructType = StructType(StructField("I", StringType) 
:: Nil)
val dataSchema: StructType = StructType(StructField("Ä±", StringType) :: 
Nil)

val getColName: (StructField => String) = _.name.toLowerCase

val overlappedPartCols = mutable.Map.empty[String, StructField]
partitionSchema.foreach { partitionField =>
  if (dataSchema.exists(getColName(_) == getColName(partitionField))) {
overlappedPartCols += getColName(partitionField) -> partitionField
  }
}

val schema = StructType(dataSchema.map(f => 
overlappedPartCols.getOrElse(getColName(f), f)) ++
  partitionSchema.filterNot(f => 
overlappedPartCols.contains(getColName(f

schema.fieldNames
```

prints

```scala
Array[String] = Array(I)
```

After

```scala
import collection.mutable

import org.apache.spark.sql.types._

java.util.Locale.setDefault(new java.util.Locale("tr"))

val partitionSchema: StructType = StructType(StructField("I", StringType) 
:: Nil)
val dataSchema: StructType = StructType(StructField("Ä±", StringType) :: 
Nil)

val getColName: (StructField => String) = 
_.name.toLowerCase(java.util.Locale.ROOT)

val overlappedPartCols = mutable.Map.empty[String, StructField]
partitionSchema.foreach { partitionField =>
  if (dataSchema.exists(getColName(_) == getColName(partitionField))) {
overlappedPartCols += getColName(partitionField) -> partitionField
  }
}

val schema = StructType(dataSchema.map(f => 
overlappedPartCols.getOrElse(getColName(f), f)) ++
  partitionSchema.filterNot(f => 
overlappedPartCols.contains(getColName(f

schema.fieldNames
```

prints

```scala
Array[String] = Array(Ä±, I)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-06 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110317441
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -128,7 +128,8 @@ object PartitioningUtils {
   //   "hdfs://host:9000/invalidPath"
   //   "hdfs://host:9000/path"
   // TODO: Selective case sensitivity.
-  val discoveredBasePaths = 
optDiscoveredBasePaths.flatten.map(_.toString.toLowerCase())
+  val discoveredBasePaths =
+  
optDiscoveredBasePaths.flatten.map(_.toString.toLowerCase(Locale.ROOT))
--- End diff --

I am worried of this one too. It sounds the path could contains Turkish 
characters I guess..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-06 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110314669
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringKeyHashMap.scala
 ---
@@ -25,7 +27,7 @@ object StringKeyHashMap {
   def apply[T](caseSensitive: Boolean): StringKeyHashMap[T] = if 
(caseSensitive) {
 new StringKeyHashMap[T](identity)
   } else {
-new StringKeyHashMap[T](_.toLowerCase)
+new StringKeyHashMap[T](_.toLowerCase(Locale.ROOT))
--- End diff --

This only seems used in `SimpleFunctionRegistry`. I don't think we have 
Turkish characters in function names and I don't think users will use other 
language in the function names. So probably it is fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-06 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110298557
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala
 ---
@@ -396,7 +397,7 @@ object PartitioningAwareFileIndex extends Logging {
   sessionOpt: Option[SparkSession]): Seq[FileStatus] = {
 logTrace(s"Listing $path")
 val fs = path.getFileSystem(hadoopConf)
-val name = path.getName.toLowerCase
+val name = path.getName.toLowerCase(Locale.ROOT)
--- End diff --

(This variable seems not used.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-06 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110317695
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala 
---
@@ -222,7 +225,7 @@ case class PreprocessTableCreation(sparkSession: 
SparkSession) extends Rule[Logi
 val columnNames = if 
(sparkSession.sessionState.conf.caseSensitiveAnalysis) {
   schema.map(_.name)
 } else {
-  schema.map(_.name.toLowerCase)
+  schema.map(_.name.toLowerCase(Locale.ROOT))
--- End diff --

Maybe, it is not good to point the similar instances all but let me just 
point this out as the change looks big. This maybe the similar instances with 
https://github.com/apache/spark/pull/17527/files#r110317272.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17527: [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java S...

2017-04-06 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17527#discussion_r110317549
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -328,7 +329,7 @@ object PartitioningUtils {
 } else {
   // TODO: Selective case sensitivity.
   val distinctPartColNames =
-
pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase())).distinct
+
pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase(Locale.ROOT))).distinct
--- End diff --

I think this might cause a similar problem with 
https://github.com/apache/spark/pull/17527/files#r110317272.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...

2017-02-27 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16929#discussion_r10990
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -2969,11 +2969,27 @@ object functions {
   }
 
   /**
-   * (Java-specific) Parses a column containing a JSON string into a 
`StructType` with the
+   * (Scala-specific) Parses a column containing a JSON array string into 
a `ArrayType` with the
* specified schema. Returns `null`, in the case of an unparseable 
string.
*
-   * @param e a string column containing JSON data.
-   * @param schema the schema to use when parsing the json string
+   * @param e a string column containing JSON array data.
+   * @param schema the schema to use when parsing the json array string
+   * @param options options to control how the json is parsed. accepts the 
same options and the
+   *json data source.
+   *
+   * @group collection_funcs
+   * @since 2.2.0
+   */
+  def from_json(e: Column, schema: ArrayType, options: Map[String, 
String]): Column = withExpr {
--- End diff --

I thought changing `StructType` to `DataType` breaks binary compatibility 
as method signature is changed and the app complied in 2.1.0 does not run in 
2.2.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 4 5 6 7 8 9 10 11 12 13 >

801 - 900 of 12634 matches

Mail list logo