[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

2018-07-14 Thread rberenguel
Github user rberenguel commented on the issue:

https://github.com/apache/spark/pull/18139
  
Hi @holdenk yup, I still have a to-do to address it in a clean way, I got 
sidetracked with other things. Helping Python since the 18th century!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21566: [Python] Fix typo in serializer exception

2018-06-14 Thread rberenguel
Github user rberenguel commented on the issue:

https://github.com/apache/spark/pull/21566
  
@HyukjinKwon sure, no problem. I'm reading pieces of the source for a 
presentation on PySpark and this jumped quickly. I'll try to collect a few more 
👍 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21566: [Python] Fix typo in serializer exception

2018-06-14 Thread rberenguel
GitHub user rberenguel opened a pull request:

https://github.com/apache/spark/pull/21566

[Python] Fix typo in serializer exception

## What changes were proposed in this pull request?

Fix typo in exception raised in Python serializer

## How was this patch tested?

No code changes

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rberenguel/spark fix_typo_pyspark_serializers

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21566.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21566


commit bd8627c879131a4af364ef667f4ac1209ea6909a
Author: Ruben Berenguel Montoro 
Date:   2018-06-14T11:11:14Z

Fix typo in serializer exception




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20949: [SPARK-19018][SQL] Add support for custom encoding on cs...

2018-05-06 Thread rberenguel
Github user rberenguel commented on the issue:

https://github.com/apache/spark/pull/20949
  
I've been giving a look to this PR (I've hit this problem in the past and 
had a chat with @crafty-coder about it and his fixes, too), is there anything 
we could do to move it forward?

Also, is there any way to trigger a rebuild on Jenkins without adding a 
dummy commit? Looks like the JVM on this test run blew the heap, just a re-run 
should be enough (cc @holdenk @HyukjinKwon )


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

2017-10-30 Thread rberenguel
Github user rberenguel commented on the issue:

https://github.com/apache/spark/pull/18139
  
Yes, most solutions are concerned only with strftime (virtualtime and 
matplotlib above).  Since calendar.gmtime can handle this better (or at least 
for pre-1900) it is just a matter to... not messing up horribly with how dates 
work (It’s also problematic in Java after all)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

2017-10-30 Thread rberenguel
Github user rberenguel commented on the issue:

https://github.com/apache/spark/pull/18139
  
@HyukjinKwon I would definitely not mind, I'm actually curious to see how 
they do it. I also saw it's not that big (and we don't need all the 
functionality). I'll extract the pieces we need for our case


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

2017-10-30 Thread rberenguel
Github user rberenguel commented on the issue:

https://github.com/apache/spark/pull/18139
  
For instance, [matplotlib handles strftime as a special case for 
pre-1900](https://github.com/matplotlib/matplotlib/blob/c166fa1afc40cf5fc211b584f8abc107744e9417/lib/matplotlib/dates.py#L573)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

2017-10-30 Thread rberenguel
Github user rberenguel commented on the issue:

https://github.com/apache/spark/pull/18139
  
More ideas welcome indeed. We could just replicate what others are doing to 
fix the problem, if we can't introduce external dependencies.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

2017-10-30 Thread rberenguel
Github user rberenguel commented on the issue:

https://github.com/apache/spark/pull/18139
  
@holdenk @ueshin Digging a bit into other solutions to this problem (since 
it's not specifically Spark related, when you dig deeper, so other libraries 
have to find its way to fix it) and maybe adding [this 
dependency](https://pypi.python.org/pypi/virtualtime) might be the best way to 
fix it forever (while also not reinventing wheels that are already in the 
python ecosystem)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17100: [SPARK-13947][SQL] The error message from using an inval...

2017-10-25 Thread rberenguel
Github user rberenguel commented on the issue:

https://github.com/apache/spark/pull/17100
  
Thanks @gatorsmile @holdenk @viirya @wzhfy for all the help


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-24 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146667191
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +270,25 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = plan.conf.resolver
+val attrsWithSameName = o.missingInput.filter { missing =>
+  o.inputSet.exists(input => resolver(missing.name, 
input.name))
+}
+val repeatedNameHint = if (attrsWithSameName.nonEmpty) {
+  val sameNames = attrsWithSameName.map(_.name).mkString(",")
+  s"""Attribute(s) with the same name appear in the operation: 
`$sameNames`.
+  |Please check if the right attribute(s) are 
used.""".stripMargin
+} else {
+  ""
+}
+
 val missingAttributes = o.missingInput.mkString(",")
 val input = o.inputSet.mkString(",")
 
-failAnalysis(
-  s"resolved attribute(s) $missingAttributes missing from 
$input " +
-s"in operator ${operator.simpleString}")
+val msg = s"""Resolved attribute(s) $missingAttributes missing 
from $input
+  |in operator 
${operator.simpleString}.""".stripMargin
+
+failAnalysis(if (repeatedNameHint.nonEmpty) msg + "\n" + 
repeatedNameHint else msg)
--- End diff --

Yup thanks, is more orderly this way


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-24 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146667141
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +270,25 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = plan.conf.resolver
+val attrsWithSameName = o.missingInput.filter { missing =>
+  o.inputSet.exists(input => resolver(missing.name, 
input.name))
+}
+val repeatedNameHint = if (attrsWithSameName.nonEmpty) {
+  val sameNames = attrsWithSameName.map(_.name).mkString(",")
+  s"""Attribute(s) with the same name appear in the operation: 
`$sameNames`.
--- End diff --

Ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-24 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146625735
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +270,25 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = plan.conf.resolver
+val attrsWithSameName = o.missingInput.filter { missing =>
+  o.inputSet.exists(input => resolver(missing.name, 
input.name))
+}
+val repeatedNameHint = if (attrsWithSameName.nonEmpty) {
+  val sameNames = attrsWithSameName.map(_.name).mkString(",")
+  s"""Attribute(s) with the same name appear in the operation: 
`$sameNames`.
--- End diff --

You mean wrapping each column name with the ticks?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-24 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146544863
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +270,25 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = plan.conf.resolver
+val attrsWithSameName = o.missingInput.filter { missing =>
+  o.inputSet.exists(input => resolver(missing.name, 
input.name))
+}
+val repeatedNameHint = if (attrsWithSameName.nonEmpty) {
+  val sameNames = attrsWithSameName.map(_.name).mkString(",")
+  s"""Attribute(s) with the same name appear in the operation: 
`$sameNames`.
+  |Please check if the right attribute(s) are 
used.""".stripMargin
--- End diff --

I don't have a strong case for having two lines here, and I see the point 
of seeing it as one line. I guess the best way to have it as one line is 
splitting it into 
```scala
val attributeRepetitionMsg = s"Attribute(s) with the same name appear in 
the operation: `$sameNames`"
val checkAttributesMsg = s"Please check if the right attribute(s) are used."
```
to avoid hitting the < 100 chars linting rule?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-24 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146469410
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
 ---
@@ -408,16 +408,23 @@ class AnalysisErrorSuite extends AnalysisTest {
 // CheckAnalysis should throw AnalysisException when Aggregate 
contains missing attribute(s)
 // Since we manually construct the logical plan at here and Sum only 
accept
 // LongType, DoubleType, and DecimalType. We use LongType as the type 
of a.
-val plan =
-  Aggregate(
+val attrA = AttributeReference("a", LongType)(exprId = ExprId(1))
+val otherA = AttributeReference("a", LongType)(exprId = ExprId(2))
+val bAlias = Alias(sum(attrA), "b")() :: Nil
+val plan = Aggregate(
 Nil,
-Alias(sum(AttributeReference("a", LongType)(exprId = ExprId(1))), 
"b")() :: Nil,
-LocalRelation(
-  AttributeReference("a", LongType)(exprId = ExprId(2
+bAlias,
+LocalRelation(otherA))
 
 assert(plan.resolved)
 
-assertAnalysisError(plan, "resolved attribute(s) a#1L missing from 
a#2L" :: Nil)
+val errorMsg = s"""Resolved attribute(s) ${attrA.toString} missing 
from ${otherA.toString}
+ |in operator !Aggregate [${bAlias.mkString("#")}].
+ |Please check attribute(s) `a`, they seem to appear 
in two
+ |different input operators, with the same 
name.""".stripMargin
+
+
--- End diff --

Fixed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-24 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146469160
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
 ---
@@ -408,16 +408,23 @@ class AnalysisErrorSuite extends AnalysisTest {
 // CheckAnalysis should throw AnalysisException when Aggregate 
contains missing attribute(s)
 // Since we manually construct the logical plan at here and Sum only 
accept
 // LongType, DoubleType, and DecimalType. We use LongType as the type 
of a.
-val plan =
-  Aggregate(
+val attrA = AttributeReference("a", LongType)(exprId = ExprId(1))
+val otherA = AttributeReference("a", LongType)(exprId = ExprId(2))
+val bAlias = Alias(sum(attrA), "b")() :: Nil
+val plan = Aggregate(
--- End diff --

Added another attribute


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-24 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146469083
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +270,25 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = plan.conf.resolver
+val attrsWithSameName = o.missingInput.filter { missing =>
+  o.inputSet.exists(input => resolver(missing.name, 
input.name))
+}
+val repeatedNameHint = if (attrsWithSameName.nonEmpty) {
+  val commonNames = attrsWithSameName.map(_.name).mkString(",")
+  s"""|Please check attribute(s) `$commonNames`, they seem to 
appear in two
+  |different input operators, with the same 
name.""".stripMargin
--- End diff --

I have changed it, now looks better no matter how many appear


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-24 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146468977
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +270,25 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = plan.conf.resolver
+val attrsWithSameName = o.missingInput.filter { missing =>
+  o.inputSet.exists(input => resolver(missing.name, 
input.name))
+}
+val repeatedNameHint = if (attrsWithSameName.nonEmpty) {
+  val commonNames = attrsWithSameName.map(_.name).mkString(",")
--- End diff --

Done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-23 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146252936
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +270,27 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = plan.conf.resolver
+val attrsWithSameName = o.missingInput.filter(missing =>
+  o.inputSet.exists(input => resolver(missing.name, 
input.name)))
--- End diff --

Thanks @wzhfy 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-23 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146234832
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +270,27 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = plan.conf.resolver
+val attrsWithSameName = o.missingInput.filter(missing =>
+  o.inputSet.exists(input => resolver(missing.name, 
input.name)))
+val repeatedNameHint = if (attrsWithSameName.nonEmpty) {
+  val commonNames = attrsWithSameName.map(_.name).mkString(",")
+  s"""|Please check attribute(s) `$commonNames`, they seem to 
appear in two
+  |different input operators, with the same 
name.""".stripMargin
+} else {
+  ""
+}
+
 val missingAttributes = o.missingInput.mkString(",")
-val input = o.inputSet.mkString(",")
+val availableAttributes = o.inputSet.mkString(",")
 
 failAnalysis(
-  s"resolved attribute(s) $missingAttributes missing from 
$input " +
-s"in operator ${operator.simpleString}")
+  s"""Some resolved attribute(s) are not present among the 
available attributes
+|for a query.
+|$missingAttributes is not in $availableAttributes.
+|$repeatedNameHint
+|The failed query was for operator
+|${operator.simpleString}""".stripMargin)
--- End diff --

Hmmm seems reasonable, indeed. and avoids that horrible trailing newline. 
Thanks a lot :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-23 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146234085
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +270,27 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = plan.conf.resolver
+val attrsWithSameName = o.missingInput.filter(missing =>
+  o.inputSet.exists(input => resolver(missing.name, 
input.name)))
+val repeatedNameHint = if (attrsWithSameName.nonEmpty) {
+  val commonNames = attrsWithSameName.map(_.name).mkString(",")
+  s"""|Please check attribute(s) `$commonNames`, they seem to 
appear in two
+  |different input operators, with the same 
name.""".stripMargin
+} else {
+  ""
+}
+
 val missingAttributes = o.missingInput.mkString(",")
-val input = o.inputSet.mkString(",")
+val availableAttributes = o.inputSet.mkString(",")
--- End diff --

Okies


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-23 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146233808
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +270,27 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = plan.conf.resolver
+val attrsWithSameName = o.missingInput.filter(missing =>
+  o.inputSet.exists(input => resolver(missing.name, 
input.name)))
--- End diff --

Done. Do you know if there is any style rule for when to use parentheses vs 
braces in these kind of lambda functions?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-23 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146211895
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +270,27 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = plan.conf.resolver
+val attrsWithSameName = o.missingInput.filter(missing =>
+  o.inputSet.exists(input => resolver(missing.name, 
input.name)))
+val repeatedNameHint = if (attrsWithSameName.nonEmpty) {
+  val commonNames = attrsWithSameName.map(_.name).mkString(",")
+  s"""|Please check attribute(s) `$commonNames`, they seem to 
appear in two
+  |different input operators, with the same 
name.""".stripMargin
+} else {
+  ""
+}
+
 val missingAttributes = o.missingInput.mkString(",")
-val input = o.inputSet.mkString(",")
+val availableAttributes = o.inputSet.mkString(",")
 
 failAnalysis(
-  s"resolved attribute(s) $missingAttributes missing from 
$input " +
-s"in operator ${operator.simpleString}")
+  s"""Some resolved attribute(s) are not present among the 
available attributes
+|for a query.
+|$missingAttributes is not in $availableAttributes.
+|$repeatedNameHint
+|The failed query was for operator
+|${operator.simpleString}""".stripMargin)
--- End diff --

This is good for me, having the `repeatedNameHint` is the whole point, at 
the end. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-23 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146211291
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +271,25 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = plan.conf.resolver
+val attrsWithSameName = o.missingInput.filter(x =>
+  o.inputSet.exists(y => resolver(x.name, y.name)))
 val missingAttributes = o.missingInput.mkString(",")
-val input = o.inputSet.mkString(",")
+val availableAttributes = o.inputSet.mkString(",")
+val repeatedNameHint = if (attrsWithSameName.size > 0) {
+  val commonNames = attrsWithSameName.map(_.name).mkString(",")
+  s"""\n|Attribute(s) `$commonNames` seem to appear in two
+  |different datasets, with the same name."""
+} else {
+  ""
+}
 
 failAnalysis(
-  s"resolved attribute(s) $missingAttributes missing from 
$input " +
-s"in operator ${operator.simpleString}")
+  s"""Some resolved attribute(s) are not present among the 
available attributes
+|for a query.
+|$missingAttributes is not in 
$availableAttributes.$repeatedNameHint
--- End diff --

Dang, I knew there was a reason for the weird new line back when I wrote 
this. I'm putting the `\n` back, do you have any idea that doesn't look as 
weird as this does, though?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-23 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146208143
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
 ---
@@ -408,16 +408,28 @@ class AnalysisErrorSuite extends AnalysisTest {
 // CheckAnalysis should throw AnalysisException when Aggregate 
contains missing attribute(s)
 // Since we manually construct the logical plan at here and Sum only 
accept
 // LongType, DoubleType, and DecimalType. We use LongType as the type 
of a.
-val plan =
-  Aggregate(
+val attrA = AttributeReference("a", LongType)(exprId = ExprId(1))
+val aId = Array[String](attrA.name, attrA.exprId.id.toString)
+val otherA = AttributeReference("a", LongType)(exprId = ExprId(2))
+val otherAId = Array[String](otherA.name, otherA.exprId.id.toString)
+val bAlias = Alias(sum(attrA), "b")() :: Nil
+val plan = Aggregate(
 Nil,
-Alias(sum(AttributeReference("a", LongType)(exprId = ExprId(1))), 
"b")() :: Nil,
-LocalRelation(
-  AttributeReference("a", LongType)(exprId = ExprId(2
+bAlias,
+LocalRelation(otherA))
 
 assert(plan.resolved)
 
-assertAnalysisError(plan, "resolved attribute(s) a#1L missing from 
a#2L" :: Nil)
+val errorMsg = s"""Some resolved attribute(s) are not present among 
the available attributes
+ |for a query.
+ |${aId.mkString("#")}L is not in 
${otherAId.mkString("#")}L.
--- End diff --

Oh, I did it wrongly and thus did no work (did aId.toString and got a 
horrible java String object view). Changing it, also regenerating the golden 
files which seemed to make the test fail


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-23 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146177319
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +271,25 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = plan.conf.resolver
+val attrsWithSameName = o.missingInput.filter(x =>
+  o.inputSet.exists(y => resolver(x.name, y.name)))
 val missingAttributes = o.missingInput.mkString(",")
-val input = o.inputSet.mkString(",")
+val availableAttributes = o.inputSet.mkString(",")
+val repeatedNameHint = if (attrsWithSameName.size > 0) {
+  val commonNames = attrsWithSameName.map(_.name).mkString(",")
+  s"""\n|Attribute(s) `$commonNames` seem to appear in two
+  |different datasets, with the same name."""
+} else {
+  ""
+}
 
 failAnalysis(
-  s"resolved attribute(s) $missingAttributes missing from 
$input " +
-s"in operator ${operator.simpleString}")
+  s"""Some resolved attribute(s) are not present among the 
available attributes
+|for a query.
+|$missingAttributes is not in 
$availableAttributes.$repeatedNameHint
--- End diff --

Oh, thanks. I didn't see it


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-23 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146177284
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -24,6 +24,7 @@ import 
org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
 import org.apache.spark.sql.catalyst.optimizer.BooleanSimplification
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.internal.SQLConf
--- End diff --

Thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-23 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146177263
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +271,25 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = plan.conf.resolver
+val attrsWithSameName = o.missingInput.filter(x =>
+  o.inputSet.exists(y => resolver(x.name, y.name)))
 val missingAttributes = o.missingInput.mkString(",")
-val input = o.inputSet.mkString(",")
+val availableAttributes = o.inputSet.mkString(",")
+val repeatedNameHint = if (attrsWithSameName.size > 0) {
+  val commonNames = attrsWithSameName.map(_.name).mkString(",")
+  s"""\n|Attribute(s) `$commonNames` seem to appear in two
+  |different datasets, with the same name."""
--- End diff --

I have changed it. Agree datasets may not be the best word to define it


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-23 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146177196
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +271,25 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = plan.conf.resolver
+val attrsWithSameName = o.missingInput.filter(x =>
+  o.inputSet.exists(y => resolver(x.name, y.name)))
 val missingAttributes = o.missingInput.mkString(",")
-val input = o.inputSet.mkString(",")
+val availableAttributes = o.inputSet.mkString(",")
+val repeatedNameHint = if (attrsWithSameName.size > 0) {
--- End diff --

I don't understand where I should move missing or available, isn't it 
already just right after `attrsWithSameName`? But I moved it after 
repeatedNameHint, since it's needed only for the `failAnalysis` argument


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-23 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146176760
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +271,25 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = plan.conf.resolver
+val attrsWithSameName = o.missingInput.filter(x =>
+  o.inputSet.exists(y => resolver(x.name, y.name)))
--- End diff --

Thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-23 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146176626
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -24,6 +24,7 @@ import 
org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
 import org.apache.spark.sql.catalyst.optimizer.BooleanSimplification
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.internal.SQLConf
--- End diff --

Thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...

2017-10-23 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146176609
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +271,25 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = plan.conf.resolver
+val attrsWithSameName = o.missingInput.filter(x =>
+  o.inputSet.exists(y => resolver(x.name, y.name)))
 val missingAttributes = o.missingInput.mkString(",")
-val input = o.inputSet.mkString(",")
+val availableAttributes = o.inputSet.mkString(",")
+val repeatedNameHint = if (attrsWithSameName.size > 0) {
+  val commonNames = attrsWithSameName.map(_.name).mkString(",")
+  s"""\n|Attribute(s) `$commonNames` seem to appear in two
+  |different datasets, with the same name."""
--- End diff --

What do you suggest instead?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17100: [SPARK-13947][SQL] The error message from using an inval...

2017-10-22 Thread rberenguel
Github user rberenguel commented on the issue:

https://github.com/apache/spark/pull/17100
  
Applied all changes @viirya & @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17100: [SPARK-13947][SQL] PySpark DataFrames: The error message...

2017-10-22 Thread rberenguel
Github user rberenguel commented on the issue:

https://github.com/apache/spark/pull/17100
  
@viirya Hi, I based the title of the PR off the original ticket 
(https://issues.apache.org/jira/browse/SPARK-13947), but indeed has nothing to 
do with Python. I see now the ticket has been "moved" to SQL, so I'll rewrite 
the title when I send the fixes (might take me a while to re-run the test 
suite) 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] PySpark DataFrames: The error ...

2017-10-22 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146128109
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +273,25 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = conf.resolver
+val commonAttrs = o.missingInput.filter(x =>
+  o.inputSet.exists(y => resolver(x.name, y.name)))
 val missingAttributes = o.missingInput.mkString(",")
-val input = o.inputSet.mkString(",")
+val availableAttributes = o.inputSet.mkString(",")
+val repeatedNameHint = if (commonAttrs.size > 0) {
+  val commonNames = commonAttrs.map(_.name).mkString(",")
+  s"""\n|Observe that attribute(s) $commonNames appear in two
--- End diff --

I'll rewrite the wording a bit so it hints to this as the cause (it's the 
most usual I suspect). The newline is for readability of the error message. I 
could remove it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] PySpark DataFrames: The error ...

2017-10-22 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146128111
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +273,25 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = conf.resolver
+val commonAttrs = o.missingInput.filter(x =>
+  o.inputSet.exists(y => resolver(x.name, y.name)))
 val missingAttributes = o.missingInput.mkString(",")
-val input = o.inputSet.mkString(",")
+val availableAttributes = o.inputSet.mkString(",")
+val repeatedNameHint = if (commonAttrs.size > 0) {
+  val commonNames = commonAttrs.map(_.name).mkString(",")
+  s"""\n|Observe that attribute(s) $commonNames appear in two
+  |different datasets, with the same name"""
--- End diff --

Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] PySpark DataFrames: The error ...

2017-10-22 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146128114
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +273,25 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = conf.resolver
+val commonAttrs = o.missingInput.filter(x =>
--- End diff --

Indeed!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] PySpark DataFrames: The error ...

2017-10-22 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146128090
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -125,7 +128,7 @@ trait CheckAnalysis extends PredicateHelper {
 }
 
   case s: SubqueryExpression =>
-checkSubqueryExpression(operator, s)
+checkAnalysisWithConf(s.plan, conf)
--- End diff --

Yup, sorry. Was too eager to merge back master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] PySpark DataFrames: The error ...

2017-10-22 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146128092
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
 ---
@@ -408,16 +408,28 @@ class AnalysisErrorSuite extends AnalysisTest {
 // CheckAnalysis should throw AnalysisException when Aggregate 
contains missing attribute(s)
 // Since we manually construct the logical plan at here and Sum only 
accept
 // LongType, DoubleType, and DecimalType. We use LongType as the type 
of a.
-val plan =
-  Aggregate(
+val attrA = AttributeReference("a", LongType)(exprId = ExprId(1))
+val aId = Array[String](attrA.name, attrA.exprId.id.toString)
+val otherA = AttributeReference("a", LongType)(exprId = ExprId(2))
+val otherAId = Array[String](otherA.name, otherA.exprId.id.toString)
+val bAlias = Alias(sum(attrA), "b")() :: Nil
+val plan = Aggregate(
 Nil,
-Alias(sum(AttributeReference("a", LongType)(exprId = ExprId(1))), 
"b")() :: Nil,
-LocalRelation(
-  AttributeReference("a", LongType)(exprId = ExprId(2
+bAlias,
+LocalRelation(otherA))
 
 assert(plan.resolved)
 
-assertAnalysisError(plan, "resolved attribute(s) a#1L missing from 
a#2L" :: Nil)
+val errorMsg = s"""Some resolved attribute(s) are not present among 
the available attributes
+ |for a query.
+ |${aId.mkString("#")}L is not in 
${otherAId.mkString("#")}L.
--- End diff --

Yes thanks, I discovered this recently, but didn't come back to this PR. 
Will fix


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] PySpark DataFrames: The error ...

2017-10-22 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r146128078
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -270,12 +273,25 @@ trait CheckAnalysis extends PredicateHelper {
 
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
+val resolver = conf.resolver
--- End diff --

Oh, much handier. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18164: [SPARK-19732][SQL][PYSPARK] Add fill functions for nulls...

2017-06-02 Thread rberenguel
Github user rberenguel commented on the issue:

https://github.com/apache/spark/pull/18164
  
@HyukjinKwon I changed it, does it look any clearer? I have always thought 
of `na` in terms of Python and not R anyway :) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18164: [SPARK-19732][SQL][PYSPARK] fillna bools

2017-06-02 Thread rberenguel
Github user rberenguel commented on the issue:

https://github.com/apache/spark/pull/18164
  
@ueshin @HyukjinKwon thanks for giving it a very thorough look and sorry 
for my previous comment, that was terribly unclear. I was confused because the 
Appveyor tick mark was green for commit 076ebed and I  had run the tests 
locally (forgot linting, though), so I was pretty sure the test was right and I 
was confused about how the subset wrong still had a passing test. 

I probably skipped the wrong step for testing the Python tests (I'm still 
figuring out which corners I can cut to avoid a full compile/build cycle for 
the whole project, which takes ages for me) so I didn't see my local test 
failing, but the remote one was more puzzling, I guess appveyor had a hiccup 
here. Sorry again for the confused and confusing statements above :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18164: [SPARK-19732][SQL][PYSPARK] fillna bools

2017-06-02 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/18164#discussion_r119813043
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala ---
@@ -124,6 +134,13 @@ class DataFrameNaFunctionsSuite extends QueryTest with 
SharedSQLContext {
 Row("Nina") :: Row("Amy") :: Row("unknown") :: Nil)
 assert(input.na.fill("unknown").columns.toSeq === input.columns.toSeq)
 
+// boolean
+checkAnswer(
+  boolInput.na.fill(true).select("spy"),
+  Row(false) :: Row(true) :: Row(true) ::
--- End diff --

Sorry, what do you mean by inlined here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18164: [SPARK-19732][SQL][PYSPARK] fillna bools

2017-06-02 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/18164#discussion_r119812943
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala ---
@@ -36,6 +36,15 @@ class DataFrameNaFunctionsSuite extends QueryTest with 
SharedSQLContext {
   ).toDF("name", "age", "height")
   }
 
+  def createBooleanDF(): DataFrame = {
--- End diff --

Yup, right. I added it on top to keep both together, but it's only used for 
the boolean tests


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18164: [SPARK-19732][SQL][PYSPARK] fillna bools

2017-06-02 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/18164#discussion_r119812842
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala ---
@@ -196,6 +196,24 @@ final class DataFrameNaFunctions private[sql](df: 
DataFrame) {
   def fill(value: String, cols: Seq[String]): DataFrame = fillValue(value, 
cols)
 
   /**
+   * Returns a new `DataFrame` that replaces null values in boolean 
columns with `value`.
+   */
+  def fill(value: Boolean): DataFrame = fill(value, df.columns)
+
+  /**
+   * (Scala-specific) Returns a new `DataFrame` that replaces null or NaN 
values in specified
--- End diff --

Oh, right. I copied the defs and docs from double, as it shows. Will 
change, NaN booleans would be weird indeed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18164: [SPARK-19732][SQL][PYSPARK] fillna bools

2017-06-02 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/18164#discussion_r119812666
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala ---
@@ -196,6 +196,24 @@ final class DataFrameNaFunctions private[sql](df: 
DataFrame) {
   def fill(value: String, cols: Seq[String]): DataFrame = fillValue(value, 
cols)
 
   /**
+   * Returns a new `DataFrame` that replaces null values in boolean 
columns with `value`.
+   */
+  def fill(value: Boolean): DataFrame = fill(value, df.columns)
--- End diff --

I wasn't sure about this, wanted to ask actually. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18164: [SPARK-19732][SQL][PYSPARK] fillna bools

2017-06-02 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/18164#discussion_r119812590
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1303,9 +1312,11 @@ def fillna(self, value, subset=None):
 +---+--+---+
 """
 if not isinstance(value, (float, int, long, basestring, dict)):
--- End diff --

I omitted it just because it wasn't failing for this if, but indeed, I'm a 
bit more on the side of putting it in even if just for completeness. Makes 
reading the code much saner if we have the if for bool


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18164: [SPARK-19732][SQL][PYSPARK] fillna bools

2017-06-02 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/18164#discussion_r119812631
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1303,9 +1312,11 @@ def fillna(self, value, subset=None):
 +---+--+---+
 """
 if not isinstance(value, (float, int, long, basestring, dict)):
-raise ValueError("value should be a float, int, long, string, 
or dict")
+raise ValueError("value should be a float, int, long, string, 
boolean or dict")
--- End diff --

Thanks, will change


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18164: [SPARK-19732][SQL][PYSPARK] fillna bools

2017-06-02 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/18164#discussion_r119812409
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1303,9 +1312,11 @@ def fillna(self, value, subset=None):
 +---+--+---+
 """
 if not isinstance(value, (float, int, long, basestring, dict)):
-raise ValueError("value should be a float, int, long, string, 
or dict")
+raise ValueError("value should be a float, int, long, string, 
boolean or dict")
 
-if isinstance(value, (int, long)):
+if isinstance(value, bool):
+pass
+elif isinstance(value, (int, long)):
--- End diff --

Thanks, indeed makes sense and makes it a bit nicer than having a pass. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18164: [SPARK-19732][SQL][PYSPARK] fillna bools

2017-06-02 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/18164#discussion_r119798778
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -1697,40 +1697,56 @@ def test_fillna(self):
 schema = StructType([
 StructField("name", StringType(), True),
 StructField("age", IntegerType(), True),
-StructField("height", DoubleType(), True)])
+StructField("height", DoubleType(), True),
+StructField("spy", BooleanType(), True)])
 
 # fillna shouldn't change non-null values
-row = self.spark.createDataFrame([(u'Alice', 10, 80.1)], 
schema).fillna(50).first()
+row = self.spark.createDataFrame([(u'Alice', 10, 80.1, True)], 
schema).fillna(50).first()
 self.assertEqual(row.age, 10)
 
 # fillna with int
-row = self.spark.createDataFrame([(u'Alice', None, None)], 
schema).fillna(50).first()
+row = self.spark.createDataFrame([(u'Alice', None, None, None)], 
schema).fillna(50).first()
 self.assertEqual(row.age, 50)
 self.assertEqual(row.height, 50.0)
 
 # fillna with double
-row = self.spark.createDataFrame([(u'Alice', None, None)], 
schema).fillna(50.1).first()
+row = self.spark.createDataFrame([(u'Alice', None, None, None)], 
schema).fillna(50.1).first()
 self.assertEqual(row.age, 50)
 self.assertEqual(row.height, 50.1)
 
+# fillna with bool
+row = self.spark.createDataFrame([(u'Alice', None, None, None)], 
schema).fillna(True).first()
+self.assertEqual(row.age, None)
+self.assertEqual(row.spy, True)
+
 # fillna with string
-row = self.spark.createDataFrame([(None, None, None)], 
schema).fillna("hello").first()
+row = self.spark.createDataFrame([(None, None, None, None)], 
schema).fillna("hello").first()
 self.assertEqual(row.name, u"hello")
 self.assertEqual(row.age, None)
 
 # fillna with subset specified for numeric cols
 row = self.spark.createDataFrame(
-[(None, None, None)], schema).fillna(50, subset=['name', 
'age']).first()
+[(None, None, None, None)], schema).fillna(50, subset=['name', 
'age']).first()
 self.assertEqual(row.name, None)
 self.assertEqual(row.age, 50)
 self.assertEqual(row.height, None)
+self.assertEqual(row.spy, None)
 
-# fillna with subset specified for numeric cols
+# fillna with subset specified for string cols
 row = self.spark.createDataFrame(
-[(None, None, None)], schema).fillna("haha", subset=['name', 
'age']).first()
+[(None, None, None, None)], schema).fillna("haha", 
subset=['name', 'age']).first()
 self.assertEqual(row.name, "haha")
 self.assertEqual(row.age, None)
 self.assertEqual(row.height, None)
+self.assertEqual(row.spy, None)
+
+# fillna with subset specified for bool cols
+row = self.spark.createDataFrame(
+[(None, None, None, None)], schema).fillna(True, 
subset=['name', 'age']).first()
+self.assertEqual(row.name, None)
+self.assertEqual(row.age, None)
+self.assertEqual(row.height, None)
+self.assertEqual(row.spy, True)
--- End diff --

Hi @ueshin indeed! Thanks for catching this, I have modified the test. BUT, 
this test, as it stands on your comment, should have failed, doesn't it? The 
subset should not have been applied to spy (so, spy should have been None, and 
the assertion should have been marked as false, but either the test passed or 
the test didn't run), if I understood correctly how subsetting fillna's work. 
But this is weird, since I didn't change any internals of how it works, I just 
created the methods to enable it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

2017-06-01 Thread rberenguel
Github user rberenguel commented on the issue:

https://github.com/apache/spark/pull/18139
  
Thanks @ueshin , forgot to run `dev/lint-python` on the tests (still 
getting used to which speed tricks are useful to avoid running a full test 
suite run locally, which takes more than an hour on my machine)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

2017-06-01 Thread rberenguel
Github user rberenguel commented on the issue:

https://github.com/apache/spark/pull/18139
  
@j-bennet I'm not sure about that change. My suggested approach is just 
fixing what the Python API offers (as in, mktime has known limitations, from 
Python's point of view, independent of what mktime might do). I guess a better 
way would be to have some platform depending check for the 1900-1901-1902 
problem (which I suspect should be as a separate issue from this one, though, 
even if they might touch the same area of the code)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...

2017-06-01 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/18139#discussion_r119583612
  
--- Diff: python/pyspark/sql/types.py ---
@@ -187,8 +187,12 @@ def needConversion(self):
 
 def toInternal(self, dt):
 if dt is not None:
-seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
-   else time.mktime(dt.timetuple()))
+# Avoiding the invalid range of years (100-1899) for mktime in 
Python < 3
+if dt.year > 1899 or dt.year <= 100:
+seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
+   else time.mktime(dt.timetuple()))
+else:
+seconds = calendar.timegm(dt.utctimetuple())
--- End diff --

Well, this was the previous functionality, I'm not 100% sure of the 
rationale that was used to do it this way either. If we don't have `tzinfo` and 
use `dt.timetuple` instead of `utctimetule` I guess it's using locale time via 
the underlying C implementation for `mktime`, or that's what the Python docs 
seem to suggest, but it's not that clear. So I'm not convinced we need to 
offset, since it should be a platform/locale dependent offset


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...

2017-06-01 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/18139#discussion_r119583288
  
--- Diff: python/pyspark/sql/types.py ---
@@ -187,8 +187,12 @@ def needConversion(self):
 
 def toInternal(self, dt):
 if dt is not None:
-seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
-   else time.mktime(dt.timetuple()))
+# Avoiding the invalid range of years (100-1899) for mktime in 
Python < 3
+if dt.year > 1899 or dt.year <= 100:
--- End diff --

Uh, right, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18164: [Spark-19732][SQL][PYSPARK] fillna bools

2017-05-31 Thread rberenguel
GitHub user rberenguel opened a pull request:

https://github.com/apache/spark/pull/18164

[Spark-19732][SQL][PYSPARK] fillna bools

## What changes were proposed in this pull request?

Allow fill/replace of NAs with booleans, both in Python and Scala

## How was this patch tested?

Unit tests, doctests

This PR is original work from me and I license this work to the Spark 
project

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rberenguel/spark SPARK-19732-fillna-bools

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18164.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18164


commit 60b60e307acae467b40d6dd8880918134e367140
Author: Ruben Berenguel Montoro <ru...@mostlymaths.net>
Date:   2017-05-30T23:22:44Z

SPARK-19732 fillna for booleans in place. Tomorrow testing for it and 
adding to PySpark

commit fb0904b21a048f40a35da72ff518822e8e3fb6c4
Author: Ruben Berenguel Montoro <ru...@mostlymaths.net>
Date:   2017-05-31T15:57:03Z

SPARK-19732 Fillna for booleans (Spark and Scala)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...

2017-05-31 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/18139#discussion_r119350488
  
--- Diff: python/pyspark/sql/types.py ---
@@ -187,8 +187,11 @@ def needConversion(self):
 
 def toInternal(self, dt):
 if dt is not None:
-seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
-   else time.mktime(dt.timetuple()))
+seconds = calendar.timegm(dt.utctimetuple())
+# Avoiding the invalid range of years (100-1899) for mktime in 
Python < 3
+if dt.year > 1899 or dt.year <= 100:
+seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
+   else time.mktime(dt.timetuple()))
--- End diff --

Makes sense. I always forget if blocks don't create scoping blocks in Python


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...

2017-05-31 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/18139#discussion_r119344936
  
--- Diff: python/pyspark/sql/types.py ---
@@ -187,8 +187,11 @@ def needConversion(self):
 
 def toInternal(self, dt):
 if dt is not None:
-seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
-   else time.mktime(dt.timetuple()))
+seconds = calendar.timegm(dt.utctimetuple())
+# Avoiding the invalid range of years (100-1899) for mktime in 
Python < 3
+if dt.year not in range(100, 1900):
--- End diff --

Thanks! Indeed, I thought about it but creating PRs too late in the 
afternoon dizzies my optimisation logic. Great catch and check, although I 
think `100 >= dt.year > 1900` is an and condition, so it's never only valid for 
years larger than 1900. I changed it for an or (the < 100 and the > 1899 parts)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...

2017-05-31 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/18139#discussion_r119344927
  
--- Diff: python/pyspark/sql/types.py ---
@@ -187,8 +187,11 @@ def needConversion(self):
 
 def toInternal(self, dt):
 if dt is not None:
-seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
-   else time.mktime(dt.timetuple()))
+seconds = calendar.timegm(dt.utctimetuple())
+# Avoiding the invalid range of years (100-1899) for mktime in 
Python < 3
+if dt.year not in range(100, 1900):
--- End diff --

Thanks! Indeed, I thought about it but creating PRs too late in the 
afternoon dizzies my optimisation logic. Great catch and check, although I 
think `100 >= dt.year > 1900` is an and condition, so it's never only valid for 
years larger than 1900. I changed it for an or (the < 100 and the > 1899 parts)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18139: Spark 20787 invalid years

2017-05-29 Thread rberenguel
GitHub user rberenguel opened a pull request:

https://github.com/apache/spark/pull/18139

Spark 20787 invalid years

`time.mktime` can't handle dates from 1899-100, according to the 
documentation by design. `calendar.timegm` is equivalent in shared cases, but 
can handle those years.

## What changes were proposed in this pull request?

Change `time.mktime` for the more able `calendar.timegm` to adress cases 
like:
```python
import datetime as dt

sqlContext.createDataFrame(sc.parallelize([[dt.datetime(1899,12,31)]])).count()
```
failing due to internal conversion failure when there is no timezone 
information in the time object. In the case there is information, `calendar` 
was used instead.

## How was this patch tested?

The existing test cases cover this change, since it does not change any 
existing functionality. Added a test to confirm it working in the problematic 
range.

This PR is original work from me and I license this work to the Spark 
project

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rberenguel/spark SPARK-20787-invalid-years

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18139.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18139


commit 6c0312f94e3fce2bf4d6a30055bd747be535bb0f
Author: Ruben Berenguel Montoro <ru...@mostlymaths.net>
Date:   2017-05-29T15:46:21Z

SPARK-20787 time.mktime can’t handle dates from 1899-100, by 
construction. calendar.timegm is equivalent in shared cases, but can handle 
those

commit d3c41b5f18971168870524ad3a5fac876859bf4b
Author: Ruben Berenguel Montoro <ru...@mostlymaths.net>
Date:   2017-05-29T19:42:54Z

SPARK-20787 Technically a hack. Using gmtime everywhere does not work well 
with DST shifts. So, for timeranges that don’t work well with mktime, use 
gmtime




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18137: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...

2017-05-29 Thread rberenguel
Github user rberenguel commented on the issue:

https://github.com/apache/spark/pull/18137
  
Closing while I fight with an issue seemingly related to DST between gmtime 
and mktime


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18137: [SPARK-20787][PYTHON] PySpark can't handle dateti...

2017-05-29 Thread rberenguel
Github user rberenguel closed the pull request at:

https://github.com/apache/spark/pull/18137


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18137: [SPARK-20787][PYTHON] PySpark can't handle dateti...

2017-05-29 Thread rberenguel
GitHub user rberenguel opened a pull request:

https://github.com/apache/spark/pull/18137

[SPARK-20787][PYTHON] PySpark can't handle datetimes before 1900 

`time.mktime` can't handle dates from 1899-100, according to the 
documentation by design. `calendar.timegm` is equivalent in shared cases, but 
can handle those years.

## What changes were proposed in this pull request?

Change `time.mktime` for the more able `calendar.timegm` to adress cases 
like:
```python
import datetime as dt

sqlContext.createDataFrame(sc.parallelize([[dt.datetime(1899,12,31)]])).count()
```
failing due to internal conversion failure when there is no timezone 
information in the time object. In the case there is information, `calendar` 
was used instead.

## How was this patch tested?

The existing test cases should cover this change, since it should not 
change any existing functionality.

This PR is original work from me and I license this work to the Spark 
project

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rberenguel/spark SPARK-20787-invalid-years

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18137.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18137


commit 6c0312f94e3fce2bf4d6a30055bd747be535bb0f
Author: Ruben Berenguel Montoro <ru...@mostlymaths.net>
Date:   2017-05-29T15:46:21Z

SPARK-20787 time.mktime can’t handle dates from 1899-100, by 
construction. calendar.timegm is equivalent in shared cases, but can handle 
those




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17100: [SPARK-13947][SQL] PySpark DataFrames: The error message...

2017-05-14 Thread rberenguel
Github user rberenguel commented on the issue:

https://github.com/apache/spark/pull/17100
  
Can you give a look to the changes @gatorsmile when you have a few spare 
moments? Also not sure how the conflict has appeared: it was merging cleanly 
when the CI test suite ran (will obviously fix, but first I want to confirm 
it's the right way of doing this now)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17100: [SPARK-13947][SQL] PySpark DataFrames: The error message...

2017-05-02 Thread rberenguel
Github user rberenguel commented on the issue:

https://github.com/apache/spark/pull/17100
  
@gatorsmile Thanks for the pointers, finally found some time to come back 
to this. I'm not sure if my approach to get the `SQLConf` into `checkAnalysis` 
is the correct one in my current local changes (since it seems to change a 
possible API endpoint). I changed the current implementation in the trait to be 
named instead `def checkAnalysisWithConf(plan: LogicalPlan, conf: SQLConf): 
Unit` and added an abstract method `def checkAnalysis(plan: LogicalPlan): Unit` 
that is then implemented in `Analyzer` (where we have a `conf` we can pass 
around). I haven't fixed all the rest yet, was puzzled enough with the 
correctness of this for now ;) Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][SQL] PySpark DataFrames: The error ...

2017-03-22 Thread rberenguel
Github user rberenguel commented on a diff in the pull request:

https://github.com/apache/spark/pull/17100#discussion_r107354055
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 ---
@@ -337,11 +337,29 @@ trait CheckAnalysis extends PredicateHelper {
 operator match {
   case o if o.children.nonEmpty && o.missingInput.nonEmpty =>
 val missingAttributes = o.missingInput.mkString(",")
-val input = o.inputSet.mkString(",")
+val availableAttributes = o.inputSet.mkString(",")
+val missingAttributesHelper = o.missingInput.map(a => (a.name, 
a.exprId.id))
+
+val availableAttributesHelper = o.inputSet.map(a => (a.name, 
a.exprId.id))
+
+val common = (missingAttributesHelper ++ 
availableAttributesHelper).groupBy(_._1)
+val commonNames = common.map(x => (x._1, x._2.toSet))
+  .filter(_._2.size > 1)
+  .map(_._1)
+
+val repeatedNameHint = if (commonNames.size > 0) {
+  s"\n|Observe that attribute(s) ${commonNames.mkString(",")} 
appear in your " +
+"query with at least two different hashes, but same name."
--- End diff --

Good point!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17100: [SPARK-13947][PYTHON][SQL] PySpark DataFrames: The error...

2017-03-01 Thread rberenguel
Github user rberenguel commented on the issue:

https://github.com/apache/spark/pull/17100
  
@holdenk added! And I suspect I have some issue locally with coursier when 
running the tests... They supposedly work (as in, no outputted 
errors/failures?) 1/3 times, the other 2 cause coursier cache misses. But I 
think the no errors one is actually "I give up running the test suite anymore". 
I'll give a look at this (since it's pretty bad I can't rely on the local test 
suite!) and fix the failing test, is pretty straightforward.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17100: [SPARK-13947][PYTHON] PySpark DataFrames: The err...

2017-02-28 Thread rberenguel
GitHub user rberenguel opened a pull request:

https://github.com/apache/spark/pull/17100

[SPARK-13947][PYTHON] PySpark DataFrames: The error message from using an 
invalid table reference is not clear

## What changes were proposed in this pull request?

 Rewritten error message for clarity. Added extra information in case of 
attribute name collision, hinting the user to double-check referencing two 
different tables

## How was this patch tested?

No functional changes, only final message has changed. It has been tested 
manually against the situation proposed in the JIRA ticket. Automated tests in 
repository pass.

This PR is original work from me and I license this work to the Spark 
project

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rberenguel/spark SPARK-13947-error-message

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17100.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17100


commit 981224bd9631a0ac633b6967e639197794894fcf
Author: Ruben Berenguel Montoro <ru...@mostlymaths.net>
Date:   2017-02-28T13:40:47Z

SPARK-13947 Rewritten error message for clarity. Added extra information in 
case of attribute name collision




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org