[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...
Github user rberenguel commented on the issue: https://github.com/apache/spark/pull/18139 Hi @holdenk yup, I still have a to-do to address it in a clean way, I got sidetracked with other things. Helping Python since the 18th century! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21566: [Python] Fix typo in serializer exception
Github user rberenguel commented on the issue: https://github.com/apache/spark/pull/21566 @HyukjinKwon sure, no problem. I'm reading pieces of the source for a presentation on PySpark and this jumped quickly. I'll try to collect a few more ð --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21566: [Python] Fix typo in serializer exception
GitHub user rberenguel opened a pull request: https://github.com/apache/spark/pull/21566 [Python] Fix typo in serializer exception ## What changes were proposed in this pull request? Fix typo in exception raised in Python serializer ## How was this patch tested? No code changes Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rberenguel/spark fix_typo_pyspark_serializers Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21566.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21566 commit bd8627c879131a4af364ef667f4ac1209ea6909a Author: Ruben Berenguel Montoro Date: 2018-06-14T11:11:14Z Fix typo in serializer exception --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20949: [SPARK-19018][SQL] Add support for custom encoding on cs...
Github user rberenguel commented on the issue: https://github.com/apache/spark/pull/20949 I've been giving a look to this PR (I've hit this problem in the past and had a chat with @crafty-coder about it and his fixes, too), is there anything we could do to move it forward? Also, is there any way to trigger a rebuild on Jenkins without adding a dummy commit? Looks like the JVM on this test run blew the heap, just a re-run should be enough (cc @holdenk @HyukjinKwon ) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...
Github user rberenguel commented on the issue: https://github.com/apache/spark/pull/18139 Yes, most solutions are concerned only with strftime (virtualtime and matplotlib above). Since calendar.gmtime can handle this better (or at least for pre-1900) it is just a matter to... not messing up horribly with how dates work (Itâs also problematic in Java after all) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...
Github user rberenguel commented on the issue: https://github.com/apache/spark/pull/18139 @HyukjinKwon I would definitely not mind, I'm actually curious to see how they do it. I also saw it's not that big (and we don't need all the functionality). I'll extract the pieces we need for our case --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...
Github user rberenguel commented on the issue: https://github.com/apache/spark/pull/18139 For instance, [matplotlib handles strftime as a special case for pre-1900](https://github.com/matplotlib/matplotlib/blob/c166fa1afc40cf5fc211b584f8abc107744e9417/lib/matplotlib/dates.py#L573) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...
Github user rberenguel commented on the issue: https://github.com/apache/spark/pull/18139 More ideas welcome indeed. We could just replicate what others are doing to fix the problem, if we can't introduce external dependencies. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...
Github user rberenguel commented on the issue: https://github.com/apache/spark/pull/18139 @holdenk @ueshin Digging a bit into other solutions to this problem (since it's not specifically Spark related, when you dig deeper, so other libraries have to find its way to fix it) and maybe adding [this dependency](https://pypi.python.org/pypi/virtualtime) might be the best way to fix it forever (while also not reinventing wheels that are already in the python ecosystem) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17100: [SPARK-13947][SQL] The error message from using an inval...
Github user rberenguel commented on the issue: https://github.com/apache/spark/pull/17100 Thanks @gatorsmile @holdenk @viirya @wzhfy for all the help --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146667191 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +270,25 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = plan.conf.resolver +val attrsWithSameName = o.missingInput.filter { missing => + o.inputSet.exists(input => resolver(missing.name, input.name)) +} +val repeatedNameHint = if (attrsWithSameName.nonEmpty) { + val sameNames = attrsWithSameName.map(_.name).mkString(",") + s"""Attribute(s) with the same name appear in the operation: `$sameNames`. + |Please check if the right attribute(s) are used.""".stripMargin +} else { + "" +} + val missingAttributes = o.missingInput.mkString(",") val input = o.inputSet.mkString(",") -failAnalysis( - s"resolved attribute(s) $missingAttributes missing from $input " + -s"in operator ${operator.simpleString}") +val msg = s"""Resolved attribute(s) $missingAttributes missing from $input + |in operator ${operator.simpleString}.""".stripMargin + +failAnalysis(if (repeatedNameHint.nonEmpty) msg + "\n" + repeatedNameHint else msg) --- End diff -- Yup thanks, is more orderly this way --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146667141 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +270,25 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = plan.conf.resolver +val attrsWithSameName = o.missingInput.filter { missing => + o.inputSet.exists(input => resolver(missing.name, input.name)) +} +val repeatedNameHint = if (attrsWithSameName.nonEmpty) { + val sameNames = attrsWithSameName.map(_.name).mkString(",") + s"""Attribute(s) with the same name appear in the operation: `$sameNames`. --- End diff -- Ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146625735 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +270,25 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = plan.conf.resolver +val attrsWithSameName = o.missingInput.filter { missing => + o.inputSet.exists(input => resolver(missing.name, input.name)) +} +val repeatedNameHint = if (attrsWithSameName.nonEmpty) { + val sameNames = attrsWithSameName.map(_.name).mkString(",") + s"""Attribute(s) with the same name appear in the operation: `$sameNames`. --- End diff -- You mean wrapping each column name with the ticks? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146544863 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +270,25 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = plan.conf.resolver +val attrsWithSameName = o.missingInput.filter { missing => + o.inputSet.exists(input => resolver(missing.name, input.name)) +} +val repeatedNameHint = if (attrsWithSameName.nonEmpty) { + val sameNames = attrsWithSameName.map(_.name).mkString(",") + s"""Attribute(s) with the same name appear in the operation: `$sameNames`. + |Please check if the right attribute(s) are used.""".stripMargin --- End diff -- I don't have a strong case for having two lines here, and I see the point of seeing it as one line. I guess the best way to have it as one line is splitting it into ```scala val attributeRepetitionMsg = s"Attribute(s) with the same name appear in the operation: `$sameNames`" val checkAttributesMsg = s"Please check if the right attribute(s) are used." ``` to avoid hitting the < 100 chars linting rule? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146469410 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala --- @@ -408,16 +408,23 @@ class AnalysisErrorSuite extends AnalysisTest { // CheckAnalysis should throw AnalysisException when Aggregate contains missing attribute(s) // Since we manually construct the logical plan at here and Sum only accept // LongType, DoubleType, and DecimalType. We use LongType as the type of a. -val plan = - Aggregate( +val attrA = AttributeReference("a", LongType)(exprId = ExprId(1)) +val otherA = AttributeReference("a", LongType)(exprId = ExprId(2)) +val bAlias = Alias(sum(attrA), "b")() :: Nil +val plan = Aggregate( Nil, -Alias(sum(AttributeReference("a", LongType)(exprId = ExprId(1))), "b")() :: Nil, -LocalRelation( - AttributeReference("a", LongType)(exprId = ExprId(2 +bAlias, +LocalRelation(otherA)) assert(plan.resolved) -assertAnalysisError(plan, "resolved attribute(s) a#1L missing from a#2L" :: Nil) +val errorMsg = s"""Resolved attribute(s) ${attrA.toString} missing from ${otherA.toString} + |in operator !Aggregate [${bAlias.mkString("#")}]. + |Please check attribute(s) `a`, they seem to appear in two + |different input operators, with the same name.""".stripMargin + + --- End diff -- Fixed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146469160 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala --- @@ -408,16 +408,23 @@ class AnalysisErrorSuite extends AnalysisTest { // CheckAnalysis should throw AnalysisException when Aggregate contains missing attribute(s) // Since we manually construct the logical plan at here and Sum only accept // LongType, DoubleType, and DecimalType. We use LongType as the type of a. -val plan = - Aggregate( +val attrA = AttributeReference("a", LongType)(exprId = ExprId(1)) +val otherA = AttributeReference("a", LongType)(exprId = ExprId(2)) +val bAlias = Alias(sum(attrA), "b")() :: Nil +val plan = Aggregate( --- End diff -- Added another attribute --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146469083 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +270,25 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = plan.conf.resolver +val attrsWithSameName = o.missingInput.filter { missing => + o.inputSet.exists(input => resolver(missing.name, input.name)) +} +val repeatedNameHint = if (attrsWithSameName.nonEmpty) { + val commonNames = attrsWithSameName.map(_.name).mkString(",") + s"""|Please check attribute(s) `$commonNames`, they seem to appear in two + |different input operators, with the same name.""".stripMargin --- End diff -- I have changed it, now looks better no matter how many appear --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146468977 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +270,25 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = plan.conf.resolver +val attrsWithSameName = o.missingInput.filter { missing => + o.inputSet.exists(input => resolver(missing.name, input.name)) +} +val repeatedNameHint = if (attrsWithSameName.nonEmpty) { + val commonNames = attrsWithSameName.map(_.name).mkString(",") --- End diff -- Done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146252936 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +270,27 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = plan.conf.resolver +val attrsWithSameName = o.missingInput.filter(missing => + o.inputSet.exists(input => resolver(missing.name, input.name))) --- End diff -- Thanks @wzhfy --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146234832 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +270,27 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = plan.conf.resolver +val attrsWithSameName = o.missingInput.filter(missing => + o.inputSet.exists(input => resolver(missing.name, input.name))) +val repeatedNameHint = if (attrsWithSameName.nonEmpty) { + val commonNames = attrsWithSameName.map(_.name).mkString(",") + s"""|Please check attribute(s) `$commonNames`, they seem to appear in two + |different input operators, with the same name.""".stripMargin +} else { + "" +} + val missingAttributes = o.missingInput.mkString(",") -val input = o.inputSet.mkString(",") +val availableAttributes = o.inputSet.mkString(",") failAnalysis( - s"resolved attribute(s) $missingAttributes missing from $input " + -s"in operator ${operator.simpleString}") + s"""Some resolved attribute(s) are not present among the available attributes +|for a query. +|$missingAttributes is not in $availableAttributes. +|$repeatedNameHint +|The failed query was for operator +|${operator.simpleString}""".stripMargin) --- End diff -- Hmmm seems reasonable, indeed. and avoids that horrible trailing newline. Thanks a lot :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146234085 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +270,27 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = plan.conf.resolver +val attrsWithSameName = o.missingInput.filter(missing => + o.inputSet.exists(input => resolver(missing.name, input.name))) +val repeatedNameHint = if (attrsWithSameName.nonEmpty) { + val commonNames = attrsWithSameName.map(_.name).mkString(",") + s"""|Please check attribute(s) `$commonNames`, they seem to appear in two + |different input operators, with the same name.""".stripMargin +} else { + "" +} + val missingAttributes = o.missingInput.mkString(",") -val input = o.inputSet.mkString(",") +val availableAttributes = o.inputSet.mkString(",") --- End diff -- Okies --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146233808 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +270,27 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = plan.conf.resolver +val attrsWithSameName = o.missingInput.filter(missing => + o.inputSet.exists(input => resolver(missing.name, input.name))) --- End diff -- Done. Do you know if there is any style rule for when to use parentheses vs braces in these kind of lambda functions? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146211895 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +270,27 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = plan.conf.resolver +val attrsWithSameName = o.missingInput.filter(missing => + o.inputSet.exists(input => resolver(missing.name, input.name))) +val repeatedNameHint = if (attrsWithSameName.nonEmpty) { + val commonNames = attrsWithSameName.map(_.name).mkString(",") + s"""|Please check attribute(s) `$commonNames`, they seem to appear in two + |different input operators, with the same name.""".stripMargin +} else { + "" +} + val missingAttributes = o.missingInput.mkString(",") -val input = o.inputSet.mkString(",") +val availableAttributes = o.inputSet.mkString(",") failAnalysis( - s"resolved attribute(s) $missingAttributes missing from $input " + -s"in operator ${operator.simpleString}") + s"""Some resolved attribute(s) are not present among the available attributes +|for a query. +|$missingAttributes is not in $availableAttributes. +|$repeatedNameHint +|The failed query was for operator +|${operator.simpleString}""".stripMargin) --- End diff -- This is good for me, having the `repeatedNameHint` is the whole point, at the end. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146211291 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +271,25 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = plan.conf.resolver +val attrsWithSameName = o.missingInput.filter(x => + o.inputSet.exists(y => resolver(x.name, y.name))) val missingAttributes = o.missingInput.mkString(",") -val input = o.inputSet.mkString(",") +val availableAttributes = o.inputSet.mkString(",") +val repeatedNameHint = if (attrsWithSameName.size > 0) { + val commonNames = attrsWithSameName.map(_.name).mkString(",") + s"""\n|Attribute(s) `$commonNames` seem to appear in two + |different datasets, with the same name.""" +} else { + "" +} failAnalysis( - s"resolved attribute(s) $missingAttributes missing from $input " + -s"in operator ${operator.simpleString}") + s"""Some resolved attribute(s) are not present among the available attributes +|for a query. +|$missingAttributes is not in $availableAttributes.$repeatedNameHint --- End diff -- Dang, I knew there was a reason for the weird new line back when I wrote this. I'm putting the `\n` back, do you have any idea that doesn't look as weird as this does, though? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146208143 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala --- @@ -408,16 +408,28 @@ class AnalysisErrorSuite extends AnalysisTest { // CheckAnalysis should throw AnalysisException when Aggregate contains missing attribute(s) // Since we manually construct the logical plan at here and Sum only accept // LongType, DoubleType, and DecimalType. We use LongType as the type of a. -val plan = - Aggregate( +val attrA = AttributeReference("a", LongType)(exprId = ExprId(1)) +val aId = Array[String](attrA.name, attrA.exprId.id.toString) +val otherA = AttributeReference("a", LongType)(exprId = ExprId(2)) +val otherAId = Array[String](otherA.name, otherA.exprId.id.toString) +val bAlias = Alias(sum(attrA), "b")() :: Nil +val plan = Aggregate( Nil, -Alias(sum(AttributeReference("a", LongType)(exprId = ExprId(1))), "b")() :: Nil, -LocalRelation( - AttributeReference("a", LongType)(exprId = ExprId(2 +bAlias, +LocalRelation(otherA)) assert(plan.resolved) -assertAnalysisError(plan, "resolved attribute(s) a#1L missing from a#2L" :: Nil) +val errorMsg = s"""Some resolved attribute(s) are not present among the available attributes + |for a query. + |${aId.mkString("#")}L is not in ${otherAId.mkString("#")}L. --- End diff -- Oh, I did it wrongly and thus did no work (did aId.toString and got a horrible java String object view). Changing it, also regenerating the golden files which seemed to make the test fail --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146177319 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +271,25 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = plan.conf.resolver +val attrsWithSameName = o.missingInput.filter(x => + o.inputSet.exists(y => resolver(x.name, y.name))) val missingAttributes = o.missingInput.mkString(",") -val input = o.inputSet.mkString(",") +val availableAttributes = o.inputSet.mkString(",") +val repeatedNameHint = if (attrsWithSameName.size > 0) { + val commonNames = attrsWithSameName.map(_.name).mkString(",") + s"""\n|Attribute(s) `$commonNames` seem to appear in two + |different datasets, with the same name.""" +} else { + "" +} failAnalysis( - s"resolved attribute(s) $missingAttributes missing from $input " + -s"in operator ${operator.simpleString}") + s"""Some resolved attribute(s) are not present among the available attributes +|for a query. +|$missingAttributes is not in $availableAttributes.$repeatedNameHint --- End diff -- Oh, thanks. I didn't see it --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146177284 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -24,6 +24,7 @@ import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression import org.apache.spark.sql.catalyst.optimizer.BooleanSimplification import org.apache.spark.sql.catalyst.plans._ import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.internal.SQLConf --- End diff -- Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146177263 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +271,25 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = plan.conf.resolver +val attrsWithSameName = o.missingInput.filter(x => + o.inputSet.exists(y => resolver(x.name, y.name))) val missingAttributes = o.missingInput.mkString(",") -val input = o.inputSet.mkString(",") +val availableAttributes = o.inputSet.mkString(",") +val repeatedNameHint = if (attrsWithSameName.size > 0) { + val commonNames = attrsWithSameName.map(_.name).mkString(",") + s"""\n|Attribute(s) `$commonNames` seem to appear in two + |different datasets, with the same name.""" --- End diff -- I have changed it. Agree datasets may not be the best word to define it --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146177196 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +271,25 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = plan.conf.resolver +val attrsWithSameName = o.missingInput.filter(x => + o.inputSet.exists(y => resolver(x.name, y.name))) val missingAttributes = o.missingInput.mkString(",") -val input = o.inputSet.mkString(",") +val availableAttributes = o.inputSet.mkString(",") +val repeatedNameHint = if (attrsWithSameName.size > 0) { --- End diff -- I don't understand where I should move missing or available, isn't it already just right after `attrsWithSameName`? But I moved it after repeatedNameHint, since it's needed only for the `failAnalysis` argument --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146176760 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +271,25 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = plan.conf.resolver +val attrsWithSameName = o.missingInput.filter(x => + o.inputSet.exists(y => resolver(x.name, y.name))) --- End diff -- Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146176626 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -24,6 +24,7 @@ import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression import org.apache.spark.sql.catalyst.optimizer.BooleanSimplification import org.apache.spark.sql.catalyst.plans._ import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.internal.SQLConf --- End diff -- Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] The error message from using a...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146176609 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +271,25 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = plan.conf.resolver +val attrsWithSameName = o.missingInput.filter(x => + o.inputSet.exists(y => resolver(x.name, y.name))) val missingAttributes = o.missingInput.mkString(",") -val input = o.inputSet.mkString(",") +val availableAttributes = o.inputSet.mkString(",") +val repeatedNameHint = if (attrsWithSameName.size > 0) { + val commonNames = attrsWithSameName.map(_.name).mkString(",") + s"""\n|Attribute(s) `$commonNames` seem to appear in two + |different datasets, with the same name.""" --- End diff -- What do you suggest instead? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17100: [SPARK-13947][SQL] The error message from using an inval...
Github user rberenguel commented on the issue: https://github.com/apache/spark/pull/17100 Applied all changes @viirya & @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17100: [SPARK-13947][SQL] PySpark DataFrames: The error message...
Github user rberenguel commented on the issue: https://github.com/apache/spark/pull/17100 @viirya Hi, I based the title of the PR off the original ticket (https://issues.apache.org/jira/browse/SPARK-13947), but indeed has nothing to do with Python. I see now the ticket has been "moved" to SQL, so I'll rewrite the title when I send the fixes (might take me a while to re-run the test suite) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] PySpark DataFrames: The error ...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146128109 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +273,25 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = conf.resolver +val commonAttrs = o.missingInput.filter(x => + o.inputSet.exists(y => resolver(x.name, y.name))) val missingAttributes = o.missingInput.mkString(",") -val input = o.inputSet.mkString(",") +val availableAttributes = o.inputSet.mkString(",") +val repeatedNameHint = if (commonAttrs.size > 0) { + val commonNames = commonAttrs.map(_.name).mkString(",") + s"""\n|Observe that attribute(s) $commonNames appear in two --- End diff -- I'll rewrite the wording a bit so it hints to this as the cause (it's the most usual I suspect). The newline is for readability of the error message. I could remove it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] PySpark DataFrames: The error ...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146128111 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +273,25 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = conf.resolver +val commonAttrs = o.missingInput.filter(x => + o.inputSet.exists(y => resolver(x.name, y.name))) val missingAttributes = o.missingInput.mkString(",") -val input = o.inputSet.mkString(",") +val availableAttributes = o.inputSet.mkString(",") +val repeatedNameHint = if (commonAttrs.size > 0) { + val commonNames = commonAttrs.map(_.name).mkString(",") + s"""\n|Observe that attribute(s) $commonNames appear in two + |different datasets, with the same name""" --- End diff -- Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] PySpark DataFrames: The error ...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146128114 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +273,25 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = conf.resolver +val commonAttrs = o.missingInput.filter(x => --- End diff -- Indeed! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] PySpark DataFrames: The error ...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146128090 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -125,7 +128,7 @@ trait CheckAnalysis extends PredicateHelper { } case s: SubqueryExpression => -checkSubqueryExpression(operator, s) +checkAnalysisWithConf(s.plan, conf) --- End diff -- Yup, sorry. Was too eager to merge back master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] PySpark DataFrames: The error ...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146128092 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala --- @@ -408,16 +408,28 @@ class AnalysisErrorSuite extends AnalysisTest { // CheckAnalysis should throw AnalysisException when Aggregate contains missing attribute(s) // Since we manually construct the logical plan at here and Sum only accept // LongType, DoubleType, and DecimalType. We use LongType as the type of a. -val plan = - Aggregate( +val attrA = AttributeReference("a", LongType)(exprId = ExprId(1)) +val aId = Array[String](attrA.name, attrA.exprId.id.toString) +val otherA = AttributeReference("a", LongType)(exprId = ExprId(2)) +val otherAId = Array[String](otherA.name, otherA.exprId.id.toString) +val bAlias = Alias(sum(attrA), "b")() :: Nil +val plan = Aggregate( Nil, -Alias(sum(AttributeReference("a", LongType)(exprId = ExprId(1))), "b")() :: Nil, -LocalRelation( - AttributeReference("a", LongType)(exprId = ExprId(2 +bAlias, +LocalRelation(otherA)) assert(plan.resolved) -assertAnalysisError(plan, "resolved attribute(s) a#1L missing from a#2L" :: Nil) +val errorMsg = s"""Some resolved attribute(s) are not present among the available attributes + |for a query. + |${aId.mkString("#")}L is not in ${otherAId.mkString("#")}L. --- End diff -- Yes thanks, I discovered this recently, but didn't come back to this PR. Will fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] PySpark DataFrames: The error ...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r146128078 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -270,12 +273,25 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => +val resolver = conf.resolver --- End diff -- Oh, much handier. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18164: [SPARK-19732][SQL][PYSPARK] Add fill functions for nulls...
Github user rberenguel commented on the issue: https://github.com/apache/spark/pull/18164 @HyukjinKwon I changed it, does it look any clearer? I have always thought of `na` in terms of Python and not R anyway :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18164: [SPARK-19732][SQL][PYSPARK] fillna bools
Github user rberenguel commented on the issue: https://github.com/apache/spark/pull/18164 @ueshin @HyukjinKwon thanks for giving it a very thorough look and sorry for my previous comment, that was terribly unclear. I was confused because the Appveyor tick mark was green for commit 076ebed and I had run the tests locally (forgot linting, though), so I was pretty sure the test was right and I was confused about how the subset wrong still had a passing test. I probably skipped the wrong step for testing the Python tests (I'm still figuring out which corners I can cut to avoid a full compile/build cycle for the whole project, which takes ages for me) so I didn't see my local test failing, but the remote one was more puzzling, I guess appveyor had a hiccup here. Sorry again for the confused and confusing statements above :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18164: [SPARK-19732][SQL][PYSPARK] fillna bools
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/18164#discussion_r119813043 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala --- @@ -124,6 +134,13 @@ class DataFrameNaFunctionsSuite extends QueryTest with SharedSQLContext { Row("Nina") :: Row("Amy") :: Row("unknown") :: Nil) assert(input.na.fill("unknown").columns.toSeq === input.columns.toSeq) +// boolean +checkAnswer( + boolInput.na.fill(true).select("spy"), + Row(false) :: Row(true) :: Row(true) :: --- End diff -- Sorry, what do you mean by inlined here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18164: [SPARK-19732][SQL][PYSPARK] fillna bools
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/18164#discussion_r119812943 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala --- @@ -36,6 +36,15 @@ class DataFrameNaFunctionsSuite extends QueryTest with SharedSQLContext { ).toDF("name", "age", "height") } + def createBooleanDF(): DataFrame = { --- End diff -- Yup, right. I added it on top to keep both together, but it's only used for the boolean tests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18164: [SPARK-19732][SQL][PYSPARK] fillna bools
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/18164#discussion_r119812842 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala --- @@ -196,6 +196,24 @@ final class DataFrameNaFunctions private[sql](df: DataFrame) { def fill(value: String, cols: Seq[String]): DataFrame = fillValue(value, cols) /** + * Returns a new `DataFrame` that replaces null values in boolean columns with `value`. + */ + def fill(value: Boolean): DataFrame = fill(value, df.columns) + + /** + * (Scala-specific) Returns a new `DataFrame` that replaces null or NaN values in specified --- End diff -- Oh, right. I copied the defs and docs from double, as it shows. Will change, NaN booleans would be weird indeed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18164: [SPARK-19732][SQL][PYSPARK] fillna bools
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/18164#discussion_r119812666 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala --- @@ -196,6 +196,24 @@ final class DataFrameNaFunctions private[sql](df: DataFrame) { def fill(value: String, cols: Seq[String]): DataFrame = fillValue(value, cols) /** + * Returns a new `DataFrame` that replaces null values in boolean columns with `value`. + */ + def fill(value: Boolean): DataFrame = fill(value, df.columns) --- End diff -- I wasn't sure about this, wanted to ask actually. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18164: [SPARK-19732][SQL][PYSPARK] fillna bools
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/18164#discussion_r119812590 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1303,9 +1312,11 @@ def fillna(self, value, subset=None): +---+--+---+ """ if not isinstance(value, (float, int, long, basestring, dict)): --- End diff -- I omitted it just because it wasn't failing for this if, but indeed, I'm a bit more on the side of putting it in even if just for completeness. Makes reading the code much saner if we have the if for bool --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18164: [SPARK-19732][SQL][PYSPARK] fillna bools
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/18164#discussion_r119812631 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1303,9 +1312,11 @@ def fillna(self, value, subset=None): +---+--+---+ """ if not isinstance(value, (float, int, long, basestring, dict)): -raise ValueError("value should be a float, int, long, string, or dict") +raise ValueError("value should be a float, int, long, string, boolean or dict") --- End diff -- Thanks, will change --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18164: [SPARK-19732][SQL][PYSPARK] fillna bools
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/18164#discussion_r119812409 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1303,9 +1312,11 @@ def fillna(self, value, subset=None): +---+--+---+ """ if not isinstance(value, (float, int, long, basestring, dict)): -raise ValueError("value should be a float, int, long, string, or dict") +raise ValueError("value should be a float, int, long, string, boolean or dict") -if isinstance(value, (int, long)): +if isinstance(value, bool): +pass +elif isinstance(value, (int, long)): --- End diff -- Thanks, indeed makes sense and makes it a bit nicer than having a pass. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18164: [SPARK-19732][SQL][PYSPARK] fillna bools
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/18164#discussion_r119798778 --- Diff: python/pyspark/sql/tests.py --- @@ -1697,40 +1697,56 @@ def test_fillna(self): schema = StructType([ StructField("name", StringType(), True), StructField("age", IntegerType(), True), -StructField("height", DoubleType(), True)]) +StructField("height", DoubleType(), True), +StructField("spy", BooleanType(), True)]) # fillna shouldn't change non-null values -row = self.spark.createDataFrame([(u'Alice', 10, 80.1)], schema).fillna(50).first() +row = self.spark.createDataFrame([(u'Alice', 10, 80.1, True)], schema).fillna(50).first() self.assertEqual(row.age, 10) # fillna with int -row = self.spark.createDataFrame([(u'Alice', None, None)], schema).fillna(50).first() +row = self.spark.createDataFrame([(u'Alice', None, None, None)], schema).fillna(50).first() self.assertEqual(row.age, 50) self.assertEqual(row.height, 50.0) # fillna with double -row = self.spark.createDataFrame([(u'Alice', None, None)], schema).fillna(50.1).first() +row = self.spark.createDataFrame([(u'Alice', None, None, None)], schema).fillna(50.1).first() self.assertEqual(row.age, 50) self.assertEqual(row.height, 50.1) +# fillna with bool +row = self.spark.createDataFrame([(u'Alice', None, None, None)], schema).fillna(True).first() +self.assertEqual(row.age, None) +self.assertEqual(row.spy, True) + # fillna with string -row = self.spark.createDataFrame([(None, None, None)], schema).fillna("hello").first() +row = self.spark.createDataFrame([(None, None, None, None)], schema).fillna("hello").first() self.assertEqual(row.name, u"hello") self.assertEqual(row.age, None) # fillna with subset specified for numeric cols row = self.spark.createDataFrame( -[(None, None, None)], schema).fillna(50, subset=['name', 'age']).first() +[(None, None, None, None)], schema).fillna(50, subset=['name', 'age']).first() self.assertEqual(row.name, None) self.assertEqual(row.age, 50) self.assertEqual(row.height, None) +self.assertEqual(row.spy, None) -# fillna with subset specified for numeric cols +# fillna with subset specified for string cols row = self.spark.createDataFrame( -[(None, None, None)], schema).fillna("haha", subset=['name', 'age']).first() +[(None, None, None, None)], schema).fillna("haha", subset=['name', 'age']).first() self.assertEqual(row.name, "haha") self.assertEqual(row.age, None) self.assertEqual(row.height, None) +self.assertEqual(row.spy, None) + +# fillna with subset specified for bool cols +row = self.spark.createDataFrame( +[(None, None, None, None)], schema).fillna(True, subset=['name', 'age']).first() +self.assertEqual(row.name, None) +self.assertEqual(row.age, None) +self.assertEqual(row.height, None) +self.assertEqual(row.spy, True) --- End diff -- Hi @ueshin indeed! Thanks for catching this, I have modified the test. BUT, this test, as it stands on your comment, should have failed, doesn't it? The subset should not have been applied to spy (so, spy should have been None, and the assertion should have been marked as false, but either the test passed or the test didn't run), if I understood correctly how subsetting fillna's work. But this is weird, since I didn't change any internals of how it works, I just created the methods to enable it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...
Github user rberenguel commented on the issue: https://github.com/apache/spark/pull/18139 Thanks @ueshin , forgot to run `dev/lint-python` on the tests (still getting used to which speed tricks are useful to avoid running a full test suite run locally, which takes more than an hour on my machine) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18139: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...
Github user rberenguel commented on the issue: https://github.com/apache/spark/pull/18139 @j-bennet I'm not sure about that change. My suggested approach is just fixing what the Python API offers (as in, mktime has known limitations, from Python's point of view, independent of what mktime might do). I guess a better way would be to have some platform depending check for the 1900-1901-1902 problem (which I suspect should be as a separate issue from this one, though, even if they might touch the same area of the code) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/18139#discussion_r119583612 --- Diff: python/pyspark/sql/types.py --- @@ -187,8 +187,12 @@ def needConversion(self): def toInternal(self, dt): if dt is not None: -seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo - else time.mktime(dt.timetuple())) +# Avoiding the invalid range of years (100-1899) for mktime in Python < 3 +if dt.year > 1899 or dt.year <= 100: +seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo + else time.mktime(dt.timetuple())) +else: +seconds = calendar.timegm(dt.utctimetuple()) --- End diff -- Well, this was the previous functionality, I'm not 100% sure of the rationale that was used to do it this way either. If we don't have `tzinfo` and use `dt.timetuple` instead of `utctimetule` I guess it's using locale time via the underlying C implementation for `mktime`, or that's what the Python docs seem to suggest, but it's not that clear. So I'm not convinced we need to offset, since it should be a platform/locale dependent offset --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/18139#discussion_r119583288 --- Diff: python/pyspark/sql/types.py --- @@ -187,8 +187,12 @@ def needConversion(self): def toInternal(self, dt): if dt is not None: -seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo - else time.mktime(dt.timetuple())) +# Avoiding the invalid range of years (100-1899) for mktime in Python < 3 +if dt.year > 1899 or dt.year <= 100: --- End diff -- Uh, right, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18164: [Spark-19732][SQL][PYSPARK] fillna bools
GitHub user rberenguel opened a pull request: https://github.com/apache/spark/pull/18164 [Spark-19732][SQL][PYSPARK] fillna bools ## What changes were proposed in this pull request? Allow fill/replace of NAs with booleans, both in Python and Scala ## How was this patch tested? Unit tests, doctests This PR is original work from me and I license this work to the Spark project You can merge this pull request into a Git repository by running: $ git pull https://github.com/rberenguel/spark SPARK-19732-fillna-bools Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18164.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18164 commit 60b60e307acae467b40d6dd8880918134e367140 Author: Ruben Berenguel Montoro <ru...@mostlymaths.net> Date: 2017-05-30T23:22:44Z SPARK-19732 fillna for booleans in place. Tomorrow testing for it and adding to PySpark commit fb0904b21a048f40a35da72ff518822e8e3fb6c4 Author: Ruben Berenguel Montoro <ru...@mostlymaths.net> Date: 2017-05-31T15:57:03Z SPARK-19732 Fillna for booleans (Spark and Scala) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/18139#discussion_r119350488 --- Diff: python/pyspark/sql/types.py --- @@ -187,8 +187,11 @@ def needConversion(self): def toInternal(self, dt): if dt is not None: -seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo - else time.mktime(dt.timetuple())) +seconds = calendar.timegm(dt.utctimetuple()) +# Avoiding the invalid range of years (100-1899) for mktime in Python < 3 +if dt.year > 1899 or dt.year <= 100: +seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo + else time.mktime(dt.timetuple())) --- End diff -- Makes sense. I always forget if blocks don't create scoping blocks in Python --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/18139#discussion_r119344936 --- Diff: python/pyspark/sql/types.py --- @@ -187,8 +187,11 @@ def needConversion(self): def toInternal(self, dt): if dt is not None: -seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo - else time.mktime(dt.timetuple())) +seconds = calendar.timegm(dt.utctimetuple()) +# Avoiding the invalid range of years (100-1899) for mktime in Python < 3 +if dt.year not in range(100, 1900): --- End diff -- Thanks! Indeed, I thought about it but creating PRs too late in the afternoon dizzies my optimisation logic. Great catch and check, although I think `100 >= dt.year > 1900` is an and condition, so it's never only valid for years larger than 1900. I changed it for an or (the < 100 and the > 1899 parts) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18139: [SPARK-20787][PYTHON] PySpark can't handle dateti...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/18139#discussion_r119344927 --- Diff: python/pyspark/sql/types.py --- @@ -187,8 +187,11 @@ def needConversion(self): def toInternal(self, dt): if dt is not None: -seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo - else time.mktime(dt.timetuple())) +seconds = calendar.timegm(dt.utctimetuple()) +# Avoiding the invalid range of years (100-1899) for mktime in Python < 3 +if dt.year not in range(100, 1900): --- End diff -- Thanks! Indeed, I thought about it but creating PRs too late in the afternoon dizzies my optimisation logic. Great catch and check, although I think `100 >= dt.year > 1900` is an and condition, so it's never only valid for years larger than 1900. I changed it for an or (the < 100 and the > 1899 parts) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18139: Spark 20787 invalid years
GitHub user rberenguel opened a pull request: https://github.com/apache/spark/pull/18139 Spark 20787 invalid years `time.mktime` can't handle dates from 1899-100, according to the documentation by design. `calendar.timegm` is equivalent in shared cases, but can handle those years. ## What changes were proposed in this pull request? Change `time.mktime` for the more able `calendar.timegm` to adress cases like: ```python import datetime as dt sqlContext.createDataFrame(sc.parallelize([[dt.datetime(1899,12,31)]])).count() ``` failing due to internal conversion failure when there is no timezone information in the time object. In the case there is information, `calendar` was used instead. ## How was this patch tested? The existing test cases cover this change, since it does not change any existing functionality. Added a test to confirm it working in the problematic range. This PR is original work from me and I license this work to the Spark project You can merge this pull request into a Git repository by running: $ git pull https://github.com/rberenguel/spark SPARK-20787-invalid-years Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18139.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18139 commit 6c0312f94e3fce2bf4d6a30055bd747be535bb0f Author: Ruben Berenguel Montoro <ru...@mostlymaths.net> Date: 2017-05-29T15:46:21Z SPARK-20787 time.mktime canât handle dates from 1899-100, by construction. calendar.timegm is equivalent in shared cases, but can handle those commit d3c41b5f18971168870524ad3a5fac876859bf4b Author: Ruben Berenguel Montoro <ru...@mostlymaths.net> Date: 2017-05-29T19:42:54Z SPARK-20787 Technically a hack. Using gmtime everywhere does not work well with DST shifts. So, for timeranges that donât work well with mktime, use gmtime --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18137: [SPARK-20787][PYTHON] PySpark can't handle datetimes bef...
Github user rberenguel commented on the issue: https://github.com/apache/spark/pull/18137 Closing while I fight with an issue seemingly related to DST between gmtime and mktime --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18137: [SPARK-20787][PYTHON] PySpark can't handle dateti...
Github user rberenguel closed the pull request at: https://github.com/apache/spark/pull/18137 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18137: [SPARK-20787][PYTHON] PySpark can't handle dateti...
GitHub user rberenguel opened a pull request: https://github.com/apache/spark/pull/18137 [SPARK-20787][PYTHON] PySpark can't handle datetimes before 1900 `time.mktime` can't handle dates from 1899-100, according to the documentation by design. `calendar.timegm` is equivalent in shared cases, but can handle those years. ## What changes were proposed in this pull request? Change `time.mktime` for the more able `calendar.timegm` to adress cases like: ```python import datetime as dt sqlContext.createDataFrame(sc.parallelize([[dt.datetime(1899,12,31)]])).count() ``` failing due to internal conversion failure when there is no timezone information in the time object. In the case there is information, `calendar` was used instead. ## How was this patch tested? The existing test cases should cover this change, since it should not change any existing functionality. This PR is original work from me and I license this work to the Spark project You can merge this pull request into a Git repository by running: $ git pull https://github.com/rberenguel/spark SPARK-20787-invalid-years Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18137.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18137 commit 6c0312f94e3fce2bf4d6a30055bd747be535bb0f Author: Ruben Berenguel Montoro <ru...@mostlymaths.net> Date: 2017-05-29T15:46:21Z SPARK-20787 time.mktime canât handle dates from 1899-100, by construction. calendar.timegm is equivalent in shared cases, but can handle those --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17100: [SPARK-13947][SQL] PySpark DataFrames: The error message...
Github user rberenguel commented on the issue: https://github.com/apache/spark/pull/17100 Can you give a look to the changes @gatorsmile when you have a few spare moments? Also not sure how the conflict has appeared: it was merging cleanly when the CI test suite ran (will obviously fix, but first I want to confirm it's the right way of doing this now) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17100: [SPARK-13947][SQL] PySpark DataFrames: The error message...
Github user rberenguel commented on the issue: https://github.com/apache/spark/pull/17100 @gatorsmile Thanks for the pointers, finally found some time to come back to this. I'm not sure if my approach to get the `SQLConf` into `checkAnalysis` is the correct one in my current local changes (since it seems to change a possible API endpoint). I changed the current implementation in the trait to be named instead `def checkAnalysisWithConf(plan: LogicalPlan, conf: SQLConf): Unit` and added an abstract method `def checkAnalysis(plan: LogicalPlan): Unit` that is then implemented in `Analyzer` (where we have a `conf` we can pass around). I haven't fixed all the rest yet, was puzzled enough with the correctness of this for now ;) Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][SQL] PySpark DataFrames: The error ...
Github user rberenguel commented on a diff in the pull request: https://github.com/apache/spark/pull/17100#discussion_r107354055 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -337,11 +337,29 @@ trait CheckAnalysis extends PredicateHelper { operator match { case o if o.children.nonEmpty && o.missingInput.nonEmpty => val missingAttributes = o.missingInput.mkString(",") -val input = o.inputSet.mkString(",") +val availableAttributes = o.inputSet.mkString(",") +val missingAttributesHelper = o.missingInput.map(a => (a.name, a.exprId.id)) + +val availableAttributesHelper = o.inputSet.map(a => (a.name, a.exprId.id)) + +val common = (missingAttributesHelper ++ availableAttributesHelper).groupBy(_._1) +val commonNames = common.map(x => (x._1, x._2.toSet)) + .filter(_._2.size > 1) + .map(_._1) + +val repeatedNameHint = if (commonNames.size > 0) { + s"\n|Observe that attribute(s) ${commonNames.mkString(",")} appear in your " + +"query with at least two different hashes, but same name." --- End diff -- Good point! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17100: [SPARK-13947][PYTHON][SQL] PySpark DataFrames: The error...
Github user rberenguel commented on the issue: https://github.com/apache/spark/pull/17100 @holdenk added! And I suspect I have some issue locally with coursier when running the tests... They supposedly work (as in, no outputted errors/failures?) 1/3 times, the other 2 cause coursier cache misses. But I think the no errors one is actually "I give up running the test suite anymore". I'll give a look at this (since it's pretty bad I can't rely on the local test suite!) and fix the failing test, is pretty straightforward. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17100: [SPARK-13947][PYTHON] PySpark DataFrames: The err...
GitHub user rberenguel opened a pull request: https://github.com/apache/spark/pull/17100 [SPARK-13947][PYTHON] PySpark DataFrames: The error message from using an invalid table reference is not clear ## What changes were proposed in this pull request? Rewritten error message for clarity. Added extra information in case of attribute name collision, hinting the user to double-check referencing two different tables ## How was this patch tested? No functional changes, only final message has changed. It has been tested manually against the situation proposed in the JIRA ticket. Automated tests in repository pass. This PR is original work from me and I license this work to the Spark project You can merge this pull request into a Git repository by running: $ git pull https://github.com/rberenguel/spark SPARK-13947-error-message Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17100.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17100 commit 981224bd9631a0ac633b6967e639197794894fcf Author: Ruben Berenguel Montoro <ru...@mostlymaths.net> Date: 2017-02-28T13:40:47Z SPARK-13947 Rewritten error message for clarity. Added extra information in case of attribute name collision --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org