[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1496311839 Thanks a lot @cloud-fan for the guidance and support in getting this issue fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1495354346 Gentle ping @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1492817996 @cloud-fan I have made the change. All Tests have passed. Can you please review? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1491954910 > If you really worry about regression, we can add a legacy config to fall back to the old code. I don't agree to make code changes that only fix the problem in one particular code path, while we know other code paths have the same problem as well. Ok, I will update the PR with suggested change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1491795763 > according to the [code in 2.3](https://github.com/apache/spark/blob/branch-2.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L190), I think we should call `distinct` in line 345 @cloud-fan Yes, that should also work, but making it there will increase the impact of change to lot more other scenarios. Whereas the place where I have made distinct keeps the scope very limited. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1486232865 @cloud-fan Can you please check my last comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1482819187 FWIW Both the use cases were working fine in Spark 2.3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1482792058 > I think case 1 works by accident. It's not an intentional design. I don't think it's a bug that case 2 doesn't work. As I had said in previous comment : Not sure about the deduplication before, but even if it was doing it at some stage, in the second use case it might not have converted the column name to lowercase by that time, that's why that would still treat the two id and ID columns as different. Only at end result of column match, we see that both column matches are same id#17. The speculation was right. Dedup is happening in unique method. For case 1: unique before:: Map(col3 -> Vector(col3#18571), col2 -> Vector(col2#18570), id -> Vector(id#18569, id#18569), col5 -> Vector(col5#18573), col4 -> Vector(col4#18572)) after before:: Map(col3 -> Vector(col3#18571), col2 -> Vector(col2#18570), id -> Vector(id#18569), col5 -> Vector(col5#18573), col4 -> Vector(col4#18572)) For Case 2: unique before:: Map(col3 -> Vector(col3#18610), col2 -> Vector(col2#18609), id -> Vector(id#18608, ID#18608), col5 -> Vector(col5#18612), col4 -> Vector(col4#18611)) after before:: Map(col3 -> Vector(col3#18610), col2 -> Vector(col2#18609), id -> Vector(id#18608, ID#18608), col5 -> Vector(col5#18612), col4 -> Vector(col4#18611)) Most of the places we are calling unique before returning the result. So what' the negative impact you think it will have if we return unique results for the column match also? One positive use case is it will fix this wrong ambiguous error being thrown just because the result of match has two duplicate values. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1482368131 > > It works because the resolved column has just one match > > But there are two id columns. Does Spark already do deduplication somewhere? Not sure about the deduplication before, but even if it was doing it at some stage, in the second use case it might not have converted the column name to lowercase by that time, that's why that would still treat the two id and ID columns as different. Only at end result of column match, we see that both column matches are same id#17. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1480595916 > @shrprasa do you know how the case 1 works? yes. It works because the resolved column has just one match attributes: Vector(id#17) but for second case, the match result is attributes: Vector(id#17, id#17) Since, there are more than one value although both are exactly same, it fails. This fix proposes to fix this by taking distinct values of match result. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1480574606 > df3.select("id").show() @cloud-fan The example you have shared will behave the same even after this fix. It will give ambiguous error. The use case which the fix is trying to solve is different. Can you please try these two cases: Case 1: which works fine val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5") val op_cols_same_case = List("id","col2","col3","col4", "col5", "id") val df3 = df1.select(op_cols_same_case.head, op_cols_same_case.tail: _*) df3.select("id").show() Case 2: which doesn't work fine and the fix is to solve this issue val df2 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5") val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID") val df4 = df2.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*) df4.select("id").show() -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1480532873 > I second @srowen ‘s view. cc @cloud-fan Thanks @yaooqinn for replying. Can you please explain why you think it's not the right fix? The fix only proposes to remove duplicates from the resolved columns. As it's incorrect to consider the only one column match as ambiguous just because it occurs more than once in the resolved column list. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1480217186 Gentle ping @dongjoon-hyun @mridulm @HyukjinKwon @yaooqinn Can you please review this PR or direct it to someone who can review this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1476233946 Gentle ping @dongjoon-hyun @mridulm @HyukjinKwon @yaooqinn Can you please review this PR or direct it to someone who can review this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1466394445 > That's a "no" from me, per the logic above Thanks @srowen But seems I am not able to explain the change to you. So it's better to get review from someone who is qualified to review the change and aware of this code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1465517104 Gentle ping @dongjoon-hyun @mridulm @HyukjinKwon @yaooqinn Can you please review this PR or direct it to someone who is aware of this code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1463230011 > Hm, I just don't see the logic in that. It isn't how SQL works either, as far as I understand. Here's maybe another example, imagine a DataFrame defined by `SELECT 3 as id, 3 as ID`. Would you also say selecting "id" is unambiguous? and it makes sense to you if I change a 3 to a 4 that this query is no longer semantically valid? If it's valid as per the plan then yes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1461303721 It's very much relevant as this is the only case which requires the fix. If they do not come from same source, the plan will reflect that and it will throw the ambiguous error even after this fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1461288036 > Hm, how is it not ambiguous? When case insensitive, 'id' could mean one of two different columns It's not ambiguous because the when we are selecting using list of column names, both id and ID are getting value from same column 'id' in the source dataframe. val **df1** = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("**id"**,"col2","col3","col4", "col5") val op_cols_mixed_case = List(**"id"**,"col2","col3","col4", "col5", **"ID"**) val df3 = **df1**.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*) df3.select("id").show() df3.explain() == Physical Plan == *(1) Project [**_1#6 AS id#17**, _2#7 AS col2#18, _3#8 AS col3#19, _4#9 AS col4#20, _5#10 AS col5#21, **_1#6 AS ID#17**] -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1461262849 > I don't get it, it is due to case sensitivity; that's why it becomes ambiguous and that's what you see. The issue is that the error isn't super helpful because it shows the lower-cased column right? that's what I was saying. Or: does your change still result in an error without case sensitivity? it should The issue is not with the error message. Problem is that in this case error should not be thrown. Select query should return result. After this change, ambiguous error will not be thrown as we are fixing the duplicate attribute match. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1459668923 > You first defined a case-sensitive data set, then queried in a case-insensitive way, I guess the error is expected. In the physical plan, both id and ID columns are projected to the same column in the dataframe: _1#6 _1#6 AS id#17, _1#6 AS ID#17 So, there is no ambiguity, Also, in the matched attributes, results are same: attributes: Vector(id#17, id#17) Just because, we have duplicates in the matched result, it's being considered as ambiguous. If the matched attribute result was Vector(id#17, ID#17) , then it would have been valid error. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1459652942 > Can you try `set spark.sql.caseSensitive=true`? Yes, I have tried it. With caseSensitive set to false, it will work as then id and ID will be treated as separate columns. Issue is when columns names are supposed to considered as case insensitive. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1459535564 Gentle ping @dongjoon-hyun @mridulm @HyukjinKwon @yaooqinn Can you please review this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1458159825 > I'm not sure about the change, not sure I'm qualified to review it. I think at best the error message should change; I am not clear that the result is 'wrong' Thanks for replying. Can you please tag someone who should be right person to review this change? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL]:Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1457585690 Gentle Ping @srowen @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL]:Incorrect ambiguous column reference error
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1454348175 @srowen @dongjoon-hyun Can you please review this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org