[GitHub] spark issue #19776: [SPARK-22548][SQL] Incorrect nested AND expression pushe...
Github user jliwork commented on the issue: https://github.com/apache/spark/pull/19776 @gatorsmile @cloud-fan @viirya @HyukjinKwon Thanks a lot! =) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...
Github user jliwork commented on a diff in the pull request: https://github.com/apache/spark/pull/19776#discussion_r152429647 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala --- @@ -0,0 +1,231 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.catalyst.dsl.expressions._ +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.PlanTest +import org.apache.spark.sql.sources +import org.apache.spark.sql.test.SharedSQLContext + +class DataSourceStrategySuite extends PlanTest with SharedSQLContext { + + test("translate simple expression") { +val attrInt = 'cint.int +val attrStr = 'cstr.string + +testTranslateFilter(EqualTo(attrInt, 1), Some(sources.EqualTo("cint", 1))) +testTranslateFilter(EqualTo(1, attrInt), Some(sources.EqualTo("cint", 1))) + +testTranslateFilter(EqualNullSafe(attrStr, Literal(null)), + Some(sources.EqualNullSafe("cstr", null))) +testTranslateFilter(EqualNullSafe(Literal(null), attrStr), + Some(sources.EqualNullSafe("cstr", null))) + +testTranslateFilter(GreaterThan(attrInt, 1), Some(sources.GreaterThan("cint", 1))) +testTranslateFilter(GreaterThan(1, attrInt), Some(sources.LessThan("cint", 1))) + +testTranslateFilter(LessThan(attrInt, 1), Some(sources.LessThan("cint", 1))) +testTranslateFilter(LessThan(1, attrInt), Some(sources.GreaterThan("cint", 1))) + +testTranslateFilter(GreaterThanOrEqual(attrInt, 1), Some(sources.GreaterThanOrEqual("cint", 1))) +testTranslateFilter(GreaterThanOrEqual(1, attrInt), Some(sources.LessThanOrEqual("cint", 1))) + +testTranslateFilter(LessThanOrEqual(attrInt, 1), Some(sources.LessThanOrEqual("cint", 1))) +testTranslateFilter(LessThanOrEqual(1, attrInt), Some(sources.GreaterThanOrEqual("cint", 1))) + +testTranslateFilter(InSet(attrInt, Set(1, 2, 3)), Some(sources.In("cint", Array(1, 2, 3 + +testTranslateFilter(In(attrInt, Seq(1, 2, 3)), Some(sources.In("cint", Array(1, 2, 3 + +testTranslateFilter(IsNull(attrInt), Some(sources.IsNull("cint"))) +testTranslateFilter(IsNotNull(attrInt), Some(sources.IsNotNull("cint"))) + +// cint > 1 AND cint < 10 +testTranslateFilter(And( + GreaterThan(attrInt, 1), + LessThan(attrInt, 10)), + Some(sources.And( +sources.GreaterThan("cint", 1), +sources.LessThan("cint", 10 + +// cint >= 8 OR cint <= 2 +testTranslateFilter(Or( + GreaterThanOrEqual(attrInt, 8), + LessThanOrEqual(attrInt, 2)), + Some(sources.Or( +sources.GreaterThanOrEqual("cint", 8), +sources.LessThanOrEqual("cint", 2 + +testTranslateFilter(Not(GreaterThanOrEqual(attrInt, 8)), + Some(sources.Not(sources.GreaterThanOrEqual("cint", 8 + +testTranslateFilter(StartsWith(attrStr, "a"), Some(sources.StringStartsWith("cstr", "a"))) + +testTranslateFilter(EndsWith(attrStr, "a"), Some(sources.StringEndsWith("cstr", "a"))) + +testTranslateFilter(Contains(attrStr, "a"), Some(sources.StringContains("cstr", "a"))) + } + + test("translate complex expression") { +val attrInt = 'cint.int + +// ABS(cint) - 2 = 1 +testTranslateFilter(LessThanOrEqual( + // Expressions are not supported --- End diff -- good catch @_@ fixed the typo. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19776: [SPARK-22548][SQL] Incorrect nested AND expression pushe...
Github user jliwork commented on the issue: https://github.com/apache/spark/pull/19776 Thanks for everyone's comments! I have polished the test cases. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...
Github user jliwork commented on a diff in the pull request: https://github.com/apache/spark/pull/19776#discussion_r152421950 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala --- @@ -0,0 +1,307 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.catalyst.dsl.expressions._ +import org.apache.spark.sql.catalyst.expressions +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.PlanTest +import org.apache.spark.sql.sources +import org.apache.spark.sql.test.SharedSQLContext + + +class DataSourceStrategySuite extends PlanTest with SharedSQLContext { + + test("translate simple expression") { +val attrInt = 'cint.int +val attrStr = 'cstr.string + +assertResult(Some(sources.EqualTo("cint", 1))) { + DataSourceStrategy.translateFilter( +expressions.EqualTo(attrInt, 1)) +} --- End diff -- Thanks! I've followed your suggestion and the test suite looks cleaner now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...
Github user jliwork commented on a diff in the pull request: https://github.com/apache/spark/pull/19776#discussion_r152422048 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala --- @@ -0,0 +1,307 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.catalyst.dsl.expressions._ +import org.apache.spark.sql.catalyst.expressions +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.PlanTest +import org.apache.spark.sql.sources +import org.apache.spark.sql.test.SharedSQLContext + + +class DataSourceStrategySuite extends PlanTest with SharedSQLContext { + + test("translate simple expression") { +val attrInt = 'cint.int +val attrStr = 'cstr.string + +assertResult(Some(sources.EqualTo("cint", 1))) { + DataSourceStrategy.translateFilter( +expressions.EqualTo(attrInt, 1)) +} +assertResult(Some(sources.EqualTo("cint", 1))) { + DataSourceStrategy.translateFilter( +expressions.EqualTo(1, attrInt)) +} + +assertResult(Some(sources.EqualNullSafe("cstr", null))) { + DataSourceStrategy.translateFilter( +expressions.EqualNullSafe(attrStr, Literal(null))) +} +assertResult(Some(sources.EqualNullSafe("cstr", null))) { + DataSourceStrategy.translateFilter( +expressions.EqualNullSafe(Literal(null), attrStr)) +} + +assertResult(Some(sources.GreaterThan("cint", 1))) { + DataSourceStrategy.translateFilter( +expressions.GreaterThan(attrInt, 1)) +} +assertResult(Some(sources.GreaterThan("cint", 1))) { + DataSourceStrategy.translateFilter( +expressions.LessThan(1, attrInt)) +} + +assertResult(Some(sources.LessThan("cint", 1))) { + DataSourceStrategy.translateFilter( +expressions.LessThan(attrInt, 1)) +} +assertResult(Some(sources.LessThan("cint", 1))) { + DataSourceStrategy.translateFilter( +expressions.GreaterThan(1, attrInt)) +} + +assertResult(Some(sources.GreaterThanOrEqual("cint", 1))) { + DataSourceStrategy.translateFilter( +expressions.GreaterThanOrEqual(attrInt, 1)) +} +assertResult(Some(sources.GreaterThanOrEqual("cint", 1))) { + DataSourceStrategy.translateFilter( +expressions.LessThanOrEqual(1, attrInt)) +} + +assertResult(Some(sources.LessThanOrEqual("cint", 1))) { + DataSourceStrategy.translateFilter( +expressions.LessThanOrEqual(attrInt, 1)) +} +assertResult(Some(sources.LessThanOrEqual("cint", 1))) { + DataSourceStrategy.translateFilter( +expressions.GreaterThanOrEqual(1, attrInt)) +} + +assertResult(Some(sources.In("cint", Array(1, 2, 3 { + DataSourceStrategy.translateFilter( +expressions.InSet(attrInt, Set(1, 2, 3))) +} + +assertResult(Some(sources.In("cint", Array(1, 2, 3 { + DataSourceStrategy.translateFilter( +expressions.In(attrInt, Seq(1, 2, 3))) +} + +assertResult(Some(sources.IsNull("cint"))) { + DataSourceStrategy.translateFilter( +expressions.IsNull(attrInt)) +} +assertResult(Some(sources.IsNotNull("cint"))) { + DataSourceStrategy.translateFilter( +expressions.IsNotNull(attrInt)) +} + +assertResult(Some(sources.And( + sources.GreaterThan("cint", 1), + sources.LessThan("cint", 10 { + DataSourceStrategy.translateFilter(expressions.And(
[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...
Github user jliwork commented on a diff in the pull request: https://github.com/apache/spark/pull/19776#discussion_r152421982 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala --- @@ -296,9 +296,15 @@ class JDBCSuite extends SparkFunSuite // The older versions of spark have this kind of bugs in parquet data source. val df1 = sql("SELECT * FROM foobar WHERE NOT (THEID != 2 AND NAME != 'mary')") --- End diff -- Fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...
Github user jliwork commented on a diff in the pull request: https://github.com/apache/spark/pull/19776#discussion_r152421773 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala --- @@ -0,0 +1,307 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.catalyst.dsl.expressions._ +import org.apache.spark.sql.catalyst.expressions +import org.apache.spark.sql.catalyst.expressions._ --- End diff -- Thanks for the suggestion. Fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19776: [SPARK-22548][SQL] Incorrect nested AND expression pushe...
Github user jliwork commented on the issue: https://github.com/apache/spark/pull/19776 @cloud-fan Thank you for your comments! I have updated the test cases as you suggested. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...
Github user jliwork commented on a diff in the pull request: https://github.com/apache/spark/pull/19776#discussion_r152193847 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala --- @@ -0,0 +1,306 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.catalyst.expressions +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.PlanTest +import org.apache.spark.sql.sources +import org.apache.spark.sql.test.SharedSQLContext +import org.apache.spark.sql.types._ + + +class DataSourceStrategySuite extends PlanTest with SharedSQLContext { + + test("translate simple expression") { +val attrInt = AttributeReference("cint", IntegerType)() +val attrStr = AttributeReference("cstr", StringType)() --- End diff -- Fixed. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...
Github user jliwork commented on a diff in the pull request: https://github.com/apache/spark/pull/19776#discussion_r152193833 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala --- @@ -0,0 +1,305 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.{sources, QueryTest} +import org.apache.spark.sql.catalyst.expressions +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.test.SharedSQLContext +import org.apache.spark.sql.types._ + + +class DataSourceStrategySuite extends QueryTest with SharedSQLContext { + + test("translate simple expression") { +val attrInt = AttributeReference("cint", IntegerType)() +val attrStr = AttributeReference("cstr", StringType)() + +assertResult(Some(sources.EqualTo("cint", 1))) { + DataSourceStrategy.translateFilter( +expressions.EqualTo(attrInt, Literal(1))) --- End diff -- Fixed. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...
Github user jliwork commented on a diff in the pull request: https://github.com/apache/spark/pull/19776#discussion_r152191430 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala --- @@ -0,0 +1,305 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.{sources, QueryTest} +import org.apache.spark.sql.catalyst.expressions +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.test.SharedSQLContext +import org.apache.spark.sql.types._ + + +class DataSourceStrategySuite extends QueryTest with SharedSQLContext { --- End diff -- Fixed. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...
Github user jliwork commented on a diff in the pull request: https://github.com/apache/spark/pull/19776#discussion_r152186116 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala --- @@ -296,8 +296,33 @@ class JDBCSuite extends SparkFunSuite // The older versions of spark have this kind of bugs in parquet data source. val df1 = sql("SELECT * FROM foobar WHERE NOT (THEID != 2 AND NAME != 'mary')") val df2 = sql("SELECT * FROM foobar WHERE NOT (THEID != 2) OR NOT (NAME != 'mary')") +val df3 = sql("SELECT * FROM foobar WHERE (THEID > 0 AND NAME = 'mary') OR (NAME = 'fred')") +val df4 = sql("SELECT * FROM foobar " + + "WHERE (THEID > 0 AND TRIM(NAME) = 'mary') OR (NAME = 'fred')") +val df5 = sql("SELECT * FROM foobar " + + "WHERE THEID > 0 AND TRIM(NAME) = 'mary' AND LENGTH(NAME) > 3") +val df6 = sql("SELECT * FROM foobar " + + "WHERE THEID < 0 OR NAME = 'mary' OR NAME = 'fred'") +val df7 = sql("SELECT * FROM foobar " + + "WHERE THEID < 0 OR TRIM(NAME) = 'mary' OR NAME = 'fred'") +val df8 = sql("SELECT * FROM foobar " + + "WHERE NOT((THEID < 0 OR NAME != 'mary') AND (THEID != 1 OR NAME != 'fred'))") +val df9 = sql("SELECT * FROM foobar " + + "WHERE NOT((THEID < 0 OR NAME != 'mary') AND (THEID != 1 OR TRIM(NAME) != 'fred'))") +val df10 = sql("SELECT * FROM foobar " + + "WHERE (NOT(THEID < 0 OR TRIM(NAME) != 'mary')) OR (THEID = 1 AND NAME = 'fred')") + assert(df1.collect.toSet === Set(Row("mary", 2))) assert(df2.collect.toSet === Set(Row("mary", 2))) +assert(df3.collect.toSet === Set(Row("fred", 1), Row("mary", 2))) +assert(df4.collect.toSet === Set(Row("fred", 1), Row("mary", 2))) +assert(df5.collect.toSet === Set(Row("mary", 2))) +assert(df6.collect.toSet === Set(Row("fred", 1), Row("mary", 2))) +assert(df7.collect.toSet === Set(Row("fred", 1), Row("mary", 2))) +assert(df8.collect.toSet === Set(Row("fred", 1), Row("mary", 2))) +assert(df9.collect.toSet === Set(Row("fred", 1), Row("mary", 2))) +assert(df10.collect.toSet === Set(Row("fred", 1), Row("mary", 2))) --- End diff -- I went ahead and added a new ```DataSourceStrategySuite``` to test the ```translateFilter```. Please free feel to let me know of any further comments. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...
Github user jliwork commented on a diff in the pull request: https://github.com/apache/spark/pull/19776#discussion_r152158653 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala --- @@ -296,8 +296,33 @@ class JDBCSuite extends SparkFunSuite // The older versions of spark have this kind of bugs in parquet data source. val df1 = sql("SELECT * FROM foobar WHERE NOT (THEID != 2 AND NAME != 'mary')") val df2 = sql("SELECT * FROM foobar WHERE NOT (THEID != 2) OR NOT (NAME != 'mary')") +val df3 = sql("SELECT * FROM foobar WHERE (THEID > 0 AND NAME = 'mary') OR (NAME = 'fred')") +val df4 = sql("SELECT * FROM foobar " + + "WHERE (THEID > 0 AND TRIM(NAME) = 'mary') OR (NAME = 'fred')") +val df5 = sql("SELECT * FROM foobar " + + "WHERE THEID > 0 AND TRIM(NAME) = 'mary' AND LENGTH(NAME) > 3") +val df6 = sql("SELECT * FROM foobar " + + "WHERE THEID < 0 OR NAME = 'mary' OR NAME = 'fred'") +val df7 = sql("SELECT * FROM foobar " + + "WHERE THEID < 0 OR TRIM(NAME) = 'mary' OR NAME = 'fred'") +val df8 = sql("SELECT * FROM foobar " + + "WHERE NOT((THEID < 0 OR NAME != 'mary') AND (THEID != 1 OR NAME != 'fred'))") +val df9 = sql("SELECT * FROM foobar " + + "WHERE NOT((THEID < 0 OR NAME != 'mary') AND (THEID != 1 OR TRIM(NAME) != 'fred'))") +val df10 = sql("SELECT * FROM foobar " + + "WHERE (NOT(THEID < 0 OR TRIM(NAME) != 'mary')) OR (THEID = 1 AND NAME = 'fred')") + assert(df1.collect.toSet === Set(Row("mary", 2))) assert(df2.collect.toSet === Set(Row("mary", 2))) +assert(df3.collect.toSet === Set(Row("fred", 1), Row("mary", 2))) +assert(df4.collect.toSet === Set(Row("fred", 1), Row("mary", 2))) +assert(df5.collect.toSet === Set(Row("mary", 2))) +assert(df6.collect.toSet === Set(Row("fred", 1), Row("mary", 2))) +assert(df7.collect.toSet === Set(Row("fred", 1), Row("mary", 2))) +assert(df8.collect.toSet === Set(Row("fred", 1), Row("mary", 2))) +assert(df9.collect.toSet === Set(Row("fred", 1), Row("mary", 2))) +assert(df10.collect.toSet === Set(Row("fred", 1), Row("mary", 2))) --- End diff -- @gatorsmile Thanks for comments, Sean =) You mean you'd like to see test cases calling ```DataSourceStrategy.translateFilter``` directly or end-to-end like ```sql("SELECT * FROM foobar WHERE THEID < 1")``` which goes thru the translateFilter code path to verify. I'm fine either way, just want to clarify. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...
Github user jliwork commented on a diff in the pull request: https://github.com/apache/spark/pull/19776#discussion_r152137096 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala --- @@ -296,8 +296,33 @@ class JDBCSuite extends SparkFunSuite // The older versions of spark have this kind of bugs in parquet data source. val df1 = sql("SELECT * FROM foobar WHERE NOT (THEID != 2 AND NAME != 'mary')") val df2 = sql("SELECT * FROM foobar WHERE NOT (THEID != 2) OR NOT (NAME != 'mary')") +val df3 = sql("SELECT * FROM foobar WHERE (THEID > 0 AND NAME = 'mary') OR (NAME = 'fred')") +val df4 = sql("SELECT * FROM foobar " + + "WHERE (THEID > 0 AND TRIM(NAME) = 'mary') OR (NAME = 'fred')") +val df5 = sql("SELECT * FROM foobar " + + "WHERE THEID > 0 AND TRIM(NAME) = 'mary' AND LENGTH(NAME) > 3") +val df6 = sql("SELECT * FROM foobar " + + "WHERE THEID < 0 OR NAME = 'mary' OR NAME = 'fred'") +val df7 = sql("SELECT * FROM foobar " + + "WHERE THEID < 0 OR TRIM(NAME) = 'mary' OR NAME = 'fred'") +val df8 = sql("SELECT * FROM foobar " + + "WHERE NOT((THEID < 0 OR NAME != 'mary') AND (THEID != 1 OR NAME != 'fred'))") +val df9 = sql("SELECT * FROM foobar " + + "WHERE NOT((THEID < 0 OR NAME != 'mary') AND (THEID != 1 OR TRIM(NAME) != 'fred'))") +val df10 = sql("SELECT * FROM foobar " + + "WHERE (NOT(THEID < 0 OR TRIM(NAME) != 'mary')) OR (THEID = 1 AND NAME = 'fred')") + assert(df1.collect.toSet === Set(Row("mary", 2))) assert(df2.collect.toSet === Set(Row("mary", 2))) +assert(df3.collect.toSet === Set(Row("fred", 1), Row("mary", 2))) +assert(df4.collect.toSet === Set(Row("fred", 1), Row("mary", 2))) +assert(df5.collect.toSet === Set(Row("mary", 2))) +assert(df6.collect.toSet === Set(Row("fred", 1), Row("mary", 2))) +assert(df7.collect.toSet === Set(Row("fred", 1), Row("mary", 2))) +assert(df8.collect.toSet === Set(Row("fred", 1), Row("mary", 2))) +assert(df9.collect.toSet === Set(Row("fred", 1), Row("mary", 2))) +assert(df10.collect.toSet === Set(Row("fred", 1), Row("mary", 2))) --- End diff -- Sure. I can help. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...
Github user jliwork commented on a diff in the pull request: https://github.com/apache/spark/pull/19776#discussion_r152136470 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala --- @@ -497,7 +497,19 @@ object DataSourceStrategy { Some(sources.IsNotNull(a.name)) case expressions.And(left, right) => -(translateFilter(left) ++ translateFilter(right)).reduceOption(sources.And) +// See SPARK-12218 for detailed discussion +// It is not safe to just convert one side if we do not understand the +// other side. Here is an example used to explain the reason. +// Let's say we have (a = 2 AND trim(b) = 'blah') OR (c > 0) +// and we do not understand how to convert trim(b) = 'blah'. +// If we only convert a = 2, we will end up with +// (a = 2) OR (c > 0), which will generate wrong results. +// Pushing one leg of AND down is only safe to do at the top level. +// You can see ParquetFilters' createFilter for more details. +for { + leftFilter <- translateFilter(left) + rightFilter <- translateFilter(right) +} yield sources.And(leftFilter, rightFilter) --- End diff -- I would think so. SPARK-12218 put fixes into ParquetFilters.createFilter and OrcFilters.createFilter. They're similar to DataSourceStrategy.translateFilter but have different signature customized for Parquet and Orc. For all datasources including JDBC, Parquet, etc, translateFilter is called to determine if a predicate Expression can be pushed down as a Filter or not. Next for Parquet and ORC, Fitlers get mapped to Parquet or ORC specific filters with their own createFilter method. So this PR does help all datasources to get the correct set of push down predicates. Without this PR we simply got lucky with Parquet and ORC in terms of result correctness because 1) it looks like we always apply Filter on top of scan; 2) we end up with same number of or more rows returned with one leg missing from AND. JDBC data source does not always come with Filter on top of scan therefore exposed the bug. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...
Github user jliwork commented on a diff in the pull request: https://github.com/apache/spark/pull/19776#discussion_r151922970 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala --- @@ -497,7 +497,11 @@ object DataSourceStrategy { Some(sources.IsNotNull(a.name)) case expressions.And(left, right) => -(translateFilter(left) ++ translateFilter(right)).reduceOption(sources.And) +// See SPARK-12218 and PR 10362 for detailed discussion --- End diff -- @viirya I see. Thanks, Simon! I've removed the PR number from the comment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...
Github user jliwork commented on a diff in the pull request: https://github.com/apache/spark/pull/19776#discussion_r151920447 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala --- @@ -497,7 +497,10 @@ object DataSourceStrategy { Some(sources.IsNotNull(a.name)) case expressions.And(left, right) => -(translateFilter(left) ++ translateFilter(right)).reduceOption(sources.And) +for { --- End diff -- Thanks. Just did that as you suggested. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...
Github user jliwork commented on a diff in the pull request: https://github.com/apache/spark/pull/19776#discussion_r151920126 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala --- @@ -497,7 +497,11 @@ object DataSourceStrategy { Some(sources.IsNotNull(a.name)) case expressions.And(left, right) => -(translateFilter(left) ++ translateFilter(right)).reduceOption(sources.And) +// See SPARK-12218 and PR 10362 for detailed discussion --- End diff -- Sure. I have added more comments there with an example. Thanks, Sean! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19776: [SPARK-22548][SQL] Incorrect nested AND expression pushe...
Github user jliwork commented on the issue: https://github.com/apache/spark/pull/19776 @viirya I'm fine with backport to 2.2 unless anyone objects. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...
Github user jliwork commented on a diff in the pull request: https://github.com/apache/spark/pull/19776#discussion_r151901160 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala --- @@ -497,7 +497,10 @@ object DataSourceStrategy { Some(sources.IsNotNull(a.name)) case expressions.And(left, right) => -(translateFilter(left) ++ translateFilter(right)).reduceOption(sources.And) +for { --- End diff -- Sure. Will do. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19776: [SPARK-22548][SQL] Incorrect nested AND expression pushe...
Github user jliwork commented on the issue: https://github.com/apache/spark/pull/19776 @viirya Thanks for letting me know, Simon. I've fixed the title. Can someone please help trigger the tests please? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...
GitHub user jliwork opened a pull request: https://github.com/apache/spark/pull/19776 [SPARK-22548][SQL] Incorrect nested AND expression pushed down to JDB⦠â¦C data source ## What changes were proposed in this pull request? Letâs say I have a nested AND expression shown below and p2 can not be pushed down, (p1 AND p2) OR p3 In current Spark code, during data source filter translation, (p1 AND p2) is returned as p1 only and p2 is simply lost. This issue occurs with JDBC data source and is similar to SPARK-12218 for Parquet. When we have AND nested below another expression, we should either push both legs or nothing. Note that 1) the current Spark code will always split conjunctive predicate before it determines if a predicate can be pushed down or not; 2) the same translation method is also called by Data Source V2. ## How was this patch tested? Added new unit test cases to JDBCSuite You can merge this pull request into a Git repository by running: $ git pull https://github.com/jliwork/spark spark-22548 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19776.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19776 commit 58de88c21210d469b8ef14b1f23764c31ca5651e Author: Jia Li <ji...@us.ibm.com> Date: 2017-11-18T01:15:01Z [SPARK-22548][SQL] Incorrect nested AND expression pushed down to JDBC data source --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11569] [ML] Fix StringIndexer to handle...
Github user jliwork commented on the pull request: https://github.com/apache/spark/pull/9920#issuecomment-218837337 @holdenk Definitely :-) Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11569] [ML] Fix StringIndexer to handle...
Github user jliwork commented on the pull request: https://github.com/apache/spark/pull/9920#issuecomment-218826896 Yes. I'd like to continue working on this. But since this PR is obsolete. I will close it and open a new one instead. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11569] [ML] Fix StringIndexer to handle...
Github user jliwork closed the pull request at: https://github.com/apache/spark/pull/9920 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14548][SQL] Support not greater than an...
Github user jliwork commented on the pull request: https://github.com/apache/spark/pull/12316#issuecomment-214459899 Awesome. Thank you! :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14548][SQL] Support not greater than an...
Github user jliwork commented on the pull request: https://github.com/apache/spark/pull/12316#issuecomment-212281661 @rxin gentle ping --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14548][SQL] Support not greater than an...
Github user jliwork commented on the pull request: https://github.com/apache/spark/pull/12316#issuecomment-209606539 @rxin Microsoft SQL Server, IBM DB2 and Sybase Adaptive Server support them. https://msdn.microsoft.com/en-us/library/ms188074.aspx http://www.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r746.html?lang=en http://infocenter.sybase.com/archive/index.jsp?topic=/com.sybase.dc36271_36272_36273_36274_1250/html/refman/refman237.htm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14548][SQL] Support not greater than an...
Github user jliwork commented on a diff in the pull request: https://github.com/apache/spark/pull/12316#discussion_r59323956 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala --- @@ -1432,4 +1432,15 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { getMessage() assert(e1.startsWith("Path does not exist")) } + + test("test !< and !> comparator") { --- End diff -- Hi @rxin, Thanks for your comment! I've removed this test case from DataFrameSuite. Please free free to let me know of any other comments or questions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14548][SQL] Support not greater than an...
GitHub user jliwork opened a pull request: https://github.com/apache/spark/pull/12316 [SPARK-14548][SQL] Support not greater than and not less than operator in Spark SQL ## What changes were proposed in this pull request? !< means not less than which is equivalent to >= !> means not greater than which is equivalent to <= I'd to create a PR to support these two operators. ## How was this patch tested? I've added new test cases in: DataFrameSuite, ExpressionParserSuite, JDBCSuite, PlanParserSuite, SQLQuerySuite @dilipbiswal @viirya @gatorsmile You can merge this pull request into a Git repository by running: $ git pull https://github.com/jliwork/spark SPARK-14548 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12316.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12316 commit 6792fe59990fcc925eb86bcbcd297f6f57505c8e Author: jliwork <ji...@us.ibm.com> Date: 2016-04-12T01:33:50Z support not greater than and not less than operator in Spark SQL --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11569] [ML] Fix StringIndexer to handle...
Github user jliwork commented on the pull request: https://github.com/apache/spark/pull/9920#issuecomment-160546154 @jkbradley @holdenk Could you please take a look my latest fix and let me know of any further comments? thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11569] [ML] Fix StringIndexer to handle...
GitHub user jliwork opened a pull request: https://github.com/apache/spark/pull/9920 [SPARK-11569] [ML] Fix StringIndexer to handle null value properly I was having some problem with rebase on https://github.com/apache/spark/pull/9709, so I had to close that PR and creating a new pull request with my latest fix. Thanks to @jkbradley and @holdenk for your comments. I have updated my fix so that it will allow user to config either to filter out null values or throw an error with StringIndexer.setHandleInvalid("skip") API. The default is StringIndexer.setHandleInvalid("error"). Please let me know what you think. Thanks again! You can merge this pull request into a Git repository by running: $ git pull https://github.com/jliwork/spark SPARK-11569-new Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9920.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9920 commit 19de57454cc751074fbbc6e73a3be25f1429e40b Author: Jia Li <ji...@us.ibm.com> Date: 2015-11-23T23:05:26Z fix SPARK-11569 on new branch --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11569] [ML] Fix StringIndexer to handle...
Github user jliwork closed the pull request at: https://github.com/apache/spark/pull/9709 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11569] [ML] Fix StringIndexer to handle...
Github user jliwork commented on the pull request: https://github.com/apache/spark/pull/9709#issuecomment-159098271 @dragos thanks for pointing it out. i am having some problem with rebase and will close this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11569] [ML] Fix StringIndexer to handle...
Github user jliwork commented on the pull request: https://github.com/apache/spark/pull/9709#issuecomment-157152669 @mengxr @jkbradley Can you please help to take a look at my fix? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11569] [ML] Fix StringIndexer to handle...
GitHub user jliwork opened a pull request: https://github.com/apache/spark/pull/9709 [SPARK-11569] [ML] Fix StringIndexer to handle null value properly The root cause of this bug is that StringIndexer is unable to handle input column with null value. With my fix, null value can also be indexed just like any other string value. @jkbradley @holdenk Please let me know of any comments. Thanks! You can merge this pull request into a Git repository by running: $ git pull https://github.com/jliwork/spark SPARK-11569 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9709.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9709 commit f360172322a71478ecaceedfa97bfeaec65c592c Author: Jia Li <ji...@us.ibm.com> Date: 2015-11-12T16:45:47Z Fix for SPARK-11569 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11277] [SQL] sort_array throws exceptio...
Github user jliwork commented on the pull request: https://github.com/apache/spark/pull/9247#issuecomment-151274333 @marmbrus @cloud-fan Thank you for your comment. My current fix takes care of array of NullType. I should be able to give error to block sort_array on struct type as well if you like. I just looked into Spark's ORDER BY on struct type. Since I'm new to spark, however, I feel a little bit challenging for me to enhance sort_array to include struct type at this point. Would you be OK if I open another JIRA to support sort_array on struct type, so that someone in the community might be able to help? Thanks, --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11277] [SQL] sort_array throws exceptio...
Github user jliwork commented on a diff in the pull request: https://github.com/apache/spark/pull/9247#discussion_r43074419 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionFunctionsSuite.scala --- @@ -64,6 +65,21 @@ class CollectionFunctionsSuite extends SparkFunSuite with ExpressionEvalHelper { checkEvaluation(new SortArray(a3, Literal(false)), Seq("b", "a", null)) checkEvaluation(Literal.create(null, ArrayType(StringType)), null) +checkEvaluation(new SortArray(a4), Seq(null, null)) + +val typeAS = ArrayType(StructType(StructField("a", IntegerType) :: Nil)) +val arrayStruct = Literal.create(Seq(create_row(2), create_row(1)), typeAS) +val sorted = new SortArray(arrayStruct) + +def getArrayStructFields(expr: Expression, fieldName: String): GetArrayStructFields = { + expr.dataType match { +case ArrayType(StructType(fields), containsNull) => + val field = fields.find(_.name == fieldName).get + GetArrayStructFields(expr, field, fields.indexOf(field), fields.length, containsNull) + } +} + +checkEvaluation(getArrayStructFields(sorted, "a"), Seq(1, 2)) --- End diff -- I was just trying to find a way to compare the results. Your approach is much simpler. I have updated the test case based on your comment. Thank you for your comment! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11277] [SQL] sort_array throws exceptio...
Github user jliwork commented on the pull request: https://github.com/apache/spark/pull/9247#issuecomment-151341426 Thank you all. I have updated my fix based on your comments. It should now be able to: 1) skip sorting an array of NULLs 2) sort an array of StructType @cloud-fan @marmbrus @yhuai Please let me know if you have any further questions or comments. Thanks again, --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11277] [SQL] sort_array throws exceptio...
Github user jliwork commented on the pull request: https://github.com/apache/spark/pull/9247#issuecomment-150992372 @cloud-fan Thank you very much for your comment! Sorting on array of NULL type does not make sense to me, either. That's why my fix is to block this kind of sorting for sort_array. 1)I have checked Hive's behavior. Hive does not throws any exception when sorting an array of NULLs. It will simple return the array of NULLs back. With my fix spark's sort_array will return the exact same result as Hive. 2) Regarding sorting on struct, I could not find any existing UDF like sort_struct. Hive does not seems to offer such UDF, either. I'm not very familiar with dataframe's struct type as I'm new to Spark and still learning. Please let me know if you have any other questions or concerns. Thank again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11277] [SQL] sort_array throws exceptio...
Github user jliwork commented on the pull request: https://github.com/apache/spark/pull/9247#issuecomment-151030490 @cloud-fan Thanks for the clarification. You have a very good point. This bug's exception occourred because SortArray's Comparators expect AtomicType only. According to source code docs, AtomicType 'represent everything that is not null, UDTs, arrays, structs, and maps'. My current fix takes care of NullType only. StructType still causes execption, so you're right :-) SortArray actually comes with checkInputDataTypes() which lets NullType and StructType slip through because it uses RowOrdering.isOrderable which returns true for AtomicType, NullType and StructType. I checked Hive's behavior for struct type element. For example, select sort_array(array(struct(3, 2, 1), struct(2, 3, 4))) from src Hive does not support sort array with struct type involved. It returns error for this query, saying function SORT_ARRAY must be array, but array<struct<col1:int,col2:int,col3:int>> was found. QUESTION - What do you expect the behavior of the query above for Spark SQL? I would prefer to match with Hive's behavior which is to block this kind of sorting on struct type elements since its use case does not seem very meaningful to me. Thanks, --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11277] [SQL] sort_array throws exceptio...
GitHub user jliwork opened a pull request: https://github.com/apache/spark/pull/9247 [SPARK-11277] [SQL] sort_array throws exception scala.MatchError I'm new to spark. I was trying out the sort_array function then hit this exception. I looked into the spark source code. I found the root cause is that sort_array does not check for an array of NULLs. It's not meaningful to sort an array of entirely NULLs anyway. I'm adding a check on the input array type to SortArray. If the array consists of NULLs entirely, there is no need to sort such array. I have also added a test case for this. Please help to review my fix. Thanks! You can merge this pull request into a Git repository by running: $ git pull https://github.com/jliwork/spark SPARK-11277 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9247.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9247 commit 998b8b84f2667b1d16c732d63ba70ab6bfc5b56e Author: Jia Li <ji...@us.ibm.com> Date: 2015-10-23T09:06:26Z [SPARK-11277] [SQL] sort_array throws exception scala.MatchError --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org