[GitHub] spark issue #19776: [SPARK-22548][SQL] Incorrect nested AND expression pushe...

2017-11-21 Thread jliwork
Github user jliwork commented on the issue:

https://github.com/apache/spark/pull/19776
  
@gatorsmile @cloud-fan @viirya @HyukjinKwon Thanks a lot! =)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...

2017-11-21 Thread jliwork
Github user jliwork commented on a diff in the pull request:

https://github.com/apache/spark/pull/19776#discussion_r152429647
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala
 ---
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.sources
+import org.apache.spark.sql.test.SharedSQLContext
+
+class DataSourceStrategySuite extends PlanTest with SharedSQLContext {
+
+  test("translate simple expression") {
+val attrInt = 'cint.int
+val attrStr = 'cstr.string
+
+testTranslateFilter(EqualTo(attrInt, 1), Some(sources.EqualTo("cint", 
1)))
+testTranslateFilter(EqualTo(1, attrInt), Some(sources.EqualTo("cint", 
1)))
+
+testTranslateFilter(EqualNullSafe(attrStr, Literal(null)),
+  Some(sources.EqualNullSafe("cstr", null)))
+testTranslateFilter(EqualNullSafe(Literal(null), attrStr),
+  Some(sources.EqualNullSafe("cstr", null)))
+
+testTranslateFilter(GreaterThan(attrInt, 1), 
Some(sources.GreaterThan("cint", 1)))
+testTranslateFilter(GreaterThan(1, attrInt), 
Some(sources.LessThan("cint", 1)))
+
+testTranslateFilter(LessThan(attrInt, 1), 
Some(sources.LessThan("cint", 1)))
+testTranslateFilter(LessThan(1, attrInt), 
Some(sources.GreaterThan("cint", 1)))
+
+testTranslateFilter(GreaterThanOrEqual(attrInt, 1), 
Some(sources.GreaterThanOrEqual("cint", 1)))
+testTranslateFilter(GreaterThanOrEqual(1, attrInt), 
Some(sources.LessThanOrEqual("cint", 1)))
+
+testTranslateFilter(LessThanOrEqual(attrInt, 1), 
Some(sources.LessThanOrEqual("cint", 1)))
+testTranslateFilter(LessThanOrEqual(1, attrInt), 
Some(sources.GreaterThanOrEqual("cint", 1)))
+
+testTranslateFilter(InSet(attrInt, Set(1, 2, 3)), 
Some(sources.In("cint", Array(1, 2, 3
+
+testTranslateFilter(In(attrInt, Seq(1, 2, 3)), Some(sources.In("cint", 
Array(1, 2, 3
+
+testTranslateFilter(IsNull(attrInt), Some(sources.IsNull("cint")))
+testTranslateFilter(IsNotNull(attrInt), 
Some(sources.IsNotNull("cint")))
+
+// cint > 1 AND cint < 10
+testTranslateFilter(And(
+  GreaterThan(attrInt, 1),
+  LessThan(attrInt, 10)),
+  Some(sources.And(
+sources.GreaterThan("cint", 1),
+sources.LessThan("cint", 10
+
+// cint >= 8 OR cint <= 2
+testTranslateFilter(Or(
+  GreaterThanOrEqual(attrInt, 8),
+  LessThanOrEqual(attrInt, 2)),
+  Some(sources.Or(
+sources.GreaterThanOrEqual("cint", 8),
+sources.LessThanOrEqual("cint", 2
+
+testTranslateFilter(Not(GreaterThanOrEqual(attrInt, 8)),
+  Some(sources.Not(sources.GreaterThanOrEqual("cint", 8
+
+testTranslateFilter(StartsWith(attrStr, "a"), 
Some(sources.StringStartsWith("cstr", "a")))
+
+testTranslateFilter(EndsWith(attrStr, "a"), 
Some(sources.StringEndsWith("cstr", "a")))
+
+testTranslateFilter(Contains(attrStr, "a"), 
Some(sources.StringContains("cstr", "a")))
+  }
+
+  test("translate complex expression") {
+val attrInt = 'cint.int
+
+// ABS(cint) - 2 = 1
+testTranslateFilter(LessThanOrEqual(
+  // Expressions are not supported
--- End diff --

good catch @_@ fixed the typo. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19776: [SPARK-22548][SQL] Incorrect nested AND expression pushe...

2017-11-21 Thread jliwork
Github user jliwork commented on the issue:

https://github.com/apache/spark/pull/19776
  
Thanks for everyone's comments! I have polished the test cases. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...

2017-11-21 Thread jliwork
Github user jliwork commented on a diff in the pull request:

https://github.com/apache/spark/pull/19776#discussion_r152421950
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala
 ---
@@ -0,0 +1,307 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.sources
+import org.apache.spark.sql.test.SharedSQLContext
+
+
+class DataSourceStrategySuite extends PlanTest with SharedSQLContext {
+
+  test("translate simple expression") {
+val attrInt = 'cint.int
+val attrStr = 'cstr.string
+
+assertResult(Some(sources.EqualTo("cint", 1))) {
+  DataSourceStrategy.translateFilter(
+expressions.EqualTo(attrInt, 1))
+}
--- End diff --

Thanks! I've followed your suggestion and the test suite looks cleaner now. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...

2017-11-21 Thread jliwork
Github user jliwork commented on a diff in the pull request:

https://github.com/apache/spark/pull/19776#discussion_r152422048
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala
 ---
@@ -0,0 +1,307 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.sources
+import org.apache.spark.sql.test.SharedSQLContext
+
+
+class DataSourceStrategySuite extends PlanTest with SharedSQLContext {
+
+  test("translate simple expression") {
+val attrInt = 'cint.int
+val attrStr = 'cstr.string
+
+assertResult(Some(sources.EqualTo("cint", 1))) {
+  DataSourceStrategy.translateFilter(
+expressions.EqualTo(attrInt, 1))
+}
+assertResult(Some(sources.EqualTo("cint", 1))) {
+  DataSourceStrategy.translateFilter(
+expressions.EqualTo(1, attrInt))
+}
+
+assertResult(Some(sources.EqualNullSafe("cstr", null))) {
+  DataSourceStrategy.translateFilter(
+expressions.EqualNullSafe(attrStr, Literal(null)))
+}
+assertResult(Some(sources.EqualNullSafe("cstr", null))) {
+  DataSourceStrategy.translateFilter(
+expressions.EqualNullSafe(Literal(null), attrStr))
+}
+
+assertResult(Some(sources.GreaterThan("cint", 1))) {
+  DataSourceStrategy.translateFilter(
+expressions.GreaterThan(attrInt, 1))
+}
+assertResult(Some(sources.GreaterThan("cint", 1))) {
+  DataSourceStrategy.translateFilter(
+expressions.LessThan(1, attrInt))
+}
+
+assertResult(Some(sources.LessThan("cint", 1))) {
+  DataSourceStrategy.translateFilter(
+expressions.LessThan(attrInt, 1))
+}
+assertResult(Some(sources.LessThan("cint", 1))) {
+  DataSourceStrategy.translateFilter(
+expressions.GreaterThan(1, attrInt))
+}
+
+assertResult(Some(sources.GreaterThanOrEqual("cint", 1))) {
+  DataSourceStrategy.translateFilter(
+expressions.GreaterThanOrEqual(attrInt, 1))
+}
+assertResult(Some(sources.GreaterThanOrEqual("cint", 1))) {
+  DataSourceStrategy.translateFilter(
+expressions.LessThanOrEqual(1, attrInt))
+}
+
+assertResult(Some(sources.LessThanOrEqual("cint", 1))) {
+  DataSourceStrategy.translateFilter(
+expressions.LessThanOrEqual(attrInt, 1))
+}
+assertResult(Some(sources.LessThanOrEqual("cint", 1))) {
+  DataSourceStrategy.translateFilter(
+expressions.GreaterThanOrEqual(1, attrInt))
+}
+
+assertResult(Some(sources.In("cint", Array(1, 2, 3 {
+  DataSourceStrategy.translateFilter(
+expressions.InSet(attrInt, Set(1, 2, 3)))
+}
+
+assertResult(Some(sources.In("cint", Array(1, 2, 3 {
+  DataSourceStrategy.translateFilter(
+expressions.In(attrInt, Seq(1, 2, 3)))
+}
+
+assertResult(Some(sources.IsNull("cint"))) {
+  DataSourceStrategy.translateFilter(
+expressions.IsNull(attrInt))
+}
+assertResult(Some(sources.IsNotNull("cint"))) {
+  DataSourceStrategy.translateFilter(
+expressions.IsNotNull(attrInt))
+}
+
+assertResult(Some(sources.And(
+  sources.GreaterThan("cint", 1),
+  sources.LessThan("cint", 10 {
+  DataSourceStrategy.translateFilter(expressions.And(

[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...

2017-11-21 Thread jliwork
Github user jliwork commented on a diff in the pull request:

https://github.com/apache/spark/pull/19776#discussion_r152421982
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala 
---
@@ -296,9 +296,15 @@ class JDBCSuite extends SparkFunSuite
 // The older versions of spark have this kind of bugs in parquet data 
source.
 val df1 = sql("SELECT * FROM foobar WHERE NOT (THEID != 2 AND NAME != 
'mary')")
--- End diff --

Fixed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...

2017-11-21 Thread jliwork
Github user jliwork commented on a diff in the pull request:

https://github.com/apache/spark/pull/19776#discussion_r152421773
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala
 ---
@@ -0,0 +1,307 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions._
--- End diff --

Thanks for the suggestion. Fixed. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19776: [SPARK-22548][SQL] Incorrect nested AND expression pushe...

2017-11-21 Thread jliwork
Github user jliwork commented on the issue:

https://github.com/apache/spark/pull/19776
  
@cloud-fan Thank you for your comments! I have updated the test cases as 
you suggested. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...

2017-11-20 Thread jliwork
Github user jliwork commented on a diff in the pull request:

https://github.com/apache/spark/pull/19776#discussion_r152193847
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala
 ---
@@ -0,0 +1,306 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.sources
+import org.apache.spark.sql.test.SharedSQLContext
+import org.apache.spark.sql.types._
+
+
+class DataSourceStrategySuite extends PlanTest with SharedSQLContext {
+
+  test("translate simple expression") {
+val attrInt = AttributeReference("cint", IntegerType)()
+val attrStr = AttributeReference("cstr", StringType)()
--- End diff --

Fixed. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...

2017-11-20 Thread jliwork
Github user jliwork commented on a diff in the pull request:

https://github.com/apache/spark/pull/19776#discussion_r152193833
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala
 ---
@@ -0,0 +1,305 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.{sources, QueryTest}
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.test.SharedSQLContext
+import org.apache.spark.sql.types._
+
+
+class DataSourceStrategySuite extends QueryTest with SharedSQLContext {
+
+  test("translate simple expression") {
+val attrInt = AttributeReference("cint", IntegerType)()
+val attrStr = AttributeReference("cstr", StringType)()
+
+assertResult(Some(sources.EqualTo("cint", 1))) {
+  DataSourceStrategy.translateFilter(
+expressions.EqualTo(attrInt, Literal(1)))
--- End diff --

Fixed. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...

2017-11-20 Thread jliwork
Github user jliwork commented on a diff in the pull request:

https://github.com/apache/spark/pull/19776#discussion_r152191430
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala
 ---
@@ -0,0 +1,305 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.{sources, QueryTest}
+import org.apache.spark.sql.catalyst.expressions
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.test.SharedSQLContext
+import org.apache.spark.sql.types._
+
+
+class DataSourceStrategySuite extends QueryTest with SharedSQLContext {
--- End diff --

Fixed. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...

2017-11-20 Thread jliwork
Github user jliwork commented on a diff in the pull request:

https://github.com/apache/spark/pull/19776#discussion_r152186116
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala 
---
@@ -296,8 +296,33 @@ class JDBCSuite extends SparkFunSuite
 // The older versions of spark have this kind of bugs in parquet data 
source.
 val df1 = sql("SELECT * FROM foobar WHERE NOT (THEID != 2 AND NAME != 
'mary')")
 val df2 = sql("SELECT * FROM foobar WHERE NOT (THEID != 2) OR NOT 
(NAME != 'mary')")
+val df3 = sql("SELECT * FROM foobar WHERE (THEID > 0 AND NAME = 
'mary') OR (NAME = 'fred')")
+val df4 = sql("SELECT * FROM foobar " +
+  "WHERE (THEID > 0 AND TRIM(NAME) = 'mary') OR (NAME = 'fred')")
+val df5 = sql("SELECT * FROM foobar " +
+  "WHERE THEID > 0 AND TRIM(NAME) = 'mary' AND LENGTH(NAME) > 3")
+val df6 = sql("SELECT * FROM foobar " +
+  "WHERE THEID < 0 OR NAME = 'mary' OR NAME = 'fred'")
+val df7 = sql("SELECT * FROM foobar " +
+  "WHERE THEID < 0 OR TRIM(NAME) = 'mary' OR NAME = 'fred'")
+val df8 = sql("SELECT * FROM foobar " +
+  "WHERE NOT((THEID < 0 OR NAME != 'mary') AND (THEID != 1 OR NAME != 
'fred'))")
+val df9 = sql("SELECT * FROM foobar " +
+  "WHERE NOT((THEID < 0 OR NAME != 'mary') AND (THEID != 1 OR 
TRIM(NAME) != 'fred'))")
+val df10 = sql("SELECT * FROM foobar " +
+  "WHERE (NOT(THEID < 0 OR TRIM(NAME) != 'mary')) OR (THEID = 1 AND 
NAME = 'fred')")
+
 assert(df1.collect.toSet === Set(Row("mary", 2)))
 assert(df2.collect.toSet === Set(Row("mary", 2)))
+assert(df3.collect.toSet === Set(Row("fred", 1), Row("mary", 2)))
+assert(df4.collect.toSet === Set(Row("fred", 1), Row("mary", 2)))
+assert(df5.collect.toSet === Set(Row("mary", 2)))
+assert(df6.collect.toSet === Set(Row("fred", 1), Row("mary", 2)))
+assert(df7.collect.toSet === Set(Row("fred", 1), Row("mary", 2)))
+assert(df8.collect.toSet === Set(Row("fred", 1), Row("mary", 2)))
+assert(df9.collect.toSet === Set(Row("fred", 1), Row("mary", 2)))
+assert(df10.collect.toSet === Set(Row("fred", 1), Row("mary", 2)))
--- End diff --

I went ahead and added a new ```DataSourceStrategySuite``` to test the 
```translateFilter```. Please free feel to let me know of any further comments. 
Thanks! 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...

2017-11-20 Thread jliwork
Github user jliwork commented on a diff in the pull request:

https://github.com/apache/spark/pull/19776#discussion_r152158653
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala 
---
@@ -296,8 +296,33 @@ class JDBCSuite extends SparkFunSuite
 // The older versions of spark have this kind of bugs in parquet data 
source.
 val df1 = sql("SELECT * FROM foobar WHERE NOT (THEID != 2 AND NAME != 
'mary')")
 val df2 = sql("SELECT * FROM foobar WHERE NOT (THEID != 2) OR NOT 
(NAME != 'mary')")
+val df3 = sql("SELECT * FROM foobar WHERE (THEID > 0 AND NAME = 
'mary') OR (NAME = 'fred')")
+val df4 = sql("SELECT * FROM foobar " +
+  "WHERE (THEID > 0 AND TRIM(NAME) = 'mary') OR (NAME = 'fred')")
+val df5 = sql("SELECT * FROM foobar " +
+  "WHERE THEID > 0 AND TRIM(NAME) = 'mary' AND LENGTH(NAME) > 3")
+val df6 = sql("SELECT * FROM foobar " +
+  "WHERE THEID < 0 OR NAME = 'mary' OR NAME = 'fred'")
+val df7 = sql("SELECT * FROM foobar " +
+  "WHERE THEID < 0 OR TRIM(NAME) = 'mary' OR NAME = 'fred'")
+val df8 = sql("SELECT * FROM foobar " +
+  "WHERE NOT((THEID < 0 OR NAME != 'mary') AND (THEID != 1 OR NAME != 
'fred'))")
+val df9 = sql("SELECT * FROM foobar " +
+  "WHERE NOT((THEID < 0 OR NAME != 'mary') AND (THEID != 1 OR 
TRIM(NAME) != 'fred'))")
+val df10 = sql("SELECT * FROM foobar " +
+  "WHERE (NOT(THEID < 0 OR TRIM(NAME) != 'mary')) OR (THEID = 1 AND 
NAME = 'fred')")
+
 assert(df1.collect.toSet === Set(Row("mary", 2)))
 assert(df2.collect.toSet === Set(Row("mary", 2)))
+assert(df3.collect.toSet === Set(Row("fred", 1), Row("mary", 2)))
+assert(df4.collect.toSet === Set(Row("fred", 1), Row("mary", 2)))
+assert(df5.collect.toSet === Set(Row("mary", 2)))
+assert(df6.collect.toSet === Set(Row("fred", 1), Row("mary", 2)))
+assert(df7.collect.toSet === Set(Row("fred", 1), Row("mary", 2)))
+assert(df8.collect.toSet === Set(Row("fred", 1), Row("mary", 2)))
+assert(df9.collect.toSet === Set(Row("fred", 1), Row("mary", 2)))
+assert(df10.collect.toSet === Set(Row("fred", 1), Row("mary", 2)))
--- End diff --

@gatorsmile Thanks for comments, Sean =) You mean you'd like to see test 
cases calling ```DataSourceStrategy.translateFilter``` directly or end-to-end 
like ```sql("SELECT * FROM foobar WHERE THEID < 1")``` which goes thru the 
translateFilter code path to verify. I'm fine either way, just want to clarify. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...

2017-11-20 Thread jliwork
Github user jliwork commented on a diff in the pull request:

https://github.com/apache/spark/pull/19776#discussion_r152137096
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala 
---
@@ -296,8 +296,33 @@ class JDBCSuite extends SparkFunSuite
 // The older versions of spark have this kind of bugs in parquet data 
source.
 val df1 = sql("SELECT * FROM foobar WHERE NOT (THEID != 2 AND NAME != 
'mary')")
 val df2 = sql("SELECT * FROM foobar WHERE NOT (THEID != 2) OR NOT 
(NAME != 'mary')")
+val df3 = sql("SELECT * FROM foobar WHERE (THEID > 0 AND NAME = 
'mary') OR (NAME = 'fred')")
+val df4 = sql("SELECT * FROM foobar " +
+  "WHERE (THEID > 0 AND TRIM(NAME) = 'mary') OR (NAME = 'fred')")
+val df5 = sql("SELECT * FROM foobar " +
+  "WHERE THEID > 0 AND TRIM(NAME) = 'mary' AND LENGTH(NAME) > 3")
+val df6 = sql("SELECT * FROM foobar " +
+  "WHERE THEID < 0 OR NAME = 'mary' OR NAME = 'fred'")
+val df7 = sql("SELECT * FROM foobar " +
+  "WHERE THEID < 0 OR TRIM(NAME) = 'mary' OR NAME = 'fred'")
+val df8 = sql("SELECT * FROM foobar " +
+  "WHERE NOT((THEID < 0 OR NAME != 'mary') AND (THEID != 1 OR NAME != 
'fred'))")
+val df9 = sql("SELECT * FROM foobar " +
+  "WHERE NOT((THEID < 0 OR NAME != 'mary') AND (THEID != 1 OR 
TRIM(NAME) != 'fred'))")
+val df10 = sql("SELECT * FROM foobar " +
+  "WHERE (NOT(THEID < 0 OR TRIM(NAME) != 'mary')) OR (THEID = 1 AND 
NAME = 'fred')")
+
 assert(df1.collect.toSet === Set(Row("mary", 2)))
 assert(df2.collect.toSet === Set(Row("mary", 2)))
+assert(df3.collect.toSet === Set(Row("fred", 1), Row("mary", 2)))
+assert(df4.collect.toSet === Set(Row("fred", 1), Row("mary", 2)))
+assert(df5.collect.toSet === Set(Row("mary", 2)))
+assert(df6.collect.toSet === Set(Row("fred", 1), Row("mary", 2)))
+assert(df7.collect.toSet === Set(Row("fred", 1), Row("mary", 2)))
+assert(df8.collect.toSet === Set(Row("fred", 1), Row("mary", 2)))
+assert(df9.collect.toSet === Set(Row("fred", 1), Row("mary", 2)))
+assert(df10.collect.toSet === Set(Row("fred", 1), Row("mary", 2)))
--- End diff --

Sure. I can help.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...

2017-11-20 Thread jliwork
Github user jliwork commented on a diff in the pull request:

https://github.com/apache/spark/pull/19776#discussion_r152136470
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 ---
@@ -497,7 +497,19 @@ object DataSourceStrategy {
 Some(sources.IsNotNull(a.name))
 
   case expressions.And(left, right) =>
-(translateFilter(left) ++ 
translateFilter(right)).reduceOption(sources.And)
+// See SPARK-12218 for detailed discussion
+// It is not safe to just convert one side if we do not understand 
the
+// other side. Here is an example used to explain the reason.
+// Let's say we have (a = 2 AND trim(b) = 'blah') OR (c > 0)
+// and we do not understand how to convert trim(b) = 'blah'.
+// If we only convert a = 2, we will end up with
+// (a = 2) OR (c > 0), which will generate wrong results.
+// Pushing one leg of AND down is only safe to do at the top level.
+// You can see ParquetFilters' createFilter for more details.
+for {
+  leftFilter <- translateFilter(left)
+  rightFilter <- translateFilter(right)
+} yield sources.And(leftFilter, rightFilter)
--- End diff --

I would think so.  SPARK-12218 put fixes into ParquetFilters.createFilter 
and OrcFilters.createFilter. They're similar to 
DataSourceStrategy.translateFilter but have different signature customized for 
Parquet and Orc. For all datasources including JDBC, Parquet, etc, 
translateFilter is called to determine if a predicate Expression can be pushed 
down as a Filter or not. Next for Parquet and ORC, Fitlers get mapped to 
Parquet or ORC specific filters with their own createFilter method. 

So this PR does help all datasources to get the correct set of push down 
predicates. Without this PR we simply got lucky with Parquet and ORC in terms 
of result correctness because 1) it looks like we always apply Filter on top of 
scan; 2) we end up with same number of or more rows returned with one leg 
missing from AND. 

JDBC data source does not always come with Filter on top of scan therefore 
exposed the bug. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...

2017-11-20 Thread jliwork
Github user jliwork commented on a diff in the pull request:

https://github.com/apache/spark/pull/19776#discussion_r151922970
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 ---
@@ -497,7 +497,11 @@ object DataSourceStrategy {
 Some(sources.IsNotNull(a.name))
 
   case expressions.And(left, right) =>
-(translateFilter(left) ++ 
translateFilter(right)).reduceOption(sources.And)
+// See SPARK-12218 and PR 10362 for detailed discussion
--- End diff --

@viirya I see. Thanks, Simon! I've removed the PR number from the comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...

2017-11-19 Thread jliwork
Github user jliwork commented on a diff in the pull request:

https://github.com/apache/spark/pull/19776#discussion_r151920447
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 ---
@@ -497,7 +497,10 @@ object DataSourceStrategy {
 Some(sources.IsNotNull(a.name))
 
   case expressions.And(left, right) =>
-(translateFilter(left) ++ 
translateFilter(right)).reduceOption(sources.And)
+for {
--- End diff --

Thanks. Just did that as you suggested.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...

2017-11-19 Thread jliwork
Github user jliwork commented on a diff in the pull request:

https://github.com/apache/spark/pull/19776#discussion_r151920126
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 ---
@@ -497,7 +497,11 @@ object DataSourceStrategy {
 Some(sources.IsNotNull(a.name))
 
   case expressions.And(left, right) =>
-(translateFilter(left) ++ 
translateFilter(right)).reduceOption(sources.And)
+// See SPARK-12218 and PR 10362 for detailed discussion
--- End diff --

Sure. I have added more comments there with an example. Thanks, Sean!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19776: [SPARK-22548][SQL] Incorrect nested AND expression pushe...

2017-11-19 Thread jliwork
Github user jliwork commented on the issue:

https://github.com/apache/spark/pull/19776
  
@viirya I'm fine with backport to 2.2 unless anyone objects.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...

2017-11-19 Thread jliwork
Github user jliwork commented on a diff in the pull request:

https://github.com/apache/spark/pull/19776#discussion_r151901160
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 ---
@@ -497,7 +497,10 @@ object DataSourceStrategy {
 Some(sources.IsNotNull(a.name))
 
   case expressions.And(left, right) =>
-(translateFilter(left) ++ 
translateFilter(right)).reduceOption(sources.And)
+for {
--- End diff --

Sure. Will do. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19776: [SPARK-22548][SQL] Incorrect nested AND expression pushe...

2017-11-19 Thread jliwork
Github user jliwork commented on the issue:

https://github.com/apache/spark/pull/19776
  
@viirya Thanks for letting me know, Simon. I've fixed the title. Can 
someone please help trigger the tests please? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19776: [SPARK-22548][SQL] Incorrect nested AND expressio...

2017-11-17 Thread jliwork
GitHub user jliwork opened a pull request:

https://github.com/apache/spark/pull/19776

[SPARK-22548][SQL] Incorrect nested AND expression pushed down to JDB…

…C data source

## What changes were proposed in this pull request?

Let’s say I have a nested AND expression shown below and p2 can not be 
pushed down, 

(p1 AND p2) OR p3

In current Spark code, during data source filter translation, (p1 AND p2) 
is returned as p1 only and p2 is simply lost. This issue occurs with JDBC data 
source and is similar to SPARK-12218 for Parquet. When we have AND nested below 
another expression, we should either push both legs or nothing. Note that 1) 
the current Spark code will always split conjunctive predicate before it 
determines if a predicate can be pushed down or not; 2) the same translation 
method is also called by Data Source V2. 

## How was this patch tested?

Added new unit test cases to JDBCSuite


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jliwork/spark spark-22548

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19776.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19776


commit 58de88c21210d469b8ef14b1f23764c31ca5651e
Author: Jia Li <ji...@us.ibm.com>
Date:   2017-11-18T01:15:01Z

[SPARK-22548][SQL] Incorrect nested AND expression pushed down to JDBC data 
source




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11569] [ML] Fix StringIndexer to handle...

2016-05-12 Thread jliwork
Github user jliwork commented on the pull request:

https://github.com/apache/spark/pull/9920#issuecomment-218837337
  
@holdenk Definitely :-) Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11569] [ML] Fix StringIndexer to handle...

2016-05-12 Thread jliwork
Github user jliwork commented on the pull request:

https://github.com/apache/spark/pull/9920#issuecomment-218826896
  
Yes. I'd like to continue working on this. But since this PR is obsolete. I 
will close it and open a new one instead. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11569] [ML] Fix StringIndexer to handle...

2016-05-12 Thread jliwork
Github user jliwork closed the pull request at:

https://github.com/apache/spark/pull/9920


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14548][SQL] Support not greater than an...

2016-04-25 Thread jliwork
Github user jliwork commented on the pull request:

https://github.com/apache/spark/pull/12316#issuecomment-214459899
  
Awesome. Thank you! :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14548][SQL] Support not greater than an...

2016-04-20 Thread jliwork
Github user jliwork commented on the pull request:

https://github.com/apache/spark/pull/12316#issuecomment-212281661
  
@rxin gentle ping


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14548][SQL] Support not greater than an...

2016-04-13 Thread jliwork
Github user jliwork commented on the pull request:

https://github.com/apache/spark/pull/12316#issuecomment-209606539
  
@rxin Microsoft SQL Server, IBM DB2 and Sybase Adaptive Server support 
them. 

https://msdn.microsoft.com/en-us/library/ms188074.aspx

http://www.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r746.html?lang=en

http://infocenter.sybase.com/archive/index.jsp?topic=/com.sybase.dc36271_36272_36273_36274_1250/html/refman/refman237.htm



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14548][SQL] Support not greater than an...

2016-04-11 Thread jliwork
Github user jliwork commented on a diff in the pull request:

https://github.com/apache/spark/pull/12316#discussion_r59323956
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
---
@@ -1432,4 +1432,15 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
   getMessage()
 assert(e1.startsWith("Path does not exist"))
   }
+
+  test("test !< and !> comparator") {
--- End diff --

Hi @rxin, 
Thanks for your comment! I've removed this test case from DataFrameSuite. 
Please free free to let me know of any other comments or questions. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14548][SQL] Support not greater than an...

2016-04-11 Thread jliwork
GitHub user jliwork opened a pull request:

https://github.com/apache/spark/pull/12316

[SPARK-14548][SQL] Support not greater than and not less than operator in 
Spark SQL

## What changes were proposed in this pull request?

!< means not less than which is equivalent to >=
!> means not greater than which is equivalent to <=

I'd to create a PR to support these two operators. 

## How was this patch tested?

I've added new test cases in: DataFrameSuite, ExpressionParserSuite, 
JDBCSuite, PlanParserSuite, SQLQuerySuite

@dilipbiswal @viirya @gatorsmile 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jliwork/spark SPARK-14548

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12316.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12316


commit 6792fe59990fcc925eb86bcbcd297f6f57505c8e
Author: jliwork <ji...@us.ibm.com>
Date:   2016-04-12T01:33:50Z

support not greater than and not less than operator in Spark SQL




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11569] [ML] Fix StringIndexer to handle...

2015-11-29 Thread jliwork
Github user jliwork commented on the pull request:

https://github.com/apache/spark/pull/9920#issuecomment-160546154
  
@jkbradley @holdenk Could you please take a look my latest fix and let me 
know of any further comments? thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11569] [ML] Fix StringIndexer to handle...

2015-11-23 Thread jliwork
GitHub user jliwork opened a pull request:

https://github.com/apache/spark/pull/9920

[SPARK-11569] [ML] Fix StringIndexer to handle null value properly

I was having some problem with rebase on 
https://github.com/apache/spark/pull/9709, so I had to close that PR and 
creating a new pull request with my latest fix. 

Thanks to @jkbradley and @holdenk for your comments. I have updated my fix 
so that it will allow user to config either to filter out null values or throw 
an error with StringIndexer.setHandleInvalid("skip") API. The default is 
StringIndexer.setHandleInvalid("error"). 

Please let me know what you think. Thanks again!

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jliwork/spark SPARK-11569-new

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9920.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9920


commit 19de57454cc751074fbbc6e73a3be25f1429e40b
Author: Jia Li <ji...@us.ibm.com>
Date:   2015-11-23T23:05:26Z

fix SPARK-11569 on new branch




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11569] [ML] Fix StringIndexer to handle...

2015-11-23 Thread jliwork
Github user jliwork closed the pull request at:

https://github.com/apache/spark/pull/9709


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11569] [ML] Fix StringIndexer to handle...

2015-11-23 Thread jliwork
Github user jliwork commented on the pull request:

https://github.com/apache/spark/pull/9709#issuecomment-159098271
  
@dragos thanks for pointing it out. i am having some problem with rebase 
and will close this PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11569] [ML] Fix StringIndexer to handle...

2015-11-16 Thread jliwork
Github user jliwork commented on the pull request:

https://github.com/apache/spark/pull/9709#issuecomment-157152669
  
@mengxr @jkbradley Can you please help to take a look at my fix? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11569] [ML] Fix StringIndexer to handle...

2015-11-13 Thread jliwork
GitHub user jliwork opened a pull request:

https://github.com/apache/spark/pull/9709

[SPARK-11569] [ML] Fix StringIndexer to handle null value properly

The root cause of this bug is that StringIndexer is unable to handle input 
column with null value. With my fix, null value can also be indexed just like 
any other string value.

@jkbradley @holdenk Please let me know of any comments. Thanks!

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jliwork/spark SPARK-11569

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9709.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9709


commit f360172322a71478ecaceedfa97bfeaec65c592c
Author: Jia Li <ji...@us.ibm.com>
Date:   2015-11-12T16:45:47Z

Fix for SPARK-11569




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11277] [SQL] sort_array throws exceptio...

2015-10-26 Thread jliwork
Github user jliwork commented on the pull request:

https://github.com/apache/spark/pull/9247#issuecomment-151274333
  
@marmbrus @cloud-fan Thank you for your comment. My current fix takes care 
of array of NullType. I should be able to give error to block sort_array on 
struct type as well if you like. I just looked into Spark's ORDER BY on struct 
type. Since I'm new to spark, however, I feel a little bit challenging for me 
to enhance sort_array to include struct type at this point. Would you be OK if 
I open another JIRA to support sort_array on struct type, so that someone in 
the community might be able to help?

Thanks,


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11277] [SQL] sort_array throws exceptio...

2015-10-26 Thread jliwork
Github user jliwork commented on a diff in the pull request:

https://github.com/apache/spark/pull/9247#discussion_r43074419
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionFunctionsSuite.scala
 ---
@@ -64,6 +65,21 @@ class CollectionFunctionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(new SortArray(a3, Literal(false)), Seq("b", "a", null))
 
 checkEvaluation(Literal.create(null, ArrayType(StringType)), null)
+checkEvaluation(new SortArray(a4), Seq(null, null))
+
+val typeAS = ArrayType(StructType(StructField("a", IntegerType) :: 
Nil))
+val arrayStruct = Literal.create(Seq(create_row(2), create_row(1)), 
typeAS)
+val sorted = new SortArray(arrayStruct)
+
+def getArrayStructFields(expr: Expression, fieldName: String): 
GetArrayStructFields = {
+  expr.dataType match {
+case ArrayType(StructType(fields), containsNull) =>
+  val field = fields.find(_.name == fieldName).get
+  GetArrayStructFields(expr, field, fields.indexOf(field), 
fields.length, containsNull)
+  }
+}
+
+checkEvaluation(getArrayStructFields(sorted, "a"), Seq(1, 2))
--- End diff --

I was just trying to find a way to compare the results. Your approach is 
much simpler. I have updated the test case based on your comment. Thank you for 
your comment! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11277] [SQL] sort_array throws exceptio...

2015-10-26 Thread jliwork
Github user jliwork commented on the pull request:

https://github.com/apache/spark/pull/9247#issuecomment-151341426
  
Thank you all. I have updated my fix based on your comments. It should now 
be able to:
1) skip sorting an array of NULLs 
2) sort an array of StructType

@cloud-fan @marmbrus @yhuai 
Please let me know if you have any further questions or comments. Thanks 
again,



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11277] [SQL] sort_array throws exceptio...

2015-10-25 Thread jliwork
Github user jliwork commented on the pull request:

https://github.com/apache/spark/pull/9247#issuecomment-150992372
  
@cloud-fan Thank you very much for your comment! Sorting on array of NULL 
type does not make sense to me, either. That's why my fix is to block this kind 
of sorting for sort_array. 

1)I have checked Hive's behavior. Hive does not throws any exception when 
sorting an array of NULLs. It will simple return the array of NULLs back. With 
my fix spark's sort_array will return the exact same result as Hive. 
2) Regarding sorting on struct, I could not find any existing UDF like 
sort_struct. Hive does not seems to offer such UDF, either. I'm not very 
familiar with dataframe's struct type as I'm new to Spark and still learning.

Please let me know if you have any other questions or concerns. Thank 
again. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11277] [SQL] sort_array throws exceptio...

2015-10-25 Thread jliwork
Github user jliwork commented on the pull request:

https://github.com/apache/spark/pull/9247#issuecomment-151030490
  
@cloud-fan Thanks for the clarification. You have a very good point. This 
bug's exception occourred because SortArray's Comparators expect AtomicType 
only. According to source code docs, AtomicType 'represent everything that is 
not null, UDTs, arrays, structs, and maps'. My current fix takes care of 
NullType only. StructType still causes execption, so you're right :-)

SortArray actually comes with checkInputDataTypes() which lets NullType and 
StructType slip through because it uses RowOrdering.isOrderable which returns 
true for AtomicType, NullType and StructType. I checked Hive's behavior for 
struct type element. For example, 

select sort_array(array(struct(3, 2, 1), struct(2, 3, 4))) from src

Hive does not support sort array with struct type involved. It returns 
error for this query, saying function SORT_ARRAY must be array, but 
array<struct<col1:int,col2:int,col3:int>> was found. 

QUESTION - What do you expect the behavior of the query above for Spark 
SQL? I would prefer to match with Hive's behavior which is to block this kind 
of sorting on struct type elements since its use case does not seem very 
meaningful to me.

Thanks,


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11277] [SQL] sort_array throws exceptio...

2015-10-23 Thread jliwork
GitHub user jliwork opened a pull request:

https://github.com/apache/spark/pull/9247

[SPARK-11277] [SQL] sort_array throws exception scala.MatchError

I'm new to spark. I was trying out the sort_array function then hit this 
exception. I looked into the spark source code. I found the root cause is that 
sort_array does not check for an array of NULLs. It's not meaningful to sort an 
array of entirely NULLs anyway.

I'm adding a check on the input array type to SortArray. If the array 
consists of NULLs entirely, there is no need to sort such array. I have also 
added a test case for this.

Please help to review my fix. Thanks!

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jliwork/spark SPARK-11277

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9247.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9247


commit 998b8b84f2667b1d16c732d63ba70ab6bfc5b56e
Author: Jia Li <ji...@us.ibm.com>
Date:   2015-10-23T09:06:26Z

[SPARK-11277] [SQL] sort_array throws exception scala.MatchError




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org