[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...

2018-11-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20163


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...

2018-01-15 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/20163#discussion_r161677327
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
 ---
@@ -144,6 +145,7 @@ object EvaluatePython {
 }
 
 case StringType => (obj: Any) => nullSafeConvert(obj) {
+  case _: Calendar => null
   case _ => UTF8String.fromString(obj.toString)
--- End diff --

btw, the array case seems a bit weird?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...

2018-01-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20163#discussion_r161455317
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
 ---
@@ -144,6 +145,7 @@ object EvaluatePython {
 }
 
 case StringType => (obj: Any) => nullSafeConvert(obj) {
+  case _: Calendar => null
   case _ => UTF8String.fromString(obj.toString)
--- End diff --

looks good


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...

2018-01-12 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20163#discussion_r161365496
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
 ---
@@ -144,6 +145,7 @@ object EvaluatePython {
 }
 
 case StringType => (obj: Any) => nullSafeConvert(obj) {
+  case _: Calendar => null
   case _ => UTF8String.fromString(obj.toString)
--- End diff --

@cloud-fan, how about something like this then?

```scala
case StringType => (obj: Any) => nullSafeConvert(obj) {
  // Shortcut for string conversion
  case c: String => UTF8String.fromString(c)

  // Here, we return null for 'array', 'tuple', 'dict', 'list', 
'datetime.datetime',
  // 'datetime.date' and 'datetime.time' because those string 
conversions are
  // not quite consistent with SQL string representation of data.
  case _: java.util.Calendar | _: net.razorvine.pickle.objects.Time |
   _: java.util.List[_] | _: java.util.Map[_, _] =>
null
  case c if c.getClass.isArray => null

  // Here, we keep the string conversion fall back for compatibility.
  // TODO: We should revisit this and rewrite the type conversion logic 
in Spark 3.x.
  case other => UTF8String.fromString(other.toString)
}
```

My few tests:

`datetime.time`:

```
from pyspark.sql.functions import udf
from datetime import time

f = udf(lambda x: time(0, 0), "string")
spark.range(1).select(f("id")).show()
```

```
++
|(id)|
++
|Time: 0 hours, 0 ...|
++
```

`array`:

```
from pyspark.sql.functions import udf
import array

f = udf(lambda x: array.array("c", "aaa"), "string")
spark.range(1).select(f("id")).show()
```

```
++
|(id)|
++
| [C@11618d9e|
++
```

`tuple`:

```
from pyspark.sql.functions import udf

f = udf(lambda x: (x,), "string")
spark.range(1).select(f("id")).show()
```

```
++
|(id)|
++
|[Ljava.lang.Objec...|
++
```

`list`:

```
from pyspark.sql.functions import udf
from datetime import datetime

f = udf(lambda x: [datetime(1990, 1, 1)], "string")
spark.range(1).select(f("id")).show()
```

```
++
|(id)|
++
|[java.util.Gregor...|
++
```





---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...

2018-01-12 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20163#discussion_r161363813
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
 ---
@@ -144,6 +145,7 @@ object EvaluatePython {
 }
 
 case StringType => (obj: Any) => nullSafeConvert(obj) {
+  case _: Calendar => null
   case _ => UTF8String.fromString(obj.toString)
--- End diff --

I think there is no perfect solution ..  I think 
https://github.com/apache/spark/pull/20163#discussion_r161363004 sounds good 
enough as a fix for this issue for now .. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...

2018-01-12 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20163#discussion_r161363630
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
 ---
@@ -144,6 +145,7 @@ object EvaluatePython {
 }
 
 case StringType => (obj: Any) => nullSafeConvert(obj) {
+  case _: Calendar => null
   case _ => UTF8String.fromString(obj.toString)
--- End diff --

BTW, seems there is another hole when we actually do the internal 
conversion with unexpected types:

```python
>>> from pyspark.sql.functions import udf
>>> f = udf(lambda x: x, "date")
>>> spark.range(1).select(f("id")).show()
```

```
org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):
  File "./python/pyspark/worker.py", line 229, in main
process()
  File "./python/pyspark/worker.py", line 224, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File "./python/pyspark/worker.py", line 149, in 
func = lambda _, it: map(mapper, it)
  File "", line 1, in 
  File "./python/pyspark/worker.py", line 72, in 
return lambda *a: toInternal(f(*a))
  File "/.../pyspark/sql/types.py", line 175, in toInternal
return d.toordinal() - self.EPOCH_ORDINAL
AttributeError: 'int' object has no attribute 'toordinal'
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...

2018-01-12 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20163#discussion_r161363023
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
 ---
@@ -144,6 +145,7 @@ object EvaluatePython {
 }
 
 case StringType => (obj: Any) => nullSafeConvert(obj) {
+  case _: Calendar => null
   case _ => UTF8String.fromString(obj.toString)
--- End diff --

Oh, I didn't see the comment above when I write my comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...

2018-01-12 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20163#discussion_r161363004
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
 ---
@@ -144,6 +145,7 @@ object EvaluatePython {
 }
 
 case StringType => (obj: Any) => nullSafeConvert(obj) {
+  case _: Calendar => null
   case _ => UTF8String.fromString(obj.toString)
--- End diff --

So, for now .. I think it's fine as a small fix as is ... We are going to 
document that the return type and return value should be matched anyway ..

So, expected return values will be:

```python
# Mapping Python types to Spark SQL DataType
_type_mappings = {
type(None): NullType,
bool: BooleanType,
int: LongType,
float: DoubleType,
str: StringType,
bytearray: BinaryType,
decimal.Decimal: DecimalType,
datetime.date: DateType,
datetime.datetime: TimestampType,
datetime.time: TimestampType,
}
```

Seems, we can also check if the string conversion looks reasonable and then 
blacklist `net.razorvine.pickle.objects.Time` if not ...

How does this sound to you @cloud-fan and @rednaxelafx?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...

2018-01-12 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20163#discussion_r161362994
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
 ---
@@ -144,6 +145,7 @@ object EvaluatePython {
 }
 
 case StringType => (obj: Any) => nullSafeConvert(obj) {
+  case _: Calendar => null
   case _ => UTF8String.fromString(obj.toString)
--- End diff --

> check if the string conversion looks reasonably consistent by 
obj.toString. If not, we add it in the blacklist.

hmm, this seems weird as the type mismatch now is defined by Pyrolite 
object's `toString` behavior...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...

2018-01-12 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20163#discussion_r161362902
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
 ---
@@ -144,6 +145,7 @@ object EvaluatePython {
 }
 
 case StringType => (obj: Any) => nullSafeConvert(obj) {
+  case _: Calendar => null
   case _ => UTF8String.fromString(obj.toString)
--- End diff --

For the perfectness, I think we should check all the types, 
https://github.com/irmen/Pyrolite,

```
PYTHON> JAVA
--  
Nonenull
boolboolean
int int
longlong or BigInteger  (depending on size)
string  String
unicode String
complex net.razorvine.pickle.objects.ComplexNumber
datetime.date   java.util.Calendar
datetime.datetime   java.util.Calendar
datetime.time   net.razorvine.pickle.objects.Time
datetime.timedelta  net.razorvine.pickle.objects.TimeDelta
float   double   (float isn't used) 
array.array array of appropriate primitive type (char, int, short, 
long, float, double)
listjava.util.List
tuple   Object[]
set java.util.Set
dictjava.util.Map
bytes   byte[]
bytearray   byte[]
decimal BigDecimal
custom classMap  (dict with class attributes 
including its name in "__class__")
Pyro4.core.URI  net.razorvine.pyro.PyroURI
Pyro4.core.Proxynet.razorvine.pyro.PyroProxy
Pyro4.errors.*  net.razorvine.pyro.PyroException
Pyro4.utils.flame.FlameBuiltin net.razorvine.pyro.FlameBuiltin 
Pyro4.utils.flame.FlameModule  net.razorvine.pyro.FlameModule 
Pyro4.utils.flame.RemoteInteractiveConsole
net.razorvine.pyro.FlameRemoteConsole 
```

and then check if the string conversion looks reasonably consistent by 
`obj.toString`. If not, we add it in the blacklist.

Another possibility is to whitelist `String`, but then I guess this is 
rather a radical change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...

2018-01-12 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/20163#discussion_r161287717
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
 ---
@@ -144,6 +145,7 @@ object EvaluatePython {
 }
 
 case StringType => (obj: Any) => nullSafeConvert(obj) {
+  case _: Calendar => null
   case _ => UTF8String.fromString(obj.toString)
--- End diff --

I was pounding on that yesterday, too... somehow I have this feeling that 
no matter which direction we take, there's no good answer to type mismatch 
situations.

Let's say if we blacklist more types, should we document the list so that 
it's clear what will definitely NOT work?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...

2018-01-12 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20163#discussion_r161268321
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
 ---
@@ -144,6 +145,7 @@ object EvaluatePython {
 }
 
 case StringType => (obj: Any) => nullSafeConvert(obj) {
+  case _: Calendar => null
   case _ => UTF8String.fromString(obj.toString)
--- End diff --

shall we blacklist more types? e.g. if a udf returns decimal and mark the 
return type as string type, is it a mismatch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org