[jira] [Commented] (SPARK-24358) createDataFrame in Python 3 should be able to infer bytes type as Binary type

2018-05-24 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489546#comment-16489546
 ] 

Hyukjin Kwon commented on SPARK-24358:
--

Yea, I know the differences and I know the rationale here. We should need 
strong evidences and reasons to accept the divergence. Also we need to take a 
look for PySpark codes bases too and check such divergence.
FWIW, I was trying to take a look and fix the difference among bytes, str and 
unicode and I am currently stuck due to other swarming works.



> createDataFrame in Python 3 should be able to infer bytes type as Binary type
> -
>
> Key: SPARK-24358
> URL: https://issues.apache.org/jira/browse/SPARK-24358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Minor
>  Labels: Python3
>
> createDataFrame can infer Python 3's bytearray type as a Binary. Since bytes 
> is just the immutable, hashable version of this same structure, it makes 
> sense for the same thing to apply there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24358) createDataFrame in Python 3 should be able to infer bytes type as Binary type

2018-05-24 Thread Joel Croteau (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489507#comment-16489507
 ] 

Joel Croteau commented on SPARK-24358:
--

This does mean that the current implementation has some compatibility issues 
with Python 3. In Python 2, a bytes will be inferred as a StringType, 
regardless of content. StringType and BinaryType are functionally identical, as 
they are both just arbitrary arrays of bytes, and Python 2 will handle any 
value of them just fine. In Python 3, attempting to infer the type of a bytes 
is an error, and Python 3 will convert a StringType to Unicode. Since not every 
byte string is valid Unicode, some errors may occur in processing StringTypes 
in Python 3 that worked fine in Python 2.

> createDataFrame in Python 3 should be able to infer bytes type as Binary type
> -
>
> Key: SPARK-24358
> URL: https://issues.apache.org/jira/browse/SPARK-24358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Minor
>  Labels: Python3
>
> createDataFrame can infer Python 3's bytearray type as a Binary. Since bytes 
> is just the immutable, hashable version of this same structure, it makes 
> sense for the same thing to apply there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24358) createDataFrame in Python 3 should be able to infer bytes type as Binary type

2018-05-23 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488261#comment-16488261
 ] 

Hyukjin Kwon commented on SPARK-24358:
--

Yea, that's exactly what I thought. There are some differences between Python 2 
and 3 but I think PySpark usually supports consistently for both so far up to 
my knowledge.

> createDataFrame in Python 3 should be able to infer bytes type as Binary type
> -
>
> Key: SPARK-24358
> URL: https://issues.apache.org/jira/browse/SPARK-24358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Minor
>  Labels: Python3
>
> createDataFrame can infer Python 3's bytearray type as a Binary. Since bytes 
> is just the immutable, hashable version of this same structure, it makes 
> sense for the same thing to apply there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24358) createDataFrame in Python 3 should be able to infer bytes type as Binary type

2018-05-23 Thread Joel Croteau (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488090#comment-16488090
 ] 

Joel Croteau commented on SPARK-24358:
--

This may be trickier than I first thought. In Python 2, bytes is an alias for 
str, so a bytes object resolves as a StringType. In Python 3, they are 
different types, and not in general freely convertible, as an str in Python 3 
is unicode, and an arbitrary byte string as represented by a bytes may not be 
valid unicode. This means that a bytes in Python 2 will need to be resolved as 
a different schema from a bytes in Python 3. Not sure how significant that is.

> createDataFrame in Python 3 should be able to infer bytes type as Binary type
> -
>
> Key: SPARK-24358
> URL: https://issues.apache.org/jira/browse/SPARK-24358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Minor
>  Labels: Python3
>
> createDataFrame can infer Python 3's bytearray type as a Binary. Since bytes 
> is just the immutable, hashable version of this same structure, it makes 
> sense for the same thing to apply there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24358) createDataFrame in Python 3 should be able to infer bytes type as Binary type

2018-05-22 Thread Joel Croteau (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486594#comment-16486594
 ] 

Joel Croteau commented on SPARK-24358:
--

Done.

> createDataFrame in Python 3 should be able to infer bytes type as Binary type
> -
>
> Key: SPARK-24358
> URL: https://issues.apache.org/jira/browse/SPARK-24358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Minor
>  Labels: Python3
>
> createDataFrame can infer Python 3's bytearray type as a Binary. Since bytes 
> is just the immutable, hashable version of this same structure, it makes 
> sense for the same thing to apply there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org