[jira] [Commented] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None
[ https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023481#comment-17023481 ] Dongjoon Hyun commented on SPARK-27612: --- I also did double-check that this is not required in branch-2.4 still. To distinguish this from the other correctness issue, I set `Target Version` as `3.0.0`. > Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays > of None > - > > Key: SPARK-27612 > URL: https://issues.apache.org/jira/browse/SPARK-27612 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Assignee: Hyukjin Kwon >Priority: Blocker > Labels: correctness > Fix For: 3.0.0 > > > This seems to only affect Python 3. > When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there > ends up being rows that are filled with None. > > {code:java} > In [1]: from pyspark.sql.types import ArrayType, IntegerType > > In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, > ArrayType(IntegerType(), True)) > In [3]: df.distinct().collect() > > Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])] > {code} > > From this example, it is consistently at elements 97, 98: > {code} > In [5]: df.collect()[-5:] > > Out[5]: > [Row(value=[1, 2, 3, 4]), > Row(value=[1, 2, 3, 4]), > Row(value=[None, None, None, None]), > Row(value=[None, None, None, None]), > Row(value=[1, 2, 3, 4])] > {code} > This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None
[ https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832712#comment-16832712 ] Bryan Cutler commented on SPARK-27612: -- Thanks for checking this out [~viirya] and [~hyukjin.kwon]. I agree that if we can fix it in cloudpickle and do another upgrade before 3.0.0, that would be best. The last upgrade to 0.6.2 has not been in any released versions of Spark right? > Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays > of None > - > > Key: SPARK-27612 > URL: https://issues.apache.org/jira/browse/SPARK-27612 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Assignee: Hyukjin Kwon >Priority: Blocker > Labels: correctness > Fix For: 3.0.0 > > > This seems to only affect Python 3. > When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there > ends up being rows that are filled with None. > > {code:java} > In [1]: from pyspark.sql.types import ArrayType, IntegerType > > In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, > ArrayType(IntegerType(), True)) > In [3]: df.distinct().collect() > > Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])] > {code} > > From this example, it is consistently at elements 97, 98: > {code} > In [5]: df.collect()[-5:] > > Out[5]: > [Row(value=[1, 2, 3, 4]), > Row(value=[1, 2, 3, 4]), > Row(value=[None, None, None, None]), > Row(value=[None, None, None, None]), > Row(value=[1, 2, 3, 4])] > {code} > This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None
[ https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831680#comment-16831680 ] Liang-Chi Hsieh commented on SPARK-27612: - yeah, seems the issue is happened when python object gets pickled... > Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays > of None > - > > Key: SPARK-27612 > URL: https://issues.apache.org/jira/browse/SPARK-27612 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Critical > Labels: correctness > > This seems to only affect Python 3. > When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there > ends up being rows that are filled with None. > > {code:java} > In [1]: from pyspark.sql.types import ArrayType, IntegerType > > In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, > ArrayType(IntegerType(), True)) > In [3]: df.distinct().collect() > > Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])] > {code} > > From this example, it is consistently at elements 97, 98: > {code} > In [5]: df.collect()[-5:] > > Out[5]: > [Row(value=[1, 2, 3, 4]), > Row(value=[1, 2, 3, 4]), > Row(value=[None, None, None, None]), > Row(value=[None, None, None, None]), > Row(value=[1, 2, 3, 4])] > {code} > This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None
[ https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831623#comment-16831623 ] Hyukjin Kwon commented on SPARK-27612: -- Argh, this happens after we upgraded the cloudpickle to 0.6.2 https://github.com/apache/spark/commit/75ea89ad94ca76646e4697cf98c78d14c6e2695f#diff-19fd865e0dd0d7e6b04b3b1e047dcda7 Upgrading cloudpickle to 0.8.1 still doesn't solve the problem .. I think we should fix it in cloudpickle, I made a cloudpickle release and we port that change into Spark. > Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays > of None > - > > Key: SPARK-27612 > URL: https://issues.apache.org/jira/browse/SPARK-27612 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Critical > Labels: correctness > > This seems to only affect Python 3. > When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there > ends up being rows that are filled with None. > > {code:java} > In [1]: from pyspark.sql.types import ArrayType, IntegerType > > In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, > ArrayType(IntegerType(), True)) > In [3]: df.distinct().collect() > > Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])] > {code} > > From this example, it is consistently at elements 97, 98: > {code} > In [5]: df.collect()[-5:] > > Out[5]: > [Row(value=[1, 2, 3, 4]), > Row(value=[1, 2, 3, 4]), > Row(value=[None, None, None, None]), > Row(value=[None, None, None, None]), > Row(value=[1, 2, 3, 4])] > {code} > This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None
[ https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831602#comment-16831602 ] Hyukjin Kwon commented on SPARK-27612: -- Argh, seems to be a regression. {code} >>> from pyspark.sql.types import ArrayType, IntegerType >>> df = spark.createDataFrame([[1, 2, 3, 4]] * 100, ArrayType(IntegerType(), >>> True)) >>> df.distinct().collect() [Row(value=[1, 2, 3, 4])] {code} Doesn't happen in Spark 2.4.1 and Spark 2.3.3 > Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays > of None > - > > Key: SPARK-27612 > URL: https://issues.apache.org/jira/browse/SPARK-27612 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > This seems to only affect Python 3. > When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there > ends up being rows that are filled with None. > > {code:java} > In [1]: from pyspark.sql.types import ArrayType, IntegerType > > In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, > ArrayType(IntegerType(), True)) > In [3]: df.distinct().collect() > > Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])] > {code} > > From this example, it is consistently at elements 97, 98: > {code} > In [5]: df.collect()[-5:] > > Out[5]: > [Row(value=[1, 2, 3, 4]), > Row(value=[1, 2, 3, 4]), > Row(value=[None, None, None, None]), > Row(value=[None, None, None, None]), > Row(value=[1, 2, 3, 4])] > {code} > This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None
[ https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831600#comment-16831600 ] Liang-Chi Hsieh commented on SPARK-27612: - Yup, I can reproduce it too. No worry [~bryanc]. :) Will take some time to look into it. > Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays > of None > - > > Key: SPARK-27612 > URL: https://issues.apache.org/jira/browse/SPARK-27612 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > This seems to only affect Python 3. > When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there > ends up being rows that are filled with None. > > {code:java} > In [1]: from pyspark.sql.types import ArrayType, IntegerType > > In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, > ArrayType(IntegerType(), True)) > In [3]: df.distinct().collect() > > Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])] > {code} > > From this example, it is consistently at elements 97, 98: > {code} > In [5]: df.collect()[-5:] > > Out[5]: > [Row(value=[1, 2, 3, 4]), > Row(value=[1, 2, 3, 4]), > Row(value=[None, None, None, None]), > Row(value=[None, None, None, None]), > Row(value=[1, 2, 3, 4])] > {code} > This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None
[ https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831596#comment-16831596 ] Hyukjin Kwon commented on SPARK-27612: -- haha, you're not crazy {code} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT /_/ Using Python version 3.7.3 (default, Mar 27 2019 09:23:15) SparkSession available as 'spark'. >>> from pyspark.sql.types import ArrayType, IntegerType >>> df = spark.createDataFrame([[1, 2, 3, 4]] * 100, ArrayType(IntegerType(), >>> True)) >>> df.distinct().collect() [Row(value=[None, None]), Row(value=[1, 2, 3, 4])] {code} > Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays > of None > - > > Key: SPARK-27612 > URL: https://issues.apache.org/jira/browse/SPARK-27612 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > This seems to only affect Python 3. > When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there > ends up being rows that are filled with None. > > {code:java} > In [1]: from pyspark.sql.types import ArrayType, IntegerType > > In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, > ArrayType(IntegerType(), True)) > In [3]: df.distinct().collect() > > Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])] > {code} > > From this example, it is consistently at elements 97, 98: > {code} > In [5]: df.collect()[-5:] > > Out[5]: > [Row(value=[1, 2, 3, 4]), > Row(value=[1, 2, 3, 4]), > Row(value=[None, None, None, None]), > Row(value=[None, None, None, None]), > Row(value=[1, 2, 3, 4])] > {code} > This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None
[ https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831098#comment-16831098 ] Bryan Cutler commented on SPARK-27612: -- Also cc [~viirya] [~hyukjin.kwon], this is a little strange.. I hope I'm not crazy > Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays > of None > - > > Key: SPARK-27612 > URL: https://issues.apache.org/jira/browse/SPARK-27612 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > This seems to only affect Python 3. > When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there > ends up being rows that are filled with None. > > {code:java} > In [1]: from pyspark.sql.types import ArrayType, IntegerType > > In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, > ArrayType(IntegerType(), True)) > In [3]: df.distinct().collect() > > Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])] > {code} > > From this example, it is consistently at elements 97, 98: > {code} > In [5]: df.collect()[-5:] > > Out[5]: > [Row(value=[1, 2, 3, 4]), > Row(value=[1, 2, 3, 4]), > Row(value=[None, None, None, None]), > Row(value=[None, None, None, None]), > Row(value=[1, 2, 3, 4])] > {code} > This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None
[ https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831097#comment-16831097 ] Marco Gaido commented on SPARK-27612: - I don't have a python3 env, sorry... > Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays > of None > - > > Key: SPARK-27612 > URL: https://issues.apache.org/jira/browse/SPARK-27612 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > This seems to only affect Python 3. > When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there > ends up being rows that are filled with None. > > {code:java} > In [1]: from pyspark.sql.types import ArrayType, IntegerType > > In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, > ArrayType(IntegerType(), True)) > In [3]: df.distinct().collect() > > Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])] > {code} > > From this example, it is consistently at elements 97, 98: > {code} > In [5]: df.collect()[-5:] > > Out[5]: > [Row(value=[1, 2, 3, 4]), > Row(value=[1, 2, 3, 4]), > Row(value=[None, None, None, None]), > Row(value=[None, None, None, None]), > Row(value=[1, 2, 3, 4])] > {code} > This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None
[ https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831092#comment-16831092 ] Bryan Cutler commented on SPARK-27612: -- Thanks [~mgaido], it seems like the problem does not happen for me with Python 2, so only my Python 3 environments. Would you be able to check with Python 3? > Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays > of None > - > > Key: SPARK-27612 > URL: https://issues.apache.org/jira/browse/SPARK-27612 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there > ends up being rows that are filled with None. > > {code:java} > In [1]: from pyspark.sql.types import ArrayType, IntegerType > > In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, > ArrayType(IntegerType(), True)) > In [3]: df.distinct().collect() > > Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])] > {code} > > From this example, it is consistently at elements 97, 98: > {code:python} > In [5]: df.collect()[-5:] > > Out[5]: > [Row(value=[1, 2, 3, 4]), > Row(value=[1, 2, 3, 4]), > Row(value=[None, None, None, None]), > Row(value=[None, None, None, None]), > Row(value=[1, 2, 3, 4])] > {code} > This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None
[ https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830945#comment-16830945 ] Marco Gaido commented on SPARK-27612: - I am not able to reproduce... {code} __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT /_/ Using Python version 2.7.10 (default, Oct 6 2017 22:29:07) SparkSession available as 'spark'. >>> from pyspark.sql.types import ArrayType, IntegerType >>> df = spark.createDataFrame([[1, 2, 3, 4]] * 100, ArrayType(IntegerType(), >>> True)) >>> df.distinct().collect() [Row(value=[1, 2, 3, 4])] >>> {code} > Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays > of None > - > > Key: SPARK-27612 > URL: https://issues.apache.org/jira/browse/SPARK-27612 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there > ends up being rows that are filled with None. > > {code:java} > In [1]: from pyspark.sql.types import ArrayType, IntegerType > > In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, > ArrayType(IntegerType(), True)) > In [3]: df.distinct().collect() > > Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])] > {code} > > From this example, it is consistently at elements 97, 98: > {code:python} > In [5]: df.collect()[-5:] > > Out[5]: > [Row(value=[1, 2, 3, 4]), > Row(value=[1, 2, 3, 4]), > Row(value=[None, None, None, None]), > Row(value=[None, None, None, None]), > Row(value=[1, 2, 3, 4])] > {code} > This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org