Bryan Cutler created SPARK-27612: ------------------------------------ Summary: Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None Key: SPARK-27612 URL: https://issues.apache.org/jira/browse/SPARK-27612 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 3.0.0 Reporter: Bryan Cutler
When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there ends up being rows that are filled with None. {code:java} In [1]: from pyspark.sql.types import ArrayType, IntegerType In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, ArrayType(IntegerType(), True)) In [3]: df.distinct().collect() Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])] {code} >From this example, it is consistently at elements 97, 98: {code} In [5]: df.collect()[-5:] Out[5]: [Row(value=[1, 2, 3, 4]), Row(value=[1, 2, 3, 4]), Row(value=[None, None, None, None]), Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])] {code} This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org