[jira] [Commented] (SPARK-18709) Automatic null conversion bug (instead of throwing error) when creating a Spark Datarame with incompatible types for fields.
[ https://issues.apache.org/jira/browse/SPARK-18709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727763#comment-15727763 ] Dongjoon Hyun commented on SPARK-18709: --- @srowen . The type verification was introduced by https://issues.apache.org/jira/browse/SPARK-14945 when `session.py` is created in 2.0.0. > Automatic null conversion bug (instead of throwing error) when creating a > Spark Datarame with incompatible types for fields. > > > Key: SPARK-18709 > URL: https://issues.apache.org/jira/browse/SPARK-18709 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 1.6.3 >Reporter: Amogh Param > Labels: bug > Fix For: 2.0.0 > > > When converting an RDD with a `float` type field to a spark dataframe with an > `IntegerType` / `LongType` schema field, spark 1.6.2 and 1.6.3 silently > convert the field values to nulls instead of throwing an error like `LongType > can not accept object ___ in type `. However, this seems to be > fixed in Spark 2.0.2. > The following example should make the problem clear: > {code} > from pyspark.sql.types import StructField, StructType, LongType, DoubleType > schema = StructType([ > StructField("0", LongType(), True), > StructField("1", DoubleType(), True), > ]) > data = [[1.0, 1.0], [nan, 2.0]] > spark_df = sqlContext.createDataFrame(sc.parallelize(data), schema) > spark_df.show() > {code} > Instead of throwing an error like: > {code} > LongType can not accept object 1.0 in type > {code} > Spark converts all the values in the first column to nulls > Running `spark_df.show()` gives: > {code} > ++---+ > | 0| 1| > ++---+ > |null|1.0| > |null|1.0| > ++---+ > {code} > For the purposes of my computation, I'm doing a `mapPartitions` on a spark > data frame, and for each partition, converting it into a pandas data frame, > doing a few computations on this pandas dataframe and the return value will > be a list of lists, which is converted to an RDD while being returned from > 'mapPartitions' (for all partitions). This RDD is then converted into a spark > dataframe similar to the example above, using > `sqlContext.createDataFrame(rdd, schema)`. The rdd has a column that should > be converted to a `LongType` in the spark data frame, but since it has > missing values, it is a `float` type. When spark tries to create the data > frame, it converts all the values in that column to nulls instead of throwing > an error that there is a type mismatch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18709) Automatic null conversion bug (instead of throwing error) when creating a Spark Datarame with incompatible types for fields.
[ https://issues.apache.org/jira/browse/SPARK-18709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724144#comment-15724144 ] Dongjoon Hyun commented on SPARK-18709: --- I'll check which commit added the guard condition. > Automatic null conversion bug (instead of throwing error) when creating a > Spark Datarame with incompatible types for fields. > > > Key: SPARK-18709 > URL: https://issues.apache.org/jira/browse/SPARK-18709 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 1.6.3 >Reporter: Amogh Param > Labels: bug > Fix For: 2.0.0 > > > When converting an RDD with a `float` type field to a spark dataframe with an > `IntegerType` / `LongType` schema field, spark 1.6.2 and 1.6.3 silently > convert the field values to nulls instead of throwing an error like `LongType > can not accept object ___ in type `. However, this seems to be > fixed in Spark 2.0.2. > The following example should make the problem clear: > {code} > from pyspark.sql.types import StructField, StructType, LongType, DoubleType > schema = StructType([ > StructField("0", LongType(), True), > StructField("1", DoubleType(), True), > ]) > data = [[1.0, 1.0], [nan, 2.0]] > spark_df = sqlContext.createDataFrame(sc.parallelize(data), schema) > spark_df.show() > {code} > Instead of throwing an error like: > {code} > LongType can not accept object 1.0 in type > {code} > Spark converts all the values in the first column to nulls > Running `spark_df.show()` gives: > {code} > ++---+ > | 0| 1| > ++---+ > |null|1.0| > |null|1.0| > ++---+ > {code} > For the purposes of my computation, I'm doing a `mapPartitions` on a spark > data frame, and for each partition, converting it into a pandas data frame, > doing a few computations on this pandas dataframe and the return value will > be a list of lists, which is converted to an RDD while being returned from > 'mapPartitions' (for all partitions). This RDD is then converted into a spark > dataframe similar to the example above, using > `sqlContext.createDataFrame(rdd, schema)`. The rdd has a column that should > be converted to a `LongType` in the spark data frame, but since it has > missing values, it is a `float` type. When spark tries to create the data > frame, it converts all the values in that column to nulls instead of throwing > an error that there is a type mismatch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18709) Automatic null conversion bug (instead of throwing error) when creating a Spark Datarame with incompatible types for fields.
[ https://issues.apache.org/jira/browse/SPARK-18709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724101#comment-15724101 ] Sean Owen commented on SPARK-18709: --- BTW do you know what change fixed this, by any chance? > Automatic null conversion bug (instead of throwing error) when creating a > Spark Datarame with incompatible types for fields. > > > Key: SPARK-18709 > URL: https://issues.apache.org/jira/browse/SPARK-18709 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 1.6.3 >Reporter: Amogh Param > Labels: bug > Fix For: 2.0.0 > > > When converting an RDD with a `float` type field to a spark dataframe with an > `IntegerType` / `LongType` schema field, spark 1.6.2 and 1.6.3 silently > convert the field values to nulls instead of throwing an error like `LongType > can not accept object ___ in type `. However, this seems to be > fixed in Spark 2.0.2. > The following example should make the problem clear: > {code} > from pyspark.sql.types import StructField, StructType, LongType, DoubleType > schema = StructType([ > StructField("0", LongType(), True), > StructField("1", DoubleType(), True), > ]) > data = [[1.0, 1.0], [nan, 2.0]] > spark_df = sqlContext.createDataFrame(sc.parallelize(data), schema) > spark_df.show() > {code} > Instead of throwing an error like: > {code} > LongType can not accept object 1.0 in type > {code} > Spark converts all the values in the first column to nulls > Running `spark_df.show()` gives: > {code} > ++---+ > | 0| 1| > ++---+ > |null|1.0| > |null|1.0| > ++---+ > {code} > For the purposes of my computation, I'm doing a `mapPartitions` on a spark > data frame, and for each partition, converting it into a pandas data frame, > doing a few computations on this pandas dataframe and the return value will > be a list of lists, which is converted to an RDD while being returned from > 'mapPartitions' (for all partitions). This RDD is then converted into a spark > dataframe similar to the example above, using > `sqlContext.createDataFrame(rdd, schema)`. The rdd has a column that should > be converted to a `LongType` in the spark data frame, but since it has > missing values, it is a `float` type. When spark tries to create the data > frame, it converts all the values in that column to nulls instead of throwing > an error that there is a type mismatch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18709) Automatic null conversion bug (instead of throwing error) when creating a Spark Datarame with incompatible types for fields.
[ https://issues.apache.org/jira/browse/SPARK-18709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723236#comment-15723236 ] Amogh Param commented on SPARK-18709: - Thanks, I'll close the ticket. > Automatic null conversion bug (instead of throwing error) when creating a > Spark Datarame with incompatible types for fields. > > > Key: SPARK-18709 > URL: https://issues.apache.org/jira/browse/SPARK-18709 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 1.6.3 >Reporter: Amogh Param > Labels: bug > Fix For: 2.0.0 > > > When converting an RDD with a `float` type field to a spark dataframe with an > `IntegerType` / `LongType` schema field, spark 1.6.2 and 1.6.3 silently > convert the field values to nulls instead of throwing an error like `LongType > can not accept object ___ in type `. However, this seems to be > fixed in Spark 2.0.2. > The following example should make the problem clear: > {code} > from pyspark.sql.types import StructField, StructType, LongType, DoubleType > schema = StructType([ > StructField("0", LongType(), True), > StructField("1", DoubleType(), True), > ]) > data = [[1.0, 1.0], [nan, 2.0]] > spark_df = sqlContext.createDataFrame(sc.parallelize(data), schema) > spark_df.show() > {code} > Instead of throwing an error like: > {code} > LongType can not accept object 1.0 in type > {code} > Spark converts all the values in the first column to nulls > Running `spark_df.show()` gives: > {code} > ++---+ > | 0| 1| > ++---+ > |null|1.0| > |null|1.0| > ++---+ > {code} > For the purposes of my computation, I'm doing a `mapPartitions` on a spark > data frame, and for each partition, converting it into a pandas data frame, > doing a few computations on this pandas dataframe and the return value will > be a list of lists, which is converted to an RDD while being returned from > 'mapPartitions' (for all partitions). This RDD is then converted into a spark > dataframe similar to the example above, using > `sqlContext.createDataFrame(rdd, schema)`. The rdd has a column that should > be converted to a `LongType` in the spark data frame, but since it has > missing values, it is a `float` type. When spark tries to create the data > frame, it converts all the values in that column to nulls instead of throwing > an error that there is a type mismatch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18709) Automatic null conversion bug (instead of throwing error) when creating a Spark Datarame with incompatible types for fields.
[ https://issues.apache.org/jira/browse/SPARK-18709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723201#comment-15723201 ] Dongjoon Hyun commented on SPARK-18709: --- Yes. It will not be in 1.6.4 (if exists). > Automatic null conversion bug (instead of throwing error) when creating a > Spark Datarame with incompatible types for fields. > > > Key: SPARK-18709 > URL: https://issues.apache.org/jira/browse/SPARK-18709 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 1.6.3 >Reporter: Amogh Param > Labels: bug > Fix For: 2.0.0 > > > When converting an RDD with a `float` type field to a spark dataframe with an > `IntegerType` / `LongType` schema field, spark 1.6.2 and 1.6.3 silently > convert the field values to nulls instead of throwing an error like `LongType > can not accept object ___ in type `. However, this seems to be > fixed in Spark 2.0.2. > The following example should make the problem clear: > {code} > from pyspark.sql.types import StructField, StructType, LongType, DoubleType > schema = StructType([ > StructField("0", LongType(), True), > StructField("1", DoubleType(), True), > ]) > data = [[1.0, 1.0], [nan, 2.0]] > spark_df = sqlContext.createDataFrame(sc.parallelize(data), schema) > spark_df.show() > {code} > Instead of throwing an error like: > {code} > LongType can not accept object 1.0 in type > {code} > Spark converts all the values in the first column to nulls > Running `spark_df.show()` gives: > {code} > ++---+ > | 0| 1| > ++---+ > |null|1.0| > |null|1.0| > ++---+ > {code} > For the purposes of my computation, I'm doing a `mapPartitions` on a spark > data frame, and for each partition, converting it into a pandas data frame, > doing a few computations on this pandas dataframe and the return value will > be a list of lists, which is converted to an RDD while being returned from > 'mapPartitions' (for all partitions). This RDD is then converted into a spark > dataframe similar to the example above, using > `sqlContext.createDataFrame(rdd, schema)`. The rdd has a column that should > be converted to a `LongType` in the spark data frame, but since it has > missing values, it is a `float` type. When spark tries to create the data > frame, it converts all the values in that column to nulls instead of throwing > an error that there is a type mismatch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18709) Automatic null conversion bug (instead of throwing error) when creating a Spark Datarame with incompatible types for fields.
[ https://issues.apache.org/jira/browse/SPARK-18709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723189#comment-15723189 ] Amogh Param commented on SPARK-18709: - [~dongjoon] Thanks for the fix. Just to clarify, does this mean that the fix will only be in 2.0.0 and not in 1.6.4 (assuming there will be a 1.6.4 update)? > Automatic null conversion bug (instead of throwing error) when creating a > Spark Datarame with incompatible types for fields. > > > Key: SPARK-18709 > URL: https://issues.apache.org/jira/browse/SPARK-18709 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 1.6.3 >Reporter: Amogh Param > Labels: bug > Fix For: 2.0.0 > > > When converting an RDD with a `float` type field to a spark dataframe with an > `IntegerType` / `LongType` schema field, spark 1.6.2 and 1.6.3 silently > convert the field values to nulls instead of throwing an error like `LongType > can not accept object ___ in type `. However, this seems to be > fixed in Spark 2.0.2. > The following example should make the problem clear: > {code} > from pyspark.sql.types import StructField, StructType, LongType, DoubleType > schema = StructType([ > StructField("0", LongType(), True), > StructField("1", DoubleType(), True), > ]) > data = [[1.0, 1.0], [nan, 2.0]] > spark_df = sqlContext.createDataFrame(sc.parallelize(data), schema) > spark_df.show() > {code} > Instead of throwing an error like: > {code} > LongType can not accept object 1.0 in type > {code} > Spark converts all the values in the first column to nulls > Running `spark_df.show()` gives: > {code} > ++---+ > | 0| 1| > ++---+ > |null|1.0| > |null|1.0| > ++---+ > {code} > For the purposes of my computation, I'm doing a `mapPartitions` on a spark > data frame, and for each partition, converting it into a pandas data frame, > doing a few computations on this pandas dataframe and the return value will > be a list of lists, which is converted to an RDD while being returned from > 'mapPartitions' (for all partitions). This RDD is then converted into a spark > dataframe similar to the example above, using > `sqlContext.createDataFrame(rdd, schema)`. The rdd has a column that should > be converted to a `LongType` in the spark data frame, but since it has > missing values, it is a `float` type. When spark tries to create the data > frame, it converts all the values in that column to nulls instead of throwing > an error that there is a type mismatch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18709) Automatic null conversion bug (instead of throwing error) when creating a Spark Datarame with incompatible types for fields.
[ https://issues.apache.org/jira/browse/SPARK-18709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723100#comment-15723100 ] Shixiong Zhu commented on SPARK-18709: -- [~dongjoon] Is it already resolved? If so, could you close this ticket? > Automatic null conversion bug (instead of throwing error) when creating a > Spark Datarame with incompatible types for fields. > > > Key: SPARK-18709 > URL: https://issues.apache.org/jira/browse/SPARK-18709 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 1.6.3 >Reporter: Amogh Param > Labels: bug > Fix For: 2.0.0 > > > When converting an RDD with a `float` type field to a spark dataframe with an > `IntegerType` / `LongType` schema field, spark 1.6.2 and 1.6.3 silently > convert the field values to nulls instead of throwing an error like `LongType > can not accept object ___ in type `. However, this seems to be > fixed in Spark 2.0.2. > The following example should make the problem clear: > {code} > from pyspark.sql.types import StructField, StructType, LongType, DoubleType > schema = StructType([ > StructField("0", LongType(), True), > StructField("1", DoubleType(), True), > ]) > data = [[1.0, 1.0], [nan, 2.0]] > spark_df = sqlContext.createDataFrame(sc.parallelize(data), schema) > spark_df.show() > {code} > Instead of throwing an error like: > {code} > LongType can not accept object 1.0 in type > {code} > Spark converts all the values in the first column to nulls > Running `spark_df.show()` gives: > {code} > ++---+ > | 0| 1| > ++---+ > |null|1.0| > |null|1.0| > ++---+ > {code} > For the purposes of my computation, I'm doing a `mapPartitions` on a spark > data frame, and for each partition, converting it into a pandas data frame, > doing a few computations on this pandas dataframe and the return value will > be a list of lists, which is converted to an RDD while being returned from > 'mapPartitions' (for all partitions). This RDD is then converted into a spark > dataframe similar to the example above, using > `sqlContext.createDataFrame(rdd, schema)`. The rdd has a column that should > be converted to a `LongType` in the spark data frame, but since it has > missing values, it is a `float` type. When spark tries to create the data > frame, it converts all the values in that column to nulls instead of throwing > an error that there is a type mismatch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18709) Automatic null conversion bug (instead of throwing error) when creating a Spark Datarame with incompatible types for fields.
[ https://issues.apache.org/jira/browse/SPARK-18709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15720691#comment-15720691 ] Dongjoon Hyun commented on SPARK-18709: --- Also, I updated the fix version to 2.0.0 after testing on 2.0.0. > Automatic null conversion bug (instead of throwing error) when creating a > Spark Datarame with incompatible types for fields. > > > Key: SPARK-18709 > URL: https://issues.apache.org/jira/browse/SPARK-18709 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 1.6.3 >Reporter: Amogh Param > Labels: bug > Fix For: 2.0.0 > > > When converting an RDD with a `float` type field to a spark dataframe with an > `IntegerType` / `LongType` schema field, spark 1.6.2 and 1.6.3 silently > convert the field values to nulls instead of throwing an error like `LongType > can not accept object ___ in type `. However, this seems to be > fixed in Spark 2.0.2. > The following example should make the problem clear: > {code} > from pyspark.sql.types import StructField, StructType, LongType, DoubleType > schema = StructType([ > StructField("0", LongType(), True), > StructField("1", DoubleType(), True), > ]) > data = [[1.0, 1.0], [nan, 2.0]] > spark_df = sqlContext.createDataFrame(sc.parallelize(data), schema) > spark_df.show() > {code} > Instead of throwing an error like: > {code} > LongType can not accept object 1.0 in type > {code} > Spark converts all the values in the first column to nulls > Running `spark_df.show()` gives: > {code} > ++---+ > | 0| 1| > ++---+ > |null|1.0| > |null|1.0| > ++---+ > {code} > For the purposes of my computation, I'm doing a `mapPartitions` on a spark > data frame, and for each partition, converting it into a pandas data frame, > doing a few computations on this pandas dataframe and the return value will > be a list of lists, which is converted to an RDD while being returned from > 'mapPartitions' (for all partitions). This RDD is then converted into a spark > dataframe similar to the example above, using > `sqlContext.createDataFrame(rdd, schema)`. The rdd has a column that should > be converted to a `LongType` in the spark data frame, but since it has > missing values, it is a `float` type. When spark tries to create the data > frame, it converts all the values in that column to nulls instead of throwing > an error that there is a type mismatch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18709) Automatic null conversion bug (instead of throwing error) when creating a Spark Datarame with incompatible types for fields.
[ https://issues.apache.org/jira/browse/SPARK-18709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15720687#comment-15720687 ] Dongjoon Hyun commented on SPARK-18709: --- Hi, [~amogh.91]. I removed the target version since 1.6.3 is already released. In addition, only committers decide the target version. > Automatic null conversion bug (instead of throwing error) when creating a > Spark Datarame with incompatible types for fields. > > > Key: SPARK-18709 > URL: https://issues.apache.org/jira/browse/SPARK-18709 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 1.6.3 >Reporter: Amogh Param > Labels: bug > Fix For: 2.0.2 > > > When converting an RDD with a `float` type field to a spark dataframe with an > `IntegerType` / `LongType` schema field, spark 1.6.2 and 1.6.3 silently > convert the field values to nulls instead of throwing an error like `LongType > can not accept object ___ in type `. However, this seems to be > fixed in Spark 2.0.2. > The following example should make the problem clear: > {code} > from pyspark.sql.types import StructField, StructType, LongType, DoubleType > schema = StructType([ > StructField("0", LongType(), True), > StructField("1", DoubleType(), True), > ]) > data = [[1.0, 1.0], [nan, 2.0]] > spark_df = sqlContext.createDataFrame(sc.parallelize(data), schema) > spark_df.show() > {code} > Instead of throwing an error like: > {code} > LongType can not accept object 1.0 in type > {code} > Spark converts all the values in the first column to nulls > Running `spark_df.show()` gives: > {code} > ++---+ > | 0| 1| > ++---+ > |null|1.0| > |null|1.0| > ++---+ > {code} > For the purposes of my computation, I'm doing a `mapPartitions` on a spark > data frame, and for each partition, converting it into a pandas data frame, > doing a few computations on this pandas dataframe and the return value will > be a list of lists, which is converted to an RDD while being returned from > 'mapPartitions' (for all partitions). This RDD is then converted into a spark > dataframe similar to the example above, using > `sqlContext.createDataFrame(rdd, schema)`. The rdd has a column that should > be converted to a `LongType` in the spark data frame, but since it has > missing values, it is a `float` type. When spark tries to create the data > frame, it converts all the values in that column to nulls instead of throwing > an error that there is a type mismatch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org