[jira] [Commented] (SPARK-21698) write.partitionBy() is giving me garbage data

Xiao Li (JIRA) Fri, 11 Aug 2017 15:01:43 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-21698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16124161#comment-16124161
 ]


Xiao Li commented on SPARK-21698:
---------------------------------

{{insertInto}} is resolved by positions. You can printout the schema of the 
DataFrame {{data3}} and the table schema. You will find they do not match. 
Thus, the data are inserted to wrong columns with different data types.

The simplest walkaround is to use saveAsTable with append mode.
Change this line 
{{data3.write.insertInto(table_name)}}
to
{{data3.write.partitionBy("count").mode("append").saveAsTable(table_name)}}




> write.partitionBy() is giving me garbage data
> ---------------------------------------------
>
>                 Key: SPARK-21698
>                 URL: https://issues.apache.org/jira/browse/SPARK-21698
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.1, 2.2.0
>         Environment: Linux Ubuntu 17.04.  Python 3.5.
>            Reporter: Luis
>
> Spark partionBy is causing some data corruption.  I am doing three super 
> simple writes. . Below is the code to reproduce the problem.
> {code:title=Program Output|borderStyle=solid}
> 17/08/10 16:05:03 WARN [SparkUtils]: [Database exists] test
> /usr/local/spark/python/pyspark/sql/session.py:331: UserWarning: inferring 
> schema from dict is deprecated,please use pyspark.sql.Row instead
>   warnings.warn("inferring schema from dict is deprecated,"
> +---+----+-----+                                                              
>   
> | id|name|count|
> +---+----+-----+
> |  1|   1|    1|
> |  2|   2|    2|
> |  3|   3|    3|
> +---+----+-----+
> 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
> 17/08/10 16:05:07 WARN log: Updated size to 545
> 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
> 17/08/10 16:05:07 WARN log: Updated size to 545
> 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
> 17/08/10 16:05:07 WARN log: Updated size to 545
> +---+----+-----+
> | id|name|count|
> +---+----+-----+
> |  1|   1|    1|
> |  2|   2|    2|
> |  3|   3|    3|
> |  4|   4|    4|
> |  5|   5|    5|
> |  6|   6|    6|
> +---+----+-----+
> +---+----+-----+
> | id|name|count|
> +---+----+-----+
> |  9|   4| null|
> | 10|   6| null|
> |  7|   1| null|
> |  8|   2| null|
> |  1|   1|    1|
> |  2|   2|    2|
> |  3|   3|    3|
> |  4|   4|    4|
> |  5|   5|    5|
> |  6|   6|    6|
> +---+----+-----+
> {code}
> In the last show(). I see the data is null 
> {code:title=spark init|borderStyle=solid}
>         self.spark = SparkSession \
>                 .builder \
>                 .master("spark://localhost:7077") \
>                 .enableHiveSupport() \
>                 .getOrCreate()
> {code}
> {code:title=Code for the test case|borderStyle=solid}
>     def test_clean_insert_table(self):
>         table_name = "data"
>         data0 = [
>             {"id": 1, "name":"1", "count": 1},
>             {"id": 2, "name":"2", "count": 2},
>             {"id": 3, "name":"3", "count": 3},
>         ]
>         df_data0 = self.spark.createDataFrame(data0)
>         
> df_data0.write.partitionBy("count").mode("overwrite").saveAsTable(table_name)
>         df_return = self.spark.read.table(table_name)
>         df_return.show()
>         data1 = [
>             {"id": 4, "name":"4", "count": 4},
>             {"id": 5, "name":"5", "count": 5},
>             {"id": 6, "name":"6", "count": 6},
>         ]
>         df_data1 = self.spark.createDataFrame(data1)
>         df_data1.write.insertInto(table_name)
>         df_return = self.spark.read.table(table_name)
>         df_return.show()
>         data3 = [
>             {"id": 1, "name":"one", "count":7},
>             {"id": 2, "name":"two", "count": 8},
>             {"id": 4, "name":"three", "count": 9},
>             {"id": 6, "name":"six", "count":10}
>         ]
>         data3 = self.spark.createDataFrame(data3)
>         data3.write.insertInto(table_name)
>         df_return = self.spark.read.table(table_name)
>         df_return.show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21698) write.partitionBy() is giving me garbage data

Reply via email to