Robert Joseph Evans created SPARK-38955:
-------------------------------------------

             Summary: from_csv can corrupt surrounding lines if a lineSep is in 
the data
                 Key: SPARK-38955
                 URL: https://issues.apache.org/jira/browse/SPARK-38955
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.2.0
            Reporter: Robert Joseph Evans


I don't know how critical this is. I was doing some general testing to 
understand {{from_csv}} and found that if I happen to have a {{lineSep}} in the 
input data and I noticed that the next row appears to be corrupted. 
{{multiLine}} does not appear to fix it. Because this is data corruption I am 
inclined to mark this as CRITICAL or BLOCKER, but it is an odd corner case so I 
m not going to set it myself.

{code}
Seq[String]("1,\n2,3,4,5","6,7,8,9,10", "11,12,13,14,15", 
null).toDF.select(col("value"), from_csv(col("value"), 
StructType(Seq(StructField("a", LongType), StructField("b", StringType))), 
Map[String,String]())).show()
+--------------+---------------+
|         value|from_csv(value)|
+--------------+---------------+
|   1,\n2,3,4,5|      {1, null}|
|    6,7,8,9,10|      {null, 8}|
|11,12,13,14,15|       {11, 12}|
|          null|           null|
+--------------+---------------+
{code}

{code}
Seq[String]("1,:2,3,4,5","6,7,8,9,10", "11,12,13,14,15", 
null).toDF.select(col("value"), from_csv(col("value"), 
StructType(Seq(StructField("a", LongType), StructField("b", StringType))), 
Map[String,String]("lineSep" -> ":"))).show()
+--------------+---------------+
|         value|from_csv(value)|
+--------------+---------------+
|    1,:2,3,4,5|      {1, null}|
|    6,7,8,9,10|      {null, 8}|
|11,12,13,14,15|       {11, 12}|
|          null|           null|
+--------------+---------------+
{code}

{code}
Seq[String]("1,\n2,3,4,5","6,7,8,9,10", "11,12,13,14,15", 
null).toDF.select(col("value"), from_csv(col("value"), 
StructType(Seq(StructField("a", LongType), StructField("b", StringType))), 
Map[String,String]("lineSep" -> ":"))).show()
+--------------+---------------+
|         value|from_csv(value)|
+--------------+---------------+
|   1,\n2,3,4,5|       {1, \n2}|
|    6,7,8,9,10|         {6, 7}|
|11,12,13,14,15|       {11, 12}|
|          null|           null|
+--------------+---------------+
{code}




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to