[jira] [Updated] (SPARK-40808) Infer schema for CSV files - wrong behavior using header + merge schema

2022-10-17 Thread ohad (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ohad updated SPARK-40808:
-
Attachment: test_csv.py

> Infer schema for CSV files - wrong behavior using header + merge schema
> ---
>
> Key: SPARK-40808
> URL: https://issues.apache.org/jira/browse/SPARK-40808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: ohad
>Priority: Major
>  Labels: CSVReader, csv, csvparser
> Attachments: test_csv.py
>
>
> Hello. 
> I am writing unit-tests to some functionality in my application that reading 
> data from CSV files using Spark.
> I am reading the data using:
> {code:java}
> header=True
> mergeSchema=True
> inferSchema=True{code}
> When I am reading this single file:
> {code:java}
> File1:
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is ohad",6.234,2002-03-22{code}
> I am getting this schema:
> {code:java}
> int_col=int
> string_col=string
> decimal_col=double
> date_col=string{code}
> When I am duplicating this file, I am getting the same schema.
> The strange part is when I am adding new int column, it looks like spark is 
> getting confused and think that the column that already identified as int are 
> now string:
> {code:java}
> File1:
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is ohad",6.234,2002-03-22
> File2:
> "int_col","string_col","decimal_col","date_col","int2_col"
> 1,"hello",1.43,2022-02-23,234
> 2,"world",5.534,2021-05-05,5
> 3,"my name",86.455,2011-08-15,32
> 4,"is ohad",6.234,2002-03-22,2
> {code}
> result:
> {code:java}
> int_col=string
> string_col=string
> decimal_col=string
> date_col=string
> int2_col=int{code}
> When I am reading only the second file, it looks fine:
> {code:java}
> File2:
> "int_col","string_col","decimal_col","date_col","int2_col"
> 1,"hello",1.43,2022-02-23,234
> 2,"world",5.534,2021-05-05,5
> 3,"my name",86.455,2011-08-15,32
> 4,"is ohad",6.234,2002-03-22,2{code}
> result:
> {code:java}
> int_col=int
> string_col=string
> decimal_col=double
> date_col=string
> int2_col=int{code}
> For conclusion, it looks like there is a bug mixing the two features: header 
> recognition and merge schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40808) Infer schema for CSV files - wrong behavior using header + merge schema

2022-10-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40808:
-
Component/s: SQL
 (was: Spark Core)

> Infer schema for CSV files - wrong behavior using header + merge schema
> ---
>
> Key: SPARK-40808
> URL: https://issues.apache.org/jira/browse/SPARK-40808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: ohad
>Priority: Major
>  Labels: CSVReader, csv, csvparser
>
> Hello. 
> I am writing unit-tests to some functionality in my application that reading 
> data from CSV files using Spark.
> I am reading the data using:
> {code:java}
> header=True
> mergeSchema=True
> inferSchema=True{code}
> When I am reading this single file:
> {code:java}
> File1:
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is ohad",6.234,2002-03-22{code}
> I am getting this schema:
> {code:java}
> int_col=int
> string_col=string
> decimal_col=double
> date_col=string{code}
> When I am duplicating this file, I am getting the same schema.
> The strange part is when I am adding new int column, it looks like spark is 
> getting confused and think that the column that already identified as int are 
> now string:
> {code:java}
> File1:
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is ohad",6.234,2002-03-22
> File2:
> "int_col","string_col","decimal_col","date_col","int2_col"
> 1,"hello",1.43,2022-02-23,234
> 2,"world",5.534,2021-05-05,5
> 3,"my name",86.455,2011-08-15,32
> 4,"is ohad",6.234,2002-03-22,2
> {code}
> result:
> {code:java}
> int_col=string
> string_col=string
> decimal_col=string
> date_col=string
> int2_col=int{code}
> When I am reading only the second file, it looks fine:
> {code:java}
> File2:
> "int_col","string_col","decimal_col","date_col","int2_col"
> 1,"hello",1.43,2022-02-23,234
> 2,"world",5.534,2021-05-05,5
> 3,"my name",86.455,2011-08-15,32
> 4,"is ohad",6.234,2002-03-22,2{code}
> result:
> {code:java}
> int_col=int
> string_col=string
> decimal_col=double
> date_col=string
> int2_col=int{code}
> For conclusion, it looks like there is a bug mixing the two features: header 
> recognition and merge schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40808) Infer schema for CSV files - wrong behavior using header + merge schema

2022-10-16 Thread ohad (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ohad updated SPARK-40808:
-
Description: 
Hello. 
I am writing unit-tests to some functionality in my application that reading 
data from CSV files using Spark.

I am reading the data using:


{code:java}
header=True
mergeSchema=True
inferSchema=True{code}
When I am reading this single file:
{code:java}
File1:
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22{code}
I am getting this schema:
{code:java}
int_col=int
string_col=string
decimal_col=double
date_col=string{code}




When I am duplicating this file, I am getting the same schema.

The strange part is when I am adding new int column, it looks like spark is 
getting confused and think that the column that already identified as int are 
now string:


{code:java}
File1:
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22
File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
{code}
result:
{code:java}
int_col=string
string_col=string
decimal_col=string
date_col=string
int2_col=int{code}




When I am reading only the second file, it looks fine:
{code:java}
File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2{code}
result:
{code:java}
int_col=int
string_col=string
decimal_col=double
date_col=string
int2_col=int{code}
For conclusion, it looks like there is a bug mixing the two features: header 
recognition and merge schema.

  was:
Hello. 
I am writing unit-tests to some functionality in my application that reading 
data from CSV files using Spark.

I am reading the data using:
```
header=True
mergeSchema=True
inferSchema=True
```

When I am reading this single file:
```
Fi
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22
```

I am getting this schema:
```
int_col=int
string_col=string
decimal_col=double
date_col=string
```

When I am duplicating this file, I am getting the same schema.

The strange part is when I am adding new int column, it looks like spark is 
getting confused and think that the column that already identified as int are 
now string:
```
File1:
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22

File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
```

result:
```
int_col=string
string_col=string
decimal_col=string
date_col=string
int2_col=int
```

When I am reading only the second file, it looks fine:
```
File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
```

result:
```
int_col=int
string_col=string
decimal_col=double
date_col=string
int2_col=int
```

For conclusion, it looks like there is a bug mixing the two features: header 
recognition and merge schema.


> Infer schema for CSV files - wrong behavior using header + merge schema
> ---
>
> Key: SPARK-40808
> URL: https://issues.apache.org/jira/browse/SPARK-40808
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.2
>Reporter: ohad
>Priority: Major
>  Labels: CSVReader, csv, csvparser
>
> Hello. 
> I am writing unit-tests to some functionality in my application that reading 
> data from CSV files using Spark.
> I am reading the data using:
> {code:java}
> header=True
> mergeSchema=True
> inferSchema=True{code}
> When I am reading this single file:
> {code:java}
> File1:
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is ohad",6.234,2002-03-22{code}
> I am getting this schema:
> {code:java}
> int_col=int
> string_col=string
> decimal_col=double
> date_col=string{code}
> When I am duplicating this file, I am getting the same schema.
> The strange part is when I am adding new int column, it looks like spark is 
> getting confused and think that the column that already identified as int are 
> now string:
> {code:java}
> File1:
> 

[jira] [Updated] (SPARK-40808) Infer schema for CSV files - wrong behavior using header + merge schema

2022-10-16 Thread ohad (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ohad updated SPARK-40808:
-
Description: 
Hello. 
I am writing unit-tests to some functionality in my application that reading 
data from CSV files using Spark.

I am reading the data using:
```
header=True
mergeSchema=True
inferSchema=True
```

When I am reading this single file:
```
Fi
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22
```

I am getting this schema:
```
int_col=int
string_col=string
decimal_col=double
date_col=string
```

When I am duplicating this file, I am getting the same schema.

The strange part is when I am adding new int column, it looks like spark is 
getting confused and think that the column that already identified as int are 
now string:
```
File1:
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22

File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
```

result:
```
int_col=string
string_col=string
decimal_col=string
date_col=string
int2_col=int
```

When I am reading only the second file, it looks fine:
```
File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
```

result:
```
int_col=int
string_col=string
decimal_col=double
date_col=string
int2_col=int
```

For conclusion, it looks like there is a bug mixing the two features: header 
recognition and merge schema.

  was:
Hello. 
I am writing some unit-tests to some functionality in my application that 
reading data from CSV files using Spark.

I am reading the data using:
```
header=True
mergeSchema=True
inferSchema=True
```

When I am reading this single file:
```
Fi
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22
```

I am getting this schema:
```
int_col=int
string_col=string
decimal_col=double
date_col=string
```

When I am duplicating this file, I am getting the same schema.

The strange part is when I am adding new int column, it looks like spark is 
getting confused and think that the column that already identified as int are 
now string:
```
File1:
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22

File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
```

result:
```
int_col=string
string_col=string
decimal_col=string
date_col=string
int2_col=int
```

When I am reading only the second file, it looks fine:
```
File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2
```

result:
```
int_col=int
string_col=string
decimal_col=double
date_col=string
int2_col=int
```

For conclusion, it looks like there is a bug mixing the two features: header 
recognition and merge schema.


> Infer schema for CSV files - wrong behavior using header + merge schema
> ---
>
> Key: SPARK-40808
> URL: https://issues.apache.org/jira/browse/SPARK-40808
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.2
>Reporter: ohad
>Priority: Major
>  Labels: CSVReader, csv, csvparser
>
> Hello. 
> I am writing unit-tests to some functionality in my application that reading 
> data from CSV files using Spark.
> I am reading the data using:
> ```
> header=True
> mergeSchema=True
> inferSchema=True
> ```
> When I am reading this single file:
> ```
> Fi
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is ohad",6.234,2002-03-22
> ```
> I am getting this schema:
> ```
> int_col=int
> string_col=string
> decimal_col=double
> date_col=string
> ```
> When I am duplicating this file, I am getting the same schema.
> The strange part is when I am adding new int column, it looks like spark is 
> getting confused and think that the column that already identified as int are 
> now string:
> ```
> File1:
> "int_col","string_col","decimal_col","date_col"
> 1,"hello",1.43,2022-02-23
> 2,"world",5.534,2021-05-05
> 3,"my name",86.455,2011-08-15
> 4,"is