[jira] [Updated] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored
[ https://issues.apache.org/jira/browse/SPARK-19228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Rubtsov updated SPARK-19228: --- Description: Current FastDateFormat parser can't properly parse date and timestamp and does not meet the ISO8601. For example, I need to process user.csv like this: {code:java} id,project,started,ended sergey.rubtsov,project0,12/12/2012,10/10/2015 {code} When I add date format options: {code:java} Dataset users = spark.read().format("csv").option("mode", "PERMISSIVE").option("header", "true") .option("inferSchema", "true").option("dateFormat", "dd/MM/").load("src/main/resources/user.csv"); users.printSchema(); {code} expected scheme should be {code:java} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: date (nullable = true) |-- ended: date (nullable = true) {code} but the actual result is: {code:java} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: string (nullable = true) |-- ended: string (nullable = true) {code} This mean that date processed as string and "dateFormat" option is ignored. If I add option {code:java} .option("timestampFormat", "dd/MM/") {code} result is: {code:java} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: timestamp (nullable = true) |-- ended: timestamp (nullable = true) {code} was: Current FastDateFormat can't properly parse date and timestamp and does not meet the ISO8601. That is why there is now supporting for inferring DateType and custom "dateFormat" option for csv parsing. For example, I need to process user.csv like this: {code:java} id,project,started,ended sergey.rubtsov,project0,12/12/2012,10/10/2015 {code} When I add date format options: {code:java} Dataset users = spark.read().format("csv").option("mode", "PERMISSIVE").option("header", "true") .option("inferSchema", "true").option("dateFormat", "dd/MM/").load("src/main/resources/user.csv"); users.printSchema(); {code} expected scheme should be {code:java} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: date (nullable = true) |-- ended: date (nullable = true) {code} but the actual result is: {code:java} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: string (nullable = true) |-- ended: string (nullable = true) {code} This mean that date processed as string and "dateFormat" option is ignored. If I add option {code:java} .option("timestampFormat", "dd/MM/") {code} result is: {code:java} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: timestamp (nullable = true) |-- ended: timestamp (nullable = true) {code} I think, the issue is somewhere in object CSVInferSchema, function inferField, lines 80-97 and method "tryParseDate" need to be added before/after "tryParseTimestamp", or date/timestamp process logic need to be changed. > inferSchema function processed csv date column as string and "dateFormat" > DataSource option is ignored > -- > > Key: SPARK-19228 > URL: https://issues.apache.org/jira/browse/SPARK-19228 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 2.1.0 >Reporter: Sergey Rubtsov >Priority: Major > Labels: easyfix > Original Estimate: 6h > Remaining Estimate: 6h > > Current FastDateFormat parser can't properly parse date and timestamp and > does not meet the ISO8601. > For example, I need to process user.csv like this: > {code:java} > id,project,started,ended > sergey.rubtsov,project0,12/12/2012,10/10/2015 > {code} > When I add date format options: > {code:java} > Dataset users = spark.read().format("csv").option("mode", > "PERMISSIVE").option("header", "true") > .option("inferSchema", > "true").option("dateFormat", > "dd/MM/").load("src/main/resources/user.csv"); > users.printSchema(); > {code} > expected scheme should be > {code:java} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: date (nullable = true) > |-- ended: date (nullable = true) > {code} > but the actual result is: > {code:java} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: string (nullable = true) > |-- ended: string (nullable = true) > {code} > This mean that date processed as string and "dateFormat" option is ignored. > If I add option > {code:java} > .option("timestampFormat", "dd/MM/") > {code} > result is: > {code:java}
[jira] [Updated] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored
[ https://issues.apache.org/jira/browse/SPARK-19228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Rubtsov updated SPARK-19228: --- Description: Current FastDateFormat can't properly parse date and timestamp and does not meet the ISO8601. That is why there is now supporting for inferring DateType and custom "dateFormat" option for csv parsing. For example, I need to process user.csv like this: {code:java} id,project,started,ended sergey.rubtsov,project0,12/12/2012,10/10/2015 {code} When I add date format options: {code:java} Dataset users = spark.read().format("csv").option("mode", "PERMISSIVE").option("header", "true") .option("inferSchema", "true").option("dateFormat", "dd/MM/").load("src/main/resources/user.csv"); users.printSchema(); {code} expected scheme should be {code:java} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: date (nullable = true) |-- ended: date (nullable = true) {code} but the actual result is: {code:java} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: string (nullable = true) |-- ended: string (nullable = true) {code} This mean that date processed as string and "dateFormat" option is ignored. If I add option {code:java} .option("timestampFormat", "dd/MM/") {code} result is: {code:java} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: timestamp (nullable = true) |-- ended: timestamp (nullable = true) {code} I think, the issue is somewhere in object CSVInferSchema, function inferField, lines 80-97 and method "tryParseDate" need to be added before/after "tryParseTimestamp", or date/timestamp process logic need to be changed. was: I need to process user.csv like this: {code} id,project,started,ended sergey.rubtsov,project0,12/12/2012,10/10/2015 {code} When I add date format options: {code} Dataset users = spark.read().format("csv").option("mode", "PERMISSIVE").option("header", "true") .option("inferSchema", "true").option("dateFormat", "dd/MM/").load("src/main/resources/user.csv"); users.printSchema(); {code} expected scheme should be {code} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: date (nullable = true) |-- ended: date (nullable = true) {code} but the actual result is: {code} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: string (nullable = true) |-- ended: string (nullable = true) {code} This mean that date processed as string and "dateFormat" option is ignored. If I add option {code} .option("timestampFormat", "dd/MM/") {code} result is: {code} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: timestamp (nullable = true) |-- ended: timestamp (nullable = true) {code} I think, the issue is somewhere in object CSVInferSchema, function inferField, lines 80-97 and method "tryParseDate" need to be added before/after "tryParseTimestamp", or date/timestamp process logic need to be changed. > inferSchema function processed csv date column as string and "dateFormat" > DataSource option is ignored > -- > > Key: SPARK-19228 > URL: https://issues.apache.org/jira/browse/SPARK-19228 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 2.1.0 >Reporter: Sergey Rubtsov >Priority: Major > Labels: easyfix > Original Estimate: 6h > Remaining Estimate: 6h > > Current FastDateFormat can't properly parse date and timestamp and does not > meet the ISO8601. > That is why there is now supporting for inferring DateType and custom > "dateFormat" option for csv parsing. > For example, I need to process user.csv like this: > {code:java} > id,project,started,ended > sergey.rubtsov,project0,12/12/2012,10/10/2015 > {code} > When I add date format options: > {code:java} > Dataset users = spark.read().format("csv").option("mode", > "PERMISSIVE").option("header", "true") > .option("inferSchema", > "true").option("dateFormat", > "dd/MM/").load("src/main/resources/user.csv"); > users.printSchema(); > {code} > expected scheme should be > {code:java} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: date (nullable = true) > |-- ended: date (nullable = true) > {code} > but the actual result is: > {code:java} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: string (nullable = true) > |-- ended: string (nullable = true) > {code} > This m
[jira] [Commented] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored
[ https://issues.apache.org/jira/browse/SPARK-19228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16478811#comment-16478811 ] Sergey Rubtsov commented on SPARK-19228: Java 8 contains new java.time module, also it can fix an old bug with parse string to SQL's timestamp value in microseconds accuracy: https://issues.apache.org/jira/browse/SPARK-10681.x > inferSchema function processed csv date column as string and "dateFormat" > DataSource option is ignored > -- > > Key: SPARK-19228 > URL: https://issues.apache.org/jira/browse/SPARK-19228 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 2.1.0 >Reporter: Sergey Rubtsov >Priority: Major > Labels: easyfix > Original Estimate: 6h > Remaining Estimate: 6h > > I need to process user.csv like this: > {code} > id,project,started,ended > sergey.rubtsov,project0,12/12/2012,10/10/2015 > {code} > When I add date format options: > {code} > Dataset users = spark.read().format("csv").option("mode", > "PERMISSIVE").option("header", "true") > .option("inferSchema", > "true").option("dateFormat", > "dd/MM/").load("src/main/resources/user.csv"); > users.printSchema(); > {code} > expected scheme should be > {code} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: date (nullable = true) > |-- ended: date (nullable = true) > {code} > but the actual result is: > {code} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: string (nullable = true) > |-- ended: string (nullable = true) > {code} > This mean that date processed as string and "dateFormat" option is ignored. > If I add option > {code} > .option("timestampFormat", "dd/MM/") > {code} > result is: > {code} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: timestamp (nullable = true) > |-- ended: timestamp (nullable = true) > {code} > I think, the issue is somewhere in object CSVInferSchema, function > inferField, lines 80-97 and > method "tryParseDate" need to be added before/after "tryParseTimestamp", or > date/timestamp process logic need to be changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored
[ https://issues.apache.org/jira/browse/SPARK-19228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925775#comment-15925775 ] Sergey Rubtsov commented on SPARK-19228: Hi [~hyukjin.kwon], Updated pull request: https://github.com/apache/spark/pull/16735 Please, take a look. Couldn't run tests in CSVSuite locally on my Windows OS, apologize for the possible test fails > inferSchema function processed csv date column as string and "dateFormat" > DataSource option is ignored > -- > > Key: SPARK-19228 > URL: https://issues.apache.org/jira/browse/SPARK-19228 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 2.1.0 >Reporter: Sergey Rubtsov > Labels: easyfix > Original Estimate: 6h > Remaining Estimate: 6h > > I need to process user.csv like this: > {code} > id,project,started,ended > sergey.rubtsov,project0,12/12/2012,10/10/2015 > {code} > When I add date format options: > {code} > Dataset users = spark.read().format("csv").option("mode", > "PERMISSIVE").option("header", "true") > .option("inferSchema", > "true").option("dateFormat", > "dd/MM/").load("src/main/resources/user.csv"); > users.printSchema(); > {code} > expected scheme should be > {code} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: date (nullable = true) > |-- ended: date (nullable = true) > {code} > but the actual result is: > {code} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: string (nullable = true) > |-- ended: string (nullable = true) > {code} > This mean that date processed as string and "dateFormat" option is ignored. > If I add option > {code} > .option("timestampFormat", "dd/MM/") > {code} > result is: > {code} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: timestamp (nullable = true) > |-- ended: timestamp (nullable = true) > {code} > I think, the issue is somewhere in object CSVInferSchema, function > inferField, lines 80-97 and > method "tryParseDate" need to be added before/after "tryParseTimestamp", or > date/timestamp process logic need to be changed. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored
[ https://issues.apache.org/jira/browse/SPARK-19228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15825860#comment-15825860 ] Sergey Rubtsov commented on SPARK-19228: Okey, I will do it. > inferSchema function processed csv date column as string and "dateFormat" > DataSource option is ignored > -- > > Key: SPARK-19228 > URL: https://issues.apache.org/jira/browse/SPARK-19228 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 2.1.0 >Reporter: Sergey Rubtsov > Labels: easyfix > Original Estimate: 6h > Remaining Estimate: 6h > > I need to process user.csv like this: > {code} > id,project,started,ended > sergey.rubtsov,project0,12/12/2012,10/10/2015 > {code} > When I add date format options: > {code} > Dataset users = spark.read().format("csv").option("mode", > "PERMISSIVE").option("header", "true") > .option("inferSchema", > "true").option("dateFormat", > "dd/MM/").load("src/main/resources/user.csv"); > users.printSchema(); > {code} > expected scheme should be > {code} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: date (nullable = true) > |-- ended: date (nullable = true) > {code} > but the actual result is: > {code} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: string (nullable = true) > |-- ended: string (nullable = true) > {code} > This mean that date processed as string and "dateFormat" option is ignored. > If I add option > {code} > .option("timestampFormat", "dd/MM/") > {code} > result is: > {code} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: timestamp (nullable = true) > |-- ended: timestamp (nullable = true) > {code} > I think, the issue is somewhere in object CSVInferSchema, function > inferField, lines 80-97 and > method "tryParseDate" need to be added before/after "tryParseTimestamp", or > date/timestamp process logic need to be changed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored
[ https://issues.apache.org/jira/browse/SPARK-19228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Rubtsov updated SPARK-19228: --- Description: I need to process user.csv like this: {code} id,project,started,ended sergey.rubtsov,project0,12/12/2012,10/10/2015 {code} When I add date format options: {code} Dataset users = spark.read().format("csv").option("mode", "PERMISSIVE").option("header", "true") .option("inferSchema", "true").option("dateFormat", "dd/MM/").load("src/main/resources/user.csv"); users.printSchema(); {code} expected scheme should be {code} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: date (nullable = true) |-- ended: date (nullable = true) {code} but the actual result is: {code} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: string (nullable = true) |-- ended: string (nullable = true) {code} This mean that date processed as string and "dateFormat" option is ignored. If I add option {code} .option("timestampFormat", "dd/MM/") {code} result is: {code} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: timestamp (nullable = true) |-- ended: timestamp (nullable = true) {code} I think, the issue is somewhere in object CSVInferSchema, function inferField, lines 80-97 and method "tryParseDate" need to be added before/after "tryParseTimestamp", or date/timestamp process logic need to be changed. was: I need to process user.csv like this: {code} id,project,started,ended sergey.rubtsov,project0,12/12/2012,10/10/2015 {code} When I add date format options: {code} Dataset users = spark.read().format("csv").option("mode", "PERMISSIVE").option("header", "true") .option("inferSchema", "true").option("dateFormat", "dd/MM/").load("src/main/resources/user.csv"); users.printSchema(); {code} expected scheme should be {code} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: date (nullable = true) |-- ended: date (nullable = true) {code} but the actual result is: {code} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: string (nullable = true) |-- ended: string (nullable = true) {code} This mean that date processed as string and "dateFormat" option is ignored and date processed as string. If I add option {code} .option("timestampFormat", "dd/MM/") {code} result is: {code} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: timestamp (nullable = true) |-- ended: timestamp (nullable = true) {code} I think, the issue is somewhere in object CSVInferSchema, function inferField, lines 80-97 and method "tryParseDate" need to be added before/after "tryParseTimestamp", or date/timestamp process logic need to be changed. > inferSchema function processed csv date column as string and "dateFormat" > DataSource option is ignored > -- > > Key: SPARK-19228 > URL: https://issues.apache.org/jira/browse/SPARK-19228 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 2.1.0 >Reporter: Sergey Rubtsov > Labels: easyfix > Original Estimate: 6h > Remaining Estimate: 6h > > I need to process user.csv like this: > {code} > id,project,started,ended > sergey.rubtsov,project0,12/12/2012,10/10/2015 > {code} > When I add date format options: > {code} > Dataset users = spark.read().format("csv").option("mode", > "PERMISSIVE").option("header", "true") > .option("inferSchema", > "true").option("dateFormat", > "dd/MM/").load("src/main/resources/user.csv"); > users.printSchema(); > {code} > expected scheme should be > {code} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: date (nullable = true) > |-- ended: date (nullable = true) > {code} > but the actual result is: > {code} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: string (nullable = true) > |-- ended: string (nullable = true) > {code} > This mean that date processed as string and "dateFormat" option is ignored. > If I add option > {code} > .option("timestampFormat", "dd/MM/") > {code} > result is: > {code} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: timestamp (nullable = true) > |-- ended: timestamp (nullable = true) > {code} > I think, the issue is somewhere in object CSVInferSchema, function > inferField, lines 80-97 and > method "tryParseDate" need to b
[jira] [Updated] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored
[ https://issues.apache.org/jira/browse/SPARK-19228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Rubtsov updated SPARK-19228: --- Description: I need to process user.csv like this: {code} id,project,started,ended sergey.rubtsov,project0,12/12/2012,10/10/2015 {code} When I add date format options: {code} Dataset users = spark.read().format("csv").option("mode", "PERMISSIVE").option("header", "true") .option("inferSchema", "true").option("dateFormat", "dd/MM/").load("src/main/resources/user.csv"); users.printSchema(); {code} expected scheme should be {code} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: date (nullable = true) |-- ended: date (nullable = true) {code} but the actual result is: {code} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: string (nullable = true) |-- ended: string (nullable = true) {code} This mean that date processed as string and "dateFormat" option is ignored and date processed as string. If I add option {code} .option("timestampFormat", "dd/MM/") {code} result is: {code} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: timestamp (nullable = true) |-- ended: timestamp (nullable = true) {code} I think, the issue is somewhere in object CSVInferSchema, function inferField, lines 80-97 and method "tryParseDate" need to be added before/after "tryParseTimestamp", or date/timestamp process logic need to be changed. was: I need to process user.csv like this: {code} id,project,started,ended sergey.rubtsov,project0,12/12/2012,10/10/2015 {code} When I add date format options: {code} Dataset users = spark.read().format("csv").option("mode", "PERMISSIVE").option("header", "true") .option("inferSchema", "true").option("dateFormat", "dd/MM/").load("src/main/resources/user.csv"); users.printSchema(); {code} expected scheme should be {code} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: date (nullable = true) |-- ended: date (nullable = true) {code} but the actual result is: {code} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: string (nullable = true) |-- ended: string (nullable = true) This mean that date processed as string and "dateFormat" option is ignored and date processed as string. If I add option {code} .option("timestampFormat", "dd/MM/") {code} result is: {code} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: timestamp (nullable = true) |-- ended: timestamp (nullable = true) {code} I think, the issue is somewhere in object CSVInferSchema, function inferField, lines 80-97 and method "tryParseDate" need to be added before/after "tryParseTimestamp", or date/timestamp process logic need to be changed. > inferSchema function processed csv date column as string and "dateFormat" > DataSource option is ignored > -- > > Key: SPARK-19228 > URL: https://issues.apache.org/jira/browse/SPARK-19228 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 2.1.0 >Reporter: Sergey Rubtsov > Labels: easyfix > Original Estimate: 6h > Remaining Estimate: 6h > > I need to process user.csv like this: > {code} > id,project,started,ended > sergey.rubtsov,project0,12/12/2012,10/10/2015 > {code} > When I add date format options: > {code} > Dataset users = spark.read().format("csv").option("mode", > "PERMISSIVE").option("header", "true") > .option("inferSchema", > "true").option("dateFormat", > "dd/MM/").load("src/main/resources/user.csv"); > users.printSchema(); > {code} > expected scheme should be > {code} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: date (nullable = true) > |-- ended: date (nullable = true) > {code} > but the actual result is: > {code} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: string (nullable = true) > |-- ended: string (nullable = true) > {code} > This mean that date processed as string and "dateFormat" option is ignored > and date processed as string. > If I add option > {code} > .option("timestampFormat", "dd/MM/") > {code} > result is: > {code} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: timestamp (nullable = true) > |-- ended: timestamp (nullable = true) > {code} > I think, the issue is somewhere in object CSVInferSchema, function > infer
[jira] [Created] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored
Sergey Rubtsov created SPARK-19228: -- Summary: inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored Key: SPARK-19228 URL: https://issues.apache.org/jira/browse/SPARK-19228 Project: Spark Issue Type: Bug Components: Input/Output, SQL Affects Versions: 2.1.0 Reporter: Sergey Rubtsov I need to process user.csv like this: {code} id,project,started,ended sergey.rubtsov,project0,12/12/2012,10/10/2015 {code} When I add date format options: {code} Dataset users = spark.read().format("csv").option("mode", "PERMISSIVE").option("header", "true") .option("inferSchema", "true").option("dateFormat", "dd/MM/").load("src/main/resources/user.csv"); users.printSchema(); {code} expected scheme should be {code} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: date (nullable = true) |-- ended: date (nullable = true) {code} but the actual result is: {code} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: string (nullable = true) |-- ended: string (nullable = true) This mean that date processed as string and "dateFormat" option is ignored and date processed as string. If I add option {code} .option("timestampFormat", "dd/MM/") {code} result is: {code} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: timestamp (nullable = true) |-- ended: timestamp (nullable = true) {code} I think, the issue is somewhere in object CSVInferSchema, function inferField, lines 80-97 and method "tryParseDate" need to be added before/after "tryParseTimestamp", or date/timestamp process logic need to be changed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org