[GitHub] [spark] xiaonanyang-db commented on pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on PR #37933: URL: https://github.com/apache/spark/pull/37933#issuecomment-1253325002 > I think the logic in this PR seems reasonable. If prefersDate = true and we have date and timestamp strings, make the column StringType. If prefersDate was not set, then this could be inferred as timestamp if possible. > > I am just not sure about hardcoding date formats in the CSVInferSchema parser. Thanks @sadikovi. Understand your concern, it's a trade off on introducing performance degradation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on PR #37933: URL: https://github.com/apache/spark/pull/37933#issuecomment-1253319740 > There are many cases to consider here: 1) the CSV data is pure date, pure timestamp, or a mixture. 2) the user specifies datetime pattern or not. > > 1. pure date + no datetime pattern: infer as date type > 2. pure timestamp + no datetime pattern: infer as timestamp type > 3. mixture + no datetime pattern: infer as timestamp type > 4. pure date + datetime pattern: if pattern matches, infer as date type, otherwise string type > 5. pure timestamp + datetime pattern: if pattern matches, infer as timestamp type, otherwise string type > 6. mixture + datetime pattern: I think this is where the problem occurs. We will first parse the data as date, if can't, try parse as timestamp. This is very slow as we invoke the formatter twice. I think we shouldn't support mixture of date and timestamp in this case. If `prefersDate` is true, only try to infer as date type, otherwise only try to infer as timestamp. Thanks @cloud-fan Case 1, 2, 4, 5 are already supported, case 3 is also already supported but this PR adjusts the implementation. For case 6, the behavior after this PR is that we will not always "first parse the data as date, if can't, try parse as timestamp". - When typeSoFar is `DateType` or some other tighter type, we will "first parse the data as date, if can't, try parse as timestamp" - However, when typeSoFar is already `TimestampType` and we encounter another date value, we will directly parse it as timestamp, which would fail and then finalize as `StringType`. In reality, we would not invoke the formatter twice in most cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xiaonanyang-db commented on pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference
xiaonanyang-db commented on PR #37933: URL: https://github.com/apache/spark/pull/37933#issuecomment-1251869330 > Can you update the description to list all of the semantics of the change? You can remove the point where we need to merge them to TimestampType if this is not what the PR implements and replace it with "merging to StringType" instead. > > Is it correct that the date inference is still controlled by "prefersDate"? Sure! Yes, it's still controlled by "prefersDate". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org