[GitHub] [spark] xiaonanyang-db commented on pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-21 Thread GitBox


xiaonanyang-db commented on PR #37933:
URL: https://github.com/apache/spark/pull/37933#issuecomment-1253325002

   > I think the logic in this PR seems reasonable. If prefersDate = true and 
we have date and timestamp strings, make the column StringType. If prefersDate 
was not set, then this could be inferred as timestamp if possible.
   > 
   > I am just not sure about hardcoding date formats in the CSVInferSchema 
parser.
   
   Thanks @sadikovi. Understand your concern, it's a trade off on introducing 
performance degradation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xiaonanyang-db commented on pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-21 Thread GitBox


xiaonanyang-db commented on PR #37933:
URL: https://github.com/apache/spark/pull/37933#issuecomment-1253319740

   > There are many cases to consider here: 1) the CSV data is pure date, pure 
timestamp, or a mixture. 2) the user specifies datetime pattern or not.
   > 
   > 1. pure date + no datetime pattern: infer as date type
   > 2. pure timestamp + no datetime pattern: infer as timestamp type
   > 3. mixture + no datetime pattern: infer as timestamp type
   > 4. pure date + datetime pattern: if pattern matches, infer as date type, 
otherwise string type
   > 5. pure timestamp + datetime pattern: if pattern matches, infer as 
timestamp type, otherwise string type
   > 6. mixture + datetime pattern: I think this is where the problem occurs. 
We will first parse the data as date, if can't, try parse as timestamp. This is 
very slow as we invoke the formatter twice. I think we shouldn't support 
mixture of date and timestamp in this case. If `prefersDate` is true, only try 
to infer as date type, otherwise only try to infer as timestamp.
   
   Thanks @cloud-fan
   Case 1, 2, 4, 5 are already supported, case 3 is also already supported but 
this PR adjusts the implementation.
   For case 6, the behavior after this PR is that we will not always "first 
parse the data as date, if can't, try parse as timestamp". - When typeSoFar is 
`DateType` or some other tighter type, we will "first parse the data as date, 
if can't, try parse as timestamp"
   - However, when typeSoFar is already `TimestampType` and we encounter 
another date value, we will directly parse it as timestamp, which would fail 
and then finalize as `StringType`. 
   In reality, we would not invoke the formatter twice in most cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xiaonanyang-db commented on pull request #37933: [SPARK-40474][SQL] Infer columns with mixed date and timestamp as String in CSV schema inference

2022-09-19 Thread GitBox


xiaonanyang-db commented on PR #37933:
URL: https://github.com/apache/spark/pull/37933#issuecomment-1251869330

   > Can you update the description to list all of the semantics of the change? 
You can remove the point where we need to merge them to TimestampType if this 
is not what the PR implements and replace it with "merging to StringType" 
instead.
   > 
   > Is it correct that the date inference is still controlled by "prefersDate"?
   
   Sure!
   Yes, it's still controlled by "prefersDate".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org