[ https://issues.apache.org/jira/browse/SPARK-45440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772724#comment-17772724 ]
Bruce Robbins commented on SPARK-45440: --------------------------------------- I added {{inferSchema=true}} as a datasource option in your example and I got the expected answer. Otherwise it's doing a max and min on a string (not a number). > Incorrect summary counts from a CSV file > ---------------------------------------- > > Key: SPARK-45440 > URL: https://issues.apache.org/jira/browse/SPARK-45440 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 3.5.0 > Environment: Pyspark version 3.5.0 > Reporter: Evan Volgas > Priority: Major > Labels: aggregation, bug, pyspark > > I am using pip-installed Pyspark version 3.5.0 inside the context of an > IPython shell. The task is straightforward: take [this CSV > file|https://gist.githubusercontent.com/evanvolgas/e5cb082673ec947239658291f2251de4/raw/a9c5e9866ac662a816f9f3828a2d184032f604f0/AAPL.csv] > of AAPL stock prices and compute the minimum and maximum volume weighted > average price for the entire file. > My code is [here. > |https://gist.github.com/evanvolgas/e4aa75fec4179bb7075a5283867f127c]I've > also performed the same computation in DuckDB because I noticed that the > results of the Spark code are wrong. > Literally, the exact same SQL in DuckDB and in Spark yield different results, > and Spark's are wrong. > I have never seen this behavior in a Spark release before. I'm very confused > by it, and curious if anyone else can replicate this behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org