[jira] [Created] (SPARK-45440) Incorrect summary counts from a CSV file

Evan Volgas (Jira) Fri, 06 Oct 2023 13:52:20 -0700

Evan Volgas created SPARK-45440:
-----------------------------------

             Summary: Incorrect summary counts from a CSV file
                 Key: SPARK-45440
                 URL: https://issues.apache.org/jira/browse/SPARK-45440
             Project: Spark
          Issue Type: Bug
          Components: Input/Output
    Affects Versions: 3.5.0
         Environment: Pyspark version 3.5.0 
            Reporter: Evan Volgas



I am using pip-installed Pyspark version 3.5.0 inside the context of an IPython 
shell. The task is straightforward: take [this CSV 
file|https://gist.githubusercontent.com/evanvolgas/e5cb082673ec947239658291f2251de4/raw/a9c5e9866ac662a816f9f3828a2d184032f604f0/AAPL.csv]
 of AAPL stock prices and compute the minimum and maximum volume weighted 
average price for the entire file. 

My code is [here. 
|https://gist.github.com/evanvolgas/e4aa75fec4179bb7075a5283867f127c]I've also 
performed the same computation in DuckDB because I noticed that the results of 
the Spark code are wrong. 

Literally, the exact same SQL in DuckDB and in Spark yield different results, 
and Spark's are wrong. 

I have never seen this behavior in a Spark release before. I'm very confused by 
it, and curious if anyone else can replicate this behavior. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45440) Incorrect summary counts from a CSV file

Reply via email to