Re: Count() action leading to errors | Pyspark

Bjørn Jørgensen Sat, 07 May 2022 12:09:35 -0700

Try without using CTE.

SQL CTE is temporary, so you are probably working on 2 datasets.


fre. 6. mai 2022 kl. 10:32 skrev Sid <flinkbyhe...@gmail.com>:

> Hi Team,
>
> I am trying to display the counts of the DF which is created by running
> one Spark SQL query with a CTE pattern.
>
> Everything is working as expected. I was able to write the DF to Postgres
> RDS. However, when I am trying to display the counts using a simple count()
> action it leads to the below error:
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o321.count.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 1301 in stage 35.0 failed 4 times, most recent failure: Lost task 1301.3 in
> stage 35.0 (TID 7889, 10.100.6.148, executor 1):
> java.io.FileNotFoundException: File not present on S3
> It is possible the underlying files have been updated. You can explicitly
> invalidate the cache in Spark by running 'REFRESH TABLE tableName' command
> in SQL or by recreating the Dataset/DataFrame involved.
>
>
> So, I tried something like the below:
>
>
> print(modifiedData.repartition(modifiedData.rdd.getNumPartitions()).count())
>
> So, there are 80 partitions being formed for this DF, and the count
> written in Table is 92,665. However, it didn't match with the count
> displayed post repartitioning which was 91,183
>
> Not sure why is this gap?
>
> Why the counts are not matching? Also what could be the possible reason
> for that simple count error?
>
> Environment:
> AWS GLUE 1.X
> 10 workers
> Spark 2.4.3
>
> Thanks,
> Sid
>


-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: Count() action leading to errors | Pyspark

Reply via email to