Try without using CTE. SQL CTE is temporary, so you are probably working on 2 datasets.
fre. 6. mai 2022 kl. 10:32 skrev Sid <flinkbyhe...@gmail.com>: > Hi Team, > > I am trying to display the counts of the DF which is created by running > one Spark SQL query with a CTE pattern. > > Everything is working as expected. I was able to write the DF to Postgres > RDS. However, when I am trying to display the counts using a simple count() > action it leads to the below error: > > py4j.protocol.Py4JJavaError: An error occurred while calling o321.count. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task > 1301 in stage 35.0 failed 4 times, most recent failure: Lost task 1301.3 in > stage 35.0 (TID 7889, 10.100.6.148, executor 1): > java.io.FileNotFoundException: File not present on S3 > It is possible the underlying files have been updated. You can explicitly > invalidate the cache in Spark by running 'REFRESH TABLE tableName' command > in SQL or by recreating the Dataset/DataFrame involved. > > > So, I tried something like the below: > > > print(modifiedData.repartition(modifiedData.rdd.getNumPartitions()).count()) > > So, there are 80 partitions being formed for this DF, and the count > written in Table is 92,665. However, it didn't match with the count > displayed post repartitioning which was 91,183 > > Not sure why is this gap? > > Why the counts are not matching? Also what could be the possible reason > for that simple count error? > > Environment: > AWS GLUE 1.X > 10 workers > Spark 2.4.3 > > Thanks, > Sid > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297