Re: Questions about count() performance with dataframes and parquet files

2020-02-13 Thread Ashley Hoff
Hi, Thank you both for your suggestions! These have been eyeopeners for me. Just to clarify, I need the counts for logging and auditing purposes otherwise I would exclude the step. I should have also mentioned that while I am processing around 30 GB of raw data, the individual outputs are

Re: Questions about count() performance with dataframes and parquet files

2020-02-13 Thread Enrico Minack
Ashley, I want to suggest a few optimizations. The problem might go away but at least performance should improve. The freeze problems could have many reasons, the Spark UI SQL pages and stages detail pages would be useful. You can send them privately, if you wish. 1. the repartition(1)

Re: Environment variable for deleting .sparkStaging

2020-02-13 Thread mailfordebu
Any feedback please? Thanks, Debu Sent from my iPhone > On 13-Feb-2020, at 6:36 PM, Debabrata Ghosh wrote: > >  > Greetings All ! > > I have got plenty of application directories lying around sparkStaging , such > as .sparkStaging/application_1580703507814_0074 > > Would you please be able

Environment variable for deleting .sparkStaging

2020-02-13 Thread Debabrata Ghosh
Greetings All ! I have got plenty of application directories lying around sparkStaging , such as .sparkStaging/application_1580703507814_0074 Would you please be able to help advise me which variable I need to set in spark-env.sh so that the sparkStaging applications aren't preserved after the