Hi I am having trouble debugging my driver. It runs correctly on smaller data set but fails on large ones. It is very hard to figure out what the bug is. I suspect it may have something do with the way spark is installed and configured. I am using google cloud platform dataproc pyspark
The log messages are not helpful. The error message will be something like "User application exited with status 1" And jsonPayload: { class: "server.TThreadPoolServer" filename: "hive-server2.log" message: "Error occurred during processing of message." thread: "HiveServer2-Handler-Pool: Thread-40" } I am able to access the spark history server however it does not capture anything if the driver crashes. I am unable to figure out how to access spark web UI. My driver program looks something like the pseudo code bellow. A long list of transforms with a single action, (i.e. write) at the end. Adding log messages is not helpful because of lazy evaluations. I am tempted to add something like Logger.warn( “DEBUG df.count():{}”.format( df.count() )” to try and inline some sort of diagnostic message. What do you think? Is there a better way to debug this? Kind regards Andy def run(): listOfDF = [] for filePath in listOfFiles: df = spark.read.load( filePath, ...) listOfDF.append(df) list2OfDF = [] for df in listOfDF: df2 = df.select( .... ) lsit2OfDF.append( df2 ) # will setting list to None free cache? # or just driver memory listOfDF = None df3 = list2OfDF[0] for i in range( 1, len(list2OfDF) ): df = list2OfDF[i] df3 = df3.join(df ...) # will setting to list to None free cache? # or just driver memory List2OfDF = None lots of narrow transformations on d3 return df3 def main() : df = run() df.write()