Interesting. I noticed that my drive log messages with time stamp, function
name but no line number. However log message in other python files only contain
the messages. All of my python code is a single zip file. The zip file is job
submit argument
2022-01-21 19:45:02 WARN __main__:? - sparkConfig: ('spark.sql.cbo.enabled',
'true')
2022-01-21 19:48:34 WARN __main__:? - readsSparkDF.rdd.getNumPartitions():1698
__init__ BEGIN
__init__ END
run BEGIN
run rawCountsSparkDF numRows:5387495 numCols:10409
My guess is somehow I need to change the way log4j is configure on the workers?
Kind regards
Andy
From: Andrew Davidson
Date: Thursday, January 20, 2022 at 2:32 PM
To: "user @spark"
Subject: How to configure log4j in pyspark to get log level, file name, and
line number
Hi
When I use python logging for my unit test. I am able to control the output
format. I get the log level, the file and line number, then the msg
[INFO testEstimatedScalingFactors.py:166 - test_B_convertCountsToInts()] BEGIN
In my spark driver I am able to get the log4j logger
spark = SparkSession\
.builder\
.appName("estimatedScalingFactors")\
.getOrCreate()
#
#
https://medium.com/@lubna_22592/building-production-pyspark-jobs-5480d03fd71e
# initialize logger for yarn cluster logs
#
log4jLogger = spark.sparkContext._jvm.org.apache.log4j
logger = log4jLogger.LogManager.getLogger(__name__)
However it only outputs the message. As a hack I have been adding the function
names to the msg.
I wonder if this is because of the way I make my python code available. When I
submit my job using
‘$ gcloud dataproc jobs submit pyspark’
I pass my python file in a zip file
--py-files ${extraPkg}
I use level warn because the driver info logs are very verbose
###
def rowSums( self, countsSparkDF, columnNames ):
self.logger.warn( "rowSums BEGIN" )
# https://stackoverflow.com/a/54283997/4586180
retDF = countsSparkDF.na.fill( 0 ).withColumn( "rowSum" , reduce( add,
[col( x ) for x in columnNames] ) )
self.logger.warn( "rowSums retDF numRows:{} numCols:{}"\
.format( retDF.count(), len( retDF.columns ) ) )
self.logger.warn( "rowSums END\n" )
return retDF
kind regards
Andy