Hi
When I use python logging for my unit test. I am able to control the output
format. I get the log level, the file and line number, then the msg
[INFO testEstimatedScalingFactors.py:166 - test_B_convertCountsToInts()] BEGIN
In my spark driver I am able to get the log4j logger
spark = SparkSession\
.builder\
.appName("estimatedScalingFactors")\
.getOrCreate()
#
#
https://medium.com/@lubna_22592/building-production-pyspark-jobs-5480d03fd71e
# initialize logger for yarn cluster logs
#
log4jLogger = spark.sparkContext._jvm.org.apache.log4j
logger = log4jLogger.LogManager.getLogger(__name__)
However it only outputs the message. As a hack I have been adding the function
names to the msg.
I wonder if this is because of the way I make my python code available. When I
submit my job using
‘$ gcloud dataproc jobs submit pyspark’
I pass my python file in a zip file
--py-files ${extraPkg}
I use level warn because the driver info logs are very verbose
###############################################################################
def rowSums( self, countsSparkDF, columnNames ):
self.logger.warn( "rowSums BEGIN" )
# https://stackoverflow.com/a/54283997/4586180
retDF = countsSparkDF.na.fill( 0 ).withColumn( "rowSum" , reduce( add,
[col( x ) for x in columnNames] ) )
self.logger.warn( "rowSums retDF numRows:{} numCols:{}"\
.format( retDF.count(), len( retDF.columns ) ) )
self.logger.warn( "rowSums END\n" )
return retDF
kind regards
Andy