Re: How to configure log4j in pyspark to get log level, file name, and line number

2022-01-21 Thread Andrew Davidson
Interesting. I noticed that my drive log messages with time stamp, function 
name but no line number. However log message in other python files only contain 
the messages. All of my python code is a single zip file. The zip file is job 
submit argument

2022-01-21 19:45:02 WARN  __main__:? - sparkConfig: ('spark.sql.cbo.enabled', 
'true')
2022-01-21 19:48:34 WARN  __main__:? - readsSparkDF.rdd.getNumPartitions():1698
__init__ BEGIN
__init__ END
run BEGIN
run rawCountsSparkDF numRows:5387495 numCols:10409

My guess is somehow I need to change the way log4j is configure on the workers?

Kind regards

Andy

From: Andrew Davidson 
Date: Thursday, January 20, 2022 at 2:32 PM
To: "user @spark" 
Subject: How to configure log4j in pyspark to get log level, file name, and 
line number

Hi

When I use python logging for my unit test. I am able to control the output 
format. I get the log level, the file and line number, then the msg

[INFO testEstimatedScalingFactors.py:166 - test_B_convertCountsToInts()] BEGIN

In my spark driver I am able to get the log4j logger

spark = SparkSession\
.builder\
.appName("estimatedScalingFactors")\
.getOrCreate()

#
# 
https://medium.com/@lubna_22592/building-production-pyspark-jobs-5480d03fd71e
# initialize  logger for yarn cluster logs
#
log4jLogger = spark.sparkContext._jvm.org.apache.log4j
logger = log4jLogger.LogManager.getLogger(__name__)

However it only outputs the message. As a hack I have been adding the function 
names to the msg.



I wonder if this is because of the way I make my python code available. When I 
submit my job using



‘$ gcloud dataproc jobs submit pyspark’



I pass my python file in a zip file
 --py-files ${extraPkg}

I use level warn because the driver info logs are very verbose


###

def rowSums( self, countsSparkDF, columnNames ):

self.logger.warn( "rowSums BEGIN" )



# https://stackoverflow.com/a/54283997/4586180

retDF = countsSparkDF.na.fill( 0 ).withColumn( "rowSum" , reduce( add, 
[col( x ) for x in columnNames] ) )



self.logger.warn( "rowSums retDF numRows:{} numCols:{}"\

 .format( retDF.count(), len( retDF.columns ) ) )



self.logger.warn( "rowSums END\n" )

return retDF

kind regards

Andy


How to configure log4j in pyspark to get log level, file name, and line number

2022-01-20 Thread Andrew Davidson
Hi

When I use python logging for my unit test. I am able to control the output 
format. I get the log level, the file and line number, then the msg

[INFO testEstimatedScalingFactors.py:166 - test_B_convertCountsToInts()] BEGIN

In my spark driver I am able to get the log4j logger

spark = SparkSession\
.builder\
.appName("estimatedScalingFactors")\
.getOrCreate()

#
# 
https://medium.com/@lubna_22592/building-production-pyspark-jobs-5480d03fd71e
# initialize  logger for yarn cluster logs
#
log4jLogger = spark.sparkContext._jvm.org.apache.log4j
logger = log4jLogger.LogManager.getLogger(__name__)

However it only outputs the message. As a hack I have been adding the function 
names to the msg.



I wonder if this is because of the way I make my python code available. When I 
submit my job using



‘$ gcloud dataproc jobs submit pyspark’



I pass my python file in a zip file
 --py-files ${extraPkg}

I use level warn because the driver info logs are very verbose


###

def rowSums( self, countsSparkDF, columnNames ):

self.logger.warn( "rowSums BEGIN" )



# https://stackoverflow.com/a/54283997/4586180

retDF = countsSparkDF.na.fill( 0 ).withColumn( "rowSum" , reduce( add, 
[col( x ) for x in columnNames] ) )



self.logger.warn( "rowSums retDF numRows:{} numCols:{}"\

 .format( retDF.count(), len( retDF.columns ) ) )



self.logger.warn( "rowSums END\n" )

return retDF

kind regards

Andy