How to configure log4j in pyspark to get log level, file name, and line number

Andrew Davidson Thu, 20 Jan 2022 14:32:54 -0800

Hi

When I use python logging for my unit test. I am able to control the output 
format. I get the log level, the file and line number, then the msg


[INFO testEstimatedScalingFactors.py:166 - test_B_convertCountsToInts()] BEGIN

In my spark driver I am able to get the log4j logger

        spark = SparkSession\
                    .builder\
                    .appName("estimatedScalingFactors")\
                    .getOrCreate()

        #
        # 
https://medium.com/@lubna_22592/building-production-pyspark-jobs-5480d03fd71e
        # initialize  logger for yarn cluster logs
        #
        log4jLogger = spark.sparkContext._jvm.org.apache.log4j
        logger = log4jLogger.LogManager.getLogger(__name__)

However it only outputs the message. As a hack I have been adding the function 
names to the msg.



I wonder if this is because of the way I make my python code available. When I 
submit my job using



‘$ gcloud dataproc jobs submit pyspark’



I pass my python file in a zip file
 --py-files ${extraPkg}

I use level warn because the driver info logs are very verbose


###############################################################################

def rowSums( self, countsSparkDF, columnNames ):

    self.logger.warn( "rowSums BEGIN" )



    # https://stackoverflow.com/a/54283997/4586180

    retDF = countsSparkDF.na.fill( 0 ).withColumn( "rowSum" , reduce( add, 
[col( x ) for x in columnNames] ) )



    self.logger.warn( "rowSums retDF numRows:{} numCols:{}"\

                         .format( retDF.count(), len( retDF.columns ) ) )



    self.logger.warn( "rowSums END\n" )

    return retDF

kind regards

Andy

How to configure log4j in pyspark to get log level, file name, and line number

Reply via email to