Pyspark version:3.1.3 *Question 1: *What is DataFilters in spark physical plan? How is it different from PushedFilters? *Question 2:* When joining two datasets, Why is the filter isnotnull applied twice on the joining key column? In the physical plan, it is once applied as a PushedFilter and then explicitly applied right after it. Why is that so?
code: import os import pandas as pd, numpy as np import pyspark spark=pyspark.sql.SparkSession.builder.getOrCreate() save_loc = "gs://monsoon-credittech.appspot.com/spark_datasets/random_tests/ " df1 = spark.createDataFrame(pd.DataFrame({'a':np.random.choice([1,2,None],size = 1000, p = [0.47,0.48,0.05]), 'b': np.random.random(1000)})) df2 = spark.createDataFrame(pd.DataFrame({'a':np.random.choice([1,2,None],size = 1000, p = [0.47,0.48,0.05]), 'b': np.random.random(1000)})) df1.write.parquet(os.path.join(save_loc,"dfl_key_int")) df2.write.parquet(os.path.join(save_loc,"dfr_key_int")) dfl_int = spark.read.parquet(os.path.join(save_loc,"dfl_key_int")) dfr_int = spark.read.parquet(os.path.join(save_loc,"dfr_key_int")) dfl_int.join(dfr_int,on='a',how='inner').explain() output: == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Project [a#23L, b#24, b#28] +- BroadcastHashJoin [a#23L], [a#27L], Inner, BuildRight, false :- Filter isnotnull(a#23L) : +- FileScan parquet [a#23L,b#24] Batched: true, DataFilters: [isnotnull(a#23L)], Format: Parquet, Location: InMemoryFileIndex[gs://monsoon-credittech.appspot.com/spark_datasets/random_tests/dfl_key_int], PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: struct<a:bigint,b:double> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]),false), [id=#75] +- Filter isnotnull(a#27L) +- FileScan parquet [a#27L,b#28] Batched: true, DataFilters: [isnotnull(a#27L)], Format: Parquet, Location: InMemoryFileIndex[gs://monsoon-credittech.appspot.com/spark_datasets/random_tests/dfr_key_int], PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: struct<a:bigint,b:double> -- Regards, Nitin