Hi, I have a use case where there are large files in hdfs.
Size of the file is 3 GB. It is an existing code in production and I am trying to improve the performance of the job. Sample Code: textDF=dataframe ( This is dataframe that got created from hdfs path) logging.info("Number of partitions"+str(txt_df.rdd.getNumPartitions())) --> Prints 1 textDF.repartition(100) logging.info("Number of partitions"+str(txt_df.rdd.getNumPartitions())) --> Prints 1 Any suggestions on why this is happening? Next Block of the code which takes time: rdd.filter(lambda line: len(line)!=collistlenth) any way to parallelize and speed up my process on this? Thanks, Asmath