Hi,

I have a use case where there are large files in hdfs.

Size of the file is 3 GB.

It is an existing code in production and I am trying to improve the
performance of the job.

Sample Code:
textDF=dataframe ( This is dataframe that got created from hdfs path)
logging.info("Number of partitions"+str(txt_df.rdd.getNumPartitions()))
--> Prints 1
textDF.repartition(100)
logging.info("Number of partitions"+str(txt_df.rdd.getNumPartitions()))
--> Prints 1

Any suggestions  on why this is happening?

Next Block of the code which takes time:
rdd.filter(lambda line: len(line)!=collistlenth)

any way to parallelize and speed up my process on this?

Thanks,
Asmath

Reply via email to