To be concrete, say we have a folder with thousands of tab-delimited csv files with following attributes format (each csv file is about 10GB):
id name address city... 1 Matt add1 LA... 2 Will add2 LA... 3 Lucy add3 SF... ... And we have a lookup table based on "name" above name gender Matt M Lucy F ... Now we are interested to output from top 1000 rows of each csv file into following format: id name gender 1 Matt M ... Can we use pyspark to efficiently handle this?