To be concrete, say we have a folder with thousands of tab-delimited csv
files with following attributes format (each csv file is about 10GB):

    id    name    address    city...
    1    Matt    add1    LA...
    2    Will    add2    LA...
    3    Lucy    add3    SF...
    ...

And we have a lookup table based on "name" above

    name    gender
    Matt    M
    Lucy    F
    ...

Now we are interested to output from top 1000 rows of each csv file into
following format:

    id    name    gender
    1    Matt    M
    ...

Can we use pyspark to efficiently handle this?

Reply via email to