subject:"\[pyspark\] Read multiple files parallely into a single dataframe"

Re: [pyspark] Read multiple files parallely into a single dataframe

2018-05-04 Thread Irving Duran

I could be wrong, but I think you can do a wild card. df = spark.read.format('csv').load('/path/to/file*.csv.gz') Thank You, Irving Duran On Fri, May 4, 2018 at 4:38 AM Shuporno Choudhury < shuporno.choudh...@gmail.com> wrote: > Hi, > > I want to read multiple files parallely into 1 dataframe

[pyspark] Read multiple files parallely into a single dataframe

2018-05-04 Thread Shuporno Choudhury

Hi, I want to read multiple files parallely into 1 dataframe. But the files have random names and cannot confirm to any pattern (so I can't use wildcard). Also, the files can be in different directories. If I provide the file names in a list to the dataframe reader, it reads then sequentially.