Re: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-16 Thread Ben Kaylor
rouped(100).toList.par.map(groupedParts => > spark.read.parquet(groupedParts: _*)) > > val finalDF = dfs.seq.grouped(100).toList.par.map(dfgroup => > dfgroup.reduce(_ union _)).reduce(_ union _).coalesce(2000) > > > > *From: *Ben Kaylor > *Date: *Tuesday, March 16, 2021 at 3:2

Re: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-16 Thread Ben Kaylor
: > P.S.: 3. If fast updates are required, one way would be capturing S3 > events & putting the paths/modifications dates/etc of the paths into > DynamoDB/your DB of choice. > > > > *From:* Boris Litvak > *Sent:* Tuesday, 16 March 2021 9:03 > *To:* Ben Kaylor ;

Re: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-15 Thread Ben Kaylor
Not sure on answer on this, but am solving similar issues. So looking for additional feedback on how to do this. My thoughts if unable to do via spark and S3 boto commands, then have apps self report those changes. Where instead of having just mappers discovering the keys, you have services self