filter rows by all columns

2017-01-16 Thread Shawn Wan
I need to filter out outliers from a dataframe by all columns. I can manually list all columns like: df.filter(x=>math.abs(x.get(0).toString().toDouble-means(0))<=3*stddevs(0)) .filter(x=>math.abs(x.get(1).toString().toDouble-means(1))<=3*stddevs(1 )) ... But I want to turn it into a

load large number of files from s3

2016-11-11 Thread Shawn Wan
Hi, We have 30 million small files (100k each) on s3. I want to know how bad it is to load them directly from s3 ( eg driver memory, io, executor memory, s3 reliability) before merge or distcp them. Anybody has experience? Thanks in advance! Regards, Shawn -- View this message in context: