from:"Allen George"

DropDuplicates Behavior

2016-04-29 Thread Allen George

I'd like to echo a question that was asked earlier this year: If we do a global sort of a dataframe (with two columns: col_1, col_2) by (col_1, col_2/desc) and then dropDuplicates on col_1, will it retain the first row of each sorted group? i.e. Will it return the row with the greatest value of co

Restarting an executor during execution causes it to lose AWS credentials (anyone seen this?)

2016-03-20 Thread Allen George

Hi guys, I'm having a problem where respawning a failed executor during a job that reads/writes parquet on S3 causes subsequent tasks to fail because of missing AWS keys. Setup: I'm using Spark 1.5.2 with Hadoop 2.7 and running experiments on a simple standalone cluster: 1 master 2 workers My