Re: PySpark on Yarn - how group by data properly

2014-09-16 Thread Oleg Ruchovets
I am expand my data set and executing pyspark on yarn: I payed attention that only 2 processes processed the data: 14210 yarn 20 0 2463m 2.0g 9708 R 100.0 4.3 8:22.63 python2.7 32467 yarn 20 0 2519m 2.1g 9720 R 99.3 4.4 7:16.97 python2.7 *Question:* *how to configure

PySpark on Yarn - how group by data properly

2014-09-09 Thread Oleg Ruchovets
Hi , I came from map/reduce background and try to do quite trivial thing: I have a lot of files ( on hdfs ) - format is : 1 , 2 , 3 2 , 3 , 5 1 , 3, 5 2, 3 , 4 2 , 5, 1 I am actually need to group by key (first column) : key values 1 -- (2,3),(3,5) 2 --

Re: PySpark on Yarn - how group by data properly

2014-09-09 Thread Davies Liu
On Tue, Sep 9, 2014 at 9:56 AM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , I came from map/reduce background and try to do quite trivial thing: I have a lot of files ( on hdfs ) - format is : 1 , 2 , 3 2 , 3 , 5 1 , 3, 5 2, 3 , 4 2 , 5, 1 I am actually need