Hi Guys, I running the following function with spark-submmit and de SO is killing my process :
def getRdd(self,date,provider): path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz' log2= self.sqlContext.jsonFile(path) log2.registerTempTable('log_test') log2.cache() out=self.sqlContext.sql("SELECT user, tax from log_test where provider = '"+provider+"'and country <> ''").map(lambda row: (row.user, row.tax)) print "out1" return map((lambda (x,y): (x, list(y))), sorted(out.groupByKey(2000).collect())) The input dataset has 57 zip files (2 GB) The same process with a smaller dataset completed successfully Any ideas to debug is welcome. Regards Eduardo