Hi Guys, I running the following function with spark-submmit and de SO is
killing my process :


  def getRdd(self,date,provider):
    path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz'
    log2= self.sqlContext.jsonFile(path)
    log2.registerTempTable('log_test')
    log2.cache()
    out=self.sqlContext.sql("SELECT user, tax from log_test where provider
= '"+provider+"'and country <> ''").map(lambda row: (row.user, row.tax))
    print "out1"
    return  map((lambda (x,y): (x, list(y))),
sorted(out.groupByKey(2000).collect()))



The input dataset has 57 zip files (2 GB)

The same process with a smaller dataset completed successfully

Any ideas to debug is welcome.

Regards
Eduardo

Reply via email to