With batchSize = 1, I think it will become even worse. I'd suggest to go with 1.3, have a taste for the new DataFrame API.
On Wed, Mar 25, 2015 at 11:49 AM, Eduardo Cusa <eduardo.c...@usmediaconsulting.com> wrote: > Hi Davies, I running 1.1.0. > > Now I'm following this thread that recommend use batchsize parameter = 1 > > > http://apache-spark-user-list.1001560.n3.nabble.com/pySpark-memory-usage-td3022.html > > if this does not work I will install 1.2.1 or 1.3 > > Regards > > > > > > > On Wed, Mar 25, 2015 at 3:39 PM, Davies Liu <dav...@databricks.com> wrote: >> >> What's the version of Spark you are running? >> >> There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3, >> >> [1] https://issues.apache.org/jira/browse/SPARK-6055 >> >> On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa >> <eduardo.c...@usmediaconsulting.com> wrote: >> > Hi Guys, I running the following function with spark-submmit and de SO >> > is >> > killing my process : >> > >> > >> > def getRdd(self,date,provider): >> > path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz' >> > log2= self.sqlContext.jsonFile(path) >> > log2.registerTempTable('log_test') >> > log2.cache() >> >> You only visit the table once, cache does not help here. >> >> > out=self.sqlContext.sql("SELECT user, tax from log_test where >> > provider = >> > '"+provider+"'and country <> ''").map(lambda row: (row.user, row.tax)) >> > print "out1" >> > return map((lambda (x,y): (x, list(y))), >> > sorted(out.groupByKey(2000).collect())) >> >> 100 partitions (or less) will be enough for 2G dataset. >> >> > >> > >> > The input dataset has 57 zip files (2 GB) >> > >> > The same process with a smaller dataset completed successfully >> > >> > Any ideas to debug is welcome. >> > >> > Regards >> > Eduardo >> > >> > > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org