With batchSize = 1, I think it will become even worse.

I'd suggest to go with 1.3, have a taste for the new DataFrame API.

On Wed, Mar 25, 2015 at 11:49 AM, Eduardo Cusa
<eduardo.c...@usmediaconsulting.com> wrote:
> Hi Davies, I running 1.1.0.
>
> Now I'm following this thread that recommend use batchsize parameter = 1
>
>
> http://apache-spark-user-list.1001560.n3.nabble.com/pySpark-memory-usage-td3022.html
>
> if this does not work I will install  1.2.1 or  1.3
>
> Regards
>
>
>
>
>
>
> On Wed, Mar 25, 2015 at 3:39 PM, Davies Liu <dav...@databricks.com> wrote:
>>
>> What's the version of Spark you are running?
>>
>> There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3,
>>
>> [1] https://issues.apache.org/jira/browse/SPARK-6055
>>
>> On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa
>> <eduardo.c...@usmediaconsulting.com> wrote:
>> > Hi Guys, I running the following function with spark-submmit and de SO
>> > is
>> > killing my process :
>> >
>> >
>> >   def getRdd(self,date,provider):
>> >     path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz'
>> >     log2= self.sqlContext.jsonFile(path)
>> >     log2.registerTempTable('log_test')
>> >     log2.cache()
>>
>> You only visit the table once, cache does not help here.
>>
>> >     out=self.sqlContext.sql("SELECT user, tax from log_test where
>> > provider =
>> > '"+provider+"'and country <> ''").map(lambda row: (row.user, row.tax))
>> >     print "out1"
>> >     return  map((lambda (x,y): (x, list(y))),
>> > sorted(out.groupByKey(2000).collect()))
>>
>> 100 partitions (or less) will be enough for 2G dataset.
>>
>> >
>> >
>> > The input dataset has 57 zip files (2 GB)
>> >
>> > The same process with a smaller dataset completed successfully
>> >
>> > Any ideas to debug is welcome.
>> >
>> > Regards
>> > Eduardo
>> >
>> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to