Re: Logistic Regression Iterations causing High GC in Spark 2.3

Sean Owen Mon, 29 Jul 2019 07:12:28 -0700

Could be lots of things. Implementations change, caching may have
changed, etc. The size of the input doesn't really directly translate
to heap usage. Here you just need a bit more memory.


On Mon, Jul 29, 2019 at 9:03 AM Dhrubajyoti Hati <dhruba.w...@gmail.com> wrote:
>
> Hi Sean,
>
> Yeah I checked the heap, its almost full. I checked the GC logs in the 
> executors where I found that GC cycles are kicking in frequently. The 
> Executors tab shows red in the "Total Time/GC Time".
>
> Also the data which I am dealing with is quite small(~4 GB) and the cluster 
> is quite big for that high GC.
>
> But what's troubling me is this issue doesn't occur in Spark 2.2 at all. What 
> could be the reason behind such a behaviour?
>
> Regards,
> Dhrub
>
> On Mon, Jul 29, 2019 at 6:45 PM Sean Owen <sro...@gmail.com> wrote:
>>
>> -dev@
>>
>> Yep, high GC activity means '(almost) out of memory'. I don't see that
>> you've checked heap usage - is it nearly full?
>> The answer isn't tuning but more heap.
>> (Sometimes with really big heaps the problem is big pauses, but that's
>> not the case here.)
>>
>> On Mon, Jul 29, 2019 at 1:26 AM Dhrubajyoti Hati <dhruba.w...@gmail.com> 
>> wrote:
>> >
>> > Hi,
>> >
>> > We were running Logistic Regression in Spark 2.2.X and then we tried to 
>> > see how does it do in Spark 2.3.X. Now we are facing an issue while 
>> > running a Logistic Regression Model in Spark 2.3.X on top of 
>> > Yarn(GCP-Dataproc). In the TreeAggregate method it takes a huge time due 
>> > to very High GC Activity. I have tuned the GC, created different sized 
>> > clusters, higher spark version(2.4.X), smaller data but nothing helps. The 
>> > GC time is 100 - 1000 times of the processing time in avg for iterations.
>> >
>> > The strange part is in Spark 2.2 this doesn't happen at all. Same code, 
>> > same cluster sizing, same data in both the cases.
>> >
>> > I was wondering if someone can explain this behaviour and help me to 
>> > resolve this. How can the same code has so different behaviour in two 
>> > Spark version, especially the higher ones?
>> >
>> > Here are the config which I used:
>> >
>> >
>> > spark.serializer=org.apache.spark.serializer.KryoSerializer
>> >
>> > #GC Tuning
>> >
>> > spark.executor.extraJavaOptions= -XX:+UseG1GC -XX:+PrintFlagsFinal 
>> > -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
>> > -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
>> > -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -Xms9000m 
>> > -XX:ParallelGCThreads=20 -XX:ConcGCThreads=5
>> >
>> >
>> > spark.executor.instances=20
>> >
>> > spark.executor.cores=1
>> >
>> > spark.executor.memory=9010m
>> >
>> >
>> >
>> > Regards,
>> > Dhrub
>> >

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Logistic Regression Iterations causing High GC in Spark 2.3

Reply via email to