I think we figured the issue. It was the Cassandra ; in that environment one of the node was making write super slow. We fixed the cluster and now it's much better.
On 2017-02-28 13:09 (-0800), Sandesh Hegde <[email protected]<mailto:[email protected]>> wrote: > Can you please attach the stacktrace of the operator? > > You can increase the attribute TIMEOUT_WINDOW_COUNT , AppMaster uses that > to decide when to kill the blocked operator. > > For taking stack trace, find the information in the blog. > https://www.datatorrent.com/blog/getting-stack-traces-apache-apex-applications/ > > On Tue, Feb 28, 2017 at 12:59 PM Sunil Parmar > <[email protected]<mailto:[email protected]>> > wrote: > > > Ashwin, > > I don%u2019t see such warning. I%u2019ll PM you entire log file. > > > > On 2017-02-28 12:16 (-0800), Ashwin Chandra Putta < > > [email protected]<mailto:[email protected]>> wrote: > > > Sunil, > > > This might be related to checkpointing. See: > > > > > https://github.com/apache/apex-core/blob/master/engine/src/main/java/com/datatorrent/stram/StreamingContainerManager.java#L2211-L2217 > > > > > > Also check this piece of code: > > > > > https://github.com/apache/apex-core/blob/master/engine/src/main/java/com/datatorrent/stram/StreamingContainerManager.java#L2031-L2044 > > > > > > Can you paste the output of the warning from the code above which starts > > > with 'Marking operator ' > > > > > > Regards, > > > Ashwin. > > > > > > On Tue, Feb 28, 2017 at 12:03 PM, Sunil Parmar > > > <[email protected]<mailto:[email protected]> > > > > > > wrote: > > > > > > > That doesn%u2019t seems to be the case. We do see window id moving in > > UI as > > > > well. > > > > > > > > On 2017-02-28 11:19 (-0800), Munagala Ramanath > > > > <[email protected]<mailto:[email protected]>> > > > > wrote: > > > > > It likely means that that operator is taking too long to return from > > one > > > > of > > > > > the callbacks like beginWindow(), endWindow(), > > > > > emitTuples(), etc. Do you have any potentially blocking calls to > > external > > > > > systems in any of those callbacks ? > > > > > > > > > > Ram > > > > > > > > > > On Tue, Feb 28, 2017 at 11:09 AM, Sunil Parmar < > > [email protected]<mailto:[email protected]> > > > > > > > > > > wrote: > > > > > > > > > > > 2017-02-27 19:43:21,926 INFO com.datatorrent.stram. > > > > StreamingContainerManager: > > > > > > Blocked operator PTOperator[id=3,name=eventUpdatesFormatter] > > container > > > > > > > > PTContainer[id=1(container_1487310232732_0027_02_000111),state=ACTIVE] > > > > > > time 61905ms > > > > > > 2017-02-27 19:43:22,928 INFO com.datatorrent.stram. > > > > StreamingAppMasterService: > > > > > > Completed containerId=container_1487310232732_0027_02_000111, > > > > > > state=COMPLETE, exitStatus=-105, diagnostics=Container killed by > > the > > > > > > ApplicationMaster. > > > > > > Container killed on request. Exit code is 143 > > > > > > Container exited with a non-zero exit code 143 > > > > > > > > > > > > > > > > > > Can anyone help understand this error ? We see one of the operators > > > > keeps > > > > > > restarting the container; the above error is from AppMaster log. > > > > > > > > > > > > Thanks, > > > > > > Sunil > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > _______________________________________________________ > > > > > > > > > > Munagala V. Ramanath > > > > > > > > > > Software Engineer > > > > > > > > > > E: [email protected]<mailto:[email protected]> | M: (408) > > > > > 331-5034 | Twitter: @UnknownRam > > > > > > > > > > www.datatorrent.com | apex.apache.org > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Regards, > > > Ashwin. > > > > > > -- > *Join us at Apex Big Data World-San Jose > <http://www.apexbigdata.com/san-jose.html>, April 4, 2017!* > [image: http://www.apexbigdata.com/san-jose-register.html] >
