Thank you for pointing LIST_NOTES broadcasting to every client, I'm not sure that that's what was meant to happen in such case.
I have never seen the behavior you describe and it looks like a race condition on a run note message. Did you have a chance to try applying only the first part of the changes that you have described earlier, keeping the synchronized noteSocketMap? -- Alex On Fri, Apr 8, 2016 at 12:47 PM, Prasad Wagle <prasadwa...@gmail.com> wrote: > Thanks Alex. I understand the reason for synchronization of > note<->client_connection. However, I don't think I understand why if I > request LIST_NOTES which does not involve any changes, the server sends the > list of notes to all clients using broadcastNoteList() which uses > broadcastAll. > > After deploying the changes I mentioned earlier, the server ran fine for > 18 hours before running into a deadlock (jstack output below). We could > download the top level page and notes but not run any paragraphs. Server > restart fixed the problem. Do you think this is a result of my changes or a > separate issue? > > Found one Java-level deadlock: > ============================= > "qtp873175411-3443": > waiting to lock monitor 0x00000000031e6158 (object 0x00000006c3b1fba8, a > java.util.HashMap), > which is held by "DefaultQuartzScheduler_Worker-4" > "DefaultQuartzScheduler_Worker-4": > waiting to lock monitor 0x000000000268ad58 (object 0x00000006c34a12c0, a > java.util.ArrayList), > which is held by "DefaultQuartzScheduler_Worker-2" > "DefaultQuartzScheduler_Worker-2": > waiting to lock monitor 0x00000000031e6158 (object 0x00000006c3b1fba8, a > java.util.HashMap), > which is held by "DefaultQuartzScheduler_Worker-4" > > Java stack information for the threads listed above: > =================================================== > "qtp873175411-3443": > at > org.apache.zeppelin.interpreter.InterpreterFactory.getNoteInterpreterSettingBinding(InterpreterFactory.java:502) > - waiting to lock <0x00000006c3b1fba8> (a java.util.HashMap) > at > org.apache.zeppelin.notebook.NoteInterpreterLoader.getInterpreterSettings(NoteInterpreterLoader.java:60) > at > org.apache.zeppelin.socket.NotebookServer.sendAllAngularObjects(NotebookServer.java:951) > at > org.apache.zeppelin.socket.NotebookServer.sendNote(NotebookServer.java:437) > at > org.apache.zeppelin.socket.NotebookServer.onMessage(NotebookServer.java:123) > at > org.apache.zeppelin.socket.NotebookSocket.onMessage(NotebookSocket.java:70) > at > org.eclipse.jetty.websocket.WebSocketConnectionRFC6455$WSFrameHandler.onFrame(WebSocketConnectionRFC6455.java:835) > at > org.eclipse.jetty.websocket.WebSocketParserRFC6455.parseNext(WebSocketParserRFC6455.java:349) > at > org.eclipse.jetty.websocket.WebSocketConnectionRFC6455.handle(WebSocketConnectionRFC6455.java:225) > at org.eclipse.jetty.io.nio.SslConnection.handle(SslConnection.java:196) > at > org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) > at > org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) > at java.lang.Thread.run(Thread.java:745) > "DefaultQuartzScheduler_Worker-4": > at org.apache.zeppelin.notebook.Note.getParagraphs(Note.java:441) > - waiting to lock <0x00000006c34a12c0> (a java.util.ArrayList) > at > org.apache.zeppelin.search.LuceneSearch.updateIndexDoc(LuceneSearch.java:172) > at org.apache.zeppelin.notebook.Note.persist(Note.java:463) > at > org.apache.zeppelin.socket.NotebookServer$ParagraphJobListener.afterStatusChange(NotebookServer.java:935) > at org.apache.zeppelin.scheduler.Job.setStatus(Job.java:143) > at org.apache.zeppelin.notebook.Paragraph.jobAbort(Paragraph.java:271) > at org.apache.zeppelin.scheduler.Job.abort(Job.java:232) > at > org.apache.zeppelin.interpreter.InterpreterFactory.stopJobAllInterpreter(InterpreterFactory.java:593) > at > org.apache.zeppelin.interpreter.InterpreterFactory.restart(InterpreterFactory.java:547) > - locked <0x00000006c3b1fba8> (a java.util.HashMap) > at org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:440) > at org.quartz.core.JobRunShell.run(JobRunShell.java:202) > at > org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573) > - locked <0x00000006c3ac3dc0> (a java.lang.Object) > "DefaultQuartzScheduler_Worker-2": > at > org.apache.zeppelin.interpreter.InterpreterFactory.getNoteInterpreterSettingBinding(InterpreterFactory.java:502) > - waiting to lock <0x00000006c3b1fba8> (a java.util.HashMap) > at > org.apache.zeppelin.notebook.NoteInterpreterLoader.getInterpreterSettings(NoteInterpreterLoader.java:60) > at > org.apache.zeppelin.notebook.NoteInterpreterLoader.get(NoteInterpreterLoader.java:77) > at org.apache.zeppelin.notebook.Note.runAll(Note.java:409) > - locked <0x00000006c34a12c0> (a java.util.ArrayList) > at org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:419) > at org.quartz.core.JobRunShell.run(JobRunShell.java:202) > at > org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573) > - locked <0x00000006c3abd630> (a java.lang.Object) > > Found 1 deadlock. > > On Thu, Apr 7, 2016 at 6:46 PM, Alexander Bezzubov <b...@apache.org> wrote: > >> Hi, >> >> thank you Eric, upgrading Jetty sounds like a great idea! >> >> Prasad, I think braodcastAll and synchronization of >> note<->client_connection is used by default to achieve the ability to >> collaborate over analysis with multiple people at same Note in realtime - >> to notify all other clients who have this Note open about the changes that >> you did in your browser tab (like in 2 different tabs you can see). >> >> I believe it might be possible to replace a map with concurrent >> implementation to avoid excessive synchronization though, as we did in [1] >> before. If same behaviour persist after upgrading to Jetty 9, could you >> pelase create an separate issue for that and I will be happy help and look >> more into it. >> >> Thanks! >> >> 1. https://issues.apache.org/jira/browse/ZEPPELIN-312 >> >> -- >> Alex >> >> >> On Fri, Apr 8, 2016 at 1:28 AM, Prasad Wagle <prasadwa...@gmail.com> >> wrote: >> >>> Thanks Eric! I created >>> https://issues.apache.org/jira/browse/ZEPPELIN-798 - Migrate to Jetty >>> version 9 that has fix for websocket deadlock bug causing Zeppelin server >>> hangs. This is pretty important for us so please let me know how I can help. >>> >>> For now, I have made some changes to reduce websocket communications and >>> probability of hangs: >>> >>> - For the LIST_NOTES operation, I use broadcastNoteList(conn) that >>> sends note list to the current connection instead of using broadcastAll. >>> What is the reason for using broadcastAll? >>> - I removed synchronized (noteSocketMap) from broadcast so that one >>> bad socket does not hang the server. Do you think this can cause serious >>> problems? >>> >>> >>> On Thu, Apr 7, 2016 at 3:06 AM, Eric Charles <e...@apache.org> wrote: >>> >>>> On 07/04/16 07:18, Prasad Wagle wrote: >>>> >>>>> Hi, >>>>> >>>>> We experienced three Zeppelin server hangs today. I have included one >>>>> of >>>>> the stack traces below. It is similar to the stack trace in a websocket >>>>> deadlock bug in Jetty 8. From the bug report >>>>> <https://bugs.eclipse.org/bugs/show_bug.cgi?id=389645>: >>>>> >>>>> However, Jetty 9 has already refactored the low level read/write on >>>>> a socket heavily to compensate for websocket, spdy, and http/2 >>>>> Marking this as WONTFIX for Jetty 7/8 >>>>> Use Jetty 9 >>>>> >>>>> >>>>> Is there a workaround? Has anyone tried using Jetty 9 in Zeppelin? What >>>>> is the effort involved? >>>>> >>>> >>>> >>>> I have upgraded the source code to Jetty 9 which implies a few >>>> different constructs. >>>> >>>> Could you open a JIRA? I will then submit a PRo >>>> >>>> >>>>> Thanks, >>>>> Prasad >>>>> >>>>> >>>>> *Stack trace* >>>>> >>>>> >>>>> "pool-1-thread-10" #141 prio=5 os_prio=0 tid=0x0000000001513000 >>>>> nid=0x6749 in Object.wait() [0x00007fdab6ff4000] >>>>> java.lang.Thread.State: TIMED_WAITING (on object monitor) >>>>> at java.lang.Object.wait(Native Method) >>>>> at >>>>> >>>>> org.eclipse.jetty.io.nio.SelectChannelEndPoint.blockWritable(SelectChannelEndPoint.java:494) >>>>> - locked <0x00000006c50d9b48> (a >>>>> org.eclipse.jetty.io.nio.SelectChannelEndPoint) >>>>> at >>>>> >>>>> org.eclipse.jetty.io.nio.SslConnection$SslEndPoint.blockWritable(SslConnection.java:723) >>>>> at >>>>> >>>>> org.eclipse.jetty.websocket.WebSocketGeneratorRFC6455.flush(WebSocketGeneratorRFC6455.java:248) >>>>> at >>>>> >>>>> org.eclipse.jetty.websocket.WebSocketGeneratorRFC6455.addFrame(WebSocketGeneratorRFC6455.java:114) >>>>> at >>>>> >>>>> org.eclipse.jetty.websocket.WebSocketConnectionRFC6455$WSFrameConnection.sendMessage(WebSocketConnectionRFC6455.java:439) >>>>> at >>>>> org.apache.zeppelin.socket.NotebookSocket.send(NotebookSocket.java:89) >>>>> at >>>>> >>>>> org.apache.zeppelin.socket.NotebookServer.broadcast(NotebookServer.java:286) >>>>> - locked <0x00000006c3a1cd08> (a java.util.HashMap) >>>>> at >>>>> >>>>> org.apache.zeppelin.socket.NotebookServer.broadcastNote(NotebookServer.java:370) >>>>> at >>>>> >>>>> org.apache.zeppelin.socket.NotebookServer$ParagraphJobListener.afterStatusChange(NotebookServer.java:945) >>>>> at org.apache.zeppelin.scheduler.Job.setStatus(Job.java:143) >>>>> at >>>>> >>>>> org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.afterStatusChange(RemoteScheduler.java:379) >>>>> at >>>>> >>>>> org.apache.zeppelin.scheduler.RemoteScheduler$JobStatusPoller.getStatus(RemoteScheduler.java:261) >>>>> - locked <0x00000006c5885178> (a >>>>> org.apache.zeppelin.scheduler.RemoteScheduler$JobStatusPoller) >>>>> at >>>>> >>>>> org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:335) >>>>> at >>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>>>> at >>>>> >>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) >>>>> at >>>>> >>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) >>>>> at >>>>> >>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>>>> at >>>>> >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>>>> at java.lang.Thread.run(Thread.java:745) >>>>> >>>> >