Sounds good. Current code checks port availability every 0.5 second for 5 seconds. If port is available before the timeout, it breaks the loop and continue to run. So it would be okay to set this number to large numbers, like 30, 60 for all interpreters. Make this value configurable through conf variable make sense. I have created an issue https://issues.apache.org/jira/browse/ZEPPELIN-124
Thanks, moon On Sat, Jun 20, 2015 at 3:09 PM John Omernik <j...@omernik.com> wrote: > That was exactly the issue. I moved it (hard coded) to 10 seconds, and now > all my interpreters start as expected with no issues. > > So given this, perhaps 5 seconds, hard coded isn't a good idea long term > here. Some options: > > 1. Provide a conf variable that can be used, default to 5, and allow it to > be set globally to something else. > 2. Set it per interpreter. Some interpreters may just need a little more > time. This seems like more work, but also more flexible. > 3. Provide a check before trying to connect to see if the port is > listening. Perhaps check after 5, then wait 5 more. If it goes longer than > X timeout value (with X being a variable in the config, with perhaps a > default of 30) then error out. > > A side note, the restarting of the interpreter seems out of whack. You > would think if the connection failed, that I could restart the interpreter > and try again, but everytime that happened, I had to restart zeppelin > before I could even attempt again. > > Thanks for the pointer, and glad I could find something here. I'd be > interested in your thoughts on how to address. > > John > > > > > On Sat, Jun 20, 2015 at 4:51 PM, moon soo Lee <m...@apache.org> wrote: > >> Thanks for explanation. >> Zeppelin server daemon is creating a remote process and wait's for >> interpreter process port being available for 5 seconds. >> So, there is possibility that if your interpreter process is not created >> and listening port in 5 seconds, It would have connection refused error. >> >> >> https://github.com/apache/incubator-zeppelin/blob/master/zeppelin-interpreter/src/main/java/org/apache/zeppelin/interpreter/remote/RemoteInterpreterProcess.java#L116 >> >> This is related source code. I think you can try increase the number from >> 5*1000 to something bigger, and see how it works. >> >> Thanks, >> moon >> >> >> >> On Sat, Jun 20, 2015 at 7:37 AM John Omernik <j...@omernik.com> wrote: >> >>> Thanks for the email Moon, I have gone through some pretty logical >>> troubleshooting steps, but I can't seem to get this bug to occur >>> consistently. Like I said, this is an interesting setup in that sometimes >>> things work normally sometimes they don't >>> >>> When they don't start, and I check the interpreter logs, they say they >>> are starting fine, say on port xyz, when I check xyz (this is all after the >>> error) in netstat, I see it listening properly, and I even see a connection >>> from localhost to it, but in the interface, I can't run any more paragraphs >>> with that interpreter. Even if I refresh the whole page. >>> >>> One thought I had, and maybe you could help me on this... what is the >>> process/time out to connect to a new interpreter? I.e. >>> >>> Step 1: Paragraph with interpreter that is not running is executed, >>> Zeppelin sees it not running and it kicks off the new JVM with the >>> interpreter >>> Step 2: Interpreter starts >>> Step 3: Zeppelin connects to the Interpreter >>> >>> I guess what is the process to go from Step 2 to Step3? Is there a delay >>> in connection? Is there a retry? I.e. If the interpreter is starting, and >>> lets set Zeppelin take 2 seconds after it starts the interpreter and tries >>> to connect. If the interpreter isn't quite ready does it throw an error? >>> Does it retry? Does it wait until the interpreter is 100% started before >>> trying to connect? Is there a retry? >>> >>> Given the inconsistency, I was thinking timing may be an issue. These >>> are servers that have quite a bit going on them, thus perhaps my >>> interpreter starting is taking longer than Zeppelin would expect? >>> >>> >>> >>> On Fri, Jun 19, 2015 at 12:49 PM, moon soo Lee <m...@apache.org> wrote: >>> >>>> Hi, >>>> >>>> Thanks for sharing the problem. >>>> >>>> Zeppelin runs each interpreter instance as a separate JVM process and >>>> communicate through thrift. Little detail is, Zeppelin server daemon invoke >>>> interpreter JVM process with specific port and server daemon connect to >>>> that port. Your error is that Zeppelin server can not connect to the >>>> interpreter JVM process. Do you see any possibility that this process can >>>> cause problem on your system? >>>> >>>> About the same variable name in markdown and hive interpreter, it won't >>>> be a problem. >>>> >>>> Thanks, >>>> moon >>>> >>>> >>>> >>>> On Fri, Jun 19, 2015 at 9:34 AM John Omernik <j...@omernik.com> wrote: >>>> >>>>> Another thing that may or may not be related is on the server running >>>>> Zeppelin, I have multiple interfaces, it "appears" the interpreter binds >>>>> on >>>>> all interfaces, but what about the connection? Does that come from a >>>>> specific interface? Could that be causing the connection refused? (I have >>>>> two eth interfaces and a docker0 interface on this node) >>>>> >>>>> John >>>>> >>>>> >>>>> On Fri, Jun 19, 2015 at 8:02 AM, John Omernik <j...@omernik.com> >>>>> wrote: >>>>> >>>>>> I am not an expert in Java, but could there be an issue using the >>>>>> markdown and the hive interpreters together because they share a variable >>>>>> name (md = markdown object in %markdown and md = metatdata in %hive) >>>>>> >>>>>> >>>>>> >>>>>> markdown: >>>>>> >>>>>> public void open() { md = new Markdown4jProcessor(); } >>>>>> >>>>>> hive: >>>>>> >>>>>> try { ResultSetMetaData md = res.getMetaData(); for (int i = 1; i < >>>>>> md.getColumnCount() + 1; i++) { if (i == 1) { msg.append(md. >>>>>> getColumnName(i)); } else { msg.append("\t" + md.getColumnName(i)); } >>>>>> } >>>>>> >>>>>> On Fri, Jun 19, 2015 at 6:56 AM, John Omernik <j...@omernik.com> >>>>>> wrote: >>>>>> >>>>>>> Hey all, >>>>>>> >>>>>>> I am working with three primary interpreters, %md, %pyspark, and >>>>>>> %hive. What I am noticing is with my current config, sometimes an >>>>>>> interpreter will start other times, I'll get an errors below. I wish I >>>>>>> could say what the rhyme or reason was. >>>>>>> >>>>>>> If I get the errors, then I have to restart Zeppelin before it will >>>>>>> work (or even attempt to work). I've tried clicking "restart >>>>>>> interpreter" >>>>>>> in the interpreters tab, it seems to work, but when I go back to a >>>>>>> notebook >>>>>>> I get "Scheduler already terminated" >>>>>>> >>>>>>> What's interesting here, is other than a restart, I can run the >>>>>>> cells (I have three one for each interpreter) in different orders and >>>>>>> get >>>>>>> different results, sometimes if I run %hive first, it works, then >>>>>>> %pyspark, >>>>>>> that will work too then %md will fail. (Note these are the SAME >>>>>>> commands, >>>>>>> on the same server, same config etc). >>>>>>> >>>>>>> Other times, I can get them to run no matter what, it's very >>>>>>> inconsistent, and combined with the fact that once an interpreter fails, >>>>>>> there is no getting it back until the whole server is restarted. >>>>>>> >>>>>>> Also of note here: I am running a recently compiled version of this >>>>>>> (I downloaded this on Wed) using git clone) >>>>>>> >>>>>>> Any help would be appreciated in determining how to troubleshoot >>>>>>> this! >>>>>>> >>>>>>> John >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Example from %md >>>>>>> >>>>>>> *In Notebook error* >>>>>>> >>>>>>> >>>>>>> >>>>>>> %md >>>>>>> #For the Love of Jeezy Pete >>>>>>> >>>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.init(RemoteInterpreter.java:135) >>>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:249) >>>>>>> org.apache.zeppelin.interpreter.LazyOpenInterpreter.getFormType(LazyOpenInterpreter.java:104) >>>>>>> org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:202) >>>>>>> org.apache.zeppelin.scheduler.Job.run(Job.java:170) >>>>>>> org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:296) >>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:262) >>>>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) >>>>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) >>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>>>>> java.lang.Thread.run(Thread.java:745) >>>>>>> >>>>>>> *In Running Shell Window (where I ran bin/zeppelin.sh)* >>>>>>> >>>>>>> org.apache.zeppelin.interpreter.InterpreterException: >>>>>>> org.apache.zeppelin.interpreter.InterpreterException: >>>>>>> org.apache.thrift.transport.TTransportException: >>>>>>> java.net.ConnectException: >>>>>>> Connection refused >>>>>>> >>>>>>> at >>>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.init(RemoteInterpreter.java:135) >>>>>>> >>>>>>> at >>>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:249) >>>>>>> >>>>>>> at >>>>>>> org.apache.zeppelin.interpreter.LazyOpenInterpreter.getFormType(LazyOpenInterpreter.java:104) >>>>>>> >>>>>>> at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:202) >>>>>>> >>>>>>> at org.apache.zeppelin.scheduler.Job.run(Job.java:170) >>>>>>> >>>>>>> at >>>>>>> org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:296) >>>>>>> >>>>>>> at >>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >>>>>>> >>>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262) >>>>>>> >>>>>>> at >>>>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) >>>>>>> >>>>>>> at >>>>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) >>>>>>> >>>>>>> at >>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>>>>>> >>>>>>> at >>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>>>>> >>>>>>> at java.lang.Thread.run(Thread.java:745) >>>>>>> >>>>>>> Caused by: org.apache.zeppelin.interpreter.InterpreterException: >>>>>>> org.apache.thrift.transport.TTransportException: >>>>>>> java.net.ConnectException: >>>>>>> Connection refused >>>>>>> >>>>>>> at >>>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:53) >>>>>>> >>>>>>> at >>>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:37) >>>>>>> >>>>>>> at >>>>>>> org.apache.commons.pool2.BasePooledObjectFactory.makeObject(BasePooledObjectFactory.java:60) >>>>>>> >>>>>>> at >>>>>>> org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:861) >>>>>>> >>>>>>> at >>>>>>> org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:435) >>>>>>> >>>>>>> at >>>>>>> org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363) >>>>>>> >>>>>>> at >>>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.getClient(RemoteInterpreterProcess.java:138) >>>>>>> >>>>>>> at >>>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.init(RemoteInterpreter.java:133) >>>>>>> >>>>>>> ... 12 more >>>>>>> >>>>>>> Caused by: org.apache.thrift.transport.TTransportException: >>>>>>> java.net.ConnectException: Connection refused >>>>>>> >>>>>>> at org.apache.thrift.transport.TSocket.open(TSocket.java:185) >>>>>>> >>>>>>> at >>>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:51) >>>>>>> >>>>>>> ... 19 more >>>>>>> >>>>>>> Caused by: java.net.ConnectException: Connection refused >>>>>>> >>>>>>> at java.net.PlainSocketImpl.socketConnect(Native Method) >>>>>>> >>>>>>> at >>>>>>> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) >>>>>>> >>>>>>> at >>>>>>> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) >>>>>>> >>>>>>> at >>>>>>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) >>>>>>> >>>>>>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) >>>>>>> >>>>>>> at java.net.Socket.connect(Socket.java:579) >>>>>>> >>>>>>> at org.apache.thrift.transport.TSocket.open(TSocket.java:180) >>>>>>> >>>>>>> ... 20 more >>>>>>> >>>>>>> *from interpreter log file:* >>>>>>> >>>>>>> INFO [2015-06-19 06:44:29,134] ({Thread-0} >>>>>>> RemoteInterpreterServer.java[run]:95) - Starting remote interpreter >>>>>>> server >>>>>>> on port 54930 >>>>>>> >>>>>>> >>>>>>> *From Zeppelin Log file:* >>>>>>> >>>>>>> INFO [2015-06-19 06:44:19,329] ({pool-1-thread-2} >>>>>>> SchedulerFactory.java[jobStarted]:132) - Job >>>>>>> paragraph_1434713440246_1991176208 started by scheduler >>>>>>> remoteinterpreter_328619575 >>>>>>> >>>>>>> INFO [2015-06-19 06:44:19,331] ({pool-1-thread-2} >>>>>>> Paragraph.java[jobRun]:194) - run paragraph 20150619-063040_649381067 >>>>>>> using >>>>>>> md org.apache.zeppelin.interpreter.LazyOpenInterpreter@38946f29 >>>>>>> >>>>>>> INFO [2015-06-19 06:44:19,341] ({pool-1-thread-2} >>>>>>> RemoteInterpreterProcess.java[reference]:107) - Run interpreter process >>>>>>> /mapr/brewpot/mesos/zeppelin/0.5.0-incubating-SNAPSHOT/bin/interpreter.sh >>>>>>> -d >>>>>>> /mapr/brewpot/mesos/zeppelin/0.5.0-incubating-SNAPSHOT/interpreter/md -p >>>>>>> 54930 >>>>>>> >>>>>>> ERROR [2015-06-19 06:44:24,399] ({Thread-35} >>>>>>> RemoteScheduler.java[getStatus]:226) - Can't get status information >>>>>>> >>>>>>> org.apache.zeppelin.interpreter.InterpreterException: >>>>>>> org.apache.thrift.transport.TTransportException: >>>>>>> java.net.ConnectException: >>>>>>> Connection refused >>>>>>> >>>>>>> at >>>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:53) >>>>>>> >>>>>>> at >>>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:37) >>>>>>> >>>>>>> at >>>>>>> org.apache.commons.pool2.BasePooledObjectFactory.makeObject(BasePooledObjectFactory.java:60) >>>>>>> >>>>>>> at >>>>>>> org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:861) >>>>>>> >>>>>>> at >>>>>>> org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:435) >>>>>>> >>>>>>> at >>>>>>> org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363) >>>>>>> >>>>>>> at >>>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.getClient(RemoteInterpreterProcess.java:138) >>>>>>> >>>>>>> at >>>>>>> org.apache.zeppelin.scheduler.RemoteScheduler$JobStatusPoller.getStatus(RemoteScheduler.java:224) >>>>>>> >>>>>>> at >>>>>>> org.apache.zeppelin.scheduler.RemoteScheduler$JobStatusPoller.run(RemoteScheduler.java:183) >>>>>>> >>>>>>> Caused by: org.apache.thrift.transport.TTransportException: >>>>>>> java.net.ConnectException: Connection refused >>>>>>> >>>>>>> at org.apache.thrift.transport.TSocket.open(TSocket.java:185) >>>>>>> >>>>>>> at >>>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:51) >>>>>>> >>>>>>> ... 8 more >>>>>>> >>>>>>> Caused by: java.net.ConnectException: Connection refused >>>>>>> >>>>>>> at java.net.PlainSocketImpl.socketConnect(Native Method) >>>>>>> >>>>>>> at >>>>>>> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) >>>>>>> >>>>>>> at >>>>>>> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) >>>>>>> >>>>>>> at >>>>>>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) >>>>>>> >>>>>>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) >>>>>>> >>>>>>> at java.net.Socket.connect(Socket.java:579) >>>>>>> >>>>>>> at org.apache.thrift.transport.TSocket.open(TSocket.java:180) >>>>>>> >>>>>>> ... 9 more >>>>>>> >>>>>>> ERROR [2015-06-19 06:44:24,399] ({pool-1-thread-2} >>>>>>> Job.java[run]:183) - Job failed >>>>>>> >>>>>>> org.apache.zeppelin.interpreter.InterpreterException: >>>>>>> org.apache.zeppelin.interpreter.InterpreterException: >>>>>>> org.apache.thrift.transport.TTransportException: >>>>>>> java.net.ConnectException: >>>>>>> Connection refused >>>>>>> >>>>>>> at >>>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.init(RemoteInterpreter.java:135) >>>>>>> >>>>>>> at >>>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:249) >>>>>>> >>>>>>> at >>>>>>> org.apache.zeppelin.interpreter.LazyOpenInterpreter.getFormType(LazyOpenInterpreter.java:104) >>>>>>> >>>>>>> at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:202) >>>>>>> >>>>>>> at org.apache.zeppelin.scheduler.Job.run(Job.java:170) >>>>>>> >>>>>>> at >>>>>>> org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:296) >>>>>>> >>>>>>> at >>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >>>>>>> >>>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262) >>>>>>> >>>>>>> at >>>>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) >>>>>>> >>>>>>> at >>>>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) >>>>>>> >>>>>>> at >>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>>>>>> >>>>>>> at >>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>>>>> >>>>>>> at java.lang.Thread.run(Thread.java:745) >>>>>>> >>>>>>> Caused by: org.apache.zeppelin.interpreter.InterpreterException: >>>>>>> org.apache.thrift.transport.TTransportException: >>>>>>> java.net.ConnectException: >>>>>>> Connection refused >>>>>>> >>>>>>> at >>>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:53) >>>>>>> >>>>>>> at >>>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:37) >>>>>>> >>>>>>> at >>>>>>> org.apache.commons.pool2.BasePooledObjectFactory.makeObject(BasePooledObjectFactory.java:60) >>>>>>> >>>>>>> at >>>>>>> org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:861) >>>>>>> >>>>>>> at >>>>>>> org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:435) >>>>>>> >>>>>>> at >>>>>>> org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363) >>>>>>> >>>>>>> at >>>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.getClient(RemoteInterpreterProcess.java:138) >>>>>>> >>>>>>> at >>>>>>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.init(RemoteInterpreter.java:133) >>>>>>> >>>>>>> ... 12 more >>>>>>> >>>>>>> Caused by: org.apache.thrift.transport.TTransportException: >>>>>>> java.net.ConnectException: Connection refused >>>>>>> >>>>>>> at org.apache.thrift.transport.TSocket.open(TSocket.java:185) >>>>>>> >>>>>>> at >>>>>>> org.apache.zeppelin.interpreter.remote.ClientFactory.create(ClientFactory.java:51) >>>>>>> >>>>>>> ... 19 more >>>>>>> >>>>>>> Caused by: java.net.ConnectException: Connection refused >>>>>>> >>>>>>> at java.net.PlainSocketImpl.socketConnect(Native Method) >>>>>>> >>>>>>> at >>>>>>> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) >>>>>>> >>>>>>> at >>>>>>> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) >>>>>>> >>>>>>> at >>>>>>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) >>>>>>> >>>>>>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) >>>>>>> >>>>>>> at java.net.Socket.connect(Socket.java:579) >>>>>>> >>>>>>> at org.apache.thrift.transport.TSocket.open(TSocket.java:180) >>>>>>> >>>>>>> ... 20 more >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>> >