Hi!
Couple days ago I tested Zeppelin on my laptop, Cloudera Hadoop in
pseudodistributed mode with Spark Standalone. I faced with
fasterxml.jackson problem. Eric Charles said that he had the similar
problem and advised to remove jackson-*.jar libraries from lib folder.
So I did it. I also coped with parameters in zeppelin-env.sh to make
Zeppelin work locally.
On Monday, when I came to job, it became clear that configuration
parameters for local installation and real cluster installation vary
greatly. And I got this Thrift Transport Exception .
In 2 days, rebuilt Zeppelin several times, checked all parameters,
checked & changed my network. At last, when I received your letter, I
checked MASTER variable. And I remembered those deleted *.jar files. I
thought that they are sections of the chain. I copied them back to lib
folder. And Spark began to work!
But Spark SQL doesn't work, DataFrames can't load & write ORC files. It
gives some HiveContext error connected to metastore_db (Derby). Either
Hive itself (which is situated on the same edge node as Zeppelin) has
its own Derby metastore_db, or I should delete metastore_db from
$ZEPPELIN_HOME/bin. Should I?
The code is
%spark
import org.apache.spark.sql._
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
Import is made. Then I get error.
On 11/24/2015 07:39 PM, moon soo Lee wrote:
Basically, if SPARK_HOME/bin/spark-shell works, then export SPARK_HOME
in conf/zeppelin-env.sh and setting 'master' property in Interpreter
menu on Zeppelin GUI should be enough to make successful connection to
Spark standalone cluster.
Do you see any new exception in your log file when you set 'master'
property in Interpreter menu on Zeppelin GUI and see 'Scheduler
already Terminated' error? If you can share, that would be helpful.
Zeppelin does not use HiveThriftServer2 and does not need any other
dependency except for JVM to run, once it's been built.
Thanks,
moon
On Tue, Nov 24, 2015 at 11:37 PM Timur Shenkao <t...@timshenkao.su
<mailto:t...@timshenkao.su>> wrote:
One more question. What should be installed on server? What the
dependencies of Zeppelin?
Node.js, npm, bower? Scala?
On Tue, Nov 24, 2015 at 5:34 PM, Timur Shenkao <t...@timshenkao.su
<mailto:t...@timshenkao.su>> wrote:
> I also checked Spark workers. There are no traces, folders, logs
about
> Zeppelin on them.
> There are logs about Zeppelin on Spark Master server only where
Zeppelin
> is launched.
>
> For example, H2O creates logs on every worker in folders
> /usr/spark/work/app-.....-... Is it correct?
>
> I also launched Thrift server via
/usr/spark/sbin/start-thriftserver.sh on
> Spark Master. Does Zeppelin use
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 ?
>
> For terminated scheduler, I got
> INFO [2015-11-24 16:26:16,610] ({pool-1-thread-2}
> SchedulerFactory.java[jobFinished]:138) - Job paragraph_1448346$
> ERROR [2015-11-24 16:26:17,658] ({Thread-34}
> JobProgressPoller.java[run]:57) - Can not get or update progress
> org.apache.zeppelin.interpreter.InterpreterException:
> org.apache.thrift.transport.TTransportException
> at
>
org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getProgress(RemoteInterpreter.java:302)
> at
>
org.apache.zeppelin.interpreter.LazyOpenInterpreter.getProgress(LazyOpenInterpreter.java:110)
> at
> org.apache.zeppelin.notebook.Paragraph.progress(Paragraph.java:174)
> at
>
org.apache.zeppelin.scheduler.JobProgressPoller.run(JobProgressPoller.java:54)
> Caused by: org.apache.thrift.transport.TTransportException
> at
>
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
> at
> org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
> at
>
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
> at
>
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
> at
>
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
> at
> org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
> at
>
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_getProgress(RemoteInterpret$
> at
>
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.getProgress(RemoteInterpreterSer$
> INFO [2015-11-24 16:26:52,617] ({qtp982007015-52}
> InterpreterRestApi.java[updateSetting]:104) - Update interprete$
> INFO [2015-11-24 16:27:56,319] ({qtp982007015-48}
> InterpreterRestApi.java[restartSetting]:143) - Restart interpre$
> ERROR [2015-11-24 16:28:09,603] ({qtp982007015-48}
> NotebookServer.java[runParagraph]:661) - Exception from run
> java.lang.RuntimeException: Scheduler already terminated
> at
>
org.apache.zeppelin.scheduler.RemoteScheduler.submit(RemoteScheduler.java:124)
> at org.apache.zeppelin.notebook.Note.run(Note.java:326)
> at
>
org.apache.zeppelin.socket.NotebookServer.runParagraph(NotebookServer.java:659)
> at
>
org.apache.zeppelin.socket.NotebookServer.onMessage(NotebookServer.java:126)
> at
>
org.apache.zeppelin.socket.NotebookSocket.onMessage(NotebookSocket.java:56)
> at
>
org.eclipse.jetty.websocket.WebSocketConnectionRFC6455$WSFrameHandler.onFrame(WebSocketConnectionRFC645$
> at
>
org.eclipse.jetty.websocket.WebSocketParserRFC6455.parseNext(WebSocketParserRFC6455.java:349)
> at
>
org.eclipse.jetty.websocket.WebSocketConnectionRFC6455.handle(WebSocketConnectionRFC6455.java:225)
> at
>
org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
> at
>
org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
> at
>
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> at
>
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> at java.lang.Thread.run(Thread.java:745)
> ERROR [2015-11-24 16:28:36,906] ({qtp982007015-50}
> NotebookServer.java[runParagraph]:661) - Exception from run
> java.lang.RuntimeException: Scheduler already terminated
> at
>
org.apache.zeppelin.scheduler.RemoteScheduler.submit(RemoteScheduler.java:124)
> at org.apache.zeppelin.notebook.Note.run(Note.java:326)
> at
>
org.apache.zeppelin.socket.NotebookServer.runParagraph(NotebookServer.java:659)
> at
>
org.apache.zeppelin.socket.NotebookServer.onMessage(NotebookServer.java:126)
> at
>
org.apache.zeppelin.socket.NotebookSocket.onMessage(NotebookSocket.java:56)
>
>
>
>
> On Tue, Nov 24, 2015 at 4:50 PM, Timur Shenkao
<t...@timshenkao.su <mailto:t...@timshenkao.su>> wrote:
>
>> Hello!
>>
>> There is no Kerberos, no security in my cluster. It's in an
internal
>> network.
>>
>> Interpreters %hive and %sh work, I can create tables, drop,
pwd, etc. So,
>> the problem is in integration with Spark.
>>
>> In /usr/spark/conf/spark-env.sh I set / unset in turn MASTER =
>> spark://localhost:7077, MASTER = spark://192.168.58.10:7077
<http://192.168.58.10:7077>, MASTER =
>> spark://127.0.0.1:7077 <http://127.0.0.1:7077> on master node.
On slaves I set / unset in turn
>> MASTER = spark://192.168.58.10:7077 <http://192.168.58.10:7077>
in different combinations.
>>
>> Zeppelin is installed on the same machine as Spark Master. So, in
>> zeppelin-env.sh I set / unset MASTER = spark://localhost:7077,
MASTER =
>> spark://192.168.58.10:7077 <http://192.168.58.10:7077>, MASTER
= spark://127.0.0.1:7077 <http://127.0.0.1:7077>
>> Yes, I can connect to 192.168.58 and see URL
spark://192.168.58:7077
>> REST URL spark://192.168.58:6066 (cluster mode)
>>
>> Does TCP type influence? On my laptop, in pseudodistributed
mode, all
>> connections are IPv4 (tcp). There are IPv4 lines in /etc/hosts
only.
>> In cluster, Spark automatically, for unknown reasons, uses IPv6
(tcp6).
>> There are IPv6 lines in /etc/hosts.
>> Right now, I try to make Spark use IPv4
>>
>> I switched Spark to IPv4 via -Djava.net.preferIPv4Stack=true
>>
>> It seems that Zeppelin uses / answers the following ports on Master
>> server (ps axu | grep zeppelin; then for each PID netstat
-natp | grep
>> ...):
>> 41303
>> 46971
>> 59007
>> 35781
>> 53637
>> 34860
>> 59793
>> 46971
>> 50676
>> 50677
>>
>> 44341
>> 50805
>> 50803
>> 50802
>>
>> 60886
>> 43345
>> 48415
>> 48417
>> 10000
>> 48416
>>
>> Best regards
>>
>> P.S. I inserted into zeppelin-env.sh and spark interpreter
configuration
>> in web UI precise address from Spark page: MASTER=spark://
>> 192.168.58.10:7077 <http://192.168.58.10:7077>.
>> Earlier, I got Java error stacktrace in Web UI. I BEGAN to receive
>> "Scheduler already terminated"
>>
>> On Tue, Nov 24, 2015 at 12:56 PM, moon soo Lee <m...@apache.org
<mailto:m...@apache.org>> wrote:
>>
>>> Thanks for sharing the problem.
>>>
>>> Based on your log file, it looks like somehow your spark
master address
>>> is not well configured.
>>>
>>> Can you confirm that you have also set 'master' property in
Interpreter
>>> menu on GUI, at spark section?
>>>
>>> If it is not, you can connect Spark Master UI with your web
browser and
>>> see the first line, "Spark Master at spark://....". That value
should be in
>>> 'master' property in Interpreter menu on GUI, at spark section.
>>>
>>> Hope this helps
>>>
>>> Best,
>>> moon
>>>
>>> On Tue, Nov 24, 2015 at 3:07 AM Timur Shenkao
<t...@timshenkao.su <mailto:t...@timshenkao.su>> wrote:
>>>
>>>> Hi!
>>>>
>>>> New mistake comes: TTransportException.
>>>> I use CentOS 6.7 + Spark 1.5.2 Standalone + Cloudera Hadoop
5.4.8 on
>>>> the same cluster. I can't use Mesos or Spark on YARN.
>>>> I built Zeppelin 0.6.0 so:
>>>> mvn clean package –DskipTests -Pspark-1.5 -Phadoop-2.6 -Pyarn
>>>> -Ppyspark -Pbuild-distr
>>>>
>>>> I constantly get errors like
>>>> ERROR [2015-11-23 18:14:33,404] ({pool-1-thread-4}
Job.java[run]:183) -
>>>> Job failed
>>>> org.apache.zeppelin.interpreter.InterpreterException:
>>>> org.apache.thrift.transport.TTransportException
>>>> at
>>>>
org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:237)
>>>>
>>>>
>>>> or
>>>>
>>>> ERROR [2015-11-23 18:07:26,535] ({Thread-11}
>>>> RemoteInterpreterEventPoller.java[run]:72) - Can't get
>>>> RemoteInterpreterEvent
>>>> org.apache.thrift.transport.TTransportException
>>>>
>>>> I changed several parameters in zeppelin-env.sh and in Spark
configs.
>>>> Whatever I do - these mistakes come. At the same time, when I
use local
>>>> Zeppelin with Hadoop in pseudodistributed mode + Spark
Standalone (Master +
>>>> workers on the same machine), everything works.
>>>>
>>>> What configurations (memory, network, CPU cores) should be in
order to
>>>> Zeppelin to work?
>>>>
>>>> I launch H2O on this cluster. And it works.
>>>> Spark Master config:
>>>> SPARK_MASTER_WEBUI_PORT=18080
>>>> HADOOP_CONF_DIR=/etc/hadoop/conf
>>>> SPARK_HOME=/usr/spark
>>>>
>>>> Spark Worker config:
>>>> export HADOOP_CONF_DIR=/etc/hadoop/conf
>>>> export MASTER=spark://192.168.58.10:7077
<http://192.168.58.10:7077>
>>>> export SPARK_HOME=/usr/spark
>>>>
>>>> SPARK_WORKER_INSTANCES=1
>>>> SPARK_WORKER_CORES=4
>>>> SPARK_WORKER_MEMORY=32G
>>>>
>>>>
>>>> I apply Spark configs + zeppelin configs & logs for local
mode +
>>>> zeppelin configs & logs when I defined IP address of Spark Master
>>>> explicitly.
>>>> Thank you.
>>>>
>>>
>>
>