Re: hive on spark - why is it so hard?

Stephen Sprague Sun, 01 Oct 2017 18:06:51 -0700

so...  i made some progress after much copying of jar files around (as
alluded to by Gopal previously on this thread).



following the instructions here:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

and doing this as instructed will leave off about a dozen or so jar files
that spark'll need:
  ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz
"-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"

i ended copying the missing jars to $SPARK_HOME/jars but i would have
preferred to just add a path(s) to the spark class path but i did not find
any effective way to do that. In hive you can specify HIVE_AUX_JARS_PATH
but i don't see the analagous var in spark - i don't think it inherits the
hive classpath.

anyway a simple query is now working under Hive On Spark so i think i might
be over the hump.  Now its a matter of comparing the performance with Tez.

Cheers,
Stephen.


On Wed, Sep 27, 2017 at 9:37 PM, Stephen Sprague <sprag...@gmail.com> wrote:

> ok.. getting further.  seems now i have to deploy hive to all nodes in the
> cluster - don't think i had to do that before but not a big deal to do it
> now.
>
> for me:
>     HIVE_HOME=/usr/lib/apache-hive-2.3.0-bin/
>     SPARK_HOME=/usr/lib/spark-2.2.0-bin-hadoop2.6
>
> on all three nodes now.
>
> i started spark master on the namenode and i started spark slaves (2) on
> two datanodes of the cluster.
>
> so far so good.
>
> now i run my usual test command.
>
> $ hive --hiveconf hive.root.logger=DEBUG,console -e 'set
> hive.execution.engine=spark; select date_key, count(*) from
> fe_inventory.merged_properties_hist group by 1 order by 1;'
>
> i get a little further now and find the stderr from the Spark Web UI
> interface (nice) and it reports this:
>
> 17/09/27 20:47:35 INFO WorkerWatcher: Successfully connected to 
> spark://Worker@172.19.79.127:40145
> Exception in thread "main" java.lang.reflect.InvocationTargetException
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:483)
>       at 
> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
>       at 
> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)*Caused 
> by: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS*
>       at 
> org.apache.hive.spark.client.rpc.RpcConfiguration.<clinit>(RpcConfiguration.java:47)
>       at 
> org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:134)
>       at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
>       ... 6 more
>
>
>
> searching around the internet i find this is probably a compatibility
> issue.
>
> i know. i know. no surprise here.
>
> so i guess i just got to the point where everybody else is... build spark
> w/o hive.
>
> lemme see what happens next.
>
>
>
>
>
> On Wed, Sep 27, 2017 at 7:41 PM, Stephen Sprague <sprag...@gmail.com>
> wrote:
>
>> thanks.  I haven't had a chance to dig into this again today but i do
>> appreciate the pointer.  I'll keep you posted.
>>
>> On Wed, Sep 27, 2017 at 10:14 AM, Sahil Takiar <takiar.sa...@gmail.com>
>> wrote:
>>
>>> You can try increasing the value of hive.spark.client.connect.timeout.
>>> Would also suggest taking a look at the HoS Remote Driver logs. The driver
>>> gets launched in a YARN container (assuming you are running Spark in
>>> yarn-client mode), so you just have to find the logs for that container.
>>>
>>> --Sahil
>>>
>>> On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague <sprag...@gmail.com>
>>> wrote:
>>>
>>>> i _seem_ to be getting closer.  Maybe its just wishful thinking.
>>>> Here's where i'm at now.
>>>>
>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>>> 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with
>>>> CreateSubmissionResponse:
>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>>>   "action" : "CreateSubmissionResponse",
>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>>>   "message" : "Driver successfully submitted as 
>>>> driver-20170926211038-0003",
>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>>>   "serverSparkVersion" : "2.2.0",
>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>>>   "submissionId" : "driver-20170926211038-0003",
>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>>>   "success" : true
>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: }
>>>> 2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to
>>>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
>>>> Client (425015667) connection to dwrdevnn1.sv2.trulia.com/172.1
>>>> 9.73.136:8020 from dwr: closed
>>>> 2017-09-26T21:10:45,702 DEBUG [IPC Client (425015667) connection to
>>>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
>>>> Clien
>>>> t (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020
>>>> from dwr: stopped, remaining connections 0
>>>> 2017-09-26T21:12:06,719 ERROR [2337b36e-86ca-47cd-b1ae-f0b32571b97e
>>>> main] client.SparkClientImpl: Timed out waiting for client to connect.
>>>> *Possible reasons include network issues, errors in remote driver or
>>>> the cluster has no available resources, etc.*
>>>> *Please check YARN or Spark driver's logs for further information.*
>>>> java.util.concurrent.ExecutionException: 
>>>> java.util.concurrent.TimeoutException:
>>>> Timed out waiting for client connection.
>>>>         at 
>>>> io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
>>>> ~[netty-all-4.0.29.Final.jar:4.0.29.Final]
>>>>         at 
>>>> org.apache.hive.spark.client.SparkClientImpl.<init>(SparkClientImpl.java:108)
>>>> [hive-exec-2.3.0.jar:2.3.0]
>>>>         at 
>>>> org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
>>>> [hive-exec-2.3.0.jar:2.3.0]
>>>>         at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.c
>>>> reateRemoteClient(RemoteHiveSparkClient.java:101)
>>>> [hive-exec-2.3.0.jar:2.3.0]
>>>>         at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.<
>>>> init>(RemoteHiveSparkClient.java:97) [hive-exec-2.3.0.jar:2.3.0]
>>>>         at org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.
>>>> createHiveSparkClient(HiveSparkClientFactory.java:73)
>>>> [hive-exec-2.3.0.jar:2.3.0]
>>>>         at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImp
>>>> l.open(SparkSessionImpl.java:62) [hive-exec-2.3.0.jar:2.3.0]
>>>>         at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionMan
>>>> agerImpl.getSession(SparkSessionManagerImpl.java:115)
>>>> [hive-exec-2.3.0.jar:2.3.0]
>>>>         at org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSpark
>>>> Session(SparkUtilities.java:126) [hive-exec-2.3.0.jar:2.3.0]
>>>>         at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerPar
>>>> allelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:236)
>>>> [hive-exec-2.3.0.jar:2.3.0]
>>>>
>>>>
>>>> i'll dig some more tomorrow.
>>>>
>>>> On Tue, Sep 26, 2017 at 8:23 PM, Stephen Sprague <sprag...@gmail.com>
>>>> wrote:
>>>>
>>>>> oh. i missed Gopal's reply.  oy... that sounds foreboding.  I'll keep
>>>>> you posted on my progress.
>>>>>
>>>>> On Tue, Sep 26, 2017 at 4:40 PM, Gopal Vijayaraghavan <
>>>>> gop...@apache.org> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a
>>>>>> spark session: org.apache.hadoop.hive.ql.metadata.HiveException:
>>>>>> Failed to create spark client.
>>>>>>
>>>>>> I get inexplicable errors with Hive-on-Spark unless I do a three step
>>>>>> build.
>>>>>>
>>>>>> Build Hive first, use that version to build Spark, use that Spark
>>>>>> version to rebuild Hive.
>>>>>>
>>>>>> I have to do this to make it work because Spark contains Hive jars
>>>>>> and Hive contains Spark jars in the class-path.
>>>>>>
>>>>>> And specifically I have to edit the pom.xml files, instead of passing
>>>>>> in params with -Dspark.version, because the installed pom files don't get
>>>>>> replacements from the build args.
>>>>>>
>>>>>> Cheers,
>>>>>> Gopal
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Sahil Takiar
>>> Software Engineer at Cloudera
>>> takiar.sa...@gmail.com | (510) 673-0309
>>>
>>
>>
>

Re: hive on spark - why is it so hard?

Reply via email to