Re: hive on spark - why is it so hard?

Jörn Franke Mon, 02 Oct 2017 01:18:07 -0700

You should try with TEZ+LLAP.

Additionally you will need to compare different configurations.


Finally just any comparison is meaningless.
You should use queries, data and file formats that your users are using later.

> On 2. Oct 2017, at 03:06, Stephen Sprague <sprag...@gmail.com> wrote:
> 
> so...  i made some progress after much copying of jar files around (as 
> alluded to by Gopal previously on this thread).
> 
> 
> following the instructions here: 
> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
> 
> and doing this as instructed will leave off about a dozen or so jar files 
> that spark'll need:
>   ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz 
> "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"
> 
> i ended copying the missing jars to $SPARK_HOME/jars but i would have 
> preferred to just add a path(s) to the spark class path but i did not find 
> any effective way to do that. In hive you can specify HIVE_AUX_JARS_PATH but 
> i don't see the analagous var in spark - i don't think it inherits the hive 
> classpath.
> 
> anyway a simple query is now working under Hive On Spark so i think i might 
> be over the hump.  Now its a matter of comparing the performance with Tez.
> 
> Cheers,
> Stephen.
> 
> 
>> On Wed, Sep 27, 2017 at 9:37 PM, Stephen Sprague <sprag...@gmail.com> wrote:
>> ok.. getting further.  seems now i have to deploy hive to all nodes in the 
>> cluster - don't think i had to do that before but not a big deal to do it 
>> now.
>> 
>> for me:
>>     HIVE_HOME=/usr/lib/apache-hive-2.3.0-bin/
>>     SPARK_HOME=/usr/lib/spark-2.2.0-bin-hadoop2.6
>> 
>> on all three nodes now.
>> 
>> i started spark master on the namenode and i started spark slaves (2) on two 
>> datanodes of the cluster. 
>> 
>> so far so good.
>> 
>> now i run my usual test command.
>> 
>> $ hive --hiveconf hive.root.logger=DEBUG,console -e 'set 
>> hive.execution.engine=spark; select date_key, count(*) from 
>> fe_inventory.merged_properties_hist group by 1 order by 1;'
>> 
>> i get a little further now and find the stderr from the Spark Web UI 
>> interface (nice) and it reports this:
>> 
>> 17/09/27 20:47:35 INFO WorkerWatcher: Successfully connected to 
>> spark://Worker@172.19.79.127:40145
>> Exception in thread "main" java.lang.reflect.InvocationTargetException
>>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>      at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>      at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>      at java.lang.reflect.Method.invoke(Method.java:483)
>>      at 
>> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
>>      at 
>> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
>> Caused by: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
>>      at 
>> org.apache.hive.spark.client.rpc.RpcConfiguration.<clinit>(RpcConfiguration.java:47)
>>      at 
>> org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:134)
>>      at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
>>      ... 6 more
>> 
>> 
>> searching around the internet i find this is probably a compatibility issue.
>> 
>> i know. i know. no surprise here.  
>> 
>> so i guess i just got to the point where everybody else is... build spark 
>> w/o hive. 
>> 
>> lemme see what happens next.
>> 
>> 
>> 
>> 
>> 
>>> On Wed, Sep 27, 2017 at 7:41 PM, Stephen Sprague <sprag...@gmail.com> wrote:
>>> thanks.  I haven't had a chance to dig into this again today but i do 
>>> appreciate the pointer.  I'll keep you posted.
>>> 
>>>> On Wed, Sep 27, 2017 at 10:14 AM, Sahil Takiar <takiar.sa...@gmail.com> 
>>>> wrote:
>>>> You can try increasing the value of hive.spark.client.connect.timeout. 
>>>> Would also suggest taking a look at the HoS Remote Driver logs. The driver 
>>>> gets launched in a YARN container (assuming you are running Spark in 
>>>> yarn-client mode), so you just have to find the logs for that container.
>>>> 
>>>> --Sahil
>>>> 
>>>>> On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague <sprag...@gmail.com> 
>>>>> wrote:
>>>>> i _seem_ to be getting closer.  Maybe its just wishful thinking.   Here's 
>>>>> where i'm at now.
>>>>> 
>>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: 
>>>>> 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with 
>>>>> CreateSubmissionResponse:
>>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
>>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:   
>>>>> "action" : "CreateSubmissionResponse",
>>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:   
>>>>> "message" : "Driver successfully submitted as driver-20170926211038-0003",
>>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:   
>>>>> "serverSparkVersion" : "2.2.0",
>>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:   
>>>>> "submissionId" : "driver-20170926211038-0003",
>>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:   
>>>>> "success" : true
>>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: }
>>>>> 2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to 
>>>>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC 
>>>>> Client (425015667) connection to 
>>>>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr: closed
>>>>> 2017-09-26T21:10:45,702 DEBUG [IPC Client (425015667) connection to 
>>>>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC 
>>>>> Clien
>>>>> t (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 
>>>>> from dwr: stopped, remaining connections 0
>>>>> 2017-09-26T21:12:06,719 ERROR [2337b36e-86ca-47cd-b1ae-f0b32571b97e main] 
>>>>> client.SparkClientImpl: Timed out waiting for client to connect.
>>>>> Possible reasons include network issues, errors in remote driver or the 
>>>>> cluster has no available resources, etc.
>>>>> Please check YARN or Spark driver's logs for further information.
>>>>> java.util.concurrent.ExecutionException: 
>>>>> java.util.concurrent.TimeoutException: Timed out waiting for client 
>>>>> connection.
>>>>>         at 
>>>>> io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37) 
>>>>> ~[netty-all-4.0.29.Final.jar:4.0.29.Final]
>>>>>         at 
>>>>> org.apache.hive.spark.client.SparkClientImpl.<init>(SparkClientImpl.java:108)
>>>>>  [hive-exec-2.3.0.jar:2.3.0]
>>>>>         at 
>>>>> org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
>>>>>  [hive-exec-2.3.0.jar:2.3.0]
>>>>>         at 
>>>>> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.createRemoteClient(RemoteHiveSparkClient.java:101)
>>>>>  [hive-exec-2.3.0.jar:2.3.0]
>>>>>         at 
>>>>> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.<init>(RemoteHiveSparkClient.java:97)
>>>>>  [hive-exec-2.3.0.jar:2.3.0]
>>>>>         at 
>>>>> org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.createHiveSparkClient(HiveSparkClientFactory.java:73)
>>>>>  [hive-exec-2.3.0.jar:2.3.0]
>>>>>         at 
>>>>> org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:62)
>>>>>  [hive-exec-2.3.0.jar:2.3.0]
>>>>>         at 
>>>>> org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:115)
>>>>>  [hive-exec-2.3.0.jar:2.3.0]
>>>>>         at 
>>>>> org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:126)
>>>>>  [hive-exec-2.3.0.jar:2.3.0]
>>>>>         at 
>>>>> org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerParallelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:236)
>>>>>  [hive-exec-2.3.0.jar:2.3.0]
>>>>> 
>>>>> 
>>>>> i'll dig some more tomorrow.
>>>>> 
>>>>>> On Tue, Sep 26, 2017 at 8:23 PM, Stephen Sprague <sprag...@gmail.com> 
>>>>>> wrote:
>>>>>> oh. i missed Gopal's reply.  oy... that sounds foreboding.  I'll keep 
>>>>>> you posted on my progress.
>>>>>> 
>>>>>>> On Tue, Sep 26, 2017 at 4:40 PM, Gopal Vijayaraghavan 
>>>>>>> <gop...@apache.org> wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a 
>>>>>>> > spark session: org.apache.hadoop.hive.ql.metadata.HiveException: 
>>>>>>> > Failed to create spark client.
>>>>>>> 
>>>>>>> I get inexplicable errors with Hive-on-Spark unless I do a three step 
>>>>>>> build.
>>>>>>> 
>>>>>>> Build Hive first, use that version to build Spark, use that Spark 
>>>>>>> version to rebuild Hive.
>>>>>>> 
>>>>>>> I have to do this to make it work because Spark contains Hive jars and 
>>>>>>> Hive contains Spark jars in the class-path.
>>>>>>> 
>>>>>>> And specifically I have to edit the pom.xml files, instead of passing 
>>>>>>> in params with -Dspark.version, because the installed pom files don't 
>>>>>>> get replacements from the build args.
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Gopal
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Sahil Takiar
>>>> Software Engineer at Cloudera
>>>> takiar.sa...@gmail.com | (510) 673-0309
>>> 
>> 
>

Re: hive on spark - why is it so hard?

Reply via email to