You should try with TEZ+LLAP. Additionally you will need to compare different configurations.
Finally just any comparison is meaningless. You should use queries, data and file formats that your users are using later. > On 2. Oct 2017, at 03:06, Stephen Sprague <sprag...@gmail.com> wrote: > > so... i made some progress after much copying of jar files around (as > alluded to by Gopal previously on this thread). > > > following the instructions here: > https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started > > and doing this as instructed will leave off about a dozen or so jar files > that spark'll need: > ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz > "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided" > > i ended copying the missing jars to $SPARK_HOME/jars but i would have > preferred to just add a path(s) to the spark class path but i did not find > any effective way to do that. In hive you can specify HIVE_AUX_JARS_PATH but > i don't see the analagous var in spark - i don't think it inherits the hive > classpath. > > anyway a simple query is now working under Hive On Spark so i think i might > be over the hump. Now its a matter of comparing the performance with Tez. > > Cheers, > Stephen. > > >> On Wed, Sep 27, 2017 at 9:37 PM, Stephen Sprague <sprag...@gmail.com> wrote: >> ok.. getting further. seems now i have to deploy hive to all nodes in the >> cluster - don't think i had to do that before but not a big deal to do it >> now. >> >> for me: >> HIVE_HOME=/usr/lib/apache-hive-2.3.0-bin/ >> SPARK_HOME=/usr/lib/spark-2.2.0-bin-hadoop2.6 >> >> on all three nodes now. >> >> i started spark master on the namenode and i started spark slaves (2) on two >> datanodes of the cluster. >> >> so far so good. >> >> now i run my usual test command. >> >> $ hive --hiveconf hive.root.logger=DEBUG,console -e 'set >> hive.execution.engine=spark; select date_key, count(*) from >> fe_inventory.merged_properties_hist group by 1 order by 1;' >> >> i get a little further now and find the stderr from the Spark Web UI >> interface (nice) and it reports this: >> >> 17/09/27 20:47:35 INFO WorkerWatcher: Successfully connected to >> spark://Worker@172.19.79.127:40145 >> Exception in thread "main" java.lang.reflect.InvocationTargetException >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:483) >> at >> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58) >> at >> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) >> Caused by: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS >> at >> org.apache.hive.spark.client.rpc.RpcConfiguration.<clinit>(RpcConfiguration.java:47) >> at >> org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:134) >> at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516) >> ... 6 more >> >> >> searching around the internet i find this is probably a compatibility issue. >> >> i know. i know. no surprise here. >> >> so i guess i just got to the point where everybody else is... build spark >> w/o hive. >> >> lemme see what happens next. >> >> >> >> >> >>> On Wed, Sep 27, 2017 at 7:41 PM, Stephen Sprague <sprag...@gmail.com> wrote: >>> thanks. I haven't had a chance to dig into this again today but i do >>> appreciate the pointer. I'll keep you posted. >>> >>>> On Wed, Sep 27, 2017 at 10:14 AM, Sahil Takiar <takiar.sa...@gmail.com> >>>> wrote: >>>> You can try increasing the value of hive.spark.client.connect.timeout. >>>> Would also suggest taking a look at the HoS Remote Driver logs. The driver >>>> gets launched in a YARN container (assuming you are running Spark in >>>> yarn-client mode), so you just have to find the logs for that container. >>>> >>>> --Sahil >>>> >>>>> On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague <sprag...@gmail.com> >>>>> wrote: >>>>> i _seem_ to be getting closer. Maybe its just wishful thinking. Here's >>>>> where i'm at now. >>>>> >>>>> 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: >>>>> 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with >>>>> CreateSubmissionResponse: >>>>> 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: { >>>>> 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: >>>>> "action" : "CreateSubmissionResponse", >>>>> 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: >>>>> "message" : "Driver successfully submitted as driver-20170926211038-0003", >>>>> 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: >>>>> "serverSparkVersion" : "2.2.0", >>>>> 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: >>>>> "submissionId" : "driver-20170926211038-0003", >>>>> 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: >>>>> "success" : true >>>>> 2017-09-26T21:10:38,892 INFO [stderr-redir-1] client.SparkClientImpl: } >>>>> 2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to >>>>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC >>>>> Client (425015667) connection to >>>>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr: closed >>>>> 2017-09-26T21:10:45,702 DEBUG [IPC Client (425015667) connection to >>>>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC >>>>> Clien >>>>> t (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 >>>>> from dwr: stopped, remaining connections 0 >>>>> 2017-09-26T21:12:06,719 ERROR [2337b36e-86ca-47cd-b1ae-f0b32571b97e main] >>>>> client.SparkClientImpl: Timed out waiting for client to connect. >>>>> Possible reasons include network issues, errors in remote driver or the >>>>> cluster has no available resources, etc. >>>>> Please check YARN or Spark driver's logs for further information. >>>>> java.util.concurrent.ExecutionException: >>>>> java.util.concurrent.TimeoutException: Timed out waiting for client >>>>> connection. >>>>> at >>>>> io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37) >>>>> ~[netty-all-4.0.29.Final.jar:4.0.29.Final] >>>>> at >>>>> org.apache.hive.spark.client.SparkClientImpl.<init>(SparkClientImpl.java:108) >>>>> [hive-exec-2.3.0.jar:2.3.0] >>>>> at >>>>> org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80) >>>>> [hive-exec-2.3.0.jar:2.3.0] >>>>> at >>>>> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.createRemoteClient(RemoteHiveSparkClient.java:101) >>>>> [hive-exec-2.3.0.jar:2.3.0] >>>>> at >>>>> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.<init>(RemoteHiveSparkClient.java:97) >>>>> [hive-exec-2.3.0.jar:2.3.0] >>>>> at >>>>> org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.createHiveSparkClient(HiveSparkClientFactory.java:73) >>>>> [hive-exec-2.3.0.jar:2.3.0] >>>>> at >>>>> org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:62) >>>>> [hive-exec-2.3.0.jar:2.3.0] >>>>> at >>>>> org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:115) >>>>> [hive-exec-2.3.0.jar:2.3.0] >>>>> at >>>>> org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:126) >>>>> [hive-exec-2.3.0.jar:2.3.0] >>>>> at >>>>> org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerParallelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:236) >>>>> [hive-exec-2.3.0.jar:2.3.0] >>>>> >>>>> >>>>> i'll dig some more tomorrow. >>>>> >>>>>> On Tue, Sep 26, 2017 at 8:23 PM, Stephen Sprague <sprag...@gmail.com> >>>>>> wrote: >>>>>> oh. i missed Gopal's reply. oy... that sounds foreboding. I'll keep >>>>>> you posted on my progress. >>>>>> >>>>>>> On Tue, Sep 26, 2017 at 4:40 PM, Gopal Vijayaraghavan >>>>>>> <gop...@apache.org> wrote: >>>>>>> Hi, >>>>>>> >>>>>>> > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a >>>>>>> > spark session: org.apache.hadoop.hive.ql.metadata.HiveException: >>>>>>> > Failed to create spark client. >>>>>>> >>>>>>> I get inexplicable errors with Hive-on-Spark unless I do a three step >>>>>>> build. >>>>>>> >>>>>>> Build Hive first, use that version to build Spark, use that Spark >>>>>>> version to rebuild Hive. >>>>>>> >>>>>>> I have to do this to make it work because Spark contains Hive jars and >>>>>>> Hive contains Spark jars in the class-path. >>>>>>> >>>>>>> And specifically I have to edit the pom.xml files, instead of passing >>>>>>> in params with -Dspark.version, because the installed pom files don't >>>>>>> get replacements from the build args. >>>>>>> >>>>>>> Cheers, >>>>>>> Gopal >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Sahil Takiar >>>> Software Engineer at Cloudera >>>> takiar.sa...@gmail.com | (510) 673-0309 >>> >> >