Re: hive on spark - why is it so hard?

2017-10-02 Thread Jörn Franke
You should try with TEZ+LLAP.

Additionally you will need to compare different configurations.

Finally just any comparison is meaningless.
You should use queries, data and file formats that your users are using later.

> On 2. Oct 2017, at 03:06, Stephen Sprague  wrote:
> 
> so...  i made some progress after much copying of jar files around (as 
> alluded to by Gopal previously on this thread).
> 
> 
> following the instructions here: 
> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
> 
> and doing this as instructed will leave off about a dozen or so jar files 
> that spark'll need:
>   ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz 
> "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"
> 
> i ended copying the missing jars to $SPARK_HOME/jars but i would have 
> preferred to just add a path(s) to the spark class path but i did not find 
> any effective way to do that. In hive you can specify HIVE_AUX_JARS_PATH but 
> i don't see the analagous var in spark - i don't think it inherits the hive 
> classpath.
> 
> anyway a simple query is now working under Hive On Spark so i think i might 
> be over the hump.  Now its a matter of comparing the performance with Tez.
> 
> Cheers,
> Stephen.
> 
> 
>> On Wed, Sep 27, 2017 at 9:37 PM, Stephen Sprague  wrote:
>> ok.. getting further.  seems now i have to deploy hive to all nodes in the 
>> cluster - don't think i had to do that before but not a big deal to do it 
>> now.
>> 
>> for me:
>> HIVE_HOME=/usr/lib/apache-hive-2.3.0-bin/
>> SPARK_HOME=/usr/lib/spark-2.2.0-bin-hadoop2.6
>> 
>> on all three nodes now.
>> 
>> i started spark master on the namenode and i started spark slaves (2) on two 
>> datanodes of the cluster. 
>> 
>> so far so good.
>> 
>> now i run my usual test command.
>> 
>> $ hive --hiveconf hive.root.logger=DEBUG,console -e 'set 
>> hive.execution.engine=spark; select date_key, count(*) from 
>> fe_inventory.merged_properties_hist group by 1 order by 1;'
>> 
>> i get a little further now and find the stderr from the Spark Web UI 
>> interface (nice) and it reports this:
>> 
>> 17/09/27 20:47:35 INFO WorkerWatcher: Successfully connected to 
>> spark://Worker@172.19.79.127:40145
>> Exception in thread "main" java.lang.reflect.InvocationTargetException
>>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>  at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>  at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>  at java.lang.reflect.Method.invoke(Method.java:483)
>>  at 
>> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
>>  at 
>> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
>> Caused by: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
>>  at 
>> org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.java:47)
>>  at 
>> org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:134)
>>  at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
>>  ... 6 more
>> 
>> 
>> searching around the internet i find this is probably a compatibility issue.
>> 
>> i know. i know. no surprise here.  
>> 
>> so i guess i just got to the point where everybody else is... build spark 
>> w/o hive. 
>> 
>> lemme see what happens next.
>> 
>> 
>> 
>> 
>> 
>>> On Wed, Sep 27, 2017 at 7:41 PM, Stephen Sprague  wrote:
>>> thanks.  I haven't had a chance to dig into this again today but i do 
>>> appreciate the pointer.  I'll keep you posted.
>>> 
 On Wed, Sep 27, 2017 at 10:14 AM, Sahil Takiar  
 wrote:
 You can try increasing the value of hive.spark.client.connect.timeout. 
 Would also suggest taking a look at the HoS Remote Driver logs. The driver 
 gets launched in a YARN container (assuming you are running Spark in 
 yarn-client mode), so you just have to find the logs for that container.
 
 --Sahil
 
> On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague  
> wrote:
> i _seem_ to be getting closer.  Maybe its just wishful thinking.   Here's 
> where i'm at now.
> 
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: 
> 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with 
> CreateSubmissionResponse:
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:   
> "action" : "CreateSubmissionResponse",
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:   
> "message" : "Driver successfully submitted as driver-20170926211038-0003",
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:   
> "serverSparkVersion" : "2.2.0",
> 

Re: hive on spark - why is it so hard?

2017-10-01 Thread Stephen Sprague
so...  i made some progress after much copying of jar files around (as
alluded to by Gopal previously on this thread).


following the instructions here:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

and doing this as instructed will leave off about a dozen or so jar files
that spark'll need:
  ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz
"-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"

i ended copying the missing jars to $SPARK_HOME/jars but i would have
preferred to just add a path(s) to the spark class path but i did not find
any effective way to do that. In hive you can specify HIVE_AUX_JARS_PATH
but i don't see the analagous var in spark - i don't think it inherits the
hive classpath.

anyway a simple query is now working under Hive On Spark so i think i might
be over the hump.  Now its a matter of comparing the performance with Tez.

Cheers,
Stephen.


On Wed, Sep 27, 2017 at 9:37 PM, Stephen Sprague  wrote:

> ok.. getting further.  seems now i have to deploy hive to all nodes in the
> cluster - don't think i had to do that before but not a big deal to do it
> now.
>
> for me:
> HIVE_HOME=/usr/lib/apache-hive-2.3.0-bin/
> SPARK_HOME=/usr/lib/spark-2.2.0-bin-hadoop2.6
>
> on all three nodes now.
>
> i started spark master on the namenode and i started spark slaves (2) on
> two datanodes of the cluster.
>
> so far so good.
>
> now i run my usual test command.
>
> $ hive --hiveconf hive.root.logger=DEBUG,console -e 'set
> hive.execution.engine=spark; select date_key, count(*) from
> fe_inventory.merged_properties_hist group by 1 order by 1;'
>
> i get a little further now and find the stderr from the Spark Web UI
> interface (nice) and it reports this:
>
> 17/09/27 20:47:35 INFO WorkerWatcher: Successfully connected to 
> spark://Worker@172.19.79.127:40145
> Exception in thread "main" java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
>   at 
> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)*Caused 
> by: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS*
>   at 
> org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.java:47)
>   at 
> org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:134)
>   at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
>   ... 6 more
>
>
>
> searching around the internet i find this is probably a compatibility
> issue.
>
> i know. i know. no surprise here.
>
> so i guess i just got to the point where everybody else is... build spark
> w/o hive.
>
> lemme see what happens next.
>
>
>
>
>
> On Wed, Sep 27, 2017 at 7:41 PM, Stephen Sprague 
> wrote:
>
>> thanks.  I haven't had a chance to dig into this again today but i do
>> appreciate the pointer.  I'll keep you posted.
>>
>> On Wed, Sep 27, 2017 at 10:14 AM, Sahil Takiar 
>> wrote:
>>
>>> You can try increasing the value of hive.spark.client.connect.timeout.
>>> Would also suggest taking a look at the HoS Remote Driver logs. The driver
>>> gets launched in a YARN container (assuming you are running Spark in
>>> yarn-client mode), so you just have to find the logs for that container.
>>>
>>> --Sahil
>>>
>>> On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague 
>>> wrote:
>>>
 i _seem_ to be getting closer.  Maybe its just wishful thinking.
 Here's where i'm at now.

 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with
 CreateSubmissionResponse:
 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
   "action" : "CreateSubmissionResponse",
 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
   "message" : "Driver successfully submitted as 
 driver-20170926211038-0003",
 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
   "serverSparkVersion" : "2.2.0",
 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
   "submissionId" : "driver-20170926211038-0003",
 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
   "success" : true
 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: }
 2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to
 dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
 Client 

Re: hive on spark - why is it so hard?

2017-09-27 Thread Stephen Sprague
ok.. getting further.  seems now i have to deploy hive to all nodes in the
cluster - don't think i had to do that before but not a big deal to do it
now.

for me:
HIVE_HOME=/usr/lib/apache-hive-2.3.0-bin/
SPARK_HOME=/usr/lib/spark-2.2.0-bin-hadoop2.6

on all three nodes now.

i started spark master on the namenode and i started spark slaves (2) on
two datanodes of the cluster.

so far so good.

now i run my usual test command.

$ hive --hiveconf hive.root.logger=DEBUG,console -e 'set
hive.execution.engine=spark; select date_key, count(*) from
fe_inventory.merged_properties_hist group by 1 order by 1;'

i get a little further now and find the stderr from the Spark Web UI
interface (nice) and it reports this:

17/09/27 20:47:35 INFO WorkerWatcher: Successfully connected to
spark://Worker@172.19.79.127:40145
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at 
org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)*Caused
by: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS*
at 
org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.java:47)
at 
org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:134)
at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
... 6 more



searching around the internet i find this is probably a compatibility issue.

i know. i know. no surprise here.

so i guess i just got to the point where everybody else is... build spark
w/o hive.

lemme see what happens next.





On Wed, Sep 27, 2017 at 7:41 PM, Stephen Sprague  wrote:

> thanks.  I haven't had a chance to dig into this again today but i do
> appreciate the pointer.  I'll keep you posted.
>
> On Wed, Sep 27, 2017 at 10:14 AM, Sahil Takiar 
> wrote:
>
>> You can try increasing the value of hive.spark.client.connect.timeout.
>> Would also suggest taking a look at the HoS Remote Driver logs. The driver
>> gets launched in a YARN container (assuming you are running Spark in
>> yarn-client mode), so you just have to find the logs for that container.
>>
>> --Sahil
>>
>> On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague 
>> wrote:
>>
>>> i _seem_ to be getting closer.  Maybe its just wishful thinking.
>>> Here's where i'm at now.
>>>
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>> 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with
>>> CreateSubmissionResponse:
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>> "action" : "CreateSubmissionResponse",
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>> "message" : "Driver successfully submitted as driver-20170926211038-0003",
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>> "serverSparkVersion" : "2.2.0",
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>> "submissionId" : "driver-20170926211038-0003",
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>> "success" : true
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: }
>>> 2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to
>>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
>>> Client (425015667) connection to dwrdevnn1.sv2.trulia.com/172.1
>>> 9.73.136:8020 from dwr: closed
>>> 2017-09-26T21:10:45,702 DEBUG [IPC Client (425015667) connection to
>>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
>>> Clien
>>> t (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020
>>> from dwr: stopped, remaining connections 0
>>> 2017-09-26T21:12:06,719 ERROR [2337b36e-86ca-47cd-b1ae-f0b32571b97e
>>> main] client.SparkClientImpl: Timed out waiting for client to connect.
>>> *Possible reasons include network issues, errors in remote driver or the
>>> cluster has no available resources, etc.*
>>> *Please check YARN or Spark driver's logs for further information.*
>>> java.util.concurrent.ExecutionException: 
>>> java.util.concurrent.TimeoutException:
>>> Timed out waiting for client connection.
>>> at 
>>> io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
>>> ~[netty-all-4.0.29.Final.jar:4.0.29.Final]
>>> at 
>>> org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:108)
>>> [hive-exec-2.3.0.jar:2.3.0]
>>> at 
>>> 

Re: hive on spark - why is it so hard?

2017-09-27 Thread Stephen Sprague
thanks.  I haven't had a chance to dig into this again today but i do
appreciate the pointer.  I'll keep you posted.

On Wed, Sep 27, 2017 at 10:14 AM, Sahil Takiar 
wrote:

> You can try increasing the value of hive.spark.client.connect.timeout.
> Would also suggest taking a look at the HoS Remote Driver logs. The driver
> gets launched in a YARN container (assuming you are running Spark in
> yarn-client mode), so you just have to find the logs for that container.
>
> --Sahil
>
> On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague 
> wrote:
>
>> i _seem_ to be getting closer.  Maybe its just wishful thinking.   Here's
>> where i'm at now.
>>
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with
>> CreateSubmissionResponse:
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> "action" : "CreateSubmissionResponse",
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> "message" : "Driver successfully submitted as driver-20170926211038-0003",
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> "serverSparkVersion" : "2.2.0",
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> "submissionId" : "driver-20170926211038-0003",
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> "success" : true
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: }
>> 2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to
>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
>> Client (425015667) connection to dwrdevnn1.sv2.trulia.com/172.1
>> 9.73.136:8020 from dwr: closed
>> 2017-09-26T21:10:45,702 DEBUG [IPC Client (425015667) connection to
>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
>> Clien
>> t (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020
>> from dwr: stopped, remaining connections 0
>> 2017-09-26T21:12:06,719 ERROR [2337b36e-86ca-47cd-b1ae-f0b32571b97e
>> main] client.SparkClientImpl: Timed out waiting for client to connect.
>> *Possible reasons include network issues, errors in remote driver or the
>> cluster has no available resources, etc.*
>> *Please check YARN or Spark driver's logs for further information.*
>> java.util.concurrent.ExecutionException: 
>> java.util.concurrent.TimeoutException:
>> Timed out waiting for client connection.
>> at 
>> io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
>> ~[netty-all-4.0.29.Final.jar:4.0.29.Final]
>> at 
>> org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:108)
>> [hive-exec-2.3.0.jar:2.3.0]
>> at 
>> org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
>> [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.c
>> reateRemoteClient(RemoteHiveSparkClient.java:101)
>> [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.<
>> init>(RemoteHiveSparkClient.java:97) [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.
>> createHiveSparkClient(HiveSparkClientFactory.java:73)
>> [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImp
>> l.open(SparkSessionImpl.java:62) [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionMan
>> agerImpl.getSession(SparkSessionManagerImpl.java:115)
>> [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSpark
>> Session(SparkUtilities.java:126) [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerPar
>> allelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:236)
>> [hive-exec-2.3.0.jar:2.3.0]
>>
>>
>> i'll dig some more tomorrow.
>>
>> On Tue, Sep 26, 2017 at 8:23 PM, Stephen Sprague 
>> wrote:
>>
>>> oh. i missed Gopal's reply.  oy... that sounds foreboding.  I'll keep
>>> you posted on my progress.
>>>
>>> On Tue, Sep 26, 2017 at 4:40 PM, Gopal Vijayaraghavan >> > wrote:
>>>
 Hi,

 > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a
 spark session: org.apache.hadoop.hive.ql.metadata.HiveException:
 Failed to create spark client.

 I get inexplicable errors with Hive-on-Spark unless I do a three step
 build.

 Build Hive first, use that version to build Spark, use that Spark
 version to rebuild Hive.

 I have to do this to make it work because Spark contains Hive jars and
 Hive contains Spark jars in the class-path.

 And specifically I have to edit the pom.xml files, 

Re: hive on spark - why is it so hard?

2017-09-27 Thread Sahil Takiar
You can try increasing the value of hive.spark.client.connect.timeout.
Would also suggest taking a look at the HoS Remote Driver logs. The driver
gets launched in a YARN container (assuming you are running Spark in
yarn-client mode), so you just have to find the logs for that container.

--Sahil

On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague  wrote:

> i _seem_ to be getting closer.  Maybe its just wishful thinking.   Here's
> where i'm at now.
>
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
> 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with
> CreateSubmissionResponse:
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
> "action" : "CreateSubmissionResponse",
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
> "message" : "Driver successfully submitted as driver-20170926211038-0003",
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
> "serverSparkVersion" : "2.2.0",
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
> "submissionId" : "driver-20170926211038-0003",
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
> "success" : true
> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: }
> 2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to
> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
> Client (425015667) connection to dwrdevnn1.sv2.trulia.com/172.
> 19.73.136:8020 from dwr: closed
> 2017-09-26T21:10:45,702 DEBUG [IPC Client (425015667) connection to
> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
> Clien
> t (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020
> from dwr: stopped, remaining connections 0
> 2017-09-26T21:12:06,719 ERROR [2337b36e-86ca-47cd-b1ae-f0b32571b97e main]
> client.SparkClientImpl: Timed out waiting for client to connect.
> *Possible reasons include network issues, errors in remote driver or the
> cluster has no available resources, etc.*
> *Please check YARN or Spark driver's logs for further information.*
> java.util.concurrent.ExecutionException: 
> java.util.concurrent.TimeoutException:
> Timed out waiting for client connection.
> at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
> ~[netty-all-4.0.29.Final.jar:4.0.29.Final]
> at 
> org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:108)
> [hive-exec-2.3.0.jar:2.3.0]
> at 
> org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
> [hive-exec-2.3.0.jar:2.3.0]
> at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.
> createRemoteClient(RemoteHiveSparkClient.java:101)
> [hive-exec-2.3.0.jar:2.3.0]
> at org.apache.hadoop.hive.ql.exec.spark.
> RemoteHiveSparkClient.(RemoteHiveSparkClient.java:97)
> [hive-exec-2.3.0.jar:2.3.0]
> at org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.
> createHiveSparkClient(HiveSparkClientFactory.java:73)
> [hive-exec-2.3.0.jar:2.3.0]
> at org.apache.hadoop.hive.ql.exec.spark.session.
> SparkSessionImpl.open(SparkSessionImpl.java:62)
> [hive-exec-2.3.0.jar:2.3.0]
> at org.apache.hadoop.hive.ql.exec.spark.session.
> SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:115)
> [hive-exec-2.3.0.jar:2.3.0]
> at org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.
> getSparkSession(SparkUtilities.java:126) [hive-exec-2.3.0.jar:2.3.0]
> at org.apache.hadoop.hive.ql.optimizer.spark.
> SetSparkReducerParallelism.getSparkMemoryAndCores(
> SetSparkReducerParallelism.java:236) [hive-exec-2.3.0.jar:2.3.0]
>
>
> i'll dig some more tomorrow.
>
> On Tue, Sep 26, 2017 at 8:23 PM, Stephen Sprague 
> wrote:
>
>> oh. i missed Gopal's reply.  oy... that sounds foreboding.  I'll keep you
>> posted on my progress.
>>
>> On Tue, Sep 26, 2017 at 4:40 PM, Gopal Vijayaraghavan 
>> wrote:
>>
>>> Hi,
>>>
>>> > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a
>>> spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed
>>> to create spark client.
>>>
>>> I get inexplicable errors with Hive-on-Spark unless I do a three step
>>> build.
>>>
>>> Build Hive first, use that version to build Spark, use that Spark
>>> version to rebuild Hive.
>>>
>>> I have to do this to make it work because Spark contains Hive jars and
>>> Hive contains Spark jars in the class-path.
>>>
>>> And specifically I have to edit the pom.xml files, instead of passing in
>>> params with -Dspark.version, because the installed pom files don't get
>>> replacements from the build args.
>>>
>>> Cheers,
>>> Gopal
>>>
>>>
>>>
>>
>


-- 
Sahil Takiar
Software Engineer at Cloudera
takiar.sa...@gmail.com | (510) 673-0309


Re: hive on spark - why is it so hard?

2017-09-26 Thread Stephen Sprague
i _seem_ to be getting closer.  Maybe its just wishful thinking.   Here's
where i'm at now.

2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with
CreateSubmissionResponse:
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
"action" : "CreateSubmissionResponse",
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
"message" : "Driver successfully submitted as driver-20170926211038-0003",
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
"serverSparkVersion" : "2.2.0",
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
"submissionId" : "driver-20170926211038-0003",
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
"success" : true
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: }
2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to
dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
Client (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020
from dwr: closed
2017-09-26T21:10:45,702 DEBUG [IPC Client (425015667) connection to
dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC Clien
t (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020
from dwr: stopped, remaining connections 0
2017-09-26T21:12:06,719 ERROR [2337b36e-86ca-47cd-b1ae-f0b32571b97e main]
client.SparkClientImpl: Timed out waiting for client to connect.
*Possible reasons include network issues, errors in remote driver or the
cluster has no available resources, etc.*
*Please check YARN or Spark driver's logs for further information.*
java.util.concurrent.ExecutionException:
java.util.concurrent.TimeoutException: Timed out waiting for client
connection.
at
io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
~[netty-all-4.0.29.Final.jar:4.0.29.Final]
at
org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:108)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.createRemoteClient(RemoteHiveSparkClient.java:101)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.(RemoteHiveSparkClient.java:97)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.createHiveSparkClient(HiveSparkClientFactory.java:73)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:62)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:115)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:126)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerParallelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:236)
[hive-exec-2.3.0.jar:2.3.0]


i'll dig some more tomorrow.

On Tue, Sep 26, 2017 at 8:23 PM, Stephen Sprague  wrote:

> oh. i missed Gopal's reply.  oy... that sounds foreboding.  I'll keep you
> posted on my progress.
>
> On Tue, Sep 26, 2017 at 4:40 PM, Gopal Vijayaraghavan 
> wrote:
>
>> Hi,
>>
>> > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a
>> spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed
>> to create spark client.
>>
>> I get inexplicable errors with Hive-on-Spark unless I do a three step
>> build.
>>
>> Build Hive first, use that version to build Spark, use that Spark version
>> to rebuild Hive.
>>
>> I have to do this to make it work because Spark contains Hive jars and
>> Hive contains Spark jars in the class-path.
>>
>> And specifically I have to edit the pom.xml files, instead of passing in
>> params with -Dspark.version, because the installed pom files don't get
>> replacements from the build args.
>>
>> Cheers,
>> Gopal
>>
>>
>>
>


Re: hive on spark - why is it so hard?

2017-09-26 Thread Stephen Sprague
oh. i missed Gopal's reply.  oy... that sounds foreboding.  I'll keep you
posted on my progress.

On Tue, Sep 26, 2017 at 4:40 PM, Gopal Vijayaraghavan 
wrote:

> Hi,
>
> > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a
> spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed
> to create spark client.
>
> I get inexplicable errors with Hive-on-Spark unless I do a three step
> build.
>
> Build Hive first, use that version to build Spark, use that Spark version
> to rebuild Hive.
>
> I have to do this to make it work because Spark contains Hive jars and
> Hive contains Spark jars in the class-path.
>
> And specifically I have to edit the pom.xml files, instead of passing in
> params with -Dspark.version, because the installed pom files don't get
> replacements from the build args.
>
> Cheers,
> Gopal
>
>
>


Re: hive on spark - why is it so hard?

2017-09-26 Thread Stephen Sprague
well this is the spark-submit line from above:

   2017-09-26T14:04:45,678  INFO [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3
main] client.SparkClientImpl: Running client driver with argv:
*/usr/li/spark-2.2.0-bin-**hadoop2.6/bin/spark-submit*

and that's pretty clearly v2.2


I do have other versions of spark on the namenode so lemme remove those and
see what happens


A-HA! dang it!

$ echo $SPARK_HOME
/usr/local/spark

well that clearly needs to be: */usr/lib/spark-2.2.0-bin-*
*hadoop2.6  *

how did i miss that? unbelievable.


Thank you Sahil!   Lets see what happens next!

Cheers,
Stephen


On Tue, Sep 26, 2017 at 4:12 PM, Sahil Takiar 
wrote:

> Are you sure you are using Spark 2.2.0? Based on the stack-trace it looks
> like your call to spark-submit it using an older version of Spark (looks
> like some early 1.x version). Do you have SPARK_HOME set locally? Do you
> have older versions of Spark installed locally?
>
> --Sahil
>
> On Tue, Sep 26, 2017 at 3:33 PM, Stephen Sprague 
> wrote:
>
>> thanks Sahil.  here it is.
>>
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> org/apache/spark/scheduler/SparkListenerInterface
>> at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:344)
>> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.
>> scala:318)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:
>> 75)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.spark.scheduler.SparkListenerInterface
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> ... 5 more
>>
>> at 
>> org.apache.hive.spark.client.rpc.RpcServer.cancelClient(RpcServer.java:212)
>> ~[hive-exec-2.3.0.jar:2.3.0]
>> at 
>> org.apache.hive.spark.client.SparkClientImpl$3.run(SparkClientImpl.java:500)
>> ~[hive-exec-2.3.0.jar:2.3.0]
>> at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_25]
>> FAILED: SemanticException Failed to get a spark session:
>> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark
>> client.
>> 2017-09-26T14:04:46,470 ERROR [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3
>> main] ql.Driver: FAILED: SemanticException Failed to get a spark session:
>> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark
>> client.
>> org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a spark
>> session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to
>> create spark client.
>> at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerPar
>> allelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:240)
>> at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerPar
>> allelism.process(SetSparkReducerParallelism.java:173)
>> at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch
>> (DefaultRuleDispatcher.java:90)
>> at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAnd
>> Return(DefaultGraphWalker.java:105)
>> at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(De
>> faultGraphWalker.java:89)
>> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWa
>> lker.java:56)
>> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWa
>> lker.java:61)
>> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWa
>> lker.java:61)
>> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWa
>> lker.java:61)
>> at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalkin
>> g(DefaultGraphWalker.java:120)
>> at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.runSetRe
>> ducerParallelism(SparkCompiler.java:288)
>> at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimize
>> OperatorPlan(SparkCompiler.java:122)
>> at org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCom
>> piler.java:140)
>> at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInte
>> rnal(SemanticAnalyzer.java:11253)
>> at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeIntern
>> al(CalcitePlanner.java:286)
>> at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze
>> (BaseSemanticAnalyzer.java:258)
>> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:511)
>> at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java
>> :1316)
>> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1456)
>> at 

Re: hive on spark - why is it so hard?

2017-09-26 Thread Gopal Vijayaraghavan
Hi,

> org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a spark 
> session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create 
> spark client.
 
I get inexplicable errors with Hive-on-Spark unless I do a three step build.

Build Hive first, use that version to build Spark, use that Spark version to 
rebuild Hive.

I have to do this to make it work because Spark contains Hive jars and Hive 
contains Spark jars in the class-path.

And specifically I have to edit the pom.xml files, instead of passing in params 
with -Dspark.version, because the installed pom files don't get replacements 
from the build args.

Cheers,
Gopal




Re: hive on spark - why is it so hard?

2017-09-26 Thread Sahil Takiar
Are you sure you are using Spark 2.2.0? Based on the stack-trace it looks
like your call to spark-submit it using an older version of Spark (looks
like some early 1.x version). Do you have SPARK_HOME set locally? Do you
have older versions of Spark installed locally?

--Sahil

On Tue, Sep 26, 2017 at 3:33 PM, Stephen Sprague  wrote:

> thanks Sahil.  here it is.
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/spark/scheduler/SparkListenerInterface
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:344)
> at org.apache.spark.deploy.SparkSubmit$.launch(
> SparkSubmit.scala:318)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: org.apache.spark.scheduler.
> SparkListenerInterface
> at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 5 more
>
> at 
> org.apache.hive.spark.client.rpc.RpcServer.cancelClient(RpcServer.java:212)
> ~[hive-exec-2.3.0.jar:2.3.0]
> at 
> org.apache.hive.spark.client.SparkClientImpl$3.run(SparkClientImpl.java:500)
> ~[hive-exec-2.3.0.jar:2.3.0]
> at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_25]
> FAILED: SemanticException Failed to get a spark session:
> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark
> client.
> 2017-09-26T14:04:46,470 ERROR [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3 main]
> ql.Driver: FAILED: SemanticException Failed to get a spark session:
> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark
> client.
> org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a spark
> session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to
> create spark client.
> at org.apache.hadoop.hive.ql.optimizer.spark.
> SetSparkReducerParallelism.getSparkMemoryAndCores(
> SetSparkReducerParallelism.java:240)
> at org.apache.hadoop.hive.ql.optimizer.spark.
> SetSparkReducerParallelism.process(SetSparkReducerParallelism.java:173)
> at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(
> DefaultRuleDispatcher.java:90)
> at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.
> dispatchAndReturn(DefaultGraphWalker.java:105)
> at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(
> DefaultGraphWalker.java:89)
> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(
> PreOrderWalker.java:56)
> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(
> PreOrderWalker.java:61)
> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(
> PreOrderWalker.java:61)
> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(
> PreOrderWalker.java:61)
> at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(
> DefaultGraphWalker.java:120)
> at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.
> runSetReducerParallelism(SparkCompiler.java:288)
> at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.
> optimizeOperatorPlan(SparkCompiler.java:122)
> at org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(
> TaskCompiler.java:140)
> at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.
> analyzeInternal(SemanticAnalyzer.java:11253)
> at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(
> CalcitePlanner.java:286)
> at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.
> analyze(BaseSemanticAnalyzer.java:258)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:511)
> at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.
> java:1316)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1456)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1236)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1226)
> at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(
> CliDriver.java:233)
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(
> CliDriver.java:184)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(
> CliDriver.java:403)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(
> CliDriver.java:336)
> at org.apache.hadoop.hive.cli.CliDriver.executeDriver(
> CliDriver.java:787)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 

Re: hive on spark - why is it so hard?

2017-09-26 Thread Stephen Sprague
thanks Sahil.  here it is.

Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/spark/scheduler/SparkListenerInterface
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:344)
at
org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:318)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException:
org.apache.spark.scheduler.SparkListenerInterface
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 5 more

at
org.apache.hive.spark.client.rpc.RpcServer.cancelClient(RpcServer.java:212)
~[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hive.spark.client.SparkClientImpl$3.run(SparkClientImpl.java:500)
~[hive-exec-2.3.0.jar:2.3.0]
at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_25]
FAILED: SemanticException Failed to get a spark session:
org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark
client.
2017-09-26T14:04:46,470 ERROR [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3 main]
ql.Driver: FAILED: SemanticException Failed to get a spark session:
org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark
client.
org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a spark
session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create
spark client.
at
org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerParallelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:240)
at
org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerParallelism.process(SetSparkReducerParallelism.java:173)
at
org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
at
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105)
at
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89)
at
org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWalker.java:56)
at
org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWalker.java:61)
at
org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWalker.java:61)
at
org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWalker.java:61)
at
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120)
at
org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.runSetReducerParallelism(SparkCompiler.java:288)
at
org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeOperatorPlan(SparkCompiler.java:122)
at
org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCompiler.java:140)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:11253)
at
org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:286)
at
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:258)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:511)
at
org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1316)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1456)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1236)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1226)
at
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
at
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
at
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:787)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)


I bugs me that that class is in spark-core_2.11-2.2.0.jar yet so seemingly
out of reach. :(



On Tue, Sep 26, 2017 at 2:44 PM, Sahil Takiar 
wrote:

> Hey Stephen,
>
> Can you send the full stack 

Re: hive on spark - why is it so hard?

2017-09-26 Thread Sahil Takiar
Hey Stephen,

Can you send the full stack trace for the NoClassDefFoundError? For Hive
2.3.0, we only support Spark 2.0.0. Hive may work with more recent versions
of Spark, but we only test with Spark 2.0.0.

--Sahil

On Tue, Sep 26, 2017 at 2:35 PM, Stephen Sprague  wrote:

> * i've installed hive 2.3 and spark 2.2
>
> * i've read this doc plenty of times -> https://cwiki.apache.org/
> confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
>
> * i run this query:
>
>hive --hiveconf hive.root.logger=DEBUG,console -e 'set
> hive.execution.engine=spark; select date_key, count(*) from
> fe_inventory.merged_properties_hist group by 1 order by 1;'
>
>
> * i get this error:
>
> *   Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/spark/scheduler/SparkListenerInterface*
>
>
> * this class in:
>   /usr/lib/spark-2.2.0-bin-hadoop2.6/jars/spark-core_2.11-2.2.0.jar
>
> * i have copied all the spark jars to hdfs://dwrdevnn1/spark-2.2-jars
>
> * i have updated hive-site.xml to set spark.yarn.jars to it.
>
> * i see this is the console:
>
> 2017-09-26T13:34:15,505  INFO [334aa7db-ad0c-48c3-9ada-467aaf05cff3 main]
> spark.HiveSparkClientFactory: load spark property from hive configuration
> (spark.yarn.jars -> hdfs://dwrdevnn1.sv2.trulia.com:8020/spark-2.2-jars/*
> ).
>
> * i see this on the console
>
> 2017-09-26T14:04:45,678  INFO [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3 main]
> client.SparkClientImpl: Running client driver with argv:
> /usr/lib/spark-2.2.0-bin-hadoop2.6/bin/spark-submit --properties-file
> /tmp/spark-submit.6105784757200912217.properties --class
> org.apache.hive.spark.client.RemoteDriver 
> /usr/lib/apache-hive-2.3.0-bin/lib/hive-exec-2.3.0.jar
> --remote-host dwrdevnn1.sv2.trulia.com --remote-port 53393 --conf
> hive.spark.client.connect.timeout=1000 --conf 
> hive.spark.client.server.connect.timeout=9
> --conf hive.spark.client.channel.log.level=null --conf
> hive.spark.client.rpc.max.size=52428800 --conf
> hive.spark.client.rpc.threads=8 --conf hive.spark.client.secret.bits=256
> --conf hive.spark.client.rpc.server.address=null
>
> * i even print out CLASSPATH in this script: /usr/lib/spark-2.2.0-bin-
> hadoop2.6/bin/spark-submit
>
> and /usr/lib/spark-2.2.0-bin-hadoop2.6/jars/spark-core_2.11-2.2.0.jar is
> in it.
>
> ​so i ask... what am i missing?
>
> thanks,
> Stephen​
>
>
>
>
>
>


-- 
Sahil Takiar
Software Engineer at Cloudera
takiar.sa...@gmail.com | (510) 673-0309