Re: Serde moved? version 2.3.0

2017-10-25 Thread Stephen Sprague
thanks guys for the clarification.  i'll work on perhaps finding the source
code and re-implementing if i can.  good to know.



On Wed, Oct 25, 2017 at 6:32 PM, Owen O'Malley <owen.omal...@gmail.com>
wrote:

> I considered that, but it won't work.
>
> The Hive 2.2 code looks like:
>
> public interface SerDe { ... }
> public abstract class AbstractSerDe implements SerDe { ... }
>
> ... a lot of code using SerDe ...
>
> The Hive 2.3 code looks like:
>
> public abstract class AbstractSerDe {  }
>
> ... a lot of code using AbstractSerDe ...
>
> You could create a SerDe interface to avoid the ClassNotFound exception,
> but I assume the external classes do:
>
> public FooSerDe implements SerDe { ... }
>
> when Hive 2.3 tries to use the instance as an AbstractSerDe, it will fail.
>
> With SerDe being an interface and AbstractSerDe being an abstract class,
> we also can't play games with inverting the class hierarchy.
>
> .. Owen
>
> On Wed, Oct 25, 2017 at 4:38 PM, Matt Burgess <mattyb...@gmail.com> wrote:
>
>> Perhaps a fourth option is a “bridge” JAR, to implement the original
>> class by subclassing AbstractSerDe, unless the API has changed such that
>> such a mapping cannot be done.
>>
>> Regards,
>> Matt
>>
>>
>>
>> On Oct 25, 2017, at 7:31 PM, Owen O'Malley <owen.omal...@gmail.com>
>> wrote:
>>
>>
>> On Wed, Oct 25, 2017 at 3:20 PM, Stephen Sprague <sprag...@gmail.com>
>> wrote:
>>
>>> i see. interesting.  i think this breaks a ton of opensource Serde's
>>> we've all downloaded off the internet and have been using for years.  openx
>>> json serde and ibm xml serde come to mind. does this change render all
>>> these incompatible now with 2.3.0?
>>>
>>
>> Unfortunately, yes. With Hive 2.2.0, the SerDe class is still there. It
>> was removed in 2.3.0. I suspect the developer didn't realize there were
>> plugins that would break.
>>
>>
>>>
>>> I mean all i done in the past is download the jar file and put it in the
>>> lib (or auxlib) dir and i was good to go.  That clearly isn't going to fly
>>> anymore then.
>>>
>>> hmmm.  again i'm not a java weenie so rewriting java code isn't
>>> something i would be proficient at. But it sounds like from a devops
>>> standpoint there's no magic jar i can put in the class path to make these
>>> work?  that kinda is a deal-breaker to upgrade then.
>>>
>>
>> Unfortunately, not. Looking at the change, the code now requires the
>> user's serde to implement AbstractSerDe. If it doesn't, it will break.
>>
>> Effectively, the choices seem to be:
>>
>>1. Change the plugin and recompile it.
>>2. Upgrade to Hive 2.2 instead of 2.3.
>>3. Make a case for reverting the change. I'm not sure what the
>>original motivation of the change was. It seems like it was effectively a
>>clean up.
>>
>> .. Owen
>>
>>
>>> Am i interpreting this correctly?
>>>
>>> On Wed, Oct 25, 2017 at 2:31 PM, Owen O'Malley <owen.omal...@gmail.com>
>>> wrote:
>>>
>>>> SerDe was removed by https://issues.apache.org/jira/browse/HIVE-15167
>>>>
>>>> You should use AbstractSerDe instead.
>>>>
>>>> .. Owen
>>>>
>>>> On Oct 25, 2017, at 2:18 PM, Stephen Sprague <sprag...@gmail.com>
>>>> wrote:
>>>>
>>>> hey guys,
>>>>
>>>> could be a dumb question but not being a java type of guy i'm not quite
>>>> sure about it. I'm upgrading from 2.1.0 to 2.3.0 and encountering this
>>>> error:
>>>>
>>>> class not found: org/apache/hadoop/hive/serde2/SerDe
>>>>
>>>> so in hive 2.1.0 i see it in this jar:
>>>>
>>>> * hive-serde-2.1.0.jar
>>>> org/apache/hadoop/hive/serde2/SerDe.class
>>>>
>>>>
>>>> but in hive 2.3.0 i don't see it in hive-serde-2.3.0.jar.
>>>>
>>>>
>>>> so i ask where did it go in version 2.3.0?
>>>>
>>>> thanks,
>>>> Stephen
>>>>
>>>>
>>>>
>>>
>>
>


Fwd: Serde moved? version 2.3.0

2017-10-25 Thread Stephen Sprague
i see. interesting.  i think this breaks a ton of opensource Serde's we've
all downloaded off the internet and have been using for years.  openx json
serde and ibm xml serde come to mind. does this change render all these
incompatible now with 2.3.0?

I mean all i done in the past is download the jar file and put it in the
lib (or auxlib) dir and i was good to go.  That clearly isn't going to fly
anymore then.

hmmm.  again i'm not a java weenie so rewriting java code isn't something i
would be proficient at. But it sounds like from a devops standpoint there's
no magic jar i can put in the class path to make these work?  that kinda is
a deal-breaker to upgrade then.


Am i interpreting this correctly?

On Wed, Oct 25, 2017 at 2:31 PM, Owen O'Malley <owen.omal...@gmail.com>
wrote:

> SerDe was removed by https://issues.apache.org/jira/browse/HIVE-15167
>
> You should use AbstractSerDe instead.
>
> .. Owen
>
> On Oct 25, 2017, at 2:18 PM, Stephen Sprague <sprag...@gmail.com> wrote:
>
> hey guys,
>
> could be a dumb question but not being a java type of guy i'm not quite
> sure about it. I'm upgrading from 2.1.0 to 2.3.0 and encountering this
> error:
>
> class not found: org/apache/hadoop/hive/serde2/SerDe
>
> so in hive 2.1.0 i see it in this jar:
>
> * hive-serde-2.1.0.jar
> org/apache/hadoop/hive/serde2/SerDe.class
>
>
> but in hive 2.3.0 i don't see it in hive-serde-2.3.0.jar.
>
>
> so i ask where did it go in version 2.3.0?
>
> thanks,
> Stephen
>
>
>


Serde moved? version 2.3.0

2017-10-25 Thread Stephen Sprague
hey guys,

could be a dumb question but not being a java type of guy i'm not quite
sure about it. I'm upgrading from 2.1.0 to 2.3.0 and encountering this
error:

class not found: org/apache/hadoop/hive/serde2/SerDe

so in hive 2.1.0 i see it in this jar:

* hive-serde-2.1.0.jar
org/apache/hadoop/hive/serde2/SerDe.class


but in hive 2.3.0 i don't see it in hive-serde-2.3.0.jar.


so i ask where did it go in version 2.3.0?

thanks,
Stephen


Re: hive on spark - why is it so hard?

2017-10-01 Thread Stephen Sprague
so...  i made some progress after much copying of jar files around (as
alluded to by Gopal previously on this thread).


following the instructions here:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

and doing this as instructed will leave off about a dozen or so jar files
that spark'll need:
  ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz
"-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"

i ended copying the missing jars to $SPARK_HOME/jars but i would have
preferred to just add a path(s) to the spark class path but i did not find
any effective way to do that. In hive you can specify HIVE_AUX_JARS_PATH
but i don't see the analagous var in spark - i don't think it inherits the
hive classpath.

anyway a simple query is now working under Hive On Spark so i think i might
be over the hump.  Now its a matter of comparing the performance with Tez.

Cheers,
Stephen.


On Wed, Sep 27, 2017 at 9:37 PM, Stephen Sprague <sprag...@gmail.com> wrote:

> ok.. getting further.  seems now i have to deploy hive to all nodes in the
> cluster - don't think i had to do that before but not a big deal to do it
> now.
>
> for me:
> HIVE_HOME=/usr/lib/apache-hive-2.3.0-bin/
> SPARK_HOME=/usr/lib/spark-2.2.0-bin-hadoop2.6
>
> on all three nodes now.
>
> i started spark master on the namenode and i started spark slaves (2) on
> two datanodes of the cluster.
>
> so far so good.
>
> now i run my usual test command.
>
> $ hive --hiveconf hive.root.logger=DEBUG,console -e 'set
> hive.execution.engine=spark; select date_key, count(*) from
> fe_inventory.merged_properties_hist group by 1 order by 1;'
>
> i get a little further now and find the stderr from the Spark Web UI
> interface (nice) and it reports this:
>
> 17/09/27 20:47:35 INFO WorkerWatcher: Successfully connected to 
> spark://Worker@172.19.79.127:40145
> Exception in thread "main" java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
>   at 
> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)*Caused 
> by: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS*
>   at 
> org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.java:47)
>   at 
> org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:134)
>   at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
>   ... 6 more
>
>
>
> searching around the internet i find this is probably a compatibility
> issue.
>
> i know. i know. no surprise here.
>
> so i guess i just got to the point where everybody else is... build spark
> w/o hive.
>
> lemme see what happens next.
>
>
>
>
>
> On Wed, Sep 27, 2017 at 7:41 PM, Stephen Sprague <sprag...@gmail.com>
> wrote:
>
>> thanks.  I haven't had a chance to dig into this again today but i do
>> appreciate the pointer.  I'll keep you posted.
>>
>> On Wed, Sep 27, 2017 at 10:14 AM, Sahil Takiar <takiar.sa...@gmail.com>
>> wrote:
>>
>>> You can try increasing the value of hive.spark.client.connect.timeout.
>>> Would also suggest taking a look at the HoS Remote Driver logs. The driver
>>> gets launched in a YARN container (assuming you are running Spark in
>>> yarn-client mode), so you just have to find the logs for that container.
>>>
>>> --Sahil
>>>
>>> On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague <sprag...@gmail.com>
>>> wrote:
>>>
>>>> i _seem_ to be getting closer.  Maybe its just wishful thinking.
>>>> Here's where i'm at now.
>>>>
>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>>> 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with
>>>> CreateSubmissionResponse:
>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>>>   "action" : "CreateSubmissionResponse",
>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>>>   "message" : "Driver successfully submitted as 
>>>> driver-20170926211038-0003",
>>>> 2017-09-2

Re: hive on spark - why is it so hard?

2017-09-27 Thread Stephen Sprague
ok.. getting further.  seems now i have to deploy hive to all nodes in the
cluster - don't think i had to do that before but not a big deal to do it
now.

for me:
HIVE_HOME=/usr/lib/apache-hive-2.3.0-bin/
SPARK_HOME=/usr/lib/spark-2.2.0-bin-hadoop2.6

on all three nodes now.

i started spark master on the namenode and i started spark slaves (2) on
two datanodes of the cluster.

so far so good.

now i run my usual test command.

$ hive --hiveconf hive.root.logger=DEBUG,console -e 'set
hive.execution.engine=spark; select date_key, count(*) from
fe_inventory.merged_properties_hist group by 1 order by 1;'

i get a little further now and find the stderr from the Spark Web UI
interface (nice) and it reports this:

17/09/27 20:47:35 INFO WorkerWatcher: Successfully connected to
spark://Worker@172.19.79.127:40145
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at 
org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)*Caused
by: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS*
at 
org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.java:47)
at 
org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:134)
at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
... 6 more



searching around the internet i find this is probably a compatibility issue.

i know. i know. no surprise here.

so i guess i just got to the point where everybody else is... build spark
w/o hive.

lemme see what happens next.





On Wed, Sep 27, 2017 at 7:41 PM, Stephen Sprague <sprag...@gmail.com> wrote:

> thanks.  I haven't had a chance to dig into this again today but i do
> appreciate the pointer.  I'll keep you posted.
>
> On Wed, Sep 27, 2017 at 10:14 AM, Sahil Takiar <takiar.sa...@gmail.com>
> wrote:
>
>> You can try increasing the value of hive.spark.client.connect.timeout.
>> Would also suggest taking a look at the HoS Remote Driver logs. The driver
>> gets launched in a YARN container (assuming you are running Spark in
>> yarn-client mode), so you just have to find the logs for that container.
>>
>> --Sahil
>>
>> On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague <sprag...@gmail.com>
>> wrote:
>>
>>> i _seem_ to be getting closer.  Maybe its just wishful thinking.
>>> Here's where i'm at now.
>>>
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>> 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with
>>> CreateSubmissionResponse:
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>> "action" : "CreateSubmissionResponse",
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>> "message" : "Driver successfully submitted as driver-20170926211038-0003",
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>> "serverSparkVersion" : "2.2.0",
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>> "submissionId" : "driver-20170926211038-0003",
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>> "success" : true
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: }
>>> 2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to
>>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
>>> Client (425015667) connection to dwrdevnn1.sv2.trulia.com/172.1
>>> 9.73.136:8020 from dwr: closed
>>> 2017-09-26T21:10:45,702 DEBUG [IPC Client (425015667) connection to
>>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
>>> Clien
>>> t (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020
>>> from dwr: stopped, remaining connections 0
>>> 2017-09-26T21:12:06,719 ERROR [2337b36e-86ca-47cd-b1ae-f0b32571b97e
>>> main] client.SparkClientImpl: Timed out waiting for client to connect.
>>> *Possible reasons include network issues, errors in remote driver or the
>>> cluster has no available resou

Re: hive on spark - why is it so hard?

2017-09-27 Thread Stephen Sprague
thanks.  I haven't had a chance to dig into this again today but i do
appreciate the pointer.  I'll keep you posted.

On Wed, Sep 27, 2017 at 10:14 AM, Sahil Takiar <takiar.sa...@gmail.com>
wrote:

> You can try increasing the value of hive.spark.client.connect.timeout.
> Would also suggest taking a look at the HoS Remote Driver logs. The driver
> gets launched in a YARN container (assuming you are running Spark in
> yarn-client mode), so you just have to find the logs for that container.
>
> --Sahil
>
> On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague <sprag...@gmail.com>
> wrote:
>
>> i _seem_ to be getting closer.  Maybe its just wishful thinking.   Here's
>> where i'm at now.
>>
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with
>> CreateSubmissionResponse:
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> "action" : "CreateSubmissionResponse",
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> "message" : "Driver successfully submitted as driver-20170926211038-0003",
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> "serverSparkVersion" : "2.2.0",
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> "submissionId" : "driver-20170926211038-0003",
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> "success" : true
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: }
>> 2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to
>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
>> Client (425015667) connection to dwrdevnn1.sv2.trulia.com/172.1
>> 9.73.136:8020 from dwr: closed
>> 2017-09-26T21:10:45,702 DEBUG [IPC Client (425015667) connection to
>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
>> Clien
>> t (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020
>> from dwr: stopped, remaining connections 0
>> 2017-09-26T21:12:06,719 ERROR [2337b36e-86ca-47cd-b1ae-f0b32571b97e
>> main] client.SparkClientImpl: Timed out waiting for client to connect.
>> *Possible reasons include network issues, errors in remote driver or the
>> cluster has no available resources, etc.*
>> *Please check YARN or Spark driver's logs for further information.*
>> java.util.concurrent.ExecutionException: 
>> java.util.concurrent.TimeoutException:
>> Timed out waiting for client connection.
>> at 
>> io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
>> ~[netty-all-4.0.29.Final.jar:4.0.29.Final]
>> at 
>> org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:108)
>> [hive-exec-2.3.0.jar:2.3.0]
>> at 
>> org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
>> [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.c
>> reateRemoteClient(RemoteHiveSparkClient.java:101)
>> [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.<
>> init>(RemoteHiveSparkClient.java:97) [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.
>> createHiveSparkClient(HiveSparkClientFactory.java:73)
>> [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImp
>> l.open(SparkSessionImpl.java:62) [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionMan
>> agerImpl.getSession(SparkSessionManagerImpl.java:115)
>> [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSpark
>> Session(SparkUtilities.java:126) [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerPar
>> allelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:236)
>> [hive-exec-2.3.0.jar:2.3.0]
>>
>>
>> i'll dig some more tomorrow.
>>
>> On Tue, Sep 26, 2017 at 8:23 PM, Stephen Sprague <sprag...@gmail.com>
>> wrote:
>>
>>> oh. i missed Gopal's reply.  oy... that sounds foreboding.  I'll keep
>>> you posted on my progress.
>>>
>>> On Tue, Sep 26, 2017 at 4:40 PM, Gopal Vijayaraghavan <gop...@apache.org
&

Re: hive on spark - why is it so hard?

2017-09-26 Thread Stephen Sprague
i _seem_ to be getting closer.  Maybe its just wishful thinking.   Here's
where i'm at now.

2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with
CreateSubmissionResponse:
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
"action" : "CreateSubmissionResponse",
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
"message" : "Driver successfully submitted as driver-20170926211038-0003",
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
"serverSparkVersion" : "2.2.0",
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
"submissionId" : "driver-20170926211038-0003",
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
"success" : true
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: }
2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to
dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
Client (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020
from dwr: closed
2017-09-26T21:10:45,702 DEBUG [IPC Client (425015667) connection to
dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC Clien
t (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020
from dwr: stopped, remaining connections 0
2017-09-26T21:12:06,719 ERROR [2337b36e-86ca-47cd-b1ae-f0b32571b97e main]
client.SparkClientImpl: Timed out waiting for client to connect.
*Possible reasons include network issues, errors in remote driver or the
cluster has no available resources, etc.*
*Please check YARN or Spark driver's logs for further information.*
java.util.concurrent.ExecutionException:
java.util.concurrent.TimeoutException: Timed out waiting for client
connection.
at
io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
~[netty-all-4.0.29.Final.jar:4.0.29.Final]
at
org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:108)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.createRemoteClient(RemoteHiveSparkClient.java:101)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.(RemoteHiveSparkClient.java:97)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.createHiveSparkClient(HiveSparkClientFactory.java:73)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:62)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:115)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:126)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerParallelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:236)
[hive-exec-2.3.0.jar:2.3.0]


i'll dig some more tomorrow.

On Tue, Sep 26, 2017 at 8:23 PM, Stephen Sprague <sprag...@gmail.com> wrote:

> oh. i missed Gopal's reply.  oy... that sounds foreboding.  I'll keep you
> posted on my progress.
>
> On Tue, Sep 26, 2017 at 4:40 PM, Gopal Vijayaraghavan <gop...@apache.org>
> wrote:
>
>> Hi,
>>
>> > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a
>> spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed
>> to create spark client.
>>
>> I get inexplicable errors with Hive-on-Spark unless I do a three step
>> build.
>>
>> Build Hive first, use that version to build Spark, use that Spark version
>> to rebuild Hive.
>>
>> I have to do this to make it work because Spark contains Hive jars and
>> Hive contains Spark jars in the class-path.
>>
>> And specifically I have to edit the pom.xml files, instead of passing in
>> params with -Dspark.version, because the installed pom files don't get
>> replacements from the build args.
>>
>> Cheers,
>> Gopal
>>
>>
>>
>


Re: hive on spark - why is it so hard?

2017-09-26 Thread Stephen Sprague
oh. i missed Gopal's reply.  oy... that sounds foreboding.  I'll keep you
posted on my progress.

On Tue, Sep 26, 2017 at 4:40 PM, Gopal Vijayaraghavan 
wrote:

> Hi,
>
> > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a
> spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed
> to create spark client.
>
> I get inexplicable errors with Hive-on-Spark unless I do a three step
> build.
>
> Build Hive first, use that version to build Spark, use that Spark version
> to rebuild Hive.
>
> I have to do this to make it work because Spark contains Hive jars and
> Hive contains Spark jars in the class-path.
>
> And specifically I have to edit the pom.xml files, instead of passing in
> params with -Dspark.version, because the installed pom files don't get
> replacements from the build args.
>
> Cheers,
> Gopal
>
>
>


Re: hive on spark - why is it so hard?

2017-09-26 Thread Stephen Sprague
well this is the spark-submit line from above:

   2017-09-26T14:04:45,678  INFO [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3
main] client.SparkClientImpl: Running client driver with argv:
*/usr/li/spark-2.2.0-bin-**hadoop2.6/bin/spark-submit*

and that's pretty clearly v2.2


I do have other versions of spark on the namenode so lemme remove those and
see what happens


A-HA! dang it!

$ echo $SPARK_HOME
/usr/local/spark

well that clearly needs to be: */usr/lib/spark-2.2.0-bin-*
*hadoop2.6  *

how did i miss that? unbelievable.


Thank you Sahil!   Lets see what happens next!

Cheers,
Stephen


On Tue, Sep 26, 2017 at 4:12 PM, Sahil Takiar <takiar.sa...@gmail.com>
wrote:

> Are you sure you are using Spark 2.2.0? Based on the stack-trace it looks
> like your call to spark-submit it using an older version of Spark (looks
> like some early 1.x version). Do you have SPARK_HOME set locally? Do you
> have older versions of Spark installed locally?
>
> --Sahil
>
> On Tue, Sep 26, 2017 at 3:33 PM, Stephen Sprague <sprag...@gmail.com>
> wrote:
>
>> thanks Sahil.  here it is.
>>
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> org/apache/spark/scheduler/SparkListenerInterface
>> at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:344)
>> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.
>> scala:318)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:
>> 75)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.spark.scheduler.SparkListenerInterface
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> ... 5 more
>>
>> at 
>> org.apache.hive.spark.client.rpc.RpcServer.cancelClient(RpcServer.java:212)
>> ~[hive-exec-2.3.0.jar:2.3.0]
>> at 
>> org.apache.hive.spark.client.SparkClientImpl$3.run(SparkClientImpl.java:500)
>> ~[hive-exec-2.3.0.jar:2.3.0]
>> at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_25]
>> FAILED: SemanticException Failed to get a spark session:
>> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark
>> client.
>> 2017-09-26T14:04:46,470 ERROR [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3
>> main] ql.Driver: FAILED: SemanticException Failed to get a spark session:
>> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark
>> client.
>> org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a spark
>> session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to
>> create spark client.
>> at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerPar
>> allelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:240)
>> at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerPar
>> allelism.process(SetSparkReducerParallelism.java:173)
>> at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch
>> (DefaultRuleDispatcher.java:90)
>> at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAnd
>> Return(DefaultGraphWalker.java:105)
>> at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(De
>> faultGraphWalker.java:89)
>> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWa
>> lker.java:56)
>> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWa
>> lker.java:61)
>> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWa
>> lker.java:61)
>> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWa
>> lker.java:61)
>> at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalkin
>> g(DefaultGraphWalker.java:120)
>> at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.runSetRe
>> ducerParallelism(SparkCompiler.java:288)
>> at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimize
>> OperatorPlan(SparkCompiler.java:122)
>> at org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCom
>> piler.java:140)
>> at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInte
>> rnal(SemanticAnalyzer.java:11253)
>> at org.apache.hadoop.hive.ql.parse.

Re: hive on spark - why is it so hard?

2017-09-26 Thread Stephen Sprague
n you send the full stack trace for the NoClassDefFoundError? For Hive
> 2.3.0, we only support Spark 2.0.0. Hive may work with more recent versions
> of Spark, but we only test with Spark 2.0.0.
>
> --Sahil
>
> On Tue, Sep 26, 2017 at 2:35 PM, Stephen Sprague <sprag...@gmail.com>
> wrote:
>
>> * i've installed hive 2.3 and spark 2.2
>>
>> * i've read this doc plenty of times -> https://cwiki.apache.org/confl
>> uence/display/Hive/Hive+on+Spark%3A+Getting+Started
>>
>> * i run this query:
>>
>>hive --hiveconf hive.root.logger=DEBUG,console -e 'set
>> hive.execution.engine=spark; select date_key, count(*) from
>> fe_inventory.merged_properties_hist group by 1 order by 1;'
>>
>>
>> * i get this error:
>>
>> *   Exception in thread "main" java.lang.NoClassDefFoundError:
>> org/apache/spark/scheduler/SparkListenerInterface*
>>
>>
>> * this class in:
>>   /usr/lib/spark-2.2.0-bin-hadoop2.6/jars/spark-core_2.11-2.2.0.jar
>>
>> * i have copied all the spark jars to hdfs://dwrdevnn1/spark-2.2-jars
>>
>> * i have updated hive-site.xml to set spark.yarn.jars to it.
>>
>> * i see this is the console:
>>
>> 2017-09-26T13:34:15,505  INFO [334aa7db-ad0c-48c3-9ada-467aaf05cff3
>> main] spark.HiveSparkClientFactory: load spark property from hive
>> configuration (spark.yarn.jars -> hdfs://dwrdevnn1.sv2.trulia.co
>> m:8020/spark-2.2-jars/*).
>>
>> * i see this on the console
>>
>> 2017-09-26T14:04:45,678  INFO [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3
>> main] client.SparkClientImpl: Running client driver with argv:
>> /usr/lib/spark-2.2.0-bin-hadoop2.6/bin/spark-submit --properties-file
>> /tmp/spark-submit.6105784757200912217.properties --class
>> org.apache.hive.spark.client.RemoteDriver 
>> /usr/lib/apache-hive-2.3.0-bin/lib/hive-exec-2.3.0.jar
>> --remote-host dwrdevnn1.sv2.trulia.com --remote-port 53393 --conf
>> hive.spark.client.connect.timeout=1000 --conf
>> hive.spark.client.server.connect.timeout=9 --conf
>> hive.spark.client.channel.log.level=null --conf
>> hive.spark.client.rpc.max.size=52428800 --conf
>> hive.spark.client.rpc.threads=8 --conf hive.spark.client.secret.bits=256
>> --conf hive.spark.client.rpc.server.address=null
>>
>> * i even print out CLASSPATH in this script:
>> /usr/lib/spark-2.2.0-bin-hadoop2.6/bin/spark-submit
>>
>> and /usr/lib/spark-2.2.0-bin-hadoop2.6/jars/spark-core_2.11-2.2.0.jar is
>> in it.
>>
>> ​so i ask... what am i missing?
>>
>> thanks,
>> Stephen​
>>
>>
>>
>>
>>
>>
>
>
> --
> Sahil Takiar
> Software Engineer at Cloudera
> takiar.sa...@gmail.com | (510) 673-0309
>


hive on spark - why is it so hard?

2017-09-26 Thread Stephen Sprague
* i've installed hive 2.3 and spark 2.2

* i've read this doc plenty of times ->
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

* i run this query:

   hive --hiveconf hive.root.logger=DEBUG,console -e 'set
hive.execution.engine=spark; select date_key, count(*) from
fe_inventory.merged_properties_hist group by 1 order by 1;'


* i get this error:

*   Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/spark/scheduler/SparkListenerInterface*


* this class in:
  /usr/lib/spark-2.2.0-bin-hadoop2.6/jars/spark-core_2.11-2.2.0.jar

* i have copied all the spark jars to hdfs://dwrdevnn1/spark-2.2-jars

* i have updated hive-site.xml to set spark.yarn.jars to it.

* i see this is the console:

2017-09-26T13:34:15,505  INFO [334aa7db-ad0c-48c3-9ada-467aaf05cff3 main]
spark.HiveSparkClientFactory: load spark property from hive configuration
(spark.yarn.jars -> hdfs://dwrdevnn1.sv2.trulia.com:8020/spark-2.2-jars/*).

* i see this on the console

2017-09-26T14:04:45,678  INFO [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3 main]
client.SparkClientImpl: Running client driver with argv:
/usr/lib/spark-2.2.0-bin-hadoop2.6/bin/spark-submit --properties-file
/tmp/spark-submit.6105784757200912217.properties --class
org.apache.hive.spark.client.RemoteDriver
/usr/lib/apache-hive-2.3.0-bin/lib/hive-exec-2.3.0.jar --remote-host
dwrdevnn1.sv2.trulia.com --remote-port 53393 --conf
hive.spark.client.connect.timeout=1000 --conf
hive.spark.client.server.connect.timeout=9 --conf
hive.spark.client.channel.log.level=null --conf
hive.spark.client.rpc.max.size=52428800 --conf
hive.spark.client.rpc.threads=8 --conf hive.spark.client.secret.bits=256
--conf hive.spark.client.rpc.server.address=null

* i even print out CLASSPATH in this script:
/usr/lib/spark-2.2.0-bin-hadoop2.6/bin/spark-submit

and /usr/lib/spark-2.2.0-bin-hadoop2.6/jars/spark-core_2.11-2.2.0.jar is in
it.

​so i ask... what am i missing?

thanks,
Stephen​


group by + two nulls in a row = bug?

2017-06-27 Thread Stephen Sprague
i'm running hive version 2.1.0 and found this interesting. i've broken it
down into a trivial test case below.

i run this:

 select a.date_key,
a.property_id,
cast(NULL as bigint) as malone_id,
cast(NULL as bigint) as zpid,
sum(coalesce(a.sum_365,0)) as sum_365
 from
(select 20170626 as date_key, 123 as property_id, 10 as
sum_365 ) a

 group by
1
,2
,3
,4

and i get:

+--+-+---++-+
| date_key | property_id | malone_id | zpid   | sum_365 |
+--+-+---++-+
| 20170626 | 123 | [NULL]| [NULL] | *[NULL]*  |
+--+-+---++-+

why do you suppose sum_365 is null?


then i run this:

 select a.date_key,
a.property_id,
cast(NULL as bigint) as malone_id,
sum(coalesce(a.sum_365,0)) as sum_365
 from
(select 20170626 as date_key, 123 as property_id, 10 as
sum_365 ) a

 group by
1
,2
,3

and i get:


+--+-+---+-+
| date_key | property_id | malone_id | sum_365 |
+--+-+---+-+
| 20170626 | 123 | [NULL]| *10 * |
+--+-+---+-+


what do you think? is it me? or is it hive?(looks to me grouping by two
NULL's in a row causes a problem.)


any hive release imminent?

2017-06-19 Thread Stephen Sprague
Hey guys,
Is there any word out on the street about a timeframe for the next 2.x hive
release? Looks like Dec 2016 was the last one.

The natives are getting restless i think. :)

thanks,
Stephen.


Re: How to setup the max memory for my big Hive SQL which is on MapReduce of Yarn

2017-06-06 Thread Stephen Sprague
have you researched the yarn schedulers?  namely the capacity and fair
schedulers? those are the places where resource limits can be easily
defined.

On Mon, Jun 5, 2017 at 9:25 PM, Chang.Wu <583424...@qq.com> wrote:

> My Hive engine is MapReduce and Yarn. What my urgent need is to limit the
> memory usage of my big sql so that my bigger sql will run a longer time
> instead of using up all the resource of queue or even all the resource of
> my whole yarn cluster at once.
> But I cannot find any solution to define my sql resource-usage upper
> threshold.
> Any one can give me some suggestions.
>
> 583424...@qq.com
>


Re: drop table - external - aws

2017-05-17 Thread Stephen Sprague
yeah. that's a potential idea too.  gotta put the time in to script it with
200+ tables though.

On Wed, May 17, 2017 at 10:07 AM, Furcy Pin <furcy@flaminem.com> wrote:

> Did you try ALTER TABLE ... SET LOCATION ... ? maybe it could have worked.
>
>
> On Wed, May 17, 2017 at 6:57 PM, Vihang Karajgaonkar <vih...@cloudera.com>
> wrote:
>
>> This is interesting and possibly a bug. Did you try changing them to
>> managed tables and then dropping or truncating them? How do we reproduce
>> this on our setup?
>>
>> On Tue, May 16, 2017 at 6:38 PM, Stephen Sprague <sprag...@gmail.com>
>> wrote:
>>
>>> fwiw. i ended up re-creating the ec2 cluster with that same host name
>>> just so i could drop those tables from the metastore.
>>>
>>> note to self.  be careful - be real careful - with "sharing" hive
>>> metastores between different compute paradigms.
>>>
>>> Regards,
>>> Stephen.
>>>
>>> On Tue, May 16, 2017 at 6:38 AM, Stephen Sprague <sprag...@gmail.com>
>>> wrote:
>>>
>>>> hey guys,
>>>> here's something bizarre.   i created about 200 external tables with a
>>>> location something like this 'hdfs:///path'.  this was three
>>>> months ago and now i'm revisiting and want to drop these tables.
>>>>
>>>> ha! no can do!
>>>>
>>>> that  is long gone.
>>>>
>>>> Upon issuing the drop table command i get this:
>>>>
>>>> Error while processing statement: FAILED: Execution Error, return code
>>>> 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
>>>> MetaException(message:java.lang.IllegalArgumentException:
>>>> java.net.UnknownHostException: )
>>>>
>>>> where  is that old host name.
>>>>
>>>> so i ask is there a work around for this?  given they are external
>>>> tables i'm surprised it "checks" that that location exists (or not.)
>>>>
>>>> thanks,
>>>> Stephen
>>>>
>>>
>>>
>>
>


Re: drop table - external - aws

2017-05-17 Thread Stephen Sprague
hi vihang,
here's how to duplicate (hive version 2.1.0).

1. install hadoop on a host called "test-nn"

2. install hive on a host called  "hive-client" and use a metastore on
"hive-client" (or anywhere else except "test-nn").

3. using hive on "hive-client" do this: create external table foo(a int)
location 'hdfs://test-nn/hive_root';

3. shutdown host "test-nn" or rename it.

4. using hive on "hive-client" try dropping table foo.

pretty sure step 4 is metastore only operation and hence does not need a
active hadoop cluster to work. if it were a managed table then yeah we'd
need to bring up a another cluster.

but you get the idea, right?

On Wed, May 17, 2017 at 9:57 AM, Vihang Karajgaonkar <vih...@cloudera.com>
wrote:

> This is interesting and possibly a bug. Did you try changing them to
> managed tables and then dropping or truncating them? How do we reproduce
> this on our setup?
>
> On Tue, May 16, 2017 at 6:38 PM, Stephen Sprague <sprag...@gmail.com>
> wrote:
>
>> fwiw. i ended up re-creating the ec2 cluster with that same host name
>> just so i could drop those tables from the metastore.
>>
>> note to self.  be careful - be real careful - with "sharing" hive
>> metastores between different compute paradigms.
>>
>> Regards,
>> Stephen.
>>
>> On Tue, May 16, 2017 at 6:38 AM, Stephen Sprague <sprag...@gmail.com>
>> wrote:
>>
>>> hey guys,
>>> here's something bizarre.   i created about 200 external tables with a
>>> location something like this 'hdfs:///path'.  this was three
>>> months ago and now i'm revisiting and want to drop these tables.
>>>
>>> ha! no can do!
>>>
>>> that  is long gone.
>>>
>>> Upon issuing the drop table command i get this:
>>>
>>> Error while processing statement: FAILED: Execution Error, return code 1
>>> from org.apache.hadoop.hive.ql.exec.DDLTask.
>>> MetaException(message:java.lang.IllegalArgumentException:
>>> java.net.UnknownHostException: )
>>>
>>> where  is that old host name.
>>>
>>> so i ask is there a work around for this?  given they are external
>>> tables i'm surprised it "checks" that that location exists (or not.)
>>>
>>> thanks,
>>> Stephen
>>>
>>
>>
>


Re: drop table - external - aws

2017-05-16 Thread Stephen Sprague
fwiw. i ended up re-creating the ec2 cluster with that same host name just
so i could drop those tables from the metastore.

note to self.  be careful - be real careful - with "sharing" hive
metastores between different compute paradigms.

Regards,
Stephen.

On Tue, May 16, 2017 at 6:38 AM, Stephen Sprague <sprag...@gmail.com> wrote:

> hey guys,
> here's something bizarre.   i created about 200 external tables with a
> location something like this 'hdfs:///path'.  this was three
> months ago and now i'm revisiting and want to drop these tables.
>
> ha! no can do!
>
> that  is long gone.
>
> Upon issuing the drop table command i get this:
>
> Error while processing statement: FAILED: Execution Error, return code 1
> from org.apache.hadoop.hive.ql.exec.DDLTask. 
> MetaException(message:java.lang.IllegalArgumentException:
> java.net.UnknownHostException: )
>
> where  is that old host name.
>
> so i ask is there a work around for this?  given they are external tables
> i'm surprised it "checks" that that location exists (or not.)
>
> thanks,
> Stephen
>


Re: hive on spark - version question

2017-03-17 Thread Stephen Sprague
yeah but... is the glass half-full or half-empty?  sure this might suck but
keep your head high, bro! Lots of it (hive) does work. :)


On Fri, Mar 17, 2017 at 2:25 PM, hernan saab <hernan_javier_s...@yahoo.com>
wrote:

> Stephan,
>
> Thanks for the response.
>
> The one thing that I don't appreciate from those who promote and DOCUMENT
> spark on hive is that, seemingly, there is absolutely no evidence seen that
> says that hive on spark WORKS.
> As a matter of fact, after a lot of pain, I noticed it is not supported by
> just about anybody.
>
> If someone dares to document Hive on Spark (see link
> https://cwiki.apache.org/confluence/display/Hive/Hive+
> on+Spark%3A+Getting+Started)  why can't they have the decency to mention
> what specific combo of Hadoop/Spark/Hive versions used that works? Have a
> git repo included in a doc with all the right versions and libraries. Why
> not? We can start from there and progressively use newer libraries in case
> the doc becomes stale. I am not really asking much, I just want to know
> what the documenter used to claim that Hive on Spark works, that's it.
>
> Clearly, for most cases, this setup is broken and it misleads people to
> waste time on a broken setup.
>
> I love this tech. But I do notice that there is some mean spirited or very
> negligent actions made by the apache development community. Documenting
> hive on spark while knowing it won't work for most cases means apache
> developers don't give a crap about the time wasted by people like us.
>
>
>
>
> On Friday, March 17, 2017 1:14 PM, Edward Capriolo <edlinuxg...@gmail.com>
> wrote:
>
>
>
>
> On Fri, Mar 17, 2017 at 2:56 PM, hernan saab <hernan_javier_s...@yahoo.com
> > wrote:
>
> I have been in a similar world of pain. Basically, I tried to use an
> external Hive to have user access controls with a spark engine.
> At the end, I realized that it was a better idea to use apache tez instead
> of a spark engine for my particular case.
>
> But the journey is what I want to share with you.
> The big data apache tools and libraries such as Hive, Tez, Spark, Hadoop ,
> Parquet etc etc are not interchangeable as we would like to think. There
> are very limited combinations for very specific versions. This is why tools
> like Ambari can be useful. Ambari sets a path of combos of versions known
> to work and the dirty work is done under the UI.
>
> More often than not, when you try a version that few people tried, you
> will get error messages that will derailed you and cause you to waste a lot
> of time.
>
> In addition, this group, as well as many other apache big data user
> groups,  provides extremely poor support for users. The answers you usually
> get are not even hints to a solution. Their answers usually translate to
> "there is nothing I am willing to do about your problem. If I did, I should
> get paid" in many cryptic ways.
>
> If you ask your question to the Spark group they will take you to the Hive
> group and viceversa (I can almost guarantee it based on previous
> experiences)
>
> But in hindsight, people who work on this kinds of things typically make
> more money that the average developers. If you make more $$s it makes sense
> learning this stuff is supposed to be harder.
>
> Conclusion, don't try it. Or try using Tez/Hive instead of Spark/Hive  if
> you are querying large files.
>
>
>
> On Friday, March 17, 2017 11:33 AM, Stephen Sprague <sprag...@gmail.com>
> wrote:
>
>
> :(  gettin' no love on this one.   any SME's know if Spark 2.1.0 will work
> with Hive 2.1.0 ?  That JavaSparkListener class looks like a deal breaker
> to me, alas.
>
> thanks in advance.
>
> Cheers,
> Stephen.
>
> On Mon, Mar 13, 2017 at 10:32 PM, Stephen Sprague <sprag...@gmail.com>
> wrote:
>
> hi guys,
> wondering where we stand with Hive On Spark these days?
>
> i'm trying to run Spark 2.1.0 with Hive 2.1.0 (purely coincidental
> versions) and running up against this class not found:
>
> java.lang. NoClassDefFoundError: org/apache/spark/ JavaSparkListener
>
>
> searching the Cyber i find this:
> 1. http://stackoverflow.com/ questions/41953688/setting-
> spark-as-default-execution- engine-for-hive
> <http://stackoverflow.com/questions/41953688/setting-spark-as-default-execution-engine-for-hive>
>
> which pretty much describes my situation too and it references this:
>
>
> 2. https://issues.apache.org/ jira/browse/SPARK-17563
> <https://issues.apache.org/jira/browse/SPARK-17563>
>
> which indicates a "won't fix" - but does reference this:
>
>
> 3. https://issues.apache.org/ jira/browse/HIVE-14029
>

Re: hive on spark - version question

2017-03-17 Thread Stephen Sprague
thanks for the comments and for sure all relevant. And yeah I feel the pain
just like the next guy but that's the part of the opensource "life style"
you subscribe to when using it.

The upside payoff has gotta be worth the downside risk - or else forget
about it right? Here in the Hive world in my experience anyway its been
great.  Gotta roll with it, be courteous, be persistent and sometimes
things just work out.

Getting back to Spark and Tez yes by all means i'm a big Tez user aleady so
i was hoping to see what Spark brought to table and i didn't want to diddle
around with Spark < 2.0.   That's cool. I can live with that not being
nailed down yet. I'll just wait for hive 2.2 and rattle the cage again! ha!


All good!

Cheers,
Stephen.

On Fri, Mar 17, 2017 at 1:14 PM, Edward Capriolo <edlinuxg...@gmail.com>
wrote:

>
>
> On Fri, Mar 17, 2017 at 2:56 PM, hernan saab <hernan_javier_s...@yahoo.com
> > wrote:
>
>> I have been in a similar world of pain. Basically, I tried to use an
>> external Hive to have user access controls with a spark engine.
>> At the end, I realized that it was a better idea to use apache tez
>> instead of a spark engine for my particular case.
>>
>> But the journey is what I want to share with you.
>> The big data apache tools and libraries such as Hive, Tez, Spark, Hadoop
>> , Parquet etc etc are not interchangeable as we would like to think. There
>> are very limited combinations for very specific versions. This is why tools
>> like Ambari can be useful. Ambari sets a path of combos of versions known
>> to work and the dirty work is done under the UI.
>>
>> More often than not, when you try a version that few people tried, you
>> will get error messages that will derailed you and cause you to waste a lot
>> of time.
>>
>> In addition, this group, as well as many other apache big data user
>> groups,  provides extremely poor support for users. The answers you usually
>> get are not even hints to a solution. Their answers usually translate to
>> "there is nothing I am willing to do about your problem. If I did, I should
>> get paid" in many cryptic ways.
>>
>> If you ask your question to the Spark group they will take you to the
>> Hive group and viceversa (I can almost guarantee it based on previous
>> experiences)
>>
>> But in hindsight, people who work on this kinds of things typically make
>> more money that the average developers. If you make more $$s it makes sense
>> learning this stuff is supposed to be harder.
>>
>> Conclusion, don't try it. Or try using Tez/Hive instead of Spark/Hive  if
>> you are querying large files.
>>
>>
>>
>> On Friday, March 17, 2017 11:33 AM, Stephen Sprague <sprag...@gmail.com>
>> wrote:
>>
>>
>> :(  gettin' no love on this one.   any SME's know if Spark 2.1.0 will
>> work with Hive 2.1.0 ?  That JavaSparkListener class looks like a deal
>> breaker to me, alas.
>>
>> thanks in advance.
>>
>> Cheers,
>> Stephen.
>>
>> On Mon, Mar 13, 2017 at 10:32 PM, Stephen Sprague <sprag...@gmail.com>
>> wrote:
>>
>> hi guys,
>> wondering where we stand with Hive On Spark these days?
>>
>> i'm trying to run Spark 2.1.0 with Hive 2.1.0 (purely coincidental
>> versions) and running up against this class not found:
>>
>> java.lang. NoClassDefFoundError: org/apache/spark/ JavaSparkListener
>>
>>
>> searching the Cyber i find this:
>> 1. http://stackoverflow.com/ questions/41953688/setting-
>> spark-as-default-execution- engine-for-hive
>> <http://stackoverflow.com/questions/41953688/setting-spark-as-default-execution-engine-for-hive>
>>
>> which pretty much describes my situation too and it references this:
>>
>>
>> 2. https://issues.apache.org/ jira/browse/SPARK-17563
>> <https://issues.apache.org/jira/browse/SPARK-17563>
>>
>> which indicates a "won't fix" - but does reference this:
>>
>>
>> 3. https://issues.apache.org/ jira/browse/HIVE-14029
>> <https://issues.apache.org/jira/browse/HIVE-14029>
>>
>> which looks to be fixed in hive 2.2 - which is not released yet.
>>
>>
>> so if i want to use spark 2.1.0 with hive am i out of luck - until hive
>> 2.2?
>>
>> thanks,
>> Stephen.
>>
>>
>>
>>
>>
> Stephan,
>
> I understand some of your frustration.  Remember that many in open source
> are volunteering their time. This is why if you pay a vendor for support of
> some software you might pay 50K a year or $2

Re: hive on spark - version question

2017-03-17 Thread Stephen Sprague
:(  gettin' no love on this one.   any SME's know if Spark 2.1.0 will work
with Hive 2.1.0 ?  That JavaSparkListener class looks like a deal breaker
to me, alas.

thanks in advance.

Cheers,
Stephen.

On Mon, Mar 13, 2017 at 10:32 PM, Stephen Sprague <sprag...@gmail.com>
wrote:

> hi guys,
> wondering where we stand with Hive On Spark these days?
>
> i'm trying to run Spark 2.1.0 with Hive 2.1.0 (purely coincidental
> versions) and running up against this class not found:
>
> java.lang.NoClassDefFoundError: org/apache/spark/JavaSparkListener
>
>
> searching the Cyber i find this:
> 1. http://stackoverflow.com/questions/41953688/setting-
> spark-as-default-execution-engine-for-hive
>
> which pretty much describes my situation too and it references this:
>
>
> 2. https://issues.apache.org/jira/browse/SPARK-17563
>
> which indicates a "won't fix" - but does reference this:
>
>
> 3. https://issues.apache.org/jira/browse/HIVE-14029
>
> which looks to be fixed in hive 2.2 - which is not released yet.
>
>
> so if i want to use spark 2.1.0 with hive am i out of luck - until hive
> 2.2?
>
> thanks,
> Stephen.
>
>


hive on spark - version question

2017-03-14 Thread Stephen Sprague
hi guys,
wondering where we stand with Hive On Spark these days?

i'm trying to run Spark 2.1.0 with Hive 2.1.0 (purely coincidental
versions) and running up against this class not found:

java.lang.NoClassDefFoundError: org/apache/spark/JavaSparkListener


searching the Cyber i find this:
1.
http://stackoverflow.com/questions/41953688/setting-spark-as-default-execution-engine-for-hive

which pretty much describes my situation too and it references this:


2. https://issues.apache.org/jira/browse/SPARK-17563

which indicates a "won't fix" - but does reference this:


3. https://issues.apache.org/jira/browse/HIVE-14029

which looks to be fixed in hive 2.2 - which is not released yet.


so if i want to use spark 2.1.0 with hive am i out of luck - until hive 2.2?

thanks,
Stephen.


random KILL's in YARN

2017-01-18 Thread Stephen Sprague
hey guys,
I have a question on why Hiveserver2 would issue a "killjob" signal.

We run Yarn on Hadoop 5.6 with the HiveServer2 process. It uses the
fair-scheduler. Pre-emption is turned off. At least twice a day we have
jobs that are randomly killed.  they can be big jobs, they can be small
ones. they can be Tez jobs, they can be MR jobs.  I can't find any pattern
and i can't find *WHY* this is occurring so i'm reaching out.

i have the RM process running at DEBUG level logging as well as all the NM
processes so i have pretty detailed logs.  Still i can't find a reason -
all is see is something like this:

2017-01-18 11:18:15,732 DEBUG [IPC Server handler 0 on 42807]
org.apache.hadoop.ipc.Server: IPC Server handler 0 on 42807:
org.apache.hadoop.mapreduce.v2.api.MRClientProtocolPB.killJob from
172.19.75.137:39623 Call#5981979 Retry#0 for RpcKind RPC_PROTOCOL_BUFFER
2017-01-18 11:18:15,732 DEBUG [IPC Server handler 0 on 42807]
org.apache.hadoop.security.UserGroupInformation: PrivilegedAction as:dwr
(auth:SIMPLE) from:org.apache.hadoop.ipc.Server$Handler.run(Server.
java:2038)
*2017-01-18 11:18:15,736 INFO [IPC Server handler 0 on 42807]
org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Kill job
job_1484692657978_3312 received from dwr (auth:SIMPLE) at 172.19.75.137*
2017-01-18 11:18:15,736 DEBUG [AsyncDispatcher event handler]
org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event
org.apache.hadoop.mapreduce.v2.app.job.event.JobDiagnosticsUpdateEvent.EventType:
JOB_DIAGNOSTIC_UPDATE
2017-01-18 11:18:15,736 DEBUG [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Processing
job_1484692657978_3312 of type JOB_DIAGNOSTIC_UPDATE
2017-01-18 11:18:15,736 DEBUG [AsyncDispatcher event handler]
org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event
org.apache.hadoop.mapreduce.v2.app.job.event.JobEvent.EventType: JOB_KILL
2017-01-18 11:18:15,736 DEBUG [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Processing
job_1484692657978_3312 of type JOB_KILL
2017-01-18 11:18:15,737 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl:
job_1484692657978_3312Job Transitioned from RUNNING to KILL_WAIT
2017-01-18 11:18:15,737 DEBUG [AsyncDispatcher event handler]
org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event
org.apache.hadoop.mapreduce.v2.app.job.event.TaskEvent.EventType: T_KILL
2017-01-18 11:18:15,737 DEBUG [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Processing
task_1484692657978_3312_m_00 of type T_KILL
2017-01-18 11:18:15,737 DEBUG [AsyncDispatcher event handler]
org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event
org.apache.hadoop.mapreduce.v2.app.job.event.TaskEvent.EventType: T_KILL
2017-01-18 11:18:15,737 DEBUG [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Processing
task_1484692657978_3312_m_01 of type T_KILL




* 172.19.75.137 is node where the HiveServer2 process is running - so i'm
pretty sure this is the process issuing the kill.  but why?  why kill the
job?  makes no sense to me.  what would cause this to happen?



The Hiveserver2 log doesn't have anything useful in it at the INFO level
and i can't seem to get it to log at the DEBUG level. the bit in red below
doesn't seem to work so any suggestions here gladly accepted too.

$ ps -ef | grep -i hiveserver2

dwr  21927  1785  8 Jan14 ?10:10:11
/usr/lib/jvm/java-8-oracle/jre//bin/java -Xmx12288m
-Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/usr/lib/hadoop/logs
-Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop
-Dhadoop.id.str= -Dhadoop.root.logger=INFO,console
-Djava.library.path=/usr/lib/hadoop/lib/native
-Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
-Xmx268435456 -Xmx10G -Dlog4j.configurationFile=hive-log4j2.properties
-Dorg.apache.logging.log4j.simplelog.StatusLogger.level=DEBUG
-Dlog4j.configurationFile=hive-log4j2.properties
-Djava.util.logging.config.file=/usr/lib/apache-hive-2.1.0-bin/bin/../conf/parquet-logging.properties
-Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar
/usr/lib/apache-hive-2.1.0-bin/lib/hive-service-2.1.0.jar
org.apache.hive.service.server.HiveServer2


*--hiveconf hive.root.logger=DEBUG,console*
thanks,
Stephen


Re: tez + union stmt

2016-12-25 Thread Stephen Sprague
Thanks Elliot.  Nice christmas present.   Those settings in that
stackoverflow link look to me to be exactly what i need to set for MR jobs
to pick that data up that Tez created.

Cheers,
Stephen.

On Sun, Dec 25, 2016 at 2:45 AM, Elliot West <tea...@gmail.com> wrote:

> I believe that tez will generate subfolders for unioned data. As far as I
> know, this is the expected behaviour and there is no alternative.
> Presumably this is to prevent multiple tasks from attempting to write the
> same file?
>
> We've experienced issues when switching from mr to tez; downstream jobs
> weren't expecting subfolders and had trouble reading previously accessible
> datasets.
>
> Apparently there are workarounds within Hive:
> http://stackoverflow.com/questions/39511585/hive-
> create-table-not-insert-data
>
> Merry Christmas,
>
> Elliot.
>
> On Sun, 25 Dec 2016 at 03:11, Rajesh Balamohan <rbalamo...@apache.org>
> wrote:
>
>> Are there any exceptions in hive.log?. Is tmp_pv_v4* table part of the
>> select query?
>>
>> Assuming you are creating the table in staging.db, it would have created
>> the table location as staging.db/foo (as you have not specified the
>> location).
>>
>> Adding user@hive.apache.org as this is hive related.
>>
>>
>> ~Rajesh.B
>>
>> On Sun, Dec 25, 2016 at 12:08 AM, Stephen Sprague <sprag...@gmail.com>
>> wrote:
>>
>> all,
>>
>> i'm running tez with the sql pattern:
>>
>> * create table foo as select * from (select... UNION select... UNION
>> select...)
>>
>> in the logs the final step is this:
>>
>> * Moving data to directory hdfs://dwrnn1.sv2.trulia.com:
>> 8020/user/hive/warehouse/staging.db/tmp_pv_v4c__loc_4 from hdfs://
>> dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/
>> staging.db/.hive-staging_hive_2016-12-24_10-05-40_048_
>> 4896412314807355668-899/-ext-10002
>>
>>
>> when querying the table i got zero rows returned which made me curious.
>> so i queried the hdfs location and see this:
>>
>>   $ hdfs dfs -ls hdfs://dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/
>> staging.db/tmp_pv_v4c__loc_4
>>
>>   Found 3 items
>>   drwxrwxrwx   - dwr supergroup  0 2016-12-24 10:05 hdfs://
>> dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/
>> staging.db/tmp_pv_v4c__loc_4/1
>>   drwxrwxrwx   - dwr supergroup  0 2016-12-24 10:06 hdfs://
>> dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/
>> staging.db/tmp_pv_v4c__loc_4/2
>>   drwxrwxrwx   - dwr supergroup  0 2016-12-24 10:06 hdfs://
>> dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/
>> staging.db/tmp_pv_v4c__loc_4/3
>>
>> and yes the data files are under these three dirs.
>>
>> so i ask... i'm not used to seeing sub-directories under the tablename
>> unless the table is partitioned. is this legit? might there be some config
>> settings i need to set to see this data via sql?
>>
>> thanks,
>> Stephen.
>>
>>
>>
>>
>>
>>
>>
>>


Re: Maintaining big and complex Hive queries

2016-12-21 Thread Stephen Sprague
my 2 cents. :)

as soon as you say "complex query" i would submit you've lost the upperhand
and you're behind the eight-ball right off the bat.  And you know this too
otherwise you wouldn't have posted here. ha!

i use cascading CTAS statements so that i can examine the intermediate
tables.  Another approach is to use CTE's but while that makes things
easier to read it's still one big query and you don't get insight to the
"work" tables.

yes, it could take longer execution time if those intermediate tables can't
be run in parallel but small price to pay compared to human debug time in
my book anyway.

thoughts?

Cheers,
Stephen.





On Wed, Dec 21, 2016 at 10:07 AM, Saumitra Shahapure <
saumitra.offic...@gmail.com> wrote:

> Hi Elliot,
>
> Thanks for letting me know. HPL-SQL sounded particularly interesting. But
> in the documentation I could not see any way to pass output generated by
> one Hive query to the next one. The tool looks good as a homogeneous PL-SQL
> platform for multiple big-data systems (http://www.hplsql.org/about).
>
> However in order to break single complex hive query, DDLs look to be only
> way in HPL-SQL too. Or is there any alternate way that I might have missed?
>
> -- Saumitra S. Shahapure
>
> On Thu, Dec 15, 2016 at 6:21 PM, Elliot West  wrote:
>
>> I notice that HPL/SQL is not mentioned on the page I referenced, however
>> I expect that is another approach that you could use to modularise:
>>
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156
>> http://www.hplsql.org/doc
>>
>> On 15 December 2016 at 17:17, Elliot West  wrote:
>>
>>> Some options are covered here, although there is no definitive guidance
>>> as far as I know:
>>>
>>> https://cwiki.apache.org/confluence/display/Hive/Unit+Testin
>>> g+Hive+SQL#UnitTestingHiveSQL-Modularisation
>>>
>>> On 15 December 2016 at 17:08, Saumitra Shahapure <
>>> saumitra.offic...@gmail.com> wrote:
>>>
 Hello,

 We are running and maintaining quite big and complex Hive SELECT query
 right now. It's basically a single SELECT query which performs JOIN of
 about ten other SELECT query outputs.

 A simplest way to refactor that we can think of is to break this query
 down into multiple views and then join the views. There is similar
 possibility to create intermediate tables.

 However creating multiple DDLs in order to maintain a single DML is not
 very smooth. We would end up polluting metadata database by creating views
 / intermediate tables which are used in just this ETL.

 What are the other efficient ways to maintain complex SQL queries
 written in Hive? Are there better ways to break Hive query into multiple
 modules?

 -- Saumitra S. Shahapure

>>>
>>>
>>
>


Re: [ANNOUNCE] Apache Hive 2.1.1 Released

2016-12-08 Thread Stephen Sprague
Ahh.  thank you.

On Thu, Dec 8, 2016 at 3:19 PM, Alan Gates <alanfga...@gmail.com> wrote:

> Apache keeps just the latest version of each release on the mirrors.  You
> can find all Hive releases at https://archive.apache.org/dist/hive/ if
> you need 2.1.0.
>
> Alan.
>
> > On Dec 8, 2016, at 14:40, Stephen Sprague <sprag...@gmail.com> wrote:
> >
> > out of curiosity any reason why release 2.1.0 disappeared from
> apache.claz.org/hive ?   apologies if i missed the conversation about
> it.  thanks.
> >
> >
> > 
> >
> > On Thu, Dec 8, 2016 at 9:58 AM, Jesus Camacho Rodriguez <
> jcama...@apache.org> wrote:
> > The Apache Hive team is proud to announce the release of Apache Hive
> > version 2.1.1.
> >
> > The Apache Hive (TM) data warehouse software facilitates querying and
> > managing large datasets residing in distributed storage. Built on top
> > of Apache Hadoop (TM), it provides, among others:
> >
> > * Tools to enable easy data extract/transform/load (ETL)
> >
> > * A mechanism to impose structure on a variety of data formats
> >
> > * Access to files stored either directly in Apache HDFS (TM) or in other
> >   data storage systems such as Apache HBase (TM)
> >
> > * Query execution via Apache Hadoop MapReduce and Apache Tez frameworks.
> >
> > For Hive release details and downloads, please visit:
> > https://hive.apache.org/downloads.html
> >
> > Hive 2.1.1 Release Notes are available here:
> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?
> projectId=12310843=12335838
> >
> > We would like to thank the many contributors who made this release
> > possible.
> >
> > Regards,
> >
> > The Apache Hive Team
> >
> >
> >
>
>


Re: [ANNOUNCE] Apache Hive 2.1.1 Released

2016-12-08 Thread Stephen Sprague
out of curiosity any reason why release 2.1.0 disappeared from
apache.claz.org/hive ?   apologies if i missed the conversation about it.
thanks.


[image: Inline image 1]

On Thu, Dec 8, 2016 at 9:58 AM, Jesus Camacho Rodriguez  wrote:

> The Apache Hive team is proud to announce the release of Apache Hive
> version 2.1.1.
>
> The Apache Hive (TM) data warehouse software facilitates querying and
> managing large datasets residing in distributed storage. Built on top
> of Apache Hadoop (TM), it provides, among others:
>
> * Tools to enable easy data extract/transform/load (ETL)
>
> * A mechanism to impose structure on a variety of data formats
>
> * Access to files stored either directly in Apache HDFS (TM) or in other
>   data storage systems such as Apache HBase (TM)
>
> * Query execution via Apache Hadoop MapReduce and Apache Tez frameworks.
>
> For Hive release details and downloads, please visit:
> https://hive.apache.org/downloads.html
>
> Hive 2.1.1 Release Notes are available here:
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?
> projectId=12310843=12335838
>
> We would like to thank the many contributors who made this release
> possible.
>
> Regards,
>
> The Apache Hive Team
>
>
>


Re: s3a and hive

2016-11-15 Thread Stephen Sprague
just for the record...

this config "hive.exec.stagingdir" determines that ".hive_staging"
sub-directory. when it defaults to the table path and the table path is in
s3 that's where i get the exception:

Failed with exception java.io.IOException: rename for src path:
s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/
date_key=20161113/.hive- staging_hive_2016-11-15_04-57-
52_085_7825126612479617470-1/-ext-1/00_0 to dest
path:s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/00_0
returned false

i got the tip from this stackoverflow:

http://stackoverflow.com/questions/39547001/why-hive-staging-file-is-missing-in-aws-emr


that said setting "hive.metastore.warehouse.dir" to an S3 location is
something totally different and per Elliot's comments could be a risky
adventure - and unrelated to my error.

Anyway, I reset that back to hdfs and was inserting into an external table
located in s3 and *still* got that error above much to my consternation.
however, by playing with "hive.exec.stagingdir" (and reading that
stackoverflow) i was able to overcome the error.

YMMV.

Cheers,
Stephen.


On Tue, Nov 15, 2016 at 7:53 AM, Stephen Sprague <sprag...@gmail.com> wrote:

> Thanks Elliot. I think you might be onto something there. :)
>
> Making that tiny little switch sure seemed attractive but judging from the
> Jira's out there the ramifications of that setting are far more involved
> and nuanced than i thought.
>
> awright. you make some convincing arguments there. looks like the smart
> money is on hdfs for the time-being.
>
> thanks for replying! Good stuff.
>
> Cheers,
> Stephen.
>
> On Tue, Nov 15, 2016 at 7:29 AM, Elliot West <tea...@gmail.com> wrote:
>
>> My gut feeling is that this is not something you should do (except for
>> fun!) I'm fairly confident that somewhere in Hive, MR, or Tez, you'll hit
>> some code that requires consistent, atomic move/copy/list/overwrite
>> semantics from the warehouse filesystem. This is not something that the
>> vanilla S3AFileSystem can provide. Even if you get to the point where
>> everything appears functionally sound, I expect you'll encounter unusual
>> and inconsistent behavior if you use this in the long term.
>>
>> Solutions to Hive on S3 include:
>>
>>- Use S3Guard (not yet available): https://issues.apa
>>che.org/jira/browse/HADOOP-13345
>><https://issues.apache.org/jira/browse/HADOOP-13345>
>>- Use Hive on EMR with Amazon's S3 filesystem implementation and
>>EMRFS. Note that this confusingly requires and overloads the 's3://' 
>> scheme.
>>
>> Hope this helps, and please report back with any findings as we are doing
>> quite a bit of Hive in AWS too.
>>
>> Cheers - Elliot.
>>
>> On 15 November 2016 at 15:19, Stephen Sprague <sprag...@gmail.com> wrote:
>>
>>> no. permissions are good.  i believe the case to be that s3a does not
>>> have a "move" and/or "rename" semantic but i can't be the first one to
>>> encounter this. somebody out there has to have gone done this path way
>>> before me surely.
>>>
>>> searching the cyber i find this:
>>>
>>>https://issues.apache.org/jira/browse/HIVE-14270
>>>
>>> which is part of a even more work with s3 (see the related jira's that
>>> that jira comes under) especially the Hadoop Uber-Jira.
>>>
>>>
>>> so after digging though those jira's lemme ask:
>>>
>>> has anyone set hive.metastore.warehouse.dir to a s3a location with
>>> success?
>>>
>>> seems to me hive 2.2.0 and perhaps hadoop 2.7 or 2.8 are the only
>>> chances of success but i'm happy to be told i'm wrong.
>>>
>>> thanks,
>>> Stephen.
>>>
>>>
>>>
>>> On Mon, Nov 14, 2016 at 10:25 PM, Jörn Franke <jornfra...@gmail.com>
>>> wrote:
>>>
>>>> Is it a permission issue on the folder?
>>>>
>>>> On 15 Nov 2016, at 06:28, Stephen Sprague <sprag...@gmail.com> wrote:
>>>>
>>>> so i figured i try and set hive.metastore.warehouse.dir=s3a://bucket/hive
>>>> and see what would happen.
>>>>
>>>> running this query:
>>>>
>>>> insert overwrite table omniture.hit_data_aws partition
>>>> (date_key=20161113) select * from staging.hit_data_aws_ext_20161113
>>>> limit 1;
>>>>
>>>> yields this error:
>>>>
>>>>Failed with exception java.io.IOException: rename for src path:
>>

Re: s3a and hive

2016-11-15 Thread Stephen Sprague
Thanks Elliot. I think you might be onto something there. :)

Making that tiny little switch sure seemed attractive but judging from the
Jira's out there the ramifications of that setting are far more involved
and nuanced than i thought.

awright. you make some convincing arguments there. looks like the smart
money is on hdfs for the time-being.

thanks for replying! Good stuff.

Cheers,
Stephen.

On Tue, Nov 15, 2016 at 7:29 AM, Elliot West <tea...@gmail.com> wrote:

> My gut feeling is that this is not something you should do (except for
> fun!) I'm fairly confident that somewhere in Hive, MR, or Tez, you'll hit
> some code that requires consistent, atomic move/copy/list/overwrite
> semantics from the warehouse filesystem. This is not something that the
> vanilla S3AFileSystem can provide. Even if you get to the point where
> everything appears functionally sound, I expect you'll encounter unusual
> and inconsistent behavior if you use this in the long term.
>
> Solutions to Hive on S3 include:
>
>- Use S3Guard (not yet available): https://issues.apa
>che.org/jira/browse/HADOOP-13345
><https://issues.apache.org/jira/browse/HADOOP-13345>
>- Use Hive on EMR with Amazon's S3 filesystem implementation and
>EMRFS. Note that this confusingly requires and overloads the 's3://' 
> scheme.
>
> Hope this helps, and please report back with any findings as we are doing
> quite a bit of Hive in AWS too.
>
> Cheers - Elliot.
>
> On 15 November 2016 at 15:19, Stephen Sprague <sprag...@gmail.com> wrote:
>
>> no. permissions are good.  i believe the case to be that s3a does not
>> have a "move" and/or "rename" semantic but i can't be the first one to
>> encounter this. somebody out there has to have gone done this path way
>> before me surely.
>>
>> searching the cyber i find this:
>>
>>https://issues.apache.org/jira/browse/HIVE-14270
>>
>> which is part of a even more work with s3 (see the related jira's that
>> that jira comes under) especially the Hadoop Uber-Jira.
>>
>>
>> so after digging though those jira's lemme ask:
>>
>> has anyone set hive.metastore.warehouse.dir to a s3a location with
>> success?
>>
>> seems to me hive 2.2.0 and perhaps hadoop 2.7 or 2.8 are the only chances
>> of success but i'm happy to be told i'm wrong.
>>
>> thanks,
>> Stephen.
>>
>>
>>
>> On Mon, Nov 14, 2016 at 10:25 PM, Jörn Franke <jornfra...@gmail.com>
>> wrote:
>>
>>> Is it a permission issue on the folder?
>>>
>>> On 15 Nov 2016, at 06:28, Stephen Sprague <sprag...@gmail.com> wrote:
>>>
>>> so i figured i try and set hive.metastore.warehouse.dir=s3a://bucket/hive
>>> and see what would happen.
>>>
>>> running this query:
>>>
>>> insert overwrite table omniture.hit_data_aws partition
>>> (date_key=20161113) select * from staging.hit_data_aws_ext_20161113
>>> limit 1;
>>>
>>> yields this error:
>>>
>>>Failed with exception java.io.IOException: rename for src path:
>>> s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/.hive-
>>> staging_hive_2016-11-15_04-57-52_085_7825126612479617470-1/-ext-1/00_0
>>> to dest 
>>> path:s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/00_0
>>> returned false
>>> FAILED: Execution Error, return code 1 from
>>> org.apache.hadoop.hive.ql.exec.MoveTask. java.io.IOException: rename
>>> for src path: s3a://trulia-dwr-cluster-dev/h
>>> ive/omniture.db/hit_data_aws/date_key=20161113/.hive-staging
>>> _hive_2016-11-15_04-57-52_085_7825126612479617470-1/-ext-1/00_0
>>> to dest 
>>> path:s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/00_0
>>> returned false
>>>
>>>
>>> is there any workaround?   i'm running hive 2.1.0 and hadoop version
>>> 2.6.0-cdh5.7.1  .
>>>
>>>
>>> thanks,
>>> Stephen.
>>>
>>>
>>
>


Re: s3a and hive

2016-11-15 Thread Stephen Sprague
no. permissions are good.  i believe the case to be that s3a does not have
a "move" and/or "rename" semantic but i can't be the first one to encounter
this. somebody out there has to have gone done this path way before me
surely.

searching the cyber i find this:

   https://issues.apache.org/jira/browse/HIVE-14270

which is part of a even more work with s3 (see the related jira's that that
jira comes under) especially the Hadoop Uber-Jira.


so after digging though those jira's lemme ask:

has anyone set hive.metastore.warehouse.dir to a s3a location with success?

seems to me hive 2.2.0 and perhaps hadoop 2.7 or 2.8 are the only chances
of success but i'm happy to be told i'm wrong.

thanks,
Stephen.



On Mon, Nov 14, 2016 at 10:25 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> Is it a permission issue on the folder?
>
> On 15 Nov 2016, at 06:28, Stephen Sprague <sprag...@gmail.com> wrote:
>
> so i figured i try and set hive.metastore.warehouse.dir=s3a://bucket/hive
> and see what would happen.
>
> running this query:
>
> insert overwrite table omniture.hit_data_aws partition
> (date_key=20161113) select * from staging.hit_data_aws_ext_20161113 limit
> 1;
>
> yields this error:
>
>Failed with exception java.io.IOException: rename for src path:
> s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/.hive-
> staging_hive_2016-11-15_04-57-52_085_7825126612479617470-1/-ext-1/00_0
> to dest 
> path:s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/00_0
> returned false
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.MoveTask.
> java.io.IOException: rename for src path: s3a://trulia-dwr-cluster-dev/h
> ive/omniture.db/hit_data_aws/date_key=20161113/.hive-staging
> _hive_2016-11-15_04-57-52_085_7825126612479617470-1/-ext-1/00_0
> to dest 
> path:s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/00_0
> returned false
>
>
> is there any workaround?   i'm running hive 2.1.0 and hadoop version
> 2.6.0-cdh5.7.1  .
>
>
> thanks,
> Stephen.
>
>


s3a and hive

2016-11-14 Thread Stephen Sprague
so i figured i try and set hive.metastore.warehouse.dir=s3a://bucket/hive
and see what would happen.

running this query:

insert overwrite table omniture.hit_data_aws partition
(date_key=20161113) select * from staging.hit_data_aws_ext_20161113 limit 1;

yields this error:

   Failed with exception java.io.IOException: rename for src path:
s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/.hive-
staging_hive_2016-11-15_04-57-52_085_7825126612479617470-1/-ext-1/00_0
to dest
path:s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/00_0
returned false
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.MoveTask. java.io.IOException: rename for
src path:
s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/.hive-staging_hive_2016-11-15_04-57-52_085_7825126612479617470-1/-ext-1/00_0
to dest
path:s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/00_0
returned false


is there any workaround?   i'm running hive 2.1.0 and hadoop version
2.6.0-cdh5.7.1  .


thanks,
Stephen.


Re: a GROUP BY that is not fully grouping

2016-11-03 Thread Stephen Sprague
ha!  kinda shows how the tech stack boundaries now are getting blurred,
eh?  well at least for us amateurs! :o

On Thu, Nov 3, 2016 at 5:00 AM, Donald Matthews  wrote:

> |Spark calls its SQL part HiveContext, but it is not related to this
> list
>
> Oof, I didn't realize that. Thanks for letting me know.
>


Re: hiveserver2 GC overhead limit exceeded

2016-10-23 Thread Stephen Sprague
ok. i'll bite.

lets see the output of this command where Hiveserver2 is running.

$ ps -ef | grep -i hiveserver2

this'll show us all the command line parameters HS2 was (ultimately)
invoked with.

Cheers,
Stephen

On Sun, Oct 23, 2016 at 6:46 AM, patcharee 
wrote:

> Hi,
>
> I use beeline to connect to hiveserver2. I tested with a simple command
> and got an error GC overhead limit exceeded
>
> 0: jdbc:hive2://service-10-1:10010/default> drop table
> testhivedrivertable;
> Error: Error while processing statement: FAILED: Execution Error, return
> code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. GC overhead limit
> exceeded (state=08S01,code=1)
>
> How to solve this? How to identify if this error is from the client
> (beeline) or from hiveserver2?
>
> Thanks,
>
> Patcharee
>
>


hiveserver2 and KILLJOB

2016-10-05 Thread Stephen Sprague
hey guys,
this is a long shot but i'll ask anyway.   We're running YARN and
HiveServer2 (v2.1.0) and noticing "random" kills - what looks to me - being
issued by HiveServer2.

we've turned DEBUG log level on for the Application Master container and
see the following in the logs:

2016-10-05 02:06:18,365 DEBUG [IPC Server handler 0 on 34013]
org.apache.hadoop.ipc.Server: IPC Server handler 0 on 34013:
*org.apache.hadoop.mapreduce.v2.api.MRClientProtocolPB.killJob
from 172.19.75.137:35616 * Call#6603343 Retry#0
for RpcKind RPC_PROTOCOL_BUFFER

2016-10-05 02:06:18,365 DEBUG [IPC Server handler 0 on 34013]
org.apache.hadoop.security.UserGroupInformation: PrivilegedAction as:dwr
(auth:SIMPLE)
from:org.apache.hadoop.ipc.Server$Handler.run(Server.java:2038)

2016-10-05 02:06:18,368 INFO [IPC Server handler 0 on 34013]
org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Kill job
job_1475268554936_14906 received from dwr (auth:SIMPLE) at 172.19.75.137


host 172.19.75.137 is where the HiveServer2 process is running.


i'm not seeing anything in the HiveServer2 logs at all that is related to
this job. that's whole 'nother problem.  also rerunning the M/R job works
fine.

so its very weird why out of the blue HS2 would issue a KILL.

this happens three or four times every day with random jobs.

Any thoughts greatly appreciated.

Cheers,
Stephen.


Re: How do I determine a library mismatch between jdbc client and server?

2016-09-28 Thread Stephen Sprague
you might just end up using your own heuristics.  if the port is "alive"
(ie. you can list it via netstat or telnet to it) but you can't connect...
then you got yourself a problem.

kinda like a bootstrapping problem, eh? you need to connect to get the
version but you can't connect if you don't have the right version.

my 2 cents.

On Wed, Sep 28, 2016 at 4:13 PM, Bear Giles  wrote:

> Hi, I'm trying to do development in an environment where we have a mixed
> bag of clusters. Some Cloudera, some Hortonworks, different versions of
> each, etc.
>
> (I don't know if we'll see this mix in the field but we need the variety
> for testing our software.)
>
> Sometimes the poor overworked developer (cough) grabs the wrong jars for
> the server. In at least one case there was no indication of what the
> problem was - all I saw was a connection timeout. That caused some
> confusion since mismatched jars was pretty far down my list of likely
> causes for that error.
>
> Is there a standard way to query for the remote version before trying to
> establish a JDBC connection? I know that I can check the Connection's
> DatabaseMetaData information once I've established a connection but that
> doesn't help me when I'm getting nothing but a timeout exception.
>
>
>
> Bear Giles
>
> Sr. Java Application Engineer
> bgi...@snaplogic.com
> Mobile: 720-354-0766
> SnapLogic.com  | We're Hiring
> !
> 
> 
>   
>   
> 
>
>
>
> 
>
> SnapLogic Inc | 929 Pearl St #200 | Boulder | 80302 | Colorado
>
> SnapLogic Inc | 2 W Fifth Avenue Fourth Floor | San Mateo | 94402 |
> California
>
> This message is confidential. It may also be privileged or otherwise
> protected by work product immunity or other legal rules. If you have
> received it by mistake, please let us know by e-mail reply and delete it
> from your system; you may not copy this message or disclose its contents to
> anyone. The integrity and security of this message cannot be guaranteed on
> the Internet.
>


Re: Hive queries rejected under heavy load

2016-09-28 Thread Stephen Sprague
gotta start by looking at the logs and run the local client to eliminate
HS2.   perhaps running hive as such:

$ hive -hiveconf hive.root.logger=DEBUG,console

do you see any smoking gun?

On Wed, Sep 28, 2016 at 7:34 AM, Jose Rozanec  wrote:

> Hi,
>
> We have a Hive cluster (Hive 2.1.0+Tez 0.8.4) which works well for most
> queries. Though for some heavy ones we observe that sometimes are able to
> execute and sometimes get rejected. We are not sure why we get a rejection
> instead of getting them enqueued and wait for execution until resources in
> cluster are available again. We notice that the connection waits for a
> minute, and if fails to assign resources, will drop the query.
> Looking at configuration parameters, is not clear to us if this can be
> changed. Did anyone had a similar experience and can provide us some
> guidance?
>
> Thank you in advance,
>
> Joze.
>
>
>


Re: Hive 2.x usage

2016-09-14 Thread Stephen Sprague
> * Are you using Hive-2.x at your org and at what scale?

yes. we're using 2.1.0.  1.5PB.  30 node cluster.  ~1000 jobs a day.And
yeah hive 2.1.0 has some issues and can require some finesse wrt the
hive-site.xml settings.

> * Is the release stable enough? Did you notice any correctness issues?

yes.  we did notice correctness issues."NOT IN" against a partition key
silently fails.  and trouble with UNION statements also got us.

> * MR is deprecated in Hive-2.x (Though almost all the qtests still use
MR). Are you still using MR with Hive-2.x?

we still use MR on YARN and have no issues with it.

> * Are you using the apache release or HDP ?

we're using Apache/Hive 2.1.0 on a CDH distro.  does cloudera certify it?
probably not.  Do i care? no. :)

bottom line IMHO is Hive 2.1.0 is still in an early adopter phase.
however, if you're the do-it-yourself kind of guy then i'm sure you can
adapt - and reap the rewards of your know-how.  If you're the type who
needs somebody else to go to when it fails then i wouldn't jump on it just
yet.

just one opinion!  :o

Cheers,
Stephen.


On Wed, Sep 14, 2016 at 3:53 PM, Jörn Franke  wrote:

> If you are using a distribution (which you should if you go to production
> - Apache releases should not be used due to the maintainability, complexity
> and interaction with other components, such as Hadoop etc) then wait until
> a distribution with 2.x is out. As far as i am aware there is currently no
> such distribution. As far as i know , Hortonworks and probably also
> Cloudera test their distributions on large scale real production systems
> beforehand.
>
> I would not use MR even with 1.x and go for TEZ (except you are using some
> very specific outdated functionality). Spark is another option, but i do
> not see Hive on Spark as stable and less functionality - this may change in
> the future.
>
> On 14 Sep 2016, at 22:36, RD  wrote:
>
> Hi Folks,
>   We  (at my org) are currently planning our move to Hive-2.x. As part of
> this I wanted to get a sense of how stable the Hive-2.x release is.  I
> thought it would be good to conduct a brief survey on this. I've added a
> few questions below. It would really be a ton of help if folks could
> provide their feedback
>
> * Are you using Hive-2.x at your org and at what scale?
> * Is the release stable enough? Did you notice any correctness issues?
> * MR is deprecated in Hive-2.x (Though almost all the qtests still use
> MR). Are you still using MR with Hive-2.x?
> * Are you using the apache release or HDP ?
>
> -Best,
>
>
>
>
>


Re: Re: load data Failed with exception java.lang.IndexOutOfBoundsException

2016-09-08 Thread Stephen Sprague
>at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.
validateInput(OrcInputFormat.java:508)

would it be safe to assume that you are trying to load a text file into an
table stored as ORC?

your create table doesn't specify that explicitly so that means you have a
setting in your configs that says new tables are to be stored as ORC if not
specified otherwise.

​
​too bad there isn't an error message like: "loading text data into into a
non-TEXTFILE table generally isn't a good idea". :)

then again maybe somebody knows something i don't.

Cheers,
Stephen.​





On Thu, Sep 8, 2016 at 7:37 PM, C R  wrote:

>
> Yes, based on my testing,it is wrong from 0 to 99 with the content of file 
> 1.dat,
> whether the column type is string or int.
>
> hive.log:
>
> 2016-09-09T09:10:40,978 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: CliDriver (SessionState.java:printInfo(
> 1029)) - Hive-on-MR is deprecated in Hive 2 and may
> not be available in the future versions. Consider
> using a different execution engine (i.e. tez, spark) or
> using Hive 1.X releases.
> 2016-09-09T09:11:17,433 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: conf.HiveConf (HiveConf.java:
> getLogIdVar(3177)) - Using the default value passed in
> for log id: d1e08abd-5f8b-4149-a679-00ba6b4f4ab9
> 2016-09-09T09:11:17,462 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: ql.Driver (Driver.java:compile(409)) -
> Compiling command(queryId=hadoop_20160909091117_2f9e8e3b-b2e8-4312-b473-
> 535881c1d726): LOAD DATA LOCAL INPATH '1.dat' overwrite INTO TABLE ODS.
> loadtest
> 2016-09-09T09:11:18,016 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: metastore.HiveMetaStore (HiveMetaStore.
> java:logInfo(670)) - 0: get_table : db=ODS tbl=loadtest
> 2016-09-09T09:11:18,016 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: HiveMetaStore.audit (HiveMetaStore.java:
> logAuditEvent(280)) - ugi=hadoop   ip=unknown-ip-addr
>cmd=get_table : db=ODS tbl=loadtest
> 2016-09-09T09:11:18,162 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: ql.Driver (Driver.java:compile(479)) -
> Semantic Analysis Completed
> 2016-09-09T09:11:18,163 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: ql.Driver (Driver.java:getSchema(251)) -
>  Returning Hive schema: Schema(fieldSchemas:null, properties:null)
> 2016-09-09T09:11:18,167 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: ql.Driver (Driver.java:compile(551)) -
> Completed compiling command(queryId=hadoop_20160909091117_
> 2f9e8e3b-b2e8-4312-b473-535881c1d726); Time taken: 0.725 seconds
> 2016-09-09T09:11:18,167 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: ql.Driver (Driver.java:checkConcurrency(
> 171)) - Concurrency mode is disabled, not creating a lock manager
> 2016-09-09T09:11:18,167 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: ql.Driver (Driver.java:execute(1493)) -
> Executing command(queryId=hadoop_20160909091117_2f9e8e3b-b2e8-4312-b473-
> 535881c1d726): LOAD DATA LOCAL INPATH '1.dat' overwrite INTO TABLE ODS.
> loadtest
> 2016-09-09T09:11:18,172 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: ql.Driver (Driver.java:launchTask(1832))
>  - Starting task [Stage-0:MOVE] in serial mode
> 2016-09-09T09:11:18,172 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: exec.Task (SessionState.java:printInfo(
> 1029)) - Loading data to table ods.loadtest from file:1.dat
> 2016-09-09T09:11:18,172 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: metastore.HiveMetaStore (HiveMetaStore.
> java:logInfo(670)) - 0: get_table : db=ods tbl=loadtest
> 2016-09-09T09:11:18,173 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: HiveMetaStore.audit (HiveMetaStore.java:
> logAuditEvent(280)) - ugi=hadoop   ip=unknown-ip-addr
>cmd=get_table : db=ods tbl=loadtest
> 2016-09-09T09:11:18,320 ERROR [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: exec.Task (SessionState.java:printError(
> 1038)) - Failed with exception java.lang.IndexOutOfBoundsException
> org.apache.hadoop.hive.ql.metadata.HiveException: java.
> lang.IndexOutOfBoundsException
> at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.
> checkInputFormat(HiveFileFormatUtils.java:195)
> at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.
> checkTextInputFormat(HiveFileFormatUtils.java:217)
> at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.
> checkInputFormat(HiveFileFormatUtils.java:182)
> at org.apache.hadoop.hive.ql.exec.MoveTask.execute(
> MoveTask.java:306)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:158)
> at org.apache.hadoop.hive.ql.exec.TaskRunner.
> runSequential(TaskRunner.java:101)
> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1834)
> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1578)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1355)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1178)
> at 

Re: hive.root.logger influencing query plan?? so it's not so

2016-09-04 Thread Stephen Sprague
another case of a query hangin' in v2.1.0.

hive> select 1 from raw_logs.fact_mweb_datekey_by_hour where
date_key=20160904 and hour=15 limit 1;

* date_key is a partition key and 20160904 exists.
* hour is a bucketed column but does not exist. not sure it matters if it
bucketed or not for the query to hang.
* so empty result expected.


as Gopal mentioned previously this does indeed fix it:

* set hive.fetch.task.conversion=none;

but not sure its the right thing to set globally just yet.

Anyhoo users beware.

Regards,
Stephen





On Wed, Aug 31, 2016 at 7:01 AM, Stephen Sprague <sprag...@gmail.com> wrote:

> hi guys,
>
> Vlad: good suggestion however in my case its a 5 second query (when it
> works)
>
> Gopal: Thanks for the explanation on the effect logging can have on the
> execution path. somewhat counter-intuitive i must say and as you can
> imagine a tad more challenging to debug - when debugging influences the
> observation. :)
>
> that said adding:
>
>set hive.fetch.task.conversion=none;
>
> forced the query plans to be the same and that made the difference. it
> worked.
>
> now i need to research that setting and see what the implications are.
> I'm sure it doesn't come for free.
>
> and as always a big thanks!  lemme know if i can provide any details.  the
> partition size is >1G btw. and this is using hive 2.1.0.
>
> Cheers,
> Stephen.
> ps here's the hdfs info - the table is bucketed by 24 fwiw.
>
>
> $ hdfs dfs -du -h /user/hive/warehouse/raw_logs.
> db/fact_www_access_datekey_by_hour/date_key=20160828
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/00_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/01_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/02_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/03_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/04_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/05_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/06_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/07_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/08_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/09_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/10_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/11_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/12_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/13_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/14_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/15_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/16_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/17_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/18_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/19_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/20_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/21_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/22_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/23_0
>
>
>
>
> On Wed, Aug 31, 2016 at 12:19 AM, Vladimir Kroz <vladimir.k...@gmail.com>
> wrote:
>
>> One of the causes could be that on long-running queries your terminal
>> session gets disconnected and client process terminate (appearing like
>>  query hangs).
>>
>> When debug messages are on, they will keep terminal session alive and
>> hence allowing your query to complete.
>>
>> I'm not sure if this is your case, but based on synopsis your provided it
>&

Re: Beeline throws OOM on large input query

2016-09-02 Thread Stephen Sprague
hmmm.  so beeline blew up *before* the query was even submitted to the
execution engine?   one would think 16G would be plenty 8M row sql
statement.

some suggestions if you feel like going further down the rabbit hole.

1.  confirm your beeline java process is indeed running with expanded
memory (ps -ef | grep beeline and look for the last Xmx setting on the line)

2.  try the hive-cli (or the python one even.)  or "beeline -u
jdbc:hive2://" (local beeline - maybe thats different)

3.  chop down your 6K points to 3K or something smaller to see just where
the breaking point is.  does 1K points even work?  ie. determine how close
to edge are you?

Cheers,
Stephen.

PS. i had never heard of a "theta" join before so a searched it and found
this:
https://cwiki.apache.org/confluence/display/Hive/Theta+Join  and
this: https://issues.apache.org/jira/browse/HIVE-556 (looks like this came
first)

and still in "open" status i see. well you're not alone if that's any
solace!

maybe ping that Jira and see if Edward or Brock (or others) have any new
news on the topic as supporting theta joins sounds like the proper solution
to this whole rigamarole you find yourself in.

On Fri, Sep 2, 2016 at 6:12 AM, Adam  wrote:

> I set the heap size using HADOOP_CLIENT_OPTS all the way to 16g and still
> no luck.
>
> I tried to go down the table join route but the problem is that the
> relation is not an equality so it would be a theta join which is not
> supported in Hive.
> Basically what I am doing is a geographic intersection against 6,000
> points so the where clause has 6000 points in it (I use a custom UDF for
> the intersection).
>
> To avoid the problem I ended up writing another version of the UDF that
> reads the point list from an HDFS file.
>
> It's a low priority I'm sure but I bet there are some inefficiencies in
> the query string handling that could be fixed.  When I traced the code it
> was doing all kinds of StringBuffer and String += type stuff.
>
> Regards,
>


Re: Beeline throws OOM on large input query

2016-09-01 Thread Stephen Sprague
lemme guess.  your query contains an 'in' clause with 1 million static
values? :)

* brute force solution is to set:

 HADOOP_CLIENT_OPTS=-Xmx8G (or whatever)

before you run beeline to force a larger memory size

(i'm pretty sure beeline uses that env var though i didn't actually check
the script)


* more scalable solution is (if my premise above is correct) would be to
join to a table containing your values. But hey maybe you actually have a
8MM byte sql query that doesn't contain static data - that could be a world
record!

Cheers,
Stephen



On Thu, Sep 1, 2016 at 6:21 AM, Adam  wrote:

> Hive Version: 2.1.0
> I have a very large, multi-line input query (8,668,519 chars) and I have
> gone up to 16g heap and still get the same OOM.
>
>
> Error: Error running query: java.lang.OutOfMemoryError: Java heap space
> (state=,code=0)
> org.apache.hive.service.cli.HiveSQLException: Error running query:
> java.lang.OutOfMemoryError: Java heap space
> at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:264)
> at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:
> 250)
> at org.apache.hive.jdbc.HiveStatement.runAsyncOnServer(HiveStat
> ement.java:309)
> at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.
> java:250)
> at org.apache.hive.beeline.Commands.executeInternal(Commands.
> java:977)
> at org.apache.hive.beeline.Commands.execute(Commands.java:1148)
> at org.apache.hive.beeline.Commands.sql(Commands.java:1063)
> at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:1134)
> at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:965)
> at org.apache.hive.beeline.BeeLine.executeFile(BeeLine.java:940)
> at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:867)
> at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(Bee
> Line.java:499)
> at org.apache.hive.beeline.BeeLine.main(BeeLine.java:482)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
> ssorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
> thodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> Caused by: org.apache.hive.service.cli.HiveSQLException: Error running
> query: java.lang.OutOfMemoryError: Java heap space
> at org.apache.hive.service.cli.operation.SQLOperation.prepare(
> SQLOperation.java:218)
> at org.apache.hive.service.cli.operation.SQLOperation.runIntern
> al(SQLOperation.java:269)
> at org.apache.hive.service.cli.operation.Operation.run(Operatio
> n.java:324)
> at org.apache.hive.service.cli.session.HiveSessionImpl.executeS
> tatementInternal(HiveSessionImpl.java:460)
> at org.apache.hive.service.cli.session.HiveSessionImpl.executeS
> tatementAsync(HiveSessionImpl.java:447)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
> ssorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
> thodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(
> HiveSessionProxy.java:78)
> at org.apache.hive.service.cli.session.HiveSessionProxy.access$
> 000(HiveSessionProxy.java:36)
> at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(
> HiveSessionProxy.java:63)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
> upInformation.java:1657)
> at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(
> HiveSessionProxy.java:59)
> at com.sun.proxy.$Proxy33.executeStatementAsync(Unknown Source)
> at org.apache.hive.service.cli.CLIService.executeStatementAsync
> (CLIService.java:294)
> at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteS
> tatement(ThriftCLIService.java:497)
> at org.apache.hive.service.rpc.thrift.TCLIService$Processor$Exe
> cuteStatement.getResult(TCLIService.java:1437)
> at org.apache.hive.service.rpc.thrift.TCLIService$Processor$Exe
> cuteStatement.getResult(TCLIService.java:1422)
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.
> java:39)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.
> java:39)
> at org.apache.hive.service.auth.TSetIpAddressProcessor.process(
> TSetIpAddressProcessor.java:56)
> at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.
> run(TThreadPoolServer.java:286)
> at 

Re: Quota for rogue ad-hoc queries

2016-09-01 Thread Stephen Sprague
> rogue queries

so this really isn't limited to just hive is it?  any dbms system perhaps
has to contend with this.  even malicious rogue queries as a matter of fact.

timeouts are cheap way systems handle this - assuming time is related to
resource. i'm sure beeline or whatever client you use has a timeout feature.

maybe one could write a separate service - say a governor - that watches
over YARN (or hdfs or whatever resource is rare) - and terminates the
process if it goes beyond a threshold.  think OOM killer.

but, yeah, i admittedly don't know of something out there already you can
just tap into but YARN's Resource Manager seems to be place i'd research
for starters. Just look look at its name. :)

my unsolicited 2 cents.



On Wed, Aug 31, 2016 at 10:24 PM, ravi teja  wrote:

> Thanks Mich,
>
> Unfortunately we have many insert queries.
> Are there any other ways?
>
> Thanks,
> Ravi
>
> On Wed, Aug 31, 2016 at 9:45 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Trt this
>>
>> hive.limit.optimize.fetch.max
>>
>>- Default Value: 5
>>- Added In: Hive 0.8.0
>>
>> Maximum number of rows allowed for a smaller subset of data for simple
>> LIMIT, if it is a fetch query. Insert queries are not restricted by this
>> limit.
>>
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 31 August 2016 at 13:42, ravi teja  wrote:
>>
>>> Hi Community,
>>>
>>> Many users run adhoc hive queries on our platform.
>>> Some rogue queries managed to fill up the hdfs space and causing
>>> mainstream queries to fail.
>>>
>>> We wanted to limit the data generated by these adhoc queries.
>>> We are aware of strict param which limits the data being scanned, but it
>>> is of less help as huge number of user tables aren't partitioned.
>>>
>>> Is there a way we can limit the data generated from hive per query, like
>>> a hve parameter for setting HDFS quotas for job level *scratch*
>>> directory or any other approach?
>>> What's the general approach to gaurdrail such multi-tenant cases.
>>>
>>> Thanks in advance,
>>> Ravi
>>>
>>
>>
>


Re: hive.root.logger influencing query plan?? so it's not so

2016-08-31 Thread Stephen Sprague
hi guys,

Vlad: good suggestion however in my case its a 5 second query (when it
works)

Gopal: Thanks for the explanation on the effect logging can have on the
execution path. somewhat counter-intuitive i must say and as you can
imagine a tad more challenging to debug - when debugging influences the
observation. :)

that said adding:

   set hive.fetch.task.conversion=none;

forced the query plans to be the same and that made the difference. it
worked.

now i need to research that setting and see what the implications are.  I'm
sure it doesn't come for free.

and as always a big thanks!  lemme know if i can provide any details.  the
partition size is >1G btw. and this is using hive 2.1.0.

Cheers,
Stephen.
ps here's the hdfs info - the table is bucketed by 24 fwiw.


$ hdfs dfs -du -h
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/00_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/01_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/02_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/03_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/04_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/05_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/06_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/07_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/08_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/09_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/10_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/11_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/12_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/13_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/14_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/15_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/16_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/17_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/18_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/19_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/20_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/21_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/22_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/23_0




On Wed, Aug 31, 2016 at 12:19 AM, Vladimir Kroz <vladimir.k...@gmail.com>
wrote:

> One of the causes could be that on long-running queries your terminal
> session gets disconnected and client process terminate (appearing like
>  query hangs).
>
> When debug messages are on, they will keep terminal session alive and
> hence allowing your query to complete.
>
> I'm not sure if this is your case, but based on synopsis your provided it
> could be one of the reasons.
>
> Cheers,
> Vlad
>
> ---
> From: Stephen Sprague <sprag...@gmail.com>
> To: "user@hive.apache.org" <user@hive.apache.org>
> Cc:
> Date: Tue, 30 Aug 2016 20:28:50 -0700
> Subject: hive.root.logger influencing query plan?? so it's not so
> Hi guys,
> I've banged my head on this one all day and i need to surrender.  I have a
> query that hangs (never returns). However, when i turn on logging to DEBUG
> level it works.  I'm stumped.   I include here the query, the different
> query plans (with the only thing different being the log level) and a
> traceback on the query that hangs.
>
> * the query
>
> #!/bin/bash
>
> sql="
> explain SELECT *
> FROM   raw_logs.fact_www_access_datekey_by_hour
> WHERE  date_key = 20160828  -- partition key
> and lower(http_string) = 'foo'  -- always evaluates to false
> limit 1 -- if i remove limit 1 it works.  I expect
> an empty result set.
> ;
> "
>
> #hive -hiveconf hive.

hive.root.logger influencing query plan?? so it's not so

2016-08-30 Thread Stephen Sprague
Hi guys,
I've banged my head on this one all day and i need to surrender.  I have a
query that hangs (never returns). However, when i turn on logging to DEBUG
level it works.  I'm stumped.   I include here the query, the different
query plans (with the only thing different being the log level) and a
traceback on the query that hangs.

* the query

#!/bin/bash

sql="
explain SELECT *
FROM   raw_logs.fact_www_access_datekey_by_hour
WHERE  date_key = 20160828  -- partition key
and lower(http_string) = 'foo'  -- always evaluates to false
limit 1 -- if i remove limit 1 it works.  I expect
an empty result set.
;
"

#hive -hiveconf hive.root.logger=ERROR,console -e "$sql" >stdout.bad
2>stderr.bad  #hangs
hive -hiveconf hive.root.logger=DEBUG,console -e "$sql" >stdout.good
2>stderr.good #works



* explain plan for the one that hangs (ERROR,console)

STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
Fetch Operator
  limit: 1
  Processor Tree:
TableScan
  alias: fact_www_access_datekey_by_hour
  Statistics: Num rows: 144355924 Data size: 158755813572 Basic
stats: COMPLETE Column stats: NONE
  Filter Operator
predicate: (lower(http_string) = 'foo') (type: boolean)
Statistics: Num rows: 72177962 Data size: 79377906786 Basic
stats: COMPLETE Column stats: NONE
Select Operator
  expressions: payload_year (type: int), payload_month (type:
int), payload_day (type: int), payload_time (type: string),
payload_gmt_offset (type: string), perf_time_micros (type: int),
http_string (type: string), http_type (type: string), http_version (type:
string), http_rc (type: int), http_size (type: int), referrer (type:
string), user_agent (type: string), bot_flag (type: string), ip_list (type:
array), scalar_tags (type: map), url_components
(type: array), user_id (type: string), lt_session_id (type:
string), session_id (type: string), log_name (type: string), log_lineno
(type: int), log_line (type: string), web_server (type: string),
location_info (type: map), dp_main_url_type (type: string),
dp_inventory_type (type: string), dp_main_url (type:
map), dp_tag_url (type: map),
dp_referrer_url (type: map), hour (type: int),
20160828 (type: int)
  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5,
_col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15,
_col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24,
_col25, _col26, _col27, _col28, _col29, _col30, _col31
  Statistics: Num rows: 72177962 Data size: 79377906786 Basic
stats: COMPLETE Column stats: NONE
  Limit
Number of rows: 1
Statistics: Num rows: 1 Data size: 1099 Basic stats:
COMPLETE Column stats: NONE
ListSink

* query plan for the one that works (DEBUG,console)

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Map Reduce
  Map Operator Tree:
  TableScan
alias: fact_www_access_datekey_by_hour
Statistics: Num rows: 144355924 Data size: 158755813572 Basic
stats: COMPLETE Column stats: NONE
Filter Operator
  predicate: (lower(http_string) = 'foo') (type: boolean)
  Statistics: Num rows: 72177962 Data size: 79377906786 Basic
stats: COMPLETE Column stats: NONE
  Select Operator
expressions: payload_year (type: int), payload_month (type:
int), payload_day (type: int), payload_time (type: string),
payload_gmt_offset (type: string), perf_time_micros (type: int),
http_string (type: string), http_type (type: string), http_version (type:
string), http_rc (type: int), http_size (type: int), referrer (type:
string), user_agent (type: string), bot_flag (type: string), ip_list (type:
array), scalar_tags (type: map), url_components
(type: array), user_id (type: string), lt_session_id (type:
string), session_id (type: string), log_name (type: string), log_lineno
(type: int), log_line (type: string), web_server (type: string),
location_info (type: map), dp_main_url_type (type: string),
dp_inventory_type (type: string), dp_main_url (type:
map), dp_tag_url (type: map),
dp_referrer_url (type: map), hour (type: int),
20160828 (type: int)
outputColumnNames: _col0, _col1, _col2, _col3, _col4,
_col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14,
_col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23,
_col24, _col25, _col26, _col27, _col28, _col29, _col30, _col31
Statistics: Num rows: 72177962 Data size: 79377906786 Basic
stats: COMPLETE Column stats: NONE
Limit
  Number of rows: 1
  Statistics: Num rows: 1 Data 

Re: hive 2.1.0 + drop view

2016-08-26 Thread Stephen Sprague
just to cap this discussion...  thank you Ashutosh for that link that was
very helpful.

I did the following based on my reading of it.


1. added the following to hive-site.xml


  datanucleus.rdbms.initializeColumnInfo
  NONE



this allows one to create views and drop views however it does not allow
you to drop views previously created w/o that setting.

so...

2. did a show create table on all the views and saved to file.


3. surgically went into the hive metastore and deleted the views from table
"TBLS" (but first had to delete from "TABLE_PARAMS" and "TBL_PRIVS" due to
ref constraints.)


4. recreated the views as best as possible but given some views are
dependent on other views need to make multiple passes


That was my workaround anyway.


Cheers,
Stephen
PS. altering the table to 'varchar' did nothing on postgres - thats just a
synonym for 'character varying'

On Fri, Aug 26, 2016 at 1:40 PM, Ashutosh Chauhan <hashut...@apache.org>
wrote:

> Its a bug in DataNucleus. See discussion on : https://issues.apache.org/
> jira/browse/HIVE-14322
>
> On Fri, Aug 26, 2016 at 1:34 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Actually I don't understand why they have defined TBL_NAME and TBL_TYPE
>> as NVARCHAR (this is from Sybase similar to yours)
>>
>> [image: Inline images 1]
>>
>> Oracle seems to be correct.
>>
>> And if we look further
>>
>> Use the fixed-length datatype, *nchar(n)* , and the variable-length
>> datatype, *nvarchar(n)*, for both single-byte and multibyte character
>> sets, such as Japanese. The difference between *nchar(n)* and *char(n)*
>> and *nvarchar(n)* and *varchar(n)* is that both *nchar(n)* and
>> *nvarchar(n)* allocate storage based on *n* times the number of bytes
>> per character (based on the default character set). *char(n)* and
>> *varchar(n)* allocate *n* bytes of storage.
>>
>> What character set are you using for your server/database?
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 26 August 2016 at 21:03, Stephen Sprague <sprag...@gmail.com> wrote:
>>
>>> thanks.  what i gotta try is altering the table and changing "character
>>> varying(767)" to "varchar(767)" - I think.
>>>
>>> On Fri, Aug 26, 2016 at 12:59 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> You don't really want to mess around with the schema.
>>>>
>>>> This is what I have in Oracle 12c schema for TBLS. The same as yours
>>>>
>>>>
>>>> [image: Inline images 1]
>>>>
>>>> But this is Oracle, a serious database :)
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 26 August 2016 at 20:32, Stephen Sprague <sprag...@gmail.com> wrote:
>>>>
>>>>> yeah... so after the hive upgrade scripts ran we have this in pg for
>>>>> table "TABLS"
>>>>>
>>>>> {quote}
>>>>> dwr_prod_2_1_0=> \d "TBLS"
>>>>>   Table "pub

Re: hive 2.1.0 + drop view

2016-08-26 Thread Stephen Sprague
thanks.  what i gotta try is altering the table and changing "character
varying(767)" to "varchar(767)" - I think.

On Fri, Aug 26, 2016 at 12:59 PM, Mich Talebzadeh <mich.talebza...@gmail.com
> wrote:

> You don't really want to mess around with the schema.
>
> This is what I have in Oracle 12c schema for TBLS. The same as yours
>
>
> [image: Inline images 1]
>
> But this is Oracle, a serious database :)
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 26 August 2016 at 20:32, Stephen Sprague <sprag...@gmail.com> wrote:
>
>> yeah... so after the hive upgrade scripts ran we have this in pg for
>> table "TABLS"
>>
>> {quote}
>> dwr_prod_2_1_0=> \d "TBLS"
>>   Table "public.TBLS"
>>Column   |  Type  |Modifiers
>> ++--
>> ---
>>  TBL_ID | bigint | not null
>>  CREATE_TIME| bigint | not null
>>  DB_ID  | bigint |
>>  LAST_ACCESS_TIME   | bigint | not null
>>  OWNER  | character varying(767) | default NULL::character
>> varying
>>  RETENTION  | bigint | not null
>>  SD_ID  | bigint |
>>  TBL_NAME   | character varying(128) | default NULL::character
>> varying
>>  TBL_TYPE   | character varying(128) | default NULL::character
>> varying
>>  VIEW_EXPANDED_TEXT | text   |
>>  VIEW_ORIGINAL_TEXT | text   |
>>
>> {quote}
>>
>> wonder if i can perform some surgery here. :o  do i feel lucky?
>>
>> On Fri, Aug 26, 2016 at 12:28 PM, Stephen Sprague <sprag...@gmail.com>
>> wrote:
>>
>>> well that doesn't bode well. :(
>>>
>>> we definitely need to use a remote metastore given this is a prod env
>>> with 100's of users.  i wasn't able to see anything in the metastore log
>>> though so i'm  wondering what logger to run to get that?  don't think its
>>> hive.root.logger.
>>>
>>> thanks,
>>> Stephen.
>>> just toggling hive.metastore.try.direct.sql between true or false which
>>> seemed like it should influence the metastore access behaviour did not
>>> change anything.  I guess this is a postgres incompatiblity with jdbc4
>>> (this "character varying" thing.)
>>>
>>> On Fri, Aug 26, 2016 at 8:55 AM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Sounds like there are a number of issues with Hive metastore on
>>>> Postgres. There have been a number of reports on this.
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 26 August 2016 at 16:43, Stephen Sprague <sprag...@gmail.com> wrote:
>>>>
>>>>> thanks Gopal.  you're right our metastore is using Postgres. very
>>>>> interesting you were able to intuit that!
>>>>>
>>>>> lemme give your suggestions a try

Re: hive 2.1.0 + drop view

2016-08-26 Thread Stephen Sprague
yeah... so after the hive upgrade scripts ran we have this in pg for table
"TABLS"

{quote}
dwr_prod_2_1_0=> \d "TBLS"
  Table "public.TBLS"
   Column   |  Type  |Modifiers
++-
 TBL_ID | bigint | not null
 CREATE_TIME| bigint | not null
 DB_ID  | bigint |
 LAST_ACCESS_TIME   | bigint | not null
 OWNER  | character varying(767) | default NULL::character
varying
 RETENTION  | bigint | not null
 SD_ID  | bigint |
 TBL_NAME   | character varying(128) | default NULL::character
varying
 TBL_TYPE   | character varying(128) | default NULL::character
varying
 VIEW_EXPANDED_TEXT | text   |
 VIEW_ORIGINAL_TEXT | text   |

{quote}

wonder if i can perform some surgery here. :o  do i feel lucky?

On Fri, Aug 26, 2016 at 12:28 PM, Stephen Sprague <sprag...@gmail.com>
wrote:

> well that doesn't bode well. :(
>
> we definitely need to use a remote metastore given this is a prod env
> with 100's of users.  i wasn't able to see anything in the metastore log
> though so i'm  wondering what logger to run to get that?  don't think its
> hive.root.logger.
>
> thanks,
> Stephen.
> just toggling hive.metastore.try.direct.sql between true or false which
> seemed like it should influence the metastore access behaviour did not
> change anything.  I guess this is a postgres incompatiblity with jdbc4
> (this "character varying" thing.)
>
> On Fri, Aug 26, 2016 at 8:55 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Sounds like there are a number of issues with Hive metastore on Postgres.
>> There have been a number of reports on this.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 26 August 2016 at 16:43, Stephen Sprague <sprag...@gmail.com> wrote:
>>
>>> thanks Gopal.  you're right our metastore is using Postgres. very
>>> interesting you were able to intuit that!
>>>
>>> lemme give your suggestions a try and i'll post back.
>>>
>>> thanks!
>>> Stephen
>>>
>>> On Fri, Aug 26, 2016 at 8:32 AM, Gopal Vijayaraghavan <gop...@apache.org
>>> > wrote:
>>>
>>>> > NULL::character%20varying)
>>>> ...
>>>> > i want to say this is somehow related to a java version (we're using
>>>> 8)
>>>> >but i'm not sure.
>>>>
>>>> The "character varying" looks like a lot like a Postgres issue to me
>>>> (though character varying could be the real term for varchar in another
>>>> DB).
>>>>
>>>> The hive-metastore.log should have the real backtrace.
>>>>
>>>> You can try doing
>>>>
>>>> set hive.metastore.uris=;
>>>> set hive.metastore.try.direct.sql=false;
>>>>
>>>>
>>>> (i.e switch to embedded metastore + disable direct sql, in Hive CLI -
>>>> provided you have all the password stuff for the metastore in the
>>>> regular
>>>> hive-site.xml)
>>>>
>>>> https://github.com/apache/hive/blob/master/metastore/src/jav
>>>> a/org/apache/ha
>>>> doop/hive/metastore/MetaStoreDirectSql.java#L887
>>>> <https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java#L887>
>>>>
>>>>
>>>> Cheers,
>>>> Gopal
>>>>
>>>>
>>>>
>>>
>>
>


Re: hive 2.1.0 + drop view

2016-08-26 Thread Stephen Sprague
well that doesn't bode well. :(

we definitely need to use a remote metastore given this is a prod env  with
100's of users.  i wasn't able to see anything in the metastore log though
so i'm  wondering what logger to run to get that?  don't think its
hive.root.logger.

thanks,
Stephen.
just toggling hive.metastore.try.direct.sql between true or false which
seemed like it should influence the metastore access behaviour did not
change anything.  I guess this is a postgres incompatiblity with jdbc4
(this "character varying" thing.)

On Fri, Aug 26, 2016 at 8:55 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Sounds like there are a number of issues with Hive metastore on Postgres.
> There have been a number of reports on this.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 26 August 2016 at 16:43, Stephen Sprague <sprag...@gmail.com> wrote:
>
>> thanks Gopal.  you're right our metastore is using Postgres. very
>> interesting you were able to intuit that!
>>
>> lemme give your suggestions a try and i'll post back.
>>
>> thanks!
>> Stephen
>>
>> On Fri, Aug 26, 2016 at 8:32 AM, Gopal Vijayaraghavan <gop...@apache.org>
>> wrote:
>>
>>> > NULL::character%20varying)
>>> ...
>>> > i want to say this is somehow related to a java version (we're using 8)
>>> >but i'm not sure.
>>>
>>> The "character varying" looks like a lot like a Postgres issue to me
>>> (though character varying could be the real term for varchar in another
>>> DB).
>>>
>>> The hive-metastore.log should have the real backtrace.
>>>
>>> You can try doing
>>>
>>> set hive.metastore.uris=;
>>> set hive.metastore.try.direct.sql=false;
>>>
>>>
>>> (i.e switch to embedded metastore + disable direct sql, in Hive CLI -
>>> provided you have all the password stuff for the metastore in the regular
>>> hive-site.xml)
>>>
>>> https://github.com/apache/hive/blob/master/metastore/src/jav
>>> a/org/apache/ha
>>> doop/hive/metastore/MetaStoreDirectSql.java#L887
>>> <https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java#L887>
>>>
>>>
>>> Cheers,
>>> Gopal
>>>
>>>
>>>
>>
>


Re: hive 2.1.0 + drop view

2016-08-26 Thread Stephen Sprague
thanks Gopal.  you're right our metastore is using Postgres. very
interesting you were able to intuit that!

lemme give your suggestions a try and i'll post back.

thanks!
Stephen

On Fri, Aug 26, 2016 at 8:32 AM, Gopal Vijayaraghavan 
wrote:

> > NULL::character%20varying)
> ...
> > i want to say this is somehow related to a java version (we're using 8)
> >but i'm not sure.
>
> The "character varying" looks like a lot like a Postgres issue to me
> (though character varying could be the real term for varchar in another
> DB).
>
> The hive-metastore.log should have the real backtrace.
>
> You can try doing
>
> set hive.metastore.uris=;
> set hive.metastore.try.direct.sql=false;
>
>
> (i.e switch to embedded metastore + disable direct sql, in Hive CLI -
> provided you have all the password stuff for the metastore in the regular
> hive-site.xml)
>
> https://github.com/apache/hive/blob/master/metastore/
> src/java/org/apache/ha
> doop/hive/metastore/MetaStoreDirectSql.java#L887
>
>
> Cheers,
> Gopal
>
>
>


hive 2.1.0 + drop view

2016-08-26 Thread Stephen Sprague
hey guys,
this ones a little more strange.

hive> create view foo_vw as select * from foo;
OK
Time taken: 0.376 seconds

hive> drop view foo_vw;
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
MetaException(message:java.lang.IllegalArgumentException:
java.net.URISyntaxException: Relative path in absolute URI:
NULL::character%20varying)

i want to say this is somehow related to a java version (we're using 8) but
i'm not sure.

any insights on how to go from here?

cheers,
Stephen

PS setting hive.root.logger=DEBUG,console here's the traceback.

Caused by: MetaException(message:java.lang.IllegalArgumentException:
java.net.URISyntaxException: Relative path in absolute URI:
NULL::character%20varying)
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$drop_table_with_environment_context_result$drop_table_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:47552)
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$drop_table_with_environment_context_result$drop_table_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:47529)
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$drop_table_with_environment_context_result.read(ThriftHiveMetastore.java:47471)
at
org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_drop_table_with_environment_context(ThriftHiveMetastore.java:1358)
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.drop_table_with_environment_context(ThriftHiveMetastore.java:1342)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.drop_table_with_environment_context(HiveMetaStoreClient.java:2329)
at
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.drop_table_with_environment_context(SessionHiveMetaStoreClient.java:114)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.dropTable(HiveMetaStoreClient.java:1074)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.dropTable(HiveMetaStoreClient.java:1010)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:154)
at com.sun.proxy.$Proxy21.dropTable(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2254)
at com.sun.proxy.$Proxy21.dropTable(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.dropTable(Hive.java:1172)


Re: hive 2.1.0 and "NOT IN ( list )" and column is a partition_key

2016-08-25 Thread Stephen Sprague
awesome!  i'm not crazy after all!

so workaround we devised here for this pattern:

   partition_key not in ('foo')

is:

   not array_contains(array('foo'), partition_key)

thanks,
Stephen.

On Thu, Aug 25, 2016 at 6:31 PM, Sergey Shelukhin <ser...@hortonworks.com>
wrote:

> I can repro this on master. I’ll file a bug...
>
> From: Stephen Sprague <sprag...@gmail.com>
> Reply-To: "user@hive.apache.org" <user@hive.apache.org>
> Date: Thursday, August 25, 2016 at 13:34
> To: "user@hive.apache.org" <user@hive.apache.org>
> Subject: Re: hive 2.1.0 and "NOT IN ( list )" and column is a
> partition_key
>
> Hi Gopal,
> Thank you for this insight.  good stuff.   The thing is there is no 'foo'
> for etl_database_source so that filter if anything should be
> short-circuited to 'true'.  ie. double nots.   1. not in  2. and foo not
> present.
>
> it doesn't matter what what i put in that "not in" clause the filter
> always comes back false if the column is a partition_key of course.
>
> thanks for the tip on explain extended that's some crazy output so i'm
> sifting for clues in that now.   i hear you though - something in there
> with the metastore is at play.
>
> Cheers,
> Stephen.
>
> On Thu, Aug 25, 2016 at 1:12 PM, Gopal Vijayaraghavan <gop...@apache.org>
> wrote:
>
>>
>> > anybody run up against this one?  hive 2.1.0 + using a  "not in" on a
>> >list + the column is a partition key participant.
>>
>> The partition filters are run before the plan is generated.
>>
>> >AND etl_source_database not in ('foo')
>>
>> Is there a 'foo' in etl_source_database?
>>
>> > predicate: false (type: boolean)   this kills any hope
>> >of the query returning anything.
>> ...
>> >  Select Operator###doesn't even mention a filter
>>
>> This is probably good news, because that's an optimization.
>>
>> PrunedPartitionList getPartitionsFromServer(Table tab, final
>> ExprNodeGenericFuncDesc compactExpr ...) {
>> ...
>>   hasUnknownPartitions = Hive.get().getPartitionsByExpr(
>>   tab, compactExpr, conf, partitions);
>> }
>>
>>
>> goes into the metastore and evaluates the IN and NOT IN for partitions
>> ahead of time.
>>
>>
>> So, this could mean that the partition pruning evaluation returned no
>> partitions at all (or just exactly matched partitions only, skipping the
>> filter per-row).
>>
>> In 2.x, you might notice it does a bit fancier things there as well, like
>>
>> select count(1) from table where year*1 + month*100 + day >= 20160101;
>>
>> https://github.com/apache/hive/blob/master/ql/src/java/org/
>> apache/hadoop/hi
>> ve/ql/optimizer/ppr/PartitionPruner.java#L468
>> <https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ppr/PartitionPruner.java#L468>
>>
>>
>> You can try "explain extended" and see which partitions are selected (&
>> validate that the filter removed was applied already).
>>
>> Cheers,
>> Gopal
>>
>>
>>
>>
>>
>


Re: hive 2.1.0 and "NOT IN ( list )" and column is a partition_key

2016-08-25 Thread Stephen Sprague
Hi Gopal,
Thank you for this insight.  good stuff.   The thing is there is no 'foo'
for etl_database_source so that filter if anything should be
short-circuited to 'true'.  ie. double nots.   1. not in  2. and foo not
present.

it doesn't matter what what i put in that "not in" clause the filter always
comes back false if the column is a partition_key of course.

thanks for the tip on explain extended that's some crazy output so i'm
sifting for clues in that now.   i hear you though - something in there
with the metastore is at play.

Cheers,
Stephen.

On Thu, Aug 25, 2016 at 1:12 PM, Gopal Vijayaraghavan 
wrote:

>
> > anybody run up against this one?  hive 2.1.0 + using a  "not in" on a
> >list + the column is a partition key participant.
>
> The partition filters are run before the plan is generated.
>
> >AND etl_source_database not in ('foo')
>
> Is there a 'foo' in etl_source_database?
>
> > predicate: false (type: boolean)   this kills any hope
> >of the query returning anything.
> ...
> >  Select Operator###doesn't even mention a filter
>
> This is probably good news, because that's an optimization.
>
> PrunedPartitionList getPartitionsFromServer(Table tab, final
> ExprNodeGenericFuncDesc compactExpr ...) {
> ...
>   hasUnknownPartitions = Hive.get().getPartitionsByExpr(
>   tab, compactExpr, conf, partitions);
> }
>
>
> goes into the metastore and evaluates the IN and NOT IN for partitions
> ahead of time.
>
>
> So, this could mean that the partition pruning evaluation returned no
> partitions at all (or just exactly matched partitions only, skipping the
> filter per-row).
>
> In 2.x, you might notice it does a bit fancier things there as well, like
>
> select count(1) from table where year*1 + month*100 + day >= 20160101;
>
> https://github.com/apache/hive/blob/master/ql/src/java/
> org/apache/hadoop/hi
> ve/ql/optimizer/ppr/PartitionPruner.java#L468
>
>
> You can try "explain extended" and see which partitions are selected (&
> validate that the filter removed was applied already).
>
> Cheers,
> Gopal
>
>
>
>
>


Re: hive throws ConcurrentModificationException when executing insert overwrite table

2016-08-17 Thread Stephen Sprague
indeed +1 to Gopal on that explanation! That was huge.

On Wed, Aug 17, 2016 at 12:58 AM, 明浩 冯  wrote:

> Hi Gopal,
>
>
> It works when I disabled the dfs.namenode.acls.
>
> For the data loss, it doesn't affect me too much currently. But I will
> track the issue in Kylin.
>
> Thank you very much for your detailed explain and solution.  You saved me!
>
>
> Best Regards,
>
> Minghao Feng
> --
> *From:* Gopal Vijayaraghavan  on behalf of Gopal
> Vijayaraghavan 
> *Sent:* Wednesday, August 17, 2016 1:18:54 PM
> *To:* user@hive.apache.org
> *Subject:* Re: hive throws ConcurrentModificationException when executing
> insert overwrite table
>
>
> > Yes, Kylin generated the query. I'm using Kylin 1.5.3.
>
> I would report a bug to Kylin about DISTRIBUTE BY RAND().
>
> This is what happens when a node which ran a Map task fails and the whole
> task is retried.
>
> Assume that the first attempt of the Map task0 wrote value1 into
> reducer-99, because RAND() returned 99.
>
> Now the task succeeds and then reducer starts, running reducer-0
> successfully, which write _0.
>
> But before reducer-99 runs, the node which ran Map task0 crashes.
>
> So, the engine re-runs Map task0 on another node. Except because RAND() is
> completely random, it may give 0 as the output of RAND() for "value1".
>
> The reducer-0 output from Map task0 now has "value1", except there's no
> task which will ever read that out or write that out.
>
> In short, the output of the table will not contain "value1", despite the
> input and the shuffle outputs containing "value1".
>
> I would replace the DISTRIBUTE BY RAND() with SORT BY 0, for a random
> distribution without data loss.
>
> > But I still not sure how can I fix the problem. I'm a beginner of Hive
> >and Kylin, Can the problem be fixed by just change the hive or kylin
> >settings?
>
> If you're just experimenting with Kylin right now, I recommend just
> disabling the ACL settings in HDFS (this is not permissions btw, ACLs are
> permissions++).
>
> Set dfs.namenode.acls.enabled=false in core-site.xml and wherever else in
> your /etc/hadoop/conf it shows up and you should be good to avoid the race
> condition.
>
> Cheers,
> Gopal
>
>
>


Re: JsonSerDe and mapping tweet's user structure error

2016-08-16 Thread Stephen Sprague
stackoverflow is your friend.

that said have a peek at the doc even :)  cf. https://cwiki.apache.org/
confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Keywords,Non-
reservedKeywordsandReservedKeywords  paying close attention to this
paragraph:

{quote}
Reserved keywords are permitted as identifiers if you quote them as
described in Supporting Quoted Identifiers in Column Names

 (version 0.13.0 and later, see HIVE-6013
). Most of the keywords
are reserved through HIVE-6617
 in order to reduce the
ambiguity in grammar (version 1.2.0 and later). There are two ways if the
user still would like to use those reserved keywords as identifiers: (1)
use quoted identifiers, (2) set hive.support.sql11.reserved.keywords
=false.

{quote}



On Tue, Aug 16, 2016 at 3:23 PM, Igor Kravzov 
wrote:

> Hi,
>
> I am storing tweets in HDFS and creating external table on top of these
> tweets..
> The command I am using is bellow:
>
> *create external table tweets (*
> *  id bigint,*
> *  in_reply_to_screen_name string,*
> *  in_reply_to_status_id bigint,*
> *  user struct<*
> *  id: bigint, *
> *  name: string,*
> *  screen_name: string>,*
> *  retweeted_status struct <*
> *  id:bigint>,*
> *   text string*
> * )*
> *partitioned by (mmdd int)*
> *ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'*
> *location 'my location';*
>
> But I am getting the following error:
>
> org.apache.ambari.view.hive.client.HiveInvalidQueryException: Error while 
> compiling statement: FAILED: ParseException line 5:2 Failed to recognize 
> predicate 'user'. Failed rule: 'identifier' in column specification 
> [ERROR_STATUS]
>
> Without "user" definition"  query runs fine.
>
> Can it be because 'user' is reserved keyword? Or there is something else?
> "user" as part of tweet structure. How can I name column differently?
>
> Thank in advance.
>


Re: hiver errors

2016-08-10 Thread Stephen Sprague
this error messages says everything you need to know:

>Likely cause: new client talking to old server. Continuing without it.

when you upgrade hive you also need to upgrade the metastore schema.
failing to do that can trigger the message you're getting.

On Wed, Aug 10, 2016 at 6:41 AM, Mich Talebzadeh 
wrote:

> which version of hive it it?
>
> can you log in to hive via hive cli
>
>
>
> *$HIVE_HOME/bin/hive*
> Logging initialized using configuration in file:/usr/lib/hive/conf/hive-
> log4j2.properties
> hive>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 10 August 2016 at 14:32, johny casanova 
> wrote:
>
>> helo,
>>
>>
>> I just installed hive and getting this errors:
>>
>> in the Hive metastore server:
>>
>>
>>
>> Metastore thrift server threw an exception
>>
>> metaexception(message:Version information not found in metastore.)
>>
>>
>>
>>
>>
>>
>>
>> then on the hiveserver2
>>
>>
>>
>>
>>
>> WARN hive.metastore: set_ugi() not successful, Likely cause: new client
>> talking to old server. Continuing without it.
>> org.apache.thrift.transport.TTransportException: java.net
>> .SocketTimeoutException: Read timed out
>>
>> I'm using CDH 5.7.0
>>
>
>


Re: beeline/hiveserver2 + logging

2016-08-10 Thread Stephen Sprague
Hi Gopal,
Aha!  thank you for background behind this.  that makes things much more
understandable.

and ~3000 queries across 10 HS2 servers. sweet. now that's what i call
pushing the edge. I like it!

Thanks again,
Stephen.

On Tue, Aug 9, 2016 at 10:29 PM, Gopal Vijayaraghavan 
wrote:

> > not get the progress messages back until the query finishes which
> >somewhat defeats the purpose of interactive usage.
>
> That happens entirely on the client side btw.
>
> So to avoid a hard sleep() + check loop causing pointless HTTP traffic,
> HiveServer2 now does a long poll on the server side.
>
> hive.server2.long.polling.timeout", "5000ms"
>
>
> This means that it is edge-triggered to return whenever the query finishes
> instead of adding extra time when the results are ready but beeline
> doesn't know about.
>
>
> However, the get_logs() synchronizes on the same HiveStatement and is
> mutexed out by the long poll for getting results.
>
> You can escape this on a low-concurrency cluster by changing the
> long.polling.timeout to 0.5s instead of 5s & restarting HS2.
>
> However as the total # of concurrent queries goes up, the current setting
> does very well due to the reduction in total # of "Nope, come back" http
> noise (largest parallel workload I've seen is about ~3000 queries on 10
> HS2 nodes load-balanced).
>
> Cheers,
> Gopal
>
>
>


Re: beeline/hiveserver2 + logging

2016-08-09 Thread Stephen Sprague
well, well. i just found this:
https://issues.apache.org/jira/browse/HIVE-14183   seems something changed
between 1.2.1 and 2.1.0.

i'll see if the Rx as prescribed in that ticket does indeed work for me.

Thanks,
Stephen.

On Tue, Aug 9, 2016 at 5:12 PM, Stephen Sprague <sprag...@gmail.com> wrote:

> hey guys,
> try as i might i cannot seem to get beeline (via jdbc) to log information
> back from hiveserver2 like job_id, progress and that kind of information
> (similiar to what the local beeline or hive clients do.)
>
> i see this ticket that is closed: https://issues.apache.org/
> jira/browse/HIVE-7615 which leads me to believe things should work.
>
> info:
>* running hive v2.1.0
>* i have enabled all the hive.hiveserver2.logging settings
>
>
> do other people have this working? whats' the secret?
>
> Thanks,
> stephen.
>
> 1. beeline -u jdbc:hive2://  #local beeline-cli) reports progress
> 2. hive --service cli  #local hive-cli reports progress
>
> 3. beeline -u jdbc:hive2://dwrdevnn1:10001/default;auth=noSasl  #remote
> beeline client *does not* report progress
>
> i have Hiveserver2 running in binary mode (not http) not sure that makes
> any difference.
>
>
>


beeline/hiveserver2 + logging

2016-08-09 Thread Stephen Sprague
hey guys,
try as i might i cannot seem to get beeline (via jdbc) to log information
back from hiveserver2 like job_id, progress and that kind of information
(similiar to what the local beeline or hive clients do.)

i see this ticket that is closed:
https://issues.apache.org/jira/browse/HIVE-7615 which leads me to believe
things should work.

info:
   * running hive v2.1.0
   * i have enabled all the hive.hiveserver2.logging settings


do other people have this working? whats' the secret?

Thanks,
stephen.

1. beeline -u jdbc:hive2://  #local beeline-cli) reports progress
2. hive --service cli  #local hive-cli reports progress

3. beeline -u jdbc:hive2://dwrdevnn1:10001/default;auth=noSasl  #remote
beeline client *does not* report progress

i have Hiveserver2 running in binary mode (not http) not sure that makes
any difference.


Re: msck repair table and hive v2.1.0

2016-07-14 Thread Stephen Sprague
oach by setting
> "hive.mv.files.thread=0"?  You mentioned that your tables tables are in s3,
> but the external table created was pointing to HDFS. Was that intentional?
>
> ~Rajesh.B
>
> On Fri, Jul 15, 2016 at 6:58 AM, Stephen Sprague <sprag...@gmail.com>
> wrote:
>
>> in the meantime given my tables are in s3 i've written a utility to do a
>> 'aws s3 ls' on the bucket and folder in question, change the folder syntax
>> to partition syntax and then issued my own 'alter table ... add partition'
>> for each partition.
>>
>>
>> so essentially it does what msck repair tables does but in a non-portable
>> way.  oh well.  gotta do what ya gotta do.
>>
>> On Wed, Jul 13, 2016 at 9:29 PM, Stephen Sprague <sprag...@gmail.com>
>> wrote:
>>
>>> hey guys,
>>> i'm using hive version 2.1.0 and i can't seem to get msck repair table
>>> to work.  no matter what i try i get the 'ol NPE.  I've set the log level
>>> to 'DEBUG' but yet i still am not seeing any smoking gun.
>>>
>>> would anyone here have any pointers or suggestions to figure out what's
>>> going wrong?
>>>
>>> thanks,
>>> Stephen.
>>>
>>>
>>>
>>> hive> create external table foo (a int) partitioned by (date_key bigint)
>>> location 'hdfs:/tmp/foo';
>>> OK
>>> Time taken: 3.359 seconds
>>>
>>> hive> msck repair table foo;
>>> FAILED: Execution Error, return code 1 from
>>> org.apache.hadoop.hive.ql.exec.DDLTask
>>>
>>>
>>> from the log...
>>>
>>> 2016-07-14T04:08:02,431 DEBUG [MSCK-GetPaths-1]:
>>> httpclient.RestStorageService (:()) - Found 13 objects in one batch
>>> 2016-07-14T04:08:02,431 DEBUG [MSCK-GetPaths-1]:
>>> httpclient.RestStorageService (:()) - Found 0 common prefixes in one batch
>>> 2016-07-14T04:08:02,433 ERROR [main]: metadata.HiveMetaStoreChecker
>>> (:()) - java.lang.NullPointerException
>>> 2016-07-14T04:08:02,434 WARN  [main]: exec.DDLTask (:()) - Failed to run
>>> metacheck:
>>> org.apache.hadoop.hive.ql.metadata.HiveException:
>>> java.lang.NullPointerException
>>> at
>>> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.getAllLeafDirs(HiveMetaStoreChecker.java:444)
>>> at
>>> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.getAllLeafDirs(HiveMetaStoreChecker.java:448)
>>> at
>>> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.getAllLeafDirs(HiveMetaStoreChecker.java:388)
>>> at
>>> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.findUnknownPartitions(HiveMetaStoreChecker.java:309)
>>> at
>>> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkTable(HiveMetaStoreChecker.java:285)
>>> at
>>> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkTable(HiveMetaStoreChecker.java:230)
>>> at
>>> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkMetastore(HiveMetaStoreChecker.java:109)
>>> at org.apache.hadoop.hive.ql.exec.DDLTask.msck(DDLTask.java:1814)
>>> at
>>> org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:403)
>>> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:197)
>>> at
>>> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
>>> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1858)
>>> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1562)
>>> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1313)
>>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1084)
>>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1072)
>>> at
>>> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:232)
>>> at
>>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
>>> at
>>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
>>> at
>>> org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
>>> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
>>> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>
>>
>>
>
>
> --
> ~Rajesh.B
>


Re: msck repair table and hive v2.1.0

2016-07-14 Thread Stephen Sprague
in the meantime given my tables are in s3 i've written a utility to do a
'aws s3 ls' on the bucket and folder in question, change the folder syntax
to partition syntax and then issued my own 'alter table ... add partition'
for each partition.


so essentially it does what msck repair tables does but in a non-portable
way.  oh well.  gotta do what ya gotta do.

On Wed, Jul 13, 2016 at 9:29 PM, Stephen Sprague <sprag...@gmail.com> wrote:

> hey guys,
> i'm using hive version 2.1.0 and i can't seem to get msck repair table to
> work.  no matter what i try i get the 'ol NPE.  I've set the log level to
> 'DEBUG' but yet i still am not seeing any smoking gun.
>
> would anyone here have any pointers or suggestions to figure out what's
> going wrong?
>
> thanks,
> Stephen.
>
>
>
> hive> create external table foo (a int) partitioned by (date_key bigint)
> location 'hdfs:/tmp/foo';
> OK
> Time taken: 3.359 seconds
>
> hive> msck repair table foo;
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.DDLTask
>
>
> from the log...
>
> 2016-07-14T04:08:02,431 DEBUG [MSCK-GetPaths-1]:
> httpclient.RestStorageService (:()) - Found 13 objects in one batch
> 2016-07-14T04:08:02,431 DEBUG [MSCK-GetPaths-1]:
> httpclient.RestStorageService (:()) - Found 0 common prefixes in one batch
> 2016-07-14T04:08:02,433 ERROR [main]: metadata.HiveMetaStoreChecker (:())
> - java.lang.NullPointerException
> 2016-07-14T04:08:02,434 WARN  [main]: exec.DDLTask (:()) - Failed to run
> metacheck:
> org.apache.hadoop.hive.ql.metadata.HiveException:
> java.lang.NullPointerException
> at
> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.getAllLeafDirs(HiveMetaStoreChecker.java:444)
> at
> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.getAllLeafDirs(HiveMetaStoreChecker.java:448)
> at
> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.getAllLeafDirs(HiveMetaStoreChecker.java:388)
> at
> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.findUnknownPartitions(HiveMetaStoreChecker.java:309)
> at
> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkTable(HiveMetaStoreChecker.java:285)
> at
> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkTable(HiveMetaStoreChecker.java:230)
> at
> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkMetastore(HiveMetaStoreChecker.java:109)
> at org.apache.hadoop.hive.ql.exec.DDLTask.msck(DDLTask.java:1814)
> at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:403)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:197)
> at
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1858)
> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1562)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1313)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1084)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1072)
> at
> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:232)
> at
> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
> at
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
> at
> org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>


msck repair table and hive v2.1.0

2016-07-13 Thread Stephen Sprague
hey guys,
i'm using hive version 2.1.0 and i can't seem to get msck repair table to
work.  no matter what i try i get the 'ol NPE.  I've set the log level to
'DEBUG' but yet i still am not seeing any smoking gun.

would anyone here have any pointers or suggestions to figure out what's
going wrong?

thanks,
Stephen.



hive> create external table foo (a int) partitioned by (date_key bigint)
location 'hdfs:/tmp/foo';
OK
Time taken: 3.359 seconds

hive> msck repair table foo;
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask


from the log...

2016-07-14T04:08:02,431 DEBUG [MSCK-GetPaths-1]:
httpclient.RestStorageService (:()) - Found 13 objects in one batch
2016-07-14T04:08:02,431 DEBUG [MSCK-GetPaths-1]:
httpclient.RestStorageService (:()) - Found 0 common prefixes in one batch
2016-07-14T04:08:02,433 ERROR [main]: metadata.HiveMetaStoreChecker (:()) -
java.lang.NullPointerException
2016-07-14T04:08:02,434 WARN  [main]: exec.DDLTask (:()) - Failed to run
metacheck:
org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.NullPointerException
at
org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.getAllLeafDirs(HiveMetaStoreChecker.java:444)
at
org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.getAllLeafDirs(HiveMetaStoreChecker.java:448)
at
org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.getAllLeafDirs(HiveMetaStoreChecker.java:388)
at
org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.findUnknownPartitions(HiveMetaStoreChecker.java:309)
at
org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkTable(HiveMetaStoreChecker.java:285)
at
org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkTable(HiveMetaStoreChecker.java:230)
at
org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkMetastore(HiveMetaStoreChecker.java:109)
at org.apache.hadoop.hive.ql.exec.DDLTask.msck(DDLTask.java:1814)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:403)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:197)
at
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1858)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1562)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1313)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1084)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1072)
at
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:232)
at
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
at
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)


Tez issues with beeline via HS2

2016-02-17 Thread Stephen Sprague
Hi guys,
it was suggested i post to the user@hive group rather than the user@tez
group for this one.  Here's my issue. My query hangs when using beeline via
HS2 (but works with the local beeline client).  I'd like to overcome that.

This is my query:

   beeline -u 'jdbc:hive2://
dwrdevnn1.sv2.truila.com:10001/default;auth=noSasl sprague nopwd
org.apache.hive.jdbc.HiveDriver' 

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-03 Thread Stephen Sprague
i refuse to take anybody seriously who has a sig file longer than one line
and  that there is just plain repugnant.

On Wed, Feb 3, 2016 at 1:47 PM, Mich Talebzadeh  wrote:

> I just did some further tests joining a 5 million rows FACT tables with 2
> DIMENSION tables.
>
>
>
> SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS
> TotalSales
>
> FROM sales s, times t, channels c
>
> WHERE s.time_id = t.time_id
>
> AND   s.channel_id = c.channel_id
>
> GROUP BY t.calendar_month_desc, c.channel_desc
>
> ;
>
>
>
>
>
> Hive on Spark crashes, Hive with MR finishes in 85 sec and Spark on Hive
> finishes in 267 sec. I am trying to understand this behaviour
>
>
>
> OK I changed the three below parameters as suggested by Jeff
>
>
>
> export SPARK_EXECUTOR_CORES=12 ##, Number of cores for the workers
> (Default: 1).
>
> export SPARK_EXECUTOR_MEMORY=5G ## , Memory per Worker (e.g. 1000M, 2G)
> (Default: 1G)
>
> export SPARK_DRIVER_MEMORY=2G ## , Memory for Master (e.g. 1000M, 2G)
> (Default: 512 Mb)
>
>
>
>
>
> *1)**Hive 1.2.1 on Spark 1.3.1*
>
> It fails. Never completes.
>
>
>
> ERROR : Status: Failed
>
> Error: Error while processing statement: FAILED: Execution Error, return
> code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
> (state=08S01,code=3)
>
>
>
> *2)**Hive 1.2.1 on MR engine Looks good and completes in 85 sec*
>
>
>
> 0: jdbc:hive2://rhes564:10010/default> SELECT t.calendar_month_desc,
> c.channel_desc, SUM(s.amount_sold) AS TotalSales
>
> 0: jdbc:hive2://rhes564:10010/default> FROM sales s, times t, channels c
>
> 0: jdbc:hive2://rhes564:10010/default> WHERE s.time_id = t.time_id
>
> 0: jdbc:hive2://rhes564:10010/default> AND   s.channel_id = c.channel_id
>
> 0: jdbc:hive2://rhes564:10010/default> GROUP BY t.calendar_month_desc,
> c.channel_desc
>
> 0: jdbc:hive2://rhes564:10010/default> ;
>
> INFO  : Execution completed successfully
>
> INFO  : MapredLocal task succeeded
>
> INFO  : Number of reduce tasks not specified. Estimated from input data
> size: 1
>
> INFO  : In order to change the average load for a reducer (in bytes):
>
> INFO  :   set hive.exec.reducers.bytes.per.reducer=
>
> INFO  : In order to limit the maximum number of reducers:
>
> INFO  :   set hive.exec.reducers.max=
>
> INFO  : In order to set a constant number of reducers:
>
> INFO  :   set mapreduce.job.reduces=
>
> WARN  : Hadoop command-line option parsing not performed. Implement the
> Tool interface and execute your application with ToolRunner to remedy this.
>
> INFO  : number of splits:1
>
> INFO  : Submitting tokens for job: job_1454534517374_0002
>
> INFO  : The url to track the job:
> http://rhes564:8088/proxy/application_1454534517374_0002/
>
> INFO  : Starting Job = job_1454534517374_0002, Tracking URL =
> http://rhes564:8088/proxy/application_1454534517374_0002/
>
> INFO  : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
> job_1454534517374_0002
>
> INFO  : Hadoop job information for Stage-3: number of mappers: 1; number
> of reducers: 1
>
> INFO  : 2016-02-03 21:25:17,769 Stage-3 map = 0%,  reduce = 0%
>
> INFO  : 2016-02-03 21:25:29,103 Stage-3 map = 2%,  reduce = 0%, Cumulative
> CPU 7.52 sec
>
> INFO  : 2016-02-03 21:25:32,205 Stage-3 map = 5%,  reduce = 0%, Cumulative
> CPU 10.19 sec
>
> INFO  : 2016-02-03 21:25:35,295 Stage-3 map = 7%,  reduce = 0%, Cumulative
> CPU 12.69 sec
>
> INFO  : 2016-02-03 21:25:38,392 Stage-3 map = 10%,  reduce = 0%,
> Cumulative CPU 15.2 sec
>
> INFO  : 2016-02-03 21:25:41,502 Stage-3 map = 13%,  reduce = 0%,
> Cumulative CPU 17.31 sec
>
> INFO  : 2016-02-03 21:25:44,600 Stage-3 map = 16%,  reduce = 0%,
> Cumulative CPU 21.55 sec
>
> INFO  : 2016-02-03 21:25:47,691 Stage-3 map = 20%,  reduce = 0%,
> Cumulative CPU 24.32 sec
>
> INFO  : 2016-02-03 21:25:50,786 Stage-3 map = 23%,  reduce = 0%,
> Cumulative CPU 26.3 sec
>
> INFO  : 2016-02-03 21:25:52,858 Stage-3 map = 27%,  reduce = 0%,
> Cumulative CPU 28.52 sec
>
> INFO  : 2016-02-03 21:25:55,948 Stage-3 map = 31%,  reduce = 0%,
> Cumulative CPU 30.65 sec
>
> INFO  : 2016-02-03 21:25:59,032 Stage-3 map = 35%,  reduce = 0%,
> Cumulative CPU 32.7 sec
>
> INFO  : 2016-02-03 21:26:02,120 Stage-3 map = 40%,  reduce = 0%,
> Cumulative CPU 34.69 sec
>
> INFO  : 2016-02-03 21:26:05,217 Stage-3 map = 43%,  reduce = 0%,
> Cumulative CPU 36.67 sec
>
> INFO  : 2016-02-03 21:26:08,310 Stage-3 map = 47%,  reduce = 0%,
> Cumulative CPU 38.78 sec
>
> INFO  : 2016-02-03 21:26:11,408 Stage-3 map = 52%,  reduce = 0%,
> Cumulative CPU 40.7 sec
>
> INFO  : 2016-02-03 21:26:14,512 Stage-3 map = 56%,  reduce = 0%,
> Cumulative CPU 42.69 sec
>
> INFO  : 2016-02-03 21:26:17,607 Stage-3 map = 60%,  reduce = 0%,
> Cumulative CPU 44.69 sec
>
> INFO  : 2016-02-03 21:26:20,722 Stage-3 map = 64%,  reduce = 0%,
> Cumulative CPU 46.83 sec
>
> INFO  : 2016-02-03 21:26:22,787 Stage-3 map = 100%,  reduce = 0%,
> Cumulative CPU 48.46 sec
>
> INFO  : 2016-02-03 21:26:29,030 Stage-3 map = 100%,  reduce = 

Re: Hive job name

2015-04-07 Thread Stephen Sprague
Gotta say i don't particular like this either.   We run thousands of
queries a day and by using a comment before the query you could manipulate
the job.name hive would use.   Now like the OP stated that's ignored -
every job is named hive-exec-0.13.1-cdh5.3.1.jar   Sure setting
mapreduce.job.name explicitly is a workaround but... that's a boat load of
code changes!

Would not there be a fix to roll this back to how it got the job.name
before?

Thanks,
Stephen Sprague

On Wed, Mar 11, 2015 at 1:38 PM, Viral Bajaria viral.baja...@gmail.com
wrote:

 I haven't used Tez but it's in my list of experimentation...

 Thanks,
 Viral

 On Tue, Mar 10, 2015 at 8:38 AM, P lva ruvi...@gmail.com wrote:

 Hey Viral,

 Is there a similar config for tez ?

 Thanks




 On Mon, Mar 9, 2015 at 6:36 PM, Viral Bajaria viral.baja...@gmail.com
 wrote:

 We use the hive.job.name property to set meaningful job names. Look
 into using that before submitting queries.

 Thanks,
 Viral


 On Mon, Mar 9, 2015 at 2:47 PM, Alex Bohr a...@gradientx.com wrote:

 Since going from Hive 0.11 to 0.13 the Job Names, in the JobTracker UI,
 for my Hive queries in the job tracker changed.

 It used to be the first 200 characters or so of the Hive query, now
 it's just hive-exec-0.13.1-cdh5.3.1.jar.

 Seeing the beginning of the query in the job Name was super helpful to
 figure out which queries are taking super long or using heavy # of slots.
 Now I need to drill down to the job config file and check the 
 hive.query.string
 key.

 Is there a config setting I can use to make the Hive Job name reflect
 the query being executed?

 I'm still using the JobTracker, haven't switched to Yarn yet.  Does
 this change with Yarn?

 Thanks







Re: bug in hive

2014-09-20 Thread Stephen Sprague
great policy. install open source software that's not even version 1.0 into
production and then not allow the ability to improve it (but of course reap
all the rewards of its benefits.)  so instead of actually fixing the
problem the right way introduce a super-hack work-around cuz, you know,
that's much more stable.

Gotta luv it.   Good luck.

On Sat, Sep 20, 2014 at 8:00 AM, Shushant Arora shushantaror...@gmail.com
wrote:

 Hi Alan

 I have 0.10 version of hive deployed in my org's cluster, I cannot update
 that because of org's policy.
 How can I achieve exclusive lock functionality while inserting in dynamic
 partition on hive 0.10 ?
 Does calling hive scripts via some sort of java api with patched jar
 included will help ?
 Moreover hive does not release locks in 0.10 when hive session is killed .
 User has to explicitly unlock a table.
 Can i specify any sort of max expiry time while taking a lock.

 Thanks
 Shushant

 On Sat, Sep 20, 2014 at 8:11 PM, Alan Gates ga...@hortonworks.com wrote:

 Up until Hive 0.13 locks in Hive were really advisory only, since as you
 note any user can remove any other user's lock.  In Hive 0.13 a new type of
 locking was introduced, see
 https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-LockManager
 This new locking is automatic and ignores both LOCK and UNLOCK commands.
 Note that it is off by default, you have to configure Hive to use the new
 DbTxnManager to get turn on this locking.  In 0.13 it still has the bug you
 describe as far as acquiring the wrong lock for dynamic partitioning, but I
 believe I've fixed that in 0.14.

 Alan.

   Shushant Arora shushantaror...@gmail.com
  September 20, 2014 at 5:39

 Hive version 0.9 and later has a bug



 While inserting in a hive table Hive takes an exclusive lock. But if
 table is partitioned , and insert is in dynamic partition , it will take
 shared lock on table but if all partitions are static then hive takes
 exclusive lock on partitions in which data is being inserted

 and shared lock on table.

 https://issues.apache.org/jira/browse/HIVE-3509


 1.What if I want to take exclusive lock on table while inserting in
 dynamic partition ?


 I tried to take explicit lock using :

 LOCK TABLE tablename EXCLUSIVE;


 But it made table to be disabled.

 I cannot even read from table anymore even is same session until I do

 unlock table tablename in another session;


 2. moreover whats lock level in hive , I mean any user can remove any
 other users lock. that too seems buggy.


 Thanks

 Shushant




 --
 Sent with Postbox http://www.getpostbox.com

 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.





Re: Mysql - Hive Sync

2014-09-06 Thread Stephen Sprague
interesting. thanks Muthu.

a colleague of mine pointed out this one too, linkedin's databus (
https://github.com/linkedin/databus/wiki)  this one looks extremely heavy
weight and again not sure its worth the headache.

i like the idea of a trigger on the mysql table and then broadcasting the
data to a another app via udp message.

cf. https://code.google.com/p/mysql-message-api/

the thing is you'll need to batch the records over say 5 minutes (or
whatever) then write the batch as one file to hdfs.

This seems infinitely simpler and more maintainable to me. :)




On Fri, Sep 5, 2014 at 11:53 PM, Muthu Pandi muthu1...@gmail.com wrote:

 Yeah installing Mysql hadoop applier took lot of time when building and
 installing GCC 4.6, and its working but its not serving the exact purpose.
 So now am trying with my own python scripting.

 Idea is reading insert query from binlog and save it under hive warehouse
 as table and query from there.



 *RegardsMuthupandi.K*

 [image: Picture (Device Independent Bitmap)]



 On Sat, Sep 6, 2014 at 4:47 AM, Stephen Sprague sprag...@gmail.com
 wrote:

 great find, Muthu.  I would be interested in hearing any about any
 success or failures using this adapter. almost sounds too good to be true.

 After reading the blog (
 http://innovating-technology.blogspot.com/2013/04/mysql-hadoop-applier-part-2.html)
 about it i see it comes with caveats and it looks a little rough around the
 edges for installing.  Not sure i'd bet the farm on this product but YMMV.

 Anyway, curious to know how it works out for you.



 On Tue, Sep 2, 2014 at 11:03 PM, Muthu Pandi muthu1...@gmail.com wrote:

 This cant be done since insert update delete are not supported in hive.

 Mysql Applier for Hadoop package servers the same purpose of the
 prototype tool which i intended to develop.

 link for Mysql Applier for Hadoop
 http://dev.mysql.com/tech-resources/articles/mysql-hadoop-applier.html



 *Regards Muthupandi.K*

  [image: Picture (Device Independent Bitmap)]



 On Wed, Sep 3, 2014 at 10:35 AM, Muthu Pandi muthu1...@gmail.com
 wrote:

 Yeah but we cant make it to work as near real time. Also my table
 doesnt have like 'ID' to use for --check-column that's why opted out of
 sqoop.



 *Regards Muthupandi.K*

  [image: Picture (Device Independent Bitmap)]



 On Wed, Sep 3, 2014 at 10:28 AM, Nitin Pawar nitinpawar...@gmail.com
 wrote:

 have you looked at sqoop?


 On Wed, Sep 3, 2014 at 10:15 AM, Muthu Pandi muthu1...@gmail.com
 wrote:

 Dear All

  Am developing a prototype of syncing tables from mysql to Hive
 using python and JDBC. Is it a good idea using the JDBC for this purpose.

 My usecase will be generating the sales report using the hive, data
 pulled from mysql using the prototype tool.My data will be around 
 2GB/day.



 *Regards Muthupandi.K*

  [image: Picture (Device Independent Bitmap)]




 --
 Nitin Pawar








Re: Mysql - Hive Sync

2014-09-05 Thread Stephen Sprague
great find, Muthu.  I would be interested in hearing any about any success
or failures using this adapter. almost sounds too good to be true.

After reading the blog (
http://innovating-technology.blogspot.com/2013/04/mysql-hadoop-applier-part-2.html)
about it i see it comes with caveats and it looks a little rough around the
edges for installing.  Not sure i'd bet the farm on this product but YMMV.

Anyway, curious to know how it works out for you.



On Tue, Sep 2, 2014 at 11:03 PM, Muthu Pandi muthu1...@gmail.com wrote:

 This cant be done since insert update delete are not supported in hive.

 Mysql Applier for Hadoop package servers the same purpose of the prototype
 tool which i intended to develop.

 link for Mysql Applier for Hadoop
 http://dev.mysql.com/tech-resources/articles/mysql-hadoop-applier.html



 *Regards Muthupandi.K*

  [image: Picture (Device Independent Bitmap)]



 On Wed, Sep 3, 2014 at 10:35 AM, Muthu Pandi muthu1...@gmail.com wrote:

 Yeah but we cant make it to work as near real time. Also my table doesnt
 have like 'ID' to use for --check-column that's why opted out of sqoop.



 *Regards Muthupandi.K*

  [image: Picture (Device Independent Bitmap)]



 On Wed, Sep 3, 2014 at 10:28 AM, Nitin Pawar nitinpawar...@gmail.com
 wrote:

 have you looked at sqoop?


 On Wed, Sep 3, 2014 at 10:15 AM, Muthu Pandi muthu1...@gmail.com
 wrote:

 Dear All

  Am developing a prototype of syncing tables from mysql to Hive
 using python and JDBC. Is it a good idea using the JDBC for this purpose.

 My usecase will be generating the sales report using the hive, data
 pulled from mysql using the prototype tool.My data will be around 2GB/day.



 *Regards Muthupandi.K*

  [image: Picture (Device Independent Bitmap)]




 --
 Nitin Pawar






Re: Altering the Metastore on EC2

2014-08-14 Thread Stephen Sprague
i'll take a stab at this.

- probably no reason.

- if you can. is there a derby client s/t you can issue the command: alter
table COLUMNS_V2 modify TYPE_NAME varchar(32672). otherwise maybe use the
mysql or postgres metastores (instead of derby) and run that alter command
after the install.

- the schema only exists in one place and that's the metastore (which is
probably on your namenode for derby.) for mysql or postgres it can be
anywhere you want but again examples will probably show localhost (the
namenode)

that's a mighty big schema! you don't just want to use string type and use
get_json_object to pull data out of it dynamically? not as elegant as using
static syntax like nested structs but its better than nothing. something to
think about anyway.

i'm guessing given a nested struct that large you'll get over one hump only
to be faced with another one. hive needs to do some crazy mapping there for
every record. hopefully that's optimized. :)

Good luck! I'd be curious how it goes.


On Mon, Aug 11, 2014 at 5:52 PM, David Beveridge dbeveri...@cylance.com
wrote:

  We are creating an Hive schema for reading massive JSON files. Our JSON
 schema is rather large, and we have found that the default metastore schema
 for Hive cannot work for us as-is.

 To be specific, one field in our schema has about 17KB of nested structs
 within it. Unfortunately, it appears that Hive has a limit of varchar(4000)
 for the field that stores the resulting definition:



 CREATE TABLE COLUMNS_V2 (

 CD_ID bigint NOT NULL,

 COMMENT varchar(4000),

 COLUMN_NAME varchar(128) NOT NULL,

 TYPE_NAME varchar(4000),

 INTEGER_IDX INTEGER NOT NULL,

 PRIMARY KEY (CD_ID, COLUMN_NAME)

 );



 We are running this on Amazon MapReduce (v0.11 with default Derby
 metastore)



 So, our initial questions are:

 · Is there a reason that the TYPE_NAME is being limited to 4000
 (IIUC, varchar on derby can grow to 32672, which would be sufficient for
 a long time)

 · Can we alter the metastore schema without hacking/reinstalling
 Hive? (if so, how?)

 · If so, is there a proper way to update the schema on all nodes?





 Thanks in advance!

 --DB



Re: Predicate pushdown optimisation not working for ORC

2014-04-03 Thread Stephen Sprague
wow. good find. i hope these config settings are well documented and that
you didn't have to spend alot time searching for that.  Interesting that
the default isn't true for this one.


On Wed, Apr 2, 2014 at 11:00 PM, Abhay Bansal abhaybansal.1...@gmail.comwrote:

 I was able to resolve the issue by setting hive.optimize.index.filter to
 true.

 In the hadoop logs
 syslog:2014-04-03 05:44:51,204 INFO
 org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: included column ids =
 3,8,13
 syslog:2014-04-03 05:44:51,204 INFO
 org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: included columns names =
 sourceipv4address,sessionid,url
 syslog:2014-04-03 05:44:51,216 INFO
 org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: ORC pushdown predicate:
 leaf-0 = (EQUALS sourceipv4address 1809657989)

 I can now see the ORC pushdown predicate.

 Thanks,
 -Abhay


 On Thu, Apr 3, 2014 at 11:14 AM, Stephen Boesch java...@gmail.com wrote:

 HI Abhay,
   What is the DDL for your test table?


 2014-04-02 22:36 GMT-07:00 Abhay Bansal abhaybansal.1...@gmail.com:

 I am new to Hive, apologise for asking such a basic question.

 Following exercise was done with hive .12 and hadoop 0.20.203

 I created a ORC file form java, and pushed it into a table with the same
 schema. I checked the conf
 property 
 propertynamehive.optimize.ppd/namevaluetrue/value/property
 which should ideally use the ppd optimisation.

 I ran a query select sourceipv4address,sessionid,url from test where
 sourceipv4address=dummy;

 Just to see if the ppd optimization is working I checked the hadoop logs
 where I found

 ./userlogs/job_201404010833_0036/attempt_201404010833_0036_m_00_0/syslog:2014-04-03
 05:01:39,913 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: included
 column ids = 3,8,13
 ./userlogs/job_201404010833_0036/attempt_201404010833_0036_m_00_0/syslog:2014-04-03
 05:01:39,914 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: included
 columns names = sourceipv4address,sessionid,url
 ./userlogs/job_201404010833_0036/attempt_201404010833_0036_m_00_0/syslog:2014-04-03
 05:01:39,914 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: *No
 ORC pushdown predicate*

  I am not sure which part of it I missed. Any help would be appreciated.

 Thanks,
 -Abhay






Re: Partitioned table to partitioned table

2014-03-26 Thread Stephen Sprague
the error message is correct.  remember the partition columns are not
stored with the data and by doing a select * that's what doing.  And this
has nothing to do with ORC either its a Hive thing. :)

so your second approach was close.  just omit the partition columns yr, mo,
day.


On Wed, Mar 26, 2014 at 8:18 AM, P lva ruvi...@gmail.com wrote:

 Hello,

 I'm trying to convert managed partitioned text table into compressed orc
 partitioned table.

 I created the a new table with the same schema but when I try inserting
 data the errors says there are different number of columns.

 I tried doing

 From table a insert into table b(yr=2014, mo=01, day=01 ) select * where
 yr='2014' and month='01' and day='01'

 Also tried doing this

 From table a insert into table b(yr=2014, mo=01, day=01) select a.column1,
 a.column2, a.column3, ... where yr='2014' and month='01' and day='01'

 which didnt work either

 How does one copy data from a partitioned table to another along with
 partition information ?

 Thanks











Re: computing median and percentiles

2014-03-20 Thread Stephen Sprague
the short answer is there is no native hive UDF that solves your unique
case.  That means you have to solve it.


i searched for something like you were looking for myself and found this
general recipe:

http://www.onlinemathlearning.com/median-frequency-table.html

off the top of my head i'm not sure how easy this would be in SQL but i
imagine using ROW_NUMBER() function a clever person could do it. And if not
SQL then perhaps a custom UDF but ultimately you have to do the work and
now you have a potential recipe to follow. :)




On Wed, Mar 19, 2014 at 9:37 PM, Seema Datar sda...@yahoo-inc.com wrote:

  Not really. If it was a single column with no counters, Hive provides an
 option to use percentile. So basically if the data was like -

  100
 100
 200
 200
 200
 200
 300

  But if we have 2 columns, one that maintain the value and the other that
 maintains the count, how can Hive be used to derive the percentile?

  Value Count
 100  2
 200  4
 300  1

  Thanks,
 Seema

   From: Stephen Sprague sprag...@gmail.com
 Reply-To: user@hive.apache.org user@hive.apache.org
 Date: Thursday, March 20, 2014 5:28 AM
 To: user@hive.apache.org user@hive.apache.org
 Subject: Re: computing median and percentiles

   not a hive question is it?   its more like a math question.



 On Wed, Mar 19, 2014 at 1:30 PM, Seema Datar sda...@yahoo-inc.com wrote:



   I understand the percentile function is supported in Hive in the
 latest versions. However, how does once calculate percentiles when the data
 is across two columns. So say -

  Value  Count

  100 2   ( so basically 100 occurred twice)
 200 4
 300 1
 400 6
 500 3


  I want to find out the 0.25 percentile for the value distribution. How
 can I do it using the Hive percentile function?






Re: Improving self join time

2014-03-20 Thread Stephen Sprague
hmm. would this not fall under the general problem of identifying
duplicates?

Would something like this meet your needs? (untested)

select  -- outer query finds the ids for the duplicates
key

from (  -- inner query lists duplicate values
 select
   count(*) as cnt,
   value
 from
foo
 group by
value
 having
   count(*)  1
 ) z
 join foo a on (a.value = z.value)
;

table foo is your table elements
key is your id,
value is your element


On Thu, Mar 20, 2014 at 7:03 AM, Jeff Storey storey.j...@gmail.com wrote:

 I have a table with 10 million rows and 2 columns - id (int) and element
 (string). I am trying to do a self join that finds any ids where the
 element values are the same, and my query looks like:

 select e1.id, e1.tag, e2.id as id2, e2.tag as tag2 from elements e1 JOIN
 elements e2 on e1.element = e2.element WHERE e1.id  e2.id;

 I tested this at a smaller scale and it works well. The problem is that
 with 10 million rows, this becomes a bit large and I've let it run for 90
 minutes and it was up to 80GB of disk space and still going. The original
 input data was only 500MB.

 Is this something I can optimize in hive? Or should I be considering a
 different approach to the problem instead?

 Any guidance here would be helpful. Thank you.



Re: Improving self join time

2014-03-20 Thread Stephen Sprague
I agree with your assessment of the inner query. why stop there though?
Doesn't the outer query fetch the ids of the tags that the inner query
identified?



On Thu, Mar 20, 2014 at 9:54 AM, Jeff Storey storey.j...@gmail.com wrote:

 I don't think this quite fits here..I think the inner query will give me a
 list of duplicate elements and their counts, but it loses the information
 as to what id had these elements.

 I'm trying to find which pairs of ids have any duplicate tags.


 On Thu, Mar 20, 2014 at 11:57 AM, Stephen Sprague sprag...@gmail.comwrote:

 hmm. would this not fall under the general problem of identifying
 duplicates?

 Would something like this meet your needs? (untested)

 select  -- outer query finds the ids for the duplicates
 key

 from (  -- inner query lists duplicate values
   select
count(*) as cnt,
value
  from
 foo
  group by
 value
  having
count(*)  1
  ) z
  join foo a on (a.value = z.value)
 ;

 table foo is your table elements
 key is your id,
 value is your element


 On Thu, Mar 20, 2014 at 7:03 AM, Jeff Storey storey.j...@gmail.comwrote:

 I have a table with 10 million rows and 2 columns - id (int) and element
 (string). I am trying to do a self join that finds any ids where the
 element values are the same, and my query looks like:

 select e1.id, e1.tag, e2.id as id2, e2.tag as tag2 from elements e1
 JOIN elements e2 on e1.element = e2.element WHERE e1.id  e2.id;

 I tested this at a smaller scale and it works well. The problem is that
 with 10 million rows, this becomes a bit large and I've let it run for 90
 minutes and it was up to 80GB of disk space and still going. The original
 input data was only 500MB.

 Is this something I can optimize in hive? Or should I be considering a
 different approach to the problem instead?

 Any guidance here would be helpful. Thank you.






Re: Improving self join time

2014-03-20 Thread Stephen Sprague
so that's your final assessment, eh? :)  What is your comment about the
outer query _joining on value_ to get the key?




On Thu, Mar 20, 2014 at 12:26 PM, Jeff Storey storey.j...@gmail.com wrote:

 I don't think so since the inner result doesn't have the key field in it.
 It ends up being

 select key from (query result that doesn't contain the key field) ...


 On Thu, Mar 20, 2014 at 1:28 PM, Stephen Sprague sprag...@gmail.comwrote:

 I agree with your assessment of the inner query. why stop there though?
 Doesn't the outer query fetch the ids of the tags that the inner query
 identified?



 On Thu, Mar 20, 2014 at 9:54 AM, Jeff Storey storey.j...@gmail.comwrote:

 I don't think this quite fits here..I think the inner query will give me
 a list of duplicate elements and their counts, but it loses the information
 as to what id had these elements.

 I'm trying to find which pairs of ids have any duplicate tags.


 On Thu, Mar 20, 2014 at 11:57 AM, Stephen Sprague sprag...@gmail.comwrote:

 hmm. would this not fall under the general problem of identifying
 duplicates?

 Would something like this meet your needs? (untested)

 select  -- outer query finds the ids for the duplicates
 key

 from (  -- inner query lists duplicate values
   select
count(*) as cnt,
value
  from
 foo
  group by
 value
  having
count(*)  1
  ) z
  join foo a on (a.value = z.value)
 ;

 table foo is your table elements
 key is your id,
 value is your element


 On Thu, Mar 20, 2014 at 7:03 AM, Jeff Storey storey.j...@gmail.comwrote:

 I have a table with 10 million rows and 2 columns - id (int) and
 element (string). I am trying to do a self join that finds any ids where
 the element values are the same, and my query looks like:

 select e1.id, e1.tag, e2.id as id2, e2.tag as tag2 from elements e1
 JOIN elements e2 on e1.element = e2.element WHERE e1.id  e2.id;

 I tested this at a smaller scale and it works well. The problem is
 that with 10 million rows, this becomes a bit large and I've let it run 
 for
 90 minutes and it was up to 80GB of disk space and still going. The
 original input data was only 500MB.

 Is this something I can optimize in hive? Or should I be considering a
 different approach to the problem instead?

 Any guidance here would be helpful. Thank you.








Re: computing median and percentiles

2014-03-19 Thread Stephen Sprague
not a hive question is it?   its more like a math question.



On Wed, Mar 19, 2014 at 1:30 PM, Seema Datar sda...@yahoo-inc.com wrote:



   I understand the percentile function is supported in Hive in the latest
 versions. However, how does once calculate percentiles when the data is
 across two columns. So say -

  Value  Count

  100 2   ( so basically 100 occurred twice)
 200 4
 300 1
 400 6
 500 3


  I want to find out the 0.25 percentile for the value distribution. How
 can I do it using the Hive percentile function?





Re: Hive - Sorting on the Partition Column data type Int . Output is Alphabetic Sort

2014-03-14 Thread Stephen Sprague
i agree. back in the days of v0.7 who knows if this was a bug or not. What
i do know is i have int partitions all over the place and they sort in
numeric context just fine every day.  i'm running v0.12.  pretty sure this
worked fine at v0.8 as well - that's when i started in hive.

good luck!


On Fri, Mar 14, 2014 at 4:21 AM, Nitin Pawar nitinpawar...@gmail.comwrote:

 Can you first try updating hive to atleast 0.11 if you can not move to
 0.12 ?


 On Fri, Mar 14, 2014 at 4:49 PM, Arafat, Moiz moiz.ara...@teamaol.comwrote:

  My comments inline



 *From:* Stephen Sprague [mailto:sprag...@gmail.com]
 *Sent:* Friday, March 14, 2014 12:23 AM

 *To:* user@hive.apache.org
 *Subject:* Re: Hive - Sorting on the Partition Column data type Int .
 Output is Alphabetic Sort



 wow. its still sorting based on string context.  ok, some followups.

 1.did you start clean?  ie. did you do a drop table
 moiz_partition_test before you started?

 Yes I did

 2.lets see the output of show create table moiz_partition_test
 (if that doesn't work [its hive v0.11 i think] lets see desc
 moiz_partition_test )

 hive desc moiz_partition_test;

 OK

 event_dtstring

 partition_hrint

 Time taken: 1.967 seconds

   3.what version of hive are you running?

 0.7.1

   pretty bizarre.





 On Thu, Mar 13, 2014 at 3:48 AM, Arafat, Moiz moiz.ara...@teamaol.com
 wrote:

  Hi Stephen,



 I followed your approach and still got the same result



 1) hive CREATE TABLE moiz_partition_test

  (EVENT_DT STRING) partitioned by

  (

  PARTITION_HR INT

  )

  ROW FORMAT DELIMITED

  FIELDS TERMINATED BY '09'

  location '/user/moiztcs/moiz_partition_test'

  ;



 2)

 hive alter table  moiz_partition_test add IF NOT EXISTS partition
 (partition_hr=0)  ;

 OK

 Time taken: 2.421 seconds

 hive alter table  moiz_partition_test add IF NOT EXISTS partition
 (partition_hr=1)  ;

 OK

 Time taken: 0.132 seconds

 hive alter table  moiz_partition_test add IF NOT EXISTS partition
 (partition_hr=2)  ;

 OK

 Time taken: 0.226 seconds

 hive alter table  moiz_partition_test add IF NOT EXISTS partition
 (partition_hr=10)  ;

 OK

 Time taken: 0.177 seconds



 3)

 $ hadoop fs -ls /user/moiztcs/moiz_partition_test

 Found 4 items

 drwxr-xr-x   - cdidw aolmis  0 2014-03-13 06:40
 /user/moiztcs/moiz_partition_test/partition_hr=0

 drwxr-xr-x   - cdidw aolmis  0 2014-03-13 06:41
 /user/moiztcs/moiz_partition_test/partition_hr=1

 drwxr-xr-x   - cdidw aolmis  0 2014-03-13 06:42
 /user/moiztcs/moiz_partition_test/partition_hr=10

 drwxr-xr-x   - cdidw aolmis  0 2014-03-13 06:41
 /user/moiztcs/moiz_partition_test/partition_hr=2



 4)

 $ hadoop fs -copyFromLocal test.dat
 /user/moiztcs/moiz_partition_test/partition_hr=0

 $ hadoop fs -copyFromLocal test.dat
 /user/moiztcs/moiz_partition_test/partition_hr=1

 $ hadoop fs -copyFromLocal test.dat
 /user/moiztcs/moiz_partition_test/partition_hr=10

 $ hadoop fs -copyFromLocal test.dat
 /user/moiztcs/moiz_partition_test/partition_hr=2



 5) hive select distinct partition_hr from moiz_partition_test order by
 partition_hr;



 OK

 0

 1

 10

 2



 Thanks,

 Moiz



 *From:* Stephen Sprague [mailto:sprag...@gmail.com]
 *Sent:* Wednesday, March 12, 2014 9:58 PM


 *To:* user@hive.apache.org
 *Subject:* Re: Hive - Sorting on the Partition Column data type Int .
 Output is Alphabetic Sort



 there you go.   I think you're inflicting too much of your own will onto
 hive with specifying the partition directories as 00, 01, 02.

 In my experience hive expects the partition name followed by an equal
 sign followed by the value.

 I'd stick with this kind of hdfs topology:

 /user/moiztcs/moiz_partition_test/partition_hr=00/
 /user/moiztcs/moiz_partition_test/partition_hr=01/
 /user/moiztcs/moiz_partition_test/partition_hr=10/

 By omitting the location clause on your alter table statements you should
 get above layout which can be

 confirmed by issuing the following command:

 $ hdfs dfs -ls /user/moiztc/moiz_partition_test

 Can you try this?









 On Wed, Mar 12, 2014 at 12:10 AM, Arafat, Moiz moiz.ara...@teamaol.com
 wrote:

  Hi,



 Here are the steps I followed . Please let me know If I did something
 wrong.



 1)  Create table

 hive CREATE TABLE moiz_partition_test

  (EVENT_DT STRING) partitioned by

  (

  PARTITION_HR INT

  )

   ROW FORMAT DELIMITED

   FIELDS TERMINATED BY '09'

   location '/user/moiztcs/moiz_partition_test'

  ;



 2)  Add partitions

 hive alter table  moiz_partition_test add IF NOT EXISTS partition
 (partition_hr=0)  location '/user/moiztcs/moiz_partition_test/00';

 OK

 Time taken: 0.411 seconds

 hive alter table  moiz_partition_test add IF NOT EXISTS partition
 (partition_hr=1)  location '/user/moiztcs/moiz_partition_test/01';

 OK

 Time taken: 0.193 seconds

 hive alter table  moiz_partition_test add IF NOT EXISTS partition
 (partition_hr

Re: Writing data to LOCAL with Hive Server2

2014-03-14 Thread Stephen Sprague
re: HiveServer2

this is not natively possible (this falls under the export rubric.)
similarly, you can't load a file directly from your client using native
syntax (import.)
Believe me, you're not the only one who'd like this both of these
functions. :)

I'd search this list for import or export utilities as i remember not long
ago someone pointed to page in the doc about these.

as an alternative i just employ rsync to push or pull the file to where i
ultimately need it.  its one line of code which is about as easy as it
gets.


On Fri, Mar 14, 2014 at 5:43 AM, Kumar V kumarbuyonl...@yahoo.com wrote:

 Hi,
Is it possible to write data to a local file system using Hive server 2
 ?
 When I use 'insert overwrite local directory /tmp/hivedata
 ...', it writes data in the directory on the box on which hive
 server 2 is running.
 What can I do to write this data to the file system of the box I am
 running this Java/Python program from ?

 Other option was to write this data in HDFS.  But I would like to like to
  write to local file if I can.

 Please let me know if it is possible.

 Thanks.



Re: Writing data to LOCAL with Hive Server2

2014-03-14 Thread Stephen Sprague
well that does complicate matters then. :)  how to get a file off a host
you don't have access to? when you say no access i presume you mean you
have no login access.

I can think of some ideas:

   1. make a case to get access so rsync can connect. (best)

   2. can you run a hdfs dfs -get command from you client? (next best)

   3. run your query to stdout and redirect that to your file. eg. $ hive
-host foo -port 1 -e select foo from bar; my_local_file (least
desirable)

the page i was thinking of was this:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ImportExport
don't think it helps your situation though.


Let us know how you make out!

Cheers,
Stephen.


On Fri, Mar 14, 2014 at 10:56 AM, Kumar V kumarbuyonl...@yahoo.com wrote:

 Thanks Stephen.
 Yes, rsync would be easy.  But we don't have access to the servers the
 hiveserver2 is running on.  Basically, there is no way to reach those files
 from our boxes. That's why I was asking about writing it locally.
 I'll check this list for import/export like you mentioned.

 Thanks.


   On Friday, March 14, 2014 12:23 PM, Stephen Sprague sprag...@gmail.com
 wrote:
  re: HiveServer2

 this is not natively possible (this falls under the export rubric.)
 similarly, you can't load a file directly from your client using native
 syntax (import.)
 Believe me, you're not the only one who'd like this both of these
 functions. :)

 I'd search this list for import or export utilities as i remember not long
 ago someone pointed to page in the doc about these.

  as an alternative i just employ rsync to push or pull the file to where
 i ultimately need it.  its one line of code which is about as easy as it
 gets.


 On Fri, Mar 14, 2014 at 5:43 AM, Kumar V kumarbuyonl...@yahoo.com wrote:

  Hi,
Is it possible to write data to a local file system using Hive server 2
 ?
 When I use 'insert overwrite local directory /tmp/hivedata
 ...', it writes data in the directory on the box on which hive
 server 2 is running.
 What can I do to write this data to the file system of the box I am
 running this Java/Python program from ?

 Other option was to write this data in HDFS.  But I would like to like to
  write to local file if I can.

 Please let me know if it is possible.

 Thanks.







Re: Hive - Sorting on the Partition Column data type Int . Output is Alphabetic Sort

2014-03-13 Thread Stephen Sprague
wow. its still sorting based on string context.  ok, some followups.

1.  did you start clean?  ie. did you do a drop table moiz_partition_test
before you started?

2.  lets see the output of show create table moiz_partition_test  (if
that doesn't work [its hive v0.11 i think] lets see desc
moiz_partition_test )


3.  what version of hive are you running?


pretty bizarre.




On Thu, Mar 13, 2014 at 3:48 AM, Arafat, Moiz moiz.ara...@teamaol.comwrote:

  Hi Stephen,



 I followed your approach and still got the same result



 1) hive CREATE TABLE moiz_partition_test

  (EVENT_DT STRING) partitioned by

  (

  PARTITION_HR INT

  )

  ROW FORMAT DELIMITED

  FIELDS TERMINATED BY '09'

  location '/user/moiztcs/moiz_partition_test'

  ;



 2)

 hive alter table  moiz_partition_test add IF NOT EXISTS partition
 (partition_hr=0)  ;

 OK

 Time taken: 2.421 seconds

 hive alter table  moiz_partition_test add IF NOT EXISTS partition
 (partition_hr=1)  ;

 OK

 Time taken: 0.132 seconds

 hive alter table  moiz_partition_test add IF NOT EXISTS partition
 (partition_hr=2)  ;

 OK

 Time taken: 0.226 seconds

 hive alter table  moiz_partition_test add IF NOT EXISTS partition
 (partition_hr=10)  ;

 OK

 Time taken: 0.177 seconds



 3)

 $ hadoop fs -ls /user/moiztcs/moiz_partition_test

 Found 4 items

 drwxr-xr-x   - cdidw aolmis  0 2014-03-13 06:40
 /user/moiztcs/moiz_partition_test/partition_hr=0

 drwxr-xr-x   - cdidw aolmis  0 2014-03-13 06:41
 /user/moiztcs/moiz_partition_test/partition_hr=1

 drwxr-xr-x   - cdidw aolmis  0 2014-03-13 06:42
 /user/moiztcs/moiz_partition_test/partition_hr=10

 drwxr-xr-x   - cdidw aolmis  0 2014-03-13 06:41
 /user/moiztcs/moiz_partition_test/partition_hr=2



 4)

 $ hadoop fs -copyFromLocal test.dat
 /user/moiztcs/moiz_partition_test/partition_hr=0

 $ hadoop fs -copyFromLocal test.dat
 /user/moiztcs/moiz_partition_test/partition_hr=1

 $ hadoop fs -copyFromLocal test.dat
 /user/moiztcs/moiz_partition_test/partition_hr=10

 $ hadoop fs -copyFromLocal test.dat
 /user/moiztcs/moiz_partition_test/partition_hr=2



 5) hive select distinct partition_hr from moiz_partition_test order by
 partition_hr;



 OK

 0

 1

 10

 2



 Thanks,

 Moiz



 *From:* Stephen Sprague [mailto:sprag...@gmail.com]
 *Sent:* Wednesday, March 12, 2014 9:58 PM

 *To:* user@hive.apache.org
 *Subject:* Re: Hive - Sorting on the Partition Column data type Int .
 Output is Alphabetic Sort



 there you go.   I think you're inflicting too much of your own will onto
 hive with specifying the partition directories as 00, 01, 02.

 In my experience hive expects the partition name followed by an equal sign
 followed by the value.

 I'd stick with this kind of hdfs topology:

 /user/moiztcs/moiz_partition_test/partition_hr=00/
 /user/moiztcs/moiz_partition_test/partition_hr=01/
 /user/moiztcs/moiz_partition_test/partition_hr=10/

 By omitting the location clause on your alter table statements you should
 get above layout which can be

 confirmed by issuing the following command:

 $ hdfs dfs -ls /user/moiztc/moiz_partition_test

 Can you try this?









 On Wed, Mar 12, 2014 at 12:10 AM, Arafat, Moiz moiz.ara...@teamaol.com
 wrote:

  Hi,



 Here are the steps I followed . Please let me know If I did something
 wrong.



 1)  Create table

 hive CREATE TABLE moiz_partition_test

  (EVENT_DT STRING) partitioned by

  (

  PARTITION_HR INT

  )

   ROW FORMAT DELIMITED

   FIELDS TERMINATED BY '09'

   location '/user/moiztcs/moiz_partition_test'

  ;



 2)  Add partitions

 hive alter table  moiz_partition_test add IF NOT EXISTS partition
 (partition_hr=0)  location '/user/moiztcs/moiz_partition_test/00';

 OK

 Time taken: 0.411 seconds

 hive alter table  moiz_partition_test add IF NOT EXISTS partition
 (partition_hr=1)  location '/user/moiztcs/moiz_partition_test/01';

 OK

 Time taken: 0.193 seconds

 hive alter table  moiz_partition_test add IF NOT EXISTS partition
 (partition_hr=2)  location '/user/moiztcs/moiz_partition_test/02';

 OK

 Time taken: 0.182 seconds

 hive alter table  moiz_partition_test add IF NOT EXISTS partition
 (partition_hr=10)  location '/user/moiztcs/moiz_partition_test/10';

 OK

 Time taken: 0.235 seconds



 3)  Copy data into the directories

 hadoop fs -copyFromLocal test.dat  /user/moiztcs/moiz_partition_test/00

 hadoop fs -copyFromLocal test.dat  /user/moiztcs/moiz_partition_test/01

 hadoop fs -copyFromLocal test.dat  /user/moiztcs/moiz_partition_test/02

 hadoop fs -copyFromLocal test.dat  /user/moiztcs/moiz_partition_test/10



 4)  Ran the sql

 hive select distinct partition_hr from moiz_partition_test order by
 partition_hr;

 Ended Job

 OK

 0

 1

 10

 2



 Thanks,

 Moiz

 *From:* Stephen Sprague [mailto:sprag...@gmail.com]
 *Sent:* Wednesday, March 12, 2014 12:55 AM
 *To:* user@hive.apache.org
 *Subject:* Re: Hive - Sorting on the Partition

Re: Hive - Sorting on the Partition Column data type Int . Output is Alphabetic Sort

2014-03-12 Thread Stephen Sprague
there you go.   I think you're inflicting too much of your own will onto
hive with specifying the partition directories as 00, 01, 02.

In my experience hive expects the partition name followed by an equal sign
followed by the value.

I'd stick with this kind of hdfs topology:

/user/moiztcs/moiz_partition_test/partition_hr=00/
/user/moiztcs/moiz_partition_test/partition_hr=01/
/user/moiztcs/moiz_partition_test/partition_hr=10/

By omitting the location clause on your alter table statements you should
get above layout which can be
confirmed by issuing the following command:

$ hdfs dfs -ls /user/moiztc/moiz_partition_test

Can you try this?







On Wed, Mar 12, 2014 at 12:10 AM, Arafat, Moiz moiz.ara...@teamaol.comwrote:

  Hi,



 Here are the steps I followed . Please let me know If I did something
 wrong.



 1)  Create table

 hive CREATE TABLE moiz_partition_test

  (EVENT_DT STRING) partitioned by

  (

  PARTITION_HR INT

  )

   ROW FORMAT DELIMITED

   FIELDS TERMINATED BY '09'

   location '/user/moiztcs/moiz_partition_test'

  ;



 2)  Add partitions

 hive alter table  moiz_partition_test add IF NOT EXISTS partition
 (partition_hr=0)  location '/user/moiztcs/moiz_partition_test/00';

 OK

 Time taken: 0.411 seconds

 hive alter table  moiz_partition_test add IF NOT EXISTS partition
 (partition_hr=1)  location '/user/moiztcs/moiz_partition_test/01';

 OK

 Time taken: 0.193 seconds

 hive alter table  moiz_partition_test add IF NOT EXISTS partition
 (partition_hr=2)  location '/user/moiztcs/moiz_partition_test/02';

 OK

 Time taken: 0.182 seconds

 hive alter table  moiz_partition_test add IF NOT EXISTS partition
 (partition_hr=10)  location '/user/moiztcs/moiz_partition_test/10';

 OK

 Time taken: 0.235 seconds



 3)  Copy data into the directories

 hadoop fs -copyFromLocal test.dat  /user/moiztcs/moiz_partition_test/00

 hadoop fs -copyFromLocal test.dat  /user/moiztcs/moiz_partition_test/01

 hadoop fs -copyFromLocal test.dat  /user/moiztcs/moiz_partition_test/02

 hadoop fs -copyFromLocal test.dat  /user/moiztcs/moiz_partition_test/10



 4)  Ran the sql

 hive select distinct partition_hr from moiz_partition_test order by
 partition_hr;

 Ended Job

 OK

 0

 1

 10

 2



 Thanks,

 Moiz

 *From:* Stephen Sprague [mailto:sprag...@gmail.com]
 *Sent:* Wednesday, March 12, 2014 12:55 AM
 *To:* user@hive.apache.org
 *Subject:* Re: Hive - Sorting on the Partition Column data type Int .
 Output is Alphabetic Sort



 that makes no sense. if the column is an int it isn't going to sort like a
 string.  I smell a user error somewhere.



 On Tue, Mar 11, 2014 at 6:21 AM, Arafat, Moiz moiz.ara...@teamaol.com
 wrote:

 Hi ,

 I have a table that has a partition column partition_hr . Data Type is int
 (partition_hrint) . When i run a sort on this column the output is
 like this.

 0
 1
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 2
 20
 21
 22
 23
 3
 4
 5
 6
 7
 8
 9

 I expected the output like this  .

 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 .
 .
 and so on.

 It works fine for non-partition columns. Please advise.

 Thanks,
 Moiz





Re: full outer join result

2014-03-12 Thread Stephen Sprague
interesting.don't know the answer but could you change the UNION in the
Postgres to UNION ALL?  I'd be curious if the default is UNION DISTINCT on
that platform. That would at least partially explain postgres behaviour
leaving hive the odd man out.



On Wed, Mar 12, 2014 at 6:47 AM, Martin Kudlej mkud...@redhat.com wrote:

 Hi all,

 I've tried BigTop test for join_filters:
 CREATE TABLE myinput1(key int, value int);
 LOAD DATA LOCAL INPATH 'seed_data_files/in3.txt' INTO TABLE myinput1;

 where seed_data_files/in3.txt:
 12  35
 NULL40
 48  NULL
 100 100

 I've tried:
 SELECT * FROM myinput1 a FULL OUTER JOIN myinput1 b on a.key  40 AND
 a.value  50 AND a.key = a.value AND b.key  40 AND b.value  50 AND b.key
 = b.value ORDER BY a.key, a.value, b.key, b.value;

 and expected result in test is:
 NULL  NULL  NULL  40
 NULL  NULL  NULL  40
 NULL  NULL  NULL  40
 NULL  NULL  NULL  40
 NULL  NULL  12  35
 NULL  NULL  12  35
 NULL  NULL  12  35
 NULL  NULL  12  35
 NULL  NULL  48  NULL
 NULL  NULL  48  NULL
 NULL  NULL  48  NULL
 NULL  NULL  48  NULL
 NULL  40  NULL  NULL
 NULL  40  NULL  NULL
 NULL  40  NULL  NULL
 NULL  40  NULL  NULL
 12  35  NULL  NULL
 12  35  NULL  NULL
 12  35  NULL  NULL
 12  35  NULL  NULL
 48  NULL  NULL  NULL
 48  NULL  NULL  NULL
 48  NULL  NULL  NULL
 48  NULL  NULL  NULL
 100 100 NULL  NULL
 100 100 NULL  NULL
 100 100 NULL  NULL
 100 100 100 100


 but real hive result is:
 NULLNULLNULL40
 NULLNULL12  35
 NULLNULL48  NULL
 NULL40  NULLNULL
 12  35  NULLNULL
 48  NULLNULLNULL
 100 100 100 100

 btw. result from postgresql is:
 (SELECT *
   FROM myinput1 a
 LEFT JOIN
   myinput1 b on
 a.key  40 AND
 a.value  50 AND
 a.key = a.value AND
 b.key  40 AND
 b.value  50 AND
 b.key = b.value  ORDER BY a.key, a.value, b.key, b.value)
 UNION (SELECT *
   FROM myinput1 a
 RIGHT JOIN
   myinput1 b on
 a.key  40 AND
 a.value  50 AND
 a.key = a.value AND
 b.key  40 AND
 b.value  50 AND
 b.key = b.value
 ORDER BY a.key, a.value, b.key, b.value);
  |   |  12 |35
   12 |35 | |
  |   |  48 |
   48 |   | |
  |40 | |
  |   | |40
  100 |   100 | 100 |   100

 so it's the same like in hive.

 What is the right result for this full outer join in HiveQL, please?

 --
 Best Regards,
 Martin Kudlej.
 MRG/Grid  RHS-Hadoop Senior Quality Assurance Engineer
 Red Hat Czech s.r.o.

 Phone: +420 532 294 155
 E-mail:mkudlej at redhat.com
 IRC:   mkudlej at #brno, #messaging, #grid, #rhs, #distcomp



Re: full outer join result

2014-03-12 Thread Stephen Sprague
well. i had some free time to search it.  from here:
http://www.postgresql.org/docs/9.3/static/sql-select.html#SQL-UNION you'll
see the default is indeed UNION DISTINCT.  so changing it to UNION ALL
you'll get different results - are they the ones you're expecting?


On Wed, Mar 12, 2014 at 9:36 AM, Stephen Sprague sprag...@gmail.com wrote:

 interesting.don't know the answer but could you change the UNION in
 the Postgres to UNION ALL?  I'd be curious if the default is UNION DISTINCT
 on that platform. That would at least partially explain postgres behaviour
 leaving hive the odd man out.



 On Wed, Mar 12, 2014 at 6:47 AM, Martin Kudlej mkud...@redhat.com wrote:

 Hi all,

 I've tried BigTop test for join_filters:
 CREATE TABLE myinput1(key int, value int);
 LOAD DATA LOCAL INPATH 'seed_data_files/in3.txt' INTO TABLE myinput1;

 where seed_data_files/in3.txt:
 12  35
 NULL40
 48  NULL
 100 100

 I've tried:
 SELECT * FROM myinput1 a FULL OUTER JOIN myinput1 b on a.key  40 AND
 a.value  50 AND a.key = a.value AND b.key  40 AND b.value  50 AND b.key
 = b.value ORDER BY a.key, a.value, b.key, b.value;

 and expected result in test is:
 NULL  NULL  NULL  40
 NULL  NULL  NULL  40
 NULL  NULL  NULL  40
 NULL  NULL  NULL  40
 NULL  NULL  12  35
 NULL  NULL  12  35
 NULL  NULL  12  35
 NULL  NULL  12  35
 NULL  NULL  48  NULL
 NULL  NULL  48  NULL
 NULL  NULL  48  NULL
 NULL  NULL  48  NULL
 NULL  40  NULL  NULL
 NULL  40  NULL  NULL
 NULL  40  NULL  NULL
 NULL  40  NULL  NULL
 12  35  NULL  NULL
 12  35  NULL  NULL
 12  35  NULL  NULL
 12  35  NULL  NULL
 48  NULL  NULL  NULL
 48  NULL  NULL  NULL
 48  NULL  NULL  NULL
 48  NULL  NULL  NULL
 100 100 NULL  NULL
 100 100 NULL  NULL
 100 100 NULL  NULL
 100 100 100 100


 but real hive result is:
 NULLNULLNULL40
 NULLNULL12  35
 NULLNULL48  NULL
 NULL40  NULLNULL
 12  35  NULLNULL
 48  NULLNULLNULL
 100 100 100 100

 btw. result from postgresql is:
 (SELECT *
   FROM myinput1 a
 LEFT JOIN
   myinput1 b on
 a.key  40 AND
 a.value  50 AND
 a.key = a.value AND
 b.key  40 AND
 b.value  50 AND
 b.key = b.value  ORDER BY a.key, a.value, b.key, b.value)
 UNION (SELECT *
   FROM myinput1 a
 RIGHT JOIN
   myinput1 b on
 a.key  40 AND
 a.value  50 AND
 a.key = a.value AND
 b.key  40 AND
 b.value  50 AND
 b.key = b.value
 ORDER BY a.key, a.value, b.key, b.value);
  |   |  12 |35
   12 |35 | |
  |   |  48 |
   48 |   | |
  |40 | |
  |   | |40
  100 |   100 | 100 |   100

 so it's the same like in hive.

 What is the right result for this full outer join in HiveQL, please?

 --
 Best Regards,
 Martin Kudlej.
 MRG/Grid  RHS-Hadoop Senior Quality Assurance Engineer
 Red Hat Czech s.r.o.

 Phone: +420 532 294 155
 E-mail:mkudlej at redhat.com
 IRC:   mkudlej at #brno, #messaging, #grid, #rhs, #distcomp





additional hive functions

2014-03-12 Thread Stephen Sprague
just a public service announcement.

I had a case where i had a nested json array in a string and i needed that
to act like a first class array in hive. natively, you can pull it out but
it'll just a string. woe is me.

I searched around the web and found this:

http://stackoverflow.com/questions/19867155/array-contains-equivalent-for-json-in-hiveql

follow the link for the brickhouse libs in the above url and you'll find a
veritable treasure trove of useful functions.


Cheers,
Stephen.
PS I have zero affiliation to brickhouse.


Re: Hive - Sorting on the Partition Column data type Int . Output is Alphabetic Sort

2014-03-11 Thread Stephen Sprague
that makes no sense. if the column is an int it isn't going to sort like a
string.  I smell a user error somewhere.


On Tue, Mar 11, 2014 at 6:21 AM, Arafat, Moiz moiz.ara...@teamaol.comwrote:

 Hi ,

 I have a table that has a partition column partition_hr . Data Type is int
 (partition_hrint) . When i run a sort on this column the output is
 like this.

 0
 1
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 2
 20
 21
 22
 23
 3
 4
 5
 6
 7
 8
 9

 I expected the output like this  .

 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 .
 .
 and so on.

 It works fine for non-partition columns. Please advise.

 Thanks,
 Moiz




Re: bucketed table problems

2014-03-07 Thread Stephen Sprague
yeah. that's not right.

1. lets see the output of show create table foo

2. what version of hive are you using.


On Fri, Mar 7, 2014 at 11:46 AM, Keith Wiley kwi...@keithwiley.com wrote:

 I want to convert a table to a bucketed table, so I made a new table with
 the same schema as the old table and specified a cluster column:

 create table foo_bucketed
 (
 a string,
 b int,
 c float
 )
 clustered by (b) into 10 buckets;

 Then I populate it from my original table:

 set hive.enforce.bucketing = true;
 insert overwrite table foo_bucketed
 select * from foo;

 All of the data goes into the first bucket, leaving the remaining 9
 buckets empty (in the file system, the remaining 9 files are 0 size).
  Furthermore, the cluster column is now NULL.  Its values have been
 completely erased by the insertion (which might explain how they all ended
 up in a single bucket of course).


 
 Keith Wiley kwi...@keithwiley.com keithwiley.com
 music.keithwiley.com

 Yet mark his perfect self-contentment, and hence learn his lesson, that
 to be
 self-contented is to be vile and ignorant, and that to aspire is better
 than to
 be blindly and impotently happy.
--  Edwin A. Abbott, Flatland

 




Re: bucketed table problems

2014-03-07 Thread Stephen Sprague
short answer: its by position.


Re: HIVE QUERY HELP:: HOW TO IMPLEMENT THIS CASE

2014-03-04 Thread Stephen Sprague
Let's just say this.  Coercing hive into doing something its not meant to
do is kinda a waste of time. Sure you can rewrite any update as a
delete/insert but that's not the point of Hive.

Seems like your going down a path here that's not optimal for your
situation.

You know, I could buy a Tesla and bury it in the ground and use it as a
root cellar - but why?

Cheers,
Stephen


On Mon, Mar 3, 2014 at 10:45 PM, yogesh dhari yogeshh...@gmail.com wrote:

 Hello All,

 I have a use case in RDBMS query which I have implemented in
 HIVE as..



 *1.1) Update statement* *in RDBMS*

 update  TABLE1
 set
 Age= case when isnull(age,'') ='' then 'A= Current' else '240+ Days' end,
 Prev_Age=case when isnull(prev_age,'') ='' then 'A= Current' else '240+
 Days' end;
 *1.2) Update statement* *in HIVE*

 create table  TABLE2 as select
 a.* ,
 case when coalesce(a.age,'')='' then 'A=Current' else '240+ Days' end as
 Age,
 case when coalesce(a.prev_age,'')='' then 'A=Current' else '240+ Days' end
 as Prev_age from TABLE1 a ;





 *Now I have a case statement in which I have a join condition*.



 *2) Join in RDBMS*
 update  TABLE1
 set a.Age = c.sys_buscd_item_desc1
 from  TABLE1 a
 join  TABLE2 c
 on c.sys_type_cd='AGE'
 where isnull(a.age,'00')=c.sys_item;
 commit;





 How can I implement this query into Hive, Pls help and suggest.



 Thanks In Advance

 Yogesh Kumar









Re: HIVE QUERY HELP:: HOW TO IMPLEMENT THIS CASE

2014-03-04 Thread Stephen Sprague
ok. my conscience got the best of me. maybe for worse though. :)

This to me is like giving you a rope and a stool and i don't think it'll
end well.

That said consider something like this:

{code}
select
a.foo1,
a.foo2,

--column to be updated. you need to position it properly
--if null then no match. use value from table a
case
when z.sys_buscd_item_desc1 is null then
   a.sys_buscd_item_desc1

--if not null then match. use value from table z
else
   z.sys_buscd_item_desc1
end,

a.fooN_1,
a.fooN
from

   (
   select
   a.sys_type_cd,  -- join key
   a.age, -- where clause
   c.sys_buscd_item_desc1
   from
TABLE1 a
join  TABLE2 c on isnull(a.age,'00')=c.sys_item
   where
   c.sys_type_cd='AGE'
   ) z

   RIGHT OUTER join TABLE1 a on (z.sys_z.sys_item = coalesece(a.age,'00') )

-- no where clause needed on 'AGE' since its part of the where clause in the
-- derived table.
{code}

i switched your ON clause and WHERE clause so be sure to take that under
consideration. And finally its not tested.

Best of luck.

Cheers,
Stephen




On Tue, Mar 4, 2014 at 7:49 AM, Stephen Sprague sprag...@gmail.com wrote:


 Let's just say this.  Coercing hive into doing something its not meant to
 do is kinda a waste of time. Sure you can rewrite any update as a
 delete/insert but that's not the point of Hive.

 Seems like your going down a path here that's not optimal for your
 situation.

 You know, I could buy a Tesla and bury it in the ground and use it as a
 root cellar - but why?

 Cheers,
 Stephen


 On Mon, Mar 3, 2014 at 10:45 PM, yogesh dhari yogeshh...@gmail.comwrote:

 Hello All,

 I have a use case in RDBMS query which I have implemented in
 HIVE as..



 *1.1) Update statement* *in RDBMS*

 update  TABLE1
 set
 Age= case when isnull(age,'') ='' then 'A= Current' else '240+ Days' end,
 Prev_Age=case when isnull(prev_age,'') ='' then 'A= Current' else '240+
 Days' end;
 *1.2) Update statement* *in HIVE*

 create table  TABLE2 as select
 a.* ,
 case when coalesce(a.age,'')='' then 'A=Current' else '240+ Days' end as
 Age,
 case when coalesce(a.prev_age,'')='' then 'A=Current' else '240+ Days'
 end as Prev_age from TABLE1 a ;





 *Now I have a case statement in which I have a join condition*.



 *2) Join in RDBMS*
 update  TABLE1
 set a.Age = c.sys_buscd_item_desc1
 from  TABLE1 a
 join  TABLE2 c
 on c.sys_type_cd='AGE'
 where isnull(a.age,'00')=c.sys_item;
 commit;





 How can I implement this query into Hive, Pls help and suggest.



 Thanks In Advance

 Yogesh Kumar











Re: move hive tables from one cluster to another cluster

2014-02-28 Thread Stephen Sprague
this is a FAQ.  see doc on: msck repair table table

this will scan hdfs and create the corresponding partitions in the
metastore.


On Fri, Feb 28, 2014 at 12:59 AM, shashwat shriparv 
dwivedishash...@gmail.com wrote:

 Where was your meta data in derby or MySql?


 *Warm Regards_**∞_*
 * Shashwat Shriparv*
  [image: 
 http://www.linkedin.com/pub/shashwat-shriparv/19/214/2a9]http://www.linkedin.com/pub/shashwat-shriparv/19/214/2a9[image:
 https://twitter.com/shriparv] https://twitter.com/shriparv[image:
 https://www.facebook.com/shriparv] https://www.facebook.com/shriparv[image:
 http://google.com/+ShashwatShriparv] 
 http://google.com/+ShashwatShriparv[image:
 http://www.youtube.com/user/sShriparv/videos]http://www.youtube.com/user/sShriparv/videos[image:
 http://profile.yahoo.com/SWXSTW3DVSDTF2HHSRM47AV6DI/] shrip...@yahoo.com



 On Fri, Feb 28, 2014 at 1:28 PM, soniya B soniya.bigd...@gmail.comwrote:

 Hi,

 I have moved warehouse file to another cluster. but still i didn't see
 the tables on the other cluster. How to rebulid the metadata?

 Thanks
 Soniya

 On Fri, Feb 28, 2014 at 9:26 AM, Krishnan K kkrishna...@gmail.comwrote:

 1. you can use distcp to copy the files to the new cluster
 2. rebuild metadata


 On Thu, Feb 27, 2014 at 8:07 PM, soniya B soniya.bigd...@gmail.comwrote:

 Dear experts,

 I want to move my hive tables from one cluster to another cluster. how
 can i do it?

 Thanks
 Soniya.







Re: Hive + Flume

2014-02-28 Thread Stephen Sprague
if you can configure flume to create temporary files that start with an
underscore (_) i believe hive will safely ignore them. otherwise you have
write a script to move them out.


On Fri, Feb 28, 2014 at 11:09 AM, P lva ruvi...@gmail.com wrote:

 Hi,

 I'm have a flume stream that stores data in a directory which is source
 for an external table in hive.
 However flume creates tmp file that cannot be read by hive and query
 breaks.
 Is there any way to avoid hive reading this tmp file ?

 Thanks



Re: move hive tables from one cluster to another cluster

2014-02-28 Thread Stephen Sprague
that advice is way over complicating something that is very easy. instead,
please take this approach.

1. run the ddl to create the table on the new cluster
2. distcp the hdfs data into the appropriate hdfs directory.
3. run msck repair table table in hive to discover the partitions and
populate the metastore accordingly.








On Fri, Feb 28, 2014 at 1:41 PM, Abdelrahman Shettia 
ashet...@hortonworks.com wrote:

 If you are using Mysql as a database, you need to take a mysql dump of the
 metastore database and search/replace the old namenode hostname with the
 new namenode hostname. Once that is completed, import the metastore
 database.

 Thanks


 On Fri, Feb 28, 2014 at 12:59 AM, shashwat shriparv 
 dwivedishash...@gmail.com wrote:

 Where was your meta data in derby or MySql?


 *Warm Regards_**∞_*
 * Shashwat Shriparv*
  [image: 
 http://www.linkedin.com/pub/shashwat-shriparv/19/214/2a9]http://www.linkedin.com/pub/shashwat-shriparv/19/214/2a9[image:
 https://twitter.com/shriparv] https://twitter.com/shriparv[image:
 https://www.facebook.com/shriparv] https://www.facebook.com/shriparv[image:
 http://google.com/+ShashwatShriparv]http://google.com/+ShashwatShriparv[image:
 http://www.youtube.com/user/sShriparv/videos]http://www.youtube.com/user/sShriparv/videos[image:
 http://profile.yahoo.com/SWXSTW3DVSDTF2HHSRM47AV6DI/]shrip...@yahoo.com



 On Fri, Feb 28, 2014 at 1:28 PM, soniya B soniya.bigd...@gmail.comwrote:

 Hi,

 I have moved warehouse file to another cluster. but still i didn't see
 the tables on the other cluster. How to rebulid the metadata?

 Thanks
 Soniya

 On Fri, Feb 28, 2014 at 9:26 AM, Krishnan K kkrishna...@gmail.comwrote:

 1. you can use distcp to copy the files to the new cluster
 2. rebuild metadata


 On Thu, Feb 27, 2014 at 8:07 PM, soniya B soniya.bigd...@gmail.comwrote:

 Dear experts,

 I want to move my hive tables from one cluster to another cluster. how
 can i do it?

 Thanks
 Soniya.






 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.


Re: Metastore performance on HDFS-backed table with 15000+ partitions

2014-02-22 Thread Stephen Sprague
yeah. That traceback pretty much spells it out - its metastore related and
that's where the partitions are stored.

I'm with the others on this. HiveServer2 is still a little jankey on memory
management.  I bounce mine once a day at midnight just to play it safe (and
because i can.)

Again, for me, i use the hive local client for production jobs and remote
client for adhoc stuff.

you may wish to confirm the local hive client has no problem with your
query.

other than that you either increase your heap size on the HS2 process and
hope for the best and/or file a bug report.

bottom line hiveserver2 isn't production bullet proof just yet, IMHO.
Others may disagree.

Regards,
Stephen.



On Sat, Feb 22, 2014 at 9:50 AM, Norbert Burger norbert.bur...@gmail.comwrote:

 Thanks all for the quick feedback.

 I'm a bit surprised to learn 15k is considered too much, but we can work
 around it.  I guess I'm also curious why the query planner needs to know
 about all partitions even in the case of simple select/limit queries, where
 the query might target only a single partition.

 Here's the client-side OOME with HADOOP_HEAPSIZE=2048:


 https://gist.githubusercontent.com/nburger/3286d2052060e2efe161/raw/dc30231086803c1d33b9137b5844d2d0e20e350d/gistfile1.txt

 This was from a CDH4.3.0 client hitting HIveServer2.  Any idea what's
 consuming the heap?

 Norbert


 On Sat, Feb 22, 2014 at 10:32 AM, Edward Capriolo 
 edlinuxg...@gmail.comwrote:

 Dont make tbales with that many partitions. It is an anti pattern. I hwve
 tables with 2000 partitions a day and that is rewlly to many. Hive needs go
 load that informqtion into memory to plan the query.


 On Saturday, February 22, 2014, Terje Marthinussen 
 tmarthinus...@gmail.com wrote:
  Query optimizer in hive is awful on memory consumption. 15k partitions
 sounds a bit early for it to fail though..
 
  What is your heap size?
 
  Regards,
  Terje
 
  On 22 Feb 2014, at 12:05, Norbert Burger norbert.bur...@gmail.com
 wrote:
 
  Hi folks,
 
  We are running CDH 4.3.0 Hive (0.10.0+121) with a MySQL metastore.
 
  In Hive, we have an external table backed by HDFS which has a 3-level
 partitioning scheme that currently has 15000+ partitions.
 
  Within the last day or so, queries against this table have started
 failing.  A simple query which shouldn't take very long at all (select *
 from ... limit 10) fails after several minutes with a client OOME.  I get
 the same outcome on count(*) queries (which I thought wouldn't send any
 data back to the client).  Increasing heap on both client and server JVMs
 (via HADOOP_HEAPSIZE) doesn't have any impact.
 
  We were only able to work around the client OOMEs by reducing the
 number of partitions in the table.
 
  Looking at the MySQL querylog, my thought is that the Hive client is
 quite busy making requests for partitions that doesn't contribute to the
 query.  Has anyone else had similar experience against tables this size?
 
  Thanks,
  Norbert
 

 --
 Sorry this was sent from mobile. Will do less grammar and spell check
 than usual.





Re: Metastore performance on HDFS-backed table with 15000+ partitions

2014-02-21 Thread Stephen Sprague
most interesting.  we had an issue recently with querying a table with 15K
columns and running out of heap storage but not 15K partitions.

15K partitions shouldn't be causing a problem in my humble estimation.
Maybe a million but not 15K. :)

So is there a traceback we can look at? or its not heap but real memory?

and this is the local hive client? or the hiveserver?

Thanks,
Stephen.



On Fri, Feb 21, 2014 at 7:05 PM, Norbert Burger norbert.bur...@gmail.comwrote:

 Hi folks,

 We are running CDH 4.3.0 Hive (0.10.0+121) with a MySQL metastore.

 In Hive, we have an external table backed by HDFS which has a 3-level
 partitioning scheme that currently has 15000+ partitions.

 Within the last day or so, queries against this table have started
 failing.  A simple query which shouldn't take very long at all (select *
 from ... limit 10) fails after several minutes with a client OOME.  I get
 the same outcome on count(*) queries (which I thought wouldn't send any
 data back to the client).  Increasing heap on both client and server JVMs
 (via HADOOP_HEAPSIZE) doesn't have any impact.

 We were only able to work around the client OOMEs by reducing the number
 of partitions in the table.

 Looking at the MySQL querylog, my thought is that the Hive client is quite
 busy making requests for partitions that doesn't contribute to the query.
  Has anyone else had similar experience against tables this size?

 Thanks,
 Norbert



Re: Slow performance on queries with aggregation function

2014-02-21 Thread Stephen Sprague
Hi Jone,
um.  i can say for sure something is wrong. :)

i would _start_ by going to the tasktracker. this is your friend.  find
your job and look for failed reducers.  That's the starting point anyway,
IMHO.



On Fri, Feb 21, 2014 at 11:35 AM, Jone Lura jone.l...@ecc.no wrote:

 Hi,

 I have tried some variations of queries with aggregation function such as
 the following query;

 select max(total) from my_table;

 and

 select id, sum(total) from my_table group by id

 In my junit tests, I only have two rows with data, but the queries are
 extremely slow.

 The job detail output shows me the following;

 Hadoop job information for Stage-1: number of mappers: 1; number of
 reducers: 1
 2014-02-21 17:31:42,544 Stage-1 map = 0%,  reduce = 0%
 2014-02-21 17:31:45,548 Stage-1 map = 100%,  reduce = 0%
 2014-02-21 17:31:46,899 Stage-1 map = 100%,  reduce = 0%
 2014-02-21 17:31:55,446 Stage-1 map = 100%,  reduce = 0%
 2014-02-21 17:32:34,358 Stage-1 map = 100%,  reduce = 0%
 2014-02-21 17:32:40,040 Stage-1 map = 100%,  reduce = 0%
 2014-02-21 17:32:45,653 Stage-1 map = 100%,  reduce = 0%
 2014-02-21 17:32:46,999 Stage-1 map = 100%,  reduce = 0%
 2014-02-21 17:32:55,544 Stage-1 map = 100%,  reduce = 0%
 2014-02-21 17:33:34,454 Stage-1 map = 100%,  reduce = 0%
 2014-02-21 17:33:40,130 Stage-1 map = 100%,  reduce = 0%
 2014-02-21 17:33:45,742 Stage-1 map = 100%,  reduce = 0%
 2014-02-21 17:33:47,093 Stage-1 map = 100%,  reduce = 0%
 2014-02-21 17:33:55,632 Stage-1 map = 100%,  reduce = 0%
 2014-02-21 19:27:48,005 Stage-1 map = 100%,  reduce = 0%
 2014-02-21 19:27:48,461 Stage-1 map = 100%,  reduce = 0%
 2014-02-21 19:27:48,311 Stage-1 map = 100%,  reduce = 0%
 2014-02-21 19:27:48,574 Stage-1 map = 100%,  reduce = 0%
 2014-02-21 19:27:48,932 Stage-1 map = 100%,  reduce = 0%
 2014-02-21 19:28:48,915 Stage-1 map = 100%,  reduce = 0%
 2014-02-21 19:28:48,915 Stage-1 map = 100%,  reduce = 0%
 2014-02-21 19:28:48,933 Stage-1 map = 100%,  reduce = 0%
 2014-02-21 19:28:48,933 Stage-1 map = 100%,  reduce = 0%
 2014-02-21 19:28:49,727 Stage-1 map = 100%,  reduce = 0%
 2014-02-21 19:29:47,995 Stage-1 map = 100%,  reduce = 100%
 2014-02-21 19:29:48,997 Stage-1 map = 100%,  reduce = 0%
 2014-02-21 19:29:49,018 Stage-1 map = 100%,  reduce = 0%
 2014-02-21 19:29:49,019 Stage-1 map = 100%,  reduce = 0%
 2014-02-21 19:29:49,824 Stage-1 map = 100%,  reduce = 0%

 I am relatively new to Hadoop and Hive and I do not know if this is
 normal, or if I have missed some configuration details.

 In my application I am expecting to have 500M or more rows.

 Best regards,

 Jone



Re: Issue with Hive and table with lots of column

2014-02-19 Thread Stephen Sprague
ok. thanks.

so given everything we know the choices i see are:

1. increase your heapsize some more. (And of course confirm your process
that your reported the -Xmx8192M is the HiveServer2 process.)

2. modify your query such that it doesn't use select *

3. modify your query such that it does its own buffering.  maybe stream it?

4. create a jira ticket and request that the internal buffer size that the
hiveserver2 uses for staging results be configurable.

That's all _i_ got left in the tank for this issue.  I think we need an SME
who is familiar with the code now.

Regards,
Stephen.


On Tue, Feb 18, 2014 at 10:57 AM, David Gayou david.ga...@kxen.com wrote:

 Sorry i badly reported it. It's 8192M

 Thanks,

 David.
 Le 18 févr. 2014 18:37, Stephen Sprague sprag...@gmail.com a écrit :

 oh. i just noticed the -Xmx value you reported.

 there's no M or G after that number??  I'd like to see -Xmx8192M or
 -Xmx8G.  That *is* very important.

 thanks,
 Stephen.


 On Tue, Feb 18, 2014 at 9:22 AM, Stephen Sprague sprag...@gmail.comwrote:

 thanks.

 re #1.  we need to find that Hiveserver2 process. For all i know the one
 you reported is hiveserver1 (which works.) chances are they use the same
 -Xmx value but we really shouldn't make any assumptions.

 try wide format on the ps command (eg. ps -efw | grep -i Hiveserver2)

 re.#2.  okay.  so that tells us is not the number of columns blowing the
 heap but rather the combination of rows + columns.  There's no way it
 stores the full result set on the heap even under normal circumstances so
 my guess is there's an internal number of rows it buffers.  sorta like how
 unix buffers stdout.  How and where that's set is out of my league.
 However, maybe you get around it by upping your heapsize again if you have
 the available memory of course.


 On Tue, Feb 18, 2014 at 8:39 AM, David Gayou david.ga...@kxen.comwrote:


 1. I have no process with hiveserver2 ...

 ps -ef | grep -i hive  return some pretty long command with a
 -Xmx8192 and that's the value set in hive-env.sh


 2. The select * from table limit 1 or even 100 is working correctly.


 David.


 On Tue, Feb 18, 2014 at 4:16 PM, Stephen Sprague sprag...@gmail.comwrote:

 He lives on after all! and thanks for the continued feedback.

 We need the answers to these questions using HS2:



1. what is the output of ps -ef | grep -i hiveserver2 on your
 system? in particular what is the value of -Xmx ?

2. does select * from table limit 1 work?

 Thanks,
 Stephen.



 On Tue, Feb 18, 2014 at 6:32 AM, David Gayou david.ga...@kxen.comwrote:

 I'm so sorry, i wrote an answer, and i forgot to sent it
 And i haven't been able to work on this for a few days.


 So far :

 I have a 15k columns table and 50k rows.

 I do not see any changes if i change the storage.


 *Hive 12.0*

 My test query is select * from bigtable


 If i use the hive cli, it works fine.

 If i use hiveserver1 + ODBC : it works fine

 If i use hiverserver2 + odbc or hiverserver2 + beeline,i have this
 java exception :

 2014-02-18 13:22:22,571 ERROR thrift.ProcessFunction
 (ProcessFunction.java:process(41)) - Internal error processing 
 FetchResults

 java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2734)
 at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
 at java.util.ArrayList.add(ArrayList.java:351)
  at
 org.apache.hive.service.cli.thrift.TRow.addToColVals(TRow.java:160)
 at
 org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60)
 at
 org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32)
 at
 org.apache.hive.service.cli.operation.SQLOperation.prepareFromRow(SQLOperation.java:270)
 at
 org.apache.hive.service.cli.operation.SQLOperation.decode(SQLOperation.java:262)
 at
 org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:246)
 at
 org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:171)
 at
 org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:438)
 at
 org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:346)
 at
 org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:407)
 at
 org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1373)




 *From the SVN trunk* : (for the HIVE-3746)

 With the maven change, most of the documentation and wiki are out of
 date.
 Compiling from trunk was not that easy and i may have failed some
 steps but :

 It has the same behavior. It works in CLI and hiveserver1.
 It fails with hiveserver 2.


 Regards

 David Gayou





 On Thu, Feb 13, 2014 at 3:11 AM, Navis류승우 navis@nexr.com wrote:

 With HIVE-3746, which will be included in hive-0.13, HiveServer2
 takes less memory than before.

 Could you try

Re: Issue with Hive and table with lots of column

2014-02-18 Thread Stephen Sprague
He lives on after all! and thanks for the continued feedback.

We need the answers to these questions using HS2:


   1. what is the output of ps -ef | grep -i hiveserver2 on your system?
in particular what is the value of -Xmx ?

   2. does select * from table limit 1 work?

Thanks,
Stephen.



On Tue, Feb 18, 2014 at 6:32 AM, David Gayou david.ga...@kxen.com wrote:

 I'm so sorry, i wrote an answer, and i forgot to sent it
 And i haven't been able to work on this for a few days.


 So far :

 I have a 15k columns table and 50k rows.

 I do not see any changes if i change the storage.


 *Hive 12.0*

 My test query is select * from bigtable


 If i use the hive cli, it works fine.

 If i use hiveserver1 + ODBC : it works fine

 If i use hiverserver2 + odbc or hiverserver2 + beeline,i have this java
 exception :

 2014-02-18 13:22:22,571 ERROR thrift.ProcessFunction
 (ProcessFunction.java:process(41)) - Internal error processing FetchResults

 java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2734)
 at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
 at java.util.ArrayList.add(ArrayList.java:351)
  at
 org.apache.hive.service.cli.thrift.TRow.addToColVals(TRow.java:160)
 at
 org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60)
 at
 org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32)
 at
 org.apache.hive.service.cli.operation.SQLOperation.prepareFromRow(SQLOperation.java:270)
 at
 org.apache.hive.service.cli.operation.SQLOperation.decode(SQLOperation.java:262)
 at
 org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:246)
 at
 org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:171)
 at
 org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:438)
 at
 org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:346)
 at
 org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:407)
 at
 org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1373)




 *From the SVN trunk* : (for the HIVE-3746)

 With the maven change, most of the documentation and wiki are out of date.
 Compiling from trunk was not that easy and i may have failed some steps
 but :

 It has the same behavior. It works in CLI and hiveserver1.
 It fails with hiveserver 2.


 Regards

 David Gayou





 On Thu, Feb 13, 2014 at 3:11 AM, Navis류승우 navis@nexr.com wrote:

 With HIVE-3746, which will be included in hive-0.13, HiveServer2 takes
 less memory than before.

 Could you try it with the version in trunk?


 2014-02-13 10:49 GMT+09:00 Stephen Sprague sprag...@gmail.com:

 question to the original poster.  closure appreciated!


 On Fri, Jan 31, 2014 at 12:22 PM, Stephen Sprague sprag...@gmail.comwrote:

 thanks Ed. And on a separate tact lets look at Hiveserver2.


 @OP

 *I've tried to look around on how i can change the thrift heap size but
 haven't found anything.*


 looking at my hiveserver2 i find this:

$ ps -ef | grep -i hiveserver2
dwr   9824 20479  0 12:11 pts/100:00:00 grep -i hiveserver2
dwr  28410 1  0 00:05 ?00:01:04
 /usr/lib/jvm/java-6-sun/jre/bin/java 
 *-Xmx256m*-Dhadoop.log.dir=/usr/lib/hadoop/logs 
 -Dhadoop.log.file=hadoop.log
 -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str=
 -Dhadoop.root.logger=INFO,console
 -Djava.library.path=/usr/lib/hadoop/lib/native
 -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
 -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar
 /usr/lib/hive/lib/hive-service-0.12.0.jar
 org.apache.hive.service.server.HiveServer2




 questions:

1. what is the output of ps -ef | grep -i hiveserver2 on your
 system? in particular what is the value of -Xmx ?

2. can you restart your hiveserver with -Xmx1g? or some value that
 makes sense to your system?



 Lots of questions now.  we await your answers! :)



 On Fri, Jan 31, 2014 at 11:51 AM, Edward Capriolo 
 edlinuxg...@gmail.com wrote:

 Final table compression should not effect the de serialized size of
 the data over the wire.


 On Fri, Jan 31, 2014 at 2:49 PM, Stephen Sprague 
 sprag...@gmail.comwrote:

 Excellent progress David.   So.  What the most important thing here
 we learned was that it works (!) by running hive in local mode and that
 this error is a limitation in the HiveServer2.  That's important.

 so textfile storage handler and having issues converting it to ORC.
 hmmm.

 follow-ups.

 1. what is your query that fails?

 2. can you add a limit 1 to the end of your query and tell us if
 that works? this'll tell us if it's column or row bound.

 3. bonus points. run these in local mode:
set hive.exec.compress.output=true;
set mapred.output.compression.type=BLOCK

Re: Hive Query :: Implementing case statement

2014-02-18 Thread Stephen Sprague
maybe consider something along these lines. nb. not tested.

-- temp table holding new balances + key

create table NEW_BALS as
  select * from (
  select b.prev as NEW_BALANCE, a.key from TABLE_SQL a join TABLE_SQL_2
b on (a.key=b.key) where a.code='1';
  UNION ALL
  select b.prev as NEW_BALANCE, a.key from TABLE_SQL a join TABLE_SQL_3
b on (a.key=b.key) where a.code='2';
  ) z
  ;

-- i gave the table a new name but you could use TABLE_SQL instead of
NEW_TABLE_SQL for the true update.

insert overwrite table NEW_TABLE_SQL

   select * from (

-- intersected rows - you'll have to line-up the columns correctly
substituting b.NEW_BALANCE into the right position
   select  a.col1, b.NEW_BALANCE, a.col2, ... from TABLE_SQL a join
NEW_BALS b on (a.key=b.key)

   UNION ALL

-- non-intersected rows
   select a.* from TABLE_SQL a join NEW_BALS b on (a.key=b.key) where
b.NEW_BALANCE is null

   ) z
;

there's probably some typos in there but hopefully you get the idea and can
take it from here.





On Tue, Feb 18, 2014 at 4:39 AM, yogesh dhari yogeshh...@gmail.com wrote:

 Hello All,

 I have a use case where a table say TABLE_SQL is geting updated like.


 1st Update Command

 update TABLE_SQL a
 set BALANCE = b.prev
 from TABLE_SQL_2 b
 where a.key = b.key and a.code = 1


 2nd Update Command

 update TABLE_SQL a
 set BALANCE = b.prev
 from TABLE_SQL_3 b
 where a.key = b.key and a.code = 2


 same column is getting update twice in sql table,

 I have a Table in Hive say TABLE_HIVE.

 How to perform this kind operatation in HIVE,

 Pls Help,

 Thanks in Advance
 Yogesh Kumar













Re: Issue with Hive and table with lots of column

2014-02-18 Thread Stephen Sprague
thanks.

re #1.  we need to find that Hiveserver2 process. For all i know the one
you reported is hiveserver1 (which works.) chances are they use the same
-Xmx value but we really shouldn't make any assumptions.

try wide format on the ps command (eg. ps -efw | grep -i Hiveserver2)

re.#2.  okay.  so that tells us is not the number of columns blowing the
heap but rather the combination of rows + columns.  There's no way it
stores the full result set on the heap even under normal circumstances so
my guess is there's an internal number of rows it buffers.  sorta like how
unix buffers stdout.  How and where that's set is out of my league.
However, maybe you get around it by upping your heapsize again if you have
the available memory of course.


On Tue, Feb 18, 2014 at 8:39 AM, David Gayou david.ga...@kxen.com wrote:


 1. I have no process with hiveserver2 ...

 ps -ef | grep -i hive  return some pretty long command with a -Xmx8192
 and that's the value set in hive-env.sh


 2. The select * from table limit 1 or even 100 is working correctly.


 David.


 On Tue, Feb 18, 2014 at 4:16 PM, Stephen Sprague sprag...@gmail.comwrote:

 He lives on after all! and thanks for the continued feedback.

 We need the answers to these questions using HS2:



1. what is the output of ps -ef | grep -i hiveserver2 on your
 system? in particular what is the value of -Xmx ?

2. does select * from table limit 1 work?

 Thanks,
 Stephen.



 On Tue, Feb 18, 2014 at 6:32 AM, David Gayou david.ga...@kxen.comwrote:

 I'm so sorry, i wrote an answer, and i forgot to sent it
 And i haven't been able to work on this for a few days.


 So far :

 I have a 15k columns table and 50k rows.

 I do not see any changes if i change the storage.


 *Hive 12.0*

 My test query is select * from bigtable


 If i use the hive cli, it works fine.

 If i use hiveserver1 + ODBC : it works fine

 If i use hiverserver2 + odbc or hiverserver2 + beeline,i have this java
 exception :

 2014-02-18 13:22:22,571 ERROR thrift.ProcessFunction
 (ProcessFunction.java:process(41)) - Internal error processing FetchResults

 java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2734)
 at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
 at java.util.ArrayList.add(ArrayList.java:351)
  at
 org.apache.hive.service.cli.thrift.TRow.addToColVals(TRow.java:160)
 at
 org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60)
 at
 org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32)
 at
 org.apache.hive.service.cli.operation.SQLOperation.prepareFromRow(SQLOperation.java:270)
 at
 org.apache.hive.service.cli.operation.SQLOperation.decode(SQLOperation.java:262)
 at
 org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:246)
 at
 org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:171)
 at
 org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:438)
 at
 org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:346)
 at
 org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:407)
 at
 org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1373)




 *From the SVN trunk* : (for the HIVE-3746)

 With the maven change, most of the documentation and wiki are out of
 date.
 Compiling from trunk was not that easy and i may have failed some steps
 but :

 It has the same behavior. It works in CLI and hiveserver1.
 It fails with hiveserver 2.


 Regards

 David Gayou





 On Thu, Feb 13, 2014 at 3:11 AM, Navis류승우 navis@nexr.com wrote:

 With HIVE-3746, which will be included in hive-0.13, HiveServer2 takes
 less memory than before.

 Could you try it with the version in trunk?


 2014-02-13 10:49 GMT+09:00 Stephen Sprague sprag...@gmail.com:

 question to the original poster.  closure appreciated!


 On Fri, Jan 31, 2014 at 12:22 PM, Stephen Sprague 
 sprag...@gmail.comwrote:

 thanks Ed. And on a separate tact lets look at Hiveserver2.


 @OP

 *I've tried to look around on how i can change the thrift heap size
 but haven't found anything.*


 looking at my hiveserver2 i find this:

$ ps -ef | grep -i hiveserver2
dwr   9824 20479  0 12:11 pts/100:00:00 grep -i hiveserver2
dwr  28410 1  0 00:05 ?00:01:04
 /usr/lib/jvm/java-6-sun/jre/bin/java 
 *-Xmx256m*-Dhadoop.log.dir=/usr/lib/hadoop/logs 
 -Dhadoop.log.file=hadoop.log
 -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str=
 -Dhadoop.root.logger=INFO,console
 -Djava.library.path=/usr/lib/hadoop/lib/native
 -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
 -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar
 /usr/lib/hive/lib/hive-service-0.12.0.jar

Re: Issue with Hive and table with lots of column

2014-02-18 Thread Stephen Sprague
oh. i just noticed the -Xmx value you reported.

there's no M or G after that number??  I'd like to see -Xmx8192M or
-Xmx8G.  That *is* very important.

thanks,
Stephen.


On Tue, Feb 18, 2014 at 9:22 AM, Stephen Sprague sprag...@gmail.com wrote:

 thanks.

 re #1.  we need to find that Hiveserver2 process. For all i know the one
 you reported is hiveserver1 (which works.) chances are they use the same
 -Xmx value but we really shouldn't make any assumptions.

 try wide format on the ps command (eg. ps -efw | grep -i Hiveserver2)

 re.#2.  okay.  so that tells us is not the number of columns blowing the
 heap but rather the combination of rows + columns.  There's no way it
 stores the full result set on the heap even under normal circumstances so
 my guess is there's an internal number of rows it buffers.  sorta like how
 unix buffers stdout.  How and where that's set is out of my league.
 However, maybe you get around it by upping your heapsize again if you have
 the available memory of course.


 On Tue, Feb 18, 2014 at 8:39 AM, David Gayou david.ga...@kxen.com wrote:


 1. I have no process with hiveserver2 ...

 ps -ef | grep -i hive  return some pretty long command with a -Xmx8192
 and that's the value set in hive-env.sh


 2. The select * from table limit 1 or even 100 is working correctly.


 David.


 On Tue, Feb 18, 2014 at 4:16 PM, Stephen Sprague sprag...@gmail.comwrote:

 He lives on after all! and thanks for the continued feedback.

 We need the answers to these questions using HS2:



1. what is the output of ps -ef | grep -i hiveserver2 on your
 system? in particular what is the value of -Xmx ?

2. does select * from table limit 1 work?

 Thanks,
 Stephen.



 On Tue, Feb 18, 2014 at 6:32 AM, David Gayou david.ga...@kxen.comwrote:

 I'm so sorry, i wrote an answer, and i forgot to sent it
 And i haven't been able to work on this for a few days.


 So far :

 I have a 15k columns table and 50k rows.

 I do not see any changes if i change the storage.


 *Hive 12.0*

 My test query is select * from bigtable


 If i use the hive cli, it works fine.

 If i use hiveserver1 + ODBC : it works fine

 If i use hiverserver2 + odbc or hiverserver2 + beeline,i have this java
 exception :

 2014-02-18 13:22:22,571 ERROR thrift.ProcessFunction
 (ProcessFunction.java:process(41)) - Internal error processing FetchResults

 java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2734)
 at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
 at java.util.ArrayList.add(ArrayList.java:351)
  at
 org.apache.hive.service.cli.thrift.TRow.addToColVals(TRow.java:160)
 at
 org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60)
 at
 org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32)
 at
 org.apache.hive.service.cli.operation.SQLOperation.prepareFromRow(SQLOperation.java:270)
 at
 org.apache.hive.service.cli.operation.SQLOperation.decode(SQLOperation.java:262)
 at
 org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:246)
 at
 org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:171)
 at
 org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:438)
 at
 org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:346)
 at
 org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:407)
 at
 org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1373)




 *From the SVN trunk* : (for the HIVE-3746)

 With the maven change, most of the documentation and wiki are out of
 date.
 Compiling from trunk was not that easy and i may have failed some steps
 but :

 It has the same behavior. It works in CLI and hiveserver1.
 It fails with hiveserver 2.


 Regards

 David Gayou





 On Thu, Feb 13, 2014 at 3:11 AM, Navis류승우 navis@nexr.com wrote:

 With HIVE-3746, which will be included in hive-0.13, HiveServer2 takes
 less memory than before.

 Could you try it with the version in trunk?


 2014-02-13 10:49 GMT+09:00 Stephen Sprague sprag...@gmail.com:

 question to the original poster.  closure appreciated!


 On Fri, Jan 31, 2014 at 12:22 PM, Stephen Sprague sprag...@gmail.com
  wrote:

 thanks Ed. And on a separate tact lets look at Hiveserver2.


 @OP

 *I've tried to look around on how i can change the thrift heap size
 but haven't found anything.*


 looking at my hiveserver2 i find this:

$ ps -ef | grep -i hiveserver2
dwr   9824 20479  0 12:11 pts/100:00:00 grep -i
 hiveserver2
dwr  28410 1  0 00:05 ?00:01:04
 /usr/lib/jvm/java-6-sun/jre/bin/java 
 *-Xmx256m*-Dhadoop.log.dir=/usr/lib/hadoop/logs 
 -Dhadoop.log.file=hadoop.log
 -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str=
 -Dhadoop.root.logger=INFO,console

Re: Views and partitions and performance

2014-02-11 Thread Stephen Sprague
great questions, Burak.  Personally, I had not before seen the create view
... partition on construct. Not that that means anything but thanks for
bringing bringing out into the forefront!

So, yeah, do we have an SME out there that would like to elaborate on this
beyond the aforementioned url?


On Tue, Feb 11, 2014 at 8:04 AM, Burak Gürsoy burakgur...@gmx.net wrote:

 Hi,

 Some background information:

 We have a really large base table (~1P) -- (lets call it raw_table)
 partitioned like _mm_dd=2014-02-11/hh=12/flag=1 (or
 _mm_dd=2014-02-11/hh=12/flag=0)

 I'm testing creating a view on that table to be able to have a consistent
 user accessible interface instead of letting the users to use the actual
 table (we've changed the format of the table multiple times in the past
 and we'll possibly do so again).

 I am creating the view like this[1] and I've also tested view
 partitions[2].

 The version of Hive is 0.10.0

 The issues I came across so far are:

 1) I couldn't figure out the actual purpose of the view partitions.
 The document[3] lists several use cases and mentioned
 infer view partitions automatically based on the partitions of the
 underlying tables
 which leads to an open ticket[4] which in turn lists another one
 (HIVE-1941)
 which is closed. However I fail to see how to create a derived view
 partition.
 And with the definition I've used[2] they don't seem to have the same
 effect.

 2) You need to specify the partitions in the query to the view explicitly
 to have the partition pruning come into effect. However, when I
 say select * from raw_table limit 10 it selects a subset somehow
 (explain and explain extended does not show any targeted partitions with
 such queries -- is this a bug?). But doing the same query on the view leads
 to a full table scan.

 2.1) I see a similar behaviour with things like
 select ... from myview tablesample (bucket 1 out of $bucketsize on $field)

 I'm wondering if this is just the way the views work at the moment or is
 there a workaround to improve, lets say: silly queries? As the interface
 seems inconsistent between the table and view access.

 Thanks,
 Burak



 =[1]=

 CREATE VIEW myview AS SELECT  `combined`.`column1`,
 `combined`.`column2`,
 `combined`.`_mm_dd`,
 `combined`.`hh`,
 `combined`.`column3`,
 `combined`.`column4`,
 `combined`.`column5`
 FROM
 (
 SELECT
 `raw`.`column1`,
 `raw`.`column2`,
 `raw`.`_mm_dd`,
 `raw`.`hh`,
 `raw`.`column3`,
 `raw`.`column4`,
 `raw`.`column5`
 FROM  `default`.`raw_table` `raw`
 ) `combined`

 =[2]=

 CREATE VIEW myview
 PARTITIONED ON (
 _mm_dd,
 hh,
 type,
 persona,
 dc
 )
 AS ( ... )


 =[3]=[

 https://cwiki.apache.org/confluence/display/Hive/PartitionedViews





Re: FUNCTION HIVE to DAYS OF WEEK

2014-02-10 Thread Stephen Sprague
oddly enough i don't see one here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions

however, you're not the only one finding something like this useful. cf.
https://issues.apache.org/jira/browse/HIVE-6046

in the meantime it appears as though you'll have to write your own UDF or
perhaps setup a streaming job via transform() function.

Cheers,
Stephen.

PS. The transform() function is _really_ easy to use so don't be scared
off.  cf.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform
(the last example is the simplest.)


On Mon, Feb 10, 2014 at 8:33 AM, Eduardo Parra Valdes | BEEVA 
eduardo.pa...@beeva.com wrote:

 Hello! all
 I wish to find a function that returns hive day of the week (Monday,
 Tuesday .. etc) to enter a parameter (timestamp).
 Anyone have an idea of how to do it?

 --
  [image: BEEVA]

 *Eduardo Parra Valdés*
 eduardo.pa...@beeva.com

 BEE OUR CLIENT

 WWW.BEEVA.COM http://www.beeva.com/

 Clara del Rey 26, 4ª planta.

 [28002] Madrid.
 https://www.facebook.com/somosBEEVAhttp://www.linkedin.com/company/beevahttps://twitter.com/beeva_es

 Antes de imprimir este mensaje, por favor compruebe que es necesario
 hacerlo.

 Before you print this message please consider if it really necesary.
 Aviso Legal:
 Este mensaje, su contenido y cualquier fichero transmitido con él, está
 dirigido únicamente a su destinatario y es confidencial. Por ello, se
 informa a quien lo reciba por error o tenga conocimiento del mismo sin ser
 su destinatario, que la información contenida en él es reservada y su uso
 no autorizado, por lo que en tal caso le rogamos nos lo comunique por la
 misma vía, así como que se abstenga de reproducir el mensaje mediante
 cualquier medio o remitirlo o entregarlo a otra persona, procediendo a su
 borrado de manera inmediata.Disclaimer:
 This message, its content and any file attached thereto is for the
 intended recipient only and is confidential. If you have received this
 e-mail in error or had access to it, you should note that the information
 in it is private and any use thereof is unauthorised. In such an event
 please notify us by e-mail. Any reproduction of this e-mail by whatsoever
 means and any transmission or dissemination thereof to other persons is
 prohibited. It should be deleted immediately from your system.



  1   2   >