from:"Stephen Sprague"

Re: Serde moved? version 2.3.0

2017-10-25 Thread Stephen Sprague

thanks guys for the clarification.  i'll work on perhaps finding the source
code and re-implementing if i can.  good to know.



On Wed, Oct 25, 2017 at 6:32 PM, Owen O'Malley 
wrote:

> I considered that, but it won't work.
>
> The Hive 2.2 code looks like:
>
> public interface SerDe { ... }
> public abstract class AbstractSerDe implements SerDe { ... }
>
> ... a lot of code using SerDe ...
>
> The Hive 2.3 code looks like:
>
> public abstract class AbstractSerDe {  }
>
> ... a lot of code using AbstractSerDe ...
>
> You could create a SerDe interface to avoid the ClassNotFound exception,
> but I assume the external classes do:
>
> public FooSerDe implements SerDe { ... }
>
> when Hive 2.3 tries to use the instance as an AbstractSerDe, it will fail.
>
> With SerDe being an interface and AbstractSerDe being an abstract class,
> we also can't play games with inverting the class hierarchy.
>
> .. Owen
>
> On Wed, Oct 25, 2017 at 4:38 PM, Matt Burgess  wrote:
>
>> Perhaps a fourth option is a “bridge” JAR, to implement the original
>> class by subclassing AbstractSerDe, unless the API has changed such that
>> such a mapping cannot be done.
>>
>> Regards,
>> Matt
>>
>>
>>
>> On Oct 25, 2017, at 7:31 PM, Owen O'Malley 
>> wrote:
>>
>>
>> On Wed, Oct 25, 2017 at 3:20 PM, Stephen Sprague 
>> wrote:
>>
>>> i see. interesting.  i think this breaks a ton of opensource Serde's
>>> we've all downloaded off the internet and have been using for years.  openx
>>> json serde and ibm xml serde come to mind. does this change render all
>>> these incompatible now with 2.3.0?
>>>
>>
>> Unfortunately, yes. With Hive 2.2.0, the SerDe class is still there. It
>> was removed in 2.3.0. I suspect the developer didn't realize there were
>> plugins that would break.
>>
>>
>>>
>>> I mean all i done in the past is download the jar file and put it in the
>>> lib (or auxlib) dir and i was good to go.  That clearly isn't going to fly
>>> anymore then.
>>>
>>> hmmm.  again i'm not a java weenie so rewriting java code isn't
>>> something i would be proficient at. But it sounds like from a devops
>>> standpoint there's no magic jar i can put in the class path to make these
>>> work?  that kinda is a deal-breaker to upgrade then.
>>>
>>
>> Unfortunately, not. Looking at the change, the code now requires the
>> user's serde to implement AbstractSerDe. If it doesn't, it will break.
>>
>> Effectively, the choices seem to be:
>>
>>1. Change the plugin and recompile it.
>>2. Upgrade to Hive 2.2 instead of 2.3.
>>    3. Make a case for reverting the change. I'm not sure what the
>>original motivation of the change was. It seems like it was effectively a
>>clean up.
>>
>> .. Owen
>>
>>
>>> Am i interpreting this correctly?
>>>
>>> On Wed, Oct 25, 2017 at 2:31 PM, Owen O'Malley 
>>> wrote:
>>>
>>>> SerDe was removed by https://issues.apache.org/jira/browse/HIVE-15167
>>>>
>>>> You should use AbstractSerDe instead.
>>>>
>>>> .. Owen
>>>>
>>>> On Oct 25, 2017, at 2:18 PM, Stephen Sprague 
>>>> wrote:
>>>>
>>>> hey guys,
>>>>
>>>> could be a dumb question but not being a java type of guy i'm not quite
>>>> sure about it. I'm upgrading from 2.1.0 to 2.3.0 and encountering this
>>>> error:
>>>>
>>>> class not found: org/apache/hadoop/hive/serde2/SerDe
>>>>
>>>> so in hive 2.1.0 i see it in this jar:
>>>>
>>>> * hive-serde-2.1.0.jar
>>>> org/apache/hadoop/hive/serde2/SerDe.class
>>>>
>>>>
>>>> but in hive 2.3.0 i don't see it in hive-serde-2.3.0.jar.
>>>>
>>>>
>>>> so i ask where did it go in version 2.3.0?
>>>>
>>>> thanks,
>>>> Stephen
>>>>
>>>>
>>>>
>>>
>>
>

Fwd: Serde moved? version 2.3.0

2017-10-25 Thread Stephen Sprague

i see. interesting.  i think this breaks a ton of opensource Serde's we've
all downloaded off the internet and have been using for years.  openx json
serde and ibm xml serde come to mind. does this change render all these
incompatible now with 2.3.0?

I mean all i done in the past is download the jar file and put it in the
lib (or auxlib) dir and i was good to go.  That clearly isn't going to fly
anymore then.

hmmm.  again i'm not a java weenie so rewriting java code isn't something i
would be proficient at. But it sounds like from a devops standpoint there's
no magic jar i can put in the class path to make these work?  that kinda is
a deal-breaker to upgrade then.

Am i interpreting this correctly?

On Wed, Oct 25, 2017 at 2:31 PM, Owen O'Malley 
wrote:

> SerDe was removed by https://issues.apache.org/jira/browse/HIVE-15167
>
> You should use AbstractSerDe instead.
>
> .. Owen
>
> On Oct 25, 2017, at 2:18 PM, Stephen Sprague  wrote:
>
> hey guys,
>
> could be a dumb question but not being a java type of guy i'm not quite
> sure about it. I'm upgrading from 2.1.0 to 2.3.0 and encountering this
> error:
>
> class not found: org/apache/hadoop/hive/serde2/SerDe
>
> so in hive 2.1.0 i see it in this jar:
>
> * hive-serde-2.1.0.jar
> org/apache/hadoop/hive/serde2/SerDe.class
>
>
> but in hive 2.3.0 i don't see it in hive-serde-2.3.0.jar.
>
>
> so i ask where did it go in version 2.3.0?
>
> thanks,
> Stephen
>
>
>

Serde moved? version 2.3.0

2017-10-25 Thread Stephen Sprague

hey guys,

could be a dumb question but not being a java type of guy i'm not quite
sure about it. I'm upgrading from 2.1.0 to 2.3.0 and encountering this
error:

class not found: org/apache/hadoop/hive/serde2/SerDe

so in hive 2.1.0 i see it in this jar:

* hive-serde-2.1.0.jar
org/apache/hadoop/hive/serde2/SerDe.class


but in hive 2.3.0 i don't see it in hive-serde-2.3.0.jar.


so i ask where did it go in version 2.3.0?

thanks,
Stephen

Re: hive on spark - why is it so hard?

2017-10-01 Thread Stephen Sprague

so...  i made some progress after much copying of jar files around (as
alluded to by Gopal previously on this thread).


following the instructions here:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

and doing this as instructed will leave off about a dozen or so jar files
that spark'll need:
  ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz
"-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"

i ended copying the missing jars to $SPARK_HOME/jars but i would have
preferred to just add a path(s) to the spark class path but i did not find
any effective way to do that. In hive you can specify HIVE_AUX_JARS_PATH
but i don't see the analagous var in spark - i don't think it inherits the
hive classpath.

anyway a simple query is now working under Hive On Spark so i think i might
be over the hump.  Now its a matter of comparing the performance with Tez.

Cheers,
Stephen.


On Wed, Sep 27, 2017 at 9:37 PM, Stephen Sprague  wrote:

> ok.. getting further.  seems now i have to deploy hive to all nodes in the
> cluster - don't think i had to do that before but not a big deal to do it
> now.
>
> for me:
> HIVE_HOME=/usr/lib/apache-hive-2.3.0-bin/
> SPARK_HOME=/usr/lib/spark-2.2.0-bin-hadoop2.6
>
> on all three nodes now.
>
> i started spark master on the namenode and i started spark slaves (2) on
> two datanodes of the cluster.
>
> so far so good.
>
> now i run my usual test command.
>
> $ hive --hiveconf hive.root.logger=DEBUG,console -e 'set
> hive.execution.engine=spark; select date_key, count(*) from
> fe_inventory.merged_properties_hist group by 1 order by 1;'
>
> i get a little further now and find the stderr from the Spark Web UI
> interface (nice) and it reports this:
>
> 17/09/27 20:47:35 INFO WorkerWatcher: Successfully connected to 
> spark://Worker@172.19.79.127:40145
> Exception in thread "main" java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
>   at 
> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)*Caused 
> by: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS*
>   at 
> org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.java:47)
>   at 
> org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:134)
>   at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
>   ... 6 more
>
>
>
> searching around the internet i find this is probably a compatibility
> issue.
>
> i know. i know. no surprise here.
>
> so i guess i just got to the point where everybody else is... build spark
> w/o hive.
>
> lemme see what happens next.
>
>
>
>
>
> On Wed, Sep 27, 2017 at 7:41 PM, Stephen Sprague 
> wrote:
>
>> thanks.  I haven't had a chance to dig into this again today but i do
>> appreciate the pointer.  I'll keep you posted.
>>
>> On Wed, Sep 27, 2017 at 10:14 AM, Sahil Takiar 
>> wrote:
>>
>>> You can try increasing the value of hive.spark.client.connect.timeout.
>>> Would also suggest taking a look at the HoS Remote Driver logs. The driver
>>> gets launched in a YARN container (assuming you are running Spark in
>>> yarn-client mode), so you just have to find the logs for that container.
>>>
>>> --Sahil
>>>
>>> On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague 
>>> wrote:
>>>
>>>> i _seem_ to be getting closer.  Maybe its just wishful thinking.
>>>> Here's where i'm at now.
>>>>
>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>>> 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with
>>>> CreateSubmissionResponse:
>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>>>   "action" : "CreateSubmissionResponse",
>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>>>   "message" : "Driver successfully submitted as 
>>>> driver-20170926211038-0003",
>>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImp

Re: hive on spark - why is it so hard?

2017-09-27 Thread Stephen Sprague

ok.. getting further.  seems now i have to deploy hive to all nodes in the
cluster - don't think i had to do that before but not a big deal to do it
now.

for me:
HIVE_HOME=/usr/lib/apache-hive-2.3.0-bin/
SPARK_HOME=/usr/lib/spark-2.2.0-bin-hadoop2.6

on all three nodes now.

i started spark master on the namenode and i started spark slaves (2) on
two datanodes of the cluster.

so far so good.

now i run my usual test command.

$ hive --hiveconf hive.root.logger=DEBUG,console -e 'set
hive.execution.engine=spark; select date_key, count(*) from
fe_inventory.merged_properties_hist group by 1 order by 1;'

i get a little further now and find the stderr from the Spark Web UI
interface (nice) and it reports this:

17/09/27 20:47:35 INFO WorkerWatcher: Successfully connected to
spark://Worker@172.19.79.127:40145
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at 
org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)*Caused
by: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS*
at 
org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.java:47)
at 
org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:134)
at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
... 6 more

searching around the internet i find this is probably a compatibility issue.

i know. i know. no surprise here.

so i guess i just got to the point where everybody else is... build spark
w/o hive.

lemme see what happens next.

On Wed, Sep 27, 2017 at 7:41 PM, Stephen Sprague  wrote:

> thanks.  I haven't had a chance to dig into this again today but i do
> appreciate the pointer.  I'll keep you posted.
>
> On Wed, Sep 27, 2017 at 10:14 AM, Sahil Takiar 
> wrote:
>
>> You can try increasing the value of hive.spark.client.connect.timeout.
>> Would also suggest taking a look at the HoS Remote Driver logs. The driver
>> gets launched in a YARN container (assuming you are running Spark in
>> yarn-client mode), so you just have to find the logs for that container.
>>
>> --Sahil
>>
>> On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague 
>> wrote:
>>
>>> i _seem_ to be getting closer.  Maybe its just wishful thinking.
>>> Here's where i'm at now.
>>>
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>> 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with
>>> CreateSubmissionResponse:
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>> "action" : "CreateSubmissionResponse",
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>> "message" : "Driver successfully submitted as driver-20170926211038-0003",
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>> "serverSparkVersion" : "2.2.0",
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>> "submissionId" : "driver-20170926211038-0003",
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>>> "success" : true
>>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: }
>>> 2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to
>>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
>>> Client (425015667) connection to dwrdevnn1.sv2.trulia.com/172.1
>>> 9.73.136:8020 from dwr: closed
>>> 2017-09-26T21:10:45,702 DEBUG [IPC Client (425015667) connection to
>>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
>>> Clien
>>> t (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020
>>> from dwr: stopped, remaining connections 0
>>> 2017-09-26T21:12:06,719 ERROR [2337b36e-86ca-47cd-b1ae-f0b32571b97e
>>> main] client.SparkClientImpl: Timed out waiting for client to connect.
>>> *Possible reasons include network issues, errors in remote driver or the
>>> cluster has no available resources, etc.*
>>> *Please check YARN or Spark driver's logs for further informat

Re: hive on spark - why is it so hard?

2017-09-27 Thread Stephen Sprague

thanks.  I haven't had a chance to dig into this again today but i do
appreciate the pointer.  I'll keep you posted.

On Wed, Sep 27, 2017 at 10:14 AM, Sahil Takiar 
wrote:

> You can try increasing the value of hive.spark.client.connect.timeout.
> Would also suggest taking a look at the HoS Remote Driver logs. The driver
> gets launched in a YARN container (assuming you are running Spark in
> yarn-client mode), so you just have to find the logs for that container.
>
> --Sahil
>
> On Tue, Sep 26, 2017 at 9:17 PM, Stephen Sprague 
> wrote:
>
>> i _seem_ to be getting closer.  Maybe its just wishful thinking.   Here's
>> where i'm at now.
>>
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> 17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with
>> CreateSubmissionResponse:
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> "action" : "CreateSubmissionResponse",
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> "message" : "Driver successfully submitted as driver-20170926211038-0003",
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> "serverSparkVersion" : "2.2.0",
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> "submissionId" : "driver-20170926211038-0003",
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
>> "success" : true
>> 2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: }
>> 2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to
>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
>> Client (425015667) connection to dwrdevnn1.sv2.trulia.com/172.1
>> 9.73.136:8020 from dwr: closed
>> 2017-09-26T21:10:45,702 DEBUG [IPC Client (425015667) connection to
>> dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
>> Clien
>> t (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020
>> from dwr: stopped, remaining connections 0
>> 2017-09-26T21:12:06,719 ERROR [2337b36e-86ca-47cd-b1ae-f0b32571b97e
>> main] client.SparkClientImpl: Timed out waiting for client to connect.
>> *Possible reasons include network issues, errors in remote driver or the
>> cluster has no available resources, etc.*
>> *Please check YARN or Spark driver's logs for further information.*
>> java.util.concurrent.ExecutionException: 
>> java.util.concurrent.TimeoutException:
>> Timed out waiting for client connection.
>> at 
>> io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
>> ~[netty-all-4.0.29.Final.jar:4.0.29.Final]
>> at 
>> org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:108)
>> [hive-exec-2.3.0.jar:2.3.0]
>> at 
>> org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
>> [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.c
>> reateRemoteClient(RemoteHiveSparkClient.java:101)
>> [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.<
>> init>(RemoteHiveSparkClient.java:97) [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.
>> createHiveSparkClient(HiveSparkClientFactory.java:73)
>> [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImp
>> l.open(SparkSessionImpl.java:62) [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionMan
>> agerImpl.getSession(SparkSessionManagerImpl.java:115)
>> [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSpark
>> Session(SparkUtilities.java:126) [hive-exec-2.3.0.jar:2.3.0]
>> at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerPar
>> allelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:236)
>> [hive-exec-2.3.0.jar:2.3.0]
>>
>>
>> i'll dig some more tomorrow.
>>
>> On Tue, Sep 26, 2017 at 8:23 PM, Stephen Sprague 
>> wrote:
>>
>>> oh. i missed Gopal's reply.  oy... that sounds foreboding.  I'll keep
>>> you posted on my progress.
>>>
>>> On Tue, Sep 26, 2017 at 4:40 PM, Gopal Vijayaraghavan >> > wrote:
>>>
>>>> Hi,
>>>>

Re: hive on spark - why is it so hard?

2017-09-26 Thread Stephen Sprague

i _seem_ to be getting closer.  Maybe its just wishful thinking.   Here's
where i'm at now.

2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
17/09/26 21:10:38 INFO rest.RestSubmissionClient: Server responded with
CreateSubmissionResponse:
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: {
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
"action" : "CreateSubmissionResponse",
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
"message" : "Driver successfully submitted as driver-20170926211038-0003",
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
"serverSparkVersion" : "2.2.0",
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
"submissionId" : "driver-20170926211038-0003",
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl:
"success" : true
2017-09-26T21:10:38,892  INFO [stderr-redir-1] client.SparkClientImpl: }
2017-09-26T21:10:45,701 DEBUG [IPC Client (425015667) connection to
dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC
Client (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020
from dwr: closed
2017-09-26T21:10:45,702 DEBUG [IPC Client (425015667) connection to
dwrdevnn1.sv2.trulia.com/172.19.73.136:8020 from dwr] ipc.Client: IPC Clien
t (425015667) connection to dwrdevnn1.sv2.trulia.com/172.19.73.136:8020
from dwr: stopped, remaining connections 0
2017-09-26T21:12:06,719 ERROR [2337b36e-86ca-47cd-b1ae-f0b32571b97e main]
client.SparkClientImpl: Timed out waiting for client to connect.
*Possible reasons include network issues, errors in remote driver or the
cluster has no available resources, etc.*
*Please check YARN or Spark driver's logs for further information.*
java.util.concurrent.ExecutionException:
java.util.concurrent.TimeoutException: Timed out waiting for client
connection.
at
io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
~[netty-all-4.0.29.Final.jar:4.0.29.Final]
at
org.apache.hive.spark.client.SparkClientImpl.(SparkClientImpl.java:108)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hive.spark.client.SparkClientFactory.createClient(SparkClientFactory.java:80)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.createRemoteClient(RemoteHiveSparkClient.java:101)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.(RemoteHiveSparkClient.java:97)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.createHiveSparkClient(HiveSparkClientFactory.java:73)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:62)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:115)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:126)
[hive-exec-2.3.0.jar:2.3.0]
at
org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerParallelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:236)
[hive-exec-2.3.0.jar:2.3.0]

i'll dig some more tomorrow.

On Tue, Sep 26, 2017 at 8:23 PM, Stephen Sprague  wrote:

> oh. i missed Gopal's reply.  oy... that sounds foreboding.  I'll keep you
> posted on my progress.
>
> On Tue, Sep 26, 2017 at 4:40 PM, Gopal Vijayaraghavan 
> wrote:
>
>> Hi,
>>
>> > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a
>> spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed
>> to create spark client.
>>
>> I get inexplicable errors with Hive-on-Spark unless I do a three step
>> build.
>>
>> Build Hive first, use that version to build Spark, use that Spark version
>> to rebuild Hive.
>>
>> I have to do this to make it work because Spark contains Hive jars and
>> Hive contains Spark jars in the class-path.
>>
>> And specifically I have to edit the pom.xml files, instead of passing in
>> params with -Dspark.version, because the installed pom files don't get
>> replacements from the build args.
>>
>> Cheers,
>> Gopal
>>
>>
>>
>

Re: hive on spark - why is it so hard?

2017-09-26 Thread Stephen Sprague

oh. i missed Gopal's reply.  oy... that sounds foreboding.  I'll keep you
posted on my progress.

On Tue, Sep 26, 2017 at 4:40 PM, Gopal Vijayaraghavan 
wrote:

> Hi,
>
> > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a
> spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed
> to create spark client.
>
> I get inexplicable errors with Hive-on-Spark unless I do a three step
> build.
>
> Build Hive first, use that version to build Spark, use that Spark version
> to rebuild Hive.
>
> I have to do this to make it work because Spark contains Hive jars and
> Hive contains Spark jars in the class-path.
>
> And specifically I have to edit the pom.xml files, instead of passing in
> params with -Dspark.version, because the installed pom files don't get
> replacements from the build args.
>
> Cheers,
> Gopal
>
>
>

Re: hive on spark - why is it so hard?

2017-09-26 Thread Stephen Sprague

well this is the spark-submit line from above:

   2017-09-26T14:04:45,678  INFO [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3
main] client.SparkClientImpl: Running client driver with argv:
*/usr/li/spark-2.2.0-bin-**hadoop2.6/bin/spark-submit*

and that's pretty clearly v2.2


I do have other versions of spark on the namenode so lemme remove those and
see what happens


A-HA! dang it!

$ echo $SPARK_HOME
/usr/local/spark

well that clearly needs to be: */usr/lib/spark-2.2.0-bin-*
*hadoop2.6  *

how did i miss that? unbelievable.


Thank you Sahil!   Lets see what happens next!

Cheers,
Stephen


On Tue, Sep 26, 2017 at 4:12 PM, Sahil Takiar 
wrote:

> Are you sure you are using Spark 2.2.0? Based on the stack-trace it looks
> like your call to spark-submit it using an older version of Spark (looks
> like some early 1.x version). Do you have SPARK_HOME set locally? Do you
> have older versions of Spark installed locally?
>
> --Sahil
>
> On Tue, Sep 26, 2017 at 3:33 PM, Stephen Sprague 
> wrote:
>
>> thanks Sahil.  here it is.
>>
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> org/apache/spark/scheduler/SparkListenerInterface
>> at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:344)
>> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.
>> scala:318)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:
>> 75)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.spark.scheduler.SparkListenerInterface
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> ... 5 more
>>
>> at 
>> org.apache.hive.spark.client.rpc.RpcServer.cancelClient(RpcServer.java:212)
>> ~[hive-exec-2.3.0.jar:2.3.0]
>> at 
>> org.apache.hive.spark.client.SparkClientImpl$3.run(SparkClientImpl.java:500)
>> ~[hive-exec-2.3.0.jar:2.3.0]
>> at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_25]
>> FAILED: SemanticException Failed to get a spark session:
>> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark
>> client.
>> 2017-09-26T14:04:46,470 ERROR [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3
>> main] ql.Driver: FAILED: SemanticException Failed to get a spark session:
>> org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark
>> client.
>> org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a spark
>> session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to
>> create spark client.
>> at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerPar
>> allelism.getSparkMemoryAndCores(SetSparkReducerParallelism.java:240)
>> at org.apache.hadoop.hive.ql.optimizer.spark.SetSparkReducerPar
>> allelism.process(SetSparkReducerParallelism.java:173)
>> at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch
>> (DefaultRuleDispatcher.java:90)
>> at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAnd
>> Return(DefaultGraphWalker.java:105)
>> at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(De
>> faultGraphWalker.java:89)
>> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWa
>> lker.java:56)
>> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWa
>> lker.java:61)
>> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWa
>> lker.java:61)
>> at org.apache.hadoop.hive.ql.lib.PreOrderWalker.walk(PreOrderWa
>> lker.java:61)
>> at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalkin
>> g(DefaultGraphWalker.java:120)
>> at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.runSetRe
>> ducerParallelism(SparkCompiler.java:288)
>> at org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimize
>> OperatorPlan(SparkCompiler.java:122)
>> at org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCom
>> piler.java:140)
>> at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInte
>> rnal(SemanticAnalyzer.java:11253)
>> at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeIntern
>> a

Re: hive on spark - why is it so hard?

2017-09-26 Thread Stephen Sprague

stack trace for the NoClassDefFoundError? For Hive
> 2.3.0, we only support Spark 2.0.0. Hive may work with more recent versions
> of Spark, but we only test with Spark 2.0.0.
>
> --Sahil
>
> On Tue, Sep 26, 2017 at 2:35 PM, Stephen Sprague 
> wrote:
>
>> * i've installed hive 2.3 and spark 2.2
>>
>> * i've read this doc plenty of times -> https://cwiki.apache.org/confl
>> uence/display/Hive/Hive+on+Spark%3A+Getting+Started
>>
>> * i run this query:
>>
>>hive --hiveconf hive.root.logger=DEBUG,console -e 'set
>> hive.execution.engine=spark; select date_key, count(*) from
>> fe_inventory.merged_properties_hist group by 1 order by 1;'
>>
>>
>> * i get this error:
>>
>> *   Exception in thread "main" java.lang.NoClassDefFoundError:
>> org/apache/spark/scheduler/SparkListenerInterface*
>>
>>
>> * this class in:
>>   /usr/lib/spark-2.2.0-bin-hadoop2.6/jars/spark-core_2.11-2.2.0.jar
>>
>> * i have copied all the spark jars to hdfs://dwrdevnn1/spark-2.2-jars
>>
>> * i have updated hive-site.xml to set spark.yarn.jars to it.
>>
>> * i see this is the console:
>>
>> 2017-09-26T13:34:15,505  INFO [334aa7db-ad0c-48c3-9ada-467aaf05cff3
>> main] spark.HiveSparkClientFactory: load spark property from hive
>> configuration (spark.yarn.jars -> hdfs://dwrdevnn1.sv2.trulia.co
>> m:8020/spark-2.2-jars/*).
>>
>> * i see this on the console
>>
>> 2017-09-26T14:04:45,678  INFO [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3
>> main] client.SparkClientImpl: Running client driver with argv:
>> /usr/lib/spark-2.2.0-bin-hadoop2.6/bin/spark-submit --properties-file
>> /tmp/spark-submit.6105784757200912217.properties --class
>> org.apache.hive.spark.client.RemoteDriver 
>> /usr/lib/apache-hive-2.3.0-bin/lib/hive-exec-2.3.0.jar
>> --remote-host dwrdevnn1.sv2.trulia.com --remote-port 53393 --conf
>> hive.spark.client.connect.timeout=1000 --conf
>> hive.spark.client.server.connect.timeout=9 --conf
>> hive.spark.client.channel.log.level=null --conf
>> hive.spark.client.rpc.max.size=52428800 --conf
>> hive.spark.client.rpc.threads=8 --conf hive.spark.client.secret.bits=256
>> --conf hive.spark.client.rpc.server.address=null
>>
>> * i even print out CLASSPATH in this script:
>> /usr/lib/spark-2.2.0-bin-hadoop2.6/bin/spark-submit
>>
>> and /usr/lib/spark-2.2.0-bin-hadoop2.6/jars/spark-core_2.11-2.2.0.jar is
>> in it.
>>
>> so i ask... what am i missing?
>>
>> thanks,
>> Stephen
>>
>>
>>
>>
>>
>>
>
>
> --
> Sahil Takiar
> Software Engineer at Cloudera
> takiar.sa...@gmail.com | (510) 673-0309
>

hive on spark - why is it so hard?

2017-09-26 Thread Stephen Sprague

* i've installed hive 2.3 and spark 2.2

* i've read this doc plenty of times ->
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

* i run this query:

   hive --hiveconf hive.root.logger=DEBUG,console -e 'set
hive.execution.engine=spark; select date_key, count(*) from
fe_inventory.merged_properties_hist group by 1 order by 1;'


* i get this error:

*   Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/spark/scheduler/SparkListenerInterface*


* this class in:
  /usr/lib/spark-2.2.0-bin-hadoop2.6/jars/spark-core_2.11-2.2.0.jar

* i have copied all the spark jars to hdfs://dwrdevnn1/spark-2.2-jars

* i have updated hive-site.xml to set spark.yarn.jars to it.

* i see this is the console:

2017-09-26T13:34:15,505  INFO [334aa7db-ad0c-48c3-9ada-467aaf05cff3 main]
spark.HiveSparkClientFactory: load spark property from hive configuration
(spark.yarn.jars -> hdfs://dwrdevnn1.sv2.trulia.com:8020/spark-2.2-jars/*).

* i see this on the console

2017-09-26T14:04:45,678  INFO [4cb82b6d-9568-4518-8e00-f0cf7ac58cd3 main]
client.SparkClientImpl: Running client driver with argv:
/usr/lib/spark-2.2.0-bin-hadoop2.6/bin/spark-submit --properties-file
/tmp/spark-submit.6105784757200912217.properties --class
org.apache.hive.spark.client.RemoteDriver
/usr/lib/apache-hive-2.3.0-bin/lib/hive-exec-2.3.0.jar --remote-host
dwrdevnn1.sv2.trulia.com --remote-port 53393 --conf
hive.spark.client.connect.timeout=1000 --conf
hive.spark.client.server.connect.timeout=9 --conf
hive.spark.client.channel.log.level=null --conf
hive.spark.client.rpc.max.size=52428800 --conf
hive.spark.client.rpc.threads=8 --conf hive.spark.client.secret.bits=256
--conf hive.spark.client.rpc.server.address=null

* i even print out CLASSPATH in this script:
/usr/lib/spark-2.2.0-bin-hadoop2.6/bin/spark-submit

and /usr/lib/spark-2.2.0-bin-hadoop2.6/jars/spark-core_2.11-2.2.0.jar is in
it.

so i ask... what am i missing?

thanks,
Stephen

group by + two nulls in a row = bug?

2017-06-27 Thread Stephen Sprague

i'm running hive version 2.1.0 and found this interesting. i've broken it
down into a trivial test case below.

i run this:

 select a.date_key,
a.property_id,
cast(NULL as bigint) as malone_id,
cast(NULL as bigint) as zpid,
sum(coalesce(a.sum_365,0)) as sum_365
 from
(select 20170626 as date_key, 123 as property_id, 10 as
sum_365 ) a

 group by
1
,2
,3
,4

and i get:

+--+-+---++-+
| date_key | property_id | malone_id | zpid   | sum_365 |
+--+-+---++-+
| 20170626 | 123 | [NULL]| [NULL] | *[NULL]*  |
+--+-+---++-+

why do you suppose sum_365 is null?


then i run this:

 select a.date_key,
a.property_id,
cast(NULL as bigint) as malone_id,
sum(coalesce(a.sum_365,0)) as sum_365
 from
(select 20170626 as date_key, 123 as property_id, 10 as
sum_365 ) a

 group by
1
,2
,3

and i get:


+--+-+---+-+
| date_key | property_id | malone_id | sum_365 |
+--+-+---+-+
| 20170626 | 123 | [NULL]| *10 * |
+--+-+---+-+


what do you think? is it me? or is it hive?(looks to me grouping by two
NULL's in a row causes a problem.)

Re: any hive release imminent?

2017-06-20 Thread Stephen Sprague

awesome.  thanks for the update!

On Tue, Jun 20, 2017 at 10:04 AM, Owen O'Malley 
wrote:

> The natives are very restless. I'm actively working on getting Hive 2.2
> released. I'm running through qfile tests now and I hope to have it in the
> next couple weeks. It will be quickly followed up by Hive 2.3, which will
> be more aggressive with features, but less stable.
>
> .. Owen
>
> On Mon, Jun 19, 2017 at 7:53 PM, Stephen Sprague 
> wrote:
>
>> Hey guys,
>> Is there any word out on the street about a timeframe for the next 2.x
>> hive release? Looks like Dec 2016 was the last one.
>>
>> The natives are getting restless i think. :)
>>
>> thanks,
>> Stephen.
>>
>>
>

any hive release imminent?

2017-06-19 Thread Stephen Sprague

Hey guys,
Is there any word out on the street about a timeframe for the next 2.x hive
release? Looks like Dec 2016 was the last one.

The natives are getting restless i think. :)

thanks,
Stephen.

Re: How to setup the max memory for my big Hive SQL which is on MapReduce of Yarn

2017-06-06 Thread Stephen Sprague

have you researched the yarn schedulers?  namely the capacity and fair
schedulers? those are the places where resource limits can be easily
defined.

On Mon, Jun 5, 2017 at 9:25 PM, Chang.Wu <583424...@qq.com> wrote:

> My Hive engine is MapReduce and Yarn. What my urgent need is to limit the
> memory usage of my big sql so that my bigger sql will run a longer time
> instead of using up all the resource of queue or even all the resource of
> my whole yarn cluster at once.
> But I cannot find any solution to define my sql resource-usage upper
> threshold.
> Any one can give me some suggestions.
>
> 583424...@qq.com
>

Re: drop table - external - aws

2017-05-17 Thread Stephen Sprague

yeah. that's a potential idea too.  gotta put the time in to script it with
200+ tables though.

On Wed, May 17, 2017 at 10:07 AM, Furcy Pin  wrote:

> Did you try ALTER TABLE ... SET LOCATION ... ? maybe it could have worked.
>
>
> On Wed, May 17, 2017 at 6:57 PM, Vihang Karajgaonkar 
> wrote:
>
>> This is interesting and possibly a bug. Did you try changing them to
>> managed tables and then dropping or truncating them? How do we reproduce
>> this on our setup?
>>
>> On Tue, May 16, 2017 at 6:38 PM, Stephen Sprague 
>> wrote:
>>
>>> fwiw. i ended up re-creating the ec2 cluster with that same host name
>>> just so i could drop those tables from the metastore.
>>>
>>> note to self.  be careful - be real careful - with "sharing" hive
>>> metastores between different compute paradigms.
>>>
>>> Regards,
>>> Stephen.
>>>
>>> On Tue, May 16, 2017 at 6:38 AM, Stephen Sprague 
>>> wrote:
>>>
>>>> hey guys,
>>>> here's something bizarre.   i created about 200 external tables with a
>>>> location something like this 'hdfs:///path'.  this was three
>>>> months ago and now i'm revisiting and want to drop these tables.
>>>>
>>>> ha! no can do!
>>>>
>>>> that  is long gone.
>>>>
>>>> Upon issuing the drop table command i get this:
>>>>
>>>> Error while processing statement: FAILED: Execution Error, return code
>>>> 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
>>>> MetaException(message:java.lang.IllegalArgumentException:
>>>> java.net.UnknownHostException: )
>>>>
>>>> where  is that old host name.
>>>>
>>>> so i ask is there a work around for this?  given they are external
>>>> tables i'm surprised it "checks" that that location exists (or not.)
>>>>
>>>> thanks,
>>>> Stephen
>>>>
>>>
>>>
>>
>

Re: drop table - external - aws

2017-05-17 Thread Stephen Sprague

hi vihang,
here's how to duplicate (hive version 2.1.0).

1. install hadoop on a host called "test-nn"

2. install hive on a host called  "hive-client" and use a metastore on
"hive-client" (or anywhere else except "test-nn").

3. using hive on "hive-client" do this: create external table foo(a int)
location 'hdfs://test-nn/hive_root';

3. shutdown host "test-nn" or rename it.

4. using hive on "hive-client" try dropping table foo.

pretty sure step 4 is metastore only operation and hence does not need a
active hadoop cluster to work. if it were a managed table then yeah we'd
need to bring up a another cluster.

but you get the idea, right?

On Wed, May 17, 2017 at 9:57 AM, Vihang Karajgaonkar 
wrote:

> This is interesting and possibly a bug. Did you try changing them to
> managed tables and then dropping or truncating them? How do we reproduce
> this on our setup?
>
> On Tue, May 16, 2017 at 6:38 PM, Stephen Sprague 
> wrote:
>
>> fwiw. i ended up re-creating the ec2 cluster with that same host name
>> just so i could drop those tables from the metastore.
>>
>> note to self.  be careful - be real careful - with "sharing" hive
>> metastores between different compute paradigms.
>>
>> Regards,
>> Stephen.
>>
>> On Tue, May 16, 2017 at 6:38 AM, Stephen Sprague 
>> wrote:
>>
>>> hey guys,
>>> here's something bizarre.   i created about 200 external tables with a
>>> location something like this 'hdfs:///path'.  this was three
>>> months ago and now i'm revisiting and want to drop these tables.
>>>
>>> ha! no can do!
>>>
>>> that  is long gone.
>>>
>>> Upon issuing the drop table command i get this:
>>>
>>> Error while processing statement: FAILED: Execution Error, return code 1
>>> from org.apache.hadoop.hive.ql.exec.DDLTask.
>>> MetaException(message:java.lang.IllegalArgumentException:
>>> java.net.UnknownHostException: )
>>>
>>> where  is that old host name.
>>>
>>> so i ask is there a work around for this?  given they are external
>>> tables i'm surprised it "checks" that that location exists (or not.)
>>>
>>> thanks,
>>> Stephen
>>>
>>
>>
>

Re: drop table - external - aws

2017-05-16 Thread Stephen Sprague

fwiw. i ended up re-creating the ec2 cluster with that same host name just
so i could drop those tables from the metastore.

note to self.  be careful - be real careful - with "sharing" hive
metastores between different compute paradigms.

Regards,
Stephen.

On Tue, May 16, 2017 at 6:38 AM, Stephen Sprague  wrote:

> hey guys,
> here's something bizarre.   i created about 200 external tables with a
> location something like this 'hdfs:///path'.  this was three
> months ago and now i'm revisiting and want to drop these tables.
>
> ha! no can do!
>
> that  is long gone.
>
> Upon issuing the drop table command i get this:
>
> Error while processing statement: FAILED: Execution Error, return code 1
> from org.apache.hadoop.hive.ql.exec.DDLTask. 
> MetaException(message:java.lang.IllegalArgumentException:
> java.net.UnknownHostException: )
>
> where  is that old host name.
>
> so i ask is there a work around for this?  given they are external tables
> i'm surprised it "checks" that that location exists (or not.)
>
> thanks,
> Stephen
>

drop table - external - aws

2017-05-16 Thread Stephen Sprague

hey guys,
here's something bizarre.   i created about 200 external tables with a
location something like this 'hdfs:///path'.  this was three
months ago and now i'm revisiting and want to drop these tables.

ha! no can do!

that  is long gone.

Upon issuing the drop table command i get this:

Error while processing statement: FAILED: Execution Error, return code 1
from org.apache.hadoop.hive.ql.exec.DDLTask.
MetaException(message:java.lang.IllegalArgumentException:
java.net.UnknownHostException: )

where  is that old host name.

so i ask is there a work around for this?  given they are external tables
i'm surprised it "checks" that that location exists (or not.)

thanks,
Stephen

Re: hive on spark - version question

2017-03-17 Thread Stephen Sprague

yeah but... is the glass half-full or half-empty?  sure this might suck but
keep your head high, bro! Lots of it (hive) does work. :)


On Fri, Mar 17, 2017 at 2:25 PM, hernan saab 
wrote:

> Stephan,
>
> Thanks for the response.
>
> The one thing that I don't appreciate from those who promote and DOCUMENT
> spark on hive is that, seemingly, there is absolutely no evidence seen that
> says that hive on spark WORKS.
> As a matter of fact, after a lot of pain, I noticed it is not supported by
> just about anybody.
>
> If someone dares to document Hive on Spark (see link
> https://cwiki.apache.org/confluence/display/Hive/Hive+
> on+Spark%3A+Getting+Started)  why can't they have the decency to mention
> what specific combo of Hadoop/Spark/Hive versions used that works? Have a
> git repo included in a doc with all the right versions and libraries. Why
> not? We can start from there and progressively use newer libraries in case
> the doc becomes stale. I am not really asking much, I just want to know
> what the documenter used to claim that Hive on Spark works, that's it.
>
> Clearly, for most cases, this setup is broken and it misleads people to
> waste time on a broken setup.
>
> I love this tech. But I do notice that there is some mean spirited or very
> negligent actions made by the apache development community. Documenting
> hive on spark while knowing it won't work for most cases means apache
> developers don't give a crap about the time wasted by people like us.
>
>
>
>
> On Friday, March 17, 2017 1:14 PM, Edward Capriolo 
> wrote:
>
>
>
>
> On Fri, Mar 17, 2017 at 2:56 PM, hernan saab  > wrote:
>
> I have been in a similar world of pain. Basically, I tried to use an
> external Hive to have user access controls with a spark engine.
> At the end, I realized that it was a better idea to use apache tez instead
> of a spark engine for my particular case.
>
> But the journey is what I want to share with you.
> The big data apache tools and libraries such as Hive, Tez, Spark, Hadoop ,
> Parquet etc etc are not interchangeable as we would like to think. There
> are very limited combinations for very specific versions. This is why tools
> like Ambari can be useful. Ambari sets a path of combos of versions known
> to work and the dirty work is done under the UI.
>
> More often than not, when you try a version that few people tried, you
> will get error messages that will derailed you and cause you to waste a lot
> of time.
>
> In addition, this group, as well as many other apache big data user
> groups,  provides extremely poor support for users. The answers you usually
> get are not even hints to a solution. Their answers usually translate to
> "there is nothing I am willing to do about your problem. If I did, I should
> get paid" in many cryptic ways.
>
> If you ask your question to the Spark group they will take you to the Hive
> group and viceversa (I can almost guarantee it based on previous
> experiences)
>
> But in hindsight, people who work on this kinds of things typically make
> more money that the average developers. If you make more $$s it makes sense
> learning this stuff is supposed to be harder.
>
> Conclusion, don't try it. Or try using Tez/Hive instead of Spark/Hive  if
> you are querying large files.
>
>
>
> On Friday, March 17, 2017 11:33 AM, Stephen Sprague 
> wrote:
>
>
> :(  gettin' no love on this one.   any SME's know if Spark 2.1.0 will work
> with Hive 2.1.0 ?  That JavaSparkListener class looks like a deal breaker
> to me, alas.
>
> thanks in advance.
>
> Cheers,
> Stephen.
>
> On Mon, Mar 13, 2017 at 10:32 PM, Stephen Sprague 
> wrote:
>
> hi guys,
> wondering where we stand with Hive On Spark these days?
>
> i'm trying to run Spark 2.1.0 with Hive 2.1.0 (purely coincidental
> versions) and running up against this class not found:
>
> java.lang. NoClassDefFoundError: org/apache/spark/ JavaSparkListener
>
>
> searching the Cyber i find this:
> 1. http://stackoverflow.com/ questions/41953688/setting-
> spark-as-default-execution- engine-for-hive
> <http://stackoverflow.com/questions/41953688/setting-spark-as-default-execution-engine-for-hive>
>
> which pretty much describes my situation too and it references this:
>
>
> 2. https://issues.apache.org/ jira/browse/SPARK-17563
> <https://issues.apache.org/jira/browse/SPARK-17563>
>
> which indicates a "won't fix" - but does reference this:
>
>
> 3. https://issues.apache.org/ jira/browse/HIVE-14029
> <https://issues.apache.org/jira/browse/HIVE-14029>
>
> which looks to be fixed in hiv

Re: hive on spark - version question

2017-03-17 Thread Stephen Sprague

thanks for the comments and for sure all relevant. And yeah I feel the pain
just like the next guy but that's the part of the opensource "life style"
you subscribe to when using it.

The upside payoff has gotta be worth the downside risk - or else forget
about it right? Here in the Hive world in my experience anyway its been
great.  Gotta roll with it, be courteous, be persistent and sometimes
things just work out.

Getting back to Spark and Tez yes by all means i'm a big Tez user aleady so
i was hoping to see what Spark brought to table and i didn't want to diddle
around with Spark < 2.0.   That's cool. I can live with that not being
nailed down yet. I'll just wait for hive 2.2 and rattle the cage again! ha!


All good!

Cheers,
Stephen.

On Fri, Mar 17, 2017 at 1:14 PM, Edward Capriolo 
wrote:

>
>
> On Fri, Mar 17, 2017 at 2:56 PM, hernan saab  > wrote:
>
>> I have been in a similar world of pain. Basically, I tried to use an
>> external Hive to have user access controls with a spark engine.
>> At the end, I realized that it was a better idea to use apache tez
>> instead of a spark engine for my particular case.
>>
>> But the journey is what I want to share with you.
>> The big data apache tools and libraries such as Hive, Tez, Spark, Hadoop
>> , Parquet etc etc are not interchangeable as we would like to think. There
>> are very limited combinations for very specific versions. This is why tools
>> like Ambari can be useful. Ambari sets a path of combos of versions known
>> to work and the dirty work is done under the UI.
>>
>> More often than not, when you try a version that few people tried, you
>> will get error messages that will derailed you and cause you to waste a lot
>> of time.
>>
>> In addition, this group, as well as many other apache big data user
>> groups,  provides extremely poor support for users. The answers you usually
>> get are not even hints to a solution. Their answers usually translate to
>> "there is nothing I am willing to do about your problem. If I did, I should
>> get paid" in many cryptic ways.
>>
>> If you ask your question to the Spark group they will take you to the
>> Hive group and viceversa (I can almost guarantee it based on previous
>> experiences)
>>
>> But in hindsight, people who work on this kinds of things typically make
>> more money that the average developers. If you make more $$s it makes sense
>> learning this stuff is supposed to be harder.
>>
>> Conclusion, don't try it. Or try using Tez/Hive instead of Spark/Hive  if
>> you are querying large files.
>>
>>
>>
>> On Friday, March 17, 2017 11:33 AM, Stephen Sprague 
>> wrote:
>>
>>
>> :(  gettin' no love on this one.   any SME's know if Spark 2.1.0 will
>> work with Hive 2.1.0 ?  That JavaSparkListener class looks like a deal
>> breaker to me, alas.
>>
>> thanks in advance.
>>
>> Cheers,
>> Stephen.
>>
>> On Mon, Mar 13, 2017 at 10:32 PM, Stephen Sprague 
>> wrote:
>>
>> hi guys,
>> wondering where we stand with Hive On Spark these days?
>>
>> i'm trying to run Spark 2.1.0 with Hive 2.1.0 (purely coincidental
>> versions) and running up against this class not found:
>>
>> java.lang. NoClassDefFoundError: org/apache/spark/ JavaSparkListener
>>
>>
>> searching the Cyber i find this:
>> 1. http://stackoverflow.com/ questions/41953688/setting-
>> spark-as-default-execution- engine-for-hive
>> <http://stackoverflow.com/questions/41953688/setting-spark-as-default-execution-engine-for-hive>
>>
>> which pretty much describes my situation too and it references this:
>>
>>
>> 2. https://issues.apache.org/ jira/browse/SPARK-17563
>> <https://issues.apache.org/jira/browse/SPARK-17563>
>>
>> which indicates a "won't fix" - but does reference this:
>>
>>
>> 3. https://issues.apache.org/ jira/browse/HIVE-14029
>> <https://issues.apache.org/jira/browse/HIVE-14029>
>>
>> which looks to be fixed in hive 2.2 - which is not released yet.
>>
>>
>> so if i want to use spark 2.1.0 with hive am i out of luck - until hive
>> 2.2?
>>
>> thanks,
>> Stephen.
>>
>>
>>
>>
>>
> Stephan,
>
> I understand some of your frustration.  Remember that many in open source
> are volunteering their time. This is why if you pay a vendor for support of
> some software you might pay 50K a year or $200.00 an hour. If I was your
> vendor/consultant I would have starte

Re: hive on spark - version question

2017-03-17 Thread Stephen Sprague

:(  gettin' no love on this one.   any SME's know if Spark 2.1.0 will work
with Hive 2.1.0 ?  That JavaSparkListener class looks like a deal breaker
to me, alas.

thanks in advance.

Cheers,
Stephen.

On Mon, Mar 13, 2017 at 10:32 PM, Stephen Sprague 
wrote:

> hi guys,
> wondering where we stand with Hive On Spark these days?
>
> i'm trying to run Spark 2.1.0 with Hive 2.1.0 (purely coincidental
> versions) and running up against this class not found:
>
> java.lang.NoClassDefFoundError: org/apache/spark/JavaSparkListener
>
>
> searching the Cyber i find this:
> 1. http://stackoverflow.com/questions/41953688/setting-
> spark-as-default-execution-engine-for-hive
>
> which pretty much describes my situation too and it references this:
>
>
> 2. https://issues.apache.org/jira/browse/SPARK-17563
>
> which indicates a "won't fix" - but does reference this:
>
>
> 3. https://issues.apache.org/jira/browse/HIVE-14029
>
> which looks to be fixed in hive 2.2 - which is not released yet.
>
>
> so if i want to use spark 2.1.0 with hive am i out of luck - until hive
> 2.2?
>
> thanks,
> Stephen.
>
>

hive on spark - version question

2017-03-13 Thread Stephen Sprague

hi guys,
wondering where we stand with Hive On Spark these days?

i'm trying to run Spark 2.1.0 with Hive 2.1.0 (purely coincidental
versions) and running up against this class not found:

java.lang.NoClassDefFoundError: org/apache/spark/JavaSparkListener


searching the Cyber i find this:
1.
http://stackoverflow.com/questions/41953688/setting-spark-as-default-execution-engine-for-hive

which pretty much describes my situation too and it references this:


2. https://issues.apache.org/jira/browse/SPARK-17563

which indicates a "won't fix" - but does reference this:


3. https://issues.apache.org/jira/browse/HIVE-14029

which looks to be fixed in hive 2.2 - which is not released yet.


so if i want to use spark 2.1.0 with hive am i out of luck - until hive 2.2?

thanks,
Stephen.

random KILL's in YARN

2017-01-18 Thread Stephen Sprague

hey guys,
I have a question on why Hiveserver2 would issue a "killjob" signal.

We run Yarn on Hadoop 5.6 with the HiveServer2 process. It uses the
fair-scheduler. Pre-emption is turned off. At least twice a day we have
jobs that are randomly killed.  they can be big jobs, they can be small
ones. they can be Tez jobs, they can be MR jobs.  I can't find any pattern
and i can't find *WHY* this is occurring so i'm reaching out.

i have the RM process running at DEBUG level logging as well as all the NM
processes so i have pretty detailed logs.  Still i can't find a reason -
all is see is something like this:

2017-01-18 11:18:15,732 DEBUG [IPC Server handler 0 on 42807]
org.apache.hadoop.ipc.Server: IPC Server handler 0 on 42807:
org.apache.hadoop.mapreduce.v2.api.MRClientProtocolPB.killJob from
172.19.75.137:39623 Call#5981979 Retry#0 for RpcKind RPC_PROTOCOL_BUFFER
2017-01-18 11:18:15,732 DEBUG [IPC Server handler 0 on 42807]
org.apache.hadoop.security.UserGroupInformation: PrivilegedAction as:dwr
(auth:SIMPLE) from:org.apache.hadoop.ipc.Server$Handler.run(Server.
java:2038)
*2017-01-18 11:18:15,736 INFO [IPC Server handler 0 on 42807]
org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Kill job
job_1484692657978_3312 received from dwr (auth:SIMPLE) at 172.19.75.137*
2017-01-18 11:18:15,736 DEBUG [AsyncDispatcher event handler]
org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event
org.apache.hadoop.mapreduce.v2.app.job.event.JobDiagnosticsUpdateEvent.EventType:
JOB_DIAGNOSTIC_UPDATE
2017-01-18 11:18:15,736 DEBUG [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Processing
job_1484692657978_3312 of type JOB_DIAGNOSTIC_UPDATE
2017-01-18 11:18:15,736 DEBUG [AsyncDispatcher event handler]
org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event
org.apache.hadoop.mapreduce.v2.app.job.event.JobEvent.EventType: JOB_KILL
2017-01-18 11:18:15,736 DEBUG [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Processing
job_1484692657978_3312 of type JOB_KILL
2017-01-18 11:18:15,737 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl:
job_1484692657978_3312Job Transitioned from RUNNING to KILL_WAIT
2017-01-18 11:18:15,737 DEBUG [AsyncDispatcher event handler]
org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event
org.apache.hadoop.mapreduce.v2.app.job.event.TaskEvent.EventType: T_KILL
2017-01-18 11:18:15,737 DEBUG [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Processing
task_1484692657978_3312_m_00 of type T_KILL
2017-01-18 11:18:15,737 DEBUG [AsyncDispatcher event handler]
org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event
org.apache.hadoop.mapreduce.v2.app.job.event.TaskEvent.EventType: T_KILL
2017-01-18 11:18:15,737 DEBUG [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Processing
task_1484692657978_3312_m_01 of type T_KILL




* 172.19.75.137 is node where the HiveServer2 process is running - so i'm
pretty sure this is the process issuing the kill.  but why?  why kill the
job?  makes no sense to me.  what would cause this to happen?



The Hiveserver2 log doesn't have anything useful in it at the INFO level
and i can't seem to get it to log at the DEBUG level. the bit in red below
doesn't seem to work so any suggestions here gladly accepted too.

$ ps -ef | grep -i hiveserver2

dwr  21927  1785  8 Jan14 ?10:10:11
/usr/lib/jvm/java-8-oracle/jre//bin/java -Xmx12288m
-Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/usr/lib/hadoop/logs
-Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop
-Dhadoop.id.str= -Dhadoop.root.logger=INFO,console
-Djava.library.path=/usr/lib/hadoop/lib/native
-Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
-Xmx268435456 -Xmx10G -Dlog4j.configurationFile=hive-log4j2.properties
-Dorg.apache.logging.log4j.simplelog.StatusLogger.level=DEBUG
-Dlog4j.configurationFile=hive-log4j2.properties
-Djava.util.logging.config.file=/usr/lib/apache-hive-2.1.0-bin/bin/../conf/parquet-logging.properties
-Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar
/usr/lib/apache-hive-2.1.0-bin/lib/hive-service-2.1.0.jar
org.apache.hive.service.server.HiveServer2


*--hiveconf hive.root.logger=DEBUG,console*
thanks,
Stephen

Re: tez + union stmt

2016-12-25 Thread Stephen Sprague

Thanks Elliot.  Nice christmas present.   Those settings in that
stackoverflow link look to me to be exactly what i need to set for MR jobs
to pick that data up that Tez created.

Cheers,
Stephen.

On Sun, Dec 25, 2016 at 2:45 AM, Elliot West  wrote:

> I believe that tez will generate subfolders for unioned data. As far as I
> know, this is the expected behaviour and there is no alternative.
> Presumably this is to prevent multiple tasks from attempting to write the
> same file?
>
> We've experienced issues when switching from mr to tez; downstream jobs
> weren't expecting subfolders and had trouble reading previously accessible
> datasets.
>
> Apparently there are workarounds within Hive:
> http://stackoverflow.com/questions/39511585/hive-
> create-table-not-insert-data
>
> Merry Christmas,
>
> Elliot.
>
> On Sun, 25 Dec 2016 at 03:11, Rajesh Balamohan 
> wrote:
>
>> Are there any exceptions in hive.log?. Is tmp_pv_v4* table part of the
>> select query?
>>
>> Assuming you are creating the table in staging.db, it would have created
>> the table location as staging.db/foo (as you have not specified the
>> location).
>>
>> Adding user@hive.apache.org as this is hive related.
>>
>>
>> ~Rajesh.B
>>
>> On Sun, Dec 25, 2016 at 12:08 AM, Stephen Sprague 
>> wrote:
>>
>> all,
>>
>> i'm running tez with the sql pattern:
>>
>> * create table foo as select * from (select... UNION select... UNION
>> select...)
>>
>> in the logs the final step is this:
>>
>> * Moving data to directory hdfs://dwrnn1.sv2.trulia.com:
>> 8020/user/hive/warehouse/staging.db/tmp_pv_v4c__loc_4 from hdfs://
>> dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/
>> staging.db/.hive-staging_hive_2016-12-24_10-05-40_048_
>> 4896412314807355668-899/-ext-10002
>>
>>
>> when querying the table i got zero rows returned which made me curious.
>> so i queried the hdfs location and see this:
>>
>>   $ hdfs dfs -ls hdfs://dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/
>> staging.db/tmp_pv_v4c__loc_4
>>
>>   Found 3 items
>>   drwxrwxrwx   - dwr supergroup  0 2016-12-24 10:05 hdfs://
>> dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/
>> staging.db/tmp_pv_v4c__loc_4/1
>>   drwxrwxrwx   - dwr supergroup  0 2016-12-24 10:06 hdfs://
>> dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/
>> staging.db/tmp_pv_v4c__loc_4/2
>>   drwxrwxrwx   - dwr supergroup  0 2016-12-24 10:06 hdfs://
>> dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/
>> staging.db/tmp_pv_v4c__loc_4/3
>>
>> and yes the data files are under these three dirs.
>>
>> so i ask... i'm not used to seeing sub-directories under the tablename
>> unless the table is partitioned. is this legit? might there be some config
>> settings i need to set to see this data via sql?
>>
>> thanks,
>> Stephen.
>>
>>
>>
>>
>>
>>
>>
>>

Re: Maintaining big and complex Hive queries

2016-12-21 Thread Stephen Sprague

my 2 cents. :)

as soon as you say "complex query" i would submit you've lost the upperhand
and you're behind the eight-ball right off the bat.  And you know this too
otherwise you wouldn't have posted here. ha!

i use cascading CTAS statements so that i can examine the intermediate
tables.  Another approach is to use CTE's but while that makes things
easier to read it's still one big query and you don't get insight to the
"work" tables.

yes, it could take longer execution time if those intermediate tables can't
be run in parallel but small price to pay compared to human debug time in
my book anyway.

thoughts?

Cheers,
Stephen.

On Wed, Dec 21, 2016 at 10:07 AM, Saumitra Shahapure <
saumitra.offic...@gmail.com> wrote:

> Hi Elliot,
>
> Thanks for letting me know. HPL-SQL sounded particularly interesting. But
> in the documentation I could not see any way to pass output generated by
> one Hive query to the next one. The tool looks good as a homogeneous PL-SQL
> platform for multiple big-data systems (http://www.hplsql.org/about).
>
> However in order to break single complex hive query, DDLs look to be only
> way in HPL-SQL too. Or is there any alternate way that I might have missed?
>
> -- Saumitra S. Shahapure
>
> On Thu, Dec 15, 2016 at 6:21 PM, Elliot West  wrote:
>
>> I notice that HPL/SQL is not mentioned on the page I referenced, however
>> I expect that is another approach that you could use to modularise:
>>
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156
>> http://www.hplsql.org/doc
>>
>> On 15 December 2016 at 17:17, Elliot West  wrote:
>>
>>> Some options are covered here, although there is no definitive guidance
>>> as far as I know:
>>>
>>> https://cwiki.apache.org/confluence/display/Hive/Unit+Testin
>>> g+Hive+SQL#UnitTestingHiveSQL-Modularisation
>>>
>>> On 15 December 2016 at 17:08, Saumitra Shahapure <
>>> saumitra.offic...@gmail.com> wrote:
>>>
 Hello,

 We are running and maintaining quite big and complex Hive SELECT query
 right now. It's basically a single SELECT query which performs JOIN of
 about ten other SELECT query outputs.

 A simplest way to refactor that we can think of is to break this query
 down into multiple views and then join the views. There is similar
 possibility to create intermediate tables.

 However creating multiple DDLs in order to maintain a single DML is not
 very smooth. We would end up polluting metadata database by creating views
 / intermediate tables which are used in just this ETL.

 What are the other efficient ways to maintain complex SQL queries
 written in Hive? Are there better ways to break Hive query into multiple
 modules?

 -- Saumitra S. Shahapure

>>>
>>>
>>
>

Re: [ANNOUNCE] Apache Hive 2.1.1 Released

2016-12-08 Thread Stephen Sprague

Ahh.  thank you.

On Thu, Dec 8, 2016 at 3:19 PM, Alan Gates  wrote:

> Apache keeps just the latest version of each release on the mirrors.  You
> can find all Hive releases at https://archive.apache.org/dist/hive/ if
> you need 2.1.0.
>
> Alan.
>
> > On Dec 8, 2016, at 14:40, Stephen Sprague  wrote:
> >
> > out of curiosity any reason why release 2.1.0 disappeared from
> apache.claz.org/hive ?   apologies if i missed the conversation about
> it.  thanks.
> >
> >
> > 
> >
> > On Thu, Dec 8, 2016 at 9:58 AM, Jesus Camacho Rodriguez <
> jcama...@apache.org> wrote:
> > The Apache Hive team is proud to announce the release of Apache Hive
> > version 2.1.1.
> >
> > The Apache Hive (TM) data warehouse software facilitates querying and
> > managing large datasets residing in distributed storage. Built on top
> > of Apache Hadoop (TM), it provides, among others:
> >
> > * Tools to enable easy data extract/transform/load (ETL)
> >
> > * A mechanism to impose structure on a variety of data formats
> >
> > * Access to files stored either directly in Apache HDFS (TM) or in other
> >   data storage systems such as Apache HBase (TM)
> >
> > * Query execution via Apache Hadoop MapReduce and Apache Tez frameworks.
> >
> > For Hive release details and downloads, please visit:
> > https://hive.apache.org/downloads.html
> >
> > Hive 2.1.1 Release Notes are available here:
> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?
> projectId=12310843&version=12335838
> >
> > We would like to thank the many contributors who made this release
> > possible.
> >
> > Regards,
> >
> > The Apache Hive Team
> >
> >
> >
>
>

Re: [ANNOUNCE] Apache Hive 2.1.1 Released

2016-12-08 Thread Stephen Sprague

out of curiosity any reason why release 2.1.0 disappeared from
apache.claz.org/hive ?   apologies if i missed the conversation about it.
thanks.


[image: Inline image 1]

On Thu, Dec 8, 2016 at 9:58 AM, Jesus Camacho Rodriguez  wrote:

> The Apache Hive team is proud to announce the release of Apache Hive
> version 2.1.1.
>
> The Apache Hive (TM) data warehouse software facilitates querying and
> managing large datasets residing in distributed storage. Built on top
> of Apache Hadoop (TM), it provides, among others:
>
> * Tools to enable easy data extract/transform/load (ETL)
>
> * A mechanism to impose structure on a variety of data formats
>
> * Access to files stored either directly in Apache HDFS (TM) or in other
>   data storage systems such as Apache HBase (TM)
>
> * Query execution via Apache Hadoop MapReduce and Apache Tez frameworks.
>
> For Hive release details and downloads, please visit:
> https://hive.apache.org/downloads.html
>
> Hive 2.1.1 Release Notes are available here:
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?
> projectId=12310843&version=12335838
>
> We would like to thank the many contributors who made this release
> possible.
>
> Regards,
>
> The Apache Hive Team
>
>
>

Re: s3a and hive

2016-11-15 Thread Stephen Sprague

just for the record...

this config "hive.exec.stagingdir" determines that ".hive_staging"
sub-directory. when it defaults to the table path and the table path is in
s3 that's where i get the exception:

Failed with exception java.io.IOException: rename for src path:
s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/
date_key=20161113/.hive- staging_hive_2016-11-15_04-57-
52_085_7825126612479617470-1/-ext-1/00_0 to dest
path:s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/00_0
returned false

i got the tip from this stackoverflow:

http://stackoverflow.com/questions/39547001/why-hive-staging-file-is-missing-in-aws-emr


that said setting "hive.metastore.warehouse.dir" to an S3 location is
something totally different and per Elliot's comments could be a risky
adventure - and unrelated to my error.

Anyway, I reset that back to hdfs and was inserting into an external table
located in s3 and *still* got that error above much to my consternation.
however, by playing with "hive.exec.stagingdir" (and reading that
stackoverflow) i was able to overcome the error.

YMMV.

Cheers,
Stephen.


On Tue, Nov 15, 2016 at 7:53 AM, Stephen Sprague  wrote:

> Thanks Elliot. I think you might be onto something there. :)
>
> Making that tiny little switch sure seemed attractive but judging from the
> Jira's out there the ramifications of that setting are far more involved
> and nuanced than i thought.
>
> awright. you make some convincing arguments there. looks like the smart
> money is on hdfs for the time-being.
>
> thanks for replying! Good stuff.
>
> Cheers,
> Stephen.
>
> On Tue, Nov 15, 2016 at 7:29 AM, Elliot West  wrote:
>
>> My gut feeling is that this is not something you should do (except for
>> fun!) I'm fairly confident that somewhere in Hive, MR, or Tez, you'll hit
>> some code that requires consistent, atomic move/copy/list/overwrite
>> semantics from the warehouse filesystem. This is not something that the
>> vanilla S3AFileSystem can provide. Even if you get to the point where
>> everything appears functionally sound, I expect you'll encounter unusual
>> and inconsistent behavior if you use this in the long term.
>>
>> Solutions to Hive on S3 include:
>>
>>- Use S3Guard (not yet available): https://issues.apa
>>che.org/jira/browse/HADOOP-13345
>><https://issues.apache.org/jira/browse/HADOOP-13345>
>>- Use Hive on EMR with Amazon's S3 filesystem implementation and
>>EMRFS. Note that this confusingly requires and overloads the 's3://' 
>> scheme.
>>
>> Hope this helps, and please report back with any findings as we are doing
>> quite a bit of Hive in AWS too.
>>
>> Cheers - Elliot.
>>
>> On 15 November 2016 at 15:19, Stephen Sprague  wrote:
>>
>>> no. permissions are good.  i believe the case to be that s3a does not
>>> have a "move" and/or "rename" semantic but i can't be the first one to
>>> encounter this. somebody out there has to have gone done this path way
>>> before me surely.
>>>
>>> searching the cyber i find this:
>>>
>>>https://issues.apache.org/jira/browse/HIVE-14270
>>>
>>> which is part of a even more work with s3 (see the related jira's that
>>> that jira comes under) especially the Hadoop Uber-Jira.
>>>
>>>
>>> so after digging though those jira's lemme ask:
>>>
>>> has anyone set hive.metastore.warehouse.dir to a s3a location with
>>> success?
>>>
>>> seems to me hive 2.2.0 and perhaps hadoop 2.7 or 2.8 are the only
>>> chances of success but i'm happy to be told i'm wrong.
>>>
>>> thanks,
>>> Stephen.
>>>
>>>
>>>
>>> On Mon, Nov 14, 2016 at 10:25 PM, Jörn Franke 
>>> wrote:
>>>
>>>> Is it a permission issue on the folder?
>>>>
>>>> On 15 Nov 2016, at 06:28, Stephen Sprague  wrote:
>>>>
>>>> so i figured i try and set hive.metastore.warehouse.dir=s3a://bucket/hive
>>>> and see what would happen.
>>>>
>>>> running this query:
>>>>
>>>> insert overwrite table omniture.hit_data_aws partition
>>>> (date_key=20161113) select * from staging.hit_data_aws_ext_20161113
>>>> limit 1;
>>>>
>>>> yields this error:
>>>>
>>>>Failed with exception java.io.IOException: rename for src path:
>>>> s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data

Re: s3a and hive

2016-11-15 Thread Stephen Sprague

Thanks Elliot. I think you might be onto something there. :)

Making that tiny little switch sure seemed attractive but judging from the
Jira's out there the ramifications of that setting are far more involved
and nuanced than i thought.

awright. you make some convincing arguments there. looks like the smart
money is on hdfs for the time-being.

thanks for replying! Good stuff.

Cheers,
Stephen.

On Tue, Nov 15, 2016 at 7:29 AM, Elliot West  wrote:

> My gut feeling is that this is not something you should do (except for
> fun!) I'm fairly confident that somewhere in Hive, MR, or Tez, you'll hit
> some code that requires consistent, atomic move/copy/list/overwrite
> semantics from the warehouse filesystem. This is not something that the
> vanilla S3AFileSystem can provide. Even if you get to the point where
> everything appears functionally sound, I expect you'll encounter unusual
> and inconsistent behavior if you use this in the long term.
>
> Solutions to Hive on S3 include:
>
>- Use S3Guard (not yet available): https://issues.apa
>che.org/jira/browse/HADOOP-13345
><https://issues.apache.org/jira/browse/HADOOP-13345>
>- Use Hive on EMR with Amazon's S3 filesystem implementation and
>EMRFS. Note that this confusingly requires and overloads the 's3://' 
> scheme.
>
> Hope this helps, and please report back with any findings as we are doing
> quite a bit of Hive in AWS too.
>
> Cheers - Elliot.
>
> On 15 November 2016 at 15:19, Stephen Sprague  wrote:
>
>> no. permissions are good.  i believe the case to be that s3a does not
>> have a "move" and/or "rename" semantic but i can't be the first one to
>> encounter this. somebody out there has to have gone done this path way
>> before me surely.
>>
>> searching the cyber i find this:
>>
>>https://issues.apache.org/jira/browse/HIVE-14270
>>
>> which is part of a even more work with s3 (see the related jira's that
>> that jira comes under) especially the Hadoop Uber-Jira.
>>
>>
>> so after digging though those jira's lemme ask:
>>
>> has anyone set hive.metastore.warehouse.dir to a s3a location with
>> success?
>>
>> seems to me hive 2.2.0 and perhaps hadoop 2.7 or 2.8 are the only chances
>> of success but i'm happy to be told i'm wrong.
>>
>> thanks,
>> Stephen.
>>
>>
>>
>> On Mon, Nov 14, 2016 at 10:25 PM, Jörn Franke 
>> wrote:
>>
>>> Is it a permission issue on the folder?
>>>
>>> On 15 Nov 2016, at 06:28, Stephen Sprague  wrote:
>>>
>>> so i figured i try and set hive.metastore.warehouse.dir=s3a://bucket/hive
>>> and see what would happen.
>>>
>>> running this query:
>>>
>>> insert overwrite table omniture.hit_data_aws partition
>>> (date_key=20161113) select * from staging.hit_data_aws_ext_20161113
>>> limit 1;
>>>
>>> yields this error:
>>>
>>>Failed with exception java.io.IOException: rename for src path:
>>> s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/.hive-
>>> staging_hive_2016-11-15_04-57-52_085_7825126612479617470-1/-ext-1/00_0
>>> to dest 
>>> path:s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/00_0
>>> returned false
>>> FAILED: Execution Error, return code 1 from
>>> org.apache.hadoop.hive.ql.exec.MoveTask. java.io.IOException: rename
>>> for src path: s3a://trulia-dwr-cluster-dev/h
>>> ive/omniture.db/hit_data_aws/date_key=20161113/.hive-staging
>>> _hive_2016-11-15_04-57-52_085_7825126612479617470-1/-ext-1/00_0
>>> to dest 
>>> path:s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/00_0
>>> returned false
>>>
>>>
>>> is there any workaround?   i'm running hive 2.1.0 and hadoop version
>>> 2.6.0-cdh5.7.1  .
>>>
>>>
>>> thanks,
>>> Stephen.
>>>
>>>
>>
>

Re: s3a and hive

2016-11-15 Thread Stephen Sprague

no. permissions are good.  i believe the case to be that s3a does not have
a "move" and/or "rename" semantic but i can't be the first one to encounter
this. somebody out there has to have gone done this path way before me
surely.

searching the cyber i find this:

   https://issues.apache.org/jira/browse/HIVE-14270

which is part of a even more work with s3 (see the related jira's that that
jira comes under) especially the Hadoop Uber-Jira.

so after digging though those jira's lemme ask:

has anyone set hive.metastore.warehouse.dir to a s3a location with success?

seems to me hive 2.2.0 and perhaps hadoop 2.7 or 2.8 are the only chances
of success but i'm happy to be told i'm wrong.

thanks,
Stephen.

On Mon, Nov 14, 2016 at 10:25 PM, Jörn Franke  wrote:

> Is it a permission issue on the folder?
>
> On 15 Nov 2016, at 06:28, Stephen Sprague  wrote:
>
> so i figured i try and set hive.metastore.warehouse.dir=s3a://bucket/hive
> and see what would happen.
>
> running this query:
>
> insert overwrite table omniture.hit_data_aws partition
> (date_key=20161113) select * from staging.hit_data_aws_ext_20161113 limit
> 1;
>
> yields this error:
>
>Failed with exception java.io.IOException: rename for src path:
> s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/.hive-
> staging_hive_2016-11-15_04-57-52_085_7825126612479617470-1/-ext-1/00_0
> to dest 
> path:s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/00_0
> returned false
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.MoveTask.
> java.io.IOException: rename for src path: s3a://trulia-dwr-cluster-dev/h
> ive/omniture.db/hit_data_aws/date_key=20161113/.hive-staging
> _hive_2016-11-15_04-57-52_085_7825126612479617470-1/-ext-1/00_0
> to dest 
> path:s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/00_0
> returned false
>
>
> is there any workaround?   i'm running hive 2.1.0 and hadoop version
> 2.6.0-cdh5.7.1  .
>
>
> thanks,
> Stephen.
>
>

s3a and hive

2016-11-14 Thread Stephen Sprague

so i figured i try and set hive.metastore.warehouse.dir=s3a://bucket/hive
and see what would happen.

running this query:

insert overwrite table omniture.hit_data_aws partition
(date_key=20161113) select * from staging.hit_data_aws_ext_20161113 limit 1;

yields this error:

   Failed with exception java.io.IOException: rename for src path:
s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/.hive-
staging_hive_2016-11-15_04-57-52_085_7825126612479617470-1/-ext-1/00_0
to dest
path:s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/00_0
returned false
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.MoveTask. java.io.IOException: rename for
src path:
s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/.hive-staging_hive_2016-11-15_04-57-52_085_7825126612479617470-1/-ext-1/00_0
to dest
path:s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/00_0
returned false


is there any workaround?   i'm running hive 2.1.0 and hadoop version
2.6.0-cdh5.7.1  .


thanks,
Stephen.

Re: a GROUP BY that is not fully grouping

2016-11-03 Thread Stephen Sprague

ha!  kinda shows how the tech stack boundaries now are getting blurred,
eh?  well at least for us amateurs! :o

On Thu, Nov 3, 2016 at 5:00 AM, Donald Matthews  wrote:

> |Spark calls its SQL part HiveContext, but it is not related to this
> list
>
> Oof, I didn't realize that. Thanks for letting me know.
>

Re: hiveserver2 GC overhead limit exceeded

2016-10-23 Thread Stephen Sprague

ok. i'll bite.

lets see the output of this command where Hiveserver2 is running.

$ ps -ef | grep -i hiveserver2

this'll show us all the command line parameters HS2 was (ultimately)
invoked with.

Cheers,
Stephen

On Sun, Oct 23, 2016 at 6:46 AM, patcharee 
wrote:

> Hi,
>
> I use beeline to connect to hiveserver2. I tested with a simple command
> and got an error GC overhead limit exceeded
>
> 0: jdbc:hive2://service-10-1:10010/default> drop table
> testhivedrivertable;
> Error: Error while processing statement: FAILED: Execution Error, return
> code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. GC overhead limit
> exceeded (state=08S01,code=1)
>
> How to solve this? How to identify if this error is from the client
> (beeline) or from hiveserver2?
>
> Thanks,
>
> Patcharee
>
>

hiveserver2 and KILLJOB

2016-10-05 Thread Stephen Sprague

hey guys,
this is a long shot but i'll ask anyway.   We're running YARN and
HiveServer2 (v2.1.0) and noticing "random" kills - what looks to me - being
issued by HiveServer2.

we've turned DEBUG log level on for the Application Master container and
see the following in the logs:

2016-10-05 02:06:18,365 DEBUG [IPC Server handler 0 on 34013]
org.apache.hadoop.ipc.Server: IPC Server handler 0 on 34013:
*org.apache.hadoop.mapreduce.v2.api.MRClientProtocolPB.killJob
from 172.19.75.137:35616 * Call#6603343 Retry#0
for RpcKind RPC_PROTOCOL_BUFFER

2016-10-05 02:06:18,365 DEBUG [IPC Server handler 0 on 34013]
org.apache.hadoop.security.UserGroupInformation: PrivilegedAction as:dwr
(auth:SIMPLE)
from:org.apache.hadoop.ipc.Server$Handler.run(Server.java:2038)

2016-10-05 02:06:18,368 INFO [IPC Server handler 0 on 34013]
org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Kill job
job_1475268554936_14906 received from dwr (auth:SIMPLE) at 172.19.75.137


host 172.19.75.137 is where the HiveServer2 process is running.


i'm not seeing anything in the HiveServer2 logs at all that is related to
this job. that's whole 'nother problem.  also rerunning the M/R job works
fine.

so its very weird why out of the blue HS2 would issue a KILL.

this happens three or four times every day with random jobs.

Any thoughts greatly appreciated.

Cheers,
Stephen.

Re: How do I determine a library mismatch between jdbc client and server?

2016-09-28 Thread Stephen Sprague

you might just end up using your own heuristics.  if the port is "alive"
(ie. you can list it via netstat or telnet to it) but you can't connect...
then you got yourself a problem.

kinda like a bootstrapping problem, eh? you need to connect to get the
version but you can't connect if you don't have the right version.

my 2 cents.

On Wed, Sep 28, 2016 at 4:13 PM, Bear Giles  wrote:

> Hi, I'm trying to do development in an environment where we have a mixed
> bag of clusters. Some Cloudera, some Hortonworks, different versions of
> each, etc.
>
> (I don't know if we'll see this mix in the field but we need the variety
> for testing our software.)
>
> Sometimes the poor overworked developer (cough) grabs the wrong jars for
> the server. In at least one case there was no indication of what the
> problem was - all I saw was a connection timeout. That caused some
> confusion since mismatched jars was pretty far down my list of likely
> causes for that error.
>
> Is there a standard way to query for the remote version before trying to
> establish a JDBC connection? I know that I can check the Connection's
> DatabaseMetaData information once I've established a connection but that
> doesn't help me when I'm getting nothing but a timeout exception.
>
>
>
> Bear Giles
>
> Sr. Java Application Engineer
> bgi...@snaplogic.com
> Mobile: 720-354-0766
> SnapLogic.com  | We're Hiring
> !
> 
> 
>   
>   
> 
>
>
>
> 
>
> SnapLogic Inc | 929 Pearl St #200 | Boulder | 80302 | Colorado
>
> SnapLogic Inc | 2 W Fifth Avenue Fourth Floor | San Mateo | 94402 |
> California
>
> This message is confidential. It may also be privileged or otherwise
> protected by work product immunity or other legal rules. If you have
> received it by mistake, please let us know by e-mail reply and delete it
> from your system; you may not copy this message or disclose its contents to
> anyone. The integrity and security of this message cannot be guaranteed on
> the Internet.
>

Re: Hive queries rejected under heavy load

2016-09-28 Thread Stephen Sprague

gotta start by looking at the logs and run the local client to eliminate
HS2.   perhaps running hive as such:

$ hive -hiveconf hive.root.logger=DEBUG,console

do you see any smoking gun?

On Wed, Sep 28, 2016 at 7:34 AM, Jose Rozanec  wrote:

> Hi,
>
> We have a Hive cluster (Hive 2.1.0+Tez 0.8.4) which works well for most
> queries. Though for some heavy ones we observe that sometimes are able to
> execute and sometimes get rejected. We are not sure why we get a rejection
> instead of getting them enqueued and wait for execution until resources in
> cluster are available again. We notice that the connection waits for a
> minute, and if fails to assign resources, will drop the query.
> Looking at configuration parameters, is not clear to us if this can be
> changed. Did anyone had a similar experience and can provide us some
> guidance?
>
> Thank you in advance,
>
> Joze.
>
>
>

Re: Hive 2.x usage

2016-09-14 Thread Stephen Sprague

> * Are you using Hive-2.x at your org and at what scale?

yes. we're using 2.1.0.  1.5PB.  30 node cluster.  ~1000 jobs a day.And
yeah hive 2.1.0 has some issues and can require some finesse wrt the
hive-site.xml settings.

> * Is the release stable enough? Did you notice any correctness issues?

yes.  we did notice correctness issues."NOT IN" against a partition key
silently fails.  and trouble with UNION statements also got us.

> * MR is deprecated in Hive-2.x (Though almost all the qtests still use
MR). Are you still using MR with Hive-2.x?

we still use MR on YARN and have no issues with it.

> * Are you using the apache release or HDP ?

we're using Apache/Hive 2.1.0 on a CDH distro.  does cloudera certify it?
probably not.  Do i care? no. :)

bottom line IMHO is Hive 2.1.0 is still in an early adopter phase.
however, if you're the do-it-yourself kind of guy then i'm sure you can
adapt - and reap the rewards of your know-how.  If you're the type who
needs somebody else to go to when it fails then i wouldn't jump on it just
yet.

just one opinion!  :o

Cheers,
Stephen.

On Wed, Sep 14, 2016 at 3:53 PM, Jörn Franke  wrote:

> If you are using a distribution (which you should if you go to production
> - Apache releases should not be used due to the maintainability, complexity
> and interaction with other components, such as Hadoop etc) then wait until
> a distribution with 2.x is out. As far as i am aware there is currently no
> such distribution. As far as i know , Hortonworks and probably also
> Cloudera test their distributions on large scale real production systems
> beforehand.
>
> I would not use MR even with 1.x and go for TEZ (except you are using some
> very specific outdated functionality). Spark is another option, but i do
> not see Hive on Spark as stable and less functionality - this may change in
> the future.
>
> On 14 Sep 2016, at 22:36, RD  wrote:
>
> Hi Folks,
>   We  (at my org) are currently planning our move to Hive-2.x. As part of
> this I wanted to get a sense of how stable the Hive-2.x release is.  I
> thought it would be good to conduct a brief survey on this. I've added a
> few questions below. It would really be a ton of help if folks could
> provide their feedback
>
> * Are you using Hive-2.x at your org and at what scale?
> * Is the release stable enough? Did you notice any correctness issues?
> * MR is deprecated in Hive-2.x (Though almost all the qtests still use
> MR). Are you still using MR with Hive-2.x?
> * Are you using the apache release or HDP ?
>
> -Best,
>
>
>
>
>

Re: Re: load data Failed with exception java.lang.IndexOutOfBoundsException

2016-09-08 Thread Stephen Sprague

>at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.
validateInput(OrcInputFormat.java:508)

would it be safe to assume that you are trying to load a text file into an
table stored as ORC?

your create table doesn't specify that explicitly so that means you have a
setting in your configs that says new tables are to be stored as ORC if not
specified otherwise.


too bad there isn't an error message like: "loading text data into into a
non-TEXTFILE table generally isn't a good idea". :)

then again maybe somebody knows something i don't.

Cheers,
Stephen.





On Thu, Sep 8, 2016 at 7:37 PM, C R  wrote:

>
> Yes, based on my testing,it is wrong from 0 to 99 with the content of file 
> 1.dat,
> whether the column type is string or int.
>
> hive.log:
>
> 2016-09-09T09:10:40,978 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: CliDriver (SessionState.java:printInfo(
> 1029)) - Hive-on-MR is deprecated in Hive 2 and may
> not be available in the future versions. Consider
> using a different execution engine (i.e. tez, spark) or
> using Hive 1.X releases.
> 2016-09-09T09:11:17,433 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: conf.HiveConf (HiveConf.java:
> getLogIdVar(3177)) - Using the default value passed in
> for log id: d1e08abd-5f8b-4149-a679-00ba6b4f4ab9
> 2016-09-09T09:11:17,462 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: ql.Driver (Driver.java:compile(409)) -
> Compiling command(queryId=hadoop_20160909091117_2f9e8e3b-b2e8-4312-b473-
> 535881c1d726): LOAD DATA LOCAL INPATH '1.dat' overwrite INTO TABLE ODS.
> loadtest
> 2016-09-09T09:11:18,016 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: metastore.HiveMetaStore (HiveMetaStore.
> java:logInfo(670)) - 0: get_table : db=ODS tbl=loadtest
> 2016-09-09T09:11:18,016 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: HiveMetaStore.audit (HiveMetaStore.java:
> logAuditEvent(280)) - ugi=hadoop   ip=unknown-ip-addr
>cmd=get_table : db=ODS tbl=loadtest
> 2016-09-09T09:11:18,162 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: ql.Driver (Driver.java:compile(479)) -
> Semantic Analysis Completed
> 2016-09-09T09:11:18,163 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: ql.Driver (Driver.java:getSchema(251)) -
>  Returning Hive schema: Schema(fieldSchemas:null, properties:null)
> 2016-09-09T09:11:18,167 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: ql.Driver (Driver.java:compile(551)) -
> Completed compiling command(queryId=hadoop_20160909091117_
> 2f9e8e3b-b2e8-4312-b473-535881c1d726); Time taken: 0.725 seconds
> 2016-09-09T09:11:18,167 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: ql.Driver (Driver.java:checkConcurrency(
> 171)) - Concurrency mode is disabled, not creating a lock manager
> 2016-09-09T09:11:18,167 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: ql.Driver (Driver.java:execute(1493)) -
> Executing command(queryId=hadoop_20160909091117_2f9e8e3b-b2e8-4312-b473-
> 535881c1d726): LOAD DATA LOCAL INPATH '1.dat' overwrite INTO TABLE ODS.
> loadtest
> 2016-09-09T09:11:18,172 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: ql.Driver (Driver.java:launchTask(1832))
>  - Starting task [Stage-0:MOVE] in serial mode
> 2016-09-09T09:11:18,172 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: exec.Task (SessionState.java:printInfo(
> 1029)) - Loading data to table ods.loadtest from file:1.dat
> 2016-09-09T09:11:18,172 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: metastore.HiveMetaStore (HiveMetaStore.
> java:logInfo(670)) - 0: get_table : db=ods tbl=loadtest
> 2016-09-09T09:11:18,173 INFO  [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: HiveMetaStore.audit (HiveMetaStore.java:
> logAuditEvent(280)) - ugi=hadoop   ip=unknown-ip-addr
>cmd=get_table : db=ods tbl=loadtest
> 2016-09-09T09:11:18,320 ERROR [d1e08abd-5f8b-4149-a679-
> 00ba6b4f4ab9 main]: exec.Task (SessionState.java:printError(
> 1038)) - Failed with exception java.lang.IndexOutOfBoundsException
> org.apache.hadoop.hive.ql.metadata.HiveException: java.
> lang.IndexOutOfBoundsException
> at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.
> checkInputFormat(HiveFileFormatUtils.java:195)
> at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.
> checkTextInputFormat(HiveFileFormatUtils.java:217)
> at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.
> checkInputFormat(HiveFileFormatUtils.java:182)
> at org.apache.hadoop.hive.ql.exec.MoveTask.execute(
> MoveTask.java:306)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:158)
> at org.apache.hadoop.hive.ql.exec.TaskRunner.
> runSequential(TaskRunner.java:101)
> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1834)
> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1578)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1355)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1178)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1166

Re: hive.root.logger influencing query plan?? so it's not so

2016-09-04 Thread Stephen Sprague

another case of a query hangin' in v2.1.0.

hive> select 1 from raw_logs.fact_mweb_datekey_by_hour where
date_key=20160904 and hour=15 limit 1;

* date_key is a partition key and 20160904 exists.
* hour is a bucketed column but does not exist. not sure it matters if it
bucketed or not for the query to hang.
* so empty result expected.


as Gopal mentioned previously this does indeed fix it:

* set hive.fetch.task.conversion=none;

but not sure its the right thing to set globally just yet.

Anyhoo users beware.

Regards,
Stephen





On Wed, Aug 31, 2016 at 7:01 AM, Stephen Sprague  wrote:

> hi guys,
>
> Vlad: good suggestion however in my case its a 5 second query (when it
> works)
>
> Gopal: Thanks for the explanation on the effect logging can have on the
> execution path. somewhat counter-intuitive i must say and as you can
> imagine a tad more challenging to debug - when debugging influences the
> observation. :)
>
> that said adding:
>
>set hive.fetch.task.conversion=none;
>
> forced the query plans to be the same and that made the difference. it
> worked.
>
> now i need to research that setting and see what the implications are.
> I'm sure it doesn't come for free.
>
> and as always a big thanks!  lemme know if i can provide any details.  the
> partition size is >1G btw. and this is using hive 2.1.0.
>
> Cheers,
> Stephen.
> ps here's the hdfs info - the table is bucketed by 24 fwiw.
>
>
> $ hdfs dfs -du -h /user/hive/warehouse/raw_logs.
> db/fact_www_access_datekey_by_hour/date_key=20160828
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/00_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/01_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/02_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/03_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/04_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/05_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/06_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/07_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/08_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/09_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/10_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/11_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/12_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/13_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/14_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/15_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/16_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/17_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/18_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/19_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/20_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/21_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/22_0
> 1.1 G  3.2 G  /user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_
> hour/date_key=20160828/23_0
>
>
>
>
> On Wed, Aug 31, 2016 at 12:19 AM, Vladimir Kroz 
> wrote:
>
>> One of the causes could be that on long-running queries your terminal
>> session gets disconnected and client process terminate (appearing like
>>  query hangs).
>>
>> When debug messages are on, they will keep terminal session alive and
>> hence allowing your query to complete.
>>
>> I'm not sure if this is your case, but based on synopsis your provided it
>> could be one of the reasons.
&g

Re: Beeline throws OOM on large input query

2016-09-02 Thread Stephen Sprague

hmmm.  so beeline blew up *before* the query was even submitted to the
execution engine?   one would think 16G would be plenty 8M row sql
statement.

some suggestions if you feel like going further down the rabbit hole.

1.  confirm your beeline java process is indeed running with expanded
memory (ps -ef | grep beeline and look for the last Xmx setting on the line)

2.  try the hive-cli (or the python one even.)  or "beeline -u
jdbc:hive2://" (local beeline - maybe thats different)

3.  chop down your 6K points to 3K or something smaller to see just where
the breaking point is.  does 1K points even work?  ie. determine how close
to edge are you?

Cheers,
Stephen.

PS. i had never heard of a "theta" join before so a searched it and found
this:
https://cwiki.apache.org/confluence/display/Hive/Theta+Join  and
this: https://issues.apache.org/jira/browse/HIVE-556 (looks like this came
first)

and still in "open" status i see. well you're not alone if that's any
solace!

maybe ping that Jira and see if Edward or Brock (or others) have any new
news on the topic as supporting theta joins sounds like the proper solution
to this whole rigamarole you find yourself in.

On Fri, Sep 2, 2016 at 6:12 AM, Adam  wrote:

> I set the heap size using HADOOP_CLIENT_OPTS all the way to 16g and still
> no luck.
>
> I tried to go down the table join route but the problem is that the
> relation is not an equality so it would be a theta join which is not
> supported in Hive.
> Basically what I am doing is a geographic intersection against 6,000
> points so the where clause has 6000 points in it (I use a custom UDF for
> the intersection).
>
> To avoid the problem I ended up writing another version of the UDF that
> reads the point list from an HDFS file.
>
> It's a low priority I'm sure but I bet there are some inefficiencies in
> the query string handling that could be fixed.  When I traced the code it
> was doing all kinds of StringBuffer and String += type stuff.
>
> Regards,
>

Re: Beeline throws OOM on large input query

2016-09-01 Thread Stephen Sprague

lemme guess.  your query contains an 'in' clause with 1 million static
values? :)

* brute force solution is to set:

 HADOOP_CLIENT_OPTS=-Xmx8G (or whatever)

before you run beeline to force a larger memory size

(i'm pretty sure beeline uses that env var though i didn't actually check
the script)


* more scalable solution is (if my premise above is correct) would be to
join to a table containing your values. But hey maybe you actually have a
8MM byte sql query that doesn't contain static data - that could be a world
record!

Cheers,
Stephen



On Thu, Sep 1, 2016 at 6:21 AM, Adam  wrote:

> Hive Version: 2.1.0
> I have a very large, multi-line input query (8,668,519 chars) and I have
> gone up to 16g heap and still get the same OOM.
>
>
> Error: Error running query: java.lang.OutOfMemoryError: Java heap space
> (state=,code=0)
> org.apache.hive.service.cli.HiveSQLException: Error running query:
> java.lang.OutOfMemoryError: Java heap space
> at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:264)
> at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:
> 250)
> at org.apache.hive.jdbc.HiveStatement.runAsyncOnServer(HiveStat
> ement.java:309)
> at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.
> java:250)
> at org.apache.hive.beeline.Commands.executeInternal(Commands.
> java:977)
> at org.apache.hive.beeline.Commands.execute(Commands.java:1148)
> at org.apache.hive.beeline.Commands.sql(Commands.java:1063)
> at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:1134)
> at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:965)
> at org.apache.hive.beeline.BeeLine.executeFile(BeeLine.java:940)
> at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:867)
> at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(Bee
> Line.java:499)
> at org.apache.hive.beeline.BeeLine.main(BeeLine.java:482)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
> ssorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
> thodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> Caused by: org.apache.hive.service.cli.HiveSQLException: Error running
> query: java.lang.OutOfMemoryError: Java heap space
> at org.apache.hive.service.cli.operation.SQLOperation.prepare(
> SQLOperation.java:218)
> at org.apache.hive.service.cli.operation.SQLOperation.runIntern
> al(SQLOperation.java:269)
> at org.apache.hive.service.cli.operation.Operation.run(Operatio
> n.java:324)
> at org.apache.hive.service.cli.session.HiveSessionImpl.executeS
> tatementInternal(HiveSessionImpl.java:460)
> at org.apache.hive.service.cli.session.HiveSessionImpl.executeS
> tatementAsync(HiveSessionImpl.java:447)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
> ssorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
> thodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(
> HiveSessionProxy.java:78)
> at org.apache.hive.service.cli.session.HiveSessionProxy.access$
> 000(HiveSessionProxy.java:36)
> at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(
> HiveSessionProxy.java:63)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
> upInformation.java:1657)
> at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(
> HiveSessionProxy.java:59)
> at com.sun.proxy.$Proxy33.executeStatementAsync(Unknown Source)
> at org.apache.hive.service.cli.CLIService.executeStatementAsync
> (CLIService.java:294)
> at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteS
> tatement(ThriftCLIService.java:497)
> at org.apache.hive.service.rpc.thrift.TCLIService$Processor$Exe
> cuteStatement.getResult(TCLIService.java:1437)
> at org.apache.hive.service.rpc.thrift.TCLIService$Processor$Exe
> cuteStatement.getResult(TCLIService.java:1422)
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.
> java:39)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.
> java:39)
> at org.apache.hive.service.auth.TSetIpAddressProcessor.process(
> TSetIpAddressProcessor.java:56)
> at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.
> run(TThreadPoolServer.java:286)
> at java.util.concurrent.ThreadPoolEx

Re: Quota for rogue ad-hoc queries

2016-09-01 Thread Stephen Sprague

> rogue queries

so this really isn't limited to just hive is it?  any dbms system perhaps
has to contend with this.  even malicious rogue queries as a matter of fact.

timeouts are cheap way systems handle this - assuming time is related to
resource. i'm sure beeline or whatever client you use has a timeout feature.

maybe one could write a separate service - say a governor - that watches
over YARN (or hdfs or whatever resource is rare) - and terminates the
process if it goes beyond a threshold.  think OOM killer.

but, yeah, i admittedly don't know of something out there already you can
just tap into but YARN's Resource Manager seems to be place i'd research
for starters. Just look look at its name. :)

my unsolicited 2 cents.



On Wed, Aug 31, 2016 at 10:24 PM, ravi teja  wrote:

> Thanks Mich,
>
> Unfortunately we have many insert queries.
> Are there any other ways?
>
> Thanks,
> Ravi
>
> On Wed, Aug 31, 2016 at 9:45 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Trt this
>>
>> hive.limit.optimize.fetch.max
>>
>>- Default Value: 5
>>- Added In: Hive 0.8.0
>>
>> Maximum number of rows allowed for a smaller subset of data for simple
>> LIMIT, if it is a fetch query. Insert queries are not restricted by this
>> limit.
>>
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 31 August 2016 at 13:42, ravi teja  wrote:
>>
>>> Hi Community,
>>>
>>> Many users run adhoc hive queries on our platform.
>>> Some rogue queries managed to fill up the hdfs space and causing
>>> mainstream queries to fail.
>>>
>>> We wanted to limit the data generated by these adhoc queries.
>>> We are aware of strict param which limits the data being scanned, but it
>>> is of less help as huge number of user tables aren't partitioned.
>>>
>>> Is there a way we can limit the data generated from hive per query, like
>>> a hve parameter for setting HDFS quotas for job level *scratch*
>>> directory or any other approach?
>>> What's the general approach to gaurdrail such multi-tenant cases.
>>>
>>> Thanks in advance,
>>> Ravi
>>>
>>
>>
>

Re: hive.root.logger influencing query plan?? so it's not so

2016-08-31 Thread Stephen Sprague

hi guys,

Vlad: good suggestion however in my case its a 5 second query (when it
works)

Gopal: Thanks for the explanation on the effect logging can have on the
execution path. somewhat counter-intuitive i must say and as you can
imagine a tad more challenging to debug - when debugging influences the
observation. :)

that said adding:

   set hive.fetch.task.conversion=none;

forced the query plans to be the same and that made the difference. it
worked.

now i need to research that setting and see what the implications are.  I'm
sure it doesn't come for free.

and as always a big thanks!  lemme know if i can provide any details.  the
partition size is >1G btw. and this is using hive 2.1.0.

Cheers,
Stephen.
ps here's the hdfs info - the table is bucketed by 24 fwiw.


$ hdfs dfs -du -h
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/00_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/01_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/02_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/03_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/04_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/05_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/06_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/07_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/08_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/09_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/10_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/11_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/12_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/13_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/14_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/15_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/16_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/17_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/18_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/19_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/20_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/21_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/22_0
1.1 G  3.2 G
/user/hive/warehouse/raw_logs.db/fact_www_access_datekey_by_hour/date_key=20160828/23_0




On Wed, Aug 31, 2016 at 12:19 AM, Vladimir Kroz 
wrote:

> One of the causes could be that on long-running queries your terminal
> session gets disconnected and client process terminate (appearing like
>  query hangs).
>
> When debug messages are on, they will keep terminal session alive and
> hence allowing your query to complete.
>
> I'm not sure if this is your case, but based on synopsis your provided it
> could be one of the reasons.
>
> Cheers,
> Vlad
>
> ---
> From: Stephen Sprague 
> To: "user@hive.apache.org" 
> Cc:
> Date: Tue, 30 Aug 2016 20:28:50 -0700
> Subject: hive.root.logger influencing query plan?? so it's not so
> Hi guys,
> I've banged my head on this one all day and i need to surrender.  I have a
> query that hangs (never returns). However, when i turn on logging to DEBUG
> level it works.  I'm stumped.   I include here the query, the different
> query plans (with the only thing different being the log level) and a
> traceback on the query that hangs.
>
> * the query
>
> #!/bin/bash
>
> sql="
> explain SELECT *
> FROM   raw_logs.fact_www_access_datekey_by_hour
> WHERE  date_key = 20160828  -- partition key
> and lower(http_string) = 'foo'  -- always evaluates to false
> limit 1 -- if i remove limit 1 it works.  I expect
> an empty result set.
> ;
> "
>
> #hive -hiveconf hive.root.logger=ERROR,console -e "$sql"

hive.root.logger influencing query plan?? so it's not so

2016-08-30 Thread Stephen Sprague

Hi guys,
I've banged my head on this one all day and i need to surrender.  I have a
query that hangs (never returns). However, when i turn on logging to DEBUG
level it works.  I'm stumped.   I include here the query, the different
query plans (with the only thing different being the log level) and a
traceback on the query that hangs.

* the query

#!/bin/bash

sql="
explain SELECT *
FROM   raw_logs.fact_www_access_datekey_by_hour
WHERE  date_key = 20160828  -- partition key
and lower(http_string) = 'foo'  -- always evaluates to false
limit 1 -- if i remove limit 1 it works.  I expect
an empty result set.
;
"

#hive -hiveconf hive.root.logger=ERROR,console -e "$sql" >stdout.bad
2>stderr.bad  #hangs
hive -hiveconf hive.root.logger=DEBUG,console -e "$sql" >stdout.good
2>stderr.good #works



* explain plan for the one that hangs (ERROR,console)

STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
Fetch Operator
  limit: 1
  Processor Tree:
TableScan
  alias: fact_www_access_datekey_by_hour
  Statistics: Num rows: 144355924 Data size: 158755813572 Basic
stats: COMPLETE Column stats: NONE
  Filter Operator
predicate: (lower(http_string) = 'foo') (type: boolean)
Statistics: Num rows: 72177962 Data size: 79377906786 Basic
stats: COMPLETE Column stats: NONE
Select Operator
  expressions: payload_year (type: int), payload_month (type:
int), payload_day (type: int), payload_time (type: string),
payload_gmt_offset (type: string), perf_time_micros (type: int),
http_string (type: string), http_type (type: string), http_version (type:
string), http_rc (type: int), http_size (type: int), referrer (type:
string), user_agent (type: string), bot_flag (type: string), ip_list (type:
array), scalar_tags (type: map), url_components
(type: array), user_id (type: string), lt_session_id (type:
string), session_id (type: string), log_name (type: string), log_lineno
(type: int), log_line (type: string), web_server (type: string),
location_info (type: map), dp_main_url_type (type: string),
dp_inventory_type (type: string), dp_main_url (type:
map>), dp_tag_url (type: map>),
dp_referrer_url (type: map>), hour (type: int),
20160828 (type: int)
  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5,
_col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15,
_col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24,
_col25, _col26, _col27, _col28, _col29, _col30, _col31
  Statistics: Num rows: 72177962 Data size: 79377906786 Basic
stats: COMPLETE Column stats: NONE
  Limit
Number of rows: 1
Statistics: Num rows: 1 Data size: 1099 Basic stats:
COMPLETE Column stats: NONE
ListSink

* query plan for the one that works (DEBUG,console)

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Map Reduce
  Map Operator Tree:
  TableScan
alias: fact_www_access_datekey_by_hour
Statistics: Num rows: 144355924 Data size: 158755813572 Basic
stats: COMPLETE Column stats: NONE
Filter Operator
  predicate: (lower(http_string) = 'foo') (type: boolean)
  Statistics: Num rows: 72177962 Data size: 79377906786 Basic
stats: COMPLETE Column stats: NONE
  Select Operator
expressions: payload_year (type: int), payload_month (type:
int), payload_day (type: int), payload_time (type: string),
payload_gmt_offset (type: string), perf_time_micros (type: int),
http_string (type: string), http_type (type: string), http_version (type:
string), http_rc (type: int), http_size (type: int), referrer (type:
string), user_agent (type: string), bot_flag (type: string), ip_list (type:
array), scalar_tags (type: map), url_components
(type: array), user_id (type: string), lt_session_id (type:
string), session_id (type: string), log_name (type: string), log_lineno
(type: int), log_line (type: string), web_server (type: string),
location_info (type: map), dp_main_url_type (type: string),
dp_inventory_type (type: string), dp_main_url (type:
map>), dp_tag_url (type: map>),
dp_referrer_url (type: map>), hour (type: int),
20160828 (type: int)
outputColumnNames: _col0, _col1, _col2, _col3, _col4,
_col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14,
_col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23,
_col24, _col25, _col26, _col27, _col28, _col29, _col30, _col31
Statistics: Num rows: 72177962 Data size: 79377906786 Basic
stats: COMPLETE Column stats: NONE
Limit
  Number of rows: 1
  Statistics: Num rows: 1 Data size: 1099 Basic stats:
COMPLETE Column stats: NONE
  File Output Operator
compressed: true

Re: hive 2.1.0 + drop view

2016-08-26 Thread Stephen Sprague

just to cap this discussion...  thank you Ashutosh for that link that was
very helpful.

I did the following based on my reading of it.


1. added the following to hive-site.xml


  datanucleus.rdbms.initializeColumnInfo
  NONE



this allows one to create views and drop views however it does not allow
you to drop views previously created w/o that setting.

so...

2. did a show create table on all the views and saved to file.


3. surgically went into the hive metastore and deleted the views from table
"TBLS" (but first had to delete from "TABLE_PARAMS" and "TBL_PRIVS" due to
ref constraints.)


4. recreated the views as best as possible but given some views are
dependent on other views need to make multiple passes


That was my workaround anyway.


Cheers,
Stephen
PS. altering the table to 'varchar' did nothing on postgres - thats just a
synonym for 'character varying'

On Fri, Aug 26, 2016 at 1:40 PM, Ashutosh Chauhan 
wrote:

> Its a bug in DataNucleus. See discussion on : https://issues.apache.org/
> jira/browse/HIVE-14322
>
> On Fri, Aug 26, 2016 at 1:34 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Actually I don't understand why they have defined TBL_NAME and TBL_TYPE
>> as NVARCHAR (this is from Sybase similar to yours)
>>
>> [image: Inline images 1]
>>
>> Oracle seems to be correct.
>>
>> And if we look further
>>
>> Use the fixed-length datatype, *nchar(n)* , and the variable-length
>> datatype, *nvarchar(n)*, for both single-byte and multibyte character
>> sets, such as Japanese. The difference between *nchar(n)* and *char(n)*
>> and *nvarchar(n)* and *varchar(n)* is that both *nchar(n)* and
>> *nvarchar(n)* allocate storage based on *n* times the number of bytes
>> per character (based on the default character set). *char(n)* and
>> *varchar(n)* allocate *n* bytes of storage.
>>
>> What character set are you using for your server/database?
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 26 August 2016 at 21:03, Stephen Sprague  wrote:
>>
>>> thanks.  what i gotta try is altering the table and changing "character
>>> varying(767)" to "varchar(767)" - I think.
>>>
>>> On Fri, Aug 26, 2016 at 12:59 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> You don't really want to mess around with the schema.
>>>>
>>>> This is what I have in Oracle 12c schema for TBLS. The same as yours
>>>>
>>>>
>>>> [image: Inline images 1]
>>>>
>>>> But this is Oracle, a serious database :)
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 26 August 2016 at 20:32, Stephen Sprague  wrote:
>>>>
>>>>> yeah... so after the hive upgrade scripts ran we have this in pg for
>>>>> table "TABLS"
>>>>>
>>>>> {quote}
>>>>> dwr_prod_2_1_0=> \d "TBLS"
>>>>>   Table "public.TBLS"
>>>>>

Re: hive 2.1.0 + drop view

2016-08-26 Thread Stephen Sprague

thanks.  what i gotta try is altering the table and changing "character
varying(767)" to "varchar(767)" - I think.

On Fri, Aug 26, 2016 at 12:59 PM, Mich Talebzadeh  wrote:

> You don't really want to mess around with the schema.
>
> This is what I have in Oracle 12c schema for TBLS. The same as yours
>
>
> [image: Inline images 1]
>
> But this is Oracle, a serious database :)
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 26 August 2016 at 20:32, Stephen Sprague  wrote:
>
>> yeah... so after the hive upgrade scripts ran we have this in pg for
>> table "TABLS"
>>
>> {quote}
>> dwr_prod_2_1_0=> \d "TBLS"
>>   Table "public.TBLS"
>>Column   |  Type  |Modifiers
>> ++--
>> ---
>>  TBL_ID | bigint | not null
>>  CREATE_TIME| bigint | not null
>>  DB_ID  | bigint |
>>  LAST_ACCESS_TIME   | bigint | not null
>>  OWNER  | character varying(767) | default NULL::character
>> varying
>>  RETENTION  | bigint | not null
>>  SD_ID  | bigint |
>>  TBL_NAME   | character varying(128) | default NULL::character
>> varying
>>  TBL_TYPE   | character varying(128) | default NULL::character
>> varying
>>  VIEW_EXPANDED_TEXT | text   |
>>  VIEW_ORIGINAL_TEXT | text   |
>>
>> {quote}
>>
>> wonder if i can perform some surgery here. :o  do i feel lucky?
>>
>> On Fri, Aug 26, 2016 at 12:28 PM, Stephen Sprague 
>> wrote:
>>
>>> well that doesn't bode well. :(
>>>
>>> we definitely need to use a remote metastore given this is a prod env
>>> with 100's of users.  i wasn't able to see anything in the metastore log
>>> though so i'm  wondering what logger to run to get that?  don't think its
>>> hive.root.logger.
>>>
>>> thanks,
>>> Stephen.
>>> just toggling hive.metastore.try.direct.sql between true or false which
>>> seemed like it should influence the metastore access behaviour did not
>>> change anything.  I guess this is a postgres incompatiblity with jdbc4
>>> (this "character varying" thing.)
>>>
>>> On Fri, Aug 26, 2016 at 8:55 AM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Sounds like there are a number of issues with Hive metastore on
>>>> Postgres. There have been a number of reports on this.
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 26 August 2016 at 16:43, Stephen Sprague  wrote:
>>>>
>>>>> thanks Gopal.  you're right our metastore is using Postgres. very
>>>>> interesting you were able to intuit that!
>>>>>
>>>>> lemme give your suggestions a try and i'll post back.
>>>>>
>>

Re: hive 2.1.0 + drop view

2016-08-26 Thread Stephen Sprague

yeah... so after the hive upgrade scripts ran we have this in pg for table
"TABLS"

{quote}
dwr_prod_2_1_0=> \d "TBLS"
  Table "public.TBLS"
   Column   |  Type  |Modifiers
++-
 TBL_ID | bigint | not null
 CREATE_TIME| bigint | not null
 DB_ID  | bigint |
 LAST_ACCESS_TIME   | bigint | not null
 OWNER  | character varying(767) | default NULL::character
varying
 RETENTION  | bigint | not null
 SD_ID  | bigint |
 TBL_NAME   | character varying(128) | default NULL::character
varying
 TBL_TYPE   | character varying(128) | default NULL::character
varying
 VIEW_EXPANDED_TEXT | text   |
 VIEW_ORIGINAL_TEXT | text   |

{quote}

wonder if i can perform some surgery here. :o  do i feel lucky?

On Fri, Aug 26, 2016 at 12:28 PM, Stephen Sprague 
wrote:

> well that doesn't bode well. :(
>
> we definitely need to use a remote metastore given this is a prod env
> with 100's of users.  i wasn't able to see anything in the metastore log
> though so i'm  wondering what logger to run to get that?  don't think its
> hive.root.logger.
>
> thanks,
> Stephen.
> just toggling hive.metastore.try.direct.sql between true or false which
> seemed like it should influence the metastore access behaviour did not
> change anything.  I guess this is a postgres incompatiblity with jdbc4
> (this "character varying" thing.)
>
> On Fri, Aug 26, 2016 at 8:55 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Sounds like there are a number of issues with Hive metastore on Postgres.
>> There have been a number of reports on this.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 26 August 2016 at 16:43, Stephen Sprague  wrote:
>>
>>> thanks Gopal.  you're right our metastore is using Postgres. very
>>> interesting you were able to intuit that!
>>>
>>> lemme give your suggestions a try and i'll post back.
>>>
>>> thanks!
>>> Stephen
>>>
>>> On Fri, Aug 26, 2016 at 8:32 AM, Gopal Vijayaraghavan >> > wrote:
>>>
>>>> > NULL::character%20varying)
>>>> ...
>>>> > i want to say this is somehow related to a java version (we're using
>>>> 8)
>>>> >but i'm not sure.
>>>>
>>>> The "character varying" looks like a lot like a Postgres issue to me
>>>> (though character varying could be the real term for varchar in another
>>>> DB).
>>>>
>>>> The hive-metastore.log should have the real backtrace.
>>>>
>>>> You can try doing
>>>>
>>>> set hive.metastore.uris=;
>>>> set hive.metastore.try.direct.sql=false;
>>>>
>>>>
>>>> (i.e switch to embedded metastore + disable direct sql, in Hive CLI -
>>>> provided you have all the password stuff for the metastore in the
>>>> regular
>>>> hive-site.xml)
>>>>
>>>> https://github.com/apache/hive/blob/master/metastore/src/jav
>>>> a/org/apache/ha
>>>> doop/hive/metastore/MetaStoreDirectSql.java#L887
>>>> <https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java#L887>
>>>>
>>>>
>>>> Cheers,
>>>> Gopal
>>>>
>>>>
>>>>
>>>
>>
>

Re: hive 2.1.0 + drop view

2016-08-26 Thread Stephen Sprague

well that doesn't bode well. :(

we definitely need to use a remote metastore given this is a prod env  with
100's of users.  i wasn't able to see anything in the metastore log though
so i'm  wondering what logger to run to get that?  don't think its
hive.root.logger.

thanks,
Stephen.
just toggling hive.metastore.try.direct.sql between true or false which
seemed like it should influence the metastore access behaviour did not
change anything.  I guess this is a postgres incompatiblity with jdbc4
(this "character varying" thing.)

On Fri, Aug 26, 2016 at 8:55 AM, Mich Talebzadeh 
wrote:

> Sounds like there are a number of issues with Hive metastore on Postgres.
> There have been a number of reports on this.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 26 August 2016 at 16:43, Stephen Sprague  wrote:
>
>> thanks Gopal.  you're right our metastore is using Postgres. very
>> interesting you were able to intuit that!
>>
>> lemme give your suggestions a try and i'll post back.
>>
>> thanks!
>> Stephen
>>
>> On Fri, Aug 26, 2016 at 8:32 AM, Gopal Vijayaraghavan 
>> wrote:
>>
>>> > NULL::character%20varying)
>>> ...
>>> > i want to say this is somehow related to a java version (we're using 8)
>>> >but i'm not sure.
>>>
>>> The "character varying" looks like a lot like a Postgres issue to me
>>> (though character varying could be the real term for varchar in another
>>> DB).
>>>
>>> The hive-metastore.log should have the real backtrace.
>>>
>>> You can try doing
>>>
>>> set hive.metastore.uris=;
>>> set hive.metastore.try.direct.sql=false;
>>>
>>>
>>> (i.e switch to embedded metastore + disable direct sql, in Hive CLI -
>>> provided you have all the password stuff for the metastore in the regular
>>> hive-site.xml)
>>>
>>> https://github.com/apache/hive/blob/master/metastore/src/jav
>>> a/org/apache/ha
>>> doop/hive/metastore/MetaStoreDirectSql.java#L887
>>> <https://github.com/apache/hive/blob/master/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java#L887>
>>>
>>>
>>> Cheers,
>>> Gopal
>>>
>>>
>>>
>>
>

Re: hive 2.1.0 + drop view

2016-08-26 Thread Stephen Sprague

thanks Gopal.  you're right our metastore is using Postgres. very
interesting you were able to intuit that!

lemme give your suggestions a try and i'll post back.

thanks!
Stephen

On Fri, Aug 26, 2016 at 8:32 AM, Gopal Vijayaraghavan 
wrote:

> > NULL::character%20varying)
> ...
> > i want to say this is somehow related to a java version (we're using 8)
> >but i'm not sure.
>
> The "character varying" looks like a lot like a Postgres issue to me
> (though character varying could be the real term for varchar in another
> DB).
>
> The hive-metastore.log should have the real backtrace.
>
> You can try doing
>
> set hive.metastore.uris=;
> set hive.metastore.try.direct.sql=false;
>
>
> (i.e switch to embedded metastore + disable direct sql, in Hive CLI -
> provided you have all the password stuff for the metastore in the regular
> hive-site.xml)
>
> https://github.com/apache/hive/blob/master/metastore/
> src/java/org/apache/ha
> doop/hive/metastore/MetaStoreDirectSql.java#L887
>
>
> Cheers,
> Gopal
>
>
>

hive 2.1.0 + drop view

2016-08-26 Thread Stephen Sprague

hey guys,
this ones a little more strange.

hive> create view foo_vw as select * from foo;
OK
Time taken: 0.376 seconds

hive> drop view foo_vw;
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
MetaException(message:java.lang.IllegalArgumentException:
java.net.URISyntaxException: Relative path in absolute URI:
NULL::character%20varying)

i want to say this is somehow related to a java version (we're using 8) but
i'm not sure.

any insights on how to go from here?

cheers,
Stephen

PS setting hive.root.logger=DEBUG,console here's the traceback.

Caused by: MetaException(message:java.lang.IllegalArgumentException:
java.net.URISyntaxException: Relative path in absolute URI:
NULL::character%20varying)
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$drop_table_with_environment_context_result$drop_table_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:47552)
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$drop_table_with_environment_context_result$drop_table_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:47529)
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$drop_table_with_environment_context_result.read(ThriftHiveMetastore.java:47471)
at
org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_drop_table_with_environment_context(ThriftHiveMetastore.java:1358)
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.drop_table_with_environment_context(ThriftHiveMetastore.java:1342)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.drop_table_with_environment_context(HiveMetaStoreClient.java:2329)
at
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.drop_table_with_environment_context(SessionHiveMetaStoreClient.java:114)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.dropTable(HiveMetaStoreClient.java:1074)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.dropTable(HiveMetaStoreClient.java:1010)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:154)
at com.sun.proxy.$Proxy21.dropTable(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2254)
at com.sun.proxy.$Proxy21.dropTable(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.dropTable(Hive.java:1172)

Re: hive 2.1.0 and "NOT IN ( list )" and column is a partition_key

2016-08-25 Thread Stephen Sprague

awesome!  i'm not crazy after all!

so workaround we devised here for this pattern:

   partition_key not in ('foo')

is:

   not array_contains(array('foo'), partition_key)

thanks,
Stephen.

On Thu, Aug 25, 2016 at 6:31 PM, Sergey Shelukhin 
wrote:

> I can repro this on master. I’ll file a bug...
>
> From: Stephen Sprague 
> Reply-To: "user@hive.apache.org" 
> Date: Thursday, August 25, 2016 at 13:34
> To: "user@hive.apache.org" 
> Subject: Re: hive 2.1.0 and "NOT IN ( list )" and column is a
> partition_key
>
> Hi Gopal,
> Thank you for this insight.  good stuff.   The thing is there is no 'foo'
> for etl_database_source so that filter if anything should be
> short-circuited to 'true'.  ie. double nots.   1. not in  2. and foo not
> present.
>
> it doesn't matter what what i put in that "not in" clause the filter
> always comes back false if the column is a partition_key of course.
>
> thanks for the tip on explain extended that's some crazy output so i'm
> sifting for clues in that now.   i hear you though - something in there
> with the metastore is at play.
>
> Cheers,
> Stephen.
>
> On Thu, Aug 25, 2016 at 1:12 PM, Gopal Vijayaraghavan 
> wrote:
>
>>
>> > anybody run up against this one?  hive 2.1.0 + using a  "not in" on a
>> >list + the column is a partition key participant.
>>
>> The partition filters are run before the plan is generated.
>>
>> >AND etl_source_database not in ('foo')
>>
>> Is there a 'foo' in etl_source_database?
>>
>> > predicate: false (type: boolean)   this kills any hope
>> >of the query returning anything.
>> ...
>> >  Select Operator###doesn't even mention a filter
>>
>> This is probably good news, because that's an optimization.
>>
>> PrunedPartitionList getPartitionsFromServer(Table tab, final
>> ExprNodeGenericFuncDesc compactExpr ...) {
>> ...
>>   hasUnknownPartitions = Hive.get().getPartitionsByExpr(
>>   tab, compactExpr, conf, partitions);
>> }
>>
>>
>> goes into the metastore and evaluates the IN and NOT IN for partitions
>> ahead of time.
>>
>>
>> So, this could mean that the partition pruning evaluation returned no
>> partitions at all (or just exactly matched partitions only, skipping the
>> filter per-row).
>>
>> In 2.x, you might notice it does a bit fancier things there as well, like
>>
>> select count(1) from table where year*1 + month*100 + day >= 20160101;
>>
>> https://github.com/apache/hive/blob/master/ql/src/java/org/
>> apache/hadoop/hi
>> ve/ql/optimizer/ppr/PartitionPruner.java#L468
>> <https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ppr/PartitionPruner.java#L468>
>>
>>
>> You can try "explain extended" and see which partitions are selected (&
>> validate that the filter removed was applied already).
>>
>> Cheers,
>> Gopal
>>
>>
>>
>>
>>
>

Re: hive 2.1.0 and "NOT IN ( list )" and column is a partition_key

2016-08-25 Thread Stephen Sprague

Hi Gopal,
Thank you for this insight.  good stuff.   The thing is there is no 'foo'
for etl_database_source so that filter if anything should be
short-circuited to 'true'.  ie. double nots.   1. not in  2. and foo not
present.

it doesn't matter what what i put in that "not in" clause the filter always
comes back false if the column is a partition_key of course.

thanks for the tip on explain extended that's some crazy output so i'm
sifting for clues in that now.   i hear you though - something in there
with the metastore is at play.

Cheers,
Stephen.

On Thu, Aug 25, 2016 at 1:12 PM, Gopal Vijayaraghavan 
wrote:

>
> > anybody run up against this one?  hive 2.1.0 + using a  "not in" on a
> >list + the column is a partition key participant.
>
> The partition filters are run before the plan is generated.
>
> >AND etl_source_database not in ('foo')
>
> Is there a 'foo' in etl_source_database?
>
> > predicate: false (type: boolean)   this kills any hope
> >of the query returning anything.
> ...
> >  Select Operator###doesn't even mention a filter
>
> This is probably good news, because that's an optimization.
>
> PrunedPartitionList getPartitionsFromServer(Table tab, final
> ExprNodeGenericFuncDesc compactExpr ...) {
> ...
>   hasUnknownPartitions = Hive.get().getPartitionsByExpr(
>   tab, compactExpr, conf, partitions);
> }
>
>
> goes into the metastore and evaluates the IN and NOT IN for partitions
> ahead of time.
>
>
> So, this could mean that the partition pruning evaluation returned no
> partitions at all (or just exactly matched partitions only, skipping the
> filter per-row).
>
> In 2.x, you might notice it does a bit fancier things there as well, like
>
> select count(1) from table where year*1 + month*100 + day >= 20160101;
>
> https://github.com/apache/hive/blob/master/ql/src/java/
> org/apache/hadoop/hi
> ve/ql/optimizer/ppr/PartitionPruner.java#L468
>
>
> You can try "explain extended" and see which partitions are selected (&
> validate that the filter removed was applied already).
>
> Cheers,
> Gopal
>
>
>
>
>

hive 2.1.0 and "NOT IN ( list )" and column is a partition_key

2016-08-25 Thread Stephen Sprague

anybody run up against this one?  hive 2.1.0 + using a  "not in" on a list
+ the column is a partition key participant.


* using not

query:

explain
SELECT count(*)
FROM bi.fact_email_funnel
WHERE
event_date_key = 20160824
AND etl_source_database *not* in ('foo')


output frag:

Map Operator Tree:
  TableScan
alias: fact_email_funnel
Statistics: Num rows: 33723530 Data size: 12154626048 Basic
stats: COMPLETE Column stats: NONE
Filter Operator
  *predicate: false (type: boolean)   this kills any hope
of the query returning anything.*
  Statistics: Num rows: 1 Data size: 360 Basic stats: COMPLETE
Column stats: NONE
  Group By Operator
aggregations: count()




* not using not

query:

  explain
  SELECT count(*)
  FROM bi.fact_email_funnel
  WHERE
event_date_key = 20160824
AND etl_source_database in ('foo')




output frag:
 Map Operator Tree:
  TableScan
alias: fact_email_funnel
Statistics: Num rows: 33744659 Data size: 14562180096 Basic
stats: COMPLETE Column stats: NONE
Select Operator*###doesn't even mention a filter*
  Statistics: Num rows: 33744659 Data size: 14562180096 Basic
stats: COMPLETE Column stats: NONE
  Group By Operator
aggregations: count()



* show create table output:

CREATE TABLE `bi.fact_email_funnel`(
  `user_id` string,
  `email` string,
  `ipaddress` string,
  `email_type_key` string,
  `email_type_id` string,
  `email_type_name` string,
  `tracking_code` string,
  `action_type` string,
  `email_guid` string,
  `mailing_id` string,
  `message_id` string,
  `delivered_status` string,
  `delivered_error_code` string,
  `click_link_id` string,
  `click_link` string,
  `click_property_id` string,
  `platform` string,
  `etl_source_id_column` string,
  `etl_source_id` string,
  `event_datetime` timestamp,
  `event_time_key` bigint,
  `id_job` bigint,
  `load_date` timestamp,
  `version` string,
  `event_datetime_utc` timestamp,
  `num_leads` int COMMENT 'number of leads',
  `lead_product_type` string COMMENT 'product lead type ',
  `property_impression`
array>
COMMENT 'sent property attributes',
  `search_impression` map COMMENT 'search attributes')
PARTITIONED BY (
  `event_date_key` bigint,
  `etl_source_database` string,
  `etl_source_table` string)

Re: hive throws ConcurrentModificationException when executing insert overwrite table

2016-08-17 Thread Stephen Sprague

indeed +1 to Gopal on that explanation! That was huge.

On Wed, Aug 17, 2016 at 12:58 AM, 明浩 冯  wrote:

> Hi Gopal,
>
>
> It works when I disabled the dfs.namenode.acls.
>
> For the data loss, it doesn't affect me too much currently. But I will
> track the issue in Kylin.
>
> Thank you very much for your detailed explain and solution.  You saved me!
>
>
> Best Regards,
>
> Minghao Feng
> --
> *From:* Gopal Vijayaraghavan  on behalf of Gopal
> Vijayaraghavan 
> *Sent:* Wednesday, August 17, 2016 1:18:54 PM
> *To:* user@hive.apache.org
> *Subject:* Re: hive throws ConcurrentModificationException when executing
> insert overwrite table
>
>
> > Yes, Kylin generated the query. I'm using Kylin 1.5.3.
>
> I would report a bug to Kylin about DISTRIBUTE BY RAND().
>
> This is what happens when a node which ran a Map task fails and the whole
> task is retried.
>
> Assume that the first attempt of the Map task0 wrote value1 into
> reducer-99, because RAND() returned 99.
>
> Now the task succeeds and then reducer starts, running reducer-0
> successfully, which write _0.
>
> But before reducer-99 runs, the node which ran Map task0 crashes.
>
> So, the engine re-runs Map task0 on another node. Except because RAND() is
> completely random, it may give 0 as the output of RAND() for "value1".
>
> The reducer-0 output from Map task0 now has "value1", except there's no
> task which will ever read that out or write that out.
>
> In short, the output of the table will not contain "value1", despite the
> input and the shuffle outputs containing "value1".
>
> I would replace the DISTRIBUTE BY RAND() with SORT BY 0, for a random
> distribution without data loss.
>
> > But I still not sure how can I fix the problem. I'm a beginner of Hive
> >and Kylin, Can the problem be fixed by just change the hive or kylin
> >settings?
>
> If you're just experimenting with Kylin right now, I recommend just
> disabling the ACL settings in HDFS (this is not permissions btw, ACLs are
> permissions++).
>
> Set dfs.namenode.acls.enabled=false in core-site.xml and wherever else in
> your /etc/hadoop/conf it shows up and you should be good to avoid the race
> condition.
>
> Cheers,
> Gopal
>
>
>

Re: JsonSerDe and mapping tweet's user structure error

2016-08-16 Thread Stephen Sprague

stackoverflow is your friend.

that said have a peek at the doc even :)  cf. https://cwiki.apache.org/
confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Keywords,Non-
reservedKeywordsandReservedKeywords  paying close attention to this
paragraph:

{quote}
Reserved keywords are permitted as identifiers if you quote them as
described in Supporting Quoted Identifiers in Column Names

 (version 0.13.0 and later, see HIVE-6013
). Most of the keywords
are reserved through HIVE-6617
 in order to reduce the
ambiguity in grammar (version 1.2.0 and later). There are two ways if the
user still would like to use those reserved keywords as identifiers: (1)
use quoted identifiers, (2) set hive.support.sql11.reserved.keywords
=false.

{quote}

On Tue, Aug 16, 2016 at 3:23 PM, Igor Kravzov 
wrote:

> Hi,
>
> I am storing tweets in HDFS and creating external table on top of these
> tweets..
> The command I am using is bellow:
>
> *create external table tweets (*
> *  id bigint,*
> *  in_reply_to_screen_name string,*
> *  in_reply_to_status_id bigint,*
> *  user struct<*
> *  id: bigint, *
> *  name: string,*
> *  screen_name: string>,*
> *  retweeted_status struct <*
> *  id:bigint>,*
> *   text string*
> * )*
> *partitioned by (mmdd int)*
> *ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'*
> *location 'my location';*
>
> But I am getting the following error:
>
> org.apache.ambari.view.hive.client.HiveInvalidQueryException: Error while 
> compiling statement: FAILED: ParseException line 5:2 Failed to recognize 
> predicate 'user'. Failed rule: 'identifier' in column specification 
> [ERROR_STATUS]
>
> Without "user" definition"  query runs fine.
>
> Can it be because 'user' is reserved keyword? Or there is something else?
> "user" as part of tweet structure. How can I name column differently?
>
> Thank in advance.
>

Re: hiver errors

2016-08-10 Thread Stephen Sprague

this error messages says everything you need to know:

>Likely cause: new client talking to old server. Continuing without it.

when you upgrade hive you also need to upgrade the metastore schema.
failing to do that can trigger the message you're getting.

On Wed, Aug 10, 2016 at 6:41 AM, Mich Talebzadeh 
wrote:

> which version of hive it it?
>
> can you log in to hive via hive cli
>
>
>
> *$HIVE_HOME/bin/hive*
> Logging initialized using configuration in file:/usr/lib/hive/conf/hive-
> log4j2.properties
> hive>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 10 August 2016 at 14:32, johny casanova 
> wrote:
>
>> helo,
>>
>>
>> I just installed hive and getting this errors:
>>
>> in the Hive metastore server:
>>
>>
>>
>> Metastore thrift server threw an exception
>>
>> metaexception(message:Version information not found in metastore.)
>>
>>
>>
>>
>>
>>
>>
>> then on the hiveserver2
>>
>>
>>
>>
>>
>> WARN hive.metastore: set_ugi() not successful, Likely cause: new client
>> talking to old server. Continuing without it.
>> org.apache.thrift.transport.TTransportException: java.net
>> .SocketTimeoutException: Read timed out
>>
>> I'm using CDH 5.7.0
>>
>
>

Re: beeline/hiveserver2 + logging

2016-08-10 Thread Stephen Sprague

Hi Gopal,
Aha!  thank you for background behind this.  that makes things much more
understandable.

and ~3000 queries across 10 HS2 servers. sweet. now that's what i call
pushing the edge. I like it!

Thanks again,
Stephen.

On Tue, Aug 9, 2016 at 10:29 PM, Gopal Vijayaraghavan 
wrote:

> > not get the progress messages back until the query finishes which
> >somewhat defeats the purpose of interactive usage.
>
> That happens entirely on the client side btw.
>
> So to avoid a hard sleep() + check loop causing pointless HTTP traffic,
> HiveServer2 now does a long poll on the server side.
>
> hive.server2.long.polling.timeout", "5000ms"
>
>
> This means that it is edge-triggered to return whenever the query finishes
> instead of adding extra time when the results are ready but beeline
> doesn't know about.
>
>
> However, the get_logs() synchronizes on the same HiveStatement and is
> mutexed out by the long poll for getting results.
>
> You can escape this on a low-concurrency cluster by changing the
> long.polling.timeout to 0.5s instead of 5s & restarting HS2.
>
> However as the total # of concurrent queries goes up, the current setting
> does very well due to the reduction in total # of "Nope, come back" http
> noise (largest parallel workload I've seen is about ~3000 queries on 10
> HS2 nodes load-balanced).
>
> Cheers,
> Gopal
>
>
>

Re: beeline/hiveserver2 + logging

2016-08-09 Thread Stephen Sprague

finishing this thread off... yes, it worked to a degree.

by setting hive.async.log.enabled=false one does get the job_id and the
tracking url returned to the client before the job is launched on the
cluster. So that part is good.  However, one does not get the progress
messages back until the query finishes which somewhat defeats the purpose
of interactive usage.  For batch though not that bad.

as a side note the "operation_logs" directory is updated in real-time - and
deleted - after its copied back to the client.  so if you're hard-up for
that progress as its happening seek out that operation_log dir & associated
file.

Thanks,
Stephen.

On Tue, Aug 9, 2016 at 6:44 PM, Stephen Sprague  wrote:

> well, well. i just found this: https://issues.apache.org/
> jira/browse/HIVE-14183   seems something changed between 1.2.1 and
> 2.1.0.
>
> i'll see if the Rx as prescribed in that ticket does indeed work for me.
>
> Thanks,
> Stephen.
>
> On Tue, Aug 9, 2016 at 5:12 PM, Stephen Sprague 
> wrote:
>
>> hey guys,
>> try as i might i cannot seem to get beeline (via jdbc) to log information
>> back from hiveserver2 like job_id, progress and that kind of information
>> (similiar to what the local beeline or hive clients do.)
>>
>> i see this ticket that is closed: https://issues.apache.org/jira
>> /browse/HIVE-7615 which leads me to believe things should work.
>>
>> info:
>>* running hive v2.1.0
>>* i have enabled all the hive.hiveserver2.logging settings
>>
>>
>> do other people have this working? whats' the secret?
>>
>> Thanks,
>> stephen.
>>
>> 1. beeline -u jdbc:hive2://  #local beeline-cli) reports progress
>> 2. hive --service cli  #local hive-cli reports progress
>>
>> 3. beeline -u jdbc:hive2://dwrdevnn1:10001/default;auth=noSasl  #remote
>> beeline client *does not* report progress
>>
>> i have Hiveserver2 running in binary mode (not http) not sure that makes
>> any difference.
>>
>>
>>
>

Re: beeline/hiveserver2 + logging

2016-08-09 Thread Stephen Sprague

well, well. i just found this:
https://issues.apache.org/jira/browse/HIVE-14183   seems something changed
between 1.2.1 and 2.1.0.

i'll see if the Rx as prescribed in that ticket does indeed work for me.

Thanks,
Stephen.

On Tue, Aug 9, 2016 at 5:12 PM, Stephen Sprague  wrote:

> hey guys,
> try as i might i cannot seem to get beeline (via jdbc) to log information
> back from hiveserver2 like job_id, progress and that kind of information
> (similiar to what the local beeline or hive clients do.)
>
> i see this ticket that is closed: https://issues.apache.org/
> jira/browse/HIVE-7615 which leads me to believe things should work.
>
> info:
>* running hive v2.1.0
>* i have enabled all the hive.hiveserver2.logging settings
>
>
> do other people have this working? whats' the secret?
>
> Thanks,
> stephen.
>
> 1. beeline -u jdbc:hive2://  #local beeline-cli) reports progress
> 2. hive --service cli  #local hive-cli reports progress
>
> 3. beeline -u jdbc:hive2://dwrdevnn1:10001/default;auth=noSasl  #remote
> beeline client *does not* report progress
>
> i have Hiveserver2 running in binary mode (not http) not sure that makes
> any difference.
>
>
>

beeline/hiveserver2 + logging

2016-08-09 Thread Stephen Sprague

hey guys,
try as i might i cannot seem to get beeline (via jdbc) to log information
back from hiveserver2 like job_id, progress and that kind of information
(similiar to what the local beeline or hive clients do.)

i see this ticket that is closed:
https://issues.apache.org/jira/browse/HIVE-7615 which leads me to believe
things should work.

info:
   * running hive v2.1.0
   * i have enabled all the hive.hiveserver2.logging settings


do other people have this working? whats' the secret?

Thanks,
stephen.

1. beeline -u jdbc:hive2://  #local beeline-cli) reports progress
2. hive --service cli  #local hive-cli reports progress

3. beeline -u jdbc:hive2://dwrdevnn1:10001/default;auth=noSasl  #remote
beeline client *does not* report progress

i have Hiveserver2 running in binary mode (not http) not sure that makes
any difference.

Re: msck repair table and hive v2.1.0

2016-07-14 Thread Stephen Sprague

; Hi Stephen,
>
> Can you try by turning off multi-threaded approach by setting
> "hive.mv.files.thread=0"?  You mentioned that your tables tables are in s3,
> but the external table created was pointing to HDFS. Was that intentional?
>
> ~Rajesh.B
>
> On Fri, Jul 15, 2016 at 6:58 AM, Stephen Sprague 
> wrote:
>
>> in the meantime given my tables are in s3 i've written a utility to do a
>> 'aws s3 ls' on the bucket and folder in question, change the folder syntax
>> to partition syntax and then issued my own 'alter table ... add partition'
>> for each partition.
>>
>>
>> so essentially it does what msck repair tables does but in a non-portable
>> way.  oh well.  gotta do what ya gotta do.
>>
>> On Wed, Jul 13, 2016 at 9:29 PM, Stephen Sprague 
>> wrote:
>>
>>> hey guys,
>>> i'm using hive version 2.1.0 and i can't seem to get msck repair table
>>> to work.  no matter what i try i get the 'ol NPE.  I've set the log level
>>> to 'DEBUG' but yet i still am not seeing any smoking gun.
>>>
>>> would anyone here have any pointers or suggestions to figure out what's
>>> going wrong?
>>>
>>> thanks,
>>> Stephen.
>>>
>>>
>>>
>>> hive> create external table foo (a int) partitioned by (date_key bigint)
>>> location 'hdfs:/tmp/foo';
>>> OK
>>> Time taken: 3.359 seconds
>>>
>>> hive> msck repair table foo;
>>> FAILED: Execution Error, return code 1 from
>>> org.apache.hadoop.hive.ql.exec.DDLTask
>>>
>>>
>>> from the log...
>>>
>>> 2016-07-14T04:08:02,431 DEBUG [MSCK-GetPaths-1]:
>>> httpclient.RestStorageService (:()) - Found 13 objects in one batch
>>> 2016-07-14T04:08:02,431 DEBUG [MSCK-GetPaths-1]:
>>> httpclient.RestStorageService (:()) - Found 0 common prefixes in one batch
>>> 2016-07-14T04:08:02,433 ERROR [main]: metadata.HiveMetaStoreChecker
>>> (:()) - java.lang.NullPointerException
>>> 2016-07-14T04:08:02,434 WARN  [main]: exec.DDLTask (:()) - Failed to run
>>> metacheck:
>>> org.apache.hadoop.hive.ql.metadata.HiveException:
>>> java.lang.NullPointerException
>>> at
>>> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.getAllLeafDirs(HiveMetaStoreChecker.java:444)
>>> at
>>> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.getAllLeafDirs(HiveMetaStoreChecker.java:448)
>>> at
>>> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.getAllLeafDirs(HiveMetaStoreChecker.java:388)
>>> at
>>> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.findUnknownPartitions(HiveMetaStoreChecker.java:309)
>>> at
>>> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkTable(HiveMetaStoreChecker.java:285)
>>> at
>>> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkTable(HiveMetaStoreChecker.java:230)
>>> at
>>> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkMetastore(HiveMetaStoreChecker.java:109)
>>> at org.apache.hadoop.hive.ql.exec.DDLTask.msck(DDLTask.java:1814)
>>> at
>>> org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:403)
>>> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:197)
>>> at
>>> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
>>> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1858)
>>> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1562)
>>> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1313)
>>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1084)
>>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1072)
>>> at
>>> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:232)
>>> at
>>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
>>> at
>>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
>>> at
>>> org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
>>> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
>>> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>
>>
>>
>
>
> --
> ~Rajesh.B
>

Re: msck repair table and hive v2.1.0

2016-07-14 Thread Stephen Sprague

in the meantime given my tables are in s3 i've written a utility to do a
'aws s3 ls' on the bucket and folder in question, change the folder syntax
to partition syntax and then issued my own 'alter table ... add partition'
for each partition.


so essentially it does what msck repair tables does but in a non-portable
way.  oh well.  gotta do what ya gotta do.

On Wed, Jul 13, 2016 at 9:29 PM, Stephen Sprague  wrote:

> hey guys,
> i'm using hive version 2.1.0 and i can't seem to get msck repair table to
> work.  no matter what i try i get the 'ol NPE.  I've set the log level to
> 'DEBUG' but yet i still am not seeing any smoking gun.
>
> would anyone here have any pointers or suggestions to figure out what's
> going wrong?
>
> thanks,
> Stephen.
>
>
>
> hive> create external table foo (a int) partitioned by (date_key bigint)
> location 'hdfs:/tmp/foo';
> OK
> Time taken: 3.359 seconds
>
> hive> msck repair table foo;
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.DDLTask
>
>
> from the log...
>
> 2016-07-14T04:08:02,431 DEBUG [MSCK-GetPaths-1]:
> httpclient.RestStorageService (:()) - Found 13 objects in one batch
> 2016-07-14T04:08:02,431 DEBUG [MSCK-GetPaths-1]:
> httpclient.RestStorageService (:()) - Found 0 common prefixes in one batch
> 2016-07-14T04:08:02,433 ERROR [main]: metadata.HiveMetaStoreChecker (:())
> - java.lang.NullPointerException
> 2016-07-14T04:08:02,434 WARN  [main]: exec.DDLTask (:()) - Failed to run
> metacheck:
> org.apache.hadoop.hive.ql.metadata.HiveException:
> java.lang.NullPointerException
> at
> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.getAllLeafDirs(HiveMetaStoreChecker.java:444)
> at
> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.getAllLeafDirs(HiveMetaStoreChecker.java:448)
> at
> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.getAllLeafDirs(HiveMetaStoreChecker.java:388)
> at
> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.findUnknownPartitions(HiveMetaStoreChecker.java:309)
> at
> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkTable(HiveMetaStoreChecker.java:285)
> at
> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkTable(HiveMetaStoreChecker.java:230)
> at
> org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkMetastore(HiveMetaStoreChecker.java:109)
> at org.apache.hadoop.hive.ql.exec.DDLTask.msck(DDLTask.java:1814)
> at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:403)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:197)
> at
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1858)
> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1562)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1313)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1084)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1072)
> at
> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:232)
> at
> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
> at
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
> at
> org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>

msck repair table and hive v2.1.0

2016-07-13 Thread Stephen Sprague

hey guys,
i'm using hive version 2.1.0 and i can't seem to get msck repair table to
work.  no matter what i try i get the 'ol NPE.  I've set the log level to
'DEBUG' but yet i still am not seeing any smoking gun.

would anyone here have any pointers or suggestions to figure out what's
going wrong?

thanks,
Stephen.



hive> create external table foo (a int) partitioned by (date_key bigint)
location 'hdfs:/tmp/foo';
OK
Time taken: 3.359 seconds

hive> msck repair table foo;
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask


from the log...

2016-07-14T04:08:02,431 DEBUG [MSCK-GetPaths-1]:
httpclient.RestStorageService (:()) - Found 13 objects in one batch
2016-07-14T04:08:02,431 DEBUG [MSCK-GetPaths-1]:
httpclient.RestStorageService (:()) - Found 0 common prefixes in one batch
2016-07-14T04:08:02,433 ERROR [main]: metadata.HiveMetaStoreChecker (:()) -
java.lang.NullPointerException
2016-07-14T04:08:02,434 WARN  [main]: exec.DDLTask (:()) - Failed to run
metacheck:
org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.NullPointerException
at
org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.getAllLeafDirs(HiveMetaStoreChecker.java:444)
at
org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.getAllLeafDirs(HiveMetaStoreChecker.java:448)
at
org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.getAllLeafDirs(HiveMetaStoreChecker.java:388)
at
org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.findUnknownPartitions(HiveMetaStoreChecker.java:309)
at
org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkTable(HiveMetaStoreChecker.java:285)
at
org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkTable(HiveMetaStoreChecker.java:230)
at
org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkMetastore(HiveMetaStoreChecker.java:109)
at org.apache.hadoop.hive.ql.exec.DDLTask.msck(DDLTask.java:1814)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:403)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:197)
at
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1858)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1562)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1313)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1084)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1072)
at
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:232)
at
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
at
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

Tez issues with beeline via HS2

2016-02-17 Thread Stephen Sprague

Hi guys,
it was suggested i post to the user@hive group rather than the user@tez
group for this one.  Here's my issue. My query hangs when using beeline via
HS2 (but works with the local beeline client).  I'd like to overcome that.

This is my query:

   beeline -u 'jdbc:hive2://
dwrdevnn1.sv2.truila.com:10001/default;auth=noSasl sprague nopwd
org.apache.hive.jdbc.HiveDriver'

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-03 Thread Stephen Sprague

i refuse to take anybody seriously who has a sig file longer than one line
and  that there is just plain repugnant.

On Wed, Feb 3, 2016 at 1:47 PM, Mich Talebzadeh  wrote:

> I just did some further tests joining a 5 million rows FACT tables with 2
> DIMENSION tables.
>
>
>
> SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS
> TotalSales
>
> FROM sales s, times t, channels c
>
> WHERE s.time_id = t.time_id
>
> AND   s.channel_id = c.channel_id
>
> GROUP BY t.calendar_month_desc, c.channel_desc
>
> ;
>
>
>
>
>
> Hive on Spark crashes, Hive with MR finishes in 85 sec and Spark on Hive
> finishes in 267 sec. I am trying to understand this behaviour
>
>
>
> OK I changed the three below parameters as suggested by Jeff
>
>
>
> export SPARK_EXECUTOR_CORES=12 ##, Number of cores for the workers
> (Default: 1).
>
> export SPARK_EXECUTOR_MEMORY=5G ## , Memory per Worker (e.g. 1000M, 2G)
> (Default: 1G)
>
> export SPARK_DRIVER_MEMORY=2G ## , Memory for Master (e.g. 1000M, 2G)
> (Default: 512 Mb)
>
>
>
>
>
> *1)**Hive 1.2.1 on Spark 1.3.1*
>
> It fails. Never completes.
>
>
>
> ERROR : Status: Failed
>
> Error: Error while processing statement: FAILED: Execution Error, return
> code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
> (state=08S01,code=3)
>
>
>
> *2)**Hive 1.2.1 on MR engine Looks good and completes in 85 sec*
>
>
>
> 0: jdbc:hive2://rhes564:10010/default> SELECT t.calendar_month_desc,
> c.channel_desc, SUM(s.amount_sold) AS TotalSales
>
> 0: jdbc:hive2://rhes564:10010/default> FROM sales s, times t, channels c
>
> 0: jdbc:hive2://rhes564:10010/default> WHERE s.time_id = t.time_id
>
> 0: jdbc:hive2://rhes564:10010/default> AND   s.channel_id = c.channel_id
>
> 0: jdbc:hive2://rhes564:10010/default> GROUP BY t.calendar_month_desc,
> c.channel_desc
>
> 0: jdbc:hive2://rhes564:10010/default> ;
>
> INFO  : Execution completed successfully
>
> INFO  : MapredLocal task succeeded
>
> INFO  : Number of reduce tasks not specified. Estimated from input data
> size: 1
>
> INFO  : In order to change the average load for a reducer (in bytes):
>
> INFO  :   set hive.exec.reducers.bytes.per.reducer=
>
> INFO  : In order to limit the maximum number of reducers:
>
> INFO  :   set hive.exec.reducers.max=
>
> INFO  : In order to set a constant number of reducers:
>
> INFO  :   set mapreduce.job.reduces=
>
> WARN  : Hadoop command-line option parsing not performed. Implement the
> Tool interface and execute your application with ToolRunner to remedy this.
>
> INFO  : number of splits:1
>
> INFO  : Submitting tokens for job: job_1454534517374_0002
>
> INFO  : The url to track the job:
> http://rhes564:8088/proxy/application_1454534517374_0002/
>
> INFO  : Starting Job = job_1454534517374_0002, Tracking URL =
> http://rhes564:8088/proxy/application_1454534517374_0002/
>
> INFO  : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
> job_1454534517374_0002
>
> INFO  : Hadoop job information for Stage-3: number of mappers: 1; number
> of reducers: 1
>
> INFO  : 2016-02-03 21:25:17,769 Stage-3 map = 0%,  reduce = 0%
>
> INFO  : 2016-02-03 21:25:29,103 Stage-3 map = 2%,  reduce = 0%, Cumulative
> CPU 7.52 sec
>
> INFO  : 2016-02-03 21:25:32,205 Stage-3 map = 5%,  reduce = 0%, Cumulative
> CPU 10.19 sec
>
> INFO  : 2016-02-03 21:25:35,295 Stage-3 map = 7%,  reduce = 0%, Cumulative
> CPU 12.69 sec
>
> INFO  : 2016-02-03 21:25:38,392 Stage-3 map = 10%,  reduce = 0%,
> Cumulative CPU 15.2 sec
>
> INFO  : 2016-02-03 21:25:41,502 Stage-3 map = 13%,  reduce = 0%,
> Cumulative CPU 17.31 sec
>
> INFO  : 2016-02-03 21:25:44,600 Stage-3 map = 16%,  reduce = 0%,
> Cumulative CPU 21.55 sec
>
> INFO  : 2016-02-03 21:25:47,691 Stage-3 map = 20%,  reduce = 0%,
> Cumulative CPU 24.32 sec
>
> INFO  : 2016-02-03 21:25:50,786 Stage-3 map = 23%,  reduce = 0%,
> Cumulative CPU 26.3 sec
>
> INFO  : 2016-02-03 21:25:52,858 Stage-3 map = 27%,  reduce = 0%,
> Cumulative CPU 28.52 sec
>
> INFO  : 2016-02-03 21:25:55,948 Stage-3 map = 31%,  reduce = 0%,
> Cumulative CPU 30.65 sec
>
> INFO  : 2016-02-03 21:25:59,032 Stage-3 map = 35%,  reduce = 0%,
> Cumulative CPU 32.7 sec
>
> INFO  : 2016-02-03 21:26:02,120 Stage-3 map = 40%,  reduce = 0%,
> Cumulative CPU 34.69 sec
>
> INFO  : 2016-02-03 21:26:05,217 Stage-3 map = 43%,  reduce = 0%,
> Cumulative CPU 36.67 sec
>
> INFO  : 2016-02-03 21:26:08,310 Stage-3 map = 47%,  reduce = 0%,
> Cumulative CPU 38.78 sec
>
> INFO  : 2016-02-03 21:26:11,408 Stage-3 map = 52%,  reduce = 0%,
> Cumulative CPU 40.7 sec
>
> INFO  : 2016-02-03 21:26:14,512 Stage-3 map = 56%,  reduce = 0%,
> Cumulative CPU 42.69 sec
>
> INFO  : 2016-02-03 21:26:17,607 Stage-3 map = 60%,  reduce = 0%,
> Cumulative CPU 44.69 sec
>
> INFO  : 2016-02-03 21:26:20,722 Stage-3 map = 64%,  reduce = 0%,
> Cumulative CPU 46.83 sec
>
> INFO  : 2016-02-03 21:26:22,787 Stage-3 map = 100%,  reduce = 0%,
> Cumulative CPU 48.46 sec
>
> INFO  : 2016-02-03 21:26:29,030 Stage-3 map = 100%,  reduce = 100%,
> Cumulative CPU

Re: Hive job name

2015-04-07 Thread Stephen Sprague

Gotta say i don't particular like this either.   We run thousands of
queries a day and by using a comment before the query you could manipulate
the job.name hive would use.   Now like the OP stated that's ignored -
every job is named "hive-exec-0.13.1-cdh5.3.1.jar"   Sure setting
mapreduce.job.name explicitly is a workaround but... that's a boat load of
code changes!

Would not there be a "fix" to roll this back to how it got the job.name
before?

Thanks,
Stephen Sprague

On Wed, Mar 11, 2015 at 1:38 PM, Viral Bajaria 
wrote:

> I haven't used Tez but it's in my list of experimentation...
>
> Thanks,
> Viral
>
> On Tue, Mar 10, 2015 at 8:38 AM, P lva  wrote:
>
>> Hey Viral,
>>
>> Is there a similar config for tez ?
>>
>> Thanks
>>
>>
>>
>>
>> On Mon, Mar 9, 2015 at 6:36 PM, Viral Bajaria 
>> wrote:
>>
>>> We use the hive.job.name property to set meaningful job names. Look
>>> into using that before submitting queries.
>>>
>>> Thanks,
>>> Viral
>>>
>>>
>>> On Mon, Mar 9, 2015 at 2:47 PM, Alex Bohr  wrote:
>>>
>>>> Since going from Hive 0.11 to 0.13 the Job Names, in the JobTracker UI,
>>>> for my Hive queries in the job tracker changed.
>>>>
>>>> It used to be the first 200 characters or so of the Hive query, now
>>>> it's just "hive-exec-0.13.1-cdh5.3.1.jar."
>>>>
>>>> Seeing the beginning of the query in the job Name was super helpful to
>>>> figure out which queries are taking super long or using heavy # of slots.
>>>> Now I need to drill down to the job config file and check the 
>>>> hive.query.string
>>>> key.
>>>>
>>>> Is there a config setting I can use to make the Hive Job name reflect
>>>> the query being executed?
>>>>
>>>> I'm still using the JobTracker, haven't switched to Yarn yet.  Does
>>>> this change with Yarn?
>>>>
>>>> Thanks
>>>>
>>>
>>>
>>
>

Re: bug in hive

2014-09-20 Thread Stephen Sprague

great policy. install open source software that's not even version 1.0 into
production and then not allow the ability to improve it (but of course reap
all the rewards of its benefits.)  so instead of actually fixing the
problem the right way introduce a super-hack work-around cuz, you know,
that's much more "stable."

Gotta luv it.   Good luck.

On Sat, Sep 20, 2014 at 8:00 AM, Shushant Arora 
wrote:

> Hi Alan
>
> I have 0.10 version of hive deployed in my org's cluster, I cannot update
> that because of org's policy.
> How can I achieve exclusive lock functionality while inserting in dynamic
> partition on hive 0.10 ?
> Does calling hive scripts via some sort of java api with patched jar
> included will help ?
> Moreover hive does not release locks in 0.10 when hive session is killed .
> User has to explicitly unlock a table.
> Can i specify any sort of max expiry time while taking a lock.
>
> Thanks
> Shushant
>
> On Sat, Sep 20, 2014 at 8:11 PM, Alan Gates  wrote:
>
>> Up until Hive 0.13 locks in Hive were really advisory only, since as you
>> note any user can remove any other user's lock.  In Hive 0.13 a new type of
>> locking was introduced, see
>> https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-LockManager
>> This new locking is automatic and ignores both LOCK and UNLOCK commands.
>> Note that it is off by default, you have to configure Hive to use the new
>> DbTxnManager to get turn on this locking.  In 0.13 it still has the bug you
>> describe as far as acquiring the wrong lock for dynamic partitioning, but I
>> believe I've fixed that in 0.14.
>>
>> Alan.
>>
>>   Shushant Arora 
>>  September 20, 2014 at 5:39
>>
>> Hive version 0.9 and later has a bug
>>
>>
>>
>> While inserting in a hive table Hive takes an exclusive lock. But if
>> table is partitioned , and insert is in dynamic partition , it will take
>> shared lock on table but if all partitions are static then hive takes
>> exclusive lock on partitions in which data is being inserted
>>
>> and shared lock on table.
>>
>> https://issues.apache.org/jira/browse/HIVE-3509
>>
>>
>> 1.What if I want to take exclusive lock on table while inserting in
>> dynamic partition ?
>>
>>
>> I tried to take explicit lock using :
>>
>> LOCK TABLE tablename EXCLUSIVE;
>>
>>
>> But it made table to be disabled.
>>
>> I cannot even read from table anymore even is same session until I do
>>
>> unlock table tablename in another session;
>>
>>
>> 2. moreover whats lock level in hive , I mean any user can remove any
>> other users lock. that too seems buggy.
>>
>>
>> Thanks
>>
>> Shushant
>>
>>
>>
>>
>> --
>> Sent with Postbox 
>>
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>
>
>

Re: Mysql - Hive Sync

2014-09-06 Thread Stephen Sprague

interesting. thanks Muthu.

a colleague of mine pointed out this one too, linkedin's databus (
https://github.com/linkedin/databus/wiki)  this one looks extremely heavy
weight and again not sure its worth the headache.

i like the idea of a trigger on the mysql table and then broadcasting the
data to a another app via udp message.

cf. https://code.google.com/p/mysql-message-api/

the thing is you'll need to batch the records over say 5 minutes (or
whatever) then write the batch as one file to hdfs.

This seems infinitely simpler and more maintainable to me. :)




On Fri, Sep 5, 2014 at 11:53 PM, Muthu Pandi  wrote:

> Yeah installing Mysql hadoop applier took lot of time when building and
> installing GCC 4.6, and its working but its not serving the exact purpose.
> So now am trying with my own python scripting.
>
> Idea is reading insert query from binlog and save it under hive warehouse
> as table and query from there.
>
>
>
> *RegardsMuthupandi.K*
>
> [image: Picture (Device Independent Bitmap)]
>
>
>
> On Sat, Sep 6, 2014 at 4:47 AM, Stephen Sprague 
> wrote:
>
>> great find, Muthu.  I would be interested in hearing any about any
>> success or failures using this adapter. almost sounds too good to be true.
>>
>> After reading the blog (
>> http://innovating-technology.blogspot.com/2013/04/mysql-hadoop-applier-part-2.html)
>> about it i see it comes with caveats and it looks a little rough around the
>> edges for installing.  Not sure i'd bet the farm on this product but YMMV.
>>
>> Anyway, curious to know how it works out for you.
>>
>>
>>
>> On Tue, Sep 2, 2014 at 11:03 PM, Muthu Pandi  wrote:
>>
>>> This cant be done since insert update delete are not supported in hive.
>>>
>>> Mysql Applier for Hadoop package servers the same purpose of the
>>> prototype tool which i intended to develop.
>>>
>>> link for "Mysql Applier for Hadoop"
>>> http://dev.mysql.com/tech-resources/articles/mysql-hadoop-applier.html
>>>
>>>
>>>
>>> *Regards Muthupandi.K*
>>>
>>>  [image: Picture (Device Independent Bitmap)]
>>>
>>>
>>>
>>> On Wed, Sep 3, 2014 at 10:35 AM, Muthu Pandi 
>>> wrote:
>>>
>>>> Yeah but we cant make it to work as near real time. Also my table
>>>> doesnt have like 'ID' to use for --check-column that's why opted out of
>>>> sqoop.
>>>>
>>>>
>>>>
>>>> *Regards Muthupandi.K*
>>>>
>>>>  [image: Picture (Device Independent Bitmap)]
>>>>
>>>>
>>>>
>>>> On Wed, Sep 3, 2014 at 10:28 AM, Nitin Pawar 
>>>> wrote:
>>>>
>>>>> have you looked at sqoop?
>>>>>
>>>>>
>>>>> On Wed, Sep 3, 2014 at 10:15 AM, Muthu Pandi 
>>>>> wrote:
>>>>>
>>>>>> Dear All
>>>>>>
>>>>>>  Am developing a prototype of syncing tables from mysql to Hive
>>>>>> using python and JDBC. Is it a good idea using the JDBC for this purpose.
>>>>>>
>>>>>> My usecase will be generating the sales report using the hive, data
>>>>>> pulled from mysql using the prototype tool.My data will be around 
>>>>>> 2GB/day.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Regards Muthupandi.K*
>>>>>>
>>>>>>  [image: Picture (Device Independent Bitmap)]
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Mysql - Hive Sync

2014-09-05 Thread Stephen Sprague

great find, Muthu.  I would be interested in hearing any about any success
or failures using this adapter. almost sounds too good to be true.

After reading the blog (
http://innovating-technology.blogspot.com/2013/04/mysql-hadoop-applier-part-2.html)
about it i see it comes with caveats and it looks a little rough around the
edges for installing.  Not sure i'd bet the farm on this product but YMMV.

Anyway, curious to know how it works out for you.

On Tue, Sep 2, 2014 at 11:03 PM, Muthu Pandi  wrote:

> This cant be done since insert update delete are not supported in hive.
>
> Mysql Applier for Hadoop package servers the same purpose of the prototype
> tool which i intended to develop.
>
> link for "Mysql Applier for Hadoop"
> http://dev.mysql.com/tech-resources/articles/mysql-hadoop-applier.html
>
>
>
> *Regards Muthupandi.K*
>
>  [image: Picture (Device Independent Bitmap)]
>
>
>
> On Wed, Sep 3, 2014 at 10:35 AM, Muthu Pandi  wrote:
>
>> Yeah but we cant make it to work as near real time. Also my table doesnt
>> have like 'ID' to use for --check-column that's why opted out of sqoop.
>>
>>
>>
>> *Regards Muthupandi.K*
>>
>>  [image: Picture (Device Independent Bitmap)]
>>
>>
>>
>> On Wed, Sep 3, 2014 at 10:28 AM, Nitin Pawar 
>> wrote:
>>
>>> have you looked at sqoop?
>>>
>>>
>>> On Wed, Sep 3, 2014 at 10:15 AM, Muthu Pandi 
>>> wrote:
>>>
 Dear All

  Am developing a prototype of syncing tables from mysql to Hive
 using python and JDBC. Is it a good idea using the JDBC for this purpose.

 My usecase will be generating the sales report using the hive, data
 pulled from mysql using the prototype tool.My data will be around 2GB/day.

 *Regards Muthupandi.K*

  [image: Picture (Device Independent Bitmap)]

>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>

Re: ODBC Calls Extremely Slow

2014-08-15 Thread Stephen Sprague

what container are you using for your metastore? Derby, mysql or postgres?
for a large set of tables don't use Derby.

So you've confirmed its the ODBC driver and not the metastore itself?


On Fri, Aug 15, 2014 at 8:54 AM, Bradley Wright  wrote:

> Try an eval of our commercial ODBC driver for Hive:
>
>
> http://www.progress.com/products/datadirect-connect/odbc-drivers/data-sources/hadoop-apache-hive
>
> It will perform better!
>
> From: Charles Yorek 
> Reply-To: "user@hive.apache.org" 
> Date: Friday, August 15, 2014 9:54 AM
> To: "user@hive.apache.org" 
> Subject: ODBC Calls Extremely Slow
>
> Hello All,
>
> I am trying to use the ODBC driver but making ODBC Calls to fetch a list
> of tables from Hive is extremely slow on a HiveServer2.  The cluster has a
> large number of tables but it takes in excess of 1 hour to extract a list
> of tables via ODBC Calls.
>
> Are there any known workarounds for this issue?
>
> Thanks
>

Re: Altering the Metastore on EC2

2014-08-14 Thread Stephen Sprague

i'll take a stab at this.

- probably no reason.

- if you can. is there a derby client s/t you can issue the command: "alter
table COLUMNS_V2 modify TYPE_NAME varchar(32672)". otherwise maybe use the
mysql or postgres metastores (instead of derby) and run that alter command
after the install.

- the schema only exists in one place and that's the metastore (which is
probably on your namenode for derby.) for mysql or postgres it can be
anywhere you want but again examples will probably show localhost (the
namenode)

that's a mighty big schema! you don't just want to use string type and use
get_json_object to pull data out of it dynamically? not as elegant as using
static syntax like nested structs but its better than nothing. something to
think about anyway.

i'm guessing given a nested struct that large you'll get over one hump only
to be faced with another one. hive needs to do some crazy mapping there for
every record. hopefully that's optimized. :)

Good luck! I'd be curious how it goes.


On Mon, Aug 11, 2014 at 5:52 PM, David Beveridge 
wrote:

>  We are creating an Hive schema for reading massive JSON files. Our JSON
> schema is rather large, and we have found that the default metastore schema
> for Hive cannot work for us as-is.
>
> To be specific, one field in our schema has about 17KB of nested structs
> within it. Unfortunately, it appears that Hive has a limit of varchar(4000)
> for the field that stores the resulting definition:
>
>
>
> CREATE TABLE "COLUMNS_V2" (
>
> "CD_ID" bigint NOT NULL,
>
> "COMMENT" varchar(4000),
>
> "COLUMN_NAME" varchar(128) NOT NULL,
>
> "TYPE_NAME" varchar(4000),
>
> "INTEGER_IDX" INTEGER NOT NULL,
>
> PRIMARY KEY ("CD_ID", "COLUMN_NAME")
>
> );
>
>
>
> We are running this on Amazon MapReduce (v0.11 with default Derby
> metastore)
>
>
>
> So, our initial questions are:
>
> · Is there a reason that the TYPE_NAME is being limited to 4000
> (IIUC, varchar on derby can grow to 32672, which would be sufficient for
> a long time)
>
> · Can we alter the metastore schema without hacking/reinstalling
> Hive? (if so, how?)
>
> · If so, is there a proper way to update the schema on all nodes?
>
>
>
>
>
> Thanks in advance!
>
> --DB
>

Re: Hive Server 2 memory leak

2014-06-11 Thread Stephen Sprague

searching this list will in fact show you're not alone.   what is being
done about it is another matter.


On Wed, Jun 11, 2014 at 10:42 AM, Benjamin Bowman 
wrote:

> All,
>
> I am running Hadoop 2.4 and Hive 0.13.  I consistently run out of Hive
> heap space when running for a long period of time.  If I bump up the heap
> memory - it will run longer, but still eventually throws an out of memory
> error and becomes unresponsive.  The memory usage has a clearly linear
> trend which leads me to believe that there is a significant memory leak.
>  Has any one else had this problem or have any insight as to why I am
> seeing this?  Below I have included (1) the hive calls I am making, (2) the
> resulting java processes spawned off by these commands,  and (3) the top 50
> memory users according to a heap dump performed on Hive after it had run
> out of memory.  Note that each of these calls is being run several times a
> minute.
>
> Thanks,
> Benjamin Bowman
>
>
> (1) Hive/Hadoop calls:
> //Copying data to HDFS
> $HADOOP fs -copyFromLocal ${fullFile} $LOADING_DIR/${file}
> //Loading data
> $HIVE -u jdbc:hive2://$HIVE_MASTER:10002/database-n root -e "load data
> inpath '$LOADING_DIR/${file}' into table ${LOADING_TABLE} partition
> (range=$partitionDate);"
>  //ORC data
> $HIVE -u jdbc:hive2://$HIVE_MASTER:10002/database -n root --hiveconf
> hive.exec.dynamic.partition.mode=nonstrict --hiveconf
> hive.enforce.sorting=true --hiveconf $SET_QUEUE -e "insert into table
> ${TABLE_NAME} partition (range) select * from ${LOADING_TABLE};"
>
> (2) Java processes:
> // Copying data to HDFS
> /usr/java/latest/bin/java -Xmx1000m
> -Dhadoop.log.dir=/opt/hadoop/latest-hadoop/logs
> -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/hadoop/latest-hadoop
> -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console
> -Djava.library.path=/opt/hadoop/latest-hadoop/lib/native
> -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
> -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.fs.FsShell
> -copyFromLocal /data/file1.txt /database/loading/file1.txt
>
> // Loading data
> /usr/java/latest/bin/java -Xmx4096m -XX:NewRatio=12 -Xms10m
> -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:-UseGCOverheadLimit
> -XX:MaxPermSize=1024m -Dhadoop.log.dir=/opt/hadoop/latest-hadoop/logs
> -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/hadoop/latest-hadoop
> -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console
> -Djava.library.path=/opt/hadoop/latest-hadoop/lib/native
> -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
> -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar
> /opt/hadoop/latest-hive/lib/hive-cli-0.13.0.jar
> org.apache.hive.beeline.BeeLine --hiveconf
> hive.aux.jars.path=file:///opt/hadoop/latest-hive/lib/hive-exec-0.13.0.jar,file:///opt/hadoop/latest-hive/hcatalog/share/hcatalog/hive-hcatalog-core-0.13.0.jar
> -u jdbc:hive2://HIVE_MASTER:10002/database -n root -e load data inpath
> '/database/loading/file1.txt' into table loading_table partition
> (range=1402506000);
>
> //ORC data
> /usr/java/latest/bin/java -Xmx4096m -XX:NewRatio=12 -Xms10m
> -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:-UseGCOverheadLimit
> -XX:MaxPermSize=1024m -Dhadoop.log.dir=/opt/hadoop/latest-hadoop/logs
> -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/hadoop/latest-hadoop
> -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console
> -Djava.library.path=/opt/hadoop/latest-hadoop/lib/native
> -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
> -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar
> /opt/hadoop/latest-hive/lib/hive-cli-0.13.0.jar
> org.apache.hive.beeline.BeeLine --hiveconf
> hive.aux.jars.path=file:///opt/hadoop/latest-hive/lib/hive-exec-0.13.0.jar,file:///opt/hadoop/latest-hive/hcatalog/share/hcatalog/hive-hcatalog-core-0.13.0.jar
> -u jdbc:hive2://HIVE_MASTER:10002/database -n root --hiveconf
> hive.exec.dynamic.partition.mode=nonstrict --hiveconf
> hive.enforce.sorting=true --hiveconf mapred.job.queue.name=orc_queue -e
> insert into table TABLE_NAME partition (range) select * from loading_table;
>
> (3) Top 50 memory users of Hive heap space after 4 GB was consumed by Hive
> (Size is in Bytes)
>
>   Class Count Size  class [Ljava.util.HashMap$Entry; 6073306 1372508848  class
> [C 13007633 1034931150  class java.util.HashMap$Entry 19706437 551780236  
> class
> java.util.HashMap 6072908 291499584  class java.lang.String 13008539
> 260170780  class [Ljava.lang.Object; 2201432 180233440  class
> org.datanucleus.ExecutionContextThreadedImpl 439411 119959203  class
> com.mysql.jdbc.JDBC4ResultSet 352849 91740740  class [I 2433025 83659960  
> class
> org.datanucleus.FetchPlanForClass 2421672 79915176  class
> com.mysql.jdbc.StatementImpl 352825 75151725  class
> org.datanucleus.util.SoftValueMap$SoftValueReference 1483828 71223744  class
> org.datanucleus.FetchPlan 439412 33395312  class
> org.datanucleus.TransactionImpl 439411 333

Re: Predicate pushdown optimisation not working for ORC

2014-04-03 Thread Stephen Sprague

wow. good find. i hope these config settings are well documented and that
you didn't have to spend alot time searching for that.  Interesting that
the default isn't true for this one.


On Wed, Apr 2, 2014 at 11:00 PM, Abhay Bansal wrote:

> I was able to resolve the issue by setting "hive.optimize.index.filter" to
> true.
>
> In the hadoop logs
> syslog:2014-04-03 05:44:51,204 INFO
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: included column ids =
> 3,8,13
> syslog:2014-04-03 05:44:51,204 INFO
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: included columns names =
> sourceipv4address,sessionid,url
> syslog:2014-04-03 05:44:51,216 INFO
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: ORC pushdown predicate:
> leaf-0 = (EQUALS sourceipv4address 1809657989)
>
> I can now see the ORC pushdown predicate.
>
> Thanks,
> -Abhay
>
>
> On Thu, Apr 3, 2014 at 11:14 AM, Stephen Boesch  wrote:
>
>> HI Abhay,
>>   What is the DDL for your "test" table?
>>
>>
>> 2014-04-02 22:36 GMT-07:00 Abhay Bansal :
>>
>> I am new to Hive, apologise for asking such a basic question.
>>>
>>> Following exercise was done with hive .12 and hadoop 0.20.203
>>>
>>> I created a ORC file form java, and pushed it into a table with the same
>>> schema. I checked the conf
>>> property 
>>> hive.optimize.ppdtrue
>>> which should ideally use the ppd optimisation.
>>>
>>> I ran a query "select sourceipv4address,sessionid,url from test where
>>> sourceipv4address="dummy";"
>>>
>>> Just to see if the ppd optimization is working I checked the hadoop logs
>>> where I found
>>>
>>> ./userlogs/job_201404010833_0036/attempt_201404010833_0036_m_00_0/syslog:2014-04-03
>>> 05:01:39,913 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: included
>>> column ids = 3,8,13
>>> ./userlogs/job_201404010833_0036/attempt_201404010833_0036_m_00_0/syslog:2014-04-03
>>> 05:01:39,914 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: included
>>> columns names = sourceipv4address,sessionid,url
>>> ./userlogs/job_201404010833_0036/attempt_201404010833_0036_m_00_0/syslog:2014-04-03
>>> 05:01:39,914 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: *No
>>> ORC pushdown predicate*
>>>
>>>  I am not sure which part of it I missed. Any help would be appreciated.
>>>
>>> Thanks,
>>> -Abhay
>>>
>>
>>
>

Re: MSCK REPAIR TABLE

2014-03-27 Thread Stephen Sprague

fwiw. i would not have the repair table statement as part of a production
job stream.  That's kinda a poor man's way to employ dynamic partitioning
off the back end.

Why not either use hive's dynamic partitioning features or pre-declare your
partitions? that way you are explicitly coding for your purpose rather than
running a general repair table on the backend knowing you "broke it" up
front?

just a suggestion!


On Thu, Mar 27, 2014 at 3:18 AM, fab wol  wrote:

> Hey Nitin and everyone else,
>
> so let me tell you from memory that the Hive CLI Error was kind of the
> same and nothing saying like the beeline error. Would have been no uplift
> here.
>
> I was restarting the cluster (it is a cloud cluster provided by
> http://www.unbelievable-machine.net), for getting the HiveServer2 Log and
> to be very sure, that everything is well set up. During this all
> tasktrackers are deleted and newly setup (HDFS and storage is not touched
> at all, neither are configs). After that the msck repair table stmt is
> going well and its actually not so slow at all, as i thought it might be
> (ca. 110 secs per table). I guess there might have been some logs/tmp/cache
> data stacked up, and that might have caused the errors ...
>
> Slightly confusing, but i will post if I find out what exactly was
> throwing the error here in the future ...
>
> Cheers for the help
> Wolli
>
>
> 2014-03-27 11:03 GMT+01:00 Nitin Pawar :
>
> Without error stack, very hard to get whats wrong
>>
>> will it be possible for you to run it via hive cli and grab some logs
>> through there ?
>>
>>
>> On Thu, Mar 27, 2014 at 3:29 PM, fab wol  wrote:
>>
>>> Hey Nitin,
>>>
>>> HiveServer2 Log unfurtantely says nothing:
>>>
>>> Mon Mar 24 17:41:18 CET 2014 hiveserver2 stopped, pid 2540
>>> Mon Mar 24 17:43:22 CET 2014 hiveserver2 started, pid 2554
>>> Hive history
>>> file=/tmp/mapr/hive_job_log_97715747-63cd-4789-9b2e-a8b0d544cdf9_
>>> 2102956370.txt
>>> OK
>>> Thu Mar 27 10:52:48 CET 2014 hiveserver2 stopped, pid 2554
>>> Thu Mar 27 10:55:52 CET 2014 hiveserver2 started, pid 2597
>>>
>>> Cheers
>>> Wolli
>>>
>>>
>>> 2014-03-27 10:04 GMT+01:00 Nitin Pawar :
>>>
>>> can you grab more logs from hiveserver2 log file?


 On Thu, Mar 27, 2014 at 2:31 PM, fab wol  wrote:

> Hey everyone,
>
> I have a table with currently 5541 partitions. Daily there are 14
> partitions added. I will switch the update for the metastore from "msck
> repair table" to "alter table add partition", since its performing better,
> but sometimes this might fail, and i need the "msck repair table" command.
> But unfortunately its not working anymore with this table size it seems:
>
> 0: jdbc:hive2://clusterXYZ-> use ;
> No rows affected (1.082 seconds)
> 0: jdbc:hive2://clusterXYZ-> set
> hive.metastore.client.socket.timeout=6000;
> No rows affected (0.029 seconds)
> 0: jdbc:hive2://clusterXYZ-> MSCK REPAIR TABLE ;
> Error: Error while processing statement: FAILED: Execution Error,
> return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
> (state=08S01,code=1)
> Error: Error while processing statement: FAILED: Execution Error,
> return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
> (state=08S01,code=1)
>
> anyone had luck with getting this to work? As you can see, I already
> raised the time until the Thrift Timeout kicks in, but this error is
> happening even before the time runs off ...
>
> Cheers
> Wolli
>



 --
 Nitin Pawar

>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>

Re: Partitioned table to partitioned table

2014-03-26 Thread Stephen Sprague

the error message is correct.  remember the partition columns are not
stored with the data and by doing a "select *" that's what doing.  And this
has nothing to do with ORC either its a Hive thing. :)

so your second approach was close.  just omit the partition columns yr, mo,
day.

On Wed, Mar 26, 2014 at 8:18 AM, P lva  wrote:

> Hello,
>
> I'm trying to convert managed partitioned text table into compressed orc
> partitioned table.
>
> I created the a new table with the same schema but when I try inserting
> data the errors says there are different number of columns.
>
> I tried doing
>
> From table a insert into table b(yr=2014, mo=01, day=01 ) select * where
> yr='2014' and month='01' and day='01'
>
> Also tried doing this
>
> From table a insert into table b(yr=2014, mo=01, day=01) select a.column1,
> a.column2, a.column3, ... where yr='2014' and month='01' and day='01'
>
> which didnt work either
>
> How does one copy data from a partitioned table to another along with
> partition information ?
>
> Thanks
>
>
>
>
>
>
>
>
>

Re: Improving self join time

2014-03-20 Thread Stephen Sprague

so that's your final assessment, eh? :)  What is your comment about the
outer query _joining on value_ to get the key?




On Thu, Mar 20, 2014 at 12:26 PM, Jeff Storey  wrote:

> I don't think so since the inner result doesn't have the key field in it.
> It ends up being
>
> select key from (query result that doesn't contain the key field) ...
>
>
> On Thu, Mar 20, 2014 at 1:28 PM, Stephen Sprague wrote:
>
>> I agree with your assessment of the inner query. why stop there though?
>> Doesn't the outer query fetch the ids of the tags that the inner query
>> identified?
>>
>>
>>
>> On Thu, Mar 20, 2014 at 9:54 AM, Jeff Storey wrote:
>>
>>> I don't think this quite fits here..I think the inner query will give me
>>> a list of duplicate elements and their counts, but it loses the information
>>> as to what id had these elements.
>>>
>>> I'm trying to find which pairs of ids have any duplicate tags.
>>>
>>>
>>> On Thu, Mar 20, 2014 at 11:57 AM, Stephen Sprague wrote:
>>>
>>>> hmm. would this not fall under the general problem of identifying
>>>> duplicates?
>>>>
>>>> Would something like this meet your needs? (untested)
>>>>
>>>> select  -- outer query finds the ids for the duplicates
>>>> key
>>>>
>>>> from (  -- inner query lists duplicate values
>>>>   select
>>>>count(*) as cnt,
>>>>value
>>>>  from
>>>> foo
>>>>  group by
>>>> value
>>>>  having
>>>>count(*) > 1
>>>>  ) z
>>>>  join foo a on (a.value = z.value)
>>>> ;
>>>>
>>>> table foo is your table elements
>>>> key is your id,
>>>> value is your element
>>>>
>>>>
>>>> On Thu, Mar 20, 2014 at 7:03 AM, Jeff Storey wrote:
>>>>
>>>>> I have a table with 10 million rows and 2 columns - id (int) and
>>>>> element (string). I am trying to do a self join that finds any ids where
>>>>> the element values are the same, and my query looks like:
>>>>>
>>>>> select e1.id, e1.tag, e2.id as id2, e2.tag as tag2 from elements e1
>>>>> JOIN elements e2 on e1.element = e2.element WHERE e1.id < e2.id;
>>>>>
>>>>> I tested this at a smaller scale and it works well. The problem is
>>>>> that with 10 million rows, this becomes a bit large and I've let it run 
>>>>> for
>>>>> 90 minutes and it was up to 80GB of disk space and still going. The
>>>>> original input data was only 500MB.
>>>>>
>>>>> Is this something I can optimize in hive? Or should I be considering a
>>>>> different approach to the problem instead?
>>>>>
>>>>> Any guidance here would be helpful. Thank you.
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Improving self join time

2014-03-20 Thread Stephen Sprague

I agree with your assessment of the inner query. why stop there though?
Doesn't the outer query fetch the ids of the tags that the inner query
identified?



On Thu, Mar 20, 2014 at 9:54 AM, Jeff Storey  wrote:

> I don't think this quite fits here..I think the inner query will give me a
> list of duplicate elements and their counts, but it loses the information
> as to what id had these elements.
>
> I'm trying to find which pairs of ids have any duplicate tags.
>
>
> On Thu, Mar 20, 2014 at 11:57 AM, Stephen Sprague wrote:
>
>> hmm. would this not fall under the general problem of identifying
>> duplicates?
>>
>> Would something like this meet your needs? (untested)
>>
>> select  -- outer query finds the ids for the duplicates
>> key
>>
>> from (  -- inner query lists duplicate values
>>   select
>>count(*) as cnt,
>>value
>>  from
>> foo
>>  group by
>> value
>>  having
>>count(*) > 1
>>  ) z
>>  join foo a on (a.value = z.value)
>> ;
>>
>> table foo is your table elements
>> key is your id,
>> value is your element
>>
>>
>> On Thu, Mar 20, 2014 at 7:03 AM, Jeff Storey wrote:
>>
>>> I have a table with 10 million rows and 2 columns - id (int) and element
>>> (string). I am trying to do a self join that finds any ids where the
>>> element values are the same, and my query looks like:
>>>
>>> select e1.id, e1.tag, e2.id as id2, e2.tag as tag2 from elements e1
>>> JOIN elements e2 on e1.element = e2.element WHERE e1.id < e2.id;
>>>
>>> I tested this at a smaller scale and it works well. The problem is that
>>> with 10 million rows, this becomes a bit large and I've let it run for 90
>>> minutes and it was up to 80GB of disk space and still going. The original
>>> input data was only 500MB.
>>>
>>> Is this something I can optimize in hive? Or should I be considering a
>>> different approach to the problem instead?
>>>
>>> Any guidance here would be helpful. Thank you.
>>>
>>
>>
>

Re: Improving self join time

2014-03-20 Thread Stephen Sprague

hmm. would this not fall under the general problem of identifying
duplicates?

Would something like this meet your needs? (untested)

select  -- outer query finds the ids for the duplicates
key

from (  -- inner query lists duplicate values
 select
   count(*) as cnt,
   value
 from
foo
 group by
value
 having
   count(*) > 1
 ) z
 join foo a on (a.value = z.value)
;

table foo is your table elements
key is your id,
value is your element


On Thu, Mar 20, 2014 at 7:03 AM, Jeff Storey  wrote:

> I have a table with 10 million rows and 2 columns - id (int) and element
> (string). I am trying to do a self join that finds any ids where the
> element values are the same, and my query looks like:
>
> select e1.id, e1.tag, e2.id as id2, e2.tag as tag2 from elements e1 JOIN
> elements e2 on e1.element = e2.element WHERE e1.id < e2.id;
>
> I tested this at a smaller scale and it works well. The problem is that
> with 10 million rows, this becomes a bit large and I've let it run for 90
> minutes and it was up to 80GB of disk space and still going. The original
> input data was only 500MB.
>
> Is this something I can optimize in hive? Or should I be considering a
> different approach to the problem instead?
>
> Any guidance here would be helpful. Thank you.
>

Re: computing median and percentiles

2014-03-20 Thread Stephen Sprague

the short answer is there is no native hive UDF that solves your unique
case.  That means you have to solve it.


i searched for something like you were looking for myself and found this
general recipe:

http://www.onlinemathlearning.com/median-frequency-table.html

off the top of my head i'm not sure how easy this would be in SQL but i
imagine using ROW_NUMBER() function a clever person could do it. And if not
SQL then perhaps a custom UDF but ultimately you have to do the work and
now you have a potential recipe to follow. :)




On Wed, Mar 19, 2014 at 9:37 PM, Seema Datar  wrote:

>  Not really. If it was a single column with no counters, Hive provides an
> option to use percentile. So basically if the data was like -
>
>  100
> 100
> 200
> 200
> 200
> 200
> 300
>
>  But if we have 2 columns, one that maintain the value and the other that
> maintains the count, how can Hive be used to derive the percentile?
>
>  Value Count
> 100  2
> 200  4
> 300  1
>
>  Thanks,
> Seema
>
>   From: Stephen Sprague 
> Reply-To: "user@hive.apache.org" 
> Date: Thursday, March 20, 2014 5:28 AM
> To: "user@hive.apache.org" 
> Subject: Re: computing median and percentiles
>
>   not a hive question is it?   its more like a math question.
>
>
>
> On Wed, Mar 19, 2014 at 1:30 PM, Seema Datar  wrote:
>
>>
>>
>>   I understand the percentile function is supported in Hive in the
>> latest versions. However, how does once calculate percentiles when the data
>> is across two columns. So say -
>>
>>  Value  Count
>>
>>  100 2   ( so basically 100 occurred twice)
>> 200 4
>> 300 1
>> 400 6
>> 500 3
>>
>>
>>  I want to find out the 0.25 percentile for the value distribution. How
>> can I do it using the Hive percentile function?
>>
>>
>>
>

Re: computing median and percentiles

2014-03-19 Thread Stephen Sprague

not a hive question is it?   its more like a math question.



On Wed, Mar 19, 2014 at 1:30 PM, Seema Datar  wrote:

>
>
>   I understand the percentile function is supported in Hive in the latest
> versions. However, how does once calculate percentiles when the data is
> across two columns. So say -
>
>  Value  Count
>
>  100 2   ( so basically 100 occurred twice)
> 200 4
> 300 1
> 400 6
> 500 3
>
>
>  I want to find out the 0.25 percentile for the value distribution. How
> can I do it using the Hive percentile function?
>
>
>

Re: Trouble with transform and dynamic partitions into table with custom field delimiters.

2014-03-18 Thread Stephen Sprague

but why go through all this and make it so long-winded, verbose and
non-standard?  That's a pain to maintain!

just use tabs as your transform in/out separator and go easy on the next
guy who has to maintain your code. :)


On Tue, Mar 18, 2014 at 4:59 PM, Nurdin Premji <
nurdin.pre...@casalemedia.com> wrote:

> I figured out my problem,
>
> I wasn't using the outRowFormat section when calling the transform. I had
> tried with a row format, but I guess I put it in the inRowFormat section
> which had changed the way the data was passed to my transform function and
> thought that there was only one place to specify rowFormat.
>
> So for anyone else with this issue, the key is to change the insert
> overwrite table query as:
>
>
> select transform (*)
>  using 'transform.pl'
>  AS(
>  siteID,
>  otherfield,
>  type
>  )
> FIELDS TERMINATED BY '\001'
> COLLECTION ITEMS TERMINATED BY '\003'
> MAP KEYS TERMINATED BY '\002
>
>  from auditlogs where dt = '2014-03-17' limit 1
>
> with the FIELDS... section AFTER the AS() section, not before. (as before
> will change the way it's passed into the transform script).
>
> Thank you,
> Nuridn.
>
>
> On 14-03-18 02:45 PM, Nurdin Premji wrote:
>
>> Hello,
>>
>> I'm having trouble with loading data into a table with a dynamic
>> partition with custom field delimiters. When I use tabs, it works, when
>> I have fields delimited by '001' it does not.
>>
>> I'm using hive 0.10 on top of hadoop 2.0.0-cdh4.6.0
>>
>> Here is some example code I was trying:
>>
>> create external table test_table (
>>   siteid STRING,
>>   otherfield STRING
>> ) PARTITIONED BY (dt STRING, type STRING)
>> ROW FORMAT DELIMITED
>> FIELDS TERMINATED BY '\001'
>> COLLECTION ITEMS TERMINATED BY '\003'
>> MAP KEYS TERMINATED BY '\002'
>> STORED AS TEXTFILE;
>>
>> And then using a transform to fill it in:
>>
>> add file /home/flex/transform.pl;
>> insert overwrite table test_table partition(dt = '2014-03-17', type)
>> select transform (*)
>>   using 'transform.pl'
>>   AS(
>>   siteID,
>>   otherfield,
>>   type
>>   )
>>   from auditlogs where dt = '2014-03-17' limit 1
>>
>> with a transform function that looks like:
>>
>> transform.pl:
>> #!/usr/bin/perl
>> use strict;
>>
>> LOOP: while (my $line = ) {
>>   chomp $line;
>>   warn "Received line $line";
>>   print "123"."\001" . "456"."\001"."789\n";
>> }
>>
>> The results I'm getting has: siteID = '123', otherfield='456',
>> dt='2014-03-17' and type equal to __HIVE_DEFAULT_PARTITION__ but I
>> expected that "789" would be stored as the type.
>>
>> If instead I change the transform function to look like:
>>
>> #!/usr/bin/perl
>> use strict;
>>
>> LOOP: while (my $line = ) {
>>   chomp $line;
>>   warn "Received line $line";
>>   print "123"."\t" . "456"."\t"."789\n";
>> }
>>
>> Then I get siteID = '123', otherfield = '456' dt='2014-03-17' and
>> type='789'  which is what I expected from the first scenario.
>>
>> I'd prefer not to have to change my script delimiters as switching from
>> \001 to \t causes other problems I haven't yet looked into.
>>
>>
>> I looked through JIRA for "transform dynamic partitions" and did not
>> find a bug that seemed similar to this so I'm asking on the mailing list
>> before I create a bug and also to check that I haven't misinterpreted
>> anything with transform functionality.
>>
>> So does this look like a bug?
>>
>> Thank you,
>> Nurdin.
>>
>>
>

Re: Writing data to LOCAL with Hive Server2

2014-03-14 Thread Stephen Sprague

well that does complicate matters then. :)  how to get a file off a host
you don't have access to? when you say no access i presume you mean you
have no login access.

I can think of some ideas:

   1. make a case to get access so rsync can connect. (best)

   2. can you run a "hdfs dfs -get" command from you client? (next best)

   3. run your query to stdout and redirect that to your file. eg. $ hive
-host foo -port 1 -e "select foo from bar;" >my_local_file (least
desirable)

the page i was thinking of was this:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ImportExport
don't think it helps your situation though.


Let us know how you make out!

Cheers,
Stephen.


On Fri, Mar 14, 2014 at 10:56 AM, Kumar V  wrote:

> Thanks Stephen.
> Yes, rsync would be easy.  But we don't have access to the servers the
> hiveserver2 is running on.  Basically, there is no way to reach those files
> from our boxes. That's why I was asking about writing it locally.
> I'll check this list for import/export like you mentioned.
>
> Thanks.
>
>
>   On Friday, March 14, 2014 12:23 PM, Stephen Sprague 
> wrote:
>  re: HiveServer2
>
> this is not natively possible (this falls under the export rubric.)
> similarly, you can't load a file directly from your client using native
> syntax (import.)
> Believe me, you're not the only one who'd like this both of these
> functions. :)
>
> I'd search this list for import or export utilities as i remember not long
> ago someone pointed to page in the doc about these.
>
>  as an alternative i just employ rsync to push or pull the file to where
> i ultimately need it.  its one line of code which is about as easy as it
> gets.
>
>
> On Fri, Mar 14, 2014 at 5:43 AM, Kumar V  wrote:
>
>  Hi,
>Is it possible to write data to a local file system using Hive server 2
> ?
> When I use 'insert overwrite local directory /tmp/hivedata
> ...', it writes data in the directory on the box on which hive
> server 2 is running.
> What can I do to write this data to the file system of the box I am
> running this Java/Python program from ?
>
> Other option was to write this data in HDFS.  But I would like to like to
>  write to local file if I can.
>
> Please let me know if it is possible.
>
> Thanks.
>
>
>
>
>

Re: Writing data to LOCAL with Hive Server2

2014-03-14 Thread Stephen Sprague

re: HiveServer2

this is not natively possible (this falls under the export rubric.)
similarly, you can't load a file directly from your client using native
syntax (import.)
Believe me, you're not the only one who'd like this both of these
functions. :)

I'd search this list for import or export utilities as i remember not long
ago someone pointed to page in the doc about these.

as an alternative i just employ rsync to push or pull the file to where i
ultimately need it.  its one line of code which is about as easy as it
gets.

On Fri, Mar 14, 2014 at 5:43 AM, Kumar V  wrote:

> Hi,
>Is it possible to write data to a local file system using Hive server 2
> ?
> When I use 'insert overwrite local directory /tmp/hivedata
> ...', it writes data in the directory on the box on which hive
> server 2 is running.
> What can I do to write this data to the file system of the box I am
> running this Java/Python program from ?
>
> Other option was to write this data in HDFS.  But I would like to like to
>  write to local file if I can.
>
> Please let me know if it is possible.
>
> Thanks.
>

Re: Hive - Sorting on the Partition Column data type Int . Output is Alphabetic Sort

2014-03-14 Thread Stephen Sprague

i agree. back in the days of v0.7 who knows if this was a bug or not. What
i do know is i have int partitions all over the place and they sort in
numeric context just fine every day.  i'm running v0.12.  pretty sure this
worked fine at v0.8 as well - that's when i started in hive.

good luck!


On Fri, Mar 14, 2014 at 4:21 AM, Nitin Pawar wrote:

> Can you first try updating hive to atleast 0.11 if you can not move to
> 0.12 ?
>
>
> On Fri, Mar 14, 2014 at 4:49 PM, Arafat, Moiz wrote:
>
>>  My comments inline
>>
>>
>>
>> *From:* Stephen Sprague [mailto:sprag...@gmail.com]
>> *Sent:* Friday, March 14, 2014 12:23 AM
>>
>> *To:* user@hive.apache.org
>> *Subject:* Re: Hive - Sorting on the Partition Column data type Int .
>> Output is Alphabetic Sort
>>
>>
>>
>> wow. its still sorting based on string context.  ok, some followups.
>>
>> 1.did you start clean?  ie. did you do a "drop table
>> moiz_partition_test" before you started?
>>
>> Yes I did
>>
>> 2.lets see the output of "show create table moiz_partition_test"
>> (if that doesn't work [its hive v0.11 i think] lets see "desc
>> moiz_partition_test" )
>>
>> hive> desc moiz_partition_test;
>>
>> OK
>>
>> event_dtstring
>>
>> partition_hrint
>>
>> Time taken: 1.967 seconds
>>
>>   3.what version of hive are you running?
>>
>> 0.7.1
>>
>>   pretty bizarre.
>>
>>
>>
>>
>>
>> On Thu, Mar 13, 2014 at 3:48 AM, Arafat, Moiz 
>> wrote:
>>
>>  Hi Stephen,
>>
>>
>>
>> I followed your approach and still got the same result
>>
>>
>>
>> 1) hive> CREATE TABLE moiz_partition_test
>>
>> > (EVENT_DT STRING) partitioned by
>>
>> > (
>>
>> > PARTITION_HR INT
>>
>> > )
>>
>> > ROW FORMAT DELIMITED
>>
>> > FIELDS TERMINATED BY '09'
>>
>> > location '/user/moiztcs/moiz_partition_test'
>>
>> > ;
>>
>>
>>
>> 2)
>>
>> hive> alter table  moiz_partition_test add IF NOT EXISTS partition
>> (partition_hr=0)  ;
>>
>> OK
>>
>> Time taken: 2.421 seconds
>>
>> hive> alter table  moiz_partition_test add IF NOT EXISTS partition
>> (partition_hr=1)  ;
>>
>> OK
>>
>> Time taken: 0.132 seconds
>>
>> hive> alter table  moiz_partition_test add IF NOT EXISTS partition
>> (partition_hr=2)  ;
>>
>> OK
>>
>> Time taken: 0.226 seconds
>>
>> hive> alter table  moiz_partition_test add IF NOT EXISTS partition
>> (partition_hr=10)  ;
>>
>> OK
>>
>> Time taken: 0.177 seconds
>>
>>
>>
>> 3)
>>
>> $ hadoop fs -ls /user/moiztcs/moiz_partition_test
>>
>> Found 4 items
>>
>> drwxr-xr-x   - cdidw aolmis  0 2014-03-13 06:40
>> /user/moiztcs/moiz_partition_test/partition_hr=0
>>
>> drwxr-xr-x   - cdidw aolmis  0 2014-03-13 06:41
>> /user/moiztcs/moiz_partition_test/partition_hr=1
>>
>> drwxr-xr-x   - cdidw aolmis  0 2014-03-13 06:42
>> /user/moiztcs/moiz_partition_test/partition_hr=10
>>
>> drwxr-xr-x   - cdidw aolmis  0 2014-03-13 06:41
>> /user/moiztcs/moiz_partition_test/partition_hr=2
>>
>>
>>
>> 4)
>>
>> $ hadoop fs -copyFromLocal test.dat
>> /user/moiztcs/moiz_partition_test/partition_hr=0
>>
>> $ hadoop fs -copyFromLocal test.dat
>> /user/moiztcs/moiz_partition_test/partition_hr=1
>>
>> $ hadoop fs -copyFromLocal test.dat
>> /user/moiztcs/moiz_partition_test/partition_hr=10
>>
>> $ hadoop fs -copyFromLocal test.dat
>> /user/moiztcs/moiz_partition_test/partition_hr=2
>>
>>
>>
>> 5) hive> select distinct partition_hr from moiz_partition_test order by
>> partition_hr;
>>
>>
>>
>> OK
>>
>> 0
>>
>> 1
>>
>> 10
>>
>> 2
>>
>>
>>
>> Thanks,
>>
>> Moiz
>>
>>
>>
>> *From:* Stephen Sprague [mailto:sprag...@gmail.com]
>> *Sent:* Wednesday, March 12, 2014 9:58 PM
>>
>>
>> *To:* user@hive.apache.org
>> *Subject:* Re: Hive - Sorting on the Partition Column data type Int .
>> Output is Alphabetic Sort
>>
>>
>>
>> there you go.   I think you're inflicting to

Re: Hive - Sorting on the Partition Column data type Int . Output is Alphabetic Sort

2014-03-13 Thread Stephen Sprague

wow. its still sorting based on string context.  ok, some followups.

1.  did you start clean?  ie. did you do a "drop table moiz_partition_test"
before you started?

2.  lets see the output of "show create table moiz_partition_test"  (if
that doesn't work [its hive v0.11 i think] lets see "desc
moiz_partition_test" )


3.  what version of hive are you running?


pretty bizarre.




On Thu, Mar 13, 2014 at 3:48 AM, Arafat, Moiz wrote:

>  Hi Stephen,
>
>
>
> I followed your approach and still got the same result
>
>
>
> 1) hive> CREATE TABLE moiz_partition_test
>
> > (EVENT_DT STRING) partitioned by
>
> > (
>
> > PARTITION_HR INT
>
> > )
>
> > ROW FORMAT DELIMITED
>
> > FIELDS TERMINATED BY '09'
>
> > location '/user/moiztcs/moiz_partition_test'
>
> > ;
>
>
>
> 2)
>
> hive> alter table  moiz_partition_test add IF NOT EXISTS partition
> (partition_hr=0)  ;
>
> OK
>
> Time taken: 2.421 seconds
>
> hive> alter table  moiz_partition_test add IF NOT EXISTS partition
> (partition_hr=1)  ;
>
> OK
>
> Time taken: 0.132 seconds
>
> hive> alter table  moiz_partition_test add IF NOT EXISTS partition
> (partition_hr=2)  ;
>
> OK
>
> Time taken: 0.226 seconds
>
> hive> alter table  moiz_partition_test add IF NOT EXISTS partition
> (partition_hr=10)  ;
>
> OK
>
> Time taken: 0.177 seconds
>
>
>
> 3)
>
> $ hadoop fs -ls /user/moiztcs/moiz_partition_test
>
> Found 4 items
>
> drwxr-xr-x   - cdidw aolmis  0 2014-03-13 06:40
> /user/moiztcs/moiz_partition_test/partition_hr=0
>
> drwxr-xr-x   - cdidw aolmis  0 2014-03-13 06:41
> /user/moiztcs/moiz_partition_test/partition_hr=1
>
> drwxr-xr-x   - cdidw aolmis  0 2014-03-13 06:42
> /user/moiztcs/moiz_partition_test/partition_hr=10
>
> drwxr-xr-x   - cdidw aolmis  0 2014-03-13 06:41
> /user/moiztcs/moiz_partition_test/partition_hr=2
>
>
>
> 4)
>
> $ hadoop fs -copyFromLocal test.dat
> /user/moiztcs/moiz_partition_test/partition_hr=0
>
> $ hadoop fs -copyFromLocal test.dat
> /user/moiztcs/moiz_partition_test/partition_hr=1
>
> $ hadoop fs -copyFromLocal test.dat
> /user/moiztcs/moiz_partition_test/partition_hr=10
>
> $ hadoop fs -copyFromLocal test.dat
> /user/moiztcs/moiz_partition_test/partition_hr=2
>
>
>
> 5) hive> select distinct partition_hr from moiz_partition_test order by
> partition_hr;
>
>
>
> OK
>
> 0
>
> 1
>
> 10
>
> 2
>
>
>
> Thanks,
>
> Moiz
>
>
>
> *From:* Stephen Sprague [mailto:sprag...@gmail.com]
> *Sent:* Wednesday, March 12, 2014 9:58 PM
>
> *To:* user@hive.apache.org
> *Subject:* Re: Hive - Sorting on the Partition Column data type Int .
> Output is Alphabetic Sort
>
>
>
> there you go.   I think you're inflicting too much of your own will onto
> hive with specifying the partition directories as 00, 01, 02.
>
> In my experience hive expects the partition name followed by an equal sign
> followed by the value.
>
> I'd stick with this kind of hdfs topology:
>
> /user/moiztcs/moiz_partition_test/partition_hr=00/
> /user/moiztcs/moiz_partition_test/partition_hr=01/
> /user/moiztcs/moiz_partition_test/partition_hr=10/
>
> By omitting the location clause on your alter table statements you should
> get above layout which can be
>
> confirmed by issuing the following command:
>
> $ hdfs dfs -ls /user/moiztc/moiz_partition_test
>
> Can you try this?
>
>
>
>
>
>
>
>
>
> On Wed, Mar 12, 2014 at 12:10 AM, Arafat, Moiz 
> wrote:
>
>  Hi,
>
>
>
> Here are the steps I followed . Please let me know If I did something
> wrong.
>
>
>
> 1)  Create table
>
> hive> CREATE TABLE moiz_partition_test
>
> > (EVENT_DT STRING) partitioned by
>
> > (
>
> > PARTITION_HR INT
>
> > )
>
> >  ROW FORMAT DELIMITED
>
> >  FIELDS TERMINATED BY '09'
>
> >  location '/user/moiztcs/moiz_partition_test'
>
> > ;
>
>
>
> 2)  Add partitions
>
> hive> alter table  moiz_partition_test add IF NOT EXISTS partition
> (partition_hr=0)  location '/user/moiztcs/moiz_partition_test/00';
>
> OK
>
> Time taken: 0.411 seconds
>
> hive> alter table  moiz_partition_test add IF NOT EXISTS partition
> (partition_hr=1)  location '/user/moiztcs/moiz_partition_test/01';
>
> OK
>
> Time taken: 0.193 seconds
>
> hive> alte

additional hive functions

2014-03-12 Thread Stephen Sprague

just a public service announcement.

I had a case where i had a nested json array in a string and i needed that
to act like a first class array in hive. natively, you can pull it out but
it'll just a string. woe is me.

I searched around the web and found this:

http://stackoverflow.com/questions/19867155/array-contains-equivalent-for-json-in-hiveql

follow the link for the brickhouse libs in the above url and you'll find a
veritable treasure trove of useful functions.


Cheers,
Stephen.
PS I have zero affiliation to brickhouse.

Re: full outer join result

2014-03-12 Thread Stephen Sprague

well. i had some free time to search it.  from here:
http://www.postgresql.org/docs/9.3/static/sql-select.html#SQL-UNION you'll
see the default is indeed UNION DISTINCT.  so changing it to UNION ALL
you'll get different results - are they the ones you're expecting?


On Wed, Mar 12, 2014 at 9:36 AM, Stephen Sprague  wrote:

> interesting.don't know the answer but could you change the UNION in
> the Postgres to UNION ALL?  I'd be curious if the default is UNION DISTINCT
> on that platform. That would at least partially explain postgres behaviour
> leaving hive the odd man out.
>
>
>
> On Wed, Mar 12, 2014 at 6:47 AM, Martin Kudlej  wrote:
>
>> Hi all,
>>
>> I've tried BigTop test for join_filters:
>> CREATE TABLE myinput1(key int, value int);
>> LOAD DATA LOCAL INPATH 'seed_data_files/in3.txt' INTO TABLE myinput1;
>>
>> where seed_data_files/in3.txt:
>> 12  35
>> NULL40
>> 48  NULL
>> 100 100
>>
>> I've tried:
>> SELECT * FROM myinput1 a FULL OUTER JOIN myinput1 b on a.key > 40 AND
>> a.value > 50 AND a.key = a.value AND b.key > 40 AND b.value > 50 AND b.key
>> = b.value ORDER BY a.key, a.value, b.key, b.value;
>>
>> and expected result in test is:
>> NULL  NULL  NULL  40
>> NULL  NULL  NULL  40
>> NULL  NULL  NULL  40
>> NULL  NULL  NULL  40
>> NULL  NULL  12  35
>> NULL  NULL  12  35
>> NULL  NULL  12  35
>> NULL  NULL  12  35
>> NULL  NULL  48  NULL
>> NULL  NULL  48  NULL
>> NULL  NULL  48  NULL
>> NULL  NULL  48  NULL
>> NULL  40  NULL  NULL
>> NULL  40  NULL  NULL
>> NULL  40  NULL  NULL
>> NULL  40  NULL  NULL
>> 12  35  NULL  NULL
>> 12  35  NULL  NULL
>> 12  35  NULL  NULL
>> 12  35  NULL  NULL
>> 48  NULL  NULL  NULL
>> 48  NULL  NULL  NULL
>> 48  NULL  NULL  NULL
>> 48  NULL  NULL  NULL
>> 100 100 NULL  NULL
>> 100 100 NULL  NULL
>> 100 100 NULL  NULL
>> 100 100 100 100
>>
>>
>> but real hive result is:
>> NULLNULLNULL40
>> NULLNULL12  35
>> NULLNULL48  NULL
>> NULL40  NULLNULL
>> 12  35  NULLNULL
>> 48  NULLNULLNULL
>> 100 100 100 100
>>
>> btw. result from postgresql is:
>> (SELECT *
>>   FROM myinput1 a
>> LEFT JOIN
>>   myinput1 b on
>> a.key > 40 AND
>> a.value > 50 AND
>> a.key = a.value AND
>> b.key > 40 AND
>> b.value > 50 AND
>> b.key = b.value  ORDER BY a.key, a.value, b.key, b.value)
>> UNION (SELECT *
>>   FROM myinput1 a
>> RIGHT JOIN
>>   myinput1 b on
>> a.key > 40 AND
>> a.value > 50 AND
>> a.key = a.value AND
>> b.key > 40 AND
>> b.value > 50 AND
>> b.key = b.value
>> ORDER BY a.key, a.value, b.key, b.value);
>>  |   |  12 |35
>>   12 |35 | |
>>  |   |  48 |
>>   48 |   | |
>>  |40 | |
>>  |   | |40
>>  100 |   100 | 100 |   100
>>
>> so it's the same like in hive.
>>
>> What is the right result for this full outer join in HiveQL, please?
>>
>> --
>> Best Regards,
>> Martin Kudlej.
>> MRG/Grid & RHS-Hadoop Senior Quality Assurance Engineer
>> Red Hat Czech s.r.o.
>>
>> Phone: +420 532 294 155
>> E-mail:mkudlej at redhat.com
>> IRC:   mkudlej at #brno, #messaging, #grid, #rhs, #distcomp
>>
>
>

Re: full outer join result

2014-03-12 Thread Stephen Sprague

interesting.don't know the answer but could you change the UNION in the
Postgres to UNION ALL?  I'd be curious if the default is UNION DISTINCT on
that platform. That would at least partially explain postgres behaviour
leaving hive the odd man out.



On Wed, Mar 12, 2014 at 6:47 AM, Martin Kudlej  wrote:

> Hi all,
>
> I've tried BigTop test for join_filters:
> CREATE TABLE myinput1(key int, value int);
> LOAD DATA LOCAL INPATH 'seed_data_files/in3.txt' INTO TABLE myinput1;
>
> where seed_data_files/in3.txt:
> 12  35
> NULL40
> 48  NULL
> 100 100
>
> I've tried:
> SELECT * FROM myinput1 a FULL OUTER JOIN myinput1 b on a.key > 40 AND
> a.value > 50 AND a.key = a.value AND b.key > 40 AND b.value > 50 AND b.key
> = b.value ORDER BY a.key, a.value, b.key, b.value;
>
> and expected result in test is:
> NULL  NULL  NULL  40
> NULL  NULL  NULL  40
> NULL  NULL  NULL  40
> NULL  NULL  NULL  40
> NULL  NULL  12  35
> NULL  NULL  12  35
> NULL  NULL  12  35
> NULL  NULL  12  35
> NULL  NULL  48  NULL
> NULL  NULL  48  NULL
> NULL  NULL  48  NULL
> NULL  NULL  48  NULL
> NULL  40  NULL  NULL
> NULL  40  NULL  NULL
> NULL  40  NULL  NULL
> NULL  40  NULL  NULL
> 12  35  NULL  NULL
> 12  35  NULL  NULL
> 12  35  NULL  NULL
> 12  35  NULL  NULL
> 48  NULL  NULL  NULL
> 48  NULL  NULL  NULL
> 48  NULL  NULL  NULL
> 48  NULL  NULL  NULL
> 100 100 NULL  NULL
> 100 100 NULL  NULL
> 100 100 NULL  NULL
> 100 100 100 100
>
>
> but real hive result is:
> NULLNULLNULL40
> NULLNULL12  35
> NULLNULL48  NULL
> NULL40  NULLNULL
> 12  35  NULLNULL
> 48  NULLNULLNULL
> 100 100 100 100
>
> btw. result from postgresql is:
> (SELECT *
>   FROM myinput1 a
> LEFT JOIN
>   myinput1 b on
> a.key > 40 AND
> a.value > 50 AND
> a.key = a.value AND
> b.key > 40 AND
> b.value > 50 AND
> b.key = b.value  ORDER BY a.key, a.value, b.key, b.value)
> UNION (SELECT *
>   FROM myinput1 a
> RIGHT JOIN
>   myinput1 b on
> a.key > 40 AND
> a.value > 50 AND
> a.key = a.value AND
> b.key > 40 AND
> b.value > 50 AND
> b.key = b.value
> ORDER BY a.key, a.value, b.key, b.value);
>  |   |  12 |35
>   12 |35 | |
>  |   |  48 |
>   48 |   | |
>  |40 | |
>  |   | |40
>  100 |   100 | 100 |   100
>
> so it's the same like in hive.
>
> What is the right result for this full outer join in HiveQL, please?
>
> --
> Best Regards,
> Martin Kudlej.
> MRG/Grid & RHS-Hadoop Senior Quality Assurance Engineer
> Red Hat Czech s.r.o.
>
> Phone: +420 532 294 155
> E-mail:mkudlej at redhat.com
> IRC:   mkudlej at #brno, #messaging, #grid, #rhs, #distcomp
>

Re: Hive - Sorting on the Partition Column data type Int . Output is Alphabetic Sort

2014-03-12 Thread Stephen Sprague

there you go.   I think you're inflicting too much of your own will onto
hive with specifying the partition directories as 00, 01, 02.

In my experience hive expects the partition name followed by an equal sign
followed by the value.

I'd stick with this kind of hdfs topology:

/user/moiztcs/moiz_partition_test/partition_hr=00/
/user/moiztcs/moiz_partition_test/partition_hr=01/
/user/moiztcs/moiz_partition_test/partition_hr=10/

By omitting the location clause on your alter table statements you should
get above layout which can be
confirmed by issuing the following command:

$ hdfs dfs -ls /user/moiztc/moiz_partition_test

Can you try this?







On Wed, Mar 12, 2014 at 12:10 AM, Arafat, Moiz wrote:

>  Hi,
>
>
>
> Here are the steps I followed . Please let me know If I did something
> wrong.
>
>
>
> 1)  Create table
>
> hive> CREATE TABLE moiz_partition_test
>
> > (EVENT_DT STRING) partitioned by
>
> > (
>
> > PARTITION_HR INT
>
> > )
>
> >  ROW FORMAT DELIMITED
>
> >  FIELDS TERMINATED BY '09'
>
> >  location '/user/moiztcs/moiz_partition_test'
>
> > ;
>
>
>
> 2)  Add partitions
>
> hive> alter table  moiz_partition_test add IF NOT EXISTS partition
> (partition_hr=0)  location '/user/moiztcs/moiz_partition_test/00';
>
> OK
>
> Time taken: 0.411 seconds
>
> hive> alter table  moiz_partition_test add IF NOT EXISTS partition
> (partition_hr=1)  location '/user/moiztcs/moiz_partition_test/01';
>
> OK
>
> Time taken: 0.193 seconds
>
> hive> alter table  moiz_partition_test add IF NOT EXISTS partition
> (partition_hr=2)  location '/user/moiztcs/moiz_partition_test/02';
>
> OK
>
> Time taken: 0.182 seconds
>
> hive> alter table  moiz_partition_test add IF NOT EXISTS partition
> (partition_hr=10)  location '/user/moiztcs/moiz_partition_test/10';
>
> OK
>
> Time taken: 0.235 seconds
>
>
>
> 3)  Copy data into the directories
>
> hadoop fs -copyFromLocal test.dat  /user/moiztcs/moiz_partition_test/00
>
> hadoop fs -copyFromLocal test.dat  /user/moiztcs/moiz_partition_test/01
>
> hadoop fs -copyFromLocal test.dat  /user/moiztcs/moiz_partition_test/02
>
> hadoop fs -copyFromLocal test.dat  /user/moiztcs/moiz_partition_test/10
>
>
>
> 4)  Ran the sql
>
> hive> select distinct partition_hr from moiz_partition_test order by
> partition_hr;
>
> Ended Job
>
> OK
>
> 0
>
> 1
>
> 10
>
> 2
>
>
>
> Thanks,
>
> Moiz
>
> *From:* Stephen Sprague [mailto:sprag...@gmail.com]
> *Sent:* Wednesday, March 12, 2014 12:55 AM
> *To:* user@hive.apache.org
> *Subject:* Re: Hive - Sorting on the Partition Column data type Int .
> Output is Alphabetic Sort
>
>
>
> that makes no sense. if the column is an int it isn't going to sort like a
> string.  I smell a user error somewhere.
>
>
>
> On Tue, Mar 11, 2014 at 6:21 AM, Arafat, Moiz 
> wrote:
>
> Hi ,
>
> I have a table that has a partition column partition_hr . Data Type is int
> (partition_hrint) . When i run a sort on this column the output is
> like this.
>
> 0
> 1
> 10
> 11
> 12
> 13
> 14
> 15
> 16
> 17
> 18
> 19
> 2
> 20
> 21
> 22
> 23
> 3
> 4
> 5
> 6
> 7
> 8
> 9
>
> I expected the output like this  .
>
> 0
> 1
> 2
> 3
> 4
> 5
> 6
> 7
> 8
> 9
> 10
> .
> .
> and so on.
>
> It works fine for non-partition columns. Please advise.
>
> Thanks,
> Moiz
>
>
>

Re: Hive - Sorting on the Partition Column data type Int . Output is Alphabetic Sort

2014-03-11 Thread Stephen Sprague

that makes no sense. if the column is an int it isn't going to sort like a
string.  I smell a user error somewhere.


On Tue, Mar 11, 2014 at 6:21 AM, Arafat, Moiz wrote:

> Hi ,
>
> I have a table that has a partition column partition_hr . Data Type is int
> (partition_hrint) . When i run a sort on this column the output is
> like this.
>
> 0
> 1
> 10
> 11
> 12
> 13
> 14
> 15
> 16
> 17
> 18
> 19
> 2
> 20
> 21
> 22
> 23
> 3
> 4
> 5
> 6
> 7
> 8
> 9
>
> I expected the output like this  .
>
> 0
> 1
> 2
> 3
> 4
> 5
> 6
> 7
> 8
> 9
> 10
> .
> .
> and so on.
>
> It works fine for non-partition columns. Please advise.
>
> Thanks,
> Moiz
>
>

Re: bucketed table problems

2014-03-07 Thread Stephen Sprague

short answer: its by position.

Re: bucketed table problems

2014-03-07 Thread Stephen Sprague

yeah. that's not right.

1. lets see the output of "show create table foo"

2. what version of hive are you using.


On Fri, Mar 7, 2014 at 11:46 AM, Keith Wiley  wrote:

> I want to convert a table to a bucketed table, so I made a new table with
> the same schema as the old table and specified a cluster column:
>
> create table foo_bucketed
> (
> a string,
> b int,
> c float
> )
> clustered by (b) into 10 buckets;
>
> Then I populate it from my original table:
>
> set hive.enforce.bucketing = true;
> insert overwrite table foo_bucketed
> select * from foo;
>
> All of the data goes into the first bucket, leaving the remaining 9
> buckets empty (in the file system, the remaining 9 files are 0 size).
>  Furthermore, the cluster column is now NULL.  Its values have been
> completely erased by the insertion (which might explain how they all ended
> up in a single bucket of course).
>
>
> 
> Keith Wiley kwi...@keithwiley.com keithwiley.com
> music.keithwiley.com
>
> "Yet mark his perfect self-contentment, and hence learn his lesson, that
> to be
> self-contented is to be vile and ignorant, and that to aspire is better
> than to
> be blindly and impotently happy."
>--  Edwin A. Abbott, Flatland
>
> 
>
>

Re: HIVE QUERY HELP:: HOW TO IMPLEMENT THIS CASE

2014-03-04 Thread Stephen Sprague

ok. my conscience got the best of me. maybe for worse though. :)

This to me is like giving you a rope and a stool and i don't think it'll
end well.

That said consider something like this:

{code}
select
a.foo1,
a.foo2,

--column to be updated. you need to position it properly
--if null then no match. use value from table a
case
when z.sys_buscd_item_desc1 is null then
   a.sys_buscd_item_desc1

--if not null then match. use value from table z
else
   z.sys_buscd_item_desc1
end,

a.fooN_1,
a.fooN
from

   (
   select
   a.sys_type_cd,  -- join key
   a.age, -- where clause
   c.sys_buscd_item_desc1
   from
TABLE1 a
join  TABLE2 c on isnull(a.age,'00')=c.sys_item
   where
   c.sys_type_cd='AGE'
   ) z

   RIGHT OUTER join TABLE1 a on (z.sys_z.sys_item = coalesece(a.age,'00') )

-- no where clause needed on 'AGE' since its part of the where clause in the
-- derived table.
{code}

i switched your ON clause and WHERE clause so be sure to take that under
consideration. And finally its not tested.

Best of luck.

Cheers,
Stephen




On Tue, Mar 4, 2014 at 7:49 AM, Stephen Sprague  wrote:

>
> Let's just say this.  Coercing hive into doing something its not meant to
> do is kinda a waste of time. Sure you can rewrite any update as a
> delete/insert but that's not the point of Hive.
>
> Seems like your going down a path here that's not optimal for your
> situation.
>
> You know, I could buy a Tesla and bury it in the ground and use it as a
> root cellar - but why?
>
> Cheers,
> Stephen
>
>
> On Mon, Mar 3, 2014 at 10:45 PM, yogesh dhari wrote:
>
>> Hello All,
>>
>> I have a use case in RDBMS query which I have implemented in
>> HIVE as..
>>
>>
>>
>> *1.1) Update statement* *in RDBMS*
>>
>> update  TABLE1
>> set
>> Age= case when isnull(age,'') ='' then 'A= Current' else '240+ Days' end,
>> Prev_Age=case when isnull(prev_age,'') ='' then 'A= Current' else '240+
>> Days' end;
>> *1.2) Update statement* *in HIVE*
>>
>> create table  TABLE2 as select
>> a.* ,
>> case when coalesce(a.age,'')='' then 'A=Current' else '240+ Days' end as
>> Age,
>> case when coalesce(a.prev_age,'')='' then 'A=Current' else '240+ Days'
>> end as Prev_age from TABLE1 a ;
>>
>>
>>
>>
>>
>> *Now I have a case statement in which I have a join condition*.
>>
>>
>>
>> *2) Join in RDBMS*
>> update  TABLE1
>> set a.Age = c.sys_buscd_item_desc1
>> from  TABLE1 a
>> join  TABLE2 c
>> on c.sys_type_cd='AGE'
>> where isnull(a.age,'00')=c.sys_item;
>> commit;
>>
>>
>>
>>
>>
>> How can I implement this query into Hive, Pls help and suggest.
>>
>>
>>
>> Thanks In Advance
>>
>> Yogesh Kumar
>>
>>
>>
>>
>>
>>
>>
>
>

Re: HIVE QUERY HELP:: HOW TO IMPLEMENT THIS CASE

2014-03-04 Thread Stephen Sprague

Let's just say this.  Coercing hive into doing something its not meant to
do is kinda a waste of time. Sure you can rewrite any update as a
delete/insert but that's not the point of Hive.

Seems like your going down a path here that's not optimal for your
situation.

You know, I could buy a Tesla and bury it in the ground and use it as a
root cellar - but why?

Cheers,
Stephen


On Mon, Mar 3, 2014 at 10:45 PM, yogesh dhari  wrote:

> Hello All,
>
> I have a use case in RDBMS query which I have implemented in
> HIVE as..
>
>
>
> *1.1) Update statement* *in RDBMS*
>
> update  TABLE1
> set
> Age= case when isnull(age,'') ='' then 'A= Current' else '240+ Days' end,
> Prev_Age=case when isnull(prev_age,'') ='' then 'A= Current' else '240+
> Days' end;
> *1.2) Update statement* *in HIVE*
>
> create table  TABLE2 as select
> a.* ,
> case when coalesce(a.age,'')='' then 'A=Current' else '240+ Days' end as
> Age,
> case when coalesce(a.prev_age,'')='' then 'A=Current' else '240+ Days' end
> as Prev_age from TABLE1 a ;
>
>
>
>
>
> *Now I have a case statement in which I have a join condition*.
>
>
>
> *2) Join in RDBMS*
> update  TABLE1
> set a.Age = c.sys_buscd_item_desc1
> from  TABLE1 a
> join  TABLE2 c
> on c.sys_type_cd='AGE'
> where isnull(a.age,'00')=c.sys_item;
> commit;
>
>
>
>
>
> How can I implement this query into Hive, Pls help and suggest.
>
>
>
> Thanks In Advance
>
> Yogesh Kumar
>
>
>
>
>
>
>

Re: move hive tables from one cluster to another cluster

2014-02-28 Thread Stephen Sprague

that advice is way over complicating something that is very easy. instead,
please take this approach.

1. run the ddl to create the table on the new cluster
2. distcp the hdfs data into the appropriate hdfs directory.
3. run "msck repair table " in hive to discover the partitions and
populate the metastore accordingly.








On Fri, Feb 28, 2014 at 1:41 PM, Abdelrahman Shettia <
ashet...@hortonworks.com> wrote:

> If you are using Mysql as a database, you need to take a mysql dump of the
> metastore database and search/replace the old namenode hostname with the
> new namenode hostname. Once that is completed, import the metastore
> database.
>
> Thanks
>
>
> On Fri, Feb 28, 2014 at 12:59 AM, shashwat shriparv <
> dwivedishash...@gmail.com> wrote:
>
>> Where was your meta data in derby or MySql?
>>
>>
>> *Warm Regards_**∞_*
>> * Shashwat Shriparv*
>>  [image: 
>> http://www.linkedin.com/pub/shashwat-shriparv/19/214/2a9][image:
>> https://twitter.com/shriparv] [image:
>> https://www.facebook.com/shriparv] [image:
>> http://google.com/+ShashwatShriparv][image:
>> http://www.youtube.com/user/sShriparv/videos][image:
>> http://profile.yahoo.com/SWXSTW3DVSDTF2HHSRM47AV6DI/]
>>
>>
>>
>> On Fri, Feb 28, 2014 at 1:28 PM, soniya B wrote:
>>
>>> Hi,
>>>
>>> I have moved warehouse file to another cluster. but still i didn't see
>>> the tables on the other cluster. How to rebulid the metadata?
>>>
>>> Thanks
>>> Soniya
>>>
>>> On Fri, Feb 28, 2014 at 9:26 AM, Krishnan K wrote:
>>>
 1. you can use distcp to copy the files to the new cluster
 2. rebuild metadata


 On Thu, Feb 27, 2014 at 8:07 PM, soniya B wrote:

> Dear experts,
>
> I want to move my hive tables from one cluster to another cluster. how
> can i do it?
>
> Thanks
> Soniya.
>


>>>
>>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: Hive + Flume

2014-02-28 Thread Stephen Sprague

if you can configure flume to create temporary files that start with an
underscore (_) i believe hive will safely ignore them. otherwise you have
write a script to move them out.

On Fri, Feb 28, 2014 at 11:09 AM, P lva  wrote:

> Hi,
>
> I'm have a flume stream that stores data in a directory which is source
> for an external table in hive.
> However flume creates tmp file that cannot be read by hive and query
> breaks.
> Is there any way to avoid hive reading this tmp file ?
>
> Thanks
>

Re: move hive tables from one cluster to another cluster

2014-02-28 Thread Stephen Sprague

this is a FAQ.  see doc on: msck repair table 

this will scan hdfs and create the corresponding partitions in the
metastore.


On Fri, Feb 28, 2014 at 12:59 AM, shashwat shriparv <
dwivedishash...@gmail.com> wrote:

> Where was your meta data in derby or MySql?
>
>
> *Warm Regards_**∞_*
> * Shashwat Shriparv*
>  [image: 
> http://www.linkedin.com/pub/shashwat-shriparv/19/214/2a9][image:
> https://twitter.com/shriparv] [image:
> https://www.facebook.com/shriparv] [image:
> http://google.com/+ShashwatShriparv] 
> [image:
> http://www.youtube.com/user/sShriparv/videos][image:
> http://profile.yahoo.com/SWXSTW3DVSDTF2HHSRM47AV6DI/] 
>
>
>
> On Fri, Feb 28, 2014 at 1:28 PM, soniya B wrote:
>
>> Hi,
>>
>> I have moved warehouse file to another cluster. but still i didn't see
>> the tables on the other cluster. How to rebulid the metadata?
>>
>> Thanks
>> Soniya
>>
>> On Fri, Feb 28, 2014 at 9:26 AM, Krishnan K wrote:
>>
>>> 1. you can use distcp to copy the files to the new cluster
>>> 2. rebuild metadata
>>>
>>>
>>> On Thu, Feb 27, 2014 at 8:07 PM, soniya B wrote:
>>>
 Dear experts,

 I want to move my hive tables from one cluster to another cluster. how
 can i do it?

 Thanks
 Soniya.

>>>
>>>
>>
>

Re: Metastore performance on HDFS-backed table with 15000+ partitions

2014-02-22 Thread Stephen Sprague

yeah. That traceback pretty much spells it out - its metastore related and
that's where the partitions are stored.

I'm with the others on this. HiveServer2 is still a little jankey on memory
management.  I bounce mine once a day at midnight just to play it safe (and
because i can.)

Again, for me, i use the hive local client for production jobs and remote
client for adhoc stuff.

you may wish to confirm the local hive client has no problem with your
query.

other than that you either increase your heap size on the HS2 process and
hope for the best and/or file a bug report.

bottom line hiveserver2 isn't production bullet proof just yet, IMHO.
Others may disagree.

Regards,
Stephen.



On Sat, Feb 22, 2014 at 9:50 AM, Norbert Burger wrote:

> Thanks all for the quick feedback.
>
> I'm a bit surprised to learn 15k is considered too much, but we can work
> around it.  I guess I'm also curious why the query planner needs to know
> about all partitions even in the case of simple select/limit queries, where
> the query might target only a single partition.
>
> Here's the client-side OOME with HADOOP_HEAPSIZE=2048:
>
>
> https://gist.githubusercontent.com/nburger/3286d2052060e2efe161/raw/dc30231086803c1d33b9137b5844d2d0e20e350d/gistfile1.txt
>
> This was from a CDH4.3.0 client hitting HIveServer2.  Any idea what's
> consuming the heap?
>
> Norbert
>
>
> On Sat, Feb 22, 2014 at 10:32 AM, Edward Capriolo 
> wrote:
>
>> Dont make tbales with that many partitions. It is an anti pattern. I hwve
>> tables with 2000 partitions a day and that is rewlly to many. Hive needs go
>> load that informqtion into memory to plan the query.
>>
>>
>> On Saturday, February 22, 2014, Terje Marthinussen <
>> tmarthinus...@gmail.com> wrote:
>> > Query optimizer in hive is awful on memory consumption. 15k partitions
>> sounds a bit early for it to fail though..
>> >
>> > What is your heap size?
>> >
>> > Regards,
>> > Terje
>> >
>> >> On 22 Feb 2014, at 12:05, Norbert Burger 
>> wrote:
>> >>
>> >> Hi folks,
>> >>
>> >> We are running CDH 4.3.0 Hive (0.10.0+121) with a MySQL metastore.
>> >>
>> >> In Hive, we have an external table backed by HDFS which has a 3-level
>> partitioning scheme that currently has 15000+ partitions.
>> >>
>> >> Within the last day or so, queries against this table have started
>> failing.  A simple query which shouldn't take very long at all (select *
>> from ... limit 10) fails after several minutes with a client OOME.  I get
>> the same outcome on count(*) queries (which I thought wouldn't send any
>> data back to the client).  Increasing heap on both client and server JVMs
>> (via HADOOP_HEAPSIZE) doesn't have any impact.
>> >>
>> >> We were only able to work around the client OOMEs by reducing the
>> number of partitions in the table.
>> >>
>> >> Looking at the MySQL querylog, my thought is that the Hive client is
>> quite busy making requests for partitions that doesn't contribute to the
>> query.  Has anyone else had similar experience against tables this size?
>> >>
>> >> Thanks,
>> >> Norbert
>> >
>>
>> --
>> Sorry this was sent from mobile. Will do less grammar and spell check
>> than usual.
>>
>
>

Re: Slow performance on queries with aggregation function

2014-02-21 Thread Stephen Sprague

Hi Jone,
um.  i can say for sure something is wrong. :)

i would _start_ by going to the tasktracker. this is your friend.  find
your job and look for failed reducers.  That's the starting point anyway,
IMHO.



On Fri, Feb 21, 2014 at 11:35 AM, Jone Lura  wrote:

> Hi,
>
> I have tried some variations of queries with aggregation function such as
> the following query;
>
> select max(total) from my_table;
>
> and
>
> select id, sum(total) from my_table group by id
>
> In my junit tests, I only have two rows with data, but the queries are
> extremely slow.
>
> The job detail output shows me the following;
>
> Hadoop job information for Stage-1: number of mappers: 1; number of
> reducers: 1
> 2014-02-21 17:31:42,544 Stage-1 map = 0%,  reduce = 0%
> 2014-02-21 17:31:45,548 Stage-1 map = 100%,  reduce = 0%
> 2014-02-21 17:31:46,899 Stage-1 map = 100%,  reduce = 0%
> 2014-02-21 17:31:55,446 Stage-1 map = 100%,  reduce = 0%
> 2014-02-21 17:32:34,358 Stage-1 map = 100%,  reduce = 0%
> 2014-02-21 17:32:40,040 Stage-1 map = 100%,  reduce = 0%
> 2014-02-21 17:32:45,653 Stage-1 map = 100%,  reduce = 0%
> 2014-02-21 17:32:46,999 Stage-1 map = 100%,  reduce = 0%
> 2014-02-21 17:32:55,544 Stage-1 map = 100%,  reduce = 0%
> 2014-02-21 17:33:34,454 Stage-1 map = 100%,  reduce = 0%
> 2014-02-21 17:33:40,130 Stage-1 map = 100%,  reduce = 0%
> 2014-02-21 17:33:45,742 Stage-1 map = 100%,  reduce = 0%
> 2014-02-21 17:33:47,093 Stage-1 map = 100%,  reduce = 0%
> 2014-02-21 17:33:55,632 Stage-1 map = 100%,  reduce = 0%
> 2014-02-21 19:27:48,005 Stage-1 map = 100%,  reduce = 0%
> 2014-02-21 19:27:48,461 Stage-1 map = 100%,  reduce = 0%
> 2014-02-21 19:27:48,311 Stage-1 map = 100%,  reduce = 0%
> 2014-02-21 19:27:48,574 Stage-1 map = 100%,  reduce = 0%
> 2014-02-21 19:27:48,932 Stage-1 map = 100%,  reduce = 0%
> 2014-02-21 19:28:48,915 Stage-1 map = 100%,  reduce = 0%
> 2014-02-21 19:28:48,915 Stage-1 map = 100%,  reduce = 0%
> 2014-02-21 19:28:48,933 Stage-1 map = 100%,  reduce = 0%
> 2014-02-21 19:28:48,933 Stage-1 map = 100%,  reduce = 0%
> 2014-02-21 19:28:49,727 Stage-1 map = 100%,  reduce = 0%
> 2014-02-21 19:29:47,995 Stage-1 map = 100%,  reduce = 100%
> 2014-02-21 19:29:48,997 Stage-1 map = 100%,  reduce = 0%
> 2014-02-21 19:29:49,018 Stage-1 map = 100%,  reduce = 0%
> 2014-02-21 19:29:49,019 Stage-1 map = 100%,  reduce = 0%
> 2014-02-21 19:29:49,824 Stage-1 map = 100%,  reduce = 0%
>
> I am relatively new to Hadoop and Hive and I do not know if this is
> normal, or if I have missed some configuration details.
>
> In my application I am expecting to have 500M or more rows.
>
> Best regards,
>
> Jone
>

1 2 3 >

1 - 100 of 235 matches

Mail list logo