Joining Integer To String Can Produce Duplicates

2018-02-02 Thread Ryan C. Kleck
Hey all,  I’m on Hive 1.2.2 at work and I found some unfavorable behavior on 
one of my joins and I wanted to see what you all think.  Below is an example:
https://github.secureserver.net/gist/rkleck/258a9a7b3dd3c915f94a53234e422a1a

WITH string_key_setup AS (
SELECT   CAST('1234  ' AS STRING)   
AS my_key
UNION ALL
SELECT   CAST('1234' AS STRING) 
AS my_key
)
, group_setup AS (
SELECT   my_key 
 AS my_key
FROM string_key_setup
GROUP BY my_key
)
, string_key AS (
SELECT   CAST('1234' AS STRING) AS my_key
)
, integer_key AS (
SELECT   CAST('1234' AS INT)
AS my_key
)
SELECT   'String To String Join'
  AS join_type
, COUNT(1)  
  AS num_rows
FROM string_key t1
JOINgroup_setup t2
ON  t1.my_key = 
t2.my_key
UNION ALL
SELECT   'Integer To String Join'   
 AS join_type
, COUNT(1)  
  AS num_rows
FROM integer_key t1
JOINgroup_setup t2
ON  t1.my_key = 
t2.my_key
;


This query returns the following:
join_type

num_rows

Integer To String Join

2

String To String Join

1



I feel it’s unfavorable because the GROUP BY is not TRIMming the extra space 
around the string, but when we do a join against an integer it does trim the 
space.  This query should not produce multiple rows.

In my opinion, there should be another check when comparing string to int to 
make sure the size of the string and integer are the same (so in this example 
the row with key ‘1234  ‘ will be filtered out).  Furthermore, the original 
query that produces these dupes has MANY more joins.  I could fix by 
CASTing/TRIMing, but it would require me to know all the data types for the 
columns in the tables involved in the join (and maybe casting a string to int 
will lose some rows and you can’t TRIM an INT).

Thoughts?



Ryan Kleck
Data Engineer IV
Advanced Analytics
480-505-8800 xt. 4024

This email message and any attachments hereto are intended for use only by its 
intended recipient(s) and may contain confidential information. If you have 
received this email in error, please immediately notify the sender and 
permanently delete the original and any copy of this message and its 
attachments.


Re: Question on accessing LLAP as data cache from external containers

2018-02-02 Thread Gopal Vijayaraghavan

> For example, a Hive job may start Tez containers, which then retrieve data 
> from LLAP running concurrently. In the current implementation, this is 
> unrealistic

That is how LLAP was built - to push work from Tez to LLAP vertex by vertex, 
instead of an all-or-nothing implementation.

Here are the slides describing how that is plugged in LLAP from Hadoop Summit 
2015.

https://www.slideshare.net/Hadoop_Summit/llap-longlived-execution-in-hive/21

The flag in question is hive.llap.execution.mode - the most common use-case 
imagined for it was something like the mode=map, where only table-scan + all 
secure operators (i.e no temporary UDFs) are run inside LLAP (to take advantage 
of the cache).

LLAP can shuffle data to a Tez container, but it cannot shuffle data from a Tez 
container back into the daemon (& that's not very useful, since it won't be 
cached).

Here's the class that decides the hybrid execution tree & the plans the split 
between LLAP and Tez in the same query DAG.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/LlapDecider.java#L81

If you want to consume the LLAP cached rows from something like GPUs running 
Caffee, you can access LLAP cache via the SparkSQL data-source APIs.

https://github.com/hortonworks/spark-llap-release/blob/HDP-2.6.3.0-235-tag/examples/src/main/python/spark_llap_dsl.py

This is faster than directly reading off Cloud filesystems (because of LLAP's 
SSD cache), but even with a perf penalty on-prem it is very useful to restrict 
the access of the Spark ML[1] to certain columns (i.e you can extract lat/long, 
from a table which has other PII data) without having to make a complete copy 
of the data after projections to share from the EDW end of the shop to the ML 
side of it, even if the entire data-set is HDFS encrypted.

Cheers,
Gopal
[1] - https://hortonworks.com/blog/row-column-level-control-apache-spark/




Loading data from one swift container into external table pointing to another container fails

2018-02-02 Thread anup ahire
Hello,


Loading data from one swift container into external table pointing to
another container is failing for me.


hive> load data inpath 'swift://mycontainer.default/source_test' into table
my_test;
Loading data to table default.my_test
Failed with exception copyFiles: error while moving files!!! Cannot move
swift://mycontainer.default/source_test/AK.TXT to swift://test
.default/my_test/AK.TXT
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.MoveTask
hive>



Here is complete stacktrace

2018-02-02 21:27:28,309 ERROR [main]: exec.Task
(SessionState.java:printError(962)) - Failed with exception copyFiles:
error while moving files!!! Cannot move
swift://mycontainer.default/source_test/AK.TXT
to swift://test.default/my_test/AK.TXT
org.apache.hadoop.hive.ql.metadata.HiveException: copyFiles: error while
moving files!!! Cannot move swift://mycontainer.default/source_test/AK.TXT
to swift://test.default/my_test/AK.TXT
   at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2784)
   at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1664)
   at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:298)
   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
   at
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89)
   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1728)
   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1485)
   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1262)
   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1126)
   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1116)
   at
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:216)
   at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:168)
   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:379)
   at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:739)
   at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:684)
   at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:624)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:498)
   at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.io.IOException:Cannot move
swift://mycontainer.default/source_test/AK.TXT
to swift://test.default/my_test/AK.TXT
   at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2777)
   ... 21 more

hive version = 1.2.1

Any help is appreciated.

Thanks


Re: Hiveserver2 - Beeline How to enable progress bar (Tez)

2018-02-02 Thread Margusja
Does it depends what kind of engine you are using - tez versus mr?

Br margus

> On 1 Feb 2018, at 17:57, Shankar Mane  wrote:
> 
> Any update? 
> 
> I m not getting progress bar with beeline client. tried both embedded and 
> remote connection.
> 
> It's working properly with Hive CLI.
> 
> 
> On 30-Jan-2018 6:01 PM, "Shankar Mane"  > wrote:
> May i know, how to enable Tez progress bar in beeline ? I am using Tez has a 
> execution engine. 
> 
> I have followed some steps but failed. steps are:
> set hive.async.log.enable=false
> set hive.server2.in.place.progress=true
> 
> 
> Env Details:
> RHEL 
> hive 2.3.2
> hadoop 2.9.0
> Tez 0.9.1
> Apache Distribution
> 
> Also while connecting to hiveserver, i am getting below error:
> 
> 
> 2018-01-30T16:08:46,204  INFO [main] http.HttpServer: Started 
> HttpServer[hiveserver2] on port 10002
> Exception in thread 
> "org.apache.hadoop.hive.common.JvmPauseMonitor$Monitor@2489e84a" 
> java.lang.NoSuchMethodError: 
> com.google.common.base.Stopwatch.elapsed(Ljava/util/concurrent/TimeUnit;)J
> at 
> org.apache.hadoop.hive.common.JvmPauseMonitor$Monitor.run(JvmPauseMonitor.java:185)
> at java.lang.Thread.run(Thread.java:748)
> 
> 
> 



Re: Hiveserver2 - Beeline How to enable progress bar (Tez)

2018-02-02 Thread Shankar Mane
I have set Tez as a default execution engine.

On 02-Feb-2018 2:59 PM, "Margusja"  wrote:

Does it depends what kind of engine you are using - tez versus mr?

Br margus


On 1 Feb 2018, at 17:57, Shankar Mane  wrote:

Any update?

I m not getting progress bar with beeline client. tried both embedded and
remote connection.

It's working properly with Hive CLI.


On 30-Jan-2018 6:01 PM, "Shankar Mane"  wrote:

May i know, how to enable Tez progress bar in beeline ? I am using Tez has
a execution engine.

I have followed some steps but failed. steps are:
set hive.async.log.enable=false
set hive.server2.in.place.progress=true


Env Details:
RHEL
hive 2.3.2
hadoop 2.9.0
Tez 0.9.1
Apache Distribution

Also while connecting to hiveserver, i am getting below error:


2018-01-30T16:08:46,204  INFO [main] http.HttpServer: Started
HttpServer[hiveserver2] on port 10002
Exception in thread "org.apache.hadoop.hive.common
.JvmPauseMonitor$Monitor@2489e84a" java.lang.NoSuchMethodError:
com.google.common.base.Stopwatch.elapsed(Ljava/util/concurrent/TimeUnit;)J
at org.apache.hadoop.hive.common.JvmPauseMonitor$Monitor.run(Jv
mPauseMonitor.java:185)
at java.lang.Thread.run(Thread.java:748)