Re: Hive Not Returning YARN Application Results Correctly Nor Inserting Into Local Tables

2019-11-06 Thread Sungwoo Park
For the problem of not returning the result to the console, I think it
occurs because the default file system is set to local file system, not to
HDFS. Perhaps hive.exec.scratchdir is already set to /tmp/hive, but if the
default file system is local, FileSinkOperator writes the final result to
the local file system of the container where it is running. Then
HiveServer2 tries to read from a subdirectory under /tmp/hive of its own
local file system, thus returning an empty result. (The query 'select *
from ...' works okay because it is taken care of by HiveServer2 itself.)

I can think of two solutions: 1) set the default file system to HDFS (e.g.,
by updating core-site.xml); 2) embed the file system directly into
hive.exec.scratchdir (e.g., by setting it to hdfs://tmp/hive).

--- gla

On Thu, Nov 7, 2019 at 3:12 AM Aaron Grubb 
wrote:

> Hello all,
>
>
>
> I'm running a from-scratch cluster on AWS EC2. I have an external table
> (partitioned) defined with data on S3. I'm able to query this table and
> receive results to the console with a simple select * statement:
>
>
>
>
> 
>
> hive> set hive.execution.engine=tez;
>
> hive> select * from external_table where partition_1='1' and
> partition_2='2';
>
> [correct results returned]
>
>
> 
>
>
>
> Running a query that requires Tez doesn't return the results to the
> console:
>
>
>
>
> 
>
> hive> set hive.execution.engine=tez;
>
> hive> select count(*) from external_table where partition_1='1' and
> partition_2='2';
>
> Status: Running (Executing on YARN cluster with App id
> application_1572972524483_0012)
>
>
>
> OK
>
> +--+
>
> | _c0 |
>
> +--+
>
> +--+
>
> No rows selected (8.902 seconds)
>
>
> 
>
>
>
> However, if I dig in the logs and on the filesystem, I can find the
> results from that query:
>
>
>
>
> 
>
> (yarn.resourcemanager.log)
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root
> OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS
> APPID=application_1572972524483_0022
> CONTAINERID=container_1572972524483_0022_01_02 RESOURCE= vCores:1> QUEUENAME=default
>
> (container_folder/syslog_attempt) [TezChild] |exec.FileSinkOperator|: New
> Final Path: FS file:/tmp/[REALLY LONG FILE PATH]/00_0
>
> [root #] cat /tmp/[REALLY LONG FILE PATH]/00_0
>
> SEQ"org.apache.hadoop.io.BytesWritableorg.apache.hadoop.io.Textl▒ꩇ1som}▒▒
> j¹▒ 2060
>
>
> 
>
>
>
> 2060 is the correct count for the partition.
>
>
>
> Now, oddly enough, I'm able to get the results from the application if I
> insert overwrite directory on HDFS:
>
>
>
>
> 
>
> hive> set hive.execution.engine=tez;
>
> hive> INSERT OVERWRITE DIRECTORY '/tmp/local_out' select count(*) from
> external_table where partition_1='1' and partition_2='2';
>
> [root #] hdfs dfs -cat /tmp/local_out/00_0
>
> 2060
>
>
> 
>
>
>
> However, attempting to insert overwrite local directory fails:
>
>
>
>
> 
>
> hive> set hive.execution.engine=tez;
>
> hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' select count(*)
> from external_table where partition_1='1' and partition_2='2';
>
> [root #] cat /tmp/local_out/00_0
>
> cat: /tmp/local_out/00_0: No such file or directory
>
>
> 
>
>
>
> If I cat the container result file for this query, it's only the number,
> no class name or special characters:
>
>
>
>
> 
>
> [root #] cat /tmp/[REALLY LONG FILE PATH]/00_0
>
> 2060
>
>
> 
>
>
>
> The only out-of-place log message I can find comes from the YARN
> ResourceManager log:
>
>
>
>
> 
>
> (yarn.resourcemanager.log) INFO
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root
> OPERATION=AM Released Container TARGET=

Re: INSERT OVERWRITE Failure Saftey

2019-11-06 Thread David M
Thanks, Shawn. That was my suspicion but I'm glad to hear it from someone else. 
Do you have any links or documentation I can give to a manager to help them 
feel more confident in this?

Also, I misspoke, I'm running Hive 1.2.1.

Get Outlook for Android


From: Shawn Weeks 
Sent: Wednesday, November 6, 2019 5:35:17 PM
To: user@hive.apache.org 
Subject: Re: INSERT OVERWRITE Failure Saftey


I’m not sure specific to Hive 1.3 but in other versions the data is written to 
a temp location and then at the end of the query the previous data is deleted 
and the new data is renamed/moved. Something to watch out for is if the query 
returns no rows than the old data isn’t removed.



Thanks

Shawn



From: David M 
Reply-To: "user@hive.apache.org" 
Date: Wednesday, November 6, 2019 at 3:27 PM
To: "user@hive.apache.org" 
Subject: INSERT OVERWRITE Failure Saftey



All,



I have a Hive 1.3 cluster running in production, and there was a question about 
INSERT OVERWRITE queries on tables. If I perform an INSERT OVERWRITE query on a 
table, and the query fails half way through, will the old data still exist in 
the table? I’m not completely clear on the exact process which INSERT OVERWRITE 
follows, but I believe it puts the data into the staging folder, and then does 
a remove and move, which should be safe. It could also just wipe the folder 
before the query starts however, which would cause issues if the query itself 
failed. Can someone give me a definitive answer on this? Pointers to the source 
code or documentation that explains this would be even better.



Thanks!



David McGinnis




Re: INSERT OVERWRITE Failure Saftey

2019-11-06 Thread Shawn Weeks
I’m not sure specific to Hive 1.3 but in other versions the data is written to 
a temp location and then at the end of the query the previous data is deleted 
and the new data is renamed/moved. Something to watch out for is if the query 
returns no rows than the old data isn’t removed.

Thanks
Shawn

From: David M 
Reply-To: "user@hive.apache.org" 
Date: Wednesday, November 6, 2019 at 3:27 PM
To: "user@hive.apache.org" 
Subject: INSERT OVERWRITE Failure Saftey

All,

I have a Hive 1.3 cluster running in production, and there was a question about 
INSERT OVERWRITE queries on tables. If I perform an INSERT OVERWRITE query on a 
table, and the query fails half way through, will the old data still exist in 
the table? I’m not completely clear on the exact process which INSERT OVERWRITE 
follows, but I believe it puts the data into the staging folder, and then does 
a remove and move, which should be safe. It could also just wipe the folder 
before the query starts however, which would cause issues if the query itself 
failed. Can someone give me a definitive answer on this? Pointers to the source 
code or documentation that explains this would be even better.

Thanks!

David McGinnis



INSERT OVERWRITE Failure Saftey

2019-11-06 Thread David M
All,

I have a Hive 1.3 cluster running in production, and there was a question about 
INSERT OVERWRITE queries on tables. If I perform an INSERT OVERWRITE query on a 
table, and the query fails half way through, will the old data still exist in 
the table? I'm not completely clear on the exact process which INSERT OVERWRITE 
follows, but I believe it puts the data into the staging folder, and then does 
a remove and move, which should be safe. It could also just wipe the folder 
before the query starts however, which would cause issues if the query itself 
failed. Can someone give me a definitive answer on this? Pointers to the source 
code or documentation that explains this would be even better.

Thanks!

David McGinnis



Hive Not Returning YARN Application Results Correctly Nor Inserting Into Local Tables

2019-11-06 Thread Aaron Grubb
Hello all,

I'm running a from-scratch cluster on AWS EC2. I have an external table 
(partitioned) defined with data on S3. I'm able to query this table and receive 
results to the console with a simple select * statement:


hive> set hive.execution.engine=tez;
hive> select * from external_table where partition_1='1' and partition_2='2';
[correct results returned]


Running a query that requires Tez doesn't return the results to the console:


hive> set hive.execution.engine=tez;
hive> select count(*) from external_table where partition_1='1' and 
partition_2='2';
Status: Running (Executing on YARN cluster with App id 
application_1572972524483_0012)

OK
+--+
| _c0 |
+--+
+--+
No rows selected (8.902 seconds)


However, if I dig in the logs and on the filesystem, I can find the results 
from that query:


(yarn.resourcemanager.log) 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root 
OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS 
APPID=application_1572972524483_0022 
CONTAINERID=container_1572972524483_0022_01_02 RESOURCE= QUEUENAME=default
(container_folder/syslog_attempt) [TezChild] |exec.FileSinkOperator|: New Final 
Path: FS file:/tmp/[REALLY LONG FILE PATH]/00_0
[root #] cat /tmp/[REALLY LONG FILE PATH]/00_0
SEQ"org.apache.hadoop.io.BytesWritableorg.apache.hadoop.io.Textl▒ꩇ1som}▒▒j¹▒ 
2060


2060 is the correct count for the partition.

Now, oddly enough, I'm able to get the results from the application if I insert 
overwrite directory on HDFS:


hive> set hive.execution.engine=tez;
hive> INSERT OVERWRITE DIRECTORY '/tmp/local_out' select count(*) from 
external_table where partition_1='1' and partition_2='2';
[root #] hdfs dfs -cat /tmp/local_out/00_0
2060


However, attempting to insert overwrite local directory fails:


hive> set hive.execution.engine=tez;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' select count(*) from 
external_table where partition_1='1' and partition_2='2';
[root #] cat /tmp/local_out/00_0
cat: /tmp/local_out/00_0: No such file or directory


If I cat the container result file for this query, it's only the number, no 
class name or special characters:


[root #] cat /tmp/[REALLY LONG FILE PATH]/00_0
2060


The only out-of-place log message I can find comes from the YARN 
ResourceManager log:


(yarn.resourcemanager.log) INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root 
OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS 
APPID=application_1572972524483_0023 
CONTAINERID=container_1572972524483_0023_01_04 RESOURCE= QUEUENAME=default
(yarn.resourcemanager.log) WARN 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root IP=NMIP 
OPERATION=AM Released Container TARGET=Scheduler RESULT=FAILURE 
DESCRIPTION=Trying to release container not owned by app or with invalid id. 
PERMISSIONS=Unauthorized access or invalid container 
APPID=application_1572972524483_0023 
CONTAINERID=container_1572972524483_0023_01_04


I've also tried creating a table and inserting data into it. The table creates 
just fine but when I tried to insert data, it throws an error:


hive> set hive.execution.engine=tez;
hive> insert into test_table (test_col) values ('blah'), ('blahblah');
Query ID = root_20191106172949_5301b127-7219-46d1-8fd2-dc80ca7e96ee
Total jobs = 1
Launching Job 1 out of 1
Status: Failed
Vertex failed, ve