RE: Hive Not Returning YARN Application Results Correctly Nor Inserting Into Local Tables

Aaron Grubb Fri, 08 Nov 2019 06:16:49 -0800

Hi Gla,

Thanks for your suggestions. The problem was indeed being caused by the results 
file being written to the local machine running the container instead of HDFS. 
Both of the settings you suggested didn’t have any effect but I dug through the 
source code and found that mapreduce.framework.name=local (default in Hadoop 
3.2.1) caused the container to use the local filesystem for everything. “Set 
mapreduce.framework.name=yarn” solved this problem.


Thanks,
Aaron

From: Sungwoo Park <glap...@gmail.com>
Sent: Wednesday, November 6, 2019 8:59 PM
To: user@hive.apache.org
Subject: Re: Hive Not Returning YARN Application Results Correctly Nor 
Inserting Into Local Tables

For the problem of not returning the result to the console, I think it occurs 
because the default file system is set to local file system, not to HDFS. 
Perhaps hive.exec.scratchdir is already set to /tmp/hive, but if the default 
file system is local, FileSinkOperator writes the final result to the local 
file system of the container where it is running. Then HiveServer2 tries to 
read from a subdirectory under /tmp/hive of its own local file system, thus 
returning an empty result. (The query 'select * from ...' works okay because it 
is taken care of by HiveServer2 itself.)

I can think of two solutions: 1) set the default file system to HDFS (e.g., by 
updating core-site.xml); 2) embed the file system directly into 
hive.exec.scratchdir (e.g., by setting it to hdfs://tmp/hive).

--- gla

On Thu, Nov 7, 2019 at 3:12 AM Aaron Grubb 
<aaron.gr...@clearpier.com<mailto:aaron.gr...@clearpier.com>> wrote:
Hello all,

I'm running a from-scratch cluster on AWS EC2. I have an external table 
(partitioned) defined with data on S3. I'm able to query this table and receive 
results to the console with a simple select * statement:

--------------------------------------------------------------------------------------------------------
hive> set hive.execution.engine=tez;
hive> select * from external_table where partition_1='1' and partition_2='2';
[correct results returned]
--------------------------------------------------------------------------------------------------------

Running a query that requires Tez doesn't return the results to the console:

--------------------------------------------------------------------------------------------------------
hive> set hive.execution.engine=tez;
hive> select count(*) from external_table where partition_1='1' and 
partition_2='2';
Status: Running (Executing on YARN cluster with App id 
application_1572972524483_0012)

OK
+------+
| _c0 |
+------+
+------+
No rows selected (8.902 seconds)
--------------------------------------------------------------------------------------------------------

However, if I dig in the logs and on the filesystem, I can find the results 
from that query:

--------------------------------------------------------------------------------------------------------
(yarn.resourcemanager.log) 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root 
OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS 
APPID=application_1572972524483_0022 
CONTAINERID=container_1572972524483_0022_01_000002 RESOURCE=<memory:1024, 
vCores:1> QUEUENAME=default
(container_folder/syslog_attempt) [TezChild] |exec.FileSinkOperator|: New Final 
Path: FS file:/tmp/[REALLY LONG FILE PATH]/000000_0
[root #] cat /tmp/[REALLY LONG FILE PATH]/000000_0
SEQ"org.apache.hadoop.io.BytesWritableorg.apache.hadoop.io.Textl▒ꩇ1som}▒▒j¹▒ 
2060
--------------------------------------------------------------------------------------------------------

2060 is the correct count for the partition.

Now, oddly enough, I'm able to get the results from the application if I insert 
overwrite directory on HDFS:

--------------------------------------------------------------------------------------------------------
hive> set hive.execution.engine=tez;
hive> INSERT OVERWRITE DIRECTORY '/tmp/local_out' select count(*) from 
external_table where partition_1='1' and partition_2='2';
[root #] hdfs dfs -cat /tmp/local_out/000000_0
2060
--------------------------------------------------------------------------------------------------------

However, attempting to insert overwrite local directory fails:

--------------------------------------------------------------------------------------------------------
hive> set hive.execution.engine=tez;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' select count(*) from 
external_table where partition_1='1' and partition_2='2';
[root #] cat /tmp/local_out/000000_0
cat: /tmp/local_out/000000_0: No such file or directory
--------------------------------------------------------------------------------------------------------

If I cat the container result file for this query, it's only the number, no 
class name or special characters:

--------------------------------------------------------------------------------------------------------
[root #] cat /tmp/[REALLY LONG FILE PATH]/000000_0
2060
--------------------------------------------------------------------------------------------------------

The only out-of-place log message I can find comes from the YARN 
ResourceManager log:

--------------------------------------------------------------------------------------------------------
(yarn.resourcemanager.log) INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root 
OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS 
APPID=application_1572972524483_0023 
CONTAINERID=container_1572972524483_0023_01_000004 RESOURCE=<memory:1024, 
vCores:1> QUEUENAME=default
(yarn.resourcemanager.log) WARN 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root IP=NMIP 
OPERATION=AM Released Container TARGET=Scheduler RESULT=FAILURE 
DESCRIPTION=Trying to release container not owned by app or with invalid id. 
PERMISSIONS=Unauthorized access or invalid container 
APPID=application_1572972524483_0023 
CONTAINERID=container_1572972524483_0023_01_000004
--------------------------------------------------------------------------------------------------------

I've also tried creating a table and inserting data into it. The table creates 
just fine but when I tried to insert data, it throws an error:

--------------------------------------------------------------------------------------------------------
hive> set hive.execution.engine=tez;
hive> insert into test_table (test_col) values ('blah'), ('blahblah');
Query ID = root_20191106172949_5301b127-7219-46d1-8fd2-dc80ca7e96ee
Total jobs = 1
Launching Job 1 out of 1
Status: Failed
Vertex failed, vertexName=Map 1, vertexId=vertex_1573060958692_0001_1_00, 
diagnostics=[Vertex vertex_1573060958692_0001_1_00 [Map 1] killed/failed due 
to:ROOT_INPUT_INIT_FAILURE, Vertex Input: _dummy_table initializer failed, 
vertex=vertex_1573060958692_0001_1_00 [Map 1], 
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
not exist: 
file:/tmp/root/a9b76683-8e19-446a-be74-7a5daedf70e5/hive_2019-11-06_17-29-49_820_224977921325223208-2/dummy_path
        at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:332)
        at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:274)
        at 
org.apache.hadoop.hive.shims.Hadoop23Shims$1.listStatus(Hadoop23Shims.java:134)
        at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:217)
        at 
org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:76)
        at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:321)
        at 
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:444)
        at 
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:564)
        at 
org.apache.tez.mapreduce.hadoop.MRInputHelpers.generateOldSplits(MRInputHelpers.java:488)
        at 
org.apache.tez.mapreduce.hadoop.MRInputHelpers.generateInputSplitsToMem(MRInputHelpers.java:337)
        at 
org.apache.tez.mapreduce.common.MRInputAMSplitGenerator.initialize(MRInputAMSplitGenerator.java:122)
        at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
        at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
        at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
        at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)
        at 
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
--------------------------------------------------------------------------------------------------------

My versions are as follows:

Hadoop 3.2.1
Hive 3.1.2
Tez 0.9.2

Any help is much appreciated!

RE: Hive Not Returning YARN Application Results Correctly Nor Inserting Into Local Tables

Reply via email to