RE: Hive 0.12 MAPJOIN hangs sometimes

Yong Zhang Fri, 11 Mar 2016 07:34:13 -0800

I understand the Hive version problem.
We are using IBM BigInsights V3.0.0.2, which comes with Hadoop 2.2.0 and Hive 
0.12.  It is extremely difficult to upgrade to BigInsights v4.x, as IBM did V4 
totally different as V3. We are looking for the option to upgrade, but it won't 
be a fast way.
The query and log is very big, so I attached them in the file.
Thanks 
Yong

From: [email protected]
Subject: Re: Hive 0.12 MAPJOIN hangs sometimes
Date: Fri, 11 Mar 2016 15:55:42 +0100
To: [email protected]

Honestly 0.12 is a no go - you miss a lot of performance improvements. Probably 
your query would execute in less than a minute. If your Hadoop vendor does not 
support smooth upgrades then change it. Hive 1.2.1 is the absolute minimum 
including using Orc or parquet as a table format and tez (preferred) or spark 
as an execution engine.
To your questions: It seems that the logger is configured wrongly that is why 
you may miss some messages.
What is the exact join query. Hive on older version needed a special syntax if 
you wanted to benefit from partition pruning.
Which Hadoop version are you using.

On 11 Mar 2016, at 15:43, Yong Zhang <[email protected]> wrote:

Hi, Hive users:
Currently our Hadoop vendor comes with Hive 0.12. I know it is a kind of old 
version, but upgrade still has some long path to go.
Right now, we are facing an issue in the Hive 0.12.
We have one ETL kind of steps implemented in Hive, and due to the data volume 
in this step, we know that MAPJOIN is the right way to go, as one side of data 
is very small, but the other size is much larger.
So below is the query example:
set hive.exec.compress.output=true;set parquet.compression=snappy;set 
mapred.reduce.tasks=1;set mapred.reduce.child.java.opts=-Xms1560m -Xmx4096m;set 
mapred.task.timeout=7200000;set 
mapred.map.tasks.speculative.execution=false;set 
hive.ignore.mapjoin.hint=false;set 
hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
insert overwrite table a(dt='${hiveconf:run_date}', source='ip')select  /*+ 
MAPJOIN(trial_event) */xxxx
The above query can be finished daily around 10 minutes, which we are very 
happy about it. But sometimes, the query will be hang hours in the ETL, until 
we manually kill it.
I add the debug info in the Hive, and found the following message:
2016-03-11 09:11:52 Starting to launch local task to process map join;  maximum 
memory = 536870912SLF4J: Failed to load class 
"org.slf4j.impl.StaticLoggerBinder".SLF4J: Defaulting to no-operation (NOP) 
logger implementationSLF4J: See 
http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.16/03/11 
09:11:55 DEBUG ipc.Client: IPC Client (-1284813870) connection to 
namenode/10.20.95.130:9000 from etl: closed16/03/11 09:11:55 DEBUG ipc.Client: 
IPC Client (-1284813870) connection to namenode/10.20.95.130:9000 from etl: 
stopped, remaining connections 0
Then there is no more log after that for hours.
If we don't use MAPJOIN, we won't face this issue, but the query will take 2.5 
hours.
When this happens, I can see the NameNode works fine, I can run all kinds of 
"HDFS" operation without any issue, while this query is hanging. What does this 
"IPC Client remaining connections 0" mean? If we cannot upgrade our Hive 
version as now, any workaround do we have?
Thanks
Yong

hive_12.log.gz
Description: GNU Zip compressed data

RE: Hive 0.12 MAPJOIN hangs sometimes

Reply via email to