Re: Hive 0.12 MAPJOIN hangs sometimes

Jörn Franke Fri, 11 Mar 2016 06:56:45 -0800

Honestly 0.12 is a no go - you miss a lot of performance improvements. Probably 
your query would execute in less than a minute. If your Hadoop vendor does not 
support smooth upgrades then change it. Hive 1.2.1 is the absolute minimum 
including using Orc or parquet as a table format and tez (preferred) or spark 
as an execution engine.


To your questions: 
It seems that the logger is configured wrongly that is why you may miss some 
messages.

What is the exact join query. Hive on older version needed a special syntax if 
you wanted to benefit from partition pruning.

Which Hadoop version are you using.



> On 11 Mar 2016, at 15:43, Yong Zhang <java8...@hotmail.com> wrote:
> 
> Hi, Hive users:
> 
> Currently our Hadoop vendor comes with Hive 0.12. I know it is a kind of old 
> version, but upgrade still has some long path to go.
> 
> Right now, we are facing an issue in the Hive 0.12.
> 
> We have one ETL kind of steps implemented in Hive, and due to the data volume 
> in this step, we know that MAPJOIN is the right way to go, as one side of 
> data is very small, but the other size is much larger.
> 
> So below is the query example:
> 
> set hive.exec.compress.output=true;
> set parquet.compression=snappy;
> set mapred.reduce.tasks=1;
> set mapred.reduce.child.java.opts=-Xms1560m -Xmx4096m;
> set mapred.task.timeout=7200000;
> set mapred.map.tasks.speculative.execution=false;
> set hive.ignore.mapjoin.hint=false;
> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
> 
> insert overwrite table a(dt='${hiveconf:run_date}', source='ip')
> select
>   /*+ MAPJOIN(trial_event) */
> xxxx
> 
> The above query can be finished daily around 10 minutes, which we are very 
> happy about it. But sometimes, the query will be hang hours in the ETL, until 
> we manually kill it.
> 
> I add the debug info in the Hive, and found the following message:
> 
> 2016-03-11 09:11:52 Starting to launch local task to process map join;  
> maximum memory = 536870912
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> 16/03/11 09:11:55 DEBUG ipc.Client: IPC Client (-1284813870) connection to 
> namenode/10.20.95.130:9000 from etl: closed
> 16/03/11 09:11:55 DEBUG ipc.Client: IPC Client (-1284813870) connection to 
> namenode/10.20.95.130:9000 from etl: stopped, remaining connections 0
> 
> Then there is no more log after that for hours.
> 
> If we don't use MAPJOIN, we won't face this issue, but the query will take 
> 2.5 hours.
> 
> When this happens, I can see the NameNode works fine, I can run all kinds of 
> "HDFS" operation without any issue, while this query is hanging. What does 
> this "IPC Client remaining connections 0" mean? If we cannot upgrade our Hive 
> version as now, any workaround do we have?
> 
> Thanks
> 
> Yong

Re: Hive 0.12 MAPJOIN hangs sometimes

Reply via email to