Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

hadoop hive Sat, 02 Aug 2014 08:35:30 -0700

Can you check the ulimit for tour user. Which might be causing this.
On Aug 2, 2014 8:54 PM, "Ana Gillan" <[email protected]> wrote:


> Hi everyone,
>
> I am having an issue with MapReduce jobs running through Hive being killed
> after 600s timeouts and with very simple jobs taking over 3 hours (or just
> failing) for a set of files with a compressed size of only 1-2gb. I will
> try and provide as much information as I can here, so if someone can help,
> that would be really great.
>
> I have a cluster of 7 nodes (1 master, 6 slaves) with the following config:
>
> • Master node:
>
> – 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz
>
> – 64GB DDR3 SDRAM
>
> – 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5)
>
> • Slave nodes (each):
>
> – Intel Xeon 4-core E3-1220v3 @ 3.1GHz
>
> – 32GB DDR3 SDRAM
>
> – 4 x 2TB SATA-3 hard drive
>
> • Operating system on all nodes: openSUSE Linux 13.1
>
>  We have the Apache BigTop package version 0.7, with Hadoop version
> 2.0.6-alpha and Hive version 0.11.
> YARN has been configured as per these recommendations:
> http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
>
> I also set the following additional settings before running jobs:
> set yarn.nodemanager.resource.cpu-vcores=4;
> set mapred.tasktracker.map.tasks.maximum=4;
> set hive.hadoop.supports.splittable.combineinputformat=true;
> set hive.merge.mapredfiles=true;
>
> No one else uses this cluster while I am working.
>
> What I’m trying to do:
> I have a bunch of XML files on HDFS, which I am reading into Hive using
> this SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to
> create a series of tables from these files and finally run a Python script
> on one of them to perform some scientific calculations. The files are
> .xml.gz format and (uncompressed) are only about 4mb in size each. 
> hive.input.format
> is set to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to
> avoid the “small files problem.”
>
> Problems:
> My HQL statements work perfectly for up to 1000 of these files. Even for
> much larger numbers, doing select * works fine, which means the files are
> being read properly, but if I do something as simple as selecting just one
> column from the whole table for a larger number of files, containers start
> being killed and jobs fail with this error in the container logs:
>
> 2014-08-02 14:51:45,137 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
> Failed to close file
> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-ext-10001/_tmp.000000_0
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-ext-10001/_tmp.000000_0:
> File does not exist. Holder
> DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have
> any open files.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2398)
>
> Killed jobs show the above and also the following message:
> AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600
> secsContainer killed by the ApplicationMaster.
>
> Also, in the node logs, I get a lot of pings like this:
> INFO [IPC Server handler 17 on 40961]
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
> attempt_1403771939632_0362_m_000002_0
>
> For 5000 files (1gb compressed), the selection of a single column
> finishes, but takes over 3 hours. For 10,000 files, the job hangs on about
> 4% map and then errors out.
>
> While the jobs are running, I notice that the containers are not evenly
> distributed across the cluster. Some nodes lie idle, while the application
> master node runs 7 containers, maxing out the 28gb of RAM allocated
> to Hadoop on each slave node.
>
> This is the output of netstat –i while the column selection is running:
>
> Kernel Interface table
>
> Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP
> TX-OVR Flg
>
> eth0   1500   0 79515196      0 2265807     0 45694758      0      0
>   0 BMRU
>
> eth1   1500   0 77410508      0      0      0 40815746      0      0
> 0 BMRU
>
> lo    65536   0 16593808      0      0      0 16593808      0      0
> 0 LRU
>
>
>
>
> Are there some settings I am missing that mean the cluster isn’t
> processing this data as efficiently as it can?
>
> I am very new to Hadoop and there are so many logs, etc, that
> troubleshooting can be a bit overwhelming. Where else should I be looking
> to try and diagnose what is wrong?
>
> Thanks in advance for any help you can give!
>
> Kind regards,
> Ana
>
>

Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException)

Reply via email to