Can you check the ulimit for tour user. Which might be causing this. On Aug 2, 2014 8:54 PM, "Ana Gillan" <[email protected]> wrote:
> Hi everyone, > > I am having an issue with MapReduce jobs running through Hive being killed > after 600s timeouts and with very simple jobs taking over 3 hours (or just > failing) for a set of files with a compressed size of only 1-2gb. I will > try and provide as much information as I can here, so if someone can help, > that would be really great. > > I have a cluster of 7 nodes (1 master, 6 slaves) with the following config: > > • Master node: > > – 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz > > – 64GB DDR3 SDRAM > > – 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5) > > • Slave nodes (each): > > – Intel Xeon 4-core E3-1220v3 @ 3.1GHz > > – 32GB DDR3 SDRAM > > – 4 x 2TB SATA-3 hard drive > > • Operating system on all nodes: openSUSE Linux 13.1 > > We have the Apache BigTop package version 0.7, with Hadoop version > 2.0.6-alpha and Hive version 0.11. > YARN has been configured as per these recommendations: > http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/ > > I also set the following additional settings before running jobs: > set yarn.nodemanager.resource.cpu-vcores=4; > set mapred.tasktracker.map.tasks.maximum=4; > set hive.hadoop.supports.splittable.combineinputformat=true; > set hive.merge.mapredfiles=true; > > No one else uses this cluster while I am working. > > What I’m trying to do: > I have a bunch of XML files on HDFS, which I am reading into Hive using > this SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to > create a series of tables from these files and finally run a Python script > on one of them to perform some scientific calculations. The files are > .xml.gz format and (uncompressed) are only about 4mb in size each. > hive.input.format > is set to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to > avoid the “small files problem.” > > Problems: > My HQL statements work perfectly for up to 1000 of these files. Even for > much larger numbers, doing select * works fine, which means the files are > being read properly, but if I do something as simple as selecting just one > column from the whole table for a larger number of files, containers start > being killed and jobs fail with this error in the container logs: > > 2014-08-02 14:51:45,137 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient: > Failed to close file > /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-ext-10001/_tmp.000000_0 > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): > No lease on > /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-ext-10001/_tmp.000000_0: > File does not exist. Holder > DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have > any open files. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2398) > > Killed jobs show the above and also the following message: > AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600 > secsContainer killed by the ApplicationMaster. > > Also, in the node logs, I get a lot of pings like this: > INFO [IPC Server handler 17 on 40961] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from > attempt_1403771939632_0362_m_000002_0 > > For 5000 files (1gb compressed), the selection of a single column > finishes, but takes over 3 hours. For 10,000 files, the job hangs on about > 4% map and then errors out. > > While the jobs are running, I notice that the containers are not evenly > distributed across the cluster. Some nodes lie idle, while the application > master node runs 7 containers, maxing out the 28gb of RAM allocated > to Hadoop on each slave node. > > This is the output of netstat –i while the column selection is running: > > Kernel Interface table > > Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP > TX-OVR Flg > > eth0 1500 0 79515196 0 2265807 0 45694758 0 0 > 0 BMRU > > eth1 1500 0 77410508 0 0 0 40815746 0 0 > 0 BMRU > > lo 65536 0 16593808 0 0 0 16593808 0 0 > 0 LRU > > > > > Are there some settings I am missing that mean the cluster isn’t > processing this data as efficiently as it can? > > I am very new to Hadoop and there are so many logs, etc, that > troubleshooting can be a bit overwhelming. Where else should I be looking > to try and diagnose what is wrong? > > Thanks in advance for any help you can give! > > Kind regards, > Ana > >
