For my own user? It is as follows: core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 483941 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 800 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited From: hadoop hive <[email protected]> Reply-To: <[email protected]> Date: Saturday, 2 August 2014 16:34 To: <[email protected]> Subject: Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode .LeaseExpiredException) Can you check the ulimit for tour user. Which might be causing this. On Aug 2, 2014 8:54 PM, "Ana Gillan" <[email protected]> wrote: > Hi everyone, > > I am having an issue with MapReduce jobs running through Hive being killed > after 600s timeouts and with very simple jobs taking over 3 hours (or just > failing) for a set of files with a compressed size of only 1-2gb. I will try > and provide as much information as I can here, so if someone can help, that > would be really great. > > I have a cluster of 7 nodes (1 master, 6 slaves) with the following config: >> Master node: >> >> 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz >> >> 64GB DDR3 SDRAM >> >> 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5) >> >> Slave nodes (each): >> >> Intel Xeon 4-core E3-1220v3 @ 3.1GHz >> >> 32GB DDR3 SDRAM >> >> 4 x 2TB SATA-3 hard drive >> >> Operating system on all nodes: openSUSE Linux 13.1 > > We have the Apache BigTop package version 0.7, with Hadoop version 2.0.6-alpha > and Hive version 0.11. > YARN has been configured as per these recommendations: > http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/ > > I also set the following additional settings before running jobs: > set yarn.nodemanager.resource.cpu-vcores=4; > set mapred.tasktracker.map.tasks.maximum=4; > set hive.hadoop.supports.splittable.combineinputformat=true; > set hive.merge.mapredfiles=true; > > No one else uses this cluster while I am working. > > What I¹m trying to do: > I have a bunch of XML files on HDFS, which I am reading into Hive using this > SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to create a > series of tables from these files and finally run a Python script on one of > them to perform some scientific calculations. The files are .xml.gz format and > (uncompressed) are only about 4mb in size each. hive.input.format is set to > org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to avoid the ³small > files problem.² > > Problems: > My HQL statements work perfectly for up to 1000 of these files. Even for much > larger numbers, doing select * works fine, which means the files are being > read properly, but if I do something as simple as selecting just one column > from the whole table for a larger number of files, containers start being > killed and jobs fail with this error in the container logs: > > 2014-08-02 14:51:45,137 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient: > Failed to close file > /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.- > ext-10001/_tmp.000000_0 > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.L > easeExpiredException): No lease on > /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.- > ext-10001/_tmp.000000_0: File does not exist. Holder > DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have > any open files. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.ja > va:2398) > > Killed jobs show the above and also the following message: > AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600 > secsContainer killed by the ApplicationMaster. > > Also, in the node logs, I get a lot of pings like this: > INFO [IPC Server handler 17 on 40961] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from > attempt_1403771939632_0362_m_000002_0 > > For 5000 files (1gb compressed), the selection of a single column finishes, > but takes over 3 hours. For 10,000 files, the job hangs on about 4% map and > then errors out. > > While the jobs are running, I notice that the containers are not evenly > distributed across the cluster. Some nodes lie idle, while the application > master node runs 7 containers, maxing out the 28gb of RAM allocated to Hadoop > on each slave node. > > This is the output of netstat i while the column selection is running: > Kernel Interface table > > Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR > Flg > > eth0 1500 0 79515196 0 2265807 0 45694758 0 0 0 > BMRU > > eth1 1500 0 77410508 0 0 0 40815746 0 0 0 > BMRU > > lo 65536 0 16593808 0 0 0 16593808 0 0 0 > LRU > > > > > > Are there some settings I am missing that mean the cluster isn¹t processing > this data as efficiently as it can? > > I am very new to Hadoop and there are so many logs, etc, that troubleshooting > can be a bit overwhelming. Where else should I be looking to try and diagnose > what is wrong? > > Thanks in advance for any help you can give! > > Kind regards, > Ana >
