Unsubscribe Sent from my iPhone
> On Dec 1, 2015, at 3:27 PM, Aladdin Elston <[email protected]> wrote: > > > Thank you, Aladdin > > > From: Robert Schmidtke > Date: Tuesday, December 1, 2015 at 7:21 AM > To: "[email protected]" > Subject: Re: HDFS TestDFSIO performance with 1 mapper/file of 16G per node > > Hi Andreas, > > thanks for your reply. I'm glad I'm not the only one running into this issue. > I'm currently in the process of profiling HDFS carefully and hopefully I'll > come across something. It would seem that the OS caches all writes to disk, > and there may be problems with HDFS being on-heap mostly (wild guess). > > I am running on bare metal here, with Yarn being the only layer in between. > > Robert > > >> On Tue, Dec 1, 2015 at 11:07 AM, Andreas Fritzler >> <[email protected]> wrote: >> Hi Robert, >> >> I'm experiencing the same effect regarding the RAM consumption when running >> TestDFSIO with huge files. I cound't really find out yet why this is >> happening though. >> >> One question regarding your setup: Are you running your cluster on bare >> metal or virtualized? >> >> Regards, >> Andreas >> >>> On Fri, Nov 27, 2015 at 5:07 PM, Robert Schmidtke <[email protected]> >>> wrote: >>> Hi everyone, >>> >>> I cannot seem to wrap my head around the TestDFSIO benchmark results. I >>> have a cluster of 8 nodes, the first one runs the NameNode and the >>> ResourceManager, the others each run a DataNode and a NodeManager. Each >>> node has 64G of RAM. I am using a block size for HDFS of 128M, and a >>> replication factor of 3. I'm running the benchmark with 7 map tasks, each >>> processing one file of 16G. The results are as follows: >>> >>> For the write phase: >>> 15/11/27 14:50:32 INFO mapreduce.Job: Counters: 49 >>> File System Counters >>> FILE: Number of bytes read=626 >>> FILE: Number of bytes written=929305 >>> FILE: Number of read operations=0 >>> FILE: Number of large read operations=0 >>> FILE: Number of write operations=0 >>> HDFS: Number of bytes read=1673 >>> HDFS: Number of bytes written=120259084369 >>> HDFS: Number of read operations=31 >>> HDFS: Number of large read operations=0 >>> HDFS: Number of write operations=9 >>> Job Counters >>> Launched map tasks=7 >>> Launched reduce tasks=1 >>> Data-local map tasks=7 >>> Total time spent by all maps in occupied slots (ms)=3816135 >>> Total time spent by all reduces in occupied slots (ms)=277826 >>> Total time spent by all map tasks (ms)=3816135 >>> Total time spent by all reduce tasks (ms)=277826 >>> Total vcore-seconds taken by all map tasks=3816135 >>> Total vcore-seconds taken by all reduce tasks=277826 >>> Total megabyte-seconds taken by all map tasks=3907722240 >>> Total megabyte-seconds taken by all reduce tasks=284493824 >>> Map-Reduce Framework >>> Map input records=7 >>> Map output records=35 >>> Map output bytes=550 >>> Map output materialized bytes=662 >>> Input split bytes=889 >>> Combine input records=0 >>> Combine output records=0 >>> Reduce input groups=5 >>> Reduce shuffle bytes=662 >>> Reduce input records=35 >>> Reduce output records=5 >>> Spilled Records=70 >>> Shuffled Maps =7 >>> Failed Shuffles=0 >>> Merged Map outputs=7 >>> GC time elapsed (ms)=73040 >>> CPU time spent (ms)=754470 >>> Physical memory (bytes) snapshot=2070585344 >>> Virtual memory (bytes) snapshot=7448678400 >>> Total committed heap usage (bytes)=1346895872 >>> Shuffle Errors >>> BAD_ID=0 >>> CONNECTION=0 >>> IO_ERROR=0 >>> WRONG_LENGTH=0 >>> WRONG_MAP=0 >>> WRONG_REDUCE=0 >>> File Input Format Counters >>> Bytes Read=784 >>> File Output Format Counters >>> Bytes Written=81 >>> 15/11/27 14:50:32 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write >>> 15/11/27 14:50:32 INFO fs.TestDFSIO: Date & time: Fri Nov 27 >>> 14:50:32 CET 2015 >>> 15/11/27 14:50:32 INFO fs.TestDFSIO: Number of files: 7 >>> 15/11/27 14:50:32 INFO fs.TestDFSIO: Total MBytes processed: 114688.0 >>> 15/11/27 14:50:32 INFO fs.TestDFSIO: Throughput mb/sec: >>> 30.260630125485257 >>> 15/11/27 14:50:32 INFO fs.TestDFSIO: Average IO rate mb/sec: >>> 31.812650680541992 >>> 15/11/27 14:50:32 INFO fs.TestDFSIO: IO rate std deviation: >>> 6.862839589024802 >>> 15/11/27 14:50:32 INFO fs.TestDFSIO: Test exec time sec: 705.559 >>> >>> For the read phase: >>> 15/11/27 14:56:46 INFO mapreduce.Job: Counters: 51 >>> File System Counters >>> FILE: Number of bytes read=637 >>> FILE: Number of bytes written=929311 >>> FILE: Number of read operations=0 >>> FILE: Number of large read operations=0 >>> FILE: Number of write operations=0 >>> HDFS: Number of bytes read=120259085961 >>> HDFS: Number of bytes written=84 >>> HDFS: Number of read operations=38 >>> HDFS: Number of large read operations=0 >>> HDFS: Number of write operations=2 >>> Job Counters >>> Killed map tasks=4 >>> Launched map tasks=11 >>> Launched reduce tasks=1 >>> Data-local map tasks=9 >>> Rack-local map tasks=2 >>> Total time spent by all maps in occupied slots (ms)=2096182 >>> Total time spent by all reduces in occupied slots (ms)=272737 >>> Total time spent by all map tasks (ms)=2096182 >>> Total time spent by all reduce tasks (ms)=272737 >>> Total vcore-seconds taken by all map tasks=2096182 >>> Total vcore-seconds taken by all reduce tasks=272737 >>> Total megabyte-seconds taken by all map tasks=2146490368 >>> Total megabyte-seconds taken by all reduce tasks=279282688 >>> Map-Reduce Framework >>> Map input records=7 >>> Map output records=35 >>> Map output bytes=561 >>> Map output materialized bytes=673 >>> Input split bytes=889 >>> Combine input records=0 >>> Combine output records=0 >>> Reduce input groups=5 >>> Reduce shuffle bytes=673 >>> Reduce input records=35 >>> Reduce output records=5 >>> Spilled Records=70 >>> Shuffled Maps =7 >>> Failed Shuffles=0 >>> Merged Map outputs=7 >>> GC time elapsed (ms)=1407 >>> CPU time spent (ms)=96580 >>> Physical memory (bytes) snapshot=1945346048 >>> Virtual memory (bytes) snapshot=7420182528 >>> Total committed heap usage (bytes)=1421869056 >>> Shuffle Errors >>> BAD_ID=0 >>> CONNECTION=0 >>> IO_ERROR=0 >>> WRONG_LENGTH=0 >>> WRONG_MAP=0 >>> WRONG_REDUCE=0 >>> File Input Format Counters >>> Bytes Read=784 >>> File Output Format Counters >>> Bytes Written=84 >>> 15/11/27 14:56:46 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read >>> 15/11/27 14:56:46 INFO fs.TestDFSIO: Date & time: Fri Nov 27 >>> 14:56:46 CET 2015 >>> 15/11/27 14:56:46 INFO fs.TestDFSIO: Number of files: 7 >>> 15/11/27 14:56:46 INFO fs.TestDFSIO: Total MBytes processed: 114688.0 >>> 15/11/27 14:56:46 INFO fs.TestDFSIO: Throughput mb/sec: >>> 101.66069077999725 >>> 15/11/27 14:56:46 INFO fs.TestDFSIO: Average IO rate mb/sec: 153.29443359375 >>> 15/11/27 14:56:46 INFO fs.TestDFSIO: IO rate std deviation: >>> 115.23795112362669 >>> 15/11/27 14:56:46 INFO fs.TestDFSIO: Test exec time sec: 370.396 >>> >>> I usually average over a number of runs, but this is just a single one. I >>> have seen much greater execution times for both the write and the read >>> phases (up to 2x the ones stated above). I have compared the performance to >>> XtreemFS, another distributed file system, and it achieves an average write >>> rate that is around 6x faster, and an average read rate that is around 2.5x >>> faster, however with a much lower standard deviation rate. Note that HDFS >>> outperforms XtreemFS in terms of TestDFSIO benchmark results with smaller >>> files (1G). I also don' understand why map tasks are being killed. >>> >>> I noticed that quite some time is spent on GC, so I plotted some system >>> metrics for one node running a DataNode/NodeManager (the write phase stops >>> and the read phase starts at around 14:51): >>> >>> <datanode.png> >>> >>> I am wondering why the system RAM usage goes all the way up to the Yarn >>> container limit of around 60G, when only one 16G file is being written >>> to/read from on this node. >>> >>> So, to sum up a list of questions that I would like to ask your advice >>> about: >>> >>> - Has anyone observed similar behavior/performance with HDFS and large >>> files? >>> - Which configuration parameters would I need to tweak in order to increase >>> performance, or which system properties do I need to examine for you to >>> give me helpful advice here? >>> - Why is the RAM consumption so high? >>> - Why is the reported standard deviation during read phases so high? >>> >>> Thanks in advance for your time and effort! >>> >>> Robert >>> >>> -- >>> My GPG Key ID: 336E2680 > > > > -- > My GPG Key ID: 336E2680 > <datanode.png> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected]
