[jira] [Commented] (MAPREDUCE-5791) Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently
[ https://issues.apache.org/jira/browse/MAPREDUCE-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959934#comment-13959934 ] Nikola Vujic commented on MAPREDUCE-5791: - In order to check the performance impact of this patch, I have tested the patch on a 80 node hadoop cluster on Windows. Here are the results: Terasort 5 TB 1540 map tasks 770 reduce tasks Elapsed Avg Map Time Avg Reduce time Avg Shuffle Time Avg Merge Time default 3194.67 402.33 248.67 1471.33 12.33 optimized shuffle 2411.00 392.00 689.67 674.67 17.17 default/optmized 1.331.03 0.36 2.18 0.72 * optimized shuffle is configured to use 512K buffer size for the buffer copy shuffle. ** presented numbers are avg of at least 3 runs. Optimized shuffle version is 1.33x faster than the default version. Gain in the shuffle phase alone is 2.18x. Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently Key: MAPREDUCE-5791 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5791 Project: Hadoop Map/Reduce Issue Type: Bug Components: client Affects Versions: 3.0.0, 2.3.0 Reporter: Nikola Vujic Assignee: Nikola Vujic Fix For: 3.0.0, 2.4.0 Attachments: MAPREDUCE-5791.patch, MAPREDUCE-5791.patch, MAPREDUCE-5791.patch transferTo method in org.apache.hadoop.mapred.FadvisedFileRegion is using transferTo method from a FileChannel to transfer data from a disk to socket. This is performing slow in Windows, slower than in Linux. The reason is that transferTo method for the java.nio is issuing 32K IO requests all the time. In Windows, these 32K transfers are not optimal and we don't get the best performance form the underlying IO subsystem. In order to achieve better performance when reading from the drives, we need to read data in bigger chunks, 512K for example. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5791) Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently
[ https://issues.apache.org/jira/browse/MAPREDUCE-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959944#comment-13959944 ] Nikola Vujic commented on MAPREDUCE-5791: - I missed to format the table. Here it is: || ||Elapsed||Avg Map Time||Avg Reduce time||Avg Shuffle Time||Avg Merge Time|| |default|3194.67|402.33|248.67|1471.33|12.33| |optimized shuffle|2411.00|392.00|689.67|674.67|17.17| |default/optmized|1.33|1.03|0.36|2.18|0.72| Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently Key: MAPREDUCE-5791 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5791 Project: Hadoop Map/Reduce Issue Type: Bug Components: client Affects Versions: 3.0.0, 2.3.0 Reporter: Nikola Vujic Assignee: Nikola Vujic Fix For: 3.0.0, 2.4.0 Attachments: MAPREDUCE-5791.patch, MAPREDUCE-5791.patch, MAPREDUCE-5791.patch transferTo method in org.apache.hadoop.mapred.FadvisedFileRegion is using transferTo method from a FileChannel to transfer data from a disk to socket. This is performing slow in Windows, slower than in Linux. The reason is that transferTo method for the java.nio is issuing 32K IO requests all the time. In Windows, these 32K transfers are not optimal and we don't get the best performance form the underlying IO subsystem. In order to achieve better performance when reading from the drives, we need to read data in bigger chunks, 512K for example. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5791) Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently
[ https://issues.apache.org/jira/browse/MAPREDUCE-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13945407#comment-13945407 ] Nikola Vujic commented on MAPREDUCE-5791: - Hi [~cnauroth], I have applied all fixes except for the if-else in {{FadvisedFileRegion}}. Edge case is reading the last chunk of data from a file. {{customShuffleTransfer}} must read {{actualCount}} bytes from a file, starting from the {{this.position}}. This is done in the while loop and {{trans}} variable is used to calculate the number of remaining bytes. {{fileChannel.read}} returns the number of bytes read. For the last chunk of data this number can be higher than the remaining number of bytes to read. In that case we cannot use {{Buffer#flip}}. For example, let's suppose that we have 128 byte buffer and the we want to read 200 bytes starting at position 1000 in a file (file size bigger than 1256 bytes). At least two iterations of the while loop will be done: 1. Iteration 1: {{fileChannel.read(byteBuffer, 1000+0)}} = 128 bytes are read = all 128 bytes are needed = target.write 2. Iteration 2: {{fileChannel.read(byteBuffer, 1000+128)}} = 128 bytes are read = 128 bytes are read because file is big enough but only first 72 bytes are needed = {{byteBuffer.limit(72)}} = target.write In the else block we don't set limit to the current position but to a number lower than the current position. Updating local {{position}} variable is needed in order to read data starting from a proper position in the next iterations of the loop. Does it make sense? Regarding the resource leak in the test, I applied a change you suggested and I did the same with the {{fileRegion}} in order to eliminated one try block. I changed {{customShuffleTransferCornerCases}} to private. It was public. Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently Key: MAPREDUCE-5791 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5791 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Nikola Vujic Assignee: Nikola Vujic Attachments: MAPREDUCE-5791.patch, MAPREDUCE-5791.patch transferTo method in org.apache.hadoop.mapred.FadvisedFileRegion is using transferTo method from a FileChannel to transfer data from a disk to socket. This is performing slow in Windows, slower than in Linux. The reason is that transferTo method for the java.nio is issuing 32K IO requests all the time. In Windows, these 32K transfers are not optimal and we don't get the best performance form the underlying IO subsystem. In order to achieve better performance when reading from the drives, we need to read data in bigger chunks, 512K for example. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5791) Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently
[ https://issues.apache.org/jira/browse/MAPREDUCE-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Vujic updated MAPREDUCE-5791: Attachment: MAPREDUCE-5791.patch Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently Key: MAPREDUCE-5791 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5791 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Nikola Vujic Assignee: Nikola Vujic Attachments: MAPREDUCE-5791.patch, MAPREDUCE-5791.patch, MAPREDUCE-5791.patch transferTo method in org.apache.hadoop.mapred.FadvisedFileRegion is using transferTo method from a FileChannel to transfer data from a disk to socket. This is performing slow in Windows, slower than in Linux. The reason is that transferTo method for the java.nio is issuing 32K IO requests all the time. In Windows, these 32K transfers are not optimal and we don't get the best performance form the underlying IO subsystem. In order to achieve better performance when reading from the drives, we need to read data in bigger chunks, 512K for example. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5791) Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently
[ https://issues.apache.org/jira/browse/MAPREDUCE-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Vujic updated MAPREDUCE-5791: Attachment: MAPREDUCE-5791.patch I have submitted a new patch, fixed according to your comments. Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently Key: MAPREDUCE-5791 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5791 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Nikola Vujic Assignee: Nikola Vujic Attachments: MAPREDUCE-5791.patch, MAPREDUCE-5791.patch transferTo method in org.apache.hadoop.mapred.FadvisedFileRegion is using transferTo method from a FileChannel to transfer data from a disk to socket. This is performing slow in Windows, slower than in Linux. The reason is that transferTo method for the java.nio is issuing 32K IO requests all the time. In Windows, these 32K transfers are not optimal and we don't get the best performance form the underlying IO subsystem. In order to achieve better performance when reading from the drives, we need to read data in bigger chunks, 512K for example. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5791) Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently
[ https://issues.apache.org/jira/browse/MAPREDUCE-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Vujic updated MAPREDUCE-5791: Attachment: MAPREDUCE-5791.patch Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently Key: MAPREDUCE-5791 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5791 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Nikola Vujic Assignee: Nikola Vujic Attachments: MAPREDUCE-5791.patch transferTo method in org.apache.hadoop.mapred.FadvisedFileRegion is using transferTo method from a FileChannel to transfer data from a disk to socket. This is performing slow in Windows, slower than in Linux. The reason is that transferTo method for the java.nio is issuing 32K IO requests all the time. In Windows, these 32K transfers are not optimal and we don't get the best performance form the underlying IO subsystem. In order to achieve better performance when reading from the drives, we need to read data in bigger chunks, 512K for example. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-5791) Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently
[ https://issues.apache.org/jira/browse/MAPREDUCE-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Vujic updated MAPREDUCE-5791: Status: Patch Available (was: Open) Patch contains implementation of a new function to do a data transfer. Existing implementation relies on nio transferTo method which is slow on Windows. New function does simple transfer by using an intermediate buffer in memory to transfer data from a disk and send to a socket. Size of the intermediate buffer determines the size of the IO requests. This way it is possible to manage size of the IO request in the shuffle phase. Managing the size of the IO requests turns to be important from the performance point of view on the Windows machines. I observed that the new code improves AVG Shuffle Time on windows for 1.8x. End to end improvement in 100 GB Terasort is 1.3x when the new code is used (tested on a cluster with 4 datanodes). Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently Key: MAPREDUCE-5791 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5791 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Nikola Vujic Assignee: Nikola Vujic Attachments: MAPREDUCE-5791.patch transferTo method in org.apache.hadoop.mapred.FadvisedFileRegion is using transferTo method from a FileChannel to transfer data from a disk to socket. This is performing slow in Windows, slower than in Linux. The reason is that transferTo method for the java.nio is issuing 32K IO requests all the time. In Windows, these 32K transfers are not optimal and we don't get the best performance form the underlying IO subsystem. In order to achieve better performance when reading from the drives, we need to read data in bigger chunks, 512K for example. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5791) Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently
[ https://issues.apache.org/jira/browse/MAPREDUCE-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13933835#comment-13933835 ] Nikola Vujic commented on MAPREDUCE-5791: - Hi @Chris Nauroth, No, original code is not calling it with count parameters set to 32K. It seems that java.nio.transferTo is chopping a larger transfer into multiple I/O requests for 32K each. I didn't find a way to configure that transfer size for java.nio. I think that java.nio has native implementation for transferTo in Linux (direct transfer with DMA), but in Windows that implementation is missing. Then probably, JDK is taking a slow path in Windows. Btw, java.nio.transferTo does not always use 32K transfers but It seems that this is not under user control. At least, I didn't find a way how to control this. Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently Key: MAPREDUCE-5791 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5791 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Nikola Vujic Assignee: Nikola Vujic Attachments: MAPREDUCE-5791.patch transferTo method in org.apache.hadoop.mapred.FadvisedFileRegion is using transferTo method from a FileChannel to transfer data from a disk to socket. This is performing slow in Windows, slower than in Linux. The reason is that transferTo method for the java.nio is issuing 32K IO requests all the time. In Windows, these 32K transfers are not optimal and we don't get the best performance form the underlying IO subsystem. In order to achieve better performance when reading from the drives, we need to read data in bigger chunks, 512K for example. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5791) Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently
[ https://issues.apache.org/jira/browse/MAPREDUCE-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13934321#comment-13934321 ] Nikola Vujic commented on MAPREDUCE-5791: - I agree about the root cause. I saw that code too. Since I use Oracle JDK, I wanted to verify that the native implementation does not exists in Oracle JDK, but I couldn't find source code for nio.dll in Oracle JDK. Thus I run the test with OpenJDK in order to compare performance and I observed the same behavior. So, both JDKs are behaving the same in the shuffle, which means that Oracle JDK is also missing native implementation for the zero-copy transfer on Windows (assuming that the zero-copy transfer would work at least as fast as buffer copy). It is good idea to try JNI call to TransmitFile. Actually, it may happen to get a perf boost from TransmitFile due to a shuffle phase being CPU bound now (CPU is at 100% during shuffle with buffer copy). I will have to try it. Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently Key: MAPREDUCE-5791 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5791 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Nikola Vujic Assignee: Nikola Vujic Attachments: MAPREDUCE-5791.patch transferTo method in org.apache.hadoop.mapred.FadvisedFileRegion is using transferTo method from a FileChannel to transfer data from a disk to socket. This is performing slow in Windows, slower than in Linux. The reason is that transferTo method for the java.nio is issuing 32K IO requests all the time. In Windows, these 32K transfers are not optimal and we don't get the best performance form the underlying IO subsystem. In order to achieve better performance when reading from the drives, we need to read data in bigger chunks, 512K for example. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MAPREDUCE-5791) Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently
Nikola Vujic created MAPREDUCE-5791: --- Summary: Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently Key: MAPREDUCE-5791 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5791 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Nikola Vujic Assignee: Nikola Vujic transferTo method in org.apache.hadoop.mapred.FadvisedFileRegion is using transferTo method from a FileChannel to transfer data from a disk to socket. This is performing slow in Windows, slower than in Linux. The reason is that transferTo method for the java.nio is issuing 32K IO requests all the time. In Windows, these 32K transfers are not optimal and we don't get the best performance form the underlying IO subsystem. In order to achieve better performance when reading from the drives, we need to read data in bigger chunks, 512K for example. -- This message was sent by Atlassian JIRA (v6.2#6252)