[jira] [Commented] (MAPREDUCE-5791) Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently

2014-04-04 Thread Nikola Vujic (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959934#comment-13959934
 ] 

Nikola Vujic commented on MAPREDUCE-5791:
-

In order to check the performance impact of this patch, I have tested the patch 
on a 80 node hadoop cluster on Windows. Here are the results:

Terasort 5 TB
1540 map tasks
770 reduce tasks

 Elapsed  Avg Map Time Avg Reduce time  
Avg Shuffle Time Avg Merge Time
default  3194.67  402.33  248.67
  1471.33 12.33
optimized shuffle   2411.00   392.00  689.67
  674.67  17.17
 
default/optmized  1.331.03 0.36 
  2.18  0.72

* optimized shuffle is configured to use 512K buffer size for the buffer copy 
shuffle.
** presented numbers are avg of at least 3 runs.
 
Optimized shuffle version is 1.33x faster than the default version. Gain in the 
shuffle phase alone is 2.18x.


 Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not 
 read disks efficiently
 

 Key: MAPREDUCE-5791
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5791
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: client
Affects Versions: 3.0.0, 2.3.0
Reporter: Nikola Vujic
Assignee: Nikola Vujic
 Fix For: 3.0.0, 2.4.0

 Attachments: MAPREDUCE-5791.patch, MAPREDUCE-5791.patch, 
 MAPREDUCE-5791.patch


 transferTo method in org.apache.hadoop.mapred.FadvisedFileRegion is using 
 transferTo method from a FileChannel to transfer data from a disk to socket. 
 This is performing slow in Windows, slower than in Linux. The reason is that 
 transferTo method for the java.nio is issuing 32K IO requests all the time. 
 In Windows, these 32K transfers are not optimal and we don't get the best 
 performance form the underlying IO subsystem. In order to achieve better 
 performance when reading from the drives, we need to read data in bigger 
 chunks, 512K for example.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5791) Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently

2014-04-04 Thread Nikola Vujic (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959944#comment-13959944
 ] 

Nikola Vujic commented on MAPREDUCE-5791:
-

I missed to format the table. Here it is:
|| ||Elapsed||Avg Map Time||Avg Reduce time||Avg Shuffle Time||Avg Merge Time||
|default|3194.67|402.33|248.67|1471.33|12.33|
|optimized shuffle|2411.00|392.00|689.67|674.67|17.17|
|default/optmized|1.33|1.03|0.36|2.18|0.72|


 Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not 
 read disks efficiently
 

 Key: MAPREDUCE-5791
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5791
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: client
Affects Versions: 3.0.0, 2.3.0
Reporter: Nikola Vujic
Assignee: Nikola Vujic
 Fix For: 3.0.0, 2.4.0

 Attachments: MAPREDUCE-5791.patch, MAPREDUCE-5791.patch, 
 MAPREDUCE-5791.patch


 transferTo method in org.apache.hadoop.mapred.FadvisedFileRegion is using 
 transferTo method from a FileChannel to transfer data from a disk to socket. 
 This is performing slow in Windows, slower than in Linux. The reason is that 
 transferTo method for the java.nio is issuing 32K IO requests all the time. 
 In Windows, these 32K transfers are not optimal and we don't get the best 
 performance form the underlying IO subsystem. In order to achieve better 
 performance when reading from the drives, we need to read data in bigger 
 chunks, 512K for example.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5791) Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently

2014-03-24 Thread Nikola Vujic (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13945407#comment-13945407
 ] 

Nikola Vujic commented on MAPREDUCE-5791:
-

Hi [~cnauroth],

I have applied all fixes except for the if-else in {{FadvisedFileRegion}}. Edge 
case is reading the last chunk of data from a file. {{customShuffleTransfer}} 
must read {{actualCount}} bytes from a file, starting from the 
{{this.position}}. This is done in the while loop and {{trans}} variable is 
used to calculate the number of remaining bytes. {{fileChannel.read}} returns 
the number of bytes read. For the last chunk of data this number can be higher 
than the remaining number of bytes to read. In that case we cannot use 
{{Buffer#flip}}. 

For example, let's suppose that we have 128 byte buffer and the we want to read 
200 bytes starting at position 1000 in a file (file size bigger than 1256 
bytes). At least two iterations of the while loop will be done: 
1. Iteration 1: {{fileChannel.read(byteBuffer, 1000+0)}} = 128 bytes are read 
= all 128 bytes are needed = target.write
2. Iteration 2: {{fileChannel.read(byteBuffer, 1000+128)}} = 128 bytes are 
read = 128 bytes are read because file is big enough but only first 72 bytes 
are needed = {{byteBuffer.limit(72)}} = target.write

In the else block we don't set limit to the current position but to a number 
lower than the current position. Updating local {{position}} variable is needed 
in order to read data starting from a proper position in the next iterations of 
the loop. Does it make sense?

Regarding the resource leak in the test, I applied a change you suggested and I 
did the same with the {{fileRegion}} in order to eliminated one try block.

I changed {{customShuffleTransferCornerCases}} to private. It was public.

 Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not 
 read disks efficiently
 

 Key: MAPREDUCE-5791
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5791
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Nikola Vujic
Assignee: Nikola Vujic
 Attachments: MAPREDUCE-5791.patch, MAPREDUCE-5791.patch


 transferTo method in org.apache.hadoop.mapred.FadvisedFileRegion is using 
 transferTo method from a FileChannel to transfer data from a disk to socket. 
 This is performing slow in Windows, slower than in Linux. The reason is that 
 transferTo method for the java.nio is issuing 32K IO requests all the time. 
 In Windows, these 32K transfers are not optimal and we don't get the best 
 performance form the underlying IO subsystem. In order to achieve better 
 performance when reading from the drives, we need to read data in bigger 
 chunks, 512K for example.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5791) Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently

2014-03-24 Thread Nikola Vujic (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Vujic updated MAPREDUCE-5791:


Attachment: MAPREDUCE-5791.patch

 Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not 
 read disks efficiently
 

 Key: MAPREDUCE-5791
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5791
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Nikola Vujic
Assignee: Nikola Vujic
 Attachments: MAPREDUCE-5791.patch, MAPREDUCE-5791.patch, 
 MAPREDUCE-5791.patch


 transferTo method in org.apache.hadoop.mapred.FadvisedFileRegion is using 
 transferTo method from a FileChannel to transfer data from a disk to socket. 
 This is performing slow in Windows, slower than in Linux. The reason is that 
 transferTo method for the java.nio is issuing 32K IO requests all the time. 
 In Windows, these 32K transfers are not optimal and we don't get the best 
 performance form the underlying IO subsystem. In order to achieve better 
 performance when reading from the drives, we need to read data in bigger 
 chunks, 512K for example.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5791) Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently

2014-03-14 Thread Nikola Vujic (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Vujic updated MAPREDUCE-5791:


Attachment: MAPREDUCE-5791.patch

I have submitted a new patch, fixed according to your comments.

 Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not 
 read disks efficiently
 

 Key: MAPREDUCE-5791
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5791
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Nikola Vujic
Assignee: Nikola Vujic
 Attachments: MAPREDUCE-5791.patch, MAPREDUCE-5791.patch


 transferTo method in org.apache.hadoop.mapred.FadvisedFileRegion is using 
 transferTo method from a FileChannel to transfer data from a disk to socket. 
 This is performing slow in Windows, slower than in Linux. The reason is that 
 transferTo method for the java.nio is issuing 32K IO requests all the time. 
 In Windows, these 32K transfers are not optimal and we don't get the best 
 performance form the underlying IO subsystem. In order to achieve better 
 performance when reading from the drives, we need to read data in bigger 
 chunks, 512K for example.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5791) Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently

2014-03-13 Thread Nikola Vujic (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Vujic updated MAPREDUCE-5791:


Attachment: MAPREDUCE-5791.patch

 Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not 
 read disks efficiently
 

 Key: MAPREDUCE-5791
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5791
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Nikola Vujic
Assignee: Nikola Vujic
 Attachments: MAPREDUCE-5791.patch


 transferTo method in org.apache.hadoop.mapred.FadvisedFileRegion is using 
 transferTo method from a FileChannel to transfer data from a disk to socket. 
 This is performing slow in Windows, slower than in Linux. The reason is that 
 transferTo method for the java.nio is issuing 32K IO requests all the time. 
 In Windows, these 32K transfers are not optimal and we don't get the best 
 performance form the underlying IO subsystem. In order to achieve better 
 performance when reading from the drives, we need to read data in bigger 
 chunks, 512K for example.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5791) Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently

2014-03-13 Thread Nikola Vujic (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikola Vujic updated MAPREDUCE-5791:


Status: Patch Available  (was: Open)

Patch contains implementation of a new function to do a data transfer. Existing 
implementation relies on nio transferTo method which is slow on Windows. New 
function does simple transfer by using an intermediate buffer in memory to 
transfer data from a disk and send to a socket. Size of the intermediate buffer 
determines the size of the IO requests. This way it is possible to manage size 
of the IO request in the shuffle phase. Managing the size of the IO requests 
turns to be important from the performance point of view on the Windows 
machines.

I observed that the new code improves AVG Shuffle Time on windows for 1.8x. End 
to end improvement in 100 GB Terasort is 1.3x when the new code is used (tested 
on a cluster with 4 datanodes).

 Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not 
 read disks efficiently
 

 Key: MAPREDUCE-5791
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5791
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Nikola Vujic
Assignee: Nikola Vujic
 Attachments: MAPREDUCE-5791.patch


 transferTo method in org.apache.hadoop.mapred.FadvisedFileRegion is using 
 transferTo method from a FileChannel to transfer data from a disk to socket. 
 This is performing slow in Windows, slower than in Linux. The reason is that 
 transferTo method for the java.nio is issuing 32K IO requests all the time. 
 In Windows, these 32K transfers are not optimal and we don't get the best 
 performance form the underlying IO subsystem. In order to achieve better 
 performance when reading from the drives, we need to read data in bigger 
 chunks, 512K for example.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5791) Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently

2014-03-13 Thread Nikola Vujic (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13933835#comment-13933835
 ] 

Nikola Vujic commented on MAPREDUCE-5791:
-

Hi @Chris Nauroth,

No, original code is not calling it with count parameters set to 32K. It seems 
that java.nio.transferTo is chopping a larger transfer into multiple I/O 
requests for 32K each. I didn't find a way to configure that transfer size for 
java.nio. I think that java.nio has native implementation for transferTo in 
Linux (direct transfer with DMA), but in Windows that implementation is 
missing. Then probably, JDK is taking a slow path in Windows.

Btw, java.nio.transferTo does not always use 32K transfers but It seems that 
this is not under user control. At least, I didn't find a way how to control 
this.



 Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not 
 read disks efficiently
 

 Key: MAPREDUCE-5791
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5791
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Nikola Vujic
Assignee: Nikola Vujic
 Attachments: MAPREDUCE-5791.patch


 transferTo method in org.apache.hadoop.mapred.FadvisedFileRegion is using 
 transferTo method from a FileChannel to transfer data from a disk to socket. 
 This is performing slow in Windows, slower than in Linux. The reason is that 
 transferTo method for the java.nio is issuing 32K IO requests all the time. 
 In Windows, these 32K transfers are not optimal and we don't get the best 
 performance form the underlying IO subsystem. In order to achieve better 
 performance when reading from the drives, we need to read data in bigger 
 chunks, 512K for example.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5791) Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently

2014-03-13 Thread Nikola Vujic (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13934321#comment-13934321
 ] 

Nikola Vujic commented on MAPREDUCE-5791:
-

I agree about the root cause. I saw that code too. Since I use Oracle JDK, I 
wanted to verify that the native implementation does not exists in Oracle JDK, 
but I couldn't find source code for nio.dll in Oracle JDK. Thus I run the test 
with OpenJDK in order to compare performance and I observed the same behavior. 
So, both JDKs are behaving the same in the shuffle, which means that Oracle JDK 
is also missing native implementation for the zero-copy transfer on Windows 
(assuming that the zero-copy transfer would work at least as fast as buffer 
copy).

It is good idea to try JNI call to TransmitFile. Actually, it may happen to get 
a perf boost from TransmitFile due to a shuffle phase being CPU bound now (CPU 
is at 100% during shuffle with buffer copy). I will have to try it.



 Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not 
 read disks efficiently
 

 Key: MAPREDUCE-5791
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5791
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Nikola Vujic
Assignee: Nikola Vujic
 Attachments: MAPREDUCE-5791.patch


 transferTo method in org.apache.hadoop.mapred.FadvisedFileRegion is using 
 transferTo method from a FileChannel to transfer data from a disk to socket. 
 This is performing slow in Windows, slower than in Linux. The reason is that 
 transferTo method for the java.nio is issuing 32K IO requests all the time. 
 In Windows, these 32K transfers are not optimal and we don't get the best 
 performance form the underlying IO subsystem. In order to achieve better 
 performance when reading from the drives, we need to read data in bigger 
 chunks, 512K for example.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MAPREDUCE-5791) Shuffle phase is slow in Windows - FadviseFileRegion::transferTo does not read disks efficiently

2014-03-11 Thread Nikola Vujic (JIRA)
Nikola Vujic created MAPREDUCE-5791:
---

 Summary: Shuffle phase is slow in Windows - 
FadviseFileRegion::transferTo does not read disks efficiently
 Key: MAPREDUCE-5791
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5791
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: Nikola Vujic
Assignee: Nikola Vujic


transferTo method in org.apache.hadoop.mapred.FadvisedFileRegion is using 
transferTo method from a FileChannel to transfer data from a disk to socket. 
This is performing slow in Windows, slower than in Linux. The reason is that 
transferTo method for the java.nio is issuing 32K IO requests all the time. In 
Windows, these 32K transfers are not optimal and we don't get the best 
performance form the underlying IO subsystem. In order to achieve better 
performance when reading from the drives, we need to read data in bigger 
chunks, 512K for example.



--
This message was sent by Atlassian JIRA
(v6.2#6252)