Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, > I'll do it tomorrow : Today is President's Day in the US, and I am > spending the day with my family. thank you. Enjoy your day. Cheers, Thomas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
On Mon, 2014-02-17 at 16:32 +0100, Thomas Glanzmann wrote: > Hello Eric, > > > Unfortunately you did not had good results with the MSG_MORE applied > > to the page fragments. > > I agree. We should submit only the submit the patch from this message: > > Message-ID: > <1391886759.10160.114.ca...@edumazet-glaptop2.roam.corp.google.com> > http://mid.gmane.org/1391886759.10160.114.ca...@edumazet-glaptop2.roam.corp.google.com > > > I think I'll submit the part only dealing with the metadata. > > May I submit the patch or do you want to do it yourself? I'll do it tomorrow : Today is President's Day in the US, and I am spending the day with my family. Thanks ! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, > Unfortunately you did not had good results with the MSG_MORE applied > to the page fragments. I agree. We should submit only the submit the patch from this message: Message-ID: <1391886759.10160.114.ca...@edumazet-glaptop2.roam.corp.google.com> http://mid.gmane.org/1391886759.10160.114.ca...@edumazet-glaptop2.roam.corp.google.com > I think I'll submit the part only dealing with the metadata. May I submit the patch or do you want to do it yourself? Cheers, Thomas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
On Mon, 2014-02-17 at 15:08 +0100, Thomas Glanzmann wrote: > Hello Eric, > may submit your latest patch for upstream? Or do you plan on doing that > yourself? Unfortunately you did not had good results with the MSG_MORE applied to the page fragments. I think I'll submit the part only dealing with the metadata. Then later we might take care of the page themselves. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, may submit your latest patch for upstream? Or do you plan on doing that yourself? Cheers, Thomas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, may submit your latest patch for upstream? Or do you plan on doing that yourself? Cheers, Thomas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
On Mon, 2014-02-17 at 15:08 +0100, Thomas Glanzmann wrote: Hello Eric, may submit your latest patch for upstream? Or do you plan on doing that yourself? Unfortunately you did not had good results with the MSG_MORE applied to the page fragments. I think I'll submit the part only dealing with the metadata. Then later we might take care of the page themselves. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, Unfortunately you did not had good results with the MSG_MORE applied to the page fragments. I agree. We should submit only the submit the patch from this message: Message-ID: 1391886759.10160.114.ca...@edumazet-glaptop2.roam.corp.google.com http://mid.gmane.org/1391886759.10160.114.ca...@edumazet-glaptop2.roam.corp.google.com I think I'll submit the part only dealing with the metadata. May I submit the patch or do you want to do it yourself? Cheers, Thomas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
On Mon, 2014-02-17 at 16:32 +0100, Thomas Glanzmann wrote: Hello Eric, Unfortunately you did not had good results with the MSG_MORE applied to the page fragments. I agree. We should submit only the submit the patch from this message: Message-ID: 1391886759.10160.114.ca...@edumazet-glaptop2.roam.corp.google.com http://mid.gmane.org/1391886759.10160.114.ca...@edumazet-glaptop2.roam.corp.google.com I think I'll submit the part only dealing with the metadata. May I submit the patch or do you want to do it yourself? I'll do it tomorrow : Today is President's Day in the US, and I am spending the day with my family. Thanks ! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, I'll do it tomorrow : Today is President's Day in the US, and I am spending the day with my family. thank you. Enjoy your day. Cheers, Thomas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, > Yes, this is much better : 2 frames per request/response, instead of 4. perfect. I send out the page to the iscsi target list in your name since you did the work and I added me as signed off I hope that is how it is handled or should I have added my name to the from line and mentioned in the description of the patch that you did the heavy lifting? Cheers, Thomas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
On Sat, 2014-02-08 at 22:36 +0100, Thomas Glanzmann wrote: > Hello Eric, > > > I was simply thinking about something like : > > (might need further changes, but I guess this should solve your case) > > thank you for your patch. It did not apply on top of Linux tip, so I put > in the changes manually and fixed up another call to tx_data that your > forgot in your initial patch to make it apply. > > I gave it another run, can you confirm that it now behaves better? > > https://thomas.glanzmann.de/tmp/tcp_auto_corking_on_patched_tcp_more.pcap.bz2 > > And look at that roundtrip graph it is perfect. Also filesystem is now > created in 3 seconds instead of 4. Yes, this is much better : 2 frames per request/response, instead of 4. 13:32:04.665367 IP 10.101.0.12.43418 > 10.101.99.5.3260: Flags [P.], seq 384:432, ack 2529, win 514, options [nop,nop,TS val 1576981 ecr 4294913967], length 48 13:32:04.665483 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [P.], seq 2529:3089, ack 432, win 235, options [nop,nop,TS val 4294913967 ecr 1576981], length 560 13:32:04.665642 IP 10.101.0.12.43418 > 10.101.99.5.3260: Flags [P.], seq 432:480, ack 3089, win 514, options [nop,nop,TS val 1576981 ecr 4294913967], length 48 13:32:04.665756 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [P.], seq 3089:3649, ack 480, win 235, options [nop,nop,TS val 4294913967 ecr 1576981], length 560 13:32:04.665933 IP 10.101.0.12.43418 > 10.101.99.5.3260: Flags [P.], seq 480:528, ack 3649, win 514, options [nop,nop,TS val 1576981 ecr 4294913967], length 48 13:32:04.666046 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [P.], seq 3649:4209, ack 528, win 235, options [nop,nop,TS val 4294913967 ecr 1576981], length 560 13:32:04.666214 IP 10.101.0.12.43418 > 10.101.99.5.3260: Flags [P.], seq 528:576, ack 4209, win 514, options [nop,nop,TS val 1576981 ecr 4294913967], length 48 13:32:04.666333 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [P.], seq 4209:4769, ack 576, win 235, options [nop,nop,TS val 4294913967 ecr 1576981], length 560 13:32:04.78 IP 10.101.0.12.43418 > 10.101.99.5.3260: Flags [P.], seq 576:624, ack 4769, win 514, options [nop,nop,TS val 1576981 ecr 4294913967], length 48 13:32:04.666790 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [P.], seq 4769:5329, ack 624, win 235, options [nop,nop,TS val 4294913967 ecr 1576981], length 560 13:32:04.666983 IP 10.101.0.12.43418 > 10.101.99.5.3260: Flags [P.], seq 624:672, ack 5329, win 514, options [nop,nop,TS val 1576981 ecr 4294913967], length 48 13:32:04.667097 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [P.], seq 5329:5889, ack 672, win 235, options [nop,nop,TS val 4294913967 ecr 1576981], length 560 13:32:04.667280 IP 10.101.0.12.43418 > 10.101.99.5.3260: Flags [P.], seq 672:720, ack 5889, win 514, options [nop,nop,TS val 1576981 ecr 4294913967], length 48 13:32:04.667324 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [P.], seq 5889:6449, ack 720, win 235, options [nop,nop,TS val 4294913967 ecr 1576981], length 560 13:32:04.667500 IP 10.101.0.12.43418 > 10.101.99.5.3260: Flags [P.], seq 720:768, ack 6449, win 514, options [nop,nop,TS val 1576981 ecr 4294913967], length 48 13:32:04.667540 IP 10.101.99.5.3260 > 10.101.0.12.43418: Flags [P.], seq 6449:7009, ack 768, win 235, options [nop,nop,TS val 4294913967 ecr 1576981], length 560 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, > I was simply thinking about something like : > (might need further changes, but I guess this should solve your case) thank you for your patch. It did not apply on top of Linux tip, so I put in the changes manually and fixed up another call to tx_data that your forgot in your initial patch to make it apply. I gave it another run, can you confirm that it now behaves better? https://thomas.glanzmann.de/tmp/tcp_auto_corking_on_patched_tcp_more.pcap.bz2 And look at that roundtrip graph it is perfect. Also filesystem is now created in 3 seconds instead of 4. https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-22:34:57.png Nab, do you consider this patch for upstream? Would you take if I clean it up? Cheers, Thomas PS: I'm asleep for the next 8 hours. diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c index e655b04..0eb9681 100644 --- a/drivers/target/iscsi/iscsi_target_util.c +++ b/drivers/target/iscsi/iscsi_target_util.c @@ -1168,7 +1168,7 @@ send_data: iov_count = cmd->iov_misc_count; } - tx_sent = tx_data(conn, [0], iov_count, tx_size); + tx_sent = tx_data(conn, [0], iov_count, tx_size, 0); if (tx_size != tx_sent) { if (tx_sent == -EAGAIN) { pr_err("tx_data() returned -EAGAIN\n"); @@ -1199,7 +1199,8 @@ send_hdr: iov.iov_base = cmd->pdu; iov.iov_len = tx_hdr_size; - tx_sent = tx_data(conn, , 1, tx_hdr_size); + data_len = cmd->tx_size - tx_hdr_size - cmd->padding; +tx_sent = tx_data(conn, , 1, tx_hdr_size, data_len ? MSG_MORE : 0); if (tx_hdr_size != tx_sent) { if (tx_sent == -EAGAIN) { pr_err("tx_data() returned -EAGAIN\n"); @@ -1208,7 +1209,6 @@ send_hdr: return -1; } - data_len = cmd->tx_size - tx_hdr_size - cmd->padding; /* * Set iov_off used by padding and data digest tx_data() calls below * in order to determine proper offset into cmd->iov_data[] @@ -1252,7 +1252,8 @@ send_padding: if (cmd->padding) { struct kvec *iov_p = >iov_data[iov_off++]; - tx_sent = tx_data(conn, iov_p, 1, cmd->padding); + tx_sent = tx_data(conn, iov_p, 1, cmd->padding, + conn->conn_ops->DataDigest ? MSG_MORE : 0); if (cmd->padding != tx_sent) { if (tx_sent == -EAGAIN) { pr_err("tx_data() returned -EAGAIN\n"); @@ -1266,7 +1267,7 @@ send_datacrc: if (conn->conn_ops->DataDigest) { struct kvec *iov_d = >iov_data[iov_off]; - tx_sent = tx_data(conn, iov_d, 1, ISCSI_CRC_LEN); + tx_sent = tx_data(conn, iov_d, 1, ISCSI_CRC_LEN, 0); if (ISCSI_CRC_LEN != tx_sent) { if (tx_sent == -EAGAIN) { pr_err("tx_data() returned -EAGAIN\n"); @@ -1352,11 +1353,13 @@ static int iscsit_do_rx_data( static int iscsit_do_tx_data( struct iscsi_conn *conn, - struct iscsi_data_count *count) + struct iscsi_data_count *count, + int flags) { int data = count->data_length, total_tx = 0, tx_loop = 0, iov_len; struct kvec *iov_p; struct msghdr msg; +struct msghdr msg = { .msg_flags = flags }; if (!conn || !conn->sock || !conn->conn_ops) return -1; @@ -1366,8 +1369,6 @@ static int iscsit_do_tx_data( return -1; } - memset(, 0, sizeof(struct msghdr)); - iov_p = count->iov; iov_len = count->iov_count; @@ -1411,7 +1412,8 @@ int tx_data( struct iscsi_conn *conn, struct kvec *iov, int iov_count, - int data) + int data, + int flags) { struct iscsi_data_count c; @@ -1424,7 +1426,7 @@ int tx_data( c.data_length = data; c.type = ISCSI_TX_DATA; - return iscsit_do_tx_data(conn, ); + return iscsit_do_tx_data(conn, , flags); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
On Sat, 2014-02-08 at 18:15 +0100, Thomas Glanzmann wrote: > The iSCSI target uses one function to send all outbound data. So in > order to do it right every function that is sending data in multiple > chunks need to mark it correctly. Of course someone could also do some > wild guessing and saying that everything that is below 512 Bytes gets > pushed out. I wonder what Nab has to say about this? I was simply thinking about something like : (might need further changes, but I guess this should solve your case) diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c index 0819e688a398..44f0d62a88d6 100644 --- a/drivers/target/iscsi/iscsi_target_util.c +++ b/drivers/target/iscsi/iscsi_target_util.c @@ -1165,7 +1165,7 @@ send_data: iov_count = cmd->iov_misc_count; } - tx_sent = tx_data(conn, [0], iov_count, tx_size); + tx_sent = tx_data(conn, [0], iov_count, tx_size, 0); if (tx_size != tx_sent) { if (tx_sent == -EAGAIN) { pr_err("tx_data() returned -EAGAIN\n"); @@ -1196,7 +1196,8 @@ send_hdr: iov.iov_base = cmd->pdu; iov.iov_len = tx_hdr_size; - tx_sent = tx_data(conn, , 1, tx_hdr_size); + data_len = cmd->tx_size - tx_hdr_size - cmd->padding; + tx_sent = tx_data(conn, , 1, tx_hdr_size, data_len ? MSG_MORE : 0); if (tx_hdr_size != tx_sent) { if (tx_sent == -EAGAIN) { pr_err("tx_data() returned -EAGAIN\n"); @@ -1205,7 +1206,6 @@ send_hdr: return -1; } - data_len = cmd->tx_size - tx_hdr_size - cmd->padding; /* * Set iov_off used by padding and data digest tx_data() calls below * in order to determine proper offset into cmd->iov_data[] @@ -1249,7 +1249,8 @@ send_padding: if (cmd->padding) { struct kvec *iov_p = >iov_data[iov_off++]; - tx_sent = tx_data(conn, iov_p, 1, cmd->padding); + tx_sent = tx_data(conn, iov_p, 1, cmd->padding, + conn->conn_ops->DataDigest ? MSG_MORE : 0); if (cmd->padding != tx_sent) { if (tx_sent == -EAGAIN) { pr_err("tx_data() returned -EAGAIN\n"); @@ -1263,7 +1264,7 @@ send_datacrc: if (conn->conn_ops->DataDigest) { struct kvec *iov_d = >iov_data[iov_off]; - tx_sent = tx_data(conn, iov_d, 1, ISCSI_CRC_LEN); + tx_sent = tx_data(conn, iov_d, 1, ISCSI_CRC_LEN, 0); if (ISCSI_CRC_LEN != tx_sent) { if (tx_sent == -EAGAIN) { pr_err("tx_data() returned -EAGAIN\n"); @@ -1349,11 +1350,12 @@ static int iscsit_do_rx_data( static int iscsit_do_tx_data( struct iscsi_conn *conn, - struct iscsi_data_count *count) + struct iscsi_data_count *count, + int flags) { int data = count->data_length, total_tx = 0, tx_loop = 0, iov_len; struct kvec *iov_p; - struct msghdr msg; + struct msghdr msg = { .msg_flags = flags }; if (!conn || !conn->sock || !conn->conn_ops) return -1; @@ -1363,8 +1365,6 @@ static int iscsit_do_tx_data( return -1; } - memset(, 0, sizeof(struct msghdr)); - iov_p = count->iov; iov_len = count->iov_count; @@ -1408,7 +1408,8 @@ int tx_data( struct iscsi_conn *conn, struct kvec *iov, int iov_count, - int data) + int data, + int flags) { struct iscsi_data_count c; @@ -1421,7 +1422,7 @@ int tx_data( c.data_length = data; c.type = ISCSI_TX_DATA; - return iscsit_do_tx_data(conn, ); + return iscsit_do_tx_data(conn, , flags); } void iscsit_collect_login_stats( diff --git a/drivers/target/iscsi/iscsi_target_util.h b/drivers/target/iscsi/iscsi_target_util.h index e4fc34a02f57..1b4f06801adc 100644 --- a/drivers/target/iscsi/iscsi_target_util.h +++ b/drivers/target/iscsi/iscsi_target_util.h @@ -54,7 +54,7 @@ extern int iscsit_print_dev_to_proc(char *, char **, off_t, int); extern int iscsit_print_sessions_to_proc(char *, char **, off_t, int); extern int iscsit_print_tpg_to_proc(char *, char **, off_t, int); extern int rx_data(struct iscsi_conn *, struct kvec *, int, int); -extern int tx_data(struct iscsi_conn *, struct kvec *, int, int); +extern int tx_data(struct iscsi_conn *, struct kvec *, int, int, int); extern void iscsit_collect_login_stats(struct iscsi_conn *, u8, u8); extern struct iscsi_tiqn *iscsit_snmp_get_tiqn(struct iscsi_conn *); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, > Yep, but the problem (at least on your pcap), is about sending the 48 > bytes headers in TCP segment of its own, then the 512 byte payload in > a separate segment. I agree. > I suspect the sendpage() is only used for the payload. No need for > MSG_MORE here. I see. > The MSG_MORE would need to be set on the first part (48 bytes header), > so that TCP stack will defer the push of the segment at the time the 512 > bytes payload is added. The iSCSI target uses one function to send all outbound data. So in order to do it right every function that is sending data in multiple chunks need to mark it correctly. Of course someone could also do some wild guessing and saying that everything that is below 512 Bytes gets pushed out. I wonder what Nab has to say about this? Cheers, Thomas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
On Sat, 2014-02-08 at 17:57 +0100, Thomas Glanzmann wrote: > Hello Eric, > > > Note : We did some patches in the MSG_MORE logic for sendpage(), but > > in your case I do not think its related > > (git grep -n MSG_SENDPAGE_NOTLAST ) if you are curious > > thank you for the pointer. The iSCSI target code actually uses sendpage > whenever it can. Yep, but the problem (at least on your pcap), is about sending the 48 bytes headers in TCP segment of its own, then the 512 byte payload in a separate segment. I suspect the sendpage() is only used for the payload. No need for MSG_MORE here. The MSG_MORE would need to be set on the first part (48 bytes header), so that TCP stack will defer the push of the segment at the time the 512 bytes payload is added. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, > Note : We did some patches in the MSG_MORE logic for sendpage(), but > in your case I do not think its related > (git grep -n MSG_SENDPAGE_NOTLAST ) if you are curious thank you for the pointer. The iSCSI target code actually uses sendpage whenever it can. Cheers, Thomas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
On Sat, 2014-02-08 at 16:00 +0100, Thomas Glanzmann wrote: > Hello Eric, > > > Idea would be to set this flag when calling sendmsg() of the 48 bytes > > of the header, and not set it on the sendmsg() of the 512 bytes of the > > payload. > > I see. > > > iscsi_sw_tcp_xmit_segment() already adds MSG_MORE, but > > it would be nice to add a new _initial_ flags parameter to > > iscsi_sw_tcp_xmit_segment() > > This is for the iscsi initiator implementation. I'm interested in iSCSI > target code, but I already found it and experiemented a little bit, but > I need to dig deeper if I want to prepare a patch. Fantastic ! Let me know if you want some help. Note : We did some patches in the MSG_MORE logic for sendpage(), but in your case I do not think its related (git grep -n MSG_SENDPAGE_NOTLAST ) if you are curious -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, > Idea would be to set this flag when calling sendmsg() of the 48 bytes > of the header, and not set it on the sendmsg() of the 512 bytes of the > payload. I see. > iscsi_sw_tcp_xmit_segment() already adds MSG_MORE, but > it would be nice to add a new _initial_ flags parameter to > iscsi_sw_tcp_xmit_segment() This is for the iscsi initiator implementation. I'm interested in iSCSI target code, but I already found it and experiemented a little bit, but I need to dig deeper if I want to prepare a patch. Cheers, Thomas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
On Sat, 2014-02-08 at 15:19 +0100, Thomas Glanzmann wrote: > Hello Eric, > I get the idea. However I'm a little bit confused, when I do a 'git grep > MSG_MORE' I don't see much references in the Linux kernel who use it at > all. So do you have an example for me where this flags needs to be > applied? Idea would be to set this flag when calling sendmsg() of the 48 bytes of the header, and not set it on the sendmsg() of the 512 bytes of the payload. iscsi_sw_tcp_xmit_segment() already adds MSG_MORE, but it would be nice to add a new _initial_ flags parameter to iscsi_sw_tcp_xmit_segment() -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, > > BTW this problem demonstrates there is room for improvement in iCSCI, > > using MSG_MORE to avoid sending two small segments in separate frames. > With the fix, new pcap is more explicit about this suboptimal behavior : > 05:34:16.280900 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [.], ack > 54353, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0 > 05:34:16.280949 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [P.], seq > 5328:5376, ack 54353, win 514, options [nop,nop,TS val 1732452 ecr > 4294935370], length 48 > 05:34:16.280982 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq > 54353:54401, ack 5376, win 235, options [nop,nop,TS val 4294935370 ecr > 1732452], length 48 > 05:34:16.281000 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq > 54401:54913, ack 5376, win 235, options [nop,nop,TS val 4294935370 ecr > 1732452], length 512 > 05:34:16.281107 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [.], ack > 54913, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0 > 05:34:16.281157 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [P.], seq > 5376:5424, ack 54913, win 514, options [nop,nop,TS val 1732452 ecr > 4294935370], length 48 > 05:34:16.281190 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq > 54913:54961, ack 5424, win 235, options [nop,nop,TS val 4294935370 ecr > 1732452], length 48 > 05:34:16.281208 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq > 54961:55473, ack 5424, win 235, options [nop,nop,TS val 4294935370 ecr > 1732452], length 512 > 05:34:16.281337 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [.], ack > 55473, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0 > 05:34:16.281390 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [P.], seq > 5424:5472, ack 55473, win 514, options [nop,nop,TS val 1732452 ecr > 4294935370], length 48 > 05:34:16.281423 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq > 55473:55521, ack 5472, win 235, options [nop,nop,TS val 4294935370 ecr > 1732452], length 48 > 05:34:16.281440 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq > 55521:56033, ack 5472, win 235, options [nop,nop,TS val 4294935370 ecr > 1732452], length 512 I get the idea. However I'm a little bit confused, when I do a 'git grep MSG_MORE' I don't see much references in the Linux kernel who use it at all. So do you have an example for me where this flags needs to be applied? Cheers, Thomas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
On Sat, 2014-02-08 at 05:50 -0800, Eric Dumazet wrote: > On Sat, 2014-02-08 at 05:33 -0800, Eric Dumazet wrote: > > On Sat, 2014-02-08 at 05:14 -0800, Eric Dumazet wrote: > > > Here is the combined patch, could you test it ? > > > > Also make sure you have commit a181ceb501b31b4bf8812a5c84c716cc31d82c2d > > ("tcp: autocork should not hold first packet in write queue") > > in your tree. > > > > > > BTW this problem demonstrates there is room for improvement in iCSCI, > using MSG_MORE to avoid sending two small segments in separate frames. > > [1] 00:32:35.726568 IP 10.101.99.5.3260 > 10.101.0.13.27778: Flags [P.], seq > 145:193, ack 144, win 235, options [nop,nop,TS val 4294960733 ecr 385385], > length 48 > [2] 00:32:35.838074 IP 10.101.0.13.27778 > 10.101.99.5.3260: Flags [.], ack > 193, win 514, options [nop,nop,TS val 385396 ecr 4294960733], length 0 > [3] 00:32:35.838099 IP 10.101.99.5.3260 > 10.101.0.13.27778: Flags [P.], seq > 193:705, ack 144, win 235, options [nop,nop,TS val 4294960761 ecr 385396], > length 512 > > [1] & [3] could be coalesced, and [2] would be avoided. > With the fix, new pcap is more explicit about this suboptimal behavior : 05:34:16.280900 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [.], ack 54353, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0 05:34:16.280949 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [P.], seq 5328:5376, ack 54353, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48 05:34:16.280982 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54353:54401, ack 5376, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48 05:34:16.281000 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54401:54913, ack 5376, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512 05:34:16.281107 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [.], ack 54913, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0 05:34:16.281157 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [P.], seq 5376:5424, ack 54913, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48 05:34:16.281190 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54913:54961, ack 5424, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48 05:34:16.281208 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 54961:55473, ack 5424, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512 05:34:16.281337 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [.], ack 55473, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0 05:34:16.281390 IP 10.101.0.13.41531 > 10.101.99.5.3260: Flags [P.], seq 5424:5472, ack 55473, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48 05:34:16.281423 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 55473:55521, ack 5472, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48 05:34:16.281440 IP 10.101.99.5.3260 > 10.101.0.13.41531: Flags [P.], seq 55521:56033, ack 5472, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, [RESEND: dropped CC accidently] > 10.101.99.5 or 10.101.0.13? 10.101.99.5 (iSCSI Target) tcpdump -i bond0.101 -s 0 -w /tmp/tcp_auto_corking_on_patched.pcap host esx-03.v101.campusvl.de Cheers, Thomas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
On Sat, 2014-02-08 at 14:37 +0100, Thomas Glanzmann wrote: > > It fixes my case but if you look at the round trip time it is not even > close what it used to be. So while this fixes my problem I'm still for > disabling it by default. > > https://thomas.glanzmann.de/tmp/tcp_auto_corking_on_patched.pcap.bz2 This pcap was taken on which host ? 10.101.99.5 or 10.101.0.13 ? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, > What is your NIC model and driver? I have four Intel Corporation I350 Gigabit Network Connection (rev 01). (node-62) [~/work/linux-2.6] lspci -v | pbot http://pbot.rmdir.de/rgu6yHMBDVQpflMmbcJACg (node-62) [~/work/linux-2.6] ip a s | pbot http://pbot.rmdir.de/xJjRT8u-ekC6mrWgl09ZtQ (node-62) [~/work/linux-2.6] dmesg | pbot http://pbot.rmdir.de/MigrSPtxGmp0fI1CRgXsHw I do 802.3ad link aggregation layer 2 hash with two network cards to one switch. I'm running: Linux node-62 3.14.0-rc1+ #23 SMP Sat Feb 8 14:27:47 CET 2014 x86_64 GNU/Linux Driver: igb If you need remote access to the machine let me know. Cheers, Thomas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
On Sat, 2014-02-08 at 14:37 +0100, Thomas Glanzmann wrote: > Hello Eric, > > It fixes my case but if you look at the round trip time it is not even > close what it used to be. So while this fixes my problem I'm still for > disabling it by default. > > https://thomas.glanzmann.de/tmp/tcp_auto_corking_on_patched.pcap.bz2 > https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-14:36:25.png Very nice. Now we have to check your NIC and how TX completion is performed. What is your NIC model and driver ? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
On Sat, 2014-02-08 at 05:33 -0800, Eric Dumazet wrote: > On Sat, 2014-02-08 at 05:14 -0800, Eric Dumazet wrote: > > Here is the combined patch, could you test it ? > > Also make sure you have commit a181ceb501b31b4bf8812a5c84c716cc31d82c2d > ("tcp: autocork should not hold first packet in write queue") > in your tree. > > BTW this problem demonstrates there is room for improvement in iCSCI, using MSG_MORE to avoid sending two small segments in separate frames. [1] 00:32:35.726568 IP 10.101.99.5.3260 > 10.101.0.13.27778: Flags [P.], seq 145:193, ack 144, win 235, options [nop,nop,TS val 4294960733 ecr 385385], length 48 [2] 00:32:35.838074 IP 10.101.0.13.27778 > 10.101.99.5.3260: Flags [.], ack 193, win 514, options [nop,nop,TS val 385396 ecr 4294960733], length 0 [3] 00:32:35.838099 IP 10.101.99.5.3260 > 10.101.0.13.27778: Flags [P.], seq 193:705, ack 144, win 235, options [nop,nop,TS val 4294960761 ecr 385396], length 512 [1] & [3] could be coalesced, and [2] would be avoided. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, > Also make sure you have commit a181ceb501b31b4bf8812a5c84c716cc31d82c2d > ("tcp: autocork should not hold first packet in write queue") > in your tree. confirmed: (node-62) [~/work/linux-2.6] git show a181ceb501b31b4bf8812a5c84c716cc31d82c2d | head commit a181ceb501b31b4bf8812a5c84c716cc31d82c2d Author: Eric Dumazet Date: Tue Dec 17 09:58:30 2013 -0800 tcp: autocork should not hold first packet in write queue Willem noticed a TCP_RR regression caused by TCP autocorking on a Mellanox test bed. MLX4_EN_TX_COAL_TIME is 16 us, which can be right above RTT between hosts. Cheers, Thomas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, > > tcp corking kills iSCSI performance > Here is the combined patch, could you test it? the patch did not apply, so I edited by hand. Here is the resulting patch: diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 03d26b8..40d1958 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -698,7 +698,8 @@ static void tcp_tsq_handler(struct sock *sk) if ((1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_CLOSING | TCPF_CLOSE_WAIT | TCPF_LAST_ACK)) - tcp_write_xmit(sk, tcp_current_mss(sk), 0, 0, GFP_ATOMIC); + tcp_write_xmit(sk, tcp_current_mss(sk), tcp_sk(sk)->nonagle, + 0, GFP_ATOMIC); } /* * One tasklet per cpu tries to send more skbs. @@ -1904,7 +1905,16 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, if (atomic_read(>sk_wmem_alloc) > limit) { set_bit(TSQ_THROTTLED, >tsq_flags); - break; + /* It is possible TX completion already happened +* before we set TSQ_THROTTLED, so we must +* test again the condition. +* We abuse smp_mb__after_clear_bit() because +* there is no smp_mb__after_set_bit() yet +*/ + smp_mb__after_clear_bit(); + if (atomic_read(>sk_wmem_alloc) > limit) + break; + } limit = mss_now; -- cut here -- It fixes my case but if you look at the round trip time it is not even close what it used to be. So while this fixes my problem I'm still for disabling it by default. https://thomas.glanzmann.de/tmp/tcp_auto_corking_on_patched.pcap.bz2 https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-14:36:25.png Cheers, Thomas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
On Sat, 2014-02-08 at 05:14 -0800, Eric Dumazet wrote: > Here is the combined patch, could you test it ? Also make sure you have commit a181ceb501b31b4bf8812a5c84c716cc31d82c2d ("tcp: autocork should not hold first packet in write queue") in your tree. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
On Sat, 2014-02-08 at 10:38 +0100, Thomas Glanzmann wrote: > Hello Eric, > > [RESEND: the time it took the VMFS was created was switched between > on/off so with on it took over 2 minutes with off it took less than 4 > seconds] > > [RESEND 2: The throughput graphs were switched as well ;-(] > > > * Thomas Glanzmann [2014-02-07 08:55]: > > > Creating a 4 TB VMFS filesystem over iSCSI takes 24 seconds on 3.12 > > > and 15 minutes on 3.14.0-rc2+. > > * Nicholas A. Bellinger [2014-02-07 20:30]: > > Would it be possible to try a couple of different stable kernel > > versions to help track this down? > > I bisected[1] it and found the offending commit f54b311 tcp auto corking > [2] 'if we have a small send and a previous packet is already in the > qdisc or device queue, defer until TX completion or we get more data.' > - Description by David S. Miller > > I gathered a pcap with tcp_autocorking on and off. > > On: - took 2 minutes 24 seconds to create a 500 GB VMFS file system > https://thomas.glanzmann.de/tmp/tcp_auto_corking_on.pcap.bz2 > https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:46:34.png > https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:52:28.png > > Off: - took 4 seconds to create a 500 GB VMFS file system > sysctl net.ipv4.tcp_autocorking=0 > https://thomas.glanzmann.de/tmp/tcp_auto_corking_off.pcap.bz2 > https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:45:43.png > https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:53:17.png > > First graph can be generated by opening bunziping the file, opening it > in wireshark and select Statistics > IO Grap and change the unit to > Bytes/Tick. The second graph can be generated by selecting Statistics > > TCP Stream Graph > Round Trip Time. > > You can also see that the round trip time increases by factor 25 at > least. > > I once saw a similar problem with dealyed ACK packets of the > paravirtulized network driver in xen it caused that the tcp window > filled up and slowed down the throughput from 30 MB/s to less than 100 > KB/s the symptom was that the login to a Windows desktop took more than > 10 minutes while it used to be below 30 seconds because the profile of > the user was loaded slowly from a CIFS server. At that time the culprit > were also delayed small packets: ACK packets in the CIFS case. However I > only proofed iSCSI regression so far for tcp auto corking but assume we > will see many others if we leave it enabled. > > I found the problem by doing the following: > - I compiled kernel by executing the following commands: > yes '' | make oldconfig > time make -j 24 > / make modules_install > / mkinitramfs -o /boot/initrd.img-bisect > > - I cleaned the iSCSI configuration after each test by issuing: > /etc/init.d/target stop > rm /iscsi?/* /etc/target/* > > - I configured iSCSI after each reboot > cat > lio-v101.conf < set global auto_cd_after_create=false > /backstores/fileio create shared-01.v101.campusvl.de > /iscsi1/shared-01.v101.campusvl.de size=500G buffered=true > > /iscsi create iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de > /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals > create 10.101.99.4 > /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals > create 10.101.99.5 > /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals > create 10.102.99.4 > /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals > create 10.102.99.5 > /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/luns > create /backstores/fileio/shared-01.v101.campusvl.de lun=10 > /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/ > set attribute authentication=0 demo_mode_write_protect=0 generate_node_acls=1 > cache_dynamic_acls=1 > > saveconfig > yes > EOF > targetcli < lio-v101.conf > And configured a fresh booted ESXi 5.5.0 1331820 via > autodeploy > to the iSCSI target, configured the portal, rescanned and > created a 500 GB VMFS 5 filesystem and noticed the time if it > was longer than 2 minutes it was bad if it was below 10 > seconds > it was good. > git bisect good/bad > > My network config is: > > auto bond0 > iface bond0 inet static >address 10.100.4.62 >netmask 255.255.0.0 >gateway 10.100.0.1 >slaves eth0 eth1 >bond-mode 802.3ad >bond-miimon 100 > > auto bond0.101 > iface bond0.101 inet static >address 10.101.99.4 >netmask 255.255.0.0 > > auto bond1 > iface bond1 inet static >address 10.100.5.62 >netmask 255.255.0.0 >slaves eth2 eth3 >bond-mode 802.3ad >
REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, [RESEND: the time it took the VMFS was created was switched between on/off so with on it took over 2 minutes with off it took less than 4 seconds] [RESEND 2: The throughput graphs were switched as well ;-(] > * Thomas Glanzmann [2014-02-07 08:55]: > > Creating a 4 TB VMFS filesystem over iSCSI takes 24 seconds on 3.12 > > and 15 minutes on 3.14.0-rc2+. * Nicholas A. Bellinger [2014-02-07 20:30]: > Would it be possible to try a couple of different stable kernel > versions to help track this down? I bisected[1] it and found the offending commit f54b311 tcp auto corking [2] 'if we have a small send and a previous packet is already in the qdisc or device queue, defer until TX completion or we get more data.' - Description by David S. Miller I gathered a pcap with tcp_autocorking on and off. On: - took 2 minutes 24 seconds to create a 500 GB VMFS file system https://thomas.glanzmann.de/tmp/tcp_auto_corking_on.pcap.bz2 https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:46:34.png https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:52:28.png Off: - took 4 seconds to create a 500 GB VMFS file system sysctl net.ipv4.tcp_autocorking=0 https://thomas.glanzmann.de/tmp/tcp_auto_corking_off.pcap.bz2 https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:45:43.png https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:53:17.png First graph can be generated by opening bunziping the file, opening it in wireshark and select Statistics > IO Grap and change the unit to Bytes/Tick. The second graph can be generated by selecting Statistics > TCP Stream Graph > Round Trip Time. You can also see that the round trip time increases by factor 25 at least. I once saw a similar problem with dealyed ACK packets of the paravirtulized network driver in xen it caused that the tcp window filled up and slowed down the throughput from 30 MB/s to less than 100 KB/s the symptom was that the login to a Windows desktop took more than 10 minutes while it used to be below 30 seconds because the profile of the user was loaded slowly from a CIFS server. At that time the culprit were also delayed small packets: ACK packets in the CIFS case. However I only proofed iSCSI regression so far for tcp auto corking but assume we will see many others if we leave it enabled. I found the problem by doing the following: - I compiled kernel by executing the following commands: yes '' | make oldconfig time make -j 24 / make modules_install / mkinitramfs -o /boot/initrd.img-bisect - I cleaned the iSCSI configuration after each test by issuing: /etc/init.d/target stop rm /iscsi?/* /etc/target/* - I configured iSCSI after each reboot cat > lio-v101.conf
REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, [RESEND: the time it took the VMFS was created was switched between on/off so with on it took over 2 minutes with off it took less than 4 seconds] > * Thomas Glanzmann [2014-02-07 08:55]: > > Creating a 4 TB VMFS filesystem over iSCSI takes 24 seconds on 3.12 > > and 15 minutes on 3.14.0-rc2+. * Nicholas A. Bellinger [2014-02-07 20:30]: > Would it be possible to try a couple of different stable kernel > versions to help track this down? I bisected[1] it and found the offending commit f54b311 tcp auto corking [2] 'if we have a small send and a previous packet is already in the qdisc or device queue, defer until TX completion or we get more data.' - Description by David S. Miller I gathered a pcap with tcp_autocorking on and off. On: - took 2 minutes 24 seconds to create a 500 GB VMFS file system https://thomas.glanzmann.de/tmp/tcp_auto_corking_on.pcap.bz2 https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:45:43.png https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:52:28.png Off: - took 4 seconds to create a 500 GB VMFS file system sysctl net.ipv4.tcp_autocorking=0 https://thomas.glanzmann.de/tmp/tcp_auto_corking_off.pcap.bz2 https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:46:34.png https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:53:17.png First graph can be generated by opening bunziping the file, opening it in wireshark and select Statistics > IO Grap and change the unit to Bytes/Tick. The second graph can be generated by selecting Statistics > TCP Stream Graph > Round Trip Time. You can also see that the round trip time increases by factor 25 at least. I once saw a similar problem with dealyed ACK packets of the paravirtulized network driver in xen it caused that the tcp window filled up and slowed down the throughput from 30 MB/s to less than 100 KB/s the symptom was that the login to a Windows desktop took more than 10 minutes while it used to be below 30 seconds because the profile of the user was loaded slowly from a CIFS server. At that time the culprit were also delayed small packets: ACK packets in the CIFS case. However I only proofed iSCSI regression so far for tcp auto corking but assume we will see many others if we leave it enabled. I found the problem by doing the following: - I compiled kernel by executing the following commands: yes '' | make oldconfig time make -j 24 / make modules_install / mkinitramfs -o /boot/initrd.img-bisect - I cleaned the iSCSI configuration after each test by issuing: /etc/init.d/target stop rm /iscsi?/* /etc/target/* - I configured iSCSI after each reboot cat > lio-v101.conf
REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, > * Thomas Glanzmann [2014-02-07 08:55]: > > Creating a 4 TB VMFS filesystem over iSCSI takes 24 seconds on 3.12 > > and 15 minutes on 3.14.0-rc2+. * Nicholas A. Bellinger [2014-02-07 20:30]: > Would it be possible to try a couple of different stable kernel > versions to help track this down? I bisected[1] it and found the offending commit f54b311 tcp auto corking [2] 'if we have a small send and a previous packet is already in the qdisc or device queue, defer until TX completion or we get more data.' - Description by David S. Miller I gathered a pcap with tcp_autocorking on and off. On: - took 4 seconds to create a 500 GB VMFS file system https://thomas.glanzmann.de/tmp/tcp_auto_corking_on.pcap.bz2 https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:45:43.png https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:52:28.png Off: - took 2 minutes 24 seconds to create a 500 GB VMFS file system sysctl net.ipv4.tcp_autocorking=0 https://thomas.glanzmann.de/tmp/tcp_auto_corking_off.pcap.bz2 https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:46:34.png https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:53:17.png First graph can be generated by opening bunziping the file, opening it in wireshark and select Statistics > IO Grap and change the unit to Bytes/Tick. The second graph can be generated by selecting Statistics > TCP Stream Graph > Round Trip Time. You can also see that the round trip time increases by factor 25 at least. I once saw a similar problem with dealyed ACK packets of the paravirtulized network driver in xen it caused that the tcp window filled up and slowed down the throughput from 30 MB/s to less than 100 KB/s the symptom was that the login to a Windows desktop took more than 10 minutes while it used to be below 30 seconds because the profile of the user was loaded slowly from a CIFS server. At that time the culprit were also delayed small packets: ACK packets in the CIFS case. However I only proofed iSCSI regression so far for tcp auto corking but assume we will see many others if we leave it enabled. I found the problem by doing the following: - I compiled kernel by executing the following commands: yes '' | make oldconfig time make -j 24 / make modules_install / mkinitramfs -o /boot/initrd.img-bisect - I cleaned the iSCSI configuration after each test by issuing: /etc/init.d/target stop rm /iscsi?/* /etc/target/* - I configured iSCSI after each reboot cat > lio-v101.conf
REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, * Thomas Glanzmann tho...@glanzmann.de [2014-02-07 08:55]: Creating a 4 TB VMFS filesystem over iSCSI takes 24 seconds on 3.12 and 15 minutes on 3.14.0-rc2+. * Nicholas A. Bellinger n...@linux-iscsi.org [2014-02-07 20:30]: Would it be possible to try a couple of different stable kernel versions to help track this down? I bisected[1] it and found the offending commit f54b311 tcp auto corking [2] 'if we have a small send and a previous packet is already in the qdisc or device queue, defer until TX completion or we get more data.' - Description by David S. Miller I gathered a pcap with tcp_autocorking on and off. On: - took 4 seconds to create a 500 GB VMFS file system https://thomas.glanzmann.de/tmp/tcp_auto_corking_on.pcap.bz2 https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:45:43.png https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:52:28.png Off: - took 2 minutes 24 seconds to create a 500 GB VMFS file system sysctl net.ipv4.tcp_autocorking=0 https://thomas.glanzmann.de/tmp/tcp_auto_corking_off.pcap.bz2 https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:46:34.png https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:53:17.png First graph can be generated by opening bunziping the file, opening it in wireshark and select Statistics IO Grap and change the unit to Bytes/Tick. The second graph can be generated by selecting Statistics TCP Stream Graph Round Trip Time. You can also see that the round trip time increases by factor 25 at least. I once saw a similar problem with dealyed ACK packets of the paravirtulized network driver in xen it caused that the tcp window filled up and slowed down the throughput from 30 MB/s to less than 100 KB/s the symptom was that the login to a Windows desktop took more than 10 minutes while it used to be below 30 seconds because the profile of the user was loaded slowly from a CIFS server. At that time the culprit were also delayed small packets: ACK packets in the CIFS case. However I only proofed iSCSI regression so far for tcp auto corking but assume we will see many others if we leave it enabled. I found the problem by doing the following: - I compiled kernel by executing the following commands: yes '' | make oldconfig time make -j 24 / make modules_install / mkinitramfs -o /boot/initrd.img-bisect version - I cleaned the iSCSI configuration after each test by issuing: /etc/init.d/target stop rm /iscsi?/* /etc/target/* - I configured iSCSI after each reboot cat lio-v101.conf EOF set global auto_cd_after_create=false /backstores/fileio create shared-01.v101.campusvl.de /iscsi1/shared-01.v101.campusvl.de size=500G buffered=true /iscsi create iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.101.99.4 /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.101.99.5 /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.102.99.4 /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.102.99.5 /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/luns create /backstores/fileio/shared-01.v101.campusvl.de lun=10 /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/ set attribute authentication=0 demo_mode_write_protect=0 generate_node_acls=1 cache_dynamic_acls=1 saveconfig yes EOF targetcli lio-v101.conf And configured a fresh booted ESXi 5.5.0 1331820 via autodeploy to the iSCSI target, configured the portal, rescanned and created a 500 GB VMFS 5 filesystem and noticed the time if it was longer than 2 minutes it was bad if it was below 10 seconds it was good. git bisect good/bad My network config is: auto bond0 iface bond0 inet static address 10.100.4.62 netmask 255.255.0.0 gateway 10.100.0.1 slaves eth0 eth1 bond-mode 802.3ad bond-miimon 100 auto bond0.101 iface bond0.101 inet static address 10.101.99.4 netmask 255.255.0.0 auto bond1 iface bond1 inet static address 10.100.5.62 netmask 255.255.0.0 slaves eth2 eth3 bond-mode 802.3ad bond-miimon 100 auto bond1.101 iface bond1.101 inet static address 10.101.99.5 netmask 255.255.0.0 I propose to disable tcp_autocorking by default because it obviously degrades iSCSI performance and probably many other protocols. Also the commit mentions that applications can explicitly disable auto corking we probably should do that for the iSCSI target, but I don't know how. Anyone? [1] http://pbot.rmdir.de/a65q6MjgV36tZnn5jS-DUQ [2]
REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, [RESEND: the time it took the VMFS was created was switched between on/off so with on it took over 2 minutes with off it took less than 4 seconds] * Thomas Glanzmann tho...@glanzmann.de [2014-02-07 08:55]: Creating a 4 TB VMFS filesystem over iSCSI takes 24 seconds on 3.12 and 15 minutes on 3.14.0-rc2+. * Nicholas A. Bellinger n...@linux-iscsi.org [2014-02-07 20:30]: Would it be possible to try a couple of different stable kernel versions to help track this down? I bisected[1] it and found the offending commit f54b311 tcp auto corking [2] 'if we have a small send and a previous packet is already in the qdisc or device queue, defer until TX completion or we get more data.' - Description by David S. Miller I gathered a pcap with tcp_autocorking on and off. On: - took 2 minutes 24 seconds to create a 500 GB VMFS file system https://thomas.glanzmann.de/tmp/tcp_auto_corking_on.pcap.bz2 https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:45:43.png https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:52:28.png Off: - took 4 seconds to create a 500 GB VMFS file system sysctl net.ipv4.tcp_autocorking=0 https://thomas.glanzmann.de/tmp/tcp_auto_corking_off.pcap.bz2 https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:46:34.png https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:53:17.png First graph can be generated by opening bunziping the file, opening it in wireshark and select Statistics IO Grap and change the unit to Bytes/Tick. The second graph can be generated by selecting Statistics TCP Stream Graph Round Trip Time. You can also see that the round trip time increases by factor 25 at least. I once saw a similar problem with dealyed ACK packets of the paravirtulized network driver in xen it caused that the tcp window filled up and slowed down the throughput from 30 MB/s to less than 100 KB/s the symptom was that the login to a Windows desktop took more than 10 minutes while it used to be below 30 seconds because the profile of the user was loaded slowly from a CIFS server. At that time the culprit were also delayed small packets: ACK packets in the CIFS case. However I only proofed iSCSI regression so far for tcp auto corking but assume we will see many others if we leave it enabled. I found the problem by doing the following: - I compiled kernel by executing the following commands: yes '' | make oldconfig time make -j 24 / make modules_install / mkinitramfs -o /boot/initrd.img-bisect version - I cleaned the iSCSI configuration after each test by issuing: /etc/init.d/target stop rm /iscsi?/* /etc/target/* - I configured iSCSI after each reboot cat lio-v101.conf EOF set global auto_cd_after_create=false /backstores/fileio create shared-01.v101.campusvl.de /iscsi1/shared-01.v101.campusvl.de size=500G buffered=true /iscsi create iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.101.99.4 /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.101.99.5 /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.102.99.4 /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.102.99.5 /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/luns create /backstores/fileio/shared-01.v101.campusvl.de lun=10 /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/ set attribute authentication=0 demo_mode_write_protect=0 generate_node_acls=1 cache_dynamic_acls=1 saveconfig yes EOF targetcli lio-v101.conf And configured a fresh booted ESXi 5.5.0 1331820 via autodeploy to the iSCSI target, configured the portal, rescanned and created a 500 GB VMFS 5 filesystem and noticed the time if it was longer than 2 minutes it was bad if it was below 10 seconds it was good. git bisect good/bad My network config is: auto bond0 iface bond0 inet static address 10.100.4.62 netmask 255.255.0.0 gateway 10.100.0.1 slaves eth0 eth1 bond-mode 802.3ad bond-miimon 100 auto bond0.101 iface bond0.101 inet static address 10.101.99.4 netmask 255.255.0.0 auto bond1 iface bond1 inet static address 10.100.5.62 netmask 255.255.0.0 slaves eth2 eth3 bond-mode 802.3ad bond-miimon 100 auto bond1.101 iface bond1.101 inet static address 10.101.99.5 netmask 255.255.0.0 I propose to disable tcp_autocorking by default because it obviously degrades iSCSI performance and probably many other protocols. Also the commit mentions that applications can explicitly disable auto
REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, [RESEND: the time it took the VMFS was created was switched between on/off so with on it took over 2 minutes with off it took less than 4 seconds] [RESEND 2: The throughput graphs were switched as well ;-(] * Thomas Glanzmann tho...@glanzmann.de [2014-02-07 08:55]: Creating a 4 TB VMFS filesystem over iSCSI takes 24 seconds on 3.12 and 15 minutes on 3.14.0-rc2+. * Nicholas A. Bellinger n...@linux-iscsi.org [2014-02-07 20:30]: Would it be possible to try a couple of different stable kernel versions to help track this down? I bisected[1] it and found the offending commit f54b311 tcp auto corking [2] 'if we have a small send and a previous packet is already in the qdisc or device queue, defer until TX completion or we get more data.' - Description by David S. Miller I gathered a pcap with tcp_autocorking on and off. On: - took 2 minutes 24 seconds to create a 500 GB VMFS file system https://thomas.glanzmann.de/tmp/tcp_auto_corking_on.pcap.bz2 https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:46:34.png https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:52:28.png Off: - took 4 seconds to create a 500 GB VMFS file system sysctl net.ipv4.tcp_autocorking=0 https://thomas.glanzmann.de/tmp/tcp_auto_corking_off.pcap.bz2 https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:45:43.png https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:53:17.png First graph can be generated by opening bunziping the file, opening it in wireshark and select Statistics IO Grap and change the unit to Bytes/Tick. The second graph can be generated by selecting Statistics TCP Stream Graph Round Trip Time. You can also see that the round trip time increases by factor 25 at least. I once saw a similar problem with dealyed ACK packets of the paravirtulized network driver in xen it caused that the tcp window filled up and slowed down the throughput from 30 MB/s to less than 100 KB/s the symptom was that the login to a Windows desktop took more than 10 minutes while it used to be below 30 seconds because the profile of the user was loaded slowly from a CIFS server. At that time the culprit were also delayed small packets: ACK packets in the CIFS case. However I only proofed iSCSI regression so far for tcp auto corking but assume we will see many others if we leave it enabled. I found the problem by doing the following: - I compiled kernel by executing the following commands: yes '' | make oldconfig time make -j 24 / make modules_install / mkinitramfs -o /boot/initrd.img-bisect version - I cleaned the iSCSI configuration after each test by issuing: /etc/init.d/target stop rm /iscsi?/* /etc/target/* - I configured iSCSI after each reboot cat lio-v101.conf EOF set global auto_cd_after_create=false /backstores/fileio create shared-01.v101.campusvl.de /iscsi1/shared-01.v101.campusvl.de size=500G buffered=true /iscsi create iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.101.99.4 /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.101.99.5 /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.102.99.4 /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.102.99.5 /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/luns create /backstores/fileio/shared-01.v101.campusvl.de lun=10 /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/ set attribute authentication=0 demo_mode_write_protect=0 generate_node_acls=1 cache_dynamic_acls=1 saveconfig yes EOF targetcli lio-v101.conf And configured a fresh booted ESXi 5.5.0 1331820 via autodeploy to the iSCSI target, configured the portal, rescanned and created a 500 GB VMFS 5 filesystem and noticed the time if it was longer than 2 minutes it was bad if it was below 10 seconds it was good. git bisect good/bad My network config is: auto bond0 iface bond0 inet static address 10.100.4.62 netmask 255.255.0.0 gateway 10.100.0.1 slaves eth0 eth1 bond-mode 802.3ad bond-miimon 100 auto bond0.101 iface bond0.101 inet static address 10.101.99.4 netmask 255.255.0.0 auto bond1 iface bond1 inet static address 10.100.5.62 netmask 255.255.0.0 slaves eth2 eth3 bond-mode 802.3ad bond-miimon 100 auto bond1.101 iface bond1.101 inet static address 10.101.99.5 netmask 255.255.0.0 I propose to disable tcp_autocorking by default because it obviously degrades iSCSI performance and probably many other protocols. Also the
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
On Sat, 2014-02-08 at 10:38 +0100, Thomas Glanzmann wrote: Hello Eric, [RESEND: the time it took the VMFS was created was switched between on/off so with on it took over 2 minutes with off it took less than 4 seconds] [RESEND 2: The throughput graphs were switched as well ;-(] * Thomas Glanzmann tho...@glanzmann.de [2014-02-07 08:55]: Creating a 4 TB VMFS filesystem over iSCSI takes 24 seconds on 3.12 and 15 minutes on 3.14.0-rc2+. * Nicholas A. Bellinger n...@linux-iscsi.org [2014-02-07 20:30]: Would it be possible to try a couple of different stable kernel versions to help track this down? I bisected[1] it and found the offending commit f54b311 tcp auto corking [2] 'if we have a small send and a previous packet is already in the qdisc or device queue, defer until TX completion or we get more data.' - Description by David S. Miller I gathered a pcap with tcp_autocorking on and off. On: - took 2 minutes 24 seconds to create a 500 GB VMFS file system https://thomas.glanzmann.de/tmp/tcp_auto_corking_on.pcap.bz2 https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:46:34.png https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:52:28.png Off: - took 4 seconds to create a 500 GB VMFS file system sysctl net.ipv4.tcp_autocorking=0 https://thomas.glanzmann.de/tmp/tcp_auto_corking_off.pcap.bz2 https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:45:43.png https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-09:53:17.png First graph can be generated by opening bunziping the file, opening it in wireshark and select Statistics IO Grap and change the unit to Bytes/Tick. The second graph can be generated by selecting Statistics TCP Stream Graph Round Trip Time. You can also see that the round trip time increases by factor 25 at least. I once saw a similar problem with dealyed ACK packets of the paravirtulized network driver in xen it caused that the tcp window filled up and slowed down the throughput from 30 MB/s to less than 100 KB/s the symptom was that the login to a Windows desktop took more than 10 minutes while it used to be below 30 seconds because the profile of the user was loaded slowly from a CIFS server. At that time the culprit were also delayed small packets: ACK packets in the CIFS case. However I only proofed iSCSI regression so far for tcp auto corking but assume we will see many others if we leave it enabled. I found the problem by doing the following: - I compiled kernel by executing the following commands: yes '' | make oldconfig time make -j 24 / make modules_install / mkinitramfs -o /boot/initrd.img-bisect version - I cleaned the iSCSI configuration after each test by issuing: /etc/init.d/target stop rm /iscsi?/* /etc/target/* - I configured iSCSI after each reboot cat lio-v101.conf EOF set global auto_cd_after_create=false /backstores/fileio create shared-01.v101.campusvl.de /iscsi1/shared-01.v101.campusvl.de size=500G buffered=true /iscsi create iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.101.99.4 /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.101.99.5 /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.102.99.4 /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/portals create 10.102.99.5 /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/luns create /backstores/fileio/shared-01.v101.campusvl.de lun=10 /iscsi/iqn.2013-03.de.campusvl.v101.storage:shared-01.v101.campusvl.de/tpgt1/ set attribute authentication=0 demo_mode_write_protect=0 generate_node_acls=1 cache_dynamic_acls=1 saveconfig yes EOF targetcli lio-v101.conf And configured a fresh booted ESXi 5.5.0 1331820 via autodeploy to the iSCSI target, configured the portal, rescanned and created a 500 GB VMFS 5 filesystem and noticed the time if it was longer than 2 minutes it was bad if it was below 10 seconds it was good. git bisect good/bad My network config is: auto bond0 iface bond0 inet static address 10.100.4.62 netmask 255.255.0.0 gateway 10.100.0.1 slaves eth0 eth1 bond-mode 802.3ad bond-miimon 100 auto bond0.101 iface bond0.101 inet static address 10.101.99.4 netmask 255.255.0.0 auto bond1 iface bond1 inet static address 10.100.5.62 netmask 255.255.0.0 slaves eth2 eth3 bond-mode 802.3ad bond-miimon 100 auto bond1.101 iface bond1.101 inet static
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
On Sat, 2014-02-08 at 05:14 -0800, Eric Dumazet wrote: Here is the combined patch, could you test it ? Also make sure you have commit a181ceb501b31b4bf8812a5c84c716cc31d82c2d (tcp: autocork should not hold first packet in write queue) in your tree. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, tcp corking kills iSCSI performance Here is the combined patch, could you test it? the patch did not apply, so I edited by hand. Here is the resulting patch: diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 03d26b8..40d1958 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -698,7 +698,8 @@ static void tcp_tsq_handler(struct sock *sk) if ((1 sk-sk_state) (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_CLOSING | TCPF_CLOSE_WAIT | TCPF_LAST_ACK)) - tcp_write_xmit(sk, tcp_current_mss(sk), 0, 0, GFP_ATOMIC); + tcp_write_xmit(sk, tcp_current_mss(sk), tcp_sk(sk)-nonagle, + 0, GFP_ATOMIC); } /* * One tasklet per cpu tries to send more skbs. @@ -1904,7 +1905,16 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, if (atomic_read(sk-sk_wmem_alloc) limit) { set_bit(TSQ_THROTTLED, tp-tsq_flags); - break; + /* It is possible TX completion already happened +* before we set TSQ_THROTTLED, so we must +* test again the condition. +* We abuse smp_mb__after_clear_bit() because +* there is no smp_mb__after_set_bit() yet +*/ + smp_mb__after_clear_bit(); + if (atomic_read(sk-sk_wmem_alloc) limit) + break; + } limit = mss_now; -- cut here -- It fixes my case but if you look at the round trip time it is not even close what it used to be. So while this fixes my problem I'm still for disabling it by default. https://thomas.glanzmann.de/tmp/tcp_auto_corking_on_patched.pcap.bz2 https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-14:36:25.png Cheers, Thomas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, Also make sure you have commit a181ceb501b31b4bf8812a5c84c716cc31d82c2d (tcp: autocork should not hold first packet in write queue) in your tree. confirmed: (node-62) [~/work/linux-2.6] git show a181ceb501b31b4bf8812a5c84c716cc31d82c2d | head commit a181ceb501b31b4bf8812a5c84c716cc31d82c2d Author: Eric Dumazet eduma...@google.com Date: Tue Dec 17 09:58:30 2013 -0800 tcp: autocork should not hold first packet in write queue Willem noticed a TCP_RR regression caused by TCP autocorking on a Mellanox test bed. MLX4_EN_TX_COAL_TIME is 16 us, which can be right above RTT between hosts. Cheers, Thomas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
On Sat, 2014-02-08 at 05:33 -0800, Eric Dumazet wrote: On Sat, 2014-02-08 at 05:14 -0800, Eric Dumazet wrote: Here is the combined patch, could you test it ? Also make sure you have commit a181ceb501b31b4bf8812a5c84c716cc31d82c2d (tcp: autocork should not hold first packet in write queue) in your tree. BTW this problem demonstrates there is room for improvement in iCSCI, using MSG_MORE to avoid sending two small segments in separate frames. [1] 00:32:35.726568 IP 10.101.99.5.3260 10.101.0.13.27778: Flags [P.], seq 145:193, ack 144, win 235, options [nop,nop,TS val 4294960733 ecr 385385], length 48 [2] 00:32:35.838074 IP 10.101.0.13.27778 10.101.99.5.3260: Flags [.], ack 193, win 514, options [nop,nop,TS val 385396 ecr 4294960733], length 0 [3] 00:32:35.838099 IP 10.101.99.5.3260 10.101.0.13.27778: Flags [P.], seq 193:705, ack 144, win 235, options [nop,nop,TS val 4294960761 ecr 385396], length 512 [1] [3] could be coalesced, and [2] would be avoided. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
On Sat, 2014-02-08 at 14:37 +0100, Thomas Glanzmann wrote: Hello Eric, It fixes my case but if you look at the round trip time it is not even close what it used to be. So while this fixes my problem I'm still for disabling it by default. https://thomas.glanzmann.de/tmp/tcp_auto_corking_on_patched.pcap.bz2 https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-14:36:25.png Very nice. Now we have to check your NIC and how TX completion is performed. What is your NIC model and driver ? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, What is your NIC model and driver? I have four Intel Corporation I350 Gigabit Network Connection (rev 01). (node-62) [~/work/linux-2.6] lspci -v | pbot http://pbot.rmdir.de/rgu6yHMBDVQpflMmbcJACg (node-62) [~/work/linux-2.6] ip a s | pbot http://pbot.rmdir.de/xJjRT8u-ekC6mrWgl09ZtQ (node-62) [~/work/linux-2.6] dmesg | pbot http://pbot.rmdir.de/MigrSPtxGmp0fI1CRgXsHw I do 802.3ad link aggregation layer 2 hash with two network cards to one switch. I'm running: Linux node-62 3.14.0-rc1+ #23 SMP Sat Feb 8 14:27:47 CET 2014 x86_64 GNU/Linux Driver: igb If you need remote access to the machine let me know. Cheers, Thomas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
On Sat, 2014-02-08 at 14:37 +0100, Thomas Glanzmann wrote: It fixes my case but if you look at the round trip time it is not even close what it used to be. So while this fixes my problem I'm still for disabling it by default. https://thomas.glanzmann.de/tmp/tcp_auto_corking_on_patched.pcap.bz2 This pcap was taken on which host ? 10.101.99.5 or 10.101.0.13 ? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, [RESEND: dropped CC accidently] 10.101.99.5 or 10.101.0.13? 10.101.99.5 (iSCSI Target) tcpdump -i bond0.101 -s 0 -w /tmp/tcp_auto_corking_on_patched.pcap host esx-03.v101.campusvl.de Cheers, Thomas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
On Sat, 2014-02-08 at 05:50 -0800, Eric Dumazet wrote: On Sat, 2014-02-08 at 05:33 -0800, Eric Dumazet wrote: On Sat, 2014-02-08 at 05:14 -0800, Eric Dumazet wrote: Here is the combined patch, could you test it ? Also make sure you have commit a181ceb501b31b4bf8812a5c84c716cc31d82c2d (tcp: autocork should not hold first packet in write queue) in your tree. BTW this problem demonstrates there is room for improvement in iCSCI, using MSG_MORE to avoid sending two small segments in separate frames. [1] 00:32:35.726568 IP 10.101.99.5.3260 10.101.0.13.27778: Flags [P.], seq 145:193, ack 144, win 235, options [nop,nop,TS val 4294960733 ecr 385385], length 48 [2] 00:32:35.838074 IP 10.101.0.13.27778 10.101.99.5.3260: Flags [.], ack 193, win 514, options [nop,nop,TS val 385396 ecr 4294960733], length 0 [3] 00:32:35.838099 IP 10.101.99.5.3260 10.101.0.13.27778: Flags [P.], seq 193:705, ack 144, win 235, options [nop,nop,TS val 4294960761 ecr 385396], length 512 [1] [3] could be coalesced, and [2] would be avoided. With the fix, new pcap is more explicit about this suboptimal behavior : 05:34:16.280900 IP 10.101.0.13.41531 10.101.99.5.3260: Flags [.], ack 54353, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0 05:34:16.280949 IP 10.101.0.13.41531 10.101.99.5.3260: Flags [P.], seq 5328:5376, ack 54353, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48 05:34:16.280982 IP 10.101.99.5.3260 10.101.0.13.41531: Flags [P.], seq 54353:54401, ack 5376, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48 05:34:16.281000 IP 10.101.99.5.3260 10.101.0.13.41531: Flags [P.], seq 54401:54913, ack 5376, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512 05:34:16.281107 IP 10.101.0.13.41531 10.101.99.5.3260: Flags [.], ack 54913, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0 05:34:16.281157 IP 10.101.0.13.41531 10.101.99.5.3260: Flags [P.], seq 5376:5424, ack 54913, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48 05:34:16.281190 IP 10.101.99.5.3260 10.101.0.13.41531: Flags [P.], seq 54913:54961, ack 5424, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48 05:34:16.281208 IP 10.101.99.5.3260 10.101.0.13.41531: Flags [P.], seq 54961:55473, ack 5424, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512 05:34:16.281337 IP 10.101.0.13.41531 10.101.99.5.3260: Flags [.], ack 55473, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0 05:34:16.281390 IP 10.101.0.13.41531 10.101.99.5.3260: Flags [P.], seq 5424:5472, ack 55473, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48 05:34:16.281423 IP 10.101.99.5.3260 10.101.0.13.41531: Flags [P.], seq 55473:55521, ack 5472, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48 05:34:16.281440 IP 10.101.99.5.3260 10.101.0.13.41531: Flags [P.], seq 55521:56033, ack 5472, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, BTW this problem demonstrates there is room for improvement in iCSCI, using MSG_MORE to avoid sending two small segments in separate frames. With the fix, new pcap is more explicit about this suboptimal behavior : 05:34:16.280900 IP 10.101.0.13.41531 10.101.99.5.3260: Flags [.], ack 54353, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0 05:34:16.280949 IP 10.101.0.13.41531 10.101.99.5.3260: Flags [P.], seq 5328:5376, ack 54353, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48 05:34:16.280982 IP 10.101.99.5.3260 10.101.0.13.41531: Flags [P.], seq 54353:54401, ack 5376, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48 05:34:16.281000 IP 10.101.99.5.3260 10.101.0.13.41531: Flags [P.], seq 54401:54913, ack 5376, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512 05:34:16.281107 IP 10.101.0.13.41531 10.101.99.5.3260: Flags [.], ack 54913, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0 05:34:16.281157 IP 10.101.0.13.41531 10.101.99.5.3260: Flags [P.], seq 5376:5424, ack 54913, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48 05:34:16.281190 IP 10.101.99.5.3260 10.101.0.13.41531: Flags [P.], seq 54913:54961, ack 5424, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48 05:34:16.281208 IP 10.101.99.5.3260 10.101.0.13.41531: Flags [P.], seq 54961:55473, ack 5424, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512 05:34:16.281337 IP 10.101.0.13.41531 10.101.99.5.3260: Flags [.], ack 55473, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 0 05:34:16.281390 IP 10.101.0.13.41531 10.101.99.5.3260: Flags [P.], seq 5424:5472, ack 55473, win 514, options [nop,nop,TS val 1732452 ecr 4294935370], length 48 05:34:16.281423 IP 10.101.99.5.3260 10.101.0.13.41531: Flags [P.], seq 55473:55521, ack 5472, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 48 05:34:16.281440 IP 10.101.99.5.3260 10.101.0.13.41531: Flags [P.], seq 55521:56033, ack 5472, win 235, options [nop,nop,TS val 4294935370 ecr 1732452], length 512 I get the idea. However I'm a little bit confused, when I do a 'git grep MSG_MORE' I don't see much references in the Linux kernel who use it at all. So do you have an example for me where this flags needs to be applied? Cheers, Thomas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
On Sat, 2014-02-08 at 15:19 +0100, Thomas Glanzmann wrote: Hello Eric, I get the idea. However I'm a little bit confused, when I do a 'git grep MSG_MORE' I don't see much references in the Linux kernel who use it at all. So do you have an example for me where this flags needs to be applied? Idea would be to set this flag when calling sendmsg() of the 48 bytes of the header, and not set it on the sendmsg() of the 512 bytes of the payload. iscsi_sw_tcp_xmit_segment() already adds MSG_MORE, but it would be nice to add a new _initial_ flags parameter to iscsi_sw_tcp_xmit_segment() -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, Idea would be to set this flag when calling sendmsg() of the 48 bytes of the header, and not set it on the sendmsg() of the 512 bytes of the payload. I see. iscsi_sw_tcp_xmit_segment() already adds MSG_MORE, but it would be nice to add a new _initial_ flags parameter to iscsi_sw_tcp_xmit_segment() This is for the iscsi initiator implementation. I'm interested in iSCSI target code, but I already found it and experiemented a little bit, but I need to dig deeper if I want to prepare a patch. Cheers, Thomas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
On Sat, 2014-02-08 at 16:00 +0100, Thomas Glanzmann wrote: Hello Eric, Idea would be to set this flag when calling sendmsg() of the 48 bytes of the header, and not set it on the sendmsg() of the 512 bytes of the payload. I see. iscsi_sw_tcp_xmit_segment() already adds MSG_MORE, but it would be nice to add a new _initial_ flags parameter to iscsi_sw_tcp_xmit_segment() This is for the iscsi initiator implementation. I'm interested in iSCSI target code, but I already found it and experiemented a little bit, but I need to dig deeper if I want to prepare a patch. Fantastic ! Let me know if you want some help. Note : We did some patches in the MSG_MORE logic for sendpage(), but in your case I do not think its related (git grep -n MSG_SENDPAGE_NOTLAST ) if you are curious -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, Note : We did some patches in the MSG_MORE logic for sendpage(), but in your case I do not think its related (git grep -n MSG_SENDPAGE_NOTLAST ) if you are curious thank you for the pointer. The iSCSI target code actually uses sendpage whenever it can. Cheers, Thomas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
On Sat, 2014-02-08 at 17:57 +0100, Thomas Glanzmann wrote: Hello Eric, Note : We did some patches in the MSG_MORE logic for sendpage(), but in your case I do not think its related (git grep -n MSG_SENDPAGE_NOTLAST ) if you are curious thank you for the pointer. The iSCSI target code actually uses sendpage whenever it can. Yep, but the problem (at least on your pcap), is about sending the 48 bytes headers in TCP segment of its own, then the 512 byte payload in a separate segment. I suspect the sendpage() is only used for the payload. No need for MSG_MORE here. The MSG_MORE would need to be set on the first part (48 bytes header), so that TCP stack will defer the push of the segment at the time the 512 bytes payload is added. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, Yep, but the problem (at least on your pcap), is about sending the 48 bytes headers in TCP segment of its own, then the 512 byte payload in a separate segment. I agree. I suspect the sendpage() is only used for the payload. No need for MSG_MORE here. I see. The MSG_MORE would need to be set on the first part (48 bytes header), so that TCP stack will defer the push of the segment at the time the 512 bytes payload is added. The iSCSI target uses one function to send all outbound data. So in order to do it right every function that is sending data in multiple chunks need to mark it correctly. Of course someone could also do some wild guessing and saying that everything that is below 512 Bytes gets pushed out. I wonder what Nab has to say about this? Cheers, Thomas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
On Sat, 2014-02-08 at 18:15 +0100, Thomas Glanzmann wrote: The iSCSI target uses one function to send all outbound data. So in order to do it right every function that is sending data in multiple chunks need to mark it correctly. Of course someone could also do some wild guessing and saying that everything that is below 512 Bytes gets pushed out. I wonder what Nab has to say about this? I was simply thinking about something like : (might need further changes, but I guess this should solve your case) diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c index 0819e688a398..44f0d62a88d6 100644 --- a/drivers/target/iscsi/iscsi_target_util.c +++ b/drivers/target/iscsi/iscsi_target_util.c @@ -1165,7 +1165,7 @@ send_data: iov_count = cmd-iov_misc_count; } - tx_sent = tx_data(conn, iov[0], iov_count, tx_size); + tx_sent = tx_data(conn, iov[0], iov_count, tx_size, 0); if (tx_size != tx_sent) { if (tx_sent == -EAGAIN) { pr_err(tx_data() returned -EAGAIN\n); @@ -1196,7 +1196,8 @@ send_hdr: iov.iov_base = cmd-pdu; iov.iov_len = tx_hdr_size; - tx_sent = tx_data(conn, iov, 1, tx_hdr_size); + data_len = cmd-tx_size - tx_hdr_size - cmd-padding; + tx_sent = tx_data(conn, iov, 1, tx_hdr_size, data_len ? MSG_MORE : 0); if (tx_hdr_size != tx_sent) { if (tx_sent == -EAGAIN) { pr_err(tx_data() returned -EAGAIN\n); @@ -1205,7 +1206,6 @@ send_hdr: return -1; } - data_len = cmd-tx_size - tx_hdr_size - cmd-padding; /* * Set iov_off used by padding and data digest tx_data() calls below * in order to determine proper offset into cmd-iov_data[] @@ -1249,7 +1249,8 @@ send_padding: if (cmd-padding) { struct kvec *iov_p = cmd-iov_data[iov_off++]; - tx_sent = tx_data(conn, iov_p, 1, cmd-padding); + tx_sent = tx_data(conn, iov_p, 1, cmd-padding, + conn-conn_ops-DataDigest ? MSG_MORE : 0); if (cmd-padding != tx_sent) { if (tx_sent == -EAGAIN) { pr_err(tx_data() returned -EAGAIN\n); @@ -1263,7 +1264,7 @@ send_datacrc: if (conn-conn_ops-DataDigest) { struct kvec *iov_d = cmd-iov_data[iov_off]; - tx_sent = tx_data(conn, iov_d, 1, ISCSI_CRC_LEN); + tx_sent = tx_data(conn, iov_d, 1, ISCSI_CRC_LEN, 0); if (ISCSI_CRC_LEN != tx_sent) { if (tx_sent == -EAGAIN) { pr_err(tx_data() returned -EAGAIN\n); @@ -1349,11 +1350,12 @@ static int iscsit_do_rx_data( static int iscsit_do_tx_data( struct iscsi_conn *conn, - struct iscsi_data_count *count) + struct iscsi_data_count *count, + int flags) { int data = count-data_length, total_tx = 0, tx_loop = 0, iov_len; struct kvec *iov_p; - struct msghdr msg; + struct msghdr msg = { .msg_flags = flags }; if (!conn || !conn-sock || !conn-conn_ops) return -1; @@ -1363,8 +1365,6 @@ static int iscsit_do_tx_data( return -1; } - memset(msg, 0, sizeof(struct msghdr)); - iov_p = count-iov; iov_len = count-iov_count; @@ -1408,7 +1408,8 @@ int tx_data( struct iscsi_conn *conn, struct kvec *iov, int iov_count, - int data) + int data, + int flags) { struct iscsi_data_count c; @@ -1421,7 +1422,7 @@ int tx_data( c.data_length = data; c.type = ISCSI_TX_DATA; - return iscsit_do_tx_data(conn, c); + return iscsit_do_tx_data(conn, c, flags); } void iscsit_collect_login_stats( diff --git a/drivers/target/iscsi/iscsi_target_util.h b/drivers/target/iscsi/iscsi_target_util.h index e4fc34a02f57..1b4f06801adc 100644 --- a/drivers/target/iscsi/iscsi_target_util.h +++ b/drivers/target/iscsi/iscsi_target_util.h @@ -54,7 +54,7 @@ extern int iscsit_print_dev_to_proc(char *, char **, off_t, int); extern int iscsit_print_sessions_to_proc(char *, char **, off_t, int); extern int iscsit_print_tpg_to_proc(char *, char **, off_t, int); extern int rx_data(struct iscsi_conn *, struct kvec *, int, int); -extern int tx_data(struct iscsi_conn *, struct kvec *, int, int); +extern int tx_data(struct iscsi_conn *, struct kvec *, int, int, int); extern void iscsit_collect_login_stats(struct iscsi_conn *, u8, u8); extern struct iscsi_tiqn *iscsit_snmp_get_tiqn(struct iscsi_conn *); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, I was simply thinking about something like : (might need further changes, but I guess this should solve your case) thank you for your patch. It did not apply on top of Linux tip, so I put in the changes manually and fixed up another call to tx_data that your forgot in your initial patch to make it apply. I gave it another run, can you confirm that it now behaves better? https://thomas.glanzmann.de/tmp/tcp_auto_corking_on_patched_tcp_more.pcap.bz2 And look at that roundtrip graph it is perfect. Also filesystem is now created in 3 seconds instead of 4. https://thomas.glanzmann.de/tmp/screenshot-mini-2014-02-08-22:34:57.png Nab, do you consider this patch for upstream? Would you take if I clean it up? Cheers, Thomas PS: I'm asleep for the next 8 hours. diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c index e655b04..0eb9681 100644 --- a/drivers/target/iscsi/iscsi_target_util.c +++ b/drivers/target/iscsi/iscsi_target_util.c @@ -1168,7 +1168,7 @@ send_data: iov_count = cmd-iov_misc_count; } - tx_sent = tx_data(conn, iov[0], iov_count, tx_size); + tx_sent = tx_data(conn, iov[0], iov_count, tx_size, 0); if (tx_size != tx_sent) { if (tx_sent == -EAGAIN) { pr_err(tx_data() returned -EAGAIN\n); @@ -1199,7 +1199,8 @@ send_hdr: iov.iov_base = cmd-pdu; iov.iov_len = tx_hdr_size; - tx_sent = tx_data(conn, iov, 1, tx_hdr_size); + data_len = cmd-tx_size - tx_hdr_size - cmd-padding; +tx_sent = tx_data(conn, iov, 1, tx_hdr_size, data_len ? MSG_MORE : 0); if (tx_hdr_size != tx_sent) { if (tx_sent == -EAGAIN) { pr_err(tx_data() returned -EAGAIN\n); @@ -1208,7 +1209,6 @@ send_hdr: return -1; } - data_len = cmd-tx_size - tx_hdr_size - cmd-padding; /* * Set iov_off used by padding and data digest tx_data() calls below * in order to determine proper offset into cmd-iov_data[] @@ -1252,7 +1252,8 @@ send_padding: if (cmd-padding) { struct kvec *iov_p = cmd-iov_data[iov_off++]; - tx_sent = tx_data(conn, iov_p, 1, cmd-padding); + tx_sent = tx_data(conn, iov_p, 1, cmd-padding, + conn-conn_ops-DataDigest ? MSG_MORE : 0); if (cmd-padding != tx_sent) { if (tx_sent == -EAGAIN) { pr_err(tx_data() returned -EAGAIN\n); @@ -1266,7 +1267,7 @@ send_datacrc: if (conn-conn_ops-DataDigest) { struct kvec *iov_d = cmd-iov_data[iov_off]; - tx_sent = tx_data(conn, iov_d, 1, ISCSI_CRC_LEN); + tx_sent = tx_data(conn, iov_d, 1, ISCSI_CRC_LEN, 0); if (ISCSI_CRC_LEN != tx_sent) { if (tx_sent == -EAGAIN) { pr_err(tx_data() returned -EAGAIN\n); @@ -1352,11 +1353,13 @@ static int iscsit_do_rx_data( static int iscsit_do_tx_data( struct iscsi_conn *conn, - struct iscsi_data_count *count) + struct iscsi_data_count *count, + int flags) { int data = count-data_length, total_tx = 0, tx_loop = 0, iov_len; struct kvec *iov_p; struct msghdr msg; +struct msghdr msg = { .msg_flags = flags }; if (!conn || !conn-sock || !conn-conn_ops) return -1; @@ -1366,8 +1369,6 @@ static int iscsit_do_tx_data( return -1; } - memset(msg, 0, sizeof(struct msghdr)); - iov_p = count-iov; iov_len = count-iov_count; @@ -1411,7 +1412,8 @@ int tx_data( struct iscsi_conn *conn, struct kvec *iov, int iov_count, - int data) + int data, + int flags) { struct iscsi_data_count c; @@ -1424,7 +1426,7 @@ int tx_data( c.data_length = data; c.type = ISCSI_TX_DATA; - return iscsit_do_tx_data(conn, c); + return iscsit_do_tx_data(conn, c, flags); } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
On Sat, 2014-02-08 at 22:36 +0100, Thomas Glanzmann wrote: Hello Eric, I was simply thinking about something like : (might need further changes, but I guess this should solve your case) thank you for your patch. It did not apply on top of Linux tip, so I put in the changes manually and fixed up another call to tx_data that your forgot in your initial patch to make it apply. I gave it another run, can you confirm that it now behaves better? https://thomas.glanzmann.de/tmp/tcp_auto_corking_on_patched_tcp_more.pcap.bz2 And look at that roundtrip graph it is perfect. Also filesystem is now created in 3 seconds instead of 4. Yes, this is much better : 2 frames per request/response, instead of 4. 13:32:04.665367 IP 10.101.0.12.43418 10.101.99.5.3260: Flags [P.], seq 384:432, ack 2529, win 514, options [nop,nop,TS val 1576981 ecr 4294913967], length 48 13:32:04.665483 IP 10.101.99.5.3260 10.101.0.12.43418: Flags [P.], seq 2529:3089, ack 432, win 235, options [nop,nop,TS val 4294913967 ecr 1576981], length 560 13:32:04.665642 IP 10.101.0.12.43418 10.101.99.5.3260: Flags [P.], seq 432:480, ack 3089, win 514, options [nop,nop,TS val 1576981 ecr 4294913967], length 48 13:32:04.665756 IP 10.101.99.5.3260 10.101.0.12.43418: Flags [P.], seq 3089:3649, ack 480, win 235, options [nop,nop,TS val 4294913967 ecr 1576981], length 560 13:32:04.665933 IP 10.101.0.12.43418 10.101.99.5.3260: Flags [P.], seq 480:528, ack 3649, win 514, options [nop,nop,TS val 1576981 ecr 4294913967], length 48 13:32:04.666046 IP 10.101.99.5.3260 10.101.0.12.43418: Flags [P.], seq 3649:4209, ack 528, win 235, options [nop,nop,TS val 4294913967 ecr 1576981], length 560 13:32:04.666214 IP 10.101.0.12.43418 10.101.99.5.3260: Flags [P.], seq 528:576, ack 4209, win 514, options [nop,nop,TS val 1576981 ecr 4294913967], length 48 13:32:04.666333 IP 10.101.99.5.3260 10.101.0.12.43418: Flags [P.], seq 4209:4769, ack 576, win 235, options [nop,nop,TS val 4294913967 ecr 1576981], length 560 13:32:04.78 IP 10.101.0.12.43418 10.101.99.5.3260: Flags [P.], seq 576:624, ack 4769, win 514, options [nop,nop,TS val 1576981 ecr 4294913967], length 48 13:32:04.666790 IP 10.101.99.5.3260 10.101.0.12.43418: Flags [P.], seq 4769:5329, ack 624, win 235, options [nop,nop,TS val 4294913967 ecr 1576981], length 560 13:32:04.666983 IP 10.101.0.12.43418 10.101.99.5.3260: Flags [P.], seq 624:672, ack 5329, win 514, options [nop,nop,TS val 1576981 ecr 4294913967], length 48 13:32:04.667097 IP 10.101.99.5.3260 10.101.0.12.43418: Flags [P.], seq 5329:5889, ack 672, win 235, options [nop,nop,TS val 4294913967 ecr 1576981], length 560 13:32:04.667280 IP 10.101.0.12.43418 10.101.99.5.3260: Flags [P.], seq 672:720, ack 5889, win 514, options [nop,nop,TS val 1576981 ecr 4294913967], length 48 13:32:04.667324 IP 10.101.99.5.3260 10.101.0.12.43418: Flags [P.], seq 5889:6449, ack 720, win 235, options [nop,nop,TS val 4294913967 ecr 1576981], length 560 13:32:04.667500 IP 10.101.0.12.43418 10.101.99.5.3260: Flags [P.], seq 720:768, ack 6449, win 514, options [nop,nop,TS val 1576981 ecr 4294913967], length 48 13:32:04.667540 IP 10.101.99.5.3260 10.101.0.12.43418: Flags [P.], seq 6449:7009, ack 768, win 235, options [nop,nop,TS val 4294913967 ecr 1576981], length 560 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]
Hello Eric, Yes, this is much better : 2 frames per request/response, instead of 4. perfect. I send out the page to the iscsi target list in your name since you did the work and I added me as signed off I hope that is how it is handled or should I have added my name to the from line and mentioned in the description of the patch that you did the heavy lifting? Cheers, Thomas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/