[GitHub] [spark] turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait
turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait URL: https://github.com/apache/spark/pull/27943#issuecomment-603876706 thanks, have updated the description. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait
turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait URL: https://github.com/apache/spark/pull/27943#issuecomment-603730142 UT had passed before, the latest test is killed manually. cc @cloud-fan @Ngone51 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait
turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait URL: https://github.com/apache/spark/pull/27943#issuecomment-603163006 Agree the fail fast time window length should be a little less than conf.ioRetryWaitTimeMs(). > The only other question I have is connections not going through the retryingblockfetcher, this could potentially fail them much faster, if its a one time fetch is that what we want. I would need to look a bit more at the usages there. I think we can only fast fail the connection if conf.maxIORetries >0. >Also I'm curious could this lead to the scenario that, when you have two tasks and only one client, the connection request from the second task may fail fast every time it tries to connect (because the connection from the first task always fail beforehand)? I think if the service address is still unreachable after maxIORetries retries, it does not matter to fail the another task. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait
turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait URL: https://github.com/apache/spark/pull/27943#issuecomment-602625438 Thanks for the reply @tgravescs Sorry for the unclear description. `All connections` I mentioned above is that the sent request connections to the same unreachable address. It is my mistake that does not recognize that there maybe several clients for the same address, may be we need keep a lastConnectionFailedTime variable for one clientPool. The problem is that, for a task, there maybe several request connections to the same address. Specially, for a shuffle read task and there is only one client in the client pool and it would always been picked by the connections, which want to connect the same ESS. If this address is unreachable, these connections would block each other(during createClient). These connections owned to a same task and want to connect the same ESS, if this ESS was still unreachable. It would cost connectionNum \* connectionTimeOut \* maxRetry to retry, and then fail the task. It is ideally that this task could fail in connectionTimeOut \* maxRetry. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait
turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait URL: https://github.com/apache/spark/pull/27943#issuecomment-602592168 > > > Currently we just run and timeout 3 times, and this PR proposes to fail fast. > > > > > > We should not be failing without retrying. Is that really what this does? I'd have to take a closer look but I thought the RetryingBlockFetcher caught this and did its normal retries within it, but that was my question yesterday to confirm? > > @cloud-fan @tgravescs IIUC, this PR only fail fast in a single one connection try but will still retry if it's a `RetryingBlockFetcher`. In fact, the current implementation in this patch would fast fail all connections. I just propose a compromise solution that just fast fail a single one connection in the comments. I prefer to fast fail all connections related with the unreachable ESS. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait
turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait URL: https://github.com/apache/spark/pull/27943#issuecomment-601209554 Just attach the example mentioned in the description. > For example: there are two request connection, rc1 and rc2. Especially, the io.numConnectionsPerPeer is 1 and connection timeout is 2 minutes. 1: rc1 hold the client lock, it timeout after 2 minutes. 2: rc2 hold the client lock, it timeout after 2 minutes. 3: rc1 start the second retry, hold lock and timeout after 2 minutes. 4: rc2 start the second retry, hold lock and timeout after 2 minutes. 5: rc1 start the third retry, hold lock and timeout after 2 minutes. 6: rc2 start the third retry, hold lock and timeout after 2 minutes. It wastes lots of time. The concern is that, for some case, these request connections block each other. If the rc1 connect timeout, then we fast *break* the first retry of rc2 but don't increase the retry count of rc2. Then rc1 will wait a io retry wait, and then start the second retry and connect timeout. Then we fast break the rc2 and don't increase its retry count. Then rc1 will wait a io retry wait, and then start the third retry and connect timeout, then rc1 throw fetch failed exception. I think it is better than the request connections block each other. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait
turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait URL: https://github.com/apache/spark/pull/27943#issuecomment-601027932 How about that, if the last connection failed in the last retry io wait, the new connection would be break but its retry count would not increase. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait
turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait URL: https://github.com/apache/spark/pull/27943#issuecomment-600432565 I think it may happen for these case below: - nm GC - nm crash - temporary network issue This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait
turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait URL: https://github.com/apache/spark/pull/27943#issuecomment-600431305 Thanks for the reply. We meet this issue when ESS(node manager) is busy for full gc, and the task costs long times(connectionTimeout*maxRetry*(number of requests to the same ESS)) and then became fetch failed. We expect the task could fast fail instead of wasting too much times to wait the lock of client. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait
turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait URL: https://github.com/apache/spark/pull/27943#issuecomment-600396691 cc @cloud-fan This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org