[GitHub] [spark] turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait

2020-03-25 Thread GitBox
turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection 
while last connection failed in the last retry IO wait
URL: https://github.com/apache/spark/pull/27943#issuecomment-603876706
 
 
   thanks, have updated the description.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait

2020-03-25 Thread GitBox
turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection 
while last connection failed in the last retry IO wait
URL: https://github.com/apache/spark/pull/27943#issuecomment-603730142
 
 
   UT had passed before, the latest test is killed manually.
   cc @cloud-fan  @Ngone51 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait

2020-03-24 Thread GitBox
turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection 
while last connection failed in the last retry IO wait
URL: https://github.com/apache/spark/pull/27943#issuecomment-603163006
 
 
   Agree the fail fast time window length should be a little less than 
conf.ioRetryWaitTimeMs().
   
   
   
   > The only other question I have is connections not going through the 
retryingblockfetcher, this could potentially fail them much faster, if its a 
one time fetch is that what we want. I would need to look a bit more at the 
usages there.
   
   I think we can only fast fail the connection if conf.maxIORetries >0.
   
   
   
   >Also I'm curious could this lead to the scenario that, when you have two 
tasks and only one client, the connection request from the second task may fail 
fast every time it tries to connect (because the connection from the first task 
always fail beforehand)?
   
   I think if the  service address is still unreachable after maxIORetries 
retries, it does not matter to fail the another task.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait

2020-03-23 Thread GitBox
turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection 
while last connection failed in the last retry IO wait
URL: https://github.com/apache/spark/pull/27943#issuecomment-602625438
 
 
   Thanks for the reply @tgravescs 
   Sorry for the unclear description. 
   
   `All connections` I mentioned above is that the sent request connections to 
the same unreachable address.
   
   It is my mistake that does not recognize  that there maybe several clients 
for the same address, may be we need keep a lastConnectionFailedTime variable 
for one clientPool.
   
   The problem is that, for a task, there maybe several request connections to 
the same address.
   Specially, for a shuffle read task and there is only one client in the 
client pool and it would always been picked by the connections, which want to 
connect the same ESS.
   If this address is unreachable, these connections would block each 
other(during createClient).
   These connections owned to a same task and want to connect the same ESS,  if 
this ESS was still unreachable.
   It  would cost connectionNum \* connectionTimeOut \* maxRetry to retry, and 
then fail the task.
   It is ideally that this task could fail in connectionTimeOut \* maxRetry.
   
   
   
   
   
   

   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait

2020-03-23 Thread GitBox
turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection 
while last connection failed in the last retry IO wait
URL: https://github.com/apache/spark/pull/27943#issuecomment-602592168
 
 
   > > > Currently we just run and timeout 3 times, and this PR proposes to 
fail fast.
   > > 
   > > 
   > > We should not be failing without retrying. Is that really what this 
does? I'd have to take a closer look but I thought the RetryingBlockFetcher 
caught this and did its normal retries within it, but that was my question 
yesterday to confirm?
   > 
   > @cloud-fan @tgravescs IIUC, this PR only fail fast in a single one 
connection try but will still retry if it's a `RetryingBlockFetcher`.
   
   In fact, the current implementation in this patch would fast fail all 
connections.
   I just propose a compromise solution that just fast fail a single one 
connection in the comments.
   
   I prefer to fast fail all connections related with the unreachable ESS.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait

2020-03-19 Thread GitBox
turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection 
while last connection failed in the last retry IO wait
URL: https://github.com/apache/spark/pull/27943#issuecomment-601209554
 
 
   Just attach the example mentioned in the description.
   > For example: there are two request connection, rc1 and rc2.
   Especially, the io.numConnectionsPerPeer is 1 and connection timeout is 2 
minutes.
   1: rc1 hold the client lock, it timeout after 2 minutes.
   2: rc2 hold the client lock, it timeout after 2 minutes.
   3: rc1 start the second retry, hold lock and timeout after 2 minutes.
   4: rc2 start the second retry, hold lock and timeout after 2 minutes.
   5: rc1 start the third retry, hold lock and timeout after 2 minutes.
   6: rc2 start the third retry, hold lock and timeout after 2 minutes.
   It wastes lots of time.
   
   The concern is that, for some case, these request connections block each 
other.
   If the rc1 connect timeout, then we fast *break* the first retry of rc2 but 
don't increase the retry count of rc2.
   Then rc1 will wait a io retry wait, and then start the second retry and 
connect timeout.
   Then we fast break the rc2 and don't increase its retry count.
   Then rc1 will wait a io retry wait, and then start the third retry and 
connect timeout, then rc1 throw fetch failed exception.
   
   
   I think it is better than the request connections block each other.
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait

2020-03-19 Thread GitBox
turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection 
while last connection failed in the last retry IO wait
URL: https://github.com/apache/spark/pull/27943#issuecomment-601027932
 
 
   How about that, if the last connection failed in the last retry io wait, the 
new connection would be break but its retry count would not increase.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait

2020-03-17 Thread GitBox
turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection 
while last connection failed in the last retry IO wait
URL: https://github.com/apache/spark/pull/27943#issuecomment-600432565
 
 
   I think it may happen for these case below:
   - nm GC
   - nm crash
   - temporary network issue


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait

2020-03-17 Thread GitBox
turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection 
while last connection failed in the last retry IO wait
URL: https://github.com/apache/spark/pull/27943#issuecomment-600431305
 
 
   Thanks for the reply.
   We meet this issue when ESS(node manager) is busy for full gc, and the task 
costs long times(connectionTimeout*maxRetry*(number of requests to the same 
ESS)) and then became fetch failed.
   We expect the task could fast fail instead of wasting too much times to wait 
the lock of client.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection while last connection failed in the last retry IO wait

2020-03-17 Thread GitBox
turboFei commented on issue #27943: [SPARK-31179] Fast fail the connection 
while last connection failed in the last retry IO wait
URL: https://github.com/apache/spark/pull/27943#issuecomment-600396691
 
 
   cc @cloud-fan 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org