Re: Review Request 36664: Patch for KAFKA-2353
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/#review92848 --- Ship it! Ship It! - Guozhang Wang On July 23, 2015, 12:51 a.m., Jiangjie Qin wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/ --- (Updated July 23, 2015, 12:51 a.m.) Review request for kafka. Bugs: KAFKA-2353 https://issues.apache.org/jira/browse/KAFKA-2353 Repository: kafka Description --- Addressed Gwen's comments Addressed Gwen's comments. Diffs - core/src/main/scala/kafka/network/SocketServer.scala 91319fa010b140cca632e5fa8050509bd2295fc9 Diff: https://reviews.apache.org/r/36664/diff/ Testing --- Thanks, Jiangjie Qin
Re: Review Request 36664: Patch for KAFKA-2353
On July 22, 2015, 10:40 a.m., Ismael Juma wrote: core/src/main/scala/kafka/network/SocketServer.scala, line 401 https://reviews.apache.org/r/36664/diff/2/?file=1018375#file1018375line401 Is it intentional to ignore `java.lang.Error` too? Jiangjie Qin wrote: I think java.lang.Error is a subclass of throwables. Ismael Juma wrote: Yes, it is and hence the question. To be clear: I am asking, do we want to catch java.lang.Error exceptions like `OutOfMemoryError` and `ThreadDeath` and swallow them instead of letting them propagate? Maybe this is what we want, but I want to make sure that's understood. Jiangjie Qin wrote: It largely depends why out of memory occured. For example, a highly compressable message might coming and eat huge amount of memory after decompression. In this case, we might see OOM. In this case, the broker will just throw away the message, return an error to client and move on. Or maybe the huge message get through and got written to disk so the memory got release, in that case we also wants to move on. We see such kind of situation from time to time. There might be some other OOM but generally speaking, we just want to skip that and continue. OK. It may be worth also adding a note about OOM and java.lang.Error to prevent other people from changing it in the future without understanding the consequences. - Ismael --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/#review92574 --- On July 23, 2015, 12:51 a.m., Jiangjie Qin wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/ --- (Updated July 23, 2015, 12:51 a.m.) Review request for kafka. Bugs: KAFKA-2353 https://issues.apache.org/jira/browse/KAFKA-2353 Repository: kafka Description --- Addressed Gwen's comments Addressed Gwen's comments. Diffs - core/src/main/scala/kafka/network/SocketServer.scala 91319fa010b140cca632e5fa8050509bd2295fc9 Diff: https://reviews.apache.org/r/36664/diff/ Testing --- Thanks, Jiangjie Qin
Re: Review Request 36664: Patch for KAFKA-2353
On July 22, 2015, 10:40 a.m., Ismael Juma wrote: core/src/main/scala/kafka/network/SocketServer.scala, line 401 https://reviews.apache.org/r/36664/diff/2/?file=1018375#file1018375line401 Is it intentional to ignore `java.lang.Error` too? Jiangjie Qin wrote: I think java.lang.Error is a subclass of throwables. Ismael Juma wrote: Yes, it is and hence the question. To be clear: I am asking, do we want to catch java.lang.Error exceptions like `OutOfMemoryError` and `ThreadDeath` and swallow them instead of letting them propagate? Maybe this is what we want, but I want to make sure that's understood. It largely depends why out of memory occured. For example, a highly compressable message might coming and eat huge amount of memory after decompression. In this case, we might see OOM. In this case, the broker will just throw away the message, return an error to client and move on. Or maybe the huge message get through and got written to disk so the memory got release, in that case we also wants to move on. We see such kind of situation from time to time. There might be some other OOM but generally speaking, we just want to skip that and continue. - Jiangjie --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/#review92574 --- On July 23, 2015, 12:51 a.m., Jiangjie Qin wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/ --- (Updated July 23, 2015, 12:51 a.m.) Review request for kafka. Bugs: KAFKA-2353 https://issues.apache.org/jira/browse/KAFKA-2353 Repository: kafka Description --- Addressed Gwen's comments Addressed Gwen's comments. Diffs - core/src/main/scala/kafka/network/SocketServer.scala 91319fa010b140cca632e5fa8050509bd2295fc9 Diff: https://reviews.apache.org/r/36664/diff/ Testing --- Thanks, Jiangjie Qin
Re: Review Request 36664: Patch for KAFKA-2353
On July 22, 2015, 10:40 a.m., Ismael Juma wrote: core/src/main/scala/kafka/network/SocketServer.scala, line 467 https://reviews.apache.org/r/36664/diff/2/?file=1018375#file1018375line467 As far as I can see `ClosedChannelException`, `IllegalStateException` and `IllegalArgumentException` are enough? Also, you would it be better to use `IOException` instead of `ClosedChannelException`? What happens if other exceptions are thrown? Will we still have a socket leak? Gwen Shapira wrote: Yeah, perhaps in addition to listing the expected cases, we should also handle nonFatal(e)? (https://tersesystems.com/2012/12/27/error-handling-in-scala/) If we handle `NonFatal(e)`, it may be that we don't want to list all the other exceptions as it's just boilerplate. On July 22, 2015, 10:40 a.m., Ismael Juma wrote: core/src/main/scala/kafka/network/SocketServer.scala, line 401 https://reviews.apache.org/r/36664/diff/2/?file=1018375#file1018375line401 Is it intentional to ignore `java.lang.Error` too? Jiangjie Qin wrote: I think java.lang.Error is a subclass of throwables. Yes, it is and hence the question. To be clear: I am asking, do we want to catch java.lang.Error exceptions like `OutOfMemoryError` and `ThreadDeath` and swallow them instead of letting them propagate? Maybe this is what we want, but I want to make sure that's understood. - Ismael --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/#review92574 --- On July 23, 2015, 12:51 a.m., Jiangjie Qin wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/ --- (Updated July 23, 2015, 12:51 a.m.) Review request for kafka. Bugs: KAFKA-2353 https://issues.apache.org/jira/browse/KAFKA-2353 Repository: kafka Description --- Addressed Gwen's comments Addressed Gwen's comments. Diffs - core/src/main/scala/kafka/network/SocketServer.scala 91319fa010b140cca632e5fa8050509bd2295fc9 Diff: https://reviews.apache.org/r/36664/diff/ Testing --- Thanks, Jiangjie Qin
Re: Review Request 36664: Patch for KAFKA-2353
On July 22, 2015, 10:40 a.m., Ismael Juma wrote: core/src/main/scala/kafka/network/SocketServer.scala, line 467 https://reviews.apache.org/r/36664/diff/2/?file=1018375#file1018375line467 As far as I can see `ClosedChannelException`, `IllegalStateException` and `IllegalArgumentException` are enough? Also, you would it be better to use `IOException` instead of `ClosedChannelException`? What happens if other exceptions are thrown? Will we still have a socket leak? Gwen Shapira wrote: Yeah, perhaps in addition to listing the expected cases, we should also handle nonFatal(e)? (https://tersesystems.com/2012/12/27/error-handling-in-scala/) Ismael Juma wrote: If we handle `NonFatal(e)`, it may be that we don't want to list all the other exceptions as it's just boilerplate. OK, this is how it was done in the end. Good. :) - Ismael --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/#review92574 --- On July 23, 2015, 12:51 a.m., Jiangjie Qin wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/ --- (Updated July 23, 2015, 12:51 a.m.) Review request for kafka. Bugs: KAFKA-2353 https://issues.apache.org/jira/browse/KAFKA-2353 Repository: kafka Description --- Addressed Gwen's comments Addressed Gwen's comments. Diffs - core/src/main/scala/kafka/network/SocketServer.scala 91319fa010b140cca632e5fa8050509bd2295fc9 Diff: https://reviews.apache.org/r/36664/diff/ Testing --- Thanks, Jiangjie Qin
Re: Review Request 36664: Patch for KAFKA-2353
On July 21, 2015, 11:15 p.m., Gwen Shapira wrote: core/src/main/scala/kafka/network/SocketServer.scala, line 465 https://reviews.apache.org/r/36664/diff/1/?file=1018238#file1018238line465 Turns out that catching Throwable is a really bad idea: https://www.sumologic.com/2014/05/05/why-you-should-never-catch-throwable-in-scala/ Jiangjie Qin wrote: Ah... Didn't know that before. I explicitly listed the exceptions. Guozhang Wang wrote: Searching : Throwable gives me 180+ cases in code base :P Though many of them are from unit tests (which, arguably maybe OK) there are still a lot in the core package. I agree that we should avoid catching Throwable whenever possible, which will also help enforcing the developers to think about possible checked exceptions in the calling trace. Gwen Shapira wrote: I know :( I'm not sure if going over and converting everything is worth the effort. Although it can be a nice newbie jira. Jiangjie Qin wrote: Maybe we can simply change that to catch Throwables except ControlThrowables? That might be a simple search and replace. Gwen Shapira wrote: possible. definitely not in this JIRA though :) It's worth noting that the `ControlThrowable` thing is only an issue if `return` is used inside a closure or for comprehension and `return` is something to be avoided in Scala. Still, it would be nice to fix the issue. If we do, we should not add an additional clause everywhere. Instead, we should add an extractor like `NonFatal` (maybe called `NonControl`) to avoid unnecessary boilerplate. - Ismael --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/#review92496 --- On July 23, 2015, 12:51 a.m., Jiangjie Qin wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/ --- (Updated July 23, 2015, 12:51 a.m.) Review request for kafka. Bugs: KAFKA-2353 https://issues.apache.org/jira/browse/KAFKA-2353 Repository: kafka Description --- Addressed Gwen's comments Addressed Gwen's comments. Diffs - core/src/main/scala/kafka/network/SocketServer.scala 91319fa010b140cca632e5fa8050509bd2295fc9 Diff: https://reviews.apache.org/r/36664/diff/ Testing --- Thanks, Jiangjie Qin
Re: Review Request 36664: Patch for KAFKA-2353
On July 21, 2015, 11:15 p.m., Gwen Shapira wrote: core/src/main/scala/kafka/network/SocketServer.scala, line 400 https://reviews.apache.org/r/36664/diff/1/?file=1018238#file1018238line400 So in case of unexpected exception, we log an error and keep running? Isn't it better to kill the processor, since we don't know what's the state of the system? If the acceptor keeps placing messages in the queue for a dead processor, isn't it a separate issue? Jiangjie Qin wrote: This part I'm not quite sure. I am not very experienced in the error handling in such case, so please correct me if I missed something. Here is what I thought. The way it currently works is that the acceptor will 1. accept new connection request and create new socket channel 2. choose a processor and put the socket channel into the processor's new connection queue The processor will just take the socket channels from the queue and register it to the selector. If the processor runs and get an uncaught exception, there are several possibilities. Case 1: The exception was from one socket channel. Case 2: The exception was associated with a bad request. In case 1, ideally we should just disconnect that socket channel without affecting other socket channels. In case 2, I think we should log the error and skip the message - assuming client will retry sending data if no response was received for a given peoriod of time. I am not sure if letting processor exit is a good idea because this will lead to the result of a badly behaving client screw the entire cluster - it might screw processors one by one. Comparing with that, I kind of leaning towards keeping the processor running and serving other normal TCP connections if possible, but log the error so monitoring system can detect and see if human intervention is needed. Also, I don't know what to do here to prevent the thread from exiting without catching all the throwables. According to this blog http://www.tzavellas.com/techblog/2010/09/20/catching-throwable-in-scala/ I guess I can rethrow all the ControlThrowables, but intercept the rests? I would also prefer not to close the Processor thread upon exceptions, mainly for avoid one bad client killing a shared Kafka cluster. In the past we have seen such issues like DDoS MetadataRequest killing the cluster and all other clients gets affected, etc, and the quota work is towards preventing it. Since Processor threads are shared (8 by default on a broker), it should not be closed by a single socket / bad client request. On July 21, 2015, 11:15 p.m., Gwen Shapira wrote: core/src/main/scala/kafka/network/SocketServer.scala, line 465 https://reviews.apache.org/r/36664/diff/1/?file=1018238#file1018238line465 Turns out that catching Throwable is a really bad idea: https://www.sumologic.com/2014/05/05/why-you-should-never-catch-throwable-in-scala/ Jiangjie Qin wrote: Ah... Didn't know that before. I explicitly listed the exceptions. Searching : Throwable gives me 180+ cases in code base :P Though many of them are from unit tests (which, arguably maybe OK) there are still a lot in the core package. I agree that we should avoid catching Throwable whenever possible, which will also help enforcing the developers to think about possible checked exceptions in the calling trace. - Guozhang --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/#review92496 --- On July 22, 2015, 5:02 a.m., Jiangjie Qin wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/ --- (Updated July 22, 2015, 5:02 a.m.) Review request for kafka. Bugs: KAFKA-2353 https://issues.apache.org/jira/browse/KAFKA-2353 Repository: kafka Description --- Addressed Gwen's comments Diffs - core/src/main/scala/kafka/network/SocketServer.scala 91319fa010b140cca632e5fa8050509bd2295fc9 Diff: https://reviews.apache.org/r/36664/diff/ Testing --- Thanks, Jiangjie Qin
Re: Review Request 36664: Patch for KAFKA-2353
On July 21, 2015, 11:15 p.m., Gwen Shapira wrote: core/src/main/scala/kafka/network/SocketServer.scala, line 465 https://reviews.apache.org/r/36664/diff/1/?file=1018238#file1018238line465 Turns out that catching Throwable is a really bad idea: https://www.sumologic.com/2014/05/05/why-you-should-never-catch-throwable-in-scala/ Jiangjie Qin wrote: Ah... Didn't know that before. I explicitly listed the exceptions. Guozhang Wang wrote: Searching : Throwable gives me 180+ cases in code base :P Though many of them are from unit tests (which, arguably maybe OK) there are still a lot in the core package. I agree that we should avoid catching Throwable whenever possible, which will also help enforcing the developers to think about possible checked exceptions in the calling trace. I know :( I'm not sure if going over and converting everything is worth the effort. Although it can be a nice newbie jira. On July 21, 2015, 11:15 p.m., Gwen Shapira wrote: core/src/main/scala/kafka/network/SocketServer.scala, line 400 https://reviews.apache.org/r/36664/diff/1/?file=1018238#file1018238line400 So in case of unexpected exception, we log an error and keep running? Isn't it better to kill the processor, since we don't know what's the state of the system? If the acceptor keeps placing messages in the queue for a dead processor, isn't it a separate issue? Jiangjie Qin wrote: This part I'm not quite sure. I am not very experienced in the error handling in such case, so please correct me if I missed something. Here is what I thought. The way it currently works is that the acceptor will 1. accept new connection request and create new socket channel 2. choose a processor and put the socket channel into the processor's new connection queue The processor will just take the socket channels from the queue and register it to the selector. If the processor runs and get an uncaught exception, there are several possibilities. Case 1: The exception was from one socket channel. Case 2: The exception was associated with a bad request. In case 1, ideally we should just disconnect that socket channel without affecting other socket channels. In case 2, I think we should log the error and skip the message - assuming client will retry sending data if no response was received for a given peoriod of time. I am not sure if letting processor exit is a good idea because this will lead to the result of a badly behaving client screw the entire cluster - it might screw processors one by one. Comparing with that, I kind of leaning towards keeping the processor running and serving other normal TCP connections if possible, but log the error so monitoring system can detect and see if human intervention is needed. Also, I don't know what to do here to prevent the thread from exiting without catching all the throwables. According to this blog http://www.tzavellas.com/techblog/2010/09/20/catching-throwable-in-scala/ I guess I can rethrow all the ControlThrowables, but intercept the rests? Guozhang Wang wrote: I would also prefer not to close the Processor thread upon exceptions, mainly for avoid one bad client killing a shared Kafka cluster. In the past we have seen such issues like DDoS MetadataRequest killing the cluster and all other clients gets affected, etc, and the quota work is towards preventing it. Since Processor threads are shared (8 by default on a broker), it should not be closed by a single socket / bad client request. I like your thinking around cases #1 and #2. I think this should go as a code comment somewhere, so when people improve / extend SocketServer they will keep this logic in mind. Maybe even specify in specific catch clauses if they are handling possible errors in request level or channel level. My concern is with possible case #3: Each processor has an o.a.k.common.network.Selector. I'm concerned about the possibility of something going wrong in the state of the selector, which will possibly be an issue for all channels. For example failure to register could be an issue with the channel.register call, but also perhaps an issue with keys.put (just an example - I'm not sure something can actually break keys table). I'd like to be able to identify cases where the Selector state may have gone wrong and close the processor in that case. Does that make any sense? Or am I being too paranoid? - Gwen --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/#review92496 --- On July 22, 2015, 5:02 a.m., Jiangjie Qin wrote: --- This is an automatically generated e-mail. To reply, visit:
Re: Review Request 36664: Patch for KAFKA-2353
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/#review92607 --- LGTM overall. Could you address Ismael's comments as well before check-in? - Guozhang Wang On July 22, 2015, 5:02 a.m., Jiangjie Qin wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/ --- (Updated July 22, 2015, 5:02 a.m.) Review request for kafka. Bugs: KAFKA-2353 https://issues.apache.org/jira/browse/KAFKA-2353 Repository: kafka Description --- Addressed Gwen's comments Diffs - core/src/main/scala/kafka/network/SocketServer.scala 91319fa010b140cca632e5fa8050509bd2295fc9 Diff: https://reviews.apache.org/r/36664/diff/ Testing --- Thanks, Jiangjie Qin
Re: Review Request 36664: Patch for KAFKA-2353
Since I've been dealing with the fallout of this particular problem all week, I'll add a few thoughts... On Wed, Jul 22, 2015 at 10:51 AM, Gwen Shapira gshap...@cloudera.com wrote: On July 21, 2015, 11:15 p.m., Gwen Shapira wrote: core/src/main/scala/kafka/network/SocketServer.scala, line 465 https://reviews.apache.org/r/36664/diff/1/?file=1018238#file1018238line465 Turns out that catching Throwable is a really bad idea: https://www.sumologic.com/2014/05/05/why-you-should-never-catch-throwable-in-scala/ Jiangjie Qin wrote: Ah... Didn't know that before. I explicitly listed the exceptions. Guozhang Wang wrote: Searching : Throwable gives me 180+ cases in code base :P Though many of them are from unit tests (which, arguably maybe OK) there are still a lot in the core package. I agree that we should avoid catching Throwable whenever possible, which will also help enforcing the developers to think about possible checked exceptions in the calling trace. I know :( I'm not sure if going over and converting everything is worth the effort. Although it can be a nice newbie jira. On July 21, 2015, 11:15 p.m., Gwen Shapira wrote: core/src/main/scala/kafka/network/SocketServer.scala, line 400 https://reviews.apache.org/r/36664/diff/1/?file=1018238#file1018238line400 So in case of unexpected exception, we log an error and keep running? Isn't it better to kill the processor, since we don't know what's the state of the system? If the acceptor keeps placing messages in the queue for a dead processor, isn't it a separate issue? Jiangjie Qin wrote: This part I'm not quite sure. I am not very experienced in the error handling in such case, so please correct me if I missed something. Here is what I thought. The way it currently works is that the acceptor will 1. accept new connection request and create new socket channel 2. choose a processor and put the socket channel into the processor's new connection queue The processor will just take the socket channels from the queue and register it to the selector. If the processor runs and get an uncaught exception, there are several possibilities. Case 1: The exception was from one socket channel. Case 2: The exception was associated with a bad request. In case 1, ideally we should just disconnect that socket channel without affecting other socket channels. In case 2, I think we should log the error and skip the message - assuming client will retry sending data if no response was received for a given peoriod of time. I am not sure if letting processor exit is a good idea because this will lead to the result of a badly behaving client screw the entire cluster - it might screw processors one by one. Comparing with that, I kind of leaning towards keeping the processor running and serving other normal TCP connections if possible, but log the error so monitoring system can detect and see if human intervention is needed. Also, I don't know what to do here to prevent the thread from exiting without catching all the throwables. According to this blog http://www.tzavellas.com/techblog/2010/09/20/catching-throwable-in-scala/ I guess I can rethrow all the ControlThrowables, but intercept the rests? Guozhang Wang wrote: I would also prefer not to close the Processor thread upon exceptions, mainly for avoid one bad client killing a shared Kafka cluster. In the past we have seen such issues like DDoS MetadataRequest killing the cluster and all other clients gets affected, etc, and the quota work is towards preventing it. Since Processor threads are shared (8 by default on a broker), it should not be closed by a single socket / bad client request. I like your thinking around cases #1 and #2. I think this should go as a code comment somewhere, so when people improve / extend SocketServer they will keep this logic in mind. Maybe even specify in specific catch clauses if they are handling possible errors in request level or channel level. My concern is with possible case #3: Each processor has an o.a.k.common.network.Selector. I'm concerned about the possibility of something going wrong in the state of the selector, which will possibly be an issue for all channels. For example failure to register could be an issue with the channel.register call, but also perhaps an issue with keys.put (just an example - I'm not sure something can actually break keys table). I'd like to be able to identify cases where the Selector state may have gone wrong and close the processor in that case. Does that make any sense? Or am I being too paranoid? If there are error cases that are not associated with a specific connection or request, then I agree that we should handle that a little differently. What we need to keep in mind is that close the processor is actually terminate the
Re: Review Request 36664: Patch for KAFKA-2353
On July 21, 2015, 11:15 p.m., Gwen Shapira wrote: core/src/main/scala/kafka/network/SocketServer.scala, line 465 https://reviews.apache.org/r/36664/diff/1/?file=1018238#file1018238line465 Turns out that catching Throwable is a really bad idea: https://www.sumologic.com/2014/05/05/why-you-should-never-catch-throwable-in-scala/ Jiangjie Qin wrote: Ah... Didn't know that before. I explicitly listed the exceptions. Guozhang Wang wrote: Searching : Throwable gives me 180+ cases in code base :P Though many of them are from unit tests (which, arguably maybe OK) there are still a lot in the core package. I agree that we should avoid catching Throwable whenever possible, which will also help enforcing the developers to think about possible checked exceptions in the calling trace. Gwen Shapira wrote: I know :( I'm not sure if going over and converting everything is worth the effort. Although it can be a nice newbie jira. Maybe we can simply change that to catch Throwables except ControlThrowables? That might be a simple search and replace. On July 21, 2015, 11:15 p.m., Gwen Shapira wrote: core/src/main/scala/kafka/network/SocketServer.scala, line 400 https://reviews.apache.org/r/36664/diff/1/?file=1018238#file1018238line400 So in case of unexpected exception, we log an error and keep running? Isn't it better to kill the processor, since we don't know what's the state of the system? If the acceptor keeps placing messages in the queue for a dead processor, isn't it a separate issue? Jiangjie Qin wrote: This part I'm not quite sure. I am not very experienced in the error handling in such case, so please correct me if I missed something. Here is what I thought. The way it currently works is that the acceptor will 1. accept new connection request and create new socket channel 2. choose a processor and put the socket channel into the processor's new connection queue The processor will just take the socket channels from the queue and register it to the selector. If the processor runs and get an uncaught exception, there are several possibilities. Case 1: The exception was from one socket channel. Case 2: The exception was associated with a bad request. In case 1, ideally we should just disconnect that socket channel without affecting other socket channels. In case 2, I think we should log the error and skip the message - assuming client will retry sending data if no response was received for a given peoriod of time. I am not sure if letting processor exit is a good idea because this will lead to the result of a badly behaving client screw the entire cluster - it might screw processors one by one. Comparing with that, I kind of leaning towards keeping the processor running and serving other normal TCP connections if possible, but log the error so monitoring system can detect and see if human intervention is needed. Also, I don't know what to do here to prevent the thread from exiting without catching all the throwables. According to this blog http://www.tzavellas.com/techblog/2010/09/20/catching-throwable-in-scala/ I guess I can rethrow all the ControlThrowables, but intercept the rests? Guozhang Wang wrote: I would also prefer not to close the Processor thread upon exceptions, mainly for avoid one bad client killing a shared Kafka cluster. In the past we have seen such issues like DDoS MetadataRequest killing the cluster and all other clients gets affected, etc, and the quota work is towards preventing it. Since Processor threads are shared (8 by default on a broker), it should not be closed by a single socket / bad client request. Gwen Shapira wrote: I like your thinking around cases #1 and #2. I think this should go as a code comment somewhere, so when people improve / extend SocketServer they will keep this logic in mind. Maybe even specify in specific catch clauses if they are handling possible errors in request level or channel level. My concern is with possible case #3: Each processor has an o.a.k.common.network.Selector. I'm concerned about the possibility of something going wrong in the state of the selector, which will possibly be an issue for all channels. For example failure to register could be an issue with the channel.register call, but also perhaps an issue with keys.put (just an example - I'm not sure something can actually break keys table). I'd like to be able to identify cases where the Selector state may have gone wrong and close the processor in that case. Does that make any sense? Or am I being too paranoid? Hi Gwen, I think what you said makes sense. Maybe I see this more from a failure boundary point of view. Actually we might need to do more if we let a processor exit. We need to stop the acceptor from
Re: Review Request 36664: Patch for KAFKA-2353
On July 22, 2015, 10:40 a.m., Ismael Juma wrote: core/src/main/scala/kafka/network/SocketServer.scala, line 467 https://reviews.apache.org/r/36664/diff/2/?file=1018375#file1018375line467 As far as I can see `ClosedChannelException`, `IllegalStateException` and `IllegalArgumentException` are enough? Also, you would it be better to use `IOException` instead of `ClosedChannelException`? What happens if other exceptions are thrown? Will we still have a socket leak? Yeah, perhaps in addition to listing the expected cases, we should also handle nonFatal(e)? (https://tersesystems.com/2012/12/27/error-handling-in-scala/) - Gwen --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/#review92574 --- On July 22, 2015, 5:02 a.m., Jiangjie Qin wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/ --- (Updated July 22, 2015, 5:02 a.m.) Review request for kafka. Bugs: KAFKA-2353 https://issues.apache.org/jira/browse/KAFKA-2353 Repository: kafka Description --- Addressed Gwen's comments Diffs - core/src/main/scala/kafka/network/SocketServer.scala 91319fa010b140cca632e5fa8050509bd2295fc9 Diff: https://reviews.apache.org/r/36664/diff/ Testing --- Thanks, Jiangjie Qin
Re: Review Request 36664: Patch for KAFKA-2353
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/#review92574 --- core/src/main/scala/kafka/network/SocketServer.scala (line 401) https://reviews.apache.org/r/36664/#comment146775 Is it intentional to ignore `java.lang.Error` too? core/src/main/scala/kafka/network/SocketServer.scala (line 463) https://reviews.apache.org/r/36664/#comment146773 As far as I can see `ClosedChannelException`, `IllegalStateException` and `IllegalArgumentException` are enough? Also, you would it be better to use `IOException` instead of `ClosedChannelException`? What happens if other exceptions are thrown? Will we still have a socket leak? - Ismael Juma On July 22, 2015, 5:02 a.m., Jiangjie Qin wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/ --- (Updated July 22, 2015, 5:02 a.m.) Review request for kafka. Bugs: KAFKA-2353 https://issues.apache.org/jira/browse/KAFKA-2353 Repository: kafka Description --- Addressed Gwen's comments Diffs - core/src/main/scala/kafka/network/SocketServer.scala 91319fa010b140cca632e5fa8050509bd2295fc9 Diff: https://reviews.apache.org/r/36664/diff/ Testing --- Thanks, Jiangjie Qin
Re: Review Request 36664: Patch for KAFKA-2353
On July 22, 2015, 10:40 a.m., Ismael Juma wrote: core/src/main/scala/kafka/network/SocketServer.scala, line 401 https://reviews.apache.org/r/36664/diff/2/?file=1018375#file1018375line401 Is it intentional to ignore `java.lang.Error` too? I think java.lang.Error is a subclass of throwables. - Jiangjie --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/#review92574 --- On July 23, 2015, 12:51 a.m., Jiangjie Qin wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/ --- (Updated July 23, 2015, 12:51 a.m.) Review request for kafka. Bugs: KAFKA-2353 https://issues.apache.org/jira/browse/KAFKA-2353 Repository: kafka Description --- Addressed Gwen's comments Addressed Gwen's comments. Diffs - core/src/main/scala/kafka/network/SocketServer.scala 91319fa010b140cca632e5fa8050509bd2295fc9 Diff: https://reviews.apache.org/r/36664/diff/ Testing --- Thanks, Jiangjie Qin
Re: Review Request 36664: Patch for KAFKA-2353
On July 22, 2015, 5:13 p.m., Guozhang Wang wrote: LGTM overall. Could you address Ismael's comments as well before check-in? Thanks, Guozhang. I updated patch to address Ismael's comments. - Jiangjie --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/#review92607 --- On July 23, 2015, 12:51 a.m., Jiangjie Qin wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/ --- (Updated July 23, 2015, 12:51 a.m.) Review request for kafka. Bugs: KAFKA-2353 https://issues.apache.org/jira/browse/KAFKA-2353 Repository: kafka Description --- Addressed Gwen's comments Addressed Gwen's comments. Diffs - core/src/main/scala/kafka/network/SocketServer.scala 91319fa010b140cca632e5fa8050509bd2295fc9 Diff: https://reviews.apache.org/r/36664/diff/ Testing --- Thanks, Jiangjie Qin
Re: Review Request 36664: Patch for KAFKA-2353
On July 21, 2015, 11:15 p.m., Gwen Shapira wrote: core/src/main/scala/kafka/network/SocketServer.scala, line 465 https://reviews.apache.org/r/36664/diff/1/?file=1018238#file1018238line465 Turns out that catching Throwable is a really bad idea: https://www.sumologic.com/2014/05/05/why-you-should-never-catch-throwable-in-scala/ Jiangjie Qin wrote: Ah... Didn't know that before. I explicitly listed the exceptions. Guozhang Wang wrote: Searching : Throwable gives me 180+ cases in code base :P Though many of them are from unit tests (which, arguably maybe OK) there are still a lot in the core package. I agree that we should avoid catching Throwable whenever possible, which will also help enforcing the developers to think about possible checked exceptions in the calling trace. Gwen Shapira wrote: I know :( I'm not sure if going over and converting everything is worth the effort. Although it can be a nice newbie jira. Jiangjie Qin wrote: Maybe we can simply change that to catch Throwables except ControlThrowables? That might be a simple search and replace. possible. definitely not in this JIRA though :) On July 21, 2015, 11:15 p.m., Gwen Shapira wrote: core/src/main/scala/kafka/network/SocketServer.scala, line 400 https://reviews.apache.org/r/36664/diff/1/?file=1018238#file1018238line400 So in case of unexpected exception, we log an error and keep running? Isn't it better to kill the processor, since we don't know what's the state of the system? If the acceptor keeps placing messages in the queue for a dead processor, isn't it a separate issue? Jiangjie Qin wrote: This part I'm not quite sure. I am not very experienced in the error handling in such case, so please correct me if I missed something. Here is what I thought. The way it currently works is that the acceptor will 1. accept new connection request and create new socket channel 2. choose a processor and put the socket channel into the processor's new connection queue The processor will just take the socket channels from the queue and register it to the selector. If the processor runs and get an uncaught exception, there are several possibilities. Case 1: The exception was from one socket channel. Case 2: The exception was associated with a bad request. In case 1, ideally we should just disconnect that socket channel without affecting other socket channels. In case 2, I think we should log the error and skip the message - assuming client will retry sending data if no response was received for a given peoriod of time. I am not sure if letting processor exit is a good idea because this will lead to the result of a badly behaving client screw the entire cluster - it might screw processors one by one. Comparing with that, I kind of leaning towards keeping the processor running and serving other normal TCP connections if possible, but log the error so monitoring system can detect and see if human intervention is needed. Also, I don't know what to do here to prevent the thread from exiting without catching all the throwables. According to this blog http://www.tzavellas.com/techblog/2010/09/20/catching-throwable-in-scala/ I guess I can rethrow all the ControlThrowables, but intercept the rests? Guozhang Wang wrote: I would also prefer not to close the Processor thread upon exceptions, mainly for avoid one bad client killing a shared Kafka cluster. In the past we have seen such issues like DDoS MetadataRequest killing the cluster and all other clients gets affected, etc, and the quota work is towards preventing it. Since Processor threads are shared (8 by default on a broker), it should not be closed by a single socket / bad client request. Gwen Shapira wrote: I like your thinking around cases #1 and #2. I think this should go as a code comment somewhere, so when people improve / extend SocketServer they will keep this logic in mind. Maybe even specify in specific catch clauses if they are handling possible errors in request level or channel level. My concern is with possible case #3: Each processor has an o.a.k.common.network.Selector. I'm concerned about the possibility of something going wrong in the state of the selector, which will possibly be an issue for all channels. For example failure to register could be an issue with the channel.register call, but also perhaps an issue with keys.put (just an example - I'm not sure something can actually break keys table). I'd like to be able to identify cases where the Selector state may have gone wrong and close the processor in that case. Does that make any sense? Or am I being too paranoid? Jiangjie Qin wrote: Hi Gwen, I think what you said makes sense. Maybe I see this more from a failure boundary point of
Re: Review Request 36664: Patch for KAFKA-2353
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/ --- (Updated July 23, 2015, 12:51 a.m.) Review request for kafka. Bugs: KAFKA-2353 https://issues.apache.org/jira/browse/KAFKA-2353 Repository: kafka Description (updated) --- Addressed Gwen's comments Addressed Gwen's comments. Diffs (updated) - core/src/main/scala/kafka/network/SocketServer.scala 91319fa010b140cca632e5fa8050509bd2295fc9 Diff: https://reviews.apache.org/r/36664/diff/ Testing --- Thanks, Jiangjie Qin
Re: Review Request 36664: Patch for KAFKA-2353
On July 21, 2015, 11:15 p.m., Gwen Shapira wrote: Thanks for looking into that. Exception handling was the most challenging part of rewriting SocketServer, so I'm glad to see more eyes on this implementation. I have a concern regarding the right way to handle an unexpected exceptions. Hi Gwen, thanks for the quick review. I replied to your comments below. Mind taking another look? - Jiangjie --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/#review92496 --- On July 22, 2015, 5:02 a.m., Jiangjie Qin wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/ --- (Updated July 22, 2015, 5:02 a.m.) Review request for kafka. Bugs: KAFKA-2353 https://issues.apache.org/jira/browse/KAFKA-2353 Repository: kafka Description --- Addressed Gwen's comments Diffs - core/src/main/scala/kafka/network/SocketServer.scala 91319fa010b140cca632e5fa8050509bd2295fc9 Diff: https://reviews.apache.org/r/36664/diff/ Testing --- Thanks, Jiangjie Qin
Re: Review Request 36664: Patch for KAFKA-2353
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/ --- (Updated July 22, 2015, 5:02 a.m.) Review request for kafka. Bugs: KAFKA-2353 https://issues.apache.org/jira/browse/KAFKA-2353 Repository: kafka Description (updated) --- Addressed Gwen's comments Diffs (updated) - core/src/main/scala/kafka/network/SocketServer.scala 91319fa010b140cca632e5fa8050509bd2295fc9 Diff: https://reviews.apache.org/r/36664/diff/ Testing --- Thanks, Jiangjie Qin
Re: Review Request 36664: Patch for KAFKA-2353
On July 21, 2015, 11:15 p.m., Gwen Shapira wrote: core/src/main/scala/kafka/network/SocketServer.scala, line 465 https://reviews.apache.org/r/36664/diff/1/?file=1018238#file1018238line465 Turns out that catching Throwable is a really bad idea: https://www.sumologic.com/2014/05/05/why-you-should-never-catch-throwable-in-scala/ Ah... Didn't know that before. I explicitly listed the exceptions. On July 21, 2015, 11:15 p.m., Gwen Shapira wrote: core/src/main/scala/kafka/network/SocketServer.scala, line 400 https://reviews.apache.org/r/36664/diff/1/?file=1018238#file1018238line400 So in case of unexpected exception, we log an error and keep running? Isn't it better to kill the processor, since we don't know what's the state of the system? If the acceptor keeps placing messages in the queue for a dead processor, isn't it a separate issue? This part I'm not quite sure. I am not very experienced in the error handling in such case, so please correct me if I missed something. Here is what I thought. The way it currently works is that the acceptor will 1. accept new connection request and create new socket channel 2. choose a processor and put the socket channel into the processor's new connection queue The processor will just take the socket channels from the queue and register it to the selector. If the processor runs and get an uncaught exception, there are several possibilities. Case 1: The exception was from one socket channel. Case 2: The exception was associated with a bad request. In case 1, ideally we should just disconnect that socket channel without affecting other socket channels. In case 2, I think we should log the error and skip the message - assuming client will retry sending data if no response was received for a given peoriod of time. I am not sure if letting processor exit is a good idea because this will lead to the result of a badly behaving client screw the entire cluster - it might screw processors one by one. Comparing with that, I kind of leaning towards keeping the processor running and serving other normal TCP connections if possible, but log the error so monitoring system can detect and see if human intervention is needed. Also, I don't know what to do here to prevent the thread from exiting without catching all the throwables. According to this blog http://www.tzavellas.com/techblog/2010/09/20/catching-throwable-in-scala/ I guess I can rethrow all the ControlThrowables, but intercept the rests? - Jiangjie --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/#review92496 --- On July 21, 2015, 11:03 p.m., Jiangjie Qin wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/ --- (Updated July 21, 2015, 11:03 p.m.) Review request for kafka. Bugs: KAFKA-2353 https://issues.apache.org/jira/browse/KAFKA-2353 Repository: kafka Description --- Catch exception in kafka.network.Processor to avoid socket leak and exiting unexpectedly. Diffs - core/src/main/scala/kafka/network/SocketServer.scala 91319fa010b140cca632e5fa8050509bd2295fc9 Diff: https://reviews.apache.org/r/36664/diff/ Testing --- Thanks, Jiangjie Qin
Review Request 36664: Patch for KAFKA-2353
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/ --- Review request for kafka. Bugs: KAFKA-2353 https://issues.apache.org/jira/browse/KAFKA-2353 Repository: kafka Description --- Catch exception in kafka.network.Processor to avoid socket leak and exiting unexpectedly. Diffs - core/src/main/scala/kafka/network/SocketServer.scala 91319fa010b140cca632e5fa8050509bd2295fc9 Diff: https://reviews.apache.org/r/36664/diff/ Testing --- Thanks, Jiangjie Qin
Re: Review Request 36664: Patch for KAFKA-2353
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/#review92496 --- Thanks for looking into that. Exception handling was the most challenging part of rewriting SocketServer, so I'm glad to see more eyes on this implementation. I have a concern regarding the right way to handle an unexpected exceptions. core/src/main/scala/kafka/network/SocketServer.scala (line 400) https://reviews.apache.org/r/36664/#comment146707 So in case of unexpected exception, we log an error and keep running? Isn't it better to kill the processor, since we don't know what's the state of the system? If the acceptor keeps placing messages in the queue for a dead processor, isn't it a separate issue? core/src/main/scala/kafka/network/SocketServer.scala (line 461) https://reviews.apache.org/r/36664/#comment146708 Turns out that catching Throwable is a really bad idea: https://www.sumologic.com/2014/05/05/why-you-should-never-catch-throwable-in-scala/ - Gwen Shapira On July 21, 2015, 11:03 p.m., Jiangjie Qin wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/ --- (Updated July 21, 2015, 11:03 p.m.) Review request for kafka. Bugs: KAFKA-2353 https://issues.apache.org/jira/browse/KAFKA-2353 Repository: kafka Description --- Catch exception in kafka.network.Processor to avoid socket leak and exiting unexpectedly. Diffs - core/src/main/scala/kafka/network/SocketServer.scala 91319fa010b140cca632e5fa8050509bd2295fc9 Diff: https://reviews.apache.org/r/36664/diff/ Testing --- Thanks, Jiangjie Qin