Re: [vdsm] flowID schema
On 02/19/2012 05:32 AM, Ayal Baron wrote: FlowID does absolutely nothing towards that end. What is required for that (and should indeed be done) is work on improving the existing logs: 1. reduce verbosity 2. make sure that required points in the flow are logged 3. improve log messages Don't forget the MSGID change that Keith promised to push as well. Keith hasn't forgotten. He is just trying to get a standardized format and tooling that are *easy* to use and which is minimally intrusive to development. Had a few meetings in Brno last week to discuss logging and I think that we have a fairly good plan now. As soon as I finish the wiki, I'll send a link. Cheers, Keith ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/vdsm-devel
Re: [vdsm] flowID schema
Pinning down specific moments where stuff went horribly wrong is usually quite simple. The only reason I can't think of for someone to track a problem is if there was a bug in the storage subsystem that cause a corruption that only became apparent later on. But this can't benefit from flowID. I'm going to to day this again. I want to hear someone to give a case where this is useful. Don't say things like. It will allow me to do X or I know of a guy who spent 2 years doing X to VDSM\Engine logs I want an example like: The user complained about X. So I had to do Y to figure out what is wrong and it was a pain. The Only reason I can think of is someone trying to figure out what wrong knowing nothing about how VDSM\Engine works. While I just jump from place to place because I know what is going on, other people would just want to go step by step to get the complete picture. But even with that I don't see why you would want to keep going from the Engine down to VDSM and back out again. Other then that I just simply can't imagine a use case where I'd need to do the stuff you talk about. But again, I usually get my bug reports from QA and QE, and they are more skilled at bug reporting then the general public. Further more, WHAT IS A FLOW? When does it start? When does it end? create Image is a flow? is the connect to the domain included? isn't it all just part of a big flow to create a VM? Does the engine even track it as a flow? Flows depend on the debugger the problem and the scope. Throwing another useless ID in the pile will give you nothing. What you want is to be able to map sophisticated connections between resources and operations inside every component and between them. For instance, in VDSM. To track a resource locking issues you use the resourceID. It crosses flows. To debug connection issues you use the connection information (and soon, the reference ID). An it crosses flows as well. I'm not saying that debugging Ovirt is easy. I'm just saying that this is not the solution IMHO. Good anchors to resources like taskIDs, connection reference IDs, domain IDs and resource IDs give you the ability to track whatever you want. You just need better tools to cross reference them across log files so that the log tricks I *know* are implemented in a way that everyone can use them. This is when good tools come in to play. Think about it like a DB. Should something be a table with an index or should it just be a view. (more inline) - Original Message - From: Simon Grinberg si...@redhat.com To: Saggi Mizrahi smizr...@redhat.com Cc: vdsm-devel@lists.fedorahosted.org, Dan Yasny dya...@redhat.com, Ayal Baron aba...@redhat.com Sent: Thursday, February 16, 2012 11:07:34 AM Subject: Re: [vdsm] flowID schema - Original Message - From: Saggi Mizrahi smizr...@redhat.com To: Simon Grinberg si...@redhat.com Cc: vdsm-devel@lists.fedorahosted.org, Dan Yasny dya...@redhat.com, Ayal Baron aba...@redhat.com Sent: Thursday, February 16, 2012 5:18:26 PM Subject: Re: [vdsm] flowID schema You could just cross check flowID (that is printed in RHEV-M to all the call IDs in that flow, It's a simple enough tool to make). Also if you look at the way VDSM is heading you will see that just grepping for the flow ID will gradually give you a smaller and smaller picture of the actual flow. - The new connection management means that the actual connect is not a direct part of the flow anymore. - We plan to have a similar mechanism for domains. - The new image code will do a lot of things out of band. - There will be more flows using multiple hosts because of SDM and the increased safety gained by sanlock and they will not share the flow ID in internal communication. - Engine actually polls all tasks in a single thread (with it's own flow ID? No flowID?) so even the actual result for async tasks might have a different flowID in the VDSM log. OK so they may be a terminology mismatch here: A flow as an end user would see it is: Everything that happened since the moment he clicked OK and to the moment the operation failed succeeded. Same goes for internal/auto triggered actions in response to an event (Storage failure etc). It may be composed as you notes above of several connections, multiple tasks etc. The flow ID as I see it is the linkage that makes sense in all of these. Everything that you've wrote above just convinces me that such an ID is a must. I understood that. The point is that the system is too complex to be able to find certain operations to flows. If you connect in FlowX but use a resource provided by the connection in flowY. The connection breaks and you loose the resource. As long as you somehow pinned the flowID value to the connection object it might be logged as flowX. The actual problem happened in flowY. My point is that the logical leaps required to extract only relevant data to a flow
Re: [vdsm] flowID schema
Hi Ayal, actual use case where someone would have actually benefited from this. Sorry joining that late into the discussion. Forget the automatic analyses/past issues recognition that may or may not be possible if we do have flow ID. (Personally I believe it's possible for known issues) Think of large setups, not the few hosts setups that are running in the labs. Dozens of hosts, hundreds of VMs and tasks, running for few months. --- Huge logs (yes we even got 50G sized logs) Now try to do what you and Saggy are suggesting said, yes it will work!!! Sometimes it is the only approach that will be possible. But what about efficiency? Some times I found myself spending 30 minutes just getting to the files that I need to view and to the approximate location of the error. If you have a flow ID you can have a tool that will extract all the relevant information from all the hosts participating in that flow and correlate that with the rhevm.log as the anchor. (Valdik actually wrote a tool that follows the SPM and creates an SPM log no matter on which host it resides - this allowed tracking flows and proved itself useful and time saving, especially since when storage issues begin SPM tends to move around :)) In addition to that think about skill levels, For you it's very easy to go and debug, you wrote the code, you know what to expect and what is the normal behavior (most of the time). But what about others? When encountering an issue within a flow for a front line or a customer self debug the easiest way will be to actually compare it to the same flow that succeed (if exist) and understand what was supposed to happen next (or before) and did not. I actually used that technique a lot. Again a flow ID will save time. All you have to do is extract an identical flow that worked using the flow-extractor utility and then compare. I agree that when you get to the really tough cases, they will span across flows and your only alternative may be just to dig in. However for many others, flows tracking will save support calls, support escalations and will considerably help with education. For 2.2 we had Marina going into the code and documenting flows, so support people can learn what to expect and what to look for in the logs. I believe it will also help engineering, lot's of times I had to gather 3 engineers and sat together just to understand a flow in order to debug an issue or understand what to expect if someone complained about a weird behavior. Since each of them only new what his part is doing - all had to go into their code and discuss. With flow ID you can just extract the flow and easily see what is being done (you may find out as happened before that it's not exactly what they meant to do) In short, if we aim to look at RHEV as a system, flows are helpful in many aspects. Instead of having discrete components and requiring a high level of expertise and spending a lot of time (finding the right spot) to debug even minor issues you can look at a system and easily get (most of the time), understand what to expect and get to the most relevant logs fairly fast. Simon. - Original Message - From: Dan Yasny dya...@redhat.com To: Ayal Baron aba...@redhat.com Cc: Simon Grinberg si...@redhat.com, vdsm-devel@lists.fedorahosted.org Sent: Friday, February 10, 2012 4:01:01 PM Subject: Re: [vdsm] flowID schema - Original Message - From: Ayal Baron aba...@redhat.com To: Dan Yasny dya...@redhat.com Cc: Simon Grinberg si...@redhat.com, vdsm-devel@lists.fedorahosted.org Sent: Friday, 10 February, 2012 3:51:01 PM Subject: Re: [vdsm] flowID schema - Original Message - - Original Message - From: Ayal Baron aba...@redhat.com To: Dan Yasny dya...@redhat.com Cc: Simon Grinberg si...@redhat.com, vdsm-devel@lists.fedorahosted.org Sent: Friday, 10 February, 2012 2:55:46 PM Subject: Re: [vdsm] flowID schema - Original Message - - Original Message - From: Ayal Baron aba...@redhat.com To: Dan Yasny dya...@redhat.com Cc: Simon Grinberg si...@redhat.com, vdsm-devel@lists.fedorahosted.org Sent: Friday, 10 February, 2012 12:50:04 AM Subject: Re: [vdsm] flowID schema - Original Message - From: Saggi Mizrahi smizr...@redhat.com To: Keith Robertson krobe...@redhat.com Cc: VDSM Project Development vdsm-devel@lists.fedorahosted.org Sent: Thursday, February 9, 2012 2:24:44 PM Subject: Re: [vdsm] flowID schema -1 I agree that for messaging environment having a Message ID is a must because you sometimes don't have a particular target so when you get a response you need to know what this node is actually
Re: [vdsm] flowID schema
On Mon, Feb 13, 2012 at 12:06:46PM -0500, Ayal Baron wrote: - Original Message - On 02/13/2012 02:28 PM, Ayal Baron wrote: ... is that it (ab)uses an http header for carrying FlowID, Yes, it certainly does appear to overload it. I would be nice to have something formal given to it by engine, but I can appreciate the difficulty implementing such a scheme. Technically I disagree, this is a cross cutting concern which has nothing to do with any specific call hence it should be passed as a header, that is actually rather elegant. To the specific matter at hand though. what would really be nice is solving the real problem properly, and not contaminating the API and the log with things which have marginal benefit if at all. going back to the 'grep' issue. vdsm logs are verbose. they are multi-threaded as well. I think this should be more than just about finding the entry point of the flow, then identifying for this specific log format how to trace it, which would require writing a log analyzer with plugins for each component. having all lines which are relevant to a flow with a flowid logged in them would make it much easier to get all (or most) of relevant parts of the flow (most, since something orthogonal to the flow may have happened affecting it, like loss of network) I'm sorry but what you're proposing is to make the log even more difficult to read for absolutely NO reason. I haven't seen 1 good reason to add more to the log. What we should be focusing on is: 1. adding the relevant data that is needed to the engine log so that most of the time users wouldn't need to go the host 2. reducing the verbosity of the vdsm log and increasing readability (the flow ID does exactly the opposite). As opposed to most people here who are thinking that this sounds like a good idea, I actually have debugged at least dozens of issues in engine and vdsm and can assure you that not once would this have been beneficial to me. When I debug an Engine-related issue, I tend to find a silly API call in Vdsm. Then I have to start correlating this to Engine logs. This step can be made quicker and less error-prone by logging FlowID both Engine and Vdsm. To me, this is the 1 good reason for logging FlowID on API entry in Vdsm. However, I find adding FlowID to each and every log line a bit excessive. Our log is too cluttered as it is. Logging FlowID whenever a new thread is spawned makes more sense to me. What was mostly missing in the engine logs was understanding what thread in engine called what operation in vdsm and what vdsm's response was. In 3.0 my understanding is that engine fixed this so this entire feature will be counter productive (will make logs less readable and harder to decipher, adds complexity to the API and adds complexity to users of the rest API). All cross hosts issues stem from *different* flows, so this would not help in this case and single host issues are easily traceable today (and you *never* need to follow an entire flow, it's entirely redundant and inefficient). I'm more than willing to show this on any set of logs by the way and would be happy to be proven wrong. More often than not by the way, the issue is that inside a specific call (i.e. 1 verb, not a flow) people are not proficient in finding the offending line (which is why I wrote the 'how to read the vdsm log' wiki). ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/vdsm-devel
Re: [vdsm] flowID schema
- Original Message - On 02/12/2012 02:50 AM, Dan Kenigsberg wrote: On Thu, Feb 09, 2012 at 03:23:19PM -0500, Ayal Baron wrote: - Original Message - -1 I agree that for messaging environment having a Message ID is a must because you sometimes don't have a particular target so when you get a response you need to know what this node is actually responding to. The message ID could be composed withFLOWIDMSGID so you can reuse the field. But that is all besides the point. I understand that someone might find it fun to go on following the entire flow in the Engine and in VDSM. But I would like to hear an actual use case where someone would have actually benefited from this. As I see it having VSDM return the task ID with every response (and not just for async tasks) is a lot more useful and correct. A generic debugging scenario as I see it. 1. Something went wrong 2. You go looking in the ENGINE log trying to figure out what happend. 3. You see that ENGINE got SomeError. 4. Check to see if this error makes sense imagining that VDSM is always right and is a black box. 5. You did your digging and now you think that VDSM is as fault. 6. Go look for the call that failed. (If we returned the taskID it's pretty simple to find that call). 7. Look around the call to check VDSM state. 8. Profit. There is never a point where you want to follow a whole flow call by call going back and forth, and even if you did having the VDSM taskID is a better anchor then flowID. VDSM is built in a way that every call takes in to account the current state only. Debugging it with an engine flow mindset is just wrong and distracting. I see it doing more harm the good by reinforcing bad debugging practices. I don't know about harm, but, today the engine logs every call and return value to and from vdsm. This means that all the info that is needed to follow a flow is already present in the engine log (which was not the case previously) so I believe that the flow id is redundant. In addition, instead of focusing on how to track a flow between components, we should focus on how to improve the engine log so that the users don't need to go to the hosts in the first place. My biggest problem with it is that it changes each and every verb in the API and makes the log itself also more verbose and less readable. The good thing about the currently suggested implementation http://gerrit.ovirt.org/#patch,sidebyside,1221,6,vdsm/BindingXMLRPC.py is that it (ab)uses an http header for carrying FlowID, Yes, it certainly does appear to overload it. I would be nice to have something formal given to it by engine, but I can appreciate the difficulty implementing such a scheme. Technically I disagree, this is a cross cutting concern which has nothing to do with any specific call hence it should be passed as a header, that is actually rather elegant. To the specific matter at hand though. what would really be nice is solving the real problem properly, and not contaminating the API and the log with things which have marginal benefit if at all. thus keeping the formal API intact. FlowID is logged only on API entry point, so it would not clutter the logs too much. Keith, Dan, I will need your support in Gerrit for the patch. +1 if I am asked to review. Dan. ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/vdsm-devel
Re: [vdsm] flowID schema
- Original Message - From: Itamar Heim ih...@redhat.com To: Ayal Baron aba...@redhat.com Cc: VDSM Project Development vdsm-devel@lists.fedorahosted.org Sent: Monday, 13 February, 2012 6:55:41 PM Subject: Re: [vdsm] flowID schema On 02/13/2012 02:28 PM, Ayal Baron wrote: ... is that it (ab)uses an http header for carrying FlowID, Yes, it certainly does appear to overload it. I would be nice to have something formal given to it by engine, but I can appreciate the difficulty implementing such a scheme. Technically I disagree, this is a cross cutting concern which has nothing to do with any specific call hence it should be passed as a header, that is actually rather elegant. To the specific matter at hand though. what would really be nice is solving the real problem properly, and not contaminating the API and the log with things which have marginal benefit if at all. going back to the 'grep' issue. vdsm logs are verbose. they are multi-threaded as well. I think this should be more than just about finding the entry point of the flow, then identifying for this specific log format how to trace it, which would require writing a log analyzer with plugins for each component. having all lines which are relevant to a flow with a flowid logged in them would make it much easier to get all (or most) of relevant parts of the flow +1 (most, since something orthogonal to the flow may have happened affecting it, like loss of network) ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/vdsm-devel -- Regards, Dan Yasny Red Hat Israel +972 9769 2280 ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/vdsm-devel
Re: [vdsm] flowID schema
On 02/13/2012 12:52 AM, Ayal Baron wrote: ... if the FlowID isn't logged for logs which are part of the action, it will be easy to detect the entry point. but still wouldn't make it easy to grep all related logs to a flow If the flow id is present in all lines it would make the log a lot less readable. Writing a script that accepts the flowid and 'grep's the 'flow' should be simple enough. However, if this is to become (remotely) useful, the flowID should always contain an engine generated generation. The thing is, really complex debug scenarios which span multiple hosts never stem from the same flow, so the only added benefit here would be if we could at least know what happened before what from logs across multiple hosts. To achieve this we would need the engine to add a generation id to each unique flowID (the problem being that currently the flow ID would be determined by the UI and not by engine). Engine would have to generate such an ID for all internal flows as well! I thought it is clear the UI/api can pass the flow id for the rest of the components (engine and below) I still have not heard even one good reason for having this though. ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/vdsm-devel
Re: [vdsm] flowID schema
- Original Message - On 02/13/2012 02:28 PM, Ayal Baron wrote: ... is that it (ab)uses an http header for carrying FlowID, Yes, it certainly does appear to overload it. I would be nice to have something formal given to it by engine, but I can appreciate the difficulty implementing such a scheme. Technically I disagree, this is a cross cutting concern which has nothing to do with any specific call hence it should be passed as a header, that is actually rather elegant. To the specific matter at hand though. what would really be nice is solving the real problem properly, and not contaminating the API and the log with things which have marginal benefit if at all. going back to the 'grep' issue. vdsm logs are verbose. they are multi-threaded as well. I think this should be more than just about finding the entry point of the flow, then identifying for this specific log format how to trace it, which would require writing a log analyzer with plugins for each component. having all lines which are relevant to a flow with a flowid logged in them would make it much easier to get all (or most) of relevant parts of the flow (most, since something orthogonal to the flow may have happened affecting it, like loss of network) I'm sorry but what you're proposing is to make the log even more difficult to read for absolutely NO reason. I haven't seen 1 good reason to add more to the log. What we should be focusing on is: 1. adding the relevant data that is needed to the engine log so that most of the time users wouldn't need to go the host 2. reducing the verbosity of the vdsm log and increasing readability (the flow ID does exactly the opposite). As opposed to most people here who are thinking that this sounds like a good idea, I actually have debugged at least dozens of issues in engine and vdsm and can assure you that not once would this have been beneficial to me. What was mostly missing in the engine logs was understanding what thread in engine called what operation in vdsm and what vdsm's response was. In 3.0 my understanding is that engine fixed this so this entire feature will be counter productive (will make logs less readable and harder to decipher, adds complexity to the API and adds complexity to users of the rest API). All cross hosts issues stem from *different* flows, so this would not help in this case and single host issues are easily traceable today (and you *never* need to follow an entire flow, it's entirely redundant and inefficient). I'm more than willing to show this on any set of logs by the way and would be happy to be proven wrong. More often than not by the way, the issue is that inside a specific call (i.e. 1 verb, not a flow) people are not proficient in finding the offending line (which is why I wrote the 'how to read the vdsm log' wiki). ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/vdsm-devel
Re: [vdsm] flowID schema
On 02/12/2012 09:50 AM, Dan Kenigsberg wrote: On Thu, Feb 09, 2012 at 03:23:19PM -0500, Ayal Baron wrote: - Original Message - -1 I agree that for messaging environment having a Message ID is a must because you sometimes don't have a particular target so when you get a response you need to know what this node is actually responding to. The message ID could be composed withFLOWIDMSGID so you can reuse the field. But that is all besides the point. I understand that someone might find it fun to go on following the entire flow in the Engine and in VDSM. But I would like to hear an actual use case where someone would have actually benefited from this. As I see it having VSDM return the task ID with every response (and not just for async tasks) is a lot more useful and correct. A generic debugging scenario as I see it. 1. Something went wrong 2. You go looking in the ENGINE log trying to figure out what happend. 3. You see that ENGINE got SomeError. 4. Check to see if this error makes sense imagining that VDSM is always right and is a black box. 5. You did your digging and now you think that VDSM is as fault. 6. Go look for the call that failed. (If we returned the taskID it's pretty simple to find that call). 7. Look around the call to check VDSM state. 8. Profit. There is never a point where you want to follow a whole flow call by call going back and forth, and even if you did having the VDSM taskID is a better anchor then flowID. VDSM is built in a way that every call takes in to account the current state only. Debugging it with an engine flow mindset is just wrong and distracting. I see it doing more harm the good by reinforcing bad debugging practices. I don't know about harm, but, today the engine logs every call and return value to and from vdsm. This means that all the info that is needed to follow a flow is already present in the engine log (which was not the case previously) so I believe that the flow id is redundant. In addition, instead of focusing on how to track a flow between components, we should focus on how to improve the engine log so that the users don't need to go to the hosts in the first place. My biggest problem with it is that it changes each and every verb in the API and makes the log itself also more verbose and less readable. The good thing about the currently suggested implementation http://gerrit.ovirt.org/#patch,sidebyside,1221,6,vdsm/BindingXMLRPC.py is that it (ab)uses an http header for carrying FlowID, thus keeping the formal API intact. FlowID is logged only on API entry point, so it would not clutter the logs too much. if the FlowID isn't logged for logs which are part of the action, it will be easy to detect the entry point. but still wouldn't make it easy to grep all related logs to a flow ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/vdsm-devel
Re: [vdsm] flowID schema
- Original Message - On 02/12/2012 09:50 AM, Dan Kenigsberg wrote: On Thu, Feb 09, 2012 at 03:23:19PM -0500, Ayal Baron wrote: - Original Message - -1 I agree that for messaging environment having a Message ID is a must because you sometimes don't have a particular target so when you get a response you need to know what this node is actually responding to. The message ID could be composed withFLOWIDMSGID so you can reuse the field. But that is all besides the point. I understand that someone might find it fun to go on following the entire flow in the Engine and in VDSM. But I would like to hear an actual use case where someone would have actually benefited from this. As I see it having VSDM return the task ID with every response (and not just for async tasks) is a lot more useful and correct. A generic debugging scenario as I see it. 1. Something went wrong 2. You go looking in the ENGINE log trying to figure out what happend. 3. You see that ENGINE got SomeError. 4. Check to see if this error makes sense imagining that VDSM is always right and is a black box. 5. You did your digging and now you think that VDSM is as fault. 6. Go look for the call that failed. (If we returned the taskID it's pretty simple to find that call). 7. Look around the call to check VDSM state. 8. Profit. There is never a point where you want to follow a whole flow call by call going back and forth, and even if you did having the VDSM taskID is a better anchor then flowID. VDSM is built in a way that every call takes in to account the current state only. Debugging it with an engine flow mindset is just wrong and distracting. I see it doing more harm the good by reinforcing bad debugging practices. I don't know about harm, but, today the engine logs every call and return value to and from vdsm. This means that all the info that is needed to follow a flow is already present in the engine log (which was not the case previously) so I believe that the flow id is redundant. In addition, instead of focusing on how to track a flow between components, we should focus on how to improve the engine log so that the users don't need to go to the hosts in the first place. My biggest problem with it is that it changes each and every verb in the API and makes the log itself also more verbose and less readable. The good thing about the currently suggested implementation http://gerrit.ovirt.org/#patch,sidebyside,1221,6,vdsm/BindingXMLRPC.py is that it (ab)uses an http header for carrying FlowID, thus keeping the formal API intact. FlowID is logged only on API entry point, so it would not clutter the logs too much. if the FlowID isn't logged for logs which are part of the action, it will be easy to detect the entry point. but still wouldn't make it easy to grep all related logs to a flow If the flow id is present in all lines it would make the log a lot less readable. Writing a script that accepts the flowid and 'grep's the 'flow' should be simple enough. However, if this is to become (remotely) useful, the flowID should always contain an engine generated generation. The thing is, really complex debug scenarios which span multiple hosts never stem from the same flow, so the only added benefit here would be if we could at least know what happened before what from logs across multiple hosts. To achieve this we would need the engine to add a generation id to each unique flowID (the problem being that currently the flow ID would be determined by the UI and not by engine). Engine would have to generate such an ID for all internal flows as well! I still have not heard even one good reason for having this though. ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/vdsm-devel
Re: [vdsm] flowID schema
On Thu, Feb 09, 2012 at 03:23:19PM -0500, Ayal Baron wrote: - Original Message - -1 I agree that for messaging environment having a Message ID is a must because you sometimes don't have a particular target so when you get a response you need to know what this node is actually responding to. The message ID could be composed with FLOWIDMSGID so you can reuse the field. But that is all besides the point. I understand that someone might find it fun to go on following the entire flow in the Engine and in VDSM. But I would like to hear an actual use case where someone would have actually benefited from this. As I see it having VSDM return the task ID with every response (and not just for async tasks) is a lot more useful and correct. A generic debugging scenario as I see it. 1. Something went wrong 2. You go looking in the ENGINE log trying to figure out what happend. 3. You see that ENGINE got SomeError. 4. Check to see if this error makes sense imagining that VDSM is always right and is a black box. 5. You did your digging and now you think that VDSM is as fault. 6. Go look for the call that failed. (If we returned the taskID it's pretty simple to find that call). 7. Look around the call to check VDSM state. 8. Profit. There is never a point where you want to follow a whole flow call by call going back and forth, and even if you did having the VDSM taskID is a better anchor then flowID. VDSM is built in a way that every call takes in to account the current state only. Debugging it with an engine flow mindset is just wrong and distracting. I see it doing more harm the good by reinforcing bad debugging practices. I don't know about harm, but, today the engine logs every call and return value to and from vdsm. This means that all the info that is needed to follow a flow is already present in the engine log (which was not the case previously) so I believe that the flow id is redundant. In addition, instead of focusing on how to track a flow between components, we should focus on how to improve the engine log so that the users don't need to go to the hosts in the first place. My biggest problem with it is that it changes each and every verb in the API and makes the log itself also more verbose and less readable. The good thing about the currently suggested implementation http://gerrit.ovirt.org/#patch,sidebyside,1221,6,vdsm/BindingXMLRPC.py is that it (ab)uses an http header for carrying FlowID, thus keeping the formal API intact. FlowID is logged only on API entry point, so it would not clutter the logs too much. Keith, Dan, I will need your support in Gerrit for the patch. Dan. ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/vdsm-devel
Re: [vdsm] flowID schema
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/09/2012 05:18 PM, Andrew Cathrow wrote: The question is, what problem is this really trying to solve and is there a simpler and less obtrusive solution to that problem? correlating logs between ovirt engine and potentially multiple vdsm nodes is a nightmare. It requires a lot skill to follow a transaction through from the front end all the way to the node, and even multiple nodes (eg actions on spm, then actions on other node to run a vm). Having a way to correlate the logs and follow a single event/flow is vital. +1 anything that allows us to easily follow flows between engine, vdsm and back again would be welcomed by those supporting it! Lee -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJPNAiFAAoJELymbjP2ci12PFYH/09f/c7xlSMZ1QKJo6QHgBmC fucAUiKDWFxpJ3zyOJsoJfCL6TmXdvdUAJQXhmy2Its3KGYdhKZLezqJHpIGM/AR f694/O5a1jUYGJnxtQknr1H4wJar1ot0gIJEuiDxEM9haR4XTTYbhsyl8ApQ8wHP EERHi6eTzLY0j3ohbDOqRSrGdsofTYQ653MNQXggfBa41R1KbOPkp8XIf4RJ7+0J 8e7xGqU3SwITDeJ9c7yj5bwWdpuzUGvwPZif8VeuR0RajjKHCagiexWbRoLmoGYi dqc+QcxOL8zs1e9mOiB1Cgfc36bbCjGFBNYe8N3gpnod9NCQZ1f9FzbQ82+Dosg= =CMZJ -END PGP SIGNATURE- ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/vdsm-devel
Re: [vdsm] flowID schema
-1 I agree that for messaging environment having a Message ID is a must because you sometimes don't have a particular target so when you get a response you need to know what this node is actually responding to. The message ID could be composed with FLOWIDMSGID so you can reuse the field. But that is all besides the point. I understand that someone might find it fun to go on following the entire flow in the Engine and in VDSM. But I would like to hear an actual use case where someone would have actually benefited from this. As I see it having VSDM return the task ID with every response (and not just for async tasks) is a lot more useful and correct. A generic debugging scenario as I see it. 1. Something went wrong 2. You go looking in the ENGINE log trying to figure out what happend. 3. You see that ENGINE got SomeError. 4. Check to see if this error makes sense imagining that VDSM is always right and is a black box. 5. You did your digging and now you think that VDSM is as fault. 6. Go look for the call that failed. (If we returned the taskID it's pretty simple to find that call). 7. Look around the call to check VDSM state. 8. Profit. There is never a point where you want to follow a whole flow call by call going back and forth, and even if you did having the VDSM taskID is a better anchor then flowID. VDSM is built in a way that every call takes in to account the current state only. Debugging it with an engine flow mindset is just wrong and distracting. I see it doing more harm the good by reinforcing bad debugging practices. - Original Message - From: Keith Robertson krobe...@redhat.com To: VDSM Project Development vdsm-devel@lists.fedorahosted.org Sent: Thursday, February 9, 2012 1:34:43 PM Subject: Re: [vdsm] flowID schema On 02/09/2012 12:18 PM, Andrew Cathrow wrote: - Original Message - From: Ayal Baronaba...@redhat.com To: Dan Kenigsbergdan...@redhat.com Cc: VDSM Project Developmentvdsm-devel@lists.fedorahosted.org Sent: Monday, February 6, 2012 10:35:54 AM Subject: Re: [vdsm] flowID schema - Original Message - On Thu, Feb 02, 2012 at 10:32:49AM -0500, Saggi Mizrahi wrote: flowID makes no sense after the initial API call as stuff like cacheing\threadpools\samplingtasks\resources\asyncTasks so flowing a flow like that will not give you the entire picture while debugging. Also adding it now will make everything even more ugly. You know what, just imagine I wrote one of my long rambles about why I don't agree with doing this. I cannot imagine you write anything like that. Really. I do not understand why you object logging flowID on API entry point. The question is, what problem is this really trying to solve and is there a simpler and less obtrusive solution to that problem? correlating logs between ovirt engine and potentially multiple vdsm nodes is a nightmare. It requires a lot skill to follow a transaction through from the front end all the way to the node, and even multiple nodes (eg actions on spm, then actions on other node to run a vm). Having a way to correlate the logs and follow a single event/flow is vital. +1 Knowing what command caused a sequence of events in VDSM would be really helpful particularly in a threaded environment. Further, wouldn't such an ID be helpful in an asynchronous request/response model? I'm not sure what the plans are for AMQP or even if there are plans, but I'd think that something like this would be crucial for an async response. So, if you implemented it you might be killing 2 birds with 1 stone. FYI: If you want to see examples of other systems that use similar concepts, take a look at the correlation ID in JMS. Cheers, Keith ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/vdsm-devel ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/vdsm-devel ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/vdsm-devel ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/vdsm-devel ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/vdsm-devel
Re: [vdsm] flowID schema
- Original Message - -1 I agree that for messaging environment having a Message ID is a must because you sometimes don't have a particular target so when you get a response you need to know what this node is actually responding to. The message ID could be composed with FLOWIDMSGID so you can reuse the field. But that is all besides the point. I understand that someone might find it fun to go on following the entire flow in the Engine and in VDSM. But I would like to hear an actual use case where someone would have actually benefited from this. As I see it having VSDM return the task ID with every response (and not just for async tasks) is a lot more useful and correct. A generic debugging scenario as I see it. 1. Something went wrong 2. You go looking in the ENGINE log trying to figure out what happend. 3. You see that ENGINE got SomeError. 4. Check to see if this error makes sense imagining that VDSM is always right and is a black box. 5. You did your digging and now you think that VDSM is as fault. 6. Go look for the call that failed. (If we returned the taskID it's pretty simple to find that call). 7. Look around the call to check VDSM state. 8. Profit. There is never a point where you want to follow a whole flow call by call going back and forth, and even if you did having the VDSM taskID is a better anchor then flowID. VDSM is built in a way that every call takes in to account the current state only. Debugging it with an engine flow mindset is just wrong and distracting. I see it doing more harm the good by reinforcing bad debugging practices. I don't know about harm, but, today the engine logs every call and return value to and from vdsm. This means that all the info that is needed to follow a flow is already present in the engine log (which was not the case previously) so I believe that the flow id is redundant. In addition, instead of focusing on how to track a flow between components, we should focus on how to improve the engine log so that the users don't need to go to the hosts in the first place. My biggest problem with it is that it changes each and every verb in the API and makes the log itself also more verbose and less readable. - Original Message - From: Keith Robertson krobe...@redhat.com To: VDSM Project Development vdsm-devel@lists.fedorahosted.org Sent: Thursday, February 9, 2012 1:34:43 PM Subject: Re: [vdsm] flowID schema On 02/09/2012 12:18 PM, Andrew Cathrow wrote: - Original Message - From: Ayal Baronaba...@redhat.com To: Dan Kenigsbergdan...@redhat.com Cc: VDSM Project Developmentvdsm-devel@lists.fedorahosted.org Sent: Monday, February 6, 2012 10:35:54 AM Subject: Re: [vdsm] flowID schema - Original Message - On Thu, Feb 02, 2012 at 10:32:49AM -0500, Saggi Mizrahi wrote: flowID makes no sense after the initial API call as stuff like cacheing\threadpools\samplingtasks\resources\asyncTasks so flowing a flow like that will not give you the entire picture while debugging. Also adding it now will make everything even more ugly. You know what, just imagine I wrote one of my long rambles about why I don't agree with doing this. I cannot imagine you write anything like that. Really. I do not understand why you object logging flowID on API entry point. The question is, what problem is this really trying to solve and is there a simpler and less obtrusive solution to that problem? correlating logs between ovirt engine and potentially multiple vdsm nodes is a nightmare. It requires a lot skill to follow a transaction through from the front end all the way to the node, and even multiple nodes (eg actions on spm, then actions on other node to run a vm). Having a way to correlate the logs and follow a single event/flow is vital. +1 Knowing what command caused a sequence of events in VDSM would be really helpful particularly in a threaded environment. Further, wouldn't such an ID be helpful in an asynchronous request/response model? I'm not sure what the plans are for AMQP or even if there are plans, but I'd think that something like this would be crucial for an async response. So, if you implemented it you might be killing 2 birds with 1 stone. FYI: If you want to see examples of other systems that use similar concepts, take a look at the correlation ID in JMS. Cheers, Keith ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/vdsm-devel ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/vdsm-devel ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org
Re: [vdsm] flowID schema
From: Saggi Mizrahi smizr...@redhat.com To: Keith Robertson krobe...@redhat.com Cc: VDSM Project Development vdsm-devel@lists.fedorahosted.org Sent: Thursday, February 9, 2012 2:24:44 PM Subject: Re: [vdsm] flowID schema -1 I agree that for messaging environment having a Message ID is a must because you sometimes don't have a particular target so when you get a response you need to know what this node is actually responding to. The message ID could be composed with FLOWIDMSGID so you can reuse the field. But that is all besides the point. I understand that someone might find it fun to go on following the entire flow in the Engine and in VDSM. But I would like to hear an actual use case where someone would have actually benefited from this. As I see it having VSDM return the task ID with every response (and not just for async tasks) is a lot more useful and correct. Actually, the only way to understand what happened in a certain flow is to follow it through. From the engine log where an action was initiated, down to the hosts that did the execution. Everything RHEV does is a flow, and with no correlation between hosts executing parts of the same flow, troubleshooting turns into guesswork, because the only contact point left is time, which is useless when you're talking about vdsm - there are sometimes hundreds of log records in a single second, and not every host is in absolute sync with every other. A generic debugging scenario as I see it. 1. Something went wrong 2. You go looking in the ENGINE log trying to figure out what happend. 3. You see that ENGINE got SomeError. ok, the rest are all downhill. 4. You follow the failure back to the start of the flow, then go with the flow to the point where the engine exited to vdsm 5. switch over to vdsm logs, make sure you have the timing right (with no flow ID that's the olny orientation after all) 6. find the start of the vdsm-side flow, follow it to the failure, pray the error makes sense. In many cases the answer is not in the vdsm failure traceback but somewhere in the middle of the flow, with no errors reported, this is why we need a way to easily follow things through. Moreover, the logs should be readable enough to make sense to a typical sysadmin, and not a RHEV expert. 4. Check to see if this error makes sense imagining that VDSM is always right and is a black box. 5. You did your digging and now you think that VDSM is as fault. 6. Go look for the call that failed. (If we returned the taskID it's pretty simple to find that call). 7. Look around the call to check VDSM state. 8. Profit. There is never a point where you want to follow a whole flow call by call going back and forth, and even if you did having the VDSM taskID is a better anchor then flowID. not everything is a task, flow IDs would unify entire flows, and make following them easy. VDSM is built in a way that every call takes in to account the current state only. Debugging it with an engine flow mindset is just wrong and distracting. I see it doing more harm the good by reinforcing bad debugging practices. Maybe you're right, though I can't see how from my experience so far, but following the flows is the only thing that got cases resolved. Not event IDs making every possible error, and not task IDs (though these do have their uses) - slow and meticulous mapping of flows to log records. - Original Message - From: Keith Robertson krobe...@redhat.com To: VDSM Project Development vdsm-devel@lists.fedorahosted.org Sent: Thursday, February 9, 2012 1:34:43 PM Subject: Re: [vdsm] flowID schema On 02/09/2012 12:18 PM, Andrew Cathrow wrote: - Original Message - From: Ayal Baronaba...@redhat.com To: Dan Kenigsbergdan...@redhat.com Cc: VDSM Project Developmentvdsm-devel@lists.fedorahosted.org Sent: Monday, February 6, 2012 10:35:54 AM Subject: Re: [vdsm] flowID schema - Original Message - On Thu, Feb 02, 2012 at 10:32:49AM -0500, Saggi Mizrahi wrote: flowID makes no sense after the initial API call as stuff like cacheing\threadpools\samplingtasks\resources\asyncTasks so flowing a flow like that will not give you the entire picture while debugging. Also adding it now will make everything even more ugly. You know what, just imagine I wrote one of my long rambles about why I don't agree with doing this. I cannot imagine you write anything like that. Really. I do not understand why you object logging flowID on API entry point. The question is, what problem is this really trying to solve and is there a simpler and less obtrusive solution to that problem? correlating logs between ovirt engine and potentially multiple vdsm nodes is a nightmare. It requires a lot skill to follow a transaction through from the front end all the way to the node, and even multiple nodes (eg actions on spm, then actions
Re: [vdsm] flowID schema
- Original Message - From: Saggi Mizrahi smizr...@redhat.com To: Keith Robertson krobe...@redhat.com Cc: VDSM Project Development vdsm-devel@lists.fedorahosted.org Sent: Thursday, February 9, 2012 2:24:44 PM Subject: Re: [vdsm] flowID schema -1 I agree that for messaging environment having a Message ID is a must because you sometimes don't have a particular target so when you get a response you need to know what this node is actually responding to. The message ID could be composed with FLOWIDMSGID so you can reuse the field. But that is all besides the point. I understand that someone might find it fun to go on following the entire flow in the Engine and in VDSM. But I would like to hear an actual use case where someone would have actually benefited from this. As I see it having VSDM return the task ID with every response (and not just for async tasks) is a lot more useful and correct. Actually, the only way to understand what happened in a certain flow is to follow it through. From the engine log where an action was initiated, down to the hosts that did the execution. Everything RHEV does is a flow, and with no correlation between hosts executing parts of the same flow, troubleshooting turns into guesswork, because the only contact point left is time, which is useless when you're talking about vdsm - there are sometimes hundreds of log records in a single second, and not every host is in absolute sync with every other. What are you talking about? you know exactly what operation the engine ran at vdsm level. If it's a task then you also have a task id which is a uuid so you don't need anything else. In addition, now that engine logs results, you can just grep that instead of a flow id and land at the exact correct command and not have to figure out which out of the 5 run in this flow is the relevant one. If you could give a real example where this would be beneficial (i.e. log excerpts, how you correlated them and how flow id would have eased your job) that would be great. Note that I've also discussed this with Yaniv from qe who said they don't really need it. A generic debugging scenario as I see it. 1. Something went wrong 2. You go looking in the ENGINE log trying to figure out what happend. 3. You see that ENGINE got SomeError. ok, the rest are all downhill. 4. You follow the failure back to the start of the flow, then go with the flow to the point where the engine exited to vdsm 5. switch over to vdsm logs, make sure you have the timing right (with no flow ID that's the olny orientation after all) 6. find the start of the vdsm-side flow, follow it to the failure, pray the error makes sense. In many cases the answer is not in the vdsm failure traceback but somewhere in the middle of the flow, with no errors reported, this is why we need a way to easily follow things through. Moreover, the logs should be readable enough to make sense to a typical sysadmin, and not a RHEV expert. 4. Check to see if this error makes sense imagining that VDSM is always right and is a black box. 5. You did your digging and now you think that VDSM is as fault. 6. Go look for the call that failed. (If we returned the taskID it's pretty simple to find that call). 7. Look around the call to check VDSM state. 8. Profit. There is never a point where you want to follow a whole flow call by call going back and forth, and even if you did having the VDSM taskID is a better anchor then flowID. not everything is a task, flow IDs would unify entire flows, and make following them easy. VDSM is built in a way that every call takes in to account the current state only. Debugging it with an engine flow mindset is just wrong and distracting. I see it doing more harm the good by reinforcing bad debugging practices. Maybe you're right, though I can't see how from my experience so far, but following the flows is the only thing that got cases resolved. Not event IDs making every possible error, and not task IDs (though these do have their uses) - slow and meticulous mapping of flows to log records. - Original Message - From: Keith Robertson krobe...@redhat.com To: VDSM Project Development vdsm-devel@lists.fedorahosted.org Sent: Thursday, February 9, 2012 1:34:43 PM Subject: Re: [vdsm] flowID schema On 02/09/2012 12:18 PM, Andrew Cathrow wrote: - Original Message - From: Ayal Baronaba...@redhat.com To: Dan Kenigsbergdan...@redhat.com Cc: VDSM Project Developmentvdsm-devel@lists.fedorahosted.org Sent: Monday, February 6, 2012 10:35:54 AM Subject: Re: [vdsm] flowID schema - Original Message - On Thu, Feb 02, 2012 at 10:32:49AM -0500, Saggi Mizrahi wrote: flowID makes no sense after the initial API call as stuff like cacheing\threadpools\samplingtasks\resources
Re: [vdsm] flowID schema
On Thu, Feb 02, 2012 at 12:00:44AM -0500, Douglas Landgraf wrote: Hello, flowID is schema that we will be including in vdsm API to oVirt Engine people share the ID of engine transaction to vdsm. With this in hands, we will add the ID of transactions to our log. I would like to know your opinion how we could do that without break our API, like include new parameters to our calls. Should we add at BindingXMLRPC.py - wrapper() a code to search for a 'flowID' key into functions which use dict as parameter (like create)? [1] Maybe change at other level inside BindingXMLRPC ? Moti came up an idea from that department: we can abuse one of the http headers (say, X-FlowID) and it in BindingXMLRPC._createXMLRPCServer, class LoggingHandle: def parse_request(self): ... threadLocal.flowID = self.headers.get('X-FlowID') I hate this crossing of layers, but it could keep our formal API intact. Dan. ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/vdsm-devel
Re: [vdsm] flowID schema
flowID makes no sense after the initial API call as stuff like cacheing\threadpools\samplingtasks\resources\asyncTasks so flowing a flow like that will not give you the entire picture while debugging. Also adding it now will make everything even more ugly. You know what, just imagine I wrote one of my long rambles about why I don't agree with doing this. As you plan on going on anyway here is my suggestion on how to push this in. XMLRPC doesn't support named parameter, this means that you can't just ad-hoc a new arg to all the API calls called flow-id. For simplicity's sake lets assume they always pass the last arg as flowID if it is a string that starts with __FLOWID__. What you do then is in dispatcher take the last arg and put it on the task object. Have the logger print this value even when the task is in prepare next to the threadID. You will have to make the clientIF calls use *another* dispatcher but the same task thread pool to have this supported at the clientIF verbs as well but I think it should have been done anyways. - Original Message - From: Douglas Landgraf dougsl...@redhat.com To: VDSM Project Development vdsm-devel@lists.fedorahosted.org Sent: Thursday, February 2, 2012 12:00:44 AM Subject: [vdsm] flowID schema Hello, flowID is schema that we will be including in vdsm API to oVirt Engine people share the ID of engine transaction to vdsm. With this in hands, we will add the ID of transactions to our log. I would like to know your opinion how we could do that without break our API, like include new parameters to our calls. Should we add at BindingXMLRPC.py - wrapper() a code to search for a 'flowID' key into functions which use dict as parameter (like create)? [1] Maybe change at other level inside BindingXMLRPC ? Ideas/Thoughts? [1] http://gerrit.ovirt.org/#patch,sidebyside,1221,3,vdsm/BindingXMLRPC.py Thanks! -- Cheers Douglas ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/vdsm-devel ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/vdsm-devel
[vdsm] flowID schema
Hello, flowID is schema that we will be including in vdsm API to oVirt Engine people share the ID of engine transaction to vdsm. With this in hands, we will add the ID of transactions to our log. I would like to know your opinion how we could do that without break our API, like include new parameters to our calls. Should we add at BindingXMLRPC.py - wrapper() a code to search for a 'flowID' key into functions which use dict as parameter (like create)? [1] Maybe change at other level inside BindingXMLRPC ? Ideas/Thoughts? [1] http://gerrit.ovirt.org/#patch,sidebyside,1221,3,vdsm/BindingXMLRPC.py Thanks! -- Cheers Douglas ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/vdsm-devel