Re: linux-next: manual merge of the edac-amd tree with the edac tree
On Thu, Dec 01, 2016 at 07:15:01PM +0100, Borislav Petkov wrote: > On Thu, Dec 01, 2016 at 11:02:04AM -0500, Yazen Ghannam wrote: > > A deferred error is an uncorrectable error whose handling can be > > deferred, i.e. it's not urgent. This affects the system behavior, but > > I'm now thinking that this shouldn't affect users' behavior. I think it > > would be simpler to just classify deferred errors as uncorrectable > > errors so that users treat them as such. > > Why would we want to lie about deferred errors being uncorrectable? > They are uncorrectable errors that can be handled differently. If you can't handle them then there's not much difference. > And I believe deferred errors can be handled differently like freeze the > process using the page instead of killing it. And so on... > If deferred errors can be handled differently in userspace, then you're right we should maintain the distinction. I was thinking we'd only handle them in the kernel. > Why aren't you simply adding the documentation about > HW_EVENT_ERR_DEFERRED and be done with it? The downstream path like > tracepoint and all can handle all that just fine. > Okay, will do. > > Boris, > > Can we drop or revert commit d12a969ebbfc? > > No can do. It is a public branch and there's no touching it. > Okay, got it. Thanks, Yazen
Re: linux-next: manual merge of the edac-amd tree with the edac tree
On Thu, Dec 01, 2016 at 11:02:04AM -0500, Yazen Ghannam wrote: > A deferred error is an uncorrectable error whose handling can be > deferred, i.e. it's not urgent. This affects the system behavior, but > I'm now thinking that this shouldn't affect users' behavior. I think it > would be simpler to just classify deferred errors as uncorrectable > errors so that users treat them as such. Why would we want to lie about deferred errors being uncorrectable? And I believe deferred errors can be handled differently like freeze the process using the page instead of killing it. And so on... Why aren't you simply adding the documentation about HW_EVENT_ERR_DEFERRED and be done with it? The downstream path like tracepoint and all can handle all that just fine. > Boris, > Can we drop or revert commit d12a969ebbfc? No can do. It is a public branch and there's no touching it. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: linux-next: manual merge of the edac-amd tree with the edac tree
On Thu, Dec 01, 2016 at 10:06:17AM -0200, Mauro Carvalho Chehab wrote: > > However, rebasing over your tree showed a new documentation gap: > ./include/linux/edac.h:144: warning: Enum value 'HW_EVENT_ERR_DEFERRED' > not described in enum 'hw_event_mc_err_type' > > With was introduced by this commit: > > commit d12a969ebbfcfc25853c4147d42b388f758e8784 > Author: Yazen Ghannam > Date: Thu Nov 17 17:57:32 2016 -0500 > > EDAC, amd64: Add Deferred Error type > > Currently, deferred errors are classified as correctable in EDAC. Add a > new error type for deferred errors so that they are correctly reported > to the user. > > Signed-off-by: Yazen Ghannam > Cc: Aravind Gopalakrishnan > Cc: linux-edac > Link: > http://lkml.kernel.org/r/1479423463-8536-7-git-send-email-yazen.ghan...@amd.com > Signed-off-by: Borislav Petkov > > > Yazen introduced a "deferred error" code (whatever it means), but didn't > document what's that. Unfortunately, the patch description is also > not clear enough about what a "deferred error" means or how userspace > is supposed to handle it. > > Yazen, > > Could you please send us a patch adding a proper description for this > new error code? > Hi Mauro, A deferred error is an uncorrectable error whose handling can be deferred, i.e. it's not urgent. This affects the system behavior, but I'm now thinking that this shouldn't affect users' behavior. I think it would be simpler to just classify deferred errors as uncorrectable errors so that users treat them as such. Boris, Can we drop or revert commit d12a969ebbfc? And can we apply a fixup like this to commit 713ad54675fd? --- From: Yazen Ghannam Date: Thu, 1 Dec 2016 08:54:49 -0600 Subject: [PATCH] fixup! EDAC, amd64: Define and register UMC error decode function Signed-off-by: Yazen Ghannam --- drivers/edac/amd64_edac.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c index 991b36c..245b9a0 100644 --- a/drivers/edac/amd64_edac.c +++ b/drivers/edac/amd64_edac.c @@ -2480,8 +2480,9 @@ static void decode_umc_error(int node_id, struct mce *m) memset(&err, 0, sizeof(err)); + /* Log deferred errors as uncorrectable errors. */ if (m->status & MCI_STATUS_DEFERRED) - ecc_type = 3; + ecc_type = 1; err.channel = find_umc_channel(pvt, m); if (err.channel < 0) { -- 2.7.4 --- Thanks, Yazen
Re: linux-next: manual merge of the edac-amd tree with the edac tree
Em Thu, 1 Dec 2016 11:48:46 +0100 Borislav Petkov escreveu: > On Wed, Nov 30, 2016 at 08:50:13AM -0200, Mauro Carvalho Chehab wrote: > > Fixed. If you have a stable branch, I can rebase it on the top > > of your patches, in order to avoid the confict at linux-next. > > http://git.kernel.org/cgit/linux/kernel/git/bp/bp.git/log/?h=for-next Thanks! > but with that fixed, shouldn't the merge work without a conflict? Well, we're removing some lines from one file and adding them on another one. Git is not smart enough to solve such conflicts automatically: CONFLICT (content): Merge conflict in drivers/edac/edac_mc.c Recorded preimage for 'drivers/edac/edac_mc.c' Automatic merge failed; fix conflicts and then commit the result. Anyway, I rebased my work on the top of your branch. The conflict is now gone: https://git.kernel.org/cgit/linux/kernel/git/mchehab/linux-edac.git/log/?h=linux_next You can see the generated documentation at: https://mchehab.fedorapeople.org/kernel_docs/admin-guide/ras.html (user and admin guide) and: https://mchehab.fedorapeople.org/kernel_docs/driver-api/edac.html (kernel API) However, rebasing over your tree showed a new documentation gap: ./include/linux/edac.h:144: warning: Enum value 'HW_EVENT_ERR_DEFERRED' not described in enum 'hw_event_mc_err_type' With was introduced by this commit: commit d12a969ebbfcfc25853c4147d42b388f758e8784 Author: Yazen Ghannam Date: Thu Nov 17 17:57:32 2016 -0500 EDAC, amd64: Add Deferred Error type Currently, deferred errors are classified as correctable in EDAC. Add a new error type for deferred errors so that they are correctly reported to the user. Signed-off-by: Yazen Ghannam Cc: Aravind Gopalakrishnan Cc: linux-edac Link: http://lkml.kernel.org/r/1479423463-8536-7-git-send-email-yazen.ghan...@amd.com Signed-off-by: Borislav Petkov Yazen introduced a "deferred error" code (whatever it means), but didn't document what's that. Unfortunately, the patch description is also not clear enough about what a "deferred error" means or how userspace is supposed to handle it. Yazen, Could you please send us a patch adding a proper description for this new error code? Thanks! Mauro
Re: linux-next: manual merge of the edac-amd tree with the edac tree
On Wed, Nov 30, 2016 at 08:50:13AM -0200, Mauro Carvalho Chehab wrote: > Fixed. If you have a stable branch, I can rebase it on the top > of your patches, in order to avoid the confict at linux-next. http://git.kernel.org/cgit/linux/kernel/git/bp/bp.git/log/?h=for-next but with that fixed, shouldn't the merge work without a conflict? -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: linux-next: manual merge of the edac-amd tree with the edac tree
Em Mon, 28 Nov 2016 09:27:35 +0100 Borislav Petkov escreveu: > On Mon, Nov 28, 2016 at 02:37:26PM +1100, Stephen Rothwell wrote: > > Hi Borislav, > > > > Today's linux-next merge of the edac-amd tree got a conflict in: > > > > drivers/edac/edac_mc.c > > > > between commit: > > > > ef91afa61088 ("edac: move documentation from edac_mc.c to edac_core.h") > > > > from the edac tree and commit: > > > > c73e8833bec5 ("EDAC, mc: Fix locking around mc_devices list") > > > > from the edac-amd tree. > > > > I fixed it up (see below - there may be more fixes needed in > > edac_core.h) and can carry the fix as necessary. This is now fixed as > > far as linux-next is concerned, but any non trivial conflicts should be > > mentioned to your upstream maintainer when your tree is submitted for > > merging. You may also want to consider cooperating with the maintainer > > of the conflicting tree to minimise any particularly complex conflicts. > > Just one issue which has nothing to do with linux-next. There's still > that in ef91afa61088: > > > +/** > > + * edac_mc_find: Search for a mem_ctl_info structure whose index is @idx. > > + * > > + * @idx: index to be seek > > + * > > + * If found, return a pointer to the structure. > > + * Else return NULL. > > + * > > + * Caller must hold mem_ctls_mutex. > > + */ > > That last sentence in the comment is not true anymore - edac_mc_find() > is grabbing the mutex itself as it should be. Mauro, please fix that in > your tree. Fixed. If you have a stable branch, I can rebase it on the top of your patches, in order to avoid the confict at linux-next. Regards, Mauro
Re: linux-next: manual merge of the edac-amd tree with the edac tree
On Mon, Nov 28, 2016 at 02:37:26PM +1100, Stephen Rothwell wrote: > Hi Borislav, > > Today's linux-next merge of the edac-amd tree got a conflict in: > > drivers/edac/edac_mc.c > > between commit: > > ef91afa61088 ("edac: move documentation from edac_mc.c to edac_core.h") > > from the edac tree and commit: > > c73e8833bec5 ("EDAC, mc: Fix locking around mc_devices list") > > from the edac-amd tree. > > I fixed it up (see below - there may be more fixes needed in > edac_core.h) and can carry the fix as necessary. This is now fixed as > far as linux-next is concerned, but any non trivial conflicts should be > mentioned to your upstream maintainer when your tree is submitted for > merging. You may also want to consider cooperating with the maintainer > of the conflicting tree to minimise any particularly complex conflicts. Just one issue which has nothing to do with linux-next. There's still that in ef91afa61088: > +/** > + * edac_mc_find: Search for a mem_ctl_info structure whose index is @idx. > + * > + * @idx: index to be seek > + * > + * If found, return a pointer to the structure. > + * Else return NULL. > + * > + * Caller must hold mem_ctls_mutex. > + */ That last sentence in the comment is not true anymore - edac_mc_find() is grabbing the mutex itself as it should be. Mauro, please fix that in your tree. Thanks. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.