from:"Jens Axboe"

Re: [PATCH v2 0/4] eventfd: simplify signal helpers

2023-11-23 Thread Jens Axboe

On 11/22/23 5:48 AM, Christian Brauner wrote:
> Hey everyone,
> 
> This simplifies the eventfd_signal() and eventfd_signal_mask() helpers
> significantly. They can be made void and not take any unnecessary
> arguments.
> 
> I've added a few more simplifications based on Sean's suggestion.
> 
> Signed-off-by: Christian Brauner 
> 
> Changes in v2:
> - further simplify helpers
> - Link to v1: 
> https://lore.kernel.org/r/20230713-vfs-eventfd-signal-v1-0-7fda6c5d2...@kernel.org

Only oddity I spotted was the kerneldoc, which someone else already
brought up.

Reviewed-by: Jens Axboe 

-- 
Jens Axboe

Re: [PATCH] powerpc: Don't clobber fr0/vs0 during fp|altivec register save

2023-11-19 Thread Jens Axboe

On 11/18/23 4:45 PM, Timothy Pearson wrote:
> During floating point and vector save to thread data fr0/vs0 are clobbered
> by the FPSCR/VSCR store routine.  This leads to userspace register corruption
> and application data corruption / crash under the following rare condition:
> 
>  * A userspace thread is executing with VSX/FP mode enabled
>  * The userspace thread is making active use of fr0 and/or vs0
>  * An IPI is taken in kernel mode, forcing the userspace thread to reschedule
>  * The userspace thread is interrupted by the IPI before accessing data it
>previously stored in fr0/vs0
>  * The thread being switched in by the IPI has a pending signal
> 
> If these exact criteria are met, then the following sequence happens:
> 
>  * The existing thread FP storage is still valid before the IPI, due to a
>prior call to save_fpu() or store_fp_state().  Note that the current
>fr0/vs0 registers have been clobbered, so the FP/VSX state in registers
>is now invalid pending a call to restore_fp()/restore_altivec().
>  * IPI -- FP/VSX register state remains invalid
>  * interrupt_exit_user_prepare_main() calls do_notify_resume(),
>due to the pending signal
>  * do_notify_resume() eventually calls save_fpu() via giveup_fpu(), which
>merrily reads and saves the invalid FP/VSX state to thread local storage.
>  * interrupt_exit_user_prepare_main() calls restore_math(), writing the 
> invalid
>FP/VSX state back to registers.
>  * Execution is released to userspace, and the application crashes or corrupts
>data.

What an epic bug hunt! Hats off to you for seeing it through and getting
to the bottom of it. Particularly difficult as the commit that made it
easier to trigger was in no way related to where the actual bug was.

I ran this on the vm I have access to, and it survived 2x500 iterations.
Happy to call that good:

Tested-by: Jens Axboe 

-- 
Jens Axboe

Re: [PATCH v8 0/3] generic and PowerPC SED Opal keystore

2023-10-17 Thread Jens Axboe



On Wed, 04 Oct 2023 15:19:54 -0500, gjo...@linux.vnet.ibm.com wrote:
> This patchset has gone through numerous rounds of review and
> all comments/suggetions have been addressed. The reviews have
> covered all relevant areas including reviews by block and keyring
> developers as well as the SED Opal maintainer.
> 
> TCG SED Opal is a specification from The Trusted Computing Group
> that allows self encrypting storage devices (SED) to be locked at
> power on and require an authentication key to unlock the drive.
> 
> [...]

Applied, thanks!

[1/3] block:sed-opal: SED Opal keystore
  commit: 96ff37ceb203426b1bcebbae42399686110b0130
[2/3] block: sed-opal: keystore access for SED Opal keys
  commit: 5dd339722f5f612f349b068e8da6d6710fd0e460
[3/3] powerpc/pseries: PLPKS SED Opal keystore support
  commit: ec8cf230ceccfcc2bd29990c2902be168a92dee4

Best regards,
-- 
Jens Axboe

Re: [PATCH v8 0/3] generic and PowerPC SED Opal keystore

2023-10-17 Thread Jens Axboe

On 10/17/23 9:09 AM, Greg Joyce wrote:
> 
> Hi Jens,
> 
> I've addressed all the comments/issues on v7 of the patchset and
> haven't received any feedback on v8. Is there anything else that you'd
> like to see before this can be included?

Let's give it another shot!

-- 
Jens Axboe

Re: [PATCH v7 3/3 RESEND] powerpc/pseries: PLPKS SED Opal keystore support

2023-09-13 Thread Jens Axboe

On 9/13/23 12:59 PM, Nathan Chancellor wrote:
> Hi Greg,
> 
> On Fri, Sep 08, 2023 at 10:30:56AM -0500, gjo...@linux.vnet.ibm.com wrote:
>> From: Greg Joyce 
>>
>> Define operations for SED Opal to read/write keys
>> from POWER LPAR Platform KeyStore(PLPKS). This allows
>> non-volatile storage of SED Opal keys.
>>
>> Signed-off-by: Greg Joyce 
>> Reviewed-by: Jonathan Derrick 
>> Reviewed-by: Hannes Reinecke 
> 
> After this change in -next as commit 9f2c7411ada9 ("powerpc/pseries:
> PLPKS SED Opal keystore support"), I see the following crash when
> booting some distribution configurations, such as OpenSUSE's [1] (the
> rootfs is available at [2] if necessary):

I'll drop the series for now - I didn't push out the main branch just
yet as I don't publish the block next tree until at least at -rc2 time,
so it's just in a private branch for now.

-- 
Jens Axboe

Re: [PATCH v7 0/3 RESEND] generic and PowerPC SED Opal keystore

2023-09-11 Thread Jens Axboe

On 9/8/23 9:30 AM, gjo...@linux.vnet.ibm.com wrote:
> From: Greg Joyce 
> 
> This patchset extends the capabilites incorporated into for-6.6/block
> (https://git.kernel.dk/cgit/linux/commit/?h=for-6.6/block=3bfeb61256643281ac4be5b8a57e9d9da3db4335)
>  by allowing the SED Opal key to be seeded into
> the keyring from a secure permanent keystore.
> 
> It has gone through numerous rounds of review and all comments/suggetions
> have been addressed. The reviews have covered all relevant areas including
> reviews by block and keyring developers as well as the SED Opal
> maintainer. The last patchset submission has not solicited any responses
> in the six weeks since it was last distributed. The changes are
> generally useful and ready for inclusion.
> 
> TCG SED Opal is a specification from The Trusted Computing Group
> that allows self encrypting storage devices (SED) to be locked at
> power on and require an authentication key to unlock the drive.
> 
> Generic functions have been defined for accessing SED Opal keys.
> The generic functions are defined as weak so that they may be superseded
> by keystore specific versions.
> 
> PowerPC/pseries versions of these functions provide read/write access
> to SED Opal keys in the PLPKS keystore.
> 
> The SED block driver has been modified to read the SED Opal
> keystore to populate a key in the SED Opal keyring. Changes to the
> SED Opal key will be written to the SED Opal keystore.

Applied for 6.7, thanks.

-- 
Jens Axboe

Re: [PATCH v7 0/3 RESEND] generic and PowerPC SED Opal keystore

2023-09-08 Thread Jens Axboe

On 9/8/23 9:30 AM, gjo...@linux.vnet.ibm.com wrote:
> From: Greg Joyce 
> 
> This patchset extends the capabilites incorporated into for-6.6/block
> (https://git.kernel.dk/cgit/linux/commit/?h=for-6.6/block=3bfeb61256643281ac4be5b8a57e9d9da3db4335)
>  by allowing the SED Opal key to be seeded into
> the keyring from a secure permanent keystore.
> 
> It has gone through numerous rounds of review and all comments/suggetions
> have been addressed. The reviews have covered all relevant areas including
> reviews by block and keyring developers as well as the SED Opal
> maintainer. The last patchset submission has not solicited any responses
> in the six weeks since it was last distributed. The changes are
> generally useful and ready for inclusion.

Best time to resend is generally once the merge window is closed again,
as I won't start applying patches for the next release before that
happens. I'll try to remember to pick this one up for 6.7.

-- 
Jens Axboe

Re: [PATCH v5 0/3 RESEND] sed-opal: keyrings, discovery, revert, key store

2023-08-22 Thread Jens Axboe



On Fri, 21 Jul 2023 16:15:31 -0500, gjo...@linux.vnet.ibm.com wrote:
> This patchset has gone through numerous rounds of review and
> all comments/suggetions have been addressed. The reviews have
> covered all relevant areas including reviews by block and keyring
> developers as well as the SED Opal maintainer. The last
> patchset submission has not solicited any responses in the
> six weeks since it was last distributed. The changes are
> generally useful and ready for inclusion.
> 
> [...]

Applied, thanks!

[1/3] block: sed-opal: Implement IOC_OPAL_DISCOVERY
  commit: 9fb10726ecc5145550180aec4fd0adf0a7b1d634
[2/3] block: sed-opal: Implement IOC_OPAL_REVERT_LSP
  commit: 5c82efc1aee8eb0919aa67a0d2559de5a326bd7c
[3/3] block: sed-opal: keyring support for SED keys
  commit: 3bfeb61256643281ac4be5b8a57e9d9da3db4335

Best regards,
-- 
Jens Axboe

Re: [PATCH v4 RESEND 0/3] sed-opal: keyrings, discovery, revert, key store

2023-06-05 Thread Jens Axboe

On 6/1/23 4:37 PM, gjo...@linux.vnet.ibm.com wrote:
> From: Greg Joyce 
> 
> This patchset has gone through numerous rounds of review and
> all comments/suggetions have been addressed. I believe that
> this patchset is ready for inclusion.
> 
> TCG SED Opal is a specification from The Trusted Computing Group
> that allows self encrypting storage devices (SED) to be locked at
> power on and require an authentication key to unlock the drive.
> 
> The current SED Opal implementation in the block driver
> requires that authentication keys be provided in an ioctl
> so that they can be presented to the underlying SED
> capable drive. Currently, the key is typically entered by
> a user with an application like sedutil or sedcli. While
> this process works, it does not lend itself to automation
> like unlock by a udev rule.
> 
> The SED block driver has been extended so it can alternatively
> obtain a key from a sed-opal kernel keyring. The SED ioctls
> will indicate the source of the key, either directly in the
> ioctl data or from the keyring.
> 
> Two new SED ioctls have also been added. These are:
>   1) IOC_OPAL_REVERT_LSP to revert LSP state
>   2) IOC_OPAL_DISCOVERY to discover drive capabilities/state
> 
> change log v4:
> - rebase to 6.3-rc7
>   - replaced "255" magic number with U8_MAX

None of this applies for for-6.5/block, and I'm a little puzzled
as to why you'd rebase to an old kernel rather than a 6.4-rc at
least?

Please resend one that is current.

-- 
Jens Axboe

Re: Memory coherency issue with IO thread offloading?

2023-03-28 Thread Jens Axboe

On 3/28/23 6:51?AM, Michael Ellerman wrote:
> Jens Axboe  writes:
>>>> Can the  queueing cause the creation of an IO thread (if one does not
>>>> exist, or all blocked?)
>>>
>>> Yep
>>>
>>> Since writing this email, I've gone through a lot of different tests.
>>> Here's a rough listing of what I found:
>>>
>>> - Like using the hack patch, if I just limit the number of IO thread
>>>   workers to 1, it seems to pass. At least longer than before, does 1000
>>>   iterations.
>>>
>>> - If I pin each IO worker to a single CPU, it also passes.
>>>
>>> - If I liberally sprinkle smp_mb() for the io-wq side, test still fails.
>>>   I've added one before queueing the work item, and after. One before
>>>   the io-wq worker grabs a work item and one after. Eg full hammer
>>>   approach. This still fails.
>>>
>>> Puzzling... For the "pin each IO worker to a single CPU" I added some
>>> basic code around trying to ensure that a work item queued on CPU X
>>> would be processed by a worker on CPU X, and too a large degree, this
>>> does happen. But since the work list is a normal list, it's quite
>>> possible that some other worker finishes its work on CPU Y just in time
>>> to grab the one from cpu X. I checked and this does happen in the test
>>> case, yet it still passes. This may be because I got a bit lucky, but
>>> seems suspect with thousands of passes of the test case.
>>>
>>> Another theory there is that it's perhaps related to an io-wq worker
>>> being rescheduled on a different CPU. Though again puzzled as to why the
>>> smp_mb sprinkling didn't fix that then. I'm going to try and run the
>>> test case with JUST the io-wq worker pinning and not caring about where
>>> the work is processed to see if that does anything.
>>
>> Just pinning each worker to whatever CPU they got created on seemingly
>> fixes the issue too. This does not mean that each worker will process
>> work on the CPU on which it was queued, just that each worker will
>> remain on whatever CPU it originally got created on.
>>
>> Puzzling...
>>
>> Note that it is indeed quite possible that this isn't a ppc issue at
>> all, just shows on ppc. It could be page cache related, or it could even
>> be a bug in mariadb itself.
> 
> I tried binary patching every lwsync to hwsync (read/write to full
> barrier) in mariadbd and all the libaries it links. It didn't fix the
> problem.
> 
> I also tried switching all the kernel barriers/spin locks to using a
> hwsync, but that also didn't fix it.
> 
> It's still possible there's somewhere that currently has no barrier at
> all that needs one, the above would only fix the problem if we have a
> read/write barrier that actually needs to be a full barrier.
> 
> 
> I also looked at making all TLB invalidates broadcast, regardless of
> whether we think the thread has only been on a single CPU. That didn't
> help, but I'm not sure I got all places where we do TLB invalidates, so
> I'll look at that some more tomorrow.

Thanks, appreciate your testing! I have no new data points since
yesterday, but the key point from then still seems to be that if an io
worker never reschedules onto a different CPU, then the problem doesn't
occur. This could very well be a page cache issue, if it isn't an issue
on the powerpc side...

-- 
Jens Axboe

Re: [PATCH] powerpc: don't try to copy ppc for task with NULL pt_regs

2023-03-28 Thread Jens Axboe

On 3/28/23 5:32?AM, Michael Ellerman wrote:
> Jens Axboe  writes:
>> Powerpc sets up PF_KTHREAD and PF_IO_WORKER with a NULL pt_regs, which
>> from my (arguably very short) checking is not commonly done for other
>> archs. This is fine, except when PF_IO_WORKER's have been created and
>> the task does something that causes a coredump to be generated.
> 
> Do kthread's ever core dump? I didn't think they did, but I can't find
> any logic to prevent it.

kthreads aren't associated with the original task, they just exist by
themselves. They also can't take signals. Eg they cannot core dump, just
oops :-)

This is different than io workers that do show up as threads, but they
still don't exit to userspace. That is why it ended being a problem.

> As Nick said we should probably have a non-NULL regs for PF_IO_WORKERS,
> but I'll still take this as a nice backportable fix for the immediate
> crash.
> 
> I tagged it as Fixes: pointing back at the commit that added ppr_get(),
> even though I don't know for sure the bug was triggerable back then
> (v4.8).

Thanks!

-- 
Jens Axboe

Re: Memory coherency issue with IO thread offloading?

2023-03-27 Thread Jens Axboe

>> Can the  queueing cause the creation of an IO thread (if one does not
>> exist, or all blocked?)
> 
> Yep
> 
> Since writing this email, I've gone through a lot of different tests.
> Here's a rough listing of what I found:
> 
> - Like using the hack patch, if I just limit the number of IO thread
>   workers to 1, it seems to pass. At least longer than before, does 1000
>   iterations.
> 
> - If I pin each IO worker to a single CPU, it also passes.
> 
> - If I liberally sprinkle smp_mb() for the io-wq side, test still fails.
>   I've added one before queueing the work item, and after. One before
>   the io-wq worker grabs a work item and one after. Eg full hammer
>   approach. This still fails.
> 
> Puzzling... For the "pin each IO worker to a single CPU" I added some
> basic code around trying to ensure that a work item queued on CPU X
> would be processed by a worker on CPU X, and too a large degree, this
> does happen. But since the work list is a normal list, it's quite
> possible that some other worker finishes its work on CPU Y just in time
> to grab the one from cpu X. I checked and this does happen in the test
> case, yet it still passes. This may be because I got a bit lucky, but
> seems suspect with thousands of passes of the test case.
> 
> Another theory there is that it's perhaps related to an io-wq worker
> being rescheduled on a different CPU. Though again puzzled as to why the
> smp_mb sprinkling didn't fix that then. I'm going to try and run the
> test case with JUST the io-wq worker pinning and not caring about where
> the work is processed to see if that does anything.

Just pinning each worker to whatever CPU they got created on seemingly
fixes the issue too. This does not mean that each worker will process
work on the CPU on which it was queued, just that each worker will
remain on whatever CPU it originally got created on.

Puzzling...

Note that it is indeed quite possible that this isn't a ppc issue at
all, just shows on ppc. It could be page cache related, or it could even
be a bug in mariadb itself.

-- 
Jens Axboe

Re: [PATCH] powerpc: don't try to copy ppc for task with NULL pt_regs

2023-03-27 Thread Jens Axboe

On 3/27/23 7:54?AM, Michael Ellerman wrote:
> "Nicholas Piggin"  writes:
>> On Mon Mar 27, 2023 at 8:15 AM AEST, Jens Axboe wrote:
>>> Powerpc sets up PF_KTHREAD and PF_IO_WORKER with a NULL pt_regs, which
>>> from my (arguably very short) checking is not commonly done for other
>>> archs. This is fine, except when PF_IO_WORKER's have been created and
>>> the task does something that causes a coredump to be generated. Then we
>>> get this crash:
>>
>> Hey Jens,
>>
>> Thanks for the testing and the patch.
>>
>> I think your patch would work, but I'd be inclined to give the IO worker
>> a pt_regs so it looks more like other archs and a regular user thread.
>>
>> Your IO worker bug reminded me to resurrect some copy_thread patches I
>> had and I think they should do that
>>
>> https://lists.ozlabs.org/pipermail/linuxppc-dev/2023-March/256271.html
>>
>> I wouldn't ask you to test it until I've at least tried, do you have a
>> test case that triggers this?
> 
> I hit it once on one machine while running the mtr test from the other
> thread. I'm not sure what leads to it failing that way rather than the
> usual case of the mariadb test just printing an error.

That's how I hit it first too, then I wrote the reproducer to verify and
be able to test a patch.

-- 
Jens Axboe

Re: [PATCH] powerpc: don't try to copy ppc for task with NULL pt_regs

2023-03-27 Thread Jens Axboe

On 3/27/23 12:36?AM, Nicholas Piggin wrote:
> On Mon Mar 27, 2023 at 8:15 AM AEST, Jens Axboe wrote:
>> Powerpc sets up PF_KTHREAD and PF_IO_WORKER with a NULL pt_regs, which
>> from my (arguably very short) checking is not commonly done for other
>> archs. This is fine, except when PF_IO_WORKER's have been created and
>> the task does something that causes a coredump to be generated. Then we
>> get this crash:
> 
> Hey Jens,
> 
> Thanks for the testing and the patch.
> 
> I think your patch would work, but I'd be inclined to give the IO worker
> a pt_regs so it looks more like other archs and a regular user thread.

Yep I think that'd be a better idea. No better way to get a good patch
than to send out a bad one :-)

> Your IO worker bug reminded me to resurrect some copy_thread patches I
> had and I think they should do that
> 
> https://lists.ozlabs.org/pipermail/linuxppc-dev/2023-March/256271.html
> 
> I wouldn't ask you to test it until I've at least tried, do you have a
> test case that triggers this?

I can test them pretty easily. I did write a test case that is 100%
reliable for me, attached. Just do:

$ gcc -Wall -o ppc-crash ppc-crash.c -luring
$ ulimit -c1000
$ ./ppc-crash

and it'll bomb while trying to write that coredump.

-- 
Jens Axboe
#include 
#include 
#include 

#include 

int main(int argc, char *argv[])
{
	struct io_uring_sqe *sqe;
	struct io_uring ring;
	unsigned long *ptr = NULL;
	char buf[16384];
	char fname[32];
	int fd[4];
	int i;

	for (i = 0; i < 4; i++) {
		sprintf(fname, "/dev/shm/test.%d", i);
		fd[i] = open(fname, O_RDWR | O_CREAT, 0644);
		if (fd[i] < 0) {
			perror("open");
			return 1;
		}
	}

	io_uring_queue_init(32, , 0);

	for (i = 0; i < 32; i++) {
		unsigned long off = 16384 * (i / 4);
		int index = i & 3;

		sqe = io_uring_get_sqe();
		io_uring_prep_write(sqe, fd[index], buf, sizeof(buf), off);
	}

	io_uring_submit();
	usleep(1000);

	*ptr = 0x1234;
}

Re: Memory coherency issue with IO thread offloading?

2023-03-27 Thread Jens Axboe

On 3/26/23 10:22?PM, Nicholas Piggin wrote:
> On Sat Mar 25, 2023 at 11:20 AM AEST, Jens Axboe wrote:
>> On 3/24/23 7:15?PM, Jens Axboe wrote:
>>>> Are there any CONFIG options I'd need to trip this?
>>>
>>> I don't think you need any special CONFIG options. I'll attach my config
>>> here, and I know the default distro one hits it too. But perhaps the
>>> mariadb version is not new enough? I think you need 10.6 or above, as
>>> will use io_uring by default. What version are you running?
>>
>> And here's the .config and the patch for using queue_work().
> 
> So if you *don't* apply this patch, the work gets queued up with an IO
> thread? In io-wq.c? Does that worker end up just doing an io_write()
> same as this one?

Right, without this patch, it gets added to the io-wq work pool. If a
thread is available to run it, it will. If one is not, then one is
created. Eg either event can happen.

That thread does the exact same io_write() again.

> Can the  queueing cause the creation of an IO thread (if one does not
> exist, or all blocked?)

Yep

Since writing this email, I've gone through a lot of different tests.
Here's a rough listing of what I found:

- Like using the hack patch, if I just limit the number of IO thread
  workers to 1, it seems to pass. At least longer than before, does 1000
  iterations.

- If I pin each IO worker to a single CPU, it also passes.

- If I liberally sprinkle smp_mb() for the io-wq side, test still fails.
  I've added one before queueing the work item, and after. One before
  the io-wq worker grabs a work item and one after. Eg full hammer
  approach. This still fails.

Puzzling... For the "pin each IO worker to a single CPU" I added some
basic code around trying to ensure that a work item queued on CPU X
would be processed by a worker on CPU X, and too a large degree, this
does happen. But since the work list is a normal list, it's quite
possible that some other worker finishes its work on CPU Y just in time
to grab the one from cpu X. I checked and this does happen in the test
case, yet it still passes. This may be because I got a bit lucky, but
seems suspect with thousands of passes of the test case.

Another theory there is that it's perhaps related to an io-wq worker
being rescheduled on a different CPU. Though again puzzled as to why the
smp_mb sprinkling didn't fix that then. I'm going to try and run the
test case with JUST the io-wq worker pinning and not caring about where
the work is processed to see if that does anything.

> I'm wondering what the practical differences are between this patch and
> upstream.
> 
> kthread_use_mm() should be basically the same as context switching to an
> IO thread. There is maybe a difference in that kthread_switch_mm() has
> a 'sync' instruction *after* the MMU is switched to the new thread from
> the membarrier code, but a regular context switch might not. The MMU
> switch does have an isync() after it though, so loads *should* be
> prohibited from moving ahead of that.
> 
> Something like this adds a sync roughly where kthread_use_mm() has one.
> It's a pretty unlikely shot in the dark though. I'm more inclined to
> think the work submission to the IO thread might have a problem.

Didn't seem to change anything, fails pretty quickly:

[...]
encryption.innodb_encryption 'innodb,undo0' [ 38 pass ]   3083
encryption.innodb_encryption 'innodb,undo0' [ 39 pass ]   3135
encryption.innodb_encryption 'innodb,undo0' [ 40 fail ]
Test ended at 2023-03-27 12:20:46

CURRENT_TEST: encryption.innodb_encryption
mysqltest: At line 11: query 'SET @start_global_value = 
@@global.innodb_encryption_threads' failed: ER_UNKNOWN_SYSTEM_VARIABLE (1193): 
Unknown system variable 'innodb_encryption_threads'

The result from queries just before the failure was:
SET @start_global_value = @@global.innodb_encryption_threads;

 - saving '/dev/shm/mysql/log/encryption.innodb_encryption-innodb,undo0/' to 
'/dev/shm/mysql/log/encryption.innodb_encryption-innodb,undo0/'
***Warnings generated in error logs during shutdown after running tests: 
encryption.innodb_encryption

2023-03-27 12:20:45 0 [Warning] Plugin 'example_key_management' is of maturity 
level experimental while the server is gamma
2023-03-27 12:20:45 0 [ERROR] InnoDB: Database page corruption on disk or a 
failed read of file './ibdata1' page [page id: space=0, page number=214]. You 
may have to recover from a backup.
2023-03-27 12:20:45 0 [ERROR] InnoDB: File './ibdata1' is corrupted
2023-03-27 12:20:45 0 [ERROR] InnoDB: Plugin initialization aborted with error 
Page read from tablespace is corrupted.
2023-03-27 12:20:45 0 [ERROR] Plugin 'InnoDB' init function returned error.
2023-03-27 12:20:45 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE 
failed.

-- 
Jens Axboe

[PATCH] powerpc: don't try to copy ppc for task with NULL pt_regs

2023-03-26 Thread Jens Axboe

Powerpc sets up PF_KTHREAD and PF_IO_WORKER with a NULL pt_regs, which
from my (arguably very short) checking is not commonly done for other
archs. This is fine, except when PF_IO_WORKER's have been created and
the task does something that causes a coredump to be generated. Then we
get this crash:

Kernel attempted to read user page (160) - exploit attempt? (uid: 1000)
BUG: Kernel NULL pointer dereference on read at 0x0160
Faulting instruction address: 0xc00c3a60
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=32 NUMA pSeries
Modules linked in: bochs drm_vram_helper drm_kms_helper xts binfmt_misc ecb ctr 
syscopyarea sysfillrect cbc sysimgblt drm_ttm_helper aes_generic ttm sg libaes 
evdev joydev virtio_balloon vmx_crypto gf128mul drm dm_mod fuse loop configfs 
drm_panel_orientation_quirks ip_tables x_tables autofs4 hid_generic usbhid hid 
xhci_pci xhci_hcd usbcore usb_common sd_mod
CPU: 1 PID: 1982 Comm: ppc-crash Not tainted 6.3.0-rc2+ #88
Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1202 0xf05 
of:SLOF,HEAD hv:linux,kvm pSeries
NIP:  c00c3a60 LR: c0039944 CTR: c00398e0
REGS: c41833b0 TRAP: 0300   Not tainted  (6.3.0-rc2+)
MSR:  8280b033   CR: 88082828  XER: 
200400f8
CFAR: c00c386c DAR: 0160 DSISR: 4000 IRQMASK: 0
GPR00: c0039920 c4183650 c175d600 c40f9800
GPR04: 0160 0008 c0039920 
GPR08:   0003fe06 2000
GPR12: c00398e0 c003f200 c15edbc0 cba2f648
GPR16: cba2f600 c1616ea8 0004 
GPR20: 0048 c4183918 c1410f00 c1410ef8
GPR24: c40f9800 c40f9800 c41837b8 c00398e0
GPR28: ccc4cb80 c40f9800 0008 0008
NIP [c00c3a60] memcpy_power7+0x200/0x7d0
LR [c0039944] ppr_get+0x64/0xb0
Call Trace:
[c4183650] [c0039920] ppr_get+0x40/0xb0 (unreliable)
[c4183690] [c01e5e80] __regset_get+0x180/0x1f0
[c4183700] [c01e5f94] regset_get_alloc+0x64/0x90
[c4183740] [c07ae638] elf_core_dump+0xb98/0x1b60
[c41839c0] [c07bb564] do_coredump+0x1c34/0x24a0
[c4183ba0] [c01acf0c] get_signal+0x71c/0x1410
[c4183ce0] [c00228a0] do_notify_resume+0x140/0x6f0
[c4183db0] [c00353bc] 
interrupt_exit_user_prepare_main+0x29c/0x320
[c4183e20] [c003579c] interrupt_exit_user_prepare+0x6c/0xa0
[c4183e50] [c000c6f4] interrupt_return_srr_user+0x8/0x138
--- interrupt: 300 at 0x183ee09e0
NIP:  000183ee09e0 LR: 000183ee09dc CTR: 8280f033
REGS: c4183e80 TRAP: 0300   Not tainted  (6.3.0-rc2+)
MSR:  8000d033   CR: 22002848  XER: 00f8
CFAR: 7ffe6d746aa8 DAR:  DSISR: 4200 IRQMASK: 0
GPR00: 000183ee09dc 720d37c0 000183f07f00 
GPR04:  720d37a8  7ffe6d9eae00
GPR08: 720d3710   
GPR12:  7ffe6d9eae00  
GPR16:    
GPR20:    000183ee0860
GPR24: 7ffe6d9df820 7ffe6d9e 720d7d98 0001
GPR28: 000183ee0c60 720d7924 720d7820 
NIP [000183ee09e0] 0x183ee09e0
LR [000183ee09dc] 0x183ee09dc
--- interrupt: 300
Code: f9030018 38630020 409f001c e804 e8c40008 38840010 f803 f8c30008 
38630010 78a50720 7cb01120 409c001c <8004> 80c40004 38840008 9003
---[ end trace  ]---

note: ppc-crash[1982] exited with irqs disabled

because ppr_get() is trying to copy from a PF_IO_WORKER with a NULL
pt_regs.

Check for a valid pt_regs in both ppc_get/ppr_set, and return an error
if not set. The actual error value doesn't seem to be important here,
so just pick -EINVAL.

Signed-off-by: Jens Axboe 

diff --git a/arch/powerpc/kernel/ptrace/ptrace-view.c 
b/arch/powerpc/kernel/ptrace/ptrace-view.c
index 2087a785f05f..80b699dd0d7f 100644
--- a/arch/powerpc/kernel/ptrace/ptrace-view.c
+++ b/arch/powerpc/kernel/ptrace/ptrace-view.c
@@ -290,6 +290,8 @@ static int gpr_set(struct task_struct *target, const struct 
user_regset *regset,
 static int ppr_get(struct task_struct *target, const struct user_regset 
*regset,
   struct membuf to)
 {
+   if (!target->thread.regs)
+   return -EINVAL;
return membuf_write(, >thread.regs->ppr, sizeof(u64));
 }
 
@@ -297,6 +299,8 @@ static int ppr_set(struct task_struct *target, const struct 
user_regset *regset,
   unsigned int pos, uns

Re: Memory coherency issue with IO thread offloading?

2023-03-24 Thread Jens Axboe

On 3/24/23 6:42?PM, Michael Ellerman wrote:
> Jens Axboe  writes:
>> Hi,
> 
> Hi Jens,
> 
> Thanks for the report.
> 
>> I got a report sent to me from mariadb, in where 5.10.158 works fine and
>> 5.10.162 is broken. And in fact, current 6.3-rc also fails the test
>> case. Beware that this email is long, as I'm trying to include
>> everything that may be relevant...
>>
>> The test case in question is pretty simple. On debian testing, do:
>>
>> $ sudo apt-get install mariadb-test
>> $ cd /usr/share/mysql/mysql-test
>> $ ./mtr --mysqld=--innodb-flush-method=fsync 
>> --mysqld=--innodb-use-native-aio=1 --vardir=/dev/shm/mysql  --force 
>> encryption.innodb_encryption,innodb,undo0 --repeat=200
> 
> I mostly use Fedora, the package name is the same but the mtr binary
> ends up in /usr/share/mysql.
> 
>> and if it fails, you'll see something like:
>>
>> encryption.innodb_encryption 'innodb,undo0' [ 6 pass ]   3120
>> encryption.innodb_encryption 'innodb,undo0' [ 7 pass ]   3123
>> encryption.innodb_encryption 'innodb,undo0' [ 8 pass ]   3042
>> encryption.innodb_encryption 'innodb,undo0' [ 9 fail ]
>> Test ended at 2023-03-23 16:55:17
> 
> I haven't been able to get this to fail yet. I've done several runs with
> --repeat=500 and haven't seen any errors yet.
> 
> Are there any CONFIG options I'd need to trip this?

I don't think you need any special CONFIG options. I'll attach my config
here, and I know the default distro one hits it too. But perhaps the
mariadb version is not new enough? I think you need 10.6 or above, as
will use io_uring by default. What version are you running?

> ...
>> Today I finally gave up and ran a basic experiment, which simply
>> offloads the writes to a kthread. Since powerpc has an interesting
>> memory coherency model, my suspicion was that the work involved with
>> switching MMs for the kthread could just be the main difference here.
>> The patch is really dumb and simple - rather than queue the write to an
>> IO thread, it just offloads it to a kthread that then does
>> kthread_use_mm(), perform write with the same write handler,
>> kthread_unuse_mm(). AND THIS WORKS! Usually the above mtr test would
>> fail in 2..20 loops, I've now done 200 and 500 loops and it's fine.
> 
> Can you share the patch that does that? It would help me track down
> where exactly in the io_uring code you're talking about.

Shoot yes, I actually meant to attach it but then forgot. Below!

>> Which then leads me to the question, what about the IO thread offload
>> makes this fail on powerpc (and no other arch I've tested on, including
>> x86/x86-64/aarch64/hppa64)? The offload should be equivalent to having a
>> thread in userspace in the application, and having that thread just
>> perform the writes. Is there some magic involved with the kthread mm
>> use/unuse that makes this sufficiently consistent on powerpc? I've tried
>> any mix of isync()/mb and making the flush_dcache_page() unconditionally
>> done in the filemap read/write helpers, and it still falls flat on its
>> face with the offload to an IO thread.
> 
> My first guess would be that there's some missing barriers between the
> thread that queues the IO and the IO worker thread. 

That was my guess too, and I consulted Paul McKenney as well on that.
And he had some ideas of course, in terms of ordering of the CQ ring.
But tried it all out, and it still failed in the same way...

> I think you're using schedule_work() for that though, which should be a
> full barrier. Could it be on the completion side?

queue_work() for the patch, before that it's io-wq which is an internal
IO thread worker pool. The latter just needs a spin_lock() around
queueing the work, and then a wake of the task. Typing this out, maybe
this is where a barrier is now missing? If the IO thread is already
running rather than sleeping?

> I can't think of any magic in kthread_use_mm() other than extra
> barriers. In particular kthread_unuse_mm() has an
> smp_mb__after_spinlock() which is a full memory barrier on powerpc but
> is a nop on some other architectures, x86 at least.

Yeah, I did poke at kthread_use_mm() and the related powerpc bits, but
didn't immediately find anything that seemed promising in this regard.

>> I must clearly be missing something here, which is why I'm emailing the
>> powerpc Gods for help :-)
> 
> Unfortunately the true God of powerpc memory ordering has left us and
> ascended into the Metaverse ;)

;-)

-- 
Jens Axboe

Re: Memory coherency issue with IO thread offloading?

2023-03-24 Thread Jens Axboe

On 3/24/23 6:15 PM, Michael Ellerman wrote:
> Jens Axboe  writes:
>> On 3/24/23 1:27?AM, Christophe Leroy wrote:
>>> Le 23/03/2023 ? 19:54, Jens Axboe a ?crit :
>>>> I got a report sent to me from mariadb, in where 5.10.158 works fine and
>>>> 5.10.162 is broken. And in fact, current 6.3-rc also fails the test
>>>> case. Beware that this email is long, as I'm trying to include
>>>> everything that may be relevant...
>>>
>>> Which variant of powerpc ? 32 or 64 bits ? Book3S or BookE ?
>>
>> I knew I'd forget something important... It's power9:
>>
>> processor: 0
>> cpu  : POWER9 (architected), altivec supported
>> clock: 2200.00MHz
>> revision : 2.2 (pvr 004e 1202)
> 
> Believe it or not there's still more variables in play, Power9 has two
> different MMUs, and Linux can run on two different hypervisors as well
> as on bare metal P9 :)

Just my luck :-)

> 
> Can you paste the last ~10 lines of /proc/cpuinfo, with the "machine",
> "firmware" and "MMU" lines, that should tell us everything we need to
> know.

timebase: 51200
platform: pSeries
model   : IBM pSeries (emulated by qemu)
machine : CHRP IBM pSeries (emulated by qemu)
MMU     : Radix

I know it reproduces bare metal as well, but no details on what
that box is. I just know I was able to reproduce it in this vm that
I was able to get my hands on.

-- 
Jens Axboe

Re: Memory coherency issue with IO thread offloading?

2023-03-24 Thread Jens Axboe

On 3/24/23 1:27?AM, Christophe Leroy wrote:
> Hi,
> 
> Le 23/03/2023 ? 19:54, Jens Axboe a ?crit :
>> Hi,
>>
>> I got a report sent to me from mariadb, in where 5.10.158 works fine and
>> 5.10.162 is broken. And in fact, current 6.3-rc also fails the test
>> case. Beware that this email is long, as I'm trying to include
>> everything that may be relevant...
> 
> Which variant of powerpc ? 32 or 64 bits ? Book3S or BookE ?

I knew I'd forget something important... It's power9:

processor   : 0
cpu : POWER9 (architected), altivec supported
clock   : 2200.00MHz
revision: 2.2 (pvr 004e 1202)

-- 
Jens Axboe

Memory coherency issue with IO thread offloading?

2023-03-23 Thread Jens Axboe

Hi,

I got a report sent to me from mariadb, in where 5.10.158 works fine and
5.10.162 is broken. And in fact, current 6.3-rc also fails the test
case. Beware that this email is long, as I'm trying to include
everything that may be relevant...

The test case in question is pretty simple. On debian testing, do:

$ sudo apt-get install mariadb-test
$ cd /usr/share/mysql/mysql-test
$ ./mtr --mysqld=--innodb-flush-method=fsync --mysqld=--innodb-use-native-aio=1 
--vardir=/dev/shm/mysql  --force encryption.innodb_encryption,innodb,undo0 
--repeat=200

and if it fails, you'll see something like:

encryption.innodb_encryption 'innodb,undo0' [ 6 pass ]   3120
encryption.innodb_encryption 'innodb,undo0' [ 7 pass ]   3123
encryption.innodb_encryption 'innodb,undo0' [ 8 pass ]   3042
encryption.innodb_encryption 'innodb,undo0' [ 9 fail ]
Test ended at 2023-03-23 16:55:17

CURRENT_TEST: encryption.innodb_encryption
mysqltest: At line 11: query 'SET @start_global_value = 
@@global.innodb_encryption_threads' failed: ER_UNKNOWN_SYSTEM_VARIABLE (1193): 
Unknown system variable 'innodb_encryption_threads'

The result from queries just before the failure was:
SET @start_global_value = @@global.innodb_encryption_threads;

 - saving '/dev/shm/mysql/log/encryption.innodb_encryption-innodb,undo0/' to 
'/dev/shm/mysql/log/encryption.innodb_encryption-innodb,undo0/'
***Warnings generated in error logs during shutdown after running tests: 
encryption.innodb_encryption

2023-03-23 16:55:17 0 [Warning] Plugin 'example_key_management' is of maturity 
level experimental while the server is stable
2023-03-23 16:55:17 0 [ERROR] InnoDB: Database page corruption on disk or a 
failed read of file './ibdata1' page [page id: space=0, page number=221]. You 
may have to recover from a backup.

where data read was not as expected.

Now, there are a number of io_uring changes between .158 and .162, as it
includes the backport that brought 5.10-stable into line with what
5.15-stable includes. I'll spare you all the digging I did to vet those
changes, but the key thing is that it STILL happens on 6.3-git on
powerpc.

After ruling out many things, one key difference between 158 and 162 is
that the former offloaded requests that could not be done nonblocking to
a kthread, and 162 and newer offloads to an IO thread. An IO thread is
just a normal thread created from the application submitting IO, the
only difference is that it never exits to userspace. An IO thread has
the same mm/files/you-name-it from the original task. It really is the
same as a userspace thread created by the application The switch to IO
threads was done exactly because of that, rather than rely on a fragile
scheme of having the kthread worker assume all sorts of identify from
the original task. surprises if things were missed. This is what caused
most of the io_uring security issues in the past.

The IO that mariadb does in this test is pretty simple - a bunch of
largish buffered writes with IORING_OP_WRITEV, and some smallish (16K)
buffered reads with IORING_OP_READV.

Today I finally gave up and ran a basic experiment, which simply
offloads the writes to a kthread. Since powerpc has an interesting
memory coherency model, my suspicion was that the work involved with
switching MMs for the kthread could just be the main difference here.
The patch is really dumb and simple - rather than queue the write to an
IO thread, it just offloads it to a kthread that then does
kthread_use_mm(), perform write with the same write handler,
kthread_unuse_mm(). AND THIS WORKS! Usually the above mtr test would
fail in 2..20 loops, I've now done 200 and 500 loops and it's fine.

Which then leads me to the question, what about the IO thread offload
makes this fail on powerpc (and no other arch I've tested on, including
x86/x86-64/aarch64/hppa64)? The offload should be equivalent to having a
thread in userspace in the application, and having that thread just
perform the writes. Is there some magic involved with the kthread mm
use/unuse that makes this sufficiently consistent on powerpc? I've tried
any mix of isync()/mb and making the flush_dcache_page() unconditionally
done in the filemap read/write helpers, and it still falls flat on its
face with the offload to an IO thread.

I must clearly be missing something here, which is why I'm emailing the
powerpc Gods for help :-)

-- 
Jens Axboe

Re: [PATCH] ps3vram: remove bio splitting

2023-01-23 Thread Jens Axboe



On Mon, 23 Jan 2023 08:47:18 +0100, Christoph Hellwig wrote:
> ps3vram iterates over the bio one segment, that is page aligned and max
> page sized chunk, a time.  Because of that there is no point in
> calling bio_split_to_limits, or explicitly setting the default limits
> that are only used by bio_split_to_limits.
> 
> 

Applied, thanks!

[1/1] ps3vram: remove bio splitting
  commit: 2192a93eb4ac63eeb37ec5ec5cfa0db92ded5e3c

Best regards,
-- 
Jens Axboe

Re: [PATCH] block: move from strlcpy with unused retval to strscpy

2022-09-21 Thread Jens Axboe

On Thu, 18 Aug 2022 22:59:57 +0200, Wolfram Sang wrote:
> Follow the advice of the below link and prefer 'strscpy' in this
> subsystem. Conversion is 1:1 because the return value is not used.
> Generated by a coccinelle script.
> 
> 

Applied, thanks!

[1/1] block: move from strlcpy with unused retval to strscpy
  commit: e55e1b4831563e5766f88fa821f5bfac0d6c298c

Best regards,
-- 
Jens Axboe

Re: [PATCH 06/13] nvdimm/blk: avoid calling del_gendisk() on early failures

2021-11-02 Thread Jens Axboe

On 11/2/21 6:49 PM, Dan Williams wrote:
> On Tue, Nov 2, 2021 at 5:10 PM Luis Chamberlain  wrote:
>>
>> On Fri, Oct 15, 2021 at 05:13:48PM -0700, Dan Williams wrote:
>>> On Fri, Oct 15, 2021 at 4:53 PM Luis Chamberlain  wrote:
>>>>
>>>> If nd_integrity_init() fails we'd get del_gendisk() called,
>>>> but that's not correct as we should only call that if we're
>>>> done with device_add_disk(). Fix this by providing unwinding
>>>> prior to the devm call being registered and moving the devm
>>>> registration to the very end.
>>>>
>>>> This should fix calling del_gendisk() if nd_integrity_init()
>>>> fails. I only spotted this issue through code inspection. It
>>>> does not fix any real world bug.
>>>>
>>>
>>> Just fyi, I'm preparing patches to delete this driver completely as it
>>> is unused by any shipping platform. I hope to get that removal into
>>> v5.16.
>>
>> Curious if are you going to nuking it on v5.16? Otherwise it would stand
>> in the way of the last few patches to add __must_check for the final
>> add_disk() error handling changes.
> 
> True, I don't think I can get it nuked in time, so you can add my
> Reviewed-by for this one.

Luis, I lost track of the nv* patches from this discussion. If you want
them in 5.16 and they are reviewed, please do resend and I'll pick them
up for the middle-of-merge-window push.

-- 
Jens Axboe

Re: (subset) [PATCH 00/13] block: add_disk() error handling stragglers

2021-10-30 Thread Jens Axboe

On Fri, 15 Oct 2021 16:52:06 -0700, Luis Chamberlain wrote:
> This patch set consists of al the straggler drivers for which we have
> have no patch reviews done for yet. I'd like to ask for folks to please
> consider chiming in, specially if you're the maintainer for the driver.
> Additionally if you can specify if you'll take the patch in yourself or
> if you want Jens to take it, that'd be great too.
> 
> Luis Chamberlain (13):
>   block/brd: add error handling support for add_disk()
>   nvme-multipath: add error handling support for add_disk()
>   nvdimm/btt: do not call del_gendisk() if not needed
>   nvdimm/btt: use goto error labels on btt_blk_init()
>   nvdimm/btt: add error handling support for add_disk()
>   nvdimm/blk: avoid calling del_gendisk() on early failures
>   nvdimm/blk: add error handling support for add_disk()
>   zram: add error handling support for add_disk()
>   z2ram: add error handling support for add_disk()
>   ps3disk: add error handling support for add_disk()
>   ps3vram: add error handling support for add_disk()
>   block/sunvdc: add error handling support for add_disk()
>   mtd/ubi/block: add error handling support for add_disk()
> 
> [...]

Applied, thanks!

[01/13] block/brd: add error handling support for add_disk()
commit: e1528830bd4ebf435d91c154e309e6e028336210

Best regards,
-- 
Jens Axboe

Re: (subset) [PATCH 00/13] block: add_disk() error handling stragglers

2021-10-30 Thread Jens Axboe

On Fri, 15 Oct 2021 16:52:06 -0700, Luis Chamberlain wrote:
> This patch set consists of al the straggler drivers for which we have
> have no patch reviews done for yet. I'd like to ask for folks to please
> consider chiming in, specially if you're the maintainer for the driver.
> Additionally if you can specify if you'll take the patch in yourself or
> if you want Jens to take it, that'd be great too.
> 
> Luis Chamberlain (13):
>   block/brd: add error handling support for add_disk()
>   nvme-multipath: add error handling support for add_disk()
>   nvdimm/btt: do not call del_gendisk() if not needed
>   nvdimm/btt: use goto error labels on btt_blk_init()
>   nvdimm/btt: add error handling support for add_disk()
>   nvdimm/blk: avoid calling del_gendisk() on early failures
>   nvdimm/blk: add error handling support for add_disk()
>   zram: add error handling support for add_disk()
>   z2ram: add error handling support for add_disk()
>   ps3disk: add error handling support for add_disk()
>   ps3vram: add error handling support for add_disk()
>   block/sunvdc: add error handling support for add_disk()
>   mtd/ubi/block: add error handling support for add_disk()
> 
> [...]

Applied, thanks!

[08/13] zram: add error handling support for add_disk()
commit: 5e2e1cc4131cf4d21629c94331f2351b7dc8b87c
[10/13] ps3disk: add error handling support for add_disk()
commit: ff4cbe0fcf5d749f76040f782f0618656cd23e33
[11/13] ps3vram: add error handling support for add_disk()
commit: 3c30883acab1d20ecbd3c48dc12b147b51548742

Best regards,
-- 
Jens Axboe

Re: [PATCH v2 00/10] block: fourth batch of add_disk() error handling conversions

2021-09-27 Thread Jens Axboe

On 9/27/21 4:01 PM, Luis Chamberlain wrote:
> This is the fourth batch of add_disk() error handling driver
> conversions. This set along with the entire 7 set of driver conversions
> can be found on my 20210927-for-axboe-add-disk-error-handling branch
> [0].

Applied 1-2, 6, 8-9, thanks.

-- 
Jens Axboe

Re: [PATCH] arch: Kconfig: clean up obsolete use of HAVE_IDE

2021-07-28 Thread Jens Axboe

On 7/28/21 12:21 PM, Lukas Bulwahn wrote:
> The arch-specific Kconfig files use HAVE_IDE to indicate if IDE is
> supported.
> 
> As IDE support and the HAVE_IDE config vanishes with commit b7fb14d3ac63
> ("ide: remove the legacy ide driver"), there is no need to mention
> HAVE_IDE in all those arch-specific Kconfig files.
> 
> The issue was identified with ./scripts/checkkconfigsymbols.py.

Thanks, let's queue this for 5.14 to avoid any future conflicts with
it.

-- 
Jens Axboe

Re: switch the block layer to use kmap_local_page v3

2021-07-27 Thread Jens Axboe

On 7/26/21 11:56 PM, Christoph Hellwig wrote:
> Hi all,
> 
> this series switches the core block layer code and all users of the
> existing bvec kmap helpers to use kmap_local_page.  Drivers that
> currently use open coded kmap_atomic calls will converted in a follow
> on series.
> 
> To do so a new kunmap variant is added that calls
> flush_kernel_dcache_page.  I'm not entirely sure where to call
> flush_dcache_page vs flush_kernel_dcache_page, so I've tried to follow
> the documentation here, but additional feedback would be welcome.
> 
> Note that the ps3disk has a minir conflict with the
> flush_kernel_dcache_page removal in linux-next through the -mm tree.
> I had hoped that change would go into 5.14, but it seems like it is
> being held for 5.15.

Applied for 5.15, thanks.

-- 
Jens Axboe

Re: simplify gendisk and request_queue allocation for blk-mq based drivers

2021-06-11 Thread Jens Axboe

On 6/2/21 12:53 AM, Christoph Hellwig wrote:
> Hi all,
> 
> this series is the scond part of cleaning up lifetimes and allocation of
> the gendisk and request_queue structure.  It adds a new interface to
> allocate the disk and queue together for blk based drivers, and uses that
> in all drivers that do not have any caveats in their gendisk and
> request_queue lifetime rules.

Applied, thanks.

-- 
Jens Axboe

Re: [PATCH v2] libnvdimm/pmem: Fix blk_cleanup_disk() usage

2021-06-08 Thread Jens Axboe

On 6/7/21 5:52 PM, Dan Williams wrote:
> The queue_to_disk() helper can not be used after del_gendisk()
> communicate @disk via the pgmap->owner.
> 
> Otherwise, queue_to_disk() returns NULL resulting in the splat below.
> 
>  Kernel attempted to read user page (330) - exploit attempt? (uid: 0)
>  BUG: Kernel NULL pointer dereference on read at 0x0330
>  Faulting instruction address: 0xc0906344
>  Oops: Kernel access of bad area, sig: 11 [#1]
>  [..]
>  NIP [c0906344] pmem_pagemap_cleanup+0x24/0x40
>  LR [c04701d4] memunmap_pages+0x1b4/0x4b0
>  Call Trace:
>  [c00022cbb9c0] [c09063c8] pmem_pagemap_kill+0x28/0x40 
> (unreliable)
>  [c00022cbb9e0] [c04701d4] memunmap_pages+0x1b4/0x4b0
>  [c00022cbba90] [c08b28a0] devm_action_release+0x30/0x50
>  [c00022cbbab0] [c08b39c8] release_nodes+0x2f8/0x3e0
>  [c00022cbbb60] [c08ac440] 
> device_release_driver_internal+0x190/0x2b0
>  [c00022cbbba0] [c000008a8450] unbind_store+0x130/0x170

Applied, thanks.

-- 
Jens Axboe

Re: simplify gendisk and request_queue allocation for bio based drivers

2021-06-01 Thread Jens Axboe

On 5/20/21 11:50 PM, Christoph Hellwig wrote:
> Hi all,
> 
> this series is the first part of cleaning up lifetimes and allocation of
> the gendisk and request_queue structure.  It adds a new interface to
> allocate the disk and queue together for bio based drivers, and a helper
> for cleanup/free them when a driver is unloaded or a device is removed.
> 
> Together this removes the need to treat the gendisk and request_queue
> as separate entities for bio based drivers.

Applied, thanks.

-- 
Jens Axboe

Re: linux-next: boot failure in today's linux-next

2021-04-26 Thread Jens Axboe

ea808] 
>> __vio_register_driver+0x68/0x90
>> [1.859754][T1] [c63a3bb0] [c10cee74] 
>> ibmvscsi_module_init+0xa4/0xdc
>> [1.859931][T1] [c63a3bf0] [c0012190] 
>> do_one_initcall+0x60/0x2c0
>> [1.860071][T1] [c63a3cc0] [c10846e4] 
>> kernel_init_freeable+0x300/0x3a0
>> [1.860207][T1] [c63a3da0] [c0012764] 
>> kernel_init+0x2c/0x168
>> [1.860336][T1] [c63a3e10] [c000d5ec] 
>> ret_from_kernel_thread+0x5c/0x70
>> [1.860690][T1] Instruction dump:
>> [1.861072][T1] fba10038 7cbb2b78 7c7d1b78 7cfc3b78 a1440048 2c2a 
>> 4082008c a13f004a 
>> [1.861328][T1] 7c095040 40810110 e93f0008 811f0028  
>> e9290050 812903d8 7d3e4850 
>> [1.863000][T1] ---[ end trace c49ca2d91ee47d7f ]---
>> [1.879456][T1] 
>> [2.880941][T1] Kernel panic - not syncing: Attempted to kill init! 
>> exitcode=0x000b
>>
>> I don't know what caused this, but it is some change since Friday.
>>
>> I have left it like this.
> 
> Bisections leads to commit
> 
>   42fb54fbc707 ("bio: limit bio max size")
> 
> from the block tree.  Reverting that commit on top of today's
> linux-next allows to the boot to work again.

The patch has been dropped, thanks Stephen.

-- 
Jens Axboe

Re: [PATCH] swim3: support highmem

2021-04-06 Thread Jens Axboe

On 4/6/21 12:18 AM, Christoph Hellwig wrote:
> swim3 only uses the virtual address of a bio to stash it into the data
> transfer using virt_to_bus.  But the ppc32 virt_to_bus just uses the
> physical address with an offset.  Replace virt_to_bus with a local hack
> that performs the equivalent transformation and stop asking for block
> layer bounce buffering.

Applied, thanks.

-- 
Jens Axboe

Re: [PATCH] xsysace: Remove SYSACE driver

2021-03-23 Thread Jens Axboe

On 3/23/21 10:25 AM, Michal Simek wrote:
> 
> 
> On 3/23/21 5:23 PM, Jens Axboe wrote:
>> On 3/22/21 6:04 PM, Davidlohr Bueso wrote:
>>> Hi,
>>>
>>> On Mon, 09 Nov 2020, Michal Simek wrote:
>>>
>>>> Sysace IP is no longer used on Xilinx PowerPC 405/440 and Microblaze
>>>> systems. The driver is not regularly tested and very likely not working for
>>>> quite a long time that's why remove it.
>>>
>>> Is there a reason this patch was never merged? can the driver be
>>> removed? I ran into this as a potential tasklet user that can be
>>> replaced/removed.
>>
>> I'd be happy to merge it for 5.13.
>>
> 
> Can you just take this version? Or do you want me to send it again?

Minor edits needed for fuzz, but I've applied this version.

-- 
Jens Axboe

Re: [PATCH] xsysace: Remove SYSACE driver

2021-03-23 Thread Jens Axboe

On 3/22/21 6:04 PM, Davidlohr Bueso wrote:
> Hi,
> 
> On Mon, 09 Nov 2020, Michal Simek wrote:
> 
>> Sysace IP is no longer used on Xilinx PowerPC 405/440 and Microblaze
>> systems. The driver is not regularly tested and very likely not working for
>> quite a long time that's why remove it.
> 
> Is there a reason this patch was never merged? can the driver be
> removed? I ran into this as a potential tasklet user that can be
> replaced/removed.

I'd be happy to merge it for 5.13.

-- 
Jens Axboe

Re: [PATCH] ide: fix warning comparing pointer to 0

2021-03-11 Thread Jens Axboe

On 3/11/21 2:48 AM, Jiapeng Chong wrote:
> Fix the following coccicheck warning:
> 
> ./drivers/ide/pmac.c:1680:38-39: WARNING comparing pointer to 0.
> 
> Reported-by: Abaci Robot 
> Signed-off-by: Jiapeng Chong 
> ---
>  drivers/ide/pmac.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/ide/pmac.c b/drivers/ide/pmac.c
> index ea0b064..d5171e0 100644
> --- a/drivers/ide/pmac.c
> +++ b/drivers/ide/pmac.c
> @@ -1677,7 +1677,7 @@ static int pmac_ide_init_dma(ide_hwif_t *hwif, const 
> struct ide_port_info *d)
>   /* We won't need pci_dev if we switch to generic consistent
>* DMA routines ...
>*/
> - if (dev == NULL || pmif->dma_regs == 0)
> + if (!dev || pmif->dma_regs)
>   return -ENODEV;

This looks utterly broken - the warning is most definitely about
dma_regs, not dev, and you swapped the condition (failing on dma_regs
being set, not NULL).

I'd just leave this one alone, drivers/ide/ should be going away soon.

-- 
Jens Axboe

Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING

2021-02-04 Thread Jens Axboe

On 2/4/21 10:01 AM, Aneesh Kumar K.V wrote:
> On 2/4/21 10:23 PM, Jens Axboe wrote:
>> On 2/1/21 11:30 PM, Aneesh Kumar K.V wrote:
>>> On 2/2/21 11:50 AM, Christophe Leroy wrote:
>>>>
>>>>
>>>> Le 02/02/2021 à 07:16, Aneesh Kumar K.V a écrit :
>>>>> On 2/2/21 11:32 AM, Christophe Leroy wrote:
>>>>>>
>>>>>>
>>>>>> Le 02/02/2021 à 06:55, Aneesh Kumar K.V a écrit :
>>>>>>> Aneesh Kumar K.V  writes:
>>>>>>>
>>>>>>>> Nicholas Piggin  writes:
>>>>>>>>
>>>>>>>>> Excerpts from Michael Ellerman's message of January 30, 2021 9:22 pm:
>>>>>>>>>> Christophe Leroy  writes:
>>>>>>>>>>> +Aneesh
>>>>>>>>>>>
>>>>>>>>>>> Le 29/01/2021 à 07:52, Zorro Lang a écrit :
>>>>>>>>>> ..
>>>>>>>>>>>> [   96.200296] [ cut here ]
>>>>>>>>>>>> [   96.200304] Bug: Read fault blocked by KUAP!
>>>>>>>>>>>> [   96.200309] WARNING: CPU: 3 PID: 1876 at
>>>>>>>>>>>> arch/powerpc/mm/fault.c:229 bad_kernel_fault+0x180/0x310
>>>>>>>>>>>
>>>>>>>>>>>> [   96.200734] NIP [c0849424]
>>>>>>>>>>>> fault_in_pages_readable+0x104/0x350
>>>>>>>>>>>> [   96.200741] LR [c084952c]
>>>>>>>>>>>> fault_in_pages_readable+0x20c/0x350
>>>>>>>>>>>> [   96.200747] --- interrupt: 300
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Problem happens in a section where userspace access is supposed
>>>>>>>>>>> to be granted, so the patch you
>>>>>>>>>>> proposed is definitely not the right fix.
>>>>>>>>>>>
>>>>>>>>>>> c0849408:2c 01 00 4c isync
>>>>>>>>>>> c084940c:a6 03 3d 7d mtspr   29,r9  <== granting
>>>>>>>>>>> userspace access permission
>>>>>>>>>>> c0849410:2c 01 00 4c isync
>>>>>>>>>>> c0849414:00 00 36 e9 ld  r9,0(r22)
>>>>>>>>>>> c0849418:20 00 29 81 lwz r9,32(r9)
>>>>>>>>>>> c084941c:00 02 29 71 andi.   r9,r9,512
>>>>>>>>>>> c0849420:78 d3 5e 7f mr  r30,r26
>>>>>>>>>>> ==> c0849424:00 00 bf 8b lbz r29,0(r31)  <==
>>>>>>>>>>> accessing userspace
>>>>>>>>>>> c0849428:10 00 82 41 beq c0849438
>>>>>>>>>>> 
>>>>>>>>>>> c084942c:2c 01 00 4c isync
>>>>>>>>>>> c0849430:a6 03 bd 7e mtspr   29,r21  <==
>>>>>>>>>>> clearing userspace access permission
>>>>>>>>>>> c0849434:2c 01 00 4c isync
>>>>>>>>>>>
>>>>>>>>>>> My first guess is that the problem is linked to the following
>>>>>>>>>>> function, see the comment
>>>>>>>>>>>
>>>>>>>>>>> /*
>>>>>>>>>>> * For kernel thread that doesn't have thread.regs return
>>>>>>>>>>> * default AMR/IAMR values.
>>>>>>>>>>> */
>>>>>>>>>>> static inline u64 current_thread_amr(void)
>>>>>>>>>>> {
>>>>>>>>>>>  if (current->thread.regs)
>>>>>>>>>>>  return current->thread.regs->amr;
>>>>>>>>>>>  return AMR_KUAP_BLOCKED;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> Above function was introduced by commit 48a8ab4eeb82
>>>>>>>>>>> ("powerpc/book3s64/pke

Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING

2021-02-04 Thread Jens Axboe

plemented using memory protection keys, it 
>>>>>>>> depends
>>>>>>>> on the value of the AMR register, which is not part of the mm, 
>>>>>>>> it's in
>>>>>>>> thread.regs->amr.
>>>>>>>>
>>>>>>>> And what's worse by the time we're in kthread_use_mm() we no 
>>>>>>>> longer have
>>>>>>>> access to the thread.regs->amr of the original process that 
>>>>>>>> submitted
>>>>>>>> the IO.
>>>>>>>>
>>>>>>>> We also can't simply move the AMR into the mm, precisely because 
>>>>>>>> it's
>>>>>>>> per thread, not per mm.
>>>>>>>>
>>>>>>>> So TBH I don't know how we're going to fix this.
>>>>>>>>
>>>>>>>> I guess we could return AMR=unblocked for kernel threads, but that's
>>>>>>>> arguably a bug because it allows a process to circumvent memory 
>>>>>>>> keys by
>>>>>>>> asking the kernel to do the access.
>>>>>>>
>>>>>>> We shouldn't need to inherit AMR should we? We only need it to be 
>>>>>>> locked
>>>>>>> for kernel threads until it's explicitly unlocked -- nothing mm 
>>>>>>> specific
>>>>>>> there. I think current_thread_amr could return 0 for kernel 
>>>>>>> threads? Or
>>>>>>> I would even avoid using that function for allow_user_access and open
>>>>>>> code the kthread case and remove it from current_thread_amr().
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Nick
>>>>>>
>>>>>
>>>>> updated one
>>>>>
>>>>>  From 8fdb0680f983940d61f91da8252b13c8d3e8ebee Mon Sep 17 00:00:00 2001
>>>>> From: "Aneesh Kumar K.V" 
>>>>> Date: Tue, 2 Feb 2021 09:23:38 +0530
>>>>> Subject: [PATCH v2] powerpc/kuap: Allow kernel thread to access 
>>>>> userspace
>>>>>   after kthread_use_mm
>>>>>
>>>>> This fix the bad fault reported by KUAP when io_wqe_worker access 
>>>>> userspace.
>>>>>
>>>>>   Bug: Read fault blocked by KUAP!
>>>>>   WARNING: CPU: 1 PID: 101841 at arch/powerpc/mm/fault.c:229 
>>>>> __do_page_fault+0x6b4/0xcd0
>>>>>   NIP [c009e7e4] __do_page_fault+0x6b4/0xcd0
>>>>>   LR [c009e7e0] __do_page_fault+0x6b0/0xcd0
>>>>> ..
>>>>>   Call Trace:
>>>>>   [c00016367330] [c009e7e0] __do_page_fault+0x6b0/0xcd0 
>>>>> (unreliable)
>>>>>   [c000163673e0] [c009ee3c] do_page_fault+0x3c/0x120
>>>>>   [c00016367430] [c000c848] handle_page_fault+0x10/0x2c
>>>>>   --- interrupt: 300 at iov_iter_fault_in_readable+0x148/0x6f0
>>>>> ..
>>>>>   NIP [c08e8228] iov_iter_fault_in_readable+0x148/0x6f0
>>>>>   LR [c08e834c] iov_iter_fault_in_readable+0x26c/0x6f0
>>>>>   interrupt: 300
>>>>>   [c000163677e0] [c07154a0] iomap_write_actor+0xc0/0x280
>>>>>   [c00016367880] [c070fc94] iomap_apply+0x1c4/0x780
>>>>>   [c00016367990] [c0710330] 
>>>>> iomap_file_buffered_write+0xa0/0x120
>>>>>   [c000163679e0] [c0080040791c] 
>>>>> xfs_file_buffered_aio_write+0x314/0x5e0 [xfs]
>>>>>   [c00016367a90] [c000006d74bc] io_write+0x10c/0x460
>>>>>   [c00016367bb0] [c06d80e4] io_issue_sqe+0x8d4/0x1200
>>>>>   [c00016367c70] [c06d8ad0] io_wq_submit_work+0xc0/0x250
>>>>>   [c00016367cb0] [c06e2578] 
>>>>> io_worker_handle_work+0x498/0x800
>>>>>   [c00016367d40] [c06e2cdc] io_wqe_worker+0x3fc/0x4f0
>>>>>   [c00016367da0] [c01cb0a4] kthread+0x1c4/0x1d0
>>>>>   [c00016367e10] [c000dbf0] 
>>>>> ret_from_kernel_thread+0x5c/0x6c
>>>>>
>>>>> The kernel consider thread AMR value for kernel thread to be
>>>>> AMR_KUAP_BLOCKED. Hence access to userspace is denied. This
>>>>> of course not correct and we should allow userspace access after
>>>>> kthread_use_mm(). To be precise, kthread_use_mm() should inherit the
>>>>> AMR value of the operating address space. But, the AMR value is
>>>>> thread-specific and we inherit the address space and not thread
>>>>> access restrictions. Because of this ignore AMR value when accessing
>>>>> userspace via kernel thread.
>>>>>
>>>>> Signed-off-by: Aneesh Kumar K.V 
>>>>> ---
>>>>> Changes from v1:
>>>>> * Address review feedback from Nick
>>>>>
>>>>>   arch/powerpc/include/asm/book3s/64/kup.h | 8 +++-
>>>>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/arch/powerpc/include/asm/book3s/64/kup.h 
>>>>> b/arch/powerpc/include/asm/book3s/64/kup.h
>>>>> index f50f72e535aa..95f4df99249e 100644
>>>>> --- a/arch/powerpc/include/asm/book3s/64/kup.h
>>>>> +++ b/arch/powerpc/include/asm/book3s/64/kup.h
>>>>> @@ -384,7 +384,13 @@ static __always_inline void 
>>>>> allow_user_access(void __user *to, const void __user
>>>>>   // This is written so we can resolve to a single case at build 
>>>>> time
>>>>>   BUILD_BUG_ON(!__builtin_constant_p(dir));
>>>>> -if (mmu_has_feature(MMU_FTR_PKEY))
>>>>> +/*
>>>>> + * if it is a kthread that did kthread_use_mm() don't
>>>>> + * use current_thread_amr().
>>>>
>>>> According to include/linux/sched.h, PF_KTHREAD means /* I am a kernel 
>>>> thread */
>>>> It doesn't seem to be related to kthread_use_mm()
>>>
>>> That should be a sufficient check here. if we did reach here without 
>>> calling kthread_user_mm, we will crash on access because we don't have 
>>> a mm attached to the current process.  a kernel thread with 
>>> kthread_use_mm has
>>
>> Ok but then the comment doesn't match the check.
> 
> 
> I was trying to be explict in the comment that we expect the thread to 
> have done kthread_use_mm().
> 
>>
>> And also the comment in current_thread_amr() is then misleading.
>>
>> Why not do the current->flags & PF_KTHREAD check in current_thread_amr() 
>> and return 0 in that case instead of BLOCKED ?
> 
> In my view currrent_thread_amr() is more generic and we want to be 
> explicit there that a kernel thread AMR is KUAP_BLOCKED. Only when we 
> call allow user access, we relax the AMR value.

Just following up on this, as I'd hate to have 5.11 released with this
bug in it for powerpc. It'll obviously also affect other cases of a
kernel thread faulting after having done kthread_use_mm(), though I'm
not sure how widespread that is. In any case, it'll leave io_uring
mostly broken on powerpc if this isn't patched for release.

-- 
Jens Axboe

Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING

2021-01-28 Thread Jens Axboe

On 1/28/21 6:52 AM, Zorro Lang wrote:
> On Wed, Jan 27, 2021 at 08:06:37PM -0700, Jens Axboe wrote:
>> On 1/27/21 8:13 PM, Zorro Lang wrote:
>>> On Thu, Jan 28, 2021 at 10:18:07AM +1000, Nicholas Piggin wrote:
>>>> Excerpts from Jens Axboe's message of January 28, 2021 5:29 am:
>>>>> On 1/27/21 9:38 AM, Christophe Leroy wrote:
>>>>>>
>>>>>>
>>>>>> Le 27/01/2021 à 15:56, Zorro Lang a écrit :
>>>>>>> On powerpc, io_uring test hit below KUAP fault on __do_page_fault.
>>>>>>> The fail source line is:
>>>>>>>
>>>>>>>if (unlikely(!is_user && bad_kernel_fault(regs, error_code, address, 
>>>>>>> is_write)))
>>>>>>>return SIGSEGV;
>>>>>>>
>>>>>>> The is_user() is based on user_mod(regs) only. This's not suit for
>>>>>>> io_uring, where the helper thread can assume the user app identity
>>>>>>> and could perform this fault just fine. So turn to use mm to decide
>>>>>>> if this is valid or not.
>>>>>>
>>>>>> I don't understand why testing is_user would be an issue. KUAP purpose
>>>>>> it to block any unallowed access from kernel to user memory
>>>>>> (Equivalent to SMAP on x86). So it really must be based on MSR_PR bit,
>>>>>> that is what is_user provides.
>>>>>>
>>>>>> If the kernel access is legitimate, kernel should have opened
>>>>>> userspace access then you shouldn't get this "Bug: Read fault blocked
>>>>>> by KUAP!".
>>>>>>
>>>>>> As far as I understand, the fault occurs in
>>>>>> iov_iter_fault_in_readable() which calls fault_in_pages_readable() And
>>>>>> fault_in_pages_readable() uses __get_user() so it is a legitimate
>>>>>> access and you really should get a KUAP fault.
>>>>>>
>>>>>> So the problem is somewhere else, I think you proposed patch just
>>>>>> hides the problem, it doesn't fix it.
>>>>>
>>>>> If we do kthread_use_mm(), can we agree that the user access is valid?
>>>>
>>>> Yeah the io uring code is fine, provided it uses the uaccess primitives 
>>>> like any other kernel code. It's looking more like a an arch/powerpc bug.
>>>>
>>>>> We should be able to copy to/from user space, and including faults, if
>>>>> that's been done and the new mm assigned. Because it really should be.
>>>>> If SMAP was a problem on x86, we would have seen it long ago.
>>>>>
>>>>> I'm assuming this may be breakage related to the recent uaccess changes
>>>>> related to set_fs and friends? Or maybe recent changes on the powerpc
>>>>> side?
>>>>>
>>>>> Zorro, did 5.10 work?
>>>>
>>>> Would be interesting to know.
>>>
>>> Sure Nick and Jens, which 5.10 rc? version do you want to know ? Or any git
>>> commit(be the HEAD) in 5.10 phase?
>>
>> I forget which versions had what series of this, but 5.10 final - and if
>> that fails, then 5.9 final. IIRC, 5.9 was pre any of these changes, and
>> 5.10 definitely has them.
> 
> I justed built linux v5.10 with same .config file, and gave it same test.
> v5.10 (HEAD=2c85ebc57b Linux 5.10) can't reproduce this bug:
> 
> # ./check generic/013 generic/051
> FSTYP -- xfs (non-debug)
> PLATFORM  -- Linux/ppc64le ibm-p9z-xxx- 5.10.0 #3 SMP Thu Jan 28 
> 04:12:14 EST 2021
> MKFS_OPTIONS  -- -f -m 
> crc=1,finobt=1,reflink=1,rmapbt=1,bigtime=1,inobtcount=1 /dev/sda3
> MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /dev/sda3 
> /mnt/xfstests/scratch
> 
> generic/013 138s ...  77s
> generic/051 103s ...  143s
> Ran: generic/013 generic/051
> Passed all 2 tests

Thanks for testing that, so I think it's safe to conclude that there's a
regression in powerpc fault handling for kthreads that use
kthread_use_mm in this release. A bisect would definitely find it, but
might be pointless if Christophe or Nick already have an idea of what it
is.

-- 
Jens Axboe

Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING

2021-01-27 Thread Jens Axboe

On 1/27/21 8:13 PM, Zorro Lang wrote:
> On Thu, Jan 28, 2021 at 10:18:07AM +1000, Nicholas Piggin wrote:
>> Excerpts from Jens Axboe's message of January 28, 2021 5:29 am:
>>> On 1/27/21 9:38 AM, Christophe Leroy wrote:
>>>>
>>>>
>>>> Le 27/01/2021 à 15:56, Zorro Lang a écrit :
>>>>> On powerpc, io_uring test hit below KUAP fault on __do_page_fault.
>>>>> The fail source line is:
>>>>>
>>>>>if (unlikely(!is_user && bad_kernel_fault(regs, error_code, address, 
>>>>> is_write)))
>>>>>return SIGSEGV;
>>>>>
>>>>> The is_user() is based on user_mod(regs) only. This's not suit for
>>>>> io_uring, where the helper thread can assume the user app identity
>>>>> and could perform this fault just fine. So turn to use mm to decide
>>>>> if this is valid or not.
>>>>
>>>> I don't understand why testing is_user would be an issue. KUAP purpose
>>>> it to block any unallowed access from kernel to user memory
>>>> (Equivalent to SMAP on x86). So it really must be based on MSR_PR bit,
>>>> that is what is_user provides.
>>>>
>>>> If the kernel access is legitimate, kernel should have opened
>>>> userspace access then you shouldn't get this "Bug: Read fault blocked
>>>> by KUAP!".
>>>>
>>>> As far as I understand, the fault occurs in
>>>> iov_iter_fault_in_readable() which calls fault_in_pages_readable() And
>>>> fault_in_pages_readable() uses __get_user() so it is a legitimate
>>>> access and you really should get a KUAP fault.
>>>>
>>>> So the problem is somewhere else, I think you proposed patch just
>>>> hides the problem, it doesn't fix it.
>>>
>>> If we do kthread_use_mm(), can we agree that the user access is valid?
>>
>> Yeah the io uring code is fine, provided it uses the uaccess primitives 
>> like any other kernel code. It's looking more like a an arch/powerpc bug.
>>
>>> We should be able to copy to/from user space, and including faults, if
>>> that's been done and the new mm assigned. Because it really should be.
>>> If SMAP was a problem on x86, we would have seen it long ago.
>>>
>>> I'm assuming this may be breakage related to the recent uaccess changes
>>> related to set_fs and friends? Or maybe recent changes on the powerpc
>>> side?
>>>
>>> Zorro, did 5.10 work?
>>
>> Would be interesting to know.
> 
> Sure Nick and Jens, which 5.10 rc? version do you want to know ? Or any git
> commit(be the HEAD) in 5.10 phase?

I forget which versions had what series of this, but 5.10 final - and if
that fails, then 5.9 final. IIRC, 5.9 was pre any of these changes, and
5.10 definitely has them.

-- 
Jens Axboe

Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING

2021-01-27 Thread Jens Axboe

On 1/27/21 9:38 AM, Christophe Leroy wrote:
> 
> 
> Le 27/01/2021 à 15:56, Zorro Lang a écrit :
>> On powerpc, io_uring test hit below KUAP fault on __do_page_fault.
>> The fail source line is:
>>
>>if (unlikely(!is_user && bad_kernel_fault(regs, error_code, address, 
>> is_write)))
>>return SIGSEGV;
>>
>> The is_user() is based on user_mod(regs) only. This's not suit for
>> io_uring, where the helper thread can assume the user app identity
>> and could perform this fault just fine. So turn to use mm to decide
>> if this is valid or not.
> 
> I don't understand why testing is_user would be an issue. KUAP purpose
> it to block any unallowed access from kernel to user memory
> (Equivalent to SMAP on x86). So it really must be based on MSR_PR bit,
> that is what is_user provides.
> 
> If the kernel access is legitimate, kernel should have opened
> userspace access then you shouldn't get this "Bug: Read fault blocked
> by KUAP!".
> 
> As far as I understand, the fault occurs in
> iov_iter_fault_in_readable() which calls fault_in_pages_readable() And
> fault_in_pages_readable() uses __get_user() so it is a legitimate
> access and you really should get a KUAP fault.
> 
> So the problem is somewhere else, I think you proposed patch just
> hides the problem, it doesn't fix it.

If we do kthread_use_mm(), can we agree that the user access is valid?
We should be able to copy to/from user space, and including faults, if
that's been done and the new mm assigned. Because it really should be.
If SMAP was a problem on x86, we would have seen it long ago.

I'm assuming this may be breakage related to the recent uaccess changes
related to set_fs and friends? Or maybe recent changes on the powerpc
side?

Zorro, did 5.10 work?

-- 
Jens Axboe

Re: [PATCH] powerpc: add support for TIF_NOTIFY_SIGNAL

2020-10-29 Thread Jens Axboe

On 10/29/20 6:48 PM, Michael Ellerman wrote:
> Jens Axboe  writes:
>> Wire up TIF_NOTIFY_SIGNAL handling for powerpc.
>>
>> Cc: linuxppc-dev@lists.ozlabs.org
>> Signed-off-by: Jens Axboe 
>> ---
>>
>> 5.11 has support queued up for TIF_NOTIFY_SIGNAL, see this posting
>> for details:
>>
>> https://lore.kernel.org/io-uring/20201026203230.386348-1-ax...@kernel.dk/
>>
>> As part of that work, I'm adding TIF_NOTIFY_SIGNAL support to all archs,
>> as that will enable a set of cleanups once all of them support it. I'm
>> happy carrying this patch if need be, or it can be funelled through the
>> arch tree. Let me know.
> 
> Happy for you to take it along with the rest of the series.
> 
> Acked-by: Michael Ellerman 

Great, thanks Michael! Added.

-- 
Jens Axboe

[PATCH] powerpc: add support for TIF_NOTIFY_SIGNAL

2020-10-29 Thread Jens Axboe

Wire up TIF_NOTIFY_SIGNAL handling for powerpc.

Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Jens Axboe 
---

5.11 has support queued up for TIF_NOTIFY_SIGNAL, see this posting
for details:

https://lore.kernel.org/io-uring/20201026203230.386348-1-ax...@kernel.dk/

As part of that work, I'm adding TIF_NOTIFY_SIGNAL support to all archs,
as that will enable a set of cleanups once all of them support it. I'm
happy carrying this patch if need be, or it can be funelled through the
arch tree. Let me know.

 arch/powerpc/include/asm/thread_info.h | 5 -
 arch/powerpc/kernel/signal.c   | 2 +-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/thread_info.h 
b/arch/powerpc/include/asm/thread_info.h
index 46a210b03d2b..53115ae61495 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -90,6 +90,7 @@ void arch_setup_new_exec(void);
 #define TIF_SYSCALL_TRACE  0   /* syscall trace active */
 #define TIF_SIGPENDING 1   /* signal pending */
 #define TIF_NEED_RESCHED   2   /* rescheduling necessary */
+#define TIF_NOTIFY_SIGNAL  3   /* signal notifications exist */
 #define TIF_SYSCALL_EMU4   /* syscall emulation active */
 #define TIF_RESTORE_TM 5   /* need to restore TM FP/VEC/VSX */
 #define TIF_PATCH_PENDING  6   /* pending live patching update */
@@ -115,6 +116,7 @@ void arch_setup_new_exec(void);
 #define _TIF_SYSCALL_TRACE (1<thread.regs);
do_signal(current);
}
-- 
2.29.0

-- 
Jens Axboe

Re: [REGRESSION] mm: process_vm_readv testcase no longer works after compat_prcoess_vm_readv removed

2020-10-26 Thread Jens Axboe

On 10/26/20 6:05 PM, Al Viro wrote:
> On Mon, Oct 26, 2020 at 05:56:11PM -0600, Jens Axboe wrote:
>> On 10/26/20 4:55 PM, Kyle Huey wrote:
>>> A test program from the rr[0] test suite, vm_readv_writev[1], no
>>> longer works on 5.10-rc1 when compiled as a 32 bit binary and executed
>>> on a 64 bit kernel. The first process_vm_readv call (on line 35) now
>>> fails with EFAULT. I have bisected this to
>>> c3973b401ef2b0b8005f8074a10e96e3ea093823.
>>>
>>> It should be fairly straightforward to extract the test case from our
>>> repository into a standalone program.
>>
>> Can you check with this applied?
>>
>> diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
>> index fd12da80b6f2..05676722d9cd 100644
>> --- a/mm/process_vm_access.c
>> +++ b/mm/process_vm_access.c
>> @@ -273,7 +273,8 @@ static ssize_t process_vm_rw(pid_t pid,
>>  return rc;
>>  if (!iov_iter_count())
>>  goto free_iov_l;
>> -iov_r = iovec_from_user(rvec, riovcnt, UIO_FASTIOV, iovstack_r, false);
>> +iov_r = iovec_from_user(rvec, riovcnt, UIO_FASTIOV, iovstack_r,
>> +in_compat_syscall());
> 
> _ouch_
> 
> There's a bug, all right, but I'm not sure that this is all there is
> to it. For now it's probably the right fix, but...  Consider the fun
> trying to use that from 32bit process to access the memory of 64bit
> one.  IOW, we might want to add an explicit flag for "force 64bit
> addresses/sizes in rvec".

Ouch yes good point, nice catch.

-- 
Jens Axboe

Re: [REGRESSION] mm: process_vm_readv testcase no longer works after compat_prcoess_vm_readv removed

2020-10-26 Thread Jens Axboe

On 10/26/20 4:55 PM, Kyle Huey wrote:
> A test program from the rr[0] test suite, vm_readv_writev[1], no
> longer works on 5.10-rc1 when compiled as a 32 bit binary and executed
> on a 64 bit kernel. The first process_vm_readv call (on line 35) now
> fails with EFAULT. I have bisected this to
> c3973b401ef2b0b8005f8074a10e96e3ea093823.
> 
> It should be fairly straightforward to extract the test case from our
> repository into a standalone program.

Can you check with this applied?

diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index fd12da80b6f2..05676722d9cd 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -273,7 +273,8 @@ static ssize_t process_vm_rw(pid_t pid,
return rc;
if (!iov_iter_count())
goto free_iov_l;
-   iov_r = iovec_from_user(rvec, riovcnt, UIO_FASTIOV, iovstack_r, false);
+   iov_r = iovec_from_user(rvec, riovcnt, UIO_FASTIOV, iovstack_r,
+   in_compat_syscall());
if (IS_ERR(iov_r)) {
rc = PTR_ERR(iov_r);
    goto free_iov_l;

-- 
Jens Axboe

Re: rename ->make_request_fn and move it to the block_device_operations v2

2020-07-01 Thread Jens Axboe

On 7/1/20 2:59 AM, Christoph Hellwig wrote:
> Hi Jens,
> 
> this series moves the make_request_fn method into block_device_operations
> with the much more descriptive ->submit_bio name.  It then also gives
> generic_make_request a more descriptive name, and further optimize the
> path to issue to blk-mq, removing the need for the direct_make_request
> bypass.

Thanks, I'll try this again.

-- 
Jens Axboe

Re: rename ->make_request_fn and move it to the block_device_operations

2020-06-30 Thread Jens Axboe

On 6/30/20 12:21 PM, Jens Axboe wrote:
> On 6/30/20 12:19 PM, Christoph Hellwig wrote:
>> On Tue, Jun 30, 2020 at 09:43:31AM -0600, Jens Axboe wrote:
>>> On 6/30/20 7:57 AM, Jens Axboe wrote:
>>>> On 6/29/20 1:39 PM, Christoph Hellwig wrote:
>>>>> Hi Jens,
>>>>>
>>>>> this series moves the make_request_fn method into block_device_operations
>>>>> with the much more descriptive ->submit_bio name.  It then also gives
>>>>> generic_make_request a more descriptive name, and further optimize the
>>>>> path to issue to blk-mq, removing the need for the direct_make_request
>>>>> bypass.
>>>>
>>>> Looks good to me, and it's a nice cleanup as well. Applied.
>>>
>>> Dropped, insta-crashes with dm:
>>
>> Hmm.  Can you send me what is at "submit_bio_noacct+0x1f6" from gdb?
>> Or your .config?
> 
> I'd have to apply and compile again. But it's a bad RIP, so I'm guessing
> it's ->submit_bio == NULL. Let me know if you really need it, and I can
> re-generate the OOPS and have the vmlinux too.

Here's the .config

-- 
Jens Axboe

#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 5.8.0-rc1 Kernel Configuration
#
CONFIG_CC_VERSION_TEXT="gcc (Ubuntu 10.1.0-2ubuntu1~18.04) 10.1.0"
CONFIG_CC_IS_GCC=y
CONFIG_GCC_VERSION=100100
CONFIG_LD_VERSION=23000
CONFIG_CLANG_VERSION=0
CONFIG_CC_CAN_LINK=y
CONFIG_CC_CAN_LINK_STATIC=y
CONFIG_CC_HAS_ASM_GOTO=y
CONFIG_CC_HAS_ASM_INLINE=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_TABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_BUILD_SALT=""
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_INIT=""
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
# CONFIG_WATCH_QUEUE is not set
CONFIG_CROSS_MEMORY_ATTACH=y
# CONFIG_USELIB is not set
# CONFIG_AUDIT is not set
CONFIG_HAVE_ARCH_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_HARDIRQS_SW_RESEND=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y
CONFIG_GENERIC_IRQ_RESERVATION_MODE=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
# CONFIG_GENERIC_IRQ_DEBUGFS is not set
# end of IRQ subsystem

CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_INIT=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
# CONFIG_NO_HZ is not set
CONFIG_HIGH_RES_TIMERS=y
# end of Timers subsystem

# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_COUNT=y
CONFIG_PREEMPTION=y

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_IRQ_TIME_ACCOUNTING is not set
# CONFIG_SCHED_THERMAL_PRESSURE is not set
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
# CONFIG_PSI is not set
# end of CPU/Task time and stats accounting

# CONFIG_CPU_ISOLATION is not set

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
CONFIG_PREEMPT_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU_GENERIC=y
CONFIG_TASKS_RCU=y
CONFIG_TASKS_RUDE_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
# end of RCU Subsystem

CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
# CONFIG_IKHEADERS is not set
CONFIG_LOG_BUF_SHIFT=18
CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y

#
# Scheduler features
#
# CONFIG_UCLAMP_TASK is not set
# end of Scheduler features

CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y
CONFIG_CC_HAS_INT128=y
CONFIG_ARCH_SUPPORTS_INT128=y
CONFIG_NUMA_BALANCING=y
CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
CONFIG_CGROUPS=y
# CONFIG_MEMCG is not set
CONFIG_BLK_CGROUP=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_CFS_BA

Re: rename ->make_request_fn and move it to the block_device_operations

2020-06-30 Thread Jens Axboe

On 6/30/20 12:19 PM, Christoph Hellwig wrote:
> On Tue, Jun 30, 2020 at 09:43:31AM -0600, Jens Axboe wrote:
>> On 6/30/20 7:57 AM, Jens Axboe wrote:
>>> On 6/29/20 1:39 PM, Christoph Hellwig wrote:
>>>> Hi Jens,
>>>>
>>>> this series moves the make_request_fn method into block_device_operations
>>>> with the much more descriptive ->submit_bio name.  It then also gives
>>>> generic_make_request a more descriptive name, and further optimize the
>>>> path to issue to blk-mq, removing the need for the direct_make_request
>>>> bypass.
>>>
>>> Looks good to me, and it's a nice cleanup as well. Applied.
>>
>> Dropped, insta-crashes with dm:
> 
> Hmm.  Can you send me what is at "submit_bio_noacct+0x1f6" from gdb?
> Or your .config?

I'd have to apply and compile again. But it's a bad RIP, so I'm guessing
it's ->submit_bio == NULL. Let me know if you really need it, and I can
re-generate the OOPS and have the vmlinux too.

-- 
Jens Axboe

Re: rename ->make_request_fn and move it to the block_device_operations

2020-06-30 Thread Jens Axboe

On 6/30/20 7:57 AM, Jens Axboe wrote:
> On 6/29/20 1:39 PM, Christoph Hellwig wrote:
>> Hi Jens,
>>
>> this series moves the make_request_fn method into block_device_operations
>> with the much more descriptive ->submit_bio name.  It then also gives
>> generic_make_request a more descriptive name, and further optimize the
>> path to issue to blk-mq, removing the need for the direct_make_request
>> bypass.
> 
> Looks good to me, and it's a nice cleanup as well. Applied.

Dropped, insta-crashes with dm:

[   10.240134] BUG: kernel NULL pointer dereference, address: 
[   10.241000] #PF: supervisor instruction fetch in kernel mode
[   10.241666] #PF: error_code(0x0010) - not-present page
[   10.242280] PGD 0 P4D 0 
[   10.242600] Oops: 0010 [#1] PREEMPT SMP
[   10.243073] CPU: 1 PID: 2110 Comm: systemd-udevd Not tainted 5.8.0-rc3+ #6655
[   10.243939] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.13.0-1ubuntu1 04/01/2014
[   10.245012] RIP: 0010:0x0
[   10.245322] Code: Bad RIP value.
[   10.245695] RSP: 0018:c92f7af8 EFLAGS: 00010246
[   10.246333] RAX: 81c83520 RBX: 8881b805dea8 RCX: 88819e844070
[   10.247227] RDX:  RSI:  RDI: 88819e844070
[   10.248112] RBP: c92f7b48 R08: 8881b6f38800 R09: 88818ff0ea58
[   10.248994] R10:  R11: 88818ff0ea58 R12: 88819e844070
[   10.250077] R13:  R14:  R15: 888107812948
[   10.251168] FS:  7f5c3ed66a80() GS:8881b9c8() 
knlGS:
[   10.252161] CS:  0010 DS:  ES:  CR0: 80050033
[   10.253189] CR2: ffd6 CR3: 0001b2953003 CR4: 001606e0
[   10.254157] DR0:  DR1:  DR2: 
[   10.255279] DR3:  DR6: fffe0ff0 DR7: 0400
[   10.256365] Call Trace:
[   10.256781]  submit_bio_noacct+0x1f6/0x3d0
[   10.257297]  submit_bio+0x37/0x130
[   10.257780]  ? guard_bio_eod+0x2e/0x70
[   10.258418]  mpage_readahead+0x13c/0x180
[   10.259096]  ? blkdev_direct_IO+0x490/0x490
[   10.259654]  read_pages+0x68/0x2d0
[   10.260051]  page_cache_readahead_unbounded+0x1b7/0x220
[   10.260818]  generic_file_buffered_read+0x865/0xc80
[   10.261587]  ? _copy_to_user+0x6d/0x80
[   10.262171]  ? cp_new_stat+0x119/0x130
[   10.262680]  new_sync_read+0xfe/0x170
[   10.263155]  vfs_read+0xc8/0x180
[   10.263647]  ksys_read+0x53/0xc0
[   10.264209]  do_syscall_64+0x3c/0x70
[   10.264759]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   10.265200] RIP: 0033:0x7f5c3fcc9ab2
[   10.265510] Code: Bad RIP value.
[   10.265775] RSP: 002b:7ffc8e0cf9c8 EFLAGS: 0246 ORIG_RAX: 

[   10.266426] RAX: ffda RBX: 55d5eca76c68 RCX: 7f5c3fcc9ab2
[   10.267012] RDX: 0040 RSI: 55d5eca76c78 RDI: 0006
[   10.267591] RBP: 55d5eca44890 R08: 55d5eca76c50 R09: 7f5c3fd99a40
[   10.268168] R10: 0008 R11: 0246 R12: 3bd9
[   10.268744] R13: 0040 R14: 55d5eca76c50 R15: 55d5eca448e0
[   10.269319] Modules linked in:
[   10.269562] CR2: 
[   10.269845] ---[ end trace f09b8963e5a3593b ]---

-- 
Jens Axboe

Re: rename ->make_request_fn and move it to the block_device_operations

2020-06-30 Thread Jens Axboe

On 6/29/20 1:39 PM, Christoph Hellwig wrote:
> Hi Jens,
> 
> this series moves the make_request_fn method into block_device_operations
> with the much more descriptive ->submit_bio name.  It then also gives
> generic_make_request a more descriptive name, and further optimize the
> path to issue to blk-mq, removing the need for the direct_make_request
> bypass.

Looks good to me, and it's a nice cleanup as well. Applied.

-- 
Jens Axboe

Re: [PATCH 11/20] fs: remove a weird comment in submit_bh_wbc

2020-06-30 Thread Jens Axboe

On 6/29/20 1:39 PM, Christoph Hellwig wrote:
> All bios can get remapped if submitted to partitions.  No need to
> comment on that.

I'm pretty sure that comment is from me, dating back to when the bio
code was introduced in 2001. The point wasn't the remapping, just
that from here on down the IO was purely bio based, not buffer_heads.
Anyway, totally agree that it should just die, it's not that
interesting or useful anymore.

-- 
Jens Axboe

Re: Build failure on latest powerpc/merge (311ae9e159d8 io_uring: fix dead-hung for non-iter fixed rw)

2019-11-29 Thread Jens Axboe


On 11/29/19 10:07 AM, Pavel Begunkov wrote:

On 29/11/2019 20:16, Jens Axboe wrote:

On 11/29/19 8:14 AM, Christophe Leroy wrote:


Reverting commit 311ae9e159d8 ("io_uring: fix dead-hung for non-iter
fixed rw") clears the failure.

Most likely an #include is missing.


Huh weird how the build bots didn't catch that. Does the below work?


Yes it works, thanks.


Thanks for reporting and testing, I've queued it up with your reported
and tested-by.


My bad, thanks for the report and fixing.


No worries, usually the build bots are great at finding these before
patches go upstream. They have been unreliable lately, unfortunately.

--
Jens Axboe

Re: Build failure on latest powerpc/merge (311ae9e159d8 io_uring: fix dead-hung for non-iter fixed rw)

2019-11-29 Thread Jens Axboe


On 11/29/19 8:14 AM, Christophe Leroy wrote:



Le 29/11/2019 à 17:04, Jens Axboe a écrit :

On 11/29/19 6:53 AM, Christophe Leroy wrote:

 CC  fs/io_uring.o
fs/io_uring.c: In function ‘loop_rw_iter’:
fs/io_uring.c:1628:21: error: implicit declaration of function ‘kmap’
[-Werror=implicit-function-declaration]
   iovec.iov_base = kmap(iter->bvec->bv_page)
^
fs/io_uring.c:1628:19: warning: assignment makes pointer from integer
without a cast [-Wint-conversion]
   iovec.iov_base = kmap(iter->bvec->bv_page)
  ^
fs/io_uring.c:1643:4: error: implicit declaration of function ‘kunmap’
[-Werror=implicit-function-declaration]
   kunmap(iter->bvec->bv_page);
   ^


Reverting commit 311ae9e159d8 ("io_uring: fix dead-hung for non-iter
fixed rw") clears the failure.

Most likely an #include is missing.


Huh weird how the build bots didn't catch that. Does the below work?


Yes it works, thanks.


Thanks for reporting and testing, I've queued it up with your reported
and tested-by.

--
Jens Axboe

Re: Build failure on latest powerpc/merge (311ae9e159d8 io_uring: fix dead-hung for non-iter fixed rw)

2019-11-29 Thread Jens Axboe


On 11/29/19 6:53 AM, Christophe Leroy wrote:

CC  fs/io_uring.o
fs/io_uring.c: In function ‘loop_rw_iter’:
fs/io_uring.c:1628:21: error: implicit declaration of function ‘kmap’
[-Werror=implicit-function-declaration]
  iovec.iov_base = kmap(iter->bvec->bv_page)
   ^
fs/io_uring.c:1628:19: warning: assignment makes pointer from integer
without a cast [-Wint-conversion]
  iovec.iov_base = kmap(iter->bvec->bv_page)
 ^
fs/io_uring.c:1643:4: error: implicit declaration of function ‘kunmap’
[-Werror=implicit-function-declaration]
  kunmap(iter->bvec->bv_page);
  ^


Reverting commit 311ae9e159d8 ("io_uring: fix dead-hung for non-iter
fixed rw") clears the failure.

Most likely an #include is missing.


Huh weird how the build bots didn't catch that. Does the below work?


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 2c2e8c25da01..745eb005fefe 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -69,6 +69,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define CREATE_TRACE_POINTS

 #include 

--
Jens Axboe

Re: [PATCH v6 15/24] fs/io_uring: set FOLL_PIN via pin_user_pages()

2019-11-19 Thread Jens Axboe

On 11/19/19 1:16 AM, John Hubbard wrote:
> Convert fs/io_uring to use the new pin_user_pages() call, which sets
> FOLL_PIN. Setting FOLL_PIN is now required for code that requires
> tracking of pinned pages, and therefore for any code that calls
> put_user_page().
> 
> In partial anticipation of this work, the io_uring code was already
> calling put_user_page() instead of put_page(). Therefore, in order to
> convert from the get_user_pages()/put_page() model, to the
> pin_user_pages()/put_user_page() model, the only change required
> here is to change get_user_pages() to pin_user_pages().
> 
> Reviewed-by: Jan Kara 
> Signed-off-by: John Hubbard 

You dropped my reviewed-by now... Given the file, you'd probably want
to keep that.

-- 
Jens Axboe

Re: [PATCH 10/19] fs/io_uring: set FOLL_PIN via pin_user_pages()

2019-11-01 Thread Jens Axboe

On 10/30/19 4:49 PM, John Hubbard wrote:
> Convert fs/io_uring to use the new pin_user_pages() call, which sets
> FOLL_PIN. Setting FOLL_PIN is now required for code that requires
> tracking of pinned pages, and therefore for any code that calls
> put_user_page().

Reviewed-by: Jens Axboe 

-- 
Jens Axboe

Re: [PATCH] block/ps3vram: Use %llu to format sector_t after LBDAF removal

2019-06-13 Thread Jens Axboe

On 6/13/19 1:30 AM, Geert Uytterhoeven wrote:
> The removal of CONFIG_LBDAF changed the type of sector_t from "unsigned
> long" to "u64" aka "unsigned long long" on 64-bit platforms, leading to
> a compiler warning regression:
> 
>  drivers/block/ps3vram.c: In function ‘ps3vram_probe’:
>  drivers/block/ps3vram.c:770:23: warning: format ‘%lu’ expects argument 
> of type ‘long unsigned int’, but argument 4 has type ‘sector_t {aka long long 
> unsigned int}’ [-Wformat=]
> 
> Fix this by using "%llu" instead.

Applied, thanks.

-- 
Jens Axboe

Re: [PATCH 2/2] arch: add pidfd and io_uring syscalls everywhere

2019-04-03 Thread Jens Axboe

On 4/3/19 9:49 AM, Will Deacon wrote:
> On Wed, Apr 03, 2019 at 09:39:52AM -0600, Jens Axboe wrote:
>> On 4/3/19 9:19 AM, Will Deacon wrote:
>>> On Wed, Apr 03, 2019 at 07:49:26AM -0600, Jens Axboe wrote:
>>>> On 4/3/19 5:11 AM, Will Deacon wrote:
>>>>> will@autoplooker:~/liburing/test$ ./io_uring_register 
>>>>> RELIMIT_MEMLOCK: 67108864 (67108864)
>>>>> [   35.477875] Unable to handle kernel NULL pointer dereference at 
>>>>> virtual address 0070
>>>>> [   35.478969] Mem abort info:
>>>>> [   35.479296]   ESR = 0x9604
>>>>> [   35.479785]   Exception class = DABT (current EL), IL = 32 bits
>>>>> [   35.480528]   SET = 0, FnV = 0
>>>>> [   35.480980]   EA = 0, S1PTW = 0
>>>>> [   35.481345] Data abort info:
>>>>> [   35.481680]   ISV = 0, ISS = 0x0004
>>>>> [   35.482267]   CM = 0, WnR = 0
>>>>> [   35.482618] user pgtable: 4k pages, 48-bit VAs, pgdp = (ptrval)
>>>>> [   35.483486] [0070] pgd=
>>>>> [   35.484041] Internal error: Oops: 9604 [#1] PREEMPT SMP
>>>>> [   35.484788] Modules linked in:
>>>>> [   35.485311] CPU: 113 PID: 3973 Comm: io_uring_regist Not tainted 
>>>>> 5.1.0-rc3-00012-g40b114779944 #1
>>>>> [   35.486712] Hardware name: linux,dummy-virt (DT)
>>>>> [   35.487450] pstate: 2045 (nzCv daif +PAN -UAO)
>>>>> [   35.488228] pc : link_pwq+0x10/0x60
>>>>> [   35.488794] lr : apply_wqattrs_commit+0xe0/0x118
>>>>> [   35.489550] sp : ffff000017e2bbc0
>>>>
>>>> Huh, this looks odd, it's crashing inside the wq setup.
>>>
>>> Enabling KASAN seems to indicate a double-free, which may well be related.
>>
>> Does this help?
> 
> Yes, thanks for the quick patch. Feel free to add:
> 
> Reported-by: Will Deacon 
> Tested-by: Will Deacon 
> 
> if you spin a proper patch.

Great, thanks for reporting/testing.

-- 
Jens Axboe

Re: [PATCH 2/2] arch: add pidfd and io_uring syscalls everywhere

2019-04-03 Thread Jens Axboe

On 4/3/19 9:19 AM, Will Deacon wrote:
> Hi Jens,
> 
> On Wed, Apr 03, 2019 at 07:49:26AM -0600, Jens Axboe wrote:
>> On 4/3/19 5:11 AM, Will Deacon wrote:
>>> will@autoplooker:~/liburing/test$ ./io_uring_register 
>>> RELIMIT_MEMLOCK: 67108864 (67108864)
>>> [   35.477875] Unable to handle kernel NULL pointer dereference at virtual 
>>> address 0070
>>> [   35.478969] Mem abort info:
>>> [   35.479296]   ESR = 0x9604
>>> [   35.479785]   Exception class = DABT (current EL), IL = 32 bits
>>> [   35.480528]   SET = 0, FnV = 0
>>> [   35.480980]   EA = 0, S1PTW = 0
>>> [   35.481345] Data abort info:
>>> [   35.481680]   ISV = 0, ISS = 0x0004
>>> [   35.482267]   CM = 0, WnR = 0
>>> [   35.482618] user pgtable: 4k pages, 48-bit VAs, pgdp = (ptrval)
>>> [   35.483486] [0070] pgd=
>>> [   35.484041] Internal error: Oops: 9604 [#1] PREEMPT SMP
>>> [   35.484788] Modules linked in:
>>> [   35.485311] CPU: 113 PID: 3973 Comm: io_uring_regist Not tainted 
>>> 5.1.0-rc3-00012-g40b114779944 #1
>>> [   35.486712] Hardware name: linux,dummy-virt (DT)
>>> [   35.487450] pstate: 2045 (nzCv daif +PAN -UAO)
>>> [   35.488228] pc : link_pwq+0x10/0x60
>>> [   35.488794] lr : apply_wqattrs_commit+0xe0/0x118
>>> [   35.489550] sp : 17e2bbc0
>>
>> Huh, this looks odd, it's crashing inside the wq setup.
> 
> Enabling KASAN seems to indicate a double-free, which may well be related.

Does this help?


diff --git a/fs/io_uring.c b/fs/io_uring.c
index bbdbd56cf2ac..07d6ef195d05 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2215,6 +2215,7 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, 
void __user *arg,
fput(ctx->user_files[i]);
 
kfree(ctx->user_files);
+   ctx->user_files = NULL;
ctx->nr_user_files = 0;
return ret;
}

-- 
Jens Axboe

Re: [PATCH 2/2] arch: add pidfd and io_uring syscalls everywhere

2019-04-03 Thread Jens Axboe

On 4/3/19 5:11 AM, Will Deacon wrote:
> Hi Michael,
> 
> On Wed, Apr 03, 2019 at 01:47:50PM +1100, Michael Ellerman wrote:
>> Arnd Bergmann  writes:
>>> diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl 
>>> b/arch/powerpc/kernel/syscalls/syscall.tbl
>>> index b18abb0c3dae..00f5a63c8d9a 100644
>>> --- a/arch/powerpc/kernel/syscalls/syscall.tbl
>>> +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
>>> @@ -505,3 +505,7 @@
>>>  42132  rt_sigtimedwait_time64  sys_rt_sigtimedwait 
>>> compat_sys_rt_sigtimedwait_time64
>>>  42232  futex_time64sys_futex   
>>> sys_futex
>>>  42332  sched_rr_get_interval_time64
>>> sys_sched_rr_get_interval   sys_sched_rr_get_interval
>>> +424common  pidfd_send_signal   sys_pidfd_send_signal
>>> +425common  io_uring_setup  sys_io_uring_setup
>>> +426common  io_uring_enter  sys_io_uring_enter
>>> +427common  io_uring_register   sys_io_uring_register
>>
>> Acked-by: Michael Ellerman  (powerpc)
>>
>> Lightly tested.
>>
>> The pidfd_test selftest passes.
> 
> That reports pass for me too, although it fails to unshare the pid ns, which I
> assume is benign.
> 
>> Ran the io_uring example from fio, which prints lots of:
> 
> How did you invoke that? I had a play with the tests in:

It's t/io_uring from the fio repo:

git://git.kernel.dk/fio

and you just run it ala:

# make t/io_uring
# t/io_uring /dev/some_device

>   git://git.kernel.dk/liburing
> 
> but I quickly ran into the kernel oops below.
> 
> Will
> 
> --->8
> 
> will@autoplooker:~/liburing/test$ ./io_uring_register 
> RELIMIT_MEMLOCK: 67108864 (67108864)
> [   35.477875] Unable to handle kernel NULL pointer dereference at virtual 
> address 0070
> [   35.478969] Mem abort info:
> [   35.479296]   ESR = 0x9604
> [   35.479785]   Exception class = DABT (current EL), IL = 32 bits
> [   35.480528]   SET = 0, FnV = 0
> [   35.480980]   EA = 0, S1PTW = 0
> [   35.481345] Data abort info:
> [   35.481680]   ISV = 0, ISS = 0x0004
> [   35.482267]   CM = 0, WnR = 0
> [   35.482618] user pgtable: 4k pages, 48-bit VAs, pgdp = (ptrval)
> [   35.483486] [0070] pgd=
> [   35.484041] Internal error: Oops: 9604 [#1] PREEMPT SMP
> [   35.484788] Modules linked in:
> [   35.485311] CPU: 113 PID: 3973 Comm: io_uring_regist Not tainted 
> 5.1.0-rc3-00012-g40b114779944 #1
> [   35.486712] Hardware name: linux,dummy-virt (DT)
> [   35.487450] pstate: 2045 (nzCv daif +PAN -UAO)
> [   35.488228] pc : link_pwq+0x10/0x60
> [   35.488794] lr : apply_wqattrs_commit+0xe0/0x118
> [   35.489550] sp : 17e2bbc0

Huh, this looks odd, it's crashing inside the wq setup.


-- 
Jens Axboe

Re: [PATCH] block/swim3: Fix regression on PowerBook G3

2018-12-31 Thread Jens Axboe

On 12/30/18 10:44 PM, Finn Thain wrote:
> As of v4.20, the swim3 driver crashes when loaded on a PowerBook G3
> (Wallstreet).
> 
> MacIO PCI driver attached to Gatwick chipset
> MacIO PCI driver attached to Heathrow chipset
> swim3 0.00015000:floppy: [fd0] SWIM3 floppy controller in media bay
> 0.00013020:ch-a: ttyS0 at MMIO 0xf3013020 (irq = 16, base_baud = 230400) is a 
> Z85c30 ESCC - Serial port
> 0.00013000:ch-b: ttyS1 at MMIO 0xf3013000 (irq = 17, base_baud = 230400) is a 
> Z85c30 ESCC - Infrared port
> macio: fixed media-bay irq on gatwick
> macio: fixed left floppy irqs
> swim3 1.00015000:floppy: [fd1] Couldn't request interrupt
> Unable to handle kernel paging request for data at address 0x0024
> Faulting instruction address: 0xc02652f8
> Oops: Kernel access of bad area, sig: 11 [#1]
> BE SMP NR_CPUS=2 PowerMac
> Modules linked in:
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.20.0 #2
> NIP:  c02652f8 LR: c026915c CTR: c0276d1c
> REGS: df43ba10 TRAP: 0300   Not tainted  (4.20.0)
> MSR:  9032   CR: 28228288  XER: 0100
> DAR: 0024 DSISR: 4000
> GPR00: c026915c df43bac0 df439060 c0731524 df494700  c06e1c08 0001
> GPR08: 0001  df5ff220 1032 28228282  c0004ca4 
> GPR16:    c073144c dfffe064 c0731524 0120 c0586108
> GPR24: c073132c c073143c c073143c  c0731524 df67cd70 df494700 0001
> NIP [c02652f8] blk_mq_free_rqs+0x28/0xf8
> LR [c026915c] blk_mq_sched_tags_teardown+0x58/0x84
> Call Trace:
> [df43bac0] [c0045f50] flush_workqueue_prep_pwqs+0x178/0x1c4 (unreliable)
> [df43bae0] [c026915c] blk_mq_sched_tags_teardown+0x58/0x84
> [df43bb00] [c02697f0] blk_mq_exit_sched+0x9c/0xb8
> [df43bb20] [c0252794] elevator_exit+0x84/0xa4
> [df43bb40] [c0256538] blk_exit_queue+0x30/0x50
> [df43bb50] [c0256640] blk_cleanup_queue+0xe8/0x184
> [df43bb70] [c034732c] swim3_attach+0x330/0x5f0
> [df43bbb0] [c034fb24] macio_device_probe+0x58/0xec
> [df43bbd0] [c032ba88] really_probe+0x1e4/0x2f4
> [df43bc00] [c032bd28] driver_probe_device+0x64/0x204
> [df43bc20] [c0329ac4] bus_for_each_drv+0x60/0xac
> [df43bc50] [c032b824] __device_attach+0xe8/0x160
> [df43bc80] [c032ab38] bus_probe_device+0xa0/0xbc
> [df43bca0] [c0327338] device_add+0x3d8/0x630
> [df43bcf0] [c0350848] macio_add_one_device+0x444/0x48c
> [df43bd50] [c03509f8] macio_pci_add_devices+0x168/0x1bc
> [df43bd90] [c03500ec] macio_pci_probe+0xc0/0x10c
> [df43bda0] [c02ad884] pci_device_probe+0xd4/0x184
> [df43bdd0] [c032ba88] really_probe+0x1e4/0x2f4
> [df43be00] [c032bd28] driver_probe_device+0x64/0x204
> [df43be20] [c032bfcc] __driver_attach+0x104/0x108
> [df43be40] [c0329a00] bus_for_each_dev+0x64/0xb4
> [df43be70] [c032add8] bus_add_driver+0x154/0x238
> [df43be90] [c032ca24] driver_register+0x84/0x148
> [df43bea0] [c0004aa0] do_one_initcall+0x40/0x188
> [df43bf00] [c0690100] kernel_init_freeable+0x138/0x1d4
> [df43bf30] [c0004cbc] kernel_init+0x18/0x10c
> [df43bf40] [c00121e4] ret_from_kernel_thread+0x14/0x1c
> Instruction dump:
> 5484d97e 4bfff4f4 9421ffe0 7c0802a6 bf410008 7c9e2378 90010024 8124005c
> 2f89 419e0078 81230004 7c7c1b78 <81290024> 2f89 419e0064 8144
> ---[ end trace 12025ab921a9784c ]---
> 
> Reverting commit 8ccb8cb1892b ("swim3: convert to blk-mq") resolves the
> problem.
> 
> That commit added a struct blk_mq_tag_set to struct floppy_state and
> initialized it with a blk_mq_init_sq_queue() call. Unfortunately, there
> is a memset() in swim3_add_device() that subsequently clears the
> floppy_state struct. That means fs->tag_set->ops is a NULL pointer, and
> it gets dereferenced by blk_mq_free_rqs() which gets called in the
> request_irq() error path. Move the memset() to fix this bug.
> 
> BTW, the request_irq() failure for the left mediabay floppy (fd1) is not
> a regression. I don't know why it happens. The right media bay floppy
> (fd0) works fine however.

Applied, thanks.

-- 
Jens Axboe

Re: [PATCH] block/swim3: Fix -EBUSY error when re-opening device after unmount

2018-12-31 Thread Jens Axboe

On 12/30/18 10:44 PM, Finn Thain wrote:
> When the block device is opened with FMODE_EXCL, ref_count is set to -1.
> This value doesn't get reset when the device is closed which means the
> device cannot be opened again. Fix this by checking for refcount <= 0
> in the release method.

Applied, thanks.


-- 
Jens Axboe

Re: [PATCH] block/swim3: Remove dead return statement

2018-12-31 Thread Jens Axboe

On 12/30/18 10:44 PM, Finn Thain wrote:
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Finn Thain 

Applied, thanks.

-- 
Jens Axboe

Re: [Intel-wired-lan] [PATCH V2] mm: Replace all open encodings for NUMA_NO_NODE

2018-11-26 Thread Jens Axboe

On 11/26/18 10:56 AM, Jeff Kirsher wrote:
> On Mon, 2018-11-26 at 17:56 +0530, Anshuman Khandual wrote:
>> At present there are multiple places where invalid node number is
>> encoded
>> as -1. Even though implicitly understood it is always better to have
>> macros
>> in there. Replace these open encodings for an invalid node number
>> with the
>> global macro NUMA_NO_NODE. This helps remove NUMA related assumptions
>> like
>> 'invalid node' from various places redirecting them to a common
>> definition.
>>
>> Signed-off-by: Anshuman Khandual 
> 
> For the 'ixgbe' driver changes.
> 
> Acked-by: Jeff Kirsher 

Lost the original, but for mtip32xx:

Acked-by: Jens Axboe 

-- 
Jens Axboe

Re: System not booting since dm changes? (was Linux 4.20-rc1)

2018-11-05 Thread Jens Axboe

On 11/5/18 7:35 AM, Satheesh Rajendran wrote:
> On Mon, Nov 05, 2018 at 08:51:57AM -0500, Mike Snitzer wrote:
>> On Mon, Nov 05 2018 at  5:25am -0500,
>> Michael Ellerman  wrote:
>>
>>> Linus Torvalds  writes:
>>> ...
>>>> Mike Snitzer (1):
>>>> device mapper updates
>>>
>>> Hi Mike,
>>>
>>> Replying here because I can't find the device-mapper pull or the patch
>>> in question on LKML. I guess I should be subscribed to dm-devel.
>>>
>>> We have a box that doesn't boot any more, bisect points at one of:
>>>
>>>   cef6f55a9fb4 Mike Snitzer   dm table: require that request-based DM 
>>> be layered on blk-mq devices 
>>>   953923c09fe8 Mike Snitzer   dm: rename DM_TYPE_MQ_REQUEST_BASED to 
>>> DM_TYPE_REQUEST_BASED 
>>>   6a23e05c2fe3 Jens Axboe dm: remove legacy request-based IO path 
>>>
>>>
>>> It's a Power8 system running Rawhide, it does have multipath, but I'm
>>> told it was setup by the Fedora installer, ie. nothing fancy.
>>>
>>> The symptom is the system can't find its root filesystem and drops into
>>> the initramfs shell. The dmesg includes a bunch of errors like below:
>>>
>>>   [   43.263460] localhost multipathd[1344]: sdb: fail to get serial
>>>   [   43.268762] localhost multipathd[1344]: mpatha: failed in domap for 
>>> addition of new path sdb
>>>   [   43.268762] localhost multipathd[1344]: uevent trigger error
>>>   [   43.282065] localhost kernel: device-mapper: table: table load 
>>> rejected: not all devices are blk-mq request-stackable
>> ...
>>>
>>> Any ideas what's going wrong here?
>>
>> "table load rejected: not all devices are blk-mq request-stackable"
>> speaks to the fact that you aren't using blk-mq for scsi (aka scsi-mq).
>>
>> You need to use scsi_mod.use_blk_mq=Y on the kernel commandline (or set
>> CONFIG_SCSI_MQ_DEFAULT in your kernel config)
> 
> Thanks Mike!, above solution worked and the system booted fine now:-)

This quirk will go away for the next kernel, fwiw, since the non-mq
path for SCSI will be dropped as well.

-- 
Jens Axboe

Re: [PATCH] [RFC v2] Drop all 00-INDEX files from Documentation/

2018-09-04 Thread Jens Axboe

On 9/3/18 4:15 PM, Henrik Austad wrote:
> This is a respin with a wider audience (all that get_maintainer returned)
> and I know this spams a *lot* of people. Not sure what would be the correct
> way, so my apologies for ruining your inbox.
> 
> The 00-INDEX files are supposed to give a summary of all files present
> in a directory, but these files are horribly out of date and their
> usefulness is brought into question. Often a simple "ls" would reveal
> the same information as the filenames are generally quite descriptive as
> a short introduction to what the file covers (it should not surprise
> anyone what Documentation/sched/sched-design-CFS.txt covers)
> 
> A few years back it was mentioned that these files were no longer really
> needed, and they have since then grown further out of date, so perhaps
> it is time to just throw them out.
> 
> A short status yields the following _outdated_ 00-INDEX files, first
> counter is files listed in 00-INDEX but missing in the directory, last
> is files present but not listed in 00-INDEX.

For the block related bits:

Reviewed-by: Jens Axboe 

-- 
Jens Axboe

Re: Oops in kmem_cache_free() via bioset_exit() (was Re: [next-20180601][nvme][ppc] Kernel Oops is triggered when creating lvm snapshots on nvme disks)

2018-06-28 Thread Jens Axboe

On 6/28/18 8:42 AM, Michael Ellerman wrote:
> Kent, Jens,
> 
> This looks like it might be related to the recent bioset changes?
> 
> cheers
> 
> Abdul Haleem  writes:
>> On Tue, 2018-06-26 at 23:36 +1000, Michael Ellerman wrote:
>>> Abdul Haleem  writes:
> ...
>> I was able to reproduce again with slub_debug=FZP and DEBUG_INFO enabled
>> on 4.17.0-rc7-next-20180601, but not much traces other than the Oops stack 
>> trace
> 
> Are you still testing on that revision? It's nearly a month old.
> 
> Please try to reproduce on mainline or today's linux-next.
> 
> 
>> the faulty instruction points to below code path :
>>
>> gdb -batch vmlinux -ex 'list *(0xc0304fe0)'
>> 0xc0304fe0 is in kmem_cache_free (mm/slab.h:231).
>> 226  }
>> 227  
>> 228  static inline bool slab_equal_or_root(struct kmem_cache *s,
>> 229struct kmem_cache *p)
>> 230  {
>> 231  return p == s || p == s->memcg_params.root_cache;
>> 232  }
> 
> And s is NULL.
> 
> Called via:
>   kmem_cache_free+0x210/0x2a0
>   mempool_free_slab+0x24/0x40
>   mempool_exit+0x50/0x90
>   bioset_exit+0x40/0x1d0
>   dm_io_client_destroy+0x2c/0x50
>   dm_bufio_client_destroy+0x1fc/0x2d0 [dm_bufio]
>   persistent_read_metadata+0x430/0x660 [dm_snapshot]
>   snapshot_ctr+0x5c8/0x7a0 [dm_snapshot]
>   dm_table_add_target+0x19c/0x3c0
>   table_load+0x104/0x450
>   ctl_ioctl+0x1f8/0x570
>   dm_ctl_ioctl+0x18/0x30
>   do_vfs_ioctl+0xcc/0x9e0
>   ksys_ioctl+0x5c/0xe0
>   sys_ioctl+0x20/0x80
>   system_call+0x58/0x6c
> 
> So looks like we did:
> 
>   kmem_cache_free(NULL
> 
> Probably a bad error path that frees before the cache has been allocated.
> 
> mempool_init_node() calls mempool_exit() on a partially initialised
> mempool, which looks fishy, though you're not hitting that patch AFAICS.

The slab cache is setup elsewhere, it's pending_cache. So if pending_cache
is NULL, then yeah and exit there will barf. I'd try something like the
below, but from the trace, we already basically see the path.


diff --git a/include/linux/mempool.h b/include/linux/mempool.h
index 0c964ac107c2..ebfa2f89ffdd 100644
--- a/include/linux/mempool.h
+++ b/include/linux/mempool.h
@@ -59,6 +59,7 @@ void mempool_free_slab(void *element, void *pool_data);
 static inline int
 mempool_init_slab_pool(mempool_t *pool, int min_nr, struct kmem_cache *kc)
 {
+   BUG_ON(!kc);
return mempool_init(pool, min_nr, mempool_alloc_slab,
mempool_free_slab, (void *) kc);
 }
diff --git a/mm/mempool.c b/mm/mempool.c
index b54f2c20e5e0..060f44acd0df 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -508,7 +508,9 @@ EXPORT_SYMBOL(mempool_alloc_slab);
 void mempool_free_slab(void *element, void *pool_data)
 {
struct kmem_cache *mem = pool_data;
-   kmem_cache_free(mem, element);
+
+   if (!WARN_ON(!mem))
+   kmem_cache_free(mem, element);
 }
 EXPORT_SYMBOL(mempool_free_slab);
 

-- 
Jens Axboe

Re: Constant ata messages on console with commit 28361c403683 ("libata: add extra internal command") (was Re: [GIT PULL 2/2] libata changes for v4.18-rc1)

2018-06-19 Thread Jens Axboe

On 6/19/18 5:35 PM, Michael Ellerman wrote:
> Jens Axboe  writes:
>> On 6/19/18 1:29 AM, Michael Ellerman wrote:
>>> Jens Axboe  writes:
>>>> On 6/18/18 1:33 AM, Michael Ellerman wrote:
>>>>> Tejun Heo  writes:
>>>>> ...
>>>>>> Jens Axboe (10):
>>>>>>   libata: introduce notion of separate hardware tags
>>>>>>   libata: convert core and drivers to ->hw_tag usage
>>>>>>   libata: bump ->qc_active to a 64-bit type
>>>>>>   libata: use ata_tag_internal() consistently
>>>>>>   libata: remove assumption that ATA_MAX_QUEUE - 1 is the max
>>>>>>   sata_nv: set host can_queue count appropriately
>>>>>>   libata: add extra internal command
>>>>>
>>>>> Replying here because I can't find the original mail.
>>>>>
>>>>> The above commit is causing one of my machines to constantly spew ata
>>>>> messages on the console, according to bisect:
>>>>>
>>>>> # first bad commit: [28361c403683c2b00d4f5e76045f3ccd299bf99d] libata: 
>>>>> add extra internal command
>>>>>
>>>>> To get it to boot I have to also apply:
>>>>>
>>>>>   88e10092f6a6 ("sata_fsl: use the right type for tag bitshift")
>>>>>
>>>>>
>>>>> The system boots OK and seems fine, except that it's just printing
>>>>> multiple of these per second:
>>>>>
>>>>>   ata2: Signature Update detected @ 0 msecs
>>>>>   ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
>>>>>   ata2.00: configured for UDMA/100
>>>>>   ata2: Signature Update detected @ 0 msecs
>>>>>   ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
>>>>>   ata2.00: configured for UDMA/100
>>>>>   ata2: Signature Update detected @ 0 msecs
>>>>>   ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
>>>>>   ata2.00: configured for UDMA/100
>>>>>   ata2: Signature Update detected @ 0 msecs
>>>>>   ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
>>>>>   ata2.00: configured for UDMA/100
>>>>>   ata2: Signature Update detected @ 0 msecs
> ...
>>
>> Actually, just try this one on top of current -git.
> 
> Yep that fixes it.
> 
> No more message spam, and when I try to mount sr0 it says "no medium".
> 
> I'll have to go into the office to actually put a disc in the drive
> to check it's really working, but it seems likely.
> 
> Thanks for debugging it, here's a tested-by if you like:
> 
> Tested-by: Michael Ellerman 

Thanks, but I packaged it a little differently, see the other
series I CC'ed you on. It'll work the same, though. It's in Tejun's
tree now, so should end up with Linus some time this week I think.

Thanks for reporting and testing the fix!

-- 
Jens Axboe

Re: Constant ata messages on console with commit 28361c403683 ("libata: add extra internal command") (was Re: [GIT PULL 2/2] libata changes for v4.18-rc1)

2018-06-19 Thread Jens Axboe

On 6/19/18 1:29 AM, Michael Ellerman wrote:
> Jens Axboe  writes:
>> On 6/18/18 1:33 AM, Michael Ellerman wrote:
>>> Tejun Heo  writes:
>>> ...
>>>> Jens Axboe (10):
>>>>   libata: introduce notion of separate hardware tags
>>>>   libata: convert core and drivers to ->hw_tag usage
>>>>   libata: bump ->qc_active to a 64-bit type
>>>>   libata: use ata_tag_internal() consistently
>>>>   libata: remove assumption that ATA_MAX_QUEUE - 1 is the max
>>>>   sata_nv: set host can_queue count appropriately
>>>>   libata: add extra internal command
>>>
>>> Replying here because I can't find the original mail.
>>>
>>> The above commit is causing one of my machines to constantly spew ata
>>> messages on the console, according to bisect:
>>>
>>> # first bad commit: [28361c403683c2b00d4f5e76045f3ccd299bf99d] libata: add 
>>> extra internal command
>>>
>>> To get it to boot I have to also apply:
>>>
>>>   88e10092f6a6 ("sata_fsl: use the right type for tag bitshift")
>>>
>>>
>>> The system boots OK and seems fine, except that it's just printing
>>> multiple of these per second:
>>>
>>>   ata2: Signature Update detected @ 0 msecs
>>>   ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
>>>   ata2.00: configured for UDMA/100
>>>   ata2: Signature Update detected @ 0 msecs
>>>   ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
>>>   ata2.00: configured for UDMA/100
>>>   ata2: Signature Update detected @ 0 msecs
>>>   ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
>>>   ata2.00: configured for UDMA/100
>>>   ata2: Signature Update detected @ 0 msecs
>>>   ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
>>>   ata2.00: configured for UDMA/100
>>>   ata2: Signature Update detected @ 0 msecs
>>>
>>> And it never seems to stop.
>>>
>>> The machine is a Freescale/NXP P5020ds, using the sata_fsl driver
>>> presumably. Any ideas?
>>
>> Hmm that's odd. Can you include the boot log from a working boot as
>> well? Would be nice to see what devices are on the sata adapter.
>> The above just looks like a hardreset loop.
> 
> Ah yep. I stupidly assumed it was working, because the machine booted,
> but that's because the root disk is on ata1.
> 
> Booting the good kernel:
> 
>   ba80c3a572f4 ("sata_nv: set host can_queue count appropriately")
> 
> I see:
> 
>   root@p5020ds:~# ls -l /sys/class/ata_port/
>   total 0
>   lrwxrwxrwx 1 root root 0 Jun 19 06:49 ata1 -> 
> ../../devices/platform/ffe00.soc/ffe22.sata/ata1/ata_port/ata1
>   lrwxrwxrwx 1 root root 0 Jun 19 17:06 ata2 -> 
> ../../devices/platform/ffe00.soc/ffe221000.sata/ata2/ata_port/ata2
> 
>   root@p5020ds:~# ls -l /sys/class/block/ | grep ata
>   lrwxrwxrwx 1 root root 0 Jun 19 17:11 sda -> 
> ../../devices/platform/ffe00.soc/ffe22.sata/ata1/host0/target0:0:0/0:0:0:0/block/sda
>   lrwxrwxrwx 1 root root 0 Jun 19 17:11 sda1 -> 
> ../../devices/platform/ffe00.soc/ffe22.sata/ata1/host0/target0:0:0/0:0:0:0/block/sda/sda1
>   lrwxrwxrwx 1 root root 0 Jun 19 17:11 sda2 -> 
> ../../devices/platform/ffe00.soc/ffe22.sata/ata1/host0/target0:0:0/0:0:0:0/block/sda/sda2
>   lrwxrwxrwx 1 root root 0 Jun 19 17:11 sda5 -> 
> ../../devices/platform/ffe00.soc/ffe22.sata/ata1/host0/target0:0:0/0:0:0:0/block/sda/sda5
>   lrwxrwxrwx 1 root root 0 Jun 19 17:11 sr0 -> 
> ../../devices/platform/ffe00.soc/ffe221000.sata/ata2/host1/target1:0:0/1:0:0:0/block/sr0
> 
> So it's the DVD drive.
> 
>   
> root@p5020ds:/sys/devices/platform/ffe00.soc/ffe221000.sata/ata2/host1/target1:0:0/1:0:0:0/scsi_device/1:0:0:0/device#
>  cat vendor 
>   Optiarc 
>   
> root@p5020ds:/sys/devices/platform/ffe00.soc/ffe221000.sata/ata2/host1/target1:0:0/1:0:0:0/scsi_device/1:0:0:0/device#
>  cat model
>   DVD RW AD-7260S 
> 
> 
> Full boot log from a good boot attached if that's helpful.
> 
> All of the above looks the same when I boot with the broken setup, it
> just spams dmesg constantly.
> 
> One thing that is different, on the good kernel I see:
> 
>   root@p5020ds:~# mount /dev/sr0 /mnt
>   mount: no medium found on /dev/sr0
> 
> vs bad (88e10092f6a6):
> 
>   root@p5020ds:~# mount /dev/sr0 /mnt
>   mount: /dev/sr0 is already mounted or /mnt busy
> 
> cheers

Actually, just try this one on top of current -git.

diff --git a/drivers/ata/sat

Re: Constant ata messages on console with commit 28361c403683 ("libata: add extra internal command") (was Re: [GIT PULL 2/2] libata changes for v4.18-rc1)

2018-06-19 Thread Jens Axboe

On 6/19/18 1:29 AM, Michael Ellerman wrote:
> Jens Axboe  writes:
>> On 6/18/18 1:33 AM, Michael Ellerman wrote:
>>> Tejun Heo  writes:
>>> ...
>>>> Jens Axboe (10):
>>>>   libata: introduce notion of separate hardware tags
>>>>   libata: convert core and drivers to ->hw_tag usage
>>>>   libata: bump ->qc_active to a 64-bit type
>>>>   libata: use ata_tag_internal() consistently
>>>>   libata: remove assumption that ATA_MAX_QUEUE - 1 is the max
>>>>   sata_nv: set host can_queue count appropriately
>>>>   libata: add extra internal command
>>>
>>> Replying here because I can't find the original mail.
>>>
>>> The above commit is causing one of my machines to constantly spew ata
>>> messages on the console, according to bisect:
>>>
>>> # first bad commit: [28361c403683c2b00d4f5e76045f3ccd299bf99d] libata: add 
>>> extra internal command
>>>
>>> To get it to boot I have to also apply:
>>>
>>>   88e10092f6a6 ("sata_fsl: use the right type for tag bitshift")
>>>
>>>
>>> The system boots OK and seems fine, except that it's just printing
>>> multiple of these per second:
>>>
>>>   ata2: Signature Update detected @ 0 msecs
>>>   ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
>>>   ata2.00: configured for UDMA/100
>>>   ata2: Signature Update detected @ 0 msecs
>>>   ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
>>>   ata2.00: configured for UDMA/100
>>>   ata2: Signature Update detected @ 0 msecs
>>>   ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
>>>   ata2.00: configured for UDMA/100
>>>   ata2: Signature Update detected @ 0 msecs
>>>   ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
>>>   ata2.00: configured for UDMA/100
>>>   ata2: Signature Update detected @ 0 msecs
>>>
>>> And it never seems to stop.
>>>
>>> The machine is a Freescale/NXP P5020ds, using the sata_fsl driver
>>> presumably. Any ideas?
>>
>> Hmm that's odd. Can you include the boot log from a working boot as
>> well? Would be nice to see what devices are on the sata adapter.
>> The above just looks like a hardreset loop.
> 
> Ah yep. I stupidly assumed it was working, because the machine booted,
> but that's because the root disk is on ata1.
> 
> Booting the good kernel:
> 
>   ba80c3a572f4 ("sata_nv: set host can_queue count appropriately")
> 
> I see:
> 
>   root@p5020ds:~# ls -l /sys/class/ata_port/
>   total 0
>   lrwxrwxrwx 1 root root 0 Jun 19 06:49 ata1 -> 
> ../../devices/platform/ffe00.soc/ffe22.sata/ata1/ata_port/ata1
>   lrwxrwxrwx 1 root root 0 Jun 19 17:06 ata2 -> 
> ../../devices/platform/ffe00.soc/ffe221000.sata/ata2/ata_port/ata2
> 
>   root@p5020ds:~# ls -l /sys/class/block/ | grep ata
>   lrwxrwxrwx 1 root root 0 Jun 19 17:11 sda -> 
> ../../devices/platform/ffe00.soc/ffe22.sata/ata1/host0/target0:0:0/0:0:0:0/block/sda
>   lrwxrwxrwx 1 root root 0 Jun 19 17:11 sda1 -> 
> ../../devices/platform/ffe00.soc/ffe22.sata/ata1/host0/target0:0:0/0:0:0:0/block/sda/sda1
>   lrwxrwxrwx 1 root root 0 Jun 19 17:11 sda2 -> 
> ../../devices/platform/ffe00.soc/ffe22.sata/ata1/host0/target0:0:0/0:0:0:0/block/sda/sda2
>   lrwxrwxrwx 1 root root 0 Jun 19 17:11 sda5 -> 
> ../../devices/platform/ffe00.soc/ffe22.sata/ata1/host0/target0:0:0/0:0:0:0/block/sda/sda5
>   lrwxrwxrwx 1 root root 0 Jun 19 17:11 sr0 -> 
> ../../devices/platform/ffe00.soc/ffe221000.sata/ata2/host1/target1:0:0/1:0:0:0/block/sr0
> 
> So it's the DVD drive.
> 
>   
> root@p5020ds:/sys/devices/platform/ffe00.soc/ffe221000.sata/ata2/host1/target1:0:0/1:0:0:0/scsi_device/1:0:0:0/device#
>  cat vendor 
>   Optiarc 
>   
> root@p5020ds:/sys/devices/platform/ffe00.soc/ffe221000.sata/ata2/host1/target1:0:0/1:0:0:0/scsi_device/1:0:0:0/device#
>  cat model
>   DVD RW AD-7260S 
> 
> 
> Full boot log from a good boot attached if that's helpful.
> 
> All of the above looks the same when I boot with the broken setup, it
> just spams dmesg constantly.
> 
> One thing that is different, on the good kernel I see:
> 
>   root@p5020ds:~# mount /dev/sr0 /mnt
>   mount: no medium found on /dev/sr0
> 
> vs bad (88e10092f6a6):
> 
>   root@p5020ds:~# mount /dev/sr0 /mnt
>   mount: /dev/sr0 is already mounted or /mnt busy

Can you try the below patch, on both 4.17 and on current -git? Might
help shed some light

Re: Constant ata messages on console with commit 28361c403683 ("libata: add extra internal command") (was Re: [GIT PULL 2/2] libata changes for v4.18-rc1)

2018-06-18 Thread Jens Axboe

On 6/18/18 1:33 AM, Michael Ellerman wrote:
> Tejun Heo  writes:
> ...
>> Jens Axboe (10):
>>   libata: introduce notion of separate hardware tags
>>   libata: convert core and drivers to ->hw_tag usage
>>   libata: bump ->qc_active to a 64-bit type
>>   libata: use ata_tag_internal() consistently
>>   libata: remove assumption that ATA_MAX_QUEUE - 1 is the max
>>   sata_nv: set host can_queue count appropriately
>>   libata: add extra internal command
> 
> Replying here because I can't find the original mail.
> 
> The above commit is causing one of my machines to constantly spew ata
> messages on the console, according to bisect:
> 
> # first bad commit: [28361c403683c2b00d4f5e76045f3ccd299bf99d] libata: add 
> extra internal command
> 
> To get it to boot I have to also apply:
> 
>   88e10092f6a6 ("sata_fsl: use the right type for tag bitshift")
> 
> 
> The system boots OK and seems fine, except that it's just printing
> multiple of these per second:
> 
>   ata2: Signature Update detected @ 0 msecs
>   ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
>   ata2.00: configured for UDMA/100
>   ata2: Signature Update detected @ 0 msecs
>   ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
>   ata2.00: configured for UDMA/100
>   ata2: Signature Update detected @ 0 msecs
>   ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
>   ata2.00: configured for UDMA/100
>   ata2: Signature Update detected @ 0 msecs
>   ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
>   ata2.00: configured for UDMA/100
>   ata2: Signature Update detected @ 0 msecs
> 
> And it never seems to stop.
> 
> The machine is a Freescale/NXP P5020ds, using the sata_fsl driver
> presumably. Any ideas?

Hmm that's odd. Can you include the boot log from a working boot as
well? Would be nice to see what devices are on the sata adapter.
The above just looks like a hardreset loop.

-- 
Jens Axboe

Re: Fwd: [powerpc/Baremetal]Kernel OOPS while executing memory hotplug on Power8 baremetal

2018-06-07 Thread Jens Axboe

On 6/7/18 8:45 AM, Jens Axboe wrote:
> On 6/7/18 4:37 AM, vrbagal1 wrote:
>> On 2018-06-07 13:12, Bart Van Assche wrote:
>>> On Thu, 2018-06-07 at 12:56 +0530, Venkat Rao B wrote:
>>>> On Thursday 07 June 2018 12:46 PM, Bart Van Assche wrote:
>>>>> On Thu, 2018-06-07 at 12:38 +0530, vrbagal1 wrote:
>>>>>> Observing Kernel oops and machine reboots while executing memory hotplug
>>>>>> test case, on Power8 Baremetal machine.
>>>>>>
>>>>>> I see this is introduced some where between rc6 and 4.17.
>>>>>
>>>>> Please provide the exact versions (git commit IDs) of the kernel versions
>>>>> you have tested.
>>>>
>>>> Commit Id ---> 5037be168f
>>>
>>> The reason I was asking for the commit ID is because I saw that 
>>> clone_endio()
>>> occurs in the oops which means that the dm driver is involved. An 
>>> important fix
>>> for the dm driver went upstream recently, namely d37753540568 ("dm: Use 
>>> kzalloc
>>> for all structs with embedded biosets/mempools"). Can you double check 
>>> whether
>>> that commit it present in your tree? If it is not present, please 
>>> update to the
>>> latest master and retest. If it is present, please report how to 
>>> reproduce
>>> this oops to Kent Overstreet, Jens Axboe, linux-block and Mike Snitzer.
>>>
>>> Thanks,
>>>
>>> Bart.
>>
>>
>> Yes, the fix is present in the tree, which I have tested.
>>
>> Steps to reproduce:
>>
>> Step1: Clone and Install avocado git clone 
>> https://github.com/avocado-framework/avocado.git
>> Step2: Clone 
>> https://github.com/avocado-framework-tests/avocado-misc-tests.git
>> Test case is 
>> https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/memhotplug.py
>> Step3: Command to run the test is avocado run 
>> avocado-misc-tests/memory/memhotplug.py
> 
> Can you try with the below? Not a fully formed fix since I'd prefer
> if the dm bioset copy stuff was changed instead, but worth a shot.

This is closer to an actual fix, please try that instead.


diff --git a/block/bio.c b/block/bio.c
index 595663e0281a..0616d86b15c6 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1967,6 +1967,21 @@ int bioset_init(struct bio_set *bs,
 }
 EXPORT_SYMBOL(bioset_init);
 
+int bioset_init_from_src(struct bio_set *new, struct bio_set *src)
+{
+   unsigned int pool_size = src->bio_pool.min_nr;
+   int flags;
+
+   flags = 0;
+   if (src->bvec_pool.min_nr)
+   flags |= BIOSET_NEED_BVECS;
+   if (src->rescue_workqueue)
+   flags |= BIOSET_NEED_RESCUER;
+
+   return bioset_init(new, pool_size, src->front_pad, flags);
+}
+EXPORT_SYMBOL(bioset_init_from_src);
+
 #ifdef CONFIG_BLK_CGROUP
 
 /**
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 98dff36b89a3..20a8d63754bf 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1953,9 +1953,10 @@ static void free_dev(struct mapped_device *md)
kvfree(md);
 }
 
-static void __bind_mempools(struct mapped_device *md, struct dm_table *t)
+static int __bind_mempools(struct mapped_device *md, struct dm_table *t)
 {
struct dm_md_mempools *p = dm_table_get_md_mempools(t);
+   int ret = 0;
 
if (dm_table_bio_based(t)) {
/*
@@ -1982,13 +1983,16 @@ static void __bind_mempools(struct mapped_device *md, 
struct dm_table *t)
   bioset_initialized(>bs) ||
   bioset_initialized(>io_bs));
 
-   md->bs = p->bs;
-   memset(>bs, 0, sizeof(p->bs));
-   md->io_bs = p->io_bs;
-   memset(>io_bs, 0, sizeof(p->io_bs));
+   ret = bioset_init_from_src(>bs, >bs);
+   if (ret)
+   goto out;
+   ret = bioset_init_from_src(>io_bs, >io_bs);
+   if (ret)
+   bioset_exit(>bs);
 out:
/* mempool bind completed, no longer need any mempools in the table */
dm_table_free_md_mempools(t);
+   return ret;
 }
 
 /*
@@ -2033,6 +2037,7 @@ static struct dm_table *__bind(struct mapped_device *md, 
struct dm_table *t,
struct request_queue *q = md->queue;
bool request_based = dm_table_request_based(t);
sector_t size;
+   int ret;
 
lockdep_assert_held(>suspend_lock);
 
@@ -2068,7 +2073,11 @@ static struct dm_table *__bind(struct mapped_device *md, 
struct dm_table *t,
md->immutable_target = dm_table_get_immutable_target(t);
}
 
-   __bind_mempools(md, t);
+   ret = __bind_mempools(md, t);
+   if (ret) {
+   old_map = ERR_PTR(ret);
+   g

Re: Fwd: [powerpc/Baremetal]Kernel OOPS while executing memory hotplug on Power8 baremetal

2018-06-07 Thread Jens Axboe

On 6/7/18 4:37 AM, vrbagal1 wrote:
> On 2018-06-07 13:12, Bart Van Assche wrote:
>> On Thu, 2018-06-07 at 12:56 +0530, Venkat Rao B wrote:
>>> On Thursday 07 June 2018 12:46 PM, Bart Van Assche wrote:
>>>> On Thu, 2018-06-07 at 12:38 +0530, vrbagal1 wrote:
>>>>> Observing Kernel oops and machine reboots while executing memory hotplug
>>>>> test case, on Power8 Baremetal machine.
>>>>>
>>>>> I see this is introduced some where between rc6 and 4.17.
>>>>
>>>> Please provide the exact versions (git commit IDs) of the kernel versions
>>>> you have tested.
>>>
>>> Commit Id ---> 5037be168f
>>
>> The reason I was asking for the commit ID is because I saw that 
>> clone_endio()
>> occurs in the oops which means that the dm driver is involved. An 
>> important fix
>> for the dm driver went upstream recently, namely d37753540568 ("dm: Use 
>> kzalloc
>> for all structs with embedded biosets/mempools"). Can you double check 
>> whether
>> that commit it present in your tree? If it is not present, please 
>> update to the
>> latest master and retest. If it is present, please report how to 
>> reproduce
>> this oops to Kent Overstreet, Jens Axboe, linux-block and Mike Snitzer.
>>
>> Thanks,
>>
>> Bart.
> 
> 
> Yes, the fix is present in the tree, which I have tested.
> 
> Steps to reproduce:
> 
> Step1: Clone and Install avocado git clone 
> https://github.com/avocado-framework/avocado.git
> Step2: Clone 
> https://github.com/avocado-framework-tests/avocado-misc-tests.git
> Test case is 
> https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/memhotplug.py
> Step3: Command to run the test is avocado run 
> avocado-misc-tests/memory/memhotplug.py

Can you try with the below? Not a fully formed fix since I'd prefer
if the dm bioset copy stuff was changed instead, but worth a shot.


diff --git a/block/bio.c b/block/bio.c
index 595663e0281a..45bdee67d28b 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1967,6 +1967,27 @@ int bioset_init(struct bio_set *bs,
 }
 EXPORT_SYMBOL(bioset_init);
 
+void bioset_move(struct bio_set *dst, struct bio_set *src)
+{
+   dst->bio_slab = src->bio_slab;
+   dst->front_pad = src->front_pad;
+   mempool_move(>bio_pool, >bio_pool);
+   mempool_move(>bvec_pool, >bvec_pool);
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+   mempool_move(>bio_integrity_pool, >bio_integrity_pool);
+   mempool_move(>bvec_integrity_pool, >bvec_integrity_pool);
+#endif
+   BUG_ON(!bio_list_empty(>rescue_list));
+   BUG_ON(work_pending(>rescue_work));
+   spin_lock_init(>rescue_lock);
+   bio_list_init(>rescue_list);
+   INIT_WORK(>rescue_work, bio_alloc_rescue);
+   dst->rescue_workqueue = src->rescue_workqueue;
+
+   memset(src, 0, sizeof(*src));
+}
+EXPORT_SYMBOL(bioset_move);
+
 #ifdef CONFIG_BLK_CGROUP
 
 /**
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 98dff36b89a3..87f636815baf 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1982,10 +1982,8 @@ static void __bind_mempools(struct mapped_device *md, 
struct dm_table *t)
   bioset_initialized(>bs) ||
   bioset_initialized(>io_bs));
 
-   md->bs = p->bs;
-   memset(>bs, 0, sizeof(p->bs));
-   md->io_bs = p->io_bs;
-   memset(>io_bs, 0, sizeof(p->io_bs));
+   bioset_move(>bs, >bs);
+   bioset_move(>io_bs, >io_bs);
 out:
/* mempool bind completed, no longer need any mempools in the table */
dm_table_free_md_mempools(t);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 810a8bee8f85..7581231dd0a3 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -417,6 +417,7 @@ enum {
 extern int bioset_init(struct bio_set *, unsigned int, unsigned int, int 
flags);
 extern void bioset_exit(struct bio_set *);
 extern int biovec_init_pool(mempool_t *pool, int pool_entries);
+extern void bioset_move(struct bio_set *dst, struct bio_set *src);
 
 extern struct bio *bio_alloc_bioset(gfp_t, unsigned int, struct bio_set *);
 extern void bio_put(struct bio *);
diff --git a/include/linux/mempool.h b/include/linux/mempool.h
index 0c964ac107c2..20818919180c 100644
--- a/include/linux/mempool.h
+++ b/include/linux/mempool.h
@@ -47,6 +47,7 @@ extern int mempool_resize(mempool_t *pool, int new_min_nr);
 extern void mempool_destroy(mempool_t *pool);
 extern void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask) __malloc;
 extern void mempool_free(void *element, mempool_t *pool);
+extern void mempool_move(mempool_t *dst, mempool_t *src);
 
 /*
  * A mempool_alloc_t and mempool_free_t that get the memory from
diff --git a/mm

Re: make a few block drivers highmem safe

2018-05-11 Thread Jens Axboe

On 5/9/18 7:59 AM, Christoph Hellwig wrote:
> Hi all,
> 
> this series converts a few random block drivers to be highmem safe,
> in preparation of eventually getting rid of the block layer bounce
> buffering support.

Applied, thanks.

-- 
Jens Axboe

Re: make a few block drivers highmem safe

2018-05-09 Thread Jens Axboe

On 5/9/18 7:59 AM, Christoph Hellwig wrote:
> Hi all,
> 
> this series converts a few random block drivers to be highmem safe,
> in preparation of eventually getting rid of the block layer bounce
> buffering support.

Series looks good to me.

-- 
Jens Axboe

Re: [linux-next][XFS][trinity] WARNING: CPU: 32 PID: 31369 at fs/iomap.c:993

2017-09-18 Thread Jens Axboe

On 09/18/2017 09:43 AM, Al Viro wrote:
> On Mon, Sep 18, 2017 at 05:39:47PM +0200, Christoph Hellwig wrote:
>> On Mon, Sep 18, 2017 at 09:28:55AM -0600, Jens Axboe wrote:
>>> If it's expected, why don't we kill the WARN_ON_ONCE()? I get it all
>>> the time running xfstests as well.
>>
>> Dave insisted on it to decourage users/applications from mixing
>> mmap and direct I/O.
>>
>> In many ways a tracepoint might be the better way to diagnose these.
> 
> sysctl suppressing those two, perhaps?

I'd rather just make it a trace point, but don't care too much.

The code doesn't even have a comment as to why that WARN_ON() is
there or expected. Seems pretty sloppy to me, not a great way
to "discourage" users to mix mmap/dio.

-- 
Jens Axboe

Re: [linux-next][XFS][trinity] WARNING: CPU: 32 PID: 31369 at fs/iomap.c:993

2017-09-18 Thread Jens Axboe

On 09/18/2017 09:27 AM, Christoph Hellwig wrote:
> On Mon, Sep 18, 2017 at 08:26:05PM +0530, Abdul Haleem wrote:
>> Hi,
>>
>> A warning is triggered from:
>>
>> file fs/iomap.c in function iomap_dio_rw
>>
>> if (ret)
>> goto out_free_dio;
>>
>> ret = invalidate_inode_pages2_range(mapping,
>> start >> PAGE_SHIFT, end >> PAGE_SHIFT);
>>>>  WARN_ON_ONCE(ret);
>> ret = 0;
>>
>> inode_dio_begin(inode);
> 
> This is expected and an indication of a problematic workload - which
> may be triggered by a fuzzer.

If it's expected, why don't we kill the WARN_ON_ONCE()? I get it all
the time running xfstests as well.

-- 
Jens Axboe

Re: [PATCH 0/3] Minor updates for PS3

2017-08-08 Thread Jens Axboe

On 08/08/2017 04:16 AM, Michael Ellerman wrote:
> Geoff Levand <ge...@infradead.org> writes:
> 
>> Hi Michael,
>>
>> A few very minor updates for PS3.  Please apply.
> 
> Jens do you want to take the block ones, or should I just take the lot?

Up to you, I'm fine either way.

-- 
Jens Axboe

Re: blk_mq_sched_insert_request: inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage

2017-08-01 Thread Jens Axboe

On 08/01/2017 12:55 AM, Michael Ellerman wrote:
> Jens Axboe <ax...@kernel.dk> writes:
> ...
>>
>> Can you try the below fix? Should be more palatable than the previous
>> one. Brian, maybe you can take a look at the IRQ issue mentioned above?
> 
> Given the patch from Brian fixed the lockdep warning, do you still want
> me to try and test this one?

Nope, we don't have to do that. I'd much rather just add a WARN_ON()
or similar to make sure we catch buggy users earlier. scsi_run_queue()
needs a

WARN_ON(in_interrupt());

but it might be better to put that in __blk_mq_run_hw_queue().

-- 
Jens Axboe

Re: blk_mq_sched_insert_request: inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage

2017-07-28 Thread Jens Axboe

On 07/28/2017 09:13 AM, Bart Van Assche wrote:
> On Fri, 2017-07-28 at 08:25 -0600, Jens Axboe wrote:
>> On 07/28/2017 12:19 AM, Michael Ellerman wrote:
>>> OK, so the resolution is "fix it in IPR" ?
>>
>> I'll leave that to the SCSI crew. But at least one bug is in IPR, if you
>> look at the call trace:
>>
>> - timer function triggers, runs ipr_reset_timer_done(), which grabs the
>>   host lock AND disables interrupts.
>> - further down in the call path, ipr_ioa_bringdown_done() uncondtionally
>>   enables interrupts:
>>
>> spin_unlock_irq(ioa_cfg->host->host_lock);
>> scsi_unblock_requests(ioa_cfg->host);
>> spin_lock_irq(ioa_cfg->host->host_lock); 
>>
>> And the call to scsi_unblock_requests() is the one that ultimately runs
>> the queue. The IRQ issue aside here, scsi_unblock_requests() could run
>> the queue async, and we could retain the normal sync run otherwise.
>>
>> Can you try the below fix? Should be more palatable than the previous
>> one. Brian, maybe you can take a look at the IRQ issue mentioned above?
>>
>> [ ... ]
> 
> Hello Jens,
> 
> Are there other block drivers that can call blk_mq_start_hw_queues()
> from interrupt context? I'm currently working on converting the skd
> driver (drivers/block/skd_main.c) from a single queue block driver
> into a scsi-mq driver. The skd driver calls blk_start_queue() from
> interrupt context. As we know it is not safe to call
> blk_mq_start_hw_queues() from interrupt context.  Can you recommend me
> how I should proceed: should I implement a solution in the skd driver
> or should perhaps the blk-mq core be modified?

Great that you a converting that driver! If there's a need for it, we
could always expose the sync/async need in blk_mq_start_hw_queues().
>From a quick look at the driver, it's using start queue very liberally.
Would probably make sense to see which ones of those are actually
needed. For resource management, we've got better interfaces on the
blk-mq side, for instance.

Since this is a conversion, might make sense to not modify
blk_mq_start_hw_queues() and simply provide an alternative
blk_mq_start_hw_queues_async(). That will keep the conversion straight
forward. Then the next step could be to fixup skd, and then we could
drop the _async() variant again, hopefully.

-- 
Jens Axboe

Re: blk_mq_sched_insert_request: inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage

2017-07-28 Thread Jens Axboe

On 07/28/2017 12:19 AM, Michael Ellerman wrote:
> Jens Axboe <ax...@kernel.dk> writes:
>> On 07/27/2017 08:47 AM, Bart Van Assche wrote:
>>> On Thu, 2017-07-27 at 08:02 -0600, Jens Axboe wrote:
>>>> The bug looks like SCSI running the queue inline from IRQ
>>>> context, that's not a good idea.
> ...
>>>
>>> scsi_run_queue() works fine if no scheduler is configured. Additionally, 
>>> that
>>> code predates the introduction of blk-mq I/O schedulers. I think it is
>>> nontrivial for block driver authors to figure out that a queue has to be run
>>> from process context if a scheduler has been configured that does not 
>>> support
>>> to be run from interrupt context.
>>
>> No it doesn't, you could never run the queue from interrupt context with
>> async == false. So I don't think that's confusing at all, you should
>> always be aware of the context.
>>
>>> How about adding WARN_ON_ONCE(in_interrupt()) to
>>> blk_mq_start_hw_queue() or replacing the above patch by the following:
>>
>> No, I hate having dependencies like that, because they always just catch
>> one of them. Looks like the IPR path that hits this should just offload
>> to a workqueue or similar, you don't have to make any scsi_run_queue()
>> async.
> 
> OK, so the resolution is "fix it in IPR" ?

I'll leave that to the SCSI crew. But at least one bug is in IPR, if you
look at the call trace:

- timer function triggers, runs ipr_reset_timer_done(), which grabs the
  host lock AND disables interrupts.
- further down in the call path, ipr_ioa_bringdown_done() uncondtionally
  enables interrupts:

spin_unlock_irq(ioa_cfg->host->host_lock);
scsi_unblock_requests(ioa_cfg->host);
spin_lock_irq(ioa_cfg->host->host_lock); 

And the call to scsi_unblock_requests() is the one that ultimately runs
the queue. The IRQ issue aside here, scsi_unblock_requests() could run
the queue async, and we could retain the normal sync run otherwise.

Can you try the below fix? Should be more palatable than the previous
one. Brian, maybe you can take a look at the IRQ issue mentioned above?

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index f6097b89d5d3..dfb89596af81 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -481,13 +481,14 @@ static void scsi_starved_list_run(struct Scsi_Host *shost)
  * Purpose:Select a proper request queue to serve next
  *
  * Arguments:  q   - last request's queue
+ * async   - run queues async, if we need to
  *
  * Returns: Nothing
  *
  * Notes:  The previous command was completely finished, start
  * a new one if possible.
  */
-static void scsi_run_queue(struct request_queue *q)
+static void scsi_run_queue(struct request_queue *q, bool async)
 {
struct scsi_device *sdev = q->queuedata;
 
@@ -497,7 +498,7 @@ static void scsi_run_queue(struct request_queue *q)
scsi_starved_list_run(sdev->host);
 
if (q->mq_ops)
-   blk_mq_run_hw_queues(q, false);
+   blk_mq_run_hw_queues(q, async);
else
blk_run_queue(q);
 }
@@ -509,7 +510,7 @@ void scsi_requeue_run_queue(struct work_struct *work)
 
sdev = container_of(work, struct scsi_device, requeue_work);
q = sdev->request_queue;
-   scsi_run_queue(q);
+   scsi_run_queue(q, false);
 }
 
 /*
@@ -543,17 +544,22 @@ static void scsi_requeue_command(struct request_queue *q, 
struct scsi_cmnd *cmd)
blk_requeue_request(q, req);
spin_unlock_irqrestore(q->queue_lock, flags);
 
-   scsi_run_queue(q);
+   scsi_run_queue(q, true);
 
put_device(>sdev_gendev);
 }
 
-void scsi_run_host_queues(struct Scsi_Host *shost)
+static void __scsi_run_host_queues(struct Scsi_Host *shost, bool async)
 {
struct scsi_device *sdev;
 
shost_for_each_device(sdev, shost)
-   scsi_run_queue(sdev->request_queue);
+   scsi_run_queue(sdev->request_queue, async);
+}
+
+void scsi_run_host_queues(struct Scsi_Host *shost)
+{
+   __scsi_run_host_queues(shost, false);
 }
 
 static void scsi_uninit_cmd(struct scsi_cmnd *cmd)
@@ -671,7 +677,7 @@ static bool scsi_end_request(struct request *req, 
blk_status_t error,
blk_finish_request(req, error);
spin_unlock_irqrestore(q->queue_lock, flags);
 
-   scsi_run_queue(q);
+   scsi_run_queue(q, false);
}
 
put_device(>sdev_gendev);
@@ -2293,7 +2299,7 @@ EXPORT_SYMBOL(scsi_block_requests);
 void scsi_unblock_requests(struct Scsi_Host *shost)
 {
shost->host_self_blocked = 0;
-   scsi_run_host_queues(shost);
+   __scsi_run_host_queues(shost, true);
 }
 EX

Re: blk_mq_sched_insert_request: inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage

2017-07-27 Thread Jens Axboe

On 07/27/2017 08:47 AM, Bart Van Assche wrote:
> On Thu, 2017-07-27 at 08:02 -0600, Jens Axboe wrote:
>> The bug looks like SCSI running the queue inline from IRQ
>> context, that's not a good idea. Can you confirm the below works for
>> you?
>>
>>
>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
>> index f6097b89d5d3..78740ebf966c 100644
>> --- a/drivers/scsi/scsi_lib.c
>> +++ b/drivers/scsi/scsi_lib.c
>> @@ -497,7 +497,7 @@ static void scsi_run_queue(struct request_queue *q)
>>  scsi_starved_list_run(sdev->host);
>>  
>>  if (q->mq_ops)
>> -blk_mq_run_hw_queues(q, false);
>> +blk_mq_run_hw_queues(q, true);
>>  else
>>  blk_run_queue(q);
>>  }
> 
> Hello Jens,
> 
> scsi_run_queue() works fine if no scheduler is configured. Additionally, that
> code predates the introduction of blk-mq I/O schedulers. I think it is
> nontrivial for block driver authors to figure out that a queue has to be run
> from process context if a scheduler has been configured that does not support
> to be run from interrupt context.

No it doesn't, you could never run the queue from interrupt context with
async == false. So I don't think that's confusing at all, you should
always be aware of the context.

> How about adding WARN_ON_ONCE(in_interrupt()) to
> blk_mq_start_hw_queue() or replacing the above patch by the following:

No, I hate having dependencies like that, because they always just catch
one of them. Looks like the IPR path that hits this should just offload
to a workqueue or similar, you don't have to make any scsi_run_queue()
async.

-- 
Jens Axboe

Re: blk_mq_sched_insert_request: inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage

2017-07-27 Thread Jens Axboe

ivers/scsi/scsi_lib.c
index f6097b89d5d3..78740ebf966c 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -497,7 +497,7 @@ static void scsi_run_queue(struct request_queue *q)
scsi_starved_list_run(sdev->host);
 
if (q->mq_ops)
-   blk_mq_run_hw_queues(q, false);
+   blk_mq_run_hw_queues(q, true);
else
blk_run_queue(q);
 }

-- 
Jens Axboe


-- 
Jens Axboe

Re: linux-next: build failure after merge of the block tree

2017-06-28 Thread Jens Axboe

On 06/28/2017 08:01 AM, Jens Axboe wrote:
> On 06/28/2017 06:43 AM, Jens Axboe wrote:
>> On 06/28/2017 02:04 AM, Stephen Rothwell wrote:
>>> Hi Jens,
>>>
>>> After merging the block tree, today's linux-next build (powerpc
>>> allnoconfig) failed like this:
>>>
>>> fs/fcntl.o: In function `do_fcntl': 
>>> fcntl.c:(.text+0x6d4): undefined reference to `__get_user_bad'
>>> fcntl.c:(.text+0x730): undefined reference to `__get_user_bad'
>>>
>>> Probably caused by commit
>>>
>>>   c75b1d9421f8 ("fs: add fcntl() interface for setting/getting write life 
>>> time hints")
>>>
>>> On powerpc (at least) you cannot use get_user() to fetch anything larger
>>> than "unsigned long" i.e. 32 bits on 32 bit powerpc.
>>>
>>> This has been discussed before (and, I think, a fix attempted).
>>
>> Gah, thanks for letting me know. I'll test your patch and queue it
>> up to fix this issue.
> 
> But put_user() is fine? Just checking here, since the change adds
> both a u64 put and get user.

I just changed all 4, at least that provides some symmetry in how
we copy things in and out for that set of fcntls.

-- 
Jens Axboe

Re: linux-next: build failure after merge of the block tree

2017-06-28 Thread Jens Axboe

On 06/28/2017 06:43 AM, Jens Axboe wrote:
> On 06/28/2017 02:04 AM, Stephen Rothwell wrote:
>> Hi Jens,
>>
>> After merging the block tree, today's linux-next build (powerpc
>> allnoconfig) failed like this:
>>
>> fs/fcntl.o: In function `do_fcntl': 
>> fcntl.c:(.text+0x6d4): undefined reference to `__get_user_bad'
>> fcntl.c:(.text+0x730): undefined reference to `__get_user_bad'
>>
>> Probably caused by commit
>>
>>   c75b1d9421f8 ("fs: add fcntl() interface for setting/getting write life 
>> time hints")
>>
>> On powerpc (at least) you cannot use get_user() to fetch anything larger
>> than "unsigned long" i.e. 32 bits on 32 bit powerpc.
>>
>> This has been discussed before (and, I think, a fix attempted).
> 
> Gah, thanks for letting me know. I'll test your patch and queue it
> up to fix this issue.

But put_user() is fine? Just checking here, since the change adds
both a u64 put and get user.

-- 
Jens Axboe

Re: linux-next: build failure after merge of the block tree

2017-06-28 Thread Jens Axboe

On 06/28/2017 02:04 AM, Stephen Rothwell wrote:
> Hi Jens,
> 
> After merging the block tree, today's linux-next build (powerpc
> allnoconfig) failed like this:
> 
> fs/fcntl.o: In function `do_fcntl': 
> fcntl.c:(.text+0x6d4): undefined reference to `__get_user_bad'
> fcntl.c:(.text+0x730): undefined reference to `__get_user_bad'
> 
> Probably caused by commit
> 
>   c75b1d9421f8 ("fs: add fcntl() interface for setting/getting write life 
> time hints")
> 
> On powerpc (at least) you cannot use get_user() to fetch anything larger
> than "unsigned long" i.e. 32 bits on 32 bit powerpc.
> 
> This has been discussed before (and, I think, a fix attempted).

Gah, thanks for letting me know. I'll test your patch and queue it
up to fix this issue.

-- 
Jens Axboe

Re: [linux-next][bock] WARNING: CPU: 22 PID: 0 at block/blk-core.c:2655 .blk_update_request+0x4f8/0x500

2017-05-08 Thread Jens Axboe

On 05/08/2017 01:13 AM, Abdul Haleem wrote:
> On Fri, 2017-05-05 at 08:02 -0600, Jens Axboe wrote:
>> On 05/05/2017 12:25 AM, Abdul Haleem wrote:
>>> Hi,
>>>
>>> 4.11.0 Linus mainline booted with Warnings on PowerPC.
>>>
>>> We did not see this on next-20170407 but on next-20170410 and later.
>>
>> Have you tried current Linus -git? Both of the -next versions you list
>> are rather old.
>>
> 
> Hi Jens, 
> 
> Warning is still seen with next-20170505 and also with today's mainline.
> 
> It was first seen on next-20170410, so the last good was next-20170407.

The log between the known good and first bad version, condensed a bit for
primary suspects, is below.

Christoph Hellwig (4):
  sd: split sd_setup_discard_cmnd
  sd: implement REQ_OP_WRITE_ZEROES
  sd: implement unmapping Write Zeroes
  block: remove the discard_zeroes_data flag

Martin K. Petersen (2):
  scsi: sd: Separate zeroout and discard command choices
  scsi: sd: Remove LBPRZ dependency for discards

Christoph Hellwig (7):
  block: implement splitting of REQ_OP_WRITE_ZEROES bios
  block: stop using blkdev_issue_write_same for zeroing
  block: add a flags argument to (__)blkdev_issue_zeroout
  block: add a REQ_NOUNMAP flag for REQ_OP_WRITE_ZEROES
  block: add a new BLKDEV_ZERO_NOFALLBACK flag
  block: stop using discards for zeroing
  block: remove the discard_zeroes_data flag

Christoph, Martin - any ideas? Trace from Abdul below.

WARNING: CPU: 12 PID: 0 at block/blk-core.c:2651 .blk_update_request+0x4cc/0x4e0
Modules linked in: sg(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) 
sunrpc(E) binfmt_misc(E) ip_tables(E) ext4(E) mbcache(E) jbd2(E) sd_mod(E) 
ibmvscsi(E) scsi_transport_srp(E) ibmveth(E)
CPU: 12 PID: 0 Comm: swapper/12 Tainted: GE   4.11.0-autotest #1
task: c009f455ee80 task.stack: c009fb2e8000
NIP: c050bd1c LR: c050b8ec CTR: c05114b0
REGS: c013fff73740 TRAP: 0700   Tainted: GE(4.11.0-autotest)
MSR: 80029032 <SF,EE,ME,IR,DR,RI>
  CR: 48042048  XER: 0001
CFAR: c050bb34 SOFTE: 1 
GPR00: c050b8ec c013fff739c0 c1389c00 c009eca9c800
GPR04:   0001 0060 
GPR08: 00067887  c009eca9c800 de5f7e30 
GPR12: 88044044 ce9f6c00 c009fb2ebf90 00200042 
GPR16: 9367 c013fff7  c0df4100 
GPR20: c13c3b00 c0df4100  0005 
GPR24: 2ee0 c17789f8   
GPR28:  c38ba400  c009eca9c800 
NIP [c050bd1c] .blk_update_request+0x4cc/0x4e0
LR [c050b8ec] .blk_update_request+0x9c/0x4e0
Call Trace:
[c013fff739c0] [c050b8ec] .blk_update_request+0x9c/0x4e0 
(unreliable)
[c013fff73a60] [c06b06fc] .scsi_end_request+0x4c/0x240
[c013fff73b10] [c06b4564] .scsi_io_completion+0x1d4/0x6c0
[c013fff73be0] [c06a8cd0] .scsi_finish_command+0x100/0x1b0
[c013fff73c70] [c06b3978] .scsi_softirq_done+0x188/0x1e0
[c013fff73d00] [c0516b44] .blk_done_softirq+0xc4/0xf0
[c013fff73d90] [c00daef8] .__do_softirq+0x158/0x3b0
[c013fff73e90] [c00db5b8] .irq_exit+0x1a8/0x1c0
[c013fff73f10] [c0014f84] .__do_irq+0x94/0x1f0
[c013fff73f90] [c0026cbc] .call_do_irq+0x14/0x24
[c009fb2eb7f0] [c001516c] .do_IRQ+0x8c/0x100
[c009fb2eb890] [c0008bf4] hardware_interrupt_common+0x114/0x120
--- interrupt: 501 at .plpar_hcall_norets+0x14/0x20
LR = .check_and_cede_processor+0x24/0x40
[c009fb2ebb80] [0002] 0x2 (unreliable)
[c009fb2ebbf0] [c07c360c] .dedicated_cede_loop+0x4c/0x150
[c009fb2ebc70] [c07c1040] .cpuidle_enter_state+0xb0/0x3b0
[c009fb2ebd20] [c012d1bc] .call_cpuidle+0x3c/0x70
[c009fb2ebd90] [c012d550] .do_idle+0x280/0x2e0
[c009fb2ebe50] [c012d768] .cpu_startup_entry+0x28/0x40
[c009fb2ebed0] [c00428a4] .start_secondary+0x304/0x350
[c009fb2ebf90] [c000aa6c] start_secondary_prolog+0x10/0x14
Instruction dump:
3f82ff90 3b9cc190 4bfffd8c 3f82ff90 3b9cc1a8 4bfffd80 61290040 b13f0018
4bfffbd4 3cc2ff8b 38c63160 4bfffd9c <0fe0> 4bfffe18 6000 6000 
---[ end trace 0f80359f8fb9c5f4 ]---
EXT4-fs (sda3): Delayed block allocation failed for inode 11011467 at logical 
offset 0 with max blocks 7 with error 121
EXT4-fs (sda3): This should not happen!! Data will be lost

-- 
Jens Axboe

Re: [linux-next][bock] WARNING: CPU: 22 PID: 0 at block/blk-core.c:2655 .blk_update_request+0x4f8/0x500

2017-05-05 Thread Jens Axboe

On 05/05/2017 12:25 AM, Abdul Haleem wrote:
> Hi,
> 
> 4.11.0 Linus mainline booted with Warnings on PowerPC.
> 
> We did not see this on next-20170407 but on next-20170410 and later.

Have you tried current Linus -git? Both of the -next versions you list
are rather old.

-- 
Jens Axboe

Re: 4.11.0-rc1 boot resulted in WARNING: CPU: 14 PID: 1722 at fs/sysfs/dir.c:31 .sysfs_warn_dup+0x78/0xb0

2017-03-11 Thread Jens Axboe

On 03/09/2017 05:59 AM, Brian Foster wrote:
> cc linux-block
> 
> On Thu, Mar 09, 2017 at 04:20:06PM +0530, Abdul Haleem wrote:
>> On Wed, 2017-03-08 at 08:17 -0500, Brian Foster wrote:
>>> On Tue, Mar 07, 2017 at 10:01:04PM +0530, Abdul Haleem wrote:
>>>>
>>>> Hi,
>>>>
>>>> Today's mainline (4.11.0-rc1) booted with warnings on Power7 LPAR.
>>>>
>>>> Issue is not reproducible all the time.

Is that still the case with -git as of yesterday? Check that you
have this merge:

34bbce9e344b47e8871273409632f525973afad4

in your tree.

-- 
Jens Axboe

Re: [PATCH] axonram: Fix gendisk handling

2017-03-08 Thread Jens Axboe

On 03/08/2017 06:56 AM, Jan Kara wrote:
> It is invalid to call del_gendisk() when disk->queue is NULL. Fix error
> handling in axon_ram_probe() to avoid doing that.
> 
> Also del_gendisk() does not drop a reference to gendisk allocated by
> alloc_disk(). That has to be done by put_disk(). Add that call where
> needed.

Applied, thanks.

-- 
Jens Axboe

Re: [PATCH] direct-io: don't introduce another read of inode->i_blkbits

2017-01-10 Thread Jens Axboe

On 01/09/2017 10:27 PM, Chandan Rajendra wrote:
> On Monday, January 09, 2017 04:42:58 PM Jeff Moyer wrote:
>> Commit 20ce44d545844 ("do_direct_IO: Use inode->i_blkbits to compute
>> block count to be cleaned") introduced a regression: if the block size
>> of the block device is changed while a direct I/O request is being
>> setup, it can result in a panic.  See commit ab73857e354ab ("direct-io:
>> don't read inode->i_blkbits multiple times") for the reasoning, and
>> commit b87570f5d3496 ("Fix a crash when block device is read and block
>> size is changed at the same time") for a more detailed problem
>> description and reproducer.
>>
>> Fixes: 20ce44d545844
>> Signed-off-by: Jeff Moyer <jmo...@redhat.com>
>>
>> ---
>> Chandan, can you please test this to ensure this still fixes your problem?
> 
> This patch fixes the failure,
> 
> Tested-by: Chandan Rajendra <chan...@linux.vnet.ibm.com>
> 

I've updated the patch, thanks guys.

-- 
Jens Axboe

Re: [PATCH] direct-io: don't introduce another read of inode->i_blkbits

2017-01-09 Thread Jens Axboe

On 01/09/2017 02:42 PM, Jeff Moyer wrote:
> Commit 20ce44d545844 ("do_direct_IO: Use inode->i_blkbits to compute
> block count to be cleaned") introduced a regression: if the block size
> of the block device is changed while a direct I/O request is being
> setup, it can result in a panic.  See commit ab73857e354ab ("direct-io:
> don't read inode->i_blkbits multiple times") for the reasoning, and
> commit b87570f5d3496 ("Fix a crash when block device is read and block
> size is changed at the same time") for a more detailed problem
> description and reproducer.
> 
> Fixes: 20ce44d545844
> Signed-off-by: Jeff Moyer <jmo...@redhat.com>
> 
> ---
> Chandan, can you please test this to ensure this still fixes your problem?

Please, that would be great. And if so, then let's fold the two.

-- 
Jens Axboe

Re: [PATCH] do_direct_IO: Use inode->i_blkbits to compute block count to be cleaned

2017-01-09 Thread Jens Axboe

On 01/08/2017 07:47 AM, Chandan Rajendra wrote:
> The code currently uses sdio->blkbits to compute the number of blocks to
> be cleaned. However sdio->blkbits is derived from the logical block size
> of the underlying block device (Refer to the definition of
> do_blockdev_direct_IO()). Due to this, generic/299 test would rarely
> fail when executed on an ext4 filesystem with 64k as the block size and
> when using a virtio based disk (having 512 byte as the logical block
> size) inside a kvm guest.
> 
> This commit fixes the bug by using inode->i_blkbits to compute the
> number of blocks to be cleaned.

Thanks, added for 4.10.

-- 
Jens Axboe

Re: ext4 filesystem corruption with 4.10-rc2 on ppc64le

2017-01-04 Thread Jens Axboe

On 01/04/2017 08:28 AM, Theodore Ts'o wrote:
> On Wed, Jan 04, 2017 at 11:32:42AM +0530, Chandan Rajendra wrote:
>> On Wednesday, January 04, 2017 04:18:08 PM Anton Blanchard wrote:
>>> I'm consistently seeing ext4 filesystem corruption using a mainline
>>> kernel. It doesn't take much to trigger it - download a ppc64le Ubuntu
>>> cloud image, boot it in KVM and run:
>>>
>>> sudo apt-get update
>>> sudo apt-get dist-upgrade
>>> sudo reboot
>>>
>>> And it never makes it back up, dying with rather severe filesystem
>>> corruption.
>>
>> The patch at https://patchwork.kernel.org/patch/9488235/ should fix the
>> bug.
> 
> It looks like this patch is already queued up on the "for-linus"
> branch on the linux-block.git tree.
> 
> Chandra, thanks for pointing this out!  I had missed your e-mail from
> Christmas day, and it was on my todo list to figure out why I was
> seeing lots of 1k block regressions on gce-xfstests post-merge window
> that wasn't showing up on the ext4.git tree before I sent my pull
> request to Linus.
> 
> Jens, could you expedite a pull request to Linus?  This is affecting
> ext4 on 1k block file systems on x86/x86_64, so this is not a ppc-only
> regression.  

Yes, it'll go out this morning.

-- 
Jens Axboe

Re: ext4 filesystem corruption with 4.10-rc2 on ppc64le

2017-01-04 Thread Jens Axboe

On 01/03/2017 10:18 PM, Anton Blanchard wrote:
> Hi,
> 
> I'm consistently seeing ext4 filesystem corruption using a mainline
> kernel. It doesn't take much to trigger it - download a ppc64le Ubuntu
> cloud image, boot it in KVM and run:
> 
> sudo apt-get update
> sudo apt-get dist-upgrade
> sudo reboot
> 
> And it never makes it back up, dying with rather severe filesystem
> corruption.
> 
> I've narrowed it down to:
> 
> 64e1c57fa474 ("ext4: Use clean_bdev_aliases() instead of iteration")
> e64855c6cfaa ("fs: Add helper to clean bdev aliases under a bh and use it")
> ce98321bf7d2 ("fs: Remove unmap_underlying_metadata")
> 
> Backing these patches out fixes the issue.

Fix is going out today, I see Chandan already pointed you at it. For the
other reporter, it's not an LE vs BE thing, it's a fs blocksize < page
size problem.

-- 
Jens Axboe

Re: kernel BUG at drivers/scsi/scsi_lib.c:1096!

2015-12-04 Thread Jens Axboe

 that should fix it. You need commit a88d32af18b8 or newer.


--
Jens Axboe

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: kernel BUG at drivers/scsi/scsi_lib.c:1096!

2015-11-25 Thread Jens Axboe


On 11/25/2015 02:04 AM, Hannes Reinecke wrote:

On 11/20/2015 04:28 PM, Ewan Milne wrote:

On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:

Can't we have a joint effort here?
I've been spending a _LOT_ of time trying to debug things here, but
none of the ideas I've come up with have been able to fix anything.


Yes.  I'm not the one primarily looking at it, and we don't have a
reproducer in-house.  We just have the one dump right now.



I'm almost tempted to increase the count from scsi_alloc_sgtable()
by one and be done with ...



That might not fix it if it is a problem with the merge code, though.


And indeed, it doesn't.
Seems I finally found the culprit.

What happens is this:
We have two paths, with these seg_boundary_masks:

path-1:seg_boundary_mask = 65535,
path-2:seg_boundary_mask = 4294967295,

consequently the DM request queue has this:

md-1:seg_boundary_mask = 65535,

What happens now is that a request is being formatted, and sent
to path 2. During submission req->nr_phys_segments is formatted
with the limits of path 2, arriving at a count of 3.
Now the request gets retried on path 1, but as the NOMERGE request
flag is set req->nr_phys_segments is never updated.
But blk_rq_map_sg() ignores all counters, and just uses the
bi_vec directly, resulting in a count of 4 -> boom.

So the culprit here is the NOMERGE flag, which is evaluated
via
->dm_dispatch_request()
   ->blk_insert_cloned_request()
 ->blk_rq_check_limits()

If the above assessment is correct, the following patch should
fix it:

diff --git a/block/blk-core.c b/block/blk-core.c
index 801ced7..12cccd6 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio);
   */
  int blk_rq_check_limits(struct request_queue *q, struct request *rq)
  {
-   if (!rq_mergeable(rq))
+   if (rq->cmd_type != REQ_TYPE_FS)
 return 0;

 if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q,
rq->cmd_flags)) {


Mike? Jens?
Can you comment on it?


We only support merging on REQ_TYPE_FS already, so how is the above 
making it any different? In general, NOMERGE being set or not should not 
make a difference. It's only a hint that we need not check further if we 
should be merging on this request, since we already tried it once, found 
we'd exceed various limits, then set NOMERGE to reflect that.



--
Jens Axboe

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: kernel BUG at drivers/scsi/scsi_lib.c:1096!

2015-11-25 Thread Jens Axboe


On 11/25/2015 12:10 PM, Hannes Reinecke wrote:

On 11/25/2015 06:56 PM, Jens Axboe wrote:

On 11/25/2015 02:04 AM, Hannes Reinecke wrote:

On 11/20/2015 04:28 PM, Ewan Milne wrote:

On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:

Can't we have a joint effort here?
I've been spending a _LOT_ of time trying to debug things here, but
none of the ideas I've come up with have been able to fix anything.


Yes.  I'm not the one primarily looking at it, and we don't have a
reproducer in-house.  We just have the one dump right now.



I'm almost tempted to increase the count from scsi_alloc_sgtable()
by one and be done with ...



That might not fix it if it is a problem with the merge code, though.


And indeed, it doesn't.
Seems I finally found the culprit.

What happens is this:
We have two paths, with these seg_boundary_masks:

path-1:seg_boundary_mask = 65535,
path-2:seg_boundary_mask = 4294967295,

consequently the DM request queue has this:

md-1:seg_boundary_mask = 65535,

What happens now is that a request is being formatted, and sent
to path 2. During submission req->nr_phys_segments is formatted
with the limits of path 2, arriving at a count of 3.
Now the request gets retried on path 1, but as the NOMERGE request
flag is set req->nr_phys_segments is never updated.
But blk_rq_map_sg() ignores all counters, and just uses the
bi_vec directly, resulting in a count of 4 -> boom.

So the culprit here is the NOMERGE flag, which is evaluated
via
->dm_dispatch_request()
   ->blk_insert_cloned_request()
 ->blk_rq_check_limits()

If the above assessment is correct, the following patch should
fix it:

diff --git a/block/blk-core.c b/block/blk-core.c
index 801ced7..12cccd6 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio);
   */
  int blk_rq_check_limits(struct request_queue *q, struct request *rq)
  {
-   if (!rq_mergeable(rq))
+   if (rq->cmd_type != REQ_TYPE_FS)
 return 0;

 if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q,
rq->cmd_flags)) {


Mike? Jens?
Can you comment on it?


We only support merging on REQ_TYPE_FS already, so how is the above
making it any different? In general, NOMERGE being set or not should not
make a difference. It's only a hint that we need not check further if we
should be merging on this request, since we already tried it once, found
we'd exceed various limits, then set NOMERGE to reflect that.


The problem is that NOMERGE does too much, as it inhibits _any_ merging.


Right, that is the point of the flag from the block layer view, where it 
was originally added for the case mentioned.



Unfortunately, the req->nr_phys_segments value is evaluated in the final
_driver_ context _after_ the merging happend; cf
scsi_lib.c:scsi_init_sgtable().
As nr_phys_segments is inherited from the original request (and never
recalculated with the new request queue limits) the following
blk_rq_map_sg() call might end up at a different calculation, especially
after retrying a request on another path.


That all sounds pretty horrible. Why is blk_rq_check_limits() checking 
for mergeable at all? If merging is disabled on the request, I'm 
assuming that's an attempt at an optimization since we know it won't 
change. But that should be tracked separately, like how it's done on the 
bio.



--
Jens Axboe

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/3] block/ps3vram: Remove obsolete reference to MTD

2015-06-11 Thread Jens Axboe


On 06/11/2015 07:04 AM, Geert Uytterhoeven wrote:

On Wed, Jun 10, 2015 at 8:00 PM, Geert Uytterhoeven
ge...@linux-m68k.org did not write:

The ps3vram driver is a plain block device driver since commit
f507cd22035fdadd5dbb476dd05e9e7ee21c3b84 (ps3/block: Replace mtd/ps3vram
by block/ps3vram).

Signed-off-by: Geert Uytterhoeven ge...@linux-m68k.org
Signed-off-by: Geoff Levand ge...@infradead.org


For the record: while I did send out this patch back in 2013, I didn't resend it
yesterday.

Geoff:
   1. Please fix your scripts (start using git-send-email?), to avoid sending
  fake email. Thanks!


Indeed, 2/3 of these patches ended up in my spam.


   2. I am grateful you didn't forget about this patch, and that it got applied.


I'll add:

3. Your git base should be the base of where you want this to be pulled 
in, or prior. Yours was 4.1-rc7, and I'm not pulling in tons of 
unrelated changes for 3 patches. If you are targeting the next release 
and don't depend on existing block changes, then your base should be 
4.0. If you are dependent on changes in the block tree, the block tree 
should be your base.


--
Jens Axboe

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/3] block/ps3vram: Minor updates and fixes

2015-06-10 Thread Jens Axboe


On 06/10/2015 12:00 PM, Geoff Levand wrote:

Hi Jens,

Here are a few minor updates for the ps3vram driver.  The third patch adds me
as co-maintainer of the driver, which I think is fitting as I have been
maintaining it for the last few years and expect I would be involved in any
future inquiries regarding it.

Please apply, thanks.


Applied to for-4.2/drivers, thanks.

--
Jens Axboe

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

1 2 >

1 - 100 of 153 matches

Mail list logo