Re: [PATCH] raid5-cache: add crc32c Kconfig dependency

2015-11-05 Thread Neil Brown
On Wed, Nov 04 2015, Arnd Bergmann wrote:

> The recent change of the raid5-cache code to use crc32c instead
> of crc32 causes link errors when CONFIG_LIBCRC32C is disabled:
>
> drivers/built-in.o: In function crc32c'
> core.c:(.text+0x1c6060): undefined reference to `crc32c'
>
> This adds an explicit 'select' statement like all other users
> of this function do.
>
> Signed-off-by: Arnd Bergmann 
> Fixes: 5cb2fbd6ea0d ("raid5-cache: use crc32c checksum")
>
> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
> index 3e01e6fb3424..7913fdcfc849 100644
> --- a/drivers/md/Kconfig
> +++ b/drivers/md/Kconfig
> @@ -123,6 +123,7 @@ config MD_RAID456
>   tristate "RAID-4/RAID-5/RAID-6 mode"
>   depends on BLK_DEV_MD
>   select RAID6_PQ
> + select LIBCRC32C
>   select ASYNC_MEMCPY
>   select ASYNC_XOR
>   select ASYNC_PQ

Applied, thanks,

NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH] md/raid5: remove redundant check in stripe_add_to_batch_list()

2015-11-05 Thread Neil Brown
On Mon, Nov 02 2015, Roman Gushchin wrote:

> The stripe_add_to_batch_list() function is called only if
> stripe_can_batch() returned true, so there is no need for double check.
>
> Signed-off-by: Roman Gushchin 
> Cc: Neil Brown 
> Cc: linux-r...@vger.kernel.org
> ---
>  drivers/md/raid5.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 45933c1..f829ebc 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -768,8 +768,6 @@ static void stripe_add_to_batch_list(struct r5conf *conf, 
> struct stripe_head *sh
>   int hash;
>   int dd_idx;
>  
> - if (!stripe_can_batch(sh))
> - return;
>   /* Don't cross chunks, so stripe pd_idx/qd_idx is the same */
>   tmp_sec = sh->sector;
>   if (!sector_div(tmp_sec, conf->chunk_sectors))
> -- 
> 2.4.3

applied, thanks.

(for future reference, public mailing lists like linux-raid aren't
usually mentioned on a 'Cc:' line in a patch.
Documentation/SubmittingPatches suggests:

  If a person has had the opportunity to comment on a patch, but has not
  provided such comments, you may optionally add a "Cc:" tag to the patch.
  This is the only tag which might be added without an explicit action by the
  person it names - but it should indicate that this person was copied on the
  patch.  This tag documents that potentially interested parties
  have been included in the discussion.

so it is primarily for third-party individuals
)

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH] md/raid5: remove redundant check in stripe_add_to_batch_list()

2015-11-05 Thread Neil Brown
On Mon, Nov 02 2015, Roman Gushchin wrote:

> The stripe_add_to_batch_list() function is called only if
> stripe_can_batch() returned true, so there is no need for double check.
>
> Signed-off-by: Roman Gushchin <kl...@yandex-team.ru>
> Cc: Neil Brown <ne...@suse.com>
> Cc: linux-r...@vger.kernel.org
> ---
>  drivers/md/raid5.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 45933c1..f829ebc 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -768,8 +768,6 @@ static void stripe_add_to_batch_list(struct r5conf *conf, 
> struct stripe_head *sh
>   int hash;
>   int dd_idx;
>  
> - if (!stripe_can_batch(sh))
> - return;
>   /* Don't cross chunks, so stripe pd_idx/qd_idx is the same */
>   tmp_sec = sh->sector;
>   if (!sector_div(tmp_sec, conf->chunk_sectors))
> -- 
> 2.4.3

applied, thanks.

(for future reference, public mailing lists like linux-raid aren't
usually mentioned on a 'Cc:' line in a patch.
Documentation/SubmittingPatches suggests:

  If a person has had the opportunity to comment on a patch, but has not
  provided such comments, you may optionally add a "Cc:" tag to the patch.
  This is the only tag which might be added without an explicit action by the
  person it names - but it should indicate that this person was copied on the
  patch.  This tag documents that potentially interested parties
  have been included in the discussion.

so it is primarily for third-party individuals
)

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH] raid5-cache: add crc32c Kconfig dependency

2015-11-05 Thread Neil Brown
On Wed, Nov 04 2015, Arnd Bergmann wrote:

> The recent change of the raid5-cache code to use crc32c instead
> of crc32 causes link errors when CONFIG_LIBCRC32C is disabled:
>
> drivers/built-in.o: In function crc32c'
> core.c:(.text+0x1c6060): undefined reference to `crc32c'
>
> This adds an explicit 'select' statement like all other users
> of this function do.
>
> Signed-off-by: Arnd Bergmann 
> Fixes: 5cb2fbd6ea0d ("raid5-cache: use crc32c checksum")
>
> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
> index 3e01e6fb3424..7913fdcfc849 100644
> --- a/drivers/md/Kconfig
> +++ b/drivers/md/Kconfig
> @@ -123,6 +123,7 @@ config MD_RAID456
>   tristate "RAID-4/RAID-5/RAID-6 mode"
>   depends on BLK_DEV_MD
>   select RAID6_PQ
> + select LIBCRC32C
>   select ASYNC_MEMCPY
>   select ASYNC_XOR
>   select ASYNC_PQ

Applied, thanks,

NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH/RFC] make btrfs subvol mounts appear in /proc/mounts

2015-11-02 Thread Neil Brown
On Tue, Nov 03 2015, Chris Mason wrote:

> On Mon, Nov 02, 2015 at 03:50:12PM -0500, J. Bruce Fields wrote:
>> On Wed, Oct 28, 2015 at 07:25:10AM +0900, Neil Brown wrote:
>> > 
>> > If you create a subvolume in btrfs and access it (by name) without
>> > mounting it, then the subvolume looks like a separate mount to some
>> > extent, returning a different st_dev to stat(), but it doesn't look like
>> > a separate mount in that it isn't listed in /proc/mounts. This
>> > inconsistency can confuse tools.
>> > 
>> > This patch causes these subvolumes to become separate mounts by using
>> > the VFS' automount functionality, much like NFS uses automount when it
>> > discovered mountpoints on the server.
>> > 
>> > The VFS currently makes it impossible to auto-mount a directory on to 
>> > itself
>> > (i.e. a bind mount).  For NFS this isn't a problem as a new superblock
>> > is created for the child filesystem so there are two separate dentries
>> > (and inodes) for the one directory: one in the parent filesystem, one in
>> > the child (note that the two superblocks share a common connection to
>> > the server so there is still a lot of commonality).
>> > 
>> > BTRFS has chosen instead to use a single superblock for all subvolumes.
>> 
>> Naive question: was there a reason for that choice?
>
> They are really all part of the same FS, the single super better fits.
> Or said another way, it felt like there would be dramatically more duct
> tape around supers-per-subvolume than there was abusing st_dev.
>
> Neil's patch came up after I told him a few of us had tried to do the
> same thing and failed to find clean vfs changes to make it possible...he
> took it as a challenge.  Now I have to remember what it was about our
> past attempts that I didn't like.
>
> I'll test this and queue for 4.5 if it all works out, thanks Neil!

I'd rather resend with proper documentation updates and s-o-b before it
gets queued if that is OK.  So once you are happy, please let me know
and I'll do it "properly".

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH/RFC] make btrfs subvol mounts appear in /proc/mounts

2015-11-02 Thread Neil Brown
On Tue, Nov 03 2015, Chris Mason wrote:

> On Mon, Nov 02, 2015 at 03:50:12PM -0500, J. Bruce Fields wrote:
>> On Wed, Oct 28, 2015 at 07:25:10AM +0900, Neil Brown wrote:
>> > 
>> > If you create a subvolume in btrfs and access it (by name) without
>> > mounting it, then the subvolume looks like a separate mount to some
>> > extent, returning a different st_dev to stat(), but it doesn't look like
>> > a separate mount in that it isn't listed in /proc/mounts. This
>> > inconsistency can confuse tools.
>> > 
>> > This patch causes these subvolumes to become separate mounts by using
>> > the VFS' automount functionality, much like NFS uses automount when it
>> > discovered mountpoints on the server.
>> > 
>> > The VFS currently makes it impossible to auto-mount a directory on to 
>> > itself
>> > (i.e. a bind mount).  For NFS this isn't a problem as a new superblock
>> > is created for the child filesystem so there are two separate dentries
>> > (and inodes) for the one directory: one in the parent filesystem, one in
>> > the child (note that the two superblocks share a common connection to
>> > the server so there is still a lot of commonality).
>> > 
>> > BTRFS has chosen instead to use a single superblock for all subvolumes.
>> 
>> Naive question: was there a reason for that choice?
>
> They are really all part of the same FS, the single super better fits.
> Or said another way, it felt like there would be dramatically more duct
> tape around supers-per-subvolume than there was abusing st_dev.
>
> Neil's patch came up after I told him a few of us had tried to do the
> same thing and failed to find clean vfs changes to make it possible...he
> took it as a challenge.  Now I have to remember what it was about our
> past attempts that I didn't like.
>
> I'll test this and queue for 4.5 if it all works out, thanks Neil!

I'd rather resend with proper documentation updates and s-o-b before it
gets queued if that is OK.  So once you are happy, please let me know
and I'll do it "properly".

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH 1/3] power, vfs: move away from PF_KTHREAD freezing in favor of fs freezing

2015-11-01 Thread Neil Brown
On Sat, Oct 31 2015, Oliver Neukum wrote:

> On Fri, 2015-10-30 at 14:47 +0100, Jiri Kosina wrote:
>> Basically the main argument why kthread freezer is not needed boils
>> down to
>> this: the only facility that is needed during suspend: "no persistent
>> fs
>> changes are allowed from now on".
>
> Is that true? Drivers of character devices also may assume
> that IO and suspend() wouldn't race (except in fairly clear exceptions)

Anything that is a device can hook in to the suspend using the standard
call back and make sure no problematic races happen.

Filesystems are different partly because the "know" in memory some
details of what is persisting on storage, and partly because they aren't
devices.

Maybe filesystems should be devices  though that probably wouldn't
help much.


I would suggest that "the only facility that is needed during suspend"
is really "well defined states and well defined transitions".
Filesystems need them as well as devices, and frozen kthreads might have
been a kludge to try to provide them.

NeilBrown


signature.asc
Description: PGP signature


[GIT PULL REQUEST] md updates for 4.4

2015-11-01 Thread Neil Brown
The following changes since commit 25cb62b76430a91cc6195f902e61c2cb84ade622:

  Linux 4.3-rc5 (2015-10-11 11:09:45 -0700)

are available in the git repository at:

  git://neil.brown.name/md tags/md/4.4

for you to fetch changes up to 339421def582abb14c2217aa8c8f28bb2e299174:

  MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RW (2015-11-01 
13:48:29 +1100)


md updates for 4.4.

Two major components to this update.

1/ the clustered-raid1 support from SUSE is nearly
  complete.  There are a few outstanding issues being
  worked on.  Maybe half a dozen patches will bring
  this to a usable state.

2/ The first stage of journalled-raid5 support from
   Facebook makes an appearance.  With a journal
   device configured (typically NVRAM or SSD), the
   "RAID5 write hole" should be closed - a crash
   during degraded operations cannot result in data
   corruption.

   The next stage will be to use the journal as a
   write-behind cache so that latency can be reduced
   and in some cases throughput increased by
   performing more full-stripe writes.


Christoph Hellwig (12):
  raid5-cache: move functionality out of __r5l_set_io_unit_state
  raid5-cache: free I/O units earlier
  raid5-cache: rename flushed_ios to finished_ios
  raid5-cache: factor out a helper to run all stripes for an I/O unit
  raid5-cache: simplify state machine when caches flushes are not needed
  raid5-cache: clean up r5l_get_meta
  raid5-cache: refactor bio allocation
  raid5-cache: take rdev->data_offset into account early on
  raid5-cache: inline r5l_alloc_io_unit into r5l_new_meta
  raid5-cache: new helper: r5_reserve_log_entry
  raid5-cache: small log->seq cleanup
  raid5-cache: use bio chaining

Goldwyn Rodrigues (11):
  md-cluster: complete all write requests before adding suspend_info
  md: Increment version for clustered bitmaps
  md-cluster: Use a small window for resync
  md-cluster: Wake up suspended process
  md: remove_and_add_spares() to activate specific rdev
  md-cluster: Improve md_reload_sb to be less error prone
  md-cluster: Perform a lazy update
  md-cluster: Perform resync/recovery under a DLM lock
  md-cluster: Fix adding of new disk with new reload code
  md-cluster: Do not printk() every received message
  md-cluster: Call update_raid_disks() if another node --grow's raid_disks

Guoqing Jiang (9):
  md-cluster: send BITMAP_NEEDS_SYNC when node is leaving cluster
  md-cluster: make other members of cluster_msg is handled by little endian 
funcs
  md-cluster: remove unnecessary setting for slot
  md-cluster: make sure the node do not receive it's own msg
  md-cluster: zero cmsg before it was sent
  md-cluster: Add 'SUSE' as author for md-cluster.c
  md-cluster: only call kick_rdev_from_array after remove disk successfully
  md: check the return value for metadata_update_start
  md-cluster: Fix warnings when build with CF=-D__CHECK_ENDIAN__

NeilBrown (5):
  Merge branch 'md-next' of git://github.com/goldwynr/linux into for-next
  md-cluster: metadata_update_finish: consistently use cmsg.raid_slot as 
le32
  md-cluster: discard unused sb_mutex.
  md-cluster: don't cast void pointers when assigning them.
  md-cluster: remove mddev arg from add_resync_info()

Shaohua Li (24):
  md: override md superblock recovery_offset for journal device
  raid5: export some functions
  raid5: add a new state for stripe log handling
  raid5: add basic stripe log
  raid5: log reclaim support
  raid5: log recovery
  raid5-cache: use crc32c checksum
  raid5: disable batch with log enabled
  raid5: don't allow resize/reshape with cache(log) support
  raid5: enable log for raid array with cache disk
  raid5-cache: switching to state machine for log disk cache flush
  raid5-cache: fix a user-after-free bug
  raid5-cache: optimize FLUSH IO with log enabled
  md: skip resync for raid array with journal
  raid5-cache: check stripe finish out of order
  raid5-cache: don't delay stripe captured in log
  md: show journal for journal disk in disk state sysfs
  raid5-cache: move reclaim stop to quiesce
  MD: fix info output for journal disk
  raid5-cache: add trim support for log
  raid5: journal disk can't be removed
  raid5-cache: IO error handling
  raid5-cache: start raid5 readonly if journal is missing
  MD: set journal disk ->raid_disk

Song Liu (6):
  MD: replace special disk roles with macros
  MD: add a new disk role to present write journal device
  skip match_mddev_units check for special roles
  MD: add new bit to indicate raid array with journal
  MD: kick out journal disk if it's not fresh
  MD: when RAID journal is missing/faulty, 

Re: [PATCH 1/3] power, vfs: move away from PF_KTHREAD freezing in favor of fs freezing

2015-11-01 Thread Neil Brown
On Sat, Oct 31 2015, Oliver Neukum wrote:

> On Fri, 2015-10-30 at 14:47 +0100, Jiri Kosina wrote:
>> Basically the main argument why kthread freezer is not needed boils
>> down to
>> this: the only facility that is needed during suspend: "no persistent
>> fs
>> changes are allowed from now on".
>
> Is that true? Drivers of character devices also may assume
> that IO and suspend() wouldn't race (except in fairly clear exceptions)

Anything that is a device can hook in to the suspend using the standard
call back and make sure no problematic races happen.

Filesystems are different partly because the "know" in memory some
details of what is persisting on storage, and partly because they aren't
devices.

Maybe filesystems should be devices  though that probably wouldn't
help much.


I would suggest that "the only facility that is needed during suspend"
is really "well defined states and well defined transitions".
Filesystems need them as well as devices, and frozen kthreads might have
been a kludge to try to provide them.

NeilBrown


signature.asc
Description: PGP signature


[GIT PULL REQUEST] md updates for 4.4

2015-11-01 Thread Neil Brown
The following changes since commit 25cb62b76430a91cc6195f902e61c2cb84ade622:

  Linux 4.3-rc5 (2015-10-11 11:09:45 -0700)

are available in the git repository at:

  git://neil.brown.name/md tags/md/4.4

for you to fetch changes up to 339421def582abb14c2217aa8c8f28bb2e299174:

  MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RW (2015-11-01 
13:48:29 +1100)


md updates for 4.4.

Two major components to this update.

1/ the clustered-raid1 support from SUSE is nearly
  complete.  There are a few outstanding issues being
  worked on.  Maybe half a dozen patches will bring
  this to a usable state.

2/ The first stage of journalled-raid5 support from
   Facebook makes an appearance.  With a journal
   device configured (typically NVRAM or SSD), the
   "RAID5 write hole" should be closed - a crash
   during degraded operations cannot result in data
   corruption.

   The next stage will be to use the journal as a
   write-behind cache so that latency can be reduced
   and in some cases throughput increased by
   performing more full-stripe writes.


Christoph Hellwig (12):
  raid5-cache: move functionality out of __r5l_set_io_unit_state
  raid5-cache: free I/O units earlier
  raid5-cache: rename flushed_ios to finished_ios
  raid5-cache: factor out a helper to run all stripes for an I/O unit
  raid5-cache: simplify state machine when caches flushes are not needed
  raid5-cache: clean up r5l_get_meta
  raid5-cache: refactor bio allocation
  raid5-cache: take rdev->data_offset into account early on
  raid5-cache: inline r5l_alloc_io_unit into r5l_new_meta
  raid5-cache: new helper: r5_reserve_log_entry
  raid5-cache: small log->seq cleanup
  raid5-cache: use bio chaining

Goldwyn Rodrigues (11):
  md-cluster: complete all write requests before adding suspend_info
  md: Increment version for clustered bitmaps
  md-cluster: Use a small window for resync
  md-cluster: Wake up suspended process
  md: remove_and_add_spares() to activate specific rdev
  md-cluster: Improve md_reload_sb to be less error prone
  md-cluster: Perform a lazy update
  md-cluster: Perform resync/recovery under a DLM lock
  md-cluster: Fix adding of new disk with new reload code
  md-cluster: Do not printk() every received message
  md-cluster: Call update_raid_disks() if another node --grow's raid_disks

Guoqing Jiang (9):
  md-cluster: send BITMAP_NEEDS_SYNC when node is leaving cluster
  md-cluster: make other members of cluster_msg is handled by little endian 
funcs
  md-cluster: remove unnecessary setting for slot
  md-cluster: make sure the node do not receive it's own msg
  md-cluster: zero cmsg before it was sent
  md-cluster: Add 'SUSE' as author for md-cluster.c
  md-cluster: only call kick_rdev_from_array after remove disk successfully
  md: check the return value for metadata_update_start
  md-cluster: Fix warnings when build with CF=-D__CHECK_ENDIAN__

NeilBrown (5):
  Merge branch 'md-next' of git://github.com/goldwynr/linux into for-next
  md-cluster: metadata_update_finish: consistently use cmsg.raid_slot as 
le32
  md-cluster: discard unused sb_mutex.
  md-cluster: don't cast void pointers when assigning them.
  md-cluster: remove mddev arg from add_resync_info()

Shaohua Li (24):
  md: override md superblock recovery_offset for journal device
  raid5: export some functions
  raid5: add a new state for stripe log handling
  raid5: add basic stripe log
  raid5: log reclaim support
  raid5: log recovery
  raid5-cache: use crc32c checksum
  raid5: disable batch with log enabled
  raid5: don't allow resize/reshape with cache(log) support
  raid5: enable log for raid array with cache disk
  raid5-cache: switching to state machine for log disk cache flush
  raid5-cache: fix a user-after-free bug
  raid5-cache: optimize FLUSH IO with log enabled
  md: skip resync for raid array with journal
  raid5-cache: check stripe finish out of order
  raid5-cache: don't delay stripe captured in log
  md: show journal for journal disk in disk state sysfs
  raid5-cache: move reclaim stop to quiesce
  MD: fix info output for journal disk
  raid5-cache: add trim support for log
  raid5: journal disk can't be removed
  raid5-cache: IO error handling
  raid5-cache: start raid5 readonly if journal is missing
  MD: set journal disk ->raid_disk

Song Liu (6):
  MD: replace special disk roles with macros
  MD: add a new disk role to present write journal device
  skip match_mddev_units check for special roles
  MD: add new bit to indicate raid array with journal
  MD: kick out journal disk if it's not fresh
  MD: when RAID journal is missing/faulty, 

[GIT PULL REQUEST] last minute bug fixes for md.

2015-10-30 Thread Neil Brown

they just keep coming in :-(

The following changes since commit 8bce6d35b308d73cdb2ee273c95d711a55be688c:

  md/raid10: fix the 'new' raid10 layout to work correctly. (2015-10-24 
16:24:25 +1100)

are available in the git repository at:

  git://neil.brown.name/md tags/md/4.3-rc7-fixes

for you to fetch changes up to d01552a76d71f9879af448e9142389ee9be6e95b:

  Revert "md: allow a partially recovered device to be hot-added to an array." 
(2015-10-31 11:00:56 +1100)


two more bug fixes for md.

One bugfix for a list corruption in raid5 because of incorrect
locking.

Other for possible data corruption when a recovering device is failed,
removed, and re-added.

Both tagged for -stable.


NeilBrown (1):
  Revert "md: allow a partially recovered device to be hot-added to an 
array."

Roman Gushchin (1):
  md/raid5: fix locking in handle_stripe_clean_event()

 drivers/md/md.c| 3 +--
 drivers/md/raid5.c | 6 --
 2 files changed, 5 insertions(+), 4 deletions(-)


signature.asc
Description: PGP signature


Re: [PATCH] md/raid5: fix locking in handle_stripe_clean_event()

2015-10-30 Thread Neil Brown
On Sat, Oct 31 2015, Shaohua Li wrote:

> On Fri, Oct 30, 2015 at 05:02:47PM +0300, Roman Gushchin wrote:
>> > Isn't the 4.1 fix just:
>> >
>> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> > index e5befa356dbe..6e4350a78257 100644
>> > --- a/drivers/md/raid5.c
>> > +++ b/drivers/md/raid5.c
>> > @@ -3522,16 +3522,16 @@ returnbi:
>> >   * no updated data, so remove it from hash list and the 
>> > stripe
>> >   * will be reinitialized
>> >   */
>> > - spin_lock_irq(>device_lock);
>> >  unhash:
>> > + spin_lock_irq(conf->hash_locks + sh->hash_lock_index);
>> >  remove_hash(sh);
>> > + spin_unlock_irq(conf->hash_locks + sh->hash_lock_index);
>> >  if (head_sh->batch_head) {
>> >  sh = list_first_entry(>batch_list,
>> >    struct stripe_head, 
>> > batch_list);
>> >  if (sh != head_sh)
>> >  goto unhash;
>> >  }
>> > - spin_unlock_irq(>device_lock);
>> >  sh = head_sh;
>> >
>> >  if (test_bit(STRIPE_SYNC_REQUESTED, >state))
>> >
>> > ??
>> 
>> In my opion, this patch looks correct, although it seems to me, that there 
>> is an another issue here.
>> 
>> >  if (head_sh->batch_head) {
>> >  sh = list_first_entry(>batch_list,
>> >    struct stripe_head, 
>> > batch_list);
>> >  if (sh != head_sh)
>> >  goto unhash;
>> >  }
>>  
>> With a patch above this code will be executed without taking any locks. It 
>> it correct?
>> In my opinion, we need to take at least sh->stripe_lock, which protects 
>> sh->batch_head.
>> Or do I miss something?
>> 
>> If you want, we can handle this issue separately.
>
> The batch_list list doesn't need the protection. Only the remove_hash() need 
> it.

Yes, that's my understanding too.  The key to understanding is that
comment you (helpfully!) put in clear_batch_ready():

/*
 * BATCH_READY is cleared, no new stripes can be added.
 * batch_list can be accessed without lock
 */

I'll wrangle some patches...

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH] md/raid5: fix locking in handle_stripe_clean_event()

2015-10-30 Thread Neil Brown
On Sat, Oct 31 2015, Shaohua Li wrote:

> On Fri, Oct 30, 2015 at 05:02:47PM +0300, Roman Gushchin wrote:
>> > Isn't the 4.1 fix just:
>> >
>> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> > index e5befa356dbe..6e4350a78257 100644
>> > --- a/drivers/md/raid5.c
>> > +++ b/drivers/md/raid5.c
>> > @@ -3522,16 +3522,16 @@ returnbi:
>> >   * no updated data, so remove it from hash list and the 
>> > stripe
>> >   * will be reinitialized
>> >   */
>> > - spin_lock_irq(>device_lock);
>> >  unhash:
>> > + spin_lock_irq(conf->hash_locks + sh->hash_lock_index);
>> >  remove_hash(sh);
>> > + spin_unlock_irq(conf->hash_locks + sh->hash_lock_index);
>> >  if (head_sh->batch_head) {
>> >  sh = list_first_entry(>batch_list,
>> >    struct stripe_head, 
>> > batch_list);
>> >  if (sh != head_sh)
>> >  goto unhash;
>> >  }
>> > - spin_unlock_irq(>device_lock);
>> >  sh = head_sh;
>> >
>> >  if (test_bit(STRIPE_SYNC_REQUESTED, >state))
>> >
>> > ??
>> 
>> In my opion, this patch looks correct, although it seems to me, that there 
>> is an another issue here.
>> 
>> >  if (head_sh->batch_head) {
>> >  sh = list_first_entry(>batch_list,
>> >    struct stripe_head, 
>> > batch_list);
>> >  if (sh != head_sh)
>> >  goto unhash;
>> >  }
>>  
>> With a patch above this code will be executed without taking any locks. It 
>> it correct?
>> In my opinion, we need to take at least sh->stripe_lock, which protects 
>> sh->batch_head.
>> Or do I miss something?
>> 
>> If you want, we can handle this issue separately.
>
> The batch_list list doesn't need the protection. Only the remove_hash() need 
> it.

Yes, that's my understanding too.  The key to understanding is that
comment you (helpfully!) put in clear_batch_ready():

/*
 * BATCH_READY is cleared, no new stripes can be added.
 * batch_list can be accessed without lock
 */

I'll wrangle some patches...

Thanks,
NeilBrown


signature.asc
Description: PGP signature


[GIT PULL REQUEST] last minute bug fixes for md.

2015-10-30 Thread Neil Brown

they just keep coming in :-(

The following changes since commit 8bce6d35b308d73cdb2ee273c95d711a55be688c:

  md/raid10: fix the 'new' raid10 layout to work correctly. (2015-10-24 
16:24:25 +1100)

are available in the git repository at:

  git://neil.brown.name/md tags/md/4.3-rc7-fixes

for you to fetch changes up to d01552a76d71f9879af448e9142389ee9be6e95b:

  Revert "md: allow a partially recovered device to be hot-added to an array." 
(2015-10-31 11:00:56 +1100)


two more bug fixes for md.

One bugfix for a list corruption in raid5 because of incorrect
locking.

Other for possible data corruption when a recovering device is failed,
removed, and re-added.

Both tagged for -stable.


NeilBrown (1):
  Revert "md: allow a partially recovered device to be hot-added to an 
array."

Roman Gushchin (1):
  md/raid5: fix locking in handle_stripe_clean_event()

 drivers/md/md.c| 3 +--
 drivers/md/raid5.c | 6 --
 2 files changed, 5 insertions(+), 4 deletions(-)


signature.asc
Description: PGP signature


Re: [PATCH] md/raid5: fix locking in handle_stripe_clean_event()

2015-10-29 Thread Neil Brown
On Fri, Oct 30 2015, Roman Gushchin wrote:

> 29.10.2015, 03:35, "Neil Brown" :
>> On Wed, Oct 28 2015, Roman Gushchin wrote:
>>
>>>  After commit 566c09c53455 ("raid5: relieve lock contention in 
>>> get_active_stripe()")
>>>  __find_stripe() is called under conf->hash_locks + hash.
>>>  But handle_stripe_clean_event() calls remove_hash() under
>>>  conf->device_lock.
>>>
>>>  Under some cirscumstances the hash chain can be circuited,
>>>  and we get an infinite loop with disabled interrupts and locked hash
>>>  lock in __find_stripe(). This leads to hard lockup on multiple CPUs
>>>  and following system crash.
>>>
>>>  I was able to reproduce this behavior on raid6 over 6 ssd disks.
>>>  The devices_handle_discard_safely option should be set to enable trim
>>>  support. The following script was used:
>>>
>>>  for i in `seq 1 32`; do
>>>  dd if=/dev/zero of=large$i bs=10M count=100 &
>>>  done
>>>
>>>  Signed-off-by: Roman Gushchin 
>>>  Cc: Neil Brown 
>>>  Cc: Shaohua Li 
>>>  Cc: linux-r...@vger.kernel.org
>>>  Cc:  # 3.10 - 3.19
>>
>> Hi Roman,
>>  thanks for reporting this and providing a fix.
>>
>> I'm a bit confused by that stable range: 3.10 - 3.19
>>
>> The commit you identify as introducing the bug was added in 3.13, so
>> presumably 3.10, 3.11, 3.12 are not affected.
>
> Sure, it's my mistake. Correct range is 3.13 - 3.19. Sorry.
>
>> Also the bug is still present in mainline, so 4.0, 4.1, 4.2 are also
>> affected, though the patch needs to be revised a bit for 4.1 and later.
>
> Yes, exactly, but things are a bit more complicated in mainline.
> I'll try to prepare a patch for mainline in a couple of days.
>
Thanks for the confirmation.

Isn't the 4.1 fix just:

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index e5befa356dbe..6e4350a78257 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3522,16 +3522,16 @@ returnbi:
 * no updated data, so remove it from hash list and the stripe
 * will be reinitialized
 */
-   spin_lock_irq(>device_lock);
 unhash:
+   spin_lock_irq(conf->hash_locks + sh->hash_lock_index);
remove_hash(sh);
+   spin_unlock_irq(conf->hash_locks + sh->hash_lock_index);
if (head_sh->batch_head) {
sh = list_first_entry(>batch_list,
  struct stripe_head, batch_list);
if (sh != head_sh)
goto unhash;
}
-   spin_unlock_irq(>device_lock);
sh = head_sh;
 
if (test_bit(STRIPE_SYNC_REQUESTED, >state))

??

Or maybe
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index e5befa356dbe..704ef7fcfbf8 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3509,6 +3509,7 @@ returnbi:
 
if (!discard_pending &&
test_bit(R5_Discard, >dev[sh->pd_idx].flags)) {
+   int hash;
clear_bit(R5_Discard, >dev[sh->pd_idx].flags);
clear_bit(R5_UPTODATE, >dev[sh->pd_idx].flags);
if (sh->qd_idx >= 0) {
@@ -3522,16 +3523,17 @@ returnbi:
 * no updated data, so remove it from hash list and the stripe
 * will be reinitialized
 */
-   spin_lock_irq(>device_lock);
 unhash:
+   hash = sh->hash_lock_index;
+   spin_lock_irq(conf->hash_locks + hash);
remove_hash(sh);
+   spin_unlock_irq(conf->hash_locks + hash);
if (head_sh->batch_head) {
sh = list_first_entry(>batch_list,
  struct stripe_head, batch_list);
if (sh != head_sh)
goto unhash;
}
-   spin_unlock_irq(>device_lock);
sh = head_sh;
 
if (test_bit(STRIPE_SYNC_REQUESTED, >state))


For personal reasons I would like to get this resolved today or
tomorrow, though it would be silly to rush if there is any uncertainty.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH] md/raid5: fix locking in handle_stripe_clean_event()

2015-10-29 Thread Neil Brown
On Fri, Oct 30 2015, Roman Gushchin wrote:

> 29.10.2015, 03:35, "Neil Brown" <ne...@suse.de>:
>> On Wed, Oct 28 2015, Roman Gushchin wrote:
>>
>>>  After commit 566c09c53455 ("raid5: relieve lock contention in 
>>> get_active_stripe()")
>>>  __find_stripe() is called under conf->hash_locks + hash.
>>>  But handle_stripe_clean_event() calls remove_hash() under
>>>  conf->device_lock.
>>>
>>>  Under some cirscumstances the hash chain can be circuited,
>>>  and we get an infinite loop with disabled interrupts and locked hash
>>>  lock in __find_stripe(). This leads to hard lockup on multiple CPUs
>>>  and following system crash.
>>>
>>>  I was able to reproduce this behavior on raid6 over 6 ssd disks.
>>>  The devices_handle_discard_safely option should be set to enable trim
>>>  support. The following script was used:
>>>
>>>  for i in `seq 1 32`; do
>>>  dd if=/dev/zero of=large$i bs=10M count=100 &
>>>  done
>>>
>>>  Signed-off-by: Roman Gushchin <kl...@yandex-team.ru>
>>>  Cc: Neil Brown <ne...@suse.de>
>>>  Cc: Shaohua Li <s...@kernel.org>
>>>  Cc: linux-r...@vger.kernel.org
>>>  Cc: <sta...@vger.kernel.org> # 3.10 - 3.19
>>
>> Hi Roman,
>>  thanks for reporting this and providing a fix.
>>
>> I'm a bit confused by that stable range: 3.10 - 3.19
>>
>> The commit you identify as introducing the bug was added in 3.13, so
>> presumably 3.10, 3.11, 3.12 are not affected.
>
> Sure, it's my mistake. Correct range is 3.13 - 3.19. Sorry.
>
>> Also the bug is still present in mainline, so 4.0, 4.1, 4.2 are also
>> affected, though the patch needs to be revised a bit for 4.1 and later.
>
> Yes, exactly, but things are a bit more complicated in mainline.
> I'll try to prepare a patch for mainline in a couple of days.
>
Thanks for the confirmation.

Isn't the 4.1 fix just:

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index e5befa356dbe..6e4350a78257 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3522,16 +3522,16 @@ returnbi:
 * no updated data, so remove it from hash list and the stripe
 * will be reinitialized
 */
-   spin_lock_irq(>device_lock);
 unhash:
+   spin_lock_irq(conf->hash_locks + sh->hash_lock_index);
remove_hash(sh);
+   spin_unlock_irq(conf->hash_locks + sh->hash_lock_index);
if (head_sh->batch_head) {
sh = list_first_entry(>batch_list,
  struct stripe_head, batch_list);
if (sh != head_sh)
goto unhash;
}
-   spin_unlock_irq(>device_lock);
sh = head_sh;
 
if (test_bit(STRIPE_SYNC_REQUESTED, >state))

??

Or maybe
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index e5befa356dbe..704ef7fcfbf8 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3509,6 +3509,7 @@ returnbi:
 
if (!discard_pending &&
test_bit(R5_Discard, >dev[sh->pd_idx].flags)) {
+   int hash;
clear_bit(R5_Discard, >dev[sh->pd_idx].flags);
clear_bit(R5_UPTODATE, >dev[sh->pd_idx].flags);
if (sh->qd_idx >= 0) {
@@ -3522,16 +3523,17 @@ returnbi:
 * no updated data, so remove it from hash list and the stripe
 * will be reinitialized
 */
-   spin_lock_irq(>device_lock);
 unhash:
+   hash = sh->hash_lock_index;
+   spin_lock_irq(conf->hash_locks + hash);
remove_hash(sh);
+   spin_unlock_irq(conf->hash_locks + hash);
if (head_sh->batch_head) {
sh = list_first_entry(>batch_list,
  struct stripe_head, batch_list);
if (sh != head_sh)
goto unhash;
}
-   spin_unlock_irq(>device_lock);
sh = head_sh;
 
if (test_bit(STRIPE_SYNC_REQUESTED, >state))


For personal reasons I would like to get this resolved today or
tomorrow, though it would be silly to rush if there is any uncertainty.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH] md/raid5: fix locking in handle_stripe_clean_event()

2015-10-28 Thread Neil Brown
On Wed, Oct 28 2015, Roman Gushchin wrote:

> After commit 566c09c53455 ("raid5: relieve lock contention in 
> get_active_stripe()")
> __find_stripe() is called under conf->hash_locks + hash.
> But handle_stripe_clean_event() calls remove_hash() under
> conf->device_lock.
>
> Under some cirscumstances the hash chain can be circuited,
> and we get an infinite loop with disabled interrupts and locked hash
> lock in __find_stripe(). This leads to hard lockup on multiple CPUs
> and following system crash.
>
> I was able to reproduce this behavior on raid6 over 6 ssd disks.
> The devices_handle_discard_safely option should be set to enable trim
> support. The following script was used:
>
> for i in `seq 1 32`; do
> dd if=/dev/zero of=large$i bs=10M count=100 &
> done
>
> Signed-off-by: Roman Gushchin 
> Cc: Neil Brown 
> Cc: Shaohua Li 
> Cc: linux-r...@vger.kernel.org
> Cc:  # 3.10 - 3.19

Hi Roman,
 thanks for reporting this and providing a fix.

I'm a bit confused by that stable range: 3.10 - 3.19

The commit you identify as introducing the bug was added in 3.13, so
presumably 3.10, 3.11, 3.12 are not affected.
Also the bug is still present in mainline, so 4.0, 4.1, 4.2 are also
affected, though the patch needs to be revised a bit for 4.1 and later.

Does that match your understanding?  Or is there something that I am
missing?

Thanks,
NeilBrown

> ---
>  drivers/md/raid5.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index e421016..5fa7549 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3060,6 +3060,8 @@ static void handle_stripe_clean_event(struct r5conf 
> *conf,
>   }
>   if (!discard_pending &&
>   test_bit(R5_Discard, >dev[sh->pd_idx].flags)) {
> + int hash = sh->hash_lock_index;
> +
>   clear_bit(R5_Discard, >dev[sh->pd_idx].flags);
>   clear_bit(R5_UPTODATE, >dev[sh->pd_idx].flags);
>   if (sh->qd_idx >= 0) {
> @@ -3073,9 +3075,9 @@ static void handle_stripe_clean_event(struct r5conf 
> *conf,
>* no updated data, so remove it from hash list and the stripe
>* will be reinitialized
>*/
> - spin_lock_irq(>device_lock);
> + spin_lock_irq(conf->hash_locks + hash);
>   remove_hash(sh);
> - spin_unlock_irq(>device_lock);
> + spin_unlock_irq(conf->hash_locks + hash);
>   if (test_bit(STRIPE_SYNC_REQUESTED, >state))
>   set_bit(STRIPE_HANDLE, >state);
>  
> -- 
> 2.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: PGP signature


Re: [PATCH] md/raid5: fix locking in handle_stripe_clean_event()

2015-10-28 Thread Neil Brown
On Wed, Oct 28 2015, Roman Gushchin wrote:

> After commit 566c09c53455 ("raid5: relieve lock contention in 
> get_active_stripe()")
> __find_stripe() is called under conf->hash_locks + hash.
> But handle_stripe_clean_event() calls remove_hash() under
> conf->device_lock.
>
> Under some cirscumstances the hash chain can be circuited,
> and we get an infinite loop with disabled interrupts and locked hash
> lock in __find_stripe(). This leads to hard lockup on multiple CPUs
> and following system crash.
>
> I was able to reproduce this behavior on raid6 over 6 ssd disks.
> The devices_handle_discard_safely option should be set to enable trim
> support. The following script was used:
>
> for i in `seq 1 32`; do
> dd if=/dev/zero of=large$i bs=10M count=100 &
> done
>
> Signed-off-by: Roman Gushchin <kl...@yandex-team.ru>
> Cc: Neil Brown <ne...@suse.de>
> Cc: Shaohua Li <s...@kernel.org>
> Cc: linux-r...@vger.kernel.org
> Cc: <sta...@vger.kernel.org> # 3.10 - 3.19

Hi Roman,
 thanks for reporting this and providing a fix.

I'm a bit confused by that stable range: 3.10 - 3.19

The commit you identify as introducing the bug was added in 3.13, so
presumably 3.10, 3.11, 3.12 are not affected.
Also the bug is still present in mainline, so 4.0, 4.1, 4.2 are also
affected, though the patch needs to be revised a bit for 4.1 and later.

Does that match your understanding?  Or is there something that I am
missing?

Thanks,
NeilBrown

> ---
>  drivers/md/raid5.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index e421016..5fa7549 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3060,6 +3060,8 @@ static void handle_stripe_clean_event(struct r5conf 
> *conf,
>   }
>   if (!discard_pending &&
>   test_bit(R5_Discard, >dev[sh->pd_idx].flags)) {
> + int hash = sh->hash_lock_index;
> +
>   clear_bit(R5_Discard, >dev[sh->pd_idx].flags);
>   clear_bit(R5_UPTODATE, >dev[sh->pd_idx].flags);
>   if (sh->qd_idx >= 0) {
> @@ -3073,9 +3075,9 @@ static void handle_stripe_clean_event(struct r5conf 
> *conf,
>* no updated data, so remove it from hash list and the stripe
>* will be reinitialized
>*/
> - spin_lock_irq(>device_lock);
> + spin_lock_irq(conf->hash_locks + hash);
>   remove_hash(sh);
> - spin_unlock_irq(>device_lock);
> + spin_unlock_irq(conf->hash_locks + hash);
>   if (test_bit(STRIPE_SYNC_REQUESTED, >state))
>   set_bit(STRIPE_HANDLE, >state);
>  
> -- 
> 2.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: PGP signature


Re: [PATCHv2] Documentation: add new description of path-name lookup.

2015-10-27 Thread Neil Brown
On Wed, Oct 28 2015, J. Bruce Fields wrote:

> On Mon, Oct 26, 2015 at 03:35:54PM +0900, Neil Brown wrote:
>> From c38784b876a181eda9a5687e618749157dc96a0e Mon Sep 17 00:00:00 2001
>> From: NeilBrown 
>> Date: Sat, 25 Jul 2015 10:24:41 +1000
>> Subject: [PATCH] Documentation: add new description of path-name lookup.
>> 
>> This document is based on three recent lwn.net articles.
>> Some of the introductory material and linkage between articles
>> has been removed, and some time-based descriptions have been
>> revised.
>
> Thanks for doing this!  Nit:
>
>> +End of the road
>> +---
>> +
>> +Despite its complexity, all this pathname lookup code appears to be
>> +in good shape - various parts are certainly easier to understand now
>> +than even a couple of releases ago.  But that doesn't mean it is
>> +"finished".   As already mentioned, RCU-walk currently only follows
>> +symlinks that are stored in the inode so, while it handles many ext4
>> +symlinks, it doesn't help with NFS, XFS, or Btrfs.  That support
>> +is not likely to be long delayed.
>
> This looks likely to go stale quickly.  Maybe just drop it?
>
> I'd personally also probably drop the introduction.  (I think of
> Documetation/ as reference material with less of a need to tell a
> "story".)  But, I could be wrong.

Allow me to try to convince you that you are wrong :-)

When I want reference material, I look at the code.  Code doesn't get
out of date.
Ideally there should be comments next to lock declarations outlining what
the lock protects, and comments next to function pointer declarations
describing the use and behaviour of that function.  But that sort of
detail - whether as code or comments - should live with the code.

When I go to Documentation/ it is to get the big picture.  To get
answers to the questions that I didn't even know I should ask.  To find
out the "why" and the caveats and the history.  In sort, I want to be
told the story of the particular topic. (Unfortunately, that isn't
usually what I find).

Don't underestimate story.  Story has been the most effective teaching
tool we've had since time immemorial (or so the story goes).  We
disregard it at our peril.

I've had a couple of experiences recently that reflect on the teaching
of mathematics, one concerning trigonometry

  https://plus.google.com/+LinusTorvalds/posts/Y6i9Mnk3sno

and one concerning quadratic equations.  Both cases could be viewed as
an over emphasis on specific details (important though they are) without
enough time being spent on the big-picture story of why these things are
*everywhere*.

There is certainly value in structuring the story so that a reader can
come back and easily find the detail they need to be reminded about.
But that first time, when you have no idea how things work, but need to
fix a bug or understand why you cannot find a way to add that feature
you so desperately need, having a story to read (and the patience to
read it) can be a big help.

NeilBrown


signature.asc
Description: PGP signature


[PATCH/RFC] make btrfs subvol mounts appear in /proc/mounts

2015-10-27 Thread Neil Brown

If you create a subvolume in btrfs and access it (by name) without
mounting it, then the subvolume looks like a separate mount to some
extent, returning a different st_dev to stat(), but it doesn't look like
a separate mount in that it isn't listed in /proc/mounts. This
inconsistency can confuse tools.

This patch causes these subvolumes to become separate mounts by using
the VFS' automount functionality, much like NFS uses automount when it
discovered mountpoints on the server.

The VFS currently makes it impossible to auto-mount a directory on to itself
(i.e. a bind mount).  For NFS this isn't a problem as a new superblock
is created for the child filesystem so there are two separate dentries
(and inodes) for the one directory: one in the parent filesystem, one in
the child (note that the two superblocks share a common connection to
the server so there is still a lot of commonality).

BTRFS has chosen instead to use a single superblock for all subvolumes.
This results in a single dentry for the subvol-root.  A dentry which
must be auto-mounted on itself.

This creates 2 problems.
Firstly, we want to be selective about when the dentry triggers an
automount.  When the dentry is not the root of a vfsmount, we do want to
trigger an automount.  When it is the root, we don't.
I have modified follow_managed() to understand that if ->d_manage()
returns 1, then an automount should be suppressed when
  path->mnt->mnt_root == path->dentry.

Secondly, finish_automount() explicitly tests for the case of mounting a
dentry on top of itself and fails with -ELOOP.  I've changed this to
only fail if the mounted-on directory is the root in its own vfsmount.
So now mounting a dentry on itself can be OK, but mounting the same
dentry on top of that will cause -ELOOP.

With those two VFS changes, it is straight forward to enable automounts for
btrfs subvolumes and to use clone_private_mount to create the required
vfsmount.

I have not added infrastructure to time-out the submounts after a period
of inactivity in the way that NFS does.  If that is desirable (I'm not
sure of the benefits) it can easily be added as a subsequent patch.

If this approach is acceptable, I will resubmit addressing any issues
raised and with proper updates for the automount documentation.

Thanks for any comments,
NeilBrown


diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 611b66d73e80..e96c53590f72 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5621,6 +5621,23 @@ static void btrfs_dentry_release(struct dentry *dentry)
kfree(dentry->d_fsdata);
 }
 
+static int btrfs_dentry_manage(struct dentry *dentry, bool in_rcu)
+{
+   /* This is a 'rebind automount'.  So only trigger automount
+* when the dentry isn't the root of a mountpoint.
+*/
+   return 1;
+}
+
+static struct vfsmount *btrfs_dentry_automount(struct path *path)
+{
+   struct vfsmount *mnt;
+   mnt = clone_private_mount(path);
+   if (!IS_ERR(mnt))
+   mntget(mnt);
+   return mnt;
+}
+
 static struct dentry *btrfs_lookup(struct inode *dir, struct dentry *dentry,
   unsigned int flags)
 {
@@ -5633,6 +5650,8 @@ static struct dentry *btrfs_lookup(struct inode *dir, 
struct dentry *dentry,
else
return ERR_CAST(inode);
}
+   if (inode && inode->i_ino == BTRFS_FIRST_FREE_OBJECTID)
+   dentry->d_flags |= DCACHE_MANAGE_TRANSIT | 
DCACHE_NEED_AUTOMOUNT;
 
return d_splice_alias(inode, dentry);
 }
@@ -9990,4 +10009,6 @@ static const struct inode_operations 
btrfs_symlink_inode_operations = {
 const struct dentry_operations btrfs_dentry_operations = {
.d_delete   = btrfs_dentry_delete,
.d_release  = btrfs_dentry_release,
+   .d_manage   = btrfs_dentry_manage,
+   .d_automount= btrfs_dentry_automount,
 };
diff --git a/fs/namei.c b/fs/namei.c
index 33e9495a3129..07e4bbbadae1 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1178,6 +1178,7 @@ static int follow_managed(struct path *path, struct 
nameidata *nd)
   unlikely(managed != 0)) {
/* Allow the filesystem to manage the transit without i_mutex
 * being held. */
+   ret = 0;
if (managed & DCACHE_MANAGE_TRANSIT) {
BUG_ON(!path->dentry->d_op);
BUG_ON(!path->dentry->d_op->d_manage);
@@ -1207,7 +1208,12 @@ static int follow_managed(struct path *path, struct 
nameidata *nd)
 
/* Handle an automount point */
if (managed & DCACHE_NEED_AUTOMOUNT) {
-   ret = follow_automount(path, nd, _mntput);
+   if (ret == 1 && path->mnt->mnt_root == path->dentry) {
+   /* only automount when not the root */
+   ret = 0;
+   break;
+   } else
+  

[PATCH/RFC] make btrfs subvol mounts appear in /proc/mounts

2015-10-27 Thread Neil Brown

If you create a subvolume in btrfs and access it (by name) without
mounting it, then the subvolume looks like a separate mount to some
extent, returning a different st_dev to stat(), but it doesn't look like
a separate mount in that it isn't listed in /proc/mounts. This
inconsistency can confuse tools.

This patch causes these subvolumes to become separate mounts by using
the VFS' automount functionality, much like NFS uses automount when it
discovered mountpoints on the server.

The VFS currently makes it impossible to auto-mount a directory on to itself
(i.e. a bind mount).  For NFS this isn't a problem as a new superblock
is created for the child filesystem so there are two separate dentries
(and inodes) for the one directory: one in the parent filesystem, one in
the child (note that the two superblocks share a common connection to
the server so there is still a lot of commonality).

BTRFS has chosen instead to use a single superblock for all subvolumes.
This results in a single dentry for the subvol-root.  A dentry which
must be auto-mounted on itself.

This creates 2 problems.
Firstly, we want to be selective about when the dentry triggers an
automount.  When the dentry is not the root of a vfsmount, we do want to
trigger an automount.  When it is the root, we don't.
I have modified follow_managed() to understand that if ->d_manage()
returns 1, then an automount should be suppressed when
  path->mnt->mnt_root == path->dentry.

Secondly, finish_automount() explicitly tests for the case of mounting a
dentry on top of itself and fails with -ELOOP.  I've changed this to
only fail if the mounted-on directory is the root in its own vfsmount.
So now mounting a dentry on itself can be OK, but mounting the same
dentry on top of that will cause -ELOOP.

With those two VFS changes, it is straight forward to enable automounts for
btrfs subvolumes and to use clone_private_mount to create the required
vfsmount.

I have not added infrastructure to time-out the submounts after a period
of inactivity in the way that NFS does.  If that is desirable (I'm not
sure of the benefits) it can easily be added as a subsequent patch.

If this approach is acceptable, I will resubmit addressing any issues
raised and with proper updates for the automount documentation.

Thanks for any comments,
NeilBrown


diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 611b66d73e80..e96c53590f72 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5621,6 +5621,23 @@ static void btrfs_dentry_release(struct dentry *dentry)
kfree(dentry->d_fsdata);
 }
 
+static int btrfs_dentry_manage(struct dentry *dentry, bool in_rcu)
+{
+   /* This is a 'rebind automount'.  So only trigger automount
+* when the dentry isn't the root of a mountpoint.
+*/
+   return 1;
+}
+
+static struct vfsmount *btrfs_dentry_automount(struct path *path)
+{
+   struct vfsmount *mnt;
+   mnt = clone_private_mount(path);
+   if (!IS_ERR(mnt))
+   mntget(mnt);
+   return mnt;
+}
+
 static struct dentry *btrfs_lookup(struct inode *dir, struct dentry *dentry,
   unsigned int flags)
 {
@@ -5633,6 +5650,8 @@ static struct dentry *btrfs_lookup(struct inode *dir, 
struct dentry *dentry,
else
return ERR_CAST(inode);
}
+   if (inode && inode->i_ino == BTRFS_FIRST_FREE_OBJECTID)
+   dentry->d_flags |= DCACHE_MANAGE_TRANSIT | 
DCACHE_NEED_AUTOMOUNT;
 
return d_splice_alias(inode, dentry);
 }
@@ -9990,4 +10009,6 @@ static const struct inode_operations 
btrfs_symlink_inode_operations = {
 const struct dentry_operations btrfs_dentry_operations = {
.d_delete   = btrfs_dentry_delete,
.d_release  = btrfs_dentry_release,
+   .d_manage   = btrfs_dentry_manage,
+   .d_automount= btrfs_dentry_automount,
 };
diff --git a/fs/namei.c b/fs/namei.c
index 33e9495a3129..07e4bbbadae1 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1178,6 +1178,7 @@ static int follow_managed(struct path *path, struct 
nameidata *nd)
   unlikely(managed != 0)) {
/* Allow the filesystem to manage the transit without i_mutex
 * being held. */
+   ret = 0;
if (managed & DCACHE_MANAGE_TRANSIT) {
BUG_ON(!path->dentry->d_op);
BUG_ON(!path->dentry->d_op->d_manage);
@@ -1207,7 +1208,12 @@ static int follow_managed(struct path *path, struct 
nameidata *nd)
 
/* Handle an automount point */
if (managed & DCACHE_NEED_AUTOMOUNT) {
-   ret = follow_automount(path, nd, _mntput);
+   if (ret == 1 && path->mnt->mnt_root == path->dentry) {
+   /* only automount when not the root */
+   ret = 0;
+   break;
+   } else
+  

Re: [PATCHv2] Documentation: add new description of path-name lookup.

2015-10-27 Thread Neil Brown
On Wed, Oct 28 2015, J. Bruce Fields wrote:

> On Mon, Oct 26, 2015 at 03:35:54PM +0900, Neil Brown wrote:
>> From c38784b876a181eda9a5687e618749157dc96a0e Mon Sep 17 00:00:00 2001
>> From: NeilBrown <n...@brown.name>
>> Date: Sat, 25 Jul 2015 10:24:41 +1000
>> Subject: [PATCH] Documentation: add new description of path-name lookup.
>> 
>> This document is based on three recent lwn.net articles.
>> Some of the introductory material and linkage between articles
>> has been removed, and some time-based descriptions have been
>> revised.
>
> Thanks for doing this!  Nit:
>
>> +End of the road
>> +---
>> +
>> +Despite its complexity, all this pathname lookup code appears to be
>> +in good shape - various parts are certainly easier to understand now
>> +than even a couple of releases ago.  But that doesn't mean it is
>> +"finished".   As already mentioned, RCU-walk currently only follows
>> +symlinks that are stored in the inode so, while it handles many ext4
>> +symlinks, it doesn't help with NFS, XFS, or Btrfs.  That support
>> +is not likely to be long delayed.
>
> This looks likely to go stale quickly.  Maybe just drop it?
>
> I'd personally also probably drop the introduction.  (I think of
> Documetation/ as reference material with less of a need to tell a
> "story".)  But, I could be wrong.

Allow me to try to convince you that you are wrong :-)

When I want reference material, I look at the code.  Code doesn't get
out of date.
Ideally there should be comments next to lock declarations outlining what
the lock protects, and comments next to function pointer declarations
describing the use and behaviour of that function.  But that sort of
detail - whether as code or comments - should live with the code.

When I go to Documentation/ it is to get the big picture.  To get
answers to the questions that I didn't even know I should ask.  To find
out the "why" and the caveats and the history.  In sort, I want to be
told the story of the particular topic. (Unfortunately, that isn't
usually what I find).

Don't underestimate story.  Story has been the most effective teaching
tool we've had since time immemorial (or so the story goes).  We
disregard it at our peril.

I've had a couple of experiences recently that reflect on the teaching
of mathematics, one concerning trigonometry

  https://plus.google.com/+LinusTorvalds/posts/Y6i9Mnk3sno

and one concerning quadratic equations.  Both cases could be viewed as
an over emphasis on specific details (important though they are) without
enough time being spent on the big-picture story of why these things are
*everywhere*.

There is certainly value in structuring the story so that a reader can
come back and easily find the detail they need to be reminded about.
But that first time, when you have no idea how things work, but need to
fix a bug or understand why you cannot find a way to add that feature
you so desperately need, having a story to read (and the patience to
read it) can be a big help.

NeilBrown


signature.asc
Description: PGP signature


[PATCHv2] Documentation: add new description of path-name lookup.

2015-10-26 Thread Neil Brown
From c38784b876a181eda9a5687e618749157dc96a0e Mon Sep 17 00:00:00 2001
From: NeilBrown 
Date: Sat, 25 Jul 2015 10:24:41 +1000
Subject: [PATCH] Documentation: add new description of path-name lookup.

This document is based on three recent lwn.net articles.
Some of the introductory material and linkage between articles
has been removed, and some time-based descriptions have been
revised.

Also all links to code have been removed as the code is very close by.

Contains corrections and improvements from Randy Dunlap .

Signed-off-by: NeilBrown 

---

I meant to re-send this weeks ago... but didn't.  Sorry.

NeilBrown


diff --git a/Documentation/filesystems/path-lookup.md 
b/Documentation/filesystems/path-lookup.md
new file mode 100644
index ..1b39e084a2b2
--- /dev/null
+++ b/Documentation/filesystems/path-lookup.md
@@ -0,0 +1,1297 @@
+
+ p { max-width:50em} ol, ul {max-width: 40em}
+
+
+Pathname lookup in Linux.
+=
+
+This write-up is based on three articles published at lwn.net:
+
+- <https://lwn.net/Articles/649115/> Pathname lookup in Linux
+- <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux
+- <https://lwn.net/Articles/650786/> A walk among the symlinks
+
+Written by Neil Brown with help from Al Viro and Jon Corbet.
+
+Introduction
+
+
+The most obvious aspect of pathname lookup, which very little
+exploration is needed to discover, is that it is complex.  There are
+many rules, special cases, and implementation alternatives that all
+combine to confuse the unwary reader.  Computer science has long been
+acquainted with such complexity and has tools to help manage it.  One
+tool that we will make extensive use of is "divide and conquer".  For
+the early parts of the analysis we will divide off symlinks - leaving
+them until the final part.  Well before we get to symlinks we have
+another major division based on the VFS's approach to locking which
+will allow us to review "REF-walk" and "RCU-walk" separately.  But we
+are getting ahead of ourselves.  There are some important low level
+distinctions we need to clarify first.
+
+There are two sorts of ...
+--
+
+[`openat()`]: http://man7.org/linux/man-pages/man2/openat.2.html
+
+Pathnames (sometimes "file names"), used to identify objects in the
+filesystem, will be familiar to most readers.  They contain two sorts
+of elements: "slashes" that are sequences of one or more "`/`"
+characters, and "components" that are sequences of one or more
+non-"`/`" characters.  These form two kinds of paths.  Those that
+start with slashes are "absolute" and start from the filesystem root.
+The others are "relative" and start from the current directory, or
+from some other location specified by a file descriptor given to a
+"xxx`at`" system call such as "[`openat()`]".
+
+[`execveat()`]: http://man7.org/linux/man-pages/man2/execveat.2.html
+
+It is tempting to describe the second kind as starting with a
+component, but that isn't always accurate: a pathname can lack both
+slashes and components, it can be empty, in other words.  This is
+generally forbidden in POSIX, but some of those "xxx`at`" system calls
+in Linux permit it when the `AT_EMPTY_PATH` flag is given.  For
+example, if you have an open file descriptor on an executable file you
+can execute it by calling [`execveat()`] passing the file descriptor,
+an empty path, and the `AT_EMPTY_PATH` flag.
+
+These paths can be divided into two sections: the final component and
+everything else.  The "everything else" is the easy bit.  In all cases
+it must identify a directory that already exists, otherwise an error
+such as `ENOENT` or `ENOTDIR` will be reported.
+
+The final component is not so simple.  Not only do different system
+calls interpret it quite differently (e.g. some create it, some do
+not), but it might not even exist: neither the empty pathname nor the
+pathname that is just slashes have a final component.  If it does
+exist, it could be "`.`" or "`..`" which are handled quite differently
+from other components.
+
+[POSIX]: 
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12
+
+If a pathname ends with a slash, such as "`/tmp/foo/`" it might be
+tempting to consider that to have an empty final component.  In many
+ways that would lead to correct results, but not always.  In
+particular, `mkdir()` and `rmdir()` each create or remove a directory named
+by the final component, and they are required to work with pathnames
+ending in "`/`".  According to [POSIX]
+
+> A pathname that contains at least one non- slash> character and
+> that ends with one or more trailing slash> characters shall not
+> be resolved successfully unless the last pathname component before
+

[PATCHv2] Documentation: add new description of path-name lookup.

2015-10-26 Thread Neil Brown
From c38784b876a181eda9a5687e618749157dc96a0e Mon Sep 17 00:00:00 2001
From: NeilBrown <n...@brown.name>
Date: Sat, 25 Jul 2015 10:24:41 +1000
Subject: [PATCH] Documentation: add new description of path-name lookup.

This document is based on three recent lwn.net articles.
Some of the introductory material and linkage between articles
has been removed, and some time-based descriptions have been
revised.

Also all links to code have been removed as the code is very close by.

Contains corrections and improvements from Randy Dunlap <rdun...@infradead.org>.

Signed-off-by: NeilBrown <n...@brown.name>

---

I meant to re-send this weeks ago... but didn't.  Sorry.

NeilBrown


diff --git a/Documentation/filesystems/path-lookup.md 
b/Documentation/filesystems/path-lookup.md
new file mode 100644
index ..1b39e084a2b2
--- /dev/null
+++ b/Documentation/filesystems/path-lookup.md
@@ -0,0 +1,1297 @@
+
+ p { max-width:50em} ol, ul {max-width: 40em}
+
+
+Pathname lookup in Linux.
+=
+
+This write-up is based on three articles published at lwn.net:
+
+- <https://lwn.net/Articles/649115/> Pathname lookup in Linux
+- <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux
+- <https://lwn.net/Articles/650786/> A walk among the symlinks
+
+Written by Neil Brown with help from Al Viro and Jon Corbet.
+
+Introduction
+
+
+The most obvious aspect of pathname lookup, which very little
+exploration is needed to discover, is that it is complex.  There are
+many rules, special cases, and implementation alternatives that all
+combine to confuse the unwary reader.  Computer science has long been
+acquainted with such complexity and has tools to help manage it.  One
+tool that we will make extensive use of is "divide and conquer".  For
+the early parts of the analysis we will divide off symlinks - leaving
+them until the final part.  Well before we get to symlinks we have
+another major division based on the VFS's approach to locking which
+will allow us to review "REF-walk" and "RCU-walk" separately.  But we
+are getting ahead of ourselves.  There are some important low level
+distinctions we need to clarify first.
+
+There are two sorts of ...
+--
+
+[`openat()`]: http://man7.org/linux/man-pages/man2/openat.2.html
+
+Pathnames (sometimes "file names"), used to identify objects in the
+filesystem, will be familiar to most readers.  They contain two sorts
+of elements: "slashes" that are sequences of one or more "`/`"
+characters, and "components" that are sequences of one or more
+non-"`/`" characters.  These form two kinds of paths.  Those that
+start with slashes are "absolute" and start from the filesystem root.
+The others are "relative" and start from the current directory, or
+from some other location specified by a file descriptor given to a
+"xxx`at`" system call such as "[`openat()`]".
+
+[`execveat()`]: http://man7.org/linux/man-pages/man2/execveat.2.html
+
+It is tempting to describe the second kind as starting with a
+component, but that isn't always accurate: a pathname can lack both
+slashes and components, it can be empty, in other words.  This is
+generally forbidden in POSIX, but some of those "xxx`at`" system calls
+in Linux permit it when the `AT_EMPTY_PATH` flag is given.  For
+example, if you have an open file descriptor on an executable file you
+can execute it by calling [`execveat()`] passing the file descriptor,
+an empty path, and the `AT_EMPTY_PATH` flag.
+
+These paths can be divided into two sections: the final component and
+everything else.  The "everything else" is the easy bit.  In all cases
+it must identify a directory that already exists, otherwise an error
+such as `ENOENT` or `ENOTDIR` will be reported.
+
+The final component is not so simple.  Not only do different system
+calls interpret it quite differently (e.g. some create it, some do
+not), but it might not even exist: neither the empty pathname nor the
+pathname that is just slashes have a final component.  If it does
+exist, it could be "`.`" or "`..`" which are handled quite differently
+from other components.
+
+[POSIX]: 
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12
+
+If a pathname ends with a slash, such as "`/tmp/foo/`" it might be
+tempting to consider that to have an empty final component.  In many
+ways that would lead to correct results, but not always.  In
+particular, `mkdir()` and `rmdir()` each create or remove a directory named
+by the final component, and they are required to work with pathnames
+ending in "`/`".  According to [POSIX]
+
+> A pathname that contains at least one non- slash> character and
+> that ends with one or more trailing slash> characters sha

Re: [Ksummit-discuss] Linux Foundation Technical Advisory Board Elections and Nomination process

2015-10-22 Thread Neil Brown
Kristen Accardi  writes:

> On Oct 22, 2015 7:04 PM, "Neil Brown"  wrote:
>>
>> Darren Hart  writes:
>>
>> >
>> > Is there a good description of what is expected of a TAB member? How
> much time
>> > is involved? What makes a great TAB member?
>> >
>> > I've found:
> http://www.linuxfoundation.org/programs/advisory-councils/tab
>> >
>> > I've read the charter and scanned some of the minutes, but I'd still
> like to
>> > hear from some of the "incumbants". Specifically, what makes you
> successful as a
>> > member of the TAB?
>>
>> Also on the topic of the TAB and elections, I cannot figure out what:
>>
>>   "Only one vote per candidate will be counted."
>>
>> means in:
>>
>> Voting:
>> ---
>> Each person present at the time of the election shall be handed a printed
>> ballot.  Voters may vote for as many candidates as they wish. Only one
>> vote per candidate will be counted. A space will be provided where
>> they can fill out the ballot in private and place it into a box to be
>> counted in private.
>>
>>
>> in
>>
> http://www.linuxfoundation.org/collaborate/workgroups/technical-advisory-board-tab/minutes-july-2015-tab-meeting
>>
>> Is it "Only one vote per candidate per voter will be counted"?
>
> Sorry this was a bit confusing.  Basically, it means you can vote for as
> many candidates as you like, but if you can't vote multiple times for the
> same candidate on your ballot.

Thanks.  That was the only explanation I could come up with that made
any sense, but it definitely isn't obvious (to me) that that is the
meaning of the words.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: nfs race condition

2015-10-22 Thread Neil Brown
Nicolas Parpandet  writes:

> Hello,
>
> I did some more investigations, (wireshark)
>
> The unlink of /tmp/crontab.vYPoHR/crontab is tranlated into a rename
> nfs call ? (.nfs file)

That happens if the file is still open.

>
> NFS   670 V3 READDIRPLUS Reply (Call In 731) crontab .. .
> NFS   210 V3 LOOKUP Call (Reply In 734), DH: 
> 0x038abb3d/.nfs0098008a0019
> NFS   194 V3 LOOKUP Reply (Call In 733) Error: NFS3ERR_NOENT
> NFS   254 V3 RENAME Call (Reply In 736), From DH: 0x038abb3d/crontab To 
> DH: 0x038abb3d/.nfs0098008a0019
> NFS   338 V3 RENAME Reply (Call In 735)
>
> and after, it tries to remove the directory /tmp/crontab.0wuoNx :
> NFS   198 V3 RMDIR Call (Reply In 738), DH: 0x954c28ec/crontab.0wuoNx
> NFS   222 V3 RMDIR Reply (Call In 737) Error: NFS3ERR_NOTEMPTY
>
> but the file /tmp/crontab.0wuoNx/crontab was renamed instead of being removed 
> (nfs network, I cannot see .nfs000 file in fact),
> this prevent the removing of the directory  /tmp/crontab.0wuoNx (notempty)
>
> It seems a race condition, see "nfs silly rename" keyword, but I have
> the problem with mysql temp files, and crontab temp file (crontab -e),

I don't think it is a race condition exactly.  More of an ordering
problem.

Unlinking a file before closing it is rarely useful (sometimes it is,
but not here).

> am I the only one doing mysql and crontabs over NFS !!!?!?,

Probably.

> does anybody have a solution ?

Fix 'crontab' to close before unlinking...

Is this the Debian/Ubuntu crontab program it looks a bit like it,
but this bug was fixed 6 years ago.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=413962

Or use a tmpfs for /tmp/ rather than using NFS???

NeilBrown

>
>
> Regards
>
> Nicolas
>
>
> - Mail original -
> De: "npa" 
> À: linux-kernel@vger.kernel.org
> Cc: "pcoustillas" 
> Envoyé: Dimanche 4 Octobre 2015 09:57:26
> Objet: nfs race condition
>
> Hello, 
>
> I'm doing some " crontab -e, exit with :x" on NFS mount point, 
> and there is a race condition whent removing temporary files : 
>
>
> 7701 write(2, "No modification made\n", 21) = 21 
> 7701 open("/tmp/crontab.vYPoHR", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 
> 4 
> 7701 getdents(4, {{d_ino=73336409, d_off=1, d_reclen=24, d_name="."} 
> {d_ino=73335455, d_off=2, d_reclen=24, d_name=".."} {d_ino=73340547, d_off=3, 
> d_reclen=32, d_name="crontab"}}, 32768) = 80 
> 7701 unlink("/tmp/crontab.vYPoHR/crontab") = 0 
> 7701 getdents(4, {}, 32768) = 0 
> 7701 close(4) = 0 
> 7701 rmdir("/tmp/crontab.vYPoHR") = -1 ENOTEMPTY (Directory not empty) 
> 7701 write(2, "/tmp/crontab.vYPoHR: Directory n"..., 41) = 41 
> 7701 socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 4
>
> The unlink is really done after the rmdir,
> nfsv4 & nfsv3 over ipv6, so the rmdir fails and the directory isn't removed.
> I have same problem with mysql temporary files.
>
> If I mount with "nocto,noac,lookupcache=none" : same problem
> kernel version on server : 3.16.0-4-amd64 (debian8)
> kernel version on client : 2.6.32-39-pve (proxmox), same thing with 
> 3.13.0-63-generic (ubuntu)
>
> Please answer me in CC, I'm not subscribed to the list.
>
> Regards
>
> Nicolas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


signature.asc
Description: PGP signature


Re: [Ksummit-discuss] Linux Foundation Technical Advisory Board Elections and Nomination process

2015-10-22 Thread Neil Brown
Darren Hart  writes:

>
> Is there a good description of what is expected of a TAB member? How much time
> is involved? What makes a great TAB member?
>
> I've found: http://www.linuxfoundation.org/programs/advisory-councils/tab
>
> I've read the charter and scanned some of the minutes, but I'd still like to
> hear from some of the "incumbants". Specifically, what makes you successful 
> as a
> member of the TAB?

Also on the topic of the TAB and elections, I cannot figure out what:

  "Only one vote per candidate will be counted."

means in:

Voting:
---
Each person present at the time of the election shall be handed a printed
ballot.  Voters may vote for as many candidates as they wish. Only one
vote per candidate will be counted. A space will be provided where
they can fill out the ballot in private and place it into a box to be
counted in private.


in
http://www.linuxfoundation.org/collaborate/workgroups/technical-advisory-board-tab/minutes-july-2015-tab-meeting

Is it "Only one vote per candidate per voter will be counted"?

NeilBrown


signature.asc
Description: PGP signature


Re: [Ksummit-discuss] Linux Foundation Technical Advisory Board Elections and Nomination process

2015-10-22 Thread Neil Brown
Darren Hart  writes:

>
> Is there a good description of what is expected of a TAB member? How much time
> is involved? What makes a great TAB member?
>
> I've found: http://www.linuxfoundation.org/programs/advisory-councils/tab
>
> I've read the charter and scanned some of the minutes, but I'd still like to
> hear from some of the "incumbants". Specifically, what makes you successful 
> as a
> member of the TAB?

Also on the topic of the TAB and elections, I cannot figure out what:

  "Only one vote per candidate will be counted."

means in:

Voting:
---
Each person present at the time of the election shall be handed a printed
ballot.  Voters may vote for as many candidates as they wish. Only one
vote per candidate will be counted. A space will be provided where
they can fill out the ballot in private and place it into a box to be
counted in private.


in
http://www.linuxfoundation.org/collaborate/workgroups/technical-advisory-board-tab/minutes-july-2015-tab-meeting

Is it "Only one vote per candidate per voter will be counted"?

NeilBrown


signature.asc
Description: PGP signature


Re: [Ksummit-discuss] Linux Foundation Technical Advisory Board Elections and Nomination process

2015-10-22 Thread Neil Brown
Darren Hart  writes:

>
> Is there a good description of what is expected of a TAB member? How much time
> is involved? What makes a great TAB member?
>
> I've found: http://www.linuxfoundation.org/programs/advisory-councils/tab
>
> I've read the charter and scanned some of the minutes, but I'd still like to
> hear from some of the "incumbants". Specifically, what makes you successful 
> as a
> member of the TAB?

Also on the topic of the TAB and elections, I cannot figure out what:

  "Only one vote per candidate will be counted."

means in:

Voting:
---
Each person present at the time of the election shall be handed a printed
ballot.  Voters may vote for as many candidates as they wish. Only one
vote per candidate will be counted. A space will be provided where
they can fill out the ballot in private and place it into a box to be
counted in private.


in
http://www.linuxfoundation.org/collaborate/workgroups/technical-advisory-board-tab/minutes-july-2015-tab-meeting

Is it "Only one vote per candidate per voter will be counted"?

NeilBrown


signature.asc
Description: PGP signature


Re: nfs race condition

2015-10-22 Thread Neil Brown
Nicolas Parpandet  writes:

> Hello,
>
> I did some more investigations, (wireshark)
>
> The unlink of /tmp/crontab.vYPoHR/crontab is tranlated into a rename
> nfs call ? (.nfs file)

That happens if the file is still open.

>
> NFS   670 V3 READDIRPLUS Reply (Call In 731) crontab .. .
> NFS   210 V3 LOOKUP Call (Reply In 734), DH: 
> 0x038abb3d/.nfs0098008a0019
> NFS   194 V3 LOOKUP Reply (Call In 733) Error: NFS3ERR_NOENT
> NFS   254 V3 RENAME Call (Reply In 736), From DH: 0x038abb3d/crontab To 
> DH: 0x038abb3d/.nfs0098008a0019
> NFS   338 V3 RENAME Reply (Call In 735)
>
> and after, it tries to remove the directory /tmp/crontab.0wuoNx :
> NFS   198 V3 RMDIR Call (Reply In 738), DH: 0x954c28ec/crontab.0wuoNx
> NFS   222 V3 RMDIR Reply (Call In 737) Error: NFS3ERR_NOTEMPTY
>
> but the file /tmp/crontab.0wuoNx/crontab was renamed instead of being removed 
> (nfs network, I cannot see .nfs000 file in fact),
> this prevent the removing of the directory  /tmp/crontab.0wuoNx (notempty)
>
> It seems a race condition, see "nfs silly rename" keyword, but I have
> the problem with mysql temp files, and crontab temp file (crontab -e),

I don't think it is a race condition exactly.  More of an ordering
problem.

Unlinking a file before closing it is rarely useful (sometimes it is,
but not here).

> am I the only one doing mysql and crontabs over NFS !!!?!?,

Probably.

> does anybody have a solution ?

Fix 'crontab' to close before unlinking...

Is this the Debian/Ubuntu crontab program it looks a bit like it,
but this bug was fixed 6 years ago.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=413962

Or use a tmpfs for /tmp/ rather than using NFS???

NeilBrown

>
>
> Regards
>
> Nicolas
>
>
> - Mail original -
> De: "npa" 
> À: linux-kernel@vger.kernel.org
> Cc: "pcoustillas" 
> Envoyé: Dimanche 4 Octobre 2015 09:57:26
> Objet: nfs race condition
>
> Hello, 
>
> I'm doing some " crontab -e, exit with :x" on NFS mount point, 
> and there is a race condition whent removing temporary files : 
>
>
> 7701 write(2, "No modification made\n", 21) = 21 
> 7701 open("/tmp/crontab.vYPoHR", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 
> 4 
> 7701 getdents(4, {{d_ino=73336409, d_off=1, d_reclen=24, d_name="."} 
> {d_ino=73335455, d_off=2, d_reclen=24, d_name=".."} {d_ino=73340547, d_off=3, 
> d_reclen=32, d_name="crontab"}}, 32768) = 80 
> 7701 unlink("/tmp/crontab.vYPoHR/crontab") = 0 
> 7701 getdents(4, {}, 32768) = 0 
> 7701 close(4) = 0 
> 7701 rmdir("/tmp/crontab.vYPoHR") = -1 ENOTEMPTY (Directory not empty) 
> 7701 write(2, "/tmp/crontab.vYPoHR: Directory n"..., 41) = 41 
> 7701 socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 4
>
> The unlink is really done after the rmdir,
> nfsv4 & nfsv3 over ipv6, so the rmdir fails and the directory isn't removed.
> I have same problem with mysql temporary files.
>
> If I mount with "nocto,noac,lookupcache=none" : same problem
> kernel version on server : 3.16.0-4-amd64 (debian8)
> kernel version on client : 2.6.32-39-pve (proxmox), same thing with 
> 3.13.0-63-generic (ubuntu)
>
> Please answer me in CC, I'm not subscribed to the list.
>
> Regards
>
> Nicolas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


signature.asc
Description: PGP signature


Re: [Ksummit-discuss] Linux Foundation Technical Advisory Board Elections and Nomination process

2015-10-22 Thread Neil Brown
Darren Hart  writes:

>
> Is there a good description of what is expected of a TAB member? How much time
> is involved? What makes a great TAB member?
>
> I've found: http://www.linuxfoundation.org/programs/advisory-councils/tab
>
> I've read the charter and scanned some of the minutes, but I'd still like to
> hear from some of the "incumbants". Specifically, what makes you successful 
> as a
> member of the TAB?

Also on the topic of the TAB and elections, I cannot figure out what:

  "Only one vote per candidate will be counted."

means in:

Voting:
---
Each person present at the time of the election shall be handed a printed
ballot.  Voters may vote for as many candidates as they wish. Only one
vote per candidate will be counted. A space will be provided where
they can fill out the ballot in private and place it into a box to be
counted in private.


in
http://www.linuxfoundation.org/collaborate/workgroups/technical-advisory-board-tab/minutes-july-2015-tab-meeting

Is it "Only one vote per candidate per voter will be counted"?

NeilBrown


signature.asc
Description: PGP signature


Re: [Ksummit-discuss] Linux Foundation Technical Advisory Board Elections and Nomination process

2015-10-22 Thread Neil Brown
Kristen Accardi <kacca...@gmail.com> writes:

> On Oct 22, 2015 7:04 PM, "Neil Brown" <ne...@suse.com> wrote:
>>
>> Darren Hart <dvh...@infradead.org> writes:
>>
>> >
>> > Is there a good description of what is expected of a TAB member? How
> much time
>> > is involved? What makes a great TAB member?
>> >
>> > I've found:
> http://www.linuxfoundation.org/programs/advisory-councils/tab
>> >
>> > I've read the charter and scanned some of the minutes, but I'd still
> like to
>> > hear from some of the "incumbants". Specifically, what makes you
> successful as a
>> > member of the TAB?
>>
>> Also on the topic of the TAB and elections, I cannot figure out what:
>>
>>   "Only one vote per candidate will be counted."
>>
>> means in:
>>
>> Voting:
>> ---
>> Each person present at the time of the election shall be handed a printed
>> ballot.  Voters may vote for as many candidates as they wish. Only one
>> vote per candidate will be counted. A space will be provided where
>> they can fill out the ballot in private and place it into a box to be
>> counted in private.
>>
>>
>> in
>>
> http://www.linuxfoundation.org/collaborate/workgroups/technical-advisory-board-tab/minutes-july-2015-tab-meeting
>>
>> Is it "Only one vote per candidate per voter will be counted"?
>
> Sorry this was a bit confusing.  Basically, it means you can vote for as
> many candidates as you like, but if you can't vote multiple times for the
> same candidate on your ballot.

Thanks.  That was the only explanation I could come up with that made
any sense, but it definitely isn't obvious (to me) that that is the
meaning of the words.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: Possible bug in DM-RAID.

2015-10-20 Thread Neil Brown

Added dm-devel, which is probably the more appropriate list for dm
things.

NeilBrown

Austin S Hemmelgarn  writes:

> I think I've stumbled upon a bug in DM-RAID.  The primary symptom is that when
> creating a new DM-RAID based device (using either LVM or dmsetup) in a RAID1
> configuration, it very quickly claims one by one that all of the disks failed
> except the first, and goes degraded.  When this happens on a given system, the
> disks always 'fail' in the reverse of the order of the mirror numbers.  All of
> the other RAID profiles work just fine.  Curiously, it also only seems to
> happen for 'big' devices (I haven't been able to determine exactly what the
> minimum size is, but I see it 100% of the time with 32G devices, never with 
> 16G
> ones, and only intermittently with 24G).
>
> Here's what I got from dmesg when creating a 32G LVM volume that exhibited
> this issue:
> [66318.401295] device-mapper: raid: Superblocks created for new array
> [66318.450452] md/raid1:mdX: active with 2 out of 2 mirrors
> [66318.450467] Choosing daemon_sleep default (5 sec)
> [66318.450482] created bitmap (32 pages) for device mdX
> [66318.450495] attempt to access beyond end of device
> [66318.450501] dm-91: rw=13329, want=0, limit=8192
> [66318.450506] md: super_written gets error=-5, uptodate=0
> [66318.450513] md/raid1:mdX: Disk failure on dm-92, disabling device.
>md/raid1:mdX: Operation continuing on 1 devices.
> [66318.459815] attempt to access beyond end of device
> [66318.459819] dm-89: rw=13329, want=0, limit=8192
> [66318.459822] md: super_written gets error=-5, uptodate=0
> [66318.492852] attempt to access beyond end of device
> [66318.492862] dm-89: rw=13329, want=0, limit=8192
> [66318.492868] md: super_written gets error=-5, uptodate=0
> [66318.627183] mdX: bitmap file is out of date, doing full recovery
> [66318.714107] mdX: bitmap initialized from disk: read 3 pages, set 65536 of 
> 65536 bits
> [66318.782045] RAID1 conf printout:
> [66318.782054]  --- wd:1 rd:2
> [66318.782061]  disk 0, wo:0, o:1, dev:dm-90
> [66318.782068]  disk 1, wo:1, o:0, dev:dm-92
> [66318.836598] RAID1 conf printout:
> [66318.836607]  --- wd:1 rd:2
> [66318.836614]  disk 0, wo:0, o:1, dev:dm-90
>
> And here's output for a 24G LVM volume that didn't display the issue.
> [66343.407954] device-mapper: raid: Superblocks created for new array
> [66343.479065] md/raid1:mdX: active with 2 out of 2 mirrors
> [66343.479078] Choosing daemon_sleep default (5 sec)
> [66343.479101] created bitmap (24 pages) for device mdX
> [66343.629329] mdX: bitmap file is out of date, doing full recovery
> [66343.677374] mdX: bitmap initialized from disk: read 2 pages, set 49152 of 
> 49152 bits
>
> I'm using a lightly patched version of 4.2.3
> (the source can be found at https://github.com/ferroin/linux)
> but none of the patches I'm using come anywhere near anything in the block 
> layer,
> let alone the DM/MD code.
>
> I've attempted to bisect this, although it got kind of complicated.  So far 
> I've
> determined that the first commit that I see this issue on is d3b178a: md: 
> Skip cluster setup for dm-raid
> Prior to that commit, I can't initialize any dm-raid devices due to the bug 
> it fixes.
> I have not tested anything prior to d51e4fe (the merge commit that pulled in 
> the md-cluster code),
> but I do distinctly remember that I did not see this issue in 3.19.
>
> I'll be happy to provide more info if needed.


signature.asc
Description: PGP signature


Re: Possible bug in DM-RAID.

2015-10-20 Thread Neil Brown

Added dm-devel, which is probably the more appropriate list for dm
things.

NeilBrown

Austin S Hemmelgarn  writes:

> I think I've stumbled upon a bug in DM-RAID.  The primary symptom is that when
> creating a new DM-RAID based device (using either LVM or dmsetup) in a RAID1
> configuration, it very quickly claims one by one that all of the disks failed
> except the first, and goes degraded.  When this happens on a given system, the
> disks always 'fail' in the reverse of the order of the mirror numbers.  All of
> the other RAID profiles work just fine.  Curiously, it also only seems to
> happen for 'big' devices (I haven't been able to determine exactly what the
> minimum size is, but I see it 100% of the time with 32G devices, never with 
> 16G
> ones, and only intermittently with 24G).
>
> Here's what I got from dmesg when creating a 32G LVM volume that exhibited
> this issue:
> [66318.401295] device-mapper: raid: Superblocks created for new array
> [66318.450452] md/raid1:mdX: active with 2 out of 2 mirrors
> [66318.450467] Choosing daemon_sleep default (5 sec)
> [66318.450482] created bitmap (32 pages) for device mdX
> [66318.450495] attempt to access beyond end of device
> [66318.450501] dm-91: rw=13329, want=0, limit=8192
> [66318.450506] md: super_written gets error=-5, uptodate=0
> [66318.450513] md/raid1:mdX: Disk failure on dm-92, disabling device.
>md/raid1:mdX: Operation continuing on 1 devices.
> [66318.459815] attempt to access beyond end of device
> [66318.459819] dm-89: rw=13329, want=0, limit=8192
> [66318.459822] md: super_written gets error=-5, uptodate=0
> [66318.492852] attempt to access beyond end of device
> [66318.492862] dm-89: rw=13329, want=0, limit=8192
> [66318.492868] md: super_written gets error=-5, uptodate=0
> [66318.627183] mdX: bitmap file is out of date, doing full recovery
> [66318.714107] mdX: bitmap initialized from disk: read 3 pages, set 65536 of 
> 65536 bits
> [66318.782045] RAID1 conf printout:
> [66318.782054]  --- wd:1 rd:2
> [66318.782061]  disk 0, wo:0, o:1, dev:dm-90
> [66318.782068]  disk 1, wo:1, o:0, dev:dm-92
> [66318.836598] RAID1 conf printout:
> [66318.836607]  --- wd:1 rd:2
> [66318.836614]  disk 0, wo:0, o:1, dev:dm-90
>
> And here's output for a 24G LVM volume that didn't display the issue.
> [66343.407954] device-mapper: raid: Superblocks created for new array
> [66343.479065] md/raid1:mdX: active with 2 out of 2 mirrors
> [66343.479078] Choosing daemon_sleep default (5 sec)
> [66343.479101] created bitmap (24 pages) for device mdX
> [66343.629329] mdX: bitmap file is out of date, doing full recovery
> [66343.677374] mdX: bitmap initialized from disk: read 2 pages, set 49152 of 
> 49152 bits
>
> I'm using a lightly patched version of 4.2.3
> (the source can be found at https://github.com/ferroin/linux)
> but none of the patches I'm using come anywhere near anything in the block 
> layer,
> let alone the DM/MD code.
>
> I've attempted to bisect this, although it got kind of complicated.  So far 
> I've
> determined that the first commit that I see this issue on is d3b178a: md: 
> Skip cluster setup for dm-raid
> Prior to that commit, I can't initialize any dm-raid devices due to the bug 
> it fixes.
> I have not tested anything prior to d51e4fe (the merge commit that pulled in 
> the md-cluster code),
> but I do distinctly remember that I did not see this issue in 3.19.
>
> I'll be happy to provide more info if needed.


signature.asc
Description: PGP signature


Re: [RFC PATCH 0/3] pwm: omap: Add PWM support using dual-mode timers

2015-10-18 Thread Neil Brown
Neil Armstrong  writes:

> This patch is based on an earlier patch by NeilBrown which is based on
> a older patch from Grant Erickson which provided PWM devices using
> the 'legacy' interface.
>
> The pwm driver was renamed to not be confused with the OMAP4 PWM dedicated
> hardware and was cleaned with the review from Thierry Reding.
>
> The first patch introduces a way to select to dmtimer clock source via a
> clocks binding and a dedicated function wit the legacy fallback.
>
> In order to prepare for the future form of the dmtimer (clksource or whatever)
> the first patch introduces the PWM driver with all the dmtimer calls into a
> platform data structure.
>
> The structure is then filled in plat-omap and added as auxdata for the
> ti,pwm-dmtimer-omap compatible nodes.
>
> Cc: Grant Erickson 
> Cc: NeilBrown 
> Cc: Joachim Eastwood 
> Suggested-by: Tony Lindgren 
>
> Neil Armstrong (3):
>   arm: plat-omap: dmtimer: Add clock source from DT
>   pwm: Add PWM driver for OMAP using dual-mode timers
>   arm: plat-omap: Add PWM dmtimer platforma data quirks
>
>  .../devicetree/bindings/pwm/pwm-omap-dmtimer.txt   |  18 ++
>  arch/arm/mach-omap2/pdata-quirks.c |  23 ++
>  arch/arm/plat-omap/dmtimer.c   |  32 +-
>  drivers/pwm/Kconfig|   9 +
>  drivers/pwm/Makefile   |   1 +
>  drivers/pwm/pwm-omap-dmtimer.c | 322 
> +
>  include/linux/platform_data/pwm_omap_dmtimer.h |  69 +
>  7 files changed, 472 insertions(+), 2 deletions(-)
>  create mode 100644 Documentation/devicetree/bindings/pwm/pwm-omap-dmtimer.txt
>  create mode 100644 drivers/pwm/pwm-omap-dmtimer.c
>  create mode 100644 include/linux/platform_data/pwm_omap_dmtimer.h
>

Thanks for much for doing this!
I haven't looked and wont have a chance for a least a couple of weeks,
but I'm very encouraged that someone is pursuing it.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [RFC PATCH 0/3] pwm: omap: Add PWM support using dual-mode timers

2015-10-18 Thread Neil Brown
Neil Armstrong  writes:

> This patch is based on an earlier patch by NeilBrown which is based on
> a older patch from Grant Erickson which provided PWM devices using
> the 'legacy' interface.
>
> The pwm driver was renamed to not be confused with the OMAP4 PWM dedicated
> hardware and was cleaned with the review from Thierry Reding.
>
> The first patch introduces a way to select to dmtimer clock source via a
> clocks binding and a dedicated function wit the legacy fallback.
>
> In order to prepare for the future form of the dmtimer (clksource or whatever)
> the first patch introduces the PWM driver with all the dmtimer calls into a
> platform data structure.
>
> The structure is then filled in plat-omap and added as auxdata for the
> ti,pwm-dmtimer-omap compatible nodes.
>
> Cc: Grant Erickson 
> Cc: NeilBrown 
> Cc: Joachim Eastwood 
> Suggested-by: Tony Lindgren 
>
> Neil Armstrong (3):
>   arm: plat-omap: dmtimer: Add clock source from DT
>   pwm: Add PWM driver for OMAP using dual-mode timers
>   arm: plat-omap: Add PWM dmtimer platforma data quirks
>
>  .../devicetree/bindings/pwm/pwm-omap-dmtimer.txt   |  18 ++
>  arch/arm/mach-omap2/pdata-quirks.c |  23 ++
>  arch/arm/plat-omap/dmtimer.c   |  32 +-
>  drivers/pwm/Kconfig|   9 +
>  drivers/pwm/Makefile   |   1 +
>  drivers/pwm/pwm-omap-dmtimer.c | 322 
> +
>  include/linux/platform_data/pwm_omap_dmtimer.h |  69 +
>  7 files changed, 472 insertions(+), 2 deletions(-)
>  create mode 100644 Documentation/devicetree/bindings/pwm/pwm-omap-dmtimer.txt
>  create mode 100644 drivers/pwm/pwm-omap-dmtimer.c
>  create mode 100644 include/linux/platform_data/pwm_omap_dmtimer.h
>

Thanks for much for doing this!
I haven't looked and wont have a chance for a least a couple of weeks,
but I'm very encouraged that someone is pursuing it.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH v2] sunrpc: fix waitqueue_active without memory barrier in sunrpc

2015-10-15 Thread Neil Brown
"J. Bruce Fields"  writes:

> On Thu, Oct 15, 2015 at 11:44:20AM +, Kosuke Tatsukawa wrote:
>> Tatsukawa Kosuke wrote:
>> > J. Bruce Fields wrote:
>> >> Thanks for the detailed investigation.
>> >> 
>> >> I think it would be worth adding a comment if that might help someone
>> >> having to reinvestigate this again some day.
>> > 
>> > It would be nice, but I find it difficult to write a comment in the
>> > sunrpc layer why a memory barrier isn't necessary, using the knowledge
>> > of how nfsd uses it, and the current implementation of the network code.
>> > 
>> > Personally, I would prefer removing the call to waitqueue_active() which
>> > would make the memory barrier totally unnecessary at the cost of a
>> > spin_lock + spin_unlock by unconditionally calling
>> > wake_up_interruptible.
>> 
>> On second thought, the callbacks will be called frequently from the tcp
>> code, so it wouldn't be a good idea.
>
> So, I was even considering documenting it like this, if it's not
> overkill.
>
> Hmm... but if this is right, then we may as well ask why we're doing the
> wakeups at all.  Might be educational to test the code with them
> removed.
>
> --b.
>
> commit 0882cfeb39e0
> Author: J. Bruce Fields 
> Date:   Thu Oct 15 16:53:41 2015 -0400
>
> svcrpc: document lack of some memory barriers.
> 
> Kosuke Tatsukawa points out an odd lack of memory barriers in some sites
> here.  I think the code's correct, but it's probably worth documenting.
> 
> Reported-by: Kosuke Tatsukawa 
>
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 856407fa085e..90480993ec4a 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -399,6 +399,25 @@ static int svc_sock_secure_port(struct svc_rqst *rqstp)
>   return svc_port_is_privileged(svc_addr(rqstp));
>  }
>  
> +static void svc_no_smp_mb(void)
> +{
> + /*
> +  * Kosuke Tatsukawa points out there should normally be an
> +  * smp_mb() at the callsites of this function.  (Either that or
> +  * we could just drop the waitqueue_active() checks.)
> +  *
> +  * It appears they aren't currently necessary, though, basically
> +  * because nfsd does non-blocking reads from these sockets, so
> +  * the only places we wait on this waitqueue is in sendpage and
> +  * sendmsg, which won't be waiting for wakeups on newly arrived
> +  * data.
> +  *
> +  * Maybe we should add the memory barriers anyway, but these are
> +  * hot paths so we'd need to be convinced there's no sigificant
> +  * penalty.
> +  */
> +}
> +
>  /*
>   * INET callback when data has been received on the socket.
>   */
> @@ -414,7 +433,7 @@ static void svc_udp_data_ready(struct sock *sk)
>   set_bit(XPT_DATA, >sk_xprt.xpt_flags);
>   svc_xprt_enqueue(>sk_xprt);
>   }
> - smp_mb();
> + svc_no_smp_mb();
>   if (wq && waitqueue_active(wq))
>   wake_up_interruptible(wq);
>  }
> @@ -433,7 +452,7 @@ static void svc_write_space(struct sock *sk)
>   svc_xprt_enqueue(>sk_xprt);
>   }
>  
> - smp_mb();
> + svc_no_smp_mb();
>   if (wq && waitqueue_active(wq)) {
>   dprintk("RPC svc_write_space: someone sleeping on %p\n",
>  svsk);
> @@ -789,7 +808,7 @@ static void svc_tcp_listen_data_ready(struct sock *sk)
>   }
>  
>   wq = sk_sleep(sk);
> - smp_mb();
> + svc_no_smp_mb();
>   if (wq && waitqueue_active(wq))
>   wake_up_interruptible_all(wq);
>  }
> @@ -811,7 +830,7 @@ static void svc_tcp_state_change(struct sock *sk)
>   set_bit(XPT_CLOSE, >sk_xprt.xpt_flags);
>   svc_xprt_enqueue(>sk_xprt);
>   }
> - smp_mb();
> + svc_no_smp_mb();
>   if (wq && waitqueue_active(wq))
>   wake_up_interruptible_all(wq);
>  }
> @@ -827,7 +846,7 @@ static void svc_tcp_data_ready(struct sock *sk)
>   set_bit(XPT_DATA, >sk_xprt.xpt_flags);
>   svc_xprt_enqueue(>sk_xprt);
>   }
> - smp_mb();
> + svc_no_smp_mb();
>   if (wq && waitqueue_active(wq))
>   wake_up_interruptible(wq);
>  }
> @@ -1599,7 +1618,7 @@ static void svc_sock_detach(struct svc_xprt *xprt)
>   sk->sk_write_space = svsk->sk_owspace;
>  
>   wq = sk_sleep(sk);
> - smp_mb();
> + svc_no_smp_mb();
>   if (wq && waitqueue_active(wq))
>   wake_up_interruptible(wq);
>  }

I would feel a lot more comfortable if you instead created:

static inline bool sunrpc_waitqueue_active(struct wait_queue_head *wq)
{
if (!wq)
return false;
/* long comment abot not needing a memory barrier */
return waitqueue_active(wq);
}

and then replace various "if (wq && waitqueue_active(wq))" calls with
if (sunrpc_waitqueue_active(wq))"

The comment seems readable and seems to make sense.

NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH v2] sunrpc: fix waitqueue_active without memory barrier in sunrpc

2015-10-15 Thread Neil Brown
"J. Bruce Fields"  writes:

> On Thu, Oct 15, 2015 at 11:44:20AM +, Kosuke Tatsukawa wrote:
>> Tatsukawa Kosuke wrote:
>> > J. Bruce Fields wrote:
>> >> Thanks for the detailed investigation.
>> >> 
>> >> I think it would be worth adding a comment if that might help someone
>> >> having to reinvestigate this again some day.
>> > 
>> > It would be nice, but I find it difficult to write a comment in the
>> > sunrpc layer why a memory barrier isn't necessary, using the knowledge
>> > of how nfsd uses it, and the current implementation of the network code.
>> > 
>> > Personally, I would prefer removing the call to waitqueue_active() which
>> > would make the memory barrier totally unnecessary at the cost of a
>> > spin_lock + spin_unlock by unconditionally calling
>> > wake_up_interruptible.
>> 
>> On second thought, the callbacks will be called frequently from the tcp
>> code, so it wouldn't be a good idea.
>
> So, I was even considering documenting it like this, if it's not
> overkill.
>
> Hmm... but if this is right, then we may as well ask why we're doing the
> wakeups at all.  Might be educational to test the code with them
> removed.
>
> --b.
>
> commit 0882cfeb39e0
> Author: J. Bruce Fields 
> Date:   Thu Oct 15 16:53:41 2015 -0400
>
> svcrpc: document lack of some memory barriers.
> 
> Kosuke Tatsukawa points out an odd lack of memory barriers in some sites
> here.  I think the code's correct, but it's probably worth documenting.
> 
> Reported-by: Kosuke Tatsukawa 
>
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 856407fa085e..90480993ec4a 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -399,6 +399,25 @@ static int svc_sock_secure_port(struct svc_rqst *rqstp)
>   return svc_port_is_privileged(svc_addr(rqstp));
>  }
>  
> +static void svc_no_smp_mb(void)
> +{
> + /*
> +  * Kosuke Tatsukawa points out there should normally be an
> +  * smp_mb() at the callsites of this function.  (Either that or
> +  * we could just drop the waitqueue_active() checks.)
> +  *
> +  * It appears they aren't currently necessary, though, basically
> +  * because nfsd does non-blocking reads from these sockets, so
> +  * the only places we wait on this waitqueue is in sendpage and
> +  * sendmsg, which won't be waiting for wakeups on newly arrived
> +  * data.
> +  *
> +  * Maybe we should add the memory barriers anyway, but these are
> +  * hot paths so we'd need to be convinced there's no sigificant
> +  * penalty.
> +  */
> +}
> +
>  /*
>   * INET callback when data has been received on the socket.
>   */
> @@ -414,7 +433,7 @@ static void svc_udp_data_ready(struct sock *sk)
>   set_bit(XPT_DATA, >sk_xprt.xpt_flags);
>   svc_xprt_enqueue(>sk_xprt);
>   }
> - smp_mb();
> + svc_no_smp_mb();
>   if (wq && waitqueue_active(wq))
>   wake_up_interruptible(wq);
>  }
> @@ -433,7 +452,7 @@ static void svc_write_space(struct sock *sk)
>   svc_xprt_enqueue(>sk_xprt);
>   }
>  
> - smp_mb();
> + svc_no_smp_mb();
>   if (wq && waitqueue_active(wq)) {
>   dprintk("RPC svc_write_space: someone sleeping on %p\n",
>  svsk);
> @@ -789,7 +808,7 @@ static void svc_tcp_listen_data_ready(struct sock *sk)
>   }
>  
>   wq = sk_sleep(sk);
> - smp_mb();
> + svc_no_smp_mb();
>   if (wq && waitqueue_active(wq))
>   wake_up_interruptible_all(wq);
>  }
> @@ -811,7 +830,7 @@ static void svc_tcp_state_change(struct sock *sk)
>   set_bit(XPT_CLOSE, >sk_xprt.xpt_flags);
>   svc_xprt_enqueue(>sk_xprt);
>   }
> - smp_mb();
> + svc_no_smp_mb();
>   if (wq && waitqueue_active(wq))
>   wake_up_interruptible_all(wq);
>  }
> @@ -827,7 +846,7 @@ static void svc_tcp_data_ready(struct sock *sk)
>   set_bit(XPT_DATA, >sk_xprt.xpt_flags);
>   svc_xprt_enqueue(>sk_xprt);
>   }
> - smp_mb();
> + svc_no_smp_mb();
>   if (wq && waitqueue_active(wq))
>   wake_up_interruptible(wq);
>  }
> @@ -1599,7 +1618,7 @@ static void svc_sock_detach(struct svc_xprt *xprt)
>   sk->sk_write_space = svsk->sk_owspace;
>  
>   wq = sk_sleep(sk);
> - smp_mb();
> + svc_no_smp_mb();
>   if (wq && waitqueue_active(wq))
>   wake_up_interruptible(wq);
>  }

I would feel a lot more comfortable if you instead created:

static inline bool sunrpc_waitqueue_active(struct wait_queue_head *wq)
{
if (!wq)
return false;
/* long comment abot not needing a memory barrier */
return waitqueue_active(wq);
}

and then replace various "if (wq && waitqueue_active(wq))" calls with
if (sunrpc_waitqueue_active(wq))"

The comment seems readable and seems to make sense.

NeilBrown


signature.asc
Description: 

Re: [PATCH] md: fix 32-bit build warning

2015-10-14 Thread Neil Brown
Arnd Bergmann  writes:

> On Monday 12 October 2015 15:59:27 Neil Brown wrote:
>> > diff --git a/drivers/md/md.c b/drivers/md/md.c
>> > index 7fff1e6884d6..e13f72a3b561 100644
>> > --- a/drivers/md/md.c
>> > +++ b/drivers/md/md.c
>> > @@ -8987,9 +8987,9 @@ static void check_sb_changes(struct mddev *mddev, 
>> > struct md_rdev *rdev)
>> >  
>> >   /* recovery_cp changed */
>> >   if (le64_to_cpu(sb->resync_offset) != mddev->recovery_cp) {
>> > - pr_info("%s:%d recovery_cp changed from %lu to %lu\n", 
>> > __func__,
>> > - __LINE__, mddev->recovery_cp,
>> > - (unsigned long) 
>> > le64_to_cpu(sb->resync_offset));
>> > + pr_info("%s:%d recovery_cp changed from %llu to %llu\n", 
>> > __func__,
>> > + __LINE__, (u64)mddev->recovery_cp,
>> > + (u64) le64_to_cpu(sb->resync_offset));
>> >   mddev->recovery_cp = le64_to_cpu(sb->resync_offset);
>> >   }
>> >  
>> 
>> Thanks, but is this really right?
>> I think u64 is "unsigned long" on 64bit.
>> I have always used (unsigned long long) when I want to use %llu on
>> sector_t.
>> 
>> How confident are you of using "u64" ?
>
> Very confident ;-)
>
> This used to not work until some linux-2.6 version when we changed all
> architectures to use asm-generic/int-ll64.h in the kernel, because
> a lot of code relied on printing u64 variables using %lld.
>
> I tend to use u64 for things like this because it's shorter than
> 'unsigned long long'.
>

Ahh.. good to know - thanks.

It seems that we've since removed those 'pr_info' lines, so there is
nothing to fix any more.  I'll remember that about using "u64" though -
using "unsigned long long" always felt so clumsy.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH] md: fix 32-bit build warning

2015-10-14 Thread Neil Brown
Arnd Bergmann <a...@arndb.de> writes:

> On Monday 12 October 2015 15:59:27 Neil Brown wrote:
>> > diff --git a/drivers/md/md.c b/drivers/md/md.c
>> > index 7fff1e6884d6..e13f72a3b561 100644
>> > --- a/drivers/md/md.c
>> > +++ b/drivers/md/md.c
>> > @@ -8987,9 +8987,9 @@ static void check_sb_changes(struct mddev *mddev, 
>> > struct md_rdev *rdev)
>> >  
>> >   /* recovery_cp changed */
>> >   if (le64_to_cpu(sb->resync_offset) != mddev->recovery_cp) {
>> > - pr_info("%s:%d recovery_cp changed from %lu to %lu\n", 
>> > __func__,
>> > - __LINE__, mddev->recovery_cp,
>> > - (unsigned long) 
>> > le64_to_cpu(sb->resync_offset));
>> > + pr_info("%s:%d recovery_cp changed from %llu to %llu\n", 
>> > __func__,
>> > + __LINE__, (u64)mddev->recovery_cp,
>> > + (u64) le64_to_cpu(sb->resync_offset));
>> >   mddev->recovery_cp = le64_to_cpu(sb->resync_offset);
>> >   }
>> >  
>> 
>> Thanks, but is this really right?
>> I think u64 is "unsigned long" on 64bit.
>> I have always used (unsigned long long) when I want to use %llu on
>> sector_t.
>> 
>> How confident are you of using "u64" ?
>
> Very confident ;-)
>
> This used to not work until some linux-2.6 version when we changed all
> architectures to use asm-generic/int-ll64.h in the kernel, because
> a lot of code relied on printing u64 variables using %lld.
>
> I tend to use u64 for things like this because it's shorter than
> 'unsigned long long'.
>

Ahh.. good to know - thanks.

It seems that we've since removed those 'pr_info' lines, so there is
nothing to fix any more.  I'll remember that about using "u64" though -
using "unsigned long long" always felt so clumsy.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH] md: fix 32-bit build warning

2015-10-11 Thread Neil Brown
Arnd Bergmann  writes:

> On 32-bit architectures, the md code produces this warning when CONFIG_LDAF
> is set:
>
> drivers/md/md.c: In function 'check_sb_changes':
> drivers/md/md.c:8990:10: warning: format '%lu' expects argument of type 'long 
> unsigned int', but argument 4 has type 'sector_t {aka long long unsigned 
> int}' [-Wformat=]
>pr_info("%s:%d recovery_cp changed from %lu to %lu\n", __func__,
>
> The code was only recently introduced, and uses the wrong format string
> for sector_t. As a workaround, this patch adds an explicit cast to 'u64'
> so we can use the %llu format string on all architectures.
>
> Signed-off-by: Arnd Bergmann 
> Fixes: e0212320066e ("md-cluster: Improve md_reload_sb to be less error 
> prone")
> Cc: Goldwyn Rodrigues 
> ---
>
> I also noticed that some commmits in md/for-next including the one causing
> the problem lack a Signed-off-by line. It might make sense to just fold this
> patch and add the lines at the same time.
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 7fff1e6884d6..e13f72a3b561 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8987,9 +8987,9 @@ static void check_sb_changes(struct mddev *mddev, 
> struct md_rdev *rdev)
>  
>   /* recovery_cp changed */
>   if (le64_to_cpu(sb->resync_offset) != mddev->recovery_cp) {
> - pr_info("%s:%d recovery_cp changed from %lu to %lu\n", __func__,
> - __LINE__, mddev->recovery_cp,
> - (unsigned long) le64_to_cpu(sb->resync_offset));
> + pr_info("%s:%d recovery_cp changed from %llu to %llu\n", 
> __func__,
> + __LINE__, (u64)mddev->recovery_cp,
> + (u64) le64_to_cpu(sb->resync_offset));
>   mddev->recovery_cp = le64_to_cpu(sb->resync_offset);
>   }
>  

Thanks, but is this really right?
I think u64 is "unsigned long" on 64bit.
I have always used (unsigned long long) when I want to use %llu on
sector_t.

How confident are you of using "u64" ?

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH] md: fix 32-bit build warning

2015-10-11 Thread Neil Brown
Arnd Bergmann  writes:

> On 32-bit architectures, the md code produces this warning when CONFIG_LDAF
> is set:
>
> drivers/md/md.c: In function 'check_sb_changes':
> drivers/md/md.c:8990:10: warning: format '%lu' expects argument of type 'long 
> unsigned int', but argument 4 has type 'sector_t {aka long long unsigned 
> int}' [-Wformat=]
>pr_info("%s:%d recovery_cp changed from %lu to %lu\n", __func__,
>
> The code was only recently introduced, and uses the wrong format string
> for sector_t. As a workaround, this patch adds an explicit cast to 'u64'
> so we can use the %llu format string on all architectures.
>
> Signed-off-by: Arnd Bergmann 
> Fixes: e0212320066e ("md-cluster: Improve md_reload_sb to be less error 
> prone")
> Cc: Goldwyn Rodrigues 
> ---
>
> I also noticed that some commmits in md/for-next including the one causing
> the problem lack a Signed-off-by line. It might make sense to just fold this
> patch and add the lines at the same time.
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 7fff1e6884d6..e13f72a3b561 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8987,9 +8987,9 @@ static void check_sb_changes(struct mddev *mddev, 
> struct md_rdev *rdev)
>  
>   /* recovery_cp changed */
>   if (le64_to_cpu(sb->resync_offset) != mddev->recovery_cp) {
> - pr_info("%s:%d recovery_cp changed from %lu to %lu\n", __func__,
> - __LINE__, mddev->recovery_cp,
> - (unsigned long) le64_to_cpu(sb->resync_offset));
> + pr_info("%s:%d recovery_cp changed from %llu to %llu\n", 
> __func__,
> + __LINE__, (u64)mddev->recovery_cp,
> + (u64) le64_to_cpu(sb->resync_offset));
>   mddev->recovery_cp = le64_to_cpu(sb->resync_offset);
>   }
>  

Thanks, but is this really right?
I think u64 is "unsigned long" on 64bit.
I have always used (unsigned long long) when I want to use %llu on
sector_t.

How confident are you of using "u64" ?

Thanks,
NeilBrown


signature.asc
Description: PGP signature


[GIT PULL REQUEST] md bugfix

2015-10-10 Thread Neil Brown

The following changes since commit da6fb7a9e5bd6f04f7e15070f630bdf1ea502841:

  md/bitmap: don't pass -1 to bitmap_storage_alloc. (2015-10-02 17:24:13 +1000)

are available in the git repository at:

  git://neil.brown.name/md tags/md/4.3-rc4-fix

for you to fetch changes up to a452744bcbf706eac65abb4c98496a366820c60a:

  crash in md-raid1 and md-raid10 due to incorrect list manipulation 
(2015-10-09 08:33:46 +1100)


One bug fix for raid1/raid10.

Very careless bug earler in 4.3-rc, now fixed :-)


Mikulas Patocka (1):
  crash in md-raid1 and md-raid10 due to incorrect list manipulation

 drivers/md/raid1.c  | 4 ++--
 drivers/md/raid10.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)




signature.asc
Description: PGP signature


[GIT PULL REQUEST] md bugfix

2015-10-10 Thread Neil Brown

The following changes since commit da6fb7a9e5bd6f04f7e15070f630bdf1ea502841:

  md/bitmap: don't pass -1 to bitmap_storage_alloc. (2015-10-02 17:24:13 +1000)

are available in the git repository at:

  git://neil.brown.name/md tags/md/4.3-rc4-fix

for you to fetch changes up to a452744bcbf706eac65abb4c98496a366820c60a:

  crash in md-raid1 and md-raid10 due to incorrect list manipulation 
(2015-10-09 08:33:46 +1100)


One bug fix for raid1/raid10.

Very careless bug earler in 4.3-rc, now fixed :-)


Mikulas Patocka (1):
  crash in md-raid1 and md-raid10 due to incorrect list manipulation

 drivers/md/raid1.c  | 4 ++--
 drivers/md/raid10.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)




signature.asc
Description: PGP signature


Re: [PATCH v2] sunrpc: fix waitqueue_active without memory barrier in sunrpc

2015-10-08 Thread Neil Brown
Kosuke Tatsukawa  writes:

> There are several places in net/sunrpc/svcsock.c which calls
> waitqueue_active() without calling a memory barrier.  Add a memory
> barrier just as in wq_has_sleeper().
>
> I found this issue when I was looking through the linux source code
> for places calling waitqueue_active() before wake_up*(), but without
> preceding memory barriers, after sending a patch to fix a similar
> issue in drivers/tty/n_tty.c  (Details about the original issue can be
> found here: https://lkml.org/lkml/2015/9/28/849).

hi,
this feels like the wrong approach to the problem.  It requires extra
'smb_mb's to be spread around which are hard to understand as easy to
forget.

A quick look seems to suggest that (nearly) every waitqueue_active()
will need an smb_mb.  Could we just put the smb_mb() inside
waitqueue_active()??

Thanks,
NeilBrown


>
> Signed-off-by: Kosuke Tatsukawa 
> ---
> v2:
>   - Fixed compiler warnings caused by type mismatch
> v1:
>   - https://lkml.org/lkml/2015/10/8/993
> ---
>  net/sunrpc/svcsock.c |6 ++
>  1 files changed, 6 insertions(+), 0 deletions(-)
>
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 0c81202..ec19444 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -414,6 +414,7 @@ static void svc_udp_data_ready(struct sock *sk)
>   set_bit(XPT_DATA, >sk_xprt.xpt_flags);
>   svc_xprt_enqueue(>sk_xprt);
>   }
> + smp_mb();
>   if (wq && waitqueue_active(wq))
>   wake_up_interruptible(wq);
>  }


signature.asc
Description: PGP signature


Re: [PATCH] crash in md-raid1 and md-raid10 due to incorrect list manipulation

2015-10-08 Thread Neil Brown
Mikulas Patocka  writes:

> The commit 55ce74d4bfe1b936264c637f39a152d1e5ac (md/raid1: ensure 
> device failure recorded before write request returns) is causing crash in 
> the LVM2 testsuite test shell/lvchange-raid.sh. For me the crash is 100% 
> reproducible.
>
> The reason for the crash is that the newly added code in raid1d moves the 
> list from conf->bio_end_io_list to tmp, then tests if tmp is non-empty and 
> then incorrectly pops the bio from conf->bio_end_io_list (which is empty 
> because the list was alrady moved).
>
> Raid-10 has a similar bug.

Ouch.  I can't have been thinking when I wrote that code!

Thanks for finding and fixing this.  Patch will be sent to Linus in time
for next -rc.

Thanks,
NeilBrown

>
> Kernel Fault: Code=15 regs=6ccb8640 (Addr=0001)
> CPU: 3 PID: 1930 Comm: mdX_raid1 Not tainted 4.2.0-rc5-bisect+ #35
> task: 6cc1f258 ti: 6ccb8000 task.ti: 6ccb8000
>
>  YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
> PSW: 11001110 Not tainted
> r00-03  00ff0804fe0f 1059d000 1059f818 7f16be38
> r04-07  1059d000 7f16be08 00200200 0001
> r08-11  6ccb8260 7b7934d0 0001 
> r12-15  4056f320  00013dd0 
> r16-19  f0d00ae0   0001
> r20-23  080f 42200390  
> r24-27  0001 080f 7f16be08 1059d000
> r28-31  0001 6ccb8560 6ccb8640 
> sr00-03  00249800   00249800
> sr04-07     
>
> IASQ:   IAOQ: 1059f61c 
> 1059f620
>  IIR: 0f8010c6ISR:   IOR: 0001
>  CPU:3   CR30: 6ccb8000 CR31: 
>  ORIG_R28: 1059d000
>  IAOQ[0]: call_bio_endio+0x34/0x1a8 [raid1]
>  IAOQ[1]: call_bio_endio+0x38/0x1a8 [raid1]
>  RP(r2): raid_end_bio_io+0x88/0x168 [raid1]
> Backtrace:
>  [<1059f818>] raid_end_bio_io+0x88/0x168 [raid1]
>  [<105a4f64>] raid1d+0x144/0x1640 [raid1]
>  [<4017fd5c>] kthread+0x144/0x160
>
> Signed-off-by: Mikulas Patocka 
> Fixes: 55ce74d4bfe1 ("md/raid1: ensure device failure recorded before write 
> request returns.")
> Fixes: 95af587e95aa ("md/raid10: ensure device failure recorded before write 
> request returns.")
>
> ---
>  drivers/md/raid1.c  |4 ++--
>  drivers/md/raid10.c |4 ++--
>  2 files changed, 4 insertions(+), 4 deletions(-)
>
> Index: linux-2.6/drivers/md/raid1.c
> ===
> --- linux-2.6.orig/drivers/md/raid1.c 2015-10-01 21:10:05.0 +0200
> +++ linux-2.6/drivers/md/raid1.c  2015-10-01 21:10:58.0 +0200
> @@ -2437,8 +2437,8 @@ static void raid1d(struct md_thread *thr
>   }
>   spin_unlock_irqrestore(>device_lock, flags);
>   while (!list_empty()) {
> - r1_bio = list_first_entry(>bio_end_io_list,
> -   struct r1bio, retry_list);
> + r1_bio = list_first_entry(, struct r1bio,
> +   retry_list);
>   list_del(_bio->retry_list);
>   raid_end_bio_io(r1_bio);
>   }
> Index: linux-2.6/drivers/md/raid10.c
> ===
> --- linux-2.6.orig/drivers/md/raid10.c2015-10-01 21:11:02.0 
> +0200
> +++ linux-2.6/drivers/md/raid10.c 2015-10-01 21:11:19.0 +0200
> @@ -2804,8 +2804,8 @@ static void raid10d(struct md_thread *th
>   }
>   spin_unlock_irqrestore(>device_lock, flags);
>   while (!list_empty()) {
> - r10_bio = list_first_entry(>bio_end_io_list,
> -   struct r10bio, retry_list);
> + r10_bio = list_first_entry(, struct r10bio,
> +retry_list);
>   list_del(_bio->retry_list);
>   raid_end_bio_io(r10_bio);
>   }


signature.asc
Description: PGP signature


Re: [PATCH] crash in md-raid1 and md-raid10 due to incorrect list manipulation

2015-10-08 Thread Neil Brown
Mikulas Patocka  writes:

> The commit 55ce74d4bfe1b936264c637f39a152d1e5ac (md/raid1: ensure 
> device failure recorded before write request returns) is causing crash in 
> the LVM2 testsuite test shell/lvchange-raid.sh. For me the crash is 100% 
> reproducible.
>
> The reason for the crash is that the newly added code in raid1d moves the 
> list from conf->bio_end_io_list to tmp, then tests if tmp is non-empty and 
> then incorrectly pops the bio from conf->bio_end_io_list (which is empty 
> because the list was alrady moved).
>
> Raid-10 has a similar bug.

Ouch.  I can't have been thinking when I wrote that code!

Thanks for finding and fixing this.  Patch will be sent to Linus in time
for next -rc.

Thanks,
NeilBrown

>
> Kernel Fault: Code=15 regs=6ccb8640 (Addr=0001)
> CPU: 3 PID: 1930 Comm: mdX_raid1 Not tainted 4.2.0-rc5-bisect+ #35
> task: 6cc1f258 ti: 6ccb8000 task.ti: 6ccb8000
>
>  YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
> PSW: 11001110 Not tainted
> r00-03  00ff0804fe0f 1059d000 1059f818 7f16be38
> r04-07  1059d000 7f16be08 00200200 0001
> r08-11  6ccb8260 7b7934d0 0001 
> r12-15  4056f320  00013dd0 
> r16-19  f0d00ae0   0001
> r20-23  080f 42200390  
> r24-27  0001 080f 7f16be08 1059d000
> r28-31  0001 6ccb8560 6ccb8640 
> sr00-03  00249800   00249800
> sr04-07     
>
> IASQ:   IAOQ: 1059f61c 
> 1059f620
>  IIR: 0f8010c6ISR:   IOR: 0001
>  CPU:3   CR30: 6ccb8000 CR31: 
>  ORIG_R28: 1059d000
>  IAOQ[0]: call_bio_endio+0x34/0x1a8 [raid1]
>  IAOQ[1]: call_bio_endio+0x38/0x1a8 [raid1]
>  RP(r2): raid_end_bio_io+0x88/0x168 [raid1]
> Backtrace:
>  [<1059f818>] raid_end_bio_io+0x88/0x168 [raid1]
>  [<105a4f64>] raid1d+0x144/0x1640 [raid1]
>  [<4017fd5c>] kthread+0x144/0x160
>
> Signed-off-by: Mikulas Patocka 
> Fixes: 55ce74d4bfe1 ("md/raid1: ensure device failure recorded before write 
> request returns.")
> Fixes: 95af587e95aa ("md/raid10: ensure device failure recorded before write 
> request returns.")
>
> ---
>  drivers/md/raid1.c  |4 ++--
>  drivers/md/raid10.c |4 ++--
>  2 files changed, 4 insertions(+), 4 deletions(-)
>
> Index: linux-2.6/drivers/md/raid1.c
> ===
> --- linux-2.6.orig/drivers/md/raid1.c 2015-10-01 21:10:05.0 +0200
> +++ linux-2.6/drivers/md/raid1.c  2015-10-01 21:10:58.0 +0200
> @@ -2437,8 +2437,8 @@ static void raid1d(struct md_thread *thr
>   }
>   spin_unlock_irqrestore(>device_lock, flags);
>   while (!list_empty()) {
> - r1_bio = list_first_entry(>bio_end_io_list,
> -   struct r1bio, retry_list);
> + r1_bio = list_first_entry(, struct r1bio,
> +   retry_list);
>   list_del(_bio->retry_list);
>   raid_end_bio_io(r1_bio);
>   }
> Index: linux-2.6/drivers/md/raid10.c
> ===
> --- linux-2.6.orig/drivers/md/raid10.c2015-10-01 21:11:02.0 
> +0200
> +++ linux-2.6/drivers/md/raid10.c 2015-10-01 21:11:19.0 +0200
> @@ -2804,8 +2804,8 @@ static void raid10d(struct md_thread *th
>   }
>   spin_unlock_irqrestore(>device_lock, flags);
>   while (!list_empty()) {
> - r10_bio = list_first_entry(>bio_end_io_list,
> -   struct r10bio, retry_list);
> + r10_bio = list_first_entry(, struct r10bio,
> +retry_list);
>   list_del(_bio->retry_list);
>   raid_end_bio_io(r10_bio);
>   }


signature.asc
Description: PGP signature


Re: [PATCH v2] sunrpc: fix waitqueue_active without memory barrier in sunrpc

2015-10-08 Thread Neil Brown
Kosuke Tatsukawa  writes:

> There are several places in net/sunrpc/svcsock.c which calls
> waitqueue_active() without calling a memory barrier.  Add a memory
> barrier just as in wq_has_sleeper().
>
> I found this issue when I was looking through the linux source code
> for places calling waitqueue_active() before wake_up*(), but without
> preceding memory barriers, after sending a patch to fix a similar
> issue in drivers/tty/n_tty.c  (Details about the original issue can be
> found here: https://lkml.org/lkml/2015/9/28/849).

hi,
this feels like the wrong approach to the problem.  It requires extra
'smb_mb's to be spread around which are hard to understand as easy to
forget.

A quick look seems to suggest that (nearly) every waitqueue_active()
will need an smb_mb.  Could we just put the smb_mb() inside
waitqueue_active()??

Thanks,
NeilBrown


>
> Signed-off-by: Kosuke Tatsukawa 
> ---
> v2:
>   - Fixed compiler warnings caused by type mismatch
> v1:
>   - https://lkml.org/lkml/2015/10/8/993
> ---
>  net/sunrpc/svcsock.c |6 ++
>  1 files changed, 6 insertions(+), 0 deletions(-)
>
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 0c81202..ec19444 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -414,6 +414,7 @@ static void svc_udp_data_ready(struct sock *sk)
>   set_bit(XPT_DATA, >sk_xprt.xpt_flags);
>   svc_xprt_enqueue(>sk_xprt);
>   }
> + smp_mb();
>   if (wq && waitqueue_active(wq))
>   wake_up_interruptible(wq);
>  }


signature.asc
Description: PGP signature


[GIT PULL REQUEST] md fixes for 4.3

2015-10-04 Thread Neil Brown

Hi Linus,
 a few md bug fixes.
Thanks,
NeilBrown

The following changes since commit bcee19f424a0d8c26ecf2607b73c690802658b29:

  Merge branch 'for-4.3-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup (2015-09-21 18:26:54 
-0700)

are available in the git repository at:

  git://neil.brown.name/md tags/md/4.3-fixes

for you to fetch changes up to da6fb7a9e5bd6f04f7e15070f630bdf1ea502841:

  md/bitmap: don't pass -1 to bitmap_storage_alloc. (2015-10-02 17:24:13 +1000)


Assorted fixes for md in 4.3-rc

Two tagged for -stable
One is really a cleanup to match and improve kmemcache interface.


Jes Sorensen (1):
  md/raid1: Avoid raid1 resync getting stuck

Julia Lawall (1):
  md: drop null test before destroy functions

NeilBrown (4):
  md: wait for pending superblock updates before switching to read-only
  md/raid5: don't index beyond end of array in need_this_block().
  md/raid0: apply base queue limits *before* disk_stack_limits
  md/bitmap: don't pass -1 to bitmap_storage_alloc.

Shaohua Li (2):
  raid5: update analysis state for failed stripe
  md: clear CHANGE_PENDING in readonly array

 drivers/md/bitmap.c|  3 ++-
 drivers/md/md.c|  5 +
 drivers/md/multipath.c |  3 +--
 drivers/md/raid0.c | 12 ++--
 drivers/md/raid1.c | 11 ---
 drivers/md/raid10.c|  9 +++--
 drivers/md/raid5.c | 11 +++
 7 files changed, 28 insertions(+), 26 deletions(-)


signature.asc
Description: PGP signature


[GIT PULL REQUEST] md fixes for 4.3

2015-10-04 Thread Neil Brown

Hi Linus,
 a few md bug fixes.
Thanks,
NeilBrown

The following changes since commit bcee19f424a0d8c26ecf2607b73c690802658b29:

  Merge branch 'for-4.3-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup (2015-09-21 18:26:54 
-0700)

are available in the git repository at:

  git://neil.brown.name/md tags/md/4.3-fixes

for you to fetch changes up to da6fb7a9e5bd6f04f7e15070f630bdf1ea502841:

  md/bitmap: don't pass -1 to bitmap_storage_alloc. (2015-10-02 17:24:13 +1000)


Assorted fixes for md in 4.3-rc

Two tagged for -stable
One is really a cleanup to match and improve kmemcache interface.


Jes Sorensen (1):
  md/raid1: Avoid raid1 resync getting stuck

Julia Lawall (1):
  md: drop null test before destroy functions

NeilBrown (4):
  md: wait for pending superblock updates before switching to read-only
  md/raid5: don't index beyond end of array in need_this_block().
  md/raid0: apply base queue limits *before* disk_stack_limits
  md/bitmap: don't pass -1 to bitmap_storage_alloc.

Shaohua Li (2):
  raid5: update analysis state for failed stripe
  md: clear CHANGE_PENDING in readonly array

 drivers/md/bitmap.c|  3 ++-
 drivers/md/md.c|  5 +
 drivers/md/multipath.c |  3 +--
 drivers/md/raid0.c | 12 ++--
 drivers/md/raid1.c | 11 ---
 drivers/md/raid10.c|  9 +++--
 drivers/md/raid5.c | 11 +++
 7 files changed, 28 insertions(+), 26 deletions(-)


signature.asc
Description: PGP signature


Re: [PATCH 10/13] twl4030_charger: add software controlled linear charging mode.

2015-10-01 Thread Neil Brown
Pavel Machek  writes:

> On Thu 2015-07-30 10:11:24, NeilBrown wrote:
>> 
>> Add a 'continuous' option for usb charging which enables
>> the "linear" charging mode of the twl4030.
>> 
>> Linear charging does a good job with not-so-reliable power sources.
>> Auto mode does not work well as it switches off when voltage drops
>> momentarily.  Care must be taken not to over-charge.
>
> Can you explain how the user can "care not to over-charge"?

The following text reads:

It was used with a bike hub dynamo since a year or so. In that case
there are automatically charging stops when the cyclist needs a break.

so: take a break from cycling occasionally.

>
>> @@ -750,6 +784,17 @@ static int twl4030_bci_get_property(struct power_supply 
>> *psy,
>>  is_charging = state & TWL4030_MSTATEC_USB;
>>  else
>>  is_charging = state & TWL4030_MSTATEC_AC;
>> +if (!is_charging) {
>> +u8 s;
>> +twl4030_bci_read(TWL4030_BCIMDEN, );
>> +if (psy->desc->type == POWER_SUPPLY_TYPE_USB)
>> +is_charging = s & 1;
>> +else
>> +is_charging = s & 2;
>> +if (is_charging)
>> +/* A little white lie */
>> +state = TWL4030_MSTATEC_QUICK1;
>
> I'm not sure... can't this white lie turn into black smoke?
>
> Like.. normally, when battery is below something (like 3.5V) it must
> not be quick-charged (because something is very wrong with it). Are
> you just forcing the quick charge here?

No.  That value is only used for reporting a status via sysfs.
That 'lie' gets translated to "POWER_SUPPLY_STATUS_CHARGING" which is
not a lie.

Thanks,
NeilBrown


>
>   Pavel
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) 
> http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: PGP signature


Re: [PATCH 10/13] twl4030_charger: add software controlled linear charging mode.

2015-10-01 Thread Neil Brown
Pavel Machek  writes:

> On Thu 2015-07-30 10:11:24, NeilBrown wrote:
>> 
>> Add a 'continuous' option for usb charging which enables
>> the "linear" charging mode of the twl4030.
>> 
>> Linear charging does a good job with not-so-reliable power sources.
>> Auto mode does not work well as it switches off when voltage drops
>> momentarily.  Care must be taken not to over-charge.
>
> Can you explain how the user can "care not to over-charge"?

The following text reads:

It was used with a bike hub dynamo since a year or so. In that case
there are automatically charging stops when the cyclist needs a break.

so: take a break from cycling occasionally.

>
>> @@ -750,6 +784,17 @@ static int twl4030_bci_get_property(struct power_supply 
>> *psy,
>>  is_charging = state & TWL4030_MSTATEC_USB;
>>  else
>>  is_charging = state & TWL4030_MSTATEC_AC;
>> +if (!is_charging) {
>> +u8 s;
>> +twl4030_bci_read(TWL4030_BCIMDEN, );
>> +if (psy->desc->type == POWER_SUPPLY_TYPE_USB)
>> +is_charging = s & 1;
>> +else
>> +is_charging = s & 2;
>> +if (is_charging)
>> +/* A little white lie */
>> +state = TWL4030_MSTATEC_QUICK1;
>
> I'm not sure... can't this white lie turn into black smoke?
>
> Like.. normally, when battery is below something (like 3.5V) it must
> not be quick-charged (because something is very wrong with it). Are
> you just forcing the quick charge here?

No.  That value is only used for reporting a status via sysfs.
That 'lie' gets translated to "POWER_SUPPLY_STATUS_CHARGING" which is
not a lie.

Thanks,
NeilBrown


>
>   Pavel
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) 
> http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: PGP signature


Re: [PATCH 1/3] BTRFS: support NFSv2 export

2015-09-23 Thread Neil Brown
David Sterba  writes:

> On Fri, May 08, 2015 at 10:16:23AM +1000, NeilBrown wrote:
>> The "fh_len" passed to ->fh_to_* is not guaranteed to be that same as
>> that returned by encode_fh - it may be larger.
>> 
>> With NFSv2, the filehandle is fixed length, so it may appear longer
>> than expected and be zero-padded.
>> 
>> So we must test that fh_len is at least some value, not exactly equal
>> to it.
>> 
>> Signed-off-by: NeilBrown 
>
> Acked-by: David Sterba 

Thanks.  However I just checked mainline and this still hasn't been
applied.
Should I resend it to someone?  Who?

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: defects for uses of abs(u64) (was: Re: Regression: can't apply frequency offsets above 1000ppm)

2015-09-23 Thread Neil Brown
Joe Perches  writes:

> On Fri, 2015-09-04 at 18:00 -0700, John Stultz wrote:
>> On Fri, Sep 4, 2015 at 5:57 PM, John Stultz  wrote:
>> > On Thu, Sep 3, 2015 at 4:26 AM, Miroslav Lichvar  
>> > wrote:
>> >> On Wed, Sep 02, 2015 at 04:16:00PM -0700, John Stultz wrote:
>> >>> On Tue, Sep 1, 2015 at 6:14 PM, Nuno Gonçalves  wrote:
>> >>> > And just installing chrony from the feeds. With any kernel from 3.17
>> >>> > you'll have wrong estimates at chronyc sourcestats.
>> >>>
>> >>> Wrong estimates? Could you be more specific about what the failure
>> >>> you're seeing is here? The
>> >>>
>> >>> I installed the image above, which comes with a 4.1.6 kernel, and
>> >>> chrony seems to have gotten my BBB into ~1ms sync w/ servers over the
>> >>> internet fairly quickly (at least according to chronyc tracking).
>> >>
>> >> To see the bug with chronyd the initial offset shouldn't be very close
>> >> to zero, so it's forced to correct the offset by adjusting the
>> >> frequency in a larger step.
>> >>
>> >> I'm attaching a simple C program that prints the frequency offset
>> >> as measured between the REALTIME and MONOTONIC_RAW clocks when the
>> >> adjtimex tick is set to 9000. It should show values close to -10
>> >> ppm and I suspect on the BBB it will be much smaller.
>> >
>> > So I spent some time on this late last night and this afternoon.
>> >
>> > It was a little odd because things don't seem totally broken, but
>> > something isn't quite right.
>> >
>> > Digging around it seems the iterative logrithmic approximation done in
>> > timekeeping_freqadjust() wasn't working right. Instead of making
>> > smaller order alternating positive and negative adjustments, it was
>> > doing strange growing adjustments for the same value that wern't large
>> > enough to actually correct things very quickly. This made it much
>> > slower to adapt to specified frequency values.
>> >
>> > The odd bit, is it seems to come down to:
>> > tick_error = abs(tick_error);
>> >
>> > Haven't chased down why yet, but apparently abs() isn't doing what one
>> > would think when passed a s64 value.
>> 
>> Well.. chasing it down wasn't hard.. from include/linux/kernel.h:
>> /*
>>  * abs() handles unsigned and signed longs, ints, shorts and chars.  For all
>>  * input types abs() returns a signed long.
>>  * abs() should not be used for 64-bit types (s64, u64, long long) - use 
>> abs64()
>>  * for those.
>>  */
>> 
>> Ouch.
>
> Here's a little cocci script that finds more of these in:

Thanks.

Maybe we should also:

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 5582410727cb..aa7d69afdcac 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -208,6 +208,7 @@ extern int _cond_resched(void);
  */
 #define abs(x) ({  \
long ret;   \
+   BUILD_BUG_ON(sizeof(x) > sizeof(long)); \
if (sizeof(x) == sizeof(long)) {\
long __x = (x); \
ret = (__x < 0) ? -__x : __x;   \


so that people won't make the same mistake again.
That finds bugs in
 driver/md/raid10.c
 drivers/gpu/drm/radeon/radeon_display.c
 kernel/time/clocksource.c
 kernel/time/timekeeping.c
 fs/ext4/mballoc.c
 
that your cocci scripted missed.  All "abs(x - y)".

As sector_t can be 32bit and can be 64bit, I wonder if abs_sector()
would be a good idea ... probably not.

Thoughts?

NeilBrown


signature.asc
Description: PGP signature


Re: defects for uses of abs(u64) (was: Re: Regression: can't apply frequency offsets above 1000ppm)

2015-09-23 Thread Neil Brown
Joe Perches  writes:

> On Fri, 2015-09-04 at 18:00 -0700, John Stultz wrote:
>> On Fri, Sep 4, 2015 at 5:57 PM, John Stultz  wrote:
>> > On Thu, Sep 3, 2015 at 4:26 AM, Miroslav Lichvar  
>> > wrote:
>> >> On Wed, Sep 02, 2015 at 04:16:00PM -0700, John Stultz wrote:
>> >>> On Tue, Sep 1, 2015 at 6:14 PM, Nuno Gonçalves  wrote:
>> >>> > And just installing chrony from the feeds. With any kernel from 3.17
>> >>> > you'll have wrong estimates at chronyc sourcestats.
>> >>>
>> >>> Wrong estimates? Could you be more specific about what the failure
>> >>> you're seeing is here? The
>> >>>
>> >>> I installed the image above, which comes with a 4.1.6 kernel, and
>> >>> chrony seems to have gotten my BBB into ~1ms sync w/ servers over the
>> >>> internet fairly quickly (at least according to chronyc tracking).
>> >>
>> >> To see the bug with chronyd the initial offset shouldn't be very close
>> >> to zero, so it's forced to correct the offset by adjusting the
>> >> frequency in a larger step.
>> >>
>> >> I'm attaching a simple C program that prints the frequency offset
>> >> as measured between the REALTIME and MONOTONIC_RAW clocks when the
>> >> adjtimex tick is set to 9000. It should show values close to -10
>> >> ppm and I suspect on the BBB it will be much smaller.
>> >
>> > So I spent some time on this late last night and this afternoon.
>> >
>> > It was a little odd because things don't seem totally broken, but
>> > something isn't quite right.
>> >
>> > Digging around it seems the iterative logrithmic approximation done in
>> > timekeeping_freqadjust() wasn't working right. Instead of making
>> > smaller order alternating positive and negative adjustments, it was
>> > doing strange growing adjustments for the same value that wern't large
>> > enough to actually correct things very quickly. This made it much
>> > slower to adapt to specified frequency values.
>> >
>> > The odd bit, is it seems to come down to:
>> > tick_error = abs(tick_error);
>> >
>> > Haven't chased down why yet, but apparently abs() isn't doing what one
>> > would think when passed a s64 value.
>> 
>> Well.. chasing it down wasn't hard.. from include/linux/kernel.h:
>> /*
>>  * abs() handles unsigned and signed longs, ints, shorts and chars.  For all
>>  * input types abs() returns a signed long.
>>  * abs() should not be used for 64-bit types (s64, u64, long long) - use 
>> abs64()
>>  * for those.
>>  */
>> 
>> Ouch.
>
> Here's a little cocci script that finds more of these in:

Thanks.

Maybe we should also:

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 5582410727cb..aa7d69afdcac 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -208,6 +208,7 @@ extern int _cond_resched(void);
  */
 #define abs(x) ({  \
long ret;   \
+   BUILD_BUG_ON(sizeof(x) > sizeof(long)); \
if (sizeof(x) == sizeof(long)) {\
long __x = (x); \
ret = (__x < 0) ? -__x : __x;   \


so that people won't make the same mistake again.
That finds bugs in
 driver/md/raid10.c
 drivers/gpu/drm/radeon/radeon_display.c
 kernel/time/clocksource.c
 kernel/time/timekeeping.c
 fs/ext4/mballoc.c
 
that your cocci scripted missed.  All "abs(x - y)".

As sector_t can be 32bit and can be 64bit, I wonder if abs_sector()
would be a good idea ... probably not.

Thoughts?

NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH 1/3] BTRFS: support NFSv2 export

2015-09-23 Thread Neil Brown
David Sterba  writes:

> On Fri, May 08, 2015 at 10:16:23AM +1000, NeilBrown wrote:
>> The "fh_len" passed to ->fh_to_* is not guaranteed to be that same as
>> that returned by encode_fh - it may be larger.
>> 
>> With NFSv2, the filehandle is fixed length, so it may appear longer
>> than expected and be zero-padded.
>> 
>> So we must test that fh_len is at least some value, not exactly equal
>> to it.
>> 
>> Signed-off-by: NeilBrown 
>
> Acked-by: David Sterba 

Thanks.  However I just checked mainline and this still hasn't been
applied.
Should I resend it to someone?  Who?

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH 17/39] md: drop null test before destroy functions

2015-09-15 Thread Neil Brown
Julia Lawall  writes:

> Remove unneeded NULL test.
>
> The semantic patch that makes this change is as follows:
> (http://coccinelle.lip6.fr/)
>
> // 
> @@ expression x; @@
> -if (x != NULL)
>   \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
> // 
>
> Signed-off-by: Julia Lawall 
>
> ---
>  drivers/md/multipath.c |3 +--
>  drivers/md/raid1.c |6 ++
>  drivers/md/raid10.c|9 +++--
>  drivers/md/raid5.c |3 +--
>  4 files changed, 7 insertions(+), 14 deletions(-)

Applied, thanks.

NeilBrown


>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 15ef2c6..09a12d7 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -2271,8 +2271,7 @@ static void shrink_stripes(struct r5conf *conf)
>  drop_one_stripe(conf))
>   ;
>  
> - if (conf->slab_cache)
> - kmem_cache_destroy(conf->slab_cache);
> + kmem_cache_destroy(conf->slab_cache);
>   conf->slab_cache = NULL;
>  }
>  
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 4517f06..5f4f553 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2843,8 +2843,7 @@ static struct r1conf *setup_conf(struct mddev *mddev)
>  
>   abort:
>   if (conf) {
> - if (conf->r1bio_pool)
> - mempool_destroy(conf->r1bio_pool);
> + mempool_destroy(conf->r1bio_pool);
>   kfree(conf->mirrors);
>   safe_put_page(conf->tmppage);
>   kfree(conf->poolinfo);
> @@ -2946,8 +2945,7 @@ static void raid1_free(struct mddev *mddev, void *priv)
>  {
>   struct r1conf *conf = priv;
>  
> - if (conf->r1bio_pool)
> - mempool_destroy(conf->r1bio_pool);
> + mempool_destroy(conf->r1bio_pool);
>   kfree(conf->mirrors);
>   safe_put_page(conf->tmppage);
>   kfree(conf->poolinfo);
> diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
> index d222522..d132f06 100644
> --- a/drivers/md/multipath.c
> +++ b/drivers/md/multipath.c
> @@ -470,8 +470,7 @@ static int multipath_run (struct mddev *mddev)
>   return 0;
>  
>  out_free_conf:
> - if (conf->pool)
> - mempool_destroy(conf->pool);
> + mempool_destroy(conf->pool);
>   kfree(conf->multipaths);
>   kfree(conf);
>   mddev->private = NULL;
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 0fc33eb..7c99a40 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -3486,8 +3486,7 @@ static struct r10conf *setup_conf(struct mddev *mddev)
>   printk(KERN_ERR "md/raid10:%s: couldn't allocate memory.\n",
>  mdname(mddev));
>   if (conf) {
> - if (conf->r10bio_pool)
> - mempool_destroy(conf->r10bio_pool);
> + mempool_destroy(conf->r10bio_pool);
>   kfree(conf->mirrors);
>   safe_put_page(conf->tmppage);
>   kfree(conf);
> @@ -3682,8 +3681,7 @@ static int run(struct mddev *mddev)
>  
>  out_free_conf:
>   md_unregister_thread(>thread);
> - if (conf->r10bio_pool)
> - mempool_destroy(conf->r10bio_pool);
> + mempool_destroy(conf->r10bio_pool);
>   safe_put_page(conf->tmppage);
>   kfree(conf->mirrors);
>   kfree(conf);
> @@ -3696,8 +3694,7 @@ static void raid10_free(struct mddev *mddev, void *priv)
>  {
>   struct r10conf *conf = priv;
>  
> - if (conf->r10bio_pool)
> - mempool_destroy(conf->r10bio_pool);
> + mempool_destroy(conf->r10bio_pool);
>   safe_put_page(conf->tmppage);
>   kfree(conf->mirrors);
>   kfree(conf->mirrors_old);


signature.asc
Description: PGP signature


Re: [PATCH 17/39] md: drop null test before destroy functions

2015-09-15 Thread Neil Brown
Julia Lawall  writes:

> Remove unneeded NULL test.
>
> The semantic patch that makes this change is as follows:
> (http://coccinelle.lip6.fr/)
>
> // 
> @@ expression x; @@
> -if (x != NULL)
>   \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
> // 
>
> Signed-off-by: Julia Lawall 
>
> ---
>  drivers/md/multipath.c |3 +--
>  drivers/md/raid1.c |6 ++
>  drivers/md/raid10.c|9 +++--
>  drivers/md/raid5.c |3 +--
>  4 files changed, 7 insertions(+), 14 deletions(-)

Applied, thanks.

NeilBrown


>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 15ef2c6..09a12d7 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -2271,8 +2271,7 @@ static void shrink_stripes(struct r5conf *conf)
>  drop_one_stripe(conf))
>   ;
>  
> - if (conf->slab_cache)
> - kmem_cache_destroy(conf->slab_cache);
> + kmem_cache_destroy(conf->slab_cache);
>   conf->slab_cache = NULL;
>  }
>  
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 4517f06..5f4f553 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2843,8 +2843,7 @@ static struct r1conf *setup_conf(struct mddev *mddev)
>  
>   abort:
>   if (conf) {
> - if (conf->r1bio_pool)
> - mempool_destroy(conf->r1bio_pool);
> + mempool_destroy(conf->r1bio_pool);
>   kfree(conf->mirrors);
>   safe_put_page(conf->tmppage);
>   kfree(conf->poolinfo);
> @@ -2946,8 +2945,7 @@ static void raid1_free(struct mddev *mddev, void *priv)
>  {
>   struct r1conf *conf = priv;
>  
> - if (conf->r1bio_pool)
> - mempool_destroy(conf->r1bio_pool);
> + mempool_destroy(conf->r1bio_pool);
>   kfree(conf->mirrors);
>   safe_put_page(conf->tmppage);
>   kfree(conf->poolinfo);
> diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
> index d222522..d132f06 100644
> --- a/drivers/md/multipath.c
> +++ b/drivers/md/multipath.c
> @@ -470,8 +470,7 @@ static int multipath_run (struct mddev *mddev)
>   return 0;
>  
>  out_free_conf:
> - if (conf->pool)
> - mempool_destroy(conf->pool);
> + mempool_destroy(conf->pool);
>   kfree(conf->multipaths);
>   kfree(conf);
>   mddev->private = NULL;
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 0fc33eb..7c99a40 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -3486,8 +3486,7 @@ static struct r10conf *setup_conf(struct mddev *mddev)
>   printk(KERN_ERR "md/raid10:%s: couldn't allocate memory.\n",
>  mdname(mddev));
>   if (conf) {
> - if (conf->r10bio_pool)
> - mempool_destroy(conf->r10bio_pool);
> + mempool_destroy(conf->r10bio_pool);
>   kfree(conf->mirrors);
>   safe_put_page(conf->tmppage);
>   kfree(conf);
> @@ -3682,8 +3681,7 @@ static int run(struct mddev *mddev)
>  
>  out_free_conf:
>   md_unregister_thread(>thread);
> - if (conf->r10bio_pool)
> - mempool_destroy(conf->r10bio_pool);
> + mempool_destroy(conf->r10bio_pool);
>   safe_put_page(conf->tmppage);
>   kfree(conf->mirrors);
>   kfree(conf);
> @@ -3696,8 +3694,7 @@ static void raid10_free(struct mddev *mddev, void *priv)
>  {
>   struct r10conf *conf = priv;
>  
> - if (conf->r10bio_pool)
> - mempool_destroy(conf->r10bio_pool);
> + mempool_destroy(conf->r10bio_pool);
>   safe_put_page(conf->tmppage);
>   kfree(conf->mirrors);
>   kfree(conf->mirrors_old);


signature.asc
Description: PGP signature


[GIT PULL REQUEST] md updates for 4.3

2015-09-05 Thread Neil Brown

Please pull these updates.  I've already merged with the 'block' tree
to resolve a few simple conflicts.

Thanks,
NeilBrown


The following changes since commit 1081230b748de8f03f37f80c53dfa89feda9b8de:

  Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-block (2015-09-02 
13:10:25 -0700)

are available in the git repository at:

  git://neil.brown.name/md tags/md/4.3

for you to fetch changes up to e89c6fdf9e0eb1b5a03574d4ca73e83eae8deb91:

  Merge linux-block/for-4.3/core into md/for-linux (2015-09-05 11:08:32 +0200)


md updates for 4.3

- An assortment of little fixes, several for minor races only likely
  to be hit during testing
- further cluster-md-raid1 development, not ready for real use yet.
- new RAID6 syndrome code for ARM NEON
- fix a race where a write can return before failure of one device
  is properly recorded in metadata, so an immediate crash might result
  in that write being lost.


Ard Biesheuvel (1):
  md/raid6: delta syndrome for ARM NEON

Benjamin Randazzo (1):
  md: simplify get_bitmap_file now that "file" is zeroed.

Guoqing Jiang (11):
  md-cluster: use %pU to print UUIDs
  md-cluster: split recover_slot for future code reuse
  md-cluster: transfer the resync ownership to another node
  md-cluster: fix deadlock issue on message lock
  md-cluster: init completion within lockres_init
  md-cluster: add the error check if failed to get dlm lock
  md-cluster: init suspend_list and suspend_lock early in join
  md-cluster: remove the unused sb_lock
  md-cluster: add missed lockres_free
  md-cluster: only call complete(>completion) when node join cluster
  md-cluster: Read the disk bitmap sb and check if it needs recovery

NeilBrown (20):
  md/raid0: update queue parameter in a safer location.
  md: Keep /proc/mdstat reporting recovery until fully DONE.
  md: close some races between setting and checking sync_action.
  md/raid5: consider updating reshape_position at start of reshape.
  md/raid10: fix a few typos in comments
  md/raid5: always set conf->prev_chunk_sectors and ->prev_algo
  md/raid5: switch to use conf->chunk_sectors in place of 
mddev->chunk_sectors where possible
  md/raid5: strengthen check on reshape_position at run.
  md/raid5: remove incorrect "min_t()" when calculating writepos.
  md: set MD_RECOVERY_RECOVER when starting a degraded array.
  md: be careful when testing resync_max against curr_resync_completed.
  md: sync sync_completed has correct value as recovery finishes.
  md/raid5: handle possible race as reshape completes.
  md: extend spinlock protection in register_md_cluster_operations
  md-cluster: remove inappropriate try_module_get from join()
  md/raid1: ensure device failure recorded before write request returns.
  md/raid10: ensure device failure recorded before write request returns.
  md/raid5: use bio_list for the list of bios to return.
  md/raid5: ensure device failure recorded before write request returns.
  Merge linux-block/for-4.3/core into md/for-linux

Sasha Levin (1):
  md: setup safemode_timer before it's being used

 Documentation/md-cluster.txt |   4 +-
 drivers/md/md-cluster.c  | 159 ---
 drivers/md/md.c  | 110 +++---
 drivers/md/raid0.c   |  75 ++--
 drivers/md/raid1.c   |  30 +++-
 drivers/md/raid1.h   |   5 ++
 drivers/md/raid10.c  |  33 -
 drivers/md/raid10.h  |   6 ++
 drivers/md/raid5.c   | 140 ++---
 drivers/md/raid5.h   |   5 +-
 lib/raid6/neon.c |  13 +++-
 lib/raid6/neon.uc|  46 +
 12 files changed, 433 insertions(+), 193 deletions(-)


signature.asc
Description: PGP signature


[GIT PULL REQUEST] md updates for 4.3

2015-09-05 Thread Neil Brown

Please pull these updates.  I've already merged with the 'block' tree
to resolve a few simple conflicts.

Thanks,
NeilBrown


The following changes since commit 1081230b748de8f03f37f80c53dfa89feda9b8de:

  Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-block (2015-09-02 
13:10:25 -0700)

are available in the git repository at:

  git://neil.brown.name/md tags/md/4.3

for you to fetch changes up to e89c6fdf9e0eb1b5a03574d4ca73e83eae8deb91:

  Merge linux-block/for-4.3/core into md/for-linux (2015-09-05 11:08:32 +0200)


md updates for 4.3

- An assortment of little fixes, several for minor races only likely
  to be hit during testing
- further cluster-md-raid1 development, not ready for real use yet.
- new RAID6 syndrome code for ARM NEON
- fix a race where a write can return before failure of one device
  is properly recorded in metadata, so an immediate crash might result
  in that write being lost.


Ard Biesheuvel (1):
  md/raid6: delta syndrome for ARM NEON

Benjamin Randazzo (1):
  md: simplify get_bitmap_file now that "file" is zeroed.

Guoqing Jiang (11):
  md-cluster: use %pU to print UUIDs
  md-cluster: split recover_slot for future code reuse
  md-cluster: transfer the resync ownership to another node
  md-cluster: fix deadlock issue on message lock
  md-cluster: init completion within lockres_init
  md-cluster: add the error check if failed to get dlm lock
  md-cluster: init suspend_list and suspend_lock early in join
  md-cluster: remove the unused sb_lock
  md-cluster: add missed lockres_free
  md-cluster: only call complete(>completion) when node join cluster
  md-cluster: Read the disk bitmap sb and check if it needs recovery

NeilBrown (20):
  md/raid0: update queue parameter in a safer location.
  md: Keep /proc/mdstat reporting recovery until fully DONE.
  md: close some races between setting and checking sync_action.
  md/raid5: consider updating reshape_position at start of reshape.
  md/raid10: fix a few typos in comments
  md/raid5: always set conf->prev_chunk_sectors and ->prev_algo
  md/raid5: switch to use conf->chunk_sectors in place of 
mddev->chunk_sectors where possible
  md/raid5: strengthen check on reshape_position at run.
  md/raid5: remove incorrect "min_t()" when calculating writepos.
  md: set MD_RECOVERY_RECOVER when starting a degraded array.
  md: be careful when testing resync_max against curr_resync_completed.
  md: sync sync_completed has correct value as recovery finishes.
  md/raid5: handle possible race as reshape completes.
  md: extend spinlock protection in register_md_cluster_operations
  md-cluster: remove inappropriate try_module_get from join()
  md/raid1: ensure device failure recorded before write request returns.
  md/raid10: ensure device failure recorded before write request returns.
  md/raid5: use bio_list for the list of bios to return.
  md/raid5: ensure device failure recorded before write request returns.
  Merge linux-block/for-4.3/core into md/for-linux

Sasha Levin (1):
  md: setup safemode_timer before it's being used

 Documentation/md-cluster.txt |   4 +-
 drivers/md/md-cluster.c  | 159 ---
 drivers/md/md.c  | 110 +++---
 drivers/md/raid0.c   |  75 ++--
 drivers/md/raid1.c   |  30 +++-
 drivers/md/raid1.h   |   5 ++
 drivers/md/raid10.c  |  33 -
 drivers/md/raid10.h  |   6 ++
 drivers/md/raid5.c   | 140 ++---
 drivers/md/raid5.h   |   5 +-
 lib/raid6/neon.c |  13 +++-
 lib/raid6/neon.uc|  46 +
 12 files changed, 433 insertions(+), 193 deletions(-)


signature.asc
Description: PGP signature


Re: [PATCH 03/13] twl4030_charger: correctly handle -EPROBE_DEFER from devm_usb_get_phy_by_node

2015-09-02 Thread Neil Brown
Kevin Hilman  writes:

> ping... this boot failure has now landed in mainline

sorry, I'm on leave at the moment and travelling so I'm unlikely to be
able to look at this properly.  I should be able to examine this issue
before the end of the month but cannot promise sooner than that (though
it is not impossible).

Maybe it would be best to just revert it until a proper analysis can be
done.

NeilBrown


>
> On Thu, Aug 27, 2015 at 1:51 PM, Kevin Hilman  wrote:
>> On Wed, Jul 29, 2015 at 5:11 PM, NeilBrown  wrote:
>>> Now that twl4030_bci_probe can safely return -EPROBE_DEFER,
>>> do so when devm_usb_get_phy_by_node returns that error.
>>>
>>> Signed-off-by: NeilBrown 
>>
>> This patch has hit linux-next in the form of coommit 3fc3895e4fe1
>> (twl4030_charger: correctly handle -EPROBE_DEFER from
>> devm_usb_get_phy_by_node) and kernelci.org found a regression on
>> omap3-beagle-xm[1].  Bisecting[2] this boot failure pointed at this
>> commit, and I verified that reverting it on top of next-20150827 gets
>> the board booting again.  I haven't debugged any further.
>>
>> Kevin
>>
>> [1] 
>> http://storage.kernelci.org/next/next-20150827/arm-omap2plus_defconfig/lab-khilman/boot-omap3-beagle-xm.html
>> [2] https://ci.linaro.org/view/people/job/tbaker-boot-bisect-bot/88/console


signature.asc
Description: PGP signature


Re: [PATCH 03/13] twl4030_charger: correctly handle -EPROBE_DEFER from devm_usb_get_phy_by_node

2015-09-02 Thread Neil Brown
Kevin Hilman  writes:

> ping... this boot failure has now landed in mainline

sorry, I'm on leave at the moment and travelling so I'm unlikely to be
able to look at this properly.  I should be able to examine this issue
before the end of the month but cannot promise sooner than that (though
it is not impossible).

Maybe it would be best to just revert it until a proper analysis can be
done.

NeilBrown


>
> On Thu, Aug 27, 2015 at 1:51 PM, Kevin Hilman  wrote:
>> On Wed, Jul 29, 2015 at 5:11 PM, NeilBrown  wrote:
>>> Now that twl4030_bci_probe can safely return -EPROBE_DEFER,
>>> do so when devm_usb_get_phy_by_node returns that error.
>>>
>>> Signed-off-by: NeilBrown 
>>
>> This patch has hit linux-next in the form of coommit 3fc3895e4fe1
>> (twl4030_charger: correctly handle -EPROBE_DEFER from
>> devm_usb_get_phy_by_node) and kernelci.org found a regression on
>> omap3-beagle-xm[1].  Bisecting[2] this boot failure pointed at this
>> commit, and I verified that reverting it on top of next-20150827 gets
>> the board booting again.  I haven't debugged any further.
>>
>> Kevin
>>
>> [1] 
>> http://storage.kernelci.org/next/next-20150827/arm-omap2plus_defconfig/lab-khilman/boot-omap3-beagle-xm.html
>> [2] https://ci.linaro.org/view/people/job/tbaker-boot-bisect-bot/88/console


signature.asc
Description: PGP signature


Re: clustered MD

2015-06-10 Thread Neil Brown
On Wed, 10 Jun 2015 16:07:44 -0500
David Teigland  wrote:

> On Thu, Jun 11, 2015 at 06:31:31AM +1000, Neil Brown wrote:
> > What is your interest in this?  I'm always happy for open discussion and
> > varied input, but it would help to know to what extent you are a stake
> > holder?
> 
> Using the dlm correctly is non-trivial and should be reviewed.
> If the dlm is misused, some part of that may fall in my lap, if
> only so far as having to debug problems to distinguish between dlm
> bugs or md-cluster bugs.  This has been learned the hard way.
> 
> I have yet to find time to look up the previous review discussion.
> I will be more than happy if I find the dlm usage has already been
> thoroughly reviewed.

The DLM usage is the part that I am least comfortable with and I would
certainly welcome review.   There was a recent discussion of some issue that I
haven't had a chance to go over yet, but apart from that it has mostly just
been a few developers trying to figure out what we need and how that can be
implemented.

There are (as I recall) two main aspects of the DLM usage.
One is fairly idiomatic locking of the multiple write-intend bitmaps.
Each bitmap can be "active" or "idle".  When "idle" all bits are clear.
When "active", one node will usually have an exclusive lock.  If/when that node
dies, all other nodes must find out and at least one takes remedial action.
Once the remedial action is taken the bitmap becomes idle.  In that state
a new node can claim it.  When that happens all other nodes must find out so
they transition to the "watching an active bitmap" state.
This seems to fit well with the shared/exclusive reclaimable locks of DLM.

The other usage is to provide synchronous broadcast message passing between
nodes.  When one nodes makes a configuration change it needs to tell all other
nodes and wait for them to acknowledge before the change (such as adding a
spare) is committed.  There is a small collections of locks which represent
different states in a broadcast/acknowledge protocol.
This is the part I'm least confident of, but it seems to make sense and seems
to work.


Separately:

> Reading those messages again I see what you mean, they don't sound very
> nice, so sorry about that.  I'll repeat the one positive note, which is
> that the brief things I've noticed make it look much better than the dm
> approach from several years ago.

Thanks :-)
In part this effort is a response to "clvm" - which is a completely adequate
solution of clustering when you just need volume management (growing and
shrinking and striping volumes) but doesn't extend very well to RAID.

Look forward to any review comments you find time for:-)

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: clustered MD

2015-06-10 Thread Neil Brown
On Wed, 10 Jun 2015 10:01:51 -0500
David Teigland  wrote:

> Isn't this process what staging is for?

No it isn't.
Staging is useful for code drops.  i.e. multiple other developers want to
collaborate to improve some code that the maintainer doesn't want to accept.
So it goes into staging, "the community" gets a chance to use the kernel
development workflow to fix it up, and then it either is accepted by the
maintainer, or is eventually discarded.

cluster-MD has followed a very different process.  The relevant maintainer (me)
has been involved from the start, has provided input to the design, and
reviewed various initial implementations, and now thinks that the code is close
enough to ready for it to be included in the kernel (with suitable warnings).
That way I only need to review incremental changes, not the whole set of
patches from scratch each time.

What is your interest in this?  I'm always happy for open discussion and
varied input, but it would help to know to what extent you are a stake holder?
Also a slightly less adversarial tone would make me feel more comfortable,
though maybe I'm misreading your intent.

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: clustered MD

2015-06-10 Thread Neil Brown
On Wed, 10 Jun 2015 10:01:51 -0500
David Teigland teigl...@redhat.com wrote:

 Isn't this process what staging is for?

No it isn't.
Staging is useful for code drops.  i.e. multiple other developers want to
collaborate to improve some code that the maintainer doesn't want to accept.
So it goes into staging, the community gets a chance to use the kernel
development workflow to fix it up, and then it either is accepted by the
maintainer, or is eventually discarded.

cluster-MD has followed a very different process.  The relevant maintainer (me)
has been involved from the start, has provided input to the design, and
reviewed various initial implementations, and now thinks that the code is close
enough to ready for it to be included in the kernel (with suitable warnings).
That way I only need to review incremental changes, not the whole set of
patches from scratch each time.

What is your interest in this?  I'm always happy for open discussion and
varied input, but it would help to know to what extent you are a stake holder?
Also a slightly less adversarial tone would make me feel more comfortable,
though maybe I'm misreading your intent.

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: clustered MD

2015-06-10 Thread Neil Brown
On Wed, 10 Jun 2015 16:07:44 -0500
David Teigland teigl...@redhat.com wrote:

 On Thu, Jun 11, 2015 at 06:31:31AM +1000, Neil Brown wrote:
  What is your interest in this?  I'm always happy for open discussion and
  varied input, but it would help to know to what extent you are a stake
  holder?
 
 Using the dlm correctly is non-trivial and should be reviewed.
 If the dlm is misused, some part of that may fall in my lap, if
 only so far as having to debug problems to distinguish between dlm
 bugs or md-cluster bugs.  This has been learned the hard way.
 
 I have yet to find time to look up the previous review discussion.
 I will be more than happy if I find the dlm usage has already been
 thoroughly reviewed.

The DLM usage is the part that I am least comfortable with and I would
certainly welcome review.   There was a recent discussion of some issue that I
haven't had a chance to go over yet, but apart from that it has mostly just
been a few developers trying to figure out what we need and how that can be
implemented.

There are (as I recall) two main aspects of the DLM usage.
One is fairly idiomatic locking of the multiple write-intend bitmaps.
Each bitmap can be active or idle.  When idle all bits are clear.
When active, one node will usually have an exclusive lock.  If/when that node
dies, all other nodes must find out and at least one takes remedial action.
Once the remedial action is taken the bitmap becomes idle.  In that state
a new node can claim it.  When that happens all other nodes must find out so
they transition to the watching an active bitmap state.
This seems to fit well with the shared/exclusive reclaimable locks of DLM.

The other usage is to provide synchronous broadcast message passing between
nodes.  When one nodes makes a configuration change it needs to tell all other
nodes and wait for them to acknowledge before the change (such as adding a
spare) is committed.  There is a small collections of locks which represent
different states in a broadcast/acknowledge protocol.
This is the part I'm least confident of, but it seems to make sense and seems
to work.


Separately:

 Reading those messages again I see what you mean, they don't sound very
 nice, so sorry about that.  I'll repeat the one positive note, which is
 that the brief things I've noticed make it look much better than the dm
 approach from several years ago.

Thanks :-)
In part this effort is a response to clvm - which is a completely adequate
solution of clustering when you just need volume management (growing and
shrinking and striping volumes) but doesn't extend very well to RAID.

Look forward to any review comments you find time for:-)

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 6/8] drivers/md/md.c: Use strreplace

2015-06-09 Thread Neil Brown
On Tue,  9 Jun 2015 01:26:54 +0200
Rasmus Villemoes  wrote:

> There's no point in starting over when we meet a '/'. This also
> eliminates a stack variable and a little .text.
> 
> Signed-off-by: Rasmus Villemoes 
> ---
> v2: no changes.
> 
>  drivers/md/md.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 27506302eb7a..2ea2f28551c5 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -2024,7 +2024,6 @@ static int bind_rdev_to_array(struct md_rdev
> *rdev, struct mddev *mddev) {
>   char b[BDEVNAME_SIZE];
>   struct kobject *ko;
> - char *s;
>   int err;
>  
>   /* prevent duplicates */
> @@ -2070,8 +2069,7 @@ static int bind_rdev_to_array(struct md_rdev
> *rdev, struct mddev *mddev) return -EBUSY;
>   }
>   bdevname(rdev->bdev,b);
> - while ( (s=strchr(b, '/')) != NULL)
> - *s = '!';
> + strreplace(b, '/', '!');
>  
>   rdev->mddev = mddev;
>   printk(KERN_INFO "md: bind<%s>\n", b);


Acked-by: NeilBrown 

I'm happy for Andrew to merge this.
Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 6/8] drivers/md/md.c: Use strreplace

2015-06-09 Thread Neil Brown
On Tue,  9 Jun 2015 01:26:54 +0200
Rasmus Villemoes li...@rasmusvillemoes.dk wrote:

 There's no point in starting over when we meet a '/'. This also
 eliminates a stack variable and a little .text.
 
 Signed-off-by: Rasmus Villemoes li...@rasmusvillemoes.dk
 ---
 v2: no changes.
 
  drivers/md/md.c | 4 +---
  1 file changed, 1 insertion(+), 3 deletions(-)
 
 diff --git a/drivers/md/md.c b/drivers/md/md.c
 index 27506302eb7a..2ea2f28551c5 100644
 --- a/drivers/md/md.c
 +++ b/drivers/md/md.c
 @@ -2024,7 +2024,6 @@ static int bind_rdev_to_array(struct md_rdev
 *rdev, struct mddev *mddev) {
   char b[BDEVNAME_SIZE];
   struct kobject *ko;
 - char *s;
   int err;
  
   /* prevent duplicates */
 @@ -2070,8 +2069,7 @@ static int bind_rdev_to_array(struct md_rdev
 *rdev, struct mddev *mddev) return -EBUSY;
   }
   bdevname(rdev-bdev,b);
 - while ( (s=strchr(b, '/')) != NULL)
 - *s = '!';
 + strreplace(b, '/', '!');
  
   rdev-mddev = mddev;
   printk(KERN_INFO md: bind%s\n, b);


Acked-by: NeilBrown ne...@suse.de

I'm happy for Andrew to merge this.
Thanks,
NeilBrown
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/28] Swap over NFS -v16

2008-02-25 Thread Neil Brown
On Saturday February 23, [EMAIL PROTECTED] wrote:
> On Wed, 20 Feb 2008 15:46:10 +0100 Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> 
> > Another posting of the full swap over NFS series. 
> 
> Well I looked.  There's rather a lot of it and I wouldn't pretend to
> understand it.

But pretending is fun :-)

> 
> What is the NFS and net people's take on all of this?

Well I'm only vaguely an NFS person, barely a net person, sporadically
an mm person, but I've had a look and it seems to mostly make sense.

We introduce a new "emergency" concept for page allocation.
The size of the emergency pool is set by various reservations by
different potential users.
If the number of free pages is below the "emergency" size, then only
users with a "MEMALLOC" flag get to allocate pages.  Further, those
pages get a "reserve" flag set which propagates into slab/slub so
kmalloc/kmemalloc only return memory from those pages to MEMALLOC
users. 
MEMALLOC users are those that set PF_MEMALLOC.  A socket can get
SOCK_MEMALLOC set which will cause certain pieces of code to
temporarily set PF_MEMALLOC while working on that socket.

The upshot is that providing any MEMALLOC user reserves an appropriate
amount of emergency space, returns the emergency memory promptly, and
sets PF_MEMALLOC whenever allocating memory, it's memory allocations
should never fail.

As memory is requested is small units, but allocated as pages, there
needs to be a conversion from small-units to pages.  One of the
patches does this and appears to err on the side of be over-generous,
which is the right thing to do.


Memory reservations are organised in a tree.  I really don't
understand the tree.  Is it just to make /proc/reserve_info look more
helpful?
Certainly all the individual reservations need to be recorded, and the
cumulative reservation needs also to be recorded (currently in the
root of the tree) but what are all the other levels used for?

Reservations are used for all the transient memory that might be used
by the network stack.  This particularly includes the route cache and
skbs for incoming messages.  I have no idea if there is anything else
that needs to be allowed for.

Filesystems can advertise (via address_space_operations) that files
may be used as swap file.  They then provide swapout/swapin methods
which are like writepage/readpage but may behave differently and have
a different way to get credentials from a 'struct file'.


So in general, the patch set looks to have the right sort of shape.  I
cannot be very authoritative on the details as there are a lot of
them, and they touch code that I'm not very familiar with.

Some specific comments on patches:


reserve-slub.patch

   Please avoid irrelevant reformatting in patches.  It makes them
   harder to read.  e.g.:

-static void setup_object(struct kmem_cache *s, struct page *page,
-   void *object)
+static void setup_object(struct kmem_cache *s, struct page *page, void *object)


mm-kmem_estimate_pages.patch

   This introduces
 kestimate
 kestimate_single
 kmem_estimate_pages

   The last obviously returns a number of pages.  The contrast seems
   to suggest the others don't.   But they do...
   I don't think the names are very good, but I concede that it is
   hard to choose good names here.  Maybe:
  kmalloc_estimate_variable
  kmalloc_estimate_fixed
  kmem_alloc_estimate
   ???

mm-reserve.patch

   I'm confused by __mem_reserve_add.

+   reserve = mem_reserve_root.pages;
+   __calc_reserve(res, pages, 0);
+   reserve = mem_reserve_root.pages - reserve;

   __calc_reserve will always add 'pages' to mem_reserve_root.pages.
   So this is a complex way of doing
reserve = pages;
__calc_reserve(res, pages, 0);

And as you can calculate reserve before calling __calc_reserve
(which seems odd when stated that way), the whole function looks
like it could become:

   ret = adjust_memalloc_reserve(pages);
   if (!ret)
__calc_reserve(res, pages, limit);
   return ret;

What am I missing?

Also, mem_reserve_disconnect really should be a "void" function.
Just put a BUG_ON(ret) and don't return anything.

Finally, I'll just repeat that the purpose of the tree structure
eludes me.

net-sk_allocation.patch

Why are the "GFP_KERNEL" call sites just changed to
"sk->sk_allocation" rather than "sk_allocation(sk, GFP_KERNEL)" ??

I assume there is a good reason, and seeing it in the change log
would educate me and make the patch more obviously correct.

netvm-reserve.patch

Function names again:

 sk_adjust_memalloc
 sk_set_memalloc

sound similar.  Purpose is completely different.

mm-page_file_methods.patch

This makes page_offset and others more expensive by adding a
conditional jump to a function call that is not usually made.

Why do swap pages have a different index to 

Re: [PATCH 00/28] Swap over NFS -v16

2008-02-25 Thread Neil Brown
On Saturday February 23, [EMAIL PROTECTED] wrote:
 On Wed, 20 Feb 2008 15:46:10 +0100 Peter Zijlstra [EMAIL PROTECTED] wrote:
 
  Another posting of the full swap over NFS series. 
 
 Well I looked.  There's rather a lot of it and I wouldn't pretend to
 understand it.

But pretending is fun :-)

 
 What is the NFS and net people's take on all of this?

Well I'm only vaguely an NFS person, barely a net person, sporadically
an mm person, but I've had a look and it seems to mostly make sense.

We introduce a new emergency concept for page allocation.
The size of the emergency pool is set by various reservations by
different potential users.
If the number of free pages is below the emergency size, then only
users with a MEMALLOC flag get to allocate pages.  Further, those
pages get a reserve flag set which propagates into slab/slub so
kmalloc/kmemalloc only return memory from those pages to MEMALLOC
users. 
MEMALLOC users are those that set PF_MEMALLOC.  A socket can get
SOCK_MEMALLOC set which will cause certain pieces of code to
temporarily set PF_MEMALLOC while working on that socket.

The upshot is that providing any MEMALLOC user reserves an appropriate
amount of emergency space, returns the emergency memory promptly, and
sets PF_MEMALLOC whenever allocating memory, it's memory allocations
should never fail.

As memory is requested is small units, but allocated as pages, there
needs to be a conversion from small-units to pages.  One of the
patches does this and appears to err on the side of be over-generous,
which is the right thing to do.


Memory reservations are organised in a tree.  I really don't
understand the tree.  Is it just to make /proc/reserve_info look more
helpful?
Certainly all the individual reservations need to be recorded, and the
cumulative reservation needs also to be recorded (currently in the
root of the tree) but what are all the other levels used for?

Reservations are used for all the transient memory that might be used
by the network stack.  This particularly includes the route cache and
skbs for incoming messages.  I have no idea if there is anything else
that needs to be allowed for.

Filesystems can advertise (via address_space_operations) that files
may be used as swap file.  They then provide swapout/swapin methods
which are like writepage/readpage but may behave differently and have
a different way to get credentials from a 'struct file'.


So in general, the patch set looks to have the right sort of shape.  I
cannot be very authoritative on the details as there are a lot of
them, and they touch code that I'm not very familiar with.

Some specific comments on patches:


reserve-slub.patch

   Please avoid irrelevant reformatting in patches.  It makes them
   harder to read.  e.g.:

-static void setup_object(struct kmem_cache *s, struct page *page,
-   void *object)
+static void setup_object(struct kmem_cache *s, struct page *page, void *object)


mm-kmem_estimate_pages.patch

   This introduces
 kestimate
 kestimate_single
 kmem_estimate_pages

   The last obviously returns a number of pages.  The contrast seems
   to suggest the others don't.   But they do...
   I don't think the names are very good, but I concede that it is
   hard to choose good names here.  Maybe:
  kmalloc_estimate_variable
  kmalloc_estimate_fixed
  kmem_alloc_estimate
   ???

mm-reserve.patch

   I'm confused by __mem_reserve_add.

+   reserve = mem_reserve_root.pages;
+   __calc_reserve(res, pages, 0);
+   reserve = mem_reserve_root.pages - reserve;

   __calc_reserve will always add 'pages' to mem_reserve_root.pages.
   So this is a complex way of doing
reserve = pages;
__calc_reserve(res, pages, 0);

And as you can calculate reserve before calling __calc_reserve
(which seems odd when stated that way), the whole function looks
like it could become:

   ret = adjust_memalloc_reserve(pages);
   if (!ret)
__calc_reserve(res, pages, limit);
   return ret;

What am I missing?

Also, mem_reserve_disconnect really should be a void function.
Just put a BUG_ON(ret) and don't return anything.

Finally, I'll just repeat that the purpose of the tree structure
eludes me.

net-sk_allocation.patch

Why are the GFP_KERNEL call sites just changed to
sk-sk_allocation rather than sk_allocation(sk, GFP_KERNEL) ??

I assume there is a good reason, and seeing it in the change log
would educate me and make the patch more obviously correct.

netvm-reserve.patch

Function names again:

 sk_adjust_memalloc
 sk_set_memalloc

sound similar.  Purpose is completely different.

mm-page_file_methods.patch

This makes page_offset and others more expensive by adding a
conditional jump to a function call that is not usually made.

Why do swap pages have a different index to everyone else?

nfs-swap_ops.patch

Re: [dm-devel] Re: [PATCH] Implement barrier support for single device DM devices

2008-02-20 Thread Neil Brown
On Tuesday February 19, [EMAIL PROTECTED] wrote:
> On Mon, Feb 18, 2008 at 04:24:27PM +0300, Michael Tokarev wrote:
> > First, I still don't understand why in God's sake barriers are "working"
> > while regular cache flushes are not.  Almost no consumer-grade hard drive
> > supports write barriers, but they all support regular cache flushes, and
> > the latter should be enough (while not the most speed-optimal) to ensure
> > data safety.  Why to require write cache disable (like in XFS FAQ) instead
> > of going the flush-cache-when-appropriate (as opposed to write-barrier-
> > when-appropriate) way?
> 
> Devil's advocate:
> 
> Why should we need to support multiple different block layer APIs
> to do the same thing? Surely any hardware that doesn't support barrier
> operations can emulate them with cache flushes when they receive a
> barrier I/O from the filesystem

The simple answer to "why multiple APIs" is "different performance
trade-offs". 
If barriers are implemented in at the end of the pipeline, they can
presumably be reasonably cheap.
If they have to be implemented at the top of the pipeline, thus
stalling the whole pipeline, they are likely to be more expensive.

A filesystem may be able to mitigate the expense if it knows something
about the purpose of the data.
e.g. ext3 in data=writeback mode could wait only for journal writes to
complete before submitting the (would-be) barrier write of the commit
block, and would not bother to wait for data writes.

However, consistent APIs are also a good thing.
I would easily accept an argument that a BIO_RW_BARRER request must
*always* be correctly ordered around all other requests to the same
device.  If a layered device cannot get the service it requires from
lower level devices, it must do that flush/write/wait itself.

That should be paired with a way for the upper levels to find out how
efficient barriers are.  I guess the three levels of barrier
efficiency are:
  1/ handled above the elevator - least efficient
  2/ handled between elevator and device (by 'flush request'), medium
  3/ handled inside device (e.g. ordered SCSI request) most efficient.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [PATCH] Implement barrier support for single device DM devices

2008-02-20 Thread Neil Brown
On Monday February 18, [EMAIL PROTECTED] wrote:
> 
> 
> I'll put it even more strongly.  My experience is that disabling write
> cache plus disabling barriers is often much faster than enabling both
> barriers and write cache enabled, when doing metadata intensive
> operations, as long as you have a drive that is good at CTQ/NCQ.

Doesn't this speed gain come at a correctness cost?  Barriers aren't
only about flushed the write cache.  They are also about preventing
re-ordering, both at the "elevator" layer and inside the device.
If the device does CTQ, it could re-order requests could it not?
Then two writes sent from the fs could make it to media in the wrong
order, and a power failure in between could corrupt your data.

Or am I misunderstanding something ??

(of course this only applies to XFS where disabling barriers means
"hope for the best", as opposed to ext3 where disabling barriers means
"don't trust the device, impose explicit ordering").

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [PATCH] Implement barrier support for single device DM devices

2008-02-20 Thread Neil Brown
On Monday February 18, [EMAIL PROTECTED] wrote:
 
 
 I'll put it even more strongly.  My experience is that disabling write
 cache plus disabling barriers is often much faster than enabling both
 barriers and write cache enabled, when doing metadata intensive
 operations, as long as you have a drive that is good at CTQ/NCQ.

Doesn't this speed gain come at a correctness cost?  Barriers aren't
only about flushed the write cache.  They are also about preventing
re-ordering, both at the elevator layer and inside the device.
If the device does CTQ, it could re-order requests could it not?
Then two writes sent from the fs could make it to media in the wrong
order, and a power failure in between could corrupt your data.

Or am I misunderstanding something ??

(of course this only applies to XFS where disabling barriers means
hope for the best, as opposed to ext3 where disabling barriers means
don't trust the device, impose explicit ordering).

NeilBrown
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [PATCH] Implement barrier support for single device DM devices

2008-02-20 Thread Neil Brown
On Tuesday February 19, [EMAIL PROTECTED] wrote:
 On Mon, Feb 18, 2008 at 04:24:27PM +0300, Michael Tokarev wrote:
  First, I still don't understand why in God's sake barriers are working
  while regular cache flushes are not.  Almost no consumer-grade hard drive
  supports write barriers, but they all support regular cache flushes, and
  the latter should be enough (while not the most speed-optimal) to ensure
  data safety.  Why to require write cache disable (like in XFS FAQ) instead
  of going the flush-cache-when-appropriate (as opposed to write-barrier-
  when-appropriate) way?
 
 Devil's advocate:
 
 Why should we need to support multiple different block layer APIs
 to do the same thing? Surely any hardware that doesn't support barrier
 operations can emulate them with cache flushes when they receive a
 barrier I/O from the filesystem

The simple answer to why multiple APIs is different performance
trade-offs. 
If barriers are implemented in at the end of the pipeline, they can
presumably be reasonably cheap.
If they have to be implemented at the top of the pipeline, thus
stalling the whole pipeline, they are likely to be more expensive.

A filesystem may be able to mitigate the expense if it knows something
about the purpose of the data.
e.g. ext3 in data=writeback mode could wait only for journal writes to
complete before submitting the (would-be) barrier write of the commit
block, and would not bother to wait for data writes.

However, consistent APIs are also a good thing.
I would easily accept an argument that a BIO_RW_BARRER request must
*always* be correctly ordered around all other requests to the same
device.  If a layered device cannot get the service it requires from
lower level devices, it must do that flush/write/wait itself.

That should be paired with a way for the upper levels to find out how
efficient barriers are.  I guess the three levels of barrier
efficiency are:
  1/ handled above the elevator - least efficient
  2/ handled between elevator and device (by 'flush request'), medium
  3/ handled inside device (e.g. ordered SCSI request) most efficient.

NeilBrown
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2.6 patch] fs/drop_caches.c: make 2 functions static

2008-02-18 Thread Neil Brown
On Monday February 18, [EMAIL PROTECTED] wrote:
> On Sun, 17 Feb 2008 10:17:46 +0200 Adrian Bunk <[EMAIL PROTECTED]> wrote:
> 
> > This patch makes the following needlessly global functions static:
> > - drop_pagecache()
> > - drop_slab()
> > 
> > Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>
..
> 
> This was originally done because Neil said that he planned to use these in
> nfsd.  I guess that didn't happen, and an additional patch would be needed
> here anyway to do the module exports.  So  zap.

Did I?  I wonder why.  Neither my memory or my mail archive gives me
any hints, so I suspect it must have just been a passing fancy that
resolved into a really dumb idea (as so many passing fancies do).

Zap-supported-by: NeilBrown <[EMAIL PROTECTED]>

Hmmm... I guess that should really be

  Acked-By: NeilBrown <[EMAIL PROTECTED]>
;-)

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2.6 patch] fs/drop_caches.c: make 2 functions static

2008-02-18 Thread Neil Brown
On Monday February 18, [EMAIL PROTECTED] wrote:
 On Sun, 17 Feb 2008 10:17:46 +0200 Adrian Bunk [EMAIL PROTECTED] wrote:
 
  This patch makes the following needlessly global functions static:
  - drop_pagecache()
  - drop_slab()
  
  Signed-off-by: Adrian Bunk [EMAIL PROTECTED]
..
 
 This was originally done because Neil said that he planned to use these in
 nfsd.  I guess that didn't happen, and an additional patch would be needed
 here anyway to do the module exports.  So  zap.

Did I?  I wonder why.  Neither my memory or my mail archive gives me
any hints, so I suspect it must have just been a passing fancy that
resolved into a really dumb idea (as so many passing fancies do).

Zap-supported-by: NeilBrown [EMAIL PROTECTED]

Hmmm... I guess that should really be

  Acked-By: NeilBrown [EMAIL PROTECTED]
;-)

NeilBrown
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Documentation/patch-tags, one more time

2008-02-17 Thread Neil Brown
On Sunday February 17, [EMAIL PROTECTED] wrote:
> Neil responding to Linus:
> > > "From:" is not a tag. It's a special marker at the *top*
> >
> > You may be right, but when I email patches to akpm
> 
> Linus wasn't saying you don't need a 'From:' line in this case (as the
> *top* line of patches you didn't author).  He's saying it's not an
> instance of the type "tag" line, such as the "Signed-off-by:" line that
> go after the patch explanation.  The "From:" line is a different kind
> of line, with different rules and position.

If it looks like a duck and quacks like a duck.

When Linus used "From" "Subject" and "Date" together in the context of
"mail", it sounded a lot like he was talking about the headers section
of a standard mail item.  I guess he wasn't.

I don't really care if "From:" gets called a 'tag' or what, but given
that it is an important piece of metadata associated with a patch, it
seems good to document it in the same place as other important pieces
of metadata that are associated with patches.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Documentation/patch-tags, one more time

2008-02-17 Thread Neil Brown
On Friday February 8, [EMAIL PROTECTED] wrote:
> 
> 
> On Fri, 8 Feb 2008, Jonathan Corbet wrote:
> > +
> > +These tags are:
> > +
> > +From:  The original author of the patch.  This tag will ensure
> > +   that credit is properly given when somebody other than the
> > +   original author submits the patch.
> 
> "From:" is not a tag. It's a special marker at the *top* (and _only_ the 
> top) of the email, which indicates authorship. It goes along with markers 
> like "Date:" and "Subject:", and has nothing to do with the sign-off-like 
> tags at the end.

You may be right, but when I email patches to akpm that I did not
author, and that don't have a From: tag in the body saying who did
author them, then akpm tells me off.  And I don't like that :-(

Maybe we need "tags in git commit logs" and "tags in email sent to
akpm" and ... :-)

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Documentation/patch-tags, one more time

2008-02-17 Thread Neil Brown
On Friday February 8, [EMAIL PROTECTED] wrote:
 
 
 On Fri, 8 Feb 2008, Jonathan Corbet wrote:
  +
  +These tags are:
  +
  +From:  The original author of the patch.  This tag will ensure
  +   that credit is properly given when somebody other than the
  +   original author submits the patch.
 
 From: is not a tag. It's a special marker at the *top* (and _only_ the 
 top) of the email, which indicates authorship. It goes along with markers 
 like Date: and Subject:, and has nothing to do with the sign-off-like 
 tags at the end.

You may be right, but when I email patches to akpm that I did not
author, and that don't have a From: tag in the body saying who did
author them, then akpm tells me off.  And I don't like that :-(

Maybe we need tags in git commit logs and tags in email sent to
akpm and ... :-)

NeilBrown
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Documentation/patch-tags, one more time

2008-02-17 Thread Neil Brown
On Sunday February 17, [EMAIL PROTECTED] wrote:
 Neil responding to Linus:
   From: is not a tag. It's a special marker at the *top*
 
  You may be right, but when I email patches to akpm
 
 Linus wasn't saying you don't need a 'From:' line in this case (as the
 *top* line of patches you didn't author).  He's saying it's not an
 instance of the type tag line, such as the Signed-off-by: line that
 go after the patch explanation.  The From: line is a different kind
 of line, with different rules and position.

If it looks like a duck and quacks like a duck.

When Linus used From Subject and Date together in the context of
mail, it sounded a lot like he was talking about the headers section
of a standard mail item.  I guess he wasn't.

I don't really care if From: gets called a 'tag' or what, but given
that it is an important piece of metadata associated with a patch, it
seems good to document it in the same place as other important pieces
of metadata that are associated with patches.

NeilBrown

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ext3 can fail badly when device stops accepting BIO_RW_BARRIER requests.

2008-02-06 Thread Neil Brown

Some devices - notably dm and md - can change their behaviour in
response to BIO_RW_BARRIER requests.  They might start out accepting
such requests but on reconfiguration, they find out that they cannot
any more.

ext3 (and other filesystems) deal with this by always testing if
BIO_RW_BARRIER requests fail with EOPNOTSUPP, and retrying the write
requests without the barrier (probably after waiting for any pending
writes to complete).

However there is a bug in the handling for this for ext3.

When ext3 (jbd actually) decides to submit a BIO_RW_BARRIER request,
it sets the buffer_ordered flag on the buffer head.
If the request completes successfully, the flag STAYS SET.

Other code might then write the same buffer_head after the device has
been reconfigured to not accept barriers.  This write will then fail,
but the "other code" is not ready to handle EOPNOTSUPP errors and the
error will be treated as fatal.

This can be seen without having to reconfigure a device at exactly the
wrong time by putting:

if (buffer_ordered(bh))
printk("OH DEAR, and ordered buffer\n");


in the while loop in "commit phase 5" of journal_commit_transaction.

If it ever prints the "OH DEAR ..." message (as it does sometimes for
me), then that request could (in different circumstances) have failed
with EOPNOTSUPP, but that isn't tested for.

My proposed fix is to clear the buffer_ordered flag after it has been
used, as in the following patch.

Thanks,
NeilBrown

Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

diff .prev/fs/jbd/commit.c ./fs/jbd/commit.c
--- .prev/fs/jbd/commit.c   2008-02-07 10:01:57.0 +1100
+++ ./fs/jbd/commit.c   2008-02-07 10:04:58.0 +1100
@@ -131,6 +131,8 @@ static int journal_write_commit_record(j
barrier_done = 1;
}
ret = sync_dirty_buffer(bh);
+   if (barrier_done)
+   clear_buffer_ordered(bh);
/* is it possible for another commit to fail at roughly
 * the same time as this one?  If so, we don't want to
 * trust the barrier flag in the super, but instead want
@@ -148,7 +150,6 @@ static int journal_write_commit_record(j
spin_unlock(>j_state_lock);
 
/* And try again, without the barrier */
-   clear_buffer_ordered(bh);
set_buffer_uptodate(bh);
set_buffer_dirty(bh);
ret = sync_dirty_buffer(bh);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ext3 can fail badly when device stops accepting BIO_RW_BARRIER requests.

2008-02-06 Thread Neil Brown

Some devices - notably dm and md - can change their behaviour in
response to BIO_RW_BARRIER requests.  They might start out accepting
such requests but on reconfiguration, they find out that they cannot
any more.

ext3 (and other filesystems) deal with this by always testing if
BIO_RW_BARRIER requests fail with EOPNOTSUPP, and retrying the write
requests without the barrier (probably after waiting for any pending
writes to complete).

However there is a bug in the handling for this for ext3.

When ext3 (jbd actually) decides to submit a BIO_RW_BARRIER request,
it sets the buffer_ordered flag on the buffer head.
If the request completes successfully, the flag STAYS SET.

Other code might then write the same buffer_head after the device has
been reconfigured to not accept barriers.  This write will then fail,
but the other code is not ready to handle EOPNOTSUPP errors and the
error will be treated as fatal.

This can be seen without having to reconfigure a device at exactly the
wrong time by putting:

if (buffer_ordered(bh))
printk(OH DEAR, and ordered buffer\n);


in the while loop in commit phase 5 of journal_commit_transaction.

If it ever prints the OH DEAR ... message (as it does sometimes for
me), then that request could (in different circumstances) have failed
with EOPNOTSUPP, but that isn't tested for.

My proposed fix is to clear the buffer_ordered flag after it has been
used, as in the following patch.

Thanks,
NeilBrown

Signed-off-by: Neil Brown [EMAIL PROTECTED]

diff .prev/fs/jbd/commit.c ./fs/jbd/commit.c
--- .prev/fs/jbd/commit.c   2008-02-07 10:01:57.0 +1100
+++ ./fs/jbd/commit.c   2008-02-07 10:04:58.0 +1100
@@ -131,6 +131,8 @@ static int journal_write_commit_record(j
barrier_done = 1;
}
ret = sync_dirty_buffer(bh);
+   if (barrier_done)
+   clear_buffer_ordered(bh);
/* is it possible for another commit to fail at roughly
 * the same time as this one?  If so, we don't want to
 * trust the barrier flag in the super, but instead want
@@ -148,7 +150,6 @@ static int journal_write_commit_record(j
spin_unlock(journal-j_state_lock);
 
/* And try again, without the barrier */
-   clear_buffer_ordered(bh);
set_buffer_uptodate(bh);
set_buffer_dirty(bh);
ret = sync_dirty_buffer(bh);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Monthly md check == hung machine; how do I debug?

2008-02-05 Thread Neil Brown
On Tuesday February 5, [EMAIL PROTECTED] wrote:
> 
> I was able to solve the problem, however, like so:
> 
> 132c133
> < # CONFIG_PREEMPT_NONE is not set
> ---
> > CONFIG_PREEMPT_NONE=y
> 134,135c135,136
> < CONFIG_PREEMPT=y
> < CONFIG_PREEMPT_BKL=y
> ---
> > # CONFIG_PREEMPT is not set
> > # CONFIG_PREEMPT_BKL is not set
> 

This suggests that there is some sort of race.
Given that I've never hit it on SMP machines, it is probably a very
small window that opens immediately after some event that triggers
kernel preemption.

The only "mdadm --monitor" does in the kernel is read /proc/mdstat and
maybe make some GET_ARRAY_INFO/ GET_DISK_INFO ioctl calls.

They don't do much more than grab the reconfig_mutex.

What sort of hardware do you have?  x86?  SMP or uni-processor?
Also, exactly what kernel are you running?

I might see if I can reproduce it... so if you can send me the broken
.config, that might help too.

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Monthly md check == hung machine; how do I debug?

2008-02-05 Thread Neil Brown
On Tuesday February 5, [EMAIL PROTECTED] wrote:
 
 I was able to solve the problem, however, like so:
 
 132c133
  # CONFIG_PREEMPT_NONE is not set
 ---
  CONFIG_PREEMPT_NONE=y
 134,135c135,136
  CONFIG_PREEMPT=y
  CONFIG_PREEMPT_BKL=y
 ---
  # CONFIG_PREEMPT is not set
  # CONFIG_PREEMPT_BKL is not set
 

This suggests that there is some sort of race.
Given that I've never hit it on SMP machines, it is probably a very
small window that opens immediately after some event that triggers
kernel preemption.

The only mdadm --monitor does in the kernel is read /proc/mdstat and
maybe make some GET_ARRAY_INFO/ GET_DISK_INFO ioctl calls.

They don't do much more than grab the reconfig_mutex.

What sort of hardware do you have?  x86?  SMP or uni-processor?
Also, exactly what kernel are you running?

I might see if I can reproduce it... so if you can send me the broken
.config, that might help too.

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] The kernel thread for md RAID1 could cause a md RAID1 array deadlock

2008-01-23 Thread Neil Brown
On Tuesday January 15, [EMAIL PROTECTED] wrote:
> 
> This message describes the details about md-RAID1 issue found by
> testing the md RAID1 using the SCSI fault injection framework.
> 
> Abstract:
> Both the error handler for md RAID1 and write access request to the md RAID1
> use raid1d kernel thread. The nr_pending flag could cause a race condition
> in raid1d, results in a raid1d deadlock.

Thanks for finding and reporting this.

I believe the following patch should fix the deadlock.

If you are able to repeat your test and confirm this I would
appreciate it.

Thanks,
NeilBrown



Fix deadlock in md/raid1 when handling a read error.

When handling a read error, we freeze the array to stop any other
IO while attempting to over-write with correct data.

This is done in the raid1d thread and must wait for all submitted IO
to complete (except for requests that failed and are sitting in the
retry queue - these are counted in ->nr_queue and will stay there during
a freeze).

However write requests need attention from raid1d as bitmap updates
might be required.  This can cause a deadlock as raid1 is waiting for
requests to finish that themselves need attention from raid1d.

So we create a new function 'flush_pending_writes' to give that attention,
and call it in freeze_array to be sure that we aren't waiting on raid1d.

Thanks to "K.Tanaka" <[EMAIL PROTECTED]> for finding and reporting
this problem.

Cc: "K.Tanaka" <[EMAIL PROTECTED]>
Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/raid1.c |   66 ++-
 1 file changed, 45 insertions(+), 21 deletions(-)

diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
--- .prev/drivers/md/raid1.c2008-01-18 11:19:09.0 +1100
+++ ./drivers/md/raid1.c2008-01-24 14:21:55.0 +1100
@@ -592,6 +592,37 @@ static int raid1_congested(void *data, i
 }
 
 
+static int flush_pending_writes(conf_t *conf)
+{
+   /* Any writes that have been queue but are awaiting
+* bitmap updates get flushed here.
+* We return 1 if any requests were actually submitted.
+*/
+   int rv = 0;
+
+   spin_lock_irq(>device_lock);
+
+   if (conf->pending_bio_list.head) {
+   struct bio *bio;
+   bio = bio_list_get(>pending_bio_list);
+   blk_remove_plug(conf->mddev->queue);
+   spin_unlock_irq(>device_lock);
+   /* flush any pending bitmap writes to
+* disk before proceeding w/ I/O */
+   bitmap_unplug(conf->mddev->bitmap);
+
+   while (bio) { /* submit pending writes */
+   struct bio *next = bio->bi_next;
+   bio->bi_next = NULL;
+   generic_make_request(bio);
+   bio = next;
+   }
+   rv = 1;
+   } else
+   spin_unlock_irq(>device_lock);
+   return rv;
+}
+
 /* Barriers
  * Sometimes we need to suspend IO while we do something else,
  * either some resync/recovery, or reconfigure the array.
@@ -678,10 +709,14 @@ static void freeze_array(conf_t *conf)
spin_lock_irq(>resync_lock);
conf->barrier++;
conf->nr_waiting++;
+   spin_unlock_irq(>resync_lock);
+
+   spin_lock_irq(>resync_lock);
wait_event_lock_irq(conf->wait_barrier,
conf->barrier+conf->nr_pending == conf->nr_queued+2,
conf->resync_lock,
-   raid1_unplug(conf->mddev->queue));
+   ({ flush_pending_writes(conf);
+  raid1_unplug(conf->mddev->queue); }));
spin_unlock_irq(>resync_lock);
 }
 static void unfreeze_array(conf_t *conf)
@@ -907,6 +942,9 @@ static int make_request(struct request_q
blk_plug_device(mddev->queue);
spin_unlock_irqrestore(>device_lock, flags);
 
+   /* In case raid1d snuck into freeze_array */
+   wake_up(>wait_barrier);
+
if (do_sync)
md_wakeup_thread(mddev->thread);
 #if 0
@@ -1473,28 +1511,14 @@ static void raid1d(mddev_t *mddev)

for (;;) {
char b[BDEVNAME_SIZE];
-   spin_lock_irqsave(>device_lock, flags);
-
-   if (conf->pending_bio_list.head) {
-   bio = bio_list_get(>pending_bio_list);
-   blk_remove_plug(mddev->queue);
-   spin_unlock_irqrestore(>device_lock, flags);
-   /* flush any pending bitmap writes to disk before 
proceeding w/ I/O */
-   bitmap_unplug(mddev->bitmap);
-
-   while (bio) { /* submit pending writes */
-   struct bio *ne

Re: [BUG] The kernel thread for md RAID1 could cause a md RAID1 array deadlock

2008-01-23 Thread Neil Brown
On Tuesday January 15, [EMAIL PROTECTED] wrote:
 
 This message describes the details about md-RAID1 issue found by
 testing the md RAID1 using the SCSI fault injection framework.
 
 Abstract:
 Both the error handler for md RAID1 and write access request to the md RAID1
 use raid1d kernel thread. The nr_pending flag could cause a race condition
 in raid1d, results in a raid1d deadlock.

Thanks for finding and reporting this.

I believe the following patch should fix the deadlock.

If you are able to repeat your test and confirm this I would
appreciate it.

Thanks,
NeilBrown



Fix deadlock in md/raid1 when handling a read error.

When handling a read error, we freeze the array to stop any other
IO while attempting to over-write with correct data.

This is done in the raid1d thread and must wait for all submitted IO
to complete (except for requests that failed and are sitting in the
retry queue - these are counted in -nr_queue and will stay there during
a freeze).

However write requests need attention from raid1d as bitmap updates
might be required.  This can cause a deadlock as raid1 is waiting for
requests to finish that themselves need attention from raid1d.

So we create a new function 'flush_pending_writes' to give that attention,
and call it in freeze_array to be sure that we aren't waiting on raid1d.

Thanks to K.Tanaka [EMAIL PROTECTED] for finding and reporting
this problem.

Cc: K.Tanaka [EMAIL PROTECTED]
Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/raid1.c |   66 ++-
 1 file changed, 45 insertions(+), 21 deletions(-)

diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
--- .prev/drivers/md/raid1.c2008-01-18 11:19:09.0 +1100
+++ ./drivers/md/raid1.c2008-01-24 14:21:55.0 +1100
@@ -592,6 +592,37 @@ static int raid1_congested(void *data, i
 }
 
 
+static int flush_pending_writes(conf_t *conf)
+{
+   /* Any writes that have been queue but are awaiting
+* bitmap updates get flushed here.
+* We return 1 if any requests were actually submitted.
+*/
+   int rv = 0;
+
+   spin_lock_irq(conf-device_lock);
+
+   if (conf-pending_bio_list.head) {
+   struct bio *bio;
+   bio = bio_list_get(conf-pending_bio_list);
+   blk_remove_plug(conf-mddev-queue);
+   spin_unlock_irq(conf-device_lock);
+   /* flush any pending bitmap writes to
+* disk before proceeding w/ I/O */
+   bitmap_unplug(conf-mddev-bitmap);
+
+   while (bio) { /* submit pending writes */
+   struct bio *next = bio-bi_next;
+   bio-bi_next = NULL;
+   generic_make_request(bio);
+   bio = next;
+   }
+   rv = 1;
+   } else
+   spin_unlock_irq(conf-device_lock);
+   return rv;
+}
+
 /* Barriers
  * Sometimes we need to suspend IO while we do something else,
  * either some resync/recovery, or reconfigure the array.
@@ -678,10 +709,14 @@ static void freeze_array(conf_t *conf)
spin_lock_irq(conf-resync_lock);
conf-barrier++;
conf-nr_waiting++;
+   spin_unlock_irq(conf-resync_lock);
+
+   spin_lock_irq(conf-resync_lock);
wait_event_lock_irq(conf-wait_barrier,
conf-barrier+conf-nr_pending == conf-nr_queued+2,
conf-resync_lock,
-   raid1_unplug(conf-mddev-queue));
+   ({ flush_pending_writes(conf);
+  raid1_unplug(conf-mddev-queue); }));
spin_unlock_irq(conf-resync_lock);
 }
 static void unfreeze_array(conf_t *conf)
@@ -907,6 +942,9 @@ static int make_request(struct request_q
blk_plug_device(mddev-queue);
spin_unlock_irqrestore(conf-device_lock, flags);
 
+   /* In case raid1d snuck into freeze_array */
+   wake_up(conf-wait_barrier);
+
if (do_sync)
md_wakeup_thread(mddev-thread);
 #if 0
@@ -1473,28 +1511,14 @@ static void raid1d(mddev_t *mddev)

for (;;) {
char b[BDEVNAME_SIZE];
-   spin_lock_irqsave(conf-device_lock, flags);
-
-   if (conf-pending_bio_list.head) {
-   bio = bio_list_get(conf-pending_bio_list);
-   blk_remove_plug(mddev-queue);
-   spin_unlock_irqrestore(conf-device_lock, flags);
-   /* flush any pending bitmap writes to disk before 
proceeding w/ I/O */
-   bitmap_unplug(mddev-bitmap);
-
-   while (bio) { /* submit pending writes */
-   struct bio *next = bio-bi_next;
-   bio-bi_next = NULL;
-   generic_make_request(bio);
-   bio = next

Re: do_md_run returned -22 [Was: 2.6.24-rc8-mm1]

2008-01-17 Thread Neil Brown
On Thursday January 17, [EMAIL PROTECTED] wrote:
> On Thu, 17 Jan 2008 16:23:30 +0100 Jiri Slaby <[EMAIL PROTECTED]> wrote:
> 
> > On 01/17/2008 11:35 AM, Andrew Morton wrote:
> > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc8/2.6.24-rc8-mm1/
> > 
> > still the same md issue (do_md_run returns -22=EINVAL) as in -rc6-mm1 
> > reported 
> > by Thorsten here:
> > http://lkml.org/lkml/2007/12/27/45
> 
> hm, I must have been asleep when that was reported.  Neil, did you see it?

No, even though it was Cc:ed to me - sorry.
Maybe a revised subject line would have helped... maybe not.

> 
> > Is there around any fix for this?
> 
> Well, we could bitbucket md-allow-devices-to-be-shared-between-md-arrays.patch

Yeah, do that.  I'll send you something new.
I'll move that chunk into a different patch and add the extra bits
needed to make that test correct in *all* cases rather than just the
ones I was thinking about at the time.
My test suit does try in-kernel-autodetect (the problem case) but it
didn't catch this bug due to another bug.  I'll fix that too.

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: do_md_run returned -22 [Was: 2.6.24-rc8-mm1]

2008-01-17 Thread Neil Brown
On Thursday January 17, [EMAIL PROTECTED] wrote:
 On Thu, 17 Jan 2008 16:23:30 +0100 Jiri Slaby [EMAIL PROTECTED] wrote:
 
  On 01/17/2008 11:35 AM, Andrew Morton wrote:
   ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc8/2.6.24-rc8-mm1/
  
  still the same md issue (do_md_run returns -22=EINVAL) as in -rc6-mm1 
  reported 
  by Thorsten here:
  http://lkml.org/lkml/2007/12/27/45
 
 hm, I must have been asleep when that was reported.  Neil, did you see it?

No, even though it was Cc:ed to me - sorry.
Maybe a revised subject line would have helped... maybe not.

 
  Is there around any fix for this?
 
 Well, we could bitbucket md-allow-devices-to-be-shared-between-md-arrays.patch

Yeah, do that.  I'll send you something new.
I'll move that chunk into a different patch and add the extra bits
needed to make that test correct in *all* cases rather than just the
ones I was thinking about at the time.
My test suit does try in-kernel-autodetect (the problem case) but it
didn't catch this bug due to another bug.  I'll fix that too.

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] VFS: extend /proc/mounts

2008-01-16 Thread Neil Brown
On Thursday January 17, [EMAIL PROTECTED] wrote:
> 
> On Jan 17 2008 00:43, Karel Zak wrote:
> >> 
> >> Seems like a plain bad idea to me.  There will be any number of home-made
> >> /proc/mounts parsers and we don't know what they do.
> >
> > So, let's use /proc/mounts_v2  ;-)
> 
> Was not it like "don't use /proc for new things"?

I thought it was "don't use /proc for new things that aren't process
related".

And as the mount table is per process..

A host has a bunch of mounted filesystems (struct super_block), and
each process has some subset of these stitched together into a mount
tree (struct vfsmount / struct namespace).

There needs to be something in /proc that exposes the vfsmount tree.

Arguably there should be something else - maybe in sysfs - that
provides access to the "struct superblock" object.

And there needs to be a clear way to relate information from one with
information from the other.

In the tradition of stat, statm, status, maybe the former should be
 /proc/$PID/mountm
:-)

Hey, I just found /proc/X/mountstats.  How does this fit in to the big
picture?

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5

2008-01-16 Thread Neil Brown
On Tuesday January 15, [EMAIL PROTECTED] wrote:
> On Wed, 16 Jan 2008 00:09:31 -0700 "Dan Williams" <[EMAIL PROTECTED]> wrote:
> 
> > > heheh.
> > >
> > > it's really easy to reproduce the hang without the patch -- i could
> > > hang the box in under 20 min on 2.6.22+ w/XFS and raid5 on 7x750GB.
> > > i'll try with ext3... Dan's experiences suggest it won't happen with ext3
> > > (or is even more rare), which would explain why this has is overall a
> > > rare problem.
> > >
> > 
> > Hmmm... how rare?
> > 
> > http://marc.info/?l=linux-kernel=119461747005776=2
> > 
> > There is nothing specific that prevents other filesystems from hitting
> > it, perhaps XFS is just better at submitting large i/o's.  -stable
> > should get some kind of treatment.  I'll take altered performance over
> > a hung system.
> 
> We can always target 2.6.25-rc1 then 2.6.24.1 if Neil is still feeling
> wimpy.

I am feeling wimpy.  There've been a few too many raid5 breakages
recently and it is very hard to really judge the performance impact of
this change.  I even have a small uncertainty of correctness - could
it still hang in some other way?  I don't think so, but this is
complex code...

If it were really common I would have expected more noise on the
mailing list.  Sure, there has been some, but not much.  However maybe
people are searching the archives and finding the "increase stripe
cache size" trick, and not reporting anything  seems unlikely
though.

How about we queue it for 2.6.25-rc1 and then about when -rc2 comes
out, we queue it for 2.6.24.y?  Any one (or any distro) that really
needs it can of course grab the patch them selves...

??

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5

2008-01-16 Thread Neil Brown
On Tuesday January 15, [EMAIL PROTECTED] wrote:
 On Wed, 16 Jan 2008 00:09:31 -0700 Dan Williams [EMAIL PROTECTED] wrote:
 
   heheh.
  
   it's really easy to reproduce the hang without the patch -- i could
   hang the box in under 20 min on 2.6.22+ w/XFS and raid5 on 7x750GB.
   i'll try with ext3... Dan's experiences suggest it won't happen with ext3
   (or is even more rare), which would explain why this has is overall a
   rare problem.
  
  
  Hmmm... how rare?
  
  http://marc.info/?l=linux-kernelm=119461747005776w=2
  
  There is nothing specific that prevents other filesystems from hitting
  it, perhaps XFS is just better at submitting large i/o's.  -stable
  should get some kind of treatment.  I'll take altered performance over
  a hung system.
 
 We can always target 2.6.25-rc1 then 2.6.24.1 if Neil is still feeling
 wimpy.

I am feeling wimpy.  There've been a few too many raid5 breakages
recently and it is very hard to really judge the performance impact of
this change.  I even have a small uncertainty of correctness - could
it still hang in some other way?  I don't think so, but this is
complex code...

If it were really common I would have expected more noise on the
mailing list.  Sure, there has been some, but not much.  However maybe
people are searching the archives and finding the increase stripe
cache size trick, and not reporting anything  seems unlikely
though.

How about we queue it for 2.6.25-rc1 and then about when -rc2 comes
out, we queue it for 2.6.24.y?  Any one (or any distro) that really
needs it can of course grab the patch them selves...

??

NeilBrown
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] VFS: extend /proc/mounts

2008-01-16 Thread Neil Brown
On Thursday January 17, [EMAIL PROTECTED] wrote:
 
 On Jan 17 2008 00:43, Karel Zak wrote:
  
  Seems like a plain bad idea to me.  There will be any number of home-made
  /proc/mounts parsers and we don't know what they do.
 
  So, let's use /proc/mounts_v2  ;-)
 
 Was not it like don't use /proc for new things?

I thought it was don't use /proc for new things that aren't process
related.

And as the mount table is per process..

A host has a bunch of mounted filesystems (struct super_block), and
each process has some subset of these stitched together into a mount
tree (struct vfsmount / struct namespace).

There needs to be something in /proc that exposes the vfsmount tree.

Arguably there should be something else - maybe in sysfs - that
provides access to the struct superblock object.

And there needs to be a clear way to relate information from one with
information from the other.

In the tradition of stat, statm, status, maybe the former should be
 /proc/$PID/mountm
:-)

Hey, I just found /proc/X/mountstats.  How does this fit in to the big
picture?

NeilBrown
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 002 of 6] md: Fix use-after-free bug when dropping an rdev from an md array.

2008-01-13 Thread Neil Brown
On Monday January 14, [EMAIL PROTECTED] wrote:
> 
> Thanks.  I'll see what I can some up with.

How about this, against current -mm

On both the read and write path for an rdev attribute, we
call mddev_lock, first checking that mddev is not NULL.
Once we get the lock, we check again.
If rdev->mddev is not NULL, we know it will stay that way as it only
gets cleared under the same lock.

While in the rdev show/store routines, we know that the mddev cannot
get freed, do to the kobject relationships.

rdev_size_store is awkward because it has to drop the lock.  So we
take a copy of rdev->mddev before the drop, and we are safe...

Comments?

NeilBrown

Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/md.c |   35 ++-
 1 file changed, 26 insertions(+), 9 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c   2008-01-14 12:26:15.0 +1100
+++ ./drivers/md/md.c   2008-01-14 17:05:53.0 +1100
@@ -1998,9 +1998,11 @@ rdev_size_store(mdk_rdev_t *rdev, const 
char *e;
unsigned long long size = simple_strtoull(buf, , 10);
unsigned long long oldsize = rdev->size;
+   mddev_t *my_mddev = rdev->mddev;
+
if (e==buf || (*e && *e != '\n'))
return -EINVAL;
-   if (rdev->mddev->pers)
+   if (my_mddev->pers)
return -EBUSY;
rdev->size = size;
if (size > oldsize && rdev->mddev->external) {
@@ -2013,7 +2015,7 @@ rdev_size_store(mdk_rdev_t *rdev, const 
int overlap = 0;
struct list_head *tmp, *tmp2;
 
-   mddev_unlock(rdev->mddev);
+   mddev_unlock(my_mddev);
for_each_mddev(mddev, tmp) {
mdk_rdev_t *rdev2;
 
@@ -2033,7 +2035,7 @@ rdev_size_store(mdk_rdev_t *rdev, const 
break;
}
}
-   mddev_lock(rdev->mddev);
+   mddev_lock(my_mddev);
if (overlap) {
/* Someone else could have slipped in a size
 * change here, but doing so is just silly.
@@ -2045,8 +2047,8 @@ rdev_size_store(mdk_rdev_t *rdev, const 
return -EBUSY;
}
}
-   if (size < rdev->mddev->size || rdev->mddev->size == 0)
-   rdev->mddev->size = size;
+   if (size < my_mddev->size || my_mddev->size == 0)
+   my_mddev->size = size;
return len;
 }
 
@@ -2067,10 +2069,21 @@ rdev_attr_show(struct kobject *kobj, str
 {
struct rdev_sysfs_entry *entry = container_of(attr, struct 
rdev_sysfs_entry, attr);
mdk_rdev_t *rdev = container_of(kobj, mdk_rdev_t, kobj);
+   mddev_t *mddev = rdev->mddev;
+   ssize_t rv;
 
if (!entry->show)
return -EIO;
-   return entry->show(rdev, page);
+
+   rv = mddev ? mddev_lock(mddev) : -EBUSY;
+   if (!rv) {
+   if (rdev->mddev == NULL)
+   rv = -EBUSY;
+   else
+   rv = entry->show(rdev, page);
+   mddev_unlock(mddev);
+   }
+   return rv;
 }
 
 static ssize_t
@@ -2079,15 +2092,19 @@ rdev_attr_store(struct kobject *kobj, st
 {
struct rdev_sysfs_entry *entry = container_of(attr, struct 
rdev_sysfs_entry, attr);
mdk_rdev_t *rdev = container_of(kobj, mdk_rdev_t, kobj);
-   int rv;
+   ssize_t rv;
+   mddev_t *mddev = rdev->mddev;
 
if (!entry->store)
return -EIO;
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
-   rv = mddev_lock(rdev->mddev);
+   rv = mddev ? mddev_lock(mddev): -EBUSY;
if (!rv) {
-   rv = entry->store(rdev, page, length);
+   if (rdev->mddev == NULL)
+   rv = -EBUSY;
+   else
+   rv = entry->store(rdev, page, length);
mddev_unlock(rdev->mddev);
}
return rv;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 002 of 6] md: Fix use-after-free bug when dropping an rdev from an md array.

2008-01-13 Thread Neil Brown
On Monday January 14, [EMAIL PROTECTED] wrote:
> On Mon, Jan 14, 2008 at 02:21:45PM +1100, Neil Brown wrote:
> 
> > Maybe it isn't there any more
> > 
> > Once upon a time, when I 
> >echo remove > /sys/block/mdX/md/dev-YYY/state
> 
> Egads.  And just what will protect you from parallel callers
> of state_store()?  buffer->mutex does *not* do that - it only
> gives you exclusion on given struct file.  Run the command
> above from several shells and you've got independent open
> from each redirect => different struct file *and* different
> buffer for each => no exclusion whatsoever.

well in -mm, rdev_attr_store gets a lock on
rdev->mddev->reconfig_mutex. 
It doesn't test is rdev->mddev is NULL though, so if the write happens
after unbind_rdev_from_array, we lose.
A test for NULL would be easy enough.  And I think that the mddev
won't actually disappear until the rdevs are all gone (you subsequent
comment about kobject_del ordering seems to confirm that) so a simple test
for NULL should be sufficient.

> 
> And _that_ is present right in the mainline tree - it's unrelated
> to -mm kobject changes.
> 
> BTW, yes, you do have a deadlock there - kobject_del() will try to evict
> children, which will include waiting for currently running ->store()
> to finish, which will include the caller since .../state *is* a child of
> that sucker.
> 
> The real problem is the lack of any kind of exclusion considerations in
> md.c itself, AFAICS.  Fun with ordering is secondary (BTW, yes, it is
> a problem - will sysfs ->store() to attribute between export_rdev() and
> kobject_del() work correctly?)

Probably not.  The possibility that rdev->mddev could be NULL would
break a lot of these.  Maybe I should delay setting rdev->mddev to
NULL until after the kobject_del.  Then audit them all.

Thanks.  I'll see what I can some up with.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 002 of 6] md: Fix use-after-free bug when dropping an rdev from an md array.

2008-01-13 Thread Neil Brown
On Monday January 14, [EMAIL PROTECTED] wrote:
> On Mon, Jan 14, 2008 at 12:45:31PM +1100, NeilBrown wrote:
> > 
> > Due to possible deadlock issues we need to use a schedule work to
> > kobject_del an 'rdev' object from a different thread.
> > 
> > A recent change means that kobject_add no longer gets a refernce, and
> > kobject_del doesn't put a reference.  Consequently, we need to
> > explicitly hold a reference to ensure that the last reference isn't
> > dropped before the scheduled work get a chance to call kobject_del.
> > 
> > Also, rename delayed_delete to md_delayed_delete to that it is more
> > obvious in a stack trace which code is to blame.
> 
> I don't know...  You still get kobject_del() and export_rdev()
> in unpredictable order; sure, it won't be freed under you, but...

I cannot see that that would matter.
kobject_del deletes the object from the kobj tree and free sysfs.
export_rdev disconnects the objects from md structures and releases
the connection with the device.  They are quite independent.

> 
> What is that deadlock problem, anyway?  I don't see anything that
> would look like an obvious candidate in the stuff you are delaying...

Maybe it isn't there any more

Once upon a time, when I 
   echo remove > /sys/block/mdX/md/dev-YYY/state

sysfs_write_file would hold buffer->sem while calling my store
handler.
When my store handler tried to delete the relevant kobject, it would
eventually call orphan_all_buffers which would try to take buf->sem
and deadlock.

orphan_all_buffers doesn't exist any more, so maybe the deadlock is
gone too.
However the comment at the top of sysfs_schedule_callback in
sysfs/file.c says:

 *
 * sysfs attribute methods must not unregister themselves or their parent
 * kobject (which would amount to the same thing).  Attempts to do so will
 * deadlock, since unregistration is mutually exclusive with driver
 * callbacks.
 *

so I'm included to leave the code as it is  ofcourse the comment
could be well out of date.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Top 10 kernel oopses for the week ending January 5th, 2008

2008-01-13 Thread Neil Brown
On Thursday January 10, [EMAIL PROTECTED] wrote:
> On Thu, Jan 10, 2008 at 03:13:48PM +1100, Neil Brown wrote:
> > > What guarantees that it doesn't happen before we get to callback?  AFAICS,
> > > nothing whatsoever...
> > 
> > Yes, that's bad isn't it :-)
> > 
> > I think I should be using sysfs_schedule_callback here.  That makes the 
> > required 'get' and 'put' calls but it can fail with -ENOMEM.  I
> > wonder what I do if -ENOMEM???  Maybe I'll just continue to roll my
> > one :-( 
> 
> How about this instead (completely untested)
> 
>   * split failure exits
>   * switch to kick_rdev_from_array()
>   * fold unbind_rdev_from_array() into it (no other callers anymore)
>   * take export_rdev() into failure case in bind_rdev_to_array()
>   * in kick_rdev_from_array() do what export_rdev() does sans
> kobject_put() and do that before schedule_work().  Take kobject_put() into
> delayed_delete().

While there are probably some good ideas in there, I think fixing this
particular bug is much simpler.  Just take a reference to the object
before scheduling the worker, and drop it when the worker has done
its work.

I have a closer look at the idea of no required export_rdev after a
failed bind_rdev_to_array.  On the surface it does seem to make the
code nicer.

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/5] NLM: Have lockd call try_to_freeze

2008-01-13 Thread Neil Brown
On Sunday January 13, [EMAIL PROTECTED] wrote:
> On Thu, 10 Jan 2008 13:01:34 -0500
> Jeff Layton <[EMAIL PROTECTED]> wrote:
> 
> > lockd makes itself freezable, but never calls try_to_freeze(). Have it
> > call try_to_freeze() within the main loop.
> > 
> > Signed-off-by: Jeff Layton <[EMAIL PROTECTED]>
> > ---
> >  fs/lockd/svc.c |3 +++
> >  1 files changed, 3 insertions(+), 0 deletions(-)
> > 
> > diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
> > index 82e2192..6ee8bed 100644
> > --- a/fs/lockd/svc.c
> > +++ b/fs/lockd/svc.c
> > @@ -155,6 +155,9 @@ lockd(struct svc_rqst *rqstp)
> > long timeout = MAX_SCHEDULE_TIMEOUT;
> > char buf[RPC_MAX_ADDRBUFLEN];
> >  
> > +   if (try_to_freeze())
> > +   continue;
> > +
> > if (signalled()) {
> > flush_signals(current);
> > if (nlmsvc_ops) {
> 
> 
> I was looking over svc_recv today and noticed that it calls
> try_to_freeze a couple of times. Given that, the above patch may be
> unnecessary. I don't think it hurts anything though. Should we keep
> this patch or drop it?

I would suggest dropping it.
Having unnecessary code is likely to be confusing.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/5] NLM: Have lockd call try_to_freeze

2008-01-13 Thread Neil Brown
On Sunday January 13, [EMAIL PROTECTED] wrote:
 On Thu, 10 Jan 2008 13:01:34 -0500
 Jeff Layton [EMAIL PROTECTED] wrote:
 
  lockd makes itself freezable, but never calls try_to_freeze(). Have it
  call try_to_freeze() within the main loop.
  
  Signed-off-by: Jeff Layton [EMAIL PROTECTED]
  ---
   fs/lockd/svc.c |3 +++
   1 files changed, 3 insertions(+), 0 deletions(-)
  
  diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
  index 82e2192..6ee8bed 100644
  --- a/fs/lockd/svc.c
  +++ b/fs/lockd/svc.c
  @@ -155,6 +155,9 @@ lockd(struct svc_rqst *rqstp)
  long timeout = MAX_SCHEDULE_TIMEOUT;
  char buf[RPC_MAX_ADDRBUFLEN];
   
  +   if (try_to_freeze())
  +   continue;
  +
  if (signalled()) {
  flush_signals(current);
  if (nlmsvc_ops) {
 
 
 I was looking over svc_recv today and noticed that it calls
 try_to_freeze a couple of times. Given that, the above patch may be
 unnecessary. I don't think it hurts anything though. Should we keep
 this patch or drop it?

I would suggest dropping it.
Having unnecessary code is likely to be confusing.

NeilBrown
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Top 10 kernel oopses for the week ending January 5th, 2008

2008-01-13 Thread Neil Brown
On Thursday January 10, [EMAIL PROTECTED] wrote:
 On Thu, Jan 10, 2008 at 03:13:48PM +1100, Neil Brown wrote:
   What guarantees that it doesn't happen before we get to callback?  AFAICS,
   nothing whatsoever...
  
  Yes, that's bad isn't it :-)
  
  I think I should be using sysfs_schedule_callback here.  That makes the 
  required 'get' and 'put' calls but it can fail with -ENOMEM.  I
  wonder what I do if -ENOMEM???  Maybe I'll just continue to roll my
  one :-( 
 
 How about this instead (completely untested)
 
   * split failure exits
   * switch to kick_rdev_from_array()
   * fold unbind_rdev_from_array() into it (no other callers anymore)
   * take export_rdev() into failure case in bind_rdev_to_array()
   * in kick_rdev_from_array() do what export_rdev() does sans
 kobject_put() and do that before schedule_work().  Take kobject_put() into
 delayed_delete().

While there are probably some good ideas in there, I think fixing this
particular bug is much simpler.  Just take a reference to the object
before scheduling the worker, and drop it when the worker has done
its work.

I have a closer look at the idea of no required export_rdev after a
failed bind_rdev_to_array.  On the surface it does seem to make the
code nicer.

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 002 of 6] md: Fix use-after-free bug when dropping an rdev from an md array.

2008-01-13 Thread Neil Brown
On Monday January 14, [EMAIL PROTECTED] wrote:
 On Mon, Jan 14, 2008 at 12:45:31PM +1100, NeilBrown wrote:
  
  Due to possible deadlock issues we need to use a schedule work to
  kobject_del an 'rdev' object from a different thread.
  
  A recent change means that kobject_add no longer gets a refernce, and
  kobject_del doesn't put a reference.  Consequently, we need to
  explicitly hold a reference to ensure that the last reference isn't
  dropped before the scheduled work get a chance to call kobject_del.
  
  Also, rename delayed_delete to md_delayed_delete to that it is more
  obvious in a stack trace which code is to blame.
 
 I don't know...  You still get kobject_del() and export_rdev()
 in unpredictable order; sure, it won't be freed under you, but...

I cannot see that that would matter.
kobject_del deletes the object from the kobj tree and free sysfs.
export_rdev disconnects the objects from md structures and releases
the connection with the device.  They are quite independent.

 
 What is that deadlock problem, anyway?  I don't see anything that
 would look like an obvious candidate in the stuff you are delaying...

Maybe it isn't there any more

Once upon a time, when I 
   echo remove  /sys/block/mdX/md/dev-YYY/state

sysfs_write_file would hold buffer-sem while calling my store
handler.
When my store handler tried to delete the relevant kobject, it would
eventually call orphan_all_buffers which would try to take buf-sem
and deadlock.

orphan_all_buffers doesn't exist any more, so maybe the deadlock is
gone too.
However the comment at the top of sysfs_schedule_callback in
sysfs/file.c says:

 *
 * sysfs attribute methods must not unregister themselves or their parent
 * kobject (which would amount to the same thing).  Attempts to do so will
 * deadlock, since unregistration is mutually exclusive with driver
 * callbacks.
 *

so I'm included to leave the code as it is  ofcourse the comment
could be well out of date.

NeilBrown
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >