Re: [Cluster-devel] [PATCH 03/13] GFS2: Eliminate a goto in finish_xmote

2018-11-19 Thread Bob Peterson
Hi,

- Original Message -
> 
> 
> On 19/11/18 13:29, Bob Peterson wrote:
> > This is another baby step toward a better glock state machine.
> > This patch eliminates a goto in function finish_xmote so we can
> > begin unraveling the cryptic logic with later patches.
> >
> > Signed-off-by: Bob Peterson 
> > ---
> >   fs/gfs2/glock.c | 11 +--
> >   1 file changed, 5 insertions(+), 6 deletions(-)
> >
> > diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
> > index 5f2156f15f05..6e9d53583b73 100644
> > --- a/fs/gfs2/glock.c
> > +++ b/fs/gfs2/glock.c
> > @@ -472,11 +472,11 @@ static void finish_xmote(struct gfs2_glock *gl,
> > unsigned int ret)
> > list_move_tail(&gh->gh_list, 
> > &gl->gl_holders);
> > gh = find_first_waiter(gl);
> > gl->gl_target = gh->gh_state;
> > -   goto retry;
> > -   }
> > -   /* Some error or failed "try lock" - report it */
> > -   if ((ret & LM_OUT_ERROR) ||
> > -   (gh->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB))) {
> > +   state = LM_ST_UNLOCKED;
> I'm not sure what you are trying to achieve here, but setting the state
> to LM_ST_UNLOCKED when it is quite possible that it is not that state,
> doesn't really seem to improve anything. Indeed, it looks more confusing
> to me, at least it was fairly clear before that the intent was to retry
> the operation which has been canceled.

When finish_xmote hits this affected section of code, it's because the dlm
returned a state different from the intended state. Before this patch, it
did "goto retry" which jumps to the label inside the switch state that
handles LM_ST_UNLOCKED, after which it simply unlocks and returns.

Changing local variable "state" merely forces the code to take the same
codepath in which it calls do_xmote, unlocking and returning as it does today,
but without the goto. This makes the function more suitable to the new
autonomous state machine, which is added in a later patch.

The addition of "else if" is needed so it doesn't go down the wrong code path
at the comment: /* Some error or failed "try lock" - report it */
The logic is a bit tricky here, but is preserved from the original.

Most of the subsequent patches aren't quite as mind-bending, I promise. :)

Regards,

Bob Peterson



Re: [Cluster-devel] [PATCH 02/13] GFS2: Make do_xmote determine its own gh parameter

2018-11-19 Thread Bob Peterson
Hi Steve,

- Original Message -
> 
> 
> On 19/11/18 13:29, Bob Peterson wrote:
> > This is another baby step toward a better glock state machine.
> > Before this patch, do_xmote was called with a gh parameter, but
> > only for promotes, not demotes. This patch allows do_xmote to
> > determine the gh autonomously.
> >
> > Signed-off-by: Bob Peterson 

(snip)

> Since gh is apparently only used to get the lock flags, it would make
> more sense just to pass the lock flags rather than add in an additional
> find_first_waiter() call,
> 
> Steve.

Perhaps I didn't put enough info into the comments for this patch.

I need to get rid of the gh parameter in order to make the glock
state machine fully autonomous. In other words, function do_xmote will
become a state in the (stand alone) state machine, which itself does not
require a gh parameter and may be called from several places under
several conditions. The state of the glock will determine that it needs
to call do_xmote, but do_xmote needs to figure it out on its own.

Before this patch, the caller does indeed know the gh pointer, but in
the future, it will replaced by a generic call to the state machine
which will not know it.

Regards,

Bob Peterson



[Cluster-devel] [PATCH 07/13] GFS2: Add blocking and non-blocking demote to state machine

2018-11-19 Thread Bob Peterson
Before this patch, function run_queue would do special processing
before calling the state machine for the blocking and non-blocking
demote-in-progress cases. This function rolls those functions into
the state machine, which will allow us to eventually simplify with
later patches.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/glock.c | 44 +---
 fs/gfs2/glock.h |  2 ++
 2 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 8dc98d069afa..023e2186b374 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -631,21 +631,9 @@ __acquires(&gl->gl_lockref.lock)
 
if (test_bit(GLF_DEMOTE, &gl->gl_flags) &&
gl->gl_demote_state != gl->gl_state) {
-   if (find_first_holder(gl)) {
-   clear_bit(GLF_LOCK, &gl->gl_flags);
-   smp_mb__after_atomic();
-   return;
-   }
-   if (nonblock) {
-   clear_bit(GLF_LOCK, &gl->gl_flags);
-   smp_mb__after_atomic();
-   gl->gl_lockref.count++;
-   __gfs2_glock_queue_work(gl, 0);
-   return;
-   }
-   set_bit(GLF_DEMOTE_IN_PROGRESS, &gl->gl_flags);
-   GLOCK_BUG_ON(gl, gl->gl_demote_state == LM_ST_EXCLUSIVE);
-   gl->gl_target = gl->gl_demote_state;
+   __state_machine(gl, nonblock ? GL_ST_DEMOTE_NONBLOCK :
+   GL_ST_BLOCKING_DEMOTE);
+   return;
} else {
if (test_bit(GLF_DEMOTE, &gl->gl_flags))
gfs2_demote_wake(gl);
@@ -713,6 +701,32 @@ static void __state_machine(struct gfs2_glock *gl, int 
new_state)
if (ret == -EAGAIN)
gl->gl_mch = GL_ST_FINISH_XMOTE;
break;
+
+   case GL_ST_BLOCKING_DEMOTE:
+   gl->gl_mch = GL_ST_IDLE;
+   if (find_first_holder(gl)) {
+   clear_bit(GLF_LOCK, &gl->gl_flags);
+   smp_mb__after_atomic();
+   break;
+   }
+   set_bit(GLF_DEMOTE_IN_PROGRESS, &gl->gl_flags);
+   GLOCK_BUG_ON(gl, gl->gl_demote_state == 
LM_ST_EXCLUSIVE);
+   gl->gl_target = gl->gl_demote_state;
+   gl->gl_mch = GL_ST_DO_XMOTE;
+   break;
+
+   case GL_ST_DEMOTE_NONBLOCK:
+   gl->gl_mch = GL_ST_IDLE;
+   if (find_first_holder(gl)) {
+   clear_bit(GLF_LOCK, &gl->gl_flags);
+   smp_mb__after_atomic();
+   break;
+   }
+   clear_bit(GLF_LOCK, &gl->gl_flags);
+   smp_mb__after_atomic();
+   gl->gl_lockref.count++;
+   __gfs2_glock_queue_work(gl, 0);
+   break;
}
} while (gl->gl_mch != GL_ST_IDLE);
 }
diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index 8333bbc5a197..c8b704a73638 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -126,6 +126,8 @@ enum gl_machine_states {
GL_ST_DO_XMOTE = 2, /* Promote or demote a waiter */
GL_ST_DO_XMOTE_UNLOCK = 3, /* Promote or demote a waiter after a failed
  state change attempt from dlm. */
+   GL_ST_DEMOTE_NONBLOCK = 4, /* Demote is in progress - non-blocking */
+   GL_ST_BLOCKING_DEMOTE = 5, /* Demote is in progress - blocking */
 };
 
 struct lm_lockops {
-- 
2.19.1



[Cluster-devel] [PATCH 09/13] GFS2: Replace run_queue with new GL_ST_RUN state in state machine

2018-11-19 Thread Bob Peterson
Now we replace the run_queue function with a new state in the glock
state machine which does the same thing.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/glock.c | 60 +
 fs/gfs2/glock.h |  1 +
 2 files changed, 27 insertions(+), 34 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 6048867b2d66..82c4614793b3 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -60,9 +60,6 @@ struct gfs2_glock_iter {
 
 typedef void (*glock_examiner) (struct gfs2_glock * gl);
 
-static void __state_machine(struct gfs2_glock *gl, int new_state);
-static void state_machine(struct gfs2_glock *gl, int new_state);
-
 static struct dentry *gfs2_root;
 static struct workqueue_struct *glock_workqueue;
 struct workqueue_struct *gfs2_delete_workqueue;
@@ -610,34 +607,11 @@ static inline struct gfs2_holder *find_first_holder(const 
struct gfs2_glock *gl)
return NULL;
 }
 
-/**
- * run_queue - do all outstanding tasks related to a glock
- * @gl: The glock in question
- * @nonblock: True if we must not block in run_queue
- *
- */
-
-static void run_queue(struct gfs2_glock *gl, const int nonblock)
-__releases(&gl->gl_lockref.lock)
-__acquires(&gl->gl_lockref.lock)
-{
-   if (test_and_set_bit(GLF_LOCK, &gl->gl_flags))
-   return;
-
-   GLOCK_BUG_ON(gl, test_bit(GLF_DEMOTE_IN_PROGRESS, &gl->gl_flags));
-
-   if (test_bit(GLF_DEMOTE, &gl->gl_flags) &&
-   gl->gl_demote_state != gl->gl_state)
-   __state_machine(gl, nonblock ? GL_ST_DEMOTE_NONBLOCK :
-   GL_ST_BLOCKING_DEMOTE);
-   else
-   __state_machine(gl, GL_ST_PROMOTE);
-}
-
 /**
  * __state_machine - the glock state machine
  * @gl: pointer to the glock we are transitioning
  * @new_state: The new state we need to execute
+ * @nonblock: True if we must not block while demoting
  *
  * This function handles state transitions for glocks.
  * When the state_machine is called, it's given a new state that needs to be
@@ -646,7 +620,8 @@ __acquires(&gl->gl_lockref.lock)
  * The lock might be released inside some of the states, so we may need react
  * to state changes from other calls.
  */
-static void __state_machine(struct gfs2_glock *gl, int new_state)
+static void __state_machine(struct gfs2_glock *gl, int new_state,
+   const int nonblock)
 {
struct gfs2_holder *gh = NULL;
int ret;
@@ -727,6 +702,22 @@ static void __state_machine(struct gfs2_glock *gl, int 
new_state)
do_error(gl, 0); /* Fail queued try locks */
gl->gl_mch = GL_ST_DO_XMOTE;
break;
+
+   case GL_ST_RUN:
+   gl->gl_mch = GL_ST_IDLE;
+   if (test_and_set_bit(GLF_LOCK, &gl->gl_flags))
+   break;
+
+   GLOCK_BUG_ON(gl, test_bit(GLF_DEMOTE_IN_PROGRESS,
+ &gl->gl_flags));
+
+   if (test_bit(GLF_DEMOTE, &gl->gl_flags) &&
+   gl->gl_demote_state != gl->gl_state)
+   gl->gl_mch = nonblock ? GL_ST_DEMOTE_NONBLOCK :
+   GL_ST_BLOCKING_DEMOTE;
+   else
+   gl->gl_mch = GL_ST_PROMOTE;
+   break;
}
} while (gl->gl_mch != GL_ST_IDLE);
 }
@@ -738,12 +729,13 @@ static void __state_machine(struct gfs2_glock *gl, int 
new_state)
  *
  * Just like __state_machine but it acquires the gl_lockref lock
  */
-static void state_machine(struct gfs2_glock *gl, int new_state)
+static void state_machine(struct gfs2_glock *gl, int new_state,
+ const int nonblock)
 __releases(&gl->gl_lockref.lock)
 __acquires(&gl->gl_lockref.lock)
 {
spin_lock(&gl->gl_lockref.lock);
-   __state_machine(gl, new_state);
+   __state_machine(gl, new_state, nonblock);
spin_unlock(&gl->gl_lockref.lock);
 }
 
@@ -776,7 +768,7 @@ static void glock_work_func(struct work_struct *work)
unsigned int drop_refs = 1;
 
if (test_and_clear_bit(GLF_REPLY_PENDING, &gl->gl_flags)) {
-   state_machine(gl, GL_ST_FINISH_XMOTE);
+   state_machine(gl, GL_ST_FINISH_XMOTE, 0);
drop_refs++;
}
spin_lock(&gl->gl_lockref.lock);
@@ -794,7 +786,7 @@ static void glock_work_func(struct work_struct *work)
set_bit(GLF_DEMOTE, &gl->gl_flags);
}
}
-   run_queue(gl, 0);
+   __state_machine(gl, GL_ST_RUN, 0);
if (delay) {
/* Keep one glock reference for the work we requeue. */
drop_refs--;
@@ -1189,7 +1181,7 @@ int gfs2_glock_nq(struct gfs2_holder *gh)
gl->gl_lockref.count++;
__gfs2_glock_queue_work(gl, 0);
}
-   run_queue(gl, 1);
+ 

[Cluster-devel] [PATCH 13/13] GFS2: Optimization of GL_ST_UNLOCK state

2018-11-19 Thread Bob Peterson
The previous patch added a new GL_ST_UNLOCK state that simply
cleared the GLF_LOCK bit, then went into the RUN state. Since the
RUN state immediately sets the GLF_LOCK again and does a few other
things, we can optimize it by putting GL_ST_UNLOCK in the middle
of the GL_ST_RUN state. This avoids unlock-then-lock and extra
checks and skips directly to the resulting promotion/demotion state.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/glock.c | 6 ++
 fs/gfs2/glock.h | 2 +-
 2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 33f4a530caf4..29e6d9a440ec 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -724,7 +724,9 @@ static void __state_machine(struct gfs2_glock *gl, int 
new_state,
 
GLOCK_BUG_ON(gl, test_bit(GLF_DEMOTE_IN_PROGRESS,
  &gl->gl_flags));
+   /* Fall through */
 
+   case GL_ST_UNLOCK:
if (test_bit(GLF_DEMOTE, &gl->gl_flags) &&
gl->gl_demote_state != gl->gl_state)
gl->gl_mch = nonblock ? GL_ST_DEMOTE_NONBLOCK :
@@ -751,10 +753,6 @@ static void __state_machine(struct gfs2_glock *gl, int 
new_state,
gl->gl_mch = GL_ST_RUN;
break;
 
-   case GL_ST_UNLOCK:
-   clear_bit(GLF_LOCK, &gl->gl_flags);
-   gl->gl_mch = GL_ST_RUN;
-   break;
}
} while (gl->gl_mch != GL_ST_IDLE || new_state != GL_ST_IDLE);
 
diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index 5bbf3118c7c3..e1a5bc2d3f76 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -131,7 +131,7 @@ enum gl_machine_states {
GL_ST_PROMOTE = 6, /* Promote the lock */
GL_ST_RUN = 7,  /* "Run" or progress the lock */
GL_ST_WORK = 8, /* Perform general glock work */
-   GL_ST_UNLOCK = 9,   /* Unlock then run */
+   GL_ST_UNLOCK = 9,   /* Unlock the glock */
 };
 
 struct lm_lockops {
-- 
2.19.1



[Cluster-devel] [PATCH 11/13] GFS2: Reduce glock_work_func to a single call to state_machine

2018-11-19 Thread Bob Peterson
Before this patch, function glock_work_func would call into the
state machine for GL_FINISH_XMOTE, then GL_RUN, plus some work
related to dropping references and requeueing itself. This patch
moves all that functionality to a new GL_WORK state. This reduces
glock_work_func to a single call to the state machine.

The goal here is to allow for patches in the future that will
bypass the delayed workqueue altogether to improve performance.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/glock.c | 97 -
 fs/gfs2/glock.h |  1 +
 2 files changed, 56 insertions(+), 42 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 858f42e66698..22ddeda90199 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -624,9 +624,25 @@ static void __state_machine(struct gfs2_glock *gl, int 
new_state,
const int nonblock)
 {
struct gfs2_holder *gh = NULL;
+   unsigned long delay = 0;
+   unsigned int drop_refs = 0;
int ret;
 
BUG_ON(!spin_is_locked(&gl->gl_lockref.lock));
+   if (new_state == GL_ST_WORK) {
+   drop_refs = 1;
+   /**
+* Before we can do the rest of the work, we need to finish
+* any xmotes due to a reply from dlm. Note that since we did
+* not change new_state, we'll drop back into GL_ST_WORK when
+* the GL_ST_FINISH_XMOTE completes its cycle, regardless
+* of how many other states it passes through.
+*/
+   if (test_and_clear_bit(GLF_REPLY_PENDING, &gl->gl_flags)) {
+   gl->gl_mch = GL_ST_FINISH_XMOTE;
+   drop_refs++;
+   }
+   }
 
do {
switch (gl->gl_mch) {
@@ -716,8 +732,41 @@ static void __state_machine(struct gfs2_glock *gl, int 
new_state,
else
gl->gl_mch = GL_ST_PROMOTE;
break;
+
+   case GL_ST_WORK:
+   if (test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags) &&
+   gl->gl_state != LM_ST_UNLOCKED &&
+   gl->gl_demote_state != LM_ST_EXCLUSIVE) {
+   unsigned long holdtime, now = jiffies;
+
+   holdtime = gl->gl_tchange + gl->gl_hold_time;
+   if (time_before(now, holdtime))
+   delay = holdtime - now;
+
+   if (!delay) {
+   clear_bit(GLF_PENDING_DEMOTE, 
&gl->gl_flags);
+   set_bit(GLF_DEMOTE, &gl->gl_flags);
+   }
+   }
+   gl->gl_mch = GL_ST_RUN;
+   break;
}
-   } while (gl->gl_mch != GL_ST_IDLE);
+   } while (gl->gl_mch != GL_ST_IDLE || new_state != GL_ST_IDLE);
+
+   /* Now check if a delayed re-queue of the work is needed */
+   if (delay) {
+   /* Keep one glock reference for the work we requeue. */
+   drop_refs--;
+   if (gl->gl_name.ln_type != LM_TYPE_INODE)
+   delay = 0;
+   __gfs2_glock_queue_work(gl, delay);
+   }
+   /*
+* Drop the remaining glock references manually here. (Mind that
+* __gfs2_glock_queue_work depends on the lockref spinlock begin held
+* here as well.)
+*/
+   gl->gl_lockref.count -= drop_refs;
 }
 
 /**
@@ -734,6 +783,10 @@ __acquires(&gl->gl_lockref.lock)
 {
spin_lock(&gl->gl_lockref.lock);
__state_machine(gl, new_state, nonblock);
+   if (new_state == GL_ST_WORK && !gl->gl_lockref.count) {
+   __gfs2_glock_put(gl);
+   return;
+   }
spin_unlock(&gl->gl_lockref.lock);
 }
 
@@ -761,49 +814,9 @@ static void delete_work_func(struct work_struct *work)
 
 static void glock_work_func(struct work_struct *work)
 {
-   unsigned long delay = 0;
struct gfs2_glock *gl = container_of(work, struct gfs2_glock, 
gl_work.work);
-   unsigned int drop_refs = 1;
-
-   if (test_and_clear_bit(GLF_REPLY_PENDING, &gl->gl_flags)) {
-   state_machine(gl, GL_ST_FINISH_XMOTE, 0);
-   drop_refs++;
-   }
-   spin_lock(&gl->gl_lockref.lock);
-   if (test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags) &&
-   gl->gl_state != LM_ST_UNLOCKED &&
-   gl->gl_demote_state != LM_ST_EXCLUSIVE) {
-   unsigned long holdtime, now = jiffies;
-
-   holdtime = gl->gl_tchange + gl->gl_hold_time;
-   if (time_before(now, holdtime))
-   delay = holdtime - now;
 
-   if (!delay) {
-   clear_bit(GLF_PENDING_DEMOTE, &gl->gl_flags);
-   set_bit(GLF_DEMOTE, &gl->gl_flags);
-  

[Cluster-devel] [PATCH 10/13] GFS2: Reduce redundancy in GL_ST_DEMOTE_NONBLOCK state

2018-11-19 Thread Bob Peterson
This patch just simplifies the path through the GL_ST_DEMOTE_NONBLOCK
state in the state machine. Regardless of whether a holder is found,
it still needs to clear the GLF_LOCK bit. But we need to be careful
to do it check for a holder before we release GLF_LOCK.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/glock.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 82c4614793b3..858f42e66698 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -673,15 +673,13 @@ static void __state_machine(struct gfs2_glock *gl, int 
new_state,
 
case GL_ST_DEMOTE_NONBLOCK:
gl->gl_mch = GL_ST_IDLE;
-   if (find_first_holder(gl)) {
-   clear_bit(GLF_LOCK, &gl->gl_flags);
-   smp_mb__after_atomic();
-   break;
-   }
+   gh = find_first_holder(gl);
clear_bit(GLF_LOCK, &gl->gl_flags);
smp_mb__after_atomic();
-   gl->gl_lockref.count++;
-   __gfs2_glock_queue_work(gl, 0);
+   if (!gh) {
+   gl->gl_lockref.count++;
+   __gfs2_glock_queue_work(gl, 0);
+   }
break;
 
case GL_ST_PROMOTE:
-- 
2.19.1



[Cluster-devel] [PATCH 08/13] GFS2: Add a new GL_ST_PROMOTE state to glock state machine

2018-11-19 Thread Bob Peterson
Before this patch, function run_queue did a bunch of logic when
glocks are being promoted. This patch moves the logic to the glock
state machine and simplifies run_queue further.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/glock.c | 46 +++---
 fs/gfs2/glock.h |  1 +
 2 files changed, 24 insertions(+), 23 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 023e2186b374..6048867b2d66 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -621,37 +621,17 @@ static void run_queue(struct gfs2_glock *gl, const int 
nonblock)
 __releases(&gl->gl_lockref.lock)
 __acquires(&gl->gl_lockref.lock)
 {
-   struct gfs2_holder *gh = NULL;
-   int ret;
-
if (test_and_set_bit(GLF_LOCK, &gl->gl_flags))
return;
 
GLOCK_BUG_ON(gl, test_bit(GLF_DEMOTE_IN_PROGRESS, &gl->gl_flags));
 
if (test_bit(GLF_DEMOTE, &gl->gl_flags) &&
-   gl->gl_demote_state != gl->gl_state) {
+   gl->gl_demote_state != gl->gl_state)
__state_machine(gl, nonblock ? GL_ST_DEMOTE_NONBLOCK :
GL_ST_BLOCKING_DEMOTE);
-   return;
-   } else {
-   if (test_bit(GLF_DEMOTE, &gl->gl_flags))
-   gfs2_demote_wake(gl);
-   ret = do_promote(gl);
-   if (ret == 0) {
-   clear_bit(GLF_LOCK, &gl->gl_flags);
-   smp_mb__after_atomic();
-   return;
-   }
-   if (ret == 2)
-   return;
-   gh = find_first_waiter(gl);
-   gl->gl_target = gh->gh_state;
-   if (!(gh->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB)))
-   do_error(gl, 0); /* Fail queued try locks */
-   }
-   __state_machine(gl, GL_ST_DO_XMOTE);
-   return;
+   else
+   __state_machine(gl, GL_ST_PROMOTE);
 }
 
 /**
@@ -668,6 +648,7 @@ __acquires(&gl->gl_lockref.lock)
  */
 static void __state_machine(struct gfs2_glock *gl, int new_state)
 {
+   struct gfs2_holder *gh = NULL;
int ret;
 
BUG_ON(!spin_is_locked(&gl->gl_lockref.lock));
@@ -727,6 +708,25 @@ static void __state_machine(struct gfs2_glock *gl, int 
new_state)
gl->gl_lockref.count++;
__gfs2_glock_queue_work(gl, 0);
break;
+
+   case GL_ST_PROMOTE:
+   gl->gl_mch = GL_ST_IDLE;
+   if (test_bit(GLF_DEMOTE, &gl->gl_flags))
+   gfs2_demote_wake(gl);
+   ret = do_promote(gl);
+   if (ret == 0) {
+   clear_bit(GLF_LOCK, &gl->gl_flags);
+   smp_mb__after_atomic();
+   break;
+   }
+   if (ret == 2)
+   break;
+   gh = find_first_waiter(gl);
+   gl->gl_target = gh->gh_state;
+   if (!(gh->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB)))
+   do_error(gl, 0); /* Fail queued try locks */
+   gl->gl_mch = GL_ST_DO_XMOTE;
+   break;
}
} while (gl->gl_mch != GL_ST_IDLE);
 }
diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index c8b704a73638..76db63d1efae 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -128,6 +128,7 @@ enum gl_machine_states {
  state change attempt from dlm. */
GL_ST_DEMOTE_NONBLOCK = 4, /* Demote is in progress - non-blocking */
GL_ST_BLOCKING_DEMOTE = 5, /* Demote is in progress - blocking */
+   GL_ST_PROMOTE = 6, /* Promote the lock */
 };
 
 struct lm_lockops {
-- 
2.19.1



[Cluster-devel] [PATCH 05/13] GFS2: Add do_xmote states to state machine

2018-11-19 Thread Bob Peterson
This patch adds two new states to the glock state machine. The
first is a normal call for gl_target. The second is an abnormal
call for cases in which dlm was unable to grant our request.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/glock.c | 37 -
 fs/gfs2/glock.h |  3 +++
 2 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 5e0eac782c32..b541c4053dd7 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -60,7 +60,7 @@ struct gfs2_glock_iter {
 
 typedef void (*glock_examiner) (struct gfs2_glock * gl);
 
-static void do_xmote(struct gfs2_glock *gl, unsigned int target);
+static void __state_machine(struct gfs2_glock *gl, int new_state);
 static void state_machine(struct gfs2_glock *gl, int new_state);
 
 static struct dentry *gfs2_root;
@@ -444,9 +444,12 @@ static void gfs2_demote_wake(struct gfs2_glock *gl)
  * finish_xmote - The DLM has replied to one of our lock requests
  * @gl: The glock
  *
+ * Returns: -EAGAIN if do_xmote should be called next
+ *  -EBUSY if dlm could not satisfy the request
+ *  else 0
  */
 
-static void finish_xmote(struct gfs2_glock *gl)
+static int finish_xmote(struct gfs2_glock *gl)
 {
const struct gfs2_glock_operations *glops = gl->gl_ops;
struct gfs2_holder *gh;
@@ -484,19 +487,17 @@ static void finish_xmote(struct gfs2_glock *gl)
switch(state) {
/* Unlocked due to conversion deadlock, try again */
case LM_ST_UNLOCKED:
-   do_xmote(gl, gl->gl_target);
-   break;
+   return -EAGAIN;
/* Conversion fails, unlock and try again */
case LM_ST_SHARED:
case LM_ST_DEFERRED:
-   do_xmote(gl, LM_ST_UNLOCKED);
-   break;
+   return -EBUSY;
default: /* Everything else */
fs_err(gl->gl_name.ln_sbd, "wanted %u got %u\n",
   gl->gl_target, state);
GLOCK_BUG_ON(gl, 1);
}
-   return;
+   return 0;
}
 
/* Fast path - we got what we asked for */
@@ -514,10 +515,11 @@ static void finish_xmote(struct gfs2_glock *gl)
}
rv = do_promote(gl);
if (rv == 2)
-   return;
+   return 0;
}
 out:
clear_bit(GLF_LOCK, &gl->gl_flags);
+   return 0;
 }
 
 /**
@@ -656,7 +658,7 @@ __acquires(&gl->gl_lockref.lock)
if (!(gh->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB)))
do_error(gl, 0); /* Fail queued try locks */
}
-   do_xmote(gl, gl->gl_target);
+   __state_machine(gl, GL_ST_DO_XMOTE);
return;
 }
 
@@ -674,6 +676,8 @@ __acquires(&gl->gl_lockref.lock)
  */
 static void __state_machine(struct gfs2_glock *gl, int new_state)
 {
+   int ret;
+
BUG_ON(!spin_is_locked(&gl->gl_lockref.lock));
 
do {
@@ -685,6 +689,21 @@ static void __state_machine(struct gfs2_glock *gl, int 
new_state)
 
case GL_ST_FINISH_XMOTE:
gl->gl_mch = GL_ST_IDLE;
+   ret = finish_xmote(gl);
+   if (ret == -EBUSY)
+   gl->gl_mch = GL_ST_DO_XMOTE_UNLOCK;
+   else if (ret == -EAGAIN)
+   gl->gl_mch = GL_ST_DO_XMOTE;
+   break;
+
+   case GL_ST_DO_XMOTE:
+   gl->gl_mch = GL_ST_IDLE;
+   do_xmote(gl, gl->gl_target);
+   break;
+
+   case GL_ST_DO_XMOTE_UNLOCK:
+   gl->gl_mch = GL_ST_IDLE;
+   do_xmote(gl, LM_ST_UNLOCKED);
finish_xmote(gl);
break;
}
diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index 88043d610d64..8333bbc5a197 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -123,6 +123,9 @@ enum {
 enum gl_machine_states {
GL_ST_IDLE = 0, /* State machine idle; no transition needed */
GL_ST_FINISH_XMOTE = 1, /* Promotion/demotion complete */
+   GL_ST_DO_XMOTE = 2, /* Promote or demote a waiter */
+   GL_ST_DO_XMOTE_UNLOCK = 3, /* Promote or demote a waiter after a failed
+ state change attempt from dlm. */
 };
 
 struct lm_lockops {
-- 
2.19.1



[Cluster-devel] [PATCH 04/13] GFS2: Baby step toward a real state machine: finish_xmote

2018-11-19 Thread Bob Peterson
This patch adds a new function state_machine and some hooks to
call it. For this early version, we've only got two states:
idle and finish_xmote. Later, many more will be added.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/glock.c  | 74 +++-
 fs/gfs2/glock.h  |  5 
 fs/gfs2/incore.h |  1 +
 3 files changed, 66 insertions(+), 14 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 6e9d53583b73..5e0eac782c32 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -61,6 +61,7 @@ struct gfs2_glock_iter {
 typedef void (*glock_examiner) (struct gfs2_glock * gl);
 
 static void do_xmote(struct gfs2_glock *gl, unsigned int target);
+static void state_machine(struct gfs2_glock *gl, int new_state);
 
 static struct dentry *gfs2_root;
 static struct workqueue_struct *glock_workqueue;
@@ -442,18 +443,16 @@ static void gfs2_demote_wake(struct gfs2_glock *gl)
 /**
  * finish_xmote - The DLM has replied to one of our lock requests
  * @gl: The glock
- * @ret: The status from the DLM
  *
  */
 
-static void finish_xmote(struct gfs2_glock *gl, unsigned int ret)
+static void finish_xmote(struct gfs2_glock *gl)
 {
const struct gfs2_glock_operations *glops = gl->gl_ops;
struct gfs2_holder *gh;
-   unsigned state = ret & LM_OUT_ST_MASK;
+   unsigned state = gl->gl_reply & LM_OUT_ST_MASK;
int rv;
 
-   spin_lock(&gl->gl_lockref.lock);
trace_gfs2_glock_state_change(gl, state);
state_change(gl, state);
gh = find_first_waiter(gl);
@@ -467,18 +466,18 @@ static void finish_xmote(struct gfs2_glock *gl, unsigned 
int ret)
if (unlikely(state != gl->gl_target)) {
if (gh && !test_bit(GLF_DEMOTE_IN_PROGRESS, &gl->gl_flags)) {
/* move to back of queue and try next entry */
-   if (ret & LM_OUT_CANCELED) {
+   if (gl->gl_reply & LM_OUT_CANCELED) {
if ((gh->gh_flags & LM_FLAG_PRIORITY) == 0)
list_move_tail(&gh->gh_list, 
&gl->gl_holders);
gh = find_first_waiter(gl);
gl->gl_target = gh->gh_state;
state = LM_ST_UNLOCKED;
-   } else if ((ret & LM_OUT_ERROR) ||
+   } else if ((gl->gl_reply & LM_OUT_ERROR) ||
   (gh->gh_flags & (LM_FLAG_TRY |
LM_FLAG_TRY_1CB))) {
/* An error or failed "try lock" - report it */
gl->gl_target = gl->gl_state;
-   do_error(gl, ret);
+   do_error(gl, gl->gl_reply);
goto out;
}
}
@@ -497,7 +496,6 @@ static void finish_xmote(struct gfs2_glock *gl, unsigned 
int ret)
   gl->gl_target, state);
GLOCK_BUG_ON(gl, 1);
}
-   spin_unlock(&gl->gl_lockref.lock);
return;
}
 
@@ -516,12 +514,10 @@ static void finish_xmote(struct gfs2_glock *gl, unsigned 
int ret)
}
rv = do_promote(gl);
if (rv == 2)
-   goto out_locked;
+   return;
}
 out:
clear_bit(GLF_LOCK, &gl->gl_flags);
-out_locked:
-   spin_unlock(&gl->gl_lockref.lock);
 }
 
 /**
@@ -573,7 +569,8 @@ __acquires(&gl->gl_lockref.lock)
if (ret == -EINVAL && gl->gl_target == LM_ST_UNLOCKED &&
target == LM_ST_UNLOCKED &&
test_bit(SDF_SKIP_DLM_UNLOCK, &sdp->sd_flags)) {
-   finish_xmote(gl, target);
+   gl->gl_reply = target;
+   state_machine(gl, GL_ST_FINISH_XMOTE);
gfs2_glock_queue_work(gl, 0);
}
else if (ret) {
@@ -582,7 +579,8 @@ __acquires(&gl->gl_lockref.lock)
   &sdp->sd_flags));
}
} else { /* lock_nolock */
-   finish_xmote(gl, target);
+   gl->gl_reply = target;
+   state_machine(gl, GL_ST_FINISH_XMOTE);
gfs2_glock_queue_work(gl, 0);
}
 
@@ -662,6 +660,53 @@ __acquires(&gl->gl_lockref.lock)
return;
 }
 
+/**
+ * __state_machine - the glock state machine
+ * @gl: pointer to the glock we are transitioning
+ * @new_state: The new state we need to execute
+ *
+ * This function handles state transitions for glocks.
+ * When the state_machine is called, it's given a new state that needs to be
+ * handled, but only after it becomes idle from the last call. Once called,
+ * it keeps running until the state transitions have all been resolved.
+ * The lock might be released inside some of the

[Cluster-devel] [PATCH 00/13] Radical Reform of glock state machine (take 2)

2018-11-19 Thread Bob Peterson
This is a second go at my "Radical Reform of glock state machine"
patch set. It's been rebased to the latest git tree, and had a lot more
testing (from the actual for-next tree rather than testing the ports).
This includes performance testing using iozone.

The patches are unchanged from my original post on 09 July, except that
I've added one patch at the beginning, "Remove gotos from function run_queue."

Just to reiterate: The goal here is to preserve the current logic of
today's glock state machine, so no attempts have been made (yet) to
make it faster or more sane. The goal is to eventually allow gfs2 to
reduce its use of the glock workqueue by calling the state machine
directly from the locking process, thus reducing glock latency and
greatly reducing context switches. That will be another set of patches.

I'd like to get this pushed early in the cycle so it doesn't keep
getting pushed out for various merge window cycles.

Bob Peterson
---
GFS2 glocks go through a very complex and confusing series of steps to
transition from one locking mode to another. For example, to transition
from unlocked (UN) mode to Shared (SH) mode we need to "promote" the lock,
and to do so requires a series of steps such as asking permission of DLM,
fetching dlm responses, checking for remote demotion requests, granting
multiple holders, and so forth.

This is all managed by a large set of functions known as the glock state
machine. Today, the glock state machine is a disorganized chaos of
functions that are neither documented nor intuitive. It's a complete
disaster that makes no sense at all to someone coming in to the code cold.
For proof, you need only look at the fact that function do_xmote() can
call function finish_xmote(), but finish_xmote() can call do_xmote() as
well. It works really well, but the problem is: it's a house of cards.
It's easy to misunderstand the intent and get it wrong, and if you try to
make any little change, everything falls apart.

The other problem is that today's glock state machine relies completely
on the delayed work queue. That means you can't simply transition from
one state to the next; you need to queue delayed work and have the
delayed workqueue do the work. This requires an absurd number of context
switches to make the simplest change to a glock. That creates a lot of
unnecessary latency, and makes it much harder for smaller environments,
like virts, to do the job, due to a potentially very limited number of
cores.

This patch set is my first pass designed to radically reform the
gfs2 glock state machine. Each patch slowly transforms it into more and
more of a real state machine.

The first few patches merely just untangle the mess. After that, each
patch removes an old-style state and adds it as a new state for the new
state machine.

The result is a state machine that's readable, and each state is now
clearly labeled and more obvious what it's doing and what other state
it can transition to.

One primary goal here is to leave the logic completely unchanged. I
wanted to make it so that code reviewers could check to make sure I
did not break today's existing logic. I do, however, have a few
optimizations at the end. So I'm not trying to fix any bugs here.
I'm just untangling spaghetti.

Another goal was to make each patch as small and digestable as possible
to avoid any hand waving.

I'm planning future patches to reduce the context switches by making
the state machine not rely as much on the glock work queue. These
patches will speed up glocks by reducing the work queue delays, by
executing the state machine from within the process context. For
example, there are only a few cases in which we need to ensure a
glock demote happen later (minimum hold time).
---
Bob Peterson (13):
  GFS2: Remove gotos from function run_queue
  GFS2: Make do_xmote determine its own gh parameter
  GFS2: Eliminate a goto in finish_xmote
  GFS2: Baby step toward a real state machine: finish_xmote
  GFS2: Add do_xmote states to state machine
  GFS2: Make do_xmote not call the state machine again
  GFS2: Add blocking and non-blocking demote to state machine
  GFS2: Add a new GL_ST_PROMOTE state to glock state machine
  GFS2: Replace run_queue with new GL_ST_RUN state in state machine
  GFS2: Reduce redundancy in GL_ST_DEMOTE_NONBLOCK state
  GFS2: Reduce glock_work_func to a single call to state_machine
  GFS2: Add new GL_ST_UNLOCK state to reduce calls to the __ version
  GFS2: Optimization of GL_ST_UNLOCK state

 fs/gfs2/glock.c  | 327 +--
 fs/gfs2/glock.h  |  14 ++
 fs/gfs2/incore.h |   1 +
 3 files changed, 221 insertions(+), 121 deletions(-)

-- 
2.19.1



[Cluster-devel] [PATCH 06/13] GFS2: Make do_xmote not call the state machine again

2018-11-19 Thread Bob Peterson
Before this patch, the state machine could call do_xmote which,
in turn, could call back into the state machine. This patch unravels
the logic so instead it sends back an -EAGAIN return code, which
signals the state machine to loop under the new state.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/glock.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index b541c4053dd7..8dc98d069afa 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -527,9 +527,11 @@ static int finish_xmote(struct gfs2_glock *gl)
  * @gl: The lock state
  * @target: The target lock state
  *
+ * Returns: 0 if the lock is pending, or
+ *  -EAGAIN if we need to run the state machine again to finish_xmote
  */
 
-static void do_xmote(struct gfs2_glock *gl, unsigned int target)
+static int do_xmote(struct gfs2_glock *gl, unsigned int target)
 __releases(&gl->gl_lockref.lock)
 __acquires(&gl->gl_lockref.lock)
 {
@@ -537,11 +539,11 @@ __acquires(&gl->gl_lockref.lock)
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
struct gfs2_holder *gh = find_first_waiter(gl);
unsigned int lck_flags = (unsigned int)(gh ? gh->gh_flags : 0);
-   int ret;
+   int ret = 0;
 
if (unlikely(test_bit(SDF_SHUTDOWN, &sdp->sd_flags)) &&
target != LM_ST_UNLOCKED)
-   return;
+   return 0;
lck_flags &= (LM_FLAG_TRY | LM_FLAG_TRY_1CB | LM_FLAG_NOEXP |
  LM_FLAG_PRIORITY);
GLOCK_BUG_ON(gl, gl->gl_state == target);
@@ -572,21 +574,23 @@ __acquires(&gl->gl_lockref.lock)
target == LM_ST_UNLOCKED &&
test_bit(SDF_SKIP_DLM_UNLOCK, &sdp->sd_flags)) {
gl->gl_reply = target;
-   state_machine(gl, GL_ST_FINISH_XMOTE);
+   ret = -EAGAIN;
gfs2_glock_queue_work(gl, 0);
}
else if (ret) {
fs_err(sdp, "lm_lock ret %d\n", ret);
GLOCK_BUG_ON(gl, !test_bit(SDF_SHUTDOWN,
   &sdp->sd_flags));
+   ret = 0;
}
} else { /* lock_nolock */
gl->gl_reply = target;
-   state_machine(gl, GL_ST_FINISH_XMOTE);
+   ret = -EAGAIN;
gfs2_glock_queue_work(gl, 0);
}
 
spin_lock(&gl->gl_lockref.lock);
+   return ret;
 }
 
 /**
@@ -698,13 +702,16 @@ static void __state_machine(struct gfs2_glock *gl, int 
new_state)
 
case GL_ST_DO_XMOTE:
gl->gl_mch = GL_ST_IDLE;
-   do_xmote(gl, gl->gl_target);
+   ret = do_xmote(gl, gl->gl_target);
+   if (ret == -EAGAIN)
+   gl->gl_mch = GL_ST_FINISH_XMOTE;
break;
 
case GL_ST_DO_XMOTE_UNLOCK:
gl->gl_mch = GL_ST_IDLE;
-   do_xmote(gl, LM_ST_UNLOCKED);
-   finish_xmote(gl);
+   ret = do_xmote(gl, LM_ST_UNLOCKED);
+   if (ret == -EAGAIN)
+   gl->gl_mch = GL_ST_FINISH_XMOTE;
break;
}
} while (gl->gl_mch != GL_ST_IDLE);
-- 
2.19.1



Re: [Cluster-devel] [PATCH 02/13] GFS2: Make do_xmote determine its own gh parameter

2018-11-19 Thread Steven Whitehouse




On 19/11/18 13:29, Bob Peterson wrote:

This is another baby step toward a better glock state machine.
Before this patch, do_xmote was called with a gh parameter, but
only for promotes, not demotes. This patch allows do_xmote to
determine the gh autonomously.

Signed-off-by: Bob Peterson 
---
  fs/gfs2/glock.c | 12 ++--
  1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 692784faa464..5f2156f15f05 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -60,7 +60,7 @@ struct gfs2_glock_iter {
  
  typedef void (*glock_examiner) (struct gfs2_glock * gl);
  
-static void do_xmote(struct gfs2_glock *gl, struct gfs2_holder *gh, unsigned int target);

+static void do_xmote(struct gfs2_glock *gl, unsigned int target);
  
  static struct dentry *gfs2_root;

  static struct workqueue_struct *glock_workqueue;
@@ -486,12 +486,12 @@ static void finish_xmote(struct gfs2_glock *gl, unsigned 
int ret)
/* Unlocked due to conversion deadlock, try again */
case LM_ST_UNLOCKED:
  retry:
-   do_xmote(gl, gh, gl->gl_target);
+   do_xmote(gl, gl->gl_target);
break;
/* Conversion fails, unlock and try again */
case LM_ST_SHARED:
case LM_ST_DEFERRED:
-   do_xmote(gl, gh, LM_ST_UNLOCKED);
+   do_xmote(gl, LM_ST_UNLOCKED);
break;
default: /* Everything else */
fs_err(gl->gl_name.ln_sbd, "wanted %u got %u\n",
@@ -528,17 +528,17 @@ static void finish_xmote(struct gfs2_glock *gl, unsigned 
int ret)
  /**
   * do_xmote - Calls the DLM to change the state of a lock
   * @gl: The lock state
- * @gh: The holder (only for promotes)
   * @target: The target lock state
   *
   */
  
-static void do_xmote(struct gfs2_glock *gl, struct gfs2_holder *gh, unsigned int target)

+static void do_xmote(struct gfs2_glock *gl, unsigned int target)
  __releases(&gl->gl_lockref.lock)
  __acquires(&gl->gl_lockref.lock)
  {
const struct gfs2_glock_operations *glops = gl->gl_ops;
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
+   struct gfs2_holder *gh = find_first_waiter(gl);
unsigned int lck_flags = (unsigned int)(gh ? gh->gh_flags : 0);
int ret;
  
@@ -659,7 +659,7 @@ __acquires(&gl->gl_lockref.lock)

if (!(gh->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB)))
do_error(gl, 0); /* Fail queued try locks */
}
-   do_xmote(gl, gh, gl->gl_target);
+   do_xmote(gl, gl->gl_target);
return;
  }
  
Since gh is apparently only used to get the lock flags, it would make 
more sense just to pass the lock flags rather than add in an additional 
find_first_waiter() call,


Steve.



Re: [Cluster-devel] [PATCH 03/13] GFS2: Eliminate a goto in finish_xmote

2018-11-19 Thread Steven Whitehouse




On 19/11/18 13:29, Bob Peterson wrote:

This is another baby step toward a better glock state machine.
This patch eliminates a goto in function finish_xmote so we can
begin unraveling the cryptic logic with later patches.

Signed-off-by: Bob Peterson 
---
  fs/gfs2/glock.c | 11 +--
  1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 5f2156f15f05..6e9d53583b73 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -472,11 +472,11 @@ static void finish_xmote(struct gfs2_glock *gl, unsigned 
int ret)
list_move_tail(&gh->gh_list, 
&gl->gl_holders);
gh = find_first_waiter(gl);
gl->gl_target = gh->gh_state;
-   goto retry;
-   }
-   /* Some error or failed "try lock" - report it */
-   if ((ret & LM_OUT_ERROR) ||
-   (gh->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB))) {
+   state = LM_ST_UNLOCKED;
I'm not sure what you are trying to achieve here, but setting the state 
to LM_ST_UNLOCKED when it is quite possible that it is not that state, 
doesn't really seem to improve anything. Indeed, it looks more confusing 
to me, at least it was fairly clear before that the intent was to retry 
the operation which has been canceled.



+   } else if ((ret & LM_OUT_ERROR) ||
+  (gh->gh_flags & (LM_FLAG_TRY |
+   LM_FLAG_TRY_1CB))) {
+   /* An error or failed "try lock" - report it */
gl->gl_target = gl->gl_state;
do_error(gl, ret);
goto out;
@@ -485,7 +485,6 @@ static void finish_xmote(struct gfs2_glock *gl, unsigned 
int ret)
switch(state) {
/* Unlocked due to conversion deadlock, try again */
case LM_ST_UNLOCKED:
-retry:
do_xmote(gl, gl->gl_target);
break;
/* Conversion fails, unlock and try again */




[Cluster-devel] [PATCH 12/13] GFS2: Add new GL_ST_UNLOCK state to reduce calls to the __ version

2018-11-19 Thread Bob Peterson
Before this patch, the truncate code called __state_machine but
only did the unlock of GLF_UNLOCK. This patch adds a new state
GL_ST_UNLOCK does the same thing, thus allowing it to call the
regular state_machine function. This reduces the calls to the
helper.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/glock.c | 11 ++-
 fs/gfs2/glock.h |  1 +
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 22ddeda90199..33f4a530caf4 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -750,6 +750,11 @@ static void __state_machine(struct gfs2_glock *gl, int 
new_state,
}
gl->gl_mch = GL_ST_RUN;
break;
+
+   case GL_ST_UNLOCK:
+   clear_bit(GLF_LOCK, &gl->gl_flags);
+   gl->gl_mch = GL_ST_RUN;
+   break;
}
} while (gl->gl_mch != GL_ST_IDLE || new_state != GL_ST_IDLE);
 
@@ -774,7 +779,6 @@ static void __state_machine(struct gfs2_glock *gl, int 
new_state,
  * @gl: pointer to the glock we are transitioning
  * @new_state: The new state we need to execute
  *
- * Just like __state_machine but it acquires the gl_lockref lock
  */
 static void state_machine(struct gfs2_glock *gl, int new_state,
  const int nonblock)
@@ -1734,10 +1738,7 @@ void gfs2_glock_finish_truncate(struct gfs2_inode *ip)
ret = gfs2_truncatei_resume(ip);
gfs2_assert_withdraw(gl->gl_name.ln_sbd, ret == 0);
 
-   spin_lock(&gl->gl_lockref.lock);
-   clear_bit(GLF_LOCK, &gl->gl_flags);
-   __state_machine(gl, GL_ST_RUN, 1);
-   spin_unlock(&gl->gl_lockref.lock);
+   state_machine(gl, GL_ST_UNLOCK, 1);
 }
 
 static const char *state2str(unsigned state)
diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index 0b1dffb92e8a..5bbf3118c7c3 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -131,6 +131,7 @@ enum gl_machine_states {
GL_ST_PROMOTE = 6, /* Promote the lock */
GL_ST_RUN = 7,  /* "Run" or progress the lock */
GL_ST_WORK = 8, /* Perform general glock work */
+   GL_ST_UNLOCK = 9,   /* Unlock then run */
 };
 
 struct lm_lockops {
-- 
2.19.1



[Cluster-devel] [PATCH 02/13] GFS2: Make do_xmote determine its own gh parameter

2018-11-19 Thread Bob Peterson
This is another baby step toward a better glock state machine.
Before this patch, do_xmote was called with a gh parameter, but
only for promotes, not demotes. This patch allows do_xmote to
determine the gh autonomously.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/glock.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 692784faa464..5f2156f15f05 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -60,7 +60,7 @@ struct gfs2_glock_iter {
 
 typedef void (*glock_examiner) (struct gfs2_glock * gl);
 
-static void do_xmote(struct gfs2_glock *gl, struct gfs2_holder *gh, unsigned 
int target);
+static void do_xmote(struct gfs2_glock *gl, unsigned int target);
 
 static struct dentry *gfs2_root;
 static struct workqueue_struct *glock_workqueue;
@@ -486,12 +486,12 @@ static void finish_xmote(struct gfs2_glock *gl, unsigned 
int ret)
/* Unlocked due to conversion deadlock, try again */
case LM_ST_UNLOCKED:
 retry:
-   do_xmote(gl, gh, gl->gl_target);
+   do_xmote(gl, gl->gl_target);
break;
/* Conversion fails, unlock and try again */
case LM_ST_SHARED:
case LM_ST_DEFERRED:
-   do_xmote(gl, gh, LM_ST_UNLOCKED);
+   do_xmote(gl, LM_ST_UNLOCKED);
break;
default: /* Everything else */
fs_err(gl->gl_name.ln_sbd, "wanted %u got %u\n",
@@ -528,17 +528,17 @@ static void finish_xmote(struct gfs2_glock *gl, unsigned 
int ret)
 /**
  * do_xmote - Calls the DLM to change the state of a lock
  * @gl: The lock state
- * @gh: The holder (only for promotes)
  * @target: The target lock state
  *
  */
 
-static void do_xmote(struct gfs2_glock *gl, struct gfs2_holder *gh, unsigned 
int target)
+static void do_xmote(struct gfs2_glock *gl, unsigned int target)
 __releases(&gl->gl_lockref.lock)
 __acquires(&gl->gl_lockref.lock)
 {
const struct gfs2_glock_operations *glops = gl->gl_ops;
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
+   struct gfs2_holder *gh = find_first_waiter(gl);
unsigned int lck_flags = (unsigned int)(gh ? gh->gh_flags : 0);
int ret;
 
@@ -659,7 +659,7 @@ __acquires(&gl->gl_lockref.lock)
if (!(gh->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB)))
do_error(gl, 0); /* Fail queued try locks */
}
-   do_xmote(gl, gh, gl->gl_target);
+   do_xmote(gl, gl->gl_target);
return;
 }
 
-- 
2.19.1



[Cluster-devel] [PATCH 03/13] GFS2: Eliminate a goto in finish_xmote

2018-11-19 Thread Bob Peterson
This is another baby step toward a better glock state machine.
This patch eliminates a goto in function finish_xmote so we can
begin unraveling the cryptic logic with later patches.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/glock.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 5f2156f15f05..6e9d53583b73 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -472,11 +472,11 @@ static void finish_xmote(struct gfs2_glock *gl, unsigned 
int ret)
list_move_tail(&gh->gh_list, 
&gl->gl_holders);
gh = find_first_waiter(gl);
gl->gl_target = gh->gh_state;
-   goto retry;
-   }
-   /* Some error or failed "try lock" - report it */
-   if ((ret & LM_OUT_ERROR) ||
-   (gh->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB))) {
+   state = LM_ST_UNLOCKED;
+   } else if ((ret & LM_OUT_ERROR) ||
+  (gh->gh_flags & (LM_FLAG_TRY |
+   LM_FLAG_TRY_1CB))) {
+   /* An error or failed "try lock" - report it */
gl->gl_target = gl->gl_state;
do_error(gl, ret);
goto out;
@@ -485,7 +485,6 @@ static void finish_xmote(struct gfs2_glock *gl, unsigned 
int ret)
switch(state) {
/* Unlocked due to conversion deadlock, try again */
case LM_ST_UNLOCKED:
-retry:
do_xmote(gl, gl->gl_target);
break;
/* Conversion fails, unlock and try again */
-- 
2.19.1



[Cluster-devel] [PATCH 01/13] GFS2: Remove gotos from function run_queue

2018-11-19 Thread Bob Peterson
This patch removes all the gotos from function run_queue and inlines
the code where necessary. This is a small step toward unravelling the
logic to reorganize the glock state machine.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/glock.c | 38 ++
 1 file changed, 18 insertions(+), 20 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 05431324b262..692784faa464 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -628,10 +628,18 @@ __acquires(&gl->gl_lockref.lock)
 
if (test_bit(GLF_DEMOTE, &gl->gl_flags) &&
gl->gl_demote_state != gl->gl_state) {
-   if (find_first_holder(gl))
-   goto out_unlock;
-   if (nonblock)
-   goto out_sched;
+   if (find_first_holder(gl)) {
+   clear_bit(GLF_LOCK, &gl->gl_flags);
+   smp_mb__after_atomic();
+   return;
+   }
+   if (nonblock) {
+   clear_bit(GLF_LOCK, &gl->gl_flags);
+   smp_mb__after_atomic();
+   gl->gl_lockref.count++;
+   __gfs2_glock_queue_work(gl, 0);
+   return;
+   }
set_bit(GLF_DEMOTE_IN_PROGRESS, &gl->gl_flags);
GLOCK_BUG_ON(gl, gl->gl_demote_state == LM_ST_EXCLUSIVE);
gl->gl_target = gl->gl_demote_state;
@@ -639,29 +647,19 @@ __acquires(&gl->gl_lockref.lock)
if (test_bit(GLF_DEMOTE, &gl->gl_flags))
gfs2_demote_wake(gl);
ret = do_promote(gl);
-   if (ret == 0)
-   goto out_unlock;
+   if (ret == 0) {
+   clear_bit(GLF_LOCK, &gl->gl_flags);
+   smp_mb__after_atomic();
+   return;
+   }
if (ret == 2)
-   goto out;
+   return;
gh = find_first_waiter(gl);
gl->gl_target = gh->gh_state;
if (!(gh->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB)))
do_error(gl, 0); /* Fail queued try locks */
}
do_xmote(gl, gh, gl->gl_target);
-out:
-   return;
-
-out_sched:
-   clear_bit(GLF_LOCK, &gl->gl_flags);
-   smp_mb__after_atomic();
-   gl->gl_lockref.count++;
-   __gfs2_glock_queue_work(gl, 0);
-   return;
-
-out_unlock:
-   clear_bit(GLF_LOCK, &gl->gl_flags);
-   smp_mb__after_atomic();
return;
 }
 
-- 
2.19.1



Re: [Cluster-devel] [PATCH V10 18/19] block: kill QUEUE_FLAG_NO_SG_MERGE

2018-11-19 Thread Ming Lei
On Fri, Nov 16, 2018 at 02:58:03PM +0100, Christoph Hellwig wrote:
> On Thu, Nov 15, 2018 at 04:53:05PM +0800, Ming Lei wrote:
> > Since bdced438acd83ad83a6c ("block: setup bi_phys_segments after 
> > splitting"),
> > physical segment number is mainly figured out in blk_queue_split() for
> > fast path, and the flag of BIO_SEG_VALID is set there too.
> > 
> > Now only blk_recount_segments() and blk_recalc_rq_segments() use this
> > flag.
> > 
> > Basically blk_recount_segments() is bypassed in fast path given 
> > BIO_SEG_VALID
> > is set in blk_queue_split().
> > 
> > For another user of blk_recalc_rq_segments():
> > 
> > - run in partial completion branch of blk_update_request, which is an 
> > unusual case
> > 
> > - run in blk_cloned_rq_check_limits(), still not a big problem if the flag 
> > is killed
> > since dm-rq is the only user.
> > 
> > Multi-page bvec is enabled now, QUEUE_FLAG_NO_SG_MERGE doesn't make sense 
> > any more.
> > 
> > Cc: Dave Chinner 
> > Cc: Kent Overstreet 
> > Cc: Mike Snitzer 
> > Cc: dm-de...@redhat.com
> > Cc: Alexander Viro 
> > Cc: linux-fsde...@vger.kernel.org
> > Cc: Shaohua Li 
> > Cc: linux-r...@vger.kernel.org
> > Cc: linux-er...@lists.ozlabs.org
> > Cc: David Sterba 
> > Cc: linux-bt...@vger.kernel.org
> > Cc: Darrick J. Wong 
> > Cc: linux-...@vger.kernel.org
> > Cc: Gao Xiang 
> > Cc: Christoph Hellwig 
> > Cc: Theodore Ts'o 
> > Cc: linux-e...@vger.kernel.org
> > Cc: Coly Li 
> > Cc: linux-bca...@vger.kernel.org
> > Cc: Boaz Harrosh 
> > Cc: Bob Peterson 
> > Cc: cluster-devel@redhat.com
> > Signed-off-by: Ming Lei 
> > ---
> >  block/blk-merge.c  | 31 ++-
> >  block/blk-mq-debugfs.c |  1 -
> >  block/blk-mq.c |  3 ---
> >  drivers/md/dm-table.c  | 13 -
> >  include/linux/blkdev.h |  1 -
> >  5 files changed, 6 insertions(+), 43 deletions(-)
> > 
> > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > index 153a659fde74..06be298be332 100644
> > --- a/block/blk-merge.c
> > +++ b/block/blk-merge.c
> > @@ -351,8 +351,7 @@ void blk_queue_split(struct request_queue *q, struct 
> > bio **bio)
> >  EXPORT_SYMBOL(blk_queue_split);
> >  
> >  static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
> > -struct bio *bio,
> > -bool no_sg_merge)
> > +struct bio *bio)
> >  {
> > struct bio_vec bv, bvprv = { NULL };
> > int cluster, prev = 0;
> > @@ -379,13 +378,6 @@ static unsigned int __blk_recalc_rq_segments(struct 
> > request_queue *q,
> > nr_phys_segs = 0;
> > for_each_bio(bio) {
> > bio_for_each_bvec(bv, bio, iter) {
> > -   /*
> > -* If SG merging is disabled, each bio vector is
> > -* a segment
> > -*/
> > -   if (no_sg_merge)
> > -   goto new_segment;
> > -
> > if (prev && cluster) {
> > if (seg_size + bv.bv_len
> > > queue_max_segment_size(q))
> > @@ -420,27 +412,16 @@ static unsigned int __blk_recalc_rq_segments(struct 
> > request_queue *q,
> >  
> >  void blk_recalc_rq_segments(struct request *rq)
> >  {
> > -   bool no_sg_merge = !!test_bit(QUEUE_FLAG_NO_SG_MERGE,
> > -   &rq->q->queue_flags);
> > -
> > -   rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio,
> > -   no_sg_merge);
> > +   rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio);
> 
> Can we rename __blk_recalc_rq_segments to blk_recalc_rq_segments
> can kill the old blk_recalc_rq_segments now?

Sure.

Thanks,
Ming



Re: [Cluster-devel] [PATCH V10 18/19] block: kill QUEUE_FLAG_NO_SG_MERGE

2018-11-19 Thread Ming Lei
On Thu, Nov 15, 2018 at 06:18:11PM -0800, Omar Sandoval wrote:
> On Thu, Nov 15, 2018 at 04:53:05PM +0800, Ming Lei wrote:
> > Since bdced438acd83ad83a6c ("block: setup bi_phys_segments after 
> > splitting"),
> > physical segment number is mainly figured out in blk_queue_split() for
> > fast path, and the flag of BIO_SEG_VALID is set there too.
> > 
> > Now only blk_recount_segments() and blk_recalc_rq_segments() use this
> > flag.
> > 
> > Basically blk_recount_segments() is bypassed in fast path given 
> > BIO_SEG_VALID
> > is set in blk_queue_split().
> > 
> > For another user of blk_recalc_rq_segments():
> > 
> > - run in partial completion branch of blk_update_request, which is an 
> > unusual case
> > 
> > - run in blk_cloned_rq_check_limits(), still not a big problem if the flag 
> > is killed
> > since dm-rq is the only user.
> > 
> > Multi-page bvec is enabled now, QUEUE_FLAG_NO_SG_MERGE doesn't make sense 
> > any more.
> 
> This commit message wasn't very clear. Is it the case that
> QUEUE_FLAG_NO_SG_MERGE is no longer set by any drivers?

OK, I will add the explanation to commit log in next version.

05f1dd53152173 (block: add queue flag for disabling SG merging) introduces this
flag for NVMe performance purpose only, so that merging to segment can
be bypassed for NVMe.

Actually this optimization was bypassed by 54efd50bfd873e2d (block: make
generic_make_request handle arbitrarily sized bios) and bdced438acd83ad83a6c
("block: setup bi_phys_segments after splitting").

Now segment computation can be very quick, given most of times one bvec
can be thought as one segment, so we can remove the flag.

thanks, 
Ming



Re: [Cluster-devel] [PATCH V10 17/19] block: don't use bio->bi_vcnt to figure out segment number

2018-11-19 Thread Ming Lei
On Thu, Nov 15, 2018 at 06:11:40PM -0800, Omar Sandoval wrote:
> On Thu, Nov 15, 2018 at 04:53:04PM +0800, Ming Lei wrote:
> > It is wrong to use bio->bi_vcnt to figure out how many segments
> > there are in the bio even though CLONED flag isn't set on this bio,
> > because this bio may be splitted or advanced.
> > 
> > So always use bio_segments() in blk_recount_segments(), and it shouldn't
> > cause any performance loss now because the physical segment number is 
> > figured
> > out in blk_queue_split() and BIO_SEG_VALID is set meantime since
> > bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting").
> > 
> > Cc: Dave Chinner 
> > Cc: Kent Overstreet 
> > Fixes: 7f60dcaaf91 ("block: blk-merge: fix blk_recount_segments()")
> 
> From what I can tell, the problem was originally introduced by
> 76d8137a3113 ("blk-merge: recaculate segment if it isn't less than max 
> segments")
> 
> Is that right?

Indeed, will update it in next version.

Thanks,
Ming



Re: [Cluster-devel] [PATCH V10 15/19] block: always define BIO_MAX_PAGES as 256

2018-11-19 Thread Ming Lei
On Thu, Nov 15, 2018 at 05:59:36PM -0800, Omar Sandoval wrote:
> On Thu, Nov 15, 2018 at 04:53:02PM +0800, Ming Lei wrote:
> > Now multi-page bvec can cover CONFIG_THP_SWAP, so we don't need to
> > increase BIO_MAX_PAGES for it.
> 
> You mentioned to it in the cover letter, but this needs more explanation
> in the commit message. Why did CONFIG_THP_SWAP require > 256? Why does
> multipage bvecs remove that requirement?

CONFIG_THP_SWAP needs to split one TH page into normal pages and adds
them all to one bio. With multipage-bvec, it just takes one bvec to
hold them all.

thanks,
Ming



Re: [Cluster-devel] [PATCH V10 14/19] block: enable multipage bvecs

2018-11-19 Thread Ming Lei
On Fri, Nov 16, 2018 at 02:53:08PM +0100, Christoph Hellwig wrote:
> > -
> > -   if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) {
> > -   bv->bv_len += len;
> > -   bio->bi_iter.bi_size += len;
> > -   return true;
> > -   }
> > +   struct request_queue *q = NULL;
> > +
> > +   if (page == bv->bv_page && off == (bv->bv_offset + bv->bv_len)
> > +   && (off + len) <= PAGE_SIZE)
> 
> How could the page struct be the same, but the range beyond PAGE_SIZE
> (at least with the existing callers)?
> 
> Also no need for the inner btraces, and the && always goes on the
> first line.

OK.

> 
> > +   if (bio->bi_disk)
> > +   q = bio->bi_disk->queue;
> > +
> > +   /* disable multi-page bvec too if cluster isn't enabled */
> > +   if (!q || !blk_queue_cluster(q) ||
> > +   ((page_to_phys(bv->bv_page) + bv->bv_offset + bv->bv_len) !=
> > +(page_to_phys(page) + off)))
> > +   return false;
> > + merge:
> > +   bv->bv_len += len;
> > +   bio->bi_iter.bi_size += len;
> > +   return true;
> 
> Ok, this is scary, as it will give differen results depending on when
> bi_disk is assigned.

It is just merge or not, both can be handled well now.

> But then again we shouldn't really do the cluster
> check here, but rather when splitting the bio for the actual low-level
> driver.

Yeah, I thought of this way too, but it may cause tons of bio split for
no-clustering, and there are quite a few scsi devices which require
to disable clustering.

[linux]$ git grep -n DISABLE_CLUSTERING ./drivers/scsi/ | wc -l
 28

Or we may introduce bio_split_to_single_page_bvec() to allocate &
convert to single-page bvec table for non-clustering, will try this
approach in next version.

> 
> (and eventually we should kill this clustering setting off in favor
> of our normal segment limits).

Yeah, it has been in my post-multi-page todo list already, :-)

thanks,
Ming



Re: [Cluster-devel] [PATCH V10 14/19] block: enable multipage bvecs

2018-11-19 Thread Ming Lei
On Thu, Nov 15, 2018 at 05:56:27PM -0800, Omar Sandoval wrote:
> On Thu, Nov 15, 2018 at 04:53:01PM +0800, Ming Lei wrote:
> > This patch pulls the trigger for multi-page bvecs.
> > 
> > Now any request queue which supports queue cluster will see multi-page
> > bvecs.
> > 
> > Cc: Dave Chinner 
> > Cc: Kent Overstreet 
> > Cc: Mike Snitzer 
> > Cc: dm-de...@redhat.com
> > Cc: Alexander Viro 
> > Cc: linux-fsde...@vger.kernel.org
> > Cc: Shaohua Li 
> > Cc: linux-r...@vger.kernel.org
> > Cc: linux-er...@lists.ozlabs.org
> > Cc: David Sterba 
> > Cc: linux-bt...@vger.kernel.org
> > Cc: Darrick J. Wong 
> > Cc: linux-...@vger.kernel.org
> > Cc: Gao Xiang 
> > Cc: Christoph Hellwig 
> > Cc: Theodore Ts'o 
> > Cc: linux-e...@vger.kernel.org
> > Cc: Coly Li 
> > Cc: linux-bca...@vger.kernel.org
> > Cc: Boaz Harrosh 
> > Cc: Bob Peterson 
> > Cc: cluster-devel@redhat.com
> > Signed-off-by: Ming Lei 
> > ---
> >  block/bio.c | 24 ++--
> >  1 file changed, 18 insertions(+), 6 deletions(-)
> > 
> > diff --git a/block/bio.c b/block/bio.c
> > index 6486722d4d4b..ed6df6f8e63d 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> 
> This comment above __bio_try_merge_page() doesn't make sense after this
> change:
> 
>  This is a
>  a useful optimisation for file systems with a block size smaller than the
>  page size.
> 
> Can you please get rid of it in this patch?

I understand __bio_try_merge_page() still works for original cases, so
looks the optimization for sub-pagesize is still there too, isn't it?

> 
> > @@ -767,12 +767,24 @@ bool __bio_try_merge_page(struct bio *bio, struct 
> > page *page,
> >  
> > if (bio->bi_vcnt > 0) {
> > struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
> > -
> > -   if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) {
> > -   bv->bv_len += len;
> > -   bio->bi_iter.bi_size += len;
> > -   return true;
> > -   }
> > +   struct request_queue *q = NULL;
> > +
> > +   if (page == bv->bv_page && off == (bv->bv_offset + bv->bv_len)
> > +   && (off + len) <= PAGE_SIZE)
> > +   goto merge;
> 
> The parentheses around (bv->bv_offset + bv->bv_len) and (off + len) are
> unnecessary noise.
> 
> What's the point of the new (off + len) <= PAGE_SIZE check?

Yeah, I don't know why I did it, :-(, the check is absolutely always true.

> 
> > +
> > +   if (bio->bi_disk)
> > +   q = bio->bi_disk->queue;
> > +
> > +   /* disable multi-page bvec too if cluster isn't enabled */
> > +   if (!q || !blk_queue_cluster(q) ||
> > +   ((page_to_phys(bv->bv_page) + bv->bv_offset + bv->bv_len) !=
> > +(page_to_phys(page) + off)))
> 
> More unnecessary parentheses here.

OK.

Thanks,
Ming



Re: [Cluster-devel] [PATCH V10 12/19] block: allow bio_for_each_segment_all() to iterate over multi-page bvec

2018-11-19 Thread Ming Lei
On Thu, Nov 15, 2018 at 05:22:45PM -0800, Omar Sandoval wrote:
> On Thu, Nov 15, 2018 at 04:52:59PM +0800, Ming Lei wrote:
> > This patch introduces one extra iterator variable to 
> > bio_for_each_segment_all(),
> > then we can allow bio_for_each_segment_all() to iterate over multi-page 
> > bvec.
> > 
> > Given it is just one mechannical & simple change on all 
> > bio_for_each_segment_all()
> > users, this patch does tree-wide change in one single patch, so that we can
> > avoid to use a temporary helper for this conversion.
> > 
> > Cc: Dave Chinner 
> > Cc: Kent Overstreet 
> > Cc: linux-fsde...@vger.kernel.org
> > Cc: Alexander Viro 
> > Cc: Shaohua Li 
> > Cc: linux-r...@vger.kernel.org
> > Cc: linux-er...@lists.ozlabs.org
> > Cc: linux-bt...@vger.kernel.org
> > Cc: David Sterba 
> > Cc: Darrick J. Wong 
> > Cc: Gao Xiang 
> > Cc: Christoph Hellwig 
> > Cc: Theodore Ts'o 
> > Cc: linux-e...@vger.kernel.org
> > Cc: Coly Li 
> > Cc: linux-bca...@vger.kernel.org
> > Cc: Boaz Harrosh 
> > Cc: Bob Peterson 
> > Cc: cluster-devel@redhat.com
> > Signed-off-by: Ming Lei 
> > ---
> >  block/bio.c   | 27 ++-
> >  block/blk-zoned.c |  1 +
> >  block/bounce.c|  6 --
> >  drivers/md/bcache/btree.c |  3 ++-
> >  drivers/md/dm-crypt.c |  3 ++-
> >  drivers/md/raid1.c|  3 ++-
> >  drivers/staging/erofs/data.c  |  3 ++-
> >  drivers/staging/erofs/unzip_vle.c |  3 ++-
> >  fs/block_dev.c|  6 --
> >  fs/btrfs/compression.c|  3 ++-
> >  fs/btrfs/disk-io.c|  3 ++-
> >  fs/btrfs/extent_io.c  | 12 
> >  fs/btrfs/inode.c  |  6 --
> >  fs/btrfs/raid56.c |  3 ++-
> >  fs/crypto/bio.c   |  3 ++-
> >  fs/direct-io.c|  4 +++-
> >  fs/exofs/ore.c|  3 ++-
> >  fs/exofs/ore_raid.c   |  3 ++-
> >  fs/ext4/page-io.c |  3 ++-
> >  fs/ext4/readpage.c|  3 ++-
> >  fs/f2fs/data.c|  9 ++---
> >  fs/gfs2/lops.c|  6 --
> >  fs/gfs2/meta_io.c |  3 ++-
> >  fs/iomap.c|  6 --
> >  fs/mpage.c|  3 ++-
> >  fs/xfs/xfs_aops.c |  5 +++--
> >  include/linux/bio.h   | 11 +--
> >  include/linux/bvec.h  | 31 +++
> >  28 files changed, 129 insertions(+), 46 deletions(-)
> > 
> 
> [snip]
> 
> > diff --git a/include/linux/bio.h b/include/linux/bio.h
> > index 3496c816946e..1a2430a8b89d 100644
> > --- a/include/linux/bio.h
> > +++ b/include/linux/bio.h
> > @@ -131,12 +131,19 @@ static inline bool bio_full(struct bio *bio)
> > return bio->bi_vcnt >= bio->bi_max_vecs;
> >  }
> >  
> > +#define bvec_for_each_segment(bv, bvl, i, iter_all)
> > \
> > +   for (bv = bvec_init_iter_all(&iter_all);\
> > +   (iter_all.done < (bvl)->bv_len) &&  \
> > +   ((bvec_next_segment((bvl), &iter_all)), 1); \
> 
> The parentheses around (bvec_next_segment((bvl), &iter_all)) are
> unnecessary.

OK.

> 
> > +   iter_all.done += bv->bv_len, i += 1)
> > +
> >  /*
> >   * drivers should _never_ use the all version - the bio may have been split
> >   * before it got to the driver and the driver won't own all of it
> >   */
> > -#define bio_for_each_segment_all(bvl, bio, i)  
> > \
> > -   for (i = 0, bvl = (bio)->bi_io_vec; i < (bio)->bi_vcnt; i++, bvl++)
> > +#define bio_for_each_segment_all(bvl, bio, i, iter_all)\
> > +   for (i = 0, iter_all.idx = 0; iter_all.idx < (bio)->bi_vcnt; 
> > iter_all.idx++)\
> > +   bvec_for_each_segment(bvl, &((bio)->bi_io_vec[iter_all.idx]), 
> > i, iter_all)
> 
> Would it be possible to move i into iter_all to streamline this a bit?

That may may cause unnecessary conversion work for us, because the local
variable 'i' is defined in external function.

> 
> >  static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter 
> > *iter,
> >   unsigned bytes, bool mp)
> > diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> > index 01616a0b6220..02f26d2b59ad 100644
> > --- a/include/linux/bvec.h
> > +++ b/include/linux/bvec.h
> > @@ -82,6 +82,12 @@ struct bvec_iter {
> >current bvec */
> >  };
> >  
> > +struct bvec_iter_all {
> > +   struct bio_vec  bv;
> > +   int idx;
> > +   unsigneddone;
> > +};
> > +
> >  /*
> >   * various member access, note that bio_data should of course not be used
> >   * on highmem page vectors
> > @@ -216,6 +222,31 @@ static inline bool mp_bvec_iter_advance(const struct 
> > bio_vec *bv,
> > .bi_bvec_done   = 0,  

Re: [Cluster-devel] [PATCH V10 12/19] block: allow bio_for_each_segment_all() to iterate over multi-page bvec

2018-11-19 Thread Ming Lei
On Thu, Nov 15, 2018 at 01:42:52PM +0100, David Sterba wrote:
> On Thu, Nov 15, 2018 at 04:52:59PM +0800, Ming Lei wrote:
> > diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> > index 13ba2011a306..789b09ae402a 100644
> > --- a/block/blk-zoned.c
> > +++ b/block/blk-zoned.c
> > @@ -123,6 +123,7 @@ static int blk_report_zones(struct gendisk *disk, 
> > sector_t sector,
> > unsigned int z = 0, n, nrz = *nr_zones;
> > sector_t capacity = get_capacity(disk);
> > int ret;
> > +   struct bvec_iter_all iter_all;
> >  
> > while (z < nrz && sector < capacity) {
> > n = nrz - z;
> 
> iter_all is added but not used and I don't see any
> bio_for_each_segment_all for conversion in this function.

Good catch, will fix it in next version.

Thanks,
Ming



Re: [Cluster-devel] [PATCH V10 13/19] iomap & xfs: only account for new added page

2018-11-19 Thread Ming Lei
On Fri, Nov 16, 2018 at 02:49:36PM +0100, Christoph Hellwig wrote:
> I'd much rather have __bio_try_merge_page only do merges in
> the same page, and have a new __bio_try_merge_segment that does
> multi-page merges.  This will keep the accounting a lot simpler.

Looks this way is clever, will do it in next version.

Thanks,
Ming



Re: [Cluster-devel] [PATCH V10 13/19] iomap & xfs: only account for new added page

2018-11-19 Thread Ming Lei
On Thu, Nov 15, 2018 at 05:46:58PM -0800, Omar Sandoval wrote:
> On Thu, Nov 15, 2018 at 04:53:00PM +0800, Ming Lei wrote:
> > After multi-page is enabled, one new page may be merged to a segment
> > even though it is a new added page.
> > 
> > This patch deals with this issue by post-check in case of merge, and
> > only a freshly new added page need to be dealt with for iomap & xfs.
> > 
> > Cc: Dave Chinner 
> > Cc: Kent Overstreet 
> > Cc: Mike Snitzer 
> > Cc: dm-de...@redhat.com
> > Cc: Alexander Viro 
> > Cc: linux-fsde...@vger.kernel.org
> > Cc: Shaohua Li 
> > Cc: linux-r...@vger.kernel.org
> > Cc: linux-er...@lists.ozlabs.org
> > Cc: David Sterba 
> > Cc: linux-bt...@vger.kernel.org
> > Cc: Darrick J. Wong 
> > Cc: linux-...@vger.kernel.org
> > Cc: Gao Xiang 
> > Cc: Christoph Hellwig 
> > Cc: Theodore Ts'o 
> > Cc: linux-e...@vger.kernel.org
> > Cc: Coly Li 
> > Cc: linux-bca...@vger.kernel.org
> > Cc: Boaz Harrosh 
> > Cc: Bob Peterson 
> > Cc: cluster-devel@redhat.com
> > Signed-off-by: Ming Lei 
> > ---
> >  fs/iomap.c  | 22 ++
> >  fs/xfs/xfs_aops.c   | 10 --
> >  include/linux/bio.h | 11 +++
> >  3 files changed, 33 insertions(+), 10 deletions(-)
> > 
> > diff --git a/fs/iomap.c b/fs/iomap.c
> > index df0212560b36..a1b97a5c726a 100644
> > --- a/fs/iomap.c
> > +++ b/fs/iomap.c
> > @@ -288,6 +288,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
> > loff_t length, void *data,
> > loff_t orig_pos = pos;
> > unsigned poff, plen;
> > sector_t sector;
> > +   bool need_account = false;
> >  
> > if (iomap->type == IOMAP_INLINE) {
> > WARN_ON_ONCE(pos);
> > @@ -313,18 +314,15 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
> > loff_t length, void *data,
> >  */
> > sector = iomap_sector(iomap, pos);
> > if (ctx->bio && bio_end_sector(ctx->bio) == sector) {
> > -   if (__bio_try_merge_page(ctx->bio, page, plen, poff))
> > +   if (__bio_try_merge_page(ctx->bio, page, plen, poff)) {
> > +   need_account = iop && bio_is_last_segment(ctx->bio,
> > +   page, plen, poff);
> 
> It's redundant to make this iop && ... since you already check
> iop && need_account below. Maybe rename it to added_page? Also, this
> indentation is wack.

We may avoid to call bio_is_last_segment() in case of !iop, and will
fix the indentation.

Looks added_page is one better name.

> 
> > goto done;
> > +   }
> > is_contig = true;
> > }
> >  
> > -   /*
> > -* If we start a new segment we need to increase the read count, and we
> > -* need to do so before submitting any previous full bio to make sure
> > -* that we don't prematurely unlock the page.
> > -*/
> > -   if (iop)
> > -   atomic_inc(&iop->read_count);
> > +   need_account = true;
> >  
> > if (!ctx->bio || !is_contig || bio_full(ctx->bio)) {
> > gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL);
> > @@ -347,6 +345,14 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
> > loff_t length, void *data,
> > __bio_add_page(ctx->bio, page, plen, poff);
> >  done:
> > /*
> > +* If we add a new page we need to increase the read count, and we
> > +* need to do so before submitting any previous full bio to make sure
> > +* that we don't prematurely unlock the page.
> > +*/
> > +   if (iop && need_account)
> > +   atomic_inc(&iop->read_count);
> > +
> > +   /*
> >  * Move the caller beyond our range so that it keeps making progress.
> >  * For that we have to include any leading non-uptodate ranges, but
> >  * we can skip trailing ones as they will be handled in the next
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index 1f1829e506e8..d8e9cc9f751a 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -603,6 +603,7 @@ xfs_add_to_ioend(
> > unsignedlen = i_blocksize(inode);
> > unsignedpoff = offset & (PAGE_SIZE - 1);
> > sector_tsector;
> > +   boolneed_account;
> >  
> > sector = xfs_fsb_to_db(ip, wpc->imap.br_startblock) +
> > ((offset - XFS_FSB_TO_B(mp, wpc->imap.br_startoff)) >> 9);
> > @@ -617,13 +618,18 @@ xfs_add_to_ioend(
> > }
> >  
> > if (!__bio_try_merge_page(wpc->ioend->io_bio, page, len, poff)) {
> > -   if (iop)
> > -   atomic_inc(&iop->write_count);
> > +   need_account = true;
> > if (bio_full(wpc->ioend->io_bio))
> > xfs_chain_bio(wpc->ioend, wbc, bdev, sector);
> > __bio_add_page(wpc->ioend->io_bio, page, len, poff);
> > +   } else {
> > +   need_account = iop && bio_is_last_segment(wpc->ioend->io_bio,
> > +   page, len, poff);
> 
> Same here, no need for iop &&, rename it added_page, indentation is off.
> 
> > }

Re: [Cluster-devel] [PATCH V10 11/19] bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages()

2018-11-19 Thread Ming Lei
On Fri, Nov 16, 2018 at 02:46:45PM +0100, Christoph Hellwig wrote:
> > -   bio_for_each_segment_all(bv, bio, i) {
> > +   for (i = 0, bv = bio->bi_io_vec; i < bio->bi_vcnt; bv++) {
> 
> This really needs a comment.  Otherwise it looks fine to me.

OK, will do it in next version.

Thanks,
Ming



Re: [Cluster-devel] [PATCH V10 11/19] bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages()

2018-11-19 Thread Ming Lei
On Thu, Nov 15, 2018 at 04:44:02PM -0800, Omar Sandoval wrote:
> On Thu, Nov 15, 2018 at 04:52:58PM +0800, Ming Lei wrote:
> > bch_bio_alloc_pages() is always called on one new bio, so it is safe
> > to access the bvec table directly. Given it is the only kind of this
> > case, open code the bvec table access since bio_for_each_segment_all()
> > will be changed to support for iterating over multipage bvec.
> > 
> > Cc: Dave Chinner 
> > Cc: Kent Overstreet 
> > Acked-by: Coly Li 
> > Cc: Mike Snitzer 
> > Cc: dm-de...@redhat.com
> > Cc: Alexander Viro 
> > Cc: linux-fsde...@vger.kernel.org
> > Cc: Shaohua Li 
> > Cc: linux-r...@vger.kernel.org
> > Cc: linux-er...@lists.ozlabs.org
> > Cc: David Sterba 
> > Cc: linux-bt...@vger.kernel.org
> > Cc: Darrick J. Wong 
> > Cc: linux-...@vger.kernel.org
> > Cc: Gao Xiang 
> > Cc: Christoph Hellwig 
> > Cc: Theodore Ts'o 
> > Cc: linux-e...@vger.kernel.org
> > Cc: Coly Li 
> > Cc: linux-bca...@vger.kernel.org
> > Cc: Boaz Harrosh 
> > Cc: Bob Peterson 
> > Cc: cluster-devel@redhat.com
> > Signed-off-by: Ming Lei 
> > ---
> >  drivers/md/bcache/util.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
> > index 20eddeac1531..8517aebcda2d 100644
> > --- a/drivers/md/bcache/util.c
> > +++ b/drivers/md/bcache/util.c
> > @@ -270,7 +270,7 @@ int bch_bio_alloc_pages(struct bio *bio, gfp_t gfp_mask)
> > int i;
> > struct bio_vec *bv;
> >  
> > -   bio_for_each_segment_all(bv, bio, i) {
> > +   for (i = 0, bv = bio->bi_io_vec; i < bio->bi_vcnt; bv++) {
> 
> This is missing an i++.

Good catch, will fix it in next version.

thanks,
Ming



Re: [Cluster-devel] [PATCH V10 09/19] block: introduce bio_bvecs()

2018-11-19 Thread Ming Lei
On Fri, Nov 16, 2018 at 02:45:41PM +0100, Christoph Hellwig wrote:
> On Thu, Nov 15, 2018 at 04:52:56PM +0800, Ming Lei wrote:
> > There are still cases in which we need to use bio_bvecs() for get the
> > number of multi-page segment, so introduce it.
> 
> The only user in your final tree seems to be the loop driver, and
> even that one only uses the helper for read/write bios.
> 
> I think something like this would be much simpler in the end:
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index d509902a8046..712511815ac6 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -514,16 +514,18 @@ static int lo_rw_aio(struct loop_device *lo, struct 
> loop_cmd *cmd,
>   struct request *rq = blk_mq_rq_from_pdu(cmd);
>   struct bio *bio = rq->bio;
>   struct file *file = lo->lo_backing_file;
> + struct bvec_iter bvec_iter;
> + struct bio_vec tmp;
>   unsigned int offset;
>   int nr_bvec = 0;
>   int ret;
>  
> + __rq_for_each_bio(bio, rq)
> + bio_for_each_bvec(tmp, bio, bvec_iter)
> + nr_bvec++;
> +
>   if (rq->bio != rq->biotail) {
> - struct bvec_iter iter;
> - struct bio_vec tmp;
>  
> - __rq_for_each_bio(bio, rq)
> - nr_bvec += bio_bvecs(bio);
>   bvec = kmalloc_array(nr_bvec, sizeof(struct bio_vec),
>GFP_NOIO);
>   if (!bvec)
> @@ -537,7 +539,7 @@ static int lo_rw_aio(struct loop_device *lo, struct 
> loop_cmd *cmd,
>* API will take care of all details for us.
>*/
>   __rq_for_each_bio(bio, rq)
> - bio_for_each_bvec(tmp, bio, iter) {
> + bio_for_each_bvec(tmp, bio, bvec_iter) {
>   *bvec = tmp;
>   bvec++;
>   }
> @@ -551,7 +553,6 @@ static int lo_rw_aio(struct loop_device *lo, struct 
> loop_cmd *cmd,
>*/
>   offset = bio->bi_iter.bi_bvec_done;
>   bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
> - nr_bvec = bio_bvecs(bio);
>   }
>   atomic_set(&cmd->ref, 2);
>  
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index dcad0b69f57a..379440d1ced0 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -200,30 +200,6 @@ static inline unsigned bio_segments(struct bio *bio)
>   }
>  }
>  
> -static inline unsigned bio_bvecs(struct bio *bio)
> -{
> - unsigned bvecs = 0;
> - struct bio_vec bv;
> - struct bvec_iter iter;
> -
> - /*
> -  * We special case discard/write same/write zeroes, because they
> -  * interpret bi_size differently:
> -  */
> - switch (bio_op(bio)) {
> - case REQ_OP_DISCARD:
> - case REQ_OP_SECURE_ERASE:
> - case REQ_OP_WRITE_ZEROES:
> - return 0;
> - case REQ_OP_WRITE_SAME:
> - return 1;
> - default:
> - bio_for_each_bvec(bv, bio, iter)
> - bvecs++;
> - return bvecs;
> - }
> -}
> -
>  /*
>   * get a reference to a bio, so it won't disappear. the intended use is
>   * something like:

OK, will do it in next version.

Thanks,
Ming



Re: [Cluster-devel] [PATCH V10 08/19] btrfs: move bio_pages_all() to btrfs

2018-11-19 Thread Ming Lei
On Fri, Nov 16, 2018 at 02:38:45PM +0100, Christoph Hellwig wrote:
> On Thu, Nov 15, 2018 at 04:52:55PM +0800, Ming Lei wrote:
> > BTRFS is the only user of this helper, so move this helper into
> > BTRFS, and implement it via bio_for_each_segment_all(), since
> > bio->bi_vcnt may not equal to number of pages after multipage bvec
> > is enabled.
> 
> btrfs only uses the value to check if it is larger than 1.  No amount
> of multipage bio merging should ever make bi_vcnt go from 0 to 1 or
> vice versa.

Could you explain a bit why?

Suppose 2 physically continuous pages are added to this bio, .bi_vcnt
can be 1 in case of multi-page bvec, but it is 2 in case of single-page
bvec.

Thanks,
Ming



Re: [Cluster-devel] [PATCH V10 10/19] block: loop: pass multi-page bvec to iov_iter

2018-11-19 Thread Ming Lei
On Thu, Nov 15, 2018 at 04:40:22PM -0800, Omar Sandoval wrote:
> On Thu, Nov 15, 2018 at 04:52:57PM +0800, Ming Lei wrote:
> > iov_iter is implemented with bvec itererator, so it is safe to pass
> > multipage bvec to it, and this way is much more efficient than
> > passing one page in each bvec.
> > 
> > Cc: Dave Chinner 
> > Cc: Kent Overstreet 
> > Cc: Mike Snitzer 
> > Cc: dm-de...@redhat.com
> > Cc: Alexander Viro 
> > Cc: linux-fsde...@vger.kernel.org
> > Cc: Shaohua Li 
> > Cc: linux-r...@vger.kernel.org
> > Cc: linux-er...@lists.ozlabs.org
> > Cc: David Sterba 
> > Cc: linux-bt...@vger.kernel.org
> > Cc: Darrick J. Wong 
> > Cc: linux-...@vger.kernel.org
> > Cc: Gao Xiang 
> > Cc: Christoph Hellwig 
> > Cc: Theodore Ts'o 
> > Cc: linux-e...@vger.kernel.org
> > Cc: Coly Li 
> > Cc: linux-bca...@vger.kernel.org
> > Cc: Boaz Harrosh 
> > Cc: Bob Peterson 
> > Cc: cluster-devel@redhat.com
> 
> Reviewed-by: Omar Sandoval 
> 
> Comments below.
> 
> > Signed-off-by: Ming Lei 
> > ---
> >  drivers/block/loop.c | 23 ---
> >  1 file changed, 12 insertions(+), 11 deletions(-)
> > 
> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > index bf6bc35aaf88..a3fd418ec637 100644
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -515,16 +515,16 @@ static int lo_rw_aio(struct loop_device *lo, struct 
> > loop_cmd *cmd,
> > struct bio *bio = rq->bio;
> > struct file *file = lo->lo_backing_file;
> > unsigned int offset;
> > -   int segments = 0;
> > +   int nr_bvec = 0;
> > int ret;
> >  
> > if (rq->bio != rq->biotail) {
> > -   struct req_iterator iter;
> > +   struct bvec_iter iter;
> > struct bio_vec tmp;
> >  
> > __rq_for_each_bio(bio, rq)
> > -   segments += bio_segments(bio);
> > -   bvec = kmalloc_array(segments, sizeof(struct bio_vec),
> > +   nr_bvec += bio_bvecs(bio);
> > +   bvec = kmalloc_array(nr_bvec, sizeof(struct bio_vec),
> >  GFP_NOIO);
> > if (!bvec)
> > return -EIO;
> > @@ -533,13 +533,14 @@ static int lo_rw_aio(struct loop_device *lo, struct 
> > loop_cmd *cmd,
> > /*
> >  * The bios of the request may be started from the middle of
> >  * the 'bvec' because of bio splitting, so we can't directly
> > -* copy bio->bi_iov_vec to new bvec. The rq_for_each_segment
> > +* copy bio->bi_iov_vec to new bvec. The bio_for_each_bvec
> >  * API will take care of all details for us.
> >  */
> > -   rq_for_each_segment(tmp, rq, iter) {
> > -   *bvec = tmp;
> > -   bvec++;
> > -   }
> > +   __rq_for_each_bio(bio, rq)
> > +   bio_for_each_bvec(tmp, bio, iter) {
> > +   *bvec = tmp;
> > +   bvec++;
> > +   }
> 
> Even if they're not strictly necessary, could you please include the
> curly braces for __rq_for_each_bio() here?

Sure, will do it.

> 
> > bvec = cmd->bvec;
> > offset = 0;
> > } else {
> > @@ -550,11 +551,11 @@ static int lo_rw_aio(struct loop_device *lo, struct 
> > loop_cmd *cmd,
> >  */
> > offset = bio->bi_iter.bi_bvec_done;
> > bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
> > -   segments = bio_segments(bio);
> > +   nr_bvec = bio_bvecs(bio);
> 
> This scared me for a second, but it's fine to do here because we haven't
> actually enabled multipage bvecs yet, right?

Well, it is fine, all helpers supporting multi-page bvec actually works
well when it isn't enabled, cause single-page bvec is one special case in
which multi-page bevc helpers have to deal with.

Thanks,
Ming



Re: [Cluster-devel] [PATCH V10 08/19] btrfs: move bio_pages_all() to btrfs

2018-11-19 Thread Christoph Hellwig
On Mon, Nov 19, 2018 at 04:19:24PM +0800, Ming Lei wrote:
> On Fri, Nov 16, 2018 at 02:38:45PM +0100, Christoph Hellwig wrote:
> > On Thu, Nov 15, 2018 at 04:52:55PM +0800, Ming Lei wrote:
> > > BTRFS is the only user of this helper, so move this helper into
> > > BTRFS, and implement it via bio_for_each_segment_all(), since
> > > bio->bi_vcnt may not equal to number of pages after multipage bvec
> > > is enabled.
> > 
> > btrfs only uses the value to check if it is larger than 1.  No amount
> > of multipage bio merging should ever make bi_vcnt go from 0 to 1 or
> > vice versa.
> 
> Could you explain a bit why?
> 
> Suppose 2 physically continuous pages are added to this bio, .bi_vcnt
> can be 1 in case of multi-page bvec, but it is 2 in case of single-page
> bvec.

True, I did think of 0 vs 1.

The magic here in btrfs still doesn't make much sense.  The comments
down in btrfs_check_repairable talk about sectors, so it might be a
leftover from the blocksize == PAGE_SIZE days (assuming those patches
actually got merged).  I'd like to have the btrfs folks chime in,
but in the end we should probably check if the bio was larger than
a single sector here.



Re: [Cluster-devel] [PATCH V10 08/19] btrfs: move bio_pages_all() to btrfs

2018-11-19 Thread Ming Lei
On Thu, Nov 15, 2018 at 04:23:56PM -0800, Omar Sandoval wrote:
> On Thu, Nov 15, 2018 at 04:52:55PM +0800, Ming Lei wrote:
> > BTRFS is the only user of this helper, so move this helper into
> > BTRFS, and implement it via bio_for_each_segment_all(), since
> > bio->bi_vcnt may not equal to number of pages after multipage bvec
> > is enabled.
> 
> Shouldn't you also get rid of bio_pages_all() in this patch?

Good catch!

thanks,
Ming



Re: [Cluster-devel] [PATCH V10 05/19] block: introduce bvec_last_segment()

2018-11-19 Thread Ming Lei
On Thu, Nov 15, 2018 at 03:23:56PM -0800, Omar Sandoval wrote:
> On Thu, Nov 15, 2018 at 04:52:52PM +0800, Ming Lei wrote:
> > BTRFS and guard_bio_eod() need to get the last singlepage segment
> > from one multipage bvec, so introduce this helper to make them happy.
> > 
> > Cc: Dave Chinner 
> > Cc: Kent Overstreet 
> > Cc: Mike Snitzer 
> > Cc: dm-de...@redhat.com
> > Cc: Alexander Viro 
> > Cc: linux-fsde...@vger.kernel.org
> > Cc: Shaohua Li 
> > Cc: linux-r...@vger.kernel.org
> > Cc: linux-er...@lists.ozlabs.org
> > Cc: David Sterba 
> > Cc: linux-bt...@vger.kernel.org
> > Cc: Darrick J. Wong 
> > Cc: linux-...@vger.kernel.org
> > Cc: Gao Xiang 
> > Cc: Christoph Hellwig 
> > Cc: Theodore Ts'o 
> > Cc: linux-e...@vger.kernel.org
> > Cc: Coly Li 
> > Cc: linux-bca...@vger.kernel.org
> > Cc: Boaz Harrosh 
> > Cc: Bob Peterson 
> > Cc: cluster-devel@redhat.com
> 
> Reviewed-by: Omar Sandoval 
> 
> Minor comments below.
> 
> > Signed-off-by: Ming Lei 
> > ---
> >  include/linux/bvec.h | 25 +
> >  1 file changed, 25 insertions(+)
> > 
> > diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> > index 3d61352cd8cf..01616a0b6220 100644
> > --- a/include/linux/bvec.h
> > +++ b/include/linux/bvec.h
> > @@ -216,4 +216,29 @@ static inline bool mp_bvec_iter_advance(const struct 
> > bio_vec *bv,
> > .bi_bvec_done   = 0,\
> >  }
> >  
> > +/*
> > + * Get the last singlepage segment from the multipage bvec and store it
> > + * in @seg
> > + */
> > +static inline void bvec_last_segment(const struct bio_vec *bvec,
> > +   struct bio_vec *seg)
> 
> Indentation is all messed up here.

Will fix it.

> 
> > +{
> > +   unsigned total = bvec->bv_offset + bvec->bv_len;
> > +   unsigned last_page = total / PAGE_SIZE;
> > +
> > +   if (last_page * PAGE_SIZE == total)
> > +   last_page--;
> 
> I think this could just be
> 
>   unsigned int last_page = (total - 1) / PAGE_SIZE;

This way is really elegant.

Thanks,
Ming



Re: [Cluster-devel] [PATCH V10 07/19] btrfs: use bvec_last_segment to get bio's last page

2018-11-19 Thread Ming Lei
On Fri, Nov 16, 2018 at 02:37:10PM +0100, Christoph Hellwig wrote:
> On Thu, Nov 15, 2018 at 04:52:54PM +0800, Ming Lei wrote:
> > index 2955a4ea2fa8..161e14b8b180 100644
> > --- a/fs/btrfs/compression.c
> > +++ b/fs/btrfs/compression.c
> > @@ -400,8 +400,11 @@ blk_status_t btrfs_submit_compressed_write(struct 
> > inode *inode, u64 start,
> >  static u64 bio_end_offset(struct bio *bio)
> >  {
> > struct bio_vec *last = bio_last_bvec_all(bio);
> > +   struct bio_vec bv;
> >  
> > -   return page_offset(last->bv_page) + last->bv_len + last->bv_offset;
> > +   bvec_last_segment(last, &bv);
> > +
> > +   return page_offset(bv.bv_page) + bv.bv_len + bv.bv_offset;
> 
> I don't think we need this.  If last is a multi-page bvec bv_offset
> will already contain the correct offset from the first page.

Yeah, it is true for this specific case, looks we can drop this patch.


thanks,
Ming



Re: [Cluster-devel] [PATCH V10 03/19] block: use bio_for_each_bvec() to compute multi-page bvec count

2018-11-19 Thread Ming Lei
On Thu, Nov 15, 2018 at 12:20:28PM -0800, Omar Sandoval wrote:
> On Thu, Nov 15, 2018 at 04:52:50PM +0800, Ming Lei wrote:
> > First it is more efficient to use bio_for_each_bvec() in both
> > blk_bio_segment_split() and __blk_recalc_rq_segments() to compute how
> > many multi-page bvecs there are in the bio.
> > 
> > Secondly once bio_for_each_bvec() is used, the bvec may need to be
> > splitted because its length can be very longer than max segment size,
> > so we have to split the big bvec into several segments.
> > 
> > Thirdly when splitting multi-page bvec into segments, the max segment
> > limit may be reached, so the bio split need to be considered under
> > this situation too.
> > 
> > Cc: Dave Chinner 
> > Cc: Kent Overstreet 
> > Cc: Mike Snitzer 
> > Cc: dm-de...@redhat.com
> > Cc: Alexander Viro 
> > Cc: linux-fsde...@vger.kernel.org
> > Cc: Shaohua Li 
> > Cc: linux-r...@vger.kernel.org
> > Cc: linux-er...@lists.ozlabs.org
> > Cc: David Sterba 
> > Cc: linux-bt...@vger.kernel.org
> > Cc: Darrick J. Wong 
> > Cc: linux-...@vger.kernel.org
> > Cc: Gao Xiang 
> > Cc: Christoph Hellwig 
> > Cc: Theodore Ts'o 
> > Cc: linux-e...@vger.kernel.org
> > Cc: Coly Li 
> > Cc: linux-bca...@vger.kernel.org
> > Cc: Boaz Harrosh 
> > Cc: Bob Peterson 
> > Cc: cluster-devel@redhat.com
> > Signed-off-by: Ming Lei 
> > ---
> >  block/blk-merge.c | 90 
> > ++-
> >  1 file changed, 76 insertions(+), 14 deletions(-)
> > 
> > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > index 91b2af332a84..6f7deb94a23f 100644
> > --- a/block/blk-merge.c
> > +++ b/block/blk-merge.c
> > @@ -160,6 +160,62 @@ static inline unsigned get_max_io_size(struct 
> > request_queue *q,
> > return sectors;
> >  }
> >  
> > +/*
> > + * Split the bvec @bv into segments, and update all kinds of
> > + * variables.
> > + */
> > +static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv,
> > +   unsigned *nsegs, unsigned *last_seg_size,
> > +   unsigned *front_seg_size, unsigned *sectors)
> > +{
> > +   bool need_split = false;
> > +   unsigned len = bv->bv_len;
> > +   unsigned total_len = 0;
> > +   unsigned new_nsegs = 0, seg_size = 0;
> 
> "unsigned int" here and everywhere else.
> 
> > +   if ((*nsegs >= queue_max_segments(q)) || !len)
> > +   return need_split;
> > +
> > +   /*
> > +* Multipage bvec may be too big to hold in one segment,
> > +* so the current bvec has to be splitted as multiple
> > +* segments.
> > +*/
> > +   while (new_nsegs + *nsegs < queue_max_segments(q)) {
> > +   seg_size = min(queue_max_segment_size(q), len);
> > +
> > +   new_nsegs++;
> > +   total_len += seg_size;
> > +   len -= seg_size;
> > +
> > +   if ((queue_virt_boundary(q) && ((bv->bv_offset +
> > +   total_len) & queue_virt_boundary(q))) || !len)
> > +   break;
> 
> Checking queue_virt_boundary(q) != 0 is superfluous, and the len check
> could just control the loop, i.e.,
> 
>   while (len && new_nsegs + *nsegs < queue_max_segments(q)) {
>   seg_size = min(queue_max_segment_size(q), len);
> 
>   new_nsegs++;
>   total_len += seg_size;
>   len -= seg_size;
> 
>   if ((bv->bv_offset + total_len) & queue_virt_boundary(q))
>   break;
>   }
> 
> And if you rewrite it this way, I _think_ you can get rid of this
> special case:
> 
>   if ((*nsegs >= queue_max_segments(q)) || !len)
>   return need_split;
> 
> above.

Good point, will do in next version.

> 
> > +   }
> > +
> > +   /* split in the middle of the bvec */
> > +   if (len)
> > +   need_split = true;
> 
> need_split is unnecessary, just return len != 0.

OK.

> 
> > +
> > +   /* update front segment size */
> > +   if (!*nsegs) {
> > +   unsigned first_seg_size = seg_size;
> > +
> > +   if (new_nsegs > 1)
> > +   first_seg_size = queue_max_segment_size(q);
> > +   if (*front_seg_size < first_seg_size)
> > +   *front_seg_size = first_seg_size;
> > +   }
> > +
> > +   /* update other varibles */
> > +   *last_seg_size = seg_size;
> > +   *nsegs += new_nsegs;
> > +   if (sectors)
> > +   *sectors += total_len >> 9;
> > +
> > +   return need_split;
> > +}
> > +
> >  static struct bio *blk_bio_segment_split(struct request_queue *q,
> >  struct bio *bio,
> >  struct bio_set *bs,
> > @@ -173,7 +229,7 @@ static struct bio *blk_bio_segment_split(struct 
> > request_queue *q,
> > struct bio *new = NULL;
> > const unsigned max_sectors = get_max_io_size(q, bio);
> >  
> > -   bio_for_each_segment(bv, bio, iter) {
> > +   bio_for_each_bvec(bv, bio, iter) {
> > /*
> >  * If the queue doesn't support SG gaps and adding this
> >  * offset would create a

Re: [Cluster-devel] [PATCH V10 04/19] block: use bio_for_each_bvec() to map sg

2018-11-19 Thread Ming Lei
On Fri, Nov 16, 2018 at 02:33:14PM +0100, Christoph Hellwig wrote:
> > +   if (!*sg)
> > +   return sglist;
> > +   else {
> 
> No need for an else after an early return.

OK, good catch!

Thanks,
Ming