Re: [Ocfs2-devel] [PATCH] ocfs2: don't put and assign null to bh allocated outside

2018-04-10 Thread Changwei Ge
Hi Jun,

On 2018/4/11 9:43, piaojun wrote:
> Hi Changwei,
> 
> On 2018/4/11 9:14, Changwei Ge wrote:
>> Hi Jun,
>>
>> Thanks for your review.
>>
>> On 2018/4/11 8:40, piaojun wrote:
>>> Hi Changwei,
>>>
>>> On 2018/4/10 19:35, Changwei Ge wrote:
 ocfs2_read_blocks() and ocfs2_read_blocks_sync() are both used to read
 several blocks from disk. Currently, the input argument *bhs* can be
 NULL or NOT. It depends on the caller's behavior. If the function fails
 in reading blocks from disk, the corresponding bh will be assigned to
 NULL and put.

 Obviously, above process for non-NULL input bh is not appropriate.
 Because the caller doesn't even know its bhs are put and re-assigned.

 If buffer head is managed by caller, ocfs2_read_blocks and
 ocfs2_read_blocks_sync()  should not evaluate it to NULL. It will
 cause caller accessing illegal memory, thus crash.

 Also, we should put bhs which have succeeded in reading before current
 read failure.

 Signed-off-by: Changwei Ge 
 ---
fs/ocfs2/buffer_head_io.c | 77 
 ---
1 file changed, 59 insertions(+), 18 deletions(-)

 diff --git a/fs/ocfs2/buffer_head_io.c b/fs/ocfs2/buffer_head_io.c
 index d9ebe11..7ae4147 100644
 --- a/fs/ocfs2/buffer_head_io.c
 +++ b/fs/ocfs2/buffer_head_io.c
 @@ -99,25 +99,34 @@ int ocfs2_write_block(struct ocfs2_super *osb, struct 
 buffer_head *bh,
return ret;
}

 +/* Caller must provide a bhs[] with all NULL or non-NULL entries, so it
 + * will be easier to handle read failure.
 + */
int ocfs2_read_blocks_sync(struct ocfs2_super *osb, u64 block,
   unsigned int nr, struct buffer_head *bhs[])
{
int status = 0;
unsigned int i;
struct buffer_head *bh;
 +  int new_bh = 0;

trace_ocfs2_read_blocks_sync((unsigned long long)block, nr);

if (!nr)
goto bail;

 +  /* Don't put buffer head and re-assign it to NULL if it is allocated
 +   * outside since the call can't be aware of this alternation!
 +   */
 +  new_bh = (bhs[0] == NULL);
>>>
>>> 'new_bh' just means the first bh is NULL, what if the middle bh is NULL?
>>
>> I am afraid we have to assume that if the first bh is NULL, others must to be
>> NULL as well. Otherwise this function's exception handling will be rather
>> complicated and messy. So I add a comment before this function to remind the
>> caller should pass appropriate arguments in.
>> And I checked the callers to this function, they behave like my assumption.
>>
>> Moreover, who would pass a so strange bh array in?
> 
> What about restricting the outside caller's behaviour rather than handling
> it in ocfs2_read_blocks_sync()? Will it be easier to do this?

No, it can't be easier than fix it inside ocfs2_read_blocks_sync().
Because some callers don't want to put bh until it has to.
Others will put shortly after it obtains it wants from disk.
So it's hard to restrict callers' behavior.
And that way will change long-lived logic a lot.

Thanks,
Changwei

> 
> thanks,
> Jum
> 
>>
>>>
 +
for (i = 0 ; i < nr ; i++) {
if (bhs[i] == NULL) {
bhs[i] = sb_getblk(osb->sb, block++);
if (bhs[i] == NULL) {
status = -ENOMEM;
mlog_errno(status);
 -  goto bail;
 +  break;
}
}
bh = bhs[i];
 @@ -158,9 +167,26 @@ int ocfs2_read_blocks_sync(struct ocfs2_super *osb, 
 u64 block,
submit_bh(REQ_OP_READ, 0, bh);
}

 +read_failure:
>>>
>>> This looks weird that 'read_failure' include the normal branch.
>>
>> Sounds reasonable.
>> How about if_read_failure or may_read_failure?
>>
>> Thanks,
>> Changwei
>>
>>>
for (i = nr; i > 0; i--) {
bh = bhs[i - 1];

 +  if (unlikely(status)) {
 +  if (new_bh && !bh) {
 +  /* If middle bh fails, let previous bh
 +   * finish its read and then put it to
 +   * aovoid bh leak
 +   */
 +  if (!buffer_jbd(bh))
 +  wait_on_buffer(bh);
 +  put_bh(bh);
 +  bhs[i - 1] = NULL;
 +  } else if (buffer_uptodate(bh)) {
 +  clear_buffer_uptodate(bh);
 +  }
 +  continue;
>>>

Re: [Ocfs2-devel] [PATCH] ocfs2: don't use iocb when EIOCBQUEUED returns

2018-04-10 Thread Gang He
Hi Changwei,

The code change just works around the problem, but theoretically the IOCB 
object should not be freed before which is handled.
Anyway, if we can find the root cause behind via some way (e.g. inject delay in 
some place), the result is more perfect. 


Thanks
Gang


>>> 
> Hi Jun,
> 
> On 2018/4/11 8:52, piaojun wrote:
>> Hi Changwei,
>> 
>> It looks like a code bug, and 'iocb' should not be freed at this place.
>> Could this BUG reproduced easily?
> 
> Actually, it's not easy to be reproduced since IO is much slower than CPU 
> executing instructions. But the logic here is broken, we'd better fix this.
> 
> Thanks,
> Changwei
> 
>> 
>> thanks,
>> Jun
>> 
>> On 2018/4/10 20:00, Changwei Ge wrote:
>>> When -EIOCBQUEUED returns, it means that aio_complete() will be called
>>> from dio_complete(), which is an asynchronous progress against write_iter.
>>> Generally, IO is a very slow progress than executing instruction, but we
>>> still can't take the risk to access a freed iocb.
>>>
>>> And we do face a BUG crash issue.
>>> >From crash tool, iocb is obviously freed already.
>>> crash> struct -x kiocb 881a350f5900
>>> struct kiocb {
>>>ki_filp = 0x881a350f5a80,
>>>ki_pos = 0x0,
>>>ki_complete = 0x0,
>>>private = 0x0,
>>>ki_flags = 0x0
>>> }
>>>
>>> And the backtrace shows:
>>> ocfs2_file_write_iter+0xcaa/0xd00 [ocfs2]
>>> ? ocfs2_check_range_for_refcount+0x150/0x150 [ocfs2]
>>> aio_run_iocb+0x229/0x2f0
>>> ? try_to_wake_up+0x380/0x380
>>> do_io_submit+0x291/0x540
>>> ? syscall_trace_leave+0xad/0x130
>>> SyS_io_submit+0x10/0x20
>>> system_call_fastpath+0x16/0x75
>>>
>>> Signed-off-by: Changwei Ge 
>>> ---
>>>   fs/ocfs2/file.c | 4 ++--
>>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
>>> index 5d1784a..1393ff2 100644
>>> --- a/fs/ocfs2/file.c
>>> +++ b/fs/ocfs2/file.c
>>> @@ -2343,7 +2343,7 @@ static ssize_t ocfs2_file_write_iter(struct kiocb 
> *iocb,
>>>   
>>> written = __generic_file_write_iter(iocb, from);
>>> /* buffered aio wouldn't have proper lock coverage today */
>>> -   BUG_ON(written == -EIOCBQUEUED && !(iocb->ki_flags & IOCB_DIRECT));
>>> +   BUG_ON(written == -EIOCBQUEUED && !direct_io);
>>>   
>>> /*
>>>  * deep in g_f_a_w_n()->ocfs2_direct_IO we pass in a ocfs2_dio_end_io
>>> @@ -2463,7 +2463,7 @@ static ssize_t ocfs2_file_read_iter(struct kiocb 
>>> *iocb,
>>> trace_generic_file_aio_read_ret(ret);
>>>   
>>> /* buffered aio wouldn't have proper lock coverage today */
>>> -   BUG_ON(ret == -EIOCBQUEUED && !(iocb->ki_flags & IOCB_DIRECT));
>>> +   BUG_ON(ret == -EIOCBQUEUED && !direct_io);
>>>   
>>> /* see ocfs2_file_write_iter */
>>> if (ret == -EIOCBQUEUED || !ocfs2_iocb_is_rw_locked(iocb)) {
>>>
>> 
> 
> ___
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com 
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] ocfs2/o2hb: check len for bio_add_page() to avoid submitting incorrect bio

2018-04-10 Thread Changwei Ge
Hi Jun,

Thanks for your patch.

I just applied your patch into my tree and triggered ocfs2-test.
Unfortunately, the very first case fails in making fs since bio can't 
accommodate more than 16 vecs.

Of course this is not introduced by your patch. You patch just makes this 
hidden issue visible.

I just want to remind if this patch is applied. The cluster scale can't exceed 
16 nodes.

And I will try to post a patch to fix it.

Attach log:

Apr 11 08:37:02 cvknode-ocfs2test-e0501-1 kernel: [  942.329330] 
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 0, vec_len = 4096, vec_start = 0
Apr 11 08:37:02 cvknode-ocfs2test-e0501-1 kernel: [  942.329331] 
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 1, vec_len = 4096, vec_start = 0
Apr 11 08:37:02 cvknode-ocfs2test-e0501-1 kernel: [  942.329332] 
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 2, vec_len = 4096, vec_start = 0
Apr 11 08:37:02 cvknode-ocfs2test-e0501-1 kernel: [  942.329333] 
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 3, vec_len = 4096, vec_start = 0
Apr 11 08:37:02 cvknode-ocfs2test-e0501-1 kernel: [  942.329334] 
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 4, vec_len = 4096, vec_start = 0
Apr 11 08:37:02 cvknode-ocfs2test-e0501-1 kernel: [  942.329335] 
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 5, vec_len = 4096, vec_start = 0
Apr 11 08:37:02 cvknode-ocfs2test-e0501-1 kernel: [  942.329336] 
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 6, vec_len = 4096, vec_start = 0
Apr 11 08:37:02 cvknode-ocfs2test-e0501-1 kernel: [  942.329337] 
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 7, vec_len = 4096, vec_start = 0
Apr 11 08:37:02 cvknode-ocfs2test-e0501-1 kernel: [  942.329338] 
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 8, vec_len = 4096, vec_start = 0
Apr 11 08:37:02 cvknode-ocfs2test-e0501-1 kernel: [  942.329339] 
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 9, vec_len = 4096, vec_start = 0
Apr 11 08:37:02 cvknode-ocfs2test-e0501-1 kernel: [  942.329339] 
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 10, vec_len = 4096, vec_start 
= 0
Apr 11 08:37:02 cvknode-ocfs2test-e0501-1 kernel: [  942.329340] 
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 11, vec_len = 4096, vec_start 
= 0
Apr 11 08:37:02 cvknode-ocfs2test-e0501-1 kernel: [  942.329341] 
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 12, vec_len = 4096, vec_start 
= 0
Apr 11 08:37:02 cvknode-ocfs2test-e0501-1 kernel: [  942.329342] 
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 13, vec_len = 4096, vec_start 
= 0
Apr 11 08:37:02 cvknode-ocfs2test-e0501-1 kernel: [  942.329343] 
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 14, vec_len = 4096, vec_start 
= 0
Apr 11 08:37:02 cvknode-ocfs2test-e0501-1 kernel: [  942.329344] 
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 15, vec_len = 4096, vec_start 
= 0
Apr 11 08:37:02 cvknode-ocfs2test-e0501-1 kernel: [  942.329345] 
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 16, vec_len = 4096, vec_start 
= 0
Apr 11 08:37:02 cvknode-ocfs2test-e0501-1 kernel: [  942.329346] 
(mkfs.ocfs2,27479,2):o2hb_setup_one_bio:471 ERROR: Adding page[16] to bio 
failed, page ea0002d7ed40, len 0, vec_len 4096, vec_start 0, bi_sector 8192
Apr 11 08:37:02 cvknode-ocfs2test-e0501-1 kernel: [  942.329357] 
(mkfs.ocfs2,27479,2):o2hb_read_slots:500 ERROR: status = -5
Apr 11 08:37:02 cvknode-ocfs2test-e0501-1 kernel: [  942.329361] 
(mkfs.ocfs2,27479,2):o2hb_populate_slot_data:1911 ERROR: status = -5
Apr 11 08:37:02 cvknode-ocfs2test-e0501-1 kernel: [  942.329364] 
(mkfs.ocfs2,27479,2):o2hb_region_dev_write:2012 ERROR: status = -5


On 2018/3/28 11:52, piaojun wrote:
> We need check len for bio_add_page() to make sure the bio has been set up
> correctly, otherwise we may submit incorrect data to device.
> 
> Signed-off-by: Jun Piao 
> Reviewed-by: Yiwen Jiang 
> ---
>   fs/ocfs2/cluster/heartbeat.c | 11 ++-
>   1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
> index ea8c551..43ad79f 100644
> --- a/fs/ocfs2/cluster/heartbeat.c
> +++ b/fs/ocfs2/cluster/heartbeat.c
> @@ -570,7 +570,16 @@ static struct bio *o2hb_setup_one_bio(struct o2hb_region 
> *reg,
>current_page, vec_len, vec_start);
> 
>   len = bio_add_page(bio, page, vec_len, vec_start);
> - if (len != vec_len) break;
> + if (len != vec_len) {
> + mlog(ML_ERROR, "Adding page[%d] to bio failed, "
> +  "page %p, len %d, vec_len %u, vec_start %u, "
> +  "bi_sector %llu\n", current_page, page, len,
> +  vec_len, vec_start,
> +  (unsigned long long)bio->bi_iter.bi_sector);
> + bio_put(bio);
> + bio = ERR_PTR(-EFAULT);
> + return bio;
> + }
> 
>   cs += vec_len / (PAGE_SIZE/spp);
>   vec_start = 0;
> 

__

Re: [Ocfs2-devel] [PATCH] ocfs2: don't use iocb when EIOCBQUEUED returns

2018-04-10 Thread Changwei Ge
Hi Jun,

On 2018/4/11 9:52, piaojun wrote:
> Hi Changwei,
> 
> It seems other codes which try to access 'iocb' will also cause error,
> right? I think we should find the reason why 'iocb' is freed first.

Which code snippet do you mean? Actually, I have checked most of other parts in 
write_iter() and read_iter(). I don't see any risk accessing freed iocb yet.

The root cause is clear for this issue.
iocb is freed by aio_complete().

You can refer to path:

dio_complete
   aio_complete
 kiocb_free

Thanks,
Changwei

> 
> thanks,
> Jun
> 
> On 2018/4/11 9:07, Changwei Ge wrote:
>> Hi Jun,
>>
>> On 2018/4/11 8:52, piaojun wrote:
>>> Hi Changwei,
>>>
>>> It looks like a code bug, and 'iocb' should not be freed at this place.
>>> Could this BUG reproduced easily?
>>
>> Actually, it's not easy to be reproduced since IO is much slower than CPU
>> executing instructions. But the logic here is broken, we'd better fix this.
>>
>> Thanks,
>> Changwei
>>
>>>
>>> thanks,
>>> Jun
>>>
>>> On 2018/4/10 20:00, Changwei Ge wrote:
 When -EIOCBQUEUED returns, it means that aio_complete() will be called
 from dio_complete(), which is an asynchronous progress against write_iter.
 Generally, IO is a very slow progress than executing instruction, but we
 still can't take the risk to access a freed iocb.

 And we do face a BUG crash issue.
 >From crash tool, iocb is obviously freed already.
 crash> struct -x kiocb 881a350f5900
 struct kiocb {
 ki_filp = 0x881a350f5a80,
 ki_pos = 0x0,
 ki_complete = 0x0,
 private = 0x0,
 ki_flags = 0x0
 }

 And the backtrace shows:
 ocfs2_file_write_iter+0xcaa/0xd00 [ocfs2]
 ? ocfs2_check_range_for_refcount+0x150/0x150 [ocfs2]
 aio_run_iocb+0x229/0x2f0
 ? try_to_wake_up+0x380/0x380
 do_io_submit+0x291/0x540
 ? syscall_trace_leave+0xad/0x130
 SyS_io_submit+0x10/0x20
 system_call_fastpath+0x16/0x75

 Signed-off-by: Changwei Ge 
 ---
fs/ocfs2/file.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

 diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
 index 5d1784a..1393ff2 100644
 --- a/fs/ocfs2/file.c
 +++ b/fs/ocfs2/file.c
 @@ -2343,7 +2343,7 @@ static ssize_t ocfs2_file_write_iter(struct kiocb 
 *iocb,

written = __generic_file_write_iter(iocb, from);
/* buffered aio wouldn't have proper lock coverage today */
 -  BUG_ON(written == -EIOCBQUEUED && !(iocb->ki_flags & IOCB_DIRECT));
 +  BUG_ON(written == -EIOCBQUEUED && !direct_io);

/*
 * deep in g_f_a_w_n()->ocfs2_direct_IO we pass in a 
 ocfs2_dio_end_io
 @@ -2463,7 +2463,7 @@ static ssize_t ocfs2_file_read_iter(struct kiocb 
 *iocb,
trace_generic_file_aio_read_ret(ret);

/* buffered aio wouldn't have proper lock coverage today */
 -  BUG_ON(ret == -EIOCBQUEUED && !(iocb->ki_flags & IOCB_DIRECT));
 +  BUG_ON(ret == -EIOCBQUEUED && !direct_io);

/* see ocfs2_file_write_iter */
if (ret == -EIOCBQUEUED || !ocfs2_iocb_is_rw_locked(iocb)) {

>>>
>> .
>>
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] ocfs2: don't use iocb when EIOCBQUEUED returns

2018-04-10 Thread piaojun
Hi Changwei,

It seems other codes which try to access 'iocb' will also cause error,
right? I think we should find the reason why 'iocb' is freed first.

thanks,
Jun

On 2018/4/11 9:07, Changwei Ge wrote:
> Hi Jun,
> 
> On 2018/4/11 8:52, piaojun wrote:
>> Hi Changwei,
>>
>> It looks like a code bug, and 'iocb' should not be freed at this place.
>> Could this BUG reproduced easily?
> 
> Actually, it's not easy to be reproduced since IO is much slower than CPU 
> executing instructions. But the logic here is broken, we'd better fix this.
> 
> Thanks,
> Changwei
> 
>>
>> thanks,
>> Jun
>>
>> On 2018/4/10 20:00, Changwei Ge wrote:
>>> When -EIOCBQUEUED returns, it means that aio_complete() will be called
>>> from dio_complete(), which is an asynchronous progress against write_iter.
>>> Generally, IO is a very slow progress than executing instruction, but we
>>> still can't take the risk to access a freed iocb.
>>>
>>> And we do face a BUG crash issue.
>>> >From crash tool, iocb is obviously freed already.
>>> crash> struct -x kiocb 881a350f5900
>>> struct kiocb {
>>>ki_filp = 0x881a350f5a80,
>>>ki_pos = 0x0,
>>>ki_complete = 0x0,
>>>private = 0x0,
>>>ki_flags = 0x0
>>> }
>>>
>>> And the backtrace shows:
>>> ocfs2_file_write_iter+0xcaa/0xd00 [ocfs2]
>>> ? ocfs2_check_range_for_refcount+0x150/0x150 [ocfs2]
>>> aio_run_iocb+0x229/0x2f0
>>> ? try_to_wake_up+0x380/0x380
>>> do_io_submit+0x291/0x540
>>> ? syscall_trace_leave+0xad/0x130
>>> SyS_io_submit+0x10/0x20
>>> system_call_fastpath+0x16/0x75
>>>
>>> Signed-off-by: Changwei Ge 
>>> ---
>>>   fs/ocfs2/file.c | 4 ++--
>>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
>>> index 5d1784a..1393ff2 100644
>>> --- a/fs/ocfs2/file.c
>>> +++ b/fs/ocfs2/file.c
>>> @@ -2343,7 +2343,7 @@ static ssize_t ocfs2_file_write_iter(struct kiocb 
>>> *iocb,
>>>   
>>> written = __generic_file_write_iter(iocb, from);
>>> /* buffered aio wouldn't have proper lock coverage today */
>>> -   BUG_ON(written == -EIOCBQUEUED && !(iocb->ki_flags & IOCB_DIRECT));
>>> +   BUG_ON(written == -EIOCBQUEUED && !direct_io);
>>>   
>>> /*
>>>  * deep in g_f_a_w_n()->ocfs2_direct_IO we pass in a ocfs2_dio_end_io
>>> @@ -2463,7 +2463,7 @@ static ssize_t ocfs2_file_read_iter(struct kiocb 
>>> *iocb,
>>> trace_generic_file_aio_read_ret(ret);
>>>   
>>> /* buffered aio wouldn't have proper lock coverage today */
>>> -   BUG_ON(ret == -EIOCBQUEUED && !(iocb->ki_flags & IOCB_DIRECT));
>>> +   BUG_ON(ret == -EIOCBQUEUED && !direct_io);
>>>   
>>> /* see ocfs2_file_write_iter */
>>> if (ret == -EIOCBQUEUED || !ocfs2_iocb_is_rw_locked(iocb)) {
>>>
>>
> .
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] ocfs2: don't put and assign null to bh allocated outside

2018-04-10 Thread piaojun
Hi Changwei,

On 2018/4/11 9:14, Changwei Ge wrote:
> Hi Jun,
> 
> Thanks for your review.
> 
> On 2018/4/11 8:40, piaojun wrote:
>> Hi Changwei,
>>
>> On 2018/4/10 19:35, Changwei Ge wrote:
>>> ocfs2_read_blocks() and ocfs2_read_blocks_sync() are both used to read
>>> several blocks from disk. Currently, the input argument *bhs* can be
>>> NULL or NOT. It depends on the caller's behavior. If the function fails
>>> in reading blocks from disk, the corresponding bh will be assigned to
>>> NULL and put.
>>>
>>> Obviously, above process for non-NULL input bh is not appropriate.
>>> Because the caller doesn't even know its bhs are put and re-assigned.
>>>
>>> If buffer head is managed by caller, ocfs2_read_blocks and
>>> ocfs2_read_blocks_sync()  should not evaluate it to NULL. It will
>>> cause caller accessing illegal memory, thus crash.
>>>
>>> Also, we should put bhs which have succeeded in reading before current
>>> read failure.
>>>
>>> Signed-off-by: Changwei Ge 
>>> ---
>>>   fs/ocfs2/buffer_head_io.c | 77 
>>> ---
>>>   1 file changed, 59 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/fs/ocfs2/buffer_head_io.c b/fs/ocfs2/buffer_head_io.c
>>> index d9ebe11..7ae4147 100644
>>> --- a/fs/ocfs2/buffer_head_io.c
>>> +++ b/fs/ocfs2/buffer_head_io.c
>>> @@ -99,25 +99,34 @@ int ocfs2_write_block(struct ocfs2_super *osb, struct 
>>> buffer_head *bh,
>>> return ret;
>>>   }
>>>   
>>> +/* Caller must provide a bhs[] with all NULL or non-NULL entries, so it
>>> + * will be easier to handle read failure.
>>> + */
>>>   int ocfs2_read_blocks_sync(struct ocfs2_super *osb, u64 block,
>>>unsigned int nr, struct buffer_head *bhs[])
>>>   {
>>> int status = 0;
>>> unsigned int i;
>>> struct buffer_head *bh;
>>> +   int new_bh = 0;
>>>   
>>> trace_ocfs2_read_blocks_sync((unsigned long long)block, nr);
>>>   
>>> if (!nr)
>>> goto bail;
>>>   
>>> +   /* Don't put buffer head and re-assign it to NULL if it is allocated
>>> +* outside since the call can't be aware of this alternation!
>>> +*/
>>> +   new_bh = (bhs[0] == NULL);
>>
>> 'new_bh' just means the first bh is NULL, what if the middle bh is NULL?
> 
> I am afraid we have to assume that if the first bh is NULL, others must to be 
> NULL as well. Otherwise this function's exception handling will be rather 
> complicated and messy. So I add a comment before this function to remind the 
> caller should pass appropriate arguments in.
> And I checked the callers to this function, they behave like my assumption.
> 
> Moreover, who would pass a so strange bh array in?

What about restricting the outside caller's behaviour rather than handling
it in ocfs2_read_blocks_sync()? Will it be easier to do this?

thanks,
Jum

> 
>>
>>> +
>>> for (i = 0 ; i < nr ; i++) {
>>> if (bhs[i] == NULL) {
>>> bhs[i] = sb_getblk(osb->sb, block++);
>>> if (bhs[i] == NULL) {
>>> status = -ENOMEM;
>>> mlog_errno(status);
>>> -   goto bail;
>>> +   break;
>>> }
>>> }
>>> bh = bhs[i];
>>> @@ -158,9 +167,26 @@ int ocfs2_read_blocks_sync(struct ocfs2_super *osb, 
>>> u64 block,
>>> submit_bh(REQ_OP_READ, 0, bh);
>>> }
>>>   
>>> +read_failure:
>>
>> This looks weird that 'read_failure' include the normal branch.
> 
> Sounds reasonable.
> How about if_read_failure or may_read_failure?
> 
> Thanks,
> Changwei
> 
>>
>>> for (i = nr; i > 0; i--) {
>>> bh = bhs[i - 1];
>>>   
>>> +   if (unlikely(status)) {
>>> +   if (new_bh && !bh) {
>>> +   /* If middle bh fails, let previous bh
>>> +* finish its read and then put it to
>>> +* aovoid bh leak
>>> +*/
>>> +   if (!buffer_jbd(bh))
>>> +   wait_on_buffer(bh);
>>> +   put_bh(bh);
>>> +   bhs[i - 1] = NULL;
>>> +   } else if (buffer_uptodate(bh)) {
>>> +   clear_buffer_uptodate(bh);
>>> +   }
>>> +   continue;
>>> +   }
>>> +
>>> /* No need to wait on the buffer if it's managed by JBD. */
>>> if (!buffer_jbd(bh))
>>> wait_on_buffer(bh);
>>> @@ -170,8 +196,7 @@ int ocfs2_read_blocks_sync(struct ocfs2_super *osb, u64 
>>> block,
>>>  * so we can safely record this and loop back
>>>  * to cleanup the other buffers. */
>>> status = -EIO;
>>> -   put_bh(bh);
>>> -   bhs[i - 1] = NULL;
>>> +   goto read_failure;
>>> }
>>> }
>>>   
>>> @@ -17

Re: [Ocfs2-devel] [PATCH] ocfs2: don't put and assign null to bh allocated outside

2018-04-10 Thread Changwei Ge
Hi Jun,

Thanks for your review.

On 2018/4/11 8:40, piaojun wrote:
> Hi Changwei,
> 
> On 2018/4/10 19:35, Changwei Ge wrote:
>> ocfs2_read_blocks() and ocfs2_read_blocks_sync() are both used to read
>> several blocks from disk. Currently, the input argument *bhs* can be
>> NULL or NOT. It depends on the caller's behavior. If the function fails
>> in reading blocks from disk, the corresponding bh will be assigned to
>> NULL and put.
>>
>> Obviously, above process for non-NULL input bh is not appropriate.
>> Because the caller doesn't even know its bhs are put and re-assigned.
>>
>> If buffer head is managed by caller, ocfs2_read_blocks and
>> ocfs2_read_blocks_sync()  should not evaluate it to NULL. It will
>> cause caller accessing illegal memory, thus crash.
>>
>> Also, we should put bhs which have succeeded in reading before current
>> read failure.
>>
>> Signed-off-by: Changwei Ge 
>> ---
>>   fs/ocfs2/buffer_head_io.c | 77 
>> ---
>>   1 file changed, 59 insertions(+), 18 deletions(-)
>>
>> diff --git a/fs/ocfs2/buffer_head_io.c b/fs/ocfs2/buffer_head_io.c
>> index d9ebe11..7ae4147 100644
>> --- a/fs/ocfs2/buffer_head_io.c
>> +++ b/fs/ocfs2/buffer_head_io.c
>> @@ -99,25 +99,34 @@ int ocfs2_write_block(struct ocfs2_super *osb, struct 
>> buffer_head *bh,
>>  return ret;
>>   }
>>   
>> +/* Caller must provide a bhs[] with all NULL or non-NULL entries, so it
>> + * will be easier to handle read failure.
>> + */
>>   int ocfs2_read_blocks_sync(struct ocfs2_super *osb, u64 block,
>> unsigned int nr, struct buffer_head *bhs[])
>>   {
>>  int status = 0;
>>  unsigned int i;
>>  struct buffer_head *bh;
>> +int new_bh = 0;
>>   
>>  trace_ocfs2_read_blocks_sync((unsigned long long)block, nr);
>>   
>>  if (!nr)
>>  goto bail;
>>   
>> +/* Don't put buffer head and re-assign it to NULL if it is allocated
>> + * outside since the call can't be aware of this alternation!
>> + */
>> +new_bh = (bhs[0] == NULL);
> 
> 'new_bh' just means the first bh is NULL, what if the middle bh is NULL?

I am afraid we have to assume that if the first bh is NULL, others must to be 
NULL as well. Otherwise this function's exception handling will be rather 
complicated and messy. So I add a comment before this function to remind the 
caller should pass appropriate arguments in.
And I checked the callers to this function, they behave like my assumption.

Moreover, who would pass a so strange bh array in?

> 
>> +
>>  for (i = 0 ; i < nr ; i++) {
>>  if (bhs[i] == NULL) {
>>  bhs[i] = sb_getblk(osb->sb, block++);
>>  if (bhs[i] == NULL) {
>>  status = -ENOMEM;
>>  mlog_errno(status);
>> -goto bail;
>> +break;
>>  }
>>  }
>>  bh = bhs[i];
>> @@ -158,9 +167,26 @@ int ocfs2_read_blocks_sync(struct ocfs2_super *osb, u64 
>> block,
>>  submit_bh(REQ_OP_READ, 0, bh);
>>  }
>>   
>> +read_failure:
> 
> This looks weird that 'read_failure' include the normal branch.

Sounds reasonable.
How about if_read_failure or may_read_failure?

Thanks,
Changwei

> 
>>  for (i = nr; i > 0; i--) {
>>  bh = bhs[i - 1];
>>   
>> +if (unlikely(status)) {
>> +if (new_bh && !bh) {
>> +/* If middle bh fails, let previous bh
>> + * finish its read and then put it to
>> + * aovoid bh leak
>> + */
>> +if (!buffer_jbd(bh))
>> +wait_on_buffer(bh);
>> +put_bh(bh);
>> +bhs[i - 1] = NULL;
>> +} else if (buffer_uptodate(bh)) {
>> +clear_buffer_uptodate(bh);
>> +}
>> +continue;
>> +}
>> +
>>  /* No need to wait on the buffer if it's managed by JBD. */
>>  if (!buffer_jbd(bh))
>>  wait_on_buffer(bh);
>> @@ -170,8 +196,7 @@ int ocfs2_read_blocks_sync(struct ocfs2_super *osb, u64 
>> block,
>>   * so we can safely record this and loop back
>>   * to cleanup the other buffers. */
>>  status = -EIO;
>> -put_bh(bh);
>> -bhs[i - 1] = NULL;
>> +goto read_failure;
>>  }
>>  }
>>   
>> @@ -179,6 +204,9 @@ int ocfs2_read_blocks_sync(struct ocfs2_super *osb, u64 
>> block,
>>  return status;
>>   }
>>   
>> +/* Caller must provide a bhs[] with all NULL or non-NULL entries, so it
>> + * will be easier to handle read failure.
>> + */
>>   int ocfs2_read_blocks(struct ocfs2_caching_info

Re: [Ocfs2-devel] [PATCH] ocfs2: don't use iocb when EIOCBQUEUED returns

2018-04-10 Thread Changwei Ge
Hi Jun,

On 2018/4/11 8:52, piaojun wrote:
> Hi Changwei,
> 
> It looks like a code bug, and 'iocb' should not be freed at this place.
> Could this BUG reproduced easily?

Actually, it's not easy to be reproduced since IO is much slower than CPU 
executing instructions. But the logic here is broken, we'd better fix this.

Thanks,
Changwei

> 
> thanks,
> Jun
> 
> On 2018/4/10 20:00, Changwei Ge wrote:
>> When -EIOCBQUEUED returns, it means that aio_complete() will be called
>> from dio_complete(), which is an asynchronous progress against write_iter.
>> Generally, IO is a very slow progress than executing instruction, but we
>> still can't take the risk to access a freed iocb.
>>
>> And we do face a BUG crash issue.
>> >From crash tool, iocb is obviously freed already.
>> crash> struct -x kiocb 881a350f5900
>> struct kiocb {
>>ki_filp = 0x881a350f5a80,
>>ki_pos = 0x0,
>>ki_complete = 0x0,
>>private = 0x0,
>>ki_flags = 0x0
>> }
>>
>> And the backtrace shows:
>> ocfs2_file_write_iter+0xcaa/0xd00 [ocfs2]
>> ? ocfs2_check_range_for_refcount+0x150/0x150 [ocfs2]
>> aio_run_iocb+0x229/0x2f0
>> ? try_to_wake_up+0x380/0x380
>> do_io_submit+0x291/0x540
>> ? syscall_trace_leave+0xad/0x130
>> SyS_io_submit+0x10/0x20
>> system_call_fastpath+0x16/0x75
>>
>> Signed-off-by: Changwei Ge 
>> ---
>>   fs/ocfs2/file.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
>> index 5d1784a..1393ff2 100644
>> --- a/fs/ocfs2/file.c
>> +++ b/fs/ocfs2/file.c
>> @@ -2343,7 +2343,7 @@ static ssize_t ocfs2_file_write_iter(struct kiocb 
>> *iocb,
>>   
>>  written = __generic_file_write_iter(iocb, from);
>>  /* buffered aio wouldn't have proper lock coverage today */
>> -BUG_ON(written == -EIOCBQUEUED && !(iocb->ki_flags & IOCB_DIRECT));
>> +BUG_ON(written == -EIOCBQUEUED && !direct_io);
>>   
>>  /*
>>   * deep in g_f_a_w_n()->ocfs2_direct_IO we pass in a ocfs2_dio_end_io
>> @@ -2463,7 +2463,7 @@ static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
>>  trace_generic_file_aio_read_ret(ret);
>>   
>>  /* buffered aio wouldn't have proper lock coverage today */
>> -BUG_ON(ret == -EIOCBQUEUED && !(iocb->ki_flags & IOCB_DIRECT));
>> +BUG_ON(ret == -EIOCBQUEUED && !direct_io);
>>   
>>  /* see ocfs2_file_write_iter */
>>  if (ret == -EIOCBQUEUED || !ocfs2_iocb_is_rw_locked(iocb)) {
>>
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] ocfs2/dlm: clean up unused stack variable in dlm_do_local_ast()

2018-04-10 Thread Changwei Ge
Hi Jun,

On 2018/4/11 8:43, piaojun wrote:
> Hi Changwei,
> 
> This patch had been merged into mainline already.
> 
> Please see patch a43d24cb3b0b.

Oops, I forgot to rebase my tree... :(

Thanks for your reminder.

Changwei

> 
> thanks,
> Jun
> 
> On 2018/4/10 19:49, Changwei Ge wrote:
>> Signed-off-by: Changwei Ge 
>> ---
>>   fs/ocfs2/dlm/dlmast.c | 2 --
>>   1 file changed, 2 deletions(-)
>>
>> diff --git a/fs/ocfs2/dlm/dlmast.c b/fs/ocfs2/dlm/dlmast.c
>> index fd6..39831fc 100644
>> --- a/fs/ocfs2/dlm/dlmast.c
>> +++ b/fs/ocfs2/dlm/dlmast.c
>> @@ -224,14 +224,12 @@ void dlm_do_local_ast(struct dlm_ctxt *dlm, struct 
>> dlm_lock_resource *res,
>>struct dlm_lock *lock)
>>   {
>>  dlm_astlockfunc_t *fn;
>> -struct dlm_lockstatus *lksb;
>>   
>>  mlog(0, "%s: res %.*s, lock %u:%llu, Local AST\n", dlm->name,
>>   res->lockname.len, res->lockname.name,
>>   dlm_get_lock_cookie_node(be64_to_cpu(lock->ml.cookie)),
>>   dlm_get_lock_cookie_seq(be64_to_cpu(lock->ml.cookie)));
>>   
>> -lksb = lock->lksb;
>>  fn = lock->ast;
>>  BUG_ON(lock->ml.node != dlm->node_num);
>>   
>>
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] ocfs2: don't use iocb when EIOCBQUEUED returns

2018-04-10 Thread piaojun
Hi Changwei,

It looks like a code bug, and 'iocb' should not be freed at this place.
Could this BUG reproduced easily?

thanks,
Jun

On 2018/4/10 20:00, Changwei Ge wrote:
> When -EIOCBQUEUED returns, it means that aio_complete() will be called
> from dio_complete(), which is an asynchronous progress against write_iter.
> Generally, IO is a very slow progress than executing instruction, but we
> still can't take the risk to access a freed iocb.
> 
> And we do face a BUG crash issue.
>>From crash tool, iocb is obviously freed already.
> crash> struct -x kiocb 881a350f5900
> struct kiocb {
>   ki_filp = 0x881a350f5a80,
>   ki_pos = 0x0,
>   ki_complete = 0x0,
>   private = 0x0,
>   ki_flags = 0x0
> }
> 
> And the backtrace shows:
> ocfs2_file_write_iter+0xcaa/0xd00 [ocfs2]
> ? ocfs2_check_range_for_refcount+0x150/0x150 [ocfs2]
> aio_run_iocb+0x229/0x2f0
> ? try_to_wake_up+0x380/0x380
> do_io_submit+0x291/0x540
> ? syscall_trace_leave+0xad/0x130
> SyS_io_submit+0x10/0x20
> system_call_fastpath+0x16/0x75
> 
> Signed-off-by: Changwei Ge 
> ---
>  fs/ocfs2/file.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index 5d1784a..1393ff2 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -2343,7 +2343,7 @@ static ssize_t ocfs2_file_write_iter(struct kiocb *iocb,
>  
>   written = __generic_file_write_iter(iocb, from);
>   /* buffered aio wouldn't have proper lock coverage today */
> - BUG_ON(written == -EIOCBQUEUED && !(iocb->ki_flags & IOCB_DIRECT));
> + BUG_ON(written == -EIOCBQUEUED && !direct_io);
>  
>   /*
>* deep in g_f_a_w_n()->ocfs2_direct_IO we pass in a ocfs2_dio_end_io
> @@ -2463,7 +2463,7 @@ static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
>   trace_generic_file_aio_read_ret(ret);
>  
>   /* buffered aio wouldn't have proper lock coverage today */
> - BUG_ON(ret == -EIOCBQUEUED && !(iocb->ki_flags & IOCB_DIRECT));
> + BUG_ON(ret == -EIOCBQUEUED && !direct_io);
>  
>   /* see ocfs2_file_write_iter */
>   if (ret == -EIOCBQUEUED || !ocfs2_iocb_is_rw_locked(iocb)) {
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] ocfs2/dlm: clean up unused stack variable in dlm_do_local_ast()

2018-04-10 Thread piaojun
Hi Changwei,

This patch had been merged into mainline already.

Please see patch a43d24cb3b0b.

thanks,
Jun

On 2018/4/10 19:49, Changwei Ge wrote:
> Signed-off-by: Changwei Ge 
> ---
>  fs/ocfs2/dlm/dlmast.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/fs/ocfs2/dlm/dlmast.c b/fs/ocfs2/dlm/dlmast.c
> index fd6..39831fc 100644
> --- a/fs/ocfs2/dlm/dlmast.c
> +++ b/fs/ocfs2/dlm/dlmast.c
> @@ -224,14 +224,12 @@ void dlm_do_local_ast(struct dlm_ctxt *dlm, struct 
> dlm_lock_resource *res,
> struct dlm_lock *lock)
>  {
>   dlm_astlockfunc_t *fn;
> - struct dlm_lockstatus *lksb;
>  
>   mlog(0, "%s: res %.*s, lock %u:%llu, Local AST\n", dlm->name,
>res->lockname.len, res->lockname.name,
>dlm_get_lock_cookie_node(be64_to_cpu(lock->ml.cookie)),
>dlm_get_lock_cookie_seq(be64_to_cpu(lock->ml.cookie)));
>  
> - lksb = lock->lksb;
>   fn = lock->ast;
>   BUG_ON(lock->ml.node != dlm->node_num);
>  
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


Re: [Ocfs2-devel] [PATCH] ocfs2: don't put and assign null to bh allocated outside

2018-04-10 Thread piaojun
Hi Changwei,

On 2018/4/10 19:35, Changwei Ge wrote:
> ocfs2_read_blocks() and ocfs2_read_blocks_sync() are both used to read
> several blocks from disk. Currently, the input argument *bhs* can be
> NULL or NOT. It depends on the caller's behavior. If the function fails
> in reading blocks from disk, the corresponding bh will be assigned to
> NULL and put.
> 
> Obviously, above process for non-NULL input bh is not appropriate.
> Because the caller doesn't even know its bhs are put and re-assigned.
> 
> If buffer head is managed by caller, ocfs2_read_blocks and
> ocfs2_read_blocks_sync()  should not evaluate it to NULL. It will
> cause caller accessing illegal memory, thus crash.
> 
> Also, we should put bhs which have succeeded in reading before current
> read failure.
> 
> Signed-off-by: Changwei Ge 
> ---
>  fs/ocfs2/buffer_head_io.c | 77 
> ---
>  1 file changed, 59 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/ocfs2/buffer_head_io.c b/fs/ocfs2/buffer_head_io.c
> index d9ebe11..7ae4147 100644
> --- a/fs/ocfs2/buffer_head_io.c
> +++ b/fs/ocfs2/buffer_head_io.c
> @@ -99,25 +99,34 @@ int ocfs2_write_block(struct ocfs2_super *osb, struct 
> buffer_head *bh,
>   return ret;
>  }
>  
> +/* Caller must provide a bhs[] with all NULL or non-NULL entries, so it
> + * will be easier to handle read failure.
> + */
>  int ocfs2_read_blocks_sync(struct ocfs2_super *osb, u64 block,
>  unsigned int nr, struct buffer_head *bhs[])
>  {
>   int status = 0;
>   unsigned int i;
>   struct buffer_head *bh;
> + int new_bh = 0;
>  
>   trace_ocfs2_read_blocks_sync((unsigned long long)block, nr);
>  
>   if (!nr)
>   goto bail;
>  
> + /* Don't put buffer head and re-assign it to NULL if it is allocated
> +  * outside since the call can't be aware of this alternation!
> +  */
> + new_bh = (bhs[0] == NULL);

'new_bh' just means the first bh is NULL, what if the middle bh is NULL?

> +
>   for (i = 0 ; i < nr ; i++) {
>   if (bhs[i] == NULL) {
>   bhs[i] = sb_getblk(osb->sb, block++);
>   if (bhs[i] == NULL) {
>   status = -ENOMEM;
>   mlog_errno(status);
> - goto bail;
> + break;
>   }
>   }
>   bh = bhs[i];
> @@ -158,9 +167,26 @@ int ocfs2_read_blocks_sync(struct ocfs2_super *osb, u64 
> block,
>   submit_bh(REQ_OP_READ, 0, bh);
>   }
>  
> +read_failure:

This looks weird that 'read_failure' include the normal branch.

>   for (i = nr; i > 0; i--) {
>   bh = bhs[i - 1];
>  
> + if (unlikely(status)) {
> + if (new_bh && !bh) {
> + /* If middle bh fails, let previous bh
> +  * finish its read and then put it to
> +  * aovoid bh leak
> +  */
> + if (!buffer_jbd(bh))
> + wait_on_buffer(bh);
> + put_bh(bh);
> + bhs[i - 1] = NULL;
> + } else if (buffer_uptodate(bh)) {
> + clear_buffer_uptodate(bh);
> + }
> + continue;
> + }
> +
>   /* No need to wait on the buffer if it's managed by JBD. */
>   if (!buffer_jbd(bh))
>   wait_on_buffer(bh);
> @@ -170,8 +196,7 @@ int ocfs2_read_blocks_sync(struct ocfs2_super *osb, u64 
> block,
>* so we can safely record this and loop back
>* to cleanup the other buffers. */
>   status = -EIO;
> - put_bh(bh);
> - bhs[i - 1] = NULL;
> + goto read_failure;
>   }
>   }
>  
> @@ -179,6 +204,9 @@ int ocfs2_read_blocks_sync(struct ocfs2_super *osb, u64 
> block,
>   return status;
>  }
>  
> +/* Caller must provide a bhs[] with all NULL or non-NULL entries, so it
> + * will be easier to handle read failure.
> + */
>  int ocfs2_read_blocks(struct ocfs2_caching_info *ci, u64 block, int nr,
> struct buffer_head *bhs[], int flags,
> int (*validate)(struct super_block *sb,
> @@ -188,6 +216,7 @@ int ocfs2_read_blocks(struct ocfs2_caching_info *ci, u64 
> block, int nr,
>   int i, ignore_cache = 0;
>   struct buffer_head *bh;
>   struct super_block *sb = ocfs2_metadata_cache_get_super(ci);
> + int new_bh = 0;
>  
>   trace_ocfs2_read_blocks_begin(ci, (unsigned long long)block, nr, flags);
>  
> @@ -213,6 +242,11 @@ int ocfs2_read_blocks(struct ocfs2_caching_info *ci, u64 
> block, int nr,
>   goto bail;
>   }
>  
> + /* Don'

Re: [Ocfs2-devel] [PATCH] ocfs2: Take inode cluster lock before moving reflinked inode from orphan dir

2018-04-10 Thread Ashish Samant


On 04/02/2018 02:17 AM, Joseph Qi wrote:
>
> On 18/3/31 01:42, Ashish Samant wrote:
>> While reflinking an inode, we create a new inode in orphan directory, then
>> take EX lock on it, reflink the original inode to orphan inode and release
>> EX lock. Once the lock is released another node might request it in PR mode
>> which causes downconvert of the lock to PR mode.
>>
>> Later we attempt to initialize security acl for the orphan inode and move
>> it to the reflink destination. However, while doing this we dont take EX
>> lock on the inode. So effectively, we are doing this and accessing the
>> journal for this inode while holding PR lock. While accessing the journal,
>> we make
>>
>> ci->ci_last_trans = journal->j_trans_id
>>
>> At this point, if there is another downconvert request on this inode from
>> another node (PR->NL), we will trip on the following condition in
>> ocfs2_ci_checkpointed()
>>
>> BUG_ON(lockres->l_level != DLM_LOCK_EX && !checkpointed);
>>
>> because we hold the lock in PR mode and journal->j_trans_id is not greater
>> than ci_last_trans for the inode.
>>
>> Fix this by taking orphan inode cluster lock in EX mode before
>> initializing security and moving orphan inode to reflink destination.
>> Use the __tracker variant while taking inode lock to avoid recursive
>> locking in the ocfs2_init_security_and_acl() call chain.
>>
>> Signed-off-by: Ashish Samant 
> Looks good.
> Reviewed-by: Joseph Qi 
Hi Joseph,

Looks like mainline kernel has removed the inode lock call
in PR mode for orphan inode in ocfs2_recover_orphans()
and it directly takes the lock in EX mode. So, the patch commit log does 
not
reflect the exact problem for mainline kernel. However,
it is still possible that we might end up accessing the journal for the 
inode
while holding inode lock in NL mode (instead of PR). It might not cause
the same problem as described above, but it seems wrong to
modify inode metadata with NL lock held and could potentially cause similar
problems later. So, i'll submit v2 with modified commit message.

Thanks,
Ashish




___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


[Ocfs2-devel] [PATCH] ocfs2: don't use iocb when EIOCBQUEUED returns

2018-04-10 Thread Changwei Ge
When -EIOCBQUEUED returns, it means that aio_complete() will be called
from dio_complete(), which is an asynchronous progress against write_iter.
Generally, IO is a very slow progress than executing instruction, but we
still can't take the risk to access a freed iocb.

And we do face a BUG crash issue.
>From crash tool, iocb is obviously freed already.
crash> struct -x kiocb 881a350f5900
struct kiocb {
  ki_filp = 0x881a350f5a80,
  ki_pos = 0x0,
  ki_complete = 0x0,
  private = 0x0,
  ki_flags = 0x0
}

And the backtrace shows:
ocfs2_file_write_iter+0xcaa/0xd00 [ocfs2]
? ocfs2_check_range_for_refcount+0x150/0x150 [ocfs2]
aio_run_iocb+0x229/0x2f0
? try_to_wake_up+0x380/0x380
do_io_submit+0x291/0x540
? syscall_trace_leave+0xad/0x130
SyS_io_submit+0x10/0x20
system_call_fastpath+0x16/0x75

Signed-off-by: Changwei Ge 
---
 fs/ocfs2/file.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 5d1784a..1393ff2 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2343,7 +2343,7 @@ static ssize_t ocfs2_file_write_iter(struct kiocb *iocb,
 
written = __generic_file_write_iter(iocb, from);
/* buffered aio wouldn't have proper lock coverage today */
-   BUG_ON(written == -EIOCBQUEUED && !(iocb->ki_flags & IOCB_DIRECT));
+   BUG_ON(written == -EIOCBQUEUED && !direct_io);
 
/*
 * deep in g_f_a_w_n()->ocfs2_direct_IO we pass in a ocfs2_dio_end_io
@@ -2463,7 +2463,7 @@ static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
trace_generic_file_aio_read_ret(ret);
 
/* buffered aio wouldn't have proper lock coverage today */
-   BUG_ON(ret == -EIOCBQUEUED && !(iocb->ki_flags & IOCB_DIRECT));
+   BUG_ON(ret == -EIOCBQUEUED && !direct_io);
 
/* see ocfs2_file_write_iter */
if (ret == -EIOCBQUEUED || !ocfs2_iocb_is_rw_locked(iocb)) {
-- 
2.7.4


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


[Ocfs2-devel] [PATCH] ocfs2/dlm: clean up unused stack variable in dlm_do_local_ast()

2018-04-10 Thread Changwei Ge
Signed-off-by: Changwei Ge 
---
 fs/ocfs2/dlm/dlmast.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmast.c b/fs/ocfs2/dlm/dlmast.c
index fd6..39831fc 100644
--- a/fs/ocfs2/dlm/dlmast.c
+++ b/fs/ocfs2/dlm/dlmast.c
@@ -224,14 +224,12 @@ void dlm_do_local_ast(struct dlm_ctxt *dlm, struct 
dlm_lock_resource *res,
  struct dlm_lock *lock)
 {
dlm_astlockfunc_t *fn;
-   struct dlm_lockstatus *lksb;
 
mlog(0, "%s: res %.*s, lock %u:%llu, Local AST\n", dlm->name,
 res->lockname.len, res->lockname.name,
 dlm_get_lock_cookie_node(be64_to_cpu(lock->ml.cookie)),
 dlm_get_lock_cookie_seq(be64_to_cpu(lock->ml.cookie)));
 
-   lksb = lock->lksb;
fn = lock->ast;
BUG_ON(lock->ml.node != dlm->node_num);
 
-- 
2.7.4


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel


[Ocfs2-devel] [PATCH] ocfs2: don't put and assign null to bh allocated outside

2018-04-10 Thread Changwei Ge
ocfs2_read_blocks() and ocfs2_read_blocks_sync() are both used to read
several blocks from disk. Currently, the input argument *bhs* can be
NULL or NOT. It depends on the caller's behavior. If the function fails
in reading blocks from disk, the corresponding bh will be assigned to
NULL and put.

Obviously, above process for non-NULL input bh is not appropriate.
Because the caller doesn't even know its bhs are put and re-assigned.

If buffer head is managed by caller, ocfs2_read_blocks and
ocfs2_read_blocks_sync()  should not evaluate it to NULL. It will
cause caller accessing illegal memory, thus crash.

Also, we should put bhs which have succeeded in reading before current
read failure.

Signed-off-by: Changwei Ge 
---
 fs/ocfs2/buffer_head_io.c | 77 ---
 1 file changed, 59 insertions(+), 18 deletions(-)

diff --git a/fs/ocfs2/buffer_head_io.c b/fs/ocfs2/buffer_head_io.c
index d9ebe11..7ae4147 100644
--- a/fs/ocfs2/buffer_head_io.c
+++ b/fs/ocfs2/buffer_head_io.c
@@ -99,25 +99,34 @@ int ocfs2_write_block(struct ocfs2_super *osb, struct 
buffer_head *bh,
return ret;
 }
 
+/* Caller must provide a bhs[] with all NULL or non-NULL entries, so it
+ * will be easier to handle read failure.
+ */
 int ocfs2_read_blocks_sync(struct ocfs2_super *osb, u64 block,
   unsigned int nr, struct buffer_head *bhs[])
 {
int status = 0;
unsigned int i;
struct buffer_head *bh;
+   int new_bh = 0;
 
trace_ocfs2_read_blocks_sync((unsigned long long)block, nr);
 
if (!nr)
goto bail;
 
+   /* Don't put buffer head and re-assign it to NULL if it is allocated
+* outside since the call can't be aware of this alternation!
+*/
+   new_bh = (bhs[0] == NULL);
+
for (i = 0 ; i < nr ; i++) {
if (bhs[i] == NULL) {
bhs[i] = sb_getblk(osb->sb, block++);
if (bhs[i] == NULL) {
status = -ENOMEM;
mlog_errno(status);
-   goto bail;
+   break;
}
}
bh = bhs[i];
@@ -158,9 +167,26 @@ int ocfs2_read_blocks_sync(struct ocfs2_super *osb, u64 
block,
submit_bh(REQ_OP_READ, 0, bh);
}
 
+read_failure:
for (i = nr; i > 0; i--) {
bh = bhs[i - 1];
 
+   if (unlikely(status)) {
+   if (new_bh && !bh) {
+   /* If middle bh fails, let previous bh
+* finish its read and then put it to
+* aovoid bh leak
+*/
+   if (!buffer_jbd(bh))
+   wait_on_buffer(bh);
+   put_bh(bh);
+   bhs[i - 1] = NULL;
+   } else if (buffer_uptodate(bh)) {
+   clear_buffer_uptodate(bh);
+   }
+   continue;
+   }
+
/* No need to wait on the buffer if it's managed by JBD. */
if (!buffer_jbd(bh))
wait_on_buffer(bh);
@@ -170,8 +196,7 @@ int ocfs2_read_blocks_sync(struct ocfs2_super *osb, u64 
block,
 * so we can safely record this and loop back
 * to cleanup the other buffers. */
status = -EIO;
-   put_bh(bh);
-   bhs[i - 1] = NULL;
+   goto read_failure;
}
}
 
@@ -179,6 +204,9 @@ int ocfs2_read_blocks_sync(struct ocfs2_super *osb, u64 
block,
return status;
 }
 
+/* Caller must provide a bhs[] with all NULL or non-NULL entries, so it
+ * will be easier to handle read failure.
+ */
 int ocfs2_read_blocks(struct ocfs2_caching_info *ci, u64 block, int nr,
  struct buffer_head *bhs[], int flags,
  int (*validate)(struct super_block *sb,
@@ -188,6 +216,7 @@ int ocfs2_read_blocks(struct ocfs2_caching_info *ci, u64 
block, int nr,
int i, ignore_cache = 0;
struct buffer_head *bh;
struct super_block *sb = ocfs2_metadata_cache_get_super(ci);
+   int new_bh = 0;
 
trace_ocfs2_read_blocks_begin(ci, (unsigned long long)block, nr, flags);
 
@@ -213,6 +242,11 @@ int ocfs2_read_blocks(struct ocfs2_caching_info *ci, u64 
block, int nr,
goto bail;
}
 
+   /* Don't put buffer head and re-assign it to NULL if it is allocated
+* outside since the call can't be aware of this alternation!
+*/
+   new_bh = (bhs[0] == NULL);
+
ocfs2_metadata_cache_io_lock(ci);
for (i = 0 ; i < nr ; i++) {
if (bhs[i] == NULL) {
@@ -221,7 +255,8 @@ int