from:"Junxiao Bi"

Re: [Ocfs2-devel] [PATCH V2] ocfs2: Take inode cluster lock before moving reflinked inode from orphan dir

2018-04-11 Thread Junxiao Bi

On 04/12/2018 03:31 AM, Ashish Samant wrote:

> While reflinking an inode, we create a new inode in orphan directory, then
> take EX lock on it, reflink the original inode to orphan inode and release
> EX lock. Once the lock is released another node could request it in EX mode
> from ocfs2_recover_orphans() which causes downconvert of the lock, on this
> node, to NL mode.
>
> Later we attempt to initialize security acl for the orphan inode and move
> it to the reflink destination. However, while doing this we dont take EX
> lock on the inode. This could potentially cause problems because we could
> be starting transaction, accessing journal and modifying metadata of the
> inode while holding NL lock and with another node holding EX lock on the
> inode.
>
> Fix this by taking orphan inode cluster lock in EX mode before
> initializing security and moving orphan inode to reflink destination.
> Use the __tracker variant while taking inode lock to avoid recursive
> locking in the ocfs2_init_security_and_acl() call chain.
>
> Signed-off-by: Ashish Samant 
Reviewed-by: Junxiao Bi 
>
> V1->V2:
> Modify commit message to better reflect the problem in upstream kernel.
> ---
>   fs/ocfs2/refcounttree.c | 14 --
>   1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
> index ab156e3..1b1283f 100644
> --- a/fs/ocfs2/refcounttree.c
> +++ b/fs/ocfs2/refcounttree.c
> @@ -4250,10 +4250,11 @@ static int __ocfs2_reflink(struct dentry *old_dentry,
>   static int ocfs2_reflink(struct dentry *old_dentry, struct inode *dir,
>struct dentry *new_dentry, bool preserve)
>   {
> - int error;
> + int error, had_lock;
>   struct inode *inode = d_inode(old_dentry);
>   struct buffer_head *old_bh = NULL;
>   struct inode *new_orphan_inode = NULL;
> + struct ocfs2_lock_holder oh;
>   
>   if (!ocfs2_refcount_tree(OCFS2_SB(inode->i_sb)))
>   return -EOPNOTSUPP;
> @@ -4295,6 +4296,14 @@ static int ocfs2_reflink(struct dentry *old_dentry, 
> struct inode *dir,
>   goto out;
>   }
>   
> + had_lock = ocfs2_inode_lock_tracker(new_orphan_inode, NULL, 1,
> + &oh);
> + if (had_lock < 0) {
> + error = had_lock;
> + mlog_errno(error);
> + goto out;
> + }
> +
>   /* If the security isn't preserved, we need to re-initialize them. */
>   if (!preserve) {
>   error = ocfs2_init_security_and_acl(dir, new_orphan_inode,
> @@ -4302,14 +4311,15 @@ static int ocfs2_reflink(struct dentry *old_dentry, 
> struct inode *dir,
>   if (error)
>   mlog_errno(error);
>   }
> -out:
>   if (!error) {
>   error = ocfs2_mv_orphaned_inode_to_new(dir, new_orphan_inode,
>  new_dentry);
>   if (error)
>   mlog_errno(error);
>   }
> + ocfs2_inode_unlock_tracker(new_orphan_inode, 1, &oh, had_lock);
>   
> +out:
>   if (new_orphan_inode) {
>   /*
>* We need to open_unlock the inode no matter whether we


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH v2] ocfs2: try to reuse extent block in dealloc without meta_alloc

2018-01-02 Thread Junxiao Bi

On 12/27/2017 08:21 PM, Changwei Ge wrote:
> Hi Junxiao,
> 
> On 2017/12/27 18:02, Junxiao Bi wrote:
>> Hi Changwei,
>>
>>
>> On 12/26/2017 03:55 PM, Changwei Ge wrote:
>>> A crash issue was reported by John.
>>> The call trace follows:
>>> ocfs2_split_extent+0x1ad3/0x1b40 [ocfs2]
>>> ocfs2_change_extent_flag+0x33a/0x470 [ocfs2]
>>> ocfs2_mark_extent_written+0x172/0x220 [ocfs2]
>>> ocfs2_dio_end_io+0x62d/0x910 [ocfs2]
>>> dio_complete+0x19a/0x1a0
>>> do_blockdev_direct_IO+0x19dd/0x1eb0
>>> __blockdev_direct_IO+0x43/0x50
>>> ocfs2_direct_IO+0x8f/0xa0 [ocfs2]
>>> generic_file_direct_write+0xb2/0x170
>>> __generic_file_write_iter+0xc3/0x1b0
>>> ocfs2_file_write_iter+0x4bb/0xca0 [ocfs2]
>>> __vfs_write+0xae/0xf0
>>> vfs_write+0xb8/0x1b0
>>> SyS_write+0x4f/0xb0
>>> system_call_fastpath+0x16/0x75
>>>
>>> The BUG code told that extent tree wants to grow but no metadata
>>> was reserved ahead of time.
>>>From my investigation into this issue, the root cause it that although
>>> enough metadata is not reserved, there should be enough for following use.
>>> Rightmost extent is merged into its left one due to a certain times of
>>> marking extent written. Because during marking extent written, we got many
>>> physically continuous extents. At last, an empty extent showed up and the
>>> rightmost path is removed from extent tree.
>> I am trying to understand the issue. Quick questions.
>> Is this issue caused by BUG_ON(meta_ac == NULL)? Can you explain why it
>> is NULL?
> My pleasure to.
> Before marking extents written, we have to estimate how many metadata will be 
> used.
> If there are enough metadata for following operation-marking extent written, 
> no metadata
> will be reserved ahead of time.
> 
> For this BUG scenario, it happens that extent tree already has enough free 
> metadata.
> This can be referred by code path:
> ocfs2_dio_end_io_write
>ocfs2_lock_allocators - No need to reserve metadata,since extent tree 
> already has more metadata than needed. So *mata_ac* is NULL.
>ocfs2_mark_extent_written
>  ocfs2_change_extent_flag - cluster by cluster
>ocfs2_split_extent
>  ocfs2_try_to_merge_extent - During filling file hole, we mark 
> cluster as written one by one. Somehow, we merge those physically continuous 
> cluster(WRITTEN) into a single one.
>ocfs2_rotate_tree_left - rotate extent
>  __ocfs2_rotate_tree_left
>ocfs2_rotate_subtree_left - Aha, we find a totally empty 
> extent here, so unlink it from extent tree.
>  ocfs2_unlink_subtree
>  ocfs2_remove_empty_extent
>  
> 
> Then, since we delete one extent block, our previously estimated metadata 
> number is
> pointless(NOTE, actually the estimation is accurate.). Since it is reduced 
> resulted by extent
> rotation and merging.
> So for now, we don't have enough metadata.
See, thanks for your detailed explanation.

> 
> Then we are still marking extent tree written for left clusters and we need 
> to split extent.
Once one ocfs2_mark_extent_written() caused an extent block removed,
then next invoke to it may not have enought metadata, right?

> But we don't have enough metadata to accommodate the split extent records.
If above is right, can we invoke ocfs2_lock_allocators() to calculate
enough metadata before every invoke to ocfs2_mark_extent_written()?
That will be more simple than fixing by reuse dealloc.

Thanks,
Junxiao.

> So extent tree need to grow.
> BUG. no metadata is reserved ahead of time.
> 
> Thanks,
> Changwei
>
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH v2] ocfs2: try to reuse extent block in dealloc without meta_alloc

2017-12-27 Thread Junxiao Bi

Hi Changwei,


On 12/26/2017 03:55 PM, Changwei Ge wrote:
> A crash issue was reported by John.
> The call trace follows:
> ocfs2_split_extent+0x1ad3/0x1b40 [ocfs2]
> ocfs2_change_extent_flag+0x33a/0x470 [ocfs2]
> ocfs2_mark_extent_written+0x172/0x220 [ocfs2]
> ocfs2_dio_end_io+0x62d/0x910 [ocfs2]
> dio_complete+0x19a/0x1a0
> do_blockdev_direct_IO+0x19dd/0x1eb0
> __blockdev_direct_IO+0x43/0x50
> ocfs2_direct_IO+0x8f/0xa0 [ocfs2]
> generic_file_direct_write+0xb2/0x170
> __generic_file_write_iter+0xc3/0x1b0
> ocfs2_file_write_iter+0x4bb/0xca0 [ocfs2]
> __vfs_write+0xae/0xf0
> vfs_write+0xb8/0x1b0
> SyS_write+0x4f/0xb0
> system_call_fastpath+0x16/0x75
>
> The BUG code told that extent tree wants to grow but no metadata
> was reserved ahead of time.
>   From my investigation into this issue, the root cause it that although
> enough metadata is not reserved, there should be enough for following use.
> Rightmost extent is merged into its left one due to a certain times of
> marking extent written. Because during marking extent written, we got many
> physically continuous extents. At last, an empty extent showed up and the
> rightmost path is removed from extent tree.
I am trying to understand the issue. Quick questions.
Is this issue caused by BUG_ON(meta_ac == NULL)? Can you explain why it 
is NULL?

Thanks,
Junxiao.
>
> Add a new mechanism to reuse extent block cached in dealloc which were
> just unlinked from extent tree to solve this crash issue.
>
> Criteria is that during marking extents *written*, if extent rotation
> and merging results in unlinking extent with growing extent tree later
> without any metadata reserved ahead of time, try to reuse those extents
> in dealloc in which deleted extents are cached.
>
> Also, this patch addresses the issue John reported that ::dw_zero_count is
> not calculated properly.
>
> After applying this patch, the issue John reported was gone.
> Thanks for the reproducer provided by John.
> And this patch has passed ocfs2-test(29 cases) suite running by New H3C Group.
>
> Reported-by: John Lightsey 
> Signed-off-by: Changwei Ge 
> Reviewed-by: Duan Zhang 
> ---
>fs/ocfs2/alloc.c | 140 
> ---
>fs/ocfs2/alloc.h |   1 +
>fs/ocfs2/aops.c  |  14 --
>3 files changed, 145 insertions(+), 10 deletions(-)
>
> diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
> index ab5105f..56aba96 100644
> --- a/fs/ocfs2/alloc.c
> +++ b/fs/ocfs2/alloc.c
> @@ -165,6 +165,13 @@ static int ocfs2_dinode_insert_check(struct 
> ocfs2_extent_tree *et,
>struct ocfs2_extent_rec *rec);
>static int ocfs2_dinode_sanity_check(struct ocfs2_extent_tree *et);
>static void ocfs2_dinode_fill_root_el(struct ocfs2_extent_tree *et);
> +
> +static int ocfs2_reuse_blk_from_dealloc(handle_t *handle,
> + struct ocfs2_extent_tree *et,
> + struct buffer_head **new_eb_bh,
> + int blk_cnt);
> +static int ocfs2_is_dealloc_empty(struct ocfs2_extent_tree *et);
> +
>static const struct ocfs2_extent_tree_operations ocfs2_dinode_et_ops = {
>   .eo_set_last_eb_blk = ocfs2_dinode_set_last_eb_blk,
>   .eo_get_last_eb_blk = ocfs2_dinode_get_last_eb_blk,
> @@ -448,6 +455,7 @@ static void __ocfs2_init_extent_tree(struct 
> ocfs2_extent_tree *et,
>   if (!obj)
>   obj = (void *)bh->b_data;
>   et->et_object = obj;
> + et->et_dealloc = NULL;
>
>   et->et_ops->eo_fill_root_el(et);
>   if (!et->et_ops->eo_fill_max_leaf_clusters)
> @@ -1213,8 +1221,15 @@ static int ocfs2_add_branch(handle_t *handle,
>   goto bail;
>   }
>
> - status = ocfs2_create_new_meta_bhs(handle, et, new_blocks,
> -meta_ac, new_eb_bhs);
> + if (meta_ac) {
> + status = ocfs2_create_new_meta_bhs(handle, et, new_blocks,
> +meta_ac, new_eb_bhs);
> + } else if (!ocfs2_is_dealloc_empty(et)) {
> + status = ocfs2_reuse_blk_from_dealloc(handle, et,
> +   new_eb_bhs, new_blocks);
> + } else {
> + BUG();
> + }
>   if (status < 0) {
>   mlog_errno(status);
>   goto bail;
> @@ -1347,8 +1362,15 @@ static int ocfs2_shift_tree_depth(handle_t *handle,
>   struct ocfs2_extent_list  *root_el;
>   struct ocfs2_extent_list  *eb_el;
>
> - status = ocfs2_create_new_meta_bhs(handle, et, 1, meta_ac,
> -&new_eb_bh);
> + if (meta_ac) {
> + status = ocfs2_create_new_meta_bhs(handle, et, 1, meta_ac,
> +&new_eb_bh);
> + } else if (!ocfs2_is_dealloc_empty(et)) {
> + status = ocfs2_reuse_blk_from_dealloc(handle, et,
> +

Re: [Ocfs2-devel] [PATCH v2] ocfs2: fall back to buffer IO when append dio is disabled with file hole existing

2017-12-27 Thread Junxiao Bi

On 12/27/2017 03:46 PM, Changwei Ge wrote:

> Hi Junxiao,
>
> On 2017/12/27 15:35, Junxiao Bi wrote:
>> Hi Changwei,
>>
>> On 12/26/2017 05:20 PM, Changwei Ge wrote:
>>> Hi Alex
>>>
>>> On 2017/12/26 16:20, alex chen wrote:
>>>> Hi Changwei,
>>>>
>>>> On 2017/12/26 15:03, Changwei Ge wrote:
>>>>> The intention of this patch is to provide an option to ocfs2 users whether
>>>>> to allocate disk space while doing dio write.
>>>>>
>>>>> Whether to make ocfs2 fall back to buffer io is up to ocfs2 users through
>>>>> toggling append-dio feature. It rather makes ocfs2 configurable and
>>>>> flexible.
>>>>>
>>>> It is too strange to make ocfs2 fall back to buffer io by toggling 
>>>> append-dio feature.
>>> It might be.
>>> But as my changelog said, I think, append-dio is key to whether to allocate
>>> space with dio writing. So filling hole and appending file should have the 
>>> same reflection.
>>>
>>> Besides, in the early days, ocfs2 truly falls back to buffer io when 
>>> append-dio is disabled and file hole is encountered.
>> I think we have discussed a lot in your v1 version. This wasn't fit for
>> mainline, you can keep one off-mainline patch.
> Fine. I agree now.
> I give up trying to push this patch into mainline.
> BTW, could you please help review my another patch (ocfs2: try to reuse 
> extent block in dealloc without meta_alloc) which is used to fix a dio crash 
> issue?
> So without this patch, dio still won't hit such an issue.
> Your comments are very important to me.
Sure,  will look at it.

Thanks,
Junxiao.
>
> Thanks,
> Changwei
>
>> Thanks,
>> Junxiao.
>>
>


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH v2] ocfs2: fall back to buffer IO when append dio is disabled with file hole existing

2017-12-26 Thread Junxiao Bi

Hi Changwei,

On 12/26/2017 05:20 PM, Changwei Ge wrote:
> Hi Alex
> 
> On 2017/12/26 16:20, alex chen wrote:
>> Hi Changwei,
>>
>> On 2017/12/26 15:03, Changwei Ge wrote:
>>> The intention of this patch is to provide an option to ocfs2 users whether
>>> to allocate disk space while doing dio write.
>>>
>>> Whether to make ocfs2 fall back to buffer io is up to ocfs2 users through
>>> toggling append-dio feature. It rather makes ocfs2 configurable and
>>> flexible.
>>>
>> It is too strange to make ocfs2 fall back to buffer io by toggling 
>> append-dio feature.
> 
> It might be.
> But as my changelog said, I think, append-dio is key to whether to allocate
> space with dio writing. So filling hole and appending file should have the 
> same reflection.
> 
> Besides, in the early days, ocfs2 truly falls back to buffer io when 
> append-dio is disabled and file hole is encountered.
I think we have discussed a lot in your v1 version. This wasn't fit for
mainline, you can keep one off-mainline patch.

Thanks,
Junxiao.

> 
> Thanks,
> Changwei
> 
>>
>>> So if something bad happens to dio write with space allocation, we can
>>> still make ocfs2 fall back to buffer io. It's an option not a mandatory
>>> action.:)
>> Now the ocfs2 supports fill holes during direct io whether or not supporting 
>> append-dio feature
>> and we can directly fix the problem.
>> I think it is meaningless to provide an temporary option to turn off it.
> 
> IMO, this patch is NOT a temporary solution.
> Instead, it provides an extra option to ocfs2 users. And append-dio is 
> enabled while making fs by *default*.
> 
> Thanks,
> Changwei  
> 
>>
>>>
>>> Besides, append-dio feature is key to whether to allocate space with dio
>>> writing. So writing to file hole and enlarging file(appending file) should
>>> have the same reflection to append-dio feature.
>>>
>>> Signed-off-by: Changwei Ge 
>>> ---
>>>fs/ocfs2/aops.c | 53 
>>> ++---
>>>1 file changed, 50 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
>>> index d151632..32e60c0 100644
>>> --- a/fs/ocfs2/aops.c
>>> +++ b/fs/ocfs2/aops.c
>>> @@ -2414,12 +2414,52 @@ static int ocfs2_dio_end_io(struct kiocb *iocb,
>>> return ret;
>>>}
>>>
>>> +/*
>>> + * Will look for holes and unwritten extents in the range starting at
>>> + * pos for count bytes (inclusive).
>>> + * Return value 1 indicates hole exists, 0 not exists, others indicate 
>>> error.
>>> + */
>>> +static int ocfs2_range_has_holes(struct inode *inode, loff_t pos,
>>> +size_t count)
>>> +{
>>> +   int ret = 0;
>>> +   unsigned int extent_flags;
>>> +   u32 cpos, clusters, extent_len, phys_cpos;
>>> +   struct super_block *sb = inode->i_sb;
>>> +
>>> +   cpos = pos >> OCFS2_SB(sb)->s_clustersize_bits;
>>> +   clusters = ocfs2_clusters_for_bytes(sb, pos + count) - cpos;
>>> +
>>> +   while (clusters) {
>>> +   ret = ocfs2_get_clusters(inode, cpos, &phys_cpos, &extent_len,
>>> +&extent_flags);
>>> +   if (ret < 0) {
>>> +   mlog_errno(ret);
>>> +   goto out;
>>> +   }
>>> +
>>> +   if (phys_cpos == 0) {
>>> +   ret = 1;
>>> +   goto out;
>>> +   }
>>> +
>>> +   if (extent_len > clusters)
>>> +   extent_len = clusters;
>>> +
>>> +   clusters -= extent_len;
>>> +   cpos += extent_len;
>>> +   }
>>> +out:
>>> +   return ret;
>>> +}
>>> +
>>>static ssize_t ocfs2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>>>{
>>> struct file *file = iocb->ki_filp;
>>> struct inode *inode = file->f_mapping->host;
>>> struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>>> get_block_t *get_block;
>>> +   int ret;
>>>
>>> /*
>>>  * Fallback to buffered I/O if we see an inode without
>>> @@ -2429,9 +2469,16 @@ static ssize_t ocfs2_direct_IO(struct kiocb *iocb, 
>>> struct iov_iter *iter)
>>> return 0;
>>>
>>> /* Fallback to buffered I/O if we do not support append dio. */
>>> -   if (iocb->ki_pos + iter->count > i_size_read(inode) &&
>>> -   !ocfs2_supports_append_dio(osb))
>>> -   return 0;
>>> +   if (!ocfs2_supports_append_dio(osb)) {
>>> +   if (iocb->ki_pos + iter->count > i_size_read(inode))
>>> +   return 0;
>>> +
>>> +   ret = ocfs2_range_has_holes(inode, iocb->ki_pos, iter->count);
>>> +   if (ret == 1)
>>> +   return 0;
>>> +   else if (ret < 0)
>>> +   return ret;
>>> +   }
>>>
>>> if (iov_iter_rw(iter) == READ)
>>> get_block = ocfs2_lock_get_block;
>>>
>>
>>
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: fall back to buffer IO when append dio is disabled with file hole existing

2017-12-19 Thread Junxiao Bi

On 12/19/2017 05:11 PM, Changwei Ge wrote:
> Hi Junxiao,
> 
> On 2017/12/19 16:15, Junxiao Bi wrote:
>> Hi Changwei,
>>
>> On 12/19/2017 02:02 PM, Changwei Ge wrote:
>>> On 2017/12/19 11:41, piaojun wrote:
>>>> Hi Changwei,
>>>>
>>>> On 2017/12/19 11:05, Changwei Ge wrote:
>>>>> Hi Jun,
>>>>>
>>>>> On 2017/12/19 9:48, piaojun wrote:
>>>>>> Hi Changwei,
>>>>>>
>>>>>> On 2017/12/18 20:06, Changwei Ge wrote:
>>>>>>> Before ocfs2 supporting allocating clusters while doing append-dio, all 
>>>>>>> append
>>>>>>> dio will fall back to buffer io to allocate clusters firstly. Also, 
>>>>>>> when it
>>>>>>> steps on a file hole, it will fall back to buffer io, too. But for 
>>>>>>> current
>>>>>>> code, writing to file hole will leverage dio to allocate clusters. This 
>>>>>>> is not
>>>>>>> right, since whether append-io is enabled tells the capability whether 
>>>>>>> ocfs2 can
>>>>>>> allocate space while doing dio.
>>>>>>> So introduce file hole check function back into ocfs2.
>>>>>>> Once ocfs2 is doing dio upon a file hole with append-dio disabled, it 
>>>>>>> will fall
>>>>>>> back to buffer IO to allocate clusters.
>>>>>>>
>>>>>> 1. Do you mean that filling hole can't go with dio when append-dio is 
>>>>>> disabled?
>>>>>
>>>>> Yes, direct IO will fall back to buffer IO with _append-dio_ disabled.
>>>>
>>>> Why does dio need fall back to buffer-io when append-dio disabled?
>>>> Could 'common-dio' on file hole go through direct io process? If not,
>>>> could you please point out the necessity.
>>> Hi Jun,
>>>
>>> The intention to make dio fall back to buffer io is to provide *an option* 
>>> to users, which
>>> is more stable and even fast.
>> More memory will be consumed for some important user cases. Like if
>> ocfs2 is using to store vm system image file, the file is highly sparse
>> but never extended. If fall back buffer io, more memory will be consumed
>> by page cache in dom0, that will cause less VM running on dom0 and
>> performance issue.
> 
> I admit your point above makes scene.
> But, AFAIK, major ocfs2 crash issues comes from direct IO part, especially 
> when file is extremely
> sparse. So for the worst case, virtual machine even can't run due to crash 
> again and again.
Can you please point out where those crash were? I would like take a
look at them. I run ocfs2-test for mainline sometimes, and never find a
dio crash. As i know, Eric also run the test regularly, he also didn't
have a dio crash report.

> 
> I think the most benefit DIO brings to VM is that data can be transferred to 
> LUN as soon as possible,
> thus no data could be missed.
Right, another benefit.

Thanks,
Junxiao.
> 
>>
>>>   From my perspective, current ocfs2 dio implementation especially around 
>>> space allocation during
>>> doing dio still needs more test and improvements.
>>>
>>> Whether to make ocfs2 fall back to buffer io is up to ocfs2 users through 
>>> toggling append-dio feature.
>>> It rather makes ocfs2 configurable and flexible.
>>>
>>> Besides, do you still remember John's report about dio crash weeks ago?
>> Looks like this is a workaround, why not fix the bug directly? If with
>> this, people may disable append-dio by default to avoid dio issues. That
>> will make it never stable. But it is a useful feature.
> 
> Arguably, this patch just provides an extra option to users. It's up to ocfs2 
> users how to use ocfs2 for
> their business. I think we should not limit ocfs2 users.
> 
> Moreover, I agree that direct IO is a useful feature, but it is not mature 
> enough yet.
> We have to improve it, however, it needs time. I suppose we still have a 
> short or long journey until that.
> So before that we could provide a backup way.
> This may look like kind of workaround, but I prefer to call it an extra 
> option.
> 
> Thanks,
> Changwei
> 
>>
>> Thanks,
>> Junxiao.
>>
>>>
>>> I managed to reproduce this issue, so for now, I don't trust dio related 
>>> code one hundred percents.
>>> So if something bad happens to dio writing with s

Re: [Ocfs2-devel] [PATCH] ocfs2: fall back to buffer IO when append dio is disabled with file hole existing

2017-12-19 Thread Junxiao Bi

Hi Changwei,

On 12/19/2017 02:02 PM, Changwei Ge wrote:
> On 2017/12/19 11:41, piaojun wrote:
>> Hi Changwei,
>>
>> On 2017/12/19 11:05, Changwei Ge wrote:
>>> Hi Jun,
>>>
>>> On 2017/12/19 9:48, piaojun wrote:
 Hi Changwei,

 On 2017/12/18 20:06, Changwei Ge wrote:
> Before ocfs2 supporting allocating clusters while doing append-dio, all 
> append
> dio will fall back to buffer io to allocate clusters firstly. Also, when 
> it
> steps on a file hole, it will fall back to buffer io, too. But for current
> code, writing to file hole will leverage dio to allocate clusters. This 
> is not
> right, since whether append-io is enabled tells the capability whether 
> ocfs2 can
> allocate space while doing dio.
> So introduce file hole check function back into ocfs2.
> Once ocfs2 is doing dio upon a file hole with append-dio disabled, it 
> will fall
> back to buffer IO to allocate clusters.
>
 1. Do you mean that filling hole can't go with dio when append-dio is 
 disabled?
>>>
>>> Yes, direct IO will fall back to buffer IO with _append-dio_ disabled.
>>
>> Why does dio need fall back to buffer-io when append-dio disabled?
>> Could 'common-dio' on file hole go through direct io process? If not,
>> could you please point out the necessity.
> Hi Jun,
> 
> The intention to make dio fall back to buffer io is to provide *an option* to 
> users, which
> is more stable and even fast.
More memory will be consumed for some important user cases. Like if
ocfs2 is using to store vm system image file, the file is highly sparse
but never extended. If fall back buffer io, more memory will be consumed
by page cache in dom0, that will cause less VM running on dom0 and
performance issue.

>  From my perspective, current ocfs2 dio implementation especially around 
> space allocation during
> doing dio still needs more test and improvements.
> 
> Whether to make ocfs2 fall back to buffer io is up to ocfs2 users through 
> toggling append-dio feature.
> It rather makes ocfs2 configurable and flexible.
> 
> Besides, do you still remember John's report about dio crash weeks ago?
Looks like this is a workaround, why not fix the bug directly? If with
this, people may disable append-dio by default to avoid dio issues. That
will make it never stable. But it is a useful feature.

Thanks,
Junxiao.

> 
> I managed to reproduce this issue, so for now, I don't trust dio related code 
> one hundred percents.
> So if something bad happens to dio writing with space allocation, we can 
> still make ocfs2 fall back to buffer io
> It's an option not a mandatory action.
> 
> Besides, append-dio feature is key to whether to allocate space with dio 
> writing.
> So writing to file hole and enlarging file(appending file) should have the 
> same reflection to append-dio feature.
> :)
> 
> Thanks,
> Changwei
> 
>>
>>>
 2. Is your checking-hole just for 'append-dio' or for 'all-common-dio'?
>>>
>>> Just for append-dio
>>>
>>
>> If your patch is just for 'append-dio', I wonder that will have impact
>> on 'common-dio'.
>>
>> thanks,
>> Jun
>>
> Signed-off-by: Changwei Ge 
> ---
> fs/ocfs2/aops.c | 44 ++--
> 1 file changed, 42 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> index d151632..a982cf6 100644
> --- a/fs/ocfs2/aops.c
> +++ b/fs/ocfs2/aops.c
> @@ -2414,6 +2414,44 @@ static int ocfs2_dio_end_io(struct kiocb *iocb,
>   return ret;
> }
> 
> +/*
> + * Will look for holes and unwritten extents in the range starting at
> + * pos for count bytes (inclusive).
> + */
> +static int ocfs2_check_range_for_holes(struct inode *inode, loff_t pos,
> +size_t count)
> +{
> + int ret = 0;
> + unsigned int extent_flags;
> + u32 cpos, clusters, extent_len, phys_cpos;
> + struct super_block *sb = inode->i_sb;
> +
> + cpos = pos >> OCFS2_SB(sb)->s_clustersize_bits;
> + clusters = ocfs2_clusters_for_bytes(sb, pos + count) - cpos;
> +
> + while (clusters) {
> + ret = ocfs2_get_clusters(inode, cpos, &phys_cpos, &extent_len,
> +  &extent_flags);
> + if (ret < 0) {
> + mlog_errno(ret);
> + goto out;
> + }
> +
> + if (phys_cpos == 0 || (extent_flags & OCFS2_EXT_UNWRITTEN)) {
> + ret = 1;
> + break;
> + }
> +
> + if (extent_len > clusters)
> + extent_len = clusters;
> +
> + clusters -= extent_len;
> + cpos += extent_len;
> + }
> +out:
> + return ret;
> +}
> +
> static ssize_t ocfs2_direct_IO(struct kiocb *iocb, struct iov_iter 
> *iter)
> {
>   struct file *file =

Re: [Ocfs2-devel] ocfs2 hangs

2017-10-23 Thread Junxiao Bi

Hi Dmitry,

Please wait our new kernel, we will drop this issue and backport
upstream commit c25a1e0671fb ("ocfs2: fix posix_acl_create deadlock") to
fix this issue.

Thanks,
Junxiao.
On 10/23/2017 11:57 PM, Zhen Ren wrote:
> Hi,
> 
>>From the backtrace below, it seems very like the issue fixed by Junxiao 
>>recently
> by this patch:
> 
> [PATCH] ocfs2: mknode :  fix recursive locking hung
> 
> Eric
> 
> 
 Dmitry Melekhov  10/18/17 1:20 PM >>>
> Hello!
> 
> I run two dovecot servers over ocfs2 for years.
> 
> Previously I used ubuntu, but migrated to Oracle Linux this year.
> 
> And all kernels older then
> 
> 4.1.12-94.5.9.el7uek.x86_64
> 
> hangs with
> Aug 27 07:14:17 dovecot1 kernel: INFO: task deliver:15573 blocked for more 
> than
> 120 seconds.
> Aug 27 07:14:17 dovecot1 kernel:  Not tainted 4.1.12-103.3.8.el7uek.x86_64
> #2
> Aug 27 07:14:17 dovecot1 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Aug 27 07:14:17 dovecot1 kernel: deliver D 0001 0 
> 15573
>   15572 0x0080
> Aug 27 07:14:17 dovecot1 kernel: 88003d0af6c8 0086
> 880079039c00 88007c1ff000
> Aug 27 07:14:17 dovecot1 kernel: 88003d0af6b8 88003d0b
> 88003d0af890 7fff
> Aug 27 07:14:17 dovecot1 kernel: 88007c1ff000 
> 88003d0af6e8 81739c57
> Aug 27 07:14:17 dovecot1 kernel: Call Trace:
> Aug 27 07:14:17 dovecot1 kernel: [] schedule+0x37/0x90
> Aug 27 07:14:17 dovecot1 kernel: []
> schedule_timeout+0x24c/0x2c0
> Aug 27 07:14:17 dovecot1 kernel: [] ?
> find_get_entry+0x1e/0xa0
> Aug 27 07:14:17 dovecot1 kernel: [] ?
> pagecache_get_page+0xd1/0x1a0
> Aug 27 07:14:17 dovecot1 kernel: [] ?
> bh_lru_install+0x18a/0x1e0
> Aug 27 07:14:17 dovecot1 kernel: []
> wait_for_completion+0x134/0x190
> Aug 27 07:14:17 dovecot1 kernel: [] ? 
> wake_up_state+0x20/0x20
> Aug 27 07:14:17 dovecot1 kernel: []
> __ocfs2_cluster_lock.isra.36+0x231/0x9c0 [ocfs2]
> Aug 27 07:14:17 dovecot1 kernel: [] ?
> ocfs2_buffer_cached.isra.6+0xb6/0x240 [ocfs2]
> Aug 27 07:14:17 dovecot1 kernel: []
> ocfs2_inode_lock_full_nested+0x1da/0x530 [ocfs2]
> Aug 27 07:14:17 dovecot1 kernel: []
> ocfs2_inode_lock_tracker+0xbb/0x1c0 [ocfs2]
> Aug 27 07:14:17 dovecot1 kernel: []
> ocfs2_iop_get_acl+0x5d/0x25e [ocfs2]
> Aug 27 07:14:17 dovecot1 kernel: [] ?
> ocfs2_reserve_local_alloc_bits+0x8d/0x380 [ocfs2]
> Aug 27 07:14:17 dovecot1 kernel: [] get_acl+0x47/0x70
> Aug 27 07:14:17 dovecot1 kernel: []
> posix_acl_create+0x5a/0x160
> Aug 27 07:14:17 dovecot1 kernel: [] ocfs2_mknod+0x938/0x1620
> [ocfs2]
> Aug 27 07:14:17 dovecot1 kernel: [] ?
> ocfs2_wake_downconvert_thread+0x49/0x50 [ocfs2]
> Aug 27 07:14:17 dovecot1 kernel: [] ocfs2_create+0x66/0x170
> [ocfs2]
> Aug 27 07:14:17 dovecot1 kernel: [] vfs_create+0xd5/0x140
> Aug 27 07:14:17 dovecot1 kernel: [] do_last+0x9ed/0x1270
> Aug 27 07:14:17 dovecot1 kernel: [] path_openat+0x8f/0x630
> Aug 27 07:14:17 dovecot1 kernel: [] ?
> user_path_at_empty+0x6e/0xc0
> Aug 27 07:14:17 dovecot1 kernel: [] do_filp_open+0x49/0xc0
> Aug 27 07:14:17 dovecot1 kernel: [] ?
> find_next_zero_bit+0x25/0x30
> Aug 27 07:14:17 dovecot1 kernel: [] ? __alloc_fd+0xa7/0x130
> Aug 27 07:14:17 dovecot1 kernel: [] do_sys_open+0x137/0x240
> Aug 27 07:14:17 dovecot1 kernel: [] ?
> __audit_syscall_exit+0x1e6/0x280
> Aug 27 07:14:17 dovecot1 kernel: [] ?
> SyS_mprotect+0x1f4/0x290
> Aug 27 07:14:17 dovecot1 kernel: [] SyS_open+0x1e/0x20
> Aug 27 07:14:17 dovecot1 kernel: []
> system_call_fastpath+0x12/0x71
>  
>
> 
> Here is my bug report
> https://bugzilla.oracle.com/bugzilla/show_bug.cgi?id=16056
> 
> But, unfortunately, there is no activity here and bug is still not fixed 
> in latest UEK kernels.
> 
> I think that Oracle takes ocfs2 from here anyway.
> 
> May be somebody knows what was changed in
> 4.1.12-103.3.8.el7uek ?
> 
> 
> Thank you!
> Hi,
> 
>>From the backtrace below, it seems very like the issue fixed by Junxiao 
>>recently
> by this patch:
> 
> [PATCH] ocfs2: mknod: fix recursive locking hung
> 
> EricHi,
> 
>>From the backtrace below, it seems very like the issue fixed by Junxiao 
>>recently
> by this patch:
> 
> [PATCH] ocfs2: mknod: fix recursive locking hung
> 
> EricHi,
> 
>>From the backtrace below, it seems very like the issue fixed by Junxiao 
>>recently
> by this patch:
> 
> [PATCH] ocfs2: mknod: fix recursive locking hung
> 
> EricHi,
> 
>>From the backtrace below, it seems very like the issue fixed by Junxiao 
>>recently
> by this patch:
> 
> [PATCH] ocfs2: mknod: fix recursive locking hung
> 
> EricHi,
> 
>>From the backtrace below, it seems very like the issue fixed by Junxiao 
>>recently
> by this patch:
> 
> [PATCH] ocfs2: mknod: fix recursive locking hung
> 
> EricHi,
> 
>>From the backtrace below, it seems very like the issue fixed by Junxiao 
>>recently
> by this patch:
> 
> [PATCH] ocfs2: mknod: fix recursive locking hung
> 
> EricHi,
> 
>>From t

Re: [Ocfs2-devel] [PATCH] ocfs2: mknod: fix recursive locking hung

2017-10-23 Thread Junxiao Bi

On 10/23/2017 02:51 PM, Eric Ren wrote:
> Hi,
> 
> On 10/18/2017 12:44 PM, Junxiao Bi wrote:
>> On 10/18/2017 12:41 PM, Gang He wrote:
>>> Hi Junxiao,
>>>
>>> The problem looks easy to reproduce?
>>> Could you share the trigger script/code for this issue?
>> Please run ocfs2-test multiple reflink test.
> Hmm, strange, we do run ocfs2-test quite often.
Indeed this issue not exist in upstream.
commit c25a1e0671fb ("ocfs2: fix posix_acl_create deadlock") already
fixed it.

Thanks,
Junxiao.
> 
> Eric


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: mknod: fix recursive locking hung

2017-10-17 Thread Junxiao Bi

On 10/18/2017 12:41 PM, Gang He wrote:
> Hi Junxiao,
> 
> The problem looks easy to reproduce?
> Could you share the trigger script/code for this issue?
Please run ocfs2-test multiple reflink test.

Thanks,
Junxiao.
> 
> 
> Thanks
> Gang
> 
> 
>>>>
>> Here another recursive lock caught and it caused the cluster hung.
>>
>>  #0 [88008e3935a8] __schedule at 816e4722
>>  #1 [88008e393600] schedule at 816e4dee
>>  #2 [88008e393620] schedule_timeout at 816e7cd5
>>  #3 [88008e3936c0] wait_for_completion at 816e631f
>>  #4 [88008e393740] __ocfs2_cluster_lock at a05a9111 [ocfs2]
>>  #5 [88008e393890] ocfs2_inode_lock_full_nested at a05aec14 
>> [ocfs2]
>>  #6 [88008e393910] ocfs2_inode_lock_tracker at a05af02f [ocfs2]
>>  #7 [88008e393970] ocfs2_iop_get_acl at a0620c92 [ocfs2]
>>  #8 [88008e3939d0] get_acl at 8126ae79
>>  #9 [88008e3939f0] posix_acl_create at 8126b27a
>>  #10 [88008e393a20] ocfs2_mknod at a05cedcc [ocfs2]
>>  #11 [88008e393b60] ocfs2_create at a05cfb13 [ocfs2]
>>  #12 [88008e393bd0] vfs_create at 81217338
>>  #13 [88008e393c10] lookup_open at 81217a85
>>  #14 [88008e393ca0] do_last at 8121ac6d
>>  #15 [88008e393d30] path_openat at 8121b112
>>  #16 [88008e393df0] do_filp_open at 8121b53a
>>  #17 [88008e393ed0] do_sys_open at 81209a5a
>>  #18 [88008e393f40] sys_open at 81209bae
>>  #19 [88008e393f50] system_call_fastpath at 816e902e
>>
>> inode lock is got by ocfs2_mknod() before call into posix_acl_create().
>>
>> Signed-off-by: Junxiao Bi 
>> Cc: 
>> ---
>>  fs/ocfs2/namei.c |   14 --
>>  1 file changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
>> index 3b0a10d9b36f..f0ee52e600ff 100644
>> --- a/fs/ocfs2/namei.c
>> +++ b/fs/ocfs2/namei.c
>> @@ -260,6 +260,8 @@ static int ocfs2_mknod(struct inode *dir,
>>  sigset_t oldset;
>>  int did_block_signals = 0;
>>  struct ocfs2_dentry_lock *dl = NULL;
>> +int locked;
>> +struct ocfs2_lock_holder oh;
>>  
>>  trace_ocfs2_mknod(dir, dentry, dentry->d_name.len, dentry->d_name.name,
>>(unsigned long long)OCFS2_I(dir)->ip_blkno,
>> @@ -274,11 +276,11 @@ static int ocfs2_mknod(struct inode *dir,
>>  /* get our super block */
>>  osb = OCFS2_SB(dir->i_sb);
>>  
>> -status = ocfs2_inode_lock(dir, &parent_fe_bh, 1);
>> -if (status < 0) {
>> -if (status != -ENOENT)
>> -mlog_errno(status);
>> -return status;
>> +locked = ocfs2_inode_lock_tracker(dir, &parent_fe_bh, 1, &oh);
>> +if (locked < 0) {
>> +if (locked != -ENOENT)
>> +mlog_errno(locked);
>> +return locked;
>>  }
>>  
>>  if (S_ISDIR(mode) && (dir->i_nlink >= ocfs2_link_max(osb))) {
>> @@ -462,7 +464,7 @@ static int ocfs2_mknod(struct inode *dir,
>>  if (handle)
>>  ocfs2_commit_trans(osb, handle);
>>  
>> -ocfs2_inode_unlock(dir, 1);
>> +ocfs2_inode_unlock_tracker(dir, 1, &oh, locked);
>>  if (did_block_signals)
>>  ocfs2_unblock_signals(&oldset);
>>  
>> -- 
>> 1.7.9.5
>>
>>
>> ___
>> Ocfs2-devel mailing list
>> Ocfs2-devel@oss.oracle.com 
>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH] ocfs2: mknod: fix recursive locking hung

2017-10-17 Thread Junxiao Bi

Here another recursive lock caught and it caused the cluster hung.

 #0 [88008e3935a8] __schedule at 816e4722
 #1 [88008e393600] schedule at 816e4dee
 #2 [88008e393620] schedule_timeout at 816e7cd5
 #3 [88008e3936c0] wait_for_completion at 816e631f
 #4 [88008e393740] __ocfs2_cluster_lock at a05a9111 [ocfs2]
 #5 [88008e393890] ocfs2_inode_lock_full_nested at a05aec14 [ocfs2]
 #6 [88008e393910] ocfs2_inode_lock_tracker at a05af02f [ocfs2]
 #7 [88008e393970] ocfs2_iop_get_acl at a0620c92 [ocfs2]
 #8 [88008e3939d0] get_acl at 8126ae79
 #9 [88008e3939f0] posix_acl_create at 8126b27a
 #10 [88008e393a20] ocfs2_mknod at a05cedcc [ocfs2]
 #11 [88008e393b60] ocfs2_create at a05cfb13 [ocfs2]
 #12 [88008e393bd0] vfs_create at 81217338
 #13 [88008e393c10] lookup_open at 81217a85
 #14 [88008e393ca0] do_last at 8121ac6d
 #15 [88008e393d30] path_openat at 8121b112
 #16 [88008e393df0] do_filp_open at 8121b53a
 #17 [88008e393ed0] do_sys_open at 81209a5a
 #18 [88008e393f40] sys_open at 81209bae
 #19 [88008e393f50] system_call_fastpath at 816e902e

inode lock is got by ocfs2_mknod() before call into posix_acl_create().

Signed-off-by: Junxiao Bi 
Cc: 
---
 fs/ocfs2/namei.c |   14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index 3b0a10d9b36f..f0ee52e600ff 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -260,6 +260,8 @@ static int ocfs2_mknod(struct inode *dir,
sigset_t oldset;
int did_block_signals = 0;
struct ocfs2_dentry_lock *dl = NULL;
+   int locked;
+   struct ocfs2_lock_holder oh;
 
trace_ocfs2_mknod(dir, dentry, dentry->d_name.len, dentry->d_name.name,
  (unsigned long long)OCFS2_I(dir)->ip_blkno,
@@ -274,11 +276,11 @@ static int ocfs2_mknod(struct inode *dir,
/* get our super block */
osb = OCFS2_SB(dir->i_sb);
 
-   status = ocfs2_inode_lock(dir, &parent_fe_bh, 1);
-   if (status < 0) {
-   if (status != -ENOENT)
-   mlog_errno(status);
-   return status;
+   locked = ocfs2_inode_lock_tracker(dir, &parent_fe_bh, 1, &oh);
+   if (locked < 0) {
+   if (locked != -ENOENT)
+   mlog_errno(locked);
+   return locked;
}
 
if (S_ISDIR(mode) && (dir->i_nlink >= ocfs2_link_max(osb))) {
@@ -462,7 +464,7 @@ static int ocfs2_mknod(struct inode *dir,
if (handle)
ocfs2_commit_trans(osb, handle);
 
-   ocfs2_inode_unlock(dir, 1);
+   ocfs2_inode_unlock_tracker(dir, 1, &oh, locked);
if (did_block_signals)
ocfs2_unblock_signals(&oldset);
 
-- 
1.7.9.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: fstrim: Fix start offset of first cluster group during fstrim

2017-10-12 Thread Junxiao Bi

On 10/13/2017 03:12 AM, Ashish Samant wrote:
> The first cluster group descriptor is not stored at the start of the group
> but at an offset from the start. We need to take this into account while
> doing fstrim on the first cluster group. Otherwise we will wrongly start
> fstrim a few blocks after the desired start block and the range can cross
> over into the next cluster group and zero out the group descriptor there.
> This can cause filesytem corruption that cannot be fixed by fsck.
> 
> Signed-off-by: Ashish Samant 
> Cc: sta...@vger.kernel.org
Looks good.

Reviewed-by: Junxiao Bi 

> ---
>  fs/ocfs2/alloc.c | 24 ++--
>  1 file changed, 18 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
> index a177eae..addd7c5 100644
> --- a/fs/ocfs2/alloc.c
> +++ b/fs/ocfs2/alloc.c
> @@ -7304,13 +7304,24 @@ int ocfs2_truncate_inline(struct inode *inode, struct 
> buffer_head *di_bh,
>  
>  static int ocfs2_trim_extent(struct super_block *sb,
>struct ocfs2_group_desc *gd,
> -  u32 start, u32 count)
> +  u64 group, u32 start, u32 count)
>  {
>   u64 discard, bcount;
> + struct ocfs2_super *osb = OCFS2_SB(sb);
>  
>   bcount = ocfs2_clusters_to_blocks(sb, count);
> - discard = le64_to_cpu(gd->bg_blkno) +
> - ocfs2_clusters_to_blocks(sb, start);
> + discard = ocfs2_clusters_to_blocks(sb, start);
> +
> + /*
> +  * For the first cluster group, the gd->bg_blkno is not at the start
> +  * of the group, but at an offset from the start. If we add it while
> +  * calculating discard for first group, we will wrongly start fstrim a
> +  * few blocks after the desried start block and the range can cross
> +  * over into the next cluster group. So, add it only if this is not
> +  * the first cluster group.
> +  */
> + if (group != osb->first_cluster_group_blkno)
> + discard += le64_to_cpu(gd->bg_blkno);
>  
>   trace_ocfs2_trim_extent(sb, (unsigned long long)discard, bcount);
>  
> @@ -7318,7 +7329,7 @@ static int ocfs2_trim_extent(struct super_block *sb,
>  }
>  
>  static int ocfs2_trim_group(struct super_block *sb,
> - struct ocfs2_group_desc *gd,
> + struct ocfs2_group_desc *gd, u64 group,
>   u32 start, u32 max, u32 minbits)
>  {
>   int ret = 0, count = 0, next;
> @@ -7337,7 +7348,7 @@ static int ocfs2_trim_group(struct super_block *sb,
>   next = ocfs2_find_next_bit(bitmap, max, start);
>  
>   if ((next - start) >= minbits) {
> - ret = ocfs2_trim_extent(sb, gd,
> + ret = ocfs2_trim_extent(sb, gd, group,
>   start, next - start);
>   if (ret < 0) {
>   mlog_errno(ret);
> @@ -7435,7 +7446,8 @@ int ocfs2_trim_fs(struct super_block *sb, struct 
> fstrim_range *range)
>   }
>  
>   gd = (struct ocfs2_group_desc *)gd_bh->b_data;
> - cnt = ocfs2_trim_group(sb, gd, first_bit, last_bit, minlen);
> + cnt = ocfs2_trim_group(sb, gd, group,
> +first_bit, last_bit, minlen);
>   brelse(gd_bh);
>   gd_bh = NULL;
>   if (cnt < 0) {
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] A o2cb DLM problem

2017-10-11 Thread Junxiao Bi

On 10/12/2017 02:37 PM, Gang He wrote:
> Hello list,
> 
> We got a o2cb DLM problem from the customer, which is using o2cb stack for 
> OCFS2 file system on SLES12SP1(3.12.49-11-default).
> The problem description is as below,
> 
> Customer has three node oracle rack cluster
> gal7gblr2084
> gal7gblr2085
> gal7gblr2086
> 
> On each node they have configured two ocfs resources as a filesystem. The two 
> node gal7gblr2085 and gal7gblr2086 got hung and went into loop to kill each 
> other and they want root cause analysis.
> Anyway, all I see in logs is those messages flooding /var/log/messages
> 
> 2017-10-05T06:50:25.980773+01:00 gal7gblr2085 kernel: [16874541.314199] 
> o2net: Connection to node gal7gblr2086 (num 2) at 10.233.217.12: has been 
> idle for 30.5 secs, shutting it down.
Looks it is an old kernel. Shutting down connection when idle timeout
will cause losing dlm message which may cause hung. Please apply the
following 3 patches.

8c7b638cece1 ocfs2: quorum: add a log for node not fenced
8e9801dfe37c ocfs2: o2net: set tcp user timeout to max value
c43c363def04 ocfs2: o2net: don't shutdown connection when idle timeout

Thanks,
Junxiao.
> 2017-10-05T06:50:37.456786+01:00 gal7gblr2085 kernel: [16874552.778726] 
> o2net: No longer connected to node gal7gblr2086 (num 2) at 10.233.217.12:
> 2017-10-05T06:50:45.176798+01:00 gal7gblr2085 kernel: [16874560.487834] 
> (kworker/u64:1,13245,10):dlm_send_remote_convert_request:392 ERROR: Error 
> -107 when sending message 504 (key 0x4a68dd81) to node 2
> 2017-10-05T06:50:45.176812+01:00 gal7gblr2085 kernel: [16874560.487838] 
> o2dlm: Waiting on the death of node 2 in domain 
> 18AE08328428452BA610E7BDE26F5246
> 2017-10-05T06:50:50.284796+01:00 gal7gblr2085 kernel: [16874565.589996] 
> (kworker/u64:1,13245,10):dlm_send_remote_convert_request:392 ERROR: Error 
> -107 when sending message 504 (key 0x4a68dd81) to node 2
> 2017-10-05T06:50:50.284811+01:00 gal7gblr2085 kernel: [16874565.59] 
> o2dlm: Waiting on the death of node 2 in domain 
> 18AE08328428452BA610E7BDE26F5246
> 2017-10-05T06:50:55.400808+01:00 gal7gblr2085 kernel: [16874570.700448] 
> (kworker/u64:1,13245,10):dlm_send_remote_convert_request:392 ERROR: Error 
> -107 when sending message 504 (key 0x4a68dd81) to node 2
> 2017-10-05T06:50:55.400824+01:00 gal7gblr2085 kernel: [16874570.700452] 
> o2dlm: Waiting on the death of node 2 in domain 
> 18AE08328428452BA610E7BDE26F5246
> 2017-10-05T06:51:00.512766+01:00 gal7gblr2085 kernel: [16874575.808944] 
> (kworker/u64:1,13245,26):dlm_send_remote_convert_request:392 ERROR: Error 
> -107 when sending message 504 (key 0x4a68dd81) to node 2
> 2017-10-05T06:51:00.512783+01:00 gal7gblr2085 kernel: [16874575.808948] 
> o2dlm: Waiting on the death of node 2 in domain 
> 18AE08328428452BA610E7BDE26F5246
> 2017-10-05T06:51:02.456785+01:00 gal7gblr2085 kernel: [16874577.749286] 
> (ora_diag_rcp2,24339,0):dlm_do_master_request:1344 ERROR: link to 2 went down!
> 2017-10-05T06:51:02.456797+01:00 gal7gblr2085 kernel: [16874577.749289] 
> (ora_diag_rcp2,24339,0):dlm_get_lock_resource:929 ERROR: status = -107
> 2017-10-05T06:51:05.632955+01:00 gal7gblr2085 kernel: [16874580.920124] 
> (kworker/u64:1,13245,26):dlm_send_remote_convert_request:392 ERROR: Error 
> -107 when sending message 504 (key 0x4a68dd81) to node 2
> 2017-10-05T06:51:05.632973+01:00 gal7gblr2085 kernel: [16874580.920132] 
> o2dlm: Waiting on the death of node 2 in domain 
> 18AE08328428452BA610E7BDE26F5246
> 2017-10-05T06:51:07.976787+01:00 gal7gblr2085 kernel: [16874583.262561] 
> o2net: No connection established with node 2 after 30.0 seconds, giving up.
> 2017-10-05T10:03:38.439542+01:00 gal7gblr2084 kernel: [1911889.097543] 
> (mdb_psp0_-mgmtd,21126,0):dlm_send_remote_unlock_request:358 ERROR: Error 
> -107 when sending message 506 (key 0x4a68dd81) to node 1
> 2017-10-05T10:03:38.439543+01:00 gal7gblr2084 kernel: [1911889.097547] 
> (mdb_psp0_-mgmtd,21126,0):dlm_send_remote_unlock_request:358 ERROR: Error 
> -107 when sending message 506 (key 0x4a68dd81) to node 1
> 
> 
> Did you guys encounter such problem when using o2cb stack? since we mainly 
> focus on pmck stack, but I still want to help this customer to know the root 
> cause.
> 
> 
> Thanks
> Gang
> 
> 
> 
> 
> 
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: re-queue AST or BAST if sending is failed to improve the reliability

2017-08-22 Thread Junxiao Bi

On 08/10/2017 06:49 PM, Changwei Ge wrote:
> Hi Joseph,
> 
> 
> On 2017/8/10 17:53, Joseph Qi wrote:
>> Hi Changwei,
>>
>> On 17/8/9 23:24, ge changwei wrote:
>>> Hi
>>>
>>>
>>> On 2017/8/9 下午7:32, Joseph Qi wrote:
 Hi,

 On 17/8/7 15:13, Changwei Ge wrote:
> Hi,
>
> In current code, while flushing AST, we don't handle an exception that
> sending AST or BAST is failed.
> But it is indeed possible that AST or BAST is lost due to some kind of
> networks fault.
>
 Could you please describe this issue more clearly? It is better analyze
 issue along with the error message and the status of related nodes.
 IMO, if network is down, one of the two nodes will be fenced. So what's
 your case here?

 Thanks,
 Joseph
>>> I have posted the status of related lock resource in my preceding email. 
>>> Please check them out.
>>>
>>> Moreover, network is not down forever even not longer than threshold  to 
>>> be fenced.
>>> So no node will be fenced.
>>>
>>> This issue happens in terrible network environment. Some messages may be 
>>> abandoned by switch due to various conditions.
>>> And even frequent and fast link up and down will also cause this issue.
>>>
>>> In a nutshell,  re-queuing AST and BAST is crucial when link between 
>>> nodes recover quickly. It prevents cluster from hanging.
>>> So you mean the tcp packet is lost due to connection reset? IIRC,
> Yes, it's something like that exception which I think is deserved to be
> fixed within OCFS2.
>> Junxiao has posted a patchset to fix this issue.
>> If you are using the way of re-queuing, how to make sure the original
>> message is *truly* lost and the same ast/bast won't be sent twice?
> With regards to TCP layer, if it returns error to OCFS2, packets must
> not be sent successfully. So no node will obtain such an AST or BAST.
Right, but not only AST/BAST, other messages pending in tcp queue will
also lost if tcp return error to ocfs2, this can also caused hung.
Besides, your fix may introduce duplicated ast/bast message Joseph
mentioned.
Ocfs2 depends tcp a lot, it can't work well if tcp return error to it.
To fix it, maybe ocfs2 should maintain its own message queue and ack
messages while not depend on TCP.

Thanks,
Junxiao.


> With regards to OCFS2, my patch can guarantee that one AST/BAST can't be
> queued on pending list twice of which are both sent successfully.
> 
> Thanks,
> Changwei
>>
>> Thanks,
>> Joseph
>>  
>>> Thanks,
>>> Changwei
> If above exception happens, the requesting node will never obtain an AST
> back, hence, it will never acquire the lock or abort current locking.
>
> With this patch, I'd like to fix this issue by re-queuing the AST or
> BAST if sending is failed due to networks fault.
>
> And the re-queuing AST or BAST will be dropped if the requesting node is
> dead!
>
> It will improve the reliability a lot.
>
>
> Thanks.
>
> Changwei.
 ___
 Ocfs2-devel mailing list
 Ocfs2-devel@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: o2hb: revert hb threshold to keep compatible

2017-03-28 Thread Junxiao Bi

On 03/29/2017 12:01 PM, Joseph Qi wrote:
> 
> 
> On 17/3/29 09:07, Junxiao Bi wrote:
>> On 03/29/2017 06:31 AM, Andrew Morton wrote:
>>> On Tue, 28 Mar 2017 09:40:45 +0800 Junxiao Bi 
>>> wrote:
>>>
>>>> Configfs is the interface for ocfs2-tools to set configure to
>>>> kernel. Change heartbeat dead threshold name in configfs will
>>>> cause compatible issue, so revert it.
>>>>
>>>> Fixes: 45b997737a80 ("ocfs2/cluster: use per-attribute show and
>>>> store methods")
>>> I don't get it.  45b997737a80 was merged nearly two years ago, so isn't
>>> it a bit late to fix compatibility issues?
>>>
>> This compatibility will not cause ocfs2 down, just some configure (hb
>> dead threshold) lose effect. If someone want to use the new kernel, they
>> should apply this fix.
> The threshold configuration file has default value in kernel, so it will
> only affect changing this value in user space.
Right. Thank you.

> 
> Thanks,
> Joseph
>>
>> Thanks,
>> Junxiao.
>>
>> ___
>> Ocfs2-devel mailing list
>> Ocfs2-devel@oss.oracle.com
>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: o2hb: revert hb threshold to keep compatible

2017-03-28 Thread Junxiao Bi

Hi Andrew,

On 03/29/2017 11:31 AM, Andrew Morton wrote:
> On Wed, 29 Mar 2017 09:07:08 +0800 Junxiao Bi  wrote:
> 
>> On 03/29/2017 06:31 AM, Andrew Morton wrote:
>>> On Tue, 28 Mar 2017 09:40:45 +0800 Junxiao Bi  wrote:
>>>
>>>> Configfs is the interface for ocfs2-tools to set configure to
>>>> kernel. Change heartbeat dead threshold name in configfs will
>>>> cause compatible issue, so revert it.
>>>>
>>>> Fixes: 45b997737a80 ("ocfs2/cluster: use per-attribute show and store 
>>>> methods")
>>>
>>> I don't get it.  45b997737a80 was merged nearly two years ago, so isn't
>>> it a bit late to fix compatibility issues?
>>>
>> This compatibility will not cause ocfs2 down, just some configure (hb
>> dead threshold) lose effect. If someone want to use the new kernel, they
>> should apply this fix.
> 
> Well could someone please send a better changelog?  One which carefully
> describes the present behaviour, what is wrong with it and how the
> patch fixes it?

A new one, please help review.

Configfs is the interface for ocfs2-tools to set configure to
kernel and there
$configfs_dir/cluster/$clustername/heartbeat/dead_threshold is the one
used to configure heartbeat dead threshold. Kernel has a default value
of it but user can set O2CB_HEARTBEAT_THRESHOLD in /etc/sysconfig/o2cb
to override it.
Commit 45b997737a80 ("ocfs2/cluster: use per-attribute show and store
methods") changed heartbeat dead threshold name while ocfs2-tools not,
so ocfs2-tools won't set this configure and default value is always
used. So revert it.

Thanks,
Junxiao.

> 
> One reason for doing this is to permit effecitive patch review.
> 
> Another reason is to permit others to decide whether the patch should
> be backported into -stable kernels.
> 
> Yet another reason is so that maintainers of other kernels can
> determine whether this patch will fix behaviour which their users are
> reporting.
> 
> Thanks.
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: o2hb: revert hb threshold to keep compatible

2017-03-28 Thread Junxiao Bi

On 03/29/2017 06:31 AM, Andrew Morton wrote:
> On Tue, 28 Mar 2017 09:40:45 +0800 Junxiao Bi  wrote:
> 
>> Configfs is the interface for ocfs2-tools to set configure to
>> kernel. Change heartbeat dead threshold name in configfs will
>> cause compatible issue, so revert it.
>>
>> Fixes: 45b997737a80 ("ocfs2/cluster: use per-attribute show and store 
>> methods")
> 
> I don't get it.  45b997737a80 was merged nearly two years ago, so isn't
> it a bit late to fix compatibility issues?
> 
This compatibility will not cause ocfs2 down, just some configure (hb
dead threshold) lose effect. If someone want to use the new kernel, they
should apply this fix.

Thanks,
Junxiao.

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH] ocfs2: o2hb: revert hb threshold to keep compatible

2017-03-27 Thread Junxiao Bi

Configfs is the interface for ocfs2-tools to set configure to
kernel. Change heartbeat dead threshold name in configfs will
cause compatible issue, so revert it.

Fixes: 45b997737a80 ("ocfs2/cluster: use per-attribute show and store methods")
Signed-off-by: Junxiao Bi 
---
 fs/ocfs2/cluster/heartbeat.c |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
index f6e871760f8d..0da0332725aa 100644
--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -2242,13 +2242,13 @@ static void o2hb_heartbeat_group_drop_item(struct 
config_group *group,
spin_unlock(&o2hb_live_lock);
 }
 
-static ssize_t o2hb_heartbeat_group_threshold_show(struct config_item *item,
+static ssize_t o2hb_heartbeat_group_dead_threshold_show(struct config_item 
*item,
char *page)
 {
return sprintf(page, "%u\n", o2hb_dead_threshold);
 }
 
-static ssize_t o2hb_heartbeat_group_threshold_store(struct config_item *item,
+static ssize_t o2hb_heartbeat_group_dead_threshold_store(struct config_item 
*item,
const char *page, size_t count)
 {
unsigned long tmp;
@@ -2297,11 +2297,11 @@ static ssize_t o2hb_heartbeat_group_mode_store(struct 
config_item *item,
 
 }
 
-CONFIGFS_ATTR(o2hb_heartbeat_group_, threshold);
+CONFIGFS_ATTR(o2hb_heartbeat_group_, dead_threshold);
 CONFIGFS_ATTR(o2hb_heartbeat_group_, mode);
 
 static struct configfs_attribute *o2hb_heartbeat_group_attrs[] = {
-   &o2hb_heartbeat_group_attr_threshold,
+   &o2hb_heartbeat_group_attr_dead_threshold,
&o2hb_heartbeat_group_attr_mode,
NULL,
 };
-- 
1.7.9.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 02/17] Single Run: kernel building is little broken now

2017-03-13 Thread Junxiao Bi

On 12/13/2016 01:29 PM, Eric Ren wrote:
> Only check kernel source if we specify "buildkernel" test case.
> The original kernel source web-link cannot be reached,
> so give a new link instead but the md5sum check is missing
> now.
> 
> Signed-off-by: Eric Ren 
> ---
>  programs/python_common/single_run-WIP.sh | 56 
> 
>  1 file changed, 28 insertions(+), 28 deletions(-)
> 
> diff --git a/programs/python_common/single_run-WIP.sh 
> b/programs/python_common/single_run-WIP.sh
> index fe0056c..61008d8 100755
> --- a/programs/python_common/single_run-WIP.sh
> +++ b/programs/python_common/single_run-WIP.sh
> @@ -20,9 +20,9 @@ WGET=`which wget`
>  WHOAMI=`which whoami`
>  SED=`which sed`
>  
> -DWNLD_PATH="http://oss.oracle.com/~smushran/ocfs2-test";
> -KERNEL_TARBALL="linux-kernel.tar.gz"
> -KERNEL_TARBALL_CHECK="${KERNEL_TARBALL}.md5sum"
> +DWNLD_PATH="https://cdn.kernel.org/pub/linux/kernel/v3.x/";
> +KERNEL_TARBALL="linux-3.2.80.tar.xz"
> +#KERNEL_TARBALL_CHECK="${KERNEL_TARBALL}.md5sum"

Can we compute the md5sum manually and put it here?

Thanks,
Junxiao.

>  USERID=`${WHOAMI}`
>  
>  DEBUGFS_BIN="${SUDO} `which debugfs.ocfs2`"
> @@ -85,7 +85,7 @@ get_bits()
>  # get_kernel_source $LOGDIR $DWNLD_PATH $KERNEL_TARBALL $KERNEL_TARBALL_CHECK
>  get_kernel_source()
>  {
> - if [ "$#" -lt "4" ]; then
> + if [ "$#" -lt "3" ]; then
>   ${ECHO} "Error in get_kernel_source()"
>   exit 1
>   fi
> @@ -93,18 +93,18 @@ get_kernel_source()
>   logdir=$1
>   dwnld_path=$2
>   kernel_tarball=$3
> - kernel_tarball_check=$4
> + #kernel_tarball_check=$4
>  
>   cd ${logdir}
>  
>   outlog=get_kernel_source.log
>  
> - ${WGET} -o ${outlog} ${dwnld_path}/${kernel_tarball_check}
> - if [ $? -ne 0 ]; then
> - ${ECHO} "ERROR downloading 
> ${dwnld_path}/${kernel_tarball_check}"
> - cd -
> - exit 1
> - fi
> +#${WGET} -o ${outlog} ${dwnld_path}/${kernel_tarball_check}
> +#if [ $? -ne 0 ]; then
> +#${ECHO} "ERROR downloading 
> ${dwnld_path}/${kernel_tarball_check}"
> +#cd -
> +#exit 1
> +#fi
>  
>   ${WGET} -a ${outlog} ${dwnld_path}/${kernel_tarball}
>   if [ $? -ne 0 ]; then
> @@ -113,13 +113,13 @@ get_kernel_source()
>   exit 1
>   fi
>  
> - ${MD5SUM} -c ${kernel_tarball_check} >>${outlog} 2>&1
> - if [ $? -ne 0 ]; then
> - ${ECHO} "ERROR ${kernel_tarball_check} check failed"
> - cd -
> - exit 1
> - fi
> - cd -
> +#${MD5SUM} -c ${kernel_tarball_check} >>${outlog} 2>&1
> +#if [ $? -ne 0 ]; then
> +#${ECHO} "ERROR ${kernel_tarball_check} check failed"
> +#cd -
> +#exit 1
> +#fi
> +#cd -
>  }
>  
>  # do_format() ${BLOCKSIZE} ${CLUSTERSIZE} ${FEATURES} ${DEVICE}
> @@ -1012,16 +1012,6 @@ LOGFILE=${LOGDIR}/single_run.log
>  
>  do_mkdir ${LOGDIR}
>  
> -if [ -z ${KERNELSRC} ]; then
> - get_kernel_source $LOGDIR $DWNLD_PATH $KERNEL_TARBALL 
> $KERNEL_TARBALL_CHECK
> - KERNELSRC=${LOGDIR}/${KERNEL_TARBALL}
> -fi
> -
> -if [ ! -f ${KERNELSRC} ]; then
> - ${ECHO} "No kernel source"
> - usage
> -fi
> -
>  STARTRUN=$(date +%s)
>  log_message "*** Start Single Node test ***"
>  
> @@ -1058,6 +1048,16 @@ for tc in `${ECHO} ${TESTCASES} | ${SED} "s:,: :g"`; do
>   fi
>  
>   if [ "$tc"X = "buildkernel"X -o "$tc"X = "all"X ];then
> + if [ -z ${KERNELSRC} ]; then
> + get_kernel_source $LOGDIR $DWNLD_PATH $KERNEL_TARBALL 
> #$KERNEL_TARBALL_CHECK
> + KERNELSRC=${LOGDIR}/${KERNEL_TARBALL}
> + fi
> +
> + if [ ! -f ${KERNELSRC} ]; then
> + ${ECHO} "No kernel source"
> + usage
> + fi
> +
>   run_buildkernel ${LOGDIR} ${DEVICE} ${MOUNTPOINT} ${KERNELSRC}
>   fi
>  
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 01/17] ocfs2 test: correct the check on testcase if supported

2017-03-13 Thread Junxiao Bi


On 12/13/2016 01:29 PM, Eric Ren wrote:
> Signed-off-by: Eric Ren 
Reviewed-by: Junxiao Bi 

> ---
>  programs/python_common/multiple_run.sh   | 2 +-
>  programs/python_common/single_run-WIP.sh | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/programs/python_common/multiple_run.sh 
> b/programs/python_common/multiple_run.sh
> index dd9603f..c4a7da9 100755
> --- a/programs/python_common/multiple_run.sh
> +++ b/programs/python_common/multiple_run.sh
> @@ -201,7 +201,7 @@ f_setup()
>   fi
>  
>   SUPPORTED_TESTCASES="all xattr inline reflink write_append_truncate 
> multi_mmap create_racer flock_unit cross_delete open_delete lvb_torture"
> - for cas in ${TESTCASES}; do
> + for cas in `${ECHO} ${TESTCASES} | ${SED} "s:,: :g"`; do
>   echo ${SUPPORTED_TESTCASES} | grep -sqw $cas
>   if [ $? -ne 0 ]; then
>   echo "testcase [${cas}] not supported."
> diff --git a/programs/python_common/single_run-WIP.sh 
> b/programs/python_common/single_run-WIP.sh
> index 5a8fae1..fe0056c 100755
> --- a/programs/python_common/single_run-WIP.sh
> +++ b/programs/python_common/single_run-WIP.sh
> @@ -997,7 +997,7 @@ fi
>  SUPPORTED_TESTCASES="all create_and_open directaio fillverifyholes 
> renamewriterace aiostress\
>filesizelimits mmaptruncate buildkernel splice sendfile mmap reserve_space 
> inline xattr\
>reflink mkfs tunefs backup_super filecheck"
> -for cas in ${TESTCASES}; do
> +for cas in `${ECHO} ${TESTCASES} | ${SED} "s:,: :g"`; do
>   echo ${SUPPORTED_TESTCASES} | grep -sqw $cas
>   if [ $? -ne 0 ]; then
>   echo "testcase [${cas}] not supported."
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH v3 1/2] ocfs2/dlmglue: prepare tracking logic to avoid recursive cluster lock

2017-01-16 Thread Junxiao Bi

On 01/17/2017 02:30 PM, Eric Ren wrote:
> We are in the situation that we have to avoid recursive cluster locking,
> but there is no way to check if a cluster lock has been taken by a
> precess already.
> 
> Mostly, we can avoid recursive locking by writing code carefully.
> However, we found that it's very hard to handle the routines that
> are invoked directly by vfs code. For instance:
> 
> const struct inode_operations ocfs2_file_iops = {
> .permission = ocfs2_permission,
> .get_acl= ocfs2_iop_get_acl,
> .set_acl= ocfs2_iop_set_acl,
> };
> 
> Both ocfs2_permission() and ocfs2_iop_get_acl() call ocfs2_inode_lock(PR):
> do_sys_open
>  may_open
>   inode_permission
>ocfs2_permission
> ocfs2_inode_lock() <=== first time
>  generic_permission
>   get_acl
>ocfs2_iop_get_acl
>   ocfs2_inode_lock() <=== recursive one
> 
> A deadlock will occur if a remote EX request comes in between two
> of ocfs2_inode_lock(). Briefly describe how the deadlock is formed:
> 
> On one hand, OCFS2_LOCK_BLOCKED flag of this lockres is set in
> BAST(ocfs2_generic_handle_bast) when downconvert is started
> on behalf of the remote EX lock request. Another hand, the recursive
> cluster lock (the second one) will be blocked in in __ocfs2_cluster_lock()
> because of OCFS2_LOCK_BLOCKED. But, the downconvert never complete, why?
> because there is no chance for the first cluster lock on this node to be
> unlocked - we block ourselves in the code path.
> 
> The idea to fix this issue is mostly taken from gfs2 code.
> 1. introduce a new field: struct ocfs2_lock_res.l_holders, to
> keep track of the processes' pid  who has taken the cluster lock
> of this lock resource;
> 2. introduce a new flag for ocfs2_inode_lock_full: OCFS2_META_LOCK_GETBH;
> it means just getting back disk inode bh for us if we've got cluster lock.
> 3. export a helper: ocfs2_is_locked_by_me() is used to check if we
> have got the cluster lock in the upper code path.
> 
> The tracking logic should be used by some of the ocfs2 vfs's callbacks,
> to solve the recursive locking issue cuased by the fact that vfs routines
> can call into each other.
> 
> The performance penalty of processing the holder list should only be seen
> at a few cases where the tracking logic is used, such as get/set acl.
> 
> You may ask what if the first time we got a PR lock, and the second time
> we want a EX lock? fortunately, this case never happens in the real world,
> as far as I can see, including permission check, (get|set)_(acl|attr), and
> the gfs2 code also do so.
> 
> Changes since v1:
> - Let ocfs2_is_locked_by_me() just return true/false to indicate if the
> process gets the cluster lock - suggested by: Joseph Qi 
> and Junxiao Bi .
> 
> - Change "struct ocfs2_holder" to a more meaningful name "ocfs2_lock_holder",
> suggested by: Junxiao Bi.
> 
> - Do not inline functions whose bodies are not in scope, changed by:
> Stephen Rothwell .
> 
> Changes since v2:
> - Wrap the tracking logic code of recursive locking into functions,
> ocfs2_inode_lock_tracker() and ocfs2_inode_unlock_tracker(),
> suggested by: Junxiao Bi.
> 
> [s...@canb.auug.org.au remove some inlines]
> Signed-off-by: Eric Ren 

Reviewed-by: Junxiao Bi 

> ---
>  fs/ocfs2/dlmglue.c | 105 
> +++--
>  fs/ocfs2/dlmglue.h |  18 +
>  fs/ocfs2/ocfs2.h   |   1 +
>  3 files changed, 121 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
> index 77d1632..c75b9e9 100644
> --- a/fs/ocfs2/dlmglue.c
> +++ b/fs/ocfs2/dlmglue.c
> @@ -532,6 +532,7 @@ void ocfs2_lock_res_init_once(struct ocfs2_lock_res *res)
>   init_waitqueue_head(&res->l_event);
>   INIT_LIST_HEAD(&res->l_blocked_list);
>   INIT_LIST_HEAD(&res->l_mask_waiters);
> + INIT_LIST_HEAD(&res->l_holders);
>  }
>  
>  void ocfs2_inode_lock_res_init(struct ocfs2_lock_res *res,
> @@ -749,6 +750,50 @@ void ocfs2_lock_res_free(struct ocfs2_lock_res *res)
>   res->l_flags = 0UL;
>  }
>  
> +/*
> + * Keep a list of processes who have interest in a lockres.
> + * Note: this is now only uesed for check recursive cluster locking.
> + */
> +static inline void ocfs2_add_holder(struct ocfs2_lock_res *lockres,
> +struct ocfs2_lock_holder *oh)
> +{
> + INIT_LIST_HEAD(&oh->oh_list);
> + oh->oh_owner_pid =  get_pid(task_pid(current));
> +
> + spin_lock(&lockres->l_lock);
> + list_add_tail(&oh->oh_list, &lockres->l_holders);
> +

Re: [Ocfs2-devel] [PATCH v3 2/2] ocfs2: fix deadlock issue when taking inode lock at vfs entry points

2017-01-16 Thread Junxiao Bi

On 01/17/2017 02:30 PM, Eric Ren wrote:
> Commit 743b5f1434f5 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
> results in a deadlock, as the author "Tariq Saeed" realized shortly
> after the patch was merged. The discussion happened here
> (https://oss.oracle.com/pipermail/ocfs2-devel/2015-September/011085.html).
> 
> The reason why taking cluster inode lock at vfs entry points opens up
> a self deadlock window, is explained in the previous patch of this
> series.
> 
> So far, we have seen two different code paths that have this issue.
> 1. do_sys_open
>  may_open
>   inode_permission
>ocfs2_permission
> ocfs2_inode_lock() <=== take PR
>  generic_permission
>   get_acl
>ocfs2_iop_get_acl
> ocfs2_inode_lock() <=== take PR
> 2. fchmod|fchmodat
> chmod_common
>  notify_change
>   ocfs2_setattr <=== take EX
>posix_acl_chmod
> get_acl
>  ocfs2_iop_get_acl <=== take PR
> ocfs2_iop_set_acl <=== take EX
> 
> Fixes them by adding the tracking logic (in the previous patch) for
> these funcs above, ocfs2_permission(), ocfs2_iop_[set|get]_acl(),
> ocfs2_setattr().
> 
> Changes since v1:
> - Let ocfs2_is_locked_by_me() just return true/false to indicate if the
> process gets the cluster lock - suggested by: Joseph Qi 
> and Junxiao Bi .
> 
> - Change "struct ocfs2_holder" to a more meaningful name "ocfs2_lock_holder",
> suggested by: Junxiao Bi.
> 
> - Add debugging output at ocfs2_setattr() and ocfs2_permission() to
> catch exceptional cases, suggested by: Junxiao Bi.
> 
> Changes since v2:
> - Use new wrappers of tracking logic code, suggested by: Junxiao Bi.
> 
> Signed-off-by: Eric Ren 
Reviewed-by: Junxiao Bi 

> ---
>  fs/ocfs2/acl.c  | 29 +
>  fs/ocfs2/file.c | 58 
> -
>  2 files changed, 58 insertions(+), 29 deletions(-)
> 
> diff --git a/fs/ocfs2/acl.c b/fs/ocfs2/acl.c
> index bed1fcb..dc22ba8 100644
> --- a/fs/ocfs2/acl.c
> +++ b/fs/ocfs2/acl.c
> @@ -283,16 +283,14 @@ int ocfs2_set_acl(handle_t *handle,
>  int ocfs2_iop_set_acl(struct inode *inode, struct posix_acl *acl, int type)
>  {
>   struct buffer_head *bh = NULL;
> - int status = 0;
> + int status, had_lock;
> + struct ocfs2_lock_holder oh;
>  
> - status = ocfs2_inode_lock(inode, &bh, 1);
> - if (status < 0) {
> - if (status != -ENOENT)
> - mlog_errno(status);
> - return status;
> - }
> + had_lock = ocfs2_inode_lock_tracker(inode, &bh, 1, &oh);
> + if (had_lock < 0)
> + return had_lock;
>   status = ocfs2_set_acl(NULL, inode, bh, type, acl, NULL, NULL);
> - ocfs2_inode_unlock(inode, 1);
> + ocfs2_inode_unlock_tracker(inode, 1, &oh, had_lock);
>   brelse(bh);
>   return status;
>  }
> @@ -302,21 +300,20 @@ struct posix_acl *ocfs2_iop_get_acl(struct inode 
> *inode, int type)
>   struct ocfs2_super *osb;
>   struct buffer_head *di_bh = NULL;
>   struct posix_acl *acl;
> - int ret;
> + int had_lock;
> + struct ocfs2_lock_holder oh;
>  
>   osb = OCFS2_SB(inode->i_sb);
>   if (!(osb->s_mount_opt & OCFS2_MOUNT_POSIX_ACL))
>   return NULL;
> - ret = ocfs2_inode_lock(inode, &di_bh, 0);
> - if (ret < 0) {
> - if (ret != -ENOENT)
> - mlog_errno(ret);
> - return ERR_PTR(ret);
> - }
> +
> + had_lock = ocfs2_inode_lock_tracker(inode, &di_bh, 0, &oh);
> + if (had_lock < 0)
> + return ERR_PTR(had_lock);
>  
>   acl = ocfs2_get_acl_nolock(inode, type, di_bh);
>  
> - ocfs2_inode_unlock(inode, 0);
> + ocfs2_inode_unlock_tracker(inode, 0, &oh, had_lock);
>   brelse(di_bh);
>   return acl;
>  }
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index c488965..7b6a146 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -1138,6 +1138,8 @@ int ocfs2_setattr(struct dentry *dentry, struct iattr 
> *attr)
>   handle_t *handle = NULL;
>   struct dquot *transfer_to[MAXQUOTAS] = { };
>   int qtype;
> + int had_lock;
> + struct ocfs2_lock_holder oh;
>  
>   trace_ocfs2_setattr(inode, dentry,
>   (unsigned long long)OCFS2_I(inode)->ip_blkno,
> @@ -1173,11 +1175,30 @@ int ocfs2_setattr(struct dentry *dentry, struct iattr 
> *attr)
>   }
>   }
>  
> - status = ocfs2_

Re: [Ocfs2-devel] [PATCH v2 2/2] ocfs2: fix deadlock issue when taking inode lock at vfs entry points

2017-01-15 Thread Junxiao Bi

On 01/16/2017 02:42 PM, Eric Ren wrote:
> Commit 743b5f1434f5 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
> results in a deadlock, as the author "Tariq Saeed" realized shortly
> after the patch was merged. The discussion happened here
> (https://oss.oracle.com/pipermail/ocfs2-devel/2015-September/011085.html).
> 
> The reason why taking cluster inode lock at vfs entry points opens up
> a self deadlock window, is explained in the previous patch of this
> series.
> 
> So far, we have seen two different code paths that have this issue.
> 1. do_sys_open
>  may_open
>   inode_permission
>ocfs2_permission
> ocfs2_inode_lock() <=== take PR
>  generic_permission
>   get_acl
>ocfs2_iop_get_acl
> ocfs2_inode_lock() <=== take PR
> 2. fchmod|fchmodat
> chmod_common
>  notify_change
>   ocfs2_setattr <=== take EX
>posix_acl_chmod
> get_acl
>  ocfs2_iop_get_acl <=== take PR
> ocfs2_iop_set_acl <=== take EX
> 
> Fixes them by adding the tracking logic (in the previous patch) for
> these funcs above, ocfs2_permission(), ocfs2_iop_[set|get]_acl(),
> ocfs2_setattr().
> 
> Changes since v1:
> 1. Let ocfs2_is_locked_by_me() just return true/false to indicate if the
> process gets the cluster lock - suggested by: Joseph Qi 
> and Junxiao Bi .
> 
> 2. Change "struct ocfs2_holder" to a more meaningful name "ocfs2_lock_holder",
> suggested by: Junxiao Bi .
> 
> 3. Add debugging output at ocfs2_setattr() and ocfs2_permission() to
> catch exceptional cases, suggested by: Junxiao Bi .
> 
> Signed-off-by: Eric Ren 
> ---
>  fs/ocfs2/acl.c  | 39 +
>  fs/ocfs2/file.c | 76 
> +
>  2 files changed, 100 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/ocfs2/acl.c b/fs/ocfs2/acl.c
> index bed1fcb..3e47262 100644
> --- a/fs/ocfs2/acl.c
> +++ b/fs/ocfs2/acl.c
> @@ -284,16 +284,31 @@ int ocfs2_iop_set_acl(struct inode *inode, struct 
> posix_acl *acl, int type)
>  {
>   struct buffer_head *bh = NULL;
>   int status = 0;
> -
> - status = ocfs2_inode_lock(inode, &bh, 1);
> + int arg_flags = 0, has_locked;
> + struct ocfs2_lock_holder oh;
> + struct ocfs2_lock_res *lockres;
> +
> + lockres = &OCFS2_I(inode)->ip_inode_lockres;
> + has_locked = ocfs2_is_locked_by_me(lockres);
> + if (has_locked)
> + arg_flags = OCFS2_META_LOCK_GETBH;
> + status = ocfs2_inode_lock_full(inode, &bh, 1, arg_flags);
>   if (status < 0) {
>   if (status != -ENOENT)
>   mlog_errno(status);
>   return status;
>   }
> + if (!has_locked)
> + ocfs2_add_holder(lockres, &oh);
> +
Same code pattern showed here and *get_acl, can it be abstracted to one
function?
The same issue for *setattr and *permission. Sorry for not mention that
in last review.

Thanks,
Junxiao.
>   status = ocfs2_set_acl(NULL, inode, bh, type, acl, NULL, NULL);
> - ocfs2_inode_unlock(inode, 1);
> +
> + if (!has_locked) {
> + ocfs2_remove_holder(lockres, &oh);
> + ocfs2_inode_unlock(inode, 1);
> + }
>   brelse(bh);
> +
>   return status;
>  }
>  
> @@ -303,21 +318,35 @@ struct posix_acl *ocfs2_iop_get_acl(struct inode 
> *inode, int type)
>   struct buffer_head *di_bh = NULL;
>   struct posix_acl *acl;
>   int ret;
> + int arg_flags = 0, has_locked;
> + struct ocfs2_lock_holder oh;
> + struct ocfs2_lock_res *lockres;
>  
>   osb = OCFS2_SB(inode->i_sb);
>   if (!(osb->s_mount_opt & OCFS2_MOUNT_POSIX_ACL))
>   return NULL;
> - ret = ocfs2_inode_lock(inode, &di_bh, 0);
> +
> + lockres = &OCFS2_I(inode)->ip_inode_lockres;
> + has_locked = ocfs2_is_locked_by_me(lockres);
> + if (has_locked)
> + arg_flags = OCFS2_META_LOCK_GETBH;
> + ret = ocfs2_inode_lock_full(inode, &di_bh, 0, arg_flags);
>   if (ret < 0) {
>   if (ret != -ENOENT)
>   mlog_errno(ret);
>   return ERR_PTR(ret);
>   }
> + if (!has_locked)
> + ocfs2_add_holder(lockres, &oh);
>  
>   acl = ocfs2_get_acl_nolock(inode, type, di_bh);
>  
> - ocfs2_inode_unlock(inode, 0);
> + if (!has_locked) {
> + ocfs2_remove_holder(lockres, &oh);
> + ocfs2_inode_unlock(inode, 0);
> + }
>   brelse(di_bh);
> +
>   ret

Re: [Ocfs2-devel] [PATCH 2/2] ocfs2: fix deadlocks when taking inode lock at vfs entry points

2017-01-15 Thread Junxiao Bi

On 01/16/2017 11:06 AM, Eric Ren wrote:
> Hi Junxiao,
> 
> On 01/16/2017 10:46 AM, Junxiao Bi wrote:
>>>> If had_lock==true, it is a bug? I think we should BUG_ON for it, that
>>>> can help us catch bug at the first time.
>>> Good idea! But I'm not sure if "ocfs2_setattr" is always the first one
>>> who takes the cluster lock.
>>> It's harder for me to name all the possible paths;-/
>> The BUG_ON() can help catch the path where ocfs2_setattr is not the
>> first one.
> Yes, I understand. But, the problem is that the vfs entries calling
> order is out of our control.
> I don't want to place an assertion where I'm not 100% sure it's
> absolutely right;-)
If it is not the first one, is it another recursive locking bug? In this
case, if you don't like BUG_ON(), you can dump the call trace and print
some warning message.

Thanks,
Junxiao.
> 
> Thanks,
> Eric
> 
>>
>> Thanks,
>> Junxiao.
>>
>>>>
>>>>> +if (had_lock)
>>>>> +arg_flags = OCFS2_META_LOCK_GETBH;
>>>>> +status = ocfs2_inode_lock_full(inode, &bh, 1, arg_flags);
>>>>>if (status < 0) {
>>>>>if (status != -ENOENT)
>>>>>mlog_errno(status);
>>>>>goto bail_unlock_rw;
>>>>>}
>>>>> -inode_locked = 1;
>>>>> +if (!had_lock) {
>>>>> +ocfs2_add_holder(lockres, &oh);
>>>>> +inode_locked = 1;
>>>>> +}
>>>>>  if (size_change) {
>>>>>status = inode_newsize_ok(inode, attr->ia_size);
>>>>> @@ -1260,7 +1270,8 @@ int ocfs2_setattr(struct dentry *dentry, struct
>>>>> iattr *attr)
>>>>>bail_commit:
>>>>>ocfs2_commit_trans(osb, handle);
>>>>>bail_unlock:
>>>>> -if (status) {
>>>>> +if (status && inode_locked) {
>>>>> +ocfs2_remove_holder(lockres, &oh);
>>>>>ocfs2_inode_unlock(inode, 1);
>>>>>inode_locked = 0;
>>>>>}
>>>>> @@ -1278,8 +1289,10 @@ int ocfs2_setattr(struct dentry *dentry,
>>>>> struct iattr *attr)
>>>>>if (status < 0)
>>>>>mlog_errno(status);
>>>>>}
>>>>> -if (inode_locked)
>>>>> +if (inode_locked) {
>>>>> +ocfs2_remove_holder(lockres, &oh);
>>>>>ocfs2_inode_unlock(inode, 1);
>>>>> +}
>>>>>  brelse(bh);
>>>>>return status;
>>>>> @@ -1321,20 +1334,31 @@ int ocfs2_getattr(struct vfsmount *mnt,
>>>>>int ocfs2_permission(struct inode *inode, int mask)
>>>>>{
>>>>>int ret;
>>>>> +int has_locked;
>>>>> +struct ocfs2_holder oh;
>>>>> +struct ocfs2_lock_res *lockres;
>>>>>  if (mask & MAY_NOT_BLOCK)
>>>>>return -ECHILD;
>>>>>-ret = ocfs2_inode_lock(inode, NULL, 0);
>>>>> -if (ret) {
>>>>> -if (ret != -ENOENT)
>>>>> -mlog_errno(ret);
>>>>> -goto out;
>>>>> +lockres = &OCFS2_I(inode)->ip_inode_lockres;
>>>>> +has_locked = (ocfs2_is_locked_by_me(lockres) != NULL);
>>>> The same thing as ocfs2_setattr.
>>> OK. I will think over your suggestions!
>>>
>>> Thanks,
>>> Eric
>>>
>>>> Thanks,
>>>> Junxiao.
>>>>> +if (!has_locked) {
>>>>> +ret = ocfs2_inode_lock(inode, NULL, 0);
>>>>> +if (ret) {
>>>>> +if (ret != -ENOENT)
>>>>> +mlog_errno(ret);
>>>>> +goto out;
>>>>> +}
>>>>> +ocfs2_add_holder(lockres, &oh);
>>>>>}
>>>>>  ret = generic_permission(inode, mask);
>>>>>-ocfs2_inode_unlock(inode, 0);
>>>>> +if (!has_locked) {
>>>>> +ocfs2_remove_holder(lockres, &oh);
>>>>> +ocfs2_inode_unlock(inode, 0);
>>>>> +}
>>>>>out:
>>>>>return ret;
>>>>>}
>>>>>
>>
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 2/2] ocfs2: fix deadlocks when taking inode lock at vfs entry points

2017-01-15 Thread Junxiao Bi

On 01/13/2017 02:19 PM, Eric Ren wrote:
> Hi!
> 
> On 01/13/2017 12:22 PM, Junxiao Bi wrote:
>> On 01/05/2017 11:31 PM, Eric Ren wrote:
>>> Commit 743b5f1434f5 ("ocfs2: take inode lock in
>>> ocfs2_iop_set/get_acl()")
>>> results in a deadlock, as the author "Tariq Saeed" realized shortly
>>> after the patch was merged. The discussion happened here
>>> (https://oss.oracle.com/pipermail/ocfs2-devel/2015-September/011085.html).
>>>
>>>
>>> The reason why taking cluster inode lock at vfs entry points opens up
>>> a self deadlock window, is explained in the previous patch of this
>>> series.
>>>
>>> So far, we have seen two different code paths that have this issue.
>>> 1. do_sys_open
>>>   may_open
>>>inode_permission
>>> ocfs2_permission
>>>  ocfs2_inode_lock() <=== take PR
>>>   generic_permission
>>>get_acl
>>> ocfs2_iop_get_acl
>>>  ocfs2_inode_lock() <=== take PR
>>> 2. fchmod|fchmodat
>>>  chmod_common
>>>   notify_change
>>>ocfs2_setattr <=== take EX
>>> posix_acl_chmod
>>>  get_acl
>>>   ocfs2_iop_get_acl <=== take PR
>>>  ocfs2_iop_set_acl <=== take EX
>>>
>>> Fixes them by adding the tracking logic (in the previous patch) for
>>> these funcs above, ocfs2_permission(), ocfs2_iop_[set|get]_acl(),
>>> ocfs2_setattr().
>>>
>>> Signed-off-by: Eric Ren 
>>> ---
>>>   fs/ocfs2/acl.c  | 39 ++-
>>>   fs/ocfs2/file.c | 44 ++--
>>>   2 files changed, 68 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/fs/ocfs2/acl.c b/fs/ocfs2/acl.c
>>> index bed1fcb..c539890 100644
>>> --- a/fs/ocfs2/acl.c
>>> +++ b/fs/ocfs2/acl.c
>>> @@ -284,16 +284,31 @@ int ocfs2_iop_set_acl(struct inode *inode,
>>> struct posix_acl *acl, int type)
>>>   {
>>>   struct buffer_head *bh = NULL;
>>>   int status = 0;
>>> -
>>> -status = ocfs2_inode_lock(inode, &bh, 1);
>>> +int arg_flags = 0, has_locked;
>>> +struct ocfs2_holder oh;
>>> +struct ocfs2_lock_res *lockres;
>>> +
>>> +lockres = &OCFS2_I(inode)->ip_inode_lockres;
>>> +has_locked = (ocfs2_is_locked_by_me(lockres) != NULL);
>>> +if (has_locked)
>>> +arg_flags = OCFS2_META_LOCK_GETBH;
>>> +status = ocfs2_inode_lock_full(inode, &bh, 1, arg_flags);
>>>   if (status < 0) {
>>>   if (status != -ENOENT)
>>>   mlog_errno(status);
>>>   return status;
>>>   }
>>> +if (!has_locked)
>>> +ocfs2_add_holder(lockres, &oh);
>>> +
>>>   status = ocfs2_set_acl(NULL, inode, bh, type, acl, NULL, NULL);
>>> -ocfs2_inode_unlock(inode, 1);
>>> +
>>> +if (!has_locked) {
>>> +ocfs2_remove_holder(lockres, &oh);
>>> +ocfs2_inode_unlock(inode, 1);
>>> +}
>>>   brelse(bh);
>>> +
>>>   return status;
>>>   }
>>>   @@ -303,21 +318,35 @@ struct posix_acl *ocfs2_iop_get_acl(struct
>>> inode *inode, int type)
>>>   struct buffer_head *di_bh = NULL;
>>>   struct posix_acl *acl;
>>>   int ret;
>>> +int arg_flags = 0, has_locked;
>>> +struct ocfs2_holder oh;
>>> +struct ocfs2_lock_res *lockres;
>>> osb = OCFS2_SB(inode->i_sb);
>>>   if (!(osb->s_mount_opt & OCFS2_MOUNT_POSIX_ACL))
>>>   return NULL;
>>> -ret = ocfs2_inode_lock(inode, &di_bh, 0);
>>> +
>>> +lockres = &OCFS2_I(inode)->ip_inode_lockres;
>>> +has_locked = (ocfs2_is_locked_by_me(lockres) != NULL);
>>> +if (has_locked)
>>> +arg_flags = OCFS2_META_LOCK_GETBH;
>>> +ret = ocfs2_inode_lock_full(inode, &di_bh, 0, arg_flags);
>>>   if (ret < 0) {
>>>   if (ret != -ENOENT)
>>>   mlog_errno(ret);
>>>   return ERR_PTR(ret);
>>>   }
>>> +if (!has_locked)
>>> +ocfs2_add_holder(lockres, &oh);
>>> acl = ocfs2_get_acl_nolock(inode, type, di_bh);
>>>   -ocfs2_inode_u

Re: [Ocfs2-devel] [PATCH 1/2] ocfs2/dlmglue: prepare tracking logic to avoid recursive cluster lock

2017-01-15 Thread Junxiao Bi

On 01/13/2017 02:12 PM, Eric Ren wrote:
> Hi Junxiao!
> 
> On 01/13/2017 11:59 AM, Junxiao Bi wrote:
>> On 01/05/2017 11:31 PM, Eric Ren wrote:
>>> We are in the situation that we have to avoid recursive cluster locking,
>>> but there is no way to check if a cluster lock has been taken by a
>>> precess already.
>>>
>>> Mostly, we can avoid recursive locking by writing code carefully.
>>> However, we found that it's very hard to handle the routines that
>>> are invoked directly by vfs code. For instance:
>>>
>>> const struct inode_operations ocfs2_file_iops = {
>>>  .permission = ocfs2_permission,
>>>  .get_acl= ocfs2_iop_get_acl,
>>>  .set_acl= ocfs2_iop_set_acl,
>>> };
>>>
>>> Both ocfs2_permission() and ocfs2_iop_get_acl() call
>>> ocfs2_inode_lock(PR):
>>> do_sys_open
>>>   may_open
>>>inode_permission
>>> ocfs2_permission
>>>  ocfs2_inode_lock() <=== first time
>>>   generic_permission
>>>get_acl
>>> ocfs2_iop_get_acl
>>> ocfs2_inode_lock() <=== recursive one
>>>
>>> A deadlock will occur if a remote EX request comes in between two
>>> of ocfs2_inode_lock(). Briefly describe how the deadlock is formed:
>>>
>>> On one hand, OCFS2_LOCK_BLOCKED flag of this lockres is set in
>>> BAST(ocfs2_generic_handle_bast) when downconvert is started
>>> on behalf of the remote EX lock request. Another hand, the recursive
>>> cluster lock (the second one) will be blocked in in
>>> __ocfs2_cluster_lock()
>>> because of OCFS2_LOCK_BLOCKED. But, the downconvert never complete, why?
>>> because there is no chance for the first cluster lock on this node to be
>>> unlocked - we block ourselves in the code path.
>>>
>>> The idea to fix this issue is mostly taken from gfs2 code.
>>> 1. introduce a new field: struct ocfs2_lock_res.l_holders, to
>>> keep track of the processes' pid  who has taken the cluster lock
>>> of this lock resource;
>>> 2. introduce a new flag for ocfs2_inode_lock_full:
>>> OCFS2_META_LOCK_GETBH;
>>> it means just getting back disk inode bh for us if we've got cluster
>>> lock.
>>> 3. export a helper: ocfs2_is_locked_by_me() is used to check if we
>>> have got the cluster lock in the upper code path.
>>>
>>> The tracking logic should be used by some of the ocfs2 vfs's callbacks,
>>> to solve the recursive locking issue cuased by the fact that vfs
>>> routines
>>> can call into each other.
>>>
>>> The performance penalty of processing the holder list should only be
>>> seen
>>> at a few cases where the tracking logic is used, such as get/set acl.
>>>
>>> You may ask what if the first time we got a PR lock, and the second time
>>> we want a EX lock? fortunately, this case never happens in the real
>>> world,
>>> as far as I can see, including permission check,
>>> (get|set)_(acl|attr), and
>>> the gfs2 code also do so.
>>>
>>> Signed-off-by: Eric Ren 
>>> ---
>>>   fs/ocfs2/dlmglue.c | 47
>>> ---
>>>   fs/ocfs2/dlmglue.h | 18 ++
>>>   fs/ocfs2/ocfs2.h   |  1 +
>>>   3 files changed, 63 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
>>> index 83d576f..500bda4 100644
>>> --- a/fs/ocfs2/dlmglue.c
>>> +++ b/fs/ocfs2/dlmglue.c
>>> @@ -532,6 +532,7 @@ void ocfs2_lock_res_init_once(struct
>>> ocfs2_lock_res *res)
>>>   init_waitqueue_head(&res->l_event);
>>>   INIT_LIST_HEAD(&res->l_blocked_list);
>>>   INIT_LIST_HEAD(&res->l_mask_waiters);
>>> +INIT_LIST_HEAD(&res->l_holders);
>>>   }
>>> void ocfs2_inode_lock_res_init(struct ocfs2_lock_res *res,
>>> @@ -749,6 +750,45 @@ void ocfs2_lock_res_free(struct ocfs2_lock_res
>>> *res)
>>>   res->l_flags = 0UL;
>>>   }
>>>   +inline void ocfs2_add_holder(struct ocfs2_lock_res *lockres,
>>> +   struct ocfs2_holder *oh)
>>> +{
>>> +INIT_LIST_HEAD(&oh->oh_list);
>>> +oh->oh_owner_pid =  get_pid(task_pid(current));
>> struct pid(oh->oh_owner_pid) looks complicated here, why not use
>> task_struct(current) or pid_t(current-&

Re: [Ocfs2-devel] [PATCH 2/2] ocfs2: fix deadlocks when taking inode lock at vfs entry points

2017-01-12 Thread Junxiao Bi

On 01/05/2017 11:31 PM, Eric Ren wrote:
> Commit 743b5f1434f5 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
> results in a deadlock, as the author "Tariq Saeed" realized shortly
> after the patch was merged. The discussion happened here
> (https://oss.oracle.com/pipermail/ocfs2-devel/2015-September/011085.html).
> 
> The reason why taking cluster inode lock at vfs entry points opens up
> a self deadlock window, is explained in the previous patch of this
> series.
> 
> So far, we have seen two different code paths that have this issue.
> 1. do_sys_open
>  may_open
>   inode_permission
>ocfs2_permission
> ocfs2_inode_lock() <=== take PR
>  generic_permission
>   get_acl
>ocfs2_iop_get_acl
> ocfs2_inode_lock() <=== take PR
> 2. fchmod|fchmodat
> chmod_common
>  notify_change
>   ocfs2_setattr <=== take EX
>posix_acl_chmod
> get_acl
>  ocfs2_iop_get_acl <=== take PR
> ocfs2_iop_set_acl <=== take EX
> 
> Fixes them by adding the tracking logic (in the previous patch) for
> these funcs above, ocfs2_permission(), ocfs2_iop_[set|get]_acl(),
> ocfs2_setattr().
> 
> Signed-off-by: Eric Ren 
> ---
>  fs/ocfs2/acl.c  | 39 ++-
>  fs/ocfs2/file.c | 44 ++--
>  2 files changed, 68 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/ocfs2/acl.c b/fs/ocfs2/acl.c
> index bed1fcb..c539890 100644
> --- a/fs/ocfs2/acl.c
> +++ b/fs/ocfs2/acl.c
> @@ -284,16 +284,31 @@ int ocfs2_iop_set_acl(struct inode *inode, struct 
> posix_acl *acl, int type)
>  {
>   struct buffer_head *bh = NULL;
>   int status = 0;
> -
> - status = ocfs2_inode_lock(inode, &bh, 1);
> + int arg_flags = 0, has_locked;
> + struct ocfs2_holder oh;
> + struct ocfs2_lock_res *lockres;
> +
> + lockres = &OCFS2_I(inode)->ip_inode_lockres;
> + has_locked = (ocfs2_is_locked_by_me(lockres) != NULL);
> + if (has_locked)
> + arg_flags = OCFS2_META_LOCK_GETBH;
> + status = ocfs2_inode_lock_full(inode, &bh, 1, arg_flags);
>   if (status < 0) {
>   if (status != -ENOENT)
>   mlog_errno(status);
>   return status;
>   }
> + if (!has_locked)
> + ocfs2_add_holder(lockres, &oh);
> +
>   status = ocfs2_set_acl(NULL, inode, bh, type, acl, NULL, NULL);
> - ocfs2_inode_unlock(inode, 1);
> +
> + if (!has_locked) {
> + ocfs2_remove_holder(lockres, &oh);
> + ocfs2_inode_unlock(inode, 1);
> + }
>   brelse(bh);
> +
>   return status;
>  }
>  
> @@ -303,21 +318,35 @@ struct posix_acl *ocfs2_iop_get_acl(struct inode 
> *inode, int type)
>   struct buffer_head *di_bh = NULL;
>   struct posix_acl *acl;
>   int ret;
> + int arg_flags = 0, has_locked;
> + struct ocfs2_holder oh;
> + struct ocfs2_lock_res *lockres;
>  
>   osb = OCFS2_SB(inode->i_sb);
>   if (!(osb->s_mount_opt & OCFS2_MOUNT_POSIX_ACL))
>   return NULL;
> - ret = ocfs2_inode_lock(inode, &di_bh, 0);
> +
> + lockres = &OCFS2_I(inode)->ip_inode_lockres;
> + has_locked = (ocfs2_is_locked_by_me(lockres) != NULL);
> + if (has_locked)
> + arg_flags = OCFS2_META_LOCK_GETBH;
> + ret = ocfs2_inode_lock_full(inode, &di_bh, 0, arg_flags);
>   if (ret < 0) {
>   if (ret != -ENOENT)
>   mlog_errno(ret);
>   return ERR_PTR(ret);
>   }
> + if (!has_locked)
> + ocfs2_add_holder(lockres, &oh);
>  
>   acl = ocfs2_get_acl_nolock(inode, type, di_bh);
>  
> - ocfs2_inode_unlock(inode, 0);
> + if (!has_locked) {
> + ocfs2_remove_holder(lockres, &oh);
> + ocfs2_inode_unlock(inode, 0);
> + }
>   brelse(di_bh);
> +
>   return acl;
>  }
>  
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index c488965..62be75d 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -1138,6 +1138,9 @@ int ocfs2_setattr(struct dentry *dentry, struct iattr 
> *attr)
>   handle_t *handle = NULL;
>   struct dquot *transfer_to[MAXQUOTAS] = { };
>   int qtype;
> + int arg_flags = 0, had_lock;
> + struct ocfs2_holder oh;
> + struct ocfs2_lock_res *lockres;
>  
>   trace_ocfs2_setattr(inode, dentry,
>   (unsigned long long)OCFS2_I(inode)->ip_blkno,
> @@ -1173,13 +1176,20 @@ int ocfs2_setattr(struct dentry *dentry, struct iattr 
> *attr)
>   }
>   }
>  
> - status = ocfs2_inode_lock(inode, &bh, 1);
> + lockres = &OCFS2_I(inode)->ip_inode_lockres;
> + had_lock = (ocfs2_is_locked_by_me(lockres) != NULL);

If had_lock==true, it is a bug? I think we should BUG_ON for it, that
can help us catch bug at the first time.


> + if (had_lock)
> + arg_flags = OCFS2_META_LOCK_GETBH;
> + status = ocfs2_inode_lock_full(inode, &bh, 1, arg_flags);
>

Re: [Ocfs2-devel] [PATCH 1/2] ocfs2/dlmglue: prepare tracking logic to avoid recursive cluster lock

2017-01-12 Thread Junxiao Bi

On 01/05/2017 11:31 PM, Eric Ren wrote:
> We are in the situation that we have to avoid recursive cluster locking,
> but there is no way to check if a cluster lock has been taken by a
> precess already.
> 
> Mostly, we can avoid recursive locking by writing code carefully.
> However, we found that it's very hard to handle the routines that
> are invoked directly by vfs code. For instance:
> 
> const struct inode_operations ocfs2_file_iops = {
> .permission = ocfs2_permission,
> .get_acl= ocfs2_iop_get_acl,
> .set_acl= ocfs2_iop_set_acl,
> };
> 
> Both ocfs2_permission() and ocfs2_iop_get_acl() call ocfs2_inode_lock(PR):
> do_sys_open
>  may_open
>   inode_permission
>ocfs2_permission
> ocfs2_inode_lock() <=== first time
>  generic_permission
>   get_acl
>ocfs2_iop_get_acl
>   ocfs2_inode_lock() <=== recursive one
> 
> A deadlock will occur if a remote EX request comes in between two
> of ocfs2_inode_lock(). Briefly describe how the deadlock is formed:
> 
> On one hand, OCFS2_LOCK_BLOCKED flag of this lockres is set in
> BAST(ocfs2_generic_handle_bast) when downconvert is started
> on behalf of the remote EX lock request. Another hand, the recursive
> cluster lock (the second one) will be blocked in in __ocfs2_cluster_lock()
> because of OCFS2_LOCK_BLOCKED. But, the downconvert never complete, why?
> because there is no chance for the first cluster lock on this node to be
> unlocked - we block ourselves in the code path.
> 
> The idea to fix this issue is mostly taken from gfs2 code.
> 1. introduce a new field: struct ocfs2_lock_res.l_holders, to
> keep track of the processes' pid  who has taken the cluster lock
> of this lock resource;
> 2. introduce a new flag for ocfs2_inode_lock_full: OCFS2_META_LOCK_GETBH;
> it means just getting back disk inode bh for us if we've got cluster lock.
> 3. export a helper: ocfs2_is_locked_by_me() is used to check if we
> have got the cluster lock in the upper code path.
> 
> The tracking logic should be used by some of the ocfs2 vfs's callbacks,
> to solve the recursive locking issue cuased by the fact that vfs routines
> can call into each other.
> 
> The performance penalty of processing the holder list should only be seen
> at a few cases where the tracking logic is used, such as get/set acl.
> 
> You may ask what if the first time we got a PR lock, and the second time
> we want a EX lock? fortunately, this case never happens in the real world,
> as far as I can see, including permission check, (get|set)_(acl|attr), and
> the gfs2 code also do so.
> 
> Signed-off-by: Eric Ren 
> ---
>  fs/ocfs2/dlmglue.c | 47 ---
>  fs/ocfs2/dlmglue.h | 18 ++
>  fs/ocfs2/ocfs2.h   |  1 +
>  3 files changed, 63 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
> index 83d576f..500bda4 100644
> --- a/fs/ocfs2/dlmglue.c
> +++ b/fs/ocfs2/dlmglue.c
> @@ -532,6 +532,7 @@ void ocfs2_lock_res_init_once(struct ocfs2_lock_res *res)
>   init_waitqueue_head(&res->l_event);
>   INIT_LIST_HEAD(&res->l_blocked_list);
>   INIT_LIST_HEAD(&res->l_mask_waiters);
> + INIT_LIST_HEAD(&res->l_holders);
>  }
>  
>  void ocfs2_inode_lock_res_init(struct ocfs2_lock_res *res,
> @@ -749,6 +750,45 @@ void ocfs2_lock_res_free(struct ocfs2_lock_res *res)
>   res->l_flags = 0UL;
>  }
>  
> +inline void ocfs2_add_holder(struct ocfs2_lock_res *lockres,
> +struct ocfs2_holder *oh)
> +{
> + INIT_LIST_HEAD(&oh->oh_list);
> + oh->oh_owner_pid =  get_pid(task_pid(current));
struct pid(oh->oh_owner_pid) looks complicated here, why not use
task_struct(current) or pid_t(current->pid) directly? Also i didn't see
the ref count needs to be considered.

> +
> + spin_lock(&lockres->l_lock);
> + list_add_tail(&oh->oh_list, &lockres->l_holders);
> + spin_unlock(&lockres->l_lock);
> +}
> +
> +inline void ocfs2_remove_holder(struct ocfs2_lock_res *lockres,
> +struct ocfs2_holder *oh)
> +{
> + spin_lock(&lockres->l_lock);
> + list_del(&oh->oh_list);
> + spin_unlock(&lockres->l_lock);
> +
> + put_pid(oh->oh_owner_pid);
same the above

> +}
> +
> +inline struct ocfs2_holder *ocfs2_is_locked_by_me(struct ocfs2_lock_res 
> *lockres)
Agree with Joseph, return bool looks better. I didn't see how that help
debug since the return value is not used.


> +{
> + struct ocfs2_holder *oh;
> + struct pid *pid;
> +
> + /* look in the list of holders for one with the current task as owner */
> + spin_lock(&lockres->l_lock);
> + pid = task_pid(current);
> + list_for_each_entry(oh, &lockres->l_holders, oh_list) {
> + if (oh->oh_owner_pid == pid)
> + goto out;
> + }
> + oh = NULL;
> +out:
> + spin_unlock(&lockres->l_lock);
> + return oh;
> +}
> +
>  static inline void ocfs2_inc_holders(struct ocfs2_lock_res

Re: [Ocfs2-devel] ocfs2-test passed on next-20161223

2016-12-31 Thread Junxiao Bi


> 在 2016年12月30日，下午3:12，Eric Ren  写道：
> 
> Hi Junxiao,
> 
> On 12/30/2016 10:44 AM, Junxiao Bi wrote:
>> Hi Guys,
>> 
>> I just done ocfs2-test single/multiple/discontig test on linux
>> next-20161223, all test passed. Thank you for your effort to make the
>> good quality.
> 
> Thanks for your effort! BTW, how long does the whole testing take usually on 
> your side?
Nearly 3 days, about 66 hours.

thanks,
Junxiao.
> 
> Eric
> 
>> 
>> Thanks,
>> Junxiao.
>> 
>> ___
>> Ocfs2-devel mailing list
>> Ocfs2-devel@oss.oracle.com
>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>> 
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] ocfs2-test passed on next-20161223

2016-12-29 Thread Junxiao Bi

Hi Guys,

I just done ocfs2-test single/multiple/discontig test on linux
next-20161223, all test passed. Thank you for your effort to make the
good quality.

Thanks,
Junxiao.

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] ocfs2: fix sparse file & data ordering issue in direct io

2016-11-21 Thread Junxiao Bi

Hi Dan,

It will not cause a real issue. -EAGAIN can be only returned in
__ocfs2_page_mkwrite() path where "locked_page" is NULL, so that
function will return VM_FAULT_NOPAGE before accessing "fsdata".

Thanks,
Junxiao.

On 11/17/2016 06:03 PM, Dan Carpenter wrote:
> On Thu, Nov 17, 2016 at 11:08:08AM +0800, Eric Ren wrote:
>> Hi,
>>
>> On 11/16/2016 06:45 PM, Dan Carpenter wrote:
>>> On Wed, Nov 16, 2016 at 10:33:49AM +0800, Eric Ren wrote:
>>> That silences the warning, of course, but I feel like the code is buggy.
>>> How do we know that we don't hit that exit path?
>> Sorry, I missed your point. Do you mean the below?
>>
>> "1817 goto out_quota; " will free (*wc), but with "ret = 0". Thus, the caller
>> think it's OK to use (*wc), but...
>>
>> Do I understand you correctly?
>>
> 
> It doesn't free it.  It frees "wc" but not "*fsdata".  So it leaves it
> unintialized on that path.  That's the issue, yes.
> 
> It could be that it's impossible to reach that path from here, but it's
> not clear to me.
> 
> regards,
> dan carpenter
> 
> 
> ___
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH] ocfs2: fix not enough credit panic

2016-11-04 Thread Junxiao Bi

The following panic was caught when run ocfs2 disconfig single test
(block size 512 and cluster size 8192). ocfs2_journal_dirty() return
-ENOSPC, that means credits were used up. The total credit should
include 3 times of "num_dx_leaves" from ocfs2_dx_dir_rebalance(),
because 2 times will be consumed in ocfs2_dx_dir_transfer_leaf() and
1 time will be consumed in ocfs2_dx_dir_new_cluster()->
__ocfs2_dx_dir_new_cluster()->ocfs2_dx_dir_format_cluster(). But only
two times is included in ocfs2_dx_dir_rebalance_credits(), fix it.
This can cause read-only fs(v4.1+) or panic for mainline linux
depending on mount option.

[34377.331151] [ cut here ]
[34377.332007] kernel BUG at fs/ocfs2/journal.c:775!
[34377.344107] invalid opcode:  [#1] SMP
[34377.346090] Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss 
sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager 
ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 
nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi 
iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser 
rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp 
libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront fb_sys_fops 
sysimgblt sysfillrect syscopyarea parport_pc parport acpi_cpufreq i2c_piix4 
i2c_core pcspkr ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic 
ata_piix dm_mirror dm_region_hash dm_log dm_mod
[34377.346090] CPU: 2 PID: 10601 Comm: dd Not tainted 
4.1.12-71.el6uek.bug24939243.x86_64 #2
[34377.346090] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016
[34377.346090] task: 8800b6de6200 ti: 8800a7d48000 task.ti: 
8800a7d48000
[34377.346090] RIP: 0010:[]  [] 
ocfs2_journal_dirty+0xa7/0xb0 [ocfs2]
[34377.346090] RSP: 0018:8800a7d4b6d8  EFLAGS: 00010286
[34377.346090] RAX: ffe4 RBX: 814d0a9c RCX: 04f9
[34377.346090] RDX: a008e990 RSI: a008f1ee RDI: 8800622b6460
[34377.346090] RBP: 8800a7d4b6f8 R08: a008f288 R09: 8800622b6460
[34377.346090] R10:  R11: 0282 R12: 02c8421e
[34377.346090] R13: 88006d0cad00 R14: 880092beef60 R15: 0070
[34377.346090] FS:  7f9b83e92700() GS:8800be88() 
knlGS:
[34377.346090] CS:  0010 DS:  ES:  CR0: 80050033
[34377.346090] CR2: 7fb2c0d1a000 CR3: 08f8 CR4: 000406e0
[34377.346090] Stack:
[34377.346090]  814d0a9c 88005fe61e00 88006e995c00 
880009847c00
[34377.346090]  8800a7d4b768 a06c0999 88009d3c2a10 
88005fe61e30
[34377.346090]  8800997ce500 8800997ce980 880092beef60 
001e
[34377.346090] Call Trace:
[34377.346090]  [] ocfs2_dx_dir_transfer_leaf+0x159/0x1a0 
[ocfs2]
[34377.346090]  [] ocfs2_dx_dir_rebalance+0xd9b/0xea0 [ocfs2]
[34377.346090]  [] ? ocfs2_inode_cache_io_unlock+0x12/0x20 
[ocfs2]
[34377.346090]  [] ? 
ocfs2_refcount_tree_et_ops+0x60/0xfffe4b31 [ocfs2]
[34377.346090]  [] ? ocfs2_journal_access_dl+0x20/0x20 [ocfs2]
[34377.346090]  [] ocfs2_find_dir_space_dx+0xd3/0x300 [ocfs2]
[34377.346090]  [] 
ocfs2_prepare_dx_dir_for_insert+0x219/0x450 [ocfs2]
[34377.346090]  [] ocfs2_prepare_dir_for_insert+0x1d6/0x580 
[ocfs2]
[34377.346090]  [] ? ocfs2_read_inode_block+0x10/0x20 [ocfs2]
[34377.346090]  [] ocfs2_mknod+0x5a2/0x1400 [ocfs2]
[34377.346090]  [] ocfs2_create+0x73/0x180 [ocfs2]
[34377.346090]  [] vfs_create+0xd8/0x100
[34377.346090]  [] ? lookup_real+0x1d/0x60
[34377.346090]  [] lookup_open+0x185/0x1c0
[34377.346090]  [] do_last+0x36d/0x780
[34377.346090]  [] ? security_file_alloc+0x16/0x20
[34377.346090]  [] path_openat+0x92/0x470
[34377.346090]  [] do_filp_open+0x4a/0xa0
[34377.346090]  [] ? find_next_zero_bit+0x10/0x20
[34377.346090]  [] ? __alloc_fd+0xac/0x150
[34377.346090]  [] do_sys_open+0x11a/0x230
[34377.346090]  [] ? syscall_trace_enter_phase1+0x153/0x180
[34377.346090]  [] SyS_open+0x1e/0x20
[34377.346090]  [] system_call_fastpath+0x12/0x71
[34377.346090] Code: 1d 3f 29 09 00 48 85 db 74 1f 48 8b 03 0f 1f 80 00 00 00 
00 48 8b 7b 08 48 83 c3 10 4c 89 e6 ff d0 48 8b 03 48 85 c0 75 eb eb 90 <0f> 0b 
eb fe 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54
[34377.346090] RIP  [] ocfs2_journal_dirty+0xa7/0xb0 [ocfs2]
[34377.346090]  RSP 
[34377.615401] ---[ end trace 91ac5312a6ee1288 ]---
[34377.618919] Kernel panic - not syncing: Fatal exception
[34377.619910] Kernel Offset: disabled

Signed-off-by: Junxiao Bi 
---
 fs/ocfs2/dir.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ocfs2/dir.c b/fs/ocfs2/dir.c
index ccd4dcfc3645..39f02b75aaf8 100644
--- a/fs/ocfs2/dir.c
+++ b/fs/ocfs2/dir.c
@@ -3712,7 +3712,7 @@ static void ocfs2_dx_dir_transfer_leaf(struct inode *dir, 
u32 split_hash,
 static int ocfs2_dx_dir_rebalance_credits(s

Re: [Ocfs2-devel] [RFC] Should we revert commit "ocfs2: take inode lock in ocfs2_iop_set/get_acl()"? or other ideas?

2016-10-18 Thread Junxiao Bi

Hi Eric,

> 在 2016年10月19日，下午1:19，Eric Ren  写道：
> 
> Hi all!
> 
> Commit 743b5f1434f5 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
> results in another deadlock as we have discussed in the recent thread:
>https://oss.oracle.com/pipermail/ocfs2-devel/2016-October/012454.html
> 
> Before this one, a similiar deadlock has been fixed by Junxiao:
>commit c25a1e0671fb ("ocfs2: fix posix_acl_create deadlock")
>commit 5ee0fbd50fdf ("ocfs2: revert using ocfs2_acl_chmod to avoid inode 
> cluster lock hang")
> 
> We are in the situation that we have to avoid recursive cluster locking, but
> there is no way to check if a cluster lock has been taken by a precess 
> already.
> 
> Mostly, we can avoid recursive locking by writing code carefully. However, as
> the deadlock issues have proved out, it's very hard to handle the routines
> that are called directly by vfs. For instance:
> 
>const struct inode_operations ocfs2_file_iops = {
>.permission = ocfs2_permission,
>.get_acl= ocfs2_iop_get_acl,
>.set_acl= ocfs2_iop_set_acl,
>};
> 
> 
> ocfs2_permission() and ocfs2_iop_get/set_acl() both call ocfs2_inode_lock().
> The problem is that the call chain of ocfs2_permission() includes *_acl().
> 
> Possibly, there are three solutions I can think of.  The first one is to
> implement the inode permission routine for ocfs2 itself, replacing the
> existing generic_permission(); this will bring lots of changes and
> involve too many trivial vfs functions into ocfs2 code. Frown on this.
> 
> The second one is, what I am trying now, to keep track of the processes who
> lock/unlock a cluster lock by the following draft patches. But, I quickly
> find out that a cluster locking which has been taken by processA can be 
> unlocked
> by processB. For example, systemfiles like journal: is locked during 
> mout, and
> unlocked during umount. 
I had ever implemented generic recursive locking support, please check the 
patch at https://oss.oracle.com/pipermail/ocfs2-devel/2015-December/011408.html 
 , the 
issue that locking and unlocking in different processes was considered. But it 
was rejected by Mark as recursive locking is not allowed in ocfs2/kernel . 
> 
> The thrid one is to revert that problematic commit! It looks like 
> get/set_acl()
> are always been called by other vfs callback like ocfs2_permission(). I think
> we can do this if it's true, right? Anyway, I'll try to work out if it's 
> true;-)
Not sure whether get/set_acl() will be called directly by vfs. Even not now, we 
can’t make sure that in the future. So revert it may be a little risky. But if 
refactor is complicated, then this maybe the only way we can do.

Thanks,
Junxiao.
> 
> Hope for your input to solve this problem;-)
> 
> Thanks,
> Eric
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [Question] deadlock on chmod when running discontigous block group multiple node testing

2016-10-12 Thread Junxiao Bi

On 10/12/2016 06:54 PM, Eric Ren wrote:
> Hi,
> 
> On 10/12/2016 05:45 PM, Junxiao Bi wrote:
>> On 10/12/2016 05:34 PM, Eric Ren wrote:
>>> Hi Junxiao,
>>>
>>> On 10/12/2016 02:47 PM, Junxiao Bi wrote:
>>>> On 10/12/2016 10:36 AM, Eric Ren wrote:
>>>>> Hi,
>>>>>
>>>>> When backporting those patches, I find that they are already in our
>>>>> product kernel, maybe
>>>>> via "stable kernel" policy, although our product kernel is 4.4
>>>>> while the
>>>>> patches were merged
>>>>> into 4.6.
>>>>>
>>>>> Seems it's another deadlock that happens when doing `chmod -R 777
>>>>> /mnt/ocfs2`
>>>>> among mutilple nodes at the same time.
>>>> Yes, but i just finish running ocfs2 full test on linux next-20161006
>>>> and didn't find any issue.
>>> Thanks a lot, really!
>>>
>>> 1. What's the size of your ocfs2 disk? My disk is 200G.
>> 212G
>>
>>> 2. Did you run discontig block group test with multiple nodes? with this
>>> option:
>> Yes, but i don't know what that option is.
>>
>>>  " -m ocfs2cts1,ocfs2cts2"
> 
> ocfs2ctsX is the host name of cluster nodes. Discontig bg testcase will
> run in local mode if without
> this option.
It had, 3 machines were used. I first thought ocfs2cts1,ocfs2cts2 is the
option.

Thanks,
Junxiao.
> 
> Thanks
> Eric
> 
>>>
>>> 3. Then, I am using fs/dlm. That's a different point.
>> Yes, that deserve a look since your issue is cluster locking hung.
>>
>> Thanks,
>> Junxiao.
>>> Thanks,
>>> Eric
>>>
>>>> Thanks,
>>>> Junxiao.
>>>>
>>>>> Thanks,
>>>>> Eric
>>>>> On 10/12/2016 09:23 AM, Eric Ren wrote:
>>>>>> Hi Junxiao,
>>>>>>
>>>>>>> Hi Eric,
>>>>>>>
>>>>>>> On 10/11/2016 10:42 AM, Eric Ren wrote:
>>>>>>>> Hi Junxiao,
>>>>>>>>
>>>>>>>> As the subject, the testing hung there on a kernel without your
>>>>>>>> patches:
>>>>>>>>
>>>>>>>> "ocfs2: revert using ocfs2_acl_chmod to avoid inode cluster lock
>>>>>>>> hang"
>>>>>>>> and
>>>>>>>> "ocfs2: fix posix_acl_create deadlock"
>>>>>>>>
>>>>>>>> The stack trace is:
>>>>>>>> ```
>>>>>>>> ocfs2cts1:~ # pstree -pl 24133
>>>>>>>> discontig_runne(24133)───activate_discon(21156)───mpirun(15146)─┬─fillup_contig_b(15149)───sudo(15231)───chmod(15232)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ocfs2cts1:~ # pgrep -a chmod
>>>>>>>> 15232 /bin/chmod -R 777 /mnt/ocfs2
>>>>>>>>
>>>>>>>> ocfs2cts1:~ # cat /proc/15232/stack
>>>>>>>> [] __ocfs2_cluster_lock.isra.39+0x1bf/0x620
>>>>>>>> [ocfs2]
>>>>>>>> [] ocfs2_inode_lock_full_nested+0x12d/0x840
>>>>>>>> [ocfs2]
>>>>>>>> [] ocfs2_inode_lock_atime+0xcb/0x170 [ocfs2]
>>>>>>>> [] ocfs2_readdir+0x41/0x1b0 [ocfs2]
>>>>>>>> [] iterate_dir+0x9c/0x110
>>>>>>>> [] SyS_getdents+0x83/0xf0
>>>>>>>> [] entry_SYSCALL_64_fastpath+0x12/0x6d
>>>>>>>> [] 0x
>>>>>>>> ```
>>>>>>>>
>>>>>>>> Do you think this issue can be fixed by your patches?
>>>>>>> Looks not. Those two patches are to fix recursive locking deadlock.
>>>>>>> But
>>>>>>> from above call trace, there is no recursive lock.
>>>>>> Sorry, the call trace on another node was missing.  Here it is:
>>>>>>
>>>>>> ocfs2cts2:~ # pstree -lp
>>>>>> sshd(4292)─┬─sshd(4745)───sshd(4753)───bash(4754)───orted(4781)───fillup_contig_b(4782)───sudo(4864)───chmod(4865)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ocfs2cts2:~ # cat /proc/4865/stack
>>>>>> [] __ocfs2_cluster_lock.isra.39+0x1bf/0x620 [ocfs2]
>>>>>> [] ocfs2_inode_lock_full_nested+0x12d/0x840 [ocfs2]
>>>>>> [] ocfs2_iop_get_acl+0x40/0xf0 [ocfs2]
>>>>>> [] generic_permission+0x166/0x1c0
>>>>>> [] ocfs2_permission+0xaa/0xd0 [ocfs2]
>>>>>> [] __inode_permission+0x56/0xb0
>>>>>> [] link_path_walk+0x29a/0x560
>>>>>> [] path_lookupat+0x7f/0x110
>>>>>> [] filename_lookup+0x9c/0x150
>>>>>> [] SyS_fchmodat+0x33/0x90
>>>>>> [] entry_SYSCALL_64_fastpath+0x12/0x6d
>>>>>> [] 0x
>>>>>>
>>>>>> Thanks,
>>>>>> Eric
>>>>>>
>>>>>>
>>>>>>> Thanks,
>>>>>>> Junxiao.
>>>>>>>> I will try your patches later, but I am little worried the
>>>>>>>> possibility
>>>>>>>> of reproduction may not be 100%.
>>>>>>>> So ask you to confirm;-)
>>>>>>>>
>>>>>>>> Eric
>>>>>> ___
>>>>>> Ocfs2-devel mailing list
>>>>>> Ocfs2-devel@oss.oracle.com
>>>>>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>>
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [Question] deadlock on chmod when running discontigous block group multiple node testing

2016-10-12 Thread Junxiao Bi

On 10/12/2016 05:34 PM, Eric Ren wrote:
> Hi Junxiao,
> 
> On 10/12/2016 02:47 PM, Junxiao Bi wrote:
>> On 10/12/2016 10:36 AM, Eric Ren wrote:
>>> Hi,
>>>
>>> When backporting those patches, I find that they are already in our
>>> product kernel, maybe
>>> via "stable kernel" policy, although our product kernel is 4.4 while the
>>> patches were merged
>>> into 4.6.
>>>
>>> Seems it's another deadlock that happens when doing `chmod -R 777
>>> /mnt/ocfs2`
>>> among mutilple nodes at the same time.
>> Yes, but i just finish running ocfs2 full test on linux next-20161006
>> and didn't find any issue.
> 
> Thanks a lot, really!
> 
> 1. What's the size of your ocfs2 disk? My disk is 200G.
212G

> 
> 2. Did you run discontig block group test with multiple nodes? with this
> option:
Yes, but i don't know what that option is.

> 
> " -m ocfs2cts1,ocfs2cts2"
> 
> 3. Then, I am using fs/dlm. That's a different point.
Yes, that deserve a look since your issue is cluster locking hung.

Thanks,
Junxiao.
> 
> Thanks,
> Eric
> 
>>
>> Thanks,
>> Junxiao.
>>
>>> Thanks,
>>> Eric
>>> On 10/12/2016 09:23 AM, Eric Ren wrote:
>>>> Hi Junxiao,
>>>>
>>>>> Hi Eric,
>>>>>
>>>>> On 10/11/2016 10:42 AM, Eric Ren wrote:
>>>>>> Hi Junxiao,
>>>>>>
>>>>>> As the subject, the testing hung there on a kernel without your
>>>>>> patches:
>>>>>>
>>>>>> "ocfs2: revert using ocfs2_acl_chmod to avoid inode cluster lock
>>>>>> hang"
>>>>>> and
>>>>>> "ocfs2: fix posix_acl_create deadlock"
>>>>>>
>>>>>> The stack trace is:
>>>>>> ```
>>>>>> ocfs2cts1:~ # pstree -pl 24133
>>>>>> discontig_runne(24133)───activate_discon(21156)───mpirun(15146)─┬─fillup_contig_b(15149)───sudo(15231)───chmod(15232)
>>>>>>
>>>>>>
>>>>>>
>>>>>> ocfs2cts1:~ # pgrep -a chmod
>>>>>> 15232 /bin/chmod -R 777 /mnt/ocfs2
>>>>>>
>>>>>> ocfs2cts1:~ # cat /proc/15232/stack
>>>>>> [] __ocfs2_cluster_lock.isra.39+0x1bf/0x620 [ocfs2]
>>>>>> [] ocfs2_inode_lock_full_nested+0x12d/0x840 [ocfs2]
>>>>>> [] ocfs2_inode_lock_atime+0xcb/0x170 [ocfs2]
>>>>>> [] ocfs2_readdir+0x41/0x1b0 [ocfs2]
>>>>>> [] iterate_dir+0x9c/0x110
>>>>>> [] SyS_getdents+0x83/0xf0
>>>>>> [] entry_SYSCALL_64_fastpath+0x12/0x6d
>>>>>> [] 0x
>>>>>> ```
>>>>>>
>>>>>> Do you think this issue can be fixed by your patches?
>>>>> Looks not. Those two patches are to fix recursive locking deadlock.
>>>>> But
>>>>> from above call trace, there is no recursive lock.
>>>> Sorry, the call trace on another node was missing.  Here it is:
>>>>
>>>> ocfs2cts2:~ # pstree -lp
>>>> sshd(4292)─┬─sshd(4745)───sshd(4753)───bash(4754)───orted(4781)───fillup_contig_b(4782)───sudo(4864)───chmod(4865)
>>>>
>>>>
>>>>
>>>> ocfs2cts2:~ # cat /proc/4865/stack
>>>> [] __ocfs2_cluster_lock.isra.39+0x1bf/0x620 [ocfs2]
>>>> [] ocfs2_inode_lock_full_nested+0x12d/0x840 [ocfs2]
>>>> [] ocfs2_iop_get_acl+0x40/0xf0 [ocfs2]
>>>> [] generic_permission+0x166/0x1c0
>>>> [] ocfs2_permission+0xaa/0xd0 [ocfs2]
>>>> [] __inode_permission+0x56/0xb0
>>>> [] link_path_walk+0x29a/0x560
>>>> [] path_lookupat+0x7f/0x110
>>>> [] filename_lookup+0x9c/0x150
>>>> [] SyS_fchmodat+0x33/0x90
>>>> [] entry_SYSCALL_64_fastpath+0x12/0x6d
>>>> [] 0x
>>>>
>>>> Thanks,
>>>> Eric
>>>>
>>>>
>>>>> Thanks,
>>>>> Junxiao.
>>>>>> I will try your patches later, but I am little worried the
>>>>>> possibility
>>>>>> of reproduction may not be 100%.
>>>>>> So ask you to confirm;-)
>>>>>>
>>>>>> Eric
>>>> ___
>>>> Ocfs2-devel mailing list
>>>> Ocfs2-devel@oss.oracle.com
>>>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>>>
>>
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] ocfs2-test passed on linux-next/next-20161006

2016-10-11 Thread Junxiao Bi

Hi all,

I just finished a full ocfs2 test(single/multiple/discontig) on
linux-next/next-20161006. All test case passed. That's a good sign of
quality. Thank you for your effort.

Thanks,
Junxiao.

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [Question] deadlock on chmod when running discontigous block group multiple node testing

2016-10-11 Thread Junxiao Bi

On 10/12/2016 10:36 AM, Eric Ren wrote:
> Hi,
> 
> When backporting those patches, I find that they are already in our
> product kernel, maybe
> via "stable kernel" policy, although our product kernel is 4.4 while the
> patches were merged
> into 4.6.
> 
> Seems it's another deadlock that happens when doing `chmod -R 777
> /mnt/ocfs2`
> among mutilple nodes at the same time.
Yes, but i just finish running ocfs2 full test on linux next-20161006
and didn't find any issue.

Thanks,
Junxiao.

> 
> Thanks,
> Eric
> On 10/12/2016 09:23 AM, Eric Ren wrote:
>> Hi Junxiao,
>>
>>> Hi Eric,
>>>
>>> On 10/11/2016 10:42 AM, Eric Ren wrote:
 Hi Junxiao,

 As the subject, the testing hung there on a kernel without your
 patches:

 "ocfs2: revert using ocfs2_acl_chmod to avoid inode cluster lock hang"
 and
 "ocfs2: fix posix_acl_create deadlock"

 The stack trace is:
 ```
 ocfs2cts1:~ # pstree -pl 24133
 discontig_runne(24133)───activate_discon(21156)───mpirun(15146)─┬─fillup_contig_b(15149)───sudo(15231)───chmod(15232)


 ocfs2cts1:~ # pgrep -a chmod
 15232 /bin/chmod -R 777 /mnt/ocfs2

 ocfs2cts1:~ # cat /proc/15232/stack
 [] __ocfs2_cluster_lock.isra.39+0x1bf/0x620 [ocfs2]
 [] ocfs2_inode_lock_full_nested+0x12d/0x840 [ocfs2]
 [] ocfs2_inode_lock_atime+0xcb/0x170 [ocfs2]
 [] ocfs2_readdir+0x41/0x1b0 [ocfs2]
 [] iterate_dir+0x9c/0x110
 [] SyS_getdents+0x83/0xf0
 [] entry_SYSCALL_64_fastpath+0x12/0x6d
 [] 0x
 ```

 Do you think this issue can be fixed by your patches?
>>> Looks not. Those two patches are to fix recursive locking deadlock. But
>>> from above call trace, there is no recursive lock.
>> Sorry, the call trace on another node was missing.  Here it is:
>>
>> ocfs2cts2:~ # pstree -lp
>> sshd(4292)─┬─sshd(4745)───sshd(4753)───bash(4754)───orted(4781)───fillup_contig_b(4782)───sudo(4864)───chmod(4865)
>>
>>
>> ocfs2cts2:~ # cat /proc/4865/stack
>> [] __ocfs2_cluster_lock.isra.39+0x1bf/0x620 [ocfs2]
>> [] ocfs2_inode_lock_full_nested+0x12d/0x840 [ocfs2]
>> [] ocfs2_iop_get_acl+0x40/0xf0 [ocfs2]
>> [] generic_permission+0x166/0x1c0
>> [] ocfs2_permission+0xaa/0xd0 [ocfs2]
>> [] __inode_permission+0x56/0xb0
>> [] link_path_walk+0x29a/0x560
>> [] path_lookupat+0x7f/0x110
>> [] filename_lookup+0x9c/0x150
>> [] SyS_fchmodat+0x33/0x90
>> [] entry_SYSCALL_64_fastpath+0x12/0x6d
>> [] 0x
>>
>> Thanks,
>> Eric
>>
>>
>>> Thanks,
>>> Junxiao.
 I will try your patches later, but I am little worried the possibility
 of reproduction may not be 100%.
 So ask you to confirm;-)

 Eric
>>
>> ___
>> Ocfs2-devel mailing list
>> Ocfs2-devel@oss.oracle.com
>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [Question] deadlock on chmod when running discontigous block group multiple node testing

2016-10-10 Thread Junxiao Bi

Hi Eric,

On 10/11/2016 10:42 AM, Eric Ren wrote:
> Hi Junxiao,
> 
> As the subject, the testing hung there on a kernel without your patches:
> 
> "ocfs2: revert using ocfs2_acl_chmod to avoid inode cluster lock hang"
> and
> "ocfs2: fix posix_acl_create deadlock"
> 
> The stack trace is:
> ```
> ocfs2cts1:~ # pstree -pl 24133
> discontig_runne(24133)───activate_discon(21156)───mpirun(15146)─┬─fillup_contig_b(15149)───sudo(15231)───chmod(15232)
> 
> ocfs2cts1:~ # pgrep -a chmod
> 15232 /bin/chmod -R 777 /mnt/ocfs2
> 
> ocfs2cts1:~ # cat /proc/15232/stack
> [] __ocfs2_cluster_lock.isra.39+0x1bf/0x620 [ocfs2]
> [] ocfs2_inode_lock_full_nested+0x12d/0x840 [ocfs2]
> [] ocfs2_inode_lock_atime+0xcb/0x170 [ocfs2]
> [] ocfs2_readdir+0x41/0x1b0 [ocfs2]
> [] iterate_dir+0x9c/0x110
> [] SyS_getdents+0x83/0xf0
> [] entry_SYSCALL_64_fastpath+0x12/0x6d
> [] 0x
> ```
> 
> Do you think this issue can be fixed by your patches?
Looks not. Those two patches are to fix recursive locking deadlock. But
from above call trace, there is no recursive lock.

Thanks,
Junxiao.
> 
> I will try your patches later, but I am little worried the possibility
> of reproduction may not be 100%.
> So ask you to confirm;-)
> 
> Eric


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] What are the purposes of GLOBAL_INODE_ALLOC_SYSTEM_INODE and BAD_BLOCK_SYSTEM_INODE system file

2016-10-09 Thread Junxiao Bi

On 10/09/2016 04:47 PM, Gang He wrote:
> Hello Guys,
> 
> If you use debugfs.ocfs2 to list system files for a ocfs2 file system, you 
> can find these two system files.
> sles12sp1-node1:/ # debugfs.ocfs2 /dev/sdb1
> debugfs.ocfs2 1.8.2
> debugfs: ls //
>  6   16   12  .
>  6   16   22  ..
>  7   24   10   1  bad_blocks << ==  
> BAD_BLOCK_SYSTEM_INODE
>  8   32   18   1  global_inode_alloc << ==  
> GLOBAL_INODE_ALLOC_SYSTEM_INODE
>   
> 
> But, What are the purposes of GLOBAL_INODE_ALLOC_SYSTEM_INODE and 
> BAD_BLOCK_SYSTEM_INODE system file?
> For BAD_BLOCK_SYSTEM_INODE system file, it looks to be used to store bad 
> blocks for a file system partition, but from the code, there is not any code 
> for this system file.
> For GLOBAL_INODE_ALLOC_SYSTEM_INODE system file, there is also not any code 
> for it, what is the purpose of this file ?
These two are not used. Maybe left for future extend.

Thanks,
Junxiao.
> 
> 
> Thanks
> Gang
> 
> 
> 
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 1/2] ocfs2: fix trans extend while flush truncate log

2016-09-12 Thread Junxiao Bi

On 09/13/2016 10:04 AM, Joseph Qi wrote:
> Hi Junxiao,
> 
> On 2016/9/12 18:03, Junxiao Bi wrote:
>> Every time,  ocfs2_extend_trans() included a credit for truncate log inode,
>> but as that inode had been managed by jbd2 running transaction first time,
>> it will not consume that credit until jbd2_journal_restart(). Since total
>> credits to extend always included the un-consumed ones, there will be more
>> and more un-consumed credit, at last jbd2_journal_restart() will fail due
>> to credit number over the half of max transction credit.
>>
>> The following error was caught when unlink a large file with many extents.
>>
>> [233096.013936] [ cut here ]
>> [233096.018586] WARNING: CPU: 0 PID: 13626 at fs/jbd2/transaction.c:269 
>> start_this_handle+0x4c3/0x510 [jbd2]()
>> [233096.028335] Modules linked in: ocfs2 nfsd lockd grace nfs_acl 
>> auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm 
>> ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT 
>> nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack 
>> ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i 
>> cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad 
>> ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi 
>> ppdev xen_kbdfront xen_netfront fb_sys_fops sysimgblt sysfillrect 
>> syscopyarea parport_pc parport pcspkr i2c_piix4 i2c_core acpi_cpufreq ext4 
>> jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix dm_mirror 
>> dm_region_hash dm_log dm_mod
>> [233096.081751] CPU: 0 PID: 13626 Comm: unlink Tainted: GW   
>> 4.1.12-37.6.3.el6uek.x86_64 #2
>> [233096.088556] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016
>> [233096.093125]  010d 8818b768 816bc5bc 
>> 010d
>> [233096.099082]   8818b7a8 81081475 
>> 8818b788
>> [233096.105038]  88007a99a000 88007b573390 00fb 
>> 0050
>> [233096.110540] Call Trace:
>> [233096.111893]  [] dump_stack+0x48/0x5c
>> [233096.114637]  [] warn_slowpath_common+0x95/0xe0
>> [233096.117797]  [] warn_slowpath_null+0x1a/0x20
>> [233096.120984]  [] start_this_handle+0x4c3/0x510 [jbd2]
>> [233096.124505]  [] ? __jbd2_log_start_commit+0xe5/0xf0 
>> [jbd2]
>> [233096.128115]  [] ? __wake_up+0x53/0x70
>> [233096.130924]  [] jbd2__journal_restart+0x161/0x1b0 
>> [jbd2]
>> [233096.134523]  [] jbd2_journal_restart+0x13/0x20 [jbd2]
>> [233096.137986]  [] ocfs2_extend_trans+0x74/0x220 [ocfs2]
>> [233096.141407]  [] ? ocfs2_journal_dirty+0x3a/0x90 [ocfs2]
>> [233096.144921]  [] 
>> ocfs2_replay_truncate_records+0x93/0x360 [ocfs2]
>> [233096.148819]  [] __ocfs2_flush_truncate_log+0x13e/0x3a0 
>> [ocfs2]
>> [233096.152644]  [] ? 
>> ocfs2_reserve_blocks_for_rec_trunc.clone.0+0x44/0x1f0 [ocfs2]
>> [233096.157310]  [] ocfs2_remove_btree_range+0x458/0x7f0 
>> [ocfs2]
>> [233096.161099]  [] ? __ocfs2_find_path+0x187/0x2d0 [ocfs2]
>> [233096.164612]  [] ocfs2_commit_truncate+0x1b3/0x6f0 
>> [ocfs2]
>> [233096.168204]  [] ? 
>> ocfs2_xattr_tree_et_ops+0x60/0xfffe8c20 [ocfs2]
>> [233096.172539]  [] ? ocfs2_journal_access_eb+0x20/0x20 
>> [ocfs2]
>> [233096.176285]  [] ? __sb_end_write+0x33/0x70
>> [233096.179226]  [] ocfs2_truncate_for_delete+0xbd/0x380 
>> [ocfs2]
>> [233096.183009]  [] ? ocfs2_query_inode_wipe+0xf4/0x320 
>> [ocfs2]
>> [233096.186738]  [] ocfs2_wipe_inode+0x136/0x6a0 [ocfs2]
>> [233096.190165]  [] ? ocfs2_query_inode_wipe+0xf4/0x320 
>> [ocfs2]
>> [233096.193846]  [] ocfs2_delete_inode+0x2a2/0x3e0 [ocfs2]
>> [233096.197274]  [] ? __inode_wait_for_writeback+0x69/0xc0
>> [233096.200736]  [] ? 
>> __PRETTY_FUNCTION__.112282+0x20/0xb520 [ocfs2]
>> [233096.205146]  [] ocfs2_evict_inode+0x28/0x60 [ocfs2]
>> [233096.208462]  [] evict+0xab/0x1a0
>> [233096.211020]  [] ? 
>> __PRETTY_FUNCTION__.112282+0x20/0xb520 [ocfs2]
>> [233096.215396]  [] iput_final+0xf6/0x190
>> [233096.218169]  [] iput+0xc8/0xe0
>> [233096.220586]  [] do_unlinkat+0x1b7/0x310
>> [233096.223487]  [] ? __do_page_fault+0x18b/0x480
>> [233096.226655]  [] ? __audit_syscall_entry+0xac/0x110
>> [233096.230009]  [] ? do_audit_syscall_entry+0x6c/0x70
>> [233096.233346]  [] ? 
>> syscall_trace_enter_phase1+0x153/0x180
>> [233096.237103]  [] SyS_unlink+0x16/0x20
>> [233096.239800]  [] system_call_fastpath+0x12/0x71
>> [

[Ocfs2-devel] [PATCH 2/2] ocfs2: fix trans extend while free cached blocks

2016-09-12 Thread Junxiao Bi

Root cause of this issue is the same with the one fixed by last patch,
but this time credits for allocator inode and group descriptor may not
be consumed before trans extend.

The following error was caught.

[  685.240276] WARNING: CPU: 0 PID: 2037 at fs/jbd2/transaction.c:269 
start_this_handle+0x4c3/0x510 [jbd2]()
[  685.240294] Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss 
sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager 
ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 
nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi 
iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser 
rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp 
libiscsi scsi_transport_iscsi ppdev xen_kbdfront fb_sys_fops sysimgblt 
sysfillrect syscopyarea xen_netfront parport_pc parport pcspkr i2c_piix4 
i2c_core acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi 
ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod
[  685.240296] CPU: 0 PID: 2037 Comm: rm Tainted: GW   
4.1.12-37.6.3.el6uek.bug24573128v2.x86_64 #2
[  685.240296] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016
[  685.240298]  010d 88007ac3f808 816bc5bc 
010d
[  685.240300]   88007ac3f848 81081475 
88007ac3f828
[  685.240301]  880037bbf000 880037688210 0095 
0050
[  685.240301] Call Trace:
[  685.240305]  [] dump_stack+0x48/0x5c
[  685.240308]  [] warn_slowpath_common+0x95/0xe0
[  685.240310]  [] warn_slowpath_null+0x1a/0x20
[  685.240313]  [] start_this_handle+0x4c3/0x510 [jbd2]
[  685.240317]  [] ? __jbd2_log_start_commit+0xe5/0xf0 [jbd2]
[  685.240319]  [] ? __wake_up+0x53/0x70
[  685.240322]  [] jbd2__journal_restart+0x161/0x1b0 [jbd2]
[  685.240325]  [] jbd2_journal_restart+0x13/0x20 [jbd2]
[  685.240340]  [] ocfs2_extend_trans+0x74/0x220 [ocfs2]
[  685.240347]  [] ocfs2_free_cached_blocks+0x16b/0x4e0 
[ocfs2]
[  685.240349]  [] ? internal_add_timer+0x91/0xc0
[  685.240356]  [] ocfs2_run_deallocs+0x70/0x270 [ocfs2]
[  685.240363]  [] ocfs2_commit_truncate+0x474/0x6f0 [ocfs2]
[  685.240374]  [] ? 
ocfs2_xattr_tree_et_ops+0x60/0xfffe8c00 [ocfs2]
[  685.240384]  [] ? ocfs2_journal_access_eb+0x20/0x20 [ocfs2]
[  685.240385]  [] ? __sb_end_write+0x33/0x70
[  685.240394]  [] ocfs2_truncate_for_delete+0xbd/0x380 
[ocfs2]
[  685.240402]  [] ? ocfs2_query_inode_wipe+0xf4/0x320 [ocfs2]
[  685.240409]  [] ocfs2_wipe_inode+0x136/0x6a0 [ocfs2]
[  685.240415]  [] ? ocfs2_query_inode_wipe+0xf4/0x320 [ocfs2]
[  685.240422]  [] ocfs2_delete_inode+0x2a2/0x3e0 [ocfs2]
[  685.240424]  [] ? __inode_wait_for_writeback+0x69/0xc0
[  685.240437]  [] ? 
__PRETTY_FUNCTION__.112282+0x20/0xb500 [ocfs2]
[  685.240444]  [] ocfs2_evict_inode+0x28/0x60 [ocfs2]
[  685.240445]  [] evict+0xab/0x1a0
[  685.240456]  [] ? 
__PRETTY_FUNCTION__.112282+0x20/0xb500 [ocfs2]
[  685.240457]  [] iput_final+0xf6/0x190
[  685.240458]  [] iput+0xc8/0xe0
[  685.240460]  [] do_unlinkat+0x1b7/0x310
[  685.240462]  [] ? __audit_syscall_entry+0xac/0x110
[  685.240464]  [] ? do_audit_syscall_entry+0x6c/0x70
[  685.240465]  [] ? syscall_trace_enter_phase1+0x153/0x180
[  685.240467]  [] SyS_unlinkat+0x22/0x40
[  685.240468]  [] system_call_fastpath+0x12/0x71
[  685.240469] ---[ end trace a62437cb060baa71 ]---
[  685.240470] JBD2: rm wants too many credits (149 > 128)

Signed-off-by: Junxiao Bi 
---
 fs/ocfs2/alloc.c |   27 +--
 1 file changed, 9 insertions(+), 18 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index 51128789a661..f165f867f332 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -6404,43 +6404,34 @@ static int ocfs2_free_cached_blocks(struct ocfs2_super 
*osb,
goto out_mutex;
}
 
-   handle = ocfs2_start_trans(osb, OCFS2_SUBALLOC_FREE);
-   if (IS_ERR(handle)) {
-   ret = PTR_ERR(handle);
-   mlog_errno(ret);
-   goto out_unlock;
-   }
-
while (head) {
if (head->free_bg)
bg_blkno = head->free_bg;
else
bg_blkno = ocfs2_which_suballoc_group(head->free_blk,
  head->free_bit);
+   handle = ocfs2_start_trans(osb, OCFS2_SUBALLOC_FREE);
+   if (IS_ERR(handle)) {
+   ret = PTR_ERR(handle);
+   mlog_errno(ret);
+   goto out_unlock;
+   }
+
trace_ocfs2_free_cached_blocks(
 (unsigned long long)head->free_blk, head->free_bit);
 
ret = ocfs2_free_suballoc_bits(handle, inode, di_bh,
   head->free_bit, bg_blkno, 1);
-   if (ret) {

[Ocfs2-devel] [PATCH 1/2] ocfs2: fix trans extend while flush truncate log

2016-09-12 Thread Junxiao Bi

Every time,  ocfs2_extend_trans() included a credit for truncate log inode,
but as that inode had been managed by jbd2 running transaction first time,
it will not consume that credit until jbd2_journal_restart(). Since total
credits to extend always included the un-consumed ones, there will be more
and more un-consumed credit, at last jbd2_journal_restart() will fail due
to credit number over the half of max transction credit.

The following error was caught when unlink a large file with many extents.

[233096.013936] [ cut here ]
[233096.018586] WARNING: CPU: 0 PID: 13626 at fs/jbd2/transaction.c:269 
start_this_handle+0x4c3/0x510 [jbd2]()
[233096.028335] Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss 
sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager 
ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 
nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi 
iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser 
rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp 
libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront fb_sys_fops 
sysimgblt sysfillrect syscopyarea parport_pc parport pcspkr i2c_piix4 i2c_core 
acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic 
ata_piix dm_mirror dm_region_hash dm_log dm_mod
[233096.081751] CPU: 0 PID: 13626 Comm: unlink Tainted: GW   
4.1.12-37.6.3.el6uek.x86_64 #2
[233096.088556] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016
[233096.093125]  010d 8818b768 816bc5bc 
010d
[233096.099082]   8818b7a8 81081475 
8818b788
[233096.105038]  88007a99a000 88007b573390 00fb 
0050
[233096.110540] Call Trace:
[233096.111893]  [] dump_stack+0x48/0x5c
[233096.114637]  [] warn_slowpath_common+0x95/0xe0
[233096.117797]  [] warn_slowpath_null+0x1a/0x20
[233096.120984]  [] start_this_handle+0x4c3/0x510 [jbd2]
[233096.124505]  [] ? __jbd2_log_start_commit+0xe5/0xf0 [jbd2]
[233096.128115]  [] ? __wake_up+0x53/0x70
[233096.130924]  [] jbd2__journal_restart+0x161/0x1b0 [jbd2]
[233096.134523]  [] jbd2_journal_restart+0x13/0x20 [jbd2]
[233096.137986]  [] ocfs2_extend_trans+0x74/0x220 [ocfs2]
[233096.141407]  [] ? ocfs2_journal_dirty+0x3a/0x90 [ocfs2]
[233096.144921]  [] ocfs2_replay_truncate_records+0x93/0x360 
[ocfs2]
[233096.148819]  [] __ocfs2_flush_truncate_log+0x13e/0x3a0 
[ocfs2]
[233096.152644]  [] ? 
ocfs2_reserve_blocks_for_rec_trunc.clone.0+0x44/0x1f0 [ocfs2]
[233096.157310]  [] ocfs2_remove_btree_range+0x458/0x7f0 
[ocfs2]
[233096.161099]  [] ? __ocfs2_find_path+0x187/0x2d0 [ocfs2]
[233096.164612]  [] ocfs2_commit_truncate+0x1b3/0x6f0 [ocfs2]
[233096.168204]  [] ? 
ocfs2_xattr_tree_et_ops+0x60/0xfffe8c20 [ocfs2]
[233096.172539]  [] ? ocfs2_journal_access_eb+0x20/0x20 
[ocfs2]
[233096.176285]  [] ? __sb_end_write+0x33/0x70
[233096.179226]  [] ocfs2_truncate_for_delete+0xbd/0x380 
[ocfs2]
[233096.183009]  [] ? ocfs2_query_inode_wipe+0xf4/0x320 
[ocfs2]
[233096.186738]  [] ocfs2_wipe_inode+0x136/0x6a0 [ocfs2]
[233096.190165]  [] ? ocfs2_query_inode_wipe+0xf4/0x320 
[ocfs2]
[233096.193846]  [] ocfs2_delete_inode+0x2a2/0x3e0 [ocfs2]
[233096.197274]  [] ? __inode_wait_for_writeback+0x69/0xc0
[233096.200736]  [] ? 
__PRETTY_FUNCTION__.112282+0x20/0xb520 [ocfs2]
[233096.205146]  [] ocfs2_evict_inode+0x28/0x60 [ocfs2]
[233096.208462]  [] evict+0xab/0x1a0
[233096.211020]  [] ? 
__PRETTY_FUNCTION__.112282+0x20/0xb520 [ocfs2]
[233096.215396]  [] iput_final+0xf6/0x190
[233096.218169]  [] iput+0xc8/0xe0
[233096.220586]  [] do_unlinkat+0x1b7/0x310
[233096.223487]  [] ? __do_page_fault+0x18b/0x480
[233096.226655]  [] ? __audit_syscall_entry+0xac/0x110
[233096.230009]  [] ? do_audit_syscall_entry+0x6c/0x70
[233096.233346]  [] ? syscall_trace_enter_phase1+0x153/0x180
[233096.237103]  [] SyS_unlink+0x16/0x20
[233096.239800]  [] system_call_fastpath+0x12/0x71
[233096.244346] ---[ end trace 28aa7410e69369cf ]---
[233096.247798] JBD2: unlink wants too many credits (251 > 128)

Signed-off-by: Junxiao Bi 
---
 fs/ocfs2/alloc.c |   29 ++---
 1 file changed, 10 insertions(+), 19 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index 7dabbc31060e..51128789a661 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -5922,7 +5922,6 @@ bail:
 }
 
 static int ocfs2_replay_truncate_records(struct ocfs2_super *osb,
-handle_t *handle,
 struct inode *data_alloc_inode,
 struct buffer_head *data_alloc_bh)
 {
@@ -5935,11 +5934,19 @@ static int ocfs2_replay_truncate_records(struct 
ocfs2_super *osb,
struct ocfs2_truncate_log *tl;
struct inode *tl_inode = osb->osb_tl_inode;
struct buffer_head

Re: [Ocfs2-devel] [PATCH v2] ocfs2: Fix start offset to ocfs2_zero_range_for_truncate()

2016-08-29 Thread Junxiao Bi

On 08/30/2016 03:23 AM, Ashish Samant wrote:
> Hi Eric,
> 
> The easiest way to reproduce this is :
> 
> 1. Create a random file of say 10 MB
>  xfs_io -c 'pwrite -b 4k 0 10M' -f 10MBfile
> 2. Reflink  it
>  reflink -f 10MBfile reflnktest
> 3. Punch a hole at starting at cluster boundary  with range greater that 
> 1MB. You can also use a range that will put the end offset in another 
> extent.
>  fallocate -p -o 0 -l 1048615 reflnktest
> 4. sync
> 5. Check the  first cluster in the source file. (It will be zeroed out).
> dd if=10MBfile iflag=direct bs= count=1 | hexdump -C
> 
Cool, this reproduce step deserved to be add into patch log.

Thanks,
Junxiao.

> Thanks,
> Ashish
> 
> On 08/28/2016 10:39 PM, Eric Ren wrote:
>> Hi,
>>
>> Thanks for this fix. I'd like to reproduce this issue locally and test 
>> this patch,
>> could you elaborate the detailed steps of reproduction?
>>
>> Thanks,
>> Eric
>>
>> On 08/27/2016 07:04 AM, Ashish Samant wrote:
>>> If we punch a hole on a reflink such that following conditions are met:
>>>
>>> 1. start offset is on a cluster boundary
>>> 2. end offset is not on a cluster boundary
>>> 3. (end offset is somewhere in another extent) or
>>> (hole range > MAX_CONTIG_BYTES(1MB)),
>>>
>>> we dont COW the first cluster starting at the start offset. But in this
>>> case, we were wrongly passing this cluster to
>>> ocfs2_zero_range_for_truncate() to zero out. This will modify the 
>>> cluster
>>> in place and zero it in the source too.
>>>
>>> Fix this by skipping this cluster in such a scenario.
>>>
>>> Reported-by: Saar Maoz 
>>> Signed-off-by: Ashish Samant 
>>> Reviewed-by: Srinivas Eeda 
>>> ---
>>> v1->v2:
>>> -Changed the commit msg to include a better and generic description of
>>>   the problem, for all cluster sizes.
>>> -Added Reported-by and Reviewed-by tags.
>>>  fs/ocfs2/file.c | 34 --
>>>   1 file changed, 24 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
>>> index 4e7b0dc..0b055bf 100644
>>> --- a/fs/ocfs2/file.c
>>> +++ b/fs/ocfs2/file.c
>>> @@ -1506,7 +1506,8 @@ static int ocfs2_zero_partial_clusters(struct 
>>> inode *inode,
>>>  u64 start, u64 len)
>>>   {
>>>   int ret = 0;
>>> -u64 tmpend, end = start + len;
>>> +u64 tmpend = 0;
>>> +u64 end = start + len;
>>>   struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>>>   unsigned int csize = osb->s_clustersize;
>>>   handle_t *handle;
>>> @@ -1538,18 +1539,31 @@ static int ocfs2_zero_partial_clusters(struct 
>>> inode *inode,
>>>   }
>>> /*
>>> - * We want to get the byte offset of the end of the 1st cluster.
>>> + * If start is on a cluster boundary and end is somewhere in 
>>> another
>>> + * cluster, we have not COWed the cluster starting at start, unless
>>> + * end is also within the same cluster. So, in this case, we 
>>> skip this
>>> + * first call to ocfs2_zero_range_for_truncate() truncate and 
>>> move on
>>> + * to the next one.
>>>*/
>>> -tmpend = (u64)osb->s_clustersize + (start & ~(osb->s_clustersize 
>>> - 1));
>>> -if (tmpend > end)
>>> -tmpend = end;
>>> +if ((start & (csize - 1)) != 0) {
>>> +/*
>>> + * We want to get the byte offset of the end of the 1st
>>> + * cluster.
>>> + */
>>> +tmpend = (u64)osb->s_clustersize +
>>> +(start & ~(osb->s_clustersize - 1));
>>> +if (tmpend > end)
>>> +tmpend = end;
>>>   -trace_ocfs2_zero_partial_clusters_range1((unsigned long 
>>> long)start,
>>> - (unsigned long long)tmpend);
>>> +trace_ocfs2_zero_partial_clusters_range1(
>>> +(unsigned long long)start,
>>> +(unsigned long long)tmpend);
>>>   -ret = ocfs2_zero_range_for_truncate(inode, handle, start, 
>>> tmpend);
>>> -if (ret)
>>> -mlog_errno(ret);
>>> +ret = ocfs2_zero_range_for_truncate(inode, handle, start,
>>> +tmpend);
>>> +if (ret)
>>> +mlog_errno(ret);
>>> +}
>>> if (tmpend < end) {
>>>   /*
>>
>>
> 
> 
> ___
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH] Revert "ocfs2: bump up o2cb network protocol version"

2016-08-16 Thread Junxiao Bi

This reverts commit 38b52efd218bf2a11a5b4a8f56052cee6684cfec.

This commit made rolling upgrade fail. When one node is upgraded
to new version with this commit, the remaining nodes will fail to
establish connections to it, then the application like VMs on the
remaining nodes can't be live migrated to the upgraded one. This
will cause an outage. Since negotiate hb timeout behavior didn't
change without this commit, so revert it.

Signed-off-by: Junxiao Bi 
---
 fs/ocfs2/cluster/tcp_internal.h |5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/ocfs2/cluster/tcp_internal.h b/fs/ocfs2/cluster/tcp_internal.h
index 94b18369b1cc..b95e7df5b76a 100644
--- a/fs/ocfs2/cluster/tcp_internal.h
+++ b/fs/ocfs2/cluster/tcp_internal.h
@@ -44,9 +44,6 @@
  * version here in tcp_internal.h should not need to be bumped for
  * filesystem locking changes.
  *
- * New in version 12
- * - Negotiate hb timeout when storage is down.
- *
  * New in version 11
  * - Negotiation of filesystem locking in the dlm join.
  *
@@ -78,7 +75,7 @@
  * - full 64 bit i_size in the metadata lock lvbs
  * - introduction of "rw" lock and pushing meta/data locking down
  */
-#define O2NET_PROTOCOL_VERSION 12ULL
+#define O2NET_PROTOCOL_VERSION 11ULL
 struct o2net_handshake {
__be64  protocol_version;
__be64  connector_id;
-- 
1.7.9.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH v2] ocfs2: improve recovery performance

2016-07-10 Thread Junxiao Bi

On 07/09/2016 05:23 AM, Andrew Morton wrote:
> On Thu,  7 Jul 2016 10:24:48 +0800 Junxiao Bi  wrote:
> 
>> Journal replay will be run when do recovery for a dead node,
>> to avoid the stale cache impact, all blocks of dead node's
>> journal inode were reload from disk. This hurts the performance,
>> check whether one block is cached before reload it can improve
>> a lot performance. In my test env, the time doing recovery was
>> improved from 120s to 1s.
> 
> So since v1 you did this (unchangelogged bugfix!):
> 
> --- a/fs/ocfs2/journal.c~ocfs2-improve-recovery-performance-v2
> +++ a/fs/ocfs2/journal.c
> @@ -1194,6 +1194,7 @@ static int ocfs2_force_read_journal(stru
>  
>   brelse(bh);
>   bh = NULL;
> + p_blkno++;
>   }
>  
>   v_blkno += p_blocks;
> 
> 
> I suppose this is a bit neater?
Yes, looks good. Thank you.

Thanks,
Junxiao.
> 
> --- a/fs/ocfs2/journal.c~ocfs2-improve-recovery-performance-v2-fix
> +++ a/fs/ocfs2/journal.c
> @@ -1172,14 +1172,12 @@ static int ocfs2_force_read_journal(stru
>   goto bail;
>   }
>  
> - for (i = 0; i < p_blocks; i++) {
> + for (i = 0; i < p_blocks; i++, p_blkno++) {
>   bh = __find_get_block(osb->sb->s_bdev, p_blkno,
>   osb->sb->s_blocksize);
>   /* block not cached. */
> - if (!bh) {
> - p_blkno++;
> + if (!bh)
>   continue;
> - }
>  
>   brelse(bh);
>   bh = NULL;
> @@ -1194,7 +1192,6 @@ static int ocfs2_force_read_journal(stru
>  
>   brelse(bh);
>   bh = NULL;
> - p_blkno++;
>   }
>  
>   v_blkno += p_blocks;
> _
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH v2] ocfs2: improve recovery performance

2016-07-06 Thread Junxiao Bi

Journal replay will be run when do recovery for a dead node,
to avoid the stale cache impact, all blocks of dead node's
journal inode were reload from disk. This hurts the performance,
check whether one block is cached before reload it can improve
a lot performance. In my test env, the time doing recovery was
improved from 120s to 1s.

Signed-off-by: Junxiao Bi 
---
 fs/ocfs2/journal.c |   42 +++---
 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index e607419cdfa4..67179cf60525 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -1159,10 +1159,8 @@ static int ocfs2_force_read_journal(struct inode *inode)
int status = 0;
int i;
u64 v_blkno, p_blkno, p_blocks, num_blocks;
-#define CONCURRENT_JOURNAL_FILL 32ULL
-   struct buffer_head *bhs[CONCURRENT_JOURNAL_FILL];
-
-   memset(bhs, 0, sizeof(struct buffer_head *) * CONCURRENT_JOURNAL_FILL);
+   struct buffer_head *bh = NULL;
+   struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
 
num_blocks = ocfs2_blocks_for_bytes(inode->i_sb, i_size_read(inode));
v_blkno = 0;
@@ -1174,29 +1172,35 @@ static int ocfs2_force_read_journal(struct inode *inode)
goto bail;
}
 
-   if (p_blocks > CONCURRENT_JOURNAL_FILL)
-   p_blocks = CONCURRENT_JOURNAL_FILL;
+   for (i = 0; i < p_blocks; i++) {
+   bh = __find_get_block(osb->sb->s_bdev, p_blkno,
+   osb->sb->s_blocksize);
+   /* block not cached. */
+   if (!bh) {
+   p_blkno++;
+   continue;
+   }
 
-   /* We are reading journal data which should not
-* be put in the uptodate cache */
-   status = ocfs2_read_blocks_sync(OCFS2_SB(inode->i_sb),
-   p_blkno, p_blocks, bhs);
-   if (status < 0) {
-   mlog_errno(status);
-   goto bail;
-   }
+   brelse(bh);
+   bh = NULL;
+   /* We are reading journal data which should not
+* be put in the uptodate cache.
+*/
+   status = ocfs2_read_blocks_sync(osb, p_blkno, 1, &bh);
+   if (status < 0) {
+   mlog_errno(status);
+   goto bail;
+   }
 
-   for(i = 0; i < p_blocks; i++) {
-   brelse(bhs[i]);
-   bhs[i] = NULL;
+   brelse(bh);
+   bh = NULL;
+   p_blkno++;
}
 
v_blkno += p_blocks;
}
 
 bail:
-   for(i = 0; i < CONCURRENT_JOURNAL_FILL; i++)
-   brelse(bhs[i]);
return status;
 }
 
-- 
1.7.9.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH v2] ocfs2: improve recovery performance

2016-07-06 Thread Junxiao Bi

On 07/07/2016 10:05 AM, xuejiufei wrote:
> Hi Junxiao,
> p_blkno is not increased after force reading from disk, so
> this block is read many times from disk while other blocks
> remain in cached are not reloaded.
Good catch. Will send a v2 version.

Thanks,
Junxiao.
> 
> Thanks,
> Jiufei
> 
> On 2016/6/17 17:28, Junxiao Bi wrote:
>> Journal replay will be run when do recovery for a dead node,
>> to avoid the stale cache impact, all blocks of dead node's
>> journal inode were reload from disk. This hurts the performance,
>> check whether one block is cached before reload it can improve
>> a lot performance. In my test env, the time doing recovery was
>> improved from 120s to 1s.
>>
>> Signed-off-by: Junxiao Bi 
>> ---
>>  fs/ocfs2/journal.c |   41 ++---
>>  1 file changed, 22 insertions(+), 19 deletions(-)
>>
>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
>> index e607419cdfa4..bc0e21e8a674 100644
>> --- a/fs/ocfs2/journal.c
>> +++ b/fs/ocfs2/journal.c
>> @@ -1159,10 +1159,8 @@ static int ocfs2_force_read_journal(struct inode 
>> *inode)
>>  int status = 0;
>>  int i;
>>  u64 v_blkno, p_blkno, p_blocks, num_blocks;
>> -#define CONCURRENT_JOURNAL_FILL 32ULL
>> -struct buffer_head *bhs[CONCURRENT_JOURNAL_FILL];
>> -
>> -memset(bhs, 0, sizeof(struct buffer_head *) * CONCURRENT_JOURNAL_FILL);
>> +struct buffer_head *bh = NULL;
>> +struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>>  
>>  num_blocks = ocfs2_blocks_for_bytes(inode->i_sb, i_size_read(inode));
>>  v_blkno = 0;
>> @@ -1174,29 +1172,34 @@ static int ocfs2_force_read_journal(struct inode 
>> *inode)
>>  goto bail;
>>  }
>>  
>> -if (p_blocks > CONCURRENT_JOURNAL_FILL)
>> -p_blocks = CONCURRENT_JOURNAL_FILL;
>> +for (i = 0; i < p_blocks; i++) {
>> +bh = __find_get_block(osb->sb->s_bdev, p_blkno,
>> +osb->sb->s_blocksize);
>> +/* block not cached. */
>> +if (!bh) {
>> +p_blkno++;
>> +continue;
>> +}
>>  
>> -/* We are reading journal data which should not
>> - * be put in the uptodate cache */
>> -status = ocfs2_read_blocks_sync(OCFS2_SB(inode->i_sb),
>> -p_blkno, p_blocks, bhs);
>> -if (status < 0) {
>> -mlog_errno(status);
>> -goto bail;
>> -}
>> +brelse(bh);
>> +bh = NULL;
>> +/* We are reading journal data which should not
>> + * be put in the uptodate cache.
>> + */
>> +status = ocfs2_read_blocks_sync(osb, p_blkno, 1, &bh);
>> +if (status < 0) {
>> +mlog_errno(status);
>> +goto bail;
>> +}
>>  
>> -for(i = 0; i < p_blocks; i++) {
>> -brelse(bhs[i]);
>> -bhs[i] = NULL;
>> +brelse(bh);
>> +bh = NULL;
>>  }
>>  
>>  v_blkno += p_blocks;
>>  }
>>  
>>  bail:
>> -for(i = 0; i < CONCURRENT_JOURNAL_FILL; i++)
>> -brelse(bhs[i]);
>>  return status;
>>  }
>>  
>>
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH v2] ocfs2: improve recovery performance

2016-06-23 Thread Junxiao Bi

On 06/24/2016 06:13 AM, Andrew Morton wrote:
> On Thu, 23 Jun 2016 09:17:53 +0800 Junxiao Bi  wrote:
> 
>> Hi Andrew,
>>
>> Did you miss this patch to your tree?
> 
> I would have seen it eventually.  Explicitly cc'ing me on patches
> helps, please.
I see, will cc you next time.

Thanks,
Junxiao.
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH v2] ocfs2: improve recovery performance

2016-06-22 Thread Junxiao Bi

Hi Andrew,

Did you miss this patch to your tree?

Thanks,
Junxiao.

On 06/17/2016 05:43 PM, Joseph Qi wrote:
> On 2016/6/17 17:28, Junxiao Bi wrote:
>> Journal replay will be run when do recovery for a dead node,
>> to avoid the stale cache impact, all blocks of dead node's
>> journal inode were reload from disk. This hurts the performance,
>> check whether one block is cached before reload it can improve
>> a lot performance. In my test env, the time doing recovery was
>> improved from 120s to 1s.
>>
>> Signed-off-by: Junxiao Bi 
> Looks good to me. And it indeed has performance improvement from my
> test.
> Reviewed-by: Joseph Qi 
> 
>> ---
>>  fs/ocfs2/journal.c |   41 ++---
>>  1 file changed, 22 insertions(+), 19 deletions(-)
>>
>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
>> index e607419cdfa4..bc0e21e8a674 100644
>> --- a/fs/ocfs2/journal.c
>> +++ b/fs/ocfs2/journal.c
>> @@ -1159,10 +1159,8 @@ static int ocfs2_force_read_journal(struct inode 
>> *inode)
>>  int status = 0;
>>  int i;
>>  u64 v_blkno, p_blkno, p_blocks, num_blocks;
>> -#define CONCURRENT_JOURNAL_FILL 32ULL
>> -struct buffer_head *bhs[CONCURRENT_JOURNAL_FILL];
>> -
>> -memset(bhs, 0, sizeof(struct buffer_head *) * CONCURRENT_JOURNAL_FILL);
>> +struct buffer_head *bh = NULL;
>> +struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>>  
>>  num_blocks = ocfs2_blocks_for_bytes(inode->i_sb, i_size_read(inode));
>>  v_blkno = 0;
>> @@ -1174,29 +1172,34 @@ static int ocfs2_force_read_journal(struct inode 
>> *inode)
>>  goto bail;
>>  }
>>  
>> -if (p_blocks > CONCURRENT_JOURNAL_FILL)
>> -p_blocks = CONCURRENT_JOURNAL_FILL;
>> +for (i = 0; i < p_blocks; i++) {
>> +bh = __find_get_block(osb->sb->s_bdev, p_blkno,
>> +osb->sb->s_blocksize);
>> +/* block not cached. */
>> +if (!bh) {
>> +p_blkno++;
>> +continue;
>> +}
>>  
>> -/* We are reading journal data which should not
>> - * be put in the uptodate cache */
>> -status = ocfs2_read_blocks_sync(OCFS2_SB(inode->i_sb),
>> -p_blkno, p_blocks, bhs);
>> -if (status < 0) {
>> -mlog_errno(status);
>> -goto bail;
>> -}
>> +brelse(bh);
>> +bh = NULL;
>> +/* We are reading journal data which should not
>> + * be put in the uptodate cache.
>> + */
>> +status = ocfs2_read_blocks_sync(osb, p_blkno, 1, &bh);
>> +if (status < 0) {
>> +mlog_errno(status);
>> +goto bail;
>> +}
>>  
>> -for(i = 0; i < p_blocks; i++) {
>> -brelse(bhs[i]);
>> -bhs[i] = NULL;
>> +brelse(bh);
>> +bh = NULL;
>>  }
>>  
>>  v_blkno += p_blocks;
>>  }
>>  
>>  bail:
>> -for(i = 0; i < CONCURRENT_JOURNAL_FILL; i++)
>> -brelse(bhs[i]);
>>  return status;
>>  }
>>  
>>
> 
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH v2] ocfs2: improve recovery performance

2016-06-21 Thread Junxiao Bi

Hi Gang,

On 06/20/2016 11:10 AM, Gang He wrote:
> Hello Junxiao,
> 
> I think this change will bring a performance improvement, but from the 
> function comments
> /*
>  * JBD Might read a cached version of another nodes journal file. We
>  * don't want this as this file changes often and we get no
>  * notification on those changes. The only way to be sure that we've
>  * got the most up to date version of those blocks then is to force
>  * read them off disk. Just searching through the buffer cache won't
>  * work as there may be pages backing this file which are still marked
>  * up to date. We know things can't change on this file underneath us
>  * as we have the lock by now :)
>  */
> static int ocfs2_force_read_journal(struct inode *inode)
> 
> Did we consider this potential risk behind this patch? I am not familiar with 
> this part code, 
> I want to know if there is any sync mechanism to make sure the block cache 
> for another node journal file is really the latest data?  
I don't see that is needed, because those stale info will not be used
except journal replay.

Thanks,
Junxiao.
> 
> 
> 
> Thanks
> Gang 
> 
> 
>>>>
>> On 2016/6/17 17:28, Junxiao Bi wrote:
>>> Journal replay will be run when do recovery for a dead node,
>>> to avoid the stale cache impact, all blocks of dead node's
>>> journal inode were reload from disk. This hurts the performance,
>>> check whether one block is cached before reload it can improve
>>> a lot performance. In my test env, the time doing recovery was
>>> improved from 120s to 1s.
>>>
>>> Signed-off-by: Junxiao Bi 
>> Looks good to me. And it indeed has performance improvement from my
>> test.
>> Reviewed-by: Joseph Qi 
>>
>>> ---
>>>  fs/ocfs2/journal.c |   41 ++---
>>>  1 file changed, 22 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
>>> index e607419cdfa4..bc0e21e8a674 100644
>>> --- a/fs/ocfs2/journal.c
>>> +++ b/fs/ocfs2/journal.c
>>> @@ -1159,10 +1159,8 @@ static int ocfs2_force_read_journal(struct inode 
>> *inode)
>>> int status = 0;
>>> int i;
>>> u64 v_blkno, p_blkno, p_blocks, num_blocks;
>>> -#define CONCURRENT_JOURNAL_FILL 32ULL
>>> -   struct buffer_head *bhs[CONCURRENT_JOURNAL_FILL];
>>> -
>>> -   memset(bhs, 0, sizeof(struct buffer_head *) * CONCURRENT_JOURNAL_FILL);
>>> +   struct buffer_head *bh = NULL;
>>> +   struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>>>  
>>> num_blocks = ocfs2_blocks_for_bytes(inode->i_sb, i_size_read(inode));
>>> v_blkno = 0;
>>> @@ -1174,29 +1172,34 @@ static int ocfs2_force_read_journal(struct inode 
>> *inode)
>>> goto bail;
>>> }
>>>  
>>> -   if (p_blocks > CONCURRENT_JOURNAL_FILL)
>>> -   p_blocks = CONCURRENT_JOURNAL_FILL;
>>> +   for (i = 0; i < p_blocks; i++) {
>>> +   bh = __find_get_block(osb->sb->s_bdev, p_blkno,
>>> +   osb->sb->s_blocksize);
>>> +   /* block not cached. */
>>> +   if (!bh) {
>>> +   p_blkno++;
>>> +   continue;
>>> +   }
>>>  
>>> -   /* We are reading journal data which should not
>>> -* be put in the uptodate cache */
>>> -   status = ocfs2_read_blocks_sync(OCFS2_SB(inode->i_sb),
>>> -   p_blkno, p_blocks, bhs);
>>> -   if (status < 0) {
>>> -   mlog_errno(status);
>>> -   goto bail;
>>> -   }
>>> +   brelse(bh);
>>> +   bh = NULL;
>>> +   /* We are reading journal data which should not
>>> +* be put in the uptodate cache.
>>> +*/
>>> +   status = ocfs2_read_blocks_sync(osb, p_blkno, 1, &bh);
>>> +   if (status < 0) {
>>> +   mlog_errno(status);
>>> +   goto bail;
>>> +   }
>>>  
>>> -   for(i = 0; i < p_blocks; i++) {
>>> -   brelse(bhs[i]);
>>> -   bhs[i] = NULL;
>>> +   brelse(bh);
>>> +   bh = NULL;
>>> }
>>>  
>>> v_blkno += p_blocks;
>>> }
>>>  
>>>  bail:
>>> -   for(i = 0; i < CONCURRENT_JOURNAL_FILL; i++)
>>> -   brelse(bhs[i]);
>>> return status;
>>>  }
>>>  
>>>
>>
>>
>>
>> ___
>> Ocfs2-devel mailing list
>> Ocfs2-devel@oss.oracle.com 
>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH v2] ocfs2: improve recovery performance

2016-06-17 Thread Junxiao Bi

Journal replay will be run when do recovery for a dead node,
to avoid the stale cache impact, all blocks of dead node's
journal inode were reload from disk. This hurts the performance,
check whether one block is cached before reload it can improve
a lot performance. In my test env, the time doing recovery was
improved from 120s to 1s.

Signed-off-by: Junxiao Bi 
---
 fs/ocfs2/journal.c |   41 ++---
 1 file changed, 22 insertions(+), 19 deletions(-)

diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index e607419cdfa4..bc0e21e8a674 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -1159,10 +1159,8 @@ static int ocfs2_force_read_journal(struct inode *inode)
int status = 0;
int i;
u64 v_blkno, p_blkno, p_blocks, num_blocks;
-#define CONCURRENT_JOURNAL_FILL 32ULL
-   struct buffer_head *bhs[CONCURRENT_JOURNAL_FILL];
-
-   memset(bhs, 0, sizeof(struct buffer_head *) * CONCURRENT_JOURNAL_FILL);
+   struct buffer_head *bh = NULL;
+   struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
 
num_blocks = ocfs2_blocks_for_bytes(inode->i_sb, i_size_read(inode));
v_blkno = 0;
@@ -1174,29 +1172,34 @@ static int ocfs2_force_read_journal(struct inode *inode)
goto bail;
}
 
-   if (p_blocks > CONCURRENT_JOURNAL_FILL)
-   p_blocks = CONCURRENT_JOURNAL_FILL;
+   for (i = 0; i < p_blocks; i++) {
+   bh = __find_get_block(osb->sb->s_bdev, p_blkno,
+   osb->sb->s_blocksize);
+   /* block not cached. */
+   if (!bh) {
+   p_blkno++;
+   continue;
+   }
 
-   /* We are reading journal data which should not
-* be put in the uptodate cache */
-   status = ocfs2_read_blocks_sync(OCFS2_SB(inode->i_sb),
-   p_blkno, p_blocks, bhs);
-   if (status < 0) {
-   mlog_errno(status);
-   goto bail;
-   }
+   brelse(bh);
+   bh = NULL;
+   /* We are reading journal data which should not
+* be put in the uptodate cache.
+*/
+   status = ocfs2_read_blocks_sync(osb, p_blkno, 1, &bh);
+   if (status < 0) {
+   mlog_errno(status);
+   goto bail;
+   }
 
-   for(i = 0; i < p_blocks; i++) {
-   brelse(bhs[i]);
-   bhs[i] = NULL;
+   brelse(bh);
+   bh = NULL;
}
 
v_blkno += p_blocks;
}
 
 bail:
-   for(i = 0; i < CONCURRENT_JOURNAL_FILL; i++)
-   brelse(bhs[i]);
return status;
 }
 
-- 
1.7.9.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: improve recovery performance

2016-06-17 Thread Junxiao Bi

On 06/17/2016 04:32 PM, Joseph Qi wrote:
> On 2016/6/17 15:50, Junxiao Bi wrote:
>> Hi Joseph,
>>
>> On 06/17/2016 03:44 PM, Joseph Qi wrote:
>>> Hi Junxiao,
>>>
>>> On 2016/6/17 14:10, Junxiao Bi wrote:
>>>> Journal replay will be run when do recovery for a dead node,
>>>> to avoid the stale cache impact, all blocks of dead node's
>>>> journal inode were reload from disk. This hurts the performance,
>>>> check whether one block is cached before reload it can improve
>>>> a lot performance. In my test env, the time doing recovery was
>>>> improved from 120s to 1s.
>>>>
>>>> Signed-off-by: Junxiao Bi 
>>>> ---
>>>>  fs/ocfs2/journal.c |   41 ++---
>>>>  1 file changed, 22 insertions(+), 19 deletions(-)
>>>>
>>>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
>>>> index e607419cdfa4..8b808afd5f82 100644
>>>> --- a/fs/ocfs2/journal.c
>>>> +++ b/fs/ocfs2/journal.c
>>>> @@ -1159,10 +1159,8 @@ static int ocfs2_force_read_journal(struct inode 
>>>> *inode)
>>>>int status = 0;
>>>>int i;
>>>>u64 v_blkno, p_blkno, p_blocks, num_blocks;
>>>> -#define CONCURRENT_JOURNAL_FILL 32ULL
>>>> -  struct buffer_head *bhs[CONCURRENT_JOURNAL_FILL];
>>>> -
>>>> -  memset(bhs, 0, sizeof(struct buffer_head *) * CONCURRENT_JOURNAL_FILL);
>>>> +  struct buffer_head *bhs[1] = {NULL};
>>> Since now we do not need batch load, how about make the logic like:
>>>
>>> struct buffer_head *bh = NULL;
>>> ...
>>> ocfs2_read_blocks_sync(osb, p_blkno, 1, &bh);
>> This array is used because ocfs2_read_blocks_sync() needs it as last
>> parameter.
> IC, so we pass &bh like ocfs2_read_locked_inode.
Right, will submit v2.

Thanks,
Junxiao.
> 
> Thanks,
> Joseph
> 
>>
>> Thanks,
>> Junxiao.
>>>
>>> Thanks,
>>> Joseph
>>>
>>>> +  struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>>>>  
>>>>num_blocks = ocfs2_blocks_for_bytes(inode->i_sb, i_size_read(inode));
>>>>v_blkno = 0;
>>>> @@ -1174,29 +1172,34 @@ static int ocfs2_force_read_journal(struct inode 
>>>> *inode)
>>>>goto bail;
>>>>}
>>>>  
>>>> -  if (p_blocks > CONCURRENT_JOURNAL_FILL)
>>>> -  p_blocks = CONCURRENT_JOURNAL_FILL;
>>>> +  for (i = 0; i < p_blocks; i++) {
>>>> +  bhs[0] = __find_get_block(osb->sb->s_bdev, p_blkno,
>>>> +  osb->sb->s_blocksize);
>>>> +  /* block not cached. */
>>>> +  if (!bhs[0]) {
>>>> +  p_blkno++;
>>>> +  continue;
>>>> +  }
>>>>  
>>>> -  /* We are reading journal data which should not
>>>> -   * be put in the uptodate cache */
>>>> -  status = ocfs2_read_blocks_sync(OCFS2_SB(inode->i_sb),
>>>> -  p_blkno, p_blocks, bhs);
>>>> -  if (status < 0) {
>>>> -  mlog_errno(status);
>>>> -  goto bail;
>>>> -  }
>>>> +  brelse(bhs[0]);
>>>> +  bhs[0] = NULL;
>>>> +  /* We are reading journal data which should not
>>>> +   * be put in the uptodate cache.
>>>> +   */
>>>> +  status = ocfs2_read_blocks_sync(osb, p_blkno, 1, bhs);
>>>> +  if (status < 0) {
>>>> +  mlog_errno(status);
>>>> +  goto bail;
>>>> +  }
>>>>  
>>>> -  for(i = 0; i < p_blocks; i++) {
>>>> -  brelse(bhs[i]);
>>>> -  bhs[i] = NULL;
>>>> +  brelse(bhs[0]);
>>>> +  bhs[0] = NULL;
>>>>}
>>>>  
>>>>v_blkno += p_blocks;
>>>>}
>>>>  
>>>>  bail:
>>>> -  for(i = 0; i < CONCURRENT_JOURNAL_FILL; i++)
>>>> -  brelse(bhs[i]);
>>>>return status;
>>>>  }
>>>>  
>>>>
>>>
>>>
>>
>>
>> .
>>
> 
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: improve recovery performance

2016-06-17 Thread Junxiao Bi

Hi Joseph,

On 06/17/2016 03:44 PM, Joseph Qi wrote:
> Hi Junxiao,
> 
> On 2016/6/17 14:10, Junxiao Bi wrote:
>> Journal replay will be run when do recovery for a dead node,
>> to avoid the stale cache impact, all blocks of dead node's
>> journal inode were reload from disk. This hurts the performance,
>> check whether one block is cached before reload it can improve
>> a lot performance. In my test env, the time doing recovery was
>> improved from 120s to 1s.
>>
>> Signed-off-by: Junxiao Bi 
>> ---
>>  fs/ocfs2/journal.c |   41 ++---
>>  1 file changed, 22 insertions(+), 19 deletions(-)
>>
>> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
>> index e607419cdfa4..8b808afd5f82 100644
>> --- a/fs/ocfs2/journal.c
>> +++ b/fs/ocfs2/journal.c
>> @@ -1159,10 +1159,8 @@ static int ocfs2_force_read_journal(struct inode 
>> *inode)
>>  int status = 0;
>>  int i;
>>  u64 v_blkno, p_blkno, p_blocks, num_blocks;
>> -#define CONCURRENT_JOURNAL_FILL 32ULL
>> -struct buffer_head *bhs[CONCURRENT_JOURNAL_FILL];
>> -
>> -memset(bhs, 0, sizeof(struct buffer_head *) * CONCURRENT_JOURNAL_FILL);
>> +struct buffer_head *bhs[1] = {NULL};
> Since now we do not need batch load, how about make the logic like:
> 
>   struct buffer_head *bh = NULL;
>   ...
>   ocfs2_read_blocks_sync(osb, p_blkno, 1, &bh);
This array is used because ocfs2_read_blocks_sync() needs it as last
parameter.

Thanks,
Junxiao.
> 
> Thanks,
> Joseph
> 
>> +struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>>  
>>  num_blocks = ocfs2_blocks_for_bytes(inode->i_sb, i_size_read(inode));
>>  v_blkno = 0;
>> @@ -1174,29 +1172,34 @@ static int ocfs2_force_read_journal(struct inode 
>> *inode)
>>  goto bail;
>>  }
>>  
>> -if (p_blocks > CONCURRENT_JOURNAL_FILL)
>> -p_blocks = CONCURRENT_JOURNAL_FILL;
>> +for (i = 0; i < p_blocks; i++) {
>> +bhs[0] = __find_get_block(osb->sb->s_bdev, p_blkno,
>> +osb->sb->s_blocksize);
>> +/* block not cached. */
>> +if (!bhs[0]) {
>> +p_blkno++;
>> +continue;
>> +}
>>  
>> -/* We are reading journal data which should not
>> - * be put in the uptodate cache */
>> -status = ocfs2_read_blocks_sync(OCFS2_SB(inode->i_sb),
>> -p_blkno, p_blocks, bhs);
>> -if (status < 0) {
>> -mlog_errno(status);
>> -goto bail;
>> -}
>> +brelse(bhs[0]);
>> +bhs[0] = NULL;
>> +/* We are reading journal data which should not
>> + * be put in the uptodate cache.
>> + */
>> +status = ocfs2_read_blocks_sync(osb, p_blkno, 1, bhs);
>> +if (status < 0) {
>> +mlog_errno(status);
>> +goto bail;
>> +}
>>  
>> -for(i = 0; i < p_blocks; i++) {
>> -brelse(bhs[i]);
>> -bhs[i] = NULL;
>> +brelse(bhs[0]);
>> +bhs[0] = NULL;
>>  }
>>  
>>  v_blkno += p_blocks;
>>  }
>>  
>>  bail:
>> -for(i = 0; i < CONCURRENT_JOURNAL_FILL; i++)
>> -brelse(bhs[i]);
>>  return status;
>>  }
>>  
>>
> 
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH] ocfs2: improve recovery performance

2016-06-16 Thread Junxiao Bi

Journal replay will be run when do recovery for a dead node,
to avoid the stale cache impact, all blocks of dead node's
journal inode were reload from disk. This hurts the performance,
check whether one block is cached before reload it can improve
a lot performance. In my test env, the time doing recovery was
improved from 120s to 1s.

Signed-off-by: Junxiao Bi 
---
 fs/ocfs2/journal.c |   41 ++---
 1 file changed, 22 insertions(+), 19 deletions(-)

diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index e607419cdfa4..8b808afd5f82 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -1159,10 +1159,8 @@ static int ocfs2_force_read_journal(struct inode *inode)
int status = 0;
int i;
u64 v_blkno, p_blkno, p_blocks, num_blocks;
-#define CONCURRENT_JOURNAL_FILL 32ULL
-   struct buffer_head *bhs[CONCURRENT_JOURNAL_FILL];
-
-   memset(bhs, 0, sizeof(struct buffer_head *) * CONCURRENT_JOURNAL_FILL);
+   struct buffer_head *bhs[1] = {NULL};
+   struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
 
num_blocks = ocfs2_blocks_for_bytes(inode->i_sb, i_size_read(inode));
v_blkno = 0;
@@ -1174,29 +1172,34 @@ static int ocfs2_force_read_journal(struct inode *inode)
goto bail;
}
 
-   if (p_blocks > CONCURRENT_JOURNAL_FILL)
-   p_blocks = CONCURRENT_JOURNAL_FILL;
+   for (i = 0; i < p_blocks; i++) {
+   bhs[0] = __find_get_block(osb->sb->s_bdev, p_blkno,
+   osb->sb->s_blocksize);
+   /* block not cached. */
+   if (!bhs[0]) {
+   p_blkno++;
+   continue;
+   }
 
-   /* We are reading journal data which should not
-* be put in the uptodate cache */
-   status = ocfs2_read_blocks_sync(OCFS2_SB(inode->i_sb),
-   p_blkno, p_blocks, bhs);
-   if (status < 0) {
-   mlog_errno(status);
-   goto bail;
-   }
+   brelse(bhs[0]);
+   bhs[0] = NULL;
+   /* We are reading journal data which should not
+* be put in the uptodate cache.
+*/
+   status = ocfs2_read_blocks_sync(osb, p_blkno, 1, bhs);
+   if (status < 0) {
+   mlog_errno(status);
+   goto bail;
+   }
 
-   for(i = 0; i < p_blocks; i++) {
-   brelse(bhs[i]);
-   bhs[i] = NULL;
+   brelse(bhs[0]);
+   bhs[0] = NULL;
}
 
v_blkno += p_blocks;
}
 
 bail:
-   for(i = 0; i < CONCURRENT_JOURNAL_FILL; i++)
-   brelse(bhs[i]);
return status;
 }
 
-- 
1.7.9.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH] ocfs2: bump up o2cb network protocol version

2016-05-25 Thread Junxiao Bi

Two new messages are added to support negotiating hb timeout. Stopping
nodes talking old version to mount as they will cause the negotiation
fail.

Signed-off-by: Junxiao Bi 
---
 fs/ocfs2/cluster/tcp_internal.h |5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/ocfs2/cluster/tcp_internal.h b/fs/ocfs2/cluster/tcp_internal.h
index b95e7df5b76a..94b18369b1cc 100644
--- a/fs/ocfs2/cluster/tcp_internal.h
+++ b/fs/ocfs2/cluster/tcp_internal.h
@@ -44,6 +44,9 @@
  * version here in tcp_internal.h should not need to be bumped for
  * filesystem locking changes.
  *
+ * New in version 12
+ * - Negotiate hb timeout when storage is down.
+ *
  * New in version 11
  * - Negotiation of filesystem locking in the dlm join.
  *
@@ -75,7 +78,7 @@
  * - full 64 bit i_size in the metadata lock lvbs
  * - introduction of "rw" lock and pushing meta/data locking down
  */
-#define O2NET_PROTOCOL_VERSION 11ULL
+#define O2NET_PROTOCOL_VERSION 12ULL
 struct o2net_handshake {
__be64  protocol_version;
__be64  connector_id;
-- 
1.7.9.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] OCFS2 benchmark questions

2016-05-24 Thread Junxiao Bi

On 05/20/2016 04:30 PM, Gang He wrote:
> Hello Joseph, Junxiao and All,
> 
> I got some benchmark related questions, but due to hardware limitation in our 
> local lab, we have not enough information to answer the questions. 
> Could you help to look at the questions, if you ever did the related testings 
> in your sites.
> 1) How many nodes can OCFS2 cluster scale out to? theoretically, the answer 
> is 32. but who ever did the testing with the maximum number of nodes? second, 
> too many nodes will impact the performance, or not?
The limit is 255, but we didn't advise using more than 32 nodes. More
nodes impact performance for sure.

Thanks,
Junxiao.
> 
> 2) Who can help to share OCFS2 benchmark information/data in the physical 
> environment? e.g. node hardware information, back-end storage information, IO 
> read/write data, IOPS/IO throughput. etc.
> 
> 
> Thanks
> Gang  
> 
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [patch 1/6] ocfs2: o2hb: add negotiate timer

2016-05-24 Thread Junxiao Bi

On 05/25/2016 06:35 AM, Mark Fasheh wrote:
> On Mon, May 23, 2016 at 02:50:28PM -0700, Andrew Morton wrote:
>> From: Junxiao Bi 
>> Subject: ocfs2: o2hb: add negotiate timer
> 
> Thank you for the well written patch description by the way.
> 
> 
>> This series of patches is to fix the issue that when storage down, all
>> nodes will fence self due to write timeout.
>>
>> With this patch set, all nodes will keep going until storage back online,
>> except if the following issue happens, then all nodes will do as before to
>> fence self.
>>
>> 1. io error got
>> 2. network between nodes down
>> 3. nodes panic
>>
>> This patch (of 6):
>>
>> When storage down, all nodes will fence self due to write timeout.  The
>> negotiate timer is designed to avoid this, with it node will wait until
>> storage up again.
>>
>> Negotiate timer working in the following way:
>>
>> 1. The timer expires before write timeout timer, its timeout is half
>>of write timeout now.  It is re-queued along with write timeout timer.
>>If expires, it will send NEGO_TIMEOUT message to master node(node with
>>lowest node number).  This message does nothing but marks a bit in a
>>bitmap recording which nodes are negotiating timeout on master node.
> 
> I went through the patch series, and generally feel that the code
> is well written and straight forward. I have two issues regarding
> how this operates. Otherwise, I like the general direction this
> is taking.
> 
> The first is easy - we're updating the o2cb network protocol and
> need to bump the protocol version otherwise a node that doesn't
> speak these new messages could mount and even be selected as the
> 'master' without actually being able to participate in this scheme.
Right. Will add this.
> 
> 
> My other concern is whether the notion of 'lowest node' can
> change if one comes online while the cluster is negotiating this
> timeout. Obviously in the case where all the disks are unplugged
> this couldn't happen because a new node couldn't begin to
> heartbeat.
Yes.
> 
> What about a situation where only some nodes are negotiating this
> timeout? On the ones which have no disk access, lowest node
> number still won't change since they can't read the new
> heartbeats. On those with stable access though, can't this value
> change? How does that effect this algorithm?
The lowest node can change for good nodes, but didn't affect the
algorithm. Because only bad nodes sent NEGO_TIMEOUT message while good
nodes not, so the original lowest node will never receive NEGO_TIMEOUT
messages from all nodes, then it will not approve the timeout, at last
bad nodes will fence self and good nodes keep alive.

Thanks,
Junxiao.
> 
> Thanks,
>   --Mark
> 
> --
> Mark Fasheh
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] ocfs2: fix recursive locking deadlock

2016-05-09 Thread Junxiao Bi

On 05/10/2016 01:58 PM, Andrew Morton wrote:
> On Tue, 10 May 2016 12:53:41 +0800 Junxiao Bi  wrote:
> 
>> These two patches is to fix recursive locking deadlock issues. As we
>> talked with Mark before, recursive locking is not allowed in ocfs2,
>> so these two patches fixes the deadlock issue with reverting back
>> patches to avoid recursive locking. Please review.
> 
> 2-3 years old so I guess it isn't a huge issue.  Did you consider
> backporting into -stable kernels?
> 
We'd better do that. I reproduced every one in my test env.

Thanks,
Junxiao.

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] Reflink hangs with kernel 4.4

2016-05-09 Thread Junxiao Bi

Hi Tiger,

Only those two process reported call trace from the two nodes? If so,
looks a little different from my hung where it is a recursive locking of
cluster lock. Any way, i just post the fixed to my issue to the mail
list, you can have a try.

Thanks,
Junxiao.

On 05/09/2016 09:20 PM, 서정우 wrote:
> Hi all.
> 
>  I built up ocfs2 on drbd dual primary. 
> Each node has 12 disks of Raid 10 with mdadm chuck size 4096k.
> Cluster size of filesystem is 1048576 bytes.
> 
> Main purpose of use is reflink files on drbd.
> 
> I reflinked files from 1TB file and exported them to LIO iscsi.
> 
> After few days tests, i got kernel error.
> 
> 
> 
> May  4 19:29:38 master kernel: [1283940.130689]
> (reflink,30902,0):ocfs2_check_dir_for_entry:2048 ERROR: status = -17
> May  4 19:29:38 master kernel: [1283940.131122]
> (reflink,30902,0):ocfs2_mv_orphaned_inode_to_new:2917 ERROR: status = -17
> May  4 19:29:38 master kernel: [1283940.131533]
> (reflink,30902,0):ocfs2_reflink:4317 ERROR: status = -17
> May  4 21:15:29 master kernel: [1290290.387752] INFO: task reflink:5954
> blocked for more than 120 seconds.
> May  4 21:15:29 master kernel: [1290290.388093]   Not tainted
> 4.4.7-040407-generic #201604121331
> May  4 21:15:29 master kernel: [1290290.388417] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> May  4 21:15:29 master kernel: [1290290.388784] reflink D
> 880037e83cf8 0  5954  25468 0x
> May  4 21:15:29 master kernel: [1290290.388788]  880037e83cf8
> 8800b80e6000 8802156ae040 88000195d280
> May  4 21:15:29 master kernel: [1290290.388790]  880037e84000
> 8801af84f1dc 88000195d280 
> May  4 21:15:29 master kernel: [1290290.388792]  8801af84f1e0
> 880037e83d10 817fdf35 8801af84f1d8
> May  4 21:15:29 master kernel: [1290290.388793] Call Trace:
> May  4 21:15:29 master kernel: [1290290.388798]  []
> schedule+0x35/0x80
> May  4 21:15:29 master kernel: [1290290.388800]  []
> schedule_preempt_disabled+0xe/0x10
> May  4 21:15:29 master kernel: [1290290.388802]  []
> __mutex_lock_slowpath+0xb9/0x130
> May  4 21:15:29 master kernel: [1290290.388803]  []
> mutex_lock+0x1f/0x30
> May  4 21:15:29 master kernel: [1290290.388832]  []
> ocfs2_reflink_ioctl+0x218/0x360 [ocfs2]
> May  4 21:15:29 master kernel: [1290290.388848]  []
> ocfs2_ioctl+0x26e/0x660 [ocfs2]
> May  4 21:15:29 master kernel: [1290290.388851]  []
> do_vfs_ioctl+0x298/0x480
> May  4 21:15:29 master kernel: [1290290.388853]  [] ?
> putname+0x54/0x60
> May  4 21:15:29 master kernel: [1290290.388854]  [] ?
> do_sys_open+0x1af/0x230
> May  4 21:15:29 master kernel: [1290290.388856]  []
> SyS_ioctl+0x79/0x90
> May  4 21:15:29 master kernel: [1290290.388858]  []
> entry_SYSCALL_64_fastpath+0x16/0x75
> May  4 21:15:29 master kernel: [1290290.388860] INFO: task reflink:6466
> blocked for more than 120 seconds.
> May  4 21:15:29 master kernel: [1290290.389236]   Not tainted
> 4.4.7-040407-generic #201604121331
> May  4 21:15:29 master kernel: [1290290.389611] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> May  4 21:15:29 master kernel: [1290290.389998] reflink D
> 880038f87cf8 0  6466  32643 0x
> May  4 21:15:29 master kernel: [1290290.39]  880038f87cf8
> 8800b80e6000 880215542940 880002508dc0
> May  4 21:15:29 master kernel: [1290290.390002]  880038f88000
> 8801af84f1dc 880002508dc0 
> May  4 21:15:29 master kernel: [1290290.390004]  8801af84f1e0
> 880038f87d10 817fdf35 8801af84f1d8
> May  4 21:15:29 master kernel: [1290290.390005] Call Trace:
> May  4 21:15:29 master kernel: [1290290.390008]  []
> schedule+0x35/0x80
> May  4 21:15:29 master kernel: [1290290.390009]  []
> schedule_preempt_disabled+0xe/0x10
> May  4 21:15:29 master kernel: [1290290.390010]  []
> __mutex_lock_slowpath+0xb9/0x130
> May  4 21:15:29 master kernel: [1290290.390012]  []
> mutex_lock+0x1f/0x30
> May  4 21:15:29 master kernel: [1290290.390031]  []
> ocfs2_reflink_ioctl+0x218/0x360 [ocfs2]
> May  4 21:15:29 master kernel: [1290290.390045]  []
> ocfs2_ioctl+0x26e/0x660 [ocfs2]
> May  4 21:15:29 master kernel: [1290290.390048]  []
> do_vfs_ioctl+0x298/0x480
> May  4 21:15:29 master kernel: [1290290.390049]  [] ?
> putname+0x54/0x60
> May  4 21:15:29 master kernel: [1290290.390051]  [] ?
> do_sys_open+0x1af/0x230
> May  4 21:15:29 master kernel: [1290290.390052]  []
> SyS_ioctl+0x79/0x90
> May  4 21:15:29 master kernel: [1290290.390054]  []
> entry_SYSCALL_64_fastpath+0x16/0x75
> 
> 
>  I saw same report with kernel 4.3 but there was no answer.
> Any ideas?
> 
>  
> 
> 
> ___
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/

[Ocfs2-devel] [PATCH 2/2] ocfs2: fix posix_acl_create deadlock

2016-05-09 Thread Junxiao Bi

commit 702e5bc68ad2 ("ocfs2: use generic posix ACL infrastructure")
refactored code to use posix_acl_create. The problem with this function
is that it is not mindful of the cluster wide inode lock making it
unsuitable for use with ocfs2 inode creation with ACLs. For example,
when used in ocfs2_mknod, this function can cause deadlock as follows.
The parent dir inode lock is taken when calling posix_acl_create ->
get_acl -> ocfs2_iop_get_acl which takes the inode lock again. This can
cause deadlock if there is a blocked remote lock request waiting for the
lock to be downconverted. And same deadlock happened in ocfs2_reflink.
This fix is to revert back using ocfs2_init_acl.

Fixes: 702e5bc68ad2 ("ocfs2: use generic posix ACL infrastructure")
Signed-off-by: Tariq Saeed 
Signed-off-by: Junxiao Bi 
---
 fs/ocfs2/acl.c  |   63 +++
 fs/ocfs2/acl.h  |4 +++
 fs/ocfs2/namei.c|   23 ++---
 fs/ocfs2/refcounttree.c |   17 ++---
 fs/ocfs2/xattr.c|   14 ---
 fs/ocfs2/xattr.h|4 +--
 6 files changed, 77 insertions(+), 48 deletions(-)

diff --git a/fs/ocfs2/acl.c b/fs/ocfs2/acl.c
index 749d3bc41232..2162434728c0 100644
--- a/fs/ocfs2/acl.c
+++ b/fs/ocfs2/acl.c
@@ -346,3 +346,66 @@ int ocfs2_acl_chmod(struct inode *inode, struct 
buffer_head *bh)
posix_acl_release(acl);
return ret;
 }
+
+/*
+ * Initialize the ACLs of a new inode. If parent directory has default ACL,
+ * then clone to new inode. Called from ocfs2_mknod.
+ */
+int ocfs2_init_acl(handle_t *handle,
+  struct inode *inode,
+  struct inode *dir,
+  struct buffer_head *di_bh,
+  struct buffer_head *dir_bh,
+  struct ocfs2_alloc_context *meta_ac,
+  struct ocfs2_alloc_context *data_ac)
+{
+   struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+   struct posix_acl *acl = NULL;
+   int ret = 0, ret2;
+   umode_t mode;
+
+   if (!S_ISLNK(inode->i_mode)) {
+   if (osb->s_mount_opt & OCFS2_MOUNT_POSIX_ACL) {
+   acl = ocfs2_get_acl_nolock(dir, ACL_TYPE_DEFAULT,
+  dir_bh);
+   if (IS_ERR(acl))
+   return PTR_ERR(acl);
+   }
+   if (!acl) {
+   mode = inode->i_mode & ~current_umask();
+   ret = ocfs2_acl_set_mode(inode, di_bh, handle, mode);
+   if (ret) {
+   mlog_errno(ret);
+   goto cleanup;
+   }
+   }
+   }
+   if ((osb->s_mount_opt & OCFS2_MOUNT_POSIX_ACL) && acl) {
+   if (S_ISDIR(inode->i_mode)) {
+   ret = ocfs2_set_acl(handle, inode, di_bh,
+   ACL_TYPE_DEFAULT, acl,
+   meta_ac, data_ac);
+   if (ret)
+   goto cleanup;
+   }
+   mode = inode->i_mode;
+   ret = __posix_acl_create(&acl, GFP_NOFS, &mode);
+   if (ret < 0)
+   return ret;
+
+   ret2 = ocfs2_acl_set_mode(inode, di_bh, handle, mode);
+   if (ret2) {
+   mlog_errno(ret2);
+   ret = ret2;
+   goto cleanup;
+   }
+   if (ret > 0) {
+   ret = ocfs2_set_acl(handle, inode,
+   di_bh, ACL_TYPE_ACCESS,
+   acl, meta_ac, data_ac);
+   }
+   }
+cleanup:
+   posix_acl_release(acl);
+   return ret;
+}
diff --git a/fs/ocfs2/acl.h b/fs/ocfs2/acl.h
index 035e5878db06..2783a75b3999 100644
--- a/fs/ocfs2/acl.h
+++ b/fs/ocfs2/acl.h
@@ -36,5 +36,9 @@ int ocfs2_set_acl(handle_t *handle,
 struct ocfs2_alloc_context *meta_ac,
 struct ocfs2_alloc_context *data_ac);
 extern int ocfs2_acl_chmod(struct inode *, struct buffer_head *);
+extern int ocfs2_init_acl(handle_t *, struct inode *, struct inode *,
+ struct buffer_head *, struct buffer_head *,
+ struct ocfs2_alloc_context *,
+ struct ocfs2_alloc_context *);
 
 #endif /* OCFS2_ACL_H */
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index 6b3e87189a64..a8f1225e6d9b 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -259,7 +259,6 @@ static int ocfs2_mknod(struct inode *dir,
struct ocfs2_dir_lookup_result lookup = { NULL, };
sigset_t oldset;
int did_block_signals = 0;
-   struct posix_acl *default_acl = NULL, *acl = NULL;
struct ocfs2_de

[Ocfs2-devel] [PATCH 1/2] ocfs2: revert using ocfs2_acl_chmod to avoid inode cluster lock hang

2016-05-09 Thread Junxiao Bi

commit 743b5f1434f5 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
introduced this issue.  ocfs2_setattr called by chmod command holds cluster
wide inode lock when calling posix_acl_chmod. This latter function in turn
calls ocfs2_iop_get_acl and ocfs2_iop_set_acl.  These two are also called
directly from vfs layer for getfacl/setfacl commands and therefore acquire
the cluster wide inode lock. If a remote conversion request comes after the
first inode lock in ocfs2_setattr, OCFS2_LOCK_BLOCKED will be set. And this
will cause the second call to inode lock from the ocfs2_iop_get_acl() to
block indefinetly.

The deleted version of ocfs2_acl_chmod() calls __posix_acl_chmod() which
does not call back into the filesystem. Therefore, we restore
ocfs2_acl_chmod(), modify it slightly for locking as needed, and use that
instead.

Fixes: 743b5f1434f5 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
Signed-off-by: Tariq Saeed 
Signed-off-by: Junxiao Bi 
---
 fs/ocfs2/acl.c  |   24 
 fs/ocfs2/acl.h  |1 +
 fs/ocfs2/file.c |4 ++--
 3 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/fs/ocfs2/acl.c b/fs/ocfs2/acl.c
index 0cdf497c91ef..749d3bc41232 100644
--- a/fs/ocfs2/acl.c
+++ b/fs/ocfs2/acl.c
@@ -322,3 +322,27 @@ struct posix_acl *ocfs2_iop_get_acl(struct inode *inode, 
int type)
brelse(di_bh);
return acl;
 }
+
+int ocfs2_acl_chmod(struct inode *inode, struct buffer_head *bh)
+{
+   struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+   struct posix_acl *acl;
+   int ret;
+
+   if (S_ISLNK(inode->i_mode))
+   return -EOPNOTSUPP;
+
+   if (!(osb->s_mount_opt & OCFS2_MOUNT_POSIX_ACL))
+   return 0;
+
+   acl = ocfs2_get_acl_nolock(inode, ACL_TYPE_ACCESS, bh);
+   if (IS_ERR(acl) || !acl)
+   return PTR_ERR(acl);
+   ret = __posix_acl_chmod(&acl, GFP_KERNEL, inode->i_mode);
+   if (ret)
+   return ret;
+   ret = ocfs2_set_acl(NULL, inode, NULL, ACL_TYPE_ACCESS,
+   acl, NULL, NULL);
+   posix_acl_release(acl);
+   return ret;
+}
diff --git a/fs/ocfs2/acl.h b/fs/ocfs2/acl.h
index 3fce68d08625..035e5878db06 100644
--- a/fs/ocfs2/acl.h
+++ b/fs/ocfs2/acl.h
@@ -35,5 +35,6 @@ int ocfs2_set_acl(handle_t *handle,
 struct posix_acl *acl,
 struct ocfs2_alloc_context *meta_ac,
 struct ocfs2_alloc_context *data_ac);
+extern int ocfs2_acl_chmod(struct inode *, struct buffer_head *);
 
 #endif /* OCFS2_ACL_H */
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 2bf23fd333ed..4e7b0dc22450 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1268,20 +1268,20 @@ bail_unlock_rw:
if (size_change)
ocfs2_rw_unlock(inode, 1);
 bail:
-   brelse(bh);
 
/* Release quota pointers in case we acquired them */
for (qtype = 0; qtype < OCFS2_MAXQUOTAS; qtype++)
dqput(transfer_to[qtype]);
 
if (!status && attr->ia_valid & ATTR_MODE) {
-   status = posix_acl_chmod(inode, inode->i_mode);
+   status = ocfs2_acl_chmod(inode, bh);
if (status < 0)
mlog_errno(status);
}
if (inode_locked)
ocfs2_inode_unlock(inode, 1);
 
+   brelse(bh);
return status;
 }
 
-- 
1.7.9.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] ocfs2: fix recursive locking deadlock

2016-05-09 Thread Junxiao Bi


Hi,

These two patches is to fix recursive locking deadlock issues. As we
talked with Mark before, recursive locking is not allowed in ocfs2,
so these two patches fixes the deadlock issue with reverting back
patches to avoid recursive locking. Please review.

Thanks,
Junxiao.

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] o2hb: increase unsteady iterations

2016-03-30 Thread Junxiao Bi

On 03/29/2016 08:32 PM, Shichangkuo wrote:
> Hi Junxiao,
> If tcp connections take a long time to establish, and hr_unsteady_iterations 
> still may change to 0, then o2hb thread will exit by error.
> What about wait the connection?
NAK, this will hung the mount if network is down.

Thanks,
Junxiao.
> 
> --- a/fs/ocfs2/cluster/heartbeat.c  2016-03-29 20:22:03.066400592 +0800
> +++ a/fs/ocfs2/cluster/heartbeat.c  2016-03-29 20:20:55.494065519 +0800
> @@ -1054,7 +1054,7 @@ bail:
> }
> 
> if (atomic_read(®->hr_steady_iterations) != 0) {
> -   if (atomic_dec_and_test(®->hr_unsteady_iterations)) {
> +   if (!membership_change && 
> atomic_dec_and_test(®->hr_unsteady_iterations)) {
> printk(KERN_NOTICE "o2hb: Unable to stabilize "
>"heartbeart on region %s (%s)\n",
>config_item_name(®->hr_item),
> 
> Thanks
> Changkuo
> 
> 发件人: ocfs2-devel-boun...@oss.oracle.com 
> [mailto:ocfs2-devel-boun...@oss.oracle.com] 代表 Junxiao Bi
> 发送时间: 2015年11月19日 16:08
> 收件人: ocfs2-devel@oss.oracle.com
> 抄送: mfas...@suse.com
> 主题: [Ocfs2-devel] [PATCH] o2hb: increase unsteady iterations
> 
> When run multiple xattr test of ocfs2-test on a three-nodes cluster, mount 
> failed sometimes with the following message.
> 
> o2hb: Unable to stabilize heartbeart on region 
> D18B775E758D4D80837E8CF3D086AD4A (xvdb)
> 
> Stabilize heartbeat depends on the timing order to mount ocfs2 from cluster 
> nodes and how fast the tcp connections are established. So increase unsteady 
> interations to leave more time for it.
> 
> Signed-off-by: Junxiao Bi 
> ---
>  fs/ocfs2/cluster/heartbeat.c |4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c 
> index 709fbbd..a3cc6d2 100644
> --- a/fs/ocfs2/cluster/heartbeat.c
> +++ b/fs/ocfs2/cluster/heartbeat.c
> @@ -1780,8 +1780,8 @@ static ssize_t o2hb_region_dev_store(struct config_item 
> *item,
> }
> ++live_threshold;
> atomic_set(®->hr_steady_iterations, live_threshold);
> -   /* unsteady_iterations is double the steady_iterations */
> -   atomic_set(®->hr_unsteady_iterations, (live_threshold << 1));
> +   /* unsteady_iterations is triple the steady_iterations */
> +   atomic_set(®->hr_unsteady_iterations, (live_threshold * 3));
> 
> hb_task = kthread_run(o2hb_thread, reg, "o2hb-%s",
>   reg->hr_item.ci_name);
> --
> 1.7.9.5
> 
> 
> ___
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> -
> 本邮件及其附件含有杭州华三通信技术有限公司的保密信息，仅限于发送给上面地址中列出
> 的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、
> 或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本
> 邮件！
> This e-mail and its attachments contain confidential information from H3C, 
> which is
> intended only for the person or entity whose address is listed above. Any use 
> of the
> information contained herein in any way (including, but not limited to, total 
> or partial
> disclosure, reproduction, or dissemination) by persons other than the intended
> recipient(s) is prohibited. If you receive this e-mail in error, please 
> notify the sender
> by phone or email immediately and delete it!
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] kernel BUG in function ocfs2_truncate_file

2016-03-30 Thread Junxiao Bi

On 03/31/2016 10:56 AM, Gang He wrote:
> Hello Joseph and Junxiao,
> 
> Did you encounter this issue in the past? I doubt this is possible a race 
> condition bug (rather than data inconsistency).
Never saw this. fsck report any corruption?

Thanks,
Junxiao.
> 
> Thanks
> Gang
> 
> 

>> Hello Guys,
>>
>> I got a bug, which reported a kernel BUG in function ocfs2_truncate_file,
>> Base on my initial analysis, this bug looks like a race condition problem.
>> Unfortunately, there was no kernel crash dump caught, just got some kernel 
>> log as below,
>>
>> kernel BUG at /usr/src/packages/BUILD/ocfs2-1.6/default/ocfs2/file.c:466!
>> Oct 21 13:02:19 uii316 [ 1766.831230] Supported: Yes
>> Oct 21 13:02:19 uii316 [ 1766.831234]
>> Oct 21 13:02:19 uii316 [ 1766.831238] Pid: 7134, comm: saposcol Not tainted 
>> 3.0.101-0.47.67-default #1
>> Oct 21 13:02:19 uii316 HP ProLiant BL460c G1
>> Oct 21 13:02:19 uii316
>> Oct 21 13:02:19 uii316 [ 1766.831247] RIP: 0010:[]
>> Oct 21 13:02:19 uii316 [] ocfs2_truncate_file+0xa5/0x490 
>> [ocfs2]
>> Oct 21 13:02:19 uii316 [ 1766.831312] RSP: 0018:880f39d79b68  EFLAGS: 
>> 00010296
>> Oct 21 13:02:19 uii316 [ 1766.831321] RAX: 008f RBX: 
>> 880f39c5e240 RCX: 39fd
>> Oct 21 13:02:19 uii316 [ 1766.831326] RDX:  RSI: 
>> 0007 RDI: 0246
>> Oct 21 13:02:19 uii316 [ 1766.831331] RBP: 1000 R08: 
>> 81da0ac0 R09: 
>> Oct 21 13:02:19 uii316 [ 1766.831336] R10: 0003 R11: 
>>  R12: 880f3949bc78
>> Oct 21 13:02:19 uii316 [ 1766.831342] R13: 880f39c5e888 R14: 
>> 880f3d481000 R15: 000e43bc
>> Oct 21 13:02:19 uii316 [ 1766.831347] FS:  7f11cda9d720() 
>> GS:880fefd4() knlGS:
>> Oct 21 13:02:19 uii316 [ 1766.831353] CS:  0010 DS:  ES:  CR0: 
>> 80050033
>> Oct 21 13:02:19 uii316 [ 1766.831358] CR2: 7f11cdad4000 CR3: 
>> 000f39d35000 CR4: 07e0
>> Oct 21 13:02:19 uii316 [ 1766.831363] DR0:  DR1: 
>>  DR2: 
>> Oct 21 13:02:19 uii316 [ 1766.831368] DR3:  DR6: 
>> 0ff0 DR7: 0400
>> Oct 21 13:02:19 uii316 [ 1766.831374] Process saposcol (pid: 7134, 
>> threadinfo 880f39d78000, task 880f39c5e240)
>> Oct 21 13:02:19 uii316 [ 1766.831379] Stack:
>> Oct 21 13:02:19 uii316 [ 1766.831383]  0002433c
>> Oct 21 13:02:19 uii316 000e43bc
>> Oct 21 13:02:19 uii316 000eab40
>> Oct 21 13:02:19 uii316 880f0001
>> Oct 21 13:02:19 uii316
>> Oct 21 13:02:19 uii316 [ 1766.831397]  880f394956e0
>> Oct 21 13:02:19 uii316 880f8e0d1000
>> Oct 21 13:02:19 uii316 880f3949b800
>> Oct 21 13:02:19 uii316 
>> Oct 21 13:02:19 uii316
>> Oct 21 13:02:19 uii316 [ 1766.831410]  880f39454980
>> Oct 21 13:02:19 uii316 0001
>> Oct 21 13:02:19 uii316 0002433c
>> Oct 21 13:02:19 uii316 0008
>> Oct 21 13:02:19 uii316
>> Oct 21 13:02:19 uii316 [ 1766.831423] Call Trace:
>> Oct 21 13:02:19 uii316 [ 1766.831492]  [] 
>> ocfs2_setattr+0x26e/0xa90 [ocfs2]
>> Oct 21 13:02:19 uii316 [ 1766.831522]  [] 
>> notify_change+0x19f/0x2f0
>> Oct 21 13:02:19 uii316 [ 1766.831534]  [] 
>> do_truncate+0x57/0x80
>> Oct 21 13:02:19 uii316 [ 1766.831544]  [] 
>> do_last+0x603/0x800
>> Oct 21 13:02:19 uii316 [ 1766.831551]  [] 
>> path_openat+0xd9/0x420
>> Oct 21 13:02:19 uii316 [ 1766.831558]  [] 
>> do_filp_open+0x4c/0xc0
>> Oct 21 13:02:19 uii316 [ 1766.831566]  [] 
>> do_sys_open+0x17f/0x250
>> Oct 21 13:02:19 uii316 [ 1766.831575]  [] 
>> system_call_fastpath+0x16/0x1b
>> Oct 21 13:02:19 uii316 [ 1766.831588]  [<7f11ccb07080>] 0x7f11ccb0707f
>> Oct 21 13:02:19 uii316 [ 1766.831592] Code:
>>
>>  The source code in question is as below,
>>  444 static int ocfs2_truncate_file(struct inode *inode,
>>  445struct buffer_head *di_bh,
>>  446u64 new_i_size)
>>  447 {
>>  448 int status = 0;
>>  449 struct ocfs2_dinode *fe = NULL;
>>  450 struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>>  451
>>  452 /* We trust di_bh because it comes from ocfs2_inode_lock(), 
>> which
>>  453  * already validated it */
>>  454 fe = (struct ocfs2_dinode *) di_bh->b_data;
>>  455
>>  456 trace_ocfs2_truncate_file((unsigned long 
>> long)OCFS2_I(inode)->ip_blkno,
>>  457   (unsigned long 
>> long)le64_to_cpu(fe->i_size),
>>  458   (unsigned long long)new_i_size);
>>  459
>>  460 mlog_bug_on_msg(le64_to_cpu(fe->i_size) != i_size_read(inode),  
>>   
>>  <<= here
>>  461 "Inode %llu, inode i_size = %lld != di "
>>  462 "i_size = %llu, i_flags = 0x%x\n",
>>  463 (unsigned long long)OCFS2_I(inode)->ip_blkn

Re: [Ocfs2-devel] [patch 19/25] ocfs2: o2hb: add negotiate timer

2016-03-27 Thread Junxiao Bi

Hi Yiwen,

On 03/26/2016 10:54 AM, jiangyiwen wrote:
> Hi, Junxiao
> This patch may have a problem. That is journal of every nodes become
> abort when storage down, and then when storage up, because journal
> has become abort, all of operations of metadata will fail. So how to
> restore environment? panic or reset? how to trigger?
Journal aborted means io error was returned by storage, right?
If so, o2hb_thread should also get io error, in this case, nego process
will be bypassed, and nodes will be fenced at last, see "[patch 23/25]
ocfs2: o2hb: don't negotiate if last hb fail".

Thanks,
Junxiao.
> 
> Thanks,
> Yiwen Jiang.


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [patch 19/25] ocfs2: o2hb: add negotiate timer

2016-03-23 Thread Junxiao Bi

This is v1 version, I sent out V2 patch set before to fix all code style
issue.

On 03/24/2016 04:12 AM, a...@linux-foundation.org wrote:
> From: Junxiao Bi 
> Subject: ocfs2: o2hb: add negotiate timer
> 
> This series of patches is to fix the issue that when storage down, all
> nodes will fence self due to write timeout.
> 
> With this patch set, all nodes will keep going until storage back online,
> except if the following issue happens, then all nodes will do as before to
> fence self.
> 
> 1. io error got
> 2. network between nodes down
> 3. nodes panic
> 
> 
> This patch (of 6):
> 
> When storage down, all nodes will fence self due to write timeout.  The
> negotiate timer is designed to avoid this, with it node will wait until
> storage up again.
> 
> Negotiate timer working in the following way:
> 
> 1. The timer expires before write timeout timer, its timeout is half
>of write timeout now.  It is re-queued along with write timeout timer. 
>If expires, it will send NEGO_TIMEOUT message to master node(node with
>lowest node number).  This message does nothing but marks a bit in a
>bitmap recording which nodes are negotiating timeout on master node.
> 
> 2. If storage down, nodes will send this message to master node, then
>when master node finds its bitmap including all online nodes, it sends
>NEGO_APPROVL message to all nodes one by one, this message will
>re-queue write timeout timer and negotiate timer.  For any node doesn't
>receive this message or meets some issue when handling this message, it
>will be fenced.  If storage up at any time, o2hb_thread will run and
>re-queue all the timer, nothing will be affected by these two steps.
> 
> Signed-off-by: Junxiao Bi 
> Reviewed-by: Ryan Ding 
> Cc: Gang He 
> Cc: rwxybh 
> Cc: Mark Fasheh 
> Cc: Joel Becker 
> Cc: Joseph Qi 
> Signed-off-by: Andrew Morton 
> ---
> 
>  fs/ocfs2/cluster/heartbeat.c |   52 ++---
>  1 file changed, 48 insertions(+), 4 deletions(-)
> 
> diff -puN fs/ocfs2/cluster/heartbeat.c~ocfs2-o2hb-add-negotiate-timer 
> fs/ocfs2/cluster/heartbeat.c
> --- a/fs/ocfs2/cluster/heartbeat.c~ocfs2-o2hb-add-negotiate-timer
> +++ a/fs/ocfs2/cluster/heartbeat.c
> @@ -272,6 +272,10 @@ struct o2hb_region {
>   struct delayed_work hr_write_timeout_work;
>   unsigned long   hr_last_timeout_start;
>  
> + /* negotiate timer, used to negotiate extending hb timeout. */
> + struct delayed_work hr_nego_timeout_work;
> + unsigned long   
> hr_nego_node_bitmap[BITS_TO_LONGS(O2NM_MAX_NODES)];
> +
>   /* Used during o2hb_check_slot to hold a copy of the block
>* being checked because we temporarily have to zero out the
>* crc field. */
> @@ -319,7 +323,7 @@ static void o2hb_write_timeout(struct wo
>   o2quo_disk_timeout();
>  }
>  
> -static void o2hb_arm_write_timeout(struct o2hb_region *reg)
> +static void o2hb_arm_timeout(struct o2hb_region *reg)
>  {
>   /* Arm writeout only after thread reaches steady state */
>   if (atomic_read(®->hr_steady_iterations) != 0)
> @@ -337,11 +341,50 @@ static void o2hb_arm_write_timeout(struc
>   reg->hr_last_timeout_start = jiffies;
>   schedule_delayed_work(®->hr_write_timeout_work,
> msecs_to_jiffies(O2HB_MAX_WRITE_TIMEOUT_MS));
> +
> + cancel_delayed_work(®->hr_nego_timeout_work);
> + /* negotiate timeout must be less than write timeout. */
> + schedule_delayed_work(®->hr_nego_timeout_work,
> +   msecs_to_jiffies(O2HB_MAX_WRITE_TIMEOUT_MS)/2);
> + memset(reg->hr_nego_node_bitmap, 0, sizeof(reg->hr_nego_node_bitmap));
>  }
>  
> -static void o2hb_disarm_write_timeout(struct o2hb_region *reg)
> +static void o2hb_disarm_timeout(struct o2hb_region *reg)
>  {
>   cancel_delayed_work_sync(®->hr_write_timeout_work);
> + cancel_delayed_work_sync(®->hr_nego_timeout_work);
> +}
> +
> +static void o2hb_nego_timeout(struct work_struct *work)
> +{
> + struct o2hb_region *reg =
> + container_of(work, struct o2hb_region,
> +  hr_nego_timeout_work.work);
> + unsigned long live_node_bitmap[BITS_TO_LONGS(O2NM_MAX_NODES)];
> + int master_node;
> +
> + o2hb_fill_node_map(live_node_bitmap, sizeof(live_node_bitmap));
> + /* lowest node as master node to make negotiate decision. */
> + master_node = find_next_bit(live_node_bitmap, O2NM_MAX_NODES, 0);
> +
> + if (master_node == o2nm_this_node()) {
> + set_bit(master_node, reg->hr_nego_node_bitmap);

[Ocfs2-devel] [PATCH] ocfs2: o2hb: fix double free bug

2016-03-20 Thread Junxiao Bi

This is a regression issue and caused the following kernel panic
when do ocfs2 multiple test.

[  254.604228] BUG: unable to handle kernel paging request at
0002000800c0
[  254.605013] IP: [] kmem_cache_alloc+0x78/0x160
[  254.605013] PGD 7bbe5067 PUD 0
[  254.605013] Oops:  [#1] SMP
[  254.605013] Modules linked in: ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
ocfs2_nodemanager ocfs2_stackglue iscsi_tcp libiscsi_tcp libiscsi
scsi_transport_iscsi xen_kbdfront xen_netfront xen_fbfront xen_blkfront
[  254.605013] CPU: 2 PID: 4044 Comm: mpirun Not tainted
4.5.0-rc5-next-20160225 #1
[  254.605013] Hardware name: Xen HVM domU, BIOS 4.3.1OVM 05/14/2014
[  254.605013] task: 88007a521a80 ti: 88007aed task.ti:
88007aed
[  254.605013] RIP: 0010:[]  []
kmem_cache_alloc+0x78/0x160
[  254.605013] RSP: 0018:88007aed3a48  EFLAGS: 00010282
[  254.605013] RAX:  RBX:  RCX:
1991
[  254.605013] RDX: 1990 RSI: 024000c0 RDI:
0001b330
[  254.605013] RBP: 88007aed3a98 R08: 88007d29b330 R09:
0002000800c0
[  254.605013] R10: 000c51376d87 R11: 8800792cac38 R12:
88007cc30f00
[  254.605013] R13: 024000c0 R14: 811b053f R15:
88007aed3ce7
[  254.605013] FS:  () GS:88007d28()
knlGS:
[  254.605013] CS:  0010 DS:  ES:  CR0: 80050033
[  254.605013] CR2: 0002000800c0 CR3: 7aeb2000 CR4:
000406e0
[  254.605013] Stack:
[  254.605013]  13082000 88007aed3d28 0079
0001
[  254.605013]  2f2f2f2f 8800792cac00 88007aed3d38
0101
[  254.605013]  88007a5e2000 88007aed3ce7 88007aed3b08
811b053f
[  254.605013] Call Trace:
[  254.605013]  [] __d_alloc+0x2f/0x1a0
[  254.605013]  [] ? unlazy_walk+0xe2/0x160
[  254.605013]  [] d_alloc+0x17/0x80
[  254.605013]  [] lookup_dcache+0x8a/0xc0
[  254.605013]  [] ? __alloc_pages_nodemask+0x173/0xeb0
[  254.605013]  [] path_openat+0x3c3/0x1210
[  254.605013]  [] ? radix_tree_lookup_slot+0x13/0x30
[  254.605013]  [] ? find_get_entry+0x32/0xc0
[  254.605013]  [] ? atime_needs_update+0x55/0xe0
[  254.605013]  [] ? filemap_fault+0xd1/0x4b0
[  254.605013]  [] ? do_set_pte+0xb6/0x140
[  254.605013]  [] do_filp_open+0x80/0xe0
[  254.605013]  [] ? __alloc_fd+0x48/0x1a0
[  254.605013]  [] ? getname_flags+0x7a/0x1e0
[  254.605013]  [] do_sys_open+0x110/0x200
[  254.605013]  [] SyS_open+0x19/0x20
[  254.605013]  [] do_syscall_64+0x72/0x230
[  254.605013]  [] ? __do_page_fault+0x177/0x430
[  254.605013]  [] entry_SYSCALL64_slow_path+0x25/0x25
[  254.605013] Code: 05 e6 77 e7 7e 4d 8b 08 49 8b 40 10 4d 85 c9 0f 84
dd 00 00 00 48 85 c0 0f 84 d4 00 00 00 49 63 44 24 20 49 8b 3c 24 48 8d
4a 01 <49> 8b 1c 01 4c 89 c8 65 48 0f c7 0f 0f 94 c0 3c 01 75 b6 49 63
[  254.605013] RIP  [] kmem_cache_alloc+0x78/0x160
[  254.605013]  RSP 
[  254.605013] CR2: 0002000800c0
[  254.792273] ---[ end trace 823969e602e4aaac ]---

Fixes: a4a1dfa4bb8b("ocfs2/cluster: fix memory leak in o2hb_region_release")
Signed-off-by: Junxiao Bi 
---
 fs/ocfs2/cluster/heartbeat.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
index ef6a2ec494de..bd15929b5f92 100644
--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -1444,8 +1444,8 @@ static void o2hb_region_release(struct config_item *item)
debugfs_remove(reg->hr_debug_dir);
kfree(reg->hr_db_livenodes);
kfree(reg->hr_db_regnum);
-   kfree(reg->hr_debug_elapsed_time);
-   kfree(reg->hr_debug_pinned);
+   kfree(reg->hr_db_elapsed_time);
+   kfree(reg->hr_db_pinned);
 
spin_lock(&o2hb_live_lock);
list_del(®->hr_all_item);
-- 
1.7.9.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] Could the master return DLM_NORMAL when unlock nonexistent locks from other node?

2016-03-19 Thread Junxiao Bi

On 03/15/2016 09:55 AM, Shichangkuo wrote:
> Hi all,
> When NodeA want to unlock lock-res1, and it send message to NodeB, but in 
> NodeB any lock queues (granted, converting, blocked) could not find this lock 
> for some unknown reason, then NodeB reply DLM_IVLOCKID.
> In this situation, NodeA bug. The detail is described as follows.
> 
> NODEA  NODEB
> ocfs2_drop_lock
> ocfs2_dlm_unlock
> o2cb_dlm_unlock
> dlmunlock
> dlmunlock_remote>   send message to master
>   dlm_unlock_lock_handler
>   return DLM_IVLOCKID
> BUG()
> 
> I think it's no nessary to let NodeA bug, it just like we removed an 
> nonexistent file.
> Could NodeB return DLM_NORMAL?
No. This is a lock inconsistent bug. What kernel version are you using?
Please check whether patchset "ocfs2: o2net: don't shutdown connection
when idle timeout" is there. Without this, this inconsistent lock can be
triggered.

Thanks,
Junxiao.
> 
> Thanks
> Changkuo
> ___
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> -
> 本邮件及其附件含有杭州华三通信技术有限公司的保密信息，仅限于发送给上面地址中列出
> 的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、
> 或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本
> 邮件！
> This e-mail and its attachments contain confidential information from H3C, 
> which is
> intended only for the person or entity whose address is listed above. Any use 
> of the
> information contained herein in any way (including, but not limited to, total 
> or partial
> disclosure, reproduction, or dissemination) by persons other than the intended
> recipient(s) is prohibited. If you receive this e-mail in error, please 
> notify the sender
> by phone or email immediately and delete it!
> ___
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [QUESTION] How to recover a deleted file?

2016-03-03 Thread Junxiao Bi

On 03/04/2016 08:47 AM, Shichangkuo wrote:
> Hi All,
> 
> I have removed a file which was very important to me by mistake.
> Does someone once encounta problem like this, and could the file be
> undeleted?
Maybe you can rebuild it manually if released clusters are not reused.
Umount the volume first, then use logdump of debugfs.ocfs2 to find the
latest extents tree of the file before it is removed. With this, you can
copy the removed file out. Good luck.

Thanks,
Junxiao.

> 
>  
> 
> Many thanks.
> 
> Changkuo.Shi
> 
> -
> 本邮件及其附件含有杭州华三通信技术有限公司的保密信息，仅限于发送给上面地
> 址中列出
> 的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄
> 露、复制、
> 或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人
> 并删除本
> 邮件！
> This e-mail and its attachments contain confidential information from
> H3C, which is
> intended only for the person or entity whose address is listed above.
> Any use of the
> information contained herein in any way (including, but not limited to,
> total or partial
> disclosure, reproduction, or dissemination) by persons other than the
> intended
> recipient(s) is prohibited. If you receive this e-mail in error, please
> notify the sender
> by phone or email immediately and delete it!
> 
> 
> ___
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] ocfs2: o2hb: not fence self if storage down

2016-03-02 Thread Junxiao Bi


Hi Mark,

This serial of patches is to fix the issue that when storage down,
all nodes will fence self due to write timeout.
With this patch set, all nodes will keep going until storage back
online, except if the following issue happens, then all nodes will
do as before to fence self.
1. io error got
2. network between nodes down
3. nodes panic

---
Changes from V1:
- code style fix.

Junxiao Bi (6):
  ocfs2: o2hb: add negotiate timer
  ocfs2: o2hb: add NEGO_TIMEOUT message
  ocfs2: o2hb: add NEGOTIATE_APPROVE message
  ocfs2: o2hb: add some user/debug log
  ocfs2: o2hb: don't negotiate if last hb fail
  ocfs2: o2hb: fix hb hung time

fs/ocfs2/cluster/heartbeat.c |  180 --
 1 file changed, 174 insertions(+), 6 deletions(-)


Thanks,
Junxiao.

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH V2 6/6] ocfs2: o2hb: fix hb hung time

2016-03-02 Thread Junxiao Bi

hr_last_timeout_start should be set as the last time where hb is still OK.
When hb write timeout, hung time will be (jiffies -
hr_last_timeout_start).

Signed-off-by: Junxiao Bi 
Reviewed-by: Ryan Ding 
Cc: Gang He 
Cc: rwxybh 
Cc: Mark Fasheh 
Cc: Joel Becker 
Cc: Joseph Qi 
Signed-off-by: Andrew Morton 
---
 fs/ocfs2/cluster/heartbeat.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
index c040fc3dd605..8ec85cac894e 100644
--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -357,7 +357,6 @@ static void o2hb_arm_timeout(struct o2hb_region *reg)
spin_unlock(&o2hb_live_lock);
}
cancel_delayed_work(®->hr_write_timeout_work);
-   reg->hr_last_timeout_start = jiffies;
schedule_delayed_work(®->hr_write_timeout_work,
  msecs_to_jiffies(O2HB_MAX_WRITE_TIMEOUT_MS));
 
@@ -1175,6 +1174,7 @@ static int o2hb_do_disk_heartbeat(struct o2hb_region *reg)
if (own_slot_ok) {
o2hb_set_quorum_device(reg);
o2hb_arm_timeout(reg);
+   reg->hr_last_timeout_start = jiffies;
}
 
 bail:
-- 
1.7.9.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH] ocfs2: o2hb: remove useless force cast

2016-03-02 Thread Junxiao Bi

Signed-off-by: Junxiao Bi 
---
 fs/ocfs2/cluster/heartbeat.c |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
index 8ec85cac894e..023c72d35498 100644
--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -1315,19 +1315,19 @@ static int o2hb_debug_open(struct inode *inode, struct 
file *file)
 
case O2HB_DB_TYPE_REGION_LIVENODES:
spin_lock(&o2hb_live_lock);
-   reg = (struct o2hb_region *)db->db_data;
+   reg = db->db_data;
memcpy(map, reg->hr_live_node_bitmap, db->db_size);
spin_unlock(&o2hb_live_lock);
break;
 
case O2HB_DB_TYPE_REGION_NUMBER:
-   reg = (struct o2hb_region *)db->db_data;
+   reg = db->db_data;
out += snprintf(buf + out, PAGE_SIZE - out, "%d\n",
reg->hr_region_num);
goto done;
 
case O2HB_DB_TYPE_REGION_ELAPSED_TIME:
-   reg = (struct o2hb_region *)db->db_data;
+   reg = db->db_data;
lts = reg->hr_last_timeout_start;
/* If 0, it has never been set before */
if (lts)
@@ -1336,7 +1336,7 @@ static int o2hb_debug_open(struct inode *inode, struct 
file *file)
goto done;
 
case O2HB_DB_TYPE_REGION_PINNED:
-   reg = (struct o2hb_region *)db->db_data;
+   reg = db->db_data;
out += snprintf(buf + out, PAGE_SIZE - out, "%u\n",
!!reg->hr_item_pinned);
goto done;
-- 
1.7.9.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH V2 1/6] ocfs2: o2hb: add negotiate timer

2016-03-02 Thread Junxiao Bi

This series of patches is to fix the issue that when storage down, all
nodes will fence self due to write timeout.

With this patch set, all nodes will keep going until storage back online,
except if the following issue happens, then all nodes will do as before to
fence self.

1. io error got
2. network between nodes down
3. nodes panic

This patch (of 6):

When storage down, all nodes will fence self due to write timeout.  The
negotiate timer is designed to avoid this, with it node will wait until
storage up again.

Negotiate timer working in the following way:

1. The timer expires before write timeout timer, its timeout is half
   of write timeout now.  It is re-queued along with write timeout timer.
   If expires, it will send NEGO_TIMEOUT message to master node(node with
   lowest node number).  This message does nothing but marks a bit in a
   bitmap recording which nodes are negotiating timeout on master node.

2. If storage down, nodes will send this message to master node, then
   when master node finds its bitmap including all online nodes, it sends
   NEGO_APPROVL message to all nodes one by one, this message will
   re-queue write timeout timer and negotiate timer.  For any node doesn't
   receive this message or meets some issue when handling this message, it
   will be fenced.  If storage up at any time, o2hb_thread will run and
   re-queue all the timer, nothing will be affected by these two steps.

Signed-off-by: Junxiao Bi 
Reviewed-by: Ryan Ding 
Cc: Gang He 
Cc: rwxybh 
Cc: Mark Fasheh 
Cc: Joel Becker 
Cc: Joseph Qi 
Signed-off-by: Andrew Morton 
---
 fs/ocfs2/cluster/heartbeat.c |   51 ++
 1 file changed, 47 insertions(+), 4 deletions(-)

diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
index a76b9ea7722e..59982b34b7a5 100644
--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -272,6 +272,10 @@ struct o2hb_region {
struct delayed_work hr_write_timeout_work;
unsigned long   hr_last_timeout_start;
 
+   /* negotiate timer, used to negotiate extending hb timeout. */
+   struct delayed_work hr_nego_timeout_work;
+   unsigned long   
hr_nego_node_bitmap[BITS_TO_LONGS(O2NM_MAX_NODES)];
+
/* Used during o2hb_check_slot to hold a copy of the block
 * being checked because we temporarily have to zero out the
 * crc field. */
@@ -320,7 +324,7 @@ static void o2hb_write_timeout(struct work_struct *work)
o2quo_disk_timeout();
 }
 
-static void o2hb_arm_write_timeout(struct o2hb_region *reg)
+static void o2hb_arm_timeout(struct o2hb_region *reg)
 {
/* Arm writeout only after thread reaches steady state */
if (atomic_read(®->hr_steady_iterations) != 0)
@@ -338,11 +342,49 @@ static void o2hb_arm_write_timeout(struct o2hb_region 
*reg)
reg->hr_last_timeout_start = jiffies;
schedule_delayed_work(®->hr_write_timeout_work,
  msecs_to_jiffies(O2HB_MAX_WRITE_TIMEOUT_MS));
+
+   cancel_delayed_work(®->hr_nego_timeout_work);
+   /* negotiate timeout must be less than write timeout. */
+   schedule_delayed_work(®->hr_nego_timeout_work,
+ msecs_to_jiffies(O2HB_MAX_WRITE_TIMEOUT_MS)/2);
+   memset(reg->hr_nego_node_bitmap, 0, sizeof(reg->hr_nego_node_bitmap));
 }
 
-static void o2hb_disarm_write_timeout(struct o2hb_region *reg)
+static void o2hb_disarm_timeout(struct o2hb_region *reg)
 {
cancel_delayed_work_sync(®->hr_write_timeout_work);
+   cancel_delayed_work_sync(®->hr_nego_timeout_work);
+}
+
+static void o2hb_nego_timeout(struct work_struct *work)
+{
+   unsigned long live_node_bitmap[BITS_TO_LONGS(O2NM_MAX_NODES)];
+   int master_node;
+   struct o2hb_region *reg;
+
+   reg = container_of(work, struct o2hb_region, hr_nego_timeout_work.work);
+   o2hb_fill_node_map(live_node_bitmap, sizeof(live_node_bitmap));
+   /* lowest node as master node to make negotiate decision. */
+   master_node = find_next_bit(live_node_bitmap, O2NM_MAX_NODES, 0);
+
+   if (master_node == o2nm_this_node()) {
+   set_bit(master_node, reg->hr_nego_node_bitmap);
+   if (memcmp(reg->hr_nego_node_bitmap, live_node_bitmap,
+   sizeof(reg->hr_nego_node_bitmap))) {
+   /* check negotiate bitmap every second to do timeout
+* approve decision.
+*/
+   schedule_delayed_work(®->hr_nego_timeout_work,
+   msecs_to_jiffies(1000));
+
+   return;
+   }
+
+   /* approve negotiate timeout request. */
+   } else {
+   /* negotiate timeout with master node. */
+   }
+
 }
 
 static inline void o2hb_bio_wait_init(struct o2hb_bio_wait_ctxt *wc)
@@ -1033,7 +1075,7

[Ocfs2-devel] [PATCH V2 3/6] ocfs2: o2hb: add NEGOTIATE_APPROVE message

2016-03-02 Thread Junxiao Bi

This message is used to re-queue write timeout timer and negotiate timer
when all nodes suffer a write hung to storage, this makes node not fence
self if storage down.

Signed-off-by: Junxiao Bi 
Reviewed-by: Ryan Ding 
Cc: Gang He 
Cc: rwxybh 
Cc: Mark Fasheh 
Cc: Joel Becker 
Cc: Joseph Qi 
Signed-off-by: Andrew Morton 
---
 fs/ocfs2/cluster/heartbeat.c |   28 +++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
index 39547d7090d3..9ac01dcd1feb 100644
--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -294,6 +294,7 @@ struct o2hb_bio_wait_ctxt {
 
 enum {
O2HB_NEGO_TIMEOUT_MSG = 1,
+   O2HB_NEGO_APPROVE_MSG = 2,
 };
 
 struct o2hb_nego_msg {
@@ -389,7 +390,7 @@ again:
 static void o2hb_nego_timeout(struct work_struct *work)
 {
unsigned long live_node_bitmap[BITS_TO_LONGS(O2NM_MAX_NODES)];
-   int master_node;
+   int master_node, i;
struct o2hb_region *reg;
 
reg = container_of(work, struct o2hb_region, hr_nego_timeout_work.work);
@@ -411,6 +412,17 @@ static void o2hb_nego_timeout(struct work_struct *work)
}
 
/* approve negotiate timeout request. */
+   o2hb_arm_timeout(reg);
+
+   i = -1;
+   while ((i = find_next_bit(live_node_bitmap,
+   O2NM_MAX_NODES, i + 1)) < O2NM_MAX_NODES) {
+   if (i == master_node)
+   continue;
+
+   o2hb_send_nego_msg(reg->hr_key,
+   O2HB_NEGO_APPROVE_MSG, i);
+   }
} else {
/* negotiate timeout with master node. */
o2hb_send_nego_msg(reg->hr_key, O2HB_NEGO_TIMEOUT_MSG,
@@ -433,6 +445,13 @@ static int o2hb_nego_timeout_handler(struct o2net_msg 
*msg, u32 len, void *data,
return 0;
 }
 
+static int o2hb_nego_approve_handler(struct o2net_msg *msg, u32 len, void 
*data,
+   void **ret_data)
+{
+   o2hb_arm_timeout(data);
+   return 0;
+}
+
 static inline void o2hb_bio_wait_init(struct o2hb_bio_wait_ctxt *wc)
 {
atomic_set(&wc->wc_num_reqs, 1);
@@ -2101,6 +2120,13 @@ static struct config_item 
*o2hb_heartbeat_group_make_item(struct config_group *g
if (ret)
goto free;
 
+   ret = o2net_register_handler(O2HB_NEGO_APPROVE_MSG, reg->hr_key,
+   sizeof(struct o2hb_nego_msg),
+   o2hb_nego_approve_handler,
+   reg, NULL, ®->hr_handler_list);
+   if (ret)
+   goto unregister_handler;
+
ret = o2hb_debug_region_init(reg, o2hb_debug_dir);
if (ret) {
config_item_put(®->hr_item);
-- 
1.7.9.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH V2 4/6] ocfs2: o2hb: add some user/debug log

2016-03-02 Thread Junxiao Bi

Signed-off-by: Junxiao Bi 
Reviewed-by: Ryan Ding 
Cc: Gang He 
Cc: rwxybh 
Cc: Mark Fasheh 
Cc: Joel Becker 
Cc: Joseph Qi 
Signed-off-by: Andrew Morton 
---
 fs/ocfs2/cluster/heartbeat.c |   39 ---
 1 file changed, 32 insertions(+), 7 deletions(-)

diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
index 9ac01dcd1feb..9f4a02ed85fd 100644
--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -292,6 +292,8 @@ struct o2hb_bio_wait_ctxt {
int   wc_error;
 };
 
+#define O2HB_NEGO_TIMEOUT_MS (O2HB_MAX_WRITE_TIMEOUT_MS/2)
+
 enum {
O2HB_NEGO_TIMEOUT_MSG = 1,
O2HB_NEGO_APPROVE_MSG = 2,
@@ -359,7 +361,7 @@ static void o2hb_arm_timeout(struct o2hb_region *reg)
cancel_delayed_work(®->hr_nego_timeout_work);
/* negotiate timeout must be less than write timeout. */
schedule_delayed_work(®->hr_nego_timeout_work,
- msecs_to_jiffies(O2HB_MAX_WRITE_TIMEOUT_MS)/2);
+ msecs_to_jiffies(O2HB_NEGO_TIMEOUT_MS));
memset(reg->hr_nego_node_bitmap, 0, sizeof(reg->hr_nego_node_bitmap));
 }
 
@@ -390,7 +392,7 @@ again:
 static void o2hb_nego_timeout(struct work_struct *work)
 {
unsigned long live_node_bitmap[BITS_TO_LONGS(O2NM_MAX_NODES)];
-   int master_node, i;
+   int master_node, i, ret;
struct o2hb_region *reg;
 
reg = container_of(work, struct o2hb_region, hr_nego_timeout_work.work);
@@ -399,7 +401,12 @@ static void o2hb_nego_timeout(struct work_struct *work)
master_node = find_next_bit(live_node_bitmap, O2NM_MAX_NODES, 0);
 
if (master_node == o2nm_this_node()) {
-   set_bit(master_node, reg->hr_nego_node_bitmap);
+   if (!test_bit(master_node, reg->hr_nego_node_bitmap)) {
+   printk(KERN_NOTICE "o2hb: node %d hb write hung for %ds 
on region %s (%s).\n",
+   o2nm_this_node(), O2HB_NEGO_TIMEOUT_MS/1000,
+   config_item_name(®->hr_item), 
reg->hr_dev_name);
+   set_bit(master_node, reg->hr_nego_node_bitmap);
+   }
if (memcmp(reg->hr_nego_node_bitmap, live_node_bitmap,
sizeof(reg->hr_nego_node_bitmap))) {
/* check negotiate bitmap every second to do timeout
@@ -411,6 +418,8 @@ static void o2hb_nego_timeout(struct work_struct *work)
return;
}
 
+   printk(KERN_NOTICE "o2hb: all nodes hb write hung, maybe region 
%s (%s) is down.\n",
+   config_item_name(®->hr_item), reg->hr_dev_name);
/* approve negotiate timeout request. */
o2hb_arm_timeout(reg);
 
@@ -420,13 +429,23 @@ static void o2hb_nego_timeout(struct work_struct *work)
if (i == master_node)
continue;
 
-   o2hb_send_nego_msg(reg->hr_key,
+   mlog(ML_HEARTBEAT, "send NEGO_APPROVE msg to node 
%d\n", i);
+   ret = o2hb_send_nego_msg(reg->hr_key,
O2HB_NEGO_APPROVE_MSG, i);
+   if (ret)
+   mlog(ML_ERROR, "send NEGO_APPROVE msg to node 
%d fail %d\n",
+   i, ret);
}
} else {
/* negotiate timeout with master node. */
-   o2hb_send_nego_msg(reg->hr_key, O2HB_NEGO_TIMEOUT_MSG,
-   master_node);
+   printk(KERN_NOTICE "o2hb: node %d hb write hung for %ds on 
region %s (%s), negotiate timeout with node %d.\n",
+   o2nm_this_node(), O2HB_NEGO_TIMEOUT_MS/1000, 
config_item_name(®->hr_item),
+   reg->hr_dev_name, master_node);
+   ret = o2hb_send_nego_msg(reg->hr_key, O2HB_NEGO_TIMEOUT_MSG,
+   master_node);
+   if (ret)
+   mlog(ML_ERROR, "send NEGO_TIMEOUT msg to node %d fail 
%d\n",
+   master_node, ret);
}
 }
 
@@ -437,6 +456,8 @@ static int o2hb_nego_timeout_handler(struct o2net_msg *msg, 
u32 len, void *data,
struct o2hb_nego_msg *nego_msg;
 
nego_msg = (struct o2hb_nego_msg *)msg->buf;
+   printk(KERN_NOTICE "o2hb: receive negotiate timeout message from node 
%d on region %s (%s).\n",
+   nego_msg->node_num, config_item_name(®->hr_item), 
reg->hr_dev_name);
if (nego_msg->node_num < O2NM_MAX_NODES)
set_bit(nego_msg->node_num, reg->hr_nego_node_bitmap);
else
@@ -448,7 +469,11 @@ static int o2hb_nego_timeout_handler(struct o2net_msg 
*msg,

[Ocfs2-devel] [PATCH V2 2/6] ocfs2: o2hb: add NEGO_TIMEOUT message

2016-03-02 Thread Junxiao Bi

This message is sent to master node when non-master nodes's negotiate
timer expired.  Master node records these nodes in a bitmap which is used
to do write timeout timer re-queue decision.

Signed-off-by: Junxiao Bi 
Reviewed-by: Ryan Ding 
Cc: Gang He 
Cc: rwxybh 
Cc: Mark Fasheh 
Cc: Joel Becker 
Cc: Joseph Qi 
Signed-off-by: Andrew Morton 
---
 fs/ocfs2/cluster/heartbeat.c |   66 +-
 1 file changed, 65 insertions(+), 1 deletion(-)

diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
index 59982b34b7a5..39547d7090d3 100644
--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -280,6 +280,10 @@ struct o2hb_region {
 * being checked because we temporarily have to zero out the
 * crc field. */
struct o2hb_disk_heartbeat_block *hr_tmp_block;
+
+   /* Message key for negotiate timeout message. */
+   unsigned inthr_key;
+   struct list_headhr_handler_list;
 };
 
 struct o2hb_bio_wait_ctxt {
@@ -288,6 +292,14 @@ struct o2hb_bio_wait_ctxt {
int   wc_error;
 };
 
+enum {
+   O2HB_NEGO_TIMEOUT_MSG = 1,
+};
+
+struct o2hb_nego_msg {
+   u8 node_num;
+};
+
 static void o2hb_write_timeout(struct work_struct *work)
 {
int failed, quorum;
@@ -356,6 +368,24 @@ static void o2hb_disarm_timeout(struct o2hb_region *reg)
cancel_delayed_work_sync(®->hr_nego_timeout_work);
 }
 
+static int o2hb_send_nego_msg(int key, int type, u8 target)
+{
+   struct o2hb_nego_msg msg;
+   int status, ret;
+
+   msg.node_num = o2nm_this_node();
+again:
+   ret = o2net_send_message(type, key, &msg, sizeof(msg),
+   target, &status);
+
+   if (ret == -EAGAIN || ret == -ENOMEM) {
+   msleep(100);
+   goto again;
+   }
+
+   return ret;
+}
+
 static void o2hb_nego_timeout(struct work_struct *work)
 {
unsigned long live_node_bitmap[BITS_TO_LONGS(O2NM_MAX_NODES)];
@@ -383,8 +413,24 @@ static void o2hb_nego_timeout(struct work_struct *work)
/* approve negotiate timeout request. */
} else {
/* negotiate timeout with master node. */
+   o2hb_send_nego_msg(reg->hr_key, O2HB_NEGO_TIMEOUT_MSG,
+   master_node);
}
+}
+
+static int o2hb_nego_timeout_handler(struct o2net_msg *msg, u32 len, void 
*data,
+   void **ret_data)
+{
+   struct o2hb_region *reg = data;
+   struct o2hb_nego_msg *nego_msg;
 
+   nego_msg = (struct o2hb_nego_msg *)msg->buf;
+   if (nego_msg->node_num < O2NM_MAX_NODES)
+   set_bit(nego_msg->node_num, reg->hr_nego_node_bitmap);
+   else
+   mlog(ML_ERROR, "got nego timeout message from bad node.\n");
+
+   return 0;
 }
 
 static inline void o2hb_bio_wait_init(struct o2hb_bio_wait_ctxt *wc)
@@ -1494,6 +1540,7 @@ static void o2hb_region_release(struct config_item *item)
list_del(®->hr_all_item);
spin_unlock(&o2hb_live_lock);
 
+   o2net_unregister_handler_list(®->hr_handler_list);
kfree(reg);
 }
 
@@ -2040,13 +2087,30 @@ static struct config_item 
*o2hb_heartbeat_group_make_item(struct config_group *g
 
config_item_init_type_name(®->hr_item, name, &o2hb_region_type);
 
+   /* this is the same way to generate msg key as dlm, for local heartbeat,
+* name is also the same, so make initial crc value different to avoid
+* message key conflict.
+*/
+   reg->hr_key = crc32_le(reg->hr_region_num + O2NM_MAX_REGIONS,
+   name, strlen(name));
+   INIT_LIST_HEAD(®->hr_handler_list);
+   ret = o2net_register_handler(O2HB_NEGO_TIMEOUT_MSG, reg->hr_key,
+   sizeof(struct o2hb_nego_msg),
+   o2hb_nego_timeout_handler,
+   reg, NULL, ®->hr_handler_list);
+   if (ret)
+   goto free;
+
ret = o2hb_debug_region_init(reg, o2hb_debug_dir);
if (ret) {
config_item_put(®->hr_item);
-   goto free;
+   goto unregister_handler;
}
 
return ®->hr_item;
+
+unregister_handler:
+   o2net_unregister_handler_list(®->hr_handler_list);
 free:
kfree(reg);
return ERR_PTR(ret);
-- 
1.7.9.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH V2 5/6] ocfs2: o2hb: don't negotiate if last hb fail

2016-03-02 Thread Junxiao Bi

Sometimes io error is returned when storage is down for a while.  Like for
iscsi device, stroage is made offline when session timeout, and this will
make all io return -EIO.  For this case, nodes shouldn't do negotiate
timeout but should fence self.  So let nodes fence self when
o2hb_do_disk_heartbeat return an error, this is the same behavior with
o2hb without negotiate timer.

Signed-off-by: Junxiao Bi 
Reviewed-by: Ryan Ding 
Cc: Gang He 
Cc: rwxybh 
Cc: Mark Fasheh 
Cc: Joel Becker 
Cc: Joseph Qi 
Signed-off-by: Andrew Morton 
---
 fs/ocfs2/cluster/heartbeat.c |   10 ++
 1 file changed, 10 insertions(+)

diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
index 9f4a02ed85fd..c040fc3dd605 100644
--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -284,6 +284,9 @@ struct o2hb_region {
/* Message key for negotiate timeout message. */
unsigned inthr_key;
struct list_headhr_handler_list;
+
+   /* last hb status, 0 for success, other value for error. */
+   int hr_last_hb_status;
 };
 
 struct o2hb_bio_wait_ctxt {
@@ -396,6 +399,12 @@ static void o2hb_nego_timeout(struct work_struct *work)
struct o2hb_region *reg;
 
reg = container_of(work, struct o2hb_region, hr_nego_timeout_work.work);
+   /* don't negotiate timeout if last hb failed since it is very
+* possible io failed. Should let write timeout fence self.
+*/
+   if (reg->hr_last_hb_status)
+   return;
+
o2hb_fill_node_map(live_node_bitmap, sizeof(live_node_bitmap));
/* lowest node as master node to make negotiate decision. */
master_node = find_next_bit(live_node_bitmap, O2NM_MAX_NODES, 0);
@@ -1229,6 +1238,7 @@ static int o2hb_thread(void *data)
before_hb = ktime_get_real();
 
ret = o2hb_do_disk_heartbeat(reg);
+   reg->hr_last_hb_status = ret;
 
after_hb = ktime_get_real();
 
-- 
1.7.9.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] kernel panic on next-20160225

2016-02-25 Thread Junxiao Bi

Hi,

The following panic is triggered when run ocfs2 xattr test on
linux-next-20160225. Did anybody ever see this?

[  254.604228] BUG: unable to handle kernel paging request at
0002000800c0
[  254.605013] IP: [] kmem_cache_alloc+0x78/0x160
[  254.605013] PGD 7bbe5067 PUD 0
[  254.605013] Oops:  [#1] SMP
[  254.605013] Modules linked in: ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
ocfs2_nodemanager ocfs2_stackglue iscsi_tcp libiscsi_tcp libiscsi
scsi_transport_iscsi xen_kbdfront xen_netfront xen_fbfront xen_blkfront
[  254.605013] CPU: 2 PID: 4044 Comm: mpirun Not tainted
4.5.0-rc5-next-20160225 #1
[  254.605013] Hardware name: Xen HVM domU, BIOS 4.3.1OVM 05/14/2014
[  254.605013] task: 88007a521a80 ti: 88007aed task.ti:
88007aed
[  254.605013] RIP: 0010:[]  []
kmem_cache_alloc+0x78/0x160
[  254.605013] RSP: 0018:88007aed3a48  EFLAGS: 00010282
[  254.605013] RAX:  RBX:  RCX:
1991
[  254.605013] RDX: 1990 RSI: 024000c0 RDI:
0001b330
[  254.605013] RBP: 88007aed3a98 R08: 88007d29b330 R09:
0002000800c0
[  254.605013] R10: 000c51376d87 R11: 8800792cac38 R12:
88007cc30f00
[  254.605013] R13: 024000c0 R14: 811b053f R15:
88007aed3ce7
[  254.605013] FS:  () GS:88007d28()
knlGS:
[  254.605013] CS:  0010 DS:  ES:  CR0: 80050033
[  254.605013] CR2: 0002000800c0 CR3: 7aeb2000 CR4:
000406e0
[  254.605013] Stack:
[  254.605013]  13082000 88007aed3d28 0079
0001
[  254.605013]  2f2f2f2f 8800792cac00 88007aed3d38
0101
[  254.605013]  88007a5e2000 88007aed3ce7 88007aed3b08
811b053f
[  254.605013] Call Trace:
[  254.605013]  [] __d_alloc+0x2f/0x1a0
[  254.605013]  [] ? unlazy_walk+0xe2/0x160
[  254.605013]  [] d_alloc+0x17/0x80
[  254.605013]  [] lookup_dcache+0x8a/0xc0
[  254.605013]  [] ? __alloc_pages_nodemask+0x173/0xeb0
[  254.605013]  [] path_openat+0x3c3/0x1210
[  254.605013]  [] ? radix_tree_lookup_slot+0x13/0x30
[  254.605013]  [] ? find_get_entry+0x32/0xc0
[  254.605013]  [] ? atime_needs_update+0x55/0xe0
[  254.605013]  [] ? filemap_fault+0xd1/0x4b0
[  254.605013]  [] ? do_set_pte+0xb6/0x140
[  254.605013]  [] do_filp_open+0x80/0xe0
[  254.605013]  [] ? __alloc_fd+0x48/0x1a0
[  254.605013]  [] ? getname_flags+0x7a/0x1e0
[  254.605013]  [] do_sys_open+0x110/0x200
[  254.605013]  [] SyS_open+0x19/0x20
[  254.605013]  [] do_syscall_64+0x72/0x230
[  254.605013]  [] ? __do_page_fault+0x177/0x430
[  254.605013]  [] entry_SYSCALL64_slow_path+0x25/0x25
[  254.605013] Code: 05 e6 77 e7 7e 4d 8b 08 49 8b 40 10 4d 85 c9 0f 84
dd 00 00 00 48 85 c0 0f 84 d4 00 00 00 49 63 44 24 20 49 8b 3c 24 48 8d
4a 01 <49> 8b 1c 01 4c 89 c8 65 48 0f c7 0f 0f 94 c0 3c 01 75 b6 49 63
[  254.605013] RIP  [] kmem_cache_alloc+0x78/0x160
[  254.605013]  RSP 
[  254.605013] CR2: 0002000800c0
[  254.792273] ---[ end trace 823969e602e4aaac ]---

Thanks,
Junxiao.

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] ocfs2-test for v4.3 done

2016-02-23 Thread Junxiao Bi

Hi Eric,

On 02/19/2016 11:01 AM, Eric Ren wrote:
> Hi Junxiao,
> 
> On Wed, Feb 17, 2016 at 10:15:56AM +0800, Junxiao Bi wrote: 
>> Hi Eric,
>>
>> I remember i described it before, please search it on ocfs2-devel. For
>> ocfs2 env setup, please refer to README in ocfs2-test.
> 
> Yes, you did. Actually, it's the  quoted paragraph below;-)
> Thanks, but more things what I really want to learn about are, such as:
> 1. git hook/auto control scripts, if it's OK to share;
Sure. Attached.
> 2. pain point and solution, for example, a latest tagged release kernel may
>not compile successfully by `make defconfig` or `cp /boot/config-`uname 
> -r``;
>Or cannot boot up even if we've built kernel RPM and installed it.
That maybe because some config not enabled for your platform. I used xen
vm as test nodes, and use make xenconfig to generate .config and it
works well.

There are two pain points for this test framework:
1. auto bisect to spot regression issue.
2. improve ocfs2-test speed.
Now it needs several days to done the test. Better split the test as
function test and stress test.

Please think about it when build your test framework.

Thanks,
Junxiao.
> 
>Did you have this problem? Any suggestion;-) What I can think of is to try 
> opensuse
>tumbleweed distribution(a rolling release).
> 
>>
>> On 02/16/2016 05:54 PM, Eric Ren wrote:
>>> Hi Junxiao,
>>>
>>>> Four vm are used, one for git server, and the other three to build
>>>> kernel and run ocfs2 test. Tag a branch and push it to the git server,
>>>> the test will be started. The test cases to run can be set in tag message.
>>>
>>> Recently, I get free days and also want to setup automatic testing env for 
>>> ocfs2.
>>> I'll use pcmk as cluster stack while you probably use o2cb. I think we can 
>>> complement
>>> each other.  May I bother you to describe the work flow more specifically, 
>>> or share
>>> experiences and references? That way it would save my life;-)
>>>
> Thanks!
> Eric
> 

#!/bin/sh
#
# An example hook script to prepare a packed repository for use over
# dumb transports.
#
# To enable this hook, rename this file to "post-update".

#exec git update-server-info

function parse_tag() {
tag=`git tag -n1 -l $tag_name | awk '{print $2}'`
echo $tag | grep -sq $BUILD_TAG
if [ $? -eq 0 ]; then
command=1
fi

echo $tag | grep -sq $TEST_TAG
if [ $? -eq 0 ]; then
echo $tag | grep -sq $TESTONLY_TAG
if [ $? -eq 0 ]; then
command=3
cases=`echo $tag | sed "s:^${TESTONLY_TAG}-::"`
else
command=2
cases=`echo $tag | sed "s:^${TEST_TAG}-::"`
fi

if [ X"$cases" = X"$tag" ];then
cases="all"
fi

for cas in `echo $cases | sed "s:,: :g"`; do
echo $USERDEF_TESTCASES " " $SINGLE_TESTCASES " " 
$MULTIPLE_TESTCASES | grep -sqw $cas
if [ $? -ne 0 ]; then
echo "error, testcase [${cas}] not supported."
echo "The following testcases are supported:"
echo "user defined testcases: 
$USERDEF_TESTCASES"
echo "single testcases:  $SINGLE_TESTCASES"
echo "multiple testcases: $MULTIPLE_TESTCASES"
exit 1
else
testcase=${testcase}" "${cas}
fi
done

if [ -z "$testcase" ]; then
exit 1
fi
fi

if [ $command -eq 0 ]; then
exit 0
elif [ $command -eq 1 ];then
echo "command: build $version"
elif [ $command -eq 2 ]; then
echo "command: test $version"
echo "testcase: $testcase"
else
echo "command: testonly"
echo "testcase: $testcase"
fi
}

function build_kernel() {
echo "archive tag: $tag_name to ${version}.tar"
git archive --prefix=${version}/ $tag_name > /tmp/${version}.tar
echo "copy ${version}.tar to $BUILD_SERVER:$BUILD_PATH"
scp /tmp/${version}.tar root@${BUILD_SERVER}:${BUILD_PATH} && rm -rf 
/tmp/${version}.tar
echo "un

Re: [Ocfs2-devel] ocfs2-test for v4.3 done

2016-02-16 Thread Junxiao Bi

Hi Eric,

I remember i described it before, please search it on ocfs2-devel. For
ocfs2 env setup, please refer to README in ocfs2-test.

Thanks,
Junxiao.

On 02/16/2016 05:54 PM, Eric Ren wrote:
> Hi Junxiao,
> 
 I have setup a test env to build and auto do ocfs2 test. With it, Ocfs2
 for mainline and linux-next will be test regularly, the test status and
 bugs will be reported to ocfs2-devel. Feel free to take any bug if you
 are interested, it will be a good start point with ocfs2. Hope this can
 catch regression bugs earlier before merged by mainline.
>>> I'm very interested in how your testing env works, could you briefly 
>>> introduce
>>> it?
>> Four vm are used, one for git server, and the other three to build
>> kernel and run ocfs2 test. Tag a branch and push it to the git server,
>> the test will be started. The test cases to run can be set in tag message.
> 
> Recently, I get free days and also want to setup automatic testing env for 
> ocfs2.
> I'll use pcmk as cluster stack while you probably use o2cb. I think we can 
> complement
> each other.  May I bother you to describe the work flow more specifically, or 
> share
> experiences and references? That way it would save my life;-)
> 
> Thanks!
> Eric
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] Ocfs2 test for linux-next-20160122 passed

2016-01-27 Thread Junxiao Bi

This means the following patches have passed ocfs2-test. The first three
ones are merged by me to avoid that recursive deadlock issue.

=

-inode deadlock in ocfs2_mknode due to using posix_acl_create
-posix_acl_create unsuitable to use in ocfs2_reflink
-revert to using ocfs2_acl_chmod to avoid inode cluster lock hang
-ocfs2: check/fix inode block for online file check
-ocfs2: create/remove sysfile for online file check
-ocfs2: sysfile interfaces for online file check
-ocfs2: export ocfs2_kset for online file check
-ocfs2: solve a problem of crossing the boundary in updating backups
-ocfs2: fix occurring deadlock by changing ocfs2_wq from global to local
-ocfs2: extend enough credits for freeing one truncate record while
replaying truncate records
-ocfs2: extend transaction for ocfs2_remove_rightmost_path() and
ocfs2_update_edge_lengths() before to avoid inconsistency between inode
and et
-ocfs2/dlm: fix BUG in dlm_move_lockres_to_recovery_list
-ocfs2-dlm-fix-race-between-convert-and-recovery-v3
-ocfs2-dlm-fix-race-between-convert-and-recovery-v2
-ocfs2/dlm: fix race between convert and recovery
-ocfs2: fix a deadlock issue in ocfs2_dio_end_io_write()
-ocfs2: fix disk file size and memory file size mismatch
-ocfs2: take ip_alloc_sem in ocfs2_dio_get_block & ocfs2_dio_end_io_write
-ocfs2-fix-ip_unaligned_aio-deadlock-with-dio-work-queue-fix
-ocfs2: fix ip_unaligned_aio deadlock with dio work queue
-ocfs2: code clean up for direct io
-ocfs2: fix sparse file & data ordering issue in direct io
-ocfs2: record UNWRITTEN extents when populate write desc
-ocfs2: return the physical address in ocfs2_write_cluster
-ocfs2: do not change i_size in write_end for direct io
-ocfs2: test target page before change it
-ocfs2: use c_new to indicate newly allocated extents
-ocfs2: add ocfs2_write_type_t type to identify the caller of write
-ocfs2: NFS hangs in __ocfs2_cluster_lock due to race with
ocfs2_unblock_lock

==

Thanks,
Junxiao.

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] ocfs2: A race between refmap setting and clearing

2016-01-25 Thread Junxiao Bi

On 01/26/2016 09:43 AM, xuejiufei wrote:
> Hi Junxiao,
> 
> On 2016/1/21 15:34, Junxiao Bi wrote:
>> Hi Jiufei,
>>
>> I didn't find other solution for this issue. You can go with yours.
>> Looks like your second one is more straightforward, there deref work can
>> be removed.
>>
> There are two problems with the second solution:
> 1) Node retry to deref the lock resource will block dlm_thread to process
> other lock resources.
Yes, a little, but i don't think that will be long. Indeed I think about
clear DLM_LOCK_RES_DROPPING_REF and requeue lockres again before, that
will not block dlm_thread, but I found dlm had an assuming about this
flag, it assumed the lockres is gone if it is set. If this bad assuming
can be fixed, the second solution will be much better.

> 2) When node retries to drop the refmap bit, the master may be in another
> assert master work, that will take a long time to purge a lockres.
Delay purging lockres maybe not a bad idea, but a good one. Like in your
case, the second lock request can go directly without master reqeust.
This can improve performance.

Thanks,
Junxiao.

> So I prefer the first solution.
> 
> Thanks,
> Jiufei
> 
>> Thanks,
>> Junxiao.
>> On 01/11/2016 10:46 AM, xuejiufei wrote:
>>> Hi all,
>>> We have found a race between refmap setting and clearing which
>>> will cause the lock resource on master is freed before other nodes
>>> purge it.
>>>
>>> Node 1   Node 2(master)
>>> dlm_do_master_request
>>> dlm_master_request_handler
>>> -> dlm_lockres_set_refmap_bit
>>> call dlm_purge_lockres after unlock
>>> dlm_deref_handler
>>> -> find lock resource is in
>>>DLM_LOCK_RES_SETREF_INPROG state,
>>>so dispatch a deref work
>>> dlm_purge_lockres succeed.
>>>
>>> dlm_do_master_request
>>> dlm_master_request_handler
>>> -> dlm_lockres_set_refmap_bit
>>>
>>> deref work trigger, call
>>> dlm_lockres_clear_refmap_bit
>>> to clear Node 1 from refmap
>>>
>>> Now Node 2 can purge the lock resource but the owner of lock resource
>>> on Node 1 is still Node 2 which may trigger BUG if the lock resource
>>> is $RECOVERY or other problems.
>>>
>>> We have discussed 2 solutions:
>>> 1)The master return error to Node 1 if the DLM_LOCK_RES_SETREF_INPROG
>>> is set. Node 1 will not retry and master send another message to Node 1
>>> after clearing the refmap. Node 1 can purge the lock resource after the
>>> refmap on master is cleared.
>>> 2) The master return error to Node 1 if the DLM_LOCK_RES_SETREF_INPROG
>>> is set, and Node 1 will retry to deref the lockres.
>>>
>>> Does anybody has better ideas?
>>>
>>> Thanks,
>>> --Jiufei
>>>
>>
>>
>> .
>>
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 4/6] ocfs2: o2hb: add some user/debug log

2016-01-24 Thread Junxiao Bi

On 01/25/2016 11:28 AM, Eric Ren wrote:
>> @@ -449,7 +470,11 @@ static int o2hb_nego_timeout_handler(struct o2net_msg 
>> *msg, u32 len, void *data,
>> >  static int o2hb_nego_approve_handler(struct o2net_msg *msg, u32 len, void 
>> > *data,
>> >void **ret_data)
>> >  {
>> > -  o2hb_arm_timeout((struct o2hb_region *)data);
>> > +  struct o2hb_region *reg = (struct o2hb_region *)data;
>> > +
>> > +  printk(KERN_NOTICE "o2hb: negotiate timeout approved by master node on 
>> > region %s (%s).\n",
>> > +  config_item_name(®->hr_item), reg->hr_dev_name);
>> > +  o2hb_arm_timeout(reg);
> Why mix the use of printk and mlog? Any rules to follow?
printk is log for user while mlog is log for debug.

Thanks,
Junxiao.

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 2/6] ocfs2: o2hb: add NEGO_TIMEOUT message

2016-01-24 Thread Junxiao Bi

On 01/25/2016 11:18 AM, Eric Ren wrote:
>>  
>> > @@ -2039,13 +2086,30 @@ static struct config_item 
>> > *o2hb_heartbeat_group_make_item(struct config_group *g
>> >  
>> >config_item_init_type_name(®->hr_item, name, &o2hb_region_type);
>> >  
>> > +  /* this is the same way to generate msg key as dlm, for local heartbeat,
>> > +   * name is also the same, so make initial crc value different to avoid
>> > +   * message key conflict.
>> > +   */
>> > +  reg->hr_key = crc32_le(reg->hr_region_num + O2NM_MAX_REGIONS,
>> > +  name, strlen(name));
>> > +  INIT_LIST_HEAD(®->hr_handler_list);
> Looks no need to initilize ->hr_handler_list here?
Why? It is list head.

Thanks,
Junxiao.

___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 2/6] ocfs2: o2hb: add NEGO_TIMEOUT message

2016-01-21 Thread Junxiao Bi

On 01/22/2016 01:45 PM, Andrew Morton wrote:
> On Fri, 22 Jan 2016 13:12:26 +0800 Junxiao Bi  wrote:
> 
>> On 01/22/2016 07:47 AM, Andrew Morton wrote:
>>> On Wed, 20 Jan 2016 11:13:35 +0800 Junxiao Bi  wrote:
>>>
>>>> This message is sent to master node when non-master nodes's
>>>> negotiate timer expired. Master node records these nodes in
>>>> a bitmap which is used to do write timeout timer re-queue
>>>> decision.
>>>>
>>>> ...
>>>>
>>>> +static int o2hb_nego_timeout_handler(struct o2net_msg *msg, u32 len, void 
>>>> *data,
>>>> +  void **ret_data)
>>>> +{
>>>> +  struct o2hb_region *reg = (struct o2hb_region *)data;
>>>
>>> It's best not to typecast a void*.  It's unneeded clutter and the cast
>>> can actually hide bugs - if someone changes `data' to a different type
>>> or if there's a different "data" in scope, etc.
>> There are many kinds of messages in ocfs2 and each one needs a different
>> type of "data", so it is made type void*.
> 
> What I mean is to do this:
> 
>   struct o2hb_region *reg = data;
> 
> and not
> 
>   struct o2hb_region *reg = (struct o2hb_region *)data;
> 
> Because the typecast is unneeded and is actually harmful.  Imagine if someone
> goofed and had `int data;': no warning, runtime failure.
Oh, I see. Thank you. Will update this in V2.

Thanks,
Junxiao.
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 2/6] ocfs2: o2hb: add NEGO_TIMEOUT message

2016-01-21 Thread Junxiao Bi

On 01/22/2016 07:47 AM, Andrew Morton wrote:
> On Wed, 20 Jan 2016 11:13:35 +0800 Junxiao Bi  wrote:
> 
>> This message is sent to master node when non-master nodes's
>> negotiate timer expired. Master node records these nodes in
>> a bitmap which is used to do write timeout timer re-queue
>> decision.
>>
>> ...
>>
>> +static int o2hb_nego_timeout_handler(struct o2net_msg *msg, u32 len, void 
>> *data,
>> +void **ret_data)
>> +{
>> +struct o2hb_region *reg = (struct o2hb_region *)data;
> 
> It's best not to typecast a void*.  It's unneeded clutter and the cast
> can actually hide bugs - if someone changes `data' to a different type
> or if there's a different "data" in scope, etc.
There are many kinds of messages in ocfs2 and each one needs a different
type of "data", so it is made type void*.

Thanks,
Junxiao.
> 
>> +struct o2hb_nego_msg *nego_msg;
>>  
>> +nego_msg = (struct o2hb_nego_msg *)msg->buf;
>> +if (nego_msg->node_num < O2NM_MAX_NODES)
>> +set_bit(nego_msg->node_num, reg->hr_nego_node_bitmap);
>> +else
>> +mlog(ML_ERROR, "got nego timeout message from bad node.\n");
>> +
>> +return 0;
>>  }
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] ocfs2: o2hb: not fence self if storage down

2016-01-21 Thread Junxiao Bi

Hi Joseph,

On 01/22/2016 12:25 PM, Joseph Qi wrote:
> Hi Junxiao,
> 
> On 2016/1/21 9:48, Junxiao Bi wrote:
>> On 01/21/2016 08:46 AM, Joseph Qi wrote:
>>> Hi Junxiao,
>>> So you mean the negotiation you added only happens if all nodes storage
>>> link down?
>> Negotiation happened when one node found its storage link down, but
>> success when all nodes storage link down, or it will keep the same
>> behavior like before.
> IC, thanks for your explanation.
> IMHO, if storage down, all business deployed on the storage will be
> impacted even nodes won't fence.
Yes, but storage may back online again after a while. This can improve
system's stability and availability.

> I have another scenario, only several paths (multipath environment) in
> several nodes have problems, as a result, ocfs2 will fence these nodes.
> So I wonder if we have a better way to resolve this issue.
This seemed not obey usual ocfs2's policy. Fence these nodes at that
time will be good to the availability of ocfs2?

Any way I am not sure whether it is feasible now, the problem is that we
need find a way to make an agreement between good nodes during an env
that more error maybe coming, while good nodes can't be hurt even the
agreement can't be made.

Thanks,
Junxiao.
> 
> Thanks,
> Joseph
> 
>>
>> Thanks,
>> Junxiao.
>>>
>>> Thanks,
>>> Joseph
>>>
>>> On 2016/1/20 21:27, Junxiao Bi wrote:
>>>> Hi Joseph,
>>>>
>>>>> 在 2016年1月20日，下午5:18，Joseph Qi  写道：
>>>>>
>>>>> Hi Junxiao,
>>>>> Thanks for the patch set.
>>>>> In case only one node storage link down, if this node doesn't fence
>>>>> self, other nodes will still check and mark this node dead, which will
>>>>> cause cluster membership inconsistency.
>>>>> In your patch set, I cannot see any logic to handle this. Am I missing
>>>>> something?
>>>> No, there is no logic for this. But why didn’t node fence self when 
>>>> storage down? What make a softirq timer can’t be run, another bug?
>>>>
>>>> Thanks,
>>>> Junxiao.
>>>>>
>>>>> On 2016/1/20 11:13, Junxiao Bi wrote:
>>>>>> Hi,
>>>>>>
>>>>>> This serial of patches is to fix the issue that when storage down,
>>>>>> all nodes will fence self due to write timeout.
>>>>>> With this patch set, all nodes will keep going until storage back
>>>>>> online, except if the following issue happens, then all nodes will
>>>>>> do as before to fence self.
>>>>>> 1. io error got
>>>>>> 2. network between nodes down
>>>>>> 3. nodes panic
>>>>>>
>>>>>> Junxiao Bi (6):
>>>>>>  ocfs2: o2hb: add negotiate timer
>>>>>>  ocfs2: o2hb: add NEGO_TIMEOUT message
>>>>>>  ocfs2: o2hb: add NEGOTIATE_APPROVE message
>>>>>>  ocfs2: o2hb: add some user/debug log
>>>>>>  ocfs2: o2hb: don't negotiate if last hb fail
>>>>>>  ocfs2: o2hb: fix hb hung time
>>>>>>
>>>>>> fs/ocfs2/cluster/heartbeat.c |  181 
>>>>>> --
>>>>>> 1 file changed, 175 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> Thanks,
>>>>>> Junxiao.
>>>>>>
>>>>>> ___
>>>>>> Ocfs2-devel mailing list
>>>>>> Ocfs2-devel@oss.oracle.com
>>>>>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> .
>>>>
>>>
>>>
>>
>>
>> .
>>
> 
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 1/6] ocfs2: o2hb: add negotiate timer

2016-01-21 Thread Junxiao Bi

Hi Andrew,

On 01/22/2016 07:42 AM, Andrew Morton wrote:
> On Wed, 20 Jan 2016 11:13:34 +0800 Junxiao Bi  wrote:
> 
>> When storage down, all nodes will fence self due to write timeout.
>> The negotiate timer is designed to avoid this, with it node will
>> wait until storage up again.
>>
>> Negotiate timer working in the following way:
>>
>> 1. The timer expires before write timeout timer, its timeout is half
>> of write timeout now. It is re-queued along with write timeout timer.
>> If expires, it will send NEGO_TIMEOUT message to master node(node with
>> lowest node number). This message does nothing but marks a bit in a
>> bitmap recording which nodes are negotiating timeout on master node.
>>
>> 2. If storage down, nodes will send this message to master node, then
>> when master node finds its bitmap including all online nodes, it sends
>> NEGO_APPROVL message to all nodes one by one, this message will re-queue
>> write timeout timer and negotiate timer.
>> For any node doesn't receive this message or meets some issue when
>> handling this message, it will be fenced.
>> If storage up at any time, o2hb_thread will run and re-queue all the
>> timer, nothing will be affected by these two steps.
>>
>> ...
>>
>> +static void o2hb_nego_timeout(struct work_struct *work)
>> +{
>> +struct o2hb_region *reg =
>> +container_of(work, struct o2hb_region,
>> + hr_nego_timeout_work.work);
> 
> It's better to just do
> 
>   struct o2hb_region *reg;
> 
>   reg = container_of(work, struct o2hb_region, hr_nego_timeout_work.work);
> 
> and avoid the weird 80-column tricks.
OK. Will update this in V2.

> 
>> +unsigned long live_node_bitmap[BITS_TO_LONGS(O2NM_MAX_NODES)];
> 
> the bitmap.h interfaces might be nicer here.  Perhaps.  A little bit.
Will consider this in v2.

> 
>> +int master_node;
>> +
>> +o2hb_fill_node_map(live_node_bitmap, sizeof(live_node_bitmap));
>> +/* lowest node as master node to make negotiate decision. */
>> +master_node = find_next_bit(live_node_bitmap, O2NM_MAX_NODES, 0);
>> +
>> +if (master_node == o2nm_this_node()) {
>> +set_bit(master_node, reg->hr_nego_node_bitmap);
>> +if (memcmp(reg->hr_nego_node_bitmap, live_node_bitmap,
>> +sizeof(reg->hr_nego_node_bitmap))) {
>> +/* check negotiate bitmap every second to do timeout
>> + * approve decision.
>> + */
>> +schedule_delayed_work(®->hr_nego_timeout_work,
>> +msecs_to_jiffies(1000));
> 
> One second is long enough to unmount the fs (and to run `rmmod
> ocfs2'!).  Is there anything preventing the work from triggering in
> these situations?
Yes, this delayed work will by sync before the umount.

Thanks,
Junxiao.
> 
>> +
>> +return;
>> +}
>> +
>> +/* approve negotiate timeout request. */
>> +} else {
>> +/* negotiate timeout with master node. */
>> +}
>> +
>>  }
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 1/6] ocfs2: o2hb: add negotiate timer

2016-01-21 Thread Junxiao Bi

Hi Joseph,

On 01/22/2016 08:56 AM, Joseph Qi wrote:
> Hi Junxiao,
> 
> On 2016/1/20 11:13, Junxiao Bi wrote:
>> When storage down, all nodes will fence self due to write timeout.
>> The negotiate timer is designed to avoid this, with it node will
>> wait until storage up again.
>>
>> Negotiate timer working in the following way:
>>
>> 1. The timer expires before write timeout timer, its timeout is half
>> of write timeout now. It is re-queued along with write timeout timer.
>> If expires, it will send NEGO_TIMEOUT message to master node(node with
>> lowest node number). This message does nothing but marks a bit in a
>> bitmap recording which nodes are negotiating timeout on master node.
>>
>> 2. If storage down, nodes will send this message to master node, then
>> when master node finds its bitmap including all online nodes, it sends
>> NEGO_APPROVL message to all nodes one by one, this message will re-queue
>> write timeout timer and negotiate timer.
>> For any node doesn't receive this message or meets some issue when
>> handling this message, it will be fenced.
>> If storage up at any time, o2hb_thread will run and re-queue all the
>> timer, nothing will be affected by these two steps.
>>
>> Signed-off-by: Junxiao Bi 
>> Reviewed-by: Ryan Ding 
>> ---
>>  fs/ocfs2/cluster/heartbeat.c |   52 
>> ++
>>  1 file changed, 48 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
>> index a3cc6d2fc896..b601ee95de50 100644
>> --- a/fs/ocfs2/cluster/heartbeat.c
>> +++ b/fs/ocfs2/cluster/heartbeat.c
>> @@ -272,6 +272,10 @@ struct o2hb_region {
>>  struct delayed_work hr_write_timeout_work;
>>  unsigned long   hr_last_timeout_start;
>>  
>> +/* negotiate timer, used to negotiate extending hb timeout. */
>> +struct delayed_work hr_nego_timeout_work;
>> +unsigned long   
>> hr_nego_node_bitmap[BITS_TO_LONGS(O2NM_MAX_NODES)];
>> +
>>  /* Used during o2hb_check_slot to hold a copy of the block
>>   * being checked because we temporarily have to zero out the
>>   * crc field. */
>> @@ -320,7 +324,7 @@ static void o2hb_write_timeout(struct work_struct *work)
>>  o2quo_disk_timeout();
>>  }
>>  
>> -static void o2hb_arm_write_timeout(struct o2hb_region *reg)
>> +static void o2hb_arm_timeout(struct o2hb_region *reg)
>>  {
>>  /* Arm writeout only after thread reaches steady state */
>>  if (atomic_read(®->hr_steady_iterations) != 0)
>> @@ -338,11 +342,50 @@ static void o2hb_arm_write_timeout(struct o2hb_region 
>> *reg)
>>  reg->hr_last_timeout_start = jiffies;
>>  schedule_delayed_work(®->hr_write_timeout_work,
>>msecs_to_jiffies(O2HB_MAX_WRITE_TIMEOUT_MS));
>> +
>> +cancel_delayed_work(®->hr_nego_timeout_work);
>> +/* negotiate timeout must be less than write timeout. */
>> +schedule_delayed_work(®->hr_nego_timeout_work,
>> +  msecs_to_jiffies(O2HB_MAX_WRITE_TIMEOUT_MS)/2);
>> +memset(reg->hr_nego_node_bitmap, 0, sizeof(reg->hr_nego_node_bitmap));
>>  }
>>  
>> -static void o2hb_disarm_write_timeout(struct o2hb_region *reg)
>> +static void o2hb_disarm_timeout(struct o2hb_region *reg)
>>  {
>>  cancel_delayed_work_sync(®->hr_write_timeout_work);
>> +cancel_delayed_work_sync(®->hr_nego_timeout_work);
>> +}
>> +
>> +static void o2hb_nego_timeout(struct work_struct *work)
>> +{
>> +struct o2hb_region *reg =
>> +container_of(work, struct o2hb_region,
>> + hr_nego_timeout_work.work);
>> +unsigned long live_node_bitmap[BITS_TO_LONGS(O2NM_MAX_NODES)];
>> +int master_node;
>> +
>> +o2hb_fill_node_map(live_node_bitmap, sizeof(live_node_bitmap));
>> +/* lowest node as master node to make negotiate decision. */
>> +master_node = find_next_bit(live_node_bitmap, O2NM_MAX_NODES, 0);
>> +
>> +if (master_node == o2nm_this_node()) {
>> +set_bit(master_node, reg->hr_nego_node_bitmap);
>> +if (memcmp(reg->hr_nego_node_bitmap, live_node_bitmap,
>> +sizeof(reg->hr_nego_node_bitmap))) {
> Should the access to hr_nego_node_bitmap be protected, for example,
> under o2hb_live_lock?
I didn't see need for this. This bitmap is used by negotiation master
node, every set op is ord

Re: [Ocfs2-devel] OCFS2 causing system instability

2016-01-21 Thread Junxiao Bi

Hi Guy,

On 01/22/2016 01:46 AM, Guy 2212112 wrote:
> Hi,
> First, I'm well aware that OCFS2 is not a distributed file system, but a
> shared, clustered file system. This is the main reason that I want to
> use it - access the same filesystem from multiple nodes.
Glad to here you are interested in ocfs2.

> I've checked the latest Kernel 4.4 release that include the
> "errors=continue" option and installed also (manually) the patch
> described in this thread - "[PATCH V2] ocfs2: call ocfs2_abort when
> journal abort" .
> 
> Unfortunately the issues I've described where not solved.
> 
> Also, I understand that OCFS2 relies on the SAN availability and is not
> replicating the data to other locations (like a distributed file
> system), so I don't expect to be able to access the data when a
> disk/volume is not accessible (for example because of hardware failure).
Lost data maybe not big issue, but lost metadata is. There is some
metadata in the unplug disk. Without it, ocfs2 can't know how data is
stored, so it can't work well. And I think this maybe the same for other
clustered or local fs.

> 
> In other filesystems, clustered or even local, when a disk/volume fails
> - this and only this disk/volume cannot be accessed - and all the other
> filesystems continue to function and can accessed and the whole system
> stability is definitely not compromised.
Which fs can do this?

> 
> Of course, I can understand that if this specific disk/volume contains
> the operating system it probably cause a  panic/reboot, or if the
> disk/volume is used by the cluster as heartbeat, it may influence the
> whole cluster - if it's the only way the nodes in the cluster are using
> to communicate between themselves.
> 
> The configuration I use rely on Global heartbeat on three different
> dedicated disks and the "simulated error" is on an additional,fourth
> disk that doesn't include a heartbeat.
You mean the fourth disk is used to store data but not hb disk, right?

Thanks,
Junxiao.

> 
> Errors may occur on storage arrays and if I'm connecting my OCFS2
> cluster to 4 storage arrays with each 10 disks/volumes, I don't expect
> that the whole OCFS2 cluster will fail when only one array is down. I
> still expect that the other 30 disks from the other 3 remaining arrays
> will continue working.
> Of course, I will not have any access to the failed array disks.
> 
> I hope this describes better the situation,
> 
> Thanks,
> 
> Guy
> 
> On Wed, Jan 20, 2016 at 10:51 AM, Junxiao Bi  <mailto:junxiao...@oracle.com>> wrote:
> 
> Hi Guy,
> 
> ocfs2 is shared-disk fs, there is no way to do replication like dfs,
> also no volume manager integrated in ocfs2. Ocfs2 depends on underlying
> storage stack to handler disk failure, so you can configure multipath,
> raid or storage to handle removing disk issue. If io error is still
> reported to ocfs2, then there is no way to workaround, ocfs2 will be set
> read-only or even panic to avoid fs corruption. This is the same
> behavior with local fs.
> If io error not reported to ocfs2, then there is a fix i just posted to
> ocfs2-devel to avoid the node panic, please try patch serial [ocfs2:
> o2hb: not fence self if storage down]. Note this is only useful to o2cb
> stack. Nodes will hung on io and wait storage online again.
> 
> For the endless loop you met in "Appendix A1", it is a bug and fixed by
> "[PATCH V2] ocfs2: call ocfs2_abort when journal abort", you can get it
> from ocfs2-devel. This patch will set fs readonly or panic node since io
> error have been reported to ocfs2.
> 
> Thanks,
> Junxiao.
> 
> On 01/20/2016 03:19 AM, Guy 1234 wrote:
> > Dear OCFS2 guys,
> >
> >
> >
> > My name is Guy, and I'm testing ocfs2 due to its features as a
> clustered
> > filesystem that I need.
> >
> > As part of the stability and reliability test I’ve performed, I've
> > encountered an issue with ocfs2 (format + mount + remove disk...),
> that
> > I wanted to make sure it is a real issue and not just a
> mis-configuration.
> >
> >
> >
> > The main concern is that the stability of the whole system is
> > compromised when a single disk/volumes fails. It looks like the
> OCFS2 is
> > not handling the error correctly but stuck in an endless loop that
> > interferes with the work of the server.
> >
> >
> >
> > I’ve test tested two cluster configurations – (1)
> Corosyn

Re: [Ocfs2-devel] ocfs2: o2hb: not fence self if storage down

2016-01-21 Thread Junxiao Bi

On 01/21/2016 04:34 PM, rwxybh wrote:
> Hi, junxiao!
> 
> 
> We can't find correct fencing log after a node fencing itself. 
> We know there is log such as following in source code:
> 
> printk(KERN_ERR "*** ocfs2 is very sorry to be fencing this "
>   "system by restarting ***\n");
> 
> But we NEVER found this message from /var/log/message or last "demsg".
> 
> Do u mean we can find this message from local fs log after applying this
> patch set?
No, this patch is not targeted to do that. This patch set is to avoid
nodes fence self if storage down.
To get that log, i am afraid you need configure a console as panic
follows that printk.

Thanks,
Junxiao.
> 
> Or any way to find this output (without netconsole), thx?
> 
> --------
> rwxybh
> 
>  
> *From:* Junxiao Bi <mailto:junxiao...@oracle.com>
> *Date:* 2016-01-20 11:13
> *To:* ocfs2-devel <mailto:ocfs2-devel@oss.oracle.com>
> *CC:* mfasheh <mailto:mfas...@suse.com>
> *Subject:* [Ocfs2-devel] ocfs2: o2hb: not fence self if storage down
> Hi,
>  
> This serial of patches is to fix the issue that when storage down,
> all nodes will fence self due to write timeout.
> With this patch set, all nodes will keep going until storage back
> online, except if the following issue happens, then all nodes will
> do as before to fence self.
> 1. io error got
> 2. network between nodes down
> 3. nodes panic
>  
> Junxiao Bi (6):
>   ocfs2: o2hb: add negotiate timer
>   ocfs2: o2hb: add NEGO_TIMEOUT message
>   ocfs2: o2hb: add NEGOTIATE_APPROVE message
>   ocfs2: o2hb: add some user/debug log
>   ocfs2: o2hb: don't negotiate if last hb fail
>   ocfs2: o2hb: fix hb hung time
>  
> fs/ocfs2/cluster/heartbeat.c |  181
> --
> 1 file changed, 175 insertions(+), 6 deletions(-)
>  
> Thanks,
> Junxiao.
>  
> ___
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: dlmglue: fix false deadlock caused by clearing UPCONVERT_FINISHING too early

2016-01-21 Thread Junxiao Bi

On 01/21/2016 04:10 PM, Eric Ren wrote:
> Hi Junxiao,
> 
> On Thu, Jan 21, 2016 at 03:10:20PM +0800, Junxiao Bi wrote: 
>> Hi Eric,
>>
>> This patch should fix your issue.
>> "NFS hangs in __ocfs2_cluster_lock due to race with ocfs2_unblock_lock"
> 
> Thanks a lot for bringing up this patch! It hasn't been merged into mainline(
> at least 4.4), right?
Right, it is still in linux-next.

Thanks,
Junxiao.
> 
> I have found this patch in maillist and it looks good! I'd like to test it 
> right
> now and give feadback!
> 
> Thanks again,
> Eric
> 
>>
>> Thanks,
>> Junxiao.
>> On 01/20/2016 12:46 AM, Eric Ren wrote:
>>> This problem was introduced by commit 
>>> a19128260107f951d1b4c421cf98b92f8092b069.
>>> OCFS2_LOCK_UPCONVERT_FINISHING is set just before clearing OCFS2_LOCK_BUSY. 
>>> This
>>> will prevent dc thread from downconverting immediately, and let 
>>> mask-waiters in
>>> ->l_mask_waiters list whose requesting level is compatible with ->l_level 
>>> to take
>>> the lock. But if we have two waiters in mw list, the first is to get EX 
>>> lock, and
>>> the second is to to get PR lock. The first may fail to get lock and then 
>>> clear
>>> UPCONVERT_FINISHING. It's too early to clear the flag because this second 
>>> will be
>>> also queued again even if ->l_level is PR. As a result, nobody would kick 
>>> up dc
>>> thread, leaving dlmglue a deadlock until another lockres relative thread 
>>> wake it
>>> up.
>>>
>>> More specifically, for example:
>>> On node1, there is thread W1 keeping writing; on node2, there are thread R1 
>>> and
>>> R2 keeping reading; sure this 3 threads make IO on the same shared file. At 
>>> a
>>> time, node2 is receiving ast(0=>3), followed immediately by a bast 
>>> requesting EX
>>> lock on behave of node1. Then this may happen:
>>> node2:  node1:
>>> l_level==3; R1(3); R2(3)l_level==3
>>> R1(unlock); R1(3=>5, update atime)  W1(3=>5)
>>> BAST
>>> R2(unlock); AST(3=>0)
>>> R2(0=>3)
>>> BAST
>>> AST(0=>3)
>>> set OCFS2_LOCK_UPCONVERT_FINISHING
>>> clear OCFS2_LOCK_BUSY
>>> W1(3=>5)
>>> BAST
>>> dc thread requeue=yes
>>> R1(clear OCFS2_LOCK_UPCONVERT_FINISHING,wait)
>>> R2(wait)
>>> ...
>>> dlmglue deadlock util dc thread woken up by others
>>>
>>> This fix is to clear OCFS2_LOCK_UPCONVERT_FINISHING util OCFS2_LOCK_BUSY has
>>> been cleared and every waiters has been looped.
>>>
>>> Signed-off-by: Eric Ren 
>>> ---
>>>  fs/ocfs2/dlmglue.c | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
>>> index f92612e..72f8b6c 100644
>>> --- a/fs/ocfs2/dlmglue.c
>>> +++ b/fs/ocfs2/dlmglue.c
>>> @@ -824,6 +824,8 @@ static void lockres_clear_flags(struct ocfs2_lock_res 
>>> *lockres,
>>> unsigned long clear)
>>>  {
>>> lockres_set_flags(lockres, lockres->l_flags & ~clear);
>>> +   if(clear & OCFS2_LOCK_BUSY)
>>> +   lockres->l_flags &= ~OCFS2_LOCK_UPCONVERT_FINISHING;
>>>  }
>>>  
>>>  static inline void ocfs2_generic_handle_downconvert_action(struct 
>>> ocfs2_lock_res *lockres)
>>> @@ -1522,8 +1524,6 @@ update_holders:
>>>  
>>> ret = 0;
>>>  unlock:
>>> -   lockres_clear_flags(lockres, OCFS2_LOCK_UPCONVERT_FINISHING);
>>> -
>>> spin_unlock_irqrestore(&lockres->l_lock, flags);
>>>  out:
>>> /*
>>>
>>
>>


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] ocfs2: A race between refmap setting and clearing

2016-01-20 Thread Junxiao Bi

Hi Jiufei,

I didn't find other solution for this issue. You can go with yours.
Looks like your second one is more straightforward, there deref work can
be removed.

Thanks,
Junxiao.
On 01/11/2016 10:46 AM, xuejiufei wrote:
> Hi all,
> We have found a race between refmap setting and clearing which
> will cause the lock resource on master is freed before other nodes
> purge it.
> 
> Node 1   Node 2(master)
> dlm_do_master_request
> dlm_master_request_handler
> -> dlm_lockres_set_refmap_bit
> call dlm_purge_lockres after unlock
> dlm_deref_handler
> -> find lock resource is in
>DLM_LOCK_RES_SETREF_INPROG state,
>so dispatch a deref work
> dlm_purge_lockres succeed.
> 
> dlm_do_master_request
> dlm_master_request_handler
> -> dlm_lockres_set_refmap_bit
> 
> deref work trigger, call
> dlm_lockres_clear_refmap_bit
> to clear Node 1 from refmap
> 
> Now Node 2 can purge the lock resource but the owner of lock resource
> on Node 1 is still Node 2 which may trigger BUG if the lock resource
> is $RECOVERY or other problems.
> 
> We have discussed 2 solutions:
> 1)The master return error to Node 1 if the DLM_LOCK_RES_SETREF_INPROG
> is set. Node 1 will not retry and master send another message to Node 1
> after clearing the refmap. Node 1 can purge the lock resource after the
> refmap on master is cleared.
> 2) The master return error to Node 1 if the DLM_LOCK_RES_SETREF_INPROG
> is set, and Node 1 will retry to deref the lockres.
> 
> Does anybody has better ideas?
> 
> Thanks,
> --Jiufei
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH] ocfs2: dlmglue: fix false deadlock caused by clearing UPCONVERT_FINISHING too early

2016-01-20 Thread Junxiao Bi

Hi Eric,

This patch should fix your issue.
"NFS hangs in __ocfs2_cluster_lock due to race with ocfs2_unblock_lock"

Thanks,
Junxiao.
On 01/20/2016 12:46 AM, Eric Ren wrote:
> This problem was introduced by commit 
> a19128260107f951d1b4c421cf98b92f8092b069.
> OCFS2_LOCK_UPCONVERT_FINISHING is set just before clearing OCFS2_LOCK_BUSY. 
> This
> will prevent dc thread from downconverting immediately, and let mask-waiters 
> in
> ->l_mask_waiters list whose requesting level is compatible with ->l_level to 
> take
> the lock. But if we have two waiters in mw list, the first is to get EX lock, 
> and
> the second is to to get PR lock. The first may fail to get lock and then clear
> UPCONVERT_FINISHING. It's too early to clear the flag because this second 
> will be
> also queued again even if ->l_level is PR. As a result, nobody would kick up 
> dc
> thread, leaving dlmglue a deadlock until another lockres relative thread wake 
> it
> up.
> 
> More specifically, for example:
> On node1, there is thread W1 keeping writing; on node2, there are thread R1 
> and
> R2 keeping reading; sure this 3 threads make IO on the same shared file. At a
> time, node2 is receiving ast(0=>3), followed immediately by a bast requesting 
> EX
> lock on behave of node1. Then this may happen:
> node2:  node1:
> l_level==3; R1(3); R2(3)l_level==3
> R1(unlock); R1(3=>5, update atime)  W1(3=>5)
> BAST
> R2(unlock); AST(3=>0)
> R2(0=>3)
> BAST
> AST(0=>3)
> set OCFS2_LOCK_UPCONVERT_FINISHING
> clear OCFS2_LOCK_BUSY
> W1(3=>5)
> BAST
> dc thread requeue=yes
> R1(clear OCFS2_LOCK_UPCONVERT_FINISHING,wait)
> R2(wait)
> ...
> dlmglue deadlock util dc thread woken up by others
> 
> This fix is to clear OCFS2_LOCK_UPCONVERT_FINISHING util OCFS2_LOCK_BUSY has
> been cleared and every waiters has been looped.
> 
> Signed-off-by: Eric Ren 
> ---
>  fs/ocfs2/dlmglue.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
> index f92612e..72f8b6c 100644
> --- a/fs/ocfs2/dlmglue.c
> +++ b/fs/ocfs2/dlmglue.c
> @@ -824,6 +824,8 @@ static void lockres_clear_flags(struct ocfs2_lock_res 
> *lockres,
>   unsigned long clear)
>  {
>   lockres_set_flags(lockres, lockres->l_flags & ~clear);
> + if(clear & OCFS2_LOCK_BUSY)
> + lockres->l_flags &= ~OCFS2_LOCK_UPCONVERT_FINISHING;
>  }
>  
>  static inline void ocfs2_generic_handle_downconvert_action(struct 
> ocfs2_lock_res *lockres)
> @@ -1522,8 +1524,6 @@ update_holders:
>  
>   ret = 0;
>  unlock:
> - lockres_clear_flags(lockres, OCFS2_LOCK_UPCONVERT_FINISHING);
> -
>   spin_unlock_irqrestore(&lockres->l_lock, flags);
>  out:
>   /*
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] ocfs2: o2hb: not fence self if storage down

2016-01-20 Thread Junxiao Bi

On 01/21/2016 08:46 AM, Joseph Qi wrote:
> Hi Junxiao,
> So you mean the negotiation you added only happens if all nodes storage
> link down?
Negotiation happened when one node found its storage link down, but
success when all nodes storage link down, or it will keep the same
behavior like before.

Thanks,
Junxiao.
> 
> Thanks,
> Joseph
> 
> On 2016/1/20 21:27, Junxiao Bi wrote:
>> Hi Joseph,
>>
>>> 在 2016年1月20日，下午5:18，Joseph Qi  写道：
>>>
>>> Hi Junxiao,
>>> Thanks for the patch set.
>>> In case only one node storage link down, if this node doesn't fence
>>> self, other nodes will still check and mark this node dead, which will
>>> cause cluster membership inconsistency.
>>> In your patch set, I cannot see any logic to handle this. Am I missing
>>> something?
>> No, there is no logic for this. But why didn’t node fence self when storage 
>> down? What make a softirq timer can’t be run, another bug?
>>
>> Thanks,
>> Junxiao.
>>>
>>> On 2016/1/20 11:13, Junxiao Bi wrote:
>>>> Hi,
>>>>
>>>> This serial of patches is to fix the issue that when storage down,
>>>> all nodes will fence self due to write timeout.
>>>> With this patch set, all nodes will keep going until storage back
>>>> online, except if the following issue happens, then all nodes will
>>>> do as before to fence self.
>>>> 1. io error got
>>>> 2. network between nodes down
>>>> 3. nodes panic
>>>>
>>>> Junxiao Bi (6):
>>>>  ocfs2: o2hb: add negotiate timer
>>>>  ocfs2: o2hb: add NEGO_TIMEOUT message
>>>>  ocfs2: o2hb: add NEGOTIATE_APPROVE message
>>>>  ocfs2: o2hb: add some user/debug log
>>>>  ocfs2: o2hb: don't negotiate if last hb fail
>>>>  ocfs2: o2hb: fix hb hung time
>>>>
>>>> fs/ocfs2/cluster/heartbeat.c |  181 
>>>> --
>>>> 1 file changed, 175 insertions(+), 6 deletions(-)
>>>>
>>>> Thanks,
>>>> Junxiao.
>>>>
>>>> ___
>>>> Ocfs2-devel mailing list
>>>> Ocfs2-devel@oss.oracle.com
>>>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>>>>
>>>>
>>>
>>>
>>
>>
>> .
>>
> 
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] ocfs2: o2hb: not fence self if storage down

2016-01-20 Thread Junxiao Bi

Hi Joseph,

> 在 2016年1月20日，下午5:18，Joseph Qi  写道：
> 
> Hi Junxiao,
> Thanks for the patch set.
> In case only one node storage link down, if this node doesn't fence
> self, other nodes will still check and mark this node dead, which will
> cause cluster membership inconsistency.
> In your patch set, I cannot see any logic to handle this. Am I missing
> something?
No, there is no logic for this. But why didn’t node fence self when storage 
down? What make a softirq timer can’t be run, another bug?

Thanks,
Junxiao.
> 
> On 2016/1/20 11:13, Junxiao Bi wrote:
>> Hi,
>> 
>> This serial of patches is to fix the issue that when storage down,
>> all nodes will fence self due to write timeout.
>> With this patch set, all nodes will keep going until storage back
>> online, except if the following issue happens, then all nodes will
>> do as before to fence self.
>> 1. io error got
>> 2. network between nodes down
>> 3. nodes panic
>> 
>> Junxiao Bi (6):
>>  ocfs2: o2hb: add negotiate timer
>>  ocfs2: o2hb: add NEGO_TIMEOUT message
>>  ocfs2: o2hb: add NEGOTIATE_APPROVE message
>>  ocfs2: o2hb: add some user/debug log
>>  ocfs2: o2hb: don't negotiate if last hb fail
>>  ocfs2: o2hb: fix hb hung time
>> 
>> fs/ocfs2/cluster/heartbeat.c |  181 
>> --
>> 1 file changed, 175 insertions(+), 6 deletions(-)
>> 
>> Thanks,
>> Junxiao.
>> 
>> ___
>> Ocfs2-devel mailing list
>> Ocfs2-devel@oss.oracle.com
>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>> 
>> 
> 
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] OCFS2 causing system instability

2016-01-20 Thread Junxiao Bi

Hi Guy,

ocfs2 is shared-disk fs, there is no way to do replication like dfs,
also no volume manager integrated in ocfs2. Ocfs2 depends on underlying
storage stack to handler disk failure, so you can configure multipath,
raid or storage to handle removing disk issue. If io error is still
reported to ocfs2, then there is no way to workaround, ocfs2 will be set
read-only or even panic to avoid fs corruption. This is the same
behavior with local fs.
If io error not reported to ocfs2, then there is a fix i just posted to
ocfs2-devel to avoid the node panic, please try patch serial [ocfs2:
o2hb: not fence self if storage down]. Note this is only useful to o2cb
stack. Nodes will hung on io and wait storage online again.

For the endless loop you met in "Appendix A1", it is a bug and fixed by
"[PATCH V2] ocfs2: call ocfs2_abort when journal abort", you can get it
from ocfs2-devel. This patch will set fs readonly or panic node since io
error have been reported to ocfs2.

Thanks,
Junxiao.

On 01/20/2016 03:19 AM, Guy 1234 wrote:
> Dear OCFS2 guys,
> 
>  
> 
> My name is Guy, and I'm testing ocfs2 due to its features as a clustered
> filesystem that I need.
> 
> As part of the stability and reliability test I’ve performed, I've
> encountered an issue with ocfs2 (format + mount + remove disk...), that
> I wanted to make sure it is a real issue and not just a mis-configuration.
> 
>  
> 
> The main concern is that the stability of the whole system is
> compromised when a single disk/volumes fails. It looks like the OCFS2 is
> not handling the error correctly but stuck in an endless loop that
> interferes with the work of the server.
> 
>  
> 
> I’ve test tested two cluster configurations – (1) Corosync/Pacemaker and
> (2) o2cb that react similarly.
> 
> Following the process and log entries:
> 
> 
> Also below additional configuration that were tested.
> 
> 
> Node 1:
> 
> === 
> 
> 1. service corosync start
> 
> 2. service dlm start
> 
> 3. mkfs.ocfs2 -v -Jblock64 -b 4096 --fs-feature-level=max-features
> --cluster-=pcmk --cluster-name=cluster-name -N 2 /dev/
> 
> 4. mount -o
> rw,noatime,nodiratime,data=writeback,heartbeat=none,cluster_stack=pcmk
> /dev/ /mnt/ocfs2-mountpoint
> 
>  
> 
> Node 2:
> 
> ===
> 
> 5. service corosync start
> 
> 6. service dlm start
> 
> 7. mount -o
> rw,noatime,nodiratime,data=writeback,heartbeat=none,cluster_stack=pcmk
> /dev/ /mnt/ocfs2-mountpoint
> 
>  
> 
> So far all is working well, including reading and writing.
> 
> Next
> 
> 8. I’ve physically, pull out the disk at /dev/ to
> simulate a hardware failure (that may occur…) , in real life the disk is
> (hardware or software) protected. Nonetheless, I’m testing a hardware
> failure that the one of the OCFS2 file systems in my server fails.
> 
> Following  - messages observed in the system log (see below) and
> 
> ==>  9. kernel panic(!) ... in one of the nodes or on both, or reboot on
> one of the nodes or both.
> 
> 
> Is there any configuration or set of parameters that will enable the
> system to continue working, disabling the access to the failed disk
> without compromising the system stability and not cause the kernel to
> panic?!
> 
>  
> 
> From my point of view it looks basics – when a hardware failure occurs:
> 
> 1. All remaining hardware should continue working
> 
> 2. The failed disk/volume should be inaccessible – but not compromise
> the whole system availability (Kernel panic).
> 
> 3. OCFS2 “understands” there’s a failed disk and stop trying to access it.
> 
> 3. All disk commands such as mount/umount, df etc. should continue working.
> 
> 4. When a new/replacement drive is connected to the system, it can be
> accessed.
> 
> My settings:
> 
> ubuntu 14.04
> 
> linux:  3.16.0-46-generic
> 
> mkfs.ocfs2 1.8.4 (downloaded from git)
> 
>  
> 
>  
> 
> Some other scenarios which also were tested:
> 
> 1. Remove the max-features in the mkfs (i.e. mkfs.ocfs2 -v -Jblock64 -b
> 4096 --cluster-stack=pcmk --cluster-name=cluster-name -N 2 /dev/ device>)
> 
> This improved in some of the cases with no kernel panic but still the
> stability of the system was compromised, the syslog indicates that
> something unrecoverable is going on (See below - Appendix A1).
> Furthermore, System is hanging when trying to software reboot.
> 
> 2. Also tried with the o2cb stack, with similar outcomes.
> 
> 3. The configuration was also tested with (1,2 and 3) Local and Global
> heartbeat(s) that were NOT on the simulated failed disk, but on other
> physical disks.
> 
> 4. Also tested: 
> 
> Ubuntu 15.15
> 
> Kernel: 4.2.0-23-generic
> 
> mkfs.ocfs2 1.8.4 (git clone git://oss.oracle.com/git/ocfs2-tools.git
> )
> 
>  
> 
>  
> 
> ==
> 
> Appendix A1:
> 
> ==
> 
> from syslog:
> 
> [ 1676.608123] (ocfs2cmt,5316,14):ocfs2_commit_thread:2195 ERROR: status
> = -5, journal is already aborted.
> 
> [ 1677.611827] (ocfs2cmt,5316,14):ocfs2_commit_cache:324 ERROR: status = -5
> 
> [

Re: [Ocfs2-devel] ocfs2: o2hb: not fence self if storage down

2016-01-20 Thread Junxiao Bi

Hi Gang,

On 01/20/2016 02:00 PM, Gang He wrote:
> Hi Junxiao,
> 
> Thank for your fix.
> Just one quick question, this fix only effects OCFS2 O2CB case, right?
Right.
> If the user selects pacemaker as cluster stack? OCFS2 file system will 
> encounter the same problem?
Not sure about this, i have no knowledge about packmaker. You can run a
quick test on the setup.

Thanks,
Junxiao.
> 
> Thanks
> Gang 
> 
> 
>>>>
>> Hi,
>>
>> This serial of patches is to fix the issue that when storage down,
>> all nodes will fence self due to write timeout.
>> With this patch set, all nodes will keep going until storage back
>> online, except if the following issue happens, then all nodes will
>> do as before to fence self.
>> 1. io error got
>> 2. network between nodes down
>> 3. nodes panic
>>
>> Junxiao Bi (6):
>>   ocfs2: o2hb: add negotiate timer
>>   ocfs2: o2hb: add NEGO_TIMEOUT message
>>   ocfs2: o2hb: add NEGOTIATE_APPROVE message
>>   ocfs2: o2hb: add some user/debug log
>>   ocfs2: o2hb: don't negotiate if last hb fail
>>   ocfs2: o2hb: fix hb hung time
>>
>>  fs/ocfs2/cluster/heartbeat.c |  181 
>> --
>>  1 file changed, 175 insertions(+), 6 deletions(-)
>>
>>  Thanks,
>>  Junxiao.
>>
>> ___
>> Ocfs2-devel mailing list
>> Ocfs2-devel@oss.oracle.com 
>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

[Ocfs2-devel] [PATCH 1/6] ocfs2: o2hb: add negotiate timer

2016-01-19 Thread Junxiao Bi

When storage down, all nodes will fence self due to write timeout.
The negotiate timer is designed to avoid this, with it node will
wait until storage up again.

Negotiate timer working in the following way:

1. The timer expires before write timeout timer, its timeout is half
of write timeout now. It is re-queued along with write timeout timer.
If expires, it will send NEGO_TIMEOUT message to master node(node with
lowest node number). This message does nothing but marks a bit in a
bitmap recording which nodes are negotiating timeout on master node.

2. If storage down, nodes will send this message to master node, then
when master node finds its bitmap including all online nodes, it sends
NEGO_APPROVL message to all nodes one by one, this message will re-queue
write timeout timer and negotiate timer.
For any node doesn't receive this message or meets some issue when
handling this message, it will be fenced.
If storage up at any time, o2hb_thread will run and re-queue all the
timer, nothing will be affected by these two steps.

Signed-off-by: Junxiao Bi 
Reviewed-by: Ryan Ding 
---
 fs/ocfs2/cluster/heartbeat.c |   52 ++
 1 file changed, 48 insertions(+), 4 deletions(-)

diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
index a3cc6d2fc896..b601ee95de50 100644
--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -272,6 +272,10 @@ struct o2hb_region {
struct delayed_work hr_write_timeout_work;
unsigned long   hr_last_timeout_start;
 
+   /* negotiate timer, used to negotiate extending hb timeout. */
+   struct delayed_work hr_nego_timeout_work;
+   unsigned long   
hr_nego_node_bitmap[BITS_TO_LONGS(O2NM_MAX_NODES)];
+
/* Used during o2hb_check_slot to hold a copy of the block
 * being checked because we temporarily have to zero out the
 * crc field. */
@@ -320,7 +324,7 @@ static void o2hb_write_timeout(struct work_struct *work)
o2quo_disk_timeout();
 }
 
-static void o2hb_arm_write_timeout(struct o2hb_region *reg)
+static void o2hb_arm_timeout(struct o2hb_region *reg)
 {
/* Arm writeout only after thread reaches steady state */
if (atomic_read(®->hr_steady_iterations) != 0)
@@ -338,11 +342,50 @@ static void o2hb_arm_write_timeout(struct o2hb_region 
*reg)
reg->hr_last_timeout_start = jiffies;
schedule_delayed_work(®->hr_write_timeout_work,
  msecs_to_jiffies(O2HB_MAX_WRITE_TIMEOUT_MS));
+
+   cancel_delayed_work(®->hr_nego_timeout_work);
+   /* negotiate timeout must be less than write timeout. */
+   schedule_delayed_work(®->hr_nego_timeout_work,
+ msecs_to_jiffies(O2HB_MAX_WRITE_TIMEOUT_MS)/2);
+   memset(reg->hr_nego_node_bitmap, 0, sizeof(reg->hr_nego_node_bitmap));
 }
 
-static void o2hb_disarm_write_timeout(struct o2hb_region *reg)
+static void o2hb_disarm_timeout(struct o2hb_region *reg)
 {
cancel_delayed_work_sync(®->hr_write_timeout_work);
+   cancel_delayed_work_sync(®->hr_nego_timeout_work);
+}
+
+static void o2hb_nego_timeout(struct work_struct *work)
+{
+   struct o2hb_region *reg =
+   container_of(work, struct o2hb_region,
+hr_nego_timeout_work.work);
+   unsigned long live_node_bitmap[BITS_TO_LONGS(O2NM_MAX_NODES)];
+   int master_node;
+
+   o2hb_fill_node_map(live_node_bitmap, sizeof(live_node_bitmap));
+   /* lowest node as master node to make negotiate decision. */
+   master_node = find_next_bit(live_node_bitmap, O2NM_MAX_NODES, 0);
+
+   if (master_node == o2nm_this_node()) {
+   set_bit(master_node, reg->hr_nego_node_bitmap);
+   if (memcmp(reg->hr_nego_node_bitmap, live_node_bitmap,
+   sizeof(reg->hr_nego_node_bitmap))) {
+   /* check negotiate bitmap every second to do timeout
+* approve decision.
+*/
+   schedule_delayed_work(®->hr_nego_timeout_work,
+   msecs_to_jiffies(1000));
+
+   return;
+   }
+
+   /* approve negotiate timeout request. */
+   } else {
+   /* negotiate timeout with master node. */
+   }
+
 }
 
 static inline void o2hb_bio_wait_init(struct o2hb_bio_wait_ctxt *wc)
@@ -1033,7 +1076,7 @@ static int o2hb_do_disk_heartbeat(struct o2hb_region *reg)
/* Skip disarming the timeout if own slot has stale/bad data */
if (own_slot_ok) {
o2hb_set_quorum_device(reg);
-   o2hb_arm_write_timeout(reg);
+   o2hb_arm_timeout(reg);
}
 
 bail:
@@ -1115,7 +1158,7 @@ static int o2hb_thread(void *data)
}
}
 
-   o2hb_disarm_write_timeout(reg);
+   o2hb_disarm_timeout(reg);

[Ocfs2-devel] [PATCH 6/6] ocfs2: o2hb: fix hb hung time

2016-01-19 Thread Junxiao Bi

hr_last_timeout_start should be set as the last time where hb is still OK.
When hb write timeout, hung time will be (jiffies - hr_last_timeout_start).

Signed-off-by: Junxiao Bi 
Reviewed-by: Ryan Ding 
---
 fs/ocfs2/cluster/heartbeat.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
index cb931381f474..a3ce5a734b7b 100644
--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -357,7 +357,6 @@ static void o2hb_arm_timeout(struct o2hb_region *reg)
spin_unlock(&o2hb_live_lock);
}
cancel_delayed_work(®->hr_write_timeout_work);
-   reg->hr_last_timeout_start = jiffies;
schedule_delayed_work(®->hr_write_timeout_work,
  msecs_to_jiffies(O2HB_MAX_WRITE_TIMEOUT_MS));
 
@@ -1176,6 +1175,7 @@ static int o2hb_do_disk_heartbeat(struct o2hb_region *reg)
if (own_slot_ok) {
o2hb_set_quorum_device(reg);
o2hb_arm_timeout(reg);
+   reg->hr_last_timeout_start = jiffies;
}
 
 bail:
-- 
1.7.9.5


___
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

1 2 3 4 >

1 - 100 of 321 matches

Mail list logo