Re: [Gluster-devel] tests/bugs/quota/bug-1035576.t & tests/basic/quota-nfs.t spurious failures?
Below patch submitted upstream. This fixes the testcase './tests/basic/quota-nfs.t' http://review.gluster.org/#/c/12075/ Thanks, Vijay On Tuesday 01 September 2015 11:38 AM, Vijaikumar M wrote: We will look into this issue. Thanks, Vijay On Tuesday 01 September 2015 11:03 AM, Atin Mukherjee wrote: One more instance - https://build.gluster.org/job/rackspace-regression-2GB-triggered/13899/consoleFull Can you please put these tests in bad_tests() On 08/31/2015 09:23 AM, Atin Mukherjee wrote: For tests/bugs/quota/bug-1035576.t refer [1] For tests/basic/quota-nfs.t refer [2] Please note I've not added these tests in the spurious failures list [3] yet. [1] https://build.gluster.org/job/rackspace-regression-2GB-triggered/13829/consoleFull [2] https://build.gluster.org/job/rackspace-regression-2GB-triggered/13839/consoleFull [3] https://public.pad.fsfe.org/p/gluster-spurious-failures Thanks, Atin ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Netbsd build failure
On Friday 21 August 2015 10:21 AM, Avra Sengupta wrote: + Adding Vijaikumar On 08/20/2015 04:19 PM, Niels de Vos wrote: On Thu, Aug 20, 2015 at 03:05:56AM -0400, Susant Palai wrote: Hi, I tried running netbsd regression twice on a patch. And twice it failed at the same point. Here is the error: snip Build GlusterFS *** + '/opt/qa/build.sh' File /usr/pkg/lib/python2.7/site.py, line 601 [2015-08-19 05:45:06.N]:++ G_LOG:./tests/basic/quota-anon-fd-nfs.t: TEST: 85 ! fd_write 3 content ++ This particular test is currently in bad test and I believe Vijaikumar is looking into it. Could you please make sure if there is any other failure(apart from this), which is failing the regression runs. ^ SyntaxError: invalid token We have marked test './tests/basic/quota-anon-fd-nfs.t' as bad-test, I am not sure about 'SyntaxError' error. I think there is some parsing error in the shell script, need to root cause the issue. + RET=1 + '[' 1 '!=' 0 ']' + exit 1 Build step 'Ex?cuter un script shell' marked build as failure Finished: FAILURE /snip Requesting you to take look into it. Which Jenkins slave was this? Got a link to the job that failed? This looks again like a NetBSD slave where logs from regression tests are overwriting random files. The /usr/pkg/lib/python2.7/site.py file should be valid Python, and not contain these logs... Anyone has any ideas why this happens? Thanks, Niels ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] NetBSD regression failures
On Monday 17 August 2015 12:22 PM, Avra Sengupta wrote: Hi, The NetBSD regression tests are continuously failing with errors in the following tests: ./tests/basic/mount-nfs-auth.t ./tests/basic/quota-anon-fd-nfs.t quota-anon-fd-nfs.t is known issues with NFS client caching so it is marked as bad test, final test will be marked as success even if this test fails. Is there any recent change that is trigerring this behaviour. Also currently one machine is running NetBSD tests. Can someone with access to Jenkins, bring up a few more slaves to run NetBSD regressions in parallel. Regards, Avra ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] 3.7 spurious failures
On Monday 13 July 2015 11:14 PM, Joseph Fernandes wrote: Hi All, These are some of the recent hit spurious failures on 3.7 branch http://build.gluster.org/job/rackspace-regression-2GB-triggered/12356/consoleFull ./tests/bugs/snapshot/bug-1109889.t is blocking http://review.gluster.org/11649 merge http://build.gluster.org/job/rackspace-regression-2GB-triggered/12357/consoleFull ./tests/bugs/fuse/bug-1126048.t is blocking http://review.gluster.org/11608 merge Net bsd: http://build.gluster.org/job/rackspace-netbsd7-regression-triggered/8257/consoleFull ./tests/basic/quota-nfs.t is blocking http://review.gluster.org/11649 merge. This issue is fixed in master, we have submitted back-port patch to 3.7. It will be merged by today eod Thanks, Vijay Appropriate owners please take a look. Thanks Regards, Joe ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failure in 3.7.2: ./tests/bugs/quota/afr-quota-xattr-mdata-heal.t
Patch submitted upstream which fixes this issue: http://review.gluster.org/#/c/11583/ Will submit the fix for 3.7 as well. Thanks, Vijay On Friday 10 July 2015 01:19 PM, Joseph Fernandes wrote: http://build.gluster.org/job/rackspace-regression-2GB-triggered/12204/consoleFull ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] NetBSD regression tests not Initializing...
NetBSD tests arefailing again: http://build.gluster.org/job/rackspace-netbsd7-regression-triggered/8123/console Triggered by Gerrit:http://review.gluster.org/11616 in silent mode. Building remotely onnbslave74.cloud.gluster.org http://build.gluster.org/computer/nbslave74.cloud.gluster.org (netbsd7_regression) in workspace /home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository git config remote.origin.urlhttp://review.gluster.org/glusterfs.git # timeout=10 Fetching upstream changes fromhttp://review.gluster.org/glusterfs.git git --version # timeout=10 git -c core.askpass=true fetch --tags --progresshttp://review.gluster.org/glusterfs.git refs/changes/16/11616/1 ERROR: Error fetching remote repo 'origin' ERROR http://stacktrace.jenkins-ci.org/search?query=ERROR: Error fetching remote repo 'origin' Finished http://stacktrace.jenkins-ci.org/search?query=Finished: FAILURE Thanks, Vijay On Tuesday 07 July 2015 07:13 PM, Kaushal M wrote: I've taken this slave and one other offline and am rebooting it. On Tue, Jul 7, 2015 at 6:44 PM, Kotresh Hiremath Ravishankar khire...@redhat.com wrote: Hi Emmanuel, We are seeing these issues again on nbslave7h.cloud.gluster.org http://build.gluster.org/job/rackspace-netbsd7-regression-triggered/7974/console Thanks and Regards, Kotresh H R - Original Message - From: Emmanuel Dreyfus m...@netbsd.org To: Kotresh Hiremath Ravishankar khire...@redhat.com, Gluster Devel gluster-devel@gluster.org Sent: Sunday, July 5, 2015 12:52:23 AM Subject: Re: [Gluster-devel] NetBSD regression tests not Initializing... Kotresh Hiremath Ravishankar khire...@redhat.com wrote: Any help is appreciated. nbslave72 was sick indeed: it refused SSH connexions. I rebooted it and retiggered your change, but it went on another machine. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures again
On Wednesday 08 July 2015 03:42 PM, Kaushal M wrote: I've been hitting spurious failures in Linux regression runs for my change [1]. The following tests failed, ./tests/basic/afr/replace-brick-self-heal.t [2] ./tests/bugs/replicate/bug-1238508-self-heal.t [3] ./tests/bugs/quota/afr-quota-xattr-mdata-heal.t [4] I will look into this issue ./tests/bugs/quota/bug-1235182.t [5] I have submitted two patches to fix failures from 'bug-1235182.t' http://review.gluster.org/#/c/11561/ http://review.gluster.org/#/c/11510/ ./tests/bugs/replicate/bug-977797.t [6] Can AFR and quota owners look into this? Thanks. Kaushal [1] https://review.gluster.org/11559 [2] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12023/consoleFull [3] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12029/consoleFull [4] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12044/consoleFull [5] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12060/consoleFull [6] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12071/consoleFull ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failures again
On Wednesday 08 July 2015 03:53 PM, Vijaikumar M wrote: On Wednesday 08 July 2015 03:42 PM, Kaushal M wrote: I've been hitting spurious failures in Linux regression runs for my change [1]. The following tests failed, ./tests/basic/afr/replace-brick-self-heal.t [2] ./tests/bugs/replicate/bug-1238508-self-heal.t [3] ./tests/bugs/quota/afr-quota-xattr-mdata-heal.t [4] I will look into this issue Patch submitted: http://review.gluster.org/#/c/11583/ ./tests/bugs/quota/bug-1235182.t [5] I have submitted two patches to fix failures from 'bug-1235182.t' http://review.gluster.org/#/c/11561/ http://review.gluster.org/#/c/11510/ ./tests/bugs/replicate/bug-977797.t [6] Can AFR and quota owners look into this? Thanks. Kaushal [1] https://review.gluster.org/11559 [2] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12023/consoleFull [3] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12029/consoleFull [4] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12044/consoleFull [5] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12060/consoleFull [6] http://build.gluster.org/job/rackspace-regression-2GB-triggered/12071/consoleFull ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Huge memory consumption with quota-marker
On Thursday 02 July 2015 11:27 AM, Krishnan Parthasarathi wrote: Yes. The PROC_MAX is the maximum no. of 'worker' threads that would be spawned for a given syncenv. - Original Message - - Original Message - From: Krishnan Parthasarathi kpart...@redhat.com To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Vijay Bellur vbel...@redhat.com, Vijaikumar M vmall...@redhat.com, Gluster Devel gluster-devel@gluster.org, Raghavendra Gowdappa rgowd...@redhat.com, Nagaprasad Sathyanarayana nsath...@redhat.com Sent: Thursday, July 2, 2015 10:54:44 AM Subject: Re: Huge memory consumption with quota-marker Yes, we could take synctask size as an argument for synctask_create. The increase in synctask threads is not really a problem, it can't grow more than 16 (SYNCENV_PROC_MAX). That is it cannot grow more than PROC_MAX in _single_ syncenv I suppose. - Original Message - On 07/02/2015 10:40 AM, Krishnan Parthasarathi wrote: - Original Message - On Wednesday 01 July 2015 08:41 AM, Vijaikumar M wrote: Hi, The new marker xlator uses syncop framework to update quota-size in the background, it uses one synctask per write FOP. If there are 100 parallel writes with all different inodes but on the same directory '/dir', there will be ~100 txn waiting in queue to acquire a lock on on its parent i.e '/dir'. Each of this txn uses a syntack and each synctask allocates stack size of 2M (default size), so total 0f 200M usage. This usage can increase depending on the load. I am think of of using the stacksize for synctask to 256k, will this mem be sufficient as we perform very limited operations within a synctask in marker updation? Seems like a good idea to me. Do we need a 256k stacksize or can we live with something even smaller? It was 16K when synctask was introduced. This is a property of syncenv. We could create a separate syncenv for marker transactions which has smaller stacks. env-stacksize (and SYNCTASK_DEFAULT_STACKSIZE) was increased to 2MB to support pump xlator based data migration for replace-brick. For the no. of stack frames a marker transaction could use at any given time, we could use much lesser, 16K say. Does that make sense? What are the information are we store in this memory? Is it only the frames, are we also storing the function's stack data? Thanks, Vijay Creating one more syncenv will lead to extra sync-threads, may be we can take stacksize as argument. Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Regression Failure: ./tests/basic/quota.t
We look into this issue Thanks, Vijay On Thursday 02 July 2015 11:46 AM, Kotresh Hiremath Ravishankar wrote: Hi, I see quota.t regression failure for the following. The changes are related to example programs in libgfchangelog. http://build.gluster.org/job/rackspace-regression-2GB-triggered/11785/consoleFull Could someone from quota team, take a look at it. Thanks and Regards, Kotresh H R ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Huge memory consumption with quota-marker
Hi, The new marker xlator uses syncop framework to update quota-size in the background, it uses one synctask per write FOP. If there are 100 parallel writes with all different inodes but on the same directory '/dir', there will be ~100 txn waiting in queue to acquire a lock on on its parent i.e '/dir'. Each of this txn uses a syntack and each synctask allocates stack size of 2M (default size), so total 0f 200M usage. This usage can increase depending on the load. I am think of of using the stacksize for synctask to 256k, will this mem be sufficient as we perform very limited operations within a synctask in marker updation? Please provide suggestions on solving this problem? Thanks, Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Three Issues Confused me recently
On Friday 26 June 2015 12:59 PM, Susant Palai wrote: Comment inline. - Original Message - From: christ1...@sina.com To: gluster-devel gluster-devel@gluster.org Sent: Thursday, 25 June, 2015 7:56:45 PM Subject: [Gluster-devel] Three Issues Confused me recently Hi, everyone! Nowadays, t here are three issues confusing me recently when I used the glusterfs to save huge datas. Like below: 1) Is there any reason for reserving 10% free space of each brick in the volume ? And Can I do not reserve the 10% free space of each brick in the volume? You know, I will use the glusterfs to save huge surveillance videos, so each brick will be set a large disk space. If each brick will be reserved 10% free space, it must led to low usage of disk and waster many disk spaces. 10% is the default and it can be modified by the cluster.min-free-disk option. e.g gluster v set _VOL_NAME_ min-free-disk 8GB *On the question of what should be this cluster.min-free-disk's value?* Cluster.min-free-disk: The min-free-disk setting establishes a data threshold for each brick in a volume. The primary intention of this is to ensure that there is adequate space to perform self-heal and rebalance operations, both of which require disk overhead. The min-free-disk value is taken into account when it is already exceeded before a file is being written. When that is the case, the DHT algorithm will choose to write the file instead to another brick where min-free-disk is not exceeded, and will write a 0-byte link-to file on the brick where min-free-disk is exceeded and where the file was originally hashed. This link-to file contains metadata to point the client to the brick where the data was actually written. Because min-free-disk is only considered after it has been exceeded, and because the DHT algorithm makes no other consideration of available space on a brick, it is possible to write a large file that will exceed the space on the brick it is hashed to e ven while another brick has enough space to hold the file. This would result in an I/O error to the client. So if you know you routinely write files up to nGB size, then min-free-disk can be set to arbitrarily a little larger value than n. For example if your file size is 5GB which is at the high end of the file sizes you will be writing, then you might consider setting min-free-disk to be 8GB. Doing this will ensure that the file will go to a brick with enough available space (assuming one exist). 2) Will it appear some exceptions when the filesystem, like xfs, ext4, had been written fully? As I already mentioned above, the new file creation will be redirected to a different brick with adequate space considering min-free-disk is exceeded. 3) Is it natural that a very high cpu usage when the directory quota is enabled ? (glusterfs 3.6.2) What is the testcase which causes high cpu usage? CCing quota team for this. And is there any solution to avoid it ? I am very appreciate for your help, thanks very much. Best regards. Louis 2015/6/25 ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] spurious failure with test-case ./tests/basic/tier/tier.t
Hi Upstream regression failure with test-case ./tests/basic/tier/tier.t My patch# 11315 regression failed twice with test-case./tests/basic/tier/tier.t. Anyone seeing this issue with other patches? http://build.gluster.org/job/rackspace-regression-2GB-triggered/11396/consoleFull http://build.gluster.org/job/rackspace-regression-2GB-triggered/11456/consoleFull Thanks, Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] /tests/bugs/quota/bug-1153964.t is consistently failing
Hi Niels, Patch# 11022 is not available downstream 3.1. Patch# 11361 is a blocker for 3.1 and depends on ref-count functions, is it possible to backport patch #11022 to downstream? Thanks, Vijay On Tuesday 23 June 2015 06:26 PM, Niels de Vos wrote: On Tue, Jun 23, 2015 at 05:30:39PM +0530, Vijaikumar M wrote: On Tuesday 23 June 2015 04:28 PM, Niels de Vos wrote: On Tue, Jun 23, 2015 at 03:45:43PM +0530, Vijaikumar M wrote: I have submitted below patch which fixes this issue. I am handling memory clean-up with reference countmechanism. http://review.gluster.org/#/c/11361 Is there a reason you can not use the (new) refcounting functions that were introduceed with http://review.gluster.org/11022 ? I was not aware that ref-counting patch was merged. Sure we will use these function and re-submit my patch. Ok, thanks! Niels Thanks, Vijay It would be nicer to standardize all refcounting mechanisms on one implementation. I hope we can replace existing refcounting with this one too. Introducing more refcounting ways is not going to be helpful. Thanks, Niels Thanks, Vijay On Tuesday 23 June 2015 12:58 PM, Raghavendra G wrote: Multiple replies to same query. Pick one ;). On Tue, Jun 23, 2015 at 12:55 PM, Venky Shankar yknev.shan...@gmail.com mailto:yknev.shan...@gmail.com wrote: OK. Two reverts of the same patch ;) Pick one. On Tue, Jun 23, 2015 at 12:51 PM, Raghavendra Gowdappa rgowd...@redhat.com mailto:rgowd...@redhat.com wrote: Seems like its a memory corruption caused by: http://review.gluster.org/11311 I've reverted the patch at: http://review.gluster.org/11360 - Original Message - From: Xavier Hernandez xhernan...@datalab.es mailto:xhernan...@datalab.es To: Gluster Devel gluster-devel@gluster.org mailto:gluster-devel@gluster.org Sent: Tuesday, June 23, 2015 12:44:47 PM Subject: [Gluster-devel] /tests/bugs/quota/bug-1153964.t is consistently failing Hi, the quota test bug-1153964.t is failing consistently for a totally unrelated patch. Is this a known issue ? http://build.gluster.org/job/rackspace-regression-2GB-triggered/11142/consoleFull http://build.gluster.org/job/rackspace-regression-2GB-triggered/11165/consoleFull http://build.gluster.org/job/rackspace-regression-2GB-triggered/11172/consoleFull http://build.gluster.org/job/rackspace-regression-2GB-triggered/11191/consoleFull Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org mailto:Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org mailto:Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org mailto:Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel -- Raghavendra G ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] /tests/bugs/quota/bug-1153964.t is consistently failing
Hi All, I request you to re-base your patch which are failed regression with test-case bug-1153964.t Thanks, Vijay On Wednesday 24 June 2015 11:42 AM, Raghavendra Gowdappa wrote: http://review.gluster.org/#/c/11362/ has been merged. - Original Message - From: Atin Mukherjee amukh...@redhat.com To: Raghavendra Gowdappa rgowd...@redhat.com Cc: Niels de Vos nde...@redhat.com, Vijaikumar M vmall...@redhat.com, Raghavendra G raghaven...@gluster.com, Gluster Devel gluster-devel@gluster.org Sent: Wednesday, June 24, 2015 10:55:02 AM Subject: Re: [Gluster-devel] /tests/bugs/quota/bug-1153964.t is consistently failing On 06/24/2015 10:53 AM, Raghavendra Gowdappa wrote: - Original Message - From: Atin Mukherjee amukh...@redhat.com To: Niels de Vos nde...@redhat.com, Vijaikumar M vmall...@redhat.com Cc: Raghavendra G raghaven...@gluster.com, Gluster Devel gluster-devel@gluster.org Sent: Wednesday, June 24, 2015 10:15:12 AM Subject: Re: [Gluster-devel] /tests/bugs/quota/bug-1153964.t is consistently failing When is this patch getting merged, this is blocking other patches to get in. Revert of http://review.gluster.org/11311, is waiting for regression runs to pass. There are three patches (duplicates of each other). If anyone of them pass both regression runs, I'll merge them. As far as refcounting mechanism go, it'll take some time to review and merge the patch. Once the revert patch is merged, we are good to go. Please let us know once it's merged as post that all patches need rebase. ~Atin ~Atin On 06/23/2015 06:26 PM, Niels de Vos wrote: On Tue, Jun 23, 2015 at 05:30:39PM +0530, Vijaikumar M wrote: On Tuesday 23 June 2015 04:28 PM, Niels de Vos wrote: On Tue, Jun 23, 2015 at 03:45:43PM +0530, Vijaikumar M wrote: I have submitted below patch which fixes this issue. I am handling memory clean-up with reference countmechanism. http://review.gluster.org/#/c/11361 Is there a reason you can not use the (new) refcounting functions that were introduceed with http://review.gluster.org/11022 ? I was not aware that ref-counting patch was merged. Sure we will use these function and re-submit my patch. Ok, thanks! Niels Thanks, Vijay It would be nicer to standardize all refcounting mechanisms on one implementation. I hope we can replace existing refcounting with this one too. Introducing more refcounting ways is not going to be helpful. Thanks, Niels Thanks, Vijay On Tuesday 23 June 2015 12:58 PM, Raghavendra G wrote: Multiple replies to same query. Pick one ;). On Tue, Jun 23, 2015 at 12:55 PM, Venky Shankar yknev.shan...@gmail.com mailto:yknev.shan...@gmail.com wrote: OK. Two reverts of the same patch ;) Pick one. On Tue, Jun 23, 2015 at 12:51 PM, Raghavendra Gowdappa rgowd...@redhat.com mailto:rgowd...@redhat.com wrote: Seems like its a memory corruption caused by: http://review.gluster.org/11311 I've reverted the patch at: http://review.gluster.org/11360 - Original Message - From: Xavier Hernandez xhernan...@datalab.es mailto:xhernan...@datalab.es To: Gluster Devel gluster-devel@gluster.org mailto:gluster-devel@gluster.org Sent: Tuesday, June 23, 2015 12:44:47 PM Subject: [Gluster-devel] /tests/bugs/quota/bug-1153964.t is consistently failing Hi, the quota test bug-1153964.t is failing consistently for a totally unrelated patch. Is this a known issue ? http://build.gluster.org/job/rackspace-regression-2GB-triggered/11142/consoleFull http://build.gluster.org/job/rackspace-regression-2GB-triggered/11165/consoleFull http://build.gluster.org/job/rackspace-regression-2GB-triggered/11172/consoleFull http://build.gluster.org/job/rackspace-regression-2GB-triggered/11191/consoleFull Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org mailto:Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org mailto:Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org mailto:Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel -- Raghavendra G ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel -- ~Atin
Re: [Gluster-devel] /tests/bugs/quota/bug-1153964.t is consistently failing
On Tuesday 23 June 2015 04:28 PM, Niels de Vos wrote: On Tue, Jun 23, 2015 at 03:45:43PM +0530, Vijaikumar M wrote: I have submitted below patch which fixes this issue. I am handling memory clean-up with reference countmechanism. http://review.gluster.org/#/c/11361 Is there a reason you can not use the (new) refcounting functions that were introduceed with http://review.gluster.org/11022 ? I was not aware that ref-counting patch was merged. Sure we will use these function and re-submit my patch. Thanks, Vijay It would be nicer to standardize all refcounting mechanisms on one implementation. I hope we can replace existing refcounting with this one too. Introducing more refcounting ways is not going to be helpful. Thanks, Niels Thanks, Vijay On Tuesday 23 June 2015 12:58 PM, Raghavendra G wrote: Multiple replies to same query. Pick one ;). On Tue, Jun 23, 2015 at 12:55 PM, Venky Shankar yknev.shan...@gmail.com mailto:yknev.shan...@gmail.com wrote: OK. Two reverts of the same patch ;) Pick one. On Tue, Jun 23, 2015 at 12:51 PM, Raghavendra Gowdappa rgowd...@redhat.com mailto:rgowd...@redhat.com wrote: Seems like its a memory corruption caused by: http://review.gluster.org/11311 I've reverted the patch at: http://review.gluster.org/11360 - Original Message - From: Xavier Hernandez xhernan...@datalab.es mailto:xhernan...@datalab.es To: Gluster Devel gluster-devel@gluster.org mailto:gluster-devel@gluster.org Sent: Tuesday, June 23, 2015 12:44:47 PM Subject: [Gluster-devel] /tests/bugs/quota/bug-1153964.t is consistently failing Hi, the quota test bug-1153964.t is failing consistently for a totally unrelated patch. Is this a known issue ? http://build.gluster.org/job/rackspace-regression-2GB-triggered/11142/consoleFull http://build.gluster.org/job/rackspace-regression-2GB-triggered/11165/consoleFull http://build.gluster.org/job/rackspace-regression-2GB-triggered/11172/consoleFull http://build.gluster.org/job/rackspace-regression-2GB-triggered/11191/consoleFull Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org mailto:Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org mailto:Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org mailto:Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel -- Raghavendra G ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] /tests/bugs/quota/bug-1153964.t is consistently failing
I have submitted below patch which fixes this issue. I am handling memory clean-up with reference countmechanism. http://review.gluster.org/#/c/11361 Thanks, Vijay On Tuesday 23 June 2015 12:58 PM, Raghavendra G wrote: Multiple replies to same query. Pick one ;). On Tue, Jun 23, 2015 at 12:55 PM, Venky Shankar yknev.shan...@gmail.com mailto:yknev.shan...@gmail.com wrote: OK. Two reverts of the same patch ;) Pick one. On Tue, Jun 23, 2015 at 12:51 PM, Raghavendra Gowdappa rgowd...@redhat.com mailto:rgowd...@redhat.com wrote: Seems like its a memory corruption caused by: http://review.gluster.org/11311 I've reverted the patch at: http://review.gluster.org/11360 - Original Message - From: Xavier Hernandez xhernan...@datalab.es mailto:xhernan...@datalab.es To: Gluster Devel gluster-devel@gluster.org mailto:gluster-devel@gluster.org Sent: Tuesday, June 23, 2015 12:44:47 PM Subject: [Gluster-devel] /tests/bugs/quota/bug-1153964.t is consistently failing Hi, the quota test bug-1153964.t is failing consistently for a totally unrelated patch. Is this a known issue ? http://build.gluster.org/job/rackspace-regression-2GB-triggered/11142/consoleFull http://build.gluster.org/job/rackspace-regression-2GB-triggered/11165/consoleFull http://build.gluster.org/job/rackspace-regression-2GB-triggered/11172/consoleFull http://build.gluster.org/job/rackspace-regression-2GB-triggered/11191/consoleFull Xavi ___ Gluster-devel mailing list Gluster-devel@gluster.org mailto:Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org mailto:Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org mailto:Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel -- Raghavendra G ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Moratorium on new patch acceptance
Here is the status on quota test-case spurious failure: There were 3 issues 1) Quota exceeding the limit because of parallel writes - Merged Upstream, patch submitted to release-3.7 #10910 ./tests/bugs/quota/bug-1038598.t ./tests/bugs/distribute/bug-1161156.t 2) Quoting accounting going wrong - Patch Submitted #10918 ./tests/basic/ec/quota.t ./tests/basic/quota-nfs.t 3) Quota with anonymous FDs on NetBSD: This is NFS client caching issue on NetBSD. Sachin and Myself are working on this issue. ./tests/basic/quota-anon-fd-nfs.t Thanks, Vijay On Friday 22 May 2015 11:45 PM, Vijay Bellur wrote: On 05/21/2015 12:07 AM, Vijay Bellur wrote: On 05/19/2015 11:56 PM, Vijay Bellur wrote: On 05/18/2015 08:03 PM, Vijay Bellur wrote: On 05/16/2015 03:34 PM, Vijay Bellur wrote: I will send daily status updates from Monday (05/18) about this so that we are clear about where we are and what needs to be done to remove this moratorium. Appreciate your help in having a clean set of regression tests going forward! We have made some progress since Saturday. The problem with glupy.t has been fixed - thanks to Niels! All but following tests have developers looking into them: ./tests/basic/afr/entry-self-heal.t ./tests/bugs/replicate/bug-976800.t ./tests/bugs/replicate/bug-1015990.t ./tests/bugs/quota/bug-1038598.t ./tests/basic/ec/quota.t ./tests/basic/quota-nfs.t ./tests/bugs/glusterd/bug-974007.t Can submitters of these test cases or current feature owners pick these up and start looking into the failures please? Do update the spurious failures etherpad [1] once you pick up a particular test. [1] https://public.pad.fsfe.org/p/gluster-spurious-failures Update for today - all tests that are known to fail have owners. Thanks everyone for chipping in! I think we should be able to lift this moratorium and resume normal patch acceptance shortly. Today's update - Pranith fixed a bunch of failures in erasure coding and Avra removed a test that was not relevant anymore - thanks for that! Quota, afr, snapshot tiering tests are being looked into. Will provide an update on where we are with these tomorrow. A few tests have not been readily reproducible. Of the remaining tests, all but the following have either been root caused or we have patches in review: ./tests/basic/mount-nfs-auth.t ./tests/performance/open-behind.t ./tests/basic/ec/ec-5-2.t ./tests/basic/quota-nfs.t With some reviews and investigations of failing tests happening over the weekend, I am optimistic about being able to accept patches as usual from early next week. Thanks, Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Moratorium on new patch acceptance
On Tuesday 19 May 2015 09:50 PM, Shyam wrote: On 05/19/2015 11:23 AM, Vijaikumar M wrote: On Tuesday 19 May 2015 08:36 PM, Shyam wrote: On 05/19/2015 08:10 AM, Raghavendra G wrote: After discussion with Vijaykumar mallikarjuna and other inputs in this thread, we are proposing all quota tests to comply to following criteria: * use dd always with oflag=append (to make sure there are no parallel writes) and conv=fdatasync (to make sure errors, if any are delivered to application. Turning off flush-behind is optional since fdatasync acts as a barrier) OR * turn off write-behind in nfs client and glusterfs server. What do you people think is a better test scenario? Also, we don't have confirmation on the RCA that parallel writes are indeed the culprits. We are trying to reproduce the issue locally. @Shyam, it would be helpful if you can confirm the hypothesis :). Ummm... I thought we acknowledge that quota checks are done during the WIND and updated during UNWIND, and we have io threads doing in flight IOs (as well as possible IOs in io threads queue) and we have 256K writes in the case mentioned. Put together, in my head this forms a good RCA that we write more than needed due to the in flight IOs on the brick. We need to control the in flight IOs as a resolution for this from the application. In terms of actual proof, we would need to instrument the code and check. When you say it does not fail for you, does the file stop once quota is reached or is a random size greater than quota? Which itself may explain or point to the RCA. The basic thing needed from an application is, - Sync IOs, so that there aren't too many in flight IOs and the application waits for each IO to complete - Based on tests below if we keep block size in dd lower and use oflag=sync we can achieve the same, if we use higher block sizes we cannot Test results: 1) noac: - NFS sends a COMMIT (internally translates to a flush) post each IO request (NFS WRITES are still with the UNSTABLE flag) - Ensures prior IO is complete before next IO request is sent (due to waiting on the COMMIT) - Fails if IO size is large, i.e in the test case being discussed I changed the dd line that was failing as TEST ! dd if=/dev/zero of=$N0/$mydir/newfile_2 *bs=10M* count=1 conv=fdatasync and this fails at times, as the writes here are sent as 256k chunks to the server and we still see the same behavior - noac + performance.nfs.flush-behind: off + performance.flush-behind: off + performance.nfs.strict-write-ordering: on + performance.strict-write-ordering: on + performance.nfs.write-behind: off + performance.write-behind: off - Still see similar failures, i.e at times 10MB file is created successfully in the modified dd command above Overall, the switch works, but not always. If we are to use this variant then we need to announce that all quota tests using dd not try to go beyond the quota limit set in a single IO from dd. 2) oflag=sync: - Exactly the same behavior as above. 3) Added all (and possibly the kitches sink) to the test case, as attached, and still see failures, - Yes, I have made the test fail intentionally (of sorts) by using 3M per dd IO and 2 IOs to go beyond the quota limit. - The intention is to demonstrate that we still get parallel IOs from NFS client - The test would work if we reduce the block size per IO (reliably is a border condition here, and we need specific rules like block size and how many blocks before we state quota is exceeded etc.) - The test would work if we just go beyond the quota, and then check a separate dd instance as being able to *not* exceed the quota. Which is why I put up that patch. What next? Hi Shyam, I tried running the test with dd option 'oflag=append' and didn't see the issue.Can you please try this option and see if it works? Did that (in the attached script that I sent) and it still failed. Please note: - This dd command passes (or fails with EDQUOT) - dd if=/dev/zero of=$N0/$mydir/newfile_2 bs=512 count=10240 oflag=append oflag=sync conv=fdatasync - We can even drop append and fdatasync, as sync sends a commit per block written which is better for the test and quota enforcement, whereas fdatasync does one in the end and sometimes fails (with larger block sizes, say 1M) - We can change bs to [512 - 256k] - This dd command fails (or writes all the data) - dd if=/dev/zero of=$N0/$mydir/newfile_2 bs=3M count=2 oflag=append oflag=sync conv=fdatasync The reasoning is that when we write a larger block size, NFS sends in multiple 256k chunks to write and then sends the commit before the next block. As a result if we exceed quota in the *last block* that we are writing, we *may* fail. If we exceed quota in the last but one block we will pass. Hope this shorter version explains it better. (VijayM is educating me on quota (over IM), and it looks like the quota update happens as a synctask in the background, so post the flush (NFS commit) we
Re: [Gluster-devel] Moratorium on new patch acceptance
On Thursday 21 May 2015 06:48 PM, Shyam wrote: On 05/21/2015 04:04 AM, Vijaikumar M wrote: On Tuesday 19 May 2015 09:50 PM, Shyam wrote: On 05/19/2015 11:23 AM, Vijaikumar M wrote: Did that (in the attached script that I sent) and it still failed. Please note: - This dd command passes (or fails with EDQUOT) - dd if=/dev/zero of=$N0/$mydir/newfile_2 bs=512 count=10240 oflag=append oflag=sync conv=fdatasync - We can even drop append and fdatasync, as sync sends a commit per block written which is better for the test and quota enforcement, whereas fdatasync does one in the end and sometimes fails (with larger block sizes, say 1M) - We can change bs to [512 - 256k] - This dd command fails (or writes all the data) - dd if=/dev/zero of=$N0/$mydir/newfile_2 bs=3M count=2 oflag=append oflag=sync conv=fdatasync The reasoning is that when we write a larger block size, NFS sends in multiple 256k chunks to write and then sends the commit before the next block. As a result if we exceed quota in the *last block* that we are writing, we *may* fail. If we exceed quota in the last but one block we will pass. Hope this shorter version explains it better. (VijayM is educating me on quota (over IM), and it looks like the quota update happens as a synctask in the background, so post the flush (NFS commit) we may still have a race) Post education solution: - Quota updates on disk xattr as a sync task, as a result if we exceeded quota in the n-1th block there is no guarantee that the nth block would fail, as the sync task may not have completed So I think we need to do the following for the quota based tests (expanding on the provided patch, http://review.gluster.org/#/c/10811/ ) - First dd that exceeds quota (with either oflag=sync or conv=fdatasync so that we do not see any flush behind or write behind effects) to be done without checks - Next check in an EXPECT_WITHIN that quota is exceeded (maybe add checks on the just created/appended file w.r.t its minimum size that would make it exceed the quota) - Then do a further dd to a new file or append to an existing file to get the EDQUOT error - Proceed with whatever the test case needs to do next Suggestions? Here is my analysis on spurious failure with testcase: tests/bugs/distribute/bug-1161156.t In release-3.7, marker is re-factored to use synctask to do background accounting. I have done below tests with different combination and found that parallel writes is causing the spurious failure. I have filed a bug# 1223658 to track parallel write issue with quota. Agreed with the observations, tallies with mine. Just one addition, when we write 256k or less, the writes become serial as NFS writes in 256k chunks, and due to oflag=sync it follows up with a flush, correct? Yes Test (2) is interesting, even with marker foreground updates (which is still in the UNWIND path), we observe failures. Do we know why? My analysis/understanding of the same is that we have more in flight IOs that passed quota enforcement (due to accounting on the UNWIND path), does this bear any merit post your tests? Yes, my understanding is same that it could be because of more in-flight IOs and there is not much impact if the marker is doing background updates. 1) Parallel writes and Marker background update (Test always fails) TEST ! dd if=/dev/zero of=$N0/$mydir/newfile_2 bs=3M count=2 conv=fdatasync oflag=sync oflag=append NFS client breaks 3M writes into multiple 256k chunks and does parallel writes 2) Parallel writes and Marker foreground update (Test always fails) TEST ! dd if=/dev/zero of=$N0/$mydir/newfile_2 bs=3M count=2 conv=fdatasync oflag=sync oflag=append Made a marker code change to account quota in foreground (without synctask) 3) Serial writes and Marker background update (Test passed 100/100 times) TEST ! dd if=/dev/zero of=$N0/$mydir/newfile_2 bs=256k count=24 conv=fdatasync oflag=sync oflag=append Using smaller block size (256k), so that NFS client reduces parallel writes 4) Serial writes and Marker foreground update (Test passed 100/100 times) TEST ! dd if=/dev/zero of=$N0/$mydir/newfile_2 bs=256k count=24 conv=fdatasync oflag=sync oflag=append Using smaller block size (256k), so that NFS client reduces parallel writes Made a marker code change to account quota in foreground (without synctask) 5) Parallel writes on release-3.6 (Test always fails) TEST ! dd if=/dev/zero of=$N0/$mydir/newfile_2 bs=3M count=2 conv=fdatasync oflag=sync oflag=append Moved marker xlator above IO-Threads in the graph. Thanks, Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Moratorium on new patch acceptance
On Tuesday 19 May 2015 09:50 PM, Shyam wrote: On 05/19/2015 11:23 AM, Vijaikumar M wrote: On Tuesday 19 May 2015 08:36 PM, Shyam wrote: On 05/19/2015 08:10 AM, Raghavendra G wrote: After discussion with Vijaykumar mallikarjuna and other inputs in this thread, we are proposing all quota tests to comply to following criteria: * use dd always with oflag=append (to make sure there are no parallel writes) and conv=fdatasync (to make sure errors, if any are delivered to application. Turning off flush-behind is optional since fdatasync acts as a barrier) OR * turn off write-behind in nfs client and glusterfs server. What do you people think is a better test scenario? Also, we don't have confirmation on the RCA that parallel writes are indeed the culprits. We are trying to reproduce the issue locally. @Shyam, it would be helpful if you can confirm the hypothesis :). Ummm... I thought we acknowledge that quota checks are done during the WIND and updated during UNWIND, and we have io threads doing in flight IOs (as well as possible IOs in io threads queue) and we have 256K writes in the case mentioned. Put together, in my head this forms a good RCA that we write more than needed due to the in flight IOs on the brick. We need to control the in flight IOs as a resolution for this from the application. In terms of actual proof, we would need to instrument the code and check. When you say it does not fail for you, does the file stop once quota is reached or is a random size greater than quota? Which itself may explain or point to the RCA. The basic thing needed from an application is, - Sync IOs, so that there aren't too many in flight IOs and the application waits for each IO to complete - Based on tests below if we keep block size in dd lower and use oflag=sync we can achieve the same, if we use higher block sizes we cannot Test results: 1) noac: - NFS sends a COMMIT (internally translates to a flush) post each IO request (NFS WRITES are still with the UNSTABLE flag) - Ensures prior IO is complete before next IO request is sent (due to waiting on the COMMIT) - Fails if IO size is large, i.e in the test case being discussed I changed the dd line that was failing as TEST ! dd if=/dev/zero of=$N0/$mydir/newfile_2 *bs=10M* count=1 conv=fdatasync and this fails at times, as the writes here are sent as 256k chunks to the server and we still see the same behavior - noac + performance.nfs.flush-behind: off + performance.flush-behind: off + performance.nfs.strict-write-ordering: on + performance.strict-write-ordering: on + performance.nfs.write-behind: off + performance.write-behind: off - Still see similar failures, i.e at times 10MB file is created successfully in the modified dd command above Overall, the switch works, but not always. If we are to use this variant then we need to announce that all quota tests using dd not try to go beyond the quota limit set in a single IO from dd. 2) oflag=sync: - Exactly the same behavior as above. 3) Added all (and possibly the kitches sink) to the test case, as attached, and still see failures, - Yes, I have made the test fail intentionally (of sorts) by using 3M per dd IO and 2 IOs to go beyond the quota limit. - The intention is to demonstrate that we still get parallel IOs from NFS client - The test would work if we reduce the block size per IO (reliably is a border condition here, and we need specific rules like block size and how many blocks before we state quota is exceeded etc.) - The test would work if we just go beyond the quota, and then check a separate dd instance as being able to *not* exceed the quota. Which is why I put up that patch. What next? Hi Shyam, I tried running the test with dd option 'oflag=append' and didn't see the issue.Can you please try this option and see if it works? Did that (in the attached script that I sent) and it still failed. Please note: - This dd command passes (or fails with EDQUOT) - dd if=/dev/zero of=$N0/$mydir/newfile_2 bs=512 count=10240 oflag=append oflag=sync conv=fdatasync - We can even drop append and fdatasync, as sync sends a commit per block written which is better for the test and quota enforcement, whereas fdatasync does one in the end and sometimes fails (with larger block sizes, say 1M) - We can change bs to [512 - 256k] Here you are trying to write 5M of data which is always written and test will fail. - This dd command fails (or writes all the data) - dd if=/dev/zero of=$N0/$mydir/newfile_2 bs=3M count=2 oflag=append oflag=sync conv=fdatasync Here you are trying to write 6M of data (Exceeding only 1M of quota limit) and test can fail. With count=3, test passes The reasoning is that when we write a larger block size, NFS sends in multiple 256k chunks to write and then sends the commit before the next block. As a result if we exceed quota in the *last block* that we are writing, we *may* fail. If we exceed quota in the last but one block we
Re: [Gluster-devel] Moratorium on new patch acceptance
On Tuesday 19 May 2015 06:13 AM, Shyam wrote: On 05/18/2015 07:05 PM, Shyam wrote: On 05/18/2015 03:49 PM, Shyam wrote: On 05/18/2015 10:33 AM, Vijay Bellur wrote: The etherpad did not call out, ./tests/bugs/distribute/bug-1161156.t which did not have an owner, and so I took a stab at it and below are the results. I also think failure in ./tests/bugs/quota/bug-1038598.t is the same as the observation below. NOTE: Anyone with better knowledge of Quota can possibly chip in as to what should we expect in this case and how to correct the expectation from these test cases. (Details of ./tests/bugs/distribute/bug-1161156.t) 1) Failure is in TEST #20 Failed line: TEST ! dd if=/dev/zero of=$N0/$mydir/newfile_2 bs=1k count=10240 conv=fdatasync 2) The above line is expected to fail (i.e dd is expected to fail) as, the set quota is 20MB and we are attempting to exceed it by another 5MB at this point in the test case. 3) The failure is easily reproducible in my laptop, 2/10 times 4) On debugging, I see that when the above dd succeeds (or the test fails, which means dd succeeded in writing more than the set quota), there are no write errors from the bricks or any errors on the final COMMIT RPC call to NFS. As a result the expectation of this test fails. NOTE: Sometimes there is a write failure from one of the bricks (the above test uses AFR as well), but AFR self healing kicks in and fixes the problem, as expected, as the write succeeded on one of the replicas. I add this observation, as the failed regression run logs, has some EDQUOT errors reported in the client xlator, but only from one of the client bricks, and there are further AFR self heal logs noted in the logs. 5) When the test case succeeds the writes fail with EDQUOT as expected. There are times when the quota is exceeded by say 1MB - 4.8MB, but the test case still passes. Which means that, if we were to try to exceed the quota by 1MB (instead of the 5MB as in the test case), this test case may fail always. Here is why I think this passes by quota sometime and not others making this and the other test case mentioned below spurious. - Each write is 256K from the client (that is what is sent over the wire) - If more IO was queued by io-threads after passing quota checks, which in this 5MB case requires 20 IOs to be queued (16 IOs could be active in io-threads itself), we could end up writing more than the quota amount So, if quota checks to see if a write is violating the quota, and let's it through, and updates on the UNWIND the space used for future checks, we could have more IO outstanding than what the quota allows, and as a result allow such a larger write to pass through, considering IO threads queue and active IOs as well. Would this be a fair assumption of how quota works? I believe this is what is happening in this case. Checking a fix on my machine, and will post the same if it proves to be help the situation. Posted a patch to fix the problem: http://review.gluster.org/#/c/10811/ There are arguably other ways to fix/overcome the same, this seemed apt for this test case though. 6) Note on dd with conv=fdatasync As one of the fixes attempts to overcome this issue with the addition of conv=fdatasync, wanted to cover that behavior here. What the above parameter does is to send an NFS_COMMIT (which internally becomes a flush FOP) at the end of writing the blocks to the NFS share. This commit as a result triggers any pending writes for this file and sends the flush to the brick, all of which succeeds at times, resulting in the failure of the test case. NOTE: In the TC ./tests/bugs/quota/bug-1038598.t the failed line is pretty much in the same context (LINE 26: TEST ! dd if=/dev/zero of=$M0/test_dir/file1.txt bs=1024k count=15 (expecting hard limit to be exceeded and there are no write failures in the logs (which should be expected with EDQUOT (122))). Currently we are not accounting in-progress writes (It is bit complicated to account in-progress writes). When a write is successful, the accounting for this is done by marker asynchronously. We can get other writes before the marker completes accounting the previously written size. So there is small window where we exceed the quota limit. In the testcase we are attempting to write 5MB more, we may need to change this to write few more MBs. Thanks, Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Moratorium on new patch acceptance
On Tuesday 19 May 2015 08:36 PM, Shyam wrote: On 05/19/2015 08:10 AM, Raghavendra G wrote: After discussion with Vijaykumar mallikarjuna and other inputs in this thread, we are proposing all quota tests to comply to following criteria: * use dd always with oflag=append (to make sure there are no parallel writes) and conv=fdatasync (to make sure errors, if any are delivered to application. Turning off flush-behind is optional since fdatasync acts as a barrier) OR * turn off write-behind in nfs client and glusterfs server. What do you people think is a better test scenario? Also, we don't have confirmation on the RCA that parallel writes are indeed the culprits. We are trying to reproduce the issue locally. @Shyam, it would be helpful if you can confirm the hypothesis :). Ummm... I thought we acknowledge that quota checks are done during the WIND and updated during UNWIND, and we have io threads doing in flight IOs (as well as possible IOs in io threads queue) and we have 256K writes in the case mentioned. Put together, in my head this forms a good RCA that we write more than needed due to the in flight IOs on the brick. We need to control the in flight IOs as a resolution for this from the application. In terms of actual proof, we would need to instrument the code and check. When you say it does not fail for you, does the file stop once quota is reached or is a random size greater than quota? Which itself may explain or point to the RCA. The basic thing needed from an application is, - Sync IOs, so that there aren't too many in flight IOs and the application waits for each IO to complete - Based on tests below if we keep block size in dd lower and use oflag=sync we can achieve the same, if we use higher block sizes we cannot Test results: 1) noac: - NFS sends a COMMIT (internally translates to a flush) post each IO request (NFS WRITES are still with the UNSTABLE flag) - Ensures prior IO is complete before next IO request is sent (due to waiting on the COMMIT) - Fails if IO size is large, i.e in the test case being discussed I changed the dd line that was failing as TEST ! dd if=/dev/zero of=$N0/$mydir/newfile_2 *bs=10M* count=1 conv=fdatasync and this fails at times, as the writes here are sent as 256k chunks to the server and we still see the same behavior - noac + performance.nfs.flush-behind: off + performance.flush-behind: off + performance.nfs.strict-write-ordering: on + performance.strict-write-ordering: on + performance.nfs.write-behind: off + performance.write-behind: off - Still see similar failures, i.e at times 10MB file is created successfully in the modified dd command above Overall, the switch works, but not always. If we are to use this variant then we need to announce that all quota tests using dd not try to go beyond the quota limit set in a single IO from dd. 2) oflag=sync: - Exactly the same behavior as above. 3) Added all (and possibly the kitches sink) to the test case, as attached, and still see failures, - Yes, I have made the test fail intentionally (of sorts) by using 3M per dd IO and 2 IOs to go beyond the quota limit. - The intention is to demonstrate that we still get parallel IOs from NFS client - The test would work if we reduce the block size per IO (reliably is a border condition here, and we need specific rules like block size and how many blocks before we state quota is exceeded etc.) - The test would work if we just go beyond the quota, and then check a separate dd instance as being able to *not* exceed the quota. Which is why I put up that patch. What next? Hi Shyam, I tried running the test with dd option 'oflag=append' and didn't see the issue.Can you please try this option and see if it works? Thanks, Vijay regards, Raghavendra. On Tue, May 19, 2015 at 5:27 PM, Raghavendra G raghaven...@gluster.com mailto:raghaven...@gluster.com wrote: On Tue, May 19, 2015 at 4:26 PM, Jeff Darcy jda...@redhat.com mailto:jda...@redhat.com wrote: No, my suggestion was aimed at not having parallel writes. In this case quota won't even fail the writes with EDQUOT because of reasons explained above. Yes, we need to disable flush-behind along with this so that errors are delivered to application. Would conv=sync help here? That should prevent any kind of write parallelism. An strace of dd shows that * fdatasync is issued only once at the end of all writes when conv=fdatasync * for some strange reason no fsync or fdatasync is issued at all when conv=sync So, using conv=fdatasync in the test cannot prevent write-parallelism induced by write-behind. Parallelism would've been prevented only if dd had issued fdatasync after each write or opened the file with O_SYNC. If it doesn't, I'd say that's a true test failure somewhere in our stack. A similar possibility would be to
Re: [Gluster-devel] NetBSD regression in quota-nfs.t
Hi Emmanuel, I have submitted another patch: http://review.gluster.org/#/c/9478/ for addressing the spurious failure with quota-nfs.t Thanks, Vijay On Wednesday 18 March 2015 07:40 PM, Emmanuel Dreyfus wrote: On Wed, Mar 18, 2015 at 10:28:37AM +, Emmanuel Dreyfus wrote: Indeed, the test passes with this patch: And when submitting it I noticed ithat change has laready been done. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] uss.t in master doing bad things to our regression test VM's
Hi Justin, I have submitted patch 'http://review.gluster.org/#/c/9703/', used a different approach to generate a random string. Thanks, Vijay On Thursday 19 February 2015 05:21 PM, Vijaikumar M wrote: On Wednesday 18 February 2015 10:42 PM, Justin Clift wrote: Hi Vijaikumar, As part of investigating what is going wrong with our VM's in Rackspace, I created several new VM's (11 of them) and started a full regression test run on them. They're all hitting a major problem with uss.t. Part of it does a cat on /dev/urandom... which is taking several hours at 100% of a cpu. :( Here is output from ps -ef f on one of them: root 12094 1287 0 13:23 ? S 0:00 \_ /bin/bash /opt/qa/regression.sh root 12101 12094 0 13:23 ? S 0:00 \_ /bin/bash ./run-tests.sh root 12116 12101 0 13:23 ? S 0:01 \_ /usr/bin/perl /usr/bin/prove -rf --timer ./tests root382 12116 0 14:13 ? S 0:00 \_ /bin/bash ./tests/basic/uss.t root 1713 382 0 14:14 ? S 0:00 \_ /bin/bash ./tests/basic/uss.t root 1714 1713 96 14:14 ? R 166:31 \_ cat /dev/urandom root 1715 1713 2 14:14 ? S 5:04 \_ tr -dc a-zA-Z root 1716 1713 9 14:14 ? S 16:31 \_ fold -w 8 And from top: top - 17:09:19 up 3:50, 1 user, load average: 1.04, 1.03, 1.00 Tasks: 240 total, 3 running, 237 sleeping, 0 stopped, 0 zombie Cpu0 : 4.3%us, 95.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 8.1%us, 15.9%sy, 0.0%ni, 76.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 1916672k total, 1119544k used, 797128k free, 114976k buffers Swap:0k total,0k used,0k free, 427032k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 1714 root 20 0 98.6m 620 504 R 96.0 0.0 169:00.94 cat 137 root 20 0 36100 1396 1140 S 15.9 0.1 37:01.55 plymouthd 1716 root 20 0 98.6m 712 616 S 10.0 0.0 16:46.55 fold 1715 root 20 0 98.6m 636 540 S 2.7 0.0 5:08.95 tr 9 root 20 0 000 S 0.3 0.0 0:00.59 ksoftirqd/1 1 root 20 0 19232 1128 860 S 0.0 0.1 0:00.93 init 2 root 20 0 000 S 0.0 0.0 0:00.00 kthreadd Your name is on the commit which added the code, but that was months ago. No idea why it's suddenly being a problem. Do you have any idea? I am going to shut down all of these new test VM's except one, which I can give you (or anyone) access to, if that would help find and fix the problem. I am not sure why suddenly this is causing a problem. I can remove 'cat urandom' and use different approach to test this particular case. Thanks, Vijay Btw, this is pretty important. ;) + Justin -- GlusterFS - http://www.gluster.org An open source, distributed file system scaling to several petabytes, and handling thousands of clients. My personal twitter: twitter.com/realjustinclift ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Quota with hard-links
Hi All, This is regarding quota accounting for hard-links. Currently, accounting is done only once for the links created within the same directory and accounting is done separately when links are created in separate directory. With this approach account may go wrong when rename is performed on the hardlink files across directories. We are implementing one of the below mentioned policy for hard-links when quota is enabled. 1) Allow creating hard-links only within same directory. (We can hit the same problem if quota is enabled on the pre-existing data which contains hard-links) 2) Allow creating hard-links only within the same branch where the limit is set (We can hit the same problem if quota is enabled on the pre-existing data which contains hard-links and also when quota-limit is set/unset) 3) Account for all the hard-links. (Marker treats all hard-links as a new file) Please provide your feedback. Thanks, Vijay ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] quota and snapshot testcase failure (zfs on CentOS 6.6)
Hi Kiran, Testcase './tests/basic/quota-anon-fd-nfs.t' is removed from the test suite.There are some issues with this testcase, we are working on it. We will add this test-case soon once the issue is fixed. Thanks, Vijay On Tuesday 27 January 2015 06:11 PM, Vijaikumar M wrote: Hi Kiran, quota.t failure issue has been fixed with patch http://review.gluster.org/#/c/9410/. Can you please re-try the test with this patch and see if it works? Thanks, Vijay On Wednesday 19 November 2014 10:32 AM, Pranith Kumar Karampuri wrote: On 11/19/2014 10:30 AM, Atin Mukherjee wrote: On 11/18/2014 10:35 PM, Pranith Kumar Karampuri wrote: On 11/12/2014 04:52 PM, Kiran Patil wrote: I have create zpool with name d and mnt and they appear in filesystem as follows. d on /d type zfs (rw,xattr) mnt on /mnt type zfs (rw,xattr) Debug enabled output of quota.t testcase is at http://ur1.ca/irbt1. CC vijaikumar quota-anon-fd-nfs.t spurious failure fix is addressed by http://review.gluster.org/#/c/9108/ This is just quota.t in tests/basic, not the anon-fd one Pranith ~Atin On Wed, Nov 12, 2014 at 3:22 PM, Kiran Patil ki...@fractalio.com mailto:ki...@fractalio.com wrote: Hi, Gluster suite report, Gluster version: glusterfs 3.6.1 On disk filesystem: Zfs 0.6.3-1.1 Operating system: CentOS release 6.6 (Final) We are seeing quota and snapshot testcase failures. We are not sure why quota is failing since quotas worked fine on gluster 3.4. Test Summary Report --- ./tests/basic/quota-anon-fd-nfs.t (Wstat: 0 Tests: 16 Failed: 1) Failed test: 16 ./tests/basic/quota.t (Wstat: 0 Tests: 73 Failed: 4) Failed tests: 24, 28, 32, 65 ./tests/basic/uss.t (Wstat: 0 Tests: 147 Failed: 78) Failed tests: 8-11, 16-25, 28-29, 31-32, 39-40, 45-47 49-57, 60-61, 63-64, 71-72, 78-87, 90-91 93-94, 101-102, 107-115, 118-119, 121-122 129-130, 134, 136-137, 139-140, 142-143 145-146 ./tests/basic/volume-snapshot.t (Wstat: 0 Tests: 30 Failed: 12) Failed tests: 11-18, 21-24 ./tests/basic/volume-status.t (Wstat: 0 Tests: 14 Failed: 1) Failed test: 14 ./tests/bugs/bug-1023974.t (Wstat: 0 Tests: 15 Failed: 1) Failed test: 12 ./tests/bugs/bug-1038598.t (Wstat: 0 Tests: 28 Failed: 6) Failed tests: 17, 21-22, 26-28 ./tests/bugs/bug-1045333.t (Wstat: 0 Tests: 16 Failed: 9) Failed tests: 7-15 ./tests/bugs/bug-1049834.t (Wstat: 0 Tests: 18 Failed: 7) Failed tests: 11-14, 16-18 ./tests/bugs/bug-1087203.t (Wstat: 0 Tests: 43 Failed: 2) Failed tests: 31, 41 ./tests/bugs/bug-1090042.t (Wstat: 0 Tests: 12 Failed: 3) Failed tests: 9-11 ./tests/bugs/bug-1109770.t (Wstat: 0 Tests: 19 Failed: 4) Failed tests: 8-11 ./tests/bugs/bug-1109889.t (Wstat: 0 Tests: 20 Failed: 4) Failed tests: 8-11 ./tests/bugs/bug-1112559.t (Wstat: 0 Tests: 11 Failed: 3) Failed tests: 8-9, 11 ./tests/bugs/bug-1112613.t (Wstat: 0 Tests: 22 Failed: 5) Failed tests: 12-14, 17-18 ./tests/bugs/bug-1113975.t (Wstat: 0 Tests: 13 Failed: 4) Failed tests: 8-9, 11-12 ./tests/bugs/bug-847622.t (Wstat: 0 Tests: 10 Failed: 1) Failed test: 8 ./tests/bugs/bug-861542.t (Wstat: 0 Tests: 13 Failed: 7) Failed tests: 7-13 ./tests/features/ssl-authz.t (Wstat: 0 Tests: 18 Failed: 1) Failed test: 18 Files=277, Tests=7908, 8147 wallclock secs ( 4.56 usr 0.78 sys + 774.74 cusr 666.97 csys = 1447.05 CPU) Result: FAIL Thanks, Kiran. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] quota and snapshot testcase failure (zfs on CentOS 6.6)
Hi Kiran, quota.t failure issue has been fixed with patch http://review.gluster.org/#/c/9410/. Can you please re-try the test with this patch and see if it works? Thanks, Vijay On Wednesday 19 November 2014 10:32 AM, Pranith Kumar Karampuri wrote: On 11/19/2014 10:30 AM, Atin Mukherjee wrote: On 11/18/2014 10:35 PM, Pranith Kumar Karampuri wrote: On 11/12/2014 04:52 PM, Kiran Patil wrote: I have create zpool with name d and mnt and they appear in filesystem as follows. d on /d type zfs (rw,xattr) mnt on /mnt type zfs (rw,xattr) Debug enabled output of quota.t testcase is at http://ur1.ca/irbt1. CC vijaikumar quota-anon-fd-nfs.t spurious failure fix is addressed by http://review.gluster.org/#/c/9108/ This is just quota.t in tests/basic, not the anon-fd one Pranith ~Atin On Wed, Nov 12, 2014 at 3:22 PM, Kiran Patil ki...@fractalio.com mailto:ki...@fractalio.com wrote: Hi, Gluster suite report, Gluster version: glusterfs 3.6.1 On disk filesystem: Zfs 0.6.3-1.1 Operating system: CentOS release 6.6 (Final) We are seeing quota and snapshot testcase failures. We are not sure why quota is failing since quotas worked fine on gluster 3.4. Test Summary Report --- ./tests/basic/quota-anon-fd-nfs.t (Wstat: 0 Tests: 16 Failed: 1) Failed test: 16 ./tests/basic/quota.t (Wstat: 0 Tests: 73 Failed: 4) Failed tests: 24, 28, 32, 65 ./tests/basic/uss.t (Wstat: 0 Tests: 147 Failed: 78) Failed tests: 8-11, 16-25, 28-29, 31-32, 39-40, 45-47 49-57, 60-61, 63-64, 71-72, 78-87, 90-91 93-94, 101-102, 107-115, 118-119, 121-122 129-130, 134, 136-137, 139-140, 142-143 145-146 ./tests/basic/volume-snapshot.t (Wstat: 0 Tests: 30 Failed: 12) Failed tests: 11-18, 21-24 ./tests/basic/volume-status.t (Wstat: 0 Tests: 14 Failed: 1) Failed test: 14 ./tests/bugs/bug-1023974.t (Wstat: 0 Tests: 15 Failed: 1) Failed test: 12 ./tests/bugs/bug-1038598.t (Wstat: 0 Tests: 28 Failed: 6) Failed tests: 17, 21-22, 26-28 ./tests/bugs/bug-1045333.t (Wstat: 0 Tests: 16 Failed: 9) Failed tests: 7-15 ./tests/bugs/bug-1049834.t (Wstat: 0 Tests: 18 Failed: 7) Failed tests: 11-14, 16-18 ./tests/bugs/bug-1087203.t (Wstat: 0 Tests: 43 Failed: 2) Failed tests: 31, 41 ./tests/bugs/bug-1090042.t (Wstat: 0 Tests: 12 Failed: 3) Failed tests: 9-11 ./tests/bugs/bug-1109770.t (Wstat: 0 Tests: 19 Failed: 4) Failed tests: 8-11 ./tests/bugs/bug-1109889.t (Wstat: 0 Tests: 20 Failed: 4) Failed tests: 8-11 ./tests/bugs/bug-1112559.t (Wstat: 0 Tests: 11 Failed: 3) Failed tests: 8-9, 11 ./tests/bugs/bug-1112613.t (Wstat: 0 Tests: 22 Failed: 5) Failed tests: 12-14, 17-18 ./tests/bugs/bug-1113975.t (Wstat: 0 Tests: 13 Failed: 4) Failed tests: 8-9, 11-12 ./tests/bugs/bug-847622.t (Wstat: 0 Tests: 10 Failed: 1) Failed test: 8 ./tests/bugs/bug-861542.t (Wstat: 0 Tests: 13 Failed: 7) Failed tests: 7-13 ./tests/features/ssl-authz.t (Wstat: 0 Tests: 18 Failed: 1) Failed test: 18 Files=277, Tests=7908, 8147 wallclock secs ( 4.56 usr 0.78 sys + 774.74 cusr 666.97 csys = 1447.05 CPU) Result: FAIL Thanks, Kiran. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] [Gluster-users] Quota command bug in 3.6.1?
Hi Raghuram, Thanks for reporting the problem. We will submit the fix upstream soon. Thanks, Vijay On Wednesday 14 January 2015 01:50 PM, Raghuram BK wrote: When I issue quota list command with the xml option, it seems to return non-xml data : [root@fractalio-66f2 fractalio]# gluster --version glusterfs 3.6.1 built on Jan 13 2015 16:46:51 Repository revision: git://git.gluster.com/glusterfs.git http://git.gluster.com/glusterfs.git Copyright (c) 2006-2011 Gluster Inc. http://www.gluster.com GlusterFS comes with ABSOLUTELY NO WARRANTY. You may redistribute copies of GlusterFS under the terms of the GNU General Public License. [root@primary templates]# gluster volume quota vol1 list --xml ?xml version=1.0 encoding=UTF-8 standalone=yes? cliOutput opRet0/opRet opErrno0/opErrno opErrstr/ volQuota/ /cliOutput Path Hard-limit Soft-limit Used Available Soft-limit exceeded? Hard-limit exceeded? --- / 10.0GB 80% 0Bytes 10.0GB No No -- *Fractalio Data, India* Mobile: +91 96635 92022 Email: r...@fractalio.com mailto:g...@fractalio.com Web: www.fractalio.com http://www.fractalio.com/ ___ Gluster-users mailing list gluster-us...@gluster.org http://www.gluster.org/mailman/listinfo/gluster-users ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Spurious failure in quota test cases
I see below error in the log file. I think some how old mount is not cleaned properly. File: cli.log [2014-12-30 11:23:19.553912] W [cli-cmd-volume.c:886:gf_cli_create_auxiliary_mount] 0-cli: failed to mount glusterfs client. Please check the log file /var/log/glusterfs/quota-mount-patchy.log for more details File: quota-mount-patchy.log [2014-12-30 09:54:38.093890] I [MSGID: 100030] [glusterfsd.c:2027:main] 0-/build/install/sbin/glusterfs: Started running /build/install/sbin/glusterfs version 3.7dev (args: /build/install/sbin/glusterfs -s localhost --volfile-id patchy -l /var/log/glusterfs/quota-mount-patchy.log -p /var/run/gluster/patchy.pid --client-pid -5 /var/run/gluster/patchy/) [2014-12-30 09:54:38.094546] E [fuse-bridge.c:5338:init] 0-fuse: Mountpoint /var/run/gluster/patchy/ seems to have a stale mount, run 'umount /var/run/gluster/patchy/' and try again. Can someone who have access to build machine, please clear the stale mount on '/var/run/gluster/patchy/'? Thanks, Vijay On Friday 02 January 2015 04:38 PM, Atin Mukherjee wrote: Hi Vijai, It seems like lots of regression test cases are failing due to auxiliary mount failure in cli and thats because of left over auxiliary mount points. [2014-12-30 10:21:15.875965] E [fuse-bridge.c:5338:init] 0-fuse: Mountpoint /var/run/gluster/patchy/ seems to have a stale mount, run 'umount /var/run/gluster/patchy/' and try again. Once such instance can be found at [1] Can you please look into it? ~Atin [1] http://build.gluster.org/job/rackspace-regression-2GB-triggered/3406/consoleFull ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] handling statfs call in USS
On Wednesday 24 December 2014 02:30 PM, Raghavendra Bhat wrote: Hi, I have a doubt. In user serviceable snapshots as of now statfs call is not implemented. There are 2 ways how statfs can be handled. 1) Whenever snapview-client xlator gets statfs call on a path that belongs to snapshot world, it can send the statfs call to the main volume itself, with the path and the inode being set to the root of the main volume. In this approach, when statfs call is sent to main volume with path and inode set to the root can give incorrect value when quota and deem-statfs are enabled. path/inode should be set to the parent of '.snaps' Thanks, Vijay OR 2) It can redirect the call to the snapshot world (the snapshot demon which talks to all the snapshots of that particular volume) and send back the reply that it has obtained. Please provide feedback. Regards, Raghavendra Bhat ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Proposal for more sub-maintainers
On Thursday 04 December 2014 08:32 PM, Niels de Vos wrote: On Fri, Nov 28, 2014 at 01:08:29PM +0530, Vijay Bellur wrote: Hi All, To supplement our ongoing effort of better patch management, I am proposing the addition of more sub-maintainers for various components. The rationale behind this proposal the responsibilities of maintainers continue to be the same as discussed in these lists a while ago [1]. Here is the proposed list: Build - Kaleb Keithley Niels de Vos DHT - Raghavendra Gowdappa Shyam Ranganathan docs - Humble Chirammal Lalatendu Mohanty gfapi - Niels de Vos Shyam Ranganathan index io-threads - Pranith Karampuri posix - Pranith Karampuri Raghavendra Bhat I'm wondering if there are any volunteers for maintaining the FUSE component? And maybe rewrite it to use libgfapi and drop the mount.glusterfs script? I am interested. Thanks, Vijay Niels We intend to update Gerrit with this list by 8th of December. Please let us know if you have objections, concerns or feedback on this process by then. Thanks, Vijay [1] http://gluster.org/pipermail/gluster-devel/2014-April/025425.html ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] snapshot restore and USS
On Monday 01 December 2014 05:36 PM, Raghavendra Bhat wrote: On Monday 01 December 2014 04:51 PM, Raghavendra G wrote: On Fri, Nov 28, 2014 at 6:48 PM, RAGHAVENDRA TALUR raghavendra.ta...@gmail.com mailto:raghavendra.ta...@gmail.com wrote: On Thu, Nov 27, 2014 at 2:59 PM, Raghavendra Bhat rab...@redhat.com mailto:rab...@redhat.com wrote: Hi, With USS to access snapshots, we depend on last snapshot of the volume (or the latest snapshot) to resolve some issues. Ex: Say there is a directory called dir within the root of the volume and USS is enabled. Now when .snaps is accessed from dir (i.e. /dir/.snaps), first a lookup is sent on /dir which snapview-client xlator passes onto the normal graph till posix xlator of the brick. Next the lookup comes on /dir/.snaps. snapview-client xlator now redirects this call to the snap daemon (since .snaps is a virtual directory to access the snapshots). The lookup comes to snap daemon with parent gfid set to the gfid of /dir and the basename being set to .snaps. Snap daemon will first try to resolve the parent gfid by trying to find the inode for that gfid. But since that gfid was not looked up before in the snap daemon, it will not be able to find the inode. So now to resolve it, snap daemon depends upon the latest snapshot. i.e. it tries to look up the gfid of /dir in the latest snapshot and if it can get the gfid, then lookup on /dir/.snaps is also successful. From the user point of view, I would like to be able to enter into the .snaps anywhere. To be able to do that, we can turn the dependency upside down, instead of listing all snaps in the .snaps dir, lets just show whatever snapshots had that dir. Currently readdir in snap-view server is listing _all_ the snapshots. However if you try to do ls on a snapshot which doesn't contain this directory (say dir/.snaps/snap3), I think it returns ESTALE/ENOENT. So, to get what you've explained above, readdir(p) should filter out those snapshots which doesn't contain this directory (to do that, it has to lookup dir on each of the snapshots). Raghavendra Bhat explained the problem and also a possible solution to me in person. There are some pieces missing in the problem description as explained in the mail (but not in the discussion we had). The problem explained here occurs when you restore a snapshot (say snap3) where the directory got created, but deleted before next snapshot. So, directory doesn't exist in snap2 and snap4, but exists only in snap3. Now, when you restore snap3, ls on dir/.snaps should show nothing. Now, what should be result of lookup (gfid-of-dir, .snaps) should be? 1. we can blindly return a virtual inode, assuming there is atleast one snapshot contains dir. If fops come on specific snapshots (eg., dir/.snaps/snap4), they'll anyways fail with ENOENT (since dir is not present on any snaps). 2. we can choose to return ENOENT if we figure out that dir is not present on any snaps. The problem we are trying to solve here is how to achieve 2. One simple solution is to lookup for gfid-of-dir on all the snapshots and if every lookup fails with ENOENT, we can return ENOENT. The other solution is to just lookup in snapshots before and after (if both are present, otherwise just in latest snapshot). If both fail, then we can be sure that no snapshots contain that directory. Rabhat, Correct me if I've missed out anything :). If a readdir on .snaps entered from a non root directory has to show the list of only those snapshots where the directory (or rather gfid of the directory) is present, then the way to achieve will be bit costly. When readdir comes on .snaps entered from a non root directory (say ls /dir/.snaps), following operations have to be performed 1) In a array we have the names of all the snapshots. So, do a nameless lookup on the gfid of /dir on all the snapshots 2) Based on which snapshots have sent success to the above lookup, build a new array or list of snapshots. 3) Then send the above new list as the readdir entries. But the above operation it costlier. Because, just to serve one readdir request we have to make a lookup on each snapshot (if there are 256 snapshots, then we have to make 256 lookup calls via network). One more thing is resource usage. As of now any snapshot will be initied (i.e. via gfapi a connection is established with the corresponding snapshot volume, which is equivalent to a mounted volume.) when that snapshot is accessed (from fops point of view a lookup comes on the snapshot entry, say ls /dir/.snaps/snap1). Now to serve readdir all the snapshots will be accessed and all the snapshots are initialized. This means there can be 256 instances of gfapi connections with each instance having its own inode table and other resources). After
Re: [Gluster-devel] quota and snapshot testcase failure (zfs on CentOS 6.6)
Hi Kiran, Can we also get the xattrs of all the directories on the bricks. How to capture xattrs of all dirs in a brick is here: Edit quota.t and find all the lines that match 'EXPECT_WITHIN $MARKER_UPDATE_TIMEOUT .. usage .' Add below lines after every match: echo Matching Testcase /var/tmp/quota-xattr.txt for file in `find $B0 -type d`; do echo $file; getfattr -d -m . -e hex $file; echo; done /var/tmp/quota-xattr.txt echo /var/tmp/quota-xattr.txt Thanks, Vijay On Wednesday 19 November 2014 01:17 PM, Vijaikumar M wrote: Hi Kiran, Can we get the brick, client and quotad logs? Thanks, Vijay On Tuesday 18 November 2014 10:35 PM, Pranith Kumar Karampuri wrote: On 11/12/2014 04:52 PM, Kiran Patil wrote: I have create zpool with name d and mnt and they appear in filesystem as follows. d on /d type zfs (rw,xattr) mnt on /mnt type zfs (rw,xattr) Debug enabled output of quota.t testcase is at http://ur1.ca/irbt1. CC vijaikumar On Wed, Nov 12, 2014 at 3:22 PM, Kiran Patil ki...@fractalio.com mailto:ki...@fractalio.com wrote: Hi, Gluster suite report, Gluster version: glusterfs 3.6.1 On disk filesystem: Zfs 0.6.3-1.1 Operating system: CentOS release 6.6 (Final) We are seeing quota and snapshot testcase failures. We are not sure why quota is failing since quotas worked fine on gluster 3.4. Test Summary Report --- ./tests/basic/quota-anon-fd-nfs.t (Wstat: 0 Tests: 16 Failed: 1) Failed test: 16 ./tests/basic/quota.t (Wstat: 0 Tests: 73 Failed: 4) Failed tests: 24, 28, 32, 65 ./tests/basic/uss.t (Wstat: 0 Tests: 147 Failed: 78) Failed tests: 8-11, 16-25, 28-29, 31-32, 39-40, 45-47 49-57, 60-61, 63-64, 71-72, 78-87, 90-91 93-94, 101-102, 107-115, 118-119, 121-122 129-130, 134, 136-137, 139-140, 142-143 145-146 ./tests/basic/volume-snapshot.t (Wstat: 0 Tests: 30 Failed: 12) Failed tests: 11-18, 21-24 ./tests/basic/volume-status.t (Wstat: 0 Tests: 14 Failed: 1) Failed test: 14 ./tests/bugs/bug-1023974.t (Wstat: 0 Tests: 15 Failed: 1) Failed test: 12 ./tests/bugs/bug-1038598.t (Wstat: 0 Tests: 28 Failed: 6) Failed tests: 17, 21-22, 26-28 ./tests/bugs/bug-1045333.t (Wstat: 0 Tests: 16 Failed: 9) Failed tests: 7-15 ./tests/bugs/bug-1049834.t (Wstat: 0 Tests: 18 Failed: 7) Failed tests: 11-14, 16-18 ./tests/bugs/bug-1087203.t (Wstat: 0 Tests: 43 Failed: 2) Failed tests: 31, 41 ./tests/bugs/bug-1090042.t (Wstat: 0 Tests: 12 Failed: 3) Failed tests: 9-11 ./tests/bugs/bug-1109770.t (Wstat: 0 Tests: 19 Failed: 4) Failed tests: 8-11 ./tests/bugs/bug-1109889.t (Wstat: 0 Tests: 20 Failed: 4) Failed tests: 8-11 ./tests/bugs/bug-1112559.t (Wstat: 0 Tests: 11 Failed: 3) Failed tests: 8-9, 11 ./tests/bugs/bug-1112613.t (Wstat: 0 Tests: 22 Failed: 5) Failed tests: 12-14, 17-18 ./tests/bugs/bug-1113975.t (Wstat: 0 Tests: 13 Failed: 4) Failed tests: 8-9, 11-12 ./tests/bugs/bug-847622.t (Wstat: 0 Tests: 10 Failed: 1) Failed test: 8 ./tests/bugs/bug-861542.t (Wstat: 0 Tests: 13 Failed: 7) Failed tests: 7-13 ./tests/features/ssl-authz.t (Wstat: 0 Tests: 18 Failed: 1) Failed test: 18 Files=277, Tests=7908, 8147 wallclock secs ( 4.56 usr 0.78 sys + 774.74 cusr 666.97 csys = 1447.05 CPU) Result: FAIL Thanks, Kiran. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] quota and snapshot testcase failure (zfs on CentOS 6.6)
Hi Kiran, Can we get the brick, client and quotad logs? Thanks, Vijay On Tuesday 18 November 2014 10:35 PM, Pranith Kumar Karampuri wrote: On 11/12/2014 04:52 PM, Kiran Patil wrote: I have create zpool with name d and mnt and they appear in filesystem as follows. d on /d type zfs (rw,xattr) mnt on /mnt type zfs (rw,xattr) Debug enabled output of quota.t testcase is at http://ur1.ca/irbt1. CC vijaikumar On Wed, Nov 12, 2014 at 3:22 PM, Kiran Patil ki...@fractalio.com mailto:ki...@fractalio.com wrote: Hi, Gluster suite report, Gluster version: glusterfs 3.6.1 On disk filesystem: Zfs 0.6.3-1.1 Operating system: CentOS release 6.6 (Final) We are seeing quota and snapshot testcase failures. We are not sure why quota is failing since quotas worked fine on gluster 3.4. Test Summary Report --- ./tests/basic/quota-anon-fd-nfs.t(Wstat: 0 Tests: 16 Failed: 1) Failed test: 16 ./tests/basic/quota.t(Wstat: 0 Tests: 73 Failed: 4) Failed tests: 24, 28, 32, 65 ./tests/basic/uss.t(Wstat: 0 Tests: 147 Failed: 78) Failed tests: 8-11, 16-25, 28-29, 31-32, 39-40, 45-47 49-57, 60-61, 63-64, 71-72, 78-87, 90-91 93-94, 101-102, 107-115, 118-119, 121-122 129-130, 134, 136-137, 139-140, 142-143 145-146 ./tests/basic/volume-snapshot.t(Wstat: 0 Tests: 30 Failed: 12) Failed tests: 11-18, 21-24 ./tests/basic/volume-status.t(Wstat: 0 Tests: 14 Failed: 1) Failed test: 14 ./tests/bugs/bug-1023974.t (Wstat: 0 Tests: 15 Failed: 1) Failed test: 12 ./tests/bugs/bug-1038598.t (Wstat: 0 Tests: 28 Failed: 6) Failed tests: 17, 21-22, 26-28 ./tests/bugs/bug-1045333.t (Wstat: 0 Tests: 16 Failed: 9) Failed tests: 7-15 ./tests/bugs/bug-1049834.t (Wstat: 0 Tests: 18 Failed: 7) Failed tests: 11-14, 16-18 ./tests/bugs/bug-1087203.t (Wstat: 0 Tests: 43 Failed: 2) Failed tests: 31, 41 ./tests/bugs/bug-1090042.t (Wstat: 0 Tests: 12 Failed: 3) Failed tests: 9-11 ./tests/bugs/bug-1109770.t (Wstat: 0 Tests: 19 Failed: 4) Failed tests: 8-11 ./tests/bugs/bug-1109889.t (Wstat: 0 Tests: 20 Failed: 4) Failed tests: 8-11 ./tests/bugs/bug-1112559.t (Wstat: 0 Tests: 11 Failed: 3) Failed tests: 8-9, 11 ./tests/bugs/bug-1112613.t (Wstat: 0 Tests: 22 Failed: 5) Failed tests: 12-14, 17-18 ./tests/bugs/bug-1113975.t (Wstat: 0 Tests: 13 Failed: 4) Failed tests: 8-9, 11-12 ./tests/bugs/bug-847622.t(Wstat: 0 Tests: 10 Failed: 1) Failed test: 8 ./tests/bugs/bug-861542.t(Wstat: 0 Tests: 13 Failed: 7) Failed tests: 7-13 ./tests/features/ssl-authz.t (Wstat: 0 Tests: 18 Failed: 1) Failed test: 18 Files=277, Tests=7908, 8147 wallclock secs ( 4.56 usr 0.78 sys + 774.74 cusr 666.97 csys = 1447.05 CPU) Result: FAIL Thanks, Kiran. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Change in glusterfs[master]: epoll: Handle client and server FDs in a separate event pool
Hi Jeff, Missed to add this: SSL_pending was 0 before calling SSL_readand hence SSL_get_errorreturned 'SSL_ERROR_WANT_READ' Thanks, Vijay On Tuesday 24 June 2014 05:15 PM, Vijaikumar M wrote: Hi Jeff, This is regarding the patch http://review.gluster.org/#/c/3842/ (epoll: edge triggered and multi-threaded epoll). The testcase './tests/bugs/bug-873367.t' hangs with this fix (Please find the stack trace below). In the code snippet below we found that 'SSL_pending' was returning 0. I have added a condition here to return from the function when there is no data available. Please suggest if this is OK to do this way or do we need to restructure this function for multi-threaded epoll? code: socket.c 178 static int 179 ssl_do (rpc_transport_t *this, void *buf, size_t len, SSL_trinary_func *func) 180 { 211 switch (SSL_get_error(priv-ssl_ssl,r)) { 212 case SSL_ERROR_NONE: 213 return r; 214 case SSL_ERROR_WANT_READ: 215 if (SSL_pending(priv-ssl_ssl) == 0) 216 return r; 217 pfd.fd = priv-sock; 221 if (poll(pfd,1,-1) 0) { /code Thanks, Vijay On Tuesday 24 June 2014 03:55 PM, Vijaikumar M wrote: From the stack trace we found that function 'socket_submit_request' is waiting on mutext_lock. lock is held by the function 'ssl_do' and this function is blocked by poll syscall. (gdb) bt #0 0x003daa80822d in pthread_join () from /lib64/libpthread.so.0 #1 0x7f3b94eea9d0 in event_dispatch_epoll (event_pool=value optimized out) at event-epoll.c:632 #2 0x00407ecd in main (argc=4, argv=0x7fff160a4528) at glusterfsd.c:2023 (gdb) info threads 10 Thread 0x7f3b8d483700 (LWP 26225) 0x003daa80e264 in __lll_lock_wait () from /lib64/libpthread.so.0 9 Thread 0x7f3b8ca82700 (LWP 26226) 0x003daa80f4b5 in sigwait () from /lib64/libpthread.so.0 8 Thread 0x7f3b8c081700 (LWP 26227) 0x003daa80b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 7 Thread 0x7f3b8b680700 (LWP 26228) 0x003daa80b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 6 Thread 0x7f3b8a854700 (LWP 26232) 0x003daa4e9163 in epoll_wait () from /lib64/libc.so.6 5 Thread 0x7f3b89e53700 (LWP 26233) 0x003daa4e9163 in epoll_wait () from /lib64/libc.so.6 4 Thread 0x7f3b833eb700 (LWP 26241) 0x003daa4df343 in poll () from /lib64/libc.so.6 3 Thread 0x7f3b82130700 (LWP 26245) 0x003daa80e264 in __lll_lock_wait () from /lib64/libpthread.so.0 2 Thread 0x7f3b8172f700 (LWP 26247) 0x003daa80e75d in read () from /lib64/libpthread.so.0 * 1 Thread 0x7f3b94a38700 (LWP 26224) 0x003daa80822d in pthread_join () from /lib64/libpthread.so.0 *(gdb) thread 3** **[Switching to thread 3 (Thread 0x7f3b82130700 (LWP 26245))]#0 0x003daa80e264 in __lll_lock_wait ()** ** from /lib64/libpthread.so.0** **(gdb) bt #0 0x003daa80e264 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x003daa809508 in _L_lock_854 () from /lib64/libpthread.so.0 #2 0x003daa8093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x7f3b8aa74524 in socket_submit_request (this=0x7f3b7c0505c0, req=0x7f3b8212f0b0) at socket.c:3134 *#4 0x7f3b94c6b7d5 in rpc_clnt_submit (rpc=0x7f3b7c029ce0, prog=value optimized out, procnum=value optimized out, cbkfn=0x7f3b892364b0 client3_3_lookup_cbk, proghdr=0x7f3b8212f410, proghdrcount=1, progpayload=0x0, progpayloadcount=0, iobref=value optimized out, frame=0x7f3b93d2a454, rsphdr=0x7f3b8212f4c0, rsphdr_count=1, rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x7f3b700010d0) at rpc-clnt.c:1556 #5 0x7f3b892243b0 in client_submit_request (this=0x7f3b7c005ef0, req=value optimized out, frame=0x7f3b93d2a454, prog=0x7f3b894525a0, procnum=27, cbkfn=0x7f3b892364b0 client3_3_lookup_cbk, iobref=0x0, rsphdr=0x7f3b8212f4c0, rsphdr_count=1, rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x7f3b700010d0, xdrproc=0x7f3b94a4ede0 xdr_gfs3_lookup_req) at client.c:243 #6 0x7f3b8922fa42 in client3_3_lookup (frame=0x7f3b93d2a454, this=0x7f3b7c005ef0, data=0x7f3b8212f660) at client-rpc-fops.c:3119 (gdb) p priv-lock $1 = {__data = {__lock = 2, __count = 0, __owner = 26241, __nusers = 1, __kind = 0, __spins = 0, __list = { __prev = 0x0, __next = 0x0}}, __size = \002\000\000\000\000\000\000\000\201f\000\000\001, '\000' repeats 26 times, __align = 2} *(gdb) thread 4 [Switching to thread 4 (Thread 0x7f3b833eb700 (LWP 26241))]#0 0x003daa4df343 in poll () from /lib64/libc.so.6 (gdb) bt #0 0x003daa4df343 in poll () from /lib64/libc.so.6 #1 0x7f3b8aa71fff in ssl_do (this=0x7f3b7c0505c0, buf=0x7f3b7c051264, len=4, func=0x3db2441570 SSL_read) at socket.c:216 #2 0x7f3b8aa7277b in __socket_ssl_readv (this=value optimized out
Re: [Gluster-devel] Change in glusterfs[master]: epoll: Handle client and server FDs in a separate event pool
Hi Jeff, This is regarding the patch http://review.gluster.org/#/c/3842/ (epoll: edge triggered and multi-threaded epoll). The testcase './tests/bugs/bug-873367.t' hangs with this fix (Please find the stack trace below). In the code snippet below we found that 'SSL_pending' was returning 0. I have added a condition here to return from the function when there is no data available. Please suggest if this is OK to do this way or do we need to restructure this function for multi-threaded epoll? code: socket.c 178 static int 179 ssl_do (rpc_transport_t *this, void *buf, size_t len, SSL_trinary_func *func) 180 { 211 switch (SSL_get_error(priv-ssl_ssl,r)) { 212 case SSL_ERROR_NONE: 213 return r; 214 case SSL_ERROR_WANT_READ: 215 if (SSL_pending(priv-ssl_ssl) == 0) 216 return r; 217 pfd.fd = priv-sock; 221 if (poll(pfd,1,-1) 0) { /code Thanks, Vijay On Tuesday 24 June 2014 03:55 PM, Vijaikumar M wrote: From the stack trace we found that function 'socket_submit_request' is waiting on mutext_lock. lock is held by the function 'ssl_do' and this function is blocked by poll syscall. (gdb) bt #0 0x003daa80822d in pthread_join () from /lib64/libpthread.so.0 #1 0x7f3b94eea9d0 in event_dispatch_epoll (event_pool=value optimized out) at event-epoll.c:632 #2 0x00407ecd in main (argc=4, argv=0x7fff160a4528) at glusterfsd.c:2023 (gdb) info threads 10 Thread 0x7f3b8d483700 (LWP 26225) 0x003daa80e264 in __lll_lock_wait () from /lib64/libpthread.so.0 9 Thread 0x7f3b8ca82700 (LWP 26226) 0x003daa80f4b5 in sigwait () from /lib64/libpthread.so.0 8 Thread 0x7f3b8c081700 (LWP 26227) 0x003daa80b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 7 Thread 0x7f3b8b680700 (LWP 26228) 0x003daa80b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 6 Thread 0x7f3b8a854700 (LWP 26232) 0x003daa4e9163 in epoll_wait () from /lib64/libc.so.6 5 Thread 0x7f3b89e53700 (LWP 26233) 0x003daa4e9163 in epoll_wait () from /lib64/libc.so.6 4 Thread 0x7f3b833eb700 (LWP 26241) 0x003daa4df343 in poll () from /lib64/libc.so.6 3 Thread 0x7f3b82130700 (LWP 26245) 0x003daa80e264 in __lll_lock_wait () from /lib64/libpthread.so.0 2 Thread 0x7f3b8172f700 (LWP 26247) 0x003daa80e75d in read () from /lib64/libpthread.so.0 * 1 Thread 0x7f3b94a38700 (LWP 26224) 0x003daa80822d in pthread_join () from /lib64/libpthread.so.0 *(gdb) thread 3** **[Switching to thread 3 (Thread 0x7f3b82130700 (LWP 26245))]#0 0x003daa80e264 in __lll_lock_wait ()** ** from /lib64/libpthread.so.0** **(gdb) bt #0 0x003daa80e264 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x003daa809508 in _L_lock_854 () from /lib64/libpthread.so.0 #2 0x003daa8093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x7f3b8aa74524 in socket_submit_request (this=0x7f3b7c0505c0, req=0x7f3b8212f0b0) at socket.c:3134 *#4 0x7f3b94c6b7d5 in rpc_clnt_submit (rpc=0x7f3b7c029ce0, prog=value optimized out, procnum=value optimized out, cbkfn=0x7f3b892364b0 client3_3_lookup_cbk, proghdr=0x7f3b8212f410, proghdrcount=1, progpayload=0x0, progpayloadcount=0, iobref=value optimized out, frame=0x7f3b93d2a454, rsphdr=0x7f3b8212f4c0, rsphdr_count=1, rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x7f3b700010d0) at rpc-clnt.c:1556 #5 0x7f3b892243b0 in client_submit_request (this=0x7f3b7c005ef0, req=value optimized out, frame=0x7f3b93d2a454, prog=0x7f3b894525a0, procnum=27, cbkfn=0x7f3b892364b0 client3_3_lookup_cbk, iobref=0x0, rsphdr=0x7f3b8212f4c0, rsphdr_count=1, rsp_payload=0x0, rsp_payload_count=0, rsp_iobref=0x7f3b700010d0, xdrproc=0x7f3b94a4ede0 xdr_gfs3_lookup_req) at client.c:243 #6 0x7f3b8922fa42 in client3_3_lookup (frame=0x7f3b93d2a454, this=0x7f3b7c005ef0, data=0x7f3b8212f660) at client-rpc-fops.c:3119 (gdb) p priv-lock $1 = {__data = {__lock = 2, __count = 0, __owner = 26241, __nusers = 1, __kind = 0, __spins = 0, __list = { __prev = 0x0, __next = 0x0}}, __size = \002\000\000\000\000\000\000\000\201f\000\000\001, '\000' repeats 26 times, __align = 2} *(gdb) thread 4 [Switching to thread 4 (Thread 0x7f3b833eb700 (LWP 26241))]#0 0x003daa4df343 in poll () from /lib64/libc.so.6 (gdb) bt #0 0x003daa4df343 in poll () from /lib64/libc.so.6 #1 0x7f3b8aa71fff in ssl_do (this=0x7f3b7c0505c0, buf=0x7f3b7c051264, len=4, func=0x3db2441570 SSL_read) at socket.c:216 #2 0x7f3b8aa7277b in __socket_ssl_readv (this=value optimized out, opvector=value optimized out, opcount=value optimized out) at socket.c:335 #3 0x7f3b8aa72c26 in __socket_cached_read (this=value optimized out, vector=value optimized out, count
Re: [Gluster-devel] Spurious failures because of nfs and snapshots
KP, Atin and myself did some debugging and found that there was a deadlock in glusterd. When creating a volume snapshot, the back-end operation 'taking a lvm_snapshot and starting brick' for the each brick are executed in parallel using synctask framework. brick_start was releasing a big_lock with brick_connect and does a lock again. This caused a deadlock in some race condition where main-thread waiting for one of the synctask thread to finish and synctask-thread waiting for the big_lock. We are working on fixing this issue. Thanks, Vijay On Wednesday 21 May 2014 12:23 PM, Vijaikumar M wrote: From the log: http://build.gluster.org:443/logs/glusterfs-logs-20140520%3a17%3a10%3a51.tgzit looks like glusterd was hung: *Glusterd log:** * 5305 [2014-05-20 20:08:55.040665] E [glusterd-snapshot.c:3805:glusterd_add_brick_to_snap_volume] 0-management: Unable to fetch snap device (vol1.brick_snapdevice0). Leaving empty 5306 [2014-05-20 20:08:55.649146] I [rpc-clnt.c:973:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 5307 [2014-05-20 20:08:55.663181] I [rpc-clnt.c:973:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 5308 [2014-05-20 20:16:55.541197] W [glusterfsd.c:1182:cleanup_and_exit] (-- 0-: received signum (15), shutting down Glusterd was hung when executing the testcase ./tests/bugs/bug-1090042.t. *Cli log:** *72649 [2014-05-20 20:12:51.960765] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect 72650 [2014-05-20 20:12:51.960850] T [socket.c:2689:socket_connect] (--/build/install/lib/libglusterfs.so.0(gf_timer_proc+0x1a2) [0x7ff8b6609994] (--/build/install/lib/libgfrpc.so.0(rpc_clnt_reconnect+0x137) [0x7ff8b5d3305b] (- -/build/install/lib/libgfrpc.so.0(rpc_transport_connect+0x74) [0x7ff8b5d30071]))) 0-glusterfs: connect () called on transport already connected 72651 [2014-05-20 20:12:52.960943] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect 72652 [2014-05-20 20:12:52.960999] T [socket.c:2697:socket_connect] 0-glusterfs: connecting 0x1e0fcc0, state=0 gen=0 sock=-1 72653 [2014-05-20 20:12:52.961038] W [dict.c:1059:data_to_str] (--/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(+0xb5f3) [0x7ff8ad9e95f3] (--/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(socket_clien t_get_remote_sockaddr+0x10a) [0x7ff8ad9ed568] (--/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(client_fill_address_family+0xf1) [0x7ff8ad9ec7d0]))) 0-dict: data is NULL 72654 [2014-05-20 20:12:52.961070] W [dict.c:1059:data_to_str] (--/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(+0xb5f3) [0x7ff8ad9e95f3] (--/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(socket_clien t_get_remote_sockaddr+0x10a) [0x7ff8ad9ed568] (--/build/install/lib/glusterfs/3.5qa2/rpc-transport/socket.so(client_fill_address_family+0x100) [0x7ff8ad9ec7df]))) 0-dict: data is NULL 72655 [2014-05-20 20:12:52.961079] E [name.c:140:client_fill_address_family] 0-glusterfs: transport.address-family not specified. Could not guess default value from (remote-host:(null) or transport.unix.connect-path:(null)) optio ns 72656 [2014-05-20 20:12:54.961273] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-glusterfs: attempting reconnect 72657 [2014-05-20 20:12:54.961404] T [socket.c:2689:socket_connect] (--/build/install/lib/libglusterfs.so.0(gf_timer_proc+0x1a2) [0x7ff8b6609994] (--/build/install/lib/libgfrpc.so.0(rpc_clnt_reconnect+0x137) [0x7ff8b5d3305b] (- -/build/install/lib/libgfrpc.so.0(rpc_transport_connect+0x74) [0x7ff8b5d30071]))) 0-glusterfs: connect () called on transport already connected 72658 [2014-05-20 20:12:55.120645] D [cli-cmd.c:384:cli_cmd_submit] 0-cli: Returning 110 72659 [2014-05-20 20:12:55.120723] D [cli-rpc-ops.c:8716:gf_cli_snapshot] 0-cli: Returning 110 Now we need to find why glusterd was hung. Thanks, Vijay On Wednesday 21 May 2014 06:46 AM, Pranith Kumar Karampuri wrote: Hey, Seems like even after this fix is merged, the regression tests are failing for the same script. You can check the logs athttp://build.gluster.org:443/logs/glusterfs-logs-20140520%3a14%3a06%3a46.tgz Relevant logs: [2014-05-20 20:17:07.026045] : volume create patchy build.gluster.org:/d/backends/patchy1 build.gluster.org:/d/backends/patchy2 : SUCCESS [2014-05-20 20:17:08.030673] : volume start patchy : SUCCESS [2014-05-20 20:17:08.279148] : volume barrier patchy enable : SUCCESS [2014-05-20 20:17:08.476785] : volume barrier patchy enable : FAILED : Failed to reconfigure barrier. [2014-05-20 20:17:08.727429] : volume barrier patchy disable : SUCCESS [2014-05-20 20:17:08.926995] : volume barrier patchy disable : FAILED : Failed to reconfigure barrier. Pranith - Original Message - From: Pranith Kumar Karampuripkara...@redhat.com To: Gluster Develgluster-devel@gluster.org Cc: Joseph Fernandesjosfe...@redhat.com, Vijaikumar Mvmall
Re: [Gluster-devel] Spurious failures because of nfs and snapshots
Hi Joseph, In the log mentioned below, it say ping-time is set to default value 30sec.I think issue is different. Can you please point me to the logs where you where able to re-create the problem. Thanks, Vijay On Monday 19 May 2014 09:39 AM, Pranith Kumar Karampuri wrote: hi Vijai, Joseph, In 2 of the last 3 build failures, http://build.gluster.org/job/regression/4479/console, http://build.gluster.org/job/regression/4478/console this test(tests/bugs/bug-1090042.t) failed. Do you guys think it is better to revert this test until the fix is available? Please send a patch to revert the test case if you guys feel so. You can re-submit it along with the fix to the bug mentioned by Joseph. Pranith. - Original Message - From: Joseph Fernandes josfe...@redhat.com To: Pranith Kumar Karampuri pkara...@redhat.com Cc: Gluster Devel gluster-devel@gluster.org Sent: Friday, 16 May, 2014 5:13:57 PM Subject: Re: Spurious failures because of nfs and snapshots Hi All, tests/bugs/bug-1090042.t : I was able to reproduce the issue i.e when this test is done in a loop for i in {1..135} ; do ./bugs/bug-1090042.t When checked the logs [2014-05-16 10:49:49.003978] I [rpc-clnt.c:973:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2014-05-16 10:49:49.004035] I [rpc-clnt.c:988:rpc_clnt_connection_init] 0-management: defaulting ping-timeout to 30secs [2014-05-16 10:49:49.004303] I [rpc-clnt.c:973:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2014-05-16 10:49:49.004340] I [rpc-clnt.c:988:rpc_clnt_connection_init] 0-management: defaulting ping-timeout to 30secs The issue is with ping-timeout and is tracked under the bug https://bugzilla.redhat.com/show_bug.cgi?id=1096729 The workaround is mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1096729#c8 Regards, Joe - Original Message - From: Pranith Kumar Karampuri pkara...@redhat.com To: Gluster Devel gluster-devel@gluster.org Cc: Joseph Fernandes josfe...@redhat.com Sent: Friday, May 16, 2014 6:19:54 AM Subject: Spurious failures because of nfs and snapshots hi, In the latest build I fired for review.gluster.com/7766 (http://build.gluster.org/job/regression/4443/console) failed because of spurious failure. The script doesn't wait for nfs export to be available. I fixed that, but interestingly I found quite a few scripts with same problem. Some of the scripts are relying on 'sleep 5' which also could lead to spurious failures if the export is not available in 5 seconds. We found that waiting for 20 seconds is better, but 'sleep 20' would unnecessarily delay the build execution. So if you guys are going to write any scripts which has to do nfs mounts, please do it the following way: EXPECT_WITHIN 20 1 is_nfs_export_available; TEST mount -t nfs -o vers=3 $H0:/$V0 $N0; Please review http://review.gluster.com/7773 :-) I saw one more spurious failure in a snapshot related script tests/bugs/bug-1090042.t on the next build fired by Niels. Joesph (CCed) is debugging it. He agreed to reply what he finds and share it with us so that we won't introduce similar bugs in future. I encourage you guys to share what you fix to prevent spurious failures in future. Thanks Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-devel