Re: [boinc_dev] [patch] mystery solved? Re: BOINC having too many open files - failure in opendir()
Hello, Gesendet: Sonntag, 19. Mai 2013 um 05:34 Uhr Von: David Anderson da...@ssl.berkeley.edu An: boinc_dev@ssl.berkeley.edu Betreff: Re: [boinc_dev] [patch] mystery solved? Aw: Re: BOINC having too many open files - failure in opendir() As far as I can tell, none of these changes would fix a file descriptor leak in the client. The MFILE class today I read as class A { FILE *f; A::A(string fname) { f=fopen(fname); } A::close{ fclose(f); } A::~A() { // not closing the file pointer } } which leaks a file pointer with every creation of an object of class A that does not see a close(). For MFILE this is in lib/gui_rpc_client.cpp Please have a second look at it and kindly ignore those 'const' changes ... I did not notice adding them :o) By grepping through the code I agree that all instances of the BOINC client that use MFILE with a filename do indeed perform the close. I can emperically now confirm that the attached patch does not fix the issue. When now bringing the boinc client up again, the first thing the client does is to upload results. The machine is otherwise idle, no idea whom else to blame. Steffen We need a system-call trace that shows open()s. It's also possible that the system is running out of file descriptors because of software other than BOINC. -- David On 18-May-2013 4:00 AM, Steffen Möller wrote: Dear all, I skimmed through all invocations of (boinc_)?fopen() in api/ and lib/, seeking the respective matching fclose(). What I found missing I placed here http://anonscm.debian.org/gitweb/?p=pkg-boinc/boinc.git;a=blob;f=debian/patches/fopen_closing.patch;hb=HEAD as a patch. The trickiest and possibly the most important one is the omission of a close in the destructor of the MFILE class. Cheers, Steffen Gesendet: Freitag, 17. Mai 2013 um 22:00 Uhr Von: Nicolás Alvarez nicolas.alva...@gmail.com An: boinc_dev@ssl.berkeley.edu boinc_dev@ssl.berkeley.edu Betreff: Re: [boinc_dev] BOINC having too many open files - failure in opendir() Get the list of open files (ls -l /proc/$(pidof boinc)/fd) when that happens. Does the client die after that last fopen() failure? Maybe you could write a script to log the open file list every few minutes. -- Nicolás 2013/5/16 Steffen Möller steffen_moel...@gmx.de: Dear all, every few months I get an error like the one below (taken from the stdoutdae.txt) the report too many open files. This is see for about three years on several Linux machines, I only recall such with many cores (12 or 24), though, Opterons and Xeons alike. Is anything jumping at you where to look? Cheers, Steffen 16-May-2013 16:58:33 [World Community Grid] Sending scheduler request: To fetch work. 16-May-2013 16:58:33 [World Community Grid] Requesting new tasks for CPU 16-May-2013 16:58:36 [World Community Grid] Scheduler request completed: got 0 new tasks 16-May-2013 16:58:36 [World Community Grid] No tasks sent 16-May-2013 16:58:36 [World Community Grid] No tasks are available for The Clean Energy Project - Phase 2 16-May-2013 16:58:36 [World Community Grid] No tasks are available for the applications you have selected. 16-May-2013 16:58:42 [Einstein@Home] Sending scheduler request: To fetch work. 16-May-2013 16:58:42 [Einstein@Home] Reporting 4 completed tasks 16-May-2013 16:58:42 [Einstein@Home] Requesting new tasks for CPU 16-May-2013 16:58:46 [Einstein@Home] Scheduler request completed: got 1 new tasks 16-May-2013 17:15:53 [Einstein@Home] Sending scheduler request: To fetch work. 16-May-2013 17:15:53 [Einstein@Home] Requesting new tasks for CPU 16-May-2013 17:15:56 [Einstein@Home] Scheduler request completed: got 1 new tasks 16-May-2013 17:30:11 [World Community Grid] Can't get task disk usage: opendir() failed 16-May-2013 17:30:11 [Einstein@Home] Can't get task disk usage: opendir() failed 16-May-2013 17:30:11 [Einstein@Home] Can't get task disk usage: opendir() failed 16-May-2013 17:30:11 [Einstein@Home] Can't get task disk usage: opendir() failed 16-May-2013 17:30:11 [Einstein@Home] Can't get task disk usage: opendir() failed 16-May-2013 17:30:11 [Einstein@Home] Can't get task disk usage: opendir() failed 16-May-2013 17:30:11 [Einstein@Home] Can't get task disk usage: opendir() failed 16-May-2013 17:30:11 [Einstein@Home] Can't get task disk usage: opendir() failed 16-May-2013 17:30:11 [Einstein@Home] Can't get task disk usage: opendir() failed 16-May-2013 17:32:31 [Einstein@Home] read_stderr_file(): malloc() failed 16-May-2013 17:32:31 [Einstein@Home] Computation for task LATeah0024U_80.0_500_-4.66e-10_1 finished 16-May-2013 17:32:31 [Einstein@Home] md5_file failed for projects/einstein.phys.uwm.edu/einstein_S6BucketLVE_1.04_i686-pc-linux-gnu__SSE2: fopen() failed 16-May-2013 17:32:31 [---] Can't open client_state_next.xml: fopen() failed 16-May-2013 17:32:31 [---] Couldn't write state file:
Re: [boinc_dev] The ways in that BOINC tells app to quit
Thus far I have been unable to catch a recurrence. Great, an intermittent failure. On Thu, May 16, 2013 at 11:34 AM, Eric J Korpela korp...@ssl.berkeley.eduwrote: OK, I'll try to catch another one. On Thu, May 16, 2013 at 11:31 AM, David Anderson da...@ssl.berkeley.eduwrote: Eric: If you set the task_debug/ logging flag, the client will show when it suspends/resumes tasks. This will tell us whether it's a problem in the client or the app. -- David On 16-May-2013 11:16 AM, Eric J Korpela wrote: I had keeps running when computer is in use problem last night with an Einstein CUDA app, so it's clear that this isn't restricted to SETI@home. There has to be a BOINC issue or some common flaw with how exiting is handled. I'm wondering is there's some way that writing a checkpoint (or reporting a checkpoint) is going wrong. I assume that even GPU apps would suspend rather than exit if there has been no checkpoint since it was started? Of course in this case it's not suspending either, it's continuing to run. David, any comments? On Sun, May 12, 2013 at 11:47 PM, Raistmer the Sorcerer raist...@mail.ru mailto:raist...@mail.ru wrote: Then looks like I should report bug in BOINC API - sometimes it doesn't and set only suspend flag w/o quit flag. Possible relevant info: user had changed idle interval to 10 min instead of default. Воскресенье, 12 мая 2013, 21:06 -07:00 от David Anderson da...@ssl.berkeley.edu mailto:da...@ssl.berkeley.edu**: On 12-May-2013 4:44 AM, Raistmer the Sorcerer wrote: Can I get definitive answer, please. In case when user set BOINC not to use GPU while user active (USE GPU only when PC idle): should BOINC set boinc_status.quit_request flag when user becomes active by BOINC devs opinion or should not? Yes. -- David Суббота, 11 мая 2013, 8:49 +04:00 от Raistmer the Sorcerer raist...@mail.ru mailto:raist...@mail.ru : If you look corresponding thread you will see that app does check. There is boinc_status.suspended flag also, but is this designed behavior, not to set boinc_status.quit_request for GPU app in all cases when its exit=suspend required ? Пятница, 10 мая 2013, 17:24 -07:00 от Eric J Korpela korp...@ssl.berkeley.edu mailto:korpela@ssl.berkeley.**edukorp...@ssl.berkeley.edu : I can think of one other option. If an OpenCL routine never exits, the application might not get to the point of checking the flags (unless you are checking the flags while waiting on OpenCL routines to finish). I haven't checked the driver revision the users in question have installed. Are they using suspect driver versions? On Fri, May 10, 2013 at 4:46 PM, Raistmer the Sorcerer raist...@mail.ru mailto:raist...@mail.ru wrote: Trying to solve non-suspending issue listed in this thread: http://setiathome.berkeley.**edu/forum_thread.php?id=71581** postid=1365551http://setiathome.berkeley.edu/forum_thread.php?id=71581postid=1365551 I came to conclusion that either BOINC does something wrong in this situation or app doesn't check all needed flags and not aware about exit request. I check these flags: boinc_status.quit_request boinc_status.abort_request Maybe some another flag BOINC sets to inform app that suspend required ? __**_ boinc_dev mailing list boinc_dev@ssl.berkeley.edu mailto:boinc_dev@ssl.** berkeley.edu boinc_dev@ssl.berkeley.edu http://lists.ssl.berkeley.edu/**mailman/listinfo/boinc_devhttp://lists.ssl.berkeley.edu/mailman/listinfo/boinc_devTo unsubscribe, visit the above URL and (near bottom of page) enter your email address. __**_ boinc_dev mailing list boinc_dev@ssl.berkeley.edu mailto:boinc_dev@ssl.**berkeley.eduboinc_dev@ssl.berkeley.edu http://lists.ssl.berkeley.edu/**mailman/listinfo/boinc_devhttp://lists.ssl.berkeley.edu/mailman/listinfo/boinc_devTo unsubscribe, visit the above URL and (near bottom of page) enter your email address. __**_ boinc_dev mailing list boinc_dev@ssl.berkeley.edu mailto:boinc_dev@ssl.**berkeley.eduboinc_dev@ssl.berkeley.edu http://lists.ssl.berkeley.edu/**mailman/listinfo/boinc_devhttp://lists.ssl.berkeley.edu/mailman/listinfo/boinc_devTo unsubscribe, visit the above URL and (near bottom of page) enter your email address. _**__ boinc_dev mailing list boinc_dev@ssl.berkeley.edu
[boinc_dev] Problems with 7.1.1 work fetch on projects set to No new tasks
Yesterday, I upgraded to 7.1.1 and had to set the constellation project to no new tasks. I would have suspended it but it's work units take about 10 hours and it only had about 2 hours to go on the one work unit it had. Sometime late yesterday or overnight, it finished that work unit. However, even with no new tasks set, it fetched another work unit at 11:11:41. I've included the relevant section of stdoutdae.txt below. I included enough before and after the work fetch so you can see that it definitely had no new tasks set. FYI, the reason I set Constellation to no new tasks was because after the upgrade to 7.1.1, it started executing Constellation work units as NCI aqain, which it had not done in 7.0.64, although it had this same problem in some earlier versions of 7.0. Constellation is not NCI so this caused an extra work unit to be executed on my C2D E6420, which slowed down other work units and anything that ran on the system. I aborted the new Constellation work unit if fetched at 11:11:41 and suspended Constellation now that it has no work units. Here is the section of stdoutdae.txt where it fetched a work unit while set to no new tasks : 21-May-2013 11:11:34 [---] [work_fetch] --- start work fetch state --- 21-May-2013 11:11:34 [---] [work_fetch] target work buffer: 83808.00 + 11232.00 sec 21-May-2013 11:11:34 [---] [work_fetch] --- project states --- 21-May-2013 11:11:34 [The Lattice Project] [work_fetch] REC 0.000 prio 0.00 can't req work: suspended via Manager 21-May-2013 11:11:34 [superlinkattechnion] [work_fetch] REC 0.000 prio 0.00 can't req work: suspended via Manager 21-May-2013 11:11:34 [MindModeling@Beta] [work_fetch] REC 0.000 prio 0.00 can't req work: suspended via Manager 21-May-2013 11:11:34 [Constellation] [work_fetch] REC 24.318 prio -0.00 can't req work: no new tasks requested via Manager 21-May-2013 11:11:34 [Docking] [work_fetch] REC 70.270 prio -0.053371 can req work 21-May-2013 11:11:34 [malariacontrol.net] [work_fetch] REC 35.233 prio -0.053520 can req work 21-May-2013 11:11:34 [rosetta@home] [work_fetch] REC 28.369 prio -0.055149 can req work 21-May-2013 11:11:34 [correlizer] [work_fetch] REC 27.663 prio -0.055464 can req work 21-May-2013 11:11:34 [eon2] [work_fetch] REC 14.738 prio -0.055967 can req work 21-May-2013 11:11:34 [World Community Grid] [work_fetch] REC 275.469 prio -0.059589 can req work 21-May-2013 11:11:34 [NumberFields@home] [work_fetch] REC 15.186 prio -0.060085 can req work 21-May-2013 11:11:34 [SZTAKI Desktop Grid] [work_fetch] REC 15.104 prio -0.063915 can req work 21-May-2013 11:11:34 [boincsimap] [work_fetch] REC 32.650 prio -0.070539 can req work 21-May-2013 11:11:34 [ibercivis] [work_fetch] REC 14.749 prio -0.074577 can req work 21-May-2013 11:11:34 [fightmalaria@home] [work_fetch] REC 10.813 prio -0.082123 can req work 21-May-2013 11:11:34 [Asteroids@home] [work_fetch] REC 16.048 prio -0.093301 can req work 21-May-2013 11:11:34 [Milkyway@Home] [work_fetch] REC 11.449 prio -0.180597 can req work 21-May-2013 11:11:34 [NFS@Home] [work_fetch] REC 28.673 prio -0.217775 can req work 21-May-2013 11:11:34 [LHC@home 1.0] [work_fetch] REC 35.037 prio -0.266106 can req work 21-May-2013 11:11:34 [Poem@Home] [work_fetch] REC 4056.596 prio -3.851261 can req work 21-May-2013 11:11:34 [SETI@home] [work_fetch] REC 2544.520 prio -5.105209 can req work 21-May-2013 11:11:34 [Einstein@Home] [work_fetch] REC 4200.380 prio -8.210770 can req work 21-May-2013 11:11:34 [PrimeGrid] [work_fetch] REC 1403.029 prio -10.714000 can req work 21-May-2013 11:11:34 [---] [work_fetch] --- state for CPU --- 21-May-2013 11:11:34 [---] [work_fetch] shortfall 9834.29 nidle 0.00 saturated 85205.71 busy 0.00 21-May-2013 11:11:34 [The Lattice Project] [work_fetch] fetch share 0.000 21-May-2013 11:11:34 [superlinkattechnion] [work_fetch] fetch share 0.000 21-May-2013 11:11:34 [MindModeling@Beta] [work_fetch] fetch share 0.000 21-May-2013 11:11:34 [Constellation] [work_fetch] fetch share 0.000 21-May-2013 11:11:34 [Docking] [work_fetch] fetch share 0.124 21-May-2013 11:11:34 [malariacontrol.net] [work_fetch] fetch share 0.062 21-May-2013 11:11:34 [rosetta@home] [work_fetch] fetch share 0.050 21-May-2013 11:11:34 [correlizer] [work_fetch] fetch share 0.050 21-May-2013 11:11:34 [eon2] [work_fetch] fetch share 0.025 21-May-2013 11:11:34 [World Community Grid] [work_fetch] fetch share 0.497 21-May-2013 11:11:34 [NumberFields@home] [work_fetch] fetch share 0.025 21-May-2013 11:11:34 [SZTAKI Desktop Grid] [work_fetch] fetch share 0.025 21-May-2013 11:11:34 [boincsimap] [work_fetch] fetch share 0.050 21-May-2013 11:11:34 [ibercivis] [work_fetch] fetch share 0.025 21-May-2013 11:11:34 [fightmalaria@home] [work_fetch] fetch share 0.012 21-May-2013 11:11:34 [Asteroids@home] [work_fetch] fetch share 0.025 21-May-2013 11:11:34 [Milkyway@Home] [work_fetch] fetch share 0.006 21-May-2013 11:11:34 [NFS@Home] [work_fetch] fetch share 0.012 21-May-2013
[boinc_dev] 7.1.1 also getting new work for suspended project
I suspended Constellation since No New Work wasn't preventing boinc from requesting new work units from it and just noticed that it's still requesting work units from constellation. Here's the section of stdoutdae.txt where it got another WU. 5/21/2013 2:29:52 PM | | [work_fetch] --- start work fetch state --- 5/21/2013 2:29:52 PM | | [work_fetch] target work buffer: 83808.00 + 11232.00 sec 5/21/2013 2:29:52 PM | | [work_fetch] --- project states --- 5/21/2013 2:29:52 PM | The Lattice Project | [work_fetch] REC 0.000 prio 0.00 can't req work: suspended via Manager 5/21/2013 2:29:52 PM | superlinkattechnion | [work_fetch] REC 0.000 prio 0.00 can't req work: suspended via Manager 5/21/2013 2:29:52 PM | MindModeling@Beta | [work_fetch] REC 0.000 prio 0.00 can't req work: suspended via Manager 5/21/2013 2:29:52 PM | Constellation | [work_fetch] REC 24.280 prio 0.00 can't req work: suspended via Manager 5/21/2013 2:29:52 PM | rosetta@home | [work_fetch] REC 28.099 prio -0.065670 can req work 5/21/2013 2:29:52 PM | correlizer | [work_fetch] REC 27.875 prio -0.066564 can req work 5/21/2013 2:29:52 PM | eon2 | [work_fetch] REC 14.597 prio -0.066898 can req work 5/21/2013 2:29:52 PM | World Community Grid | [work_fetch] REC 276.058 prio -0.067595 can req work 5/21/2013 2:29:52 PM | Docking | [work_fetch] REC 71.034 prio -0.069294 can req work 5/21/2013 2:29:52 PM | NumberFields@home | [work_fetch] REC 15.042 prio -0.071349 can req work 5/21/2013 2:29:52 PM | SZTAKI Desktop Grid | [work_fetch] REC 14.960 prio -0.075118 can req work 5/21/2013 2:29:52 PM | malariacontrol.net | [work_fetch] REC 34.898 prio -0.077509 can req work 5/21/2013 2:29:52 PM | boincsimap | [work_fetch] REC 32.339 prio -0.082647 can req work 5/21/2013 2:29:52 PM | ibercivis | [work_fetch] REC 15.922 prio -0.088790 can req work 5/21/2013 2:29:52 PM | fightmalaria@home | [work_fetch] REC 10.710 prio -0.098163 can req work 5/21/2013 2:29:52 PM | Asteroids@home | [work_fetch] REC 15.895 prio -0.105204 can req work 5/21/2013 2:29:52 PM | Milkyway@Home | [work_fetch] REC 11.340 prio -0.215259 can req work 5/21/2013 2:29:52 PM | NFS@Home | [work_fetch] REC 28.400 prio -0.260310 can req work 5/21/2013 2:29:52 PM | LHC@home 1.0 | [work_fetch] REC 34.703 prio -0.318082 can req work 5/21/2013 2:29:52 PM | Poem@Home | [work_fetch] REC 4017.988 prio -4.603479 can req work 5/21/2013 2:29:52 PM | SETI@home | [work_fetch] REC 2618.643 prio -6.264621 can't req work: scheduler RPC backoff (backoff: 297.68 sec) 5/21/2013 2:29:52 PM | Einstein@Home | [work_fetch] REC 4160.403 prio -9.768530 can req work 5/21/2013 2:29:52 PM | PrimeGrid | [work_fetch] REC 1389.675 prio -12.795318 can req work 5/21/2013 2:29:52 PM | | [work_fetch] --- state for CPU --- 5/21/2013 2:29:52 PM | | [work_fetch] shortfall 3133.46 nidle 0.00 saturated 91906.54 busy 0.00 5/21/2013 2:29:52 PM | The Lattice Project | [work_fetch] fetch share 0.000 5/21/2013 2:29:52 PM | superlinkattechnion | [work_fetch] fetch share 0.000 5/21/2013 2:29:52 PM | MindModeling@Beta | [work_fetch] fetch share 0.000 5/21/2013 2:29:52 PM | Constellation | [work_fetch] fetch share 0.000 5/21/2013 2:29:52 PM | rosetta@home | [work_fetch] fetch share 0.050 5/21/2013 2:29:52 PM | correlizer | [work_fetch] fetch share 0.050 5/21/2013 2:29:52 PM | eon2 | [work_fetch] fetch share 0.025 5/21/2013 2:29:52 PM | World Community Grid | [work_fetch] fetch share 0.497 5/21/2013 2:29:52 PM | Docking | [work_fetch] fetch share 0.124 5/21/2013 2:29:52 PM | NumberFields@home | [work_fetch] fetch share 0.025 5/21/2013 2:29:52 PM | SZTAKI Desktop Grid | [work_fetch] fetch share 0.025 5/21/2013 2:29:52 PM | malariacontrol.net | [work_fetch] fetch share 0.062 5/21/2013 2:29:52 PM | boincsimap | [work_fetch] fetch share 0.050 5/21/2013 2:29:52 PM | ibercivis | [work_fetch] fetch share 0.025 5/21/2013 2:29:52 PM | fightmalaria@home | [work_fetch] fetch share 0.012 5/21/2013 2:29:52 PM | Asteroids@home | [work_fetch] fetch share 0.025 5/21/2013 2:29:52 PM | Milkyway@Home | [work_fetch] fetch share 0.006 5/21/2013 2:29:52 PM | NFS@Home | [work_fetch] fetch share 0.012 5/21/2013 2:29:52 PM | LHC@home 1.0 | [work_fetch] fetch share 0.012 5/21/2013 2:29:52 PM | Poem@Home | [work_fetch] fetch share 0.000 (blocked by prefs) 5/21/2013 2:29:52 PM | SETI@home | [work_fetch] fetch share 0.000 (blocked by prefs) 5/21/2013 2:29:52 PM | Einstein@Home | [work_fetch] fetch share 0.000 (blocked by prefs) 5/21/2013 2:29:52 PM | PrimeGrid | [work_fetch] fetch share 0.000 (blocked by prefs) 5/21/2013 2:29:52 PM | | [work_fetch] --- state for NVIDIA --- 5/21/2013 2:29:52 PM | | [work_fetch] shortfall 5949.70 nidle 0.00 saturated 89090.30 busy 0.00 5/21/2013 2:29:52 PM | The Lattice Project | [work_fetch] fetch share 0.000 (no apps) 5/21/2013 2:29:52 PM | superlinkattechnion | [work_fetch] fetch share 0.000 (blocked by configuration file) 5/21/2013 2:29:52 PM | MindModeling@Beta | [work_fetch]
Re: [boinc_dev] [boinc_alpha] 7.1.1 also getting new work for suspended project
David, For both your No New Tasks and Project Suspended scenarios, where BOINC still fetched work Did you manually click the Update button in both of those scenarios? No, IIRC, although when I aborted the WU and updated the project to turn it in, I believe it got another WU then. In fact, I had loaded 7.1.1 on a different quad yesterday (one that doesn't use a GPU) and when I flipped over to look at it (they're on the same KVM so share the screen and keyboard but each has it's own mouse), it had gotten work for 2 projects that were suspended on it. Einstein and Primegrid were the projects, IIRC. I unsuspended them since they didn't have large shares anyway. On the C2D machine which has a GPU I finally had to remove the Constellation project. It was still getting work while it was suspended and I had set the resource share to 0.001 BTW, on the 3 test machines (2 Vista32 and 1 Vista64) I downloaded boinc 7.1.1 on each machine. I didn't move it between machines because I didn't have shares setup for that. Each machine downloaded it's own separate copy of 7.1.1 so even if one machine somehow got a bad copy, the other machines shouldn't have had the same problem. The 3rd test machine (the only one running 64 bit Vista) didn't have any projects on it that were suspended or NNW so it didn't show the work fetch problem although it does seem to have tried to make sure it had at least one job from each project (see next paragraph). 7.1.1 is behaving differently on work fetch. 7.0.64 seemed to just grab all the work it needed from the current priority project. It seems to me like 7.1.1 is trying to get at least one work unit from each project it is attached to. ISTR David Anderson saying something about the new work fetch simulator on the boinc alpha website doing that for some reason so I guess 7.1.1 has the same logic in it. David Ball ___ boinc_dev mailing list boinc_dev@ssl.berkeley.edu http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.