Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories without ensuring they're empty
Rom, if you could build a private drop, I'll report what the log says. On Wednesday, 10 June 2015, 4:28, David Anderson da...@ssl.berkeley.edu wrote: I added a log message that may help a bit. I'd like to track this down, even though it's minor. -- David On 19-May-2015 12:15 PM, Richard Haselgrove wrote: OK, the delay happened again, and I captured a procmon log. Copy of the BOINC log attached (period of interest is 19:35:30 to 19:35:41): also a simple extract of ProcMon for the same period. It has to be said, boinc.exe was doing surprisingly little. I have kept the full ~200 MB native ProcMon log, which can be re-filtered to look for anything else of interest, if you can suggest some likely targets. On Monday, 18 May 2015, 20:57, David Anderson da...@ssl.berkeley.edu wrote: That looks like what's needed. Richard, if you can repro the inter-job delay, you could try using Process Monitor to capture as much as possible from the client during that period. -- David On 18-May-2015 11:12 AM, Jacob Klein wrote: Process Monitor can be used to watch the things a process does (you have to set up correct filters, etc.)... but I'm not sure if that includes sleeps. But if the process is waiting on a file or something, though, it should be able to tell you. Worth looking into. https://technet.microsoft.com/en-us/library/bb896645.aspx Regards, Jacob Date: Mon, 18 May 2015 10:41:16 -0700 From: da...@ssl.berkeley.edu To: r.haselgr...@btopenworld.com; onec...@hotmail.com; jacob_w_kl...@msn.com CC: boinc_dev@ssl.berkeley.edu Subject: Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories without ensuring they're empty I looked at this and couldn't figure out the source of the 12-sec delay. In general, delays could happen because 1) the client does something that takes a long time (like copying a 5 GB file) 2) the client sleeps (i.e. calls boinc_sleep()). It does this in a few situations, like backing off and retrying a file system operation. But there's no indication that either of these is happening here. Does Windows have a way of logging the system calls that a process makes (like strace on Unix)? If so that might reveal what the client is doing during those 12 seconds. -- David On 16-May-2015 8:01 AM, Richard Haselgrove wrote: Here is the message log file for a GPUGrid task finish. The 12-second delay appears again between 14:26:35 and 14:26:47 - that's after the slot directory has been cleared, and the exiting task has changed state from 'running' to 'uploading'. Two new tasks have been assigned to the GPU, but their (small) startup files have not yet been linked to their respective slot directories. I also attach directory listings for the slot and GPUGrid project folders at various stages of the cleanup: the slot held 34 files totalling 44,186,727 bytes, which doesn't sound excessive: the largest file deletion (94,783,960 bytes) occurred several minutes later, when that file finished uploading. I'll enable similar logging and watch what happens when the next GPUGrid task starts up, but from memory, the disruption to BOINC is less severe at startup. On Tuesday, 12 May 2015, 23:29, David Anderson da...@ssl.berkeley.edu mailto:da...@ssl.berkeley.edu wrote: BTW: the client isn't completely single-threaded; it uses a separate thread to do CPU throttling. It would be feasible to also use separate threads for serving GUI RPC connections, which would allow client to remain responsive even while e.g. copying thousands of files to a slot dir. -- David On 12-May-2015 2:40 AM, Seke Rob wrote: Reminds me of the Clean Energy Project, Phase 2 and why we have app_config and max_concurrent and a default control of allowing 1 'In Progress' on a host. This project sets up in slot copying near 6700 files [symlinking proposed long ago as is done on several other WCG projects for the static files]. If more than one CEP2 is started the machine feels at times like a snail, responsiveness of the BOINC manager is poor, many a time the less powerful systems incurring error zero status exits or total fail. On an 8 core observed it could take over an hour before actual computing commenced [CPU time logged]. Boot cycle requires manually starting of tasks one by one. Kevin Reed few years ago raised a ticket for staggered starting, where the models can reach several GB and bigger in the coming. At any rate, as much as these 6700 files are copied,
Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories without ensuring they're empty
Here is a private drop: http://boinc.berkeley.edu/dl/boinc.100615.x64.zip - Rom -Original Message- From: boinc_dev [mailto:boinc_dev-boun...@ssl.berkeley.edu] On Behalf Of Richard Haselgrove Sent: Wednesday, June 10, 2015 3:34 AM To: David Anderson; Jacob Klein; Seke Rob Cc: BOINC Development Subject: Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories without ensuring they're empty Rom, if you could build a private drop, I'll report what the log says. On Wednesday, 10 June 2015, 4:28, David Anderson da...@ssl.berkeley.edu wrote: I added a log message that may help a bit. I'd like to track this down, even though it's minor. -- David On 19-May-2015 12:15 PM, Richard Haselgrove wrote: OK, the delay happened again, and I captured a procmon log. Copy of the BOINC log attached (period of interest is 19:35:30 to 19:35:41): also a simple extract of ProcMon for the same period. It has to be said, boinc.exe was doing surprisingly little. I have kept the full ~200 MB native ProcMon log, which can be re-filtered to look for anything else of interest, if you can suggest some likely targets. On Monday, 18 May 2015, 20:57, David Anderson da...@ssl.berkeley.edu wrote: That looks like what's needed. Richard, if you can repro the inter-job delay, you could try using Process Monitor to capture as much as possible from the client during that period. -- David On 18-May-2015 11:12 AM, Jacob Klein wrote: Process Monitor can be used to watch the things a process does (you have to set up correct filters, etc.)... but I'm not sure if that includes sleeps. But if the process is waiting on a file or something, though, it should be able to tell you. Worth looking into. https://technet.microsoft.com/en-us/library/bb896645.aspx Regards, Jacob Date: Mon, 18 May 2015 10:41:16 -0700 From: da...@ssl.berkeley.edu To: r.haselgr...@btopenworld.com; onec...@hotmail.com; jacob_w_kl...@msn.com CC: boinc_dev@ssl.berkeley.edu Subject: Re: [boinc_dev] [boinc_alpha] BOINC re-using slot directories without ensuring they're empty I looked at this and couldn't figure out the source of the 12-sec delay. In general, delays could happen because 1) the client does something that takes a long time (like copying a 5 GB file) 2) the client sleeps (i.e. calls boinc_sleep()). It does this in a few situations, like backing off and retrying a file system operation. But there's no indication that either of these is happening here. Does Windows have a way of logging the system calls that a process makes (like strace on Unix)? If so that might reveal what the client is doing during those 12 seconds. -- David On 16-May-2015 8:01 AM, Richard Haselgrove wrote: Here is the message log file for a GPUGrid task finish. The 12-second delay appears again between 14:26:35 and 14:26:47 - that's after the slot directory has been cleared, and the exiting task has changed state from 'running' to 'uploading'. Two new tasks have been assigned to the GPU, but their (small) startup files have not yet been linked to their respective slot directories. I also attach directory listings for the slot and GPUGrid project folders at various stages of the cleanup: the slot held 34 files totalling 44,186,727 bytes, which doesn't sound excessive: the largest file deletion (94,783,960 bytes) occurred several minutes later, when that file finished uploading. I'll enable similar logging and watch what happens when the next GPUGrid task starts up, but from memory, the disruption to BOINC is less severe at startup. On Tuesday, 12 May 2015, 23:29, David Anderson da...@ssl.berkeley.edu mailto:da...@ssl.berkeley.edu wrote: BTW: the client isn't completely single-threaded; it uses a separate thread to do CPU throttling. It would be feasible to also use separate threads for serving GUI RPC connections, which would allow client to remain responsive even while e.g. copying thousands of files to a slot dir. -- David On 12-May-2015 2:40 AM, Seke Rob wrote: Reminds me of the Clean Energy Project, Phase 2 and why we have app_config and max_concurrent and a default control of allowing 1 'In Progress' on a host. This project sets up in slot copying near 6700 files [symlinking proposed long ago as is done on several other WCG projects for the static files]. If more than one CEP2 is started the machine feels at times like a snail, responsiveness of the BOINC manager is poor, many a time the less powerful systems incurring error
Re: [boinc_dev] I: Question about changes to boinc/7.4.7+dfsg-1exp1
Fixed. -- David On 03-Apr-2015 1:40 AM, Gianfranco Costamagna wrote: Hi Boinc developers, Michael Tautschnig, has discovered a bug in the zip code with a really nice code checker tool and reported on debian bug 747964 https://bugs.debian.org/747964 can you please apply the attached patch from him? Have many thanks, Gianfranco Il Giovedì 2 Aprile 2015 10:41, Michael Tautschnig m...@debian.org ha scritto: Hi, Many thanks for getting back so quickly. On Thu, Apr 02, 2015 at 8:09:57 +, Gianfranco Costamagna wrote: [...] Yes, IIRC upstream told me the code was actually not used by boinc, it was a bundled zip library, but we should use a subset of it, and not the line above. But this is upstream, not me :) http://lists.ssl.berkeley.edu/pipermail/boinc_dev/2014-May/020956.html I don't remember the exact mail, I just found the thread above... Do you still have the build failure? I might add the patch again if needed! I am attaching a patch that more or less has the effect of your proposal in that thread. Yet if I understand the response in http://lists.ssl.berkeley.edu/pipermail/boinc_dev/2014-May/020958.html correctly, one should rather change the call so as not to do any results checking? Anyway, the attached patch makes things compile in a consistent manner. It it's not more broken than the existing code :-) Best, Michael ___ boinc_dev mailing list boinc_dev@ssl.berkeley.edu http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address. ___ boinc_dev mailing list boinc_dev@ssl.berkeley.edu http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.