Re: OpenPKG 4.0.7 RPM lockfiles permission problems solaris 10
On 09/01/2010 12:41 AM, Jeff Johnson wrote: Yesterday we could reproduce the rpm locking problem. The openpkg rc facility is started by the cron. The rc facility does a rpm -q openpkg query. This query is locked. We have seen 100 of them in the process table (limit of the cron). In this situation all rpm queries hang as well as a db_verify. At the end I have attached the output of db_stat -CA root 12777 12589 0 Sep 02 ? 1:03 /usr/sbin/cron root 4501 12777 0 17:10:00 ? 0:00 sh -c /opt/local/opkg/bin/openpkg rpm -q openpkg /dev/null 21 touch /v root 4509 4501 0 17:10:00 ? 0:02 /opt/local/opkg/lib/openpkg/rpm -q openpkg In the process table we had 100 processes (cron limit) of /opt/local/opkg/lib/openpkg/rpm -q openpkg after the termination of all rpm queries we saw the rpm-db in the state run database recovery x:/opt/local/opkg/RPM/DB# openpkg rpm -qavv /var/tmp/rpm-qavv.out D: pool fd: created size 464 limit -1 flags 0 D: pool iob:created size 20 limit -1 flags 0 D: pool mire: created size 84 limit -1 flags 0 D: pool lua:created size 32 limit -1 flags 0 D: pool ts: created size 888 limit -1 flags 0 D: pool db: created size 208 limit -1 flags 0 rpm: openpkg: layout: [global] tmpdir = /opt/local/opkg/RPM/TMP D: opening db environment /opt/local/opkg/RPM/DB/Packages auto_commit:cdb rpmdb: Thread/process 2479/0 failed: Thread died in Berkeley DB library error: db4 error(-30974) from dbenv-failchk: DB_RUNRECOVERY: Fatal error, run database recovery error: Runnning db-verify ... D: opening db environment /opt/local/opkg/RPM/DB/Packages auto_commit:cdb D: opening db index /opt/local/opkg/RPM/DB/Packages rdonly mode=0x0 D: locked db index /opt/local/opkg/RPM/DB/Packages D: opening db index /opt/local/opkg/RPM/DB/Name rdonly mode=0x0 A rpm --db-cleanup has fixed the rpm-db. x:/opt/local/opkg/RPM/DB# /opt/local/opkg/bin/openpkg rpm --db-cleanup rpmdb: CLEANING UP EXISTING RPM DATABASE (/opt/local/opkg/RPM/DB) rpmdb: cleaning up RPM database DB files rpmdb: making sure RPM database are of the correct type rpmdb: rebuilding RPM database (built-in RPM procedure) rpmdb: making sure RPM database files have consistent attributes rpmdb: performing read/write operation on RPM database now the output of db_stat -CA Default locking region information: 1503Last allocated locker ID 0x7fffCurrent maximum unused locker ID 5Number of lock modes 1000Maximum number of locks possible 1000Maximum number of lockers possible 1000Maximum number of lock objects possible 50Number of lock object partitions 296Number of current locks 306Maximum number of locks at any one time 196Maximum number of locks in any one bucket 263Maximum number of locks stolen by for an empty partition 176Maximum number of locks stolen for any one partition 492Number of current lockers 492Maximum number of lockers at any one time 2Number of current lock objects 5Maximum number of lock objects at any one time 1Maximum number of lock objects in any one bucket 0Maximum number of objects stolen by for an empty partition 0Maximum number of objects stolen for any one partition 4517Total number of locks requested 3958Total number of locks released 0Total number of locks upgraded 827Total number of locks downgraded 35Lock requests not available due to conflicts, for which we waited 0Lock requests not available due to conflicts, for which we did not wait 0Number of deadlocks 0Lock timeout value 0Number of locks that have timed out 0Transaction timeout value 0Number of transactions that have timed out 504KBThe size of the lock region 10The number of partition locks that required waiting (0%) 4The maximum number of times any partition lock was waited for (0%) 0The number of object queue operations that required waiting (0%) 17The number of locker allocations that required waiting (0%) 0The number of region locks that required waiting (0%) 1Maximum hash bucket length =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Lock REGINFO information: LockRegion type 4Region ID __db.004Region name 0xfed8Original region address 0xfed8Region address 0xfed800c8Region primary address 0Region maximum allocation 0Region allocated Region allocations: 3006 allocations, 0 failures, 0 frees, 1 longest Allocations by power-of-two sizes: 1KB3002 2KB0 4KB1 8KB0 16KB2 32KB0 64KB1 128KB0 256KB0 512KB0 1024KB0 REGION_JOIN_OKRegion flags =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Lock region parameters: 283Lock region region mutex [0/581 0% 19054/0] 1031locker table size 1031object table size 560obj_off 64720locker_off 1need_dd =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Lock conflict matrix: 000
Re: OpenPKG 4.0.7 RPM lockfiles permission problems solaris 10
On Aug 31, 2010, at 4:23 AM, Roman Gächter wrote: Hello We run into serious problems with OpenPKG 4.0.2 and 4.0.7 on solaris 10 SPARC in solaris whole rooted zones. After some days of normal operation our rpm db gets corrupted. (aside) Corrupted means data loss or integrity failure technically (although its quite common to hear corruption applied to Berkeley DB with other meanings.) Is there data loss or integrity failure? Running db_verify cd /var/lib/rpm db_verify Packages against Packages or other tables (Berkeley DB calls tables databases) will detect integrity failures. Data loss can be detected by, say, counting the number of packages installed like rpm -qa | wc -l We see hanging processes from the the openpkg rc facility. The openpkg rc script tests the rpm db with a openpkg rpm -q openpkg query. These queries hang up. Hang up how? Can you add -vv to a query? There is also cd /var/lib/rpm db_stat -CA # -- -CA displays all info, -Cl displays lock info Note that you MUST use db_verify/db_stat for the same version of Berkeley DB used by RPM. The tools for the same version of Berkeley DB if internal to RPM are released with RPM. When this happens the permission of the rpm lockfiles RPM/DB/__db.001 .002 .003 changes to root. An rpmdb is protected by permissions on /var/lib/rpm. The associated __db* files are created as needed. I would expect root as owner typically, because /var/lib/rpm is usually writable only by root, and rpm installs are typically run as root. But the openpkg wrapper appears to have a different ownership model. To recover from this problem we have to terminate the rpm queries, cleanup and rebuild the rpm db and to change back the permissions of the rpm lockfiles to the openpkg user. The building of our openpkg4 binary package could be done without problems with the original openpkg-4.0.7 source package. We are using the VALUE license. Does anyone had similar problems? Any hints to resolve? Identifying what is causing the hang is what is needed first. hth 73 de Jeff Best regards Roman
Re: OpenPKG 4.0.7 RPM lockfiles permission problems solaris 10
The problem is intermittend and we are not able tho reproduce it as we like. At the moment the db_verify returns no error. Maybe the output of db_stat -CA will give a clue? db_tool db_stat -CA Default locking region information: 629 Last allocated locker ID 0x7fff Current maximum unused locker ID 5 Number of lock modes 1000Maximum number of locks possible 1000Maximum number of lockers possible 1000Maximum number of lock objects possible 160 Number of lock object partitions 0 Number of current locks 48 Maximum number of locks at any one time 4 Maximum number of locks in any one bucket 0 Maximum number of locks stolen by for an empty partition 0 Maximum number of locks stolen for any one partition 0 Number of current lockers 12 Maximum number of lockers at any one time 0 Number of current lock objects 34 Maximum number of lock objects at any one time 1 Maximum number of lock objects in any one bucket 0 Maximum number of objects stolen by for an empty partition 0 Maximum number of objects stolen for any one partition 12898 Total number of locks requested 12898 Total number of locks released 90 Total number of locks upgraded 495 Total number of locks downgraded 4 Lock requests not available due to conflicts, for which we waited 0 Lock requests not available due to conflicts, for which we did not wait 0 Number of deadlocks 0 Lock timeout value 0 Number of locks that have timed out 0 Transaction timeout value 0 Number of transactions that have timed out 512KB The size of the lock region 11 The number of partition locks that required waiting (0%) 7 The maximum number of times any partition lock was waited for (0%) 0 The number of object queue operations that required waiting (0%) 3 The number of locker allocations that required waiting (0%) 0 The number of region locks that required waiting (0%) 1 Maximum hash bucket length =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Lock REGINFO information: LockRegion type 4 Region ID __db.004Region name 0xfed0 Original region address 0xfed0 Region address 0xfed000c8 Region primary address 0 Region maximum allocation 0 Region allocated Region allocations: 3006 allocations, 0 failures, 0 frees, 1 longest Allocations by power-of-two sizes: 1KB 3002 2KB 0 4KB 0 8KB 1 16KB 2 32KB 0 64KB 1 128KB 0 256KB 0 512KB 0 1024KB 0 REGION_JOIN_OK Region flags =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Lock region parameters: 283 Lock region region mutex [0/223 0% 23806/0] 1031locker table size 1031object table size 560 obj_off 69560 locker_off 1 need_dd =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Lock conflict matrix: 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 0 1 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Locks grouped by lockers: Locker Mode Count Status - Object --- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Locks grouped by object: Locker Mode Count Status - Object --- Regards Roman On 08/31/2010 02:20 PM, Jeff Johnson wrote: On Aug 31, 2010, at 4:23 AM, Roman Gächter wrote: Hello We run into serious problems with OpenPKG 4.0.2 and 4.0.7 on solaris 10 SPARC in solaris whole rooted zones. After some days of normal operation our rpm db gets corrupted. (aside) Corrupted means data loss or integrity failure technically (although its quite common to hear corruption applied to Berkeley DB with other meanings.) Is there data loss or integrity failure? Running db_verify cd /var/lib/rpm db_verify Packages against Packages or other tables (Berkeley DB calls tables databases) will detect integrity failures. Data loss can be detected by, say, counting the number of packages installed like rpm -qa | wc -l We see hanging processes from the the openpkg rc facility. The openpkg rc script tests the rpm db with a openpkg rpm -q openpkg query. These queries hang up. Hang up how? Can you add -vv to a query? There is also cd /var/lib/rpm db_stat -CA # -- -CA displays all info, -Cl displays lock info Note that you MUST use db_verify/db_stat for the same version of Berkeley DB used by RPM. The tools for the same version of Berkeley DB if internal to RPM are released with RPM. When this happens the permission of the rpm lockfiles RPM/DB/__db.001 .002 .003 changes to root. An rpmdb is protected by permissions on /var/lib/rpm. The associated __db* files are created as needed. I would expect root as owner typically, because /var/lib/rpm is usually writable only by root, and rpm installs are typically run as root. But the openpkg wrapper appears
Re: OpenPKG 4.0.7 RPM lockfiles permission problems solaris 10
On Aug 31, 2010, at 4:23 PM, Roman Gaechter wrote: The problem is intermittend and we are not able tho reproduce it as we like. Intermittent is what makes it tricky to identify. At the moment the db_verify returns no error. So technically no corruption if db_verify doesn't detect any. rpm -qavv will show the data integrity checks: each installed header has a SHA1 digest attached, and you should see confirmation that the SHA1 digest check passes when -vv is passed. Maybe the output of db_stat -CA will give a clue? If a hang is the issue, running db_stat will show what locks are held. The output you have attached is from a quiescent system afaict. Try running (as root so that locks can be acquired) rpm -qavv as before. In the middle, hit ^Z. then run db_stat -Cl. You should see a shared/read lock from the running rpm -qa in the db_stat output. The locks usually have a state attached like READ or WRITE or HELD. And one cause of hangs is a deadlock or a stale lock (though rpm should detect and clean up stale locks automatically these days). One way to see stale lock cleanup is to start an rpm install operation, and then send kill -9 to rpm to cause immediate program termination. The next rpm invocation (as root) should have msgs displaying the stale lock cleanup (from the previously terminated by kill -9 rpm install). That's basically the techniques for diagnosing hangs with RPM and Berkeley DB. hth 73 de Jeff db_tool db_stat -CA Default locking region information: 629 Last allocated locker ID 0x7fff Current maximum unused locker ID 5 Number of lock modes 1000Maximum number of locks possible 1000Maximum number of lockers possible 1000Maximum number of lock objects possible 160 Number of lock object partitions 0 Number of current locks 48 Maximum number of locks at any one time 4 Maximum number of locks in any one bucket 0 Maximum number of locks stolen by for an empty partition 0 Maximum number of locks stolen for any one partition 0 Number of current lockers 12 Maximum number of lockers at any one time 0 Number of current lock objects 34 Maximum number of lock objects at any one time 1 Maximum number of lock objects in any one bucket 0 Maximum number of objects stolen by for an empty partition 0 Maximum number of objects stolen for any one partition 12898 Total number of locks requested 12898 Total number of locks released 90 Total number of locks upgraded 495 Total number of locks downgraded 4 Lock requests not available due to conflicts, for which we waited 0 Lock requests not available due to conflicts, for which we did not wait 0 Number of deadlocks 0 Lock timeout value 0 Number of locks that have timed out 0 Transaction timeout value 0 Number of transactions that have timed out 512KB The size of the lock region 11 The number of partition locks that required waiting (0%) 7 The maximum number of times any partition lock was waited for (0%) 0 The number of object queue operations that required waiting (0%) 3 The number of locker allocations that required waiting (0%) 0 The number of region locks that required waiting (0%) 1 Maximum hash bucket length =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Lock REGINFO information: LockRegion type 4 Region ID __db.004Region name 0xfed0 Original region address 0xfed0 Region address 0xfed000c8 Region primary address 0 Region maximum allocation 0 Region allocated Region allocations: 3006 allocations, 0 failures, 0 frees, 1 longest Allocations by power-of-two sizes: 1KB 3002 2KB 0 4KB 0 8KB 1 16KB 2 32KB 0 64KB 1 128KB 0 256KB 0 512KB 0 1024KB 0 REGION_JOIN_OK Region flags =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Lock region parameters: 283 Lock region region mutex [0/223 0% 23806/0] 1031locker table size 1031object table size 560 obj_off 69560 locker_off 1 need_dd =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Lock conflict matrix: 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 0 1 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Locks grouped by lockers: Locker Mode Count Status - Object --- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Locks grouped by object: Locker Mode Count Status - Object --- Regards Roman On 08/31/2010 02:20 PM, Jeff Johnson wrote: On Aug 31, 2010, at 4:23 AM, Roman Gächter wrote: Hello We run into serious problems with OpenPKG 4.0.2 and 4.0.7 on solaris 10 SPARC in solaris whole rooted zones. After