Re: [Zope-dev] recipe for trapping SIGSEGV and SIGILL signals on solaris
Matt - Ok, I installed everything and the system is running fine (or no worse). However, we still faced one restart so far. I have included the debug information below. This looks similiar to the problem report on sourceforge: http://sourceforge.net/tracker/?func=detailatid=105470aid=471942group_id=5470 I posted a comment to see if they have any updates. One question ... does anyone every malloc a plain ClassExtension object? It seems that every CE-based object has their own struct typedef. If so, then I think yesterday's patch problaby won't do any harm but won't help either. The current running process is being monitored by truss so I will be able to get at least one more core dump (if we get one). I won't be able to get any more information until tomorrow. Any other ideas? Thanks for your help. - joe . (gdb) info threads 17 Thread 10 0xef5b9810 in _lwp_sema_wait () 16 Thread 9 0xef647cac in _swtch () 15 Thread 8 0xef5b9810 in _lwp_sema_wait () 14 Thread 7 (LWP 5) 0xcaeb50 in ?? () 13 Thread 6 0xef647cac in _swtch () 12 Thread 5 0xef5b9810 in _lwp_sema_wait () 11 Thread 4 0xef647cac in _swtch () 10 Thread 3 0xef647cac in _swtch () 9 Thread 2 (LWP 2) 0xef5b9958 in _signotifywait () 8 Thread 1 (LWP 6) 0xef5b7488 in _poll () 7 LWP8 0xef5b6a24 in door_restart () 6 LWP6 0xef5b7488 in _poll () 5 LWP5 0xcaeb50 in ?? () 4 LWP4 0xef5b9810 in _lwp_sema_wait () 3 LWP3 0xef5b9810 in _lwp_sema_wait () 2 LWP2 0xef5b9958 in _signotifywait () * 1 LWP1 0xef5b9810 in _lwp_sema_wait () (gdb) thread 14 [Switching to Thread 7 (LWP 5)] #0 0xcaeb50 in ?? () (gdb) where #0 0xcaeb50 in ?? () #1 0x516bc in collect (young=0x13dec8, old=0x13ded4) at ./Modules/gcmodule.c:379 #2 0x51984 in collect_generations () at ./Modules/gcmodule.c:484 #3 0x519fc in _PyGC_Insert (op=0xecf7d4) at ./Modules/gcmodule.c:507 #4 0x664ec in PyMethod_New (func=0x3f796c, self=0x11c0d44, class=0x3c7e5c) at Objects/classobject.c:1834 #5 0x63850 in instance_getattr2 (inst=0x11c0d44, name=0x3d5378) at Objects/classobject.c:642 #6 0x63750 in instance_getattr1 (inst=0x11c0d44, name=0x3d5378) at Objects/classobject.c:608 #7 0x63898 in instance_getattr (inst=0x11c0d44, name=0x3d5378) at Objects/classobject.c:656 #8 0x78330 in PyObject_GetAttr (v=0x11c0d44, name=0x3d5378) at Objects/object.c:1052 #9 0x895ec in builtin_hasattr (self=0x0, args=0x12ed944) at Python/bltinmodule.c:886 #10 0x35a44 in call_cfunction (func=0x1609b0, arg=0x12ed944, kw=0x0) at Python/ceval.c:2854 #11 0x33c5c in eval_code2 (co=0x3cbf80, globals=0x1, locals=0x0, args=0x2, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:1948 and so on ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: [Zope-dev] recipe for trapping SIGSEGV and SIGILL signals on solaris
Matt - Well, your patch seems fine in our testing environment. Unfortunately, we do not see any restarts in the testing environment ... always in production. I had to rebuild our entire software base because we are using other products that use extensions class and they are not included under the main zope installation. It caused a bus error the first time (with only running wo_pcgi.py). As I mentioned in my prior e-mail, I modified the patch slightly to exactly match the struct in Python's object.h. Please review this patch. I will apply the patch in production tomorrow morning, 12/13, (Japan Standard Time or GMT+9) and monitor the system. If zope does not restart during the day, then I think you have fixed the problem. I'm using Zope 2.4.3 and Python 2.1.1 with pymalloc disabled on the solaris platform. thanks and regards, - joe n. p.s. I looked **briefly** at the Zope 2.5 source and this patch will not be compatible since there doesn't seem to be a standard among the different extension classes on whether to include or not include the COUNT_ALLOCS define. The cAccessControl class seems to be the exception. *** ExtensionClass.h.bakFri Nov 16 10:37:11 2001 --- ExtensionClass.hWed Dec 12 15:10:03 2001 *** *** 136,154 PySequenceMethods *tp_as_sequence; PyMappingMethods *tp_as_mapping; ! /* More standard operations (at end for binary compatibility) */ hashfunc tp_hash; ternaryfunc tp_call; reprfunc tp_str; getattrofunc tp_getattro; setattrofunc tp_setattro; ! /* Space for future expansion */ ! long tp_xxx3; ! long tp_xxx4; char *tp_doc; /* Documentation string */ #ifdef COUNT_ALLOCS /* these must be last */ int tp_alloc; --- 136,169 PySequenceMethods *tp_as_sequence; PyMappingMethods *tp_as_mapping; ! /* More standard operations (here for binary compatibility) */ hashfunc tp_hash; ternaryfunc tp_call; reprfunc tp_str; getattrofunc tp_getattro; setattrofunc tp_setattro; ! ! /* Functions to access object as input/output buffer */ ! PyBufferProcs *tp_as_buffer; ! ! /* Flags to define presence of optional/expanded features */ ! long tp_flags; char *tp_doc; /* Documentation string */ + /* call function for all accessible objects */ + traverseproc tp_traverse; + + /* delete references to contained objects */ + inquiry tp_clear; + + /* rich comparisons */ + richcmpfunc tp_richcompare; + + /* weak reference enabler */ + long tp_weaklistoffset; + #ifdef COUNT_ALLOCS /* these must be last */ int tp_alloc; *** *** 302,308 { PyExtensionClassCAPI-Export(D,N,T); } /* Convert a method list to a method chain. */ ! #define METHOD_CHAIN(DEF) { DEF, NULL } /* The following macro checks whether a type is an extension class: */ #define PyExtensionClass_Check(TYPE) \ --- 317,330 { PyExtensionClassCAPI-Export(D,N,T); } /* Convert a method list to a method chain. */ ! /* MTK -- make it pad the type structure out -- presumes only use is in ! ** type structure initialization ! */ ! #ifdef COUNT_ALLOCS ! #define METHOD_CHAIN(DEF) 0,0,0,0,0,0,0,0,{ DEF, NULL } ! #else ! #define METHOD_CHAIN(DEF) 0,0,0,0,{ DEF, NULL } ! #endif /* The following macro checks whether a type is an extension class: */ #define PyExtensionClass_Check(TYPE) \ *** *** 336,342 #define PURE_MIXIN_CLASS(NAME,DOC,METHODS) \ static PyExtensionClass NAME ## Type = { PyObject_HEAD_INIT(NULL) \ 0, # NAME, sizeof(PyPureMixinObject), 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \ ! 0, 0, 0, 0, 0, 0, 0, DOC, {METHODS, NULL}, \ EXTENSIONCLASS_BASICNEW_FLAG} /* The following macros provide limited access to extension-class --- 358,364 #define PURE_MIXIN_CLASS(NAME,DOC,METHODS) \ static PyExtensionClass NAME ## Type = { PyObject_HEAD_INIT(NULL) \ 0, # NAME, sizeof(PyPureMixinObject), 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \ ! 0, 0, 0, 0, 0, 0, 0, DOC, METHOD_CHAIN(METHODS), \ EXTENSIONCLASS_BASICNEW_FLAG} /* The following macros provide limited access to extension-class ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: [Zope-dev] recipe for trapping SIGSEGV and SIGILL signals on solaris
Joseph Wayne Norton wrote: Matt - Well, your patch seems fine in our testing environment. Unfortunately, we do not see any restarts in the testing environment ... always in production. I had to rebuild our entire software base because we are using other products that use extensions class and they are not included under the main zope installation. It caused a bus error the first time (with only running wo_pcgi.py). As I mentioned in my prior e-mail, I modified the patch slightly to exactly match the struct in Python's object.h. Please review this patch. I will apply the patch in production tomorrow morning, 12/13, (Japan Standard Time or GMT+9) and monitor the system. If zope does not restart during the day, then I think you have fixed the problem. I'm using Zope 2.4.3 and Python 2.1.1 with pymalloc disabled on the solaris platform. thanks and regards, - joe n. p.s. I looked **briefly** at the Zope 2.5 source and this patch will not be compatible since there doesn't seem to be a standard among the different extension classes on whether to include or not include the COUNT_ALLOCS define. The cAccessControl class seems to be the exception. My fingers and toes are crossed for you ;) I've actually built 2.5 with the modified extensionclass.h and it seems to build OK and it runs and passes all of its unit tests. Thats not proof one way or another, but... Sorry our turnaround times are so laggy; thats the downside of diagnosing a problem on the other side of the globe. ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: [Zope-dev] recipe for trapping SIGSEGV and SIGILL signals on solaris
(gdb) print *((PyObject *) gc)-ob_type $1 = {ob_refcnt = 18213696, ob_type = 0x2d70b0, ob_size = 0, tp_name = 0x1 T, tp_basicsize = 1328272, tp_itemsize = 4156348, tp_dealloc = 0x125865c, tp_print = 0x3c1b04, tp_getattr = 0, tp_setattr = 0, tp_compare = 0x29, tp_repr = 0x3adeb0, tp_as_number = 0xf66198, tp_as_sequence = 0xdf3fa0, tp_as_mapping = 0x0, tp_hash = 0x1, tp_call = 0x144490 PyMethod_Type, tp_str = 0x3f0a1c, tp_getattro = 0x125865c, tp_setattro = 0x3c1b04, tp_as_buffer = 0x0, tp_flags = 158561192, tp_doc = 0x29 , tp_traverse = 0x4c4f4144, tp_clear = 0xd908c0, tp_richcompare = 0x1151300, tp_weaklistoffset = 0} [...] gdb) x 0x4c4f4144 0x4c4f4144: Cannot access memory at address 0x4c4f4144. 0x4c4f4144 is big-endian ascii for LOAD. Things were corrupted before... Florent -- Florent Guillaume, Nuxeo (Paris, France) +33 1 40 33 79 10 http://nuxeo.com mailto:[EMAIL PROTECTED] ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: [Zope-dev] recipe for trapping SIGSEGV and SIGILL signals on solaris
Florent Guillaume wrote: (gdb) print *((PyObject *) gc)-ob_type $1 = {ob_refcnt = 18213696, ob_type = 0x2d70b0, ob_size = 0, tp_name = 0x1 T, tp_basicsize = 1328272, tp_itemsize = 4156348, tp_dealloc = 0x125865c, tp_print = 0x3c1b04, tp_getattr = 0, tp_setattr = 0, tp_compare = 0x29, tp_repr = 0x3adeb0, tp_as_number = 0xf66198, tp_as_sequence = 0xdf3fa0, tp_as_mapping = 0x0, tp_hash = 0x1, tp_call = 0x144490 PyMethod_Type, tp_str = 0x3f0a1c, tp_getattro = 0x125865c, tp_setattro = 0x3c1b04, tp_as_buffer = 0x0, tp_flags = 158561192, tp_doc = 0x29 , tp_traverse = 0x4c4f4144, tp_clear = 0xd908c0, tp_richcompare = 0x1151300, tp_weaklistoffset = 0} [...] gdb) x 0x4c4f4144 0x4c4f4144: Cannot access memory at address 0x4c4f4144. 0x4c4f4144 is big-endian ascii for LOAD. Things were corrupted before... Florent Yes, the whole block is bad, so it probably isn't really a Python type object. The refcount is a bit high, the name is really low (0x01!) the basicsize and itemsize are extremely large, the compare function is too low, the hash function is too low -- ie it isn't a type object. So, I may have been telling him to get the wrong thing; the source code that he faulted in reads: /* Subtract internal references from gc_refs */ static void subtract_refs(PyGC_Head *containers) { traverseproc traverse; PyGC_Head *gc = containers-gc_next; for (; gc != containers; gc=gc-gc_next) { /* The next line is the line that was active at the time of his fault */ traverse = PyObject_FROM_GC(gc)-ob_type-tp_traverse; (void) traverse(PyObject_FROM_GC(gc), (visitproc)visit_decref, NULL); } } And PyObject_FROM_GC(gc) is either (gc) or ((PyObject *)(((PyGC_Head *)gc)+1)) depending on on whether or not WITH_CYCLE_GC is defined. I took the easy route and asked Joe to assume that the former was true. If the latter is true, then the type object is shifted upwards in memory by three words; the new first three fields are gc_next, gc_prev, and gc_refs. That means every value in the type header is off by three fields, if it isn't aligned, meaning the real type object would be: gc_next = 0x115eb40 gc_prev = 0x2d70b0 gc_refs = 0 ob_refcnt = 0x1 ob_type = 0x144490 (which we actually know is PyMethod_Type -- yay) ob_size = 0x3f6bbc (which is too large for my comfort) tp_name = 0x12865c (valid pointer but we dont know what it is) tp_basicsize=0x3c1b04 (seems high again, but is 0x350b8 less than ob_size) tp_itemsize = 0 tp_dealloc = 0 tp_print = 0x29 (boo!) tp_getattr = 0x3adeb0 tp_setattr = 0xf66198 tp_compare = 0xdf3fa0 tp_repr = 0 tp_as_number = 1 (boo!) tp_as_sequence = 0x144490 PyMethod_Type (boo!) etc... even shifting THESE values by 1 (assuming the compiler takes PyGC_Head which is three words and pads it up to 4 words for alignment) puts garbage values like 0x29 in tp_dealloc. Ergo, I'm pretty confident that the gc pointer itself is bad. If I was just a *wee* bit more familiar with how Solaris loaded segments, I'd be able to glean some more information from the addresses (ie are they code or data segment pointers). Normally I like seeing OS's use the high nybble or byte of an address as a segment number to make that sort of diagnosis easier. It actually looks like page zero is MAPPED on Solaris (I didnt think it was) which in my book is a baaad thing since it means a null pointer CAN be dereferenced. ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: [Zope-dev] recipe for trapping SIGSEGV and SIGILL signals on solaris
Hi Joe, The problem you're seeing is that the fault is happening on a different thread than the receiver of the signal; that truss syntax is interesting though (I have an old SPARC around to test on but its painfully slow) so I'm wondering if first you needed to do an 'info thread' in gdb and then a 'thread N' to switch to the real crashing thread before getting the backtrace. - Original Message - From: Joseph Wayne Norton [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tuesday, December 11, 2001 2:20 AM Subject: [Zope-dev] recipe for trapping SIGSEGV and SIGILL signals on solaris Hello. We are facing zope restarts on the solaris 5.6 platform with zope 2.4.3 and python 2.1.1. I put together a script based some information on an old posting to the apache mailing list. The following shell/perl script allows one to get a core file from a dying zope child process and also allow the zope to restart without any side effects. The script #!/bin/sh PATH=$PATH:/usr/local/bin export PATH cd /tmp for PID in `ps -u zfs -f -o pid,comm,args | fgrep z2.py | cut -d' ' -f1` do export PID truss -f -l -t\!all -S SIGSEGV,SIGILL -p $PID 21 \ | perl -pe 'system(gcore $ENV{'PID'} sleep 5 kill -9 $ENV{'PID'}), exit($ENV{'PID'}) if /(SIGSEGV|SIGILL)/;' done Step 1: modify script to match your environment. Step 2: execute script Step 3: wait for core file to be dumped in /tmp. Step 4: analyze with gdb where $PID is the pid of the dumped process #bash gdb /path/to/bin/python /tmp/core.$PID #0 0xef5b9810 in _lwp_sema_wait () (gdb) where #0 0xef5b9810 in _lwp_sema_wait () #1 0xef647ea0 in _park () #2 0xef647b84 in _swtch () #3 0xef6468a4 in cond_wait () #4 0xef6467c8 in _ti_pthread_cond_wait () #5 0x50220 in PyThread_acquire_lock (lock=0xd9d878, waitflag=1) at Python/thread_pthread.h:313 #6 0x51f18 in lock_PyThread_acquire_lock (self=0xda39b8, args=0x0) at ./Modules/threadmodule.c:67 #7 0x35db4 in fast_cfunction (func=0xda39b8, pp_stack=0xed40f828, na=0) at Python/ceval.c:2994 #8 0x33ca0 in eval_code2 (co=0x267848, globals=0x51ec4, locals=0x0, args=0x0, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:1951 : : It seems that we are facing trouble due to the thread library on solaris (unless the truss command has introduced a side-effect). Anyone else facing similiar troubles? or maybe I should post this to a python mailing list. - joe ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope ) ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: [Zope-dev] recipe for trapping SIGSEGV and SIGILL signals on solaris
At Tue, 11 Dec 2001 10:42:46 -0500, Matthew T. Kromer wrote: #0 0xef5b9810 in _lwp_sema_wait () (gdb) info threads 19 Thread 10 0xef5b9810 in _lwp_sema_wait () 18 Thread 9 0xef5b9810 in _lwp_sema_wait () 17 Thread 8 0xef5b9810 in _lwp_sema_wait () 16 Thread 7 (LWP 8) subtract_refs (containers=0x13dec8) at ./Modules/gcmodule.c:166 Aha! See? Matthew - I performed the operations that you recommended and here are the results (see below). The problem seems to be with the value of the tp_traverse field. I am not aware of any T type python object. I'm wondering if this is an extension class type (just a guess). I searched through all of the *.c files in zope, etc. but I as not able to find any type of name T. I also ran across a bug posting at sourceforge ... http://sourceforge.net/tracker/?func=detailatid=105470aid=471942group_id=5470 This bug report looks very similiar. - j #0 0xef5b9810 in _lwp_sema_wait () (gdb) info threads 19 Thread 10 0xef5b9810 in _lwp_sema_wait () 18 Thread 9 0xef5b9810 in _lwp_sema_wait () 17 Thread 8 0xef5b9810 in _lwp_sema_wait () 16 Thread 7 (LWP 8) subtract_refs (containers=0x13dec8) at ./Modules/gcmodule.c:166 15 Thread 6 0xef647cac in _swtch () 14 Thread 5 0xef5b9810 in _lwp_sema_wait () 13 Thread 4 (LWP 0) 0xef647b7c in _swtch () 12 Thread 3 0xef647cac in _swtch () 11 Thread 2 (LWP 2) 0xef5b9958 in _signotifywait () 10 Thread 1 (LWP 6) 0xef5b7488 in _poll () 9 LWP9 0xef5b6a24 in door_restart () 8 LWP8 subtract_refs (containers=0x13dec8) at ./Modules/gcmodule.c:166 7 LWP7 0xef5b9810 in _lwp_sema_wait () 6 LWP6 0xef5b7488 in _poll () 5 LWP5 0xef5b9814 in _lwp_sema_wait () 4 LWP4 0xef5b9810 in _lwp_sema_wait () 3 LWP3 0xef5b9810 in _lwp_sema_wait () 2 LWP2 0xef5b9958 in _signotifywait () * 1 LWP1 0xef5b9810 in _lwp_sema_wait () (gdb) thread 16 [Switching to Thread 7 (LWP 8)] #0 subtract_refs (containers=0x13dec8) at ./Modules/gcmodule.c:166 ./Modules/gcmodule.c:166: No such file or directory. (gdb) print *((PyObject *) gc)-ob_type $1 = {ob_refcnt = 18213696, ob_type = 0x2d70b0, ob_size = 0, tp_name = 0x1 T, tp_basicsize = 1328272, tp_itemsize = 4156348, tp_dealloc = 0x125865c, tp_print = 0x3c1b04, tp_getattr = 0, tp_setattr = 0, tp_compare = 0x29, tp_repr = 0x3adeb0, tp_as_number = 0xf66198, tp_as_sequence = 0xdf3fa0, tp_as_mapping = 0x0, tp_hash = 0x1, tp_call = 0x144490 PyMethod_Type, tp_str = 0x3f0a1c, tp_getattro = 0x125865c, tp_setattro = 0x3c1b04, tp_as_buffer = 0x0, tp_flags = 158561192, tp_doc = 0x29 , tp_traverse = 0x4c4f4144, tp_clear = 0xd908c0, tp_richcompare = 0x1151300, tp_weaklistoffset = 0} (gdb) print *((PyObject *) 0x2d70b0)-ob_type $2 = {ob_refcnt = 2977968, ob_type = 0xff5b80, ob_size = 0, tp_name = 0x1 T, tp_basicsize = 1328272, tp_itemsize = 4155228, tp_dealloc = 0x125865c, tp_print = 0x3c1b04, tp_getattr = 0, tp_setattr = 0, tp_compare = 0x29, tp_repr = 0, tp_as_number = 0x1212b48, tp_as_sequence = 0xbf8d30, tp_as_mapping = 0x, tp_hash = 0x1, tp_call = 0x144490 PyMethod_Type, tp_str = 0x4ab2cc, tp_getattro = 0x1089d5c, tp_setattro = 0x4ab30c, tp_as_buffer = 0x0, tp_flags = 0, tp_doc = 0x29 , tp_traverse = 0, tp_clear = 0x122d140, tp_richcompare = 0x11ccd70, tp_weaklistoffset = -1} gdb) x 0x4c4f4144 0x4c4f4144: Cannot access memory at address 0x4c4f4144. ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )