Hi,

I don't know if it happens also for you, but I encountered a problem in
reactionner when it processes a Notification (with last git code).


The reactionner's worker that handle the notification crashes with a
traceback :

    [0][scheduler-central]Stats : Workers:1 (Queued:0 Processing:0
ReturnWait:0)
    [1][scheduler-CMP]Stats : Workers:1 (Queued:0 Processing:0
ReturnWait:0)
    Wait ratio: 1.0
    Notification instance has no attribute 'timeout'
    Ask actions to 1 got 1
    Process Process-2:
    Traceback (most recent call last):
      File "/usr/lib/python2.6/multiprocessing/process.py", line 232, in
_bootstrap
        self.run()
      File "/usr/lib/python2.6/multiprocessing/process.py", line 88, in
run
        self._target(*self._args, **self._kwargs)
      File "./shinken/worker.py", line 207, in work
        self.manage_finished_checks()
      File "./shinken/worker.py", line 146, in manage_finished_checks
        action.check_finished(self.max_plugins_output_length)
      File "./shinken/action.py", line 178, in check_finished
        self.check_finished_unix(max_plugins_output_length)
      File "./shinken/action.py", line 191, in check_finished_unix
        if (now - self.check_time) > self.timeout:
    AttributeError: Notification instance has no attribute 'timeout'
    We ask us for a ping
     ======================== 
    [reactionner-central] Warning : the worker 0 goes down unexpectly!
    [0][scheduler-central]Stats : Workers:0 (Queued:0 Processing:1
ReturnWait:0)
    [1][scheduler-CMP]Stats : Workers:0 (Queued:0 Processing:1
ReturnWait:0)
    Wait ratio: 1.0
    [reactionner-central] Allocating new Worker : 1


After debugging, I found that Notification is correctly created and sent
scheduler-side (in get_checks method), but reactionner receive this
Notification without the 'timeout' attribute (after the Pyro remote call
to get_checks) !


Here is a small patch that worked for me (adding the 'timeout' attribute
to the 'properties' list defined in the Notification class), I don't
know if it's the correct way to correct the problem :

  notification.py

    93c93
    < 
    ---
    >         'timeout' : StringProp(default=5),


And a little worker.py patch to add exception catching :

    146,147c150,157
    <
action.check_finished(self.max_plugins_output_length)
    <                 wait_time = min(wait_time, action.wait_time)
    ---
    >                 try:
    >
action.check_finished(self.max_plugins_output_length)
    >                     wait_time = min(wait_time, action.wait_time)
    >                 except Exception, exp:
    >                     print "[%d]Error!!! %s, exiting." % (self.id,
exp)
    >                     sys.exit(2)


But, after having corrected this first problem, another bug occured (in
reactionner again, when worker returns its result to reactionner),
traceback :

    Traceback (most recent call last):
      File "/usr/local/shinken/bin/shinken-reactionner", line 5, in
<module>
        pkg_resources.run_script('Shinken==0.4', 'shinken-reactionner')
      File "/usr/lib/python2.6/dist-packages/pkg_resources.py", line
467, in run_script
        self.require(requires)[0].run_script(script_name, ns)
      File "/usr/lib/python2.6/dist-packages/pkg_resources.py", line
1200, in run_script
        execfile(script_filename, namespace, namespace)
      File
"/usr/local/lib/python2.6/dist-packages/Shinken-0.4-py2.6.egg/EGG-INFO/scripts/shinken-reactionner",
 line 158, in <module>
        p.main()
      File
"/usr/local/lib/python2.6/dist-packages/Shinken-0.4-py2.6.egg/shinken/satellite.py",
 line 708, in main
        self.manage_action_return(self.returns_queue.pop())
      File
"/usr/local/lib/python2.6/dist-packages/Shinken-0.4-py2.6.egg/shinken/satellite.py",
 line 309, in manage_action_return
        sched_id = action.sched_id
    AttributeError: Notification instance has no attribute 'sched_id'


I found where the problem is, but it's very strange and didn't manage to
solve it.

When reactionner get a Notification from scheduler, it adds the sched_id
attribute to it, and put it in its 'self.s' Queue
(multiprocessing.Queue), ok.
But when the worker dequeue this Notification, the sched_id attribute
have disapeared !
I tried to dequeue the Notification just after it have been queued by
reactionner, and this attribute really disapeared !

Have you any idea ? some race condition ? I'm running Python 2.6.6
(Debian Squeeze)

Laurent



------------------------------------------------------------------------------
Gaining the trust of online customers is vital for the success of any company
that requires sensitive data to be transmitted over the Web.   Learn how to 
best implement a security strategy that keeps consumers' information secure 
and instills the confidence they need to proceed with transactions.
http://p.sf.net/sfu/oracle-sfdevnl 
_______________________________________________
Shinken-devel mailing list
Shinken-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/shinken-devel

Reply via email to