Re: [Spacewalk-list] Ongoing jabberd/osad issues.

Daryl Rose Mon, 22 Aug 2016 08:15:58 -0700

Matt,

Thank you very much for this information.  There was a lot here to process, so 
I had to read it multiple time to make sure that I understood what you were 
doing.

One of the errors  that I saw in the osa-dispatcher log was an invalid 
password.  I ran the sql command as suggested and saw that there were in fact 
two entries for the rhn-dispatcher.  One from the original install a year ago, 
and then a second one with the FQDN.  I removed both entries, cleaned out the 
databases and restarted jabberd/osa-dispatcher.  I'll keep an eye on things for 
a while and see if this helps with my database issues.

Oh, almost forgot......I do not see an entry in /etc/rhn/rhn.conf for 
osa_dispatcher.debug.  I added it in and stop/started osa-dispatcher, but I did 
not see any additional logging.

Thanks

Daryl

________________________________
From: [email protected] <[email protected]> on 
behalf of Matt Moldvan <[email protected]>
Sent: Friday, August 19, 2016 9:57 AM
To: [email protected]
Subject: Re: [Spacewalk-list] Ongoing jabberd/osad issues.

To answer your question about how you can tell if clients are registering (to 
OSA dispatcher, not jabber), set osa_dispatcher.debug = 4 (goes up to 9 but 
that is -too much- info in my experience) in /etc/rhn/rhn.conf and restart 
osa-dispatcher.  This will tell you if systems are subscribing to the 
dispatcher's presence and are ready to receive actions.

2016/07/28 08:16:28 -05:00 3187 0.0.0.0<http://0.0.0.0>: 
osad/jabber_lib._roster_callback('Updating the roster', <iq 
to='rhn-dispatcher-sat@somefqdn/superclient' type='set' 
id='d0i9bb6qg2mxcvod2204d697ar81obq2b9cmljd1'><query xmlns = 'jabber:iq:roster' 
><item jid='osad-5987ca54b5@somefqdn' subscription='both' /></query></iq>)
2016/07/28 08:16:28 -05:00 3187 0.0.0.0<http://0.0.0.0>: 
osad/jabber_lib._presence_callback('rhn-dispatcher-sat@somefqdn/superclient', 
osad-5987ca54b5@somefqdn, u'subscribed')

For jabber connections, check /var/log/messages for jabberd entries via syslog. 
 I had to change rsyslog settings to see all of them, as there were so many 
coming in when I'd restart the jabber services that rsyslog would start to 
rate-limit them.

Also, osad should -not- be installed on the masters, it conflicts with the OSA 
dispatcher packages...

If you have to do it manually, do this.

In the Spacewalk back end database (execute "sudo spacewalk-sql -i" or as 
root), run select * from rhnpushdispatcher to see what OSA dispatcher thinks 
the password is.  Stop jabberd, delete the entries in that table, delete the 
Berkley DB files, then start jabberd and osa-dispatcher.  I have more about the 
issue below, but it's a very long and boring story.

--- long story

This is a long story... excuse the rant but it's caused me a lot of late nights 
over the course of implementation.  This is a sort of catharsis to be able to 
send.

So, I've spent the last year and a half trying various methods to keep OSA 
dispatcher on our Spacewalk masters stable, and trying to hammer the square peg 
that is jabberd2 into our round hole of using F5s for GTMs and LTMs.  
Generally, for example for LDAP services, we create an SSL cert signed by our 
internal CA and include the GTM FQDN as the subject, and any LTM FQDN and pool 
members as the Subject Alternate Names.  But, the question wasn't really about 
that, just wanted to share my frustration that jabberd2 (s2s specifically) 
doesn't play nicely when a system has the GTM FQDN in the jabberd config 
(routes get marked as invalid and so on).  Also, the default implementation in 
Spacewalk uses the Berkley DB, which is notorious for crashing and corrupting 
itself and also doesn't support locking, so good luck trying to put it on an 
NFS mount and having multiple proxies writing to that default database.

So instead I tried Postgres in the jabber config, which worked well for a time, 
but pointing all of our clients to the GTM FQDN and attempting to schedule 
actions in the GUI wasn't working out.  When you're in an environment that 
follows ITIL processes and are doing production systems in always-varying 
scheduled and approved change windows, having rhnsd pick up actions whenever it 
feels like it (I've seen 24 hours+) just doesn't work for us.

I had to disable snapshots, because for when spacewalk-clone-by-date runs 
(daily) and also when systems check in with Puppet (ensuring registration 
daily), Spacewalk sees that the base channel for thousands of systems has been 
updated, and attempts to update these sequentially, locks the table, and causes 
memory exhaustion on the master and the database server.

This week, I gave up on using the proxies for Jabber and pointed all my clients 
to the masters in their respective datacenters.  This brought up an interesting 
issue, in the OSA dispatcher Python code, an attempt is made to retrieve the 
password from the rhnPushDispatcher table in the Spacewalk database.  
Unfortunately, it only uses part of the Jabber ID, which is hard coded to 
"rhn-dispatcher-sat" in the code.  So, if you have two, it will most likely not 
pull the correct one and forever grab the wrong password and attempt to use 
that against Jabber, and then crash OSA dispatcher.

I personally have two masters (and two OSA dispatchers) because I want 
datacenter redundancy in case of an issue in one.  Also, I want to use the GUI 
via GTM in global availability mode again, for redundancy.  So, having OSA 
dispatcher running on both is very preferable in my situation.  Unfortunately 
with the issue I mentioned above about the dispatcher code not pulling the 
password properly from rhnPushDispatcher when there are multiple entries, I had 
to implement an ugly hack.

Another issue to throw in here was that jabber and OSA dispatcher would 
randomly crash.  SM and C2S would segfault at random times and OSA dispatcher 
would error out for many various reasons:
  - timeouts when it was loading the rows from "active" and "roster-items"
  - invalidclient errors (didn't look into these much but maybe old versions of 
OSAD components on the clients).  ridiculous that one bad client could cause 
the whole dispatcher process to bomb out though
  - SELinux errors that weren't covered by running 
osa-dispatcher-selinux-enable (had to run audit2allow and semodule more times 
than I'm proud of, a few times in a while loop because fixing one SELinux error 
would lead to another being denied via SELinux in the next line of the code)

Below is the script that I mentioned to run the hack to get around the osa 
dispatcher laziness of using a "like" statement to pull the password from 
rhnPushDispatcher.  It's ugly but the dispatcher daemon has stayed up for 
almost 24 hours now (I consider this a win after all my other issues and my 
year and half long struggle with the software) and we have ~5.5k systems online 
with it.  It's in cron running every 2 minutes to check on both jabberd and 
osa-dispatcher services.

Note that the SQL file is after changing the jabber config on the masters to 
use SQLite (another attempt in the long list of things I tried to keep things 
stable).  A similar method would most likely work but the fixjabber.sql file 
below would need updating.

I don't like the idea of deleting the entire Jabber database every time, I 
think it's a cop out and requires all systems to recreate entries in multiple 
tables of that database, whether it's Berkley DB or Postgres or whatever.

---

[me@ourmaster1 ~]$ cat /var/log/fixjabber.sh.out
osa-dispatcher and jabberd have been restarted 0 times since Thu Aug 18 
11:27:59 CDT 2016
[me@ourmaster2 ~]$ cat /var/log/fixjabber.sh.out
osa-dispatcher and jabberd have been restarted 0 times since Thu Aug 18 
11:27:01 CDT 2016

[me@ourmaster1 ~]$ sudo cat /usr/local/bin/fixjabber.sh
#!/bin/bash
PATH=/sbin:/usr/bin:/bin
LOGFILE=/var/log/$(basename $0).out
if [ ! -f "${LOGFILE}" ]; then
        echo "osa-dispatcher and jabberd have been restarted 0 times since 
$(date)" >> "${LOGFILE}"
fi
service jabberd status && service osa-dispatcher status
if [ $? -ne 0 ]; then
        service osa-dispatcher stop
        service jabberd stop
        echo "delete from rhnpushdispatcher where 
jabber_id='rhn-dispatcher-sat@$(uname -n)/superclient';" | spacewalk-sql -i
        sqlite3 /var/lib/jabberd/db/sqlite.db < /usr/local/etc/fixjabber.sql
        service jabberd start
        service osa-dispatcher start
        oldnum=$(cut -d ' ' -f7 ${LOGFILE})
        newnum=$(expr $oldnum + 1)
        sed -i "s/$oldnum/$newnum/g" ${LOGFILE}
fi

[me@ourmaster1 ~]$ cat /usr/local/etc/fixjabber.sql
delete from authreg where username = 'rhn-dispatcher-sat';
delete from "roster-items" where "collection-owner" = 
'[email protected]';
delete from status where "collection-owner" = 
'[email protected]';
delete from active where "collection-owner" = 
'[email protected]';

On Fri, Aug 19, 2016 at 8:34 AM Daryl Rose 
<[email protected]<mailto:[email protected]>> wrote:

Basically you're doing everything that I've been doing with the exception of 
the db_recover command.  I was not familiar with that command.

How can I tell if the clients are self registering or not?

Thank you.

Daryl

________________________________
From: Robert Paschedag <[email protected]<mailto:[email protected]>>
Sent: Friday, August 19, 2016 3:42 AM
To: Daryl Rose
Cc: [email protected]<mailto:[email protected]>
Subject: Re: [Spacewalk-list] Ongoing jabberd/osad issues.

Now I had also a problem with the database. Just wanted to check, which
logfiles of the jabber db are not needed anymore as stated in
http://web.stanford.edu/class/cs276a/projects/docs/berkeleydb/ref/transapp/logfile.html
Berkeley DB Reference Guide: Log file 
removal<http://web.stanford.edu/class/cs276a/projects/docs/berkeleydb/ref/transapp/logfile.html>
web.stanford.edu<http://web.stanford.edu>
Log file removal. The fourth component of the infrastructure, log file removal, 
concerns the ongoing disk consumption of the database log files.

Just running "db_archive" killed my jabber db. And....fixing it with

db_recover -v or
db_recover -c

within /var/lib/jabber/db

did not work.

So I was also in the situation to "clean" the database.

1. Stop jabberd (/etc/init.d/jabberd stop)
2. Stop osa-dispatcher (/etc/init.d/osa-dispatcher stop)
3. Remove contents of /var/lib/jabber/db (rm -f /var/lib/jabber/db/*)
4. Start jabber (/etc/init.d/jabberd start)
5. Start osa-dispatcher (/etc/init.d/osa-dispatcher start)

I thought, I should restart the osad client everywhere....but no....the
clients are just re-registering themeselves automatically. Of course, I
have to check this on every client but what I have checked so far is
looking good.

Regards
Robert

Am 18.08.2016 um 09:30 schrieb Robert Paschedag:
> Hi Daryl,
>
> as long as there are no error messages within the logs that there seems to be 
> an error with the jabber db, I wouldn't do anything with the db.
>
> As I earlier wrote, I only had to repair the db once within about 3 1/2 years.
>
> So, what I would do now is to really delete the jabber db (back it up... just 
> in case) to start up with a"clean " install. If the clients (that already 
> have authentication information) do not re-register automatically, you should 
> go to the client, stop osad, remove /etc/sysconfig/rhn/osad-auth.conf and 
> start osad again. The client should then register and you should see the 
> status on the web GUI as "online". If not, check the /var/log/rhn/osad.log on 
> the client  (if I remember correct right now) and osa-dispatcher logs in the 
> server.
>
> I also wrote, that my spacewalk servers are NOT clients of themselves. I 
> don't think, that should be a problem but just for " testing " you should 
> deactivate osad "client" on the spacewalk server.
>
> Start with one test server.
>
> Good luck.
>
> Regards
> Robert
> Am 17.08.2016 20:43 schrieb Daryl Rose 
> <[email protected]<mailto:[email protected]>>:
>>
>> I've posted here issues that I've had with jabberd and osad, as have others. 
>>  But I haven't gotten things resolved, so I am posting additional 
>> information.
>>
>>
>> I put SW into production about a year ago.  After a period of time, I 
>> noticed issues with the WUI and servers not reporting correctly and other 
>> issues.  Google searches show that I need to shutdown spacewalk and remove 
>> all the contents in /var/lib/jabberd/db.   This seemed to work, but after a 
>> few months, I realized that osad was no longer communicating with 
>> osa-dispatcher.
>>
>>
>> I started doing some additional research and learned that was not a good way 
>> to resolve this issue.  According to the official Spacewalk documentation, I 
>> should create a checkpoint and then clean up log files keeping the database 
>> and auth database files.
>>
>>
>> https://fedorahosted.org/spacewalk/wiki/JabberDatabase
>>
>> JabberDatabase – spacewalk - Fedora Hosted
>> fedorahosted.org<http://fedorahosted.org>
>> Jabber Database. Spacewalk utilizes Jabber to facilitate communications 
>> between the server and the clients for osa-dispatcher/osad. The Jabber 
>> program uses the ...
>> These are the steps that I followed:
>>
>>
>> /usr/bin/db_checkpoint -1 -h /var/lib/jabberd/db/ ## mark logs for deletion
>> /usr/bin/db_archive -d -h /var/lib/jabberd/db/  ## delete logs
>> service jabberd restart
>>
>> However, this also causes problems with jabberd and osad.  If I use the 
>> commands as the documentation instructs, then osa-dispatcher will start, but 
>> die, and I get errors in the log that there is an invalid password.
>>
>>
>> So to help explain my issue, I ran a test and tried to capture everything 
>> that I could and I'll post it here.
>>
>>
>> 1. Listing of /var/lib/jabberd/db
>>
>> [root@<spwalk-server> db]# ls
>> __db.001  __db.006        log.0000000004  log.0000000009  log.0000000014  
>> log.0000000019  log.0000000024  sm.db
>> __db.002  authreg.db      log.0000000005  log.0000000010  log.0000000015  
>> log.0000000020  log.0000000025
>> __db.003  log.0000000001  log.0000000006  log.0000000011  log.0000000016  
>> log.0000000021  log.0000000026
>> __db.004  log.0000000002  log.0000000007  log.0000000012  log.0000000017  
>> log.0000000022  log.0000000027
>> __db.005  log.0000000003  log.0000000008  log.0000000013  log.0000000018  
>> log.0000000023  log.0000000028
>>
>> 2. Spacewalk Server Status
>>
>> [root@<spwalk-server> db]# spacewalk-service status
>> postmaster (pid  1175) is running...
>> router (pid 21431) is running...
>> sm (pid 21441) is running...
>> c2s (pid 21451) is running...
>> s2s (pid 21461) is running...
>> tomcat6 (pid 1304) is running...                           [  OK  ]
>> httpd (pid  1385) is running...
>> osa-dispatcher (pid  21479) is running...
>> rhn-search is running (1441).
>> cobblerd (pid 1491) is running...
>> RHN Taskomatic is running (1515).
>>
>> 3.  Most recent log file entry:
>>
>> 2016/08/17 07:44:13 -05:00 21476 0.0.0.0<http://0.0.0.0>: 
>> osad/jabber_lib.__init__
>> 2016/08/17 07:44:13 -05:00 21476 0.0.0.0<http://0.0.0.0>: 
>> osad/jabber_lib.setup_connection('Connected to jabber server', 
>> '<spwalk-server>.com')
>> 2016/08/17 07:44:13 -05:00 21476 0.0.0.0<http://0.0.0.0>: 
>> osad/osa_dispatcher.fix_connection('Upstream notification server started on 
>> port', 1290)
>> 2016/08/17 07:44:14 -05:00 21476 0.0.0.0<http://0.0.0.0>: 
>> osad/jabber_lib.process_forever
>>
>> 4.  Ran the commands as instructed in the jabberd documentation.
>>
>> /usr/bin/db_checkpoint -1 -h /var/lib/jabberd/db/ ## mark logs for deletion
>> /usr/bin/db_archive -d -h /var/lib/jabberd/db/  ## delete logs
>> service jabberd restart
>>
>> 5.  Log file entry:
>>
>> 2016/08/17 13:28:19 -05:00 21476 0.0.0.0<http://0.0.0.0>: 
>> osad/jabber_lib.main('ERROR', 'Traceback (most recent call last):\n  File 
>> "/usr/share/rhn/osad/jabber_lib.py", line 121, in main\n    
>> self.process_forever(c)\n  File "/usr/share/rhn/osad/jabber_lib.py", line 
>> 179, in process_forever\n    self.process_once(client)\n  File 
>> "/usr/share/rhn/osad/osa_dispatcher.py", line 187, in process_once\n    
>> client.retrieve_roster()\n  File "/usr/share/rhn/osad/jabber_lib.py", line 
>> 729, in retrieve_roster\n    stanza = self.get_one_stanza()\n  File 
>> "/usr/share/rhn/osad/jabber_lib.py", line 801, in get_one_stanza\n    
>> self.process(timeout=tm)\n  File "/usr/share/rhn/osad/jabber_lib.py", line 
>> 1055, in process\n    data = self._read(self.BLOCK_SIZE)\nSSLError: 
>> (\'OpenSSL error; will retry\', "(-1, \'Unexpected EOF\')")\n')
>> 2016/08/17 13:28:29 -05:00 21476 0.0.0.0<http://0.0.0.0>: 
>> osad/jabber_lib.__init__
>> 2016/08/17 13:28:29 -05:00 21476 0.0.0.0<http://0.0.0.0>: 
>> osad/jabber_lib.setup_connection('Connected to jabber server', 
>> '<spwalk-server>.com')
>> 2016/08/17 13:28:29 -05:00 21476 0.0.0.0<http://0.0.0.0>: 
>> osad/jabber_lib.register('ERROR', 'Invalid password')
>>
>> 6.  Spacewalk server status
>>
>> [root@<spwalk-server> db]# spacewalk-service status
>> postmaster (pid  1175) is running...
>> router (pid 27119) is running...
>> sm (pid 27129) is running...
>> c2s (pid 27139) is running...
>> s2s (pid 27149) is running...
>> tomcat6 (pid 1304) is running...                           [  OK  ]
>> httpd (pid  1385) is running...
>> osa-dispatcher dead but pid file exists
>> rhn-search is running (1441).
>> cobblerd (pid 1491) is running...
>> RHN Taskomatic is running (1515).
>>
>> 7. Long listing of /var/lib/jabberd/db
>>
>> [root@<spwalk-server> db]# ls -l
>> total 7536
>> -rw-r-----. 1 jabber jabber    24576 Aug 17 13:28 __db.001
>> -rw-r-----. 1 jabber jabber   204800 Aug 17 13:29 __db.002
>> -rw-r-----. 1 jabber jabber   270336 Aug 17 13:29 __db.003
>> -rw-r-----. 1 jabber jabber    98304 Aug 17 13:29 __db.004
>> -rw-r-----. 1 jabber jabber   753664 Aug 17 13:29 __db.005
>> -rw-r-----. 1 jabber jabber    57344 Aug 17 13:29 __db.006
>> -rw-r-----. 1 jabber jabber   368640 Aug 17 07:46 authreg.db
>> -rw-r-----. 1 jabber jabber 10485760 Aug 17 13:29 log.0000000031
>> -rw-r-----. 1 jabber jabber   487424 Aug 17 13:29 sm.db
>>
>> So, neither completely cleaning out jabberd database/log files works, and 
>> creating a checkpoint and removing log files that need to be cleaned out 
>> doesn't' work, so what can I do to get jabberd and osad to work, and to push 
>> out updates when I need to push them out?
>>
>>
>> Thank you.
>>
>>
>> Daryl
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
> _______________________________________________
> Spacewalk-list mailing list
> [email protected]<mailto:[email protected]>
> https://www.redhat.com/mailman/listinfo/spacewalk-list
>
_______________________________________________
Spacewalk-list mailing list
[email protected]<mailto:[email protected]>
https://www.redhat.com/mailman/listinfo/spacewalk-list

_______________________________________________
Spacewalk-list mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/spacewalk-list

Re: [Spacewalk-list] Ongoing jabberd/osad issues.

Reply via email to