Thanks Steve,
Unfortunately since this is a very busy production server (with only
a couple hours of slony downtime, it's taking a while to catch up so we
didn't want to prolong it even more), we didn't have too much time to
trouble shoot this so I did not grab the slon daemon logs and back them
up like I should have (I only looked at them, and noticed events being
processed but didn't note on whether they were worker or listener
threads). We restored the original slave cluster's data directory and
fired it all up, and slony resumed just fine.
We did not run slonik repair config, but now that we know about it I'd
imagine that would fix the problem, since we didn't include any oid's in
the pg_dump.
As of now we're running on the old cluster, we'll eventually try again
but only after testing it on a few test clusters we set up and see if we
can replicate the "problem", then fix it. I'll report results when we do
that.
- Brian
Steve Singer wrote:
> On Thu, 16 Sep 2010, Brian Fehrle wrote:
>
> What are your slon processes logging/printing? Are the remoteWorker
> threads actually processing events or is the remoteListener the only
> thread logging? Are rows the events in sl_event being marked as
> confirmed in sl_confirm (on the slave? is the data making it back to
> the sl_confirm on the master? I suspect not otherwise sl_log_1
> wouldn't keep growing)
>
> Also remember to run REPAIR CONFIG
> (http://www.slony.info/documentation/stmtrepairconfig.html) after
> restoring from a pg_dump.
>
>
>
>
>> Hi all,
>> Due to realizing that our 1 master -> 1 slave slony cluster had
>> different encodings on each box, we attempted to fix that. Our master
>> had encoding of LATIN1 and our slave had the encoding of SQL_ASCII (they
>> were initialized so long ago, we don't know who did it or why it was
>> done that way).
>> Slony worked with this setup, but we wanted to fix it, due to some
>> other problems, by moving the slave from SQL_ASCII to LATIN1.
>>
>> So we brought down the slon daemons, brought down the slave database
>> and rebooted the physical machine the slave is on (dozens of cron jobs
>> we commented out and wanted to verify they were all dead).
>>
>> When we rebooted the machine, we brought the slave postgres cluster
>> online and preformed a pg_dump on the entire database (including the
>> _slony schema). Then we brought down the postgres cluster, ran initdb to
>> create a new one with LATIN1 encoding, brought the new cluster online
>> and ran a pg_restore on it with the dump file we created before.
>>
>> After that we restarted our cron jobs, which also started up the two
>> slon daemons, we started monitoring the slave and noticed that no
>> updates are being applied. We're running the slon daemons with -s 60000
>> (force a sync every 60 seconds) and a -x flag to get some slony logs for
>> log shipping. These slony logs that are generated with -x are empty
>> (they have the slony header and footer, but no insert data).
>>
>> On the master, if I do a # select * from _slony.sl_status; I get
>> back that there are anywhere between 0 - 2 events, and a lag time no
>> greater than 3 minutes. Monitoring the slave slony log output also
>> verifies that events are being receved and processed without error every
>> minute.
>>
>> Again, on the master, # select count(*) from _slony.sl_log_1;
>> returns with 12,000 + rows, and it continually grows. So from what I can
>> tell, the master is getting events qued up, but not pushing them in the
>> events to the slave, each event is completely void of data, and it looks
>> like sl_log_1 just keeps building up.
>>
>> One theory is that even though we have an exact data dump of the old
>> slave cluster restored to to the new slave cluster, since the encoding
>> has changed perhaps the master doesn't recognize the slave as the same
>> slave it had before. If thats the case, is there any way we can get it
>> to recognize it without having to rebuild the slony cluster? (rebuilding
>> the cluster would mean a few days of work if not week/s).
>>
>> Other than that, I'm unsure what to make of this. I've restarted the
>> daemons, and neither the master nor the slave daemon report any errors
>> in the logs. I verified that the triggers exist on the master as they
>> should (we never touched the master anyways, but still checking
>> everything), the path to the slave remained the same as the previous
>> slave (same dbname, host, port, user).
>>
>> Any thoughts or things I can check would be appreciated. Or if my
>> theory about the master not recognizing the new slave cluster as the old
>> one is correct, then if we can fix that that would be great.
>>
>> thanks in advance,
>> Brian F
>> _______________________________________________
>> Slony1-general mailing list
>> [email protected]
>> http://lists.slony.info/mailman/listinfo/slony1-general
>>
>
_______________________________________________
Slony1-general mailing list
[email protected]
http://lists.slony.info/mailman/listinfo/slony1-general