Re: [Sequoia] Failed write request behavior

Jim Davis Fri, 12 Jan 2007 07:51:10 -0800

Jim Davis wrote:

I am using the appia configuration for group communications. I washaving issues with jgroups when I starting really loading the systemwith alot of transactions.

I am seeing the deadlock routine run after a while of being in a statewith a hung write. If I dump backend schema /locks I can seetransactions with locks on tables. I will haveto go back and start tracking which tables are being locked and comparethem to which table has the hung write and see if there is a connection.

I still have the developer digging into why we are getting duplicatekeys. We are running our application on a Jboss 4.0.4.GA applicationserver in a two node cluster configuration.From my discussions with our developers, during a ingest work flow itis possible for different nodes to perform different phases of theworkflow. Because they use JMS messagingand serialize the transactions, they do not expect two nodes to receivethe same phase of a transaction to complete. I have read on Jbossforums where during restarts of the servers ithas been observed where Jboss will sometime replay processed JMSmessages. But since these duplicate keys and subsequent hung writesoccur during normal operations, I cannotsay its a Jboss behavior at this point. On Thursday, I brought our testsytstem down and flushed all of our databases and upgraded to the Jan. 6nightly build of sequoia 2.10.4.I restarted the test system and verified it was an empty system. Thismorning I was able to run a small 30 submission ingest test with noproblems. An hour later, I started a 450 submissioningest test and met with a hung write after our system introduced aninsert resulting in a duplicate key violation. From a systemengineering perspective, I know if I bring the Jboss clusterdown, shutdown all of my sequoia controllers, and bring them back up byrestoring and allowing the enable backend to replay, I can recover.When I restart the first Jboss node, it will eitherrole back or continue processing anything in the JMS message queue. Ican resume the ingest test and process for either minutes or hours asthe duplicate key problems show no signs of consistencies.

As I am not a java developer, my knowledge comes from repeated trial anderror. Because my developers cannot reproduce this problem in theirdevelopment environment when they go straight toa postgres data server instead via sequoia, I am left to my own devicesto find a working solution if I am to get to keep my sequoiaconfiguration for our new production system. I am willing to tryanything to get past this problem. I have DEBUG turned oneverywhere... I have my virtualdatabase configuration set to 30 secondsidle timeout and 10 seconds wait.

I would appreciate any suggestions you could send my way for how toproceed.


Thankyou for your time and efforts,

Jim D.


Emmanuel Cecchet wrote:

Hi Jim,
I am seeing failed writes to a postgresql database backend remain inthe write queue on the controller. The duplicate key error messagefor the correspondingwrite only appears on one of the 3 controllers. But the two sistercontrollers have the same request id 10577 in the scheduler queuealong with any otherwrite requests which arrived after the 10577 request. Is thisnormal behavior? How can I clear a failed write from thecontroller's write queue? My three controllersbasically just start queueing any addtional writes after theduplicate key write occurs. Any assistance with resolving this issuewould be greatly appreciated.
From the log you attached, I understand that the query was issued onthe first controller (where it failed) but it is still pending on the2 other controllers. This is why is still shows as 'pending' becauseit has to wait for the result of the other controllers to decidewhether that was a real failure (all controllers fail) or if only thelocal controller failed (in which case its local backend are disabledand we continue with the other controllers).
*2nd Controller where no duplicate key error is recored but requestis queued:
*ANGe(admin) > dump scheduler queues
Active transactions: 7
        Transaction id list: 3800 3802 3803 3804 3805 3806 3807
Pending write requests: 6
        Write request id list: 10586 10593 10591 10581 10587 10577
*3rd controller where no duplicate error is recorded but request isqueued:
*ANGe(admin) > dump scheduler queues
Active transactions: 8
        Transaction id list: 3703 3800 3802 3803 3804 3805 3806 3807
Pending write requests: 6
        Write request id list: 10586 10593 10591 10581 10587 10577

Any suggestions on how I can recover when this happens?
What puzzles me is is the old transaction 3703 that remains open onthe 3rd controller. No idea where this could come from since if it wasa read-only transaction it would have executed on the first controller(given its id).
Another reason could be a problem with the group communication. Whichone are you using?
Something else to investigate is potential query indeterminism. Thiscan happen with multiple table updates or update with subselects. Insuch case, a strict table locking might be needed. Was this duplicatekey exception something you expected?
Thanks for your feedback,
Emmanuel

begin:vcard
fn:Jim Davis
n:Davis;Jim
org:Atmospheric Sciences Data Center;NASA Langley Research Center
adr;dom:;;;Hampton;VA
email;internet:[EMAIL PROTECTED]
title:SSAI, Systems Engineer
tel;work:757-864-7525
version:2.1
end:vcard

_______________________________________________
Sequoia mailing list
[email protected]
https://forge.continuent.org/mailman/listinfo/sequoia

Re: [Sequoia] Failed write request behavior

Reply via email to