Jim Davis wrote:
I am using the appia configuration for group communications. I was
having issues with jgroups when I starting really loading the system
with alot of transactions.
I am seeing the deadlock routine run after a while of being in a state
with a hung write. If I dump backend schema /locks I can see
transactions with locks on tables. I will have
to go back and start tracking which tables are being locked and compare
them to which table has the hung write and see if there is a connection.
I still have the developer digging into why we are getting duplicate
keys. We are running our application on a Jboss 4.0.4.GA application
server in a two node cluster configuration.
From my discussions with our developers, during a ingest work flow it
is possible for different nodes to perform different phases of the
workflow. Because they use JMS messaging
and serialize the transactions, they do not expect two nodes to receive
the same phase of a transaction to complete. I have read on Jboss
forums where during restarts of the servers it
has been observed where Jboss will sometime replay processed JMS
messages. But since these duplicate keys and subsequent hung writes
occur during normal operations, I cannot
say its a Jboss behavior at this point. On Thursday, I brought our test
sytstem down and flushed all of our databases and upgraded to the Jan. 6
nightly build of sequoia 2.10.4.
I restarted the test system and verified it was an empty system. This
morning I was able to run a small 30 submission ingest test with no
problems. An hour later, I started a 450 submission
ingest test and met with a hung write after our system introduced an
insert resulting in a duplicate key violation. From a system
engineering perspective, I know if I bring the Jboss cluster
down, shutdown all of my sequoia controllers, and bring them back up by
restoring and allowing the enable backend to replay, I can recover.
When I restart the first Jboss node, it will either
role back or continue processing anything in the JMS message queue. I
can resume the ingest test and process for either minutes or hours as
the duplicate key problems show no signs of consistencies.
As I am not a java developer, my knowledge comes from repeated trial and
error. Because my developers cannot reproduce this problem in their
development environment when they go straight to
a postgres data server instead via sequoia, I am left to my own devices
to find a working solution if I am to get to keep my sequoia
configuration for our new production system. I am willing to try
anything to get past this problem. I have DEBUG turned on
everywhere... I have my virtualdatabase configuration set to 30 seconds
idle timeout and 10 seconds wait.
I would appreciate any suggestions you could send my way for how to
proceed.
Thankyou for your time and efforts,
Jim D.
Emmanuel Cecchet wrote:
Hi Jim,
I am seeing failed writes to a postgresql database backend remain in
the write queue on the controller. The duplicate key error message
for the corresponding
write only appears on one of the 3 controllers. But the two sister
controllers have the same request id 10577 in the scheduler queue
along with any other
write requests which arrived after the 10577 request. Is this
normal behavior? How can I clear a failed write from the
controller's write queue? My three controllers
basically just start queueing any addtional writes after the
duplicate key write occurs. Any assistance with resolving this issue
would be greatly appreciated.
From the log you attached, I understand that the query was issued on
the first controller (where it failed) but it is still pending on the
2 other controllers. This is why is still shows as 'pending' because
it has to wait for the result of the other controllers to decide
whether that was a real failure (all controllers fail) or if only the
local controller failed (in which case its local backend are disabled
and we continue with the other controllers).
*2nd Controller where no duplicate key error is recored but request
is queued:
*ANGe(admin) > dump scheduler queues
Active transactions: 7
Transaction id list: 3800 3802 3803 3804 3805 3806 3807
Pending write requests: 6
Write request id list: 10586 10593 10591 10581 10587 10577
*3rd controller where no duplicate error is recorded but request is
queued:
*ANGe(admin) > dump scheduler queues
Active transactions: 8
Transaction id list: 3703 3800 3802 3803 3804 3805 3806 3807
Pending write requests: 6
Write request id list: 10586 10593 10591 10581 10587 10577
Any suggestions on how I can recover when this happens?
What puzzles me is is the old transaction 3703 that remains open on
the 3rd controller. No idea where this could come from since if it was
a read-only transaction it would have executed on the first controller
(given its id).
Another reason could be a problem with the group communication. Which
one are you using?
Something else to investigate is potential query indeterminism. This
can happen with multiple table updates or update with subselects. In
such case, a strict table locking might be needed. Was this duplicate
key exception something you expected?
Thanks for your feedback,
Emmanuel
begin:vcard
fn:Jim Davis
n:Davis;Jim
org:Atmospheric Sciences Data Center;NASA Langley Research Center
adr;dom:;;;Hampton;VA
email;internet:[EMAIL PROTECTED]
title:SSAI, Systems Engineer
tel;work:757-864-7525
version:2.1
end:vcard
_______________________________________________
Sequoia mailing list
[email protected]
https://forge.continuent.org/mailman/listinfo/sequoia