Hello Ivan, Thank you very much for the response.
Due to the nature of the problem, I don't think I have the ability to provide the data. I did go back to the event.log to locate the flow prior to the first error. More information: * There are multiple threads doing updates on the database. * The number of threads is limited to 10. * An update is occurring prior to AnalyzePopularFailure. * Update logic: index-scan for an id, if the id exist revise the xml by doing an 'update replace', else 'update insert' the new xml to the doc. * The following is a sanitized event.log prior to the failure. LOG 19/08/2011 21:50:55 (TRN xrdb pid=752 sid=47 trid=-1) [tr_common_funcs.cpp:on_transaction_end:406]: Transaction has been COMMITED LOG 19/08/2011 21:50:55 (TRN xrdb pid=752 sid=47 trid=46) [tr_functions.cpp:on_user_statement_begin:103]: User query: --- update insert <JournalEntry>..redacted..into doc('JournalEntry.xml')/JournalEntries LOG 19/08/2011 21:50:55 (TRN xrdb pid=752 sid=47 trid=-1) [tr_common_funcs.cpp:on_transaction_end:406]: Transaction has been COMMITED LOG 19/08/2011 21:50:55 (TRN xrdb pid=752 sid=47 trid=46) [tr_functions.cpp:on_user_statement_begin:103]: User query: --- index-scan("JournalEntrysIndex", "2EFE64-1313778104", "EQ") LOG 19/08/2011 21:50:55 (TRN xrdb pid=752 sid=47 trid=-1) [tr_common_funcs.cpp:on_transaction_end:406]: Transaction has been COMMITED LOG 19/08/2011 21:50:55 (TRN xrdb pid=752 sid=47 trid=46) [tr_functions.cpp:on_user_statement_begin:103]: User query: --- update insert <JournalEntry>..redacted..</JournalEntry> into doc('JournalEntry.xml')/JournalEntries LOG 19/08/2011 21:50:55 (TRN xrdb pid=782 sid=46 trid=-1) [tr_common_funcs.cpp:on_transaction_end:406]: Transaction has been COMMITED LOG 19/08/2011 21:50:55 (TRN xrdb pid=782 sid=46 trid=49) [tr_functions.cpp:on_user_statement_begin:103]: User query: LOG 19/08/2011 21:50:56 (TRN xrdb pid=752 sid=47 trid=-1) [tr_common_funcs.cpp:on_transaction_end:406]: Transaction has been COMMITED LOG 19/08/2011 21:50:56 (TRN xrdb pid=776 sid=48 trid=-1) [tr_common_funcs.cpp:on_transaction_end:406]: Transaction has been COMMITED LOG 19/08/2011 21:50:56 (TRN xrdb pid=782 sid=46 trid=-1) [tr_common_funcs.cpp:on_transaction_end:406]: Transaction has been COMMITED LOG 19/08/2011 21:50:56 (TRN xrdb pid=776 sid=48 trid=45) [tr_functions.cpp:on_user_statement_begin:103]: User query: --- update insert <Invoice>..redacted..</Invoice> into doc('Invoice.xml')/Invoices LOG 19/08/2011 21:50:56 (TRN xrdb pid=782 sid=46 trid=49) [tr_functions.cpp:on_user_statement_begin:103]: User query: --- update replace $Payment in index-scan("PaymentsIndex", "3B1F5-1197085616", "EQ") WITH <Payment>..redacted..</Payment> LOG 19/08/2011 21:50:56 (TRN xrdb pid=752 sid=47 trid=-1) [tr_common_funcs.cpp:on_transaction_end:406]: Transaction has been COMMITED LOG 19/08/2011 21:50:56 (TRN xrdb pid=752 sid=47 trid=46) [tr_functions.cpp:on_user_statement_begin:103]: User query: --- update insert <JournalEntry>..redacted..</JournalEntry> into doc('JournalEntry.xml')/JournalEntries ERROR 19/08/2011 21:50:56 (SM xrdb pid=578) [wu.cpp:AnalyzePopularFailure:64]: PutBlockVersionToBuffer - invalid lxptr (old version) ERROR 19/08/2011 21:50:56 (SM xrdb pid=578) [wu.cpp:AnalyzePopularFailure:65]: PutBlockVersionToBuffer - xptr 000000033c7e0000, lxptr 000000033c7e0000 ERROR 19/08/2011 21:50:56 (SM xrdb pid=578) [wu.cpp:AnalyzePopularFailure:66]: PutBlockVersionToBuffer - vheader {0000000221bffff8, 0000000200000006} {000000022326a368, 0000000222eaa5e4} {000000022326a410, 00000002233b0d24} {0000f7ec2326f070, 00000002233a2e34} INFO 22/08/2011 15:53:58 (GOV pid=528) [gov_functions.cpp:log_out_system_information:87]: SEDNA version is 3.4.66 (64bit Release) INFO 22/08/2011 15:53:58 (GOV pid=528) [gov_functions.cpp:log_out_system_information:95]: System: Linux 2.6.35-22-virtual x86_64 LOG 22/08/2011 15:53:58 (GOV pid=528) [gov.cpp:main:239]: Process ping server is ready LOG 22/08/2011 15:53:58 (GOV pid=528) [listener.cpp:client_listener:70]: GOVERNOR is ready LOG 22/08/2011 15:53:59 (SM xrdb pid=540) [sm.cpp:main:493]: Event log is ready LOG 22/08/2011 15:53:59 (SM xrdb pid=540) [sm.cpp:main:506]: Ping client has been started SYS 22/08/2011 15:53:59 (SM xrdb pid=540) [uprocess.c:uGetEnvironmentVariable:101]: getenv (code = 0): Success SYS 22/08/2011 15:53:59 (SM xrdb pid=540) [uprocess.c:uGetEnvironmentVariable:101]: getenv (code = 0): Success LOG 22/08/2011 15:53:59 (SM xrdb pid=540) [sm.cpp:recover_database_by_physical_and_logical_log:753]: Event log in recovery procedure is ready LOG 22/08/2011 15:53:59 (SM xrdb pid=540) [sm.cpp:recover_database_by_physical_and_logical_log:763]: Starting database recovery or hot-backup restoration... SYS 22/08/2011 15:53:59 (SM xrdb pid=540) [ummap.c:uMemLock:200]: mlock (code = 12): Cannot allocate memory WARN 22/08/2011 15:53:59 (SM xrdb pid=540) [bm_functions.cpp:_bm_init_buffer_pool:98]: Can't lock memory. It is not supported without root, RLIMIT_MEMLOCK exceeded or there are not enough system resources. LOG 22/08/2011 15:53:59 (SM xrdb pid=540) [sm.cpp:recover_database_by_physical_and_logical_log:803]: Buffer manager is started LOG 22/08/2011 15:54:35 (SM xrdb pid=540) [sm.cpp:recover_database_by_physical_and_logical_log:810]: Database has been recovered by physical log successfully LOG 22/08/2011 15:54:35 (SM xrdb pid=540) [blk_mngmt.cpp:extend_tmp_file:629]: Temp file has been extended, size: 6400000 LOG 22/08/2011 15:54:35 (SM xrdb pid=540) [sm.cpp:recover_database_by_physical_and_logical_log:822]: Checkpoints are disabled LOG 22/08/2011 15:54:35 (SM xrdb pid=540) [sm.cpp:recover_database_by_physical_and_logical_log:845]: Wu is initialized LOG 22/08/2011 15:54:35 (RCV xrdb pid=545) [tr.cpp:TRmain:182]: Session is ready LOG 22/08/2011 15:54:39 (SM xrdb pid=540) [blk_mngmt.cpp:extend_data_file:594]: Data file has been extended, size: b5410000 ERROR 22/08/2011 15:54:40 (SM xrdb pid=540) [wu.cpp:AnalyzePopularFailure:64]: PutBlockVersionToBuffer - invalid lxptr (old version) ERROR 22/08/2011 15:54:40 (SM xrdb pid=540) [wu.cpp:AnalyzePopularFailure:65]: PutBlockVersionToBuffer - xptr 0000000104ad0000, lxptr 0000000104ad0000 ERROR 22/08/2011 15:54:40 (SM xrdb pid=540) [wu.cpp:AnalyzePopularFailure:66]: PutBlockVersionToBuffer - vheader {0000000101b40000, 0000000000010420} Stop, start, and reboot. I have modified the init.d script provided by Charles Foster at http://www.cfoster.net/articles/sedna-stability/ I noticed that stopping & starting an Amazon instance works fine, but that a reboot causes the Sedna db to fail. I think I can cause the failure without any data. In our problem space, the individual Amazon instances are stopped when they are not in use. I never reboot an Amazon Sedna db instance (I refer to the OS instance) because it normally causes an unrecoverable error. Thanks, Malcolm -----Original Message----- From: Ivan Shcheklein [mailto:shchekl...@gmail.com] Sent: Thursday, September 01, 2011 9:33 AM To: Malcolm Davis Cc: sedna-discussion@lists.sourceforge.net Subject: Re: [Sedna-discussion] Questions: Database startup problem Hi Malcolm, Questions: 1. In the future, what kind of diagnostic information (memory, processes, etc.) can I provide to help resolve issues? Are there specific linux commands I should execute and provide results? In such cases (recovery procedure failure) the root of the problem usually can be found if it's possible to reproduce crash (or unexpected shutdown- process kill or machine reboot) of the database wich caused restart with recovery. So, what can we actually try? We can try to get your source data (XML files), your workload (queries, updated from event logs), try to rerun and crash (kill) Sedna until we get recovery failure. Ideally, if you know steps from the very beginning how to reproduce it we can fix the bug very soon. 2. Is the root cause a Linux or Sedna configuration problem, or is the problem a Sedna DB issue? Most likely it's DB issue. 3. Is there any configuration that would be considered optimal for an Amazon micro-instance that has nothing but Senda as the core process? The DB can grow to 6 GB, number of required simultaneous connections is limited, maybe 10 max. I don't mind recompiling code if that helps. (I think I have asked this question before, sorry for being repetitive.) Just make sure that you don't use too big buffers number value. It doesn't make sense to use 1GB buffers if instance has 600MB ram. Remeber that OS also needs some memory for cache, for system processe, etc. I would try to use 400MB at first monitoring response times. 4. I noticed that Sedna build is up to build 3.5.13. Are there any good reasons to move from 3.4.66 to the latest build? Are there any stable builds since the 3.4.66 release? No. Though 3.5 will be released soon. Don't use it now. It still has minor issues. I have all the data associated with db stored elsewhere, so I am going to simply rebuild the instance and move on. I just want to do my best from having the issue occur again. We also want to make it reliable as much as possible. Unfortunatelly, recovery failure is very hard to debug. We really need help from DB users to investigate/reproduce this problem to fix it. Ivan Shcheklein, Sedna Team ------------------------------------------------------------------------------ Special Offer -- Download ArcSight Logger for FREE! Finally, a world-class log management solution at an even better price-free! And you'll get a free "Love Thy Logs" t-shirt when you download Logger. Secure your free ArcSight Logger TODAY! http://p.sf.net/sfu/arcsisghtdev2dev _______________________________________________ Sedna-discussion mailing list Sedna-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/sedna-discussion