Hi Ilya, Thanks for the hint about jstack. I've analyzed two hanging issues and both look identical on Java side:
One thread, let's call it A, takes a lock and does not release it: "ajp-bio-8059-exec-21" daemon prio=10 tid=0x0000000018a3e000 nid=0x4c89 runnable [0x00000000434ee000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:152) at java.net.SocketInputStream.read(SocketInputStream.java:122) at net.cfoster.sedna.xmldb.p.a(Unknown Source) at net.cfoster.sedna.xmldb.o.a(Unknown Source) at net.cfoster.sedna.xmldb.o.a(Unknown Source) at net.cfoster.sedna.xmldb.g.g(Unknown Source) at net.cfoster.sedna.xmldb.g.<init>(Unknown Source) at net.cfoster.sedna.xmldb.h.c(Unknown Source) - locked <0x00000000c1c27948> (a net.cfoster.sedna.xmldb.h) at net.cfoster.sedna.xmldb.h.a(Unknown Source) - locked <0x00000000c1c27948> (a net.cfoster.sedna.xmldb.h) at net.cfoster.sedna.DatabaseImpl.getCollection(Unknown Source) Another thread, let's call it B, waits for the lock: "ajp-bio-8059-exec-22" daemon prio=10 tid=0x0000000018c36800 nid=0x4c8e waiting for monitor entry [0x0000000041135000] java.lang.Thread.State: BLOCKED (on object monitor) at net.cfoster.sedna.xmldb.h.a(Unknown Source) - waiting to lock <0x00000000c1c27948> (a net.cfoster.sedna.xmldb.h) at net.cfoster.sedna.DatabaseImpl.getCollection(Unknown Source) at org.xmldb.api.DatabaseManager.getCollection(DatabaseManager.java:206) On Sedna side the thread A database connection is processed and COMMITED in the logs: [tr_common_funcs.cpp:on_transaction_end:414]: Transaction has been COMMITED [sp.c:sp_send_msg:64]: Message send : type 250, len 0 The thread B database connection is processed by Sedna and ends up waiting for the client's response: [sp.c:sp_send_msg:64]: Message send : type 360, len 1086 [sp.c:sp_send_msg:64]: Message send : type 370, len 0 The client thread B does not respond as it waits for the lock and the corresponding Sedna connection is eventually closed by XML:DB API implementation with the message: Warning: an open Sedna XML:DB Collection/Connection was just automatically closed due to inactivity, you SHOULD close these manually. Meanwhile thread A successfully completes the processing and returns the response while still keeping the lock. @Ilya, so it's definitely the client-side issue, related either to JDK or to Charles Foster implementation. I've upgraded to the latest JDK 1.8.0_25 just for a check. If it does not help, I may ask Charles for a paid support as he suggested. Best regards, Ivan Lagunov On Tue, Oct 28, 2014 at 9:41 PM, Ilya Taranov <epsi...@socio.msu.ru> wrote: > Hi, Ivan, > > Unfortunately, I'm too busy at my full-time job currently, so sometimes, > I just don't have a chance to take a look at your dropbox file during the > day. Sorry for that. > > What I have seen so far, is that everything confirms my suspicion of the > client somehow involved. After 27478 sent 370 message, it awaits for > 310/220/225 from client to continue, holding locks. That's perfectly fine. > It's hard to tell anything without understanding why client is not willing > to send it. May that be some kind of hang? I wonder what the client is > doing, could you take a look at the client backtrace during the hang (like > the one returned by jstack/sending SIG_QUIT)? > > Index problem you mentioned is a separate issue, I will take a look at. > > Thanks, > Ilya Taranov. > > > On Tue, Oct 28, 2014 at 1:09 PM, Ivan Lagunov <lagi...@gmail.com> wrote: > >> I've faced two more hangings today. Again I'm posting only those >> connections that were created by our suspected application. Besides, I'm >> adding the last debug Message from the event.log for each transaction. >> >> https://www.dropbox.com/s/istotjr41vrmmd5/20141028001023.ZIP?dl=0 >> >> se_trn 21097 - fast XQuery, COMMITED, Message recv : type 210, len 0 >> se_trn 21114 - XQuery categorytree:getCategoryTree(), not COMMITED, gdb >> thread 1 in select, Message send : type 370, len 0 >> se_trn 21127 - fast XQuery, COMMITED, Message recv : type 210, len 0 >> se_trn 21229 - Session is ready >> >> So this hanging issue looks completely the same as those of yesterday. >> However, yesterday's problem has been fixed and now it happens roughly once >> in several hundreds cases. >> >> https://www.dropbox.com/s/ru6r6rplar3fat9/20141028151343.ZIP?dl=0 >> >> se_trn 27478 - fast XQuery basictypes:getBasicType(), not COMMITED, gdb >> thread 1 in select, Message send : type 370, len 0 >> se_trn 27932 - fast XQuery, COMMITED, Message recv : type 210, len 0 >> se_trn 28030 - Session is ready >> >> Obviously, the issue has not gone. Moreover, the second case confirms >> that the database can hang on various XQueries as before. As the >> yesterday's issue looks really the same but just worse, I've come to a >> conclusion that other hangings may be related to the Java version too. I'm >> going to try running the application under a different Java version. >> >> @Ilya, could you comment on the provided event.log with the debug >> messages generated by your patch? I'm curious if it can give better insight >> into the problem. >> >> Best regards, >> Ivan Lagunov >> >> On Mon, Oct 27, 2014 at 5:46 PM, Ivan Lagunov <lagi...@gmail.com> wrote: >> >>> Hi Ilya, >>> >>> Finally I've got some clues about Sedna hangings. Today the same >>> application that is under suspicion has been deployed by another developer. >>> It has been built on the developer's machine so apparently the JDK has been >>> different from the JDK that is used for running the application on the >>> server. I'm still waiting for more details on this matter. But this >>> deployment has resulted in a series of Sedna hangings. Although it differs >>> from the hanging scenarios that we had faced before, it may still be >>> somehow related. >>> >>> Here is the archive with logs of the first hanging issue of today: >>> >>> https://www.dropbox.com/s/ufwayhwx7ersy88/20141027131246.ZIP?dl=0 >>> >>> These are the transaction processes created by the suspected application: >>> se_trn 17209 - XQuery categorytree:getCategoryTree(), not COMMITED, gdb >>> thread 1 in select >>> se_trn 17217 - Session is ready >>> >>> All other today's hangings have been caused by the same XQuery >>> categorytree:getCategoryTree(). It simply loads and returns a big document >>> TreeExport.xml (about 2.5MB) from the collection "treeStructures/released". >>> It's very likely that the "Sedna index subsystem error" from my >>> previous email is also related to this. There have been no more hangings >>> after I rebuilt and redeployed the Java application on the server using the >>> same "good" JDK that is used for running. >>> >>> @Ilya, could you please look into the event.log and say if the client >>> indeed makes Sedna wait as you had suggested before? >>> >>> This is an extract from event.log: >>> >>> LOG 27/10/2014 12:51:43 (TRN nxp pid=17209 sid=44 trid=-1) >>> [sp.c:sp_recv_msg:34]: Message recv : type 210, len 0 >>> LOG 27/10/2014 12:51:43 (TRN nxp pid=17209 sid=44 trid=48) >>> [sp.c:sp_send_msg:64]: Message send : type 230, len 0 >>> LOG 27/10/2014 12:51:43 (TRN nxp pid=17209 sid=44 trid=48) >>> [sp.c:sp_recv_msg:34]: Message recv : type 300, len 220 >>> LOG 27/10/2014 12:51:43 (TRN nxp pid=17209 sid=44 trid=48) >>> [tr_functions.cpp:on_user_statement_begin:102]: User query: >>> --- categorytree:getCategoryTree() >>> LOG 27/10/2014 12:51:43 (TRN nxp pid=17209 sid=44 trid=48) >>> [sp.c:sp_send_msg:64]: Message send : type 320, len 0 >>> LOG 27/10/2014 12:51:45 (TRN nxp pid=17209 sid=44 trid=48) >>> [sp.c:sp_send_msg:64]: Message send : type 360, len 10240 >>> ... plenty of lines like this ... >>> LOG 27/10/2014 12:51:45 (TRN nxp pid=17209 sid=44 trid=48) >>> [sp.c:sp_send_msg:64]: Message send : type 360, len 10240 >>> LOG 27/10/2014 12:51:45 (TRN nxp pid=17209 sid=44 trid=48) >>> [sp.c:sp_send_msg:64]: Message send : type 360, len 8204 >>> LOG 27/10/2014 12:51:45 (TRN nxp pid=17209 sid=44 trid=48) >>> [sp.c:sp_send_msg:64]: Message send : type 370, len 0 >>> >>> Just to clarify about our earlier hangings, afaik all the deployments of >>> the suspected application were performed on the server using "good" JDK >>> earlier. So while earlier hangings might have been somehow related to >>> today's, the situation is quite different. >>> >>> Best regards, >>> Ivan Lagunov >>> >>> >>> On Mon, Oct 27, 2014 at 12:46 PM, Ivan Lagunov <lagi...@gmail.com> >>> wrote: >>> >>>> Hi Ilya, >>>> >>>> The patch has been applied to the night build 3.6.519, the data has >>>> been migrated with se_exp using complete export-import. The issue has not >>>> appear again for a week already. I'll keep monitoring Sedna. >>>> >>>> However, it seems there is another issue with that night build: >>>> >>>> LOG 27/10/2014 01:21:18 (TRN nxp pid=20091 sid=49 trid=-1) >>>> [sp.c:sp_send_msg:64]: Message send : type 250, len 0 >>>> LOG 27/10/2014 01:21:18 (TRN nxp pid=20091 sid=49 trid=-1) >>>> [sp.c:sp_recv_msg:34]: Message recv : type 210, len 0 >>>> LOG 27/10/2014 01:21:18 (TRN nxp pid=20091 sid=49 trid=49) >>>> [sp.c:sp_send_msg:64]: Message send : type 230, len 0 >>>> LOG 27/10/2014 01:21:18 (TRN nxp pid=20091 sid=49 trid=49) >>>> [sp.c:sp_recv_msg:34]: Message recv : type 300, len 76 >>>> LOG 27/10/2014 01:21:18 (TRN nxp pid=20091 sid=49 trid=49) >>>> [tr_functions.cpp:on_user_statement_begin:102]: User query: >>>> --- DROP DOCUMENT "TreeExport.xml" IN COLLECTION >>>> "treeStructures/released" >>>> ERROR 27/10/2014 01:21:20 (TRN nxp pid=20091 sid=49 trid=49) >>>> [btree.cpp:bt_delete_tmpl:273]: (SE1008) Sedna index subsystem error. >>>> Details: Cannot delete object which is not in the btree >>>> LOG 27/10/2014 01:21:20 (TRN nxp pid=20091 sid=49 trid=-1) >>>> [tr_common_funcs.cpp:on_transaction_end:416]: Transaction has been ROLLED >>>> BACK >>>> LOG 27/10/2014 01:21:20 (TRN nxp pid=20091 sid=49 trid=-1) >>>> [sp.c:sp_send_msg:64]: Message send : type 100, len 122 >>>> LOG 27/10/2014 01:21:25 (TRN nxp pid=20091 sid=49 trid=-1) >>>> [sp.c:sp_recv_msg:34]: Message recv : type 500, len 0 >>>> LOG 27/10/2014 01:21:25 (TRN nxp pid=20091 sid=49 trid=-1) >>>> [sp.c:sp_send_msg:64]: Message send : type 510, len 0 >>>> INFO 27/10/2014 01:21:25 (SM nxp pid=19377) >>>> [bm_core.cpp:bm_log_out_io_statistics:537]: IO block reads:0, writes:0 >>>> LOG 27/10/2014 01:21:25 (TRN nxp pid=20091 sid=49 trid=-1) >>>> [tr.cpp:TRmain:456]: Session is closed >>>> >>>> It's persistent, I could not delete the document at all until I deleted >>>> the indexes on that collection: >>>> <index name="category_href" backend="btree" as_type="xs:string" >>>> on_path="descendant::element(Category)" >>>> by_path="attribute::attribute(href)" object_name="treeStructures/released" >>>> object_type="collection"/> >>>> <index name="category_id" backend="btree" as_type="xs:string" >>>> on_path="descendant::element(Category)" >>>> by_path="attribute::attribute(Identifier)" >>>> object_name="treeStructures/released" object_type="collection"/> >>>> >>>> Any ideas? >>>> >>>> Best regards, >>>> Ivan Lagunov >>>> >>>> >>>> On Fri, Oct 17, 2014 at 10:05 AM, Ilya Taranov <epsi...@socio.msu.ru> >>>> wrote: >>>> >>>>> Hi, here is the patch over the latest label that should add protocol >>>>> minimal debugging. >>>>> >>>>> On Wed, Oct 15, 2014 at 9:53 AM, Ivan Lagunov <lagi...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Ilya, >>>>>> >>>>>> As it's stated in the email topic, I use Sedna 3.5.615. It was built >>>>>> from sources with RelWithDebugInfo flag enabled on the RHEL linux >>>>>> server. If you want to add some additional debug info, I'd suggest adding >>>>>> it on top of the latest night build though. I'm planning to upgrade >>>>>> anyway >>>>>> as there were quite some nice features implemented. >>>>>> >>>>>> @Charles, okay, nice to know this. If only I could provide a >>>>>> consistently failing test case. Currently the issue is quite mysterious. >>>>>> >>>>>> Best regards, >>>>>> Ivan Lagunov >>>>>> >>>>>> On Wed, Oct 15, 2014 at 11:02 AM, Ivan Shcheklein < >>>>>> shchekl...@gmail.com> wrote: >>>>>> >>>>>>> Ilya, >>>>>>> >>>>>>> As far as I remember. It is almost the latest dev sources built with >>>>>>> relWithDebInfo. Internal data structures should be compatible. >>>>>>> >>>>>>> Ivan >>>>>>> >>>>>>> On Wed, Oct 15, 2014 at 2:08 AM, Ilya Taranov <epsi...@socio.msu.ru> >>>>>>> wrote: >>>>>>> >>>>>>>> I'm not sure, latest build will fix the problem. >>>>>>>> >>>>>>>> It's definitely some client-to-app communication problem. >>>>>>>> If I make a git patch for you with additional debug info, will you >>>>>>>> be able to apply it (assuming, the data file version will be the same)? >>>>>>>> What build do you currently use? >>>>>>>> >>>>>>>> On Sat, Oct 11, 2014 at 12:48 AM, Ivan Lagunov <lagi...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Ilya, >>>>>>>>> >>>>>>>>> I've analyzed logs for other issue occurrences and it has appeared >>>>>>>>> more complicated. >>>>>>>>> >>>>>>>>> First, could you please clarify about "Session is ready" >>>>>>>>> statement, when exactly is it dumped to the log file? I'm wondering >>>>>>>>> because >>>>>>>>> when the database is hanging, those log entries still appear at some >>>>>>>>> random >>>>>>>>> moments of time. >>>>>>>>> >>>>>>>>> Second, I'm going to provide the investigation details for several >>>>>>>>> more issue occurrences here. Below "fast XQuery" means that it uses >>>>>>>>> index >>>>>>>>> and is expected to be very fast. >>>>>>>>> >>>>>>>>> >>>>>>>>> https://www.dropbox.com/s/2luq53cstc4t3tl/sedna_debug_info1.zip?dl=0 >>>>>>>>> >>>>>>>>> se_trn 27173 - DROP DOCUMENT operation, COMMITED >>>>>>>>> se_gov 29197 >>>>>>>>> se_sm 29224 >>>>>>>>> se_trn 29617 - fast XQuery basictypes:getBasicType, not COMMITED, >>>>>>>>> gdb thread 1 in select >>>>>>>>> se_trn 29820 - Session is ready >>>>>>>>> se_trn 29976 - Session is ready >>>>>>>>> >>>>>>>>> >>>>>>>>> https://www.dropbox.com/s/scmszq9yphyv6xf/sedna_debug_info2.zip?dl=0 >>>>>>>>> >>>>>>>>> se_gov 15890 >>>>>>>>> se_sm 15899 >>>>>>>>> se_trn 21611 - doc-available XQuery, not COMMITED, gbd thread 1 in >>>>>>>>> nanosleep >>>>>>>>> se_trn 21709 - doc('$version') XQuery, COMMITED, gdb thread 1 in >>>>>>>>> select >>>>>>>>> se_trn 21713 - doc('$version') XQuery, COMMITED, gdb thread 1 in >>>>>>>>> select >>>>>>>>> se_trn 21716 - doc('$version') XQuery, COMMITED, gdb thread 1 in >>>>>>>>> select >>>>>>>>> se_trn 21719 - doc('$version') XQuery, COMMITED, gdb thread 1 in >>>>>>>>> select >>>>>>>>> se_trn 21727 - LOAD STDIN operation, not COMMITED, gbd thread 1 in >>>>>>>>> nanosleep >>>>>>>>> se_trn 21996 - fast XQuery basictypes:getBasicType, not COMMITED, >>>>>>>>> gdb thread 1 in select >>>>>>>>> se_trn 22002 - document("$collections") XQuery, not COMMITED, gdb >>>>>>>>> thread 1 in nanosleep >>>>>>>>> se_trn 22010 - Session is ready >>>>>>>>> se_trn 22013 - Session is ready >>>>>>>>> se_trn 22152 - Session is ready >>>>>>>>> se_trn 22295 - Session is ready >>>>>>>>> se_trn 23615 - Session is ready >>>>>>>>> se_trn 24315 - Session is ready >>>>>>>>> >>>>>>>>> Questions: >>>>>>>>> 1. Does nanosleep state mean that the operation is waiting for >>>>>>>>> another one (here it's probably 21996 XQuery in select)? >>>>>>>>> 2. Why does doc-available wait for any other operation? I'd expect >>>>>>>>> it to be non-blocking and to run immediately on the latest data >>>>>>>>> snapshot. >>>>>>>>> 3. Why does doc('$version') wait for any other operation? It's a >>>>>>>>> database metadata being fetched, so I'd also expect it to run >>>>>>>>> immediately. >>>>>>>>> >>>>>>>>> >>>>>>>>> https://www.dropbox.com/s/3ibo4dvzapfdirw/sedna_debug_info3.zip?dl=0 >>>>>>>>> >>>>>>>>> se_trn 641 - fast XQuery packages:getPackage, not COMMITED, gdb >>>>>>>>> thread 1 in select >>>>>>>>> se_trn 645 - Session is ready & COMMITED without operation >>>>>>>>> se_trn 648 - Session is ready >>>>>>>>> se_trn 823 - Session is ready >>>>>>>>> se_trn 898 - Session is ready >>>>>>>>> se_gov 8444 >>>>>>>>> se_sm 8453 >>>>>>>>> >>>>>>>>> Questions: >>>>>>>>> What is going on with 645? There is "Session in ready" statement >>>>>>>>> immediately followed by "COMMITED" statement in the event.log. What >>>>>>>>> operation was committed? Is it possible the order of statements in the >>>>>>>>> event.log is inconsistent? Even so there is no more statements with >>>>>>>>> pid=645, so no operation to be seen that could be committed. Very >>>>>>>>> suspicious. >>>>>>>>> >>>>>>>>> >>>>>>>>> https://www.dropbox.com/s/38idxmpzjv6rvxf/sedna_debug_info4.zip?dl=0 >>>>>>>>> >>>>>>>>> se_gov 1532 >>>>>>>>> se_sm 1711 >>>>>>>>> se_trn 6837 - DROP DOCUMENT operation, not COMMITED, gdb thread 1 >>>>>>>>> in nanosleep >>>>>>>>> se_trn 11703 - LOAD STDIN operation, not COMMITED, gdb thread 1 in >>>>>>>>> nanosleep >>>>>>>>> se_trn 11944 - document("$collections") XQuery, not COMMITED, gdb >>>>>>>>> thread 1 in nanosleep >>>>>>>>> se_trn 11972 - document("$collections") XQuery, not COMMITED, gdb >>>>>>>>> thread 1 in nanosleep >>>>>>>>> se_trn 12139 - XQuery collection("groupingDefinitions/wip"), >>>>>>>>> COMMITED, gdb thread 1 in select >>>>>>>>> se_trn 12183 - fast XQuery basictypes:getBasicType, not COMMITED, >>>>>>>>> gdb thread 1 in select >>>>>>>>> se_trn 12186 - XQuery collection("groupingDefinitions/wip"), >>>>>>>>> COMMITED, gdb thread 1 in select >>>>>>>>> se_trn 12195 - XQuery collection("groupingDefinitions/wip"), >>>>>>>>> COMMITED, gdb thread 1 in select >>>>>>>>> se_trn 12222 - document("$collections") XQuery, not COMMITED, gdb >>>>>>>>> thread 1 in nanosleep >>>>>>>>> se_trn 12302 - fast XQuery documents:getFilesByBasicTypeId, not >>>>>>>>> COMMITED, gdb thread 1 in nanosleep >>>>>>>>> se_trn 12307 - document("$documents") XQuery, not COMMITED, gdb >>>>>>>>> thread 1 in nanosleep >>>>>>>>> se_trn 12310 - document("$collections") XQuery, not COMMITED, gdb >>>>>>>>> thread 1 in nanosleep >>>>>>>>> se_trn 12377 - Session is ready >>>>>>>>> se_trn 12382 - Session is ready >>>>>>>>> se_trn 12423 - Session is ready >>>>>>>>> se_trn 12848 - Session is ready >>>>>>>>> >>>>>>>>> Comment: >>>>>>>>> I understand many of those XQueries are not optimal, they are part >>>>>>>>> of an application developed by other devs. Now I'll take care of >>>>>>>>> improving >>>>>>>>> those operations to replace document("$documents") and >>>>>>>>> document("$collections") with doc-available and col-available >>>>>>>>> accordingly >>>>>>>>> and to use an index for collection("groupingDefinitions/wip"). >>>>>>>>> >>>>>>>>> Overall conclusion: >>>>>>>>> There is no single operation/XQuery causing the database hanging. >>>>>>>>> There were at least three different XQueries observed. Taking into >>>>>>>>> account >>>>>>>>> the seemingly random occurrences, it looks like the protocol issue >>>>>>>>> indeed. >>>>>>>>> I don't see how to apply tcpdump in this case as it's Production >>>>>>>>> environment. Any ideas? >>>>>>>>> >>>>>>>>> I'd try to rebuild the database from sources and, perhaps, to take >>>>>>>>> the latest night 3.5 (or 3.6?) build for this. Do you think it makes >>>>>>>>> sense >>>>>>>>> to try? >>>>>>>>> >>>>>>>>> Best regards, >>>>>>>>> Ivan Lagunov >>>>>>>>> >>>>>>>>> On Fri, Oct 10, 2014 at 1:23 AM, Ilya Taranov < >>>>>>>>> epsi...@socio.msu.ru> wrote: >>>>>>>>> >>>>>>>>>> Hi, from the top of my head... >>>>>>>>>> >>>>>>>>>> 1. Queries are evaluated in a lazy fashion. Which means, that >>>>>>>>>> before client asks to fetch the next row, nothing is evaluated. >>>>>>>>>> >>>>>>>>>> So select() is a socket read, which is waiting for client to make >>>>>>>>>> a decision on whether to commit (or end RO transaction) or ask for >>>>>>>>>> next >>>>>>>>>> part of the result. All locks are only released when end transaction >>>>>>>>>> happens. That's why others are waiting for one single client. >>>>>>>>>> >>>>>>>>>> To my mind, it's a wrong approach, and that's not entirely how I >>>>>>>>>> would do it right now, But that days we had ACID as a primary goal >>>>>>>>>> in minds. >>>>>>>>>> >>>>>>>>>> 2. Session is ready, means that there is no ongoing transaction, >>>>>>>>>> but the connection is established. >>>>>>>>>> >>>>>>>>>> 3. No idea in this particular case, but there can be a situation >>>>>>>>>> when session is started without client connection. >>>>>>>>>> Recovery/checkpoint, >>>>>>>>>> maybe something else, I'm not sure. >>>>>>>>>> >>>>>>>>>> 4. That dfntly looks strange, but should not cause a lock. Once >>>>>>>>>> transaction is committed, it should release all the locks. So even >>>>>>>>>> should >>>>>>>>>> there be orphan sessions, they would not interfere the normal >>>>>>>>>> operation. >>>>>>>>>> >>>>>>>>>> 5. Yes that pretty much looks like a problem. Based on logs, it >>>>>>>>>> looks like 17338 is waiting for client to fetch a result. >>>>>>>>>> >>>>>>>>>> That may be a protocol issue, by the way, when as a result of >>>>>>>>>> malformed message both client and sedna are waiting for each other. >>>>>>>>>> >>>>>>>>>> I would guess, that's always happens when >>>>>>>>>> "categorytree:getCategoryTree()" is called, could you please try >>>>>>>>>> that? If >>>>>>>>>> it confirms, could you please make a wireshark/tcpdump dump of this >>>>>>>>>> communication? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Oct 9, 2014 at 6:42 AM, Ivan Lagunov <lagi...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Ilya, >>>>>>>>>>> >>>>>>>>>>> Please check my investigations below. >>>>>>>>>>> >>>>>>>>>>> I've collected all the debug information for another hanging >>>>>>>>>>> today, it's available here: >>>>>>>>>>> https://www.dropbox.com/s/g14aer9821fzx05/sedna_debug_info6.zip?dl=0 >>>>>>>>>>> Here are the statuses of all Sedna processes: >>>>>>>>>>> >>>>>>>>>>> se_trn 13436 - DROP DOCUMENT operation, COMMITED >>>>>>>>>>> se_trn 16245 - LOAD STDIN operation, COMMITED, no ESTABLISHED >>>>>>>>>>> connection on port 5050 found (only 5151) >>>>>>>>>>> se_trn 17335 - doc('$version') XQuery, COMMITED, gdb thread 1 in >>>>>>>>>>> select () from /lib64/libc.so.6 >>>>>>>>>>> se_trn 17338 - XQuery running about 10sec, not COMMITED, gdb >>>>>>>>>>> thread 1 in select () from /lib64/libc.so.6 >>>>>>>>>>> se_trn 17410 - XQuery running <1sec, COMMITED >>>>>>>>>>> se_trn 17462 - Session is ready >>>>>>>>>>> se_trn 17554 - Session is ready >>>>>>>>>>> se_trn 18892 - Session is ready >>>>>>>>>>> se_trn 19124 - Session is ready >>>>>>>>>>> se_gov 20441 - gdb threads 1 and 2 in select () from >>>>>>>>>>> /lib64/libc.so.6 >>>>>>>>>>> se_sm 20449 >>>>>>>>>>> >>>>>>>>>>> Comments: >>>>>>>>>>> 1. Where gdb is not stated, all threads are in semop () from >>>>>>>>>>> /lib64/libc.so.6 except the last thread that is in >>>>>>>>>>> pthread_cond_timedwait@@GLIBC_2.3.2 >>>>>>>>>>> (). I believe semop is a good state, correct me if I'm wrong. >>>>>>>>>>> 2. All se_trn with "Session is ready" have no other entries in >>>>>>>>>>> the latest event.log. As I collect only the latest event.log, I >>>>>>>>>>> cannot >>>>>>>>>>> check what was the last operation. >>>>>>>>>>> 3. All se_trn processes have two ESTABLISHED connections in >>>>>>>>>>> netstat for ports 5050 and 5151 if not stated otherwise. >>>>>>>>>>> 4. se_trn 17338, 17410, 17462 were created by the same >>>>>>>>>>> application that is under suspicion. The database unfreezed after >>>>>>>>>>> the >>>>>>>>>>> restart of that application. >>>>>>>>>>> >>>>>>>>>>> Issues/questions: >>>>>>>>>>> 1. What exactly does it mean when a thread is in select()? It >>>>>>>>>>> seems the database is waiting but for the client or for some >>>>>>>>>>> internal >>>>>>>>>>> operation, e.g. fetching the data. Please clarify. >>>>>>>>>>> 2. What exactly does "Session is ready" mean? When does it >>>>>>>>>>> appear in the event.log? I believe those connections are okay but >>>>>>>>>>> I'm >>>>>>>>>>> curious to know. >>>>>>>>>>> 3. se_trn 16245 - how is it possible that there is no >>>>>>>>>>> ESTABLISHED connection on port 5050 while the process is still >>>>>>>>>>> alive? An >>>>>>>>>>> issue? >>>>>>>>>>> 4. se_trn 17335 - how is it possible that thread is in select() >>>>>>>>>>> while the transaction has been COMMITED in event.log? An issue? >>>>>>>>>>> 5. se_trn 17338 - looks like a root cause of Sedna hanging. >>>>>>>>>>> Currently I run it via XML:DB API implementation mentioned earlier, >>>>>>>>>>> more >>>>>>>>>>> specifically I run the XQuery via XQueryService.query("xquery >>>>>>>>>>> string") >>>>>>>>>>> call. Besides, I always try to close the connection in the Java >>>>>>>>>>> finally >>>>>>>>>>> block, so it looks roughly like this: >>>>>>>>>>> >>>>>>>>>>> try { >>>>>>>>>>> rootCollection = DatabaseManager.getCollection(...); >>>>>>>>>>> ... do something ... >>>>>>>>>>> } finally { >>>>>>>>>>> rootCollection.close(); >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> According to event.log the XQuery was started at 09/10/2014 >>>>>>>>>>> 12:00:01. >>>>>>>>>>> In the application log somewhere between 12:01:53 and 12:05:00 >>>>>>>>>>> appeared a message: >>>>>>>>>>> >>>>>>>>>>>> Warning: an open Sedna XML:DB Collection/Connection was just >>>>>>>>>>>> automatically closed due to inactivity, you SHOULD close these >>>>>>>>>>>> manually. >>>>>>>>>>> >>>>>>>>>>> It seems to mean that the connection was considered as abandoned >>>>>>>>>>> and removed by XML:DB API implementation (see above >>>>>>>>>>> conn-log-abandoned=true >>>>>>>>>>> and conn-remove-abandoned-timeout=120000). >>>>>>>>>>> For me it looks like the client waited for 2 minutes but did not >>>>>>>>>>> get any response from the database. So there was an attempt to >>>>>>>>>>> close the >>>>>>>>>>> connection. It's easy to understand as the database hanged at about >>>>>>>>>>> 12:00:02 when the last transaction was COMMITED. So it may be even >>>>>>>>>>> that the >>>>>>>>>>> root cause is hidden somewhere deeper. For example, this long >>>>>>>>>>> XQuery could >>>>>>>>>>> cause a deadlock. >>>>>>>>>>> >>>>>>>>>>> I understand it may be an issue in XML:DB API implementation but >>>>>>>>>>> its support was stopped by Charles Foster. Unfortunately, we cannot >>>>>>>>>>> afford >>>>>>>>>>> to migrate to XQJ API implementation yet. The most tricky part to >>>>>>>>>>> understand is why we had not faced similar issues before recent >>>>>>>>>>> occurrence >>>>>>>>>>> when it has become a regular major issue immediately. There were no >>>>>>>>>>> server >>>>>>>>>>> updates and no changes that could affect the way >>>>>>>>>>> transactions/connections >>>>>>>>>>> are handled. >>>>>>>>>>> >>>>>>>>>>> Best regards, >>>>>>>>>>> Ivan Lagunov >>>>>>>>>>> >>>>>>>>>>> On Thu, Oct 9, 2014 at 1:33 PM, Ivan Lagunov <lagi...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Sorry, copied the wrong link for the connection pool >>>>>>>>>>>> documentation. Here is the correct one: >>>>>>>>>>>> http://www.cfoster.net/pdf/sedna/xmldb/documentation.pdf#page=10 >>>>>>>>>>>> >>>>>>>>>>>> Besides, I use the following values for the connection pool >>>>>>>>>>>> parameters in the suspected application: >>>>>>>>>>>> conn-max-active=50 >>>>>>>>>>>> conn-max-wait=30000 >>>>>>>>>>>> conn-log-abandoned=true >>>>>>>>>>>> conn-remove-abandoned-timeout=120000 >>>>>>>>>>>> conn-abandoned-check-interval=5000 >>>>>>>>>>>> >>>>>>>>>>>> Best regards, >>>>>>>>>>>> Ivan Lagunov >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Oct 9, 2014 at 1:28 PM, Ivan Lagunov <lagi...@gmail.com >>>>>>>>>>>> > wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Ilya, >>>>>>>>>>>>> >>>>>>>>>>>>> I use Charles Foster implementation of XML:DB API, version >>>>>>>>>>>>> 1.2.6 without any customizations: >>>>>>>>>>>>> http://www.cfoster.net/sedna/xmldb/download/ >>>>>>>>>>>>> >>>>>>>>>>>>> Here is the link again to the connection pool documentation >>>>>>>>>>>>> (it's a part of the above implementation): >>>>>>>>>>>>> http://www.cfoster.net/sedna/xmldb/download/ >>>>>>>>>>>>> >>>>>>>>>>>>> Meanwhile, I've identified the application that makes Sedna >>>>>>>>>>>>> wait. At least restarting the application has unfreezed the >>>>>>>>>>>>> database twice >>>>>>>>>>>>> already. I'll analyze the application log along with Sedna >>>>>>>>>>>>> event.log >>>>>>>>>>>>> shortly. Hopefully, I'll find the root cause. >>>>>>>>>>>>> >>>>>>>>>>>>> Best regards, >>>>>>>>>>>>> Ivan Lagunov >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Oct 9, 2014 at 3:50 AM, Ilya Taranov < >>>>>>>>>>>>> epsi...@socio.msu.ru> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I can easily see the scenario, where one of the clients >>>>>>>>>>>>>> doesn't terminate the transaction properly. Unfortunately, >>>>>>>>>>>>>> current >>>>>>>>>>>>>> application timeout doesn't apply for client hangs. The issue >>>>>>>>>>>>>> can be easily >>>>>>>>>>>>>> fixed, however. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Unfortunately, even fix still would not solve your problem, >>>>>>>>>>>>>> you would still have application timeout wait on locks. Probably >>>>>>>>>>>>>> the most >>>>>>>>>>>>>> straightforward way to find the cause for you would be to log >>>>>>>>>>>>>> connection >>>>>>>>>>>>>> checkins/checkouts at the client driver level. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Could you please give me a link to the connection pool + >>>>>>>>>>>>>> driver you currently use (including all your changes if >>>>>>>>>>>>>> possible) so I >>>>>>>>>>>>>> could help you with that? >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Oct 7, 2014 at 4:37 AM, Ivan Lagunov < >>>>>>>>>>>>>> lagi...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Ilya, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Your suggestion seems to be correct. After I've restarted >>>>>>>>>>>>>>> three Java applications that had connections to Sedna, the >>>>>>>>>>>>>>> database has >>>>>>>>>>>>>>> resumed processing operations. Here is the event.log where >>>>>>>>>>>>>>> Sedna has hanged >>>>>>>>>>>>>>> between 12:35:03 and 13:06:05: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> https://www.dropbox.com/s/pzafuvzcvfchxwa/event-2014-10-07-13-15-04-018.zip?dl=0 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm not sure which of the applications caused it though. It >>>>>>>>>>>>>>> still worries me as the queries before hanging look quite >>>>>>>>>>>>>>> simple and I >>>>>>>>>>>>>>> cannot understand which client operation could have hanged the >>>>>>>>>>>>>>> whole >>>>>>>>>>>>>>> database. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>> Ivan Lagunov >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Oct 6, 2014 at 12:29 PM, Ivan Lagunov < >>>>>>>>>>>>>>> lagi...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Ilya, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Yes, I use a connection pool via Sedna XML:DB API >>>>>>>>>>>>>>>> implementation. It's documented here: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> http://www.cfoster.net/pdf/sedna/xmldb/documentation.pdf#page=10 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Unfortunately, there is no support for active connection >>>>>>>>>>>>>>>> timeout. This could be the cause if there was a query hanging >>>>>>>>>>>>>>>> for some >>>>>>>>>>>>>>>> reason. Still if some query hangs forever (that should not as >>>>>>>>>>>>>>>> it's in my >>>>>>>>>>>>>>>> case) it looks more like a database bug/deadlock situation. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'll try as you suggest next time. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> P.S. Sorry for the second email as I've forgotten to add CC >>>>>>>>>>>>>>>> first. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>>> Ivan Lagunov >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Oct 3, 2014 at 10:23 PM, Ilya Taranov < >>>>>>>>>>>>>>>> epsi...@socio.msu.ru> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Ivan, do you have a connection pool? I see one of the >>>>>>>>>>>>>>>>> se_trn processes waiting for socket read. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> If you don't have a connection pool, it just might be a >>>>>>>>>>>>>>>>> client problem. I'll take a deeper look anyway, just next >>>>>>>>>>>>>>>>> time it happens, >>>>>>>>>>>>>>>>> could you please try to shutdown the client thread, if it's >>>>>>>>>>>>>>>>> even possible. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, Oct 2, 2014 at 5:30 AM, Ivan Lagunov < >>>>>>>>>>>>>>>>> lagi...@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I've been facing major issues with Sedna hanging >>>>>>>>>>>>>>>>>> regularly for the last two weeks (after several months of >>>>>>>>>>>>>>>>>> stable work). I >>>>>>>>>>>>>>>>>> tried to do the complete export-import of data using se_exp >>>>>>>>>>>>>>>>>> but Sedna still >>>>>>>>>>>>>>>>>> hangs regularly. Looking into event.log does not help as >>>>>>>>>>>>>>>>>> queries differ >>>>>>>>>>>>>>>>>> every time. Sometimes it looks like Sedna hangs after some >>>>>>>>>>>>>>>>>> heavy query >>>>>>>>>>>>>>>>>> (with huge response) while in other cases there are no heavy >>>>>>>>>>>>>>>>>> queries to >>>>>>>>>>>>>>>>>> suspect. I've written a script to collect the following >>>>>>>>>>>>>>>>>> logs/dumps you will >>>>>>>>>>>>>>>>>> find in the archives: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> event.log - the last Sedna event log till the moment DB >>>>>>>>>>>>>>>>>> hangs and is stopped; >>>>>>>>>>>>>>>>>> sedna_procs.txt - list of all Sedna processes (ps -ef); >>>>>>>>>>>>>>>>>> netstat.txt - list of all ESTABLISHED connections; >>>>>>>>>>>>>>>>>> ipcs.txt - results of ipcs -s; >>>>>>>>>>>>>>>>>> <pid>.txt - results of gdb with bt for all threads of a >>>>>>>>>>>>>>>>>> Sedna process (available for all processes at the moment of >>>>>>>>>>>>>>>>>> hanging). This >>>>>>>>>>>>>>>>>> is collected automatically for threads 1-4 of each process. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I'm providing two sets of logs/dumps for two separate >>>>>>>>>>>>>>>>>> hanging events: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 25th September - seemed to hang without any heavy queries: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> https://www.dropbox.com/s/2luq53cstc4t3tl/sedna_debug_info1.zip?dl=0 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 2nd October - hanged after a heavy query: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> https://www.dropbox.com/s/scmszq9yphyv6xf/sedna_debug_info2.zip?dl=0 >>>>>>>>>>>>>>>>>> The heavy query occurs at LOG 02/10/2014 11:10:20 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> If you can find out the cause of the hangings, it'd be >>>>>>>>>>>>>>>>>> the best. Otherwise it'd be nice to hear some >>>>>>>>>>>>>>>>>> recommendations at least on >>>>>>>>>>>>>>>>>> how to avoid these issues/minimize an impact. May be I need >>>>>>>>>>>>>>>>>> to upgrade to >>>>>>>>>>>>>>>>>> some later version but I'm currently using 3.5.615 that is >>>>>>>>>>>>>>>>>> already a night >>>>>>>>>>>>>>>>>> build that we had to upgrade to after some older issue >>>>>>>>>>>>>>>>>> (reported back in >>>>>>>>>>>>>>>>>> March-April of 2013). I'd try some 3.5 or 3.6 version but >>>>>>>>>>>>>>>>>> without knowing >>>>>>>>>>>>>>>>>> the reason there is no guarantee the potential bug is fixed >>>>>>>>>>>>>>>>>> in the later >>>>>>>>>>>>>>>>>> night build. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Also it'd be really nice if there was an official release >>>>>>>>>>>>>>>>>> for 3.5 with all the latest bug fixes (I see the last night >>>>>>>>>>>>>>>>>> build is >>>>>>>>>>>>>>>>>> 3.5.627 of 8th May 2013). >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>>>>> Ivan Lagunov >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>>>>>>>>> Meet PCI DSS 3.0 Compliance Requirements with EventLog >>>>>>>>>>>>>>>>>> Analyzer >>>>>>>>>>>>>>>>>> Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box >>>>>>>>>>>>>>>>>> PCI DSS Reports >>>>>>>>>>>>>>>>>> Are you Audit-Ready for PCI DSS 3.0 Compliance? Download >>>>>>>>>>>>>>>>>> White paper >>>>>>>>>>>>>>>>>> Comply to PCI DSS 3.0 Requirement 10 and 11.5 with >>>>>>>>>>>>>>>>>> EventLog Analyzer >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk >>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>> Sedna-discussion mailing list >>>>>>>>>>>>>>>>>> Sedna-discussion@lists.sourceforge.net >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/sedna-discussion >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
------------------------------------------------------------------------------
_______________________________________________ Sedna-discussion mailing list Sedna-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/sedna-discussion