Hi Ilya,

Thanks for the hint about jstack. I've analyzed two hanging issues and both
look identical on Java side:

One thread, let's call it A, takes a lock and does not release it:
"ajp-bio-8059-exec-21" daemon prio=10 tid=0x0000000018a3e000 nid=0x4c89
runnable [0x00000000434ee000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at net.cfoster.sedna.xmldb.p.a(Unknown Source)
at net.cfoster.sedna.xmldb.o.a(Unknown Source)
at net.cfoster.sedna.xmldb.o.a(Unknown Source)
at net.cfoster.sedna.xmldb.g.g(Unknown Source)
at net.cfoster.sedna.xmldb.g.<init>(Unknown Source)
at net.cfoster.sedna.xmldb.h.c(Unknown Source)
- locked <0x00000000c1c27948> (a net.cfoster.sedna.xmldb.h)
at net.cfoster.sedna.xmldb.h.a(Unknown Source)
- locked <0x00000000c1c27948> (a net.cfoster.sedna.xmldb.h)
at net.cfoster.sedna.DatabaseImpl.getCollection(Unknown Source)

Another thread, let's call it B, waits for the lock:
"ajp-bio-8059-exec-22" daemon prio=10 tid=0x0000000018c36800 nid=0x4c8e
waiting for monitor entry [0x0000000041135000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at net.cfoster.sedna.xmldb.h.a(Unknown Source)
- waiting to lock <0x00000000c1c27948> (a net.cfoster.sedna.xmldb.h)
at net.cfoster.sedna.DatabaseImpl.getCollection(Unknown Source)
at org.xmldb.api.DatabaseManager.getCollection(DatabaseManager.java:206)

On Sedna side the thread A database connection is processed and COMMITED in
the logs:
[tr_common_funcs.cpp:on_transaction_end:414]: Transaction has been COMMITED
[sp.c:sp_send_msg:64]: Message send : type 250, len 0

The thread B database connection is processed by Sedna and ends up waiting
for the client's response:
[sp.c:sp_send_msg:64]: Message send : type 360, len 1086
[sp.c:sp_send_msg:64]: Message send : type 370, len 0

The client thread B does not respond as it waits for the lock and the
corresponding Sedna connection is eventually closed by XML:DB API
implementation with the message:
Warning: an open Sedna XML:DB Collection/Connection was just automatically
closed due to inactivity, you SHOULD close these manually.

Meanwhile thread A successfully completes the processing and returns the
response while still keeping the lock.

@Ilya, so it's definitely the client-side issue, related either to JDK or
to Charles Foster implementation. I've upgraded to the latest JDK 1.8.0_25
just for a check. If it does not help, I may ask Charles for a paid support
as he suggested.

Best regards,
Ivan Lagunov


On Tue, Oct 28, 2014 at 9:41 PM, Ilya Taranov <epsi...@socio.msu.ru> wrote:

> Hi, Ivan,
>
>   Unfortunately, I'm too busy at my full-time job currently, so sometimes,
> I just don't have a chance to take a look at your dropbox file during the
> day. Sorry for that.
>
>   What I have seen so far, is that everything confirms my suspicion of the
> client somehow involved. After 27478 sent 370 message, it awaits for
> 310/220/225 from client to continue, holding locks. That's perfectly fine.
> It's hard to tell anything without understanding why client is not willing
> to send it. May that be some kind of hang? I wonder what the client is
> doing, could you take a look at the client backtrace during the hang (like
> the one returned by jstack/sending SIG_QUIT)?
>
>   Index problem you mentioned is a separate issue, I will take a look at.
>
> Thanks,
> Ilya Taranov.
>
>
> On Tue, Oct 28, 2014 at 1:09 PM, Ivan Lagunov <lagi...@gmail.com> wrote:
>
>> I've faced two more hangings today. Again I'm posting only those
>> connections that were created by our suspected application. Besides, I'm
>> adding the last debug Message from the event.log for each transaction.
>>
>> https://www.dropbox.com/s/istotjr41vrmmd5/20141028001023.ZIP?dl=0
>>
>> se_trn 21097 - fast XQuery, COMMITED, Message recv : type 210, len 0
>> se_trn 21114 - XQuery categorytree:getCategoryTree(), not COMMITED, gdb
>> thread 1 in select, Message send : type 370, len 0
>> se_trn 21127 - fast XQuery, COMMITED, Message recv : type 210, len 0
>> se_trn 21229 - Session is ready
>>
>> So this hanging issue looks completely the same as those of yesterday.
>> However, yesterday's problem has been fixed and now it happens roughly once
>> in several hundreds cases.
>>
>> https://www.dropbox.com/s/ru6r6rplar3fat9/20141028151343.ZIP?dl=0
>>
>> se_trn 27478 - fast XQuery basictypes:getBasicType(), not COMMITED, gdb
>> thread 1 in select, Message send : type 370, len 0
>> se_trn 27932 - fast XQuery, COMMITED, Message recv : type 210, len 0
>> se_trn 28030 - Session is ready
>>
>> Obviously, the issue has not gone. Moreover, the second case confirms
>> that the database can hang on various XQueries as before. As the
>> yesterday's issue looks really the same but just worse, I've come to a
>> conclusion that other hangings may be related to the Java version too. I'm
>> going to try running the application under a different Java version.
>>
>> @Ilya, could you comment on the provided event.log with the debug
>> messages generated by your patch? I'm curious if it can give better insight
>> into the problem.
>>
>> Best regards,
>> Ivan Lagunov
>>
>> On Mon, Oct 27, 2014 at 5:46 PM, Ivan Lagunov <lagi...@gmail.com> wrote:
>>
>>> Hi Ilya,
>>>
>>> Finally I've got some clues about Sedna hangings. Today the same
>>> application that is under suspicion has been deployed by another developer.
>>> It has been built on the developer's machine so apparently the JDK has been
>>> different from the JDK that is used for running the application on the
>>> server. I'm still waiting for more details on this matter. But this
>>> deployment has resulted in a series of Sedna hangings. Although it differs
>>> from the hanging scenarios that we had faced before, it may still be
>>> somehow related.
>>>
>>> Here is the archive with logs of the first hanging issue of today:
>>>
>>> https://www.dropbox.com/s/ufwayhwx7ersy88/20141027131246.ZIP?dl=0
>>>
>>> These are the transaction processes created by the suspected application:
>>> se_trn 17209 - XQuery categorytree:getCategoryTree(), not COMMITED, gdb
>>> thread 1 in select
>>> se_trn 17217 - Session is ready
>>>
>>> All other today's hangings have been caused by the same XQuery
>>> categorytree:getCategoryTree(). It simply loads and returns a big document
>>> TreeExport.xml (about 2.5MB) from the collection "treeStructures/released".
>>> It's very likely that the "Sedna index subsystem error" from my
>>> previous email is also related to this. There have been no more hangings
>>> after I rebuilt and redeployed the Java application on the server using the
>>> same "good" JDK that is used for running.
>>>
>>> @Ilya, could you please look into the event.log and say if the client
>>> indeed makes Sedna wait as you had suggested before?
>>>
>>> This is an extract from event.log:
>>>
>>> LOG   27/10/2014 12:51:43 (TRN nxp pid=17209 sid=44 trid=-1)
>>> [sp.c:sp_recv_msg:34]: Message recv : type 210, len 0
>>> LOG   27/10/2014 12:51:43 (TRN nxp pid=17209 sid=44 trid=48)
>>> [sp.c:sp_send_msg:64]: Message send : type 230, len 0
>>> LOG   27/10/2014 12:51:43 (TRN nxp pid=17209 sid=44 trid=48)
>>> [sp.c:sp_recv_msg:34]: Message recv : type 300, len 220
>>> LOG   27/10/2014 12:51:43 (TRN nxp pid=17209 sid=44 trid=48)
>>> [tr_functions.cpp:on_user_statement_begin:102]: User query:
>>> --- categorytree:getCategoryTree()
>>> LOG   27/10/2014 12:51:43 (TRN nxp pid=17209 sid=44 trid=48)
>>> [sp.c:sp_send_msg:64]: Message send : type 320, len 0
>>> LOG   27/10/2014 12:51:45 (TRN nxp pid=17209 sid=44 trid=48)
>>> [sp.c:sp_send_msg:64]: Message send : type 360, len 10240
>>> ... plenty of lines like this ...
>>> LOG   27/10/2014 12:51:45 (TRN nxp pid=17209 sid=44 trid=48)
>>> [sp.c:sp_send_msg:64]: Message send : type 360, len 10240
>>> LOG   27/10/2014 12:51:45 (TRN nxp pid=17209 sid=44 trid=48)
>>> [sp.c:sp_send_msg:64]: Message send : type 360, len 8204
>>> LOG   27/10/2014 12:51:45 (TRN nxp pid=17209 sid=44 trid=48)
>>> [sp.c:sp_send_msg:64]: Message send : type 370, len 0
>>>
>>> Just to clarify about our earlier hangings, afaik all the deployments of
>>> the suspected application were performed on the server using "good" JDK
>>> earlier. So while earlier hangings might have been somehow related to
>>> today's, the situation is quite different.
>>>
>>> Best regards,
>>> Ivan Lagunov
>>>
>>>
>>> On Mon, Oct 27, 2014 at 12:46 PM, Ivan Lagunov <lagi...@gmail.com>
>>> wrote:
>>>
>>>> Hi Ilya,
>>>>
>>>> The patch has been applied to the night build 3.6.519, the data has
>>>> been migrated with se_exp using complete export-import. The issue has not
>>>> appear again for a week already. I'll keep monitoring Sedna.
>>>>
>>>> However, it seems there is another issue with that night build:
>>>>
>>>> LOG   27/10/2014 01:21:18 (TRN nxp pid=20091 sid=49 trid=-1)
>>>> [sp.c:sp_send_msg:64]: Message send : type 250, len 0
>>>> LOG   27/10/2014 01:21:18 (TRN nxp pid=20091 sid=49 trid=-1)
>>>> [sp.c:sp_recv_msg:34]: Message recv : type 210, len 0
>>>> LOG   27/10/2014 01:21:18 (TRN nxp pid=20091 sid=49 trid=49)
>>>> [sp.c:sp_send_msg:64]: Message send : type 230, len 0
>>>> LOG   27/10/2014 01:21:18 (TRN nxp pid=20091 sid=49 trid=49)
>>>> [sp.c:sp_recv_msg:34]: Message recv : type 300, len 76
>>>> LOG   27/10/2014 01:21:18 (TRN nxp pid=20091 sid=49 trid=49)
>>>> [tr_functions.cpp:on_user_statement_begin:102]: User query:
>>>> ---   DROP DOCUMENT "TreeExport.xml" IN COLLECTION
>>>> "treeStructures/released"
>>>> ERROR 27/10/2014 01:21:20 (TRN nxp pid=20091 sid=49 trid=49)
>>>> [btree.cpp:bt_delete_tmpl:273]: (SE1008) Sedna index subsystem error.
>>>> Details: Cannot delete object which is not in the btree
>>>> LOG   27/10/2014 01:21:20 (TRN nxp pid=20091 sid=49 trid=-1)
>>>> [tr_common_funcs.cpp:on_transaction_end:416]: Transaction has been ROLLED
>>>> BACK
>>>> LOG   27/10/2014 01:21:20 (TRN nxp pid=20091 sid=49 trid=-1)
>>>> [sp.c:sp_send_msg:64]: Message send : type 100, len 122
>>>> LOG   27/10/2014 01:21:25 (TRN nxp pid=20091 sid=49 trid=-1)
>>>> [sp.c:sp_recv_msg:34]: Message recv : type 500, len 0
>>>> LOG   27/10/2014 01:21:25 (TRN nxp pid=20091 sid=49 trid=-1)
>>>> [sp.c:sp_send_msg:64]: Message send : type 510, len 0
>>>> INFO  27/10/2014 01:21:25 (SM nxp pid=19377)
>>>> [bm_core.cpp:bm_log_out_io_statistics:537]: IO block reads:0, writes:0
>>>> LOG   27/10/2014 01:21:25 (TRN nxp pid=20091 sid=49 trid=-1)
>>>> [tr.cpp:TRmain:456]: Session is closed
>>>>
>>>> It's persistent, I could not delete the document at all until I deleted
>>>> the indexes on that collection:
>>>>   <index name="category_href" backend="btree" as_type="xs:string"
>>>> on_path="descendant::element(Category)"
>>>> by_path="attribute::attribute(href)" object_name="treeStructures/released"
>>>> object_type="collection"/>
>>>>   <index name="category_id" backend="btree" as_type="xs:string"
>>>> on_path="descendant::element(Category)"
>>>> by_path="attribute::attribute(Identifier)"
>>>> object_name="treeStructures/released" object_type="collection"/>
>>>>
>>>> Any ideas?
>>>>
>>>> Best regards,
>>>> Ivan Lagunov
>>>>
>>>>
>>>> On Fri, Oct 17, 2014 at 10:05 AM, Ilya Taranov <epsi...@socio.msu.ru>
>>>> wrote:
>>>>
>>>>> Hi, here is the patch over the latest label that should add protocol
>>>>> minimal debugging.
>>>>>
>>>>> On Wed, Oct 15, 2014 at 9:53 AM, Ivan Lagunov <lagi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Ilya,
>>>>>>
>>>>>> As it's stated in the email topic, I use Sedna 3.5.615. It was built
>>>>>> from sources with RelWithDebugInfo flag enabled on the RHEL linux
>>>>>> server. If you want to add some additional debug info, I'd suggest adding
>>>>>> it on top of the latest night build though. I'm planning to upgrade 
>>>>>> anyway
>>>>>> as there were quite some nice features implemented.
>>>>>>
>>>>>> @Charles, okay, nice to know this. If only I could provide a
>>>>>> consistently failing test case. Currently the issue is quite mysterious.
>>>>>>
>>>>>> Best regards,
>>>>>> Ivan Lagunov
>>>>>>
>>>>>> On Wed, Oct 15, 2014 at 11:02 AM, Ivan Shcheklein <
>>>>>> shchekl...@gmail.com> wrote:
>>>>>>
>>>>>>> Ilya,
>>>>>>>
>>>>>>> As far as I remember. It is almost the latest dev sources built with
>>>>>>> relWithDebInfo. Internal data structures should be compatible.
>>>>>>>
>>>>>>> Ivan
>>>>>>>
>>>>>>> On Wed, Oct 15, 2014 at 2:08 AM, Ilya Taranov <epsi...@socio.msu.ru>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I'm not sure, latest build will fix the problem.
>>>>>>>>
>>>>>>>> It's definitely some client-to-app communication problem.
>>>>>>>> If I make a git patch for you with additional debug info, will you
>>>>>>>> be able to apply it (assuming, the data file version will be the same)?
>>>>>>>> What build do you currently use?
>>>>>>>>
>>>>>>>> On Sat, Oct 11, 2014 at 12:48 AM, Ivan Lagunov <lagi...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Ilya,
>>>>>>>>>
>>>>>>>>> I've analyzed logs for other issue occurrences and it has appeared
>>>>>>>>> more complicated.
>>>>>>>>>
>>>>>>>>> First, could you please clarify about "Session is ready"
>>>>>>>>> statement, when exactly is it dumped to the log file? I'm wondering 
>>>>>>>>> because
>>>>>>>>> when the database is hanging, those log entries still appear at some 
>>>>>>>>> random
>>>>>>>>> moments of time.
>>>>>>>>>
>>>>>>>>> Second, I'm going to provide the investigation details for several
>>>>>>>>> more issue occurrences here. Below "fast XQuery" means that it uses 
>>>>>>>>> index
>>>>>>>>> and is expected to be very fast.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://www.dropbox.com/s/2luq53cstc4t3tl/sedna_debug_info1.zip?dl=0
>>>>>>>>>
>>>>>>>>> se_trn 27173 - DROP DOCUMENT operation, COMMITED
>>>>>>>>> se_gov 29197
>>>>>>>>> se_sm 29224
>>>>>>>>> se_trn 29617 - fast XQuery basictypes:getBasicType, not COMMITED,
>>>>>>>>> gdb thread 1 in select
>>>>>>>>> se_trn 29820 - Session is ready
>>>>>>>>> se_trn 29976 - Session is ready
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://www.dropbox.com/s/scmszq9yphyv6xf/sedna_debug_info2.zip?dl=0
>>>>>>>>>
>>>>>>>>> se_gov 15890
>>>>>>>>> se_sm 15899
>>>>>>>>> se_trn 21611 - doc-available XQuery, not COMMITED, gbd thread 1 in
>>>>>>>>> nanosleep
>>>>>>>>> se_trn 21709 - doc('$version') XQuery, COMMITED, gdb thread 1 in
>>>>>>>>> select
>>>>>>>>> se_trn 21713 - doc('$version') XQuery, COMMITED, gdb thread 1 in
>>>>>>>>> select
>>>>>>>>> se_trn 21716 - doc('$version') XQuery, COMMITED, gdb thread 1 in
>>>>>>>>> select
>>>>>>>>> se_trn 21719 - doc('$version') XQuery, COMMITED, gdb thread 1 in
>>>>>>>>> select
>>>>>>>>> se_trn 21727 - LOAD STDIN operation, not COMMITED, gbd thread 1 in
>>>>>>>>> nanosleep
>>>>>>>>> se_trn 21996 - fast XQuery basictypes:getBasicType, not COMMITED,
>>>>>>>>> gdb thread 1 in select
>>>>>>>>> se_trn 22002 - document("$collections") XQuery, not COMMITED, gdb
>>>>>>>>> thread 1 in nanosleep
>>>>>>>>> se_trn 22010 - Session is ready
>>>>>>>>> se_trn 22013 - Session is ready
>>>>>>>>> se_trn 22152 - Session is ready
>>>>>>>>> se_trn 22295 - Session is ready
>>>>>>>>> se_trn 23615 - Session is ready
>>>>>>>>> se_trn 24315 - Session is ready
>>>>>>>>>
>>>>>>>>> Questions:
>>>>>>>>> 1. Does nanosleep state mean that the operation is waiting for
>>>>>>>>> another one (here it's probably 21996 XQuery in select)?
>>>>>>>>> 2. Why does doc-available wait for any other operation? I'd expect
>>>>>>>>> it to be non-blocking and to run immediately on the latest data 
>>>>>>>>> snapshot.
>>>>>>>>> 3. Why does doc('$version') wait for any other operation? It's a
>>>>>>>>> database metadata being fetched, so I'd also expect it to run 
>>>>>>>>> immediately.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://www.dropbox.com/s/3ibo4dvzapfdirw/sedna_debug_info3.zip?dl=0
>>>>>>>>>
>>>>>>>>> se_trn 641 - fast XQuery packages:getPackage, not COMMITED, gdb
>>>>>>>>> thread 1 in select
>>>>>>>>> se_trn 645 - Session is ready & COMMITED without operation
>>>>>>>>> se_trn 648 - Session is ready
>>>>>>>>> se_trn 823 - Session is ready
>>>>>>>>> se_trn 898 - Session is ready
>>>>>>>>> se_gov 8444
>>>>>>>>> se_sm 8453
>>>>>>>>>
>>>>>>>>> Questions:
>>>>>>>>> What is going on with 645? There is "Session in ready" statement
>>>>>>>>> immediately followed by "COMMITED" statement in the event.log. What
>>>>>>>>> operation was committed? Is it possible the order of statements in the
>>>>>>>>> event.log is inconsistent? Even so there is no more statements with
>>>>>>>>> pid=645, so no operation to be seen that could be committed. Very
>>>>>>>>> suspicious.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://www.dropbox.com/s/38idxmpzjv6rvxf/sedna_debug_info4.zip?dl=0
>>>>>>>>>
>>>>>>>>> se_gov 1532
>>>>>>>>> se_sm 1711
>>>>>>>>> se_trn 6837 - DROP DOCUMENT operation, not COMMITED, gdb thread 1
>>>>>>>>> in nanosleep
>>>>>>>>> se_trn 11703 - LOAD STDIN operation, not COMMITED, gdb thread 1 in
>>>>>>>>> nanosleep
>>>>>>>>> se_trn 11944 - document("$collections") XQuery, not COMMITED, gdb
>>>>>>>>> thread 1 in nanosleep
>>>>>>>>> se_trn 11972 - document("$collections") XQuery, not COMMITED, gdb
>>>>>>>>> thread 1 in nanosleep
>>>>>>>>> se_trn 12139 - XQuery collection("groupingDefinitions/wip"),
>>>>>>>>> COMMITED, gdb thread 1 in select
>>>>>>>>> se_trn 12183 - fast XQuery basictypes:getBasicType, not COMMITED,
>>>>>>>>> gdb thread 1 in select
>>>>>>>>> se_trn 12186 - XQuery collection("groupingDefinitions/wip"),
>>>>>>>>> COMMITED, gdb thread 1 in select
>>>>>>>>> se_trn 12195 - XQuery collection("groupingDefinitions/wip"),
>>>>>>>>> COMMITED, gdb thread 1 in select
>>>>>>>>> se_trn 12222 - document("$collections") XQuery, not COMMITED, gdb
>>>>>>>>> thread 1 in nanosleep
>>>>>>>>> se_trn 12302 - fast XQuery documents:getFilesByBasicTypeId, not
>>>>>>>>> COMMITED, gdb thread 1 in nanosleep
>>>>>>>>> se_trn 12307 - document("$documents") XQuery, not COMMITED, gdb
>>>>>>>>> thread 1 in nanosleep
>>>>>>>>> se_trn 12310 - document("$collections") XQuery, not COMMITED, gdb
>>>>>>>>> thread 1 in nanosleep
>>>>>>>>> se_trn 12377 - Session is ready
>>>>>>>>> se_trn 12382 - Session is ready
>>>>>>>>> se_trn 12423 - Session is ready
>>>>>>>>> se_trn 12848 - Session is ready
>>>>>>>>>
>>>>>>>>> Comment:
>>>>>>>>> I understand many of those XQueries are not optimal, they are part
>>>>>>>>> of an application developed by other devs. Now I'll take care of 
>>>>>>>>> improving
>>>>>>>>> those operations to replace document("$documents") and
>>>>>>>>> document("$collections") with doc-available and col-available 
>>>>>>>>> accordingly
>>>>>>>>> and to use an index for collection("groupingDefinitions/wip").
>>>>>>>>>
>>>>>>>>> Overall conclusion:
>>>>>>>>> There is no single operation/XQuery causing the database hanging.
>>>>>>>>> There were at least three different XQueries observed. Taking into 
>>>>>>>>> account
>>>>>>>>> the seemingly random occurrences, it looks like the protocol issue 
>>>>>>>>> indeed.
>>>>>>>>> I don't see how to apply tcpdump in this case as it's Production
>>>>>>>>> environment. Any ideas?
>>>>>>>>>
>>>>>>>>> I'd try to rebuild the database from sources and, perhaps, to take
>>>>>>>>> the latest night 3.5 (or 3.6?) build for this. Do you think it makes 
>>>>>>>>> sense
>>>>>>>>> to try?
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Ivan Lagunov
>>>>>>>>>
>>>>>>>>> On Fri, Oct 10, 2014 at 1:23 AM, Ilya Taranov <
>>>>>>>>> epsi...@socio.msu.ru> wrote:
>>>>>>>>>
>>>>>>>>>> Hi, from the top of my head...
>>>>>>>>>>
>>>>>>>>>> 1. Queries are evaluated in a lazy fashion. Which means, that
>>>>>>>>>> before client asks to fetch the next row, nothing is evaluated.
>>>>>>>>>>
>>>>>>>>>> So select() is a socket read, which is waiting for client to make
>>>>>>>>>> a decision on whether to commit (or end RO transaction) or ask for 
>>>>>>>>>> next
>>>>>>>>>> part of the result. All locks are only released when end transaction
>>>>>>>>>> happens. That's why others are waiting for one single client.
>>>>>>>>>>
>>>>>>>>>> To my mind, it's a wrong approach, and that's not entirely how I
>>>>>>>>>> would do it right now, But that days we had ACID as a primary goal 
>>>>>>>>>> in minds.
>>>>>>>>>>
>>>>>>>>>> 2. Session is ready, means that there is no ongoing transaction,
>>>>>>>>>> but the connection is established.
>>>>>>>>>>
>>>>>>>>>> 3. No idea in this particular case, but there can be a situation
>>>>>>>>>> when session is started without client connection. 
>>>>>>>>>> Recovery/checkpoint,
>>>>>>>>>> maybe something else, I'm not sure.
>>>>>>>>>>
>>>>>>>>>> 4. That dfntly looks strange, but should not cause a lock. Once
>>>>>>>>>> transaction is committed, it should release all the locks. So even 
>>>>>>>>>> should
>>>>>>>>>> there be orphan sessions, they would not interfere the normal 
>>>>>>>>>> operation.
>>>>>>>>>>
>>>>>>>>>> 5. Yes that pretty much looks like a problem. Based on logs, it
>>>>>>>>>> looks like 17338 is waiting for client to fetch a result.
>>>>>>>>>>
>>>>>>>>>> That may be a protocol issue, by the way, when as a result of
>>>>>>>>>> malformed message both client and sedna are waiting for each other.
>>>>>>>>>>
>>>>>>>>>> I would guess, that's always happens when
>>>>>>>>>> "categorytree:getCategoryTree()" is called, could you please try 
>>>>>>>>>> that? If
>>>>>>>>>> it confirms, could you please make a wireshark/tcpdump dump of this
>>>>>>>>>> communication?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Oct 9, 2014 at 6:42 AM, Ivan Lagunov <lagi...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Ilya,
>>>>>>>>>>>
>>>>>>>>>>> Please check my investigations below.
>>>>>>>>>>>
>>>>>>>>>>> I've collected all the debug information for another hanging
>>>>>>>>>>> today, it's available here:
>>>>>>>>>>> https://www.dropbox.com/s/g14aer9821fzx05/sedna_debug_info6.zip?dl=0
>>>>>>>>>>> Here are the statuses of all Sedna processes:
>>>>>>>>>>>
>>>>>>>>>>> se_trn 13436 - DROP DOCUMENT operation, COMMITED
>>>>>>>>>>> se_trn 16245 - LOAD STDIN operation, COMMITED, no ESTABLISHED
>>>>>>>>>>> connection on port 5050 found (only 5151)
>>>>>>>>>>> se_trn 17335 - doc('$version') XQuery, COMMITED, gdb thread 1 in
>>>>>>>>>>> select () from /lib64/libc.so.6
>>>>>>>>>>> se_trn 17338 - XQuery running about 10sec, not COMMITED, gdb
>>>>>>>>>>> thread 1 in select () from /lib64/libc.so.6
>>>>>>>>>>> se_trn 17410 - XQuery running <1sec, COMMITED
>>>>>>>>>>> se_trn 17462 - Session is ready
>>>>>>>>>>> se_trn 17554 - Session is ready
>>>>>>>>>>> se_trn 18892 - Session is ready
>>>>>>>>>>> se_trn 19124 - Session is ready
>>>>>>>>>>> se_gov 20441 - gdb threads 1 and 2 in select () from
>>>>>>>>>>> /lib64/libc.so.6
>>>>>>>>>>> se_sm 20449
>>>>>>>>>>>
>>>>>>>>>>> Comments:
>>>>>>>>>>> 1. Where gdb is not stated, all threads are in semop () from
>>>>>>>>>>> /lib64/libc.so.6 except the last thread that is in 
>>>>>>>>>>> pthread_cond_timedwait@@GLIBC_2.3.2
>>>>>>>>>>> (). I believe semop is a good state, correct me if I'm wrong.
>>>>>>>>>>> 2. All se_trn with "Session is ready" have no other entries in
>>>>>>>>>>> the latest event.log. As I collect only the latest event.log, I 
>>>>>>>>>>> cannot
>>>>>>>>>>> check what was the last operation.
>>>>>>>>>>> 3. All se_trn processes have two ESTABLISHED connections in
>>>>>>>>>>> netstat for ports 5050 and 5151 if not stated otherwise.
>>>>>>>>>>> 4. se_trn 17338, 17410, 17462 were created by the same
>>>>>>>>>>> application that is under suspicion. The database unfreezed after 
>>>>>>>>>>> the
>>>>>>>>>>> restart of that application.
>>>>>>>>>>>
>>>>>>>>>>> Issues/questions:
>>>>>>>>>>> 1. What exactly does it mean when a thread is in select()? It
>>>>>>>>>>> seems the database is waiting but for the client or for some 
>>>>>>>>>>> internal
>>>>>>>>>>> operation, e.g. fetching the data. Please clarify.
>>>>>>>>>>> 2. What exactly does "Session is ready" mean? When does it
>>>>>>>>>>> appear in the event.log? I believe those connections are okay but 
>>>>>>>>>>> I'm
>>>>>>>>>>> curious to know.
>>>>>>>>>>> 3. se_trn 16245 - how is it possible that there is no
>>>>>>>>>>> ESTABLISHED connection on port 5050 while the process is still 
>>>>>>>>>>> alive? An
>>>>>>>>>>> issue?
>>>>>>>>>>> 4. se_trn 17335 - how is it possible that thread is in select()
>>>>>>>>>>> while the transaction has been COMMITED in event.log? An issue?
>>>>>>>>>>> 5. se_trn 17338 - looks like a root cause of Sedna hanging.
>>>>>>>>>>> Currently I run it via XML:DB API implementation mentioned earlier, 
>>>>>>>>>>> more
>>>>>>>>>>> specifically I run the XQuery via XQueryService.query("xquery 
>>>>>>>>>>> string")
>>>>>>>>>>> call. Besides, I always try to close the connection in the Java 
>>>>>>>>>>> finally
>>>>>>>>>>> block, so it looks roughly like this:
>>>>>>>>>>>
>>>>>>>>>>> try {
>>>>>>>>>>>     rootCollection = DatabaseManager.getCollection(...);
>>>>>>>>>>>     ... do something ...
>>>>>>>>>>> } finally {
>>>>>>>>>>>     rootCollection.close();
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> According to event.log the XQuery was started at 09/10/2014
>>>>>>>>>>> 12:00:01.
>>>>>>>>>>> In the application log somewhere between 12:01:53 and 12:05:00
>>>>>>>>>>> appeared a message:
>>>>>>>>>>>
>>>>>>>>>>>> Warning: an open Sedna XML:DB Collection/Connection was just
>>>>>>>>>>>> automatically closed due to inactivity, you SHOULD close these 
>>>>>>>>>>>> manually.
>>>>>>>>>>>
>>>>>>>>>>> It seems to mean that the connection was considered as abandoned
>>>>>>>>>>> and removed by XML:DB API implementation (see above 
>>>>>>>>>>> conn-log-abandoned=true
>>>>>>>>>>> and conn-remove-abandoned-timeout=120000).
>>>>>>>>>>> For me it looks like the client waited for 2 minutes but did not
>>>>>>>>>>> get any response from the database. So there was an attempt to 
>>>>>>>>>>> close the
>>>>>>>>>>> connection. It's easy to understand as the database hanged at about
>>>>>>>>>>> 12:00:02 when the last transaction was COMMITED. So it may be even 
>>>>>>>>>>> that the
>>>>>>>>>>> root cause is hidden somewhere deeper. For example, this long 
>>>>>>>>>>> XQuery could
>>>>>>>>>>> cause a deadlock.
>>>>>>>>>>>
>>>>>>>>>>> I understand it may be an issue in XML:DB API implementation but
>>>>>>>>>>> its support was stopped by Charles Foster. Unfortunately, we cannot 
>>>>>>>>>>> afford
>>>>>>>>>>> to migrate to XQJ API implementation yet. The most tricky part to
>>>>>>>>>>> understand is why we had not faced similar issues before recent 
>>>>>>>>>>> occurrence
>>>>>>>>>>> when it has become a regular major issue immediately. There were no 
>>>>>>>>>>> server
>>>>>>>>>>> updates and no changes that could affect the way 
>>>>>>>>>>> transactions/connections
>>>>>>>>>>> are handled.
>>>>>>>>>>>
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Ivan Lagunov
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Oct 9, 2014 at 1:33 PM, Ivan Lagunov <lagi...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Sorry, copied the wrong link for the connection pool
>>>>>>>>>>>> documentation. Here is the correct one:
>>>>>>>>>>>> http://www.cfoster.net/pdf/sedna/xmldb/documentation.pdf#page=10
>>>>>>>>>>>>
>>>>>>>>>>>> Besides, I use the following values for the connection pool
>>>>>>>>>>>> parameters in the suspected application:
>>>>>>>>>>>> conn-max-active=50
>>>>>>>>>>>> conn-max-wait=30000
>>>>>>>>>>>> conn-log-abandoned=true
>>>>>>>>>>>> conn-remove-abandoned-timeout=120000
>>>>>>>>>>>> conn-abandoned-check-interval=5000
>>>>>>>>>>>>
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Ivan Lagunov
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Oct 9, 2014 at 1:28 PM, Ivan Lagunov <lagi...@gmail.com
>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Ilya,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I use Charles Foster implementation of XML:DB API, version
>>>>>>>>>>>>> 1.2.6 without any customizations:
>>>>>>>>>>>>> http://www.cfoster.net/sedna/xmldb/download/
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here is the link again to the connection pool documentation
>>>>>>>>>>>>> (it's a part of the above implementation):
>>>>>>>>>>>>> http://www.cfoster.net/sedna/xmldb/download/
>>>>>>>>>>>>>
>>>>>>>>>>>>> Meanwhile, I've identified the application that makes Sedna
>>>>>>>>>>>>> wait. At least restarting the application has unfreezed the 
>>>>>>>>>>>>> database twice
>>>>>>>>>>>>> already. I'll analyze the application log along with Sedna 
>>>>>>>>>>>>> event.log
>>>>>>>>>>>>> shortly. Hopefully, I'll find the root cause.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Ivan Lagunov
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Oct 9, 2014 at 3:50 AM, Ilya Taranov <
>>>>>>>>>>>>> epsi...@socio.msu.ru> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I can easily see the scenario, where one of the clients
>>>>>>>>>>>>>> doesn't terminate the transaction properly. Unfortunately, 
>>>>>>>>>>>>>> current
>>>>>>>>>>>>>> application timeout doesn't apply for client hangs. The issue 
>>>>>>>>>>>>>> can be easily
>>>>>>>>>>>>>> fixed, however.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Unfortunately, even fix still would not solve your problem,
>>>>>>>>>>>>>> you would still have application timeout wait on locks. Probably 
>>>>>>>>>>>>>> the most
>>>>>>>>>>>>>> straightforward way to find the cause for you would be to log 
>>>>>>>>>>>>>> connection
>>>>>>>>>>>>>> checkins/checkouts at the client driver level.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Could you please give me a link to the connection pool +
>>>>>>>>>>>>>> driver you currently use (including all your changes if 
>>>>>>>>>>>>>> possible) so I
>>>>>>>>>>>>>> could help you with that?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Oct 7, 2014 at 4:37 AM, Ivan Lagunov <
>>>>>>>>>>>>>> lagi...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Ilya,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Your suggestion seems to be correct. After I've restarted
>>>>>>>>>>>>>>> three Java applications that had connections to Sedna, the 
>>>>>>>>>>>>>>> database has
>>>>>>>>>>>>>>> resumed processing operations. Here is the event.log where 
>>>>>>>>>>>>>>> Sedna has hanged
>>>>>>>>>>>>>>> between 12:35:03 and 13:06:05:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://www.dropbox.com/s/pzafuvzcvfchxwa/event-2014-10-07-13-15-04-018.zip?dl=0
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm not sure which of the applications caused it though. It
>>>>>>>>>>>>>>> still worries me as the queries before hanging look quite 
>>>>>>>>>>>>>>> simple and I
>>>>>>>>>>>>>>> cannot understand which client operation could have hanged the 
>>>>>>>>>>>>>>> whole
>>>>>>>>>>>>>>> database.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>> Ivan Lagunov
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Oct 6, 2014 at 12:29 PM, Ivan Lagunov <
>>>>>>>>>>>>>>> lagi...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Ilya,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, I use a connection pool via Sedna XML:DB API
>>>>>>>>>>>>>>>> implementation. It's documented here:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> http://www.cfoster.net/pdf/sedna/xmldb/documentation.pdf#page=10
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Unfortunately, there is no support for active connection
>>>>>>>>>>>>>>>> timeout. This could be the cause if there was a query hanging 
>>>>>>>>>>>>>>>> for some
>>>>>>>>>>>>>>>> reason. Still if some query hangs forever (that should not as 
>>>>>>>>>>>>>>>> it's in my
>>>>>>>>>>>>>>>> case) it looks more like a database bug/deadlock situation.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'll try as you suggest next time.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> P.S. Sorry for the second email as I've forgotten to add CC
>>>>>>>>>>>>>>>> first.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>> Ivan Lagunov
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Oct 3, 2014 at 10:23 PM, Ilya Taranov <
>>>>>>>>>>>>>>>> epsi...@socio.msu.ru> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ivan, do you have a connection pool? I see one of the
>>>>>>>>>>>>>>>>> se_trn processes waiting for socket read.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If you don't have a connection pool, it just might be a
>>>>>>>>>>>>>>>>> client problem. I'll take a deeper look anyway, just next 
>>>>>>>>>>>>>>>>> time it happens,
>>>>>>>>>>>>>>>>> could you please try to shutdown the client thread, if it's 
>>>>>>>>>>>>>>>>> even possible.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Oct 2, 2014 at 5:30 AM, Ivan Lagunov <
>>>>>>>>>>>>>>>>> lagi...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I've been facing major issues with Sedna hanging
>>>>>>>>>>>>>>>>>> regularly for the last two weeks (after several months of 
>>>>>>>>>>>>>>>>>> stable work). I
>>>>>>>>>>>>>>>>>> tried to do the complete export-import of data using se_exp 
>>>>>>>>>>>>>>>>>> but Sedna still
>>>>>>>>>>>>>>>>>> hangs regularly. Looking into event.log does not help as 
>>>>>>>>>>>>>>>>>> queries differ
>>>>>>>>>>>>>>>>>> every time. Sometimes it looks like Sedna hangs after some 
>>>>>>>>>>>>>>>>>> heavy query
>>>>>>>>>>>>>>>>>> (with huge response) while in other cases there are no heavy 
>>>>>>>>>>>>>>>>>> queries to
>>>>>>>>>>>>>>>>>> suspect. I've written a script to collect the following 
>>>>>>>>>>>>>>>>>> logs/dumps you will
>>>>>>>>>>>>>>>>>> find in the archives:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> event.log - the last Sedna event log till the moment DB
>>>>>>>>>>>>>>>>>> hangs and is stopped;
>>>>>>>>>>>>>>>>>> sedna_procs.txt - list of all Sedna processes (ps -ef);
>>>>>>>>>>>>>>>>>> netstat.txt - list of all ESTABLISHED connections;
>>>>>>>>>>>>>>>>>> ipcs.txt - results of ipcs -s;
>>>>>>>>>>>>>>>>>> <pid>.txt - results of gdb with bt for all threads of a
>>>>>>>>>>>>>>>>>> Sedna process (available for all processes at the moment of 
>>>>>>>>>>>>>>>>>> hanging). This
>>>>>>>>>>>>>>>>>> is collected automatically for threads 1-4 of each process.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'm providing two sets of logs/dumps for two separate
>>>>>>>>>>>>>>>>>> hanging events:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 25th September - seemed to hang without any heavy queries:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> https://www.dropbox.com/s/2luq53cstc4t3tl/sedna_debug_info1.zip?dl=0
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 2nd October - hanged after a heavy query:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> https://www.dropbox.com/s/scmszq9yphyv6xf/sedna_debug_info2.zip?dl=0
>>>>>>>>>>>>>>>>>> The heavy query occurs at LOG   02/10/2014 11:10:20
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If you can find out the cause of the hangings, it'd be
>>>>>>>>>>>>>>>>>> the best. Otherwise it'd be nice to hear some 
>>>>>>>>>>>>>>>>>> recommendations at least on
>>>>>>>>>>>>>>>>>> how to avoid these issues/minimize an impact. May be I need 
>>>>>>>>>>>>>>>>>> to upgrade to
>>>>>>>>>>>>>>>>>> some later version but I'm currently using 3.5.615 that is 
>>>>>>>>>>>>>>>>>> already a night
>>>>>>>>>>>>>>>>>> build that we had to upgrade to after some older issue 
>>>>>>>>>>>>>>>>>> (reported back in
>>>>>>>>>>>>>>>>>> March-April of 2013). I'd try some 3.5 or 3.6 version but 
>>>>>>>>>>>>>>>>>> without knowing
>>>>>>>>>>>>>>>>>> the reason there is no guarantee the potential bug is fixed 
>>>>>>>>>>>>>>>>>> in the later
>>>>>>>>>>>>>>>>>> night build.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Also it'd be really nice if there was an official release
>>>>>>>>>>>>>>>>>> for 3.5 with all the latest bug fixes (I see the last night 
>>>>>>>>>>>>>>>>>> build is
>>>>>>>>>>>>>>>>>> 3.5.627 of 8th May 2013).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>> Ivan Lagunov
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>>>>>>>>>> Meet PCI DSS 3.0 Compliance Requirements with EventLog
>>>>>>>>>>>>>>>>>> Analyzer
>>>>>>>>>>>>>>>>>> Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box
>>>>>>>>>>>>>>>>>> PCI DSS Reports
>>>>>>>>>>>>>>>>>> Are you Audit-Ready for PCI DSS 3.0 Compliance? Download
>>>>>>>>>>>>>>>>>> White paper
>>>>>>>>>>>>>>>>>> Comply to PCI DSS 3.0 Requirement 10 and 11.5 with
>>>>>>>>>>>>>>>>>> EventLog Analyzer
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk
>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>> Sedna-discussion mailing list
>>>>>>>>>>>>>>>>>> Sedna-discussion@lists.sourceforge.net
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/sedna-discussion
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
------------------------------------------------------------------------------
_______________________________________________
Sedna-discussion mailing list
Sedna-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sedna-discussion

Reply via email to