[GitHub] trafficserver issue #1476: Crash in get_client_addr()

2017-03-05 Thread shinrich
Github user shinrich commented on the issue:

https://github.com/apache/trafficserver/issues/1476
  
I had two more asserts once I added back in the accept thread.  Upon deeper 
inspection, the last alloc time and the UA_BEGIN time were off by half a ms, so 
I am attributing that to clock skew between threads.  I think the best action 
for 7.1.0 is to revert commit c1ac5f8bf87fd4bc3a8e06507219970d83965acd and 
investigate how to safely bubble up errors for 7.1.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] trafficserver issue #1476: Crash in get_client_addr()

2017-03-04 Thread suenway
Github user suenway commented on the issue:

https://github.com/apache/trafficserver/issues/1476
  

This is an automatically generated message.

sun...@yahoo-inc.com is no longer with Yahoo! Inc.

Your message will not be forwarded.

If you require assistance with a legal matter, please send a message to 
legal-noti...@yahoo-inc.com

Thank you!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] trafficserver issue #1476: Crash in get_client_addr()

2017-03-04 Thread shinrich
Github user shinrich commented on the issue:

https://github.com/apache/trafficserver/issues/1476
  
Have been adding member variables to NetVConnection and adding assert to 
HttpSM::state_api_callout.

In the cores it seemed that the crash always happened in the first hook 
called from HttpSM::state_read_server_response_header.  As noted in the 
previous comment, the event came in on the server side VC, and the 
state_read_server_response_header is getting ready to send the response of the 
UA side VC.   I added allocate thread, free_thread, and last_alloc time to the 
netvc.  And added the following assert at the beginning of NetVConnection 

{code}
ink_release_assert(milestones[TS_MILESTONE_UA_CLOSE] > 0 || 
ua_session->get_netvc() == NULL ||
(ua_session->get_create_time() != 0 && 
ink_hrtime_to_msec(ua_session->get_create_time()) <= 
ink_hrtime_to_msec(milestones[TS_MILESTONE_UA_BEGIN])));
{code}

This will trigger if there is not last_alloc_time set (it has been freed) 
or the create time of the VC is newer than the start of the transaction (has 
been reallocated).  This assert would trigger before we called the hooks 
implying that the client_vc has been freed in a previous stack but the 
ua_session reference has not been cleaned up.  The only place I could see this 
happening is in read_signal_and_update and write_signal_and_update of the 
corresponding vio._cont is NULL.  I added warnings in the NULL path, and they 
triggered a handful of times with event=EVENT_ERROR (3).  

The error bubble addition (which we were concerned about from issue #1401) 
adds calls to read_signal_error and write_signal_error.  So I reverted that 
commit (c1ac5f8bf87fd4bc3a8e06507219970d83965acd) and removed my workarounds 
for issue #1401.  I let it run overnight and it crashed twice on my assert, but 
in the accept stack not the send response stack.  The timestamps varied in the 
low microseconds, so I am doubting the accuracy of our timestamping at the 
micro and nano seconds.  I added the ink_hrtime_to_msec to the asserts, and 
kicked it off to run today.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] trafficserver issue #1476: Crash in get_client_addr()

2017-03-03 Thread shinrich
Github user shinrich commented on the issue:

https://github.com/apache/trafficserver/issues/1476
  
Still struggling with this issue.  Running into it every 2-4 hours in 
production.  At this point, I think it is a use-after-free, but as noted avoid 
ASAN fails in this case.  

Looking at a core more closely, it seems to be an incoming client VC that 
is being messed up.  In the core, the remote_addr of the VC did not match the 
src_addr in the HttpSM.  Both IP addresses show up in our squid.log at roughly 
the same time.

We are running with accept threads.  I'm going to turn that off on my test 
machine,  I assume that will eliminate this race condition.  If so, then will 
need to press on and see how we are freeing the VC early.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] trafficserver issue #1476: Crash in get_client_addr()

2017-02-27 Thread shinrich
Github user shinrich commented on the issue:

https://github.com/apache/trafficserver/issues/1476
  
Very odd bug.  Most of the crashes are in get_client_addr or 
get_remote_addr.  In both cases, the problem is at the point of the virtual 
method call into the corresponding test.  The RAX register is either 0 or 1.  
But when you go through the assembly instructions based on the current register 
values, the function pointer should be valid.

Tried this on two different machines in case it was a HW problem.  Still 
see the problem on both machine.   Set counters to see if the crash happens 
when calling into the set method or on return.  The crash happens before the 
set method is called.

At this point, I think it is a use-after-free problem.  If I do the 
function calculation off a VC object (which would be the pointer in the virtual 
function table after free), the value is a 0 or 1.

I've tried to set up ASAN builds but they are not working for me at the 
moment (issue filed).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] trafficserver issue #1476: Crash in get_client_addr()

2017-02-21 Thread shinrich
GitHub user shinrich opened an issue:

https://github.com/apache/trafficserver/issues/1476

Crash in get_client_addr()

After working through issues #1401 and #1443, we now see the following 
crash in our copy of 7.1.  The crash occurs once every couple hours.

```
#0  0x2b46debb3867 in ?? () from /lib64/libgcc_s.so.1
#1  0x2b46debb4119 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#2  0x2b46deeb9936 in backtrace () from /lib64/libc.so.6
#3  0x2b46dc70fe72 in ink_stack_trace_dump () at ink_stack_trace.cc:61
#4  0x2b46dc71213a in signal_crash_handler (signo=11) at signals.cc:186
#5  0x00513414 in crash_logger_invoke (signo=11, 
info=0x2b46e39dc130, ctx=0x2b46e39dc000) at Crash.cc:169
#6  
#7  0x0001 in ?? () at ../../lib/ts/Diags.h:142
#8  0x005126d5 in NetVConnection::get_remote_addr 
(this=0x2aad1cc4d0a0)
at 
/home/shinrich/yats_build/trafficserver/iocore/net/P_NetVConnection.h:30
#9  0x0055deea in ProxyClientSession::get_client_addr 
(this=0x2ab0dc25bba0) at ProxyClientSession.h:221
#10 0x005352bf in TSHttpSsnClientAddrGet (ssnp=0x2ab0dc25bba0) at 
InkAPI.cc:5439
#11 0x0053530d in TSHttpTxnClientAddrGet (txnp=0x2b474440c9f0) at 
InkAPI.cc:5447
#12 0x2e4008f6 in http_hook (contp=0x1e42f40, event=60006, 
edata=0x2b474440c9f0) at INKPluginInit.cc:174
#13 0x0052a5fd in INKContInternal::handle_event (this=0x1e42f40, 
event=60006, edata=0x2b474440c9f0) at InkAPI.cc:1048
#14 0x005160f2 in Continuation::handleEvent (this=0x1e42f40, 
event=60006, data=0x2b474440c9f0)
at 
/home/shinrich/yats_build/trafficserver/iocore/eventsystem/I_Continuation.h:153
#15 0x0052adf3 in APIHook::invoke (this=0x1e44f40, event=60006, 
edata=0x2b474440c9f0) at InkAPI.cc:1267
#16 0x005e5e16 in HttpSM::state_api_callout (this=0x2b474440c9f0, 
event=0, data=0x0) at HttpSM.cc:1462
#17 0x005f3a11 in HttpSM::do_api_callout_internal 
(this=0x2b474440c9f0) at HttpSM.cc:5171
#18 0x00601b85 in HttpSM::do_api_callout (this=0x2b474440c9f0) at 
HttpSM.cc:438
#19 0x005e78d6 in HttpSM::state_read_server_response_header 
(this=0x2b474440c9f0, event=100, data=0x2b4744cac508)
at HttpSM.cc:1962
#20 0x005ea331 in HttpSM::main_handler (this=0x2b474440c9f0, 
event=100, data=0x2b4744cac508) at HttpSM.cc:2662
#21 0x005160f2 in Continuation::handleEvent (this=0x2b474440c9f0, 
event=100, data=0x2b4744cac508)
at 
/home/shinrich/yats_build/trafficserver/iocore/eventsystem/I_Continuation.h:153
#22 0x0078489f in read_signal_and_update (event=100, 
vc=0x2b4744cac3e0) at UnixNetVConnection.cc:145
#23 0x00787a10 in UnixNetVConnection::readSignalAndUpdate 
(this=0x2b4744cac3e0, event=100) at UnixNetVConnection.cc:1125
#24 0x0076a098 in SSLNetVConnection::net_read_io 
(this=0x2b4744cac3e0, nh=0x2b46e18ade60, lthread=0x2b46e18aa010)
at SSLNetVConnection.cc:587
#25 0x0077d20d in NetHandler::mainNetEvent (this=0x2b46e18ade60, 
event=5, e=0x1873f20) at UnixNet.cc:509
#26 0x005160f2 in Continuation::handleEvent (this=0x2b46e18ade60, 
event=5, data=0x1873f20)
at 
/home/shinrich/yats_build/trafficserver/iocore/eventsystem/I_Continuation.h:153
#27 0x007a69c5 in EThread::process_event (this=0x2b46e18aa010, 
e=0x1873f20, calling_code=5) at UnixEThread.cc:143
#28 0x007a6eb1 in EThread::execute (this=0x2b46e18aa010) at 
UnixEThread.cc:270
#29 0x007a6089 in spawn_thread_internal (a=0x1762fe0) at 
Thread.cc:84
#30 0x2b46df156aa1 in start_thread () from /lib64/libpthread.so.0
#31 0x2b46deea393d in clone () from /lib64/libc.so.6
```
I assume that there has been a goodly amount of inlining here so it is hard 
to tell how we got from NetVConnection::get_remote_addr to the Diags::on 
method.  Digging through the core, the cached member addresses seem ok, and I 
don't see evidence of thread races.

I think I have seen this stack reported before, but couldn't find the issue.






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---