[jira] [Resolved] (TS-5039) Crash when connecting to downed parent.

2016-11-18 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-5039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson resolved TS-5039.

Resolution: Cannot Reproduce

> Crash when connecting to downed parent.
> ---
>
> Key: TS-5039
> URL: https://issues.apache.org/jira/browse/TS-5039
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Core
>Reporter: James Peach
>Assignee: Thomas Jackson
>Priority: Blocker
> Fix For: 7.1.0
>
>
> If you configure a parent proxy on {{127.0.0.1:}} and never start the 
> service, Traffic Server will crash when it fails to connect. Looks like this 
> regression was introduced in TS-4796.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-5039) Crash when connecting to downed parent.

2016-11-18 Thread Thomas Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-5039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15676954#comment-15676954
 ] 

Thomas Jackson commented on TS-5039:


Hmm, well I guess we close as can't repro then :/




> Crash when connecting to downed parent.
> ---
>
> Key: TS-5039
> URL: https://issues.apache.org/jira/browse/TS-5039
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Core
>Reporter: James Peach
>Assignee: Thomas Jackson
>Priority: Blocker
> Fix For: 7.1.0
>
>
> If you configure a parent proxy on {{127.0.0.1:}} and never start the 
> service, Traffic Server will crash when it fails to connect. Looks like this 
> regression was introduced in TS-4796.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-5039) Crash when connecting to downed parent.

2016-11-16 Thread Thomas Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-5039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15670962#comment-15670962
 ] 

Thomas Jackson commented on TS-5039:


[~jpe...@apache.org] spent a few minutes, but I can't seem to reproduce this 
case (probably due to misconfiguration, since I don't use the parent proxy 
stuff). Mind sharing the configs and/or the backtrace of the crash?

> Crash when connecting to downed parent.
> ---
>
> Key: TS-5039
> URL: https://issues.apache.org/jira/browse/TS-5039
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Core
>Reporter: James Peach
>Assignee: Thomas Jackson
>Priority: Blocker
> Fix For: 7.1.0
>
>
> If you configure a parent proxy on {{127.0.0.1:}} and never start the 
> service, Traffic Server will crash when it fails to connect. Looks like this 
> regression was introduced in TS-4796.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-5039) Crash when connecting to downed parent.

2016-11-15 Thread Thomas Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-5039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15667666#comment-15667666
 ] 

Thomas Jackson commented on TS-5039:


I should be able to take a look at this this week, doesn't sound good >.<

> Crash when connecting to downed parent.
> ---
>
> Key: TS-5039
> URL: https://issues.apache.org/jira/browse/TS-5039
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Core
>Reporter: James Peach
>Assignee: Thomas Jackson
>Priority: Blocker
> Fix For: 7.1.0
>
>
> If you configure a parent proxy on {{127.0.0.1:}} and never start the 
> service, Traffic Server will crash when it fails to connect. Looks like this 
> regression was introduced in TS-4796.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TS-5052) Segfault in HostDB sync if something fails while not holding the parent continuation mutex

2016-11-11 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson closed TS-5052.
--
Resolution: Cannot Reproduce

Turns out this is only applicable to some backport builds I've made-- not 
upstream (yay!)

> Segfault in HostDB sync if something fails while not holding the parent 
> continuation mutex
> --
>
> Key: TS-5052
> URL: https://issues.apache.org/jira/browse/TS-5052
> Project: Traffic Server
>  Issue Type: Bug
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
>
> What we noticed was the following in traffic.out:
> {code}
> Server {0x2af761e0d700} WARNING:  
> Unable to create temporary file /var/trafficserver/
> host.db.syncing, unable to persist hostdb: -13 error:Permission denied
> traffic_server: Segmentation fault (Address not mapped to object 
> [0x28])traffic_server - STACK TRACE: 
> {code}
> Which lead me to dig into it-- and it turns out the issue is related to 
> changes after the HostDB rewrite to move syncing outside of the main NET 
> threads. Before all calls to this syncer where done in a single net thread 
> wherever it was initially scheduled. Now we bounce between ET_TASK threads 
> and ET_NET threads (to avoid switching, lock contention, etc.)-- but the 
> error handlers weren't updated to handle this situation.
> So to fix this, I've created a "set_error" and "return_error" method to the 
> RefCountCacheSerializer which will take this into consideration-- 
> specifically that it will immediately return the error if scheduled in the 
> calling thread-- otherwise it'll reschedule onto that thread *then* return 
> the error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TS-5052) Segfault in HostDB sync if something fails while not holding the parent continuation mutex

2016-11-11 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson reassigned TS-5052:
--

Assignee: Thomas Jackson

> Segfault in HostDB sync if something fails while not holding the parent 
> continuation mutex
> --
>
> Key: TS-5052
> URL: https://issues.apache.org/jira/browse/TS-5052
> Project: Traffic Server
>  Issue Type: Bug
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
>
> What we noticed was the following in traffic.out:
> {code}
> Server {0x2af761e0d700} WARNING:  
> Unable to create temporary file /var/trafficserver/
> host.db.syncing, unable to persist hostdb: -13 error:Permission denied
> traffic_server: Segmentation fault (Address not mapped to object 
> [0x28])traffic_server - STACK TRACE: 
> {code}
> Which lead me to dig into it-- and it turns out the issue is related to 
> changes after the HostDB rewrite to move syncing outside of the main NET 
> threads. Before all calls to this syncer where done in a single net thread 
> wherever it was initially scheduled. Now we bounce between ET_TASK threads 
> and ET_NET threads (to avoid switching, lock contention, etc.)-- but the 
> error handlers weren't updated to handle this situation.
> So to fix this, I've created a "set_error" and "return_error" method to the 
> RefCountCacheSerializer which will take this into consideration-- 
> specifically that it will immediately return the error if scheduled in the 
> calling thread-- otherwise it'll reschedule onto that thread *then* return 
> the error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TS-5052) Segfault in HostDB sync if something fails while not holding the parent continuation mutex

2016-11-11 Thread Thomas Jackson (JIRA)
Thomas Jackson created TS-5052:
--

 Summary: Segfault in HostDB sync if something fails while not 
holding the parent continuation mutex
 Key: TS-5052
 URL: https://issues.apache.org/jira/browse/TS-5052
 Project: Traffic Server
  Issue Type: Bug
Reporter: Thomas Jackson


What we noticed was the following in traffic.out:

{code}
Server {0x2af761e0d700} WARNING:  
Unable to create temporary file /var/trafficserver/
host.db.syncing, unable to persist hostdb: -13 error:Permission denied
traffic_server: Segmentation fault (Address not mapped to object 
[0x28])traffic_server - STACK TRACE: 
{code}

Which lead me to dig into it-- and it turns out the issue is related to changes 
after the HostDB rewrite to move syncing outside of the main NET threads. 
Before all calls to this syncer where done in a single net thread wherever it 
was initially scheduled. Now we bounce between ET_TASK threads and ET_NET 
threads (to avoid switching, lock contention, etc.)-- but the error handlers 
weren't updated to handle this situation.

So to fix this, I've created a "set_error" and "return_error" method to the 
RefCountCacheSerializer which will take this into consideration-- specifically 
that it will immediately return the error if scheduled in the calling thread-- 
otherwise it'll reschedule onto that thread *then* return the error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TS-3067) ATS run DNS query lookups against IPs in URLs

2016-10-26 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson closed TS-3067.
--
Resolution: Cannot Reproduce

>From the details in this ticket I'm unable to reproduce this issue. Sadly 
>there was a paste link-- but it is now empty, presumably the info in the paste 
>would make this debuggable.

If you are still seeing this issue please re-open the ticket and attach the 
paste and we can dig in.

> ATS run DNS query lookups against IPs in URLs 
> --
>
> Key: TS-3067
> URL: https://issues.apache.org/jira/browse/TS-3067
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Core
>Affects Versions: 5.1.0
>Reporter: Luca Rea
>Assignee: Thomas Jackson
> Fix For: sometime
>
>
> When URLs contain IPs ATS, in some hooks, try to do a reverse lookup and if 
> the IP has not a PTR record it waits for timeout returning the answer only 
> after many seconds.
> http://apaste.info/yLg



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TS-4968) Log a warning if connect_attempts_rr_retries is >= connect_attempts_max_retries

2016-10-26 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson resolved TS-4968.

Resolution: Fixed

> Log a warning if connect_attempts_rr_retries is >= 
> connect_attempts_max_retries
> ---
>
> Key: TS-4968
> URL: https://issues.apache.org/jira/browse/TS-4968
> Project: Traffic Server
>  Issue Type: Improvement
>  Components: HTTP
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.1.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> If connect_attempts_rr_retries >= connect_attempts_max_retries requests going 
> to an RR DNS record will never be redispatched. To make it a bit more 
> obvious-- I think we should log a warning when we load the configs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4796) ATS not closing origin connections on first RST from client

2016-10-26 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4796:
---
Backport to Version: 6.1.2

> ATS not closing origin connections on first RST from client
> ---
>
> Key: TS-4796
> URL: https://issues.apache.org/jira/browse/TS-4796
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HTTP
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.1.0
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> *TLDR; similar to TS-4720 -- slower to close than it should, instead of never 
> closing*
> As a continuation of TS-4720, while testing that the session is closed when 
> we expect-- I found that it isn't.
> Although we are now closing the sessions, we aren't doing it as quickly as we 
> should. In this client abort case we expect the client to abort, and ATS 
> should initially continue to send bytes to the client-- as we are in the 
> half-open state. After the first set of bytes are sent to the client-- the 
> client will send an RST-- which should signal ATS to stop sending the request 
> (and tear down the origin connection etc.).
> I'm able to reproduce this locally, and the debug output (with some 
> additional comments) looks like below:
> {code}
> < FIN FROM CLIENT >
> [Aug 29 18:25:07.491] Server {0x7effa538a800} DEBUG:  (main_handler)> (http) [0] [HttpSM::main_handler, VC_EVENT_EOS]
> [Aug 29 18:25:07.491] Server {0x7effa538a800} DEBUG:  (state_watch_for_client_abort)> (http) [0] 
> [::state_watch_for_client_abort, VC_EVENT_EOS]
> < RST FROM CLIENT >
> Got an HttpTunnel event 100 
> [Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (producer_handler)> (http_tunnel) [0] producer_handler [http server 
> VC_EVENT_READ_READY]
> [Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (producer_handler_chunked)> (http_tunnel) [0] producer_handler_chunked [http 
> server VC_EVENT_READ_READY]
> [Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (read_size)> (http_chunk) read chunk size of 15 bytes
> [Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (read_chunk)> (http_chunk) completed read of chunk of 15 bytes
> [Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (producer_handler)> (http_redirect) [HttpTunnel::producer_handler] 
> enable_redirection: [1 0 0] event: 100
> Got an HttpTunnel event 101 
> [Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (consumer_handler)> (http_tunnel) [0] consumer_handler [user agent 
> VC_EVENT_WRITE_READY]
> write ready consumer_handler
> {code}
> In this situation the connection doesn't close here at the RST-- but rather 
> on the next set of bytes from the origin to send-- which end up tripping a 
> VC_EVENT_ERROR-- and tearing down the connection.
> When the client sends the first RST epoll returns a WRITE_READY event -- 
> which the HTTPTunnel consumer ignores completely. It seems then that when we 
> recieve the WRITE_READY event we need to determine if we are already in the 
> writing state-- and if so, then we should stop the transaction (since we are 
> already edge-triggered).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TS-4796) ATS not closing origin connections on first RST from client

2016-10-26 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson resolved TS-4796.

Resolution: Fixed

> ATS not closing origin connections on first RST from client
> ---
>
> Key: TS-4796
> URL: https://issues.apache.org/jira/browse/TS-4796
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HTTP
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.1.0
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> *TLDR; similar to TS-4720 -- slower to close than it should, instead of never 
> closing*
> As a continuation of TS-4720, while testing that the session is closed when 
> we expect-- I found that it isn't.
> Although we are now closing the sessions, we aren't doing it as quickly as we 
> should. In this client abort case we expect the client to abort, and ATS 
> should initially continue to send bytes to the client-- as we are in the 
> half-open state. After the first set of bytes are sent to the client-- the 
> client will send an RST-- which should signal ATS to stop sending the request 
> (and tear down the origin connection etc.).
> I'm able to reproduce this locally, and the debug output (with some 
> additional comments) looks like below:
> {code}
> < FIN FROM CLIENT >
> [Aug 29 18:25:07.491] Server {0x7effa538a800} DEBUG:  (main_handler)> (http) [0] [HttpSM::main_handler, VC_EVENT_EOS]
> [Aug 29 18:25:07.491] Server {0x7effa538a800} DEBUG:  (state_watch_for_client_abort)> (http) [0] 
> [::state_watch_for_client_abort, VC_EVENT_EOS]
> < RST FROM CLIENT >
> Got an HttpTunnel event 100 
> [Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (producer_handler)> (http_tunnel) [0] producer_handler [http server 
> VC_EVENT_READ_READY]
> [Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (producer_handler_chunked)> (http_tunnel) [0] producer_handler_chunked [http 
> server VC_EVENT_READ_READY]
> [Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (read_size)> (http_chunk) read chunk size of 15 bytes
> [Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (read_chunk)> (http_chunk) completed read of chunk of 15 bytes
> [Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (producer_handler)> (http_redirect) [HttpTunnel::producer_handler] 
> enable_redirection: [1 0 0] event: 100
> Got an HttpTunnel event 101 
> [Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (consumer_handler)> (http_tunnel) [0] consumer_handler [user agent 
> VC_EVENT_WRITE_READY]
> write ready consumer_handler
> {code}
> In this situation the connection doesn't close here at the RST-- but rather 
> on the next set of bytes from the origin to send-- which end up tripping a 
> VC_EVENT_ERROR-- and tearing down the connection.
> When the client sends the first RST epoll returns a WRITE_READY event -- 
> which the HTTPTunnel consumer ignores completely. It seems then that when we 
> recieve the WRITE_READY event we need to determine if we are already in the 
> writing state-- and if so, then we should stop the transaction (since we are 
> already edge-triggered).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4970) Crash in INKVConnInternal when handle_event is called after destroy()

2016-10-26 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4970:
---
Backport to Version: 5.3.2, 6.2.1

> Crash in INKVConnInternal when handle_event is called after destroy()
> -
>
> Key: TS-4970
> URL: https://issues.apache.org/jira/browse/TS-4970
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HTTP
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.1.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> We've noticed a few crashes for requests using SPDY (on ATS 5.2.x and 6..x) 
> where the downstream origin is down with a backtrace that looks something 
> like:
> {code}
> (gdb) bt
> #0  0x in ?? ()
> #1  0x004cfe54 in set_continuation (this=0x2afe63a93530, event=1, 
> edata=0x2afe6399fc40) at ../iocore/eventsystem/P_VIO.h:104
> #2  INKVConnInternal::handle_event (this=0x2afe63a93530, event=1, 
> edata=0x2afe6399fc40) at InkAPI.cc:1060
> #3  0x006f8e65 in handleEvent (this=0x2afe3dd07000, e=0x2afe6399fc40, 
> calling_code=1) at I_Continuation.h:146
> #4  EThread::process_event (this=0x2afe3dd07000, e=0x2afe6399fc40, 
> calling_code=1) at UnixEThread.cc:144
> #5  0x006f993b in EThread::execute (this=0x2afe3dd07000)
> at UnixEThread.cc:195
> #6  0x006f832a in spawn_thread_internal (a=0x2afe3badf400)
> at Thread.cc:88
> #7  0x003861c079d1 in start_thread () from /lib64/libpthread.so.0
> #8  0x0038614e8b5d in clone () from /lib64/libc.so.6
> {code}
> Which looks a bit odd-- as frame 0 is missing. From digging into it a bit 
> more (with the help of [~amc]) we found that the VC we where calling was an 
> INKContInternal (meaning an INKVConnInternal):
> {code}
> (gdb) p (INKVConnInternal) *vc_server
> $5 = { = { = { = 
> { = { = {_vptr.force_VFPT_to_top = 
> 0x2afe63a93170}, 
>   handler = (int (Continuation::*)(Continuation *, int, 
> void *)) 0x4cfd90 , mutex = {
> m_ptr = 0x0}, link = { = {next = 0x0}, 
> prev = 0x0}}, lerrno = 20600}, }, 
> mdata = 0xdeaddead, m_event_func = 0x2afe43c18490
>  <(anonymous namespace)::handleTransformationPluginEvents(TSCont, 
> TSEvent, void*)>, m_event_count = 0, m_closed = -1, m_deletable = 1, 
> m_deleted = 1, 
> m_free_magic = INKCONT_INTERN_MAGIC_ALIVE}, m_read_vio = {_cont = 0x0, 
> nbytes = 0, ndone = 0, op = 0, buffer = {mbuf = 0x0, entry = 0x0}, 
> vc_server = 0x0, mutex = {m_ptr = 0x0}}, m_write_vio = {_cont = 0x0, 
> nbytes = 122, ndone = 0, op = 0, buffer = {mbuf = 0x0, entry = 0x0}, 
> vc_server = 0x2afe63a93530, mutex = {m_ptr = 0x0}}, 
>   m_output_vc = 0x2afe63091a88}
> {code}
> From looking at the debug logs that lead up to the crash, I'm seeing that 
> some events (namely timeout events) are being called after the VConn has been 
> destroy()'d . This lead me to find that INKVConnInternal::handle_event is 
> actually checking if that is the case-- and then re-destroying everything, 
> which makes no sense.
> So although the ideal would be to not call handle_event on a closed VConn, 
> crashing is definitely not acceptable. My solution is to continue to only 
> call the event handler if the VConn hasn't been deleted-- but instead of 
> attempting to re-destroy the connection, we'll leave it be (unless we are in 
> debug mode-- where I'll throw in an assert).
> I did some looking at this on ATS7 and it looks like this is all fixed by the 
> cleanup of the whole free-ing stuff for VConns 
> (https://github.com/apache/trafficserver/pull/752/files).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-4970) Crash in INKVConnInternal when handle_event is called after destroy()

2016-10-21 Thread Thomas Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15596682#comment-15596682
 ] 

Thomas Jackson commented on TS-4970:


This patch is now out running on our infra, and the issues have definitely 
cleared up. So, at this point I think it'd be fine to merge this-- but if we'd 
prefer to backport the larger change I'm also okay with that.

> Crash in INKVConnInternal when handle_event is called after destroy()
> -
>
> Key: TS-4970
> URL: https://issues.apache.org/jira/browse/TS-4970
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HTTP
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.1.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> We've noticed a few crashes for requests using SPDY (on ATS 5.2.x and 6..x) 
> where the downstream origin is down with a backtrace that looks something 
> like:
> {code}
> (gdb) bt
> #0  0x in ?? ()
> #1  0x004cfe54 in set_continuation (this=0x2afe63a93530, event=1, 
> edata=0x2afe6399fc40) at ../iocore/eventsystem/P_VIO.h:104
> #2  INKVConnInternal::handle_event (this=0x2afe63a93530, event=1, 
> edata=0x2afe6399fc40) at InkAPI.cc:1060
> #3  0x006f8e65 in handleEvent (this=0x2afe3dd07000, e=0x2afe6399fc40, 
> calling_code=1) at I_Continuation.h:146
> #4  EThread::process_event (this=0x2afe3dd07000, e=0x2afe6399fc40, 
> calling_code=1) at UnixEThread.cc:144
> #5  0x006f993b in EThread::execute (this=0x2afe3dd07000)
> at UnixEThread.cc:195
> #6  0x006f832a in spawn_thread_internal (a=0x2afe3badf400)
> at Thread.cc:88
> #7  0x003861c079d1 in start_thread () from /lib64/libpthread.so.0
> #8  0x0038614e8b5d in clone () from /lib64/libc.so.6
> {code}
> Which looks a bit odd-- as frame 0 is missing. From digging into it a bit 
> more (with the help of [~amc]) we found that the VC we where calling was an 
> INKContInternal (meaning an INKVConnInternal):
> {code}
> (gdb) p (INKVConnInternal) *vc_server
> $5 = { = { = { = 
> { = { = {_vptr.force_VFPT_to_top = 
> 0x2afe63a93170}, 
>   handler = (int (Continuation::*)(Continuation *, int, 
> void *)) 0x4cfd90 , mutex = {
> m_ptr = 0x0}, link = { = {next = 0x0}, 
> prev = 0x0}}, lerrno = 20600}, }, 
> mdata = 0xdeaddead, m_event_func = 0x2afe43c18490
>  <(anonymous namespace)::handleTransformationPluginEvents(TSCont, 
> TSEvent, void*)>, m_event_count = 0, m_closed = -1, m_deletable = 1, 
> m_deleted = 1, 
> m_free_magic = INKCONT_INTERN_MAGIC_ALIVE}, m_read_vio = {_cont = 0x0, 
> nbytes = 0, ndone = 0, op = 0, buffer = {mbuf = 0x0, entry = 0x0}, 
> vc_server = 0x0, mutex = {m_ptr = 0x0}}, m_write_vio = {_cont = 0x0, 
> nbytes = 122, ndone = 0, op = 0, buffer = {mbuf = 0x0, entry = 0x0}, 
> vc_server = 0x2afe63a93530, mutex = {m_ptr = 0x0}}, 
>   m_output_vc = 0x2afe63091a88}
> {code}
> From looking at the debug logs that lead up to the crash, I'm seeing that 
> some events (namely timeout events) are being called after the VConn has been 
> destroy()'d . This lead me to find that INKVConnInternal::handle_event is 
> actually checking if that is the case-- and then re-destroying everything, 
> which makes no sense.
> So although the ideal would be to not call handle_event on a closed VConn, 
> crashing is definitely not acceptable. My solution is to continue to only 
> call the event handler if the VConn hasn't been deleted-- but instead of 
> attempting to re-destroy the connection, we'll leave it be (unless we are in 
> debug mode-- where I'll throw in an assert).
> I did some looking at this on ATS7 and it looks like this is all fixed by the 
> cleanup of the whole free-ing stuff for VConns 
> (https://github.com/apache/trafficserver/pull/752/files).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-4970) Crash in INKVConnInternal when handle_event is called after destroy()

2016-10-18 Thread Thomas Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15585605#comment-15585605
 ] 

Thomas Jackson commented on TS-4970:


Not duplicate as they are PRs against different branches, although I
imagine we won't back port to 5.2.x since it goes out of support in a
couple of weeks.




> Crash in INKVConnInternal when handle_event is called after destroy()
> -
>
> Key: TS-4970
> URL: https://issues.apache.org/jira/browse/TS-4970
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HTTP
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.1.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> We've noticed a few crashes for requests using SPDY (on ATS 5.2.x and 6..x) 
> where the downstream origin is down with a backtrace that looks something 
> like:
> {code}
> (gdb) bt
> #0  0x in ?? ()
> #1  0x004cfe54 in set_continuation (this=0x2afe63a93530, event=1, 
> edata=0x2afe6399fc40) at ../iocore/eventsystem/P_VIO.h:104
> #2  INKVConnInternal::handle_event (this=0x2afe63a93530, event=1, 
> edata=0x2afe6399fc40) at InkAPI.cc:1060
> #3  0x006f8e65 in handleEvent (this=0x2afe3dd07000, e=0x2afe6399fc40, 
> calling_code=1) at I_Continuation.h:146
> #4  EThread::process_event (this=0x2afe3dd07000, e=0x2afe6399fc40, 
> calling_code=1) at UnixEThread.cc:144
> #5  0x006f993b in EThread::execute (this=0x2afe3dd07000)
> at UnixEThread.cc:195
> #6  0x006f832a in spawn_thread_internal (a=0x2afe3badf400)
> at Thread.cc:88
> #7  0x003861c079d1 in start_thread () from /lib64/libpthread.so.0
> #8  0x0038614e8b5d in clone () from /lib64/libc.so.6
> {code}
> Which looks a bit odd-- as frame 0 is missing. From digging into it a bit 
> more (with the help of [~amc]) we found that the VC we where calling was an 
> INKContInternal (meaning an INKVConnInternal):
> {code}
> (gdb) p (INKVConnInternal) *vc_server
> $5 = { = { = { = 
> { = { = {_vptr.force_VFPT_to_top = 
> 0x2afe63a93170}, 
>   handler = (int (Continuation::*)(Continuation *, int, 
> void *)) 0x4cfd90 , mutex = {
> m_ptr = 0x0}, link = { = {next = 0x0}, 
> prev = 0x0}}, lerrno = 20600}, }, 
> mdata = 0xdeaddead, m_event_func = 0x2afe43c18490
>  <(anonymous namespace)::handleTransformationPluginEvents(TSCont, 
> TSEvent, void*)>, m_event_count = 0, m_closed = -1, m_deletable = 1, 
> m_deleted = 1, 
> m_free_magic = INKCONT_INTERN_MAGIC_ALIVE}, m_read_vio = {_cont = 0x0, 
> nbytes = 0, ndone = 0, op = 0, buffer = {mbuf = 0x0, entry = 0x0}, 
> vc_server = 0x0, mutex = {m_ptr = 0x0}}, m_write_vio = {_cont = 0x0, 
> nbytes = 122, ndone = 0, op = 0, buffer = {mbuf = 0x0, entry = 0x0}, 
> vc_server = 0x2afe63a93530, mutex = {m_ptr = 0x0}}, 
>   m_output_vc = 0x2afe63091a88}
> {code}
> From looking at the debug logs that lead up to the crash, I'm seeing that 
> some events (namely timeout events) are being called after the VConn has been 
> destroy()'d . This lead me to find that INKVConnInternal::handle_event is 
> actually checking if that is the case-- and then re-destroying everything, 
> which makes no sense.
> So although the ideal would be to not call handle_event on a closed VConn, 
> crashing is definitely not acceptable. My solution is to continue to only 
> call the event handler if the VConn hasn't been deleted-- but instead of 
> attempting to re-destroy the connection, we'll leave it be (unless we are in 
> debug mode-- where I'll throw in an assert).
> I did some looking at this on ATS7 and it looks like this is all fixed by the 
> cleanup of the whole free-ing stuff for VConns 
> (https://github.com/apache/trafficserver/pull/752/files).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-4970) Crash in INKVConnInternal when handle_event is called after destroy()

2016-10-17 Thread Thomas Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583311#comment-15583311
 ] 

Thomas Jackson commented on TS-4970:


after spending some time looking at it-- [~shinrich] is correct, I am mis-using 
the m_deleted field. Taking a step back, my primary issue with this bug is that 
we are double-freeing something. From looking at the diff of TS-4590 I see that 
the mechanism for doing this is by ussing the `m_free_magic` member, but that 
is a bit broken in 5.2 and 6.2. So, instead of the current patch I have I am 
changing to use the mutex pointer as the "are we deleted" flag to avoid double 
freeing. I've updated the PRs, and will ping back once I've finished rolling 
out the change on our infra.

> Crash in INKVConnInternal when handle_event is called after destroy()
> -
>
> Key: TS-4970
> URL: https://issues.apache.org/jira/browse/TS-4970
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HTTP
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> We've noticed a few crashes for requests using SPDY (on ATS 5.2.x and 6..x) 
> where the downstream origin is down with a backtrace that looks something 
> like:
> {code}
> (gdb) bt
> #0  0x in ?? ()
> #1  0x004cfe54 in set_continuation (this=0x2afe63a93530, event=1, 
> edata=0x2afe6399fc40) at ../iocore/eventsystem/P_VIO.h:104
> #2  INKVConnInternal::handle_event (this=0x2afe63a93530, event=1, 
> edata=0x2afe6399fc40) at InkAPI.cc:1060
> #3  0x006f8e65 in handleEvent (this=0x2afe3dd07000, e=0x2afe6399fc40, 
> calling_code=1) at I_Continuation.h:146
> #4  EThread::process_event (this=0x2afe3dd07000, e=0x2afe6399fc40, 
> calling_code=1) at UnixEThread.cc:144
> #5  0x006f993b in EThread::execute (this=0x2afe3dd07000)
> at UnixEThread.cc:195
> #6  0x006f832a in spawn_thread_internal (a=0x2afe3badf400)
> at Thread.cc:88
> #7  0x003861c079d1 in start_thread () from /lib64/libpthread.so.0
> #8  0x0038614e8b5d in clone () from /lib64/libc.so.6
> {code}
> Which looks a bit odd-- as frame 0 is missing. From digging into it a bit 
> more (with the help of [~amc]) we found that the VC we where calling was an 
> INKContInternal (meaning an INKVConnInternal):
> {code}
> (gdb) p (INKVConnInternal) *vc_server
> $5 = { = { = { = 
> { = { = {_vptr.force_VFPT_to_top = 
> 0x2afe63a93170}, 
>   handler = (int (Continuation::*)(Continuation *, int, 
> void *)) 0x4cfd90 , mutex = {
> m_ptr = 0x0}, link = { = {next = 0x0}, 
> prev = 0x0}}, lerrno = 20600}, }, 
> mdata = 0xdeaddead, m_event_func = 0x2afe43c18490
>  <(anonymous namespace)::handleTransformationPluginEvents(TSCont, 
> TSEvent, void*)>, m_event_count = 0, m_closed = -1, m_deletable = 1, 
> m_deleted = 1, 
> m_free_magic = INKCONT_INTERN_MAGIC_ALIVE}, m_read_vio = {_cont = 0x0, 
> nbytes = 0, ndone = 0, op = 0, buffer = {mbuf = 0x0, entry = 0x0}, 
> vc_server = 0x0, mutex = {m_ptr = 0x0}}, m_write_vio = {_cont = 0x0, 
> nbytes = 122, ndone = 0, op = 0, buffer = {mbuf = 0x0, entry = 0x0}, 
> vc_server = 0x2afe63a93530, mutex = {m_ptr = 0x0}}, 
>   m_output_vc = 0x2afe63091a88}
> {code}
> From looking at the debug logs that lead up to the crash, I'm seeing that 
> some events (namely timeout events) are being called after the VConn has been 
> destroy()'d . This lead me to find that INKVConnInternal::handle_event is 
> actually checking if that is the case-- and then re-destroying everything, 
> which makes no sense.
> So although the ideal would be to not call handle_event on a closed VConn, 
> crashing is definitely not acceptable. My solution is to continue to only 
> call the event handler if the VConn hasn't been deleted-- but instead of 
> attempting to re-destroy the connection, we'll leave it be (unless we are in 
> debug mode-- where I'll throw in an assert).
> I did some looking at this on ATS7 and it looks like this is all fixed by the 
> cleanup of the whole free-ing stuff for VConns 
> (https://github.com/apache/trafficserver/pull/752/files).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TS-4969) Add log field for number of retries against origin

2016-10-13 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson closed TS-4969.
--
Resolution: Duplicate

> Add log field for number of retries against origin
> --
>
> Key: TS-4969
> URL: https://issues.apache.org/jira/browse/TS-4969
> Project: Traffic Server
>  Issue Type: Improvement
>  Components: HTTP
>Reporter: Thomas Jackson
>
> We have configuration options for retries, but I don't see a mechanism to log 
> the number of retries that have happened. This would be immensely helpful in 
> debugging origin issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4970) Crash in INKVConnInternal when handle_event is called after destroy()

2016-10-13 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4970:
---
Summary: Crash in INKVConnInternal when handle_event is called after 
destroy()  (was: Crash in INKVConnInternal if handle_event is called after 
destroy())

> Crash in INKVConnInternal when handle_event is called after destroy()
> -
>
> Key: TS-4970
> URL: https://issues.apache.org/jira/browse/TS-4970
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HTTP
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
>
> We've noticed a few crashes for requests using SPDY (on ATS 5.2.x and 6..x) 
> where the downstream origin is down with a backtrace that looks something 
> like:
> {code}
> (gdb) bt
> #0  0x in ?? ()
> #1  0x004cfe54 in set_continuation (this=0x2afe63a93530, event=1, 
> edata=0x2afe6399fc40) at ../iocore/eventsystem/P_VIO.h:104
> #2  INKVConnInternal::handle_event (this=0x2afe63a93530, event=1, 
> edata=0x2afe6399fc40) at InkAPI.cc:1060
> #3  0x006f8e65 in handleEvent (this=0x2afe3dd07000, e=0x2afe6399fc40, 
> calling_code=1) at I_Continuation.h:146
> #4  EThread::process_event (this=0x2afe3dd07000, e=0x2afe6399fc40, 
> calling_code=1) at UnixEThread.cc:144
> #5  0x006f993b in EThread::execute (this=0x2afe3dd07000)
> at UnixEThread.cc:195
> #6  0x006f832a in spawn_thread_internal (a=0x2afe3badf400)
> at Thread.cc:88
> #7  0x003861c079d1 in start_thread () from /lib64/libpthread.so.0
> #8  0x0038614e8b5d in clone () from /lib64/libc.so.6
> {code}
> Which looks a bit odd-- as frame 0 is missing. From digging into it a bit 
> more (with the help of [~amc]) we found that the VC we where calling was an 
> INKContInternal (meaning an INKVConnInternal):
> {code}
> (gdb) p (INKVConnInternal) *vc_server
> $5 = { = { = { = 
> { = { = {_vptr.force_VFPT_to_top = 
> 0x2afe63a93170}, 
>   handler = (int (Continuation::*)(Continuation *, int, 
> void *)) 0x4cfd90 , mutex = {
> m_ptr = 0x0}, link = { = {next = 0x0}, 
> prev = 0x0}}, lerrno = 20600}, }, 
> mdata = 0xdeaddead, m_event_func = 0x2afe43c18490
>  <(anonymous namespace)::handleTransformationPluginEvents(TSCont, 
> TSEvent, void*)>, m_event_count = 0, m_closed = -1, m_deletable = 1, 
> m_deleted = 1, 
> m_free_magic = INKCONT_INTERN_MAGIC_ALIVE}, m_read_vio = {_cont = 0x0, 
> nbytes = 0, ndone = 0, op = 0, buffer = {mbuf = 0x0, entry = 0x0}, 
> vc_server = 0x0, mutex = {m_ptr = 0x0}}, m_write_vio = {_cont = 0x0, 
> nbytes = 122, ndone = 0, op = 0, buffer = {mbuf = 0x0, entry = 0x0}, 
> vc_server = 0x2afe63a93530, mutex = {m_ptr = 0x0}}, 
>   m_output_vc = 0x2afe63091a88}
> {code}
> From looking at the debug logs that lead up to the crash, I'm seeing that 
> some events (namely timeout events) are being called after the VConn has been 
> destroy()'d . This lead me to find that INKVConnInternal::handle_event is 
> actually checking if that is the case-- and then re-destroying everything, 
> which makes no sense.
> So although the ideal would be to not call handle_event on a closed VConn, 
> crashing is definitely not acceptable. My solution is to continue to only 
> call the event handler if the VConn hasn't been deleted-- but instead of 
> attempting to re-destroy the connection, we'll leave it be (unless we are in 
> debug mode-- where I'll throw in an assert).
> I did some looking at this on ATS7 and it looks like this is all fixed by the 
> cleanup of the whole free-ing stuff for VConns 
> (https://github.com/apache/trafficserver/pull/752/files).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TS-4970) Crash in INKVConnInternal if handle_event is called after destroy()

2016-10-13 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson reassigned TS-4970:
--

Assignee: Thomas Jackson

> Crash in INKVConnInternal if handle_event is called after destroy()
> ---
>
> Key: TS-4970
> URL: https://issues.apache.org/jira/browse/TS-4970
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HTTP
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
>
> We've noticed a few crashes for requests using SPDY (on ATS 5.2.x and 6..x) 
> where the downstream origin is down with a backtrace that looks something 
> like:
> {code}
> (gdb) bt
> #0  0x in ?? ()
> #1  0x004cfe54 in set_continuation (this=0x2afe63a93530, event=1, 
> edata=0x2afe6399fc40) at ../iocore/eventsystem/P_VIO.h:104
> #2  INKVConnInternal::handle_event (this=0x2afe63a93530, event=1, 
> edata=0x2afe6399fc40) at InkAPI.cc:1060
> #3  0x006f8e65 in handleEvent (this=0x2afe3dd07000, e=0x2afe6399fc40, 
> calling_code=1) at I_Continuation.h:146
> #4  EThread::process_event (this=0x2afe3dd07000, e=0x2afe6399fc40, 
> calling_code=1) at UnixEThread.cc:144
> #5  0x006f993b in EThread::execute (this=0x2afe3dd07000)
> at UnixEThread.cc:195
> #6  0x006f832a in spawn_thread_internal (a=0x2afe3badf400)
> at Thread.cc:88
> #7  0x003861c079d1 in start_thread () from /lib64/libpthread.so.0
> #8  0x0038614e8b5d in clone () from /lib64/libc.so.6
> {code}
> Which looks a bit odd-- as frame 0 is missing. From digging into it a bit 
> more (with the help of [~amc]) we found that the VC we where calling was an 
> INKContInternal (meaning an INKVConnInternal):
> {code}
> (gdb) p (INKVConnInternal) *vc_server
> $5 = { = { = { = 
> { = { = {_vptr.force_VFPT_to_top = 
> 0x2afe63a93170}, 
>   handler = (int (Continuation::*)(Continuation *, int, 
> void *)) 0x4cfd90 , mutex = {
> m_ptr = 0x0}, link = { = {next = 0x0}, 
> prev = 0x0}}, lerrno = 20600}, }, 
> mdata = 0xdeaddead, m_event_func = 0x2afe43c18490
>  <(anonymous namespace)::handleTransformationPluginEvents(TSCont, 
> TSEvent, void*)>, m_event_count = 0, m_closed = -1, m_deletable = 1, 
> m_deleted = 1, 
> m_free_magic = INKCONT_INTERN_MAGIC_ALIVE}, m_read_vio = {_cont = 0x0, 
> nbytes = 0, ndone = 0, op = 0, buffer = {mbuf = 0x0, entry = 0x0}, 
> vc_server = 0x0, mutex = {m_ptr = 0x0}}, m_write_vio = {_cont = 0x0, 
> nbytes = 122, ndone = 0, op = 0, buffer = {mbuf = 0x0, entry = 0x0}, 
> vc_server = 0x2afe63a93530, mutex = {m_ptr = 0x0}}, 
>   m_output_vc = 0x2afe63091a88}
> {code}
> From looking at the debug logs that lead up to the crash, I'm seeing that 
> some events (namely timeout events) are being called after the VConn has been 
> destroy()'d . This lead me to find that INKVConnInternal::handle_event is 
> actually checking if that is the case-- and then re-destroying everything, 
> which makes no sense.
> So although the ideal would be to not call handle_event on a closed VConn, 
> crashing is definitely not acceptable. My solution is to continue to only 
> call the event handler if the VConn hasn't been deleted-- but instead of 
> attempting to re-destroy the connection, we'll leave it be (unless we are in 
> debug mode-- where I'll throw in an assert).
> I did some looking at this on ATS7 and it looks like this is all fixed by the 
> cleanup of the whole free-ing stuff for VConns 
> (https://github.com/apache/trafficserver/pull/752/files).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TS-4970) Crash in INKVConnInternal if handle_event is called after destroy()

2016-10-13 Thread Thomas Jackson (JIRA)
Thomas Jackson created TS-4970:
--

 Summary: Crash in INKVConnInternal if handle_event is called after 
destroy()
 Key: TS-4970
 URL: https://issues.apache.org/jira/browse/TS-4970
 Project: Traffic Server
  Issue Type: Bug
  Components: HTTP
Reporter: Thomas Jackson


We've noticed a few crashes for requests using SPDY (on ATS 5.2.x and 6..x) 
where the downstream origin is down with a backtrace that looks something like:

{code}
(gdb) bt
#0  0x in ?? ()
#1  0x004cfe54 in set_continuation (this=0x2afe63a93530, event=1, 
edata=0x2afe6399fc40) at ../iocore/eventsystem/P_VIO.h:104
#2  INKVConnInternal::handle_event (this=0x2afe63a93530, event=1, 
edata=0x2afe6399fc40) at InkAPI.cc:1060
#3  0x006f8e65 in handleEvent (this=0x2afe3dd07000, e=0x2afe6399fc40, 
calling_code=1) at I_Continuation.h:146
#4  EThread::process_event (this=0x2afe3dd07000, e=0x2afe6399fc40, 
calling_code=1) at UnixEThread.cc:144
#5  0x006f993b in EThread::execute (this=0x2afe3dd07000)
at UnixEThread.cc:195
#6  0x006f832a in spawn_thread_internal (a=0x2afe3badf400)
at Thread.cc:88
#7  0x003861c079d1 in start_thread () from /lib64/libpthread.so.0
#8  0x0038614e8b5d in clone () from /lib64/libc.so.6
{code}

Which looks a bit odd-- as frame 0 is missing. From digging into it a bit more 
(with the help of [~amc]) we found that the VC we where calling was an 
INKContInternal (meaning an INKVConnInternal):

{code}
(gdb) p (INKVConnInternal) *vc_server
$5 = { = { = { = 
{ = { = {_vptr.force_VFPT_to_top = 
0x2afe63a93170}, 
  handler = (int (Continuation::*)(Continuation *, int, 
void *)) 0x4cfd90 , mutex = {
m_ptr = 0x0}, link = { = {next = 0x0}, 
prev = 0x0}}, lerrno = 20600}, }, 
mdata = 0xdeaddead, m_event_func = 0x2afe43c18490
 <(anonymous namespace)::handleTransformationPluginEvents(TSCont, TSEvent, 
void*)>, m_event_count = 0, m_closed = -1, m_deletable = 1, m_deleted = 1, 
m_free_magic = INKCONT_INTERN_MAGIC_ALIVE}, m_read_vio = {_cont = 0x0, 
nbytes = 0, ndone = 0, op = 0, buffer = {mbuf = 0x0, entry = 0x0}, 
vc_server = 0x0, mutex = {m_ptr = 0x0}}, m_write_vio = {_cont = 0x0, 
nbytes = 122, ndone = 0, op = 0, buffer = {mbuf = 0x0, entry = 0x0}, 
vc_server = 0x2afe63a93530, mutex = {m_ptr = 0x0}}, 
  m_output_vc = 0x2afe63091a88}
{code}

>From looking at the debug logs that lead up to the crash, I'm seeing that some 
>events (namely timeout events) are being called after the VConn has been 
>destroy()'d . This lead me to find that INKVConnInternal::handle_event is 
>actually checking if that is the case-- and then re-destroying everything, 
>which makes no sense.

So although the ideal would be to not call handle_event on a closed VConn, 
crashing is definitely not acceptable. My solution is to continue to only call 
the event handler if the VConn hasn't been deleted-- but instead of attempting 
to re-destroy the connection, we'll leave it be (unless we are in debug mode-- 
where I'll throw in an assert).

I did some looking at this on ATS7 and it looks like this is all fixed by the 
cleanup of the whole free-ing stuff for VConns 
(https://github.com/apache/trafficserver/pull/752/files).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TS-4969) Add log field for number of retries against origin

2016-10-13 Thread Thomas Jackson (JIRA)
Thomas Jackson created TS-4969:
--

 Summary: Add log field for number of retries against origin
 Key: TS-4969
 URL: https://issues.apache.org/jira/browse/TS-4969
 Project: Traffic Server
  Issue Type: Improvement
  Components: HTTP
Reporter: Thomas Jackson


We have configuration options for retries, but I don't see a mechanism to log 
the number of retries that have happened. This would be immensely helpful in 
debugging origin issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TS-4968) Log a warning if connect_attempts_rr_retries is >= connect_attempts_max_retries

2016-10-13 Thread Thomas Jackson (JIRA)
Thomas Jackson created TS-4968:
--

 Summary: Log a warning if connect_attempts_rr_retries is >= 
connect_attempts_max_retries
 Key: TS-4968
 URL: https://issues.apache.org/jira/browse/TS-4968
 Project: Traffic Server
  Issue Type: Improvement
  Components: HTTP
Reporter: Thomas Jackson


If connect_attempts_rr_retries >= connect_attempts_max_retries requests going 
to an RR DNS record will never be redispatched. To make it a bit more obvious-- 
I think we should log a warning when we load the configs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-4960) Undo internal request tunnelling hacks

2016-10-12 Thread Thomas Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569429#comment-15569429
 ] 

Thomas Jackson commented on TS-4960:


There is a fairly long chain of bugs around this-- so I'll attempt to dump as 
much context here as I can.

Originally we had an issue with internal transactions were getting stuck on 
inactivity timers because they'd get into a half-close state-- at the time 
plugin VC stuff didn't fire the appropriate events for the half-close state to 
ever end. Later (seemingly unrelated and unknown that they interact) TS-3777 
came around and added missing events from the pluginVC.

To [~jpe...@apache.org]'s point-- the plugin VCs shouldn't be treated any 
differently-- the original patch was to workaround the fact that they *where* 
different in the code, which seems to be fixed now with TS-3777. So, assuming 
pluginVCs now fire all the same events-- we *should* be able to remove the 
patch from TS-3404 

> Undo internal request tunnelling hacks
> --
>
> Key: TS-4960
> URL: https://issues.apache.org/jira/browse/TS-4960
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Core
>Reporter: James Peach
>
> {noformat}
> proxy/http/HttpSM.cc:is_eligible_post_request &= 
> !vc->get_is_internal_request();
> {noformat}
> {{HttpSM::tunnel_handler_ua}} does shenanigans based on whether this is an 
> internal transaction or not. This is a complete hack. Internal transactions 
> are no supposed to behave differently.
> AFAICT, this hack from  TS-3404 led to TS-3777, which led to TS-4924, which 
> makes it impossible for protocol plugins to use keepalive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TS-4509) Dropped keep-alive connections not being re-established (TS-3959 continued)

2016-10-11 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson resolved TS-4509.

Resolution: Fixed

> Dropped keep-alive connections not being re-established (TS-3959 continued)
> ---
>
> Key: TS-4509
> URL: https://issues.apache.org/jira/browse/TS-4509
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Core, Network
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
>Priority: Blocker
> Fix For: 7.1.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> I've observed some differences in how TrafficServer 6.0.0 behaves with 
> connection retrying and outgoing keep-alive connections. I believe the 
> changes in behavior might be related to this issue: 
> https://issues.apache.org/jira/browse/TS-3440
> I originally wasn't sure if this was a bug, but James Peach indicated it 
> sounded more like a regression on the mailing list 
> (http://mail-archives.apache.org/mod_mbox/trafficserver-users/201510.mbox/%3cba85d5a2-8b29-44a9-acdc-e7fa8d21f...@apache.org%3e).
> What I'm seeing in 6.0.0 is that if TrafficServer has some backend keep-alive 
> connections already opened, but then one of the keep-alive connections is 
> closed, the next request to TrafficServer may generate a 502 Server Hangup 
> response when attempting to reuse that connection. Previously, I think 
> TrafficServer was retrying when it encountered a closed keep-alive 
> connection, but that is no longer the case. So if you have a backend that 
> might unexpectedly close its open keep-alive connections, the only way I've 
> found to completely prevent these 502 errors in 6.0.0 is to disable outgoing 
> keepalive (proxy.config.http.keep_alive_enabled_out and 
> proxy.config.http.keep_alive_post_out settings).
> For a slightly more concrete example of what can trigger this, this is fairly 
> easy to reproduce with the following setup:
> - TrafficServer is proxying to nginx with outgoing keep-alive connections 
> enabled (the default).
> - Throw a constant stream of requests at TrafficServer.
> - While that constant stream of requests is happening, also send a regular 
> stream of SIGHUP commands to nginx to reload nginx.
> - Eventually you'll get some 502 Server Hangup responses from TrafficServer 
> among your stream of requests.
> SIGHUPs in nginx should result in zero downtime for new requests, but I think 
> what's happening is that TrafficServer may fail when an old keep-alived 
> connection is reused (it's not common, so it depends on the timing of things 
> and if the connection is from an old nginx worker that has since been shut 
> down). In TrafficServer 5.3.1 these connection failures were retried, but in 
> 6.0.0, no retries occur in this case.
> Here's some debug logs that show the difference in behavior between 6.0.0 and 
> 5.3.1. Note that differences seem to stem from how each version eventually 
> handles the "VC_EVENT_EOS" event following 
> "::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE".
> 5.3.1: 
> https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_5-3-1-log-L316
> 6.0.0: 
> https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_6-0-0-log-L314
> Interestingly, if I'm understand the log files correctly, it looks like 
> TraffficServer is reporting an odd empty response from these connections 
> ("HTTP/0.9 0" in 5.3.1 and "HTTP/1.0 0" in 6.0.0). However, as far as I can 
> tell from TCP dumps on the system, nginx is not actually sending any form of 
> response.
> In these example cases the backend server isn't sending back any data (at 
> least as far as I can tell), so from what I understand (and the logic 
> outlined in https://issues.apache.org/jira/browse/TS-3440), it should be safe 
> to retry.
> Let me know if I can provide any other details. Or if exact scripts to 
> reproduce the issues against the example nginx backend I described above 
> would be useful, I could get that together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4921) Safe HTTP methods should be retryable.

2016-10-03 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4921:
---
Description: 
In {{HttpTransact::is_request_retryable}}, nothing seems to be retryable if you 
have sent any bytes. Following the RFCs, the default behaviour should allow 
safe (and also idempotent) method requests to be retried regardless of thether 
bytes were sent.

"safe" methods (https://tools.ietf.org/html/rfc7231#section-4.2.1): GET HEAD

>From conversations, it sounds like the ideal approach is to create a config 
>option (which is transaction overrideable) which allows you to define the list 
>of methods which are retryable (which wouldn't be limited to the well-known 
>methods inside ATS).


  was:
In {{HttpTransact::is_request_retryable}}, nothing seems to be retryable if you 
have sent any bytes. Following the RFCs, the default behaviour should allow 
safe (and also idempotent) method requests to be retried regardless of thether 
bytes were sent.

"safe" methods: GET HEAD

>From conversations, it sounds like the ideal approach is to create a config 
>option (which is transaction overrideable) which allows you to define the list 
>of methods which are retryable (which wouldn't be limited to the well-known 
>methods inside ATS).



> Safe HTTP methods should be retryable.
> --
>
> Key: TS-4921
> URL: https://issues.apache.org/jira/browse/TS-4921
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Core, HTTP
>Reporter: James Peach
>Assignee: Thomas Jackson
>
> In {{HttpTransact::is_request_retryable}}, nothing seems to be retryable if 
> you have sent any bytes. Following the RFCs, the default behaviour should 
> allow safe (and also idempotent) method requests to be retried regardless of 
> thether bytes were sent.
> "safe" methods (https://tools.ietf.org/html/rfc7231#section-4.2.1): GET HEAD
> From conversations, it sounds like the ideal approach is to create a config 
> option (which is transaction overrideable) which allows you to define the 
> list of methods which are retryable (which wouldn't be limited to the 
> well-known methods inside ATS).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4921) Safe HTTP methods should be retryable.

2016-10-03 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4921:
---
Description: 
In {{HttpTransact::is_request_retryable}}, nothing seems to be retryable if you 
have sent any bytes. Following the RFCs, the default behaviour should allow 
safe (and also idempotent) method requests to be retried regardless of thether 
bytes were sent.

"safe" methods: GET HEAD

>From conversations, it sounds like the ideal approach is to create a config 
>option (which is transaction overrideable) which allows you to define the list 
>of methods which are retryable (which wouldn't be limited to the well-known 
>methods inside ATS).


  was:
In {{HttpTransact::is_request_retryable}}, nothing seems to be retryable if you 
have sent any bytes. Following the RFCs, the default behaviour should allow 
safe (and also idempotent) method requests to be retried regardless of thether 
bytes were sent.

"safe" methods: GET HEAD

>From conversations, it sounds like the best approach is to create a config 
>option (which is transaction overrideable) which allows you to define the list 
>of methods which are retryable.



> Safe HTTP methods should be retryable.
> --
>
> Key: TS-4921
> URL: https://issues.apache.org/jira/browse/TS-4921
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Core, HTTP
>Reporter: James Peach
>Assignee: Thomas Jackson
>
> In {{HttpTransact::is_request_retryable}}, nothing seems to be retryable if 
> you have sent any bytes. Following the RFCs, the default behaviour should 
> allow safe (and also idempotent) method requests to be retried regardless of 
> thether bytes were sent.
> "safe" methods: GET HEAD
> From conversations, it sounds like the ideal approach is to create a config 
> option (which is transaction overrideable) which allows you to define the 
> list of methods which are retryable (which wouldn't be limited to the 
> well-known methods inside ATS).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4921) Safe HTTP methods should be retryable.

2016-10-03 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4921:
---
Description: 
In {{HttpTransact::is_request_retryable}}, nothing seems to be retryable if you 
have sent any bytes. Following the RFCs, the default behaviour should allow 
safe (and also idempotent) method requests to be retried regardless of thether 
bytes were sent.

"safe" methods: GET HEAD

  was:In {{HttpTransact::is_request_retryable}}, nothing seems to be retryable 
if you have sent any bytes. Following the RFCs, the default behaviour should 
allow safe (and also idempotent) method requests to be retried regardless of 
thether bytes were sent.


> Safe HTTP methods should be retryable.
> --
>
> Key: TS-4921
> URL: https://issues.apache.org/jira/browse/TS-4921
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Core, HTTP
>Reporter: James Peach
>
> In {{HttpTransact::is_request_retryable}}, nothing seems to be retryable if 
> you have sent any bytes. Following the RFCs, the default behaviour should 
> allow safe (and also idempotent) method requests to be retried regardless of 
> thether bytes were sent.
> "safe" methods: GET HEAD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4921) Safe HTTP methods should be retryable.

2016-10-03 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4921:
---
Description: 
In {{HttpTransact::is_request_retryable}}, nothing seems to be retryable if you 
have sent any bytes. Following the RFCs, the default behaviour should allow 
safe (and also idempotent) method requests to be retried regardless of thether 
bytes were sent.

"safe" methods: GET HEAD

>From conversations, it sounds like the best approach is to create a config 
>option (which is transaction overrideable) which allows you to define the list 
>of methods which are retryable.


  was:
In {{HttpTransact::is_request_retryable}}, nothing seems to be retryable if you 
have sent any bytes. Following the RFCs, the default behaviour should allow 
safe (and also idempotent) method requests to be retried regardless of thether 
bytes were sent.

"safe" methods: GET HEAD




> Safe HTTP methods should be retryable.
> --
>
> Key: TS-4921
> URL: https://issues.apache.org/jira/browse/TS-4921
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Core, HTTP
>Reporter: James Peach
>
> In {{HttpTransact::is_request_retryable}}, nothing seems to be retryable if 
> you have sent any bytes. Following the RFCs, the default behaviour should 
> allow safe (and also idempotent) method requests to be retried regardless of 
> thether bytes were sent.
> "safe" methods: GET HEAD
> From conversations, it sounds like the best approach is to create a config 
> option (which is transaction overrideable) which allows you to define the 
> list of methods which are retryable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TS-4921) Safe HTTP methods should be retryable.

2016-10-03 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson reassigned TS-4921:
--

Assignee: Thomas Jackson

> Safe HTTP methods should be retryable.
> --
>
> Key: TS-4921
> URL: https://issues.apache.org/jira/browse/TS-4921
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Core, HTTP
>Reporter: James Peach
>Assignee: Thomas Jackson
>
> In {{HttpTransact::is_request_retryable}}, nothing seems to be retryable if 
> you have sent any bytes. Following the RFCs, the default behaviour should 
> allow safe (and also idempotent) method requests to be retried regardless of 
> thether bytes were sent.
> "safe" methods: GET HEAD
> From conversations, it sounds like the best approach is to create a config 
> option (which is transaction overrideable) which allows you to define the 
> list of methods which are retryable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4921) Safe HTTP methods should be retryable.

2016-10-03 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4921:
---
Description: 
In {{HttpTransact::is_request_retryable}}, nothing seems to be retryable if you 
have sent any bytes. Following the RFCs, the default behaviour should allow 
safe (and also idempotent) method requests to be retried regardless of thether 
bytes were sent.

"safe" methods: GET HEAD



  was:
In {{HttpTransact::is_request_retryable}}, nothing seems to be retryable if you 
have sent any bytes. Following the RFCs, the default behaviour should allow 
safe (and also idempotent) method requests to be retried regardless of thether 
bytes were sent.

"safe" methods: GET HEAD


> Safe HTTP methods should be retryable.
> --
>
> Key: TS-4921
> URL: https://issues.apache.org/jira/browse/TS-4921
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Core, HTTP
>Reporter: James Peach
>
> In {{HttpTransact::is_request_retryable}}, nothing seems to be retryable if 
> you have sent any bytes. Following the RFCs, the default behaviour should 
> allow safe (and also idempotent) method requests to be retried regardless of 
> thether bytes were sent.
> "safe" methods: GET HEAD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4915) Crash from hostdb in PriorityQueueLess

2016-10-03 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4915:
---
Description: 
Saw this while testing fix for TS-4813 with debug enabled.

{code}
(gdb) bt full
#0  0x00547bfe in RefCountCacheHashEntry::operator< (this=0x1cc0880, 
v2=...) at ../iocore/hostdb/P_RefCountCache.h:94
No locals.
#1  0x0054988d in 
PriorityQueueLess::operator() (this=0x2b78a9a2587b, 
a=@0x2b78f402af68, b=@0x2b78f402aa28)
at ../lib/ts/PriorityQueue.h:41
No locals.
#2  0x00549785 in PriorityQueue >::_bubble_up (this=0x1cb2990, 
index=2) at ../lib/ts/PriorityQueue.h:191
comp = {}
parent = 0
#3  0x006ecfcc in PriorityQueue >::push (this=0x1cb2990, 
entry=0x2b78f402af60) at ../../lib/ts/PriorityQueue.h:91
len = 2
#4  0x006ec206 in RefCountCachePartition::put 
(this=0x1cb2900, key=6912554662447498853, item=0x2b78aee04f00, size=96, 
expire_time=1475202356) at ./P_RefCountCache.h:210
expiry_entry = 0x2b78f402af60
__func__ = "put"
val = 0x1cc0880
#5  0x006eb3de in RefCountCache::put (this=0x18051e0, 
key=6912554662447498853, item=0x2b78aee04f00, size=16, 
expiry_time=1475202356) at ./P_RefCountCache.h:462
No locals.
#6  0x006e2d8e in HostDBContinuation::dnsEvent (this=0x2b7938020f00, 
event=600, e=0x2b78ac009440) at HostDB.cc:1422
is_rr = false
old_rr_data = 0x0
first_record = 0x2b78ac0094f8
m = 0x1
failed = false
old_r = {m_ptr = 0x0}
af = 2 '\002'
s_size = 16
rrsize = 0
allocSize = 16
r = 0x2b78aee04f00
old_info = { = { = {_vptr.ForceVFPTToTop = 
0x7f3630}, m_refcount = 0}, iobuffer_index = 0, 
  key = 47797242059264, app = {allotment = {application1 = 5326300, 
application2 = 0}, http_data = {http_version = 4, 
  pipeline_max = 59, keepalive_timeout = 17, fail_count = 81, 
unused1 = 0, last_failure = 0}, rr = {offset = 5326300}}, data = {
ip = {sa = {sa_family = 54488, sa_data = 
"^\000\000\000\000\000\020\034$\274x+\000"}, sin = {sin_family = 54488, 
sin_port = 94, 
sin_addr = {s_addr = 0}, sin_zero = "\020\034$\274x+\000"}, 
sin6 = {sin6_family = 54488, sin6_port = 94, sin6_flowinfo = 0, 
sin6_addr = {__in6_u = {__u6_addr8 = 
"\020\034$\274x+\000\000\030\036$\274\375\b\000", __u6_addr16 = {7184, 48164, 
11128, 
  0, 7704, 48164, 2301, 0}, __u6_addr32 = {3156483088, 
11128, 3156483608, 2301}}}, sin6_scope_id = 3156478176}}, 
hostname_offset = 6214872, srv = {srv_offset = 54488, srv_weight = 
94, srv_priority = 0, srv_port = 0, key = 3156483088}}, 
  hostname_offset = 11128, ip_timestamp = 2845989456, 
ip_timeout_interval = 11128, is_srv = 0, reverse_dns = 0, round_robin = 1, 
  round_robin_elt = 0}
valid_records = 0
tip = {_family = 2, _addr = {_ip4 = 540420056, _ip6 = {__in6_u = 
{__u6_addr8 = "\330'6 x+\000\000\360L\020\250x+\000", 
__u6_addr16 = {10200, 8246, 11128, 0, 19696, 43024, 11128, 0}, 
__u6_addr32 = {540420056, 11128, 2819640560, 11128}}}, 
_byte = "\330'6 x+\000\000\360L\020\250x+\000", _u32 = {540420056, 
11128, 2819640560, 11128}, _u64 = {47794936489944, 
  47797215710448}}}
ttl_seconds = 132
aname = 0x2b7938021000 "fbmm1.zenfs.com"
offset = 96
thread = 0x2b78a8101010
__func__ = "dnsEvent"
#7  0x005145dc in Continuation::handleEvent (this=0x2b7938020f00, 
event=600, data=0x2b78ac009440)
at ../iocore/eventsystem/I_Continuation.h:153
No locals.
#8  0x006f681e in DNSEntry::postEvent (this=0x2b78f4028600) at 
DNS.cc:1269
__func__ = "postEvent"
#9  0x005145dc in Continuation::handleEvent (this=0x2b78f4028600, 
event=1, data=0x2aac954db040)
at ../iocore/eventsystem/I_Continuation.h:153
No locals.
#10 0x007bc9be in EThread::process_event (this=0x2b78a8101010, 
e=0x2aac954db040, calling_code=1) at UnixEThread.cc:143
c_temp = 0x2b78f4028600
lock = {m = {m_ptr = 0x17dea10}, lock_acquired = true}
__func__ = "process_event"
#11 0x007bcc2d in EThread::execute (this=0x2b78a8101010) at 
UnixEThread.cc:197
done_one = false
e = 0x2aac954db040
NegativeQueue = {> = {head = 0x18ce400}, 
tail = 0x18ce400}
next_time = 1475191803711988905
__func__ = "execute"
#12 0x007bbfd2 in spawn_thread_internal (a=0x17fb9a0) at Thread.cc:84
p = 0x17fb9a0
#13 0x2b78a2555aa1 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#14 0x0032310e893d in clone () 

[jira] [Resolved] (TS-4867) Race condition in RefCountCacheSerializer Initialization

2016-09-14 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson resolved TS-4867.

Resolution: Fixed

> Race condition in RefCountCacheSerializer Initialization
> 
>
> Key: TS-4867
> URL: https://issues.apache.org/jira/browse/TS-4867
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HostDB
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Right now the serializer task is scheduled before one of the class members is 
> initialized-- which means we have a race between the next line running and 
> the eventProcessor scheduling+executing the task.
> Thanks to [~jaaju] for catching it!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4867) Race condition in RefCountCacheSerializer

2016-09-14 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4867:
---
Summary: Race condition in RefCountCacheSerializer  (was: Potential race 
condition in RefCountCacheSerializer)

> Race condition in RefCountCacheSerializer
> -
>
> Key: TS-4867
> URL: https://issues.apache.org/jira/browse/TS-4867
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HostDB
>Reporter: Thomas Jackson
>
> Right now the serializer task is scheduled before one of the class members is 
> initialized-- which means we have a race between the next line running and 
> the eventProcessor scheduling+executing the task.
> Thanks to [~jaaju] for catching it!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4866) Remove traffic_cop health checking.

2016-09-14 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4866:
---
Fix Version/s: 7.1.0

> Remove traffic_cop health checking.
> ---
>
> Key: TS-4866
> URL: https://issues.apache.org/jira/browse/TS-4866
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Cop
>Affects Versions: 7.0.0
>Reporter: James Peach
>Assignee: Leif Hedstrom
> Fix For: 7.0.0
>
>
> There is a school of thought that {{traffic_cop}} health checking causes more 
> problems that in solves. Consider whether we should eliminate health checking 
> from {{traffic_cop}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4866) Remove traffic_cop health checking.

2016-09-14 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4866:
---
Fix Version/s: (was: 7.1.0)
   7.0.0

> Remove traffic_cop health checking.
> ---
>
> Key: TS-4866
> URL: https://issues.apache.org/jira/browse/TS-4866
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Cop
>Affects Versions: 7.0.0
>Reporter: James Peach
>Assignee: Leif Hedstrom
> Fix For: 7.0.0
>
>
> There is a school of thought that {{traffic_cop}} health checking causes more 
> problems that in solves. Consider whether we should eliminate health checking 
> from {{traffic_cop}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4866) Remove traffic_cop health checking.

2016-09-14 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4866:
---
Affects Version/s: 7.0.0

> Remove traffic_cop health checking.
> ---
>
> Key: TS-4866
> URL: https://issues.apache.org/jira/browse/TS-4866
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Cop
>Affects Versions: 7.0.0
>Reporter: James Peach
>Assignee: Leif Hedstrom
> Fix For: 7.0.0
>
>
> There is a school of thought that {{traffic_cop}} health checking causes more 
> problems that in solves. Consider whether we should eliminate health checking 
> from {{traffic_cop}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4866) Remove traffic_cop health checking.

2016-09-14 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4866:
---
Backport to Version: 7.0.0

> Remove traffic_cop health checking.
> ---
>
> Key: TS-4866
> URL: https://issues.apache.org/jira/browse/TS-4866
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Cop
>Affects Versions: 7.0.0
>Reporter: James Peach
>Assignee: Leif Hedstrom
> Fix For: 7.0.0
>
>
> There is a school of thought that {{traffic_cop}} health checking causes more 
> problems that in solves. Consider whether we should eliminate health checking 
> from {{traffic_cop}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4867) Race condition in RefCountCacheSerializer Initialization

2016-09-14 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4867:
---
Summary: Race condition in RefCountCacheSerializer Initialization  (was: 
Race condition in RefCountCacheSerializer)

> Race condition in RefCountCacheSerializer Initialization
> 
>
> Key: TS-4867
> URL: https://issues.apache.org/jira/browse/TS-4867
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HostDB
>Reporter: Thomas Jackson
>
> Right now the serializer task is scheduled before one of the class members is 
> initialized-- which means we have a race between the next line running and 
> the eventProcessor scheduling+executing the task.
> Thanks to [~jaaju] for catching it!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TS-4867) Race condition in RefCountCacheSerializer Initialization

2016-09-14 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson reassigned TS-4867:
--

Assignee: Thomas Jackson

> Race condition in RefCountCacheSerializer Initialization
> 
>
> Key: TS-4867
> URL: https://issues.apache.org/jira/browse/TS-4867
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HostDB
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
>
> Right now the serializer task is scheduled before one of the class members is 
> initialized-- which means we have a race between the next line running and 
> the eventProcessor scheduling+executing the task.
> Thanks to [~jaaju] for catching it!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TS-4867) Potential race condition in RefCountCacheSerializer

2016-09-14 Thread Thomas Jackson (JIRA)
Thomas Jackson created TS-4867:
--

 Summary: Potential race condition in RefCountCacheSerializer
 Key: TS-4867
 URL: https://issues.apache.org/jira/browse/TS-4867
 Project: Traffic Server
  Issue Type: Bug
  Components: HostDB
Reporter: Thomas Jackson


Right now the serializer task is scheduled before one of the class members is 
initialized-- which means we have a race between the next line running and the 
eventProcessor scheduling+executing the task.

Thanks to [~jaaju] for catching it!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TS-4276) Segmentation fault when hostdb runs out of space

2016-09-08 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson resolved TS-4276.

Resolution: Fixed

> Segmentation fault when hostdb runs out of space
> 
>
> Key: TS-4276
> URL: https://issues.apache.org/jira/browse/TS-4276
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HostDB
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.0.0
>
>
> Hostdb assumes that `lookup_done` returns a valid HostDBInfo (as mentioned in 
> comments-- 
> https://github.com/apache/trafficserver/blob/master/iocore/hostdb/HostDB.cc#L1545).
>  In actuality lookup_done can actually return NULL in error conditions-- 
> primarily when it is full 
> (https://github.com/apache/trafficserver/blob/master/iocore/hostdb/HostDB.cc#L1363).
>  Because of this, if a lookup is being done when hostdb is full, r comes back 
> as NULL and we get a segmentation fault that looks like:
> {noformat}
> traffic_server: Segmentation fault (Address not mapped to object [(nil)])
> traffic_server - STACK TRACE: 
> ./bin/traffic_server(_Z19crash_logger_invokeiP9siginfo_tPv+0x8e)[0x4ab81e]
> /lib64/libpthread.so.0(+0x109f0)[0x7f991609a9f0]
> ./bin/traffic_server(_ZN18HostDBContinuation8dnsEventEiP7HostEnt+0xebb)[0x6ad5bb]
> ./bin/traffic_server(_ZN8DNSEntry9postEventEiP5Event+0x45)[0x6c5405]
> ./bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x8a)[0x7c2c0a]
> ./bin/traffic_server(_ZN7EThread7executeEv+0x7e8)[0x7c3a38]
> ./bin/traffic_server[0x7c26e5]
> /lib64/libpthread.so.0(+0x760a)[0x7f991609160a]
> /lib64/libc.so.6(clone+0x6d)[0x7f9914fa4a4d]
> Segmentation fault (core dumped)
> {noformat}
> Found while trying to repro TS-4207



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-4276) Segmentation fault when hostdb runs out of space

2016-09-08 Thread Thomas Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15475729#comment-15475729
 ] 

Thomas Jackson commented on TS-4276:


This is actually no longer an issue since the hostdb rewrite I did, so we
are okay to close this.

On Sep 8, 2016 8:53 AM, "Leif Hedstrom (JIRA)"  wrote:


[ https://issues.apache.org/jira/browse/TS-4276?page=com.
atlassian.jira.plugin.system.issuetabpanels:comment-
tabpanel=15474229#comment-15474229 ]

Leif Hedstrom commented on TS-4276:
---

[~jacksontj] Should we move this out to later? 7.1.0?

mentioned in comments-- https://github.com/apache/trafficserver/blob/master/
iocore/hostdb/HostDB.cc#L1545). In actuality lookup_done can actually
return NULL in error conditions-- primarily when it is full (
https://github.com/apache/trafficserver/blob/master/
iocore/hostdb/HostDB.cc#L1363). Because of this, if a lookup is being done
when hostdb is full, r comes back as NULL and we get a segmentation fault
that looks like:
tEiP7HostEnt+0xebb)[0x6ad5bb]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


> Segmentation fault when hostdb runs out of space
> 
>
> Key: TS-4276
> URL: https://issues.apache.org/jira/browse/TS-4276
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HostDB
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.0.0
>
>
> Hostdb assumes that `lookup_done` returns a valid HostDBInfo (as mentioned in 
> comments-- 
> https://github.com/apache/trafficserver/blob/master/iocore/hostdb/HostDB.cc#L1545).
>  In actuality lookup_done can actually return NULL in error conditions-- 
> primarily when it is full 
> (https://github.com/apache/trafficserver/blob/master/iocore/hostdb/HostDB.cc#L1363).
>  Because of this, if a lookup is being done when hostdb is full, r comes back 
> as NULL and we get a segmentation fault that looks like:
> {noformat}
> traffic_server: Segmentation fault (Address not mapped to object [(nil)])
> traffic_server - STACK TRACE: 
> ./bin/traffic_server(_Z19crash_logger_invokeiP9siginfo_tPv+0x8e)[0x4ab81e]
> /lib64/libpthread.so.0(+0x109f0)[0x7f991609a9f0]
> ./bin/traffic_server(_ZN18HostDBContinuation8dnsEventEiP7HostEnt+0xebb)[0x6ad5bb]
> ./bin/traffic_server(_ZN8DNSEntry9postEventEiP5Event+0x45)[0x6c5405]
> ./bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x8a)[0x7c2c0a]
> ./bin/traffic_server(_ZN7EThread7executeEv+0x7e8)[0x7c3a38]
> ./bin/traffic_server[0x7c26e5]
> /lib64/libpthread.so.0(+0x760a)[0x7f991609160a]
> /lib64/libc.so.6(clone+0x6d)[0x7f9914fa4a4d]
> Segmentation fault (core dumped)
> {noformat}
> Found while trying to repro TS-4207



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-4684) Leaked references to HostDBInfos from HttpTransact

2016-09-01 Thread Thomas Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-4684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15456173#comment-15456173
 ] 

Thomas Jackson commented on TS-4684:


That is correct, and thanks :)

> Leaked references to HostDBInfos from HttpTransact
> --
>
> Key: TS-4684
> URL: https://issues.apache.org/jira/browse/TS-4684
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HostDB
>Affects Versions: 7.0.0
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> After running for a while I've noticed a slow leak in memory. After tracking 
> it down the leak is due to not properly decrementing the refcounts in 
> HttpTransact for Ptr. Seems that somewhere in the rebasing this 
> very important line was lost



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TS-4796) ATS not closing origin connections on first RST from client

2016-08-29 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson reassigned TS-4796:
--

Assignee: Thomas Jackson

> ATS not closing origin connections on first RST from client
> ---
>
> Key: TS-4796
> URL: https://issues.apache.org/jira/browse/TS-4796
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HTTP
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
>
> *TLDR; similar to TS-4720 -- slower to close than it should, instead of never 
> closing*
> As a continuation of TS-4720, while testing that the session is closed when 
> we expect-- I found that it isn't.
> Although we are now closing the sessions, we aren't doing it as quickly as we 
> should. In this client abort case we expect the client to abort, and ATS 
> should initially continue to send bytes to the client-- as we are in the 
> half-open state. After the first set of bytes are sent to the client-- the 
> client will send an RST-- which should signal ATS to stop sending the request 
> (and tear down the origin connection etc.).
> I'm able to reproduce this locally, and the debug output (with some 
> additional comments) looks like below:
> {code}
> < FIN FROM CLIENT >
> [Aug 29 18:25:07.491] Server {0x7effa538a800} DEBUG:  (main_handler)> (http) [0] [HttpSM::main_handler, VC_EVENT_EOS]
> [Aug 29 18:25:07.491] Server {0x7effa538a800} DEBUG:  (state_watch_for_client_abort)> (http) [0] 
> [::state_watch_for_client_abort, VC_EVENT_EOS]
> < RST FROM CLIENT >
> Got an HttpTunnel event 100 
> [Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (producer_handler)> (http_tunnel) [0] producer_handler [http server 
> VC_EVENT_READ_READY]
> [Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (producer_handler_chunked)> (http_tunnel) [0] producer_handler_chunked [http 
> server VC_EVENT_READ_READY]
> [Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (read_size)> (http_chunk) read chunk size of 15 bytes
> [Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (read_chunk)> (http_chunk) completed read of chunk of 15 bytes
> [Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (producer_handler)> (http_redirect) [HttpTunnel::producer_handler] 
> enable_redirection: [1 0 0] event: 100
> Got an HttpTunnel event 101 
> [Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (consumer_handler)> (http_tunnel) [0] consumer_handler [user agent 
> VC_EVENT_WRITE_READY]
> write ready consumer_handler
> {code}
> In this situation the connection doesn't close here at the RST-- but rather 
> on the next set of bytes from the origin to send-- which end up tripping a 
> VC_EVENT_ERROR-- and tearing down the connection.
> When the client sends the first RST epoll returns a WRITE_READY event -- 
> which the HTTPTunnel consumer ignores completely. It seems then that when we 
> recieve the WRITE_READY event we need to determine if we are already in the 
> writing state-- and if so, then we should stop the transaction (since we are 
> already edge-triggered).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4796) ATS not closing origin connections on first RST from client

2016-08-29 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4796:
---
Description: 
*TLDR; similar to TS-4720 -- slower to close than it should, instead of never 
closing*

As a continuation of TS-4720, while testing that the session is closed when we 
expect-- I found that it isn't.

Although we are now closing the sessions, we aren't doing it as quickly as we 
should. In this client abort case we expect the client to abort, and ATS should 
initially continue to send bytes to the client-- as we are in the half-open 
state. After the first set of bytes are sent to the client-- the client will 
send an RST-- which should signal ATS to stop sending the request (and tear 
down the origin connection etc.).

I'm able to reproduce this locally, and the debug output (with some additional 
comments) looks like below:

{code}
< FIN FROM CLIENT >
[Aug 29 18:25:07.491] Server {0x7effa538a800} DEBUG:  (http) [0] [HttpSM::main_handler, VC_EVENT_EOS]
[Aug 29 18:25:07.491] Server {0x7effa538a800} DEBUG:  (http) [0] 
[::state_watch_for_client_abort, VC_EVENT_EOS]



< RST FROM CLIENT >
Got an HttpTunnel event 100 
[Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (http_tunnel) [0] producer_handler [http server 
VC_EVENT_READ_READY]
[Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (http_tunnel) [0] producer_handler_chunked [http 
server VC_EVENT_READ_READY]
[Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (http_chunk) read chunk size of 15 bytes
[Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (http_chunk) completed read of chunk of 15 bytes
[Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (http_redirect) [HttpTunnel::producer_handler] 
enable_redirection: [1 0 0] event: 100
Got an HttpTunnel event 101 
[Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (http_tunnel) [0] consumer_handler [user agent 
VC_EVENT_WRITE_READY]
write ready consumer_handler

{code}


In this situation the connection doesn't close here at the RST-- but rather on 
the next set of bytes from the origin to send-- which end up tripping a 
VC_EVENT_ERROR-- and tearing down the connection.

When the client sends the first RST epoll returns a WRITE_READY event -- which 
the HTTPTunnel consumer ignores completely. It seems then that when we recieve 
the WRITE_READY event we need to determine if we are already in the writing 
state-- and if so, then we should stop the transaction (since we are already 
edge-triggered).

  was:
As a continuation of TS-4720, while testing that the session is closed when we 
expect-- I found that it isn't.

Although we are now closing the sessions, we aren't doing it as quickly as we 
should. In this client abort case we expect the client to abort, and ATS should 
initially continue to send bytes to the client-- as we are in the half-open 
state. After the first set of bytes are sent to the client-- the client will 
send an RST-- which should signal ATS to stop sending the request (and tear 
down the origin connection etc.).

I'm able to reproduce this locally, and the debug output (with some additional 
comments) looks like below:

{code}
< FIN FROM CLIENT >
[Aug 29 18:25:07.491] Server {0x7effa538a800} DEBUG:  (http) [0] [HttpSM::main_handler, VC_EVENT_EOS]
[Aug 29 18:25:07.491] Server {0x7effa538a800} DEBUG:  (http) [0] 
[::state_watch_for_client_abort, VC_EVENT_EOS]



< RST FROM CLIENT >
Got an HttpTunnel event 100 
[Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (http_tunnel) [0] producer_handler [http server 
VC_EVENT_READ_READY]
[Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (http_tunnel) [0] producer_handler_chunked [http 
server VC_EVENT_READ_READY]
[Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (http_chunk) read chunk size of 15 bytes
[Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (http_chunk) completed read of chunk of 15 bytes
[Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (http_redirect) [HttpTunnel::producer_handler] 
enable_redirection: [1 0 0] event: 100
Got an HttpTunnel event 101 
[Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (http_tunnel) [0] consumer_handler [user agent 
VC_EVENT_WRITE_READY]
write ready consumer_handler

{code}


In this situation the connection doesn't close here at the RST-- but rather on 
the next set of bytes from the origin to send-- which end up tripping a 
VC_EVENT_ERROR-- and tearing down the connection.

When the client sends the first RST epoll returns a WRITE_READY event -- which 
the HTTPTunnel consumer ignores completely. It seems then that when we recieve 
the WRITE_READY event we need to determine if we are already in the writing 
state-- and if so, then we should stop the transaction (since we are already 
edge-triggered).


> ATS not closing origin connections on first RST from client
> ---
>
> Key: TS-4796
>   

[jira] [Created] (TS-4796) ATS not closing origin connections on first RST from client

2016-08-29 Thread Thomas Jackson (JIRA)
Thomas Jackson created TS-4796:
--

 Summary: ATS not closing origin connections on first RST from 
client
 Key: TS-4796
 URL: https://issues.apache.org/jira/browse/TS-4796
 Project: Traffic Server
  Issue Type: Bug
  Components: HTTP
Reporter: Thomas Jackson


As a continuation of TS-4720, while testing that the session is closed when we 
expect-- I found that it isn't.

Although we are now closing the sessions, we aren't doing it as quickly as we 
should. In this client abort case we expect the client to abort, and ATS should 
initially continue to send bytes to the client-- as we are in the half-open 
state. After the first set of bytes are sent to the client-- the client will 
send an RST-- which should signal ATS to stop sending the request (and tear 
down the origin connection etc.).

I'm able to reproduce this locally, and the debug output (with some additional 
comments) looks like below:

{code}
< FIN FROM CLIENT >
[Aug 29 18:25:07.491] Server {0x7effa538a800} DEBUG:  (http) [0] [HttpSM::main_handler, VC_EVENT_EOS]
[Aug 29 18:25:07.491] Server {0x7effa538a800} DEBUG:  (http) [0] 
[::state_watch_for_client_abort, VC_EVENT_EOS]



< RST FROM CLIENT >
Got an HttpTunnel event 100 
[Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (http_tunnel) [0] producer_handler [http server 
VC_EVENT_READ_READY]
[Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (http_tunnel) [0] producer_handler_chunked [http 
server VC_EVENT_READ_READY]
[Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (http_chunk) read chunk size of 15 bytes
[Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (http_chunk) completed read of chunk of 15 bytes
[Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (http_redirect) [HttpTunnel::producer_handler] 
enable_redirection: [1 0 0] event: 100
Got an HttpTunnel event 101 
[Aug 29 18:25:13.062] Server {0x7effa538a800} DEBUG:  (http_tunnel) [0] consumer_handler [user agent 
VC_EVENT_WRITE_READY]
write ready consumer_handler

{code}


In this situation the connection doesn't close here at the RST-- but rather on 
the next set of bytes from the origin to send-- which end up tripping a 
VC_EVENT_ERROR-- and tearing down the connection.

When the client sends the first RST epoll returns a WRITE_READY event -- which 
the HTTPTunnel consumer ignores completely. It seems then that when we recieve 
the WRITE_READY event we need to determine if we are already in the writing 
state-- and if so, then we should stop the transaction (since we are already 
edge-triggered).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4720) ATS not properly closing origin connections in client abort situations

2016-08-29 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4720:
---
Summary: ATS not properly closing origin connections in client abort 
situations  (was: ATS not properly closing origin connections in client abort 
situationsounce L0proxy in ela4 and idb2 for memory leak)

> ATS not properly closing origin connections in client abort situations
> --
>
> Key: TS-4720
> URL: https://issues.apache.org/jira/browse/TS-4720
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HTTP
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> We've noticed that there are some scenarios that ATS doesn't close the origin 
> connection when the client aborts. To reproduce I set up an http server which 
> would return a text/stream sending a message every 10s. In this case, if I do 
> a GET request to the endpoint and then immediately kill the client, the 
> connection to the origin doesn't close until the transaction active timer 
> kicks in. 
> After digging into this, it seems that this is actually due to a bug in the 
> HttpSM-- specifically in how it checks whether a request has a body. The 
> default value for content-length is `-1`, but some checks are `== 0` -- which 
> means that if the request had no content-length header it is treated as a 
> request with a content-length. 
> The particular place that was problematic was the section that enables the 
> vio reader to watch for client aborts-- which specifically isn't enabled for 
> POST/chunked requests as it is enabled later down the call chain (since it 
> needs to handle the buffers itself).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-4342) Connections queued when hitting `proxy.config.http.origin_max_connections` don't honor order

2016-08-18 Thread Thomas Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-4342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15426814#comment-15426814
 ] 

Thomas Jackson commented on TS-4342:


And replace it with what? Or are we planning on just dropping when we hit 
limits? I agree we should get rid of these sleep queues, but I'm not really for 
no queuing at all.

> Connections queued when hitting `proxy.config.http.origin_max_connections` 
> don't honor order
> 
>
> Key: TS-4342
> URL: https://issues.apache.org/jira/browse/TS-4342
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Network
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
>
> As of today when you hit `proxy.config.http.origin_max_connections` requests 
> are queued waiting for an available connection to origin. Today this is done 
> by simply [rescheduling 100ms in the 
> future|https://github.com/apache/trafficserver/blob/master/proxy/http/HttpSM.cc#L4756].
>  Ideally this would honor the order with which the requests came in.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (TS-4342) Connections queued when hitting `proxy.config.http.origin_max_connections` don't honor order

2016-08-18 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson reopened TS-4342:


> Connections queued when hitting `proxy.config.http.origin_max_connections` 
> don't honor order
> 
>
> Key: TS-4342
> URL: https://issues.apache.org/jira/browse/TS-4342
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Network
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
>
> As of today when you hit `proxy.config.http.origin_max_connections` requests 
> are queued waiting for an available connection to origin. Today this is done 
> by simply [rescheduling 100ms in the 
> future|https://github.com/apache/trafficserver/blob/master/proxy/http/HttpSM.cc#L4756].
>  Ideally this would honor the order with which the requests came in.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4343) When ATS hits `proxy.config.http.origin_max_connections`, the requests queued will not abort even if client aborts

2016-08-18 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4343:
---
Fix Version/s: sometime

> When ATS hits `proxy.config.http.origin_max_connections`, the requests queued 
> will not abort even if client aborts
> --
>
> Key: TS-4343
> URL: https://issues.apache.org/jira/browse/TS-4343
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Network
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: sometime
>
>
> While testing my patch for TS-4341 I've noticed that once a request is queued 
> waiting for max connections, ATS will not abort the request even if the 
> client has completely disconnected. It seems that ATS is only setting up a 
> consumer of the UA tunnel if there is a post body-- and as such we never see 
> the socket close on us. Still digging into this some, but this seems like a 
> problem ;)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4346) Requests that hit conneciton limits (max_origin_connections, etc.) incur a hard coded latency hit

2016-08-18 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4346:
---
Fix Version/s: sometime

> Requests that hit conneciton limits (max_origin_connections, etc.) incur a 
> hard coded latency hit
> -
>
> Key: TS-4346
> URL: https://issues.apache.org/jira/browse/TS-4346
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Network
>Reporter: Thomas Jackson
> Fix For: sometime
>
>
> example: 
> https://github.com/apache/trafficserver/blob/master/proxy/http/HttpSM.cc#L4741
> Ideally this would instead put the transaction in some queue to be processed, 
> such that the transaction could be resumed as soon as a connection becomes 
> available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (TS-4343) When ATS hits `proxy.config.http.origin_max_connections`, the requests queued will not abort even if client aborts

2016-08-18 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson reopened TS-4343:


> When ATS hits `proxy.config.http.origin_max_connections`, the requests queued 
> will not abort even if client aborts
> --
>
> Key: TS-4343
> URL: https://issues.apache.org/jira/browse/TS-4343
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Network
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
>
> While testing my patch for TS-4341 I've noticed that once a request is queued 
> waiting for max connections, ATS will not abort the request even if the 
> client has completely disconnected. It seems that ATS is only setting up a 
> consumer of the UA tunnel if there is a post body-- and as such we never see 
> the socket close on us. Still digging into this some, but this seems like a 
> problem ;)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-4346) Requests that hit conneciton limits (max_origin_connections, etc.) incur a hard coded latency hit

2016-08-18 Thread Thomas Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-4346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15426809#comment-15426809
 ] 

Thomas Jackson commented on TS-4346:


Why would we close this won't fix? We simply need to queue the connections
when they hit the limit instead of sleeping.




> Requests that hit conneciton limits (max_origin_connections, etc.) incur a 
> hard coded latency hit
> -
>
> Key: TS-4346
> URL: https://issues.apache.org/jira/browse/TS-4346
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Network
>Reporter: Thomas Jackson
>
> example: 
> https://github.com/apache/trafficserver/blob/master/proxy/http/HttpSM.cc#L4741
> Ideally this would instead put the transaction in some queue to be processed, 
> such that the transaction could be resumed as soon as a connection becomes 
> available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (TS-4346) Requests that hit conneciton limits (max_origin_connections, etc.) incur a hard coded latency hit

2016-08-18 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson reopened TS-4346:


> Requests that hit conneciton limits (max_origin_connections, etc.) incur a 
> hard coded latency hit
> -
>
> Key: TS-4346
> URL: https://issues.apache.org/jira/browse/TS-4346
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Network
>Reporter: Thomas Jackson
>
> example: 
> https://github.com/apache/trafficserver/blob/master/proxy/http/HttpSM.cc#L4741
> Ideally this would instead put the transaction in some queue to be processed, 
> such that the transaction could be resumed as soon as a connection becomes 
> available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-4027) Incorrect usage of mmap

2016-08-17 Thread Thomas Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-4027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15424816#comment-15424816
 ] 

Thomas Jackson commented on TS-4027:


>From the looks of the report it sounds like the reported problem is an issue 
>with how multicache was doing mmap-- good news the new one doesn't use mmap at 
>all :)

> Incorrect usage of mmap
> ---
>
> Key: TS-4027
> URL: https://issues.apache.org/jira/browse/TS-4027
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Core
>Affects Versions: 6.0.0
>Reporter: Jason Kenny
>Assignee: Jason Kenny
> Fix For: 7.0.0
>
>
> The specific type of the crash caused by this failure (whether it is the 
> SCT_SetReg assertion, or Unexpected memory deallocation error) is pretty 
> random.
> This is because the crash is caused by the application’s problematic/buggy 
> behavior in a way I’ll describe below.
> This is a trace of the syscalls invoked by traffic_server as generated with 
> strace (running without Pin) - I highlighted the important syscalls:
> open("/tmp/yts/var/trafficserver/host.db", O_RDWR|O_CREAT, 0644) = 118
> fstat(118, {st_mode=S_IFREG|0644, st_size=25935872, ...}) = 0
> open("/dev/zero", O_RDONLY) = 119
> 1.   mmap(NULL, 25935872, PROT_READ, MAP_SHARED|MAP_NORESERVE, 119, 0) = 
> 0x7fffef935000
> 2.   munmap(0x7fffef935000, 25935872) = 0
> 3.   mmap(0x7fffef935000, 966656, PROT_READ|PROT_WRITE, 
> MAP_SHARED|MAP_FIXED|MAP_NORESERVE, 118, 0) = 0x7fffef935000
> mmap(0x7fffefa21000, 7675904, PROT_READ|PROT_WRITE, 
> MAP_SHARED|MAP_FIXED|MAP_NORESERVE, 118, 0xec000) = 0x7fffefa21000
> mmap(0x70173000, 17285120, PROT_READ|PROT_WRITE, 
> MAP_SHARED|MAP_FIXED|MAP_NORESERVE, 118, 0x83e000) = 0x70173000
> mmap(0x711ef000, 8192, PROT_READ|PROT_WRITE, 
> MAP_SHARED|MAP_FIXED|MAP_NORESERVE, 118, 0x18ba000) = 0x711ef000
> close(119)  = 0
> close(118)  = 0
> 1.   You can see that the application is first calling mmap with “addr” 
> argument equal to NULL in order to get a memory region that covers the whole 
> region requested for mapping of the file host.db (25935872 bytes).
> 2.   The application unmaps the allocated region (at address 
> 0x7fffef935000), but remembers this region as a “free” region.
> 3.   Then, the application tries to mmap each region of the file to 
> addresses inside the region (from step 1) that it considers as “free”. These 
> mmaps are called with the MAP_FIXED flag – meaning that if there is already a 
> memory mapped in the requested region then the kernel should still map the 
> requested memory in a way that the already mapped region will be discarded 
> (zeroed).
>  
> As you probably guess, Between step 2 and 3 Pin is asking the OS to allocate 
> memory for its own purpose and get some memory inside the unmapped region 
> that the application considers as “free”.
> This will cause the mapping in step 3 to overwrite Pin’s memory and corrupt 
> it – leading to this crash.
> This is a bug in the application (traffic_server) that needs to be fixed.
> In a multi-threaded environment (traffic_server is multi-threaded) another 
> thread can request memory mapping (by mmap) between step 2 and 3, and get a 
> memory region that intersects with the problematic region.
> This allocated memory will eventually be corrupted.
> You don’t need Pin and dynamic instrumentation environment in order to 
> reproduce this bug!
>  
> BTW, the application call-stack of this crash, as reported by Inspector is:
> 
> 
> 
> 
> 
> 
> 
> 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-2358) DNS does not fail-over promptly for DNS server returning SERVFAIL

2016-08-16 Thread Thomas Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422932#comment-15422932
 ] 

Thomas Jackson commented on TS-2358:


I'm not sure if this is still an issue-- haven't run into that problem. In 
general SERVFAIL means the authoritive nameserver is having problems. So at 
least in my infra trying another nameserver would return equally as broken 
results-- although I do agree that it'd be good to try a different one (in case 
one of your immediate upstream resolvers is having issues). I can add this to 
my backlog of things to take a look at.

> DNS does not fail-over promptly for DNS server returning SERVFAIL
> -
>
> Key: TS-2358
> URL: https://issues.apache.org/jira/browse/TS-2358
> Project: Traffic Server
>  Issue Type: Bug
>  Components: DNS
>Affects Versions: 3.2.5
>Reporter: William Bardwell
>Assignee: Thomas Jackson
> Fix For: sometime
>
> Attachments: ats.dns.txt
>
>
> If I have 2 dns servers listed in my resolv.conf and the first one is 
> returning SERVFAIL for something that I try to lookup, ATS takes a long time 
> to fail over, and won't do it for the first request to look something up.  
> Using normal system commands (host, ping etc.) with the same resolv.conf work 
> fine.
> I tried various config values with out much improvement.  I could make it 
> fail in 40sec instead of 60sec for the initial failure...
> debug logs will be attached, doing one DNS and then waiting a while and doing 
> another.  (Doing more before enough time has passed don't seem to help much.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-2134) SRV lookup does not handle failover correctly

2016-08-15 Thread Thomas Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-2134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421988#comment-15421988
 ] 

Thomas Jackson commented on TS-2134:


This should be fixed in 7.x (not sure exactly which patch, but I have tested 
that case). If someone else wants to verify that'd be great :)

> SRV lookup does not handle failover correctly
> -
>
> Key: TS-2134
> URL: https://issues.apache.org/jira/browse/TS-2134
> Project: Traffic Server
>  Issue Type: Bug
>  Components: DNS, HTTP
>Reporter: Thach Tran
>Assignee: Thomas Jackson
>  Labels: review
> Fix For: 7.0.0
>
> Attachments: ats.log, ts2134.patch
>
>
> I'm seeing an issue with SRV lookup in ATS in which the proxy doesn't fail 
> over to alternative origins once the first choice is marked as down.
> To reproduce this, I'm running dnsmasq as a local resolver to serve up the 
> test SRV records. My configuration is as follows.
> h4. records.config
> CONFIG proxy.config.dns.nameservers STRING 127.0.0.1
> CONFIG proxy.config.dns.resolv_conf STRING NULL
> CONFIG proxy.config.srv_enabled INT 1
> h4. remap.config
> regex_remap http://.*:8080/ https://noexample.com/
> h4. dnsmasq.conf (srv records config)
> srv-host=_http._tcp.noexample.com,abc.com,443,0,100
> srv-host=_http._tcp.noexample.com,google.com,443,1,100
> The intention is since the srv lookup for _http._tcp.noexample.com returns 
> abc.com:443 and google.com:443 with abc.com:443 being the one with higher 
> priority, the proxy should try that first and once the connection to 
> abc.com:443 is marked as down (up to 6 retries by default), google.com:443 
> should be tried next and the connection should succeed then.
> However, testing with the following curl command multiple times still gives 
> back 502.
> $ curl -v http://localhost:8080/
> Debug log seems to suggest it always attempts abc.com:443.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TS-4720) ATS not properly closing origin connections in client abort situationsounce L0proxy in ela4 and idb2 for memory leak

2016-08-05 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson resolved TS-4720.

Resolution: Fixed

> ATS not properly closing origin connections in client abort situationsounce 
> L0proxy in ela4 and idb2 for memory leak
> 
>
> Key: TS-4720
> URL: https://issues.apache.org/jira/browse/TS-4720
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HTTP
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> We've noticed that there are some scenarios that ATS doesn't close the origin 
> connection when the client aborts. To reproduce I set up an http server which 
> would return a text/stream sending a message every 10s. In this case, if I do 
> a GET request to the endpoint and then immediately kill the client, the 
> connection to the origin doesn't close until the transaction active timer 
> kicks in. 
> After digging into this, it seems that this is actually due to a bug in the 
> HttpSM-- specifically in how it checks whether a request has a body. The 
> default value for content-length is `-1`, but some checks are `== 0` -- which 
> means that if the request had no content-length header it is treated as a 
> request with a content-length. 
> The particular place that was problematic was the section that enables the 
> vio reader to watch for client aborts-- which specifically isn't enabled for 
> POST/chunked requests as it is enabled later down the call chain (since it 
> needs to handle the buffers itself).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4720) ATS not properly closing origin connections in client abort situationsounce L0proxy in ela4 and idb2 for memory leak

2016-08-05 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4720:
---
Backport to Version: 6.2.1
Summary: ATS not properly closing origin connections in client 
abort situationsounce L0proxy in ela4 and idb2 for memory leak  (was: ATS not 
properly closing origin connections in client abort situations)

> ATS not properly closing origin connections in client abort situationsounce 
> L0proxy in ela4 and idb2 for memory leak
> 
>
> Key: TS-4720
> URL: https://issues.apache.org/jira/browse/TS-4720
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HTTP
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> We've noticed that there are some scenarios that ATS doesn't close the origin 
> connection when the client aborts. To reproduce I set up an http server which 
> would return a text/stream sending a message every 10s. In this case, if I do 
> a GET request to the endpoint and then immediately kill the client, the 
> connection to the origin doesn't close until the transaction active timer 
> kicks in. 
> After digging into this, it seems that this is actually due to a bug in the 
> HttpSM-- specifically in how it checks whether a request has a body. The 
> default value for content-length is `-1`, but some checks are `== 0` -- which 
> means that if the request had no content-length header it is treated as a 
> request with a content-length. 
> The particular place that was problematic was the section that enables the 
> vio reader to watch for client aborts-- which specifically isn't enabled for 
> POST/chunked requests as it is enabled later down the call chain (since it 
> needs to handle the buffers itself).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4710) Make `proxy.config.srv_enabled` transaction overrideable

2016-08-05 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4710:
---
Fix Version/s: 7.0.0

> Make `proxy.config.srv_enabled` transaction overrideable
> 
>
> Key: TS-4710
> URL: https://issues.apache.org/jira/browse/TS-4710
> Project: Traffic Server
>  Issue Type: Improvement
>  Components: HostDB
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TS-4710) Make `proxy.config.srv_enabled` transaction overrideable

2016-08-05 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson resolved TS-4710.

Resolution: Fixed

> Make `proxy.config.srv_enabled` transaction overrideable
> 
>
> Key: TS-4710
> URL: https://issues.apache.org/jira/browse/TS-4710
> Project: Traffic Server
>  Issue Type: Improvement
>  Components: HostDB
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (TS-4720) ATS not properly closing origin connections in client abort situations

2016-08-04 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4720:
---
Comment: was deleted

(was: https://github.com/apache/trafficserver/pull/841)

> ATS not properly closing origin connections in client abort situations
> --
>
> Key: TS-4720
> URL: https://issues.apache.org/jira/browse/TS-4720
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HTTP
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We've noticed that there are some scenarios that ATS doesn't close the origin 
> connection when the client aborts. To reproduce I set up an http server which 
> would return a text/stream sending a message every 10s. In this case, if I do 
> a GET request to the endpoint and then immediately kill the client, the 
> connection to the origin doesn't close until the transaction active timer 
> kicks in. 
> After digging into this, it seems that this is actually due to a bug in the 
> HttpSM-- specifically in how it checks whether a request has a body. The 
> default value for content-length is `-1`, but some checks are `== 0` -- which 
> means that if the request had no content-length header it is treated as a 
> request with a content-length. 
> The particular place that was problematic was the section that enables the 
> vio reader to watch for client aborts-- which specifically isn't enabled for 
> POST/chunked requests as it is enabled later down the call chain (since it 
> needs to handle the buffers itself).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-4720) ATS not properly closing origin connections in client abort situations

2016-08-04 Thread Thomas Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15408718#comment-15408718
 ] 

Thomas Jackson commented on TS-4720:


https://github.com/apache/trafficserver/pull/841

> ATS not properly closing origin connections in client abort situations
> --
>
> Key: TS-4720
> URL: https://issues.apache.org/jira/browse/TS-4720
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HTTP
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We've noticed that there are some scenarios that ATS doesn't close the origin 
> connection when the client aborts. To reproduce I set up an http server which 
> would return a text/stream sending a message every 10s. In this case, if I do 
> a GET request to the endpoint and then immediately kill the client, the 
> connection to the origin doesn't close until the transaction active timer 
> kicks in. 
> After digging into this, it seems that this is actually due to a bug in the 
> HttpSM-- specifically in how it checks whether a request has a body. The 
> default value for content-length is `-1`, but some checks are `== 0` -- which 
> means that if the request had no content-length header it is treated as a 
> request with a content-length. 
> The particular place that was problematic was the section that enables the 
> vio reader to watch for client aborts-- which specifically isn't enabled for 
> POST/chunked requests as it is enabled later down the call chain (since it 
> needs to handle the buffers itself).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4720) ATS not properly closing origin connections in client abort situations

2016-08-04 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4720:
---
Fix Version/s: 7.0.0

> ATS not properly closing origin connections in client abort situations
> --
>
> Key: TS-4720
> URL: https://issues.apache.org/jira/browse/TS-4720
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HTTP
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.0.0
>
>
> We've noticed that there are some scenarios that ATS doesn't close the origin 
> connection when the client aborts. To reproduce I set up an http server which 
> would return a text/stream sending a message every 10s. In this case, if I do 
> a GET request to the endpoint and then immediately kill the client, the 
> connection to the origin doesn't close until the transaction active timer 
> kicks in. 
> After digging into this, it seems that this is actually due to a bug in the 
> HttpSM-- specifically in how it checks whether a request has a body. The 
> default value for content-length is `-1`, but some checks are `== 0` -- which 
> means that if the request had no content-length header it is treated as a 
> request with a content-length. 
> The particular place that was problematic was the section that enables the 
> vio reader to watch for client aborts-- which specifically isn't enabled for 
> POST/chunked requests as it is enabled later down the call chain (since it 
> needs to handle the buffers itself).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TS-4720) ATS not properly closing origin connections in client abort situations

2016-08-04 Thread Thomas Jackson (JIRA)
Thomas Jackson created TS-4720:
--

 Summary: ATS not properly closing origin connections in client 
abort situations
 Key: TS-4720
 URL: https://issues.apache.org/jira/browse/TS-4720
 Project: Traffic Server
  Issue Type: Bug
  Components: HTTP
Reporter: Thomas Jackson


We've noticed that there are some scenarios that ATS doesn't close the origin 
connection when the client aborts. To reproduce I set up an http server which 
would return a text/stream sending a message every 10s. In this case, if I do a 
GET request to the endpoint and then immediately kill the client, the 
connection to the origin doesn't close until the transaction active timer kicks 
in. 

After digging into this, it seems that this is actually due to a bug in the 
HttpSM-- specifically in how it checks whether a request has a body. The 
default value for content-length is `-1`, but some checks are `== 0` -- which 
means that if the request had no content-length header it is treated as a 
request with a content-length. 

The particular place that was problematic was the section that enables the vio 
reader to watch for client aborts-- which specifically isn't enabled for 
POST/chunked requests as it is enabled later down the call chain (since it 
needs to handle the buffers itself).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-4509) Dropped keep-alive connections not being re-established (TS-3959 continued)

2016-08-03 Thread Thomas Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-4509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407042#comment-15407042
 ] 

Thomas Jackson commented on TS-4509:


[~amc] we had discussed adding a "sent byte count" before, the issue is that is 
just the number of bytes we've scheduled for send-- not how many where actually 
sent. This answers the question "how many bytes have I queued for send which 
haven't been sent". With that we can determine that even though we queued the 
entire request-- if none was actually sent (all of it was outstanding) we can 
re-send the request.

The big thing here is to ensure that if we sent data to the FE and it was ACKd 
we cannot retry the request.

> Dropped keep-alive connections not being re-established (TS-3959 continued)
> ---
>
> Key: TS-4509
> URL: https://issues.apache.org/jira/browse/TS-4509
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Core, Network
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
>Priority: Blocker
> Fix For: 7.0.0
>
>
> I've observed some differences in how TrafficServer 6.0.0 behaves with 
> connection retrying and outgoing keep-alive connections. I believe the 
> changes in behavior might be related to this issue: 
> https://issues.apache.org/jira/browse/TS-3440
> I originally wasn't sure if this was a bug, but James Peach indicated it 
> sounded more like a regression on the mailing list 
> (http://mail-archives.apache.org/mod_mbox/trafficserver-users/201510.mbox/%3cba85d5a2-8b29-44a9-acdc-e7fa8d21f...@apache.org%3e).
> What I'm seeing in 6.0.0 is that if TrafficServer has some backend keep-alive 
> connections already opened, but then one of the keep-alive connections is 
> closed, the next request to TrafficServer may generate a 502 Server Hangup 
> response when attempting to reuse that connection. Previously, I think 
> TrafficServer was retrying when it encountered a closed keep-alive 
> connection, but that is no longer the case. So if you have a backend that 
> might unexpectedly close its open keep-alive connections, the only way I've 
> found to completely prevent these 502 errors in 6.0.0 is to disable outgoing 
> keepalive (proxy.config.http.keep_alive_enabled_out and 
> proxy.config.http.keep_alive_post_out settings).
> For a slightly more concrete example of what can trigger this, this is fairly 
> easy to reproduce with the following setup:
> - TrafficServer is proxying to nginx with outgoing keep-alive connections 
> enabled (the default).
> - Throw a constant stream of requests at TrafficServer.
> - While that constant stream of requests is happening, also send a regular 
> stream of SIGHUP commands to nginx to reload nginx.
> - Eventually you'll get some 502 Server Hangup responses from TrafficServer 
> among your stream of requests.
> SIGHUPs in nginx should result in zero downtime for new requests, but I think 
> what's happening is that TrafficServer may fail when an old keep-alived 
> connection is reused (it's not common, so it depends on the timing of things 
> and if the connection is from an old nginx worker that has since been shut 
> down). In TrafficServer 5.3.1 these connection failures were retried, but in 
> 6.0.0, no retries occur in this case.
> Here's some debug logs that show the difference in behavior between 6.0.0 and 
> 5.3.1. Note that differences seem to stem from how each version eventually 
> handles the "VC_EVENT_EOS" event following 
> "::state_send_server_request_header, VC_EVENT_WRITE_COMPLETE".
> 5.3.1: 
> https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_5-3-1-log-L316
> 6.0.0: 
> https://gist.github.com/GUI/0c53a6c4fdc2782b14aa#file-trafficserver_6-0-0-log-L314
> Interestingly, if I'm understand the log files correctly, it looks like 
> TraffficServer is reporting an odd empty response from these connections 
> ("HTTP/0.9 0" in 5.3.1 and "HTTP/1.0 0" in 6.0.0). However, as far as I can 
> tell from TCP dumps on the system, nginx is not actually sending any form of 
> response.
> In these example cases the backend server isn't sending back any data (at 
> least as far as I can tell), so from what I understand (and the logic 
> outlined in https://issues.apache.org/jira/browse/TS-3440), it should be safe 
> to retry.
> Let me know if I can provide any other details. Or if exact scripts to 
> reproduce the issues against the example nginx backend I described above 
> would be useful, I could get that together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-4710) Make `proxy.config.srv_enabled` transaction overrideable

2016-08-01 Thread Thomas Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15403203#comment-15403203
 ] 

Thomas Jackson commented on TS-4710:


https://github.com/apache/trafficserver/pull/836

> Make `proxy.config.srv_enabled` transaction overrideable
> 
>
> Key: TS-4710
> URL: https://issues.apache.org/jira/browse/TS-4710
> Project: Traffic Server
>  Issue Type: Improvement
>  Components: HostDB
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TS-4710) Make `proxy.config.srv_enabled` transaction overrideable

2016-08-01 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson reassigned TS-4710:
--

Assignee: Thomas Jackson

> Make `proxy.config.srv_enabled` transaction overrideable
> 
>
> Key: TS-4710
> URL: https://issues.apache.org/jira/browse/TS-4710
> Project: Traffic Server
>  Issue Type: Improvement
>  Components: HostDB
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TS-4710) Make `proxy.config.srv_enabled` transaction overrideable

2016-08-01 Thread Thomas Jackson (JIRA)
Thomas Jackson created TS-4710:
--

 Summary: Make `proxy.config.srv_enabled` transaction overrideable
 Key: TS-4710
 URL: https://issues.apache.org/jira/browse/TS-4710
 Project: Traffic Server
  Issue Type: Improvement
  Components: HostDB
Reporter: Thomas Jackson






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4693) SRV priorities don't work

2016-07-26 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4693:
---
Fix Version/s: (was: sometime)
   7.0.0

> SRV priorities don't work
> -
>
> Key: TS-4693
> URL: https://issues.apache.org/jira/browse/TS-4693
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HostDB
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.0.0
>
>
> Although priorities are stored in hostdb (and code exists that attempts to do 
> priorities)-- from some black-box testing it seems that the priorities aren't 
> taken into consideration for LB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TS-4692) SRV weights don't work

2016-07-26 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson closed TS-4692.
--
Resolution: Duplicate

> SRV weights don't work
> --
>
> Key: TS-4692
> URL: https://issues.apache.org/jira/browse/TS-4692
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HostDB
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: sometime
>
>
> Although weights are stored in hostdb (and code exists that attempts to do 
> weighting)-- from some black-box testing it seems that the weighting isn't 
> taken into consideration for LB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-4693) SRV priorities don't work

2016-07-26 Thread Thomas Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-4693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15395095#comment-15395095
 ] 

Thomas Jackson commented on TS-4693:


PR with fix (and re-enabling tests: 
https://github.com/apache/trafficserver/pull/827)

> SRV priorities don't work
> -
>
> Key: TS-4693
> URL: https://issues.apache.org/jira/browse/TS-4693
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HostDB
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.0.0
>
>
> Although priorities are stored in hostdb (and code exists that attempts to do 
> priorities)-- from some black-box testing it seems that the priorities aren't 
> taken into consideration for LB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4692) SRV weights don't work

2016-07-22 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4692:
---
Issue Type: Bug  (was: Improvement)

> SRV weights don't work
> --
>
> Key: TS-4692
> URL: https://issues.apache.org/jira/browse/TS-4692
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HostDB
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: sometime
>
>
> Although weights are stored in hostdb-- from some black-box testing it seems 
> that the weighting isn't taken into consideration for LB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4693) SRV priorities don't work

2016-07-22 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4693:
---
Issue Type: Bug  (was: Improvement)

> SRV priorities don't work
> -
>
> Key: TS-4693
> URL: https://issues.apache.org/jira/browse/TS-4693
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HostDB
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: sometime
>
>
> Although priorities are stored in hostdb-- from some black-box testing it 
> seems that the priorities aren't taken into consideration for LB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4692) SRV weights don't work

2016-07-22 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4692:
---
Description: Although weights are stored in hostdb (and code exists that 
attempts to do weighting)-- from some black-box testing it seems that the 
weighting isn't taken into consideration for LB.  (was: Although weights are 
stored in hostdb-- from some black-box testing it seems that the weighting 
isn't taken into consideration for LB.)

> SRV weights don't work
> --
>
> Key: TS-4692
> URL: https://issues.apache.org/jira/browse/TS-4692
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HostDB
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: sometime
>
>
> Although weights are stored in hostdb (and code exists that attempts to do 
> weighting)-- from some black-box testing it seems that the weighting isn't 
> taken into consideration for LB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4693) SRV priorities don't work

2016-07-22 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4693:
---
Description: 
Although priorities are stored in hostdb (and code exists that attempts to do 
priorities)-- from some black-box testing it seems that the priorities aren't 
taken into consideration for LB.


  was:
Although priorities are stored in hostdb-- from some black-box testing it seems 
that the priorities aren't taken into consideration for LB.



> SRV priorities don't work
> -
>
> Key: TS-4693
> URL: https://issues.apache.org/jira/browse/TS-4693
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HostDB
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: sometime
>
>
> Although priorities are stored in hostdb (and code exists that attempts to do 
> priorities)-- from some black-box testing it seems that the priorities aren't 
> taken into consideration for LB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TS-4692) SRV weights don't work

2016-07-21 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson reassigned TS-4692:
--

Assignee: Thomas Jackson

> SRV weights don't work
> --
>
> Key: TS-4692
> URL: https://issues.apache.org/jira/browse/TS-4692
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HostDB
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
>
> Although weights are stored in hostdb-- from some black-box testing it seems 
> that the weighting isn't taken into consideration for LB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TS-4693) SRV priorities don't work

2016-07-21 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson reassigned TS-4693:
--

Assignee: Thomas Jackson

> SRV priorities don't work
> -
>
> Key: TS-4693
> URL: https://issues.apache.org/jira/browse/TS-4693
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HostDB
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
>
> Although priorities are stored in hostdb-- from some black-box testing it 
> seems that the priorities aren't taken into consideration for LB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TS-4693) SRV priorities don't work

2016-07-21 Thread Thomas Jackson (JIRA)
Thomas Jackson created TS-4693:
--

 Summary: SRV priorities don't work
 Key: TS-4693
 URL: https://issues.apache.org/jira/browse/TS-4693
 Project: Traffic Server
  Issue Type: Bug
  Components: HostDB
Reporter: Thomas Jackson


Although priorities are stored in hostdb-- from some black-box testing it seems 
that the priorities aren't taken into consideration for LB.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TS-4692) SRV weights don't work

2016-07-21 Thread Thomas Jackson (JIRA)
Thomas Jackson created TS-4692:
--

 Summary: SRV weights don't work
 Key: TS-4692
 URL: https://issues.apache.org/jira/browse/TS-4692
 Project: Traffic Server
  Issue Type: Bug
  Components: HostDB
Reporter: Thomas Jackson


Although weights are stored in hostdb-- from some black-box testing it seems 
that the weighting isn't taken into consideration for LB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TS-4615) Dynamic SRV lookups

2016-07-21 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson resolved TS-4615.

Resolution: Fixed

> Dynamic SRV lookups
> ---
>
> Key: TS-4615
> URL: https://issues.apache.org/jira/browse/TS-4615
> Project: Traffic Server
>  Issue Type: Improvement
>  Components: HostDB
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Right now all SRV lookups are `_http._tcp.NAME`, it'd be better if we 
> switched between the various protocols we support (http/https/ws/wss).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (TS-4615) Dynamic SRV lookups

2016-07-21 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on TS-4615 started by Thomas Jackson.
--
> Dynamic SRV lookups
> ---
>
> Key: TS-4615
> URL: https://issues.apache.org/jira/browse/TS-4615
> Project: Traffic Server
>  Issue Type: Improvement
>  Components: HostDB
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Right now all SRV lookups are `_http._tcp.NAME`, it'd be better if we 
> switched between the various protocols we support (http/https/ws/wss).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TS-4615) Dynamic SRV lookups

2016-07-21 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson reassigned TS-4615:
--

Assignee: Thomas Jackson

> Dynamic SRV lookups
> ---
>
> Key: TS-4615
> URL: https://issues.apache.org/jira/browse/TS-4615
> Project: Traffic Server
>  Issue Type: Improvement
>  Components: HostDB
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Right now all SRV lookups are `_http._tcp.NAME`, it'd be better if we 
> switched between the various protocols we support (http/https/ws/wss).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4615) Dynamic SRV lookups

2016-07-21 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4615:
---
Component/s: HostDB

> Dynamic SRV lookups
> ---
>
> Key: TS-4615
> URL: https://issues.apache.org/jira/browse/TS-4615
> Project: Traffic Server
>  Issue Type: Improvement
>  Components: HostDB
>Reporter: Thomas Jackson
> Fix For: 7.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Right now all SRV lookups are `_http._tcp.NAME`, it'd be better if we 
> switched between the various protocols we support (http/https/ws/wss).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TS-4622) Ports from SRV lookups aren't used

2016-07-21 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson resolved TS-4622.

Resolution: Fixed

> Ports from SRV lookups aren't used
> --
>
> Key: TS-4622
> URL: https://issues.apache.org/jira/browse/TS-4622
> Project: Traffic Server
>  Issue Type: Bug
>  Components: DNS
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Although the dns processor parses out the ports and we keep track of them, it 
> seems that the port from the SRV response is not used at all when connecting 
> to the origin. Simple fix, we simply need to set the port before doing 
> `do_http_server_open` (potentially earlier if we want to let plugins etc. 
> override the port?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TS-4688) DNS resolver doesn't handle DNS compression labels for SRV responses

2016-07-21 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson resolved TS-4688.

Resolution: Fixed

> DNS resolver doesn't handle DNS compression labels for SRV responses
> 
>
> Key: TS-4688
> URL: https://issues.apache.org/jira/browse/TS-4688
> Project: Traffic Server
>  Issue Type: Bug
>  Components: DNS
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> If you get an SRV response from DNS with compression labels-- ATS' resolver 
> doesn't expand the names so the lookup fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TS-4674) Cleanup incorrect assert after TS-4403

2016-07-19 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson reassigned TS-4674:
--

Assignee: Thomas Jackson

> Cleanup incorrect assert after TS-4403
> --
>
> Key: TS-4674
> URL: https://issues.apache.org/jira/browse/TS-4674
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HostDB
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Now that we allow for some records to persist in the case of failure, the 
> records in this path are not always in the 0 state--and therefore this assert 
> doesn't work. Regardless this assert is a bit useless since the replacement 
> of multicache with RefCountCache-- as we aren't clobbering memory all over 
> creation-- so the need for these types of asserts is significantly reduced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4674) Cleanup incorrect assert after TS-4403

2016-07-19 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4674:
---
Fix Version/s: 7.0.0

> Cleanup incorrect assert after TS-4403
> --
>
> Key: TS-4674
> URL: https://issues.apache.org/jira/browse/TS-4674
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HostDB
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Now that we allow for some records to persist in the case of failure, the 
> records in this path are not always in the 0 state--and therefore this assert 
> doesn't work. Regardless this assert is a bit useless since the replacement 
> of multicache with RefCountCache-- as we aren't clobbering memory all over 
> creation-- so the need for these types of asserts is significantly reduced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TS-4674) Cleanup incorrect assert after TS-4403

2016-07-19 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson resolved TS-4674.

Resolution: Fixed

> Cleanup incorrect assert after TS-4403
> --
>
> Key: TS-4674
> URL: https://issues.apache.org/jira/browse/TS-4674
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HostDB
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Now that we allow for some records to persist in the case of failure, the 
> records in this path are not always in the 0 state--and therefore this assert 
> doesn't work. Regardless this assert is a bit useless since the replacement 
> of multicache with RefCountCache-- as we aren't clobbering memory all over 
> creation-- so the need for these types of asserts is significantly reduced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-4688) DNS resolver doesn't handle DNS compression labels for SRV responses

2016-07-19 Thread Thomas Jackson (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-4688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15384385#comment-15384385
 ] 

Thomas Jackson commented on TS-4688:


I ran into this problem while creating tsqa tests for SRV responses-- since 
python's dnslib uses dns compression labels wherever possible (to reduce 
response size).

> DNS resolver doesn't handle DNS compression labels for SRV responses
> 
>
> Key: TS-4688
> URL: https://issues.apache.org/jira/browse/TS-4688
> Project: Traffic Server
>  Issue Type: Bug
>  Components: DNS
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If you get an SRV response from DNS with compression labels-- ATS' resolver 
> doesn't expand the names so the lookup fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TS-4688) DNS resolver doesn't handle DNS compression labels for SRV responses

2016-07-19 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson reassigned TS-4688:
--

Assignee: Thomas Jackson

> DNS resolver doesn't handle DNS compression labels for SRV responses
> 
>
> Key: TS-4688
> URL: https://issues.apache.org/jira/browse/TS-4688
> Project: Traffic Server
>  Issue Type: Bug
>  Components: DNS
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.0.0
>
>
> If you get an SRV response from DNS with compression labels-- ATS' resolver 
> doesn't expand the names so the lookup fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TS-4688) DNS resolver doesn't handle DNS compression labels for SRV responses

2016-07-19 Thread Thomas Jackson (JIRA)
Thomas Jackson created TS-4688:
--

 Summary: DNS resolver doesn't handle DNS compression labels for 
SRV responses
 Key: TS-4688
 URL: https://issues.apache.org/jira/browse/TS-4688
 Project: Traffic Server
  Issue Type: Bug
  Components: DNS
Reporter: Thomas Jackson


If you get an SRV response from DNS with compression labels-- ATS' resolver 
doesn't expand the names so the lookup fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4688) DNS resolver doesn't handle DNS compression labels for SRV responses

2016-07-19 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4688:
---
Fix Version/s: 7.0.0

> DNS resolver doesn't handle DNS compression labels for SRV responses
> 
>
> Key: TS-4688
> URL: https://issues.apache.org/jira/browse/TS-4688
> Project: Traffic Server
>  Issue Type: Bug
>  Components: DNS
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.0.0
>
>
> If you get an SRV response from DNS with compression labels-- ATS' resolver 
> doesn't expand the names so the lookup fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TS-4684) Leaked references to HostDBInfos from HttpTransact

2016-07-18 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson resolved TS-4684.

   Resolution: Fixed
Fix Version/s: 7.0.0

> Leaked references to HostDBInfos from HttpTransact
> --
>
> Key: TS-4684
> URL: https://issues.apache.org/jira/browse/TS-4684
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HostDB
>Affects Versions: 7.0.0
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
> Fix For: 7.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> After running for a while I've noticed a slow leak in memory. After tracking 
> it down the leak is due to not properly decrementing the refcounts in 
> HttpTransact for Ptr. Seems that somewhere in the rebasing this 
> very important line was lost



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TS-4685) Strict round robin isn't followed for KA connections

2016-07-18 Thread Thomas Jackson (JIRA)
Thomas Jackson created TS-4685:
--

 Summary: Strict round robin isn't followed for KA connections
 Key: TS-4685
 URL: https://issues.apache.org/jira/browse/TS-4685
 Project: Traffic Server
  Issue Type: Bug
Reporter: Thomas Jackson


ATS doesn't honor "strict round robin" for KA connections-- it first re-uses 
available connections then creates new connections following the origin 
selection algo (rr, etc.). This effectively makes it "least connections" which 
is almost the opposite of what was configured. 

We should make ATS honor the selection we've configured it to use-- and then it 
can KA with that real selection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TS-4685) Strict round robin isn't followed for KA connections

2016-07-18 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson reassigned TS-4685:
--

Assignee: Thomas Jackson

> Strict round robin isn't followed for KA connections
> 
>
> Key: TS-4685
> URL: https://issues.apache.org/jira/browse/TS-4685
> Project: Traffic Server
>  Issue Type: Bug
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
>
> ATS doesn't honor "strict round robin" for KA connections-- it first re-uses 
> available connections then creates new connections following the origin 
> selection algo (rr, etc.). This effectively makes it "least connections" 
> which is almost the opposite of what was configured. 
> We should make ATS honor the selection we've configured it to use-- and then 
> it can KA with that real selection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4684) Leaked references to HostDBInfos from HttpTransact

2016-07-18 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4684:
---
Affects Version/s: 7.0.0

> Leaked references to HostDBInfos from HttpTransact
> --
>
> Key: TS-4684
> URL: https://issues.apache.org/jira/browse/TS-4684
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HostDB
>Affects Versions: 7.0.0
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After running for a while I've noticed a slow leak in memory. After tracking 
> it down the leak is due to not properly decrementing the refcounts in 
> HttpTransact for Ptr. Seems that somewhere in the rebasing this 
> very important line was lost



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TS-4684) Leaked references to HostDBInfos from HttpTransact

2016-07-18 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson reassigned TS-4684:
--

Assignee: Thomas Jackson

> Leaked references to HostDBInfos from HttpTransact
> --
>
> Key: TS-4684
> URL: https://issues.apache.org/jira/browse/TS-4684
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HostDB
>Affects Versions: 7.0.0
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After running for a while I've noticed a slow leak in memory. After tracking 
> it down the leak is due to not properly decrementing the refcounts in 
> HttpTransact for Ptr. Seems that somewhere in the rebasing this 
> very important line was lost



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TS-4684) Leaked references to HostDBInfos from HttpTransact

2016-07-18 Thread Thomas Jackson (JIRA)
Thomas Jackson created TS-4684:
--

 Summary: Leaked references to HostDBInfos from HttpTransact
 Key: TS-4684
 URL: https://issues.apache.org/jira/browse/TS-4684
 Project: Traffic Server
  Issue Type: Bug
  Components: HostDB
Reporter: Thomas Jackson


After running for a while I've noticed a slow leak in memory. After tracking it 
down the leak is due to not properly decrementing the refcounts in HttpTransact 
for Ptr. Seems that somewhere in the rebasing this very important 
line was lost



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4682) HostDB caching of SOA responses doesn't honor TTL

2016-07-18 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson updated TS-4682:
---
Component/s: HostDB

> HostDB caching of SOA responses doesn't honor TTL
> -
>
> Key: TS-4682
> URL: https://issues.apache.org/jira/browse/TS-4682
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HostDB
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
>
> HostDB categorizes all DNS responses as "failed" and "not failed", "failed" 
> being any response that doesn't have an answer to the question sent. This 
> means that SOA responses are considered "failed" and therefore the response 
> isn't cached for the TTL defined in the SOA record. "failed" responses TTL 
> are configured by "proxy.config.hostdb.fail.timeout" (default to 0). With the 
> value set to default the record will be immediately considered stale and 
> require a lookup. To fix this we should have the "failed" response honor the 
> TTL if the response had a TTL defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TS-4682) HostDB caching of SOA responses doesn't honor TTL

2016-07-18 Thread Thomas Jackson (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Jackson reassigned TS-4682:
--

Assignee: Thomas Jackson

> HostDB caching of SOA responses doesn't honor TTL
> -
>
> Key: TS-4682
> URL: https://issues.apache.org/jira/browse/TS-4682
> Project: Traffic Server
>  Issue Type: Bug
>Reporter: Thomas Jackson
>Assignee: Thomas Jackson
>
> HostDB categorizes all DNS responses as "failed" and "not failed", "failed" 
> being any response that doesn't have an answer to the question sent. This 
> means that SOA responses are considered "failed" and therefore the response 
> isn't cached for the TTL defined in the SOA record. "failed" responses TTL 
> are configured by "proxy.config.hostdb.fail.timeout" (default to 0). With the 
> value set to default the record will be immediately considered stale and 
> require a lookup. To fix this we should have the "failed" response honor the 
> TTL if the response had a TTL defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   5   >