[jira] [Created] (TS-3084) forwarding mode breaks iPhone activation (ga.apple.com)

2014-09-18 Thread Nikolai Gorchilov (JIRA)
Nikolai Gorchilov created TS-3084:
-

 Summary: forwarding mode breaks iPhone activation (ga.apple.com)
 Key: TS-3084
 URL: https://issues.apache.org/jira/browse/TS-3084
 Project: Traffic Server
  Issue Type: Bug
  Components: Core, HTTP
Reporter: Nikolai Gorchilov


On iDevice restoration iTunes makes activation request to ga.apple.com (request 
attached).

When sent via ATS, the request leads to HTTP/1.1 400 (bad request) response 
from origin server.

Proper response (on direct connection) is also attached for your reference.

Here's the command to reproduce the problem
{noformat}
netcat gs.apple.com 80  gs.request
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-3084) forwarding mode breaks iPhone activation (ga.apple.com)

2014-09-18 Thread Nikolai Gorchilov (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikolai Gorchilov updated TS-3084:
--
Attachment: gs.response
gs.request

 forwarding mode breaks iPhone activation (ga.apple.com)
 ---

 Key: TS-3084
 URL: https://issues.apache.org/jira/browse/TS-3084
 Project: Traffic Server
  Issue Type: Bug
  Components: Core, HTTP
Reporter: Nikolai Gorchilov
 Attachments: gs.request, gs.response


 On iDevice restoration iTunes makes activation request to ga.apple.com 
 (request attached).
 When sent via ATS, the request leads to HTTP/1.1 400 (bad request) response 
 from origin server.
 Proper response (on direct connection) is also attached for your reference.
 Here's the command to reproduce the problem
 {noformat}
 netcat gs.apple.com 80  gs.request
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-3084) forwarding mode breaks iPhone activation (gs.apple.com)

2014-09-18 Thread Nikolai Gorchilov (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikolai Gorchilov updated TS-3084:
--
Description: 
On iDevice restoration iTunes makes activation request to gs.apple.com (request 
attached).

When sent via ATS, the request leads to HTTP/1.1 400 (bad request) response 
from origin server.

Proper response (on direct connection) is also attached for your reference.

Here's the command to reproduce the problem
{noformat}
netcat gs.apple.com 80  gs.request
{noformat}

  was:
On iDevice restoration iTunes makes activation request to ga.apple.com (request 
attached).

When sent via ATS, the request leads to HTTP/1.1 400 (bad request) response 
from origin server.

Proper response (on direct connection) is also attached for your reference.

Here's the command to reproduce the problem
{noformat}
netcat gs.apple.com 80  gs.request
{noformat}

Summary: forwarding mode breaks iPhone activation (gs.apple.com)  (was: 
forwarding mode breaks iPhone activation (ga.apple.com))

 forwarding mode breaks iPhone activation (gs.apple.com)
 ---

 Key: TS-3084
 URL: https://issues.apache.org/jira/browse/TS-3084
 Project: Traffic Server
  Issue Type: Bug
  Components: Core, HTTP
Reporter: Nikolai Gorchilov
 Attachments: gs.request, gs.response


 On iDevice restoration iTunes makes activation request to gs.apple.com 
 (request attached).
 When sent via ATS, the request leads to HTTP/1.1 400 (bad request) response 
 from origin server.
 Proper response (on direct connection) is also attached for your reference.
 Here's the command to reproduce the problem
 {noformat}
 netcat gs.apple.com 80  gs.request
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-3084) forwarding mode breaks iPhone activation (gs.apple.com)

2014-09-18 Thread Leif Hedstrom (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leif Hedstrom updated TS-3084:
--
Fix Version/s: 5.2.0

 forwarding mode breaks iPhone activation (gs.apple.com)
 ---

 Key: TS-3084
 URL: https://issues.apache.org/jira/browse/TS-3084
 Project: Traffic Server
  Issue Type: Bug
  Components: Core, HTTP
Reporter: Nikolai Gorchilov
 Fix For: 5.2.0

 Attachments: gs.request, gs.response


 On iDevice restoration iTunes makes activation request to gs.apple.com 
 (request attached).
 When sent via ATS, the request leads to HTTP/1.1 400 (bad request) response 
 from origin server.
 Proper response (on direct connection) is also attached for your reference.
 Here's the command to reproduce the problem
 {noformat}
 netcat gs.apple.com 80  gs.request
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-3083) crash

2014-09-18 Thread Leif Hedstrom (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-3083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leif Hedstrom updated TS-3083:
--
Fix Version/s: 5.2.0

 crash
 -

 Key: TS-3083
 URL: https://issues.apache.org/jira/browse/TS-3083
 Project: Traffic Server
  Issue Type: Bug
  Components: Core
Affects Versions: 5.0.2
Reporter: bettydramit
  Labels: crash
 Fix For: 5.2.0


 c++filt a.txt 
 {code}
 /lib64/libpthread.so.0(+0xf710)[0x2b4c37949710]
 /usr/lib64/trafficserver/libtsutil.so.5(ink_atomiclist_pop+0x3e)[0x2b4c35abb64e]
 /usr/lib64/trafficserver/libtsutil.so.5(reclaimable_freelist_new+0x65)[0x2b4c35abc065]
 /usr/bin/traffic_server(MIOBuffer_tracker::operator()(long)+0x2b)[0x4a33db]
 /usr/bin/traffic_server(PluginVCCore::init()+0x2e3)[0x4d9903]
 /usr/bin/traffic_server(PluginVCCore::alloc()+0x11d)[0x4dcf4d]
 /usr/bin/traffic_server(TSHttpConnectWithPluginId+0x5d)[0x4b9e9d]
 /usr/bin/traffic_server(FetchSM::httpConnect()+0x74)[0x4a0224]
 /usr/bin/traffic_server(PluginVC::process_read_side(bool)+0x375)[0x4da675]
 /usr/bin/traffic_server(PluginVC::process_write_side(bool)+0x57a)[0x4dafca]
 /usr/bin/traffic_server(PluginVC::main_handler(int, void*)+0x315)[0x4dc9a5]
 /usr/bin/traffic_server(EThread::process_event(Event*, int)+0x8f)[0x73788f]
 /usr/bin/traffic_server(EThread::execute()+0x57b)[0x7381fb]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-3080) OpenSSL implementation of TLS session cache is very slow.

2014-09-18 Thread Leif Hedstrom (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leif Hedstrom updated TS-3080:
--
Fix Version/s: 5.2.0

 OpenSSL implementation of TLS session cache is very slow.
 -

 Key: TS-3080
 URL: https://issues.apache.org/jira/browse/TS-3080
 Project: Traffic Server
  Issue Type: Bug
  Components: Core, SSL
Reporter: Brian Geffon
Assignee: Brian Geffon
 Fix For: 5.2.0


 The OpenSSL implementation of TLS session caching is very slow, we attempted 
 to use it and it's locking and blows up at only a few hundred QPS. I'm going 
 to develop a new TLS session cache in TS that is more performant under 
 highload.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-3080) OpenSSL implementation of TLS session cache is very slow.

2014-09-18 Thread Leif Hedstrom (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139128#comment-14139128
 ] 

Leif Hedstrom commented on TS-3080:
---

I still find it odd that you'd run into lock contention already at hundreds 
of sessions / sec. Unless the critical section is large and/or very slow, that 
shouldn't be happening. Futexes are fast in general :). 

 OpenSSL implementation of TLS session cache is very slow.
 -

 Key: TS-3080
 URL: https://issues.apache.org/jira/browse/TS-3080
 Project: Traffic Server
  Issue Type: Bug
  Components: Core, SSL
Reporter: Brian Geffon
Assignee: Brian Geffon
 Fix For: 5.2.0


 The OpenSSL implementation of TLS session caching is very slow, we attempted 
 to use it and it's locking and blows up at only a few hundred QPS. I'm going 
 to develop a new TLS session cache in TS that is more performant under 
 highload.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-3082) ATS does not bind sessions to SNI names.

2014-09-18 Thread Leif Hedstrom (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leif Hedstrom updated TS-3082:
--
Labels: security  (was: )

 ATS does not bind sessions to SNI names.
 

 Key: TS-3082
 URL: https://issues.apache.org/jira/browse/TS-3082
 Project: Traffic Server
  Issue Type: Bug
  Components: SSL
Reporter: Alexey Ivanov
Assignee: Brian Geffon
  Labels: security
 Fix For: 5.2.0


 More information in paper:
 Virtual Host Confusion: Weaknesses and Exploits. Black Hat 2014 Report
 http://bh.ht.vc/vhost_confusion.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-3082) ATS does not bind sessions to SNI names.

2014-09-18 Thread Leif Hedstrom (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leif Hedstrom updated TS-3082:
--
Priority: Critical  (was: Major)

 ATS does not bind sessions to SNI names.
 

 Key: TS-3082
 URL: https://issues.apache.org/jira/browse/TS-3082
 Project: Traffic Server
  Issue Type: Bug
  Components: SSL
Reporter: Alexey Ivanov
Assignee: Brian Geffon
Priority: Critical
  Labels: security
 Fix For: 5.2.0


 More information in paper:
 Virtual Host Confusion: Weaknesses and Exploits. Black Hat 2014 Report
 http://bh.ht.vc/vhost_confusion.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-3082) ATS does not bind sessions to SNI names.

2014-09-18 Thread Leif Hedstrom (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leif Hedstrom updated TS-3082:
--
Component/s: SSL

 ATS does not bind sessions to SNI names.
 

 Key: TS-3082
 URL: https://issues.apache.org/jira/browse/TS-3082
 Project: Traffic Server
  Issue Type: Bug
  Components: SSL
Reporter: Alexey Ivanov
Assignee: Brian Geffon
  Labels: security
 Fix For: 5.2.0


 More information in paper:
 Virtual Host Confusion: Weaknesses and Exploits. Black Hat 2014 Report
 http://bh.ht.vc/vhost_confusion.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TS-3085) Large POSTs over (relatively) slower connections failing in ats5

2014-09-18 Thread Sudheer Vinukonda (JIRA)
Sudheer Vinukonda created TS-3085:
-

 Summary: Large POSTs over (relatively) slower connections failing 
in ats5
 Key: TS-3085
 URL: https://issues.apache.org/jira/browse/TS-3085
 Project: Traffic Server
  Issue Type: Bug
  Components: SSL
Reporter: Sudheer Vinukonda


We ran into a production issue where large POSTs (30MB or high) are failing 
over slower connection speeds after ats5 roll out (the problem could be easily 
reproduced using a charles proxy with throttling enabled). 

Further debugging isolated the issue to uploads over SSL connections and after 
a lot of debugging the issue appears to be the below:



ATS calls SSL_read() followed by SSL_get_error() to check if there was any 
error in the read. This is repeated until either the complete data is read or 
an error occurs. However, from the openssl documentation, it is recommended to 
call ERR_clear_error() prior to calling SSL_read() + SSL_get_error() to ensure 
the error queue is clean of any leftover/garbage errors.  It's not clear what 
might be corrupting the error queue of the SSL context in a tight loop - 
possibly, some new feature in ats5. In any case, calling ERR_clear_error() is a 
good idea and adding this seems to resolve the post failures.

Documentation from openSSL and some related notes on stackoverflow:

https://www.openssl.org/docs/ssl/SSL_get_error.html

http://stackoverflow.com/questions/18179128/how-to-manage-the-error-queue-in-openssl-ssl-get-error-and-err-get-error


{code}
SSL_get_error() returns a result code (suitable for the C ``switch''
statement) for a preceding call to SSL_connect(), SSL_accept(),
SSL_do_handshake(), SSL_read(), SSL_peek(), or SSL_write() on ssl. The value
returned by that TLS/SSL I/O function must be passed to SSL_get_error() in
parameter ret.

In addition to ssl and ret, SSL_get_error() inspects the current thread's
OpenSSL error queue. Thus, SSL_get_error() must be used in the same thread that
performed the TLS/SSL I/O operation, and no other OpenSSL function calls should
appear in between. The current thread's error queue must be empty before the
TLS/SSL I/O operation is attempted, or SSL_get_error() will not work reliably.

SSL_get_error does not call ERR_get_error. So if you just call SSL_get_error,
the error stays in the queue.

You should be calling ERR_clear_error prior to ANY SSL-call(SSL_read, SSL_write
etc) that is followed by SSL_get_error, otherwise you may be reading an old
error that occurred previously in the current thread.
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-3085) Large POSTs over (relatively) slower connections failing in ats5

2014-09-18 Thread Sudheer Vinukonda (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sudheer Vinukonda updated TS-3085:
--
Description: 
We ran into a production issue where large POSTs (30MB or high) are failing 
over slower connection speeds after ats5 roll out (the problem could be easily 
reproduced using a charles proxy with throttling enabled). 

Further debugging isolated the issue to uploads over SSL connections and after 
a lot of debugging the issue appears to be the below:

ATS calls SSL_read() followed by SSL_get_error() to check if there was any 
error in the read. This is repeated until either the complete data is read or 
an error occurs. However, from the openssl documentation, it is recommended to 
call ERR_clear_error() prior to calling SSL_read() + SSL_get_error() to ensure 
the error queue is clean of any leftover/garbage errors.  It's not clear what 
might be corrupting the error queue of the SSL context in a tight loop - 
possibly, some new feature in ats5. In any case, calling ERR_clear_error() is a 
good idea and adding this seems to resolve the post failures.

Documentation from openSSL and some related notes on stackoverflow:

https://www.openssl.org/docs/ssl/SSL_get_error.html

http://stackoverflow.com/questions/18179128/how-to-manage-the-error-queue-in-openssl-ssl-get-error-and-err-get-error


{code}
SSL_get_error() returns a result code (suitable for the C ``switch''
statement) for a preceding call to SSL_connect(), SSL_accept(),
SSL_do_handshake(), SSL_read(), SSL_peek(), or SSL_write() on ssl. The value
returned by that TLS/SSL I/O function must be passed to SSL_get_error() in
parameter ret.

In addition to ssl and ret, SSL_get_error() inspects the current thread's
OpenSSL error queue. Thus, SSL_get_error() must be used in the same thread that
performed the TLS/SSL I/O operation, and no other OpenSSL function calls should
appear in between. The current thread's error queue must be empty before the
TLS/SSL I/O operation is attempted, or SSL_get_error() will not work reliably.

SSL_get_error does not call ERR_get_error. So if you just call SSL_get_error,
the error stays in the queue.

You should be calling ERR_clear_error prior to ANY SSL-call(SSL_read, SSL_write
etc) that is followed by SSL_get_error, otherwise you may be reading an old
error that occurred previously in the current thread.
{code}

  was:
We ran into a production issue where large POSTs (30MB or high) are failing 
over slower connection speeds after ats5 roll out (the problem could be easily 
reproduced using a charles proxy with throttling enabled). 

Further debugging isolated the issue to uploads over SSL connections and after 
a lot of debugging the issue appears to be the below:



ATS calls SSL_read() followed by SSL_get_error() to check if there was any 
error in the read. This is repeated until either the complete data is read or 
an error occurs. However, from the openssl documentation, it is recommended to 
call ERR_clear_error() prior to calling SSL_read() + SSL_get_error() to ensure 
the error queue is clean of any leftover/garbage errors.  It's not clear what 
might be corrupting the error queue of the SSL context in a tight loop - 
possibly, some new feature in ats5. In any case, calling ERR_clear_error() is a 
good idea and adding this seems to resolve the post failures.

Documentation from openSSL and some related notes on stackoverflow:

https://www.openssl.org/docs/ssl/SSL_get_error.html

http://stackoverflow.com/questions/18179128/how-to-manage-the-error-queue-in-openssl-ssl-get-error-and-err-get-error


{code}
SSL_get_error() returns a result code (suitable for the C ``switch''
statement) for a preceding call to SSL_connect(), SSL_accept(),
SSL_do_handshake(), SSL_read(), SSL_peek(), or SSL_write() on ssl. The value
returned by that TLS/SSL I/O function must be passed to SSL_get_error() in
parameter ret.

In addition to ssl and ret, SSL_get_error() inspects the current thread's
OpenSSL error queue. Thus, SSL_get_error() must be used in the same thread that
performed the TLS/SSL I/O operation, and no other OpenSSL function calls should
appear in between. The current thread's error queue must be empty before the
TLS/SSL I/O operation is attempted, or SSL_get_error() will not work reliably.

SSL_get_error does not call ERR_get_error. So if you just call SSL_get_error,
the error stays in the queue.

You should be calling ERR_clear_error prior to ANY SSL-call(SSL_read, SSL_write
etc) that is followed by SSL_get_error, otherwise you may be reading an old
error that occurred previously in the current thread.
{code}

  Affects Version/s: 5.0.1
Backport to Version: 5.1.1
  Fix Version/s: 5.2.0
   Assignee: Sudheer Vinukonda
 Labels: yahoo  (was: )

The fix is really simple - to basically call ERR_Clear_error() before 
SSL_Read(). I will investigate separately on why/who is 

[jira] [Commented] (TS-3085) Large POSTs over (relatively) slower connections failing in ats5

2014-09-18 Thread Sudheer Vinukonda (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139222#comment-14139222
 ] 

Sudheer Vinukonda commented on TS-3085:
---

Per Leif's suggestion on a different jira, I've marked this for 5.2 and added a 
back port to 5.1.1, but, this is a blocker for our ats5 roll out, and, perhaps, 
whoever has use cases involving large POSTs may need to cherry pick the fix.

 Large POSTs over (relatively) slower connections failing in ats5
 

 Key: TS-3085
 URL: https://issues.apache.org/jira/browse/TS-3085
 Project: Traffic Server
  Issue Type: Bug
  Components: SSL
Affects Versions: 5.0.1
Reporter: Sudheer Vinukonda
Assignee: Sudheer Vinukonda
  Labels: yahoo
 Fix For: 5.2.0


 We ran into a production issue where large POSTs (30MB or high) are failing 
 over slower connection speeds after ats5 roll out (the problem could be 
 easily reproduced using a charles proxy with throttling enabled). 
 Further debugging isolated the issue to uploads over SSL connections and 
 after a lot of debugging the issue appears to be the below:
 ATS calls SSL_read() followed by SSL_get_error() to check if there was any 
 error in the read. This is repeated until either the complete data is read or 
 an error occurs. However, from the openssl documentation, it is recommended 
 to call ERR_clear_error() prior to calling SSL_read() + SSL_get_error() to 
 ensure the error queue is clean of any leftover/garbage errors.  It's not 
 clear what might be corrupting the error queue of the SSL context in a tight 
 loop - possibly, some new feature in ats5. In any case, calling 
 ERR_clear_error() is a good idea and adding this seems to resolve the post 
 failures.
 Documentation from openSSL and some related notes on stackoverflow:
 https://www.openssl.org/docs/ssl/SSL_get_error.html
 http://stackoverflow.com/questions/18179128/how-to-manage-the-error-queue-in-openssl-ssl-get-error-and-err-get-error
 {code}
 SSL_get_error() returns a result code (suitable for the C ``switch''
 statement) for a preceding call to SSL_connect(), SSL_accept(),
 SSL_do_handshake(), SSL_read(), SSL_peek(), or SSL_write() on ssl. The value
 returned by that TLS/SSL I/O function must be passed to SSL_get_error() in
 parameter ret.
 In addition to ssl and ret, SSL_get_error() inspects the current thread's
 OpenSSL error queue. Thus, SSL_get_error() must be used in the same thread 
 that
 performed the TLS/SSL I/O operation, and no other OpenSSL function calls 
 should
 appear in between. The current thread's error queue must be empty before the
 TLS/SSL I/O operation is attempted, or SSL_get_error() will not work 
 reliably.
 SSL_get_error does not call ERR_get_error. So if you just call SSL_get_error,
 the error stays in the queue.
 You should be calling ERR_clear_error prior to ANY SSL-call(SSL_read, 
 SSL_write
 etc) that is followed by SSL_get_error, otherwise you may be reading an old
 error that occurred previously in the current thread.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-3085) Large POSTs over (relatively) slower connections failing in ats5

2014-09-18 Thread Sudheer Vinukonda (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139224#comment-14139224
 ] 

Sudheer Vinukonda commented on TS-3085:
---

When a POST fails, below is the log (slightly enhanced and traced using 
single/ip debugging in production):

{code}
[Sep 18 17:09:26.382] Server {0x2ab554605700} DEBUG: (ssl) 
[SSL_NetVConnection::ssl_read_from_net] b-write_avail()=32768
[Sep 18 17:09:26.382] Server {0x2ab554605700} DEBUG: (ssl) 
[SSL_NetVConnection::ssl_read_from_net] rres=-1
[Sep 18 17:09:26.382] Server {0x2ab554605700} DEBUG: (ssl.error) 
[SSL_NetVConnection::ssl_read_from_net] error 1
[Sep 18 17:09:26.382] Server {0x2ab554605700} DEBUG: (http_tunnel) [510166] 
producer_handler [user agent post VC_EVENT_ERROR]
[Sep 18 17:09:26.382] Server {0x2ab554605700} DEBUG: (http_redirect) 
[HttpTunnel::producer_handler] enable_redirection: [1 0 0] event: 3
[Sep 18 17:09:26.382] Server {0x2ab554605700} DEBUG: (http) [510166] 
[HttpSM::tunnel_handler_post_ua, VC_EVENT_ERROR]
{code}

 Large POSTs over (relatively) slower connections failing in ats5
 

 Key: TS-3085
 URL: https://issues.apache.org/jira/browse/TS-3085
 Project: Traffic Server
  Issue Type: Bug
  Components: SSL
Affects Versions: 5.0.1
Reporter: Sudheer Vinukonda
Assignee: Sudheer Vinukonda
  Labels: yahoo
 Fix For: 5.2.0


 We ran into a production issue where large POSTs (30MB or high) are failing 
 over slower connection speeds after ats5 roll out (the problem could be 
 easily reproduced using a charles proxy with throttling enabled). 
 Further debugging isolated the issue to uploads over SSL connections and 
 after a lot of debugging the issue appears to be the below:
 ATS calls SSL_read() followed by SSL_get_error() to check if there was any 
 error in the read. This is repeated until either the complete data is read or 
 an error occurs. However, from the openssl documentation, it is recommended 
 to call ERR_clear_error() prior to calling SSL_read() + SSL_get_error() to 
 ensure the error queue is clean of any leftover/garbage errors.  It's not 
 clear what might be corrupting the error queue of the SSL context in a tight 
 loop - possibly, some new feature in ats5. In any case, calling 
 ERR_clear_error() is a good idea and adding this seems to resolve the post 
 failures.
 Documentation from openSSL and some related notes on stackoverflow:
 https://www.openssl.org/docs/ssl/SSL_get_error.html
 http://stackoverflow.com/questions/18179128/how-to-manage-the-error-queue-in-openssl-ssl-get-error-and-err-get-error
 {code}
 SSL_get_error() returns a result code (suitable for the C ``switch''
 statement) for a preceding call to SSL_connect(), SSL_accept(),
 SSL_do_handshake(), SSL_read(), SSL_peek(), or SSL_write() on ssl. The value
 returned by that TLS/SSL I/O function must be passed to SSL_get_error() in
 parameter ret.
 In addition to ssl and ret, SSL_get_error() inspects the current thread's
 OpenSSL error queue. Thus, SSL_get_error() must be used in the same thread 
 that
 performed the TLS/SSL I/O operation, and no other OpenSSL function calls 
 should
 appear in between. The current thread's error queue must be empty before the
 TLS/SSL I/O operation is attempted, or SSL_get_error() will not work 
 reliably.
 SSL_get_error does not call ERR_get_error. So if you just call SSL_get_error,
 the error stays in the queue.
 You should be calling ERR_clear_error prior to ANY SSL-call(SSL_read, 
 SSL_write
 etc) that is followed by SSL_get_error, otherwise you may be reading an old
 error that occurred previously in the current thread.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-1975) LocalManager may cause manager crash

2014-09-18 Thread Jared Ocker (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139280#comment-14139280
 ] 

Jared Ocker commented on TS-1975:
-

Here's a stack trace from our traffic.out, including some logs leading up to 
it.  As you can see, we're using the rfc5861 (stale while revalidate) plugin.  
When I disable it, the issue seems to go away.

{code}
[Sep 15 09:51:55.419] Server {0x2b3a9b284710} DIAG: (sdk) (SDK) null mutex 
detected in critical region (mutex created)
[Sep 15 09:51:55.419] Server {0x2b3a9b284710} DIAG: (sdk) (SDK) please create 
continuation [0x2dfc930] with mutex
[Sep 15 09:51:55.419] Server {0x2b3a9b284710} DIAG: (rfc5861) Write Complete
[Sep 15 09:51:55.420] Server {0x2b3a9b284710} DIAG: (rfc5861) Internal Request
[Sep 15 09:51:55.450] Server {0x2b3a9b284710} DIAG: (rfc5861) Read Ready
[Sep 15 09:51:55.450] Server {0x2b3a9b284710} DIAG: (rfc5861) HTTP Status: 304
[Sep 15 09:51:55.450] Server {0x2b3a9b284710} DIAG: (rfc5861) EOS
[Sep 15 09:51:55.450] Server {0x2b3a9b284710} DIAG: (rfc5861) In sync path. 
setting fresh and re-enabling
[Sep 15 09:51:55.450] Server {0x2b3a9b284710} DIAG: (rfc5861) Attempting new 
cache lookup
[Sep 15 09:51:55.450] Server {0x2b3a9b284710} DIAG: (rfc5861) Not Stale!
[Sep 15 09:51:55.469] Server {0x2b3a9a7e9ca0} DIAG: (rfc5861) External Request
[Sep 15 09:51:55.487] Server {0x2b3a9a7e9ca0} DIAG: (rfc5861) CacheLookupStatus 
is STALE
[Sep 15 09:51:55.487] Server {0x2b3a9a7e9ca0} DIAG: (rfc5861) Found a date
[Sep 15 09:51:55.487] Server {0x2b3a9a7e9ca0} DIAG: (rfc5861) Found 
cache-control
[Sep 15 09:51:55.487] Server {0x2b3a9a7e9ca0} DIAG: (rfc5861) Unknown field 
value
[Sep 15 09:51:55.487] Server {0x2b3a9a7e9ca0} DIAG: (rfc5861) Found max-age
[Sep 15 09:51:55.487] Server {0x2b3a9a7e9ca0} DIAG: (rfc5861) Unknown field 
value
[Sep 15 09:51:55.487] Server {0x2b3a9a7e9ca0} DIAG: (rfc5861) Found 
stale-while-revalidate
[Sep 15 09:51:55.487] Server {0x2b3a9a7e9ca0} DIAG: (rfc5861) Found 
stale-on-error
[Sep 15 09:51:55.487] Server {0x2b3a9a7e9ca0} DIAG: (rfc5861) Looks like we can 
return fresh data on 500 error
[Sep 15 09:51:55.487] Server {0x2b3a9a7e9ca0} DIAG: (sdk) (SDK) null mutex 
detected in critical region (mutex created)
[Sep 15 09:51:55.487] Server {0x2b3a9a7e9ca0} DIAG: (sdk) (SDK) please create 
continuation [0x2dfc930] with mutex
[Sep 15 09:51:55.487] Server {0x2b3a9b183710} DIAG: (rfc5861) Lets do the lookup
[Sep 15 09:51:55.487] Server {0x2b3a9b183710} DIAG: (rfc5861) Set Connection: 
close
[Sep 15 09:51:55.487] Server {0x2b3a9b183710} DIAG: (rfc5861) Found old 
Connection hdr
[Sep 15 09:51:55.487] Server {0x2b3a9b183710} DIAG: (rfc5861) Creating 
Connection hdr
[Sep 15 09:51:55.487] Server {0x2b3a9b183710} DIAG: (rfc5861) Create Buffers
[Sep 15 09:51:55.487] Server {0x2b3a9b183710} DIAG: (sdk) (SDK) null mutex 
detected in critical region (mutex created)
[Sep 15 09:51:55.487] Server {0x2b3a9b183710} DIAG: (sdk) (SDK) please create 
continuation [0x2dfc5d0] with mutex
[Sep 15 09:51:55.487] Server {0x2b3a9b183710} DIAG: (rfc5861) Write Complete
[Sep 15 09:51:55.487] Server {0x2b3a9b183710} DIAG: (rfc5861) Internal Request
[Sep 15 09:51:55.515] Server {0x2b3a9a7e9ca0} DIAG: (rfc5861) Read Ready
[Sep 15 09:51:55.515] Server {0x2b3a9a7e9ca0} DIAG: (rfc5861) HTTP Status: 304
[Sep 15 09:51:55.515] Server {0x2b3a9a7e9ca0} DIAG: (rfc5861) EOS
[Sep 15 09:51:55.515] Server {0x2b3a9a7e9ca0} DIAG: (rfc5861) In sync path. 
setting fresh and re-enabling
[Sep 15 09:51:55.515] Server {0x2b3a9a7e9ca0} DIAG: (rfc5861) Attempting new 
cache lookup
[Sep 15 09:51:55.515] Server {0x2b3a9a7e9ca0} DIAG: (rfc5861) Not Stale!
[Sep 15 09:52:19.599] Server {0x2b3a9a7e9ca0} DIAG: (rfc5861) External Request
[Sep 15 09:52:19.600] Server {0x2b3a9a7e9ca0} DIAG: (rfc5861) Not Stale!
[Sep 15 09:52:19.670] Server {0x2b3a9a7e9ca0} DIAG: (rfc5861) External Request
[Sep 15 09:52:19.670] Server {0x2b3a9b284710} DIAG: (rfc5861) External Request
[Sep 15 09:52:19.671] Server {0x2b3a9a7e9ca0} DIAG: (rfc5861) Not Stale!
[Sep 15 09:52:19.671] Server {0x2b3a9b284710} DIAG: (rfc5861) Not Stale!
[Sep 15 09:52:19.673] Server {0x2b3a9b284710} DIAG: (rfc5861) External Request
[Sep 15 09:52:19.674] Server {0x2b3a9b284710} DIAG: (rfc5861) Not Stale!
[Sep 15 09:52:19.674] Server {0x2b3a9a7e9ca0} DIAG: (rfc5861) External Request
[Sep 15 09:52:19.675] Server {0x2b3a9b183710} DIAG: (rfc5861) External Request
[Sep 15 09:52:19.675] Server {0x2b3a9b284710} DIAG: (rfc5861) External Request
[Sep 15 09:52:19.676] Server {0x2b3a9b284710} DIAG: (rfc5861) Not Stale!
[Sep 15 09:52:19.677] Server {0x2b3a9a7e9ca0} DIAG: (rfc5861) Not Stale!
[Sep 15 09:52:19.677] Server {0x2b3a9b183710} DIAG: (rfc5861) Not Stale!
[Sep 15 09:52:19.685] Server {0x2b3a9a7e9ca0} DIAG: (rfc5861) External Request
[Sep 15 09:52:19.685] Server {0x2b3a9b284710} DIAG: (rfc5861) External Request
[Sep 15 09:52:19.686] Server {0x2b3a9b284710} DIAG: 

[jira] [Commented] (TS-3084) forwarding mode breaks iPhone activation (gs.apple.com)

2014-09-18 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139303#comment-14139303
 ] 

James Peach commented on TS-3084:
-

What is the request that ATS sends to the origin server?

 forwarding mode breaks iPhone activation (gs.apple.com)
 ---

 Key: TS-3084
 URL: https://issues.apache.org/jira/browse/TS-3084
 Project: Traffic Server
  Issue Type: Bug
  Components: Core, HTTP
Reporter: Nikolai Gorchilov
 Fix For: 5.2.0

 Attachments: gs.request, gs.response


 On iDevice restoration iTunes makes activation request to gs.apple.com 
 (request attached).
 When sent via ATS, the request leads to HTTP/1.1 400 (bad request) response 
 from origin server.
 Proper response (on direct connection) is also attached for your reference.
 Here's the command to reproduce the problem
 {noformat}
 netcat gs.apple.com 80  gs.request
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-3085) Large POSTs over (relatively) slower connections failing in ats5

2014-09-18 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139314#comment-14139314
 ] 

James Peach commented on TS-3085:
-

Good catch, that code looks quite broken. I think a better fix is to only call 
{{SSL_get_error()}} if {{SSL_read()}} returns = 0. The error handling for 
{{SSL_write}} also looks problematic. Can you refactor that to also call 
{{SSL_get_error()}} correctly?

 Large POSTs over (relatively) slower connections failing in ats5
 

 Key: TS-3085
 URL: https://issues.apache.org/jira/browse/TS-3085
 Project: Traffic Server
  Issue Type: Bug
  Components: SSL
Affects Versions: 5.0.1
Reporter: Sudheer Vinukonda
Assignee: Sudheer Vinukonda
  Labels: yahoo
 Fix For: 5.2.0


 We ran into a production issue where large POSTs (30MB or high) are failing 
 over slower connection speeds after ats5 roll out (the problem could be 
 easily reproduced using a charles proxy with throttling enabled). 
 Further debugging isolated the issue to uploads over SSL connections and 
 after a lot of debugging the issue appears to be the below:
 ATS calls SSL_read() followed by SSL_get_error() to check if there was any 
 error in the read. This is repeated until either the complete data is read or 
 an error occurs. However, from the openssl documentation, it is recommended 
 to call ERR_clear_error() prior to calling SSL_read() + SSL_get_error() to 
 ensure the error queue is clean of any leftover/garbage errors.  It's not 
 clear what might be corrupting the error queue of the SSL context in a tight 
 loop - possibly, some new feature in ats5. In any case, calling 
 ERR_clear_error() is a good idea and adding this seems to resolve the post 
 failures.
 Documentation from openSSL and some related notes on stackoverflow:
 https://www.openssl.org/docs/ssl/SSL_get_error.html
 http://stackoverflow.com/questions/18179128/how-to-manage-the-error-queue-in-openssl-ssl-get-error-and-err-get-error
 {code}
 SSL_get_error() returns a result code (suitable for the C ``switch''
 statement) for a preceding call to SSL_connect(), SSL_accept(),
 SSL_do_handshake(), SSL_read(), SSL_peek(), or SSL_write() on ssl. The value
 returned by that TLS/SSL I/O function must be passed to SSL_get_error() in
 parameter ret.
 In addition to ssl and ret, SSL_get_error() inspects the current thread's
 OpenSSL error queue. Thus, SSL_get_error() must be used in the same thread 
 that
 performed the TLS/SSL I/O operation, and no other OpenSSL function calls 
 should
 appear in between. The current thread's error queue must be empty before the
 TLS/SSL I/O operation is attempted, or SSL_get_error() will not work 
 reliably.
 SSL_get_error does not call ERR_get_error. So if you just call SSL_get_error,
 the error stays in the queue.
 You should be calling ERR_clear_error prior to ANY SSL-call(SSL_read, 
 SSL_write
 etc) that is followed by SSL_get_error, otherwise you may be reading an old
 error that occurred previously in the current thread.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-3084) forwarding mode breaks iPhone activation (gs.apple.com)

2014-09-18 Thread Susan Hinrichs (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Susan Hinrichs updated TS-3084:
---
Attachment: partial-request.txt

I was just able to reproduce.  It looks like only part of the gs.request was 
sent before the FIN was sent from ATS.  See the partial request attachment.

 forwarding mode breaks iPhone activation (gs.apple.com)
 ---

 Key: TS-3084
 URL: https://issues.apache.org/jira/browse/TS-3084
 Project: Traffic Server
  Issue Type: Bug
  Components: Core, HTTP
Reporter: Nikolai Gorchilov
 Fix For: 5.2.0

 Attachments: gs.request, gs.response, partial-request.txt


 On iDevice restoration iTunes makes activation request to gs.apple.com 
 (request attached).
 When sent via ATS, the request leads to HTTP/1.1 400 (bad request) response 
 from origin server.
 Proper response (on direct connection) is also attached for your reference.
 Here's the command to reproduce the problem
 {noformat}
 netcat gs.apple.com 80  gs.request
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-3084) forwarding mode breaks iPhone activation (gs.apple.com)

2014-09-18 Thread Susan Hinrichs (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139329#comment-14139329
 ] 

Susan Hinrichs commented on TS-3084:


Sometimes I see the full request being sent to origin server, but having ATS 
terminate connection to original server before the full response has been sent.

 forwarding mode breaks iPhone activation (gs.apple.com)
 ---

 Key: TS-3084
 URL: https://issues.apache.org/jira/browse/TS-3084
 Project: Traffic Server
  Issue Type: Bug
  Components: Core, HTTP
Reporter: Nikolai Gorchilov
 Fix For: 5.2.0

 Attachments: gs.request, gs.response, partial-request.txt


 On iDevice restoration iTunes makes activation request to gs.apple.com 
 (request attached).
 When sent via ATS, the request leads to HTTP/1.1 400 (bad request) response 
 from origin server.
 Proper response (on direct connection) is also attached for your reference.
 Here's the command to reproduce the problem
 {noformat}
 netcat gs.apple.com 80  gs.request
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-3084) forwarding mode breaks iPhone activation (gs.apple.com)

2014-09-18 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139338#comment-14139338
 ] 

James Peach commented on TS-3084:
-

Is this in transparent proxy mode?

 forwarding mode breaks iPhone activation (gs.apple.com)
 ---

 Key: TS-3084
 URL: https://issues.apache.org/jira/browse/TS-3084
 Project: Traffic Server
  Issue Type: Bug
  Components: Core, HTTP
Reporter: Nikolai Gorchilov
 Fix For: 5.2.0

 Attachments: gs.request, gs.response, partial-request.txt


 On iDevice restoration iTunes makes activation request to gs.apple.com 
 (request attached).
 When sent via ATS, the request leads to HTTP/1.1 400 (bad request) response 
 from origin server.
 Proper response (on direct connection) is also attached for your reference.
 Here's the command to reproduce the problem
 {noformat}
 netcat gs.apple.com 80  gs.request
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-3084) forwarding mode breaks iPhone activation (gs.apple.com)

2014-09-18 Thread Susan Hinrichs (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139350#comment-14139350
 ] 

Susan Hinrichs commented on TS-3084:


Yes, my test system is in transparent mode.

 forwarding mode breaks iPhone activation (gs.apple.com)
 ---

 Key: TS-3084
 URL: https://issues.apache.org/jira/browse/TS-3084
 Project: Traffic Server
  Issue Type: Bug
  Components: Core, HTTP
Reporter: Nikolai Gorchilov
 Fix For: 5.2.0

 Attachments: gs.request, gs.response, partial-request.txt


 On iDevice restoration iTunes makes activation request to gs.apple.com 
 (request attached).
 When sent via ATS, the request leads to HTTP/1.1 400 (bad request) response 
 from origin server.
 Proper response (on direct connection) is also attached for your reference.
 Here's the command to reproduce the problem
 {noformat}
 netcat gs.apple.com 80  gs.request
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-1975) LocalManager may cause manager crash

2014-09-18 Thread Phil Sorber (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139359#comment-14139359
 ] 

Phil Sorber commented on TS-1975:
-

Do you have a test case that repeats this easily that you can share? Or can you 
get a core and a real stack trace with debug symbols?

Also, I think [~amc] said he saw something like this and it wasn't rfc5861. 
Maybe he can comment.

Thanks.

 LocalManager may cause manager crash
 

 Key: TS-1975
 URL: https://issues.apache.org/jira/browse/TS-1975
 Project: Traffic Server
  Issue Type: Bug
  Components: Manager
Affects Versions: 3.3.4
Reporter: Zhao Yongming
Assignee: portl4t
  Labels: Crash
 Fix For: 5.2.0


 when something wrong with the LocalManager, with 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104), then you 
 will get manager and server restart.
 {code}
 Jun 17 17:40:06 cache163 traffic_manager[25654]: {0x7f528b4297e0} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Jun 17 17:40:06 cache163 traffic_manager[25654]: {0x7f528b4297e0} FATAL:  
 (last system error 104: Connection reset by peer)
 Jun 17 17:40:06 cache163 traffic_manager[25654]: {0x7f528b4297e0} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Jun 17 17:40:06 cache163 traffic_manager[25654]: {0x7f528b4297e0} ERROR:  
 (last system error 32: Broken pipe)
 Jun 17 17:40:07 cache163 traffic_cop[25652]: cop received child status signal 
 [25654 2816]
 Jun 17 17:40:07 cache163 traffic_cop[25652]: traffic_manager not running, 
 making sure traffic_server is dead
 Jun 17 17:40:07 cache163 traffic_cop[25652]: spawning traffic_manager
 Jun 17 17:40:07 cache163 traffic_manager[10118]: NOTE: --- Manager Starting 
 ---
 Jun 17 17:40:07 cache163 traffic_manager[10118]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 3.2.0 - (build # 51516 on Jun 15 
 2013 at 16:01:06)
 Jun 17 17:40:07 cache163 traffic_manager[10118]: NOTE: 
 RLIMIT_NOFILE(7):cur(16),max(16)
 Jun 17 17:40:07 cache163 traffic_manager[10118]: {0x7f26fc24a7e0} STATUS: 
 opened /var/log/trafficserver/manager.log
 Jun 17 17:40:09 cache163 traffic_server[10131]: NOTE: --- Server Starting ---
 Jun 17 17:40:09 cache163 traffic_server[10131]: NOTE: Server Version: Apache 
 Traffic Server - traffic_server - 3.2.0 - (build # 51516 on Jun 15 2013 at 
 16:01:31)
 Jun 17 17:40:09 cache163 traffic_server[10131]: {0x2b167ded2280} STATUS: 
 opened /var/log/trafficserver/diags.log
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-1975) LocalManager may cause manager crash

2014-09-18 Thread Jared Ocker (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139478#comment-14139478
 ] 

Jared Ocker commented on TS-1975:
-

Unfortunately, I don't know how to initiate it, it just happens multiple times 
per day.  Below is a listing of our September log output from {{grep FATAL 
manager.log}} showing the frequency on a test server with light traffic.  I've 
not seen any core dumps.  Do you have any suggestions for debug tags?  Even 
excluding some of the most verbose tags with {{CONFIG 
proxy.config.diags.debug.tags STRING ^(?!.\*dir_clean|stats|log.*|http)}} still 
generates a ton of output.


{code}
[Sep  4 14:41:12.517] Manager {0x7fb24371e7e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep  4 20:13:49.649] Manager {0x7fa5f16ed7e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep  5 00:34:55.776] Manager {0x7fab124677e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep  5 09:25:42.075] Manager {0x7f9edb27b7e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep  6 00:13:55.717] Manager {0x7f7b8313a7e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep  6 00:58:49.020] Manager {0x7f04482517e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep  6 01:00:49.758] Manager {0x7fd4e6c387e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep  6 03:01:39.411] Manager {0x7f4b0285b7e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep  6 05:15:48.996] Manager {0x7fb6867407e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep  6 23:19:50.204] Manager {0x7fb67c3e57e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep  7 01:15:41.632] Manager {0x7fe5527627e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep  7 01:17:50.320] Manager {0x7f60e471c7e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep  7 02:43:49.077] Manager {0x7f3f960c27e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep  7 03:11:15.545] Manager {0x7f21e77107e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep  7 03:12:01.707] Manager {0x7fcc284f37e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep  7 03:34:53.128] Manager {0x7f283d1317e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep  7 20:01:45.939] Manager {0x7f63d15977e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep  7 20:02:54.278] Manager {0x7fec786157e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep  7 20:52:53.018] Manager {0x7f34d10fc7e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep  8 23:03:53.881] Manager {0x7f97d5be57e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep  9 01:17:06.301] Manager {0x7f62dc3837e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep  9 04:02:54.184] Manager {0x7f1e2a0587e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep 10 02:09:23.664] Manager {0x7f7936f427e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep 10 04:05:23.641] Manager {0x7fd0c8fcc7e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep 10 21:49:25.050] Manager {0x7f2bd3ba47e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep 10 22:55:24.225] Manager {0x7fbe262aa7e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep 11 02:15:24.191] Manager {0x7f030a7da7e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep 12 01:48:25.133] Manager {0x7f7a396337e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep 12 03:28:24.094] Manager {0x7f53933977e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep 12 03:29:24.516] Manager {0x7f0d3d1567e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep 12 04:03:25.149] Manager {0x7f55a12b07e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep 12 07:02:24.187] Manager {0x7f72d13ee7e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep 12 10:18:24.169] Manager {0x7f564bae57e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep 12 11:13:15.805] Manager {0x7fb1b9ccc7e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep 12 11:30:17.156] Manager {0x7ff22a03d7e0} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
[Sep 12 12:39:25.266] Manager {0x7fed1d3867e0} FATAL: 

[jira] [Commented] (TS-1975) LocalManager may cause manager crash

2014-09-18 Thread Phil Sorber (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139489#comment-14139489
 ] 

Phil Sorber commented on TS-1975:
-

Here is a link to the docs on enabling core dumps. You may also need to do OS 
level changes to enable core dumps. You also need to make sure you compiled 
with debug (-g) which I think is the default so that symbols are available.

https://docs.trafficserver.apache.org/en/latest/sdk/troubleshooting-tips/using-a-debugger.en.html

 LocalManager may cause manager crash
 

 Key: TS-1975
 URL: https://issues.apache.org/jira/browse/TS-1975
 Project: Traffic Server
  Issue Type: Bug
  Components: Manager
Affects Versions: 3.3.4
Reporter: Zhao Yongming
Assignee: portl4t
  Labels: Crash
 Fix For: 5.2.0


 when something wrong with the LocalManager, with 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104), then you 
 will get manager and server restart.
 {code}
 Jun 17 17:40:06 cache163 traffic_manager[25654]: {0x7f528b4297e0} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Jun 17 17:40:06 cache163 traffic_manager[25654]: {0x7f528b4297e0} FATAL:  
 (last system error 104: Connection reset by peer)
 Jun 17 17:40:06 cache163 traffic_manager[25654]: {0x7f528b4297e0} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Jun 17 17:40:06 cache163 traffic_manager[25654]: {0x7f528b4297e0} ERROR:  
 (last system error 32: Broken pipe)
 Jun 17 17:40:07 cache163 traffic_cop[25652]: cop received child status signal 
 [25654 2816]
 Jun 17 17:40:07 cache163 traffic_cop[25652]: traffic_manager not running, 
 making sure traffic_server is dead
 Jun 17 17:40:07 cache163 traffic_cop[25652]: spawning traffic_manager
 Jun 17 17:40:07 cache163 traffic_manager[10118]: NOTE: --- Manager Starting 
 ---
 Jun 17 17:40:07 cache163 traffic_manager[10118]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 3.2.0 - (build # 51516 on Jun 15 
 2013 at 16:01:06)
 Jun 17 17:40:07 cache163 traffic_manager[10118]: NOTE: 
 RLIMIT_NOFILE(7):cur(16),max(16)
 Jun 17 17:40:07 cache163 traffic_manager[10118]: {0x7f26fc24a7e0} STATUS: 
 opened /var/log/trafficserver/manager.log
 Jun 17 17:40:09 cache163 traffic_server[10131]: NOTE: --- Server Starting ---
 Jun 17 17:40:09 cache163 traffic_server[10131]: NOTE: Server Version: Apache 
 Traffic Server - traffic_server - 3.2.0 - (build # 51516 on Jun 15 2013 at 
 16:01:31)
 Jun 17 17:40:09 cache163 traffic_server[10131]: {0x2b167ded2280} STATUS: 
 opened /var/log/trafficserver/diags.log
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TS-3086) Range requests for stale cache entries never use If-Modified-Since/If-None-Match

2014-09-18 Thread William Bardwell (JIRA)
William Bardwell created TS-3086:


 Summary: Range requests for stale cache entries never use 
If-Modified-Since/If-None-Match
 Key: TS-3086
 URL: https://issues.apache.org/jira/browse/TS-3086
 Project: Traffic Server
  Issue Type: Bug
  Components: HTTP
Reporter: William Bardwell


Range requests against a stale cache entry always cause the Range request to be 
tunneled with no conditional.  It would be nice if it used conditionals even if 
it couldn't update the cache entry just to cut down the traffic.  (Not sure if 
updating the cache entry would be right, does If-Modified-Since refer only to 
the Range requested or to the whole object?)  We could also have an option to 
use If-Range in these cases, but that might not make sense as a global 
decision...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TS-3080) OpenSSL implementation of TLS session cache is very slow.

2014-09-18 Thread Alexey Ivanov (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139593#comment-14139593
 ] 

Alexey Ivanov edited comment on TS-3080 at 9/18/14 10:00 PM:
-

Bottleneck seems to manifest itself if:
1) we are around ~1k handshakes/sec.
2) we have huge session cache side - 30 entries

It manifests itself in all NET threads stuck inside [SSL_CTX_flush_sessions]. 
Which is quite logical since it's going through the list of sessions applying 
timeout function to each of them while holding a lock:
{code}
lh_SSL_SESSION_doall_arg(tp.cache, LHASH_DOALL_ARG_FN(timeout),
TIMEOUT_PARAM, tp);
{code}

So we either need to reduce number of elements in cache which will make it 
useless or write our own implementation (preferably) using CK and 
{{SSL_CTX_sess_set_\{new,get,remove\}_cb}} callbacks.
(That's how nginx done it, though nginx still allows using built-in openssl 
cache, though it is slow and causes memory fragmentation 
[nginx_ssl_session_cache])

[SSL_CTX_flush_sessions] 
https://github.com/openssl/openssl/blob/master/ssl/ssl_sess.c#L964
[nginx_ssl_session_cache] 
http://nginx.org/en/docs/http/ngx_http_ssl_module.html#ssl_session_cache


was (Author: savetherbtz):
Bottleneck seems to manifest itself if:
1) we are around ~1k handshakes/sec.
2) we have huge session cache side - 30 entries

It manifests itself in all NET threads stuck inside [SSL_CTX_flush_sessions]. 
Which is quite logical since it's going through the list of sessions applying 
timeout function to each of them while holding a lock:
{code}
lh_SSL_SESSION_doall_arg(tp.cache, LHASH_DOALL_ARG_FN(timeout),
TIMEOUT_PARAM, tp);
{code}

So we either need to reduce number of elements in cache which will make it 
useless or write our own implementation (preferably) using CK and 
{{SSL_CTX_sess_set_\{new,get,remove\}_cb}} callbacks.
(That's how nginx done it, though nginx still allows using built-in openssl 
cache, though it is slow and causes memory fragmentation 
[nginx#ssl_session_cache])

[SSL_CTX_flush_sessions] 
https://github.com/openssl/openssl/blob/master/ssl/ssl_sess.c#L964
[nginx#ssl_session_cache] 
http://nginx.org/en/docs/http/ngx_http_ssl_module.html#ssl_session_cache

 OpenSSL implementation of TLS session cache is very slow.
 -

 Key: TS-3080
 URL: https://issues.apache.org/jira/browse/TS-3080
 Project: Traffic Server
  Issue Type: Bug
  Components: Core, SSL
Reporter: Brian Geffon
Assignee: Brian Geffon
 Fix For: 5.2.0


 The OpenSSL implementation of TLS session caching is very slow, we attempted 
 to use it and it's locking and blows up at only a few hundred QPS. I'm going 
 to develop a new TLS session cache in TS that is more performant under 
 highload.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-3080) OpenSSL implementation of TLS session cache is very slow.

2014-09-18 Thread Alexey Ivanov (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139593#comment-14139593
 ] 

Alexey Ivanov commented on TS-3080:
---

Bottleneck seems to manifest itself if:
1) we are around ~1k handshakes/sec.
2) we have huge session cache side - 30 entries

It manifests itself in all NET threads stuck inside [SSL_CTX_flush_sessions]. 
Which is quite logical since it's going through the list of sessions applying 
timeout function to each of them while holding a lock:
{code}
lh_SSL_SESSION_doall_arg(tp.cache, LHASH_DOALL_ARG_FN(timeout),
TIMEOUT_PARAM, tp);
{code}

So we either need to reduce number of elements in cache which will make it 
useless or write our own implementation (preferably) using CK and 
{{SSL_CTX_sess_set_\{new,get,remove\}_cb}} callbacks.
(That's how nginx done it, though nginx still allows using built-in openssl 
cache, though it is slow and causes memory fragmentation 
[nginx#ssl_session_cache])

[SSL_CTX_flush_sessions] 
https://github.com/openssl/openssl/blob/master/ssl/ssl_sess.c#L964
[nginx#ssl_session_cache] 
http://nginx.org/en/docs/http/ngx_http_ssl_module.html#ssl_session_cache

 OpenSSL implementation of TLS session cache is very slow.
 -

 Key: TS-3080
 URL: https://issues.apache.org/jira/browse/TS-3080
 Project: Traffic Server
  Issue Type: Bug
  Components: Core, SSL
Reporter: Brian Geffon
Assignee: Brian Geffon
 Fix For: 5.2.0


 The OpenSSL implementation of TLS session caching is very slow, we attempted 
 to use it and it's locking and blows up at only a few hundred QPS. I'm going 
 to develop a new TLS session cache in TS that is more performant under 
 highload.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-3085) Large POSTs over (relatively) slower connections failing in ats5

2014-09-18 Thread kang li (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139915#comment-14139915
 ] 

kang li commented on TS-3085:
-

Hi [~sudheerv],

I think the SSL stack corruption may be related to 
[TS:2986|https://issues.apache.org/jira/browse/TS-2986]. As it remove 
SSLErrorVC to eliminate the SSL error log in diags.log. SSLErrorVC would call 
ERR_get_error_line_data to clean the error stack.

 Large POSTs over (relatively) slower connections failing in ats5
 

 Key: TS-3085
 URL: https://issues.apache.org/jira/browse/TS-3085
 Project: Traffic Server
  Issue Type: Bug
  Components: SSL
Affects Versions: 5.0.1
Reporter: Sudheer Vinukonda
Assignee: Sudheer Vinukonda
  Labels: yahoo
 Fix For: 5.2.0


 We ran into a production issue where large POSTs (30MB or high) are failing 
 over slower connection speeds after ats5 roll out (the problem could be 
 easily reproduced using a charles proxy with throttling enabled). 
 Further debugging isolated the issue to uploads over SSL connections and 
 after a lot of debugging the issue appears to be the below:
 ATS calls SSL_read() followed by SSL_get_error() to check if there was any 
 error in the read. This is repeated until either the complete data is read or 
 an error occurs. However, from the openssl documentation, it is recommended 
 to call ERR_clear_error() prior to calling SSL_read() + SSL_get_error() to 
 ensure the error queue is clean of any leftover/garbage errors.  It's not 
 clear what might be corrupting the error queue of the SSL context in a tight 
 loop - possibly, some new feature in ats5. In any case, calling 
 ERR_clear_error() is a good idea and adding this seems to resolve the post 
 failures.
 Documentation from openSSL and some related notes on stackoverflow:
 https://www.openssl.org/docs/ssl/SSL_get_error.html
 http://stackoverflow.com/questions/18179128/how-to-manage-the-error-queue-in-openssl-ssl-get-error-and-err-get-error
 {code}
 SSL_get_error() returns a result code (suitable for the C ``switch''
 statement) for a preceding call to SSL_connect(), SSL_accept(),
 SSL_do_handshake(), SSL_read(), SSL_peek(), or SSL_write() on ssl. The value
 returned by that TLS/SSL I/O function must be passed to SSL_get_error() in
 parameter ret.
 In addition to ssl and ret, SSL_get_error() inspects the current thread's
 OpenSSL error queue. Thus, SSL_get_error() must be used in the same thread 
 that
 performed the TLS/SSL I/O operation, and no other OpenSSL function calls 
 should
 appear in between. The current thread's error queue must be empty before the
 TLS/SSL I/O operation is attempted, or SSL_get_error() will not work 
 reliably.
 SSL_get_error does not call ERR_get_error. So if you just call SSL_get_error,
 the error stays in the queue.
 You should be calling ERR_clear_error prior to ANY SSL-call(SSL_read, 
 SSL_write
 etc) that is followed by SSL_get_error, otherwise you may be reading an old
 error that occurred previously in the current thread.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-3084) forwarding mode breaks iPhone activation (gs.apple.com)

2014-09-18 Thread Alan M. Carroll (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan M. Carroll updated TS-3084:

Assignee: Susan Hinrichs

 forwarding mode breaks iPhone activation (gs.apple.com)
 ---

 Key: TS-3084
 URL: https://issues.apache.org/jira/browse/TS-3084
 Project: Traffic Server
  Issue Type: Bug
  Components: Core, HTTP
Reporter: Nikolai Gorchilov
Assignee: Susan Hinrichs
 Fix For: 5.2.0

 Attachments: gs.request, gs.response, partial-request.txt


 On iDevice restoration iTunes makes activation request to gs.apple.com 
 (request attached).
 When sent via ATS, the request leads to HTTP/1.1 400 (bad request) response 
 from origin server.
 Proper response (on direct connection) is also attached for your reference.
 Here's the command to reproduce the problem
 {noformat}
 netcat gs.apple.com 80  gs.request
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-3073) tr-pass: non-http request gets blocked with error message instead of being tunnelled to the origin server

2014-09-18 Thread Susan Hinrichs (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Susan Hinrichs updated TS-3073:
---
Attachment: ts-3084.patch

This appears to be very similar to the problem fixed in ts-3073.  The ATS logic 
really doesn't like it when the client closes a connection before the server 
finishes sending a response.   This patch is built upon the patch for ts-3073.  
But that patch addressed a ts-pass case and this is addressing an HTTP post 
case.

Again using the half_open flag to track that a client has already sent a FIN 
and check the flag after all the data has been read to do a shutdown(IO_READ) 
to send the FIN along to the origin server.

 tr-pass: non-http request gets blocked with error message instead of being 
 tunnelled to the origin server
 -

 Key: TS-3073
 URL: https://issues.apache.org/jira/browse/TS-3073
 Project: Traffic Server
  Issue Type: Bug
  Components: Core, HTTP
Reporter: Nikolai Gorchilov
Assignee: Susan Hinrichs
 Fix For: 5.2.0

 Attachments: bypass.request, tr-pass-client-close.patch, ts-3084.patch


 ATS breaks RIFF Box JTAG Manager software that is using proprietary protocol 
 over port 80 even with tr-pass enabled.
 Instead of creating a tunnel, ATS returns bad request error.
 Managed to capture the request that triggers the issue (to be attached as 
 bypass.request). Here's a simple command to reproduce the problem:
 #$ netcat 93.191.132.28 80  bypass.request
 Direct request returns a simple exclamation mark '!', but passing it via ATS 
 results in:
 {noformat}
 HTML
 HEAD
 TITLEBad Request/TITLE
 /HEAD
 BODY BGCOLOR=white FGCOLOR=black
 H1Bad Request/H1
 HR
 FONT FACE=Helvetica,ArialB
 Description: Could not process this request. 
 /B/FONT
 HR
 /BODY
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (TS-3073) tr-pass: non-http request gets blocked with error message instead of being tunnelled to the origin server

2014-09-18 Thread Susan Hinrichs (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on TS-3073 started by Susan Hinrichs.
--
 tr-pass: non-http request gets blocked with error message instead of being 
 tunnelled to the origin server
 -

 Key: TS-3073
 URL: https://issues.apache.org/jira/browse/TS-3073
 Project: Traffic Server
  Issue Type: Bug
  Components: Core, HTTP
Reporter: Nikolai Gorchilov
Assignee: Susan Hinrichs
 Fix For: 5.2.0

 Attachments: bypass.request, tr-pass-client-close.patch, ts-3084.patch


 ATS breaks RIFF Box JTAG Manager software that is using proprietary protocol 
 over port 80 even with tr-pass enabled.
 Instead of creating a tunnel, ATS returns bad request error.
 Managed to capture the request that triggers the issue (to be attached as 
 bypass.request). Here's a simple command to reproduce the problem:
 #$ netcat 93.191.132.28 80  bypass.request
 Direct request returns a simple exclamation mark '!', but passing it via ATS 
 results in:
 {noformat}
 HTML
 HEAD
 TITLEBad Request/TITLE
 /HEAD
 BODY BGCOLOR=white FGCOLOR=black
 H1Bad Request/H1
 HR
 FONT FACE=Helvetica,ArialB
 Description: Could not process this request. 
 /B/FONT
 HR
 /BODY
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)