[jira] [Updated] (TS-4242) Permanent disk failures are not handled gracefully

2016-02-29 Thread Luca Bruno (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Bruno updated TS-4242:
---
Description: 
I'm simulating a disk failure of 1 sector with the following setup:

{noformat}
dd if=/dev/zero of=err.img bs=512 count=2097152
losetup /dev/loop0 err.img
dmsetup create err0 < 
cache disk operation failed WRITE -1 5
[Feb 29 15:29:33.309] Server {0x2ac3e56063c0} WARNING:  Error accessing Disk /dev/mapper/err0 [1726/1]
[Feb 29 15:29:33.320] Server {0x2ac3e56063c0} WARNING:  Head : Doc magic does not match for 
75B41B1A2C85AE637DD6CE368BF783D0
[Feb 29 15:29:33.323] Server {0x2ac3eb480700} WARNING:  Head : Doc magic does not match for 
1075CEA6E2E47496BE190DBB448B0B64
...
[Feb 29 15:29:33.284] Server {0x2ac3f28e0700} WARNING:  
cache disk operation failed WRITE -1 5
[Feb 29 15:29:33.287] Server {0x2ac3eb682700} WARNING:  Error accessing Disk /dev/mapper/err0 [1725/1]
[Feb 29 15:29:33.289] Server {0x2ac3eb682700} WARNING:  Head : Doc magic does not match for 
7E3325870F5488955118359E6C4B10F4
[Feb 29 15:29:33.289] Server {0x2ac3eb27e700} WARNING:  Head : Doc magic does not match for 
7AE309F21ABF9B3774C67921018FCA0E
...
{noformat}

Summary: trafficserver does not treat I/O errors as permanent, but as 
temporary. Is this true? This leads to either:
1. Replace the hard disk
2. Use a devicemapper to skip the bad sector.

Both cases lead to throwing away a whole disk cache of terabytes for just a bad 
sector.

If this is what's really happening, is it feasible to skip the bad sector? If 
so, I could work on a patch.

  was:
I'm simulating a disk failure of 1 sector with the following setup:

{noformat}
dd if=/dev/zero of=err.img bs=512 count=2097152
losetup /dev/loop0 err.img
dmsetup create err0 < 
cache disk operation failed WRITE -1 5
[Feb 29 15:29:33.309] Server {0x2ac3e56063c0} WARNING:  Error accessing Disk /dev/mapper/err0 [1726/1]
[Feb 29 15:29:33.320] Server {0x2ac3e56063c0} WARNING:  Head : Doc magic does not match for 
75B41B1A2C85AE637DD6CE368BF783D0
[Feb 29 15:29:33.323] Server {0x2ac3eb480700} WARNING:  Head : Doc magic does not match for 
1075CEA6E2E47496BE190DBB448B0B64
...
[Feb 29 15:29:33.284] Server {0x2ac3f28e0700} WARNING:  
cache disk operation failed WRITE -1 5
[Feb 29 15:29:33.287] Server {0x2ac3eb682700} WARNING:  Error accessing Disk /dev/mapper/err0 [1725/1]
[Feb 29 15:29:33.289] Server {0x2ac3eb682700} WARNING:  Head : Doc magic does not match for 
7E3325870F5488955118359E6C4B10F4
[Feb 29 15:29:33.289] Server {0x2ac3eb27e700} WARNING:  Head : Doc magic does not match for 
7AE309F21ABF9B3774C67921018FCA0E
...
{noformat}

Summary: trafficserver does not treat I/O errors as permanent, but as 
temporary. Is this true? This leads to either:
1. Replace the hard disk
2. Use a devicemapper to skip the bad sector.

Both cases lead to throwing away a whole disk cache of terabytes for just a bad 
sector.

If this is what's really happening, is it feasible to skip the bad sector? If 
so, I could work on a patch.


> Permanent disk failures are not handled gracefully
> --
>
> Key: TS-4242
> URL: https://issues.apache.org/jira/browse/TS-4242
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Cache
>Reporter: Luca Bruno
>
> I'm simulating a disk failure of 1 sector with the following setup:
> {noformat}
> dd if=/dev/zero of=err.img bs=512 count=2097152
> losetup /dev/loop0 err.img
> dmsetup create err0 < 0 1024000 linear /dev/loop0 0
> 1024000 1 error
> 1024001 1073151 linear /dev/loop0 1024001
> EOF
> dmsetup mknodes err0
> {noformat}
> With the above command, we create a 1Gib disk, and at 500mib we simulate an 
> error for a single 512bytes sector.
> storage.config:
> {noformat}
> /dev/mapper/err0
> {noformat}
> Now I have a tool that randomly generates urls, stores them, and requests 
> them back with a certain probability. So that I both write and read from the 
> disk with a certain offered/expected hit ratio.
> Once I hit the 500mib mark, trafficserver keeps spitting warnings about disk 
> error. I fear it's because trafficserver keeps writing that bad sector 
> instead of skipping it.
> These are the errors/warnings I'm seeing in the log repeatedly:
> {noformat}
> [Feb 29 15:29:33.308] Server {0x2ac3f1cd4700} WARNING:  (cache_op)> cache disk operation failed WRITE -1 5
> [Feb 29 15:29:33.309] Server {0x2ac3e56063c0} WARNING:  (handle_disk_failure)> Error accessing Disk /dev/mapper/err0 [1726/1]
> [Feb 29 15:29:33.320] Server {0x2ac3e56063c0} WARNING:  (openReadStartHead)> Head : Doc magic does not match for 
> 75B41B1A2C85AE637DD6CE368BF783D0
> [Feb 29 15:29:33.323] Server {0x2ac3eb480700} WARNING:  (openReadStartHead)> Head : Doc magic does not match for 
> 1075CEA6E2E47496BE190DBB448B0B64
> ...
> [Feb 29 15:29:33.284] Server 

[jira] [Updated] (TS-4242) Permanent disk failures are not handled gracefully

2016-02-29 Thread Luca Bruno (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Bruno updated TS-4242:
---
Description: 
I'm simulating a disk failure of 1 sector with the following setup:

{noformat}
dd if=/dev/zero of=err.img bs=512 count=2097152
losetup /dev/loop0 err.img
dmsetup create err0 < 
cache disk operation failed WRITE -1 5
[Feb 29 15:29:33.309] Server {0x2ac3e56063c0} WARNING:  Error accessing Disk /dev/mapper/err0 [1726/1]
[Feb 29 15:29:33.320] Server {0x2ac3e56063c0} WARNING:  Head : Doc magic does not match for 
75B41B1A2C85AE637DD6CE368BF783D0
[Feb 29 15:29:33.323] Server {0x2ac3eb480700} WARNING:  Head : Doc magic does not match for 
1075CEA6E2E47496BE190DBB448B0B64
...
[Feb 29 15:29:33.284] Server {0x2ac3f28e0700} WARNING:  
cache disk operation failed WRITE -1 5
[Feb 29 15:29:33.287] Server {0x2ac3eb682700} WARNING:  Error accessing Disk /dev/mapper/err0 [1725/1]
[Feb 29 15:29:33.289] Server {0x2ac3eb682700} WARNING:  Head : Doc magic does not match for 
7E3325870F5488955118359E6C4B10F4
[Feb 29 15:29:33.289] Server {0x2ac3eb27e700} WARNING:  Head : Doc magic does not match for 
7AE309F21ABF9B3774C67921018FCA0E
...
{noformat}

Summary: trafficserver does not treat I/O errors as permanent, but as 
temporary. Is this true? This leads to either:
1. Replace the hard disk
2. Use a devicemapper to skip the bad sector.

Both cases lead to throwing away a whole disk cache of terabytes for just a bad 
sector.

If this is what's really happening, is it feasible to skip the bad sector? If 
so, I could work on a patch.

  was:
I'm simulating a disk failure of 1 sector with the following setup:

{noformat}
dd if=/dev/zero of=err.img bs=512 count=2097152
losetup /dev/loop0 err.img
dmsetup create err0 < 
cache disk operation failed WRITE -1 5
[Feb 29 15:29:33.309] Server {0x2ac3e56063c0} WARNING:  Error accessing Disk /dev/mapper/err0 [1726/1]
[Feb 29 15:29:33.320] Server {0x2ac3e56063c0} WARNING:  Head : Doc magic does not match for 
75B41B1A2C85AE637DD6CE368BF783D0
[Feb 29 15:29:33.323] Server {0x2ac3eb480700} WARNING:  Head : Doc magic does not match for 
1075CEA6E2E47496BE190DBB448B0B64
...
[Feb 29 15:29:33.284] Server {0x2ac3f28e0700} WARNING:  
cache disk operation failed WRITE -1 5
[Feb 29 15:29:33.287] Server {0x2ac3eb682700} WARNING:  Error accessing Disk /dev/mapper/err0 [1725/1]
[Feb 29 15:29:33.289] Server {0x2ac3eb682700} WARNING:  Head : Doc magic does not match for 
7E3325870F5488955118359E6C4B10F4
[Feb 29 15:29:33.289] Server {0x2ac3eb27e700} WARNING:  Head : Doc magic does not match for 
7AE309F21ABF9B3774C67921018FCA0E
...
{noformat}

Summary: trafficserver does not treat I/O errors as permanent, but as 
temporary. Is this true? This leads to either:
1. Replace the hard disk
2. Use a devicemapper to skip the bad sector.

Both cases lead to throwing away a whole disk cache of terabytes for just a bad 
sector.

If this is what's really happening, is it feasible to skip the bad sector? If 
so, I could work on a patch.


> Permanent disk failures are not handled gracefully
> --
>
> Key: TS-4242
> URL: https://issues.apache.org/jira/browse/TS-4242
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Cache
>Reporter: Luca Bruno
>
> I'm simulating a disk failure of 1 sector with the following setup:
> {noformat}
> dd if=/dev/zero of=err.img bs=512 count=2097152
> losetup /dev/loop0 err.img
> dmsetup create err0 < 0 1024000 linear /dev/loop0 0
> 1024000 1 error
> 1024001 1073151 linear /dev/loop0 1024001
> EOF
> dmsetup mknodes err0
> {noformat}
> With the above command, we create a 1Gib disk, and at 500mib we simulate an 
> error for a single 512bytes sector.
> storage.config:
> {noformat}
> /dev/mapper/err0
> {noformat}
> Now I have a tool that randomly generates urls, stores them, and requests 
> them back with a certain probability. So that I both write and read from the 
> disk with a certain offered/expected hit ratio.
> Once I hit the 500mib mark, trafficserver keeps spitting errors. I fear it's 
> because trafficserver keeps writing that bad sector instead of skipping it.
> These are the errors/warnings I'm seeing in the log repeatedly:
> {noformat}
> [Feb 29 15:29:33.308] Server {0x2ac3f1cd4700} WARNING:  (cache_op)> cache disk operation failed WRITE -1 5
> [Feb 29 15:29:33.309] Server {0x2ac3e56063c0} WARNING:  (handle_disk_failure)> Error accessing Disk /dev/mapper/err0 [1726/1]
> [Feb 29 15:29:33.320] Server {0x2ac3e56063c0} WARNING:  (openReadStartHead)> Head : Doc magic does not match for 
> 75B41B1A2C85AE637DD6CE368BF783D0
> [Feb 29 15:29:33.323] Server {0x2ac3eb480700} WARNING:  (openReadStartHead)> Head : Doc magic does not match for 
> 1075CEA6E2E47496BE190DBB448B0B64
> ...
> [Feb 29 15:29:33.284] Server {0x2ac3f28e0700} WARNING:  

[jira] [Updated] (TS-4242) Permanent disk failures are not handled gracefully

2016-02-29 Thread Luca Bruno (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Bruno updated TS-4242:
---
Description: 
I'm simulating a disk failure of 1 sector with the following setup:

{noformat}
dd if=/dev/zero of=err.img bs=512 count=2097152
losetup /dev/loop0 err.img
dmsetup create err0 < 
cache disk operation failed WRITE -1 5
[Feb 29 15:29:33.309] Server {0x2ac3e56063c0} WARNING:  Error accessing Disk /dev/mapper/err0 [1726/1]
[Feb 29 15:29:33.320] Server {0x2ac3e56063c0} WARNING:  Head : Doc magic does not match for 
75B41B1A2C85AE637DD6CE368BF783D0
[Feb 29 15:29:33.323] Server {0x2ac3eb480700} WARNING:  Head : Doc magic does not match for 
1075CEA6E2E47496BE190DBB448B0B64
...
[Feb 29 15:29:33.284] Server {0x2ac3f28e0700} WARNING:  
cache disk operation failed WRITE -1 5
[Feb 29 15:29:33.287] Server {0x2ac3eb682700} WARNING:  Error accessing Disk /dev/mapper/err0 [1725/1]
[Feb 29 15:29:33.289] Server {0x2ac3eb682700} WARNING:  Head : Doc magic does not match for 
7E3325870F5488955118359E6C4B10F4
[Feb 29 15:29:33.289] Server {0x2ac3eb27e700} WARNING:  Head : Doc magic does not match for 
7AE309F21ABF9B3774C67921018FCA0E
...
{noformat}

Summary: trafficserver does not treat I/O errors as permanent, but as 
temporary. Is this true? This leads to either:
1. Replace the hard disk
2. Use a devicemapper to skip the bad sector.

Both cases lead to throwing away a whole disk cache of terabytes for just a bad 
sector.

If this is what's really happening, is it feasible to skip the bad sector? If 
so, I could work on a patch.

  was:
I'm simulating a disk failure of 1 sector with the following setup:

{noformat}
dd if=/dev/zero of=err.img bs=512 count=2097152
losetup /dev/loop0 err.img
dmsetup create err0 < 
cache disk operation failed WRITE -1 5
[Feb 29 15:29:33.309] Server {0x2ac3e56063c0} WARNING:  Error accessing Disk /dev/mapper/err0 [1726/1]
[Feb 29 15:29:33.320] Server {0x2ac3e56063c0} WARNING:  Head : Doc magic does not match for 
75B41B1A2C85AE637DD6CE368BF783D0
[Feb 29 15:29:33.323] Server {0x2ac3eb480700} WARNING:  Head : Doc magic does not match for 
1075CEA6E2E47496BE190DBB448B0B64
...
[Feb 29 15:29:33.284] Server {0x2ac3f28e0700} WARNING:  
cache disk operation failed WRITE -1 5
[Feb 29 15:29:33.287] Server {0x2ac3eb682700} WARNING:  Error accessing Disk /dev/mapper/err0 [1725/1]
[Feb 29 15:29:33.289] Server {0x2ac3eb682700} WARNING:  Head : Doc magic does not match for 
7E3325870F5488955118359E6C4B10F4
[Feb 29 15:29:33.289] Server {0x2ac3eb27e700} WARNING:  Head : Doc magic does not match for 
7AE309F21ABF9B3774C67921018FCA0E
...
{noformat}

Summary: trafficserver does not treat I/O errors as permanent, but as 
temporary. Is this true? This leads to either:
1. Replace the hard disk
2. Use a devicemapper to skip the bad sector.

Both cases lead to throwing away a whole disk cache of terabytes for just a bad 
sector.

If this is what's really happening, is it feasible to skip the bad sector? If 
so, I could work on a patch.


> Permanent disk failures are not handled gracefully
> --
>
> Key: TS-4242
> URL: https://issues.apache.org/jira/browse/TS-4242
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Cache
>Reporter: Luca Bruno
>
> I'm simulating a disk failure of 1 sector with the following setup:
> {noformat}
> dd if=/dev/zero of=err.img bs=512 count=2097152
> losetup /dev/loop0 err.img
> dmsetup create err0 < 0 1024000 linear /dev/loop0 0
> 1024000 1 error
> 1024001 1073151 linear /dev/loop0 1024001
> EOF
> dmsetup mknodes err0
> {noformat}
> With the above command, we create a 1Gib disk, and at 500mib we simulate an 
> error for a single 512bytes sector.
> storage.config:
> {noformat}
> /dev/mapper/err0
> {noformat}
> Now I have a tool that randomly generates urls, stores them, and requests 
> them back with a certain probability. So that I both write and read from the 
> disk with a certain offered/expected hit ratio.
> Once I hit the 500mib mark, trafficserver keeps failing at writing and 
> reading every new object. I fear it's because trafficserver keeps writing 
> that bad sector, and it does NOT skip that bad sector.
> These are the errors/warnings I'm seeing in the log repeatedly:
> {noformat}
> [Feb 29 15:29:33.308] Server {0x2ac3f1cd4700} WARNING:  (cache_op)> cache disk operation failed WRITE -1 5
> [Feb 29 15:29:33.309] Server {0x2ac3e56063c0} WARNING:  (handle_disk_failure)> Error accessing Disk /dev/mapper/err0 [1726/1]
> [Feb 29 15:29:33.320] Server {0x2ac3e56063c0} WARNING:  (openReadStartHead)> Head : Doc magic does not match for 
> 75B41B1A2C85AE637DD6CE368BF783D0
> [Feb 29 15:29:33.323] Server {0x2ac3eb480700} WARNING:  (openReadStartHead)> Head : Doc magic does not match for 
> 1075CEA6E2E47496BE190DBB448B0B64
> ...
> [Feb 29 

[jira] [Updated] (TS-4242) Permanent disk failures are not handled gracefully

2016-02-29 Thread Luca Bruno (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Bruno updated TS-4242:
---
Description: 
I'm simulating a disk failure of 1 sector with the following setup:

{noformat}
dd if=/dev/zero of=err.img bs=512 count=2097152
losetup /dev/loop0 err.img
dmsetup create err0 < 
cache disk operation failed WRITE -1 5
[Feb 29 15:29:33.309] Server {0x2ac3e56063c0} WARNING:  Error accessing Disk /dev/mapper/err0 [1726/1]
[Feb 29 15:29:33.320] Server {0x2ac3e56063c0} WARNING:  Head : Doc magic does not match for 
75B41B1A2C85AE637DD6CE368BF783D0
[Feb 29 15:29:33.323] Server {0x2ac3eb480700} WARNING:  Head : Doc magic does not match for 
1075CEA6E2E47496BE190DBB448B0B64
...
[Feb 29 15:29:33.284] Server {0x2ac3f28e0700} WARNING:  
cache disk operation failed WRITE -1 5
[Feb 29 15:29:33.287] Server {0x2ac3eb682700} WARNING:  Error accessing Disk /dev/mapper/err0 [1725/1]
[Feb 29 15:29:33.289] Server {0x2ac3eb682700} WARNING:  Head : Doc magic does not match for 
7E3325870F5488955118359E6C4B10F4
[Feb 29 15:29:33.289] Server {0x2ac3eb27e700} WARNING:  Head : Doc magic does not match for 
7AE309F21ABF9B3774C67921018FCA0E
...
{noformat}

Summary: trafficserver does not treat I/O errors as permanent, but as 
temporary. Is this true? This leads to either:
1. Replace the hard disk
2. Use a devicemapper to skip the bad sector.

Both cases lead to throwing away a whole disk cache of terabytes for just a bad 
sector.

If this is what's really happening, is it feasible to skip the bad sector? If 
so, I could work on a patch.

  was:
I'm simulating a disk failure of 1 sector with the following setup:

{noformat}
dd if=/dev/zero of=err.img bs=512 count=2097152
losetup /dev/loop0 err.img
dmsetup create err0 < 
cache disk operation failed WRITE -1 5
[Feb 29 15:29:33.309] Server {0x2ac3e56063c0} WARNING:  Error accessing Disk /dev/mapper/err0 [1726/1]
[Feb 29 15:29:33.320] Server {0x2ac3e56063c0} WARNING:  Head : Doc magic does not match for 
75B41B1A2C85AE637DD6CE368BF783D0
[Feb 29 15:29:33.323] Server {0x2ac3eb480700} WARNING:  Head : Doc magic does not match for 
1075CEA6E2E47496BE190DBB448B0B64
...
[Feb 29 15:29:33.284] Server {0x2ac3f28e0700} WARNING:  
cache disk operation failed WRITE -1 5
[Feb 29 15:29:33.287] Server {0x2ac3eb682700} WARNING:  Error accessing Disk /dev/mapper/err0 [1725/1]
[Feb 29 15:29:33.289] Server {0x2ac3eb682700} WARNING:  Head : Doc magic does not match for 
7E3325870F5488955118359E6C4B10F4
[Feb 29 15:29:33.289] Server {0x2ac3eb27e700} WARNING:  Head : Doc magic does not match for 
7AE309F21ABF9B3774C67921018FCA0E
...
{noformat}

Summary: trafficserver does not treat I/O errors as permanent, but as 
temporary. Is this true? This leads to either:
1. Replace the hard disk
2. Use a devicemapper to skip the bad sector.

Both cases lead to throwing away a whole disk cache of terabytes for just a bad 
sector.

If this is what's really happening, is it feasible to skip the bad sector? If 
so, I could work on a patch.


> Permanent disk failures are not handled gracefully
> --
>
> Key: TS-4242
> URL: https://issues.apache.org/jira/browse/TS-4242
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Cache
>Reporter: Luca Bruno
>
> I'm simulating a disk failure of 1 sector with the following setup:
> {noformat}
> dd if=/dev/zero of=err.img bs=512 count=2097152
> losetup /dev/loop0 err.img
> dmsetup create err0 < 0 1024000 linear /dev/loop0 0
> 1024000 1 error
> 1024001 1073151 linear /dev/loop0 1024001
> EOF
> dmsetup mknodes err0
> {noformat}
> With the above command, we create a 1Gib disk, and at 500mib we simulate an 
> error for a 512bytes sector.
> storage.config:
> {noformat}
> /dev/mapper/err0
> {noformat}
> Now I have a tool that randomly generates urls, stores them, and requests 
> them back with a certain probability. So that I both write and read from the 
> disk with a certain offered/expected hit ratio.
> Once I hit the 500mib mark, trafficserver keeps failing at writing and 
> reading every new object. I fear it's because trafficserver keeps writing 
> that bad sector, and it does NOT skip that bad sector.
> These are the errors/warnings I'm seeing in the log repeatedly:
> {noformat}
> [Feb 29 15:29:33.308] Server {0x2ac3f1cd4700} WARNING:  (cache_op)> cache disk operation failed WRITE -1 5
> [Feb 29 15:29:33.309] Server {0x2ac3e56063c0} WARNING:  (handle_disk_failure)> Error accessing Disk /dev/mapper/err0 [1726/1]
> [Feb 29 15:29:33.320] Server {0x2ac3e56063c0} WARNING:  (openReadStartHead)> Head : Doc magic does not match for 
> 75B41B1A2C85AE637DD6CE368BF783D0
> [Feb 29 15:29:33.323] Server {0x2ac3eb480700} WARNING:  (openReadStartHead)> Head : Doc magic does not match for 
> 1075CEA6E2E47496BE190DBB448B0B64
> ...
> [Feb 29 

[jira] [Created] (TS-4242) Permanent disk failures are not handled gracefully

2016-02-29 Thread Luca Bruno (JIRA)
Luca Bruno created TS-4242:
--

 Summary: Permanent disk failures are not handled gracefully
 Key: TS-4242
 URL: https://issues.apache.org/jira/browse/TS-4242
 Project: Traffic Server
  Issue Type: Bug
  Components: Cache
Reporter: Luca Bruno


I'm simulating a disk failure of 1 sector with the following setup:

{noformat}
dd if=/dev/zero of=err.img bs=512 count=2097152
losetup /dev/loop0 err.img
dmsetup create err0 < 
cache disk operation failed WRITE -1 5
[Feb 29 15:29:33.309] Server {0x2ac3e56063c0} WARNING:  Error accessing Disk /dev/mapper/err0 [1726/1]
[Feb 29 15:29:33.320] Server {0x2ac3e56063c0} WARNING:  Head : Doc magic does not match for 
75B41B1A2C85AE637DD6CE368BF783D0
[Feb 29 15:29:33.323] Server {0x2ac3eb480700} WARNING:  Head : Doc magic does not match for 
1075CEA6E2E47496BE190DBB448B0B64
...
[Feb 29 15:29:33.284] Server {0x2ac3f28e0700} WARNING:  
cache disk operation failed WRITE -1 5
[Feb 29 15:29:33.287] Server {0x2ac3eb682700} WARNING:  Error accessing Disk /dev/mapper/err0 [1725/1]
[Feb 29 15:29:33.289] Server {0x2ac3eb682700} WARNING:  Head : Doc magic does not match for 
7E3325870F5488955118359E6C4B10F4
[Feb 29 15:29:33.289] Server {0x2ac3eb27e700} WARNING:  Head : Doc magic does not match for 
7AE309F21ABF9B3774C67921018FCA0E
...
{noformat}

Summary: trafficserver does not treat I/O errors as permanent, but as 
temporary. Is this true? This leads to either:
1. Replace the hard disk
2. Use a devicemapper to skip the bad sector.

Both cases lead to throwing away a whole disk cache of terabytes for just a bad 
sector.

If this is what's really happening, is it feasible to skip the bad sector? If 
so, I could work on a patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TS-4173) High latency with --enable-linux-native-aio

2016-02-04 Thread Luca Bruno (JIRA)
Luca Bruno created TS-4173:
--

 Summary: High latency with --enable-linux-native-aio
 Key: TS-4173
 URL: https://issues.apache.org/jira/browse/TS-4173
 Project: Traffic Server
  Issue Type: Bug
  Components: Cache
Reporter: Luca Bruno


Problem:

My average disk latency on rotational disk with 1RPM is however around 5ms.
However ATS response times, regardless of having low or high concurrent 
requests, is on average 20ms.

The cause:

With --enable-linux-native-aio, the AIO period time is hardcoded to 10ms 
(AIO_PERIOD in AIO.cc).
That's really suboptimal with disk latency that is lower than 10ms. 
Without --enable-linux-native-aio I get on average 5ms response times which 
resemble the disk read latency.

Possible solution:

Instead of waiting for 10ms unconditionally:
- Decide (don't know how yet) a number of disk ops NOPS.
- If the ops queue size >= NOPS, trigger mainAIOEvent, then reset the timer 
again to AIO_PERIOD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4173) High latency with --enable-linux-native-aio

2016-02-04 Thread Luca Bruno (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Bruno updated TS-4173:
---
Description: 
Problem:

My average disk latency on rotational disk with 1RPM is around 5ms.
However ATS response times, regardless of having low or high concurrent 
requests, is on average 20ms.

The cause:

With --enable-linux-native-aio, the AIO period time is hardcoded to 10ms 
(AIO_PERIOD in AIO.cc).
That's really suboptimal with disk latency that is lower than 10ms. 
Without --enable-linux-native-aio I get on average 5ms response times which 
resemble the disk read latency.

Possible solution:

Instead of waiting for 10ms unconditionally:
- Decide (don't know how yet) a number of disk ops NOPS.
- If the ops queue size >= NOPS, trigger mainAIOEvent, then reset the timer 
again to AIO_PERIOD.

  was:
Problem:

My average disk latency on rotational disk with 1RPM is however around 5ms.
However ATS response times, regardless of having low or high concurrent 
requests, is on average 20ms.

The cause:

With --enable-linux-native-aio, the AIO period time is hardcoded to 10ms 
(AIO_PERIOD in AIO.cc).
That's really suboptimal with disk latency that is lower than 10ms. 
Without --enable-linux-native-aio I get on average 5ms response times which 
resemble the disk read latency.

Possible solution:

Instead of waiting for 10ms unconditionally:
- Decide (don't know how yet) a number of disk ops NOPS.
- If the ops queue size >= NOPS, trigger mainAIOEvent, then reset the timer 
again to AIO_PERIOD.


> High latency with --enable-linux-native-aio
> ---
>
> Key: TS-4173
> URL: https://issues.apache.org/jira/browse/TS-4173
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Cache
>Reporter: Luca Bruno
>
> Problem:
> My average disk latency on rotational disk with 1RPM is around 5ms.
> However ATS response times, regardless of having low or high concurrent 
> requests, is on average 20ms.
> The cause:
> With --enable-linux-native-aio, the AIO period time is hardcoded to 10ms 
> (AIO_PERIOD in AIO.cc).
> That's really suboptimal with disk latency that is lower than 10ms. 
> Without --enable-linux-native-aio I get on average 5ms response times which 
> resemble the disk read latency.
> Possible solution:
> Instead of waiting for 10ms unconditionally:
> - Decide (don't know how yet) a number of disk ops NOPS.
> - If the ops queue size >= NOPS, trigger mainAIOEvent, then reset the timer 
> again to AIO_PERIOD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TS-4159) ATS 6.1.0 does not log to stdout when not started from shell

2016-01-30 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-4159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15124890#comment-15124890
 ] 

Luca Bruno edited comment on TS-4159 at 1/30/16 1:25 PM:
-

There is no such alternative, except writing to a log file with the LogText 
functions. TSDebug requires debug to be enabled, which means testing a lot of 
code paths with regex. TSError is well, logged as error. Internal TS code is 
able to print notices but plugins are not.


was (Author: lethalman):
There is no such alternative. TSDebug requires debug to be enabled, which means 
testing a lot of code paths with regex. TSError is well, logged as error. 
Internal TS code is able to print notices but plugins are not.

> ATS 6.1.0 does not log to stdout when not started from shell
> 
>
> Key: TS-4159
> URL: https://issues.apache.org/jira/browse/TS-4159
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Core
>Reporter: Luca Bruno
>  Labels: regression
> Fix For: 6.2.0
>
>
> I have a plugin that uses printf() to print some notice messages. Then my ATS 
> is started by systemd with just `ExecStart=/usr/bin/traffic_manager`.
> With ATS 6.0.0 I could see the printf() from my plugin with journalctl. 
> That's because the stdout of the service is connected to the journal.
> With ATS 6.1.0 the plugin output is not shown anymore with journalctl. 
> However, starting `traffic_manager` from the shell actually shows the plugin 
> output.
> No file under the log directory contains the plugin output.
> That makes me think there has been some change in 6.1.0 like: "if it's not a 
> tty, close stdout", or something like that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-4159) ATS 6.1.0 does not log to stdout when not started from shell

2016-01-30 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-4159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15124890#comment-15124890
 ] 

Luca Bruno commented on TS-4159:


There is no such alternative. TSDebug requires debug to be enabled, which means 
testing a lot of code paths with regex. TSError is well, logged as error. 
Internal TS code is able to print notices but plugins are not.

> ATS 6.1.0 does not log to stdout when not started from shell
> 
>
> Key: TS-4159
> URL: https://issues.apache.org/jira/browse/TS-4159
> Project: Traffic Server
>  Issue Type: Bug
>  Components: Core
>Reporter: Luca Bruno
>  Labels: regression
> Fix For: 6.2.0
>
>
> I have a plugin that uses printf() to print some notice messages. Then my ATS 
> is started by systemd with just `ExecStart=/usr/bin/traffic_manager`.
> With ATS 6.0.0 I could see the printf() from my plugin with journalctl. 
> That's because the stdout of the service is connected to the journal.
> With ATS 6.1.0 the plugin output is not shown anymore with journalctl. 
> However, starting `traffic_manager` from the shell actually shows the plugin 
> output.
> No file under the log directory contains the plugin output.
> That makes me think there has been some change in 6.1.0 like: "if it's not a 
> tty, close stdout", or something like that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-4126) Crash with high concurrency and low max_connections_active_in

2016-01-12 Thread Luca Bruno (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Bruno updated TS-4126:
---
Fix Version/s: 6.0.1

> Crash with high concurrency and low max_connections_active_in
> -
>
> Key: TS-4126
> URL: https://issues.apache.org/jira/browse/TS-4126
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HTTP
>Reporter: Luca Bruno
> Fix For: 6.0.1
>
>
> ATS 6.0.0 reproducible crash with high concurrency (thousands of connections) 
> and low proxy.config.net.max_connections_active_in.
> {noformat}
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x71062700 (LWP 51041)]
> NetHandler::_close_vc (this=0x72076b90, vc=0x7fff40053c20, 
> now=1452601469419678969, handle_event=@0x7105da3c: 0, 
> closed=@0x7105da38: 0, total_idle_time=, 
> total_idle_count=@0x7105da44: 1) at UnixNet.cc:678
> 678   UnixNet.cc: No such file or directory.
> (gdb) bt
> #0  NetHandler::_close_vc (this=0x72076b90, vc=0x7fff40053c20, 
> now=1452601469419678969, handle_event=@0x7105da3c: 0, 
> closed=@0x7105da38: 0, total_idle_time=, 
> total_idle_count=@0x7105da44: 1) at UnixNet.cc:678
> #1  0x55891383 in NetHandler::manage_active_queue 
> (this=0x72076b90) at UnixNet.cc:590
> #2  0x5589143a in NetHandler::add_to_active_queue 
> (this=0x72076b90, vc=0x7fffe4048f40) at UnixNet.cc:720
> #3  0x556dd82d in HttpClientSession::new_transaction 
> (this=0x7fffcc0ac1d0) at HttpClientSession.cc:124
> #4  0x55666d67 in ProxyClientSession::state_api_callout 
> (this=0x7fffcc0ac1d0, event=) at ProxyClientSession.cc:123
> #5  0x556de75b in HttpClientSession::new_connection 
> (this=0x7fffcc0ac1d0, new_vc=, iobuf=, 
> reader=, backdoor=)
> at HttpClientSession.cc:220
> #6  0x556d860d in HttpSessionAccept::accept (this=, 
> netvc=0x7fffe4048f40, iobuf=, reader=0x7fffa4021a68) at 
> HttpSessionAccept.cc:74
> #7  0x556668ab in ProtocolProbeTrampoline::ioCompletionEvent 
> (this=0x7fffb002d8d0, event=1, edata=0x71062a30) at 
> ProtocolProbeSessionAccept.cc:123
> #8  0x5589bf19 in handleEvent (data=0x7fffe4049060, event= out>, this=) at ../../iocore/eventsystem/I_Continuation.h:146
> #9  read_signal_and_update (event=, vc=0x7fffe4048f40) at 
> UnixNetVConnection.cc:145
> #10 0x558a080c in read_from_net (nh=0x72076b90, 
> vc=0x7fffe4048f40, thread=0x72073010) at UnixNetVConnection.cc:377
> #11 0x5588f33a in NetHandler::mainNetEvent (this=0x72076b90, 
> event=1, e=0x71062a30) at UnixNet.cc:516
> #12 0x558c352b in handleEvent (data=0x562b5820, event=5, 
> this=0x72076b90) at I_Continuation.h:146
> #13 EThread::process_event (this=this@entry=0x72073010, e=0x562b5820, 
> calling_code=calling_code@entry=5) at UnixEThread.cc:128
> #14 0x558c4e26 in EThread::execute (this=0x72073010) at 
> UnixEThread.cc:252
> #15 0x558c2eae in spawn_thread_internal (a=0x5620cf90) at 
> Thread.cc:86
> #16 0x760510a4 in start_thread (arg=0x71062700) at 
> pthread_create.c:309
> #17 0x74ff904d in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> {noformat}
> To reproduce:
> 1. Set proxy.config.net.max_connections_active_in to 100
> 2. ab -c 2 -n 10 http://sameurlhere



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-4126) Crash with high concurrency and low max_connections_active_in

2016-01-12 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-4126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093922#comment-15093922
 ] 

Luca Bruno commented on TS-4126:


Seems like the keep_alive_queue is empty, yet keep_alive_queue.head is accessed 
unconditionally. The debug says this right before crashing:

{noformat}
[Jan 12 15:06:01.082] Server {0x70d5f700} DEBUG:  (net_queue) closing connection NetVC=0x7fff30016df0 idle: 0 now: 
1452607561 at: 0 in: 30 diff: 1452607591
{noformat}

As you can see, idle: 0, which is keep_alive_queue_size. In fact the crash path 
is at keep_alive_queue.head->handleEvent(EVENT_IMMEDIATE, ); .


> Crash with high concurrency and low max_connections_active_in
> -
>
> Key: TS-4126
> URL: https://issues.apache.org/jira/browse/TS-4126
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HTTP
>Reporter: Luca Bruno
>
> ATS 6.0.0 reproducible crash with high concurrency (thousands of connections) 
> and low proxy.config.net.max_connections_active_in.
> {noformat}
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x71062700 (LWP 51041)]
> NetHandler::_close_vc (this=0x72076b90, vc=0x7fff40053c20, 
> now=1452601469419678969, handle_event=@0x7105da3c: 0, 
> closed=@0x7105da38: 0, total_idle_time=, 
> total_idle_count=@0x7105da44: 1) at UnixNet.cc:678
> 678   UnixNet.cc: No such file or directory.
> (gdb) bt
> #0  NetHandler::_close_vc (this=0x72076b90, vc=0x7fff40053c20, 
> now=1452601469419678969, handle_event=@0x7105da3c: 0, 
> closed=@0x7105da38: 0, total_idle_time=, 
> total_idle_count=@0x7105da44: 1) at UnixNet.cc:678
> #1  0x55891383 in NetHandler::manage_active_queue 
> (this=0x72076b90) at UnixNet.cc:590
> #2  0x5589143a in NetHandler::add_to_active_queue 
> (this=0x72076b90, vc=0x7fffe4048f40) at UnixNet.cc:720
> #3  0x556dd82d in HttpClientSession::new_transaction 
> (this=0x7fffcc0ac1d0) at HttpClientSession.cc:124
> #4  0x55666d67 in ProxyClientSession::state_api_callout 
> (this=0x7fffcc0ac1d0, event=) at ProxyClientSession.cc:123
> #5  0x556de75b in HttpClientSession::new_connection 
> (this=0x7fffcc0ac1d0, new_vc=, iobuf=, 
> reader=, backdoor=)
> at HttpClientSession.cc:220
> #6  0x556d860d in HttpSessionAccept::accept (this=, 
> netvc=0x7fffe4048f40, iobuf=, reader=0x7fffa4021a68) at 
> HttpSessionAccept.cc:74
> #7  0x556668ab in ProtocolProbeTrampoline::ioCompletionEvent 
> (this=0x7fffb002d8d0, event=1, edata=0x71062a30) at 
> ProtocolProbeSessionAccept.cc:123
> #8  0x5589bf19 in handleEvent (data=0x7fffe4049060, event= out>, this=) at ../../iocore/eventsystem/I_Continuation.h:146
> #9  read_signal_and_update (event=, vc=0x7fffe4048f40) at 
> UnixNetVConnection.cc:145
> #10 0x558a080c in read_from_net (nh=0x72076b90, 
> vc=0x7fffe4048f40, thread=0x72073010) at UnixNetVConnection.cc:377
> #11 0x5588f33a in NetHandler::mainNetEvent (this=0x72076b90, 
> event=1, e=0x71062a30) at UnixNet.cc:516
> #12 0x558c352b in handleEvent (data=0x562b5820, event=5, 
> this=0x72076b90) at I_Continuation.h:146
> #13 EThread::process_event (this=this@entry=0x72073010, e=0x562b5820, 
> calling_code=calling_code@entry=5) at UnixEThread.cc:128
> #14 0x558c4e26 in EThread::execute (this=0x72073010) at 
> UnixEThread.cc:252
> #15 0x558c2eae in spawn_thread_internal (a=0x5620cf90) at 
> Thread.cc:86
> #16 0x760510a4 in start_thread (arg=0x71062700) at 
> pthread_create.c:309
> #17 0x74ff904d in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> {noformat}
> To reproduce:
> 1. Set proxy.config.net.max_connections_active_in to 100
> 2. ab -c 2 -n 10 http://sameurlhere



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TS-4126) Crash with high concurrency and low max_connections_active_in

2016-01-12 Thread Luca Bruno (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-4126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Bruno resolved TS-4126.

Resolution: Duplicate

I've realized this has been fixed. Thanks.

> Crash with high concurrency and low max_connections_active_in
> -
>
> Key: TS-4126
> URL: https://issues.apache.org/jira/browse/TS-4126
> Project: Traffic Server
>  Issue Type: Bug
>  Components: HTTP
>Reporter: Luca Bruno
>
> ATS 6.0.0 reproducible crash with high concurrency (thousands of connections) 
> and low proxy.config.net.max_connections_active_in.
> {noformat}
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x71062700 (LWP 51041)]
> NetHandler::_close_vc (this=0x72076b90, vc=0x7fff40053c20, 
> now=1452601469419678969, handle_event=@0x7105da3c: 0, 
> closed=@0x7105da38: 0, total_idle_time=, 
> total_idle_count=@0x7105da44: 1) at UnixNet.cc:678
> 678   UnixNet.cc: No such file or directory.
> (gdb) bt
> #0  NetHandler::_close_vc (this=0x72076b90, vc=0x7fff40053c20, 
> now=1452601469419678969, handle_event=@0x7105da3c: 0, 
> closed=@0x7105da38: 0, total_idle_time=, 
> total_idle_count=@0x7105da44: 1) at UnixNet.cc:678
> #1  0x55891383 in NetHandler::manage_active_queue 
> (this=0x72076b90) at UnixNet.cc:590
> #2  0x5589143a in NetHandler::add_to_active_queue 
> (this=0x72076b90, vc=0x7fffe4048f40) at UnixNet.cc:720
> #3  0x556dd82d in HttpClientSession::new_transaction 
> (this=0x7fffcc0ac1d0) at HttpClientSession.cc:124
> #4  0x55666d67 in ProxyClientSession::state_api_callout 
> (this=0x7fffcc0ac1d0, event=) at ProxyClientSession.cc:123
> #5  0x556de75b in HttpClientSession::new_connection 
> (this=0x7fffcc0ac1d0, new_vc=, iobuf=, 
> reader=, backdoor=)
> at HttpClientSession.cc:220
> #6  0x556d860d in HttpSessionAccept::accept (this=, 
> netvc=0x7fffe4048f40, iobuf=, reader=0x7fffa4021a68) at 
> HttpSessionAccept.cc:74
> #7  0x556668ab in ProtocolProbeTrampoline::ioCompletionEvent 
> (this=0x7fffb002d8d0, event=1, edata=0x71062a30) at 
> ProtocolProbeSessionAccept.cc:123
> #8  0x5589bf19 in handleEvent (data=0x7fffe4049060, event= out>, this=) at ../../iocore/eventsystem/I_Continuation.h:146
> #9  read_signal_and_update (event=, vc=0x7fffe4048f40) at 
> UnixNetVConnection.cc:145
> #10 0x558a080c in read_from_net (nh=0x72076b90, 
> vc=0x7fffe4048f40, thread=0x72073010) at UnixNetVConnection.cc:377
> #11 0x5588f33a in NetHandler::mainNetEvent (this=0x72076b90, 
> event=1, e=0x71062a30) at UnixNet.cc:516
> #12 0x558c352b in handleEvent (data=0x562b5820, event=5, 
> this=0x72076b90) at I_Continuation.h:146
> #13 EThread::process_event (this=this@entry=0x72073010, e=0x562b5820, 
> calling_code=calling_code@entry=5) at UnixEThread.cc:128
> #14 0x558c4e26 in EThread::execute (this=0x72073010) at 
> UnixEThread.cc:252
> #15 0x558c2eae in spawn_thread_internal (a=0x5620cf90) at 
> Thread.cc:86
> #16 0x760510a4 in start_thread (arg=0x71062700) at 
> pthread_create.c:309
> #17 0x74ff904d in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> {noformat}
> To reproduce:
> 1. Set proxy.config.net.max_connections_active_in to 100
> 2. ab -c 2 -n 10 http://sameurlhere



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-4055) Coredump when closing a transaction with a stalled connection to the origin

2016-01-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-4055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092275#comment-15092275
 ] 

Luca Bruno commented on TS-4055:


I'm getting a similar crash with ats 6.0.0, but the stack trace is different. 
Note that I'm not using traffic_cop, but directly traffic_manager:

traffic_server: Segmentation fault (Address not mapped to object [0x8])
traffic_server - STACK TRACE:
/usr/bin/traffic_server(_Z19crash_logger_invokeiP9siginfo_tPv+0xa7)[0x2b1869771db7]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0)[0x2b186c0538d0]
/usr/bin/traffic_server(_ZN10NetHandler9_close_vcEP18UnixNetVConnectionlRiS2_S2_S2_+0x2a8)[0x2b18699b8e58]
/usr/bin/traffic_server(_ZN10NetHandler19manage_active_queueEv+0xd3)[0x2b18699b9143]
/usr/bin/traffic_server(_ZN10NetHandler19add_to_active_queueEP18UnixNetVConnection+0x2a)[0x2b18699b91fa]
/usr/bin/traffic_server(_ZN17HttpClientSession15new_transactionEv+0x125)[0x2b1869820795]
/usr/bin/traffic_server(_ZN18ProxyClientSession17state_api_calloutEiPv+0x33)[0x2b18697ac1b3]
/usr/bin/traffic_server(_ZN17HttpClientSession14new_connectionEP14NetVConnectionP9MIOBufferP14IOBufferReaderb+0x234)[0x2b186981fd04]
/usr/bin/traffic_server(_ZN17HttpSessionAccept6acceptEP14NetVConnectionP9MIOBufferP14IOBufferReader+0x23d)[0x2b186981b3bd]
/usr/bin/traffic_server(_ZN23ProtocolProbeTrampoline17ioCompletionEventEiPv+0x2da)[0x2b18697abeaa]
/usr/bin/traffic_server(+0x3235d8)[0x2b18699c45d8]
/usr/bin/traffic_server(_ZN10NetHandler12mainNetEventEiP5Event+0x29a)[0x2b18699b7a7a]
/usr/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x150)[0x2b18699e7550]
/usr/bin/traffic_server(_ZN7EThread7executeEv+0x67e)[0x2b18699e80ee]
/usr/bin/traffic_server(+0x345ff6)[0x2b18699e6ff6]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x2b186c04c0a4]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x2b186d0d404d]

Is it the same or should I file a bug? If it's the same issue, which commits 
should I apply on top of 6.0.0 to test? Or which commit should I try?

> Coredump when closing a transaction with a stalled connection to the origin  
> -
>
> Key: TS-4055
> URL: https://issues.apache.org/jira/browse/TS-4055
> Project: Traffic Server
>  Issue Type: Bug
>Affects Versions: 6.0.1
>Reporter: bettydramit
>Assignee: Bryan Call
>  Labels: crash
> Fix For: 6.0.1, 6.1.0
>
>
> {code}
> c++filt  traffic_server: Segmentation fault (Address not mapped to object [0x8])
> traffic_server - STACK TRACE: 
> /usr/bin/traffic_server(crash_logger_invoke(int, siginfo_t*, 
> void*)+0x8e)[0x4abf4e]
> /lib64/libpthread.so.0(+0x10430)[0x2abaac562430]
> /usr/bin/traffic_server(HttpSM::handle_server_setup_error(int, 
> void*)+0x25b)[0x5b5d0b]
> /usr/bin/traffic_server(HttpSM::state_send_server_request_header(int, 
> void*)+0x142)[0x5c0dd2]
> /usr/bin/traffic_server(HttpSM::main_handler(int, void*)+0xc8)[0x5c5e38]
> /usr/bin/traffic_server(UnixNetVConnection::mainEvent(int, 
> Event*)+0x4ff)[0x78651f]
> /usr/bin/traffic_server(InactivityCop::check_inactivity(int, 
> Event*)+0x28d)[0x7789ad]
> /usr/bin/traffic_server(EThread::process_event(Event*, int)+0x8a)[0x7bdf5a]
> /usr/bin/traffic_server(EThread::execute()+0xaa5)[0x7bf045]
> /usr/bin/traffic_server[0x7bda25]
> /lib64/libpthread.so.0(+0x7555)[0x2abaac559555]
> /lib64/libc.so.6(clone+0x6d)[0x2abaad67bb9d]
> Dec 05 21:00:12 localhost sendEmail[23289]: Email was sent successfully!
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TS-4055) Coredump when closing a transaction with a stalled connection to the origin

2016-01-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-4055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092275#comment-15092275
 ] 

Luca Bruno edited comment on TS-4055 at 1/11/16 5:00 PM:
-

I'm getting a similar crash with ats 6.0.0, but the stack trace is different. 
Note that I'm not using traffic_cop, but directly traffic_manager:
{quote}
traffic_server: Segmentation fault (Address not mapped to object [0x8])
traffic_server - STACK TRACE:
/usr/bin/traffic_server(_Z19crash_logger_invokeiP9siginfo_tPv+0xa7)[0x2b1869771db7]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0)[0x2b186c0538d0]
/usr/bin/traffic_server(_ZN10NetHandler9_close_vcEP18UnixNetVConnectionlRiS2_S2_S2_+0x2a8)[0x2b18699b8e58]
/usr/bin/traffic_server(_ZN10NetHandler19manage_active_queueEv+0xd3)[0x2b18699b9143]
/usr/bin/traffic_server(_ZN10NetHandler19add_to_active_queueEP18UnixNetVConnection+0x2a)[0x2b18699b91fa]
/usr/bin/traffic_server(_ZN17HttpClientSession15new_transactionEv+0x125)[0x2b1869820795]
/usr/bin/traffic_server(_ZN18ProxyClientSession17state_api_calloutEiPv+0x33)[0x2b18697ac1b3]
/usr/bin/traffic_server(_ZN17HttpClientSession14new_connectionEP14NetVConnectionP9MIOBufferP14IOBufferReaderb+0x234)[0x2b186981fd04]
/usr/bin/traffic_server(_ZN17HttpSessionAccept6acceptEP14NetVConnectionP9MIOBufferP14IOBufferReader+0x23d)[0x2b186981b3bd]
/usr/bin/traffic_server(_ZN23ProtocolProbeTrampoline17ioCompletionEventEiPv+0x2da)[0x2b18697abeaa]
/usr/bin/traffic_server(+0x3235d8)[0x2b18699c45d8]
/usr/bin/traffic_server(_ZN10NetHandler12mainNetEventEiP5Event+0x29a)[0x2b18699b7a7a]
/usr/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x150)[0x2b18699e7550]
/usr/bin/traffic_server(_ZN7EThread7executeEv+0x67e)[0x2b18699e80ee]
/usr/bin/traffic_server(+0x345ff6)[0x2b18699e6ff6]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x2b186c04c0a4]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x2b186d0d404d]
{quote}
Is it the same or should I file a bug? If it's the same issue, which commits 
should I apply on top of 6.0.0 to test? Or which commit should I try?


was (Author: lethalman):
I'm getting a similar crash with ats 6.0.0, but the stack trace is different. 
Note that I'm not using traffic_cop, but directly traffic_manager:
{quote}
traffic_server: Segmentation fault (Address not mapped to object [0x8])
traffic_server - STACK TRACE:
/usr/bin/traffic_server(_Z19crash_logger_invokeiP9siginfo_tPv+0xa7)[0x2b1869771db7]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0)[0x2b186c0538d0]
/usr/bin/traffic_server(_ZN10NetHandler9_close_vcEP18UnixNetVConnectionlRiS2_S2_S2_+0x2a8)[0x2b18699b8e58]
/usr/bin/traffic_server(_ZN10NetHandler19manage_active_queueEv+0xd3)[0x2b18699b9143]
/usr/bin/traffic_server(_ZN10NetHandler19add_to_active_queueEP18UnixNetVConnection+0x2a)[0x2b18699b91fa]
/usr/bin/traffic_server(_ZN17HttpClientSession15new_transactionEv+0x125)[0x2b1869820795]
/usr/bin/traffic_server(_ZN18ProxyClientSession17state_api_calloutEiPv+0x33)[0x2b18697ac1b3]
/usr/bin/traffic_server(_ZN17HttpClientSession14new_connectionEP14NetVConnectionP9MIOBufferP14IOBufferReaderb+0x234)[0x2b186981fd04]
/usr/bin/traffic_server(_ZN17HttpSessionAccept6acceptEP14NetVConnectionP9MIOBufferP14IOBufferReader+0x23d)[0x2b186981b3bd]
/usr/bin/traffic_server(_ZN23ProtocolProbeTrampoline17ioCompletionEventEiPv+0x2da)[0x2b18697abeaa]
/usr/bin/traffic_server(+0x3235d8)[0x2b18699c45d8]
/usr/bin/traffic_server(_ZN10NetHandler12mainNetEventEiP5Event+0x29a)[0x2b18699b7a7a]
/usr/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x150)[0x2b18699e7550]
/usr/bin/traffic_server(_ZN7EThread7executeEv+0x67e)[0x2b18699e80ee]
/usr/bin/traffic_server(+0x345ff6)[0x2b18699e6ff6]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x2b186c04c0a4]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x2b186d0d404d]
{/quote}
Is it the same or should I file a bug? If it's the same issue, which commits 
should I apply on top of 6.0.0 to test? Or which commit should I try?

> Coredump when closing a transaction with a stalled connection to the origin  
> -
>
> Key: TS-4055
> URL: https://issues.apache.org/jira/browse/TS-4055
> Project: Traffic Server
>  Issue Type: Bug
>Affects Versions: 6.0.1
>Reporter: bettydramit
>Assignee: Bryan Call
>  Labels: crash
> Fix For: 6.0.1, 6.1.0
>
>
> {code}
> c++filt  traffic_server: Segmentation fault (Address not mapped to object [0x8])
> traffic_server - STACK TRACE: 
> /usr/bin/traffic_server(crash_logger_invoke(int, siginfo_t*, 
> void*)+0x8e)[0x4abf4e]
> /lib64/libpthread.so.0(+0x10430)[0x2abaac562430]
> /usr/bin/traffic_server(HttpSM::handle_server_setup_error(int, 
> void*)+0x25b)[0x5b5d0b]
> /usr/bin/traffic_server(HttpSM::state_send_server_request_header(int, 
> void*)+0x142)[0x5c0dd2]
> 

[jira] [Comment Edited] (TS-4055) Coredump when closing a transaction with a stalled connection to the origin

2016-01-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-4055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092275#comment-15092275
 ] 

Luca Bruno edited comment on TS-4055 at 1/11/16 4:59 PM:
-

I'm getting a similar crash with ats 6.0.0, but the stack trace is different. 
Note that I'm not using traffic_cop, but directly traffic_manager:
{{
traffic_server: Segmentation fault (Address not mapped to object [0x8])
traffic_server - STACK TRACE:
/usr/bin/traffic_server(_Z19crash_logger_invokeiP9siginfo_tPv+0xa7)[0x2b1869771db7]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0)[0x2b186c0538d0]
/usr/bin/traffic_server(_ZN10NetHandler9_close_vcEP18UnixNetVConnectionlRiS2_S2_S2_+0x2a8)[0x2b18699b8e58]
/usr/bin/traffic_server(_ZN10NetHandler19manage_active_queueEv+0xd3)[0x2b18699b9143]
/usr/bin/traffic_server(_ZN10NetHandler19add_to_active_queueEP18UnixNetVConnection+0x2a)[0x2b18699b91fa]
/usr/bin/traffic_server(_ZN17HttpClientSession15new_transactionEv+0x125)[0x2b1869820795]
/usr/bin/traffic_server(_ZN18ProxyClientSession17state_api_calloutEiPv+0x33)[0x2b18697ac1b3]
/usr/bin/traffic_server(_ZN17HttpClientSession14new_connectionEP14NetVConnectionP9MIOBufferP14IOBufferReaderb+0x234)[0x2b186981fd04]
/usr/bin/traffic_server(_ZN17HttpSessionAccept6acceptEP14NetVConnectionP9MIOBufferP14IOBufferReader+0x23d)[0x2b186981b3bd]
/usr/bin/traffic_server(_ZN23ProtocolProbeTrampoline17ioCompletionEventEiPv+0x2da)[0x2b18697abeaa]
/usr/bin/traffic_server(+0x3235d8)[0x2b18699c45d8]
/usr/bin/traffic_server(_ZN10NetHandler12mainNetEventEiP5Event+0x29a)[0x2b18699b7a7a]
/usr/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x150)[0x2b18699e7550]
/usr/bin/traffic_server(_ZN7EThread7executeEv+0x67e)[0x2b18699e80ee]
/usr/bin/traffic_server(+0x345ff6)[0x2b18699e6ff6]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x2b186c04c0a4]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x2b186d0d404d]
}}
Is it the same or should I file a bug? If it's the same issue, which commits 
should I apply on top of 6.0.0 to test? Or which commit should I try?


was (Author: lethalman):
I'm getting a similar crash with ats 6.0.0, but the stack trace is different. 
Note that I'm not using traffic_cop, but directly traffic_manager:

traffic_server: Segmentation fault (Address not mapped to object [0x8])
traffic_server - STACK TRACE:
/usr/bin/traffic_server(_Z19crash_logger_invokeiP9siginfo_tPv+0xa7)[0x2b1869771db7]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0)[0x2b186c0538d0]
/usr/bin/traffic_server(_ZN10NetHandler9_close_vcEP18UnixNetVConnectionlRiS2_S2_S2_+0x2a8)[0x2b18699b8e58]
/usr/bin/traffic_server(_ZN10NetHandler19manage_active_queueEv+0xd3)[0x2b18699b9143]
/usr/bin/traffic_server(_ZN10NetHandler19add_to_active_queueEP18UnixNetVConnection+0x2a)[0x2b18699b91fa]
/usr/bin/traffic_server(_ZN17HttpClientSession15new_transactionEv+0x125)[0x2b1869820795]
/usr/bin/traffic_server(_ZN18ProxyClientSession17state_api_calloutEiPv+0x33)[0x2b18697ac1b3]
/usr/bin/traffic_server(_ZN17HttpClientSession14new_connectionEP14NetVConnectionP9MIOBufferP14IOBufferReaderb+0x234)[0x2b186981fd04]
/usr/bin/traffic_server(_ZN17HttpSessionAccept6acceptEP14NetVConnectionP9MIOBufferP14IOBufferReader+0x23d)[0x2b186981b3bd]
/usr/bin/traffic_server(_ZN23ProtocolProbeTrampoline17ioCompletionEventEiPv+0x2da)[0x2b18697abeaa]
/usr/bin/traffic_server(+0x3235d8)[0x2b18699c45d8]
/usr/bin/traffic_server(_ZN10NetHandler12mainNetEventEiP5Event+0x29a)[0x2b18699b7a7a]
/usr/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x150)[0x2b18699e7550]
/usr/bin/traffic_server(_ZN7EThread7executeEv+0x67e)[0x2b18699e80ee]
/usr/bin/traffic_server(+0x345ff6)[0x2b18699e6ff6]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x2b186c04c0a4]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x2b186d0d404d]

Is it the same or should I file a bug? If it's the same issue, which commits 
should I apply on top of 6.0.0 to test? Or which commit should I try?

> Coredump when closing a transaction with a stalled connection to the origin  
> -
>
> Key: TS-4055
> URL: https://issues.apache.org/jira/browse/TS-4055
> Project: Traffic Server
>  Issue Type: Bug
>Affects Versions: 6.0.1
>Reporter: bettydramit
>Assignee: Bryan Call
>  Labels: crash
> Fix For: 6.0.1, 6.1.0
>
>
> {code}
> c++filt  traffic_server: Segmentation fault (Address not mapped to object [0x8])
> traffic_server - STACK TRACE: 
> /usr/bin/traffic_server(crash_logger_invoke(int, siginfo_t*, 
> void*)+0x8e)[0x4abf4e]
> /lib64/libpthread.so.0(+0x10430)[0x2abaac562430]
> /usr/bin/traffic_server(HttpSM::handle_server_setup_error(int, 
> void*)+0x25b)[0x5b5d0b]
> /usr/bin/traffic_server(HttpSM::state_send_server_request_header(int, 
> void*)+0x142)[0x5c0dd2]
> /usr/bin/traffic_server(HttpSM::main_handler(int, 

[jira] [Comment Edited] (TS-4055) Coredump when closing a transaction with a stalled connection to the origin

2016-01-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-4055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092275#comment-15092275
 ] 

Luca Bruno edited comment on TS-4055 at 1/11/16 5:00 PM:
-

I'm getting a similar crash with ats 6.0.0, but the stack trace is different. 
Note that I'm not using traffic_cop, but directly traffic_manager:
{quote}
traffic_server: Segmentation fault (Address not mapped to object [0x8])
traffic_server - STACK TRACE:
/usr/bin/traffic_server(_Z19crash_logger_invokeiP9siginfo_tPv+0xa7)[0x2b1869771db7]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0)[0x2b186c0538d0]
/usr/bin/traffic_server(_ZN10NetHandler9_close_vcEP18UnixNetVConnectionlRiS2_S2_S2_+0x2a8)[0x2b18699b8e58]
/usr/bin/traffic_server(_ZN10NetHandler19manage_active_queueEv+0xd3)[0x2b18699b9143]
/usr/bin/traffic_server(_ZN10NetHandler19add_to_active_queueEP18UnixNetVConnection+0x2a)[0x2b18699b91fa]
/usr/bin/traffic_server(_ZN17HttpClientSession15new_transactionEv+0x125)[0x2b1869820795]
/usr/bin/traffic_server(_ZN18ProxyClientSession17state_api_calloutEiPv+0x33)[0x2b18697ac1b3]
/usr/bin/traffic_server(_ZN17HttpClientSession14new_connectionEP14NetVConnectionP9MIOBufferP14IOBufferReaderb+0x234)[0x2b186981fd04]
/usr/bin/traffic_server(_ZN17HttpSessionAccept6acceptEP14NetVConnectionP9MIOBufferP14IOBufferReader+0x23d)[0x2b186981b3bd]
/usr/bin/traffic_server(_ZN23ProtocolProbeTrampoline17ioCompletionEventEiPv+0x2da)[0x2b18697abeaa]
/usr/bin/traffic_server(+0x3235d8)[0x2b18699c45d8]
/usr/bin/traffic_server(_ZN10NetHandler12mainNetEventEiP5Event+0x29a)[0x2b18699b7a7a]
/usr/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x150)[0x2b18699e7550]
/usr/bin/traffic_server(_ZN7EThread7executeEv+0x67e)[0x2b18699e80ee]
/usr/bin/traffic_server(+0x345ff6)[0x2b18699e6ff6]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x2b186c04c0a4]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x2b186d0d404d]
{/quote}
Is it the same or should I file a bug? If it's the same issue, which commits 
should I apply on top of 6.0.0 to test? Or which commit should I try?


was (Author: lethalman):
I'm getting a similar crash with ats 6.0.0, but the stack trace is different. 
Note that I'm not using traffic_cop, but directly traffic_manager:
{{
traffic_server: Segmentation fault (Address not mapped to object [0x8])
traffic_server - STACK TRACE:
/usr/bin/traffic_server(_Z19crash_logger_invokeiP9siginfo_tPv+0xa7)[0x2b1869771db7]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0)[0x2b186c0538d0]
/usr/bin/traffic_server(_ZN10NetHandler9_close_vcEP18UnixNetVConnectionlRiS2_S2_S2_+0x2a8)[0x2b18699b8e58]
/usr/bin/traffic_server(_ZN10NetHandler19manage_active_queueEv+0xd3)[0x2b18699b9143]
/usr/bin/traffic_server(_ZN10NetHandler19add_to_active_queueEP18UnixNetVConnection+0x2a)[0x2b18699b91fa]
/usr/bin/traffic_server(_ZN17HttpClientSession15new_transactionEv+0x125)[0x2b1869820795]
/usr/bin/traffic_server(_ZN18ProxyClientSession17state_api_calloutEiPv+0x33)[0x2b18697ac1b3]
/usr/bin/traffic_server(_ZN17HttpClientSession14new_connectionEP14NetVConnectionP9MIOBufferP14IOBufferReaderb+0x234)[0x2b186981fd04]
/usr/bin/traffic_server(_ZN17HttpSessionAccept6acceptEP14NetVConnectionP9MIOBufferP14IOBufferReader+0x23d)[0x2b186981b3bd]
/usr/bin/traffic_server(_ZN23ProtocolProbeTrampoline17ioCompletionEventEiPv+0x2da)[0x2b18697abeaa]
/usr/bin/traffic_server(+0x3235d8)[0x2b18699c45d8]
/usr/bin/traffic_server(_ZN10NetHandler12mainNetEventEiP5Event+0x29a)[0x2b18699b7a7a]
/usr/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x150)[0x2b18699e7550]
/usr/bin/traffic_server(_ZN7EThread7executeEv+0x67e)[0x2b18699e80ee]
/usr/bin/traffic_server(+0x345ff6)[0x2b18699e6ff6]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x2b186c04c0a4]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x2b186d0d404d]
}}
Is it the same or should I file a bug? If it's the same issue, which commits 
should I apply on top of 6.0.0 to test? Or which commit should I try?

> Coredump when closing a transaction with a stalled connection to the origin  
> -
>
> Key: TS-4055
> URL: https://issues.apache.org/jira/browse/TS-4055
> Project: Traffic Server
>  Issue Type: Bug
>Affects Versions: 6.0.1
>Reporter: bettydramit
>Assignee: Bryan Call
>  Labels: crash
> Fix For: 6.0.1, 6.1.0
>
>
> {code}
> c++filt  traffic_server: Segmentation fault (Address not mapped to object [0x8])
> traffic_server - STACK TRACE: 
> /usr/bin/traffic_server(crash_logger_invoke(int, siginfo_t*, 
> void*)+0x8e)[0x4abf4e]
> /lib64/libpthread.so.0(+0x10430)[0x2abaac562430]
> /usr/bin/traffic_server(HttpSM::handle_server_setup_error(int, 
> void*)+0x25b)[0x5b5d0b]
> /usr/bin/traffic_server(HttpSM::state_send_server_request_header(int, 
> void*)+0x142)[0x5c0dd2]
> 

[jira] [Comment Edited] (TS-4055) Coredump when closing a transaction with a stalled connection to the origin

2016-01-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-4055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092275#comment-15092275
 ] 

Luca Bruno edited comment on TS-4055 at 1/11/16 5:02 PM:
-

I'm getting a similar crash with ats 6.0.0 under stress test, but the stack 
trace is different. Note that I'm not using traffic_cop, but directly 
traffic_manager:
{noformat}
traffic_server: Segmentation fault (Address not mapped to object [0x8])
traffic_server - STACK TRACE:
/usr/bin/traffic_server(_Z19crash_logger_invokeiP9siginfo_tPv+0xa7)[0x2b1869771db7]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0)[0x2b186c0538d0]
/usr/bin/traffic_server(_ZN10NetHandler9_close_vcEP18UnixNetVConnectionlRiS2_S2_S2_+0x2a8)[0x2b18699b8e58]
/usr/bin/traffic_server(_ZN10NetHandler19manage_active_queueEv+0xd3)[0x2b18699b9143]
/usr/bin/traffic_server(_ZN10NetHandler19add_to_active_queueEP18UnixNetVConnection+0x2a)[0x2b18699b91fa]
/usr/bin/traffic_server(_ZN17HttpClientSession15new_transactionEv+0x125)[0x2b1869820795]
/usr/bin/traffic_server(_ZN18ProxyClientSession17state_api_calloutEiPv+0x33)[0x2b18697ac1b3]
/usr/bin/traffic_server(_ZN17HttpClientSession14new_connectionEP14NetVConnectionP9MIOBufferP14IOBufferReaderb+0x234)[0x2b186981fd04]
/usr/bin/traffic_server(_ZN17HttpSessionAccept6acceptEP14NetVConnectionP9MIOBufferP14IOBufferReader+0x23d)[0x2b186981b3bd]
/usr/bin/traffic_server(_ZN23ProtocolProbeTrampoline17ioCompletionEventEiPv+0x2da)[0x2b18697abeaa]
/usr/bin/traffic_server(+0x3235d8)[0x2b18699c45d8]
/usr/bin/traffic_server(_ZN10NetHandler12mainNetEventEiP5Event+0x29a)[0x2b18699b7a7a]
/usr/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x150)[0x2b18699e7550]
/usr/bin/traffic_server(_ZN7EThread7executeEv+0x67e)[0x2b18699e80ee]
/usr/bin/traffic_server(+0x345ff6)[0x2b18699e6ff6]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x2b186c04c0a4]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x2b186d0d404d]
{noformat}

Is it the same or should I file a bug? If it's the same issue, which commits 
should I apply on top of 6.0.0 to test? Or which commit should I try?


was (Author: lethalman):
I'm getting a similar crash with ats 6.0.0, but the stack trace is different. 
Note that I'm not using traffic_cop, but directly traffic_manager:
{noformat}
traffic_server: Segmentation fault (Address not mapped to object [0x8])
traffic_server - STACK TRACE:
/usr/bin/traffic_server(_Z19crash_logger_invokeiP9siginfo_tPv+0xa7)[0x2b1869771db7]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0)[0x2b186c0538d0]
/usr/bin/traffic_server(_ZN10NetHandler9_close_vcEP18UnixNetVConnectionlRiS2_S2_S2_+0x2a8)[0x2b18699b8e58]
/usr/bin/traffic_server(_ZN10NetHandler19manage_active_queueEv+0xd3)[0x2b18699b9143]
/usr/bin/traffic_server(_ZN10NetHandler19add_to_active_queueEP18UnixNetVConnection+0x2a)[0x2b18699b91fa]
/usr/bin/traffic_server(_ZN17HttpClientSession15new_transactionEv+0x125)[0x2b1869820795]
/usr/bin/traffic_server(_ZN18ProxyClientSession17state_api_calloutEiPv+0x33)[0x2b18697ac1b3]
/usr/bin/traffic_server(_ZN17HttpClientSession14new_connectionEP14NetVConnectionP9MIOBufferP14IOBufferReaderb+0x234)[0x2b186981fd04]
/usr/bin/traffic_server(_ZN17HttpSessionAccept6acceptEP14NetVConnectionP9MIOBufferP14IOBufferReader+0x23d)[0x2b186981b3bd]
/usr/bin/traffic_server(_ZN23ProtocolProbeTrampoline17ioCompletionEventEiPv+0x2da)[0x2b18697abeaa]
/usr/bin/traffic_server(+0x3235d8)[0x2b18699c45d8]
/usr/bin/traffic_server(_ZN10NetHandler12mainNetEventEiP5Event+0x29a)[0x2b18699b7a7a]
/usr/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x150)[0x2b18699e7550]
/usr/bin/traffic_server(_ZN7EThread7executeEv+0x67e)[0x2b18699e80ee]
/usr/bin/traffic_server(+0x345ff6)[0x2b18699e6ff6]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x2b186c04c0a4]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x2b186d0d404d]
{noformat}

Is it the same or should I file a bug? If it's the same issue, which commits 
should I apply on top of 6.0.0 to test? Or which commit should I try?

> Coredump when closing a transaction with a stalled connection to the origin  
> -
>
> Key: TS-4055
> URL: https://issues.apache.org/jira/browse/TS-4055
> Project: Traffic Server
>  Issue Type: Bug
>Affects Versions: 6.0.1
>Reporter: bettydramit
>Assignee: Bryan Call
>  Labels: crash
> Fix For: 6.0.1, 6.1.0
>
>
> {code}
> c++filt  traffic_server: Segmentation fault (Address not mapped to object [0x8])
> traffic_server - STACK TRACE: 
> /usr/bin/traffic_server(crash_logger_invoke(int, siginfo_t*, 
> void*)+0x8e)[0x4abf4e]
> /lib64/libpthread.so.0(+0x10430)[0x2abaac562430]
> /usr/bin/traffic_server(HttpSM::handle_server_setup_error(int, 
> void*)+0x25b)[0x5b5d0b]
> /usr/bin/traffic_server(HttpSM::state_send_server_request_header(int, 
> void*)+0x142)[0x5c0dd2]
> 

[jira] [Comment Edited] (TS-4055) Coredump when closing a transaction with a stalled connection to the origin

2016-01-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-4055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092275#comment-15092275
 ] 

Luca Bruno edited comment on TS-4055 at 1/11/16 5:00 PM:
-

I'm getting a similar crash with ats 6.0.0, but the stack trace is different. 
Note that I'm not using traffic_cop, but directly traffic_manager:
{noformat}
traffic_server: Segmentation fault (Address not mapped to object [0x8])
traffic_server - STACK TRACE:
/usr/bin/traffic_server(_Z19crash_logger_invokeiP9siginfo_tPv+0xa7)[0x2b1869771db7]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0)[0x2b186c0538d0]
/usr/bin/traffic_server(_ZN10NetHandler9_close_vcEP18UnixNetVConnectionlRiS2_S2_S2_+0x2a8)[0x2b18699b8e58]
/usr/bin/traffic_server(_ZN10NetHandler19manage_active_queueEv+0xd3)[0x2b18699b9143]
/usr/bin/traffic_server(_ZN10NetHandler19add_to_active_queueEP18UnixNetVConnection+0x2a)[0x2b18699b91fa]
/usr/bin/traffic_server(_ZN17HttpClientSession15new_transactionEv+0x125)[0x2b1869820795]
/usr/bin/traffic_server(_ZN18ProxyClientSession17state_api_calloutEiPv+0x33)[0x2b18697ac1b3]
/usr/bin/traffic_server(_ZN17HttpClientSession14new_connectionEP14NetVConnectionP9MIOBufferP14IOBufferReaderb+0x234)[0x2b186981fd04]
/usr/bin/traffic_server(_ZN17HttpSessionAccept6acceptEP14NetVConnectionP9MIOBufferP14IOBufferReader+0x23d)[0x2b186981b3bd]
/usr/bin/traffic_server(_ZN23ProtocolProbeTrampoline17ioCompletionEventEiPv+0x2da)[0x2b18697abeaa]
/usr/bin/traffic_server(+0x3235d8)[0x2b18699c45d8]
/usr/bin/traffic_server(_ZN10NetHandler12mainNetEventEiP5Event+0x29a)[0x2b18699b7a7a]
/usr/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x150)[0x2b18699e7550]
/usr/bin/traffic_server(_ZN7EThread7executeEv+0x67e)[0x2b18699e80ee]
/usr/bin/traffic_server(+0x345ff6)[0x2b18699e6ff6]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x2b186c04c0a4]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x2b186d0d404d]
{noformat}

Is it the same or should I file a bug? If it's the same issue, which commits 
should I apply on top of 6.0.0 to test? Or which commit should I try?


was (Author: lethalman):
I'm getting a similar crash with ats 6.0.0, but the stack trace is different. 
Note that I'm not using traffic_cop, but directly traffic_manager:
{quote}
traffic_server: Segmentation fault (Address not mapped to object [0x8])
traffic_server - STACK TRACE:
/usr/bin/traffic_server(_Z19crash_logger_invokeiP9siginfo_tPv+0xa7)[0x2b1869771db7]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0)[0x2b186c0538d0]
/usr/bin/traffic_server(_ZN10NetHandler9_close_vcEP18UnixNetVConnectionlRiS2_S2_S2_+0x2a8)[0x2b18699b8e58]
/usr/bin/traffic_server(_ZN10NetHandler19manage_active_queueEv+0xd3)[0x2b18699b9143]
/usr/bin/traffic_server(_ZN10NetHandler19add_to_active_queueEP18UnixNetVConnection+0x2a)[0x2b18699b91fa]
/usr/bin/traffic_server(_ZN17HttpClientSession15new_transactionEv+0x125)[0x2b1869820795]
/usr/bin/traffic_server(_ZN18ProxyClientSession17state_api_calloutEiPv+0x33)[0x2b18697ac1b3]
/usr/bin/traffic_server(_ZN17HttpClientSession14new_connectionEP14NetVConnectionP9MIOBufferP14IOBufferReaderb+0x234)[0x2b186981fd04]
/usr/bin/traffic_server(_ZN17HttpSessionAccept6acceptEP14NetVConnectionP9MIOBufferP14IOBufferReader+0x23d)[0x2b186981b3bd]
/usr/bin/traffic_server(_ZN23ProtocolProbeTrampoline17ioCompletionEventEiPv+0x2da)[0x2b18697abeaa]
/usr/bin/traffic_server(+0x3235d8)[0x2b18699c45d8]
/usr/bin/traffic_server(_ZN10NetHandler12mainNetEventEiP5Event+0x29a)[0x2b18699b7a7a]
/usr/bin/traffic_server(_ZN7EThread13process_eventEP5Eventi+0x150)[0x2b18699e7550]
/usr/bin/traffic_server(_ZN7EThread7executeEv+0x67e)[0x2b18699e80ee]
/usr/bin/traffic_server(+0x345ff6)[0x2b18699e6ff6]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x2b186c04c0a4]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x2b186d0d404d]
{quote}
Is it the same or should I file a bug? If it's the same issue, which commits 
should I apply on top of 6.0.0 to test? Or which commit should I try?

> Coredump when closing a transaction with a stalled connection to the origin  
> -
>
> Key: TS-4055
> URL: https://issues.apache.org/jira/browse/TS-4055
> Project: Traffic Server
>  Issue Type: Bug
>Affects Versions: 6.0.1
>Reporter: bettydramit
>Assignee: Bryan Call
>  Labels: crash
> Fix For: 6.0.1, 6.1.0
>
>
> {code}
> c++filt  traffic_server: Segmentation fault (Address not mapped to object [0x8])
> traffic_server - STACK TRACE: 
> /usr/bin/traffic_server(crash_logger_invoke(int, siginfo_t*, 
> void*)+0x8e)[0x4abf4e]
> /lib64/libpthread.so.0(+0x10430)[0x2abaac562430]
> /usr/bin/traffic_server(HttpSM::handle_server_setup_error(int, 
> void*)+0x25b)[0x5b5d0b]
> /usr/bin/traffic_server(HttpSM::state_send_server_request_header(int, 
> void*)+0x142)[0x5c0dd2]
> 

[jira] [Comment Edited] (TS-3455) Marking the cache STALE in lookup-complete causes abort()

2015-03-23 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375892#comment-14375892
 ] 

Luca Bruno edited comment on TS-3455 at 3/23/15 1:38 PM:
-

Just tried with and without read_while_writer, same abort. The remap is 
somewhat as follows:
{noformat}
map http://media.mycache.com/ http://media.origin.com/
{noformat}

The media.origin.com is looked up from a dns server set in resolv.conf.


was (Author: lethalman):
Just tried with and without read_while_writer, same abort.

 Marking the cache STALE in lookup-complete causes abort()
 -

 Key: TS-3455
 URL: https://issues.apache.org/jira/browse/TS-3455
 Project: Traffic Server
  Issue Type: Bug
Reporter: Luca Bruno
 Fix For: 6.0.0


 I've written a simple test case plugin for demonstrating this problem, not 
 sure if it's a problem on my side, but that would also mean that the regex 
 invalidate plugin would also abort().
 What the plugin does: in LOOKUP_COMPLETE, if the cache status is FRESH then 
 set it to STALE.
 To reproduce:
 1) Send a first cacheable request to ATS, which gets cached.
 2) Request again the same url, the plugin triggers and set the cache to 
 STALE. Then ATS does abort().
 Plugin code:
 {noformat}
 #include ts/ts.h
 #include ts/remap.h
 #include ts/experimental.h
 #include stdlib.h
 #include stdio.h
 #include getopt.h
 #include string.h
 #include string
 #include iterator
 #include map
 const char PLUGIN_NAME[] = maybebug;
 static int Handler(TSCont cont, TSEvent event, void *edata);
 struct PluginState {
   PluginState()
   {
 cont = TSContCreate(Handler, NULL);
 TSContDataSet(cont, this);
   }
   ~PluginState()
   {
 TSContDestroy(cont);
   }
   TSCont cont;
 };
 static int Handler(TSCont cont, TSEvent event, void* edata) {
   TSHttpTxn txn = (TSHttpTxn)edata;
   if (event == TS_EVENT_HTTP_CACHE_LOOKUP_COMPLETE) {
 int lookup_status;
 if (TS_SUCCESS == TSHttpTxnCacheLookupStatusGet(txn, lookup_status)) {
   TSDebug(PLUGIN_NAME, lookup complete: %d, lookup_status);
   if (lookup_status == TS_CACHE_LOOKUP_HIT_FRESH) {
 TSDebug(PLUGIN_NAME, set stale);
 TSHttpTxnCacheLookupStatusSet(txn, TS_CACHE_LOOKUP_HIT_STALE);
   }
 }
   }
   TSHttpTxnReenable(txn, TS_EVENT_HTTP_CONTINUE);
   return TS_EVENT_NONE;
 }
 void TSPluginInit (int argc, const char *argv[]) {
   TSPluginRegistrationInfo info;
   info.plugin_name = strdup(cappello);
   info.vendor_name = strdup(foo);
   info.support_email = strdup(f...@bar.com);
   if (TSPluginRegister(TS_SDK_VERSION_3_0 , info) != TS_SUCCESS) {
 TSError(Plugin registration failed);
   }
   PluginState* state = new PluginState();
   TSHttpHookAdd(TS_HTTP_CACHE_LOOKUP_COMPLETE_HOOK, state-cont);
 }
 {noformat}
 Output:
 {noformat}
 [Mar 19 18:40:36.254] Server {0x7f6df0b4f740} DIAG: (maybebug) lookup 
 complete: 0
 [Mar 19 18:40:40.854] Server {0x7f6decfee700} DIAG: (maybebug) lookup 
 complete: 2
 [Mar 19 18:40:40.854] Server {0x7f6decfee700} DIAG: (maybebug) set stale
 FATAL: HttpTransact.cc:433: failed assert `s-pending_work == NULL`
 traffic_server - STACK TRACE: 
 /usr/local/lib/libtsutil.so.5(ink_fatal+0xa3)[0x7f6df072186d]
 /usr/local/lib/libtsutil.so.5(_Z12ink_get_randv+0x0)[0x7f6df071f3a0]
 traffic_server[0x60d0aa]
 traffic_server(_ZN12HttpTransact22HandleCacheOpenReadHitEPNS_5StateE+0xf82)[0x619206]
 ...
 {noformat}
 What happens in gdb is that HandleCacheOpenReadHit is called twice in the 
 same request. The first time s-pending_work is NULL, the second time it's 
 not NULL.
 The patch below fixes the problem:
 {noformat}
 diff --git a/proxy/http/HttpTransact.cc b/proxy/http/HttpTransact.cc
 index 0078ef1..852f285 100644
 --- a/proxy/http/HttpTransact.cc
 +++ b/proxy/http/HttpTransact.cc
 @@ -2641,11 +2641,6 @@ HttpTransact::HandleCacheOpenReadHit(State* s)
  //ink_release_assert(s-current.request_to == PARENT_PROXY ||
  //s-http_config_param-no_dns_forward_to_parent != 0);
  
 -// Set ourselves up to handle pending revalidate issues
 -//  after the PP DNS lookup
 -ink_assert(s-pending_work == NULL);
 -s-pending_work = issue_revalidate;
 -
  // We must be going a PARENT PROXY since so did
  //  origin server DNS lookup right after state Start
  //
 @@ -2654,6 +2649,11 @@ HttpTransact::HandleCacheOpenReadHit(State* s)
  //  missing ip but we won't take down the system
  //
  if (s-current.request_to == PARENT_PROXY) {
 +  // Set ourselves up to handle pending revalidate issues
 +  //  after the PP DNS lookup
 +  ink_assert(s-pending_work == NULL);
 +  s-pending_work = issue_revalidate;
 +
TRANSACT_RETURN(SM_ACTION_DNS_LOOKUP, 

[jira] [Commented] (TS-3455) Marking the cache STALE in lookup-complete causes abort()

2015-03-23 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375892#comment-14375892
 ] 

Luca Bruno commented on TS-3455:


Just tried with and without read_while_writer, same abort.

 Marking the cache STALE in lookup-complete causes abort()
 -

 Key: TS-3455
 URL: https://issues.apache.org/jira/browse/TS-3455
 Project: Traffic Server
  Issue Type: Bug
Reporter: Luca Bruno
 Fix For: 6.0.0


 I've written a simple test case plugin for demonstrating this problem, not 
 sure if it's a problem on my side, but that would also mean that the regex 
 invalidate plugin would also abort().
 What the plugin does: in LOOKUP_COMPLETE, if the cache status is FRESH then 
 set it to STALE.
 To reproduce:
 1) Send a first cacheable request to ATS, which gets cached.
 2) Request again the same url, the plugin triggers and set the cache to 
 STALE. Then ATS does abort().
 Plugin code:
 {noformat}
 #include ts/ts.h
 #include ts/remap.h
 #include ts/experimental.h
 #include stdlib.h
 #include stdio.h
 #include getopt.h
 #include string.h
 #include string
 #include iterator
 #include map
 const char PLUGIN_NAME[] = maybebug;
 static int Handler(TSCont cont, TSEvent event, void *edata);
 struct PluginState {
   PluginState()
   {
 cont = TSContCreate(Handler, NULL);
 TSContDataSet(cont, this);
   }
   ~PluginState()
   {
 TSContDestroy(cont);
   }
   TSCont cont;
 };
 static int Handler(TSCont cont, TSEvent event, void* edata) {
   TSHttpTxn txn = (TSHttpTxn)edata;
   if (event == TS_EVENT_HTTP_CACHE_LOOKUP_COMPLETE) {
 int lookup_status;
 if (TS_SUCCESS == TSHttpTxnCacheLookupStatusGet(txn, lookup_status)) {
   TSDebug(PLUGIN_NAME, lookup complete: %d, lookup_status);
   if (lookup_status == TS_CACHE_LOOKUP_HIT_FRESH) {
 TSDebug(PLUGIN_NAME, set stale);
 TSHttpTxnCacheLookupStatusSet(txn, TS_CACHE_LOOKUP_HIT_STALE);
   }
 }
   }
   TSHttpTxnReenable(txn, TS_EVENT_HTTP_CONTINUE);
   return TS_EVENT_NONE;
 }
 void TSPluginInit (int argc, const char *argv[]) {
   TSPluginRegistrationInfo info;
   info.plugin_name = strdup(cappello);
   info.vendor_name = strdup(foo);
   info.support_email = strdup(f...@bar.com);
   if (TSPluginRegister(TS_SDK_VERSION_3_0 , info) != TS_SUCCESS) {
 TSError(Plugin registration failed);
   }
   PluginState* state = new PluginState();
   TSHttpHookAdd(TS_HTTP_CACHE_LOOKUP_COMPLETE_HOOK, state-cont);
 }
 {noformat}
 Output:
 {noformat}
 [Mar 19 18:40:36.254] Server {0x7f6df0b4f740} DIAG: (maybebug) lookup 
 complete: 0
 [Mar 19 18:40:40.854] Server {0x7f6decfee700} DIAG: (maybebug) lookup 
 complete: 2
 [Mar 19 18:40:40.854] Server {0x7f6decfee700} DIAG: (maybebug) set stale
 FATAL: HttpTransact.cc:433: failed assert `s-pending_work == NULL`
 traffic_server - STACK TRACE: 
 /usr/local/lib/libtsutil.so.5(ink_fatal+0xa3)[0x7f6df072186d]
 /usr/local/lib/libtsutil.so.5(_Z12ink_get_randv+0x0)[0x7f6df071f3a0]
 traffic_server[0x60d0aa]
 traffic_server(_ZN12HttpTransact22HandleCacheOpenReadHitEPNS_5StateE+0xf82)[0x619206]
 ...
 {noformat}
 What happens in gdb is that HandleCacheOpenReadHit is called twice in the 
 same request. The first time s-pending_work is NULL, the second time it's 
 not NULL.
 The patch below fixes the problem:
 {noformat}
 diff --git a/proxy/http/HttpTransact.cc b/proxy/http/HttpTransact.cc
 index 0078ef1..852f285 100644
 --- a/proxy/http/HttpTransact.cc
 +++ b/proxy/http/HttpTransact.cc
 @@ -2641,11 +2641,6 @@ HttpTransact::HandleCacheOpenReadHit(State* s)
  //ink_release_assert(s-current.request_to == PARENT_PROXY ||
  //s-http_config_param-no_dns_forward_to_parent != 0);
  
 -// Set ourselves up to handle pending revalidate issues
 -//  after the PP DNS lookup
 -ink_assert(s-pending_work == NULL);
 -s-pending_work = issue_revalidate;
 -
  // We must be going a PARENT PROXY since so did
  //  origin server DNS lookup right after state Start
  //
 @@ -2654,6 +2649,11 @@ HttpTransact::HandleCacheOpenReadHit(State* s)
  //  missing ip but we won't take down the system
  //
  if (s-current.request_to == PARENT_PROXY) {
 +  // Set ourselves up to handle pending revalidate issues
 +  //  after the PP DNS lookup
 +  ink_assert(s-pending_work == NULL);
 +  s-pending_work = issue_revalidate;
 +
TRANSACT_RETURN(SM_ACTION_DNS_LOOKUP, PPDNSLookup);
  } else if (s-current.request_to == ORIGIN_SERVER) {
TRANSACT_RETURN(SM_ACTION_DNS_LOOKUP, OSDNSLookup);
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-3455) Marking the cache STALE in lookup-complete causes abort()

2015-03-23 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375998#comment-14375998
 ] 

Luca Bruno commented on TS-3455:


So I think the relevant line is: 
https://github.com/apache/trafficserver/blob/95cd99da5d161fc2419584a0e40329f48e55e732/proxy/http/HttpTransact.cc#L1786

ATS does a cache lookup - OS dns lookup - cache lookup again. What could be 
the code path for which my setup reaches that point, while you can't reproduce 
it?

 Marking the cache STALE in lookup-complete causes abort()
 -

 Key: TS-3455
 URL: https://issues.apache.org/jira/browse/TS-3455
 Project: Traffic Server
  Issue Type: Bug
Reporter: Luca Bruno
 Fix For: 6.0.0


 I've written a simple test case plugin for demonstrating this problem, not 
 sure if it's a problem on my side, but that would also mean that the regex 
 invalidate plugin would also abort().
 What the plugin does: in LOOKUP_COMPLETE, if the cache status is FRESH then 
 set it to STALE.
 To reproduce:
 1) Send a first cacheable request to ATS, which gets cached.
 2) Request again the same url, the plugin triggers and set the cache to 
 STALE. Then ATS does abort().
 Plugin code:
 {noformat}
 #include ts/ts.h
 #include ts/remap.h
 #include ts/experimental.h
 #include stdlib.h
 #include stdio.h
 #include getopt.h
 #include string.h
 #include string
 #include iterator
 #include map
 const char PLUGIN_NAME[] = maybebug;
 static int Handler(TSCont cont, TSEvent event, void *edata);
 struct PluginState {
   PluginState()
   {
 cont = TSContCreate(Handler, NULL);
 TSContDataSet(cont, this);
   }
   ~PluginState()
   {
 TSContDestroy(cont);
   }
   TSCont cont;
 };
 static int Handler(TSCont cont, TSEvent event, void* edata) {
   TSHttpTxn txn = (TSHttpTxn)edata;
   if (event == TS_EVENT_HTTP_CACHE_LOOKUP_COMPLETE) {
 int lookup_status;
 if (TS_SUCCESS == TSHttpTxnCacheLookupStatusGet(txn, lookup_status)) {
   TSDebug(PLUGIN_NAME, lookup complete: %d, lookup_status);
   if (lookup_status == TS_CACHE_LOOKUP_HIT_FRESH) {
 TSDebug(PLUGIN_NAME, set stale);
 TSHttpTxnCacheLookupStatusSet(txn, TS_CACHE_LOOKUP_HIT_STALE);
   }
 }
   }
   TSHttpTxnReenable(txn, TS_EVENT_HTTP_CONTINUE);
   return TS_EVENT_NONE;
 }
 void TSPluginInit (int argc, const char *argv[]) {
   TSPluginRegistrationInfo info;
   info.plugin_name = strdup(cappello);
   info.vendor_name = strdup(foo);
   info.support_email = strdup(f...@bar.com);
   if (TSPluginRegister(TS_SDK_VERSION_3_0 , info) != TS_SUCCESS) {
 TSError(Plugin registration failed);
   }
   PluginState* state = new PluginState();
   TSHttpHookAdd(TS_HTTP_CACHE_LOOKUP_COMPLETE_HOOK, state-cont);
 }
 {noformat}
 Output:
 {noformat}
 [Mar 19 18:40:36.254] Server {0x7f6df0b4f740} DIAG: (maybebug) lookup 
 complete: 0
 [Mar 19 18:40:40.854] Server {0x7f6decfee700} DIAG: (maybebug) lookup 
 complete: 2
 [Mar 19 18:40:40.854] Server {0x7f6decfee700} DIAG: (maybebug) set stale
 FATAL: HttpTransact.cc:433: failed assert `s-pending_work == NULL`
 traffic_server - STACK TRACE: 
 /usr/local/lib/libtsutil.so.5(ink_fatal+0xa3)[0x7f6df072186d]
 /usr/local/lib/libtsutil.so.5(_Z12ink_get_randv+0x0)[0x7f6df071f3a0]
 traffic_server[0x60d0aa]
 traffic_server(_ZN12HttpTransact22HandleCacheOpenReadHitEPNS_5StateE+0xf82)[0x619206]
 ...
 {noformat}
 What happens in gdb is that HandleCacheOpenReadHit is called twice in the 
 same request. The first time s-pending_work is NULL, the second time it's 
 not NULL.
 The patch below fixes the problem:
 {noformat}
 diff --git a/proxy/http/HttpTransact.cc b/proxy/http/HttpTransact.cc
 index 0078ef1..852f285 100644
 --- a/proxy/http/HttpTransact.cc
 +++ b/proxy/http/HttpTransact.cc
 @@ -2641,11 +2641,6 @@ HttpTransact::HandleCacheOpenReadHit(State* s)
  //ink_release_assert(s-current.request_to == PARENT_PROXY ||
  //s-http_config_param-no_dns_forward_to_parent != 0);
  
 -// Set ourselves up to handle pending revalidate issues
 -//  after the PP DNS lookup
 -ink_assert(s-pending_work == NULL);
 -s-pending_work = issue_revalidate;
 -
  // We must be going a PARENT PROXY since so did
  //  origin server DNS lookup right after state Start
  //
 @@ -2654,6 +2649,11 @@ HttpTransact::HandleCacheOpenReadHit(State* s)
  //  missing ip but we won't take down the system
  //
  if (s-current.request_to == PARENT_PROXY) {
 +  // Set ourselves up to handle pending revalidate issues
 +  //  after the PP DNS lookup
 +  ink_assert(s-pending_work == NULL);
 +  s-pending_work = issue_revalidate;
 +
TRANSACT_RETURN(SM_ACTION_DNS_LOOKUP, PPDNSLookup);
  } else if (s-current.request_to == 

[jira] [Comment Edited] (TS-3455) Marking the cache STALE in lookup-complete causes abort()

2015-03-23 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375998#comment-14375998
 ] 

Luca Bruno edited comment on TS-3455 at 3/23/15 2:56 PM:
-

So I think the relevant line is: 
https://github.com/apache/trafficserver/blob/95cd99da5d161fc2419584a0e40329f48e55e732/proxy/http/HttpTransact.cc#L1786

ATS does a cache lookup - OS dns lookup - cache lookup again (I think it 
would loop if it's not for that assertion). What could be the code path for 
which my setup reaches that point, while you can't reproduce it?


was (Author: lethalman):
So I think the relevant line is: 
https://github.com/apache/trafficserver/blob/95cd99da5d161fc2419584a0e40329f48e55e732/proxy/http/HttpTransact.cc#L1786

ATS does a cache lookup - OS dns lookup - cache lookup again. What could be 
the code path for which my setup reaches that point, while you can't reproduce 
it?

 Marking the cache STALE in lookup-complete causes abort()
 -

 Key: TS-3455
 URL: https://issues.apache.org/jira/browse/TS-3455
 Project: Traffic Server
  Issue Type: Bug
Reporter: Luca Bruno
 Fix For: 6.0.0


 I've written a simple test case plugin for demonstrating this problem, not 
 sure if it's a problem on my side, but that would also mean that the regex 
 invalidate plugin would also abort().
 What the plugin does: in LOOKUP_COMPLETE, if the cache status is FRESH then 
 set it to STALE.
 To reproduce:
 1) Send a first cacheable request to ATS, which gets cached.
 2) Request again the same url, the plugin triggers and set the cache to 
 STALE. Then ATS does abort().
 Plugin code:
 {noformat}
 #include ts/ts.h
 #include ts/remap.h
 #include ts/experimental.h
 #include stdlib.h
 #include stdio.h
 #include getopt.h
 #include string.h
 #include string
 #include iterator
 #include map
 const char PLUGIN_NAME[] = maybebug;
 static int Handler(TSCont cont, TSEvent event, void *edata);
 struct PluginState {
   PluginState()
   {
 cont = TSContCreate(Handler, NULL);
 TSContDataSet(cont, this);
   }
   ~PluginState()
   {
 TSContDestroy(cont);
   }
   TSCont cont;
 };
 static int Handler(TSCont cont, TSEvent event, void* edata) {
   TSHttpTxn txn = (TSHttpTxn)edata;
   if (event == TS_EVENT_HTTP_CACHE_LOOKUP_COMPLETE) {
 int lookup_status;
 if (TS_SUCCESS == TSHttpTxnCacheLookupStatusGet(txn, lookup_status)) {
   TSDebug(PLUGIN_NAME, lookup complete: %d, lookup_status);
   if (lookup_status == TS_CACHE_LOOKUP_HIT_FRESH) {
 TSDebug(PLUGIN_NAME, set stale);
 TSHttpTxnCacheLookupStatusSet(txn, TS_CACHE_LOOKUP_HIT_STALE);
   }
 }
   }
   TSHttpTxnReenable(txn, TS_EVENT_HTTP_CONTINUE);
   return TS_EVENT_NONE;
 }
 void TSPluginInit (int argc, const char *argv[]) {
   TSPluginRegistrationInfo info;
   info.plugin_name = strdup(cappello);
   info.vendor_name = strdup(foo);
   info.support_email = strdup(f...@bar.com);
   if (TSPluginRegister(TS_SDK_VERSION_3_0 , info) != TS_SUCCESS) {
 TSError(Plugin registration failed);
   }
   PluginState* state = new PluginState();
   TSHttpHookAdd(TS_HTTP_CACHE_LOOKUP_COMPLETE_HOOK, state-cont);
 }
 {noformat}
 Output:
 {noformat}
 [Mar 19 18:40:36.254] Server {0x7f6df0b4f740} DIAG: (maybebug) lookup 
 complete: 0
 [Mar 19 18:40:40.854] Server {0x7f6decfee700} DIAG: (maybebug) lookup 
 complete: 2
 [Mar 19 18:40:40.854] Server {0x7f6decfee700} DIAG: (maybebug) set stale
 FATAL: HttpTransact.cc:433: failed assert `s-pending_work == NULL`
 traffic_server - STACK TRACE: 
 /usr/local/lib/libtsutil.so.5(ink_fatal+0xa3)[0x7f6df072186d]
 /usr/local/lib/libtsutil.so.5(_Z12ink_get_randv+0x0)[0x7f6df071f3a0]
 traffic_server[0x60d0aa]
 traffic_server(_ZN12HttpTransact22HandleCacheOpenReadHitEPNS_5StateE+0xf82)[0x619206]
 ...
 {noformat}
 What happens in gdb is that HandleCacheOpenReadHit is called twice in the 
 same request. The first time s-pending_work is NULL, the second time it's 
 not NULL.
 The patch below fixes the problem:
 {noformat}
 diff --git a/proxy/http/HttpTransact.cc b/proxy/http/HttpTransact.cc
 index 0078ef1..852f285 100644
 --- a/proxy/http/HttpTransact.cc
 +++ b/proxy/http/HttpTransact.cc
 @@ -2641,11 +2641,6 @@ HttpTransact::HandleCacheOpenReadHit(State* s)
  //ink_release_assert(s-current.request_to == PARENT_PROXY ||
  //s-http_config_param-no_dns_forward_to_parent != 0);
  
 -// Set ourselves up to handle pending revalidate issues
 -//  after the PP DNS lookup
 -ink_assert(s-pending_work == NULL);
 -s-pending_work = issue_revalidate;
 -
  // We must be going a PARENT PROXY since so did
  //  origin server DNS lookup right after state Start
  //
 @@ -2654,6 +2649,11 @@ HttpTransact::HandleCacheOpenReadHit(State* s)
  

[jira] [Commented] (TS-3455) Marking the cache STALE in lookup-complete causes abort()

2015-03-23 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376266#comment-14376266
 ] 

Luca Bruno commented on TS-3455:


Yes confirming that with --disable-debug I don't see the abort.

 Marking the cache STALE in lookup-complete causes abort()
 -

 Key: TS-3455
 URL: https://issues.apache.org/jira/browse/TS-3455
 Project: Traffic Server
  Issue Type: Bug
Reporter: Luca Bruno
 Fix For: 6.0.0


 I've written a simple test case plugin for demonstrating this problem, not 
 sure if it's a problem on my side, but that would also mean that the regex 
 invalidate plugin would also abort().
 What the plugin does: in LOOKUP_COMPLETE, if the cache status is FRESH then 
 set it to STALE.
 To reproduce:
 1) Send a first cacheable request to ATS, which gets cached.
 2) Request again the same url, the plugin triggers and set the cache to 
 STALE. Then ATS does abort().
 Plugin code:
 {noformat}
 #include ts/ts.h
 #include ts/remap.h
 #include ts/experimental.h
 #include stdlib.h
 #include stdio.h
 #include getopt.h
 #include string.h
 #include string
 #include iterator
 #include map
 const char PLUGIN_NAME[] = maybebug;
 static int Handler(TSCont cont, TSEvent event, void *edata);
 struct PluginState {
   PluginState()
   {
 cont = TSContCreate(Handler, NULL);
 TSContDataSet(cont, this);
   }
   ~PluginState()
   {
 TSContDestroy(cont);
   }
   TSCont cont;
 };
 static int Handler(TSCont cont, TSEvent event, void* edata) {
   TSHttpTxn txn = (TSHttpTxn)edata;
   if (event == TS_EVENT_HTTP_CACHE_LOOKUP_COMPLETE) {
 int lookup_status;
 if (TS_SUCCESS == TSHttpTxnCacheLookupStatusGet(txn, lookup_status)) {
   TSDebug(PLUGIN_NAME, lookup complete: %d, lookup_status);
   if (lookup_status == TS_CACHE_LOOKUP_HIT_FRESH) {
 TSDebug(PLUGIN_NAME, set stale);
 TSHttpTxnCacheLookupStatusSet(txn, TS_CACHE_LOOKUP_HIT_STALE);
   }
 }
   }
   TSHttpTxnReenable(txn, TS_EVENT_HTTP_CONTINUE);
   return TS_EVENT_NONE;
 }
 void TSPluginInit (int argc, const char *argv[]) {
   TSPluginRegistrationInfo info;
   info.plugin_name = strdup(cappello);
   info.vendor_name = strdup(foo);
   info.support_email = strdup(f...@bar.com);
   if (TSPluginRegister(TS_SDK_VERSION_3_0 , info) != TS_SUCCESS) {
 TSError(Plugin registration failed);
   }
   PluginState* state = new PluginState();
   TSHttpHookAdd(TS_HTTP_CACHE_LOOKUP_COMPLETE_HOOK, state-cont);
 }
 {noformat}
 Output:
 {noformat}
 [Mar 19 18:40:36.254] Server {0x7f6df0b4f740} DIAG: (maybebug) lookup 
 complete: 0
 [Mar 19 18:40:40.854] Server {0x7f6decfee700} DIAG: (maybebug) lookup 
 complete: 2
 [Mar 19 18:40:40.854] Server {0x7f6decfee700} DIAG: (maybebug) set stale
 FATAL: HttpTransact.cc:433: failed assert `s-pending_work == NULL`
 traffic_server - STACK TRACE: 
 /usr/local/lib/libtsutil.so.5(ink_fatal+0xa3)[0x7f6df072186d]
 /usr/local/lib/libtsutil.so.5(_Z12ink_get_randv+0x0)[0x7f6df071f3a0]
 traffic_server[0x60d0aa]
 traffic_server(_ZN12HttpTransact22HandleCacheOpenReadHitEPNS_5StateE+0xf82)[0x619206]
 ...
 {noformat}
 What happens in gdb is that HandleCacheOpenReadHit is called twice in the 
 same request. The first time s-pending_work is NULL, the second time it's 
 not NULL.
 The patch below fixes the problem:
 {noformat}
 diff --git a/proxy/http/HttpTransact.cc b/proxy/http/HttpTransact.cc
 index 0078ef1..852f285 100644
 --- a/proxy/http/HttpTransact.cc
 +++ b/proxy/http/HttpTransact.cc
 @@ -2641,11 +2641,6 @@ HttpTransact::HandleCacheOpenReadHit(State* s)
  //ink_release_assert(s-current.request_to == PARENT_PROXY ||
  //s-http_config_param-no_dns_forward_to_parent != 0);
  
 -// Set ourselves up to handle pending revalidate issues
 -//  after the PP DNS lookup
 -ink_assert(s-pending_work == NULL);
 -s-pending_work = issue_revalidate;
 -
  // We must be going a PARENT PROXY since so did
  //  origin server DNS lookup right after state Start
  //
 @@ -2654,6 +2649,11 @@ HttpTransact::HandleCacheOpenReadHit(State* s)
  //  missing ip but we won't take down the system
  //
  if (s-current.request_to == PARENT_PROXY) {
 +  // Set ourselves up to handle pending revalidate issues
 +  //  after the PP DNS lookup
 +  ink_assert(s-pending_work == NULL);
 +  s-pending_work = issue_revalidate;
 +
TRANSACT_RETURN(SM_ACTION_DNS_LOOKUP, PPDNSLookup);
  } else if (s-current.request_to == ORIGIN_SERVER) {
TRANSACT_RETURN(SM_ACTION_DNS_LOOKUP, OSDNSLookup);
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-3455) Marking the cache STALE in lookup-complete causes abort()

2015-03-20 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371119#comment-14371119
 ] 

Luca Bruno commented on TS-3455:


So I've tried the regex_revalidate plugin that does something similar, and I 
experience right the same abort() as with my plugin.

 Marking the cache STALE in lookup-complete causes abort()
 -

 Key: TS-3455
 URL: https://issues.apache.org/jira/browse/TS-3455
 Project: Traffic Server
  Issue Type: Bug
Reporter: Luca Bruno
 Fix For: 6.0.0


 I've written a simple test case plugin for demonstrating this problem, not 
 sure if it's a problem on my side, but that would also mean that the regex 
 invalidate plugin would also abort().
 What the plugin does: in LOOKUP_COMPLETE, if the cache status is FRESH then 
 set it to STALE.
 To reproduce:
 1) Send a first cacheable request to ATS, which gets cached.
 2) Request again the same url, the plugin triggers and set the cache to 
 STALE. Then ATS does abort().
 Plugin code:
 {noformat}
 #include ts/ts.h
 #include ts/remap.h
 #include ts/experimental.h
 #include stdlib.h
 #include stdio.h
 #include getopt.h
 #include string.h
 #include string
 #include iterator
 #include map
 const char PLUGIN_NAME[] = maybebug;
 static int Handler(TSCont cont, TSEvent event, void *edata);
 struct PluginState {
   PluginState()
   {
 cont = TSContCreate(Handler, NULL);
 TSContDataSet(cont, this);
   }
   ~PluginState()
   {
 TSContDestroy(cont);
   }
   TSCont cont;
 };
 static int Handler(TSCont cont, TSEvent event, void* edata) {
   TSHttpTxn txn = (TSHttpTxn)edata;
   if (event == TS_EVENT_HTTP_CACHE_LOOKUP_COMPLETE) {
 int lookup_status;
 if (TS_SUCCESS == TSHttpTxnCacheLookupStatusGet(txn, lookup_status)) {
   TSDebug(PLUGIN_NAME, lookup complete: %d, lookup_status);
   if (lookup_status == TS_CACHE_LOOKUP_HIT_FRESH) {
 TSDebug(PLUGIN_NAME, set stale);
 TSHttpTxnCacheLookupStatusSet(txn, TS_CACHE_LOOKUP_HIT_STALE);
   }
 }
   }
   TSHttpTxnReenable(txn, TS_EVENT_HTTP_CONTINUE);
   return TS_EVENT_NONE;
 }
 void TSPluginInit (int argc, const char *argv[]) {
   TSPluginRegistrationInfo info;
   info.plugin_name = strdup(cappello);
   info.vendor_name = strdup(foo);
   info.support_email = strdup(f...@bar.com);
   if (TSPluginRegister(TS_SDK_VERSION_3_0 , info) != TS_SUCCESS) {
 TSError(Plugin registration failed);
   }
   PluginState* state = new PluginState();
   TSHttpHookAdd(TS_HTTP_CACHE_LOOKUP_COMPLETE_HOOK, state-cont);
 }
 {noformat}
 Output:
 {noformat}
 [Mar 19 18:40:36.254] Server {0x7f6df0b4f740} DIAG: (maybebug) lookup 
 complete: 0
 [Mar 19 18:40:40.854] Server {0x7f6decfee700} DIAG: (maybebug) lookup 
 complete: 2
 [Mar 19 18:40:40.854] Server {0x7f6decfee700} DIAG: (maybebug) set stale
 FATAL: HttpTransact.cc:433: failed assert `s-pending_work == NULL`
 traffic_server - STACK TRACE: 
 /usr/local/lib/libtsutil.so.5(ink_fatal+0xa3)[0x7f6df072186d]
 /usr/local/lib/libtsutil.so.5(_Z12ink_get_randv+0x0)[0x7f6df071f3a0]
 traffic_server[0x60d0aa]
 traffic_server(_ZN12HttpTransact22HandleCacheOpenReadHitEPNS_5StateE+0xf82)[0x619206]
 ...
 {noformat}
 What happens in gdb is that HandleCacheOpenReadHit is called twice in the 
 same request. The first time s-pending_work is NULL, the second time it's 
 not NULL.
 The patch below fixes the problem:
 {noformat}
 diff --git a/proxy/http/HttpTransact.cc b/proxy/http/HttpTransact.cc
 index 0078ef1..852f285 100644
 --- a/proxy/http/HttpTransact.cc
 +++ b/proxy/http/HttpTransact.cc
 @@ -2641,11 +2641,6 @@ HttpTransact::HandleCacheOpenReadHit(State* s)
  //ink_release_assert(s-current.request_to == PARENT_PROXY ||
  //s-http_config_param-no_dns_forward_to_parent != 0);
  
 -// Set ourselves up to handle pending revalidate issues
 -//  after the PP DNS lookup
 -ink_assert(s-pending_work == NULL);
 -s-pending_work = issue_revalidate;
 -
  // We must be going a PARENT PROXY since so did
  //  origin server DNS lookup right after state Start
  //
 @@ -2654,6 +2649,11 @@ HttpTransact::HandleCacheOpenReadHit(State* s)
  //  missing ip but we won't take down the system
  //
  if (s-current.request_to == PARENT_PROXY) {
 +  // Set ourselves up to handle pending revalidate issues
 +  //  after the PP DNS lookup
 +  ink_assert(s-pending_work == NULL);
 +  s-pending_work = issue_revalidate;
 +
TRANSACT_RETURN(SM_ACTION_DNS_LOOKUP, PPDNSLookup);
  } else if (s-current.request_to == ORIGIN_SERVER) {
TRANSACT_RETURN(SM_ACTION_DNS_LOOKUP, OSDNSLookup);
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-3455) Marking the cache STALE in lookup-complete causes abort()

2015-03-20 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371223#comment-14371223
 ] 

Luca Bruno commented on TS-3455:


[~zwoop] the question is also why OpenReadHit is called twice in the same 
request. If the content is STALE, why should it open the cache content again?

 Marking the cache STALE in lookup-complete causes abort()
 -

 Key: TS-3455
 URL: https://issues.apache.org/jira/browse/TS-3455
 Project: Traffic Server
  Issue Type: Bug
Reporter: Luca Bruno
 Fix For: 6.0.0


 I've written a simple test case plugin for demonstrating this problem, not 
 sure if it's a problem on my side, but that would also mean that the regex 
 invalidate plugin would also abort().
 What the plugin does: in LOOKUP_COMPLETE, if the cache status is FRESH then 
 set it to STALE.
 To reproduce:
 1) Send a first cacheable request to ATS, which gets cached.
 2) Request again the same url, the plugin triggers and set the cache to 
 STALE. Then ATS does abort().
 Plugin code:
 {noformat}
 #include ts/ts.h
 #include ts/remap.h
 #include ts/experimental.h
 #include stdlib.h
 #include stdio.h
 #include getopt.h
 #include string.h
 #include string
 #include iterator
 #include map
 const char PLUGIN_NAME[] = maybebug;
 static int Handler(TSCont cont, TSEvent event, void *edata);
 struct PluginState {
   PluginState()
   {
 cont = TSContCreate(Handler, NULL);
 TSContDataSet(cont, this);
   }
   ~PluginState()
   {
 TSContDestroy(cont);
   }
   TSCont cont;
 };
 static int Handler(TSCont cont, TSEvent event, void* edata) {
   TSHttpTxn txn = (TSHttpTxn)edata;
   if (event == TS_EVENT_HTTP_CACHE_LOOKUP_COMPLETE) {
 int lookup_status;
 if (TS_SUCCESS == TSHttpTxnCacheLookupStatusGet(txn, lookup_status)) {
   TSDebug(PLUGIN_NAME, lookup complete: %d, lookup_status);
   if (lookup_status == TS_CACHE_LOOKUP_HIT_FRESH) {
 TSDebug(PLUGIN_NAME, set stale);
 TSHttpTxnCacheLookupStatusSet(txn, TS_CACHE_LOOKUP_HIT_STALE);
   }
 }
   }
   TSHttpTxnReenable(txn, TS_EVENT_HTTP_CONTINUE);
   return TS_EVENT_NONE;
 }
 void TSPluginInit (int argc, const char *argv[]) {
   TSPluginRegistrationInfo info;
   info.plugin_name = strdup(cappello);
   info.vendor_name = strdup(foo);
   info.support_email = strdup(f...@bar.com);
   if (TSPluginRegister(TS_SDK_VERSION_3_0 , info) != TS_SUCCESS) {
 TSError(Plugin registration failed);
   }
   PluginState* state = new PluginState();
   TSHttpHookAdd(TS_HTTP_CACHE_LOOKUP_COMPLETE_HOOK, state-cont);
 }
 {noformat}
 Output:
 {noformat}
 [Mar 19 18:40:36.254] Server {0x7f6df0b4f740} DIAG: (maybebug) lookup 
 complete: 0
 [Mar 19 18:40:40.854] Server {0x7f6decfee700} DIAG: (maybebug) lookup 
 complete: 2
 [Mar 19 18:40:40.854] Server {0x7f6decfee700} DIAG: (maybebug) set stale
 FATAL: HttpTransact.cc:433: failed assert `s-pending_work == NULL`
 traffic_server - STACK TRACE: 
 /usr/local/lib/libtsutil.so.5(ink_fatal+0xa3)[0x7f6df072186d]
 /usr/local/lib/libtsutil.so.5(_Z12ink_get_randv+0x0)[0x7f6df071f3a0]
 traffic_server[0x60d0aa]
 traffic_server(_ZN12HttpTransact22HandleCacheOpenReadHitEPNS_5StateE+0xf82)[0x619206]
 ...
 {noformat}
 What happens in gdb is that HandleCacheOpenReadHit is called twice in the 
 same request. The first time s-pending_work is NULL, the second time it's 
 not NULL.
 The patch below fixes the problem:
 {noformat}
 diff --git a/proxy/http/HttpTransact.cc b/proxy/http/HttpTransact.cc
 index 0078ef1..852f285 100644
 --- a/proxy/http/HttpTransact.cc
 +++ b/proxy/http/HttpTransact.cc
 @@ -2641,11 +2641,6 @@ HttpTransact::HandleCacheOpenReadHit(State* s)
  //ink_release_assert(s-current.request_to == PARENT_PROXY ||
  //s-http_config_param-no_dns_forward_to_parent != 0);
  
 -// Set ourselves up to handle pending revalidate issues
 -//  after the PP DNS lookup
 -ink_assert(s-pending_work == NULL);
 -s-pending_work = issue_revalidate;
 -
  // We must be going a PARENT PROXY since so did
  //  origin server DNS lookup right after state Start
  //
 @@ -2654,6 +2649,11 @@ HttpTransact::HandleCacheOpenReadHit(State* s)
  //  missing ip but we won't take down the system
  //
  if (s-current.request_to == PARENT_PROXY) {
 +  // Set ourselves up to handle pending revalidate issues
 +  //  after the PP DNS lookup
 +  ink_assert(s-pending_work == NULL);
 +  s-pending_work = issue_revalidate;
 +
TRANSACT_RETURN(SM_ACTION_DNS_LOOKUP, PPDNSLookup);
  } else if (s-current.request_to == ORIGIN_SERVER) {
TRANSACT_RETURN(SM_ACTION_DNS_LOOKUP, OSDNSLookup);
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-3455) Marking the cache STALE in lookup-complete causes abort()

2015-03-20 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371439#comment-14371439
 ] 

Luca Bruno commented on TS-3455:


ATS 4.2.3 also aborts in the same way with regex_revalidate. I have the plain 
default configuration of ATS, only changed the plugin.config and added simple 
remap rules.

 Marking the cache STALE in lookup-complete causes abort()
 -

 Key: TS-3455
 URL: https://issues.apache.org/jira/browse/TS-3455
 Project: Traffic Server
  Issue Type: Bug
Reporter: Luca Bruno
 Fix For: 6.0.0


 I've written a simple test case plugin for demonstrating this problem, not 
 sure if it's a problem on my side, but that would also mean that the regex 
 invalidate plugin would also abort().
 What the plugin does: in LOOKUP_COMPLETE, if the cache status is FRESH then 
 set it to STALE.
 To reproduce:
 1) Send a first cacheable request to ATS, which gets cached.
 2) Request again the same url, the plugin triggers and set the cache to 
 STALE. Then ATS does abort().
 Plugin code:
 {noformat}
 #include ts/ts.h
 #include ts/remap.h
 #include ts/experimental.h
 #include stdlib.h
 #include stdio.h
 #include getopt.h
 #include string.h
 #include string
 #include iterator
 #include map
 const char PLUGIN_NAME[] = maybebug;
 static int Handler(TSCont cont, TSEvent event, void *edata);
 struct PluginState {
   PluginState()
   {
 cont = TSContCreate(Handler, NULL);
 TSContDataSet(cont, this);
   }
   ~PluginState()
   {
 TSContDestroy(cont);
   }
   TSCont cont;
 };
 static int Handler(TSCont cont, TSEvent event, void* edata) {
   TSHttpTxn txn = (TSHttpTxn)edata;
   if (event == TS_EVENT_HTTP_CACHE_LOOKUP_COMPLETE) {
 int lookup_status;
 if (TS_SUCCESS == TSHttpTxnCacheLookupStatusGet(txn, lookup_status)) {
   TSDebug(PLUGIN_NAME, lookup complete: %d, lookup_status);
   if (lookup_status == TS_CACHE_LOOKUP_HIT_FRESH) {
 TSDebug(PLUGIN_NAME, set stale);
 TSHttpTxnCacheLookupStatusSet(txn, TS_CACHE_LOOKUP_HIT_STALE);
   }
 }
   }
   TSHttpTxnReenable(txn, TS_EVENT_HTTP_CONTINUE);
   return TS_EVENT_NONE;
 }
 void TSPluginInit (int argc, const char *argv[]) {
   TSPluginRegistrationInfo info;
   info.plugin_name = strdup(cappello);
   info.vendor_name = strdup(foo);
   info.support_email = strdup(f...@bar.com);
   if (TSPluginRegister(TS_SDK_VERSION_3_0 , info) != TS_SUCCESS) {
 TSError(Plugin registration failed);
   }
   PluginState* state = new PluginState();
   TSHttpHookAdd(TS_HTTP_CACHE_LOOKUP_COMPLETE_HOOK, state-cont);
 }
 {noformat}
 Output:
 {noformat}
 [Mar 19 18:40:36.254] Server {0x7f6df0b4f740} DIAG: (maybebug) lookup 
 complete: 0
 [Mar 19 18:40:40.854] Server {0x7f6decfee700} DIAG: (maybebug) lookup 
 complete: 2
 [Mar 19 18:40:40.854] Server {0x7f6decfee700} DIAG: (maybebug) set stale
 FATAL: HttpTransact.cc:433: failed assert `s-pending_work == NULL`
 traffic_server - STACK TRACE: 
 /usr/local/lib/libtsutil.so.5(ink_fatal+0xa3)[0x7f6df072186d]
 /usr/local/lib/libtsutil.so.5(_Z12ink_get_randv+0x0)[0x7f6df071f3a0]
 traffic_server[0x60d0aa]
 traffic_server(_ZN12HttpTransact22HandleCacheOpenReadHitEPNS_5StateE+0xf82)[0x619206]
 ...
 {noformat}
 What happens in gdb is that HandleCacheOpenReadHit is called twice in the 
 same request. The first time s-pending_work is NULL, the second time it's 
 not NULL.
 The patch below fixes the problem:
 {noformat}
 diff --git a/proxy/http/HttpTransact.cc b/proxy/http/HttpTransact.cc
 index 0078ef1..852f285 100644
 --- a/proxy/http/HttpTransact.cc
 +++ b/proxy/http/HttpTransact.cc
 @@ -2641,11 +2641,6 @@ HttpTransact::HandleCacheOpenReadHit(State* s)
  //ink_release_assert(s-current.request_to == PARENT_PROXY ||
  //s-http_config_param-no_dns_forward_to_parent != 0);
  
 -// Set ourselves up to handle pending revalidate issues
 -//  after the PP DNS lookup
 -ink_assert(s-pending_work == NULL);
 -s-pending_work = issue_revalidate;
 -
  // We must be going a PARENT PROXY since so did
  //  origin server DNS lookup right after state Start
  //
 @@ -2654,6 +2649,11 @@ HttpTransact::HandleCacheOpenReadHit(State* s)
  //  missing ip but we won't take down the system
  //
  if (s-current.request_to == PARENT_PROXY) {
 +  // Set ourselves up to handle pending revalidate issues
 +  //  after the PP DNS lookup
 +  ink_assert(s-pending_work == NULL);
 +  s-pending_work = issue_revalidate;
 +
TRANSACT_RETURN(SM_ACTION_DNS_LOOKUP, PPDNSLookup);
  } else if (s-current.request_to == ORIGIN_SERVER) {
TRANSACT_RETURN(SM_ACTION_DNS_LOOKUP, OSDNSLookup);
 {noformat}



--
This message was sent by Atlassian JIRA

[jira] [Commented] (TS-3455) Marking the cache STALE in lookup-complete causes abort()

2015-03-20 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371620#comment-14371620
 ] 

Luca Bruno commented on TS-3455:


Relevant debug output: hit the cache, call the plugin which sets STALE (see the 
message set stale), then OS dns lookup with next action 
HandleCacheOpenReadHit:

{noformat}
...
[Mar 20 17:23:45.420] Server {0x7fe58c8cf740} DEBUG: (http_match) 
[..._document_freshness] document is fresh; returning FRESHNESS_FRESH
[Mar 20 17:23:45.420] Server {0x7fe58c8cf740} DEBUG: (http_seq) 
[HttpTransact::HandleCacheOpenReadHitFreshness] Fresh copy
[Mar 20 17:23:45.420] Server {0x7fe58c8cf740} DEBUG: (http_seq) 
[HttpTransact::HandleCacheOpenReadHit] Authentication not needed
[Mar 20 17:23:45.421] Server {0x7fe58c8cf740} DEBUG: (http_trans) Next action 
SM_ACTION_API_CACHE_LOOKUP_COMPLETE; HttpTransact::HandleCacheOpenReadHit
[Mar 20 17:23:45.421] Server {0x7fe58c8cf740} DEBUG: (http) [1] State 
Transition: SM_ACTION_API_READ_CACHE_HDR - SM_ACTION_API_CACHE_LOOKUP_COMPLETE
[Mar 20 17:23:45.421] Server {0x7fe58c8cf740} DEBUG: (http) [1] calling plugin 
on hook TS_HTTP_CACHE_LOOKUP_COMPLETE_HOOK at hook 0x7fe580011240
[Mar 20 17:23:45.421] Server {0x7fe58c8cf740} DIAG: (maybebug) lookup complete: 
2
[Mar 20 17:23:45.421] Server {0x7fe58c8cf740} DIAG: (maybebug) set stale
[Mar 20 17:23:45.421] Server {0x7fe58c8cf740} DEBUG: (http) [1] 
[HttpSM::state_api_callback, HTTP_API_CONTINUE]
[Mar 20 17:23:45.421] Server {0x7fe58c8cf740} DEBUG: (http) [1] 
[HttpSM::state_api_callout, HTTP_API_CONTINUE]
[Mar 20 17:23:45.421] Server {0x7fe58c8cf740} DEBUG: (http_seq) 
[HttpTransact::HandleCacheOpenReadHit] Authentication not needed
[Mar 20 17:23:45.421] Server {0x7fe58c8cf740} DEBUG: (http_trans) CacheOpenRead 
--- needs_auth  = 0
[Mar 20 17:23:45.421] Server {0x7fe58c8cf740} DEBUG: (http_trans) CacheOpenRead 
--- needs_revalidate= 1
[Mar 20 17:23:45.421] Server {0x7fe58c8cf740} DEBUG: (http_trans) CacheOpenRead 
--- response_returnable = 1
[Mar 20 17:23:45.421] Server {0x7fe58c8cf740} DEBUG: (http_trans) CacheOpenRead 
--- needs_cache_auth= 0
[Mar 20 17:23:45.421] Server {0x7fe58c8cf740} DEBUG: (http_trans) CacheOpenRead 
--- send_revalidate= 1
[Mar 20 17:23:45.421] Server {0x7fe58c8cf740} DEBUG: (http_trans) CacheOpenRead 
--- HIT-STALE
[Mar 20 17:23:45.422] Server {0x7fe58c8cf740} DEBUG: (http_seq) 
[HttpTransact::HandleCacheOpenReadHit] Revalidate document with server
[Mar 20 17:23:45.422] Server {0x7fe58c8cf740} DEBUG: (http_trans) Next action 
SM_ACTION_DNS_LOOKUP; OSDNSLookup
[Mar 20 17:23:45.422] Server {0x7fe58c8cf740} DEBUG: (http) [1] State 
Transition: SM_ACTION_API_CACHE_LOOKUP_COMPLETE - SM_ACTION_DNS_LOOKUP
[Mar 20 17:23:45.422] Server {0x7fe58c8cf740} DEBUG: (http_seq) 
[HttpSM::do_hostdb_lookup] Doing DNS Lookup
[Mar 20 17:23:45.422] Server {0x7fe58c8cf740} DEBUG: (http_trans) 
[ink_cluster_time] local: 1426868625, highest_delta: 0, cluster: 1426868625
[Mar 20 17:23:45.422] Server {0x7fe58c8cf740} DEBUG: (http_trans) 
[HttpTransact::OSDNSLookup] This was attempt 1
[Mar 20 17:23:45.422] Server {0x7fe58c8cf740} DEBUG: (http_seq) 
[HttpTransact::OSDNSLookup] DNS Lookup successful
[Mar 20 17:23:45.422] Server {0x7fe58c8cf740} DEBUG: (http_trans) [OSDNSLookup] 
DNS lookup for O.S. successful IP: xx.xx.xx.xx
[Mar 20 17:23:45.422] Server {0x7fe58c8cf740} DEBUG: (http_trans) Next action 
SM_ACTION_API_OS_DNS; HandleCacheOpenReadHit
[Mar 20 17:23:45.422] Server {0x7fe58c8cf740} DEBUG: (http) [1] State 
Transition: SM_ACTION_DNS_LOOKUP - SM_ACTION_API_OS_DNS
[Mar 20 17:23:45.422] Server {0x7fe58c8cf740} DEBUG: (http_seq) 
[HttpTransact::HandleCacheOpenReadHit] Authentication not needed
[Mar 20 17:23:45.422] Server {0x7fe58c8cf740} DEBUG: (http_trans) CacheOpenRead 
--- needs_auth  = 0
[Mar 20 17:23:45.422] Server {0x7fe58c8cf740} DEBUG: (http_trans) CacheOpenRead 
--- needs_revalidate= 1
[Mar 20 17:23:45.422] Server {0x7fe58c8cf740} DEBUG: (http_trans) CacheOpenRead 
--- response_returnable = 1
[Mar 20 17:23:45.422] Server {0x7fe58c8cf740} DEBUG: (http_trans) CacheOpenRead 
--- needs_cache_auth= 0
[Mar 20 17:23:45.422] Server {0x7fe58c8cf740} DEBUG: (http_trans) CacheOpenRead 
--- send_revalidate= 1
[Mar 20 17:23:45.422] Server {0x7fe58c8cf740} DEBUG: (http_trans) CacheOpenRead 
--- HIT-STALE
[Mar 20 17:23:45.423] Server {0x7fe58c8cf740} DEBUG: (http_seq) 
[HttpTransact::HandleCacheOpenReadHit] Revalidate document with server
...
{noformat}

 Marking the cache STALE in lookup-complete causes abort()
 -

 Key: TS-3455
 URL: https://issues.apache.org/jira/browse/TS-3455
 Project: Traffic Server
  Issue Type: Bug
Reporter: Luca Bruno
 Fix For: 6.0.0


 I've written a simple test case plugin for demonstrating this problem, not 
 

[jira] [Created] (TS-3455) Renaming Cache-Control in read-response and marking the cache STALE in lookup-complete causes abort()

2015-03-19 Thread Luca Bruno (JIRA)
Luca Bruno created TS-3455:
--

 Summary: Renaming Cache-Control in read-response and marking the 
cache STALE in lookup-complete causes abort()
 Key: TS-3455
 URL: https://issues.apache.org/jira/browse/TS-3455
 Project: Traffic Server
  Issue Type: Bug
Reporter: Luca Bruno


I've written a simple test case plugin for demonstrating this problem, not sure 
if it's a problem on my side.

What the plugin does:

1) In READ_RESPONSE_HDR rename Cache-Control to X-Plugin-Cache-Control
2) In LOOKUP_COMPLETE, if it's a FRESH hit set the status to be STALE instead

So after the cache has been enabled, do a first request: this will be cached 
because there's no Cache-Control header.
Do a second request, now ATS will abort().

The issue is related to both the fact that Cache-Control is renamed and the 
cache set to be stale. Either of the two alone don't trigger the issue.

Plugin code:

{noformat}
#include ts/ts.h
#include ts/remap.h
#include ts/experimental.h
#include stdlib.h
#include stdio.h
#include getopt.h
#include string.h
#include string
#include iterator
#include map


// Constants and some declarations
const char PLUGIN_NAME[] = maybebug;
static int Handler(TSCont cont, TSEvent event, void *edata);

struct PluginState {
  PluginState()
  {
cont = TSContCreate(Handler, NULL);
TSContDataSet(cont, this);
  }

  ~PluginState()
  {
TSContDestroy(cont);
  }

  TSCont cont;
};

static int Handler(TSCont cont, TSEvent event, void* edata) {
  TSHttpTxn txn = (TSHttpTxn)edata;
  TSMBuffer mbuf;
  TSMLoc hdrp;

  if (event == TS_EVENT_HTTP_READ_RESPONSE_HDR) {
TSDebug(PLUGIN_NAME, read response);
if (TS_SUCCESS == TSHttpTxnServerRespGet(txn, mbuf, hdrp)) {
  TSMLoc fieldp = TSMimeHdrFieldFind(mbuf, hdrp, Cache-Control, 
sizeof(Cache-Control)-1);
  if (fieldp != TS_NULL_MLOC) {
TSDebug(PLUGIN_NAME, rename cache-control);
TSMimeHdrFieldNameSet(mbuf, hdrp, fieldp, X-Plugin-Cache-Control, 
sizeof(X-Plugin-Cache-Control)-1);
TSHandleMLocRelease(mbuf, hdrp, fieldp);
  }
  TSHandleMLocRelease(mbuf, TS_NULL_MLOC, hdrp);
}
  } else if (event == TS_EVENT_HTTP_CACHE_LOOKUP_COMPLETE) {
int lookup_status;
if (TS_SUCCESS != TSHttpTxnCacheLookupStatusGet(txn, lookup_status)) {
  goto beach;
}
TSDebug(PLUGIN_NAME, lookup complete: %d, lookup_status);

if (lookup_status == TS_CACHE_LOOKUP_HIT_FRESH  
TSHttpTxnCachedRespGet(txn, mbuf, hdrp) == TS_SUCCESS) {
  TSMLoc fieldp = TSMimeHdrFieldFind(mbuf, hdrp, X-Plugin-Cache-Control, 
sizeof(X-Plugin-Cache-Control)-1);
  if (fieldp) {
TSDebug(PLUGIN_NAME, set stale);
TSHandleMLocRelease(mbuf, hdrp, fieldp);
TSHttpTxnCacheLookupStatusSet(txn, TS_CACHE_LOOKUP_HIT_STALE);
  }

  TSHandleMLocRelease(mbuf, TS_NULL_MLOC, hdrp);
}
  }

beach:
  TSHttpTxnReenable(txn, TS_EVENT_HTTP_CONTINUE);
  return TS_EVENT_NONE;
}


TSReturnCode TSRemapInit(TSRemapInterface*  api, char* errbuf, int bufsz) {
  return TS_SUCCESS;
}


TSReturnCode TSRemapNewInstance(int argc, char* argv[], void** instance, char* 
errbuf, int errbuf_size) {
  TSDebug(PLUGIN_NAME, new instance);

  PluginState* es = new PluginState();

  *instance = es;

  return TS_SUCCESS;
}


void TSRemapDeleteInstance(void* instance) {
  delete static_castPluginState*(instance);
}


TSRemapStatus TSRemapDoRemap(void* instance, TSHttpTxn txn, TSRemapRequestInfo* 
rri) {
  PluginState* es = static_castPluginState*(instance);

  TSHttpTxnHookAdd(txn, TS_HTTP_READ_RESPONSE_HDR_HOOK, es-cont);
  TSHttpTxnHookAdd(txn, TS_HTTP_CACHE_LOOKUP_COMPLETE_HOOK, es-cont);
  return TSREMAP_NO_REMAP;
}
{noformat}

Stacktrace from gdb:

{noformat}
#1  0x754e53e0 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x77bb96fc in ink_die_die_die (retval=1) at ink_error.cc:43
#3  0x77bb97c2 in ink_fatal_va(int, const char *, typedef __va_list_tag 
__va_list_tag *) (return_code=1, fmt=0x77bca490 %s:%d: failed assert 
`%s`, ap=0x742813d8)
at ink_error.cc:67
#4  0x77bb986d in ink_fatal (return_code=1, 
message_format=0x77bca490 %s:%d: failed assert `%s`) at ink_error.cc:75
#5  0x77bb73a0 in _ink_assert (expression=0x7c6883 s-pending_work == 
NULL, file=0x7c66cc HttpTransact.cc, line=433) at ink_assert.cc:37
#6  0x0060d0aa in how_to_open_connection (s=0x7fffdacac9d0) at 
HttpTransact.cc:433
#7  0x00619206 in HttpTransact::HandleCacheOpenReadHit 
(s=0x7fffdacac9d0) at HttpTransact.cc:2684
#8  0x005ffd21 in HttpSM::call_transact_and_set_next_state 
(this=0x7fffdacac960, f=0) at HttpSM.cc:6954
#9  0x005ebc20 in HttpSM::handle_api_return (this=0x7fffdacac960) at 
HttpSM.cc:1532
#10 0x005eb9d5 in HttpSM::state_api_callout (this=0x7fffdacac960, 
event=0, data=0x0) at HttpSM.cc:1464
#11 0x005f8bcb in HttpSM::do_api_callout_internal 

[jira] [Updated] (TS-3455) Renaming Cache-Control in read-response and marking the cache STALE in lookup-complete causes abort()

2015-03-19 Thread Luca Bruno (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Bruno updated TS-3455:
---
Description: 
I've written a simple test case plugin for demonstrating this problem, not sure 
if it's a problem on my side.

What the plugin does:

1) In READ_RESPONSE_HDR rename Cache-Control to X-Plugin-Cache-Control
2) In LOOKUP_COMPLETE, if it's a FRESH hit set the status to be STALE instead

So after the cache has been enabled, do a first request: this will be cached 
because there's no Cache-Control header.
Do a second request, now ATS will abort().

The issue is related to both the fact that Cache-Control is renamed and the 
cache set to be stale. Either of the two alone don't trigger the issue.

Plugin code:

{noformat}
#include ts/ts.h
#include ts/remap.h
#include ts/experimental.h
#include stdlib.h
#include stdio.h
#include getopt.h
#include string.h
#include string
#include iterator
#include map


// Constants and some declarations
const char PLUGIN_NAME[] = maybebug;
static int Handler(TSCont cont, TSEvent event, void *edata);

struct PluginState {
  PluginState()
  {
cont = TSContCreate(Handler, NULL);
TSContDataSet(cont, this);
  }

  ~PluginState()
  {
TSContDestroy(cont);
  }

  TSCont cont;
};

static int Handler(TSCont cont, TSEvent event, void* edata) {
  TSHttpTxn txn = (TSHttpTxn)edata;
  TSMBuffer mbuf;
  TSMLoc hdrp;

  if (event == TS_EVENT_HTTP_READ_RESPONSE_HDR) {
TSDebug(PLUGIN_NAME, read response);
if (TS_SUCCESS == TSHttpTxnServerRespGet(txn, mbuf, hdrp)) {
  TSMLoc fieldp = TSMimeHdrFieldFind(mbuf, hdrp, Cache-Control, 
sizeof(Cache-Control)-1);
  if (fieldp != TS_NULL_MLOC) {
TSDebug(PLUGIN_NAME, rename cache-control);
TSMimeHdrFieldNameSet(mbuf, hdrp, fieldp, X-Plugin-Cache-Control, 
sizeof(X-Plugin-Cache-Control)-1);
TSHandleMLocRelease(mbuf, hdrp, fieldp);
  }
  TSHandleMLocRelease(mbuf, TS_NULL_MLOC, hdrp);
}
  } else if (event == TS_EVENT_HTTP_CACHE_LOOKUP_COMPLETE) {
int lookup_status;
if (TS_SUCCESS != TSHttpTxnCacheLookupStatusGet(txn, lookup_status)) {
  goto beach;
}
TSDebug(PLUGIN_NAME, lookup complete: %d, lookup_status);

if (lookup_status == TS_CACHE_LOOKUP_HIT_FRESH  
TSHttpTxnCachedRespGet(txn, mbuf, hdrp) == TS_SUCCESS) {
  TSMLoc fieldp = TSMimeHdrFieldFind(mbuf, hdrp, X-Plugin-Cache-Control, 
sizeof(X-Plugin-Cache-Control)-1);
  if (fieldp) {
TSDebug(PLUGIN_NAME, set stale);
TSHandleMLocRelease(mbuf, hdrp, fieldp);
TSHttpTxnCacheLookupStatusSet(txn, TS_CACHE_LOOKUP_HIT_STALE);
  }

  TSHandleMLocRelease(mbuf, TS_NULL_MLOC, hdrp);
}
  }

beach:
  TSHttpTxnReenable(txn, TS_EVENT_HTTP_CONTINUE);
  return TS_EVENT_NONE;
}


TSReturnCode TSRemapInit(TSRemapInterface*  api, char* errbuf, int bufsz) {
  return TS_SUCCESS;
}


TSReturnCode TSRemapNewInstance(int argc, char* argv[], void** instance, char* 
errbuf, int errbuf_size) {
  TSDebug(PLUGIN_NAME, new instance);

  PluginState* es = new PluginState();

  *instance = es;

  return TS_SUCCESS;
}


void TSRemapDeleteInstance(void* instance) {
  delete static_castPluginState*(instance);
}


TSRemapStatus TSRemapDoRemap(void* instance, TSHttpTxn txn, TSRemapRequestInfo* 
rri) {
  PluginState* es = static_castPluginState*(instance);

  TSHttpTxnHookAdd(txn, TS_HTTP_READ_RESPONSE_HDR_HOOK, es-cont);
  TSHttpTxnHookAdd(txn, TS_HTTP_CACHE_LOOKUP_COMPLETE_HOOK, es-cont);
  return TSREMAP_NO_REMAP;
}
{noformat}

Stacktrace from gdb:

{noformat}
#1  0x754e53e0 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x77bb96fc in ink_die_die_die (retval=1) at ink_error.cc:43
#3  0x77bb97c2 in ink_fatal_va(int, const char *, typedef __va_list_tag 
__va_list_tag *) (return_code=1, fmt=0x77bca490 %s:%d: failed assert 
`%s`, ap=0x742813d8)
at ink_error.cc:67
#4  0x77bb986d in ink_fatal (return_code=1, 
message_format=0x77bca490 %s:%d: failed assert `%s`) at ink_error.cc:75
#5  0x77bb73a0 in _ink_assert (expression=0x7c6883 s-pending_work == 
NULL, file=0x7c66cc HttpTransact.cc, line=433) at ink_assert.cc:37
#6  0x0060d0aa in how_to_open_connection (s=0x7fffdacac9d0) at 
HttpTransact.cc:433
#7  0x00619206 in HttpTransact::HandleCacheOpenReadHit 
(s=0x7fffdacac9d0) at HttpTransact.cc:2684
#8  0x005ffd21 in HttpSM::call_transact_and_set_next_state 
(this=0x7fffdacac960, f=0) at HttpSM.cc:6954
#9  0x005ebc20 in HttpSM::handle_api_return (this=0x7fffdacac960) at 
HttpSM.cc:1532
#10 0x005eb9d5 in HttpSM::state_api_callout (this=0x7fffdacac960, 
event=0, data=0x0) at HttpSM.cc:1464
#11 0x005f8bcb in HttpSM::do_api_callout_internal (this=0x7fffdacac960) 
at HttpSM.cc:4912
#12 0x00606999 in HttpSM::do_api_callout (this=0x7fffdacac960) at 
HttpSM.cc:450
#13 0x005ffed0 in HttpSM::set_next_state 

[jira] [Commented] (TS-3455) Renaming Cache-Control in read-response and marking the cache STALE in lookup-complete causes abort()

2015-03-19 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14369712#comment-14369712
 ] 

Luca Bruno commented on TS-3455:


Ok sorry the whole issue here is misleading, the abort happens as follows:

1) Hook lookup_complete
2) If the cache status is FRESH, set it to STALE

Then HandleCacheOpenReadHit will be called twice, and assert s-pending_work 
not being NULL.

So nothing to do with the header renaming above. That means the regex plugin 
would also abort(). Will try with ATS 5.0.

 Renaming Cache-Control in read-response and marking the cache STALE in 
 lookup-complete causes abort()
 -

 Key: TS-3455
 URL: https://issues.apache.org/jira/browse/TS-3455
 Project: Traffic Server
  Issue Type: Bug
Reporter: Luca Bruno

 I've written a simple test case plugin for demonstrating this problem, not 
 sure if it's a problem on my side.
 What the plugin does:
 1) In READ_RESPONSE_HDR rename Cache-Control to X-Plugin-Cache-Control
 2) In LOOKUP_COMPLETE, if it's a FRESH hit set the status to be STALE instead
 So after the cache has been enabled, do a first request: this will be cached 
 because there's no Cache-Control header.
 Do a second request, now ATS will abort().
 The issue is related to both the fact that Cache-Control is renamed and the 
 cache set to be stale. Either of the two alone don't trigger the issue.
 Plugin code:
 {noformat}
 #include ts/ts.h
 #include ts/remap.h
 #include ts/experimental.h
 #include stdlib.h
 #include stdio.h
 #include getopt.h
 #include string.h
 #include string
 #include iterator
 #include map
 // Constants and some declarations
 const char PLUGIN_NAME[] = maybebug;
 static int Handler(TSCont cont, TSEvent event, void *edata);
 struct PluginState {
   PluginState()
   {
 cont = TSContCreate(Handler, NULL);
 TSContDataSet(cont, this);
   }
   ~PluginState()
   {
 TSContDestroy(cont);
   }
   TSCont cont;
 };
 static int Handler(TSCont cont, TSEvent event, void* edata) {
   TSHttpTxn txn = (TSHttpTxn)edata;
   TSMBuffer mbuf;
   TSMLoc hdrp;
   if (event == TS_EVENT_HTTP_READ_RESPONSE_HDR) {
 TSDebug(PLUGIN_NAME, read response);
 if (TS_SUCCESS == TSHttpTxnServerRespGet(txn, mbuf, hdrp)) {
   TSMLoc fieldp = TSMimeHdrFieldFind(mbuf, hdrp, Cache-Control, 
 sizeof(Cache-Control)-1);
   if (fieldp != TS_NULL_MLOC) {
 TSDebug(PLUGIN_NAME, rename cache-control);
 TSMimeHdrFieldNameSet(mbuf, hdrp, fieldp, X-Plugin-Cache-Control, 
 sizeof(X-Plugin-Cache-Control)-1);
 TSHandleMLocRelease(mbuf, hdrp, fieldp);
   }
   TSHandleMLocRelease(mbuf, TS_NULL_MLOC, hdrp);
 }
   } else if (event == TS_EVENT_HTTP_CACHE_LOOKUP_COMPLETE) {
 int lookup_status;
 if (TS_SUCCESS != TSHttpTxnCacheLookupStatusGet(txn, lookup_status)) {
   goto beach;
 }
 TSDebug(PLUGIN_NAME, lookup complete: %d, lookup_status);
 if (lookup_status == TS_CACHE_LOOKUP_HIT_FRESH  
 TSHttpTxnCachedRespGet(txn, mbuf, hdrp) == TS_SUCCESS) {
   TSMLoc fieldp = TSMimeHdrFieldFind(mbuf, hdrp, 
 X-Plugin-Cache-Control, sizeof(X-Plugin-Cache-Control)-1);
   if (fieldp) {
 TSDebug(PLUGIN_NAME, set stale);
 TSHandleMLocRelease(mbuf, hdrp, fieldp);
 TSHttpTxnCacheLookupStatusSet(txn, TS_CACHE_LOOKUP_HIT_STALE);
   }
   TSHandleMLocRelease(mbuf, TS_NULL_MLOC, hdrp);
 }
   }
 beach:
   TSHttpTxnReenable(txn, TS_EVENT_HTTP_CONTINUE);
   return TS_EVENT_NONE;
 }
 TSReturnCode TSRemapInit(TSRemapInterface*  api, char* errbuf, int bufsz) {
   return TS_SUCCESS;
 }
 TSReturnCode TSRemapNewInstance(int argc, char* argv[], void** instance, 
 char* errbuf, int errbuf_size) {
   TSDebug(PLUGIN_NAME, new instance);
   PluginState* es = new PluginState();
   *instance = es;
   return TS_SUCCESS;
 }
 void TSRemapDeleteInstance(void* instance) {
   delete static_castPluginState*(instance);
 }
 TSRemapStatus TSRemapDoRemap(void* instance, TSHttpTxn txn, 
 TSRemapRequestInfo* rri) {
   PluginState* es = static_castPluginState*(instance);
   TSHttpTxnHookAdd(txn, TS_HTTP_READ_RESPONSE_HDR_HOOK, es-cont);
   TSHttpTxnHookAdd(txn, TS_HTTP_CACHE_LOOKUP_COMPLETE_HOOK, es-cont);
   return TSREMAP_NO_REMAP;
 }
 {noformat}
 Stacktrace from gdb:
 {noformat}
 #1  0x754e53e0 in abort () from /lib/x86_64-linux-gnu/libc.so.6
 #2  0x77bb96fc in ink_die_die_die (retval=1) at ink_error.cc:43
 #3  0x77bb97c2 in ink_fatal_va(int, const char *, typedef 
 __va_list_tag __va_list_tag *) (return_code=1, fmt=0x77bca490 %s:%d: 
 failed assert `%s`, ap=0x742813d8)
 at ink_error.cc:67
 #4  0x77bb986d in ink_fatal (return_code=1, 
 message_format=0x77bca490 %s:%d: failed assert `%s`) at ink_error.cc:75
 #5  0x77bb73a0 in 

[jira] [Issue Comment Deleted] (TS-3455) Marking the cache STALE in lookup-complete causes abort()

2015-03-19 Thread Luca Bruno (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Bruno updated TS-3455:
---
Comment: was deleted

(was: To clarify. I also tried with header_rewrite:
{noformat}
cond %{READ_RESPONSE_HDR_HOOK}
add-header X-Plugin-Cache-Control %{HEADER:Cache-Control}
rm-header Cache-Control
{noformat}

Then in my plugin I only hooked up the lookup-complete to set the cache to be 
stale. Got the same abort().

Also please note that the tests I do always start with a clean cache.)

 Marking the cache STALE in lookup-complete causes abort()
 -

 Key: TS-3455
 URL: https://issues.apache.org/jira/browse/TS-3455
 Project: Traffic Server
  Issue Type: Bug
Reporter: Luca Bruno

 I've written a simple test case plugin for demonstrating this problem, not 
 sure if it's a problem on my side, but that would also mean that the regex 
 invalidate plugin would also abort().
 What the plugin does: in LOOKUP_COMPLETE, if the cache status is FRESH then 
 set it to STALE.
 To reproduce:
 1) First cacheable request to ATS gets cached
 2) Request again the same url, the plugin triggers and set the cache to 
 STALE. Then ATS does abort().
 Plugin code:
 {noformat}
 #include ts/ts.h
 #include ts/remap.h
 #include ts/experimental.h
 #include stdlib.h
 #include stdio.h
 #include getopt.h
 #include string.h
 #include string
 #include iterator
 #include map
 const char PLUGIN_NAME[] = maybebug;
 static int Handler(TSCont cont, TSEvent event, void *edata);
 struct PluginState {
   PluginState()
   {
 cont = TSContCreate(Handler, NULL);
 TSContDataSet(cont, this);
   }
   ~PluginState()
   {
 TSContDestroy(cont);
   }
   TSCont cont;
 };
 static int Handler(TSCont cont, TSEvent event, void* edata) {
   TSHttpTxn txn = (TSHttpTxn)edata;
   if (event == TS_EVENT_HTTP_CACHE_LOOKUP_COMPLETE) {
 int lookup_status;
 if (TS_SUCCESS == TSHttpTxnCacheLookupStatusGet(txn, lookup_status)) {
   TSDebug(PLUGIN_NAME, lookup complete: %d, lookup_status);
   if (lookup_status == TS_CACHE_LOOKUP_HIT_FRESH) {
 TSDebug(PLUGIN_NAME, set stale);
 TSHttpTxnCacheLookupStatusSet(txn, TS_CACHE_LOOKUP_HIT_STALE);
   }
 }
   }
   TSHttpTxnReenable(txn, TS_EVENT_HTTP_CONTINUE);
   return TS_EVENT_NONE;
 }
 void TSPluginInit (int argc, const char *argv[]) {
   TSPluginRegistrationInfo info;
   info.plugin_name = strdup(cappello);
   info.vendor_name = strdup(foo);
   info.support_email = strdup(f...@bar.com);
   if (TSPluginRegister(TS_SDK_VERSION_3_0 , info) != TS_SUCCESS) {
 TSError(Plugin registration failed);
   }
   PluginState* state = new PluginState();
   TSHttpHookAdd(TS_HTTP_CACHE_LOOKUP_COMPLETE_HOOK, state-cont);
 }
 {noformat}
 Output:
 {noformat}
 [Mar 19 18:40:36.254] Server {0x7f6df0b4f740} DIAG: (maybebug) lookup 
 complete: 0
 [Mar 19 18:40:40.854] Server {0x7f6decfee700} DIAG: (maybebug) lookup 
 complete: 2
 [Mar 19 18:40:40.854] Server {0x7f6decfee700} DIAG: (maybebug) set stale
 FATAL: HttpTransact.cc:433: failed assert `s-pending_work == NULL`
 traffic_server - STACK TRACE: 
 /usr/local/lib/libtsutil.so.5(ink_fatal+0xa3)[0x7f6df072186d]
 /usr/local/lib/libtsutil.so.5(_Z12ink_get_randv+0x0)[0x7f6df071f3a0]
 traffic_server[0x60d0aa]
 traffic_server(_ZN12HttpTransact22HandleCacheOpenReadHitEPNS_5StateE+0xf82)[0x619206]
 ...
 {noformat}
 What happens in gdb is that HandleCacheOpenReadHit is called twice in the 
 same request. The first time s-pending_work is NULL, the second time it's 
 not NULL.
 The patch below fixes the problem:
 {noformat}
 diff --git a/proxy/http/HttpTransact.cc b/proxy/http/HttpTransact.cc
 index 0078ef1..852f285 100644
 --- a/proxy/http/HttpTransact.cc
 +++ b/proxy/http/HttpTransact.cc
 @@ -2641,11 +2641,6 @@ HttpTransact::HandleCacheOpenReadHit(State* s)
  //ink_release_assert(s-current.request_to == PARENT_PROXY ||
  //s-http_config_param-no_dns_forward_to_parent != 0);
  
 -// Set ourselves up to handle pending revalidate issues
 -//  after the PP DNS lookup
 -ink_assert(s-pending_work == NULL);
 -s-pending_work = issue_revalidate;
 -
  // We must be going a PARENT PROXY since so did
  //  origin server DNS lookup right after state Start
  //
 @@ -2654,6 +2649,11 @@ HttpTransact::HandleCacheOpenReadHit(State* s)
  //  missing ip but we won't take down the system
  //
  if (s-current.request_to == PARENT_PROXY) {
 +  // Set ourselves up to handle pending revalidate issues
 +  //  after the PP DNS lookup
 +  ink_assert(s-pending_work == NULL);
 +  s-pending_work = issue_revalidate;
 +
TRANSACT_RETURN(SM_ACTION_DNS_LOOKUP, PPDNSLookup);
  } else if (s-current.request_to == ORIGIN_SERVER) {

[jira] [Updated] (TS-3455) Marking the cache STALE in lookup-complete causes abort()

2015-03-19 Thread Luca Bruno (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Bruno updated TS-3455:
---
Description: 
I've written a simple test case plugin for demonstrating this problem, not sure 
if it's a problem on my side, but that would also mean that the regex 
invalidate plugin would also abort().

What the plugin does: in LOOKUP_COMPLETE, if the cache status is FRESH then set 
it to STALE.

To reproduce:
1) Send a first cacheable request to ATS, which gets cached.
2) Request again the same url, the plugin triggers and set the cache to STALE. 
Then ATS does abort().

Plugin code:

{noformat}
#include ts/ts.h
#include ts/remap.h
#include ts/experimental.h
#include stdlib.h
#include stdio.h
#include getopt.h
#include string.h
#include string
#include iterator
#include map

const char PLUGIN_NAME[] = maybebug;
static int Handler(TSCont cont, TSEvent event, void *edata);

struct PluginState {
  PluginState()
  {
cont = TSContCreate(Handler, NULL);
TSContDataSet(cont, this);
  }

  ~PluginState()
  {
TSContDestroy(cont);
  }

  TSCont cont;
};

static int Handler(TSCont cont, TSEvent event, void* edata) {
  TSHttpTxn txn = (TSHttpTxn)edata;

  if (event == TS_EVENT_HTTP_CACHE_LOOKUP_COMPLETE) {
int lookup_status;
if (TS_SUCCESS == TSHttpTxnCacheLookupStatusGet(txn, lookup_status)) {
  TSDebug(PLUGIN_NAME, lookup complete: %d, lookup_status);

  if (lookup_status == TS_CACHE_LOOKUP_HIT_FRESH) {
TSDebug(PLUGIN_NAME, set stale);
TSHttpTxnCacheLookupStatusSet(txn, TS_CACHE_LOOKUP_HIT_STALE);
  }
}
  }

  TSHttpTxnReenable(txn, TS_EVENT_HTTP_CONTINUE);
  return TS_EVENT_NONE;
}

void TSPluginInit (int argc, const char *argv[]) {
  TSPluginRegistrationInfo info;
  info.plugin_name = strdup(cappello);
  info.vendor_name = strdup(foo);
  info.support_email = strdup(f...@bar.com);

  if (TSPluginRegister(TS_SDK_VERSION_3_0 , info) != TS_SUCCESS) {
TSError(Plugin registration failed);
  }

  PluginState* state = new PluginState();

  TSHttpHookAdd(TS_HTTP_CACHE_LOOKUP_COMPLETE_HOOK, state-cont);
}
{noformat}

Output:

{noformat}
[Mar 19 18:40:36.254] Server {0x7f6df0b4f740} DIAG: (maybebug) lookup complete: 0
[Mar 19 18:40:40.854] Server {0x7f6decfee700} DIAG: (maybebug) lookup complete: 
2
[Mar 19 18:40:40.854] Server {0x7f6decfee700} DIAG: (maybebug) set stale
FATAL: HttpTransact.cc:433: failed assert `s-pending_work == NULL`
traffic_server - STACK TRACE: 
/usr/local/lib/libtsutil.so.5(ink_fatal+0xa3)[0x7f6df072186d]
/usr/local/lib/libtsutil.so.5(_Z12ink_get_randv+0x0)[0x7f6df071f3a0]
traffic_server[0x60d0aa]
traffic_server(_ZN12HttpTransact22HandleCacheOpenReadHitEPNS_5StateE+0xf82)[0x619206]
...
{noformat}

What happens in gdb is that HandleCacheOpenReadHit is called twice in the same 
request. The first time s-pending_work is NULL, the second time it's not NULL.

The patch below fixes the problem:

{noformat}
diff --git a/proxy/http/HttpTransact.cc b/proxy/http/HttpTransact.cc
index 0078ef1..852f285 100644
--- a/proxy/http/HttpTransact.cc
+++ b/proxy/http/HttpTransact.cc
@@ -2641,11 +2641,6 @@ HttpTransact::HandleCacheOpenReadHit(State* s)
 //ink_release_assert(s-current.request_to == PARENT_PROXY ||
 //s-http_config_param-no_dns_forward_to_parent != 0);
 
-// Set ourselves up to handle pending revalidate issues
-//  after the PP DNS lookup
-ink_assert(s-pending_work == NULL);
-s-pending_work = issue_revalidate;
-
 // We must be going a PARENT PROXY since so did
 //  origin server DNS lookup right after state Start
 //
@@ -2654,6 +2649,11 @@ HttpTransact::HandleCacheOpenReadHit(State* s)
 //  missing ip but we won't take down the system
 //
 if (s-current.request_to == PARENT_PROXY) {
+  // Set ourselves up to handle pending revalidate issues
+  //  after the PP DNS lookup
+  ink_assert(s-pending_work == NULL);
+  s-pending_work = issue_revalidate;
+
   TRANSACT_RETURN(SM_ACTION_DNS_LOOKUP, PPDNSLookup);
 } else if (s-current.request_to == ORIGIN_SERVER) {
   TRANSACT_RETURN(SM_ACTION_DNS_LOOKUP, OSDNSLookup);
{noformat}

  was:
I've written a simple test case plugin for demonstrating this problem, not sure 
if it's a problem on my side, but that would also mean that the regex 
invalidate plugin would also abort().

What the plugin does: in LOOKUP_COMPLETE, if the cache status is FRESH then set 
it to STALE.

To reproduce:
1) First cacheable request to ATS gets cached
2) Request again the same url, the plugin triggers and set the cache to STALE. 
Then ATS does abort().

Plugin code:

{noformat}
#include ts/ts.h
#include ts/remap.h
#include ts/experimental.h
#include stdlib.h
#include stdio.h
#include getopt.h
#include string.h
#include string
#include iterator
#include map

const char PLUGIN_NAME[] = maybebug;

[jira] [Updated] (TS-3455) Marking the cache STALE in lookup-complete causes abort()

2015-03-19 Thread Luca Bruno (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Bruno updated TS-3455:
---
Summary: Marking the cache STALE in lookup-complete causes abort()  (was: 
Renaming Cache-Control in read-response and marking the cache STALE in 
lookup-complete causes abort())

 Marking the cache STALE in lookup-complete causes abort()
 -

 Key: TS-3455
 URL: https://issues.apache.org/jira/browse/TS-3455
 Project: Traffic Server
  Issue Type: Bug
Reporter: Luca Bruno

 I've written a simple test case plugin for demonstrating this problem, not 
 sure if it's a problem on my side, but that would also mean that the regex 
 invalidate plugin would also abort().
 What the plugin does: in LOOKUP_COMPLETE, if the cache status is FRESH then 
 set it to STALE.
 To reproduce:
 1) First cacheable request to ATS gets cached
 2) Request again the same url, the plugin triggers and set the cache to 
 STALE. Then ATS does abort().
 Plugin code:
 {noformat}
 #include ts/ts.h
 #include ts/remap.h
 #include ts/experimental.h
 #include stdlib.h
 #include stdio.h
 #include getopt.h
 #include string.h
 #include string
 #include iterator
 #include map
 const char PLUGIN_NAME[] = maybebug;
 static int Handler(TSCont cont, TSEvent event, void *edata);
 struct PluginState {
   PluginState()
   {
 cont = TSContCreate(Handler, NULL);
 TSContDataSet(cont, this);
   }
   ~PluginState()
   {
 TSContDestroy(cont);
   }
   TSCont cont;
 };
 static int Handler(TSCont cont, TSEvent event, void* edata) {
   TSHttpTxn txn = (TSHttpTxn)edata;
   if (event == TS_EVENT_HTTP_CACHE_LOOKUP_COMPLETE) {
 int lookup_status;
 if (TS_SUCCESS == TSHttpTxnCacheLookupStatusGet(txn, lookup_status)) {
   TSDebug(PLUGIN_NAME, lookup complete: %d, lookup_status);
   if (lookup_status == TS_CACHE_LOOKUP_HIT_FRESH) {
 TSDebug(PLUGIN_NAME, set stale);
 TSHttpTxnCacheLookupStatusSet(txn, TS_CACHE_LOOKUP_HIT_STALE);
   }
 }
   }
   TSHttpTxnReenable(txn, TS_EVENT_HTTP_CONTINUE);
   return TS_EVENT_NONE;
 }
 void TSPluginInit (int argc, const char *argv[]) {
   TSPluginRegistrationInfo info;
   info.plugin_name = strdup(cappello);
   info.vendor_name = strdup(foo);
   info.support_email = strdup(f...@bar.com);
   if (TSPluginRegister(TS_SDK_VERSION_3_0 , info) != TS_SUCCESS) {
 TSError(Plugin registration failed);
   }
   PluginState* state = new PluginState();
   TSHttpHookAdd(TS_HTTP_CACHE_LOOKUP_COMPLETE_HOOK, state-cont);
 }
 {noformat}
 Output:
 {noformat}
 [Mar 19 18:40:36.254] Server {0x7f6df0b4f740} DIAG: (maybebug) lookup 
 complete: 0
 [Mar 19 18:40:40.854] Server {0x7f6decfee700} DIAG: (maybebug) lookup 
 complete: 2
 [Mar 19 18:40:40.854] Server {0x7f6decfee700} DIAG: (maybebug) set stale
 FATAL: HttpTransact.cc:433: failed assert `s-pending_work == NULL`
 traffic_server - STACK TRACE: 
 /usr/local/lib/libtsutil.so.5(ink_fatal+0xa3)[0x7f6df072186d]
 /usr/local/lib/libtsutil.so.5(_Z12ink_get_randv+0x0)[0x7f6df071f3a0]
 traffic_server[0x60d0aa]
 traffic_server(_ZN12HttpTransact22HandleCacheOpenReadHitEPNS_5StateE+0xf82)[0x619206]
 ...
 {noformat}
 What happens in gdb is that HandleCacheOpenReadHit is called twice in the 
 same request. The first time s-pending_work is NULL, the second time it's 
 not NULL.
 The patch below fixes the problem:
 {noformat}
 diff --git a/proxy/http/HttpTransact.cc b/proxy/http/HttpTransact.cc
 index 0078ef1..852f285 100644
 --- a/proxy/http/HttpTransact.cc
 +++ b/proxy/http/HttpTransact.cc
 @@ -2641,11 +2641,6 @@ HttpTransact::HandleCacheOpenReadHit(State* s)
  //ink_release_assert(s-current.request_to == PARENT_PROXY ||
  //s-http_config_param-no_dns_forward_to_parent != 0);
  
 -// Set ourselves up to handle pending revalidate issues
 -//  after the PP DNS lookup
 -ink_assert(s-pending_work == NULL);
 -s-pending_work = issue_revalidate;
 -
  // We must be going a PARENT PROXY since so did
  //  origin server DNS lookup right after state Start
  //
 @@ -2654,6 +2649,11 @@ HttpTransact::HandleCacheOpenReadHit(State* s)
  //  missing ip but we won't take down the system
  //
  if (s-current.request_to == PARENT_PROXY) {
 +  // Set ourselves up to handle pending revalidate issues
 +  //  after the PP DNS lookup
 +  ink_assert(s-pending_work == NULL);
 +  s-pending_work = issue_revalidate;
 +
TRANSACT_RETURN(SM_ACTION_DNS_LOOKUP, PPDNSLookup);
  } else if (s-current.request_to == ORIGIN_SERVER) {
TRANSACT_RETURN(SM_ACTION_DNS_LOOKUP, OSDNSLookup);
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (TS-3455) Marking the cache STALE in lookup-complete causes abort()

2015-03-19 Thread Luca Bruno (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Bruno updated TS-3455:
---
Comment: was deleted

(was: Ok sorry the whole issue here is misleading, the abort happens as follows:

1) Hook lookup_complete
2) If the cache status is FRESH, set it to STALE

Then HandleCacheOpenReadHit will be called twice for the same request, and 
assert s-pending_work not being NULL.

So nothing to do with the header renaming above. That means the regex plugin 
would also abort(). Will try with ATS 5.0.)

 Marking the cache STALE in lookup-complete causes abort()
 -

 Key: TS-3455
 URL: https://issues.apache.org/jira/browse/TS-3455
 Project: Traffic Server
  Issue Type: Bug
Reporter: Luca Bruno

 I've written a simple test case plugin for demonstrating this problem, not 
 sure if it's a problem on my side, but that would also mean that the regex 
 invalidate plugin would also abort().
 What the plugin does: in LOOKUP_COMPLETE, if the cache status is FRESH then 
 set it to STALE.
 To reproduce:
 1) First cacheable request to ATS gets cached
 2) Request again the same url, the plugin triggers and set the cache to 
 STALE. Then ATS does abort().
 Plugin code:
 {noformat}
 #include ts/ts.h
 #include ts/remap.h
 #include ts/experimental.h
 #include stdlib.h
 #include stdio.h
 #include getopt.h
 #include string.h
 #include string
 #include iterator
 #include map
 const char PLUGIN_NAME[] = maybebug;
 static int Handler(TSCont cont, TSEvent event, void *edata);
 struct PluginState {
   PluginState()
   {
 cont = TSContCreate(Handler, NULL);
 TSContDataSet(cont, this);
   }
   ~PluginState()
   {
 TSContDestroy(cont);
   }
   TSCont cont;
 };
 static int Handler(TSCont cont, TSEvent event, void* edata) {
   TSHttpTxn txn = (TSHttpTxn)edata;
   if (event == TS_EVENT_HTTP_CACHE_LOOKUP_COMPLETE) {
 int lookup_status;
 if (TS_SUCCESS == TSHttpTxnCacheLookupStatusGet(txn, lookup_status)) {
   TSDebug(PLUGIN_NAME, lookup complete: %d, lookup_status);
   if (lookup_status == TS_CACHE_LOOKUP_HIT_FRESH) {
 TSDebug(PLUGIN_NAME, set stale);
 TSHttpTxnCacheLookupStatusSet(txn, TS_CACHE_LOOKUP_HIT_STALE);
   }
 }
   }
   TSHttpTxnReenable(txn, TS_EVENT_HTTP_CONTINUE);
   return TS_EVENT_NONE;
 }
 void TSPluginInit (int argc, const char *argv[]) {
   TSPluginRegistrationInfo info;
   info.plugin_name = strdup(cappello);
   info.vendor_name = strdup(foo);
   info.support_email = strdup(f...@bar.com);
   if (TSPluginRegister(TS_SDK_VERSION_3_0 , info) != TS_SUCCESS) {
 TSError(Plugin registration failed);
   }
   PluginState* state = new PluginState();
   TSHttpHookAdd(TS_HTTP_CACHE_LOOKUP_COMPLETE_HOOK, state-cont);
 }
 {noformat}
 Output:
 {noformat}
 [Mar 19 18:40:36.254] Server {0x7f6df0b4f740} DIAG: (maybebug) lookup 
 complete: 0
 [Mar 19 18:40:40.854] Server {0x7f6decfee700} DIAG: (maybebug) lookup 
 complete: 2
 [Mar 19 18:40:40.854] Server {0x7f6decfee700} DIAG: (maybebug) set stale
 FATAL: HttpTransact.cc:433: failed assert `s-pending_work == NULL`
 traffic_server - STACK TRACE: 
 /usr/local/lib/libtsutil.so.5(ink_fatal+0xa3)[0x7f6df072186d]
 /usr/local/lib/libtsutil.so.5(_Z12ink_get_randv+0x0)[0x7f6df071f3a0]
 traffic_server[0x60d0aa]
 traffic_server(_ZN12HttpTransact22HandleCacheOpenReadHitEPNS_5StateE+0xf82)[0x619206]
 ...
 {noformat}
 What happens in gdb is that HandleCacheOpenReadHit is called twice in the 
 same request. The first time s-pending_work is NULL, the second time it's 
 not NULL.
 The patch below fixes the problem:
 {noformat}
 diff --git a/proxy/http/HttpTransact.cc b/proxy/http/HttpTransact.cc
 index 0078ef1..852f285 100644
 --- a/proxy/http/HttpTransact.cc
 +++ b/proxy/http/HttpTransact.cc
 @@ -2641,11 +2641,6 @@ HttpTransact::HandleCacheOpenReadHit(State* s)
  //ink_release_assert(s-current.request_to == PARENT_PROXY ||
  //s-http_config_param-no_dns_forward_to_parent != 0);
  
 -// Set ourselves up to handle pending revalidate issues
 -//  after the PP DNS lookup
 -ink_assert(s-pending_work == NULL);
 -s-pending_work = issue_revalidate;
 -
  // We must be going a PARENT PROXY since so did
  //  origin server DNS lookup right after state Start
  //
 @@ -2654,6 +2649,11 @@ HttpTransact::HandleCacheOpenReadHit(State* s)
  //  missing ip but we won't take down the system
  //
  if (s-current.request_to == PARENT_PROXY) {
 +  // Set ourselves up to handle pending revalidate issues
 +  //  after the PP DNS lookup
 +  ink_assert(s-pending_work == NULL);
 +  s-pending_work = issue_revalidate;
 +
TRANSACT_RETURN(SM_ACTION_DNS_LOOKUP, PPDNSLookup);
  } else if (s-current.request_to == 

[jira] [Updated] (TS-3455) Renaming Cache-Control in read-response and marking the cache STALE in lookup-complete causes abort()

2015-03-19 Thread Luca Bruno (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Bruno updated TS-3455:
---
Description: 
I've written a simple test case plugin for demonstrating this problem, not sure 
if it's a problem on my side, but that would also mean that the regex 
invalidate plugin would also abort().

What the plugin does: in LOOKUP_COMPLETE, if the cache status is FRESH then set 
it to STALE.

To reproduce:
1) First cacheable request to ATS gets cached
2) Request again the same url, the plugin triggers and set the cache to STALE. 
Then ATS does abort().

Plugin code:

{noformat}
#include ts/ts.h
#include ts/remap.h
#include ts/experimental.h
#include stdlib.h
#include stdio.h
#include getopt.h
#include string.h
#include string
#include iterator
#include map

const char PLUGIN_NAME[] = maybebug;
static int Handler(TSCont cont, TSEvent event, void *edata);

struct PluginState {
  PluginState()
  {
cont = TSContCreate(Handler, NULL);
TSContDataSet(cont, this);
  }

  ~PluginState()
  {
TSContDestroy(cont);
  }

  TSCont cont;
};

static int Handler(TSCont cont, TSEvent event, void* edata) {
  TSHttpTxn txn = (TSHttpTxn)edata;

  if (event == TS_EVENT_HTTP_CACHE_LOOKUP_COMPLETE) {
int lookup_status;
if (TS_SUCCESS == TSHttpTxnCacheLookupStatusGet(txn, lookup_status)) {
  TSDebug(PLUGIN_NAME, lookup complete: %d, lookup_status);

  if (lookup_status == TS_CACHE_LOOKUP_HIT_FRESH) {
TSDebug(PLUGIN_NAME, set stale);
TSHttpTxnCacheLookupStatusSet(txn, TS_CACHE_LOOKUP_HIT_STALE);
  }
}
  }

  TSHttpTxnReenable(txn, TS_EVENT_HTTP_CONTINUE);
  return TS_EVENT_NONE;
}

void TSPluginInit (int argc, const char *argv[]) {
  TSPluginRegistrationInfo info;
  info.plugin_name = strdup(cappello);
  info.vendor_name = strdup(foo);
  info.support_email = strdup(f...@bar.com);

  if (TSPluginRegister(TS_SDK_VERSION_3_0 , info) != TS_SUCCESS) {
TSError(Plugin registration failed);
  }

  PluginState* state = new PluginState();

  TSHttpHookAdd(TS_HTTP_CACHE_LOOKUP_COMPLETE_HOOK, state-cont);
}
{noformat}

Output:

{noformat}
[Mar 19 18:40:36.254] Server {0x7f6df0b4f740} DIAG: (maybebug) lookup complete: 0
[Mar 19 18:40:40.854] Server {0x7f6decfee700} DIAG: (maybebug) lookup complete: 
2
[Mar 19 18:40:40.854] Server {0x7f6decfee700} DIAG: (maybebug) set stale
FATAL: HttpTransact.cc:433: failed assert `s-pending_work == NULL`
traffic_server - STACK TRACE: 
/usr/local/lib/libtsutil.so.5(ink_fatal+0xa3)[0x7f6df072186d]
/usr/local/lib/libtsutil.so.5(_Z12ink_get_randv+0x0)[0x7f6df071f3a0]
traffic_server[0x60d0aa]
traffic_server(_ZN12HttpTransact22HandleCacheOpenReadHitEPNS_5StateE+0xf82)[0x619206]
...
{noformat}

What happens in gdb is that HandleCacheOpenReadHit is called twice in the same 
request. The first time s-pending_work is NULL, the second time it's not NULL.

The patch below fixes the problem:

{noformat}
diff --git a/proxy/http/HttpTransact.cc b/proxy/http/HttpTransact.cc
index 0078ef1..852f285 100644
--- a/proxy/http/HttpTransact.cc
+++ b/proxy/http/HttpTransact.cc
@@ -2641,11 +2641,6 @@ HttpTransact::HandleCacheOpenReadHit(State* s)
 //ink_release_assert(s-current.request_to == PARENT_PROXY ||
 //s-http_config_param-no_dns_forward_to_parent != 0);
 
-// Set ourselves up to handle pending revalidate issues
-//  after the PP DNS lookup
-ink_assert(s-pending_work == NULL);
-s-pending_work = issue_revalidate;
-
 // We must be going a PARENT PROXY since so did
 //  origin server DNS lookup right after state Start
 //
@@ -2654,6 +2649,11 @@ HttpTransact::HandleCacheOpenReadHit(State* s)
 //  missing ip but we won't take down the system
 //
 if (s-current.request_to == PARENT_PROXY) {
+  // Set ourselves up to handle pending revalidate issues
+  //  after the PP DNS lookup
+  ink_assert(s-pending_work == NULL);
+  s-pending_work = issue_revalidate;
+
   TRANSACT_RETURN(SM_ACTION_DNS_LOOKUP, PPDNSLookup);
 } else if (s-current.request_to == ORIGIN_SERVER) {
   TRANSACT_RETURN(SM_ACTION_DNS_LOOKUP, OSDNSLookup);
{noformat}

  was:
I've written a simple test case plugin for demonstrating this problem, not sure 
if it's a problem on my side.

What the plugin does:

1) In READ_RESPONSE_HDR rename Cache-Control to X-Plugin-Cache-Control
2) In LOOKUP_COMPLETE, if it's a FRESH hit set the status to be STALE instead

So after the cache has been enabled, do a first request: this will be cached 
because there's no Cache-Control header.
Do a second request, now ATS will abort().

The issue is related to both the fact that Cache-Control is renamed and the 
cache set to be stale. Either of the two alone don't trigger the issue.

Plugin code:

{noformat}
#include ts/ts.h
#include ts/remap.h
#include ts/experimental.h
#include stdlib.h

[jira] [Comment Edited] (TS-3455) Renaming Cache-Control in read-response and marking the cache STALE in lookup-complete causes abort()

2015-03-19 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14369461#comment-14369461
 ] 

Luca Bruno edited comment on TS-3455 at 3/19/15 3:00 PM:
-

To clarify. I also tried with header_rewrite:
{noformat}
cond %{READ_RESPONSE_HDR_HOOK}
add-header X-Plugin-Cache-Control %{HEADER:Cache-Control}
rm-header Cache-Control
{noformat}

Then in my plugin I only hooked up the lookup-complete to set the cache to be 
stale. Got the same abort().

Also please note that the tests I do always start with a clean cache.


was (Author: lethalman):
To clarify. I also tried with header_rewrite:
{noformat}
cond %{READ_RESPONSE_HDR_HOOK}
add-header X-Plugin-Cache-Control %{HEADER:Cache-Control}
rm-header Cache-Control
{noformat}

Then in my plugin I only hooked up the lookup-complete to set the cache to be 
stale. Same abort().

 Renaming Cache-Control in read-response and marking the cache STALE in 
 lookup-complete causes abort()
 -

 Key: TS-3455
 URL: https://issues.apache.org/jira/browse/TS-3455
 Project: Traffic Server
  Issue Type: Bug
Reporter: Luca Bruno

 I've written a simple test case plugin for demonstrating this problem, not 
 sure if it's a problem on my side.
 What the plugin does:
 1) In READ_RESPONSE_HDR rename Cache-Control to X-Plugin-Cache-Control
 2) In LOOKUP_COMPLETE, if it's a FRESH hit set the status to be STALE instead
 So after the cache has been enabled, do a first request: this will be cached 
 because there's no Cache-Control header.
 Do a second request, now ATS will abort().
 The issue is related to both the fact that Cache-Control is renamed and the 
 cache set to be stale. Either of the two alone don't trigger the issue.
 Plugin code:
 {noformat}
 #include ts/ts.h
 #include ts/remap.h
 #include ts/experimental.h
 #include stdlib.h
 #include stdio.h
 #include getopt.h
 #include string.h
 #include string
 #include iterator
 #include map
 // Constants and some declarations
 const char PLUGIN_NAME[] = maybebug;
 static int Handler(TSCont cont, TSEvent event, void *edata);
 struct PluginState {
   PluginState()
   {
 cont = TSContCreate(Handler, NULL);
 TSContDataSet(cont, this);
   }
   ~PluginState()
   {
 TSContDestroy(cont);
   }
   TSCont cont;
 };
 static int Handler(TSCont cont, TSEvent event, void* edata) {
   TSHttpTxn txn = (TSHttpTxn)edata;
   TSMBuffer mbuf;
   TSMLoc hdrp;
   if (event == TS_EVENT_HTTP_READ_RESPONSE_HDR) {
 TSDebug(PLUGIN_NAME, read response);
 if (TS_SUCCESS == TSHttpTxnServerRespGet(txn, mbuf, hdrp)) {
   TSMLoc fieldp = TSMimeHdrFieldFind(mbuf, hdrp, Cache-Control, 
 sizeof(Cache-Control)-1);
   if (fieldp != TS_NULL_MLOC) {
 TSDebug(PLUGIN_NAME, rename cache-control);
 TSMimeHdrFieldNameSet(mbuf, hdrp, fieldp, X-Plugin-Cache-Control, 
 sizeof(X-Plugin-Cache-Control)-1);
 TSHandleMLocRelease(mbuf, hdrp, fieldp);
   }
   TSHandleMLocRelease(mbuf, TS_NULL_MLOC, hdrp);
 }
   } else if (event == TS_EVENT_HTTP_CACHE_LOOKUP_COMPLETE) {
 int lookup_status;
 if (TS_SUCCESS != TSHttpTxnCacheLookupStatusGet(txn, lookup_status)) {
   goto beach;
 }
 TSDebug(PLUGIN_NAME, lookup complete: %d, lookup_status);
 if (lookup_status == TS_CACHE_LOOKUP_HIT_FRESH  
 TSHttpTxnCachedRespGet(txn, mbuf, hdrp) == TS_SUCCESS) {
   TSMLoc fieldp = TSMimeHdrFieldFind(mbuf, hdrp, 
 X-Plugin-Cache-Control, sizeof(X-Plugin-Cache-Control)-1);
   if (fieldp) {
 TSDebug(PLUGIN_NAME, set stale);
 TSHandleMLocRelease(mbuf, hdrp, fieldp);
 TSHttpTxnCacheLookupStatusSet(txn, TS_CACHE_LOOKUP_HIT_STALE);
   }
   TSHandleMLocRelease(mbuf, TS_NULL_MLOC, hdrp);
 }
   }
 beach:
   TSHttpTxnReenable(txn, TS_EVENT_HTTP_CONTINUE);
   return TS_EVENT_NONE;
 }
 TSReturnCode TSRemapInit(TSRemapInterface*  api, char* errbuf, int bufsz) {
   return TS_SUCCESS;
 }
 TSReturnCode TSRemapNewInstance(int argc, char* argv[], void** instance, 
 char* errbuf, int errbuf_size) {
   TSDebug(PLUGIN_NAME, new instance);
   PluginState* es = new PluginState();
   *instance = es;
   return TS_SUCCESS;
 }
 void TSRemapDeleteInstance(void* instance) {
   delete static_castPluginState*(instance);
 }
 TSRemapStatus TSRemapDoRemap(void* instance, TSHttpTxn txn, 
 TSRemapRequestInfo* rri) {
   PluginState* es = static_castPluginState*(instance);
   TSHttpTxnHookAdd(txn, TS_HTTP_READ_RESPONSE_HDR_HOOK, es-cont);
   TSHttpTxnHookAdd(txn, TS_HTTP_CACHE_LOOKUP_COMPLETE_HOOK, es-cont);
   return TSREMAP_NO_REMAP;
 }
 {noformat}
 Stacktrace from gdb:
 {noformat}
 #1  0x754e53e0 in abort () from /lib/x86_64-linux-gnu/libc.so.6
 #2  0x77bb96fc in ink_die_die_die (retval=1) at ink_error.cc:43
 #3  

[jira] [Commented] (TS-3455) Renaming Cache-Control in read-response and marking the cache STALE in lookup-complete causes abort()

2015-03-19 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14369461#comment-14369461
 ] 

Luca Bruno commented on TS-3455:


To clarify. I also tried with header_rewrite:
{noformat}
cond %{READ_RESPONSE_HDR_HOOK}
add-header X-Plugin-Cache-Control %{HEADER:Cache-Control}
rm-header Cache-Control
{noformat}

Then in my plugin I only hooked up the lookup-complete to set the cache to be 
stale. Same abort().

 Renaming Cache-Control in read-response and marking the cache STALE in 
 lookup-complete causes abort()
 -

 Key: TS-3455
 URL: https://issues.apache.org/jira/browse/TS-3455
 Project: Traffic Server
  Issue Type: Bug
Reporter: Luca Bruno

 I've written a simple test case plugin for demonstrating this problem, not 
 sure if it's a problem on my side.
 What the plugin does:
 1) In READ_RESPONSE_HDR rename Cache-Control to X-Plugin-Cache-Control
 2) In LOOKUP_COMPLETE, if it's a FRESH hit set the status to be STALE instead
 So after the cache has been enabled, do a first request: this will be cached 
 because there's no Cache-Control header.
 Do a second request, now ATS will abort().
 The issue is related to both the fact that Cache-Control is renamed and the 
 cache set to be stale. Either of the two alone don't trigger the issue.
 Plugin code:
 {noformat}
 #include ts/ts.h
 #include ts/remap.h
 #include ts/experimental.h
 #include stdlib.h
 #include stdio.h
 #include getopt.h
 #include string.h
 #include string
 #include iterator
 #include map
 // Constants and some declarations
 const char PLUGIN_NAME[] = maybebug;
 static int Handler(TSCont cont, TSEvent event, void *edata);
 struct PluginState {
   PluginState()
   {
 cont = TSContCreate(Handler, NULL);
 TSContDataSet(cont, this);
   }
   ~PluginState()
   {
 TSContDestroy(cont);
   }
   TSCont cont;
 };
 static int Handler(TSCont cont, TSEvent event, void* edata) {
   TSHttpTxn txn = (TSHttpTxn)edata;
   TSMBuffer mbuf;
   TSMLoc hdrp;
   if (event == TS_EVENT_HTTP_READ_RESPONSE_HDR) {
 TSDebug(PLUGIN_NAME, read response);
 if (TS_SUCCESS == TSHttpTxnServerRespGet(txn, mbuf, hdrp)) {
   TSMLoc fieldp = TSMimeHdrFieldFind(mbuf, hdrp, Cache-Control, 
 sizeof(Cache-Control)-1);
   if (fieldp != TS_NULL_MLOC) {
 TSDebug(PLUGIN_NAME, rename cache-control);
 TSMimeHdrFieldNameSet(mbuf, hdrp, fieldp, X-Plugin-Cache-Control, 
 sizeof(X-Plugin-Cache-Control)-1);
 TSHandleMLocRelease(mbuf, hdrp, fieldp);
   }
   TSHandleMLocRelease(mbuf, TS_NULL_MLOC, hdrp);
 }
   } else if (event == TS_EVENT_HTTP_CACHE_LOOKUP_COMPLETE) {
 int lookup_status;
 if (TS_SUCCESS != TSHttpTxnCacheLookupStatusGet(txn, lookup_status)) {
   goto beach;
 }
 TSDebug(PLUGIN_NAME, lookup complete: %d, lookup_status);
 if (lookup_status == TS_CACHE_LOOKUP_HIT_FRESH  
 TSHttpTxnCachedRespGet(txn, mbuf, hdrp) == TS_SUCCESS) {
   TSMLoc fieldp = TSMimeHdrFieldFind(mbuf, hdrp, 
 X-Plugin-Cache-Control, sizeof(X-Plugin-Cache-Control)-1);
   if (fieldp) {
 TSDebug(PLUGIN_NAME, set stale);
 TSHandleMLocRelease(mbuf, hdrp, fieldp);
 TSHttpTxnCacheLookupStatusSet(txn, TS_CACHE_LOOKUP_HIT_STALE);
   }
   TSHandleMLocRelease(mbuf, TS_NULL_MLOC, hdrp);
 }
   }
 beach:
   TSHttpTxnReenable(txn, TS_EVENT_HTTP_CONTINUE);
   return TS_EVENT_NONE;
 }
 TSReturnCode TSRemapInit(TSRemapInterface*  api, char* errbuf, int bufsz) {
   return TS_SUCCESS;
 }
 TSReturnCode TSRemapNewInstance(int argc, char* argv[], void** instance, 
 char* errbuf, int errbuf_size) {
   TSDebug(PLUGIN_NAME, new instance);
   PluginState* es = new PluginState();
   *instance = es;
   return TS_SUCCESS;
 }
 void TSRemapDeleteInstance(void* instance) {
   delete static_castPluginState*(instance);
 }
 TSRemapStatus TSRemapDoRemap(void* instance, TSHttpTxn txn, 
 TSRemapRequestInfo* rri) {
   PluginState* es = static_castPluginState*(instance);
   TSHttpTxnHookAdd(txn, TS_HTTP_READ_RESPONSE_HDR_HOOK, es-cont);
   TSHttpTxnHookAdd(txn, TS_HTTP_CACHE_LOOKUP_COMPLETE_HOOK, es-cont);
   return TSREMAP_NO_REMAP;
 }
 {noformat}
 Stacktrace from gdb:
 {noformat}
 #1  0x754e53e0 in abort () from /lib/x86_64-linux-gnu/libc.so.6
 #2  0x77bb96fc in ink_die_die_die (retval=1) at ink_error.cc:43
 #3  0x77bb97c2 in ink_fatal_va(int, const char *, typedef 
 __va_list_tag __va_list_tag *) (return_code=1, fmt=0x77bca490 %s:%d: 
 failed assert `%s`, ap=0x742813d8)
 at ink_error.cc:67
 #4  0x77bb986d in ink_fatal (return_code=1, 
 message_format=0x77bca490 %s:%d: failed assert `%s`) at ink_error.cc:75
 #5  0x77bb73a0 in _ink_assert (expression=0x7c6883 s-pending_work 
 == NULL, file=0x7c66cc 

[jira] [Commented] (TS-3395) Hit ratio drops with high concurrency

2015-02-23 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14333179#comment-14333179
 ] 

Luca Bruno commented on TS-3395:


This is with 1 vol (same with 5 vols) and 
`proxy.config.http.origin_max_connections 10` (same with 100 connections):

!http://i.imgur.com/vYXDE0J.png?1!

With 10 vols and max 10 origin connections:

!http://i.imgur.com/frRQ5r6.png?1!

!http://i.imgur.com/Xwt09Hj.png?1!

So same result as without setting the max origin connections. At this point the 
only tuning option I see working is the fact of having 10 vols, or who knows 
depending on the number of client connections apparently.


 Hit ratio drops with high concurrency
 -

 Key: TS-3395
 URL: https://issues.apache.org/jira/browse/TS-3395
 Project: Traffic Server
  Issue Type: Bug
  Components: Cache
Reporter: Luca Bruno
 Fix For: 5.3.0


 I'm doing some tests and I've noticed that the hit ratio drops with more than 
 300 simultaneous http connections.
 The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
 ram cache is disabled.
 The test is done with web-polygraph. Content size vary from 5kb to 20kb 
 uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
 after months. There's no Vary.
 !http://i.imgur.com/Zxlhgnf.png!
 Then I thought it could be a problem of polygraph. I wrote my own 
 client/server test code, it works fine also with squid, varnish and nginx. I 
 register a hit if I get either cR or cH in the headers.
 {noformat}
 2015/02/19 12:38:28 Starting 100 requests
 2015/02/19 12:37:58 Elapsed: 3m51.23552164s
 2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
 2015/02/19 12:37:58 Average size: 12.50kb/req
 2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
 2015/02/19 12:37:58 Errors: 0
 2015/02/19 12:37:58 Offered Hit ratio: 59.95%
 2015/02/19 12:37:58 Measured Hit ratio: 37.20%
 2015/02/19 12:37:58 Hit bytes: 4649000609
 2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
 2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
 {noformat}
 So similar results, 37.20% on average. Then I thought that could be a problem 
 of how I'm testing stuff, and tried with nginx cache. It achieves 60% hit 
 ratio, but request rate is very slow compared to ATS for obvious reasons.
 Then I wanted to check if with 200 connections but with longer test time hit 
 ratio also dropped, but no, it's fine:
 !http://i.imgur.com/oMHscuf.png!
 So not a problem of my tests I guess.
 Then I realized by debugging the test server that the same url was asked 
 twice.
 Out of 100 requests, 78600 urls were asked at least twice. An url was 
 even requested 9 times. These same url are not requested close to each other: 
 even more than 30sec can pass from one request to the other for the same url.
 I also tweaked the following parameters:
 {noformat}
 CONFIG proxy.config.http.cache.fuzz.time INT 0
 CONFIG proxy.config.http.cache.fuzz.min_time INT 0
 CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.00
 CONFIG proxy.config.http.cache.max_open_read_retries INT 4
 CONFIG proxy.config.http.cache.open_read_retry_time INT 500
 {noformat}
 And this is the result with polygraph, similar results:
 !http://i.imgur.com/YgOndhY.png!
 Tweaked the read-while-writer option, and yet having similar results.
 Then I've enabled 1GB of ram, it is slightly better at the beginning, but 
 then it drops:
 !http://i.imgur.com/dFTJI16.png!
 traffic_top says 25% ram hit, 37% fresh, 63% cold.
 So given that it doesn't seem to be a concurrency problem when requesting the 
 url to the origin server, could it be a problem of concurrent write access to 
 the cache? So that some pages are not cached at all? The traffoc_top fresh 
 percentage also makes me think it can be a problem in writing the cache.
 Not sure if I explained the problem correctly, ask me further information in 
 case. But in summary: hit ratio drops with a high number of connections, and 
 the problem seems related to pages that are not written to the cache.
 This is some related issue: 
 http://mail-archives.apache.org/mod_mbox/trafficserver-users/201301.mbox/%3ccd28cb1f.1f44a%25peter.wa...@email.disney.com%3E
 Also this: 
 http://apache-traffic-server.24303.n7.nabble.com/why-my-proxy-node-cache-hit-ratio-drops-td928.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TS-3395) Hit ratio drops with high concurrency

2015-02-23 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14333179#comment-14333179
 ] 

Luca Bruno edited comment on TS-3395 at 2/23/15 10:17 AM:
--

This is with 1 vol (same with 5 vols) and 
`proxy.config.http.origin_max_connections 10` (same with 100 connections):

!http://i.imgur.com/vYXDE0J.png?1!

With 10 vols and max 10 origin connections:

!http://i.imgur.com/frRQ5r6.png?1!

!http://i.imgur.com/Xwt09Hj.png?1!

So same result as without setting the max origin connections. At this point the 
only tuning option I see working is the fact of having 10 vols, or who knows 
depending on the number of simultaneous client connections.



was (Author: lethalman):
This is with 1 vol (same with 5 vols) and 
`proxy.config.http.origin_max_connections 10` (same with 100 connections):

!http://i.imgur.com/vYXDE0J.png?1!

With 10 vols and max 10 origin connections:

!http://i.imgur.com/frRQ5r6.png?1!

!http://i.imgur.com/Xwt09Hj.png?1!

So same result as without setting the max origin connections. At this point the 
only tuning option I see working is the fact of having 10 vols, or who knows 
depending on the number of client connections apparently.


 Hit ratio drops with high concurrency
 -

 Key: TS-3395
 URL: https://issues.apache.org/jira/browse/TS-3395
 Project: Traffic Server
  Issue Type: Bug
  Components: Cache
Reporter: Luca Bruno
 Fix For: 5.3.0


 I'm doing some tests and I've noticed that the hit ratio drops with more than 
 300 simultaneous http connections.
 The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
 ram cache is disabled.
 The test is done with web-polygraph. Content size vary from 5kb to 20kb 
 uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
 after months. There's no Vary.
 !http://i.imgur.com/Zxlhgnf.png!
 Then I thought it could be a problem of polygraph. I wrote my own 
 client/server test code, it works fine also with squid, varnish and nginx. I 
 register a hit if I get either cR or cH in the headers.
 {noformat}
 2015/02/19 12:38:28 Starting 100 requests
 2015/02/19 12:37:58 Elapsed: 3m51.23552164s
 2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
 2015/02/19 12:37:58 Average size: 12.50kb/req
 2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
 2015/02/19 12:37:58 Errors: 0
 2015/02/19 12:37:58 Offered Hit ratio: 59.95%
 2015/02/19 12:37:58 Measured Hit ratio: 37.20%
 2015/02/19 12:37:58 Hit bytes: 4649000609
 2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
 2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
 {noformat}
 So similar results, 37.20% on average. Then I thought that could be a problem 
 of how I'm testing stuff, and tried with nginx cache. It achieves 60% hit 
 ratio, but request rate is very slow compared to ATS for obvious reasons.
 Then I wanted to check if with 200 connections but with longer test time hit 
 ratio also dropped, but no, it's fine:
 !http://i.imgur.com/oMHscuf.png!
 So not a problem of my tests I guess.
 Then I realized by debugging the test server that the same url was asked 
 twice.
 Out of 100 requests, 78600 urls were asked at least twice. An url was 
 even requested 9 times. These same url are not requested close to each other: 
 even more than 30sec can pass from one request to the other for the same url.
 I also tweaked the following parameters:
 {noformat}
 CONFIG proxy.config.http.cache.fuzz.time INT 0
 CONFIG proxy.config.http.cache.fuzz.min_time INT 0
 CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.00
 CONFIG proxy.config.http.cache.max_open_read_retries INT 4
 CONFIG proxy.config.http.cache.open_read_retry_time INT 500
 {noformat}
 And this is the result with polygraph, similar results:
 !http://i.imgur.com/YgOndhY.png!
 Tweaked the read-while-writer option, and yet having similar results.
 Then I've enabled 1GB of ram, it is slightly better at the beginning, but 
 then it drops:
 !http://i.imgur.com/dFTJI16.png!
 traffic_top says 25% ram hit, 37% fresh, 63% cold.
 So given that it doesn't seem to be a concurrency problem when requesting the 
 url to the origin server, could it be a problem of concurrent write access to 
 the cache? So that some pages are not cached at all? The traffoc_top fresh 
 percentage also makes me think it can be a problem in writing the cache.
 Not sure if I explained the problem correctly, ask me further information in 
 case. But in summary: hit ratio drops with a high number of connections, and 
 the problem seems related to pages that are not written to the cache.
 This is some related issue: 
 

[jira] [Commented] (TS-3395) Hit ratio drops with high concurrency

2015-02-21 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14330302#comment-14330302
 ] 

Luca Bruno commented on TS-3395:


My complaint is about the hit ratio drop with vol 1, not the I/O bottleneck 
with 10 vols. 

 Hit ratio drops with high concurrency
 -

 Key: TS-3395
 URL: https://issues.apache.org/jira/browse/TS-3395
 Project: Traffic Server
  Issue Type: Bug
  Components: Cache
Reporter: Luca Bruno
 Fix For: 5.3.0


 I'm doing some tests and I've noticed that the hit ratio drops with more than 
 300 simultaneous http connections.
 The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
 ram cache is disabled.
 The test is done with web-polygraph. Content size vary from 5kb to 20kb 
 uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
 after months. There's no Vary.
 !http://i.imgur.com/Zxlhgnf.png!
 Then I thought it could be a problem of polygraph. I wrote my own 
 client/server test code, it works fine also with squid, varnish and nginx. I 
 register a hit if I get either cR or cH in the headers.
 {noformat}
 2015/02/19 12:38:28 Starting 100 requests
 2015/02/19 12:37:58 Elapsed: 3m51.23552164s
 2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
 2015/02/19 12:37:58 Average size: 12.50kb/req
 2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
 2015/02/19 12:37:58 Errors: 0
 2015/02/19 12:37:58 Offered Hit ratio: 59.95%
 2015/02/19 12:37:58 Measured Hit ratio: 37.20%
 2015/02/19 12:37:58 Hit bytes: 4649000609
 2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
 2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
 {noformat}
 So similar results, 37.20% on average. Then I thought that could be a problem 
 of how I'm testing stuff, and tried with nginx cache. It achieves 60% hit 
 ratio, but request rate is very slow compared to ATS for obvious reasons.
 Then I wanted to check if with 200 connections but with longer test time hit 
 ratio also dropped, but no, it's fine:
 !http://i.imgur.com/oMHscuf.png!
 So not a problem of my tests I guess.
 Then I realized by debugging the test server that the same url was asked 
 twice.
 Out of 100 requests, 78600 urls were asked at least twice. An url was 
 even requested 9 times. These same url are not requested close to each other: 
 even more than 30sec can pass from one request to the other for the same url.
 I also tweaked the following parameters:
 {noformat}
 CONFIG proxy.config.http.cache.fuzz.time INT 0
 CONFIG proxy.config.http.cache.fuzz.min_time INT 0
 CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.00
 CONFIG proxy.config.http.cache.max_open_read_retries INT 4
 CONFIG proxy.config.http.cache.open_read_retry_time INT 500
 {noformat}
 And this is the result with polygraph, similar results:
 !http://i.imgur.com/YgOndhY.png!
 Tweaked the read-while-writer option, and yet having similar results.
 Then I've enabled 1GB of ram, it is slightly better at the beginning, but 
 then it drops:
 !http://i.imgur.com/dFTJI16.png!
 traffic_top says 25% ram hit, 37% fresh, 63% cold.
 So given that it doesn't seem to be a concurrency problem when requesting the 
 url to the origin server, could it be a problem of concurrent write access to 
 the cache? So that some pages are not cached at all? The traffoc_top fresh 
 percentage also makes me think it can be a problem in writing the cache.
 Not sure if I explained the problem correctly, ask me further information in 
 case. But in summary: hit ratio drops with a high number of connections, and 
 the problem seems related to pages that are not written to the cache.
 This is some related issue: 
 http://mail-archives.apache.org/mod_mbox/trafficserver-users/201301.mbox/%3ccd28cb1f.1f44a%25peter.wa...@email.disney.com%3E
 Also this: 
 http://apache-traffic-server.24303.n7.nabble.com/why-my-proxy-node-cache-hit-ratio-drops-td928.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-3395) Hit ratio drops with high concurrency

2015-02-21 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14330333#comment-14330333
 ] 

Luca Bruno commented on TS-3395:


Other cache servers hit the I/O bottleneck but at least give expected hit 
ratio. So if this is a feature of ATS, it's not good for our case, where we 
don't want to hit the origin servers.

 Hit ratio drops with high concurrency
 -

 Key: TS-3395
 URL: https://issues.apache.org/jira/browse/TS-3395
 Project: Traffic Server
  Issue Type: Bug
  Components: Cache
Reporter: Luca Bruno
 Fix For: 5.3.0


 I'm doing some tests and I've noticed that the hit ratio drops with more than 
 300 simultaneous http connections.
 The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
 ram cache is disabled.
 The test is done with web-polygraph. Content size vary from 5kb to 20kb 
 uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
 after months. There's no Vary.
 !http://i.imgur.com/Zxlhgnf.png!
 Then I thought it could be a problem of polygraph. I wrote my own 
 client/server test code, it works fine also with squid, varnish and nginx. I 
 register a hit if I get either cR or cH in the headers.
 {noformat}
 2015/02/19 12:38:28 Starting 100 requests
 2015/02/19 12:37:58 Elapsed: 3m51.23552164s
 2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
 2015/02/19 12:37:58 Average size: 12.50kb/req
 2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
 2015/02/19 12:37:58 Errors: 0
 2015/02/19 12:37:58 Offered Hit ratio: 59.95%
 2015/02/19 12:37:58 Measured Hit ratio: 37.20%
 2015/02/19 12:37:58 Hit bytes: 4649000609
 2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
 2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
 {noformat}
 So similar results, 37.20% on average. Then I thought that could be a problem 
 of how I'm testing stuff, and tried with nginx cache. It achieves 60% hit 
 ratio, but request rate is very slow compared to ATS for obvious reasons.
 Then I wanted to check if with 200 connections but with longer test time hit 
 ratio also dropped, but no, it's fine:
 !http://i.imgur.com/oMHscuf.png!
 So not a problem of my tests I guess.
 Then I realized by debugging the test server that the same url was asked 
 twice.
 Out of 100 requests, 78600 urls were asked at least twice. An url was 
 even requested 9 times. These same url are not requested close to each other: 
 even more than 30sec can pass from one request to the other for the same url.
 I also tweaked the following parameters:
 {noformat}
 CONFIG proxy.config.http.cache.fuzz.time INT 0
 CONFIG proxy.config.http.cache.fuzz.min_time INT 0
 CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.00
 CONFIG proxy.config.http.cache.max_open_read_retries INT 4
 CONFIG proxy.config.http.cache.open_read_retry_time INT 500
 {noformat}
 And this is the result with polygraph, similar results:
 !http://i.imgur.com/YgOndhY.png!
 Tweaked the read-while-writer option, and yet having similar results.
 Then I've enabled 1GB of ram, it is slightly better at the beginning, but 
 then it drops:
 !http://i.imgur.com/dFTJI16.png!
 traffic_top says 25% ram hit, 37% fresh, 63% cold.
 So given that it doesn't seem to be a concurrency problem when requesting the 
 url to the origin server, could it be a problem of concurrent write access to 
 the cache? So that some pages are not cached at all? The traffoc_top fresh 
 percentage also makes me think it can be a problem in writing the cache.
 Not sure if I explained the problem correctly, ask me further information in 
 case. But in summary: hit ratio drops with a high number of connections, and 
 the problem seems related to pages that are not written to the cache.
 This is some related issue: 
 http://mail-archives.apache.org/mod_mbox/trafficserver-users/201301.mbox/%3ccd28cb1f.1f44a%25peter.wa...@email.disney.com%3E
 Also this: 
 http://apache-traffic-server.24303.n7.nabble.com/why-my-proxy-node-cache-hit-ratio-drops-td928.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-3395) Hit ratio drops with high concurrency

2015-02-21 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14330341#comment-14330341
 ] 

Luca Bruno commented on TS-3395:


About limiting connections: https://issues.apache.org/jira/browse/TS-3386

And what you say doesn't address the software issue. I know perfectly that I 
can do what you say at hardware level.

 Hit ratio drops with high concurrency
 -

 Key: TS-3395
 URL: https://issues.apache.org/jira/browse/TS-3395
 Project: Traffic Server
  Issue Type: Bug
  Components: Cache
Reporter: Luca Bruno
 Fix For: 5.3.0


 I'm doing some tests and I've noticed that the hit ratio drops with more than 
 300 simultaneous http connections.
 The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
 ram cache is disabled.
 The test is done with web-polygraph. Content size vary from 5kb to 20kb 
 uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
 after months. There's no Vary.
 !http://i.imgur.com/Zxlhgnf.png!
 Then I thought it could be a problem of polygraph. I wrote my own 
 client/server test code, it works fine also with squid, varnish and nginx. I 
 register a hit if I get either cR or cH in the headers.
 {noformat}
 2015/02/19 12:38:28 Starting 100 requests
 2015/02/19 12:37:58 Elapsed: 3m51.23552164s
 2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
 2015/02/19 12:37:58 Average size: 12.50kb/req
 2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
 2015/02/19 12:37:58 Errors: 0
 2015/02/19 12:37:58 Offered Hit ratio: 59.95%
 2015/02/19 12:37:58 Measured Hit ratio: 37.20%
 2015/02/19 12:37:58 Hit bytes: 4649000609
 2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
 2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
 {noformat}
 So similar results, 37.20% on average. Then I thought that could be a problem 
 of how I'm testing stuff, and tried with nginx cache. It achieves 60% hit 
 ratio, but request rate is very slow compared to ATS for obvious reasons.
 Then I wanted to check if with 200 connections but with longer test time hit 
 ratio also dropped, but no, it's fine:
 !http://i.imgur.com/oMHscuf.png!
 So not a problem of my tests I guess.
 Then I realized by debugging the test server that the same url was asked 
 twice.
 Out of 100 requests, 78600 urls were asked at least twice. An url was 
 even requested 9 times. These same url are not requested close to each other: 
 even more than 30sec can pass from one request to the other for the same url.
 I also tweaked the following parameters:
 {noformat}
 CONFIG proxy.config.http.cache.fuzz.time INT 0
 CONFIG proxy.config.http.cache.fuzz.min_time INT 0
 CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.00
 CONFIG proxy.config.http.cache.max_open_read_retries INT 4
 CONFIG proxy.config.http.cache.open_read_retry_time INT 500
 {noformat}
 And this is the result with polygraph, similar results:
 !http://i.imgur.com/YgOndhY.png!
 Tweaked the read-while-writer option, and yet having similar results.
 Then I've enabled 1GB of ram, it is slightly better at the beginning, but 
 then it drops:
 !http://i.imgur.com/dFTJI16.png!
 traffic_top says 25% ram hit, 37% fresh, 63% cold.
 So given that it doesn't seem to be a concurrency problem when requesting the 
 url to the origin server, could it be a problem of concurrent write access to 
 the cache? So that some pages are not cached at all? The traffoc_top fresh 
 percentage also makes me think it can be a problem in writing the cache.
 Not sure if I explained the problem correctly, ask me further information in 
 case. But in summary: hit ratio drops with a high number of connections, and 
 the problem seems related to pages that are not written to the cache.
 This is some related issue: 
 http://mail-archives.apache.org/mod_mbox/trafficserver-users/201301.mbox/%3ccd28cb1f.1f44a%25peter.wa...@email.disney.com%3E
 Also this: 
 http://apache-traffic-server.24303.n7.nabble.com/why-my-proxy-node-cache-hit-ratio-drops-td928.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-3395) Hit ratio drops with high concurrency

2015-02-21 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14330416#comment-14330416
 ] 

Luca Bruno commented on TS-3395:


For you maybe. Not sure how to explain better, we don't want to depend on some 
kind of magic tuning with some magic numbers. We want a software that achieves 
that hit ratio even when stressed, and must not depend on any kind of max 
connections.
You avoid cases, we instead want to be sure certain things never happen. If 
it's not ATS job please let me know so I can look further.

 Hit ratio drops with high concurrency
 -

 Key: TS-3395
 URL: https://issues.apache.org/jira/browse/TS-3395
 Project: Traffic Server
  Issue Type: Bug
  Components: Cache
Reporter: Luca Bruno
 Fix For: 5.3.0


 I'm doing some tests and I've noticed that the hit ratio drops with more than 
 300 simultaneous http connections.
 The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
 ram cache is disabled.
 The test is done with web-polygraph. Content size vary from 5kb to 20kb 
 uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
 after months. There's no Vary.
 !http://i.imgur.com/Zxlhgnf.png!
 Then I thought it could be a problem of polygraph. I wrote my own 
 client/server test code, it works fine also with squid, varnish and nginx. I 
 register a hit if I get either cR or cH in the headers.
 {noformat}
 2015/02/19 12:38:28 Starting 100 requests
 2015/02/19 12:37:58 Elapsed: 3m51.23552164s
 2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
 2015/02/19 12:37:58 Average size: 12.50kb/req
 2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
 2015/02/19 12:37:58 Errors: 0
 2015/02/19 12:37:58 Offered Hit ratio: 59.95%
 2015/02/19 12:37:58 Measured Hit ratio: 37.20%
 2015/02/19 12:37:58 Hit bytes: 4649000609
 2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
 2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
 {noformat}
 So similar results, 37.20% on average. Then I thought that could be a problem 
 of how I'm testing stuff, and tried with nginx cache. It achieves 60% hit 
 ratio, but request rate is very slow compared to ATS for obvious reasons.
 Then I wanted to check if with 200 connections but with longer test time hit 
 ratio also dropped, but no, it's fine:
 !http://i.imgur.com/oMHscuf.png!
 So not a problem of my tests I guess.
 Then I realized by debugging the test server that the same url was asked 
 twice.
 Out of 100 requests, 78600 urls were asked at least twice. An url was 
 even requested 9 times. These same url are not requested close to each other: 
 even more than 30sec can pass from one request to the other for the same url.
 I also tweaked the following parameters:
 {noformat}
 CONFIG proxy.config.http.cache.fuzz.time INT 0
 CONFIG proxy.config.http.cache.fuzz.min_time INT 0
 CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.00
 CONFIG proxy.config.http.cache.max_open_read_retries INT 4
 CONFIG proxy.config.http.cache.open_read_retry_time INT 500
 {noformat}
 And this is the result with polygraph, similar results:
 !http://i.imgur.com/YgOndhY.png!
 Tweaked the read-while-writer option, and yet having similar results.
 Then I've enabled 1GB of ram, it is slightly better at the beginning, but 
 then it drops:
 !http://i.imgur.com/dFTJI16.png!
 traffic_top says 25% ram hit, 37% fresh, 63% cold.
 So given that it doesn't seem to be a concurrency problem when requesting the 
 url to the origin server, could it be a problem of concurrent write access to 
 the cache? So that some pages are not cached at all? The traffoc_top fresh 
 percentage also makes me think it can be a problem in writing the cache.
 Not sure if I explained the problem correctly, ask me further information in 
 case. But in summary: hit ratio drops with a high number of connections, and 
 the problem seems related to pages that are not written to the cache.
 This is some related issue: 
 http://mail-archives.apache.org/mod_mbox/trafficserver-users/201301.mbox/%3ccd28cb1f.1f44a%25peter.wa...@email.disney.com%3E
 Also this: 
 http://apache-traffic-server.24303.n7.nabble.com/why-my-proxy-node-cache-hit-ratio-drops-td928.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TS-3395) Hit ratio drops with high concurrency

2015-02-21 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14330416#comment-14330416
 ] 

Luca Bruno edited comment on TS-3395 at 2/21/15 8:14 PM:
-

For you maybe. Not sure how to explain better, we don't want to depend on some 
kind of magic tuning with some magic numbers. We want a software that achieves 
that hit ratio even when stressed, and must not depend on any kind of max 
connections.
You avoid cases, we instead want to be sure certain things never happen. If 
it's not ATS job please let me know so I can look further, unless I 
misunderstood you.


was (Author: lethalman):
For you maybe. Not sure how to explain better, we don't want to depend on some 
kind of magic tuning with some magic numbers. We want a software that achieves 
that hit ratio even when stressed, and must not depend on any kind of max 
connections.
You avoid cases, we instead want to be sure certain things never happen. If 
it's not ATS job please let me know so I can look further.

 Hit ratio drops with high concurrency
 -

 Key: TS-3395
 URL: https://issues.apache.org/jira/browse/TS-3395
 Project: Traffic Server
  Issue Type: Bug
  Components: Cache
Reporter: Luca Bruno
 Fix For: 5.3.0


 I'm doing some tests and I've noticed that the hit ratio drops with more than 
 300 simultaneous http connections.
 The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
 ram cache is disabled.
 The test is done with web-polygraph. Content size vary from 5kb to 20kb 
 uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
 after months. There's no Vary.
 !http://i.imgur.com/Zxlhgnf.png!
 Then I thought it could be a problem of polygraph. I wrote my own 
 client/server test code, it works fine also with squid, varnish and nginx. I 
 register a hit if I get either cR or cH in the headers.
 {noformat}
 2015/02/19 12:38:28 Starting 100 requests
 2015/02/19 12:37:58 Elapsed: 3m51.23552164s
 2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
 2015/02/19 12:37:58 Average size: 12.50kb/req
 2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
 2015/02/19 12:37:58 Errors: 0
 2015/02/19 12:37:58 Offered Hit ratio: 59.95%
 2015/02/19 12:37:58 Measured Hit ratio: 37.20%
 2015/02/19 12:37:58 Hit bytes: 4649000609
 2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
 2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
 {noformat}
 So similar results, 37.20% on average. Then I thought that could be a problem 
 of how I'm testing stuff, and tried with nginx cache. It achieves 60% hit 
 ratio, but request rate is very slow compared to ATS for obvious reasons.
 Then I wanted to check if with 200 connections but with longer test time hit 
 ratio also dropped, but no, it's fine:
 !http://i.imgur.com/oMHscuf.png!
 So not a problem of my tests I guess.
 Then I realized by debugging the test server that the same url was asked 
 twice.
 Out of 100 requests, 78600 urls were asked at least twice. An url was 
 even requested 9 times. These same url are not requested close to each other: 
 even more than 30sec can pass from one request to the other for the same url.
 I also tweaked the following parameters:
 {noformat}
 CONFIG proxy.config.http.cache.fuzz.time INT 0
 CONFIG proxy.config.http.cache.fuzz.min_time INT 0
 CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.00
 CONFIG proxy.config.http.cache.max_open_read_retries INT 4
 CONFIG proxy.config.http.cache.open_read_retry_time INT 500
 {noformat}
 And this is the result with polygraph, similar results:
 !http://i.imgur.com/YgOndhY.png!
 Tweaked the read-while-writer option, and yet having similar results.
 Then I've enabled 1GB of ram, it is slightly better at the beginning, but 
 then it drops:
 !http://i.imgur.com/dFTJI16.png!
 traffic_top says 25% ram hit, 37% fresh, 63% cold.
 So given that it doesn't seem to be a concurrency problem when requesting the 
 url to the origin server, could it be a problem of concurrent write access to 
 the cache? So that some pages are not cached at all? The traffoc_top fresh 
 percentage also makes me think it can be a problem in writing the cache.
 Not sure if I explained the problem correctly, ask me further information in 
 case. But in summary: hit ratio drops with a high number of connections, and 
 the problem seems related to pages that are not written to the cache.
 This is some related issue: 
 http://mail-archives.apache.org/mod_mbox/trafficserver-users/201301.mbox/%3ccd28cb1f.1f44a%25peter.wa...@email.disney.com%3E
 Also this: 
 http://apache-traffic-server.24303.n7.nabble.com/why-my-proxy-node-cache-hit-ratio-drops-td928.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TS-3395) Hit ratio drops with high concurrency

2015-02-21 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14330416#comment-14330416
 ] 

Luca Bruno edited comment on TS-3395 at 2/21/15 8:15 PM:
-

For you maybe. Not sure how to explain better, we don't want to depend on some 
kind of magic tuning with some magic numbers. We want a software that achieves 
that hit ratio even when stressed, and must not depend on any kind of max 
connections.
You avoid cases, we instead want to be sure certain things never happen. If 
it's not ATS job please let me know so I can look further, unless I 
misunderstood you.
But basically this is the second bug where I'm suggested to change some config 
to workaround some software behavior.


was (Author: lethalman):
For you maybe. Not sure how to explain better, we don't want to depend on some 
kind of magic tuning with some magic numbers. We want a software that achieves 
that hit ratio even when stressed, and must not depend on any kind of max 
connections.
You avoid cases, we instead want to be sure certain things never happen. If 
it's not ATS job please let me know so I can look further, unless I 
misunderstood you.

 Hit ratio drops with high concurrency
 -

 Key: TS-3395
 URL: https://issues.apache.org/jira/browse/TS-3395
 Project: Traffic Server
  Issue Type: Bug
  Components: Cache
Reporter: Luca Bruno
 Fix For: 5.3.0


 I'm doing some tests and I've noticed that the hit ratio drops with more than 
 300 simultaneous http connections.
 The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
 ram cache is disabled.
 The test is done with web-polygraph. Content size vary from 5kb to 20kb 
 uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
 after months. There's no Vary.
 !http://i.imgur.com/Zxlhgnf.png!
 Then I thought it could be a problem of polygraph. I wrote my own 
 client/server test code, it works fine also with squid, varnish and nginx. I 
 register a hit if I get either cR or cH in the headers.
 {noformat}
 2015/02/19 12:38:28 Starting 100 requests
 2015/02/19 12:37:58 Elapsed: 3m51.23552164s
 2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
 2015/02/19 12:37:58 Average size: 12.50kb/req
 2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
 2015/02/19 12:37:58 Errors: 0
 2015/02/19 12:37:58 Offered Hit ratio: 59.95%
 2015/02/19 12:37:58 Measured Hit ratio: 37.20%
 2015/02/19 12:37:58 Hit bytes: 4649000609
 2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
 2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
 {noformat}
 So similar results, 37.20% on average. Then I thought that could be a problem 
 of how I'm testing stuff, and tried with nginx cache. It achieves 60% hit 
 ratio, but request rate is very slow compared to ATS for obvious reasons.
 Then I wanted to check if with 200 connections but with longer test time hit 
 ratio also dropped, but no, it's fine:
 !http://i.imgur.com/oMHscuf.png!
 So not a problem of my tests I guess.
 Then I realized by debugging the test server that the same url was asked 
 twice.
 Out of 100 requests, 78600 urls were asked at least twice. An url was 
 even requested 9 times. These same url are not requested close to each other: 
 even more than 30sec can pass from one request to the other for the same url.
 I also tweaked the following parameters:
 {noformat}
 CONFIG proxy.config.http.cache.fuzz.time INT 0
 CONFIG proxy.config.http.cache.fuzz.min_time INT 0
 CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.00
 CONFIG proxy.config.http.cache.max_open_read_retries INT 4
 CONFIG proxy.config.http.cache.open_read_retry_time INT 500
 {noformat}
 And this is the result with polygraph, similar results:
 !http://i.imgur.com/YgOndhY.png!
 Tweaked the read-while-writer option, and yet having similar results.
 Then I've enabled 1GB of ram, it is slightly better at the beginning, but 
 then it drops:
 !http://i.imgur.com/dFTJI16.png!
 traffic_top says 25% ram hit, 37% fresh, 63% cold.
 So given that it doesn't seem to be a concurrency problem when requesting the 
 url to the origin server, could it be a problem of concurrent write access to 
 the cache? So that some pages are not cached at all? The traffoc_top fresh 
 percentage also makes me think it can be a problem in writing the cache.
 Not sure if I explained the problem correctly, ask me further information in 
 case. But in summary: hit ratio drops with a high number of connections, and 
 the problem seems related to pages that are not written to the cache.
 This is some related issue: 
 http://mail-archives.apache.org/mod_mbox/trafficserver-users/201301.mbox/%3ccd28cb1f.1f44a%25peter.wa...@email.disney.com%3E
 Also this: 
 

[jira] [Commented] (TS-3395) Hit ratio drops with high concurrency

2015-02-21 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14330417#comment-14330417
 ] 

Luca Bruno commented on TS-3395:


Of course for now I'll be trying with limiting the connections, and let this 
issue die alone.

 Hit ratio drops with high concurrency
 -

 Key: TS-3395
 URL: https://issues.apache.org/jira/browse/TS-3395
 Project: Traffic Server
  Issue Type: Bug
  Components: Cache
Reporter: Luca Bruno
 Fix For: 5.3.0


 I'm doing some tests and I've noticed that the hit ratio drops with more than 
 300 simultaneous http connections.
 The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
 ram cache is disabled.
 The test is done with web-polygraph. Content size vary from 5kb to 20kb 
 uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
 after months. There's no Vary.
 !http://i.imgur.com/Zxlhgnf.png!
 Then I thought it could be a problem of polygraph. I wrote my own 
 client/server test code, it works fine also with squid, varnish and nginx. I 
 register a hit if I get either cR or cH in the headers.
 {noformat}
 2015/02/19 12:38:28 Starting 100 requests
 2015/02/19 12:37:58 Elapsed: 3m51.23552164s
 2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
 2015/02/19 12:37:58 Average size: 12.50kb/req
 2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
 2015/02/19 12:37:58 Errors: 0
 2015/02/19 12:37:58 Offered Hit ratio: 59.95%
 2015/02/19 12:37:58 Measured Hit ratio: 37.20%
 2015/02/19 12:37:58 Hit bytes: 4649000609
 2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
 2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
 {noformat}
 So similar results, 37.20% on average. Then I thought that could be a problem 
 of how I'm testing stuff, and tried with nginx cache. It achieves 60% hit 
 ratio, but request rate is very slow compared to ATS for obvious reasons.
 Then I wanted to check if with 200 connections but with longer test time hit 
 ratio also dropped, but no, it's fine:
 !http://i.imgur.com/oMHscuf.png!
 So not a problem of my tests I guess.
 Then I realized by debugging the test server that the same url was asked 
 twice.
 Out of 100 requests, 78600 urls were asked at least twice. An url was 
 even requested 9 times. These same url are not requested close to each other: 
 even more than 30sec can pass from one request to the other for the same url.
 I also tweaked the following parameters:
 {noformat}
 CONFIG proxy.config.http.cache.fuzz.time INT 0
 CONFIG proxy.config.http.cache.fuzz.min_time INT 0
 CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.00
 CONFIG proxy.config.http.cache.max_open_read_retries INT 4
 CONFIG proxy.config.http.cache.open_read_retry_time INT 500
 {noformat}
 And this is the result with polygraph, similar results:
 !http://i.imgur.com/YgOndhY.png!
 Tweaked the read-while-writer option, and yet having similar results.
 Then I've enabled 1GB of ram, it is slightly better at the beginning, but 
 then it drops:
 !http://i.imgur.com/dFTJI16.png!
 traffic_top says 25% ram hit, 37% fresh, 63% cold.
 So given that it doesn't seem to be a concurrency problem when requesting the 
 url to the origin server, could it be a problem of concurrent write access to 
 the cache? So that some pages are not cached at all? The traffoc_top fresh 
 percentage also makes me think it can be a problem in writing the cache.
 Not sure if I explained the problem correctly, ask me further information in 
 case. But in summary: hit ratio drops with a high number of connections, and 
 the problem seems related to pages that are not written to the cache.
 This is some related issue: 
 http://mail-archives.apache.org/mod_mbox/trafficserver-users/201301.mbox/%3ccd28cb1f.1f44a%25peter.wa...@email.disney.com%3E
 Also this: 
 http://apache-traffic-server.24303.n7.nabble.com/why-my-proxy-node-cache-hit-ratio-drops-td928.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-3395) Hit ratio drops with high concurrency

2015-02-20 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328818#comment-14328818
 ] 

Luca Bruno commented on TS-3395:


Further debugging, with 2000 http concurrent connections and more than 4000 
requests per second. This is what I see from my client, also confirmed by the 
logs of the origin server.

{noformat}
$ grep /someurl clientlog |grep MISS
2015/02/20 11:54:50 MISS /storage/someurl
2015/02/20 11:54:55 MISS /storage/someurl
2015/02/20 11:55:12 MISS /storage/someurl
2015/02/20 11:56:27 MISS /storage/someurl
2015/02/20 11:56:59 MISS /storage/someurl
2015/02/20 11:57:05 MISS /storage/someurl
2015/02/20 11:57:32 MISS /storage/someurl
$ grep /someurl clientlog |grep HIT
2015/02/20 11:58:22 HIT /storage/someurl
{noformat}

So out of 8 same requests, ATS either didn't write the cache for that url, or 
couldn't read it at that time. ATS was able to cache only that last request.
Also some requests are not cached at all. Given that I don't see any 
intermittent HIT/MISS for the same url, I consider this a problem during cache 
write (yet, unless it's misconfiguration from my side).


 Hit ratio drops with high concurrency
 -

 Key: TS-3395
 URL: https://issues.apache.org/jira/browse/TS-3395
 Project: Traffic Server
  Issue Type: Bug
  Components: Cache
Reporter: Luca Bruno

 I'm doing some tests and I've noticed that the hit ratio drops with more than 
 300 simultaneous http connections.
 The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
 ram cache is disabled.
 The test is done with web-polygraph. Content size vary from 5kb to 20kb 
 uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
 after months. There's no Vary.
 !http://i.imgur.com/Zxlhgnf.png!
 Then I thought it could be a problem of polygraph. I wrote my own 
 client/server test code, it works fine also with squid, varnish and nginx. I 
 register a hit if I get either cR or cH in the headers.
 {noformat}
 2015/02/19 12:38:28 Starting 100 requests
 2015/02/19 12:37:58 Elapsed: 3m51.23552164s
 2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
 2015/02/19 12:37:58 Average size: 12.50kb/req
 2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
 2015/02/19 12:37:58 Errors: 0
 2015/02/19 12:37:58 Offered Hit ratio: 59.95%
 2015/02/19 12:37:58 Measured Hit ratio: 37.20%
 2015/02/19 12:37:58 Hit bytes: 4649000609
 2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
 2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
 {noformat}
 So similar results, 37.20% on average. Then I thought that could be a problem 
 of how I'm testing stuff, and tried with nginx cache. It achieves 60% hit 
 ratio, but request rate is very slow compared to ATS for obvious reasons.
 Then I wanted to check if with 200 connections but with longer test time hit 
 ratio also dropped, but no, it's fine:
 !http://i.imgur.com/oMHscuf.png!
 So not a problem of my tests I guess.
 Then I realized by debugging the test server that the same url was asked 
 twice.
 Out of 100 requests, 78600 urls were asked at least twice. An url was 
 even requested 9 times. These same url are not requested close to each other: 
 even more than 30sec can pass from one request to the other for the same url.
 I also tweaked the following parameters:
 {noformat}
 CONFIG proxy.config.http.cache.fuzz.time INT 0
 CONFIG proxy.config.http.cache.fuzz.min_time INT 0
 CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.00
 CONFIG proxy.config.http.cache.max_open_read_retries INT 4
 CONFIG proxy.config.http.cache.open_read_retry_time INT 500
 {noformat}
 And this is the result with polygraph, similar results:
 !http://i.imgur.com/YgOndhY.png!
 Tweaked the read-while-writer option, and yet having similar results.
 Then I've enabled 1GB of ram, it is slightly better at the beginning, but 
 then it drops:
 !http://i.imgur.com/dFTJI16.png!
 traffic_top says 25% ram hit, 37% fresh, 63% cold.
 So given that it doesn't seem to be a concurrency problem when requesting the 
 url to the origin server, could it be a problem of concurrent write access to 
 the cache? So that some pages are not cached at all? The traffoc_top fresh 
 percentage also makes me think it can be a problem in writing the cache.
 Not sure if I explained the problem correctly, ask me further information in 
 case. But in summary: hit ratio drops with a high number of connections, and 
 the problem seems related to pages that are not written to the cache.
 This is some related issue: 
 http://mail-archives.apache.org/mod_mbox/trafficserver-users/201301.mbox/%3ccd28cb1f.1f44a%25peter.wa...@email.disney.com%3E
 Also this: 
 

[jira] [Comment Edited] (TS-3395) Hit ratio drops with high concurrency

2015-02-20 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328818#comment-14328818
 ] 

Luca Bruno edited comment on TS-3395 at 2/20/15 11:17 AM:
--

Further debugging, with 2000 http concurrent connections and more than 4000 
requests per second. This is what I see from my client, also confirmed by the 
logs of the origin server.

{noformat}
$ grep /someurl clientlog |grep MISS
2015/02/20 11:54:50 MISS /storage/someurl
2015/02/20 11:54:55 MISS /storage/someurl
2015/02/20 11:55:12 MISS /storage/someurl
2015/02/20 11:56:27 MISS /storage/someurl
2015/02/20 11:56:59 MISS /storage/someurl
2015/02/20 11:57:05 MISS /storage/someurl
2015/02/20 11:57:32 MISS /storage/someurl
$ grep /someurl clientlog |grep HIT
2015/02/20 11:58:22 HIT /storage/someurl
{noformat}

So out of 8 same requests, ATS either didn't write the cache for that url, or 
couldn't read it at that time. ATS was able to cache only that last request.
Also some requests are not cached at all. Given that I don't see any 
interleaving of HIT and MISS for the same url, I consider this a problem during 
cache write (yet, unless it's misconfiguration from my side).



was (Author: lethalman):
Further debugging, with 2000 http concurrent connections and more than 4000 
requests per second. This is what I see from my client, also confirmed by the 
logs of the origin server.

{noformat}
$ grep /someurl clientlog |grep MISS
2015/02/20 11:54:50 MISS /storage/someurl
2015/02/20 11:54:55 MISS /storage/someurl
2015/02/20 11:55:12 MISS /storage/someurl
2015/02/20 11:56:27 MISS /storage/someurl
2015/02/20 11:56:59 MISS /storage/someurl
2015/02/20 11:57:05 MISS /storage/someurl
2015/02/20 11:57:32 MISS /storage/someurl
$ grep /someurl clientlog |grep HIT
2015/02/20 11:58:22 HIT /storage/someurl
{noformat}

So out of 8 same requests, ATS either didn't write the cache for that url, or 
couldn't read it at that time. ATS was able to cache only that last request.
Also some requests are not cached at all. Given that I don't see any 
intermittent HIT/MISS for the same url, I consider this a problem during cache 
write (yet, unless it's misconfiguration from my side).


 Hit ratio drops with high concurrency
 -

 Key: TS-3395
 URL: https://issues.apache.org/jira/browse/TS-3395
 Project: Traffic Server
  Issue Type: Bug
  Components: Cache
Reporter: Luca Bruno

 I'm doing some tests and I've noticed that the hit ratio drops with more than 
 300 simultaneous http connections.
 The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
 ram cache is disabled.
 The test is done with web-polygraph. Content size vary from 5kb to 20kb 
 uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
 after months. There's no Vary.
 !http://i.imgur.com/Zxlhgnf.png!
 Then I thought it could be a problem of polygraph. I wrote my own 
 client/server test code, it works fine also with squid, varnish and nginx. I 
 register a hit if I get either cR or cH in the headers.
 {noformat}
 2015/02/19 12:38:28 Starting 100 requests
 2015/02/19 12:37:58 Elapsed: 3m51.23552164s
 2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
 2015/02/19 12:37:58 Average size: 12.50kb/req
 2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
 2015/02/19 12:37:58 Errors: 0
 2015/02/19 12:37:58 Offered Hit ratio: 59.95%
 2015/02/19 12:37:58 Measured Hit ratio: 37.20%
 2015/02/19 12:37:58 Hit bytes: 4649000609
 2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
 2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
 {noformat}
 So similar results, 37.20% on average. Then I thought that could be a problem 
 of how I'm testing stuff, and tried with nginx cache. It achieves 60% hit 
 ratio, but request rate is very slow compared to ATS for obvious reasons.
 Then I wanted to check if with 200 connections but with longer test time hit 
 ratio also dropped, but no, it's fine:
 !http://i.imgur.com/oMHscuf.png!
 So not a problem of my tests I guess.
 Then I realized by debugging the test server that the same url was asked 
 twice.
 Out of 100 requests, 78600 urls were asked at least twice. An url was 
 even requested 9 times. These same url are not requested close to each other: 
 even more than 30sec can pass from one request to the other for the same url.
 I also tweaked the following parameters:
 {noformat}
 CONFIG proxy.config.http.cache.fuzz.time INT 0
 CONFIG proxy.config.http.cache.fuzz.min_time INT 0
 CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.00
 CONFIG proxy.config.http.cache.max_open_read_retries INT 4
 CONFIG proxy.config.http.cache.open_read_retry_time INT 500
 {noformat}
 And this is the result with polygraph, similar results:
 

[jira] [Commented] (TS-3395) Hit ratio drops with high concurrency

2015-02-20 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328934#comment-14328934
 ] 

Luca Bruno commented on TS-3395:


So I'm trying to debug the problem, and enabled debug for http.*. Turns out the 
problem disappeared, I get the 60% hit ratio, except that of course now the 
request rate is about 400/s instead of 4000/s :).
So I'm also unable to debug this issue by following the state changes during 
requests. Any hints are appreciated.

 Hit ratio drops with high concurrency
 -

 Key: TS-3395
 URL: https://issues.apache.org/jira/browse/TS-3395
 Project: Traffic Server
  Issue Type: Bug
  Components: Cache
Reporter: Luca Bruno

 I'm doing some tests and I've noticed that the hit ratio drops with more than 
 300 simultaneous http connections.
 The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
 ram cache is disabled.
 The test is done with web-polygraph. Content size vary from 5kb to 20kb 
 uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
 after months. There's no Vary.
 !http://i.imgur.com/Zxlhgnf.png!
 Then I thought it could be a problem of polygraph. I wrote my own 
 client/server test code, it works fine also with squid, varnish and nginx. I 
 register a hit if I get either cR or cH in the headers.
 {noformat}
 2015/02/19 12:38:28 Starting 100 requests
 2015/02/19 12:37:58 Elapsed: 3m51.23552164s
 2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
 2015/02/19 12:37:58 Average size: 12.50kb/req
 2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
 2015/02/19 12:37:58 Errors: 0
 2015/02/19 12:37:58 Offered Hit ratio: 59.95%
 2015/02/19 12:37:58 Measured Hit ratio: 37.20%
 2015/02/19 12:37:58 Hit bytes: 4649000609
 2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
 2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
 {noformat}
 So similar results, 37.20% on average. Then I thought that could be a problem 
 of how I'm testing stuff, and tried with nginx cache. It achieves 60% hit 
 ratio, but request rate is very slow compared to ATS for obvious reasons.
 Then I wanted to check if with 200 connections but with longer test time hit 
 ratio also dropped, but no, it's fine:
 !http://i.imgur.com/oMHscuf.png!
 So not a problem of my tests I guess.
 Then I realized by debugging the test server that the same url was asked 
 twice.
 Out of 100 requests, 78600 urls were asked at least twice. An url was 
 even requested 9 times. These same url are not requested close to each other: 
 even more than 30sec can pass from one request to the other for the same url.
 I also tweaked the following parameters:
 {noformat}
 CONFIG proxy.config.http.cache.fuzz.time INT 0
 CONFIG proxy.config.http.cache.fuzz.min_time INT 0
 CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.00
 CONFIG proxy.config.http.cache.max_open_read_retries INT 4
 CONFIG proxy.config.http.cache.open_read_retry_time INT 500
 {noformat}
 And this is the result with polygraph, similar results:
 !http://i.imgur.com/YgOndhY.png!
 Tweaked the read-while-writer option, and yet having similar results.
 Then I've enabled 1GB of ram, it is slightly better at the beginning, but 
 then it drops:
 !http://i.imgur.com/dFTJI16.png!
 traffic_top says 25% ram hit, 37% fresh, 63% cold.
 So given that it doesn't seem to be a concurrency problem when requesting the 
 url to the origin server, could it be a problem of concurrent write access to 
 the cache? So that some pages are not cached at all? The traffoc_top fresh 
 percentage also makes me think it can be a problem in writing the cache.
 Not sure if I explained the problem correctly, ask me further information in 
 case. But in summary: hit ratio drops with a high number of connections, and 
 the problem seems related to pages that are not written to the cache.
 This is some related issue: 
 http://mail-archives.apache.org/mod_mbox/trafficserver-users/201301.mbox/%3ccd28cb1f.1f44a%25peter.wa...@email.disney.com%3E
 Also this: 
 http://apache-traffic-server.24303.n7.nabble.com/why-my-proxy-node-cache-hit-ratio-drops-td928.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-3395) Hit ratio drops with high concurrency

2015-02-20 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329648#comment-14329648
 ] 

Luca Bruno commented on TS-3395:


As stated in my earlier comments, I already tried setting open_read settings, 
max 4 retries and 500 time.

 Hit ratio drops with high concurrency
 -

 Key: TS-3395
 URL: https://issues.apache.org/jira/browse/TS-3395
 Project: Traffic Server
  Issue Type: Bug
  Components: Cache
Reporter: Luca Bruno
 Fix For: 5.3.0


 I'm doing some tests and I've noticed that the hit ratio drops with more than 
 300 simultaneous http connections.
 The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
 ram cache is disabled.
 The test is done with web-polygraph. Content size vary from 5kb to 20kb 
 uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
 after months. There's no Vary.
 !http://i.imgur.com/Zxlhgnf.png!
 Then I thought it could be a problem of polygraph. I wrote my own 
 client/server test code, it works fine also with squid, varnish and nginx. I 
 register a hit if I get either cR or cH in the headers.
 {noformat}
 2015/02/19 12:38:28 Starting 100 requests
 2015/02/19 12:37:58 Elapsed: 3m51.23552164s
 2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
 2015/02/19 12:37:58 Average size: 12.50kb/req
 2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
 2015/02/19 12:37:58 Errors: 0
 2015/02/19 12:37:58 Offered Hit ratio: 59.95%
 2015/02/19 12:37:58 Measured Hit ratio: 37.20%
 2015/02/19 12:37:58 Hit bytes: 4649000609
 2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
 2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
 {noformat}
 So similar results, 37.20% on average. Then I thought that could be a problem 
 of how I'm testing stuff, and tried with nginx cache. It achieves 60% hit 
 ratio, but request rate is very slow compared to ATS for obvious reasons.
 Then I wanted to check if with 200 connections but with longer test time hit 
 ratio also dropped, but no, it's fine:
 !http://i.imgur.com/oMHscuf.png!
 So not a problem of my tests I guess.
 Then I realized by debugging the test server that the same url was asked 
 twice.
 Out of 100 requests, 78600 urls were asked at least twice. An url was 
 even requested 9 times. These same url are not requested close to each other: 
 even more than 30sec can pass from one request to the other for the same url.
 I also tweaked the following parameters:
 {noformat}
 CONFIG proxy.config.http.cache.fuzz.time INT 0
 CONFIG proxy.config.http.cache.fuzz.min_time INT 0
 CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.00
 CONFIG proxy.config.http.cache.max_open_read_retries INT 4
 CONFIG proxy.config.http.cache.open_read_retry_time INT 500
 {noformat}
 And this is the result with polygraph, similar results:
 !http://i.imgur.com/YgOndhY.png!
 Tweaked the read-while-writer option, and yet having similar results.
 Then I've enabled 1GB of ram, it is slightly better at the beginning, but 
 then it drops:
 !http://i.imgur.com/dFTJI16.png!
 traffic_top says 25% ram hit, 37% fresh, 63% cold.
 So given that it doesn't seem to be a concurrency problem when requesting the 
 url to the origin server, could it be a problem of concurrent write access to 
 the cache? So that some pages are not cached at all? The traffoc_top fresh 
 percentage also makes me think it can be a problem in writing the cache.
 Not sure if I explained the problem correctly, ask me further information in 
 case. But in summary: hit ratio drops with a high number of connections, and 
 the problem seems related to pages that are not written to the cache.
 This is some related issue: 
 http://mail-archives.apache.org/mod_mbox/trafficserver-users/201301.mbox/%3ccd28cb1f.1f44a%25peter.wa...@email.disney.com%3E
 Also this: 
 http://apache-traffic-server.24303.n7.nabble.com/why-my-proxy-node-cache-hit-ratio-drops-td928.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TS-3395) Hit ratio drops with high concurrency

2015-02-20 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329648#comment-14329648
 ] 

Luca Bruno edited comment on TS-3395 at 2/20/15 10:15 PM:
--

As stated in my earlier comments, I already tried setting open_read settings, 
max 4 retries and 500 retry time.


was (Author: lethalman):
As stated in my earlier comments, I already tried setting open_read settings, 
max 4 retries and 500 time.

 Hit ratio drops with high concurrency
 -

 Key: TS-3395
 URL: https://issues.apache.org/jira/browse/TS-3395
 Project: Traffic Server
  Issue Type: Bug
  Components: Cache
Reporter: Luca Bruno
 Fix For: 5.3.0


 I'm doing some tests and I've noticed that the hit ratio drops with more than 
 300 simultaneous http connections.
 The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
 ram cache is disabled.
 The test is done with web-polygraph. Content size vary from 5kb to 20kb 
 uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
 after months. There's no Vary.
 !http://i.imgur.com/Zxlhgnf.png!
 Then I thought it could be a problem of polygraph. I wrote my own 
 client/server test code, it works fine also with squid, varnish and nginx. I 
 register a hit if I get either cR or cH in the headers.
 {noformat}
 2015/02/19 12:38:28 Starting 100 requests
 2015/02/19 12:37:58 Elapsed: 3m51.23552164s
 2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
 2015/02/19 12:37:58 Average size: 12.50kb/req
 2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
 2015/02/19 12:37:58 Errors: 0
 2015/02/19 12:37:58 Offered Hit ratio: 59.95%
 2015/02/19 12:37:58 Measured Hit ratio: 37.20%
 2015/02/19 12:37:58 Hit bytes: 4649000609
 2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
 2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
 {noformat}
 So similar results, 37.20% on average. Then I thought that could be a problem 
 of how I'm testing stuff, and tried with nginx cache. It achieves 60% hit 
 ratio, but request rate is very slow compared to ATS for obvious reasons.
 Then I wanted to check if with 200 connections but with longer test time hit 
 ratio also dropped, but no, it's fine:
 !http://i.imgur.com/oMHscuf.png!
 So not a problem of my tests I guess.
 Then I realized by debugging the test server that the same url was asked 
 twice.
 Out of 100 requests, 78600 urls were asked at least twice. An url was 
 even requested 9 times. These same url are not requested close to each other: 
 even more than 30sec can pass from one request to the other for the same url.
 I also tweaked the following parameters:
 {noformat}
 CONFIG proxy.config.http.cache.fuzz.time INT 0
 CONFIG proxy.config.http.cache.fuzz.min_time INT 0
 CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.00
 CONFIG proxy.config.http.cache.max_open_read_retries INT 4
 CONFIG proxy.config.http.cache.open_read_retry_time INT 500
 {noformat}
 And this is the result with polygraph, similar results:
 !http://i.imgur.com/YgOndhY.png!
 Tweaked the read-while-writer option, and yet having similar results.
 Then I've enabled 1GB of ram, it is slightly better at the beginning, but 
 then it drops:
 !http://i.imgur.com/dFTJI16.png!
 traffic_top says 25% ram hit, 37% fresh, 63% cold.
 So given that it doesn't seem to be a concurrency problem when requesting the 
 url to the origin server, could it be a problem of concurrent write access to 
 the cache? So that some pages are not cached at all? The traffoc_top fresh 
 percentage also makes me think it can be a problem in writing the cache.
 Not sure if I explained the problem correctly, ask me further information in 
 case. But in summary: hit ratio drops with a high number of connections, and 
 the problem seems related to pages that are not written to the cache.
 This is some related issue: 
 http://mail-archives.apache.org/mod_mbox/trafficserver-users/201301.mbox/%3ccd28cb1f.1f44a%25peter.wa...@email.disney.com%3E
 Also this: 
 http://apache-traffic-server.24303.n7.nabble.com/why-my-proxy-node-cache-hit-ratio-drops-td928.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TS-3395) Hit ratio drops with high concurrency

2015-02-20 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329091#comment-14329091
 ] 

Luca Bruno edited comment on TS-3395 at 2/20/15 4:03 PM:
-

I've set proxy.config.http.cache.max_open_write_retries to 1000.

I've a 500gb spinning disk. This is with 5 volumes, average hit ratio increased 
a little:

!http://i.imgur.com/7mI6kpo.png!

This is with 10 volumes:

!http://i.imgur.com/OGzeQVm.png!

Perfect, but bad req/sec and latency:

!http://i.imgur.com/zXS8uv3.png!

!http://i.imgur.com/UgYWzj5.png!

In this case I would have expected latency to be high, but still rate to be at 
least around 2500-3000 req/sec. I think here we're hitting disk scheduling 
issues, too much concurrency. Will do a test that lasts longer and also grab 
some graph on disk usage.


was (Author: lethalman):
I've set proxy.config.http.cache.max_open_write_retries to 1000.

I've a 500gb disk. This is with 5 volumes, average hit ratio increased a little:

!http://i.imgur.com/7mI6kpo.png!

This is with 10 volumes:

!http://i.imgur.com/OGzeQVm.png!

Perfect, but bad req/sec and latency:

!http://i.imgur.com/zXS8uv3.png!

!http://i.imgur.com/UgYWzj5.png!

In this case I would have expected latency to be high, but still rate to be at 
least around 2500-3000 req/sec. I think here we're hitting disk scheduling 
issues, too much concurrency. Will do a test that lasts longer and also grab 
some graph on disk usage.

 Hit ratio drops with high concurrency
 -

 Key: TS-3395
 URL: https://issues.apache.org/jira/browse/TS-3395
 Project: Traffic Server
  Issue Type: Bug
  Components: Cache
Reporter: Luca Bruno
 Fix For: 5.3.0


 I'm doing some tests and I've noticed that the hit ratio drops with more than 
 300 simultaneous http connections.
 The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
 ram cache is disabled.
 The test is done with web-polygraph. Content size vary from 5kb to 20kb 
 uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
 after months. There's no Vary.
 !http://i.imgur.com/Zxlhgnf.png!
 Then I thought it could be a problem of polygraph. I wrote my own 
 client/server test code, it works fine also with squid, varnish and nginx. I 
 register a hit if I get either cR or cH in the headers.
 {noformat}
 2015/02/19 12:38:28 Starting 100 requests
 2015/02/19 12:37:58 Elapsed: 3m51.23552164s
 2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
 2015/02/19 12:37:58 Average size: 12.50kb/req
 2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
 2015/02/19 12:37:58 Errors: 0
 2015/02/19 12:37:58 Offered Hit ratio: 59.95%
 2015/02/19 12:37:58 Measured Hit ratio: 37.20%
 2015/02/19 12:37:58 Hit bytes: 4649000609
 2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
 2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
 {noformat}
 So similar results, 37.20% on average. Then I thought that could be a problem 
 of how I'm testing stuff, and tried with nginx cache. It achieves 60% hit 
 ratio, but request rate is very slow compared to ATS for obvious reasons.
 Then I wanted to check if with 200 connections but with longer test time hit 
 ratio also dropped, but no, it's fine:
 !http://i.imgur.com/oMHscuf.png!
 So not a problem of my tests I guess.
 Then I realized by debugging the test server that the same url was asked 
 twice.
 Out of 100 requests, 78600 urls were asked at least twice. An url was 
 even requested 9 times. These same url are not requested close to each other: 
 even more than 30sec can pass from one request to the other for the same url.
 I also tweaked the following parameters:
 {noformat}
 CONFIG proxy.config.http.cache.fuzz.time INT 0
 CONFIG proxy.config.http.cache.fuzz.min_time INT 0
 CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.00
 CONFIG proxy.config.http.cache.max_open_read_retries INT 4
 CONFIG proxy.config.http.cache.open_read_retry_time INT 500
 {noformat}
 And this is the result with polygraph, similar results:
 !http://i.imgur.com/YgOndhY.png!
 Tweaked the read-while-writer option, and yet having similar results.
 Then I've enabled 1GB of ram, it is slightly better at the beginning, but 
 then it drops:
 !http://i.imgur.com/dFTJI16.png!
 traffic_top says 25% ram hit, 37% fresh, 63% cold.
 So given that it doesn't seem to be a concurrency problem when requesting the 
 url to the origin server, could it be a problem of concurrent write access to 
 the cache? So that some pages are not cached at all? The traffoc_top fresh 
 percentage also makes me think it can be a problem in writing the cache.
 Not sure if I explained the problem correctly, ask me further information in 
 case. But in summary: hit ratio drops with a high 

[jira] [Commented] (TS-3395) Hit ratio drops with high concurrency

2015-02-20 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329091#comment-14329091
 ] 

Luca Bruno commented on TS-3395:


I've set proxy.config.http.cache.max_open_write_retries to 1000.

I've a 500gb disk. This is with 5 volumes, average hit ratio increased a little:

!http://i.imgur.com/7mI6kpo.png!

This is with 10 volumes:

!http://i.imgur.com/OGzeQVm.png!

Perfect, but bad req/sec and latency:

!http://i.imgur.com/zXS8uv3.png!

!http://i.imgur.com/UgYWzj5.png!

In this case I would have expected latency to be high, but still rate to be at 
least around 2500-3000 req/sec. I think here we're hitting disk scheduling 
issues, too much concurrency. Will do a test that lasts longer and also grab 
some graph on disk usage.

 Hit ratio drops with high concurrency
 -

 Key: TS-3395
 URL: https://issues.apache.org/jira/browse/TS-3395
 Project: Traffic Server
  Issue Type: Bug
  Components: Cache
Reporter: Luca Bruno
 Fix For: 5.3.0


 I'm doing some tests and I've noticed that the hit ratio drops with more than 
 300 simultaneous http connections.
 The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
 ram cache is disabled.
 The test is done with web-polygraph. Content size vary from 5kb to 20kb 
 uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
 after months. There's no Vary.
 !http://i.imgur.com/Zxlhgnf.png!
 Then I thought it could be a problem of polygraph. I wrote my own 
 client/server test code, it works fine also with squid, varnish and nginx. I 
 register a hit if I get either cR or cH in the headers.
 {noformat}
 2015/02/19 12:38:28 Starting 100 requests
 2015/02/19 12:37:58 Elapsed: 3m51.23552164s
 2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
 2015/02/19 12:37:58 Average size: 12.50kb/req
 2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
 2015/02/19 12:37:58 Errors: 0
 2015/02/19 12:37:58 Offered Hit ratio: 59.95%
 2015/02/19 12:37:58 Measured Hit ratio: 37.20%
 2015/02/19 12:37:58 Hit bytes: 4649000609
 2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
 2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
 {noformat}
 So similar results, 37.20% on average. Then I thought that could be a problem 
 of how I'm testing stuff, and tried with nginx cache. It achieves 60% hit 
 ratio, but request rate is very slow compared to ATS for obvious reasons.
 Then I wanted to check if with 200 connections but with longer test time hit 
 ratio also dropped, but no, it's fine:
 !http://i.imgur.com/oMHscuf.png!
 So not a problem of my tests I guess.
 Then I realized by debugging the test server that the same url was asked 
 twice.
 Out of 100 requests, 78600 urls were asked at least twice. An url was 
 even requested 9 times. These same url are not requested close to each other: 
 even more than 30sec can pass from one request to the other for the same url.
 I also tweaked the following parameters:
 {noformat}
 CONFIG proxy.config.http.cache.fuzz.time INT 0
 CONFIG proxy.config.http.cache.fuzz.min_time INT 0
 CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.00
 CONFIG proxy.config.http.cache.max_open_read_retries INT 4
 CONFIG proxy.config.http.cache.open_read_retry_time INT 500
 {noformat}
 And this is the result with polygraph, similar results:
 !http://i.imgur.com/YgOndhY.png!
 Tweaked the read-while-writer option, and yet having similar results.
 Then I've enabled 1GB of ram, it is slightly better at the beginning, but 
 then it drops:
 !http://i.imgur.com/dFTJI16.png!
 traffic_top says 25% ram hit, 37% fresh, 63% cold.
 So given that it doesn't seem to be a concurrency problem when requesting the 
 url to the origin server, could it be a problem of concurrent write access to 
 the cache? So that some pages are not cached at all? The traffoc_top fresh 
 percentage also makes me think it can be a problem in writing the cache.
 Not sure if I explained the problem correctly, ask me further information in 
 case. But in summary: hit ratio drops with a high number of connections, and 
 the problem seems related to pages that are not written to the cache.
 This is some related issue: 
 http://mail-archives.apache.org/mod_mbox/trafficserver-users/201301.mbox/%3ccd28cb1f.1f44a%25peter.wa...@email.disney.com%3E
 Also this: 
 http://apache-traffic-server.24303.n7.nabble.com/why-my-proxy-node-cache-hit-ratio-drops-td928.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TS-3395) Hit ratio drops with high concurrency

2015-02-20 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329132#comment-14329132
 ] 

Luca Bruno commented on TS-3395:


Longer test with 10 vols:

!http://i.imgur.com/eCEUmOK.png!

Req/sec and res time similar above, but I understand that's a bottleneck of the 
disk of course.

This is the disk I/O:

http://i.imgur.com/QuNIav1.png

So this is what I expect, except not with 10 vols but it should behave the same 
with 1 vol.

 Hit ratio drops with high concurrency
 -

 Key: TS-3395
 URL: https://issues.apache.org/jira/browse/TS-3395
 Project: Traffic Server
  Issue Type: Bug
  Components: Cache
Reporter: Luca Bruno
 Fix For: 5.3.0


 I'm doing some tests and I've noticed that the hit ratio drops with more than 
 300 simultaneous http connections.
 The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
 ram cache is disabled.
 The test is done with web-polygraph. Content size vary from 5kb to 20kb 
 uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
 after months. There's no Vary.
 !http://i.imgur.com/Zxlhgnf.png!
 Then I thought it could be a problem of polygraph. I wrote my own 
 client/server test code, it works fine also with squid, varnish and nginx. I 
 register a hit if I get either cR or cH in the headers.
 {noformat}
 2015/02/19 12:38:28 Starting 100 requests
 2015/02/19 12:37:58 Elapsed: 3m51.23552164s
 2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
 2015/02/19 12:37:58 Average size: 12.50kb/req
 2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
 2015/02/19 12:37:58 Errors: 0
 2015/02/19 12:37:58 Offered Hit ratio: 59.95%
 2015/02/19 12:37:58 Measured Hit ratio: 37.20%
 2015/02/19 12:37:58 Hit bytes: 4649000609
 2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
 2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
 {noformat}
 So similar results, 37.20% on average. Then I thought that could be a problem 
 of how I'm testing stuff, and tried with nginx cache. It achieves 60% hit 
 ratio, but request rate is very slow compared to ATS for obvious reasons.
 Then I wanted to check if with 200 connections but with longer test time hit 
 ratio also dropped, but no, it's fine:
 !http://i.imgur.com/oMHscuf.png!
 So not a problem of my tests I guess.
 Then I realized by debugging the test server that the same url was asked 
 twice.
 Out of 100 requests, 78600 urls were asked at least twice. An url was 
 even requested 9 times. These same url are not requested close to each other: 
 even more than 30sec can pass from one request to the other for the same url.
 I also tweaked the following parameters:
 {noformat}
 CONFIG proxy.config.http.cache.fuzz.time INT 0
 CONFIG proxy.config.http.cache.fuzz.min_time INT 0
 CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.00
 CONFIG proxy.config.http.cache.max_open_read_retries INT 4
 CONFIG proxy.config.http.cache.open_read_retry_time INT 500
 {noformat}
 And this is the result with polygraph, similar results:
 !http://i.imgur.com/YgOndhY.png!
 Tweaked the read-while-writer option, and yet having similar results.
 Then I've enabled 1GB of ram, it is slightly better at the beginning, but 
 then it drops:
 !http://i.imgur.com/dFTJI16.png!
 traffic_top says 25% ram hit, 37% fresh, 63% cold.
 So given that it doesn't seem to be a concurrency problem when requesting the 
 url to the origin server, could it be a problem of concurrent write access to 
 the cache? So that some pages are not cached at all? The traffoc_top fresh 
 percentage also makes me think it can be a problem in writing the cache.
 Not sure if I explained the problem correctly, ask me further information in 
 case. But in summary: hit ratio drops with a high number of connections, and 
 the problem seems related to pages that are not written to the cache.
 This is some related issue: 
 http://mail-archives.apache.org/mod_mbox/trafficserver-users/201301.mbox/%3ccd28cb1f.1f44a%25peter.wa...@email.disney.com%3E
 Also this: 
 http://apache-traffic-server.24303.n7.nabble.com/why-my-proxy-node-cache-hit-ratio-drops-td928.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TS-3395) Hit ratio drops with high concurrency

2015-02-20 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329132#comment-14329132
 ] 

Luca Bruno edited comment on TS-3395 at 2/20/15 4:32 PM:
-

Longer test with 10 vols:

!http://i.imgur.com/eCEUmOK.png!

Req/sec and res time similar above, but I understand that's a bottleneck of the 
disk of course.

This is the disk I/O:

!http://i.imgur.com/QuNIav1.png!

So this is what I expect, except not with 10 vols but it should behave the same 
with 1 vol.


was (Author: lethalman):
Longer test with 10 vols:

!http://i.imgur.com/eCEUmOK.png!

Req/sec and res time similar above, but I understand that's a bottleneck of the 
disk of course.

This is the disk I/O:

http://i.imgur.com/QuNIav1.png

So this is what I expect, except not with 10 vols but it should behave the same 
with 1 vol.

 Hit ratio drops with high concurrency
 -

 Key: TS-3395
 URL: https://issues.apache.org/jira/browse/TS-3395
 Project: Traffic Server
  Issue Type: Bug
  Components: Cache
Reporter: Luca Bruno
 Fix For: 5.3.0


 I'm doing some tests and I've noticed that the hit ratio drops with more than 
 300 simultaneous http connections.
 The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
 ram cache is disabled.
 The test is done with web-polygraph. Content size vary from 5kb to 20kb 
 uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
 after months. There's no Vary.
 !http://i.imgur.com/Zxlhgnf.png!
 Then I thought it could be a problem of polygraph. I wrote my own 
 client/server test code, it works fine also with squid, varnish and nginx. I 
 register a hit if I get either cR or cH in the headers.
 {noformat}
 2015/02/19 12:38:28 Starting 100 requests
 2015/02/19 12:37:58 Elapsed: 3m51.23552164s
 2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
 2015/02/19 12:37:58 Average size: 12.50kb/req
 2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
 2015/02/19 12:37:58 Errors: 0
 2015/02/19 12:37:58 Offered Hit ratio: 59.95%
 2015/02/19 12:37:58 Measured Hit ratio: 37.20%
 2015/02/19 12:37:58 Hit bytes: 4649000609
 2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
 2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
 {noformat}
 So similar results, 37.20% on average. Then I thought that could be a problem 
 of how I'm testing stuff, and tried with nginx cache. It achieves 60% hit 
 ratio, but request rate is very slow compared to ATS for obvious reasons.
 Then I wanted to check if with 200 connections but with longer test time hit 
 ratio also dropped, but no, it's fine:
 !http://i.imgur.com/oMHscuf.png!
 So not a problem of my tests I guess.
 Then I realized by debugging the test server that the same url was asked 
 twice.
 Out of 100 requests, 78600 urls were asked at least twice. An url was 
 even requested 9 times. These same url are not requested close to each other: 
 even more than 30sec can pass from one request to the other for the same url.
 I also tweaked the following parameters:
 {noformat}
 CONFIG proxy.config.http.cache.fuzz.time INT 0
 CONFIG proxy.config.http.cache.fuzz.min_time INT 0
 CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.00
 CONFIG proxy.config.http.cache.max_open_read_retries INT 4
 CONFIG proxy.config.http.cache.open_read_retry_time INT 500
 {noformat}
 And this is the result with polygraph, similar results:
 !http://i.imgur.com/YgOndhY.png!
 Tweaked the read-while-writer option, and yet having similar results.
 Then I've enabled 1GB of ram, it is slightly better at the beginning, but 
 then it drops:
 !http://i.imgur.com/dFTJI16.png!
 traffic_top says 25% ram hit, 37% fresh, 63% cold.
 So given that it doesn't seem to be a concurrency problem when requesting the 
 url to the origin server, could it be a problem of concurrent write access to 
 the cache? So that some pages are not cached at all? The traffoc_top fresh 
 percentage also makes me think it can be a problem in writing the cache.
 Not sure if I explained the problem correctly, ask me further information in 
 case. But in summary: hit ratio drops with a high number of connections, and 
 the problem seems related to pages that are not written to the cache.
 This is some related issue: 
 http://mail-archives.apache.org/mod_mbox/trafficserver-users/201301.mbox/%3ccd28cb1f.1f44a%25peter.wa...@email.disney.com%3E
 Also this: 
 http://apache-traffic-server.24303.n7.nabble.com/why-my-proxy-node-cache-hit-ratio-drops-td928.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TS-3395) Hit ratio drops with high concurrency

2015-02-19 Thread Luca Bruno (JIRA)
Luca Bruno created TS-3395:
--

 Summary: Hit ratio drops with high concurrency
 Key: TS-3395
 URL: https://issues.apache.org/jira/browse/TS-3395
 Project: Traffic Server
  Issue Type: Bug
  Components: Cache
Reporter: Luca Bruno


I'm doing some tests and I've noticed that the hit ratio drops with more than 
300 simultaneous http connections.

The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
ram cache is disabled.

The test is done with web-polygraph. Content size vary from 5kb to 20kb 
uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
after months. There's no Vary.

!http://i.imgur.com/Zxlhgnf.png!

Then I thought I could be a problem of polygraph. I wrote my own client/server 
test code, it works fine also with other cache servers. I register a hit if I 
get either cR or cH in the headers.

{noformat}
2015/02/19 12:38:28 Starting 100 requests
2015/02/19 12:37:58 Elapsed: 3m51.23552164s
2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
2015/02/19 12:37:58 Average size: 12.50kb/req
2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
2015/02/19 12:37:58 Errors: 0
2015/02/19 12:37:58 Offered Hit ratio: 59.95%
2015/02/19 12:37:58 Measured Hit ratio: 37.20%
2015/02/19 12:37:58 Hit bytes: 4649000609
2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
{noformat}

So similar results, 37.20% on average. Then I thought that could be a problem 
of how I'm testing stuff, and tried with nginx cache. It achieves 60% hit 
ratio, but request rate is very slow compared to ATS for obvious reasons.

Then I wanted to check if with 200 connections but with longer test time hit 
ratio also dropped, but no, it's fine:

!http://i.imgur.com/oMHscuf.png!

So not a problem of my tests I guess.

Then I realized by debugging the test server that the same url was asked twice.
Out of 100 requests, 78600 urls were asked at least twice. An url was even 
requested 9 times. These same url are not requested close to each other: even 
more than 30sec can pass from one request to the other for the same url.

I also tweaked the following parameters:

{noformat}
CONFIG proxy.config.http.cache.fuzz.time INT 0
CONFIG proxy.config.http.cache.fuzz.min_time INT 0
CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.00
CONFIG proxy.config.http.cache.max_open_read_retries INT 4
CONFIG proxy.config.http.cache.open_read_retry_time INT 500
{noformat}

And this is the result with polygraph, similar results:

!http://imgur.com/oLO0M01!

Tweaked the read-while-writer option, and yet having similar results.

Then I've enabled 1GB of ram, it is slightly better at the beginning, but then 
it drops:

!http://i.imgur.com/dFTJI16.png!

traffic_top says 25% ram hit, 37% fresh, 63% cold.

So given that it doesn't seem to be a concurrency problem when requesting the 
url to the origin server, could it be a problem of concurrent write access to 
the cache? So that some pages are not cached at all? The traffoc_top fresh 
percentage also makes me think it can be a problem in writing the cache.

Not sure if I explained the problem correctly, ask me further information in 
case. But in summary: hit ratio drops with a high number of connections, and 
the problem seems related to pages that are not written to the cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TS-3395) Hit ratio drops with high concurrency

2015-02-19 Thread Luca Bruno (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Bruno updated TS-3395:
---
Description: 
I'm doing some tests and I've noticed that the hit ratio drops with more than 
300 simultaneous http connections.

The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
ram cache is disabled.

The test is done with web-polygraph. Content size vary from 5kb to 20kb 
uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
after months. There's no Vary.

!http://i.imgur.com/Zxlhgnf.png!

Then I thought I could be a problem of polygraph. I wrote my own client/server 
test code, it works fine also with other cache servers. I register a hit if I 
get either cR or cH in the headers.

{noformat}
2015/02/19 12:38:28 Starting 100 requests
2015/02/19 12:37:58 Elapsed: 3m51.23552164s
2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
2015/02/19 12:37:58 Average size: 12.50kb/req
2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
2015/02/19 12:37:58 Errors: 0
2015/02/19 12:37:58 Offered Hit ratio: 59.95%
2015/02/19 12:37:58 Measured Hit ratio: 37.20%
2015/02/19 12:37:58 Hit bytes: 4649000609
2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
{noformat}

So similar results, 37.20% on average. Then I thought that could be a problem 
of how I'm testing stuff, and tried with nginx cache. It achieves 60% hit 
ratio, but request rate is very slow compared to ATS for obvious reasons.

Then I wanted to check if with 200 connections but with longer test time hit 
ratio also dropped, but no, it's fine:

!http://i.imgur.com/oMHscuf.png!

So not a problem of my tests I guess.

Then I realized by debugging the test server that the same url was asked twice.
Out of 100 requests, 78600 urls were asked at least twice. An url was even 
requested 9 times. These same url are not requested close to each other: even 
more than 30sec can pass from one request to the other for the same url.

I also tweaked the following parameters:

{noformat}
CONFIG proxy.config.http.cache.fuzz.time INT 0
CONFIG proxy.config.http.cache.fuzz.min_time INT 0
CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.00
CONFIG proxy.config.http.cache.max_open_read_retries INT 4
CONFIG proxy.config.http.cache.open_read_retry_time INT 500
{noformat}

And this is the result with polygraph, similar results:

!http://i.imgur.com/YgOndhY.png!

Tweaked the read-while-writer option, and yet having similar results.

Then I've enabled 1GB of ram, it is slightly better at the beginning, but then 
it drops:

!http://i.imgur.com/dFTJI16.png!

traffic_top says 25% ram hit, 37% fresh, 63% cold.

So given that it doesn't seem to be a concurrency problem when requesting the 
url to the origin server, could it be a problem of concurrent write access to 
the cache? So that some pages are not cached at all? The traffoc_top fresh 
percentage also makes me think it can be a problem in writing the cache.

Not sure if I explained the problem correctly, ask me further information in 
case. But in summary: hit ratio drops with a high number of connections, and 
the problem seems related to pages that are not written to the cache.

This is some related issue: 
http://mail-archives.apache.org/mod_mbox/trafficserver-users/201301.mbox/%3ccd28cb1f.1f44a%25peter.wa...@email.disney.com%3E

Also this: 
http://apache-traffic-server.24303.n7.nabble.com/why-my-proxy-node-cache-hit-ratio-drops-td928.html

  was:
I'm doing some tests and I've noticed that the hit ratio drops with more than 
300 simultaneous http connections.

The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
ram cache is disabled.

The test is done with web-polygraph. Content size vary from 5kb to 20kb 
uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
after months. There's no Vary.

!http://i.imgur.com/Zxlhgnf.png!

Then I thought I could be a problem of polygraph. I wrote my own client/server 
test code, it works fine also with other cache servers. I register a hit if I 
get either cR or cH in the headers.

{noformat}
2015/02/19 12:38:28 Starting 100 requests
2015/02/19 12:37:58 Elapsed: 3m51.23552164s
2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
2015/02/19 12:37:58 Average size: 12.50kb/req
2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
2015/02/19 12:37:58 Errors: 0
2015/02/19 12:37:58 Offered Hit ratio: 59.95%
2015/02/19 12:37:58 Measured Hit ratio: 37.20%
2015/02/19 12:37:58 Hit bytes: 4649000609
2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
{noformat}

So similar results, 37.20% on average. Then I thought that could be a problem 
of how I'm testing stuff, and tried with 

[jira] [Updated] (TS-3395) Hit ratio drops with high concurrency

2015-02-19 Thread Luca Bruno (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Bruno updated TS-3395:
---
Description: 
I'm doing some tests and I've noticed that the hit ratio drops with more than 
300 simultaneous http connections.

The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
ram cache is disabled.

The test is done with web-polygraph. Content size vary from 5kb to 20kb 
uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
after months. There's no Vary.

!http://i.imgur.com/Zxlhgnf.png!

Then I thought it could be a problem of polygraph. I wrote my own client/server 
test code, it works fine also with other cache servers. I register a hit if I 
get either cR or cH in the headers.

{noformat}
2015/02/19 12:38:28 Starting 100 requests
2015/02/19 12:37:58 Elapsed: 3m51.23552164s
2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
2015/02/19 12:37:58 Average size: 12.50kb/req
2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
2015/02/19 12:37:58 Errors: 0
2015/02/19 12:37:58 Offered Hit ratio: 59.95%
2015/02/19 12:37:58 Measured Hit ratio: 37.20%
2015/02/19 12:37:58 Hit bytes: 4649000609
2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
{noformat}

So similar results, 37.20% on average. Then I thought that could be a problem 
of how I'm testing stuff, and tried with nginx cache. It achieves 60% hit 
ratio, but request rate is very slow compared to ATS for obvious reasons.

Then I wanted to check if with 200 connections but with longer test time hit 
ratio also dropped, but no, it's fine:

!http://i.imgur.com/oMHscuf.png!

So not a problem of my tests I guess.

Then I realized by debugging the test server that the same url was asked twice.
Out of 100 requests, 78600 urls were asked at least twice. An url was even 
requested 9 times. These same url are not requested close to each other: even 
more than 30sec can pass from one request to the other for the same url.

I also tweaked the following parameters:

{noformat}
CONFIG proxy.config.http.cache.fuzz.time INT 0
CONFIG proxy.config.http.cache.fuzz.min_time INT 0
CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.00
CONFIG proxy.config.http.cache.max_open_read_retries INT 4
CONFIG proxy.config.http.cache.open_read_retry_time INT 500
{noformat}

And this is the result with polygraph, similar results:

!http://i.imgur.com/YgOndhY.png!

Tweaked the read-while-writer option, and yet having similar results.

Then I've enabled 1GB of ram, it is slightly better at the beginning, but then 
it drops:

!http://i.imgur.com/dFTJI16.png!

traffic_top says 25% ram hit, 37% fresh, 63% cold.

So given that it doesn't seem to be a concurrency problem when requesting the 
url to the origin server, could it be a problem of concurrent write access to 
the cache? So that some pages are not cached at all? The traffoc_top fresh 
percentage also makes me think it can be a problem in writing the cache.

Not sure if I explained the problem correctly, ask me further information in 
case. But in summary: hit ratio drops with a high number of connections, and 
the problem seems related to pages that are not written to the cache.

This is some related issue: 
http://mail-archives.apache.org/mod_mbox/trafficserver-users/201301.mbox/%3ccd28cb1f.1f44a%25peter.wa...@email.disney.com%3E

Also this: 
http://apache-traffic-server.24303.n7.nabble.com/why-my-proxy-node-cache-hit-ratio-drops-td928.html

  was:
I'm doing some tests and I've noticed that the hit ratio drops with more than 
300 simultaneous http connections.

The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
ram cache is disabled.

The test is done with web-polygraph. Content size vary from 5kb to 20kb 
uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
after months. There's no Vary.

!http://i.imgur.com/Zxlhgnf.png!

Then I thought I could be a problem of polygraph. I wrote my own client/server 
test code, it works fine also with other cache servers. I register a hit if I 
get either cR or cH in the headers.

{noformat}
2015/02/19 12:38:28 Starting 100 requests
2015/02/19 12:37:58 Elapsed: 3m51.23552164s
2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
2015/02/19 12:37:58 Average size: 12.50kb/req
2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
2015/02/19 12:37:58 Errors: 0
2015/02/19 12:37:58 Offered Hit ratio: 59.95%
2015/02/19 12:37:58 Measured Hit ratio: 37.20%
2015/02/19 12:37:58 Hit bytes: 4649000609
2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
{noformat}

So similar results, 37.20% on average. Then I thought that could be a problem 
of how I'm testing stuff, and tried with 

[jira] [Updated] (TS-3395) Hit ratio drops with high concurrency

2015-02-19 Thread Luca Bruno (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Bruno updated TS-3395:
---
Description: 
I'm doing some tests and I've noticed that the hit ratio drops with more than 
300 simultaneous http connections.

The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
ram cache is disabled.

The test is done with web-polygraph. Content size vary from 5kb to 20kb 
uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
after months. There's no Vary.

!http://i.imgur.com/Zxlhgnf.png!

Then I thought it could be a problem of polygraph. I wrote my own client/server 
test code, it works fine also with squid, varnish and nginx. I register a hit 
if I get either cR or cH in the headers.

{noformat}
2015/02/19 12:38:28 Starting 100 requests
2015/02/19 12:37:58 Elapsed: 3m51.23552164s
2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
2015/02/19 12:37:58 Average size: 12.50kb/req
2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
2015/02/19 12:37:58 Errors: 0
2015/02/19 12:37:58 Offered Hit ratio: 59.95%
2015/02/19 12:37:58 Measured Hit ratio: 37.20%
2015/02/19 12:37:58 Hit bytes: 4649000609
2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
{noformat}

So similar results, 37.20% on average. Then I thought that could be a problem 
of how I'm testing stuff, and tried with nginx cache. It achieves 60% hit 
ratio, but request rate is very slow compared to ATS for obvious reasons.

Then I wanted to check if with 200 connections but with longer test time hit 
ratio also dropped, but no, it's fine:

!http://i.imgur.com/oMHscuf.png!

So not a problem of my tests I guess.

Then I realized by debugging the test server that the same url was asked twice.
Out of 100 requests, 78600 urls were asked at least twice. An url was even 
requested 9 times. These same url are not requested close to each other: even 
more than 30sec can pass from one request to the other for the same url.

I also tweaked the following parameters:

{noformat}
CONFIG proxy.config.http.cache.fuzz.time INT 0
CONFIG proxy.config.http.cache.fuzz.min_time INT 0
CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.00
CONFIG proxy.config.http.cache.max_open_read_retries INT 4
CONFIG proxy.config.http.cache.open_read_retry_time INT 500
{noformat}

And this is the result with polygraph, similar results:

!http://i.imgur.com/YgOndhY.png!

Tweaked the read-while-writer option, and yet having similar results.

Then I've enabled 1GB of ram, it is slightly better at the beginning, but then 
it drops:

!http://i.imgur.com/dFTJI16.png!

traffic_top says 25% ram hit, 37% fresh, 63% cold.

So given that it doesn't seem to be a concurrency problem when requesting the 
url to the origin server, could it be a problem of concurrent write access to 
the cache? So that some pages are not cached at all? The traffoc_top fresh 
percentage also makes me think it can be a problem in writing the cache.

Not sure if I explained the problem correctly, ask me further information in 
case. But in summary: hit ratio drops with a high number of connections, and 
the problem seems related to pages that are not written to the cache.

This is some related issue: 
http://mail-archives.apache.org/mod_mbox/trafficserver-users/201301.mbox/%3ccd28cb1f.1f44a%25peter.wa...@email.disney.com%3E

Also this: 
http://apache-traffic-server.24303.n7.nabble.com/why-my-proxy-node-cache-hit-ratio-drops-td928.html

  was:
I'm doing some tests and I've noticed that the hit ratio drops with more than 
300 simultaneous http connections.

The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
ram cache is disabled.

The test is done with web-polygraph. Content size vary from 5kb to 20kb 
uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
after months. There's no Vary.

!http://i.imgur.com/Zxlhgnf.png!

Then I thought it could be a problem of polygraph. I wrote my own client/server 
test code, it works fine also with other cache servers. I register a hit if I 
get either cR or cH in the headers.

{noformat}
2015/02/19 12:38:28 Starting 100 requests
2015/02/19 12:37:58 Elapsed: 3m51.23552164s
2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
2015/02/19 12:37:58 Average size: 12.50kb/req
2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
2015/02/19 12:37:58 Errors: 0
2015/02/19 12:37:58 Offered Hit ratio: 59.95%
2015/02/19 12:37:58 Measured Hit ratio: 37.20%
2015/02/19 12:37:58 Hit bytes: 4649000609
2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
{noformat}

So similar results, 37.20% on average. Then I thought that could be a problem 
of how I'm testing stuff, and 

[jira] [Updated] (TS-3391) Close origin connection after time/request

2015-02-16 Thread Luca Bruno (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Bruno updated TS-3391:
---
Issue Type: New Feature  (was: Wish)

 Close origin connection after time/request
 --

 Key: TS-3391
 URL: https://issues.apache.org/jira/browse/TS-3391
 Project: Traffic Server
  Issue Type: New Feature
  Components: HTTP
Reporter: Luca Bruno

 Our origin server is a balancer, and sometimes we have to tweak weights of 
 the real backend servers.
 I haven't found this kind of option in ATS so I'd like to ask if this idea 
 could be accepted.
 The idea is to close an origin connection after some time has passed, or a 
 certain number of requests have been made. We're more interested in the fact 
 that after X seconds a connection will be closed, and a new one opened.
 This should be quite safe to implement in ATS itself theoretically: if the 
 condition has met, remove the connection from the pool for new requests. 
 After all requests are complete, close that connection.
 It's not precise, but it guarantees that after a certain time/requests 
 eventually that connection will be closed.
 In practice, it will be closed at most after the given time + 
 transaction_active_timeout_out if I'm not wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TS-3391) Close origin connection after time/request

2015-02-16 Thread Luca Bruno (JIRA)
Luca Bruno created TS-3391:
--

 Summary: Close origin connection after time/request
 Key: TS-3391
 URL: https://issues.apache.org/jira/browse/TS-3391
 Project: Traffic Server
  Issue Type: Wish
  Components: HTTP
Reporter: Luca Bruno


Our origin server is a balancer, and sometimes we have to tweak weights of the 
real backend servers.

I haven't found this kind of option in ATS so I'd like to ask if this idea 
could be accepted.

The idea is to close an origin connection after some time has passed, or a 
certain number of requests have been made. We're more interested in the fact 
that after X seconds a connection will be closed, and a new one opened.

This should be quite safe to implement in ATS itself theoretically: if the 
condition has met, remove the connection from the pool for new requests. After 
all requests are complete, close that connection.

It's not precise, but it guarantees that after a certain time/requests 
eventually that connection will be closed.
In practice, it will be closed at most after the given time + 
transaction_active_timeout_out if I'm not wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316517#comment-14316517
 ] 

Luca Bruno edited comment on TS-3386 at 2/11/15 4:53 PM:
-

James:

I've pasted the manager.log above, dags had nothing related. And syslog clearly 
states that the server returned 502 to cop and decided to kill the server after 
2 times that it failed:
{noformat}
Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
{noformat}

To reproduce, set connection throttle to 100 and perform 1000 concurrent 
requests to ATS with minimal hit ratio (so that it also creates connections to 
the origin server).

Zhao:

q1: Whenever I restart ATS with `trafficserver restart`, the cache starts from 
zero. It's a 500gb raw disk cache. It's not supposed to work like this? Perhaps 
I'm missing some option. Either ways, it would keep restarting the server, so 
fixing the cache issue wouldn't solve this problem.

q2: I'm just testing stuff. Of course I won't be limiting the number of 
connections like that, but I still find this a bug.
q2.1: not sure sorry, what is httpSM?
q2.2: I think there are 2 options: either don't accept(), or accept and 
immediately close the connection with an error (which is apparently what the 
server does, except that it does also for the heartbeat). For example it's not 
that nginx kills its worker children because it reached the maximum number of 
connections.

Sure I keep it simple, but robustness is also the fact that the server must not 
be killed just because it's busy. Being busy means it's working and must not be 
killed by the heartbeat. In which case it can be solved by using a dedicated 
connection for the heartbeat in my opinion OR not kill at all yet keep alerting 
in the log files for external monitoring.




was (Author: lethalman):
James:

I've pasted the manager.log above, dags had nothing related. And syslog clearly 
states that the server returned 502 to cop and decided to kill the server after 
2 times that it failed:
{noformat}
Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
{noformat}

To reproduce, set connection throttle to 100 and perform 1000 concurrent 
requests to ATS with minimal hit ratio (so that it also creates connections to 
the origin server).

Zhao:

q1: Whenever I restart ATS with `trafficserver restart`, the cache starts from 
zero. It's a 500gb raw disk cache. It's not supposed to work like this? Perhaps 
I'm missing some option. Either ways, it would keep restarting the server, so 
fixing the cache issue wouldn't solve this problem.

q2: I'm just testing stuff. Of course I won't be limiting the number of 
connections like that, but I still find this a bug.
q2.1: not sure sorry, what is httpSM?
q2.2: I think there are 2 options: either don't accept(), or accept and 
immediately close the connection with an error (which is apparently what the 
server does, except that it does also for the heartbeat). For example it's not 
that nginx kills its worker children because it reached the maximum number of 
connections.

Sure I keep it simple, but robustness is also the fact that the server must not 
be killed just because it's busy. Being busy means it's working and must not be 
killed by the heartbeat. In which case it can be solved by using a dedicated 
connection for the heartbeat in my opinion OR not kill at all, but keep 
alerting in the log files for external monitoring.



 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http 

[jira] [Comment Edited] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316517#comment-14316517
 ] 

Luca Bruno edited comment on TS-3386 at 2/11/15 4:53 PM:
-

James:

I've pasted the manager.log above, dags had nothing related. And syslog clearly 
states that the server returned 502 to cop and decided to kill the server after 
2 times that it failed:
{noformat}
Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
{noformat}

To reproduce, set connection throttle to 100 and perform 1000 concurrent 
requests to ATS with minimal hit ratio (so that it also creates connections to 
the origin server).

Zhao:

q1: Whenever I restart ATS with `trafficserver restart`, the cache starts from 
zero. It's a 500gb raw disk cache. It's not supposed to work like this? Perhaps 
I'm missing some option. Either ways, it would keep restarting the server, so 
fixing the cache issue wouldn't solve this problem.

q2: I'm just testing stuff. Of course I won't be limiting the number of 
connections like that, but I still find this a bug.
q2.1: not sure sorry, what is httpSM?
q2.2: I think there are 2 options: either don't accept(), or accept and 
immediately close the connection with an error (which is apparently what the 
server does, except that it does also for the heartbeat). For example it's not 
that nginx kills its worker children because it reached the maximum number of 
connections.

Sure I keep it simple, but robustness is also the fact that the server must not 
be killed just because it's busy. Being busy means it's working and must not be 
killed by the heartbeat. In which case it can be solved by using a dedicated 
connection for the heartbeat in my opinion.




was (Author: lethalman):
James:

I've pasted the manager.log above, dags had nothing related. And syslog clearly 
states that the server returned 502 to cop and decided to kill the server after 
2 times that it failed:
{noformat}
Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
{noformat}

To reproduce, set connection throttle to 100 and perform 1000 concurrent 
requests to ATS with minimal hit ratio (so that it also creates connections to 
the origin server).

Zhao:

q1: Whenever I restart ATS with `trafficserver restart`, the cache starts from 
zero. It's a 500gb raw disk cache. It's not supposed to work like this? Perhaps 
I'm missing some option. Either ways, it would keep restarting the server, so 
fixing the cache issue wouldn't solve this problem.

q2: I'm just testing stuff. Of course I won't be limiting the number of 
connections like that, but I still find this a bug.
q2.1: not sure sorry, what is httpSM?
q2.2: I think there are 2 options: either don't accept(), or accept and 
immediately close the connection with an error (which is apparently what the 
server does, except that it does also for the heartbeat). For example it's not 
that nginx kills its worker children because it reached the maximum number of 
connections.

Sure I keep it simple, but robustness is also the fact that the server must not 
restarted just because it's busy. Being busy means it's working and must not be 
killed by the heartbeat. In which case it can be solved by using a dedicated 
connection for the heartbeat in my opinion.



 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: 

[jira] [Comment Edited] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316517#comment-14316517
 ] 

Luca Bruno edited comment on TS-3386 at 2/11/15 4:52 PM:
-

James:

I've pasted the manager.log above, dags had nothing related. And syslog clearly 
states that the server returned 502 to cop and decided to kill the server after 
2 times that it failed:
{noformat}
Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
{noformat}

To reproduce, set connection throttle to 100 and perform 1000 concurrent 
requests to ATS with minimal hit ratio (so that it also creates connections to 
the origin server).

Zhao:

q1: Whenever I restart ATS with `trafficserver restart`, the cache starts from 
zero. It's a 500gb raw disk cache. It's not supposed to work like this? Perhaps 
I'm missing some option. Either ways, it would keep restarting the server, so 
fixing the cache issue wouldn't solve this problem.

q2: I'm just testing stuff. Of course I won't be limiting the number of 
connections like that, but I still find this a bug.
q2.1: not sure sorry, what is httpSM?
q2.2: I think there are 2 options: either don't accept(), or accept and 
immediately close the connection with an error (which is apparently what the 
server does, except that it does also for the heartbeat). For example it's not 
that nginx kills its worker children because it reached the maximum number of 
connections.

Sure I keep it simple, but robustness is also the fact that the server must not 
restarted just because it's busy. Being busy means it's working and must not be 
killed by the heartbeat. In which case it can be solved by using a dedicated 
connection for the heartbeat in my opinion.




was (Author: lethalman):
James:

I've pasted the manager.log above, dags had nothing related. And syslog clearly 
states that the server returned 502 to cop and decided to kill the server after 
2 times that it failed:
{noformat}
Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
{noformat}

To reproduce, set connection throttle to 100 and perform 1000 concurrent 
requests to ATS with minimal hit ratio (so that it also creates connections to 
the origin server).

Zhao:

q1: Whenever I restart ATS, the cache starts from zero. It's a 500gb raw disk 
cache. It's not supposed to work like this? Perhaps I'm missing some option. 
Either ways, it would keep restarting the server, so fixing the cache issue 
wouldn't solve this problem.

q2: I'm just testing stuff. Of course I won't be limiting the number of 
connections like that, but I still find this a bug.
q2.1: not sure sorry, what is httpSM?
q2.2: I think there are 2 options: either don't accept(), or accept and 
immediately close the connection with an error (which is apparently what the 
server does, except that it does also for the heartbeat). For example it's not 
that nginx kills its worker children because it reached the maximum number of 
connections.

Sure I keep it simple, but robustness is also the fact that the server must not 
restarted just because it's busy. Being busy means it's working and must not be 
killed by the heartbeat. In which case it can be solved by using a dedicated 
connection for the heartbeat in my opinion.



 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 

[jira] [Comment Edited] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316517#comment-14316517
 ] 

Luca Bruno edited comment on TS-3386 at 2/11/15 4:53 PM:
-

James:

I've pasted the manager.log above, dags had nothing related. And syslog clearly 
states that the server returned 502 to cop and decided to kill the server after 
2 times that it failed:
{noformat}
Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
{noformat}

To reproduce, set connection throttle to 100 and perform 1000 concurrent 
requests to ATS with minimal hit ratio (so that it also creates connections to 
the origin server).

Zhao:

q1: Whenever I restart ATS with `trafficserver restart`, the cache starts from 
zero. It's a 500gb raw disk cache. It's not supposed to work like this? Perhaps 
I'm missing some option. Either ways, it would keep restarting the server, so 
fixing the cache issue wouldn't solve this problem.

q2: I'm just testing stuff. Of course I won't be limiting the number of 
connections like that, but I still find this a bug.
q2.1: not sure sorry, what is httpSM?
q2.2: I think there are 2 options: either don't accept(), or accept and 
immediately close the connection with an error (which is apparently what the 
server does, except that it does also for the heartbeat). For example it's not 
that nginx kills its worker children because it reached the maximum number of 
connections.

Sure I keep it simple, but robustness is also the fact that the server must not 
be killed just because it's busy. Being busy means it's working and must not be 
killed by the heartbeat. In which case it can be solved by using a dedicated 
connection for the heartbeat in my opinion OR not kill at all, but keep 
alerting in the log files.




was (Author: lethalman):
James:

I've pasted the manager.log above, dags had nothing related. And syslog clearly 
states that the server returned 502 to cop and decided to kill the server after 
2 times that it failed:
{noformat}
Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
{noformat}

To reproduce, set connection throttle to 100 and perform 1000 concurrent 
requests to ATS with minimal hit ratio (so that it also creates connections to 
the origin server).

Zhao:

q1: Whenever I restart ATS with `trafficserver restart`, the cache starts from 
zero. It's a 500gb raw disk cache. It's not supposed to work like this? Perhaps 
I'm missing some option. Either ways, it would keep restarting the server, so 
fixing the cache issue wouldn't solve this problem.

q2: I'm just testing stuff. Of course I won't be limiting the number of 
connections like that, but I still find this a bug.
q2.1: not sure sorry, what is httpSM?
q2.2: I think there are 2 options: either don't accept(), or accept and 
immediately close the connection with an error (which is apparently what the 
server does, except that it does also for the heartbeat). For example it's not 
that nginx kills its worker children because it reached the maximum number of 
connections.

Sure I keep it simple, but robustness is also the fact that the server must not 
be killed just because it's busy. Being busy means it's working and must not be 
killed by the heartbeat. In which case it can be solved by using a dedicated 
connection for the heartbeat in my opinion.



 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat 

[jira] [Comment Edited] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316517#comment-14316517
 ] 

Luca Bruno edited comment on TS-3386 at 2/11/15 5:26 PM:
-

James:

I've pasted the manager.log above, dags had nothing related. And syslog clearly 
states that the server returned 502 to cop and decided to kill the server after 
2 times that it failed:
{noformat}
Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
{noformat}

But I haven't found anything in the logs on why it returns 502, perhaps the 
connections to the origin server are full and it returns bad gateway for that 
reason also to the clients?

To reproduce, set connection throttle to 100 and perform 1000 concurrent 
requests to ATS with minimal hit ratio (so that it also creates connections to 
the origin server).

Zhao:

q1: Whenever I restart ATS with `trafficserver restart`, the cache starts from 
zero. It's a 500gb raw disk cache. It's not supposed to work like this? Perhaps 
I'm missing some option (EDIT: it re-reads the cache, see comment below). 
Either ways, it would keep restarting the server, so fixing the cache issue 
wouldn't solve this problem.

q2: I'm just testing stuff. Of course I won't be limiting the number of 
connections like that, but I still find this a bug.
q2.1: not sure sorry, what is httpSM?
q2.2: I think there are 2 options: either don't accept(), or accept and 
immediately close the connection with an error (which is apparently what the 
server does, except that it does also for the heartbeat). For example it's not 
that nginx kills its worker children because it reached the maximum number of 
connections.

Sure I keep it simple, but robustness is also the fact that the server must not 
be killed just because it's busy. Being busy means it's working and must not be 
killed by the heartbeat. In which case it can be solved by using a dedicated 
connection for the heartbeat in my opinion OR not kill at all yet keep alerting 
in the log files for external monitoring.




was (Author: lethalman):
James:

I've pasted the manager.log above, dags had nothing related. And syslog clearly 
states that the server returned 502 to cop and decided to kill the server after 
2 times that it failed:
{noformat}
Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
{noformat}

But I haven't found anything in the logs on why it returns 502, perhaps the 
connections to the origin server are full and it returns bad gateway for that 
reason also to the clients?

To reproduce, set connection throttle to 100 and perform 1000 concurrent 
requests to ATS with minimal hit ratio (so that it also creates connections to 
the origin server).

Zhao:

q1: Whenever I restart ATS with `trafficserver restart`, the cache starts from 
zero. It's a 500gb raw disk cache. It's not supposed to work like this? Perhaps 
I'm missing some option. Either ways, it would keep restarting the server, so 
fixing the cache issue wouldn't solve this problem.

q2: I'm just testing stuff. Of course I won't be limiting the number of 
connections like that, but I still find this a bug.
q2.1: not sure sorry, what is httpSM?
q2.2: I think there are 2 options: either don't accept(), or accept and 
immediately close the connection with an error (which is apparently what the 
server does, except that it does also for the heartbeat). For example it's not 
that nginx kills its worker children because it reached the maximum number of 
connections.

Sure I keep it simple, but robustness is also the fact that the server must not 
be killed just because it's busy. Being busy means it's working and must not be 
killed by the heartbeat. In which case it can be solved by using a dedicated 
connection for the heartbeat in my opinion OR not kill at all yet keep alerting 
in the log files for external monitoring.



 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316517#comment-14316517
 ] 

Luca Bruno commented on TS-3386:


James:

I've pasted the manager.log above, dags had nothing related. And syslog clearly 
states that the server returned 502 to cop and decided to kill the server after 
2 times that it failed:
{noformat}
Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
{noformat}

To reproduce, set connection throttle to 100 and perform 1000 concurrent 
requests to ATS with minimal hit ratio (so that it also creates connections to 
the origin server).

Zhao:

q1: Whenever I restart ATS, the cache starts from zero. It's a 500gb raw disk 
cache. It's not supposed to work like this? Perhaps I'm missing some option. 
Either ways, it would keep restarting the server, so fixing the cache issue 
wouldn't solve this problem.

q2: I'm just testing stuff. Of course I won't be limiting the number of 
connections like that, but I still find this a bug.
q2.1: not sure sorry, what is httpSM?
q2.2: I think there are 2 options: either don't accept(), or accept and 
immediately close the connection with an error (which is apparently what the 
server does, except that it does also for the heartbeat). For example it's not 
that nginx kills its worker children because it reached the maximum number of 
connections.

Sure I keep it simple, but robustness is also the fact that the server must not 
restarted just because it's busy. Being busy means it's working and must not be 
killed by the heartbeat. In which case it can be solved by using a dedicated 
connection for the heartbeat in my opinion.



 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache 

[jira] [Comment Edited] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316517#comment-14316517
 ] 

Luca Bruno edited comment on TS-3386 at 2/11/15 4:55 PM:
-

James:

I've pasted the manager.log above, dags had nothing related. And syslog clearly 
states that the server returned 502 to cop and decided to kill the server after 
2 times that it failed:
{noformat}
Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
{noformat}

But I haven't found anything in the logs on why it returns 502, perhaps the 
connections to the origin server are full and it returns bad gateway for that 
reason also to the clients?

To reproduce, set connection throttle to 100 and perform 1000 concurrent 
requests to ATS with minimal hit ratio (so that it also creates connections to 
the origin server).

Zhao:

q1: Whenever I restart ATS with `trafficserver restart`, the cache starts from 
zero. It's a 500gb raw disk cache. It's not supposed to work like this? Perhaps 
I'm missing some option. Either ways, it would keep restarting the server, so 
fixing the cache issue wouldn't solve this problem.

q2: I'm just testing stuff. Of course I won't be limiting the number of 
connections like that, but I still find this a bug.
q2.1: not sure sorry, what is httpSM?
q2.2: I think there are 2 options: either don't accept(), or accept and 
immediately close the connection with an error (which is apparently what the 
server does, except that it does also for the heartbeat). For example it's not 
that nginx kills its worker children because it reached the maximum number of 
connections.

Sure I keep it simple, but robustness is also the fact that the server must not 
be killed just because it's busy. Being busy means it's working and must not be 
killed by the heartbeat. In which case it can be solved by using a dedicated 
connection for the heartbeat in my opinion OR not kill at all yet keep alerting 
in the log files for external monitoring.




was (Author: lethalman):
James:

I've pasted the manager.log above, dags had nothing related. And syslog clearly 
states that the server returned 502 to cop and decided to kill the server after 
2 times that it failed:
{noformat}
Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
{noformat}

To reproduce, set connection throttle to 100 and perform 1000 concurrent 
requests to ATS with minimal hit ratio (so that it also creates connections to 
the origin server).

Zhao:

q1: Whenever I restart ATS with `trafficserver restart`, the cache starts from 
zero. It's a 500gb raw disk cache. It's not supposed to work like this? Perhaps 
I'm missing some option. Either ways, it would keep restarting the server, so 
fixing the cache issue wouldn't solve this problem.

q2: I'm just testing stuff. Of course I won't be limiting the number of 
connections like that, but I still find this a bug.
q2.1: not sure sorry, what is httpSM?
q2.2: I think there are 2 options: either don't accept(), or accept and 
immediately close the connection with an error (which is apparently what the 
server does, except that it does also for the heartbeat). For example it's not 
that nginx kills its worker children because it reached the maximum number of 
connections.

Sure I keep it simple, but robustness is also the fact that the server must not 
be killed just because it's busy. Being busy means it's working and must not be 
killed by the heartbeat. In which case it can be solved by using a dedicated 
connection for the heartbeat in my opinion OR not kill at all yet keep alerting 
in the log files for external monitoring.



 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316560#comment-14316560
 ] 

Luca Bruno commented on TS-3386:


Sorry, when I restart trafficserver it indeeds re-reads the cache (after adding 
the udev rule to the raw disk). But of course when it gets restarted by cop, it 
restarts often and thus I see the cache empty.

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server 
 Version: Apache 

[jira] [Comment Edited] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316783#comment-14316783
 ] 

Luca Bruno edited comment on TS-3386 at 2/11/15 7:07 PM:
-

I'm using latest 5.2.0. Just to say, I've read some of the code of the cop and 
it's quite clear: if it fails 2 times, it's killed. Didn't put attention on 
whether it's using a separate channel though.


was (Author: lethalman):
I'm using latest 5.2.0. Just to say, I've read some of the code of the cop and 
it's quite clear: if it fails 2 times, it's killed. Didn't put attention on 
whether it's using a separate connection though.

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:42 test-cache 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316783#comment-14316783
 ] 

Luca Bruno commented on TS-3386:


I'm using latest 5.2.0. Just to say, I've read some of the code of the cop and 
it's quite clear: if it fails 2 times, it's killed. Didn't put attention on 
whether it's using a separate connection though.

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server 
 Version: Apache Traffic 

[jira] [Comment Edited] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316350#comment-14316350
 ] 

Luca Bruno edited comment on TS-3386 at 2/11/15 3:23 PM:
-

Even for 100 connections, it should limit the connections but not restarting 
and lose the cache: that's what I call kidding. So if I have 3 like the 
default config suggests, and it reaches the maximum, it would still restart.

I think I will disable the kill signal to the child after a failed heartbeat, 
would you accept such kind of patch including a config option?

Or find a way to return 200 instead of 502 if a heartbeat request is recognized.


was (Author: lethalman):
Even for 100 connections, it should limit the connections but not restarting 
and lose the cache: that's what I call kidding. So if I have 3 like the 
default config suggests, and it reaches the maximum, it would still restart.

I think I will disable the kill signal to the child after a failed heartbeat, 
would you accept such kind of patch including a config option?

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 

[jira] [Comment Edited] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316316#comment-14316316
 ] 

Luca Bruno edited comment on TS-3386 at 2/11/15 3:06 PM:
-

Ok my remap is now:
{noformat}
map http://test-cache.local:8080/ http://test-cache.local:8081/
{noformat}

Where test-cache.local is a real entry in a dns server of the local network. 
Things didn't change much.

Btw I've found that in the error.log there's something valuable (it also 
happened before), this is repeated indefinitely in the error.log:

{noformat}
20150211.15h53m56s RESPONSE: sent 192.168.199.31 status 502 (Connect Error 
internal error - server connection terminated/-1) for 
'http://test-cache.service.farm:8081/storage/test29429?size=11023age=46230sleep=0'
{noformat}

Please note that sending those concurrent requests to the backend server work 
just fine without any error. Anyway, I believe an error in the backend should 
not result in ATS being restarted.

Debug:

{noformat}
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_seq) 
[HttpSM::do_hostdb_lookup] Doing DNS Lookup
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_trans) 
[ink_cluster_time] local: 1423666415, highest_delta: 0, cluster: 1423666415
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_trans) 
[HttpTransact::OSDNSLookup] This was attempt 1
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_seq) 
[HttpTransact::OSDNSLookup] DNS Lookup successful
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_trans) [OSDNSLookup] 
DNS lookup for O.S. successful IP: 192.168.x.y
{noformat}

Then I noticed this:

{noformat}
[Feb 11 15:53:35.043] Server {0x2b7ddc565e00} DEBUG: (http) [293] hostdb update 
marking IP: 192.168.x.y:8081 as down
{noformat}

Also proxy.config.net.connections_throttle is 1000, and I do 1000 concurrent 
requests in my tests. Increasing the throttle to 1 seems to mitigate the 
problem.


was (Author: lethalman):
Ok my remap is now:
{noformat}
map http://test-cache.local:8080/ http://test-cache.local:8081/
{noformat}

Where test-cache.local is a real entry in a dns server of the local network. 
Things didn't change much.

Btw I've found that in the error.log there's something valuable (it also 
happened before), this is repeated indefinitely in the error.log:

{noformat}
20150211.15h53m56s RESPONSE: sent 192.168.199.31 status 502 (Connect Error 
internal error - server connection terminated/-1) for 
'http://test-cache.service.farm:8081/storage/test29429?size=11023age=46230sleep=0'
{noformat}

Please note that sending those concurrent requests to the backend server work 
just fine without any error. Anyway, I believe an error in the backend should 
not result in ATS being restarted.

Debug:

{noformat}
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_seq) 
[HttpSM::do_hostdb_lookup] Doing DNS Lookup
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_trans) 
[ink_cluster_time] local: 1423666415, highest_delta: 0, cluster: 1423666415
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_trans) 
[HttpTransact::OSDNSLookup] This was attempt 1
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_seq) 
[HttpTransact::OSDNSLookup] DNS Lookup successful
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_trans) [OSDNSLookup] 
DNS lookup for O.S. successful IP: 192.168.x.y
{noformat}

Then I noticed this:

{noformat}
[Feb 11 15:53:35.043] Server {0x2b7ddc565e00} DEBUG: (http) [293] hostdb update 
marking IP: 192.168.x.y:8081 as down
{noformat}

Also proxy.config.net.connections_throttle is 1000, and I do 1000 concurrent 
requests in my tests.

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316350#comment-14316350
 ] 

Luca Bruno commented on TS-3386:


Even for 100 connections, it should limit the connections but not restarting 
and lose the cache: that's what I call kidding. So if I have 3 like the 
default config suggests, and it reaches the maximum, it would still restart.

I think I will disable the kill signal to the child after a failed heartbeat, 
would you accept such kind of patch including a config option?

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:44 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316388#comment-14316388
 ] 

Luca Bruno commented on TS-3386:


We can find something abnormal by other means in our infrastructure. For 
example it suffices to alert: heartbeat failed because of 502, instead of 
killing the server. For us, it's better to wait than to lose terabytes of 
cache, which would then kill the origin servers afterwards.

Also why isn't putting heartbeat in a dedicated connection not good? What 
wasn't good when you tried it? Do you have some related work I can read?

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316316#comment-14316316
 ] 

Luca Bruno commented on TS-3386:


Ok my remap is now:
{noformat}
map http://test-cache.local:8080/ http://test-cache.local:8081/
{noformat}

Where test-cache.local is a real entry in a dns server of the local network. 
Things didn't change much.

Btw I've found that in the error.log there's something valuable (it also 
happened before), this is repeated indefinitely in the error.log:

{noformat}
20150211.15h53m56s RESPONSE: sent 192.168.199.31 status 502 (Connect Error 
internal error - server connection terminated/-1) for 
'http://test-cache.service.farm:8081/storage/test29429?size=11023age=46230sleep=0'
{noformat}

Please note that sending those concurrent requests to the backend server work 
just fine without any error. Anyway, I believe an error in the backend should 
not result in ATS being restarted.

Debug:

{noformat}
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_seq) 
[HttpSM::do_hostdb_lookup] Doing DNS Lookup
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_trans) 
[ink_cluster_time] local: 1423666415, highest_delta: 0, cluster: 1423666415
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_trans) 
[HttpTransact::OSDNSLookup] This was attempt 1
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_seq) 
[HttpTransact::OSDNSLookup] DNS Lookup successful
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_trans) [OSDNSLookup] 
DNS lookup for O.S. successful IP: 192.168.x.y
{noformat}

Then I noticed this:

{noformat}
[Feb 11 15:53:35.043] Server {0x2b7ddc565e00} DEBUG: (http) [293] hostdb update 
marking IP: 192.168.x.y:8081 as down
{noformat}
{noformat}

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316266#comment-14316266
 ] 

Luca Bruno commented on TS-3386:


Thanks a lot for your answer. I'm seeing this:

{noformat}
[Feb 11 15:31:01.819] Server {0x2b9677700e00} DEBUG: (http_trans) Next action 
SM_ACTION_DNS_LOOKUP; OSDNSLookup
[Feb 11 15:31:01.819] Server {0x2b9677700e00} DEBUG: (http) [9] State 
Transition: SM_ACTION_API_CACHE_LOOKUP_COMPLETE - SM_ACTION_DNS_LOOKUP
[Feb 11 15:31:01.819] Server {0x2b9677700e00} DEBUG: (http_seq) 
[HttpSM::do_hostdb_lookup] Doing DNS Lookup
[Feb 11 15:31:01.819] Server {0x2b9677700e00} DEBUG: (http_trans) 
[ink_cluster_time] local: 1423665061, highest_delta: 0, cluster: 1423665061
[Feb 11 15:31:01.819] Server {0x2b9677700e00} DEBUG: (http_trans) 
[HttpTransact::OSDNSLookup] This was attempt 1
[Feb 11 15:31:01.819] Server {0x2b9677700e00} DEBUG: (http_seq) 
[HttpTransact::OSDNSLookup] DNS Lookup successful
[Feb 11 15:31:01.819] Server {0x2b9677700e00} DEBUG: (http_trans) [OSDNSLookup] 
DNS lookup for O.S. successful IP: 127.0.0.1
{noformat}

Should I be looking for something else? This is my remap:
{noformat}
map http://127.0.0.1:8080/ http://127.0.0.1:8081/
{noformat}

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 

[jira] [Comment Edited] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316316#comment-14316316
 ] 

Luca Bruno edited comment on TS-3386 at 2/11/15 3:04 PM:
-

Ok my remap is now:
{noformat}
map http://test-cache.local:8080/ http://test-cache.local:8081/
{noformat}

Where test-cache.local is a real entry in a dns server of the local network. 
Things didn't change much.

Btw I've found that in the error.log there's something valuable (it also 
happened before), this is repeated indefinitely in the error.log:

{noformat}
20150211.15h53m56s RESPONSE: sent 192.168.199.31 status 502 (Connect Error 
internal error - server connection terminated/-1) for 
'http://test-cache.service.farm:8081/storage/test29429?size=11023age=46230sleep=0'
{noformat}

Please note that sending those concurrent requests to the backend server work 
just fine without any error. Anyway, I believe an error in the backend should 
not result in ATS being restarted.

Debug:

{noformat}
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_seq) 
[HttpSM::do_hostdb_lookup] Doing DNS Lookup
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_trans) 
[ink_cluster_time] local: 1423666415, highest_delta: 0, cluster: 1423666415
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_trans) 
[HttpTransact::OSDNSLookup] This was attempt 1
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_seq) 
[HttpTransact::OSDNSLookup] DNS Lookup successful
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_trans) [OSDNSLookup] 
DNS lookup for O.S. successful IP: 192.168.x.y
{noformat}

Then I noticed this:

{noformat}
[Feb 11 15:53:35.043] Server {0x2b7ddc565e00} DEBUG: (http) [293] hostdb update 
marking IP: 192.168.x.y:8081 as down
{noformat}

Also proxy.config.net.connections_throttle is 1000, and I do 1000 concurrent 
requests in my tests.


was (Author: lethalman):
Ok my remap is now:
{noformat}
map http://test-cache.local:8080/ http://test-cache.local:8081/
{noformat}

Where test-cache.local is a real entry in a dns server of the local network. 
Things didn't change much.

Btw I've found that in the error.log there's something valuable (it also 
happened before), this is repeated indefinitely in the error.log:

{noformat}
20150211.15h53m56s RESPONSE: sent 192.168.199.31 status 502 (Connect Error 
internal error - server connection terminated/-1) for 
'http://test-cache.service.farm:8081/storage/test29429?size=11023age=46230sleep=0'
{noformat}

Please note that sending those concurrent requests to the backend server work 
just fine without any error. Anyway, I believe an error in the backend should 
not result in ATS being restarted.

Debug:

{noformat}
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_seq) 
[HttpSM::do_hostdb_lookup] Doing DNS Lookup
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_trans) 
[ink_cluster_time] local: 1423666415, highest_delta: 0, cluster: 1423666415
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_trans) 
[HttpTransact::OSDNSLookup] This was attempt 1
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_seq) 
[HttpTransact::OSDNSLookup] DNS Lookup successful
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_trans) [OSDNSLookup] 
DNS lookup for O.S. successful IP: 192.168.x.y
{noformat}

Then I noticed this:

{noformat}
[Feb 11 15:53:35.043] Server {0x2b7ddc565e00} DEBUG: (http) [293] hostdb update 
marking IP: 192.168.x.y:8081 as down
{noformat}
{noformat}

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} 

[jira] [Created] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)
Luca Bruno created TS-3386:
--

 Summary: Heartbeat failed with high load, trafficserver restarted
 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno


I've been evaluating ATS for some days. I'm using it with mostly default 
settings, except I've lowered the number of connections to the backend, I have 
a raw storage of 500gb, and disabled ram cache.

Working fine, then I wanted to stress it more. I've increased the test to 1000 
concurrent requests, then the ATS worker crashed, restarted and thus lost the 
whole cache.

/var/log/syslog:

{noformat}
Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
[LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
Killed
Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
[Alarms::signalAlarm] Server Process was reset
Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
Starting ---
Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server Version: 
Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 
at 13:04:42)
Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
RLIMIT_NOFILE(7):cur(736236),max(736236)
Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
[LocalManager::sendMgmtMsgToProcesses] Error writing message
Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
(last system error 32: Broken pipe)
Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status signal 
[32985 256]
Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
making sure traffic_server is dead
Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
---
Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 2015 
at 13:05:19)
Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
RLIMIT_NOFILE(7):cur(736236),max(736236)
Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
Starting ---
Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server Version: 
Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 
at 13:04:42)
Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
RLIMIT_NOFILE(7):cur(736236),max(736236)
Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
[LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
Killed
Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
[Alarms::signalAlarm] Server Process was reset
Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server 
Starting ---
Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server Version: 
Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 
at 13:04:42)
Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: 
RLIMIT_NOFILE(7):cur(736236),max(736236)
Feb 11 10:06:52 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
Feb 11 10:07:02 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:07:02 test-cache traffic_cop[32984]: server heartbeat 

[jira] [Updated] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

 [ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Bruno updated TS-3386:
---
Description: 
I've been evaluating ATS for some days. I'm using it with mostly default 
settings, except I've lowered the number of connections to the backend, I have 
a raw storage of 500gb, and disabled ram cache.

Working fine, then I wanted to stress it more. I've increased the test to 1000 
concurrent requests, then the ATS worker has been restarted and thus lost the 
whole cache.

/var/log/syslog:

{noformat}
Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
[LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
Killed
Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
[Alarms::signalAlarm] Server Process was reset
Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
Starting ---
Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server Version: 
Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 
at 13:04:42)
Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
RLIMIT_NOFILE(7):cur(736236),max(736236)
Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
[LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
[LocalManager::sendMgmtMsgToProcesses] Error writing message
Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
(last system error 32: Broken pipe)
Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status signal 
[32985 256]
Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
making sure traffic_server is dead
Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
---
Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 2015 
at 13:05:19)
Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
RLIMIT_NOFILE(7):cur(736236),max(736236)
Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
Starting ---
Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server Version: 
Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 
at 13:04:42)
Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
RLIMIT_NOFILE(7):cur(736236),max(736236)
Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
[LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
Killed
Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
[Alarms::signalAlarm] Server Process was reset
Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server 
Starting ---
Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server Version: 
Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 
at 13:04:42)
Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: 
RLIMIT_NOFILE(7):cur(736236),max(736236)
Feb 11 10:06:52 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
Feb 11 10:07:02 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:07:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
Feb 11 10:07:02 test-cache traffic_cop[32984]: killing server
Feb 11 10:07:02 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 

[jira] [Comment Edited] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316143#comment-14316143
 ] 

Luca Bruno edited comment on TS-3386 at 2/11/15 12:34 PM:
--

So it's like the server saturated the connections, and it replied 502 even to 
the heartbeat server, without analyzing whether the request was a heartbeat 
request. Am I right? In this case, I'd either like to disable heartbeat 
completely or start a second tcp server just for heartbeat, instead of mixing 
heartbeat + client connections within the same server.


was (Author: lethalman):
So it's like the server saturated the connections, and it replied 502 even to 
the heartbeat server, without analyzing the response. Am I right? In this case, 
I'd either like to disable heartbeat completely or start a second tcp server 
just for heartbeat, instead of mixing heartbeat + client connections within the 
same server.

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:42 test-cache traffic_cop[32984]: 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316134#comment-14316134
 ] 

Luca Bruno commented on TS-3386:


Additionally, proxy.config.cop.core_signal is 0, which should not sent a signal 
to stop the server. Yet it sends signal 9 it seems, unless I'm misunderstanding 
the option description.

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server 
 Version: Apache Traffic Server - 

[jira] [Comment Edited] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316143#comment-14316143
 ] 

Luca Bruno edited comment on TS-3386 at 2/11/15 12:31 PM:
--

So it's like the server saturated the connections, and it replied 502 even to 
the heartbeat server, without analyzing the response. Am I right? In this case, 
I'd either like to disable heartbeat completely or start a second tcp server 
just for heartbeat, instead of mixing heartbeat + client connections within the 
same server.


was (Author: lethalman):
So it's like the server saturated the connections, and it replied 502 even to 
the heartbeat server, without analyzing the response. Am I right? In this case, 
I'd either like to disable heartbeat completely or start a second tcp server 
just for heartbeat, isntead of mixing heartbeat + client connections within the 
same server.

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316143#comment-14316143
 ] 

Luca Bruno commented on TS-3386:


So it's like the server saturated the connections, and it replied 502 even to 
the heartbeat server, without analyzing the response. Am I right? In this case, 
I'd either like to disable heartbeat completely or start a second tcp server 
just for heartbeat, isntead of mixing heartbeat + client connections within the 
same server.

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- 

[jira] [Comment Edited] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316143#comment-14316143
 ] 

Luca Bruno edited comment on TS-3386 at 2/11/15 1:34 PM:
-

So it's like the server saturated the connections, and it replied 502 even to 
the heartbeat server, without analyzing whether the request was a heartbeat 
request. Am I right? In this case, I'd either like to disable heartbeat 
completely or start a second tcp server just for heartbeat, instead of mixing 
heartbeat + client connections within the same server, or keep the heartbeat 
for monitoring without killing the server.


was (Author: lethalman):
So it's like the server saturated the connections, and it replied 502 even to 
the heartbeat server, without analyzing whether the request was a heartbeat 
request. Am I right? In this case, I'd either like to disable heartbeat 
completely or start a second tcp server just for heartbeat, instead of mixing 
heartbeat + client connections within the same server.

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: