[jira] [Commented] (TS-1087) TSHttpTxnOutgoingAddrSet forward declaration does not match implementation
[ https://issues.apache.org/jira/browse/TS-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13224817#comment-13224817 ] B Wyatt commented on TS-1087: - I am not at head at the moment, but at least in my version the president set by the rest of the API had socklen_t passed in as a parameter. Is that still the case? I can see the argument either way, the addition of a socklen_t parameter at least gives the backend a fighting chance to not read invalid memory if a plugin calls in with a malformed socket address type (like a sockaddr_in with AF_INET6 for a family). In a case where the data is correct, it is useless. FWIW, the signature of the implementation included the socklen_t so either nobody was using this function with a recent version of trafficserver (unresolved at library load time) -or- they are already using the socklen_t parameter and counting on a rogue forward declaration or voodoo to link it. TSHttpTxnOutgoingAddrSet forward declaration does not match implementation -- Key: TS-1087 URL: https://issues.apache.org/jira/browse/TS-1087 Project: Traffic Server Issue Type: Bug Components: TS API Affects Versions: 3.1.0 Reporter: B Wyatt Assignee: B Wyatt Priority: Trivial Fix For: 3.1.5 Attachments: txn-outgoing-addr.patch Original Estimate: 1m Remaining Estimate: 1m ts.h.in lists the following declaration: {code}TSReturnCode TSHttpTxnOutgoingAddrSet(TSHttpTxn txnp, struct sockaddr const* addr);{code} However, the current implementation has this function sig: {code}tsapi TSReturnCode TSHttpTxnOutgoingAddrSet(TSHttpTxn txnp, struct sockaddr const* addr, socklen_t addrlen);{code} Trafficserver is unable to load plugins which use this function due to the unresolved symbol. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TS-1075) Port range bottleneck in transparent proxy mode
[ https://issues.apache.org/jira/browse/TS-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13224821#comment-13224821 ] B Wyatt commented on TS-1075: - I will investigate as this will no doubt bite me as well (it may be biting me already). Port range bottleneck in transparent proxy mode --- Key: TS-1075 URL: https://issues.apache.org/jira/browse/TS-1075 Project: Traffic Server Issue Type: Bug Components: Core Affects Versions: 3.0.1 Environment: Centos 5.6, kernel 2.6.39.2 compiled with TPROXY support ATS compiled as: ./configure --enable-tproxy Reporter: Danny Shporer Assignee: B Wyatt Fix For: 3.1.3 Attachments: ports.patch The Linux TPROXY stack only takes into account the local addresses when using dynamic bind (bind without specifying a specific port). This limits the port range to only the local range (around 30K by default and can be extended to around 64K) - this together with the TIME-WAIT Linux method of releasing ports causes a bottleneck). One symptom of this is that traffic_cop cannot open a connection to the server to monitor it (it gets error 99 - address already in use) and kills it. Another issue is when opening the connection to the server. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TS-949) key-volume hash table is not consistent when a disk is marked as bad or removed due to failure
[ https://issues.apache.org/jira/browse/TS-949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169622#comment-13169622 ] B Wyatt commented on TS-949: Opened a new bug TS-1050, that refers to this bug and addresses the data loss on volume addition problem. key-volume hash table is not consistent when a disk is marked as bad or removed due to failure --- Key: TS-949 URL: https://issues.apache.org/jira/browse/TS-949 Project: Traffic Server Issue Type: Bug Components: Cache Affects Versions: 3.1.0 Environment: Multi-volume cache with apparently faulty drives Reporter: B Wyatt Assignee: John Plevyak Fix For: 3.1.2 Attachments: TS-949-jp-1.patch, TS-949-jp2.patch, TS949-BW-p1.patch, explicit-pair.patch The method for resolving collisions when distributing hash-table space to volumes for the object_key-volume hash table creates inconsistency when a disk is determined to be bad, or when a failed disk is removed from the volume.config. Background: The hash space is distributed by round robin draft where each volume drafts a random index in the hash table until the hash space is exhausted. The random order in which a given volume drafts hash table slots is consistent across reboot/crash/disk-failure, however when a volume attempts to draft a slot which has already been occupied, it skips to its next random pick and attempts to draft that slot until it finds an open slot. This ensures that the hash is partitioned evenly between volumes. The issue: Resolving slot contention breaks the consistency as it is dependent on the order that the volumes draft. When rebuilding the hash after disk failure or reboot with fewer drives, a volume may secure an index that was previously occupied by the dead-disk. In the old hash, the surviving volume would have selected another random index due to contention. If this index is taken, by the next draft round it will represent an inconsistent key-volume result. The effects of one inconsistency will then cascade as whichever volume occupies that index after removing a dead disk is now behind on its draft sequence as well. An Example: ||Disk||Draft Sequence|| |A|1,4,7,5| |B|4,2,8,1| |C|3,7,5,2| Pre-failure Hash Table after 2 rounds of draft: |A|B|C|B|C|?|A|?| Post-failure of drive B Hash Table after 3 rounds of draft: |A|C|C|A|{color:red}A{color}|?|{color:red}C{color}|?| Two slots have become inconsistent and more will probably follow. These inconsistencies become objects stored in a volume but lost to the top level cache for open/lookup. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TS-949) key-volume hash table is not consistent when a disk is marked as bad or removed due to failure
[ https://issues.apache.org/jira/browse/TS-949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13167597#comment-13167597 ] B Wyatt commented on TS-949: Thanks John. I think the new patch should be more stable. I apologize for the misread of the previous patch, all of my volumes are matched in size so I had erroneously tuned out the inclusion of vol-len in the initial value of forvol[i]. While I am not an enforcer of code quality, I think the particulars of this method should at the very least be documented in the patched code. I'll let someone else decide whether it is worth the effort to pretty it up. All of this digging has brought up a new related issue (that I am pretty sure we cannot address at this level): Object loss when adding volumes. The hash is now consistent, however when a new volume supersedes an existing volume in the hash, any object that maps to that bucket but currently stored on the old volume will become inaccessible. I will probably create a new issue for that as this one is solved in my book. key-volume hash table is not consistent when a disk is marked as bad or removed due to failure --- Key: TS-949 URL: https://issues.apache.org/jira/browse/TS-949 Project: Traffic Server Issue Type: Bug Components: Cache Affects Versions: 3.1.0 Environment: Multi-volume cache with apparently faulty drives Reporter: B Wyatt Assignee: John Plevyak Fix For: 3.1.2 Attachments: TS-949-jp-1.patch, TS-949-jp2.patch, TS949-BW-p1.patch The method for resolving collisions when distributing hash-table space to volumes for the object_key-volume hash table creates inconsistency when a disk is determined to be bad, or when a failed disk is removed from the volume.config. Background: The hash space is distributed by round robin draft where each volume drafts a random index in the hash table until the hash space is exhausted. The random order in which a given volume drafts hash table slots is consistent across reboot/crash/disk-failure, however when a volume attempts to draft a slot which has already been occupied, it skips to its next random pick and attempts to draft that slot until it finds an open slot. This ensures that the hash is partitioned evenly between volumes. The issue: Resolving slot contention breaks the consistency as it is dependent on the order that the volumes draft. When rebuilding the hash after disk failure or reboot with fewer drives, a volume may secure an index that was previously occupied by the dead-disk. In the old hash, the surviving volume would have selected another random index due to contention. If this index is taken, by the next draft round it will represent an inconsistent key-volume result. The effects of one inconsistency will then cascade as whichever volume occupies that index after removing a dead disk is now behind on its draft sequence as well. An Example: ||Disk||Draft Sequence|| |A|1,4,7,5| |B|4,2,8,1| |C|3,7,5,2| Pre-failure Hash Table after 2 rounds of draft: |A|B|C|B|C|?|A|?| Post-failure of drive B Hash Table after 3 rounds of draft: |A|C|C|A|{color:red}A{color}|?|{color:red}C{color}|?| Two slots have become inconsistent and more will probably follow. These inconsistencies become objects stored in a volume but lost to the top level cache for open/lookup. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TS-949) key-volume hash table is not consistent when a disk is marked as bad or removed due to failure
[ https://issues.apache.org/jira/browse/TS-949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13163641#comment-13163641 ] B Wyatt commented on TS-949: Thanks John, this scheme certainly solves disks failing/being added to the cache in a deterministic way. I tend to agree with you that the extra effort to guarantee an equal distribution of hash buckets is of questionable value. It does look like there is some cruft in the patch. Score is multiplied by a value which is almost-constant across the volumes and divided by an integer constant. The comments indicate that this may have been an attempt to even out the distribution, but as it would cause the same type of inconsistency on disk loss as the previous scheme I assume it was disabled on purpose (by not decrementing forvol[*] ever). Eitherway, the result of the comparison will currently be the same as the un-multiplied un-divided comparison if the integer truncation is not important. Also I think ttable[i] = top; should be ttable[i] = mapping[top]; as the range of valid volume indices has holes in the case that a disk(s) have been declare bad. key-volume hash table is not consistent when a disk is marked as bad or removed due to failure --- Key: TS-949 URL: https://issues.apache.org/jira/browse/TS-949 Project: Traffic Server Issue Type: Bug Components: Cache Affects Versions: 3.1.0 Environment: Multi-volume cache with apparently faulty drives Reporter: B Wyatt Assignee: John Plevyak Fix For: 3.1.2 Attachments: TS-949-jp-1.patch The method for resolving collisions when distributing hash-table space to volumes for the object_key-volume hash table creates inconsistency when a disk is determined to be bad, or when a failed disk is removed from the volume.config. Background: The hash space is distributed by round robin draft where each volume drafts a random index in the hash table until the hash space is exhausted. The random order in which a given volume drafts hash table slots is consistent across reboot/crash/disk-failure, however when a volume attempts to draft a slot which has already been occupied, it skips to its next random pick and attempts to draft that slot until it finds an open slot. This ensures that the hash is partitioned evenly between volumes. The issue: Resolving slot contention breaks the consistency as it is dependent on the order that the volumes draft. When rebuilding the hash after disk failure or reboot with fewer drives, a volume may secure an index that was previously occupied by the dead-disk. In the old hash, the surviving volume would have selected another random index due to contention. If this index is taken, by the next draft round it will represent an inconsistent key-volume result. The effects of one inconsistency will then cascade as whichever volume occupies that index after removing a dead disk is now behind on its draft sequence as well. An Example: ||Disk||Draft Sequence|| |A|1,4,7,5| |B|4,2,8,1| |C|3,7,5,2| Pre-failure Hash Table after 2 rounds of draft: |A|B|C|B|C|?|A|?| Post-failure of drive B Hash Table after 3 rounds of draft: |A|C|C|A|{color:red}A{color}|?|{color:red}C{color}|?| Two slots have become inconsistent and more will probably follow. These inconsistencies become objects stored in a volume but lost to the top level cache for open/lookup. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TS-996) HTTPHdr::m_host goes stale if HdrHeap::evacuate_from_str_heaps is called
[ https://issues.apache.org/jira/browse/TS-996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13132822#comment-13132822 ] B Wyatt commented on TS-996: I am auditioning a fast/dirty fix for this that cache's a MIMEField pointer instead of the string pointer if the host comes from the MIMEHdr. For Hosts in the URL it uses the m_cached_url's copy instead of a top level copy which should be immune to heap changes. Ideally, I think that code is up for a bit of cleaning but this will hopefully suffice for now. The patch is attached as m_host.patch HTTPHdr::m_host goes stale if HdrHeap::evacuate_from_str_heaps is called Key: TS-996 URL: https://issues.apache.org/jira/browse/TS-996 Project: Traffic Server Issue Type: Bug Components: HTTP, MIME Affects Versions: 3.1.0 Reporter: B Wyatt Attachments: m_host.patch class HTTPHdr stores a copy of the string pointer from either the URLimpl or the MIMEHdr for the host name in m_host. In both cases, these strings can be moved to a new heap underneath the HTTPHdr. When this happens, the process will, at best read stale memory and be fine and at worst read unmapped memory and segfault. Currently, HdrHeap::evacuate_from_str_heaps is called to coalesce multiple heaps into a single heap. When this happens it will directly access the low level objects via ::move_strings calls. These objects do not posses the necessary information to inform parent objects about the change, nor does the HdrHeap directly inform interested parties. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TS-934) Proxy Mutex null pointer crash
[ https://issues.apache.org/jira/browse/TS-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13124201#comment-13124201 ] B Wyatt commented on TS-934: From the cores/callstacks I've seen this is the same issue as TS-857. Proxy Mutex null pointer crash -- Key: TS-934 URL: https://issues.apache.org/jira/browse/TS-934 Project: Traffic Server Issue Type: Bug Components: Core Affects Versions: 3.1.0 Environment: Debian 6.0.2 quadcore, forward transparent proxy. Reporter: Alan M. Carroll Assignee: Alan M. Carroll Fix For: 3.1.1 Attachments: ts-934-patch.txt [Client report] We had the cache crash gracefully twice last night on a segfault. Both times the callstack produced by trafficserver's signal handler was: /usr/bin/traffic_server[0x529596] /lib/libpthread.so.0(+0xef60)[0x2ab09a897f60] [0x2ab09e7c0a10] usr/bin/traffic_server(HttpServerSession::do_io_close(int)+0xa8)[0x567a3c] /usr/bin/traffic_server(HttpVCTable::cleanup_entry(HttpVCTableEntry*)+0x4c)[0x56aff6] /usr/bin/traffic_server(HttpVCTable::cleanup_all()+0x64)[0x56b07a] /usr/bin/traffic_server(HttpSM::kill_this()+0x120)[0x57c226] /usr/bin/traffic_server(HttpSM::main_handler(int, void*)+0x208)[0x571b28] /usr/bin/traffic_server(Continuation::handleEvent(int, void*)+0x69)[0x4e4623] I went through the disassembly and the instruction that it is on in ::do_io_close is loading the value of diags (not dereferencing it) so it is unlikely that that through a segfault (unless this is some how in thread local storage and that is corrupt). The kernel message claimed that the instruction pointer was 0x4e438e which in this build is in ProxyMutexPtr::operator -() on the instruction that dereferences the object pointer to get the stored mutex pointer (bingo!), so it would seem that at some point we are dereferencing a null safe pointer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira