Re: Google OCSP service down
Thank you for this comprehensive incident report Ryan. Your team's decision to improve the documentation around the right address for reporting is great to see! I wonder if it might also make sense to pull the contact information directly on https://pki.goog above the fold? -Paul (reaperhulk) On February 22, 2018 at 12:53:32 PM, Ryan Hurst via dev-security-policy ( dev-security-policy@lists.mozilla.org) wrote: I wanted to follow up with our findings and a summary of this issue for the community. Bellow you will see a detail on what happened and how we resolved the issue, hopefully this will help explain what hapened and potentially others not encounter a similar issue. Summary --- January 19th, at 08:40 UTC, a code push to improve OCSP generation for a subset of the Google operated Certificate Authorities was initiated. The change was related to the packaging of generated OCSP responses. The first time this change was invoked in production was January 19th at 16:40 UTC. NOTE: The publication of new revocation information to all geographies can take up to 6 hours to propagate. Additionally, clients and middle-boxes commonly implement caching behavior. This results in a large window where clients may have begun to observe the outage. NOTE: Most modern web browsers “soft-fail” in response to OCSP server availability issues, masking outages. Firefox, however, supports an advanced option that allows users to opt-in to “hard-fail” behavior for revocation checking. An unknown percentage of Firefox users enable this setting. We believe most users who were impacted by the outage were these Firefox users. About 9 hours after the deployment of the change began (2018-01-20 01:36 UTC) a user on Twitter mentions that they were having problems with their hard-fail OCSP checking configuration in Firefox when visiting Google properties. This tweet and the few that followed during the outage period were not noticed by any Google employees until after the incident’s post-mortem investigation had begun. About 1 day and 22 hours after the push was initiated (2018-01-21 15:07 UTC), a user posted a message to the mozilla.dev.security.policy mailing list where they mention they too are having problems with their hard-fail configuration in Firefox when visiting Google properties. About two days after the push was initiated, a Google employee discovered the post and opened a ticket (2018-01-21 16:10 UTC). This triggered the remediation procedures, which began in under an hour. The issue was resolved about 2 days and 6 hours from the time it was introduced (2018-01-21 22:56 UTC). Once Google became aware of the issue, it took 1 hour and 55 minutes to resolve the issue, and an additional 4 hours and 51 minutes for the fix to be completely deployed. No customer reports regarding this issue were sent to the notification addresses listed in Google's CPSs or on the repository websites for the duration of the outage. This extended the duration of the outage. Background -- Google's OCSP Infrastructure works by generating OCSP responses in batches, with each batch being made up of the certificates issued by an individual CA. In the case of GIAG2, this batch is produced in chunks of certificates issued in the last 370 days. For each chunk, the GIAG2 CA is asked to produce the corresponding OCSP responses, the results of which are placed into a separate .tar file. The issuer of GIAG2 has chosen to issue new certificates to GIAG2 periodically, as a result GIAG2 has multiple certificates. Two of these certificates no longer have unexpired certificates associated with them. As a result, and as expected, the CA does not produce responses for the corresponding periods. All .tar files produced during this process are then concatenated with the -concatenate command in GNU tar. This produces a single .tar file containing all of the OCSP responses for the given Certificate Authority, then this .tar file is distributed to our global CDN infrastructure for serving. A change was made in how we batch these responses, specifically instead of outputting many .tar files within a batch, a concatenation was of all tar files was produced. The change in question triggered an unexpected behaviour in GNU tar which then manifested as an empty tarball. These "empty" updates ended up being distributed to our global CDN, effectively dropping some responses, while continuing to serve responses for other CAs. During testing of the change, this behaviour was not detected, as the tests did not cover the scenario in which some chunks did not contain unexpired certificates. Findings - The outage only impacted sites with TLS certificates issued by the GIAG2 CA as it was the only CA that met the required pre-conditions of the bug. - The bug that introduced this failure manifested itself as an empty container of OCSP responses. The root cause of the issue was an unexpected behavior of GNU tar relating to concatenating tar files. - The o
Re: Google OCSP service down
I wanted to follow up with our findings and a summary of this issue for the community. Bellow you will see a detail on what happened and how we resolved the issue, hopefully this will help explain what hapened and potentially others not encounter a similar issue. Summary --- January 19th, at 08:40 UTC, a code push to improve OCSP generation for a subset of the Google operated Certificate Authorities was initiated. The change was related to the packaging of generated OCSP responses. The first time this change was invoked in production was January 19th at 16:40 UTC. NOTE: The publication of new revocation information to all geographies can take up to 6 hours to propagate. Additionally, clients and middle-boxes commonly implement caching behavior. This results in a large window where clients may have begun to observe the outage. NOTE: Most modern web browsers “soft-fail” in response to OCSP server availability issues, masking outages. Firefox, however, supports an advanced option that allows users to opt-in to “hard-fail” behavior for revocation checking. An unknown percentage of Firefox users enable this setting. We believe most users who were impacted by the outage were these Firefox users. About 9 hours after the deployment of the change began (2018-01-20 01:36 UTC) a user on Twitter mentions that they were having problems with their hard-fail OCSP checking configuration in Firefox when visiting Google properties. This tweet and the few that followed during the outage period were not noticed by any Google employees until after the incident’s post-mortem investigation had begun. About 1 day and 22 hours after the push was initiated (2018-01-21 15:07 UTC), a user posted a message to the mozilla.dev.security.policy mailing list where they mention they too are having problems with their hard-fail configuration in Firefox when visiting Google properties. About two days after the push was initiated, a Google employee discovered the post and opened a ticket (2018-01-21 16:10 UTC). This triggered the remediation procedures, which began in under an hour. The issue was resolved about 2 days and 6 hours from the time it was introduced (2018-01-21 22:56 UTC). Once Google became aware of the issue, it took 1 hour and 55 minutes to resolve the issue, and an additional 4 hours and 51 minutes for the fix to be completely deployed. No customer reports regarding this issue were sent to the notification addresses listed in Google's CPSs or on the repository websites for the duration of the outage. This extended the duration of the outage. Background -- Google's OCSP Infrastructure works by generating OCSP responses in batches, with each batch being made up of the certificates issued by an individual CA. In the case of GIAG2, this batch is produced in chunks of certificates issued in the last 370 days. For each chunk, the GIAG2 CA is asked to produce the corresponding OCSP responses, the results of which are placed into a separate .tar file. The issuer of GIAG2 has chosen to issue new certificates to GIAG2 periodically, as a result GIAG2 has multiple certificates. Two of these certificates no longer have unexpired certificates associated with them. As a result, and as expected, the CA does not produce responses for the corresponding periods. All .tar files produced during this process are then concatenated with the -concatenate command in GNU tar. This produces a single .tar file containing all of the OCSP responses for the given Certificate Authority, then this .tar file is distributed to our global CDN infrastructure for serving. A change was made in how we batch these responses, specifically instead of outputting many .tar files within a batch, a concatenation was of all tar files was produced. The change in question triggered an unexpected behaviour in GNU tar which then manifested as an empty tarball. These "empty" updates ended up being distributed to our global CDN, effectively dropping some responses, while continuing to serve responses for other CAs. During testing of the change, this behaviour was not detected, as the tests did not cover the scenario in which some chunks did not contain unexpired certificates. Findings - The outage only impacted sites with TLS certificates issued by the GIAG2 CA as it was the only CA that met the required pre-conditions of the bug. - The bug that introduced this failure manifested itself as an empty container of OCSP responses. The root cause of the issue was an unexpected behavior of GNU tar relating to concatenating tar files. - The outage was observed by revocation service monitoring as “unknown certificate” (HTTP 404) errors. HTTP 404 errors are expected in OCSP responder operations; they typically are the result of poorly configured clients. These events are monitored and a threshold does exist for an on-call escalation. - Due to a configuration error the designated Google team did no
Re: Root Store Policy 2.6
I've added the issue of subordinate CA transfers to the list for policy version 2.6: https://github.com/mozilla/pkipolicy/issues/122 On Tue, Feb 20, 2018 at 4:50 PM, Ryan Sleevi wrote: > > > On Tue, Feb 20, 2018 at 6:19 PM, Wayne Thayer wrote: > >> Ryan, >> >> On Fri, Feb 16, 2018 at 3:19 PM, Ryan Sleevi wrote: >> >>> >>> Hi Wayne, >>> >>> One point of possible clarification that should be undertaken is with >>> respect to https://github.com/mozilla/pkipolicy/blob/master/rootstor >>> e/policy.md#8-ca-operational-changes >>> >>> While this section is worded around CA's certificates, it would appear >>> that some CAs have interpreted this to mean "root CAs", rather than "Any >>> certificates operated by the CA" >>> >>> My interpretation is that this section applies to certificates directly >> included in the Mozilla root store - i.e. root CAs. >> > > Interesting. This definitely means we have a gap in disclosure > requirements, in which there exists a set of trust paths where there's no > public awareness. > > >> >> >>> An example of this would potentially appear to be QuoVadis. QuoVadis >>> created several sub-CAs, under their control and audit regime. They then >>> sold/transferred these to an entity closely linked with the United Arab >>> Emirates, and known to be closely related to the intelligence services [1], >>> and reportedly under investigation by the FBI. [2] This information comes >>> by way of DarkMatter, as part of their request to join the CA/Browser Forum >>> [3], and as far as I can tell, has not been discussed publicly here. >>> >>> DarkMatter's root inclusion request hasn't yet reached the public >> discussion phase: https://bugzilla.mozilla.org/show_bug.cgi?id=1427262 >> > > The public discussion refers to the Section 8 process, which was meant to > mitigate situations in which CAs transferred their trust. Transferring root > certificates and intermediates is no different - it's still conferring > trust to an organization unknown to Mozilla. Intermediate cross-signing at > least has a disclosure within a week, which allows for some public > awareness and review (and indeed, tooling has been built around it). > > >> >> DarkMatter reported to the Forum that "DM also operates 3 other Issuing >>> CAs - one for EV, one for OV, and one for Client certificates. These 3 ICAs >>> were issued under QuoVadis Roots and subsequently migrated to the DM >>> infrastructure (as witnessed by our WT auditors) once our WebTrust Audits >>> were successfully obtained. These 3 Issuing CAs have live end entity >>> certificates being trusted within the browsers." This statement was made by >>> Scott Rea, the Senior Vice President of Trust Services at DarkMatter. >>> >>> DarkMatter disclosed that these ICAs were previously under QuoVadis's >>> audit, [4], a period of time audit, and are now nominally in scope for >>> DarkMatter's audit, [5], or at least, we can expect to see in the next >>> update. DarkMatter's CP/CPS [6] notes that some certificates are under the >>> QuoVadis CA3 - but it is ambiguous as to what policies are in place for >>> those, given that they state "additional" policies, whether it's additive >>> or separate. In any event, it would appear that the aforementioned EV and >>> OV sub-CAs are likely [7] and [8]. At present, these disclosures are still >>> representing as being under the QuoVadis audit in CCADB. >>> >>> In terms of policy, is the issue here that subordinate CAs - either >> newly issued by or newly transferred to an "existing" CA organization (i.e. >> one that had a current audit prior to generating or receiving the new sub >> CA) - only show up on the CA organization's next regular audit? That is >> issue #32 (https://github.com/mozilla/pkipolicy/issues/32), one that I >> had not proposed tackling in version 2.6 of the policy. >> > > No, this is different, but related. In the case of Issue #32, it means > that the certificate itself won't necessary be listed in the scope of the > Operating Organization's audit, even though they're operating to the > audited CP/CPS. This is the general problem that audits only look > retrospectively, and thus can't speak to future events. > > This goes a step further, which is that there will be no (public) > disclosure of the transfer of control until 15 months after the transfer > was executed, at least based on a reading that says Section 8 only applies > to roots. This seems to go against the intent of both Section 8 and Section > 5.3.2 - which tried to get timely disclosure of those events. > > >> >> >>> It may be that QuoVadis intends to ensure their next audit covers the >>> facility, state, and procedures at both QuoVadis' location and DarkMatter's >>> location. It may alternatively be that the expectation is that, within a >>> year of QuoVadis' audit, that DarkMatter is expected to provide the audit. >>> What is unclear, however, is whether any such disclosure was made to >>> Mozilla regarding the change in Legal Ownership, Operationa