Re: Google OCSP service down

2018-02-21 Thread Paul Kehrer via dev-security-policy
Thank you for this comprehensive incident report Ryan. Your team's decision
to improve the documentation around the right address for reporting is
great to see! I wonder if it might also make sense to pull the contact
information directly on https://pki.goog above the fold?

-Paul (reaperhulk)

On February 22, 2018 at 12:53:32 PM, Ryan Hurst via dev-security-policy (
dev-security-policy@lists.mozilla.org) wrote:

I wanted to follow up with our findings and a summary of this issue for the
community.

Bellow you will see a detail on what happened and how we resolved the
issue, hopefully this will help explain what hapened and potentially others
not encounter a similar issue.

Summary
---
January 19th, at 08:40 UTC, a code push to improve OCSP generation for a
subset of the Google operated Certificate Authorities was initiated. The
change was related to the packaging of generated OCSP responses. The first
time this change was invoked in production was January 19th at 16:40 UTC.

NOTE: The publication of new revocation information to all geographies can
take up to 6 hours to propagate. Additionally, clients and middle-boxes
commonly implement caching behavior. This results in a large window where
clients may have begun to observe the outage.

NOTE: Most modern web browsers “soft-fail” in response to OCSP server
availability issues, masking outages. Firefox, however, supports an
advanced option that allows users to opt-in to “hard-fail” behavior for
revocation checking. An unknown percentage of Firefox users enable this
setting. We believe most users who were impacted by the outage were these
Firefox users.

About 9 hours after the deployment of the change began (2018-01-20 01:36
UTC) a user on Twitter mentions that they were having problems with their
hard-fail OCSP checking configuration in Firefox when visiting Google
properties. This tweet and the few that followed during the outage period
were not noticed by any Google employees until after the incident’s
post-mortem investigation had begun.

About 1 day and 22 hours after the push was initiated (2018-01-21 15:07
UTC), a user posted a message to the mozilla.dev.security.policy mailing
list where they mention they too are having problems with their hard-fail
configuration in Firefox when visiting Google properties.

About two days after the push was initiated, a Google employee discovered
the post and opened a ticket (2018-01-21 16:10 UTC). This triggered the
remediation procedures, which began in under an hour.

The issue was resolved about 2 days and 6 hours from the time it was
introduced (2018-01-21 22:56 UTC). Once Google became aware of the issue,
it took 1 hour and 55 minutes to resolve the issue, and an additional 4
hours and 51 minutes for the fix to be completely deployed.

No customer reports regarding this issue were sent to the notification
addresses listed in Google's CPSs or on the repository websites for the
duration of the outage. This extended the duration of the outage.

Background
--
Google's OCSP Infrastructure works by generating OCSP responses in batches,
with each batch being made up of the certificates issued by an individual
CA.

In the case of GIAG2, this batch is produced in chunks of certificates
issued in the last 370 days. For each chunk, the GIAG2 CA is asked to
produce the corresponding OCSP responses, the results of which are placed
into a separate .tar file.

The issuer of GIAG2 has chosen to issue new certificates to GIAG2
periodically, as a result GIAG2 has multiple certificates. Two of these
certificates no longer have unexpired certificates associated with them. As
a result, and as expected, the CA does not produce responses for the
corresponding periods.

All .tar files produced during this process are then concatenated with the
-concatenate command in GNU tar. This produces a single .tar file
containing all of the OCSP responses for the given Certificate Authority,
then this .tar file is distributed to our global CDN infrastructure for
serving.

A change was made in how we batch these responses, specifically instead of
outputting many .tar files within a batch, a concatenation was of all tar
files was produced.

The change in question triggered an unexpected behaviour in GNU tar which
then manifested as an empty tarball. These "empty" updates ended up being
distributed to our global CDN, effectively dropping some responses, while
continuing to serve responses for other CAs.

During testing of the change, this behaviour was not detected, as the tests
did not cover the scenario in which some chunks did not contain unexpired
certificates.

Findings

- The outage only impacted sites with TLS certificates issued by the GIAG2
CA as it was the only CA that met the required pre-conditions of the bug.
- The bug that introduced this failure manifested itself as an empty
container of OCSP responses. The root cause of the issue was an unexpected
behavior of GNU tar relating to concatenating tar files.
- The o

Re: Google OCSP service down

2018-02-21 Thread Ryan Hurst via dev-security-policy
I wanted to follow up with our findings and a summary of this issue for the 
community. 

Bellow you will see a detail on what happened and how we resolved the issue, 
hopefully this will help explain what hapened and potentially others not 
encounter a similar issue.

Summary
---
January 19th, at 08:40 UTC, a code push to improve OCSP generation for a subset 
of the Google operated Certificate Authorities was initiated. The change was 
related to the packaging of generated OCSP responses. The first time this 
change was invoked in production was January 19th at 16:40 UTC. 

NOTE: The publication of new revocation information to all geographies can take 
up to 6 hours to propagate. Additionally, clients and middle-boxes commonly 
implement caching behavior. This results in a large window where clients may 
have begun to observe the outage.

NOTE: Most modern web browsers “soft-fail” in response to OCSP server 
availability issues, masking outages. Firefox, however, supports an advanced 
option that allows users to opt-in to “hard-fail” behavior for revocation 
checking. An unknown percentage of Firefox users enable this setting. We 
believe most users who were impacted by the outage were these Firefox users.

About 9 hours after the deployment of the change began (2018-01-20 01:36 UTC) a 
user on Twitter mentions that they were having problems with their hard-fail 
OCSP checking configuration in Firefox when visiting Google properties. This 
tweet and the few that followed during the outage period were not noticed by 
any Google employees until after the incident’s post-mortem investigation had 
begun. 

About 1 day and 22 hours after the push was initiated (2018-01-21 15:07 UTC), a 
user posted a message to the mozilla.dev.security.policy mailing list where 
they mention they too are having problems with their hard-fail configuration in 
Firefox when visiting Google properties.

About two days after the push was initiated, a Google employee discovered the 
post and opened a ticket (2018-01-21 16:10 UTC). This triggered the remediation 
procedures, which began in under an hour.

The issue was resolved about 2 days and 6 hours from the time it was introduced 
(2018-01-21 22:56 UTC). Once Google became aware of the issue, it took 1 hour 
and 55 minutes to resolve the issue, and an additional 4 hours and 51 minutes 
for the fix to be completely deployed.

No customer reports regarding this issue were sent to the notification 
addresses listed in Google's CPSs or on the repository websites for the 
duration of the outage. This extended the duration of the outage. 

Background
--
Google's OCSP Infrastructure works by generating OCSP responses in batches, 
with each batch being made up of the certificates issued by an individual CA.

In the case of GIAG2, this batch is produced in chunks of certificates issued 
in the last 370 days. For each chunk, the GIAG2 CA is asked to produce the 
corresponding OCSP responses, the results of which are placed into a separate 
.tar file.

The issuer of GIAG2 has chosen to issue new certificates to GIAG2 periodically, 
as a result GIAG2 has multiple certificates. Two of these certificates no 
longer have unexpired certificates associated with them. As a result, and as 
expected, the CA does not produce responses for the corresponding periods.

All .tar files produced during this process are then concatenated with the 
-concatenate command in GNU tar. This produces a single .tar file containing 
all of the OCSP responses for the given Certificate Authority, then this .tar 
file is distributed to our global CDN infrastructure for serving.

A change was made in how we batch these responses, specifically instead of 
outputting many .tar files within a batch, a concatenation was of all tar files 
was produced.

The change in question triggered an unexpected behaviour in GNU tar which then 
manifested as an empty tarball. These "empty" updates ended up being 
distributed to our global CDN, effectively dropping some responses, while 
continuing to serve responses for other CAs.

During testing of the change, this behaviour was not detected, as the tests did 
not cover the scenario in which some chunks did not contain unexpired 
certificates.

Findings

- The outage only impacted sites with TLS certificates issued by the GIAG2 CA 
as it was the only CA that met the required pre-conditions of the bug. 
- The bug that introduced this failure manifested itself as an empty container 
of OCSP responses. The root cause of the issue was an unexpected behavior of 
GNU tar relating to concatenating tar files.
- The outage was observed by revocation service monitoring as  “unknown 
certificate” (HTTP 404) errors. HTTP 404 errors are expected in OCSP responder 
operations; they typically are the result of poorly configured clients. These 
events are monitored and a threshold does exist for an on-call escalation.
- Due to a configuration error the designated Google team did no

Re: Root Store Policy 2.6

2018-02-21 Thread Wayne Thayer via dev-security-policy
I've added the issue of subordinate CA transfers to the list for policy
version 2.6: https://github.com/mozilla/pkipolicy/issues/122

On Tue, Feb 20, 2018 at 4:50 PM, Ryan Sleevi  wrote:

>
>
> On Tue, Feb 20, 2018 at 6:19 PM, Wayne Thayer  wrote:
>
>> Ryan,
>>
>> On Fri, Feb 16, 2018 at 3:19 PM, Ryan Sleevi  wrote:
>>
>>>
>>> Hi Wayne,
>>>
>>> One point of possible clarification that should be undertaken is with
>>> respect to https://github.com/mozilla/pkipolicy/blob/master/rootstor
>>> e/policy.md#8-ca-operational-changes
>>>
>>> While this section is worded around CA's certificates, it would appear
>>> that some CAs have interpreted this to mean "root CAs", rather than "Any
>>> certificates operated by the CA"
>>>
>>> My interpretation is that this section applies to certificates directly
>> included in the Mozilla root store - i.e. root CAs.
>>
>
> Interesting. This definitely means we have a gap in disclosure
> requirements, in which there exists a set of trust paths where there's no
> public awareness.
>
>
>>
>>
>>> An example of this would potentially appear to be QuoVadis. QuoVadis
>>> created several sub-CAs, under their control and audit regime. They then
>>> sold/transferred these to an entity closely linked with the United Arab
>>> Emirates, and known to be closely related to the intelligence services [1],
>>> and reportedly under investigation by the FBI. [2] This information comes
>>> by way of DarkMatter, as part of their request to join the CA/Browser Forum
>>> [3], and as far as I can tell, has not been discussed publicly here.
>>>
>>> DarkMatter's root inclusion request hasn't yet reached the public
>> discussion phase: https://bugzilla.mozilla.org/show_bug.cgi?id=1427262
>>
>
> The public discussion refers to the Section 8 process, which was meant to
> mitigate situations in which CAs transferred their trust. Transferring root
> certificates and intermediates is no different - it's still conferring
> trust to an organization unknown to Mozilla. Intermediate cross-signing at
> least has a disclosure within a week, which allows for some public
> awareness and review (and indeed, tooling has been built around it).
>
>
>>
>> DarkMatter reported to the Forum that "DM also operates 3 other Issuing
>>> CAs - one for EV, one for OV, and one for Client certificates. These 3 ICAs
>>> were issued under QuoVadis Roots and subsequently migrated to the DM
>>> infrastructure (as witnessed by our WT auditors) once our WebTrust Audits
>>> were successfully obtained. These 3 Issuing CAs have live end entity
>>> certificates being trusted within the browsers." This statement was made by
>>> Scott Rea, the Senior Vice President of Trust Services at DarkMatter.
>>>
>>> DarkMatter disclosed that these ICAs were previously under QuoVadis's
>>> audit, [4], a period of time audit, and are now nominally in scope for
>>> DarkMatter's audit, [5], or at least, we can expect to see in the next
>>> update. DarkMatter's CP/CPS [6] notes that some certificates are under the
>>> QuoVadis CA3 - but it is ambiguous as to what policies are in place for
>>> those, given that they state "additional" policies, whether it's additive
>>> or separate. In any event, it would appear that the aforementioned EV and
>>> OV sub-CAs are likely [7] and [8]. At present, these disclosures are still
>>> representing as being under the QuoVadis audit in CCADB.
>>>
>>> In terms of policy, is the issue here that subordinate CAs - either
>> newly issued by or newly transferred to an "existing" CA organization (i.e.
>> one that had a current audit prior to generating or receiving the new sub
>> CA) - only show up on the CA organization's next regular audit? That is
>> issue #32 (https://github.com/mozilla/pkipolicy/issues/32), one that I
>> had not proposed tackling in version 2.6 of the policy.
>>
>
> No, this is different, but related. In the case of Issue #32, it means
> that the certificate itself won't necessary be listed in the scope of the
> Operating Organization's audit, even though they're operating to the
> audited CP/CPS. This is the general problem that audits only look
> retrospectively, and thus can't speak to future events.
>
> This goes a step further, which is that there will be no (public)
> disclosure of the transfer of control until 15 months after the transfer
> was executed, at least based on a reading that says Section 8 only applies
> to roots. This seems to go against the intent of both Section 8 and Section
> 5.3.2 - which tried to get timely disclosure of those events.
>
>
>>
>>
>>> It may be that QuoVadis intends to ensure their next audit covers the
>>> facility, state, and procedures at both QuoVadis' location and DarkMatter's
>>> location. It may alternatively be that the expectation is that, within a
>>> year of QuoVadis' audit, that DarkMatter is expected to provide the audit.
>>> What is unclear, however, is whether any such disclosure was made to
>>> Mozilla regarding the change in Legal Ownership, Operationa