GTS - OCSP serving issue 2020-04-09

2020-04-14 Thread Andy Warner via dev-security-policy
m.d.s.p community, Google Trust Services just filed
https://bugzilla.mozilla.org/show_bug.cgi?id=1630040 which contains the
same information as the report that follows.

>From 2020-04-08 16:25 UTC to 2020-04-09 05:40 UTC, Google Trust Services'
EJBCA based CAs (GIAG4, GIAG4ECC, GTSY1-4) served empty OCSP data which led
the OCSP responders to return unauthorized.

These CAs exist for issuance of custom certificate profiles and
certificates for test sites for inactive roots. Our primary CAs (GTS CA 1O1
and GTS CA 1D2) were unaffected. The problem self-corrected, but we have
added safeguards to prevent recurrence.

1. How your CA first became aware of the problem (e.g. via a problem report
submitted to your Problem Reporting Mechanism, a discussion in
mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and
the time and date.

Monitoring detected the issue on 2020-04-08 at 16:35 UTC. The root cause
was identified within hours. The issue was automatically remediated in the
next generation and push to CDN cycle while debugging and fixes were
ongoing.

2. A timeline of the actions your CA took in response. A timeline is a
date-and-time-stamped sequence of all relevant events. This may include
events before the incident was reported, such as when a particular
requirement became applicable, or a document changed, or a bug was
introduced, or an audit was done.

2020-04-08, 11:29 UTC - Scheduled system update begins
2020-04-08, 14:00 UTC - Incorrect OCSP archives are generated
2020-04-08, 15:03 UTC - Scheduled system update concludes
2020-04-08, 16:20 UTC - Incorrect OCSP responses pushed to CDN
2020-04-08, 16:35 UTC - First production monitoring alert fires
2020-04-08, 22:00 UTC - Correct OCSP archives are generated automatically
2020-04-09, 00:20 UTC - Correct OCSP responses pushed to CDN
2020-04-09, 05:40 UTC - Monitoring confirms all probes are passing

3. Whether your CA has stopped, or has not yet stopped, issuing
certificates with the problem. A statement that you have will be considered
a pledge to the community; a statement that you have not requires an
explanation.

The affected CAs are only used for infrequent and manual custom certificate
issuance. No certificate issuance aside from a manually issued post update
test certificate to validate the upgrade to resolve the issue took place
during this period. The issue in question also was specific to refreshing
OCSP responses and not certificate issuance.

4. A summary of the problematic certificates. For each problem: number of
certs, and the date the first and last certs with that problem were issued.

No certificate issuance aside from a manually issued post update test
certificate to validate the upgrade to resolve the issue took place during
this period. The test certificate was a valid and fully compliant issuance.

5. The complete certificate data for the problematic certificates. The
recommended way to provide this is to ensure each certificate is logged to
CT and then list the fingerprints or crt.sh IDs, either in the report or as
an attached spreadsheet, with one list per distinct problem.

No certificate issuance aside from the manually issued post update test
certificate to validate the the upgrade.

6. Explanation about how and why the mistakes were made or bugs introduced,
and how they avoided detection until now.

Our creation of OCSP responses and packaging them for serving is designed
to fail if any sub-command fails using set -e. However, if the function
call is part of an AND or OR sequence (ie. using '&&' or '||' control
operators), the set -e is suppressed inside the function.

The tool we use to fetch OCSP responses from EJBCA correctly returned a
non-zero exit code (due to no OCSP responses being generated because EJBCA
was not running), but because it was called inside a function with its own
error handling (using && syntax), the script continued without handling the
error properly and wrongly used empty tar.gz files with no responses in
them. The bug had existed for multiple years as a potential race condition
and we did not encounter it previously.

Quality tests are executed before publication to the CDN, however, those
tests accommodate empty responses as a valid condition because it is
something that can and does happen.

This condition did not repeat on the following update of the OCSP
responses. As a result the next update resolved the issue. Our monitoring
caught the issue enabling expedient root cause analysis and resolution.

7. List of steps your CA is taking to resolve the situation and ensure such
issuance will not be repeated in the future, accompanied with a timeline of
when your CA expects to accomplish these things.

No certificate issuance aside from a valid manually issued post update test
certificate to validate the upgrade took place during this period.

The logic error that led to incorrect OCSP responses being served has been
corrected, is checked in and in production. Additionally, checks have 

Re: DigiCert OCSP services returns 1 byte

2019-09-23 Thread Andy Warner via dev-security-policy
The CRL question is not about it being a requirement, but rather the fact
that it could / would lead to disparate treatment between CRL and OCSP for
the same certificate, which does not feel right.

On the CT quorum issue, we use a mix of the most available sharded logs and
that is the failure rate we're observing. We have a few ideas for
improvements we're working on. If other operators are seeing much different
success rates, we'd love to compare notes. We're using the published best
practices, spreading load and using sharded logs, so an implementation
issue is not obvious if there is one. That said, other groups within Google
including the CT team also exchange messages with CT logs in fairly high
volumes, so we may experience atypically high rate-limiting due to all
being bucketed together.

CAA validations are only good for 8 hours, so the suggestion of a year
misses the much shorter timeline that needs to be honored for CAA.

--
Andy Warner
Google Trust Services

On Mon, Sep 23, 2019 at 3:57 PM Kurt Roeckx  wrote:

> On Mon, Sep 23, 2019 at 02:53:26PM -0700, Andy Warner via
> dev-security-policy wrote:
> >
> > 1. The new text added to the Mozilla Recommended and Required Practices
> for this topic states only OCSP status is required for precertificates.
> Many CAs provide both CRLs and OCSP services and it would be problematic if
> these two mechanisms provided different answers to the same question.
> >
> > The practice of revoking non-issued certificates would therefore lead to
> CRL growth which would further make reliable revocation checking on
> bandwidth constrained clients more difficult.
>
> There have been suggestions to revoke them, but it's my
> understanding that there is no such requirement.
>
> > 2. There seem to be a number of assumptions that precertificate issuance
> and certificate issuance is roughly atomic. In reality, a quorum of SCTs is
> required prior to final certificate issuance, so that is not the case.
>
> I don't see anybody suggesting that, nor how it's relevant.
>
> With all the uptime requirements on the logs and the number of
> available logs, I don't see why you should have a failure rate
> of 1 in 2000, and that more seems like an implementation problem.
>
> > 3. This raises the question of how much time a CA has from the time they
> issue a precertificate to when the final certificate must be issued. When
> there are logs ecosystem issues that are beyond the control of a CA, the
> gap can easily be orders of magnitude higher than normal operating
> conditions.
>
> At what is the issue with that?
>
> > * Clarifications
> >
> > This in turn raises the question if CAs can re-use authorization data
> such as CAA records or domain authorizations from the precertificate? If a
> final certificate has not been issued due to a persistent quorum failure,
> and that failure persists longer than the validity of the used
> authorization data, can the authorizations that were done prior to the
> precertificate issuance be re-used?
>
> So 1 year is sometimes not enough to get SCTs?
>
>
> Kurt
>
>


smime.p7s
Description: S/MIME Cryptographic Signature
___
dev-security-policy mailing list
dev-security-policy@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security-policy


Re: DigiCert OCSP services returns 1 byte

2019-09-23 Thread Andy Warner via dev-security-policy
The last thing we intended was for our prior mail to be interpreted as negative 
and without substance.  That said, it is clear our mail was not received in the 
light in which it was intended.

We would like to rectify that. We have been closely monitoring this thread and 
as it began to converge on a conclusion we started planning for each of our CA 
environments what if any changes would be required and what solutions compliant 
with our understanding of the conclusions would look like.

With that background, our understanding is that the goal of this change is to 
make it easier to monitor the issuance and revocation practices of a CA based 
on the existence of precertificates, i.e. extending the use of certificate 
revocation data to include this monitoring use case. We see value in this and 
are supportive of the overall change, as it is clear that CT, as a whole, has 
made significant quality improvement to the WebPKI as a whole and this provides 
additional incremental benefit.

However, we see several challenges that we want to discuss, in particular:

1. The new text added to the Mozilla Recommended and Required Practices for 
this topic states only OCSP status is required for precertificates. Many CAs 
provide both CRLs and OCSP services and it would be problematic if these two 
mechanisms provided different answers to the same question. 

The practice of revoking non-issued certificates would therefore lead to CRL 
growth which would further make reliable revocation checking on bandwidth 
constrained clients more difficult.

Though this tax may be deemed acceptable, there is a clear impact and GTS feels 
that increasing CRL sizes for this use case is not in the best interest of 
users. We can see both sides of the argument, but we think a bit more detail is 
required to ensure our implementations align with best practices and user 
interests.

2. There seem to be a number of assumptions that precertificate issuance and 
certificate issuance is roughly atomic. In reality, a quorum of SCTs is 
required prior to final certificate issuance, so that is not the case.

CAs are rate limited by logs or logs experience availability issues that make 
achieving quorum require retries or fail altogether. GTS, for example, 
experiences approximately 0.05% delays / order abandonment related to an 
inability to achieve quorum.

As a result of this, the existence of a precertificate is possible without a 
final certificate having been issued. With the wider availability of sharded 
logs, this number has been improving, but it continues to be our most common 
cause of issuance delay or order abandonment.

3. This raises the question of how much time a CA has from the time they issue 
a precertificate to when the final certificate must be issued. When there are 
logs ecosystem issues that are beyond the control of a CA, the gap can easily 
be orders of magnitude higher than normal operating conditions.

Likewise, there is the question of how soon the revocation information must be 
produced and reachable by an interested party (e.g. someone who has never seen 
the certificate in question but still wants to know the status of that 
certificate). [Aside, Wayne, you specifically said relying parties earlier, did 
you intend to say interested party or relying party? We have some additional 
questions if relying party was actually intended, as using it in that context 
seems to redefine what a relying party is.]

This “reachable” part is particularly meaningful in that when using a CDN there 
are often phased roll outs that can take hours to complete. Today, the BRs 
leave this ambiguous, the only statement in this area is that new information 
must be published every four days:

"The CA SHALL update information provided via an Online Certificate Status 
Protocol at least every four days. OCSP responses from this service MUST have a 
maximum expiration time of ten days."

With this change, it would seem there needs to be a lower bound defined for how 
quickly the information needs to be available if it is to be an effective 
monitoring tool.

* Clarifications

This in turn raises the question if CAs can re-use authorization data such as 
CAA records or domain authorizations from the precertificate? If a final 
certificate has not been issued due to a persistent quorum failure, and that 
failure persists longer than the validity of the used authorization data, can 
the authorizations that were done prior to the precertificate issuance be 
re-used? If the precertificate is a promise to issue the exact same cert, it 
would seem to imply yes, but there are plenty or real world scenarios where 
that would not be sensible or in-line with the requester's intent. If the CAA 
record changes between initial validation for the precertificate and 
re-validation for actual issuance if there were delays, what is the correct 
course of action? 

* Process

On Thursday last week, Wayne added the topic to Recommended and Required 

Re: DigiCert OCSP services returns 1 byte

2019-09-20 Thread Andy Warner via dev-security-policy
Google Trust Services (GTS) reached out to Wayne directly, but I'm also posting 
here as the conversation seems to be rapidly converging on solutions. GTS still 
has reservations that the proposed solutions may be problematic to implement 
and may leave a number of CAs and one very common CA vendor in a bind to get 
from their current state to whatever the final state is cleanly. While 
Mozilla's requirements and recommendations are not strictly binding, they carry 
a great deal of weight and could lead to rapid implementation of a sub-standard 
solution. 

Google Trust Services would like to see the current precertificate 
'requirements' moved to the 'recommendations' section with a note explaining 
that once the formal details are worked out via bylaw changes (preferably) or 
further discussion on m.d.s.p (if bylaw changes are deemed too slow), they will 
become requirements. 

Sorry to post late in the process like this. Unfortunately, as a globally 
distributed team within a much larger company, Google Trust Services team 
cannot always move and post as quickly as we'd like. We will follow-up early 
next week with more details about our concerns, but there are a number of 
complex interactions and subtly conflicting requirements that seem best served 
by taking the time to ensure the final state is settled on in haste. It would 
be great to achieve consistency sooner than later, so a time bounded window to 
get there seems best to balance convergence versus a rush to decisions that may 
adversely affect the ecosystem or be a challenge to live with for years.

--
Andy Warner
Google Trust Services

On Friday, September 20, 2019 at 1:20:02 PM UTC-7, Curt Spann wrote:
> This is a great discussion and I want to thank everyone for their continued 
> input. Let me try and summarize my interpretation based on the input from 
> this thread and related RFC.
> 
> My interpretation is an “unknown” OCSP response should be used in the 
> following conditions:
> 1. When the OCSP request contains an issuerNameHash and issuerKeyHash for 
> which the OCSP responder is NOT authoritative (wrong issuing CA).
> 2. When the OCSP request contains an issuerNameHash and issuerKeyHash for 
> which the OCSP responder IS authoritative (correct issuing CA) but for 
> whatever reason the OCSP responder does not know the status of the requested 
> certificates and ONLY if the certificate for which the status is requested 
> contains another OCSP responder URL available in the AIA extension.
> 
> My interpretation is a “revoked” OCSP response should be used in the 
> following conditions:
> 1. When the OCSP request contains an issuerNameHash and issuerKeyHash for 
> which the OCSP responder IS authoritative and the requested certificate has 
> been revoked.
> 2. When the OCSP request contains an issuerNameHash and issuerKeyHash for 
> which the OCSP responder IS authoritative and the CA corresponding to the 
> issuerNameHash and issuerKeyHash has been revoked.
> 3. When the OCSP request contains an issuerNameHash and issuerKeyHash for 
> which the OCSP responder IS authoritative and the requested certificate has 
> not been issued. This OCSP response MUST include the extended revoked 
> definition response extension in the response, indicating that the OCSP 
> responder supports the extended definition of the "revoked" state to also 
> cover non-issued certificates. The SingleResponse related to this non-issued 
> certificate MUST specify the revocation reason certificateHold (6), MUST 
> specify the revocationTime January 1, 1970, and MUST NOT include a CRL 
> references extension or any CRL entry extensions. [1]
> 
> I agree number 3 above is in conflict with the BRs as pointed out by Wayne.
> 
> - Curt
> 
> [1] RFC 6960: https://tools.ietf.org/html/rfc6960

___
dev-security-policy mailing list
dev-security-policy@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security-policy


Re: Google Trust Services - CRL handling of expired certificates not fully compliant with RFC 5280 Section 3.3

2019-09-13 Thread Andy Warner via dev-security-policy
A quick follow-up to close this out.

The push to fully address the issue was completed globally shortly before 16:00 
UTC on 2019-09-02.

After additional review, we're confident the only certificates affected were 
these two:
https://crt.sh/?id=760396354
https://crt.sh/?id=759833603

Google Trust Services considers this matter fully addressed. We will of course 
continue our ongoing internal review program, but no other work or information 
is outstanding at this point.

--
Andy Warner
Google Trust Services

On Friday, August 30, 2019 at 2:39:51 PM UTC-4, Andy Warner wrote:
> This is an initial report and we expect to provide some additional details 
> and the completion timeline after a bit more verification and full deployment 
> of in-flight mitigations. We are posting the most complete information we 
> have currently to comply with Mozilla reporting timelines and will follow-up 
> with additional details soon.
> 
> 1. How your CA first became aware of the problem and the time and date.
> 
> While performing an internal review and assessment of the CRL generation 
> system for Google Trust Services' GTS CA 1O1 on August 16, 2019, it was 
> discovered that the CRL generation service did not include CRL entries of 
> expired certificates. The periodic job only considered certificates with 
> valid lifetimes. This does not conform to RFC 5280 Section 3.3 which states 
> that “An entry MUST NOT be removed from the CRL until it appears on one 
> regularly scheduled CRL issued beyond the revoked certificate's validity 
> period.”  We expect that few, if any, clients have been impacted.  For a 
> client to be impacted they would have to: clock skewed to a time before the 
> not-after field of the certificate; and have a CRL published after expiration 
> dropping the revoked certificate.
> 
> 
> 2. A timeline of the actions your CA took in response. A timeline is a 
> date-and-time-stamped sequence of all relevant events. This may include 
> events before the incident was reported, such as when a particular 
> requirement became applicable, or a document changed, or a bug was 
> introduced, or an audit was done.
> 
> August 16, 2019 15:00 UTC - Reviewer realizes that CRL will not publish for 
> one update past expiration
> August 16, 2019 16:00 UTC - Reviewer checks for other issues & talks to peers 
> to confirm problem
> August 16, 2019 17:00 UTC - Bug is filed to fix the issue with a proposed 
> design fix
> August 16, 2019 23:30 UTC - Fix is sent for review
> August 20, 2019 16:00 UTC - Remediation work is discussed & assigned
> August  20, 2019 18:00 UTC - Query to inspect revoked certificates is created 
> and sent to be run by production team for initial analysis.
> August 21, 2019 10:40 UTC - Production team runs query and returns result
> August 21, 2019 15:00 UTC - Reviewer analyzes data
> August 21, 2019 20:30 UTC - Reviewer asks for a follow up query to ascertain 
> if any certificates did not make it onto the CRL 
> August 22, 2019 07:00 UTC - Initial attempt at updating test systems with fix.
> August 22, 2019 09:00 UTC - Updating of test systems aborted due to 
> (unrelated) issues.
> August 22, 2019 07:00 UTC - Production team runs query for CRLs that may have 
> missed a certificate
> August 22, 2019 15:00 UTC - Reviewer ascertains that certificates under 
> question were on a CRL
> August 26, 2019 11:00 UTC - Second attempt at updating test systems with fix.
> August 26, 2019 13:00 UTC - Test systems updated, confirmed integrity of 
> fixed software.
> August 27, 2019 09:00 UTC - Confirmed fix is effective on test systems.
> August 27, 2019 10:00 UTC - present: Ongoing staged deployment to production 
> systems. Should complete fully by September 3, 2019 17:00 UTC (slightly 
> extended window due to push policies around holiday weekends. The rollout was 
> staged in accordance with Google's standard rollout procedures.)
> 
> 
> 3. Whether your CA has stopped, or has not yet stopped, issuing certificates 
> with the problem. 
> 
> The affected CA software has been patched.  It now populates expired 
> certificates in the CRL for 7 days after their expiration to ensure they 
> appear in at least one regularly issued CRL update.  Automated testing was 
> added as part of the same patch to check that revoked certificates are kept 
> in the CRL.  The patch was developed, tested, reviewed and landed within the 
> codebase by August 19, 2019.  The CRL entry removal bug has been fully 
> remediated.
> 
> 
> 4. A summary of the problematic certificates. For each problem: number of 
> certs, and the date the first and last certs with that problem were issued.
> 
> Investigation began on August 20, 2019 to discover the potential impact of 
> the logic bug. The CRL generation had contained the bug since its inception, 
> affecting all issuance under GTS 1O1 since March 2018. There were 200,263 
> revoked certificates during that time window. Almost all certificates were 
> for internal monitoring 

Google Trust Services - CRL handling of expired certificates not fully compliant with RFC 5280 Section 3.3

2019-08-30 Thread Andy Warner via dev-security-policy
This is an initial report and we expect to provide some additional details and 
the completion timeline after a bit more verification and full deployment of 
in-flight mitigations. We are posting the most complete information we have 
currently to comply with Mozilla reporting timelines and will follow-up with 
additional details soon.

1. How your CA first became aware of the problem and the time and date.

While performing an internal review and assessment of the CRL generation system 
for Google Trust Services' GTS CA 1O1 on August 16, 2019, it was discovered 
that the CRL generation service did not include CRL entries of expired 
certificates. The periodic job only considered certificates with valid 
lifetimes. This does not conform to RFC 5280 Section 3.3 which states that “An 
entry MUST NOT be removed from the CRL until it appears on one regularly 
scheduled CRL issued beyond the revoked certificate's validity period.”  We 
expect that few, if any, clients have been impacted.  For a client to be 
impacted they would have to: clock skewed to a time before the not-after field 
of the certificate; and have a CRL published after expiration dropping the 
revoked certificate.


2. A timeline of the actions your CA took in response. A timeline is a 
date-and-time-stamped sequence of all relevant events. This may include events 
before the incident was reported, such as when a particular requirement became 
applicable, or a document changed, or a bug was introduced, or an audit was 
done.

August 16, 2019 15:00 UTC - Reviewer realizes that CRL will not publish for one 
update past expiration
August 16, 2019 16:00 UTC - Reviewer checks for other issues & talks to peers 
to confirm problem
August 16, 2019 17:00 UTC - Bug is filed to fix the issue with a proposed 
design fix
August 16, 2019 23:30 UTC - Fix is sent for review
August 20, 2019 16:00 UTC - Remediation work is discussed & assigned
August  20, 2019 18:00 UTC - Query to inspect revoked certificates is created 
and sent to be run by production team for initial analysis.
August 21, 2019 10:40 UTC - Production team runs query and returns result
August 21, 2019 15:00 UTC - Reviewer analyzes data
August 21, 2019 20:30 UTC - Reviewer asks for a follow up query to ascertain if 
any certificates did not make it onto the CRL 
August 22, 2019 07:00 UTC - Initial attempt at updating test systems with fix.
August 22, 2019 09:00 UTC - Updating of test systems aborted due to (unrelated) 
issues.
August 22, 2019 07:00 UTC - Production team runs query for CRLs that may have 
missed a certificate
August 22, 2019 15:00 UTC - Reviewer ascertains that certificates under 
question were on a CRL
August 26, 2019 11:00 UTC - Second attempt at updating test systems with fix.
August 26, 2019 13:00 UTC - Test systems updated, confirmed integrity of fixed 
software.
August 27, 2019 09:00 UTC - Confirmed fix is effective on test systems.
August 27, 2019 10:00 UTC - present: Ongoing staged deployment to production 
systems. Should complete fully by September 3, 2019 17:00 UTC (slightly 
extended window due to push policies around holiday weekends. The rollout was 
staged in accordance with Google's standard rollout procedures.)


3. Whether your CA has stopped, or has not yet stopped, issuing certificates 
with the problem. 

The affected CA software has been patched.  It now populates expired 
certificates in the CRL for 7 days after their expiration to ensure they appear 
in at least one regularly issued CRL update.  Automated testing was added as 
part of the same patch to check that revoked certificates are kept in the CRL.  
The patch was developed, tested, reviewed and landed within the codebase by 
August 19, 2019.  The CRL entry removal bug has been fully remediated.


4. A summary of the problematic certificates. For each problem: number of 
certs, and the date the first and last certs with that problem were issued.

Investigation began on August 20, 2019 to discover the potential impact of the 
logic bug. The CRL generation had contained the bug since its inception, 
affecting all issuance under GTS 1O1 since March 2018. There were 200,263 
revoked certificates during that time window. Almost all certificates were for 
internal monitoring specific to checking revocation. The few non-monitoring 
certificates were all revocations by clients following rotation of certificates 
and not due to compromises.


5. The complete certificate data for the problematic certificates. The 
recommended way to provide this is to ensure each certificate is logged to CT 
and then list the fingerprints or crt.sh IDs, either in the report or as an 
attached spreadsheet, with one list per distinct problem.

crt.sh IDs to follow, waiting on confirmation that the 2 test certificates 
mentioned below are the only cases where the issue was surfaced.

The team looked for revoked certificates from first issuance that never 
appeared within a published CRL from operation of CA until August 21, 

Re: Google Trust Services - Minor SCT issue disclosure

2018-08-24 Thread Andy Warner via dev-security-policy
The code at issue evolved as CT requirements changed. What started off as a
very simple conditional grew into a more complex if / else if block with
somewhat complicated logic and inline checks. As part of the fix, we
simplified the conditionals and refactored the inline checks to make use of
nice clear IsExternallyOperated() and IsGoogleOperated() functions. The end
result is a much more readable and clear set of logic that is easier to
test and we expanded test coverage. I think the big lesson for the
community is that it would have been better to have refactored earlier
rather the evolving the code to the point it became more complicated than
it needed to be.

On Thu, Aug 23, 2018 at 9:40 AM Ryan Sleevi  wrote:

>
>
> On Thu, Aug 23, 2018 at 8:50 AM, Andy Warner via dev-security-policy <
> dev-security-policy@lists.mozilla.org> wrote:
>>
>> * NOTE: The bug was due to an 'if/else' chain fall through. The code in
>> question has been refactored to be simpler and more readable.
>>
>
> Andy,
>
> It might be good for the community if you could describe the processes
> before and after the change, so that other CAs can help prevent similar
> issues with their own embedding systems.
>


smime.p7s
Description: S/MIME Cryptographic Signature
___
dev-security-policy mailing list
dev-security-policy@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security-policy


Re: Google Trust Services - Minor SCT issue disclosure

2018-08-23 Thread Andy Warner via dev-security-policy
Google provides SCTs via embedding and during SSL handshaking depending on
the certificate and how it is served. In this case, all of the affected
certs used embedded SCTs and the issue was the selection of which SCTs to
include because we submit to more CT logs than required, but only embed the
required number of SCTs to keep cert sizes as small as possible. These
certs were submitted to 4 CT logs, 2 Google, 2 non-Google, but the embedded
certs were only from the 2 Google logs, not one Google and one non-Google.
The CA signed 4 correct SCTs and all 4 were submitted to CT logs, the
problem was the embedding logic for the SCTs.

In response to Q1, the logic involved was specific to selection and
embedding of SCTs, not part of validation logic, so a related error would
not affect validation. An unrelated error in validation logic could of
course affect validation, but all CAs have that risk and like other CAs we
have multiple layers of safeguards on validation logic.

For Q2, we sample certs regularly and make use of proven external linting
libraries and our own linting and audit logic. In this case because the
issue was not something normally checked by external tools and the behavior
was perfectly fine until the Chrome deadline in April, I can only posit
that we would have discovered it fairly quickly via other means. We now
have specific checks for this issue and other similar problems we could
foresee.

For Q3, we could account for the initial submission fully and understand
exactly what happened. Google has rigorous version control and enforcement
systems to ensure only properly reviewed and built code can enter
production and to reconcile running code against the cut point for an
approved release. Our CA systems have additional safeguards on top of those
standard tools to further ensure that we have strong knowledge of the
pedigree of all code and how it was built and deployed.

On Thu, Aug 23, 2018 at 10:55 AM Nick Lamb  wrote:

> On Thu, 23 Aug 2018 05:50:05 -0700 (PDT)
> Andy Warner via dev-security-policy
>  wrote:
>
> > May 21st 2018, a new tool for issuing certificates within Google was
> > made available to internal customers. Within hours we started to
> > receive reports that Chrome Canary (v67) with Certificate
> > Transparency checks enabled was showing warnings. A coding error led
> > to the new tool providing Signed Certificate Timestamps (SCTs) from 2
> > Google CT logs instead of one Google and one non-Google log.
>
> Feel free to jump in anywhere I've made a mistake, this might totally
> invalidate some of my questions.
>
> Presumably, since you eventually "fixed" this by asking Subscribers to
> re-issue, the SCTs are baked into a signed certificate, rather than
> provided separately so that the Subscriber can use them with e.g.
> Stapling technologies ?
>
> Which means that this "new tool" also involved a Google controlled
> subCA signing these certificates with, as it turns out, the wrong SCTs
> in them. It's not clear to me if the tool and CA are operationally one
> and the same.
>
> Q1: Could a more significant "coding error" in this tool have resulted
> in certificates being mis-issued (for example with SANs that don't
> belong to Google, or lacking mandatory X.509 fields, or without being
> CT logged)? If not please explain why the tool couldn't cause this.
>
> Q2: If this error hadn't caused a negative end-user experience, what
> mechanisms if any do you believe would have brought it to your
> attention and how soon? e.g. does a team sample resulting certificates
> from this tool at some interval? If it samples pre-certificates that
> would not have detected this error, but is worth mentioning.
>
> Q3: Such mistakes are of course inevitable in software development. But
> they could also be introduced maliciously. Were you able to confidently
> identify which specific individual(s) made the relevant change? (I don't
> want names). Are you confident you'd be able to do this even if somehow
> the production tool turned out not to match your revision control
> systems?
>
> Thanks as always for satisfying my curiosity
>
> Nick.
>


smime.p7s
Description: S/MIME Cryptographic Signature
___
dev-security-policy mailing list
dev-security-policy@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security-policy


Re: Google Trust Services - Minor SCT issue disclosure

2018-08-23 Thread Andy Warner via dev-security-policy
Correct, we do not believe there was a policy violation, we're proactively
sharing in the interest of transparency and knowledge sharing.

I believe there is additional information we could share about how we've
modified testing to ensure compliance with Chrome and Safari's SCT
inclusion rules and have more flexible tests. I want to discuss this with
the engineer who implemented the changes to ensure they agree with how I
would summarize the changes. Update to follow.

On Thu, Aug 23, 2018 at 8:57 AM Alex Gaynor  wrote:

> Hi Andy,
>
> Just so I follow, this is something you're proactively sharing, right? As
> far as I can tell, there's no violation of any Mozilla Root Program rules
> here, just an issue that caused interstitials in Chrome.
>
> Either way, I appreciate your sharing.
>
> You mentioned the issue was do to some overly complex control flow. In
> order to help other CAs out, do you think there are testing methodologies
> that could have helped catch this earlier?
>
> Alex
>
> On Thu, Aug 23, 2018 at 8:50 AM Andy Warner via dev-security-policy <
> dev-security-policy@lists.mozilla.org> wrote:
>
>> Please note, Google wrote this report for internal use immediately after
>> the issue. We intended to post it to m.d.s.p at that time, but securing
>> internal approvals took a while and the posting ended-up on the back burner
>> for a bit. It was a minor issue, but we want the community to be aware of
>> it.
>>
>> Summary:
>>
>> May 21st 2018, a new tool for issuing certificates within Google was made
>> available to internal customers. Within hours we started to receive reports
>> that Chrome Canary (v67) with Certificate Transparency checks enabled was
>> showing warnings. A coding error led to the new tool providing Signed
>> Certificate Timestamps (SCTs) from 2 Google CT logs instead of one Google
>> and one non-Google log.
>>
>> * NOTE: Affected certs were logged at issuance to at least 2 Google CT
>> logs and 2 non-Google CT logs. The embedded SCTs for affected certs only
>> provided proofs from Google logs instead of Google and non-Google logs as
>> required by Chrome.
>>
>> * NOTE: The bug was due to an 'if/else' chain fall through. The code in
>> question has been refactored to be simpler and more readable.
>>
>> The issue was fully resolved ~14 hours after initial notification. The
>> issue was mitigated within 4 hours. Triage and code fixes happened within
>> 11 hours and it took ~3 hours to deploy the fixed code and confirm the
>> fixed behavior in production. The new code was running in relatively few
>> locations, so deployment was quick compared to some changes in our
>> infrastructure.
>>
>> Most affected customers responded quickly to communications that they
>> should replace their certificates and revoke the old ones before a given
>> deadline. All certificates that were issued with an SCT set that was not
>> fully compliant were revoked on 2018-06-19 if they had not already been
>> revoked by the customer previously. Most users replaced certificates
>> shortly after notification.
>>
>> Timeline:
>>
>> 2018-03-22 Bug introduced to codebase.
>> 2018-05-21 Push including bug became available to clients.
>> 2018-05-22 08:05 UTC First user reports that Chrome Canary presents a CT
>> warning for a certificate.
>> 2018-05-22 09:25 UTC Bug filed with initial assessment.
>> 2018-05-22 12:01 UTC Frontend jobs with the bug are taken offline
>> following standard CA procedures.
>> 2018-05-22 15:59 UTC Issue conclusively identified.
>> 2018-05-22 19:07 UTC Fix is submitted.
>> 2018-05-22 21:48 UTC Fix starts to be rolled out.
>> 2018-05-22 22:16 UTC Fix fully deployed and tested on test instances
>> followed by deployment to production. Access to frontends restored.
>> 2018-05-24 Customer communication sent to affected users to ask them to
>> renew their certificates and revoke the old ones.
>> 2018-06-19 The final handful of certificates that had not already been
>> revoked and replaced by users were revoked by the CA.
>>
>> Findings:
>>
>> * The operational plan to halt issuance worked as expected and was
>> implemented quickly.
>> * The problem was quickly found, fully understood and easy to remedy.
>> * Tests existed, but did not cover this failure case.
>>
>> Remediation Plan
>> * Completed
>> ** Message of the Day (MOTD) functionality was added or improved for all
>> issuance systems to make it easier to communicate issues to users when
>> issuance is intentionally paused.
>> ** Test coverage was expanded to ensure that both the quantity and type
>> of SCTs are checked.
>> ___
>> dev-security-policy mailing list
>> dev-security-policy@lists.mozilla.org
>> https://lists.mozilla.org/listinfo/dev-security-policy
>>
>


smime.p7s
Description: S/MIME Cryptographic Signature
___
dev-security-policy mailing list
dev-security-policy@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security-policy


Google Trust Services - Minor SCT issue disclosure

2018-08-23 Thread Andy Warner via dev-security-policy
Please note, Google wrote this report for internal use immediately after the 
issue. We intended to post it to m.d.s.p at that time, but securing internal 
approvals took a while and the posting ended-up on the back burner for a bit. 
It was a minor issue, but we want the community to be aware of it.

Summary:

May 21st 2018, a new tool for issuing certificates within Google was made 
available to internal customers. Within hours we started to receive reports 
that Chrome Canary (v67) with Certificate Transparency checks enabled was 
showing warnings. A coding error led to the new tool providing Signed 
Certificate Timestamps (SCTs) from 2 Google CT logs instead of one Google and 
one non-Google log. 

* NOTE: Affected certs were logged at issuance to at least 2 Google CT logs and 
2 non-Google CT logs. The embedded SCTs for affected certs only provided proofs 
from Google logs instead of Google and non-Google logs as required by Chrome.

* NOTE: The bug was due to an 'if/else' chain fall through. The code in 
question has been refactored to be simpler and more readable.

The issue was fully resolved ~14 hours after initial notification. The issue 
was mitigated within 4 hours. Triage and code fixes happened within 11 hours 
and it took ~3 hours to deploy the fixed code and confirm the fixed behavior in 
production. The new code was running in relatively few locations, so deployment 
was quick compared to some changes in our infrastructure.

Most affected customers responded quickly to communications that they should 
replace their certificates and revoke the old ones before a given deadline. All 
certificates that were issued with an SCT set that was not fully compliant were 
revoked on 2018-06-19 if they had not already been revoked by the customer 
previously. Most users replaced certificates shortly after notification.

Timeline:

2018-03-22 Bug introduced to codebase.
2018-05-21 Push including bug became available to clients.
2018-05-22 08:05 UTC First user reports that Chrome Canary presents a CT 
warning for a certificate.
2018-05-22 09:25 UTC Bug filed with initial assessment.
2018-05-22 12:01 UTC Frontend jobs with the bug are taken offline following 
standard CA procedures.
2018-05-22 15:59 UTC Issue conclusively identified.
2018-05-22 19:07 UTC Fix is submitted.
2018-05-22 21:48 UTC Fix starts to be rolled out.
2018-05-22 22:16 UTC Fix fully deployed and tested on test instances followed 
by deployment to production. Access to frontends restored.
2018-05-24 Customer communication sent to affected users to ask them to renew 
their certificates and revoke the old ones.
2018-06-19 The final handful of certificates that had not already been revoked 
and replaced by users were revoked by the CA.

Findings:

* The operational plan to halt issuance worked as expected and was implemented 
quickly.
* The problem was quickly found, fully understood and easy to remedy.
* Tests existed, but did not cover this failure case. 

Remediation Plan
* Completed
** Message of the Day (MOTD) functionality was added or improved for all 
issuance systems to make it easier to communicate issues to users when issuance 
is intentionally paused.
** Test coverage was expanded to ensure that both the quantity and type of SCTs 
are checked.
___
dev-security-policy mailing list
dev-security-policy@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security-policy


Re: CAs not compliant with CAA CP/CPS requirement

2017-09-09 Thread Andy Warner via dev-security-policy
Google Trust Services published updated CP & CPS versions earlier today 
covering CAA checking. I'd suggest checking all CAs again tomorrow. Given the 
range of timezones CA operational staffs operate across, some may not have had 
a chance to publish their updates yet.

In terms of the 'rush' I suspect many CAs have had language prepared to publish 
well in advance, but were holding off given the number of discussions in 
various forums about how to interpret some sections of the RFC and BRs. Many of 
those discussions continued until the last moment, so holding off to ensure 
published details aligned with community consensus was a reasonable approach.
___
dev-security-policy mailing list
dev-security-policy@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security-policy