GTS - OCSP serving issue 2020-04-09
m.d.s.p community, Google Trust Services just filed https://bugzilla.mozilla.org/show_bug.cgi?id=1630040 which contains the same information as the report that follows. >From 2020-04-08 16:25 UTC to 2020-04-09 05:40 UTC, Google Trust Services' EJBCA based CAs (GIAG4, GIAG4ECC, GTSY1-4) served empty OCSP data which led the OCSP responders to return unauthorized. These CAs exist for issuance of custom certificate profiles and certificates for test sites for inactive roots. Our primary CAs (GTS CA 1O1 and GTS CA 1D2) were unaffected. The problem self-corrected, but we have added safeguards to prevent recurrence. 1. How your CA first became aware of the problem (e.g. via a problem report submitted to your Problem Reporting Mechanism, a discussion in mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and the time and date. Monitoring detected the issue on 2020-04-08 at 16:35 UTC. The root cause was identified within hours. The issue was automatically remediated in the next generation and push to CDN cycle while debugging and fixes were ongoing. 2. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done. 2020-04-08, 11:29 UTC - Scheduled system update begins 2020-04-08, 14:00 UTC - Incorrect OCSP archives are generated 2020-04-08, 15:03 UTC - Scheduled system update concludes 2020-04-08, 16:20 UTC - Incorrect OCSP responses pushed to CDN 2020-04-08, 16:35 UTC - First production monitoring alert fires 2020-04-08, 22:00 UTC - Correct OCSP archives are generated automatically 2020-04-09, 00:20 UTC - Correct OCSP responses pushed to CDN 2020-04-09, 05:40 UTC - Monitoring confirms all probes are passing 3. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. A statement that you have will be considered a pledge to the community; a statement that you have not requires an explanation. The affected CAs are only used for infrequent and manual custom certificate issuance. No certificate issuance aside from a manually issued post update test certificate to validate the upgrade to resolve the issue took place during this period. The issue in question also was specific to refreshing OCSP responses and not certificate issuance. 4. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued. No certificate issuance aside from a manually issued post update test certificate to validate the upgrade to resolve the issue took place during this period. The test certificate was a valid and fully compliant issuance. 5. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem. No certificate issuance aside from the manually issued post update test certificate to validate the the upgrade. 6. Explanation about how and why the mistakes were made or bugs introduced, and how they avoided detection until now. Our creation of OCSP responses and packaging them for serving is designed to fail if any sub-command fails using set -e. However, if the function call is part of an AND or OR sequence (ie. using '&&' or '||' control operators), the set -e is suppressed inside the function. The tool we use to fetch OCSP responses from EJBCA correctly returned a non-zero exit code (due to no OCSP responses being generated because EJBCA was not running), but because it was called inside a function with its own error handling (using && syntax), the script continued without handling the error properly and wrongly used empty tar.gz files with no responses in them. The bug had existed for multiple years as a potential race condition and we did not encounter it previously. Quality tests are executed before publication to the CDN, however, those tests accommodate empty responses as a valid condition because it is something that can and does happen. This condition did not repeat on the following update of the OCSP responses. As a result the next update resolved the issue. Our monitoring caught the issue enabling expedient root cause analysis and resolution. 7. List of steps your CA is taking to resolve the situation and ensure such issuance will not be repeated in the future, accompanied with a timeline of when your CA expects to accomplish these things. No certificate issuance aside from a valid manually issued post update test certificate to validate the upgrade took place during this period. The logic error that led to incorrect OCSP responses being served has been corrected, is checked in and in production. Additionally, checks have
Re: DigiCert OCSP services returns 1 byte
The CRL question is not about it being a requirement, but rather the fact that it could / would lead to disparate treatment between CRL and OCSP for the same certificate, which does not feel right. On the CT quorum issue, we use a mix of the most available sharded logs and that is the failure rate we're observing. We have a few ideas for improvements we're working on. If other operators are seeing much different success rates, we'd love to compare notes. We're using the published best practices, spreading load and using sharded logs, so an implementation issue is not obvious if there is one. That said, other groups within Google including the CT team also exchange messages with CT logs in fairly high volumes, so we may experience atypically high rate-limiting due to all being bucketed together. CAA validations are only good for 8 hours, so the suggestion of a year misses the much shorter timeline that needs to be honored for CAA. -- Andy Warner Google Trust Services On Mon, Sep 23, 2019 at 3:57 PM Kurt Roeckx wrote: > On Mon, Sep 23, 2019 at 02:53:26PM -0700, Andy Warner via > dev-security-policy wrote: > > > > 1. The new text added to the Mozilla Recommended and Required Practices > for this topic states only OCSP status is required for precertificates. > Many CAs provide both CRLs and OCSP services and it would be problematic if > these two mechanisms provided different answers to the same question. > > > > The practice of revoking non-issued certificates would therefore lead to > CRL growth which would further make reliable revocation checking on > bandwidth constrained clients more difficult. > > There have been suggestions to revoke them, but it's my > understanding that there is no such requirement. > > > 2. There seem to be a number of assumptions that precertificate issuance > and certificate issuance is roughly atomic. In reality, a quorum of SCTs is > required prior to final certificate issuance, so that is not the case. > > I don't see anybody suggesting that, nor how it's relevant. > > With all the uptime requirements on the logs and the number of > available logs, I don't see why you should have a failure rate > of 1 in 2000, and that more seems like an implementation problem. > > > 3. This raises the question of how much time a CA has from the time they > issue a precertificate to when the final certificate must be issued. When > there are logs ecosystem issues that are beyond the control of a CA, the > gap can easily be orders of magnitude higher than normal operating > conditions. > > At what is the issue with that? > > > * Clarifications > > > > This in turn raises the question if CAs can re-use authorization data > such as CAA records or domain authorizations from the precertificate? If a > final certificate has not been issued due to a persistent quorum failure, > and that failure persists longer than the validity of the used > authorization data, can the authorizations that were done prior to the > precertificate issuance be re-used? > > So 1 year is sometimes not enough to get SCTs? > > > Kurt > > smime.p7s Description: S/MIME Cryptographic Signature ___ dev-security-policy mailing list dev-security-policy@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-security-policy
Re: DigiCert OCSP services returns 1 byte
The last thing we intended was for our prior mail to be interpreted as negative and without substance. That said, it is clear our mail was not received in the light in which it was intended. We would like to rectify that. We have been closely monitoring this thread and as it began to converge on a conclusion we started planning for each of our CA environments what if any changes would be required and what solutions compliant with our understanding of the conclusions would look like. With that background, our understanding is that the goal of this change is to make it easier to monitor the issuance and revocation practices of a CA based on the existence of precertificates, i.e. extending the use of certificate revocation data to include this monitoring use case. We see value in this and are supportive of the overall change, as it is clear that CT, as a whole, has made significant quality improvement to the WebPKI as a whole and this provides additional incremental benefit. However, we see several challenges that we want to discuss, in particular: 1. The new text added to the Mozilla Recommended and Required Practices for this topic states only OCSP status is required for precertificates. Many CAs provide both CRLs and OCSP services and it would be problematic if these two mechanisms provided different answers to the same question. The practice of revoking non-issued certificates would therefore lead to CRL growth which would further make reliable revocation checking on bandwidth constrained clients more difficult. Though this tax may be deemed acceptable, there is a clear impact and GTS feels that increasing CRL sizes for this use case is not in the best interest of users. We can see both sides of the argument, but we think a bit more detail is required to ensure our implementations align with best practices and user interests. 2. There seem to be a number of assumptions that precertificate issuance and certificate issuance is roughly atomic. In reality, a quorum of SCTs is required prior to final certificate issuance, so that is not the case. CAs are rate limited by logs or logs experience availability issues that make achieving quorum require retries or fail altogether. GTS, for example, experiences approximately 0.05% delays / order abandonment related to an inability to achieve quorum. As a result of this, the existence of a precertificate is possible without a final certificate having been issued. With the wider availability of sharded logs, this number has been improving, but it continues to be our most common cause of issuance delay or order abandonment. 3. This raises the question of how much time a CA has from the time they issue a precertificate to when the final certificate must be issued. When there are logs ecosystem issues that are beyond the control of a CA, the gap can easily be orders of magnitude higher than normal operating conditions. Likewise, there is the question of how soon the revocation information must be produced and reachable by an interested party (e.g. someone who has never seen the certificate in question but still wants to know the status of that certificate). [Aside, Wayne, you specifically said relying parties earlier, did you intend to say interested party or relying party? We have some additional questions if relying party was actually intended, as using it in that context seems to redefine what a relying party is.] This “reachable” part is particularly meaningful in that when using a CDN there are often phased roll outs that can take hours to complete. Today, the BRs leave this ambiguous, the only statement in this area is that new information must be published every four days: "The CA SHALL update information provided via an Online Certificate Status Protocol at least every four days. OCSP responses from this service MUST have a maximum expiration time of ten days." With this change, it would seem there needs to be a lower bound defined for how quickly the information needs to be available if it is to be an effective monitoring tool. * Clarifications This in turn raises the question if CAs can re-use authorization data such as CAA records or domain authorizations from the precertificate? If a final certificate has not been issued due to a persistent quorum failure, and that failure persists longer than the validity of the used authorization data, can the authorizations that were done prior to the precertificate issuance be re-used? If the precertificate is a promise to issue the exact same cert, it would seem to imply yes, but there are plenty or real world scenarios where that would not be sensible or in-line with the requester's intent. If the CAA record changes between initial validation for the precertificate and re-validation for actual issuance if there were delays, what is the correct course of action? * Process On Thursday last week, Wayne added the topic to Recommended and Required
Re: DigiCert OCSP services returns 1 byte
Google Trust Services (GTS) reached out to Wayne directly, but I'm also posting here as the conversation seems to be rapidly converging on solutions. GTS still has reservations that the proposed solutions may be problematic to implement and may leave a number of CAs and one very common CA vendor in a bind to get from their current state to whatever the final state is cleanly. While Mozilla's requirements and recommendations are not strictly binding, they carry a great deal of weight and could lead to rapid implementation of a sub-standard solution. Google Trust Services would like to see the current precertificate 'requirements' moved to the 'recommendations' section with a note explaining that once the formal details are worked out via bylaw changes (preferably) or further discussion on m.d.s.p (if bylaw changes are deemed too slow), they will become requirements. Sorry to post late in the process like this. Unfortunately, as a globally distributed team within a much larger company, Google Trust Services team cannot always move and post as quickly as we'd like. We will follow-up early next week with more details about our concerns, but there are a number of complex interactions and subtly conflicting requirements that seem best served by taking the time to ensure the final state is settled on in haste. It would be great to achieve consistency sooner than later, so a time bounded window to get there seems best to balance convergence versus a rush to decisions that may adversely affect the ecosystem or be a challenge to live with for years. -- Andy Warner Google Trust Services On Friday, September 20, 2019 at 1:20:02 PM UTC-7, Curt Spann wrote: > This is a great discussion and I want to thank everyone for their continued > input. Let me try and summarize my interpretation based on the input from > this thread and related RFC. > > My interpretation is an “unknown” OCSP response should be used in the > following conditions: > 1. When the OCSP request contains an issuerNameHash and issuerKeyHash for > which the OCSP responder is NOT authoritative (wrong issuing CA). > 2. When the OCSP request contains an issuerNameHash and issuerKeyHash for > which the OCSP responder IS authoritative (correct issuing CA) but for > whatever reason the OCSP responder does not know the status of the requested > certificates and ONLY if the certificate for which the status is requested > contains another OCSP responder URL available in the AIA extension. > > My interpretation is a “revoked” OCSP response should be used in the > following conditions: > 1. When the OCSP request contains an issuerNameHash and issuerKeyHash for > which the OCSP responder IS authoritative and the requested certificate has > been revoked. > 2. When the OCSP request contains an issuerNameHash and issuerKeyHash for > which the OCSP responder IS authoritative and the CA corresponding to the > issuerNameHash and issuerKeyHash has been revoked. > 3. When the OCSP request contains an issuerNameHash and issuerKeyHash for > which the OCSP responder IS authoritative and the requested certificate has > not been issued. This OCSP response MUST include the extended revoked > definition response extension in the response, indicating that the OCSP > responder supports the extended definition of the "revoked" state to also > cover non-issued certificates. The SingleResponse related to this non-issued > certificate MUST specify the revocation reason certificateHold (6), MUST > specify the revocationTime January 1, 1970, and MUST NOT include a CRL > references extension or any CRL entry extensions. [1] > > I agree number 3 above is in conflict with the BRs as pointed out by Wayne. > > - Curt > > [1] RFC 6960: https://tools.ietf.org/html/rfc6960 ___ dev-security-policy mailing list dev-security-policy@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-security-policy
Re: Google Trust Services - CRL handling of expired certificates not fully compliant with RFC 5280 Section 3.3
A quick follow-up to close this out. The push to fully address the issue was completed globally shortly before 16:00 UTC on 2019-09-02. After additional review, we're confident the only certificates affected were these two: https://crt.sh/?id=760396354 https://crt.sh/?id=759833603 Google Trust Services considers this matter fully addressed. We will of course continue our ongoing internal review program, but no other work or information is outstanding at this point. -- Andy Warner Google Trust Services On Friday, August 30, 2019 at 2:39:51 PM UTC-4, Andy Warner wrote: > This is an initial report and we expect to provide some additional details > and the completion timeline after a bit more verification and full deployment > of in-flight mitigations. We are posting the most complete information we > have currently to comply with Mozilla reporting timelines and will follow-up > with additional details soon. > > 1. How your CA first became aware of the problem and the time and date. > > While performing an internal review and assessment of the CRL generation > system for Google Trust Services' GTS CA 1O1 on August 16, 2019, it was > discovered that the CRL generation service did not include CRL entries of > expired certificates. The periodic job only considered certificates with > valid lifetimes. This does not conform to RFC 5280 Section 3.3 which states > that “An entry MUST NOT be removed from the CRL until it appears on one > regularly scheduled CRL issued beyond the revoked certificate's validity > period.” We expect that few, if any, clients have been impacted. For a > client to be impacted they would have to: clock skewed to a time before the > not-after field of the certificate; and have a CRL published after expiration > dropping the revoked certificate. > > > 2. A timeline of the actions your CA took in response. A timeline is a > date-and-time-stamped sequence of all relevant events. This may include > events before the incident was reported, such as when a particular > requirement became applicable, or a document changed, or a bug was > introduced, or an audit was done. > > August 16, 2019 15:00 UTC - Reviewer realizes that CRL will not publish for > one update past expiration > August 16, 2019 16:00 UTC - Reviewer checks for other issues & talks to peers > to confirm problem > August 16, 2019 17:00 UTC - Bug is filed to fix the issue with a proposed > design fix > August 16, 2019 23:30 UTC - Fix is sent for review > August 20, 2019 16:00 UTC - Remediation work is discussed & assigned > August 20, 2019 18:00 UTC - Query to inspect revoked certificates is created > and sent to be run by production team for initial analysis. > August 21, 2019 10:40 UTC - Production team runs query and returns result > August 21, 2019 15:00 UTC - Reviewer analyzes data > August 21, 2019 20:30 UTC - Reviewer asks for a follow up query to ascertain > if any certificates did not make it onto the CRL > August 22, 2019 07:00 UTC - Initial attempt at updating test systems with fix. > August 22, 2019 09:00 UTC - Updating of test systems aborted due to > (unrelated) issues. > August 22, 2019 07:00 UTC - Production team runs query for CRLs that may have > missed a certificate > August 22, 2019 15:00 UTC - Reviewer ascertains that certificates under > question were on a CRL > August 26, 2019 11:00 UTC - Second attempt at updating test systems with fix. > August 26, 2019 13:00 UTC - Test systems updated, confirmed integrity of > fixed software. > August 27, 2019 09:00 UTC - Confirmed fix is effective on test systems. > August 27, 2019 10:00 UTC - present: Ongoing staged deployment to production > systems. Should complete fully by September 3, 2019 17:00 UTC (slightly > extended window due to push policies around holiday weekends. The rollout was > staged in accordance with Google's standard rollout procedures.) > > > 3. Whether your CA has stopped, or has not yet stopped, issuing certificates > with the problem. > > The affected CA software has been patched. It now populates expired > certificates in the CRL for 7 days after their expiration to ensure they > appear in at least one regularly issued CRL update. Automated testing was > added as part of the same patch to check that revoked certificates are kept > in the CRL. The patch was developed, tested, reviewed and landed within the > codebase by August 19, 2019. The CRL entry removal bug has been fully > remediated. > > > 4. A summary of the problematic certificates. For each problem: number of > certs, and the date the first and last certs with that problem were issued. > > Investigation began on August 20, 2019 to discover the potential impact of > the logic bug. The CRL generation had contained the bug since its inception, > affecting all issuance under GTS 1O1 since March 2018. There were 200,263 > revoked certificates during that time window. Almost all certificates were > for internal monitoring
Google Trust Services - CRL handling of expired certificates not fully compliant with RFC 5280 Section 3.3
This is an initial report and we expect to provide some additional details and the completion timeline after a bit more verification and full deployment of in-flight mitigations. We are posting the most complete information we have currently to comply with Mozilla reporting timelines and will follow-up with additional details soon. 1. How your CA first became aware of the problem and the time and date. While performing an internal review and assessment of the CRL generation system for Google Trust Services' GTS CA 1O1 on August 16, 2019, it was discovered that the CRL generation service did not include CRL entries of expired certificates. The periodic job only considered certificates with valid lifetimes. This does not conform to RFC 5280 Section 3.3 which states that “An entry MUST NOT be removed from the CRL until it appears on one regularly scheduled CRL issued beyond the revoked certificate's validity period.” We expect that few, if any, clients have been impacted. For a client to be impacted they would have to: clock skewed to a time before the not-after field of the certificate; and have a CRL published after expiration dropping the revoked certificate. 2. A timeline of the actions your CA took in response. A timeline is a date-and-time-stamped sequence of all relevant events. This may include events before the incident was reported, such as when a particular requirement became applicable, or a document changed, or a bug was introduced, or an audit was done. August 16, 2019 15:00 UTC - Reviewer realizes that CRL will not publish for one update past expiration August 16, 2019 16:00 UTC - Reviewer checks for other issues & talks to peers to confirm problem August 16, 2019 17:00 UTC - Bug is filed to fix the issue with a proposed design fix August 16, 2019 23:30 UTC - Fix is sent for review August 20, 2019 16:00 UTC - Remediation work is discussed & assigned August 20, 2019 18:00 UTC - Query to inspect revoked certificates is created and sent to be run by production team for initial analysis. August 21, 2019 10:40 UTC - Production team runs query and returns result August 21, 2019 15:00 UTC - Reviewer analyzes data August 21, 2019 20:30 UTC - Reviewer asks for a follow up query to ascertain if any certificates did not make it onto the CRL August 22, 2019 07:00 UTC - Initial attempt at updating test systems with fix. August 22, 2019 09:00 UTC - Updating of test systems aborted due to (unrelated) issues. August 22, 2019 07:00 UTC - Production team runs query for CRLs that may have missed a certificate August 22, 2019 15:00 UTC - Reviewer ascertains that certificates under question were on a CRL August 26, 2019 11:00 UTC - Second attempt at updating test systems with fix. August 26, 2019 13:00 UTC - Test systems updated, confirmed integrity of fixed software. August 27, 2019 09:00 UTC - Confirmed fix is effective on test systems. August 27, 2019 10:00 UTC - present: Ongoing staged deployment to production systems. Should complete fully by September 3, 2019 17:00 UTC (slightly extended window due to push policies around holiday weekends. The rollout was staged in accordance with Google's standard rollout procedures.) 3. Whether your CA has stopped, or has not yet stopped, issuing certificates with the problem. The affected CA software has been patched. It now populates expired certificates in the CRL for 7 days after their expiration to ensure they appear in at least one regularly issued CRL update. Automated testing was added as part of the same patch to check that revoked certificates are kept in the CRL. The patch was developed, tested, reviewed and landed within the codebase by August 19, 2019. The CRL entry removal bug has been fully remediated. 4. A summary of the problematic certificates. For each problem: number of certs, and the date the first and last certs with that problem were issued. Investigation began on August 20, 2019 to discover the potential impact of the logic bug. The CRL generation had contained the bug since its inception, affecting all issuance under GTS 1O1 since March 2018. There were 200,263 revoked certificates during that time window. Almost all certificates were for internal monitoring specific to checking revocation. The few non-monitoring certificates were all revocations by clients following rotation of certificates and not due to compromises. 5. The complete certificate data for the problematic certificates. The recommended way to provide this is to ensure each certificate is logged to CT and then list the fingerprints or crt.sh IDs, either in the report or as an attached spreadsheet, with one list per distinct problem. crt.sh IDs to follow, waiting on confirmation that the 2 test certificates mentioned below are the only cases where the issue was surfaced. The team looked for revoked certificates from first issuance that never appeared within a published CRL from operation of CA until August 21,
Re: Google Trust Services - Minor SCT issue disclosure
The code at issue evolved as CT requirements changed. What started off as a very simple conditional grew into a more complex if / else if block with somewhat complicated logic and inline checks. As part of the fix, we simplified the conditionals and refactored the inline checks to make use of nice clear IsExternallyOperated() and IsGoogleOperated() functions. The end result is a much more readable and clear set of logic that is easier to test and we expanded test coverage. I think the big lesson for the community is that it would have been better to have refactored earlier rather the evolving the code to the point it became more complicated than it needed to be. On Thu, Aug 23, 2018 at 9:40 AM Ryan Sleevi wrote: > > > On Thu, Aug 23, 2018 at 8:50 AM, Andy Warner via dev-security-policy < > dev-security-policy@lists.mozilla.org> wrote: >> >> * NOTE: The bug was due to an 'if/else' chain fall through. The code in >> question has been refactored to be simpler and more readable. >> > > Andy, > > It might be good for the community if you could describe the processes > before and after the change, so that other CAs can help prevent similar > issues with their own embedding systems. > smime.p7s Description: S/MIME Cryptographic Signature ___ dev-security-policy mailing list dev-security-policy@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-security-policy
Re: Google Trust Services - Minor SCT issue disclosure
Google provides SCTs via embedding and during SSL handshaking depending on the certificate and how it is served. In this case, all of the affected certs used embedded SCTs and the issue was the selection of which SCTs to include because we submit to more CT logs than required, but only embed the required number of SCTs to keep cert sizes as small as possible. These certs were submitted to 4 CT logs, 2 Google, 2 non-Google, but the embedded certs were only from the 2 Google logs, not one Google and one non-Google. The CA signed 4 correct SCTs and all 4 were submitted to CT logs, the problem was the embedding logic for the SCTs. In response to Q1, the logic involved was specific to selection and embedding of SCTs, not part of validation logic, so a related error would not affect validation. An unrelated error in validation logic could of course affect validation, but all CAs have that risk and like other CAs we have multiple layers of safeguards on validation logic. For Q2, we sample certs regularly and make use of proven external linting libraries and our own linting and audit logic. In this case because the issue was not something normally checked by external tools and the behavior was perfectly fine until the Chrome deadline in April, I can only posit that we would have discovered it fairly quickly via other means. We now have specific checks for this issue and other similar problems we could foresee. For Q3, we could account for the initial submission fully and understand exactly what happened. Google has rigorous version control and enforcement systems to ensure only properly reviewed and built code can enter production and to reconcile running code against the cut point for an approved release. Our CA systems have additional safeguards on top of those standard tools to further ensure that we have strong knowledge of the pedigree of all code and how it was built and deployed. On Thu, Aug 23, 2018 at 10:55 AM Nick Lamb wrote: > On Thu, 23 Aug 2018 05:50:05 -0700 (PDT) > Andy Warner via dev-security-policy > wrote: > > > May 21st 2018, a new tool for issuing certificates within Google was > > made available to internal customers. Within hours we started to > > receive reports that Chrome Canary (v67) with Certificate > > Transparency checks enabled was showing warnings. A coding error led > > to the new tool providing Signed Certificate Timestamps (SCTs) from 2 > > Google CT logs instead of one Google and one non-Google log. > > Feel free to jump in anywhere I've made a mistake, this might totally > invalidate some of my questions. > > Presumably, since you eventually "fixed" this by asking Subscribers to > re-issue, the SCTs are baked into a signed certificate, rather than > provided separately so that the Subscriber can use them with e.g. > Stapling technologies ? > > Which means that this "new tool" also involved a Google controlled > subCA signing these certificates with, as it turns out, the wrong SCTs > in them. It's not clear to me if the tool and CA are operationally one > and the same. > > Q1: Could a more significant "coding error" in this tool have resulted > in certificates being mis-issued (for example with SANs that don't > belong to Google, or lacking mandatory X.509 fields, or without being > CT logged)? If not please explain why the tool couldn't cause this. > > Q2: If this error hadn't caused a negative end-user experience, what > mechanisms if any do you believe would have brought it to your > attention and how soon? e.g. does a team sample resulting certificates > from this tool at some interval? If it samples pre-certificates that > would not have detected this error, but is worth mentioning. > > Q3: Such mistakes are of course inevitable in software development. But > they could also be introduced maliciously. Were you able to confidently > identify which specific individual(s) made the relevant change? (I don't > want names). Are you confident you'd be able to do this even if somehow > the production tool turned out not to match your revision control > systems? > > Thanks as always for satisfying my curiosity > > Nick. > smime.p7s Description: S/MIME Cryptographic Signature ___ dev-security-policy mailing list dev-security-policy@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-security-policy
Re: Google Trust Services - Minor SCT issue disclosure
Correct, we do not believe there was a policy violation, we're proactively sharing in the interest of transparency and knowledge sharing. I believe there is additional information we could share about how we've modified testing to ensure compliance with Chrome and Safari's SCT inclusion rules and have more flexible tests. I want to discuss this with the engineer who implemented the changes to ensure they agree with how I would summarize the changes. Update to follow. On Thu, Aug 23, 2018 at 8:57 AM Alex Gaynor wrote: > Hi Andy, > > Just so I follow, this is something you're proactively sharing, right? As > far as I can tell, there's no violation of any Mozilla Root Program rules > here, just an issue that caused interstitials in Chrome. > > Either way, I appreciate your sharing. > > You mentioned the issue was do to some overly complex control flow. In > order to help other CAs out, do you think there are testing methodologies > that could have helped catch this earlier? > > Alex > > On Thu, Aug 23, 2018 at 8:50 AM Andy Warner via dev-security-policy < > dev-security-policy@lists.mozilla.org> wrote: > >> Please note, Google wrote this report for internal use immediately after >> the issue. We intended to post it to m.d.s.p at that time, but securing >> internal approvals took a while and the posting ended-up on the back burner >> for a bit. It was a minor issue, but we want the community to be aware of >> it. >> >> Summary: >> >> May 21st 2018, a new tool for issuing certificates within Google was made >> available to internal customers. Within hours we started to receive reports >> that Chrome Canary (v67) with Certificate Transparency checks enabled was >> showing warnings. A coding error led to the new tool providing Signed >> Certificate Timestamps (SCTs) from 2 Google CT logs instead of one Google >> and one non-Google log. >> >> * NOTE: Affected certs were logged at issuance to at least 2 Google CT >> logs and 2 non-Google CT logs. The embedded SCTs for affected certs only >> provided proofs from Google logs instead of Google and non-Google logs as >> required by Chrome. >> >> * NOTE: The bug was due to an 'if/else' chain fall through. The code in >> question has been refactored to be simpler and more readable. >> >> The issue was fully resolved ~14 hours after initial notification. The >> issue was mitigated within 4 hours. Triage and code fixes happened within >> 11 hours and it took ~3 hours to deploy the fixed code and confirm the >> fixed behavior in production. The new code was running in relatively few >> locations, so deployment was quick compared to some changes in our >> infrastructure. >> >> Most affected customers responded quickly to communications that they >> should replace their certificates and revoke the old ones before a given >> deadline. All certificates that were issued with an SCT set that was not >> fully compliant were revoked on 2018-06-19 if they had not already been >> revoked by the customer previously. Most users replaced certificates >> shortly after notification. >> >> Timeline: >> >> 2018-03-22 Bug introduced to codebase. >> 2018-05-21 Push including bug became available to clients. >> 2018-05-22 08:05 UTC First user reports that Chrome Canary presents a CT >> warning for a certificate. >> 2018-05-22 09:25 UTC Bug filed with initial assessment. >> 2018-05-22 12:01 UTC Frontend jobs with the bug are taken offline >> following standard CA procedures. >> 2018-05-22 15:59 UTC Issue conclusively identified. >> 2018-05-22 19:07 UTC Fix is submitted. >> 2018-05-22 21:48 UTC Fix starts to be rolled out. >> 2018-05-22 22:16 UTC Fix fully deployed and tested on test instances >> followed by deployment to production. Access to frontends restored. >> 2018-05-24 Customer communication sent to affected users to ask them to >> renew their certificates and revoke the old ones. >> 2018-06-19 The final handful of certificates that had not already been >> revoked and replaced by users were revoked by the CA. >> >> Findings: >> >> * The operational plan to halt issuance worked as expected and was >> implemented quickly. >> * The problem was quickly found, fully understood and easy to remedy. >> * Tests existed, but did not cover this failure case. >> >> Remediation Plan >> * Completed >> ** Message of the Day (MOTD) functionality was added or improved for all >> issuance systems to make it easier to communicate issues to users when >> issuance is intentionally paused. >> ** Test coverage was expanded to ensure that both the quantity and type >> of SCTs are checked. >> ___ >> dev-security-policy mailing list >> dev-security-policy@lists.mozilla.org >> https://lists.mozilla.org/listinfo/dev-security-policy >> > smime.p7s Description: S/MIME Cryptographic Signature ___ dev-security-policy mailing list dev-security-policy@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-security-policy
Google Trust Services - Minor SCT issue disclosure
Please note, Google wrote this report for internal use immediately after the issue. We intended to post it to m.d.s.p at that time, but securing internal approvals took a while and the posting ended-up on the back burner for a bit. It was a minor issue, but we want the community to be aware of it. Summary: May 21st 2018, a new tool for issuing certificates within Google was made available to internal customers. Within hours we started to receive reports that Chrome Canary (v67) with Certificate Transparency checks enabled was showing warnings. A coding error led to the new tool providing Signed Certificate Timestamps (SCTs) from 2 Google CT logs instead of one Google and one non-Google log. * NOTE: Affected certs were logged at issuance to at least 2 Google CT logs and 2 non-Google CT logs. The embedded SCTs for affected certs only provided proofs from Google logs instead of Google and non-Google logs as required by Chrome. * NOTE: The bug was due to an 'if/else' chain fall through. The code in question has been refactored to be simpler and more readable. The issue was fully resolved ~14 hours after initial notification. The issue was mitigated within 4 hours. Triage and code fixes happened within 11 hours and it took ~3 hours to deploy the fixed code and confirm the fixed behavior in production. The new code was running in relatively few locations, so deployment was quick compared to some changes in our infrastructure. Most affected customers responded quickly to communications that they should replace their certificates and revoke the old ones before a given deadline. All certificates that were issued with an SCT set that was not fully compliant were revoked on 2018-06-19 if they had not already been revoked by the customer previously. Most users replaced certificates shortly after notification. Timeline: 2018-03-22 Bug introduced to codebase. 2018-05-21 Push including bug became available to clients. 2018-05-22 08:05 UTC First user reports that Chrome Canary presents a CT warning for a certificate. 2018-05-22 09:25 UTC Bug filed with initial assessment. 2018-05-22 12:01 UTC Frontend jobs with the bug are taken offline following standard CA procedures. 2018-05-22 15:59 UTC Issue conclusively identified. 2018-05-22 19:07 UTC Fix is submitted. 2018-05-22 21:48 UTC Fix starts to be rolled out. 2018-05-22 22:16 UTC Fix fully deployed and tested on test instances followed by deployment to production. Access to frontends restored. 2018-05-24 Customer communication sent to affected users to ask them to renew their certificates and revoke the old ones. 2018-06-19 The final handful of certificates that had not already been revoked and replaced by users were revoked by the CA. Findings: * The operational plan to halt issuance worked as expected and was implemented quickly. * The problem was quickly found, fully understood and easy to remedy. * Tests existed, but did not cover this failure case. Remediation Plan * Completed ** Message of the Day (MOTD) functionality was added or improved for all issuance systems to make it easier to communicate issues to users when issuance is intentionally paused. ** Test coverage was expanded to ensure that both the quantity and type of SCTs are checked. ___ dev-security-policy mailing list dev-security-policy@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-security-policy
Re: CAs not compliant with CAA CP/CPS requirement
Google Trust Services published updated CP & CPS versions earlier today covering CAA checking. I'd suggest checking all CAs again tomorrow. Given the range of timezones CA operational staffs operate across, some may not have had a chance to publish their updates yet. In terms of the 'rush' I suspect many CAs have had language prepared to publish well in advance, but were holding off given the number of discussions in various forums about how to interpret some sections of the RFC and BRs. Many of those discussions continued until the last moment, so holding off to ensure published details aligned with community consensus was a reasonable approach. ___ dev-security-policy mailing list dev-security-policy@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-security-policy