July 9, 2021
How Certificate Transparency Logs Fail and Why It's OK
Last week, a Certificate Transparency log called Yeti 2022 suffered a single bit flip, likely due to a hardware error or cosmic ray, which rendered the log unusable. Although this event will have zero impact on Web users and website operators, and was reported on an obscure mailing list for industry insiders, it captured the interest of people on Hacker News, Twitter, and Reddit. Certificate Transparency plays an essential role in ensuring security on the Web, and numerous commentators were concerned that logs could be wiped out by a single bit flip. I'm going to explain why these concerns are misplaced and why log failure doesn't worry me.
Background: Certificate Transparency (CT) is a system to log publicly-trusted SSL certificates in public, append-only logs. Website owners can monitor these logs and take action if they discover an unauthorized certificate for one of their domains. Thanks to Certificate Transparency, several untrustworthy certificate authorities have been distrusted, and the ecosystem has improved enormously compared to the pre-CT days when misissued certificates usually went unnoticed.
To ensure that CT logs remain append-only, submitted certificates are placed in the leaves of a data structure called a Merkle Tree. The leaves of the Merkle Tree are recursively hashed together with SHA-256 to produce a root hash that represents the contents of the tree. Periodically, the CT log publishes a signed statement, called a Signed Tree Head or STH, containing the current tree size and root hash. The STH is a commitment that at the specified size, the log has the specified contents. To enforce the commitment, monitors collect STHs and verify that the root hashes match the certificates downloaded from the log. If the downloaded certificates don't match the published STHs, or if a monitor detects two STHs for the same tree size with different root hashes, it means the log contents have been altered - perhaps to conceal a malicious certificate.
On June 30, 2021 at 01:02 UTC, my CT monitor, Cert Spotter, raised an alert that the root hash it calculated from the first 65,569,149 certificate entries downloaded from Yeti 2022 did not equal the root hash in the STH that Yeti 2022 had published for tree size 65,569,149.
I noticed the alert the next morning and reported the problem to the ct-policy mailing list, where these matters are discussed. The Google Chrome CT team reported that they too had observed problems, as did a user of the open source certspotter. Later that day, I drilled down that the problem was with entry 65,562,066 - the hash of the certificate returned by Yeti at this position was not part of the Merkle Tree.
On Thursday, the log operator reported that this entry had "shifted one bit" before the STH was signed. Curious, I calculated the correct hash of entry 65,562,066 and then tried flipping every bit and seeing if the resulting hash was part of the Merkle Tree. Sure enough, flipping the lowest bit of the first byte of the hash resulted in a hash that was part of the Merkle Tree and would ultimately produce the root hash from the STH.
There is no way for the log operator to fix this problem: they can't change entry 65,562,066 to match the errant hash, as this would require breaking SHA-2's preimage resistance, which is computationally infeasible. And since they've already published an STH for tree size 65,569,149, they can't publish an updated STH with a root hash that correctly reflects entry 65,562,066.
Consequentially, Yeti 2022 is toast. It has been made read-only and web browsers will soon cease to rely on it.
Yeti 2022 is not the first Certificate Transparency log to fail. Seven logs have previously failed, all without causing any impact to Web users:
- In 2016, a log failed for having excessive downtime.
- In 2016, a log failed because the operator reused the log's private key with a test log. Since the same private key was being used to sign STHs from two different logs, each log appeared to be violating the append-only property.
- In 2017, a log failed because it violated the append-only property after its database was rolled back during a botched backup restore.
- In 2017, a log failed after its operator, a Chinese blockchain company that no one had ever heard of, vanished into thin air. (I tried calling their phone number in China and got a message that their voicemail box was full.) (Maybe they thought CT was a cryptocurrency and pulled the plug when they didn't get rich?)
- In 2018, a log failed because it didn't include submitted certificates in the log.
- In 2018, another log failed because it didn't include submitted certificates in the log.
- In 2020, a log failed because its private key was presumed compromised after the server fell victim to a Salt vulnerability.
The largest risk of log failure is that previously-issued certificates will stop working and require replacement prior to their natural expiration dates. CT-enforcing browsers only accept certificates that contain receipts, called Signed Certificate Timestamps or SCTs, from a sufficient number of approved (i.e. non-failed) CT logs. Each SCT is a promise by the respective log to publish the certificate (Technically, the precertificate, which contains the same information as the certificate). What if one or more of the SCTs in a certificate is from a log which failed after the certificate was issued?
Fortunately, browser Certificate Transparency policies anticipated the possibility. At a high level, the Chrome and Apple policies require the following:
- At least one SCT from a log that is approved at time of certificate validation
- At least 2-5 SCTs (depending on certificate lifetime) from logs that were approved at time of SCT issuance
(The precise details are a bit more nuanced but don't matter for this blog post. Also, some very advanced website operators deliver SCTs using alternative mechanisms, which are subject to different rules, but this is extremely rare. Read the policies if you want the nitty gritty.)
Consequentially, a single log failure can't cause a certificate to stop working. The first requirement is still satisfied because the certificate still has at least one SCT from a currently-approved log. The second requirement is still satisfied because the failed log was approved at time of SCT issuance. The minimum number of SCTs from approved-at-issuance logs increases with certificate lifetime to reflect the increased probability of log failure: a 180 day certificate only needs 2 SCTs, whereas a 3 year certificate (back when they were allowed) needed 5 SCTs.
Note that when a log fails, its public key is not totally removed from browsers like a distrusted certificate authority's key would be. Instead, the log transitions to a state called "retired" (previously known as "disqualified"), which means its SCTs are still recognized for satisfying the second requirement. This led to an interesting question in 2020 when a log's private key was compromised: should the log be retired, or should it be totally distrusted? Counter-intuitively, it was retired, even though SCTs could have been forged to make it seem like a certificate was included in the log when it really wasn't. But that's OK, because the second requirement isn't about making sure certificates are logged, but about making sure certificate authorities aren't dumbasses. In a world of competent CAs, the second requirement wouldn't be necessary since CAs would have the good sense to include SCTs from extra logs in case some logs failed. But we do not live in a world of competent CAs - indeed, that's why CT exists - and there would no doubt be CAs embedding a single SCT in certificates if browsers didn't require redundancy.
Of course, there is still a chance that all of the SCTs in a certificate come from logs that end up failing. That would suck. But I don't think the solution is to change Certificate Transparency. Catastrophic CT failure is just one of several reasons that a certificate might need to be replaced before its natural expiration, and empirically it's the least likely reason. When a certificate authority is distrusted, as has happened several times, all of its certificates must be replaced. When a certificate is misissued, it has to be revoked and replaced, and there have been numerous incidents since 2019 in which a considerable number of certificates have required revocation - sometimes as many as 100% of a CA's active certificates:
- Multiple CAs: millions of certificates misissued due to insufficient serial number entropy
- Sectigo: 11 million certificates misissued with illegal OU field
- Let's Encrypt: 3 million certificates misissued due to CAA checking bug
- Google Trust Services: 1 million certificates misissued due to domain validation method not documented in CPS
- Sectigo: 400,000 certificates misissued due to domain validation method not documented in CPS
- Let's Encrypt: 100% of certificates misissued due to lifetime being longer than CPS allowed
The ecosystem is currently ill-prepared to handle mass replacement events like these, and in many of the above cases CAs missed the revocation deadline or declined to revoke entirely. Although the above misissuances had relatively low security impact, other cases, such as distrusting a compromised certificate authority, or events like Heartbleed or the Debian random number fiasco, are very security critical. This makes the inability to quickly replace certificates at scale a serious problem, larger than the problem of CT logs failing. To address the problem, Let's Encrypt is working on a specification called ACME Renewal Info (ARI) that would allow CAs to instruct TLS servers to request new certificates prior to their normal expiration. They've committed to deploying ARI or a similar technology in their staging environment by 2021-11-12.
Post a Comment
Your comment will be public. To contact me privately, email me. Please keep your comment polite, on-topic, and comprehensible. Your comment may be held for moderation before being published.
Comments
Anonymous on 2021-12-11 at 22:23:
Thanks for a very interesting article.
I noticed that you can only submit to a CT log if your certificate is signed by a trusted CA. Do you know of any similar systems which accept certificates from non trusted CAs or self signed?
One use case is to be able to log certificates with custom extensions (which most CAs strip out).
Reply
Andrew Ayer on 2021-12-12 at 01:37:
No. The problem with accepting non-publicly-trusted certificates is that someone could generate an unlimited number of certificates and spam logs with them.
Reply