Fix a latent query-of-death in a GFE+QUIC+Leto codepath

This CL addresses a long-standing, rare, and mysterious GFE production crash, first observed during the initial Leto-for-QUIC rollout in 2019 and undiagnosed until now.

The crash occurs in the following situation:

- Client sends an initial inchoate CHLO
- GFE responds with a REJ containing a ServerConfig.
- A client sends a CHLO containing a PUBS (public value) field which is incorrectly-sized (or possibly corrupt in some other way, but the size issue is easiest to understand).
- GFE forwards the PUBS value to Leto to complete the DH exchange, but Leto returns an error since the value is corrupt.
- GFE is configured in a temporary mode where it holds copies of Leto's private keys, to provide transparent fallback if Leto goes down.  GFE tries to complete the DH exchange with its local copy of the key, which fails for the same reason as above.

At this point GFE *should* give up and reject the handshake.  Instead, it incorrectly goes down a codepath which was intended for the situation where GFE *does not* hold the ServerConfig private keys.  In that case, each GFE generates a local (i.e. not shared with other GFEs) "fallback" ServerConfig and private key, so that if Leto cannot be reached, the GFE can send a REJ containing this "fallback" ServerConfig, and allow the handshake to proceed without Leto.

GFE only creates a "fallback" ServerConfig if it intends to use Leto exclusively, without holding the private keys for all of its "normal" ServerConfigs.
So, in the situation described above, GFE has not created any "fallback" ServerConfig at all, yet the fallback codepaths assume that it has, and segfaults.

The fix is simple - check whether a fallback ServerConfig exists before trying to use it.

Testing the fix is ugly.  I found no existing test tools for generating malformed or otherwise adversarial QUIC messages.  So, after consulting with fayang@, I went with a testvalue-driven approach - the test client generates a kosher CHLO, and I use a testvalue callback to corrupt its PUBS field inside the server.

This requires an invasive change to core QUIC code, as well as undesirably adding a subset of the testvalue API to quic/platform.  My intention is that this method would just be stubbed out and useless on the Chromium side.  I would welcome feedback on whether this is a workable approach, or what alternatives I should use instead.

Protected by Default true --quic_reloadable_flag_quic_check_fallback_null.

PiperOrigin-RevId: 333740049
Change-Id: I2abfa3b5f45ea69b1f252e79b14e9237656f4d99
2 files changed
tree: f8cdbc2eaa15597b1afe9d619ce8e3e8779ed80c
  1. common/
  2. epoll_server/
  3. http2/
  4. quic/
  5. spdy/
  6. CONTRIBUTING.md
  7. LICENSE
  8. README.md
README.md

QUICHE

QUICHE (QUIC, Http/2, Etc) is Google‘s implementation of QUIC and related protocols. It powers Chromium as well as Google’s QUIC servers and some other projects.