tailscale/cmd
Brad Fitzpatrick 15cba0a3f6 tstest/natlab/vmtest: add TestDiscoKeyChange
Add a vmtest that brings up two gokrazy nodes A and B behind two
One2OneNAT networks (so direct UDP works in both directions and any
slowness can't be blamed on NAT traversal), establishes a WireGuard
tunnel A → B with TSMP, then rotates B's disco key four times and
asserts that the data plane recovers in both directions after each
rotation. All pings are TSMP (the data-plane ping; disco pings would
not exercise the WireGuard tunnel itself).

The five pings:

  1. A → B  (initial; brings up the tunnel; 30s budget)
  2. B → A  after rotate (LocalAPI rotate-disco-key debug action)
  3. A → B  after rotate (LocalAPI)
  4. B → A  after restart (SIGKILL; gokrazy supervisor respawns)
  5. A → B  after restart (SIGKILL)

Each post-rotation ping gets a 15-second budget. Two unavoidable
multi-second waits dominate today:

  - The rotate-then-a→b phase takes ~10s on main because of LazyWG.
    After B's WantRunning bounce, B's wgengine resets its
    sentActivityAt/recvActivityAt maps and trims A out of the
    wireguard-go config as an "idle peer"; B only re-adds A on
    inbound activity, by which point A's first few TSMP packets
    have been silently dropped at B's tundev. The
    bradfitz/rm_lazy_wg branch removes that trimming entirely
    (verified locally: this phase drops to <100ms there).

  - The restart phases take ~5s for wireguard-go's RekeyTimeout
    handshake retry. After SIGKILL+respawn the first WG handshake
    init from the restarted node sometimes goes into the void
    (likely the brief peer-removed window in the receiver's
    two-step maybeReconfigWireguardLocked reconfig during which
    the peer is absent from wireguard-go), and wg-go's 5s+jitter
    retransmit timer is the next opportunity to retry. That retry
    succeeds and the staged TSMP packet flushes. Intrinsic to the
    protocol's retransmit policy.

Once LazyWG is removed and the first-handshake-after-reconfig race
is fixed, the budget should drop to 5s.

Supporting changes:

  ipn/ipnlocal: DebugRotateDiscoKey now toggles WantRunning off and
  back on after rotating the disco key. magicsock.Conn.RotateDiscoKey
  only resets local disco state; without also dropping wireguard-go
  session keys, peers keep encrypting with their stale per-peer
  session against us until their rekey timer fires (WireGuard has no
  data-plane signaling to invalidate sessions). Bouncing WantRunning
  runs the engine through Reconfig(empty) → authReconfig, which
  drops every peer's WG session so the next packet either way
  triggers a fresh handshake.

  ipn/ipnlocal, ipn/localapi: add a debug-only "peer-disco-keys"
  LocalAPI action ([LocalBackend.DebugPeerDiscoKeys]) that returns
  a map[NodePublic]DiscoPublic from the current netmap. Tests reach
  it via [local.Client.DebugResultJSON]. We do not surface disco
  keys via [ipnstate.PeerStatus] because adding a non-comparable
  [key.DiscoPublic] field there breaks reflect-based test helpers
  (e.g. TestFilterFormatAndSortExitNodes' use of cmp.Diff), and
  general LocalAPI clients have no need for disco keys. Since the
  debug LocalAPI is gated behind the ts_omit_debug build tag, this
  endpoint is automatically stripped from small binaries.

  cmd/tta: add /restart-tailscaled handler (Linux-only, via /proc walk)
  to drive the SIGKILL phase. On gokrazy the supervisor respawns
  tailscaled within a second.

  tstest/integration/testcontrol: add Server.AllOnline. When set,
  every peer entry in MapResponses is marked Online=true. Several
  disco-key handling fast paths in controlclient and wgengine
  (removeUnwantedDiscoUpdates, removeUnwantedDiscoUpdatesFromFull
  NetmapUpdate, the wgengine tsmpLearnedDisco fast path) only fire
  for online peers; without this flag, tests exercising disco-key
  rotation only hit the offline-peer code paths, which mask issues
  and are several seconds slower in this scenario. Finer-grained
  per-node online tracking can be added later.

  tstest/natlab/vmtest: add Env.RotateDiscoKey,
  Env.RestartTailscaled, Env.PeerDiscoKey, Node.Name, an
  [AllOnline] EnvOption that plumbs through to
  testcontrol.Server.AllOnline, and an exported
  Env.Ping(from, to, type, timeout). Ping replaces the unexported
  helper so callers can specify both a ping type (PingDisco for
  warming peer state, PingTSMP for asserting end-to-end
  connectivity) and a deadline. PeerDiscoKey returns its LocalAPI
  error so callers inside tstest.WaitFor can retry transient
  failures rather than fataling the test.

Updates #12639
Updates #13038

Change-Id: I3644f27fc30e52990ba25a3983498cc582ddb958
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2026-04-29 12:58:00 -07:00
..
addlicense all: remove AUTHORS file and references to it 2026-01-23 15:49:45 -08:00
build-webclient all: remove AUTHORS file and references to it 2026-01-23 15:49:45 -08:00
checkmetrics all: remove AUTHORS file and references to it 2026-01-23 15:49:45 -08:00
cigocacher cmd/cigocacher: make --stats flag best-effort (#18761) 2026-02-19 16:06:12 +00:00
cloner cmd/cloner: deep-clone pointer elements in map-of-slice values 2026-04-17 11:36:05 -04:00
connector-gen all: remove AUTHORS file and references to it 2026-01-23 15:49:45 -08:00
containerboot cmd/containerboot,cmd/k8s-proxy,kube: add authkey renewal to k8s-proxy (#19221) 2026-04-15 16:13:46 +01:00
derper cmd/derper,derp: add metrics for rate limit hits (#19560) 2026-04-29 10:29:09 -07:00
derpprobe all: remove AUTHORS file and references to it 2026-01-23 15:49:45 -08:00
dist all: remove AUTHORS file and references to it 2026-01-23 15:49:45 -08:00
distsign all: remove AUTHORS file and references to it 2026-01-23 15:49:45 -08:00
featuretags all: remove AUTHORS file and references to it 2026-01-23 15:49:45 -08:00
get-authkey all: remove AUTHORS file and references to it 2026-01-23 15:49:45 -08:00
gitops-pusher cmd/vet: add subtestnames analyzer; fix all existing violations 2026-04-05 15:52:51 -07:00
hello cmd/hello: remove hello.ipn.dev (#19567) 2026-04-28 17:54:29 -07:00
jsonimports all: remove AUTHORS file and references to it 2026-01-23 15:49:45 -08:00
k8s-nameserver cmd/vet: add subtestnames analyzer; fix all existing violations 2026-04-05 15:52:51 -07:00
k8s-operator cmd/k8s-operator: add nodeSelector to DNSConfig resource (#19429) 2026-04-29 15:56:33 +01:00
k8s-proxy cmd/containerboot,cmd/k8s-proxy,kube: add authkey renewal to k8s-proxy (#19221) 2026-04-15 16:13:46 +01:00
mkmanifest all: remove AUTHORS file and references to it 2026-01-23 15:49:45 -08:00
mkpkg all: use Go 1.26 things, run most gofix modernizers 2026-03-06 13:32:03 -08:00
mkversion all: remove AUTHORS file and references to it 2026-01-23 15:49:45 -08:00
nardump tool/updateflakes, cmd/nardump: replace update-flake.sh with Go tool 2026-04-28 10:18:32 -07:00
natc all: use bart.Lite instead of bart.Table where appropriate 2026-03-24 14:45:23 +00:00
netlogfmt all: remove AUTHORS file and references to it 2026-01-23 15:49:45 -08:00
nginx-auth all: remove AUTHORS file and references to it 2026-01-23 15:49:45 -08:00
omitsize all: remove AUTHORS file and references to it 2026-01-23 15:49:45 -08:00
pgproxy all: remove AUTHORS file and references to it 2026-01-23 15:49:45 -08:00
printdep cmd/printdep: add --next flag to use rc Go build hash instead 2026-01-27 14:49:56 -08:00
proxy-test-server all: remove AUTHORS file and references to it 2026-01-23 15:49:45 -08:00
proxy-to-grafana all: remove AUTHORS file and references to it 2026-01-23 15:49:45 -08:00
sniproxy all: use Go 1.26 things, run most gofix modernizers 2026-03-06 13:32:03 -08:00
speedtest all: use Go 1.26 things, run most gofix modernizers 2026-03-06 13:32:03 -08:00
ssh-auth-none-demo ssh: replace tempfork with tailscale/gliderssh 2026-04-07 11:59:38 +01:00
stunc all: remove AUTHORS file and references to it 2026-01-23 15:49:45 -08:00
stund derp,types,util: use bufio Peek+Discard for allocation-free fast reads (#19067) 2026-03-24 10:52:20 -04:00
stunstamp all: use Go 1.26 things, run most gofix modernizers 2026-03-06 13:32:03 -08:00
sync-containers all: remove AUTHORS file and references to it 2026-01-23 15:49:45 -08:00
systray client/systray: support several different color themes 2026-04-27 18:54:14 -07:00
tailscale cmd/tailscale/cli: drive "file cp" progress and offline warning from peerAPI 2026-04-28 11:03:58 -07:00
tailscaled go.mod: bump github.com/google/go-containerregistry (#19500) 2026-04-23 10:39:27 -07:00
testcontrol all: remove AUTHORS file and references to it 2026-01-23 15:49:45 -08:00
testwrapper cmd/testwrapper: make test tolerant of a GOEXPERIMENT being set 2026-03-06 14:05:35 -08:00
tl-longchain all: remove AUTHORS file and references to it 2026-01-23 15:49:45 -08:00
tsconnect tsd, all: add Sys.ExtraRootCAs, plumb through TLS dial paths 2026-04-07 18:10:54 -07:00
tsidp go.mod: bump github.com/google/go-containerregistry (#19500) 2026-04-23 10:39:27 -07:00
tsnet-proxy cmd/tsnet-proxy: add tsnet-based port proxy tool (#19468) 2026-04-22 13:34:18 -04:00
tsp control/tsp, cmd/tsp: add low-level Tailscale protocol client and tool 2026-04-16 20:00:25 -07:00
tsshd all: remove AUTHORS file and references to it 2026-01-23 15:49:45 -08:00
tta tstest/natlab/vmtest: add TestDiscoKeyChange 2026-04-29 12:58:00 -07:00
vet cmd/vet: add subtestnames analyzer; fix all existing violations 2026-04-05 15:52:51 -07:00
viewer cmd/cloner: deep-clone pointer elements in map-of-slice values 2026-04-17 11:36:05 -04:00
vnet all: remove AUTHORS file and references to it 2026-01-23 15:49:45 -08:00
xdpderper all: remove AUTHORS file and references to it 2026-01-23 15:49:45 -08:00