tailscale

mirror of https://github.com/tailscale/tailscale.git synced 2026-06-03 21:01:54 +08:00

Author	SHA1	Message	Date
Alex Chan	f4a280cdbd	all: update a few more references to network/tailnet lock Updates tailscale/corp#37904 Change-Id: I746b06328e080fa2b9ff28a2d099f95645aa3d0b Signed-off-by: Alex Chan <alexc@tailscale.com>	2026-05-28 16:44:16 +01:00
Brad Fitzpatrick	94af1b00fb	cmd/testwrapper, tstest: move test sharding out of test code Previously, sharding required tests to opt in by calling tstest.Shard, which used a process-global counter to assign each test to a shard. This had two problems: most tests didn't call it, so they ran on every shard (defeating the purpose), and shard assignments were unstable (depended on call order, so adding a test could reshuffle others). Remove tstest.Shard and tstest.SkipOnUnshardedCI entirely. Instead, have testwrapper implement sharding automatically for all tests: when TS_TEST_SHARD=N/M is set, it uses "go list -json" (no compilation) to find test source files, scans them for top-level Test/Benchmark/ Example/Fuzz function names, and filters by fnv32a(name) % M == N-1. The filtered names are passed as an anchored -run regex to go test. Using go list instead of "go test -list" avoids linking the test binary twice (Go's build cache does not cache test binary linking). Fixes #19886 Change-Id: I62ab7b3d757324d4c5fd0b5de50c1e3742681791 Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-05-27 16:53:17 -07:00
Brad Fitzpatrick	364b952d62	cmd/containerboot: track peers from IPN bus updates, stop using netmap.NetworkMap Some tests in another repo were broken by tailscale/tailscale#19607. This fixes them, by finishing off the rest of the migration away from netmap.NetworkMap on the IPN bus in containerboot. Containerboot used to rebuild a full NetworkMap-shaped view while reacting to IPN bus notifications. Now it insteads has its own netmapState type (immutable) of exactly what it needs to track, and sends those immutable values around, making cheap edits of new immutable values when an IPN bus edit arrives. This should make cmd/containerboot scale to much larger tailnets now too. Fixes #19852 Fixes tailscale/corp#42347 Updates #12542 Change-Id: I88adaf061f85f677f954a764935e6654329d75a6 Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-05-27 14:12:48 -07:00
Claus Lensbøl	9be21088f4	wgengine/{,magicsock},tstest/natlab/vmtest: send disco on cached netmap (#19878 ) Originally found when adding tests for working with cached netmaps, and finding the added tests to be flakey. When working off of a cached netmap, if a node exists in the cached netmap but does not yet have any endpoints, DERP connections are available but not direct ones. By sending callMeMaybe to nodes without endpoints in the cached netmap, we can establish direct connections for this edge case. Aditionally, ensure that TSMP disco advert messages are not sent if the endpoint does not have a valid address yet. Fixes #19843 Updates #19597 Signed-off-by: Claus Lensbøl <claus@tailscale.com>	2026-05-27 13:05:12 -04:00
Brad Fitzpatrick	2c965ab540	types/netmap, ipn/ipnlocal, control/controlclient: rename NodeMutationAdd to NodeMutationUpsert NodeMutationAdd was a misleading name: a PeersChanged entry in a MapResponse can represent either a truly new peer or a full replacement for an existing peer that couldn't be expressed as a PeerChangedPatch. Calling it "Add" implied it was always a completely new node, which is wrong. (I'd changed my mind on the design of mapping add/delete events to NodeMutations halfway through #19607 and forgot to update the name, even though I'd updated half the docs) Rename it to NodeMutationUpsert to reflect the actual semantics: the node should be inserted or replaced in the peer map regardless of whether it already existed. Updates #19607 Updates #12542 Change-Id: Iebd3daddb3318cba02e115a1b184fcb3ee8f83d6 Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-05-27 08:37:14 -07:00
Simon Law	988615dbad	ipn/ipnlocal,tstest/integration: pause the control client consistently (#19846 ) There are two places where tailscaled transitions into a paused state: 1. tailscaled’s controlclient is initially created, 2. tailscale down, or the GUI equivalent, commands it to. This patch unifies the implementation of both scenarios into LocalBackend.shouldPauseControlClientLocked to prevent the implementation from drifting. The flaky tstest/integration.TestNoControlConnWhenDown test exposed this mismatch, but only by accident. This patch also changes TestNode.MustDown so that it runs `tailscale down` and then waits for the testcontrol server to finish handling any associated /machine/map requests. Fixes #19831 Signed-off-by: Simon Law <sfllaw@tailscale.com>	2026-05-22 17:58:44 -07:00
Simon Law	fd2405ca8f	tstest/integration: mark TestNoControlConnWhenDown as a flaky test (#19832 ) Some checks failed CI / cross (amd64, windows) (push) Has been cancelled Details CI / cross (arm, 5, linux) (push) Has been cancelled Details CI / cross (arm, 7, linux) (push) Has been cancelled Details CI / cross (arm64, darwin) (push) Has been cancelled Details CI / cross (arm64, linux) (push) Has been cancelled Details CI / cross (arm64, windows) (push) Has been cancelled Details CI / cross (loong64, linux) (push) Has been cancelled Details CI / ios (push) Has been cancelled Details CI / crossmin (amd64, illumos) (push) Has been cancelled Details CI / crossmin (amd64, plan9) (push) Has been cancelled Details CI / crossmin (amd64, solaris) (push) Has been cancelled Details CI / crossmin (ppc64, aix) (push) Has been cancelled Details CI / android (push) Has been cancelled Details CI / wasm (push) Has been cancelled Details CI / tailscale_go (push) Has been cancelled Details CI / depaware (push) Has been cancelled Details CI / go_generate (push) Has been cancelled Details CI / make_tidy (push) Has been cancelled Details CI / licenses (push) Has been cancelled Details CI / staticcheck (${{ matrix.name }}) (--with-tags-all=darwin, arm64, darwin, macOS) (push) Has been cancelled Details CI / staticcheck (${{ matrix.name }}) (--with-tags-all=linux, amd64, linux, Linux) (push) Has been cancelled Details CI / staticcheck (${{ matrix.name }}) (--with-tags-all=windows, amd64, windows, Windows) (push) Has been cancelled Details CI / staticcheck (${{ matrix.name }}) (--without-tags-any=windows,darwin,linux --shard=1/4, amd64, linux, Portable (1/4)) (push) Has been cancelled Details CI / staticcheck (${{ matrix.name }}) (--without-tags-any=windows,darwin,linux --shard=2/4, amd64, linux, Portable (2/4)) (push) Has been cancelled Details CI / staticcheck (${{ matrix.name }}) (--without-tags-any=windows,darwin,linux --shard=3/4, amd64, linux, Portable (3/4)) (push) Has been cancelled Details CI / staticcheck (${{ matrix.name }}) (--without-tags-any=windows,darwin,linux --shard=4/4, amd64, linux, Portable (4/4)) (push) Has been cancelled Details CI / notify_slack (push) Has been cancelled Details CI / merge_blocker (push) Has been cancelled Details CI / check_mergeability_strict (push) Has been cancelled Details CI / check_mergeability (push) Has been cancelled Details Updates #19831 Signed-off-by: Simon Law <sfllaw@tailscale.com>	2026-05-21 17:36:09 -07:00
Brad Fitzpatrick	aa5da2e5f2	ipn/ipnlocal, control/controlclient: process node adds/removes in constant time For large tailnets (~50k+ nodes) with frequent peer churn (ephemeral GitHub Actions workers etc.), tailscaled used to rebuild the full netmap and fan it out on the IPN bus on every MapResponse that added or removed a peer. There were two O(N) costs per delta: the full netmap rebuild + every Notify.NetMap encode to every bus watcher. This change tackles both: 1. Plumb O(1) peer add/remove through the delta path. PeersChanged and PeersRemoved no longer prevent the delta happy path; instead, they mutate the per-node-backend peer map in place. 2. Restrict ipn.Notify.NetMap emission to the platforms whose host GUIs still depend on it (Windows, macOS, iOS) and migrate in-tree consumers off it everywhere else: - Migrate reactive consumers (containerboot, kube agents, sniproxy, tsconsensus, etc.) off Notify.NetMap to the previously-added Notify.SelfChange signal so they no longer have to subscribe to the full netmap. - Add ipn.NotifyNoNetMap so GUI clients on "legacy-emit" platforms that have already migrated can opt out of the per-watcher NetMap encode. - Gate Notify.NetMap emission on the producer side by a compile- time GOOS check, so the supporting code is dead-code-eliminated on Linux and other geese where no GUI consumer needs it. Re-running BenchmarkGiantTailnet from tstest/largetailnet, which was added along with baseline numbers on unmodified main in `ad5436af0d`, the per-delta cost (one peer add+remove pair) is now ~O(1) regardless of tailnet size N: N no-watcher (ms/op) bus-watcher (ms/op) before now factor before now factor 10000 32 0.11 300x 166 0.13 1300x 50000 222 0.11 2000x 865 0.13 6700x 100000 504 0.12 4100x 1765 0.13 13400x 250000 1551 0.12 12500x 4696 0.15 32400x Updates #12542 Change-Id: I94e34b37331d1a8ec74c299deffadf4d061fda9e Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-05-21 09:26:19 -07:00
Brad Fitzpatrick	f3a117e813	net/tsdial: run happy eyeballs across A and AAAA in UserDial When tailscaled is running in userspace-networking mode behind an exit node (e.g. as a SOCKS5 proxy), it resolves a hostname and then dials a single resolved IP through the tunnel. If the name has both A and AAAA, Go's net.Resolver merges them and we pick ips[0], which on an IPv6-native host is usually AAAA. If the exit node has no IPv6 egress (or vice versa), the dial fails silently through the tunnel and the user sees a hang. Resolve all candidates and race connect attempts across address families with a 300ms happy-eyeballs delay, matching Go's net.Dialer default and the existing pattern in net/dnscache (commit `ee0a03b14`). First success wins; losers are cancelled and any conns they produce are closed. A failBoost channel wakes the launcher when a connect fails fast (e.g. ICMP "no route" via the tunnel) so we don't sit on the 300ms timer when the answer is already known. userDialResolve is refactored into userDialResolveAll (returns the full candidate list) plus a thin single-IP wrapper for callers like UserDialPlan that don't race. UserDial's per-IP dispatch (netstack vs peer dialer vs SystemDial vs std) is extracted to dialOneUser so each candidate can route correctly on its own merits. Also fix serveDial in localapi to pass the original hostname to UserDial rather than a pre-resolved IP, so the race can fire. This fix is single-ended: it works against any exit node, including old ones, with no protocol changes. The trade-off versus filtering on the exit-node side via PeerAPI DoH is that every dial through an unreachable-family exit node costs one failed connect attempt per cache window, rather than zero, which is acceptable given the simplicity. Fixes #19792 Fixes #13257 Change-Id: I9d7645d0034caf3ee22ecdd8070798353f77e94b Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-05-20 18:35:55 -07:00
James Tucker	36c52ef383	tstest/integration/testcontrol: fix serveMap read-modify-write race serveMap cloned s.nodes[nk], mutated the clone outside the mutex, then wrote it back via updateNodeLocked. A concurrent UpdateNode, SetNodeCapMap, or other writer landing between the clone and the writeback would be silently clobbered. Mutate the live node under the mutex instead. Surfaces in tsnet's TestListenService as a flaky ErrUntaggedServiceHost panic: the test calls control.UpdateNode to attach a tag, a concurrent updateRoutine map request from the host races, and the host's next netmap arrives with Tags=[]. Updates #19822 Change-Id: I6c5ebd5e5bf79a40316f53f627157230773cb469 Signed-off-by: James Tucker <james@tailscale.com>	2026-05-20 18:29:58 -07:00
Brad Fitzpatrick	04ae61fe4b	tstest/integration/jswasmtest: add headless-Chromium tests for @tailscale/connect Add Go tests that drive a real headless Chromium (via chromedp) against the built cmd/tsconnect/pkg/ artifact and verify the @tailscale/connect public API surface end-to-end. The package has not been republished in three years, in part because no test exercises the produced artifact at runtime — only tsc --noEmit and a Go build run in CI. TestCreateIPN loads pkg.js into the browser, calls createIPN with a junk auth key, and asserts that pkg.createIPN / pkg.runSSHSession are functions and that createIPN() returns an IPN with the documented run/login/logout/ssh/fetch methods. No control-plane traffic. TestFetchTailnetPeer stands up a full local tailnet (testcontrol + DERP + a tsnet.Server peer) and verifies that the browser-side WASM client can join over WebSocket-noise to the same control, connect to DERP over WSS, and then ipn.fetch() an HTTP service hosted on the tsnet peer through the tailnet. The test asserts the response body matches a known string. Browser state transitions are logged: NoState -> NeedsLogin -> Starting -> Running. Tests are opt-in via --run-headless-browser-tests (matching the existing --run-vm-tests pattern in tstest/natlab/vmtest) so they never fire in casual `go test ./...` runs. When the flag is set, a test is skipped if cmd/tsconnect/pkg/ has not been built, and fails with t.Error if no chromium binary is found on $PATH (honoring $CHROME_BIN as an override). findChromium also falls back to /Applications/Google Chrome.app and /Applications/Chromium.app on darwin, since macOS Chrome's executable lives inside an .app bundle and is not on $PATH by default. The .github/workflows/test.yml wasm job is extended to install google-chrome-stable and run the tests with the flag after build-pkg. To prevent silently testing a stale pkg/main.wasm (built from an older checkout than the rest of the test invocation), build-pkg now writes pkg/build-info.json recording the sha256 of the raw (pre-wasm-opt) go-build output. The test does its own `go build` of cmd/tsconnect/wasm with the same -tags/-trimpath/-ldflags (factored into a new cmd/tsconnect/wasmbuild package shared by both call sites) and t.Fatalfs with a "rebuild" instruction on mismatch. Cost is near-zero because the Go build cache from the prior build-pkg makes the rebuild a cache hit. The new wasmbuild package also replaces cmd/tsconnect's hardcoded -tags string with a minimal-feature-set computation. wasmbuild.Keep names the small set of feature/featuretags entries the browser client actually needs (netstack, logtail, dns, health, c2n, ipnbus); wasmbuild.Tags() emits a ts_omit_<f> for every other omittable feature in feature/featuretags.Features, with transitive deps expanded via featuretags.Requires. An init() panics if Keep references a feature unknown to feature/featuretags so a rename there fails loudly. Net effect on size: 32M raw / 9.4M brotli before this change, 25M raw / 4.4M brotli after — vs the last-published 1.39.98 at 21M / 3.8M. The transitive package-import graph is unchanged (176 tailscale.com/* packages either way): featuretags omits eliminate dead code via `const HasX = false`, not imports. Trimming the import graph would require a separate, larger refactor splitting interface packages by build tag. Writing TestFetchTailnetPeer surfaced several real issues, all fixed here: * cmd/tsconnect built the wasm with the nethttpomithttp2 tag, but control/ts2021 (since commit `1d93bdce2`, "control/controlclient: remove x/net/http2, use net/http", Oct 2025) requires HTTP/2 from net/http's bundled implementation. With nethttpomithttp2 set, the bundle is excluded and the wasm client cannot speak HTTP/2 to any control plane, including production. Drop the tag. Wasm size grows ~1 MB raw / ~300 KB brotli (more than offset by the feature pruning above). The last published @tailscale/connect (1.39.98, early 2023) pre-dates the regression, which is why no consumer has reported the breakage. * tstest/integration/testcontrol.Server's /ts2021 noise upgrade endpoint rejected anything but POST. WebSocket clients (the only transport available to browser-WASM) come in as GET. Allow both; the controlhttp AcceptHTTP path dispatches on the Upgrade header, so the websocket library still enforces GET for WS upgrades. This matches production, where the same controlhttpserver.AcceptHTTP routes purely on the Upgrade header without checking method. * derp/derphttp's urlString built the DERP URL from node.HostName only, dropping node.DERPPort. Non-WS clients use a separate code path (connectToHost) that honors DERPPort, but WebSocket-only clients (browser-WASM) went through urlString and so could not reach a DERP running on any port other than 443. Include the port when it differs from the scheme default. Also move addWebSocketSupport from cmd/derper (where it was main-only) to derp/derpserver.AddWebSocketSupport so tstest/integration.RunDERPAndSTUN can wrap its DERP handler with WebSocket support — without that, the test DERP would not accept the browser's wss connection. Fixes #9394 Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com> Change-Id: Iff9cdee303e3b239924249b5bffb2fd04e02f391	2026-05-20 10:48:29 -07:00
Claus Lensbøl	ee0a03b140	net/dnscache: run happy eyeballs with more than one dest IP (#19770 ) If the context given to DialContext has a shorter lifetime than the OS TCP SYN timeout, and TCP SYNs are dropped from the path to the remote, DialContext would never fall back to try IPv6 after IPv4. Instead, use the normal happy eyeballs race if there is more than one address. This does remove the implicit prioritization of IPv4 over IPv6 in cases where there is only a single IPv4 remote address. Updates #13346 Signed-off-by: Claus Lensbøl <claus@tailscale.com>	2026-05-19 12:59:11 -04:00
Brad Fitzpatrick	93440604e0	tstest/natlab/vmtest: add TestPeerRelay Add a VM-based natlab test that exercises the peer-relay feature (feature/relayserver) end-to-end across three Tailscale nodes whose network topology makes a direct A<->B UDP path impossible: both peers are behind HardNAT (FreeBSD/pfSense-style endpoint-dependent NAT) with no port-mapping services, while the relay node is behind One2OneNAT so its STUN-discovered WAN endpoint is reachable from both peers. The test enables the relay server via EditPrefs, then waits for an a->b PingDisco whose PingResult.PeerRelay is set (proving magicsock chose the peer-relay path, not DERP), and finally asserts that the relay's DebugPeerRelaySessions LocalAPI reports the session. The existing TestPeerRelayPing in tstest/integration runs three tailscaled processes on the loopback interface with no NATs; this new vmtest covers peer relay through real per-VM kernels and NATs. To wire control-server capabilities into vmtest, also add a PeerRelayGrants() EnvOption (sibling of AllOnline, SameTailnetUser) that flips testcontrol.Server.PeerRelayGrants so the wildcard packet filter grants tailcfg.PeerCapabilityRelay and PeerCapabilityRelayTarget; without those caps magicsock won't consider any peer a candidate relay. Updates #13038 Change-Id: Ib3440b83ec442da0d3b89ffa48ceea9398ea9062 Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-05-14 14:47:29 -07:00
M. J. Fromberger	4eb977413a	tstest/natlab/vmtest: add helpers for fatal step errors (#19753 ) In a lot of places, we construct an error to End a step, then immediately log it to the governing test as test fatal. Save ourselves a bit of boilerplate by putting methods on Step for that. There are a couple cases this doesn't cover, e.g., where we construct the Step outside a subtest that wants to fail individually, but it helps enough to pay for its lines. Updates #13038 Change-Id: I71f9900942962de16609b6b198d3ba13d6958a5f Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>	2026-05-14 09:24:47 -07:00
Claus Lensbøl	bb47ea2c6b	tstest/natlab/vmtest: start migrating old natlab tests to vmtest (#19727 ) Instead of having two entry points for running natlab tests, start converting the connectivity tests to use the vmtest framework. Grid and pair tests have yet to be moved over. Updates #13038 Signed-off-by: Claus Lensbøl <claus@tailscale.com>	2026-05-13 16:44:53 -04:00
M. J. Fromberger	9f48567bf1	ipn/ipnlocal,wgengine/magicsock: add basic counters for cached peer connectivity (#19699 ) Add new clientmetric counters for establishing contact with peers while using cached network map data. To do this, instrument the magicsock.Conn with a bit to indicate whether its peer data came from a cached netmap. If so, there are two conditions we will count as establishing connectivity to a peer: - Receipt of a CallMeMaybe from a peer via disco. - Establishing a valid endpoint address for a peer. In vmtest, add Env.ClientMetrics to scrape metrics from the specified node. Use this to check that counters were updated in caching tests. Updates https://github.com/tailscale/projects/issues/13 Updates #12639 Change-Id: Ie8cf3244ac8af4f5bcfe4d0d944078da2ba08990 Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>	2026-05-12 12:01:05 -07:00
Brad Fitzpatrick	758ebe9839	tstest/natlab/vmtest: use short paths for Unix sockets macOS limits Unix socket paths to 104 bytes. The Go test TempDir path (e.g. /var/folders/.../TestDirectConnection...679197086/001/) easily exceeds that, causing "bind: invalid argument". Create a short /tmp/vmtest* directory for all socket files (vnet, QMP, dgram) so the paths stay well under the limit on every platform. Updates #13038 Change-Id: I721d24561d1766aaa964692bc77f40a131aa9455 Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-05-11 21:54:27 -07:00
Brad Fitzpatrick	f4c5613156	tstest/natlab/vmtest: don't require KVM; use TCG on macOS startCloudQEMU hardcoded -machine q35,accel=kvm and -cpu host, which fails on any host without KVM (notably macOS). Replace with a qemuAccelArgs helper that probes /dev/kvm and falls back to QEMU's TCG software emulation, matching the pattern already used by tstest/integration/nat. Also wire the helper into startGokrazyQEMU so gokrazy VMs pick up KVM when available. Updates #13038 Change-Id: I7745518db823279b1880957bb14ca2ffdaab4c50 Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-05-11 19:18:17 -07:00
Brad Fitzpatrick	e062b46984	tstest/natlab, .github/workflows: add opt-in natlab CI workflow The natlab vmtest suite (tstest/natlab/vmtest) and the integration nat tests are gated behind --run-vm-tests because they need KVM and are slow. Until now nothing in CI exercised them apart from a single canary TestEasyEasy run on every PR. Add .github/workflows/natlab-test.yml that runs the full opt-in suite on demand (workflow_dispatch), on PRs labeled "natlab", and on main every 12 hours via cron. The workflow has two phases: - "prepare" builds the gokrazy VM image, downloads the Ubuntu and FreeBSD cloud images once via the new natlabprep tool, and emits a dynamic JSON matrix of every TestX function it finds in the two opt-in packages. - "test" is a per-test matrix that depends on prepare. Each matrix job restores the shared caches and runs a single test, so adding a new TestFoo is automatically picked up on the next run without any workflow edits. Rename the existing natlab-integrationtest.yml to natlab-basic.yml since it's the small smoke variant (just TestEasyEasy on every PR); the new natlab-test.yml is the bigger suite. The job inside is renamed to EasyEasy for the same reason. Move the macOS arm64 host check from vmtest.Env.Start into vmtest.Env.AddNode so a test that adds a vmtest.MacOS node skips immediately on a non-macOS host, and add an explicit skipIfNotMacOSArm64 helper at the top of the two macOS-only tests so the platform requirement is obvious to readers. Quiet the takeAgentConnOne miss log in tstest/natlab/vnet by default (it was the overwhelming majority of bytes in CI logs, with no signal in healthy runs) and replace it with a periodic "still waiting" line that only fires after 10s, so a truly stuck agent connection still surfaces. Updates #13038 Change-Id: I4582098d8865200fd5a73a9b696942319ccf3bf0 Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-05-11 17:14:46 -07:00
Claus Lensbøl	469d356ed8	tstest/natlab/vmtest: add test for direct conn with cached netmap (#19660 ) When a peer is not able to connect to control after a restart and is using a cached netmap, that nodes should be able to connect to another peer in its tailnet (given that the home DERP of that peer has not changed in the meantime). Add test that starts two peers and connects them to a tailnet with caching enabled. Then blackhole traffic to control from one peer and restart it. Verify that the connection between the two ends up direct. Adds facilities for expecting a certain path type between nodes. Updates: #19597 Signed-off-by: Claus Lensbøl <claus@tailscale.com>	2026-05-08 16:57:27 -04:00
Fernando Serboncini	495d3acc7b	tstest/natlab/vmtest: kill QEMU when test process dies (#19676 ) Re-exec the test binary as a thin wrapper that holds a pipe inherited from the test. When the test goes away (any reason, including SIGKILL, panic, or OOM), the kernel closes the pipe write end; the wrapper sees EOF and SIGKILLs itself, taking QEMU and its children with it. Updates #13038 Change-Id: Ib2151098193551396c1d7bb51b07da3bd6b2cfb4 Signed-off-by: Fernando Serboncini <fserb@tailscale.com>	2026-05-07 16:14:27 -04:00
Claus Lensbøl	76248a68b2	tstest/natlab/vnet: close gonet sockets when test is done (#19677 ) Running all vmtests in tstest/natlab/vmtest locally was breaking later tasks in the queue. The goroutine dump on timeout had goroutines hanging around for 9 minutes, meaning that something was not getting cleaned up. goroutine 262 [select, 9 minutes]: gvisor.dev/gvisor/pkg/tcpip/adapters/gonet.commonRead({...}) Add a timeout of Now() to gonet TCP connections when the test ends (inspired by ServeUnixConn()), and wait for them to shut down before exiting the test. Updates #13038 Signed-off-by: Claus Lensbøl <claus@tailscale.com>	2026-05-07 14:57:07 -04:00
Brad Fitzpatrick	883d4fd2cd	wgengine/netstack, net/ping: stop using pro-bing and use our net/ping instead Fixes #19633 Fixes #13760 Change-Id: I0fa9423523a3a0fb1dfcde57de0f26e51723ff97 Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-05-04 14:05:24 -07:00
Brad Fitzpatrick	81569e891f	tstest/iosdeps: update import list to mirror ipn-go-bridge The purpose of this package is to test the iOS dependency closure, but it had drifted from the actual import list of the ipn-go-bridge package in the corp repo (the Go side of the iOS / macOS app). Update the imports to match ipn-go-bridge's GOOS=ios import list, adding many missing packages including wgengine/netstack, feature/{taildrop,syspolicy,condregister}, the util/syspolicy/* subpackages, types/{key,lazy,logid,netmap}, tsd, safesocket, util/{eventbus,must,set}, and several net/* and ipn/* packages. Drop two now-stale BadDeps entries (for now!): database/sql/driver and github.com/google/uuid are reached via wgengine/netstack -> github.com/prometheus-community/pro-bing, which netstack imports on darwin \|\| ios for ICMP user-ping, so the iOS app already ships them. But we should fix that later. Updates #19633 Change-Id: Ic50779fdb195685a2e8ccd7c513eee91b0feeaf8 Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-05-04 14:05:24 -07:00
Claus Lensbøl	ff9c3f0e00	tstest/natlab/vmtest: add test loading netmap cache from disk (#19598 ) For testing the loading of netmap cache from disk, the cache needs to exist. The simple solution is to start two nodes and connect them to control, with the netmap caching capability set. Then cut the connection to control, restart the nodes, and ping between them. This tests that we can start from a cache and get to running state, but also that we are able to establish a connection between the nodes. For now this is not testing how the nodes are able to talk to each other (DERP vs direct). Updates #19597 Signed-off-by: Claus Lensbøl <claus@tailscale.com>	2026-05-01 09:46:19 -04:00
Brad Fitzpatrick	b313bffbe7	control/tsp, tstest/integration/testcontrol: deflake TestMapAgainstTestControl The test was flaky under stress with "AddRawMapResponse N: node not connected" failures. The root cause was in testcontrol's addDebugMessage: it conflated "no streaming poll registered" with "wake-up channel buffer momentarily full". The single-slot updatesCh is just a lossy wake-up signal, but the streaming serveMap loop has fast paths (takeRawMapMessage and the hasPendingRawMapMessage continue) that don't drain it. A stale notification could remain buffered, causing the next sendUpdate to fail even though msgToSend had been queued and the streaming poll would still pick it up. Detect the real failure case (no streaming poll) by checking s.updates[nodeID] directly, and treat sendUpdate's buffer-full result as benign — the message is in msgToSend, which is the source of truth. Also plumb an optional health.Tracker through tsp.ClientOpts to the underlying ts2021.Client and supply one in the tests, eliminating the "## WARNING: (non-fatal) nil health.Tracker (being strict in CI)" stack dumps emitted by controlhttp.(Dialer).forceNoise443 under CI. Fixes #19583 Change-Id: Ib2334376585e8d6562f000a0b71dea0117acb0ff Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-04-29 16:11:00 -07:00
Brad Fitzpatrick	15cba0a3f6	tstest/natlab/vmtest: add TestDiscoKeyChange Add a vmtest that brings up two gokrazy nodes A and B behind two One2OneNAT networks (so direct UDP works in both directions and any slowness can't be blamed on NAT traversal), establishes a WireGuard tunnel A → B with TSMP, then rotates B's disco key four times and asserts that the data plane recovers in both directions after each rotation. All pings are TSMP (the data-plane ping; disco pings would not exercise the WireGuard tunnel itself). The five pings: 1. A → B (initial; brings up the tunnel; 30s budget) 2. B → A after rotate (LocalAPI rotate-disco-key debug action) 3. A → B after rotate (LocalAPI) 4. B → A after restart (SIGKILL; gokrazy supervisor respawns) 5. A → B after restart (SIGKILL) Each post-rotation ping gets a 15-second budget. Two unavoidable multi-second waits dominate today: - The rotate-then-a→b phase takes ~10s on main because of LazyWG. After B's WantRunning bounce, B's wgengine resets its sentActivityAt/recvActivityAt maps and trims A out of the wireguard-go config as an "idle peer"; B only re-adds A on inbound activity, by which point A's first few TSMP packets have been silently dropped at B's tundev. The bradfitz/rm_lazy_wg branch removes that trimming entirely (verified locally: this phase drops to <100ms there). - The restart phases take ~5s for wireguard-go's RekeyTimeout handshake retry. After SIGKILL+respawn the first WG handshake init from the restarted node sometimes goes into the void (likely the brief peer-removed window in the receiver's two-step maybeReconfigWireguardLocked reconfig during which the peer is absent from wireguard-go), and wg-go's 5s+jitter retransmit timer is the next opportunity to retry. That retry succeeds and the staged TSMP packet flushes. Intrinsic to the protocol's retransmit policy. Once LazyWG is removed and the first-handshake-after-reconfig race is fixed, the budget should drop to 5s. Supporting changes: ipn/ipnlocal: DebugRotateDiscoKey now toggles WantRunning off and back on after rotating the disco key. magicsock.Conn.RotateDiscoKey only resets local disco state; without also dropping wireguard-go session keys, peers keep encrypting with their stale per-peer session against us until their rekey timer fires (WireGuard has no data-plane signaling to invalidate sessions). Bouncing WantRunning runs the engine through Reconfig(empty) → authReconfig, which drops every peer's WG session so the next packet either way triggers a fresh handshake. ipn/ipnlocal, ipn/localapi: add a debug-only "peer-disco-keys" LocalAPI action ([LocalBackend.DebugPeerDiscoKeys]) that returns a map[NodePublic]DiscoPublic from the current netmap. Tests reach it via [local.Client.DebugResultJSON]. We do not surface disco keys via [ipnstate.PeerStatus] because adding a non-comparable [key.DiscoPublic] field there breaks reflect-based test helpers (e.g. TestFilterFormatAndSortExitNodes' use of cmp.Diff), and general LocalAPI clients have no need for disco keys. Since the debug LocalAPI is gated behind the ts_omit_debug build tag, this endpoint is automatically stripped from small binaries. cmd/tta: add /restart-tailscaled handler (Linux-only, via /proc walk) to drive the SIGKILL phase. On gokrazy the supervisor respawns tailscaled within a second. tstest/integration/testcontrol: add Server.AllOnline. When set, every peer entry in MapResponses is marked Online=true. Several disco-key handling fast paths in controlclient and wgengine (removeUnwantedDiscoUpdates, removeUnwantedDiscoUpdatesFromFull NetmapUpdate, the wgengine tsmpLearnedDisco fast path) only fire for online peers; without this flag, tests exercising disco-key rotation only hit the offline-peer code paths, which mask issues and are several seconds slower in this scenario. Finer-grained per-node online tracking can be added later. tstest/natlab/vmtest: add Env.RotateDiscoKey, Env.RestartTailscaled, Env.PeerDiscoKey, Node.Name, an [AllOnline] EnvOption that plumbs through to testcontrol.Server.AllOnline, and an exported Env.Ping(from, to, type, timeout). Ping replaces the unexported helper so callers can specify both a ping type (PingDisco for warming peer state, PingTSMP for asserting end-to-end connectivity) and a deadline. PeerDiscoKey returns its LocalAPI error so callers inside tstest.WaitFor can retry transient failures rather than fataling the test. Updates #12639 Updates #13038 Change-Id: I3644f27fc30e52990ba25a3983498cc582ddb958 Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-04-29 12:58:00 -07:00
Brad Fitzpatrick	fd6ae2fad4	tstest/natlab/vmtest: serialize per-platform setup with sync.Once Two cloud-platform nodes (e.g. sr-a and sr-b in TestSiteToSite) boot in parallel via errgroup and both call ensureCompiled and the inline image preparation block, racing to Begin() the same shared Step (which is deduped by name in Env.Step). The second goroutine panics: panic: Step "Compile linux_amd64 binaries": Begin called in state running panic: Step "Prepare ubuntu-24.04 image": Begin called in state done ensureCompiled had a TOCTOU dedup attempt (released compileMu before doing the work, only added to the compiled set at the end), and image preparation had no dedup at all. Replace the compiled set with a per-key map[string]sync.Once for each of compile and image preparation, so concurrent callers serialize on the Once and only the first executes Begin/work/End. Fixes commit `02ffe5baa8`. Updates #13038 Change-Id: If710bcc9e0aafebf0ad5b61553bae11458d976d7 Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-04-29 09:54:58 -07:00
Brad Fitzpatrick	02ffe5baa8	tstest/natlab/vmtest: add macOS VM snapshot caching for fast test starts Cache a pre-booted macOS VM snapshot on disk so subsequent test runs restore from the snapshot instead of cold-booting. The snapshot is keyed by the Tart base image digest and a code version constant (macOSSnapshotCodeVersion); bumping either invalidates the cache. Snapshot preparation (one-time): - Boot the Tart base image with a NAT NIC (--nat-nic flag) - Wait for SSH, compile and install cmd/tta as a LaunchDaemon - TTA polls the host via AF_VSOCK for an IP assignment; during prep the host replies "wait" - Disconnect NIC, save VM state via SIGINT Test fast path (cached, ~7s to agent connected): - APFS clone the snapshot, write test-specific config.json - Launch Host.app with --disconnected-nic --attach-network --assign-ip - VZ restores from SaveFile.vzvmsave (~5s with 4GB RAM) - TTA's vsock poll gets the IP config, sets static IP via ifconfig (bypasses DHCP entirely), switches driver addr to the IP directly (bypasses DNS), and resets the dial context so the reverse-dial reconnects immediately - TTA agent connects to test driver within ~2s of IP assignment Key optimizations: - 4GB RAM instead of 8GB: halves SaveFile.vzvmsave (1.4GB vs 2.4GB), halves restore time (5.5s vs 11s) - AF_VSOCK IP assignment: bypasses macOS DHCP (~5-7s saved) - Direct IP dial: bypasses DNS resolution for test-driver.tailscale - Dial context reset: cancels stale in-flight dials from snapshot - Kill instead of SIGINT for test VM cleanup (no state save needed) - Parallel VM launches Also: - Add TestDriverIPv4/TestDriverPort constants to vnet - Add --nat-nic and --assign-ip flags to Host.app - Fix SIGINT handler: retain DispatchSource globally, use dispatchMain() - Add vsock listener (port 51011) to Host.app for IP config protocol - Add disconnectNetwork() to VMController for clean snapshot state - Fix Makefile: set -o pipefail so xcodebuild failures aren't swallowed Updates #13038 Change-Id: Icbab73b57af7df3ae96136fb49cda2536310f31b Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-04-29 08:17:13 -07:00
Brad Fitzpatrick	4cec06b8f2	tstest/natlab/vmtest: add macOS VM screenshot streaming to web UI When --vmtest-web is set, Host.app is launched with --screenshot-port 0 to start a localhost HTTP server that captures the VZVirtualMachineView display. The Go test harness parses the SCREENSHOT_PORT=<port> line from stdout, then polls every 2 seconds for JPEG thumbnails and pushes them over WebSocket to the web dashboard. Clicking a screenshot thumbnail opens a full-resolution image proxied through the web UI's /screenshot/{node} endpoint. Screenshot events are excluded from the EventBus history (they're large and only the latest matters, stored in NodeStatus.Screenshot). Updates #13038 Change-Id: I9bc67ddd1cc72948b33c555d4be3d8db06a41f6d Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-04-29 07:48:26 -07:00
Alex Chan	bb91bb842c	all: remove everything related to non-seamless key renewal Seamless key renewal has been the default in all clients since 1.90. We retained the ability to disable it from the control plane as a precaution, but we haven't seen any issues that require us to disable it. We're now removing all the code for non-seamless key renewal, because we don't expect to turn it on again, and indeed it's been untested in the field for three releases so might contain latent bugs! Updates tailscale/corp#33042 Change-Id: I4b80bf07a3a50298d1c303743484169accc8844b Signed-off-by: Alex Chan <alexc@tailscale.com>	2026-04-29 10:03:26 +01:00
Brad Fitzpatrick	b2d4ba04b6	tstest/natlab/vmtest: add macOS VM support using Tart base images Add macOS VM support to the vmtest framework using Tart's pre-built macOS images (ghcr.io/cirruslabs/macos-tahoe-base) instead of building from IPSW. The Tart image has SIP disabled and SSH enabled. At test time, the Tart base image's disk, NVRAM, and hardware identity are APFS-cloned into a tailmac-compatible directory layout, and the VM is booted headlessly via tailmac's Host.app (Virtualization.framework) with its NIC connected to vnet's dgram socket. New features: - tailmac.go: ensureTartImage (auto-pull), cloneTartToTailmac (format conversion), startTailMacVM (launch + cleanup) - NoAgent() node option for VMs without TTA installed - LANPing() for ICMP reachability testing via TTA's /ping endpoint - IsMacOS field on OSImage, with GOOS/GOARCH support - Dgram socket listener in Start() for macOS VMs - Fix ReadFromUnix error spam on dgram socket close in vnet TestMacOSAndLinuxCanPing verifies a macOS Tart VM and a gokrazy Linux VM can ping each other on the same vnet LAN. Updates #13038 Change-Id: I5e73a27878abf009f780fdf11a346fc857711cff Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-04-28 12:51:40 -07:00
Brad Fitzpatrick	ec7b11d986	tstest/natlab/vmtest, cmd/tta: add TestTaildrop Add a vmtest that brings up two Ubuntu nodes, each behind its own EasyNAT, joined to the tailnet. The sender pushes a small file via "tailscale file cp" and the receiver fetches it via "tailscale file get --wait", asserting that the filename and contents round-trip unchanged. To make Taildrop work in vmtest, three small pieces were needed: The Linux/FreeBSD cloud-init now starts tailscaled with --statedir as well as --state=mem:, so the daemon has a VarRoot to host Taildrop's incoming-files directory. State itself remains in-memory (so nothing persists across reboots); only the var-root scratch space is on disk. vmtest.New grows a variadic EnvOption parameter and a SameTailnetUser helper. When the option is passed, Start sets AllNodesSameUser=true on the embedded testcontrol.Server. Cross-node Taildrop requires the sender and receiver to share a Tailnet user (or have an explicit PeerCapabilityFileSharingTarget granted between them, which we don't plumb here), so TestTaildrop opts in. Existing tests don't. cmd/tta gains /taildrop-send and /taildrop-recv handlers that wrap "tailscale file cp" and "tailscale file get --wait", plus Env.SendTaildropFile and Env.RecvTaildropFile helpers in vmtest that drive them. Updates #13038 Change-Id: I8f5f70f88106e6e2ee07780dd46fe00f8efcfdf1 Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-04-28 12:27:55 -07:00
Brad Fitzpatrick	4b8e0ede6d	tstest/natlab/{vmtest,vnet}, cmd/tta: add TestMullvadExitNode Add a vmtest that brings up a Tailscale client, an Ubuntu VM acting as a Mullvad-style plain-WireGuard exit node, and a non-Tailscale webserver, each on its own NAT'd vnet network with a distinct WAN IP. The test exercises Tailscale's IsWireGuardOnly peer code path: the way the control plane wires Mullvad exit nodes into a client's netmap, including the per-client SelfNodeV4MasqAddrForThisPeer source-IP rewrite that lets a Tailscale CGNAT IP egress through a plain-WireGuard tunnel that has no idea what Tailscale is. The mullvad VM doesn't run wireguard-tools or kernel WireGuard; instead, a new TTA endpoint /wg-server-up creates a real Linux TUN named wg0, drives it with wireguard-go (already vendored), and configures the kernel side (ip addr/up, ip_forward, iptables NAT MASQUERADE) so decrypted traffic from the peer egresses with the mullvad VM's WAN IP. Userspace vs kernel WireGuard makes no difference on the wire — what's being tested is Tailscale's plain-WireGuard exit-node code path, not the kernel module — and this lets the test avoid downloading and installing .deb packages inside the VM. Adds Env.BringUpMullvadWGServer (calls /wg-server-up, returns the generated WG public key as a key.NodePublic), Env.SetExitNodeIP (EditPrefs ExitNodeIP directly, for exit nodes whose IPs aren't discoverable via TTA), Env.ControlServer (exposes the underlying testcontrol.Server so tests can UpdateNode / SetMasqueradeAddresses to inject custom peers), and Env.Status (fetches a node's tailscale status, used to read the client's pubkey so we can pin it as the WG server's only allowed peer). The test verifies that the webserver's echoed source IP is the client's WAN with no exit node selected, the mullvad VM's WAN with the WG-only peer selected as exit, and the client's WAN again after clearing. Updates #13038 Change-Id: I5bac4e0d832f05929f12cb77fa9946d7f5fb5ef1 Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-04-28 11:31:48 -07:00
Brad Fitzpatrick	b9eac14ef9	tstest/natlab/vmtest: add web UI for watching VM tests live Add an optional --vmtest-web flag that starts an HTTP server showing a live dashboard for vmtest runs. The dashboard includes: - Step progress tracker showing all test phases (compile, image prep, QEMU launch, agent connect, tailscale up, test-specific steps) with status icons and elapsed times - Per-VM "virtual monitor" cards showing serial console output streamed in realtime via WebSocket - Per-NIC DHCP status (supporting multi-homed VMs like subnet routers) - Per-node Tailscale status (hidden for non-tailnet VMs) - Test status badge (Running/Passed/Failed) with live elapsed timer - Event log showing all lifecycle events chronologically Architecture follows the existing util/eventbus HTMX+WebSocket pattern: the server pushes HTML fragments with hx-swap-oob attributes over a WebSocket, and HTMX routes them to the correct DOM elements by ID. Key components: - vmstatus.go: Step tracker (Begin/End lifecycle), EventBus (pub/sub with history for late joiners), VMEvent types, NodeStatus tracking - web.go: HTTP server, WebSocket handler, template loading, ANSI-to-HTML conversion via robert-nix/ansihtml, deterministic port selection - assets/: HTML templates, CSS, HTMX library (copied from eventbus) - vnet/vnet.go: DHCP event callback on Server for observing DHCP lifecycle - qemu.go: Console log file tailing with manual offset-based reading Usage: go test ./tstest/natlab/vmtest/ --run-vm-tests --vmtest-web=:0 -v When using :0, a deterministic port based on the test name is tried first so re-runs get the same URL, falling back to OS-assigned on conflict. Updates #13038 Change-Id: I45281347b3d7af78ed9f4ff896033984f84dcb4d Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-04-28 07:46:04 -07:00
Brad Fitzpatrick	cb239808a6	tstest/natlab/vmtest: add --test-version flag Add a --test-version flag to run the natlab VM tests against released tailscale/tailscaled binaries downloaded from pkgs.tailscale.com instead of building from the source tree. The value can be a concrete release like "1.97.255", or "stable" / "unstable" which resolve to the latest TarballsVersion on that track via pkgs.tailscale.com/<track>/?mode=json. The track for a concrete version is derived from its minor (even=stable, odd=unstable). The host architecture (amd64 or arm64) selects the tarball. Tarballs are cached + extracted under ~/.cache/tailscale-vmtest/builds/<version>_<arch>/ so they are not re-fetched per test. tta is still always built from the local tree. Cloud VMs (Ubuntu, Debian) pick up the downloaded binaries via the existing files.tailscale file server. Non-Linux GOOS (FreeBSD) falls back to building from source since pkgs.tailscale.com only ships Linux tarballs. Gokrazy nodes continue to use binaries baked into the gokrazy image; --test-version is a no-op for them. Updates #13038 Change-Id: I213ef7db362dd17bf69d2685cbf2ab0ec5a3fee1 Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-04-28 06:59:26 -07:00
Brad Fitzpatrick	d0ae993334	tstest/natlab/vmtest: add more subnet router tests Add two tests building on TestExitNode's framework: TestSubnetRouterPublicIP brings up a client, a subnet router, and a webserver, each on its own NAT'd network with distinct WAN IPs. The subnet router advertises the webserver's network as a route. The test toggles the client's --accept-routes preference and asserts that the webserver's echoed source IP switches between the client's own WAN (direct dial) and the subnet router's WAN (forwarded through the router and SNAT'd). TestSubnetRouterAndExitNode adds a fourth node, an exit node that advertises 0.0.0.0/0 + ::/0, and uses a table-driven layout with subtests to cover the four combinations of (exit on/off, subnet on/off). The case where both are on confirms longest-prefix match wins: the subnet router's /24 takes precedence over the exit node's /0. The exit node itself is configured with --accept-routes=off so that, in the exit-only case, it forwards directly to the simulated internet rather than re-routing the forwarded traffic via the subnet router (which would otherwise mask the exit node's WAN as the observed source). Adds an Env.SetAcceptRoutes helper for toggling the RouteAll pref via EditPrefs, used by both tests. Updates #13038 Change-Id: Ifc2726db1df2f039c477c222484f535bebc40445 Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-04-27 17:06:17 -07:00
Brad Fitzpatrick	c0e6ffed0d	tstest/tailmac: add NIC hot-swap, disconnected NIC, and screenshot server Add NIC attachment hot-swap support to Host.app: VZNetworkDevice.attachment is writable at runtime, so --disconnected-nic creates a NIC with no attachment, and --attach-network hot-swaps it to a vnet dgram socket after boot/restore. macOS detects link-up and does DHCP. Refactor TailMacConfigHelper: extract createDgramAttachment() and createDisconnectedNetworkDeviceConfiguration() from the monolithic createSocketNetworkDeviceConfiguration(). Add --screenshot-port flag for headless mode. Host.app serves GET /screenshot as JPEG via a localhost HTTP server, capturing the VZVirtualMachineView via CGWindowListCreateImage. The Go test harness polls these to push live thumbnails to the web dashboard. Also: SIGINT handler in headless mode for clean VM state save. Updates #13038 Change-Id: I42fba0ecd760371b4ec5b26a0557e3dd0ba9ecae Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-04-27 17:03:09 -07:00
Brad Fitzpatrick	5c1738fd56	tstest/natlab/{vmtest,vnet}, cmd/tta: add TestExitNode Add a vmtest TestExitNode that brings up a client, two exit nodes, and a non-Tailscale webserver, each on its own NAT'd vnet network with a distinct WAN IP. The test cycles the client's exit node setting between off, exit1, and exit2 and asserts that the webserver echoes the expected post-NAT source IP for each. Three pieces were needed to make this work: vnet now forwards TCP between simulated networks at the packet level, mirroring the existing UDP path. When a guest VM sends TCP to another simulated network's WAN IP, the source network's gateway rewrites src via doNATOut and routeTCPPacket hands the packet off to the destination network, which rewrites dst via doNATIn and writes the rewritten frame onto the destination LAN. The TCP stacks of the two guest VM kernels talk end-to-end; vnet just NATs the IP/port headers in flight, so all TCP semantics (handshakes, options, sequence numbers, payload) are preserved without a gvisor TCP termination in the middle. Adds a focused TestInterNetworkTCP that exercises this path without any Tailscale machinery. cmd/tta binds its outbound dial to the default route's interface using SO_BINDTODEVICE. Without that, the moment tailscaled installs 0.0.0.0/0 → tailscale0 in response to setting an exit node, TTA's existing TCP connection to test-driver gets rerouted through the exit node. From the test driver's perspective the connection's packets then arrive with the exit node's WAN IP as the source rather than the client's, so they don't match the existing flow and the connection is dead — manifesting in the test as a hang on EditPrefs (which had actually completed in milliseconds on the daemon side, but whose response never made it back). Pinning the socket to the underlying NIC keeps TTA's agent connection on a real interface regardless of any policy routing tailscaled installs later. We bind rather than carry the Tailscale bypass fwmark because the fwmark approach is conditional on tailscaled having configured SO_MARK-based policy routing, while binding is unconditional. vmtest grows an Env.SetExitNode helper that sets ExitNodeIP via EditPrefs through the agent, used by the new test. Updates #13038 Change-Id: I9fc8f91848b7aa2297ef3eaf71fed9d96056a024 Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-04-27 16:54:20 -07:00
Alex Chan	10b63f27ce	tstest/clock: explain what happens if you don't set a Start time While working on #19444, I assumed that omitting `Start` would return a clock that started at January 1, year 1, because that's the zero value for a `time.Time`, but actually it uses the current UTC time instead. This behaviour is non-obvious, so document it. Updates #cleanup Change-Id: Id91400778578655953ff3e1671ce470db97cfe91 Signed-off-by: Alex Chan <alexc@tailscale.com>	2026-04-28 00:15:46 +02:00
Brad Fitzpatrick	ad5436af0d	tstest/largetailnet, tstest/integration/testcontrol: add in-process large-tailnet benchmark Add a Go benchmark that exercises a single tailnet client (a [tsnet.Server] running in the test process) against a synthetic large initial netmap and a stream of caller-driven peer add/remove deltas, all in-process. The harness is split in two parts: - tstest/largetailnet, a reusable package containing a [Streamer] that hijacks the map long-poll on a [testcontrol.Server] via the new AltMapStream hook, sends one initial MapResponse with N synthetic peers, and forwards caller-supplied delta MapResponses on the same stream. Helpers like MakePeer / AllocPeer build synthetic peers with unique IDs and addresses derived from the Tailscale ULA range. - tstest/largetailnet/largetailnet_test.go, BenchmarkGiantTailnet (headless tailscaled workload, no IPN bus subscriber) and BenchmarkGiantTailnetBusWatcher (GUI-client workload with one Notify subscriber attached). Both are gated on --actually-test-giant-tailnet (skipped by default), stand up an in-process testcontrol + tsnet.Server, let Up block until the initial N-peer netmap has been processed, then ResetTimer and run add+remove pairs via b.Loop. Per-delta sync is via a test-only [ipnlocal.LocalBackend.AwaitNodeKeyForTest] channel that closes once the just-added peer key appears in the netmap (no-watcher variant) or via bus-Notify drain (bus-watcher variant). To support the hijack, [testcontrol.Server] grows an AltMapStream hook and a small MapStreamWriter interface for benchmarks/stress tests that need to drive a controlled MapResponse sequence; the normal serveMap path is untouched when AltMapStream is nil. The streamer answers non-streaming "lite" map polls (which controlclient issues before the streaming long-poll to push HostInfo) with an empty MapResponse and returns immediately, so the streaming poll that follows is the one that gets the initial netmap. The benchmark is intended for before/after comparisons of netmap- and delta-handling changes targeted at large tailnets. CPU profiles on unmodified main show the expected O(N) hotspots: setControlClientStatusLocked / authReconfigLocked / userspaceEngine.Reconfig / setNetMapLocked, plus JSON encoding of the full Notify.NetMap to bus watchers (which dominates the BusWatcher variant). Median ms/op over 10 runs on unmodified main, by tailnet size N: N no-watcher bus-watcher 10000 32 166 50000 222 865 100000 504 1765 250000 1551 4696 Recommended invocation: go test ./tstest/largetailnet/ -run=^$ \ -bench='BenchmarkGiantTailnet(BusWatcher)?$' \ -benchtime=2000x -timeout=10m \ --actually-test-giant-tailnet \ --giant-tailnet-n=250000 \ -cpuprofile=/tmp/giant.cpu.pprof Updates #12542 Change-Id: I4f5b2bb271a36ba853d5a0ffe82054ef2b15c585 Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-04-27 11:47:12 -07:00
Brad Fitzpatrick	f289f7e77c	tstest/natlab/vmtest,cmd/tta: add TestSiteToSite Verifies that site-to-site Tailscale subnet routing with --snat-subnet-routes=false preserves the original source IP end-to-end. Topology: two sites, each with a Linux subnet router on a NATted WAN plus an internal LAN, and a non-Tailscale backend on each LAN. Backends are given static routes pointing to their local subnet router for the remote site's prefix; an HTTP GET from backend-a to backend-b over Tailscale returns a body containing backend-a's LAN IP. Adds the supporting vmtest.SNATSubnetRoutes NodeOption and plumbs snat-subnet-routes through TTA's /up handler. The webserver started by vmtest.WebServer now also echoes the remote IP, for the preservation assertion. Adds a /add-route TTA endpoint (Linux-only for now) and a vmtest Env.AddRoute helper so the test can install the backend static routes through TTA rather than needing a host SSH key and debug NIC. ensureGokrazy now always rebuilds the natlab qcow2 (once per test process, via sync.Once) so the test picks up the new TTA and webserver behavior. This is pulled out of a larger pending change that adds FreeBSD site-to-site subnet routing support; figured we should have at least the Linux test covering what works today. Updates #5573 Change-Id: I881c55b0f118ac9094546b5fbe68dddf179bb042 Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-04-22 12:11:30 -07:00
Brad Fitzpatrick	dfc2667f8f	tstest/integration/testcontrol: make Stream w/ capver >= 68 match docs, prod testcontrol wasn't following the document specs (and prod behavior) breaking a WIP integration test elsewhere. Updates tailscale/corp#40088 Change-Id: I02cf70894346bad7c85940b617d99c21c5310664 Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-04-20 07:34:04 -07:00
Tom Proctor	c2da563fef	tstest/integration/vms: skip cloud-init package updates (#19443 ) The package updates started getting really slow yesterday. We can do better, but attempt a band aid fix for now, as the test is failing about a third of the time on PR CI. Updates tailscale/corp#40465 Change-Id: Icf53292ba83dd1ed76b9bdf9fb94a8f6fb448c07 Signed-off-by: Tom Proctor <tomhjp@users.noreply.github.com>	2026-04-17 10:39:47 +01:00
Anton Tolchanov	958bcda5bf	control/controlclient: handle 429 responses during node registration If we get a 429 response during node registration, use the `Retry-After` header for backoff instead of the regular exponential backoff. The rate limiter error is propagated to the user, just like other registration errors are, e.g. ``` $ tailscale up backend error: node registration rate limited; will retry after 57s exit status 1 ``` Updates tailscale/corp#39533 Signed-off-by: Anton Tolchanov <anton@tailscale.com>	2026-04-15 18:54:08 +01:00
Claus Lensbøl	61c95f409c	control/controlclient: accept key if last seen on exist node is absent (#19402 ) On some nodes (found via natlab), the existing nodes last seen could be unset. For these cases, we would want to accept the key and write a last seen. This was breaking the cached netmap natlab tests. Updates #12639 Signed-off-by: Claus Lensbøl <claus@tailscale.com>	2026-04-15 03:53:40 -04:00
Naman Sood	6301a6ce4b	util/linuxfw,wgengine/router: allow incoming CGNAT range traffic with nodeattr Clients with the newly added node attribute `"disable-linux-cgnat-drop-rule"` will not automatically drop inbound traffic on non-Tailscale network interfaces with the source IP in the CGNAT IP range. This is an initial proof-of-concept for enabling connectivity with off-Tailnet CGNAT endpoints. Fixes tailscale/corp#36270. Signed-off-by: Naman Sood <mail@nsood.in>	2026-04-14 16:45:06 -04:00
Brad Fitzpatrick	a0a8fae856	tstest/integration: use linkat to hardlink test binaries on Linux Use linkat via /proc/self/fd with AT_SYMLINK_FOLLOW to create a hardlink of the test binary instead of copying it. This avoids copying ~50MB+ binaries into each test's temp directory, making test setup faster and reducing disk I/O. The simpler os.Link(b.Path, ret.Path) can't be used here because the source binary lives in the first test's TempDir, which may be cleaned up before later tests call CopyTo. The open FD keeps the inode alive after the path is deleted, but os.Link needs a valid path. (See also `b9f468240f` which tried os.Link but is racy for this reason.) The /proc/self/fd approach works without elevated privileges, unlike AT_EMPTY_PATH which requires CAP_DAC_READ_SEARCH. If the linkat fails for any reason (e.g. cross-filesystem temp dirs), it falls back to the existing full-copy path. Fixes #19397 Change-Id: I4b1f97f7e63a9ae9e09dce36dfbdd1f6cff92320 Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>	2026-04-14 07:13:10 -07:00
Avery Pennarun	621dc9cf1b	tstest: fix kernel version parsing for Debian-style version strings The kernel version parser used strings.Cut with "-" to handle versions like "5.4.0-76-generic", but Debian uses "+" in versions like "6.12.41+deb13-amd64". Use strings.IndexAny to find the first "-" or "+" and truncate there. Fixes TestKernelVersion on Debian systems. Fixes #19395 Change-Id: I70e5f95682d54baf908e51f9f4b51c130b00aaaa Co-Authored-By: Brad Fitzpatrick <bradfitz@tailscale.com> Signed-off-by: Avery Pennarun <apenwarr@tailscale.com>	2026-04-14 07:11:44 -07:00
Avery Pennarun	ab74ea0a67	tstest/integration: clear SSH_CLIENT env to prevent false positive detection When running integration tests over SSH (e.g., in remote development environments), the SSH_CLIENT environment variable is set. This causes isSSHOverTailscale() to incorrectly detect an SSH session and change behavior. Clear SSH_CLIENT in the test node environment to prevent these false positives. Fixes #19393 Change-Id: I1411abf0be9704cce37051476efb04d59beed386 Signed-off-by: Avery Pennarun <apenwarr@tailscale.com>	2026-04-13 18:53:07 -07:00

1 2 3 4 5 ...

578 Commits