Commit Graph

10699 Commits

Author SHA1 Message Date
Brad Fitzpatrick
57e443c813 cmd/testwrapper: auto-retry every failing test
Previously, testwrapper only retried tests explicitly annotated with
flakytest.Mark. Authors don't pre-emptively mark tests that haven't
flaked yet, so the first flake of a brand-new test failed CI even
when a re-run would have passed.

testwrapper now retries every failing test within a per-test wall-clock
budget (default: 5 minute per-attempt timeout capped at 1.5x the first
failure duration, 10 minute total). A test that fails and then passes
on retry is reported as flaky; a test that never passes within the
budget remains a real failure (exit non-zero).

For flakeapp's existing log scraping, the wire format is preserved:
the "flakytest failures JSON:" line is now emitted only for tests
that ultimately flaked (passed on retry). Unmarked tests get a fake
issue URL of the form https://github.com/{owner}/{repo}/issues/UNKNOWN
where owner/repo is detected from GITHUB_REPOSITORY, the local git
remote, or falls back to tailscale/tailscale. A new "permanent test
failures JSON:" line is emitted for tests that never passed; flakeapp
ignores it for now (a follow-up can teach it to record real failures
separately).

flakytest.Mark stays as an opt-in API: still useful for tracking a
known-flaky test against a real issue and for TS_SKIP_FLAKY_TESTS.

Updates tailscale/corp#38960

Change-Id: I56dfc9b023486d239f60793a53e9690578ce8017
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2026-05-29 19:11:08 +00:00
Simon Law
5d935c8900
net/traffic: add fuzz test for sorting nodes by traffic score (#19893)
In PR #19682, we introduced the traffic package which provides a
traffic.Scores.SortNodes method that uses rendezvous hashing to
break ties by equally distribute the “best” node for any given client.

This PR adds a fuzzer to make sure this algorithm is not wildly unfair.

Updates #17366
Updates tailscale/corp#33033

Signed-off-by: Simon Law <sfllaw@tailscale.com>
2026-05-29 11:55:49 -07:00
Jordan Whited
8b58bd6c64 net/batching: implement NodeAttrNeverGSOEqualTail
This NodeCapability works around the UDP GSO bugs introduced by
torvalds/linux@b10b446 (v7.0-rc1). These bugs were later fixed by
torvalds/linux@78effd8 and torvalds/linux@5f17ae0 (v7.1-rc5). These
Linux kernel bugs cause mangled UDP headers and UDP checksums, resulting
in high levels of packet loss.

The aforementioned bugs have already made their way downstream into
various distros, e.g. Ubuntu 26.04 LTS. Impacted users are now dealing
with poor UDP performance in tailscaled, and in any other software that
makes use of UDP GSO.

Not all users of the affected kernels are impacted as the relevant
kernel code path sits between kernel and netdev driver, and behaviors
vary by driver/device capability.

We cannot detect impact at runtime, as this would require gathering all
netdevs, and performing loopback tests. This is invasive and in many
cases impossible.

So, we are left to choose between disabling UDP GSO for all users on
affected kernels, whether they experience real impact or not, or try
and work around the bugs. Disabling UDP GSO for a user that is not
impacted can cut max throughput in half, and consume more CPU cycles.

This commit attempts to workaround the bugs by avoiding UDP GSO when
batches are small, and injecting a 1-byte sentinel tail payload when
they are large. This tail payload is smaller than "GSO size", which
sidesteps the primary trigger of all fragments in a batch being
equal in length.

The end result is slightly increased payload and packet overhead, but
functional UDP GSO for all Linux 7.0-7.1.4 users, regardless of
netdev/driver.

Updates #19777

Signed-off-by: Jordan Whited <jordan@tailscale.com>
2026-05-29 11:36:35 -07:00
kari-ts
7355116c05
ipn/store: make WriteState(id, nil) delete key instead of adding nil entry (#19920)
All StateStore implementations store a nil value in the cache map when WriteState is called with a nil byte slice instead of deleting the key. This causes ReadState to return (nil, nil) instead of (nil, ErrStateNotExist), since the key is still present in the map.

This breaks reset-auth in Windows, Linux, and Android, and the node can't log back in without manually editing the state file. (macOS uses a different state store)
DeleteProfile, DeleteAllProfilesForUser, setUnattendedModeAsConfigured are impacted but don't seem to break because the deleted keys are not reread.

This deletes the key from the cache instead.

Fixes tailscale/corp#42477

Signed-off-by: kari-ts <kari@tailscale.com>
2026-05-29 11:22:14 -07:00
Fran Bull
3d5102090f feature/conn25: use new pool nodeattr
We have been reading the pool config from the app nodeattr, but it is
global config, not per app, so it needs to be its own thing.

Updates tailscale/corp#39999

Signed-off-by: Fran Bull <fran@tailscale.com>
2026-05-29 08:29:34 -07:00
Brad Fitzpatrick
412c812d76 ipn/ipnlocal: use ACME ALPN for authorized Funnel non-CertDomain domains
If a user explicitly adds a non-ts.net (not a CertDomain domain) domain
like "foo.com" to their serve config as a web target that's also an allowed
funnel domain (using raw "tailscale serve set-config"), then use the new
ALPN cert fetching (from b553969b) to get certs for that domain.

This is just plumbing; there's no new product functionality to
actually enable this easily client-side, and it also has no visible
product surface to enable it server-side.

Updates tailscale/corp#41736

Change-Id: Ie2e421ac9611bce64bba3de6a454b2d505ea0e8a
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2026-05-28 13:33:45 -07:00
Tom Proctor
788a49eca5
.github/workflows: run vet on GitHub-hosted runners (#19913)
The github-ci-vm machine that runs our self-hosted CI for this repo is
only designed for the `vm` job in test.yml. That uses a different cache
dir which is causing github-ci-vm's small disk to fill up. Switch to
ubuntu 24.04 like the rest of our CI for this repo that doesn't require
anything special.

Updates tailscale/corp#40465

Signed-off-by: Tom Proctor <tomhjp@users.noreply.github.com>
2026-05-28 21:30:46 +01:00
James Tucker
524a374f01 tsnet: wait for peer in netmap before pinging in setupTwoClientTest
If we dispatch a ping too early (after a later patch removes a 250ms
blockage) then the ping may be lost due to the peers not yet knowing
about each other. The ping is retained in order to setup and ensure a
wireguard session prior to test flow.

Updates #19822

Change-Id: I6cfea28931646a9387b6ffc2654e72cd846f4e55
Signed-off-by: James Tucker <james@tailscale.com>
Co-authored-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2026-05-28 11:27:54 -07:00
Brad Fitzpatrick
c086992f4f cmd/tailscale/cli: add whoami subcommand
Add a "tailscale whoami" subcommand that is equivalent to running
"tailscale whois $(tailscale ip -4)" but more ergonomic. It supports
the --json flag just like whois, and shares the WhoIsResponse
rendering code with whois.

Fixes #19907

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Change-Id: I8f33ba7a5608bab7dffa8213303beb5f345936d3
2026-05-28 10:49:17 -07:00
Alex Chan
9d126aec34 all: remove network lock references from private method names
Updates tailscale/corp#37904

Change-Id: I312d46d958209ca3d1152d1877fb91a57c91798d
Signed-off-by: Alex Chan <alexc@tailscale.com>
2026-05-28 18:00:36 +01:00
Brendan Creane
8d90a6ab1e
ipn/ipnlocal: add HTTP/2 Content-Type tests for serve reverse proxy (#19905)
Adds two tests exercising the HTTP/2-inbound -> plaintext HTTP/1.1 backend
path through serve's reverseProxy and through the full serveWebHandler
entry point (with a funnel serveHTTPContext).

Updates #19866

Signed-off-by: Brendan Creane <bcreane@gmail.com>
2026-05-28 09:46:36 -07:00
Alex Chan
f4a280cdbd all: update a few more references to network/tailnet lock
Updates tailscale/corp#37904

Change-Id: I746b06328e080fa2b9ff28a2d099f95645aa3d0b
Signed-off-by: Alex Chan <alexc@tailscale.com>
2026-05-28 16:44:16 +01:00
Alex Chan
446ae97491 ipn: improve --exit-node hostname error during startup
When parsing the `tailscale up --exit-node=ARG` argument, we try to
resolve hostnames by searching the list of peers. However, at startup,
the peer list is empty, causing hostname lookups to trivially fail with
an unhelpful "invalid value" erorr.

Improve the error message when the peer list is empty to inform the user
that hostnames cannot be resolved during startup, and advise them to use
the exit node's Tailscale IP address instead.

Also, clarify that hostnames must be peer hostnames, not arbitrary
hostnames.

Fixes #19882

Change-Id: I9390a427c2863d657cf46c5e33b43cb3c5363764
Signed-off-by: Alex Chan <alexc@tailscale.com>
2026-05-28 16:43:45 +01:00
dragondscv
4b8115bb2c
cmd/containerboot: clamp MSS to PMTU for proxy group pods (#19686)
Some checks failed
CI / cross (amd64, windows) (push) Has been cancelled
CI / cross (arm, 5, linux) (push) Has been cancelled
CI / cross (arm, 7, linux) (push) Has been cancelled
CI / cross (arm64, darwin) (push) Has been cancelled
CI / cross (arm64, linux) (push) Has been cancelled
CI / cross (arm64, windows) (push) Has been cancelled
CI / cross (loong64, linux) (push) Has been cancelled
CI / ios (push) Has been cancelled
CI / crossmin (amd64, illumos) (push) Has been cancelled
CI / crossmin (amd64, plan9) (push) Has been cancelled
CI / crossmin (amd64, solaris) (push) Has been cancelled
CI / crossmin (ppc64, aix) (push) Has been cancelled
CI / android (push) Has been cancelled
CI / wasm (push) Has been cancelled
CI / tailscale_go (push) Has been cancelled
CI / depaware (push) Has been cancelled
CI / go_generate (push) Has been cancelled
CI / make_tidy (push) Has been cancelled
CI / licenses (push) Has been cancelled
CI / staticcheck (${{ matrix.name }}) (--with-tags-all=darwin, arm64, darwin, macOS) (push) Has been cancelled
CI / staticcheck (${{ matrix.name }}) (--with-tags-all=linux, amd64, linux, Linux) (push) Has been cancelled
CI / staticcheck (${{ matrix.name }}) (--with-tags-all=windows, amd64, windows, Windows) (push) Has been cancelled
CI / staticcheck (${{ matrix.name }}) (--without-tags-any=windows,darwin,linux --shard=1/4, amd64, linux, Portable (1/4)) (push) Has been cancelled
CI / staticcheck (${{ matrix.name }}) (--without-tags-any=windows,darwin,linux --shard=2/4, amd64, linux, Portable (2/4)) (push) Has been cancelled
CI / staticcheck (${{ matrix.name }}) (--without-tags-any=windows,darwin,linux --shard=3/4, amd64, linux, Portable (3/4)) (push) Has been cancelled
CI / staticcheck (${{ matrix.name }}) (--without-tags-any=windows,darwin,linux --shard=4/4, amd64, linux, Portable (4/4)) (push) Has been cancelled
CI / notify_slack (push) Has been cancelled
CI / merge_blocker (push) Has been cancelled
CI / check_mergeability_strict (push) Has been cancelled
CI / check_mergeability (push) Has been cancelled
Single-pod ingress/egress proxies already called ClampMSSToPMTU when
setting up forwarding rules, but the proxy group (HA) code paths in
egressservices.go and ingressservices.go did not. This caused TCP
connections through proxy group pods to suffer from MSS/MTU mismatch
issues in environments where path MTU discovery is not working.

Add ClampMSSToPMTU calls in the egress sync loop (alongside the existing
EnsureSNATForDst call) and in addDNATRuleForSvc (alongside the existing
EnsureDNATRuleForSvc call), mirroring what the single-pod forwarding
rules already do.

Also add MSS clamping assertions to TestSyncIngressConfigs and track
ClampMSSToPMTU calls in FakeNetfilterRunner.

Fixes issue #19812 https://github.com/tailscale/tailscale/issues/19812.
Tracking internal ticket TSS-86326.

Signed-off-by: Jay Tung <ltung@crusoeenergy.com>
Co-authored-by: Jay Tung <ltung@crusoeenergy.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 12:57:38 +01:00
Brad Fitzpatrick
782c73bf41 cmd/containerboot: fix data race in TestContainerBoot
Parallel subtests share *ipn.Notify pointers (e.g. runningNotify).
When multiple subtests reached the same phase concurrently, they
all wrote to the shared notify's InitialStatus field without
synchronization, triggering the race detector.

Fix by shallow-copying *ipn.Notify before setting InitialStatus,
so each test iteration works on its own copy.

Updates #19380

Change-Id: I9dd40037e02146166f006f4f7c1ddcc47adba191
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2026-05-27 18:40:03 -07:00
James Tucker
25b8ed8d9e control/controlknobs,net/{batching,tstun},wgengine: add nodecaps to disable UDP & TUN GRO/GSO
Add four control-plane node attributes that let us disable UDP GSO/GRO
on the magicsock UDP socket and UDP/TCP GRO on the Tailscale TUN
device.

These complement the pre-existing TS_DEBUG_DISABLE_UDP_{GRO,GSO} and
TS_TUN_DISABLE_{UDP,TCP}_GRO envknobs. They exist so we can mitigate
upstream Linux kernel regressions on a deployed fleet without
requiring a client release, after two incidents (#13041, #19777) where
buggy kernel patches landed upstream and the fix took an excessively
long time to reach downstream distros.

Knob changes are reacted to in setNetworkMapInternal / SetNetworkMap via
a comparison against a cached "last applied" value and only an actual
transition triggers work: magicsock Rebind()+ReSTUN for UDP,
ApplyGROKnobs for TUN. The TUN side is gated by buildfeatures.HasGRO and
is one-way (wireguard-go GRO disablement is sticky); re-enabling
requires a client restart.

Updates #13041
Updates #19777

Change-Id: I802993070afa659cc06809bb0bfbb7f8a0cdb273
Signed-off-by: James Tucker <james@tailscale.com>
2026-05-27 17:10:14 -07:00
Brad Fitzpatrick
94af1b00fb cmd/testwrapper, tstest: move test sharding out of test code
Previously, sharding required tests to opt in by calling tstest.Shard,
which used a process-global counter to assign each test to a shard.
This had two problems: most tests didn't call it, so they ran on every
shard (defeating the purpose), and shard assignments were unstable
(depended on call order, so adding a test could reshuffle others).

Remove tstest.Shard and tstest.SkipOnUnshardedCI entirely. Instead,
have testwrapper implement sharding automatically for all tests: when
TS_TEST_SHARD=N/M is set, it uses "go list -json" (no compilation) to
find test source files, scans them for top-level Test/Benchmark/
Example/Fuzz function names, and filters by fnv32a(name) % M == N-1.
The filtered names are passed as an anchored -run regex to go test.

Using go list instead of "go test -list" avoids linking the test binary
twice (Go's build cache does not cache test binary linking).

Fixes #19886

Change-Id: I62ab7b3d757324d4c5fd0b5de50c1e3742681791
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2026-05-27 16:53:17 -07:00
James Scott
db60aa8eca
logtail: gate "logtail started" behind TS_DEBUG_LOGTAIL envknob (#19891)
Gates the unnecessary "logtail started" message behind
the debug envknob TS_DEBUG_LOGTAIL. This is extra log spam that isn't
needed unless we are debugging.

Updates tailscale/corp#40908

Signed-off-by: James Scott <jim@tailscale.com>
2026-05-27 15:48:44 -07:00
kari-ts
1a17ec1988
net/netmon: in Android, replace system/bin/ip call with cached LinkProperties gateway (#19804)
bind() on NETLINK_ROUTE sockets does not work on Android 11+ (https://developer.android.com/identity/user-data-ids#mac-11-plus) . Since system/bin/ip uses bind(), likelyHomeRouterIPHelper() always fails on Andoroid 11+, so that GatewayAndSelfIP never caches the result, causing repeated ip process spawns on every periodic ReSTUN.

This replaces the system/bin/ip fallback with a cached gateway IP pushed from Android’s ConnectivityManager via LinkProperties.getRoutes(). This is the same patterm used by UpdateLastKnownDefaultRouteInterface for the interface name (see https://github.com/tailscale/tailscale/pull/11784/). We keep the proc/net/route path as a fallback for early startup before NetworkChangeCallback has fired.

Updates tailscale/tailscale#18622
Updates tailscale/tailscale#13352

Signed-off-by: kari-ts <kari@tailscale.com>
2026-05-27 15:42:48 -07:00
Brad Fitzpatrick
c9fb05b6f5 ipn/ipnlocal: don't dup-suppress UserProfiles on IPNBus on profile switches
Fixes #19889

Change-Id: I324a735c13772c0c79ed7392c0baa5064b34823b
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2026-05-27 14:47:02 -07:00
Brad Fitzpatrick
364b952d62 cmd/containerboot: track peers from IPN bus updates, stop using netmap.NetworkMap
Some tests in another repo were broken by tailscale/tailscale#19607.
This fixes them, by finishing off the rest of the migration away from
netmap.NetworkMap on the IPN bus in containerboot.

Containerboot used to rebuild a full NetworkMap-shaped view while
reacting to IPN bus notifications. Now it insteads has its own
netmapState type (immutable) of exactly what it needs to track, and
sends those immutable values around, making cheap edits of new
immutable values when an IPN bus edit arrives.

This should make cmd/containerboot scale to much larger tailnets now too.

Fixes #19852
Fixes tailscale/corp#42347
Updates #12542

Change-Id: I88adaf061f85f677f954a764935e6654329d75a6
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2026-05-27 14:12:48 -07:00
Fran Bull
80dc7a8d07 feature/conn25: disallow addrs assignment overwriting.
We don't want addr assignments to be lost from the collection before
they can be returned to the IP pools, otherwise we will get orphan
addresses marked inUse in the pools that will never be returned.

Fixes tailscale/corp#39975

Signed-off-by: Fran Bull <fran@tailscale.com>
2026-05-27 13:54:40 -07:00
Patrick O'Doherty
8501be1990
go.mod: bump dependencies to resolve govulncheck warnings (#19884)
Bump the following:
  go get -u github.com/moby/spdystream@v0.5.1
  go get -u golang.org/x/crypto@v0.52.0
  go get -u golang.org/x/net@v0.55.0

to resolve open govulncheck warnings.

Updates #cleanup

Signed-off-by: Patrick O'Doherty <patrick@tailscale.com>
2026-05-27 12:24:59 -07:00
James Tucker
dea49bb4da net/batching: add envknobs to disable UDP GRO & GSO
It is sometimes useful when diagnosing subtle and specific performance
problems to rule out GRO/GSO independently and/or toggle them to
influence packet pacing.

Updates #17835
Updates tailscale/corp#31164

Signed-off-by: James Tucker <james@tailscale.com>
2026-05-27 12:05:00 -07:00
James Tucker
d1912167dc feature/taildrop: replace outgoing-file progress channel with synchronous reporter
serveFilePut tracked outgoing-file progress through an unbuffered
progressUpdates channel whose close was owned by the request goroutine
while writers were spread across manifest parsing, the
progresstracking.Reader callback, singleFilePut failure paths, and the
success path. That writer-closes mismatch made the
send-on-closed-channel panic effectively unfixable in place.

Replace it with a request-scoped outgoingProgress reporter. Transfer
code reports state by method call; the reporter coalesces hot-path
updates and is flushed once via defer in serveFilePut. With no
producer channel to close, the panic is structurally impossible.

Fixes #19115
Fixes #19817

Change-Id: I8f00d982d2c79880dfc1f8104c5eed06e94b5a6c
Signed-off-by: James Tucker <james@tailscale.com>
2026-05-27 12:00:34 -07:00
Brad Fitzpatrick
f277bfb09d release/dist/synology: add GOARM=7,softfloat mode for hi3535
Fixes #6860

Change-Id: I36f3101e75dab35d03e76693555ac93da893f8d5
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2026-05-27 10:54:15 -07:00
Claus Lensbøl
9be21088f4
wgengine/{,magicsock},tstest/natlab/vmtest: send disco on cached netmap (#19878)
Originally found when adding tests for working with cached netmaps, and
finding the added tests to be flakey.

When working off of a cached netmap, if a node exists in the cached
netmap but does not yet have any endpoints, DERP connections are
available but not direct ones. By sending callMeMaybe to nodes
without endpoints in the cached netmap, we can establish direct
connections for this edge case.

Aditionally, ensure that TSMP disco advert messages are not sent if the
endpoint does not have a valid address yet.

Fixes #19843
Updates #19597

Signed-off-by: Claus Lensbøl <claus@tailscale.com>
2026-05-27 13:05:12 -04:00
Brad Fitzpatrick
b553969b03 ipnlocal: try ACME TLS-ALPN for Funnel renewals
Use TLS-ALPN-01 for Funnel certificate renewals only when the node
already has a cached certificate, and fall back to DNS-01 with a fresh
order if the ALPN path is unavailable or fails.

Dynamically advertise acme-tls/1 only while an ACME challenge
certificate is pending, and add client metrics for DNS-01 and
TLS-ALPN-01 start/success/failure paths.

Updates tailscale/corp#41736
Fixes tailscale/corp#42320

Change-Id: I5adc6ea129237f9ef592f84fc1a8953c80bc9d5c
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2026-05-27 09:30:23 -07:00
Jordan Whited
4aef023765 cmd/tailscaled,types/logger: remove TS_DEBUG_MEMORY and associated logger
Commit e5a8cf3b1 added feature/runtimemetrics, which emits heap bytes
and total process memory as clientmetrics when the
NodeAttrEmitRuntimeMetrics capability is set. That subsumes the job of
the TS_DEBUG_MEMORY envknob, whose only effect is to prefix every log
line with Go heap+stack and Maxrss via logger.RusagePrefixLog.

Updates tailscale/corp#39434

Signed-off-by: Jordan Whited <jordan@tailscale.com>
2026-05-27 09:09:05 -07:00
Artem Leshchev
5652b6c9c0
cmd/k8s-operator: fix token exchange for identity federation (#19845)
tailscale-client-go-v2 natively supports identity federation authentication,
and in #19010 the required authentication provider is used, but the manual
token exchange was never removed, so we were exchanging JWT token to an auth
token, and then were trying to use that auth token for exchange once again.
This commit removes the legacy mechanism, fully relying on
tailscale-client-go-v2 to handle authentication.

Fixes #19844

Signed-off-by: Artem Leshchev <matshch@avride.ai>
2026-05-27 16:45:07 +01:00
License Updater
77010351f0 licenses: update license notices
Signed-off-by: License Updater <noreply+license-updater@tailscale.com>
2026-05-27 08:38:44 -07:00
Brad Fitzpatrick
2c965ab540 types/netmap, ipn/ipnlocal, control/controlclient: rename NodeMutationAdd to NodeMutationUpsert
NodeMutationAdd was a misleading name: a PeersChanged entry in a
MapResponse can represent either a truly new peer or a full
replacement for an existing peer that couldn't be expressed as a
PeerChangedPatch. Calling it "Add" implied it was always a completely
new node, which is wrong.  (I'd changed my mind on the design of
mapping add/delete events to NodeMutations halfway through #19607 and
forgot to update the name, even though I'd updated half the docs)

Rename it to NodeMutationUpsert to reflect the actual semantics: the
node should be inserted or replaced in the peer map regardless of
whether it already existed.

Updates #19607
Updates #12542

Change-Id: Iebd3daddb3318cba02e115a1b184fcb3ee8f83d6
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2026-05-27 08:37:14 -07:00
Brad Fitzpatrick
a8f40a2ca5 ipn/ipnlocal: add missing bus notify of peers on full netmap
The prior aa5da2e5f2 ("process node adds/removes in constant
time") commit missed a bus notification case, where new-style
subscribers set NotifyNoNetmap and then the controlclient map routing
sends a full update (rather than a delta). Those profiles + peers
need to be put on the bus too.

I noticed this only when porting the Android app over to use the
new bus stuff.

Updates #19607
Updates #12542

Change-Id: I82c35011d2c532222ca27f7d4e790522c31bd156
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2026-05-27 08:03:47 -07:00
Jason Dillingham
0e2b3f31af
cmd/k8s-operator: stabilize StaticEndpoints order in ProxyGroup reconciles (#19755)
findStaticEndpoints built its return slice by iterating nodes.Items in
the order returned by r.List, which is not guaranteed to be stable
across calls. When the resulting set of addresses already matched the
existing config Secret, the slice could still permute between
reconciles, making the marshalled config Secret differ byte-for-byte.
That tripped the DeepEqual check on the config Secret, which rewrote
the Secret, which fired a watch event, which re-enqueued the
ProxyGroup, looping forever.

Detect this case and return the existing currAddrs slice unchanged
when the resulting set is the same, preserving the "use the currently
used IPs first" intent without spurious writes.

Fixes #19700

Signed-off-by: Jason Dillingham <jasonmdillingham@gmail.com>
2026-05-27 14:28:04 +01:00
Erisa A
e2a0d45418
cmd/tailscale/cli: fix time parsing in debug daemon-logs (#19875)
Fixes #19874

Signed-off-by: Erisa A <erisa@tailscale.com>
2026-05-27 12:30:28 +01:00
BeckyPauley
0ed6da2826
cmd/k8s-operator, net/netutil: support 4via6 in egress proxy and connector (#19863)
Add support for configuring egress to destinations reachable via 4via6
subnet routes. This change affects standalone egress proxy only- egress
ProxyGroup needs IPv6 support before being able to support 4via6. Egress may
be configured using either the synthesized 4via6 address or the MagicDNS
name (in the form
<IPv4-address-with-hyphens-instead-of-dots>-via-<siteid>[.*]).

Also update the Connector to validate and advertise 4via6 subnet routes.
Export net/netutil.ValidateViaPrefix so it can be reused by the Connector
validation logic.

Updates #19334

Signed-off-by: Becky Pauley <becky@tailscale.com>
2026-05-27 10:54:35 +01:00
Jordan Whited
e5a8cf3b18 control/controlknobs,feature/*,ipn/ipnlocal,tailcfg: add runtimemetrics
Emit runtime metrics as clientmetrics when the
NodeAttrEmitRuntimeMetrics NodeCapability is present.

We start small with just 2 metrics: heap bytes and total process memory.

Updates tailscale/corp#39434

Signed-off-by: Jordan Whited <jordan@tailscale.com>
2026-05-26 16:02:01 -07:00
Fran Bull
2eb45c2457 feature/conn25: extend assignment expiry on use
When we use assigned addresses in response to a DNS request, extend the
expiry on the assignment.

Updates tailscale/corp#39975

Signed-off-by: Fran Bull <fran@tailscale.com>
2026-05-26 07:28:47 -07:00
Michael Ben-Ami
5877809097 feature/conn25: unify FlowTable storage to prepare for expiry
Previously we had two maps keyed on a direction-specific tuple, with
distinct values containing the data (action) for that direction.
Values pointed at each other across maps to ensure they were removed
at the same time in the case of tuple overwrite, but LRU eviction
was per-map. So if LRU was turned on, it was possible for one
direction's data (action) to be evicted and leave the other direction
dangling.

NewFlow replaces the two direction-specific flow constructors, and
lookups return the direction-specific PacketAction directly.

Now the values in each map point to the same element, with data for both
directions in the element. A linked list also points to the elements to
implement LRU. The previous flowtrack.Cache is removed.

The single LRU structure will allow us to implement idle time expiration
by walking the list backward starting with the least recently used flow, and
stopping after a fixed number of flows, or at the first non-expired flow.

We add commented-out unused placeholder fields for tracking the
"last seen" timestamp, and an on-removal hook, to document the intent for
the follow-up expiry work.

Updates tailscale/corp#38630

Signed-off-by: Michael Ben-Ami <mzb@tailscale.com>
2026-05-26 10:09:48 -04:00
Yago Raña Gayoso
26952d53fa
scripts/installer.sh: update KDE Linux link (#19857)
Some checks failed
CI / cross (amd64, windows) (push) Has been cancelled
CI / cross (arm, 5, linux) (push) Has been cancelled
CI / cross (arm, 7, linux) (push) Has been cancelled
CI / cross (arm64, darwin) (push) Has been cancelled
CI / cross (arm64, linux) (push) Has been cancelled
CI / cross (arm64, windows) (push) Has been cancelled
CI / cross (loong64, linux) (push) Has been cancelled
CI / ios (push) Has been cancelled
CI / crossmin (amd64, illumos) (push) Has been cancelled
CI / crossmin (amd64, plan9) (push) Has been cancelled
CI / crossmin (amd64, solaris) (push) Has been cancelled
CI / crossmin (ppc64, aix) (push) Has been cancelled
CI / android (push) Has been cancelled
CI / wasm (push) Has been cancelled
CI / tailscale_go (push) Has been cancelled
CI / depaware (push) Has been cancelled
CI / go_generate (push) Has been cancelled
CI / make_tidy (push) Has been cancelled
CI / licenses (push) Has been cancelled
CI / staticcheck (${{ matrix.name }}) (--with-tags-all=darwin, arm64, darwin, macOS) (push) Has been cancelled
CI / staticcheck (${{ matrix.name }}) (--with-tags-all=linux, amd64, linux, Linux) (push) Has been cancelled
CI / staticcheck (${{ matrix.name }}) (--with-tags-all=windows, amd64, windows, Windows) (push) Has been cancelled
CI / staticcheck (${{ matrix.name }}) (--without-tags-any=windows,darwin,linux --shard=1/4, amd64, linux, Portable (1/4)) (push) Has been cancelled
CI / staticcheck (${{ matrix.name }}) (--without-tags-any=windows,darwin,linux --shard=2/4, amd64, linux, Portable (2/4)) (push) Has been cancelled
CI / staticcheck (${{ matrix.name }}) (--without-tags-any=windows,darwin,linux --shard=3/4, amd64, linux, Portable (3/4)) (push) Has been cancelled
CI / staticcheck (${{ matrix.name }}) (--without-tags-any=windows,darwin,linux --shard=4/4, amd64, linux, Portable (4/4)) (push) Has been cancelled
CI / notify_slack (push) Has been cancelled
CI / merge_blocker (push) Has been cancelled
CI / check_mergeability_strict (push) Has been cancelled
CI / check_mergeability (push) Has been cancelled
Signed-off-by: Yago Raña Gayoso <yago.rana.gayoso@gmail.com>
2026-05-24 21:40:42 +01:00
Simon Law
da8cd5cc7f
ipn/ipnlocal: fix documentation typo, NodeAttrCacheNetworkMaps (#19851)
Updates #cleanup

Signed-off-by: Simon Law <sfllaw@tailscale.com>
2026-05-22 22:19:10 -07:00
Simon Law
988615dbad
ipn/ipnlocal,tstest/integration: pause the control client consistently (#19846)
There are two places where tailscaled transitions into a paused state:
1. tailscaled’s controlclient is initially created,
2. tailscale down, or the GUI equivalent, commands it to.

This patch unifies the implementation of both scenarios into
LocalBackend.shouldPauseControlClientLocked to prevent the
implementation from drifting.

The flaky tstest/integration.TestNoControlConnWhenDown test exposed
this mismatch, but only by accident. This patch also changes
TestNode.MustDown so that it runs `tailscale down` and then waits for
the testcontrol server to finish handling any associated /machine/map
requests.

Fixes #19831

Signed-off-by: Simon Law <sfllaw@tailscale.com>
2026-05-22 17:58:44 -07:00
Adrian Dewhurst
5d8f401956 net/dns: fix handling non-IP single split DNS
Fixes #19834

Change-Id: I4d48efed00cd080b14c6fd713ff21e53a5a6ee3c
Signed-off-by: Adrian Dewhurst <adrian@tailscale.com>
2026-05-22 20:45:58 -04:00
Brad Fitzpatrick
5295e3e119 ipn/{ipnstate,ipnlocal}: add integer NodeID to PeerStatus
In aa5da2e5f2 we made the IPN bus include deltas, including the
PeersRemoved, sending a slice of integer NodeIDs that were
removed. But when updating xcode, I realized there was no way to map
those integers to the stable node IDs used in other places.

I was consdering changing the just-added ipn.Notify.PeersRemoved from
an IntID to a string StableID, but then it doesn't match the MapResponse
wire protocol, which we've tried to match so far.

Instead, just add the integer ID as well. Callers can use whichever
world they want, having both. It's a little regrettable that we still
have two worlds of IDs, but oh well. Neither is really suitable to a
hypothetical future fully federated world of control servers anyway,
so we'll need a third type later anyway, so just live with the two we
have for now.

Updates #12542

Change-Id: Ib8fd48a265e1da1f8779152f141f624a7f7260e9
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2026-05-22 08:16:55 -07:00
Amal Bansode
e32b9bde1d
control/controlclient: fix deadlock in map session change queue processing (#19828)
Holding an exclusive lock while writing to the unbuffered changequeue chan
is likely going to deadlock when the run() path may try to grab the same lock
before reading from the chan to drain it (on map session close). This causes
the client to stop processing new map responses and TSMP disco key advertisements.

There is a good probability of inducing this deadlock using the old code and new
test added in this commit: TestUpdateDiscoForNodeCallback/test_deadlock.

Also fix an unintentional regression in how the client responds to a mapResponse sleep
command. 85bb5f84a5 moved the processing of mapResponses into a new goroutine,
serialized via mapSession's changequeue. Thus, controlclient stopped sleeping in the
same goroutine servicing mapResponses/control connections. This commit brings us back
to sleeping synchronously in the same goroutine as controlclient.

Updates #12639

Signed-off-by: Amal Bansode <amal@tailscale.com>
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
Co-authored-by: Claus Lensbøl <claus@tailscale.com>
2026-05-22 07:13:18 -07:00
Simon Law
fd2405ca8f
tstest/integration: mark TestNoControlConnWhenDown as a flaky test (#19832)
Some checks failed
CI / cross (amd64, windows) (push) Has been cancelled
CI / cross (arm, 5, linux) (push) Has been cancelled
CI / cross (arm, 7, linux) (push) Has been cancelled
CI / cross (arm64, darwin) (push) Has been cancelled
CI / cross (arm64, linux) (push) Has been cancelled
CI / cross (arm64, windows) (push) Has been cancelled
CI / cross (loong64, linux) (push) Has been cancelled
CI / ios (push) Has been cancelled
CI / crossmin (amd64, illumos) (push) Has been cancelled
CI / crossmin (amd64, plan9) (push) Has been cancelled
CI / crossmin (amd64, solaris) (push) Has been cancelled
CI / crossmin (ppc64, aix) (push) Has been cancelled
CI / android (push) Has been cancelled
CI / wasm (push) Has been cancelled
CI / tailscale_go (push) Has been cancelled
CI / depaware (push) Has been cancelled
CI / go_generate (push) Has been cancelled
CI / make_tidy (push) Has been cancelled
CI / licenses (push) Has been cancelled
CI / staticcheck (${{ matrix.name }}) (--with-tags-all=darwin, arm64, darwin, macOS) (push) Has been cancelled
CI / staticcheck (${{ matrix.name }}) (--with-tags-all=linux, amd64, linux, Linux) (push) Has been cancelled
CI / staticcheck (${{ matrix.name }}) (--with-tags-all=windows, amd64, windows, Windows) (push) Has been cancelled
CI / staticcheck (${{ matrix.name }}) (--without-tags-any=windows,darwin,linux --shard=1/4, amd64, linux, Portable (1/4)) (push) Has been cancelled
CI / staticcheck (${{ matrix.name }}) (--without-tags-any=windows,darwin,linux --shard=2/4, amd64, linux, Portable (2/4)) (push) Has been cancelled
CI / staticcheck (${{ matrix.name }}) (--without-tags-any=windows,darwin,linux --shard=3/4, amd64, linux, Portable (3/4)) (push) Has been cancelled
CI / staticcheck (${{ matrix.name }}) (--without-tags-any=windows,darwin,linux --shard=4/4, amd64, linux, Portable (4/4)) (push) Has been cancelled
CI / notify_slack (push) Has been cancelled
CI / merge_blocker (push) Has been cancelled
CI / check_mergeability_strict (push) Has been cancelled
CI / check_mergeability (push) Has been cancelled
Updates #19831

Signed-off-by: Simon Law <sfllaw@tailscale.com>
2026-05-21 17:36:09 -07:00
Simon Law
7dabebc691
net/traffic: switch rendezvous hashing from SHA256 to FNV-1a (#19821)
In PR tailscale/corp#30448, we originally decided to break ties using
SHA256 for our rendezvous hashing algorithm. Now that we’ve had some
experience with it, we think that FNV-1a is a better choice. It
distributes bits evenly, it’s much faster, and it doesn’t need to be
cryptographically secure. The FNV designers recommend FNV-1a over the
deprecated FNV-1.

This PR makes the switch and updates the related tests, since changing
the algorithm changes which stable pick gets selected. As of 2026-05,
this is the best time to make this change, since there are almost no
clients in the wild with traffic steering enabled.

Updates #17366
Updates tailscale/corp#29964
Updates tailscale/corp#29966
Updates tailscale/corp#33033

Signed-off-by: Simon Law <sfllaw@tailscale.com>
2026-05-21 10:11:59 -07:00
Brad Fitzpatrick
aa5da2e5f2 ipn/ipnlocal, control/controlclient: process node adds/removes in constant time
For large tailnets (~50k+ nodes) with frequent peer churn (ephemeral
GitHub Actions workers etc.), tailscaled used to rebuild the full
netmap and fan it out on the IPN bus on every MapResponse that
added or removed a peer. There were two O(N) costs per delta: the
full netmap rebuild + every Notify.NetMap encode to every bus watcher.

This change tackles both:

  1. Plumb O(1) peer add/remove through the delta path. PeersChanged
     and PeersRemoved no longer prevent the delta happy path; instead,
     they mutate the per-node-backend peer map in place.

  2. Restrict ipn.Notify.NetMap emission to the platforms whose host
     GUIs still depend on it (Windows, macOS, iOS) and migrate
     in-tree consumers off it everywhere else:

     - Migrate reactive consumers (containerboot, kube agents,
       sniproxy, tsconsensus, etc.) off Notify.NetMap to the
       previously-added Notify.SelfChange signal so they no longer
       have to subscribe to the full netmap.
     - Add ipn.NotifyNoNetMap so GUI clients on "legacy-emit" platforms
       that have already migrated can opt out of the per-watcher
       NetMap encode.
     - Gate Notify.NetMap emission on the producer side by a compile-
       time GOOS check, so the supporting code is dead-code-eliminated
       on Linux and other geese where no GUI consumer needs it.

Re-running BenchmarkGiantTailnet from tstest/largetailnet, which was
added along with baseline numbers on unmodified main in ad5436af0d,
the per-delta cost (one peer add+remove pair) is now ~O(1) regardless
of tailnet size N:

    N         no-watcher (ms/op)            bus-watcher (ms/op)
              before    now     factor      before    now     factor
     10000        32   0.11       300x         166   0.13      1300x
     50000       222   0.11      2000x         865   0.13      6700x
    100000       504   0.12      4100x        1765   0.13     13400x
    250000      1551   0.12     12500x        4696   0.15     32400x

Updates #12542

Change-Id: I94e34b37331d1a8ec74c299deffadf4d061fda9e
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2026-05-21 09:26:19 -07:00
Brad Fitzpatrick
2703f91174 wgengine/magicsock: fix data race in TestSetDERPMapDoReStun
SetDERPMap spawns a goroutine that calls ReSTUN, which logs via the
test logger. If the test returns before that goroutine logs, the
goroutine races with testing cleanup.

Use tstest.WhileTestRunningLogger so the goroutine's logf call becomes
a no-op once the test finishes.

Fixes #19829

Change-Id: I1097f98e40ffd1c5dd7fb7a715c918255853e3c6
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2026-05-21 08:51:50 -07:00
Simon Law
7ebca58042
net/traffic,ipn/ipnlocal: extract traffic steering utilities (#19682)
The traffic package contains helpers for evaluating traffic steering
scores and picking appropriate nodes. These were extracted from
ipnlocal.suggestExitNodeUsingTrafficSteering so they can be reused by
the new routecheck package to probe exit nodes in priority order.

Updates #17366
Updates tailscale/corp#33033

Signed-off-by: Simon Law <sfllaw@tailscale.com>
2026-05-21 08:28:27 -07:00