This adds tsnet.Server.ListenSSH which, if the SSH feature is linked,
returns a net.Listener whose Accept yields *tailssh.Session values (as
net.Conn). This lets tsnet apps accept incoming SSH connections to
implement custom TUI applications.
Basic apps can use net.Conn directly (Read/Write/Close). Rich apps
import ssh/tailssh and type-assert for peer identity, PTY, signals,
etc. If feature/ssh isn't imported, ListenSSH returns an error.
Includes a demo guess-the-number game in tsnet/example/ssh-game.
Updates tailscale/corp#37839
Change-Id: I4e7c3c96afb030cdf4da8f2d8b2253820628129a
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Currently we are picking a peer for the split dns routes when we get a
netmap. Use the new custom scheme resolvers, installed per app in the
config in the netmap, to allow us to choose which connector peer should
handle a DNS request at the time the request is made.
Fixestailscale/corp#39858
Signed-off-by: Fran Bull <fran@tailscale.com>
In PR #19682, we introduced the traffic package which provides a
traffic.Scores.SortNodes method that uses rendezvous hashing to
break ties by equally distribute the “best” node for any given client.
This PR adds a fuzzer to make sure this algorithm is not wildly unfair.
Updates #17366
Updates tailscale/corp#33033
Signed-off-by: Simon Law <sfllaw@tailscale.com>
This NodeCapability works around the UDP GSO bugs introduced by
torvalds/linux@b10b446 (v7.0-rc1). These bugs were later fixed by
torvalds/linux@78effd8 and torvalds/linux@5f17ae0 (v7.1-rc5). These
Linux kernel bugs cause mangled UDP headers and UDP checksums, resulting
in high levels of packet loss.
The aforementioned bugs have already made their way downstream into
various distros, e.g. Ubuntu 26.04 LTS. Impacted users are now dealing
with poor UDP performance in tailscaled, and in any other software that
makes use of UDP GSO.
Not all users of the affected kernels are impacted as the relevant
kernel code path sits between kernel and netdev driver, and behaviors
vary by driver/device capability.
We cannot detect impact at runtime, as this would require gathering all
netdevs, and performing loopback tests. This is invasive and in many
cases impossible.
So, we are left to choose between disabling UDP GSO for all users on
affected kernels, whether they experience real impact or not, or try
and work around the bugs. Disabling UDP GSO for a user that is not
impacted can cut max throughput in half, and consume more CPU cycles.
This commit attempts to workaround the bugs by avoiding UDP GSO when
batches are small, and injecting a 1-byte sentinel tail payload when
they are large. This tail payload is smaller than "GSO size", which
sidesteps the primary trigger of all fragments in a batch being
equal in length.
The end result is slightly increased payload and packet overhead, but
functional UDP GSO for all Linux 7.0-7.1.4 users, regardless of
netdev/driver.
Updates #19777
Signed-off-by: Jordan Whited <jordan@tailscale.com>
All StateStore implementations store a nil value in the cache map when WriteState is called with a nil byte slice instead of deleting the key. This causes ReadState to return (nil, nil) instead of (nil, ErrStateNotExist), since the key is still present in the map.
This breaks reset-auth in Windows, Linux, and Android, and the node can't log back in without manually editing the state file. (macOS uses a different state store)
DeleteProfile, DeleteAllProfilesForUser, setUnattendedModeAsConfigured are impacted but don't seem to break because the deleted keys are not reread.
This deletes the key from the cache instead.
Fixestailscale/corp#42477
Signed-off-by: kari-ts <kari@tailscale.com>
We have been reading the pool config from the app nodeattr, but it is
global config, not per app, so it needs to be its own thing.
Updates tailscale/corp#39999
Signed-off-by: Fran Bull <fran@tailscale.com>
If a user explicitly adds a non-ts.net (not a CertDomain domain) domain
like "foo.com" to their serve config as a web target that's also an allowed
funnel domain (using raw "tailscale serve set-config"), then use the new
ALPN cert fetching (from b553969b) to get certs for that domain.
This is just plumbing; there's no new product functionality to
actually enable this easily client-side, and it also has no visible
product surface to enable it server-side.
Updates tailscale/corp#41736
Change-Id: Ie2e421ac9611bce64bba3de6a454b2d505ea0e8a
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
The github-ci-vm machine that runs our self-hosted CI for this repo is
only designed for the `vm` job in test.yml. That uses a different cache
dir which is causing github-ci-vm's small disk to fill up. Switch to
ubuntu 24.04 like the rest of our CI for this repo that doesn't require
anything special.
Updates tailscale/corp#40465
Signed-off-by: Tom Proctor <tomhjp@users.noreply.github.com>
If we dispatch a ping too early (after a later patch removes a 250ms
blockage) then the ping may be lost due to the peers not yet knowing
about each other. The ping is retained in order to setup and ensure a
wireguard session prior to test flow.
Updates #19822
Change-Id: I6cfea28931646a9387b6ffc2654e72cd846f4e55
Signed-off-by: James Tucker <james@tailscale.com>
Co-authored-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Add a "tailscale whoami" subcommand that is equivalent to running
"tailscale whois $(tailscale ip -4)" but more ergonomic. It supports
the --json flag just like whois, and shares the WhoIsResponse
rendering code with whois.
Fixes#19907
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Change-Id: I8f33ba7a5608bab7dffa8213303beb5f345936d3
Adds two tests exercising the HTTP/2-inbound -> plaintext HTTP/1.1 backend
path through serve's reverseProxy and through the full serveWebHandler
entry point (with a funnel serveHTTPContext).
Updates #19866
Signed-off-by: Brendan Creane <bcreane@gmail.com>
When parsing the `tailscale up --exit-node=ARG` argument, we try to
resolve hostnames by searching the list of peers. However, at startup,
the peer list is empty, causing hostname lookups to trivially fail with
an unhelpful "invalid value" erorr.
Improve the error message when the peer list is empty to inform the user
that hostnames cannot be resolved during startup, and advise them to use
the exit node's Tailscale IP address instead.
Also, clarify that hostnames must be peer hostnames, not arbitrary
hostnames.
Fixes#19882
Change-Id: I9390a427c2863d657cf46c5e33b43cb3c5363764
Signed-off-by: Alex Chan <alexc@tailscale.com>
Single-pod ingress/egress proxies already called ClampMSSToPMTU when
setting up forwarding rules, but the proxy group (HA) code paths in
egressservices.go and ingressservices.go did not. This caused TCP
connections through proxy group pods to suffer from MSS/MTU mismatch
issues in environments where path MTU discovery is not working.
Add ClampMSSToPMTU calls in the egress sync loop (alongside the existing
EnsureSNATForDst call) and in addDNATRuleForSvc (alongside the existing
EnsureDNATRuleForSvc call), mirroring what the single-pod forwarding
rules already do.
Also add MSS clamping assertions to TestSyncIngressConfigs and track
ClampMSSToPMTU calls in FakeNetfilterRunner.
Fixes issue #19812https://github.com/tailscale/tailscale/issues/19812.
Tracking internal ticket TSS-86326.
Signed-off-by: Jay Tung <ltung@crusoeenergy.com>
Co-authored-by: Jay Tung <ltung@crusoeenergy.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Parallel subtests share *ipn.Notify pointers (e.g. runningNotify).
When multiple subtests reached the same phase concurrently, they
all wrote to the shared notify's InitialStatus field without
synchronization, triggering the race detector.
Fix by shallow-copying *ipn.Notify before setting InitialStatus,
so each test iteration works on its own copy.
Updates #19380
Change-Id: I9dd40037e02146166f006f4f7c1ddcc47adba191
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Add four control-plane node attributes that let us disable UDP GSO/GRO
on the magicsock UDP socket and UDP/TCP GRO on the Tailscale TUN
device.
These complement the pre-existing TS_DEBUG_DISABLE_UDP_{GRO,GSO} and
TS_TUN_DISABLE_{UDP,TCP}_GRO envknobs. They exist so we can mitigate
upstream Linux kernel regressions on a deployed fleet without
requiring a client release, after two incidents (#13041, #19777) where
buggy kernel patches landed upstream and the fix took an excessively
long time to reach downstream distros.
Knob changes are reacted to in setNetworkMapInternal / SetNetworkMap via
a comparison against a cached "last applied" value and only an actual
transition triggers work: magicsock Rebind()+ReSTUN for UDP,
ApplyGROKnobs for TUN. The TUN side is gated by buildfeatures.HasGRO and
is one-way (wireguard-go GRO disablement is sticky); re-enabling
requires a client restart.
Updates #13041
Updates #19777
Change-Id: I802993070afa659cc06809bb0bfbb7f8a0cdb273
Signed-off-by: James Tucker <james@tailscale.com>
Previously, sharding required tests to opt in by calling tstest.Shard,
which used a process-global counter to assign each test to a shard.
This had two problems: most tests didn't call it, so they ran on every
shard (defeating the purpose), and shard assignments were unstable
(depended on call order, so adding a test could reshuffle others).
Remove tstest.Shard and tstest.SkipOnUnshardedCI entirely. Instead,
have testwrapper implement sharding automatically for all tests: when
TS_TEST_SHARD=N/M is set, it uses "go list -json" (no compilation) to
find test source files, scans them for top-level Test/Benchmark/
Example/Fuzz function names, and filters by fnv32a(name) % M == N-1.
The filtered names are passed as an anchored -run regex to go test.
Using go list instead of "go test -list" avoids linking the test binary
twice (Go's build cache does not cache test binary linking).
Fixes#19886
Change-Id: I62ab7b3d757324d4c5fd0b5de50c1e3742681791
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Gates the unnecessary "logtail started" message behind
the debug envknob TS_DEBUG_LOGTAIL. This is extra log spam that isn't
needed unless we are debugging.
Updates tailscale/corp#40908
Signed-off-by: James Scott <jim@tailscale.com>
bind() on NETLINK_ROUTE sockets does not work on Android 11+ (https://developer.android.com/identity/user-data-ids#mac-11-plus) . Since system/bin/ip uses bind(), likelyHomeRouterIPHelper() always fails on Andoroid 11+, so that GatewayAndSelfIP never caches the result, causing repeated ip process spawns on every periodic ReSTUN.
This replaces the system/bin/ip fallback with a cached gateway IP pushed from Android’s ConnectivityManager via LinkProperties.getRoutes(). This is the same patterm used by UpdateLastKnownDefaultRouteInterface for the interface name (see https://github.com/tailscale/tailscale/pull/11784/). We keep the proc/net/route path as a fallback for early startup before NetworkChangeCallback has fired.
Updates tailscale/tailscale#18622
Updates tailscale/tailscale#13352
Signed-off-by: kari-ts <kari@tailscale.com>
Some tests in another repo were broken by tailscale/tailscale#19607.
This fixes them, by finishing off the rest of the migration away from
netmap.NetworkMap on the IPN bus in containerboot.
Containerboot used to rebuild a full NetworkMap-shaped view while
reacting to IPN bus notifications. Now it insteads has its own
netmapState type (immutable) of exactly what it needs to track, and
sends those immutable values around, making cheap edits of new
immutable values when an IPN bus edit arrives.
This should make cmd/containerboot scale to much larger tailnets now too.
Fixes#19852Fixestailscale/corp#42347
Updates #12542
Change-Id: I88adaf061f85f677f954a764935e6654329d75a6
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
We don't want addr assignments to be lost from the collection before
they can be returned to the IP pools, otherwise we will get orphan
addresses marked inUse in the pools that will never be returned.
Fixestailscale/corp#39975
Signed-off-by: Fran Bull <fran@tailscale.com>
Bump the following:
go get -u github.com/moby/spdystream@v0.5.1
go get -u golang.org/x/crypto@v0.52.0
go get -u golang.org/x/net@v0.55.0
to resolve open govulncheck warnings.
Updates #cleanup
Signed-off-by: Patrick O'Doherty <patrick@tailscale.com>
It is sometimes useful when diagnosing subtle and specific performance
problems to rule out GRO/GSO independently and/or toggle them to
influence packet pacing.
Updates #17835
Updates tailscale/corp#31164
Signed-off-by: James Tucker <james@tailscale.com>
serveFilePut tracked outgoing-file progress through an unbuffered
progressUpdates channel whose close was owned by the request goroutine
while writers were spread across manifest parsing, the
progresstracking.Reader callback, singleFilePut failure paths, and the
success path. That writer-closes mismatch made the
send-on-closed-channel panic effectively unfixable in place.
Replace it with a request-scoped outgoingProgress reporter. Transfer
code reports state by method call; the reporter coalesces hot-path
updates and is flushed once via defer in serveFilePut. With no
producer channel to close, the panic is structurally impossible.
Fixes#19115Fixes#19817
Change-Id: I8f00d982d2c79880dfc1f8104c5eed06e94b5a6c
Signed-off-by: James Tucker <james@tailscale.com>
Originally found when adding tests for working with cached netmaps, and
finding the added tests to be flakey.
When working off of a cached netmap, if a node exists in the cached
netmap but does not yet have any endpoints, DERP connections are
available but not direct ones. By sending callMeMaybe to nodes
without endpoints in the cached netmap, we can establish direct
connections for this edge case.
Aditionally, ensure that TSMP disco advert messages are not sent if the
endpoint does not have a valid address yet.
Fixes#19843
Updates #19597
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
Use TLS-ALPN-01 for Funnel certificate renewals only when the node
already has a cached certificate, and fall back to DNS-01 with a fresh
order if the ALPN path is unavailable or fails.
Dynamically advertise acme-tls/1 only while an ACME challenge
certificate is pending, and add client metrics for DNS-01 and
TLS-ALPN-01 start/success/failure paths.
Updates tailscale/corp#41736Fixestailscale/corp#42320
Change-Id: I5adc6ea129237f9ef592f84fc1a8953c80bc9d5c
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Commit e5a8cf3b1 added feature/runtimemetrics, which emits heap bytes
and total process memory as clientmetrics when the
NodeAttrEmitRuntimeMetrics capability is set. That subsumes the job of
the TS_DEBUG_MEMORY envknob, whose only effect is to prefix every log
line with Go heap+stack and Maxrss via logger.RusagePrefixLog.
Updates tailscale/corp#39434
Signed-off-by: Jordan Whited <jordan@tailscale.com>
tailscale-client-go-v2 natively supports identity federation authentication,
and in #19010 the required authentication provider is used, but the manual
token exchange was never removed, so we were exchanging JWT token to an auth
token, and then were trying to use that auth token for exchange once again.
This commit removes the legacy mechanism, fully relying on
tailscale-client-go-v2 to handle authentication.
Fixes#19844
Signed-off-by: Artem Leshchev <matshch@avride.ai>
NodeMutationAdd was a misleading name: a PeersChanged entry in a
MapResponse can represent either a truly new peer or a full
replacement for an existing peer that couldn't be expressed as a
PeerChangedPatch. Calling it "Add" implied it was always a completely
new node, which is wrong. (I'd changed my mind on the design of
mapping add/delete events to NodeMutations halfway through #19607 and
forgot to update the name, even though I'd updated half the docs)
Rename it to NodeMutationUpsert to reflect the actual semantics: the
node should be inserted or replaced in the peer map regardless of
whether it already existed.
Updates #19607
Updates #12542
Change-Id: Iebd3daddb3318cba02e115a1b184fcb3ee8f83d6
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
The prior aa5da2e5f2 ("process node adds/removes in constant
time") commit missed a bus notification case, where new-style
subscribers set NotifyNoNetmap and then the controlclient map routing
sends a full update (rather than a delta). Those profiles + peers
need to be put on the bus too.
I noticed this only when porting the Android app over to use the
new bus stuff.
Updates #19607
Updates #12542
Change-Id: I82c35011d2c532222ca27f7d4e790522c31bd156
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
findStaticEndpoints built its return slice by iterating nodes.Items in
the order returned by r.List, which is not guaranteed to be stable
across calls. When the resulting set of addresses already matched the
existing config Secret, the slice could still permute between
reconciles, making the marshalled config Secret differ byte-for-byte.
That tripped the DeepEqual check on the config Secret, which rewrote
the Secret, which fired a watch event, which re-enqueued the
ProxyGroup, looping forever.
Detect this case and return the existing currAddrs slice unchanged
when the resulting set is the same, preserving the "use the currently
used IPs first" intent without spurious writes.
Fixes#19700
Signed-off-by: Jason Dillingham <jasonmdillingham@gmail.com>
Add support for configuring egress to destinations reachable via 4via6
subnet routes. This change affects standalone egress proxy only- egress
ProxyGroup needs IPv6 support before being able to support 4via6. Egress may
be configured using either the synthesized 4via6 address or the MagicDNS
name (in the form
<IPv4-address-with-hyphens-instead-of-dots>-via-<siteid>[.*]).
Also update the Connector to validate and advertise 4via6 subnet routes.
Export net/netutil.ValidateViaPrefix so it can be reused by the Connector
validation logic.
Updates #19334
Signed-off-by: Becky Pauley <becky@tailscale.com>
Emit runtime metrics as clientmetrics when the
NodeAttrEmitRuntimeMetrics NodeCapability is present.
We start small with just 2 metrics: heap bytes and total process memory.
Updates tailscale/corp#39434
Signed-off-by: Jordan Whited <jordan@tailscale.com>
When we use assigned addresses in response to a DNS request, extend the
expiry on the assignment.
Updates tailscale/corp#39975
Signed-off-by: Fran Bull <fran@tailscale.com>
Previously we had two maps keyed on a direction-specific tuple, with
distinct values containing the data (action) for that direction.
Values pointed at each other across maps to ensure they were removed
at the same time in the case of tuple overwrite, but LRU eviction
was per-map. So if LRU was turned on, it was possible for one
direction's data (action) to be evicted and leave the other direction
dangling.
NewFlow replaces the two direction-specific flow constructors, and
lookups return the direction-specific PacketAction directly.
Now the values in each map point to the same element, with data for both
directions in the element. A linked list also points to the elements to
implement LRU. The previous flowtrack.Cache is removed.
The single LRU structure will allow us to implement idle time expiration
by walking the list backward starting with the least recently used flow, and
stopping after a fixed number of flows, or at the first non-expired flow.
We add commented-out unused placeholder fields for tracking the
"last seen" timestamp, and an on-removal hook, to document the intent for
the follow-up expiry work.
Updates tailscale/corp#38630
Signed-off-by: Michael Ben-Ami <mzb@tailscale.com>
There are two places where tailscaled transitions into a paused state:
1. tailscaled’s controlclient is initially created,
2. tailscale down, or the GUI equivalent, commands it to.
This patch unifies the implementation of both scenarios into
LocalBackend.shouldPauseControlClientLocked to prevent the
implementation from drifting.
The flaky tstest/integration.TestNoControlConnWhenDown test exposed
this mismatch, but only by accident. This patch also changes
TestNode.MustDown so that it runs `tailscale down` and then waits for
the testcontrol server to finish handling any associated /machine/map
requests.
Fixes#19831
Signed-off-by: Simon Law <sfllaw@tailscale.com>
In aa5da2e5f2 we made the IPN bus include deltas, including the
PeersRemoved, sending a slice of integer NodeIDs that were
removed. But when updating xcode, I realized there was no way to map
those integers to the stable node IDs used in other places.
I was consdering changing the just-added ipn.Notify.PeersRemoved from
an IntID to a string StableID, but then it doesn't match the MapResponse
wire protocol, which we've tried to match so far.
Instead, just add the integer ID as well. Callers can use whichever
world they want, having both. It's a little regrettable that we still
have two worlds of IDs, but oh well. Neither is really suitable to a
hypothetical future fully federated world of control servers anyway,
so we'll need a third type later anyway, so just live with the two we
have for now.
Updates #12542
Change-Id: Ib8fd48a265e1da1f8779152f141f624a7f7260e9
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Holding an exclusive lock while writing to the unbuffered changequeue chan
is likely going to deadlock when the run() path may try to grab the same lock
before reading from the chan to drain it (on map session close). This causes
the client to stop processing new map responses and TSMP disco key advertisements.
There is a good probability of inducing this deadlock using the old code and new
test added in this commit: TestUpdateDiscoForNodeCallback/test_deadlock.
Also fix an unintentional regression in how the client responds to a mapResponse sleep
command. 85bb5f84a5 moved the processing of mapResponses into a new goroutine,
serialized via mapSession's changequeue. Thus, controlclient stopped sleeping in the
same goroutine servicing mapResponses/control connections. This commit brings us back
to sleeping synchronously in the same goroutine as controlclient.
Updates #12639
Signed-off-by: Amal Bansode <amal@tailscale.com>
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
Co-authored-by: Claus Lensbøl <claus@tailscale.com>
In PR tailscale/corp#30448, we originally decided to break ties using
SHA256 for our rendezvous hashing algorithm. Now that we’ve had some
experience with it, we think that FNV-1a is a better choice. It
distributes bits evenly, it’s much faster, and it doesn’t need to be
cryptographically secure. The FNV designers recommend FNV-1a over the
deprecated FNV-1.
This PR makes the switch and updates the related tests, since changing
the algorithm changes which stable pick gets selected. As of 2026-05,
this is the best time to make this change, since there are almost no
clients in the wild with traffic steering enabled.
Updates #17366
Updates tailscale/corp#29964
Updates tailscale/corp#29966
Updates tailscale/corp#33033
Signed-off-by: Simon Law <sfllaw@tailscale.com>
For large tailnets (~50k+ nodes) with frequent peer churn (ephemeral
GitHub Actions workers etc.), tailscaled used to rebuild the full
netmap and fan it out on the IPN bus on every MapResponse that
added or removed a peer. There were two O(N) costs per delta: the
full netmap rebuild + every Notify.NetMap encode to every bus watcher.
This change tackles both:
1. Plumb O(1) peer add/remove through the delta path. PeersChanged
and PeersRemoved no longer prevent the delta happy path; instead,
they mutate the per-node-backend peer map in place.
2. Restrict ipn.Notify.NetMap emission to the platforms whose host
GUIs still depend on it (Windows, macOS, iOS) and migrate
in-tree consumers off it everywhere else:
- Migrate reactive consumers (containerboot, kube agents,
sniproxy, tsconsensus, etc.) off Notify.NetMap to the
previously-added Notify.SelfChange signal so they no longer
have to subscribe to the full netmap.
- Add ipn.NotifyNoNetMap so GUI clients on "legacy-emit" platforms
that have already migrated can opt out of the per-watcher
NetMap encode.
- Gate Notify.NetMap emission on the producer side by a compile-
time GOOS check, so the supporting code is dead-code-eliminated
on Linux and other geese where no GUI consumer needs it.
Re-running BenchmarkGiantTailnet from tstest/largetailnet, which was
added along with baseline numbers on unmodified main in ad5436af0d,
the per-delta cost (one peer add+remove pair) is now ~O(1) regardless
of tailnet size N:
N no-watcher (ms/op) bus-watcher (ms/op)
before now factor before now factor
10000 32 0.11 300x 166 0.13 1300x
50000 222 0.11 2000x 865 0.13 6700x
100000 504 0.12 4100x 1765 0.13 13400x
250000 1551 0.12 12500x 4696 0.15 32400x
Updates #12542
Change-Id: I94e34b37331d1a8ec74c299deffadf4d061fda9e
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
SetDERPMap spawns a goroutine that calls ReSTUN, which logs via the
test logger. If the test returns before that goroutine logs, the
goroutine races with testing cleanup.
Use tstest.WhileTestRunningLogger so the goroutine's logf call becomes
a no-op once the test finishes.
Fixes#19829
Change-Id: I1097f98e40ffd1c5dd7fb7a715c918255853e3c6
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>