zulip

mirror of https://github.com/zulip/zulip.git synced 2026-07-21 21:05:48 +08:00

Author	SHA1	Message	Date
Mateusz Mandera	902e2df257	rabbitmq: Disable consumer_timeout to avoid event redelivery. Since 3.8.15, RabbitMQ enforces a consumer_timeout (30 minutes by default): if a delivery is not acknowledged within it, the broker requeues the event and closes the channel. Because our queue workers acknowledge only after consume() returns, any task that runs longer causes the event to be redelivered and processed a second time. This timeout does nothing useful for us. It does not abort an overlong job, it merely re-runs it; and we already have our own mechanism (MAX_CONSUME_SECONDS) for bounding worker runtime, which we deliberately disable on deferred_work precisely because its jobs have no time bound. Self-serve Slack imports routinely exceed 30 minutes, so in practice this just redelivers the import for a duplicate, harmful run. Disable consumer_timeout globally.	2026-07-02 14:16:22 +05:30
Aman Agrawal	7fd8bfb1c6	soft_deactivation: Allow a dedicated queue for soft reactivations. Soft-reactivating a returning long-term-idle user backfills the UserMessage rows skipped while they were idle, which can take many seconds. These jobs share the deferred_work queue, which also runs realm exports that can take minutes -- so an optimistic reactivation (enqueued when an idle user is sent a notification or password-reset email, to make their return fast) can sit stuck behind an export and defeat its own purpose. Add an opt-in `dedicated_soft_reactivation_queue` setting that routes these jobs to their own queue and worker. It defaults to off, so that memory-constrained servers keep sharing the deferred_work queue and don't pay for an extra worker process; large servers can enable it to isolate reactivations. The deferred_work handler is deliberately retained: it stays the default path, and also drains soft_reactivate events already queued there when a server enables the dedicated queue. get_active_worker_queues honors the same setting, so a server that hasn't opted in neither runs nor expects the worker, keeping the production-install queue-processor check consistent. The backlog Nagios check only covers the dedicated queue when it is enabled, and gives it the same clear-time thresholds as deferred_work -- the workload it splits off -- so its slow backfills don't trip false alerts. The kandra consumer check is unconditional and assumes the dedicated queue is enabled on those frontends. Fixes part of #22396. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-30 01:44:04 +08:00
Alex Vandiver	ea1323c407	postgresql: Raise locked-memory limit for io_uring rings. Since `f0cc982e52`, PostgreSQL 18 uses io_uring when available. PostgreSQL creates an io_uring instance for each of the `max_connections = 1000` possible backends. Linux 6.14 and newer charge that usage (8-13MiB at our default PostgreSQL settings) against RLIMIT_MEMLOCK. Unfortunately, the systemd default is only 8MiB; on such kernels (e.g. Ubuntu 26.04), PostgreSQL 18 fails to start with: FATAL: could not setup io_uring queue: Cannot allocate memory Raise the limit to 256MiB. This accommodates io_uring usage even at the maximum configurable io_max_concurrency (~100MiB), times two clusters running concurrently during pg_upgradecluster (since the kernel counts locked memory per-user, not per-process). Installations on older kernels (e.g. Ubuntu 24.04's Linux 6.8) pin the same memory without accounting it, so they start fine today but would fail after a kernel upgrade; applying the higher limit fixes them preemptively.	2026-06-03 10:49:10 +05:30
Alex Vandiver	dbe4330fda	puppet: Skip systemd daemon-reload when systemd is not booted. The guard only checked that the systemctl binary existed, which is still true in containers with systemd installed but not running as PID 1 (e.g. in Docker containers in CI); "systemctl daemon-reload" then fails, aborting the puppet run. Check for /run/systemd/system instead, which only exists when systemd is the running init system.	2026-06-03 10:49:10 +05:30
Aman Agrawal	ae4d65ad7f	hooks: Make deploy notifications best-effort. The pre-deploy zulip_notify hook posts an informational "Starting deploy" message; combined with `set -e`, a failure to deliver it (e.g. the bot lacks posting permission on the target channel, or the server is unreachable) propagated out of the hook, and run_hooks.py treats any pre-deploy hook failure as fatal. A transient chat-notification problem could thus block an urgent deploy. Swallow delivery failures in the shared zulip_send helper after logging them, so neither the pre- nor post-deploy notification can abort a deploy. The post-deploy hooks were already best-effort; this brings the gating pre-deploy notification in line. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 18:22:16 +08:00
Aman Agrawal	3ae2232e2e	kandra: Allow frontend access to camo from port 9292. Partial revert from `d8f47d7cc2`. Port 9292 is used both to serve metrics to prometheus and for serving frontend. Only allowing prometheus access to port 9292 broke image previews for external image uploads since frontend was not able to access camo via port 9292.	2026-05-22 11:46:19 +05:30
Alex Vandiver	9b4f2baba5	kandra: Remove firewall rule for incoming port 3000. Teleport smuggles this in via localhost -- and grafana is already only listening on localhost.	2026-05-21 01:01:03 -04:00
Alex Vandiver	32023d7cf7	kandra: Close down services on prometheus host. These were missed in `e317fb5582`.	2026-05-21 01:01:03 -04:00
Alex Vandiver	32ea23e561	kandra: Gracefully reload Teleport when its config changes. A change to /etc/teleport_*.yaml today triggers a hard stop+start of the teleport_$part unit, severing every active SSH, database, and app-proxy session holding a connection through that node. If your zulip-puppet-apply is being controlled via Teleport SSH, this can easily lead to bad state. Teleport's signal handling supports a fork+drain reload on SIGHUP[^1], which spawns a new daemon to serve new connections, and only shuts the old process down after existing connections close. Route the YAML config notifies through a `systemctl reload` exec so we actually use the `ExecReload` we already defined, which which exercises that codepath. Unit-file and package changes still notify the service directly to get a real restart. [^1]: https://goteleport.com/docs/reference/deployment/signals/	2026-05-20 16:52:18 -04:00
Alex Vandiver	e317fb5582	kandra: Make exporter ports listen only on localhost. Scrapes now arrive via the Teleport tunnel which terminates at localhost on each host, so the metrics ports no longer need to listen on any other interface.	2026-05-19 10:26:52 -04:00
Alex Vandiver	3aade3588e	kandra: Close down access to exporter ports. These are now accessed through Teleport apps.	2026-05-19 10:26:52 -04:00
Alex Vandiver	15daadd56b	kandra: Update prometheus config to use teleport-sd.	2026-05-19 10:26:52 -04:00
Alex Vandiver	0dd0143060	kandra: Use tbot application proxy + teleport-sd for service discovery. This makes the Prometheus configuration no longer have to know anything about which hosts run which exporters; instead, hosts register the exporter in Teleport, and Prometheus asks Teleport which instances of a given exporter it knows about.	2026-05-19 10:26:52 -04:00
Alex Vandiver	4320986c74	kandra: Add a Teleport app for each Prometheus exporter.	2026-05-19 10:26:52 -04:00
Alex Vandiver	aa6c135cd4	kandra: Switch 127.0.0.1 to localhost, so it works on ipv6 hosts. For example, the katex server binds to [::]:9700.	2026-05-19 10:26:52 -04:00
Alex Vandiver	1635fa5801	kandra: Use to_yaml when writing YAML data.	2026-05-19 10:26:52 -04:00
Alex Vandiver	e2d426292c	kandra: Give the teleport server its own data_dir. The ssh node and the auth node should not share a data_dir.	2026-05-19 10:26:52 -04:00
Alex Vandiver	739162d8bc	kandra: Remove explicit return_per_object_metrics setting in rabbitmq. We already accomplish this by using the dedicated metrics endpoint that returns that variant[^1]. [^1]: https://www.rabbitmq.com/docs/prometheus#per-object-endpoint	2026-05-14 00:10:14 -04:00
Alex Vandiver	f7808492c9	kandra: Switch Teleport to also bind ipv6 addresses.	2026-05-12 01:53:09 -04:00
Alex Vandiver	0c8c5ec7d1	kandra: Note that port 3025 is auth <-> proxy, which is the same host.	2026-05-12 01:53:09 -04:00
Alex Vandiver	29b60c0d77	kandra: Switch Teleport to multiplexed port 443.	2026-05-12 01:53:09 -04:00
Anders Kaseorg	d4d503f39b	requirements: Remove dateutil. Some checks failed API Documentation Update Check / check-feature-level-updated (push) Has been cancelled Details Code scanning / CodeQL (push) Has been cancelled Details Zulip production suite / Ubuntu 22.04 production build (push) Has been cancelled Details Zulip CI / ${{ matrix.name }} (zulip/ci:bookworm, true, false, Debian 12 (Python 3.11, backend + documentation), bookworm) (push) Has been cancelled Details Zulip CI / ${{ matrix.name }} (zulip/ci:jammy, false, true, Ubuntu 22.04 (Python 3.10, backend + frontend), jammy) (push) Has been cancelled Details Zulip CI / ${{ matrix.name }} (zulip/ci:noble, false, false, Ubuntu 24.04 (Python 3.12, backend), noble) (push) Has been cancelled Details Zulip CI / ${{ matrix.name }} (zulip/ci:resolute, false, false, Ubuntu 26.04 (Python 3.14, backend), resolute) (push) Has been cancelled Details Zulip CI / ${{ matrix.name }} (zulip/ci:trixie, false, false, Debian 13 (Python 3.13, backend), trixie) (push) Has been cancelled Details API Documentation Update Check / notify-if-api-docs-changed (push) Has been cancelled Details Zulip production suite / ${{ matrix.name }} (zulip/ci:bookworm, --test-custom-db, Debian 12 production install with custom db name and user, bookworm) (push) Has been cancelled Details Zulip production suite / ${{ matrix.name }} (zulip/ci:jammy, , Ubuntu 22.04 production install and PostgreSQL upgrade with pgroonga, jammy) (push) Has been cancelled Details Zulip production suite / ${{ matrix.name }} (zulip/ci:noble, , Ubuntu 24.04 production install, noble) (push) Has been cancelled Details Zulip production suite / ${{ matrix.name }} (zulip/ci:resolute, , Ubuntu 26.04 production install, resolute) (push) Has been cancelled Details Zulip production suite / ${{ matrix.name }} (zulip/ci:trixie, , Debian 13 production install, trixie) (push) Has been cancelled Details Zulip production suite / ${{ matrix.name }} (zulip/ci:bookworm-7.0, 7.0 Version Upgrade, bookworm) (push) Has been cancelled Details Zulip production suite / ${{ matrix.name }} (zulip/ci:bookworm-8.0, 8.0 Version Upgrade, bookworm) (push) Has been cancelled Details Zulip production suite / ${{ matrix.name }} (zulip/ci:jammy-6.0, 6.0 Version Upgrade, jammy) (push) Has been cancelled Details Zulip production suite / ${{ matrix.name }} (zulip/ci:noble-10.0, 10.0 Version Upgrade, noble) (push) Has been cancelled Details Zulip production suite / ${{ matrix.name }} (zulip/ci:noble-9.0, 9.0 Version Upgrade, noble) (push) Has been cancelled Details Zulip production suite / ${{ matrix.name }} (zulip/ci:trixie-11.0, 11.0 Version Upgrade, trixie) (push) Has been cancelled Details Zulip production suite / Required jobs (push) Has been cancelled Details Zulip CI / Required jobs (push) Has been cancelled Details This removes - an unclear fuzzy syntax that had been incorrectly accepted by our `<time:…>` Markdown extension and could not be reproducibly parsed without a specific Python library (even the UNIX timestamp part did not work reliably: some UNIX timestamps were instead parsed as YYYYMMDD); - a fundamentally ambiguous ad-hoc list of three-letter timezone abbreviations that we had needed to manually disambiguate by some kind of subjective popularity; - an unpleasant dependency of the `pg_backup_and_purge` script that we had needed to install system-wide because there might not be a virtualenv set up. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2026-05-10 00:21:37 -07:00
Alex Vandiver	f7bc4e97df	puppet: Create /home/zulip/uploads at install time. Previously, the local uploads directory was created lazily by the upload code as files were written. This worked but left the directory absent immediately after install, which is awkward to reason about and breaks the upcoming check_uploads_settings system check that verifies the configured upload directory exists. The path is documented as something administrators may replace with a symlink to a different storage location, so we use an exec with a 'test -d' guard (which follows symlinks) rather than a file resource that would replace the symlink with a fresh directory.	2026-05-09 23:18:47 -07:00
Alex Vandiver	97a8a5f1a0	nginx: Make Tornado /events locations exact matches. `location /api/v1/events` is a prefix-match, and as such passes any URI starting with `/api/v1/events` through to Tornado -- including encoded oddities like `/api/v1/events%3fdont%255fblock=false` (whose decoded $uri still has the prefix) and `/api/v1/events/internal`, which is meant to be reachable only via the loopback interface but was being proxied to Tornado from the public socket. Tornado's internal_api_view rejects external callers both via its REMOTE_ADDR check and its `SHARED_SECRET` check, so this was not exploitable, but a Tornado worker still had to handle each such request just to 403 it. Switch to exact matches, as was likely intended all along, which lets those requests fall through to Django/uWSGI and 404 without ever waking Tornado. The legitimate internal callers in zerver/tornado/django_api.py talk to http://127.0.0.1:<tornado-port> directly, so they are unaffected, as is the X-Accel-Redirect path served by the /internal/tornado/ regex location.	2026-04-27 09:45:18 -07:00
Alex Vandiver	ea2c37cf50	puppet: get_django_setting_slow returns nil if there are no deploy dirs.	2026-03-09 21:43:06 -07:00
iofq	05392a74bf	puppet: Install logrotate package on Debian systems. A default Debian/Ubuntu server image comes with the `logrotate` package installed, but the `ubuntu` Docker image does not. This causes the Zulip Docker install to not rotate its logfiles, despite having logrotate configuration files installed. Add `logrotate` to the list of required packages for Debian-based systems in the puppet manifest, to ensure installation is enforced on all target platforms, including the Docker image. Fixes: zulip/docker-zulip#263	2026-02-26 15:52:53 -05:00
Alex Vandiver	cbcc588999	kandra: Fix grafana tarball directory prefix. Apparently they now build this with grafana-1.2.3/ as a prefix, not grafana-v1.2.3/	2026-02-25 23:46:44 -05:00
Alex Vandiver	5c1d6a8c98	puppet: Fix the wal-g package location and hashes.	2026-02-25 15:10:57 -08:00
Alex Vandiver	f2a5dc949a	puppet: Update dependencies.	2026-02-24 08:59:08 -08:00
Alex Vandiver	0f67aa1ab2	puppet: Rename redis reconfiguration Exec to better name.	2026-02-14 21:04:12 -08:00
Alex Vandiver	874f5f8441	puppet: Remove zuli-redis.conf workaround from Zulip 2.0.0. It is no longer possible to upgrade directly from Zulip 2.0.0, so no current install needs this.	2026-02-14 21:04:12 -08:00
Alex Vandiver	3495258664	puppet: Remove /run/redis explicit creation. The packages now handle this themselves, and use mode 2755, which we should not fight them about.	2026-02-14 21:04:07 -08:00
Alex Vandiver	147e98c03b	puppet: Provide zulip_version from the same tree as puppet is run. This fixes a bug where the version puppet provided was the _current_ version, which meant that what it provided lagged by one deployment. It also did not function on first install, as `/home/zulip/deployments/current` does not exist yet on first puppet run. Examine the `version.py` of from the same tree that puppet is being run from, which addresses both of these issues.	2026-02-14 21:01:56 -08:00
Alex Vandiver	aef2d28194	nginx: Hardcode a Host: header to uwsgi when making health checks. Fixes: #37805.	2026-02-11 21:04:36 -08:00
Alex Vandiver	22ad7cd4ee	nginx: Add preload to HSTS.	2026-02-09 08:54:05 -08:00
Alex Vandiver	c18a7eabbe	nginx: Add includeSubdomains to HSTS.	2026-02-09 08:54:05 -08:00
Alex Vandiver	1ba90e5faa	nginx: Increase HSTS to 1 year, from 6 months.	2026-02-09 08:54:05 -08:00
Alex Vandiver	687d2e0bd3	nginx: Add missing semicolon to CSP definition.	2026-02-09 08:51:58 -08:00
Anders Kaseorg	4913fe228d	env-wal-g: Fix inappropriate 2>&1. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2026-02-06 21:36:20 -05:00
Anders Kaseorg	5a8c5cb563	puppet: Use Open3 for safer command execution. Signed-off-by: Anders Kaseorg <anders@zulip.com>	2026-02-06 21:36:20 -05:00
Alex Vandiver	89473cbca2	nginx: Set Cross-Origin-Opener-Policy as defense in depth. While we set rel="noopener", this provides additional protection against tabnabbing.	2026-02-06 11:54:51 -08:00
Alex Vandiver	1a46153264	nginx: Set Referrer-Policy as defense in depth for Referer: headers. While we set `rel=noreferrer` on our links, this provides additional protections.	2026-02-06 11:54:51 -08:00
Alex Vandiver	e66cfe02db	nginx: Refactor immutable cache headers. This fixes a bug where, because nginx does not cascade `add_headers`, Astro headers improperly did not include the default headers.	2026-02-05 17:10:14 -08:00
Alex Vandiver	df4d695d4f	nginx: Merge two adjacent location blocks.	2026-02-05 17:10:14 -08:00
Alex Vandiver	10c13e6367	nginx: Always send X-Content-Type-Options, even on error pages.	2026-02-05 17:08:47 -08:00
Alex Vandiver	f0cc982e52	postgresql: Default to io_method=io_uring on PostgreSQL 18. This is more performant than the PostgreSQL 18 default of `io_method=workers`, but requires kernel 5.1. All supported OSes of Zulip have at least that -- however, it may not be available inside containers, so add a puppet fact to check the syscall before enabling it in PostgreSQL.	2026-02-02 16:26:08 -08:00
Lauryn Menard	6f4e88a441	demo-orgs: Update Welcome bot string to use global time. Updates the demo organization warning for demo creators to use a global timestamp for when the demo organization will be deleted by the archive-messages cronjob (based on the realm's scheduled deletion timetamp).	2026-01-28 09:14:07 -08:00
Arun-kushwaha007	1e7974909d	letsencrypt: Enable strict shell mode for email server restart. Add strict shell options to the email server restart script. This script is a single command wrapper with no variables or pipelines, making strict mode unambiguous and safe. Fixes part of #20748.	2026-01-27 14:16:39 -05:00
Arun Kushwaha	19e9a4e44b	tooling: Enable strict shell mode in selected scripts. Add strict shell options (set -euo pipefail / set -eu) to a small set of simple shell scripts that do not rely on unset variables or pipeline exit-code masking. Each script was reviewed line by line to confirm strict mode is safe and that stopping immediately on errors is the correct behavior for these scripts. Fixes part of #20748.	2026-01-21 09:53:52 -08:00
Alex Vandiver	6f29077560	puppet: Ensure latest ca-certificates is installed. This is particularly necessary if `application_server.custom_ca_path` is in use, as that causes the system CA bundle to be used for all outgoing `requests` connections, instead of the standard `certifi` package.	2026-01-21 09:33:55 -08:00

1 2 3 4 5 ...

1933 Commits