zulip/docs/subsystems/queuing.md
Anders Kaseorg ea6934c26d dependencies: Remove WebSockets system for sending messages.
Zulip has had a small use of WebSockets (specifically, for the code
path of sending messages, via the webapp only) since ~2013.  We
originally added this use of WebSockets in the hope that the latency
benefits of doing so would allow us to avoid implementing a markdown
local echo; they were not.  Further, HTTP/2 may have eliminated the
latency difference we hoped to exploit by using WebSockets in any
case.

While we’d originally imagined using WebSockets for other endpoints,
there was never a good justification for moving more components to the
WebSockets system.

This WebSockets code path had a lot of downsides/complexity,
including:

* The messy hack involving constructing an emulated request object to
  hook into doing Django requests.
* The `message_senders` queue processor system, which increases RAM
  needs and must be provisioned independently from the rest of the
  server).
* A duplicate check_send_receive_time Nagios test specific to
  WebSockets.
* The requirement for users to have their firewalls/NATs allow
  WebSocket connections, and a setting to disable them for networks
  where WebSockets don’t work.
* Dependencies on the SockJS family of libraries, which has at times
  been poorly maintained, and periodically throws random JavaScript
  exceptions in our production environments without a deep enough
  traceback to effectively investigate.
* A total of about 1600 lines of our code related to the feature.
* Increased load on the Tornado system, especially around a Zulip
  server restart, and especially for large installations like
  zulipchat.com, resulting in extra delay before messages can be sent
  again.

As detailed in
https://github.com/zulip/zulip/pull/12862#issuecomment-536152397, it
appears that removing WebSockets moderately increases the time it
takes for the `send_message` API query to return from the server, but
does not significantly change the time between when a message is sent
and when it is received by clients.  We don’t understand the reason
for that change (suggesting the possibility of a measurement error),
and even if it is a real change, we consider that potential small
latency regression to be acceptable.

If we later want WebSockets, we’ll likely want to just use Django
Channels.

Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
2020-01-14 22:34:00 -08:00

92 lines
3.6 KiB
Markdown

# Queue processors
Zulip uses RabbitMQ to manage a system of internal queues. These are
used for a variety of purposes:
* Asynchronously doing expensive operations like sending email
notifications which can take seconds per email and thus would
otherwise timeout when 100s are triggered at once (E.g. inviting a
lot of new users to a realm).
* Asynchronously doing non-time-critical somewhat expensive operations
like updating analytics tables (e.g. UserActivityInternal) which
don't have any immediate runtime effect.
* Communicating events to push to clients (browsers, etc.) from the
main Zulip Django application process to the Tornado-based events
system. Example events might be that a new message was sent, a user
has changed their subscriptions, etc.
* Processing mobile push notifications and email mirroring system
messages.
* Processing various errors, frontend tracebacks, and slow database
queries in a batched fashion.
Needless to say, the RabbitMQ-based queuing system is an important
part of the overall Zulip architecture, since it's in critical code
paths for everything from signing up for account, to rendering
messages, to delivering updates to clients.
We use the `pika` library to interface with RabbitMQ, using a simple
custom integration defined in `zerver/lib/queue.py`.
### Adding a new queue processor
To add a new queue processor:
* Define the processor in `zerver/worker/queue_processors.py` using
the `@assign_queue` decorator; it's pretty easy to get the template
for an existing similar queue processor. This suffices to test your
queue worker in the Zulip development environment
(`tools/run-dev.py` will automatically restart the queue processors
and start running your new queue processor code). You can also run
a single queue processor manually using e.g. `./manage.py
process_queue --queue=user_activity`.
* So that supervisord will known to run the queue processor in
production, you will need to add to to `normal_queues` in
`puppet/zulip/manifests/base.pp`; the list there is used to generate
`/etc/supervisor/conf.d/zulip.conf` via a puppet template in
`app_frontend.pp`.
The queue will automatically be added to the list of queues tracked by
`scripts/nagios/check-rabbitmq-consumers`, so Nagios can properly
check whether a queue processor is running for your queue. You still
need to update the sample Nagios configuration in `puppet/zulip_ops`
manually.
### Publishing events into a queue
You can publish events to a RabbitMQ queue using the
`queue_json_publish` function defined in `zerver/lib/queue.py`.
An interesting challenge with queue processors is what should happen
when queued events in Zulip's backend tests. Our current solution is
that in the tests, `queue_json_publish` will (by default) simple call
the `consume` method for the relevant queue processor. However,
`queue_json_publish` also supports being passed a function that should
be called in the tests instead of the queue processor's `consume`
method. Where possible, we prefer the model of calling `consume` in
tests since that's more predictable and automatically covers the queue
processor's code path, but it isn't always possible.
### Clearing a RabbitMQ queue
If you need to clear a queue (delete all the events in it), run
`./manage.py purge_queue <queue_name>`, for example:
```
./manage.py purge_queue user_activity
```
You can also use the amqp tools directly. Install `amqp-tools` from
apt and then run:
```
amqp-delete-queue --username=zulip --password='...' --server=localhost \
--queue=user_presence
```
with the RabbitMQ password from `/etc/zulip/zulip-secrets.conf`.