Knowledge

21 May 2024 Back to all articles ↵

Consequences of using FIFO SQS queues where unnecessary

#aws #architecture #pipelines
FIFO SQS queues are sometimes viewed as a drop‑in replacement for regular SQS queues, but with some extra, convenient guarentees. Such a simplification can lead to some costly consequences.

Overview

On the very first encounter with AWS SQS queues (or any production-grade message passing service for distributed systems), it is not uncommon to find yourself perplexed by the notion of at‑least‑once delivery. Many new developers seek a remediation for this problem, and they come across the FIFO SQS queues, which boast message deduplication. In addition, the guaranteed ordering of the messages can sometimes simplify reasoning about the system. These reasons, together with FIFO SQS’s limitations not looking too daunting at the first glance, lead inexperienced AWS architects to choosing FIFO SQS queues over regular queues in situations where it is not strictly demanded by the business logic.

From our experience, this approach tends to come up short. The advantages of FIFO SQS queues are very specific and can be fully capitalized on only in very particular situations. Moreover, the limitations, in turn, are very universal, and usually come into play in the critical scenarios.

Pros and cons of using FIFO queues

Deduplication

First, let’s analyze the deduplication argument. The initial desire for deduplicated message passing stems from the overhead necessary to write idempotent code - if we can receive message more than once, we need to make sure that the handling code is ready for such scenarios.

However, even with a deduplicated queue, message duplication can still occur. The most obvious situation is an error. In case of a message consumer crash, the message will be put back in the queue. Depending on the timing of the error, the message might have been logically processed, but it’s still going to be put back to the queue - making it a duplicate. In practice, such situations do happen. There’s a number of reasons, for which code can crash in a high‑throughput distributed system: network hiccups, downtimes of 3rd party services, resource allocation errors, or simple bugs.

It is also possible that duplicates will be generated by the producer - but such a situation is highly dependent on the specific system.

Head-of-line blocking

The most notable problem of FIFO queues is the head‑of‑line blocking. Since FIFO is ensuring an exact order of processing messages, in case of a failure to process one of them, it has no choice, but to get stuck. This is a serious shortcoming, especially in a case of a rapidly growing business. In scale-ups, systems might contain some left‑over data from the start‑up phase, which is not necessarily up‑to‑date with the newest data formats and contracts. If a message with such data finds itself in the FIFO, head-of-line blocking will get very real and very disastrous, in an instant.

Similarly, a genuine FIFO queue scales exceptionally poorly. Since new message cannot be read until the previous one has been processed, the queue manager has no way to effectively introduce multiple workers into the architecture. In a high‑throughput system, this is a huge shortcoming (especially if message processing takes some time), and will definitely end up being a bottleneck.

The aforementioned drawbacks are critical and have to be addressed. The simplest way to achieve that is by using different message groups. In FIFO SQS, the ordering of the messages is only guaranteed for messages in the same group. If one achieves a somewhat even spread of messages between groups, then the FIFO SQS queue becomes an interleave of multiple genuine FIFOs. In such a setup, the scaling is a bit better, and the whole system will not grind to a halt in case of an erroneous message. However, our queue is no longer a genuine FIFO, which might be contrary to the initial motivation for using it in the first place.

Capacity limitations

Moreover, we also have the capacity limitations of the FIFO SQS queues. By default, they are equal to 300 transactions per second (which translates to 3000 messages per second if you use batching), and 20,000 in‑flight messages (meaning that at most 20,000 messages can be in the queue at a single point in time). These numbers should be sufficient for most common use‑cases, but they will obviously be an obstacle during the largest traffic spikes. More practically, they can be particularly annoying, when trying to purposefully process a large batch of messages. Such a need usually arises during data scheme changes (which we would like to reflect in some other parts of the system), or during integrations with new third‑party services (when we need to push existing data to the vendor) - situations quite common at the start‑up phase of the business.

There’s also the high‑throughput mode, which can increase the throughput limit to even up to 18,000 TPS (this depends on the region, sometimes it is a bit lower). This should (theoretically) satisfy almost all use‑cases, but is non‑trivial to be fully capitalized on. Most notably, one needs to use a relatively large number of message groups - which might not always be possible. But the throughput is still somewhat constrained, while regular SQS queues have virtually no limits.

Conclusion

We’d like to underline that we don’t imply that FIFO queues are never relevant. There’s a plethora of legitimate use cases, where they can be a life‑saver (most notable one being e‑commerce order processing, where the order of events for a single consumer is critical). However, they do come with their own disadvantages, and it is important to understand that the potential benefits need to outweigh them, to make the choice justifiable. In general, we recommend sticking to regular SQS queues and biting the bullet of writing idempotent handlers.

Get in touch

We can be your team of problem solvers.
We'd love to hear from you.
Contact us!