PostgreSQL 18 has been officially released, packed with numerous improvements. One major architectural change is Asynchronous I/O (AIO), which enables the asynchronous scheduling of I/O operations. This grants the database better control over storage resources and improves storage utilization. This article will not delve deeply into how AIO works or present exhaustive benchmark results. Its primary goal is to share tuning recommendations for AIO in PostgreSQL 18 and explain some inherent, non-obvious trade-offs and limitations. Ideally, these tuning suggestions should be incorporated into the official documentation, but that requires a clear consensus based on practical experience. As a new feature, AIO currently lacks sufficient real-world validation data. Although extensive benchmarks were conducted during development to set the default parameters, this cannot replace the experience of actual production systems. Therefore, this article will discuss how to (possibly) adjust the default parameters and the trade-offs involved, based on personal experience.
io_method
/ io_workers
There is a series of parameters related to AIO (or I/O in general). However, you likely only need to focus on these two introduced in Postgres 18:
io_method = worker (options: sync, io_uring)
io_workers = 3
β
Other parameters (like io_combine_limit
) have reasonable defaults. I don't have strong recommendations for tuning them yet, so it's best to keep them as-is for now. This article will focus on these two key parameters.
io_method
The io_method
determines how AIO requests are actually handled β which process performs the I/O and how it is scheduled. It has three possible values:
sync
- This is a "backwards-compatible" option, using synchronous I/O withposix_fadvice
where supported. This prefetches data into the page cache, not the shared buffers.worker
- Creates a pool of "I/O worker processes" to perform the actual I/O. When a backend process needs to read a block from a data file, it inserts a request into a queue in shared memory. An I/O worker process is woken up, performs thepread
operation, places the data into the shared buffers, and notifies the backend process.io_uring
- Each backend process has anio_uring
instance (a pair of queues) and uses it to perform I/O. The difference fromworker
is that instead of executingpread
directly, it submits requests viaio_uring
.
The default value is io_method = worker
. We did consider making either sync
or io_uring
the default, but I believe worker
was the correct choice. It is truly "asynchronous" and works everywhere (since it's our own implementation). sync
was considered as a "fallback" option in case we encountered issues during the beta/RC phase. But we didn't have problems, and it's unclear if using sync
would even be helpful, as it still goes through the AIO infrastructure. You can still use sync
if you want to simulate the behavior of older versions.
io_uring
is a popular method for asynchronous I/O (not just for disks). It is excellent, efficient, and lightweight. However, it is Linux-specific, and we need to support many platforms. We could have used platform-specific defaults (similar to wal_sync_method
), but that seemed unnecessarily complex.
Note: Even on Linux, validating io_uring
can be tricky. Some container runtimes (e.g., containerd) previously disabled io_uring
support due to security risks.
No single io_method
option is "universally optimal." There will always be workloads where A is better than B, and vice versa. Ultimately, we hope most systems will use and benefit from AIO, and we wanted to keep things simple, so we chose worker
.
π‘ βSuggestion:ββ My recommendation is to stick with io_method = worker
and adjust the io_workers
value (described in the next section).
io_workers
Postgres defaults are very conservative. It can even start on small machines like a Raspberry Pi. However, on the other hand, this conservative configuration performs poorly on typical database servers, which usually have more RAM/CPU. To get good performance on such large machines, you need to tune some parameters (shared_buffers
, max_wal_size
, etc.). I wish we had an automated way to choose "appropriate" initial values for these basic parameters, but it's more difficult than it seems. It largely depends on the context (e.g., other things might be running on the same system). At least there are tools like PGTune that provide reasonable recommendations for these parameters. This also applies to the default value of io_workers = 3
, which only creates 3 I/O worker processes. This might be acceptable for a small machine with 8 cores, but it is definitely insufficient for a machine with 128 cores.
I can demonstrate this with results from a benchmark I ran to help select the default io_method
. This benchmark generated a synthetic dataset and then ran queries matching parts of the data (while forcing the use of a specific scan type).
Note: This benchmark (along with scripts, numerous results, and a more detailed explanation) was initially shared in a pgsql-hackers mailing list thread about the default io_method
. Please refer to that thread for more details and feedback from others. The results shown are from a small workstation with a Ryzen 9900X (12 cores / 24 threads) and 4 NVMe SSDs (configured in RAID0).
The following chart compares query execution times for different io_method
options:
(Chart description: Each color represents a different io_method
value (17 stands for "Postgres 17"). For the "worker" method, there are two data series corresponding to different numbers of worker processes (3 and 12). This is for two datasets: "uniform" - uniform distribution (so I/O is completely random), and "linear_10" - sequential distribution with a bit of randomness (imperfect correlation).)
The chart shows some very interesting phenomena:
- βIndex Scan:ββ The
io_method
has no impact, which is understandable because index scans do not yet use AIO (all I/O is synchronous). - βBitmap Scan:ββ The behavior is more chaotic. The
worker
method performs best, but only when there are 12 worker processes. With the default 3 worker processes, its performance is actually poor for low-selectivity queries. - βSequential Scan:ββ There is a clear difference between methods.
worker
is the fastest, about twice as fast assync
(and PG17).io_uring
falls in between.
In a chart with a logarithmic scale on the Y-axis, the performance disadvantage of the worker
mode with io_workers=3
in bitmap scan scenarios is more evident: the configuration with io_workers=3
is consistently the slowest (this is almost imperceptible in the linear chart).
The good news is that while I/O worker processes are not free, their overhead is not excessive. Therefore, having too many workers is generally better than having too few. In the future, we might start/stop worker processes on demand, making them "adaptive." This would allow us to always maintain an optimal number of processes. There is even a patch in progress for this, but it wasn't included in Postgres 18.
βSuggestion:ββ Consider increasing io_workers
. There isn't an ideal recommended value or formula yet, but perhaps setting it to about 1/4 of the number of CPU cores is a viable option?
βTrade-offsβ
A one-size-fits-all optimal configuration does not exist. I have seen suggestions to "use io_uring
for maximum efficiency," but the benchmark above clearly shows that for sequential scans, io_uring
is significantly slower than worker
. Don't get me wrong, I recognize that io_uring
is an excellent interface, and the aforementioned suggestion is not "wrong." Any performance tuning advice is inherently a simplification; there will always be counterexamples. The real world is never as simple as advice suggests: the core meaning of such advice is to use a concise rule toζ©η the underlying complexities.
So, what are the trade-offs and differences between these asynchronous I/O methods?
βBandwidthβ
A major difference between io_uring
and worker
lies in where the tasks are executed. For io_uring
, all tasks are executed within the backend process itself; for worker
, these tasks are handled in separate processes. This can have noteworthy implications for bandwidth, depending on the overhead of processing the I/O. This overhead can be significant because it involves:
- The actual I/O operation
- Checksum verification (enabled by default in Postgres 18)
- Copying data into the shared buffers
For io_uring
, all of this happens within the backend process itself. The I/O part might be more efficient, but the checksum verification and memory copying (memcpy
) steps can become performance bottlenecks. For worker
, this work is effectively distributed among the worker processes. If you have 1 backend process and 3 worker processes, the limit is increased by a factor of 3. Of course, the converse is also true. With 16 connections, for io_uring
, that's 16 processes that can verify checksums, etc. For worker
, the limit is the value set for io_workers
. This is why I suggest setting io_workers
to about 25% of the core count. I think it could even be set higher, possibly up to one I/O worker per core. In any case, 3 seems clearly too low.
Note: I believe this ability to spread the overhead across multiple processes is the reason worker
outperforms io_uring
on sequential scans. A difference of around 20% seems plausible for checksum verification and memory copying in this benchmark.
βSignalingβ
Another important detail is the overhead of inter-process communication (IPC) between backend processes and I/O worker processes, which is based on UNIX signals. The execution flow for a single I/O operation is as follows:
- The backend process adds a read request to a queue in shared memory.
- The backend process signals an I/O worker process to wake it up.
- The I/O worker process performs the I/O requested by the backend and copies the data into the shared buffers.
- The I/O worker process signals the backend process to notify it that the I/O is complete.
In the worst case, this means one "round-trip signal" (2 signals in total) is required for every 8KB data block processed. The problem is that signaling is not "zero-cost" β there is a limit to the number of signals a process can handle per second. I wrote a simple benchmark to test the performance of signal passing between two processes. On my machine, the results showed it could reach 250,000 to 500,000 round trips per second. If each 8KB block requires one round trip, this translates to a transfer rate of only 2-4 GB/s. This is not particularly fast, especially considering the data might already be in the page cache, not just cold data read from storage. According to a test copying data from the page cache, a single process can achieve 10-20 GB/s, which is about 4 times faster than the signaling method. Clearly, signaling could become a performance bottleneck.
Note: The specific limits vary by hardware and can be much lower on older machines. But this general observation held true on all machines I had access to.
The good news is that this only affects "worst-case" workloads that require reading 8KB pages one by one. Most regular workloads are not like this. Backends often find many buffers in shared memory (thus requiring no I/O). Or, due to read-ahead, I/O happens in larger chunks, amortizing the signaling cost over multiple blocks. Therefore, I don't consider this a serious issue likely to arise frequently. There is a longer discussion about the overhead of AIO (not just due to signaling) in the mailing list thread about index prefetching.
βFile Limitsβ
io_uring
does not require any IPC, so it is not subject to signaling overhead or similar issues. However, io_uring
also has its own limitations, just in different places. For instance, each process is subject to "per-process bandwidth limits" (e.g., how much memory copying a single process can perform). But judging by the page cache test, these limits are quite high β around 10-20 GB/s. Another consideration is that io_uring
might require a considerable number of file descriptors. As explained in this pgsql-hackers thread:
The issue is that with io_uring
, we need to create one file descriptor (FD) per possible child process so that one backend process can wait for I/O initiated by another backend to complete. These io_uring
instances need to be created in the postmaster so that all backends can access them. Obviously, if max_connections
is set high, this helps hit the unadjusted soft RLIMIT_NOFILE
limit faster.
Therefore, if you decide to use io_uring
, you might also need to adjust ulimit -n
.
Note: This is not the only place in the Postgres code where you might hit file descriptor limits. About a year ago, I proposed a patch idea related to file descriptor caching. Each backend keeps open file descriptors up to max_files_per_process
, which is set to 1000 by default. This was sufficient in the past, but with partitioning (or per-tenant schemas), it's easy to trigger frequent and costly open/close calls. That is a separate but similar issue.
βSummaryβ
AIO is a major architectural change in PostgreSQL 18, but it currently has limitations: it only supports read operations, and some operations still rely on the old synchronous I/O mechanism. These limitations are not permanent and are expected to be addressed step by step in future versions. Based on the analysis in this article, the final AIO tuning recommendations are as follows:
- βKeep the default
io_method = worker
:ββ Unless benchmarking provesio_uring
is superior for your specific workload, switching is not recommended. Usesync
only if you need to simulate PostgreSQL 17 behavior (even if it may lead to performance degradation in some scenarios). - βAdjust
io_workers
based on CPU cores:ββ Start with a configuration of about 25% of the core count, and consider increasing it up to 100% in I/O-intensive scenarios.
If you discover interesting conclusions during your tuning process, feel free to provide feedback to the author, and it is even more recommended to post your experiences to the pgsql-hackers mailing list. These experiences will help improve the tuning recommendations in the official documentation in the future.
β