In commit title: I'm a bit hesitant to call this a fix, as it doesn't actually fix anything if the size of io_event remains the same. Maybe just talk directly about using vmap? The commit message itself clarifies the point of doing that.
On 04/04/2023 15:53, Tudor Cretu wrote:
There were two underlying assumptions:
- Size of the header of aio_ring is equal to the size of an io_event sizeof(aio_ring) == sizeof(io_event)
- There is no leftover space at the end of a page when populating it with io_events, i.e. PAGE_SIZE is a multiple of sizeof(io_event)
This change aims to make the aio_ring mechanism more robust such that it doesn't rely anymore on the above assumptions. It also improves the copying of the io_events from the shared ring to the userspace, by batching them together in a maximum of two copy_to_user calls.
Mapping pages independently using kmap doesn't create a virtually contiguous space. As some io_event structs could span across different pages, all the pages necessary to access the events' data need to be mapped to a virtually contiguous space. For this reason, kmap is replaced with vmap.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com
fs/aio.c | 80 ++++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 52 insertions(+), 28 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c index 079074f47c2e7..c835deda5cdcc 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -603,9 +603,24 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events) return 0; } -#define AIO_EVENTS_PER_PAGE (PAGE_SIZE / sizeof(struct io_event)) -#define AIO_EVENTS_FIRST_PAGE ((PAGE_SIZE - sizeof(struct aio_ring)) / sizeof(struct io_event)) -#define AIO_EVENTS_OFFSET (AIO_EVENTS_PER_PAGE - AIO_EVENTS_FIRST_PAGE) +static unsigned int get_event_pages_idx(unsigned int event_idx,
unsigned int nr_events,
size_t *event_offset,
unsigned int *nr_pages)
+{
- unsigned int page_idx;
- size_t off;
- off = sizeof(struct aio_ring);
- off += sizeof(struct io_event) * event_idx;
- page_idx = off / PAGE_SIZE;
- *event_offset = offset_in_page(off);
- off += sizeof(struct io_event) * nr_events - 1;
- *nr_pages = off / PAGE_SIZE + 1 - page_idx;
- return page_idx;
+} void kiocb_set_cancel_fn(struct kiocb *iocb, kiocb_cancel_fn *cancel) { @@ -1134,8 +1149,10 @@ static void aio_complete(struct aio_kiocb *iocb) { struct kioctx *ctx = iocb->ki_ctx; struct aio_ring *ring;
- struct io_event *ev_page, *event;
- unsigned tail, pos, head;
- struct io_event *event;
- void *ring_pages;
- size_t event_offset;
- unsigned int tail, head, ctx_page_idx, nr_pages; unsigned long flags;
/* @@ -1146,18 +1163,21 @@ static void aio_complete(struct aio_kiocb *iocb) spin_lock_irqsave(&ctx->completion_lock, flags); tail = ctx->tail;
- pos = tail + AIO_EVENTS_OFFSET;
- if (++tail >= ctx->nr_events)
tail = 0;
- ev_page = kmap_atomic(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);
- event = ev_page + pos % AIO_EVENTS_PER_PAGE;
- ctx_page_idx = get_event_pages_idx(tail, 1, &eventV_offset, &nr_pages);
- ring_pages = vmap(ctx->ring_pages + ctx_page_idx, nr_pages, VM_MAP, PAGE_KERNEL);
I've pondered on this for a while, in the end the current approach does worry me in terms of performance. The io_getevents() case is not a very big deal, you can fetch multiple pages of events at the same time this way and anyway high-performance applications will probably just read directly from the ring. OTOH there is no way around io_submit(), and here we are effectively performing a vmap() + vunmap() for every event in the array, which seems very wasteful.
Very interestingly, while I was checking whether the overhead of vmap() had been measured, I stumbled upon an aio patch [1] that also replaces kmap with vmap! A big difference is that it sets up the mapping when creating the ring and then keeps it around, allowing the ring to be accessed directly at all time (a bit like io_uring). There are some discussions on this patch, including performance-wise, but unfortunately no conclusion and it looks like the author then lost interest.
In our situation, where struct io_event ends up being 48 bytes, switching to vmap seems to be justified, if only to keep things reasonably simple. It does feel though that the approach in [1] makes more sense - we do increase TLB pressure by keeping the kernel mapping, but that should be a negligible cost compared to calling vmap/vunmap in a loop.
I don't think [1] applies as-is and it might not be a perfect solution anyway, so feel free to adapt it or just take some inspiration from it.
Kevin
[1] https://lore.kernel.org/all/1349764760-21093-4-git-send-email-koverstreet@go...