On 02-03-2023 12:41, Kevin Brodsky wrote:
On 23/02/2023 18:53, Tudor Cretu wrote:
Introduce compat versions of the structs exposed in the uAPI headers that might contain pointers as a member. Also, implement functions that convert the compat versions to the native versions of the struct.
A subsequent patch is going to change the io_uring structs to enable them to support new architectures. On such architectures, the current struct layout still needs to be supported for compat tasks.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com
include/linux/io_uring_types.h | 140 ++++++++++++++++++- io_uring/cancel.c | 38 ++++- io_uring/epoll.c | 2 +- io_uring/fdinfo.c | 55 +++++++- io_uring/io_uring.c | 248 +++++++++++++++++++++++---------- io_uring/io_uring.h | 122 +++++++++++++--- io_uring/kbuf.c | 108 ++++++++++++-- io_uring/kbuf.h | 6 +- io_uring/net.c | 5 +- io_uring/rsrc.c | 125 +++++++++++++++-- io_uring/tctx.c | 57 ++++++-- io_uring/uring_cmd.h | 7 + 12 files changed, 782 insertions(+), 131 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index 440179029a8f0..737bc1aa67306 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h
- union {
__kernel_rwf_t rw_flags;
__u32 fsync_flags;
__u16 poll_events;
__u32 poll32_events;
__u32 sync_range_flags;
__u32 msg_flags;
__u32 timeout_flags;
__u32 accept_flags;
__u32 cancel_flags;
__u32 open_flags;
__u32 statx_flags;
__u32 fadvise_advice;
__u32 splice_flags;
__u32 rename_flags;
__u32 unlink_flags;
__u32 hardlink_flags;
__u32 xattr_flags;
__u32 msg_ring_flags;
__u32 uring_cmd_flags;
I see that a few new members have appeared since 5.18. This makes me think, is there really much point in keeping those union members in sync? This is going to prove rather painful when rebasing, in fact they'll probably end up getting out of sync. Maybe we could simply represent the union as the member that we actually use in convert_compat_io_uring_sqe()? A comment next to that member would be enough to make it clear it's actually a union in the native struct.
Aye, great suggestion!
+struct compat_io_uring_cqe {
- __u64 user_data;
- __s32 res;
- __u32 flags;
- __u64 big_cqe[];
Nit: might as well align it like the rest, even though it's not in the native struct. I think it would make sense to have the same alignment approach for the other structs too, the native struct declarations are not very consistent.
Done!
@@ -216,7 +340,11 @@ struct io_ring_ctx { * array. */ u32 *sq_array;
struct io_uring_sqe *sq_sqes;
Nit: not sure adding a new line helps, it feels clearer to keep all the sq_* stuff in one block.
Done!
diff --git a/io_uring/cancel.c b/io_uring/cancel.c index 2291a53cdabd1..20cb8634e44af 100644 --- a/io_uring/cancel.c +++ b/io_uring/cancel.c @@ -27,6 +27,42 @@ struct io_cancel { #define CANCEL_FLAGS (IORING_ASYNC_CANCEL_ALL | IORING_ASYNC_CANCEL_FD | \ IORING_ASYNC_CANCEL_ANY | IORING_ASYNC_CANCEL_FD_FIXED) +struct compat_io_uring_sync_cancel_reg {
Shouldn't that struct be with its friends in io_uring_types.h? Same question regarding struct compat_io_uring_buf_reg.
I thought to move it here because only in this file it's currently used, but I'll move it back to party with its friends.
- __u64 addr;
- __s32 fd;
- __u32 flags;
- struct __kernel_timespec timeout;
- __u64 pad[4];
+};
+#ifdef CONFIG_COMPAT64
The #ifdef'ing makes me think: should we make it more systematic? Right now it looks like only the compat copy helpers and calls to them are #ifdef'd, but arguably it doesn't make sense to #ifdef that and not things like the size selection (e.g. in rings_size()), as we are effectively assuming that the compat and native structs are identical apart from in compat64. Maybe a new helper that returns IS_ENABLED(CONFIG_COMPAT64) && ctx->compat would help? In that case the copy helpers couldn't be #ifdef'd any longer but it may be better this way.
Great suggestion!
diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c index bc8c9d764bc13..c5bd669081c98 100644 --- a/io_uring/fdinfo.c +++ b/io_uring/fdinfo.c @@ -89,12 +89,25 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, for (i = 0; i < sq_entries; i++) { unsigned int entry = i + sq_head; struct io_uring_sqe *sqe;
unsigned int sq_idx;
unsigned int sq_idx, sq_off;
+#ifdef CONFIG_COMPAT64
struct io_uring_sqe *native_sqe = NULL;
+#endif sq_idx = READ_ONCE(ctx->sq_array[entry & sq_mask]); if (sq_idx > sq_mask) continue;
sqe = &ctx->sq_sqes[sq_idx << sq_shift];
sq_off = sq_idx << sq_shift;
sqe = ctx->compat ? (void *)&ctx->sq_sqes_compat[sq_off] : &ctx->sq_sqes[sq_off];
+#ifdef CONFIG_COMPAT64
if (ctx->compat) {
native_sqe = kmalloc(sizeof(struct io_uring_sqe) << sq_shift, GFP_KERNEL);
It's pretty suboptimal to do a kmalloc() / kfree() in a loop. An alternative would be to have a local struct io_uring_sqe native_sqe[2]; to account for the SQE128 case.
I have considered just having struct io_uring_sqe native_sqe[2]. Now I know to pay more attention when using kmalloc/kfree now, so thanks for pointing it out!
That said for the case of the printing code in this file, I wonder if we shouldn't skip the conversion altogether, since the structs are not passed to any other function. If we move each of the three printing blocks where conversion is needed to a macro, then we can easily handle both the native and compat struct with the same code.
Agree!
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 707229ae04dc8..03506776af043 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_ruing.c @@ -152,6 +152,43 @@ static void __io_submit_flush_completions(struct io_ring_ctx *ctx); static struct kmem_cache *req_cachep; +#ifdef CONFIG_COMPAT64 +static int get_compat_io_uring_getevents_arg(struct io_uring_getevents_arg *arg,
const void __user *user_arg)
+{
- struct compat_io_uring_getevents_arg compat_arg;
- if (unlikely(copy_from_user(&compat_arg, user_arg, sizeof(compat_arg))))
Not very important but curious: we only seem to be using unlikely(copy_from_user(...)) in the compat helpers, any reason for that?
It's a leftover from the rebase. I will remove the "unlikely" for consistency.
return -EFAULT;
- arg->sigmask = (__kernel_uintptr_t)compat_ptr(compat_arg.sigmask);
Looks like a leftover conversion that belongs to patch 8 instead.
Done!
- arg->sigmask_sz = compat_arg.sigmask_sz;
- arg->pad = compat_arg.pad;
- arg->ts = (__kernel_uintptr_t)compat_ptr(compat_arg.ts);
- return 0;
+} +#endif /* CONFIG_COMPAT64 */
+static int copy_io_uring_getevents_arg_from_user(struct io_ring_ctx *ctx,
struct io_uring_getevents_arg *arg,
const void __user *argp,
size_t size)
+{ +#ifdef CONFIG_COMPAT64
- if (ctx->compat) {
if (size != sizeof(struct compat_io_uring_getevents_arg))
return -EINVAL;
if (get_compat_io_uring_getevents_arg(arg, argp))
return -EFAULT;
return 0;
Or simply return get_compat_io_uring_getevents_arg(arg, argp);
I think this pattern reappears in quite a few copy_*_from_user() helpers.
Done!
- }
+#endif
- if (size != sizeof(*arg))
return -EINVAL;
- if (copy_from_user(arg, argp, sizeof(*arg)))
return -EFAULT;
- return 0;
+}
- struct sock *io_uring_get_socket(struct file *file) { #if defined(CONFIG_UNIX)
@@ -604,7 +641,9 @@ void io_cq_unlock_post(struct io_ring_ctx *ctx) static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) { bool all_flushed;
- size_t cqe_size = sizeof(struct io_uring_cqe);
- size_t cqe_size = ctx->compat ?
sizeof(struct compat_io_uring_cqe) :
sizeof(struct io_uring_cqe);
I'm not sure this is enough to get the overflow machinery to work in compat64. struct io_overflow_cqe gets filled in io_cqring_event_overflow() but AFAICT that code is not compat-aware, so it always initialises the native cqe and not the compat one. I suppose we could make use of __io_fill_cqe_any() in io_cqring_event_overflow()?
This makes me think, are there tests that trigger this overflow mechanism? Issues there might easily go unnoticed otherwise.
I had a better look at this and realised that struct io_overflow_cqe is always filled with a native cqe, so I removed the union from it. Also this change is not needed anymore. Yes, there are tests in liburing, and they pass now for compat as well.
if (!force && __io_cqring_events(ctx) == ctx->cq_entries) return false; @@ -741,10 +780,11 @@ bool io_req_cqe_overflow(struct io_kiocb *req)
- control dependency is enough as we're using WRITE_ONCE to
- fill the cq entry
*/ -struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow) +void *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow)
I certainly understand the rationale behind changing the return value to void *, as it is now either struct io_uring_cqe * or struct compat_io_uring_cqe *. That said I'm not convinced this change makes a lot of sense as it stands, because all users of *io_get_cqe* still interpret the return value as struct io_uring_cqe * (which works as void
- is implicitly convertible to any pointer type, not too conveniently
here...).
My feeling is that we should really choose one way or the other. Either make such pointers void * everywhere - or maybe better, pointer to a union, to stop them from being implicitly convertible. Or just keep struct io_uring_cqe * everywhere, which is quite common when dealing with compat structs. I don't really have a preference as both approaches have merits, up to you :)
Done! I left it as struct io_uring_cqe *.
{ unsigned int off = ctx->cached_cq_tail & (ctx->cq_entries - 1); unsigned int free, queued, len;
- void *cqe;
/* * Posting into the CQ when there are pending overflowed CQEs may break @@ -767,14 +807,15 @@ struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow) len <<= 1; }
- ctx->cqe_cached = &ctx->cqes[off];
- cqe = ctx->compat ? (void *)&ctx->cqes_compat[off] : (void *)&ctx->cqes[off];
- ctx->cqe_cached = cqe;
Don't we have a problem with the whole cqe_cached / cqe_sentinel logic? It assumes that it is manipulating pointers to struct io_uring_cqe, but in fact these are pointers to struct compat_io_uring_cqe in compat64, aren't they? This appeared in 5.19 [1] so you may not have noticed it when rebasing on 6.1.
[1] https://lore.kernel.org/all/487eeef00f3146537b3d9c1a9cef2fc0b9a86f81.1649771...
Indeed it was broken, thank you for the heads up!
@@ -2265,8 +2300,11 @@ int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr) io_submit_state_start(&ctx->submit_state, left); do {
const struct io_uring_sqe *sqe;
struct io_kiocb *req;const void *sqe;
+#ifdef CONFIG_COMPAT64
struct io_uring_sqe native_sqe;
It would seem that this will result in a buffer overflow in the IORING_OP_URING_CMD + IORING_SETUP_SQE128 case, since no space is allocated for the extra data. Maybe the best is to allocate an array of two sqes to cover this situation.
Aye, missed this instance, thank you!
-static int io_validate_ext_arg(unsigned flags, const void __user *argp, size_t argsz) +static int io_validate_ext_arg(struct io_ring_ctx *ctx, unsigned int flags,
- const void __user *argp, size_t argsz)
It looks like overflow lines in prototypes are generally aligned on the opening parenthesis in this file, not just indented by a tab.
I have been deceived by the two functions around it. Fixed!
{ if (flags & IORING_ENTER_EXT_ARG) { struct io_uring_getevents_arg arg;
int ret;
if (argsz != sizeof(arg))
return -EINVAL;
if (copy_from_user(&arg, argp, sizeof(arg)))
return -EFAULT;
ret = copy_io_uring_getevents_arg_from_user(ctx, &arg, argp, argsz);
if (ret)
return ret;
Or simply return copy_io_uring_getevents_arg_from_user(...);
Done!
+#ifdef CONFIG_COMPAT64 +#define BUILD_BUG_COMPAT_SQE_ELEM(eoffset, etype, ename) \
- __BUILD_BUG_VERIFY_OFFSET_SIZE(struct compat_io_uring_sqe, eoffset, sizeof(etype), ename)
+#define BUILD_BUG_COMPAT_SQE_ELEM_SIZE(eoffset, esize, ename) \
- __BUILD_BUG_VERIFY_OFFSET_SIZE(struct compat_io_uring_sqe, eoffset, esize, ename)
- BUILD_BUG_ON(sizeof(struct compat_io_uring_sqe) != 64);
- BUILD_BUG_COMPAT_SQE_ELEM(0, __u8, opcode);
- BUILD_BUG_COMPAT_SQE_ELEM(1, __u8, flags);
- BUILD_BUG_COMPAT_SQE_ELEM(2, __u16, ioprio);
- BUILD_BUG_COMPAT_SQE_ELEM(4, __s32, fd);
- BUILD_BUG_COMPAT_SQE_ELEM(8, __u64, off);
- BUILD_BUG_COMPAT_SQE_ELEM(8, __u64, addr2);
- BUILD_BUG_COMPAT_SQE_ELEM(8, __u32, cmd_op);
- BUILD_BUG_COMPAT_SQE_ELEM(12, __u32, __pad1);
- BUILD_BUG_COMPAT_SQE_ELEM(16, __u64, addr);
- BUILD_BUG_COMPAT_SQE_ELEM(16, __u64, splice_off_in);
- BUILD_BUG_COMPAT_SQE_ELEM(24, __u32, len);
- BUILD_BUG_COMPAT_SQE_ELEM(28, __kernel_rwf_t, rw_flags);
- BUILD_BUG_COMPAT_SQE_ELEM(28, /* compat */ int, rw_flags);
- BUILD_BUG_COMPAT_SQE_ELEM(28, /* compat */ __u32, rw_flags);
I'm a bit confused by these lines, as they are removed from the native checks above but preserved here? I don't think "compat" has much to do with compat64 here, in the case of poll_events below it seems to be about a backwards-compatible change in ABI (see [1]).
I also expected the changes in whitespace above to be reflected here.
[1] https://lore.kernel.org/all/1592387636-80105-2-git-send-email-jiufei.xue@lin...
Done!
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h index 50bc3af449534..fb2711770bfb0 100644 --- a/io_uring/io_uring.h +++ b/io_uring/io_uring.h +static inline void convert_compat_io_uring_sqe(struct io_ring_ctx *ctx,
struct io_uring_sqe *sqe,
const struct compat_io_uring_sqe *compat_sqe)
+{ +/*
- The struct io_uring_sqe contains anonymous unions and there is no field
- keeping track of which union's member is active. Because in all the cases,
- the unions are between integral types and the types are compatible, use the
- largest member of each union to perform the copy. Use this compile-time check
- to ensure that the union's members are not truncated during the conversion.
- */
+#define BUILD_BUG_COMPAT_SQE_UNION_ELEM(elem1, elem2) \
- BUILD_BUG_ON(sizeof_field(struct compat_io_uring_sqe, elem1) != \
(offsetof(struct compat_io_uring_sqe, elem2) - \
offsetof(struct compat_io_uring_sqe, elem1)))
- sqe->opcode = READ_ONCE(compat_sqe->opcode);
- sqe->flags = READ_ONCE(compat_sqe->flags);
- sqe->ioprio = READ_ONCE(compat_sqe->ioprio);
- sqe->fd = READ_ONCE(compat_sqe->fd);
- BUILD_BUG_COMPAT_SQE_UNION_ELEM(addr2, addr);
- sqe->addr2 = READ_ONCE(compat_sqe->addr2);
- BUILD_BUG_COMPAT_SQE_UNION_ELEM(addr, len);
- sqe->addr = READ_ONCE(compat_sqe->addr);
- sqe->len = READ_ONCE(compat_sqe->len);
- BUILD_BUG_COMPAT_SQE_UNION_ELEM(rw_flags, user_data);
- sqe->rw_flags = READ_ONCE(compat_sqe->rw_flags);
- sqe->user_data = READ_ONCE(compat_sqe->user_data);
- BUILD_BUG_COMPAT_SQE_UNION_ELEM(buf_index, personality);
- sqe->buf_index = READ_ONCE(compat_sqe->buf_index);
- sqe->personality = READ_ONCE(compat_sqe->personality);
- BUILD_BUG_COMPAT_SQE_UNION_ELEM(splice_fd_in, addr3);
- sqe->splice_fd_in = READ_ONCE(compat_sqe->splice_fd_in);
- if (sqe->opcode == IORING_OP_URING_CMD) {
size_t compat_cmd_size = compat_uring_cmd_pdu_size(ctx->flags &
IORING_SETUP_SQE128);
memcpy(sqe->cmd, compat_sqe->cmd, compat_cmd_size);
- } else
sqe->addr3 = READ_ONCE(compat_sqe->addr3);
It would be better to copy __pad2 too, as some net code does check that it is zero.
Done!
+#undef BUILD_BUG_COMPAT_SQE_UNION_ELEM +} +#endif /* CONFIG_COMPAT64 */
+static inline void *io_get_cqe_overflow(struct io_ring_ctx *ctx,
{ if (likely(ctx->cqe_cached < ctx->cqe_sentinel)) { struct io_uring_cqe *cqe = ctx->cqe_cached;bool overflow)
@@ -109,15 +169,46 @@ static inline struct io_uring_cqe *io_get_cqe_overflow(struct io_ring_ctx *ctx, return __io_get_cqe(ctx, overflow); } -static inline struct io_uring_cqe *io_get_cqe(struct io_ring_ctx *ctx) +static inline void *io_get_cqe(struct io_ring_ctx *ctx) { return io_get_cqe_overflow(ctx, false); } +static inline void __io_fill_cqe_any(struct io_ring_ctx *ctx, struct io_uring_cqe *cqe,
Nit: not sure the "any" suffix really helps here, __io_fill_cqe() is probably enough.
Done!
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c index e2c46889d5fab..9f840ba93c888 100644 --- a/io_uring/kbuf.c +++ b/io_uring/kbuf.c +#ifdef CONFIG_COMPAT64 +static void __user *io_ring_buffer_select_compat(struct io_kiocb *req, size_t *len,
struct io_buffer_list *bl,
unsigned int issue_flags)
+{
- struct compat_io_uring_buf_ring *br = bl->buf_ring_compat;
- struct compat_io_uring_buf *buf;
- __u16 head = bl->head;
- if (unlikely(smp_load_acquire(&br->tail) == head))
return NULL;
- head &= bl->mask;
- if (head < IO_BUFFER_LIST_COMPAT_BUF_PER_PAGE) {
buf = &br->bufs[head];
- } else {
int off = head & (IO_BUFFER_LIST_COMPAT_BUF_PER_PAGE - 1);
int index = head / IO_BUFFER_LIST_COMPAT_BUF_PER_PAGE;
buf = page_address(bl->buf_pages[index]);
buf += off;
- }
- if (*len == 0 || *len > buf->len)
*len = buf->len;
- req->flags |= REQ_F_BUFFER_RING;
- req->buf_list = bl;
- req->buf_index = buf->bid;
- if (issue_flags & IO_URING_F_UNLOCKED || !file_can_poll(req->file)) {
req->buf_list = NULL;
bl->head++;
- }
I agree duplicating that function is the only reasonable option here. It looks like this block could be moved to the common function though, reducing the duplication a bit.
Done!
@@ -508,19 +596,19 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg) return -ENOMEM; }
- pages = io_pin_pages(reg.ring_addr,
struct_size(br, bufs, reg.ring_entries),
AFAICT the new code is not quite equivalent to this line, even in the native case, because struct_size() is equivalent to sizeof(struct) + n * sizeof(flex_array_member). struct io_uring_buf_ring is a bit of a strange union, in the end its size is the size of struct io_uring_buf. IOW I think this line actually yields (n + 1) * sizeof(struct io_uring_buf).
I suspect this is not what was actually intended, and that your change fixes that off-by-one, though I may well be wrong as the way the union is used is not exactly trivial. If that's the case then it should be a separate patch, and it would be a good candidate for upstreaming.
Indeed, it is a off-by-one error. The man pages states that: The size of the ring is the product of ring_entries and the size of struct io_uring_buf, so I just did that. The union is there, just so that we store the tail of the ring in the same place as the resv field of the first io_uring_buf. ring_entries must be a power of two, otherwise the syscalls errs early, so it can't be 0, so the change should be fine.
diff --git a/io_uring/net.c b/io_uring/net.c index c586278858e7e..4c133bc6f9d1d 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -4,6 +4,7 @@ #include <linux/file.h> #include <linux/slab.h> #include <linux/net.h> +#include <linux/uio.h> #include <linux/compat.h> #include <net/compat.h> #include <linux/io_uring.h> @@ -435,7 +436,9 @@ static int __io_recvmsg_copy_hdr(struct io_kiocb *req, } else if (msg.msg_iovlen > 1) { return -EINVAL; } else {
if (copy_from_user(iomsg->fast_iov, msg.msg_iov, sizeof(*msg.msg_iov)))
void *iov = iovec_from_user(msg.msg_iov, 1, 1, iomsg->fast_iov,
req->ctx->compat);
if (IS_ERR(iov))
This looks correct, but do you have any idea why the compat code looks so different? It does not set fast_iov at all in that case, which makes me wonder if it's really equivalent...
I don't know why one uses copy_from_user and the other uses __get_user. This has been like this from the very beginning and I don't think there is a good reason for it. __get_user would suffice in both cases really. The difference in the setting of fast_iov comes from the commit 5476dfed29ad ("io_uring: clean iov usage for recvmsg buf select") which cleanups only the __io_recvmsg_copy_hdr and ignores __io_compat_recvmsg_copy_hdr.
+static int copy_io_uring_rsrc_register_from_user(struct io_ring_ctx *ctx,
struct io_uring_rsrc_register *rr,
const void __user *arg,
size_t size)
+{ +#ifdef CONFIG_COMPAT64
- if (ctx->compat) {
if (size != sizeof(struct compat_io_uring_rsrc_register))
return -EINVAL;
if (get_compat_io_uring_rsrc_register(rr, arg))
return -EFAULT;
return 0;
- }
+#endif
- /* keep it extendible */
Nit: I don't think it's worth keeping that comment, especially considering the existing typo :)
Done!
diff --git a/io_uring/tctx.c b/io_uring/tctx.c index 96f77450cf4e2..6ab9916ed3844 100644 --- a/io_uring/tctx.c +++ b/io_uring/tctx.c @@ -260,9 +284,17 @@ int io_ringfd_register(struct io_ring_ctx *ctx, void __user *__arg, tctx = current->io_uring; for (i = 0; i < nr_args; i++) {
void __user *arg;
__u32 __user *arg_offset;
int start, end;struct io_uring_rsrc_update reg;
if (copy_from_user(®, &arg[i], sizeof(reg))) {
if (ctx->compat)
arg = &((struct compat_io_uring_rsrc_update __user *)__arg)[i];
else
arg = &((struct io_uring_rsrc_update __user *)__arg)[i];
Might be worth adding a helper for that magic, since it somewhat hurts the eyes and it's also used in io_ringfd_unregister().
Done!
}if (copy_io_uring_rsrc_update_from_user(ctx, ®, arg)) { ret = -EFAULT; break;
@@ -288,8 +320,10 @@ int io_ringfd_register(struct io_ring_ctx *ctx, void __user *__arg, if (ret < 0) break;
reg.offset = ret;
Don't want we want to keep this line?
Oops, thank you!
Tudor
Kevin