This series makes it possible for purecap apps to use the io_uring system.
With these patches, all io_uring LTP tests pass in both Purecap and compat modes. Note that the LTP tests only address the basic functionality of the io_uring system and a significant portion of the multiplexed functionality is untested in LTP.
I have finished investigating Purecap liburing tests and examples and the series is updated accordingly. I am investigating compat liburing tests at the moment, so another version might be expected.
v3: - Introduce Patch 5 which exposes the compat handling logic for epoll_event. This is used then in io_uring/epoll.c. - Introduce Patch 6 which makes sure that when struct iovec is copied from userspace, the capability tags are preserved. - Fix a few sizeof(var) to sizeof(*var). - Use iovec_from_user so that compat handling logic is applied instead of copying directly from user - Add a few missing copy_from_user_with_ptr where suitable.
v2: - Rebase on top of release 6.1 - Remove VM_READ_CAPS/VM_LOAD_CAPS patches as they are already merged - Update commit message in PATCH 1 - Add the generic changes PATCH 2 and PATCH 3 to avoid copying user pointers from/to userspace unnecesarily. These could be upstreamable. - Split "pulling the cqes memeber out" change into PATCH 4 - The changes for PATCH 5 and 6 are now split into their respective files after the rebase. - Format and change organization based on the feedback on the previous version, including creating helpers copy_*_from_* for various uAPI structs - Add comments related to handling of setup flags IORING_SETUP_SQE128 and IORING_SETUP_CQE32 - Add handling for new uAPI structs: io_uring_buf, io_uring_buf_ring, io_uring_buf_reg, io_uring_sync_cancel_reg.
Gitlab issue: https://git.morello-project.org/morello/kernel/linux/-/issues/2
Review branch: https://git.morello-project.org/tudcre01/linux/-/commits/morello/io_uring_v3
Tudor Cretu (9): compiler_types: Add (u)intcap_t to native_words io_uring/rw : Restrict copy to only uiov->len from userspace io_uring/tctx: Copy only the offset field back to user io_uring: Pull cqes member out from rings struct epoll: Expose compat handling logic of epoll_event iov_iter: use copy_from_user_with_ptr for struct iovec io_uring: Implement compat versions of uAPI structs and handle them io_uring: Use user pointer type in the uAPI structs io_uring: Allow capability tag access on the shared memory
fs/eventpoll.c | 39 ++-- include/linux/compiler_types.h | 7 + include/linux/eventpoll.h | 4 + include/linux/io_uring_types.h | 160 ++++++++++++++-- include/uapi/linux/io_uring.h | 62 ++++--- io_uring/advise.c | 2 +- io_uring/cancel.c | 40 +++- io_uring/cancel.h | 2 +- io_uring/epoll.c | 4 +- io_uring/fdinfo.c | 64 ++++++- io_uring/fs.c | 16 +- io_uring/io_uring.c | 329 +++++++++++++++++++++++++-------- io_uring/io_uring.h | 126 ++++++++++--- io_uring/kbuf.c | 119 ++++++++++-- io_uring/kbuf.h | 8 +- io_uring/msg_ring.c | 4 +- io_uring/net.c | 25 +-- io_uring/openclose.c | 4 +- io_uring/poll.c | 4 +- io_uring/rsrc.c | 150 ++++++++++++--- io_uring/rw.c | 17 +- io_uring/statx.c | 4 +- io_uring/tctx.c | 57 +++++- io_uring/timeout.c | 10 +- io_uring/uring_cmd.c | 5 + io_uring/uring_cmd.h | 7 + io_uring/xattr.c | 12 +- lib/iov_iter.c | 2 +- 28 files changed, 1014 insertions(+), 269 deletions(-)
On CHERI architectures, the stores/loads of capabilities should be atomic. Add (u)intcap_t types to the native_words check in order to allow the stores/loads of capabilities to pass the checks for atomic operations.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- include/linux/compiler_types.h | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h index 365fda8e7e424..19a85e9490ff3 100644 --- a/include/linux/compiler_types.h +++ b/include/linux/compiler_types.h @@ -328,9 +328,16 @@ struct ftrace_likely_data { default: (x)))
/* Is this type a native word size -- useful for atomic operations */ +#ifdef __CHERI__ +#define __native_word(t) \ + (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \ + sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long) || \ + __same_type(t, __intcap_t) || __same_type(t, __uintcap_t)) +#else #define __native_word(t) \ (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \ sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long)) +#endif
#ifdef __OPTIMIZE__ # define __compiletime_assert(condition, msg, prefix, suffix) \
On 23/02/2023 18:53, Tudor Cretu wrote:
On CHERI architectures, the stores/loads of capabilities should be atomic. Add (u)intcap_t types to the native_words check in order to allow the stores/loads of capabilities to pass the checks for atomic operations.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com
include/linux/compiler_types.h | 7 +++++++ 1 file changed, 7 insertions(+)
That's a straightforward improvement to __native_word() that is not directly related to io_uring, so I've picked it on next. Thanks!
Kevin
diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h index 365fda8e7e424..19a85e9490ff3 100644 --- a/include/linux/compiler_types.h +++ b/include/linux/compiler_types.h @@ -328,9 +328,16 @@ struct ftrace_likely_data { default: (x))) /* Is this type a native word size -- useful for atomic operations */ +#ifdef __CHERI__ +#define __native_word(t) \
- (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \
sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long) || \
__same_type(t, __intcap_t) || __same_type(t, __uintcap_t))
+#else #define __native_word(t) \ (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \ sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long)) +#endif #ifdef __OPTIMIZE__ # define __compiletime_assert(condition, msg, prefix, suffix) \
Only the len member is needed, so restrict the copy_from_user to that.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- io_uring/rw.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/io_uring/rw.c b/io_uring/rw.c index 1393cdae75854..2edca190450ee 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -55,7 +55,6 @@ static int io_iov_compat_buffer_select_prep(struct io_rw *rw) static int io_iov_buffer_select_prep(struct io_kiocb *req) { struct iovec __user *uiov; - struct iovec iov; struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
if (rw->len != 1) @@ -67,9 +66,8 @@ static int io_iov_buffer_select_prep(struct io_kiocb *req) #endif
uiov = u64_to_user_ptr(rw->addr); - if (copy_from_user(&iov, uiov, sizeof(*uiov))) + if (get_user(rw->len, &uiov->iov_len)) return -EFAULT; - rw->len = iov.iov_len; return 0; }
Upon successful return of the io_uring_register system call, the offset field will contain the value of the registered file descriptor to be used for future io_uring_enter system calls. The rest of the struct doesn't need to be copied back to userspace, so restrict the copy_to_user call only to the offset field.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- io_uring/tctx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/io_uring/tctx.c b/io_uring/tctx.c index 4324b1cf1f6af..96f77450cf4e2 100644 --- a/io_uring/tctx.c +++ b/io_uring/tctx.c @@ -289,7 +289,7 @@ int io_ringfd_register(struct io_ring_ctx *ctx, void __user *__arg, break;
reg.offset = ret; - if (copy_to_user(&arg[i], ®, sizeof(reg))) { + if (put_user(reg.offset, &arg[i].offset)) { fput(tctx->registered_rings[reg.offset]); tctx->registered_rings[reg.offset] = NULL; ret = -EFAULT;
Pull cqes member out from rings struct so that we are able to have a union between cqes and cqes_compat. This is done in a similar way to commit 75b28affdd6a ("io_uring: allocate the two rings together"), where sq_array was pulled out from the rings struct.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- include/linux/io_uring_types.h | 18 +++++++++-------- io_uring/fdinfo.c | 2 +- io_uring/io_uring.c | 35 ++++++++++++++++++++++++---------- 3 files changed, 36 insertions(+), 19 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index df7d4febc38a4..440179029a8f0 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -141,14 +141,6 @@ struct io_rings { * ordered with any other data. */ u32 cq_overflow; - /* - * Ring buffer of completion events. - * - * The kernel writes completion events fresh every time they are - * produced, so the application is allowed to modify pending - * entries. - */ - struct io_uring_cqe cqes[] ____cacheline_aligned_in_smp; };
struct io_restriction { @@ -270,7 +262,17 @@ struct io_ring_ctx { struct xarray personalities; u32 pers_next;
+ /* completion data */ struct { + /* + * Ring buffer of completion events. + * + * The kernel writes completion events fresh every time they are + * produced, so the application is allowed to modify pending + * entries. + */ + struct io_uring_cqe *cqes; + /* * We cache a range of free CQEs we can use, once exhausted it * should go through a slower range setup, see __io_get_cqe() diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c index 2e04850a657b0..bc8c9d764bc13 100644 --- a/io_uring/fdinfo.c +++ b/io_uring/fdinfo.c @@ -119,7 +119,7 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, cq_entries = min(cq_tail - cq_head, ctx->cq_entries); for (i = 0; i < cq_entries; i++) { unsigned int entry = i + cq_head; - struct io_uring_cqe *cqe = &r->cqes[(entry & cq_mask) << cq_shift]; + struct io_uring_cqe *cqe = &ctx->cqes[(entry & cq_mask) << cq_shift];
seq_printf(m, "%5u: user_data:%llu, res:%d, flag:%x", entry & cq_mask, cqe->user_data, cqe->res, diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index df41a63c642c1..707229ae04dc8 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -743,7 +743,6 @@ bool io_req_cqe_overflow(struct io_kiocb *req) */ struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow) { - struct io_rings *rings = ctx->rings; unsigned int off = ctx->cached_cq_tail & (ctx->cq_entries - 1); unsigned int free, queued, len;
@@ -768,14 +767,14 @@ struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow) len <<= 1; }
- ctx->cqe_cached = &rings->cqes[off]; + ctx->cqe_cached = &ctx->cqes[off]; ctx->cqe_sentinel = ctx->cqe_cached + len;
ctx->cached_cq_tail++; ctx->cqe_cached++; if (ctx->flags & IORING_SETUP_CQE32) ctx->cqe_cached++; - return &rings->cqes[off]; + return &ctx->cqes[off]; }
bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags, @@ -2476,13 +2475,28 @@ static void *io_mem_alloc(size_t size) }
static unsigned long rings_size(struct io_ring_ctx *ctx, unsigned int sq_entries, - unsigned int cq_entries, size_t *sq_offset) + unsigned int cq_entries, size_t *sq_offset, + size_t *cq_offset) { struct io_rings *rings; - size_t off, sq_array_size; + size_t off, cq_array_size, sq_array_size; + + off = sizeof(*rings); + +#ifdef CONFIG_SMP + off = ALIGN(off, SMP_CACHE_BYTES); + if (off == 0) + return SIZE_MAX; +#endif + + if (cq_offset) + *cq_offset = off; + + cq_array_size = array_size(sizeof(struct io_uring_cqe), cq_entries); + if (cq_array_size == SIZE_MAX) + return SIZE_MAX;
- off = struct_size(rings, cqes, cq_entries); - if (off == SIZE_MAX) + if (check_add_overflow(off, cq_array_size, &off)) return SIZE_MAX; if (ctx->flags & IORING_SETUP_CQE32) { if (check_shl_overflow(off, 1, &off)) @@ -3314,13 +3328,13 @@ static __cold int io_allocate_scq_urings(struct io_ring_ctx *ctx, struct io_uring_params *p) { struct io_rings *rings; - size_t size, sq_array_offset; + size_t size, cqes_offset, sq_array_offset;
/* make sure these are sane, as we already accounted them */ ctx->sq_entries = p->sq_entries; ctx->cq_entries = p->cq_entries;
- size = rings_size(ctx, p->sq_entries, p->cq_entries, &sq_array_offset); + size = rings_size(ctx, p->sq_entries, p->cq_entries, &sq_array_offset, &cqes_offset); if (size == SIZE_MAX) return -EOVERFLOW;
@@ -3329,6 +3343,7 @@ static __cold int io_allocate_scq_urings(struct io_ring_ctx *ctx, return -ENOMEM;
ctx->rings = rings; + ctx->cqes = (struct io_uring_cqe *)((char *)rings + cqes_offset); ctx->sq_array = (u32 *)((char *)rings + sq_array_offset); rings->sq_ring_mask = p->sq_entries - 1; rings->cq_ring_mask = p->cq_entries - 1; @@ -3533,7 +3548,7 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p, p->cq_off.ring_mask = offsetof(struct io_rings, cq_ring_mask); p->cq_off.ring_entries = offsetof(struct io_rings, cq_ring_entries); p->cq_off.overflow = offsetof(struct io_rings, cq_overflow); - p->cq_off.cqes = offsetof(struct io_rings, cqes); + p->cq_off.cqes = (char *)ctx->cqes - (char *)ctx->rings; p->cq_off.flags = offsetof(struct io_rings, cq_flags);
p->features = IORING_FEAT_SINGLE_MMAP | IORING_FEAT_NODROP |
Move the logic that copies an epoll_event from user to its own function and expose it in the eventpoll.h header. This allows other subsystems such as io_uring to handle epoll_events.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- fs/eventpoll.c | 39 ++++++++++++++++++++++++++------------- include/linux/eventpoll.h | 4 ++++ 2 files changed, 30 insertions(+), 13 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 7e33a2781dec8..faa493a279bd0 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -2197,6 +2197,28 @@ int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds, return error; }
+static int get_compat_epoll_event(struct epoll_event *epds, + const void __user *user_epds) +{ + struct compat_epoll_event compat_epds; + + if (unlikely(copy_from_user(&compat_epds, user_epds, sizeof(compat_epds)))) + return -EFAULT; + epds->events = compat_epds.events; + epds->data = (__kernel_uintptr_t)as_user_ptr(compat_epds.data); + return 0; +} + +int copy_epoll_event_from_user(struct epoll_event *epds, + const void __user *user_epds, + bool compat) +{ + if (compat) + return get_compat_epoll_event(epds, user_epds); + return copy_from_user_with_ptr(epds, user_epds, sizeof(*epds)); +} + + /* * The following function implements the controller interface for * the eventpoll file that enables the insertion/removal/change of @@ -2211,20 +2233,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, struct epoll_event epds;
if (ep_op_has_event(op)) { - if (in_compat_syscall()) { - struct compat_epoll_event compat_epds; - - if (copy_from_user(&compat_epds, event, - sizeof(struct compat_epoll_event))) - return -EFAULT; + int ret;
- epds.events = compat_epds.events; - epds.data = (__kernel_uintptr_t)as_user_ptr(compat_epds.data); - } else { - if (copy_from_user_with_ptr(&epds, event, - sizeof(struct epoll_event))) - return -EFAULT; - } + ret = copy_epoll_event_from_user(&epds, event, in_compat_syscall()); + if (ret) + return -EFAULT; }
return do_epoll_ctl(epfd, op, fd, &epds, false); diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h index 457811d82ff20..62b0829354a0e 100644 --- a/include/linux/eventpoll.h +++ b/include/linux/eventpoll.h @@ -103,4 +103,8 @@ epoll_put_uevent(__poll_t revents, __kernel_uintptr_t data, } #endif
+int copy_epoll_event_from_user(struct epoll_event *epds, + const void __user *user_epds, + bool compat); + #endif /* #ifndef _LINUX_EVENTPOLL_H */
On 23/02/2023 18:53, Tudor Cretu wrote:
Move the logic that copies an epoll_event from user to its own function and expose it in the eventpoll.h header. This allows other subsystems such as io_uring to handle epoll_events.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com
fs/eventpoll.c | 39 ++++++++++++++++++++++++++------------- include/linux/eventpoll.h | 4 ++++ 2 files changed, 30 insertions(+), 13 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 7e33a2781dec8..faa493a279bd0 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -2197,6 +2197,28 @@ int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds, return error; } +static int get_compat_epoll_event(struct epoll_event *epds,
const void __user *user_epds)
+{
- struct compat_epoll_event compat_epds;
- if (unlikely(copy_from_user(&compat_epds, user_epds, sizeof(compat_epds))))
return -EFAULT;
- epds->events = compat_epds.events;
- epds->data = (__kernel_uintptr_t)as_user_ptr(compat_epds.data);
- return 0;
+}
+int copy_epoll_event_from_user(struct epoll_event *epds,
const void __user *user_epds,
bool compat)
+{
- if (compat)
return get_compat_epoll_event(epds, user_epds);
- return copy_from_user_with_ptr(epds, user_epds, sizeof(*epds));
+}
Nit: just one empty line is enough.
Kevin
/*
- The following function implements the controller interface for
- the eventpoll file that enables the insertion/removal/change of
@@ -2211,20 +2233,11 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, struct epoll_event epds; if (ep_op_has_event(op)) {
if (in_compat_syscall()) {
struct compat_epoll_event compat_epds;
if (copy_from_user(&compat_epds, event,
sizeof(struct compat_epoll_event)))
return -EFAULT;
int ret;
epds.events = compat_epds.events;
epds.data = (__kernel_uintptr_t)as_user_ptr(compat_epds.data);
} else {
if (copy_from_user_with_ptr(&epds, event,
sizeof(struct epoll_event)))
return -EFAULT;
}
ret = copy_epoll_event_from_user(&epds, event, in_compat_syscall());
if (ret)
}return -EFAULT;
return do_epoll_ctl(epfd, op, fd, &epds, false); diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h index 457811d82ff20..62b0829354a0e 100644 --- a/include/linux/eventpoll.h +++ b/include/linux/eventpoll.h @@ -103,4 +103,8 @@ epoll_put_uevent(__poll_t revents, __kernel_uintptr_t data, } #endif +int copy_epoll_event_from_user(struct epoll_event *epds,
const void __user *user_epds,
bool compat);
#endif /* #ifndef _LINUX_EVENTPOLL_H */
struct iovec contains a user pointer, so use the correct copy routine.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- lib/iov_iter.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/iov_iter.c b/lib/iov_iter.c index f8b88e515a03f..2d74d8d00ad94 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -1829,7 +1829,7 @@ static int copy_iovec_from_user(struct iovec *iov, { unsigned long seg;
- if (copy_from_user(iov, uvec, nr_segs * sizeof(*uvec))) + if (copy_from_user_with_ptr(iov, uvec, nr_segs * sizeof(*uvec))) return -EFAULT; for (seg = 0; seg < nr_segs; seg++) { if ((ssize_t)iov[seg].iov_len < 0)
On 23/02/2023 18:53, Tudor Cretu wrote:
struct iovec contains a user pointer, so use the correct copy routine.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com
lib/iov_iter.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Like patch 1, that's a straightforward fix for copy_iovec_from_user(), so I've applied it on next.
Kevin
diff --git a/lib/iov_iter.c b/lib/iov_iter.c index f8b88e515a03f..2d74d8d00ad94 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -1829,7 +1829,7 @@ static int copy_iovec_from_user(struct iovec *iov, { unsigned long seg;
- if (copy_from_user(iov, uvec, nr_segs * sizeof(*uvec)))
- if (copy_from_user_with_ptr(iov, uvec, nr_segs * sizeof(*uvec))) return -EFAULT; for (seg = 0; seg < nr_segs; seg++) { if ((ssize_t)iov[seg].iov_len < 0)
Introduce compat versions of the structs exposed in the uAPI headers that might contain pointers as a member. Also, implement functions that convert the compat versions to the native versions of the struct.
A subsequent patch is going to change the io_uring structs to enable them to support new architectures. On such architectures, the current struct layout still needs to be supported for compat tasks.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- include/linux/io_uring_types.h | 140 ++++++++++++++++++- io_uring/cancel.c | 38 ++++- io_uring/epoll.c | 2 +- io_uring/fdinfo.c | 55 +++++++- io_uring/io_uring.c | 248 +++++++++++++++++++++++---------- io_uring/io_uring.h | 122 +++++++++++++--- io_uring/kbuf.c | 108 ++++++++++++-- io_uring/kbuf.h | 6 +- io_uring/net.c | 5 +- io_uring/rsrc.c | 125 +++++++++++++++-- io_uring/tctx.c | 57 ++++++-- io_uring/uring_cmd.h | 7 + 12 files changed, 782 insertions(+), 131 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index 440179029a8f0..737bc1aa67306 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -7,6 +7,130 @@ #include <linux/llist.h> #include <uapi/linux/io_uring.h>
+struct compat_io_uring_sqe { + __u8 opcode; + __u8 flags; + __u16 ioprio; + __s32 fd; + union { + __u64 off; + __u64 addr2; + struct { + __u32 cmd_op; + __u32 __pad1; + }; + }; + union { + __u64 addr; + __u64 splice_off_in; + }; + __u32 len; + union { + __kernel_rwf_t rw_flags; + __u32 fsync_flags; + __u16 poll_events; + __u32 poll32_events; + __u32 sync_range_flags; + __u32 msg_flags; + __u32 timeout_flags; + __u32 accept_flags; + __u32 cancel_flags; + __u32 open_flags; + __u32 statx_flags; + __u32 fadvise_advice; + __u32 splice_flags; + __u32 rename_flags; + __u32 unlink_flags; + __u32 hardlink_flags; + __u32 xattr_flags; + __u32 msg_ring_flags; + __u32 uring_cmd_flags; + }; + __u64 user_data; + union { + __u16 buf_index; + __u16 buf_group; + } __packed; + __u16 personality; + union { + __s32 splice_fd_in; + __u32 file_index; + struct { + __u16 addr_len; + __u16 __pad3[1]; + }; + }; + union { + struct { + __u64 addr3; + __u64 __pad2[1]; + }; + __u8 cmd[0]; + }; +}; + +struct compat_io_uring_cqe { + __u64 user_data; + __s32 res; + __u32 flags; + __u64 big_cqe[]; +}; + +struct compat_io_uring_files_update { + __u32 offset; + __u32 resv; + __aligned_u64 fds; +}; + +struct compat_io_uring_rsrc_register { + __u32 nr; + __u32 flags; + __u64 resv2; + __aligned_u64 data; + __aligned_u64 tags; +}; + +struct compat_io_uring_rsrc_update { + __u32 offset; + __u32 resv; + __aligned_u64 data; +}; + +struct compat_io_uring_rsrc_update2 { + __u32 offset; + __u32 resv; + __aligned_u64 data; + __aligned_u64 tags; + __u32 nr; + __u32 resv2; +}; + +struct compat_io_uring_buf { + __u64 addr; + __u32 len; + __u16 bid; + __u16 resv; +}; + +struct compat_io_uring_buf_ring { + union { + struct { + __u64 resv1; + __u32 resv2; + __u16 resv3; + __u16 tail; + }; + struct compat_io_uring_buf bufs[0]; + }; +}; + +struct compat_io_uring_getevents_arg { + __u64 sigmask; + __u32 sigmask_sz; + __u32 pad; + __u64 ts; +}; + struct io_wq_work_node { struct io_wq_work_node *next; }; @@ -216,7 +340,11 @@ struct io_ring_ctx { * array. */ u32 *sq_array; - struct io_uring_sqe *sq_sqes; + + union { + struct compat_io_uring_sqe *sq_sqes_compat; + struct io_uring_sqe *sq_sqes; + }; unsigned cached_sq_head; unsigned sq_entries;
@@ -271,7 +399,10 @@ struct io_ring_ctx { * produced, so the application is allowed to modify pending * entries. */ - struct io_uring_cqe *cqes; + union { + struct compat_io_uring_cqe *cqes_compat; + struct io_uring_cqe *cqes; + };
/* * We cache a range of free CQEs we can use, once exhausted it @@ -581,7 +712,10 @@ struct io_kiocb {
struct io_overflow_cqe { struct list_head list; - struct io_uring_cqe cqe; + union { + struct compat_io_uring_cqe compat_cqe; + struct io_uring_cqe cqe; + }; };
#endif diff --git a/io_uring/cancel.c b/io_uring/cancel.c index 2291a53cdabd1..20cb8634e44af 100644 --- a/io_uring/cancel.c +++ b/io_uring/cancel.c @@ -27,6 +27,42 @@ struct io_cancel { #define CANCEL_FLAGS (IORING_ASYNC_CANCEL_ALL | IORING_ASYNC_CANCEL_FD | \ IORING_ASYNC_CANCEL_ANY | IORING_ASYNC_CANCEL_FD_FIXED)
+struct compat_io_uring_sync_cancel_reg { + __u64 addr; + __s32 fd; + __u32 flags; + struct __kernel_timespec timeout; + __u64 pad[4]; +}; + +#ifdef CONFIG_COMPAT64 +static int get_compat_io_uring_sync_cancel_reg(struct io_uring_sync_cancel_reg *sc, + const void __user *user_sc) +{ + struct compat_io_uring_sync_cancel_reg compat_sc; + + if (unlikely(copy_from_user(&compat_sc, user_sc, sizeof(compat_sc)))) + return -EFAULT; + sc->addr = compat_sc.addr; + sc->fd = compat_sc.fd; + sc->flags = compat_sc.flags; + sc->timeout = compat_sc.timeout; + memcpy(sc->pad, compat_sc.pad, sizeof(sc->pad)); + return 0; +} +#endif + +static int copy_io_uring_sync_cancel_reg_from_user(struct io_ring_ctx *ctx, + struct io_uring_sync_cancel_reg *sc, + const void __user *arg) +{ +#ifdef CONFIG_COMPAT64 + if (ctx->compat) + return get_compat_io_uring_sync_cancel_reg(sc, arg); +#endif + return copy_from_user(sc, arg, sizeof(*sc)); +} + static bool io_cancel_cb(struct io_wq_work *work, void *data) { struct io_kiocb *req = container_of(work, struct io_kiocb, work); @@ -243,7 +279,7 @@ int io_sync_cancel(struct io_ring_ctx *ctx, void __user *arg) DEFINE_WAIT(wait); int ret;
- if (copy_from_user(&sc, arg, sizeof(sc))) + if (copy_io_uring_sync_cancel_reg_from_user(ctx, &sc, arg)) return -EFAULT; if (sc.flags & ~CANCEL_FLAGS) return -EINVAL; diff --git a/io_uring/epoll.c b/io_uring/epoll.c index 9aa74d2c80bc4..d5580ff465c3e 100644 --- a/io_uring/epoll.c +++ b/io_uring/epoll.c @@ -40,7 +40,7 @@ int io_epoll_ctl_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) struct epoll_event __user *ev;
ev = u64_to_user_ptr(READ_ONCE(sqe->addr)); - if (copy_from_user(&epoll->event, ev, sizeof(*ev))) + if (copy_epoll_event_from_user(&epoll->event, ev, req->ctx->compat)) return -EFAULT; }
diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c index bc8c9d764bc13..c5bd669081c98 100644 --- a/io_uring/fdinfo.c +++ b/io_uring/fdinfo.c @@ -89,12 +89,25 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, for (i = 0; i < sq_entries; i++) { unsigned int entry = i + sq_head; struct io_uring_sqe *sqe; - unsigned int sq_idx; + unsigned int sq_idx, sq_off; +#ifdef CONFIG_COMPAT64 + struct io_uring_sqe *native_sqe = NULL; +#endif
sq_idx = READ_ONCE(ctx->sq_array[entry & sq_mask]); if (sq_idx > sq_mask) continue; - sqe = &ctx->sq_sqes[sq_idx << sq_shift]; + sq_off = sq_idx << sq_shift; + sqe = ctx->compat ? (void *)&ctx->sq_sqes_compat[sq_off] : &ctx->sq_sqes[sq_off]; +#ifdef CONFIG_COMPAT64 + if (ctx->compat) { + native_sqe = kmalloc(sizeof(struct io_uring_sqe) << sq_shift, GFP_KERNEL); + convert_compat_io_uring_sqe(ctx, native_sqe, + (struct compat_io_uring_sqe *)sqe); + sqe = native_sqe; + } +#endif + seq_printf(m, "%5u: opcode:%s, fd:%d, flags:%x, off:%llu, " "addr:0x%llx, rw_flags:0x%x, buf_index:%d " "user_data:%llu", @@ -104,7 +117,8 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, sqe->buf_index, sqe->user_data); if (sq_shift) { u64 *sqeb = (void *) (sqe + 1); - int size = sizeof(struct io_uring_sqe) / sizeof(u64); + int size = (ctx->compat ? sizeof(struct compat_io_uring_sqe) + : sizeof(struct io_uring_sqe)) / sizeof(u64); int j;
for (j = 0; j < size; j++) { @@ -114,12 +128,29 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, } } seq_printf(m, "\n"); +#ifdef CONFIG_COMPAT64 + kfree(native_sqe); +#endif } seq_printf(m, "CQEs:\t%u\n", cq_tail - cq_head); cq_entries = min(cq_tail - cq_head, ctx->cq_entries); for (i = 0; i < cq_entries; i++) { unsigned int entry = i + cq_head; - struct io_uring_cqe *cqe = &ctx->cqes[(entry & cq_mask) << cq_shift]; + struct io_uring_cqe *cqe; + unsigned int cq_off = (entry & cq_mask) << cq_shift; +#ifdef CONFIG_COMPAT64 + struct io_uring_cqe *native_cqe = NULL; +#endif + + cqe = ctx->compat ? (void *)&ctx->cqes_compat[cq_off] : &ctx->cqes[cq_off]; +#ifdef CONFIG_COMPAT64 + if (ctx->compat) { + native_cqe = kmalloc(sizeof(struct io_uring_cqe) << cq_shift, GFP_KERNEL); + convert_compat_io_uring_cqe(ctx, native_cqe, + (struct compat_io_uring_cqe *)cqe); + cqe = native_cqe; + } +#endif
seq_printf(m, "%5u: user_data:%llu, res:%d, flag:%x", entry & cq_mask, cqe->user_data, cqe->res, @@ -128,6 +159,9 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, seq_printf(m, ", extra1:%llu, extra2:%llu\n", cqe->big_cqe[0], cqe->big_cqe[1]); seq_printf(m, "\n"); +#ifdef CONFIG_COMPAT64 + kfree(native_cqe); +#endif }
/* @@ -189,10 +223,23 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, spin_lock(&ctx->completion_lock); list_for_each_entry(ocqe, &ctx->cq_overflow_list, list) { struct io_uring_cqe *cqe = &ocqe->cqe; +#ifdef CONFIG_COMPAT64 + struct io_uring_cqe *native_cqe = NULL; + + if (ctx->compat) { + native_cqe = kmalloc(sizeof(struct io_uring_cqe) << cq_shift, GFP_KERNEL); + convert_compat_io_uring_cqe(ctx, native_cqe, + (struct compat_io_uring_cqe *)cqe); + cqe = native_cqe; + } +#endif
seq_printf(m, " user_data=%llu, res=%d, flags=%x\n", cqe->user_data, cqe->res, cqe->flags);
+#ifdef CONFIG_COMPAT64 + kfree(native_cqe); +#endif }
spin_unlock(&ctx->completion_lock); diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 707229ae04dc8..03506776af043 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -152,6 +152,43 @@ static void __io_submit_flush_completions(struct io_ring_ctx *ctx);
static struct kmem_cache *req_cachep;
+#ifdef CONFIG_COMPAT64 +static int get_compat_io_uring_getevents_arg(struct io_uring_getevents_arg *arg, + const void __user *user_arg) +{ + struct compat_io_uring_getevents_arg compat_arg; + + if (unlikely(copy_from_user(&compat_arg, user_arg, sizeof(compat_arg)))) + return -EFAULT; + arg->sigmask = (__kernel_uintptr_t)compat_ptr(compat_arg.sigmask); + arg->sigmask_sz = compat_arg.sigmask_sz; + arg->pad = compat_arg.pad; + arg->ts = (__kernel_uintptr_t)compat_ptr(compat_arg.ts); + return 0; +} +#endif /* CONFIG_COMPAT64 */ + +static int copy_io_uring_getevents_arg_from_user(struct io_ring_ctx *ctx, + struct io_uring_getevents_arg *arg, + const void __user *argp, + size_t size) +{ +#ifdef CONFIG_COMPAT64 + if (ctx->compat) { + if (size != sizeof(struct compat_io_uring_getevents_arg)) + return -EINVAL; + if (get_compat_io_uring_getevents_arg(arg, argp)) + return -EFAULT; + return 0; + } +#endif + if (size != sizeof(*arg)) + return -EINVAL; + if (copy_from_user(arg, argp, sizeof(*arg))) + return -EFAULT; + return 0; +} + struct sock *io_uring_get_socket(struct file *file) { #if defined(CONFIG_UNIX) @@ -604,7 +641,9 @@ void io_cq_unlock_post(struct io_ring_ctx *ctx) static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) { bool all_flushed; - size_t cqe_size = sizeof(struct io_uring_cqe); + size_t cqe_size = ctx->compat ? + sizeof(struct compat_io_uring_cqe) : + sizeof(struct io_uring_cqe);
if (!force && __io_cqring_events(ctx) == ctx->cq_entries) return false; @@ -741,10 +780,11 @@ bool io_req_cqe_overflow(struct io_kiocb *req) * control dependency is enough as we're using WRITE_ONCE to * fill the cq entry */ -struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow) +void *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow) { unsigned int off = ctx->cached_cq_tail & (ctx->cq_entries - 1); unsigned int free, queued, len; + void *cqe;
/* * Posting into the CQ when there are pending overflowed CQEs may break @@ -767,14 +807,15 @@ struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow) len <<= 1; }
- ctx->cqe_cached = &ctx->cqes[off]; + cqe = ctx->compat ? (void *)&ctx->cqes_compat[off] : (void *)&ctx->cqes[off]; + ctx->cqe_cached = cqe; ctx->cqe_sentinel = ctx->cqe_cached + len;
ctx->cached_cq_tail++; ctx->cqe_cached++; if (ctx->flags & IORING_SETUP_CQE32) ctx->cqe_cached++; - return &ctx->cqes[off]; + return cqe; }
bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags, @@ -793,14 +834,7 @@ bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags if (likely(cqe)) { trace_io_uring_complete(ctx, NULL, user_data, res, cflags, 0, 0);
- WRITE_ONCE(cqe->user_data, user_data); - WRITE_ONCE(cqe->res, res); - WRITE_ONCE(cqe->flags, cflags); - - if (ctx->flags & IORING_SETUP_CQE32) { - WRITE_ONCE(cqe->big_cqe[0], 0); - WRITE_ONCE(cqe->big_cqe[1], 0); - } + __io_fill_cqe_any(ctx, cqe, user_data, res, cflags, 0, 0); return true; }
@@ -2222,7 +2256,7 @@ static void io_commit_sqring(struct io_ring_ctx *ctx) * used, it's important that those reads are done through READ_ONCE() to * prevent a re-load down the line. */ -static const struct io_uring_sqe *io_get_sqe(struct io_ring_ctx *ctx) +static const void *io_get_sqe(struct io_ring_ctx *ctx) { unsigned head, mask = ctx->sq_entries - 1; unsigned sq_idx = ctx->cached_sq_head++ & mask; @@ -2240,7 +2274,8 @@ static const struct io_uring_sqe *io_get_sqe(struct io_ring_ctx *ctx) /* double index for 128-byte SQEs, twice as long */ if (ctx->flags & IORING_SETUP_SQE128) head <<= 1; - return &ctx->sq_sqes[head]; + return ctx->compat ? (void *)&ctx->sq_sqes_compat[head] + : (void *)&ctx->sq_sqes[head]; }
/* drop invalid entries */ @@ -2265,8 +2300,11 @@ int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr) io_submit_state_start(&ctx->submit_state, left);
do { - const struct io_uring_sqe *sqe; + const void *sqe; struct io_kiocb *req; +#ifdef CONFIG_COMPAT64 + struct io_uring_sqe native_sqe; +#endif
if (unlikely(!io_alloc_req_refill(ctx))) break; @@ -2276,6 +2314,12 @@ int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr) io_req_add_to_cache(req, ctx); break; } +#ifdef CONFIG_COMPAT64 + if (ctx->compat) { + convert_compat_io_uring_sqe(ctx, &native_sqe, sqe); + sqe = &native_sqe; + } +#endif
/* * Continue submitting even for sqe failure if the @@ -2480,6 +2524,9 @@ static unsigned long rings_size(struct io_ring_ctx *ctx, unsigned int sq_entries { struct io_rings *rings; size_t off, cq_array_size, sq_array_size; + size_t cqe_size = ctx->compat ? + sizeof(struct compat_io_uring_cqe) : + sizeof(struct io_uring_cqe);
off = sizeof(*rings);
@@ -2492,7 +2539,7 @@ static unsigned long rings_size(struct io_ring_ctx *ctx, unsigned int sq_entries if (cq_offset) *cq_offset = off;
- cq_array_size = array_size(sizeof(struct io_uring_cqe), cq_entries); + cq_array_size = array_size(cqe_size, cq_entries); if (cq_array_size == SIZE_MAX) return SIZE_MAX;
@@ -3120,20 +3167,22 @@ static unsigned long io_uring_nommu_get_unmapped_area(struct file *file,
#endif /* !CONFIG_MMU */
-static int io_validate_ext_arg(unsigned flags, const void __user *argp, size_t argsz) +static int io_validate_ext_arg(struct io_ring_ctx *ctx, unsigned int flags, + const void __user *argp, size_t argsz) { if (flags & IORING_ENTER_EXT_ARG) { struct io_uring_getevents_arg arg; + int ret;
- if (argsz != sizeof(arg)) - return -EINVAL; - if (copy_from_user(&arg, argp, sizeof(arg))) - return -EFAULT; + ret = copy_io_uring_getevents_arg_from_user(ctx, &arg, argp, argsz); + if (ret) + return ret; } return 0; }
-static int io_get_ext_arg(unsigned flags, const void __user *argp, size_t *argsz, +static int io_get_ext_arg(struct io_ring_ctx *ctx, unsigned int flags, + const void __user *argp, size_t *argsz, #ifdef CONFIG_CHERI_PURECAP_UABI struct __kernel_timespec * __capability *ts, const sigset_t * __capability *sig) @@ -3143,6 +3192,7 @@ static int io_get_ext_arg(unsigned flags, const void __user *argp, size_t *argsz #endif { struct io_uring_getevents_arg arg; + int ret;
/* * If EXT_ARG isn't set, then we have no timespec and the argp pointer @@ -3158,10 +3208,9 @@ static int io_get_ext_arg(unsigned flags, const void __user *argp, size_t *argsz * EXT_ARG is set - ensure we agree on the size of it and copy in our * timespec and sigset_t pointers if good. */ - if (*argsz != sizeof(arg)) - return -EINVAL; - if (copy_from_user(&arg, argp, sizeof(arg))) - return -EFAULT; + ret = copy_io_uring_getevents_arg_from_user(ctx, &arg, argp, *argsz); + if (ret) + return ret; if (arg.pad) return -EINVAL; *sig = u64_to_user_ptr(arg.sigmask); @@ -3268,7 +3317,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, */ mutex_lock(&ctx->uring_lock); iopoll_locked: - ret2 = io_validate_ext_arg(flags, argp, argsz); + ret2 = io_validate_ext_arg(ctx, flags, argp, argsz); if (likely(!ret2)) { min_complete = min(min_complete, ctx->cq_entries); @@ -3279,7 +3328,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, const sigset_t __user *sig; struct __kernel_timespec __user *ts;
- ret2 = io_get_ext_arg(flags, argp, &argsz, &ts, &sig); + ret2 = io_get_ext_arg(ctx, flags, argp, &argsz, &ts, &sig); if (likely(!ret2)) { min_complete = min(min_complete, ctx->cq_entries); @@ -3329,6 +3378,9 @@ static __cold int io_allocate_scq_urings(struct io_ring_ctx *ctx, { struct io_rings *rings; size_t size, cqes_offset, sq_array_offset; + size_t sqe_size = ctx->compat ? + sizeof(struct compat_io_uring_sqe) : + sizeof(struct io_uring_sqe);
/* make sure these are sane, as we already accounted them */ ctx->sq_entries = p->sq_entries; @@ -3351,9 +3403,9 @@ static __cold int io_allocate_scq_urings(struct io_ring_ctx *ctx, rings->cq_ring_entries = p->cq_entries;
if (p->flags & IORING_SETUP_SQE128) - size = array_size(2 * sizeof(struct io_uring_sqe), p->sq_entries); + size = array_size(2 * sqe_size, p->sq_entries); else - size = array_size(sizeof(struct io_uring_sqe), p->sq_entries); + size = array_size(sqe_size, p->sq_entries); if (size == SIZE_MAX) { io_mem_free(ctx->rings); ctx->rings = NULL; @@ -4107,48 +4159,45 @@ static int __init io_uring_init(void) #define BUILD_BUG_SQE_ELEM_SIZE(eoffset, esize, ename) \ __BUILD_BUG_VERIFY_OFFSET_SIZE(struct io_uring_sqe, eoffset, esize, ename) BUILD_BUG_ON(sizeof(struct io_uring_sqe) != 64); - BUILD_BUG_SQE_ELEM(0, __u8, opcode); - BUILD_BUG_SQE_ELEM(1, __u8, flags); - BUILD_BUG_SQE_ELEM(2, __u16, ioprio); - BUILD_BUG_SQE_ELEM(4, __s32, fd); - BUILD_BUG_SQE_ELEM(8, __u64, off); - BUILD_BUG_SQE_ELEM(8, __u64, addr2); - BUILD_BUG_SQE_ELEM(8, __u32, cmd_op); + BUILD_BUG_SQE_ELEM(0, __u8, opcode); + BUILD_BUG_SQE_ELEM(1, __u8, flags); + BUILD_BUG_SQE_ELEM(2, __u16, ioprio); + BUILD_BUG_SQE_ELEM(4, __s32, fd); + BUILD_BUG_SQE_ELEM(8, __u64, off); + BUILD_BUG_SQE_ELEM(8, __u64, addr2); + BUILD_BUG_SQE_ELEM(8, __u32, cmd_op); BUILD_BUG_SQE_ELEM(12, __u32, __pad1); - BUILD_BUG_SQE_ELEM(16, __u64, addr); - BUILD_BUG_SQE_ELEM(16, __u64, splice_off_in); - BUILD_BUG_SQE_ELEM(24, __u32, len); - BUILD_BUG_SQE_ELEM(28, __kernel_rwf_t, rw_flags); - BUILD_BUG_SQE_ELEM(28, /* compat */ int, rw_flags); - BUILD_BUG_SQE_ELEM(28, /* compat */ __u32, rw_flags); - BUILD_BUG_SQE_ELEM(28, __u32, fsync_flags); - BUILD_BUG_SQE_ELEM(28, /* compat */ __u16, poll_events); - BUILD_BUG_SQE_ELEM(28, __u32, poll32_events); - BUILD_BUG_SQE_ELEM(28, __u32, sync_range_flags); - BUILD_BUG_SQE_ELEM(28, __u32, msg_flags); - BUILD_BUG_SQE_ELEM(28, __u32, timeout_flags); - BUILD_BUG_SQE_ELEM(28, __u32, accept_flags); - BUILD_BUG_SQE_ELEM(28, __u32, cancel_flags); - BUILD_BUG_SQE_ELEM(28, __u32, open_flags); - BUILD_BUG_SQE_ELEM(28, __u32, statx_flags); - BUILD_BUG_SQE_ELEM(28, __u32, fadvise_advice); - BUILD_BUG_SQE_ELEM(28, __u32, splice_flags); - BUILD_BUG_SQE_ELEM(28, __u32, rename_flags); - BUILD_BUG_SQE_ELEM(28, __u32, unlink_flags); - BUILD_BUG_SQE_ELEM(28, __u32, hardlink_flags); - BUILD_BUG_SQE_ELEM(28, __u32, xattr_flags); - BUILD_BUG_SQE_ELEM(28, __u32, msg_ring_flags); - BUILD_BUG_SQE_ELEM(32, __u64, user_data); - BUILD_BUG_SQE_ELEM(40, __u16, buf_index); - BUILD_BUG_SQE_ELEM(40, __u16, buf_group); - BUILD_BUG_SQE_ELEM(42, __u16, personality); - BUILD_BUG_SQE_ELEM(44, __s32, splice_fd_in); - BUILD_BUG_SQE_ELEM(44, __u32, file_index); - BUILD_BUG_SQE_ELEM(44, __u16, addr_len); - BUILD_BUG_SQE_ELEM(46, __u16, __pad3[0]); - BUILD_BUG_SQE_ELEM(48, __u64, addr3); + BUILD_BUG_SQE_ELEM(16, __u64, addr); + BUILD_BUG_SQE_ELEM(16, __u64, splice_off_in); + BUILD_BUG_SQE_ELEM(24, __u32, len); + BUILD_BUG_SQE_ELEM(28, __kernel_rwf_t, rw_flags); + BUILD_BUG_SQE_ELEM(28, __u32, fsync_flags); + BUILD_BUG_SQE_ELEM(28, __u32, poll32_events); + BUILD_BUG_SQE_ELEM(28, __u32, sync_range_flags); + BUILD_BUG_SQE_ELEM(28, __u32, msg_flags); + BUILD_BUG_SQE_ELEM(28, __u32, timeout_flags); + BUILD_BUG_SQE_ELEM(28, __u32, accept_flags); + BUILD_BUG_SQE_ELEM(28, __u32, cancel_flags); + BUILD_BUG_SQE_ELEM(28, __u32, open_flags); + BUILD_BUG_SQE_ELEM(28, __u32, statx_flags); + BUILD_BUG_SQE_ELEM(28, __u32, fadvise_advice); + BUILD_BUG_SQE_ELEM(28, __u32, splice_flags); + BUILD_BUG_SQE_ELEM(28, __u32, rename_flags); + BUILD_BUG_SQE_ELEM(28, __u32, unlink_flags); + BUILD_BUG_SQE_ELEM(28, __u32, hardlink_flags); + BUILD_BUG_SQE_ELEM(28, __u32, xattr_flags); + BUILD_BUG_SQE_ELEM(28, __u32, msg_ring_flags); + BUILD_BUG_SQE_ELEM(32, __u64, user_data); + BUILD_BUG_SQE_ELEM(40, __u16, buf_index); + BUILD_BUG_SQE_ELEM(40, __u16, buf_group); + BUILD_BUG_SQE_ELEM(42, __u16, personality); + BUILD_BUG_SQE_ELEM(44, __s32, splice_fd_in); + BUILD_BUG_SQE_ELEM(44, __u32, file_index); + BUILD_BUG_SQE_ELEM(44, __u16, addr_len); + BUILD_BUG_SQE_ELEM(46, __u16, __pad3[0]); + BUILD_BUG_SQE_ELEM(48, __u64, addr3); BUILD_BUG_SQE_ELEM_SIZE(48, 0, cmd); - BUILD_BUG_SQE_ELEM(56, __u64, __pad2); + BUILD_BUG_SQE_ELEM(56, __u64, __pad2);
BUILD_BUG_ON(sizeof(struct io_uring_files_update) != sizeof(struct io_uring_rsrc_update)); @@ -4160,6 +4209,65 @@ static int __init io_uring_init(void) BUILD_BUG_ON(offsetof(struct io_uring_buf, resv) != offsetof(struct io_uring_buf_ring, tail));
+#ifdef CONFIG_COMPAT64 +#define BUILD_BUG_COMPAT_SQE_ELEM(eoffset, etype, ename) \ + __BUILD_BUG_VERIFY_OFFSET_SIZE(struct compat_io_uring_sqe, eoffset, sizeof(etype), ename) +#define BUILD_BUG_COMPAT_SQE_ELEM_SIZE(eoffset, esize, ename) \ + __BUILD_BUG_VERIFY_OFFSET_SIZE(struct compat_io_uring_sqe, eoffset, esize, ename) + BUILD_BUG_ON(sizeof(struct compat_io_uring_sqe) != 64); + BUILD_BUG_COMPAT_SQE_ELEM(0, __u8, opcode); + BUILD_BUG_COMPAT_SQE_ELEM(1, __u8, flags); + BUILD_BUG_COMPAT_SQE_ELEM(2, __u16, ioprio); + BUILD_BUG_COMPAT_SQE_ELEM(4, __s32, fd); + BUILD_BUG_COMPAT_SQE_ELEM(8, __u64, off); + BUILD_BUG_COMPAT_SQE_ELEM(8, __u64, addr2); + BUILD_BUG_COMPAT_SQE_ELEM(8, __u32, cmd_op); + BUILD_BUG_COMPAT_SQE_ELEM(12, __u32, __pad1); + BUILD_BUG_COMPAT_SQE_ELEM(16, __u64, addr); + BUILD_BUG_COMPAT_SQE_ELEM(16, __u64, splice_off_in); + BUILD_BUG_COMPAT_SQE_ELEM(24, __u32, len); + BUILD_BUG_COMPAT_SQE_ELEM(28, __kernel_rwf_t, rw_flags); + BUILD_BUG_COMPAT_SQE_ELEM(28, /* compat */ int, rw_flags); + BUILD_BUG_COMPAT_SQE_ELEM(28, /* compat */ __u32, rw_flags); + BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, fsync_flags); + BUILD_BUG_COMPAT_SQE_ELEM(28, /* compat */ __u16, poll_events); + BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, poll32_events); + BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, sync_range_flags); + BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, msg_flags); + BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, timeout_flags); + BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, accept_flags); + BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, cancel_flags); + BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, open_flags); + BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, statx_flags); + BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, fadvise_advice); + BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, splice_flags); + BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, rename_flags); + BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, unlink_flags); + BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, hardlink_flags); + BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, xattr_flags); + BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, msg_ring_flags); + BUILD_BUG_COMPAT_SQE_ELEM(32, __u64, user_data); + BUILD_BUG_COMPAT_SQE_ELEM(40, __u16, buf_index); + BUILD_BUG_COMPAT_SQE_ELEM(40, __u16, buf_group); + BUILD_BUG_COMPAT_SQE_ELEM(42, __u16, personality); + BUILD_BUG_COMPAT_SQE_ELEM(44, __s32, splice_fd_in); + BUILD_BUG_COMPAT_SQE_ELEM(44, __u32, file_index); + BUILD_BUG_COMPAT_SQE_ELEM(44, __u16, addr_len); + BUILD_BUG_COMPAT_SQE_ELEM(46, __u16, __pad3[0]); + BUILD_BUG_COMPAT_SQE_ELEM(48, __u64, addr3); + BUILD_BUG_COMPAT_SQE_ELEM_SIZE(48, 0, cmd); + BUILD_BUG_COMPAT_SQE_ELEM(56, __u64, __pad2); + + BUILD_BUG_ON(sizeof(struct compat_io_uring_files_update) != + sizeof(struct compat_io_uring_rsrc_update)); + BUILD_BUG_ON(sizeof(struct compat_io_uring_rsrc_update) > + sizeof(struct compat_io_uring_rsrc_update2)); + + BUILD_BUG_ON(offsetof(struct compat_io_uring_buf_ring, bufs) != 0); + BUILD_BUG_ON(offsetof(struct compat_io_uring_buf, resv) != + offsetof(struct compat_io_uring_buf_ring, tail)); +#endif /* CONFIG_COMPAT64 */ + /* should fit into one byte */ BUILD_BUG_ON(SQE_VALID_FLAGS >= (1 << 8)); BUILD_BUG_ON(SQE_COMMON_FLAGS >= (1 << 8)); diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h index 50bc3af449534..fb2711770bfb0 100644 --- a/io_uring/io_uring.h +++ b/io_uring/io_uring.h @@ -5,6 +5,7 @@ #include <linux/lockdep.h> #include <linux/io_uring_types.h> #include "io-wq.h" +#include "uring_cmd.h" #include "slist.h" #include "filetable.h"
@@ -24,7 +25,7 @@ enum { IOU_STOP_MULTISHOT = -ECANCELED, };
-struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow); +void *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow); bool io_req_cqe_overflow(struct io_kiocb *req); int io_run_task_work_sig(struct io_ring_ctx *ctx); int __io_run_local_work(struct io_ring_ctx *ctx, bool *locked); @@ -93,8 +94,67 @@ static inline void io_cq_lock(struct io_ring_ctx *ctx)
void io_cq_unlock_post(struct io_ring_ctx *ctx);
-static inline struct io_uring_cqe *io_get_cqe_overflow(struct io_ring_ctx *ctx, - bool overflow) +#ifdef CONFIG_COMPAT64 +static inline void convert_compat_io_uring_cqe(struct io_ring_ctx *ctx, + struct io_uring_cqe *cqe, + const struct compat_io_uring_cqe *compat_cqe) +{ + cqe->user_data = READ_ONCE(compat_cqe->user_data); + cqe->res = READ_ONCE(compat_cqe->res); + cqe->flags = READ_ONCE(compat_cqe->flags); + + if (ctx->flags & IORING_SETUP_CQE32) { + cqe->big_cqe[0] = READ_ONCE(compat_cqe->big_cqe[0]); + cqe->big_cqe[1] = READ_ONCE(compat_cqe->big_cqe[1]); + } +} + +static inline void convert_compat_io_uring_sqe(struct io_ring_ctx *ctx, + struct io_uring_sqe *sqe, + const struct compat_io_uring_sqe *compat_sqe) +{ +/* + * The struct io_uring_sqe contains anonymous unions and there is no field + * keeping track of which union's member is active. Because in all the cases, + * the unions are between integral types and the types are compatible, use the + * largest member of each union to perform the copy. Use this compile-time check + * to ensure that the union's members are not truncated during the conversion. + */ +#define BUILD_BUG_COMPAT_SQE_UNION_ELEM(elem1, elem2) \ + BUILD_BUG_ON(sizeof_field(struct compat_io_uring_sqe, elem1) != \ + (offsetof(struct compat_io_uring_sqe, elem2) - \ + offsetof(struct compat_io_uring_sqe, elem1))) + + sqe->opcode = READ_ONCE(compat_sqe->opcode); + sqe->flags = READ_ONCE(compat_sqe->flags); + sqe->ioprio = READ_ONCE(compat_sqe->ioprio); + sqe->fd = READ_ONCE(compat_sqe->fd); + BUILD_BUG_COMPAT_SQE_UNION_ELEM(addr2, addr); + sqe->addr2 = READ_ONCE(compat_sqe->addr2); + BUILD_BUG_COMPAT_SQE_UNION_ELEM(addr, len); + sqe->addr = READ_ONCE(compat_sqe->addr); + sqe->len = READ_ONCE(compat_sqe->len); + BUILD_BUG_COMPAT_SQE_UNION_ELEM(rw_flags, user_data); + sqe->rw_flags = READ_ONCE(compat_sqe->rw_flags); + sqe->user_data = READ_ONCE(compat_sqe->user_data); + BUILD_BUG_COMPAT_SQE_UNION_ELEM(buf_index, personality); + sqe->buf_index = READ_ONCE(compat_sqe->buf_index); + sqe->personality = READ_ONCE(compat_sqe->personality); + BUILD_BUG_COMPAT_SQE_UNION_ELEM(splice_fd_in, addr3); + sqe->splice_fd_in = READ_ONCE(compat_sqe->splice_fd_in); + if (sqe->opcode == IORING_OP_URING_CMD) { + size_t compat_cmd_size = compat_uring_cmd_pdu_size(ctx->flags & + IORING_SETUP_SQE128); + + memcpy(sqe->cmd, compat_sqe->cmd, compat_cmd_size); + } else + sqe->addr3 = READ_ONCE(compat_sqe->addr3); +#undef BUILD_BUG_COMPAT_SQE_UNION_ELEM +} +#endif /* CONFIG_COMPAT64 */ + +static inline void *io_get_cqe_overflow(struct io_ring_ctx *ctx, + bool overflow) { if (likely(ctx->cqe_cached < ctx->cqe_sentinel)) { struct io_uring_cqe *cqe = ctx->cqe_cached; @@ -109,15 +169,46 @@ static inline struct io_uring_cqe *io_get_cqe_overflow(struct io_ring_ctx *ctx, return __io_get_cqe(ctx, overflow); }
-static inline struct io_uring_cqe *io_get_cqe(struct io_ring_ctx *ctx) +static inline void *io_get_cqe(struct io_ring_ctx *ctx) { return io_get_cqe_overflow(ctx, false); }
+static inline void __io_fill_cqe_any(struct io_ring_ctx *ctx, struct io_uring_cqe *cqe, + u64 user_data, s32 res, u32 cflags, + u64 extra1, u64 extra2) +{ +#ifdef CONFIG_COMPAT64 + if (ctx->compat) { + struct compat_io_uring_cqe *compat_cqe = (struct compat_io_uring_cqe *)cqe; + + WRITE_ONCE(compat_cqe->user_data, user_data); + WRITE_ONCE(compat_cqe->res, res); + WRITE_ONCE(compat_cqe->flags, cflags); + + if (ctx->flags & IORING_SETUP_CQE32) { + WRITE_ONCE(compat_cqe->big_cqe[0], extra1); + WRITE_ONCE(compat_cqe->big_cqe[1], extra2); + } + return; + } +#endif + WRITE_ONCE(cqe->user_data, user_data); + WRITE_ONCE(cqe->res, res); + WRITE_ONCE(cqe->flags, cflags); + + if (ctx->flags & IORING_SETUP_CQE32) { + WRITE_ONCE(cqe->big_cqe[0], extra1); + WRITE_ONCE(cqe->big_cqe[1], extra2); + } +} + static inline bool __io_fill_cqe_req(struct io_ring_ctx *ctx, struct io_kiocb *req) { struct io_uring_cqe *cqe; + u64 extra1 = 0; + u64 extra2 = 0;
/* * If we can't get a cq entry, userspace overflowed the @@ -128,24 +219,17 @@ static inline bool __io_fill_cqe_req(struct io_ring_ctx *ctx, if (unlikely(!cqe)) return io_req_cqe_overflow(req);
+ if (ctx->flags & IORING_SETUP_CQE32 && req->flags & REQ_F_CQE32_INIT) { + extra1 = req->extra1; + extra2 = req->extra2; + } + trace_io_uring_complete(req->ctx, req, req->cqe.user_data, req->cqe.res, req->cqe.flags, - (req->flags & REQ_F_CQE32_INIT) ? req->extra1 : 0, - (req->flags & REQ_F_CQE32_INIT) ? req->extra2 : 0); + extra1, extra2);
- memcpy(cqe, &req->cqe, sizeof(*cqe)); - - if (ctx->flags & IORING_SETUP_CQE32) { - u64 extra1 = 0, extra2 = 0; - - if (req->flags & REQ_F_CQE32_INIT) { - extra1 = req->extra1; - extra2 = req->extra2; - } - - WRITE_ONCE(cqe->big_cqe[0], extra1); - WRITE_ONCE(cqe->big_cqe[1], extra2); - } + __io_fill_cqe_any(ctx, cqe, req->cqe.user_data, req->cqe.res, + req->cqe.flags, extra1, extra2); return true; }
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c index e2c46889d5fab..9f840ba93c888 100644 --- a/io_uring/kbuf.c +++ b/io_uring/kbuf.c @@ -16,6 +16,9 @@ #include "kbuf.h"
#define IO_BUFFER_LIST_BUF_PER_PAGE (PAGE_SIZE / sizeof(struct io_uring_buf)) +#ifdef CONFIG_COMPAT64 +#define IO_BUFFER_LIST_COMPAT_BUF_PER_PAGE (PAGE_SIZE / sizeof(struct compat_io_uring_buf)) +#endif
#define BGID_ARRAY 64
@@ -28,6 +31,42 @@ struct io_provide_buf { __u16 bid; };
+#ifdef CONFIG_COMPAT64 +struct compat_io_uring_buf_reg { + __u64 ring_addr; + __u32 ring_entries; + __u16 bgid; + __u16 pad; + __u64 resv[3]; +}; + +static int get_compat_io_uring_buf_reg(struct io_uring_buf_reg *reg, + const void __user *user_reg) +{ + struct compat_io_uring_buf_reg compat_reg; + + if (unlikely(copy_from_user(&compat_reg, user_reg, sizeof(compat_reg)))) + return -EFAULT; + reg->ring_addr = compat_reg.ring_addr; + reg->ring_entries = compat_reg.ring_entries; + reg->bgid = compat_reg.bgid; + reg->pad = compat_reg.pad; + memcpy(reg->resv, compat_reg.resv, sizeof(reg->resv)); + return 0; +} +#endif + +static int copy_io_uring_buf_reg_from_user(struct io_ring_ctx *ctx, + struct io_uring_buf_reg *reg, + const void __user *arg) +{ +#ifdef CONFIG_COMPAT64 + if (ctx->compat) + return get_compat_io_uring_buf_reg(reg, arg); +#endif + return copy_from_user(reg, arg, sizeof(*reg)); +} + static inline struct io_buffer_list *io_buffer_get_list(struct io_ring_ctx *ctx, unsigned int bgid) { @@ -125,6 +164,41 @@ static void __user *io_provided_buffer_select(struct io_kiocb *req, size_t *len, return NULL; }
+#ifdef CONFIG_COMPAT64 +static void __user *io_ring_buffer_select_compat(struct io_kiocb *req, size_t *len, + struct io_buffer_list *bl, + unsigned int issue_flags) +{ + struct compat_io_uring_buf_ring *br = bl->buf_ring_compat; + struct compat_io_uring_buf *buf; + __u16 head = bl->head; + + if (unlikely(smp_load_acquire(&br->tail) == head)) + return NULL; + + head &= bl->mask; + if (head < IO_BUFFER_LIST_COMPAT_BUF_PER_PAGE) { + buf = &br->bufs[head]; + } else { + int off = head & (IO_BUFFER_LIST_COMPAT_BUF_PER_PAGE - 1); + int index = head / IO_BUFFER_LIST_COMPAT_BUF_PER_PAGE; + buf = page_address(bl->buf_pages[index]); + buf += off; + } + if (*len == 0 || *len > buf->len) + *len = buf->len; + req->flags |= REQ_F_BUFFER_RING; + req->buf_list = bl; + req->buf_index = buf->bid; + + if (issue_flags & IO_URING_F_UNLOCKED || !file_can_poll(req->file)) { + req->buf_list = NULL; + bl->head++; + } + return compat_ptr(buf->addr); +} +#endif /* CONFIG_COMPAT64 */ + static void __user *io_ring_buffer_select(struct io_kiocb *req, size_t *len, struct io_buffer_list *bl, unsigned int issue_flags) @@ -168,6 +242,17 @@ static void __user *io_ring_buffer_select(struct io_kiocb *req, size_t *len, return u64_to_user_ptr(buf->addr); }
+static void __user *io_ring_buffer_select_any(struct io_kiocb *req, size_t *len, + struct io_buffer_list *bl, + unsigned int issue_flags) +{ +#ifdef CONFIG_COMPAT64 + if (req->ctx->compat) + return io_ring_buffer_select_compat(req, len, bl, issue_flags); +#endif + return io_ring_buffer_select(req, len, bl, issue_flags); +} + void __user *io_buffer_select(struct io_kiocb *req, size_t *len, unsigned int issue_flags) { @@ -180,7 +265,7 @@ void __user *io_buffer_select(struct io_kiocb *req, size_t *len, bl = io_buffer_get_list(ctx, req->buf_index); if (likely(bl)) { if (bl->buf_nr_pages) - ret = io_ring_buffer_select(req, len, bl, issue_flags); + ret = io_ring_buffer_select_any(req, len, bl, issue_flags); else ret = io_provided_buffer_select(req, len, bl); } @@ -215,9 +300,12 @@ static int __io_remove_buffers(struct io_ring_ctx *ctx, return 0;
if (bl->buf_nr_pages) { + __u16 tail = ctx->compat ? + bl->buf_ring_compat->tail : + bl->buf_ring->tail; int j;
- i = bl->buf_ring->tail - bl->head; + i = tail - bl->head; for (j = 0; j < bl->buf_nr_pages; j++) unpin_user_page(bl->buf_pages[j]); kvfree(bl->buf_pages); @@ -469,13 +557,13 @@ int io_provide_buffers(struct io_kiocb *req, unsigned int issue_flags)
int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg) { - struct io_uring_buf_ring *br; struct io_uring_buf_reg reg; struct io_buffer_list *bl, *free_bl = NULL; struct page **pages; + size_t pages_size; int nr_pages;
- if (copy_from_user(®, arg, sizeof(reg))) + if (copy_io_uring_buf_reg_from_user(ctx, ®, arg)) return -EFAULT;
if (reg.pad || reg.resv[0] || reg.resv[1] || reg.resv[2]) @@ -508,19 +596,19 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg) return -ENOMEM; }
- pages = io_pin_pages(reg.ring_addr, - struct_size(br, bufs, reg.ring_entries), - &nr_pages); + pages_size = ctx->compat ? + size_mul(sizeof(struct compat_io_uring_buf), reg.ring_entries) : + size_mul(sizeof(struct io_uring_buf), reg.ring_entries); + pages = io_pin_pages(reg.ring_addr, pages_size, &nr_pages); if (IS_ERR(pages)) { kfree(free_bl); return PTR_ERR(pages); }
- br = page_address(pages[0]); bl->buf_pages = pages; bl->buf_nr_pages = nr_pages; bl->nr_entries = reg.ring_entries; - bl->buf_ring = br; + bl->buf_ring = page_address(pages[0]); bl->mask = reg.ring_entries - 1; io_buffer_add_list(ctx, bl, reg.bgid); return 0; @@ -531,7 +619,7 @@ int io_unregister_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg) struct io_uring_buf_reg reg; struct io_buffer_list *bl;
- if (copy_from_user(®, arg, sizeof(reg))) + if (copy_io_uring_buf_reg_from_user(ctx, ®, arg)) return -EFAULT; if (reg.pad || reg.resv[0] || reg.resv[1] || reg.resv[2]) return -EINVAL; diff --git a/io_uring/kbuf.h b/io_uring/kbuf.h index c23e15d7d3caf..1aa5bbbc5d628 100644 --- a/io_uring/kbuf.h +++ b/io_uring/kbuf.h @@ -2,6 +2,7 @@ #ifndef IOU_KBUF_H #define IOU_KBUF_H
+#include <linux/io_uring_types.h> #include <uapi/linux/io_uring.h>
struct io_buffer_list { @@ -13,7 +14,10 @@ struct io_buffer_list { struct list_head buf_list; struct { struct page **buf_pages; - struct io_uring_buf_ring *buf_ring; + union { + struct io_uring_buf_ring *buf_ring; + struct compat_io_uring_buf_ring *buf_ring_compat; + }; }; }; __u16 bgid; diff --git a/io_uring/net.c b/io_uring/net.c index c586278858e7e..4c133bc6f9d1d 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -4,6 +4,7 @@ #include <linux/file.h> #include <linux/slab.h> #include <linux/net.h> +#include <linux/uio.h> #include <linux/compat.h> #include <net/compat.h> #include <linux/io_uring.h> @@ -435,7 +436,9 @@ static int __io_recvmsg_copy_hdr(struct io_kiocb *req, } else if (msg.msg_iovlen > 1) { return -EINVAL; } else { - if (copy_from_user(iomsg->fast_iov, msg.msg_iov, sizeof(*msg.msg_iov))) + void *iov = iovec_from_user(msg.msg_iov, 1, 1, iomsg->fast_iov, + req->ctx->compat); + if (IS_ERR(iov)) return -EFAULT; sr->len = iomsg->fast_iov[0].iov_len; iomsg->free_iov = NULL; diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c index 41e192de9e8a7..eafdb039f9e51 100644 --- a/io_uring/rsrc.c +++ b/io_uring/rsrc.c @@ -23,6 +23,106 @@ struct io_rsrc_update { u32 offset; };
+#ifdef CONFIG_COMPAT64 +static int get_compat_io_uring_rsrc_update(struct io_uring_rsrc_update2 *up2, + const void __user *user_up) +{ + struct compat_io_uring_rsrc_update compat_up; + + if (unlikely(copy_from_user(&compat_up, user_up, sizeof(compat_up)))) + return -EFAULT; + up2->offset = compat_up.offset; + up2->resv = compat_up.resv; + up2->data = compat_up.data; + return 0; +} + +static int get_compat_io_uring_rsrc_update2(struct io_uring_rsrc_update2 *up2, + const void __user *user_up2) +{ + struct compat_io_uring_rsrc_update2 compat_up2; + + if (unlikely(copy_from_user(&compat_up2, user_up2, sizeof(compat_up2)))) + return -EFAULT; + up2->offset = compat_up2.offset; + up2->resv = compat_up2.resv; + up2->data = compat_up2.data; + up2->tags = compat_up2.tags; + up2->nr = compat_up2.nr; + up2->resv2 = compat_up2.resv2; + return 0; +} + +static int get_compat_io_uring_rsrc_register(struct io_uring_rsrc_register *rr, + const void __user *user_rr) +{ + struct compat_io_uring_rsrc_register compat_rr; + + if (unlikely(copy_from_user(&compat_rr, user_rr, sizeof(compat_rr)))) + return -EFAULT; + rr->nr = compat_rr.nr; + rr->flags = compat_rr.flags; + rr->resv2 = compat_rr.resv2; + rr->data = compat_rr.data; + rr->tags = compat_rr.tags; + return 0; +} +#endif /* CONFIG_COMPAT64 */ + +static int copy_io_uring_rsrc_update_from_user(struct io_ring_ctx *ctx, + struct io_uring_rsrc_update2 *up2, + const void __user *arg) +{ +#ifdef CONFIG_COMPAT64 + if (ctx->compat) + return get_compat_io_uring_rsrc_update(up2, arg); +#endif + return copy_from_user(up2, arg, sizeof(struct io_uring_rsrc_update)); +} + +static int copy_io_uring_rsrc_update2_from_user(struct io_ring_ctx *ctx, + struct io_uring_rsrc_update2 *up2, + const void __user *arg, + size_t size) +{ +#ifdef CONFIG_COMPAT64 + if (ctx->compat) { + if (size != sizeof(struct compat_io_uring_rsrc_update2)) + return -EINVAL; + if (get_compat_io_uring_rsrc_update2(up2, arg)) + return -EFAULT; + return 0; + } +#endif + if (size != sizeof(*up2)) + return -EINVAL; + if (copy_from_user(up2, arg, sizeof(*up2))) + return -EFAULT; + return 0; +} + +static int copy_io_uring_rsrc_register_from_user(struct io_ring_ctx *ctx, + struct io_uring_rsrc_register *rr, + const void __user *arg, + size_t size) +{ +#ifdef CONFIG_COMPAT64 + if (ctx->compat) { + if (size != sizeof(struct compat_io_uring_rsrc_register)) + return -EINVAL; + if (get_compat_io_uring_rsrc_register(rr, arg)) + return -EFAULT; + return 0; + } +#endif + /* keep it extendible */ + if (size != sizeof(*rr)) + return -EINVAL; + if (copy_from_user(rr, arg, size)) + return -EFAULT; + return 0; +} + static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov, struct io_mapped_ubuf **pimu, struct page **last_hpage); @@ -597,12 +697,14 @@ int io_register_files_update(struct io_ring_ctx *ctx, void __user *arg, unsigned nr_args) { struct io_uring_rsrc_update2 up; + int ret;
if (!nr_args) return -EINVAL; memset(&up, 0, sizeof(up)); - if (copy_from_user(&up, arg, sizeof(struct io_uring_rsrc_update))) - return -EFAULT; + ret = copy_io_uring_rsrc_update_from_user(ctx, &up, arg); + if (ret) + return ret; if (up.resv || up.resv2) return -EINVAL; return __io_register_rsrc_update(ctx, IORING_RSRC_FILE, &up, nr_args); @@ -612,11 +714,11 @@ int io_register_rsrc_update(struct io_ring_ctx *ctx, void __user *arg, unsigned size, unsigned type) { struct io_uring_rsrc_update2 up; + int ret;
- if (size != sizeof(up)) - return -EINVAL; - if (copy_from_user(&up, arg, sizeof(up))) - return -EFAULT; + ret = copy_io_uring_rsrc_update2_from_user(ctx, &up, arg, size); + if (ret) + return ret; if (!up.nr || up.resv || up.resv2) return -EINVAL; return __io_register_rsrc_update(ctx, type, &up, up.nr); @@ -626,14 +728,11 @@ __cold int io_register_rsrc(struct io_ring_ctx *ctx, void __user *arg, unsigned int size, unsigned int type) { struct io_uring_rsrc_register rr; + int ret;
- /* keep it extendible */ - if (size != sizeof(rr)) - return -EINVAL; - - memset(&rr, 0, sizeof(rr)); - if (copy_from_user(&rr, arg, size)) - return -EFAULT; + ret = copy_io_uring_rsrc_register_from_user(ctx, &rr, arg, size); + if (ret) + return ret; if (!rr.nr || rr.resv2) return -EINVAL; if (rr.flags & ~IORING_RSRC_REGISTER_SPARSE) diff --git a/io_uring/tctx.c b/io_uring/tctx.c index 96f77450cf4e2..6ab9916ed3844 100644 --- a/io_uring/tctx.c +++ b/io_uring/tctx.c @@ -12,6 +12,32 @@ #include "io_uring.h" #include "tctx.h"
+#ifdef CONFIG_COMPAT64 +static int get_compat_io_uring_rsrc_update(struct io_uring_rsrc_update *up, + const void __user *user_up) +{ + struct compat_io_uring_rsrc_update compat_up; + + if (unlikely(copy_from_user(&compat_up, user_up, sizeof(compat_up)))) + return -EFAULT; + up->offset = compat_up.offset; + up->resv = compat_up.resv; + up->data = compat_up.data; + return 0; +} +#endif /* CONFIG_COMPAT64 */ + +static int copy_io_uring_rsrc_update_from_user(struct io_ring_ctx *ctx, + struct io_uring_rsrc_update *up, + const void __user *arg) +{ +#ifdef CONFIG_COMPAT64 + if (ctx->compat) + return get_compat_io_uring_rsrc_update(up, arg); +#endif + return copy_from_user(up, arg, sizeof(struct io_uring_rsrc_update)); +} + static struct io_wq *io_init_wq_offload(struct io_ring_ctx *ctx, struct task_struct *task) { @@ -244,8 +270,6 @@ static int io_ring_add_registered_fd(struct io_uring_task *tctx, int fd, int io_ringfd_register(struct io_ring_ctx *ctx, void __user *__arg, unsigned nr_args) { - struct io_uring_rsrc_update __user *arg = __arg; - struct io_uring_rsrc_update reg; struct io_uring_task *tctx; int ret, i;
@@ -260,9 +284,17 @@ int io_ringfd_register(struct io_ring_ctx *ctx, void __user *__arg,
tctx = current->io_uring; for (i = 0; i < nr_args; i++) { + void __user *arg; + __u32 __user *arg_offset; + struct io_uring_rsrc_update reg; int start, end;
- if (copy_from_user(®, &arg[i], sizeof(reg))) { + if (ctx->compat) + arg = &((struct compat_io_uring_rsrc_update __user *)__arg)[i]; + else + arg = &((struct io_uring_rsrc_update __user *)__arg)[i]; + + if (copy_io_uring_rsrc_update_from_user(ctx, ®, arg)) { ret = -EFAULT; break; } @@ -288,8 +320,10 @@ int io_ringfd_register(struct io_ring_ctx *ctx, void __user *__arg, if (ret < 0) break;
- reg.offset = ret; - if (put_user(reg.offset, &arg[i].offset)) { + arg_offset = ctx->compat ? + &((struct compat_io_uring_rsrc_update __user *)arg)->offset : + &((struct io_uring_rsrc_update __user *)arg)->offset; + if (put_user(reg.offset, arg_offset)) { fput(tctx->registered_rings[reg.offset]); tctx->registered_rings[reg.offset] = NULL; ret = -EFAULT; @@ -303,9 +337,7 @@ int io_ringfd_register(struct io_ring_ctx *ctx, void __user *__arg, int io_ringfd_unregister(struct io_ring_ctx *ctx, void __user *__arg, unsigned nr_args) { - struct io_uring_rsrc_update __user *arg = __arg; struct io_uring_task *tctx = current->io_uring; - struct io_uring_rsrc_update reg; int ret = 0, i;
if (!nr_args || nr_args > IO_RINGFD_REG_MAX) @@ -314,10 +346,19 @@ int io_ringfd_unregister(struct io_ring_ctx *ctx, void __user *__arg, return 0;
for (i = 0; i < nr_args; i++) { - if (copy_from_user(®, &arg[i], sizeof(reg))) { + void __user *arg; + struct io_uring_rsrc_update reg; + + if (ctx->compat) + arg = &((struct compat_io_uring_rsrc_update __user *)__arg)[i]; + else + arg = &((struct io_uring_rsrc_update __user *)__arg)[i]; + + if (copy_io_uring_rsrc_update_from_user(ctx, ®, arg)) { ret = -EFAULT; break; } + if (reg.resv || reg.data || reg.offset >= IO_RINGFD_REG_MAX) { ret = -EINVAL; break; diff --git a/io_uring/uring_cmd.h b/io_uring/uring_cmd.h index 7c6697d13cb2e..d67bb30ad543b 100644 --- a/io_uring/uring_cmd.h +++ b/io_uring/uring_cmd.h @@ -11,3 +11,10 @@ int io_uring_cmd_prep_async(struct io_kiocb *req); #define uring_cmd_pdu_size(is_sqe128) \ ((1 + !!(is_sqe128)) * sizeof(struct io_uring_sqe) - \ offsetof(struct io_uring_sqe, cmd)) + +#ifdef CONFIG_COMPAT64 +#define compat_uring_cmd_pdu_size(is_sqe128) \ + ((1 + !!(is_sqe128)) * sizeof(struct compat_io_uring_sqe) - \ + offsetof(struct compat_io_uring_sqe, cmd)) +#endif +
On 23/02/2023 18:53, Tudor Cretu wrote:
Introduce compat versions of the structs exposed in the uAPI headers that might contain pointers as a member. Also, implement functions that convert the compat versions to the native versions of the struct.
A subsequent patch is going to change the io_uring structs to enable them to support new architectures. On such architectures, the current struct layout still needs to be supported for compat tasks.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com
include/linux/io_uring_types.h | 140 ++++++++++++++++++- io_uring/cancel.c | 38 ++++- io_uring/epoll.c | 2 +- io_uring/fdinfo.c | 55 +++++++- io_uring/io_uring.c | 248 +++++++++++++++++++++++---------- io_uring/io_uring.h | 122 +++++++++++++--- io_uring/kbuf.c | 108 ++++++++++++-- io_uring/kbuf.h | 6 +- io_uring/net.c | 5 +- io_uring/rsrc.c | 125 +++++++++++++++-- io_uring/tctx.c | 57 ++++++-- io_uring/uring_cmd.h | 7 + 12 files changed, 782 insertions(+), 131 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index 440179029a8f0..737bc1aa67306 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -7,6 +7,130 @@ #include <linux/llist.h> #include <uapi/linux/io_uring.h> +struct compat_io_uring_sqe {
- __u8 opcode;
- __u8 flags;
- __u16 ioprio;
- __s32 fd;
- union {
__u64 off;
__u64 addr2;
struct {
__u32 cmd_op;
__u32 __pad1;
};
- };
- union {
__u64 addr;
__u64 splice_off_in;
- };
- __u32 len;
- union {
__kernel_rwf_t rw_flags;
__u32 fsync_flags;
__u16 poll_events;
__u32 poll32_events;
__u32 sync_range_flags;
__u32 msg_flags;
__u32 timeout_flags;
__u32 accept_flags;
__u32 cancel_flags;
__u32 open_flags;
__u32 statx_flags;
__u32 fadvise_advice;
__u32 splice_flags;
__u32 rename_flags;
__u32 unlink_flags;
__u32 hardlink_flags;
__u32 xattr_flags;
__u32 msg_ring_flags;
__u32 uring_cmd_flags;
I see that a few new members have appeared since 5.18. This makes me think, is there really much point in keeping those union members in sync? This is going to prove rather painful when rebasing, in fact they'll probably end up getting out of sync. Maybe we could simply represent the union as the member that we actually use in convert_compat_io_uring_sqe()? A comment next to that member would be enough to make it clear it's actually a union in the native struct.
- };
- __u64 user_data;
- union {
__u16 buf_index;
__u16 buf_group;
- } __packed;
- __u16 personality;
- union {
__s32 splice_fd_in;
__u32 file_index;
struct {
__u16 addr_len;
__u16 __pad3[1];
};
- };
- union {
struct {
__u64 addr3;
__u64 __pad2[1];
};
__u8 cmd[0];
- };
+};
+struct compat_io_uring_cqe {
- __u64 user_data;
- __s32 res;
- __u32 flags;
- __u64 big_cqe[];
Nit: might as well align it like the rest, even though it's not in the native struct. I think it would make sense to have the same alignment approach for the other structs too, the native struct declarations are not very consistent.
+};
+struct compat_io_uring_files_update {
- __u32 offset;
- __u32 resv;
- __aligned_u64 fds;
+};
+struct compat_io_uring_rsrc_register {
- __u32 nr;
- __u32 flags;
- __u64 resv2;
- __aligned_u64 data;
- __aligned_u64 tags;
+};
+struct compat_io_uring_rsrc_update {
- __u32 offset;
- __u32 resv;
- __aligned_u64 data;
+};
+struct compat_io_uring_rsrc_update2 {
- __u32 offset;
- __u32 resv;
- __aligned_u64 data;
- __aligned_u64 tags;
- __u32 nr;
- __u32 resv2;
+};
+struct compat_io_uring_buf {
- __u64 addr;
- __u32 len;
- __u16 bid;
- __u16 resv;
+};
+struct compat_io_uring_buf_ring {
- union {
struct {
__u64 resv1;
__u32 resv2;
__u16 resv3;
__u16 tail;
};
struct compat_io_uring_buf bufs[0];
- };
+};
+struct compat_io_uring_getevents_arg {
- __u64 sigmask;
- __u32 sigmask_sz;
- __u32 pad;
- __u64 ts;
+};
struct io_wq_work_node { struct io_wq_work_node *next; }; @@ -216,7 +340,11 @@ struct io_ring_ctx { * array. */ u32 *sq_array;
struct io_uring_sqe *sq_sqes;
Nit: not sure adding a new line helps, it feels clearer to keep all the sq_* stuff in one block.
union {
struct compat_io_uring_sqe *sq_sqes_compat;
struct io_uring_sqe *sq_sqes;
unsigned cached_sq_head; unsigned sq_entries;};
@@ -271,7 +399,10 @@ struct io_ring_ctx { * produced, so the application is allowed to modify pending * entries. */
struct io_uring_cqe *cqes;
union {
struct compat_io_uring_cqe *cqes_compat;
struct io_uring_cqe *cqes;
};
/* * We cache a range of free CQEs we can use, once exhausted it @@ -581,7 +712,10 @@ struct io_kiocb { struct io_overflow_cqe { struct list_head list;
- struct io_uring_cqe cqe;
- union {
struct compat_io_uring_cqe compat_cqe;
struct io_uring_cqe cqe;
- };
}; #endif diff --git a/io_uring/cancel.c b/io_uring/cancel.c index 2291a53cdabd1..20cb8634e44af 100644 --- a/io_uring/cancel.c +++ b/io_uring/cancel.c @@ -27,6 +27,42 @@ struct io_cancel { #define CANCEL_FLAGS (IORING_ASYNC_CANCEL_ALL | IORING_ASYNC_CANCEL_FD | \ IORING_ASYNC_CANCEL_ANY | IORING_ASYNC_CANCEL_FD_FIXED) +struct compat_io_uring_sync_cancel_reg {
Shouldn't that struct be with its friends in io_uring_types.h? Same question regarding struct compat_io_uring_buf_reg.
- __u64 addr;
- __s32 fd;
- __u32 flags;
- struct __kernel_timespec timeout;
- __u64 pad[4];
+};
+#ifdef CONFIG_COMPAT64
The #ifdef'ing makes me think: should we make it more systematic? Right now it looks like only the compat copy helpers and calls to them are #ifdef'd, but arguably it doesn't make sense to #ifdef that and not things like the size selection (e.g. in rings_size()), as we are effectively assuming that the compat and native structs are identical apart from in compat64. Maybe a new helper that returns IS_ENABLED(CONFIG_COMPAT64) && ctx->compat would help? In that case the copy helpers couldn't be #ifdef'd any longer but it may be better this way.
+static int get_compat_io_uring_sync_cancel_reg(struct io_uring_sync_cancel_reg *sc,
const void __user *user_sc)
+{
- struct compat_io_uring_sync_cancel_reg compat_sc;
- if (unlikely(copy_from_user(&compat_sc, user_sc, sizeof(compat_sc))))
return -EFAULT;
- sc->addr = compat_sc.addr;
- sc->fd = compat_sc.fd;
- sc->flags = compat_sc.flags;
- sc->timeout = compat_sc.timeout;
- memcpy(sc->pad, compat_sc.pad, sizeof(sc->pad));
- return 0;
+} +#endif
+static int copy_io_uring_sync_cancel_reg_from_user(struct io_ring_ctx *ctx,
struct io_uring_sync_cancel_reg *sc,
const void __user *arg)
+{ +#ifdef CONFIG_COMPAT64
- if (ctx->compat)
return get_compat_io_uring_sync_cancel_reg(sc, arg);
+#endif
- return copy_from_user(sc, arg, sizeof(*sc));
+}
static bool io_cancel_cb(struct io_wq_work *work, void *data) { struct io_kiocb *req = container_of(work, struct io_kiocb, work); @@ -243,7 +279,7 @@ int io_sync_cancel(struct io_ring_ctx *ctx, void __user *arg) DEFINE_WAIT(wait); int ret;
- if (copy_from_user(&sc, arg, sizeof(sc)))
- if (copy_io_uring_sync_cancel_reg_from_user(ctx, &sc, arg)) return -EFAULT; if (sc.flags & ~CANCEL_FLAGS) return -EINVAL;
diff --git a/io_uring/epoll.c b/io_uring/epoll.c index 9aa74d2c80bc4..d5580ff465c3e 100644 --- a/io_uring/epoll.c +++ b/io_uring/epoll.c @@ -40,7 +40,7 @@ int io_epoll_ctl_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) struct epoll_event __user *ev; ev = u64_to_user_ptr(READ_ONCE(sqe->addr));
if (copy_from_user(&epoll->event, ev, sizeof(*ev)))
}if (copy_epoll_event_from_user(&epoll->event, ev, req->ctx->compat)) return -EFAULT;
diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c index bc8c9d764bc13..c5bd669081c98 100644 --- a/io_uring/fdinfo.c +++ b/io_uring/fdinfo.c @@ -89,12 +89,25 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, for (i = 0; i < sq_entries; i++) { unsigned int entry = i + sq_head; struct io_uring_sqe *sqe;
unsigned int sq_idx;
unsigned int sq_idx, sq_off;
+#ifdef CONFIG_COMPAT64
struct io_uring_sqe *native_sqe = NULL;
+#endif sq_idx = READ_ONCE(ctx->sq_array[entry & sq_mask]); if (sq_idx > sq_mask) continue;
sqe = &ctx->sq_sqes[sq_idx << sq_shift];
sq_off = sq_idx << sq_shift;
sqe = ctx->compat ? (void *)&ctx->sq_sqes_compat[sq_off] : &ctx->sq_sqes[sq_off];
+#ifdef CONFIG_COMPAT64
if (ctx->compat) {
native_sqe = kmalloc(sizeof(struct io_uring_sqe) << sq_shift, GFP_KERNEL);
It's pretty suboptimal to do a kmalloc() / kfree() in a loop. An alternative would be to have a local struct io_uring_sqe native_sqe[2]; to account for the SQE128 case.
That said for the case of the printing code in this file, I wonder if we shouldn't skip the conversion altogether, since the structs are not passed to any other function. If we move each of the three printing blocks where conversion is needed to a macro, then we can easily handle both the native and compat struct with the same code.
convert_compat_io_uring_sqe(ctx, native_sqe,
(struct compat_io_uring_sqe *)sqe);
sqe = native_sqe;
}
+#endif
- seq_printf(m, "%5u: opcode:%s, fd:%d, flags:%x, off:%llu, " "addr:0x%llx, rw_flags:0x%x, buf_index:%d " "user_data:%llu",
@@ -104,7 +117,8 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, sqe->buf_index, sqe->user_data); if (sq_shift) { u64 *sqeb = (void *) (sqe + 1);
int size = sizeof(struct io_uring_sqe) / sizeof(u64);
int size = (ctx->compat ? sizeof(struct compat_io_uring_sqe)
: sizeof(struct io_uring_sqe)) / sizeof(u64); int j;
for (j = 0; j < size; j++) { @@ -114,12 +128,29 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, } } seq_printf(m, "\n"); +#ifdef CONFIG_COMPAT64
kfree(native_sqe);
+#endif } seq_printf(m, "CQEs:\t%u\n", cq_tail - cq_head); cq_entries = min(cq_tail - cq_head, ctx->cq_entries); for (i = 0; i < cq_entries; i++) { unsigned int entry = i + cq_head;
struct io_uring_cqe *cqe = &ctx->cqes[(entry & cq_mask) << cq_shift];
struct io_uring_cqe *cqe;
unsigned int cq_off = (entry & cq_mask) << cq_shift;
+#ifdef CONFIG_COMPAT64
struct io_uring_cqe *native_cqe = NULL;
+#endif
cqe = ctx->compat ? (void *)&ctx->cqes_compat[cq_off] : &ctx->cqes[cq_off];
+#ifdef CONFIG_COMPAT64
if (ctx->compat) {
native_cqe = kmalloc(sizeof(struct io_uring_cqe) << cq_shift, GFP_KERNEL);
convert_compat_io_uring_cqe(ctx, native_cqe,
(struct compat_io_uring_cqe *)cqe);
cqe = native_cqe;
}
+#endif seq_printf(m, "%5u: user_data:%llu, res:%d, flag:%x", entry & cq_mask, cqe->user_data, cqe->res, @@ -128,6 +159,9 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, seq_printf(m, ", extra1:%llu, extra2:%llu\n", cqe->big_cqe[0], cqe->big_cqe[1]); seq_printf(m, "\n"); +#ifdef CONFIG_COMPAT64
kfree(native_cqe);
+#endif } /* @@ -189,10 +223,23 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, spin_lock(&ctx->completion_lock); list_for_each_entry(ocqe, &ctx->cq_overflow_list, list) { struct io_uring_cqe *cqe = &ocqe->cqe; +#ifdef CONFIG_COMPAT64
struct io_uring_cqe *native_cqe = NULL;
if (ctx->compat) {
native_cqe = kmalloc(sizeof(struct io_uring_cqe) << cq_shift, GFP_KERNEL);
convert_compat_io_uring_cqe(ctx, native_cqe,
(struct compat_io_uring_cqe *)cqe);
cqe = native_cqe;
}
+#endif seq_printf(m, " user_data=%llu, res=%d, flags=%x\n", cqe->user_data, cqe->res, cqe->flags); +#ifdef CONFIG_COMPAT64
kfree(native_cqe);
+#endif } spin_unlock(&ctx->completion_lock); diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 707229ae04dc8..03506776af043 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -152,6 +152,43 @@ static void __io_submit_flush_completions(struct io_ring_ctx *ctx); static struct kmem_cache *req_cachep; +#ifdef CONFIG_COMPAT64 +static int get_compat_io_uring_getevents_arg(struct io_uring_getevents_arg *arg,
const void __user *user_arg)
+{
- struct compat_io_uring_getevents_arg compat_arg;
- if (unlikely(copy_from_user(&compat_arg, user_arg, sizeof(compat_arg))))
Not very important but curious: we only seem to be using unlikely(copy_from_user(...)) in the compat helpers, any reason for that?
return -EFAULT;
- arg->sigmask = (__kernel_uintptr_t)compat_ptr(compat_arg.sigmask);
Looks like a leftover conversion that belongs to patch 8 instead.
- arg->sigmask_sz = compat_arg.sigmask_sz;
- arg->pad = compat_arg.pad;
- arg->ts = (__kernel_uintptr_t)compat_ptr(compat_arg.ts);
- return 0;
+} +#endif /* CONFIG_COMPAT64 */
+static int copy_io_uring_getevents_arg_from_user(struct io_ring_ctx *ctx,
struct io_uring_getevents_arg *arg,
const void __user *argp,
size_t size)
+{ +#ifdef CONFIG_COMPAT64
- if (ctx->compat) {
if (size != sizeof(struct compat_io_uring_getevents_arg))
return -EINVAL;
if (get_compat_io_uring_getevents_arg(arg, argp))
return -EFAULT;
return 0;
Or simply return get_compat_io_uring_getevents_arg(arg, argp);
I think this pattern reappears in quite a few copy_*_from_user() helpers.
- }
+#endif
- if (size != sizeof(*arg))
return -EINVAL;
- if (copy_from_user(arg, argp, sizeof(*arg)))
return -EFAULT;
- return 0;
+}
struct sock *io_uring_get_socket(struct file *file) { #if defined(CONFIG_UNIX) @@ -604,7 +641,9 @@ void io_cq_unlock_post(struct io_ring_ctx *ctx) static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) { bool all_flushed;
- size_t cqe_size = sizeof(struct io_uring_cqe);
- size_t cqe_size = ctx->compat ?
sizeof(struct compat_io_uring_cqe) :
sizeof(struct io_uring_cqe);
I'm not sure this is enough to get the overflow machinery to work in compat64. struct io_overflow_cqe gets filled in io_cqring_event_overflow() but AFAICT that code is not compat-aware, so it always initialises the native cqe and not the compat one. I suppose we could make use of __io_fill_cqe_any() in io_cqring_event_overflow()?
This makes me think, are there tests that trigger this overflow mechanism? Issues there might easily go unnoticed otherwise.
if (!force && __io_cqring_events(ctx) == ctx->cq_entries) return false; @@ -741,10 +780,11 @@ bool io_req_cqe_overflow(struct io_kiocb *req)
- control dependency is enough as we're using WRITE_ONCE to
- fill the cq entry
*/ -struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow) +void *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow)
I certainly understand the rationale behind changing the return value to void *, as it is now either struct io_uring_cqe * or struct compat_io_uring_cqe *. That said I'm not convinced this change makes a lot of sense as it stands, because all users of *io_get_cqe* still interpret the return value as struct io_uring_cqe * (which works as void * is implicitly convertible to any pointer type, not too conveniently here...).
My feeling is that we should really choose one way or the other. Either make such pointers void * everywhere - or maybe better, pointer to a union, to stop them from being implicitly convertible. Or just keep struct io_uring_cqe * everywhere, which is quite common when dealing with compat structs. I don't really have a preference as both approaches have merits, up to you :)
{ unsigned int off = ctx->cached_cq_tail & (ctx->cq_entries - 1); unsigned int free, queued, len;
- void *cqe;
/* * Posting into the CQ when there are pending overflowed CQEs may break @@ -767,14 +807,15 @@ struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow) len <<= 1; }
- ctx->cqe_cached = &ctx->cqes[off];
- cqe = ctx->compat ? (void *)&ctx->cqes_compat[off] : (void *)&ctx->cqes[off];
- ctx->cqe_cached = cqe;
Don't we have a problem with the whole cqe_cached / cqe_sentinel logic? It assumes that it is manipulating pointers to struct io_uring_cqe, but in fact these are pointers to struct compat_io_uring_cqe in compat64, aren't they? This appeared in 5.19 [1] so you may not have noticed it when rebasing on 6.1.
[1] https://lore.kernel.org/all/487eeef00f3146537b3d9c1a9cef2fc0b9a86f81.1649771...
ctx->cqe_sentinel = ctx->cqe_cached + len; ctx->cached_cq_tail++; ctx->cqe_cached++; if (ctx->flags & IORING_SETUP_CQE32) ctx->cqe_cached++;
- return &ctx->cqes[off];
- return cqe;
} bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags, @@ -793,14 +834,7 @@ bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags if (likely(cqe)) { trace_io_uring_complete(ctx, NULL, user_data, res, cflags, 0, 0);
WRITE_ONCE(cqe->user_data, user_data);
WRITE_ONCE(cqe->res, res);
WRITE_ONCE(cqe->flags, cflags);
if (ctx->flags & IORING_SETUP_CQE32) {
WRITE_ONCE(cqe->big_cqe[0], 0);
WRITE_ONCE(cqe->big_cqe[1], 0);
}
return true; }__io_fill_cqe_any(ctx, cqe, user_data, res, cflags, 0, 0);
@@ -2222,7 +2256,7 @@ static void io_commit_sqring(struct io_ring_ctx *ctx)
- used, it's important that those reads are done through READ_ONCE() to
- prevent a re-load down the line.
*/ -static const struct io_uring_sqe *io_get_sqe(struct io_ring_ctx *ctx) +static const void *io_get_sqe(struct io_ring_ctx *ctx) { unsigned head, mask = ctx->sq_entries - 1; unsigned sq_idx = ctx->cached_sq_head++ & mask; @@ -2240,7 +2274,8 @@ static const struct io_uring_sqe *io_get_sqe(struct io_ring_ctx *ctx) /* double index for 128-byte SQEs, twice as long */ if (ctx->flags & IORING_SETUP_SQE128) head <<= 1;
return &ctx->sq_sqes[head];
return ctx->compat ? (void *)&ctx->sq_sqes_compat[head]
}: (void *)&ctx->sq_sqes[head];
/* drop invalid entries */ @@ -2265,8 +2300,11 @@ int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr) io_submit_state_start(&ctx->submit_state, left); do {
const struct io_uring_sqe *sqe;
struct io_kiocb *req;const void *sqe;
+#ifdef CONFIG_COMPAT64
struct io_uring_sqe native_sqe;
It would seem that this will result in a buffer overflow in the IORING_OP_URING_CMD + IORING_SETUP_SQE128 case, since no space is allocated for the extra data. Maybe the best is to allocate an array of two sqes to cover this situation.
+#endif if (unlikely(!io_alloc_req_refill(ctx))) break; @@ -2276,6 +2314,12 @@ int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr) io_req_add_to_cache(req, ctx); break; } +#ifdef CONFIG_COMPAT64
if (ctx->compat) {
convert_compat_io_uring_sqe(ctx, &native_sqe, sqe);
sqe = &native_sqe;
}
+#endif /* * Continue submitting even for sqe failure if the @@ -2480,6 +2524,9 @@ static unsigned long rings_size(struct io_ring_ctx *ctx, unsigned int sq_entries { struct io_rings *rings; size_t off, cq_array_size, sq_array_size;
- size_t cqe_size = ctx->compat ?
sizeof(struct compat_io_uring_cqe) :
sizeof(struct io_uring_cqe);
off = sizeof(*rings); @@ -2492,7 +2539,7 @@ static unsigned long rings_size(struct io_ring_ctx *ctx, unsigned int sq_entries if (cq_offset) *cq_offset = off;
- cq_array_size = array_size(sizeof(struct io_uring_cqe), cq_entries);
- cq_array_size = array_size(cqe_size, cq_entries); if (cq_array_size == SIZE_MAX) return SIZE_MAX;
@@ -3120,20 +3167,22 @@ static unsigned long io_uring_nommu_get_unmapped_area(struct file *file, #endif /* !CONFIG_MMU */ -static int io_validate_ext_arg(unsigned flags, const void __user *argp, size_t argsz) +static int io_validate_ext_arg(struct io_ring_ctx *ctx, unsigned int flags,
- const void __user *argp, size_t argsz)
It looks like overflow lines in prototypes are generally aligned on the opening parenthesis in this file, not just indented by a tab.
{ if (flags & IORING_ENTER_EXT_ARG) { struct io_uring_getevents_arg arg;
int ret;
if (argsz != sizeof(arg))
return -EINVAL;
if (copy_from_user(&arg, argp, sizeof(arg)))
return -EFAULT;
ret = copy_io_uring_getevents_arg_from_user(ctx, &arg, argp, argsz);
if (ret)
return ret;
Or simply return copy_io_uring_getevents_arg_from_user(...);
} return 0; } -static int io_get_ext_arg(unsigned flags, const void __user *argp, size_t *argsz, +static int io_get_ext_arg(struct io_ring_ctx *ctx, unsigned int flags,
- const void __user *argp, size_t *argsz,
#ifdef CONFIG_CHERI_PURECAP_UABI struct __kernel_timespec * __capability *ts, const sigset_t * __capability *sig) @@ -3143,6 +3192,7 @@ static int io_get_ext_arg(unsigned flags, const void __user *argp, size_t *argsz #endif { struct io_uring_getevents_arg arg;
- int ret;
/* * If EXT_ARG isn't set, then we have no timespec and the argp pointer @@ -3158,10 +3208,9 @@ static int io_get_ext_arg(unsigned flags, const void __user *argp, size_t *argsz * EXT_ARG is set - ensure we agree on the size of it and copy in our * timespec and sigset_t pointers if good. */
- if (*argsz != sizeof(arg))
return -EINVAL;
- if (copy_from_user(&arg, argp, sizeof(arg)))
return -EFAULT;
- ret = copy_io_uring_getevents_arg_from_user(ctx, &arg, argp, *argsz);
- if (ret)
if (arg.pad) return -EINVAL; *sig = u64_to_user_ptr(arg.sigmask);return ret;
@@ -3268,7 +3317,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, */ mutex_lock(&ctx->uring_lock); iopoll_locked:
ret2 = io_validate_ext_arg(flags, argp, argsz);
ret2 = io_validate_ext_arg(ctx, flags, argp, argsz); if (likely(!ret2)) { min_complete = min(min_complete, ctx->cq_entries);
@@ -3279,7 +3328,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, const sigset_t __user *sig; struct __kernel_timespec __user *ts;
ret2 = io_get_ext_arg(flags, argp, &argsz, &ts, &sig);
ret2 = io_get_ext_arg(ctx, flags, argp, &argsz, &ts, &sig); if (likely(!ret2)) { min_complete = min(min_complete, ctx->cq_entries);
@@ -3329,6 +3378,9 @@ static __cold int io_allocate_scq_urings(struct io_ring_ctx *ctx, { struct io_rings *rings; size_t size, cqes_offset, sq_array_offset;
- size_t sqe_size = ctx->compat ?
sizeof(struct compat_io_uring_sqe) :
sizeof(struct io_uring_sqe);
/* make sure these are sane, as we already accounted them */ ctx->sq_entries = p->sq_entries; @@ -3351,9 +3403,9 @@ static __cold int io_allocate_scq_urings(struct io_ring_ctx *ctx, rings->cq_ring_entries = p->cq_entries; if (p->flags & IORING_SETUP_SQE128)
size = array_size(2 * sizeof(struct io_uring_sqe), p->sq_entries);
elsesize = array_size(2 * sqe_size, p->sq_entries);
size = array_size(sizeof(struct io_uring_sqe), p->sq_entries);
if (size == SIZE_MAX) { io_mem_free(ctx->rings); ctx->rings = NULL;size = array_size(sqe_size, p->sq_entries);
@@ -4107,48 +4159,45 @@ static int __init io_uring_init(void) #define BUILD_BUG_SQE_ELEM_SIZE(eoffset, esize, ename) \ __BUILD_BUG_VERIFY_OFFSET_SIZE(struct io_uring_sqe, eoffset, esize, ename) BUILD_BUG_ON(sizeof(struct io_uring_sqe) != 64);
- BUILD_BUG_SQE_ELEM(0, __u8, opcode);
- BUILD_BUG_SQE_ELEM(1, __u8, flags);
- BUILD_BUG_SQE_ELEM(2, __u16, ioprio);
- BUILD_BUG_SQE_ELEM(4, __s32, fd);
- BUILD_BUG_SQE_ELEM(8, __u64, off);
- BUILD_BUG_SQE_ELEM(8, __u64, addr2);
- BUILD_BUG_SQE_ELEM(8, __u32, cmd_op);
- BUILD_BUG_SQE_ELEM(0, __u8, opcode);
- BUILD_BUG_SQE_ELEM(1, __u8, flags);
- BUILD_BUG_SQE_ELEM(2, __u16, ioprio);
- BUILD_BUG_SQE_ELEM(4, __s32, fd);
- BUILD_BUG_SQE_ELEM(8, __u64, off);
- BUILD_BUG_SQE_ELEM(8, __u64, addr2);
- BUILD_BUG_SQE_ELEM(8, __u32, cmd_op); BUILD_BUG_SQE_ELEM(12, __u32, __pad1);
- BUILD_BUG_SQE_ELEM(16, __u64, addr);
- BUILD_BUG_SQE_ELEM(16, __u64, splice_off_in);
- BUILD_BUG_SQE_ELEM(24, __u32, len);
- BUILD_BUG_SQE_ELEM(28, __kernel_rwf_t, rw_flags);
- BUILD_BUG_SQE_ELEM(28, /* compat */ int, rw_flags);
- BUILD_BUG_SQE_ELEM(28, /* compat */ __u32, rw_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, fsync_flags);
- BUILD_BUG_SQE_ELEM(28, /* compat */ __u16, poll_events);
- BUILD_BUG_SQE_ELEM(28, __u32, poll32_events);
- BUILD_BUG_SQE_ELEM(28, __u32, sync_range_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, msg_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, timeout_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, accept_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, cancel_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, open_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, statx_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, fadvise_advice);
- BUILD_BUG_SQE_ELEM(28, __u32, splice_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, rename_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, unlink_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, hardlink_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, xattr_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, msg_ring_flags);
- BUILD_BUG_SQE_ELEM(32, __u64, user_data);
- BUILD_BUG_SQE_ELEM(40, __u16, buf_index);
- BUILD_BUG_SQE_ELEM(40, __u16, buf_group);
- BUILD_BUG_SQE_ELEM(42, __u16, personality);
- BUILD_BUG_SQE_ELEM(44, __s32, splice_fd_in);
- BUILD_BUG_SQE_ELEM(44, __u32, file_index);
- BUILD_BUG_SQE_ELEM(44, __u16, addr_len);
- BUILD_BUG_SQE_ELEM(46, __u16, __pad3[0]);
- BUILD_BUG_SQE_ELEM(48, __u64, addr3);
- BUILD_BUG_SQE_ELEM(16, __u64, addr);
- BUILD_BUG_SQE_ELEM(16, __u64, splice_off_in);
- BUILD_BUG_SQE_ELEM(24, __u32, len);
- BUILD_BUG_SQE_ELEM(28, __kernel_rwf_t, rw_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, fsync_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, poll32_events);
- BUILD_BUG_SQE_ELEM(28, __u32, sync_range_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, msg_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, timeout_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, accept_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, cancel_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, open_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, statx_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, fadvise_advice);
- BUILD_BUG_SQE_ELEM(28, __u32, splice_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, rename_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, unlink_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, hardlink_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, xattr_flags);
- BUILD_BUG_SQE_ELEM(28, __u32, msg_ring_flags);
- BUILD_BUG_SQE_ELEM(32, __u64, user_data);
- BUILD_BUG_SQE_ELEM(40, __u16, buf_index);
- BUILD_BUG_SQE_ELEM(40, __u16, buf_group);
- BUILD_BUG_SQE_ELEM(42, __u16, personality);
- BUILD_BUG_SQE_ELEM(44, __s32, splice_fd_in);
- BUILD_BUG_SQE_ELEM(44, __u32, file_index);
- BUILD_BUG_SQE_ELEM(44, __u16, addr_len);
- BUILD_BUG_SQE_ELEM(46, __u16, __pad3[0]);
- BUILD_BUG_SQE_ELEM(48, __u64, addr3); BUILD_BUG_SQE_ELEM_SIZE(48, 0, cmd);
- BUILD_BUG_SQE_ELEM(56, __u64, __pad2);
- BUILD_BUG_SQE_ELEM(56, __u64, __pad2);
BUILD_BUG_ON(sizeof(struct io_uring_files_update) != sizeof(struct io_uring_rsrc_update)); @@ -4160,6 +4209,65 @@ static int __init io_uring_init(void) BUILD_BUG_ON(offsetof(struct io_uring_buf, resv) != offsetof(struct io_uring_buf_ring, tail)); +#ifdef CONFIG_COMPAT64 +#define BUILD_BUG_COMPAT_SQE_ELEM(eoffset, etype, ename) \
- __BUILD_BUG_VERIFY_OFFSET_SIZE(struct compat_io_uring_sqe, eoffset, sizeof(etype), ename)
+#define BUILD_BUG_COMPAT_SQE_ELEM_SIZE(eoffset, esize, ename) \
- __BUILD_BUG_VERIFY_OFFSET_SIZE(struct compat_io_uring_sqe, eoffset, esize, ename)
- BUILD_BUG_ON(sizeof(struct compat_io_uring_sqe) != 64);
- BUILD_BUG_COMPAT_SQE_ELEM(0, __u8, opcode);
- BUILD_BUG_COMPAT_SQE_ELEM(1, __u8, flags);
- BUILD_BUG_COMPAT_SQE_ELEM(2, __u16, ioprio);
- BUILD_BUG_COMPAT_SQE_ELEM(4, __s32, fd);
- BUILD_BUG_COMPAT_SQE_ELEM(8, __u64, off);
- BUILD_BUG_COMPAT_SQE_ELEM(8, __u64, addr2);
- BUILD_BUG_COMPAT_SQE_ELEM(8, __u32, cmd_op);
- BUILD_BUG_COMPAT_SQE_ELEM(12, __u32, __pad1);
- BUILD_BUG_COMPAT_SQE_ELEM(16, __u64, addr);
- BUILD_BUG_COMPAT_SQE_ELEM(16, __u64, splice_off_in);
- BUILD_BUG_COMPAT_SQE_ELEM(24, __u32, len);
- BUILD_BUG_COMPAT_SQE_ELEM(28, __kernel_rwf_t, rw_flags);
- BUILD_BUG_COMPAT_SQE_ELEM(28, /* compat */ int, rw_flags);
- BUILD_BUG_COMPAT_SQE_ELEM(28, /* compat */ __u32, rw_flags);
I'm a bit confused by these lines, as they are removed from the native checks above but preserved here? I don't think "compat" has much to do with compat64 here, in the case of poll_events below it seems to be about a backwards-compatible change in ABI (see [1]).
I also expected the changes in whitespace above to be reflected here.
[1] https://lore.kernel.org/all/1592387636-80105-2-git-send-email-jiufei.xue@lin...
- BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, fsync_flags);
- BUILD_BUG_COMPAT_SQE_ELEM(28, /* compat */ __u16, poll_events);
- BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, poll32_events);
- BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, sync_range_flags);
- BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, msg_flags);
- BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, timeout_flags);
- BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, accept_flags);
- BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, cancel_flags);
- BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, open_flags);
- BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, statx_flags);
- BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, fadvise_advice);
- BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, splice_flags);
- BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, rename_flags);
- BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, unlink_flags);
- BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, hardlink_flags);
- BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, xattr_flags);
- BUILD_BUG_COMPAT_SQE_ELEM(28, __u32, msg_ring_flags);
- BUILD_BUG_COMPAT_SQE_ELEM(32, __u64, user_data);
- BUILD_BUG_COMPAT_SQE_ELEM(40, __u16, buf_index);
- BUILD_BUG_COMPAT_SQE_ELEM(40, __u16, buf_group);
- BUILD_BUG_COMPAT_SQE_ELEM(42, __u16, personality);
- BUILD_BUG_COMPAT_SQE_ELEM(44, __s32, splice_fd_in);
- BUILD_BUG_COMPAT_SQE_ELEM(44, __u32, file_index);
- BUILD_BUG_COMPAT_SQE_ELEM(44, __u16, addr_len);
- BUILD_BUG_COMPAT_SQE_ELEM(46, __u16, __pad3[0]);
- BUILD_BUG_COMPAT_SQE_ELEM(48, __u64, addr3);
- BUILD_BUG_COMPAT_SQE_ELEM_SIZE(48, 0, cmd);
- BUILD_BUG_COMPAT_SQE_ELEM(56, __u64, __pad2);
- BUILD_BUG_ON(sizeof(struct compat_io_uring_files_update) !=
sizeof(struct compat_io_uring_rsrc_update));
- BUILD_BUG_ON(sizeof(struct compat_io_uring_rsrc_update) >
sizeof(struct compat_io_uring_rsrc_update2));
- BUILD_BUG_ON(offsetof(struct compat_io_uring_buf_ring, bufs) != 0);
- BUILD_BUG_ON(offsetof(struct compat_io_uring_buf, resv) !=
offsetof(struct compat_io_uring_buf_ring, tail));
+#endif /* CONFIG_COMPAT64 */
- /* should fit into one byte */ BUILD_BUG_ON(SQE_VALID_FLAGS >= (1 << 8)); BUILD_BUG_ON(SQE_COMMON_FLAGS >= (1 << 8));
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h index 50bc3af449534..fb2711770bfb0 100644 --- a/io_uring/io_uring.h +++ b/io_uring/io_uring.h @@ -5,6 +5,7 @@ #include <linux/lockdep.h> #include <linux/io_uring_types.h> #include "io-wq.h" +#include "uring_cmd.h" #include "slist.h" #include "filetable.h" @@ -24,7 +25,7 @@ enum { IOU_STOP_MULTISHOT = -ECANCELED, }; -struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow); +void *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow); bool io_req_cqe_overflow(struct io_kiocb *req); int io_run_task_work_sig(struct io_ring_ctx *ctx); int __io_run_local_work(struct io_ring_ctx *ctx, bool *locked); @@ -93,8 +94,67 @@ static inline void io_cq_lock(struct io_ring_ctx *ctx) void io_cq_unlock_post(struct io_ring_ctx *ctx); -static inline struct io_uring_cqe *io_get_cqe_overflow(struct io_ring_ctx *ctx,
bool overflow)
+#ifdef CONFIG_COMPAT64 +static inline void convert_compat_io_uring_cqe(struct io_ring_ctx *ctx,
struct io_uring_cqe *cqe,
const struct compat_io_uring_cqe *compat_cqe)
+{
- cqe->user_data = READ_ONCE(compat_cqe->user_data);
- cqe->res = READ_ONCE(compat_cqe->res);
- cqe->flags = READ_ONCE(compat_cqe->flags);
- if (ctx->flags & IORING_SETUP_CQE32) {
cqe->big_cqe[0] = READ_ONCE(compat_cqe->big_cqe[0]);
cqe->big_cqe[1] = READ_ONCE(compat_cqe->big_cqe[1]);
- }
+}
+static inline void convert_compat_io_uring_sqe(struct io_ring_ctx *ctx,
struct io_uring_sqe *sqe,
const struct compat_io_uring_sqe *compat_sqe)
+{ +/*
- The struct io_uring_sqe contains anonymous unions and there is no field
- keeping track of which union's member is active. Because in all the cases,
- the unions are between integral types and the types are compatible, use the
- largest member of each union to perform the copy. Use this compile-time check
- to ensure that the union's members are not truncated during the conversion.
- */
+#define BUILD_BUG_COMPAT_SQE_UNION_ELEM(elem1, elem2) \
- BUILD_BUG_ON(sizeof_field(struct compat_io_uring_sqe, elem1) != \
(offsetof(struct compat_io_uring_sqe, elem2) - \
offsetof(struct compat_io_uring_sqe, elem1)))
- sqe->opcode = READ_ONCE(compat_sqe->opcode);
- sqe->flags = READ_ONCE(compat_sqe->flags);
- sqe->ioprio = READ_ONCE(compat_sqe->ioprio);
- sqe->fd = READ_ONCE(compat_sqe->fd);
- BUILD_BUG_COMPAT_SQE_UNION_ELEM(addr2, addr);
- sqe->addr2 = READ_ONCE(compat_sqe->addr2);
- BUILD_BUG_COMPAT_SQE_UNION_ELEM(addr, len);
- sqe->addr = READ_ONCE(compat_sqe->addr);
- sqe->len = READ_ONCE(compat_sqe->len);
- BUILD_BUG_COMPAT_SQE_UNION_ELEM(rw_flags, user_data);
- sqe->rw_flags = READ_ONCE(compat_sqe->rw_flags);
- sqe->user_data = READ_ONCE(compat_sqe->user_data);
- BUILD_BUG_COMPAT_SQE_UNION_ELEM(buf_index, personality);
- sqe->buf_index = READ_ONCE(compat_sqe->buf_index);
- sqe->personality = READ_ONCE(compat_sqe->personality);
- BUILD_BUG_COMPAT_SQE_UNION_ELEM(splice_fd_in, addr3);
- sqe->splice_fd_in = READ_ONCE(compat_sqe->splice_fd_in);
- if (sqe->opcode == IORING_OP_URING_CMD) {
size_t compat_cmd_size = compat_uring_cmd_pdu_size(ctx->flags &
IORING_SETUP_SQE128);
memcpy(sqe->cmd, compat_sqe->cmd, compat_cmd_size);
- } else
sqe->addr3 = READ_ONCE(compat_sqe->addr3);
It would be better to copy __pad2 too, as some net code does check that it is zero.
+#undef BUILD_BUG_COMPAT_SQE_UNION_ELEM +} +#endif /* CONFIG_COMPAT64 */
+static inline void *io_get_cqe_overflow(struct io_ring_ctx *ctx,
bool overflow)
{ if (likely(ctx->cqe_cached < ctx->cqe_sentinel)) { struct io_uring_cqe *cqe = ctx->cqe_cached; @@ -109,15 +169,46 @@ static inline struct io_uring_cqe *io_get_cqe_overflow(struct io_ring_ctx *ctx, return __io_get_cqe(ctx, overflow); } -static inline struct io_uring_cqe *io_get_cqe(struct io_ring_ctx *ctx) +static inline void *io_get_cqe(struct io_ring_ctx *ctx) { return io_get_cqe_overflow(ctx, false); } +static inline void __io_fill_cqe_any(struct io_ring_ctx *ctx, struct io_uring_cqe *cqe,
Nit: not sure the "any" suffix really helps here, __io_fill_cqe() is probably enough.
u64 user_data, s32 res, u32 cflags,
u64 extra1, u64 extra2)
+{ +#ifdef CONFIG_COMPAT64
- if (ctx->compat) {
struct compat_io_uring_cqe *compat_cqe = (struct compat_io_uring_cqe *)cqe;
WRITE_ONCE(compat_cqe->user_data, user_data);
WRITE_ONCE(compat_cqe->res, res);
WRITE_ONCE(compat_cqe->flags, cflags);
if (ctx->flags & IORING_SETUP_CQE32) {
WRITE_ONCE(compat_cqe->big_cqe[0], extra1);
WRITE_ONCE(compat_cqe->big_cqe[1], extra2);
}
return;
- }
+#endif
- WRITE_ONCE(cqe->user_data, user_data);
- WRITE_ONCE(cqe->res, res);
- WRITE_ONCE(cqe->flags, cflags);
- if (ctx->flags & IORING_SETUP_CQE32) {
WRITE_ONCE(cqe->big_cqe[0], extra1);
WRITE_ONCE(cqe->big_cqe[1], extra2);
- }
+}
static inline bool __io_fill_cqe_req(struct io_ring_ctx *ctx, struct io_kiocb *req) { struct io_uring_cqe *cqe;
- u64 extra1 = 0;
- u64 extra2 = 0;
/* * If we can't get a cq entry, userspace overflowed the @@ -128,24 +219,17 @@ static inline bool __io_fill_cqe_req(struct io_ring_ctx *ctx, if (unlikely(!cqe)) return io_req_cqe_overflow(req);
- if (ctx->flags & IORING_SETUP_CQE32 && req->flags & REQ_F_CQE32_INIT) {
extra1 = req->extra1;
extra2 = req->extra2;
- }
- trace_io_uring_complete(req->ctx, req, req->cqe.user_data, req->cqe.res, req->cqe.flags,
(req->flags & REQ_F_CQE32_INIT) ? req->extra1 : 0,
(req->flags & REQ_F_CQE32_INIT) ? req->extra2 : 0);
extra1, extra2);
- memcpy(cqe, &req->cqe, sizeof(*cqe));
- if (ctx->flags & IORING_SETUP_CQE32) {
u64 extra1 = 0, extra2 = 0;
if (req->flags & REQ_F_CQE32_INIT) {
extra1 = req->extra1;
extra2 = req->extra2;
}
WRITE_ONCE(cqe->big_cqe[0], extra1);
WRITE_ONCE(cqe->big_cqe[1], extra2);
- }
- __io_fill_cqe_any(ctx, cqe, req->cqe.user_data, req->cqe.res,
return true;req->cqe.flags, extra1, extra2);
} diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c index e2c46889d5fab..9f840ba93c888 100644 --- a/io_uring/kbuf.c +++ b/io_uring/kbuf.c @@ -16,6 +16,9 @@ #include "kbuf.h" #define IO_BUFFER_LIST_BUF_PER_PAGE (PAGE_SIZE / sizeof(struct io_uring_buf)) +#ifdef CONFIG_COMPAT64 +#define IO_BUFFER_LIST_COMPAT_BUF_PER_PAGE (PAGE_SIZE / sizeof(struct compat_io_uring_buf)) +#endif #define BGID_ARRAY 64 @@ -28,6 +31,42 @@ struct io_provide_buf { __u16 bid; }; +#ifdef CONFIG_COMPAT64 +struct compat_io_uring_buf_reg {
- __u64 ring_addr;
- __u32 ring_entries;
- __u16 bgid;
- __u16 pad;
- __u64 resv[3];
+};
+static int get_compat_io_uring_buf_reg(struct io_uring_buf_reg *reg,
const void __user *user_reg)
+{
- struct compat_io_uring_buf_reg compat_reg;
- if (unlikely(copy_from_user(&compat_reg, user_reg, sizeof(compat_reg))))
return -EFAULT;
- reg->ring_addr = compat_reg.ring_addr;
- reg->ring_entries = compat_reg.ring_entries;
- reg->bgid = compat_reg.bgid;
- reg->pad = compat_reg.pad;
- memcpy(reg->resv, compat_reg.resv, sizeof(reg->resv));
- return 0;
+} +#endif
+static int copy_io_uring_buf_reg_from_user(struct io_ring_ctx *ctx,
struct io_uring_buf_reg *reg,
const void __user *arg)
+{ +#ifdef CONFIG_COMPAT64
- if (ctx->compat)
return get_compat_io_uring_buf_reg(reg, arg);
+#endif
- return copy_from_user(reg, arg, sizeof(*reg));
+}
static inline struct io_buffer_list *io_buffer_get_list(struct io_ring_ctx *ctx, unsigned int bgid) { @@ -125,6 +164,41 @@ static void __user *io_provided_buffer_select(struct io_kiocb *req, size_t *len, return NULL; } +#ifdef CONFIG_COMPAT64 +static void __user *io_ring_buffer_select_compat(struct io_kiocb *req, size_t *len,
struct io_buffer_list *bl,
unsigned int issue_flags)
+{
- struct compat_io_uring_buf_ring *br = bl->buf_ring_compat;
- struct compat_io_uring_buf *buf;
- __u16 head = bl->head;
- if (unlikely(smp_load_acquire(&br->tail) == head))
return NULL;
- head &= bl->mask;
- if (head < IO_BUFFER_LIST_COMPAT_BUF_PER_PAGE) {
buf = &br->bufs[head];
- } else {
int off = head & (IO_BUFFER_LIST_COMPAT_BUF_PER_PAGE - 1);
int index = head / IO_BUFFER_LIST_COMPAT_BUF_PER_PAGE;
buf = page_address(bl->buf_pages[index]);
buf += off;
- }
- if (*len == 0 || *len > buf->len)
*len = buf->len;
- req->flags |= REQ_F_BUFFER_RING;
- req->buf_list = bl;
- req->buf_index = buf->bid;
- if (issue_flags & IO_URING_F_UNLOCKED || !file_can_poll(req->file)) {
req->buf_list = NULL;
bl->head++;
- }
I agree duplicating that function is the only reasonable option here. It looks like this block could be moved to the common function though, reducing the duplication a bit.
- return compat_ptr(buf->addr);
+} +#endif /* CONFIG_COMPAT64 */
static void __user *io_ring_buffer_select(struct io_kiocb *req, size_t *len, struct io_buffer_list *bl, unsigned int issue_flags) @@ -168,6 +242,17 @@ static void __user *io_ring_buffer_select(struct io_kiocb *req, size_t *len, return u64_to_user_ptr(buf->addr); } +static void __user *io_ring_buffer_select_any(struct io_kiocb *req, size_t *len,
struct io_buffer_list *bl,
unsigned int issue_flags)
+{ +#ifdef CONFIG_COMPAT64
- if (req->ctx->compat)
return io_ring_buffer_select_compat(req, len, bl, issue_flags);
+#endif
- return io_ring_buffer_select(req, len, bl, issue_flags);
+}
void __user *io_buffer_select(struct io_kiocb *req, size_t *len, unsigned int issue_flags) { @@ -180,7 +265,7 @@ void __user *io_buffer_select(struct io_kiocb *req, size_t *len, bl = io_buffer_get_list(ctx, req->buf_index); if (likely(bl)) { if (bl->buf_nr_pages)
ret = io_ring_buffer_select(req, len, bl, issue_flags);
else ret = io_provided_buffer_select(req, len, bl); }ret = io_ring_buffer_select_any(req, len, bl, issue_flags);
@@ -215,9 +300,12 @@ static int __io_remove_buffers(struct io_ring_ctx *ctx, return 0; if (bl->buf_nr_pages) {
__u16 tail = ctx->compat ?
bl->buf_ring_compat->tail :
int j;bl->buf_ring->tail;
i = bl->buf_ring->tail - bl->head;
for (j = 0; j < bl->buf_nr_pages; j++) unpin_user_page(bl->buf_pages[j]); kvfree(bl->buf_pages);i = tail - bl->head;
@@ -469,13 +557,13 @@ int io_provide_buffers(struct io_kiocb *req, unsigned int issue_flags) int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg) {
- struct io_uring_buf_ring *br; struct io_uring_buf_reg reg; struct io_buffer_list *bl, *free_bl = NULL; struct page **pages;
- size_t pages_size; int nr_pages;
- if (copy_from_user(®, arg, sizeof(reg)))
- if (copy_io_uring_buf_reg_from_user(ctx, ®, arg)) return -EFAULT;
if (reg.pad || reg.resv[0] || reg.resv[1] || reg.resv[2]) @@ -508,19 +596,19 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg) return -ENOMEM; }
- pages = io_pin_pages(reg.ring_addr,
struct_size(br, bufs, reg.ring_entries),
AFAICT the new code is not quite equivalent to this line, even in the native case, because struct_size() is equivalent to sizeof(struct) + n * sizeof(flex_array_member). struct io_uring_buf_ring is a bit of a strange union, in the end its size is the size of struct io_uring_buf. IOW I think this line actually yields (n + 1) * sizeof(struct io_uring_buf).
I suspect this is not what was actually intended, and that your change fixes that off-by-one, though I may well be wrong as the way the union is used is not exactly trivial. If that's the case then it should be a separate patch, and it would be a good candidate for upstreaming.
&nr_pages);
- pages_size = ctx->compat ?
size_mul(sizeof(struct compat_io_uring_buf), reg.ring_entries) :
size_mul(sizeof(struct io_uring_buf), reg.ring_entries);
- pages = io_pin_pages(reg.ring_addr, pages_size, &nr_pages); if (IS_ERR(pages)) { kfree(free_bl); return PTR_ERR(pages); }
- br = page_address(pages[0]); bl->buf_pages = pages; bl->buf_nr_pages = nr_pages; bl->nr_entries = reg.ring_entries;
- bl->buf_ring = br;
- bl->buf_ring = page_address(pages[0]); bl->mask = reg.ring_entries - 1; io_buffer_add_list(ctx, bl, reg.bgid); return 0;
@@ -531,7 +619,7 @@ int io_unregister_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg) struct io_uring_buf_reg reg; struct io_buffer_list *bl;
- if (copy_from_user(®, arg, sizeof(reg)))
- if (copy_io_uring_buf_reg_from_user(ctx, ®, arg)) return -EFAULT; if (reg.pad || reg.resv[0] || reg.resv[1] || reg.resv[2]) return -EINVAL;
diff --git a/io_uring/kbuf.h b/io_uring/kbuf.h index c23e15d7d3caf..1aa5bbbc5d628 100644 --- a/io_uring/kbuf.h +++ b/io_uring/kbuf.h @@ -2,6 +2,7 @@ #ifndef IOU_KBUF_H #define IOU_KBUF_H +#include <linux/io_uring_types.h> #include <uapi/linux/io_uring.h> struct io_buffer_list { @@ -13,7 +14,10 @@ struct io_buffer_list { struct list_head buf_list; struct { struct page **buf_pages;
struct io_uring_buf_ring *buf_ring;
union {
struct io_uring_buf_ring *buf_ring;
struct compat_io_uring_buf_ring *buf_ring_compat;
}; }; __u16 bgid;};
diff --git a/io_uring/net.c b/io_uring/net.c index c586278858e7e..4c133bc6f9d1d 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -4,6 +4,7 @@ #include <linux/file.h> #include <linux/slab.h> #include <linux/net.h> +#include <linux/uio.h> #include <linux/compat.h> #include <net/compat.h> #include <linux/io_uring.h> @@ -435,7 +436,9 @@ static int __io_recvmsg_copy_hdr(struct io_kiocb *req, } else if (msg.msg_iovlen > 1) { return -EINVAL; } else {
if (copy_from_user(iomsg->fast_iov, msg.msg_iov, sizeof(*msg.msg_iov)))
void *iov = iovec_from_user(msg.msg_iov, 1, 1, iomsg->fast_iov,
req->ctx->compat);
if (IS_ERR(iov))
This looks correct, but do you have any idea why the compat code looks so different? It does not set fast_iov at all in that case, which makes me wonder if it's really equivalent...
return -EFAULT; sr->len = iomsg->fast_iov[0].iov_len; iomsg->free_iov = NULL;
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c index 41e192de9e8a7..eafdb039f9e51 100644 --- a/io_uring/rsrc.c +++ b/io_uring/rsrc.c @@ -23,6 +23,106 @@ struct io_rsrc_update { u32 offset; }; +#ifdef CONFIG_COMPAT64 +static int get_compat_io_uring_rsrc_update(struct io_uring_rsrc_update2 *up2,
const void __user *user_up)
+{
- struct compat_io_uring_rsrc_update compat_up;
- if (unlikely(copy_from_user(&compat_up, user_up, sizeof(compat_up))))
return -EFAULT;
- up2->offset = compat_up.offset;
- up2->resv = compat_up.resv;
- up2->data = compat_up.data;
- return 0;
+}
+static int get_compat_io_uring_rsrc_update2(struct io_uring_rsrc_update2 *up2,
const void __user *user_up2)
+{
- struct compat_io_uring_rsrc_update2 compat_up2;
- if (unlikely(copy_from_user(&compat_up2, user_up2, sizeof(compat_up2))))
return -EFAULT;
- up2->offset = compat_up2.offset;
- up2->resv = compat_up2.resv;
- up2->data = compat_up2.data;
- up2->tags = compat_up2.tags;
- up2->nr = compat_up2.nr;
- up2->resv2 = compat_up2.resv2;
- return 0;
+}
+static int get_compat_io_uring_rsrc_register(struct io_uring_rsrc_register *rr,
const void __user *user_rr)
+{
- struct compat_io_uring_rsrc_register compat_rr;
- if (unlikely(copy_from_user(&compat_rr, user_rr, sizeof(compat_rr))))
return -EFAULT;
- rr->nr = compat_rr.nr;
- rr->flags = compat_rr.flags;
- rr->resv2 = compat_rr.resv2;
- rr->data = compat_rr.data;
- rr->tags = compat_rr.tags;
- return 0;
+} +#endif /* CONFIG_COMPAT64 */
+static int copy_io_uring_rsrc_update_from_user(struct io_ring_ctx *ctx,
struct io_uring_rsrc_update2 *up2,
const void __user *arg)
+{ +#ifdef CONFIG_COMPAT64
- if (ctx->compat)
return get_compat_io_uring_rsrc_update(up2, arg);
+#endif
- return copy_from_user(up2, arg, sizeof(struct io_uring_rsrc_update));
+}
+static int copy_io_uring_rsrc_update2_from_user(struct io_ring_ctx *ctx,
struct io_uring_rsrc_update2 *up2,
const void __user *arg,
size_t size)
+{ +#ifdef CONFIG_COMPAT64
- if (ctx->compat) {
if (size != sizeof(struct compat_io_uring_rsrc_update2))
return -EINVAL;
if (get_compat_io_uring_rsrc_update2(up2, arg))
return -EFAULT;
return 0;
- }
+#endif
- if (size != sizeof(*up2))
return -EINVAL;
- if (copy_from_user(up2, arg, sizeof(*up2)))
return -EFAULT;
- return 0;
+}
+static int copy_io_uring_rsrc_register_from_user(struct io_ring_ctx *ctx,
struct io_uring_rsrc_register *rr,
const void __user *arg,
size_t size)
+{ +#ifdef CONFIG_COMPAT64
- if (ctx->compat) {
if (size != sizeof(struct compat_io_uring_rsrc_register))
return -EINVAL;
if (get_compat_io_uring_rsrc_register(rr, arg))
return -EFAULT;
return 0;
- }
+#endif
- /* keep it extendible */
Nit: I don't think it's worth keeping that comment, especially considering the existing typo :)
- if (size != sizeof(*rr))
return -EINVAL;
- if (copy_from_user(rr, arg, size))
return -EFAULT;
- return 0;
+}
static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov, struct io_mapped_ubuf **pimu, struct page **last_hpage); @@ -597,12 +697,14 @@ int io_register_files_update(struct io_ring_ctx *ctx, void __user *arg, unsigned nr_args) { struct io_uring_rsrc_update2 up;
- int ret;
if (!nr_args) return -EINVAL; memset(&up, 0, sizeof(up));
- if (copy_from_user(&up, arg, sizeof(struct io_uring_rsrc_update)))
return -EFAULT;
- ret = copy_io_uring_rsrc_update_from_user(ctx, &up, arg);
- if (ret)
if (up.resv || up.resv2) return -EINVAL; return __io_register_rsrc_update(ctx, IORING_RSRC_FILE, &up, nr_args);return ret;
@@ -612,11 +714,11 @@ int io_register_rsrc_update(struct io_ring_ctx *ctx, void __user *arg, unsigned size, unsigned type) { struct io_uring_rsrc_update2 up;
- int ret;
- if (size != sizeof(up))
return -EINVAL;
- if (copy_from_user(&up, arg, sizeof(up)))
return -EFAULT;
- ret = copy_io_uring_rsrc_update2_from_user(ctx, &up, arg, size);
- if (ret)
if (!up.nr || up.resv || up.resv2) return -EINVAL; return __io_register_rsrc_update(ctx, type, &up, up.nr);return ret;
@@ -626,14 +728,11 @@ __cold int io_register_rsrc(struct io_ring_ctx *ctx, void __user *arg, unsigned int size, unsigned int type) { struct io_uring_rsrc_register rr;
- int ret;
- /* keep it extendible */
- if (size != sizeof(rr))
return -EINVAL;
- memset(&rr, 0, sizeof(rr));
- if (copy_from_user(&rr, arg, size))
return -EFAULT;
- ret = copy_io_uring_rsrc_register_from_user(ctx, &rr, arg, size);
- if (ret)
if (!rr.nr || rr.resv2) return -EINVAL; if (rr.flags & ~IORING_RSRC_REGISTER_SPARSE)return ret;
diff --git a/io_uring/tctx.c b/io_uring/tctx.c index 96f77450cf4e2..6ab9916ed3844 100644 --- a/io_uring/tctx.c +++ b/io_uring/tctx.c @@ -12,6 +12,32 @@ #include "io_uring.h" #include "tctx.h" +#ifdef CONFIG_COMPAT64 +static int get_compat_io_uring_rsrc_update(struct io_uring_rsrc_update *up,
const void __user *user_up)
+{
- struct compat_io_uring_rsrc_update compat_up;
- if (unlikely(copy_from_user(&compat_up, user_up, sizeof(compat_up))))
return -EFAULT;
- up->offset = compat_up.offset;
- up->resv = compat_up.resv;
- up->data = compat_up.data;
- return 0;
+} +#endif /* CONFIG_COMPAT64 */
+static int copy_io_uring_rsrc_update_from_user(struct io_ring_ctx *ctx,
struct io_uring_rsrc_update *up,
const void __user *arg)
+{ +#ifdef CONFIG_COMPAT64
- if (ctx->compat)
return get_compat_io_uring_rsrc_update(up, arg);
+#endif
- return copy_from_user(up, arg, sizeof(struct io_uring_rsrc_update));
+}
static struct io_wq *io_init_wq_offload(struct io_ring_ctx *ctx, struct task_struct *task) { @@ -244,8 +270,6 @@ static int io_ring_add_registered_fd(struct io_uring_task *tctx, int fd, int io_ringfd_register(struct io_ring_ctx *ctx, void __user *__arg, unsigned nr_args) {
- struct io_uring_rsrc_update __user *arg = __arg;
- struct io_uring_rsrc_update reg; struct io_uring_task *tctx; int ret, i;
@@ -260,9 +284,17 @@ int io_ringfd_register(struct io_ring_ctx *ctx, void __user *__arg, tctx = current->io_uring; for (i = 0; i < nr_args; i++) {
void __user *arg;
__u32 __user *arg_offset;
int start, end;struct io_uring_rsrc_update reg;
if (copy_from_user(®, &arg[i], sizeof(reg))) {
if (ctx->compat)
arg = &((struct compat_io_uring_rsrc_update __user *)__arg)[i];
else
arg = &((struct io_uring_rsrc_update __user *)__arg)[i];
Might be worth adding a helper for that magic, since it somewhat hurts the eyes and it's also used in io_ringfd_unregister().
}if (copy_io_uring_rsrc_update_from_user(ctx, ®, arg)) { ret = -EFAULT; break;
@@ -288,8 +320,10 @@ int io_ringfd_register(struct io_ring_ctx *ctx, void __user *__arg, if (ret < 0) break;
reg.offset = ret;
Don't want we want to keep this line?
Kevin
if (put_user(reg.offset, &arg[i].offset)) {
arg_offset = ctx->compat ?
&((struct compat_io_uring_rsrc_update __user *)arg)->offset :
&((struct io_uring_rsrc_update __user *)arg)->offset;
if (put_user(reg.offset, arg_offset)) { fput(tctx->registered_rings[reg.offset]); tctx->registered_rings[reg.offset] = NULL; ret = -EFAULT;
@@ -303,9 +337,7 @@ int io_ringfd_register(struct io_ring_ctx *ctx, void __user *__arg, int io_ringfd_unregister(struct io_ring_ctx *ctx, void __user *__arg, unsigned nr_args) {
- struct io_uring_rsrc_update __user *arg = __arg; struct io_uring_task *tctx = current->io_uring;
- struct io_uring_rsrc_update reg; int ret = 0, i;
if (!nr_args || nr_args > IO_RINGFD_REG_MAX) @@ -314,10 +346,19 @@ int io_ringfd_unregister(struct io_ring_ctx *ctx, void __user *__arg, return 0; for (i = 0; i < nr_args; i++) {
if (copy_from_user(®, &arg[i], sizeof(reg))) {
void __user *arg;
struct io_uring_rsrc_update reg;
if (ctx->compat)
arg = &((struct compat_io_uring_rsrc_update __user *)__arg)[i];
else
arg = &((struct io_uring_rsrc_update __user *)__arg)[i];
}if (copy_io_uring_rsrc_update_from_user(ctx, ®, arg)) { ret = -EFAULT; break;
- if (reg.resv || reg.data || reg.offset >= IO_RINGFD_REG_MAX) { ret = -EINVAL; break;
diff --git a/io_uring/uring_cmd.h b/io_uring/uring_cmd.h index 7c6697d13cb2e..d67bb30ad543b 100644 --- a/io_uring/uring_cmd.h +++ b/io_uring/uring_cmd.h @@ -11,3 +11,10 @@ int io_uring_cmd_prep_async(struct io_kiocb *req); #define uring_cmd_pdu_size(is_sqe128) \ ((1 + !!(is_sqe128)) * sizeof(struct io_uring_sqe) - \ offsetof(struct io_uring_sqe, cmd))
+#ifdef CONFIG_COMPAT64 +#define compat_uring_cmd_pdu_size(is_sqe128) \
- ((1 + !!(is_sqe128)) * sizeof(struct compat_io_uring_sqe) - \
offsetof(struct compat_io_uring_sqe, cmd))
+#endif
On 02-03-2023 12:41, Kevin Brodsky wrote:
On 23/02/2023 18:53, Tudor Cretu wrote:
Introduce compat versions of the structs exposed in the uAPI headers that might contain pointers as a member. Also, implement functions that convert the compat versions to the native versions of the struct.
A subsequent patch is going to change the io_uring structs to enable them to support new architectures. On such architectures, the current struct layout still needs to be supported for compat tasks.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com
include/linux/io_uring_types.h | 140 ++++++++++++++++++- io_uring/cancel.c | 38 ++++- io_uring/epoll.c | 2 +- io_uring/fdinfo.c | 55 +++++++- io_uring/io_uring.c | 248 +++++++++++++++++++++++---------- io_uring/io_uring.h | 122 +++++++++++++--- io_uring/kbuf.c | 108 ++++++++++++-- io_uring/kbuf.h | 6 +- io_uring/net.c | 5 +- io_uring/rsrc.c | 125 +++++++++++++++-- io_uring/tctx.c | 57 ++++++-- io_uring/uring_cmd.h | 7 + 12 files changed, 782 insertions(+), 131 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index 440179029a8f0..737bc1aa67306 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h
- union {
__kernel_rwf_t rw_flags;
__u32 fsync_flags;
__u16 poll_events;
__u32 poll32_events;
__u32 sync_range_flags;
__u32 msg_flags;
__u32 timeout_flags;
__u32 accept_flags;
__u32 cancel_flags;
__u32 open_flags;
__u32 statx_flags;
__u32 fadvise_advice;
__u32 splice_flags;
__u32 rename_flags;
__u32 unlink_flags;
__u32 hardlink_flags;
__u32 xattr_flags;
__u32 msg_ring_flags;
__u32 uring_cmd_flags;
I see that a few new members have appeared since 5.18. This makes me think, is there really much point in keeping those union members in sync? This is going to prove rather painful when rebasing, in fact they'll probably end up getting out of sync. Maybe we could simply represent the union as the member that we actually use in convert_compat_io_uring_sqe()? A comment next to that member would be enough to make it clear it's actually a union in the native struct.
Aye, great suggestion!
+struct compat_io_uring_cqe {
- __u64 user_data;
- __s32 res;
- __u32 flags;
- __u64 big_cqe[];
Nit: might as well align it like the rest, even though it's not in the native struct. I think it would make sense to have the same alignment approach for the other structs too, the native struct declarations are not very consistent.
Done!
@@ -216,7 +340,11 @@ struct io_ring_ctx { * array. */ u32 *sq_array;
struct io_uring_sqe *sq_sqes;
Nit: not sure adding a new line helps, it feels clearer to keep all the sq_* stuff in one block.
Done!
diff --git a/io_uring/cancel.c b/io_uring/cancel.c index 2291a53cdabd1..20cb8634e44af 100644 --- a/io_uring/cancel.c +++ b/io_uring/cancel.c @@ -27,6 +27,42 @@ struct io_cancel { #define CANCEL_FLAGS (IORING_ASYNC_CANCEL_ALL | IORING_ASYNC_CANCEL_FD | \ IORING_ASYNC_CANCEL_ANY | IORING_ASYNC_CANCEL_FD_FIXED) +struct compat_io_uring_sync_cancel_reg {
Shouldn't that struct be with its friends in io_uring_types.h? Same question regarding struct compat_io_uring_buf_reg.
I thought to move it here because only in this file it's currently used, but I'll move it back to party with its friends.
- __u64 addr;
- __s32 fd;
- __u32 flags;
- struct __kernel_timespec timeout;
- __u64 pad[4];
+};
+#ifdef CONFIG_COMPAT64
The #ifdef'ing makes me think: should we make it more systematic? Right now it looks like only the compat copy helpers and calls to them are #ifdef'd, but arguably it doesn't make sense to #ifdef that and not things like the size selection (e.g. in rings_size()), as we are effectively assuming that the compat and native structs are identical apart from in compat64. Maybe a new helper that returns IS_ENABLED(CONFIG_COMPAT64) && ctx->compat would help? In that case the copy helpers couldn't be #ifdef'd any longer but it may be better this way.
Great suggestion!
diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c index bc8c9d764bc13..c5bd669081c98 100644 --- a/io_uring/fdinfo.c +++ b/io_uring/fdinfo.c @@ -89,12 +89,25 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, for (i = 0; i < sq_entries; i++) { unsigned int entry = i + sq_head; struct io_uring_sqe *sqe;
unsigned int sq_idx;
unsigned int sq_idx, sq_off;
+#ifdef CONFIG_COMPAT64
struct io_uring_sqe *native_sqe = NULL;
+#endif sq_idx = READ_ONCE(ctx->sq_array[entry & sq_mask]); if (sq_idx > sq_mask) continue;
sqe = &ctx->sq_sqes[sq_idx << sq_shift];
sq_off = sq_idx << sq_shift;
sqe = ctx->compat ? (void *)&ctx->sq_sqes_compat[sq_off] : &ctx->sq_sqes[sq_off];
+#ifdef CONFIG_COMPAT64
if (ctx->compat) {
native_sqe = kmalloc(sizeof(struct io_uring_sqe) << sq_shift, GFP_KERNEL);
It's pretty suboptimal to do a kmalloc() / kfree() in a loop. An alternative would be to have a local struct io_uring_sqe native_sqe[2]; to account for the SQE128 case.
I have considered just having struct io_uring_sqe native_sqe[2]. Now I know to pay more attention when using kmalloc/kfree now, so thanks for pointing it out!
That said for the case of the printing code in this file, I wonder if we shouldn't skip the conversion altogether, since the structs are not passed to any other function. If we move each of the three printing blocks where conversion is needed to a macro, then we can easily handle both the native and compat struct with the same code.
Agree!
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 707229ae04dc8..03506776af043 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_ruing.c @@ -152,6 +152,43 @@ static void __io_submit_flush_completions(struct io_ring_ctx *ctx); static struct kmem_cache *req_cachep; +#ifdef CONFIG_COMPAT64 +static int get_compat_io_uring_getevents_arg(struct io_uring_getevents_arg *arg,
const void __user *user_arg)
+{
- struct compat_io_uring_getevents_arg compat_arg;
- if (unlikely(copy_from_user(&compat_arg, user_arg, sizeof(compat_arg))))
Not very important but curious: we only seem to be using unlikely(copy_from_user(...)) in the compat helpers, any reason for that?
It's a leftover from the rebase. I will remove the "unlikely" for consistency.
return -EFAULT;
- arg->sigmask = (__kernel_uintptr_t)compat_ptr(compat_arg.sigmask);
Looks like a leftover conversion that belongs to patch 8 instead.
Done!
- arg->sigmask_sz = compat_arg.sigmask_sz;
- arg->pad = compat_arg.pad;
- arg->ts = (__kernel_uintptr_t)compat_ptr(compat_arg.ts);
- return 0;
+} +#endif /* CONFIG_COMPAT64 */
+static int copy_io_uring_getevents_arg_from_user(struct io_ring_ctx *ctx,
struct io_uring_getevents_arg *arg,
const void __user *argp,
size_t size)
+{ +#ifdef CONFIG_COMPAT64
- if (ctx->compat) {
if (size != sizeof(struct compat_io_uring_getevents_arg))
return -EINVAL;
if (get_compat_io_uring_getevents_arg(arg, argp))
return -EFAULT;
return 0;
Or simply return get_compat_io_uring_getevents_arg(arg, argp);
I think this pattern reappears in quite a few copy_*_from_user() helpers.
Done!
- }
+#endif
- if (size != sizeof(*arg))
return -EINVAL;
- if (copy_from_user(arg, argp, sizeof(*arg)))
return -EFAULT;
- return 0;
+}
- struct sock *io_uring_get_socket(struct file *file) { #if defined(CONFIG_UNIX)
@@ -604,7 +641,9 @@ void io_cq_unlock_post(struct io_ring_ctx *ctx) static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) { bool all_flushed;
- size_t cqe_size = sizeof(struct io_uring_cqe);
- size_t cqe_size = ctx->compat ?
sizeof(struct compat_io_uring_cqe) :
sizeof(struct io_uring_cqe);
I'm not sure this is enough to get the overflow machinery to work in compat64. struct io_overflow_cqe gets filled in io_cqring_event_overflow() but AFAICT that code is not compat-aware, so it always initialises the native cqe and not the compat one. I suppose we could make use of __io_fill_cqe_any() in io_cqring_event_overflow()?
This makes me think, are there tests that trigger this overflow mechanism? Issues there might easily go unnoticed otherwise.
I had a better look at this and realised that struct io_overflow_cqe is always filled with a native cqe, so I removed the union from it. Also this change is not needed anymore. Yes, there are tests in liburing, and they pass now for compat as well.
if (!force && __io_cqring_events(ctx) == ctx->cq_entries) return false; @@ -741,10 +780,11 @@ bool io_req_cqe_overflow(struct io_kiocb *req)
- control dependency is enough as we're using WRITE_ONCE to
- fill the cq entry
*/ -struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow) +void *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow)
I certainly understand the rationale behind changing the return value to void *, as it is now either struct io_uring_cqe * or struct compat_io_uring_cqe *. That said I'm not convinced this change makes a lot of sense as it stands, because all users of *io_get_cqe* still interpret the return value as struct io_uring_cqe * (which works as void
- is implicitly convertible to any pointer type, not too conveniently
here...).
My feeling is that we should really choose one way or the other. Either make such pointers void * everywhere - or maybe better, pointer to a union, to stop them from being implicitly convertible. Or just keep struct io_uring_cqe * everywhere, which is quite common when dealing with compat structs. I don't really have a preference as both approaches have merits, up to you :)
Done! I left it as struct io_uring_cqe *.
{ unsigned int off = ctx->cached_cq_tail & (ctx->cq_entries - 1); unsigned int free, queued, len;
- void *cqe;
/* * Posting into the CQ when there are pending overflowed CQEs may break @@ -767,14 +807,15 @@ struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow) len <<= 1; }
- ctx->cqe_cached = &ctx->cqes[off];
- cqe = ctx->compat ? (void *)&ctx->cqes_compat[off] : (void *)&ctx->cqes[off];
- ctx->cqe_cached = cqe;
Don't we have a problem with the whole cqe_cached / cqe_sentinel logic? It assumes that it is manipulating pointers to struct io_uring_cqe, but in fact these are pointers to struct compat_io_uring_cqe in compat64, aren't they? This appeared in 5.19 [1] so you may not have noticed it when rebasing on 6.1.
[1] https://lore.kernel.org/all/487eeef00f3146537b3d9c1a9cef2fc0b9a86f81.1649771...
Indeed it was broken, thank you for the heads up!
@@ -2265,8 +2300,11 @@ int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr) io_submit_state_start(&ctx->submit_state, left); do {
const struct io_uring_sqe *sqe;
struct io_kiocb *req;const void *sqe;
+#ifdef CONFIG_COMPAT64
struct io_uring_sqe native_sqe;
It would seem that this will result in a buffer overflow in the IORING_OP_URING_CMD + IORING_SETUP_SQE128 case, since no space is allocated for the extra data. Maybe the best is to allocate an array of two sqes to cover this situation.
Aye, missed this instance, thank you!
-static int io_validate_ext_arg(unsigned flags, const void __user *argp, size_t argsz) +static int io_validate_ext_arg(struct io_ring_ctx *ctx, unsigned int flags,
- const void __user *argp, size_t argsz)
It looks like overflow lines in prototypes are generally aligned on the opening parenthesis in this file, not just indented by a tab.
I have been deceived by the two functions around it. Fixed!
{ if (flags & IORING_ENTER_EXT_ARG) { struct io_uring_getevents_arg arg;
int ret;
if (argsz != sizeof(arg))
return -EINVAL;
if (copy_from_user(&arg, argp, sizeof(arg)))
return -EFAULT;
ret = copy_io_uring_getevents_arg_from_user(ctx, &arg, argp, argsz);
if (ret)
return ret;
Or simply return copy_io_uring_getevents_arg_from_user(...);
Done!
+#ifdef CONFIG_COMPAT64 +#define BUILD_BUG_COMPAT_SQE_ELEM(eoffset, etype, ename) \
- __BUILD_BUG_VERIFY_OFFSET_SIZE(struct compat_io_uring_sqe, eoffset, sizeof(etype), ename)
+#define BUILD_BUG_COMPAT_SQE_ELEM_SIZE(eoffset, esize, ename) \
- __BUILD_BUG_VERIFY_OFFSET_SIZE(struct compat_io_uring_sqe, eoffset, esize, ename)
- BUILD_BUG_ON(sizeof(struct compat_io_uring_sqe) != 64);
- BUILD_BUG_COMPAT_SQE_ELEM(0, __u8, opcode);
- BUILD_BUG_COMPAT_SQE_ELEM(1, __u8, flags);
- BUILD_BUG_COMPAT_SQE_ELEM(2, __u16, ioprio);
- BUILD_BUG_COMPAT_SQE_ELEM(4, __s32, fd);
- BUILD_BUG_COMPAT_SQE_ELEM(8, __u64, off);
- BUILD_BUG_COMPAT_SQE_ELEM(8, __u64, addr2);
- BUILD_BUG_COMPAT_SQE_ELEM(8, __u32, cmd_op);
- BUILD_BUG_COMPAT_SQE_ELEM(12, __u32, __pad1);
- BUILD_BUG_COMPAT_SQE_ELEM(16, __u64, addr);
- BUILD_BUG_COMPAT_SQE_ELEM(16, __u64, splice_off_in);
- BUILD_BUG_COMPAT_SQE_ELEM(24, __u32, len);
- BUILD_BUG_COMPAT_SQE_ELEM(28, __kernel_rwf_t, rw_flags);
- BUILD_BUG_COMPAT_SQE_ELEM(28, /* compat */ int, rw_flags);
- BUILD_BUG_COMPAT_SQE_ELEM(28, /* compat */ __u32, rw_flags);
I'm a bit confused by these lines, as they are removed from the native checks above but preserved here? I don't think "compat" has much to do with compat64 here, in the case of poll_events below it seems to be about a backwards-compatible change in ABI (see [1]).
I also expected the changes in whitespace above to be reflected here.
[1] https://lore.kernel.org/all/1592387636-80105-2-git-send-email-jiufei.xue@lin...
Done!
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h index 50bc3af449534..fb2711770bfb0 100644 --- a/io_uring/io_uring.h +++ b/io_uring/io_uring.h +static inline void convert_compat_io_uring_sqe(struct io_ring_ctx *ctx,
struct io_uring_sqe *sqe,
const struct compat_io_uring_sqe *compat_sqe)
+{ +/*
- The struct io_uring_sqe contains anonymous unions and there is no field
- keeping track of which union's member is active. Because in all the cases,
- the unions are between integral types and the types are compatible, use the
- largest member of each union to perform the copy. Use this compile-time check
- to ensure that the union's members are not truncated during the conversion.
- */
+#define BUILD_BUG_COMPAT_SQE_UNION_ELEM(elem1, elem2) \
- BUILD_BUG_ON(sizeof_field(struct compat_io_uring_sqe, elem1) != \
(offsetof(struct compat_io_uring_sqe, elem2) - \
offsetof(struct compat_io_uring_sqe, elem1)))
- sqe->opcode = READ_ONCE(compat_sqe->opcode);
- sqe->flags = READ_ONCE(compat_sqe->flags);
- sqe->ioprio = READ_ONCE(compat_sqe->ioprio);
- sqe->fd = READ_ONCE(compat_sqe->fd);
- BUILD_BUG_COMPAT_SQE_UNION_ELEM(addr2, addr);
- sqe->addr2 = READ_ONCE(compat_sqe->addr2);
- BUILD_BUG_COMPAT_SQE_UNION_ELEM(addr, len);
- sqe->addr = READ_ONCE(compat_sqe->addr);
- sqe->len = READ_ONCE(compat_sqe->len);
- BUILD_BUG_COMPAT_SQE_UNION_ELEM(rw_flags, user_data);
- sqe->rw_flags = READ_ONCE(compat_sqe->rw_flags);
- sqe->user_data = READ_ONCE(compat_sqe->user_data);
- BUILD_BUG_COMPAT_SQE_UNION_ELEM(buf_index, personality);
- sqe->buf_index = READ_ONCE(compat_sqe->buf_index);
- sqe->personality = READ_ONCE(compat_sqe->personality);
- BUILD_BUG_COMPAT_SQE_UNION_ELEM(splice_fd_in, addr3);
- sqe->splice_fd_in = READ_ONCE(compat_sqe->splice_fd_in);
- if (sqe->opcode == IORING_OP_URING_CMD) {
size_t compat_cmd_size = compat_uring_cmd_pdu_size(ctx->flags &
IORING_SETUP_SQE128);
memcpy(sqe->cmd, compat_sqe->cmd, compat_cmd_size);
- } else
sqe->addr3 = READ_ONCE(compat_sqe->addr3);
It would be better to copy __pad2 too, as some net code does check that it is zero.
Done!
+#undef BUILD_BUG_COMPAT_SQE_UNION_ELEM +} +#endif /* CONFIG_COMPAT64 */
+static inline void *io_get_cqe_overflow(struct io_ring_ctx *ctx,
{ if (likely(ctx->cqe_cached < ctx->cqe_sentinel)) { struct io_uring_cqe *cqe = ctx->cqe_cached;bool overflow)
@@ -109,15 +169,46 @@ static inline struct io_uring_cqe *io_get_cqe_overflow(struct io_ring_ctx *ctx, return __io_get_cqe(ctx, overflow); } -static inline struct io_uring_cqe *io_get_cqe(struct io_ring_ctx *ctx) +static inline void *io_get_cqe(struct io_ring_ctx *ctx) { return io_get_cqe_overflow(ctx, false); } +static inline void __io_fill_cqe_any(struct io_ring_ctx *ctx, struct io_uring_cqe *cqe,
Nit: not sure the "any" suffix really helps here, __io_fill_cqe() is probably enough.
Done!
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c index e2c46889d5fab..9f840ba93c888 100644 --- a/io_uring/kbuf.c +++ b/io_uring/kbuf.c +#ifdef CONFIG_COMPAT64 +static void __user *io_ring_buffer_select_compat(struct io_kiocb *req, size_t *len,
struct io_buffer_list *bl,
unsigned int issue_flags)
+{
- struct compat_io_uring_buf_ring *br = bl->buf_ring_compat;
- struct compat_io_uring_buf *buf;
- __u16 head = bl->head;
- if (unlikely(smp_load_acquire(&br->tail) == head))
return NULL;
- head &= bl->mask;
- if (head < IO_BUFFER_LIST_COMPAT_BUF_PER_PAGE) {
buf = &br->bufs[head];
- } else {
int off = head & (IO_BUFFER_LIST_COMPAT_BUF_PER_PAGE - 1);
int index = head / IO_BUFFER_LIST_COMPAT_BUF_PER_PAGE;
buf = page_address(bl->buf_pages[index]);
buf += off;
- }
- if (*len == 0 || *len > buf->len)
*len = buf->len;
- req->flags |= REQ_F_BUFFER_RING;
- req->buf_list = bl;
- req->buf_index = buf->bid;
- if (issue_flags & IO_URING_F_UNLOCKED || !file_can_poll(req->file)) {
req->buf_list = NULL;
bl->head++;
- }
I agree duplicating that function is the only reasonable option here. It looks like this block could be moved to the common function though, reducing the duplication a bit.
Done!
@@ -508,19 +596,19 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg) return -ENOMEM; }
- pages = io_pin_pages(reg.ring_addr,
struct_size(br, bufs, reg.ring_entries),
AFAICT the new code is not quite equivalent to this line, even in the native case, because struct_size() is equivalent to sizeof(struct) + n * sizeof(flex_array_member). struct io_uring_buf_ring is a bit of a strange union, in the end its size is the size of struct io_uring_buf. IOW I think this line actually yields (n + 1) * sizeof(struct io_uring_buf).
I suspect this is not what was actually intended, and that your change fixes that off-by-one, though I may well be wrong as the way the union is used is not exactly trivial. If that's the case then it should be a separate patch, and it would be a good candidate for upstreaming.
Indeed, it is a off-by-one error. The man pages states that: The size of the ring is the product of ring_entries and the size of struct io_uring_buf, so I just did that. The union is there, just so that we store the tail of the ring in the same place as the resv field of the first io_uring_buf. ring_entries must be a power of two, otherwise the syscalls errs early, so it can't be 0, so the change should be fine.
diff --git a/io_uring/net.c b/io_uring/net.c index c586278858e7e..4c133bc6f9d1d 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -4,6 +4,7 @@ #include <linux/file.h> #include <linux/slab.h> #include <linux/net.h> +#include <linux/uio.h> #include <linux/compat.h> #include <net/compat.h> #include <linux/io_uring.h> @@ -435,7 +436,9 @@ static int __io_recvmsg_copy_hdr(struct io_kiocb *req, } else if (msg.msg_iovlen > 1) { return -EINVAL; } else {
if (copy_from_user(iomsg->fast_iov, msg.msg_iov, sizeof(*msg.msg_iov)))
void *iov = iovec_from_user(msg.msg_iov, 1, 1, iomsg->fast_iov,
req->ctx->compat);
if (IS_ERR(iov))
This looks correct, but do you have any idea why the compat code looks so different? It does not set fast_iov at all in that case, which makes me wonder if it's really equivalent...
I don't know why one uses copy_from_user and the other uses __get_user. This has been like this from the very beginning and I don't think there is a good reason for it. __get_user would suffice in both cases really. The difference in the setting of fast_iov comes from the commit 5476dfed29ad ("io_uring: clean iov usage for recvmsg buf select") which cleanups only the __io_recvmsg_copy_hdr and ignores __io_compat_recvmsg_copy_hdr.
+static int copy_io_uring_rsrc_register_from_user(struct io_ring_ctx *ctx,
struct io_uring_rsrc_register *rr,
const void __user *arg,
size_t size)
+{ +#ifdef CONFIG_COMPAT64
- if (ctx->compat) {
if (size != sizeof(struct compat_io_uring_rsrc_register))
return -EINVAL;
if (get_compat_io_uring_rsrc_register(rr, arg))
return -EFAULT;
return 0;
- }
+#endif
- /* keep it extendible */
Nit: I don't think it's worth keeping that comment, especially considering the existing typo :)
Done!
diff --git a/io_uring/tctx.c b/io_uring/tctx.c index 96f77450cf4e2..6ab9916ed3844 100644 --- a/io_uring/tctx.c +++ b/io_uring/tctx.c @@ -260,9 +284,17 @@ int io_ringfd_register(struct io_ring_ctx *ctx, void __user *__arg, tctx = current->io_uring; for (i = 0; i < nr_args; i++) {
void __user *arg;
__u32 __user *arg_offset;
int start, end;struct io_uring_rsrc_update reg;
if (copy_from_user(®, &arg[i], sizeof(reg))) {
if (ctx->compat)
arg = &((struct compat_io_uring_rsrc_update __user *)__arg)[i];
else
arg = &((struct io_uring_rsrc_update __user *)__arg)[i];
Might be worth adding a helper for that magic, since it somewhat hurts the eyes and it's also used in io_ringfd_unregister().
Done!
}if (copy_io_uring_rsrc_update_from_user(ctx, ®, arg)) { ret = -EFAULT; break;
@@ -288,8 +320,10 @@ int io_ringfd_register(struct io_ring_ctx *ctx, void __user *__arg, if (ret < 0) break;
reg.offset = ret;
Don't want we want to keep this line?
Oops, thank you!
Tudor
Kevin
On 16/03/2023 14:39, Tudor Cretu wrote:
[...]
@@ -604,7 +641,9 @@ void io_cq_unlock_post(struct io_ring_ctx *ctx) static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) { bool all_flushed; - size_t cqe_size = sizeof(struct io_uring_cqe); + size_t cqe_size = ctx->compat ? + sizeof(struct compat_io_uring_cqe) : + sizeof(struct io_uring_cqe);
I'm not sure this is enough to get the overflow machinery to work in compat64. struct io_overflow_cqe gets filled in io_cqring_event_overflow() but AFAICT that code is not compat-aware, so it always initialises the native cqe and not the compat one. I suppose we could make use of __io_fill_cqe_any() in io_cqring_event_overflow()?
This makes me think, are there tests that trigger this overflow mechanism? Issues there might easily go unnoticed otherwise.
I had a better look at this and realised that struct io_overflow_cqe is always filled with a native cqe, so I removed the union from it. Also this change is not needed anymore.
Right I guess that's easier this way, the overflow list is not visible to userspace so it makes sense to always store in the native format and do the conversion in this function like you did in v4.
Yes, there are tests in liburing, and they pass now for compat as well.
Nice!
[...]
@@ -508,19 +596,19 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg) return -ENOMEM; } - pages = io_pin_pages(reg.ring_addr, - struct_size(br, bufs, reg.ring_entries),
AFAICT the new code is not quite equivalent to this line, even in the native case, because struct_size() is equivalent to sizeof(struct) + n * sizeof(flex_array_member). struct io_uring_buf_ring is a bit of a strange union, in the end its size is the size of struct io_uring_buf. IOW I think this line actually yields (n + 1) * sizeof(struct io_uring_buf).
I suspect this is not what was actually intended, and that your change fixes that off-by-one, though I may well be wrong as the way the union is used is not exactly trivial. If that's the case then it should be a separate patch, and it would be a good candidate for upstreaming.
Indeed, it is a off-by-one error. The man pages states that: The size of the ring is the product of ring_entries and the size of struct io_uring_buf, so I just did that. The union is there, just so that we store the tail of the ring in the same place as the resv field of the first io_uring_buf. ring_entries must be a power of two, otherwise the syscalls errs early, so it can't be 0, so the change should be fine.
Makes sense.
diff --git a/io_uring/net.c b/io_uring/net.c index c586278858e7e..4c133bc6f9d1d 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -4,6 +4,7 @@ #include <linux/file.h> #include <linux/slab.h> #include <linux/net.h> +#include <linux/uio.h> #include <linux/compat.h> #include <net/compat.h> #include <linux/io_uring.h> @@ -435,7 +436,9 @@ static int __io_recvmsg_copy_hdr(struct io_kiocb *req, } else if (msg.msg_iovlen > 1) { return -EINVAL; } else { - if (copy_from_user(iomsg->fast_iov, msg.msg_iov, sizeof(*msg.msg_iov))) + void *iov = iovec_from_user(msg.msg_iov, 1, 1, iomsg->fast_iov, + req->ctx->compat); + if (IS_ERR(iov))
This looks correct, but do you have any idea why the compat code looks so different? It does not set fast_iov at all in that case, which makes me wonder if it's really equivalent...
I don't know why one uses copy_from_user and the other uses __get_user. This has been like this from the very beginning and I don't think there is a good reason for it. __get_user would suffice in both cases really. The difference in the setting of fast_iov comes from the commit 5476dfed29ad ("io_uring: clean iov usage for recvmsg buf select") which cleanups only the __io_recvmsg_copy_hdr and ignores __io_compat_recvmsg_copy_hdr.
I don't think that commit fundamentally changed the situation, which boils down to this: in native, the iov in iomsg is set (now to fast_iov), while in compat it is not. The only rational explanation that I can think of is that setting the iov is not actually needed, but I didn't manage to convince myself this is true.
In any case the situation is the same whether using compat32 or compat64, so it's not essential to investigate this further.
Kevin
Some members of the io_uring uAPI structs may contain user pointers. In the PCuABI, a user pointer is a 129-bit capability, so the __u64 type is not big enough to hold it. Use the __kernel_uintptr_t type instead, which is big enough on the affected architectures while remaining 64-bit on others.
The user_data field must be passed unchanged from the submission queue to the completion queue. As it is standard practice to store a pointer in user_data, expand the field to __kernel_uintptr_t. However, the kernel doesn't dereference the user_data, so don't cast it to a compat_ptr.
In addition, for the io_uring structs containing user pointers, use the special copy routines when copying user pointers from/to userspace.
Note that the structs io_uring_sqe and io_uring_cqe are doubled in size in PCuABI. The setup flags IORING_SETUP_SQE128 and IORING_SETUP_CQE32 used to double the sizes of the two structs up to 128 bytes and 32 bytes respectively. In PCuABI, the two flags are still used to double the sizes of the two structs, but, as they increased in size, they increase up to 256 bytes and 64 bytes.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- include/linux/io_uring_types.h | 4 +-- include/uapi/linux/io_uring.h | 62 +++++++++++++++++++--------------- io_uring/advise.c | 2 +- io_uring/cancel.c | 6 ++-- io_uring/cancel.h | 2 +- io_uring/epoll.c | 2 +- io_uring/fdinfo.c | 9 ++--- io_uring/fs.c | 16 ++++----- io_uring/io_uring.c | 49 +++++++++++++++++++++++---- io_uring/io_uring.h | 16 ++++----- io_uring/kbuf.c | 15 ++++---- io_uring/kbuf.h | 2 +- io_uring/msg_ring.c | 4 +-- io_uring/net.c | 20 +++++------ io_uring/openclose.c | 4 +-- io_uring/poll.c | 4 +-- io_uring/rsrc.c | 41 +++++++++++----------- io_uring/rw.c | 13 ++++--- io_uring/statx.c | 4 +-- io_uring/tctx.c | 4 +-- io_uring/timeout.c | 10 +++--- io_uring/uring_cmd.c | 5 +++ io_uring/xattr.c | 12 +++---- 23 files changed, 178 insertions(+), 128 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index 737bc1aa67306..1407ccadf0575 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -604,8 +604,8 @@ struct io_task_work { };
struct io_cqe { - __u64 user_data; - __s32 res; + __kernel_uintptr_t user_data; + __s32 res; /* fd initially, then cflags for completion */ union { __u32 flags; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 2df3225b562fa..14aa30151c5f9 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -11,6 +11,11 @@ #include <linux/fs.h> #include <linux/types.h> #include <linux/time_types.h> +#ifdef __KERNEL__ +#include <linux/stddef.h> +#else +#include <stddef.h> +#endif
#ifdef __cplusplus extern "C" { @@ -25,16 +30,16 @@ struct io_uring_sqe { __u16 ioprio; /* ioprio for the request */ __s32 fd; /* file descriptor to do IO on */ union { - __u64 off; /* offset into file */ - __u64 addr2; + __u64 off; /* offset into file */ + __kernel_uintptr_t addr2; struct { __u32 cmd_op; __u32 __pad1; }; }; union { - __u64 addr; /* pointer to buffer or iovecs */ - __u64 splice_off_in; + __kernel_uintptr_t addr; /* pointer to buffer or iovecs */ + __u64 splice_off_in; }; __u32 len; /* buffer size or number of iovecs */ union { @@ -58,7 +63,7 @@ struct io_uring_sqe { __u32 msg_ring_flags; __u32 uring_cmd_flags; }; - __u64 user_data; /* data to be passed back at completion time */ + __kernel_uintptr_t user_data; /* data to be passed back at completion time */ /* pack this to avoid bogus arm OABI complaints */ union { /* index into fixed buffers, if used */ @@ -78,12 +83,14 @@ struct io_uring_sqe { }; union { struct { - __u64 addr3; - __u64 __pad2[1]; + __kernel_uintptr_t addr3; + __u64 __pad2[1]; }; /* * If the ring is initialized with IORING_SETUP_SQE128, then - * this field is used for 80 bytes of arbitrary command data + * this field is used to double the size of the + * struct io_uring_sqe to store bytes of arbitrary + * command data, i.e. 80 bytes or 160 bytes in PCuABI */ __u8 cmd[0]; }; @@ -326,13 +333,14 @@ enum { * IO completion data structure (Completion Queue Entry) */ struct io_uring_cqe { - __u64 user_data; /* sqe->data submission passed back */ - __s32 res; /* result code for this event */ - __u32 flags; + __kernel_uintptr_t user_data; /* sqe->data submission passed back */ + __s32 res; /* result code for this event */ + __u32 flags;
/* * If the ring is initialized with IORING_SETUP_CQE32, then this field - * contains 16-bytes of padding, doubling the size of the CQE. + * doubles the size of the CQE, i.e. contains 16 bytes, or in PCuABI, + * 32 bytes of padding. */ __u64 big_cqe[]; }; @@ -504,7 +512,7 @@ enum { struct io_uring_files_update { __u32 offset; __u32 resv; - __aligned_u64 /* __s32 * */ fds; + __kernel_aligned_uintptr_t /* __s32 * */ fds; };
/* @@ -517,21 +525,21 @@ struct io_uring_rsrc_register { __u32 nr; __u32 flags; __u64 resv2; - __aligned_u64 data; - __aligned_u64 tags; + __kernel_aligned_uintptr_t data; + __kernel_aligned_uintptr_t tags; };
struct io_uring_rsrc_update { __u32 offset; __u32 resv; - __aligned_u64 data; + __kernel_aligned_uintptr_t data; };
struct io_uring_rsrc_update2 { __u32 offset; __u32 resv; - __aligned_u64 data; - __aligned_u64 tags; + __kernel_aligned_uintptr_t data; + __kernel_aligned_uintptr_t tags; __u32 nr; __u32 resv2; }; @@ -581,7 +589,7 @@ struct io_uring_restriction { };
struct io_uring_buf { - __u64 addr; + __kernel_uintptr_t addr; __u32 len; __u16 bid; __u16 resv; @@ -594,9 +602,7 @@ struct io_uring_buf_ring { * ring tail is overlaid with the io_uring_buf->resv field. */ struct { - __u64 resv1; - __u32 resv2; - __u16 resv3; + __u8 resv1[offsetof(struct io_uring_buf, resv)]; __u16 tail; }; struct io_uring_buf bufs[0]; @@ -605,7 +611,7 @@ struct io_uring_buf_ring {
/* argument for IORING_(UN)REGISTER_PBUF_RING */ struct io_uring_buf_reg { - __u64 ring_addr; + __kernel_uintptr_t ring_addr; __u32 ring_entries; __u16 bgid; __u16 pad; @@ -632,17 +638,17 @@ enum { };
struct io_uring_getevents_arg { - __u64 sigmask; - __u32 sigmask_sz; - __u32 pad; - __u64 ts; + __kernel_uintptr_t sigmask; + __u32 sigmask_sz; + __u32 pad; + __kernel_uintptr_t ts; };
/* * Argument for IORING_REGISTER_SYNC_CANCEL */ struct io_uring_sync_cancel_reg { - __u64 addr; + __kernel_uintptr_t addr; __s32 fd; __u32 flags; struct __kernel_timespec timeout; diff --git a/io_uring/advise.c b/io_uring/advise.c index 449c6f14649f7..5bb8094204979 100644 --- a/io_uring/advise.c +++ b/io_uring/advise.c @@ -23,7 +23,7 @@ struct io_fadvise {
struct io_madvise { struct file *file; - u64 addr; + __kernel_uintptr_t addr; u32 len; u32 advice; }; diff --git a/io_uring/cancel.c b/io_uring/cancel.c index 20cb8634e44af..372e2442231a4 100644 --- a/io_uring/cancel.c +++ b/io_uring/cancel.c @@ -19,7 +19,7 @@
struct io_cancel { struct file *file; - u64 addr; + __kernel_uintptr_t addr; u32 flags; s32 fd; }; @@ -43,7 +43,7 @@ static int get_compat_io_uring_sync_cancel_reg(struct io_uring_sync_cancel_reg *
if (unlikely(copy_from_user(&compat_sc, user_sc, sizeof(compat_sc)))) return -EFAULT; - sc->addr = compat_sc.addr; + sc->addr = (__kernel_uintptr_t)compat_ptr(compat_sc.addr); sc->fd = compat_sc.fd; sc->flags = compat_sc.flags; sc->timeout = compat_sc.timeout; @@ -60,7 +60,7 @@ static int copy_io_uring_sync_cancel_reg_from_user(struct io_ring_ctx *ctx, if (ctx->compat) return get_compat_io_uring_sync_cancel_reg(sc, arg); #endif - return copy_from_user(sc, arg, sizeof(*sc)); + return copy_from_user_with_ptr(sc, arg, sizeof(*sc)); }
static bool io_cancel_cb(struct io_wq_work *work, void *data) diff --git a/io_uring/cancel.h b/io_uring/cancel.h index 6a59ee484d0cc..7c1249d61bf25 100644 --- a/io_uring/cancel.h +++ b/io_uring/cancel.h @@ -5,7 +5,7 @@ struct io_cancel_data { struct io_ring_ctx *ctx; union { - u64 data; + __kernel_uintptr_t data; struct file *file; }; u32 flags; diff --git a/io_uring/epoll.c b/io_uring/epoll.c index d5580ff465c3e..d9d5983f823c2 100644 --- a/io_uring/epoll.c +++ b/io_uring/epoll.c @@ -39,7 +39,7 @@ int io_epoll_ctl_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (ep_op_has_event(epoll->op)) { struct epoll_event __user *ev;
- ev = u64_to_user_ptr(READ_ONCE(sqe->addr)); + ev = (struct epoll_event __user *)READ_ONCE(sqe->addr); if (copy_epoll_event_from_user(&epoll->event, ev, req->ctx->compat)) return -EFAULT; } diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c index c5bd669081c98..41628302d1f8b 100644 --- a/io_uring/fdinfo.c +++ b/io_uring/fdinfo.c @@ -114,7 +114,7 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, sq_idx, io_uring_get_opcode(sqe->opcode), sqe->fd, sqe->flags, (unsigned long long) sqe->off, (unsigned long long) sqe->addr, sqe->rw_flags, - sqe->buf_index, sqe->user_data); + sqe->buf_index, (unsigned long long) sqe->user_data); if (sq_shift) { u64 *sqeb = (void *) (sqe + 1); int size = (ctx->compat ? sizeof(struct compat_io_uring_sqe) @@ -153,8 +153,8 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, #endif
seq_printf(m, "%5u: user_data:%llu, res:%d, flag:%x", - entry & cq_mask, cqe->user_data, cqe->res, - cqe->flags); + entry & cq_mask, (unsigned long long) cqe->user_data, + cqe->res, cqe->flags); if (cq_shift) seq_printf(m, ", extra1:%llu, extra2:%llu\n", cqe->big_cqe[0], cqe->big_cqe[1]); @@ -235,7 +235,8 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, #endif
seq_printf(m, " user_data=%llu, res=%d, flags=%x\n", - cqe->user_data, cqe->res, cqe->flags); + (unsigned long long) cqe->user_data, cqe->res, + cqe->flags);
#ifdef CONFIG_COMPAT64 kfree(native_cqe); diff --git a/io_uring/fs.c b/io_uring/fs.c index 7100c293c13a8..2e01e7da1d4ba 100644 --- a/io_uring/fs.c +++ b/io_uring/fs.c @@ -58,8 +58,8 @@ int io_renameat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return -EBADF;
ren->old_dfd = READ_ONCE(sqe->fd); - oldf = u64_to_user_ptr(READ_ONCE(sqe->addr)); - newf = u64_to_user_ptr(READ_ONCE(sqe->addr2)); + oldf = (char __user *)READ_ONCE(sqe->addr); + newf = (char __user *)READ_ONCE(sqe->addr2); ren->new_dfd = READ_ONCE(sqe->len); ren->flags = READ_ONCE(sqe->rename_flags);
@@ -117,7 +117,7 @@ int io_unlinkat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (un->flags & ~AT_REMOVEDIR) return -EINVAL;
- fname = u64_to_user_ptr(READ_ONCE(sqe->addr)); + fname = (char __user *)READ_ONCE(sqe->addr); un->filename = getname(fname); if (IS_ERR(un->filename)) return PTR_ERR(un->filename); @@ -164,7 +164,7 @@ int io_mkdirat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) mkd->dfd = READ_ONCE(sqe->fd); mkd->mode = READ_ONCE(sqe->len);
- fname = u64_to_user_ptr(READ_ONCE(sqe->addr)); + fname = (char __user *)READ_ONCE(sqe->addr); mkd->filename = getname(fname); if (IS_ERR(mkd->filename)) return PTR_ERR(mkd->filename); @@ -206,8 +206,8 @@ int io_symlinkat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return -EBADF;
sl->new_dfd = READ_ONCE(sqe->fd); - oldpath = u64_to_user_ptr(READ_ONCE(sqe->addr)); - newpath = u64_to_user_ptr(READ_ONCE(sqe->addr2)); + oldpath = (char __user *)READ_ONCE(sqe->addr); + newpath = (char __user *)READ_ONCE(sqe->addr2);
sl->oldpath = getname(oldpath); if (IS_ERR(sl->oldpath)) @@ -250,8 +250,8 @@ int io_linkat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
lnk->old_dfd = READ_ONCE(sqe->fd); lnk->new_dfd = READ_ONCE(sqe->len); - oldf = u64_to_user_ptr(READ_ONCE(sqe->addr)); - newf = u64_to_user_ptr(READ_ONCE(sqe->addr2)); + oldf = (char __user *)READ_ONCE(sqe->addr); + newf = (char __user *)READ_ONCE(sqe->addr2); lnk->flags = READ_ONCE(sqe->hardlink_flags);
lnk->oldpath = getname(oldf); diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 03506776af043..24c1c9fb9b298 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -184,7 +184,7 @@ static int copy_io_uring_getevents_arg_from_user(struct io_ring_ctx *ctx, #endif if (size != sizeof(*arg)) return -EINVAL; - if (copy_from_user(arg, argp, sizeof(*arg))) + if (copy_from_user_with_ptr(arg, argp, sizeof(*arg))) return -EFAULT; return 0; } @@ -726,7 +726,7 @@ static __cold void io_uring_drop_tctx_refs(struct task_struct *task) } }
-static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, +static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, __kernel_uintptr_t user_data, s32 res, u32 cflags, u64 extra1, u64 extra2) { struct io_overflow_cqe *ocqe; @@ -818,7 +818,7 @@ void *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow) return cqe; }
-bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags, +bool io_fill_cqe_aux(struct io_ring_ctx *ctx, __kernel_uintptr_t user_data, s32 res, u32 cflags, bool allow_overflow) { struct io_uring_cqe *cqe; @@ -845,7 +845,7 @@ bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags }
bool io_post_aux_cqe(struct io_ring_ctx *ctx, - u64 user_data, s32 res, u32 cflags, + __kernel_uintptr_t user_data, s32 res, u32 cflags, bool allow_overflow) { bool filled; @@ -3213,9 +3213,9 @@ static int io_get_ext_arg(struct io_ring_ctx *ctx, unsigned int flags, return ret; if (arg.pad) return -EINVAL; - *sig = u64_to_user_ptr(arg.sigmask); + *sig = (sigset_t __user *)arg.sigmask; *argsz = arg.sigmask_sz; - *ts = u64_to_user_ptr(arg.ts); + *ts = (struct __kernel_timespec __user *)arg.ts; return 0; }
@@ -4158,6 +4158,42 @@ static int __init io_uring_init(void) __BUILD_BUG_VERIFY_OFFSET_SIZE(struct io_uring_sqe, eoffset, sizeof(etype), ename) #define BUILD_BUG_SQE_ELEM_SIZE(eoffset, esize, ename) \ __BUILD_BUG_VERIFY_OFFSET_SIZE(struct io_uring_sqe, eoffset, esize, ename) +#ifdef CONFIG_CHERI_PURECAP_UABI + BUILD_BUG_ON(sizeof(struct io_uring_sqe) != 128); + BUILD_BUG_SQE_ELEM(0, __u8, opcode); + BUILD_BUG_SQE_ELEM(1, __u8, flags); + BUILD_BUG_SQE_ELEM(2, __u16, ioprio); + BUILD_BUG_SQE_ELEM(4, __s32, fd); + BUILD_BUG_SQE_ELEM(16, __u64, off); + BUILD_BUG_SQE_ELEM(16, __uintcap_t, addr2); + BUILD_BUG_SQE_ELEM(32, __uintcap_t, addr); + BUILD_BUG_SQE_ELEM(32, __u64, splice_off_in); + BUILD_BUG_SQE_ELEM(48, __u32, len); + BUILD_BUG_SQE_ELEM(52, __kernel_rwf_t, rw_flags); + BUILD_BUG_SQE_ELEM(52, __u32, fsync_flags); + BUILD_BUG_SQE_ELEM(52, __u16, poll_events); + BUILD_BUG_SQE_ELEM(52, __u32, poll32_events); + BUILD_BUG_SQE_ELEM(52, __u32, sync_range_flags); + BUILD_BUG_SQE_ELEM(52, __u32, msg_flags); + BUILD_BUG_SQE_ELEM(52, __u32, timeout_flags); + BUILD_BUG_SQE_ELEM(52, __u32, accept_flags); + BUILD_BUG_SQE_ELEM(52, __u32, cancel_flags); + BUILD_BUG_SQE_ELEM(52, __u32, open_flags); + BUILD_BUG_SQE_ELEM(52, __u32, statx_flags); + BUILD_BUG_SQE_ELEM(52, __u32, fadvise_advice); + BUILD_BUG_SQE_ELEM(52, __u32, splice_flags); + BUILD_BUG_SQE_ELEM(64, __uintcap_t, user_data); + BUILD_BUG_SQE_ELEM(80, __u16, buf_index); + BUILD_BUG_SQE_ELEM(80, __u16, buf_group); + BUILD_BUG_SQE_ELEM(82, __u16, personality); + BUILD_BUG_SQE_ELEM(84, __s32, splice_fd_in); + BUILD_BUG_SQE_ELEM(84, __u32, file_index); + BUILD_BUG_SQE_ELEM(84, __u16, addr_len); + BUILD_BUG_SQE_ELEM(86, __u16, __pad3[0]); + BUILD_BUG_SQE_ELEM(96, __uintcap_t, addr3); + BUILD_BUG_SQE_ELEM_SIZE(96, 0, cmd); + BUILD_BUG_SQE_ELEM(112, __u64, __pad2); +#else /* !CONFIG_CHERI_PURECAP_UABI */ BUILD_BUG_ON(sizeof(struct io_uring_sqe) != 64); BUILD_BUG_SQE_ELEM(0, __u8, opcode); BUILD_BUG_SQE_ELEM(1, __u8, flags); @@ -4198,6 +4234,7 @@ static int __init io_uring_init(void) BUILD_BUG_SQE_ELEM(48, __u64, addr3); BUILD_BUG_SQE_ELEM_SIZE(48, 0, cmd); BUILD_BUG_SQE_ELEM(56, __u64, __pad2); +#endif /* !CONFIG_CHERI_PURECAP_UABI */
BUILD_BUG_ON(sizeof(struct io_uring_files_update) != sizeof(struct io_uring_rsrc_update)); diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h index fb2711770bfb0..f8c4e6d61124b 100644 --- a/io_uring/io_uring.h +++ b/io_uring/io_uring.h @@ -34,9 +34,9 @@ void io_req_complete_failed(struct io_kiocb *req, s32 res); void __io_req_complete(struct io_kiocb *req, unsigned issue_flags); void io_req_complete_post(struct io_kiocb *req); void __io_req_complete_post(struct io_kiocb *req); -bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags, +bool io_post_aux_cqe(struct io_ring_ctx *ctx, __kernel_uintptr_t user_data, s32 res, u32 cflags, bool allow_overflow); -bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags, +bool io_fill_cqe_aux(struct io_ring_ctx *ctx, __kernel_uintptr_t user_data, s32 res, u32 cflags, bool allow_overflow); void __io_commit_cqring_flush(struct io_ring_ctx *ctx);
@@ -99,7 +99,7 @@ static inline void convert_compat_io_uring_cqe(struct io_ring_ctx *ctx, struct io_uring_cqe *cqe, const struct compat_io_uring_cqe *compat_cqe) { - cqe->user_data = READ_ONCE(compat_cqe->user_data); + cqe->user_data = (__kernel_uintptr_t)READ_ONCE(compat_cqe->user_data); cqe->res = READ_ONCE(compat_cqe->res); cqe->flags = READ_ONCE(compat_cqe->flags);
@@ -130,13 +130,13 @@ static inline void convert_compat_io_uring_sqe(struct io_ring_ctx *ctx, sqe->ioprio = READ_ONCE(compat_sqe->ioprio); sqe->fd = READ_ONCE(compat_sqe->fd); BUILD_BUG_COMPAT_SQE_UNION_ELEM(addr2, addr); - sqe->addr2 = READ_ONCE(compat_sqe->addr2); + sqe->addr2 = (__kernel_uintptr_t)compat_ptr(READ_ONCE(compat_sqe->addr2)); BUILD_BUG_COMPAT_SQE_UNION_ELEM(addr, len); - sqe->addr = READ_ONCE(compat_sqe->addr); + sqe->addr = (__kernel_uintptr_t)compat_ptr(READ_ONCE(compat_sqe->addr)); sqe->len = READ_ONCE(compat_sqe->len); BUILD_BUG_COMPAT_SQE_UNION_ELEM(rw_flags, user_data); sqe->rw_flags = READ_ONCE(compat_sqe->rw_flags); - sqe->user_data = READ_ONCE(compat_sqe->user_data); + sqe->user_data = (__kernel_uintptr_t)READ_ONCE(compat_sqe->user_data); BUILD_BUG_COMPAT_SQE_UNION_ELEM(buf_index, personality); sqe->buf_index = READ_ONCE(compat_sqe->buf_index); sqe->personality = READ_ONCE(compat_sqe->personality); @@ -148,7 +148,7 @@ static inline void convert_compat_io_uring_sqe(struct io_ring_ctx *ctx,
memcpy(sqe->cmd, compat_sqe->cmd, compat_cmd_size); } else - sqe->addr3 = READ_ONCE(compat_sqe->addr3); + sqe->addr3 = (__kernel_uintptr_t)compat_ptr(READ_ONCE(compat_sqe->addr3)); #undef BUILD_BUG_COMPAT_SQE_UNION_ELEM } #endif /* CONFIG_COMPAT64 */ @@ -175,7 +175,7 @@ static inline void *io_get_cqe(struct io_ring_ctx *ctx) }
static inline void __io_fill_cqe_any(struct io_ring_ctx *ctx, struct io_uring_cqe *cqe, - u64 user_data, s32 res, u32 cflags, + __kernel_uintptr_t user_data, s32 res, u32 cflags, u64 extra1, u64 extra2) { #ifdef CONFIG_COMPAT64 diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c index 9f840ba93c888..414c5064222b4 100644 --- a/io_uring/kbuf.c +++ b/io_uring/kbuf.c @@ -24,7 +24,7 @@
struct io_provide_buf { struct file *file; - __u64 addr; + __kernel_uintptr_t addr; __u32 len; __u32 bgid; __u16 nbufs; @@ -47,7 +47,7 @@ static int get_compat_io_uring_buf_reg(struct io_uring_buf_reg *reg,
if (unlikely(copy_from_user(&compat_reg, user_reg, sizeof(compat_reg)))) return -EFAULT; - reg->ring_addr = compat_reg.ring_addr; + reg->ring_addr = (__kernel_uintptr_t)compat_ptr(compat_reg.ring_addr); reg->ring_entries = compat_reg.ring_entries; reg->bgid = compat_reg.bgid; reg->pad = compat_reg.pad; @@ -64,7 +64,7 @@ static int copy_io_uring_buf_reg_from_user(struct io_ring_ctx *ctx, if (ctx->compat) return get_compat_io_uring_buf_reg(reg, arg); #endif - return copy_from_user(reg, arg, sizeof(*reg)); + return copy_from_user_with_ptr(reg, arg, sizeof(*reg)); }
static inline struct io_buffer_list *io_buffer_get_list(struct io_ring_ctx *ctx, @@ -159,7 +159,7 @@ static void __user *io_provided_buffer_select(struct io_kiocb *req, size_t *len, req->flags |= REQ_F_BUFFER_SELECTED; req->kbuf = kbuf; req->buf_index = kbuf->bid; - return u64_to_user_ptr(kbuf->addr); + return (void __user *)kbuf->addr; } return NULL; } @@ -239,7 +239,7 @@ static void __user *io_ring_buffer_select(struct io_kiocb *req, size_t *len, req->buf_list = NULL; bl->head++; } - return u64_to_user_ptr(buf->addr); + return (void __user *)buf->addr; }
static void __user *io_ring_buffer_select_any(struct io_kiocb *req, size_t *len, @@ -427,7 +427,7 @@ int io_provide_buffers_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe return -EOVERFLOW;
size = (unsigned long)p->len * p->nbufs; - if (!access_ok(u64_to_user_ptr(p->addr), size)) + if (!access_ok(p->addr, size)) return -EFAULT;
p->bgid = READ_ONCE(sqe->buf_group); @@ -487,7 +487,7 @@ static int io_add_buffers(struct io_ring_ctx *ctx, struct io_provide_buf *pbuf, struct io_buffer_list *bl) { struct io_buffer *buf; - u64 addr = pbuf->addr; + __kernel_uintptr_t addr = pbuf->addr; int i, bid = pbuf->bid;
for (i = 0; i < pbuf->nbufs; i++) { @@ -599,6 +599,7 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg) pages_size = ctx->compat ? size_mul(sizeof(struct compat_io_uring_buf), reg.ring_entries) : size_mul(sizeof(struct io_uring_buf), reg.ring_entries); + /* TODO [PCuABI] - capability checks for uaccess */ pages = io_pin_pages(reg.ring_addr, pages_size, &nr_pages); if (IS_ERR(pages)) { kfree(free_bl); diff --git a/io_uring/kbuf.h b/io_uring/kbuf.h index 1aa5bbbc5d628..50a0a7524e6b6 100644 --- a/io_uring/kbuf.h +++ b/io_uring/kbuf.h @@ -31,7 +31,7 @@ struct io_buffer_list {
struct io_buffer { struct list_head list; - __u64 addr; + __kernel_uintptr_t addr; __u32 len; __u16 bid; __u16 bgid; diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c index 90d2fc6fd80e4..654f5ad0b11c0 100644 --- a/io_uring/msg_ring.c +++ b/io_uring/msg_ring.c @@ -15,7 +15,7 @@
struct io_msg { struct file *file; - u64 user_data; + __kernel_uintptr_t user_data; u32 len; u32 cmd; u32 src_fd; @@ -130,7 +130,7 @@ int io_msg_ring_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (unlikely(sqe->buf_index || sqe->personality)) return -EINVAL;
- msg->user_data = READ_ONCE(sqe->off); + msg->user_data = READ_ONCE(sqe->addr2); msg->len = READ_ONCE(sqe->len); msg->cmd = READ_ONCE(sqe->addr); msg->src_fd = READ_ONCE(sqe->addr3); diff --git a/io_uring/net.c b/io_uring/net.c index 4c133bc6f9d1d..a9143b4652114 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -243,13 +243,13 @@ int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (req->opcode == IORING_OP_SEND) { if (READ_ONCE(sqe->__pad3[0])) return -EINVAL; - sr->addr = u64_to_user_ptr(READ_ONCE(sqe->addr2)); + sr->addr = (void __user *)READ_ONCE(sqe->addr2); sr->addr_len = READ_ONCE(sqe->addr_len); } else if (sqe->addr2 || sqe->file_index) { return -EINVAL; }
- sr->umsg = u64_to_user_ptr(READ_ONCE(sqe->addr)); + sr->umsg = (struct user_msghdr __user *)READ_ONCE(sqe->addr); sr->len = READ_ONCE(sqe->len); sr->flags = READ_ONCE(sqe->ioprio); if (sr->flags & ~IORING_RECVSEND_POLL_FIRST) @@ -421,7 +421,7 @@ static int __io_recvmsg_copy_hdr(struct io_kiocb *req, struct user_msghdr msg; int ret;
- if (copy_from_user(&msg, sr->umsg, sizeof(*sr->umsg))) + if (copy_from_user_with_ptr(&msg, sr->umsg, sizeof(*sr->umsg))) return -EFAULT;
ret = __copy_msghdr(&iomsg->msg, &msg, &iomsg->uaddr); @@ -549,7 +549,7 @@ int io_recvmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (unlikely(sqe->file_index || sqe->addr2)) return -EINVAL;
- sr->umsg = u64_to_user_ptr(READ_ONCE(sqe->addr)); + sr->umsg = (struct user_msghdr __user *)READ_ONCE(sqe->addr); sr->len = READ_ONCE(sqe->len); sr->flags = READ_ONCE(sqe->ioprio); if (sr->flags & ~(RECVMSG_FLAGS)) @@ -966,7 +966,7 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (req->opcode == IORING_OP_SEND_ZC) { if (READ_ONCE(sqe->__pad3[0])) return -EINVAL; - zc->addr = u64_to_user_ptr(READ_ONCE(sqe->addr2)); + zc->addr = (void __user *)READ_ONCE(sqe->addr2); zc->addr_len = READ_ONCE(sqe->addr_len); } else { if (unlikely(sqe->addr2 || sqe->file_index)) @@ -975,7 +975,7 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return -EINVAL; }
- zc->buf = u64_to_user_ptr(READ_ONCE(sqe->addr)); + zc->buf = (void __user *)READ_ONCE(sqe->addr); zc->len = READ_ONCE(sqe->len); zc->msg_flags = READ_ONCE(sqe->msg_flags) | MSG_NOSIGNAL; if (zc->msg_flags & MSG_DONTWAIT) @@ -1242,8 +1242,8 @@ int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (sqe->len || sqe->buf_index) return -EINVAL;
- accept->addr = u64_to_user_ptr(READ_ONCE(sqe->addr)); - accept->addr_len = u64_to_user_ptr(READ_ONCE(sqe->addr2)); + accept->addr = (void __user *)READ_ONCE(sqe->addr); + accept->addr_len = (int __user *)READ_ONCE(sqe->addr2); accept->flags = READ_ONCE(sqe->accept_flags); accept->nofile = rlimit(RLIMIT_NOFILE); flags = READ_ONCE(sqe->ioprio); @@ -1392,8 +1392,8 @@ int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (sqe->len || sqe->buf_index || sqe->rw_flags || sqe->splice_fd_in) return -EINVAL;
- conn->addr = u64_to_user_ptr(READ_ONCE(sqe->addr)); - conn->addr_len = READ_ONCE(sqe->addr2); + conn->addr = (void __user *)READ_ONCE(sqe->addr); + conn->addr_len = READ_ONCE(sqe->addr2); conn->in_progress = false; return 0; } diff --git a/io_uring/openclose.c b/io_uring/openclose.c index 67178e4bb282d..0a5c838885306 100644 --- a/io_uring/openclose.c +++ b/io_uring/openclose.c @@ -47,7 +47,7 @@ static int __io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe open->how.flags |= O_LARGEFILE;
open->dfd = READ_ONCE(sqe->fd); - fname = u64_to_user_ptr(READ_ONCE(sqe->addr)); + fname = (char __user *)READ_ONCE(sqe->addr); open->filename = getname(fname); if (IS_ERR(open->filename)) { ret = PTR_ERR(open->filename); @@ -81,7 +81,7 @@ int io_openat2_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) size_t len; int ret;
- how = u64_to_user_ptr(READ_ONCE(sqe->addr2)); + how = (struct open_how __user *)READ_ONCE(sqe->addr2); len = READ_ONCE(sqe->len); if (len < OPEN_HOW_SIZE_VER0) return -EINVAL; diff --git a/io_uring/poll.c b/io_uring/poll.c index d9bf1767867e6..4914048fed0f8 100644 --- a/io_uring/poll.c +++ b/io_uring/poll.c @@ -22,8 +22,8 @@
struct io_poll_update { struct file *file; - u64 old_user_data; - u64 new_user_data; + __kernel_uintptr_t old_user_data; + __kernel_uintptr_t new_user_data; __poll_t events; bool update_events; bool update_user_data; diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c index eafdb039f9e51..f94d7475cb4d8 100644 --- a/io_uring/rsrc.c +++ b/io_uring/rsrc.c @@ -18,7 +18,7 @@
struct io_rsrc_update { struct file *file; - u64 arg; + __kernel_uintptr_t arg; u32 nr_args; u32 offset; }; @@ -33,7 +33,7 @@ static int get_compat_io_uring_rsrc_update(struct io_uring_rsrc_update2 *up2, return -EFAULT; up2->offset = compat_up.offset; up2->resv = compat_up.resv; - up2->data = compat_up.data; + up2->data = (__kernel_uintptr_t)compat_ptr(compat_up.data); return 0; }
@@ -46,8 +46,8 @@ static int get_compat_io_uring_rsrc_update2(struct io_uring_rsrc_update2 *up2, return -EFAULT; up2->offset = compat_up2.offset; up2->resv = compat_up2.resv; - up2->data = compat_up2.data; - up2->tags = compat_up2.tags; + up2->data = (__kernel_uintptr_t)compat_ptr(compat_up2.data); + up2->tags = (__kernel_uintptr_t)compat_ptr(compat_up2.tags); up2->nr = compat_up2.nr; up2->resv2 = compat_up2.resv2; return 0; @@ -63,8 +63,8 @@ static int get_compat_io_uring_rsrc_register(struct io_uring_rsrc_register *rr, rr->nr = compat_rr.nr; rr->flags = compat_rr.flags; rr->resv2 = compat_rr.resv2; - rr->data = compat_rr.data; - rr->tags = compat_rr.tags; + rr->data = (__kernel_uintptr_t)compat_ptr(compat_rr.data); + rr->tags = (__kernel_uintptr_t)compat_ptr(compat_rr.tags); return 0; } #endif /* CONFIG_COMPAT64 */ @@ -77,7 +77,7 @@ static int copy_io_uring_rsrc_update_from_user(struct io_ring_ctx *ctx, if (ctx->compat) return get_compat_io_uring_rsrc_update(up2, arg); #endif - return copy_from_user(up2, arg, sizeof(struct io_uring_rsrc_update)); + return copy_from_user_with_ptr(up2, arg, sizeof(struct io_uring_rsrc_update)); }
static int copy_io_uring_rsrc_update2_from_user(struct io_ring_ctx *ctx, @@ -96,7 +96,7 @@ static int copy_io_uring_rsrc_update2_from_user(struct io_ring_ctx *ctx, #endif if (size != sizeof(*up2)) return -EINVAL; - if (copy_from_user(up2, arg, sizeof(*up2))) + if (copy_from_user_with_ptr(up2, arg, sizeof(*up2))) return -EFAULT; return 0; } @@ -118,7 +118,7 @@ static int copy_io_uring_rsrc_register_from_user(struct io_ring_ctx *ctx, /* keep it extendible */ if (size != sizeof(*rr)) return -EINVAL; - if (copy_from_user(rr, arg, size)) + if (copy_from_user_with_ptr(rr, arg, size)) return -EFAULT; return 0; } @@ -201,13 +201,14 @@ static int io_copy_iov(struct io_ring_ctx *ctx, struct iovec *dst, if (copy_from_user(&ciov, &ciovs[index], sizeof(ciov))) return -EFAULT;
- dst->iov_base = u64_to_user_ptr((u64)ciov.iov_base); + dst->iov_base = compat_ptr(ciov.iov_base); + dst->iov_len = ciov.iov_len; return 0; } #endif src = (struct iovec __user *) arg; - if (copy_from_user(dst, &src[index], sizeof(*dst))) + if (copy_from_user_with_ptr(dst, &src[index], sizeof(*dst))) return -EFAULT; return 0; } @@ -534,8 +535,8 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx, struct io_uring_rsrc_update2 *up, unsigned nr_args) { - u64 __user *tags = u64_to_user_ptr(up->tags); - __s32 __user *fds = u64_to_user_ptr(up->data); + u64 __user *tags = (u64 __user *)up->tags; + __s32 __user *fds = (__s32 __user *)up->data; struct io_rsrc_data *data = ctx->file_data; struct io_fixed_file *file_slot; struct file *file; @@ -614,9 +615,9 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx, struct io_uring_rsrc_update2 *up, unsigned int nr_args) { - u64 __user *tags = u64_to_user_ptr(up->tags); + u64 __user *tags = (u64 __user *)up->tags; struct iovec iov; - struct iovec __user *iovs = u64_to_user_ptr(up->data); + struct iovec __user *iovs = (struct iovec __user *)up->data; struct page *last_hpage = NULL; bool needs_switch = false; __u32 done; @@ -742,13 +743,13 @@ __cold int io_register_rsrc(struct io_ring_ctx *ctx, void __user *arg, case IORING_RSRC_FILE: if (rr.flags & IORING_RSRC_REGISTER_SPARSE && rr.data) break; - return io_sqe_files_register(ctx, u64_to_user_ptr(rr.data), - rr.nr, u64_to_user_ptr(rr.tags)); + return io_sqe_files_register(ctx, (void __user *)rr.data, + rr.nr, (u64 __user *)rr.tags); case IORING_RSRC_BUFFER: if (rr.flags & IORING_RSRC_REGISTER_SPARSE && rr.data) break; - return io_sqe_buffers_register(ctx, u64_to_user_ptr(rr.data), - rr.nr, u64_to_user_ptr(rr.tags)); + return io_sqe_buffers_register(ctx, (void __user *)rr.data, + rr.nr, (u64 __user *)rr.tags); } return -EINVAL; } @@ -774,7 +775,7 @@ static int io_files_update_with_index_alloc(struct io_kiocb *req, unsigned int issue_flags) { struct io_rsrc_update *up = io_kiocb_to_cmd(req, struct io_rsrc_update); - __s32 __user *fds = u64_to_user_ptr(up->arg); + __s32 __user *fds = (__s32 __user *)up->arg; unsigned int done; struct file *file; int ret, fd; diff --git a/io_uring/rw.c b/io_uring/rw.c index 2edca190450ee..424ee773c95aa 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -23,7 +23,7 @@ struct io_rw { /* NOTE: kiocb has the file as the first member, so don't do it here */ struct kiocb kiocb; - u64 addr; + __kernel_uintptr_t addr; u32 len; rwf_t flags; }; @@ -39,7 +39,7 @@ static int io_iov_compat_buffer_select_prep(struct io_rw *rw) struct compat_iovec __user *uiov; compat_ssize_t clen;
- uiov = u64_to_user_ptr(rw->addr); + uiov = (struct compat_iovec __user *)rw->addr; if (!access_ok(uiov, sizeof(*uiov))) return -EFAULT; if (__get_user(clen, &uiov->iov_len)) @@ -65,7 +65,7 @@ static int io_iov_buffer_select_prep(struct io_kiocb *req) return io_iov_compat_buffer_select_prep(rw); #endif
- uiov = u64_to_user_ptr(rw->addr); + uiov = (struct iovec __user *)rw->addr; if (get_user(rw->len, &uiov->iov_len)) return -EFAULT; return 0; @@ -370,7 +370,7 @@ static struct iovec *__io_import_iovec(int ddir, struct io_kiocb *req, return NULL; }
- buf = u64_to_user_ptr(rw->addr); + buf = (void __user *)rw->addr; sqe_len = rw->len;
if (opcode == IORING_OP_READ || opcode == IORING_OP_WRITE || @@ -379,8 +379,7 @@ static struct iovec *__io_import_iovec(int ddir, struct io_kiocb *req, buf = io_buffer_select(req, &sqe_len, issue_flags); if (!buf) return ERR_PTR(-ENOBUFS); - /* TODO [PCuABI] - capability checks for uaccess */ - rw->addr = user_ptr_addr(buf); + rw->addr = (__kernel_uintptr_t)buf; rw->len = sqe_len; }
@@ -446,7 +445,7 @@ static ssize_t loop_rw_iter(int ddir, struct io_rw *rw, struct iov_iter *iter) if (!iov_iter_is_bvec(iter)) { iovec = iov_iter_iovec(iter); } else { - iovec.iov_base = u64_to_user_ptr(rw->addr); + iovec.iov_base = (void __user *)rw->addr; iovec.iov_len = rw->len; }
diff --git a/io_uring/statx.c b/io_uring/statx.c index d8fc933d3f593..d2604fdbcbe33 100644 --- a/io_uring/statx.c +++ b/io_uring/statx.c @@ -32,8 +32,8 @@ int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
sx->dfd = READ_ONCE(sqe->fd); sx->mask = READ_ONCE(sqe->len); - path = u64_to_user_ptr(READ_ONCE(sqe->addr)); - sx->buffer = u64_to_user_ptr(READ_ONCE(sqe->addr2)); + path = (char __user *)READ_ONCE(sqe->addr); + sx->buffer = (struct statx __user *)READ_ONCE(sqe->addr2); sx->flags = READ_ONCE(sqe->statx_flags);
sx->filename = getname_flags(path, diff --git a/io_uring/tctx.c b/io_uring/tctx.c index 6ab9916ed3844..828383907b3b9 100644 --- a/io_uring/tctx.c +++ b/io_uring/tctx.c @@ -22,7 +22,7 @@ static int get_compat_io_uring_rsrc_update(struct io_uring_rsrc_update *up, return -EFAULT; up->offset = compat_up.offset; up->resv = compat_up.resv; - up->data = compat_up.data; + up->data = (__kernel_uintptr_t)compat_ptr(compat_up.data); return 0; } #endif /* CONFIG_COMPAT64 */ @@ -35,7 +35,7 @@ static int copy_io_uring_rsrc_update_from_user(struct io_ring_ctx *ctx, if (ctx->compat) return get_compat_io_uring_rsrc_update(up, arg); #endif - return copy_from_user(up, arg, sizeof(struct io_uring_rsrc_update)); + return copy_from_user_with_ptr(up, arg, sizeof(struct io_uring_rsrc_update)); }
static struct io_wq *io_init_wq_offload(struct io_ring_ctx *ctx, diff --git a/io_uring/timeout.c b/io_uring/timeout.c index e8a8c20994805..5a0fe53c13329 100644 --- a/io_uring/timeout.c +++ b/io_uring/timeout.c @@ -26,7 +26,7 @@ struct io_timeout {
struct io_timeout_rem { struct file *file; - u64 addr; + __kernel_uintptr_t addr;
/* timeout update */ struct timespec64 ts; @@ -337,7 +337,7 @@ static clockid_t io_timeout_get_clock(struct io_timeout_data *data) } }
-static int io_linked_timeout_update(struct io_ring_ctx *ctx, __u64 user_data, +static int io_linked_timeout_update(struct io_ring_ctx *ctx, __kernel_uintptr_t user_data, struct timespec64 *ts, enum hrtimer_mode mode) __must_hold(&ctx->timeout_lock) { @@ -365,7 +365,7 @@ static int io_linked_timeout_update(struct io_ring_ctx *ctx, __u64 user_data, return 0; }
-static int io_timeout_update(struct io_ring_ctx *ctx, __u64 user_data, +static int io_timeout_update(struct io_ring_ctx *ctx, __kernel_uintptr_t user_data, struct timespec64 *ts, enum hrtimer_mode mode) __must_hold(&ctx->timeout_lock) { @@ -405,7 +405,7 @@ int io_timeout_remove_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) tr->ltimeout = true; if (tr->flags & ~(IORING_TIMEOUT_UPDATE_MASK|IORING_TIMEOUT_ABS)) return -EINVAL; - if (get_timespec64(&tr->ts, u64_to_user_ptr(sqe->addr2))) + if (get_timespec64(&tr->ts, (struct __kernel_timespec __user *)sqe->addr2)) return -EFAULT; if (tr->ts.tv_sec < 0 || tr->ts.tv_nsec < 0) return -EINVAL; @@ -490,7 +490,7 @@ static int __io_timeout_prep(struct io_kiocb *req, data->req = req; data->flags = flags;
- if (get_timespec64(&data->ts, u64_to_user_ptr(sqe->addr))) + if (get_timespec64(&data->ts, (struct __kernel_timespec __user *)sqe->addr)) return -EFAULT;
if (data->ts.tv_sec < 0 || data->ts.tv_nsec < 0) diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c index e50de0b6b9f84..4d2d2e3f885ee 100644 --- a/io_uring/uring_cmd.c +++ b/io_uring/uring_cmd.c @@ -65,8 +65,13 @@ int io_uring_cmd_prep_async(struct io_kiocb *req) struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd); size_t cmd_size;
+#ifdef CONFIG_CHERI_PURECAP_UABI + BUILD_BUG_ON(uring_cmd_pdu_size(0) != 32); + BUILD_BUG_ON(uring_cmd_pdu_size(1) != 160); +#else BUILD_BUG_ON(uring_cmd_pdu_size(0) != 16); BUILD_BUG_ON(uring_cmd_pdu_size(1) != 80); +#endif
cmd_size = uring_cmd_pdu_size(req->ctx->flags & IORING_SETUP_SQE128);
diff --git a/io_uring/xattr.c b/io_uring/xattr.c index 99df641594d74..1f13032e59536 100644 --- a/io_uring/xattr.c +++ b/io_uring/xattr.c @@ -53,8 +53,8 @@ static int __io_getxattr_prep(struct io_kiocb *req,
ix->filename = NULL; ix->ctx.kvalue = NULL; - name = u64_to_user_ptr(READ_ONCE(sqe->addr)); - ix->ctx.cvalue = u64_to_user_ptr(READ_ONCE(sqe->addr2)); + name = (char __user *)READ_ONCE(sqe->addr); + ix->ctx.cvalue = (void __user *)READ_ONCE(sqe->addr2); ix->ctx.size = READ_ONCE(sqe->len); ix->ctx.flags = READ_ONCE(sqe->xattr_flags);
@@ -93,7 +93,7 @@ int io_getxattr_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (ret) return ret;
- path = u64_to_user_ptr(READ_ONCE(sqe->addr3)); + path = (char __user *)READ_ONCE(sqe->addr3);
ix->filename = getname_flags(path, LOOKUP_FOLLOW, NULL); if (IS_ERR(ix->filename)) { @@ -159,8 +159,8 @@ static int __io_setxattr_prep(struct io_kiocb *req, return -EBADF;
ix->filename = NULL; - name = u64_to_user_ptr(READ_ONCE(sqe->addr)); - ix->ctx.cvalue = u64_to_user_ptr(READ_ONCE(sqe->addr2)); + name = (char __user *)READ_ONCE(sqe->addr); + ix->ctx.cvalue = (void __user *)READ_ONCE(sqe->addr2); ix->ctx.kvalue = NULL; ix->ctx.size = READ_ONCE(sqe->len); ix->ctx.flags = READ_ONCE(sqe->xattr_flags); @@ -189,7 +189,7 @@ int io_setxattr_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (ret) return ret;
- path = u64_to_user_ptr(READ_ONCE(sqe->addr3)); + path = (char __user *)READ_ONCE(sqe->addr3);
ix->filename = getname_flags(path, LOOKUP_FOLLOW, NULL); if (IS_ERR(ix->filename)) {
On 23/02/2023 18:53, Tudor Cretu wrote:
Some members of the io_uring uAPI structs may contain user pointers. In the PCuABI, a user pointer is a 129-bit capability, so the __u64 type is not big enough to hold it. Use the __kernel_uintptr_t type instead, which is big enough on the affected architectures while remaining 64-bit on others.
The user_data field must be passed unchanged from the submission queue to the completion queue. As it is standard practice to store a pointer in user_data, expand the field to __kernel_uintptr_t. However, the kernel doesn't dereference the user_data, so don't cast it to a compat_ptr.
Maybe something like "... don't convert it in the compat case"?
In addition, for the io_uring structs containing user pointers, use the special copy routines when copying user pointers from/to userspace.
Note that the structs io_uring_sqe and io_uring_cqe are doubled in size in PCuABI. The setup flags IORING_SETUP_SQE128 and IORING_SETUP_CQE32 used to double the sizes of the two structs up to 128 bytes and 32 bytes respectively. In PCuABI, the two flags are still used to double the sizes of the two structs, but, as they increased in size, they increase up to 256 bytes and 64 bytes.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com
include/linux/io_uring_types.h | 4 +-- include/uapi/linux/io_uring.h | 62 +++++++++++++++++++--------------- io_uring/advise.c | 2 +- io_uring/cancel.c | 6 ++-- io_uring/cancel.h | 2 +- io_uring/epoll.c | 2 +- io_uring/fdinfo.c | 9 ++--- io_uring/fs.c | 16 ++++----- io_uring/io_uring.c | 49 +++++++++++++++++++++++---- io_uring/io_uring.h | 16 ++++----- io_uring/kbuf.c | 15 ++++---- io_uring/kbuf.h | 2 +- io_uring/msg_ring.c | 4 +-- io_uring/net.c | 20 +++++------ io_uring/openclose.c | 4 +-- io_uring/poll.c | 4 +-- io_uring/rsrc.c | 41 +++++++++++----------- io_uring/rw.c | 13 ++++--- io_uring/statx.c | 4 +-- io_uring/tctx.c | 4 +-- io_uring/timeout.c | 10 +++--- io_uring/uring_cmd.c | 5 +++ io_uring/xattr.c | 12 +++---- 23 files changed, 178 insertions(+), 128 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index 737bc1aa67306..1407ccadf0575 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -604,8 +604,8 @@ struct io_task_work { }; struct io_cqe {
- __u64 user_data;
- __s32 res;
- __kernel_uintptr_t user_data;
- __s32 res; /* fd initially, then cflags for completion */ union { __u32 flags;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 2df3225b562fa..14aa30151c5f9 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -11,6 +11,11 @@ #include <linux/fs.h> #include <linux/types.h> #include <linux/time_types.h> +#ifdef __KERNEL__ +#include <linux/stddef.h> +#else +#include <stddef.h> +#endif
Worth adding a comment explaining this is for offsetof(), otherwise this is rather mysterious.
#ifdef __cplusplus extern "C" { @@ -25,16 +30,16 @@ struct io_uring_sqe { __u16 ioprio; /* ioprio for the request */ __s32 fd; /* file descriptor to do IO on */ union {
__u64 off; /* offset into file */
__u64 addr2;
__u64 off; /* offset into file */
struct { __u32 cmd_op; __u32 __pad1; }; }; union {__kernel_uintptr_t addr2;
__u64 addr; /* pointer to buffer or iovecs */
__u64 splice_off_in;
__kernel_uintptr_t addr; /* pointer to buffer or iovecs */
}; __u32 len; /* buffer size or number of iovecs */ union {__u64 splice_off_in;
@@ -58,7 +63,7 @@ struct io_uring_sqe { __u32 msg_ring_flags; __u32 uring_cmd_flags; };
- __u64 user_data; /* data to be passed back at completion time */
- __kernel_uintptr_t user_data; /* data to be passed back at completion time */ /* pack this to avoid bogus arm OABI complaints */ union { /* index into fixed buffers, if used */
@@ -78,12 +83,14 @@ struct io_uring_sqe { }; union { struct {
__u64 addr3;
__u64 __pad2[1];
__kernel_uintptr_t addr3;
__u64 __pad2[1];
That looks off, the padding does not have the same size as addr3 in PCuABI. IIUC the overall struct layout, you do want the padding in PCuABI too so that the overall struct size is 128 (two cache lines), so it should be the same size as addr3. The easiest option is probably to make it __kernel_uintptr_t too.
}; /* * If the ring is initialized with IORING_SETUP_SQE128, then
* this field is used for 80 bytes of arbitrary command data
* this field is used to double the size of the
* struct io_uring_sqe to store bytes of arbitrary
*/ __u8 cmd[0]; };* command data, i.e. 80 bytes or 160 bytes in PCuABI
@@ -326,13 +333,14 @@ enum {
- IO completion data structure (Completion Queue Entry)
*/ struct io_uring_cqe {
- __u64 user_data; /* sqe->data submission passed back */
- __s32 res; /* result code for this event */
- __u32 flags;
- __kernel_uintptr_t user_data; /* sqe->data submission passed back */
- __s32 res; /* result code for this event */
- __u32 flags;
/* * If the ring is initialized with IORING_SETUP_CQE32, then this field
* contains 16-bytes of padding, doubling the size of the CQE.
* doubles the size of the CQE, i.e. contains 16 bytes, or in PCuABI,
*/ __u64 big_cqe[];* 32 bytes of padding.
}; @@ -504,7 +512,7 @@ enum { struct io_uring_files_update { __u32 offset; __u32 resv;
- __aligned_u64 /* __s32 * */ fds;
- __kernel_aligned_uintptr_t /* __s32 * */ fds;
}; /* @@ -517,21 +525,21 @@ struct io_uring_rsrc_register { __u32 nr; __u32 flags; __u64 resv2;
- __aligned_u64 data;
- __aligned_u64 tags;
- __kernel_aligned_uintptr_t data;
- __kernel_aligned_uintptr_t tags;
}; struct io_uring_rsrc_update { __u32 offset; __u32 resv;
- __aligned_u64 data;
- __kernel_aligned_uintptr_t data;
}; struct io_uring_rsrc_update2 { __u32 offset; __u32 resv;
- __aligned_u64 data;
- __aligned_u64 tags;
- __kernel_aligned_uintptr_t data;
- __kernel_aligned_uintptr_t tags; __u32 nr; __u32 resv2;
}; @@ -581,7 +589,7 @@ struct io_uring_restriction { }; struct io_uring_buf {
- __u64 addr;
- __kernel_uintptr_t addr; __u32 len; __u16 bid; __u16 resv;
Nit: might as well align the other fields, same for io_uring_buf_reg.
@@ -594,9 +602,7 @@ struct io_uring_buf_ring { * ring tail is overlaid with the io_uring_buf->resv field. */ struct {
__u64 resv1;
__u32 resv2;
__u16 resv3;
__u8 resv1[offsetof(struct io_uring_buf, resv)];
Nit: maybe just resv now that there's only one? (Nice solution BTW, I don't think you can do better than that!)
__u16 tail; }; struct io_uring_buf bufs[0];
@@ -605,7 +611,7 @@ struct io_uring_buf_ring { /* argument for IORING_(UN)REGISTER_PBUF_RING */ struct io_uring_buf_reg {
- __u64 ring_addr;
- __kernel_uintptr_t ring_addr; __u32 ring_entries; __u16 bgid; __u16 pad;
@@ -632,17 +638,17 @@ enum { }; struct io_uring_getevents_arg {
- __u64 sigmask;
- __u32 sigmask_sz;
- __u32 pad;
- __u64 ts;
- __kernel_uintptr_t sigmask;
- __u32 sigmask_sz;
- __u32 pad;
- __kernel_uintptr_t ts;
}; /*
- Argument for IORING_REGISTER_SYNC_CANCEL
*/ struct io_uring_sync_cancel_reg {
- __u64 addr;
- __kernel_uintptr_t addr; __s32 fd; __u32 flags; struct __kernel_timespec timeout;
diff --git a/io_uring/advise.c b/io_uring/advise.c index 449c6f14649f7..5bb8094204979 100644 --- a/io_uring/advise.c +++ b/io_uring/advise.c @@ -23,7 +23,7 @@ struct io_fadvise { struct io_madvise { struct file *file;
- u64 addr;
- __kernel_uintptr_t addr; u32 len; u32 advice;
}; diff --git a/io_uring/cancel.c b/io_uring/cancel.c index 20cb8634e44af..372e2442231a4 100644 --- a/io_uring/cancel.c +++ b/io_uring/cancel.c @@ -19,7 +19,7 @@ struct io_cancel { struct file *file;
- u64 addr;
- __kernel_uintptr_t addr;
So that one is interesting. In this case addr represents a user_data value to match against in order to find a request to cancel. As it stands we do propagate the capability in PCuABI but then only use it as 64-bit value, since we do a simple address comparison in io_cancel_cb(). The same applies to IORING_OP_POLL_REMOVE (io_poll_find() in poll.c) and IORING_OP_TIMEOUT_REMOVE (io_timeout_extract() in timeout.c).
Considering user_data is used as a sort of key in that case, I suspect doing a full comparison makes more sense, though it sounds unlikely that two SQEs would have different user_data with the same address. Worth pondering on in any case.
u32 flags; s32 fd; }; @@ -43,7 +43,7 @@ static int get_compat_io_uring_sync_cancel_reg(struct io_uring_sync_cancel_reg * if (unlikely(copy_from_user(&compat_sc, user_sc, sizeof(compat_sc)))) return -EFAULT;
- sc->addr = compat_sc.addr;
- sc->addr = (__kernel_uintptr_t)compat_ptr(compat_sc.addr); sc->fd = compat_sc.fd; sc->flags = compat_sc.flags; sc->timeout = compat_sc.timeout;
@@ -60,7 +60,7 @@ static int copy_io_uring_sync_cancel_reg_from_user(struct io_ring_ctx *ctx, if (ctx->compat) return get_compat_io_uring_sync_cancel_reg(sc, arg); #endif
- return copy_from_user(sc, arg, sizeof(*sc));
- return copy_from_user_with_ptr(sc, arg, sizeof(*sc));
} static bool io_cancel_cb(struct io_wq_work *work, void *data) diff --git a/io_uring/cancel.h b/io_uring/cancel.h index 6a59ee484d0cc..7c1249d61bf25 100644 --- a/io_uring/cancel.h +++ b/io_uring/cancel.h @@ -5,7 +5,7 @@ struct io_cancel_data { struct io_ring_ctx *ctx; union {
u64 data;
struct file *file; }; u32 flags;__kernel_uintptr_t data;
diff --git a/io_uring/epoll.c b/io_uring/epoll.c index d5580ff465c3e..d9d5983f823c2 100644 --- a/io_uring/epoll.c +++ b/io_uring/epoll.c @@ -39,7 +39,7 @@ int io_epoll_ctl_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (ep_op_has_event(epoll->op)) { struct epoll_event __user *ev;
ev = u64_to_user_ptr(READ_ONCE(sqe->addr));
if (copy_epoll_event_from_user(&epoll->event, ev, req->ctx->compat)) return -EFAULT; }ev = (struct epoll_event __user *)READ_ONCE(sqe->addr);
diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c index c5bd669081c98..41628302d1f8b 100644 --- a/io_uring/fdinfo.c +++ b/io_uring/fdinfo.c @@ -114,7 +114,7 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, sq_idx, io_uring_get_opcode(sqe->opcode), sqe->fd, sqe->flags, (unsigned long long) sqe->off, (unsigned long long) sqe->addr, sqe->rw_flags,
sqe->buf_index, sqe->user_data);
if (sq_shift) { u64 *sqeb = (void *) (sqe + 1); int size = (ctx->compat ? sizeof(struct compat_io_uring_sqe)sqe->buf_index, (unsigned long long) sqe->user_data);
@@ -153,8 +153,8 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, #endif seq_printf(m, "%5u: user_data:%llu, res:%d, flag:%x",
entry & cq_mask, cqe->user_data, cqe->res,
cqe->flags);
entry & cq_mask, (unsigned long long) cqe->user_data,
if (cq_shift) seq_printf(m, ", extra1:%llu, extra2:%llu\n", cqe->big_cqe[0], cqe->big_cqe[1]);cqe->res, cqe->flags);
@@ -235,7 +235,8 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, #endif seq_printf(m, " user_data=%llu, res=%d, flags=%x\n",
cqe->user_data, cqe->res, cqe->flags);
(unsigned long long) cqe->user_data, cqe->res,
cqe->flags);
#ifdef CONFIG_COMPAT64 kfree(native_cqe); diff --git a/io_uring/fs.c b/io_uring/fs.c index 7100c293c13a8..2e01e7da1d4ba 100644 --- a/io_uring/fs.c +++ b/io_uring/fs.c @@ -58,8 +58,8 @@ int io_renameat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return -EBADF; ren->old_dfd = READ_ONCE(sqe->fd);
- oldf = u64_to_user_ptr(READ_ONCE(sqe->addr));
- newf = u64_to_user_ptr(READ_ONCE(sqe->addr2));
- oldf = (char __user *)READ_ONCE(sqe->addr);
- newf = (char __user *)READ_ONCE(sqe->addr2); ren->new_dfd = READ_ONCE(sqe->len); ren->flags = READ_ONCE(sqe->rename_flags);
@@ -117,7 +117,7 @@ int io_unlinkat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (un->flags & ~AT_REMOVEDIR) return -EINVAL;
- fname = u64_to_user_ptr(READ_ONCE(sqe->addr));
- fname = (char __user *)READ_ONCE(sqe->addr); un->filename = getname(fname); if (IS_ERR(un->filename)) return PTR_ERR(un->filename);
@@ -164,7 +164,7 @@ int io_mkdirat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) mkd->dfd = READ_ONCE(sqe->fd); mkd->mode = READ_ONCE(sqe->len);
- fname = u64_to_user_ptr(READ_ONCE(sqe->addr));
- fname = (char __user *)READ_ONCE(sqe->addr); mkd->filename = getname(fname); if (IS_ERR(mkd->filename)) return PTR_ERR(mkd->filename);
@@ -206,8 +206,8 @@ int io_symlinkat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return -EBADF; sl->new_dfd = READ_ONCE(sqe->fd);
- oldpath = u64_to_user_ptr(READ_ONCE(sqe->addr));
- newpath = u64_to_user_ptr(READ_ONCE(sqe->addr2));
- oldpath = (char __user *)READ_ONCE(sqe->addr);
- newpath = (char __user *)READ_ONCE(sqe->addr2);
sl->oldpath = getname(oldpath); if (IS_ERR(sl->oldpath)) @@ -250,8 +250,8 @@ int io_linkat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) lnk->old_dfd = READ_ONCE(sqe->fd); lnk->new_dfd = READ_ONCE(sqe->len);
- oldf = u64_to_user_ptr(READ_ONCE(sqe->addr));
- newf = u64_to_user_ptr(READ_ONCE(sqe->addr2));
- oldf = (char __user *)READ_ONCE(sqe->addr);
- newf = (char __user *)READ_ONCE(sqe->addr2); lnk->flags = READ_ONCE(sqe->hardlink_flags);
lnk->oldpath = getname(oldf); diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 03506776af043..24c1c9fb9b298 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -184,7 +184,7 @@ static int copy_io_uring_getevents_arg_from_user(struct io_ring_ctx *ctx, #endif if (size != sizeof(*arg)) return -EINVAL;
- if (copy_from_user(arg, argp, sizeof(*arg)))
- if (copy_from_user_with_ptr(arg, argp, sizeof(*arg))) return -EFAULT; return 0;
} @@ -726,7 +726,7 @@ static __cold void io_uring_drop_tctx_refs(struct task_struct *task) } } -static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, +static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, __kernel_uintptr_t user_data,
Looking at how user_data is used, I see that it is passed on to a couple of trace functions (here and in io_fill_cqe_aux()). Should we extend the trace events accordingly? Would that be useful at all? We don't necessarily need an answer now, maybe that's to be investigated later.
s32 res, u32 cflags, u64 extra1, u64 extra2)
{ struct io_overflow_cqe *ocqe; @@ -818,7 +818,7 @@ void *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow) return cqe; } -bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags, +bool io_fill_cqe_aux(struct io_ring_ctx *ctx, __kernel_uintptr_t user_data, s32 res, u32 cflags,
Nit: it's now well over 80 char so splitting would be nice (in the header too).
bool allow_overflow)
{ struct io_uring_cqe *cqe; @@ -845,7 +845,7 @@ bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags } bool io_post_aux_cqe(struct io_ring_ctx *ctx,
u64 user_data, s32 res, u32 cflags,
__kernel_uintptr_t user_data, s32 res, u32 cflags, bool allow_overflow)
{ bool filled; @@ -3213,9 +3213,9 @@ static int io_get_ext_arg(struct io_ring_ctx *ctx, unsigned int flags, return ret; if (arg.pad) return -EINVAL;
- *sig = u64_to_user_ptr(arg.sigmask);
- *sig = (sigset_t __user *)arg.sigmask; *argsz = arg.sigmask_sz;
- *ts = u64_to_user_ptr(arg.ts);
- *ts = (struct __kernel_timespec __user *)arg.ts; return 0;
} @@ -4158,6 +4158,42 @@ static int __init io_uring_init(void) __BUILD_BUG_VERIFY_OFFSET_SIZE(struct io_uring_sqe, eoffset, sizeof(etype), ename) #define BUILD_BUG_SQE_ELEM_SIZE(eoffset, esize, ename) \ __BUILD_BUG_VERIFY_OFFSET_SIZE(struct io_uring_sqe, eoffset, esize, ename) +#ifdef CONFIG_CHERI_PURECAP_UABI
- BUILD_BUG_ON(sizeof(struct io_uring_sqe) != 128);
- BUILD_BUG_SQE_ELEM(0, __u8, opcode);
- BUILD_BUG_SQE_ELEM(1, __u8, flags);
- BUILD_BUG_SQE_ELEM(2, __u16, ioprio);
- BUILD_BUG_SQE_ELEM(4, __s32, fd);
- BUILD_BUG_SQE_ELEM(16, __u64, off);
- BUILD_BUG_SQE_ELEM(16, __uintcap_t, addr2);
- BUILD_BUG_SQE_ELEM(32, __uintcap_t, addr);
- BUILD_BUG_SQE_ELEM(32, __u64, splice_off_in);
- BUILD_BUG_SQE_ELEM(48, __u32, len);
- BUILD_BUG_SQE_ELEM(52, __kernel_rwf_t, rw_flags);
- BUILD_BUG_SQE_ELEM(52, __u32, fsync_flags);
- BUILD_BUG_SQE_ELEM(52, __u16, poll_events);
- BUILD_BUG_SQE_ELEM(52, __u32, poll32_events);
- BUILD_BUG_SQE_ELEM(52, __u32, sync_range_flags);
- BUILD_BUG_SQE_ELEM(52, __u32, msg_flags);
- BUILD_BUG_SQE_ELEM(52, __u32, timeout_flags);
- BUILD_BUG_SQE_ELEM(52, __u32, accept_flags);
- BUILD_BUG_SQE_ELEM(52, __u32, cancel_flags);
- BUILD_BUG_SQE_ELEM(52, __u32, open_flags);
- BUILD_BUG_SQE_ELEM(52, __u32, statx_flags);
- BUILD_BUG_SQE_ELEM(52, __u32, fadvise_advice);
- BUILD_BUG_SQE_ELEM(52, __u32, splice_flags);
- BUILD_BUG_SQE_ELEM(64, __uintcap_t, user_data);
- BUILD_BUG_SQE_ELEM(80, __u16, buf_index);
- BUILD_BUG_SQE_ELEM(80, __u16, buf_group);
- BUILD_BUG_SQE_ELEM(82, __u16, personality);
- BUILD_BUG_SQE_ELEM(84, __s32, splice_fd_in);
- BUILD_BUG_SQE_ELEM(84, __u32, file_index);
- BUILD_BUG_SQE_ELEM(84, __u16, addr_len);
- BUILD_BUG_SQE_ELEM(86, __u16, __pad3[0]);
- BUILD_BUG_SQE_ELEM(96, __uintcap_t, addr3);
- BUILD_BUG_SQE_ELEM_SIZE(96, 0, cmd);
- BUILD_BUG_SQE_ELEM(112, __u64, __pad2);
Some assertions are missing compared to !PCuABI, I suppose that's because the corresponding fields were introduced after 5.18.
+#else /* !CONFIG_CHERI_PURECAP_UABI */ BUILD_BUG_ON(sizeof(struct io_uring_sqe) != 64); BUILD_BUG_SQE_ELEM(0, __u8, opcode); BUILD_BUG_SQE_ELEM(1, __u8, flags); @@ -4198,6 +4234,7 @@ static int __init io_uring_init(void) BUILD_BUG_SQE_ELEM(48, __u64, addr3); BUILD_BUG_SQE_ELEM_SIZE(48, 0, cmd); BUILD_BUG_SQE_ELEM(56, __u64, __pad2); +#endif /* !CONFIG_CHERI_PURECAP_UABI */ BUILD_BUG_ON(sizeof(struct io_uring_files_update) != sizeof(struct io_uring_rsrc_update)); diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h index fb2711770bfb0..f8c4e6d61124b 100644 --- a/io_uring/io_uring.h +++ b/io_uring/io_uring.h @@ -34,9 +34,9 @@ void io_req_complete_failed(struct io_kiocb *req, s32 res); void __io_req_complete(struct io_kiocb *req, unsigned issue_flags); void io_req_complete_post(struct io_kiocb *req); void __io_req_complete_post(struct io_kiocb *req); -bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags, +bool io_post_aux_cqe(struct io_ring_ctx *ctx, __kernel_uintptr_t user_data, s32 res, u32 cflags, bool allow_overflow); -bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags, +bool io_fill_cqe_aux(struct io_ring_ctx *ctx, __kernel_uintptr_t user_data, s32 res, u32 cflags, bool allow_overflow); void __io_commit_cqring_flush(struct io_ring_ctx *ctx); @@ -99,7 +99,7 @@ static inline void convert_compat_io_uring_cqe(struct io_ring_ctx *ctx, struct io_uring_cqe *cqe, const struct compat_io_uring_cqe *compat_cqe) {
- cqe->user_data = READ_ONCE(compat_cqe->user_data);
- cqe->user_data = (__kernel_uintptr_t)READ_ONCE(compat_cqe->user_data); cqe->res = READ_ONCE(compat_cqe->res); cqe->flags = READ_ONCE(compat_cqe->flags);
@@ -130,13 +130,13 @@ static inline void convert_compat_io_uring_sqe(struct io_ring_ctx *ctx, sqe->ioprio = READ_ONCE(compat_sqe->ioprio); sqe->fd = READ_ONCE(compat_sqe->fd); BUILD_BUG_COMPAT_SQE_UNION_ELEM(addr2, addr);
- sqe->addr2 = READ_ONCE(compat_sqe->addr2);
- sqe->addr2 = (__kernel_uintptr_t)compat_ptr(READ_ONCE(compat_sqe->addr2)); BUILD_BUG_COMPAT_SQE_UNION_ELEM(addr, len);
- sqe->addr = READ_ONCE(compat_sqe->addr);
- sqe->addr = (__kernel_uintptr_t)compat_ptr(READ_ONCE(compat_sqe->addr)); sqe->len = READ_ONCE(compat_sqe->len); BUILD_BUG_COMPAT_SQE_UNION_ELEM(rw_flags, user_data); sqe->rw_flags = READ_ONCE(compat_sqe->rw_flags);
- sqe->user_data = READ_ONCE(compat_sqe->user_data);
- sqe->user_data = (__kernel_uintptr_t)READ_ONCE(compat_sqe->user_data); BUILD_BUG_COMPAT_SQE_UNION_ELEM(buf_index, personality); sqe->buf_index = READ_ONCE(compat_sqe->buf_index); sqe->personality = READ_ONCE(compat_sqe->personality);
@@ -148,7 +148,7 @@ static inline void convert_compat_io_uring_sqe(struct io_ring_ctx *ctx, memcpy(sqe->cmd, compat_sqe->cmd, compat_cmd_size);
Now that sqe->cmd is bigger than compat_sqe->cmd, might it be problematic that we're leaving the end of sqe->cmd uninitialised? I suspect the answer is no because .uring_cmd handlers are not going to use that extra data, at least not in compat, but maybe that deserves a comment at least.
} else
sqe->addr3 = READ_ONCE(compat_sqe->addr3);
sqe->addr3 = (__kernel_uintptr_t)compat_ptr(READ_ONCE(compat_sqe->addr3));
#undef BUILD_BUG_COMPAT_SQE_UNION_ELEM } #endif /* CONFIG_COMPAT64 */ @@ -175,7 +175,7 @@ static inline void *io_get_cqe(struct io_ring_ctx *ctx) } static inline void __io_fill_cqe_any(struct io_ring_ctx *ctx, struct io_uring_cqe *cqe,
u64 user_data, s32 res, u32 cflags,
__kernel_uintptr_t user_data, s32 res, u32 cflags,
I would add an explicit cast to __u64 when writing user_data in compat64. It's not essential but makes it easier to see that the rest is discarded (it doesn't matter because user_data is also a __u64 in the SQE in compat64, but it would be symmetric with the cast to __kernel_uintptr_t in convert_compat_io_uring_sqe()).
u64 extra1, u64 extra2)
{ #ifdef CONFIG_COMPAT64 diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c index 9f840ba93c888..414c5064222b4 100644 --- a/io_uring/kbuf.c +++ b/io_uring/kbuf.c @@ -24,7 +24,7 @@ struct io_provide_buf { struct file *file;
- __u64 addr;
- __kernel_uintptr_t addr; __u32 len; __u32 bgid; __u16 nbufs;
@@ -47,7 +47,7 @@ static int get_compat_io_uring_buf_reg(struct io_uring_buf_reg *reg, if (unlikely(copy_from_user(&compat_reg, user_reg, sizeof(compat_reg)))) return -EFAULT;
- reg->ring_addr = compat_reg.ring_addr;
- reg->ring_addr = (__kernel_uintptr_t)compat_ptr(compat_reg.ring_addr); reg->ring_entries = compat_reg.ring_entries; reg->bgid = compat_reg.bgid; reg->pad = compat_reg.pad;
@@ -64,7 +64,7 @@ static int copy_io_uring_buf_reg_from_user(struct io_ring_ctx *ctx, if (ctx->compat) return get_compat_io_uring_buf_reg(reg, arg); #endif
- return copy_from_user(reg, arg, sizeof(*reg));
- return copy_from_user_with_ptr(reg, arg, sizeof(*reg));
} static inline struct io_buffer_list *io_buffer_get_list(struct io_ring_ctx *ctx, @@ -159,7 +159,7 @@ static void __user *io_provided_buffer_select(struct io_kiocb *req, size_t *len, req->flags |= REQ_F_BUFFER_SELECTED; req->kbuf = kbuf; req->buf_index = kbuf->bid;
return u64_to_user_ptr(kbuf->addr);
} return NULL;return (void __user *)kbuf->addr;
} @@ -239,7 +239,7 @@ static void __user *io_ring_buffer_select(struct io_kiocb *req, size_t *len, req->buf_list = NULL; bl->head++; }
- return u64_to_user_ptr(buf->addr);
- return (void __user *)buf->addr;
} static void __user *io_ring_buffer_select_any(struct io_kiocb *req, size_t *len, @@ -427,7 +427,7 @@ int io_provide_buffers_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe return -EOVERFLOW; size = (unsigned long)p->len * p->nbufs;
- if (!access_ok(u64_to_user_ptr(p->addr), size))
- if (!access_ok(p->addr, size)) return -EFAULT;
p->bgid = READ_ONCE(sqe->buf_group); @@ -487,7 +487,7 @@ static int io_add_buffers(struct io_ring_ctx *ctx, struct io_provide_buf *pbuf, struct io_buffer_list *bl) { struct io_buffer *buf;
- u64 addr = pbuf->addr;
- __kernel_uintptr_t addr = pbuf->addr; int i, bid = pbuf->bid;
for (i = 0; i < pbuf->nbufs; i++) { @@ -599,6 +599,7 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg) pages_size = ctx->compat ? size_mul(sizeof(struct compat_io_uring_buf), reg.ring_entries) : size_mul(sizeof(struct io_uring_buf), reg.ring_entries);
- /* TODO [PCuABI] - capability checks for uaccess */ pages = io_pin_pages(reg.ring_addr, pages_size, &nr_pages); if (IS_ERR(pages)) { kfree(free_bl);
diff --git a/io_uring/kbuf.h b/io_uring/kbuf.h index 1aa5bbbc5d628..50a0a7524e6b6 100644 --- a/io_uring/kbuf.h +++ b/io_uring/kbuf.h @@ -31,7 +31,7 @@ struct io_buffer_list { struct io_buffer { struct list_head list;
- __u64 addr;
- __kernel_uintptr_t addr; __u32 len; __u16 bid; __u16 bgid;
diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c index 90d2fc6fd80e4..654f5ad0b11c0 100644 --- a/io_uring/msg_ring.c +++ b/io_uring/msg_ring.c @@ -15,7 +15,7 @@ struct io_msg { struct file *file;
- u64 user_data;
- __kernel_uintptr_t user_data; u32 len; u32 cmd; u32 src_fd;
@@ -130,7 +130,7 @@ int io_msg_ring_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (unlikely(sqe->buf_index || sqe->personality)) return -EINVAL;
- msg->user_data = READ_ONCE(sqe->off);
- msg->user_data = READ_ONCE(sqe->addr2); msg->len = READ_ONCE(sqe->len); msg->cmd = READ_ONCE(sqe->addr); msg->src_fd = READ_ONCE(sqe->addr3);
diff --git a/io_uring/net.c b/io_uring/net.c index 4c133bc6f9d1d..a9143b4652114 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -243,13 +243,13 @@ int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (req->opcode == IORING_OP_SEND) { if (READ_ONCE(sqe->__pad3[0])) return -EINVAL;
sr->addr = u64_to_user_ptr(READ_ONCE(sqe->addr2));
sr->addr_len = READ_ONCE(sqe->addr_len); } else if (sqe->addr2 || sqe->file_index) { return -EINVAL; }sr->addr = (void __user *)READ_ONCE(sqe->addr2);
- sr->umsg = u64_to_user_ptr(READ_ONCE(sqe->addr));
- sr->umsg = (struct user_msghdr __user *)READ_ONCE(sqe->addr); sr->len = READ_ONCE(sqe->len); sr->flags = READ_ONCE(sqe->ioprio); if (sr->flags & ~IORING_RECVSEND_POLL_FIRST)
@@ -421,7 +421,7 @@ static int __io_recvmsg_copy_hdr(struct io_kiocb *req, struct user_msghdr msg; int ret;
- if (copy_from_user(&msg, sr->umsg, sizeof(*sr->umsg)))
- if (copy_from_user_with_ptr(&msg, sr->umsg, sizeof(*sr->umsg)))
Good catch. I've also realised that copy_msghdr_from_user() (net/socket.c) does not preserve capabilities either. It's broken in general and not just for io_uring, but maybe it makes sense to fix it in that series too?
return -EFAULT;
ret = __copy_msghdr(&iomsg->msg, &msg, &iomsg->uaddr); @@ -549,7 +549,7 @@ int io_recvmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (unlikely(sqe->file_index || sqe->addr2)) return -EINVAL;
- sr->umsg = u64_to_user_ptr(READ_ONCE(sqe->addr));
- sr->umsg = (struct user_msghdr __user *)READ_ONCE(sqe->addr); sr->len = READ_ONCE(sqe->len); sr->flags = READ_ONCE(sqe->ioprio); if (sr->flags & ~(RECVMSG_FLAGS))
@@ -966,7 +966,7 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (req->opcode == IORING_OP_SEND_ZC) { if (READ_ONCE(sqe->__pad3[0])) return -EINVAL;
zc->addr = u64_to_user_ptr(READ_ONCE(sqe->addr2));
zc->addr_len = READ_ONCE(sqe->addr_len); } else { if (unlikely(sqe->addr2 || sqe->file_index))zc->addr = (void __user *)READ_ONCE(sqe->addr2);
@@ -975,7 +975,7 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return -EINVAL; }
- zc->buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
- zc->buf = (void __user *)READ_ONCE(sqe->addr); zc->len = READ_ONCE(sqe->len); zc->msg_flags = READ_ONCE(sqe->msg_flags) | MSG_NOSIGNAL; if (zc->msg_flags & MSG_DONTWAIT)
@@ -1242,8 +1242,8 @@ int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (sqe->len || sqe->buf_index) return -EINVAL;
- accept->addr = u64_to_user_ptr(READ_ONCE(sqe->addr));
- accept->addr_len = u64_to_user_ptr(READ_ONCE(sqe->addr2));
- accept->addr = (void __user *)READ_ONCE(sqe->addr);
- accept->addr_len = (int __user *)READ_ONCE(sqe->addr2); accept->flags = READ_ONCE(sqe->accept_flags); accept->nofile = rlimit(RLIMIT_NOFILE); flags = READ_ONCE(sqe->ioprio);
@@ -1392,8 +1392,8 @@ int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (sqe->len || sqe->buf_index || sqe->rw_flags || sqe->splice_fd_in) return -EINVAL;
- conn->addr = u64_to_user_ptr(READ_ONCE(sqe->addr));
- conn->addr_len = READ_ONCE(sqe->addr2);
- conn->addr = (void __user *)READ_ONCE(sqe->addr);
- conn->addr_len = READ_ONCE(sqe->addr2);
It probably ends up being equivalent, but using sqe->off would probably be better here (as it's a size and not a pointer), especially since liburing seems to set ->off and not ->addr2.
It might be good to check other non-pointer uses of ->addr2 while at it (if any). Unfortunately ->off and ->addr2 seem to be used somewhat interchangeably upstream and that's not really equivalent any more.
conn->in_progress = false; return 0; } diff --git a/io_uring/openclose.c b/io_uring/openclose.c index 67178e4bb282d..0a5c838885306 100644 --- a/io_uring/openclose.c +++ b/io_uring/openclose.c @@ -47,7 +47,7 @@ static int __io_openat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe open->how.flags |= O_LARGEFILE; open->dfd = READ_ONCE(sqe->fd);
- fname = u64_to_user_ptr(READ_ONCE(sqe->addr));
- fname = (char __user *)READ_ONCE(sqe->addr); open->filename = getname(fname); if (IS_ERR(open->filename)) { ret = PTR_ERR(open->filename);
@@ -81,7 +81,7 @@ int io_openat2_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) size_t len; int ret;
- how = u64_to_user_ptr(READ_ONCE(sqe->addr2));
- how = (struct open_how __user *)READ_ONCE(sqe->addr2); len = READ_ONCE(sqe->len); if (len < OPEN_HOW_SIZE_VER0) return -EINVAL;
diff --git a/io_uring/poll.c b/io_uring/poll.c index d9bf1767867e6..4914048fed0f8 100644 --- a/io_uring/poll.c +++ b/io_uring/poll.c @@ -22,8 +22,8 @@ struct io_poll_update { struct file *file;
- u64 old_user_data;
- u64 new_user_data;
- __kernel_uintptr_t old_user_data;
- __kernel_uintptr_t new_user_data;
Looking at how new_user_data is used, I see that it is set to sqe->off, which is clearly not what we want in PCuABI (since off is a __u64). I suppose we should have the same change as in msg_ring.c and set it to sqe->addr2 instead?
__poll_t events; bool update_events; bool update_user_data; diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c index eafdb039f9e51..f94d7475cb4d8 100644 --- a/io_uring/rsrc.c +++ b/io_uring/rsrc.c @@ -18,7 +18,7 @@ struct io_rsrc_update { struct file *file;
- u64 arg;
- __kernel_uintptr_t arg; u32 nr_args; u32 offset;
}; @@ -33,7 +33,7 @@ static int get_compat_io_uring_rsrc_update(struct io_uring_rsrc_update2 *up2, return -EFAULT; up2->offset = compat_up.offset; up2->resv = compat_up.resv;
- up2->data = compat_up.data;
- up2->data = (__kernel_uintptr_t)compat_ptr(compat_up.data); return 0;
} @@ -46,8 +46,8 @@ static int get_compat_io_uring_rsrc_update2(struct io_uring_rsrc_update2 *up2, return -EFAULT; up2->offset = compat_up2.offset; up2->resv = compat_up2.resv;
- up2->data = compat_up2.data;
- up2->tags = compat_up2.tags;
- up2->data = (__kernel_uintptr_t)compat_ptr(compat_up2.data);
- up2->tags = (__kernel_uintptr_t)compat_ptr(compat_up2.tags); up2->nr = compat_up2.nr; up2->resv2 = compat_up2.resv2; return 0;
@@ -63,8 +63,8 @@ static int get_compat_io_uring_rsrc_register(struct io_uring_rsrc_register *rr, rr->nr = compat_rr.nr; rr->flags = compat_rr.flags; rr->resv2 = compat_rr.resv2;
- rr->data = compat_rr.data;
- rr->tags = compat_rr.tags;
- rr->data = (__kernel_uintptr_t)compat_ptr(compat_rr.data);
- rr->tags = (__kernel_uintptr_t)compat_ptr(compat_rr.tags); return 0;
} #endif /* CONFIG_COMPAT64 */ @@ -77,7 +77,7 @@ static int copy_io_uring_rsrc_update_from_user(struct io_ring_ctx *ctx, if (ctx->compat) return get_compat_io_uring_rsrc_update(up2, arg); #endif
- return copy_from_user(up2, arg, sizeof(struct io_uring_rsrc_update));
- return copy_from_user_with_ptr(up2, arg, sizeof(struct io_uring_rsrc_update));
} static int copy_io_uring_rsrc_update2_from_user(struct io_ring_ctx *ctx, @@ -96,7 +96,7 @@ static int copy_io_uring_rsrc_update2_from_user(struct io_ring_ctx *ctx, #endif if (size != sizeof(*up2)) return -EINVAL;
- if (copy_from_user(up2, arg, sizeof(*up2)))
- if (copy_from_user_with_ptr(up2, arg, sizeof(*up2))) return -EFAULT; return 0;
} @@ -118,7 +118,7 @@ static int copy_io_uring_rsrc_register_from_user(struct io_ring_ctx *ctx, /* keep it extendible */ if (size != sizeof(*rr)) return -EINVAL;
- if (copy_from_user(rr, arg, size))
- if (copy_from_user_with_ptr(rr, arg, size)) return -EFAULT; return 0;
} @@ -201,13 +201,14 @@ static int io_copy_iov(struct io_ring_ctx *ctx, struct iovec *dst, if (copy_from_user(&ciov, &ciovs[index], sizeof(ciov))) return -EFAULT;
dst->iov_base = u64_to_user_ptr((u64)ciov.iov_base);
dst->iov_base = compat_ptr(ciov.iov_base);
Not sure we want that extra empty line.
dst->iov_len = ciov.iov_len; return 0;
} #endif src = (struct iovec __user *) arg;
- if (copy_from_user(dst, &src[index], sizeof(*dst)))
- if (copy_from_user_with_ptr(dst, &src[index], sizeof(*dst))) return -EFAULT; return 0;
} @@ -534,8 +535,8 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx, struct io_uring_rsrc_update2 *up, unsigned nr_args) {
- u64 __user *tags = u64_to_user_ptr(up->tags);
- __s32 __user *fds = u64_to_user_ptr(up->data);
- u64 __user *tags = (u64 __user *)up->tags;
- __s32 __user *fds = (__s32 __user *)up->data; struct io_rsrc_data *data = ctx->file_data; struct io_fixed_file *file_slot; struct file *file;
@@ -614,9 +615,9 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx, struct io_uring_rsrc_update2 *up, unsigned int nr_args) {
- u64 __user *tags = u64_to_user_ptr(up->tags);
- u64 __user *tags = (u64 __user *)up->tags; struct iovec iov;
- struct iovec __user *iovs = u64_to_user_ptr(up->data);
- struct iovec __user *iovs = (struct iovec __user *)up->data; struct page *last_hpage = NULL; bool needs_switch = false; __u32 done;
@@ -742,13 +743,13 @@ __cold int io_register_rsrc(struct io_ring_ctx *ctx, void __user *arg, case IORING_RSRC_FILE: if (rr.flags & IORING_RSRC_REGISTER_SPARSE && rr.data) break;
return io_sqe_files_register(ctx, u64_to_user_ptr(rr.data),
rr.nr, u64_to_user_ptr(rr.tags));
return io_sqe_files_register(ctx, (void __user *)rr.data,
case IORING_RSRC_BUFFER: if (rr.flags & IORING_RSRC_REGISTER_SPARSE && rr.data) break;rr.nr, (u64 __user *)rr.tags);
return io_sqe_buffers_register(ctx, u64_to_user_ptr(rr.data),
rr.nr, u64_to_user_ptr(rr.tags));
return io_sqe_buffers_register(ctx, (void __user *)rr.data,
} return -EINVAL;rr.nr, (u64 __user *)rr.tags);
} @@ -774,7 +775,7 @@ static int io_files_update_with_index_alloc(struct io_kiocb *req, unsigned int issue_flags) { struct io_rsrc_update *up = io_kiocb_to_cmd(req, struct io_rsrc_update);
- __s32 __user *fds = u64_to_user_ptr(up->arg);
- __s32 __user *fds = (__s32 __user *)up->arg; unsigned int done; struct file *file; int ret, fd;
diff --git a/io_uring/rw.c b/io_uring/rw.c index 2edca190450ee..424ee773c95aa 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -23,7 +23,7 @@ struct io_rw { /* NOTE: kiocb has the file as the first member, so don't do it here */ struct kiocb kiocb;
- u64 addr;
- __kernel_uintptr_t addr;
In such cases where we have a member in a private struct that clearly represents a user pointer, I wonder if we wouldn't better off actually representing it as a user pointer, i.e. void __user * (or a more specific pointer type if it makes sense), as it provides better type safety and removes the need for most casts.
That would make it slightly inconsistent with cases where the member actually represents user_data (say struct io_cancel::data), in which case __kernel_uintptr_t is what we need, but that may not be bad thing, as it would make the meaning / usage of the member clearer.
u32 len; rwf_t flags; }; @@ -39,7 +39,7 @@ static int io_iov_compat_buffer_select_prep(struct io_rw *rw) struct compat_iovec __user *uiov; compat_ssize_t clen;
- uiov = u64_to_user_ptr(rw->addr);
- uiov = (struct compat_iovec __user *)rw->addr; if (!access_ok(uiov, sizeof(*uiov))) return -EFAULT; if (__get_user(clen, &uiov->iov_len))
@@ -65,7 +65,7 @@ static int io_iov_buffer_select_prep(struct io_kiocb *req) return io_iov_compat_buffer_select_prep(rw); #endif
- uiov = u64_to_user_ptr(rw->addr);
- uiov = (struct iovec __user *)rw->addr; if (get_user(rw->len, &uiov->iov_len)) return -EFAULT; return 0;
@@ -370,7 +370,7 @@ static struct iovec *__io_import_iovec(int ddir, struct io_kiocb *req, return NULL; }
- buf = u64_to_user_ptr(rw->addr);
- buf = (void __user *)rw->addr; sqe_len = rw->len;
if (opcode == IORING_OP_READ || opcode == IORING_OP_WRITE || @@ -379,8 +379,7 @@ static struct iovec *__io_import_iovec(int ddir, struct io_kiocb *req, buf = io_buffer_select(req, &sqe_len, issue_flags); if (!buf) return ERR_PTR(-ENOBUFS);
/* TODO [PCuABI] - capability checks for uaccess */
rw->addr = user_ptr_addr(buf);
}rw->addr = (__kernel_uintptr_t)buf; rw->len = sqe_len;
@@ -446,7 +445,7 @@ static ssize_t loop_rw_iter(int ddir, struct io_rw *rw, struct iov_iter *iter) if (!iov_iter_is_bvec(iter)) { iovec = iov_iter_iovec(iter); } else {
iovec.iov_base = u64_to_user_ptr(rw->addr);
}iovec.iov_base = (void __user *)rw->addr; iovec.iov_len = rw->len;
diff --git a/io_uring/statx.c b/io_uring/statx.c index d8fc933d3f593..d2604fdbcbe33 100644 --- a/io_uring/statx.c +++ b/io_uring/statx.c @@ -32,8 +32,8 @@ int io_statx_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) sx->dfd = READ_ONCE(sqe->fd); sx->mask = READ_ONCE(sqe->len);
- path = u64_to_user_ptr(READ_ONCE(sqe->addr));
- sx->buffer = u64_to_user_ptr(READ_ONCE(sqe->addr2));
- path = (char __user *)READ_ONCE(sqe->addr);
- sx->buffer = (struct statx __user *)READ_ONCE(sqe->addr2); sx->flags = READ_ONCE(sqe->statx_flags);
sx->filename = getname_flags(path, diff --git a/io_uring/tctx.c b/io_uring/tctx.c index 6ab9916ed3844..828383907b3b9 100644 --- a/io_uring/tctx.c +++ b/io_uring/tctx.c @@ -22,7 +22,7 @@ static int get_compat_io_uring_rsrc_update(struct io_uring_rsrc_update *up, return -EFAULT; up->offset = compat_up.offset; up->resv = compat_up.resv;
- up->data = compat_up.data;
- up->data = (__kernel_uintptr_t)compat_ptr(compat_up.data); return 0;
} #endif /* CONFIG_COMPAT64 */ @@ -35,7 +35,7 @@ static int copy_io_uring_rsrc_update_from_user(struct io_ring_ctx *ctx, if (ctx->compat) return get_compat_io_uring_rsrc_update(up, arg); #endif
- return copy_from_user(up, arg, sizeof(struct io_uring_rsrc_update));
- return copy_from_user_with_ptr(up, arg, sizeof(struct io_uring_rsrc_update));
}
As I mentioned in v1, there is no actual pointer in the case of IORING_{,UN}REGISTER_RING_FDS, because the struct io_uring_rsrc_update fields are used in a strange way, such that data is in fact an fd (or 0). So in fact the best is probably to do nothing in this patch :)
It makes me think we should maybe give these helpers a different name in patch 7, to avoid any confusion when e.g. grepping those in rsrc.c. Maybe add "ringfd" somewhere in their name to disambiguate?
static struct io_wq *io_init_wq_offload(struct io_ring_ctx *ctx, diff --git a/io_uring/timeout.c b/io_uring/timeout.c index e8a8c20994805..5a0fe53c13329 100644 --- a/io_uring/timeout.c +++ b/io_uring/timeout.c @@ -26,7 +26,7 @@ struct io_timeout { struct io_timeout_rem { struct file *file;
- u64 addr;
- __kernel_uintptr_t addr;
/* timeout update */ struct timespec64 ts; @@ -337,7 +337,7 @@ static clockid_t io_timeout_get_clock(struct io_timeout_data *data) } } -static int io_linked_timeout_update(struct io_ring_ctx *ctx, __u64 user_data, +static int io_linked_timeout_update(struct io_ring_ctx *ctx, __kernel_uintptr_t user_data, struct timespec64 *ts, enum hrtimer_mode mode) __must_hold(&ctx->timeout_lock) { @@ -365,7 +365,7 @@ static int io_linked_timeout_update(struct io_ring_ctx *ctx, __u64 user_data, return 0; } -static int io_timeout_update(struct io_ring_ctx *ctx, __u64 user_data, +static int io_timeout_update(struct io_ring_ctx *ctx, __kernel_uintptr_t user_data, struct timespec64 *ts, enum hrtimer_mode mode) __must_hold(&ctx->timeout_lock) { @@ -405,7 +405,7 @@ int io_timeout_remove_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) tr->ltimeout = true; if (tr->flags & ~(IORING_TIMEOUT_UPDATE_MASK|IORING_TIMEOUT_ABS)) return -EINVAL;
if (get_timespec64(&tr->ts, u64_to_user_ptr(sqe->addr2)))
if (tr->ts.tv_sec < 0 || tr->ts.tv_nsec < 0) return -EINVAL;if (get_timespec64(&tr->ts, (struct __kernel_timespec __user *)sqe->addr2)) return -EFAULT;
@@ -490,7 +490,7 @@ static int __io_timeout_prep(struct io_kiocb *req, data->req = req; data->flags = flags;
- if (get_timespec64(&data->ts, u64_to_user_ptr(sqe->addr)))
- if (get_timespec64(&data->ts, (struct __kernel_timespec __user *)sqe->addr)) return -EFAULT;
if (data->ts.tv_sec < 0 || data->ts.tv_nsec < 0) diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c index e50de0b6b9f84..4d2d2e3f885ee 100644 --- a/io_uring/uring_cmd.c +++ b/io_uring/uring_cmd.c @@ -65,8 +65,13 @@ int io_uring_cmd_prep_async(struct io_kiocb *req) struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd); size_t cmd_size; +#ifdef CONFIG_CHERI_PURECAP_UABI
- BUILD_BUG_ON(uring_cmd_pdu_size(0) != 32);
- BUILD_BUG_ON(uring_cmd_pdu_size(1) != 160);
+#else BUILD_BUG_ON(uring_cmd_pdu_size(0) != 16); BUILD_BUG_ON(uring_cmd_pdu_size(1) != 80); +#endif cmd_size = uring_cmd_pdu_size(req->ctx->flags & IORING_SETUP_SQE128); diff --git a/io_uring/xattr.c b/io_uring/xattr.c index 99df641594d74..1f13032e59536 100644 --- a/io_uring/xattr.c +++ b/io_uring/xattr.c @@ -53,8 +53,8 @@ static int __io_getxattr_prep(struct io_kiocb *req, ix->filename = NULL; ix->ctx.kvalue = NULL;
- name = u64_to_user_ptr(READ_ONCE(sqe->addr));
- ix->ctx.cvalue = u64_to_user_ptr(READ_ONCE(sqe->addr2));
- name = (char __user *)READ_ONCE(sqe->addr);
- ix->ctx.cvalue = (void __user *)READ_ONCE(sqe->addr2); ix->ctx.size = READ_ONCE(sqe->len); ix->ctx.flags = READ_ONCE(sqe->xattr_flags);
@@ -93,7 +93,7 @@ int io_getxattr_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (ret) return ret;
- path = u64_to_user_ptr(READ_ONCE(sqe->addr3));
- path = (char __user *)READ_ONCE(sqe->addr3);
ix->filename = getname_flags(path, LOOKUP_FOLLOW, NULL); if (IS_ERR(ix->filename)) { @@ -159,8 +159,8 @@ static int __io_setxattr_prep(struct io_kiocb *req, return -EBADF; ix->filename = NULL;
- name = u64_to_user_ptr(READ_ONCE(sqe->addr));
- ix->ctx.cvalue = u64_to_user_ptr(READ_ONCE(sqe->addr2));
- name = (char __user *)READ_ONCE(sqe->addr);
- ix->ctx.cvalue = (void __user *)READ_ONCE(sqe->addr2); ix->ctx.kvalue = NULL; ix->ctx.size = READ_ONCE(sqe->len); ix->ctx.flags = READ_ONCE(sqe->xattr_flags);
@@ -189,7 +189,7 @@ int io_setxattr_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (ret) return ret;
- path = u64_to_user_ptr(READ_ONCE(sqe->addr3));
- path = (char __user *)READ_ONCE(sqe->addr3);
Down 48 uses of u64_to_user_ptr(), none remaining in io_uring/, well done :party:
ix->filename = getname_flags(path, LOOKUP_FOLLOW, NULL); if (IS_ERR(ix->filename)) {
On 02-03-2023 12:41, Kevin Brodsky wrote:
On 23/02/2023 18:53, Tudor Cretu wrote:
Some members of the io_uring uAPI structs may contain user pointers. In the PCuABI, a user pointer is a 129-bit capability, so the __u64 type is not big enough to hold it. Use the __kernel_uintptr_t type instead, which is big enough on the affected architectures while remaining 64-bit on others.
The user_data field must be passed unchanged from the submission queue to the completion queue. As it is standard practice to store a pointer in user_data, expand the field to __kernel_uintptr_t. However, the kernel doesn't dereference the user_data, so don't cast it to a compat_ptr.
Maybe something like "... don't convert it in the compat case"?
Done!
In addition, for the io_uring structs containing user pointers, use the special copy routines when copying user pointers from/to userspace.
Note that the structs io_uring_sqe and io_uring_cqe are doubled in size in PCuABI. The setup flags IORING_SETUP_SQE128 and IORING_SETUP_CQE32 used to double the sizes of the two structs up to 128 bytes and 32 bytes respectively. In PCuABI, the two flags are still used to double the sizes of the two structs, but, as they increased in size, they increase up to 256 bytes and 64 bytes.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com
include/linux/io_uring_types.h | 4 +-- include/uapi/linux/io_uring.h | 62 +++++++++++++++++++--------------- io_uring/advise.c | 2 +- io_uring/cancel.c | 6 ++-- io_uring/cancel.h | 2 +- io_uring/epoll.c | 2 +- io_uring/fdinfo.c | 9 ++--- io_uring/fs.c | 16 ++++----- io_uring/io_uring.c | 49 +++++++++++++++++++++++---- io_uring/io_uring.h | 16 ++++----- io_uring/kbuf.c | 15 ++++---- io_uring/kbuf.h | 2 +- io_uring/msg_ring.c | 4 +-- io_uring/net.c | 20 +++++------ io_uring/openclose.c | 4 +-- io_uring/poll.c | 4 +-- io_uring/rsrc.c | 41 +++++++++++----------- io_uring/rw.c | 13 ++++--- io_uring/statx.c | 4 +-- io_uring/tctx.c | 4 +-- io_uring/timeout.c | 10 +++--- io_uring/uring_cmd.c | 5 +++ io_uring/xattr.c | 12 +++---- 23 files changed, 178 insertions(+), 128 deletions(-)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 2df3225b562fa..14aa30151c5f9 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -11,6 +11,11 @@ #include <linux/fs.h> #include <linux/types.h> #include <linux/time_types.h> +#ifdef __KERNEL__ +#include <linux/stddef.h> +#else +#include <stddef.h> +#endif
Worth adding a comment explaining this is for offsetof(), otherwise this is rather mysterious.
Done!
@@ -58,7 +63,7 @@ struct io_uring_sqe { __u32 msg_ring_flags; __u32 uring_cmd_flags; };
- __u64 user_data; /* data to be passed back at completion time */
- __kernel_uintptr_t user_data; /* data to be passed back at completion time */ /* pack this to avoid bogus arm OABI complaints */ union { /* index into fixed buffers, if used */
@@ -78,12 +83,14 @@ struct io_uring_sqe { }; union { struct {
__u64 addr3;
__u64 __pad2[1];
__kernel_uintptr_t addr3;
__u64 __pad2[1];
That looks off, the padding does not have the same size as addr3 in PCuABI. IIUC the overall struct layout, you do want the padding in PCuABI too so that the overall struct size is 128 (two cache lines), so it should be the same size as addr3. The easiest option is probably to make it __kernel_uintptr_t too.
I agree. Done!
struct io_uring_buf {
- __u64 addr;
- __kernel_uintptr_t addr; __u32 len; __u16 bid; __u16 resv;
Nit: might as well align the other fields, same for io_uring_buf_reg.
Done
@@ -594,9 +602,7 @@ struct io_uring_buf_ring { * ring tail is overlaid with the io_uring_buf->resv field. */ struct {
__u64 resv1;
__u32 resv2;
__u16 resv3;
__u8 resv1[offsetof(struct io_uring_buf, resv)];
Nit: maybe just resv now that there's only one? (Nice solution BTW, I don't think you can do better than that!)
Done!
diff --git a/io_uring/cancel.c b/io_uring/cancel.c index 20cb8634e44af..372e2442231a4 100644 --- a/io_uring/cancel.c +++ b/io_uring/cancel.c @@ -19,7 +19,7 @@ struct io_cancel { struct file *file;
- u64 addr;
- __kernel_uintptr_t addr;
So that one is interesting. In this case addr represents a user_data value to match against in order to find a request to cancel. As it stands we do propagate the capability in PCuABI but then only use it as 64-bit value, since we do a simple address comparison in io_cancel_cb(). The same applies to IORING_OP_POLL_REMOVE (io_poll_find() in poll.c) and IORING_OP_TIMEOUT_REMOVE (io_timeout_extract() in timeout.c).
Considering user_data is used as a sort of key in that case, I suspect doing a full comparison makes more sense, though it sounds unlikely that two SQEs would have different user_data with the same address. Worth pondering on in any case.
I agree it's a tricky case. The man page specifies that "addr must contain the user_data field of the request that should be canceled", so I believe it's clearer to have addr the same type as user_data and just propagate it here (and then to struct io_cancel_data). You made a great point that it's not required to be a capability, but I'm not sure we're gaining anything if we would restrict it to u64. I have removed the "compat_ptr()" from get_compat64_io_uring_sync_cancel_reg though. Thanks!
@@ -726,7 +726,7 @@ static __cold void io_uring_drop_tctx_refs(struct task_struct *task) } } -static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, +static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, __kernel_uintptr_t user_data,
Looking at how user_data is used, I see that it is passed on to a couple of trace functions (here and in io_fill_cqe_aux()). Should we extend the trace events accordingly? Would that be useful at all? We don't necessarily need an answer now, maybe that's to be investigated later.
We could propagate the capability there, what format specifier would be appropriate for a __kernel_uintptr_t though instead of %llx? I don't think printing capability metadata would be really worth it, but I don't like the idea of keeping user_data as a u64 there either. So, maybe I just propagate to __kernel_uintptr_t in the trace functions as well and keep %llx?
s32 res, u32 cflags, u64 extra1, u64 extra2)
{ struct io_overflow_cqe *ocqe; @@ -818,7 +818,7 @@ void *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow) return cqe; } -bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags, +bool io_fill_cqe_aux(struct io_ring_ctx *ctx, __kernel_uintptr_t user_data, s32 res, u32 cflags,
Nit: it's now well over 80 char so splitting would be nice (in the header too).
Done!
@@ -4158,6 +4158,42 @@ static int __init io_uring_init(void) __BUILD_BUG_VERIFY_OFFSET_SIZE(struct io_uring_sqe, eoffset, sizeof(etype), ename) #define BUILD_BUG_SQE_ELEM_SIZE(eoffset, esize, ename) \ __BUILD_BUG_VERIFY_OFFSET_SIZE(struct io_uring_sqe, eoffset, esize, ename) +#ifdef CONFIG_CHERI_PURECAP_UABI
- BUILD_BUG_ON(sizeof(struct io_uring_sqe) != 128);
- BUILD_BUG_SQE_ELEM(0, __u8, opcode);
- BUILD_BUG_SQE_ELEM(1, __u8, flags);
- BUILD_BUG_SQE_ELEM(2, __u16, ioprio);
- BUILD_BUG_SQE_ELEM(4, __s32, fd);
- BUILD_BUG_SQE_ELEM(16, __u64, off);
- BUILD_BUG_SQE_ELEM(16, __uintcap_t, addr2);
- BUILD_BUG_SQE_ELEM(32, __uintcap_t, addr);
- BUILD_BUG_SQE_ELEM(32, __u64, splice_off_in);
- BUILD_BUG_SQE_ELEM(48, __u32, len);
- BUILD_BUG_SQE_ELEM(52, __kernel_rwf_t, rw_flags);
- BUILD_BUG_SQE_ELEM(52, __u32, fsync_flags);
- BUILD_BUG_SQE_ELEM(52, __u16, poll_events);
- BUILD_BUG_SQE_ELEM(52, __u32, poll32_events);
- BUILD_BUG_SQE_ELEM(52, __u32, sync_range_flags);
- BUILD_BUG_SQE_ELEM(52, __u32, msg_flags);
- BUILD_BUG_SQE_ELEM(52, __u32, timeout_flags);
- BUILD_BUG_SQE_ELEM(52, __u32, accept_flags);
- BUILD_BUG_SQE_ELEM(52, __u32, cancel_flags);
- BUILD_BUG_SQE_ELEM(52, __u32, open_flags);
- BUILD_BUG_SQE_ELEM(52, __u32, statx_flags);
- BUILD_BUG_SQE_ELEM(52, __u32, fadvise_advice);
- BUILD_BUG_SQE_ELEM(52, __u32, splice_flags);
- BUILD_BUG_SQE_ELEM(64, __uintcap_t, user_data);
- BUILD_BUG_SQE_ELEM(80, __u16, buf_index);
- BUILD_BUG_SQE_ELEM(80, __u16, buf_group);
- BUILD_BUG_SQE_ELEM(82, __u16, personality);
- BUILD_BUG_SQE_ELEM(84, __s32, splice_fd_in);
- BUILD_BUG_SQE_ELEM(84, __u32, file_index);
- BUILD_BUG_SQE_ELEM(84, __u16, addr_len);
- BUILD_BUG_SQE_ELEM(86, __u16, __pad3[0]);
- BUILD_BUG_SQE_ELEM(96, __uintcap_t, addr3);
- BUILD_BUG_SQE_ELEM_SIZE(96, 0, cmd);
- BUILD_BUG_SQE_ELEM(112, __u64, __pad2);
Some assertions are missing compared to !PCuABI, I suppose that's because the corresponding fields were introduced after 5.18.
Well spotted!
@@ -148,7 +148,7 @@ static inline void convert_compat_io_uring_sqe(struct io_ring_ctx *ctx, memcpy(sqe->cmd, compat_sqe->cmd, compat_cmd_size);
Now that sqe->cmd is bigger than compat_sqe->cmd, might it be problematic that we're leaving the end of sqe->cmd uninitialised? I suspect the answer is no because .uring_cmd handlers are not going to use that extra data, at least not in compat, but maybe that deserves a comment at least.
Done!
@@ -175,7 +175,7 @@ static inline void *io_get_cqe(struct io_ring_ctx *ctx) } static inline void __io_fill_cqe_any(struct io_ring_ctx *ctx, struct io_uring_cqe *cqe,
u64 user_data, s32 res, u32 cflags,
__kernel_uintptr_t user_data, s32 res, u32 cflags,
I would add an explicit cast to __u64 when writing user_data in compat64. It's not essential but makes it easier to see that the rest is discarded (it doesn't matter because user_data is also a __u64 in the SQE in compat64, but it would be symmetric with the cast to __kernel_uintptr_t in convert_compat_io_uring_sqe()).
Done!
diff --git a/io_uring/net.c b/io_uring/net.c index 4c133bc6f9d1d..a9143b4652114 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -243,13 +243,13 @@ int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (req->opcode == IORING_OP_SEND) { if (READ_ONCE(sqe->__pad3[0])) return -EINVAL;
sr->addr = u64_to_user_ptr(READ_ONCE(sqe->addr2));
sr->addr_len = READ_ONCE(sqe->addr_len); } else if (sqe->addr2 || sqe->file_index) { return -EINVAL; }sr->addr = (void __user *)READ_ONCE(sqe->addr2);
- sr->umsg = u64_to_user_ptr(READ_ONCE(sqe->addr));
- sr->umsg = (struct user_msghdr __user *)READ_ONCE(sqe->addr); sr->len = READ_ONCE(sqe->len); sr->flags = READ_ONCE(sqe->ioprio); if (sr->flags & ~IORING_RECVSEND_POLL_FIRST)
@@ -421,7 +421,7 @@ static int __io_recvmsg_copy_hdr(struct io_kiocb *req, struct user_msghdr msg; int ret;
- if (copy_from_user(&msg, sr->umsg, sizeof(*sr->umsg)))
- if (copy_from_user_with_ptr(&msg, sr->umsg, sizeof(*sr->umsg)))
Good catch. I've also realised that copy_msghdr_from_user() (net/socket.c) does not preserve capabilities either. It's broken in general and not just for io_uring, but maybe it makes sense to fix it in that series too?
Sure! Made the change in a new patch.
@@ -1392,8 +1392,8 @@ int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (sqe->len || sqe->buf_index || sqe->rw_flags || sqe->splice_fd_in) return -EINVAL;
- conn->addr = u64_to_user_ptr(READ_ONCE(sqe->addr));
- conn->addr_len = READ_ONCE(sqe->addr2);
- conn->addr = (void __user *)READ_ONCE(sqe->addr);
- conn->addr_len = READ_ONCE(sqe->addr2);
It probably ends up being equivalent, but using sqe->off would probably be better here (as it's a size and not a pointer), especially since liburing seems to set ->off and not ->addr2.
It might be good to check other non-pointer uses of ->addr2 while at it (if any). Unfortunately ->off and ->addr2 seem to be used somewhat interchangeably upstream and that's not really equivalent any more.
Good catch! I haven't noticed any other misuse of addr2.
diff --git a/io_uring/poll.c b/io_uring/poll.c index d9bf1767867e6..4914048fed0f8 100644 --- a/io_uring/poll.c +++ b/io_uring/poll.c @@ -22,8 +22,8 @@ struct io_poll_update { struct file *file;
- u64 old_user_data;
- u64 new_user_data;
- __kernel_uintptr_t old_user_data;
- __kernel_uintptr_t new_user_data;
Looking at how new_user_data is used, I see that it is set to sqe->off, which is clearly not what we want in PCuABI (since off is a __u64). I suppose we should have the same change as in msg_ring.c and set it to sqe->addr2 instead?
That's right. Btw, the man pages mention the flag IORING_POLL_UPDATE_USER_DATA at the IORING_OP_POLL_ADD section, but liburing confirms that the flag is used only with IORING_OP_POLL_REMOVE, so that was a bit confusing.
__poll_t events; bool update_events; bool update_user_data; diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c index eafdb039f9e51..f94d7475cb4d8 100644 --- a/io_uring/rsrc.c +++ b/io_uring/rsrc.c @@ -201,13 +201,14 @@ static int io_copy_iov(struct io_ring_ctx *ctx, struct iovec *dst, if (copy_from_user(&ciov, &ciovs[index], sizeof(ciov))) return -EFAULT;
dst->iov_base = u64_to_user_ptr((u64)ciov.iov_base);
dst->iov_base = compat_ptr(ciov.iov_base);
Not sure we want that extra empty line.
Done!
diff --git a/io_uring/rw.c b/io_uring/rw.c index 2edca190450ee..424ee773c95aa 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -23,7 +23,7 @@ struct io_rw { /* NOTE: kiocb has the file as the first member, so don't do it here */ struct kiocb kiocb;
- u64 addr;
- __kernel_uintptr_t addr;
In such cases where we have a member in a private struct that clearly represents a user pointer, I wonder if we wouldn't better off actually representing it as a user pointer, i.e. void __user * (or a more specific pointer type if it makes sense), as it provides better type safety and removes the need for most casts.
That would make it slightly inconsistent with cases where the member actually represents user_data (say struct io_cancel::data), in which case __kernel_uintptr_t is what we need, but that may not be bad thing, as it would make the meaning / usage of the member clearer.
Good suggestiong! I have updated the following structs: io_madvise, io_provide_buf, io_buffer, io_rsrc_update, io_rw
The other similar structs are using the addr field for storing a user_data value, so I kept __kernel_uintptr_t for them.
diff --git a/io_uring/xattr.c b/io_uring/xattr.c index 99df641594d74..1f13032e59536 100644 --- a/io_uring/xattr.c +++ b/io_uring/xattr.c @@ -189,7 +189,7 @@ int io_setxattr_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (ret) return ret;
- path = u64_to_user_ptr(READ_ONCE(sqe->addr3));
- path = (char __user *)READ_ONCE(sqe->addr3);
Down 48 uses of u64_to_user_ptr(), none remaining in io_uring/, well done :party:
Thanks for the detailed review! :penguin_dance:
Tudor
On 16/03/2023 14:40, Tudor Cretu wrote:
[...]
@@ -726,7 +726,7 @@ static __cold void io_uring_drop_tctx_refs(struct task_struct *task) } } -static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, +static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, __kernel_uintptr_t user_data,
Looking at how user_data is used, I see that it is passed on to a couple of trace functions (here and in io_fill_cqe_aux()). Should we extend the trace events accordingly? Would that be useful at all? We don't necessarily need an answer now, maybe that's to be investigated later.
We could propagate the capability there, what format specifier would be appropriate for a __kernel_uintptr_t though instead of %llx? I don't think printing capability metadata would be really worth it, but I don't like the idea of keeping user_data as a u64 there either. So, maybe I just propagate to __kernel_uintptr_t in the trace functions as well and keep %llx?
I agree printing capability metadata is unlikely to be very helpful. I would say it's fine to keep the trace functions as-is, I would only change them to take __kernel_uintptr_t if they actually printed more than just a u64.
[...]
diff --git a/io_uring/poll.c b/io_uring/poll.c index d9bf1767867e6..4914048fed0f8 100644 --- a/io_uring/poll.c +++ b/io_uring/poll.c @@ -22,8 +22,8 @@ struct io_poll_update { struct file *file; - u64 old_user_data; - u64 new_user_data; + __kernel_uintptr_t old_user_data; + __kernel_uintptr_t new_user_data;
Looking at how new_user_data is used, I see that it is set to sqe->off, which is clearly not what we want in PCuABI (since off is a __u64). I suppose we should have the same change as in msg_ring.c and set it to sqe->addr2 instead?
That's right. Btw, the man pages mention the flag IORING_POLL_UPDATE_USER_DATA at the IORING_OP_POLL_ADD section, but liburing confirms that the flag is used only with IORING_OP_POLL_REMOVE, so that was a bit confusing.
Oh yes that's clearly a mistake in the man page, IORING_POLL_UPDATE_* don't make sense for IORING_OP_POLL_ADD! Since the man pages actually come from liburing itself, would be a nice patch to add there (and eventually upstream).
Kevin
The io_uring shared memory region hosts the io_uring_sqe and io_uring_cqe arrays. These structs may contain user pointers, so the memory region must be allowed to store and load capability pointers.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com --- io_uring/io_uring.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 24c1c9fb9b298..82a9ef824e32a 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -3136,6 +3136,11 @@ static __cold int io_uring_mmap(struct file *file, struct vm_area_struct *vma) if (IS_ERR(ptr)) return PTR_ERR(ptr);
+#ifdef CONFIG_CHERI_PURECAP_UABI + vma->vm_flags |= VM_READ_CAPS | VM_WRITE_CAPS; + vma_set_page_prot(vma); +#endif + pfn = virt_to_phys(ptr) >> PAGE_SHIFT; return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot); }
On 23/02/2023 18:53, Tudor Cretu wrote:
This series makes it possible for purecap apps to use the io_uring system.
With these patches, all io_uring LTP tests pass in both Purecap and compat modes. Note that the LTP tests only address the basic functionality of the io_uring system and a significant portion of the multiplexed functionality is untested in LTP.
I have finished investigating Purecap liburing tests and examples and the series is updated accordingly. I am investigating compat liburing tests at the moment, so another version might be expected.
v3:
- Introduce Patch 5 which exposes the compat handling logic for epoll_event. This is used then in io_uring/epoll.c.
- Introduce Patch 6 which makes sure that when struct iovec is copied from userspace, the capability tags are preserved.
- Fix a few sizeof(var) to sizeof(*var).
- Use iovec_from_user so that compat handling logic is applied instead of copying directly from user
- Add a few missing copy_from_user_with_ptr where suitable.
v2:
- Rebase on top of release 6.1
- Remove VM_READ_CAPS/VM_LOAD_CAPS patches as they are already merged
- Update commit message in PATCH 1
- Add the generic changes PATCH 2 and PATCH 3 to avoid copying user pointers from/to userspace unnecesarily. These could be upstreamable.
- Split "pulling the cqes memeber out" change into PATCH 4
- The changes for PATCH 5 and 6 are now split into their respective files after the rebase.
- Format and change organization based on the feedback on the previous version, including creating helpers copy_*_from_* for various uAPI structs
- Add comments related to handling of setup flags IORING_SETUP_SQE128 and IORING_SETUP_CQE32
- Add handling for new uAPI structs: io_uring_buf, io_uring_buf_ring, io_uring_buf_reg, io_uring_sync_cancel_reg.
Gitlab issue: https://git.morello-project.org/morello/kernel/linux/-/issues/2
Review branch: https://git.morello-project.org/tudcre01/linux/-/commits/morello/io_uring_v3
Tudor Cretu (9): compiler_types: Add (u)intcap_t to native_words io_uring/rw : Restrict copy to only uiov->len from userspace io_uring/tctx: Copy only the offset field back to user io_uring: Pull cqes member out from rings struct epoll: Expose compat handling logic of epoll_event iov_iter: use copy_from_user_with_ptr for struct iovec io_uring: Implement compat versions of uAPI structs and handle them io_uring: Use user pointer type in the uAPI structs io_uring: Allow capability tag access on the shared memory
This is looking very good overall, well done! I've got quite a few comments on patch 7 and 8 but it's nothing fundamental.
Minor thing, it might make more sense to have patch 9 before patch 8, since patch 8 cannot work without patch 9 in PCuABI (and arguably before patch 8 things do work in PCuABI, although with a different ABI).
Kevin
fs/eventpoll.c | 39 ++-- include/linux/compiler_types.h | 7 + include/linux/eventpoll.h | 4 + include/linux/io_uring_types.h | 160 ++++++++++++++-- include/uapi/linux/io_uring.h | 62 ++++--- io_uring/advise.c | 2 +- io_uring/cancel.c | 40 +++- io_uring/cancel.h | 2 +- io_uring/epoll.c | 4 +- io_uring/fdinfo.c | 64 ++++++- io_uring/fs.c | 16 +- io_uring/io_uring.c | 329 +++++++++++++++++++++++++-------- io_uring/io_uring.h | 126 ++++++++++--- io_uring/kbuf.c | 119 ++++++++++-- io_uring/kbuf.h | 8 +- io_uring/msg_ring.c | 4 +- io_uring/net.c | 25 +-- io_uring/openclose.c | 4 +- io_uring/poll.c | 4 +- io_uring/rsrc.c | 150 ++++++++++++--- io_uring/rw.c | 17 +- io_uring/statx.c | 4 +- io_uring/tctx.c | 57 +++++- io_uring/timeout.c | 10 +- io_uring/uring_cmd.c | 5 + io_uring/uring_cmd.h | 7 + io_uring/xattr.c | 12 +- lib/iov_iter.c | 2 +- 28 files changed, 1014 insertions(+), 269 deletions(-)
linux-morello@op-lists.linaro.org