On 02-03-2023 12:41, Kevin Brodsky wrote:
On 23/02/2023 18:53, Tudor Cretu wrote:
Some members of the io_uring uAPI structs may contain user pointers. In the PCuABI, a user pointer is a 129-bit capability, so the __u64 type is not big enough to hold it. Use the __kernel_uintptr_t type instead, which is big enough on the affected architectures while remaining 64-bit on others.
The user_data field must be passed unchanged from the submission queue to the completion queue. As it is standard practice to store a pointer in user_data, expand the field to __kernel_uintptr_t. However, the kernel doesn't dereference the user_data, so don't cast it to a compat_ptr.
Maybe something like "... don't convert it in the compat case"?
Done!
In addition, for the io_uring structs containing user pointers, use the special copy routines when copying user pointers from/to userspace.
Note that the structs io_uring_sqe and io_uring_cqe are doubled in size in PCuABI. The setup flags IORING_SETUP_SQE128 and IORING_SETUP_CQE32 used to double the sizes of the two structs up to 128 bytes and 32 bytes respectively. In PCuABI, the two flags are still used to double the sizes of the two structs, but, as they increased in size, they increase up to 256 bytes and 64 bytes.
Signed-off-by: Tudor Cretu tudor.cretu@arm.com
include/linux/io_uring_types.h | 4 +-- include/uapi/linux/io_uring.h | 62 +++++++++++++++++++--------------- io_uring/advise.c | 2 +- io_uring/cancel.c | 6 ++-- io_uring/cancel.h | 2 +- io_uring/epoll.c | 2 +- io_uring/fdinfo.c | 9 ++--- io_uring/fs.c | 16 ++++----- io_uring/io_uring.c | 49 +++++++++++++++++++++++---- io_uring/io_uring.h | 16 ++++----- io_uring/kbuf.c | 15 ++++---- io_uring/kbuf.h | 2 +- io_uring/msg_ring.c | 4 +-- io_uring/net.c | 20 +++++------ io_uring/openclose.c | 4 +-- io_uring/poll.c | 4 +-- io_uring/rsrc.c | 41 +++++++++++----------- io_uring/rw.c | 13 ++++--- io_uring/statx.c | 4 +-- io_uring/tctx.c | 4 +-- io_uring/timeout.c | 10 +++--- io_uring/uring_cmd.c | 5 +++ io_uring/xattr.c | 12 +++---- 23 files changed, 178 insertions(+), 128 deletions(-)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 2df3225b562fa..14aa30151c5f9 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -11,6 +11,11 @@ #include <linux/fs.h> #include <linux/types.h> #include <linux/time_types.h> +#ifdef __KERNEL__ +#include <linux/stddef.h> +#else +#include <stddef.h> +#endif
Worth adding a comment explaining this is for offsetof(), otherwise this is rather mysterious.
Done!
@@ -58,7 +63,7 @@ struct io_uring_sqe { __u32 msg_ring_flags; __u32 uring_cmd_flags; };
- __u64 user_data; /* data to be passed back at completion time */
- __kernel_uintptr_t user_data; /* data to be passed back at completion time */ /* pack this to avoid bogus arm OABI complaints */ union { /* index into fixed buffers, if used */
@@ -78,12 +83,14 @@ struct io_uring_sqe { }; union { struct {
__u64 addr3;
__u64 __pad2[1];
__kernel_uintptr_t addr3;
__u64 __pad2[1];
That looks off, the padding does not have the same size as addr3 in PCuABI. IIUC the overall struct layout, you do want the padding in PCuABI too so that the overall struct size is 128 (two cache lines), so it should be the same size as addr3. The easiest option is probably to make it __kernel_uintptr_t too.
I agree. Done!
struct io_uring_buf {
- __u64 addr;
- __kernel_uintptr_t addr; __u32 len; __u16 bid; __u16 resv;
Nit: might as well align the other fields, same for io_uring_buf_reg.
Done
@@ -594,9 +602,7 @@ struct io_uring_buf_ring { * ring tail is overlaid with the io_uring_buf->resv field. */ struct {
__u64 resv1;
__u32 resv2;
__u16 resv3;
__u8 resv1[offsetof(struct io_uring_buf, resv)];
Nit: maybe just resv now that there's only one? (Nice solution BTW, I don't think you can do better than that!)
Done!
diff --git a/io_uring/cancel.c b/io_uring/cancel.c index 20cb8634e44af..372e2442231a4 100644 --- a/io_uring/cancel.c +++ b/io_uring/cancel.c @@ -19,7 +19,7 @@ struct io_cancel { struct file *file;
- u64 addr;
- __kernel_uintptr_t addr;
So that one is interesting. In this case addr represents a user_data value to match against in order to find a request to cancel. As it stands we do propagate the capability in PCuABI but then only use it as 64-bit value, since we do a simple address comparison in io_cancel_cb(). The same applies to IORING_OP_POLL_REMOVE (io_poll_find() in poll.c) and IORING_OP_TIMEOUT_REMOVE (io_timeout_extract() in timeout.c).
Considering user_data is used as a sort of key in that case, I suspect doing a full comparison makes more sense, though it sounds unlikely that two SQEs would have different user_data with the same address. Worth pondering on in any case.
I agree it's a tricky case. The man page specifies that "addr must contain the user_data field of the request that should be canceled", so I believe it's clearer to have addr the same type as user_data and just propagate it here (and then to struct io_cancel_data). You made a great point that it's not required to be a capability, but I'm not sure we're gaining anything if we would restrict it to u64. I have removed the "compat_ptr()" from get_compat64_io_uring_sync_cancel_reg though. Thanks!
@@ -726,7 +726,7 @@ static __cold void io_uring_drop_tctx_refs(struct task_struct *task) } } -static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, +static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, __kernel_uintptr_t user_data,
Looking at how user_data is used, I see that it is passed on to a couple of trace functions (here and in io_fill_cqe_aux()). Should we extend the trace events accordingly? Would that be useful at all? We don't necessarily need an answer now, maybe that's to be investigated later.
We could propagate the capability there, what format specifier would be appropriate for a __kernel_uintptr_t though instead of %llx? I don't think printing capability metadata would be really worth it, but I don't like the idea of keeping user_data as a u64 there either. So, maybe I just propagate to __kernel_uintptr_t in the trace functions as well and keep %llx?
s32 res, u32 cflags, u64 extra1, u64 extra2)
{ struct io_overflow_cqe *ocqe; @@ -818,7 +818,7 @@ void *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow) return cqe; } -bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags, +bool io_fill_cqe_aux(struct io_ring_ctx *ctx, __kernel_uintptr_t user_data, s32 res, u32 cflags,
Nit: it's now well over 80 char so splitting would be nice (in the header too).
Done!
@@ -4158,6 +4158,42 @@ static int __init io_uring_init(void) __BUILD_BUG_VERIFY_OFFSET_SIZE(struct io_uring_sqe, eoffset, sizeof(etype), ename) #define BUILD_BUG_SQE_ELEM_SIZE(eoffset, esize, ename) \ __BUILD_BUG_VERIFY_OFFSET_SIZE(struct io_uring_sqe, eoffset, esize, ename) +#ifdef CONFIG_CHERI_PURECAP_UABI
- BUILD_BUG_ON(sizeof(struct io_uring_sqe) != 128);
- BUILD_BUG_SQE_ELEM(0, __u8, opcode);
- BUILD_BUG_SQE_ELEM(1, __u8, flags);
- BUILD_BUG_SQE_ELEM(2, __u16, ioprio);
- BUILD_BUG_SQE_ELEM(4, __s32, fd);
- BUILD_BUG_SQE_ELEM(16, __u64, off);
- BUILD_BUG_SQE_ELEM(16, __uintcap_t, addr2);
- BUILD_BUG_SQE_ELEM(32, __uintcap_t, addr);
- BUILD_BUG_SQE_ELEM(32, __u64, splice_off_in);
- BUILD_BUG_SQE_ELEM(48, __u32, len);
- BUILD_BUG_SQE_ELEM(52, __kernel_rwf_t, rw_flags);
- BUILD_BUG_SQE_ELEM(52, __u32, fsync_flags);
- BUILD_BUG_SQE_ELEM(52, __u16, poll_events);
- BUILD_BUG_SQE_ELEM(52, __u32, poll32_events);
- BUILD_BUG_SQE_ELEM(52, __u32, sync_range_flags);
- BUILD_BUG_SQE_ELEM(52, __u32, msg_flags);
- BUILD_BUG_SQE_ELEM(52, __u32, timeout_flags);
- BUILD_BUG_SQE_ELEM(52, __u32, accept_flags);
- BUILD_BUG_SQE_ELEM(52, __u32, cancel_flags);
- BUILD_BUG_SQE_ELEM(52, __u32, open_flags);
- BUILD_BUG_SQE_ELEM(52, __u32, statx_flags);
- BUILD_BUG_SQE_ELEM(52, __u32, fadvise_advice);
- BUILD_BUG_SQE_ELEM(52, __u32, splice_flags);
- BUILD_BUG_SQE_ELEM(64, __uintcap_t, user_data);
- BUILD_BUG_SQE_ELEM(80, __u16, buf_index);
- BUILD_BUG_SQE_ELEM(80, __u16, buf_group);
- BUILD_BUG_SQE_ELEM(82, __u16, personality);
- BUILD_BUG_SQE_ELEM(84, __s32, splice_fd_in);
- BUILD_BUG_SQE_ELEM(84, __u32, file_index);
- BUILD_BUG_SQE_ELEM(84, __u16, addr_len);
- BUILD_BUG_SQE_ELEM(86, __u16, __pad3[0]);
- BUILD_BUG_SQE_ELEM(96, __uintcap_t, addr3);
- BUILD_BUG_SQE_ELEM_SIZE(96, 0, cmd);
- BUILD_BUG_SQE_ELEM(112, __u64, __pad2);
Some assertions are missing compared to !PCuABI, I suppose that's because the corresponding fields were introduced after 5.18.
Well spotted!
@@ -148,7 +148,7 @@ static inline void convert_compat_io_uring_sqe(struct io_ring_ctx *ctx, memcpy(sqe->cmd, compat_sqe->cmd, compat_cmd_size);
Now that sqe->cmd is bigger than compat_sqe->cmd, might it be problematic that we're leaving the end of sqe->cmd uninitialised? I suspect the answer is no because .uring_cmd handlers are not going to use that extra data, at least not in compat, but maybe that deserves a comment at least.
Done!
@@ -175,7 +175,7 @@ static inline void *io_get_cqe(struct io_ring_ctx *ctx) } static inline void __io_fill_cqe_any(struct io_ring_ctx *ctx, struct io_uring_cqe *cqe,
u64 user_data, s32 res, u32 cflags,
__kernel_uintptr_t user_data, s32 res, u32 cflags,
I would add an explicit cast to __u64 when writing user_data in compat64. It's not essential but makes it easier to see that the rest is discarded (it doesn't matter because user_data is also a __u64 in the SQE in compat64, but it would be symmetric with the cast to __kernel_uintptr_t in convert_compat_io_uring_sqe()).
Done!
diff --git a/io_uring/net.c b/io_uring/net.c index 4c133bc6f9d1d..a9143b4652114 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -243,13 +243,13 @@ int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (req->opcode == IORING_OP_SEND) { if (READ_ONCE(sqe->__pad3[0])) return -EINVAL;
sr->addr = u64_to_user_ptr(READ_ONCE(sqe->addr2));
sr->addr_len = READ_ONCE(sqe->addr_len); } else if (sqe->addr2 || sqe->file_index) { return -EINVAL; }sr->addr = (void __user *)READ_ONCE(sqe->addr2);
- sr->umsg = u64_to_user_ptr(READ_ONCE(sqe->addr));
- sr->umsg = (struct user_msghdr __user *)READ_ONCE(sqe->addr); sr->len = READ_ONCE(sqe->len); sr->flags = READ_ONCE(sqe->ioprio); if (sr->flags & ~IORING_RECVSEND_POLL_FIRST)
@@ -421,7 +421,7 @@ static int __io_recvmsg_copy_hdr(struct io_kiocb *req, struct user_msghdr msg; int ret;
- if (copy_from_user(&msg, sr->umsg, sizeof(*sr->umsg)))
- if (copy_from_user_with_ptr(&msg, sr->umsg, sizeof(*sr->umsg)))
Good catch. I've also realised that copy_msghdr_from_user() (net/socket.c) does not preserve capabilities either. It's broken in general and not just for io_uring, but maybe it makes sense to fix it in that series too?
Sure! Made the change in a new patch.
@@ -1392,8 +1392,8 @@ int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (sqe->len || sqe->buf_index || sqe->rw_flags || sqe->splice_fd_in) return -EINVAL;
- conn->addr = u64_to_user_ptr(READ_ONCE(sqe->addr));
- conn->addr_len = READ_ONCE(sqe->addr2);
- conn->addr = (void __user *)READ_ONCE(sqe->addr);
- conn->addr_len = READ_ONCE(sqe->addr2);
It probably ends up being equivalent, but using sqe->off would probably be better here (as it's a size and not a pointer), especially since liburing seems to set ->off and not ->addr2.
It might be good to check other non-pointer uses of ->addr2 while at it (if any). Unfortunately ->off and ->addr2 seem to be used somewhat interchangeably upstream and that's not really equivalent any more.
Good catch! I haven't noticed any other misuse of addr2.
diff --git a/io_uring/poll.c b/io_uring/poll.c index d9bf1767867e6..4914048fed0f8 100644 --- a/io_uring/poll.c +++ b/io_uring/poll.c @@ -22,8 +22,8 @@ struct io_poll_update { struct file *file;
- u64 old_user_data;
- u64 new_user_data;
- __kernel_uintptr_t old_user_data;
- __kernel_uintptr_t new_user_data;
Looking at how new_user_data is used, I see that it is set to sqe->off, which is clearly not what we want in PCuABI (since off is a __u64). I suppose we should have the same change as in msg_ring.c and set it to sqe->addr2 instead?
That's right. Btw, the man pages mention the flag IORING_POLL_UPDATE_USER_DATA at the IORING_OP_POLL_ADD section, but liburing confirms that the flag is used only with IORING_OP_POLL_REMOVE, so that was a bit confusing.
__poll_t events; bool update_events; bool update_user_data; diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c index eafdb039f9e51..f94d7475cb4d8 100644 --- a/io_uring/rsrc.c +++ b/io_uring/rsrc.c @@ -201,13 +201,14 @@ static int io_copy_iov(struct io_ring_ctx *ctx, struct iovec *dst, if (copy_from_user(&ciov, &ciovs[index], sizeof(ciov))) return -EFAULT;
dst->iov_base = u64_to_user_ptr((u64)ciov.iov_base);
dst->iov_base = compat_ptr(ciov.iov_base);
Not sure we want that extra empty line.
Done!
diff --git a/io_uring/rw.c b/io_uring/rw.c index 2edca190450ee..424ee773c95aa 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -23,7 +23,7 @@ struct io_rw { /* NOTE: kiocb has the file as the first member, so don't do it here */ struct kiocb kiocb;
- u64 addr;
- __kernel_uintptr_t addr;
In such cases where we have a member in a private struct that clearly represents a user pointer, I wonder if we wouldn't better off actually representing it as a user pointer, i.e. void __user * (or a more specific pointer type if it makes sense), as it provides better type safety and removes the need for most casts.
That would make it slightly inconsistent with cases where the member actually represents user_data (say struct io_cancel::data), in which case __kernel_uintptr_t is what we need, but that may not be bad thing, as it would make the meaning / usage of the member clearer.
Good suggestiong! I have updated the following structs: io_madvise, io_provide_buf, io_buffer, io_rsrc_update, io_rw
The other similar structs are using the addr field for storing a user_data value, so I kept __kernel_uintptr_t for them.
diff --git a/io_uring/xattr.c b/io_uring/xattr.c index 99df641594d74..1f13032e59536 100644 --- a/io_uring/xattr.c +++ b/io_uring/xattr.c @@ -189,7 +189,7 @@ int io_setxattr_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (ret) return ret;
- path = u64_to_user_ptr(READ_ONCE(sqe->addr3));
- path = (char __user *)READ_ONCE(sqe->addr3);
Down 48 uses of u64_to_user_ptr(), none remaining in io_uring/, well done :party:
Thanks for the detailed review! :penguin_dance:
Tudor