Hi,
I've looked in more details at limited memory sharing (STR-6, STR-8, STR-15), mainly from the Linux guest perspective. Here are a few thoughts.
Problem -------
We have a primary VM running a guest, and a secondary one running a backend that manages one hardware resource (for example network access). They communicate with virtio (for example virtio-net). The guest implements a virtio driver, the backend a virtio device. Problem is, how to ensure that the backend and guest only share the memory required for the virtio communication, and that the backend cannot access any other memory from the guest?
Static shared region --------------------
Let's first look at static DMA regions. The hypervisor allocates a subset of memory to be shared between guest and backend. The hypervisor communicates this per-device DMA restriction to the guest during boot. It could be using a firmware property, or a discovery protocol. I did start drafting such a protocol in virtio-iommu, but I now think the reserved-memory mechanism in device-tree, below, is preferable. Would we need an equivalent for ACPI, though?
How would we implement this in a Linux guest? The virtqueue of a virtio device has two components. Static ring buffers, allocated at boot with dma_alloc_coherent(), and the actual data payload, mapped with dma_map_page() and dma_map_single(). Linux calls the former "coherent" DMA, and the latter "streaming" DMA.
Coherent DMA can already obtain its pages from a per-device memory pool. dma_init_coherent_memory() defines a range of physical memory usable by a device. Importantly this region has to be distinct from system memory RAM and reserved by the platform. It is mapped non-cacheable. If it exists, dma_alloc_coherent() will only get its pages from that region.
On the other hand streaming DMA doesn't allocate memory. The virtio drivers don't control where that memory comes from, since the pages are fed to them by an upper layer of the subsystem, specific to the device type (net, block, video, etc). Often they are pages from the slab cache that contain other unrelated objects, a notorious problem for DMA isolation. If the page is not accessible by the device, swiotlb_map() allocates a bounce buffer somewhere more convenient and copies the data when needed.
At the moment the bounce buffer is allocated from a global pool in the low physical pages. However a recent proposal by Chromium would add support for per-device swiotlb pools:
https://lore.kernel.org/linux-iommu/20200728050140.996974-1-tientzu@chromium...
And quoting Tomasz from the discussion on patch 4:
For this, I'd like to propose a "restricted-dma-region" (feel free to suggest a better name) binding, which is explicitly specified to be the only DMA-able memory for this device and make Linux use the given pool for coherent DMA allocations and bouncing non-coherent DMA.
That seems to be precisely what we need. Even when using the virtio-pci transport, it is still possible to define per-device properties in the device-tree, for example:
/* PCI root complex node */ pcie@10000000 { compatible = "pci-host-ecam-generic"; /* Add properties to endpoint with BDF 00:01.0 */ ep@0008 { reg = <0x00000800 0 0 0 0>; restricted-dma-region = <&dma_region_1>; }; };
reserved-memory { /* Define 64MB reserved region at address 0x50400000 */ dma_region_1: restricted_dma_region { reg = <0x50400000 0x4000000>; }; };
Dynamic regions ---------------
In a previous discussion [1], several people suggested using a vIOMMU to dynamically update the mappings rather than statically setting a memory region usable by the backend. I believe that approach is still worth considering because it satisfies the security requirement and doesn't necessarily have worse performance. There is a trade-off between bounce buffers on one hand, and map notifications on the other.
The problem with static regions is that all of the traffic will require copying. Sub-page payloads will need bounce buffering anyway, for proper isolation. But for large payloads bounce buffering might be prohibitive, and using a virtual IOMMU might actually be more efficient. Instead of copying large buffers the guest would send a MAP request to the hypervisor, which would then map the pages into the backend. Despite taking plenty of cycles for context switching and setting up the maps, it might be less costly than copying.
Since it depends on the device, I guess we'll need a survey of memory access patterns by the different virtio devices that we're considering. In the end a mix of both solutions might be necessary.
Thanks, Jean
[1] https://lists.oasis-open.org/archives/virtio-dev/202006/msg00037.html
On Fri, Oct 2, 2020 at 3:44 PM Jean-Philippe Brucker jean-philippe@linaro.org wrote:
Hi,
I've looked in more details at limited memory sharing (STR-6, STR-8, STR-15), mainly from the Linux guest perspective. Here are a few thoughts.
Thanks for writing this up and for sharing!
At the moment the bounce buffer is allocated from a global pool in the low physical pages. However a recent proposal by Chromium would add support for per-device swiotlb pools:
https://lore.kernel.org/linux-iommu/20200728050140.996974-1-tientzu@chromium...
And quoting Tomasz from the discussion on patch 4:
For this, I'd like to propose a "restricted-dma-region" (feel free to suggest a better name) binding, which is explicitly specified to be the only DMA-able memory for this device and make Linux use the given pool for coherent DMA allocations and bouncing non-coherent DMA.
Right, I think this can work, but there are very substantial downsides to it:
- it is a fairly substantial departure from the virtio specification, which defines that transfers can be made to any part of the guest physical address space
- for each device that allocates a region, we have to reserve host memory that cannot be used for anything else, and the size is based on the maximum we might be needing at any point of time.
- it removes any kind of performance benefit that we might see from using an iommu, by forcing the guest to do bounce-buffering, in addition to the copies that may have to be done in the host. (you have the same point below as well)
Dynamic regions
In a previous discussion [1], several people suggested using a vIOMMU to dynamically update the mappings rather than statically setting a memory region usable by the backend. I believe that approach is still worth considering because it satisfies the security requirement and doesn't necessarily have worse performance. There is a trade-off between bounce buffers on one hand, and map notifications on the other.
The problem with static regions is that all of the traffic will require copying. Sub-page payloads will need bounce buffering anyway, for proper isolation. But for large payloads bounce buffering might be prohibitive, and using a virtual IOMMU might actually be more efficient. Instead of copying large buffers the guest would send a MAP request to the hypervisor, which would then map the pages into the backend. Despite taking plenty of cycles for context switching and setting up the maps, it might be less costly than copying.
Agreed, I would think the iommu based approach is much more promising here.
Maybe the partial-page sharing could be part of the iommu dma_ops? There are many scenarios where we already rely on the iommu for isolating the kernel from malicious or accidental DMA, but I have not seen anyone do this on a sub-page granularity. Using bounce buffers for partial page dma_map_* could be something we can do separately from the rest, as both aspects seem useful regardless of one another.
Since it depends on the device, I guess we'll need a survey of memory access patterns by the different virtio devices that we're considering. In the end a mix of both solutions might be necessary.
Can you describe a case in which the iommu would clearly be inferior? IOW, what stops us from always using an iommu here?
Arnd
On Fri, Oct 02, 2020 at 04:26:10PM +0200, Arnd Bergmann wrote:
On Fri, Oct 2, 2020 at 3:44 PM Jean-Philippe Brucker jean-philippe@linaro.org wrote:
Hi,
I've looked in more details at limited memory sharing (STR-6, STR-8, STR-15), mainly from the Linux guest perspective. Here are a few thoughts.
Thanks for writing this up and for sharing!
At the moment the bounce buffer is allocated from a global pool in the low physical pages. However a recent proposal by Chromium would add support for per-device swiotlb pools:
https://lore.kernel.org/linux-iommu/20200728050140.996974-1-tientzu@chromium...
And quoting Tomasz from the discussion on patch 4:
For this, I'd like to propose a "restricted-dma-region" (feel free to suggest a better name) binding, which is explicitly specified to be the only DMA-able memory for this device and make Linux use the given pool for coherent DMA allocations and bouncing non-coherent DMA.
Right, I think this can work, but there are very substantial downsides to it:
it is a fairly substantial departure from the virtio specification, which defines that transfers can be made to any part of the guest physical address space
for each device that allocates a region, we have to reserve host memory that cannot be used for anything else, and the size is based on the maximum we might be needing at any point of time.
it removes any kind of performance benefit that we might see from using an iommu, by forcing the guest to do bounce-buffering, in addition to the copies that may have to be done in the host. (you have the same point below as well)
Dynamic regions
In a previous discussion [1], several people suggested using a vIOMMU to dynamically update the mappings rather than statically setting a memory region usable by the backend. I believe that approach is still worth considering because it satisfies the security requirement and doesn't necessarily have worse performance. There is a trade-off between bounce buffers on one hand, and map notifications on the other.
The problem with static regions is that all of the traffic will require copying. Sub-page payloads will need bounce buffering anyway, for proper isolation. But for large payloads bounce buffering might be prohibitive, and using a virtual IOMMU might actually be more efficient. Instead of copying large buffers the guest would send a MAP request to the hypervisor, which would then map the pages into the backend. Despite taking plenty of cycles for context switching and setting up the maps, it might be less costly than copying.
Agreed, I would think the iommu based approach is much more promising here.
Maybe the partial-page sharing could be part of the iommu dma_ops? There are many scenarios where we already rely on the iommu for isolating the kernel from malicious or accidental DMA, but I have not seen anyone do this on a sub-page granularity. Using bounce buffers for partial page dma_map_* could be something we can do separately from the rest, as both aspects seem useful regardless of one another.
It is about to be added to the dma-iommu module, as part of the consolidation of the IOMMU dma_ops: https://lore.kernel.org/linux-iommu/20200912032200.11489-4-baolu.lu@linux.in...
That will enforce bounce-buffers for any device marked "untrusted" (external devices such as thunderbolt devices, which could be malicious). I believe it would be a good thing to enable in our case as well.
Since it depends on the device, I guess we'll need a survey of memory access patterns by the different virtio devices that we're considering. In the end a mix of both solutions might be necessary.
Can you describe a case in which the iommu would clearly be inferior? IOW, what stops us from always using an iommu here?
If the virtio device only transfers small sub-page payloads, for example small packets: - with static regions we copy each buffer to/from the static region, - with an IOMMU we copy each buffer to/from a safe page *and* send requests to map+unmap that page. Even if we didn't use bounce buffer in this case, map+unmap generally has a very high cost due to context switching and could easily be much slower than copying a small buffer.
For this case I see a possible optimization, keeping the bounce buffers mapped for some time so subsequent transfers can reuse them, but it's not implemented at the moment (and I wonder if it opens a vulnerability, though I can't see one right now).
Thanks, Jean
On Fri, Oct 2, 2020 at 5:26 PM Jean-Philippe Brucker jean-philippe@linaro.org wrote:
On Fri, Oct 02, 2020 at 04:26:10PM +0200, Arnd Bergmann wrote:
On Fri, Oct 2, 2020 at 3:44 PM Jean-Philippe Brucker jean-philippe@linaro.org wrote:
Maybe the partial-page sharing could be part of the iommu dma_ops? There are many scenarios where we already rely on the iommu for isolating the kernel from malicious or accidental DMA, but I have not seen anyone do this on a sub-page granularity. Using bounce buffers for partial page dma_map_* could be something we can do separately from the rest, as both aspects seem useful regardless of one another.
It is about to be added to the dma-iommu module, as part of the consolidation of the IOMMU dma_ops: https://lore.kernel.org/linux-iommu/20200912032200.11489-4-baolu.lu@linux.in...
That will enforce bounce-buffers for any device marked "untrusted" (external devices such as thunderbolt devices, which could be malicious). I believe it would be a good thing to enable in our case as well.
Ah, nice!
Since it depends on the device, I guess we'll need a survey of memory access patterns by the different virtio devices that we're considering. In the end a mix of both solutions might be necessary.
Can you describe a case in which the iommu would clearly be inferior? IOW, what stops us from always using an iommu here?
If the virtio device only transfers small sub-page payloads, for example small packets:
- with static regions we copy each buffer to/from the static region,
- with an IOMMU we copy each buffer to/from a safe page *and* send requests to map+unmap that page. Even if we didn't use bounce buffer in this case, map+unmap generally has a very high cost due to context switching and could easily be much slower than copying a small buffer.
Ok, I see. So the question is mainly if any of the devices that do have sub-page buffers actually care about performance, right?
For this case I see a possible optimization, keeping the bounce buffers mapped for some time so subsequent transfers can reuse them, but it's not implemented at the moment (and I wonder if it opens a vulnerability, though I can't see one right now).
Right, the bounce buffers could essentially use the coherent mapping here, assuming we can find an upper bound on the size. I don't see any vulnerability with that either.
There is a similar issue with vq->indirect virtqueues, where we also need to map the vring descriptors dynamically, and these are not typically page aligned.
Arnd
On Fri, Oct 2, 2020 at 4:26 PM Arnd Bergmann arnd@linaro.org wrote:
On Fri, Oct 2, 2020 at 3:44 PM Jean-Philippe Brucker jean-philippe@linaro.org wrote:
- it is a fairly substantial departure from the virtio specification, which defines that transfers can be made to any part of the guest physical address space
One more thought on this: If we do something that is a significant departure from today's virtio-1.1 specification in order to do what we want, it makes sense to consider an even wider change, in this case to the way the virtqueues are defined. We currently have four fundamental layouts:
- split virtqueue, direct descriptors - split virtqueue, indirect descriptors - packed virtqueue, direct descriptors - packed virtqueue, indirect descriptors
In a scenario that always requires bounce buffers and limit the address range accessed by the virtio device implementation, a more logical way to handle this would be shared memory ring with no descriptors pointing to external memory at all, but with all data packed into a FIFO similar to how 9p packs its messages into a simple 2-way FIFO.
This may be possible to implement with just one additional virtio ring layout but minimal changes to any of the higher levels, essentially just flags for negotiating the ring type. (It's also likely that I'm missing a major problem here that would make it way more complicated).
For inter-guest communication, the ideal outcome would be that the virtio driver in one guest makes its ring buffers available to others by allocating and mapping the guest-physical pages through its iommu, while the virtio device implementation in another guest can map the shared ring through a PCI BAR or an MMIO range.
Similarly, a host user implementation of the virtio device would mmap() that ring buffer into its address space and require a notification mechanism but no other access to the memory of the guest running the virtio driver.
I don't know if this approach has been discussed before, but if so, I'd like to know if it has already been rejected or if someone has tried implementing it this way.
Arnd
On Fri, Oct 02, 2020 at 04:26:10PM +0200, Arnd Bergmann wrote:
At the moment the bounce buffer is allocated from a global pool in the low physical pages. However a recent proposal by Chromium would add support for per-device swiotlb pools:
https://lore.kernel.org/linux-iommu/20200728050140.996974-1-tientzu@chromium...
And quoting Tomasz from the discussion on patch 4:
For this, I'd like to propose a "restricted-dma-region" (feel free to suggest a better name) binding, which is explicitly specified to be the only DMA-able memory for this device and make Linux use the given pool for coherent DMA allocations and bouncing non-coherent DMA.
Right, I think this can work, but there are very substantial downsides to it:
- it is a fairly substantial departure from the virtio specification, which defines that transfers can be made to any part of the guest physical address space
Coming back to this point, that was originally true but prevented implementing hardware virtio devices or putting a vIOMMU in front of the device. The spec now defines feature bit VIRTIO_F_ACCESS_PLATFORM (previously called VIRTIO_F_IOMMU_PLATFORM):
A device SHOULD offer VIRTIO_F_ACCESS_PLATFORM if its access to memory is through bus addresses distinct from and translated by the platform to physical addresses used by the driver, and/or if it can only access certain memory addresses with said access specified and/or granted by the platform. A device MAY fail to operate further if VIRTIO_F_ACCESS_PLATFORM is not accepted.
With this the driver has to follow the DMA layout given by the platform, in our case given by the device tree.
Another point about solution #1: since the backend (secondary VM) accesses virtqueue directly (we probably want to avoid a scatter-gather translation step in the backend) the pointers written by the frontent in the virtqueue have to be guest-physical addresses of the backend. So the static memory region needs to have identical guest-physical address in primary and secondary VM. In practice I doubt this would be a problem.
Thanks, Jean
On Fri, Oct 16, 2020 at 9:19 AM Jean-Philippe Brucker jean-philippe@linaro.org wrote:
On Fri, Oct 02, 2020 at 04:26:10PM +0200, Arnd Bergmann wrote:
At the moment the bounce buffer is allocated from a global pool in the low physical pages. However a recent proposal by Chromium would add support for per-device swiotlb pools:
https://lore.kernel.org/linux-iommu/20200728050140.996974-1-tientzu@chromium...
And quoting Tomasz from the discussion on patch 4:
For this, I'd like to propose a "restricted-dma-region" (feel free to suggest a better name) binding, which is explicitly specified to be the only DMA-able memory for this device and make Linux use the given pool for coherent DMA allocations and bouncing non-coherent DMA.
Right, I think this can work, but there are very substantial downsides to it:
- it is a fairly substantial departure from the virtio specification, which defines that transfers can be made to any part of the guest physical address space
Coming back to this point, that was originally true but prevented implementing hardware virtio devices or putting a vIOMMU in front of the device. The spec now defines feature bit VIRTIO_F_ACCESS_PLATFORM (previously called VIRTIO_F_IOMMU_PLATFORM):
A device SHOULD offer VIRTIO_F_ACCESS_PLATFORM if its access to memory is through bus addresses distinct from and translated by the platform to physical addresses used by the driver, and/or if it can only access certain memory addresses with said access specified and/or granted by the platform. A device MAY fail to operate further if VIRTIO_F_ACCESS_PLATFORM is not accepted.
With this the driver has to follow the DMA layout given by the platform, in our case given by the device tree.
Ok, got it.
Another point about solution #1: since the backend (secondary VM) accesses virtqueue directly (we probably want to avoid a scatter-gather translation step in the backend) the pointers written by the frontent in the virtqueue have to be guest-physical addresses of the backend. So the static memory region needs to have identical guest-physical address in primary and secondary VM. In practice I doubt this would be a problem.
I believe this is different from what Srivatsa explained in yesterday's call about Qualcomm's current work, which makes all pointer values relative to the start of the shared memory area, making them relocatable between frontend and backend guest physical address spaces.
However, using a matching dma-ranges property in the front-end guest to describe this mapping of guest-physical addresses into the shared space would be consistent with existing practice, at least in the case of virtio-mmio devices. For virtio-pci it would not work well because the dma-ranges are common between all PCI devices under a host bridge.
Arnd
Jean-Philippe Brucker jean-philippe@linaro.org writes:
On Fri, Oct 02, 2020 at 04:26:10PM +0200, Arnd Bergmann wrote:
At the moment the bounce buffer is allocated from a global pool in the low physical pages. However a recent proposal by Chromium would add support for per-device swiotlb pools:
https://lore.kernel.org/linux-iommu/20200728050140.996974-1-tientzu@chromium...
And quoting Tomasz from the discussion on patch 4:
For this, I'd like to propose a "restricted-dma-region" (feel free to suggest a better name) binding, which is explicitly specified to be the only DMA-able memory for this device and make Linux use the given pool for coherent DMA allocations and bouncing non-coherent DMA.
Right, I think this can work, but there are very substantial downsides to it:
- it is a fairly substantial departure from the virtio specification, which defines that transfers can be made to any part of the guest physical address space
Coming back to this point, that was originally true but prevented implementing hardware virtio devices or putting a vIOMMU in front of the device. The spec now defines feature bit VIRTIO_F_ACCESS_PLATFORM (previously called VIRTIO_F_IOMMU_PLATFORM):
A device SHOULD offer VIRTIO_F_ACCESS_PLATFORM if its access to memory is through bus addresses distinct from and translated by the platform to physical addresses used by the driver, and/or if it can only access certain memory addresses with said access specified and/or granted by the platform. A device MAY fail to operate further if VIRTIO_F_ACCESS_PLATFORM is not accepted.
With this the driver has to follow the DMA layout given by the platform, in our case given by the device tree.
Is there any overlap between this concept and the blob resources concept that the virtio-gpu drivers have? From the limited following of the discussion they have:
(1) shared resources (basically udmabuf support). (2) blob resources (what VIRTIO_GPU_F_MEMORY used to be, but simpler). (3) hostmem (host-allocated resources).
is this just a case of GPU's being a special case or is there some overlap we could use here?
On Fri, Oct 16, 2020 at 11:06:07AM +0100, Alex Bennée wrote:
Jean-Philippe Brucker jean-philippe@linaro.org writes:
On Fri, Oct 02, 2020 at 04:26:10PM +0200, Arnd Bergmann wrote:
At the moment the bounce buffer is allocated from a global pool in the low physical pages. However a recent proposal by Chromium would add support for per-device swiotlb pools:
https://lore.kernel.org/linux-iommu/20200728050140.996974-1-tientzu@chromium...
And quoting Tomasz from the discussion on patch 4:
For this, I'd like to propose a "restricted-dma-region" (feel free to suggest a better name) binding, which is explicitly specified to be the only DMA-able memory for this device and make Linux use the given pool for coherent DMA allocations and bouncing non-coherent DMA.
Right, I think this can work, but there are very substantial downsides to it:
- it is a fairly substantial departure from the virtio specification, which defines that transfers can be made to any part of the guest physical address space
Coming back to this point, that was originally true but prevented implementing hardware virtio devices or putting a vIOMMU in front of the device. The spec now defines feature bit VIRTIO_F_ACCESS_PLATFORM (previously called VIRTIO_F_IOMMU_PLATFORM):
A device SHOULD offer VIRTIO_F_ACCESS_PLATFORM if its access to memory is through bus addresses distinct from and translated by the platform to physical addresses used by the driver, and/or if it can only access certain memory addresses with said access specified and/or granted by the platform. A device MAY fail to operate further if VIRTIO_F_ACCESS_PLATFORM is not accepted.
With this the driver has to follow the DMA layout given by the platform, in our case given by the device tree.
Is there any overlap between this concept and the blob resources concept that the virtio-gpu drivers have?
No, I don't think so.
From the limited following of the discussion they have:
(1) shared resources (basically udmabuf support). (2) blob resources (what VIRTIO_GPU_F_MEMORY used to be, but simpler). (3) hostmem (host-allocated resources).
Looking at the proposed Linux implementation, the last two seems to rely on the shmem feature from the upcoming virtio release:
"Shared memory regions are an additional facility available to devices that need a region of memory that’s continuously shared between the device and the driver, rather than passed between them in the way virtqueue elements are." https://lists.oasis-open.org/archives/virtio-dev/201907/msg00021.html
It was introduced for virtio-fs, to allow mapping a file for direct access. The virtio-sound device will probably use them as well, though the current spec forbids using shmem for streaming data.
The way it works for virtio-fs: a memory region (gpa, size) is exposed in the virtio-pci capability or via virtio-mmio registers. Then the virtio-fs driver sends a "setupmapping" request to the device, asking to map the file at a specific offset into that region. When it succeeds, the guest can access the file directly through that region.
is this just a case of GPU's being a special case or is there some overlap we could use here?
The mechanism for using shared regions is specific to each device type (it can't be applied to a virtio-net device for example, without amending the virtio-net specification). It does seem close in concept to the pre-share solution, but I'm not sure how it could be generalized to virtqueues.
Thanks, Jean
Jean-Philippe Brucker jean-philippe@linaro.org writes:
Hi,
I've looked in more details at limited memory sharing (STR-6, STR-8, STR-15), mainly from the Linux guest perspective. Here are a few thoughts.
Problem
We have a primary VM running a guest, and a secondary one running a backend that manages one hardware resource (for example network access). They communicate with virtio (for example virtio-net). The guest implements a virtio driver, the backend a virtio device. Problem is, how to ensure that the backend and guest only share the memory required for the virtio communication, and that the backend cannot access any other memory from the guest?
Yes - the aim is for isolation. This becomes even more important if you have a un-trusted secure workload that provides for example DRM playback capability to the primary guest.
Static shared region
Let's first look at static DMA regions. The hypervisor allocates a subset of memory to be shared between guest and backend. The hypervisor communicates this per-device DMA restriction to the guest during boot. It could be using a firmware property, or a discovery protocol. I did start drafting such a protocol in virtio-iommu, but I now think the reserved-memory mechanism in device-tree, below, is preferable. Would we need an equivalent for ACPI, though?
I think device-tree certainly makes sense for an initial implementation as the sort of deployments are likely to be a fixed platforms, possibly with device tree specifying the deployment of primary and secondary VMs (this is STR-10/11 work).
How would we implement this in a Linux guest? The virtqueue of a virtio device has two components. Static ring buffers, allocated at boot with dma_alloc_coherent(), and the actual data payload, mapped with dma_map_page() and dma_map_single(). Linux calls the former "coherent" DMA, and the latter "streaming" DMA.
Coherent DMA can already obtain its pages from a per-device memory pool. dma_init_coherent_memory() defines a range of physical memory usable by a device. Importantly this region has to be distinct from system memory RAM and reserved by the platform. It is mapped non-cacheable. If it exists, dma_alloc_coherent() will only get its pages from that region.
On the other hand streaming DMA doesn't allocate memory. The virtio drivers don't control where that memory comes from, since the pages are fed to them by an upper layer of the subsystem, specific to the device type (net, block, video, etc). Often they are pages from the slab cache that contain other unrelated objects, a notorious problem for DMA isolation. If the page is not accessible by the device, swiotlb_map() allocates a bounce buffer somewhere more convenient and copies the data when needed.
At the moment the bounce buffer is allocated from a global pool in the low physical pages. However a recent proposal by Chromium would add support for per-device swiotlb pools:
https://lore.kernel.org/linux-iommu/20200728050140.996974-1-tientzu@chromium...
And quoting Tomasz from the discussion on patch 4:
For this, I'd like to propose a "restricted-dma-region" (feel free to suggest a better name) binding, which is explicitly specified to be the only DMA-able memory for this device and make Linux use the given pool for coherent DMA allocations and bouncing non-coherent DMA.
That seems to be precisely what we need. Even when using the virtio-pci transport, it is still possible to define per-device properties in the device-tree, for example:
/* PCI root complex node */ pcie@10000000 { compatible = "pci-host-ecam-generic"; /* Add properties to endpoint with BDF 00:01.0 */ ep@0008 { reg = <0x00000800 0 0 0 0>; restricted-dma-region = <&dma_region_1>; }; };
reserved-memory { /* Define 64MB reserved region at address 0x50400000 */ dma_region_1: restricted_dma_region { reg = <0x50400000 0x4000000>; }; };
Dynamic regions
In a previous discussion [1], several people suggested using a vIOMMU to dynamically update the mappings rather than statically setting a memory region usable by the backend. I believe that approach is still worth considering because it satisfies the security requirement and doesn't necessarily have worse performance. There is a trade-off between bounce buffers on one hand, and map notifications on the other.
The problem with static regions is that all of the traffic will require copying. Sub-page payloads will need bounce buffering anyway, for proper isolation.
Shouldn't sub-page payloads be embedded directly in the virtqueues (direct vs indirect buffers)?
But for large payloads bounce buffering might be prohibitive, and using a virtual IOMMU might actually be more efficient. Instead of copying large buffers the guest would send a MAP request to the hypervisor, which would then map the pages into the backend. Despite taking plenty of cycles for context switching and setting up the maps, it might be less costly than copying.
Since it depends on the device, I guess we'll need a survey of memory access patterns by the different virtio devices that we're considering. In the end a mix of both solutions might be necessary.
Thanks, Jean
[1] https://lists.oasis-open.org/archives/virtio-dev/202006/msg00037.html
On Tue, Oct 6, 2020 at 4:33 PM Alex Bennée alex.bennee@linaro.org wrote:
Jean-Philippe Brucker jean-philippe@linaro.org writes:
The problem with static regions is that all of the traffic will require copying. Sub-page payloads will need bounce buffering anyway, for proper isolation.
Shouldn't sub-page payloads be embedded directly in the virtqueues (direct vs indirect buffers)?
It appears that one of us misunderstands how direct buffers in virtqueues work. ;-)
From my reading of the specification and the source code, there are
never any buffers within the virtqueue itself. The difference is that in "direct" virtqueues, the queue contains descriptors pointing directly at arbitrary physical pages in the guest-virtual address space, while in indirect mode, the descriptors in the virtqueue point to another set of dynamically allocated descriptors outside of the virtqueue, which then point to the actual data.
In case of indirect virtqueues, full isolation requires special care for sub-page mappings of both the (indirect) descriptors and the data.
Arnd
stratos-dev@op-lists.linaro.org