On Thu, 12 Nov 2020 16:36:01 +0000 Lorenzo Pieralisi lorenzo.pieralisi@arm.com wrote:
On Wed, Nov 04, 2020 at 01:23:43PM +0000, Jonathan Cameron via Linaro-open-discussions wrote:
Hi All,
This one made it onto the list of topics to discuss (now marked as no need to discuss). I've been meaning to give a status update by email including what is outstanding here. Please let me know if this fails to cover some aspect of interest.
Background:
https://github.com/hisilicon/acpi-numa-whitepaper/releases/tag/v0.93 chapter 3.
Generic initiators are a concept in ACPI 6.3 (sec 5.2.16.6) to plug a hole in the definition of proximity domains.
Proximity domains in ACPI (NUMA nodes in kernel) are defined by entries in SRAT table. There are a whole range of different types of SRAT entry but before ACPI 6.3 this more or less in practice meant that a proximity domain only existed if it contained either (or both) memory and CPUs. Other initiators of memory transactions such as network cards can be assigned to an existing proximity domain via _PXM in ACPI DSDT. This restricted them to sharing a domain with either memory or processors.
That doesn't always reflect system architecture, particularly with the addition of richer descriptions of access characteristics (latency / bandwidth) brought in by HMAT. Hence Generic Initiator domains to allow you to specify a proximity domain with some other type of device in it (such as a network card) and get all of the descriptive capability available for CPU / memory nodes.
Note that this was brought in prior to CXL becoming public but 1.1 CXL spec states that initiators on CXL should be described using Generic Initiator nodes. This should accelerate the number of users of this feature considerably. It is also useful in some existing systems.
What support was needed in kernel:
- Parsing of the SRAT Generic Initiator Affinity Structure
- Instantiating the NUMA nodes that map to the GI PXM nodes to ensure stuff like fallback lists for memory allocation work as normal.
- Richer use of HMAT access characteristics to differentiate nearest CPU to memory from nearest initiator to memory.
- PXM assignment from the SRAT record rather than _PXM (not yet done).
- PCI PXM assignment (not yet done)
Status:
The kernel patches sat on the list (with rebases) for well over a year failing to get the architecture review needed (as there was significant risk of breakage in both ARM64 and x86). It was to break this blockage that we were interested in an open discussion on this. However, they did recently get x86 review this and Rafael queued them for 5.10 (now merged)
The PCI PXM issue has been long standing due to some buggy firmware on certain X86 boards and the need for a clarification in the ACPI spec (added in 6.3). To make this safe, needed to ensure that NUMA nodes on ACPI systems can only be instantiated during the main parse of SRAT.
https://lore.kernel.org/linux-mm/20200818142430.1156547-1-Jonathan.Cameron@h...
That fix is now in place, and we'll resend the PCI fix shortly.
Overall, this is a topic to be discussed since it is important.
Shorter term, I need to pick your brain on this:
- The series above, it should fix this issue:
https://lore.kernel.org/lkml/26284ca5-ea05-0496-629d-9951f49dda8f@linux.alib...
Correct ?
From a code read through I 'think' the PXM only from SRAT series should fix that but without actually setting up a suitable test I'm not 100% confident.
- Why in dummy_numa_init() (arch/arm64/mm/numa.c) we don't turn numa_off == true if numa_add_memblk() fails ? (Side note: I really like the comment :) "...It is unlikely that this function fails."
So this got moved around by Mike's patch set. That set more or less by coincidence fixed the issues that I saw with SRAT not covering all memblocks. The issue I ran into in some broken firmware test cases left memblock in a wierd state after the ACPI code ran. Mike's series put everything back together again because the core memblock handling was a bit more resilient than the loop in the arm64 code.
He didn't change the flow however, we never set numa_off in the the error path.
So can we actually carry on safely if numa_add_memblk() has failed here? The only way that can happen is to have a failure from memblock_isolate_range() - the actual code to set the node can't fail.
memblock_isolate_range() only has one failure path from memblock_double_array() That only fails for reasons of code ordering (reserved regions not set yet) or a memory allocation failure (very unlikely this early in boot unless something is horribly wrong) So I think the comment is kind of valid. It doesn't fail.
If this code does fail to run we potentially end up with corrupted memblock entries but they 'might' be fine. That makes me nervous. Perhaps worth noting x86 doesn't even bother checking for return value of numa_add_memblk().
On the numa_off front, we had a discussion around whether x86 should set it for similar error paths (it doesn't).
https://patchwork.kernel.org/project/linux-pci/patch/20181211094737.71554-1-...
I haven't tested, but I think we could drop the numa_off = true from the arm64 code now without any problems.
At somepoint I'd like to explore more cross arch unification in this code where possible. There is a lot of cut and paste going on and assumptions in this code tend to have unexpected affects on ordering of core code. One implementation would make reworking things much easier, even if we have to handle a few x86 quirks in that generic code.
Forgive me for not looking earlier into the series above that Rafael merged - as I said above this deserves more attention on my side.
No problem - always far too many things to do!
Thanks,
Jonathan
Thanks, Lorenzo
Note it may be "interesting" to support nodes from CXL CDAT tables at runtime but that is another topic. ( https://uefi.org/node/4093 https://lore.kernel.org/linux-cxl/20201102183428.00005f4f@Huawei.com/T/#m52f... )
For a Generic Initiator Nodes, there are two ways a device an be assigned to the proximity domain. Conventional _PXM in DSDT can be used and that is now supported. The SRAT entry itself also contains an address (PCI seg + BDF or Platform UID / HID based). There is no obligation to provide both. The SRAT based method will require some level of alternative infrastructure to that used for _PXM. We may look at this at some stage.
So a few outstanding things but probably not worth discussing on a call at this stage unless anyone is seeing problems with the stuff already merged.
Thanks,
Jonathan
Linaro-open-discussions mailing list https://collaborate.linaro.org/display/LOD/Linaro+Open+Discussions+Home https://op-lists.linaro.org/mailman/listinfo/linaro-open-discussions