From: Barry Song song.bao.hua@hisilicon.com
For platforms having clusters like Kunpeng 920, tasks in the same cluster sharing L3 Cache Tag will have lower latency when synchronizing and accessing shared resources. Based on this, this patch moves to change the begin cpu of scanning in select_idle_cpu() from the next cpu of target to the first cpu of the target's cluster. Then the search will perform within the cluster first and we'll have more chance to wake the wakee in the same cluster of the waker.
Benchmark Tests have been done on 2-socket 4-NUMA Kunpeng 920 with 8 clusters in each NUMA and on NUMA 0. Improvements are observed in most cases compared to 5.15-rc1 with cluster scheduler level[1].
hackbench-process-pipes 5.15-rc1+cluster 5.15-rc1+cluster+patch Amean 1 0.6136 ( 0.00%) 0.5988 ( 2.41%) Amean 4 0.8380 ( 0.00%) 0.8904 * -6.25%* Amean 7 1.1661 ( 0.00%) 1.1017 * 5.52%* Amean 12 1.4670 ( 0.00%) 1.5994 * -9.03%* Amean 21 2.8909 ( 0.00%) 2.8640 ( 0.93%) Amean 30 4.3943 ( 0.00%) 4.2052 ( 4.30%) Amean 48 6.6870 ( 0.00%) 6.4079 ( 4.17%) Amean 79 10.4796 ( 0.00%) 9.5507 * 8.86%* Amean 110 14.5310 ( 0.00%) 12.2114 * 15.96%* Amean 141 16.4772 ( 0.00%) 14.1517 * 14.11%* Amean 172 20.0868 ( 0.00%) 15.9852 * 20.42%* Amean 203 22.9282 ( 0.00%) 18.4574 * 19.50%* Amean 234 25.8139 ( 0.00%) 20.4725 * 20.69%* Amean 256 27.6834 ( 0.00%) 22.9076 * 17.25%*
tbench4 5.15-rc1+cluster 5.15-rc1+cluster+patch Hmean 1 338.50 ( 0.00%) 345.47 * 2.06%* Hmean 2 672.20 ( 0.00%) 695.10 * 3.41%* Hmean 4 1329.03 ( 0.00%) 1357.40 * 2.14%* Hmean 8 2513.25 ( 0.00%) 2419.88 * -3.71%* Hmean 16 4957.39 ( 0.00%) 4882.04 * -1.52%* Hmean 32 8737.07 ( 0.00%) 8649.97 * -1.00%* Hmean 64 4929.31 ( 0.00%) 6570.13 * 33.29%* Hmean 128 5052.75 ( 0.00%) 8157.96 * 61.46%* Hmean 256 6971.70 ( 0.00%) 7648.01 * 9.70%* Hmean 512 7427.32 ( 0.00%) 7450.68 * 0.31%*
tbench4 NUMA 0 5.15-rc1+cluster 5.15-rc1+cluster+patch Hmean 1 318.98 ( 0.00%) 322.53 * 1.11%* Hmean 2 640.50 ( 0.00%) 641.89 * 0.22%* Hmean 4 1277.57 ( 0.00%) 1292.54 * 1.17%* Hmean 8 2584.55 ( 0.00%) 2622.64 * 1.47%* Hmean 16 5245.05 ( 0.00%) 5440.75 * 3.73%* Hmean 32 3231.60 ( 0.00%) 3991.83 * 23.52%* Hmean 64 7361.28 ( 0.00%) 7356.56 ( -0.06%) Hmean 128 6240.28 ( 0.00%) 6293.78 * 0.86%*
hackbench-process-pipes NUMA 0 5.15-rc1+cluster 5.15-rc1+cluster+patch Amean 1 0.5196 ( 0.00%) 0.5121 ( 1.44%) Amean 4 1.0946 ( 0.00%) 1.3234 * -20.90%* Amean 7 1.9368 ( 0.00%) 2.4304 * -25.49%* Amean 12 3.4168 ( 0.00%) 3.6422 * -6.60%* Amean 21 6.1119 ( 0.00%) 5.5032 * 9.96%* Amean 30 7.8980 ( 0.00%) 7.5433 * 4.49%* Amean 48 11.2969 ( 0.00%) 10.6889 * 5.38%* Amean 79 17.3220 ( 0.00%) 15.2553 * 11.93%* Amean 110 22.9893 ( 0.00%) 19.8521 * 13.65%* Amean 141 28.5319 ( 0.00%) 24.9064 * 12.71%* Amean 172 34.1731 ( 0.00%) 30.8424 * 9.75%* Amean 203 39.9368 ( 0.00%) 35.4607 * 11.21%* Amean 234 45.6207 ( 0.00%) 40.4969 * 11.23%* Amean 256 50.0725 ( 0.00%) 45.0295 * 10.07%*
[1] https://lore.kernel.org/lkml/20210924085104.44806-1-21cnbao@gmail.com/
Signed-off-by: Barry Song song.bao.hua@hisilicon.com Signed-off-by: Yicong Yang yangyicong@hisilicon.com --- kernel/sched/fair.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ff69f245b939..852a048a5f8c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6265,10 +6265,10 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target) { struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); - int i, cpu, idle_cpu = -1, nr = INT_MAX; + int i, cpu, scan_from, idle_cpu = -1, nr = INT_MAX; + struct sched_domain *this_sd, *cluster_sd; struct rq *this_rq = this_rq(); int this = smp_processor_id(); - struct sched_domain *this_sd; u64 time = 0;
this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); @@ -6276,6 +6276,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool return -1;
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + cpumask_clear_cpu(target, cpus); + + cluster_sd = rcu_dereference(*this_cpu_ptr(&sd_cluster)); + scan_from = cluster_sd ? cpumask_first(sched_domain_span(cluster_sd)) : target + 1;
if (sched_feat(SIS_PROP) && !has_idle_core) { u64 avg_cost, avg_idle, span_avg; @@ -6305,7 +6309,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool time = cpu_clock(this); }
- for_each_cpu_wrap(cpu, cpus, target + 1) { + for_each_cpu_wrap(cpu, cpus, scan_from) { if (has_idle_core) { i = select_idle_core(p, cpu, cpus, &idle_cpu); if ((unsigned int)i < nr_cpumask_bits)