From aad92defbdc1385c351d685e86dbc1ec2a3a29d0 Mon Sep 17 00:00:00 2001 From: Jonathan Cameron Date: Thu, 18 Nov 2021 20:43:35 +0800 Subject: [PATCH 01/16] topology: Represent clusters of CPUs within a die mainline inclusion from mainline-v5.16-rc1 commit c5e22feffdd736cb02b98b0f5b375c8ebc858dd4 upstream. ------------------------------------------------------------------------ Both ACPI and DT provide the ability to describe additional layers of topology between that of individual cores and higher level constructs such as the level at which the last level cache is shared. In ACPI this can be represented in PPTT as a Processor Hierarchy Node Structure [1] that is the parent of the CPU cores and in turn has a parent Processor Hierarchy Nodes Structure representing a higher level of topology. For example Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each cluster has 4 cpus. All clusters share L3 cache data, but each cluster has local L3 tag. On the other hand, each clusters will share some internal system bus. +-----------------------------------+ +---------+ | +------+ +------+ +--------------------------+ | | | CPU0 | | cpu1 | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ cluster | | tag | | | | | CPU2 | | CPU3 | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +----+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | L3 | | data | +-----------------------------------+ | | | +------+ +------+ | +-----------+ | | | | | | | | | | | | | +------+ +------+ +----+ L3 | | | | | | tag | | | | +------+ +------+ | | | | | | | | | | | +-----------+ | | | +------+ +------+ +--------------------------+ | +-----------------------------------| | | +-----------------------------------| | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ | | tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +---+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +--+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | +---------+ +-----------------------------------+ That means spreading tasks among clusters will bring more bandwidth while packing tasks within one cluster will lead to smaller cache synchronization latency. So both kernel and userspace will have a chance to leverage this topology to deploy tasks accordingly to achieve either smaller cache latency within one cluster or an even distribution of load among clusters for higher throughput. This patch exposes cluster topology to both kernel and userspace. Libraried like hwloc will know cluster by cluster_cpus and related sysfs attributes. PoC of HWLOC support at [2]. Note this patch only handle the ACPI case. Special consideration is needed for SMT processors, where it is necessary to move 2 levels up the hierarchy from the leaf nodes (thus skipping the processor core level). Note that arm64 / ACPI does not provide any means of identifying a die level in the topology but that may be unrelate to the cluster level. [1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node structure (Type 0) [2] https://github.com/hisilicon/hwloc/tree/linux-cluster Signed-off-by: Jonathan Cameron Signed-off-by: Tian Tao Signed-off-by: Barry Song Signed-off-by: Peter Zijlstra (Intel) Link: https://lore.kernel.org/r/20210924085104.44806-2-21cnbao@gmail.com Signed-off-by: Xue Sinian --- Documentation/admin-guide/cputopology.rst | 26 +++++++-- arch/arm64/kernel/topology.c | 2 + drivers/acpi/pptt.c | 67 +++++++++++++++++++++++ drivers/base/arch_topology.c | 14 +++++ drivers/base/topology.c | 10 ++++ include/linux/acpi.h | 5 ++ include/linux/arch_topology.h | 5 ++ include/linux/topology.h | 6 ++ 8 files changed, 131 insertions(+), 4 deletions(-) diff --git a/Documentation/admin-guide/cputopology.rst b/Documentation/admin-guide/cputopology.rst index b90dafcc8237..496ec1e6fd4b 100644 --- a/Documentation/admin-guide/cputopology.rst +++ b/Documentation/admin-guide/cputopology.rst @@ -18,6 +18,11 @@ die_id: identifier (rather than the kernel's). The actual value is architecture and platform dependent. +cluster_id: + the cluster ID of cpuX. Typically it is the hardware platform's + identifier (rather than the kernel's). The actual value is + architecture and platform dependent. + core_id: the CPU core ID of cpuX. Typically it is the hardware platform's @@ -36,6 +41,15 @@ drawer_id: identifier (rather than the kernel's). The actual value is architecture and platform dependent. +cluster_cpus: + + internal kernel map of CPUs within the same cluster + +cluster_cpus_list: + + human-readable list of CPUs within the same cluster. + The format is like 0-3, 8-11, 14,17. + core_cpus: internal kernel map of CPUs within the same core. @@ -96,12 +110,14 @@ these macros in include/asm-XXX/topology.h:: #define topology_physical_package_id(cpu) #define topology_die_id(cpu) + #define topology_cluster_id(cpu) #define topology_core_id(cpu) #define topology_book_id(cpu) #define topology_drawer_id(cpu) #define topology_sibling_cpumask(cpu) #define topology_core_cpumask(cpu) #define topology_die_cpumask(cpu) + #define topology_cluster_cpumask(cpu) #define topology_book_cpumask(cpu) #define topology_drawer_cpumask(cpu) @@ -116,10 +132,12 @@ not defined by include/asm-XXX/topology.h: 1) topology_physical_package_id: -1 2) topology_die_id: -1 -3) topology_core_id: 0 -4) topology_sibling_cpumask: just the given CPU -5) topology_core_cpumask: just the given CPU -6) topology_die_cpumask: just the given CPU +3) topology_cluster_id: -1 +4) topology_core_id: 0 +5) topology_sibling_cpumask: just the given CPU +6) topology_core_cpumask: just the given CPU +7) topology_cluster_cpumask: just the given CPU +8) topology_die_cpumask: just the given CPU For architectures that don't support books (CONFIG_SCHED_BOOK) there are no default definitions for topology_book_id() and topology_book_cpumask(). diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c index 3d3a673c6704..4ecf361cb562 100644 --- a/arch/arm64/kernel/topology.c +++ b/arch/arm64/kernel/topology.c @@ -62,6 +62,8 @@ int __init parse_acpi_topology(void) cpu_topology[cpu].thread_id = -1; cpu_topology[cpu].core_id = topology_id; } + topology_id = find_acpi_cpu_topology_cluster(cpu); + cpu_topology[cpu].cluster_id = topology_id; topology_id = find_acpi_cpu_topology_package(cpu); cpu_topology[cpu].package_id = topology_id; diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c index f31544d3656e..e906dfc17f77 100644 --- a/drivers/acpi/pptt.c +++ b/drivers/acpi/pptt.c @@ -735,6 +735,73 @@ int find_acpi_cpu_topology_package(unsigned int cpu) ACPI_PPTT_PHYSICAL_PACKAGE); } +/** + * find_acpi_cpu_topology_cluster() - Determine a unique CPU cluster value + * @cpu: Kernel logical CPU number + * + * Determine a topology unique cluster ID for the given CPU/thread. + * This ID can then be used to group peers, which will have matching ids. + * + * The cluster, if present is the level of topology above CPUs. In a + * multi-thread CPU, it will be the level above the CPU, not the thread. + * It may not exist in single CPU systems. In simple multi-CPU systems, + * it may be equal to the package topology level. + * + * Return: -ENOENT if the PPTT doesn't exist, the CPU cannot be found + * or there is no toplogy level above the CPU.. + * Otherwise returns a value which represents the package for this CPU. + */ + +int find_acpi_cpu_topology_cluster(unsigned int cpu) +{ + struct acpi_table_header *table; + acpi_status status; + struct acpi_pptt_processor *cpu_node, *cluster_node; + u32 acpi_cpu_id; + int retval; + int is_thread; + + status = acpi_get_table(ACPI_SIG_PPTT, 0, &table); + if (ACPI_FAILURE(status)) { + acpi_pptt_warn_missing(); + return -ENOENT; + } + + acpi_cpu_id = get_acpi_id_for_cpu(cpu); + cpu_node = acpi_find_processor_node(table, acpi_cpu_id); + if (cpu_node == NULL || !cpu_node->parent) { + retval = -ENOENT; + goto put_table; + } + + is_thread = cpu_node->flags & ACPI_PPTT_ACPI_PROCESSOR_IS_THREAD; + cluster_node = fetch_pptt_node(table, cpu_node->parent); + if (cluster_node == NULL) { + retval = -ENOENT; + goto put_table; + } + if (is_thread) { + if (!cluster_node->parent) { + retval = -ENOENT; + goto put_table; + } + cluster_node = fetch_pptt_node(table, cluster_node->parent); + if (cluster_node == NULL) { + retval = -ENOENT; + goto put_table; + } + } + if (cluster_node->flags & ACPI_PPTT_ACPI_PROCESSOR_ID_VALID) + retval = cluster_node->acpi_processor_id; + else + retval = ACPI_PTR_DIFF(cluster_node, table); + +put_table: + acpi_put_table(table); + + return retval; +} + /** * find_acpi_cpu_topology_hetero_id() - Get a core architecture tag * @cpu: Kernel logical CPU number diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index 0dfe20d18b8d..02af6964d9fd 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -448,6 +448,11 @@ const struct cpumask *cpu_coregroup_mask(int cpu) return core_mask; } +const struct cpumask *cpu_clustergroup_mask(int cpu) +{ + return &cpu_topology[cpu].cluster_sibling; +} + void update_siblings_masks(unsigned int cpuid) { struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid]; @@ -464,6 +469,11 @@ void update_siblings_masks(unsigned int cpuid) if (cpuid_topo->package_id != cpu_topo->package_id) continue; + if (cpuid_topo->cluster_id == cpu_topo->cluster_id && + cpuid_topo->cluster_id != -1) { + cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling); + cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling); + } cpumask_set_cpu(cpuid, &cpu_topo->core_sibling); cpumask_set_cpu(cpu, &cpuid_topo->core_sibling); @@ -483,6 +493,9 @@ static void clear_cpu_topology(int cpu) cpumask_clear(&cpu_topo->llc_sibling); cpumask_set_cpu(cpu, &cpu_topo->llc_sibling); + cpumask_clear(&cpu_topo->cluster_sibling); + cpumask_set_cpu(cpu, &cpu_topo->cluster_sibling); + cpumask_clear(&cpu_topo->core_sibling); cpumask_set_cpu(cpu, &cpu_topo->core_sibling); cpumask_clear(&cpu_topo->thread_sibling); @@ -498,6 +511,7 @@ void __init reset_cpu_topology(void) cpu_topo->thread_id = -1; cpu_topo->core_id = -1; + cpu_topo->cluster_id = -1; cpu_topo->package_id = -1; cpu_topo->llc_id = -1; diff --git a/drivers/base/topology.c b/drivers/base/topology.c index 4e033d4cc0dc..bf169975e4ef 100644 --- a/drivers/base/topology.c +++ b/drivers/base/topology.c @@ -46,6 +46,9 @@ static DEVICE_ATTR_RO(physical_package_id); define_id_show_func(die_id); static DEVICE_ATTR_RO(die_id); +define_id_show_func(cluster_id); +static DEVICE_ATTR_RO(cluster_id); + define_id_show_func(core_id); static DEVICE_ATTR_RO(core_id); @@ -61,6 +64,10 @@ define_siblings_show_func(core_siblings, core_cpumask); static DEVICE_ATTR_RO(core_siblings); static DEVICE_ATTR_RO(core_siblings_list); +define_siblings_show_func(cluster_cpus, cluster_cpumask); +static DEVICE_ATTR_RO(cluster_cpus); +static DEVICE_ATTR_RO(cluster_cpus_list); + define_siblings_show_func(die_cpus, die_cpumask); static DEVICE_ATTR_RO(die_cpus); static DEVICE_ATTR_RO(die_cpus_list); @@ -88,6 +95,7 @@ static DEVICE_ATTR_RO(drawer_siblings_list); static struct attribute *default_attrs[] = { &dev_attr_physical_package_id.attr, &dev_attr_die_id.attr, + &dev_attr_cluster_id.attr, &dev_attr_core_id.attr, &dev_attr_thread_siblings.attr, &dev_attr_thread_siblings_list.attr, @@ -95,6 +103,8 @@ static struct attribute *default_attrs[] = { &dev_attr_core_cpus_list.attr, &dev_attr_core_siblings.attr, &dev_attr_core_siblings_list.attr, + &dev_attr_cluster_cpus.attr, + &dev_attr_cluster_cpus_list.attr, &dev_attr_die_cpus.attr, &dev_attr_die_cpus_list.attr, &dev_attr_package_cpus.attr, diff --git a/include/linux/acpi.h b/include/linux/acpi.h index ed68f45c09f7..dfc4aa531c77 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -1297,6 +1297,7 @@ static inline int lpit_read_residency_count_address(u64 *address) #ifdef CONFIG_ACPI_PPTT int acpi_pptt_cpu_is_thread(unsigned int cpu); int find_acpi_cpu_topology(unsigned int cpu, int level); +int find_acpi_cpu_topology_cluster(unsigned int cpu); int find_acpi_cpu_topology_package(unsigned int cpu); int find_acpi_cpu_topology_hetero_id(unsigned int cpu); int find_acpi_cpu_cache_topology(unsigned int cpu, int level); @@ -1309,6 +1310,10 @@ static inline int find_acpi_cpu_topology(unsigned int cpu, int level) { return -EINVAL; } +static inline int find_acpi_cpu_topology_cluster(unsigned int cpu) +{ + return -EINVAL; +} static inline int find_acpi_cpu_topology_package(unsigned int cpu) { return -EINVAL; diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h index 42f2b5126094..9d7a80debe06 100644 --- a/include/linux/arch_topology.h +++ b/include/linux/arch_topology.h @@ -36,10 +36,12 @@ unsigned long topology_get_freq_scale(int cpu) struct cpu_topology { int thread_id; int core_id; + int cluster_id; int package_id; int llc_id; cpumask_t thread_sibling; cpumask_t core_sibling; + cpumask_t cluster_sibling; cpumask_t llc_sibling; }; @@ -47,13 +49,16 @@ struct cpu_topology { extern struct cpu_topology cpu_topology[NR_CPUS]; #define topology_physical_package_id(cpu) (cpu_topology[cpu].package_id) +#define topology_cluster_id(cpu) (cpu_topology[cpu].cluster_id) #define topology_core_id(cpu) (cpu_topology[cpu].core_id) #define topology_core_cpumask(cpu) (&cpu_topology[cpu].core_sibling) #define topology_sibling_cpumask(cpu) (&cpu_topology[cpu].thread_sibling) +#define topology_cluster_cpumask(cpu) (&cpu_topology[cpu].cluster_sibling) #define topology_llc_cpumask(cpu) (&cpu_topology[cpu].llc_sibling) void init_cpu_topology(void); void store_cpu_topology(unsigned int cpuid); const struct cpumask *cpu_coregroup_mask(int cpu); +const struct cpumask *cpu_clustergroup_mask(int cpu); void update_siblings_masks(unsigned int cpu); void remove_cpu_topology(unsigned int cpuid); void reset_cpu_topology(void); diff --git a/include/linux/topology.h b/include/linux/topology.h index 64e7ee0abe71..b216ed3d9fb1 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -203,6 +203,9 @@ static inline int cpu_to_mem(int cpu) #ifndef topology_die_id #define topology_die_id(cpu) ((void)(cpu), -1) #endif +#ifndef topology_cluster_id +#define topology_cluster_id(cpu) ((void)(cpu), -1) +#endif #ifndef topology_core_id #define topology_core_id(cpu) ((void)(cpu), 0) #endif @@ -212,6 +215,9 @@ static inline int cpu_to_mem(int cpu) #ifndef topology_core_cpumask #define topology_core_cpumask(cpu) cpumask_of(cpu) #endif +#ifndef topology_cluster_cpumask +#define topology_cluster_cpumask(cpu) cpumask_of(cpu) +#endif #ifndef topology_die_cpumask #define topology_die_cpumask(cpu) cpumask_of(cpu) #endif -- Gitee From ce8c88a5eb7d8c3291c299555fe1279b43be2511 Mon Sep 17 00:00:00 2001 From: Barry Song Date: Fri, 24 Sep 2021 20:51:03 +1200 Subject: [PATCH 02/16] sched: Add cluster scheduler level in core and related Kconfig for ARM64 mainline inclusion from mainline-v5.16-rc1 commit 778c558f49a2cb3dc7b18a80ff515e82aa813627 upstream. ------------------------------------------------------------------------ This patch adds scheduler level for clusters and automatically enables the load balance among clusters. It will directly benefit a lot of workload which loves more resources such as memory bandwidth, caches. Testing has widely been done in two different hardware configurations of Kunpeng920: 24 cores in one NUMA(6 clusters in each NUMA node); 32 cores in one NUMA(8 clusters in each NUMA node) Workload is running on either one NUMA node or four NUMA nodes, thus, this can estimate the effect of cluster spreading w/ and w/o NUMA load balance. * Stream benchmark: 4threads stream (on 1NUMA * 24cores = 24cores) stream stream w/o patch w/ patch MB/sec copy 29929.64 ( 0.00%) 32932.68 ( 10.03%) MB/sec scale 29861.10 ( 0.00%) 32710.58 ( 9.54%) MB/sec add 27034.42 ( 0.00%) 32400.68 ( 19.85%) MB/sec triad 27225.26 ( 0.00%) 31965.36 ( 17.41%) 6threads stream (on 1NUMA * 24cores = 24cores) stream stream w/o patch w/ patch MB/sec copy 40330.24 ( 0.00%) 42377.68 ( 5.08%) MB/sec scale 40196.42 ( 0.00%) 42197.90 ( 4.98%) MB/sec add 37427.00 ( 0.00%) 41960.78 ( 12.11%) MB/sec triad 37841.36 ( 0.00%) 42513.64 ( 12.35%) 12threads stream (on 1NUMA * 24cores = 24cores) stream stream w/o patch w/ patch MB/sec copy 52639.82 ( 0.00%) 53818.04 ( 2.24%) MB/sec scale 52350.30 ( 0.00%) 53253.38 ( 1.73%) MB/sec add 53607.68 ( 0.00%) 55198.82 ( 2.97%) MB/sec triad 54776.66 ( 0.00%) 56360.40 ( 2.89%) Thus, it could help memory-bound workload especially under medium load. Similar improvement is also seen in lkp-pbzip2: * lkp-pbzip2 benchmark 2-96 threads (on 4NUMA * 24cores = 96cores) lkp-pbzip2 lkp-pbzip2 w/o patch w/ patch Hmean tput-2 11062841.57 ( 0.00%) 11341817.51 * 2.52%* Hmean tput-5 26815503.70 ( 0.00%) 27412872.65 * 2.23%* Hmean tput-8 41873782.21 ( 0.00%) 43326212.92 * 3.47%* Hmean tput-12 61875980.48 ( 0.00%) 64578337.51 * 4.37%* Hmean tput-21 105814963.07 ( 0.00%) 111381851.01 * 5.26%* Hmean tput-30 150349470.98 ( 0.00%) 156507070.73 * 4.10%* Hmean tput-48 237195937.69 ( 0.00%) 242353597.17 * 2.17%* Hmean tput-79 360252509.37 ( 0.00%) 362635169.23 * 0.66%* Hmean tput-96 394571737.90 ( 0.00%) 400952978.48 * 1.62%* 2-24 threads (on 1NUMA * 24cores = 24cores) lkp-pbzip2 lkp-pbzip2 w/o patch w/ patch Hmean tput-2 11071705.49 ( 0.00%) 11296869.10 * 2.03%* Hmean tput-4 20782165.19 ( 0.00%) 21949232.15 * 5.62%* Hmean tput-6 30489565.14 ( 0.00%) 33023026.96 * 8.31%* Hmean tput-8 40376495.80 ( 0.00%) 42779286.27 * 5.95%* Hmean tput-12 61264033.85 ( 0.00%) 62995632.78 * 2.83%* Hmean tput-18 86697139.39 ( 0.00%) 86461545.74 ( -0.27%) Hmean tput-24 104854637.04 ( 0.00%) 104522649.46 * -0.32%* In the case of 6 threads and 8 threads, we see the greatest performance improvement. Similar improvement can be seen on lkp-pixz though the improvement is smaller: * lkp-pixz benchmark 2-24 threads lkp-pixz (on 1NUMA * 24cores = 24cores) lkp-pixz lkp-pixz w/o patch w/ patch Hmean tput-2 6486981.16 ( 0.00%) 6561515.98 * 1.15%* Hmean tput-4 11645766.38 ( 0.00%) 11614628.43 ( -0.27%) Hmean tput-6 15429943.96 ( 0.00%) 15957350.76 * 3.42%* Hmean tput-8 19974087.63 ( 0.00%) 20413746.98 * 2.20%* Hmean tput-12 28172068.18 ( 0.00%) 28751997.06 * 2.06%* Hmean tput-18 39413409.54 ( 0.00%) 39896830.55 * 1.23%* Hmean tput-24 49101815.85 ( 0.00%) 49418141.47 * 0.64%* * SPECrate benchmark 4,8,16 copies mcf_r(on 1NUMA * 32cores = 32cores) Base Base Run Time Rate ------- --------- 4 Copies w/o 580 (w/ 570) w/o 11.1 (w/ 11.3) 8 Copies w/o 647 (w/ 605) w/o 20.0 (w/ 21.4, +7%) 16 Copies w/o 844 (w/ 844) w/o 30.6 (w/ 30.6) 32 Copies(on 4NUMA * 32 cores = 128cores) [w/o patch] Base Base Base Benchmarks Copies Run Time Rate --------------- ------- --------- --------- 500.perlbench_r 32 584 87.2 * 502.gcc_r 32 503 90.2 * 505.mcf_r 32 745 69.4 * 520.omnetpp_r 32 1031 40.7 * 523.xalancbmk_r 32 597 56.6 * 525.x264_r 1 -- CE 531.deepsjeng_r 32 336 109 * 541.leela_r 32 556 95.4 * 548.exchange2_r 32 513 163 * 557.xz_r 32 530 65.2 * Est. SPECrate2017_int_base 80.3 [w/ patch] Base Base Base Benchmarks Copies Run Time Rate --------------- ------- --------- --------- 500.perlbench_r 32 580 87.8 (+0.688%) * 502.gcc_r 32 477 95.1 (+5.432%) * 505.mcf_r 32 644 80.3 (+13.574%) * 520.omnetpp_r 32 942 44.6 (+9.58%) * 523.xalancbmk_r 32 560 60.4 (+6.714%%) * 525.x264_r 1 -- CE 531.deepsjeng_r 32 337 109 (+0.000%) * 541.leela_r 32 554 95.6 (+0.210%) * 548.exchange2_r 32 515 163 (+0.000%) * 557.xz_r 32 524 66.0 (+1.227%) * Est. SPECrate2017_int_base 83.7 (+4.062%) On the other hand, it is slightly helpful to CPU-bound tasks like kernbench: * 24-96 threads kernbench (on 4NUMA * 24cores = 96cores) kernbench kernbench w/o cluster w/ cluster Min user-24 12054.67 ( 0.00%) 12024.19 ( 0.25%) Min syst-24 1751.51 ( 0.00%) 1731.68 ( 1.13%) Min elsp-24 600.46 ( 0.00%) 598.64 ( 0.30%) Min user-48 12361.93 ( 0.00%) 12315.32 ( 0.38%) Min syst-48 1917.66 ( 0.00%) 1892.73 ( 1.30%) Min elsp-48 333.96 ( 0.00%) 332.57 ( 0.42%) Min user-96 12922.40 ( 0.00%) 12921.17 ( 0.01%) Min syst-96 2143.94 ( 0.00%) 2110.39 ( 1.56%) Min elsp-96 211.22 ( 0.00%) 210.47 ( 0.36%) Amean user-24 12063.99 ( 0.00%) 12030.78 * 0.28%* Amean syst-24 1755.20 ( 0.00%) 1735.53 * 1.12%* Amean elsp-24 601.60 ( 0.00%) 600.19 ( 0.23%) Amean user-48 12362.62 ( 0.00%) 12315.56 * 0.38%* Amean syst-48 1921.59 ( 0.00%) 1894.95 * 1.39%* Amean elsp-48 334.10 ( 0.00%) 332.82 * 0.38%* Amean user-96 12925.27 ( 0.00%) 12922.63 ( 0.02%) Amean syst-96 2146.66 ( 0.00%) 2122.20 * 1.14%* Amean elsp-96 211.96 ( 0.00%) 211.79 ( 0.08%) Note this patch isn't an universal win, it might hurt those workload which can benefit from packing. Though tasks which want to take advantages of lower communication latency of one cluster won't necessarily been packed in one cluster while kernel is not aware of clusters, they have some chance to be randomly packed. But this patch will make them more likely spread. Signed-off-by: Barry Song Tested-by: Yicong Yang Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Xue Sinian --- arch/arm64/Kconfig | 9 +++++++++ include/linux/sched/topology.h | 7 +++++++ include/linux/topology.h | 7 +++++++ kernel/sched/topology.c | 5 +++++ 4 files changed, 28 insertions(+) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 098c8bbbd4e7..8bcd353e333e 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -994,6 +994,15 @@ config SCHED_MC making when dealing with multi-core CPU chips at a cost of slightly increased overhead in some places. If unsure say N here. +config SCHED_CLUSTER + bool "Cluster scheduler support" + help + Cluster scheduler support improves the CPU scheduler's decision + making when dealing with machines that have clusters of CPUs. + Cluster usually means a couple of CPUs which are placed closely + by sharing mid-level caches, last-level cache tags or internal + busses. + config SCHED_SMT bool "SMT scheduler support" help diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 4188de2c1631..95598e9cf41d 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -34,6 +34,13 @@ static inline int cpu_smt_flags(void) } #endif +#ifdef CONFIG_SCHED_CLUSTER +static inline int cpu_cluster_flags(void) +{ + return SD_SHARE_PKG_RESOURCES; +} +#endif + #ifdef CONFIG_SCHED_MC static inline int cpu_core_flags(void) { diff --git a/include/linux/topology.h b/include/linux/topology.h index b216ed3d9fb1..a85ad55ef8a2 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -229,6 +229,13 @@ static inline const struct cpumask *cpu_smt_mask(int cpu) } #endif +#if defined(CONFIG_SCHED_CLUSTER) && !defined(cpu_cluster_mask) +static inline const struct cpumask *cpu_cluster_mask(int cpu) +{ + return topology_cluster_cpumask(cpu); +} +#endif + static inline const struct cpumask *cpu_cpu_mask(int cpu) { return cpumask_of_node(cpu_to_node(cpu)); diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index ea53dd8bbf7f..5beb5fe22ff4 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1439,6 +1439,11 @@ static struct sched_domain_topology_level default_topology[] = { #ifdef CONFIG_SCHED_SMT { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) }, #endif + +#ifdef CONFIG_SCHED_CLUSTER + { cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) }, +#endif + #ifdef CONFIG_SCHED_MC { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) }, #endif -- Gitee From ea5a8dfbe2017e7be1a1532d47e6614ff2530b37 Mon Sep 17 00:00:00 2001 From: Heiko Carstens Date: Mon, 29 Nov 2021 14:03:08 +0100 Subject: [PATCH 03/16] topology/sysfs: export cluster attributes only if an architectures has support mainline inclusion from mainline-v5.17-rc1 commit e795707703b32fecdd7467afcc33ff1e92416c05 upstream. ---------------------------------------------------------------------- The cluster_id and cluster_cpus topology sysfs attributes have been added with commit c5e22feffdd7 ("topology: Represent clusters of CPUs within a die"). They are currently only used for x86, arm64, and riscv (via generic arch topology), however they are still present with bogus default values for all other architectures. Instead of enforcing such new sysfs attributes to all architectures, make them only optional visible if an architecture opts in by defining both the topology_cluster_id and topology_cluster_cpumask attributes. This is similar to what was done when the book and drawer topology levels were introduced: avoid useless and therefore confusing sysfs attributes for architectures which cannot make use of them. This should not break any existing applications, since this is a new interface introduced with the v5.16 merge window. Acked-by: Peter Zijlstra (Intel) Signed-off-by: Heiko Carstens Link: https://lore.kernel.org/r/20211129130309.3256168-3-hca@linux.ibm.com Signed-off-by: Greg Kroah-Hartman Signed-off-by: Xue Sinian --- Documentation/admin-guide/cputopology.rst | 3 +++ drivers/base/topology.c | 8 ++++++++ include/linux/topology.h | 4 ++++ 3 files changed, 15 insertions(+) diff --git a/Documentation/admin-guide/cputopology.rst b/Documentation/admin-guide/cputopology.rst index 496ec1e6fd4b..9e0c47da3ccb 100644 --- a/Documentation/admin-guide/cputopology.rst +++ b/Documentation/admin-guide/cputopology.rst @@ -102,6 +102,9 @@ Architecture-neutral, drivers/base/topology.c, exports these attributes. However, the book and drawer related sysfs files will only be created if CONFIG_SCHED_BOOK and CONFIG_SCHED_DRAWER are selected, respectively. +The cluster hierarchy related sysfs files will only be created if an +architecture provides the related macros as described below. + CONFIG_SCHED_BOOK and CONFIG_SCHED_DRAWER are currently only used on s390, where they reflect the cpu and cache hierarchy. diff --git a/drivers/base/topology.c b/drivers/base/topology.c index bf169975e4ef..ea863bcc7269 100644 --- a/drivers/base/topology.c +++ b/drivers/base/topology.c @@ -46,8 +46,10 @@ static DEVICE_ATTR_RO(physical_package_id); define_id_show_func(die_id); static DEVICE_ATTR_RO(die_id); +#ifdef TOPOLOGY_CLUSTER_SYSFS define_id_show_func(cluster_id); static DEVICE_ATTR_RO(cluster_id); +#endif define_id_show_func(core_id); static DEVICE_ATTR_RO(core_id); @@ -64,9 +66,11 @@ define_siblings_show_func(core_siblings, core_cpumask); static DEVICE_ATTR_RO(core_siblings); static DEVICE_ATTR_RO(core_siblings_list); +#ifdef TOPOLOGY_CLUSTER_SYSFS define_siblings_show_func(cluster_cpus, cluster_cpumask); static DEVICE_ATTR_RO(cluster_cpus); static DEVICE_ATTR_RO(cluster_cpus_list); +#endif define_siblings_show_func(die_cpus, die_cpumask); static DEVICE_ATTR_RO(die_cpus); @@ -95,7 +99,9 @@ static DEVICE_ATTR_RO(drawer_siblings_list); static struct attribute *default_attrs[] = { &dev_attr_physical_package_id.attr, &dev_attr_die_id.attr, +#ifdef TOPOLOGY_CLUSTER_SYSFS &dev_attr_cluster_id.attr, +#endif &dev_attr_core_id.attr, &dev_attr_thread_siblings.attr, &dev_attr_thread_siblings_list.attr, @@ -103,8 +109,10 @@ static struct attribute *default_attrs[] = { &dev_attr_core_cpus_list.attr, &dev_attr_core_siblings.attr, &dev_attr_core_siblings_list.attr, +#ifdef TOPOLOGY_CLUSTER_SYSFS &dev_attr_cluster_cpus.attr, &dev_attr_cluster_cpus_list.attr, +#endif &dev_attr_die_cpus.attr, &dev_attr_die_cpus_list.attr, &dev_attr_package_cpus.attr, diff --git a/include/linux/topology.h b/include/linux/topology.h index a85ad55ef8a2..6eb50cd0adac 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -197,6 +197,10 @@ static inline int cpu_to_mem(int cpu) #endif /* [!]CONFIG_HAVE_MEMORYLESS_NODES */ +#if defined(topology_cluster_id) && defined(topology_cluster_cpumask) +#define TOPOLOGY_CLUSTER_SYSFS +#endif + #ifndef topology_physical_package_id #define topology_physical_package_id(cpu) ((void)(cpu), -1) #endif -- Gitee From 1a531ebf877b0f1b5cda908e4b8d82fad3e2de17 Mon Sep 17 00:00:00 2001 From: Dietmar Eggemann Date: Fri, 13 May 2022 11:34:33 +0200 Subject: [PATCH 04/16] topology: Remove unused cpu_cluster_mask() mainline inclusion from mainline-v5.19-rc1 commit 15f214f9bdb7c1f560b4bf863c5a72ff53b442a4 upstream. ------------------------------------------------------------------------ default_topology[] uses cpu_clustergroup_mask() for the CLS level (guarded by CONFIG_SCHED_CLUSTER) which is currently provided by x86 (arch/x86/kernel/smpboot.c) and arm64 (drivers/base/arch_topology.c). Fixes: 778c558f49a2 ("sched: Add cluster scheduler level in core and related Kconfig for ARM64") Acked-by: Barry Song Signed-off-by: Dietmar Eggemann Link: https://lore.kernel.org/r/20220513093433.425163-1-dietmar.eggemann@arm.com Signed-off-by: Greg Kroah-Hartman Signed-off-by: Xue Sinian --- include/linux/topology.h | 7 ------- 1 file changed, 7 deletions(-) diff --git a/include/linux/topology.h b/include/linux/topology.h index 6eb50cd0adac..587814e357fc 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -233,13 +233,6 @@ static inline const struct cpumask *cpu_smt_mask(int cpu) } #endif -#if defined(CONFIG_SCHED_CLUSTER) && !defined(cpu_cluster_mask) -static inline const struct cpumask *cpu_cluster_mask(int cpu) -{ - return topology_cluster_cpumask(cpu); -} -#endif - static inline const struct cpumask *cpu_cpu_mask(int cpu) { return cpumask_of_node(cpu_to_node(cpu)); -- Gitee From e217329ad36e621aa4a1340e6340db3aa5044970 Mon Sep 17 00:00:00 2001 From: Ionela Voinescu Date: Mon, 4 Jul 2022 11:16:01 +0100 Subject: [PATCH 05/16] arch_topology: Limit span of cpu_clustergroup_mask() mainline inclusion from mainline-v6.0-rc1 commit bfcc4397435dc0407099b9a805391abc05c2313b upstream. ---------------------------------------------------------------------- Currently the cluster identifier is not set on DT based platforms. The reset or default value is -1 for all the CPUs. Once we assign the cluster identifier values correctly, the cluster_sibling mask will be populated and returned by cpu_clustergroup_mask() to contribute in the creation of the CLS scheduling domain level, if SCHED_CLUSTER is enabled. To avoid topologies that will result in questionable or incorrect scheduling domains, impose restrictions regarding the span of clusters, as presented to scheduling domains building code: cluster_sibling should not span more or the same CPUs as cpu_coregroup_mask(). This is needed in order to obtain a strict separation between the MC and CLS levels, and maintain the same domains for existing platforms in the presence of CONFIG_SCHED_CLUSTER, where the new cluster information is redundant and irrelevant for the scheduler. While previously the scheduling domain builder code would have removed MC as redundant and kept CLS if SCHED_CLUSTER was enabled and the cpu_coregroup_mask() and cpu_clustergroup_mask() spanned the same CPUs, now CLS will be removed and MC kept. Link: https://lore.kernel.org/r/20220704101605.1318280-18-sudeep.holla@arm.com Cc: Darren Hart Tested-by: Conor Dooley Acked-by: Vincent Guittot Signed-off-by: Ionela Voinescu Signed-off-by: Sudeep Holla Signed-off-by: Xue Sinian --- drivers/base/arch_topology.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index 02af6964d9fd..e5d4674aa031 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -450,6 +450,13 @@ const struct cpumask *cpu_coregroup_mask(int cpu) const struct cpumask *cpu_clustergroup_mask(int cpu) { + /* + * Forbid cpu_clustergroup_mask() to span more or the same CPUs as + * cpu_coregroup_mask(). + */ + if (cpumask_subset(cpu_coregroup_mask(cpu), + &cpu_topology[cpu].cluster_sibling)) + return get_cpu_mask(cpu); return &cpu_topology[cpu].cluster_sibling; } -- Gitee From 739d364ccd9c4ad6460ef6f7dcc34b0901a5014b Mon Sep 17 00:00:00 2001 From: Yicong Yang Date: Mon, 5 Sep 2022 20:26:15 +0800 Subject: [PATCH 06/16] arch_topology: Make cluster topology span at least SMT CPUs mainline inclusion from mainline-v6.0-rc5 commit 5ac251c8a05ce074e5efac779debf82a15d870a3 upstream. ---------------------------------------------------------------------- Currently cpu_clustergroup_mask() will return CPU mask if cluster span more or the same CPUs as cpu_coregroup_mask(). This will result topology borken on non-Cluster SMT machines when building with CONFIG_SCHED_CLUSTER=y. Test with: qemu-system-aarch64 -enable-kvm -machine virt \ -net none \ -cpu host \ -bios ./QEMU_EFI.fd \ -m 2G \ -smp 48,sockets=2,cores=12,threads=2 \ -kernel $Image \ -initrd $Rootfs \ -nographic -append "rdinit=init console=ttyAMA0 sched_verbose loglevel=8" We'll get below error: [ 3.084568] BUG: arch topology borken [ 3.084570] the SMT domain not a subset of the CLS domain Since cluster is a level higher than SMT, fix this by making cluster spans at least SMT CPUs. Fixes: bfcc4397435d ("arch_topology: Limit span of cpu_clustergroup_mask()") Cc: Sudeep Holla Cc: Vincent Guittot Cc: Ionela Voinescu Cc: Greg KH Reviewed-by: Sudeep Holla Signed-off-by: Yicong Yang Link: https://lore.kernel.org/r/20220905122615.12946-1-yangyicong@huawei.com Signed-off-by: Greg Kroah-Hartman Signed-off-by: Xue Sinian --- drivers/base/arch_topology.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index e5d4674aa031..bf336aa4deaf 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -456,7 +456,7 @@ const struct cpumask *cpu_clustergroup_mask(int cpu) */ if (cpumask_subset(cpu_coregroup_mask(cpu), &cpu_topology[cpu].cluster_sibling)) - return get_cpu_mask(cpu); + return topology_sibling_cpumask(cpu); return &cpu_topology[cpu].cluster_sibling; } -- Gitee From 8c1ef41b122faf09ff87ee416a862498c60a3b47 Mon Sep 17 00:00:00 2001 From: Barry Song Date: Mon, 17 Oct 2022 15:01:55 +0800 Subject: [PATCH 07/16] sched: Add per_cpu cluster domain info and cpus_share_lowest_cache API Reference: https://lore.kernel.org/lkml/20220915073423.25535-1-yangyicong@huawei.com/ commit 53ad6bf76d9c646e3c8494ed82d90f304c50de1f openeuler. ---------------------------------------------------------------------- Add per-cpu cluster domain info and cpus_share_lowest_cache() API. This is the preparation for the optimization of select_idle_cpu() on platforms with cluster scheduler level. Signed-off-by: Xue Sinian --- include/linux/sched/topology.h | 23 +++++++++++++++-------- kernel/sched/core.c | 14 ++++++++++++++ kernel/sched/sched.h | 2 ++ kernel/sched/topology.c | 15 +++++++++++++++ 4 files changed, 46 insertions(+), 8 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 95598e9cf41d..f98e823bee8c 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -19,13 +19,14 @@ #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ #define SD_ASYM_CPUCAPACITY 0x0040 /* Domain members have different CPU capacities */ #define SD_SHARE_CPUCAPACITY 0x0080 /* Domain members share CPU capacity */ -#define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */ -#define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share CPU pkg resources */ -#define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ -#define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */ -#define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */ -#define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */ -#define SD_NUMA 0x4000 /* cross-node balancing */ +#define SD_CLUSTER 0x0100 /* Domain members share CPU cluster */ +#define SD_SHARE_POWERDOMAIN 0x0200 /* Domain members share power domain */ +#define SD_SHARE_PKG_RESOURCES 0x0400 /* Domain members share CPU pkg resources */ +#define SD_SERIALIZE 0x0800 /* Only a single load balancing instance */ +#define SD_ASYM_PACKING 0x1000 /* Place busy groups earlier in the domain */ +#define SD_PREFER_SIBLING 0x2000 /* Prefer to place tasks in a sibling domain */ +#define SD_OVERLAP 0x4000 /* sched_domains of this level overlap */ +#define SD_NUMA 0x8000 /* cross-node balancing */ #ifdef CONFIG_SCHED_SMT static inline int cpu_smt_flags(void) @@ -37,7 +38,7 @@ static inline int cpu_smt_flags(void) #ifdef CONFIG_SCHED_CLUSTER static inline int cpu_cluster_flags(void) { - return SD_SHARE_PKG_RESOURCES; + return SD_CLUSTER | SD_SHARE_PKG_RESOURCES; } #endif @@ -180,6 +181,7 @@ cpumask_var_t *alloc_sched_domains(unsigned int ndoms); void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms); bool cpus_share_cache(int this_cpu, int that_cpu); +bool cpus_share_lowest_cache(int this_cpu, int that_cpu); typedef const struct cpumask *(*sched_domain_mask_f)(int cpu); typedef int (*sched_domain_flags_f)(void); @@ -233,6 +235,11 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu) return true; } +static inline bool cpus_share_lowest_cache(int this_cpu, int that_cpu) +{ + return true; +} + #endif /* !CONFIG_SMP */ #ifndef arch_scale_cpu_capacity diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2f7eda976364..48d796f02d4a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2925,6 +2925,20 @@ bool cpus_share_cache(int this_cpu, int that_cpu) return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu); } + +/* + * Whether CPUs are share lowest cache, which means LLC on non-cluster + * machines and LLC tag or L2 on machines with clusters. + */ +bool cpus_share_lowest_cache(int this_cpu, int that_cpu) +{ + if (this_cpu == that_cpu) + return true; + + return per_cpu(sd_lowest_cache_id, this_cpu) == + per_cpu(sd_lowest_cache_id, that_cpu); +} + #endif /* CONFIG_SMP */ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 764a0000df94..e8ee56a8796b 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1937,7 +1937,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag) DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc); DECLARE_PER_CPU(int, sd_llc_size); DECLARE_PER_CPU(int, sd_llc_id); +DECLARE_PER_CPU(int, sd_lowest_cache_id); DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster); DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa); DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity); diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 5beb5fe22ff4..95b2d5c7fac3 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -618,6 +618,8 @@ static void destroy_sched_domains(struct sched_domain *sd) DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc); DEFINE_PER_CPU(int, sd_llc_size); DEFINE_PER_CPU(int, sd_llc_id); +DEFINE_PER_CPU(int, sd_lowest_cache_id); +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster); DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared); DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa); DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); @@ -643,6 +645,18 @@ static void update_top_cache_domain(int cpu) per_cpu(sd_llc_id, cpu) = id; rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds); + sd = lowest_flag_domain(cpu, SD_CLUSTER); + if (sd) + id = cpumask_first(sched_domain_span(sd)); + rcu_assign_pointer(per_cpu(sd_cluster, cpu), sd); + + /* + * This assignment should be placed after the sd_llc_id as + * we want this id equals to cluster id on cluster machines + * but equals to LLC id on non-Cluster machines. + */ + per_cpu(sd_lowest_cache_id, cpu) = id; + sd = lowest_flag_domain(cpu, SD_NUMA); rcu_assign_pointer(per_cpu(sd_numa, cpu), sd); @@ -1306,6 +1320,7 @@ int __read_mostly node_reclaim_distance = RECLAIM_DISTANCE; */ #define TOPOLOGY_SD_FLAGS \ (SD_SHARE_CPUCAPACITY | \ + SD_CLUSTER | \ SD_SHARE_PKG_RESOURCES | \ SD_NUMA | \ SD_ASYM_PACKING | \ -- Gitee From ccc92160c5fd71023ef6865ef0f4f95771db5576 Mon Sep 17 00:00:00 2001 From: Barry Song Date: Mon, 17 Oct 2022 15:34:27 +0800 Subject: [PATCH 08/16] sched/fair: Scan cluster before scanning LLC in wake-up path Reference: https://lore.kernel.org/lkml/20220915073423.25535-1-yangyicong@huawei.com/ commit 0c3a4f986962ed94da6e26ba3ec0bdf700945894 openeuler. ---------------------------------------------------------------------- For platforms having clusters like Kunpeng920, CPUs within the same cluster have lower latency when synchronizing and accessing shared resources like cache. Thus, this patch tries to find an idle cpu within the cluster of the target CPU before scanning the whole LLC to gain lower latency. Testing has been done on Kunpeng920 by pinning tasks to one numa and two numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs. With this patch, We noticed enhancement on tbench within one numa or cross two numa. On numa 0: 6.0-rc1 patched Hmean 1 351.20 ( 0.00%) 396.45 * 12.88%* Hmean 2 700.43 ( 0.00%) 793.76 * 13.32%* Hmean 4 1404.42 ( 0.00%) 1583.62 * 12.76%* Hmean 8 2833.31 ( 0.00%) 3147.85 * 11.10%* Hmean 16 5501.90 ( 0.00%) 6089.89 * 10.69%* Hmean 32 10428.59 ( 0.00%) 10619.63 * 1.83%* Hmean 64 8223.39 ( 0.00%) 8306.93 * 1.02%* Hmean 128 7042.88 ( 0.00%) 7068.03 * 0.36%* On numa 0-1: 6.0-rc1 patched Hmean 1 363.06 ( 0.00%) 397.13 * 9.38%* Hmean 2 721.68 ( 0.00%) 789.84 * 9.44%* Hmean 4 1435.15 ( 0.00%) 1566.01 * 9.12%* Hmean 8 2776.17 ( 0.00%) 3007.05 * 8.32%* Hmean 16 5471.71 ( 0.00%) 6103.91 * 11.55%* Hmean 32 10164.98 ( 0.00%) 11531.81 * 13.45%* Hmean 64 17143.28 ( 0.00%) 20078.68 * 17.12%* Hmean 128 14552.70 ( 0.00%) 15156.41 * 4.15%* Hmean 256 12827.37 ( 0.00%) 13326.86 * 3.89%* Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so the SMT branch in the code has not been tested but it supposed to work. Signed-off-by: Xue Sinian --- kernel/sched/fair.c | 48 ++++++++++++++++++++++++++++++++++++++--- kernel/sched/sched.h | 1 + kernel/sched/topology.c | 8 ++++++- 3 files changed, 53 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index db15e02baa55..0bcb18eb7907 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6117,6 +6117,30 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + if (static_branch_unlikely(&sched_cluster_active)) { + struct sched_domain *sdc = + rcu_dereference(per_cpu(sd_cluster, target)); + + if (sdc) { + for_each_cpu_wrap(core, sched_domain_span(sdc), target) { + bool idle = true; + + if (!cpumask_test_cpu(core, cpus)) + continue; + + for_each_cpu(cpu, cpu_smt_mask(core)) { + cpumask_clear_cpu(cpu, cpus); + if (!available_idle_cpu(cpu)) + idle = false; + } + + if (idle) + return core; + } + cpumask_andnot(cpus, cpus, sched_domain_span(sdc)); + } + } + for_each_cpu_wrap(core, cpus, target) { bool idle = true; @@ -6217,8 +6241,26 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + if (static_branch_unlikely(&sched_cluster_active)) { + struct sched_domain *sdc = + rcu_dereference(per_cpu(sd_cluster, target)); + + if (sdc) { + for_each_cpu_wrap(cpu, sched_domain_span(sdc), target) { + if (!cpumask_test_cpu(cpu, cpus)) + continue; + if (--nr <= 0) + return -1; + if (available_idle_cpu(cpu) || + sched_idle_cpu(cpu)) + return cpu; + } + cpumask_andnot(cpus, cpus, sched_domain_span(sdc)); + } + } + for_each_cpu_wrap(cpu, cpus, target) { - if (!--nr) + if (--nr <= 0) return si_cpu; if (available_idle_cpu(cpu)) break; @@ -6253,7 +6295,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) /* * If the previous CPU is cache affine and idle, don't be stupid: */ - if (prev != target && cpus_share_cache(prev, target) && + if (prev != target && cpus_share_lowest_cache(prev, target) && (available_idle_cpu(prev) || sched_idle_cpu(prev))) return prev; @@ -6261,7 +6303,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) recent_used_cpu = p->recent_used_cpu; if (recent_used_cpu != prev && recent_used_cpu != target && - cpus_share_cache(recent_used_cpu, target) && + cpus_share_lowest_cache(recent_used_cpu, target) && (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) && cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr)) { /* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index e8ee56a8796b..bd6f1f042c4d 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1944,6 +1944,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa); DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity); extern struct static_key_false sched_asym_cpucapacity; +extern struct static_key_false sched_cluster_active; struct sched_group_capacity { atomic_t ref; diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 95b2d5c7fac3..fb8c20ca1a31 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -625,6 +625,7 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa); DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing); DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity); DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity); +DEFINE_STATIC_KEY_FALSE(sched_cluster_active); static void update_top_cache_domain(int cpu) { @@ -2011,6 +2012,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att int i, ret = -ENOMEM; struct sched_domain_topology_level *tl_asym; bool has_asym = false; + bool has_cluster = false; if (WARN_ON(cpumask_empty(cpu_map))) goto error; @@ -2038,7 +2040,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att goto error; sd = build_sched_domain(tl, cpu_map, attr, sd, dflags, i); - + has_cluster |= sd->flags & SD_CLUSTER; if (tl == sched_domain_topology) *per_cpu_ptr(d.sd, i) = sd; if (tl->flags & SDTL_OVERLAP) @@ -2089,6 +2091,8 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att if (has_asym) static_branch_inc_cpuslocked(&sched_asym_cpucapacity); + if (has_cluster) + static_branch_inc_cpuslocked(&sched_cluster_active); if (rq && sched_debug_enabled) { pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n", @@ -2188,6 +2192,8 @@ static void detach_destroy_domains(const struct cpumask *cpu_map) if (rcu_access_pointer(per_cpu(sd_asym_cpucapacity, cpu))) static_branch_dec_cpuslocked(&sched_asym_cpucapacity); + if (rcu_access_pointer(per_cpu(sd_cluster, cpu))) + static_branch_dec_cpuslocked(&sched_cluster_active); rcu_read_lock(); for_each_cpu(i, cpu_map) -- Gitee From f6f061ca0fd1d218fc8ed54d4dd71c0cf0c3ab39 Mon Sep 17 00:00:00 2001 From: Tim Chen Date: Fri, 3 Dec 2021 12:32:38 -0800 Subject: [PATCH 09/16] scheduler: Create SDTL_SKIP flag to skip topology level commit 211b6fb7d5a8558a453475a08a697e651ca2d0cb openeuler. Reference: https://lore.kernel.org/lkml/cover.1638563225.git.tim.c.chen@linux.intel.com/ ---------------------------------------------------------------------- A system admin may not want to use cluster scheduling. Make changes to allow cluster topology level to be skipped when building sched domains. Create SDTL_SKIP bit on the sched_domain_topology_level flag so we can check if the cluster topology level should be skipped when building sched domains. Signed-off-by: Xue Sinian --- include/linux/sched/topology.h | 1 + kernel/sched/topology.c | 12 ++++++++++-- 2 files changed, 11 insertions(+), 2 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index f98e823bee8c..103b2696ff17 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -187,6 +187,7 @@ typedef const struct cpumask *(*sched_domain_mask_f)(int cpu); typedef int (*sched_domain_flags_f)(void); #define SDTL_OVERLAP 0x01 +#define SDTL_SKIP 0x02 struct sd_data { struct sched_domain *__percpu *sd; diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index fb8c20ca1a31..70688b5109eb 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1470,8 +1470,16 @@ static struct sched_domain_topology_level default_topology[] = { static struct sched_domain_topology_level *sched_domain_topology = default_topology; +static struct sched_domain_topology_level * +next_tl(struct sched_domain_topology_level *tl) +{ + while (tl->mask && tl->flags & SDTL_SKIP) + ++tl; + return tl; +} + #define for_each_sd_topology(tl) \ - for (tl = sched_domain_topology; tl->mask; tl++) + for (tl = next_tl(sched_domain_topology); tl->mask; tl = next_tl(++tl)) void set_sched_topology(struct sched_domain_topology_level *tl) { @@ -2041,7 +2049,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att sd = build_sched_domain(tl, cpu_map, attr, sd, dflags, i); has_cluster |= sd->flags & SD_CLUSTER; - if (tl == sched_domain_topology) + if (tl == next_tl(sched_domain_topology)) *per_cpu_ptr(d.sd, i) = sd; if (tl->flags & SDTL_OVERLAP) sd->flags |= SD_OVERLAP; -- Gitee From e2ed20091b1e8f2a1314a55fffe28fc51a367fe8 Mon Sep 17 00:00:00 2001 From: Tim Chen Date: Fri, 3 Dec 2021 12:32:40 -0800 Subject: [PATCH 10/16] scheduler: Add runtime knob sysctl_sched_cluster commit 8ce3e706b31409147f035c037055caa68e450ce5 openeuler. Reference: https://lore.kernel.org/lkml/cover.1638563225.git.tim.c.chen@linux.intel.com/ ---------------------------------------------------------------------- Allow run time configuration of the scheduler to use cluster scheduling. Configuration can be changed via the sysctl variable /proc/sys/kernel/sched_cluster. Setting it to 1 enable cluster scheduling and setting it to 0 turns it off. Cluster scheduling should benefit independent tasks by load balancing them between clusters. It reaps the most benefit when the system's CPUs are not fully busy, so we can spread the tasks out between the clusters to reduce contention on cluster resource (e.g. L2 cache). However, if the system is expected to operate close to full utilization, the system admin could turn this feature off so as not to incur extra load balancing overhead between the cluster domains. Signed-off-by: Xue Sinian --- arch/x86/kernel/smpboot.c | 8 +++++ drivers/base/arch_topology.c | 10 +++++- include/linux/sched/sysctl.h | 6 ++++ include/linux/topology.h | 1 + kernel/sched/core.c | 1 + kernel/sched/sched.h | 6 ++++ kernel/sched/topology.c | 67 ++++++++++++++++++++++++++++++++++++ 7 files changed, 98 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 1699d18bd154..bf1144286ca7 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -55,6 +55,7 @@ #include #include #include +#include #include #include @@ -119,6 +120,13 @@ int arch_update_cpu_topology(void) return retval; } +void arch_rebuild_cpu_topology(void) +{ + x86_topology_update = true; + rebuild_sched_domains(); + x86_topology_update = false; +} + static inline void smpboot_setup_warm_reset_vector(unsigned long start_eip) { unsigned long flags; diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index bf336aa4deaf..dbbe6fa6a42f 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -75,6 +75,7 @@ static int register_cpu_capacity_sysctl(void) } subsys_initcall(register_cpu_capacity_sysctl); +static u32 capacity_scale; static int update_topology; int topology_update_cpu_topology(void) @@ -94,7 +95,14 @@ static void update_topology_flags_workfn(struct work_struct *work) update_topology = 0; } -static u32 capacity_scale; +void __weak arch_rebuild_cpu_topology(void) +{ + update_topology = 1; + rebuild_sched_domains(); + pr_debug("sched_domain hierarchy rebuilt, flags updated\n"); + update_topology = 0; +} + static u32 *raw_capacity; static int free_raw_capacity(void) diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 6a67c16b436b..96b072b2a74c 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -128,4 +128,10 @@ extern int sched_energy_aware_handler(struct ctl_table *table, int write, loff_t *ppos); #endif +#ifdef CONFIG_SCHED_CLUSTER +extern unsigned int sysctl_sched_cluster; +int sched_cluster_handler(struct ctl_table *table, int write, + void *buffer, size_t *lenp, loff_t *ppos); +#endif + #endif /* _LINUX_SCHED_SYSCTL_H */ diff --git a/include/linux/topology.h b/include/linux/topology.h index 587814e357fc..3d68447a7577 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -44,6 +44,7 @@ if (nr_cpus_node(node)) int arch_update_cpu_topology(void); +void arch_rebuild_cpu_topology(void); /* Conform to ACPI 2.0 SLIT distance definitions */ #define LOCAL_DISTANCE 10 diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 48d796f02d4a..dc5f92ac8111 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7773,6 +7773,7 @@ int sched_cpu_dying(unsigned int cpu) void __init sched_init_smp(void) { sched_init_numa(); + set_sched_cluster(); /* * There's no userspace yet to cause hotplug operations; hence all the diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index bd6f1f042c4d..eb6a4ffb8103 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1821,6 +1821,12 @@ this_rq_lock_irq(struct rq_flags *rf) return rq; } +#ifdef CONFIG_SCHED_CLUSTER +extern void set_sched_cluster(void); +#else +static inline void set_sched_cluster(void) { } +#endif + #ifdef CONFIG_NUMA enum numa_topology_type { NUMA_DIRECT, diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 70688b5109eb..b8f79eb6984f 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1470,6 +1470,73 @@ static struct sched_domain_topology_level default_topology[] = { static struct sched_domain_topology_level *sched_domain_topology = default_topology; +#ifdef CONFIG_SCHED_CLUSTER +void set_sched_cluster(void) +{ + struct sched_domain_topology_level *tl; + + for (tl = sched_domain_topology; tl->mask; tl++) { + if (tl->sd_flags && (tl->sd_flags() & SD_CLUSTER)) { + if (!sysctl_sched_cluster) + tl->flags |= SDTL_SKIP; + else + tl->flags &= ~SDTL_SKIP; + break; + } + } +} + +/* set via /proc/sys/kernel/sched_cluster */ +unsigned int __read_mostly sysctl_sched_cluster = 1; + +static DEFINE_MUTEX(sched_cluster_mutex); +int sched_cluster_handler(struct ctl_table *table, int write, + void *buffer, size_t *lenp, loff_t *ppos) +{ + int ret; + unsigned int oldval; + + if (write && !capable(CAP_SYS_ADMIN)) + return -EPERM; + + mutex_lock(&sched_cluster_mutex); + oldval = sysctl_sched_cluster; + ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); + if (!ret && write) { + if (oldval != sysctl_sched_cluster) { + set_sched_cluster(); + arch_rebuild_cpu_topology(); + } + } + mutex_unlock(&sched_cluster_mutex); + + return ret; +} + +static int zero; +static int one = 1; + +static struct ctl_table sched_cluster_sysctls[] = { + { + .procname = "sched_cluster", + .data = &sysctl_sched_cluster, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = sched_cluster_handler, + .extra1 = (void *)&zero, + .extra2 = (void *)&one, + }, + {} +}; + +static int __init sched_cluster_sysctl_init(void) +{ + register_sysctl_init("kernel", sched_cluster_sysctls); + return 0; +} +late_initcall(sched_cluster_sysctl_init); +#endif + static struct sched_domain_topology_level * next_tl(struct sched_domain_topology_level *tl) { -- Gitee From d6c8df58c9267957b99e1c18bbc2db574011f01d Mon Sep 17 00:00:00 2001 From: Tim Chen Date: Fri, 3 Dec 2021 12:32:41 -0800 Subject: [PATCH 11/16] scheduler: Add boot time enabling/disabling of cluster scheduling commit 9e68cc2bf535a2f4e3c33e7e53bbb15815b703c4 openeuler. Reference: https://lore.kernel.org/lkml/cover.1638563225.git.tim.c.chen@linux.intel.com/ ---------------------------------------------------------------------- Add boot time parameter sched_cluster to enable or disable cluster scheduling. Set boot parameter as follow: sched_cluster=0 disables cluster scheduling sched_cluster=1 enables cluster scheduling Signed-off-by: Xue Sinian --- Documentation/admin-guide/kernel-parameters.txt | 4 ++++ kernel/sched/topology.c | 16 ++++++++++++++++ 2 files changed, 20 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index e7e9c9979fe3..810b9f5cbf9b 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4461,6 +4461,10 @@ qspinlock.numa_spinlock_threshold_ns= [NUMA, PV_OPS] sched_debug [KNL] Enables verbose scheduler debug messages. + sched_cluster= Enable or disable cluster scheduling. + 0 -- disable. + 1 -- enable. + schedstats= [KNL,X86] Enable or disable scheduled statistics. Allowed values are enable and disable. This feature incurs a small amount of overhead in the scheduler diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index b8f79eb6984f..4e6c58488898 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1535,6 +1535,22 @@ static int __init sched_cluster_sysctl_init(void) return 0; } late_initcall(sched_cluster_sysctl_init); + +static int __init sched_cluster_option(char *str) +{ + int enable; + + if (get_option(&str, &enable)) { + if (enable != 0 && enable != 1) + return -EINVAL; + + sysctl_sched_cluster = enable; + return 0; + } + + return -EINVAL; +} +early_param("sched_cluster", sched_cluster_option); #endif static struct sched_domain_topology_level * -- Gitee From 4de683d5f4ac8e31ec9265771c5b3c1034c18b56 Mon Sep 17 00:00:00 2001 From: Yicong Yang Date: Mon, 13 Feb 2023 10:48:54 +0800 Subject: [PATCH 12/16] scheduler: Disable cluster scheduling by default commit 6afb257d6dd71085344e1472ea6e820b5dc0a8e3 openeuler. ---------------------------------------------------------------------- Disable cluster scheduling by default since it's not a universal win. User can choose to enable it through sysctl or at boot time according to their scenario. Signed-off-by: Xue Sinian --- kernel/sched/topology.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 4e6c58488898..b25044a6e64b 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1487,7 +1487,7 @@ void set_sched_cluster(void) } /* set via /proc/sys/kernel/sched_cluster */ -unsigned int __read_mostly sysctl_sched_cluster = 1; +unsigned int __read_mostly sysctl_sched_cluster; static DEFINE_MUTEX(sched_cluster_mutex); int sched_cluster_handler(struct ctl_table *table, int write, -- Gitee From a392cdbad56ce67d086772ded110e2dc77e0f4cd Mon Sep 17 00:00:00 2001 From: Jie Liu Date: Mon, 24 Oct 2022 09:34:57 +0800 Subject: [PATCH 13/16] sched: Open the kernel configuration for cluster. commit aff649361671b432570e94c9056932f50dd6f101 openeuler. ---------------------------------------------------------------------- In the past configuration, CONFIG_SCHED_CLUSTER was not set. Now, we need to open the configuration. Signed-off-by: Xue Sinian --- arch/arm64/configs/tencent.config | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm64/configs/tencent.config b/arch/arm64/configs/tencent.config index dcfd3e97e9a4..9cfd12ce0e6f 100644 --- a/arch/arm64/configs/tencent.config +++ b/arch/arm64/configs/tencent.config @@ -57,6 +57,7 @@ CONFIG_ARM64_VA_BITS_48=y CONFIG_SCHED_MC=y CONFIG_SCHED_SMT=y CONFIG_NR_CPUS=1024 +CONFIG_SCHED_CLUSTER=y CONFIG_NUMA=y CONFIG_NODES_SHIFT=8 CONFIG_HZ_1000=y -- Gitee From 037b1ecbaeb855daa7527fb7c8d1ee5ed6dbb3ff Mon Sep 17 00:00:00 2001 From: Chen Yu Date: Wed, 30 Nov 2022 18:23:24 +0800 Subject: [PATCH 14/16] sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg mainline inclusion from mainline-v6.0-rc1 commit 70fb5ccf2ebb09a0c8ebba775041567812d45f86 upstream. -------------------------------- [Problem Statement] select_idle_cpu() might spend too much time searching for an idle CPU, when the system is overloaded. The following histogram is the time spent in select_idle_cpu(), when running 224 instances of netperf on a system with 112 CPUs per LLC domain: @usecs: [0] 533 | | [1] 5495 | | [2, 4) 12008 | | [4, 8) 239252 | | [8, 16) 4041924 |@@@@@@@@@@@@@@ | [16, 32) 12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32, 64) 14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [64, 128) 13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [128, 256) 8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [256, 512) 4507667 |@@@@@@@@@@@@@@@ | [512, 1K) 2600472 |@@@@@@@@@ | [1K, 2K) 927912 |@@@ | [2K, 4K) 218720 | | [4K, 8K) 98161 | | [8K, 16K) 37722 | | [16K, 32K) 6715 | | [32K, 64K) 477 | | [64K, 128K) 7 | | netperf latency usecs: ======= case load Lat_99th std% TCP_RR thread-224 257.39 ( 0.21) The time spent in select_idle_cpu() is visible to netperf and might have a negative impact. [Symptom analysis] The patch [1] from Mel Gorman has been applied to track the efficiency of select_idle_sibling. Copy the indicators here: SIS Search Efficiency(se_eff%): A ratio expressed as a percentage of runqueues scanned versus idle CPUs found. A 100% efficiency indicates that the target, prev or recent CPU of a task was idle at wakeup. The lower the efficiency, the more runqueues were scanned before an idle CPU was found. SIS Domain Search Efficiency(dom_eff%): Similar, except only for the slower SIS patch. SIS Fast Success Rate(fast_rate%): Percentage of SIS that used target, prev or recent CPUs. SIS Success rate(success_rate%): Percentage of scans that found an idle CPU. The test is based on Aubrey's schedtests tool, including netperf, hackbench, schbench and tbench. Test on vanilla kernel: schedstat_parse.py -f netperf_vanilla.log case load se_eff% dom_eff% fast_rate% success_rate% TCP_RR 28 threads 99.978 18.535 99.995 100.000 TCP_RR 56 threads 99.397 5.671 99.964 100.000 TCP_RR 84 threads 21.721 6.818 73.632 100.000 TCP_RR 112 threads 12.500 5.533 59.000 100.000 TCP_RR 140 threads 8.524 4.535 49.020 100.000 TCP_RR 168 threads 6.438 3.945 40.309 99.999 TCP_RR 196 threads 5.397 3.718 32.320 99.982 TCP_RR 224 threads 4.874 3.661 25.775 99.767 UDP_RR 28 threads 99.988 17.704 99.997 100.000 UDP_RR 56 threads 99.528 5.977 99.970 100.000 UDP_RR 84 threads 24.219 6.992 76.479 100.000 UDP_RR 112 threads 13.907 5.706 62.538 100.000 UDP_RR 140 threads 9.408 4.699 52.519 100.000 UDP_RR 168 threads 7.095 4.077 44.352 100.000 UDP_RR 196 threads 5.757 3.775 35.764 99.991 UDP_RR 224 threads 5.124 3.704 28.748 99.860 schedstat_parse.py -f schbench_vanilla.log (each group has 28 tasks) case load se_eff% dom_eff% fast_rate% success_rate% normal 1 mthread 99.152 6.400 99.941 100.000 normal 2 mthreads 97.844 4.003 99.908 100.000 normal 3 mthreads 96.395 2.118 99.917 99.998 normal 4 mthreads 55.288 1.451 98.615 99.804 normal 5 mthreads 7.004 1.870 45.597 61.036 normal 6 mthreads 3.354 1.346 20.777 34.230 normal 7 mthreads 2.183 1.028 11.257 21.055 normal 8 mthreads 1.653 0.825 7.849 15.549 schedstat_parse.py -f hackbench_vanilla.log (each group has 28 tasks) case load se_eff% dom_eff% fast_rate% success_rate% process-pipe 1 group 99.991 7.692 99.999 100.000 process-pipe 2 groups 99.934 4.615 99.997 100.000 process-pipe 3 groups 99.597 3.198 99.987 100.000 process-pipe 4 groups 98.378 2.464 99.958 100.000 process-pipe 5 groups 27.474 3.653 89.811 99.800 process-pipe 6 groups 20.201 4.098 82.763 99.570 process-pipe 7 groups 16.423 4.156 77.398 99.316 process-pipe 8 groups 13.165 3.920 72.232 98.828 process-sockets 1 group 99.977 5.882 99.999 100.000 process-sockets 2 groups 99.927 5.505 99.996 100.000 process-sockets 3 groups 99.397 3.250 99.980 100.000 process-sockets 4 groups 79.680 4.258 98.864 99.998 process-sockets 5 groups 7.673 2.503 63.659 92.115 process-sockets 6 groups 4.642 1.584 58.946 88.048 process-sockets 7 groups 3.493 1.379 49.816 81.164 process-sockets 8 groups 3.015 1.407 40.845 75.500 threads-pipe 1 group 99.997 0.000 100.000 100.000 threads-pipe 2 groups 99.894 2.932 99.997 100.000 threads-pipe 3 groups 99.611 4.117 99.983 100.000 threads-pipe 4 groups 97.703 2.624 99.937 100.000 threads-pipe 5 groups 22.919 3.623 87.150 99.764 threads-pipe 6 groups 18.016 4.038 80.491 99.557 threads-pipe 7 groups 14.663 3.991 75.239 99.247 threads-pipe 8 groups 12.242 3.808 70.651 98.644 threads-sockets 1 group 99.990 6.667 99.999 100.000 threads-sockets 2 groups 99.940 5.114 99.997 100.000 threads-sockets 3 groups 99.469 4.115 99.977 100.000 threads-sockets 4 groups 87.528 4.038 99.400 100.000 threads-sockets 5 groups 6.942 2.398 59.244 88.337 threads-sockets 6 groups 4.359 1.954 49.448 87.860 threads-sockets 7 groups 2.845 1.345 41.198 77.102 threads-sockets 8 groups 2.871 1.404 38.512 74.312 schedstat_parse.py -f tbench_vanilla.log case load se_eff% dom_eff% fast_rate% success_rate% loopback 28 threads 99.976 18.369 99.995 100.000 loopback 56 threads 99.222 7.799 99.934 100.000 loopback 84 threads 19.723 6.819 70.215 100.000 loopback 112 threads 11.283 5.371 55.371 99.999 loopback 140 threads 0.000 0.000 0.000 0.000 loopback 168 threads 0.000 0.000 0.000 0.000 loopback 196 threads 0.000 0.000 0.000 0.000 loopback 224 threads 0.000 0.000 0.000 0.000 According to the test above, if the system becomes busy, the SIS Search Efficiency(se_eff%) drops significantly. Although some benchmarks would finally find an idle CPU(success_rate% = 100%), it is doubtful whether it is worth it to search the whole LLC domain. [Proposal] It would be ideal to have a crystal ball to answer this question: How many CPUs must a wakeup path walk down, before it can find an idle CPU? Many potential metrics could be used to predict the number. One candidate is the sum of util_avg in this LLC domain. The benefit of choosing util_avg is that it is a metric of accumulated historic activity, which seems to be smoother than instantaneous metrics (such as rq->nr_running). Besides, choosing the sum of util_avg would help predict the load of the LLC domain more precisely, because SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle time. In summary, the lower the util_avg is, the more select_idle_cpu() should scan for idle CPU, and vice versa. When the sum of util_avg in this LLC domain hits 85% or above, the scan stops. The reason to choose 85% as the threshold is that this is the imbalance_pct(117) when a LLC sched group is overloaded. Introduce the quadratic function: y = SCHED_CAPACITY_SCALE - p * x^2 and y'= y / SCHED_CAPACITY_SCALE x is the ratio of sum_util compared to the CPU capacity: x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE) y' is the ratio of CPUs to be scanned in the LLC domain, and the number of CPUs to scan is calculated by: nr_scan = llc_weight * y' Choosing quadratic function is because: [1] Compared to the linear function, it scans more aggressively when the sum_util is low. [2] Compared to the exponential function, it is easier to calculate. [3] It seems that there is no accurate mapping between the sum of util_avg and the number of CPUs to be scanned. Use heuristic scan for now. For a platform with 112 CPUs per LLC, the number of CPUs to scan is: sum_util% 0 5 15 25 35 45 55 65 75 85 86 ... scan_nr 112 111 108 102 93 81 65 47 25 1 0 ... For a platform with 16 CPUs per LLC, the number of CPUs to scan is: sum_util% 0 5 15 25 35 45 55 65 75 85 86 ... scan_nr 16 15 15 14 13 11 9 6 3 0 0 ... Furthermore, to minimize the overhead of calculating the metrics in select_idle_cpu(), borrow the statistics from periodic load balance. As mentioned by Abel, on a platform with 112 CPUs per LLC, the sum_util calculated by periodic load balance after 112 ms would decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay in reflecting the latest utilization. But it is a trade-off. Checking the util_avg in newidle load balance would be more frequent, but it brings overhead - multiple CPUs write/read the per-LLC shared variable and introduces cache contention. Tim also mentioned that, it is allowed to be non-optimal in terms of scheduling for the short-term variations, but if there is a long-term trend in the load behavior, the scheduler can adjust for that. When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and Mel suggested, SIS_UTIL should be enabled by default. This patch is based on the util_avg, which is very sensitive to the CPU frequency invariance. There is an issue that, when the max frequency has been clamp, the util_avg would decay insanely fast when the CPU is idle. Commit addca285120b ("cpufreq: intel_pstate: Handle no_turbo in frequency invariance") could be used to mitigate this symptom, by adjusting the arch_max_freq_ratio when turbo is disabled. But this issue is still not thoroughly fixed, because the current code is unaware of the user-specified max CPU frequency. [Test result] netperf and tbench were launched with 25% 50% 75% 100% 125% 150% 175% 200% of CPU number respectively. Hackbench and schbench were launched by 1, 2 ,4, 8 groups. Each test lasts for 100 seconds and repeats 3 times. The following is the benchmark result comparison between baseline:vanilla v5.19-rc1 and compare:patched kernel. Positive compare% indicates better performance. Each netperf test is a: netperf -4 -H 127.0.1 -t TCP/UDP_RR -c -C -l 100 netperf.throughput ======= case load baseline(std%) compare%( std%) TCP_RR 28 threads 1.00 ( 0.34) -0.16 ( 0.40) TCP_RR 56 threads 1.00 ( 0.19) -0.02 ( 0.20) TCP_RR 84 threads 1.00 ( 0.39) -0.47 ( 0.40) TCP_RR 112 threads 1.00 ( 0.21) -0.66 ( 0.22) TCP_RR 140 threads 1.00 ( 0.19) -0.69 ( 0.19) TCP_RR 168 threads 1.00 ( 0.18) -0.48 ( 0.18) TCP_RR 196 threads 1.00 ( 0.16) +194.70 ( 16.43) TCP_RR 224 threads 1.00 ( 0.16) +197.30 ( 7.85) UDP_RR 28 threads 1.00 ( 0.37) +0.35 ( 0.33) UDP_RR 56 threads 1.00 ( 11.18) -0.32 ( 0.21) UDP_RR 84 threads 1.00 ( 1.46) -0.98 ( 0.32) UDP_RR 112 threads 1.00 ( 28.85) -2.48 ( 19.61) UDP_RR 140 threads 1.00 ( 0.70) -0.71 ( 14.04) UDP_RR 168 threads 1.00 ( 14.33) -0.26 ( 11.16) UDP_RR 196 threads 1.00 ( 12.92) +186.92 ( 20.93) UDP_RR 224 threads 1.00 ( 11.74) +196.79 ( 18.62) Take the 224 threads as an example, the SIS search metrics changes are illustrated below: vanilla patched 4544492 +237.5% 15338634 sched_debug.cpu.sis_domain_search.avg 38539 +39686.8% 15333634 sched_debug.cpu.sis_failed.avg 128300000 -87.9% 15551326 sched_debug.cpu.sis_scanned.avg 5842896 +162.7% 15347978 sched_debug.cpu.sis_search.avg There is -87.9% less CPU scans after patched, which indicates lower overhead. Besides, with this patch applied, there is -13% less rq lock contention in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested .try_to_wake_up.default_wake_function.woken_wake_function. This might help explain the performance improvement - Because this patch allows the waking task to remain on the previous CPU, rather than grabbing other CPUs' lock. Each hackbench test is a: hackbench -g $job --process/threads --pipe/sockets -l 1000000 -s 100 hackbench.throughput ========= case load baseline(std%) compare%( std%) process-pipe 1 group 1.00 ( 1.29) +0.57 ( 0.47) process-pipe 2 groups 1.00 ( 0.27) +0.77 ( 0.81) process-pipe 4 groups 1.00 ( 0.26) +1.17 ( 0.02) process-pipe 8 groups 1.00 ( 0.15) -4.79 ( 0.02) process-sockets 1 group 1.00 ( 0.63) -0.92 ( 0.13) process-sockets 2 groups 1.00 ( 0.03) -0.83 ( 0.14) process-sockets 4 groups 1.00 ( 0.40) +5.20 ( 0.26) process-sockets 8 groups 1.00 ( 0.04) +3.52 ( 0.03) threads-pipe 1 group 1.00 ( 1.28) +0.07 ( 0.14) threads-pipe 2 groups 1.00 ( 0.22) -0.49 ( 0.74) threads-pipe 4 groups 1.00 ( 0.05) +1.88 ( 0.13) threads-pipe 8 groups 1.00 ( 0.09) -4.90 ( 0.06) threads-sockets 1 group 1.00 ( 0.25) -0.70 ( 0.53) threads-sockets 2 groups 1.00 ( 0.10) -0.63 ( 0.26) threads-sockets 4 groups 1.00 ( 0.19) +11.92 ( 0.24) threads-sockets 8 groups 1.00 ( 0.08) +4.31 ( 0.11) Each tbench test is a: tbench -t 100 $job 127.0.0.1 tbench.throughput ====== case load baseline(std%) compare%( std%) loopback 28 threads 1.00 ( 0.06) -0.14 ( 0.09) loopback 56 threads 1.00 ( 0.03) -0.04 ( 0.17) loopback 84 threads 1.00 ( 0.05) +0.36 ( 0.13) loopback 112 threads 1.00 ( 0.03) +0.51 ( 0.03) loopback 140 threads 1.00 ( 0.02) -1.67 ( 0.19) loopback 168 threads 1.00 ( 0.38) +1.27 ( 0.27) loopback 196 threads 1.00 ( 0.11) +1.34 ( 0.17) loopback 224 threads 1.00 ( 0.11) +1.67 ( 0.22) Each schbench test is a: schbench -m $job -t 28 -r 100 -s 30000 -c 30000 schbench.latency_90%_us ======== case load baseline(std%) compare%( std%) normal 1 mthread 1.00 ( 31.22) -7.36 ( 20.25)* normal 2 mthreads 1.00 ( 2.45) -0.48 ( 1.79) normal 4 mthreads 1.00 ( 1.69) +0.45 ( 0.64) normal 8 mthreads 1.00 ( 5.47) +9.81 ( 14.28) *Consider the Standard Deviation, this -7.36% regression might not be valid. Also, a OLTP workload with a commercial RDBMS has been tested, and there is no significant change. There were concerns that unbalanced tasks among CPUs would cause problems. For example, suppose the LLC domain is composed of 8 CPUs, and 7 tasks are bound to CPU0~CPU6, while CPU7 is idle: CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 util_avg 1024 1024 1024 1024 1024 1024 1024 0 Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%, select_idle_cpu() will not scan, thus CPU7 is undetected during scan. But according to Mel, it is unlikely the CPU7 will be idle all the time because CPU7 could pull some tasks via CPU_NEWLY_IDLE. lkp(kernel test robot) has reported a regression on stress-ng.sock on a very busy system. According to the sched_debug statistics, it might be caused by SIS_UTIL terminates the scan and chooses a previous CPU earlier, and this might introduce more context switch, especially involuntary preemption, which impacts a busy stress-ng. This regression has shown that, not all benchmarks in every scenario benefit from idle CPU scan limit, and it needs further investigation. Besides, there is slight regression in hackbench's 16 groups case when the LLC domain has 16 CPUs. Prateek mentioned that we should scan aggressively in an LLC domain with 16 CPUs. Because the cost to search for an idle one among 16 CPUs is negligible. The current patch aims to propose a generic solution and only considers the util_avg. Something like the below could be applied on top of the current patch to fulfill the requirement: if (llc_weight <= 16) nr_scan = nr_scan * 32 / llc_weight; For LLC domain with 16 CPUs, the nr_scan will be expanded to 2 times large. The smaller the CPU number this LLC domain has, the larger nr_scan will be expanded. This needs further investigation. There is also ongoing work[2] from Abel to filter out the busy CPUs during wakeup, to further speed up the idle CPU scan. And it could be a following-up optimization on top of this change. Suggested-by: Tim Chen Suggested-by: Peter Zijlstra Signed-off-by: Chen Yu Signed-off-by: Peter Zijlstra (Intel) Tested-by: Yicong Yang Tested-by: Mohini Narkhede Tested-by: K Prateek Nayak Link: https://lore.kernel.org/r/20220612163428.849378-1-yu.c.chen@intel.com Signed-off-by: Xue Sinian --- include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 87 ++++++++++++++++++++++++++++++++++ kernel/sched/features.h | 1 + 3 files changed, 89 insertions(+) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 103b2696ff17..0798b5063596 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -74,6 +74,7 @@ struct sched_domain_shared { atomic_t ref; atomic_t nr_busy_cpus; int has_idle_cores; + int nr_idle_scan; #ifdef CONFIG_ARM64 } ____cacheline_aligned; #else diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0bcb18eb7907..e3dd46f6a019 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6214,6 +6214,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t s64 delta; int this = smp_processor_id(); int cpu, nr = INT_MAX, si_cpu = -1; + struct sched_domain_shared *sd_share; this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -6241,6 +6242,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + if (sched_feat(SIS_UTIL)) { + sd_share = rcu_dereference(per_cpu(sd_llc_shared, target)); + if (sd_share) { + /* because !--nr is the condition to stop scan */ + nr = READ_ONCE(sd_share->nr_idle_scan) + 1; + /* overloaded LLC is unlikely to have idle cpu/core */ + if (nr == 1) + return -1; + } + } + if (static_branch_unlikely(&sched_cluster_active)) { struct sched_domain *sdc = rcu_dereference(per_cpu(sd_cluster, target)); @@ -8680,6 +8692,77 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq) } #endif /* CONFIG_NUMA_BALANCING */ +static void update_idle_cpu_scan(struct lb_env *env, + unsigned long sum_util) +{ + struct sched_domain_shared *sd_share; + int llc_weight, pct; + u64 x, y, tmp; + /* + * Update the number of CPUs to scan in LLC domain, which could + * be used as a hint in select_idle_cpu(). The update of sd_share + * could be expensive because it is within a shared cache line. + * So the write of this hint only occurs during periodic load + * balancing, rather than CPU_NEWLY_IDLE, because the latter + * can fire way more frequently than the former. + */ + if (!sched_feat(SIS_UTIL) || env->idle == CPU_NEWLY_IDLE) + return; + + llc_weight = per_cpu(sd_llc_size, env->dst_cpu); + if (env->sd->span_weight != llc_weight) + return; + + sd_share = rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu)); + if (!sd_share) + return; + + /* + * The number of CPUs to search drops as sum_util increases, when + * sum_util hits 85% or above, the scan stops. + * The reason to choose 85% as the threshold is because this is the + * imbalance_pct(117) when a LLC sched group is overloaded. + * + * let y = SCHED_CAPACITY_SCALE - p * x^2 [1] + * and y'= y / SCHED_CAPACITY_SCALE + * + * x is the ratio of sum_util compared to the CPU capacity: + * x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE) + * y' is the ratio of CPUs to be scanned in the LLC domain, + * and the number of CPUs to scan is calculated by: + * + * nr_scan = llc_weight * y' [2] + * + * When x hits the threshold of overloaded, AKA, when + * x = 100 / pct, y drops to 0. According to [1], + * p should be SCHED_CAPACITY_SCALE * pct^2 / 10000 + * + * Scale x by SCHED_CAPACITY_SCALE: + * x' = sum_util / llc_weight; [3] + * + * and finally [1] becomes: + * y = SCHED_CAPACITY_SCALE - + * x'^2 * pct^2 / (10000 * SCHED_CAPACITY_SCALE) [4] + * + */ + /* equation [3] */ + x = sum_util; + do_div(x, llc_weight); + + /* equation [4] */ + pct = env->sd->imbalance_pct; + tmp = x * x * pct * pct; + do_div(tmp, 10000 * SCHED_CAPACITY_SCALE); + tmp = min_t(long, tmp, SCHED_CAPACITY_SCALE); + y = SCHED_CAPACITY_SCALE - tmp; + + /* equation [2] */ + y *= llc_weight; + do_div(y, SCHED_CAPACITY_SCALE); + if ((int)y != sd_share->nr_idle_scan) + WRITE_ONCE(sd_share->nr_idle_scan, (int)y); +} + /** * update_sd_lb_stats - Update sched_domain's statistics for load balancing. * @env: The load balancing environment. @@ -8693,6 +8776,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd struct sg_lb_stats tmp_sgs; bool prefer_sibling = child && child->flags & SD_PREFER_SIBLING; int sg_status = 0; + unsigned long sum_util = 0; #ifdef CONFIG_NO_HZ_COMMON if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked)) @@ -8746,6 +8830,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity; + sum_util += sgs->group_util; sg = sg->next; } while (sg != env->sd->groups); @@ -8776,6 +8861,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd WRITE_ONCE(rd->overutilized, SG_OVERUTILIZED); trace_sched_overutilized_tp(rd, SG_OVERUTILIZED); } + + update_idle_cpu_scan(env, sum_util); } /** diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 66c74aa4753e..59e83f5a5719 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -56,6 +56,7 @@ SCHED_FEAT(TTWU_QUEUE, true) */ SCHED_FEAT(SIS_AVG_CPU, false) SCHED_FEAT(SIS_PROP, true) +SCHED_FEAT(SIS_UTIL, false) /* * Issue a WARN when we do multiple update_rq_clock() calls -- Gitee From 300cd494dad24e16a6fea877e193c1f955fbc2ee Mon Sep 17 00:00:00 2001 From: Guan Jing Date: Wed, 30 Nov 2022 18:23:25 +0800 Subject: [PATCH 15/16] sched/fair: Fix kabi borken in sched_domain_shared commit 222d84a0de0d2dbd75b2c73f469d74868955f3b5 openeuler. -------------------------------- The sched_domain_shared structure is only used as pointer, and other drivers don't use it directly. Signed-off-by: Xue Sinian --- include/linux/sched/topology.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 0798b5063596..a2261040dbd6 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -74,7 +74,9 @@ struct sched_domain_shared { atomic_t ref; atomic_t nr_busy_cpus; int has_idle_cores; +#ifndef __GENKSYMS__ int nr_idle_scan; +#endif #ifdef CONFIG_ARM64 } ____cacheline_aligned; #else -- Gitee From 164ec9d6049fc71ce05733fe92e97662e3bd8f7e Mon Sep 17 00:00:00 2001 From: Wang ShaoBo Date: Wed, 10 Nov 2021 17:58:56 +0800 Subject: [PATCH 16/16] arch_topology: Fix missing clear cluster_cpumask in remove_cpu_topology() mainline inclusion commit 4cc4cc28ec4154c4f1395648ab67ac9fd3e71fdc upstream. ---------------------------------------------------------------------- When testing cpu online and offline, warning happened like this: [ 146.746743] WARNING: CPU: 92 PID: 974 at kernel/sched/topology.c:2215 build_sched_domains+0x81c/0x11b0 [ 146.749988] CPU: 92 PID: 974 Comm: kworker/92:2 Not tainted 5.15.0 #9 [ 146.750402] Hardware name: Huawei TaiShan 2280 V2/BC82AMDDA, BIOS 1.79 08/21/2021 [ 146.751213] Workqueue: events cpuset_hotplug_workfn [ 146.751629] pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 146.752048] pc : build_sched_domains+0x81c/0x11b0 [ 146.752461] lr : build_sched_domains+0x414/0x11b0 [ 146.752860] sp : ffff800040a83a80 [ 146.753247] x29: ffff800040a83a80 x28: ffff20801f13a980 x27: ffff20800448ae00 [ 146.753644] x26: ffff800012a858e8 x25: ffff800012ea48c0 x24: 0000000000000000 [ 146.754039] x23: ffff800010ab7d60 x22: ffff800012f03758 x21: 000000000000005f [ 146.754427] x20: 000000000000005c x19: ffff004080012840 x18: ffffffffffffffff [ 146.754814] x17: 3661613030303230 x16: 30303078303a3239 x15: ffff800011f92b48 [ 146.755197] x14: ffff20be3f95cef6 x13: 2e6e69616d6f642d x12: 6465686373204c4c [ 146.755578] x11: ffff20bf7fc83a00 x10: 0000000000000040 x9 : 0000000000000000 [ 146.755957] x8 : 0000000000000002 x7 : ffffffffe0000000 x6 : 0000000000000002 [ 146.756334] x5 : 0000000090000000 x4 : 00000000f0000000 x3 : 0000000000000001 [ 146.756705] x2 : 0000000000000080 x1 : ffff800012f03860 x0 : 0000000000000001 [ 146.757070] Call trace: [ 146.757421] build_sched_domains+0x81c/0x11b0 [ 146.757771] partition_sched_domains_locked+0x57c/0x978 [ 146.758118] rebuild_sched_domains_locked+0x44c/0x7f0 [ 146.758460] rebuild_sched_domains+0x2c/0x48 [ 146.758791] cpuset_hotplug_workfn+0x3fc/0x888 [ 146.759114] process_one_work+0x1f4/0x480 [ 146.759429] worker_thread+0x48/0x460 [ 146.759734] kthread+0x158/0x168 [ 146.760030] ret_from_fork+0x10/0x20 [ 146.760318] ---[ end trace 82c44aad6900e81a ]--- For some architectures like risc-v and arm64 which use common code clear_cpu_topology() in shutting down CPUx, When CONFIG_SCHED_CLUSTER is set, cluster_sibling in cpu_topology of each sibling adjacent to CPUx is missed clearing, this causes checking failed in topology_span_sane() and rebuilding topology failure at end when CPU online. Different sibling's cluster_sibling in cpu_topology[] when CPU92 offline (CPU 92, 93, 94, 95 are in one cluster): Before revision: CPU [92] [93] [94] [95] cluster_sibling [92] [92-95] [92-95] [92-95] After revision: CPU [92] [93] [94] [95] cluster_sibling [92] [93-95] [93-95] [93-95] Signed-off-by: Wang ShaoBo Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Dietmar Eggemann Acked-by: Barry Song Tested-by: Dietmar Eggemann Link: https://lore.kernel.org/r/20211110095856.469360-1-bobo.shaobowang@huawei.com Signed-off-by: Xue Sinian --- drivers/base/arch_topology.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index dbbe6fa6a42f..80343cf0f9ad 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -542,6 +542,8 @@ void remove_cpu_topology(unsigned int cpu) cpumask_clear_cpu(cpu, topology_core_cpumask(sibling)); for_each_cpu(sibling, topology_sibling_cpumask(cpu)) cpumask_clear_cpu(cpu, topology_sibling_cpumask(sibling)); + for_each_cpu(sibling, topology_cluster_cpumask(cpu)) + cpumask_clear_cpu(cpu, topology_cluster_cpumask(sibling)); for_each_cpu(sibling, topology_llc_cpumask(cpu)) cpumask_clear_cpu(cpu, topology_llc_cpumask(sibling)); -- Gitee