本文旨在阐述,关于平均负载的知识点
0x00 什么是平均负载(Load Averages)
我们可以从uptime的manual中查看到以下讯息:
man uptime … System load averages is the average number of processes that are either in a runnable or uninterruptable state. (平均负载:单位时间内,系统处于运行态和不可中断态的进程数。) A process in a runnable state is either using the CPU or waiting to use the CPU.(运行态指正使用CPU或等待CPU。) A process in uninterruptable state is waiting for some I/O access, eg waiting for disk.(不可中断态指正处于内核态关键流程,万不可打断,诸如等待磁盘I/O响应。“不可中断态指系统对进程和硬件设备的保护机制。”) The averages are taken over the three time intervals.(参数取自三个时间间隔:1min、5min、15min)一直稳定:1min≈5min≈15min 过去高负:1min << 15min 目前高负:1min >> 15min Load averages are not normalized for the number of CPUs in a system, so a load average of 1 means a single CPU system is loaded all the time while on a 4 CPU system it means it was idle 75% of the time.(平均负载未针对CPU个数调整,因为数值1在1个CPU和4个CPU的系统有不同的意味。) 当平均负载 = 1时: 这需要结合CPU数(CPU NUM = grep ‘model name’ /proc/cpuinfo | wc -l)来进行判断 - 1个CPU系统:满载 - 4个CPU系统:1/4满载
运行态(runnable)+不可中断态(uninterruptable)
以下算式直观描述影响平均负载的可能因素(CPU占用,CPU等待,IO等待): 平均负载升高不一定CPU升高,例如等待I/O
System Load Averages ↑ = Using CPU ↑ + Waiting CPU + Waiting I/O
System Load Averages ↑ = Using CPU + Waiting CPU ↑ + Waiting I/O
System Load Averages ↑ = Using CPU + Waiting CPU + Waiting I/O ↑
可运行态的进程R:Running Runnable,不可中断态的进程D:Disk Sleep(uninterruptable sleep) R+ = running↓ D+ = Disk Sleep(uninterruptable sleep)↓
1 2 3 4 [root@localhost ~]# ps -aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 22417 0.0 0.3 39008 3628 pts/0 R+ 21:40 0:00 ps -aux root 22418 0.0 0.1 22016 1624 pts/0 D+ 21:40 0:00 -bash
0x01 案例模拟
测试説明
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 [root@localhost ~]# screenfetch .. root@localhost .PLTJ. OS: CentOS <><><><> Kernel: x86_64 Linux 3.10.0-957.21.3.el7.x86_64 KKSSV' 4KKK LJ KKKL.'VSSKK Uptime: 48m KKV' 4KKKKK LJ KKKKAL 'VKK Packages: 494 V' ' 'VKKKK LJ KKKKV' ' 'V Shell: bash 4.2.46 .4MA.' 'VKK LJ KKV' '.4Mb. CPU: Intel Xeon E5-26xx v4 @ 2x 2.394GHz . KKKKKA.' 'V LJ V' '.4KKKKK . GPU: cirrusdrmfb .4D KKKKKKKA.'' LJ ''.4KKKKKKK FA. RAM: 147MiB / 7821MiB <QDD ++++++++++++ ++++++++++++ GFD> 'VD KKKKKKKK'.. LJ ..'KKKKKKKK FV ' VKKKKK'. .4 LJ K. .'KKKKKV ' 'VK'. .4KK LJ KKA. .'KV' A. . .4KKKK LJ KKKKA. . .4 KKA. 'KKKKK LJ KKKKK' .4KK KKSSA. VKKK LJ KKKV .4SSKK <><><><> 'MKKM' '' [root@localhost ~]# wget http://download-ib01.fedoraproject.org/pub/fedora/linux/releases/30/Everything/x86_64/os/Packages/s/sysstat-11.7.3-3.fc30.x86_64.rpm [root@localhost ~]# yum install -y stress sysstat-11.7.3-3.fc30.x86_64.rpm
CPU占用(CPU密集型进程) Windows 1 施加一个持续10min的一个CPU占用 。
1 2 [root@localhost ~]# stress --cpu 1 --timeout 600 stress: info: [19751] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hd
Windows 2 每两秒输出uptime 命令的结果。
1 2 3 4 [root@localhost ~]# watch -d uptime Every 2.0s: uptime Mon Sep 2 23:39:47 2019 23:39:47 up 1:09, 4 users, load average: 0.85, 0.68, 0.36
Windows 3 使用mpstat来每5s输出,可以看到单个CPU使用率%usr列显着升高。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 [root@localhost ~]# mpstat -P ALL 5 v 11:41:18 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 11:41:23 PM all 50.15 0.00 0.20 0.10 0.00 0.10 0.00 0.00 0.00 49.45 >>11:41:23 PM 0 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 11:41:23 PM 1 0.20 0.00 0.40 0.00 0.00 0.00 0.00 0.00 0.00 99.40 11:41:23 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 11:41:28 PM all 50.25 0.00 0.30 0.00 0.00 0.00 0.00 0.00 0.00 49.45 >>11:41:28 PM 0 94.81 0.00 0.20 0.00 0.00 0.00 0.00 0.00 0.00 4.99 11:41:28 PM 1 5.61 0.00 0.40 0.00 0.00 0.00 0.00 0.00 0.00 93.99 11:41:28 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 11:41:33 PM all 50.25 0.00 0.20 0.10 0.00 0.00 0.00 0.00 0.00 49.45 >>11:41:33 PM 0 98.60 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.40 11:41:33 PM 1 1.80 0.00 0.40 0.20 0.00 0.00 0.00 0.00 0.00 97.60 ^
Windows 4 现在我们可以看见平均负载的升高是因为CPU被占用 。
1 2 3 4 5 6 7 8 [root@localhost ~]# pidstat -u 5 10:40:09 PM UID PID %usr %system %guest %wait %CPU CPU Command 10:40:14 PM 0 2853 0.20 0.20 0.00 0.40 0.40 0 YDService 10:40:14 PM 0 3303 0.00 0.20 0.00 0.00 0.20 0 sshd 10:40:14 PM 0 3750 0.40 0.40 0.00 0.00 0.80 0 barad_agent 10:40:14 PM 0 4179 0.20 0.00 0.00 0.00 0.20 0 watch >>10:40:14 PM 0 18043 100.00 0.00 0.00 0.00 100.00 1 stress
IO等待(I/O密集型进程) Windows 1 施加一个持续10分钟的io写入。
1 2 [root@localhost ~]# stress -i 1 --timeout 600 stress: info: [15703] dispatching hogs: 0 cpu, 1 io, 0 vm, 0 hdd
Windows 2 Load Average一分钟内数值飙升至1.06。
1 2 3 4 [root@localhost ~]# watch -d uptime Every 2.0s: uptime Tue Sep 3 21:34:11 2019 21:34:11 up 6 min, 4 users, load average: 1.05, 0.62, 0.27
Windows 3 仅一个CPU的%iowait上升。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 [root@localhost ~]# mpstat -P ALL 5 3 v 09:36:05 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 09:36:10 PM all 0.50 0.00 29.03 21.27 0.00 0.00 0.00 0.00 0.00 49.19 09:36:10 PM 0 0.81 0.00 25.15 17.44 0.00 0.00 0.00 0.00 0.00 56.59 >>09:36:10 PM 1 0.20 0.00 32.73 25.10 0.00 0.00 0.00 0.00 0.00 41.97 09:36:10 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 09:36:15 PM all 0.30 0.00 29.43 21.42 0.00 0.00 0.00 0.00 0.00 48.85 09:36:15 PM 0 0.40 0.00 20.61 15.15 0.00 0.00 0.00 0.00 0.00 63.84 >>09:36:15 PM 1 0.00 0.00 38.29 27.58 0.00 0.00 0.00 0.00 0.00 34.13 09:36:15 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 09:36:20 PM all 0.30 0.00 28.61 21.54 0.00 0.00 0.00 0.00 0.00 49.54 09:36:20 PM 0 0.41 0.00 13.18 9.13 0.00 0.00 0.00 0.00 0.00 77.28 >>09:36:20 PM 1 0.20 0.00 43.75 33.87 0.00 0.00 0.00 0.00 0.00 22.18 ^
Windows 4 Load Average上升是因为等待I/O。
1 2 3 4 5 6 7 8 [root@localhost ~]# pidstat -u 5 1 V 10:37:29 PM UID PID %usr %system %guest %wait %CPU CPU Command 10:37:34 PM 0 1232 0.00 0.40 0.00 0.00 0.40 0 kworker/0:1H 10:37:34 PM 0 2853 0.40 0.40 0.00 0.60 0.80 1 YDService 10:37:34 PM 0 3750 0.20 0.00 0.00 0.00 0.20 0 barad_agent 10:37:34 PM 0 16458 0.00 0.20 0.00 0.00 0.20 0 pidstat >>10:37:34 PM 0 17460 0.00 85.20 0.00 0.40 85.20 1 stress
CPU等待(大量进程) Windows 1 1 2 [root@localhost ~]# stress --cpu 8 --timeout 600 stress: info: [9168] dispatching hogs: 8 cpu, 0 io, 0 vm, 0 hdd
Windows 2 Load Average在一分钟内逼近10.00。 1 2 3 4 [root@localhost ~]# watch -d uptime Every 2.0s: uptime Tue Sep 3 22:08:46 2019 22:08:46 up 41 min, 4 users, load average: 9.34, 8.13, 4.66
Windows 3 使用mpstat来每5s输出,可以看到全体CPU使用率%usr列显着升高。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 [root@localhost ~]# mpstat -P ALL 5 3 v 09:59:51 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 09:59:57 PM all 99.70 0.00 0.20 0.10 0.00 0.10 0.00 0.00 0.00 0.00 09:59:57 PM 0 99.80 0.00 0.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 09:59:57 PM 1 99.80 0.00 0.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 09:59:57 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 10:00:02 PM all 99.60 0.00 0.40 0.00 0.00 0.10 0.00 0.00 0.00 0.00 10:00:02 PM 0 99.80 0.00 0.20 0.00 0.10 0.00 0.00 0.00 0.00 0.00 10:00:02 PM 1 99.40 0.00 0.60 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:00:02 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 10:00:07 PM all 99.70 0.00 0.30 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:00:07 PM 0 99.80 0.00 0.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:00:07 PM 1 99.60 0.00 0.40 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ^
Windows 4 pidstat来追踪进程,可以发现大量stress进程在抢占CPU。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 [root@localhost ~]# pidstat -u 5 1 10:33:44 PM UID PID %usr %system %guest %wait %CPU CPU Command 10:33:49 PM 0 2774 0.00 0.20 0.00 0.00 0.20 1 auditd 10:33:49 PM 0 2853 0.20 0.20 0.00 74.45 0.40 1 YDService 10:33:49 PM 0 16458 0.00 0.20 0.00 0.20 0.20 0 pidstat >>10:33:49 PM 0 16488 24.55 0.00 0.00 75.65 24.55 0 stress >>10:33:49 PM 0 16489 24.55 0.00 0.00 75.05 24.55 0 stress >>10:33:49 PM 0 16490 24.75 0.00 0.00 75.45 24.75 0 stress >>10:33:49 PM 0 16491 24.75 0.00 0.00 74.65 24.75 1 stress >>10:33:49 PM 0 16492 24.95 0.00 0.00 75.45 24.95 1 stress >>10:33:49 PM 0 16493 24.95 0.00 0.00 75.05 24.95 1 stress >>10:33:49 PM 0 16494 24.35 0.00 0.00 75.05 24.35 0 stress >>10:33:49 PM 0 16495 24.75 0.00 0.00 75.65 24.75 1 stress
0x02 结语
Load Averages = running(运行态) + uninterruptable(不可中断态)
System Load Averages ↑ = Using CPU ↑ + Waiting CPU + Waiting I/O
System Load Averages ↑ = Using CPU + Waiting CPU ↑ + Waiting I/O
System Load Averages ↑ = Using CPU + Waiting CPU + Waiting I/O ↑
平均负载可以通过以下公式进行计算。
load(t) = n+((load(t-1)-n)/e^(interval/(min*60))) load(t): 平均负载的时间. n: 运行态和不可中断态的线程数 interval: 计算间隔,RHEL是5秒 min: 负载的时长(分钟数)