0%

Linux性能調優實戰筆記IV(CPU占用率)

本文旨在剖析系统CPU性能的指标:CPU使用率。

0x00 前言

回顾前面,学习了以下描述系统情况的方式:

而描述CPU,可以直接使用CPU使用率。

0x01 什么是CPU使用率

  1. Linux身为多任务操作系统,靠的是短时间片(CPU时间的最小单位)+调度器(轮流分配)。

    • 短时间片:事先约定来触发中断的节拍率(HZ),并使用Jiffies记录开机依赖节拍数。

      1
      2
      # cat /proc/timer_list|grep -m1 -E '^jiffies'|cut -d" " -f2
      7604634289
      • 节拍率(HZ):不同发行版会有不同的内核预设值,意味每秒钟触发的时间中断次数。

        1
        2
        3
        4
        [root@rhel7 ~]# grep 'CONFIG_HZ=' /boot/config-$(uname -r)
        CONFIG_HZ=1000
        ubuntu@localhost:~$ grep 'CONFIG_HZ=' /boot/config-$(uname -r)
        CONFIG_HZ=250
      • 而用户空间往往使用的是USER_HZ,任何运行时kernel的编译选项都可以使用getconf进行显示,这个值被固定为100,也就是1秒中断100次。

        $ man 7 time

        The software clock, HZ, and jiffies
        The accuracy of various system calls that set timeouts, (e.g., select(2), sigtimedwait(2)) and measure CPU time (e.g., getrusage(2)) is limited by the resolution of the software clock, a clock
        maintained by the kernel which measures time in jiffies. The size of a jiffy is determined by the value of the kernel constant HZ.
        The value of HZ varies across kernel versions and hardware platforms. On i386 the situation is as follows: on kernels up to and including 2.4.x, HZ was 100, giving a jiffy value of 0.01 seconds;
        starting with 2.6.0, HZ was raised to 1000, giving a jiffy of 0.001 seconds. Since kernel 2.6.13, the HZ value is a kernel configuration parameter and can be 100, 250 (the default) or 1000, yield‐
        ing a jiffies value of, respectively, 0.01, 0.004, or 0.001 seconds. Since kernel 2.6.20, a further frequency is available: 300, a number that divides evenly for the common video frame rates (PAL,
        25 HZ; NTSC, 30 HZ).
        The times(2) system call is a special case. It reports times with a granularity defined by the kernel constant USER_HZ. User-space applications can determine the value of this constant using
        sysconf(_SC_CLK_TCK).

        1
        2
        3
        4
        [root@rhel7 ~]# getconf CLK_TCK
        100
        ubuntu@localhost:~$ getconf CLK_TCK
        100
  2. Linux通过伪文件系统 /proc来向用户提供OS内部状态信息,而/proc/stat 提供了系统的CPU和系统任务统计信息

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    # 重点查看不同场景下CPU的累计节拍数。
    [root@rhel7 ~]# cat /proc/stat |grep ^cpu
    cpu 34606696 176287 4467987 6580945956 1271138 0 43622 0 29486561 0 <<< 总体累加值
    cpu0 595076 20722 276949 329973186 13384 0 13257 0 362976 0
    cpu1 1127636 67877 201993 328548578 930793 0 18742 0 756795 0
    cpu2 1263945 20524 166542 329398032 229204 0 34 0 1069117 0
    cpu3 2598239 8315 279844 328137468 39873 0 33 0 2418959 0
    cpu4 3152236 4124 276053 327627575 9873 0 16 0 2983401 0
    cpu5 3758087 1714 254151 327048933 4333 0 11 0 3607109 0
    cpu6 3694695 2045 240707 327126810 3759 0 17 0 3554974 0
    cpu7 3828972 4117 240018 326989822 3392 0 21 0 3688237 0
    cpu8 3905215 3004 243946 326913132 2545 0 7 0 3765278 0
    cpu9 4009330 1172 246768 326807704 2270 0 19 0 3870190 0
    cpu10 554056 9355 214524 330129039 442 0 11371 0 159670 0
    cpu11 533244 15006 200194 330390188 4311 0 1 0 187749 0
    cpu12 540121 6568 190984 330403882 545 0 2 0 203736 0
    cpu13 643304 2944 196357 330296946 338 0 2 0 345497 0
    cpu14 895960 1370 199230 330038803 260 0 1 0 620911 0
    cpu15 671360 1116 202075 330264781 374 0 2 0 370423 0
    cpu16 778120 3359 209582 330147575 276 0 3 0 463865 0
    cpu17 582231 757 203751 330350880 299 0 0 0 295223 0
    cpu18 836571 669 215387 330086448 291 0 3 0 495057 0
    cpu19 638288 1522 208925 330266164 24565 0 70 0 267386 0

    其含义:

    $ man proc

    /proc/stat
    kernel/system statistics. Varies with architecture. Common entries include:
    cpu 3357 0 4313 1362393
    The amount of time, measured in units of USER_HZ (1/100ths of a second on most architectures, use sysconf(_SC_CLK_TCK) to obtain the right value), that the system spent in various
    states:
    user (1) Time spent in user mode.
    (缩写为us)用户态时间,不含nice,但包含guest。
    nice (2) Time spent in user mode with low priority (nice).
    (缩写为ni)低优先级用户态时间,取值范围-20 -19,数值越大优先级越低。
    system (3) Time spent in system mode.
    (缩写为sys)内核态CPU时间。
    idle (4) Time spent in the idle task. This value should be USER_HZ times the second entry in the /proc/uptime pseudo-file.
    (缩写为id)空闲时间,不含IO等待(iowait),/proc/uptime第二个值的100倍。
    iowait (since Linux 2.5.41)
    (5) Time waiting for I/O to complete.
    (缩写为wa),代表等待I/O的CPU时间。
    irq (since Linux 2.6.0-test4)
    (6) Time servicing interrupts.
    (缩写为hi),代表硬中断的CPU时间。
    softirq (since Linux 2.6.0-test4)
    (7) Time servicing softirqs.
    (缩写为si),代表软中断的CPU时间。
    steal (since Linux 2.6.11)
    (8) Stolen time, which is the time spent in other operating systems when running in a virtualized environment
    (缩写为st),代表vm系统被其他vm抢占的CPU时间。
    guest (since Linux 2.6.24)
    (9) Time spent running a virtual CPU for guest operating systems under the control of the Linux kernel.
    (缩写为guest),代表作为hypervisor运行虚拟机的CPU时间。
    guest_nice (since Linux 2.6.33)
    (10) Time spent running a niced guest (virtual CPU for guest operating systems under the control of the Linux kernel).
    (缩写为gnice),代表作为低优先级运行虚拟机的时间。

  3. 因此我们可以有了CPU使用率的计算公式:

    https://img.madebug.net/m4d3bug/images-of-website/master/blog/CPUCalculationFormula-1.jpg

    但是Linux开机很久以来的时间也纳入计算,明显没有意义,因此我们常用的是优化版:

    https://img.madebug.net/m4d3bug/images-of-website/master/blog/CPUCalculationFormula-2.jpg

    系统的角度考量的是/proc/stat , 而进程的角度考量的是/proc/[pid]/stat 。

  4. 而常见的分析工具top(3秒时间间隔)和ps(进程全生命周期)都可以帮我们计算好,我们只需要注意它们的时间间隔。

0x02 怎么查看CPU使用率

  • top:

    • 系统的资源(总体CPU和内存)使用情况。

    • 进程的资源(不区分用户态/内核态)使用情况。

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      top - 09:03:15 up 2 days,  8:11,  3 users,  load average: 0.00, 0.03, 0.05
      Tasks: 466 total, 1 running, 465 sleeping, 0 stopped, 0 zombie
      %Cpu(s): 0.1 us, 0.1 sy, 0.0 ni, 99.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
      KiB Mem : 13172561+total, 12070921+free, 5016792 used, 5999608 buff/cache
      KiB Swap: 4194300 total, 4194300 free, 0 used. 12589456+avail Mem
      (用户态%CPU+内核态%CPU)
      PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2147 root 20 0 1462688 20152 5340 S 0.3 0.0 0:37.21 docker-containe
      3953 root 20 0 23164 2688 1312 S 0.3 0.0 5:00.06 tmux
      1 root 20 0 191856 4860 2644 S 0.0 0.0 0:02.15 systemd
      2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 kthreadd
      4 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
      6 root 20 0 0 0 0 S 0.0 0.0 0:01.66 ksoftirqd/0
      7 root rt 0 0 0 0 S 0.0 0.0 0:00.19 migration/0
      8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh
      9 root 20 0 0 0 0 S 0.0 0.0 0:07.40 rcu_sched
      10 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 lru-add-drain
      11 root rt 0 0 0 0 S 0.0 0.0 0:00.12 watchdog/0
      12 root rt 0 0 0 0 S 0.0 0.0 0:00.10 watchdog/1
      13 root rt 0 0 0 0 S 0.0 0.0 0:00.18 migration/1
      14 root 20 0 0 0 0 S 0.0 0.0 0:00.27 ksoftirqd/1
      16 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/1:0H
      17 root rt 0 0 0 0 S 0.0 0.0 0:00.10 watchdog/2
      18 root rt 0 0 0 0 S 0.0 0.0 0:00.53 migration/2
  • ps:进程的资源(区分用户态/内核态)使用情况。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
       [root@rhel7 ~]# pidstat 1  5                                                                                                                                                                           
    Linux 3.10.0-1160.11.1.el7.x86_64 06/20/2021 _x86_64_ (20 CPU)

    用户态 内核态 虚拟机 等待 总使用率
    CPU使用率 CPU使用率
    12:10:18 AM UID PID %usr %system %guest %wait %CPU CPU Command
    12:10:19 AM 1000 206 0.00 6.00 0.00 0.00 6.00 0 tmux: server
    12:10:19 AM 1000 17722 0.00 1.00 0.00 0.00 1.00 0 pidstat

    12:10:19 AM UID PID %usr %system %guest %wait %CPU CPU Command
    12:10:20 AM 1000 206 0.00 8.00 0.00 0.00 8.00 0 tmux: server

    12:10:20 AM UID PID %usr %system %guest %wait %CPU CPU Command
    12:10:21 AM 1000 206 1.00 6.00 0.00 0.00 7.00 0 tmux: server

    12:10:21 AM UID PID %usr %system %guest %wait %CPU CPU Command
    12:10:22 AM 1000 206 0.00 10.00 0.00 0.00 10.00 0 tmux: server

    12:10:22 AM UID PID %usr %system %guest %wait %CPU CPU Command
    12:10:23 AM 1000 206 0.00 7.00 0.00 0.00 7.00 0 tmux: server

    Average: UID PID %usr %system %guest %wait %CPU CPU Command <<< 5组平均值
    Average: 1000 206 0.20 7.40 0.00 0.00 7.60 - tmux: server
    Average: 1000 17722 0.00 0.20 0.00 0.00 0.20 - pidstat

0x03 CPU使用率过高怎么办?

答:直接查找占用的函数进行优化。

  • GDB(中断程序进行调试)

  • perf(不中断程序进行调试)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    $ perf top
    Samples: 2K of event 'cycles', 4000 Hz, Event count (approx.): 1591275795 lost: 0/0 drop: 0/0
    共2k个CPU时钟事件 总事件数1591275795
    采样率 函数共享对象/类型 函数名,符号名
    Overhead Shared Object Symbol
    5.45% [kernel] [k] module_get_kallsym
    3.84% [kernel] [k] page_fault
    3.53% libc-2.17.so [.] __gconv_transform_utf8_internal
    3.19% libc-2.17.so [.] __mbrtowc
    2.55% libc-2.17.so [.] _int_malloc
    2.06% [kernel] [k] kallsyms_expand_symbol.constprop.1
    2.01% libc-2.17.so [.] __GI_____strtoull_l_internal
    1.74% libc-2.17.so [.] __strcmp_sse42
    1.64% [kernel] [k] format_decode
    1.62% libc-2.17.so [.] _int_free
    1.31% [kernel] [k] __memcpy
    1.23% perf [.] __dso__load_kallsyms
    1.21% libc-2.17.so [.] __strcpy_sse2_unaligned
    1.20% perf [.] rb_next
    1.09% [kernel] [k] vsnprintf
    1.05% libc-2.17.so [.] __strlen_sse2_pminub
    0.95% libc-2.17.so [.] malloc
    0.90% [kernel] [k] acpi_processor_ffh_cstate_enter
    0.90% [kernel] [k] delay_tsc
    0.88% perf [.] 0x00000000000d67a4
    0.87% libc-2.17.so [.] __memcpy_sse2
    0.86% [kernel] [k] handle_mm_fault
    0.83% libc-2.17.so [.] _IO_feof
    0.82% [kernel] [k] number.isra.2
    0.72% [kernel] [k] strnlen
    0.70% perf [.] rb_insert_color
    0.63% [kernel] [k] copy_page_rep
    0.61% libc-2.17.so [.] __ctype_get_mb_cur_max
    0.59% libc-2.17.so [.] _dl_addr
    0.59% [kernel] [k] string.isra.7
    0.59% [kernel] [k] strnlen_user

第一组示范:

  • 环境准备:
1
2
$ docker run --name nginx -p 10000:80 -itd docker.io/feisky/nginx
$ docker run --name phpfpm -itd --network container:nginx docker.io/feisky/php-fpm
  • 压测:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
这里使用其他客户端配合
$ curl http://192.168.99.79:10000/
It works!
$ ab -c 10 -n 10000 http://192.168.99.79:10000/
This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.99.79 (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests

Server Software: nginx/1.15.4
Server Hostname: 192.168.99.79
Server Port: 10000

Document Path: /
Document Length: 9 bytes

Concurrency Level: 10
Time taken for tests: 144.878 seconds
Complete requests: 10000
Failed requests: 0
Total transferred: 1720000 bytes
HTML transferred: 90000 bytes
Requests per second: 69.02 [#/sec] (mean)
Time per request: 144.878 [ms] (mean)
Time per request: 14.488 [ms] (mean, across all concurrent requests)
Transfer rate: 11.59 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 1.1 1 15
Processing: 73 144 2.7 144 172
Waiting: 73 143 2.7 144 171
Total: 74 145 2.4 145 173

Percentage of the requests served within a certain time (ms)
50% 145
66% 145
75% 145
80% 145
90% 146
95% 147
98% 149
99% 151
100% 173 (longest request)
1
2
3
4
5
6
7
8
9
10
11
12
13
[root@rhel7 ~]# top
top - 09:47:03 up 2 days, 8:55, 2 users, load average: 1.85, 0.53, 0.22
Tasks: 472 total, 6 running, 466 sleeping, 0 stopped, 0 zombie
%Cpu(s): 25.1 us, 0.0 sy, 0.0 ni, 74.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 13172561+total, 12010049+free, 5040968 used, 6584148 buff/cache
KiB Swap: 4194300 total, 4194300 free, 0 used. 12585044+avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6763 bin 20 0 336684 9364 1688 R 100.0 0.0 0:31.10 php-fpm
6764 bin 20 0 336684 9360 1684 R 100.0 0.0 0:31.09 php-fpm
6760 bin 20 0 336684 9364 1688 R 99.7 0.0 0:31.17 php-fpm
6761 bin 20 0 336684 9364 1688 R 99.7 0.0 0:31.09 php-fpm
6762 bin 20 0 336684 9360 1684 R 99.7 0.0 0:31.09 php-fpm

第一组分析

  1. 初步分析:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
可以见到由于依赖路径不一,只能看到十六进制内容。

Samples: 111K of event 'cycles', 4000 Hz, Event count (approx.): 55131826469 lost: 0/0 drop: 0/0
Children Self Shared Object Symbol
+ 14.35% 0.00% [unknown] [.] 0x6cb6258d4c544155
+ 14.35% 0.00% libc-2.24.so [.] 0x00007f10b557e2e1
+ 14.34% 0.00% php-fpm [.] 0x000055c8a0e6d642
+ 14.34% 0.00% php-fpm [.] 0x000055c8a0c1e6fc
+ 14.34% 0.00% php-fpm [.] 0x000055c8a0ccdf94
+ 14.34% 0.00% php-fpm [.] 0x000055c8a0d65323
+ 13.50% 0.00% php-fpm [.] 0x000055c8a0d6496e
+ 4.71% 4.71% libm-2.24.so [.] 0x000000000002127e
+ 3.42% 3.42% php-fpm [.] 0x000000000094ede0
+ 2.61% 0.00% php-fpm [.] 0x000055c8a0d66a7c
+ 2.48% 2.48% php-fpm [.] 0x00000000008c4ae5
+ 2.24% 0.00% php-fpm [.] 0x000055c8a0e2fea3
+ 2.15% 2.15% php-fpm [.] 0x00000000008c4a6c
+ 2.03% 2.03% php-fpm [.] 0x00000000009513cc
+ 1.88% 1.88% php-fpm [.] 0x00000000008c4a64
+ 1.52% 1.52% php-fpm [.] 0x000000000098de23
+ 1.38% 1.38% php-fpm [.] 0x00000000008cda08
+ 1.06% 1.06% php-fpm [.] 0x00000000008c297a
+ 1.06% 1.06% php-fpm [.] 0x00000000008c296e
+ 1.01% 1.01% php-fpm [.] 0x00000000008cd720
  1. 保存一个进程的记录,拷贝进入容器内分析:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[root@rhel7 ~]# perf record -g -p 6763
[ perf record: Woken up 240 times to write data ]
[ perf record: Captured and wrote 59.857 MB perf.data (512079 samples) ]
[root@rhel7 ~]# docker cp perf.data phpfpm:/tmp
[root@rhel7 ~]# docker exec -it phpfpm bash
root@d5bdfeec83b1:/app# cd /tmp
root@d5bdfeec83b1:/tmp# apt update && apt install linux-perf linux-tools procps -y
Hit:1 http://security.debian.org stretch/updates InRelease
Ign:2 http://httpredir.debian.org/debian stretch InRelease
Hit:3 http://httpredir.debian.org/debian stretch Release
Reading package lists... Done
Building dependency tree
Reading state information... Done
51 packages can be upgraded. Run 'apt list --upgradable' to see them.
Reading package lists... Done
Building dependency tree
Reading state information... Done
procps is already the newest version (2:3.3.12-3+deb9u1).
linux-perf is already the newest version (4.9+80+deb9u13).
linux-tools is already the newest version (4.9+80+deb9u13).
0 upgraded, 0 newly installed, 0 to remove and 51 not upgraded.
root@d5bdfeec83b1:/tmp# perf_4.9 report
  1. 光标选中后回车进入,逐一检视← →

https://img.madebug.net/m4d3bug/images-of-website/master/blog/calltrace.jpg

  1. 基本得出结果了,源码核实一下
1
2
3
4
5
6
7
8
9
10
11
12
[root@rhel7 ~]# docker cp phpfpm:/app .
[root@rhel7 ~]# grep sqrt -r app/
app/index.php: $x += sqrt($x);
[root@rhel7 ~]# cat app/index.php
<?php
// test only.
$x = 0.0001;
for ($i = 0; $i <= 1000000; $i++) {
$x += sqrt($x);
}

echo "It works!"
  1. 修复
1
2
3
4
[root@rhel7 ~/app]# cat index.php
<?php
echo "It works!"
?>
  1. 测试一下修复结果
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
[root@rhel7 ~]# docker rm -f nginx phpfpm
nginx
phpfpm
[root@rhel7 ~]# docker run --name nginx -p 10000:80 -itd docker.io/feisky/nginx:cpu-fix
Unable to find image 'feisky/nginx:cpu-fix' locally
cpu-fix: Pulling from feisky/nginx
a5a6f2f73cd8: Pull complete
67da5fbcb7a0: Pull complete
e82455fa5628: Pull complete
97fea4312924: Pull complete
3faf3e92e113: Pull complete
Digest: sha256:a02d81e93c29d0196293556723cfbcb7fa94b9761b01f4fc9f1b2a6291295dd9
Status: Downloaded newer image for feisky/nginx:cpu-fix
9027bc69836a6764847b305e2e1a826711c5736fac720cd8b8b66a9e36684a8a
[root@rhel7 ~]# docker run --name phpfpm -itd --network container:nginx docker.io/feisky/php-fpm:cpu-fix
Unable to find image 'feisky/php-fpm:cpu-fix' locally
cpu-fix: Pulling from feisky/php-fpm
ad7b1e3e2409: Pull complete
99184861b139: Pull complete
e0a0d77439f2: Pull complete
555d89a6180d: Pull complete
189480ae9323: Pull complete
a0fd9c448873: Pull complete
542f27453920: Pull complete
287f65539de5: Pull complete
22f455c492a0: Pull complete
14144598512d: Pull complete
Digest: sha256:92147d37d143e2e122e8dceaa7466cd61d76b61601a82025ce6032d26d3c9a27
Status: Downloaded newer image for feisky/php-fpm:cpu-fix
228cda70451e388447e9f5674d75ea0c80e820eec7f00152dc1a9344b67e39de
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
$ ab -c 10 -n 10000 http://192.168.99.79:10000/
This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.99.79 (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests

Server Software: nginx/1.15.6
Server Hostname: 192.168.99.79
Server Port: 10000

Document Path: /
Document Length: 9 bytes

Concurrency Level: 10
Time taken for tests: 2.755 seconds
Complete requests: 10000
Failed requests: 0
Total transferred: 1720000 bytes
HTML transferred: 90000 bytes
Requests per second: 3629.69 [#/sec] (mean) <-------
Time per request: 2.755 [ms] (mean)
Time per request: 0.276 [ms] (mean, across all concurrent requests)
Transfer rate: 609.68 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 1 1 1.0 1 17
Processing: 0 1 1.2 1 16
Waiting: 0 1 1.1 1 16
Total: 1 3 1.8 2 29

Percentage of the requests served within a certain time (ms)
50% 2
66% 3
75% 3
80% 3
90% 3
95% 5
98% 9
99% 13
100% 29 (longest request)

第二组示范

  • 环境准备:
1
2
3
$ docker rm -f nginx phpfpm
$ docker run --name nginx -p 10000:80 -itd feisky/nginx:sp
$ docker run --name phpfpm -itd --network container:nginx feisky/php-fpm:sp
  • 压测:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
这里使用其他客户端配合
$ curl http://192.168.99.40:10000/
It works!
$ ab -c 100 -n 1000 http://192.168.99.40:10000/
This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.99.40 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software: nginx/1.15.4
Server Hostname: 192.168.99.40
Server Port: 10000

Document Path: /
Document Length: 9 bytes

Concurrency Level: 100
Time taken for tests: 9.049 seconds
Complete requests: 1000
Failed requests: 0
Total transferred: 172000 bytes
HTML transferred: 9000 bytes
Requests per second: 110.51 [#/sec] (mean) <-----
Time per request: 904.876 [ms] (mean)
Time per request: 9.049 [ms] (mean, across all concurrent requests)
Transfer rate: 18.56 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 0.8 0 8
Processing: 30 858 132.0 889 1020
Waiting: 27 858 132.0 889 1020
Total: 32 859 131.6 889 1023
WARNING: The median and mean for the initial connection time are not within a normal deviation
These results are probably not that reliable.

Percentage of the requests served within a certain time (ms)
50% 889
66% 900
75% 905
80% 910
90% 923
95% 934
98% 954
99% 964
100% 1023 (longest request)

换一个时间比较长的进行压测方便观察。

1
2
3
4
5
6
$ ab -c 5 -t 600 http://192.168.99.40:10000/
This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.99.40 (be patient)

可以观察到处于R (Running) 的php-fpm 与 总running 数目有所出入, 而其他处于R的进程是stress

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[root@rhel7 ~]# top
top - 23:48:57 up 1:04, 3 users, load average: 5.02, 2.00, 1.09
Tasks: 313 total, 9 running, 303 sleeping, 0 stopped, 1 zombie
%Cpu(s): 81.1 us, 13.8 sy, 0.0 ni, 2.2 id, 0.0 wa, 0.0 hi, 2.9 si, 0.0 st
KiB Mem : 4026220 total, 738312 free, 1249376 used, 2038532 buff/cache
KiB Swap: 2097148 total, 2097148 free, 0 used. 2461720 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3974 101 20 0 33092 2208 788 S 4.7 0.1 0:51.84 nginx
101516 bin 20 0 336684 9608 1932 R 2.7 0.2 0:03.36 php-fpm
101521 bin 20 0 336684 9624 1948 S 2.7 0.2 0:03.34 php-fpm
101529 bin 20 0 336684 9604 1928 R 2.7 0.2 0:03.22 php-fpm
101543 bin 20 0 336684 9600 1924 S 2.3 0.2 0:03.28 php-fpm
101535 bin 20 0 336684 9600 1924 S 2.0 0.2 0:03.31 php-fpm
3628 root 20 0 677812 97988 31360 S 1.3 2.4 0:27.50 dockerd
3921 root 20 0 113364 12076 3676 S 1.0 0.3 0:13.47 containerd-shim
6 root 20 0 0 0 0 S 0.3 0.0 0:01.96 ksoftirqd/0
7 root rt 0 0 0 0 S 0.3 0.0 0:00.29 migration/0
9 root 20 0 0 0 0 R 0.3 0.0 0:06.12 rcu_sched
3617 root 20 0 446044 44020 16136 S 0.3 1.1 0:14.64 containerd
4531 root 20 0 164236 2548 1612 R 0.3 0.1 0:00.08 top
11479 bin 20 0 8176 864 104 R 0.3 0.0 0:00.01 stress
11481 bin 20 0 0 0 0 R 0.3 0.0 0:00.01 stress

pidstat不能抓到转瞬即逝的这些进程,我们转用pstree |grep stress

1
2
3
$ watch -d -n 1 "pstree |grep stress" 
|-docker-containe-+-php-fpm-+-php-fpm---sh---stress
| |-3*[php-fpm---sh---stress---stress]

第二组分析

  1. 查找源码

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    $ docker cp phpfpm:/app .
    $ grep stress -r app
    app/index.php:// fake I/O with stress (via write()/unlink()).
    app/index.php:$result = exec("/usr/local/bin/stress -t 1 -d 1 2>&1", $output, $status);
    $ cat app/index.php
    <?php
    // fake I/O with stress (via write()/unlink()).
    $result = exec("/usr/local/bin/stress -t 1 -d 1 2>&1", $output, $status);
    if (isset($_GET["verbose"]) && $_GET["verbose"]==1 && $status != 0) {
    echo "Server internal error: ";
    print_r($output);
    } else {
    echo "It works!";
    }
    ?>
  2. 通过源码带的接口调试结果

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    看错误消息 mkstemp failed: Permission denied ,以及 failed run completed in 0s。原来 stress 命令并没有成功,它因为权限问题失败退出了
    $ curl "http://192.168.99.40:10000?verbose=1"
    Server internal error: Array
    (
    [0] => stress: info: [74139] dispatching hogs: 0 cpu, 0 io, 0 vm, 1 hdd
    [1] => stress: FAIL: [74140] (563) mkstemp failed: Permission denied
    [2] => stress: FAIL: [74139] (394) <-- worker 74140 returned error 1
    [3] => stress: WARN: [74139] (396) now reaping child worker processes
    [4] => stress: FAIL: [74139] (400) kill error: No such process
    [5] => stress: FAIL: [74139] (451) failed run completed in 0s
    )
  3. perf分析

    1
    2
    3
    4
    5
    6
    7
    8
    确保软连接开启
    # sudo sysctl -w kernel.kptr_restrict=1
    # perf record -g
    # docker cp perf.data phpfpm:/tmp
    # docker exec -it phpfpm bash
    # cd /tmp
    # apt update && apt install linux-perf linux-tools procps -y
    # perf_4.9 report
  4. 光标选中后回车进入,逐一检视← →

https://img.madebug.net/m4d3bug/images-of-website/master/blog/perfcalltrace.jpg

0x04 总结

  • /proc/stat可以得知CPU使用率的类型,其排查思路如下:
    • CPU和Nice升高,排查用户态进程。
    • 系统CPU升高,排查内核线程或者系统调用。
    • I / O 等待CPU升高,排查系统存储。
    • 软中断和硬中断升高,说明中断操作占用,排查内核中断服务程序。
  • pidstat和top各有所长。
  • perf可以在不中断进程的情况下收集证据。
  • 转瞬即逝的进程,不停地崩溃重启,也有可能影响性能表现,通过perf top来搭配观察。

Welcome to my other publishing channels