在vcenter上可以看待虚拟机发生CPU禁用
esxi上vmware.log
2019-09-30T02:54:20.165Z| vcpu-1| I125: APIC THERMLVT write: 0x10000
2019-09-30T02:54:20.165Z| vcpu-3| I125: APIC THERMLVT write: 0x10000
2019-09-30T02:54:20.165Z| vcpu-4| I125: APIC THERMLVT write: 0x10000
2019-09-30T02:54:20.165Z| vcpu-5| I125: APIC THERMLVT write: 0x10000
2019-09-30T02:54:20.165Z| vcpu-0| I125: APIC THERMLVT write: 0x10000
2019-09-30T02:54:20.165Z| vcpu-2| I125: APIC THERMLVT write: 0x10000
2019-09-30T02:54:20.165Z| vcpu-6| I125: APIC THERMLVT write: 0x10000
2019-09-30T02:54:20.165Z| vcpu-0| I125: Vix: [248776 vmxCommands.c:7739]: VMAutomation_HandleCLIHLTEvent. Do nothing.
2019-09-30T02:54:20.165Z| vcpu-0| I125: MsgHint: msg.monitorevent.halt
2019-09-30T02:54:20.165Z| vcpu-0| I125+ The CPU has been disabled by the guest operating system. Power off or reset the virtual machine.
解决办法:
- Update the kernel package to at least kernel-3.10.0-514.el7 or newer as the patches have been pulled into the newer kernel versions.
- The fixes for this issue have also been included in RHEL7.2 EUS kernel version 3.10.0-327.62.1.el7. The errata for this is (RHBA-2017:3256).
- Lately, the issue with a similar symptom was also reported on the RHEL 7.5 kernel. However, it had a different root cause, which was investigated in the internal BZ 1636066 and is now described in our KCS 4094221.
- As a workaround, try disabling Transparent HugePages until the patches can be applied.
- If there are any third party, non-Red Hat shipped modules loaded, consider removing them. If after removing them, the server crashes again, please contact Red Hat.
以下是根据官方步骤对dump进行分析
[root@test /]# crash /usr/lib/debug/usr/lib/modules/3.10.0-327.el7.x86_64/vmlinux /var/crash/127.0.0.1-2019-09-28-09\:41\:08/vmcore <---使用CRASH命令进行调试
crash> bt
PID: 10620 TASK: ffff880232319700 CPU: 2 COMMAND: "dotenv-generato"
#0 [ffff8801a070f610] machine_kexec at ffffffff81051beb
#1 [ffff8801a070f670] crash_kexec at ffffffff810f2542
#2 [ffff8801a070f740] oops_end at ffffffff8163e1a8
#3 [ffff8801a070f768] no_context at ffffffff8162e2b8
#4 [ffff8801a070f7b8] __bad_area_nosemaphore at ffffffff8162e34e
#5 [ffff8801a070f800] bad_area_nosemaphore at ffffffff8162e4b8
#6 [ffff8801a070f810] __do_page_fault at ffffffff81640fce
#7 [ffff8801a070f868] do_page_fault at ffffffff81641113
#8 [ffff8801a070f890] page_fault at ffffffff8163d408
[exception RIP: down_read_trylock+9] <-----看到当时Panicked在这里
RIP: ffffffff810aa989 RSP: ffff8801a070f948 RFLAGS: 00010206
RAX: 0000000000000000 RBX: ffff8800846a0b40 RCX: ffff8800846a0b40
RDX: 0000000000000001 RSI: 0000000000000301 RDI: 000000000000002d
RBP: ffff8801a070f948 R8: 0000000000000015 R9: ffff8800846a0b40
R10: ffff88023ffd8000 R11: 0000000000000000 R12: ffff8800846a0b41
R13: ffffea0005ee5480 R14: 000000000000002d R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
#9 [ffff8801a070f950] page_lock_anon_vma_read at ffffffff811a2e65 <--------最后
#10 [ffff8801a070f980] try_to_unmap_anon at ffffffff811a3291
#11 [ffff8801a070f9d0] try_to_unmap at ffffffff811a33dd
#12 [ffff8801a070f9e8] migrate_pages at ffffffff811c7449
#13 [ffff8801a070fa90] compact_zone at ffffffff8118f259
#14 [ffff8801a070fae0] compact_zone_order at ffffffff8118f45c
#15 [ffff8801a070fb80] try_to_compact_pages at ffffffff8118f811
#16 [ffff8801a070fbe0] __alloc_pages_direct_compact at ffffffff816305c8
#17 [ffff8801a070fc40] __alloc_pages_nodemask at ffffffff811734e8
#18 [ffff8801a070fd78] alloc_pages_vma at ffffffff811b78ca
#19 [ffff8801a070fde0] do_huge_pmd_anonymous_page at ffffffff811cc2d3
#20 [ffff8801a070fe40] handle_mm_fault at ffffffff81196c78
#21 [ffff8801a070fed0] __do_page_fault at ffffffff81640e22
#22 [ffff8801a070ff28] do_page_fault at ffffffff81641113
#23 [ffff8801a070ff50] page_fault at ffffffff8163d408
RIP: 000000000040cac1 RSP: 000000c000042cf8 RFLAGS: 00010206
RAX: 00007ffff7fbdd98 RBX: 0000000000000000 RCX: 000000c000400000
RDX: 000000c000038700 RSI: 0000000000000008 RDI: 000000000065a020
RBP: 000000c000042d88 R8: 0000000000000001 R9: 0000000000000001
R10: 0000000000000200 R11: 0000000000000000 R12: 0000000000000040
R13: 0000000000000001 R14: 0000000000000000 R15: 000000c0000603c0
ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b
crash>
查看Panicked位置
crash> dis -rl down_read_trylock+0x9
/usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/kernel/rwsem.c: 32
0xffffffff810aa980 <down_read_trylock>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffff810aa985 <down_read_trylock+5>: push %rbp
0xffffffff810aa986 <down_read_trylock+6>: mov %rsp,%rbp
/usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/arch/x86/include/asm/rwsem.h: 83
0xffffffff810aa989 <down_read_trylock+9>: mov (%rdi),%rax
crash>
从上面可以看到rwsem的地址被破坏而发生的Panicked
RDI: 000000000000002d -----这个值看起来不是正常的
crash> bt | awk '/exception RIP: down_read_trylock/,/RDI:/ {print}' | grep RDI
RDX: 0000000000000001 RSI: 0000000000000301 RDI: 000000000000002d
crash> eval 000000000000002d | grep binary
binary: 0000000000000000000000000000000000000000000000000000000000101101
crash>
查看
值是来之哪里 在执行调用page_lock_anon_vma_read发生Panickedcrash> dis -rl ffffffff811a2e65
/usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/mm/rmap.c: 446
0xffffffff811a2e10 <page_lock_anon_vma_read>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffff811a2e15 <page_lock_anon_vma_read+5>: push %rbp
0xffffffff811a2e16 <page_lock_anon_vma_read+6>: mov %rsp,%rbp
0xffffffff811a2e19 <page_lock_anon_vma_read+9>: push %r14
0xffffffff811a2e1b <page_lock_anon_vma_read+11>: push %r13
0xffffffff811a2e1d <page_lock_anon_vma_read+13>: mov %rdi,%r13
0xffffffff811a2e20 <page_lock_anon_vma_read+16>: push %r12
0xffffffff811a2e22 <page_lock_anon_vma_read+18>: push %rbx
/usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/mm/rmap.c: 452
0xffffffff811a2e23 <page_lock_anon_vma_read+19>: mov 0x8(%rdi),%r12
/usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/mm/rmap.c: 453
0xffffffff811a2e27 <page_lock_anon_vma_read+23>: mov %r12,%rax
0xffffffff811a2e2a <page_lock_anon_vma_read+26>: and $0x3,%eax
0xffffffff811a2e2d <page_lock_anon_vma_read+29>: cmp $0x1,%rax
0xffffffff811a2e31 <page_lock_anon_vma_read+33>: je 0xffffffff811a2e48 <page_lock_anon_vma_read+56>
/usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/mm/rmap.c: 447
0xffffffff811a2e33 <page_lock_anon_vma_read+35>: xor %ebx,%ebx
/usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/mm/rmap.c: 505
0xffffffff811a2e35 <page_lock_anon_vma_read+37>: mov %rbx,%rax
0xffffffff811a2e38 <page_lock_anon_vma_read+40>: pop %rbx
0xffffffff811a2e39 <page_lock_anon_vma_read+41>: pop %r12
0xffffffff811a2e3b <page_lock_anon_vma_read+43>: pop %r13
0xffffffff811a2e3d <page_lock_anon_vma_read+45>: pop %r14
0xffffffff811a2e3f <page_lock_anon_vma_read+47>: pop %rbp
0xffffffff811a2e40 <page_lock_anon_vma_read+48>: retq
0xffffffff811a2e41 <page_lock_anon_vma_read+49>: nopl 0x0(%rax)
/usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/arch/x86/include/asm/atomic.h: 26
0xffffffff811a2e48 <page_lock_anon_vma_read+56>: mov 0x18(%rdi),%eax
/usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/mm/rmap.c: 455
0xffffffff811a2e4b <page_lock_anon_vma_read+59>: test %eax,%eax
0xffffffff811a2e4d <page_lock_anon_vma_read+61>: js 0xffffffff811a2e33 <page_lock_anon_vma_read+35>
/usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/mm/rmap.c: 459
0xffffffff811a2e4f <page_lock_anon_vma_read+63>: mov -0x1(%r12),%r14
/usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/mm/rmap.c: 458
0xffffffff811a2e54 <page_lock_anon_vma_read+68>: lea -0x1(%r12),%rbx
/usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/mm/rmap.c: 460
0xffffffff811a2e59 <page_lock_anon_vma_read+73>: add $0x8,%r14
0xffffffff811a2e5d <page_lock_anon_vma_read+77>: mov %r14,%rdi
0xffffffff811a2e60 <page_lock_anon_vma_read+80>: callq 0xffffffff810aa980 <down_read_trylock>
0xffffffff811a2e65 <page_lock_anon_vma_read+85>: test %eax,%eax
crash>
通过对代码的追踪发现是程序去锁定内存已anon_vma
查看获取 anon_vma
crash> struct -o page.mapping
struct page {
[8] struct address_space *mapping;
}
crash> struct -o anon_vma
struct anon_vma {
[0] struct anon_vma *root;
[8] struct rw_semaphore rwsem;
[40] atomic_t refcount;
[48] struct rb_root rb_root;
}
SIZE: 56
crash>
在继续进行跟踪
crash> dis -r ffffffff811a3291
0xffffffff811a3270 <try_to_unmap_anon>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffff811a3275 <try_to_unmap_anon+5>: push %rbp
0xffffffff811a3276 <try_to_unmap_anon+6>: mov %rsp,%rbp
0xffffffff811a3279 <try_to_unmap_anon+9>: push %r15
0xffffffff811a327b <try_to_unmap_anon+11>: push %r14
0xffffffff811a327d <try_to_unmap_anon+13>: mov %esi,%r14d
0xffffffff811a3280 <try_to_unmap_anon+16>: push %r13
0xffffffff811a3282 <try_to_unmap_anon+18>: push %r12
0xffffffff811a3284 <try_to_unmap_anon+20>: push %rbx
0xffffffff811a3285 <try_to_unmap_anon+21>: mov %rdi,%rbx
0xffffffff811a3288 <try_to_unmap_anon+24>: sub $0x18,%rsp
0xffffffff811a328c <try_to_unmap_anon+28>: callq 0xffffffff811a2e10 <page_lock_anon_vma_read>
0xffffffff811a3291 <try_to_unmap_anon+33>: mov %rax,%rcx
crash>
crash> whatis page_lock_anon_vma_read
struct anon_vma *page_lock_anon_vma_read(struct page *);
crash> dis -r ffffffff811a2e65
0xffffffff811a2e10 <page_lock_anon_vma_read>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffff811a2e15 <page_lock_anon_vma_read+5>: push %rbp
0xffffffff811a2e16 <page_lock_anon_vma_read+6>: mov %rsp,%rbp
0xffffffff811a2e19 <page_lock_anon_vma_read+9>: push %r14
0xffffffff811a2e1b <page_lock_anon_vma_read+11>: push %r13
0xffffffff811a2e1d <page_lock_anon_vma_read+13>: mov %rdi,%r13
0xffffffff811a2e20 <page_lock_anon_vma_read+16>: push %r12
0xffffffff811a2e22 <page_lock_anon_vma_read+18>: push %rbx <--- page* is pushed onto the stack
0xffffffff811a2e23 <page_lock_anon_vma_read+19>: mov 0x8(%rdi),%r12 <--- 452 anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
0xffffffff811a2e27 <page_lock_anon_vma_read+23>: mov %r12,%rax
0xffffffff811a2e2a <page_lock_anon_vma_read+26>: and $0x3,%eax
0xffffffff811a2e2d <page_lock_anon_vma_read+29>: cmp $0x1,%rax
0xffffffff811a2e31 <page_lock_anon_vma_read+33>: je 0xffffffff811a2e48 <page_lock_anon_vma_read+56>
0xffffffff811a2e33 <page_lock_anon_vma_read+35>: xor %ebx,%ebx <--- 453 if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
0xffffffff811a2e35 <page_lock_anon_vma_read+37>: mov %rbx,%rax <--- 454 goto out;
0xffffffff811a2e38 <page_lock_anon_vma_read+40>: pop %rbx
0xffffffff811a2e39 <page_lock_anon_vma_read+41>: pop %r12
0xffffffff811a2e3b <page_lock_anon_vma_read+43>: pop %r13
0xffffffff811a2e3d <page_lock_anon_vma_read+45>: pop %r14
0xffffffff811a2e3f <page_lock_anon_vma_read+47>: pop %rbp
0xffffffff811a2e40 <page_lock_anon_vma_read+48>: retq
0xffffffff811a2e41 <page_lock_anon_vma_read+49>: nopl 0x0(%rax)
0xffffffff811a2e48 <page_lock_anon_vma_read+56>: mov 0x18(%rdi),%eax
0xffffffff811a2e4b <page_lock_anon_vma_read+59>: test %eax,%eax
0xffffffff811a2e4d <page_lock_anon_vma_read+61>: js 0xffffffff811a2e33 <page_lock_anon_vma_read+35>
0xffffffff811a2e4f <page_lock_anon_vma_read+63>: mov -0x1(%r12),%r14 <---- 458 anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
0xffffffff811a2e54 <page_lock_anon_vma_read+68>: lea -0x1(%r12),%rbx <--- 458 anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
0xffffffff811a2e59 <page_lock_anon_vma_read+73>: add $0x8,%r14 <--- &root_anon_vma->rwsem
0xffffffff811a2e5d <page_lock_anon_vma_read+77>: mov %r14,%rdi <--- %r14 is rwsem
0xffffffff811a2e60 <page_lock_anon_vma_read+80>: callq 0xffffffff810aa980 <down_read_trylock>
0xffffffff811a2e65 <page_lock_anon_vma_read+85>: test %eax,%eax
crash>
The page* was pushed onto the stack
%r14 holds the anon_vma itself while %rbx holds a pointer to it from the mapping. The values are maintained before and through the call down_read_trylock
crash> bt -f | awk '/page_lock_anon_vma_read/,/page_referenced/ {print}'
#9 [ffff8801a070f950] page_lock_anon_vma_read at ffffffff811a2e65
ffff8801a070f958: ffffea0005ee5480 ffffea0005ee5440
>> %rbx << %r12
ffff8801a070f968: ffffea00078b2c40 0000000000000301
%r13 %r14
ffff8801a070f978: ffff8801a070f9c8 ffffffff811a3291
%rbp %rip
#10 [ffff8801a070f980] try_to_unmap_anon at ffffffff811a3291
ffff8801a070f988: ffffea0005ee5480 ffff8801a070fa50
ffff8801a070f998: 0000000000000001 ffffea0005ee5480
ffff8801a070f9a8: ffffea0005ee5440 ffffea00078b2c40
ffff8801a070f9b8: 0000000000000000 0000000000000000
ffff8801a070f9c8: ffff8801a070f9e0 ffffffff811a33dd
#11 [ffff8801a070f9d0] try_to_unmap at ffffffff811a33dd
ffff8801a070f9d8: ffffea0005ee5480 ffff8801a070fa88
ffff8801a070f9e8: ffffffff811c7449
#12 [ffff8801a070f9e8] migrate_pages at ffffffff811c7449
ffff8801a070f9f0: ffff8800846a0b40 ffff880232319700
ffff8801a070fa00: ffff880100000001 00000000a070fb00
ffff8801a070fa10: 0000000000000000 0000000000000000
ffff8801a070fa20: 000000000000000f ffff8801a070faf0
ffff8801a070fa30: ffffffff8118e260 ffffea0005ee54a0
ffff8801a070fa40: ffff8801a070fb00 0000000000000000
ffff8801a070fa50: ffff880237052000 00000000641e072f
ffff8801a070fa60: ffff88023ffd8000 ffff8801a070fb00
ffff8801a070fa70: 0000000000140000 ffff8801a070faf0
ffff8801a070fa80: ffff880232319700 ffff8801a070fad8
ffff8801a070fa90: ffffffff8118f259
#13 [ffff8801a070fa90] compact_zone at ffffffff8118f259
ffff8801a070fa98: 00000000ab51b7e8 ffff8801a070fb00
ffff8801a070faa8: 0000000000000020 ffff8801a070faf0
ffff8801a070fab8: ffff8801a070fd17 ffff88023ffd8000
ffff8801a070fac8: ffff88023ffd9008 0000000000000000
ffff8801a070fad8: ffff8801a070fb78 ffffffff8118f45c
#14 [ffff8801a070fae0] compact_zone_order at ffffffff8118f45c
crash> page.mapping ffffea0005ee5480
mapping = 0xffff8800846a0b41
crash> bt | awk '/exception RIP: down_read_trylock/,/ORIG_RAX:/ {print}' | grep -e R14 -e RBX
RAX: 0000000000000000 RBX: ffff8800846a0b40 RCX: ffff8800846a0b40
R13: ffffea0005ee5480 R14: 000000000000002d R15: 0000000000000000
crash>
由以上可以得相关地址值
page*: ffffea0005ee5480
page->mapping: 0xffff8800846a0b41
anon_vma->root: 000000000000002d
&anon_vma->root: ffff8800846a0b40
验证各值
crash> kmem ffffea0005ee5480
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffffea0005ee5480 17b952000 ffff8800846a0b41 7ffff33a6 2 2fffff00080009 locked,uptodate,swapbacked
crash> kmem 000000000000002d
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffffea0000000000 0 0 0 0 0
crash> kmem 0xffff8800846a0b41
CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE
ffff880237058700 vm_area_struct 216 51394 51408 2856 4k
SLAB MEMORY NODE TOTAL ALLOCATED FREE
ffffea000211a800 ffff8800846a0000 0 18 18 0
FREE / [ALLOCATED]
[ffff8800846a0af8]
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffffea000211a800 846a0000 0 0 1 1fffff00000080 slab
crash> kmem ffff8800846a0b40
CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE
ffff880237058700 vm_area_struct 216 51394 51408 2856 4k
SLAB MEMORY NODE TOTAL ALLOCATED FREE
ffffea000211a800 ffff8800846a0000 0 18 18 0
FREE / [ALLOCATED]
[ffff8800846a0af8]
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffffea000211a800 846a0000 0 0 1 1fffff00000080 slab
crash>
crash> anon_vma 000000000000002d
struct: invalid kernel virtual address: 000000000000002d
crash>
以上分析过程是根据官方KB来操作,如有需求请参考官方KB
https://access.redhat.com/solutions/2779851
文章末尾固定信息
评论