VMWARE上虚拟机发生CPU禁用The CPU has been disabled by the guest operating system. Power off or reset the virtual machine.

  • A+
所属分类:Linux

在vcenter上可以看待虚拟机发生CPU禁用

 
 

101219 0741 VMWARECPU1 - VMWARE上虚拟机发生CPU禁用The CPU has been disabled by the guest operating system. Power off or reset the virtual machine.

 
 

esxi上vmware.log

2019-09-30T02:54:20.165Z| vcpu-1| I125: APIC THERMLVT write: 0x10000

2019-09-30T02:54:20.165Z| vcpu-3| I125: APIC THERMLVT write: 0x10000

2019-09-30T02:54:20.165Z| vcpu-4| I125: APIC THERMLVT write: 0x10000

2019-09-30T02:54:20.165Z| vcpu-5| I125: APIC THERMLVT write: 0x10000

2019-09-30T02:54:20.165Z| vcpu-0| I125: APIC THERMLVT write: 0x10000

2019-09-30T02:54:20.165Z| vcpu-2| I125: APIC THERMLVT write: 0x10000

2019-09-30T02:54:20.165Z| vcpu-6| I125: APIC THERMLVT write: 0x10000

2019-09-30T02:54:20.165Z| vcpu-0| I125: Vix: [248776 vmxCommands.c:7739]: VMAutomation_HandleCLIHLTEvent. Do nothing.

2019-09-30T02:54:20.165Z| vcpu-0| I125: MsgHint: msg.monitorevent.halt

2019-09-30T02:54:20.165Z| vcpu-0| I125+ The CPU has been disabled by the guest operating system. Power off or reset the virtual machine.

 
 

解决办法:

  • Update the kernel package to at least kernel-3.10.0-514.el7 or newer as the patches have been pulled into the newer kernel versions.
  • The fixes for this issue have also been included in RHEL7.2 EUS kernel version 3.10.0-327.62.1.el7. The errata for this is (RHBA-2017:3256).
  • Lately, the issue with a similar symptom was also reported on the RHEL 7.5 kernel. However, it had a different root cause, which was investigated in the internal BZ 1636066 and is now described in our KCS 4094221.
  • As a workaround, try disabling Transparent HugePages until the patches can be applied.
  • If there are any third party, non-Red Hat shipped modules loaded, consider removing them. If after removing them, the server crashes again, please contact Red Hat.

     
     

     
     

     
     

     
     

    以下是根据官方步骤对dump进行分析

     
     

     
     

    [root@test /]# crash /usr/lib/debug/usr/lib/modules/3.10.0-327.el7.x86_64/vmlinux /var/crash/127.0.0.1-2019-09-28-09\:41\:08/vmcore <---使用CRASH命令进行调试

    crash> bt

    PID: 10620 TASK: ffff880232319700 CPU: 2 COMMAND: "dotenv-generato"

    #0 [ffff8801a070f610] machine_kexec at ffffffff81051beb

    #1 [ffff8801a070f670] crash_kexec at ffffffff810f2542

    #2 [ffff8801a070f740] oops_end at ffffffff8163e1a8

    #3 [ffff8801a070f768] no_context at ffffffff8162e2b8

    #4 [ffff8801a070f7b8] __bad_area_nosemaphore at ffffffff8162e34e

    #5 [ffff8801a070f800] bad_area_nosemaphore at ffffffff8162e4b8

    #6 [ffff8801a070f810] __do_page_fault at ffffffff81640fce

    #7 [ffff8801a070f868] do_page_fault at ffffffff81641113

    #8 [ffff8801a070f890] page_fault at ffffffff8163d408

    [exception RIP: down_read_trylock+9] <-----看到当时Panicked在这里

    RIP: ffffffff810aa989 RSP: ffff8801a070f948 RFLAGS: 00010206

    RAX: 0000000000000000 RBX: ffff8800846a0b40 RCX: ffff8800846a0b40

    RDX: 0000000000000001 RSI: 0000000000000301 RDI: 000000000000002d

    RBP: ffff8801a070f948 R8: 0000000000000015 R9: ffff8800846a0b40

    R10: ffff88023ffd8000 R11: 0000000000000000 R12: ffff8800846a0b41

    R13: ffffea0005ee5480 R14: 000000000000002d R15: 0000000000000000

    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000

    #9 [ffff8801a070f950] page_lock_anon_vma_read at ffffffff811a2e65 <--------最后

    #10 [ffff8801a070f980] try_to_unmap_anon at ffffffff811a3291

    #11 [ffff8801a070f9d0] try_to_unmap at ffffffff811a33dd

    #12 [ffff8801a070f9e8] migrate_pages at ffffffff811c7449

    #13 [ffff8801a070fa90] compact_zone at ffffffff8118f259

    #14 [ffff8801a070fae0] compact_zone_order at ffffffff8118f45c

    #15 [ffff8801a070fb80] try_to_compact_pages at ffffffff8118f811

    #16 [ffff8801a070fbe0] __alloc_pages_direct_compact at ffffffff816305c8

    #17 [ffff8801a070fc40] __alloc_pages_nodemask at ffffffff811734e8

    #18 [ffff8801a070fd78] alloc_pages_vma at ffffffff811b78ca

    #19 [ffff8801a070fde0] do_huge_pmd_anonymous_page at ffffffff811cc2d3

    #20 [ffff8801a070fe40] handle_mm_fault at ffffffff81196c78

    #21 [ffff8801a070fed0] __do_page_fault at ffffffff81640e22

    #22 [ffff8801a070ff28] do_page_fault at ffffffff81641113

    #23 [ffff8801a070ff50] page_fault at ffffffff8163d408

    RIP: 000000000040cac1 RSP: 000000c000042cf8 RFLAGS: 00010206

    RAX: 00007ffff7fbdd98 RBX: 0000000000000000 RCX: 000000c000400000

    RDX: 000000c000038700 RSI: 0000000000000008 RDI: 000000000065a020

    RBP: 000000c000042d88 R8: 0000000000000001 R9: 0000000000000001

    R10: 0000000000000200 R11: 0000000000000000 R12: 0000000000000040

    R13: 0000000000000001 R14: 0000000000000000 R15: 000000c0000603c0

    ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b

    crash>

     
     

    查看Panicked位置

    crash> dis -rl down_read_trylock+0x9

    /usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/kernel/rwsem.c: 32

    0xffffffff810aa980 <down_read_trylock>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]

    0xffffffff810aa985 <down_read_trylock+5>: push %rbp

    0xffffffff810aa986 <down_read_trylock+6>: mov %rsp,%rbp

    /usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/arch/x86/include/asm/rwsem.h: 83

    0xffffffff810aa989 <down_read_trylock+9>: mov (%rdi),%rax

    crash>

     
     

    101219 0741 VMWARECPU2 - VMWARE上虚拟机发生CPU禁用The CPU has been disabled by the guest operating system. Power off or reset the virtual machine.

     
     

    从上面可以看到rwsem的地址被破坏而发生的Panicked

    RDI: 000000000000002d -----这个值看起来不是正常的

    crash> bt | awk '/exception RIP: down_read_trylock/,/RDI:/ {print}' | grep RDI

    RDX: 0000000000000001 RSI: 0000000000000301 RDI: 000000000000002d

    crash> eval 000000000000002d | grep binary

    binary: 0000000000000000000000000000000000000000000000000000000000101101

    crash>

     
     

    查看
    值是来之哪里 在执行调用page_lock_anon_vma_read发生Panicked

    crash> dis -rl ffffffff811a2e65

    /usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/mm/rmap.c: 446

    0xffffffff811a2e10 <page_lock_anon_vma_read>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]

    0xffffffff811a2e15 <page_lock_anon_vma_read+5>: push %rbp

    0xffffffff811a2e16 <page_lock_anon_vma_read+6>: mov %rsp,%rbp

    0xffffffff811a2e19 <page_lock_anon_vma_read+9>: push %r14

    0xffffffff811a2e1b <page_lock_anon_vma_read+11>: push %r13

    0xffffffff811a2e1d <page_lock_anon_vma_read+13>: mov %rdi,%r13

    0xffffffff811a2e20 <page_lock_anon_vma_read+16>: push %r12

    0xffffffff811a2e22 <page_lock_anon_vma_read+18>: push %rbx

    /usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/mm/rmap.c: 452

    0xffffffff811a2e23 <page_lock_anon_vma_read+19>: mov 0x8(%rdi),%r12

    /usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/mm/rmap.c: 453

    0xffffffff811a2e27 <page_lock_anon_vma_read+23>: mov %r12,%rax

    0xffffffff811a2e2a <page_lock_anon_vma_read+26>: and $0x3,%eax

    0xffffffff811a2e2d <page_lock_anon_vma_read+29>: cmp $0x1,%rax

    0xffffffff811a2e31 <page_lock_anon_vma_read+33>: je 0xffffffff811a2e48 <page_lock_anon_vma_read+56>

    /usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/mm/rmap.c: 447

    0xffffffff811a2e33 <page_lock_anon_vma_read+35>: xor %ebx,%ebx

    /usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/mm/rmap.c: 505

    0xffffffff811a2e35 <page_lock_anon_vma_read+37>: mov %rbx,%rax

    0xffffffff811a2e38 <page_lock_anon_vma_read+40>: pop %rbx

    0xffffffff811a2e39 <page_lock_anon_vma_read+41>: pop %r12

    0xffffffff811a2e3b <page_lock_anon_vma_read+43>: pop %r13

    0xffffffff811a2e3d <page_lock_anon_vma_read+45>: pop %r14

    0xffffffff811a2e3f <page_lock_anon_vma_read+47>: pop %rbp

    0xffffffff811a2e40 <page_lock_anon_vma_read+48>: retq

    0xffffffff811a2e41 <page_lock_anon_vma_read+49>: nopl 0x0(%rax)

    /usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/arch/x86/include/asm/atomic.h: 26

    0xffffffff811a2e48 <page_lock_anon_vma_read+56>: mov 0x18(%rdi),%eax

    /usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/mm/rmap.c: 455

    0xffffffff811a2e4b <page_lock_anon_vma_read+59>: test %eax,%eax

    0xffffffff811a2e4d <page_lock_anon_vma_read+61>: js 0xffffffff811a2e33 <page_lock_anon_vma_read+35>

    /usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/mm/rmap.c: 459

    0xffffffff811a2e4f <page_lock_anon_vma_read+63>: mov -0x1(%r12),%r14

    /usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/mm/rmap.c: 458

    0xffffffff811a2e54 <page_lock_anon_vma_read+68>: lea -0x1(%r12),%rbx

    /usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/mm/rmap.c: 460

    0xffffffff811a2e59 <page_lock_anon_vma_read+73>: add $0x8,%r14

    0xffffffff811a2e5d <page_lock_anon_vma_read+77>: mov %r14,%rdi

    0xffffffff811a2e60 <page_lock_anon_vma_read+80>: callq 0xffffffff810aa980 <down_read_trylock>

    0xffffffff811a2e65 <page_lock_anon_vma_read+85>: test %eax,%eax

    crash>

     
     

    通过对代码的追踪发现是程序去锁定内存已anon_vma

    101219 0741 VMWARECPU3 - VMWARE上虚拟机发生CPU禁用The CPU has been disabled by the guest operating system. Power off or reset the virtual machine.

     
     

    查看获取 anon_vma

    crash> struct -o page.mapping

    struct page {

    [8] struct address_space *mapping;

    }

    crash> struct -o anon_vma

    struct anon_vma {

    [0] struct anon_vma *root;

    [8] struct rw_semaphore rwsem;

    [40] atomic_t refcount;

    [48] struct rb_root rb_root;

    }

    SIZE: 56

    crash>

     
     

     
     

    在继续进行跟踪

    crash> dis -r ffffffff811a3291

    0xffffffff811a3270 <try_to_unmap_anon>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]

    0xffffffff811a3275 <try_to_unmap_anon+5>: push %rbp

    0xffffffff811a3276 <try_to_unmap_anon+6>: mov %rsp,%rbp

    0xffffffff811a3279 <try_to_unmap_anon+9>: push %r15

    0xffffffff811a327b <try_to_unmap_anon+11>: push %r14

    0xffffffff811a327d <try_to_unmap_anon+13>: mov %esi,%r14d

    0xffffffff811a3280 <try_to_unmap_anon+16>: push %r13

    0xffffffff811a3282 <try_to_unmap_anon+18>: push %r12

    0xffffffff811a3284 <try_to_unmap_anon+20>: push %rbx

    0xffffffff811a3285 <try_to_unmap_anon+21>: mov %rdi,%rbx

    0xffffffff811a3288 <try_to_unmap_anon+24>: sub $0x18,%rsp

    0xffffffff811a328c <try_to_unmap_anon+28>: callq 0xffffffff811a2e10 <page_lock_anon_vma_read>

    0xffffffff811a3291 <try_to_unmap_anon+33>: mov %rax,%rcx

    crash>

     
     

    crash> whatis page_lock_anon_vma_read

    struct anon_vma *page_lock_anon_vma_read(struct page *);

     
     

    crash> dis -r ffffffff811a2e65

    0xffffffff811a2e10 <page_lock_anon_vma_read>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]

    0xffffffff811a2e15 <page_lock_anon_vma_read+5>: push %rbp

    0xffffffff811a2e16 <page_lock_anon_vma_read+6>: mov %rsp,%rbp

    0xffffffff811a2e19 <page_lock_anon_vma_read+9>: push %r14

    0xffffffff811a2e1b <page_lock_anon_vma_read+11>: push %r13

    0xffffffff811a2e1d <page_lock_anon_vma_read+13>: mov %rdi,%r13

    0xffffffff811a2e20 <page_lock_anon_vma_read+16>: push %r12

    0xffffffff811a2e22 <page_lock_anon_vma_read+18>: push %rbx <--- page* is pushed onto the stack

    0xffffffff811a2e23 <page_lock_anon_vma_read+19>: mov 0x8(%rdi),%r12 <--- 452 anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);

    0xffffffff811a2e27 <page_lock_anon_vma_read+23>: mov %r12,%rax

    0xffffffff811a2e2a <page_lock_anon_vma_read+26>: and $0x3,%eax

    0xffffffff811a2e2d <page_lock_anon_vma_read+29>: cmp $0x1,%rax

    0xffffffff811a2e31 <page_lock_anon_vma_read+33>: je 0xffffffff811a2e48 <page_lock_anon_vma_read+56>

    0xffffffff811a2e33 <page_lock_anon_vma_read+35>: xor %ebx,%ebx <--- 453 if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)

    0xffffffff811a2e35 <page_lock_anon_vma_read+37>: mov %rbx,%rax <--- 454 goto out;

    0xffffffff811a2e38 <page_lock_anon_vma_read+40>: pop %rbx

    0xffffffff811a2e39 <page_lock_anon_vma_read+41>: pop %r12

    0xffffffff811a2e3b <page_lock_anon_vma_read+43>: pop %r13

    0xffffffff811a2e3d <page_lock_anon_vma_read+45>: pop %r14

    0xffffffff811a2e3f <page_lock_anon_vma_read+47>: pop %rbp

    0xffffffff811a2e40 <page_lock_anon_vma_read+48>: retq

    0xffffffff811a2e41 <page_lock_anon_vma_read+49>: nopl 0x0(%rax)

    0xffffffff811a2e48 <page_lock_anon_vma_read+56>: mov 0x18(%rdi),%eax

    0xffffffff811a2e4b <page_lock_anon_vma_read+59>: test %eax,%eax

    0xffffffff811a2e4d <page_lock_anon_vma_read+61>: js 0xffffffff811a2e33 <page_lock_anon_vma_read+35>

    0xffffffff811a2e4f <page_lock_anon_vma_read+63>: mov -0x1(%r12),%r14 <---- 458 anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);

    0xffffffff811a2e54 <page_lock_anon_vma_read+68>: lea -0x1(%r12),%rbx <--- 458 anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);

    0xffffffff811a2e59 <page_lock_anon_vma_read+73>: add $0x8,%r14 <--- &root_anon_vma->rwsem

    0xffffffff811a2e5d <page_lock_anon_vma_read+77>: mov %r14,%rdi <--- %r14 is rwsem

    0xffffffff811a2e60 <page_lock_anon_vma_read+80>: callq 0xffffffff810aa980 <down_read_trylock>

    0xffffffff811a2e65 <page_lock_anon_vma_read+85>: test %eax,%eax

    crash>

     
     

    The page* was pushed onto the stack

    %r14 holds the anon_vma itself while %rbx holds a pointer to it from the mapping. The values are maintained before and through the call down_read_trylock

     
     

     
     

    crash> bt -f | awk '/page_lock_anon_vma_read/,/page_referenced/ {print}'

    #9 [ffff8801a070f950] page_lock_anon_vma_read at ffffffff811a2e65

    ffff8801a070f958: ffffea0005ee5480 ffffea0005ee5440

    >> %rbx << %r12

    ffff8801a070f968: ffffea00078b2c40 0000000000000301

    %r13 %r14

    ffff8801a070f978: ffff8801a070f9c8 ffffffff811a3291

    %rbp %rip

    #10 [ffff8801a070f980] try_to_unmap_anon at ffffffff811a3291

    ffff8801a070f988: ffffea0005ee5480 ffff8801a070fa50

    ffff8801a070f998: 0000000000000001 ffffea0005ee5480

    ffff8801a070f9a8: ffffea0005ee5440 ffffea00078b2c40

    ffff8801a070f9b8: 0000000000000000 0000000000000000

    ffff8801a070f9c8: ffff8801a070f9e0 ffffffff811a33dd

    #11 [ffff8801a070f9d0] try_to_unmap at ffffffff811a33dd

    ffff8801a070f9d8: ffffea0005ee5480 ffff8801a070fa88

    ffff8801a070f9e8: ffffffff811c7449

    #12 [ffff8801a070f9e8] migrate_pages at ffffffff811c7449

    ffff8801a070f9f0: ffff8800846a0b40 ffff880232319700

    ffff8801a070fa00: ffff880100000001 00000000a070fb00

    ffff8801a070fa10: 0000000000000000 0000000000000000

    ffff8801a070fa20: 000000000000000f ffff8801a070faf0

    ffff8801a070fa30: ffffffff8118e260 ffffea0005ee54a0

    ffff8801a070fa40: ffff8801a070fb00 0000000000000000

    ffff8801a070fa50: ffff880237052000 00000000641e072f

    ffff8801a070fa60: ffff88023ffd8000 ffff8801a070fb00

    ffff8801a070fa70: 0000000000140000 ffff8801a070faf0

    ffff8801a070fa80: ffff880232319700 ffff8801a070fad8

    ffff8801a070fa90: ffffffff8118f259

    #13 [ffff8801a070fa90] compact_zone at ffffffff8118f259

    ffff8801a070fa98: 00000000ab51b7e8 ffff8801a070fb00

    ffff8801a070faa8: 0000000000000020 ffff8801a070faf0

    ffff8801a070fab8: ffff8801a070fd17 ffff88023ffd8000

    ffff8801a070fac8: ffff88023ffd9008 0000000000000000

    ffff8801a070fad8: ffff8801a070fb78 ffffffff8118f45c

    #14 [ffff8801a070fae0] compact_zone_order at ffffffff8118f45c

     
     

    crash> page.mapping ffffea0005ee5480

    mapping = 0xffff8800846a0b41

    crash> bt | awk '/exception RIP: down_read_trylock/,/ORIG_RAX:/ {print}' | grep -e R14 -e RBX

    RAX: 0000000000000000 RBX: ffff8800846a0b40 RCX: ffff8800846a0b40

    R13: ffffea0005ee5480 R14: 000000000000002d R15: 0000000000000000

    crash>

     
     

    由以上可以得相关地址值

    page*: ffffea0005ee5480

    page->mapping: 0xffff8800846a0b41

    anon_vma->root: 000000000000002d

    &anon_vma->root: ffff8800846a0b40

     
     

    验证各值

    crash> kmem ffffea0005ee5480

    PAGE PHYSICAL MAPPING INDEX CNT FLAGS

    ffffea0005ee5480 17b952000 ffff8800846a0b41 7ffff33a6 2 2fffff00080009 locked,uptodate,swapbacked

    crash> kmem 000000000000002d

    PAGE PHYSICAL MAPPING INDEX CNT FLAGS

    ffffea0000000000 0 0 0 0 0

    crash> kmem 0xffff8800846a0b41

    CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE

    ffff880237058700 vm_area_struct 216 51394 51408 2856 4k

    SLAB MEMORY NODE TOTAL ALLOCATED FREE

    ffffea000211a800 ffff8800846a0000 0 18 18 0

    FREE / [ALLOCATED]

    [ffff8800846a0af8]

     
     

    PAGE PHYSICAL MAPPING INDEX CNT FLAGS

    ffffea000211a800 846a0000 0 0 1 1fffff00000080 slab

    crash> kmem ffff8800846a0b40

    CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE

    ffff880237058700 vm_area_struct 216 51394 51408 2856 4k

    SLAB MEMORY NODE TOTAL ALLOCATED FREE

    ffffea000211a800 ffff8800846a0000 0 18 18 0

    FREE / [ALLOCATED]

    [ffff8800846a0af8]

     
     

    PAGE PHYSICAL MAPPING INDEX CNT FLAGS

    ffffea000211a800 846a0000 0 0 1 1fffff00000080 slab

    crash>

     
     

    crash> anon_vma 000000000000002d

    struct: invalid kernel virtual address: 000000000000002d

    crash>

     
     

     
     

    以上分析过程是根据官方KB来操作,如有需求请参考官方KB

    https://access.redhat.com/solutions/2779851

     
     

     
     

     
     

     
     

     
     

     
     

     
     

发表评论

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: