nhc weirdness

From Ungracious Zebra, 1 Year ago, written in Plain Text, viewed 108 times.
URL https://paste.steamr.com/view/183158fe Embed
Download Paste or View Raw
  1. Without NHC_CHECK_ALL=1, check_cpufreq_is_reasonable returns 1 and the node is marked offline as expected.
  2.  
  3. [root@fc99 ~]# nhc -v
  4. Node Health Check starting.
  5. Running check:  "check_fs_mount_rw -t "tmpfs" -s "tmpfs"                    -f "/dev/shm""
  6. Running check:  "check_fs_mount_rw -t "tmpfs" -s "tmpfs"                    -f "/""
  7. Running check:  "check_fs_mount_rw -t "nfs"   -s "netapp?:/ArcHome"        -f "/home""
  8. Running check:  "check_fs_mount_rw -t "nfs"   -s "netapp?:/ArcScratch"     -f "/scratch""
  9. Running check:  "check_fs_mount_rw -t "nfs"   -s "netapp?:/ArcWork"        -f "/work""
  10. Running check:  "check_fs_mount_rw -t "nfs"   -s "netapp?:/ArcSoftware"    -f "/global/software""
  11. Running check:  "check_fs_mount_rw -t "nfs"   -s "bulknetapp?:/Bulk"       -f "/bulk""
  12. Running check:  "check_fs_used /scratch 95%"
  13. Running check:  "check_fs_used /home    95%"
  14. Running check:  "check_fs_used /work    95%"
  15. Running check:  "check_fs_used /tmp     95%"
  16. Running check:  "check_fs_used /bulk    95%"
  17. Running check:  "check_file_test -r -w -x -d -k /tmp"
  18. Running check:  "check_hw_swap_free 1048576"
  19. Running check:  "check_file_contents /etc/rocky-release 'Rocky Linux release 8.4 (Green Obsidian)'"
  20. Running check:  "check_file_contents /proc/version 'Linux version 4.18.0-348.23.1.el8_5.x86_64 (mockbuild@dal1-prod-builder001.bld.equ.rockylinux.org) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-4) (GCC)) #1 SMP Wed Apr 27 15:32:52 UTC 2022'"
  21. Running check:  "check_cmd_output -t 10 -e 'id slurm' -m 'uid=987(slurm) gid=983(slurm) groups=983(slurm)'"
  22. Running check:  "check_cmd_output -t 10 -e 'id munge' -m 'uid=989(munge) gid=984(munge) groups=984(munge)'"
  23. Running check:  "check_cpufreq_is_reasonable"
  24. Node Health Check failed.  Check check_cpufreq_is_reasonable returned 1
  25. ERROR:  nhc:  Health check failed:  Check check_cpufreq_is_reasonable returned 1
  26. 20221129 13:35:13 /usr/libexec/nhc/node-mark-offline fc99 Check check_cpufreq_is_reasonable returned 1
  27. /usr/libexec/nhc/node-mark-offline:  Marking idle fc99 offline:  NHC: Check check_cpufreq_is_reasonable returned 1
  28. ERROR:  nhc:  Health check failed:  Check check_cpufreq_is_reasonable returned 1
  29.  
  30.  
  31. with NHC_CHECK_ALL=1, check_cpufreq_is_reasonable returns 1, but nhc brings the node back online
  32.  
  33. [root@fc99 ~]# nhc -v
  34. Node Health Check starting.
  35. Running check:  "export MARK_OFFLINE=1 NHC_CHECK_ALL=1"
  36. Running check:  "check_fs_mount_rw -t "tmpfs" -s "tmpfs"                    -f "/dev/shm""
  37. Running check:  "check_fs_mount_rw -t "tmpfs" -s "tmpfs"                    -f "/""
  38. Running check:  "check_fs_mount_rw -t "nfs"   -s "netapp?:/ArcHome"        -f "/home""
  39. Running check:  "check_fs_mount_rw -t "nfs"   -s "netapp?:/ArcScratch"     -f "/scratch""
  40. Running check:  "check_fs_mount_rw -t "nfs"   -s "netapp?:/ArcWork"        -f "/work""
  41. Running check:  "check_fs_mount_rw -t "nfs"   -s "netapp?:/ArcSoftware"    -f "/global/software""
  42. Running check:  "check_fs_mount_rw -t "nfs"   -s "bulknetapp?:/Bulk"       -f "/bulk""
  43. Running check:  "check_fs_used /scratch 95%"
  44. Running check:  "check_fs_used /home    95%"
  45. Running check:  "check_fs_used /work    95%"
  46. Running check:  "check_fs_used /tmp     95%"
  47. Running check:  "check_fs_used /bulk    95%"
  48. Running check:  "check_file_test -r -w -x -d -k /tmp"
  49. Running check:  "check_hw_swap_free 1048576"
  50. Running check:  "check_file_contents /etc/rocky-release 'Rocky Linux release 8.4 (Green Obsidian)'"
  51. Running check:  "check_file_contents /proc/version 'Linux version 4.18.0-348.23.1.el8_5.x86_64 (mockbuild@dal1-prod-builder001.bld.equ.rockylinux.org) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-4) (GCC)) #1 SMP Wed Apr 27 15:32:52 UTC 2022'"
  52. Running check:  "check_cmd_output -t 10 -e 'id slurm' -m 'uid=987(slurm) gid=983(slurm) groups=983(slurm)'"
  53. Running check:  "check_cmd_output -t 10 -e 'id munge' -m 'uid=989(munge) gid=984(munge) groups=984(munge)'"
  54. Running check:  "check_cpufreq_is_reasonable"
  55. 20221129 13:35:24 [slurm] /usr/libexec/nhc/node-mark-online fc99
  56. /usr/libexec/nhc/node-mark-online:  Marking fc99 online and clearing note (NHC: Check check_cpufreq_is_reasonable returned 1)
  57. Node Health Check completed successfully (2s).
  58.  

Reply to "nhc weirdness"

Here you can reply to the paste above