- Without NHC_CHECK_ALL=1, check_cpufreq_is_reasonable returns 1 and the node is marked offline as expected.
- [root@fc99 ~]# nhc -v
- Node Health Check starting.
- Running check: "check_fs_mount_rw -t "tmpfs" -s "tmpfs" -f "/dev/shm""
- Running check: "check_fs_mount_rw -t "tmpfs" -s "tmpfs" -f "/""
- Running check: "check_fs_mount_rw -t "nfs" -s "netapp?:/ArcHome" -f "/home""
- Running check: "check_fs_mount_rw -t "nfs" -s "netapp?:/ArcScratch" -f "/scratch""
- Running check: "check_fs_mount_rw -t "nfs" -s "netapp?:/ArcWork" -f "/work""
- Running check: "check_fs_mount_rw -t "nfs" -s "netapp?:/ArcSoftware" -f "/global/software""
- Running check: "check_fs_mount_rw -t "nfs" -s "bulknetapp?:/Bulk" -f "/bulk""
- Running check: "check_fs_used /scratch 95%"
- Running check: "check_fs_used /home 95%"
- Running check: "check_fs_used /work 95%"
- Running check: "check_fs_used /tmp 95%"
- Running check: "check_fs_used /bulk 95%"
- Running check: "check_file_test -r -w -x -d -k /tmp"
- Running check: "check_hw_swap_free 1048576"
- Running check: "check_file_contents /etc/rocky-release 'Rocky Linux release 8.4 (Green Obsidian)'"
- Running check: "check_file_contents /proc/version 'Linux version 4.18.0-348.23.1.el8_5.x86_64 (mockbuild@dal1-prod-builder001.bld.equ.rockylinux.org) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-4) (GCC)) #1 SMP Wed Apr 27 15:32:52 UTC 2022'"
- Running check: "check_cmd_output -t 10 -e 'id slurm' -m 'uid=987(slurm) gid=983(slurm) groups=983(slurm)'"
- Running check: "check_cmd_output -t 10 -e 'id munge' -m 'uid=989(munge) gid=984(munge) groups=984(munge)'"
- Running check: "check_cpufreq_is_reasonable"
- Node Health Check failed. Check check_cpufreq_is_reasonable returned 1
- ERROR: nhc: Health check failed: Check check_cpufreq_is_reasonable returned 1
- 20221129 13:35:13 /usr/libexec/nhc/node-mark-offline fc99 Check check_cpufreq_is_reasonable returned 1
- /usr/libexec/nhc/node-mark-offline: Marking idle fc99 offline: NHC: Check check_cpufreq_is_reasonable returned 1
- ERROR: nhc: Health check failed: Check check_cpufreq_is_reasonable returned 1
- with NHC_CHECK_ALL=1, check_cpufreq_is_reasonable returns 1, but nhc brings the node back online
- [root@fc99 ~]# nhc -v
- Node Health Check starting.
- Running check: "export MARK_OFFLINE=1 NHC_CHECK_ALL=1"
- Running check: "check_fs_mount_rw -t "tmpfs" -s "tmpfs" -f "/dev/shm""
- Running check: "check_fs_mount_rw -t "tmpfs" -s "tmpfs" -f "/""
- Running check: "check_fs_mount_rw -t "nfs" -s "netapp?:/ArcHome" -f "/home""
- Running check: "check_fs_mount_rw -t "nfs" -s "netapp?:/ArcScratch" -f "/scratch""
- Running check: "check_fs_mount_rw -t "nfs" -s "netapp?:/ArcWork" -f "/work""
- Running check: "check_fs_mount_rw -t "nfs" -s "netapp?:/ArcSoftware" -f "/global/software""
- Running check: "check_fs_mount_rw -t "nfs" -s "bulknetapp?:/Bulk" -f "/bulk""
- Running check: "check_fs_used /scratch 95%"
- Running check: "check_fs_used /home 95%"
- Running check: "check_fs_used /work 95%"
- Running check: "check_fs_used /tmp 95%"
- Running check: "check_fs_used /bulk 95%"
- Running check: "check_file_test -r -w -x -d -k /tmp"
- Running check: "check_hw_swap_free 1048576"
- Running check: "check_file_contents /etc/rocky-release 'Rocky Linux release 8.4 (Green Obsidian)'"
- Running check: "check_file_contents /proc/version 'Linux version 4.18.0-348.23.1.el8_5.x86_64 (mockbuild@dal1-prod-builder001.bld.equ.rockylinux.org) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-4) (GCC)) #1 SMP Wed Apr 27 15:32:52 UTC 2022'"
- Running check: "check_cmd_output -t 10 -e 'id slurm' -m 'uid=987(slurm) gid=983(slurm) groups=983(slurm)'"
- Running check: "check_cmd_output -t 10 -e 'id munge' -m 'uid=989(munge) gid=984(munge) groups=984(munge)'"
- Running check: "check_cpufreq_is_reasonable"
- 20221129 13:35:24 [slurm] /usr/libexec/nhc/node-mark-online fc99
- /usr/libexec/nhc/node-mark-online: Marking fc99 online and clearing note (NHC: Check check_cpufreq_is_reasonable returned 1)
- Node Health Check completed successfully (2s).