Check_smartstatus

Script: check_smartstatus

check_smartstatus is a plugin run a smartctl check to verify the disk status of all local harddisks/ ssds.

It works on physical machines only.

Requirements

  • smartctl

The icinga user needs sudo permissions on the smartctl binary.

icingaclient ALL=(ALL) NOPASSWD: /sbin/smartctl

Standalone installation

From this repository ypu need next to this script:

  • inc_pluginfunctions shared function for all IML checks written in bash

Syntax

______________________________________________________________________

CHECK_SMARTSTATUS
v1.6

(c) Institute for Medical Education - University of Bern
Licence: GNU GPL 3

https://os-docs.iml.unibe.ch/icinga-checks/Checks/check_smartstatus.html
______________________________________________________________________

Show status of local S.M.A.R.T. devices.

SYNTAX:
    check_smartstatus [-h] [-l] [devices]

OPTIONS:

    -h|--help            show this help.
    -l|--list            list devices only.

PARAMETERS:

EXAMPLES

    check_smartstatus
      Scan all local disks

    check_smartstatus -l
      List all local disks without scanning them.

Parameters

(none)

Examples

Fort testing purposes: Show devices only without scanning them:

./check_smartstatus -l
Devices to scan:
- /dev/nvme0 -d nvme # /dev/nvme0, NVMe device

Without parameter check_smartstatus will loop over all found devices and perform a SMART scan on each. You get a status line with a summary followed by the output sections for each disk.

This is the output of a single SSD:

OK: SMART check on 1 Disks - 0 errors - /dev/nvme0:  PASSED
SMART/Health Information (NVMe Log 0x02)
---------------------------------------------------------------------- 

/dev/nvme0

sudo smartctl -Ha /dev/nvme0
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.9.2-1-MANJARO] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SKHynix_HFS001TEJ9X162N
Serial Number:                      AJC9N469110209D22
Firmware Version:                   51730A10
PCI Vendor/Subsystem ID:            0x1c5c
IEEE OUI Identifier:                0xace42e
Controller ID:                      0
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,024,209,543,168 [1.02 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            ace42e 0035db84db
Local Time is:                      Fri Jun  7 12:59:02 2024 CEST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x00df):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp Verify
Log Page Attributes (0x1e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size:         64 Pages
Warning  Comp. Temp. Threshold:     86 Celsius
Critical Comp. Temp. Threshold:     87 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.50W       -        -    0  0  0  0        5     305
 1 +   3.9000W       -        -    1  1  1  1       30     330
 2 +   1.5000W       -        -    2  2  2  2      100     400
 3 -   0.0500W       -        -    3  3  3  3      500    1500
 4 -   0.0050W       -        -    4  4  4  4     1000    9000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        43 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    6,589,009 [3.37 TB]
Data Units Written:                 3,879,914 [1.98 TB]
Host Read Commands:                 39,241,205
Host Write Commands:                72,717,841
Controller Busy Time:               2,112
Power Cycles:                       176
Power On Hours:                     642
Unsafe Shutdowns:                   21
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               40 Celsius
Temperature Sensor 2:               37 Celsius

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
No Self-tests Logged

/dev/nvme0 - rc=0

PASSED SMART/Health Information (NVMe Log 0x02)