check PostgreSql server

Introduction

check_psqlserver is a plugin execute different checks on a postgreSql server instance. The kind of check is defined by a parameter -m METHOD.

Requirements

  • psql (cli tool)
  • The icinga user needs to connect to the database server (see Installation).

Syntax

$ check_psqlserver [-i|-u|-m METHOD]

./check_psqlserver -h
______________________________________________________________________

CHECK_PSQLSERVER :: v0.9

(c) Institute for Medical Education - University of Bern
Licence: GNU GPL 3
______________________________________________________________________

USAGE:
  check_psqlserver [OPTIONS] -m METHOD

OPTIONS:
  -h  this help
  -i  install monitoring user (must be executed as root)
  -u  uninstall monitoring user (must be executed as root)

PARAMETERS:
  -m  method; valid methods are:
      activity        Count running processes and queries
      conflicts       Count of detected conflicts
      dbrows          Count of database row actions
      diskblock       Count of diskblocks physically read or coming from cache
      problems        Count of problems and troublemakers
      replication     Replication status and lag time
      transactions    Count of transactions over all databases

EXAMPLES:
  check_psqlserver -i
  check_psqlserver -m activity

Installation

To give access to the database there is the param -i(for “install”). This command needs to be run as root.

./check_psqlserver -i

This creates a database user “icingamonitor” with a 64 byte random password. To store the credentials a file file be created: /etc/icingaclient/.psql.conf. It is a shell script that will be sourced by check_psqlserver.

cat /etc/icingaclient/.psql.conf 
#
# generated on Thu Jun  8 03:44:06 CEST 2023
#
export PGUSER=icingamonitor
export PGPASSWORD=87B9jUcRp38DgkLDL3uHzC3V8YjgX6KZMxIxqpWieTWWIBgNoFYt8yvK9Y8RmdL0
export PGHOST=localhost

# set default database because a user db won't be created
export PGDATABASE=postgres

To test the connection run ./check_psqlserver -m activity.

If the config was written and the connect fails then search for pg_hba.conf (/var/lib/pgsql/data/pg_hba.conf or /etc/postgresql/13/main/pg_hba.conf). If local authentication for ipv4 and v6 is set to “ident”

host    all     all     127.0.0.1/32    ident

… try to set it to “md5” and restart the pgsql service.

Checks

The checks are done on the server and summarize data from statistic tables for all databases.

see https://www.postgresql.org/docs/current/monitoring-stats.html

If you need to troubleshot and want to see which of your databases causes the trouble you can execute the statement which is ggiven in the hint.

activity

Show count of running processeses and sum of process states. Possible states in pg_stat_activity are:

  • active: The backend is executing a query.
  • idle: The backend is waiting for a new client command.
  • idle in transaction: The backend is in a transaction, but is not currently executing a query.
  • idle in transaction (aborted): This state is similar to idle in transaction, except one of the statements in the transaction caused an error.
  • fastpath function call: The backend is executing a fast-path function.
  • disabled: This state is reported if track_activities is disabled in this backend.

The check summarizes:

  • total - the total count of all processes
  • active - processes with state “active”
  • idle - processes with state “idle”, “idle in transaction” and “idle in transaction (aborted)”
  • fastpath - processes with state “fastpath function call”
  • other - count of psql base processes having no value in state column

The state of the check is always “OK”.

To analyze a troublemaker on high number of processes run select * from pg_stat_activity to see the queries and the database name.

Example output:

./check_psqlserver -m activity
OK: Pgsql activity :: Running total: 33 ... active: 2 idle: 25 fastpath: 0 other: 6
 
Hint: Query for details:
select * from pg_stat_activity.

 |running-total=33;; running-active=2;; running-idle=25;; running-fastpath=0;; running-other=6;; 

conflicts

Show number of detected conflicts from pg_stat_database_conflicts. The values are counters. Therefor there is a calculation per minute to find newly occured changes.

The columns in pg_stat_database_conflicts are:

  • confl_tablespace bigint - Number of queries in this database that have been canceled due to dropped tablespaces
  • confl_lock bigint - Number of queries in this database that have been canceled due to lock timeouts
  • confl_snapshot bigint - Number of queries in this database that have been canceled due to old snapshots
  • confl_bufferpin bigint - Number of queries in this database that have been canceled due to pinned buffers
  • confl_deadlock bigint - Number of queries in this database that have been canceled due to deadlocks

The check summarizes all conflicts of all databases.

The check switches to “critical” if one of the delta values per min is <> 0.

Example output:

./check_psqlserver -m conflicts
OK: Pgsql conflicts ::  ... OK, nothing was found
 confl_tablespace         :               0 ... delta = 0 per min
 confl_lock               :               0 ... delta = 0 per min
 confl_snapshot           :               0 ... delta = 0 per min
 confl_bufferpin          :               0 ... delta = 0 per min
 confl_deadlock           :               0 ... delta = 0 per min
 
Hint: Query for details per database:
select * from pg_stat_database_conflicts.

 |confltablespace=0;; confllock=0;; conflsnapshot=0;; conflbufferpin=0;; confldeadlock=0;; 

dbrows

Count of database row actions.

From pg_stat_database we read the following columns and add them for all databases.

  • tup_fetched bigint - Number of live rows fetched by index scans in this database
  • tup_inserted bigint - Number of rows inserted by queries in this database
  • tup_updated bigint - Number of rows updated by queries in this database
  • tup_deleted bigint - Number of rows deleted by queries in this database

The values are counters. Therefor there is a calculation per sec to find current changes.

The state of the check is always “OK”.

Example output:

./check_psqlserver -m dbrows
OK: Pgsql dbrows :: Count of database row actions
 tup_returned             :    744761392127 ... delta = 0 per sec
 tup_fetched              :     21597149760 ... delta = 0 per sec
 tup_inserted             :        42683720 ... delta = 0 per sec
 tup_updated              :         1282902 ... delta = 0 per sec
 tup_deleted              :         1878883 ... delta = 0 per sec
 
Hint: Query for details per database:
select * from pg_stat_database.

 |tupreturned=0;; tupfetched=0;; tupinserted=0;; tupupdated=0;; tupdeleted=0;; 

diskblock

Count of diskblocks physically read or coming from cache

From pg_stat_database we read the following columns and add them for all databases.

  • blks_read bigint - Number of disk blocks read in this database
  • blks_hit bigint - Number of times disk blocks were found already in the buffer cache, so that a read was not necessary (this only includes hits in the PostgreSQL buffer cache, not the operating system’s file system cache)

The values are counters. Therefor there is a calculation per sec to find current changes.

The state of the check is always “OK”.

Example output:

./check_psqlserver -m diskblock
OK: Pgsql diskblock :: Count of diskblocks physically read or coming from cache (from pg_stat_database)
 blks_read                :     20391240674 ... delta = 439 per sec
 blks_hit                 :    103015908887 ... delta = 6549 per sec
 
Hint: Query for details per database:
select * from pg_stat_database.

 |blksread=439;; blkshit=6549;;

problems

Problems and troublemakers

From pg_stat_database we read the following columns and add them for all databases.

  • conflicts bigint - Number of queries canceled due to conflicts with recovery in this database. (Conflicts occur only on standby servers; see pg_stat_database_conflicts for details.)
  • deadlocks bigint - Number of deadlocks detected in this database
  • checksum_failures bigint - Number of data page checksum failures detected in this database (or on a shared object), or NULL if data checksums are not enabled.
  • temp_files bigint - Number of temporary files created by queries in this database. All temporary files are counted, regardless of why the temporary file was created (e.g., sorting or hashing), and regardless of the log_temp_files setting.
  • temp_bytes bigint - Total amount of data written to temporary files by queries in this database. All temporary files are counted, regardless of why the temporary file was created, and regardless of the log_temp_files setting.

The values are counters. Therefor there is a calculation per min to find current changes.

The state of the check switches to critical if a minimum problem was detected in the delta value.

Example output:

./check_psqlserver -m problems
OK: Pgsql problems :: Problems and troublemakers (from pg_stat_database) ... OK, nothing was found
 conflicts                :               0 ... delta = 0 per min
 deadlocks                :               0 ... delta = 0 per min
 checksumfailures         :               0 ... delta = 0 per min
 temp_files               :         5617739 ... delta = 0 per min
 temp_bytes               :   2896790521548 ... delta = 0 per min
 
Hint: Query for details per database:
select * from pg_stat_database.

 |conflicts=0;; deadlocks=0;; checksumfailures=0;; tempfiles=0;; tempbytes=0;; 

replication

Replication status.

It shows the defined replication and their status.

Aditionally it fetches the maximum lag of write, flush and replay of all replications.

The state of the check switches “warning” if …

  • one of the replications is not “streaming”
  • the maximum lag is larger 1 sec (just experimental).

Example output:

./check_psqlserver -m replication
OK: Pgsql replication :: status
 OK: all replications have the state 'streaming'.
OK: maximum lag is 0.2077 sec (below 1 sec).
 application_name :  client_addr   :   state   :    write_lag    :    flush_lag    :  replay_lag   :    max_lag    : sync_state 
------------------+----------------+-----------+-----------------+-----------------+---------------+---------------+------------
 psqlbackup         : 192.168.10.21 : streaming : 00:00:00.000673 : 00:00:00.003771 : 00:00:00.2077 : 00:00:00.2077 : async
(1 row)
Hint: Query for details per replication:
select * from pg_stat_replication.

 |total=1;; state-streaming=1;; state-other=0;; max-lag=0.2077;; 

transactions

Count of transactions over all databases

From pg_stat_database we fetch these columns and summarize it for all database:

  • xact_commit bigint - Number of transactions in this database that have been committed
  • xact_rollback bigint - Number of transactions in this database that have been rolled back

The values are counters. Therefor there is a calculation per sec to show the current speed.

The state of the check is always “OK”.

Example output:

./check_psqlserver -m transactions
OK: Pgsql transactions :: Count of transactions over all databases
 commit                   :       380856524 ... delta = 0 per sec
 rollback                 :           13173 ... delta = 0 per sec
 
Hint: Query for details per database:
select * from pg_stat_database.

 |commit=0;; rollback=0;; 

Run a query on command line

As root or icingaclient user you can read the configuration for the database monitoring user (created with param -i).

In a terminal you can source the created config file. Then run a query using psql.

Example:

. /etc/icingaclient/.psql.conf
psql -c 'select * from pg_stat_activity'