pax_global_header00006660000000000000000000000064147550640640014525gustar00rootroot0000000000000052 comment=4329cdd2f78dcebec9598a392e6d6dba9ca23724 check_patroni-2.2.0/000077500000000000000000000000001475506406400143375ustar00rootroot00000000000000check_patroni-2.2.0/.coveragerc000066400000000000000000000000421475506406400164540ustar00rootroot00000000000000[run] include = check_patroni/* check_patroni-2.2.0/.flake8000066400000000000000000000003051475506406400155100ustar00rootroot00000000000000[flake8] doctests = True ignore = # line too long E501, # line break before binary operator (added by black) W503, exclude = .git, .mypy_cache, .tox, .venv, mypy_config = mypy.ini check_patroni-2.2.0/.github/000077500000000000000000000000001475506406400156775ustar00rootroot00000000000000check_patroni-2.2.0/.github/workflows/000077500000000000000000000000001475506406400177345ustar00rootroot00000000000000check_patroni-2.2.0/.github/workflows/lint.yml000066400000000000000000000005021475506406400214220ustar00rootroot00000000000000name: Lint on: [push, pull_request] jobs: lint: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - uses: actions/setup-python@v2 - name: Install tox run: pip install tox - name: Lint (black & flake8) run: tox -e lint - name: Mypy run: tox -e mypy check_patroni-2.2.0/.github/workflows/publish.yml000066400000000000000000000011561475506406400221300ustar00rootroot00000000000000name: Publish on: push: tags: - 'v*' jobs: publish: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - uses: actions/setup-python@v2 with: python-version: '3.10' - name: Install run: python -m pip install setuptools wheel twine - name: Build run: | python setup.py check python setup.py sdist bdist_wheel python -m twine check dist/* - name: Publish run: python -m twine upload dist/* env: TWINE_USERNAME: __token__ TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }} check_patroni-2.2.0/.github/workflows/tests.yml000066400000000000000000000006771475506406400216330ustar00rootroot00000000000000name: Tests on: [push, pull_request] jobs: tests: runs-on: ubuntu-latest strategy: matrix: include: - python: "3.9" - python: "3.13" steps: - uses: actions/checkout@v2 - name: Setup Python uses: actions/setup-python@v2 with: python-version: ${{ matrix.python }} - name: Install tox run: pip install tox - name: Test run: tox -e py check_patroni-2.2.0/.gitignore000066400000000000000000000002311475506406400163230ustar00rootroot00000000000000__pycache__/ check_patroni.egg-info tests/config.ini tests/*.state_file vagrant/.vagrant vagrant/*.state_file .*.swp .coverage .venv/ .tox/ dist/ build/ check_patroni-2.2.0/CHANGELOG.md000066400000000000000000000077501475506406400161610ustar00rootroot00000000000000# Change log ## check_patroni 2.2.0 - 2025-02-17 ### Added * Support for quorum synchronous réplication (#80) ### Fixed * Update the documentation to clarify that if patroni cannot be reached, we consider it's a configuration error that returns UNKNOWN (#77, reported by @MLyssens) ## check_patroni 2.1.0 - 2024-10-19 ### Fixed * cluster_has_replica now properly accounts for standby leaders (#72, reported by @MLyssens) ### Misc * Update the tests and the documentation to reflect that master is replaced by primary everywhere it's visible in Patroni. We didn't use the term so there is nothing to change in our code. ## check_patroni 2.0.0 - 2024-04-29 ### Notice While fixing `cluster_has_replica`, the definition of a healthy replica was changed. It's more restrictive now, hence the jump from v1 to v2. ### Changed * In `cluster_node_count`, a healthy standby, sync replica or standby leaders cannot be "in archive recovery" because this service doesn't check for lag and timelines. ### Added * Add the timeline in the `cluster_has_replica` perfstats. (#50) * Add a mention about shell completion support and shell versions in the doc. (#53) * Add the leader type and whether it's archiving to the `cluster_has_leader` perfstats. (#58) ### Fixed * Add compatibility with [requests](https://requests.readthedocs.io) version 2.25 and higher. * Fix what `cluster_has_replica` deems a healthy replica. (#50, reported by @mbanck) * Fix `cluster_has_replica` to display perfstats for replicas whenever it's possible (healthy or not). (#50) * Fix `cluster_has_leader` to correctly check for standby leaders. (#58, reported by @mbanck) * Fix `cluster_node_count` to correctly manage replication states. (#50, reported by @mbanck) ### Misc * Improve the documentation for `node_is_replica`. * Improve test coverage by running an HTTP server to fake the Patroni API (#55 by @dlax). * Work around old pytest versions in type annotations in the test suite. * Declare compatibility with click version 7.1 (or higher). * In tests, work around nagiosplugin 1.3.2 not properly handling stdout redirection. ## check_patroni 1.0.0 - 2023-08-28 Check patroni is now tagged as Production/Stable. ### Added * Add `sync_standby` as a valid replica type for `cluster_has_replica`. (contributed by @mattpoel) * Add info and options (`--sync-warning` and `--sync-critical`) about sync replica to `cluster_has_replica`. * Add a new service `cluster_has_scheduled_action` to warn of any scheduled switchover or restart. * Add options to `node_is_replica` to check specifically for a synchronous (`--is-sync`) or asynchronous node (`--is-async`). * Add `standby-leader` as a valid leader type for `cluster_has_leader`. * Add a new service `node_is_leader` to check if a node is a leader (which includes standby leader nodes) ### Fixed * Fix the `node_is_alive` check. (#31) * Fix the `cluster_has_replica` and `cluster_node_count` checks to account for the new replica state `streaming` introduced in v3.0.4 (#28, reported by @log1-c) ### Misc * Create CHANGELOG.md * Add tests for the output of the scripts in addition to the return code * Documentation in CONTRIBUTING.md ## check_patroni 0.2.0 - 2023-03-20 ### Added * Add a `--save` option when state files are used * Modify `-e/--endpoints` to allow a comma separated list of endpoints (#21, reported by @lihnjo) * Use requests instead of urllib3 (with extensive help from @dlax) * Change the way logging is handled (with extensive help from @dlax) ### Fix * Reverse the test for `node_is_pending` * SSL handling ### Misc * Several doc Fix and Updates * Use spellcheck and isort * Remove tests for python 3.6 * Add python tests for python 3.11 ## check_patroni 0.1.1 - 2022-07-15 The initial release covers the following checks : * check a cluster for + configuration change + presence of a leader + presence of a replica + maintenance status * check a node for + liveness + pending restart status + primary status + replica status + tl change + patroni version check_patroni-2.2.0/CONTRIBUTING.md000066400000000000000000000053031475506406400165710ustar00rootroot00000000000000# Contributing to check_patroni Thanks for your interest in contributing to check_patroni. ## Clone Git Repository Installation from the git repository: ``` $ git clone https://github.com/dalibo/check_patroni.git $ cd check_patroni ``` Change the branch if necessary. ## Create Python Virtual Environment You need a dedicated environment, install dependencies and then check_patroni from the repo: ``` $ python3 -m venv .venv $ . .venv/bin/activate (.venv) $ pip3 install .[test] (.venv) $ pip3 install -r requirements-dev.txt (.venv) $ check_patroni ``` To quit this env and destroy it: ``` $ deactivate $ rm -r .venv ``` ## Development Environment A vagrant file is available to create a icinga / opm / grafana stack and install check_patroni. You can then add a server to the supervision and watch the graphs in grafana. It's in the `vagrant` directory. A vagrant file can be found in [this repository](https://github.com/ioguix/vagrant-patroni) to generate a patroni/etcd setup. The `README.md` can be generated with `./docs/make_readme.sh`. ## Executing Tests Crafting repeatable tests using a live Patroni cluster can be intricate. To simplify the development process, a fake HTTP server is set up as a test fixture and serves static files (either from `tests/json` directory or from in-memory data). An important consideration is that there is a potential drawback: if the JSON data is incorrect or if modifications have been made to Patroni without corresponding updates to the tests documented here, the tests might still pass erroneously. The tests are executed automatically for each PR using the ci (see `.github/workflow/lint.yml` and `.github/workflow/tests.yml`). Running the tests, * manually: ```bash pytest --cov tests ``` * or using tox: ```bash tox -e lint # mypy + flake8 + black + isort ° codespell tox # pytests and "lint" tests for all supported version of python tox -e py # pytests and "lint" tests for the default version of python ``` Please note that when dealing with any service that checks the state of a node, the related tests must use the `old_replica_state` fixture to test with both old (pre 3.0.4) and new replica states. A bash script, `check_patroni.sh`, is provided to facilitate testing all services on a Patroni endpoint (`./vagrant/check_patroni.sh`). It requires one parameter: the endpoint URL that will be used as the argument for the `-e/--endpoints` option of `check_patroni`. This script essentially compiles a list of service calls and executes them sequentially in a bash script. It creates a state file in the directory from which you run the script. Here's an example usage: ```bash ./vagrant/check_patroni.sh http://10.20.30.51:8008 ``` check_patroni-2.2.0/LICENSE000066400000000000000000000016261475506406400153510ustar00rootroot00000000000000PostgreSQL Licence Copyright (c) 2022, DALIBO Permission to use, copy, modify, and distribute this software and its documentation for any purpose, without fee, and without a written agreement is hereby granted, provided that the above copyright notice and this paragraph and the following two paragraphs appear in all copies. IN NO EVENT SHALL DALIBO BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF DALIBO HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. DALIBO SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS ON AN "AS IS" BASIS, AND DALIBO HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. check_patroni-2.2.0/MANIFEST.in000066400000000000000000000003241475506406400160740ustar00rootroot00000000000000include *.md include mypy.ini include pytest.ini include tox.ini include .coveragerc include .flake8 include pyproject.toml recursive-include docs *.sh recursive-include tests *.json recursive-include tests *.py check_patroni-2.2.0/README.md000066400000000000000000000440441475506406400156240ustar00rootroot00000000000000# check_patroni A nagios plugin for patroni. ## Features - Check presence of leader, replicas, node counts. - Check each node for replication status. ``` Usage: check_patroni [OPTIONS] COMMAND [ARGS]... Nagios plugin that uses Patroni's REST API to monitor a Patroni cluster. Options: --config FILE Read option defaults from the specified INI file [default: config.ini] -e, --endpoints TEXT Patroni API endpoint. Can be specified multiple times or as a list of comma separated addresses. The node services checks the status of one node, therefore if several addresses are specified they should point to different interfaces on the same node. The cluster services check the status of the cluster, therefore it's better to give a list of all Patroni node addresses. [default: http://127.0.0.1:8008] --cert_file PATH File with the client certificate. --key_file PATH File with the client key. --ca_file PATH The CA certificate. -v, --verbose Increase verbosity -v (info)/-vv (warning)/-vvv (debug) --version --timeout INTEGER Timeout in seconds for the API queries (0 to disable) [default: 2] --help Show this message and exit. Commands: cluster_config_has_changed Check if the hash of the configuration... cluster_has_leader Check if the cluster has a leader. cluster_has_replica Check if the cluster has healthy replicas... cluster_has_scheduled_action Check if the cluster has a scheduled... cluster_is_in_maintenance Check if the cluster is in maintenance... cluster_node_count Count the number of nodes in the cluster. node_is_alive Check if the node is alive ie patroni is... node_is_leader Check if the node is a leader node. node_is_pending_restart Check if the node is in pending restart... node_is_primary Check if the node is the primary with the... node_is_replica Check if the node is a replica with no... node_patroni_version Check if the version is equal to the input node_tl_has_changed Check if the timeline has changed. ``` ## Install check_patroni is licensed under PostgreSQL license. ``` $ pip install git+https://github.com/dalibo/check_patroni.git ``` check_patroni works on python 3.6, we keep it that way because patroni also supports it and there are still lots of RH 7 variants around. That being said python 3.6 has been EOL for ages and there is no support for it in the github CI. ## Support If you hit a bug or need help, open a [GitHub issue](https://github.com/dalibo/check_patroni/issues/new). Dalibo has no commitment on response time for public free support. Thanks for you contribution ! ## Config file All global and service specific parameters can be specified via a config file has follows: ``` [options] endpoints = https://10.20.199.3:8008, https://10.20.199.4:8008,https://10.20.199.5:8008 cert_file = ./ssl/my-cert.pem key_file = ./ssl/my-key.pem ca_file = ./ssl/CA-cert.pem timeout = 0 [options.node_is_replica] lag=100 ``` ## Thresholds The format for the threshold parameters is `[@][start:][end]`. * `start:` may be omitted if `start == 0` * `~:` means that start is negative infinity * If `end` is omitted, infinity is assumed * To invert the match condition, prefix the range expression with `@`. A match is found when: `start <= VALUE <= end`. For example, the following command will raise: * a warning if there is less than 1 nodes, which can be translated to outside of range [2;+INF[ * a critical if there are no nodes, which can be translated to outside of range [1;+INF[ ``` check_patroni -e https://10.20.199.3:8008 cluster_has_replica --warning 2: --critical 1: ``` ## SSL Several options are available: * the server's CA certificate is not available or trusted by the client system: * `--ca_cert`: your certification chain `cat CA-certificate server-certificate > cabundle` * you have a client certificate for authenticating with Patroni's REST API: * `--cert_file`: your certificate or the concatenation of your certificate and private key * `--key_file`: your private key (optional) ## Shell completion We use the [click] library which supports shell completion natively. Shell completion can be added by typing the following command or adding it to a file spécific to your shell of choice. * for Bash (add to `~/.bashrc`): ``` eval "$(_CHECK_PATRONI_COMPLETE=bash_source check_patroni)" ``` * for Zsh (add to `~/.zshrc`): ``` eval "$(_CHECK_PATRONI_COMPLETE=zsh_source check_patroni)" ``` * for Fish (add to `~/.config/fish/completions/check_patroni.fish`): ``` eval "$(_CHECK_PATRONI_COMPLETE=fish_source check_patroni)" ``` Please note that shell completion is not supported far all shell versions, for example only Bash versions older than 4.4 are supported. [click]: https://click.palletsprojects.com/en/8.1.x/shell-completion/ ## Connection errors and service status If patroni is not running, we have no way to know if the provided endpoint is valid, therefore the check returns UNKNOWN. ## Cluster services ### cluster_config_has_changed ``` Usage: check_patroni cluster_config_has_changed [OPTIONS] Check if the hash of the configuration has changed. Note: either a hash or a state file must be provided for this service to work. Check: * `OK`: The hash didn't change * `CRITICAL`: The hash of the configuration has changed compared to the input (`--hash`) or last time (`--state_file`) Perfdata: * `is_configuration_changed` is 1 if the configuration has changed Options: --hash TEXT A hash to compare with. -s, --state-file TEXT A state file to store the hash of the configuration. --save Set the current configuration hash as the reference for future calls. --help Show this message and exit. ``` ### cluster_has_leader ``` Usage: check_patroni cluster_has_leader [OPTIONS] Check if the cluster has a leader. This check applies to any kind of leaders including standby leaders. A leader is a node with the "leader" role and a "running" state. A standby leader is a node with a "standby_leader" role and a "streaming" or "in archive recovery" state. Please note that log shipping could be stuck because the WAL are not available or applicable. Patroni doesn't provide information about the origin cluster (timeline or lag), so we cannot check if there is a problem in that particular case. That's why we issue a warning when the node is "in archive recovery". We suggest using other supervision tools to do this (eg. check_pgactivity). Check: * `OK`: if there is a leader node. * 'WARNING': if there is a stanby leader in archive mode. * `CRITICAL`: otherwise. Perfdata: * `has_leader` is 1 if there is any kind of leader node, 0 otherwise * `is_standby_leader_in_arc_rec` is 1 if the standby leader node is "in archive recovery", 0 otherwise * `is_standby_leader` is 1 if there is a standby leader node, 0 otherwise * `is_leader` is 1 if there is a "classical" leader node, 0 otherwise Options: --help Show this message and exit. ``` ### cluster_has_replica ``` Usage: check_patroni cluster_has_replica [OPTIONS] Check if the cluster has healthy replicas and/or if some are sync or quorum standbies For patroni (and this check): * a replica is `streaming` if the `pg_stat_wal_receiver` say's so. * a replica is `in archive recovery`, if it's not `streaming` and has a `restore_command`. A healthy replica: * has a `replica`, `quorum_standby` or `sync_standby` role * has the same timeline as the leader and * is in `running` state (patroni < V3.0.4) * is in `streaming` or `in archive recovery` state (patroni >= V3.0.4) * has a lag lower or equal to `max_lag` Please note that replica `in archive recovery` could be stuck because the WAL are not available or applicable (the server's timeline has diverged for the leader's). We already detect the latter but we will miss the former. Therefore, it's preferable to check for the lag in addition to the healthy state if you rely on log shipping to help lagging standbies to catch up. Since we require a healthy replica to have the same timeline as the leader, it's possible that we raise alerts when the cluster is performing a switchover or failover and the standbies are in the process of catching up with the new leader. The alert shouldn't last long. In PostgreSQL, synchronous replication has two modes: on and quorum and is configured with the gucs `synchronous_standby_names` and `synchronous_commit`. Patroni uses the parameter `synchronous_mode`, which can be set to `on`, `quorum` and `off`, and has `synchronous_node_count` to configure the synchronous replication factor. Please note that, in synchronous replication, the number of servers tagged as "{sync|quorum}_standby" (what we measure) is not always equal tot `synchronous_node_count`. Check: * `OK`: if the healthy_replica count and their lag are compatible with the replica count threshold. and if the synchronous replica count is compatible with the sync replica count threshold. * `WARNING` / `CRITICAL`: otherwise Perfdata: * healthy_replica & unhealthy_replica count * the number of sync_replica (sync or quorum depending on `--sync-type`), they are included in the previous count * the lag of each replica labelled with "member name"_lag * the timeline of each replica labelled with "member name"_timeline * a boolean to tell if the node is a sync stanbdy labelled with "member name"_sync Options: -w, --warning TEXT Warning threshold for the number of healthy replica nodes. -c, --critical TEXT Critical threshold for the number of healthy replica nodes. --sync-warning TEXT Warning threshold for the number of sync replica. --sync-critical TEXT Critical threshold for the number of sync replica. --sync-type [any|sync|quorum] Synchronous replication mode used to filter and count sync standbies. [default: any] --max-lag TEXT maximum allowed lag --help Show this message and exit. ``` ### cluster_has_scheduled_action ``` Usage: check_patroni cluster_has_scheduled_action [OPTIONS] Check if the cluster has a scheduled action (switchover or restart) Check: * `OK`: If the cluster has no scheduled action * `CRITICAL`: otherwise. Perfdata: * `scheduled_actions` is 1 if the cluster has scheduled actions. * `scheduled_switchover` is 1 if the cluster has a scheduled switchover. * `scheduled_restart` counts the number of scheduled restart in the cluster. Options: --help Show this message and exit. ``` ### cluster_is_in_maintenance ``` Usage: check_patroni cluster_is_in_maintenance [OPTIONS] Check if the cluster is in maintenance mode or paused. Check: * `OK`: If the cluster is in maintenance mode. * `CRITICAL`: otherwise. Perfdata: * `is_in_maintenance` is 1 the cluster is in maintenance mode, 0 otherwise Options: --help Show this message and exit. ``` ### cluster_node_count ``` Usage: check_patroni cluster_node_count [OPTIONS] Count the number of nodes in the cluster. The role refers to the role of the server in the cluster. Possible values are: * leader (master was removed in patroni 4.0.0) * replica * standby_leader * sync_standby * quorum_standby * demoted * promoted * uninitialized The state refers to the state of PostgreSQL. Possible values are: * initializing new cluster, initdb failed * running custom bootstrap script, custom bootstrap failed * starting, start failed * restarting, restart failed * running, streaming, in archive recovery * stopping, stopped, stop failed * creating replica * crashed The "healthy" checks only ensures that: * a leader has the running state * a standby_leader has the running or streaming (V3.0.4) state * a replica, quorum_standby or sync_standby has the running or streaming (V3.0.4) state Since we dont check the lag or timeline, "in archive recovery" is not considered a valid state for this service. See cluster_has_leader and cluster_has_replica for specialized checks. Check: * Compares the number of nodes against the normal and healthy nodes warning and critical thresholds. * `OK`: If they are not provided. Perfdata: * `members`: the member count. * `healthy_members`: the running and streaming member count. * all the roles of the nodes in the cluster with their count (start with "role_"). * all the statuses of the nodes in the cluster with their count (start with "state_"). Options: -w, --warning TEXT Warning threshold for the number of nodes. -c, --critical TEXT Critical threshold for the number of nodes. --healthy-warning TEXT Warning threshold for the number of healthy nodes (running + streaming). --healthy-critical TEXT Critical threshold for the number of healthy nodes (running + streaming). --help Show this message and exit. ``` ## Node services ### node_is_alive ``` Usage: check_patroni node_is_alive [OPTIONS] Check if the node is alive ie patroni is running. This is a liveness check as defined in Patroni's documentation. If patroni is not running, we have no way to know if the provided endpoint is valid, therefore the check returns UNKNOWN. Check: * `OK`: If patroni the liveness check returns with HTTP status 200. * `CRITICAL`: if partoni's liveness check returns with an HTTP status other than 200. Perfdata: * `is_running` is 1 if patroni is running, 0 otherwise Options: --help Show this message and exit. ``` ### node_is_pending_restart ``` Usage: check_patroni node_is_pending_restart [OPTIONS] Check if the node is in pending restart state. This situation can arise if the configuration has been modified but requires a restart of PostgreSQL to take effect. Check: * `OK`: if the node has no pending restart tag. * `CRITICAL`: otherwise Perfdata: `is_pending_restart` is 1 if the node has pending restart tag, 0 otherwise. Options: --help Show this message and exit. ``` ### node_is_leader ``` Usage: check_patroni node_is_leader [OPTIONS] Check if the node is a leader node. This check applies to any kind of leaders including standby leaders. To check explicitly for a standby leader use the `--is-standby-leader` option. Check: * `OK`: if the node is a leader. * `CRITICAL:` otherwise Perfdata: `is_leader` is 1 if the node is a leader node, 0 otherwise. Options: --is-standby-leader Check for a standby leader --help Show this message and exit. ``` ### node_is_primary ``` Usage: check_patroni node_is_primary [OPTIONS] Check if the node is the primary with the leader lock. This service is not valid for a standby leader, because this kind of node is not a primary. Check: * `OK`: if the node is a primary with the leader lock. * `CRITICAL:` otherwise Perfdata: `is_primary` is 1 if the node is a primary with the leader lock, 0 otherwise. Options: --help Show this message and exit. ``` ### node_is_replica ``` Usage: check_patroni node_is_replica [OPTIONS] Check if the node is a replica with no noloadbalance tag. It is possible to check if the node is synchronous or asynchronous. If nothing is specified any kind of replica is accepted. When checking for a synchronous replica, it's not possible to specify a lag. This service is using the following Patroni endpoints: replica, asynchronous and synchronous. The first two implement the `lag` tag. For these endpoints the state of a replica node doesn't reflect the replication state (`streaming` or `in archive recovery`), we only know if it's `running`. The timeline is also not checked. Therefore, if a cluster is using asynchronous replication, it is recommended to check for the lag to detect a divegence as soon as possible. Check: * `OK`: if the node is a running replica with noloadbalance tag and the lag is under the maximum threshold. * `CRITICAL`: otherwise Perfdata: `is_replica` is 1 if the node is a running replica with noloadbalance tag and the lag is under the maximum threshold, 0 otherwise. Options: --max-lag TEXT maximum allowed lag --is-sync check if the replica is synchronous --sync-type [any|sync|quorum] Synchronous replication mode. [default: any] --is-async check if the replica is asynchronous --help Show this message and exit. ``` ### node_patroni_version ``` Usage: check_patroni node_patroni_version [OPTIONS] Check if the version is equal to the input Check: * `OK`: The version is the same as the input `--patroni-version` * `CRITICAL`: otherwise. Perfdata: * `is_version_ok` is 1 if version is ok, 0 otherwise Options: --patroni-version TEXT Patroni version to compare to [required] --help Show this message and exit. ``` ### node_tl_has_changed ``` Usage: check_patroni node_tl_has_changed [OPTIONS] Check if the timeline has changed. Note: either a timeline or a state file must be provided for this service to work. Check: * `OK`: The timeline is the same as last time (`--state_file`) or the inputted timeline (`--timeline`) * `CRITICAL`: The tl is not the same. Perfdata: * `is_timeline_changed` is 1 if the tl has changed, 0 otherwise * the timeline Options: --timeline TEXT A timeline number to compare with. -s, --state-file TEXT A state file to store the last tl number into. --save Set the current timeline number as the reference for future calls. --help Show this message and exit. ``` check_patroni-2.2.0/RELEASE.md000066400000000000000000000024001475506406400157350ustar00rootroot00000000000000# Release HOW TO ## Preparatory changes * Review the **Unreleased** section, if any, in `CHANGELOG.md` possibly adding any missing item from closed issues, merged pull requests, or directly the git history[^git-changes], * Rename the **Unreleased** section according to the version to be released, with a date, * Bump the version in `check_patroni/__init__.py`, * Rebuild the `README.md` (`cd docs; ./make_readme.sh`), * Commit these changes (either on a dedicated branch, before submitting a pull request or directly on the `master` branch) with the commit message `release X.Y.Z`. * Then, when changes landed in the `master` branch, create an annotated (and possibly signed) tag, as `git tag -a [-s] -m 'release X.Y.Z' vX.Y.Z`, and, * Push with `--follow-tags`. [^git-changes]: Use `git log $(git describe --tags --abbrev=0).. --format=%s --reverse` to get commits from the previous tag. ## PyPI package The package is generated and uploaded to pypi when a `v*` tag is created (see `.github/workflow/publish.yml`). Alternatively, the release can be done manually with: ``` tox -e build tox -e upload ``` ## GitHub release Draft a new release from the release page, choosing the tag just pushed and copy the relevant change log section as a description. check_patroni-2.2.0/check_patroni/000077500000000000000000000000001475506406400171505ustar00rootroot00000000000000check_patroni-2.2.0/check_patroni/__init__.py000066400000000000000000000001321475506406400212550ustar00rootroot00000000000000import logging __version__ = "2.2.0" _log: logging.Logger = logging.getLogger(__name__) check_patroni-2.2.0/check_patroni/__main__.py000066400000000000000000000000751475506406400212440ustar00rootroot00000000000000from .cli import main if __name__ == "__main__": main() check_patroni-2.2.0/check_patroni/cli.py000066400000000000000000000646471475506406400203120ustar00rootroot00000000000000import logging import re from configparser import ConfigParser from typing import List import click import nagiosplugin from . import __version__, _log from .cluster import ( ClusterConfigHasChanged, ClusterConfigHasChangedSummary, ClusterHasLeader, ClusterHasLeaderSummary, ClusterHasReplica, ClusterHasScheduledAction, ClusterIsInMaintenance, ClusterNodeCount, ) from .convert import size_to_byte from .node import ( NodeIsAlive, NodeIsAliveSummary, NodeIsLeader, NodeIsLeaderSummary, NodeIsPendingRestart, NodeIsPendingRestartSummary, NodeIsPrimary, NodeIsPrimarySummary, NodeIsReplica, NodeIsReplicaSummary, NodePatroniVersion, NodePatroniVersionSummary, NodeTLHasChanged, NodeTLHasChangedSummary, ) from .types import ConnectionInfo, Parameters, SyncType DEFAULT_CFG = "config.ini" handler = logging.StreamHandler() handler.setFormatter(logging.Formatter("%(levelname)s - %(message)s")) _log.addHandler(handler) def print_version(ctx: click.Context, param: str, value: str) -> None: if not value or ctx.resilient_parsing: return click.echo(f"Version {__version__}") ctx.exit() def configure(ctx: click.Context, param: str, filename: str) -> None: """Use a config file for the parameters stolen from https://jwodder.github.io/kbits/posts/click-config/ """ # FIXME should use click-configfile / click-config-file ? cfg = ConfigParser() cfg.read(filename) ctx.default_map = {} for sect in cfg.sections(): command_path = sect.split(".") if command_path[0] != "options": continue defaults = ctx.default_map for cmdname in command_path[1:]: defaults = defaults.setdefault(cmdname, {}) defaults.update(cfg[sect]) try: # endpoints is an array of addresses separated by , if isinstance(defaults["endpoints"], str): defaults["endpoints"] = re.split(r"\s*,\s*", defaults["endpoints"]) except KeyError: pass @click.group() @click.option( "--config", type=click.Path(dir_okay=False), default=DEFAULT_CFG, callback=configure, is_eager=True, expose_value=False, help="Read option defaults from the specified INI file", show_default=True, ) @click.option( "-e", "--endpoints", "endpoints", type=str, multiple=True, default=["http://127.0.0.1:8008"], help=( "Patroni API endpoint. Can be specified multiple times or as a list " "of comma separated addresses. " "The node services checks the status of one node, therefore if " "several addresses are specified they should point to different " "interfaces on the same node. The cluster services check the " "status of the cluster, therefore it's better to give a list of " "all Patroni node addresses." ), show_default=True, ) @click.option( "--cert_file", "cert_file", type=click.Path(exists=True), default=None, help="File with the client certificate.", ) @click.option( "--key_file", "key_file", type=click.Path(exists=True), default=None, help="File with the client key.", ) @click.option( "--ca_file", "ca_file", type=click.Path(exists=True), default=None, help="The CA certificate.", ) @click.option( "-v", "--verbose", "verbose", count=True, default=0, help="Increase verbosity -v (info)/-vv (warning)/-vvv (debug)", show_default=False, ) @click.option( "--version", is_flag=True, callback=print_version, expose_value=False, is_eager=True ) @click.option( "--timeout", "timeout", default=2, type=int, help="Timeout in seconds for the API queries (0 to disable)", show_default=True, ) @click.pass_context @nagiosplugin.guarded def main( ctx: click.Context, endpoints: List[str], cert_file: str, key_file: str, ca_file: str, verbose: int, timeout: int, ) -> None: """Nagios plugin that uses Patroni's REST API to monitor a Patroni cluster.""" # FIXME Not all "is/has" services have the same return code for ok. Check if it's ok # We use this to pass parameters instead of ctx.parent.params because the # latter is typed as Optional[Context] and mypy complains with the following # error unless we test if ctx.parent is none which looked ugly. # # error: Item "None" of "Optional[Context]" has an attribute "params" [union-attr] # The config file allows endpoints to be specified as a comma separated list of endpoints # To avoid confusion, We allow the same in command line parameters tendpoints: List[str] = [] for e in endpoints: tendpoints += re.split(r"\s*,\s*", e) endpoints = tendpoints if verbose == 3: logging.getLogger("urllib3").addHandler(handler) logging.getLogger("urllib3").setLevel(logging.DEBUG) _log.setLevel(logging.DEBUG) connection_info: ConnectionInfo if cert_file is None and key_file is None: connection_info = ConnectionInfo(endpoints, None, ca_file) else: connection_info = ConnectionInfo(endpoints, (cert_file, key_file), ca_file) ctx.obj = Parameters( connection_info, timeout, verbose, ) @main.command(name="cluster_node_count") # required otherwise _ are converted to - @click.option( "-w", "--warning", "warning", type=str, help="Warning threshold for the number of nodes.", ) @click.option( "-c", "--critical", "critical", type=str, help="Critical threshold for the number of nodes.", ) @click.option( "--healthy-warning", "healthy_warning", type=str, help="Warning threshold for the number of healthy nodes (running + streaming).", ) @click.option( "--healthy-critical", "healthy_critical", type=str, help="Critical threshold for the number of healthy nodes (running + streaming).", ) @click.pass_context @nagiosplugin.guarded def cluster_node_count( ctx: click.Context, warning: str, critical: str, healthy_warning: str, healthy_critical: str, ) -> None: """Count the number of nodes in the cluster. \b The role refers to the role of the server in the cluster. Possible values are: * leader (master was removed in patroni 4.0.0) * replica * standby_leader * sync_standby * quorum_standby * demoted * promoted * uninitialized \b The state refers to the state of PostgreSQL. Possible values are: * initializing new cluster, initdb failed * running custom bootstrap script, custom bootstrap failed * starting, start failed * restarting, restart failed * running, streaming, in archive recovery * stopping, stopped, stop failed * creating replica * crashed \b The "healthy" checks only ensures that: * a leader has the running state * a standby_leader has the running or streaming (V3.0.4) state * a replica, quorum_standby or sync_standby has the running or streaming (V3.0.4) state Since we dont check the lag or timeline, "in archive recovery" is not considered a valid state for this service. See cluster_has_leader and cluster_has_replica for specialized checks. \b Check: * Compares the number of nodes against the normal and healthy nodes warning and critical thresholds. * `OK`: If they are not provided. \b Perfdata: * `members`: the member count. * `healthy_members`: the running and streaming member count. * all the roles of the nodes in the cluster with their count (start with "role_"). * all the statuses of the nodes in the cluster with their count (start with "state_"). """ check = nagiosplugin.Check() check.add( ClusterNodeCount(ctx.obj.connection_info), nagiosplugin.ScalarContext( "members", warning, critical, ), nagiosplugin.ScalarContext( "healthy_members", healthy_warning, healthy_critical, ), nagiosplugin.ScalarContext("member_roles"), nagiosplugin.ScalarContext("member_statuses"), ) check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout) @main.command(name="cluster_has_leader") @click.pass_context @nagiosplugin.guarded def cluster_has_leader(ctx: click.Context) -> None: """Check if the cluster has a leader. This check applies to any kind of leaders including standby leaders. A leader is a node with the "leader" role and a "running" state. A standby leader is a node with a "standby_leader" role and a "streaming" or "in archive recovery" state. Please note that log shipping could be stuck because the WAL are not available or applicable. Patroni doesn't provide information about the origin cluster (timeline or lag), so we cannot check if there is a problem in that particular case. That's why we issue a warning when the node is "in archive recovery". We suggest using other supervision tools to do this (eg. check_pgactivity). \b Check: * `OK`: if there is a leader node. * 'WARNING': if there is a stanby leader in archive mode. * `CRITICAL`: otherwise. \b Perfdata: * `has_leader` is 1 if there is any kind of leader node, 0 otherwise * `is_standby_leader_in_arc_rec` is 1 if the standby leader node is "in archive recovery", 0 otherwise * `is_standby_leader` is 1 if there is a standby leader node, 0 otherwise * `is_leader` is 1 if there is a "classical" leader node, 0 otherwise """ check = nagiosplugin.Check() check.add( ClusterHasLeader(ctx.obj.connection_info), nagiosplugin.ScalarContext("has_leader", None, "@0:0"), nagiosplugin.ScalarContext("is_standby_leader_in_arc_rec", "@1:1", None), nagiosplugin.ScalarContext("is_leader", None, None), nagiosplugin.ScalarContext("is_standby_leader", None, None), ClusterHasLeaderSummary(), ) check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout) @main.command(name="cluster_has_replica") @click.option( "-w", "--warning", "warning", type=str, help="Warning threshold for the number of healthy replica nodes.", ) @click.option( "-c", "--critical", "critical", type=str, help="Critical threshold for the number of healthy replica nodes.", ) @click.option( "--sync-warning", "sync_warning", type=str, help="Warning threshold for the number of sync replica.", ) @click.option( "--sync-critical", "sync_critical", type=str, help="Critical threshold for the number of sync replica.", ) @click.option( "--sync-type", type=click.Choice(["any", "sync", "quorum"], case_sensitive=True), default="any", show_default=True, help="Synchronous replication mode used to filter and count sync standbies.", ) @click.option("--max-lag", "max_lag", type=str, help="maximum allowed lag") @click.pass_context @nagiosplugin.guarded def cluster_has_replica( ctx: click.Context, warning: str, critical: str, sync_warning: str, sync_critical: str, sync_type: SyncType, max_lag: str, ) -> None: """Check if the cluster has healthy replicas and/or if some are sync or quorum standbies \b For patroni (and this check): * a replica is `streaming` if the `pg_stat_wal_receiver` say's so. * a replica is `in archive recovery`, if it's not `streaming` and has a `restore_command`. \b A healthy replica: * has a `replica`, `quorum_standby` or `sync_standby` role * has the same timeline as the leader and * is in `running` state (patroni < V3.0.4) * is in `streaming` or `in archive recovery` state (patroni >= V3.0.4) * has a lag lower or equal to `max_lag` Please note that replica `in archive recovery` could be stuck because the WAL are not available or applicable (the server's timeline has diverged for the leader's). We already detect the latter but we will miss the former. Therefore, it's preferable to check for the lag in addition to the healthy state if you rely on log shipping to help lagging standbies to catch up. Since we require a healthy replica to have the same timeline as the leader, it's possible that we raise alerts when the cluster is performing a switchover or failover and the standbies are in the process of catching up with the new leader. The alert shouldn't last long. In PostgreSQL, synchronous replication has two modes: on and quorum and is configured with the gucs `synchronous_standby_names` and `synchronous_commit`. Patroni uses the parameter `synchronous_mode`, which can be set to `on`, `quorum` and `off`, and has `synchronous_node_count` to configure the synchronous replication factor. Please note that, in synchronous replication, the number of servers tagged as "{sync|quorum}_standby" (what we measure) is not always equal tot `synchronous_node_count`. \b Check: * `OK`: if the healthy_replica count and their lag are compatible with the replica count threshold. and if the synchronous replica count is compatible with the sync replica count threshold. * `WARNING` / `CRITICAL`: otherwise \b Perfdata: * healthy_replica & unhealthy_replica count * the number of sync_replica (sync or quorum depending on `--sync-type`), they are included in the previous count * the lag of each replica labelled with "member name"_lag * the timeline of each replica labelled with "member name"_timeline * a boolean to tell if the node is a sync stanbdy labelled with "member name"_sync """ tmax_lag = size_to_byte(max_lag) if max_lag is not None else None check = nagiosplugin.Check() check.add( ClusterHasReplica(ctx.obj.connection_info, tmax_lag, sync_type), nagiosplugin.ScalarContext( "healthy_replica", warning, critical, ), nagiosplugin.ScalarContext( "sync_replica", sync_warning, sync_critical, ), nagiosplugin.ScalarContext("unhealthy_replica"), nagiosplugin.ScalarContext("replica_lag"), nagiosplugin.ScalarContext("replica_timeline"), nagiosplugin.ScalarContext("replica_sync"), ) check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout) @main.command(name="cluster_config_has_changed") @click.option("--hash", "config_hash", type=str, help="A hash to compare with.") @click.option( "-s", "--state-file", "state_file", type=str, help="A state file to store the hash of the configuration.", ) @click.option( "--save", "save_config", is_flag=True, default=False, help="Set the current configuration hash as the reference for future calls.", ) @click.pass_context @nagiosplugin.guarded def cluster_config_has_changed( ctx: click.Context, config_hash: str, state_file: str, save_config: bool ) -> None: """Check if the hash of the configuration has changed. Note: either a hash or a state file must be provided for this service to work. \b Check: * `OK`: The hash didn't change * `CRITICAL`: The hash of the configuration has changed compared to the input (`--hash`) or last time (`--state_file`) \b Perfdata: * `is_configuration_changed` is 1 if the configuration has changed """ # Note: hash cannot be in the perf data = not a number if (config_hash is None and state_file is None) or ( config_hash is not None and state_file is not None ): raise click.UsageError( "Either --hash or --state-file should be provided for this service", ctx ) old_config_hash = config_hash if state_file is not None: cookie = nagiosplugin.Cookie(state_file) cookie.open() old_config_hash = cookie.get("hash") cookie.close() check = nagiosplugin.Check() check.add( ClusterConfigHasChanged( ctx.obj.connection_info, old_config_hash, state_file, save_config ), nagiosplugin.ScalarContext("is_configuration_changed", None, "@1:1"), ClusterConfigHasChangedSummary(old_config_hash), ) check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout) @main.command(name="cluster_is_in_maintenance") @click.pass_context @nagiosplugin.guarded def cluster_is_in_maintenance(ctx: click.Context) -> None: """Check if the cluster is in maintenance mode or paused. \b Check: * `OK`: If the cluster is in maintenance mode. * `CRITICAL`: otherwise. \b Perfdata: * `is_in_maintenance` is 1 the cluster is in maintenance mode, 0 otherwise """ check = nagiosplugin.Check() check.add( ClusterIsInMaintenance(ctx.obj.connection_info), nagiosplugin.ScalarContext("is_in_maintenance", None, "0:0"), ) check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout) @main.command(name="cluster_has_scheduled_action") @click.pass_context @nagiosplugin.guarded def cluster_has_scheduled_action(ctx: click.Context) -> None: """Check if the cluster has a scheduled action (switchover or restart) \b Check: * `OK`: If the cluster has no scheduled action * `CRITICAL`: otherwise. \b Perfdata: * `scheduled_actions` is 1 if the cluster has scheduled actions. * `scheduled_switchover` is 1 if the cluster has a scheduled switchover. * `scheduled_restart` counts the number of scheduled restart in the cluster. """ check = nagiosplugin.Check() check.add( ClusterHasScheduledAction(ctx.obj.connection_info), nagiosplugin.ScalarContext("has_scheduled_actions", None, "0:0"), nagiosplugin.ScalarContext("scheduled_switchover"), nagiosplugin.ScalarContext("scheduled_restart"), ) check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout) @main.command(name="node_is_primary") @click.pass_context @nagiosplugin.guarded def node_is_primary(ctx: click.Context) -> None: """Check if the node is the primary with the leader lock. This service is not valid for a standby leader, because this kind of node is not a primary. \b Check: * `OK`: if the node is a primary with the leader lock. * `CRITICAL:` otherwise Perfdata: `is_primary` is 1 if the node is a primary with the leader lock, 0 otherwise. """ check = nagiosplugin.Check() check.add( NodeIsPrimary(ctx.obj.connection_info), nagiosplugin.ScalarContext("is_primary", None, "@0:0"), NodeIsPrimarySummary(), ) check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout) @main.command(name="node_is_leader") @click.option( "--is-standby-leader", "check_standby_leader", is_flag=True, default=False, help="Check for a standby leader", ) @click.pass_context @nagiosplugin.guarded def node_is_leader(ctx: click.Context, check_standby_leader: bool) -> None: """Check if the node is a leader node. This check applies to any kind of leaders including standby leaders. To check explicitly for a standby leader use the `--is-standby-leader` option. \b Check: * `OK`: if the node is a leader. * `CRITICAL:` otherwise Perfdata: `is_leader` is 1 if the node is a leader node, 0 otherwise. """ check = nagiosplugin.Check() check.add( NodeIsLeader(ctx.obj.connection_info, check_standby_leader), nagiosplugin.ScalarContext("is_leader", None, "@0:0"), NodeIsLeaderSummary(check_standby_leader), ) check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout) @main.command(name="node_is_replica") @click.option("--max-lag", "max_lag", type=str, help="maximum allowed lag") @click.option( "--is-sync", "check_is_sync", is_flag=True, default=False, help="check if the replica is synchronous", ) @click.option( "--sync-type", type=click.Choice(["any", "sync", "quorum"], case_sensitive=True), default="any", show_default=True, help="Synchronous replication mode.", ) @click.option( "--is-async", "check_is_async", is_flag=True, default=False, help="check if the replica is asynchronous", ) @click.pass_context @nagiosplugin.guarded def node_is_replica( ctx: click.Context, max_lag: str, check_is_sync: bool, check_is_async: bool, sync_type: SyncType, ) -> None: """Check if the node is a replica with no noloadbalance tag. It is possible to check if the node is synchronous or asynchronous. If nothing is specified any kind of replica is accepted. When checking for a synchronous replica, it's not possible to specify a lag. This service is using the following Patroni endpoints: replica, asynchronous and synchronous. The first two implement the `lag` tag. For these endpoints the state of a replica node doesn't reflect the replication state (`streaming` or `in archive recovery`), we only know if it's `running`. The timeline is also not checked. Therefore, if a cluster is using asynchronous replication, it is recommended to check for the lag to detect a divegence as soon as possible. \b Check: * `OK`: if the node is a running replica with noloadbalance tag and the lag is under the maximum threshold. * `CRITICAL`: otherwise Perfdata: `is_replica` is 1 if the node is a running replica with noloadbalance tag and the lag is under the maximum threshold, 0 otherwise. """ if check_is_sync and max_lag is not None: raise click.UsageError( "--is-sync and --max-lag cannot be provided at the same time for this service", ctx, ) if check_is_sync and check_is_async: raise click.UsageError( "--is-sync and --is-async cannot be provided at the same time for this service", ctx, ) check = nagiosplugin.Check() check.add( NodeIsReplica( ctx.obj.connection_info, max_lag, check_is_sync, check_is_async, sync_type ), nagiosplugin.ScalarContext("is_replica", None, "@0:0"), NodeIsReplicaSummary(max_lag, check_is_sync, check_is_async, sync_type), ) check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout) @main.command(name="node_is_pending_restart") @click.pass_context @nagiosplugin.guarded def node_is_pending_restart(ctx: click.Context) -> None: """Check if the node is in pending restart state. This situation can arise if the configuration has been modified but requires a restart of PostgreSQL to take effect. \b Check: * `OK`: if the node has no pending restart tag. * `CRITICAL`: otherwise Perfdata: `is_pending_restart` is 1 if the node has pending restart tag, 0 otherwise. """ check = nagiosplugin.Check() check.add( NodeIsPendingRestart(ctx.obj.connection_info), nagiosplugin.ScalarContext("is_pending_restart", None, "0:0"), NodeIsPendingRestartSummary(), ) check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout) @main.command(name="node_tl_has_changed") @click.option( "--timeline", "timeline", type=str, help="A timeline number to compare with." ) @click.option( "-s", "--state-file", "state_file", type=str, help="A state file to store the last tl number into.", ) @click.option( "--save", "save_tl", is_flag=True, default=False, help="Set the current timeline number as the reference for future calls.", ) @click.pass_context @nagiosplugin.guarded def node_tl_has_changed( ctx: click.Context, timeline: str, state_file: str, save_tl: bool ) -> None: """Check if the timeline has changed. Note: either a timeline or a state file must be provided for this service to work. \b Check: * `OK`: The timeline is the same as last time (`--state_file`) or the inputted timeline (`--timeline`) * `CRITICAL`: The tl is not the same. \b Perfdata: * `is_timeline_changed` is 1 if the tl has changed, 0 otherwise * the timeline """ if (timeline is None and state_file is None) or ( timeline is not None and state_file is not None ): raise click.UsageError( "Either --timeline or --state-file should be provided for this service", ctx ) old_timeline = timeline if state_file is not None: cookie = nagiosplugin.Cookie(state_file) cookie.open() old_timeline = cookie.get("timeline") cookie.close() check = nagiosplugin.Check() check.add( NodeTLHasChanged(ctx.obj.connection_info, old_timeline, state_file, save_tl), nagiosplugin.ScalarContext("is_timeline_changed", None, "@1:1"), nagiosplugin.ScalarContext("timeline"), NodeTLHasChangedSummary(old_timeline), ) check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout) @main.command(name="node_patroni_version") @click.option( "--patroni-version", "patroni_version", type=str, help="Patroni version to compare to", required=True, ) @click.pass_context @nagiosplugin.guarded def node_patroni_version(ctx: click.Context, patroni_version: str) -> None: """Check if the version is equal to the input \b Check: * `OK`: The version is the same as the input `--patroni-version` * `CRITICAL`: otherwise. \b Perfdata: * `is_version_ok` is 1 if version is ok, 0 otherwise """ # TODO the version cannot be written in perfdata find something else ? check = nagiosplugin.Check() check.add( NodePatroniVersion(ctx.obj.connection_info, patroni_version), nagiosplugin.ScalarContext("is_version_ok", None, "@0:0"), nagiosplugin.ScalarContext("patroni_version"), NodePatroniVersionSummary(patroni_version), ) check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout) @main.command(name="node_is_alive") @click.pass_context @nagiosplugin.guarded def node_is_alive(ctx: click.Context) -> None: """Check if the node is alive ie patroni is running. This is a liveness check as defined in Patroni's documentation. If patroni is not running, we have no way to know if the provided endpoint is valid, therefore the check returns UNKNOWN. \b Check: * `OK`: If patroni the liveness check returns with HTTP status 200. * `CRITICAL`: if partoni's liveness check returns with an HTTP status other than 200. \b Perfdata: * `is_running` is 1 if patroni is running, 0 otherwise """ check = nagiosplugin.Check() check.add( NodeIsAlive(ctx.obj.connection_info), nagiosplugin.ScalarContext("is_alive", None, "@0:0"), NodeIsAliveSummary(), ) check.main(verbose=ctx.obj.verbose, timeout=ctx.obj.timeout) check_patroni-2.2.0/check_patroni/cluster.py000066400000000000000000000313521475506406400212070ustar00rootroot00000000000000import hashlib import json from collections import Counter from typing import Any, Iterable, Union import nagiosplugin from . import _log from .types import ConnectionInfo, PatroniResource, SyncType, handle_unknown def replace_chars(text: str) -> str: return text.replace("'", "").replace(" ", "_") class ClusterNodeCount(PatroniResource): def probe(self) -> Iterable[nagiosplugin.Metric]: def debug_member(member: Any, health: str) -> None: _log.debug( "Node %(node_name)s is %(health)s: role %(role)s state %(state)s.", { "node_name": member["name"], "health": health, "role": member["role"], "state": member["state"], }, ) # get the cluster info item_dict = self.rest_api("cluster") role_counters: Counter[str] = Counter() roles = [] status_counters: Counter[str] = Counter() statuses = [] healthy_member = 0 for member in item_dict["members"]: state, role = member["state"], member["role"] roles.append(replace_chars(role)) statuses.append(replace_chars(state)) if role == "leader" and state == "running": healthy_member += 1 debug_member(member, "healthy") continue if role in [ "standby_leader", "replica", "sync_standby", "quorum_standby", ] and ( (self.has_detailed_states() and state == "streaming") or (not self.has_detailed_states() and state == "running") ): healthy_member += 1 debug_member(member, "healthy") continue debug_member(member, "unhealthy") role_counters.update(roles) status_counters.update(statuses) # The actual check: members, healthy_members yield nagiosplugin.Metric("members", len(item_dict["members"])) yield nagiosplugin.Metric("healthy_members", healthy_member) # The performance data : role for role in role_counters: yield nagiosplugin.Metric( f"role_{role}", role_counters[role], context="member_roles" ) # The performance data : statuses (except running) for state in status_counters: yield nagiosplugin.Metric( f"state_{state}", status_counters[state], context="member_statuses" ) class ClusterHasLeader(PatroniResource): def probe(self) -> Iterable[nagiosplugin.Metric]: item_dict = self.rest_api("cluster") is_leader_found = False is_standby_leader_found = False is_standby_leader_in_arc_rec = False for member in item_dict["members"]: if member["role"] == "leader" and member["state"] == "running": is_leader_found = True break if member["role"] == "standby_leader": if member["state"] not in ["streaming", "in archive recovery"]: # for patroni >= 3.0.4 any state would be wrong # for patroni < 3.0.4 a state different from running would be wrong if self.has_detailed_states() or member["state"] != "running": continue if member["state"] in ["in archive recovery"]: is_standby_leader_in_arc_rec = True is_standby_leader_found = True break return [ nagiosplugin.Metric( "has_leader", 1 if is_leader_found or is_standby_leader_found else 0, ), nagiosplugin.Metric( "is_standby_leader_in_arc_rec", 1 if is_standby_leader_in_arc_rec else 0, ), nagiosplugin.Metric( "is_standby_leader", 1 if is_standby_leader_found else 0, ), nagiosplugin.Metric( "is_leader", 1 if is_leader_found else 0, ), ] class ClusterHasLeaderSummary(nagiosplugin.Summary): def ok(self, results: nagiosplugin.Result) -> str: return "The cluster has a running leader." @handle_unknown def problem(self, results: nagiosplugin.Result) -> str: return "The cluster has no running leader or the standby leader is in archive recovery." class ClusterHasReplica(PatroniResource): def __init__( self, connection_info: ConnectionInfo, max_lag: Union[int, None], sync_type: SyncType, ): super().__init__(connection_info) self.max_lag = max_lag self.sync_type = sync_type def probe(self) -> Iterable[nagiosplugin.Metric]: def debug_member(member: Any, health: str) -> None: _log.debug( "Node %(node_name)s is %(health)s: lag %(lag)s, state %(state)s, tl %(tl)s.", { "node_name": member["name"], "health": health, "lag": member["lag"], "state": member["state"], "tl": member["timeline"], }, ) # get the cluster info cluster_item_dict = self.rest_api("cluster") replicas = [] healthy_replica = 0 unhealthy_replica = 0 sync_replica = 0 leader_tl = None # Look for replicas for member in cluster_item_dict["members"]: if member["role"] in ["replica", "sync_standby", "quorum_standby"]: if member["lag"] == "unknown": # This could happen if the node is stopped # nagiosplugin doesn't handle strings in perfstats # so we have to ditch all the stats in that case debug_member(member, "unhealthy") unhealthy_replica += 1 continue else: replicas.append( { "name": member["name"], "lag": member["lag"], "timeline": member["timeline"], "sync": ( 1 if member["role"] in ["sync_standby", "quorum_standby"] else 0 ), } ) # Get the leader tl if we haven't already if leader_tl is None: # If there are no leaders, we will loop here for all # members because leader_tl will remain None. it's not # a big deal since having no leader is rare. for tmember in cluster_item_dict["members"]: if tmember["role"] in ["leader", "standby_leader"]: leader_tl = int(tmember["timeline"]) break _log.debug( "Patroni's leader_timeline is %(leader_tl)s", { "leader_tl": leader_tl, }, ) # Test for an unhealthy replica if ( self.has_detailed_states() and not ( member["state"] in ["streaming", "in archive recovery"] and int(member["timeline"]) == leader_tl ) ) or ( not self.has_detailed_states() and not ( member["state"] == "running" and int(member["timeline"]) == leader_tl ) ): debug_member(member, "unhealthy") unhealthy_replica += 1 continue if ( self.sync_type in ["sync", "any"] and member["role"] == "sync_standby" ) or ( self.sync_type in ["quorum", "any"] and member["role"] == "quorum_standby" ): sync_replica += 1 if self.max_lag is None or self.max_lag >= int(member["lag"]): debug_member(member, "healthy") healthy_replica += 1 else: debug_member(member, "unhealthy") unhealthy_replica += 1 # The actual check yield nagiosplugin.Metric("healthy_replica", healthy_replica) yield nagiosplugin.Metric("sync_replica", sync_replica) # The performance data : unhealthy replica count, replicas lag yield nagiosplugin.Metric("unhealthy_replica", unhealthy_replica) for replica in replicas: yield nagiosplugin.Metric( f"{replica['name']}_lag", replica["lag"], context="replica_lag" ) yield nagiosplugin.Metric( f"{replica['name']}_timeline", replica["timeline"], context="replica_timeline", ) yield nagiosplugin.Metric( f"{replica['name']}_sync", replica["sync"], context="replica_sync" ) # FIXME is this needed ?? # class ClusterHasReplicaSummary(nagiosplugin.Summary): # def ok(self, results): # def problem(self, results): class ClusterConfigHasChanged(PatroniResource): def __init__( self, connection_info: ConnectionInfo, config_hash: str, # Always contains the old hash state_file: str, # Only used to update the hash in the state_file (when needed) save: bool = False, # Save the configuration ): super().__init__(connection_info) self.state_file = state_file self.config_hash = config_hash self.save = save def probe(self) -> Iterable[nagiosplugin.Metric]: item_dict = self.rest_api("config") new_hash = hashlib.md5(json.dumps(item_dict).encode()).hexdigest() _log.debug("save result: %(issave)s", {"issave": self.save}) old_hash = self.config_hash if self.state_file is not None and self.save: _log.debug( "saving new hash to state file / cookie %(state_file)s", {"state_file": self.state_file}, ) cookie = nagiosplugin.Cookie(self.state_file) cookie.open() cookie["hash"] = new_hash cookie.commit() cookie.close() _log.debug( "hash info: old hash %(old_hash)s, new hash %(new_hash)s", {"old_hash": old_hash, "new_hash": new_hash}, ) return [ nagiosplugin.Metric( "is_configuration_changed", 1 if new_hash != old_hash else 0, ) ] class ClusterConfigHasChangedSummary(nagiosplugin.Summary): def __init__(self, config_hash: str) -> None: self.old_config_hash = config_hash # Note: It would be helpful to display the old / new hash here. Unfortunately, it's not a metric. # So we only have the old / expected one. def ok(self, results: nagiosplugin.Result) -> str: return f"The hash of patroni's dynamic configuration has not changed ({self.old_config_hash})." @handle_unknown def problem(self, results: nagiosplugin.Result) -> str: return f"The hash of patroni's dynamic configuration has changed. The old hash was {self.old_config_hash}." class ClusterIsInMaintenance(PatroniResource): def probe(self) -> Iterable[nagiosplugin.Metric]: item_dict = self.rest_api("cluster") # The actual check return [ nagiosplugin.Metric( "is_in_maintenance", 1 if "pause" in item_dict and item_dict["pause"] else 0, ) ] class ClusterHasScheduledAction(PatroniResource): def probe(self) -> Iterable[nagiosplugin.Metric]: item_dict = self.rest_api("cluster") scheduled_switchover = 0 scheduled_restart = 0 if "scheduled_switchover" in item_dict: scheduled_switchover = 1 for member in item_dict["members"]: if "scheduled_restart" in member: scheduled_restart += 1 # The actual check yield nagiosplugin.Metric( "has_scheduled_actions", 1 if (scheduled_switchover + scheduled_restart) > 0 else 0, ) # The performance data : scheduled_switchover, scheduled action count yield nagiosplugin.Metric("scheduled_switchover", scheduled_switchover) yield nagiosplugin.Metric("scheduled_restart", scheduled_restart) check_patroni-2.2.0/check_patroni/convert.py000066400000000000000000000027231475506406400212060ustar00rootroot00000000000000import re from typing import Tuple, Union import click def size_to_byte(value: str) -> int: """Convert any size to Byte >>> size_to_byte('1TB') 1099511627776 >>> size_to_byte('5kB') 5120 >>> size_to_byte('.5kB') 512 >>> size_to_byte('.5 yoyo') Traceback (most recent call last): ... click.exceptions.BadParameter: Invalid unit for size f{value} """ convert = { "B": 1, "kB": 1024, "MB": 1024 * 1024, "GB": 1024 * 1024 * 1024, "TB": 1024 * 1024 * 1024 * 1024, } val, unit = strtod(value) if val is None: val = 1 if unit is None: # No unit, all good # we can round half bytes dont really make sense return round(val) else: try: multiplicateur = convert[unit] except KeyError: raise click.BadParameter("Invalid unit for size f{value}") # we can round half bytes dont really make sense return round(val * multiplicateur) DBL_RE = re.compile(r"^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?") def strtod(value: str) -> Tuple[Union[float, None], Union[str, None]]: """As most as possible close equivalent of strtod(3) function used by postgres to parse parameter values. >>> strtod(' A ') == (None, 'A') True """ value = str(value).strip() match = DBL_RE.match(value) if match: end = match.end() return float(value[:end]), value[end:] return None, value check_patroni-2.2.0/check_patroni/node.py000066400000000000000000000215751475506406400204610ustar00rootroot00000000000000from typing import Iterable import nagiosplugin from . import _log from .types import APIError, ConnectionInfo, PatroniResource, SyncType, handle_unknown class NodeIsPrimary(PatroniResource): def probe(self) -> Iterable[nagiosplugin.Metric]: try: self.rest_api("primary") except APIError: return [nagiosplugin.Metric("is_primary", 0)] return [nagiosplugin.Metric("is_primary", 1)] class NodeIsPrimarySummary(nagiosplugin.Summary): def ok(self, results: nagiosplugin.Result) -> str: return "This node is the primary with the leader lock." @handle_unknown def problem(self, results: nagiosplugin.Result) -> str: return "This node is not the primary with the leader lock." class NodeIsLeader(PatroniResource): def __init__( self, connection_info: ConnectionInfo, check_is_standby_leader: bool ) -> None: super().__init__(connection_info) self.check_is_standby_leader = check_is_standby_leader def probe(self) -> Iterable[nagiosplugin.Metric]: apiname = "leader" if self.check_is_standby_leader: apiname = "standby-leader" try: self.rest_api(apiname) except APIError: return [nagiosplugin.Metric("is_leader", 0)] return [nagiosplugin.Metric("is_leader", 1)] class NodeIsLeaderSummary(nagiosplugin.Summary): def __init__(self, check_is_standby_leader: bool) -> None: if check_is_standby_leader: self.leader_kind = "standby leader" else: self.leader_kind = "leader" def ok(self, results: nagiosplugin.Result) -> str: return f"This node is a {self.leader_kind} node." @handle_unknown def problem(self, results: nagiosplugin.Result) -> str: return f"This node is not a {self.leader_kind} node." class NodeIsReplica(PatroniResource): def __init__( self, connection_info: ConnectionInfo, max_lag: str, check_is_sync: bool, check_is_async: bool, sync_type: SyncType, ) -> None: super().__init__(connection_info) self.max_lag = max_lag self.check_is_sync = check_is_sync self.check_is_async = check_is_async self.sync_type = sync_type def probe(self) -> Iterable[nagiosplugin.Metric]: item_dict = {} try: if self.max_lag is None: item_dict = self.rest_api("replica") else: item_dict = self.rest_api(f"replica?lag={self.max_lag}") except APIError: return [nagiosplugin.Metric("is_replica", 0)] if self.check_is_sync: if (self.sync_type in ["sync", "any"] and "sync_standby" in item_dict) or ( self.sync_type in ["quorum", "any"] and "quorum_standby" in item_dict ): return [nagiosplugin.Metric("is_replica", 1)] else: return [nagiosplugin.Metric("is_replica", 0)] elif self.check_is_async: if "sync_standby" in item_dict or "quorum_standby" in item_dict: return [nagiosplugin.Metric("is_replica", 0)] else: return [nagiosplugin.Metric("is_replica", 1)] else: return [nagiosplugin.Metric("is_replica", 1)] class NodeIsReplicaSummary(nagiosplugin.Summary): def __init__( self, lag: str, check_is_sync: bool, check_is_async: bool, sync_type: SyncType, ) -> None: self.lag = lag if check_is_sync: self.replica_kind = f"synchronous replica of kind '{sync_type}'" elif check_is_async: self.replica_kind = "asynchronous replica" else: self.replica_kind = "replica" def ok(self, results: nagiosplugin.Result) -> str: if self.lag is None: return ( f"This node is a running {self.replica_kind} with no noloadbalance tag." ) return f"This node is a running {self.replica_kind} with no noloadbalance tag and the lag is under {self.lag}." @handle_unknown def problem(self, results: nagiosplugin.Result) -> str: if self.lag is None: return f"This node is not a running {self.replica_kind} with no noloadbalance tag." return f"This node is not a running {self.replica_kind} with no noloadbalance tag and a lag under {self.lag}." class NodeIsPendingRestart(PatroniResource): def probe(self) -> Iterable[nagiosplugin.Metric]: item_dict = self.rest_api("patroni") is_pending_restart = item_dict.get("pending_restart", False) return [ nagiosplugin.Metric( "is_pending_restart", 1 if is_pending_restart else 0, ) ] class NodeIsPendingRestartSummary(nagiosplugin.Summary): def ok(self, results: nagiosplugin.Result) -> str: return "This node doesn't have the pending restart flag." @handle_unknown def problem(self, results: nagiosplugin.Result) -> str: return "This node has the pending restart flag." class NodeTLHasChanged(PatroniResource): def __init__( self, connection_info: ConnectionInfo, timeline: str, # Always contains the old timeline state_file: str, # Only used to update the timeline in the state_file (when needed) save: bool, # save timeline in state file ) -> None: super().__init__(connection_info) self.state_file = state_file self.timeline = timeline self.save = save def probe(self) -> Iterable[nagiosplugin.Metric]: item_dict = self.rest_api("patroni") new_tl = item_dict["timeline"] _log.debug("save result: %(issave)s", {"issave": self.save}) old_tl = self.timeline if self.state_file is not None and self.save: _log.debug( "saving new timeline to state file / cookie %(state_file)s", {"state_file": self.state_file}, ) cookie = nagiosplugin.Cookie(self.state_file) cookie.open() cookie["timeline"] = new_tl cookie.commit() cookie.close() _log.debug( "Tl data: old tl %(old_tl)s, new tl %(new_tl)s", {"old_tl": old_tl, "new_tl": new_tl}, ) # The actual check yield nagiosplugin.Metric( "is_timeline_changed", 1 if str(new_tl) != str(old_tl) else 0, ) # The performance data : the timeline number yield nagiosplugin.Metric("timeline", new_tl) class NodeTLHasChangedSummary(nagiosplugin.Summary): def __init__(self, timeline: str) -> None: self.timeline = timeline def ok(self, results: nagiosplugin.Result) -> str: return f"The timeline is still {self.timeline}." @handle_unknown def problem(self, results: nagiosplugin.Result) -> str: return f"The expected timeline was {self.timeline} got {results['timeline'].metric}." class NodePatroniVersion(PatroniResource): def __init__(self, connection_info: ConnectionInfo, patroni_version: str) -> None: super().__init__(connection_info) self.patroni_version = patroni_version def probe(self) -> Iterable[nagiosplugin.Metric]: item_dict = self.rest_api("patroni") version = item_dict["patroni"]["version"] _log.debug( "Version data: patroni version %(version)s input version %(patroni_version)s", {"version": version, "patroni_version": self.patroni_version}, ) # The actual check return [ nagiosplugin.Metric( "is_version_ok", 1 if version == self.patroni_version else 0, ) ] class NodePatroniVersionSummary(nagiosplugin.Summary): def __init__(self, patroni_version: str) -> None: self.patroni_version = patroni_version def ok(self, results: nagiosplugin.Result) -> str: return f"Patroni's version is {self.patroni_version}." @handle_unknown def problem(self, results: nagiosplugin.Result) -> str: # FIXME find a way to make the following work, check is perf data can be strings # return f"The expected patroni version was {self.patroni_version} got {results['patroni_version'].metric}." return f"Patroni's version is not {self.patroni_version}." class NodeIsAlive(PatroniResource): def probe(self) -> Iterable[nagiosplugin.Metric]: try: self.rest_api("liveness") except APIError: return [nagiosplugin.Metric("is_alive", 0)] return [nagiosplugin.Metric("is_alive", 1)] class NodeIsAliveSummary(nagiosplugin.Summary): def ok(self, results: nagiosplugin.Result) -> str: return "This node is alive (patroni is running)." @handle_unknown def problem(self, results: nagiosplugin.Result) -> str: return "This node is not alive (patroni is not running)." check_patroni-2.2.0/check_patroni/types.py000066400000000000000000000077661475506406400207060ustar00rootroot00000000000000import json from functools import lru_cache from typing import Any, Callable, List, Literal, Optional, Tuple, Union from urllib.parse import urlparse import attr import nagiosplugin import requests from . import _log SyncType = Literal["any", "sync", "quorum"] class APIError(requests.exceptions.RequestException): """This exception is raised when the rest api could be reached but we got a http status code different from 200. """ @attr.s(auto_attribs=True, frozen=True, slots=True) class ConnectionInfo: endpoints: List[str] = ["http://127.0.0.1:8008"] cert: Optional[Union[str, Tuple[str, str]]] = None ca_cert: Optional[str] = None @attr.s(auto_attribs=True, frozen=True, slots=True) class Parameters: connection_info: ConnectionInfo timeout: int verbose: int @attr.s(auto_attribs=True, eq=False, slots=True) class PatroniResource(nagiosplugin.Resource): conn_info: ConnectionInfo def rest_api(self, service: str) -> Any: """Try to connect to all the provided endpoints for the requested service""" for endpoint in self.conn_info.endpoints: cert: Optional[Union[Tuple[str, str], str]] = None verify: Optional[Union[str, bool]] = None if urlparse(endpoint).scheme == "https": if self.conn_info.cert is not None: # we can have: a key + a cert or a single file with key and cert. cert = self.conn_info.cert if self.conn_info.ca_cert is not None: verify = self.conn_info.ca_cert _log.debug( "Trying to connect to %(endpoint)s/%(service)s with cert: %(cert)s verify: %(verify)s", { "endpoint": endpoint, "service": service, "cert": cert, "verify": verify, }, ) try: r = requests.get(f"{endpoint}/{service}", verify=verify, cert=cert) except Exception as e: _log.debug(e) continue # The status code is already displayed by urllib3 _log.debug( "api call data: %(data)s", {"data": r.text if r.text else ""} ) if r.status_code != 200: raise APIError( f"Failed to connect to {endpoint}/{service} status code {r.status_code}" ) try: return r.json() except (json.JSONDecodeError, ValueError): return None raise nagiosplugin.CheckError("Connection failed for all provided endpoints") @lru_cache(maxsize=None) def has_detailed_states(self) -> bool: # get patroni's version to find out if the "streaming" and "in archive recovery" states are available patroni_item_dict = self.rest_api("patroni") if tuple( int(v) for v in patroni_item_dict["patroni"]["version"].split(".", 2) ) >= (3, 0, 4): _log.debug( "Patroni's version is %(version)s, more detailed states can be used to check for the health of replicas.", {"version": patroni_item_dict["patroni"]["version"]}, ) return True _log.debug( "Patroni's version is %(version)s, the running state and the timelines must be used to check for the health of replicas.", {"version": patroni_item_dict["patroni"]["version"]}, ) return False HandleUnknown = Callable[[nagiosplugin.Summary, nagiosplugin.Results], Any] def handle_unknown(func: HandleUnknown) -> HandleUnknown: """decorator to handle the unknown state in Summary.problem""" def wrapper(summary: nagiosplugin.Summary, results: nagiosplugin.Results) -> Any: if results.most_significant[0].state.code == 3: """get the appropriate message for all unknown error""" return results.most_significant[0].hint return func(summary, results) return wrapper check_patroni-2.2.0/docs/000077500000000000000000000000001475506406400152675ustar00rootroot00000000000000check_patroni-2.2.0/docs/make_readme.sh000077500000000000000000000105451475506406400200650ustar00rootroot00000000000000#!/bin/bash if ! command -v check_patroni &>/dev/null; then echo "check_partroni must be installed to generate the documentation" exit 1 fi top_srcdir="$(readlink -m "$0/../..")" README="${top_srcdir}/README.md" function readme(){ echo "$1" >> $README } function helpme(){ readme readme '```' check_patroni $1 --help >> $README readme '```' readme } cat << '_EOF_' > $README # check_patroni A nagios plugin for patroni. ## Features - Check presence of leader, replicas, node counts. - Check each node for replication status. _EOF_ helpme cat << '_EOF_' >> $README ## Install check_patroni is licensed under PostgreSQL license. ``` $ pip install git+https://github.com/dalibo/check_patroni.git ``` check_patroni works on python 3.6, we keep it that way because patroni also supports it and there are still lots of RH 7 variants around. That being said python 3.6 has been EOL for ages and there is no support for it in the github CI. ## Support If you hit a bug or need help, open a [GitHub issue](https://github.com/dalibo/check_patroni/issues/new). Dalibo has no commitment on response time for public free support. Thanks for you contribution ! ## Config file All global and service specific parameters can be specified via a config file has follows: ``` [options] endpoints = https://10.20.199.3:8008, https://10.20.199.4:8008,https://10.20.199.5:8008 cert_file = ./ssl/my-cert.pem key_file = ./ssl/my-key.pem ca_file = ./ssl/CA-cert.pem timeout = 0 [options.node_is_replica] lag=100 ``` ## Thresholds The format for the threshold parameters is `[@][start:][end]`. * `start:` may be omitted if `start == 0` * `~:` means that start is negative infinity * If `end` is omitted, infinity is assumed * To invert the match condition, prefix the range expression with `@`. A match is found when: `start <= VALUE <= end`. For example, the following command will raise: * a warning if there is less than 1 nodes, which can be translated to outside of range [2;+INF[ * a critical if there are no nodes, which can be translated to outside of range [1;+INF[ ``` check_patroni -e https://10.20.199.3:8008 cluster_has_replica --warning 2: --critical 1: ``` ## SSL Several options are available: * the server's CA certificate is not available or trusted by the client system: * `--ca_cert`: your certification chain `cat CA-certificate server-certificate > cabundle` * you have a client certificate for authenticating with Patroni's REST API: * `--cert_file`: your certificate or the concatenation of your certificate and private key * `--key_file`: your private key (optional) ## Shell completion We use the [click] library which supports shell completion natively. Shell completion can be added by typing the following command or adding it to a file spécific to your shell of choice. * for Bash (add to `~/.bashrc`): ``` eval "$(_CHECK_PATRONI_COMPLETE=bash_source check_patroni)" ``` * for Zsh (add to `~/.zshrc`): ``` eval "$(_CHECK_PATRONI_COMPLETE=zsh_source check_patroni)" ``` * for Fish (add to `~/.config/fish/completions/check_patroni.fish`): ``` eval "$(_CHECK_PATRONI_COMPLETE=fish_source check_patroni)" ``` Please note that shell completion is not supported far all shell versions, for example only Bash versions older than 4.4 are supported. [click]: https://click.palletsprojects.com/en/8.1.x/shell-completion/ ## Connection errors and service status If patroni is not running, we have no way to know if the provided endpoint is valid, therefore the check returns UNKNOWN. _EOF_ readme readme "## Cluster services" readme readme "### cluster_config_has_changed" helpme cluster_config_has_changed readme "### cluster_has_leader" helpme cluster_has_leader readme "### cluster_has_replica" helpme cluster_has_replica readme "### cluster_has_scheduled_action" helpme cluster_has_scheduled_action readme "### cluster_is_in_maintenance" helpme cluster_is_in_maintenance readme "### cluster_node_count" helpme cluster_node_count readme "## Node services" readme readme "### node_is_alive" helpme node_is_alive readme "### node_is_pending_restart" helpme node_is_pending_restart readme "### node_is_leader" helpme node_is_leader readme "### node_is_primary" helpme node_is_primary readme "### node_is_replica" helpme node_is_replica readme "### node_patroni_version" helpme node_patroni_version readme "### node_tl_has_changed" helpme node_tl_has_changed cat << _EOF_ >> $README _EOF_ check_patroni-2.2.0/mypy.ini000066400000000000000000000014241475506406400160370ustar00rootroot00000000000000[mypy] files = . show_error_codes = true strict = true exclude = build/ [mypy-setup] ignore_errors = True [mypy-nagiosplugin.*] ignore_missing_imports = true [mypy-check_patroni.types] # no stubs for nagioplugin => ignore: Class cannot subclass "Resource" (has type "Any") [misc] disallow_subclassing_any = false [mypy-check_patroni.node] # no subs for nagiosplugin => ignore: Class cannot subclass "Summary" (has type "Any") [misc] disallow_subclassing_any = false [mypy-check_patroni.cluster] # no subs for nagiosplugin => ignore: Class cannot subclass "Summary" (has type "Any") [misc] disallow_subclassing_any = false [mypy-check_patroni.cli] # no stubs for nagiosplugin => ignore: Untyped decorator makes function "main" untyped [misc] disallow_untyped_decorators = false check_patroni-2.2.0/pyproject.toml000066400000000000000000000002041475506406400172470ustar00rootroot00000000000000[build-system] requires = ["setuptools", "setuptools-scm"] build-backend = "setuptools.build_meta" [tool.isort] profile = "black" check_patroni-2.2.0/pytest.ini000066400000000000000000000000451475506406400163670ustar00rootroot00000000000000[pytest] addopts = --doctest-modules check_patroni-2.2.0/requirements-dev.txt000066400000000000000000000001361475506406400203770ustar00rootroot00000000000000black codespell isort flake8 mypy pytest pytest-cov types-requests setuptools tox twine wheel check_patroni-2.2.0/setup.py000066400000000000000000000031151475506406400160510ustar00rootroot00000000000000import pathlib from setuptools import find_packages, setup HERE = pathlib.Path(__file__).parent long_description = (HERE / "README.md").read_text() def get_version() -> str: fpath = HERE / "check_patroni" / "__init__.py" with fpath.open() as f: for line in f: if line.startswith("__version__"): return line.split('"')[1] raise Exception(f"version information not found in {fpath}") setup( name="check_patroni", version=get_version(), author="Dalibo", author_email="contact@dalibo.com", packages=find_packages(include=["check_patroni*"]), include_package_data=True, url="https://github.com/dalibo/check_patroni", license="PostgreSQL", description="Nagios plugin to check on patroni", long_description=long_description, long_description_content_type="text/markdown", classifiers=[ "Development Status :: 5 - Production/Stable", "Environment :: Console", "License :: OSI Approved :: PostgreSQL License", "Programming Language :: Python :: 3", "Topic :: System :: Monitoring", ], keywords="patroni nagios check", python_requires=">=3.6", install_requires=[ "attrs >= 17, !=21.1", "requests", "nagiosplugin >= 1.3.2", "click >= 7.1", ], extras_require={ "test": [ "importlib_metadata; python_version < '3.8'", "pytest >= 6.0.2", ], }, entry_points={ "console_scripts": [ "check_patroni=check_patroni.cli:main", ], }, zip_safe=False, ) check_patroni-2.2.0/tests/000077500000000000000000000000001475506406400155015ustar00rootroot00000000000000check_patroni-2.2.0/tests/__init__.py000066400000000000000000000045311475506406400176150ustar00rootroot00000000000000import json import logging import shutil from contextlib import contextmanager from functools import partial from http.server import HTTPServer, SimpleHTTPRequestHandler from pathlib import Path from typing import Any, Iterator, Mapping, Union logger = logging.getLogger(__name__) class PatroniAPI(HTTPServer): def __init__(self, directory: Path, *, datadir: Path) -> None: self.directory = directory self.datadir = datadir handler_cls = partial(SimpleHTTPRequestHandler, directory=str(directory)) super().__init__(("", 0), handler_cls) def serve_forever(self, *args: Any) -> None: logger.info( "starting fake Patroni API at %s (directory=%s)", self.endpoint, self.directory, ) return super().serve_forever(*args) @property def endpoint(self) -> str: return f"http://{self.server_name}:{self.server_port}" @contextmanager def routes(self, mapping: Mapping[str, Union[Path, str]]) -> Iterator[None]: """Temporarily install specified files in served directory, thus building "routes" from given mapping. The 'mapping' defines target route paths as keys and files to be installed in served directory as values. Mapping values of type 'str' are assumed be relative file path to the 'datadir'. """ for route_path, fpath in mapping.items(): if isinstance(fpath, str): fpath = self.datadir / fpath shutil.copy(fpath, self.directory / route_path) try: yield None finally: for fname in mapping: (self.directory / fname).unlink() def cluster_api_set_replica_running(in_json: Path, target_dir: Path) -> Path: # starting from 3.0.4 the state of replicas is streaming or in archive recovery # instead of running with in_json.open() as f: js = json.load(f) for node in js["members"]: if node["role"] in [ "replica", "sync_standby", "standby_leader", "quorum_standby", ]: if node["state"] in ["streaming", "in archive recovery"]: node["state"] = "running" assert target_dir.is_dir() out_json = target_dir / in_json.name with out_json.open("w") as f: json.dump(js, f) return out_json check_patroni-2.2.0/tests/conftest.py000066400000000000000000000037221475506406400177040ustar00rootroot00000000000000import logging import sys from pathlib import Path from threading import Thread from typing import Any, Iterator, Tuple from unittest.mock import patch if sys.version_info >= (3, 8): from importlib.metadata import version as metadata_version else: from importlib_metadata import version as metadata_version import pytest from click.testing import CliRunner from . import PatroniAPI logger = logging.getLogger(__name__) def numversion(pkgname: str) -> Tuple[int, ...]: version = metadata_version(pkgname) return tuple(int(v) for v in version.split(".", 3)) if numversion("pytest") >= (6, 2): TempPathFactory = pytest.TempPathFactory else: from _pytest.tmpdir import TempPathFactory @pytest.fixture(scope="session", autouse=True) def nagioplugin_runtime_stdout() -> Iterator[None]: # work around https://github.com/mpounsett/nagiosplugin/issues/24 when # nagiosplugin is older than 1.3.3 if numversion("nagiosplugin") < (1, 3, 3): target = "nagiosplugin.runtime.Runtime.stdout" with patch(target, None): logger.warning("patching %r", target) yield None else: yield None @pytest.fixture( params=[False, True], ids=lambda v: "new-replica-state" if v else "old-replica-state", ) def old_replica_state(request: Any) -> Any: return request.param @pytest.fixture(scope="session") def datadir() -> Path: return Path(__file__).parent / "json" @pytest.fixture(scope="session") def patroni_api( tmp_path_factory: TempPathFactory, datadir: Path ) -> Iterator[PatroniAPI]: """A fake HTTP server for the Patroni API serving files from a temporary directory. """ httpd = PatroniAPI(tmp_path_factory.mktemp("api"), datadir=datadir) t = Thread(target=httpd.serve_forever) t.start() yield httpd httpd.shutdown() t.join() @pytest.fixture def runner() -> CliRunner: """A CliRunner with stdout and stderr not mixed.""" return CliRunner(mix_stderr=False) check_patroni-2.2.0/tests/json/000077500000000000000000000000001475506406400164525ustar00rootroot00000000000000check_patroni-2.2.0/tests/json/cluster_config_has_changed.json000066400000000000000000000006061475506406400246610ustar00rootroot00000000000000{ "loop_wait": 10, "primary_start_timeout": 300, "postgresql": { "parameters": { "archive_command": "pgbackrest --stanza=main archive-push %p", "archive_mode": "on", "max_connections": 300, "restore_command": "pgbackrest --stanza=main archive-get %f \"%p\"" }, "use_pg_rewind": false, "use_slot": true }, "retry_timeout": 10, "ttl": 30 } check_patroni-2.2.0/tests/json/cluster_has_leader_ko.json000066400000000000000000000012511475506406400236650ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "replica", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "running", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "replica", "state": "running", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-2.2.0/tests/json/cluster_has_leader_ko_standby_leader.json000066400000000000000000000012641475506406400267310ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "standby_leader", "state": "stopped", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-2.2.0/tests/json/cluster_has_leader_ko_standby_leader_archiving.json000066400000000000000000000013001475506406400307520ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "standby_leader", "state": "in archive recovery", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-2.2.0/tests/json/cluster_has_leader_ok.json000066400000000000000000000012541475506406400236700ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-2.2.0/tests/json/cluster_has_leader_ok_standby_leader.json000066400000000000000000000012661475506406400267330ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "standby_leader", "state": "streaming", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-2.2.0/tests/json/cluster_has_replica_ko.json000066400000000000000000000012621475506406400240520ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "stopped", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": "unknown" }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-2.2.0/tests/json/cluster_has_replica_ko_all_replica.json000066400000000000000000000012721475506406400264020ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "replica", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv2", "role": "replica", "state": "running", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "replica", "state": "running", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-2.2.0/tests/json/cluster_has_replica_ko_lag.json000066400000000000000000000012721475506406400246760ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 10241024 }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 20000000 } ] } check_patroni-2.2.0/tests/json/cluster_has_replica_ko_wrong_tl.json000066400000000000000000000012601475506406400257630ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "running", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 50, "lag": 1000000 }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-2.2.0/tests/json/cluster_has_replica_ok.json000066400000000000000000000012731475506406400240540ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "in archive recovery", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "sync_standby", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-2.2.0/tests/json/cluster_has_replica_ok_lag.json000066400000000000000000000012601475506406400246730ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 1024 }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-2.2.0/tests/json/cluster_has_replica_ok_quorum.json000066400000000000000000000012721475506406400254630ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "quorum_standby", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "quorum_standby", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-2.2.0/tests/json/cluster_has_replica_patroni_verion_3.0.0.json000066400000000000000000000010711475506406400272130ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2021-08-11 07:02:20.732 UTC", "role": "primary", "server_version": 110012, "cluster_unlocked": false, "xlog": { "location": 1174407088 }, "timeline": 51, "replication": [ { "usename": "replicator", "application_name": "srv1", "client_addr": "10.20.199.3", "state": "streaming", "sync_state": "async", "sync_priority": 0 } ], "database_system_identifier": "6965971025273547206", "patroni": { "version": "3.0.0", "scope": "patroni-demo" } } check_patroni-2.2.0/tests/json/cluster_has_replica_patroni_verion_3.1.0.json000066400000000000000000000010711475506406400272140ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2021-08-11 07:02:20.732 UTC", "role": "primary", "server_version": 110012, "cluster_unlocked": false, "xlog": { "location": 1174407088 }, "timeline": 51, "replication": [ { "usename": "replicator", "application_name": "srv1", "client_addr": "10.20.199.3", "state": "streaming", "sync_state": "async", "sync_priority": 0 } ], "database_system_identifier": "6965971025273547206", "patroni": { "version": "3.1.0", "scope": "patroni-demo" } } check_patroni-2.2.0/tests/json/cluster_has_replica_standby_cluster_ok.json000066400000000000000000000013031475506406400273330ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "standby_leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "in archive recovery", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "sync_standby", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-2.2.0/tests/json/cluster_has_scheduled_action_ko_restart.json000066400000000000000000000011411475506406400274700ustar00rootroot00000000000000{ "members": [ { "name": "p1", "role": "sync_standby", "state": "streaming", "api_url": "http://10.20.30.51:8008/patroni", "host": "10.20.30.51", "port": 5432, "timeline": 3, "scheduled_restart": { "schedule": "2023-10-08T11:30:00+00:00", "postmaster_start_time": "2023-08-21 08:08:33.415237+00:00" }, "lag": 0 }, { "name": "p2", "role": "leader", "state": "running", "api_url": "http://10.20.30.52:8008/patroni", "host": "10.20.30.52", "port": 5432, "timeline": 3 } ] } check_patroni-2.2.0/tests/json/cluster_has_scheduled_action_ko_switchover.json000066400000000000000000000010571475506406400302070ustar00rootroot00000000000000{ "members": [ { "name": "p1", "role": "sync_standby", "state": "streaming", "api_url": "http://10.20.30.51:8008/patroni", "host": "10.20.30.51", "port": 5432, "timeline": 3, "lag": 0 }, { "name": "p2", "role": "leader", "state": "running", "api_url": "http://10.20.30.52:8008/patroni", "host": "10.20.30.52", "port": 5432, "timeline": 3 } ], "scheduled_switchover": { "at": "2023-10-08T11:30:00+00:00", "from": "p1", "to": "p2" } } check_patroni-2.2.0/tests/json/cluster_has_scheduled_action_ok.json000066400000000000000000000012611475506406400257270ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "sync_standby", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-2.2.0/tests/json/cluster_is_in_maintenance_ko.json000066400000000000000000000012751475506406400252470ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ], "pause": true } check_patroni-2.2.0/tests/json/cluster_is_in_maintenance_ko_pause_false.json000066400000000000000000000012761475506406400276170ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ], "pause": false } check_patroni-2.2.0/tests/json/cluster_is_in_maintenance_ok.json000066400000000000000000000012541475506406400252440ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-2.2.0/tests/json/cluster_is_in_maintenance_ok_pause_false.json000066400000000000000000000012761475506406400276170ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ], "pause": false } check_patroni-2.2.0/tests/json/cluster_node_count_critical.json000066400000000000000000000003461475506406400251200ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 } ] } check_patroni-2.2.0/tests/json/cluster_node_count_healthy_critical.json000066400000000000000000000012261475506406400266340ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "start failed", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "lag": "unknown" }, { "name": "srv3", "role": "replica", "state": "start failed", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "lag": "unknown" } ] } check_patroni-2.2.0/tests/json/cluster_node_count_healthy_warning.json000066400000000000000000000007111475506406400265050ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-2.2.0/tests/json/cluster_node_count_ko_in_archive_recovery.json000066400000000000000000000013121475506406400300360ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "standby_leader", "state": "in archive recovery", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "in archive recovery", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-2.2.0/tests/json/cluster_node_count_ok.json000066400000000000000000000012541475506406400237360ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-2.2.0/tests/json/cluster_node_count_ok_quorum.json000066400000000000000000000012721475506406400253460ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "quorum_standby", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "quorum_standby", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-2.2.0/tests/json/cluster_node_count_ok_sync.json000066400000000000000000000012611475506406400247700ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "sync_standby", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 }, { "name": "srv3", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.5:8008/patroni", "host": "10.20.199.5", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-2.2.0/tests/json/cluster_node_count_warning.json000066400000000000000000000007111475506406400247670ustar00rootroot00000000000000{ "members": [ { "name": "srv1", "role": "leader", "state": "running", "api_url": "https://10.20.199.3:8008/patroni", "host": "10.20.199.3", "port": 5432, "timeline": 51 }, { "name": "srv2", "role": "replica", "state": "streaming", "api_url": "https://10.20.199.4:8008/patroni", "host": "10.20.199.4", "port": 5432, "timeline": 51, "lag": 0 } ] } check_patroni-2.2.0/tests/json/node_is_leader_ko.json000066400000000000000000000010711475506406400227710ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2021-08-11 07:02:20.732 UTC", "role": "primary", "server_version": 110012, "cluster_unlocked": false, "xlog": { "location": 1174407088 }, "timeline": 58, "replication": [ { "usename": "replicator", "application_name": "srv1", "client_addr": "10.20.199.3", "state": "streaming", "sync_state": "async", "sync_priority": 0 } ], "database_system_identifier": "6965971025273547206", "patroni": { "version": "2.0.2", "scope": "patroni-demo" } } check_patroni-2.2.0/tests/json/node_is_leader_ko_standby_leader.json000066400000000000000000000007171475506406400260370ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2023-08-23 14:30:50.201691+00:00", "role": "standby_leader", "server_version": 140009, "xlog": { "received_location": 889192448, "replayed_location": 889192448, "replayed_timestamp": null, "paused": false }, "timeline": 1, "dcs_last_seen": 1692805971, "database_system_identifier": "7270495803765492571", "patroni": { "version": "3.1.0", "scope": "patroni-demo-sb" } } check_patroni-2.2.0/tests/json/node_is_leader_ok.json000066400000000000000000000010711475506406400227710ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2021-08-11 07:02:20.732 UTC", "role": "primary", "server_version": 110012, "cluster_unlocked": false, "xlog": { "location": 1174407088 }, "timeline": 58, "replication": [ { "usename": "replicator", "application_name": "srv1", "client_addr": "10.20.199.3", "state": "streaming", "sync_state": "async", "sync_priority": 0 } ], "database_system_identifier": "6965971025273547206", "patroni": { "version": "2.0.2", "scope": "patroni-demo" } } check_patroni-2.2.0/tests/json/node_is_leader_ok_standby_leader.json000066400000000000000000000007171475506406400260370ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2023-08-23 14:30:50.201691+00:00", "role": "standby_leader", "server_version": 140009, "xlog": { "received_location": 889192448, "replayed_location": 889192448, "replayed_timestamp": null, "paused": false }, "timeline": 1, "dcs_last_seen": 1692805971, "database_system_identifier": "7270495803765492571", "patroni": { "version": "3.1.0", "scope": "patroni-demo-sb" } } check_patroni-2.2.0/tests/json/node_is_pending_restart_ko.json000066400000000000000000000011241475506406400247240ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2021-08-11 07:02:20.732 UTC", "role": "primary", "server_version": 110012, "cluster_unlocked": false, "xlog": { "location": 1174407088 }, "timeline": 58, "replication": [ { "usename": "replicator", "application_name": "srv1", "client_addr": "10.20.199.3", "state": "streaming", "sync_state": "async", "sync_priority": 0 } ], "pending_restart": true, "database_system_identifier": "6965971025273547206", "patroni": { "version": "2.0.2", "scope": "patroni-demo" } } check_patroni-2.2.0/tests/json/node_is_pending_restart_ok.json000066400000000000000000000010711475506406400247250ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2021-08-11 07:02:20.732 UTC", "role": "primary", "server_version": 110012, "cluster_unlocked": false, "xlog": { "location": 1174407088 }, "timeline": 58, "replication": [ { "usename": "replicator", "application_name": "srv1", "client_addr": "10.20.199.3", "state": "streaming", "sync_state": "async", "sync_priority": 0 } ], "database_system_identifier": "6965971025273547206", "patroni": { "version": "2.0.2", "scope": "patroni-demo" } } check_patroni-2.2.0/tests/json/node_is_primary_ko.json000066400000000000000000000007011475506406400232170ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2021-08-11 07:57:51.693 UTC", "role": "replica", "server_version": 110012, "cluster_unlocked": false, "xlog": { "received_location": 1174407088, "replayed_location": 1174407088, "replayed_timestamp": null, "paused": false }, "timeline": 58, "database_system_identifier": "6965971025273547206", "patroni": { "version": "2.0.2", "scope": "patroni-demo" } } check_patroni-2.2.0/tests/json/node_is_primary_ok.json000066400000000000000000000010711475506406400232200ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2021-08-11 07:02:20.732 UTC", "role": "primary", "server_version": 110012, "cluster_unlocked": false, "xlog": { "location": 1174407088 }, "timeline": 58, "replication": [ { "usename": "replicator", "application_name": "srv1", "client_addr": "10.20.199.3", "state": "streaming", "sync_state": "async", "sync_priority": 0 } ], "database_system_identifier": "6965971025273547206", "patroni": { "version": "2.0.2", "scope": "patroni-demo" } } check_patroni-2.2.0/tests/json/node_is_replica_ko.json000066400000000000000000000010711475506406400231540ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2021-08-11 07:02:20.732 UTC", "role": "primary", "server_version": 110012, "cluster_unlocked": false, "xlog": { "location": 1174407088 }, "timeline": 58, "replication": [ { "usename": "replicator", "application_name": "srv1", "client_addr": "10.20.199.3", "state": "streaming", "sync_state": "async", "sync_priority": 0 } ], "database_system_identifier": "6965971025273547206", "patroni": { "version": "2.0.2", "scope": "patroni-demo" } } check_patroni-2.2.0/tests/json/node_is_replica_ok.json000066400000000000000000000007011475506406400231530ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2021-08-11 07:57:51.693 UTC", "role": "replica", "server_version": 110012, "cluster_unlocked": false, "xlog": { "received_location": 1174407088, "replayed_location": 1174407088, "replayed_timestamp": null, "paused": false }, "timeline": 58, "database_system_identifier": "6965971025273547206", "patroni": { "version": "2.0.2", "scope": "patroni-demo" } } check_patroni-2.2.0/tests/json/node_is_replica_ok_quorum.json000066400000000000000000000010631475506406400245650ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2024-12-23 10:07:07.660665+00:00", "role": "replica", "server_version": 140013, "xlog": { "received_location": 251660416, "replayed_location": 251660416, "replayed_timestamp": "2024-12-23 15:43:43.152572+00:00", "paused": false }, "quorum_standby": true, "timeline": 9, "replication_state": "streaming", "dcs_last_seen": 1734972473, "database_system_identifier": "7421168130564934130", "patroni": { "version": "4.0.2", "scope": "patroni-demo", "name": "p2" } } check_patroni-2.2.0/tests/json/node_is_replica_ok_sync.json000066400000000000000000000010611475506406400242070ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2024-12-23 10:07:07.660665+00:00", "role": "replica", "server_version": 140013, "xlog": { "received_location": 251660416, "replayed_location": 251660416, "replayed_timestamp": "2024-12-23 15:43:43.152572+00:00", "paused": false }, "sync_standby": true, "timeline": 9, "replication_state": "streaming", "dcs_last_seen": 1734972473, "database_system_identifier": "7421168130564934130", "patroni": { "version": "4.0.2", "scope": "patroni-demo", "name": "p2" } } check_patroni-2.2.0/tests/json/node_patroni_version.json000066400000000000000000000010711475506406400235720ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2021-08-11 07:02:20.732 UTC", "role": "primary", "server_version": 110012, "cluster_unlocked": false, "xlog": { "location": 1174407088 }, "timeline": 58, "replication": [ { "usename": "replicator", "application_name": "srv1", "client_addr": "10.20.199.3", "state": "streaming", "sync_state": "async", "sync_priority": 0 } ], "database_system_identifier": "6965971025273547206", "patroni": { "version": "2.0.2", "scope": "patroni-demo" } } check_patroni-2.2.0/tests/json/node_tl_has_changed.json000066400000000000000000000010711475506406400232740ustar00rootroot00000000000000{ "state": "running", "postmaster_start_time": "2021-08-11 07:02:20.732 UTC", "role": "primary", "server_version": 110012, "cluster_unlocked": false, "xlog": { "location": 1174407088 }, "timeline": 58, "replication": [ { "usename": "replicator", "application_name": "srv1", "client_addr": "10.20.199.3", "state": "streaming", "sync_state": "async", "sync_priority": 0 } ], "database_system_identifier": "6965971025273547206", "patroni": { "version": "2.0.2", "scope": "patroni-demo" } } check_patroni-2.2.0/tests/test_api.py000066400000000000000000000011651475506406400176660ustar00rootroot00000000000000from click.testing import CliRunner from check_patroni.cli import main from . import PatroniAPI def test_api_status_code_200(runner: CliRunner, patroni_api: PatroniAPI) -> None: with patroni_api.routes({"patroni": "node_is_pending_restart_ok.json"}): result = runner.invoke( main, ["-e", patroni_api.endpoint, "node_is_pending_restart"] ) assert result.exit_code == 0 def test_api_status_code_404(runner: CliRunner, patroni_api: PatroniAPI) -> None: result = runner.invoke( main, ["-e", patroni_api.endpoint, "node_is_pending_restart"] ) assert result.exit_code == 3 check_patroni-2.2.0/tests/test_cluster_config_has_changed.py000066400000000000000000000122751475506406400244330ustar00rootroot00000000000000from pathlib import Path from typing import Iterator import nagiosplugin import pytest from click.testing import CliRunner from check_patroni.cli import main from . import PatroniAPI @pytest.fixture(scope="module", autouse=True) def cluster_config_has_changed(patroni_api: PatroniAPI) -> Iterator[None]: with patroni_api.routes({"config": "cluster_config_has_changed.json"}): yield None def test_cluster_config_has_changed_ok_with_hash( runner: CliRunner, patroni_api: PatroniAPI ) -> None: result = runner.invoke( main, [ "-e", patroni_api.endpoint, "cluster_config_has_changed", "--hash", "30022c301991e7395182b1134683e518", ], ) assert ( result.stdout == "CLUSTERCONFIGHASCHANGED OK - The hash of patroni's dynamic configuration has not changed (30022c301991e7395182b1134683e518). | is_configuration_changed=0;;@1:1\n" ) assert result.exit_code == 0 def test_cluster_config_has_changed_ok_with_state_file( runner: CliRunner, patroni_api: PatroniAPI, tmp_path: Path ) -> None: state_file = tmp_path / "cluster_config_has_changed.state_file" with state_file.open("w") as f: f.write('{"hash": "30022c301991e7395182b1134683e518"}') result = runner.invoke( main, [ "-e", patroni_api.endpoint, "cluster_config_has_changed", "--state-file", str(state_file), ], ) assert ( result.stdout == "CLUSTERCONFIGHASCHANGED OK - The hash of patroni's dynamic configuration has not changed (30022c301991e7395182b1134683e518). | is_configuration_changed=0;;@1:1\n" ) assert result.exit_code == 0 def test_cluster_config_has_changed_ko_with_hash( runner: CliRunner, patroni_api: PatroniAPI ) -> None: result = runner.invoke( main, [ "-e", patroni_api.endpoint, "cluster_config_has_changed", "--hash", "96b12d82571473d13e890b8937ffffff", ], ) assert ( result.stdout == "CLUSTERCONFIGHASCHANGED CRITICAL - The hash of patroni's dynamic configuration has changed. The old hash was 96b12d82571473d13e890b8937ffffff. | is_configuration_changed=1;;@1:1\n" ) assert result.exit_code == 2 def test_cluster_config_has_changed_ko_with_state_file_and_save( runner: CliRunner, patroni_api: PatroniAPI, tmp_path: Path ) -> None: state_file = tmp_path / "cluster_config_has_changed.state_file" with state_file.open("w") as f: f.write('{"hash": "96b12d82571473d13e890b8937ffffff"}') # test without saving the new hash result = runner.invoke( main, [ "-e", patroni_api.endpoint, "cluster_config_has_changed", "--state-file", str(state_file), ], ) assert ( result.stdout == "CLUSTERCONFIGHASCHANGED CRITICAL - The hash of patroni's dynamic configuration has changed. The old hash was 96b12d82571473d13e890b8937ffffff. | is_configuration_changed=1;;@1:1\n" ) assert result.exit_code == 2 state_file = tmp_path / "cluster_config_has_changed.state_file" cookie = nagiosplugin.Cookie(state_file) cookie.open() new_config_hash = cookie.get("hash") cookie.close() assert new_config_hash == "96b12d82571473d13e890b8937ffffff" # test when we save the hash result = runner.invoke( main, [ "-e", patroni_api.endpoint, "cluster_config_has_changed", "--state-file", str(state_file), "--save", ], ) assert ( result.stdout == "CLUSTERCONFIGHASCHANGED CRITICAL - The hash of patroni's dynamic configuration has changed. The old hash was 96b12d82571473d13e890b8937ffffff. | is_configuration_changed=1;;@1:1\n" ) assert result.exit_code == 2 cookie = nagiosplugin.Cookie(state_file) cookie.open() new_config_hash = cookie.get("hash") cookie.close() assert new_config_hash == "30022c301991e7395182b1134683e518" def test_cluster_config_has_changed_params( runner: CliRunner, patroni_api: PatroniAPI, tmp_path: Path ) -> None: # This one is placed last because it seems like the exceptions are not flushed from stderr for the next tests. fake_state_file = tmp_path / "fake_file_name.state_file" result = runner.invoke( main, [ "-e", patroni_api.endpoint, "cluster_config_has_changed", "--hash", "640df9f0211c791723f18fc3ed9dbb95", "--state-file", str(fake_state_file), ], ) assert ( result.stdout == "CLUSTERCONFIGHASCHANGED UNKNOWN: click.exceptions.UsageError: Either --hash or --state-file should be provided for this service\n" ) assert result.exit_code == 3 result = runner.invoke( main, ["-e", "https://10.20.199.3:8008", "cluster_config_has_changed"] ) assert ( result.stdout == "CLUSTERCONFIGHASCHANGED UNKNOWN: click.exceptions.UsageError: Either --hash or --state-file should be provided for this service\n" ) check_patroni-2.2.0/tests/test_cluster_has_leader.py000066400000000000000000000136521475506406400227510ustar00rootroot00000000000000from pathlib import Path from typing import Iterator, Union import pytest from click.testing import CliRunner from check_patroni.cli import main from . import PatroniAPI, cluster_api_set_replica_running @pytest.fixture def cluster_has_leader_ok( patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path ) -> Iterator[None]: cluster_path: Union[str, Path] = "cluster_has_leader_ok.json" patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json" if old_replica_state: cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path) patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json" with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}): yield None @pytest.mark.usefixtures("cluster_has_leader_ok") def test_cluster_has_leader_ok(runner: CliRunner, patroni_api: PatroniAPI) -> None: result = runner.invoke(main, ["-e", patroni_api.endpoint, "cluster_has_leader"]) assert ( result.stdout == "CLUSTERHASLEADER OK - The cluster has a running leader. | has_leader=1;;@0 is_leader=1 is_standby_leader=0 is_standby_leader_in_arc_rec=0;@1:1\n" ) assert result.exit_code == 0 @pytest.fixture def cluster_has_leader_ok_standby_leader( patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path ) -> Iterator[None]: cluster_path: Union[str, Path] = "cluster_has_leader_ok_standby_leader.json" patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json" if old_replica_state: cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path) patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json" with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}): yield None @pytest.mark.usefixtures("cluster_has_leader_ok_standby_leader") def test_cluster_has_leader_ok_standby_leader( runner: CliRunner, patroni_api: PatroniAPI ) -> None: result = runner.invoke(main, ["-e", patroni_api.endpoint, "cluster_has_leader"]) assert ( result.stdout == "CLUSTERHASLEADER OK - The cluster has a running leader. | has_leader=1;;@0 is_leader=0 is_standby_leader=1 is_standby_leader_in_arc_rec=0;@1:1\n" ) assert result.exit_code == 0 @pytest.fixture def cluster_has_leader_ko( patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path ) -> Iterator[None]: cluster_path: Union[str, Path] = "cluster_has_leader_ko.json" patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json" if old_replica_state: cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path) patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json" with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}): yield None @pytest.mark.usefixtures("cluster_has_leader_ko") def test_cluster_has_leader_ko(runner: CliRunner, patroni_api: PatroniAPI) -> None: result = runner.invoke(main, ["-e", patroni_api.endpoint, "cluster_has_leader"]) assert ( result.stdout == "CLUSTERHASLEADER CRITICAL - The cluster has no running leader or the standby leader is in archive recovery. | has_leader=0;;@0 is_leader=0 is_standby_leader=0 is_standby_leader_in_arc_rec=0;@1:1\n" ) assert result.exit_code == 2 @pytest.fixture def cluster_has_leader_ko_standby_leader( patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path ) -> Iterator[None]: cluster_path: Union[str, Path] = "cluster_has_leader_ko_standby_leader.json" patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json" if old_replica_state: cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path) patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json" with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}): yield None @pytest.mark.usefixtures("cluster_has_leader_ko_standby_leader") def test_cluster_has_leader_ko_standby_leader( runner: CliRunner, patroni_api: PatroniAPI ) -> None: result = runner.invoke(main, ["-e", patroni_api.endpoint, "cluster_has_leader"]) assert ( result.stdout == "CLUSTERHASLEADER CRITICAL - The cluster has no running leader or the standby leader is in archive recovery. | has_leader=0;;@0 is_leader=0 is_standby_leader=0 is_standby_leader_in_arc_rec=0;@1:1\n" ) assert result.exit_code == 2 @pytest.fixture def cluster_has_leader_ko_standby_leader_archiving( patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path ) -> Iterator[None]: cluster_path: Union[str, Path] = ( "cluster_has_leader_ko_standby_leader_archiving.json" ) patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json" if old_replica_state: cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path) patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json" with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}): yield None @pytest.mark.usefixtures("cluster_has_leader_ko_standby_leader_archiving") def test_cluster_has_leader_ko_standby_leader_archiving( runner: CliRunner, patroni_api: PatroniAPI, old_replica_state: bool ) -> None: result = runner.invoke(main, ["-e", patroni_api.endpoint, "cluster_has_leader"]) if old_replica_state: assert ( result.stdout == "CLUSTERHASLEADER OK - The cluster has a running leader. | has_leader=1;;@0 is_leader=0 is_standby_leader=1 is_standby_leader_in_arc_rec=0;@1:1\n" ) assert result.exit_code == 0 else: assert ( result.stdout == "CLUSTERHASLEADER WARNING - The cluster has no running leader or the standby leader is in archive recovery. | has_leader=1;;@0 is_leader=0 is_standby_leader=1 is_standby_leader_in_arc_rec=1;@1:1\n" ) assert result.exit_code == 1 check_patroni-2.2.0/tests/test_cluster_has_replica.py000066400000000000000000000322171475506406400231320ustar00rootroot00000000000000from pathlib import Path from typing import Iterator, Union import pytest from click.testing import CliRunner from check_patroni.cli import main from . import PatroniAPI, cluster_api_set_replica_running @pytest.fixture def cluster_has_replica_ok( patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path ) -> Iterator[None]: cluster_path: Union[str, Path] = "cluster_has_replica_ok.json" patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json" if old_replica_state: cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path) patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json" with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}): yield None @pytest.mark.usefixtures("cluster_has_replica_ok") def test_cluster_has_relica_ok(runner: CliRunner, patroni_api: PatroniAPI) -> None: result = runner.invoke(main, ["-e", patroni_api.endpoint, "cluster_has_replica"]) assert ( result.stdout == "CLUSTERHASREPLICA OK - healthy_replica is 2 | healthy_replica=2 srv2_lag=0 srv2_sync=0 srv2_timeline=51 srv3_lag=0 srv3_sync=1 srv3_timeline=51 sync_replica=1 unhealthy_replica=0\n" ) assert result.exit_code == 0 @pytest.fixture def cluster_has_replica_ok_sync( patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path ) -> Iterator[None]: cluster_path: Union[str, Path] = "cluster_has_replica_ok.json" patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json" if old_replica_state: cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path) patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json" with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}): yield None @pytest.mark.usefixtures("cluster_has_replica_ok_sync") def test_cluster_has_relica_ok_sync(runner: CliRunner, patroni_api: PatroniAPI) -> None: result = runner.invoke( main, ["-e", patroni_api.endpoint, "cluster_has_replica", "--sync-type", "sync"] ) assert ( result.stdout == "CLUSTERHASREPLICA OK - healthy_replica is 2 | healthy_replica=2 srv2_lag=0 srv2_sync=0 srv2_timeline=51 srv3_lag=0 srv3_sync=1 srv3_timeline=51 sync_replica=1 unhealthy_replica=0\n" ) assert result.exit_code == 0 @pytest.fixture def cluster_has_replica_ok_quorum( patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path ) -> Iterator[None]: cluster_path: Union[str, Path] = "cluster_has_replica_ok_quorum.json" patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json" if old_replica_state: cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path) patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json" with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}): yield None @pytest.mark.usefixtures("cluster_has_replica_ok_quorum") def test_cluster_has_relica_ok_qorum( runner: CliRunner, patroni_api: PatroniAPI ) -> None: result = runner.invoke( main, ["-e", patroni_api.endpoint, "cluster_has_replica", "--sync-type", "quorum"], ) assert ( result.stdout == "CLUSTERHASREPLICA OK - healthy_replica is 2 | healthy_replica=2 srv2_lag=0 srv2_sync=1 srv2_timeline=51 srv3_lag=0 srv3_sync=1 srv3_timeline=51 sync_replica=2 unhealthy_replica=0\n" ) assert result.exit_code == 0 @pytest.mark.usefixtures("cluster_has_replica_ok") def test_cluster_has_replica_ok_with_count_thresholds( runner: CliRunner, patroni_api: PatroniAPI ) -> None: result = runner.invoke( main, [ "-e", patroni_api.endpoint, "cluster_has_replica", "--warning", "@1", "--critical", "@0", ], ) assert ( result.stdout == "CLUSTERHASREPLICA OK - healthy_replica is 2 | healthy_replica=2;@1;@0 srv2_lag=0 srv2_sync=0 srv2_timeline=51 srv3_lag=0 srv3_sync=1 srv3_timeline=51 sync_replica=1 unhealthy_replica=0\n" ) assert result.exit_code == 0 @pytest.mark.usefixtures("cluster_has_replica_ok") def test_cluster_has_replica_ok_with_sync_count_thresholds( runner: CliRunner, patroni_api: PatroniAPI ) -> None: result = runner.invoke( main, [ "-e", patroni_api.endpoint, "cluster_has_replica", "--sync-warning", "1:", ], ) assert ( result.stdout == "CLUSTERHASREPLICA OK - healthy_replica is 2 | healthy_replica=2 srv2_lag=0 srv2_sync=0 srv2_timeline=51 srv3_lag=0 srv3_sync=1 srv3_timeline=51 sync_replica=1;1: unhealthy_replica=0\n" ) assert result.exit_code == 0 @pytest.fixture def cluster_has_replica_ok_lag( patroni_api: PatroniAPI, datadir: Path, tmp_path: Path, old_replica_state: bool ) -> Iterator[None]: cluster_path: Union[str, Path] = "cluster_has_replica_ok_lag.json" patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json" if old_replica_state: cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path) patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json" with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}): yield None @pytest.mark.usefixtures("cluster_has_replica_ok_lag") def test_cluster_has_replica_ok_with_count_thresholds_lag( runner: CliRunner, patroni_api: PatroniAPI ) -> None: result = runner.invoke( main, [ "-e", patroni_api.endpoint, "cluster_has_replica", "--warning", "@1", "--critical", "@0", "--max-lag", "1MB", ], ) assert ( result.stdout == "CLUSTERHASREPLICA OK - healthy_replica is 2 | healthy_replica=2;@1;@0 srv2_lag=1024 srv2_sync=0 srv2_timeline=51 srv3_lag=0 srv3_sync=0 srv3_timeline=51 sync_replica=0 unhealthy_replica=0\n" ) assert result.exit_code == 0 @pytest.fixture def cluster_has_replica_standby_cluster_ok( patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path ) -> Iterator[None]: cluster_path: Union[str, Path] = "cluster_has_replica_standby_cluster_ok.json" patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json" if old_replica_state: cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path) patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json" with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}): yield None @pytest.mark.usefixtures("cluster_has_replica_standby_cluster_ok") def test_cluster_has_relica_standby_cluster_ok( runner: CliRunner, patroni_api: PatroniAPI ) -> None: result = runner.invoke(main, ["-e", patroni_api.endpoint, "cluster_has_replica"]) assert ( result.stdout == "CLUSTERHASREPLICA OK - healthy_replica is 2 | healthy_replica=2 srv2_lag=0 srv2_sync=0 srv2_timeline=51 srv3_lag=0 srv3_sync=1 srv3_timeline=51 sync_replica=1 unhealthy_replica=0\n" ) assert result.exit_code == 0 @pytest.fixture def cluster_has_replica_ko( patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path ) -> Iterator[None]: cluster_path: Union[str, Path] = "cluster_has_replica_ko.json" patroni_path: Union[str, Path] = "cluster_has_replica_patroni_verion_3.1.0.json" if old_replica_state: cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path) patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json" with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}): yield None @pytest.mark.usefixtures("cluster_has_replica_ko") def test_cluster_has_replica_ko_with_count_thresholds( runner: CliRunner, patroni_api: PatroniAPI ) -> None: result = runner.invoke( main, [ "-e", patroni_api.endpoint, "cluster_has_replica", "--warning", "@1", "--critical", "@0", ], ) assert ( result.stdout == "CLUSTERHASREPLICA WARNING - healthy_replica is 1 (outside range @0:1) | healthy_replica=1;@1;@0 srv3_lag=0 srv3_sync=0 srv3_timeline=51 sync_replica=0 unhealthy_replica=1\n" ) assert result.exit_code == 1 @pytest.mark.usefixtures("cluster_has_replica_ko") def test_cluster_has_replica_ko_with_sync_count_thresholds( runner: CliRunner, patroni_api: PatroniAPI ) -> None: result = runner.invoke( main, [ "-e", patroni_api.endpoint, "cluster_has_replica", "--sync-warning", "2:", "--sync-critical", "1:", ], ) # The lag on srv2 is "unknown". We don't handle string in perfstats so we have to scratch all the second node stats assert ( result.stdout == "CLUSTERHASREPLICA CRITICAL - sync_replica is 0 (outside range 1:) | healthy_replica=1 srv3_lag=0 srv3_sync=0 srv3_timeline=51 sync_replica=0;2:;1: unhealthy_replica=1\n" ) assert result.exit_code == 2 @pytest.fixture def cluster_has_replica_ko_lag( patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path ) -> Iterator[None]: cluster_path: Union[str, Path] = "cluster_has_replica_ko_lag.json" patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json" if old_replica_state: cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path) patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json" with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}): yield None @pytest.mark.usefixtures("cluster_has_replica_ko_lag") def test_cluster_has_replica_ko_with_count_thresholds_and_lag( runner: CliRunner, patroni_api: PatroniAPI ) -> None: result = runner.invoke( main, [ "-e", patroni_api.endpoint, "cluster_has_replica", "--warning", "@1", "--critical", "@0", "--max-lag", "1MB", ], ) assert ( result.stdout == "CLUSTERHASREPLICA CRITICAL - healthy_replica is 0 (outside range @0:0) | healthy_replica=0;@1;@0 srv2_lag=10241024 srv2_sync=0 srv2_timeline=51 srv3_lag=20000000 srv3_sync=0 srv3_timeline=51 sync_replica=0 unhealthy_replica=2\n" ) assert result.exit_code == 2 @pytest.fixture def cluster_has_replica_ko_wrong_tl( patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path ) -> Iterator[None]: cluster_path: Union[str, Path] = "cluster_has_replica_ko_wrong_tl.json" patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json" if old_replica_state: cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path) patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json" with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}): yield None @pytest.mark.usefixtures("cluster_has_replica_ko_wrong_tl") def test_cluster_has_replica_ko_wrong_tl( runner: CliRunner, patroni_api: PatroniAPI ) -> None: result = runner.invoke( main, [ "-e", patroni_api.endpoint, "cluster_has_replica", "--warning", "@1", "--critical", "@0", "--max-lag", "1MB", ], ) assert ( result.stdout == "CLUSTERHASREPLICA WARNING - healthy_replica is 1 (outside range @0:1) | healthy_replica=1;@1;@0 srv2_lag=1000000 srv2_sync=0 srv2_timeline=50 srv3_lag=0 srv3_sync=0 srv3_timeline=51 sync_replica=0 unhealthy_replica=1\n" ) assert result.exit_code == 1 @pytest.fixture def cluster_has_replica_ko_all_replica( patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path ) -> Iterator[None]: cluster_path: Union[str, Path] = "cluster_has_replica_ko_all_replica.json" patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json" if old_replica_state: cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path) patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json" with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}): yield None @pytest.mark.usefixtures("cluster_has_replica_ko_all_replica") def test_cluster_has_replica_ko_all_replica( runner: CliRunner, patroni_api: PatroniAPI ) -> None: result = runner.invoke( main, [ "-e", patroni_api.endpoint, "cluster_has_replica", "--warning", "@1", "--critical", "@0", "--max-lag", "1MB", ], ) assert ( result.stdout == "CLUSTERHASREPLICA CRITICAL - healthy_replica is 0 (outside range @0:0) | healthy_replica=0;@1;@0 srv1_lag=0 srv1_sync=0 srv1_timeline=51 srv2_lag=0 srv2_sync=0 srv2_timeline=51 srv3_lag=0 srv3_sync=0 srv3_timeline=51 sync_replica=0 unhealthy_replica=3\n" ) assert result.exit_code == 2 check_patroni-2.2.0/tests/test_cluster_has_scheduled_action.py000066400000000000000000000033741475506406400250120ustar00rootroot00000000000000from click.testing import CliRunner from check_patroni.cli import main from . import PatroniAPI def test_cluster_has_scheduled_action_ok( runner: CliRunner, patroni_api: PatroniAPI ) -> None: with patroni_api.routes({"cluster": "cluster_has_scheduled_action_ok.json"}): result = runner.invoke( main, ["-e", patroni_api.endpoint, "cluster_has_scheduled_action"] ) assert result.exit_code == 0 assert ( result.stdout == "CLUSTERHASSCHEDULEDACTION OK - has_scheduled_actions is 0 | has_scheduled_actions=0;;0 scheduled_restart=0 scheduled_switchover=0\n" ) def test_cluster_has_scheduled_action_ko_switchover( runner: CliRunner, patroni_api: PatroniAPI ) -> None: with patroni_api.routes( {"cluster": "cluster_has_scheduled_action_ko_switchover.json"} ): result = runner.invoke( main, ["-e", patroni_api.endpoint, "cluster_has_scheduled_action"] ) assert result.exit_code == 2 assert ( result.stdout == "CLUSTERHASSCHEDULEDACTION CRITICAL - has_scheduled_actions is 1 (outside range 0:0) | has_scheduled_actions=1;;0 scheduled_restart=0 scheduled_switchover=1\n" ) def test_cluster_has_scheduled_action_ko_restart( runner: CliRunner, patroni_api: PatroniAPI ) -> None: with patroni_api.routes( {"cluster": "cluster_has_scheduled_action_ko_restart.json"} ): result = runner.invoke( main, ["-e", patroni_api.endpoint, "cluster_has_scheduled_action"] ) assert result.exit_code == 2 assert ( result.stdout == "CLUSTERHASSCHEDULEDACTION CRITICAL - has_scheduled_actions is 1 (outside range 0:0) | has_scheduled_actions=1;;0 scheduled_restart=1 scheduled_switchover=0\n" ) check_patroni-2.2.0/tests/test_cluster_is_in_maintenance.py000066400000000000000000000030111475506406400243110ustar00rootroot00000000000000from click.testing import CliRunner from check_patroni.cli import main from . import PatroniAPI def test_cluster_is_in_maintenance_ok( runner: CliRunner, patroni_api: PatroniAPI ) -> None: with patroni_api.routes({"cluster": "cluster_is_in_maintenance_ok.json"}): result = runner.invoke( main, ["-e", patroni_api.endpoint, "cluster_is_in_maintenance"] ) assert result.exit_code == 0 assert ( result.stdout == "CLUSTERISINMAINTENANCE OK - is_in_maintenance is 0 | is_in_maintenance=0;;0\n" ) def test_cluster_is_in_maintenance_ko( runner: CliRunner, patroni_api: PatroniAPI ) -> None: with patroni_api.routes({"cluster": "cluster_is_in_maintenance_ko.json"}): result = runner.invoke( main, ["-e", patroni_api.endpoint, "cluster_is_in_maintenance"] ) assert result.exit_code == 2 assert ( result.stdout == "CLUSTERISINMAINTENANCE CRITICAL - is_in_maintenance is 1 (outside range 0:0) | is_in_maintenance=1;;0\n" ) def test_cluster_is_in_maintenance_ok_pause_false( runner: CliRunner, patroni_api: PatroniAPI ) -> None: with patroni_api.routes( {"cluster": "cluster_is_in_maintenance_ok_pause_false.json"} ): result = runner.invoke( main, ["-e", patroni_api.endpoint, "cluster_is_in_maintenance"] ) assert result.exit_code == 0 assert ( result.stdout == "CLUSTERISINMAINTENANCE OK - is_in_maintenance is 0 | is_in_maintenance=0;;0\n" ) check_patroni-2.2.0/tests/test_cluster_node_count.py000066400000000000000000000300371475506406400230130ustar00rootroot00000000000000from pathlib import Path from typing import Iterator, Union import pytest from click.testing import CliRunner from check_patroni.cli import main from . import PatroniAPI, cluster_api_set_replica_running @pytest.fixture def cluster_node_count_ok( patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path ) -> Iterator[None]: cluster_path: Union[str, Path] = "cluster_node_count_ok.json" patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json" if old_replica_state: cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path) patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json" with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}): yield None @pytest.mark.usefixtures("cluster_node_count_ok") def test_cluster_node_count_ok( runner: CliRunner, patroni_api: PatroniAPI, old_replica_state: bool ) -> None: result = runner.invoke(main, ["-e", patroni_api.endpoint, "cluster_node_count"]) if old_replica_state: assert ( result.output == "CLUSTERNODECOUNT OK - members is 3 | healthy_members=3 members=3 role_leader=1 role_replica=2 state_running=3\n" ) else: assert ( result.output == "CLUSTERNODECOUNT OK - members is 3 | healthy_members=3 members=3 role_leader=1 role_replica=2 state_running=1 state_streaming=2\n" ) assert result.exit_code == 0 @pytest.fixture def cluster_node_count_ok_sync( patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path ) -> Iterator[None]: cluster_path: Union[str, Path] = "cluster_node_count_ok_sync.json" patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json" if old_replica_state: cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path) patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json" with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}): yield None @pytest.mark.usefixtures("cluster_node_count_ok_sync") def test_cluster_node_count_ok_sync( runner: CliRunner, patroni_api: PatroniAPI, old_replica_state: bool ) -> None: result = runner.invoke(main, ["-e", patroni_api.endpoint, "cluster_node_count"]) if old_replica_state: assert ( result.output == "CLUSTERNODECOUNT OK - members is 3 | healthy_members=3 members=3 role_leader=1 role_replica=1 role_sync_standby=1 state_running=3\n" ) else: assert ( result.output == "CLUSTERNODECOUNT OK - members is 3 | healthy_members=3 members=3 role_leader=1 role_replica=1 role_sync_standby=1 state_running=1 state_streaming=2\n" ) assert result.exit_code == 0 @pytest.fixture def cluster_node_count_ok_quorum( patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path ) -> Iterator[None]: cluster_path: Union[str, Path] = "cluster_node_count_ok_quorum.json" patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json" if old_replica_state: cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path) patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json" with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}): yield None @pytest.mark.usefixtures("cluster_node_count_ok_quorum") def test_cluster_node_count_ok_quorum( runner: CliRunner, patroni_api: PatroniAPI, old_replica_state: bool ) -> None: result = runner.invoke(main, ["-e", patroni_api.endpoint, "cluster_node_count"]) if old_replica_state: assert ( result.output == "CLUSTERNODECOUNT OK - members is 3 | healthy_members=3 members=3 role_leader=1 role_quorum_standby=2 state_running=3\n" ) else: assert ( result.output == "CLUSTERNODECOUNT OK - members is 3 | healthy_members=3 members=3 role_leader=1 role_quorum_standby=2 state_running=1 state_streaming=2\n" ) @pytest.mark.usefixtures("cluster_node_count_ok") def test_cluster_node_count_ok_with_thresholds( runner: CliRunner, patroni_api: PatroniAPI, old_replica_state: bool ) -> None: result = runner.invoke( main, [ "-e", patroni_api.endpoint, "cluster_node_count", "--warning", "@0:1", "--critical", "@2", "--healthy-warning", "@2", "--healthy-critical", "@0:1", ], ) if old_replica_state: assert ( result.output == "CLUSTERNODECOUNT OK - members is 3 | healthy_members=3;@2;@1 members=3;@1;@2 role_leader=1 role_replica=2 state_running=3\n" ) else: assert ( result.output == "CLUSTERNODECOUNT OK - members is 3 | healthy_members=3;@2;@1 members=3;@1;@2 role_leader=1 role_replica=2 state_running=1 state_streaming=2\n" ) assert result.exit_code == 0 @pytest.fixture def cluster_node_count_healthy_warning( patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path ) -> Iterator[None]: cluster_path: Union[str, Path] = "cluster_node_count_healthy_warning.json" patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json" if old_replica_state: cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path) patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json" with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}): yield None @pytest.mark.usefixtures("cluster_node_count_healthy_warning") def test_cluster_node_count_healthy_warning( runner: CliRunner, patroni_api: PatroniAPI, old_replica_state: bool ) -> None: result = runner.invoke( main, [ "-e", patroni_api.endpoint, "cluster_node_count", "--healthy-warning", "@2", "--healthy-critical", "@0:1", ], ) if old_replica_state: assert ( result.output == "CLUSTERNODECOUNT WARNING - healthy_members is 2 (outside range @0:2) | healthy_members=2;@2;@1 members=2 role_leader=1 role_replica=1 state_running=2\n" ) else: assert ( result.output == "CLUSTERNODECOUNT WARNING - healthy_members is 2 (outside range @0:2) | healthy_members=2;@2;@1 members=2 role_leader=1 role_replica=1 state_running=1 state_streaming=1\n" ) assert result.exit_code == 1 @pytest.fixture def cluster_node_count_healthy_critical( patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path ) -> Iterator[None]: cluster_path: Union[str, Path] = "cluster_node_count_healthy_critical.json" patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json" if old_replica_state: cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path) patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json" with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}): yield None @pytest.mark.usefixtures("cluster_node_count_healthy_critical") def test_cluster_node_count_healthy_critical( runner: CliRunner, patroni_api: PatroniAPI ) -> None: result = runner.invoke( main, [ "-e", patroni_api.endpoint, "cluster_node_count", "--healthy-warning", "@2", "--healthy-critical", "@0:1", ], ) assert ( result.output == "CLUSTERNODECOUNT CRITICAL - healthy_members is 1 (outside range @0:1) | healthy_members=1;@2;@1 members=3 role_leader=1 role_replica=2 state_running=1 state_start_failed=2\n" ) assert result.exit_code == 2 @pytest.fixture def cluster_node_count_warning( patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path ) -> Iterator[None]: cluster_path: Union[str, Path] = "cluster_node_count_warning.json" patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json" if old_replica_state: cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path) patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json" with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}): yield None @pytest.mark.usefixtures("cluster_node_count_warning") def test_cluster_node_count_warning( runner: CliRunner, patroni_api: PatroniAPI, old_replica_state: bool ) -> None: result = runner.invoke( main, [ "-e", patroni_api.endpoint, "cluster_node_count", "--warning", "@2", "--critical", "@0:1", ], ) if old_replica_state: assert ( result.stdout == "CLUSTERNODECOUNT WARNING - members is 2 (outside range @0:2) | healthy_members=2 members=2;@2;@1 role_leader=1 role_replica=1 state_running=2\n" ) else: assert ( result.stdout == "CLUSTERNODECOUNT WARNING - members is 2 (outside range @0:2) | healthy_members=2 members=2;@2;@1 role_leader=1 role_replica=1 state_running=1 state_streaming=1\n" ) assert result.exit_code == 1 @pytest.fixture def cluster_node_count_critical( patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path ) -> Iterator[None]: cluster_path: Union[str, Path] = "cluster_node_count_critical.json" patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json" if old_replica_state: cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path) patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json" with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}): yield None @pytest.mark.usefixtures("cluster_node_count_critical") def test_cluster_node_count_critical( runner: CliRunner, patroni_api: PatroniAPI ) -> None: result = runner.invoke( main, [ "-e", patroni_api.endpoint, "cluster_node_count", "--warning", "@2", "--critical", "@0:1", ], ) assert ( result.stdout == "CLUSTERNODECOUNT CRITICAL - members is 1 (outside range @0:1) | healthy_members=1 members=1;@2;@1 role_leader=1 state_running=1\n" ) assert result.exit_code == 2 @pytest.fixture def cluster_node_count_ko_in_archive_recovery( patroni_api: PatroniAPI, old_replica_state: bool, datadir: Path, tmp_path: Path ) -> Iterator[None]: cluster_path: Union[str, Path] = "cluster_node_count_ko_in_archive_recovery.json" patroni_path = "cluster_has_replica_patroni_verion_3.1.0.json" if old_replica_state: cluster_path = cluster_api_set_replica_running(datadir / cluster_path, tmp_path) patroni_path = "cluster_has_replica_patroni_verion_3.0.0.json" with patroni_api.routes({"cluster": cluster_path, "patroni": patroni_path}): yield None @pytest.mark.usefixtures("cluster_node_count_ko_in_archive_recovery") def test_cluster_node_count_ko_in_archive_recovery( runner: CliRunner, patroni_api: PatroniAPI, old_replica_state: bool ) -> None: result = runner.invoke( main, [ "-e", patroni_api.endpoint, "cluster_node_count", "--healthy-warning", "@2", "--healthy-critical", "@0:1", ], ) if old_replica_state: assert ( result.stdout == "CLUSTERNODECOUNT OK - members is 3 | healthy_members=3;@2;@1 members=3 role_replica=2 role_standby_leader=1 state_running=3\n" ) assert result.exit_code == 0 else: assert ( result.stdout == "CLUSTERNODECOUNT CRITICAL - healthy_members is 1 (outside range @0:1) | healthy_members=1;@2;@1 members=3 role_replica=2 role_standby_leader=1 state_in_archive_recovery=2 state_streaming=1\n" ) assert result.exit_code == 2 check_patroni-2.2.0/tests/test_node_is_alive.py000066400000000000000000000016351475506406400217170ustar00rootroot00000000000000from pathlib import Path from click.testing import CliRunner from check_patroni.cli import main from . import PatroniAPI def test_node_is_alive_ok( runner: CliRunner, patroni_api: PatroniAPI, tmp_path: Path ) -> None: liveness = tmp_path / "liveness" liveness.touch() with patroni_api.routes({"liveness": liveness}): result = runner.invoke(main, ["-e", patroni_api.endpoint, "node_is_alive"]) assert result.exit_code == 0 assert ( result.stdout == "NODEISALIVE OK - This node is alive (patroni is running). | is_alive=1;;@0\n" ) def test_node_is_alive_ko(runner: CliRunner, patroni_api: PatroniAPI) -> None: result = runner.invoke(main, ["-e", patroni_api.endpoint, "node_is_alive"]) assert result.exit_code == 2 assert ( result.stdout == "NODEISALIVE CRITICAL - This node is not alive (patroni is not running). | is_alive=0;;@0\n" ) check_patroni-2.2.0/tests/test_node_is_leader.py000066400000000000000000000032371475506406400220530ustar00rootroot00000000000000from typing import Iterator import pytest from click.testing import CliRunner from check_patroni.cli import main from . import PatroniAPI @pytest.fixture def node_is_leader_ok(patroni_api: PatroniAPI) -> Iterator[None]: with patroni_api.routes( { "leader": "node_is_leader_ok.json", "standby-leader": "node_is_leader_ok_standby_leader.json", } ): yield None @pytest.mark.usefixtures("node_is_leader_ok") def test_node_is_leader_ok(runner: CliRunner, patroni_api: PatroniAPI) -> None: result = runner.invoke(main, ["-e", patroni_api.endpoint, "node_is_leader"]) assert result.exit_code == 0 assert ( result.stdout == "NODEISLEADER OK - This node is a leader node. | is_leader=1;;@0\n" ) result = runner.invoke( main, ["-e", patroni_api.endpoint, "node_is_leader", "--is-standby-leader"], ) assert result.exit_code == 0 assert ( result.stdout == "NODEISLEADER OK - This node is a standby leader node. | is_leader=1;;@0\n" ) def test_node_is_leader_ko(runner: CliRunner, patroni_api: PatroniAPI) -> None: result = runner.invoke(main, ["-e", patroni_api.endpoint, "node_is_leader"]) assert result.exit_code == 2 assert ( result.stdout == "NODEISLEADER CRITICAL - This node is not a leader node. | is_leader=0;;@0\n" ) result = runner.invoke( main, ["-e", patroni_api.endpoint, "node_is_leader", "--is-standby-leader"], ) assert result.exit_code == 2 assert ( result.stdout == "NODEISLEADER CRITICAL - This node is not a standby leader node. | is_leader=0;;@0\n" ) check_patroni-2.2.0/tests/test_node_is_pending_restart.py000066400000000000000000000020231475506406400237770ustar00rootroot00000000000000from click.testing import CliRunner from check_patroni.cli import main from . import PatroniAPI def test_node_is_pending_restart_ok(runner: CliRunner, patroni_api: PatroniAPI) -> None: with patroni_api.routes({"patroni": "node_is_pending_restart_ok.json"}): result = runner.invoke( main, ["-e", patroni_api.endpoint, "node_is_pending_restart"] ) assert result.exit_code == 0 assert ( result.stdout == "NODEISPENDINGRESTART OK - This node doesn't have the pending restart flag. | is_pending_restart=0;;0\n" ) def test_node_is_pending_restart_ko(runner: CliRunner, patroni_api: PatroniAPI) -> None: with patroni_api.routes({"patroni": "node_is_pending_restart_ko.json"}): result = runner.invoke( main, ["-e", patroni_api.endpoint, "node_is_pending_restart"] ) assert result.exit_code == 2 assert ( result.stdout == "NODEISPENDINGRESTART CRITICAL - This node has the pending restart flag. | is_pending_restart=1;;0\n" ) check_patroni-2.2.0/tests/test_node_is_primary.py000066400000000000000000000015331475506406400222770ustar00rootroot00000000000000from click.testing import CliRunner from check_patroni.cli import main from . import PatroniAPI def test_node_is_primary_ok(runner: CliRunner, patroni_api: PatroniAPI) -> None: with patroni_api.routes({"primary": "node_is_primary_ok.json"}): result = runner.invoke(main, ["-e", patroni_api.endpoint, "node_is_primary"]) assert result.exit_code == 0 assert ( result.stdout == "NODEISPRIMARY OK - This node is the primary with the leader lock. | is_primary=1;;@0\n" ) def test_node_is_primary_ko(runner: CliRunner, patroni_api: PatroniAPI) -> None: result = runner.invoke(main, ["-e", patroni_api.endpoint, "node_is_primary"]) assert result.exit_code == 2 assert ( result.stdout == "NODEISPRIMARY CRITICAL - This node is not the primary with the leader lock. | is_primary=0;;@0\n" ) check_patroni-2.2.0/tests/test_node_is_replica.py000066400000000000000000000212601475506406400222320ustar00rootroot00000000000000from typing import Iterator import pytest from click.testing import CliRunner from check_patroni.cli import main from . import PatroniAPI @pytest.fixture def node_is_replica_ok(patroni_api: PatroniAPI) -> Iterator[None]: with patroni_api.routes( { k: "node_is_replica_ok.json" for k in ("replica", "synchronous", "asynchronous") } ): yield None @pytest.mark.usefixtures("node_is_replica_ok") def test_node_is_replica_ok(runner: CliRunner, patroni_api: PatroniAPI) -> None: result = runner.invoke(main, ["-e", patroni_api.endpoint, "node_is_replica"]) assert ( result.stdout == "NODEISREPLICA OK - This node is a running replica with no noloadbalance tag. | is_replica=1;;@0\n" ) assert result.exit_code == 0 def test_node_is_replica_ko(runner: CliRunner, patroni_api: PatroniAPI) -> None: result = runner.invoke(main, ["-e", patroni_api.endpoint, "node_is_replica"]) assert ( result.stdout == "NODEISREPLICA CRITICAL - This node is not a running replica with no noloadbalance tag. | is_replica=0;;@0\n" ) assert result.exit_code == 2 def test_node_is_replica_ko_lag(runner: CliRunner, patroni_api: PatroniAPI) -> None: # We don't do the check ourselves, patroni does it and changes the return code result = runner.invoke( main, ["-e", patroni_api.endpoint, "node_is_replica", "--max-lag", "100"] ) assert ( result.stdout == "NODEISREPLICA CRITICAL - This node is not a running replica with no noloadbalance tag and a lag under 100. | is_replica=0;;@0\n" ) assert result.exit_code == 2 result = runner.invoke( main, [ "-e", patroni_api.endpoint, "node_is_replica", "--is-async", "--max-lag", "100", ], ) assert ( result.stdout == "NODEISREPLICA CRITICAL - This node is not a running asynchronous replica with no noloadbalance tag and a lag under 100. | is_replica=0;;@0\n" ) assert result.exit_code == 2 @pytest.mark.usefixtures("node_is_replica_sync_sync_ok") def test_node_is_replica_sync_any_ok( runner: CliRunner, patroni_api: PatroniAPI ) -> None: # We don't do the check ourselves, patroni does it and changes the return code result = runner.invoke( main, [ "-e", patroni_api.endpoint, "node_is_replica", "--is-sync", "--sync-type", "any", ], ) assert ( result.stdout == "NODEISREPLICA OK - This node is a running synchronous replica of kind 'any' with no noloadbalance tag. | is_replica=1;;@0\n" ) assert result.exit_code == 0 def test_node_is_replica_sync_any_ko( runner: CliRunner, patroni_api: PatroniAPI ) -> None: # We don't do the check ourselves, patroni does it and changes the return code result = runner.invoke( main, [ "-e", patroni_api.endpoint, "node_is_replica", "--is-sync", "--sync-type", "any", ], ) assert ( result.stdout == "NODEISREPLICA CRITICAL - This node is not a running synchronous replica of kind 'any' with no noloadbalance tag. | is_replica=0;;@0\n" ) assert result.exit_code == 2 @pytest.mark.usefixtures("node_is_replica_ok") def test_node_is_replica_async_ok(runner: CliRunner, patroni_api: PatroniAPI) -> None: # We don't do the check ourselves, patroni does it and changes the return code result = runner.invoke( main, ["-e", patroni_api.endpoint, "node_is_replica", "--is-async"] ) assert ( result.stdout == "NODEISREPLICA OK - This node is a running asynchronous replica with no noloadbalance tag. | is_replica=1;;@0\n" ) assert result.exit_code == 0 def test_node_is_replica_async_ko(runner: CliRunner, patroni_api: PatroniAPI) -> None: # We don't do the check ourselves, patroni does it and changes the return code result = runner.invoke( main, ["-e", patroni_api.endpoint, "node_is_replica", "--is-async"] ) assert ( result.stdout == "NODEISREPLICA CRITICAL - This node is not a running asynchronous replica with no noloadbalance tag. | is_replica=0;;@0\n" ) assert result.exit_code == 2 @pytest.mark.usefixtures("node_is_replica_ok") def test_node_is_replica_params(runner: CliRunner, patroni_api: PatroniAPI) -> None: # We don't do the check ourselves, patroni does it and changes the return code result = runner.invoke( main, [ "-e", patroni_api.endpoint, "node_is_replica", "--is-async", "--is-sync", ], ) assert ( result.stdout == "NODEISREPLICA UNKNOWN: click.exceptions.UsageError: --is-sync and --is-async cannot be provided at the same time for this service\n" ) assert result.exit_code == 3 # We don't do the check ourselves, patroni does it and changes the return code result = runner.invoke( main, [ "-e", patroni_api.endpoint, "node_is_replica", "--is-sync", "--max-lag", "1MB", ], ) assert ( result.stdout == "NODEISREPLICA UNKNOWN: click.exceptions.UsageError: --is-sync and --max-lag cannot be provided at the same time for this service\n" ) assert result.exit_code == 3 @pytest.fixture def node_is_replica_sync_sync_ok(patroni_api: PatroniAPI) -> Iterator[None]: with patroni_api.routes( { k: "node_is_replica_ok_sync.json" for k in ("replica", "synchronous", "asynchronous") } ): yield None @pytest.mark.usefixtures("node_is_replica_sync_sync_ok") def test_node_is_replica_sync_sync_ok( runner: CliRunner, patroni_api: PatroniAPI ) -> None: # We don't do the check ourselves, patroni does it and changes the return code result = runner.invoke( main, [ "-e", patroni_api.endpoint, "node_is_replica", "--is-sync", "--sync-type", "sync", ], ) assert ( result.stdout == "NODEISREPLICA OK - This node is a running synchronous replica of kind 'sync' with no noloadbalance tag. | is_replica=1;;@0\n" ) assert result.exit_code == 0 @pytest.mark.usefixtures("node_is_replica_sync_quorum_ok") def test_node_is_replica_sync_sync_ko( runner: CliRunner, patroni_api: PatroniAPI ) -> None: # We don't do the check ourselves, patroni does it and changes the return code result = runner.invoke( main, [ "-e", patroni_api.endpoint, "node_is_replica", "--is-sync", "--sync-type", "sync", ], ) assert ( result.stdout == "NODEISREPLICA CRITICAL - This node is not a running synchronous replica of kind 'sync' with no noloadbalance tag. | is_replica=0;;@0\n" ) assert result.exit_code == 2 @pytest.fixture def node_is_replica_sync_quorum_ok(patroni_api: PatroniAPI) -> Iterator[None]: with patroni_api.routes( { k: "node_is_replica_ok_quorum.json" for k in ("replica", "synchronous", "asynchronous") } ): yield None @pytest.mark.usefixtures("node_is_replica_sync_quorum_ok") def test_node_is_replica_sync_quorum_ok( runner: CliRunner, patroni_api: PatroniAPI ) -> None: # We don't do the check ourselves, patroni does it and changes the return code result = runner.invoke( main, [ "-e", patroni_api.endpoint, "node_is_replica", "--is-sync", "--sync-type", "quorum", ], ) assert ( result.stdout == "NODEISREPLICA OK - This node is a running synchronous replica of kind 'quorum' with no noloadbalance tag. | is_replica=1;;@0\n" ) assert result.exit_code == 0 @pytest.mark.usefixtures("node_is_replica_sync_sync_ok") def test_node_is_replica_quorum_ko(runner: CliRunner, patroni_api: PatroniAPI) -> None: # We don't do the check ourselves, patroni does it and changes the return code result = runner.invoke( main, [ "-e", patroni_api.endpoint, "node_is_replica", "--is-sync", "--sync-type", "quorum", ], ) assert ( result.stdout == "NODEISREPLICA CRITICAL - This node is not a running synchronous replica of kind 'quorum' with no noloadbalance tag. | is_replica=0;;@0\n" ) assert result.exit_code == 2 check_patroni-2.2.0/tests/test_node_patroni_version.py000066400000000000000000000024021475506406400233360ustar00rootroot00000000000000from typing import Iterator import pytest from click.testing import CliRunner from check_patroni.cli import main from . import PatroniAPI @pytest.fixture(scope="module", autouse=True) def node_patroni_version(patroni_api: PatroniAPI) -> Iterator[None]: with patroni_api.routes({"patroni": "node_patroni_version.json"}): yield None def test_node_patroni_version_ok(runner: CliRunner, patroni_api: PatroniAPI) -> None: result = runner.invoke( main, [ "-e", patroni_api.endpoint, "node_patroni_version", "--patroni-version", "2.0.2", ], ) assert result.exit_code == 0 assert ( result.stdout == "NODEPATRONIVERSION OK - Patroni's version is 2.0.2. | is_version_ok=1;;@0\n" ) def test_node_patroni_version_ko(runner: CliRunner, patroni_api: PatroniAPI) -> None: result = runner.invoke( main, [ "-e", patroni_api.endpoint, "node_patroni_version", "--patroni-version", "1.0.0", ], ) assert result.exit_code == 2 assert ( result.stdout == "NODEPATRONIVERSION CRITICAL - Patroni's version is not 1.0.0. | is_version_ok=0;;@0\n" ) check_patroni-2.2.0/tests/test_node_tl_has_changed.py000066400000000000000000000112541475506406400230450ustar00rootroot00000000000000from pathlib import Path from typing import Iterator import nagiosplugin import pytest from click.testing import CliRunner from check_patroni.cli import main from . import PatroniAPI @pytest.fixture def node_tl_has_changed(patroni_api: PatroniAPI) -> Iterator[None]: with patroni_api.routes({"patroni": "node_tl_has_changed.json"}): yield None @pytest.mark.usefixtures("node_tl_has_changed") def test_node_tl_has_changed_ok_with_timeline( runner: CliRunner, patroni_api: PatroniAPI ) -> None: result = runner.invoke( main, [ "-e", patroni_api.endpoint, "node_tl_has_changed", "--timeline", "58", ], ) assert result.exit_code == 0 assert ( result.stdout == "NODETLHASCHANGED OK - The timeline is still 58. | is_timeline_changed=0;;@1:1 timeline=58\n" ) @pytest.mark.usefixtures("node_tl_has_changed") def test_node_tl_has_changed_ok_with_state_file( runner: CliRunner, patroni_api: PatroniAPI, tmp_path: Path ) -> None: state_file = tmp_path / "node_tl_has_changed.state_file" with state_file.open("w") as f: f.write('{"timeline": 58}') result = runner.invoke( main, [ "-e", patroni_api.endpoint, "node_tl_has_changed", "--state-file", str(state_file), ], ) assert result.exit_code == 0 assert ( result.stdout == "NODETLHASCHANGED OK - The timeline is still 58. | is_timeline_changed=0;;@1:1 timeline=58\n" ) @pytest.mark.usefixtures("node_tl_has_changed") def test_node_tl_has_changed_ko_with_timeline( runner: CliRunner, patroni_api: PatroniAPI ) -> None: result = runner.invoke( main, [ "-e", patroni_api.endpoint, "node_tl_has_changed", "--timeline", "700", ], ) assert result.exit_code == 2 assert ( result.stdout == "NODETLHASCHANGED CRITICAL - The expected timeline was 700 got 58. | is_timeline_changed=1;;@1:1 timeline=58\n" ) @pytest.mark.usefixtures("node_tl_has_changed") def test_node_tl_has_changed_ko_with_state_file_and_save( runner: CliRunner, patroni_api: PatroniAPI, tmp_path: Path ) -> None: state_file = tmp_path / "node_tl_has_changed.state_file" with state_file.open("w") as f: f.write('{"timeline": 700}') # test without saving the new tl result = runner.invoke( main, [ "-e", patroni_api.endpoint, "node_tl_has_changed", "--state-file", str(state_file), ], ) assert result.exit_code == 2 assert ( result.stdout == "NODETLHASCHANGED CRITICAL - The expected timeline was 700 got 58. | is_timeline_changed=1;;@1:1 timeline=58\n" ) cookie = nagiosplugin.Cookie(state_file) cookie.open() new_tl = cookie.get("timeline") cookie.close() assert new_tl == 700 # test when we save the hash result = runner.invoke( main, [ "-e", patroni_api.endpoint, "node_tl_has_changed", "--state-file", str(state_file), "--save", ], ) assert result.exit_code == 2 assert ( result.stdout == "NODETLHASCHANGED CRITICAL - The expected timeline was 700 got 58. | is_timeline_changed=1;;@1:1 timeline=58\n" ) cookie = nagiosplugin.Cookie(state_file) cookie.open() new_tl = cookie.get("timeline") cookie.close() assert new_tl == 58 @pytest.mark.usefixtures("node_tl_has_changed") def test_node_tl_has_changed_params( runner: CliRunner, patroni_api: PatroniAPI, tmp_path: Path ) -> None: # This one is placed last because it seems like the exceptions are not flushed from stderr for the next tests. fake_state_file = tmp_path / "fake_file_name.state_file" result = runner.invoke( main, [ "-e", patroni_api.endpoint, "node_tl_has_changed", "--timeline", "58", "--state-file", str(fake_state_file), ], ) assert result.exit_code == 3 assert ( result.stdout == "NODETLHASCHANGED UNKNOWN: click.exceptions.UsageError: Either --timeline or --state-file should be provided for this service\n" ) result = runner.invoke(main, ["-e", patroni_api.endpoint, "node_tl_has_changed"]) assert result.exit_code == 3 assert ( result.stdout == "NODETLHASCHANGED UNKNOWN: click.exceptions.UsageError: Either --timeline or --state-file should be provided for this service\n" ) check_patroni-2.2.0/tox.ini000066400000000000000000000022561475506406400156570ustar00rootroot00000000000000[tox] # the versions specified here are overridden by github workflow envlist = lint, mypy, py{39,310,311,312,313} skip_missing_interpreters = True [testenv] extras = test commands = pytest {toxinidir}/check_patroni {toxinidir}/tests {posargs:-vv --log-level=debug} [testenv:lint] skip_install = True deps = codespell black flake8 isort commands = codespell {toxinidir}/check_patroni {toxinidir}/tests {toxinidir}/docs/ {toxinidir}/RELEASE.md {toxinidir}/CONTRIBUTING.md black --check --diff {toxinidir}/check_patroni {toxinidir}/tests flake8 {toxinidir}/check_patroni {toxinidir}/tests isort --check --diff {toxinidir}/check_patroni {toxinidir}/tests [testenv:mypy] deps = mypy commands = # we need to install types-requests mypy --install-types --non-interactive [testenv:build] deps = wheel setuptools twine allowlist_externals = rm commands = rm --verbose --recursive --force {toxinidir}/dist/ python -m build python -m twine check dist/* [testenv:upload] # requires a check_patroni section in ~/.pypirc skip_install = True deps = twine commands = python -m twine upload --repository check_patroni dist/* check_patroni-2.2.0/vagrant/000077500000000000000000000000001475506406400160015ustar00rootroot00000000000000check_patroni-2.2.0/vagrant/LICENSE000066400000000000000000000030001475506406400167770ustar00rootroot00000000000000BSD 3-Clause License Copyright (c) 2019, Jehan-Guillaume (ioguix) de Rorthais All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. check_patroni-2.2.0/vagrant/Makefile000066400000000000000000000006731475506406400174470ustar00rootroot00000000000000export VAGRANT_BOX_UPDATE_CHECK_DISABLE=1 export VAGRANT_CHECKPOINT_DISABLE=1 .PHONY: all prov validate all: prov prov: vagrant up --provision clean: vagrant destroy -f validate: @vagrant validate @if which shellcheck >/dev/null ;\ then shellcheck provision/* ;\ else echo "WARNING: shellcheck is not in PATH, not checking bash syntax" ;\ fi check_patroni-2.2.0/vagrant/README.md000066400000000000000000000031651475506406400172650ustar00rootroot00000000000000# Icinga ## Install Create the VM: ``` make ``` ## IcingaWeb Configure Icingaweb : ``` http://$IP/icingaweb2/setup ``` * Screen 1: Welcome Use the icinga token given a the end of the `icinga2-setup` provision, or: ``` sudo icingacli setup token show ``` Next * Screen 2: Modules Activate Monitor (already set) Next * Screen 3: Icinga Web 2 Next * Screen 4: Authentication Next * Screen 5: Database Resource Database Name: icingaweb_db Username: supervisor Password: th3Pass Charset: UTF8 Validate Next * Screen 6: Authentication Backend Next * Screen 7: Administration Fill the blanks Next * Screen 8: Application Configuration Next * Screen 9: Summary Next * Screen 10: Welcome ... again Next * Screen 11: Monitoring IDO Resource Database Name: icinga2 Username: supervisor Password: th3Pass Charset: UTF8 Validate Next * Screen 12: Command Transport Transaport name: icinga2 Transport Type: API Host: 127.0.0.1 Port: 5665 User: icinga_api Password: th3Pass Next * Screen 13: Monitoring Security Next * Screen 14: Summary Finish * Screen 15: Hopefully success Login ## Add servers to icinga ``` # Connect to the vm vagrant ssh s1 # Create /etc/icinga2/conf.d/check_patroni.conf sudo /vagrant/provision/director.bash init cluster1 p1=10.20.89.54 p2=10.20.89.55 # Check and load conf sudo icinga2 daemon -C sudo systemctl restart icinga2.service ``` # Grafana Connect to: http://10.20.89.52:3000/login User / pass: admin/admin Import the dashboards for the grafana directory. They are created for cluster1, and servers p1, p2. check_patroni-2.2.0/vagrant/Vagrantfile000066400000000000000000000040521475506406400201670ustar00rootroot00000000000000require 'ipaddr' #require 'yaml' ENV["LC_ALL"] = 'en_US.utf8' myBox = 'debian/buster64' myProvider = 'libvirt' pgver = 11 start_ip = '10.20.89.51' etcd_nodes = [] patroni_nodes = [] sup_nodes = ['s1'] # install check_patroni from the local repo (test) or from pip (official) cp_origin = 'test' # [test, official] Vagrant.configure(2) do |config| config.vm.provider myProvider next_ip = IPAddr.new(start_ip).succ host_ip = (IPAddr.new(start_ip) & "255.255.255.0").succ.to_s nodes_ips = {} ( patroni_nodes + etcd_nodes + sup_nodes ).each do |node| nodes_ips[node] = next_ip.to_s next_ip = next_ip.succ end # don't mind about insecure ssh key config.ssh.insert_key = false # https://vagrantcloud.com/search. config.vm.box = myBox # hardware and host settings config.vm.provider 'libvirt' do |lv| lv.cpus = 1 lv.memory = 512 lv.watchdog model: 'i6300esb' lv.default_prefix = 'patroni_' lv.qemu_use_session = false end # disable default share (NFS is not working directly in DEBIAN 10) config.vm.synced_folder ".", "/vagrant", type: "rsync" config.vm.synced_folder "/home/benoit/git/dalibo/check_patroni", "/check_patroni", type: "rsync" ## allow root@vm to ssh to ssh_login@network_1 #config.vm.synced_folder 'ssh', '/root/.ssh', type: 'rsync', # owner: 'root', group: 'root', # rsync__args: [ "--verbose", "--archive", "--delete", "--copy-links", "--no-perms" ] # system setup for sup nodes (sup_nodes).each do |node| config.vm.define node do |conf| conf.vm.network 'private_network', ip: nodes_ips[node] conf.vm.provision 'icinga2-setup', type: 'shell', path: 'provision/icinga2.bash', args: [ node ], preserve_order: true conf.vm.provision 'check_patroni', type: 'shell', path: 'provision/check_patroni.bash', args: [ cp_origin ], preserve_order: true end end end check_patroni-2.2.0/vagrant/check_patroni.sh000077500000000000000000000016161475506406400211550ustar00rootroot00000000000000#!/bin/bash if [[ -z "$1" ]]; then echo "usage: $0 PATRONI_END_POINT" exit 1 fi echo "-- Running patroni checks using endpoint $1" echo "-- Cluster checks" check_patroni -e "$1" cluster_config_has_changed --state-file cluster.state_file --save check_patroni -e "$1" cluster_has_leader check_patroni -e "$1" cluster_has_replica check_patroni -e "$1" cluster_is_in_maintenance check_patroni -e "$1" cluster_has_scheduled_action check_patroni -e "$1" cluster_node_count echo "-- Node checks" check_patroni -e "$1" node_is_alive check_patroni -e "$1" node_is_pending_restart check_patroni -e "$1" node_is_primary check_patroni -e "$1" node_is_leader --is-standby-leader check_patroni -e "$1" node_is_replica check_patroni -e "$1" node_is_replica --is-sync check_patroni -e "$1" node_patroni_version --patroni-version 4.0.2 check_patroni -e "$1" node_tl_has_changed --state-file cluster.state_file --save check_patroni-2.2.0/vagrant/grafana/000077500000000000000000000000001475506406400174005ustar00rootroot00000000000000check_patroni-2.2.0/vagrant/grafana/cluster_status_cluster1.json000066400000000000000000000471551475506406400252150ustar00rootroot00000000000000{ "__inputs": [ { "name": "DS_OPM", "label": "opm", "description": "", "type": "datasource", "pluginId": "postgres", "pluginName": "PostgreSQL" }, { "name": "VAR_CLUSTER_NAME", "type": "constant", "label": "cluster_name", "value": "cluster1", "description": "" } ], "__elements": [], "__requires": [ { "type": "grafana", "id": "grafana", "name": "Grafana", "version": "8.3.3" }, { "type": "datasource", "id": "postgres", "name": "PostgreSQL", "version": "1.0.0" }, { "type": "panel", "id": "stat", "name": "Stat", "version": "" }, { "type": "panel", "id": "timeseries", "name": "Time series", "version": "" } ], "annotations": { "list": [ { "builtIn": 1, "datasource": "-- Grafana --", "enable": true, "hide": true, "iconColor": "rgba(0, 211, 255, 1)", "name": "Annotations & Alerts", "target": { "limit": 100, "matchAny": false, "tags": [], "type": "dashboard" }, "type": "dashboard" } ] }, "editable": true, "fiscalYearStartMonth": 0, "graphTooltip": 0, "id": null, "iteration": 1640960519458, "links": [], "liveNow": false, "panels": [ { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 10, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "viz": false }, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "never", "spanNulls": true, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "short" }, "overrides": [] }, "gridPos": { "h": 6, "w": 20, "x": 0, "y": 0 }, "id": 14, "options": { "legend": { "calcs": [], "displayMode": "list", "placement": "bottom" }, "tooltip": { "mode": "single" } }, "pluginVersion": "8.3.3", "targets": [ { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$cluster_name' AND s.service = 'check_patroni_cluster_has_replica'\n ) \n AND m.label ilike '%lag%' \nGROUP BY time, m.label ORDER BY time", "refId": "A", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] } ], "title": "Cluster replica lag", "type": "timeseries" }, { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null } ] } }, "overrides": [] }, "gridPos": { "h": 2, "w": 4, "x": 20, "y": 0 }, "id": 4, "options": { "colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "auto", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "textMode": "auto" }, "pluginVersion": "8.3.3", "targets": [ { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$cluster_name' AND s.service = 'check_patroni_cluster_has_leader'\n ) GROUP BY time, m.label ORDER BY time", "refId": "A", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] } ], "title": "Cluster has primary", "type": "stat" }, { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [] }, "gridPos": { "h": 2, "w": 4, "x": 20, "y": 2 }, "id": 10, "options": { "colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "auto", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "textMode": "auto" }, "pluginVersion": "8.3.3", "targets": [ { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$cluster_name' AND s.service = 'check_patroni_cluster_config_has_changed'\n ) GROUP BY time, m.label ORDER BY time", "refId": "A", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] } ], "title": "Cluster config has changed", "type": "stat" }, { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "short" }, "overrides": [] }, "gridPos": { "h": 2, "w": 4, "x": 20, "y": 4 }, "id": 8, "options": { "colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "auto", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "textMode": "auto" }, "pluginVersion": "8.3.3", "targets": [ { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$cluster_name' AND s.service = 'check_patroni_cluster_is_in_maintenance'\n ) GROUP BY time, m.label ORDER BY time", "refId": "A", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] } ], "title": "Cluster is in maintenance", "type": "stat" }, { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [ { "matcher": { "id": "byName", "options": "role_leader" }, "properties": [ { "id": "displayName", "value": "leader" } ] }, { "matcher": { "id": "byName", "options": "role_replica" }, "properties": [ { "id": "displayName", "value": "replicas" } ] }, { "matcher": { "id": "byName", "options": "state_running" }, "properties": [ { "id": "displayName", "value": "running" } ] } ] }, "gridPos": { "h": 5, "w": 12, "x": 0, "y": 6 }, "id": 2, "options": { "colorMode": "value", "graphMode": "area", "justifyMode": "center", "orientation": "vertical", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "textMode": "auto" }, "pluginVersion": "8.3.3", "targets": [ { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM public.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id \n FROM public.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$cluster_name' AND s.service = 'check_patroni_cluster_node_count'\n ) GROUP BY time, m.label ORDER BY time", "refId": "A", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] } ], "title": "Cluster node count", "type": "stat" }, { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 10, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "viz": false }, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "never", "spanNulls": true, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "short" }, "overrides": [ { "matcher": { "id": "byName", "options": "healthy_replica" }, "properties": [ { "id": "displayName", "value": "healthy" } ] }, { "matcher": { "id": "byName", "options": "unhealthy_replica" }, "properties": [ { "id": "displayName", "value": "unhealthy" } ] } ] }, "gridPos": { "h": 5, "w": 12, "x": 12, "y": 6 }, "id": 6, "options": { "legend": { "calcs": [], "displayMode": "list", "placement": "bottom" }, "tooltip": { "mode": "single" } }, "pluginVersion": "8.3.3", "targets": [ { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM public.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id \n FROM public.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$cluster_name' AND s.service = 'check_patroni_cluster_has_replica'\n )\n AND m.label IN('healthy_replica','unhealthy_replica') \n GROUP BY time, m.label ORDER BY time", "refId": "A", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] } ], "title": "Cluster has replica", "type": "timeseries" } ], "refresh": "", "schemaVersion": 34, "style": "dark", "tags": [], "templating": { "list": [ { "hide": 2, "name": "cluster_name", "query": "${VAR_CLUSTER_NAME}", "skipUrlSync": false, "type": "constant", "current": { "value": "${VAR_CLUSTER_NAME}", "text": "${VAR_CLUSTER_NAME}", "selected": false }, "options": [ { "value": "${VAR_CLUSTER_NAME}", "text": "${VAR_CLUSTER_NAME}", "selected": false } ] }, { "auto": false, "auto_count": 30, "auto_min": "10s", "current": { "selected": true, "text": "1m", "value": "1m" }, "hide": 0, "name": "interval", "options": [ { "selected": true, "text": "1m", "value": "1m" }, { "selected": false, "text": "10m", "value": "10m" }, { "selected": false, "text": "30m", "value": "30m" }, { "selected": false, "text": "1h", "value": "1h" }, { "selected": false, "text": "6h", "value": "6h" }, { "selected": false, "text": "12h", "value": "12h" }, { "selected": false, "text": "1d", "value": "1d" }, { "selected": false, "text": "7d", "value": "7d" }, { "selected": false, "text": "14d", "value": "14d" }, { "selected": false, "text": "30d", "value": "30d" } ], "query": "1m,10m,30m,1h,6h,12h,1d,7d,14d,30d", "queryValue": "", "refresh": 2, "skipUrlSync": false, "type": "interval" } ] }, "time": { "from": "now-6h", "to": "now" }, "timepicker": {}, "timezone": "", "title": "Cluster status: cluster1", "uid": "4BullO0nk", "version": 10, "weekStart": "" }check_patroni-2.2.0/vagrant/grafana/node_status_p1.json000066400000000000000000000311571475506406400232320ustar00rootroot00000000000000{ "__inputs": [ { "name": "DS_OPM", "label": "opm", "description": "", "type": "datasource", "pluginId": "postgres", "pluginName": "PostgreSQL" }, { "name": "VAR_NODE_NAME", "type": "constant", "label": "node_name", "value": "p1", "description": "" } ], "__elements": [], "__requires": [ { "type": "grafana", "id": "grafana", "name": "Grafana", "version": "8.3.3" }, { "type": "datasource", "id": "postgres", "name": "PostgreSQL", "version": "1.0.0" }, { "type": "panel", "id": "stat", "name": "Stat", "version": "" } ], "annotations": { "list": [ { "builtIn": 1, "datasource": "-- Grafana --", "enable": true, "hide": true, "iconColor": "rgba(0, 211, 255, 1)", "name": "Annotations & Alerts", "target": { "limit": 100, "matchAny": false, "tags": [], "type": "dashboard" }, "type": "dashboard" } ] }, "editable": true, "fiscalYearStartMonth": 0, "graphTooltip": 0, "id": null, "iteration": 1640961009033, "links": [], "liveNow": false, "panels": [ { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [ { "matcher": { "id": "byName", "options": "is_primary" }, "properties": [ { "id": "displayName", "value": "Primaire" } ] }, { "matcher": { "id": "byName", "options": "is_replica" }, "properties": [ { "id": "displayName", "value": "Secondaire" } ] } ] }, "gridPos": { "h": 9, "w": 12, "x": 0, "y": 0 }, "id": 2, "options": { "colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "auto", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "textMode": "auto" }, "pluginVersion": "8.3.3", "targets": [ { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$node_name' AND s.service = 'check_patroni_node_is_primary'\n ) GROUP BY time, m.label ORDER BY time", "refId": "A", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] }, { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "hide": false, "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$node_name' AND s.service = 'check_patroni_node_is_replica'\n ) GROUP BY time, m.label ORDER BY time", "refId": "B", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] } ], "title": "Node type", "type": "stat" }, { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [ { "matcher": { "id": "byName", "options": "is_alive" }, "properties": [ { "id": "displayName", "value": "Node is alive" } ] }, { "matcher": { "id": "byName", "options": "is_pending_restart" }, "properties": [ { "id": "displayName", "value": "Node is pending restart" } ] }, { "matcher": { "id": "byName", "options": "timeline" }, "properties": [ { "id": "displayName", "value": "Current timeline" } ] } ] }, "gridPos": { "h": 9, "w": 12, "x": 12, "y": 0 }, "id": 4, "options": { "colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "horizontal", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "textMode": "auto" }, "pluginVersion": "8.3.3", "targets": [ { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$node_name' AND s.service = 'check_patroni_node_is_alive'\n ) GROUP BY time, m.label ORDER BY time", "refId": "A", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] }, { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "hide": false, "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$node_name' AND s.service = 'check_patroni_node_tl_has_changed'\n )\nAND m.label = 'timeline'\nGROUP BY time, m.label ORDER BY time", "refId": "B", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] }, { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "hide": false, "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$node_name' AND s.service = 'check_patroni_node_is_pending_restart'\n ) GROUP BY time, m.label ORDER BY time", "refId": "D", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] } ], "title": "Health stats", "type": "stat" } ], "schemaVersion": 34, "style": "dark", "tags": [], "templating": { "list": [ { "hide": 2, "name": "node_name", "query": "${VAR_NODE_NAME}", "skipUrlSync": false, "type": "constant", "current": { "value": "${VAR_NODE_NAME}", "text": "${VAR_NODE_NAME}", "selected": false }, "options": [ { "value": "${VAR_NODE_NAME}", "text": "${VAR_NODE_NAME}", "selected": false } ] }, { "auto": false, "auto_count": 30, "auto_min": "10s", "current": { "selected": false, "text": "1m", "value": "1m" }, "hide": 0, "name": "interval", "options": [ { "selected": true, "text": "1m", "value": "1m" }, { "selected": false, "text": "10m", "value": "10m" }, { "selected": false, "text": "30m", "value": "30m" }, { "selected": false, "text": "1h", "value": "1h" }, { "selected": false, "text": "6h", "value": "6h" }, { "selected": false, "text": "12h", "value": "12h" }, { "selected": false, "text": "1d", "value": "1d" }, { "selected": false, "text": "7d", "value": "7d" }, { "selected": false, "text": "14d", "value": "14d" }, { "selected": false, "text": "30d", "value": "30d" } ], "query": "1m,10m,30m,1h,6h,12h,1d,7d,14d,30d", "queryValue": "", "refresh": 2, "skipUrlSync": false, "type": "interval" } ] }, "time": { "from": "now-6h", "to": "now" }, "timepicker": {}, "timezone": "", "title": "Node status: p1", "uid": "2LfUnFAnk", "version": 1, "weekStart": "" }check_patroni-2.2.0/vagrant/grafana/node_status_p2.json000066400000000000000000000311601475506406400232250ustar00rootroot00000000000000{ "__inputs": [ { "name": "DS_OPM", "label": "opm", "description": "", "type": "datasource", "pluginId": "postgres", "pluginName": "PostgreSQL" }, { "name": "VAR_NODE_NAME", "type": "constant", "label": "node_name", "value": "p2", "description": "" } ], "__elements": [], "__requires": [ { "type": "grafana", "id": "grafana", "name": "Grafana", "version": "8.3.3" }, { "type": "datasource", "id": "postgres", "name": "PostgreSQL", "version": "1.0.0" }, { "type": "panel", "id": "stat", "name": "Stat", "version": "" } ], "annotations": { "list": [ { "builtIn": 1, "datasource": "-- Grafana --", "enable": true, "hide": true, "iconColor": "rgba(0, 211, 255, 1)", "name": "Annotations & Alerts", "target": { "limit": 100, "matchAny": false, "tags": [], "type": "dashboard" }, "type": "dashboard" } ] }, "editable": true, "fiscalYearStartMonth": 0, "graphTooltip": 0, "id": null, "iteration": 1640960994907, "links": [], "liveNow": false, "panels": [ { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [ { "matcher": { "id": "byName", "options": "is_primary" }, "properties": [ { "id": "displayName", "value": "Primaire" } ] }, { "matcher": { "id": "byName", "options": "is_replica" }, "properties": [ { "id": "displayName", "value": "Secondaire" } ] } ] }, "gridPos": { "h": 9, "w": 12, "x": 0, "y": 0 }, "id": 2, "options": { "colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "auto", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "textMode": "auto" }, "pluginVersion": "8.3.3", "targets": [ { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$node_name' AND s.service = 'check_patroni_node_is_primary'\n ) GROUP BY time, m.label ORDER BY time", "refId": "A", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] }, { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "hide": false, "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$node_name' AND s.service = 'check_patroni_node_is_replica'\n ) GROUP BY time, m.label ORDER BY time", "refId": "B", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] } ], "title": "Node type", "type": "stat" }, { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [ { "matcher": { "id": "byName", "options": "is_alive" }, "properties": [ { "id": "displayName", "value": "Node is alive" } ] }, { "matcher": { "id": "byName", "options": "is_pending_restart" }, "properties": [ { "id": "displayName", "value": "Node is pending restart" } ] }, { "matcher": { "id": "byName", "options": "timeline" }, "properties": [ { "id": "displayName", "value": "Current timeline" } ] } ] }, "gridPos": { "h": 9, "w": 12, "x": 12, "y": 0 }, "id": 4, "options": { "colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "horizontal", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "textMode": "auto" }, "pluginVersion": "8.3.3", "targets": [ { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$node_name' AND s.service = 'check_patroni_node_is_alive'\n ) GROUP BY time, m.label ORDER BY time", "refId": "A", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] }, { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "hide": false, "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$node_name' AND s.service = 'check_patroni_node_tl_has_changed'\n )\nAND m.label = 'timeline'\nGROUP BY time, m.label ORDER BY time", "refId": "B", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] }, { "datasource": { "type": "postgres", "uid": "${DS_OPM}" }, "format": "time_series", "group": [], "hide": false, "metricColumn": "none", "rawQuery": true, "rawSql": " SELECT $__timeGroup(timet, $interval) AS time, MAX(d.value), m.label AS metric\n FROM wh_nagios.metrics m,\nLATERAL wh_nagios.get_metric_data(m.id, $__timeFrom(), $__timeTo()) d\n WHERE m.id_service = (\n SELECT s.id FROM wh_nagios.services s \n JOIN public.servers h ON h.id=s.id_server\n WHERE h.hostname = '$node_name' AND s.service = 'check_patroni_node_is_pending_restart'\n ) GROUP BY time, m.label ORDER BY time", "refId": "D", "select": [ [ { "params": [ "value" ], "type": "column" } ] ], "timeColumn": "time", "where": [ { "name": "$__timeFilter", "params": [], "type": "macro" } ] } ], "title": "Health stats", "type": "stat" } ], "schemaVersion": 34, "style": "dark", "tags": [], "templating": { "list": [ { "hide": 2, "name": "node_name", "query": "${VAR_NODE_NAME}", "skipUrlSync": false, "type": "constant", "current": { "value": "${VAR_NODE_NAME}", "text": "${VAR_NODE_NAME}", "selected": false }, "options": [ { "value": "${VAR_NODE_NAME}", "text": "${VAR_NODE_NAME}", "selected": false } ] }, { "auto": false, "auto_count": 30, "auto_min": "10s", "current": { "selected": false, "text": "1m", "value": "1m" }, "hide": 0, "name": "interval", "options": [ { "selected": true, "text": "1m", "value": "1m" }, { "selected": false, "text": "10m", "value": "10m" }, { "selected": false, "text": "30m", "value": "30m" }, { "selected": false, "text": "1h", "value": "1h" }, { "selected": false, "text": "6h", "value": "6h" }, { "selected": false, "text": "12h", "value": "12h" }, { "selected": false, "text": "1d", "value": "1d" }, { "selected": false, "text": "7d", "value": "7d" }, { "selected": false, "text": "14d", "value": "14d" }, { "selected": false, "text": "30d", "value": "30d" } ], "query": "1m,10m,30m,1h,6h,12h,1d,7d,14d,30d", "queryValue": "", "refresh": 2, "skipUrlSync": false, "type": "interval" } ] }, "time": { "from": "now-6h", "to": "now" }, "timepicker": {}, "timezone": "", "title": "Node status: p2", "uid": "2LfUnFAnkr", "version": 1, "weekStart": "" }check_patroni-2.2.0/vagrant/provision/000077500000000000000000000000001475506406400200315ustar00rootroot00000000000000check_patroni-2.2.0/vagrant/provision/check_patroni.bash000077500000000000000000000012371475506406400235070ustar00rootroot00000000000000#!/usr/bin/env bash info (){ echo "$1" } ORIGIN=$1 set -o errexit set -o nounset set -o pipefail info "#=============================================================================" info "# check_patroni" info "#=============================================================================" DEBIAN_FRONTEND=noninteractive apt install -q -y git python3-pip pip3 install --upgrade pip case "$ORIGIN" in "test") cd /check_patroni pip3 install . ln -s /usr/local/bin/check_patroni /usr/lib/nagios/plugins/check_patroni ;; "official") pip3 install check_patroni ;; *) echo "Origin : [$ORIGIN] is not supported" exit 1 esac check_patroni --version check_patroni-2.2.0/vagrant/provision/director.bash000077500000000000000000000174451475506406400225210ustar00rootroot00000000000000#!/usr/bin/env bash info(){ echo "$1" } usage(){ echo "$0 ACTION CLUSTER_NAME [NODE..]" echo "" echo " ACTION: init | add" echo " CLUSTER: cluster name" echo " NODE: HOST=IP" echo " HOST: any name for icinga" echo " IP: the IP" } if [ "$#" -le "3" ]; then usage exit 1 fi ACTION="$1" shift CLUSTER="$1" shift NODES=( "$@" ) TARGET="/etc/icinga2/conf.d/check_patroni.conf" #set -o errexit set -o nounset set -o pipefail init(){ cat << '__EOF__' > "$TARGET" # =================================================================== # Check Commands # =================================================================== template CheckCommand "check_patroni" { command = [ PluginDir + "/check_patroni" ] arguments = { "--endpoints" = { value = "$endpoints$" order = -2 repeat_key = true } "--timeout" = { value = "$timeout$" order = -1 } } } object CheckCommand "check_patroni_node_is_alive" { import "check_patroni" arguments += { "node_is_alive" = { order = 1 } } } object CheckCommand "check_patroni_node_is_primary" { import "check_patroni" arguments += { "node_is_primary" = { order = 1 } } } object CheckCommand "check_patroni_node_is_replica" { import "check_patroni" arguments += { "node_is_replica" = { order = 1 } } } object CheckCommand "check_patroni_node_is_pending_restart" { import "check_patroni" arguments += { "node_is_pending_restart" = { order = 1 } } } object CheckCommand "check_patroni_node_patroni_version" { import "check_patroni" arguments += { "node_patroni_version" = { order = 1 } "--patroni-version" = { value = "$patroni_version$" order = 2 } } } object CheckCommand "check_patroni_node_tl_has_changed" { import "check_patroni" arguments += { "node_tl_has_changed" = { order = 1 } "--state-file" = { value = "/tmp/$state_file$" # a quick and dirty way for this poc order = 2 } } } # ------------------------------------------------------------------- object CheckCommand "check_patroni_cluster_has_leader" { import "check_patroni" arguments += { "cluster_has_leader" = { order = 1 } } } object CheckCommand "check_patroni_cluster_has_replica" { import "check_patroni" arguments += { "cluster_has_replica" = { order = 1 } "--warning" = { value = "$has_replica_warning$" order = 2 } "--critical" = { value = "$has_replica_critical$" order = 3 } } } object CheckCommand "check_patroni_cluster_config_has_changed" { import "check_patroni" arguments += { "cluster_config_has_changed" = { order = 1 } "--state-file" = { value = "/tmp/$state_file$" # a quick and dirty way for this poc order = 2 } } } object CheckCommand "check_patroni_cluster_is_in_maintenance" { import "check_patroni" arguments += { "cluster_is_in_maintenance" = { order = 1 } } } object CheckCommand "check_patroni_cluster_node_count" { import "check_patroni" arguments += { "cluster_node_count" = { order = 1 } "--warning" = { value = "$node_count_warning$" order = 2 } "--critical" = { value = "$node_count_critical$" order = 3 } "--running-warning" = { value = "$node_count_running_warning$" order = 4 } "--running-critical" = { value = "$node_count_running_critical$" order = 5 } } } # =================================================================== # Services # =================================================================== template Service "check_patroni" { max_check_attempts = 3 check_interval = 1m # we spam a little for the sake of testing retry_interval = 15 # we spam a little for the sake of testing enable_perfdata = true vars.timeout = 10 } apply Service "check_patroni_node_is_alive" { import "check_patroni" check_command = "check_patroni_node_is_alive" assign where "patroni_servers" in host.groups } apply Service "check_patroni_node_is_primary" { import "check_patroni" check_command = "check_patroni_node_is_primary" assign where "patroni_servers" in host.groups } apply Service "check_patroni_node_is_replica" { import "check_patroni" check_command = "check_patroni_node_is_replica" assign where "patroni_servers" in host.groups } apply Service "check_patroni_node_is_pending_restart" { import "check_patroni" check_command = "check_patroni_node_is_pending_restart" assign where "patroni_servers" in host.groups } apply Service "check_patroni_node_patroni_version" { import "check_patroni" check_command = "check_patroni_node_patroni_version" assign where "patroni_servers" in host.groups } apply Service "check_patroni_node_tl_has_changed" { import "check_patroni" vars.state_file = host.name + ".state" check_command = "check_patroni_node_tl_has_changed" assign where "patroni_servers" in host.groups } # ------------------------------------------------------------------- apply Service "check_patroni_cluster_has_leader" { import "check_patroni" check_command = "check_patroni_cluster_has_leader" assign where "patroni_clusters" in host.groups } apply Service "check_patroni_cluster_has_replica" { import "check_patroni" check_command = "check_patroni_cluster_has_replica" assign where "patroni_clusters" in host.groups } apply Service "check_patroni_cluster_config_has_changed" { import "check_patroni" vars.state_file = host.name + ".state" check_command = "check_patroni_cluster_config_has_changed" assign where "patroni_clusters" in host.groups } apply Service "check_patroni_cluster_is_in_maintenance" { import "check_patroni" check_command = "check_patroni_cluster_is_in_maintenance" assign where "patroni_clusters" in host.groups } apply Service "check_patroni_cluster_node_count" { import "check_patroni" check_command = "check_patroni_cluster_node_count" assign where "patroni_clusters" in host.groups } # =================================================================== # Hosts meta # =================================================================== object HostGroup "patroni_servers" { display_name = "patroni servers" } template Host "patroni_servers" { groups = [ "patroni_servers" ] check_command = "hostalive" vars.patroni_version = "2.1.2" } # ------------------------------------------------------------------- object HostGroup "patroni_clusters" { display_name = "patroni clusters" } template Host "patroni_clusters" { groups = [ "patroni_clusters" ] check_command = "dummy" } # =================================================================== # Hosts meta # =================================================================== __EOF__ } add_hosts(){ NODES=$@ for N in "${NODES[@]}"; do IP="${N##*=}" HOST="${N%=*}" cat << __EOF__ >> "$TARGET" object Host "$HOST" { import "patroni_servers" display_name = "Server patroni $HOST" address = "$IP" vars.endpoints = [ "http://" + address + ":8008" ] } __EOF__ done } add_cluster(){ CLUSTER=$1 NODES=$2 NAME="" IPS=" " for N in "${NODES[@]}"; do IP="${N##*=}" HOST="${N%=*}" NAME="$NAME $HOST" IPS="$IPS\"http://${IP}:8008\", " done cat << __EOF__ >> "$TARGET" object Host "$CLUSTER" { import "patroni_clusters" display_name = "Cluster: $CLUSTER ($NAME )" vars.endpoints = [$IPS ] vars.has_replica_warning = "1:" vars.has_replica_critical = "1:" vars.node_count_warning = "2:" vars.node_count_critical = "1:" vars.node_count_running_warning = "2:" vars.node_count_running_critical = "1:" } __EOF__ } case "$ACTION" in "init") init add_hosts "$NODES" add_cluster "$CLUSTER" "$NODES" ;; "add") add_hosts "$NODES" add_cluster "$CLUSTER" "$NODES" ;; *) usage echo "error: invalid action" exit 1 esac check_patroni-2.2.0/vagrant/provision/icinga2.bash000077500000000000000000000270211475506406400222110ustar00rootroot00000000000000#!/usr/bin/env bash info (){ echo "$1" } #set -o errexit set -o nounset set -o pipefail NODENAME="$1" shift PG_ICINGA_USER_NAME="supervisor" PG_ICINGA_USER_PWD="th3Pass" PG_ICINGAWEB_USER_NAME="supervisor" PG_ICINGAWEB_USER_PWD="th3Pass" PG_DIRECTOR_USER_NAME="supervisor" PG_DIRECTOR_USER_PWD="th3Pass" PG_OPM_USER_NAME="opm" PG_OPM_USER_PWD="th3Pass" PG_GRAFANA_USER_NAME="supervisor" PG_GRAFANA_USER_PWD="th3Pass" set_hostname(){ info "#=============================================================================" info "# hostname and /etc/hosts setup" info "#=============================================================================" hostnamectl set-hostname "${NODENAME}" sed --in-place -e "s/\(127\.0\.0\.1\s*localhost$\)/\1 ${NODENAME}/" /etc/hosts } packages(){ info "#=============================================================================" info "# install required repos and packages" info "#=============================================================================" apt-get update || true apt-get -y install apt-transport-https wget gnupg software-properties-common DIST=$(awk -F"[)(]+" '/VERSION=/ {print $2}' /etc/os-release) echo "deb https://packages.icinga.com/debian icinga-${DIST} main" > "/etc/apt/sources.list.d/${DIST}-icinga.list" echo "deb-src https://packages.icinga.com/debian icinga-${DIST} main" >> "/etc/apt/sources.list.d/${DIST}-icinga.list" echo "deb https://packages.grafana.com/oss/deb stable main" > /etc/apt/sources.list.d/grafana.list echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list wget -q -O - https://packages.icinga.com/icinga.key | apt-key add - wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add - wget -q -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add - apt-get update || true PACKAGES=( grafana icinga2 icinga2-ido-pgsql icingaweb2 icingaweb2-module-monitoring icingacli postgresql-client postgresql-14 php7.3-pgsql php7.3-imagick php7.3-intl nagios-plugins ) DEBIAN_FRONTEND=noninteractive apt install -q -y "${PACKAGES[@]}" systemctl --quiet --now enable postgresql@14 } icinga_setup(){ info "#=============================================================================" info "# Icinga setup" info "#=============================================================================" ## this part is already done by the standard icinga install with the user icinga2 ## and a random password, here we dont really care cat << __EOF__ | sudo -u postgres psql DROP ROLE IF EXISTS supervisor; DROP DATABASE IF EXISTS icinga2; CREATE ROLE ${PG_ICINGA_USER_NAME} WITH LOGIN SUPERUSER PASSWORD '${PG_ICINGA_USER_PWD}'; CREATE DATABASE icinga2; __EOF__ echo "*:*:*:${PG_ICINGA_USER_NAME}:${PG_ICINGA_USER_PWD}" > ~postgres/.pgpass chown postgres:postgres ~postgres/.pgpass chmod 600 ~postgres/.pgpass PGPASSFILE=~postgres/.pgpass psql -U $PG_ICINGA_USER_NAME -h 127.0.0.1 -d icinga2 -f /usr/share/icinga2-ido-pgsql/schema/pgsql.sql icingacli setup config directory --group icingaweb2 icingacli setup token create ## this part is already done by the standard icinga install with the user icinga2 cat << __EOF__ > /etc/icinga2/features-available/ido-pgsql.conf /** * The db_ido_pgsql library implements IDO functionality * for PostgreSQL. */ library "db_ido_pgsql" object IdoPgsqlConnection "ido-pgsql" { user = "${PG_ICINGA_USER_NAME}", password = "${PG_ICINGA_USER_PWD}", host = "localhost", database = "icinga2" } __EOF__ icinga2 feature enable ido-pgsql icinga2 feature enable command icinga2 feature enable perfdata #icinga2 node wizard icinga2 node setup --master --cn s1 --zone master systemctl restart icinga2.service } icinga_API(){ info "#=============================================================================" info "# Icinga API" info "#=============================================================================" icinga2 api setup cat <<__EOF__ >> /etc/icinga2/conf.d/api-users.conf object ApiUser "icingaapi" { password = "th3Pass" permissions = [ "*" ] } __EOF__ systemctl restart icinga2.service } icinga_web(){ info "#=============================================================================" info "# Icinga2 Web" info "#=============================================================================" if [ "$PG_ICINGA_USER_NAME" != "$PG_ICINGAWEB_USER_NAME" ]; then sudo -u postgres psql -c "CREATE ROLE ${PG_ICINGAWEB_USER_NAME} WITH LOGIN PASSWORD '${PG_ICINGAWEB_USER_PWD}';" fi sudo -u postgres psql -c "CREATE DATABASE icingaweb_db OWNER ${PG_ICINGAWEB_USER_NAME};" sed --in-place -e "s/;date\.timezone =/date.timezone = europe\/paris/" /etc/php/7.3/apache2/php.ini a2enconf icingaweb2 a2enmod rewrite a2dismod mpm_event a2enmod php7.3 systemctl restart apache2 } director(){ info "#=============================================================================" info "# Icinga director" info "#=============================================================================" # Create the database if [ "$PG_ICINGA_USER_NAME" != "$PG_DIRECTOR_USER_NAME" ]; then sudo -u postgres psql -c "CREATE ROLE ${PG_DIRECTOR_USER_NAME} WITH LOGIN PASSWORD '${PG_DIRECTOR_USER_PWD}';" fi sudo -u postgres psql -c "CREATE DATABASE director_db OWNER ${PG_DIRECTOR_USER_NAME};" sudo -iu postgres psql -d director_db -c "CREATE EXTENSION pgcrypto;" ## Prereq MODULE_NAME=incubator MODULE_VERSION=v0.11.0 MODULES_PATH="/usr/share/icingaweb2/modules" MODULE_PATH="${MODULES_PATH}/${MODULE_NAME}" RELEASES="https://github.com/Icinga/icingaweb2-module-${MODULE_NAME}/archive" mkdir "$MODULE_PATH" \ && wget -q $RELEASES/${MODULE_VERSION}.tar.gz -O - \ | tar xfz - -C "$MODULE_PATH" --strip-components 1 icingacli module enable "${MODULE_NAME}" ## Director MODULE_VERSION="1.8.1" ICINGAWEB_MODULEPATH="/usr/share/icingaweb2/modules" REPO_URL="https://github.com/icinga/icingaweb2-module-director" TARGET_DIR="${ICINGAWEB_MODULEPATH}/director" URL="${REPO_URL}/archive/v${MODULE_VERSION}.tar.gz" useradd -r -g icingaweb2 -d /var/lib/icingadirector -s /bin/false icingadirector install -d -o icingadirector -g icingaweb2 -m 0750 /var/lib/icingadirector install -d -m 0755 "${TARGET_DIR}" wget -q -O - "$URL" | tar xfz - -C "${TARGET_DIR}" --strip-components 1 cp "${TARGET_DIR}/contrib/systemd/icinga-director.service" /etc/systemd/system/ icingacli module enable director systemctl daemon-reload systemctl enable icinga-director.service systemctl start icinga-director.service # The permission have to be like this to let icingaweb activate modules chown -R www-data:icingaweb2 /etc/icingaweb2 } grafana(){ info "#=============================================================================" info "# Grafana" info "#=============================================================================" if [ "$PG_ICINGA_USER_NAME" != "$PG_GRAFANA_USER_NAME" ]; then sudo -u postgres psql -c "CREATE ROLE ${PG_GRAFANA_USER_NAME} WITH LOGIN PASSWORD '${PG_GRAFANA_USER_PWD}';" fi sudo -u postgres psql -c "CREATE DATABASE grafana OWNER ${PG_GRAFANA_USER_NAME};" cat << __EOF__ > /etc/grafana/grafana.ini [database] # You can configure the database connection by specifying type, host, name, user and password # as separate properties or as on string using the url property. # Either "mysql", "postgres" or "sqlite3", it's your choice type = postgres host = 127.0.0.1:5432 name = grafana user = $PG_GRAFANA_USER_NAME password = $PG_GRAFANA_USER_PWD __EOF__ systemctl --quiet --now enable grafana-server.service } opm(){ info "#=============================================================================" info "# OPM" info "#=============================================================================" ## OPM Install DEBIAN_FRONTEND=noninteractive apt install -q -y postgresql-server-dev-10 libdbd-pg-perl git build-essential cd /usr/local/src || exit 1 git clone https://github.com/OPMDG/opm-core.git git clone https://github.com/OPMDG/opm-wh_nagios.git cd /usr/local/src/opm-wh_nagios/pg/ || exit 1 make install cd /usr/local/src/opm-core/pg/ || exit 1 make install ## OPM db setup cat << __EOF__ | sudo -iu postgres psql CREATE ROLE ${PG_OPM_USER_NAME} WITH LOGIN PASSWORD '${PG_OPM_USER_PWD}'; CREATE DATABASE opm OWNER ${PG_OPM_USER_NAME}; __EOF__ cat << __EOF__ | sudo -iu postgres psql -d opm CREATE EXTENSION opm_core; CREATE EXTENSION wh_nagios CASCADE; SELECT * FROM grant_dispatcher('wh_nagios', 'opm'); __EOF__ ## OPM dispatcher cat < /etc/opm_dispatcher.conf daemon=0 directory=/var/spool/icinga2/perfdata frequency=5 db_connection_string=dbi:Pg:dbname=opm host=localhost db_user=${PG_OPM_USER_NAME} db_password=${PG_OPM_USER_PWD} debug=0 syslog=1 hostname_filter = /^$/ # Empty hostname. Never happens service_filter = /^$/ # Empty service label_filter = /^$/ # Empty label EOF cat <<'EOF' > /etc/systemd/system/opm_dispatcher.service [Unit] Description=dispatcher nagios, import perf files from icinga to opm [Service] User=nagios Group=nagios ExecStart=/usr/local/src/opm-wh_nagios/bin/nagios_dispatcher.pl -c /etc/opm_dispatcher.conf # start right after boot Type=simple # restart on crash Restart=always # after 10s RestartSec=10 [Install] WantedBy=multi-user.target EOF ## OPM planned task cat <<'EOF' > /etc/systemd/system/opm_dispatch_record.service [Unit] Description=Run wh_nagios.dispatch_record() on OPM database [Service] Type=oneshot User=postgres Group=postgres SyslogIdentifier=opm_dispatch_record ExecStart=/usr/bin/psql -U postgres -d opm -c "SELECT * FROM wh_nagios.dispatch_record()" EOF cat <<'EOF' > /etc/systemd/system/opm_dispatch_record.timer [Unit] Description=Timer to run wh_nagios.dispatch_record() on OPM [Timer] OnBootSec=60s OnUnitInactiveSec=1min [Install] WantedBy=timers.target EOF systemctl daemon-reload systemctl enable opm_dispatcher systemctl start opm_dispatcher systemctl enable opm_dispatch_record.timer systemctl start opm_dispatch_record.timer ## To check once everything is setup (icingaweb is setup) # sudo journalctl -fu opm_dispatcher # sudo ournalctl -ft opm_dispatch_record ## Grants for graphana sudo -iu postgres psql -c "CREATE ROLE grafana WITH LOGIN PASSWORD 'th3Pass'" cat <