Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Node Readiness Controller

A Kubernetes controller that provides fine-grained, declarative readiness for nodes. It ensures nodes only accept workloads when all required components (e.g., network agents, GPU drivers, storage drivers, or custom health-checks) are fully ready on the node.

Use it to orchestrate complex bootstrap steps in your node-init workflow, enforce node health, and improve workload reliability.

What is Node Readiness Controller?

The Node Readiness Controller extends Kubernetes’ node readiness model by allowing you to define additional pre-requisites for nodes (as readiness rules) based on node conditions. It automatically manages node taints to prevent scheduling until all specified conditions are satisfied.

Why This Project?

Kubernetes nodes have a simple “Ready” condition. Modern workloads need more critical infrastructure dependencies before they can run.

With this controller you can:

  • Define custom readiness for your workloads
  • Automatically taint and untaint nodes based on condition status
  • Support continuous readiness enforcement to block scheduling for fuse break scenarios
  • Integrate with existing problem-detectors like NPD or any custom daemons/node plugins for reporting readiness

Key Features

  • Multi-condition Rules: Define rules that require ALL specified conditions to be satisfied
  • Flexible Enforcement: Support for bootstrap-only and continuous enforcement modes
  • Conflict Prevention: Validation webhook prevents conflicting taint configurations
  • Dry Run Mode: Preview rule impact before applying changes
  • Comprehensive Status: Detailed observability into rule evaluation and node readiness status
  • Node Targeting: Use label selectors to target specific node types
  • Bootstrap Completion Tracking: Prevents re-evaluation once bootstrap conditions are met

Demo

Node Readiness Controller in Kind cluster

Node Readiness Demo

Example Rule

apiVersion: readiness.node.x-k8s.io/v1alpha1
kind: NodeReadinessRule
metadata:
  name: network-readiness-rule
spec:
  conditions:
    - type: "example.com/CNIReady"
      requiredStatus: "True"
  taint:
    key: "readiness.k8s.io/NetworkReady"
    effect: "NoSchedule"
    value: "pending"
  enforcementMode: "bootstrap-only"
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker: ""

Getting Involved

If you’re interested in participating in future discussions or development related to Node Readiness Controller, you can reach the maintainers of the project at:

Open Issues / PRs / Discussions here:

See the Kubernetes community on the community page. You can also engage with SIG Node at #sig-node and mailing list.

Project Status

This project is currently in alpha. The API may change in future releases.

Concepts

This section explores the core concepts of the Node Readiness Controller and how to use it to manage node lifecycle.

Node Readiness Rule (NRR)

The NodeReadinessRule is the primary resource used to define readiness criteria for your nodes. It allows you to define declarative “gates” that a node must pass before it is considered ready for workloads.

A rule specifies:

  1. Target Nodes: Which nodes the rule applies to (using nodeSelector).
  2. Readiness Conditions: A list of conditions (type and status) that must be met.
  3. Readiness Taint: The taint to apply to the node if the conditions are not met.

When a rule is created, the controller continuously watches all matching nodes. If a node does not satisfy the required conditions, the controller ensures the configured taint is present, preventing the scheduler from assigning new pods to that node.

Enforcement Modes

The controller supports two distinct modes of enforcement, configured via spec.enforcementMode, to handle different operational needs.

1. Continuous Enforcement (continuous)

In this mode, the controller actively maintains the readiness guarantee throughout the entire lifecycle of the node.

  • Behavior:
    • If conditions fail: The taint is applied immediately.
    • If conditions pass: The taint is removed.
  • Use Case: Critical infrastructure dependencies that must always be healthy.
    • Example: A CNI plugin or a storage daemon must be running. If they crash, you want the node effectively taken offline (tainted) immediately to prevent application failures.

2. Bootstrap-Only Enforcement (bootstrap-only)

In this mode, the controller enforces readiness only during the initial node startup (bootstrap).

  • Behavior:
    • The taint is applied when the node first joins or the rule is created.
    • The controller waits for the conditions to be met.
    • Once satisfied:
      1. The taint is removed.
      2. A completion marker is added to the node’s annotations: readiness.k8s.io/bootstrap-completed-<ruleName>=true.
    • After completion: The controller ignores this rule for the node, even if the conditions fail later.
  • Use Case: One-time initialization steps.
    • Example: Pre-pulling heavy container images, initializing a local cache, or performing hardware provisioning that only needs to happen once per boot.

Readiness Condition Reporting

The Node Readiness Controller operates on Node Conditions. It does not perform health checks itself; rather, it reacts to the state of conditions on the Node object.

This design decouples the policy (the Controller) from the health checking (the Reporter). You have multiple options for reporting these conditions:

Option 1: Node Problem Detector (NPD)

The Node Problem Detector is a standard Kubernetes add-on commonly found in many clusters. It is designed to monitor node health and update NodeConditions or emit Events.

You can extend NPD with Custom Plugins (Monitor Scripts) to check the status of your specific components (e.g., checking if a daemon process is running or if a local endpoint is responding).

Why choose NPD?

  • Existing Infrastructure: Leverages a tool that may already be running and authorized to update node status.
  • Separation of Concerns: Decouples the monitoring logic from the workload itself (no need to modify your DaemonSet manifests to add sidecars).
  • Centralized Config: Health checks are defined in NPD configuration rather than scattered across workload pod specs.

Option 2: Readiness Condition Reporter

To help you integrate custom checks where NPD might not be suitable, the project includes a lightweight Readiness Condition Reporter. This is designed to be run as a sidecar container within your DaemonSet.

  • How it works:
    1. It can run as a side-car container that runs in the same Pod as your workload.
    2. It periodically checks a local http endpoint (e.g., healthz probe).
    3. It patches the Node status with a custom Condition (e.g., example.com/MyCustomServiceReady).

When to choose the Reporter?

  • Simplicity: Good for simple “is this HTTP endpoint up?” checks without configuring external scripts.
  • Direct Coupling: Useful when you want the readiness reporting lifecycle of the component to strictly match the pod’s lifecycle.

Dry Run Mode

To reduce the operational risks while deploying new readiness rules in production, the controller includes a dryRun capability to first analyze the impact before actual deployment.

When spec.dryRun: true is set on a rule:

  • The controller evaluates all nodes against the criteria.
  • No taints are applied or removed.
  • The intended actions are reported in the status.dryRunResults field of the NodeReadinessRule.

This allows you to preview exactly which nodes would be affected and identifying any potential misconfigurations (like a typo in a label selector) before they impact your cluster.

Installation

Follow this guide to install the Node Readiness Controller in your Kubernetes cluster.

Deployment Options

The easiest way to get started is by applying the official release manifests.

First, to install the CRDs, apply the crds.yaml manifest:

# Replace with the desired version
VERSION=v0.1.1
kubectl apply -f https://github.com/kubernetes-sigs/node-readiness-controller/releases/download/${VERSION}/crds.yaml
kubectl wait --for condition=established --timeout=30s crd/nodereadinessrules.readiness.node.x-k8s.io

To install the controller, apply the install.yaml manifest:

kubectl apply -f https://github.com/kubernetes-sigs/node-readiness-controller/releases/download/${VERSION}/install.yaml

This will deploy the controller into the nrr-system namespace on any available node in your cluster.

Images: The official releases use multi-arch images (AMD64, Arm64).

Option 2: Deploy Using Kustomize

If you have cloned the repository and want to deploy from source, you can use Kustomize.

# 1. Install Custom Resource Definitions (CRDs)
kubectl apply -k config/crd

# 2. Deploy Controller and RBAC
kubectl apply -k config/default

Verification

After installation, verify that the controller is running successfully.

  1. Check Pod Status:

    kubectl get pods -n nrr-system
    

    You should see a pod named nrr-controller-manager-... in Running status.

  2. Check Logs:

    kubectl logs -n nrr-system -l control-plane=controller-manager
    

    Look for “Starting EventSource” or “Starting Controller” messages indicating the manager is active.

  3. Verify CRDs:

    kubectl get crd nodereadinessrules.readiness.node.x-k8s.io
    

Uninstallation

IMPORTANT: Follow this order to avoid “stuck” resources.

The controller uses a finalizer (readiness.node.x-k8s.io/cleanup-taints) on NodeReadinessRule resources to ensure taints are safely removed from nodes before a rule is deleted.

You must delete all rule objects before deleting the controller.

  1. Delete all Rules:

    kubectl delete nodereadinessrules --all
    

    Wait for this command to complete. This ensures the running controller removes its taints from your nodes.

  2. Uninstall Controller:

    # If installed via release manifest
    kubectl delete -f https://github.com/kubernetes-sigs/node-readiness-controller/releases/download/${VERSION}/install.yaml
    
    # OR if using Kustomize
    kubectl delete -k config/default
    
  3. Uninstall CRDs (Optional):

    kubectl delete -k config/crd
    

Recovering from Stuck Resources

If you accidentally deleted the controller before the rules, the NodeReadinessRule objects will get stuck in a Terminating state because the controller is needed to cleanup the taints and finalizers.

To force-delete them (this will require you to manually clean up the managed taints if any on your nodes):

# Patch the finalizer to remove it
kubectl patch nodereadinessrule <rule-name> -p '{"metadata":{"finalizers":[]}}' --type=merge

Troubleshooting Deployment

RBAC Permissions If the controller logs show “Forbidden” errors, verify the ClusterRole bindings:

kubectl describe clusterrole nrr-manager-role

It requires nodes (update/patch) and nodereadinessrules (all) access.

Debug Logging To enable verbose logging for deeper investigation:

kubectl patch deployment -n nrr-system nrr-controller-manager \
  -p '{"spec":{"template":{"spec":{"containers":[{"name":"manager","args":["--zap-log-level=debug"]}]}}}}'

CNI Readiness

In many Kubernetes clusters, the CNI plugin runs as a DaemonSet. When a new node joins the cluster, there is a race condition:

  1. The Node object is created and marked Ready by the Kubelet.
  2. The Scheduler sees the node as Ready and schedules application pods.
  3. However, the CNI DaemonSet might still be initializing networking on that node.

This guide demonstrates how to use the Node Readiness Controller to prevent pods from being scheduled on a node until the Container Network Interface (CNI) plugin (e.g., Calico) is fully initialized and ready.

The high-level steps are:

  1. Node is bootstrapped with a startup taint readiness.k8s.io/NetworkReady=pending:NoSchedule immediately upon joining.
  2. A sidecar is patched to the cni-agent to monitor the CNI’s health and report it to the API server as node-condition (network.k8s.io/CalicoReady).
  3. Node Readiness Controller will untaint the node only when the CNI reports it is ready.

Step-by-Step Guide

This example uses Calico, but the pattern applies to any CNI.

Note: You can find all the manifests used in this guide in the examples/cni-readiness directory.

1. Deploy the Readiness Condition Reporter

We need to bridge Calico’s internal health status to a Kubernetes Node Condition. We will add a sidecar container to the Calico DaemonSet.

This sidecar checks Calico’s local health endpoint (http://localhost:9099/readiness) and updates a node condition network.k8s.io/CalicoReady.

Patch your Calico DaemonSet:

# cni-patcher-sidecar.yaml
- name: cni-status-patcher
  image: registry.k8s.io/node-readiness-controller/node-readiness-reporter:v0.1.1
  imagePullPolicy: IfNotPresent
  env:
    - name: NODE_NAME
      valueFrom:
        fieldRef:
          fieldPath: spec.nodeName
    - name: CHECK_ENDPOINT
      value: "http://localhost:9099/readiness" # update to your CNI health endpoint
    - name: CONDITION_TYPE
      value: "network.k8s.io/CalicoReady"     # update this node condition
    - name: CHECK_INTERVAL
      value: "15s"
  resources:
    limits:
      cpu: "10m"
      memory: "32Mi"
    requests:
      cpu: "10m"
      memory: "32Mi"

Note: In this example, the CNI pod health is monitored by a side-car, so watcher’s lifecycle is same as the pod lifecycle. If the Calico pod is crashlooping, the sidecar will not run and cannot report readiness. For robust ‘continuous’ readiness reporting, the watcher should be ‘external’ to the pod.

2. Grant Permissions (RBAC)

The sidecar needs permission to update the Node object’s status.

# calico-rbac-node-status-patch-role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: node-status-patch-role
rules:
- apiGroups: [""]
  resources: ["nodes/status"]
  verbs: ["patch", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: calico-node-status-patch-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: node-status-patch-role
subjects:
# Bind to CNI's ServiceAccount
- kind: ServiceAccount
  name: calico-node
  namespace: kube-system

3. Create the Node Readiness Rule

Now define the rule that enforces the requirement. This tells the controller: “Keep the readiness.k8s.io/NetworkReady taint on the node until network.k8s.io/CalicoReady is True.”

# network-readiness-rule.yaml
apiVersion: readiness.node.x-k8s.io/v1alpha1
kind: NodeReadinessRule
metadata:
  name: network-readiness-rule
spec:
  # The condition(s) to monitor
  conditions:
    - type: "network.k8s.io/CalicoReady"
      requiredStatus: "True"
  
  # The taint to manage
  taint:
    key: "readiness.k8s.io/NetworkReady"
    effect: "NoSchedule"
    value: "pending"
  
  # "bootstrap-only" means: once the CNI is ready once, we stop enforcing.
  enforcementMode: "bootstrap-only"
  
  # Update to target only the nodes that need to be protected by this guardrail
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker: ""

Test scripts

  1. Create the Readiness Rule:

    cd examples/cni-readiness
    kubectl apply -f network-readiness-rule.yaml
    
  2. Install Calico CNI and Apply the RBAC:

    chmod +x apply-calico.sh
    sh apply-calico.sh
    

Verification

To test this, add a new node to the cluster.

  1. Check the Node Taints: Immediately upon joining, the node should have the taint: readiness.k8s.io/NetworkReady=pending:NoSchedule.

  2. Check Node Conditions: Watch the node conditions. You will initially see network.k8s.io/CalicoReady as False or missing. Once Calico starts, the sidecar will update it to True.

  3. Check Taint Removal: As soon as the condition becomes True, the Node Readiness Controller will remove the taint, and workloads will be scheduled.

Releases

This page details the official releases of the Node Readiness Controller.

v0.1.1

Date: 2026-01-19

This patch release includes important regression bug fixes and documentation updates made since v0.1.0.

Release Notes

Bug or Regression

  • Fix race condition where deleting a rule could leave taints stuck on nodes (#84)
  • Ensure new node evaluation results are persisted to rule status (#87]

Documentation

  • Add/update Concepts documentation (enforcement modes, dry-run, condition reporting) (#74)
  • Add v0.1.0 release notes to docs (#76)

Images

The following container images are published as part of this release.

// Node readiness controller
registry.k8s.io/node-readiness-controller/node-readiness-controller:v0.1.1

// Report component readiness condition from the node
registry.k8s.io/node-readiness-controller/node-readiness-reporter:v0.1.1

Installation

To install the CRDs, apply the crds.yaml manifest for this version:

kubectl apply -f https://github.com/kubernetes-sigs/node-readiness-controller/releases/download/v0.1.1/crds.yaml

To install the controller, apply the install.yaml manifest for this version:

kubectl apply -f https://github.com/kubernetes-sigs/node-readiness-controller/releases/download/v0.1.1/install.yaml

This will deploy the controller into any available node in the nrr-system namespace in your cluster. Check here for more installation instructions.

Contributors

  • ajaysundark

v0.1.0

Date: 2026-01-14

This is the first official release of the Node Readiness Controller.

Release Notes

  • Initial implementation of the Node Readiness Controller.
  • Support for NodeReadinessRule API (readiness.node.x-k8s.io/v1alpha1).
  • Defines custom readiness rules for k8s nodes based on node conditions.
  • Manages node taints to prevent scheduling until readiness rules are met.
  • Includes modes for bootstrap-only and continuous readiness enforcement.
  • Readiness condition reporter for reporting component health.

Images

The following container images are published as part of this release.

// Node readiness controller
registry.k8s.io/node-readiness-controller/node-readiness-controller:v0.1.0

// Report component readiness condition from the node
registry.k8s.io/node-readiness-controller/node-readiness-reporter:v0.1.0

Installation

To install the CRDs, apply the crds.yaml manifest for this version:

kubectl apply -f https://github.com/kubernetes-sigs/node-readiness-controller/releases/download/v0.1.0/crds.yaml

To install the controller, apply the install.yaml manifest for this version:

kubectl apply -f https://github.com/kubernetes-sigs/node-readiness-controller/releases/download/v0.1.0/install.yaml

This will deploy the controller into any available node in the nrr-system namespace in your cluster. Check here for more installation instructions.

Contributors

  • ajaysundark
  • Karthik-K-N
  • Priyankasaggu11929
  • sreeram-venkitesh
  • Hii-Himanshu
  • Serafeim-Katsaros
  • arnab-logs
  • Yuan-prog
  • AvineshTripathi

API Reference

Packages

readiness.node.x-k8s.io/v1alpha1

Package v1alpha1 contains API Schema definitions for the v1alpha1 API group.

Resource Types

ConditionEvaluationResult

ConditionEvaluationResult provides a detailed report of the comparison between the Node’s observed condition and the rule’s requirement.

Appears in:

FieldDescriptionDefaultValidation
type stringtype corresponds to the Node condition type being evaluated.MaxLength: 316
MinLength: 1
currentStatus ConditionStatuscurrentStatus is the actual status value observed on the Node, one of True, False, Unknown.Enum: [True False Unknown]
requiredStatus ConditionStatusrequiredStatus is the status value defined in the rule that must be matched, one of True, False, Unknown.Enum: [True False Unknown]

ConditionRequirement

ConditionRequirement defines a specific Node condition and the status value required to trigger the controller’s action.

Appears in:

FieldDescriptionDefaultValidation
type stringtype of Node condition
Following kubebuilder validation is referred from https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#Condition
MaxLength: 316
MinLength: 1
requiredStatus ConditionStatusrequiredStatus is status of the condition, one of True, False, Unknown.Enum: [True False Unknown]

DryRunResults

DryRunResults provides a summary of the actions the controller would perform if DryRun mode is enabled.

Validation:

  • MinProperties: 1

Appears in:

FieldDescriptionDefaultValidation
affectedNodes integeraffectedNodes is the total count of Nodes that match the rule’s criteria.Minimum: 0
taintsToAdd integertaintsToAdd is the number of Nodes that currently lack the specified taint and would have it applied.Minimum: 0
taintsToRemove integertaintsToRemove is the number of Nodes that currently possess the
taint but no longer meet the criteria, leading to its removal.
Minimum: 0
riskyOperations integerriskyOperations represents the count of Nodes where required conditions
are missing entirely, potentially indicating an ambiguous node state.
Minimum: 0
summary stringsummary provides a human-readable overview of the dry run evaluation,
highlighting key findings or warnings.
MaxLength: 4096
MinLength: 1

EnforcementMode

Underlying type: string

EnforcementMode specifies how the controller maintains the desired state.

Validation:

  • Enum: [bootstrap-only continuous]

Appears in:

FieldDescription
bootstrap-onlyEnforcementModeBootstrapOnly applies configuration only during the first reconcile.
continuousEnforcementModeContinuous continuously monitors and enforces the configuration.

NodeEvaluation

NodeEvaluation provides a detailed audit of a single Node’s compliance with the rule.

Appears in:

FieldDescriptionDefaultValidation
nodeName stringnodeName is the name of the evaluated Node.MaxLength: 253
MinLength: 1
Pattern: ^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$
conditionResults ConditionEvaluationResult arrayconditionResults provides a detailed breakdown of each condition evaluation
for this Node. This allows for granular auditing of which specific
criteria passed or failed during the rule assessment.
MaxItems: 5000
taintStatus TaintStatustaintStatus represents the taint status on the Node, one of Present, Absent.Enum: [Present Absent]
lastEvaluationTime TimelastEvaluationTime is the timestamp when the controller last assessed this Node.

NodeFailure

NodeFailure provides diagnostic details for Nodes that could not be successfully evaluated by the rule.

Appears in:

FieldDescriptionDefaultValidation
nodeName stringnodeName is the name of the failed Node.
Following kubebuilder validation is referred from
https://github.com/kubernetes/apimachinery/blob/84d740c9e27f3ccc94c8bc4d13f1b17f60f7080b/pkg/util/validation/validation.go#L198
MaxLength: 253
MinLength: 1
Pattern: ^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$
reason stringreason provides a brief explanation of the evaluation result.MaxLength: 256
MinLength: 1
message stringmessage is a human-readable message indicating details about the evaluation.MaxLength: 10240
MinLength: 1
lastEvaluationTime TimelastEvaluationTime is the timestamp of the last rule check failed for this Node.

NodeReadinessRule

NodeReadinessRule is the Schema for the NodeReadinessRules API.

FieldDescriptionDefaultValidation
apiVersion stringreadiness.node.x-k8s.io/v1alpha1
kind stringNodeReadinessRule
metadata ObjectMetaRefer to Kubernetes API documentation for fields of metadata.
spec NodeReadinessRuleSpecspec defines the desired state of NodeReadinessRule
status NodeReadinessRuleStatusstatus defines the observed state of NodeReadinessRuleMinProperties: 1

NodeReadinessRuleSpec

NodeReadinessRuleSpec defines the desired state of NodeReadinessRule.

Appears in:

FieldDescriptionDefaultValidation
conditions ConditionRequirement arrayconditions contains a list of the Node conditions that defines the specific
criteria that must be met for taints to be managed on the target Node.
The presence or status of these conditions directly triggers the application or removal of Node taints.
MaxItems: 32
MinItems: 1
enforcementMode EnforcementModeenforcementMode specifies how the controller maintains the desired state.
enforcementMode is one of bootstrap-only, continuous.
“bootstrap-only” applies the configuration once during initial setup.
“continuous” ensures the state is monitored and corrected throughout the resource lifecycle.
Enum: [bootstrap-only continuous]
taint Tainttaint defines the specific Taint (Key, Value, and Effect) to be managed
on Nodes that meet the defined condition criteria.
nodeSelector LabelSelectornodeSelector limits the scope of this rule to a specific subset of Nodes.
dryRun booleandryRun when set to true, The controller will evaluate Node conditions and log intended taint modifications
without persisting changes to the cluster. Proposed actions are reflected in the resource status.

NodeReadinessRuleStatus

NodeReadinessRuleStatus defines the observed state of NodeReadinessRule.

Validation:

  • MinProperties: 1

Appears in:

FieldDescriptionDefaultValidation
observedGeneration integerobservedGeneration reflects the generation of the most recently observed NodeReadinessRule by the controller.Minimum: 1
appliedNodes string arrayappliedNodes lists the names of Nodes where the taint has been successfully managed.
This provides a quick reference to the scope of impact for this rule.
MaxItems: 5000
items:MaxLength: 253
failedNodes NodeFailure arrayfailedNodes lists the Nodes where the rule evaluation encountered an error.
This is used for troubleshooting configuration issues, such as invalid selectors during node lookup.
MaxItems: 5000
nodeEvaluations NodeEvaluation arraynodeEvaluations provides detailed insight into the rule’s assessment for individual Nodes.
This is primarily used for auditing and debugging why specific Nodes were or
were not targeted by the rule.
MaxItems: 5000
dryRunResults DryRunResultsdryRunResults captures the outcome of the rule evaluation when DryRun is enabled.
This field provides visibility into the actions the controller would have taken,
allowing users to preview taint changes before they are committed.
MinProperties: 1

TaintStatus

Underlying type: string

TaintStatus specifies status of the Taint on Node.

Validation:

  • Enum: [Present Absent]

Appears in:

FieldDescription
PresentTaintStatusPresent represent the taint present on the Node.
AbsentTaintStatusAbsent represent the taint absent on the Node.