Backoff limit per index

Kubernetes v1.33 is coming with Backoff limit per index.

Backoff limWHAT?

Ideally, in kubernetes your workload should tolerate transient failures and continue running. By transient failures I mean network congestions, nodes with problems etc.

To achieve failure tolerance in a Kubernetes Job, you can set the spec.backoffLimit field. This field specifies the total number of tolerated failures. Why should there be a limit? Well, you don’t want cascade failures, for example.

However, for workloads where every index is considered independent, like embarassingly parallel workloads - the spec.backoffLimit field is often not flexible enough. For example, you may choose to run a set of suites of integration tests by representing each suite as an index within an Indexed Job. In that setup, a faulty test suite will burn all your failure budget, resulting in all suites being stopped.

In order to overcome this limitation, Kubernetes introduced backoff limit per index, which allows you to control the number of retries per index.

With this feature you can define diferent budgets per index by setting the spec.maxFailedIndexes field. Define a short-circuit to detect a failed index by using the FailIndex action in the Pod Failure Policy mechanism. When the number of tolerated failures is exceeded, the Job marks that index as failed and lists it in the Job’s status.failedIndexes field.

Example

The following Job spec snippet is an example of how to combine backoff limit per index with the Pod Failure Policy feature:

completions: 10
parallelism: 10
completionMode: Indexed
backoffLimitPerIndex: 1
maxFailedIndexes: 5
podFailurePolicy:
 rules:
 - action: Ignore
 onPodConditions:
 - type: DisruptionTarget
 - action: FailIndex
 onExitCodes:
 operator: In
 values: [ 42 ]

This Job handles Pod failures as follows:

  • Ignores failed Pods with DisruptionTarget condition
  • Fails the index if a container ends with exit code 42
  • Retries each index once (except if failed by FailIndex rule)
  • Fails the entire Job if more than 5 indexes fail

This is the official doc of Backoff limit per index