Read the announcement about this app.

SLI

In simple words Service Level Indicators are the metrics that represent how the reliability is perceived by the consumers of the service. They are normalized to be a number between 0 and 100 using this formula:

SLI = Good Valid × 100

You can read more about the definition of Good and Valid .

Common SLIs include latency, availability, yield, durability, correctness, etc.

πŸŽ‰Tip: Use Templates
Basics
Budgeting
Metric
Data points

There seems to be a comparator in the definition of metric already: .

Having another bound makes the formula harder to read. Please use only one way to set the boundaries.

There seems to be a comparator in the definition of metric already: .

Having another bound makes the formula harder to read. Please use only one way to set the boundaries.

Service Level Formula

The formula for calculating SLI for the given SLO window is the percentage of good per valid.

Depending on whether the SLI is time-based or event-based, the formula calculates the percentage of bad time or bad events.

SLI = Good {{ sloWindow.eventUnitNorm }} Valid {{ sloWindow.eventUnitNorm }} × 100

SLO

Service Level Objective (SLO) is the target percentage of good {{ sloWindow.eventUnitNorm }} out of total {{ sloWindow.eventUnitNorm }} in {{ sloWindow }}.

Using the two sliders below you can fine tune the SLO to your needs. The first slider is for the integer part of the percentage ({{ percL10n(sloInt) }}). The second slider is for the fractional part of the percentage ({{ percL10n(sloFrac) }}).

Typical SLO values
Informal Name SLO Value
{{ p.title }} {{percL10n(p.slo)}}

Note: Just be mindful of the price tag for this high service level objective!

Everyone wants the highest possible number but not everyone is willing to pay the price.

Note: this is an unusually low service level objective. Typically service level objective is above {{ percL10n(90) }} with some rare exceptions. Please check the Error budget for implications of your chosen SLO.

Value: {{ percL10n(slo) }}
Window
Thredholds

The upper threshold must be greater than the lower threshold.

Service Level Status Formula

Service Level Status (SLS) is the percentage of good {{ sloWindow.eventUnitNorm }} in a given time.

SLS is the status of the Service Level and directly relates to SLO. Whenever SLS is below SLO, we have breached the SLO. In case of SLA, this may have severe consequences.

Error budget:

Error budget is one of the core ideas behind using SLI/SLOs to improve reliability. Instead of denying or forbidding errors, error budget allows the system to fail within a pre-defined limit.

The number one enemy of reliability is change. But we need change to be able to improve the system. Error budgets do exactly that. They provide a budget of error for the team to improve the system while keeping the consumers happy enough.

Error budget is the complement of SLO. It is the percentage of bad {{ sloWindow.eventUnitNorm }} that you can have before you violate the SLO.

error_budget = 100 - SLO = {{ percL10n(100) }} - {{ percL10n(slo) }} = {{ percL10n(errorBudgetPerc) }}
Subtract Add
{{ numL10n(1) }}
{{ numL10n(10) }}
{{ numL10n(100) }}
{{ numL10n(1000) }}
{{ numL10n(10000) }}

Warning: The error budget is 0 based on your estimated number of valids {{ sloWindow.eventUnitNorm }}.

Error cost

Alerting

What is the point of setting SLI/SLO if we are not going to take application when the SLO is violated?

Alerting on error budgets enable us to be on top of the reliability of our system. When using service levels, the alert triggers on the rate of consuming the error budget.

When setting an alert, the burn rate decides how quickly the alert reacts to errors.

  • Too fast and it will lead to false positives (alerting unnecessarily) and alert fatigue (too many alerts).
  • Too slow and the error burget will be burned before you know it.

Google SRE Workbook goes through 6 alerting strategies based on SLOs .

Burn rate
Alert Triggers

Warning: The alert is too "jumpy" and will trigger too often. This may lead to alert fatigue or even worse: ignoring the alerts.

Note: The time to resolve (TTR) is too short for a human to react. It is strongly recommended to automate the incident resolution instead of relying on human response to alerts.

Warning: Remember that the alert will trigger after {{ percL10n(longWindowPerc) }} of the error budget is consumed! That error budget is for {{ sloWindow.humanTime }}.

Based on your setting an alert burns {{ percL10n(longWindowPerc) }} just to trigger. Then it needs some time to resolve too.

How many alerts like this can you have in {{ sloWindow.humanTime }} before the entire error budget is consumed?

Warning: Long alert Window is too short at this burn rate ({{ burnRate }}x) which may lead to alert fatigue.

Error: Division by zero! Long alert Window is too short for enough valid {{ sloWindow.eventUnitNorm }} to be counted.

The purpose of the short-window alert is to reduce false alerts.

It checks a shorter lookback window (hence the name) to make sure that the burn rate is still high before triggering the alert. This reduces false positives where an alert is triggered for a temporary high burn rate.

The Short-Window alert reduces false positives at the expence of making the alerting setup more complex.

Warning: Short alert Window is too short at this burn rate ({{ burnRate }}x) which may lead to alert fatigue.

Error: Short alert Window is too short for enough valid events to be counted.

Alert Policy

This is a pseudo-code for trigerring alerts based on the SLI metric in relation to the desired SLO ( ). You need to translate it to your observability and/or alerting tool.

≀
&&
≀

Share

This app completely runs in the browser and has no backend.

So all you have to do is to copy the following link. Whenever that link is clicked, it opens the app in the exact state when you copied it.

{{ toastCaption }}

This site uses cookies from Google to deliver its services and to analyze traffic. Information about your use of this site is shared with Google. By using this site, you agree to its use of cookies.

Tip: If you want to disable Google Analytics on all sites, you can use Google's official workaround.

Learn More