SLI
In simple words Service Level Indicators
are the metrics that represent how the reliability
is perceived by the consumers of the service. They are normalized to be a number between
0 and 100 using this formula:
You can read more about the definition of
Good
and
Valid
.
Common SLIs include latency, availability, yield, durability, correctness, etc.
πTip:
In simple words
You can read more about the definition of
Common SLIs include latency, availability, yield, durability, correctness, etc.
Type
SLIs can either be either:
event-based or time-based
.
Comparing SLI Types
Event-Based
Time-Based
Use
When consumers perceive reliability by events
When consumers perceive reliability by time
Counts
Good events
Good timeslices
Advantage
More accurately adjust to the amount of load
More forgiving towards the negative impact of failed events
Formula
SLIs can either be either:
Event-Based | Time-Based | |
---|---|---|
Use | When consumers perceive reliability by events | When consumers perceive reliability by time |
Counts | Good events | Good timeslices |
Advantage | More accurately adjust to the amount of load | More forgiving towards the negative impact of failed events |
Formula |
Time-based SLIs aggregate metric data over a timeslice
to mark it as success or failure.
This can also reduce the resolution of the data.
For example, probing an endpoint every 60 seconds to see if it is available,
assumes that the endpoint is available for the entire 60 seconds.
Another common example is to compare the average of data points with a desired
valud. Averages hide the spikes and valleys in the data.
It is better to use
percentiles
instead.
Another example is percentiles. When calculating the 99th percentile of the
latency every 5 minutes, the aggregation window is 5 x 60 = 300 seconds.
Typical timeslice lengths
Timeslice
Seconds
{{ p.title }}
{{ p.seconds }}
Time-based SLIs aggregate metric data over a timeslice to mark it as success or failure.
This can also reduce the resolution of the data. For example, probing an endpoint every 60 seconds to see if it is available, assumes that the endpoint is available for the entire 60 seconds.
Another common example is to compare the average of data points with a desired
valud. Averages hide the spikes and valleys in the data.
It is better to use
Another example is percentiles. When calculating the 99th percentile of the latency every 5 minutes, the aggregation window is 5 x 60 = 300 seconds.
Timeslice | Seconds | |
---|---|---|
{{ p.title }} | {{ p.seconds }} |
How do consumers perceive reliability of your service?
What kind of events are important to the service consumers?
You probably don't want to count all the.
This is an opportunity to narrow down the scope of the optimization
and what triggers an alert.
For simplicity, sometimes total is used instead of valid.
But
there is a difference
.
While
Service level indicator
guides the optimization,
the definition of valid scopes that optimization for two reasons:
- Focus the optimization effort
- Clarify responsibility and control
Note: use a plural form of the event name so that the UI reads more fluently.
Common events
Event
Use case
{{ p.eventUnit }}
{{ p.useCase }}
How do consumers perceive reliability of your service? What kind of events are important to the service consumers?
You probably don't want to count all the. This is an opportunity to narrow down the scope of the optimization and what triggers an alert.
For simplicity, sometimes total is used instead of valid.
But
While
- Focus the optimization effort
- Clarify responsibility and control
Note: use a plural form of the event name so that the UI reads more fluently.
Event | Use case | |
---|---|---|
{{ p.eventUnit }} | {{ p.useCase }} |
Good {{ sloWindow.eventUnitNorm }}
What are good {{ sloWindow.eventUnitNorm }}
from consumer's perspective?
How do good {{ sloWindow.eventUnitNorm }} look like?
What is the metric that you can measure to identify the
good {{ sloWindow.eventUnitNorm }} from all the valid {{ sloWindow.eventUnitNorm }}?
What is the metric that you can measure to identify the good {{ sloWindow.eventUnitNorm }} from all the valid {{ sloWindow.eventUnitNorm }}?
This is the unit of the metric (
) not to be confused
with the unit of the events (
).
This is the unit of the metric ( ) not to be confused with the unit of the events ( ).
What metric values define a good {{ sloWindow.eventUnitNorm }}?
The actual thresholds are part of the SLO definition.
This allows
Good
values can be bound.
The actual
is part of the SLO definition.
The actual is part of the SLO definition.
There seems to be a comparator in the definition of metric already: .
Having another bound makes the formula harder to read. Please use only one way to set the boundaries.
Good
values can be bound.
The actual
is part of the SLO definition.
The actual is part of the SLO definition.
There seems to be a comparator in the definition of metric already: .
Having another bound makes the formula harder to read. Please use only one way to set the boundaries.
Service Level Formula
The formula for calculating SLI for the given SLO window is
the percentage of good per valid.
Depending on whether the SLI is
time-based or event-based ,
the formula calculates the percentage of bad time or bad events.
The formula for calculating SLI for the given SLO window is the percentage of good per valid.
Depending on whether the SLI is
Service Level Objective (SLO) is the target percentage of
good {{ sloWindow.eventUnitNorm }}
out of total {{ sloWindow.eventUnitNorm }}
in {{ sloWindow }}.
Using the two sliders below you can fine tune the SLO to your needs.
The first slider is for the integer part of the percentage ({{ percL10n(sloInt) }}).
The second slider is for the fractional part of the percentage ({{ percL10n(sloFrac) }}).
Typical SLO values
Informal Name
SLO Value
{{ p.title }}
{{percL10n(p.slo)}}
Service Level Objective (SLO) is the target percentage of good {{ sloWindow.eventUnitNorm }} out of total {{ sloWindow.eventUnitNorm }} in {{ sloWindow }}.
Using the two sliders below you can fine tune the SLO to your needs. The first slider is for the integer part of the percentage ({{ percL10n(sloInt) }}). The second slider is for the fractional part of the percentage ({{ percL10n(sloFrac) }}).
Informal Name | SLO Value | |
---|---|---|
{{ p.title }} | {{percL10n(p.slo)}} |
Note: Just be mindful of the price tag for this high service level objective!
Everyone wants the highest possible number but
Note: this is an unusually low service level objective. Typically service level objective is above {{ percL10n(90) }} with some rare exceptions. Please check the Error budget for implications of your chosen SLO.
This slider allows fine tuning the SLO.
It is mostly for convenience when deciding a reasonable error budget
while keeping an eye on it.
{{ sloInt }}.XYZ
Subtract
Add
X
Y
Z
XYZ
This slider allows fine tuning the SLO. It is mostly for convenience when deciding a reasonable error budget while keeping an eye on it.
{{ sloInt }}.XYZ | Subtract | Add |
---|---|---|
X | ||
Y | ||
Z | ||
XYZ |
The SLO window (also known as the
compliance period )
is the time period
for which the SLO is calculated.
Essentially this adjusts the forgiveness of the SLO.
For example if the window is 30 days, we are not concerned with any incidents
and breaches of SLO that happened before that.
Smaller windows also help prevent the error budget from accumulating too much.
For example, if the SLO is 99% for a time-based Availability SLI (uptime),
the error budget allows 432 minutes of downtime per month.
This amount can be consumed in multiple down times during the month or one chunk of long downtime.
But the same SLO allows only 100 minutes of downtime per week.
It is usually 30 days or 4 weeks.
You can play with different ranges to see how a given SLO translates to different
good {{ sloWindow.eventUnitNorm }} and how it impacts the error budget.
Typical compliance periods
Window
Days
Advantage
{{ p.title }}
{{ p.days }}
{{ p.useCase }}
The SLO window (also known as the
Essentially this adjusts the forgiveness of the SLO. For example if the window is 30 days, we are not concerned with any incidents and breaches of SLO that happened before that.
Smaller windows also help prevent the error budget from accumulating too much. For example, if the SLO is 99% for a time-based Availability SLI (uptime), the error budget allows 432 minutes of downtime per month. This amount can be consumed in multiple down times during the month or one chunk of long downtime. But the same SLO allows only 100 minutes of downtime per week.
It is usually 30 days or 4 weeks.
You can play with different ranges to see how a given SLO translates to different good {{ sloWindow.eventUnitNorm }} and how it impacts the error budget.
Window | Days | Advantage | |
---|---|---|---|
{{ p.title }} | {{ p.days }} | {{ p.useCase }} |
{{ sloWindow }}
Thresholds
The thresholds are the values that put boundaries on the
values of {{ metricName }}
to define good {{ sloWindow.eventUnitNorm }}.
They are part of the SLO definition and allow for
Multi-Tierd SLOs
.
The thresholds are the values that put boundaries on the values of {{ metricName }} to define good {{ sloWindow.eventUnitNorm }}.
They are part of the SLO definition and allow for
The lower threshold (LT) defines the minimum possible values
for the {{ metricName }}
(in )
to indicate good {{ sloWindow.eventUnitNorm }}.
LT is part of the SLO definition and for example allows for
Multi-Tierd SLOs
.
The lower threshold (LT) defines the minimum possible values for the {{ metricName }} (in ) to indicate good {{ sloWindow.eventUnitNorm }}.
LT is part of the SLO definition and for example allows for
The upper threshold (UT) defines the maximum possible values
for the {{ metricName }}
(in )
to indicate good {{ sloWindow.eventUnitNorm }}.
UT is part of the SLO definition and for example allows for
Multi-Tierd SLOs
.
The upper threshold (UT) defines the maximum possible values for the {{ metricName }} (in ) to indicate good {{ sloWindow.eventUnitNorm }}.
UT is part of the SLO definition and for example allows for
The upper threshold must be greater than the lower threshold.
Service Level Status
Service Level Status (SLS)
is the percentage of good
{{ sloWindow.eventUnitNorm }} in a given time.
SLS is the status of the Service Level and directly relates to SLO.
Whenever SLS is below SLO, we have breached the SLO.
In case of SLA ,
this may have severe consequences.
SLS is the status of the Service Level and directly relates to SLO.
Whenever SLS is below SLO, we have breached the SLO.
In case of
Error budget:
Error budget is one of the core ideas behind using SLI/SLOs to improve reliability.
Instead of denying or forbidding errors, error budget allows the system to fail
within a pre-defined limit.
The number one enemy of reliability is change.
But we need change to be able to improve the system.
Error budgets do exactly that.
They provide a budget of error for the team to improve the system while keeping
the consumers happy enough.
Error budget is the complement of SLO.
It is the percentage of bad {{ sloWindow.eventUnitNorm }} that you can have
before you violate the SLO.
Subtract
Add
{{ numL10n(1) }}
{{ numL10n(10) }}
{{ numL10n(100) }}
{{ numL10n(1000) }}
{{ numL10n(10000) }}
Error budget is one of the core ideas behind using SLI/SLOs to improve reliability. Instead of denying or forbidding errors, error budget allows the system to fail within a pre-defined limit.
The number one enemy of reliability is change. But we need change to be able to improve the system. Error budgets do exactly that. They provide a budget of error for the team to improve the system while keeping the consumers happy enough.
Error budget is the complement of SLO. It is the percentage of bad {{ sloWindow.eventUnitNorm }} that you can have before you violate the SLO.
Subtract | Add | |
---|---|---|
{{ numL10n(1) }} | ||
{{ numL10n(10) }} | ||
{{ numL10n(100) }} | ||
{{ numL10n(1000) }} | ||
{{ numL10n(10000) }} |
Warning: The error budget is 0 based on your estimated number of valids {{ sloWindow.eventUnitNorm }}.
in
Here you can enter the numbers for your expected load and see
how many {{ eventUnit }} are allowed to fail during the SLO window
while still being within the error budget.
How much does a bad
cost the business or your team?
This cost will be used to put a tangible number on various windows and events.
It might be hard to put a number on failures especially if some resilience patterns are part of the architecture.
There are many ways to make the failures cheaper.
In a future article, we will discuss all patterns of reliability and how to make errors cheap.
In the mean time check out the following techniques:
-
Fallback
-
Failover
How much does a bad cost the business or your team?
This cost will be used to put a tangible number on various windows and events. It might be hard to put a number on failures especially if some resilience patterns are part of the architecture.
There are many ways to make the failures cheaper. In a future article, we will discuss all patterns of reliability and how to make errors cheap. In the mean time check out the following techniques:
-
Fallback -
Failover
You can set the currency to see how much it costs to violate the SLO.
If you can't put a currency on the errors, feel free to get creative.
Typical Currencies
Abbreviation
Description
{{ p.currency }}
{{ p.description }}
You can set the currency to see how much it costs to violate the SLO. If you can't put a currency on the errors, feel free to get creative.
Abbreviation | Description | |
---|---|---|
{{ p.currency }} | {{ p.description }} |
Alerting
What is the point of setting SLI/SLO if we are not going to take application
when the SLO is violated?
Alerting on error budgets enable us to be on top of the reliability of our system.
When using service levels, the alert triggers on the rate of consuming the error budget.
When setting an alert, the burn rate decides how quickly the alert reacts
to errors.
-
Too fast and it will lead to false positives (alerting unnecessarily)
and alert fatigue (too many alerts).
-
Too slow and the error burget will be burned before you know it.
Google SRE Workbook goes through
6 alerting strategies based on SLOs
.
What is the point of setting SLI/SLO if we are not going to take application when the SLO is violated?
Alerting on error budgets enable us to be on top of the reliability of our system. When using service levels, the alert triggers on the rate of consuming the error budget.
When setting an alert, the burn rate decides how quickly the alert reacts to errors.
- Too fast and it will lead to false positives (alerting unnecessarily) and alert fatigue (too many alerts).
- Too slow and the error burget will be burned before you know it.
Google SRE Workbook goes through
Burn rate is the rate at which the error budget is consumed.
It is the ratio of the error budget to the SLO window.
A burn rate of means that the error budget will be consumed during the
SLO window (accepted).
A burn rate of means that the error budget will be consumed in half the
SLO window. This is not acceptable because at this rate, the SLO will be
violated before the end of the SLO window.
You have selected a burn rate of {{ burnRate }}x.
This means the error budget ({{ errorBudget.eventCountL10n }} failed {{ errorBudget.eventUnitNorm }}) will be consumed in
instead of being spread across {{ sloWindow.humanTime }}.
If the error budget continues to burn at this rate throughout the SLO window,
there will be {{ sloWindowBudgetBurn }}.
Google SRE Workbook goes through 6 alerting strategies and
recommends
:
Burn Rate
Error Budget
Long-Window
Short-Window
Action
14.4x
{{ percL10n(2) }} Consumed
1 hour
5 minutes
Page
6x
{{ percL10n(5) }} Consumed
6 hours
30 minutes
Page
1x
{{ percL10n(10) }} Consumed
3 days
6 hours
Ticket
Note: The above values for Long-Window and Short-Window are based on a 1-month SLO window.
You can see your actual values in the comments below Long-Window and Short-Window.
Burn rate is the rate at which the error budget is consumed. It is the ratio of the error budget to the SLO window.
A burn rate of means that the error budget will be consumed during the SLO window (accepted).
A burn rate of means that the error budget will be consumed in half the SLO window. This is not acceptable because at this rate, the SLO will be violated before the end of the SLO window.
You have selected a burn rate of {{ burnRate }}x. This means the error budget ({{ errorBudget.eventCountL10n }} failed {{ errorBudget.eventUnitNorm }}) will be consumed in instead of being spread across {{ sloWindow.humanTime }}.
If the error budget continues to burn at this rate throughout the SLO window, there will be {{ sloWindowBudgetBurn }}.
Google SRE Workbook goes through 6 alerting strategies and
Burn Rate | Error Budget | Long-Window | Short-Window | Action | |
---|---|---|---|---|---|
14.4x | {{ percL10n(2) }} Consumed | 1 hour | 5 minutes | Page | |
6x | {{ percL10n(5) }} Consumed | 6 hours | 30 minutes | Page | |
1x | {{ percL10n(10) }} Consumed | 3 days | 6 hours | Ticket |
Note: The above values for Long-Window and Short-Window are based on a 1-month SLO window. You can see your actual values in the comments below Long-Window and Short-Window.
Long-Window
Long-window alert is the "normal" alert.
The reason it is called "long" is to distinguish it from the "short-window"
alert which is primarily used to reduce false positives and improve the alert reset time.
We don't want to wait for the entire error budget to be consumed before
alerting! It will be too late to take action.
Therefore the alert should trigger before a significant portion of the
error budget is consumed.
Based on your setup, the alert will trigger after we have consumed
{{ percL10n(longWindowPerc) }} of the entire time allotted for the error budget
(or SLO compliance window) which is
.
Which is
.
Assuming that the entire error budget was available at the begining of the incident,
the maximum time available to respond before the entire error budget is exhausted is:
Which is
.
Remember that this is the best case scenario.
In reality, you may have much less time if you don't want to consume the entire error budget for an incident.
Also note that the burn rate can be higher than {{ burnRate }}x.
We don't want to wait for the entire error budget to be consumed before alerting! It will be too late to take action.
Therefore the alert should trigger before a significant portion of the error budget is consumed.
Based on your setup, the alert will trigger after we have consumed {{ percL10n(longWindowPerc) }} of the entire time allotted for the error budget (or SLO compliance window) which is .
Which is .
Assuming that the entire error budget was available at the begining of the incident, the maximum time available to respond before the entire error budget is exhausted is:
Which is .
Remember that this is the best case scenario. In reality, you may have much less time if you don't want to consume the entire error budget for an incident. Also note that the burn rate can be higher than {{ burnRate }}x.
Warning: The alert is too "jumpy" and will trigger too often. This may lead to alert fatigue or even worse: ignoring the alerts.
Note: The time to resolve (TTR) is too short for a human to react. It is strongly recommended to automate the incident resolution instead of relying on human response to alerts.
Warning: Remember that the alert will trigger after {{ percL10n(longWindowPerc) }} of the error budget is consumed! That error budget is for {{ sloWindow.humanTime }}.
Based on your setting an alert burns {{ percL10n(longWindowPerc) }} just to trigger. Then it needs some time to resolve too.
How many alerts like this can you have in {{ sloWindow.humanTime }} before the entire error budget is consumed?
Warning: Long alert Window is too short at this burn rate ({{ burnRate }}x) which may lead to alert fatigue.
Error: Division by zero! Long alert Window is too short for enough valid {{ sloWindow.eventUnitNorm }} to be counted.
Alert Policy
This is a pseudo-code for trigerring alerts based on the
SLI metric in relation to the desired SLO (
).
You need to translate it to your observability and/or alerting tool.
This is a pseudo-code for trigerring alerts based on the SLI metric in relation to the desired SLO ( ). You need to translate it to your observability and/or alerting tool.
β€
&&
β€
The purpose of the short-window alert is to reduce false alerts.
It checks a shorter lookback window (hence the name) to make sure that the burn rate is still high before triggering the alert. This reduces false positives where an alert is triggered for a temporary high burn rate.
The Short-Window alert reduces false positives at the expence of making the alerting setup more complex.
The Short-Window is usually 1/12th of the Long-Window (per
Google SRE Workbook
recommendation).
But you can play with different dividers to see how they impact
the detection time of the alert.
Long-Window alert triggers after consuming {{ longWindowPerc }}%
of the total error budget.
Therefore, the Short-Window alert triggers after consuming:
Converted to time based on the current window ({{ sloWindow.humanTime }}):
This means the alert will trigger only if we are still
burning the error budget at least at the
x burn rate
in the past
.
The Short-Window is usually 1/12th of the Long-Window (per
Long-Window alert triggers after consuming {{ longWindowPerc }}% of the total error budget. Therefore, the Short-Window alert triggers after consuming:
Converted to time based on the current window ({{ sloWindow.humanTime }}):
This means the alert will trigger only if we are still burning the error budget at least at the x burn rate in the past .
Warning: Short alert Window is too short at this burn rate ({{ burnRate }}x) which may lead to alert fatigue.
Error: Short alert Window is too short for enough valid events to be counted.