Last week at work, I encountered an alert that was misfiring. Or so I thought.

This was the PromQL used for alerting when HTTP requests error rate exceeds the SLO of 5% per hour:

(sum(rate(service_foo_http_request_total{status_code!~"2.."}[1h]))
  / sum(rate(service_foo__http_request_total[1h]))) OR vector(0) >= 0.05 

Some debugging later, issue was finally resolved and well, the PromQL was behaving exactly how it was supposed to.

The cause can be summarized into these 3 points:

  • misunderstanding of how PromQL vector() function works
  • misunderstanding of how PromQL comparison binary operators work
  • misunderstanding of how Prometheus/Thanos-ruler evaluates query result and trigger alerts

In this post, I will share about what I have learned about the 3 points and what are changes made to PromQL to make it behave as intended.

ps: Prometheus documentation isn’t the clearest. I highly recommend checking out Prometheus: Up & Running book for better explanations on the various features & functions provided by Prometheus.


How does vector() function work?

Let’s check out the function description from Prometheus docs:

vector(s scalar) returns the scalar s as a vector with no labels.

I think the description taken from Prometheus: Up & Running is much clearer:

Vector function takes a scalar value and converts it into an instant vector with one label-less sample with the given value.

Scalar? Label-less instant vector? What are these?

Instant vector data type

a set of time series containing a single sample for each time series, all sharing the same timestamp

Example:

prometheus_operator_kubernetes_client_http_requests_total

The PromQL above returns an instant vector, with each row being a time series at a particular point in time of query.

Scalar data type

a simple numeric floating point value

It literally is just a value. It cannot have labels (i.e just scalar).

Example:

2

The PromQL above returns a scalar.

NOTE: “cannot have labels” !== “label-less”. “Label-less” only applies to vector (instant & range) data type. Example of label-less data:

year()

Usually scalar type is used to make a comparison easier.

For example: Checking the year of the time series vector data. Both PromQL examples below can be used to perform the comparison, but scalar comparisons are a little easier to understand as compared to vector matching.

Scalar comparison

year(prometheus_operator_kubernetes_client_http_requests_total) == 1971

Vector comparison

year(prometheus_operator_kubernetes_client_http_requests_total) == on() group_left vector(1970)

Another question to answer: when should we use vector()?

vector() function can be useful if we need to ensure an expression returns a result but can’t depend on any particular time series to exist.

For example:

sum(some_gauge) or vector(0)

How does comparison binary operators work?

Comparison binary operators:

  • == (equal)
  • != (not-equal)
  • > (greater-than)
  • < (less-than)
  • >= (greater-or-equal)
  • <= (less-or-equal)

Refering to the PromQL again:

(sum(rate(service_foo_http_request_total{status_code!~"2.."}[1h]))
  / sum(rate(service_foo__http_request_total[1h]))) OR vector(0) >= 0.05 

In the PromQL, we used the >= operator.

Expectation 1 (entire (sum(...)/sum(...)) expression resulting in no data):

  • if either the numerator expression or denominator expression evaluates to no data, the entire expression (sum(...)/sum(...)) will result in no data
  • OR vector(0) will handle the above and return a fallback value of {} 0
  • {} 0 >= 0.1 will evaluate to false
  • because it evaluates to false, we might expect the entire PromQL evaluates to 0 OR 1 (in place of true or false)

Reality 1: expression evaluates to false, which means no matching time series found and returns no data

Explanation 1:

The PromQL above is actually performing a vector (sum(...)/sum(...)) VS scalar (0.1) comparison.

Depending on what data types are being compared, there are variations in behaviour:

  • scalar VS scalar
  • scalar VS vector
  • vector VS vector

As per Prometheus docs,

Between an instant vector and a scalar, these operators are applied to the value of every data sample in the vector, and vector elements between which the comparison result is false get dropped from the result vector

This is why no data was returned.

If we want it to evaluate to a boolean value ({} 0 or {} 1), we will need to add a bool modifier after the operator.

Example:

(sum(rate(service_foo_http_request_total{status_code!~"2.."}[1h]))
  / sum(rate(service_foo__http_request_total[1h]))) OR vector(0) >=bool 0.05 

What if the entire (sum(...)/sum(...)) expression evaluated to {} 0 instead? Based on what we understood so far…

Expectation 2 (entire (sum(...)/sum(...)) expression resulting in {} 0):

  • if either the numerator expression or denominator expression evaluates to 0, the entire expression (sum(...)/sum(...)) will result in 0
  • OR vector(0) will handle the above and return a fallback value of 0
  • 0 >= 0.1 will evaluate to false
  • because it evaluates to false, it means no matching time series found and it will return no data

Reality 2: evaluated to a label-less instant vector {} 0

Explanation 2:

Missing parenthesis enclosing the OR vector(0) portion (i.e (sum(...) / sum(...) OR vector(0)) >= 0.05).

Without enclosing the OR vector(0) section, the sum(...) / sum(...) and OR vector(0) >= 0.05 will be evaluated seperately, before being compared again.

sum(...) / sum(...) returns 0 {}

vector(0) >= 0.05 returns no data

As per the logical binary operator description:

vector1 or vector2 results in a vector that contains all original elements (label sets + values) of vector1 and additionally all elements of vector2 which do not have matching label sets in vector1

This means the final result will be {} 0.


How does Prometheus/Thanos-ruler evaluates PromQL result and trigger alerts?

Intuitively (or not), one might assume that for a PromQL result to trigger an alert, the resulting query must evaluate to a “truthy” value. At least that was what I assumed.

On further testing with PromQLs that returned either 0, 1 or no data, I realized that:

  • no data means “ALL OK”
  • evaluating to ANY data (0, 1, whatever else) will be deemed as “SOMETHING IS WRONG. TRIGGER ALERT”

This is supported by what I found in Prometheus source code:

func (r *AlertingRule) Eval(ctx context.Context, ts time.Time, query QueryFunc, externalURL *url.URL, limit int) (promql.Vector, error) {
	res, err := query(ctx, r.vector.String(), ts)
	if err != nil {
		return nil, err
	}

	r.mtx.Lock()
	defer r.mtx.Unlock()

	// Create pending alerts for any new vector elements in the alert expression
	// or update the expression value for existing elements.
	resultFPs := map[uint64]struct{}{}

	var vec promql.Vector
	var alerts = make(map[uint64]*Alert, len(res))
	for _, smpl := range res {
		// Provide the alert information to the template.
		l := make(map[string]string, len(smpl.Metric))
		for _, lbl := range smpl.Metric {
			l[lbl.Name] = lbl.Value
		}
    ...
res, err := query(ctx, r.vector.String(), ts)

If PromQL returned any data (i.e not no data), an alerts mapping will be created.

var alerts = make(map[uint64]*Alert, len(res))

Summary

We have a better understanding of vector() function, comparison binary operators and how alerts are triggered by Prometheus/Thanos-ruler now.

So what exactly were the changes made to the PromQL to have it evaluate as intended?

Well, one way is to enclose the entire section before the >= operator

(sum(rate(service_foo_http_request_total{status_code!~"2.."}[1h]))
  / sum(rate(service_foo__http_request_total[1h]))) OR vector(0) >= 0.05 

The alternative is to remove the redundant OR vector(0).

sum(rate(service_foo_http_request_total{status_code!~"2.."}[1h]))
  / sum(rate(service_foo__http_request_total[1h])) >= 0.05