How do I instrument region and environment information correctly in Prometheus?

Issue

I’ve an application, and I’m running one instance of this application per AWS region.
I’m trying to instrument the application code with Prometheus metrics client, and will be exposing the collected metrics to the /metrics endpoint. There is a central server which will scrape the /metrics endpoints across all the regions and will store them in a central Time Series Database.

Let’s say I’ve defined a metric named: http_responses_total then I would like to know its value aggregated over all the regions along with individual regional values.
How do I store this region information which could be any one of the 13 regions and env information which could be dev or test or prod along with metrics so that I can slice and dice metrics based on region and env?

I found a few ways to do it, but not sure how it’s done in general, as it seems a pretty common scenario:

I’m new to Prometheus. Could someone please suggest how I should store this region and env information? Are there any other better ways?

Solution

All the proposed options will work, and all of them have downsides.

The first option (having env and region exposed by the application with every metric) is easy to implement but hard to maintain. Eventually somebody will forget to about these, opening a possibility for an unobserved failure to occur. Aside from that, you may not be able to add these labels to other exporters, written by someone else. Lastly, if you have to deal with millions of time series, more plain text data means more traffic.

The third option (storing these labels in a separate metric) will make it quite difficult to write and understand queries. Take this one for example:

sum by(instance) (node_arp_entries) and on(instance) node_exporter_build_info{version="0.17.0"}

It calculates a sum of node_arp_entries for instances with node-exporter version="0.17.0". Well more specifically it calculates a sum for every instance and then just drops those with a wrong version, but you got the idea.

The second option (adding these labels with Prometheus as a part of scrape configuration) is what I would choose. To save the words, consider this monitoring setup:

Datacener Prometheus Regional Prometheus Global Prometheus
1. Collects metrics from local instances. 2. Adds dc label to each metric. 3. Pushes the data into the regional Prometheus -> 1. Collects data on datacenter scale. 2. Adds region label to all metrics. 3. Pushes the data into the global instance -> Simply collects and stores the data on global scale

This is the kind of setup you need on Google scale, but the point is the simplicity. It’s perfectly clear where each label comes from and why. This approach requires you to make Prometheus configuration somewhat more complicated, and the less Prometheus instances you have, the more scrape configurations you will need. Overall, I think, this option beats the alternatives.

Answered By – anemyte

Answer Checked By – Katrina (GoLangFix Volunteer)

Leave a Reply

Your email address will not be published.