Process and mailing monitoring via Prometheus
Process and mailing monitoring via Prometheus is used to control performance, processing stability, and message queue status.
Collecting metrics allows you to:
- track mailing processing speed and individual lead processing stages;
- identify delays in scenario execution and data processing;
- monitor message publishing status in RabbitMQ;
- detect queue growth, re-sends, and lost events;
- analyze load on
campaign,procworkflow, andproctriggerprocesses; - build dashboards and configure alerts in Prometheus and Grafana;
- find bottlenecks during performance degradation or mailing processing errors.
Which processes support metrics
Currently, metrics are supported by the campaign, procworkflow, and proctrigger processes. Mailing metrics can be delivered in two ways:
- via Pushgateway when
campaignruns separately; - via the pull model inside
procworkflowandproctrigger, if campaigns are executed within these processes.
| Monitoring type | Processes | Collection method |
|---|---|---|
| Pull | procworkflow, proctrigger | Prometheus scrapes /metrics |
| Push | campaign | Metrics are sent to Pushgateway |
Configuring pull metrics
The pull model is used to collect metrics from the procworkflow and proctrigger processes. In this mode, Prometheus periodically queries the HTTP endpoint /metrics exposed by the corresponding process.
Metrics from procworkflow and proctrigger also include mailing metrics if mailings are executed within these processes.
Pull metrics configuration example
Example platform configuration for enabling pull metrics for both procworkflow and proctrigger simultaneously:
{
"PROMETHEUS_METRICS": {
"ENABLE": true,
"PROCESSES": [
"procworkflow",
"proctrigger"
]
},
"WF_METRIC_HOST": "0.0.0.0",
"WF_METRIC_PORT": 8911,
"PROC_TRIGGER_METRIC_HOST": "0.0.0.0",
"PROC_TRIGGER_METRIC_PORT": 8912
}
If the PROCESSES array is empty ([]), metrics are automatically enabled for all supported processes.
Add metric scrape jobs to the Prometheus configuration:
scrape_configs:
- job_name: 'procworkflow'
metrics_path: /metrics
static_configs:
- targets:
- '<procworkflow-server>:8911'
- job_name: 'proctrigger'
metrics_path: /metrics
static_configs:
- targets:
- '<proctrigger-server>:8912'
After starting the processes, verify that the metric services are listening on the specified ports:
netstat -tlpn | grep 8911
netstat -tlpn | grep 8912
Example output:
tcp6 0 0 :::8911 :::* LISTEN
tcp6 0 0 :::8912 :::* LISTEN
Check the availability of the /metrics endpoint:
curl http://<procworkflow-server>:8911/metrics
curl http://<proctrigger-server>:8912/metrics
Replace <procworkflow-server> and <proctrigger-server> with the actual IP addresses or hostnames of the servers running the respective processes.
If the services are configured correctly, the endpoint will return a list of Prometheus metrics for the procworkflow and proctrigger processes.
Pull metric parameters
| Parameter | Description |
|---|---|
PROMETHEUS_METRICS.ENABLE | Global enable of Prometheus metrics |
PROMETHEUS_METRICS.PROCESSES | List of processes for which metric collection is enabled |
WF_METRIC_HOST | Address on which procworkflow publishes metrics |
WF_METRIC_PORT | Port of the procworkflow metrics service |
PROC_TRIGGER_METRIC_HOST | Address on which proctrigger publishes metrics |
PROC_TRIGGER_METRIC_PORT | Port of the proctrigger metrics service |
When configuring pull metrics, consider the Prometheus location relative to the platform server.
| Value | When to use |
|---|---|
127.0.0.1 | Prometheus is installed on the same server as the platform |
0.0.0.0 | Prometheus is installed on a separate server |
The WF_METRIC_HOST and PROC_TRIGGER_METRIC_HOST parameters define the internal address on which the processes will accept requests to the /metrics endpoint.
The WF_METRIC_PORT and PROC_TRIGGER_METRIC_PORT parameters set the metric service ports. You can use ports in the range 1024 to 9999, excluding ports occupied by other services.
If Prometheus is located on a separate server, specify 0.0.0.0 in the WF_METRIC_HOST and PROC_TRIGGER_METRIC_HOST parameters.
Do not use the same port for WF_METRIC_PORT and PROC_TRIGGER_METRIC_PORT. Each process must serve metrics on a separate port.
If the metrics service does not start after changing WF_METRIC_HOST, WF_METRIC_PORT, PROC_TRIGGER_METRIC_HOST, or PROC_TRIGGER_METRIC_PORT, verify that the specified port is free and available on the server.
Configuring push metrics for campaign
The push model is used to send metrics from the campaign process to the Prometheus Pushgateway.
In this mode, the campaign process independently sends metrics to the Pushgateway, after which Prometheus scrapes them from the gateway server.
The push model is supported only for the campaign process. Metrics will not appear in Pushgateway until the mailing has been started at least once.
Before configuration, you must deploy and start the Prometheus Pushgateway.
The platform does not start Pushgateway automatically. In the ADDRESS parameter, you must specify the address of an already running Pushgateway.
Pushgateway launch example via systemd
Example unit file:
[Unit]
Description=Prometheus Pushgateway
Wants=network-online.target
After=network-online.target
[Service]
User=pushgateway
Group=pushgateway
Type=simple
ExecStart=/usr/local/bin/pushgateway
[Install]
WantedBy=multi-user.target
After starting Pushgateway, configure metric delivery in the platform configuration:
{
"PROMETHEUS_METRICS_PUSH_GATEWAY": {
"ENABLE": true,
"ADDRESS": "<pushgateway-server>:9091",
"PERIOD_SEC": 5
}
}
Parameter description:
| Parameter | Description |
|---|---|
ENABLE | Enables sending metrics to Pushgateway |
ADDRESS | Address and port of the already running Pushgateway |
PERIOD_SEC | Metric send interval in seconds |
Add Pushgateway to the Prometheus configuration:
scrape_configs:
- job_name: 'pushgateway'
static_configs:
- targets:
- '<pushgateway-server>:9091'
Check Pushgateway metrics availability:
curl http://<pushgateway-server>:9091/metrics
Replace <pushgateway-server> with the actual IP address or hostname of your Pushgateway server.
campaign metrics start appearing in the Pushgateway after the mailing is launched.
Grouping metrics by campaign ID
The CAMPAIGN_ID_PROMETHEUS_GROUPING_ENABLE parameter controls grouping of push metrics by mailing ID.
Example configuration:
{
"CAMPAIGN_ID_PROMETHEUS_GROUPING_ENABLE": true
}
| Value | Behavior |
|---|---|
true | Metrics are grouped by mailing ID |
false | All metrics are sent to a single group |
By default, the parameter is enabled (true).
When grouping is enabled, separate metric groups are created in the Pushgateway for each mailing. This simplifies:
- analyzing performance of individual mailings;
- building Grafana dashboards;
- configuring alerting rules;
- finding issues in a specific mailing.
When grouping is disabled, metrics from all mailings are aggregated into a single Pushgateway group.
RabbitMQ publisher metrics
RabbitMQ publisher business metrics are available for the procworkflow and proctrigger processes.
These metrics are used to monitor message publishing, delivery confirmation time, and the number of retry attempts.
Configuring histogram bucket values
For the total_duration and confirm_duration metrics, you can configure custom histogram bucket values.
Example configuration:
{
"PROMETHEUS_METRICS_RMQ_PUBLISHER": {
"MSEC_BUCKETS": {
"total_duration": [10, 25, 50, 75.5],
"confirm_duration": [10, 25, 50, 75.5]
}
}
}
Available metrics
| Metric | Description |
|---|---|
total_duration | Total message publishing duration |
confirm_duration | Publishing confirmation duration |
retry_counts | Number of retry attempts |
retried_count | Number of messages sent with at least one retry |
lost_failed_events_count | Number of messages discarded after exceeding the retry limit |
Metric interpretation
When analyzing RabbitMQ publisher metrics, pay attention to the following changes:
| Metric | Possible cause |
|---|---|
Increase in confirm_duration | RabbitMQ slowdown or network issues |
Increase in retry_counts | Unstable message delivery |
Increase in retried_count | Elevated number of publishing errors |
Non-zero lost_failed_events_count | Event loss after exceeding the retry limit |
Mailing metrics
Mailing metrics are used to monitor lead processing performance, individual stage execution time, and error counts during mailing execution.
The tables below list the main metrics. Actual names in Prometheus may contain additional prefixes, suffixes, and labels depending on the platform configuration.
Lag metrics
| Metric | Description |
|---|---|
cursor_lag_milliseconds | Mailing processing lag relative to the current queue state |
An increase in cursor_lag_milliseconds may indicate insufficient resources, queue overload, or slowed lead processing.
General lead processing metrics
| Metric | Description |
|---|---|
lead_prepare_milliseconds | Lead preparation time |
lead_processing_milliseconds | Lead processing time |
lead_wait_milliseconds | Total wait time |
lead_total_milliseconds | Total lead processing time |
Stage processing metrics
| Metric | Description |
|---|---|
lead_suppress_lists_check_milliseconds | Suppress list check |
lead_policy_check_milliseconds | Policy check |
lead_static_milliseconds | Static data processing |
lead_form_milliseconds | Form processing |
lead_relation_milliseconds | Relations processing |
lead_query_milliseconds | Query execution |
lead_loyalty_milliseconds | Loyalty data processing |
lead_loyalty_program_milliseconds | Loyalty program processing |
lead_site_milliseconds | Site data processing |
lead_json_milliseconds | JSON processing |
lead_render_milliseconds | Content rendering |
lead_links_milliseconds | Link generation |
lead_sends_milliseconds | Message sending |
Stage processing metrics are used to find bottlenecks during mailing execution.
Stage wait metrics
| Metric | Description |
|---|---|
lead_suppress_lists_check_wait_milliseconds | Wait time for suppress list check |
lead_policy_check_wait_milliseconds | Wait time for policy check |
lead_static_wait_milliseconds | Wait time for static data processing |
lead_form_wait_milliseconds | Wait time for form processing |
lead_relation_wait_milliseconds | Wait time for Relations processing |
lead_query_wait_milliseconds | Wait time for query execution |
lead_loyalty_wait_milliseconds | Wait time for loyalty data |
lead_loyalty_program_wait_milliseconds | Wait time for loyalty programs |
lead_site_wait_milliseconds | Wait time for site data |
lead_json_wait_milliseconds | Wait time for JSON processing |
lead_render_wait_milliseconds | Wait time for rendering |
lead_links_wait_milliseconds | Wait time for link generation |
An increase in wait metrics typically indicates insufficient resources, locks, or overloaded dependent services.
Stage error metrics
| Metric | Description |
|---|---|
lead_suppress_list_check_failure_count | Suppress list check errors |
lead_policy_check_failure_count | Policy check errors |
lead_static_failure_count | Static data processing errors |
lead_form_failure_count | Form processing errors |
lead_relation_failure_count | Relations processing errors |
lead_query_failure_count | Query execution errors |
lead_loyalty_failure_count | Loyalty data processing errors |
lead_loyalty_program_failure_count | Loyalty program processing errors |
lead_site_failure_count | Site data processing errors |
lead_json_failure_count | JSON processing errors |
lead_render_failure_count | Rendering errors |
lead_links_failure_count | Link generation errors |
lead_sends_failure_count | Message sending errors |
An increase in failure metrics indicates errors at specific mailing processing stages and can be used to configure alerting rules in Prometheus and Grafana.
Monitoring verification
After configuration, verify that Prometheus receives metrics.
For pull metrics, run the following requests:
curl http://<procworkflow-server>:8911/metrics
curl http://<proctrigger-server>:8912/metrics
For push metrics, check the Pushgateway:
curl http://<pushgateway-server>:9091/metrics
Replace <procworkflow-server>, <proctrigger-server>, and <pushgateway-server> with the actual IP addresses or hostnames of your servers.
Also check the target status in the Prometheus UI. Jobs procworkflow, proctrigger, and pushgateway should show status UP.
Common issues
| Issue | Possible cause | What to check |
|---|---|---|
| Prometheus does not receive pull metrics | Process listens only on the local interface | Check WF_METRIC_HOST and PROC_TRIGGER_METRIC_HOST values |
/metrics endpoint is unavailable | Metrics service did not start | Check the port with netstat -tlpn |
| Metrics service does not start | Port is occupied by another process | Specify a free port in the 1024–9999 range |
No campaign metrics in Pushgateway | Mailing has not been launched yet | Launch the mailing and re-check |
| All campaign metrics fall into a single group | Mailing ID grouping is disabled | Check CAMPAIGN_ID_PROMETHEUS_GROUPING_ENABLE |