# Slack Worker Observability And Release Gates

This guide documents the WS-10 observability and rollout controls for the Slack worker.

## 1. Metric Events

All Slack worker metrics are emitted as structured logs with message `slack_metric` from logger `app.slack.telemetry`.

### Ingress and routing

- `slack_ingress_event_total`
  - fields: `status`, `workspace_external_id`, `workspace_id`, `event_id`, `event_type`, `channel_type`, `reason`, `queue`
- `slack_intent_routing_total`
  - fields: `workspace_id`, `thread_id`, `intent_category`, `detected_intent`, `channel_type`
- `slack_router_message_total`
  - fields: `workspace_id`, `thread_id`, `intent_category`, `has_pending`
- `slack_router_decision_total`
  - fields: `workspace_id`, `thread_id`, `intent_category`, `action`, `confidence`, `source`, `outcome`
- `slack_action_dispatch_total`
  - fields: `workspace_id`, `thread_id`, `action`, `intent_category`, `outcome`

### Endpoint execution quality

- `slack_endpoint_call_total`
  - fields: `workspace_id`, `thread_id`, `action`, `endpoint`, `outcome`, `duration_ms`, `status_code`, `error_type`

### Queue lag and worker health

- `slack_task_enqueue_total`
  - fields: `queue`, `func`, `outcome`, `mode`, `delay_seconds`, `job_id`, `error_type`
- `slack_queue_health_total`
  - fields: `queue`, `stage`, `outcome`, `queued`, `started`, `failed`, `deferred`, `scheduled`, `oldest_age_seconds`, `error_type`

### Proactive messaging

- `slack_proactive_checkpoint_total`
  - fields: `workspace_id`, `thread_id`, `outcome`, `reason`, `error_type`, `event_count`, `delivered`, `suppressed`
- `slack_proactive_delivery_total`
  - fields: `workspace_id`, `thread_id`, `event_type`, `outcome`, `reason`
- `slack_post_attempt_total`
  - fields: `outcome`, `attempt`, `max_attempts`, `channel_id`, `thread_ts`, `post_kind`, `error_type`
- `slack_reply_delivery_total`
  - fields: `workspace_id`, `thread_id`, `outcome`, `message_count`

## 2. Suggested Alerts

### Routing quality

- Alert when `slack_action_dispatch_total{outcome="unhandled"}` > 0 in a 15-minute window.
- Alert when `slack_router_decision_total{outcome="no_response"}` rises above baseline.

### Endpoint errors by action

- Alert when error ratio for `slack_endpoint_call_total` exceeds 5% for any `action` over 15 minutes.
- Alert immediately if a single action has 3+ consecutive endpoint errors.

### Queue lag and worker health

- Warning when `oldest_age_seconds` > 120 for 10 minutes.
- Critical when `oldest_age_seconds` > 300 for 5 minutes.
- Warning when `failed` registry count increases continuously across 3 samples.

### Proactive volume and failures

- Warning when `slack_post_attempt_total{outcome="failure", post_kind="proactive"}` > 0 in 30 minutes.
- Warning when `slack_proactive_delivery_total{outcome="suppressed"}` spikes above baseline (possible metadata/rate-limit issue).

## 3. Release Gate Controls

Set these env vars on the Slack connector process:

- `SLACK_RELEASE_GATES_ENABLED=1`
- `SLACK_RELEASE_CANARY_REQUIRED=1|0`
- `SLACK_CANARY_WORKSPACE_IDS=T123,T456`
- `SLACK_RELEASE_MIN_EVAL_SCORE=<float>`
- `SLACK_RELEASE_EVAL_SCORE=<float>`
- `SLACK_RELEASE_MAX_REGRESSION=<float>`
- `SLACK_RELEASE_REGRESSION_DELTA=<float>`
- `SLACK_RELEASE_FAIL_CLOSED=1|0`

Gate behavior:

- If disabled, all workspaces pass.
- If canary is required, only listed workspaces pass.
- If min eval score is set, score must be at or above threshold.
- If max regression is set, regression delta must be at or below threshold.
- If `SLACK_RELEASE_FAIL_CLOSED=1`, missing eval/regression values fail closed.

Ingress status on block:

- `release_blocked` with reason:
  - `canary_unapproved`
  - `canary_allowlist_empty`
  - `eval_below_minimum`
  - `regression_exceeded`
  - `missing_eval_score`
  - `missing_regression_delta`
