---
title: "Observability"
canonical: "https://helm.docs.mindburn.org/observability"
source: "helm-ai-enterprise/docs/public/observability/observability.md"
edit: "https://github.com/Mindburn-Labs/helm-ai-enterprise/edit/main/docs/public/observability/observability.md"
section: "helm-ai-enterprise"
access: "public"
sensitivity: "public"
last_reviewed: "2026-02-21"
checksum_sha256: "sha256:3a31079cb38d048c638a859591a39ee97f040bb638c2b088551f6ea7f28a2076"
build_timestamp: "2026-05-24T13:40:27.882Z"
---
# Observability Templates

## Audience

## Outcome

After this page you should know what this surface is for, which source files own the behavior, which public route or adjacent page to use next, and which validation command to run before changing the claim.

## Source Truth

- Public route: `observability`
- Source document: `helm-ai-enterprise/docs/public/observability/observability.md`
- Public manifest: `helm-ai-enterprise/docs/public-docs.manifest.json`
- Source inventory: `helm-ai-enterprise/docs/source-inventory.manifest.json`
- Validation: `corepack pnpm run docs:coverage`, `corepack pnpm run docs:truth`, and `npm run coverage:inventory` from `docs-platform`

Do not expand this page with unsupported product, SDK, deployment, compliance, or integration claims unless the inventory manifest points to code, schemas, tests, examples, or an owner doc that proves the claim.

## Troubleshooting

| Symptom | First check |
| --- | --- |
| A link or route is missing from the docs website | Check `docs/public-docs.manifest.json`, `llms.txt`, search, and the per-page Markdown export before changing navigation. |
| A claim is not backed by code or tests | Remove the claim or add the missing code, example, schema, or validation command before publishing. |

## OBS-001: Structured Logging with Trace Correlation

HELM uses Go's `log/slog` for structured logging with OpenTelemetry trace ID correlation.

### Configuration

```go
import (
    "log/slog"
    "go.opentelemetry.io/otel/trace"
)

// Create a handler that adds trace_id to every log entry
type TracedHandler struct {
    slog.Handler
}

func (h *TracedHandler) Handle(ctx context.Context, r slog.Record) error {
    span := trace.SpanFromContext(ctx)
    if span.SpanContext().IsValid() {
        r.AddAttrs(
            slog.String("trace_id", span.SpanContext().TraceID().String()),
            slog.String("span_id", span.SpanContext().SpanID().String()),
        )
    }
    return h.Handler.Handle(ctx, r)
}
```

### Log Fields Convention

| Field | Type | Description |
|-------|------|-------------|
| `trace_id` | string | OpenTelemetry trace ID (auto-injected) |
| `span_id` | string | Span ID within trace |
| `tenant_id` | string | Tenant isolation boundary |
| `decision_id` | string | Guardian decision identifier |
| `receipt_id` | string | Execution receipt reference |
| `tool_id` | string | Tool being executed |
| `latency_ms` | float64 | Operation latency in milliseconds |
| `error` | string | Error message if applicable |

---

## OBS-002: Grafana / Prometheus Overview Templates

### Prometheus Metrics (exposed at `:9090/metrics`)

```yaml
# Guardian decision metrics
helm_guardian_decisions_total{verdict="ALLOW|DENY|ESCALATE"}
helm_guardian_decision_duration_seconds{quantile="0.5|0.9|0.99"}

# Executor metrics
helm_executor_tool_calls_total{tool_id, status="ok|error"}
helm_executor_tool_duration_seconds{tool_id}

# ProofGraph metrics
helm_proofgraph_nodes_total{type="DECISION|EXECUTION|EVIDENCE"}
helm_proofgraph_chain_length

# Budget metrics
helm_budget_remaining{tenant_id}
helm_budget_consumed_total{tenant_id, tool_id}

# Evidence metrics
helm_evidence_packs_exported_total
helm_evidence_verification_total{result="pass|fail"}
```

### Grafana Overview JSON (import via Grafana UI)

Save as `grafana-helm-overview.json` and import:

```json
{
  "overview": {
    "title": "HELM Kernel Overview",
    "panels": [
      {
        "title": "Decision Throughput",
        "type": "timeseries",
        "targets": [{"expr": "rate(helm_guardian_decisions_total[5m])"}]
      },
      {
        "title": "Decision Latency (p99)",
        "type": "gauge",
        "targets": [{"expr": "histogram_quantile(0.99, helm_guardian_decision_duration_seconds)"}]
      },
      {
        "title": "Tool Call Rate by Tool",
        "type": "timeseries",
        "targets": [{"expr": "rate(helm_executor_tool_calls_total[5m])"}]
      },
      {
        "title": "Error Budget Remaining",
        "type": "gauge",
        "targets": [{"expr": "helm_budget_remaining"}]
      },
      {
        "title": "ProofGraph Growth",
        "type": "timeseries",
        "targets": [{"expr": "helm_proofgraph_nodes_total"}]
      },
      {
        "title": "Evidence Verification Success Rate",
        "type": "stat",
        "targets": [{"expr": "helm_evidence_verification_total{result='pass'} / helm_evidence_verification_total"}]
      }
    ]
  }
}
```

### AlertManager Rules

```yaml
groups:
  - name: helm-kernel
    rules:
      - alert: GuardianLatencyHigh
        expr: histogram_quantile(0.99, helm_guardian_decision_duration_seconds) > 0.005
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Guardian p99 latency exceeds 5ms SLA"

      - alert: BudgetExhausted
        expr: helm_budget_remaining < 20
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Error budget below 20% — builder/promotion gates activated"

      - alert: EvidenceVerificationFailure
        expr: rate(helm_evidence_verification_total{result="fail"}[5m]) > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Evidence pack verification failures detected — potential tampering"
```

## Diagram

```mermaid
flowchart TD
    subgraph Ingestion["1. Ingestion & Context Plane"]
        source["Observability Templates"]
        s0["OBS-001: Structured Logging with Trace Correlation"]
        s1["OBS-002: Grafana / Prometheus Overview Templates"]
        output["Reader outcome"]
    end

    %% Operational Flow Edges
    source --> s0
    s0 --> s1
    s1 --> output

    %% Premium Styling Rules
```


## Operational Readiness

Use this page as the public operating layer for **Observability**. The source of truth is `helm-ai-enterprise/docs/public/observability/observability.md`; if this page and the implementation disagree, update the source-backed doc and rerun the validation command before publishing.

Before relying on this surface, confirm three things: the source path above still exists, the referenced commands or contracts are still present in the owning repo, and the docs-platform export surfaces still show this page in search, Markdown, `llms-full.txt`, and MCP without exposing protected routes.

Validation command: `corepack pnpm run docs:coverage && corepack pnpm run docs:truth`. For website parity, also run `npm run exports:boundary` and `npm run thin-pages:check` from `docs-platform`.

### Expected Output

A reader should leave with a concrete next action, the source file or contract to inspect, the command that proves the claim, and a clear boundary for what is public versus protected. For reference pages, the expected output is a correctly scoped request, schema, command, or diagnostic path. For operations pages, the expected output is a reproducible readiness or failure signal that can be attached to an evaluation or support thread.

### Failure Modes

If the validation command fails, do not patch this page in isolation. First identify whether the drift is in code, generated contracts, source-owner docs, or the docs manifest. If the public page needs a protected deep link, describe the protected document by name instead of exposing its route. Commercial operator details, tenant data, key ceremonies, and deployment-sensitive internals stay in protected customer or staff docs; this public page only exposes the safe developer contract.
