HELMhelm-ai-enterprise
MCPLLMs

helm-ai-enterprise

Observability

After this page you should know what this surface is for, which source files own the behavior, which public route or adjacent page to use next, and which validation command to run before changing the claim. Public route:
PublicSource-ownedMarkdown export

Audience

Outcome

After this page you should know what this surface is for, which source files own the behavior, which public route or adjacent page to use next, and which validation command to run before changing the claim.

Source Truth

  • Public route: observability
  • Source document: helm-ai-enterprise/docs/public/observability/observability.md
  • Public manifest: helm-ai-enterprise/docs/public-docs.manifest.json
  • Source inventory: helm-ai-enterprise/docs/source-inventory.manifest.json
  • Validation: corepack pnpm run docs:coverage, corepack pnpm run docs:truth, and npm run coverage:inventory from docs-platform

Do not expand this page with unsupported product, SDK, deployment, compliance, or integration claims unless the inventory manifest points to code, schemas, tests, examples, or an owner doc that proves the claim.

Troubleshooting

Symptom First check
A link or route is missing from the docs website Check docs/public-docs.manifest.json, llms.txt, search, and the per-page Markdown export before changing navigation.
A claim is not backed by code or tests Remove the claim or add the missing code, example, schema, or validation command before publishing.

OBS-001: Structured Logging with Trace Correlation

HELM uses Go's log/slog for structured logging with OpenTelemetry trace ID correlation.

Configuration

import (
    "log/slog"
    "go.opentelemetry.io/otel/trace"
)

// Create a handler that adds trace_id to every log entry
type TracedHandler struct {
    slog.Handler
}

func (h *TracedHandler) Handle(ctx context.Context, r slog.Record) error {
    span := trace.SpanFromContext(ctx)
    if span.SpanContext().IsValid() {
        r.AddAttrs(
            slog.String("trace_id", span.SpanContext().TraceID().String()),
            slog.String("span_id", span.SpanContext().SpanID().String()),
        )
    }
    return h.Handler.Handle(ctx, r)
}

Log Fields Convention

Field Type Description
trace_id string OpenTelemetry trace ID (auto-injected)
span_id string Span ID within trace
tenant_id string Tenant isolation boundary
decision_id string Guardian decision identifier
receipt_id string Execution receipt reference
tool_id string Tool being executed
latency_ms float64 Operation latency in milliseconds
error string Error message if applicable

OBS-002: Grafana / Prometheus Overview Templates

Prometheus Metrics (exposed at :9090/metrics)

# Guardian decision metrics
helm_guardian_decisions_total{verdict="ALLOW|DENY|ESCALATE"}
helm_guardian_decision_duration_seconds{quantile="0.5|0.9|0.99"}

# Executor metrics
helm_executor_tool_calls_total{tool_id, status="ok|error"}
helm_executor_tool_duration_seconds{tool_id}

# ProofGraph metrics
helm_proofgraph_nodes_total{type="DECISION|EXECUTION|EVIDENCE"}
helm_proofgraph_chain_length

# Budget metrics
helm_budget_remaining{tenant_id}
helm_budget_consumed_total{tenant_id, tool_id}

# Evidence metrics
helm_evidence_packs_exported_total
helm_evidence_verification_total{result="pass|fail"}

Grafana Overview JSON (import via Grafana UI)

Save as grafana-helm-overview.json and import:

{
  "overview": {
    "title": "HELM Kernel Overview",
    "panels": [
      {
        "title": "Decision Throughput",
        "type": "timeseries",
        "targets": [{"expr": "rate(helm_guardian_decisions_total[5m])"}]
      },
      {
        "title": "Decision Latency (p99)",
        "type": "gauge",
        "targets": [{"expr": "histogram_quantile(0.99, helm_guardian_decision_duration_seconds)"}]
      },
      {
        "title": "Tool Call Rate by Tool",
        "type": "timeseries",
        "targets": [{"expr": "rate(helm_executor_tool_calls_total[5m])"}]
      },
      {
        "title": "Error Budget Remaining",
        "type": "gauge",
        "targets": [{"expr": "helm_budget_remaining"}]
      },
      {
        "title": "ProofGraph Growth",
        "type": "timeseries",
        "targets": [{"expr": "helm_proofgraph_nodes_total"}]
      },
      {
        "title": "Evidence Verification Success Rate",
        "type": "stat",
        "targets": [{"expr": "helm_evidence_verification_total{result='pass'} / helm_evidence_verification_total"}]
      }
    ]
  }
}

AlertManager Rules

groups:
  - name: helm-kernel
    rules:
      - alert: GuardianLatencyHigh
        expr: histogram_quantile(0.99, helm_guardian_decision_duration_seconds) > 0.005
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Guardian p99 latency exceeds 5ms SLA"

      - alert: BudgetExhausted
        expr: helm_budget_remaining < 20
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Error budget below 20% — builder/promotion gates activated"

      - alert: EvidenceVerificationFailure
        expr: rate(helm_evidence_verification_total{result="fail"}[5m]) > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Evidence pack verification failures detected — potential tampering"

Diagram

Diagram1. Ingestion & Context Plane -> Observability Templates -> OBS-001: Structured Logging with Trace Correlation -> OBS-002: Grafana / Prometheus Overview Templates -> Reader outcome
flowchart TD
    subgraph Ingestion["1. Ingestion & Context Plane"]
        source["Observability Templates"]
        s0["OBS-001: Structured Logging with Trace Correlation"]
        s1["OBS-002: Grafana / Prometheus Overview Templates"]
        output["Reader outcome"]
    end

    %% Operational Flow Edges
    source --> s0
    s0 --> s1
    s1 --> output

    %% Premium Styling Rules
Mermaid source
flowchart TD
    subgraph Ingestion["1. Ingestion & Context Plane"]
        source["Observability Templates"]
        s0["OBS-001: Structured Logging with Trace Correlation"]
        s1["OBS-002: Grafana / Prometheus Overview Templates"]
        output["Reader outcome"]
    end

    %% Operational Flow Edges
    source --> s0
    s0 --> s1
    s1 --> output

    %% Premium Styling Rules

Operational Readiness

Use this page as the public operating layer for Observability. The source of truth is helm-ai-enterprise/docs/public/observability/observability.md; if this page and the implementation disagree, update the source-backed doc and rerun the validation command before publishing.

Before relying on this surface, confirm three things: the source path above still exists, the referenced commands or contracts are still present in the owning repo, and the docs-platform export surfaces still show this page in search, Markdown, llms-full.txt, and MCP without exposing protected routes.

Validation command: corepack pnpm run docs:coverage && corepack pnpm run docs:truth. For website parity, also run npm run exports:boundary and npm run thin-pages:check from docs-platform.

Expected Output

A reader should leave with a concrete next action, the source file or contract to inspect, the command that proves the claim, and a clear boundary for what is public versus protected. For reference pages, the expected output is a correctly scoped request, schema, command, or diagnostic path. For operations pages, the expected output is a reproducible readiness or failure signal that can be attached to an evaluation or support thread.

Failure Modes

If the validation command fails, do not patch this page in isolation. First identify whether the drift is in code, generated contracts, source-owner docs, or the docs manifest. If the public page needs a protected deep link, describe the protected document by name instead of exposing its route. Commercial operator details, tenant data, key ceremonies, and deployment-sensitive internals stay in protected customer or staff docs; this public page only exposes the safe developer contract.