Observability

Overview

The API ships traces, metrics, and logs to Azure Monitor (Application Insights + Log Analytics) in non-dev environments via UseAzureMonitor() in Program.cs. In Development, OTel is wired up but the Azure exporter is off (set Telemetry:ConsoleMetrics to true to dump metrics to stdout instead).

This page lists the custom signals — the bits beyond what Azure.Monitor.OpenTelemetry.AspNetCore gives you for free — and how to query them.

Custom signals

All defined in FormationTelemetry.cs.

Activity sources

Source	What it spans
`Formation.Handlers`	LiteBus command + event handlers (one span per call)
`Formation.Search`	Full-text search operations
`Formation.Browser`	Browser-emitted spans propagated up via headers

Histograms (meter `Formation.API`)

Metric	Unit	Tags	Source
`formation.handler.duration`	ms	`command`	`InstrumentedCommandMediator`
`formation.event.duration`	ms	`event`	`InstrumentedEventMediator`
`formation.search.duration`	ms	`entity`	`FormationTelemetry.SearchScope`
`formation.efcommand.duration`	ms	`command_type`, `success`	`SlowQueryLoggingInterceptor`

Counters (meter `Formation.API`)

Metric	Unit	Tags	Source
`formation.event.failures`	`{failure}`	`event`	`InstrumentedEventMediator` (catch path)

formation.event.failures is the signal that drift exists between the source-of-truth tables and the denormalised [query].*List views. See the CQRS flow doc for the contract — every cascade-handler exception that the API swallows-and-records lands here, tagged with the event type. A small trickle is recoverable (the daily RebuildQueryViewsWorker reconstructs every list view), but a sustained spike means the cascade is broken and views may be stale until midnight.

The Bicep alert rule {resPrefix}-alrt-eventfailures-01 fires when this counter exceeds 10 in a 5-minute window.

Browser telemetry

The SvelteKit app proxies browser RUM through the API rather than shipping a browser SDK. The connection string stays out of the bundle, the firewall trusts a single API-origin destination, and the bundle stays ~80KB lighter than @microsoft/applicationinsights-web would make it.

Endpoint	Span	Source	Tags
`POST /telemetry/exception`	`browser.exception`	`Formation.Browser`	`exception.type`, `exception.message`, `browser.url`, `browser.component`, `error.severity`, `session.id` (+ `exception.stacktrace` event)
`POST /telemetry/pageview`	`browser.pageview`	`Formation.Browser`	`browser.path`, `browser.referrer_path`, `browser.navigation_type`, `session.id`

Both endpoints are anonymous, size-capped (16 KB / 4 KB), and emit a NoContent response. The browser side lives in src/services/web/src/lib/telemetry.ts — trackException is wired through ErrorHandler (so all ErrorBoundary and hooks.client.ts errors flow through it), trackPageView is wired from the root layout’s afterNavigate hook.

session.id is a per-tab UUID persisted in sessionStorage. Same id across all spans from one user session lets a support engineer pivot from a reported error to the full path the user took before it.

Both spans are ActivityKind.Internal, so the Azure Monitor exporter routes them into AppDependencies / dependencies (not customEvents). The OperationName becomes Name; tags land in Properties (workspace) / customDimensions (classic).

Recent page-views.

AppDependencies
| where TimeGenerated > ago(1h)
| where Name == "browser.pageview"
| extend path     = tostring(Properties["browser.path"]),
         referrer = tostring(Properties["browser.referrer_path"]),
         navType  = tostring(Properties["browser.navigation_type"]),
         session  = tostring(Properties["session.id"])
| project TimeGenerated, path, referrer, navType, session
| order by TimeGenerated desc

Reconstruct one user session — the support workflow this exists for. Take the session.id off a reported exception and replay every page-view + exception from that tab.

let sid = "<paste session.id>";
AppDependencies
| where Name in ("browser.pageview", "browser.exception")
| where tostring(Properties["session.id"]) == sid
| project TimeGenerated, Name,
          path = tostring(Properties["browser.path"]),
          url  = tostring(Properties["browser.url"]),
          msg  = tostring(Properties["exception.message"]),
          comp = tostring(Properties["browser.component"])
| order by TimeGenerated asc

Recent exceptions, grouped by component — first stop when triaging a frontend regression.

AppDependencies
| where TimeGenerated > ago(24h)
| where Name == "browser.exception"
| extend type    = tostring(Properties["exception.type"]),
         msg     = tostring(Properties["exception.message"]),
         comp    = tostring(Properties["browser.component"])
| summarize count = count(),
            sessions = dcount(tostring(Properties["session.id"])),
            sample = any(msg)
            by comp, type
| order by count desc

Span events

Event	Where	Attributes
`ef.slow_command`	Current Activity (any)	`db.command_type`, `db.duration_ms`, `db.success`, `db.statement` (4 KB max), optional `db.param.*`

The ef.slow_command event fires whenever an EF Core DbCommand exceeds Telemetry:SlowQueryThresholdMs (default 500). It hangs off whatever Activity is current at the moment EF runs the command — so you’ll see it nested inside the controller / command handler / event handler that triggered it. If no Activity is current (e.g. background work), the slow command is logged via ILogger instead, which still ships to Log Analytics.

Runtime metrics (`OpenTelemetry.Instrumentation.Runtime`)

Wired up via .AddRuntimeInstrumentation() in Program.cs. Publishes the standard .NET runtime counters under dotnet.* — note these were called process.runtime.dotnet.* in earlier versions of the package; the 1.15.x line uses the shorter form that ships in customMetrics. The most useful ones for chasing latency stalls:

Metric	Tags / dimensions	What it tells you
`dotnet.gc.collections`	`gc.heap.generation`	Per-generation collection rate (gen0 / gen1 / gen2). Spikes in gen2 line up with heap-pressure stalls.
`dotnet.gc.last_collection.heap.size`	`gc.heap.generation`	Heap size per generation, sampled. Sudden drop = a Gen 2 collection just happened.
`dotnet.gc.pause.time`	—	Cumulative seconds the runtime spent paused in GC. The sharp number for “is GC the bottleneck”.
`dotnet.gc.heap.total_allocated`	—	Lifetime allocation total (bytes). Rate-of-change under load is the allocation-pressure signal.
`dotnet.thread_pool.thread.count`	—	Worker-thread pool size. Slow climb under load = pool starvation symptoms.
`dotnet.thread_pool.work_item.count`	—	Queue depth in the thread pool. High here under load is the smoking gun for sync-over-async.
`dotnet.monitor.lock_contentions`	—	Contended `lock` / `Monitor` acquisitions. Spikes correlate with serialised hot paths.
`dotnet.jit.compilation.time`	—	Cumulative JIT time. Useful to rule in/out cold-start / tiered-recompilation stalls.

We added these after the detail-page load test surfaced lockstep latency stalls across all six list routes that didn’t show up in any SQL or Kestrel metric — see app.loadtests/BASELINE.md for the investigation. Without runtime metrics the GC hypothesis was unprovable.

Configuration

Both keys live under Telemetry: in appsettings.json. Override per environment via the standard ASP.NET configuration chain.

{
  "Telemetry": {
    "SlowQueryThresholdMs": 500,
    "LogSlowQueryParameters": false
  }
}

SlowQueryThresholdMs — int, ms. Below this threshold the command is recorded in the histogram only (cheap). Above, you also get a span event with the SQL.
LogSlowQueryParameters — bool. Off by default. Flip on locally when chasing a specific slow query and you need parameter values to reproduce. Don’t enable in prod without thinking about parameter-value sensitivity.

Where to find the data

Where to query

The same data is reachable from two surfaces, with different table names:

App Insights resource → Logs blade uses the classic schema: customMetrics, traces, dependencies, customEvents, requests, exceptions. Columns: name, timestamp, customDimensions.
Log Analytics workspace → Logs blade uses the workspace-native schema: AppMetrics, AppTraces, AppDependencies, AppEvents, AppRequests, AppExceptions. Columns: Name, TimeGenerated, Properties.

The queries below use the workspace-native names because they work on both surfaces (the AI Logs blade transparently accepts them too). If you’re already comfortable with the classic schema, swap each App* for the lowercase form and the Properties[...] / Name / TimeGenerated columns for customDimensions[...] / name / timestamp.

Useful queries

Find anything we just emitted — handy first thing after deploy. Scope search to the App-Insights-derived tables; an unscoped search scans every workspace table (including infra tables like Heartbeat, Perf, AzureActivity) which don’t share the Properties column and will fail the projection.

search in (AppTraces, AppDependencies, AppEvents, AppRequests, AppExceptions) "ef.slow_command"
| where TimeGenerated > ago(1h)
| summarize count() by $table

Once you know which table is carrying the events, swap to a table-specific query (the EF-targeted ones below).

If formation.efcommand.duration rows are arriving but no ef.slow_command events have fired, nothing’s been slow enough yet under the default 500 ms threshold. To smoke-test the slow path drop the threshold via Telemetry:SlowQueryThresholdMs and redeploy.

AppMetrics
| where Name == "formation.efcommand.duration"
| where TimeGenerated > ago(1h)
| count

EF command duration by command type — averages and max from the pre-aggregated metrics. For accurate p95/p99 use the App Insights Metrics blade (chart mode, then change aggregation to p95/p99) — AppMetrics rows are pre-aggregated by interval so KQL percentile over them is approximate.

AppMetrics
| where Name == "formation.efcommand.duration"
| where TimeGenerated > ago(1h)
| extend command_type = tostring(Properties.command_type),
         success     = tobool(Properties.success)
| summarize
    avgMs = avg(Sum / ItemCount),
    maxMs = max(Max),
    count = sum(ItemCount)
    by command_type, success
| order by avgMs desc

Slow EF commands with SQL text — the ef.slow_command span event is exported as a trace record by the Azure Monitor exporter. The Properties bag carries db.statement, db.duration_ms, and db.command_type.

AppTraces
| where TimeGenerated > ago(1h)
| where Message has "ef.slow_command"
   or Properties has "ef.slow_command"
| extend statement   = tostring(Properties["db.statement"]),
         durationMs  = todouble(Properties["db.duration_ms"]),
         commandType = tostring(Properties["db.command_type"])
| project TimeGenerated, durationMs, commandType, statement, OperationId
| order by durationMs desc

The OperationId joins back to the parent request — chain it with AppRequests | where OperationId == "..." to see which endpoint the slow SQL came from.

Slow EF commands logged via ILogger — the fallback path when no Activity is current (background work):

AppTraces
| where TimeGenerated > ago(1h)
| where Message startswith "Slow EF command"
| project TimeGenerated, Message, SeverityLevel

Top 10 slowest command handlers by avg duration:

AppMetrics
| where Name == "formation.handler.duration"
| where TimeGenerated > ago(1h)
| extend command = tostring(Properties.command)
| summarize avgMs = avg(Sum / ItemCount), maxMs = max(Max), n = sum(ItemCount) by command
| order by avgMs desc
| take 10

HTTP route latency — already auto-instrumented by Azure.Monitor.OpenTelemetry.AspNetCore, no custom code needed. Prefer Properties["http.route"] (the route template, e.g. /api/Schemes/{key}) and fall back to Name when the property isn’t present:

AppRequests
| where TimeGenerated > ago(1h)
| where Success == true
| extend label = coalesce(tostring(Properties["http.route"]), Name)
| summarize p50=percentile(DurationMs, 50), p95=percentile(DurationMs, 95), p99=percentile(DurationMs, 99), count() by label
| order by p95 desc
| take 20

Drill from a slow request to its full SQL fan-out — the highest-leverage single query for triage. Pick an OperationId from one of the queries above (slow request, slow EF command, anything tied to a span), and pull every row the request produced:

let opId = "<OperationId>";
union withsource=SourceTable AppRequests, AppDependencies, AppTraces, AppExceptions
| where OperationId == opId
| project TimeGenerated, SourceTable, Name, Message, DurationMs, Success
| order by TimeGenerated asc

This turns “this query is slow in isolation” into “this route is slow because it issues this query and N others” — usually the actionable fix.

KQL pitfalls

A few things that bit us during real triage and aren’t obvious from the table schemas:

sql is a reserved word in KQL. extend sql = ... parses as a syntax error. Use a different identifier (statement, cmd).
In a union of tables, you need withsource=<Col> to capture the source table name. Bare $table doesn’t resolve in a project, and itemType (a column in the classic Application Insights schema) doesn’t exist on the workspace-native tables. Always union withsource=SourceTable A, B, C | project SourceTable, …. Avoid the obvious name Source — AppDependencies already has a column called that and the union will be rejected with “column named ‘Source’ already exists”.
Unscoped search fans out across every workspace table — infra tables like Heartbeat / Perf / AzureActivity don’t share the Properties column and will fail any subsequent projection that references it. Always scope: search in (AppTraces, AppDependencies, AppEvents, AppRequests, AppExceptions) "<term>".
AppMetrics rows are pre-aggregated by interval. KQL percentile(Sum / ItemCount, 95) over them is approximate. For exact percentiles, use the App Insights Metrics blade chart UI with the percentile aggregation.

Local development

Telemetry:ConsoleMetrics = true in appsettings.Development.json prints OTel histograms to the console at the configured export interval. Useful when iterating on a hot path without leaving the machine.

Alert rules and cost budget

Provisioned in Bicep under infrastructure/modules/alerts.bicep and wired into core.bicep after the App Insights component. Each alert is a Microsoft.Insights/scheduledQueryRules against the App Insights workspace.

Rule (suffix)	Severity	Window	Threshold	Source signal
`alrt-eventfailures-01`	2 (Error)	5 min	`formation.event.failures` total > 10	Cascade pipeline broken; views drifting
`alrt-api5xx-01`	1 (Critical)	5 min	5xx % of all API requests > 5% (2 of 2)	Outage or broken downstream
`alrt-apilatency-01`	2 (Error)	15 min	API request `duration` p95 > 5000 ms	Latency above the load-test BASELINE worst case
`alrt-apiheartbeat-01`	1 (Critical)	10 min	Zero requests received	Routing / replica-startup failure

Cost budgets — two layers

Cost monitoring is split between a manually-managed sub-wide budget and an opt-in per-RG budget:

Subscription-wide Formation budget (£2300/month, 90% actual + 100% forecasted) — manually configured in the Azure Portal, not in Bicep. Catches “the whole org is overspending.” Recipients are kept in sync with alertRecipientEmails by manual edit.
Per-RG opt-in budget ({resPrefix}-budget-01) — created by alerts.bicep only when monthlyCostBudgetAmount > 0. Catches “this single env is anomalous” — useful when one env starts running away from the others and the sub-wide signal is too coarse. Default thresholds: 80% actual, 100% forecasted.

Per-env defaults are based on observed spend over the previous 3 months:

Env	Observed monthly spend	Default budget	80% notification fires at
dev	~£780	£1000	£800
uat	~£600	£800	£640

Update monthlyCostBudgetAmount in the env parameters file when steady-state spend changes; values shouldn’t be guessed.

Adding / changing recipients

Two parameters in the env parameters file, each with its own audience:

applicationAlertEmails — the on-call rota for production health (5xx, latency, idle API, cascade failures). Routed via an action group. Empty = no action group is created and alert rules fire silently into the Azure Monitor portal.
budgetAlertEmails — the cost-budget audience (typically a wider list including finance / management). Wired straight to the budget resource’s contactEmails and doesn’t require an action group.

The split exists because the audiences differ: an API spike is an engineering page (a small group), whereas a budget breach is a financial decision (a broader group).

The sub-wide Formation budget is unmanaged in Bicep and its recipients have to be edited separately in the Azure Portal.

Tuning alert thresholds

The numbers in the table above are first-deploy defaults. They’re conservative for dev (low traffic, sporadic activity); prod should tighten them once steady-state behaviour is known. Edit alerts.bicep to change the inline thresholds — they’re not currently parameterised because tuning per-rule via Bicep would explode the parameters surface.

What’s not (yet) instrumented

Per-OData-controller spans. The HTTP request span (auto-instrumented by Azure.Monitor.OpenTelemetry.AspNetCore) covers the route as a whole. The EF command spans break the SQL portion out. The C# post-processing time inside the controller (e.g. LoadPolymorphicCollectionsAsync) is currently the residual after subtracting EF time from the request span.
IQueryMediator handlers. LiteBus is wired up to discover them, but the API doesn’t currently inject IQueryMediator anywhere — every read flows through OData controllers + EF directly. If query handlers ever start being used, mirror InstrumentedCommandMediator to wrap them.
Web Vitals. The browser RUM proxy (see Browser telemetry above) covers exceptions and page-views; performance metrics (LCP, CLS, INP, TTFB) aren’t shipped yet.
Container app restart-loop / job-failure alerts. Restart and job-execution data lands in Log Analytics, but the right query shapes are TBD — added when a real incident teaches us what to look for.