Observability
Overview
Section titled “Overview”The API ships traces, metrics, and logs to Azure Monitor (Application
Insights + Log Analytics) in non-dev environments via
UseAzureMonitor() in Program.cs.
In Development, OTel is wired up but the Azure exporter is off (set
Telemetry:ConsoleMetrics to true to dump metrics to stdout instead).
This page lists the custom signals — the bits beyond what
Azure.Monitor.OpenTelemetry.AspNetCore gives you for free — and how
to query them.
Custom signals
Section titled “Custom signals”All defined in FormationTelemetry.cs.
Activity sources
Section titled “Activity sources”| Source | What it spans |
|---|---|
Formation.Handlers | LiteBus command + event handlers (one span per call) |
Formation.Search | Full-text search operations |
Formation.Browser | Browser-emitted spans propagated up via headers |
Histograms (meter Formation.API)
Section titled “Histograms (meter Formation.API)”| Metric | Unit | Tags | Source |
|---|---|---|---|
formation.handler.duration | ms | command | InstrumentedCommandMediator |
formation.event.duration | ms | event | InstrumentedEventMediator |
formation.search.duration | ms | entity | FormationTelemetry.SearchScope |
formation.efcommand.duration | ms | command_type, success | SlowQueryLoggingInterceptor |
Counters (meter Formation.API)
Section titled “Counters (meter Formation.API)”| Metric | Unit | Tags | Source |
|---|---|---|---|
formation.event.failures | {failure} | event | InstrumentedEventMediator (catch path) |
formation.event.failures is the signal that drift exists between the source-of-truth tables and the denormalised [query].*List views. See the CQRS flow doc for the contract — every cascade-handler exception that the API swallows-and-records lands here, tagged with the event type. A small trickle is recoverable (the daily RebuildQueryViewsWorker reconstructs every list view), but a sustained spike means the cascade is broken and views may be stale until midnight.
The Bicep alert rule {resPrefix}-alrt-eventfailures-01 fires when this counter exceeds 10 in a 5-minute window.
Browser telemetry
Section titled “Browser telemetry”The SvelteKit app proxies browser RUM through the API rather than shipping a browser SDK. The connection string stays out of the bundle, the firewall trusts a single API-origin destination, and the bundle stays ~80KB lighter than @microsoft/applicationinsights-web would make it.
| Endpoint | Span | Source | Tags |
|---|---|---|---|
POST /telemetry/exception | browser.exception | Formation.Browser | exception.type, exception.message, browser.url, browser.component, error.severity, session.id (+ exception.stacktrace event) |
POST /telemetry/pageview | browser.pageview | Formation.Browser | browser.path, browser.referrer_path, browser.navigation_type, session.id |
Both endpoints are anonymous, size-capped (16 KB / 4 KB), and emit a NoContent response. The browser side lives in src/services/web/src/lib/telemetry.ts — trackException is wired through ErrorHandler (so all ErrorBoundary and hooks.client.ts errors flow through it), trackPageView is wired from the root layout’s afterNavigate hook.
session.id is a per-tab UUID persisted in sessionStorage. Same id across all spans from one user session lets a support engineer pivot from a reported error to the full path the user took before it.
Both spans are ActivityKind.Internal, so the Azure Monitor exporter routes them into AppDependencies / dependencies (not customEvents). The OperationName becomes Name; tags land in Properties (workspace) / customDimensions (classic).
Recent page-views.
AppDependencies| where TimeGenerated > ago(1h)| where Name == "browser.pageview"| extend path = tostring(Properties["browser.path"]), referrer = tostring(Properties["browser.referrer_path"]), navType = tostring(Properties["browser.navigation_type"]), session = tostring(Properties["session.id"])| project TimeGenerated, path, referrer, navType, session| order by TimeGenerated descReconstruct one user session — the support workflow this exists for. Take the session.id off a reported exception and replay every page-view + exception from that tab.
let sid = "<paste session.id>";AppDependencies| where Name in ("browser.pageview", "browser.exception")| where tostring(Properties["session.id"]) == sid| project TimeGenerated, Name, path = tostring(Properties["browser.path"]), url = tostring(Properties["browser.url"]), msg = tostring(Properties["exception.message"]), comp = tostring(Properties["browser.component"])| order by TimeGenerated ascRecent exceptions, grouped by component — first stop when triaging a frontend regression.
AppDependencies| where TimeGenerated > ago(24h)| where Name == "browser.exception"| extend type = tostring(Properties["exception.type"]), msg = tostring(Properties["exception.message"]), comp = tostring(Properties["browser.component"])| summarize count = count(), sessions = dcount(tostring(Properties["session.id"])), sample = any(msg) by comp, type| order by count descSpan events
Section titled “Span events”| Event | Where | Attributes |
|---|---|---|
ef.slow_command | Current Activity (any) | db.command_type, db.duration_ms, db.success, db.statement (4 KB max), optional db.param.* |
The ef.slow_command event fires whenever an EF Core DbCommand
exceeds Telemetry:SlowQueryThresholdMs (default 500). It hangs off
whatever Activity is current at the moment EF runs the command — so
you’ll see it nested inside the controller / command handler / event
handler that triggered it. If no Activity is current (e.g. background
work), the slow command is logged via ILogger instead, which still
ships to Log Analytics.
Runtime metrics (OpenTelemetry.Instrumentation.Runtime)
Section titled “Runtime metrics (OpenTelemetry.Instrumentation.Runtime)”Wired up via .AddRuntimeInstrumentation() in Program.cs.
Publishes the standard .NET runtime counters under dotnet.* —
note these were called process.runtime.dotnet.* in earlier
versions of the package; the 1.15.x line uses the shorter form
that ships in customMetrics. The most useful ones for chasing
latency stalls:
| Metric | Tags / dimensions | What it tells you |
|---|---|---|
dotnet.gc.collections | gc.heap.generation | Per-generation collection rate (gen0 / gen1 / gen2). Spikes in gen2 line up with heap-pressure stalls. |
dotnet.gc.last_collection.heap.size | gc.heap.generation | Heap size per generation, sampled. Sudden drop = a Gen 2 collection just happened. |
dotnet.gc.pause.time | — | Cumulative seconds the runtime spent paused in GC. The sharp number for “is GC the bottleneck”. |
dotnet.gc.heap.total_allocated | — | Lifetime allocation total (bytes). Rate-of-change under load is the allocation-pressure signal. |
dotnet.thread_pool.thread.count | — | Worker-thread pool size. Slow climb under load = pool starvation symptoms. |
dotnet.thread_pool.work_item.count | — | Queue depth in the thread pool. High here under load is the smoking gun for sync-over-async. |
dotnet.monitor.lock_contentions | — | Contended lock / Monitor acquisitions. Spikes correlate with serialised hot paths. |
dotnet.jit.compilation.time | — | Cumulative JIT time. Useful to rule in/out cold-start / tiered-recompilation stalls. |
We added these after the detail-page load test surfaced lockstep
latency stalls across all six list routes that didn’t show up in any
SQL or Kestrel metric — see
app.loadtests/BASELINE.md
for the investigation. Without runtime metrics the GC hypothesis was
unprovable.
Configuration
Section titled “Configuration”Both keys live under Telemetry: in appsettings.json. Override per
environment via the standard ASP.NET configuration chain.
{ "Telemetry": { "SlowQueryThresholdMs": 500, "LogSlowQueryParameters": false }}SlowQueryThresholdMs— int, ms. Below this threshold the command is recorded in the histogram only (cheap). Above, you also get a span event with the SQL.LogSlowQueryParameters— bool. Off by default. Flip on locally when chasing a specific slow query and you need parameter values to reproduce. Don’t enable in prod without thinking about parameter-value sensitivity.
Where to find the data
Section titled “Where to find the data”Where to query
Section titled “Where to query”The same data is reachable from two surfaces, with different table names:
- App Insights resource → Logs blade uses the classic schema:
customMetrics,traces,dependencies,customEvents,requests,exceptions. Columns:name,timestamp,customDimensions. - Log Analytics workspace → Logs blade uses the workspace-native
schema:
AppMetrics,AppTraces,AppDependencies,AppEvents,AppRequests,AppExceptions. Columns:Name,TimeGenerated,Properties.
The queries below use the workspace-native names because they
work on both surfaces (the AI Logs blade transparently accepts them
too). If you’re already comfortable with the classic schema, swap
each App* for the lowercase form and the Properties[...] /
Name / TimeGenerated columns for customDimensions[...] /
name / timestamp.
Useful queries
Section titled “Useful queries”Find anything we just emitted — handy first thing after deploy.
Scope search to the App-Insights-derived tables; an unscoped
search scans every workspace table (including infra tables like
Heartbeat, Perf, AzureActivity) which don’t share the
Properties column and will fail the projection.
search in (AppTraces, AppDependencies, AppEvents, AppRequests, AppExceptions) "ef.slow_command"| where TimeGenerated > ago(1h)| summarize count() by $tableOnce you know which table is carrying the events, swap to a table-specific query (the EF-targeted ones below).
If formation.efcommand.duration rows are arriving but no
ef.slow_command events have fired, nothing’s been slow enough yet
under the default 500 ms threshold. To smoke-test the slow path drop
the threshold via Telemetry:SlowQueryThresholdMs and redeploy.
AppMetrics| where Name == "formation.efcommand.duration"| where TimeGenerated > ago(1h)| countEF command duration by command type — averages and max from the
pre-aggregated metrics. For accurate p95/p99 use the App Insights
Metrics blade (chart mode, then change aggregation to p95/p99) —
AppMetrics rows are pre-aggregated by interval so KQL percentile
over them is approximate.
AppMetrics| where Name == "formation.efcommand.duration"| where TimeGenerated > ago(1h)| extend command_type = tostring(Properties.command_type), success = tobool(Properties.success)| summarize avgMs = avg(Sum / ItemCount), maxMs = max(Max), count = sum(ItemCount) by command_type, success| order by avgMs descSlow EF commands with SQL text — the ef.slow_command span event
is exported as a trace record by the Azure Monitor exporter. The
Properties bag carries db.statement, db.duration_ms, and
db.command_type.
AppTraces| where TimeGenerated > ago(1h)| where Message has "ef.slow_command" or Properties has "ef.slow_command"| extend statement = tostring(Properties["db.statement"]), durationMs = todouble(Properties["db.duration_ms"]), commandType = tostring(Properties["db.command_type"])| project TimeGenerated, durationMs, commandType, statement, OperationId| order by durationMs descThe OperationId joins back to the parent request — chain it with
AppRequests | where OperationId == "..." to see which endpoint the
slow SQL came from.
Slow EF commands logged via ILogger — the fallback path when no Activity is current (background work):
AppTraces| where TimeGenerated > ago(1h)| where Message startswith "Slow EF command"| project TimeGenerated, Message, SeverityLevelTop 10 slowest command handlers by avg duration:
AppMetrics| where Name == "formation.handler.duration"| where TimeGenerated > ago(1h)| extend command = tostring(Properties.command)| summarize avgMs = avg(Sum / ItemCount), maxMs = max(Max), n = sum(ItemCount) by command| order by avgMs desc| take 10HTTP route latency — already auto-instrumented by
Azure.Monitor.OpenTelemetry.AspNetCore, no custom code needed.
Prefer Properties["http.route"] (the route template, e.g.
/api/Schemes/{key}) and fall back to Name when the
property isn’t present:
AppRequests| where TimeGenerated > ago(1h)| where Success == true| extend label = coalesce(tostring(Properties["http.route"]), Name)| summarize p50=percentile(DurationMs, 50), p95=percentile(DurationMs, 95), p99=percentile(DurationMs, 99), count() by label| order by p95 desc| take 20Drill from a slow request to its full SQL fan-out — the highest-leverage
single query for triage. Pick an OperationId from one of the queries above
(slow request, slow EF command, anything tied to a span), and pull every row
the request produced:
let opId = "<OperationId>";union withsource=SourceTable AppRequests, AppDependencies, AppTraces, AppExceptions| where OperationId == opId| project TimeGenerated, SourceTable, Name, Message, DurationMs, Success| order by TimeGenerated ascThis turns “this query is slow in isolation” into “this route is slow because it issues this query and N others” — usually the actionable fix.
KQL pitfalls
Section titled “KQL pitfalls”A few things that bit us during real triage and aren’t obvious from the table schemas:
sqlis a reserved word in KQL.extend sql = ...parses as a syntax error. Use a different identifier (statement,cmd).- In a
unionof tables, you needwithsource=<Col>to capture the source table name. Bare$tabledoesn’t resolve in a project, anditemType(a column in the classic Application Insights schema) doesn’t exist on the workspace-native tables. Alwaysunion withsource=SourceTable A, B, C | project SourceTable, …. Avoid the obvious nameSource—AppDependenciesalready has a column called that and the union will be rejected with “column named ‘Source’ already exists”. - Unscoped
searchfans out across every workspace table — infra tables likeHeartbeat/Perf/AzureActivitydon’t share thePropertiescolumn and will fail any subsequent projection that references it. Always scope:search in (AppTraces, AppDependencies, AppEvents, AppRequests, AppExceptions) "<term>". AppMetricsrows are pre-aggregated by interval. KQLpercentile(Sum / ItemCount, 95)over them is approximate. For exact percentiles, use the App Insights Metrics blade chart UI with the percentile aggregation.
Local development
Section titled “Local development”Telemetry:ConsoleMetrics = true in appsettings.Development.json
prints OTel histograms to the console at the configured export
interval. Useful when iterating on a hot path without leaving the
machine.
Alert rules and cost budget
Section titled “Alert rules and cost budget”Provisioned in Bicep under infrastructure/modules/alerts.bicep and wired into core.bicep after the App Insights component. Each alert is a Microsoft.Insights/scheduledQueryRules against the App Insights workspace.
| Rule (suffix) | Severity | Window | Threshold | Source signal |
|---|---|---|---|---|
alrt-eventfailures-01 | 2 (Error) | 5 min | formation.event.failures total > 10 | Cascade pipeline broken; views drifting |
alrt-api5xx-01 | 1 (Critical) | 5 min | 5xx % of all API requests > 5% (2 of 2) | Outage or broken downstream |
alrt-apilatency-01 | 2 (Error) | 15 min | API request duration p95 > 5000 ms | Latency above the load-test BASELINE worst case |
alrt-apiheartbeat-01 | 1 (Critical) | 10 min | Zero requests received | Routing / replica-startup failure |
Cost budgets — two layers
Section titled “Cost budgets — two layers”Cost monitoring is split between a manually-managed sub-wide budget and an opt-in per-RG budget:
- Subscription-wide
Formationbudget (£2300/month, 90% actual + 100% forecasted) — manually configured in the Azure Portal, not in Bicep. Catches “the whole org is overspending.” Recipients are kept in sync withalertRecipientEmailsby manual edit. - Per-RG opt-in budget (
{resPrefix}-budget-01) — created byalerts.biceponly whenmonthlyCostBudgetAmount > 0. Catches “this single env is anomalous” — useful when one env starts running away from the others and the sub-wide signal is too coarse. Default thresholds: 80% actual, 100% forecasted.
Per-env defaults are based on observed spend over the previous 3 months:
| Env | Observed monthly spend | Default budget | 80% notification fires at |
|---|---|---|---|
| dev | ~£780 | £1000 | £800 |
| uat | ~£600 | £800 | £640 |
Update monthlyCostBudgetAmount in the env parameters file when steady-state spend changes; values shouldn’t be guessed.
Adding / changing recipients
Section titled “Adding / changing recipients”Two parameters in the env parameters file, each with its own audience:
applicationAlertEmails— the on-call rota for production health (5xx, latency, idle API, cascade failures). Routed via an action group. Empty = no action group is created and alert rules fire silently into the Azure Monitor portal.budgetAlertEmails— the cost-budget audience (typically a wider list including finance / management). Wired straight to the budget resource’scontactEmailsand doesn’t require an action group.
The split exists because the audiences differ: an API spike is an engineering page (a small group), whereas a budget breach is a financial decision (a broader group).
The sub-wide Formation budget is unmanaged in Bicep and its recipients have to be edited separately in the Azure Portal.
Tuning alert thresholds
Section titled “Tuning alert thresholds”The numbers in the table above are first-deploy defaults. They’re conservative for dev (low traffic, sporadic activity); prod should tighten them once steady-state behaviour is known. Edit alerts.bicep to change the inline thresholds — they’re not currently parameterised because tuning per-rule via Bicep would explode the parameters surface.
What’s not (yet) instrumented
Section titled “What’s not (yet) instrumented”- Per-OData-controller spans. The HTTP request span (auto-instrumented by
Azure.Monitor.OpenTelemetry.AspNetCore) covers the route as a whole. The EF command spans break the SQL portion out. The C# post-processing time inside the controller (e.g.LoadPolymorphicCollectionsAsync) is currently the residual after subtracting EF time from the request span. IQueryMediatorhandlers. LiteBus is wired up to discover them, but the API doesn’t currently injectIQueryMediatoranywhere — every read flows through OData controllers + EF directly. If query handlers ever start being used, mirrorInstrumentedCommandMediatorto wrap them.- Web Vitals. The browser RUM proxy (see Browser telemetry above) covers exceptions and page-views; performance metrics (LCP, CLS, INP, TTFB) aren’t shipped yet.
- Container app restart-loop / job-failure alerts. Restart and job-execution data lands in Log Analytics, but the right query shapes are TBD — added when a real incident teaches us what to look for.