Observability

This essay answers one question: how does NetScript make a distributed, multi-process application observable, so that one logical operation reads as one story even though it crosses HTTP, a queue, a saga, and a worker subprocess? The answer is a single idea applied everywhere — the trace context travels with the work — wired into the framework boundaries so you inherit it for free. Read this to build the mental model; to wire spans yourself, follow the how-to: add OpenTelemetry; for the headline API and ports, see the

Telemetry & logging hub; for exact exported symbols, see

telemetry .

A traceparent header entering at an HTTP request, propagating through the service RPC boundary, into a job dispatched onto the queue, and injected into a worker subprocess — every span nesting under one trace id. — One traceparent, four processes, one trace. The framework carries the W3C trace context across each boundary so spans born in separate processes still nest under the same root.

The thesis: observability is a property of the boundary

Most backends bolt observability on after the fact. You write a handler, ship it, watch it misbehave, and then thread a logger, a metrics client, and a tracer through every call site — three libraries, three configuration surfaces, three ways to forget a field. The instrumentation drifts from the code because it was never part of the code's shape.

NetScript takes the same stance on observability that it takes on

contracts: a cross-cutting concern belongs to the boundary, so the framework owns it and hands you a typed seam. A service built with defineService(...) wires request logging, health endpoints, and OpenTelemetry context propagation in one call. A worker job is dispatched inside a span the framework opens for you. The intent is that the common signal — "this operation happened, here is its trace id, here is whether it was healthy" — is free and uniform, and the specific signal — a custom child span around an expensive step — is one typed call away.

The core insight: distributed trace propagation

This is the part worth understanding well, because it is where most home-grown observability breaks. In a NetScript app, a single user action — "place an order" — does not run in one process. An HTTP request hits a service; the service dispatches a job onto a queue; a worker picks the job up, often in a separate subprocess; that worker may emit a saga step that fans out further. Four processes, one intent. If each process started its own trace, your dashboard would show four disconnected fragments and you would correlate them by hand, by timestamp, badly.

The fix is the W3C trace context: a traceparent header (and an optional tracestate) that carries a trace id and the current span id. The rule is simple — whoever does the work carries the context forward, and opens their span as a child of the id they received. When every hop obeys that rule, the spans nest into one tree even though no two of them share a process.

NetScript enforces the rule at each boundary it owns:

The service boundary. When you serve a router with defineService(...).withRPC({ traceContext: true }), RPC handling on /api/rpc/* reads the incoming traceparent and continues it, so a downstream call is a child of the caller's span rather than an orphan. You write the business logic; the framework keeps the causal chain intact across the wire.

The queue / worker boundary. Before your handler runs, the dispatcher opens a real traceJobExecution span carrying job attributes, duration, and status. When that job runs out-of-process, the dispatcher serializes the active context and injects the traceparent / tracestate (and the parent's OTEL_* config) into the subprocess environment via createJobSubprocessEnv. The child Deno process reads them back on startup and continues the same trace — so a span born in a forked subprocess still nests under the HTTP request that triggered it.

The scheduler and SSE boundaries. Cron runs emit their own spans rooted at the scheduler tick, and SSE events can be linked back to the job execution that produced them via extractTraceContextFromRecord — so a streamed progress event points at its originating trace.

  HTTP request          service (RPC)          queue / dispatcher         worker subprocess
 ──────────────       ─────────────────       ────────────────────      ──────────────────
 traceparent     ->   continue context   ->   traceJobExecution    ->   createJobSubprocessEnv
 arrives at edge      open child span         opens job span            injects traceparent into
                      (withRPC traceContext)   (REAL OTel today)         env; child joins SAME trace
        │                    │                        │                         │
        └──────────── one trace id, carried forward at every hop ───────────────┘

Structured logging, correlated by trace

Logs in NetScript are structured records, not free-text console.log. A handler's ctx.logger emits JSON with a level, a message, and an attribute bag, and the logging middleware enriches each record with request metadata. Because logging runs inside the same active span as the work, a log line emitted during a job carries — or can be joined to — the operation's trace id. That correlation is the whole point: in the dashboard you select a slow trace, and the log lines emitted within its spans are right there beside it, not in a separate searchable haystack you cross-reference by timestamp.

This is also why the scaffold's progress(...) helper is limited: it logs via the worker pool and delegates to ctx.reportProgress, but it does not by itself emit a job.progress OpenTelemetry span event. For an OTel-visible progress event, call recordJobProgress from @netscript/telemetry/instrumentation directly. The exact logging surface lives in

logger .

Where the signal goes: OTLP to the Aspire dashboard

A span or a log is only useful once something collects it. In the default dev loop that collector is the Aspire dashboard, and the wire between your process and the dashboard is OTLP (the OpenTelemetry Protocol). The generated Aspire AppHost configures an OTLP receiver at http://localhost:4318 and a dashboard UI at http://localhost:18888; resources started under Aspire are handed the OTLP endpoint through environment variables, so they export telemetry without per-service configuration. This is why Aspire is step two of the dev flow, not an afterthought: cd aspire && aspire start brings the receiver up — along with Postgres and Redis — before the first handler runs, so the very first request has somewhere to land. Without it, handlers still execute; they simply export into the void.

The observability surface — endpoints, ports, and what each one carries
Name	Type	Description
`Aspire dashboard`	`http://localhost:18888`	The viewing surface. Resource graph, per-resource health, structured logs, and the distributed-trace view. Login token is printed by `aspire start`.
`OTLP receiver`	`http://localhost:4318`	Where processes export OpenTelemetry traces, logs, and metrics. Configured by the generated Aspire AppHost and handed to resources via env vars.
`Service trace context`	`traceparent`	Continued by defineService RPC handling on /api/rpc/* (withRPC traceContext) so a downstream call is a child span of its caller, not an orphan.
`Subprocess env`	`OTEL_* + traceparent`	Injected into a worker subprocess by createJobSubprocessEnv so an out-of-process job continues the SAME trace as the request that dispatched it.
`Workers health`	`GET :8091/health`	Liveness for the workers API. The cheapest signal: is this capability up?
`Auth health`	`GET :8094/health/live`	Liveness for the auth-api service (also exposes /health/ready).

Known gap: scaffold job-tools helpers

Two layers, stated precisely so you never over- or under-claim.

Real OTel vs. scaffold stubs

Framework / dispatcher layer = real OTel. traceJobExecution, scheduler spans, subprocess traceparent propagation, and task.execute spans are live. A user gets job traces in Aspire automatically, with zero handler code.

Scaffold createJobTools(ctx) tracing helpers = still no-op stubs. The trace.addEvent and trace.withChildSpan helpers the generated sample hands into your handler are placeholders — your callback runs and returns normally, but no extra span is exported. This is a known, tracked limitation with a fix planned (debt workers-scaffold-job-tools-noop), not a permanent design choice. The workaround today: call @netscript/telemetry/instrumentation helpers directly from your handler for custom spans — because the dispatcher already opened the parent job span, your child span nests under it automatically.

// services/orders/handlers/process-order.ts
import {
  defineJobHandler,
  createSuccessResult,
} from '@netscript/plugin-workers-core';
// For CUSTOM spans inside a handler, import the telemetry helpers directly.
// The dispatcher already opened the parent job span around this handler,
// so this child span nests under it automatically.
import { withChildSpan } from '@netscript/telemetry/instrumentation';

const handler = defineJobHandler(async (ctx) => {
  ctx.logger.info('processing order'); // structured log, joined to the trace

  // Bracket an expensive step in a REAL child span.
  const result = await withChildSpan('order.charge', async (span) => {
    span.setAttribute('order.amount', ctx.payload.amount);
    return { charged: true };
  });

  return createSuccessResult({ charged: result.charged });
});

export default Object.assign(handler, { id: 'process-order' as const });

The auth audit trail: structured, redacted, traced

Authentication is the one place where "just log everything" is actively dangerous — sign-in events carry subjects, tokens, and claims you must not persist in the clear. The auth audit surface shipped to solve exactly this: @netscript/plugin-auth-core/telemetry is a small, audit-safe instrumentation facade that the auth service composition root wires in with createAuthTelemetry. It does three things, all of them on purpose.

It traces auth operations as first-class spans. traceOperation brackets each auth operation — auth.signin, auth.callback, auth.signout, auth.session, auth.me — in a child span that joins the incoming request trace, so a failed sign-in is a node in the same trace as the request that triggered it, not a disconnected log line.

It emits standardized audit events. Each operation records an auth.audit.log span event plus breadcrumbs (auth.principal.resolved, auth.session.issued, auth.session.revoked) with a finite outcome vocabulary — success, unauthenticated, failed_bad_credentials, failed_session_expired, failed_provider_error, failed_callback_invalid — and a machine-readable error code (AUTH_INVALID_CREDENTIALS, AUTH_SESSION_EXPIRED, …). Outcomes are an enum, not prose, so the audit trail is queryable.

It redacts by construction. A raw subject never lands in the trace. hashSubject runs the subject through HMAC-SHA-256 with a deployment-owned salt (never derived from the subject), so the recorded auth.subject_hash is stable for correlation but not reversible. redactAuthPrincipal projects a principal down to its hash, scheme, scope/role counts, and claims with any token-bearing key (anything matching token, secret, credential, password, apikey, authorization, sessionid, …) stripped out entirely. The shape that reaches the dashboard is audit-safe by design, not by a downstream scrubbing pass you might forget.

This supersedes the older "auth diagnostics, not an audit trail" caveat: there is now a real, structured, redacted auth audit surface. What it is not is a tamper-evident, immutable ledger — the events ride the standard OTel/streams transport. Treat it as a strong, queryable audit trail for operational and security review, not as a compliance-grade write-once log. For how auth itself is shaped, see Auth model and the

Authentication hub.

Why this design, and what it costs

The trade-offs, because instrumentation-at-the-boundary is an opinion.

Telemetry follows the framework, not a side library. Because the trace context rides the service boundary and the dispatcher opens the job span, you cannot accidentally instrument half your code — the lifecycle signal is free. The cost is that you instrument NetScript's way: you reach for @netscript/telemetry and the catalog-pinned @opentelemetry/api (^1.9), not whatever tracer you used last job.
One viewing surface in dev, your choice in prod. The dashboard at :18888 is a developer convenience wired by Aspire. It is an OTLP receiver like any other; in production you point the same OTLP export at your own collector. The model does not lock you to Aspire — it locks you to the protocol.
Correlation depends on propagation working end to end. A dropped traceparent — a call that doesn't carry it forward — turns a child span into an orphan and breaks the single-story view. The framework propagates it across the service boundary and into worker subprocesses so you don't have to; custom fan-out (your own fetch, your own spawned process) is where you reintroduce the responsibility.
The scaffold helpers are a known, bounded gap. The job lifecycle, scheduler, subprocess, and task.execute spans are real — but the scaffolded createJobTools(ctx) trace helpers are no-op stubs (fix planned). Their shapes are stable, so code you write against them keeps working when the runtime fills them in. Until then, use @netscript/telemetry/instrumentation directly.

The OpenTelemetry dependency is pinned in the workspace catalog (@opentelemetry/api at ^1.9) and imported through the catalog, never de-catalogued, so every workspace member shares one telemetry API surface — which is itself why a traceparent minted in the service and read in a worker refers to the same span model.

Glossary

OpenTelemetry (OTel) — the vendor-neutral standard for traces, logs, and metrics that NetScript instruments against (@opentelemetry/api). See the glossary and the

Telemetry & logging hub.

OTLP — the OpenTelemetry Protocol; the wire your process uses to export telemetry. In dev, the Aspire AppHost receives it at http://localhost:4318.
Span — one node in a trace: a named, timed operation with attributes and events. A child span (withChildSpan) nests under its parent to form the causal tree.
traceparent — the W3C trace-context header that carries the trace id across a boundary so separate spans stitch into one trace; the framework injects it even into worker subprocesses.

Where to go next

Do it: the how-to: add OpenTelemetry walks adding a custom span, structured logs, and traceparent propagation against a running service.
Hub: the Telemetry & logging hub covers the headline API and the OTel-wired-into-boundaries model with the real endpoints; Authentication

covers the auth surface whose audit trail is described above.
Related: Services & contracts (the boundary that propagates trace context),

Durable sagas (the multi-step operations a single trace spans), and

Orchestration with Aspire (the dashboard and OTLP receiver that consume these signals).

Reference: exact exported symbols live in telemetry , and the logging surface in logger .