Skip to content

LocalSessionWorker idle keep-alive timeout emits tracing::error! for normal zombie-session reaping #817

@lazytiger

Description

@lazytiger

Summary

In rmcp 1.4.0, when LocalSessionWorker hits its configured keep-alive timeout (default 5 min), it returns WorkerQuitReason::Fatal { error: LocalSessionWorkerError::KeepAliveTimeout(..), context: \"poll next session event\" }. The outer transport worker then logs this at tracing::error! level, even though the docstring explicitly frames this branch as normal zombie-session cleanup. That turns every expected idle reap into a red log line for downstream consumers.

Versions / context

  • rmcp 1.4.0 (from crates.io)
  • Feature set in use: [\"server\", \"schemars\", \"transport-streamable-http-server\"]
  • Typical deployment: a long-running server, where long-lived agents wake up, perform one MCP tool call, go idle, and create a fresh session next wakeup — so idle reaping fires on every natural agent cycle, not just on error.

Where it comes from

  1. LocalSessionWorker::run has the keep-alive branch:

    // src/transport/streamable_http_server/session/local.rs (≈ lines 952–971)
    let keep_alive = self.session_config.keep_alive.unwrap_or(Duration::MAX);
    loop {
        let keep_alive_timeout = tokio::time::sleep(keep_alive);
        let event = tokio::select! {
            // …other arms…
            _ = keep_alive_timeout => {
                return Err(WorkerQuitReason::fatal(
                    LocalSessionWorkerError::KeepAliveTimeout(keep_alive),
                    \"poll next session event\",
                ))
            }
        };
        // …
    }
  2. The generic worker transport logs all Fatal exits at ERROR:

    // src/transport/worker.rs (≈ lines 122–134)
    .inspect_err(|e| match e {
        WorkerQuitReason::Cancelled
        | WorkerQuitReason::TransportClosed
        | WorkerQuitReason::HandlerTerminated => {
            tracing::debug!(\"worker quit with reason: {:?}\", e);
        }
        WorkerQuitReason::Join(e) => {
            tracing::error!(\"worker quit with join error: {:?}\", e);
        }
        WorkerQuitReason::Fatal { error, context } => {
            tracing::error!(\"worker quit with fatal: {error}, when {context}\");
        }
    })
  3. The SessionConfig::keep_alive doc itself describes this as a safety net for silently-dropped HTTP connections, not an error condition:

    // src/transport/streamable_http_server/session/local.rs (≈ lines 1060–1089)
    /// The session will be closed after this duration of inactivity.
    ///
    /// This serves as a safety net for cleaning up sessions whose HTTP
    /// connections have silently dropped (e.g., due to an HTTP/2
    /// `RST_STREAM`). Without a timeout, such sessions become zombies […]
    ///
    /// Defaults to 5 minutes. […]
    pub keep_alive: Option<Duration>,

So the documented-as-normal path produces an ERROR log on every idle reap.

Concrete symptom

Every ~5 minutes of agent idleness, downstream logs show:

ERROR rmcp::transport::worker: worker quit with fatal: keep alive timeout after 300000ms, when poll next session event

This is indistinguishable, at the log level, from a real transport fatal (e.g. peer reset mid-stream). Operators who wire ERROR-level alerts get paged on something that is explicitly normal by design.

Why this matters

  • Mixes normal lifecycle events with real transport failures in a single severity bucket.
  • Forces every consumer to add target-level filters (rmcp::transport::worker=warn) to silence idle reaps, which also silences real fatals. No easy downstream fix preserves the original signal.
  • Observability best practice: ERROR should be reserved for conditions requiring human attention. Scheduled idle cleanup of a zombie-prevention timer does not qualify.

Proposal

Split the idle-reap exit from true transport fatals so the logger can tell them apart. Two options, in ascending invasiveness:

Option A — add a dedicated WorkerQuitReason variant (preferred).
WorkerQuitReason is already #[non_exhaustive], so this is additive. Add e.g.:

pub enum WorkerQuitReason<E> {
    // existing variants…
    #[error(\"Worker idle timeout ({})ms\", _0.as_millis())]
    IdleTimeout(Duration),
}

Have LocalSessionWorker::run return WorkerQuitReason::IdleTimeout(keep_alive) instead of Fatal { KeepAliveTimeout }, and in worker.rs log it at tracing::debug! (or tracing::info!) alongside the other expected-exit arms. True transport failures continue to flow through Fatal and keep their ERROR severity.

Option B — widen the Fatal arm's logger with a severity hint.
Keep Fatal as-is but add a method on the inner error (e.g. is_expected_shutdown() -> bool on the Worker::Error trait, defaulting to false) that LocalSessionWorkerError::KeepAliveTimeout overrides to true. The inspect_err arm downgrades to debug! when is_expected_shutdown() returns true. This avoids expanding the enum but couples severity to error identity, which is harder to reason about downstream.

Either way, the invariant should become: ERROR-level log ⇒ operator intervention likely needed. Idle reap of a zombie-prevention timer should not meet that bar.

Happy to follow up with a PR

If the maintainers prefer Option A, I can open a small PR (≤ 50 lines) that:

  • adds the new WorkerQuitReason::IdleTimeout(Duration) variant,
  • returns it from the keep_alive_timeout arm in LocalSessionWorker::run,
  • logs it at tracing::debug! in WorkerTransport::spawn_with_ct,
  • adds a regression test that asserts no ERROR-level records are emitted on idle reap (via tracing-subscriber::Registry + a test collector).

Let me know which direction you'd like and I'll send it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions