Summary
In rmcp 1.4.0, when LocalSessionWorker hits its configured keep-alive timeout (default 5 min), it returns WorkerQuitReason::Fatal { error: LocalSessionWorkerError::KeepAliveTimeout(..), context: \"poll next session event\" }. The outer transport worker then logs this at tracing::error! level, even though the docstring explicitly frames this branch as normal zombie-session cleanup. That turns every expected idle reap into a red log line for downstream consumers.
Versions / context
rmcp 1.4.0 (from crates.io)
- Feature set in use:
[\"server\", \"schemars\", \"transport-streamable-http-server\"]
- Typical deployment: a long-running server, where long-lived agents wake up, perform one MCP tool call, go idle, and create a fresh session next wakeup — so idle reaping fires on every natural agent cycle, not just on error.
Where it comes from
-
LocalSessionWorker::run has the keep-alive branch:
// src/transport/streamable_http_server/session/local.rs (≈ lines 952–971)
let keep_alive = self.session_config.keep_alive.unwrap_or(Duration::MAX);
loop {
let keep_alive_timeout = tokio::time::sleep(keep_alive);
let event = tokio::select! {
// …other arms…
_ = keep_alive_timeout => {
return Err(WorkerQuitReason::fatal(
LocalSessionWorkerError::KeepAliveTimeout(keep_alive),
\"poll next session event\",
))
}
};
// …
}
-
The generic worker transport logs all Fatal exits at ERROR:
// src/transport/worker.rs (≈ lines 122–134)
.inspect_err(|e| match e {
WorkerQuitReason::Cancelled
| WorkerQuitReason::TransportClosed
| WorkerQuitReason::HandlerTerminated => {
tracing::debug!(\"worker quit with reason: {:?}\", e);
}
WorkerQuitReason::Join(e) => {
tracing::error!(\"worker quit with join error: {:?}\", e);
}
WorkerQuitReason::Fatal { error, context } => {
tracing::error!(\"worker quit with fatal: {error}, when {context}\");
}
})
-
The SessionConfig::keep_alive doc itself describes this as a safety net for silently-dropped HTTP connections, not an error condition:
// src/transport/streamable_http_server/session/local.rs (≈ lines 1060–1089)
/// The session will be closed after this duration of inactivity.
///
/// This serves as a safety net for cleaning up sessions whose HTTP
/// connections have silently dropped (e.g., due to an HTTP/2
/// `RST_STREAM`). Without a timeout, such sessions become zombies […]
///
/// Defaults to 5 minutes. […]
pub keep_alive: Option<Duration>,
So the documented-as-normal path produces an ERROR log on every idle reap.
Concrete symptom
Every ~5 minutes of agent idleness, downstream logs show:
ERROR rmcp::transport::worker: worker quit with fatal: keep alive timeout after 300000ms, when poll next session event
This is indistinguishable, at the log level, from a real transport fatal (e.g. peer reset mid-stream). Operators who wire ERROR-level alerts get paged on something that is explicitly normal by design.
Why this matters
- Mixes normal lifecycle events with real transport failures in a single severity bucket.
- Forces every consumer to add target-level filters (
rmcp::transport::worker=warn) to silence idle reaps, which also silences real fatals. No easy downstream fix preserves the original signal.
- Observability best practice: ERROR should be reserved for conditions requiring human attention. Scheduled idle cleanup of a zombie-prevention timer does not qualify.
Proposal
Split the idle-reap exit from true transport fatals so the logger can tell them apart. Two options, in ascending invasiveness:
Option A — add a dedicated WorkerQuitReason variant (preferred).
WorkerQuitReason is already #[non_exhaustive], so this is additive. Add e.g.:
pub enum WorkerQuitReason<E> {
// existing variants…
#[error(\"Worker idle timeout ({})ms\", _0.as_millis())]
IdleTimeout(Duration),
}
Have LocalSessionWorker::run return WorkerQuitReason::IdleTimeout(keep_alive) instead of Fatal { KeepAliveTimeout }, and in worker.rs log it at tracing::debug! (or tracing::info!) alongside the other expected-exit arms. True transport failures continue to flow through Fatal and keep their ERROR severity.
Option B — widen the Fatal arm's logger with a severity hint.
Keep Fatal as-is but add a method on the inner error (e.g. is_expected_shutdown() -> bool on the Worker::Error trait, defaulting to false) that LocalSessionWorkerError::KeepAliveTimeout overrides to true. The inspect_err arm downgrades to debug! when is_expected_shutdown() returns true. This avoids expanding the enum but couples severity to error identity, which is harder to reason about downstream.
Either way, the invariant should become: ERROR-level log ⇒ operator intervention likely needed. Idle reap of a zombie-prevention timer should not meet that bar.
Happy to follow up with a PR
If the maintainers prefer Option A, I can open a small PR (≤ 50 lines) that:
- adds the new
WorkerQuitReason::IdleTimeout(Duration) variant,
- returns it from the
keep_alive_timeout arm in LocalSessionWorker::run,
- logs it at
tracing::debug! in WorkerTransport::spawn_with_ct,
- adds a regression test that asserts no ERROR-level records are emitted on idle reap (via
tracing-subscriber::Registry + a test collector).
Let me know which direction you'd like and I'll send it.
Summary
In
rmcp1.4.0, whenLocalSessionWorkerhits its configured keep-alive timeout (default 5 min), it returnsWorkerQuitReason::Fatal { error: LocalSessionWorkerError::KeepAliveTimeout(..), context: \"poll next session event\" }. The outer transport worker then logs this attracing::error!level, even though the docstring explicitly frames this branch as normal zombie-session cleanup. That turns every expected idle reap into a red log line for downstream consumers.Versions / context
rmcp1.4.0 (from crates.io)[\"server\", \"schemars\", \"transport-streamable-http-server\"]Where it comes from
LocalSessionWorker::runhas the keep-alive branch:The generic worker transport logs all
Fatalexits at ERROR:The
SessionConfig::keep_alivedoc itself describes this as a safety net for silently-dropped HTTP connections, not an error condition:So the documented-as-normal path produces an ERROR log on every idle reap.
Concrete symptom
Every ~5 minutes of agent idleness, downstream logs show:
This is indistinguishable, at the log level, from a real transport fatal (e.g. peer reset mid-stream). Operators who wire ERROR-level alerts get paged on something that is explicitly normal by design.
Why this matters
rmcp::transport::worker=warn) to silence idle reaps, which also silences real fatals. No easy downstream fix preserves the original signal.Proposal
Split the idle-reap exit from true transport fatals so the logger can tell them apart. Two options, in ascending invasiveness:
Option A — add a dedicated
WorkerQuitReasonvariant (preferred).WorkerQuitReasonis already#[non_exhaustive], so this is additive. Add e.g.:Have
LocalSessionWorker::runreturnWorkerQuitReason::IdleTimeout(keep_alive)instead ofFatal { KeepAliveTimeout }, and inworker.rslog it attracing::debug!(ortracing::info!) alongside the other expected-exit arms. True transport failures continue to flow throughFataland keep their ERROR severity.Option B — widen the
Fatalarm's logger with a severity hint.Keep
Fatalas-is but add a method on the inner error (e.g.is_expected_shutdown() -> boolon theWorker::Errortrait, defaulting tofalse) thatLocalSessionWorkerError::KeepAliveTimeoutoverrides totrue. Theinspect_errarm downgrades todebug!whenis_expected_shutdown()returns true. This avoids expanding the enum but couples severity to error identity, which is harder to reason about downstream.Either way, the invariant should become: ERROR-level log ⇒ operator intervention likely needed. Idle reap of a zombie-prevention timer should not meet that bar.
Happy to follow up with a PR
If the maintainers prefer Option A, I can open a small PR (≤ 50 lines) that:
WorkerQuitReason::IdleTimeout(Duration)variant,keep_alive_timeoutarm inLocalSessionWorker::run,tracing::debug!inWorkerTransport::spawn_with_ct,tracing-subscriber::Registry+ a test collector).Let me know which direction you'd like and I'll send it.