LocalSessionWorker idle keep-alive timeout emits tracing::error! for normal zombie-session reaping

## Summary

In `rmcp` 1.4.0, when `LocalSessionWorker` hits its configured keep-alive timeout (default 5 min), it returns `WorkerQuitReason::Fatal { error: LocalSessionWorkerError::KeepAliveTimeout(..), context: \"poll next session event\" }`. The outer transport worker then logs this at `tracing::error!` level, even though the docstring explicitly frames this branch as *normal* zombie-session cleanup. That turns every expected idle reap into a red log line for downstream consumers.

## Versions / context

- `rmcp` 1.4.0 (from crates.io)
- Feature set in use: `[\"server\", \"schemars\", \"transport-streamable-http-server\"]`
- Typical deployment: a long-running server, where long-lived agents wake up, perform one MCP tool call, go idle, and create a fresh session next wakeup — so idle reaping fires on every natural agent cycle, not just on error.

## Where it comes from

1. `LocalSessionWorker::run` has the keep-alive branch:

   ```rust
   // src/transport/streamable_http_server/session/local.rs (≈ lines 952–971)
   let keep_alive = self.session_config.keep_alive.unwrap_or(Duration::MAX);
   loop {
       let keep_alive_timeout = tokio::time::sleep(keep_alive);
       let event = tokio::select! {
           // …other arms…
           _ = keep_alive_timeout => {
               return Err(WorkerQuitReason::fatal(
                   LocalSessionWorkerError::KeepAliveTimeout(keep_alive),
                   \"poll next session event\",
               ))
           }
       };
       // …
   }
   ```

2. The generic worker transport logs all `Fatal` exits at ERROR:

   ```rust
   // src/transport/worker.rs (≈ lines 122–134)
   .inspect_err(|e| match e {
       WorkerQuitReason::Cancelled
       | WorkerQuitReason::TransportClosed
       | WorkerQuitReason::HandlerTerminated => {
           tracing::debug!(\"worker quit with reason: {:?}\", e);
       }
       WorkerQuitReason::Join(e) => {
           tracing::error!(\"worker quit with join error: {:?}\", e);
       }
       WorkerQuitReason::Fatal { error, context } => {
           tracing::error!(\"worker quit with fatal: {error}, when {context}\");
       }
   })
   ```

3. The `SessionConfig::keep_alive` doc itself describes this as a safety net for silently-dropped HTTP connections, not an error condition:

   ```rust
   // src/transport/streamable_http_server/session/local.rs (≈ lines 1060–1089)
   /// The session will be closed after this duration of inactivity.
   ///
   /// This serves as a safety net for cleaning up sessions whose HTTP
   /// connections have silently dropped (e.g., due to an HTTP/2
   /// `RST_STREAM`). Without a timeout, such sessions become zombies […]
   ///
   /// Defaults to 5 minutes. […]
   pub keep_alive: Option<Duration>,
   ```

So the documented-as-normal path produces an ERROR log on every idle reap.

## Concrete symptom

Every ~5 minutes of agent idleness, downstream logs show:

```
ERROR rmcp::transport::worker: worker quit with fatal: keep alive timeout after 300000ms, when poll next session event
```

This is indistinguishable, at the log level, from a real transport fatal (e.g. peer reset mid-stream). Operators who wire ERROR-level alerts get paged on something that is explicitly normal by design.

## Why this matters

- Mixes normal lifecycle events with real transport failures in a single severity bucket.
- Forces every consumer to add target-level filters (`rmcp::transport::worker=warn`) to silence idle reaps, which also silences *real* fatals. No easy downstream fix preserves the original signal.
- Observability best practice: ERROR should be reserved for conditions requiring human attention. Scheduled idle cleanup of a zombie-prevention timer does not qualify.

## Proposal

Split the idle-reap exit from true transport fatals so the logger can tell them apart. Two options, in ascending invasiveness:

**Option A — add a dedicated `WorkerQuitReason` variant (preferred).**
`WorkerQuitReason` is already `#[non_exhaustive]`, so this is additive. Add e.g.:

```rust
pub enum WorkerQuitReason<E> {
    // existing variants…
    #[error(\"Worker idle timeout ({})ms\", _0.as_millis())]
    IdleTimeout(Duration),
}
```

Have `LocalSessionWorker::run` return `WorkerQuitReason::IdleTimeout(keep_alive)` instead of `Fatal { KeepAliveTimeout }`, and in `worker.rs` log it at `tracing::debug!` (or `tracing::info!`) alongside the other expected-exit arms. True transport failures continue to flow through `Fatal` and keep their ERROR severity.

**Option B — widen the `Fatal` arm's logger with a severity hint.**
Keep `Fatal` as-is but add a method on the inner error (e.g. `is_expected_shutdown() -> bool` on the `Worker::Error` trait, defaulting to `false`) that `LocalSessionWorkerError::KeepAliveTimeout` overrides to `true`. The `inspect_err` arm downgrades to `debug!` when `is_expected_shutdown()` returns true. This avoids expanding the enum but couples severity to error identity, which is harder to reason about downstream.

Either way, the invariant should become: **ERROR-level log ⇒ operator intervention likely needed**. Idle reap of a zombie-prevention timer should not meet that bar.

## Happy to follow up with a PR

If the maintainers prefer Option A, I can open a small PR (≤ 50 lines) that:
- adds the new `WorkerQuitReason::IdleTimeout(Duration)` variant,
- returns it from the `keep_alive_timeout` arm in `LocalSessionWorker::run`,
- logs it at `tracing::debug!` in `WorkerTransport::spawn_with_ct`,
- adds a regression test that asserts no ERROR-level records are emitted on idle reap (via `tracing-subscriber::Registry` + a test collector).

Let me know which direction you'd like and I'll send it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LocalSessionWorker idle keep-alive timeout emits tracing::error! for normal zombie-session reaping #817

Summary

Versions / context

Where it comes from

Concrete symptom

Why this matters

Proposal

Happy to follow up with a PR

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LocalSessionWorker idle keep-alive timeout emits tracing::error! for normal zombie-session reaping #817

Description

Summary

Versions / context

Where it comes from

Concrete symptom

Why this matters

Proposal

Happy to follow up with a PR

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions