fix(pg_upgrade): retry nix-store -r to avoid hangs on stalled S3 fetches#2220
Conversation
| # nix-store -r can stall indefinitely on a dropped S3 connection without | ||
| # erroring out (its own download timeout doesn't reliably fire), so guard each | ||
| # attempt with a timeout and retry. nix-store is resumable, so a retry only | ||
| # re-fetches the unfinished paths. Failing as a normal command (not exit 1) |
There was a problem hiding this comment.
Resumability is per-path, not per-byte. Nix realisation iterates the closure, and each individual NAR is downloaded to a temp location and only registered as a valid store path after it completes and its hash verifies. Registration is transactional, so a SIGKILL mid-download leaves the store consistent and that path simply invalid. The next nix-store -r invocation checks isValidPath per path and skips the ones already registered. That's why retries work. But the path that was in flight when timeout fired restarts from 0% on the next attempt; there's no byte-level resume within a single NAR. So if your closure contains one NAR that takes longer than 120s to fetch over the link you're worried about, every attempt dies at 120s mid-fetch on that path and you get three guaranteed failures with zero forward progress. You have to just make sure 120s comfortably exceeds your single largest NAR's transfer time at your worst expected S3 throughput, or split/raise the budget accordingly. Everything smaller than that converges fine across attempts.
There was a problem hiding this comment.
added a very very brief version of this so it's at least kind of captured and people can find the PR for this context in the future. :)
Crispy1975
left a comment
There was a problem hiding this comment.
LGTM, but I will defer to @samrose on nix things.
There was a problem hiding this comment.
Pull request overview
This PR hardens the Nix-based pg_upgrade initiation workflow by preventing indefinite hangs during the nix-store -r (binary cache realization) step via per-attempt timeouts and retries, ensuring stalled downloads fail fast and are handled by the existing ERR cleanup/tracking.
Changes:
- Wrap
nix-store -r "$STORE_PATH"in a 3-attempt retry loop guarded bytimeout. - Preserve existing failure handling by letting the script fail naturally (triggering the
ERRtrap) once retries are exhausted.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
What
Wrap the binary-cache
nix-store -rstep in the pg_upgrade initiate script with a per-attempt timeout and a small retry loop.Why
nix-store -rcan stall indefinitely on a dropped S3 connection without erroring out, its own download timeout doesn't reliably fire. When that happens the upgrade hangs forever reportingrunning, eventually exhausting the caller's poll budget with no useful error.How
timeout(3 × 120s);nix-storeis resumable, so a retry only re-fetches the paths that didn't finish.exit 1) so the existingERRtrap runs cleanup and recordsfailed, turning an indefinite hang into a fast, reported failure.