Skip to content

fix(pg_upgrade): retry nix-store -r to avoid hangs on stalled S3 fetches#2220

Merged
jfreeland merged 4 commits into
developfrom
fix/pg-upgrade-initiate
Jun 16, 2026
Merged

fix(pg_upgrade): retry nix-store -r to avoid hangs on stalled S3 fetches#2220
jfreeland merged 4 commits into
developfrom
fix/pg-upgrade-initiate

Conversation

@jfreeland

Copy link
Copy Markdown
Contributor

What

Wrap the binary-cache nix-store -r step in the pg_upgrade initiate script with a per-attempt timeout and a small retry loop.

Why

nix-store -r can stall indefinitely on a dropped S3 connection without erroring out, its own download timeout doesn't reliably fire. When that happens the upgrade hangs forever reporting running, eventually exhausting the caller's poll budget with no useful error.

How

  • Each attempt runs under timeout (3 × 120s); nix-store is resumable, so a retry only re-fetches the paths that didn't finish.
  • On exhaustion it fails as a normal command (not exit 1) so the existing ERR trap runs cleanup and records failed, turning an indefinite hang into a fast, reported failure.
  • Worst case (360s) stays within the nix-fetch window the upgrade poller already allows, so no caller-side timeout changes are needed.

@jfreeland jfreeland requested review from a team as code owners June 15, 2026 20:00
# nix-store -r can stall indefinitely on a dropped S3 connection without
# erroring out (its own download timeout doesn't reliably fire), so guard each
# attempt with a timeout and retry. nix-store is resumable, so a retry only
# re-fetches the unfinished paths. Failing as a normal command (not exit 1)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resumability is per-path, not per-byte. Nix realisation iterates the closure, and each individual NAR is downloaded to a temp location and only registered as a valid store path after it completes and its hash verifies. Registration is transactional, so a SIGKILL mid-download leaves the store consistent and that path simply invalid. The next nix-store -r invocation checks isValidPath per path and skips the ones already registered. That's why retries work. But the path that was in flight when timeout fired restarts from 0% on the next attempt; there's no byte-level resume within a single NAR. So if your closure contains one NAR that takes longer than 120s to fetch over the link you're worried about, every attempt dies at 120s mid-fetch on that path and you get three guaranteed failures with zero forward progress. You have to just make sure 120s comfortably exceeds your single largest NAR's transfer time at your worst expected S3 throughput, or split/raise the budget accordingly. Everything smaller than that converges fine across attempts.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a very very brief version of this so it's at least kind of captured and people can find the PR for this context in the future. :)

@samrose samrose self-requested a review June 15, 2026 20:12
Comment thread ansible/files/admin_api_scripts/pg_upgrade_scripts/initiate.sh Outdated

@Crispy1975 Crispy1975 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I will defer to @samrose on nix things.

@jfreeland jfreeland requested a review from Copilot June 16, 2026 16:48

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the Nix-based pg_upgrade initiation workflow by preventing indefinite hangs during the nix-store -r (binary cache realization) step via per-attempt timeouts and retries, ensuring stalled downloads fail fast and are handled by the existing ERR cleanup/tracking.

Changes:

  • Wrap nix-store -r "$STORE_PATH" in a 3-attempt retry loop guarded by timeout.
  • Preserve existing failure handling by letting the script fail naturally (triggering the ERR trap) once retries are exhausted.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ansible/files/admin_api_scripts/pg_upgrade_scripts/initiate.sh Outdated
Comment thread ansible/files/admin_api_scripts/pg_upgrade_scripts/initiate.sh Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment thread ansible/files/admin_api_scripts/pg_upgrade_scripts/initiate.sh
Comment thread ansible/files/admin_api_scripts/pg_upgrade_scripts/initiate.sh
@jfreeland jfreeland enabled auto-merge June 16, 2026 17:02
@jfreeland jfreeland added this pull request to the merge queue Jun 16, 2026
Merged via the queue into develop with commit 405724f Jun 16, 2026
41 of 43 checks passed
@jfreeland jfreeland deleted the fix/pg-upgrade-initiate branch June 16, 2026 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants