Skip to content

Proxy add watchdog#1931

Open
hhaAndroid wants to merge 7 commits into
InternLM:mainfrom
hhaAndroid:proxy_add_watchdog
Open

Proxy add watchdog#1931
hhaAndroid wants to merge 7 commits into
InternLM:mainfrom
hhaAndroid:proxy_add_watchdog

Conversation

@hhaAndroid

@hhaAndroid hhaAndroid commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator
  • 之前 proxy 判断逻辑非常 hard code,这版本移除,而是要用户显式设置
  • 之前 proxy 没有支持注册和移除功能,在 worker 失败和重启中无法联动,目前已经支持

@hhaAndroid

Copy link
Copy Markdown
Collaborator Author

@claude review

Comment thread xtuner/v1/train/rl_trainer.py
Comment thread xtuner/v1/rl/rollout/proxy_manager.py
Comment thread xtuner/v1/rl/rollout/proxy_manager.py
Comment thread xtuner/v1/rl/rollout/proxy_manager.py
@claude

claude Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Claude: ## Summary

Moves proxy management from the trainer into the rollout subsystem via a new RolloutProxyManager that hooks into worker lifecycle events. Replaces hard-coded class-name checks with a declarative requires_rollout_proxy field on AgentLoopConfig. Good architectural improvement — decouples proxy logic from the trainer and enables automatic proxy sync on worker failure/recovery.

ProduceBatchResult impact: not affected
RoutedExperts impact: not affected

Issues

Critical

Warning

  • xtuner/v1/rl/rollout/proxy_manager.py:18-20 No thread synchronization on _registered_session_urls. Concurrent lifecycle notifications from health-check and recovery threads could race on the check-then-act patterns in _register_session_url/_delete_session_url.

Nit

Verdict

REQUEST_CHANGES — the test/implementation mismatch needs a fix before merge.

Comment thread xtuner/v1/rl/utils/misc.py Outdated
@hhaAndroid

Copy link
Copy Markdown
Collaborator Author

@claude review

Comment thread xtuner/v1/rl/rollout/proxy_manager.py
Comment thread xtuner/v1/rl/rollout/worker.py
@claude

claude Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Claude: ## Summary

Re-reviewed after new commits. The previous critical issue (test asserting logger.info that didn't exist) has been fixed. Thread safety concern addressed by moving lifecycle notifications inside _operation_lock.

Good architectural improvement: replaces hard-coded class-name proxy checks with a declarative requires_rollout_proxy field, moves proxy lifecycle management into RolloutProxyManager that hooks into the health manager via a clean RolloutWorkerLifecycleListener protocol, and adds proper validation via validate_registered_workers_to_proxy before training starts.

ProduceBatchResult impact: not affected
RoutedExperts impact: not affected

Issues

Nit

  • proxy_manager.py:38-42 Only the last registration result is captured in register_result; intermediate results are lost for debugging.
  • worker.py:424-437 Hardcoded internal URLs as defaults for routed_proxy_url/routed_proxy_admin_url could confuse external users if proxy is enabled without overriding them.

Verdict

APPROVE — previous concerns resolved, architecture is clean, tests cover the key behaviors.

logger.warning(
f"Rollout worker group ranks={group.ranks} failed health check. Marking as inactive."
)
self._notify_worker_group_inactive(group)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为什么要被锁保护呢 ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

claude code 说有可能存在偶发的竞态。放到里面最保险

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants