qwen3_5_moe: add OpenAI serving entrypoint#20313
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20313
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 3 Unrelated Failures, 4 Unclassified FailuresAs of commit 13b2ff0 with merge base 551e90e ( NEW FAILURE - The following job has failed:
UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
Adds an OpenAI-compatible serving entrypoint for the Qwen3.5 MoE example model by introducing a model-specific Python launcher (control plane) and a dedicated C++ worker binary (data plane) that speaks the generic examples/llm_server JSONL protocol.
Changes:
- Introduce
executorch.examples.models.qwen3_5_moe.serveplus hermetic tests asserting control-plane/model-code separation and correct worker spawn args. - Add
qwen3_5_moe_workerexecutable target and wire it into Qwen3.5 MoE CMake presets. - Extend CI to export additional tokenizer files and run a CUDA OpenAI-serving smoke test; document serving usage in the model README.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| examples/models/qwen3_5_moe/test_serve.py | Adds hermetic tests for the serving launcher and separation guarantees. |
| examples/models/qwen3_5_moe/serve.py | New OpenAI-compatible control-plane entrypoint that spawns the worker and builds the FastAPI app. |
| examples/models/qwen3_5_moe/README.md | Documents how to run the server and integrate it with pi. |
| examples/models/qwen3_5_moe/qwen35_moe_worker.cpp | New C++ worker binary for model execution via llm_server JSONL protocol. |
| examples/models/qwen3_5_moe/CMakePresets.json | Adds the worker target to CUDA/Metal build presets. |
| examples/models/qwen3_5_moe/CMakeLists.txt | Defines the qwen3_5_moe_worker executable and stripping/link options. |
| .ci/scripts/test_model_e2e.sh | Adds CUDA serving smoke test exercising /health, /v1/models, and /v1/chat/completions. |
| .ci/scripts/export_model_artifact.sh | Exports tokenizer_config.json alongside tokenizer.json for serving templating. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _default_worker_bin() -> str: | ||
| repo_root = Path(__file__).resolve().parents[3] | ||
| return str( | ||
| repo_root | ||
| / "cmake-out" | ||
| / "examples" | ||
| / "models" | ||
| / "qwen3_5_moe" | ||
| / "qwen3_5_moe_worker" | ||
| ) |
| _HERE = pathlib.Path(serve.__file__).resolve().parent | ||
| _REPO_ROOT = _HERE.parents[2] | ||
|
|
| offenders = [ | ||
| p | ||
| for p in server_dir.rglob("*.py") | ||
| if "qwen3_5_moe" in p.read_text() or "_qwen35_moe" in p.read_text() | ||
| ] |
No description provided.