High-Performance Serving Framework for LLMs and Multimodal Models

SGLang powers fast, scalable inference for large language and multimodal models.

Trusted by industry leaders:

See full list

Production-Grade Inference

Built for large-scale deployments, delivering reliable, low-latency, high-throughput serving from a single GPU to distributed clusters.

Model & Hardware Flexibility

Supports a wide range of open models — from LLMs to diffusion models — and runs across diverse hardware platforms.

Advanced Optimizations

Incorporates disaggregated prefill/decode, speculative decoding, parallelisms, a zero-overhead scheduler, and optimized GPU kernels.

Get Started in Seconds

Select your preferences and run the deployment command. SGLang is designed to be easy to install and deploy.

Install via pip or docker

The easiest ways to get started.

Launch the server

Start the server with a single command pointing to your model.

Query the API

Use standard OpenAI-compatible endpoints to interact with your model.

View Documentation

Build

Platform

Package

CUDA Version

Run this Command:

uv pip install sglang sglang-kernel \
  --extra-index-url https://sgl-project.github.io/whl/cu129/ \
  --extra-index-url https://download.pytorch.org/whl/cu129 \
  --index-strategy unsafe-best-match

Broad Model & Hardware Support

A single engine that runs across various models and hardware.

Supported Hardware

NVIDIA GPUs

AMD GPUs

CPU Servers

TPU

Ascend NPUs

XPU

See all supported hardware

Supported Models

DeepSeek

Qwen

GPT-OSS

Llama

Mistral

GLM

See all supported models

Join the Community

From first-time users to teams debugging complex deployments, the community is open to everyone.

GitHub

Report bugs, request features, and contribute code.

Slack

Chat with the community and get real-time help.

Discord

Community for technical discussions and office hours.

Discussions

Share ideas and discuss the future of SGLang.