High-Performance Serving Framework for LLMs and Multimodal Models

SGLang powers fast, scalable inference for large language and multimodal models.

SGLang Data Processing Architecture

Trusted by industry leaders:

Production-Grade Inference

Built for large-scale deployments, delivering reliable, low-latency, high-throughput serving from a single GPU to distributed clusters.

Model & Hardware Flexibility

Supports a wide range of open models — from LLMs to diffusion models — and runs across diverse hardware platforms.

Advanced Optimizations

Incorporates disaggregated prefill/decode, speculative decoding, parallelisms, a zero-overhead scheduler, and optimized GPU kernels.

Get Started in Seconds

Select your preferences and run the deployment command. SGLang is designed to be easy to install and deploy.

1

Install via pip or docker

The easiest ways to get started.

2

Launch the server

Start the server with a single command pointing to your model.

3

Query the API

Use standard OpenAI-compatible endpoints to interact with your model.

Build
Platform
Package
CUDA Version

Run this Command:
uv pip install sglang sglang-kernel \
  --extra-index-url https://sgl-project.github.io/whl/cu129/ \
  --extra-index-url https://download.pytorch.org/whl/cu129 \
  --index-strategy unsafe-best-match

Broad Model & Hardware Support

A single engine that runs across various models and hardware.

Supported Hardware

NVIDIA GPUs
AMD GPUs
CPU Servers
TPU
Ascend NPUs
XPU

Supported Models

DeepSeek
Qwen
GPT-OSS
Llama
Mistral
GLM