Sglang vs Vllm Streaming Json

May 28, 2026

vLLM vs SGLang for `gpt-oss-120b`: Why “OpenAI-Compatible” Streaming JSON Still Needs a Compatibility Layer

Both vLLM and SGLang advertise OpenAI Chat Completions compatibility. In practice, once you enable stream: true against gpt-oss-120b, the SSE chunks are not byte-identical. Most of the differences are harmless, but two of them are subtle enough to silently corrupt client output if you move a parser from one engine to the other without adjustment.

This post turns a live capture into a practical compatibility guide. It shows the exact JSON differences, explains which ones are likely naming drift versus provider-specific extensions, and ends with a parser that safely handles both backends.

Test setup

Model: openai/gpt-oss-120b
Request mode: OpenAI-compatible Chat Completions with stream: true
Request settings: max_tokens: 200, temperature: 0.1
SGLang deployment in this repo: sglang:v0.5.12 with --reasoning-parser gpt-oss and --tool-call-parser gpt-oss
vLLM deployment in this repo: vllm:v0.21.0 with --reasoning_parser openai_gptoss and --tool-call-parser openai

The manifest evidence does not prove the runtime wire format by itself, but it does support the attribution of the captured streams as SGLang-flavored and vLLM-flavored OpenAI compatibility.

TL;DR — the two fixes that actually matter

# 1. The reasoning field has two names. Read both.
reasoning = delta.get("reasoning_content") or delta.get("reasoning") or ""

# 2. Don't break on finish_reason before consuming the delta.
#    vLLM can ship the last token + finish_reason in the same chunk.
for chunk in stream:
    delta = chunk.choices[0].delta

    if delta.content:
        text += delta.content

    if r := delta.get("reasoning_content") or delta.get("reasoning"):
        reasoning += r

    if chunk.choices[0].finish_reason:
        break

Everything else in this post is context for those two pieces of code.

At a glance: where the streams differ

Topic	vLLM-like stream	SGLang-like stream	Why it matters
`id` format	`chatcmpl-...` + UUID	bare 32-char hex	IDs should be treated as opaque strings
Reasoning field	`delta.reasoning`	`delta.reasoning_content`	Missing either one drops reasoning tokens
Finish behavior	last reasoning delta and `finish_reason` may share one chunk	captured run shows a final reasoning delta followed by a separate finish chunk	Wrong loop order can drop the final delta on vLLM
Stop metadata	`stop_reason`	`matched_stop`	Same concept, different keys
Extra per-chunk fields	`token_ids`, `prompt_token_ids`, `prompt_text`, `system_fingerprint`	fewer tracing fields	Strict schemas must allow unknown keys
Usage details	minimal usage object	`reasoning_tokens`, `prompt_tokens_details`	Billing and accounting logic must tolerate optional fields

1. First chunk: same SSE envelope, different JSON shape

The very first chunk already shows the divergence.

SGLang declares nullable fields up front:

{
  "id": "d3b406a9b33a435cb7a7bcc2266e48ac",
  "object": "chat.completion.chunk",
  "model": "openai/gpt-oss-120b",
  "choices": [{
    "index": 0,
    "delta": { "reasoning_content": null, "role": "assistant", "content": "" },
    "logprobs": null,
    "finish_reason": null,
    "matched_stop": null
  }]
}

vLLM emits a leaner object and adds top-level token-trace fields:

{
  "id": "chatcmpl-6ca2ec78-dac2-4759-8ffc-aa13d8b470bf",
  "object": "chat.completion.chunk",
  "model": "openai/gpt-oss-120b",
  "choices": [{
    "index": 0,
    "delta": { "role": "assistant", "content": "" },
    "logprobs": null,
    "finish_reason": null
  }],
  "prompt_token_ids": null,
  "prompt_text": null
}

Two practical consequences fall out immediately:

Do not assert id.startswith("chatcmpl-"). That works on many OpenAI-style systems, but it breaks on SGLang here.
Do not overfit a strict schema to one backend. Both engines add fields the other one does not use.

2. Reasoning deltas: same idea, different field names

This is the easiest silent migration bug to introduce.

SGLang uses reasoning_content:

{ "delta": { "reasoning_content": "We" }, "finish_reason": null, "matched_stop": null }
{ "delta": { "reasoning_content": " need" }, "finish_reason": null, "matched_stop": null }
{ "delta": { "reasoning_content": " to" }, "finish_reason": null, "matched_stop": null }

vLLM uses reasoning:

{ "delta": { "reasoning": "We" }, "finish_reason": null, "token_ids": null }
{ "delta": { "reasoning": " need" }, "finish_reason": null, "token_ids": null }
{ "delta": { "reasoning": " to" }, "finish_reason": null, "token_ids": null }

If your client only reads one name, all reasoning tokens can vanish when you switch engines.

For these captures, the safest interpretation is that reasoning_content vs reasoning is a functionally equivalent naming difference. vLLM’s public reasoning-output guidance treats reasoning as the current name and reasoning_content as a deprecated compatibility naming. Current OpenAI-compatible docs still do not provide a stable, universal spec for raw streamed reasoning fields, so tolerant parsing is the only safe choice.

3. Final chunk semantics: where parsers silently lose data

The most dangerous difference is not the reasoning field name. It is the termination pattern.

In this SGLang capture, the last reasoning token is followed by a separate finish frame:

{ "delta": { "reasoning_content": " IDs" }, "finish_reason": null, "matched_stop": null }
{ "delta": { "reasoning_content": null }, "finish_reason": "length", "matched_stop": null }

vLLM can merge the last reasoning token and finish_reason into one chunk:

{
  "delta": { "reasoning": "STATE" },
  "finish_reason": "length",
  "stop_reason": null,
  "token_ids": null
}

That makes this loop wrong:

if chunk.choices[0].finish_reason:
    break
text += chunk.choices[0].delta.content or ""

It looks correct under SGLang, but on vLLM it can drop the final delta in the same chunk. The safe rule is simple: consume the delta first, then inspect finish_reason.

This is also where you see another provider-specific split:

vLLM exposes stop_reason
SGLang exposes matched_stop

Treat those as analogous backend-specific stop metadata, not guaranteed semantic equivalents and not stable OpenAI-standard fields.

4. The `usage` chunk: both compatible, not equally rich

In this run, both engines end with a usage-only chunk with choices: [], but the payloads are still different.

SGLang exposes reasoning accounting explicitly:

{
  "choices": [],
  "usage": {
    "prompt_tokens": 2677,
    "total_tokens": 2877,
    "completion_tokens": 200,
    "prompt_tokens_details": null,
    "reasoning_tokens": 200
  }
}

vLLM keeps the usage object minimal and adds system_fingerprint at the chunk level:

{
  "choices": [],
  "usage": {
    "prompt_tokens": 2674,
    "total_tokens": 2874,
    "completion_tokens": 200
  },
  "system_fingerprint": "vllm-0.1.dev1+gc06ff9ec0-tp2-59a10424"
}

Three details matter here:

In this SGLang capture, reasoning_tokens is already reflected inside completion_tokens. Do not double-count it in client-side accounting unless your provider documents otherwise.
system_fingerprint leaks backend build and topology detail. Strip it if you do not want to expose runtime metadata externally.
The prompt_tokens counts differ by three tokens (2677 vs 2674) even for the same logical request. Do not assume token accounting will match exactly across engines.

5. Which differences are naming drift, and which are provider behavior?

The cleanest way to think about these captures is:

Likely naming drift

delta.reasoning_content vs delta.reasoning

Safer to treat as provider-specific extensions

matched_stop vs stop_reason
token_ids, prompt_token_ids, prompt_text
system_fingerprint
reasoning_tokens, prompt_tokens_details
merged-vs-separate finish frame behavior
id formatting differences

That distinction matters because it changes the right client strategy. Naming drift means “read both.” Provider extensions mean “accept if present, ignore if absent.”

6. The reasoning itself is not reproducible across engines

Even with the same model and nearly deterministic settings (temperature: 0.1), the reasoning traces are not the same.

SGLang-style capture: "We need to produce JSON with reasoning_steps,."
vLLM-style capture: "We need to extract signals: tables: ..."

That is not surprising. Different engines change enough of the runtime path to perturb generation. Likely contributors include:

different attention kernels
different tensor-parallel topology
different batching and paged-attention implementations

The operational takeaway is simple: if you are running quality A/B tests, lock the engine as well as the model. Otherwise you are partly measuring engine artifacts.

7. A minimal parser for the captured variants

def parse_chat_stream(stream):
    content, reasoning, finish_reason, usage = "", "", None, None

    for chunk in stream:
        # Final usage chunk has empty choices on both engines
        if not chunk["choices"]:
            usage = chunk.get("usage", {})
            continue

        choice = chunk["choices"][0]
        delta = choice.get("delta", {})

        # Dual-field reasoning — required for cross-engine correctness
        r = delta.get("reasoning_content") or delta.get("reasoning")
        if r:
            reasoning += r

        if delta.get("content"):
            content += delta["content"]

        # Collect first, THEN record finish_reason
        if choice.get("finish_reason"):
            finish_reason = choice["finish_reason"]

    return {
        "content": content,
        "reasoning": reasoning,
        "finish_reason": finish_reason,
        "usage": {
            "prompt_tokens": usage.get("prompt_tokens", 0),
            "completion_tokens": usage.get("completion_tokens", 0),
            "reasoning_tokens": usage.get("reasoning_tokens", 0),
        },
    }

This parser does not branch on provider identity. It simply tolerates the field and finish-frame differences that matter in the captured variants.

8. Migration checklist

Priority	Change	If you skip it
P0	Read both `reasoning_content` and `reasoning`	All reasoning tokens can be dropped
P0	Consume delta before checking `finish_reason`	Final token can be lost on vLLM
P1	Allow extra keys and optional reasoning fields in the schema	Strict validators reject one backend
P1	Stop asserting `id.startswith("chatcmpl-")`	Hard failure on SGLang
P2	Read `stop_reason` and `matched_stop` as optional equivalents	You lose stop diagnostics
P2	Treat `system_fingerprint` and `reasoning_tokens` as optional	`AttributeError` or accounting bugs

The two P0 items are the real compatibility boundary. Everything else is cleanup and resilience.

Appendix A: verbatim raw SSE excerpts from the live capture

The full captures are long and repetitive, so this appendix keeps the blog readable by showing representative raw lines from one live run exactly as captured. The vLLM excerpt includes system_fingerprint; redact it if you do not want to publish backend build metadata.

A.1 First chunk

vLLM-style capture

data: {"id":"chatcmpl-6ca2ec78-dac2-4759-8ffc-aa13d8b470bf","object":"chat.completion.chunk","created":1779866853,"model":"openai/gpt-oss-120b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}],"prompt_token_ids":null,"prompt_text":null}

SGLang-style capture

data: {"id":"d3b406a9b33a435cb7a7bcc2266e48ac","object":"chat.completion.chunk","created":1779866808,"model":"openai/gpt-oss-120b","choices":[{"index":0,"delta":{"reasoning_content":null,"role":"assistant","content":""},"logprobs":null,"finish_reason":null,"matched_stop":null}]}

A.2 Early reasoning chunks

vLLM-style capture

data: {"id":"chatcmpl-6ca2ec78-dac2-4759-8ffc-aa13d8b470bf","object":"chat.completion.chunk","created":1779866853,"model":"openai/gpt-oss-120b","choices":[{"index":0,"delta":{"reasoning":"We"},"logprobs":null,"finish_reason":null,"token_ids":null}]}
data: {"id":"chatcmpl-6ca2ec78-dac2-4759-8ffc-aa13d8b470bf","object":"chat.completion.chunk","created":1779866853,"model":"openai/gpt-oss-120b","choices":[{"index":0,"delta":{"reasoning":" need"},"logprobs":null,"finish_reason":null,"token_ids":null}]}
data: {"id":"chatcmpl-6ca2ec78-dac2-4759-8ffc-aa13d8b470bf","object":"chat.completion.chunk","created":1779866853,"model":"openai/gpt-oss-120b","choices":[{"index":0,"delta":{"reasoning":" to"},"logprobs":null,"finish_reason":null,"token_ids":null}]}

SGLang-style capture

data: {"id":"d3b406a9b33a435cb7a7bcc2266e48ac","object":"chat.completion.chunk","created":1779866808,"model":"openai/gpt-oss-120b","choices":[{"index":0,"delta":{"reasoning_content":"We"},"logprobs":null,"finish_reason":null,"matched_stop":null}]}
data: {"id":"d3b406a9b33a435cb7a7bcc2266e48ac","object":"chat.completion.chunk","created":1779866808,"model":"openai/gpt-oss-120b","choices":[{"index":0,"delta":{"reasoning_content":" need"},"logprobs":null,"finish_reason":null,"matched_stop":null}]}
data: {"id":"d3b406a9b33a435cb7a7bcc2266e48ac","object":"chat.completion.chunk","created":1779866808,"model":"openai/gpt-oss-120b","choices":[{"index":0,"delta":{"reasoning_content":" to"},"logprobs":null,"finish_reason":null,"matched_stop":null}]}

A.3 Terminal reasoning and finish frames

vLLM-style capture

data: {"id":"chatcmpl-6ca2ec78-dac2-4759-8ffc-aa13d8b470bf","object":"chat.completion.chunk","created":1779866853,"model":"openai/gpt-oss-120b","choices":[{"index":0,"delta":{"reasoning":"STATE"},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}]}

SGLang-style capture

data: {"id":"d3b406a9b33a435cb7a7bcc2266e48ac","object":"chat.completion.chunk","created":1779866809,"model":"openai/gpt-oss-120b","choices":[{"index":0,"delta":{"reasoning_content":" IDs"},"logprobs":null,"finish_reason":null,"matched_stop":null}]}
data: {"id":"d3b406a9b33a435cb7a7bcc2266e48ac","object":"chat.completion.chunk","created":1779866809,"model":"openai/gpt-oss-120b","choices":[{"index":0,"delta":{"reasoning_content":null},"logprobs":null,"finish_reason":"length","matched_stop":null}]}

A.4 Final usage chunk and stream terminator

vLLM-style capture

data: {"id":"chatcmpl-6ca2ec78-dac2-4759-8ffc-aa13d8b470bf","object":"chat.completion.chunk","created":1779866853,"model":"openai/gpt-oss-120b","choices":[],"usage":{"prompt_tokens":2674,"total_tokens":2874,"completion_tokens":200},"system_fingerprint":"vllm-0.1.dev1+gc06ff9ec0-tp2-59a10424"}
data: [DONE]

SGLang-style capture

data: {"id":"d3b406a9b33a435cb7a7bcc2266e48ac","object":"chat.completion.chunk","created":1779866809,"model":"openai/gpt-oss-120b","choices":[],"usage":{"prompt_tokens":2677,"total_tokens":2877,"completion_tokens":200,"prompt_tokens_details":null,"reasoning_tokens":200}}
data: [DONE]

vLLM vs SGLang for gpt-oss-120b: Why “OpenAI-Compatible” Streaming JSON Still Needs a Compatibility Layer

Test setup

TL;DR — the two fixes that actually matter

At a glance: where the streams differ

1. First chunk: same SSE envelope, different JSON shape

2. Reasoning deltas: same idea, different field names

3. Final chunk semantics: where parsers silently lose data

4. The usage chunk: both compatible, not equally rich

5. Which differences are naming drift, and which are provider behavior?

Likely naming drift

Safer to treat as provider-specific extensions

6. The reasoning itself is not reproducible across engines

7. A minimal parser for the captured variants

8. Migration checklist

Appendix A: verbatim raw SSE excerpts from the live capture

A.1 First chunk

A.2 Early reasoning chunks

A.3 Terminal reasoning and finish frames

A.4 Final usage chunk and stream terminator

vLLM vs SGLang for `gpt-oss-120b`: Why “OpenAI-Compatible” Streaming JSON Still Needs a Compatibility Layer

4. The `usage` chunk: both compatible, not equally rich