NVIDIA: Llama 3.3 Nemotron Super 49B V1.5
Server-rendered model summary page for indexing/share previews. Use the interactive explorer for full filtering and comparison.
Identifiers & provenance
- Primary ID
- nvidia/llama-3.3-nemotron-super-49b-v1.5
- OpenRouter ID
- nvidia/llama-3.3-nemotron-super-49b-v1.5
- Canonical slug
- nvidia/llama-3.3-nemotron-super-49b-v1.5
Source semantics
- Arena rank is a human-preference leaderboard signal, not a universal truth metric.
- OpenRouter usage/popularity reflects adoption/traffic, not benchmark quality.
- Pricing fields may differ by provider and can include extra modes beyond prompt/completion.
Read more on Methodology & data sources.
Description
Llama-3.3-Nemotron-Super-49B-v1.5 is a 49B-parameter, English-centric reasoning/chat model derived from Meta’s Llama-3.3-70B-Instruct with a 128K context. It’s post-trained for agentic workflows (RAG, tool calling) via SFT across math, code, science, and multi-turn chat, followed by multiple RL stages; Reward-aware Preference Optimization (RPO) for alignment, RL with Verifiable Rewards (RLVR) for step-wise reasoning, and iterative DPO to refine tool-use behavior. A distillation-driven Neural Architecture Search (“Puzzle”) replaces some attention blocks and varies FFN widths to shrink memory footprint and improve throughput, enabling single-GPU (H100/H200) deployment while preserving instruction following and CoT quality. In internal evaluations (NeMo-Skills, up to 16 runs, temp = 0.6, top_p = 0.95), the model reports strong reasoning/coding results, e.g., MATH500 pass@1 = 97.4, AIME-2024 = 87.5, AIME-2025 = 82.71, GPQA = 71.97, LiveCodeBench (24.10–25.02) = 73.58, and MMLU-Pro (CoT) = 79.53. The model targets practical inference efficiency (high tokens/s, reduced VRAM) with Transformers/vLLM support and explicit “reasoning on/off” modes (chat-first defaults, greedy recommended when disabled). Suitable for building agents, assistants, and long-context retrieval systems where balanced accuracy-to-cost and reliable tool use matter.
Raw fields snapshot
{
"id": "nvidia/llama-3.3-nemotron-super-49b-v1.5",
"name": "NVIDIA: Llama 3.3 Nemotron Super 49B V1.5",
"description": "Llama-3.3-Nemotron-Super-49B-v1.5 is a 49B-parameter, English-centric reasoning/chat model derived from Meta’s Llama-3.3-70B-Instruct with a 128K context. It’s post-trained for agentic workflows (RAG, tool calling) via SFT across math, code, science, and multi-turn chat, followed by multiple RL stages; Reward-aware Preference Optimization (RPO) for alignment, RL with Verifiable Rewards (RLVR) for step-wise reasoning, and iterative DPO to refine tool-use behavior. A distillation-driven Neural Architecture Search (“Puzzle”) replaces some attention blocks and varies FFN widths to shrink memory footprint and improve throughput, enabling single-GPU (H100/H200) deployment while preserving instruction following and CoT quality.\n\nIn internal evaluations (NeMo-Skills, up to 16 runs, temp = 0.6, top_p = 0.95), the model reports strong reasoning/coding results, e.g., MATH500 pass@1 = 97.4, AIME-2024 = 87.5, AIME-2025 = 82.71, GPQA = 71.97, LiveCodeBench (24.10–25.02) = 73.58, and MMLU-Pro (CoT) = 79.53. The model targets practical inference efficiency (high tokens/s, reduced VRAM) with Transformers/vLLM support and explicit “reasoning on/off” modes (chat-first defaults, greedy recommended when disabled). Suitable for building agents, assistants, and long-context retrieval systems where balanced accuracy-to-cost and reliable tool use matter.\n",
"created": 1760101395,
"canonical_slug": "nvidia/llama-3.3-nemotron-super-49b-v1.5",
"hugging_face_id": "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5",
"source_type": "openrouter_only",
"context_length": 131072,
"max_completion_tokens": null,
"is_moderated": false,
"architecture": {
"modality": "text->text",
"input_modalities": [
"text"
],
"output_modalities": [
"text"
],
"tokenizer": "Llama3",
"instruct_type": null
},
"input_modalities": [
"text"
],
"output_modalities": [
"text"
],
"modality": "text->text",
"tokenizer": "Llama3",
"instruct_type": null,
"supported_parameters": [
"frequency_penalty",
"include_reasoning",
"max_tokens",
"min_p",
"presence_penalty",
"reasoning",
"repetition_penalty",
"response_format",
"seed",
"stop",
"temperature",
"tool_choice",
"tools",
"top_k",
"top_p"
],
"default_parameters": null,
"per_request_limits": null,
"top_provider": {
"context_length": 131072,
"max_completion_tokens": null,
"is_moderated": false
},
"pricing": {
"prompt": "0.0000001",
"completion": "0.0000004"
},
"PPM": {
"prompt": 0.1,
"completion": 0.4
},
"openrouter_raw": {
"id": "nvidia/llama-3.3-nemotron-super-49b-v1.5",
"canonical_slug": "nvidia/llama-3.3-nemotron-super-49b-v1.5",
"hugging_face_id": "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5",
"name": "NVIDIA: Llama 3.3 Nemotron Super 49B V1.5",
"created": 1760101395,
"description": "Llama-3.3-Nemotron-Super-49B-v1.5 is a 49B-parameter, English-centric reasoning/chat model derived from Meta’s Llama-3.3-70B-Instruct with a 128K context. It’s post-trained for agentic workflows (RAG, tool calling) via SFT across math, code, science, and multi-turn chat, followed by multiple RL stages; Reward-aware Preference Optimization (RPO) for alignment, RL with Verifiable Rewards (RLVR) for step-wise reasoning, and iterative DPO to refine tool-use behavior. A distillation-driven Neural Architecture Search (“Puzzle”) replaces some attention blocks and varies FFN widths to shrink memory footprint and improve throughput, enabling single-GPU (H100/H200) deployment while preserving instruction following and CoT quality.\n\nIn internal evaluations (NeMo-Skills, up to 16 runs, temp = 0.6, top_p = 0.95), the model reports strong reasoning/coding results, e.g., MATH500 pass@1 = 97.4, AIME-2024 = 87.5, AIME-2025 = 82.71, GPQA = 71.97, LiveCodeBench (24.10–25.02) = 73.58, and MMLU-Pro (CoT) = 79.53. The model targets practical inference efficiency (high tokens/s, reduced VRAM) with Transformers/vLLM support and explicit “reasoning on/off” modes (chat-first defaults, greedy recommended when disabled). Suitable for building agents, assistants, and long-context retrieval systems where balanced accuracy-to-cost and reliable tool use matter.\n",
"context_length": 131072,
"architecture": {
"modality": "text->text",
"input_modalities": [
"text"
],
"output_modalities": [
"text"
],
"tokenizer": "Llama3",
"instruct_type": null
},
"pricing": {
"prompt": "0.0000001",
"completion": "0.0000004"
},
"top_provider": {
"context_length": 131072,
"max_completion_tokens": null,
"is_moderated": false
},
"per_request_limits": null,
"supported_parameters": [
"frequency_penalty",
"include_reasoning",
"max_tokens",
"min_p",
"presence_penalty",
"reasoning",
"repetition_penalty",
"response_format",
"seed",
"stop",
"temperature",
"tool_choice",
"tools",
"top_k",
"top_p"
],
"default_parameters": null,
"expiration_date": null
}
}