国产成av人片在线观看天堂无码,伊人免费视频在线,另类在线欧美图片,亚洲国产中文字幕乱,绝世天君txt下载,家有囍事小说,斗罗小说网

Architecture: Innovative Load Balancing Strategy and Training Objective

Architecture: Innovative Load Balancing Strategy and Training Objective

hanjunhao 2025-03-15 賽事活動(dòng) 7 次瀏覽 0個(gè)評(píng)論

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and Deep Seek MoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.

我們推出了 DeepSeek-V3,這是一個(gè)強(qiáng)大的專家混合模型 (Mixture-of-Experts, MoE) 語(yǔ)言模型,總參數(shù)量為 671B,每個(gè) Token 激活的參數(shù)量為 37B。為了實(shí)現(xiàn)高效的推理和成本效益的訓(xùn)練,DeepSeek-V3 采用了多頭潛在注意力 (Multi-head Latent Attention, MLA) 和 Deep Seek MoE 架構(gòu),這些架構(gòu)在 DeepSeek-V2 中得到了充分驗(yàn)證。此外,DeepSeek-V3 率先采用了無(wú)輔助損失的負(fù)載均衡策略,并設(shè)定了多 Token 預(yù)測(cè)訓(xùn)練目標(biāo),以實(shí)現(xiàn)更強(qiáng)的性能。我們?cè)?14.8 萬(wàn)億個(gè)多樣化且高質(zhì)量的 Token 上對(duì) DeepSeek-V3 進(jìn)行了預(yù)訓(xùn)練,隨后進(jìn)行了監(jiān)督微調(diào)和強(qiáng)化學(xué)習(xí)階段,以充分發(fā)揮其能力。綜合評(píng)估表明,DeepSeek-V3 優(yōu)于其他開(kāi)源模型,并實(shí)現(xiàn)了與領(lǐng)先的閉源模型相當(dāng)?shù)男阅?。盡管性能卓越,DeepSeek-V3 的完整訓(xùn)練僅需 2.788M H800 GPU 小時(shí)。此外,其訓(xùn)練過(guò)程非常穩(wěn)定。在整個(gè)訓(xùn)練過(guò)程中,我們沒(méi)有遇到任何不可恢復(fù)的損失峰值或執(zhí)行任何回滾操作。模型檢查點(diǎn)可在 https://github.com/deepseek-ai/DeepSeek-V3 獲取。


Figure 1 | Benchmark performance of DeepSeek-V3 and its counterparts.

圖 1 | DeepSeek-V3 及其對(duì)比模型的基準(zhǔn)性能。

In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and evolution (Anthropic, 2024; Google, 2024; OpenAI, 2024a), progressively diminishing the gap towards Artificial General Intelligence (AGI). Beyond closed-source models, open-source models, including DeepSeek series (DeepSeek-AI, 2024a,b,c; Guo et al., 2024), LLaMA series (AI@Meta, $2024mathrm{a},mathrm,$ ; Touvron et al., 2023a,b), Qwen series (Qwen, 2023, 2024a,b), and Mistral series (Jiang et al., 2023; Mistral, 2024), are also making significant strides, endeavoring to close the gap with their closed-source counterparts. To further push the boundaries of open-source model capabilities, we scale up our models and introduce DeepSeek-V3, a large Mixture-of-Experts (MoE) model with 671B parameters, of which 37B are activated for each token.

近年來(lái),大語(yǔ)言模型 (LLMs) 經(jīng)歷了快速的迭代和演進(jìn) (Anthropic, 2024; Google, 2024; OpenAI, 2024a),逐步縮小了與通用人工智能 (AGI) 的差距。除了閉源模型外,開(kāi)源模型,包括 DeepSeek 系列 (DeepSeek-AI, 2024a,b,c; Guo et al., 2024)、LLaMA 系列 (AI@Meta, $2024mathrm{a},mathrm,$ ; Touvron et al., 2023a,b)、Qwen 系列 (Qwen, 2023, 2024a,b) 和 Mistral 系列 (Jiang et al., 2023; Mistral, 2024),也在取得顯著進(jìn)展,努力縮小與閉源模型的差距。為了進(jìn)一步突破開(kāi)源模型的能力邊界,我們擴(kuò)展了模型規(guī)模,并推出了 DeepSeek-V3,這是一個(gè)擁有 6710 億參數(shù)的大型專家混合模型 (Mixture-of-Experts, MoE),每個(gè) token 激活 370 億參數(shù)。

With a forward-looking perspective, we consistently strive for strong model performance and economical costs. Therefore, in terms of architecture, DeepSeek-V3 still adopts Multi-head Latent Attention (MLA) (DeepSeek-AI, 2024c) for efficient inference and Deep Seek MoE (Dai et al., 2024) for cost-effective training. These two architectures have been validated in DeepSeekV2 (DeepSeek-AI, 2024c), demonstrating their capability to maintain robust model performance while achieving efficient training and inference. Beyond the basic architecture, we implement two additional strategies to further enhance the model capabilities. Firstly, DeepSeek-V3 pioneers an auxiliary-loss-free strategy (Wang et al., 2024a) for load balancing, with the aim of minimizing the adverse impact on model performance that arises from the effort to encourage load balancing. Secondly, DeepSeek-V3 employs a multi-token prediction training objective, which we have observed to enhance the overall performance on evaluation benchmarks.

我們始終以長(zhǎng)遠(yuǎn)的眼光追求強(qiáng)大的模型性能和經(jīng)濟(jì)成本。因此,在架構(gòu)方面,DeepSeek-V3 仍然采用多頭潛在注意力機(jī)制 (Multi-head Latent Attention, MLA) (DeepSeek-AI, 2024c) 以實(shí)現(xiàn)高效推理,并采用 Deep Seek MoE (Dai et al., 2024) 以實(shí)現(xiàn)經(jīng)濟(jì)高效的訓(xùn)練。這兩種架構(gòu)已在 DeepSeekV2 (DeepSeek-AI, 2024c) 中得到驗(yàn)證,證明了它們?cè)诒3謴?qiáng)大模型性能的同時(shí)實(shí)現(xiàn)高效訓(xùn)練和推理的能力。除了基礎(chǔ)架構(gòu)外,我們還實(shí)施了兩種額外策略以進(jìn)一步提升模型能力。首先,DeepSeek-V3 率先采用無(wú)輔助損失的負(fù)載均衡策略 (Wang et al., 2024a),旨在最小化因鼓勵(lì)負(fù)載均衡而對(duì)模型性能產(chǎn)生的不利影響。其次,DeepSeek-V3 采用了多 Token 預(yù)測(cè)的訓(xùn)練目標(biāo),我們觀察到這能夠提升在評(píng)估基準(zhǔn)上的整體表現(xiàn)。

In order to achieve efficient training, we support the FP8 mixed precision training and implement comprehensive optimization s for the training framework. Low-precision training has emerged as a promising solution for efficient training (Dettmers et al., 2022; Kalamkar et al., 2019; Narang et al., 2017; Peng et al., 2023b), its evolution being closely tied to advancements in hardware capabilities (Luo et al., 2024; Mic ike vici us et al., 2022; Rouhani et al., 2023a). In this work, we introduce an FP8 mixed precision training framework and, for the first time, validate its effectiveness on an extremely large-scale model. Through the support for FP8 computation and storage, we achieve both accelerated training and reduced GPU memory usage. As for the training framework, we design the DualPipe algorithm for efficient pipeline parallelism, which has fewer pipeline bubbles and hides most of the communication during training through computation-communication overlap. This overlap ensures that, as the model further scales up, as long as we maintain a constant computation-to-communication ratio, we can still employ fine-grained experts across nodes while achieving a near-zero all-to-all communication overhead. In addition, we also develop efficient cross-node all-to-all communication kernels to fully utilize InfiniBand (IB) and NVLink bandwidths. Furthermore, we meticulously optimize the memory footprint, making it possible to train DeepSeek-V3 without using costly tensor parallelism. Combining these efforts, we achieve high training efficiency.

為了實(shí)現(xiàn)高效的訓(xùn)練,我們支持 FP8 混合精度訓(xùn)練,并對(duì)訓(xùn)練框架進(jìn)行了全面的優(yōu)化。低精度訓(xùn)練已成為高效訓(xùn)練的一種有前景的解決方案 (Dettmers et al., 2022; Kalamkar et al., 2019; Narang et al., 2017; Peng et al., 2023b),其發(fā)展與硬件能力的進(jìn)步密切相關(guān) (Luo et al., 2024; Mic ike vici us et al., 2022; Rouhani et al., 2023a)。在這項(xiàng)工作中,我們引入了 FP8 混合精度訓(xùn)練框架,并首次在超大規(guī)模模型上驗(yàn)證了其有效性。通過(guò)支持 FP8 計(jì)算和存儲(chǔ),我們實(shí)現(xiàn)了加速訓(xùn)練并減少了 GPU 內(nèi)存使用。對(duì)于訓(xùn)練框架,我們?cè)O(shè)計(jì)了 DualPipe 算法以實(shí)現(xiàn)高效的流水線并行,該算法減少了流水線氣泡,并通過(guò)計(jì)算-通信重疊隱藏了大部分訓(xùn)練期間的通信。這種重疊確保了隨著模型的進(jìn)一步擴(kuò)展,只要我們保持恒定的計(jì)算與通信比率,我們?nèi)匀豢梢栽诠?jié)點(diǎn)之間使用細(xì)粒度的專家,同時(shí)實(shí)現(xiàn)接近零的全對(duì)全通信開(kāi)銷。此外,我們還開(kāi)發(fā)了高效的跨節(jié)點(diǎn)全對(duì)全通信內(nèi)核,以充分利用 InfiniBand (IB) 和 NVLink 的帶寬。此外,我們精心優(yōu)化了內(nèi)存占用,使得在不使用昂貴的張量并行的情況下訓(xùn)練 DeepSeek-V3 成為可能。結(jié)合這些努力,我們實(shí)現(xiàn)了高訓(xùn)練效率。

During pre-training, we train DeepSeek-V3 on 14.8T high-quality and diverse tokens. The pre-training process is remarkably stable. Throughout the entire training process, we did not encounter any irrecoverable loss spikes or have to roll back. Next, we conduct a two-stage context length extension for DeepSeek-V3. In the first stage, the maximum context length is extended to 32K, and in the second stage, it is further extended to 128K. Following this, we conduct post-training, including Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on the base model of DeepSeek-V3, to align it with human preferences and further unlock its potential. During the post-training stage, we distill the reasoning capability from the DeepSeekR1 series of models, and meanwhile carefully maintain the balance between model accuracy

在預(yù)訓(xùn)練階段,我們?cè)?14.8T 的高質(zhì)量多樣化 Token 上訓(xùn)練 DeepSeek-V3。預(yù)訓(xùn)練過(guò)程非常穩(wěn)定。在整個(gè)訓(xùn)練過(guò)程中,我們沒(méi)有遇到任何不可恢復(fù)的損失峰值,也不需要回滾。接下來(lái),我們對(duì) DeepSeek-V3 進(jìn)行了兩階段的上下文長(zhǎng)度擴(kuò)展。在第一階段,最大上下文長(zhǎng)度擴(kuò)展到 32K,在第二階段進(jìn)一步擴(kuò)展到 128K。隨后,我們對(duì) DeepSeek-V3 的基礎(chǔ)模型進(jìn)行了后訓(xùn)練,包括監(jiān)督微調(diào) (SFT) 和強(qiáng)化學(xué)習(xí) (RL),以使其與人類偏好對(duì)齊并進(jìn)一步釋放其潛力。在后訓(xùn)練階段,我們從 DeepSeekR1 系列模型中蒸餾出推理能力,同時(shí)精心保持模型準(zhǔn)確性的平衡。

Table 1 | Training costs of DeepSeek-V3, assuming the rental price of H800 is $mathbb{9}2$ per GPU hour.

訓(xùn)練成本 預(yù)訓(xùn)練 上下文擴(kuò)展 后訓(xùn)練 總計(jì) H800 GPU 小時(shí)數(shù)(美元) 2664K $5.328M 119K $0.238M 5K $0.01M 2788K $5.576M

表 1 | DeepSeek-V3 的訓(xùn)練成本,假設(shè) H800 的租賃價(jià)格為每 GPU 小時(shí) $mathbb{9}2$。

and generation length.

生成長(zhǎng)度

We evaluate DeepSeek-V3 on a comprehensive array of benchmarks. Despite its economical training costs, comprehensive evaluations reveal that DeepSeek-V3-Base has emerged as the strongest open-source base model currently available, especially in code and math. Its chat version also outperforms other open-source models and achieves performance comparable to leading closed-source models, including GPT-4o and Claude-3.5-Sonnet, on a series of standard and open-ended benchmarks.

我們?cè)诙鄠€(gè)基準(zhǔn)測(cè)試上對(duì) DeepSeek-V3 進(jìn)行了全面評(píng)估。盡管訓(xùn)練成本較低,但綜合評(píng)估顯示,DeepSeek-V3-Base 已成為當(dāng)前最強(qiáng)的開(kāi)源基礎(chǔ)模型,尤其是在代碼和數(shù)學(xué)領(lǐng)域。其聊天版本也在多個(gè)標(biāo)準(zhǔn)和開(kāi)放式基準(zhǔn)測(cè)試中超越了其他開(kāi)源模型,并達(dá)到了與 GPT-4o 和 Claude-3.5-Sonnet 等領(lǐng)先閉源模型相當(dāng)?shù)男阅堋?/p>

Lastly, we emphasize again the economical training costs of DeepSeek-V3, summarized in Table 1, achieved through our optimized co-design of algorithms, frameworks, and hardware. During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180K H800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs. Consequently, our pretraining stage is completed in less than two months and costs 2664K GPU hours. Combined with 119K GPU hours for the context length extension and 5K GPU hours for post-training, DeepSeek-V3 costs only 2.788M GPU hours for its full training. Assuming the rental price of the $_{mathrm{H800}},mathrm{GPU}$ is $mathbb{S}2$ per GPU hour, our total training costs amount to only $mathbb{55.576M}$ . Note that the aforementioned costs include only the official training of DeepSeek-V3, excluding the costs associated with prior research and ablation experiments on architectures, algorithms, or data.

最后,我們?cè)俅螐?qiáng)調(diào) DeepSeek-V3 的經(jīng)濟(jì)訓(xùn)練成本,總結(jié)在表 1 中,這是通過(guò)我們對(duì)算法、框架和硬件的優(yōu)化協(xié)同設(shè)計(jì)實(shí)現(xiàn)的。在預(yù)訓(xùn)練階段,訓(xùn)練 DeepSeek-V3 每萬(wàn)億 Token 僅需 180K H800 GPU 小時(shí),即在我們擁有 2048 個(gè) H800 GPU 的集群上僅需 3.7 天。因此,我們的預(yù)訓(xùn)練階段在不到兩個(gè)月內(nèi)完成,消耗了 2664K GPU 小時(shí)。加上上下文長(zhǎng)度擴(kuò)展所需的 119K GPU 小時(shí)和后訓(xùn)練所需的 5K GPU 小時(shí),DeepSeek-V3 的完整訓(xùn)練僅消耗 2.788M GPU 小時(shí)。假設(shè) $_{mathrm{H800}},mathrm{GPU}$ 的租賃價(jià)格為每小時(shí) $mathbb{S}2$,我們的總訓(xùn)練成本僅為 $mathbb{55.576M}$。需要注意的是,上述成本僅包括 DeepSeek-V3 的正式訓(xùn)練,不包括先前在架構(gòu)、算法或數(shù)據(jù)上的研究和消融實(shí)驗(yàn)的相關(guān)成本。

Our main contribution includes:

我們的主要貢獻(xiàn)包括:

? On top of the efficient architecture of DeepSeek-V2, we pioneer an auxiliary-loss-free strategy for load balancing, which minimizes the performance degradation that arises from encouraging load balancing. ? We investigate a Multi-Token Prediction (MTP) objective and prove it beneficial to model performance. It can also be used for speculative decoding for inference acceleration.

? 在 DeepSeek-V2 的高效架構(gòu)基礎(chǔ)上,我們率先提出了一種無(wú)輔助損失的負(fù)載均衡策略,該策略最大限度地減少了因鼓勵(lì)負(fù)載均衡而導(dǎo)致的性能下降。
? 我們研究了多 Token 預(yù)測(cè) (Multi-Token Prediction, MTP) 目標(biāo),并證明其對(duì)模型性能有益。它還可用于推測(cè)解碼 (speculative decoding) 以加速推理。

? We design an FP8 mixed precision training framework and, for the first time, validate the feasibility and effectiveness of FP8 training on an extremely large-scale model. ? Through the co-design of algorithms, frameworks, and hardware, we overcome the communication bottleneck in cross-node MoE training, achieving near-full computationcommunication overlap. This significantly enhances our training efficiency and reduces the training costs, enabling us to further scale up the model size without additional overhead. ? At an economical cost of only 2.664M H800 GPU hours, we complete the pre-training of DeepSeek-V3 on 14.8T tokens, producing the currently strongest open-source base model. The subsequent training stages after pre-training require only 0.1M GPU hours.

? 我們?cè)O(shè)計(jì)了一個(gè)FP8混合精度訓(xùn)練框架,并首次在超大規(guī)模模型上驗(yàn)證了FP8訓(xùn)練的可行性和有效性。
? 通過(guò)算法、框架和硬件的協(xié)同設(shè)計(jì),我們克服了跨節(jié)點(diǎn)MoE訓(xùn)練中的通信瓶頸,實(shí)現(xiàn)了近乎完全的計(jì)算-通信重疊。這顯著提升了我們的訓(xùn)練效率,降低了訓(xùn)練成本,使我們能夠在沒(méi)有額外開(kāi)銷的情況下進(jìn)一步擴(kuò)大模型規(guī)模。
? 在僅花費(fèi)2.664M H800 GPU小時(shí)的經(jīng)濟(jì)成本下,我們完成了DeepSeek-V3在14.8T tokens上的預(yù)訓(xùn)練,生成了目前最強(qiáng)的開(kāi)源基礎(chǔ)模型。預(yù)訓(xùn)練后的后續(xù)訓(xùn)練階段僅需0.1M GPU小時(shí)。

? We introduce an innovative methodology to distill reasoning capabilities from the longChain-of-Thought (CoT) model, specifically from one of the DeepSeek R1 series models, into standard LLMs, particularly DeepSeek-V3. Our pipeline elegantly incorporates the

? 我們引入了一種創(chuàng)新方法,從長(zhǎng)鏈思維(CoT)模型(特別是DeepSeek R1系列模型之一)中提取推理能力,并將其融入標(biāo)準(zhǔn)大語(yǔ)言模型(LLM),特別是DeepSeek-V3。我們的流程巧妙地結(jié)合了...

verification and reflection patterns of R1 into DeepSeek-V3 and notably improves its reasoning performance. Meanwhile, we also maintain control over the output style and length of DeepSeek-V3.

將 R1 的驗(yàn)證和反思模式整合到 DeepSeek-V3 中,顯著提升了其推理性能。同時(shí),我們也保持了對(duì) DeepSeek-V3 輸出風(fēng)格和長(zhǎng)度的控制。

? Knowledge: (1) On educational benchmarks such as MMLU, MMLU-Pro, and GPQA, DeepSeek-V3 outperforms all other open-source models, achieving 88.5 on MMLU, 75.9 on MMLU-Pro, and 59.1 on GPQA. Its performance is comparable to leading closed-source models like GPT-4o and Claude-Sonnet-3.5, narrowing the gap between open-source and closed-source models in this domain. (2) For factuality benchmarks, DeepSeek-V3 demonstrates superior performance among open-source models on both SimpleQA and Chinese SimpleQA. While it trails behind GPT-4o and Claude-Sonnet-3.5 in English factual knowledge (SimpleQA), it surpasses these models in Chinese factual knowledge (Chinese SimpleQA), highlighting its strength in Chinese factual knowledge. ? Code, Math, and Reasoning: (1) DeepSeek-V3 achieves state-of-the-art performance on math-related benchmarks among all non-long-CoT open-source and closed-source models. Notably, it even outperforms o1-preview on specific benchmarks, such as MATH-500, demonstrating its robust mathematical reasoning capabilities. (2) On coding-related tasks, DeepSeek-V3 emerges as the top-performing model for coding competition benchmarks, such as Live Code Bench, solidifying its position as the leading model in this domain. For engineering-related tasks, while DeepSeek-V3 performs slightly below Claude-Sonnet-3.5, it still outpaces all other models by a significant margin, demonstrating its competitiveness across diverse technical benchmarks.

? 知識(shí):(1) 在教育基準(zhǔn)測(cè)試如 MMLU、MMLU-Pro 和 GPQA 上,DeepSeek-V3 超越了所有其他開(kāi)源模型,在 MMLU 上達(dá)到 88.5 分,在 MMLU-Pro 上達(dá)到 75.9 分,在 GPQA 上達(dá)到 59.1 分。其表現(xiàn)與領(lǐng)先的閉源模型如 GPT-4o 和 Claude-Sonnet-3.5 相當(dāng),縮小了開(kāi)源與閉源模型在這一領(lǐng)域的差距。(2) 在事實(shí)性基準(zhǔn)測(cè)試中,DeepSeek-V3 在 SimpleQA 和中文 SimpleQA 上均表現(xiàn)出色,領(lǐng)先于其他開(kāi)源模型。雖然在英語(yǔ)事實(shí)知識(shí) (SimpleQA) 上略遜于 GPT-4o 和 Claude-Sonnet-3.5,但在中文事實(shí)知識(shí) (中文 SimpleQA) 上超越了這些模型,突顯了其在中文事實(shí)知識(shí)方面的優(yōu)勢(shì)。

? 代碼、數(shù)學(xué)和推理:(1) DeepSeek-V3 在所有非長(zhǎng)鏈推理的開(kāi)源和閉源模型中,在數(shù)學(xué)相關(guān)基準(zhǔn)測(cè)試上達(dá)到了最先進(jìn)的性能。值得注意的是,它在特定基準(zhǔn)測(cè)試如 MATH-500 上甚至超越了 o1-preview,展示了其強(qiáng)大的數(shù)學(xué)推理能力。(2) 在編碼相關(guān)任務(wù)中,DeepSeek-V3 成為編碼競(jìng)賽基準(zhǔn)測(cè)試(如 Live Code Bench)中表現(xiàn)最佳的模型,鞏固了其在這一領(lǐng)域的領(lǐng)先地位。在工程相關(guān)任務(wù)中,盡管 DeepSeek-V3 的表現(xiàn)略低于 Claude-Sonnet-3.5,但仍以顯著優(yōu)勢(shì)領(lǐng)先于所有其他模型,展示了其在多樣化技術(shù)基準(zhǔn)測(cè)試中的競(jìng)爭(zhēng)力。

In the remainder of this paper, we first present a detailed exposition of our DeepSeek-V3 model architecture (Section 2). Subsequently, we introduce our infrastructures, encompassing our compute clusters, the training framework, the support for FP8 training, the inference deployment strategy, and our suggestions on future hardware design. Next, we describe our pre-training process, including the construction of training data, hyper-parameter settings, longcontext extension techniques, the associated evaluations, as well as some discussions (Section 4). Thereafter, we discuss our efforts on post-training, which include Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), the corresponding evaluations, and discussions (Section 5). Lastly, we conclude this work, discuss existing limitations of DeepSeek-V3, and propose potential directions for future research (Section 6).

在本文的剩余部分,我們首先詳細(xì)介紹了我們的 DeepSeek-V3 模型架構(gòu)(第 2 節(jié))。隨后,我們介紹了我們的基礎(chǔ)設(shè)施,包括計(jì)算集群、訓(xùn)練框架、對(duì) FP8 訓(xùn)練的支持、推理部署策略以及我們對(duì)未來(lái)硬件設(shè)計(jì)的建議。接下來(lái),我們描述了我們的預(yù)訓(xùn)練過(guò)程,包括訓(xùn)練數(shù)據(jù)的構(gòu)建、超參數(shù)設(shè)置、長(zhǎng)上下文擴(kuò)展技術(shù)、相關(guān)評(píng)估以及一些討論(第 4 節(jié))。之后,我們討論了我們?cè)谟?xùn)練后的努力,包括監(jiān)督微調(diào)(SFT)、強(qiáng)化學(xué)習(xí)(RL)、相應(yīng)的評(píng)估和討論(第 5 節(jié))。最后,我們總結(jié)了這項(xiàng)工作,討論了 DeepSeek-V3 的現(xiàn)有局限性,并提出了未來(lái)研究的潛在方向(第 6 節(jié))。

We first introduce the basic architecture of DeepSeek-V3, featured by Multi-head Latent Attention (MLA) (DeepSeek-AI, 2024c) for efficient inference and Deep Seek MoE (Dai et al., 2024) for economical training. Then, we present a Multi-Token Prediction (MTP) training objective, which we have observed to enhance the overall performance on evaluation benchmarks. For other minor details not explicitly mentioned, DeepSeek-V3 adheres to the settings of DeepSeekV2 (DeepSeek-AI, 2024c).

我們首先介紹 DeepSeek-V3 的基本架構(gòu),其特點(diǎn)是采用多頭潛在注意力機(jī)制 (Multi-head Latent Attention, MLA) (DeepSeek-AI, 2024c) 以實(shí)現(xiàn)高效推理,以及 Deep Seek MoE (Dai et al., 2024) 以實(shí)現(xiàn)經(jīng)濟(jì)的訓(xùn)練。接著,我們提出了一種多 Token 預(yù)測(cè) (Multi-Token Prediction, MTP) 訓(xùn)練目標(biāo),我們觀察到該目標(biāo)能夠提升在評(píng)估基準(zhǔn)上的整體性能。對(duì)于未明確提及的其他細(xì)節(jié),DeepSeek-V3 遵循 DeepSeekV2 (DeepSeek-AI, 2024c) 的設(shè)置。

The basic architecture of DeepSeek-V3 is still within the Transformer (Vaswani et al., 2017) framework. For efficient inference and economical training, DeepSeek-V3 also adopts MLA and Deep Seek MoE, which have been thoroughly validated by DeepSeek-V2. Compared with DeepSeek-V2, an exception is that we additionally introduce an auxiliary-loss-free load balancing strategy (Wang et al., 2024a) for Deep Seek MoE to mitigate the performance degradation induced by the effort to ensure load balance. Figure 2 illustrates the basic architecture of DeepSeek-V3, and we will briefly review the details of MLA and Deep Seek MoE in this section.

DeepSeek-V3 的基本架構(gòu)仍然在 Transformer (Vaswani et al., 2017) 框架內(nèi)。為了實(shí)現(xiàn)高效的推理和經(jīng)濟(jì)性的訓(xùn)練,DeepSeek-V3 還采用了 MLA 和 Deep Seek MoE,這些已經(jīng)在 DeepSeek-V2 中得到了充分驗(yàn)證。與 DeepSeek-V2 相比,一個(gè)例外是我們額外引入了一種無(wú)輔助損失的負(fù)載均衡策略 (Wang et al., 2024a) 用于 Deep Seek MoE,以減輕因確保負(fù)載均衡而導(dǎo)致的性能下降。圖 2 展示了 DeepSeek-V3 的基本架構(gòu),我們將在本節(jié)簡(jiǎn)要回顧 MLA 和 Deep Seek MoE 的細(xì)節(jié)。


Figure 2 | Illustration of the basic architecture of DeepSeek-V3. Following DeepSeek-V2, we adopt MLA and Deep Seek MoE for efficient inference and economical training.

圖 2 | DeepSeek-V3 的基本架構(gòu)示意圖。我們延續(xù)了 DeepSeek-V2 的設(shè)計(jì),采用 MLA 和 Deep Seek MoE 來(lái)實(shí)現(xiàn)高效的推理和經(jīng)濟(jì)性的訓(xùn)練。

For attention, DeepSeek-V3 adopts the MLA architecture. Let $d$ denote the embedding dimension, $n_{h}$ denote the number of attention heads, $d_{h}$ denote the dimension per head, and $mathbf{h}_{t}inmathbb{R}^vnwnhejyva0c$ denote the attention input for the $t^{ h}$ -th token at a given attention layer. The core of MLA is the low-rank joint compression for attention keys and values to reduce Key-Value (KV) cache during inference:

對(duì)于注意力機(jī)制,DeepSeek-V3 采用了 MLA 架構(gòu)。設(shè) $d$ 表示嵌入維度,$n_{h}$ 表示注意力頭的數(shù)量,$d_{h}$ 表示每個(gè)頭的維度,$mathbf{h}_{t}inmathbb{R}^vnwnhejyva0c$ 表示給定注意力層中第 $t^{ h}$ 個(gè) Token 的注意力輸入。MLA 的核心是對(duì)注意力鍵和值進(jìn)行低秩聯(lián)合壓縮,以減少推理過(guò)程中的鍵值 (KV) 緩存:

where $c_{t}^{K V}inmathbb{R}^{d_{c}}$ is the compressed latent vector for keys and values; $d_{c}(ll d_{h}n_{h})$ indicates the KV compression dimension; $hat{W}^{D K V}inmathbb{R}^{d_{c} imes d}$ denotes the down-projection matrix; $W^{U K}$ , $W^{U V}inmathbb{R}^{d_{h}n_{h} imes d_{c}}$ are the up-projection matrices for keys and values, respectively; $W^{K R}inmathbb{R}^{d_{h}^{R} imes d}$ is the matrix used to produce the decoupled key that carries Rotary Positional Embedding (RoPE) (Su et al., 2024); RoPE(·) denotes the operation that applies RoPE matrices; and $left[cdot;cdot ight]$ denotes concatenation. Note that for MLA, only the blue-boxed vectors (i.e., $mathbf{c}{t}^{K V}$ and $mathbf{k}{t}^{R}$ ) need to be cached during generation, which results in significantly reduced KV cache while maintaining performance comparable to standard Multi-Head Attention (MHA) (Vaswani et al., 2017).

其中 $mathbf{c}{t}^{K V}inmathbb{R}^{d{c}}$ 是鍵和值的壓縮潛在向量;$d_{c}(ll d_{h}n_{h})$ 表示鍵值壓縮維度;$hat{W}^{D K V}inmathbb{R}^{d_{c} imes d}$ 表示下投影矩陣;$W^{U K}$ 和 $W^{U V}inmathbb{R}^{d_{h}n_{h} imes d_{c}}$ 分別是鍵和值的上投影矩陣;$W^{K R}inmathbb{R}^{d_{h}^{R} imes d}$ 是用于生成攜帶旋轉(zhuǎn)位置嵌入 (RoPE) (Su et al., 2024) 的解耦鍵的矩陣;RoPE(·) 表示應(yīng)用 RoPE 矩陣的操作;$left[cdot;cdot ight]$ 表示連接操作。需要注意的是,對(duì)于 MLA,在生成過(guò)程中只需要緩存藍(lán)色框中的向量(即 $mathbf{c}{t}^{K V}$ 和 $mathbf{k}{t}^{R}$),這顯著減少了鍵值緩存,同時(shí)保持了與標(biāo)準(zhǔn)多頭注意力 (MHA) (Vaswani et al., 2017) 相當(dāng)?shù)男阅堋?/p>

For the attention queries, we also perform a low-rank compression, which can reduce the activation memory during training:

對(duì)于注意力查詢,我們還執(zhí)行了低秩壓縮,這可以減少訓(xùn)練期間的激活內(nèi)存:

where $mathbf{c}{t}^{Q};in;mathbb{R}^{d{c}^{prime}}$ is the compressed latent vector for queries; $d_{c}^{prime}(ll,d_{h}n_{h})$ denotes the query compression dimension; $W^{Dbar{Q}}inmathbb{R}^{d_{c}^{prime} imes d},W^{U Q}inmathbb{R}^{d_{h}n_{h} imes d_{c}^{prime}}$ are the down-projection and up-projection matrices for queries, respectively; and $W^{Q R}inmathbb{R}^{d_{h}^{R}n_{h} imes d_{c}^{prime}}$ is the matrix to produce the decoupled queries that carry RoPE.

其中 $mathbf{c}{t}^{Q};in;mathbb{R}^{d{c}^{prime}}$ 是查詢的壓縮潛在向量;$d_{c}^{prime}(ll,d_{h}n_{h})$ 表示查詢壓縮維度;$W^{Dbar{Q}}inmathbb{R}^{d_{c}^{prime} imes d},W^{U Q}inmathbb{R}^{d_{h}n_{h} imes d_{c}^{prime}}$ 分別是查詢的下投影和上投影矩陣;$W^{Q R}inmathbb{R}^{d_{h}^{R}n_{h} imes d_{c}^{prime}}$ 是生成攜帶 RoPE 的解耦查詢的矩陣。

Ultimately, the attention queries $left(mathbf{q}{t,i} ight)$ , keys $(mathbf{k}{j,i})$ , and values $(mathbf{v}{j,i}^{C})$ are combined to yield the final attention output $mathbf{u}{t}$ :

最終,注意力查詢 $left(mathbf{q}{t,i} ight)$、鍵 $(mathbf{k}{j,i})$ 和值 $(mathbf{v}{j,i}^{C})$ 被組合起來(lái),生成最終的注意力輸出 $mathbf{u}{t}$:

where $W^{O}inmathbb{R}^{d imes d_{h}n_{h}}$ denotes the output projection matrix.

其中 $W^{O}inmathbb{R}^{d imes d_{h}n_{h}}$ 表示輸出投影矩陣。

Basic Architecture of Deep Seek MoE. For Feed-Forward Networks (FFNs), DeepSeek-V3 employs the Deep Seek MoE architecture (Dai et al., 2024). Compared with traditional MoE architectures like GShard (Lepikhin et al., 2021), Deep Seek MoE uses finer-grained experts and isolates some experts as shared ones. Let $mathbf{u}{t}$ denote the FFN input of the $t^{ h}$ -th token, we compute the FFN output $mathbf{h}{t}^{prime}$ as follows:

Deep Seek MoE 的基礎(chǔ)架構(gòu)。對(duì)于前饋網(wǎng)絡(luò) (FFNs),DeepSeek-V3 采用了 Deep Seek MoE 架構(gòu) (Dai et al., 2024)。與傳統(tǒng)的 MoE 架構(gòu)(如 GShard (Lepikhin et al., 2021))相比,Deep Seek MoE 使用了更細(xì)粒度的專家,并將部分專家隔離為共享專家。設(shè) $mathbf{u}{t}$ 表示第 $t^{ h}$ 個(gè) Token 的 FFN 輸入,我們計(jì)算 FFN 輸出 $mathbf{h}{t}^{prime}$ 如下:

where $N_{s}$ and $N_{r}$ denote the numbers of shared experts and routed experts, respectively; $mathrm{FFN}{i}^{(s)}(cdot)$ and $mathrm{FFN}{i}^{(r)}(cdot)$ denote the ??-th shared expert and the $ifootnote{C o r r e s p o n d i n g a u t h o r.T e l:~+86-1088236095.E-m a i l a d d e n s c o n s t r a d d e n s c o n s t i o n s t i c a l l o r.}$ -th routed expert, respectively; $K_{r}$ denotes the number of activated routed experts; $g_{i,t}$ is the gating value for the $i^{ h}$ -th expert; $s_{i,t}$ is the token-to-expert affinity; $mathbf{e}_{i}$ is the centroid vector of the $i^{ h}$ -th routed expert; and $mathrm{Topk}(cdot,K)$ denotes the set comprising $K$ highest scores among the affinity scores calculated for the $t$ -th token and all routed experts. Slightly different from DeepSeek-V2, DeepSeek-V3 uses the sigmoid function to compute the affinity scores, and applies a normalization among all selected affinity scores to produce the gating values.

其中 $N_{s}$ 和 $N_{r}$ 分別表示共享專家和路由專家的數(shù)量;$mathrm{FFN}{i}^{(s)}(cdot)$ 和 $mathrm{FFN}{i}^{(r)}(cdot)$ 分別表示第 $i$ 個(gè)共享專家和第 $i$ 個(gè)路由專家;$K_{r}$ 表示激活的路由專家數(shù)量;$g_{i,t}$ 是第 $i$ 個(gè)專家的門(mén)控值;$s_{i,t}$ 是 Token 到專家的親和度;$mathbf{e}_{i}$ 是第 $i$ 個(gè)路由專家的質(zhì)心向量;$mathrm{Topk}(cdot,K)$ 表示在第 $t$ 個(gè) Token 和所有路由專家之間計(jì)算的親和度得分中,包含 $K$ 個(gè)最高得分的集合。與 DeepSeek-V2 略有不同,DeepSeek-V3 使用 sigmoid 函數(shù)計(jì)算親和度得分,并在所有選定的親和度得分之間進(jìn)行歸一化以生成門(mén)控值。

Auxiliary-Loss-Free Load Balancing. For MoE models, an unbalanced expert load will lead to routing collapse (Shazeer et al., 2017) and diminish computational efficiency in scenarios with expert parallelism. Conventional solutions usually rely on the auxiliary loss (Fedus et al., 2021; Lepikhin et al., 2021) to avoid unbalanced load. However, too large an auxiliary loss will impair the model performance (Wang et al., 2024a). To achieve a better trade-off between load balance and model performance, we pioneer an auxiliary-loss-free load balancing strategy (Wang et al., 2024a) to ensure load balance. To be specific, we introduce a bias term $b_{i}$ for each expert and add it to the corresponding affinity scores $s_{i,t}$ to determine the top-K routing:

無(wú)輔助損失負(fù)載均衡。對(duì)于 MoE 模型,專家負(fù)載不均衡會(huì)導(dǎo)致路由崩潰 (Shazeer et al., 2017),并在專家并行場(chǎng)景中降低計(jì)算效率。傳統(tǒng)解決方案通常依賴輔助損失 (Fedus et al., 2021; Lepikhin et al., 2021) 來(lái)避免負(fù)載不均衡。然而,過(guò)大的輔助損失會(huì)損害模型性能 (Wang et al., 2024a)。為了在負(fù)載均衡和模型性能之間取得更好的平衡,我們首創(chuàng)了一種無(wú)輔助損失的負(fù)載均衡策略 (Wang et al., 2024a) 來(lái)確保負(fù)載均衡。具體來(lái)說(shuō),我們?yōu)槊總€(gè)專家引入一個(gè)偏置項(xiàng) $b_{i}$,并將其添加到相應(yīng)的親和度分?jǐn)?shù) $s_{i,t}$ 中以確定 top-K 路由:

Note that the bias term is only used for routing. The gating value, which will be multiplied with the FFN output, is still derived from the original affinity score $s_{i,t}$ . During training, we keep monitoring the expert load on the whole batch of each training step. At the end of each step, we will decrease the bias term by $gamma$ if its corresponding expert is overloaded, and increase it by $gamma$ if its corresponding expert is under loaded, where $gamma$ is a hyper-parameter called bias update speed. Through the dynamic adjustment, DeepSeek-V3 keeps balanced expert load during training, and achieves better performance than models that encourage load balance through pure auxiliary losses.

Architecture: Innovative Load Balancing Strategy and Training Objective

需要注意的是,偏置項(xiàng)僅用于路由。與FFN輸出相乘的門(mén)控值仍然來(lái)源于原始的親和度得分 $s_{i,t}$ 。在訓(xùn)練過(guò)程中,我們持續(xù)監(jiān)控每個(gè)訓(xùn)練步驟中整個(gè)批次的專家負(fù)載。在每個(gè)步驟結(jié)束時(shí),如果對(duì)應(yīng)的專家過(guò)載,我們會(huì)將偏置項(xiàng)減少 $gamma$ ;如果對(duì)應(yīng)的專家負(fù)載不足,則增加 $gamma$ ,其中 $gamma$ 是一個(gè)稱為偏置更新速度的超參數(shù)。通過(guò)這種動(dòng)態(tài)調(diào)整,DeepSeek-V3 在訓(xùn)練過(guò)程中保持了專家負(fù)載的平衡,并且比僅通過(guò)純輔助損失來(lái)鼓勵(lì)負(fù)載平衡的模型表現(xiàn)更好。

Complementary Sequence-Wise Auxiliary Loss. Although DeepSeek-V3 mainly relies on the auxiliary-loss-free strategy for load balance, to prevent extreme imbalance within any single sequence, we also employ a complementary sequence-wise balance loss:

互補(bǔ)序列輔助損失。盡管 DeepSeek-V3 主要依賴無(wú)輔助損失的策略來(lái)實(shí)現(xiàn)負(fù)載均衡,但為了防止任何單個(gè)序列內(nèi)的極端不平衡,我們還采用了互補(bǔ)的序列平衡損失:

where the balance factor $alpha$ is a hyper-parameter, which will be assigned an extremely small value for DeepSeek-V3; $mathbb{1}(cdot)$ denotes the indicator function; and $T$ denotes the number of tokens in a sequence. The sequence-wise balance loss encourages the expert load on each sequence to be balanced.

其中平衡因子 $alpha$ 是一個(gè)超參數(shù),對(duì)于 DeepSeek-V3 將被賦予一個(gè)極小的值;$mathbb{1}(cdot)$ 表示指示函數(shù);$T$ 表示序列中的 Token 數(shù)量。序列級(jí)別的平衡損失鼓勵(lì)每個(gè)序列上的專家負(fù)載保持平衡。


Figure 3 | Illustration of our Multi-Token Prediction (MTP) implementation. We keep the complete causal chain for the prediction of each token at each depth.

圖 3 | 我們的多 Token 預(yù)測(cè) (MTP) 實(shí)現(xiàn)示意圖。我們?cè)诿總€(gè)深度保留完整的因果鏈以預(yù)測(cè)每個(gè) Token。

Node-Limited Routing. Like the device-limited routing used by DeepSeek-V2, DeepSeek-V3 also uses a restricted routing mechanism to limit communication costs during training. In short, we ensure that each token will be sent to at most ??nodes, which are selected according to the sum of the highest $frac{K_{r}}{M}$ affinity scores of the experts distributed on each node. Under this constraint, our MoE training framework can nearly achieve full computation-communication overlap.

節(jié)點(diǎn)限制路由。與 DeepSeek-V2 使用的設(shè)備限制路由類似,DeepSeek-V3 也采用了受限的路由機(jī)制來(lái)限制訓(xùn)練期間的通信成本。簡(jiǎn)而言之,我們確保每個(gè) Token 最多只會(huì)被發(fā)送到 ?? 個(gè)節(jié)點(diǎn),這些節(jié)點(diǎn)是根據(jù)分布在每個(gè)節(jié)點(diǎn)上的專家中親和度得分最高的 $frac{K_{r}}{M}$ 的總和來(lái)選擇的。在這種約束下,我們的 MoE 訓(xùn)練框架幾乎可以實(shí)現(xiàn)計(jì)算與通信的完全重疊。

No Token-Dropping. Due to the effective load balancing strategy, DeepSeek-V3 keeps a good load balance during its full training. Therefore, DeepSeek-V3 does not drop any tokens during training. In addition, we also implement specific deployment strategies to ensure inference load balance, so DeepSeek-V3 also does not drop tokens during inference.

無(wú)Token丟棄。由于有效的負(fù)載均衡策略,DeepSeek-V3在整個(gè)訓(xùn)練過(guò)程中保持了良好的負(fù)載平衡。因此,DeepSeek-V3在訓(xùn)練期間不會(huì)丟棄任何Token。此外,我們還實(shí)施了特定的部署策略以確保推理負(fù)載平衡,因此DeepSeek-V3在推理期間也不會(huì)丟棄Token。

Inspired by Gloeckle et al. (2024), we investigate and set a Multi-Token Prediction (MTP) objective for DeepSeek-V3, which extends the prediction scope to multiple future tokens at each position. On the one hand, an MTP objective densifies the training signals and may improve data efficiency. On the other hand, MTP may enable the model to pre-plan its representations for better prediction of future tokens. Figure 3 illustrates our implementation of MTP. Different from Gloeckle et al. (2024), which parallelly predicts $D$ additional tokens using independent output heads, we sequentially predict additional tokens and keep the complete causal chain at each prediction depth. We introduce the details of our MTP implementation in this section.

受 Gloeckle 等人 (2024) 的啟發(fā),我們?yōu)?DeepSeek-V3 研究并設(shè)定了多 Token 預(yù)測(cè) (MTP) 目標(biāo),該目標(biāo)將預(yù)測(cè)范圍擴(kuò)展到每個(gè)位置的多個(gè)未來(lái) Token。一方面,MTP 目標(biāo)增加了訓(xùn)練信號(hào)的密度,可能會(huì)提高數(shù)據(jù)效率。另一方面,MTP 可能使模型能夠預(yù)先規(guī)劃其表示,以更好地預(yù)測(cè)未來(lái)的 Token。圖 3 展示了我們的 MTP 實(shí)現(xiàn)。與 Gloeckle 等人 (2024) 使用獨(dú)立的輸出頭并行預(yù)測(cè) $D$ 個(gè)額外 Token 不同,我們按順序預(yù)測(cè)額外的 Token,并在每個(gè)預(yù)測(cè)深度保持完整的因果鏈。我們?cè)诒竟?jié)中介紹了 MTP 實(shí)現(xiàn)的細(xì)節(jié)。

MTP Modules. To be specific, our MTP implementation uses $D$ sequential modules to predict $D$ additional tokens. The $k$ -th MTP module consists of a shared embedding layer $operatorname{Emb}(cdot)$ , a shared output head OutHead $(cdot)$ , a Transformer block $mathrm{TRM}{k}(cdot){cdot}$ , and a projection matrix $M_{k}inmathbb{R}^{d imes2d}$ . For the $icdot$ -th input token $t_{i},$ at the $k$ -th prediction depth, we first combine the representation of the $i$ -th token at the $(k-1)$ -th depth $mathbf{h}{i}^{k-1}inmathbb{R}^vnwnhejyva0c$ and the embedding of the $(i+k)$ -th token $E m b(t{i+k})inmathbb{R}^vnwnhejyva0c$

MTP 模塊。具體來(lái)說(shuō),我們的 MTP 實(shí)現(xiàn)使用 $D$ 個(gè)順序模塊來(lái)預(yù)測(cè) $D$ 個(gè)額外的 Token。第 $k$ 個(gè) MTP 模塊由一個(gè)共享的嵌入層 $operatorname{Emb}(cdot)$、一個(gè)共享的輸出頭 OutHead $(cdot)$、一個(gè) Transformer 塊 $mathrm{TRM}{k}(cdot){cdot}$ 和一個(gè)投影矩陣 $M_{k}inmathbb{R}^{d imes2d}$ 組成。對(duì)于第 $i$ 個(gè)輸入 Token $t_{i},$ 在第 $k$ 個(gè)預(yù)測(cè)深度,我們首先將第 $(k-1)$ 個(gè)深度的第 $i$ 個(gè) Token 的表示 $mathbf{h}{i}^{k-1}inmathbb{R}^vnwnhejyva0c$ 和第 $(i+k)$ 個(gè) Token 的嵌入 $E m b(t{i+k})inmathbb{R}^vnwnhejyva0c$ 結(jié)合起來(lái)。

where $left[cdot;cdot ight]$ denotes concatenation. Especially, when $k=1,mathbf{h}{i}^{k-1}$ refers to the representation given by the main model. Note that for each MTP module, its embedding layer is shared with the main model. The combined $mathbf{h}{i}^{prime k}$ serves as the input of the Transformer block at the $k$ -th depth to produce the output representation at the current depth $mathbf{h}_{i}^{k}$ :

其中 $left[cdot;cdot ight]$ 表示拼接。特別地,當(dāng) $k=1$ 時(shí),$mathbf{h}{i}^{k-1}$ 指的是主模型給出的表示。注意,對(duì)于每個(gè) MTP 模塊,其嵌入層與主模型共享。組合后的 $mathbf{h}{i}^{prime k}$ 作為第 $k$ 層深度的 Transformer 塊的輸入,以生成當(dāng)前深度的輸出表示 $mathbf{h}_{i}^{k}$:

where $T$ represents the input sequence length and $i{:}j$ denotes the slicing operation (inclusive of both the left and right boundaries). Finally, taking $mathbf{h}{i}^{k}$ as the input, the shared output head will compute the probability distribution for the $k$ -th additional prediction token $P{i+1+k}^{k}inmathbb{R}^{V}.$ , where is the vocabulary size:

其中 $T$ 表示輸入序列長(zhǎng)度,$i{:}j$ 表示切片操作(包括左右邊界)。最后,以 $mathbf{h}{i}^{k}$ 作為輸入,共享的輸出頭將計(jì)算第 $k$ 個(gè)額外預(yù)測(cè) Token 的概率分布 $P{i+1+k}^{k}inmathbb{R}^{V}$,其中 $V$ 是詞匯表大?。?br/>

The output head OutHead $(cdot)$ linearly maps the representation to logits and subsequently applies the Softmax(·) function to compute the prediction probabilities of the $k$ -th additional token. Also, for each MTP module, its output head is shared with the main model. Our principle of maintaining the causal chain of predictions is similar to that of EAGLE (Li et al., 2024b), but its primary objective is speculative decoding (Leviathan et al., 2023; Xia et al., 2023), whereas we utilize MTP to improve training.

輸出頭 OutHead $(cdot)$ 將表示線性映射到 logits,隨后應(yīng)用 Softmax(·) 函數(shù)來(lái)計(jì)算第 $k$ 個(gè)附加 token 的預(yù)測(cè)概率。此外,對(duì)于每個(gè) MTP 模塊,其輸出頭與主模型共享。我們保持預(yù)測(cè)因果鏈的原則與 EAGLE (Li et al., 2024b) 類似,但其主要目標(biāo)是推測(cè)解碼 (Leviathan et al., 2023; Xia et al., 2023),而我們利用 MTP 來(lái)改進(jìn)訓(xùn)練。

MTP Training Objective. For each prediction depth, we compute a cross-entropy loss $mathcal{L}_{mathrm{MTP}}^{k}$

MTP 訓(xùn)練目標(biāo)。對(duì)于每個(gè)預(yù)測(cè)深度,我們計(jì)算交叉熵?fù)p失 $mathcal{L}_{mathrm{MTP}}^{k}$。

where $T$ denotes the input sequence length, $t_{i}$ denotes the ground-truth token at the $icdot$ -th position, and $P_{i}^{k}[t_{i}]$ denotes the corresponding prediction probability of $t_{i},$ given by the $k$ -th MTP module. Finally, we compute the average of the MTP losses across all depths and multiply it by a weighting factor $lambda$ to obtain the overall MTP loss ${mathcal{L}}_{mathrm{MTP}}$ , which serves as an additional training objective for DeepSeek-V3:

其中 $T$ 表示輸入序列長(zhǎng)度,$t_{i}$ 表示第 $i$ 個(gè)位置的真實(shí) Token,$P_{i}^{k}[t_{i}]$ 表示第 $k$ 個(gè) MTP 模塊給出的 $t_{i}$ 的對(duì)應(yīng)預(yù)測(cè)概率。最后,我們計(jì)算所有深度的 MTP 損失的平均值,并乘以權(quán)重因子 $lambda$ 以獲得整體 MTP 損失 ${mathcal{L}}_{mathrm{MTP}}$,作為 DeepSeek-V3 的額外訓(xùn)練目標(biāo):

MTP in Inference. Our MTP strategy mainly aims to improve the performance of the main model, so during inference, we can directly discard the MTP modules and the main model can function independently and normally. Additionally, we can also repurpose these MTP modules for speculative decoding to further improve the generation latency.

推理中的 MTP。我們的 MTP 策略主要旨在提高主模型的性能,因此在推理過(guò)程中,我們可以直接丟棄 MTP 模塊,主模型可以獨(dú)立且正常地運(yùn)行。此外,我們還可以將這些 MTP 模塊重新用于推測(cè)解碼,以進(jìn)一步減少生成延遲。

DeepSeek-V3 is trained on a cluster equipped with 2048 NVIDIA H800 GPUs. Each node in the H800 cluster contains 8 GPUs connected by NVLink and NVSwitch within nodes. Across different nodes, InfiniBand (IB) interconnects are utilized to facilitate communications.

DeepSeek-V3 在配備 2048 個(gè) NVIDIA H800 GPU 的集群上進(jìn)行訓(xùn)練。H800 集群中的每個(gè)節(jié)點(diǎn)包含 8 個(gè) GPU,通過(guò)節(jié)點(diǎn)內(nèi)的 NVLink 和 NVSwitch 連接。不同節(jié)點(diǎn)之間使用 InfiniBand (IB) 互連以促進(jìn)通信。

The training of DeepSeek-V3 is supported by the HAI-LLM framework, an efficient and lightweight training framework crafted by our engineers from the ground up. On the whole, DeepSeek-V3 applies 16-way Pipeline Parallelism (PP) (Qi et al., 2023a), 64-way Expert Parallelism (EP) (Lepikhin et al., 2021) spanning 8 nodes, and ZeRO-1 Data Parallelism (DP) (Rajbhandari et al., 2020).

DeepSeek-V3 的訓(xùn)練由 HAI-LLM 框架支持,這是一個(gè)由我們的工程師從頭構(gòu)建的高效且輕量級(jí)的訓(xùn)練框架??傮w而言,DeepSeek-V3 應(yīng)用了 16 路流水線并行 (Pipeline Parallelism, PP) (Qi et al., 2023a)、跨 8 個(gè)節(jié)點(diǎn)的 64 路專家并行 (Expert Parallelism, EP) (Lepikhin et al., 2021),以及 ZeRO-1 數(shù)據(jù)并行 (Data Parallelism, DP) (Rajbhandari et al., 2020)。

In order to facilitate efficient training of DeepSeek-V3, we implement meticulous engineering optimization s. Firstly, we design the DualPipe algorithm for efficient pipeline parallelism. Compared with existing PP methods, DualPipe has fewer pipeline bubbles. More importantly, it overlaps the computation and communication phases across forward and backward processes, thereby addressing the challenge of heavy communication overhead introduced by cross-node expert parallelism. Secondly, we develop efficient cross-node all-to-all communication kernels to fully utilize IB and NVLink bandwidths and conserve Streaming Multiprocessors (SMs) dedicated to communication. Finally, we meticulously optimize the memory footprint during training, thereby enabling us to train DeepSeek-V3 without using costly Tensor Parallelism (TP).

為了高效訓(xùn)練 DeepSeek-V3,我們實(shí)施了精細(xì)的工程優(yōu)化。首先,我們?cè)O(shè)計(jì)了 DualPipe 算法以實(shí)現(xiàn)高效的流水線并行。與現(xiàn)有的流水線并行方法相比,DualPipe 的流水線氣泡更少。更重要的是,它在前向和后向過(guò)程中重疊了計(jì)算和通信階段,從而解決了跨節(jié)點(diǎn)專家并行引入的通信開(kāi)銷大的挑戰(zhàn)。其次,我們開(kāi)發(fā)了高效的跨節(jié)點(diǎn)全對(duì)全通信內(nèi)核,以充分利用 IB 和 NVLink 帶寬,并節(jié)省專用于通信的流式多處理器 (SMs)。最后,我們精心優(yōu)化了訓(xùn)練期間的內(nèi)存占用,從而使得我們能夠在無(wú)需使用昂貴的張量并行 (TP) 的情況下訓(xùn)練 DeepSeek-V3。

For DeepSeek-V3, the communication overhead introduced by cross-node expert parallelism results in an inefficient computation-to-communication ratio of approximately 1:1. To tackle this challenge, we design an innovative pipeline parallelism algorithm called DualPipe, which not only accelerates model training by effectively overlapping forward and backward computationcommunication phases, but also reduces the pipeline bubbles.

對(duì)于 DeepSeek-V3,跨節(jié)點(diǎn)專家并行引入的通信開(kāi)銷導(dǎo)致計(jì)算與通信的比例約為 1:1,效率較低。為了解決這一挑戰(zhàn),我們?cè)O(shè)計(jì)了一種創(chuàng)新的流水線并行算法,稱為 DualPipe,該算法不僅通過(guò)有效重疊前向和后向計(jì)算通信階段來(lái)加速模型訓(xùn)練,還減少了流水線氣泡。

The key idea of DualPipe is to overlap the computation and communication within a pair of individual forward and backward chunks. To be specific, we divide each chunk into four components: attention, all-to-all dispatch, MLP, and all-to-all combine. Specially, for a backward chunk, both attention and MLP are further split into two parts, backward for input and backward for weights, like in ZeroBubble (Qi et al., 2023b). In addition, we have a PP communication component. As illustrated in Figure 4, for a pair of forward and backward chunks, we rearrange these components and manually adjust the ratio of GPU SMs dedicated to communication versus computation. In this overlapping strategy, we can ensure that both all-to-all and PP communication can be fully hidden during execution. Given the efficient overlapping strategy, the full DualPipe scheduling is illustrated in Figure 5. It employs a bidirectional pipeline scheduling, which feeds micro-batches from both ends of the pipeline simultaneously and a significant portion of communications can be fully overlapped. This overlap also ensures that, as the model further scales up, as long as we maintain a constant computation-to-communication ratio, we can still employ fine-grained experts across nodes while achieving a near-zero all-to-all communication overhead.

DualPipe 的核心思想是在一對(duì)獨(dú)立的前向和后向塊中重疊計(jì)算和通信。具體來(lái)說(shuō),我們將每個(gè)塊分為四個(gè)部分:注意力 (attention)、全對(duì)全分發(fā) (all-to-all dispatch)、MLP 和全對(duì)全合并 (all-to-all combine)。特別地,對(duì)于后向塊,注意力和 MLP 都進(jìn)一步分為兩部分,即輸入的后向和權(quán)重的后向,類似于 ZeroBubble (Qi et al., 2023b)。此外,我們還有一個(gè) PP 通信部分。如圖 4 所示,對(duì)于一對(duì)前向和后向塊,我們重新排列這些部分,并手動(dòng)調(diào)整 GPU SMs 用于通信與計(jì)算的比例。在這種重疊策略中,我們可以確保全對(duì)全和 PP 通信在執(zhí)行過(guò)程中完全隱藏。鑒于這種高效的重疊策略,完整的 DualPipe 調(diào)度如圖 5 所示。它采用了雙向流水線調(diào)度,同時(shí)從流水線的兩端輸入微批次,并且大部分通信可以完全重疊。這種重疊還確保了隨著模型的進(jìn)一步擴(kuò)展,只要我們保持恒定的計(jì)算與通信比例,我們?nèi)匀豢梢栽诠?jié)點(diǎn)之間使用細(xì)粒度的專家,同時(shí)實(shí)現(xiàn)接近零的全對(duì)全通信開(kāi)銷。

Figure 5 | Example DualPipe scheduling for 8 PP ranks and 20 micro-batches in two directions. The micro-batches in the reverse direction are symmetric to those in the forward direction, so we omit their batch ID for illustration simplicity. Two cells enclosed by a shared black border have mutually overlapped computation and communication.

圖 5 | 8 個(gè) PP 等級(jí)和 20 個(gè)微批次在兩個(gè)方向上的 DualPipe 調(diào)度示例。反向的微批次與正向的微批次對(duì)稱,因此為了簡(jiǎn)化說(shuō)明,我們省略了它們的批次 ID。由共享黑色邊框包圍的兩個(gè)單元格具有相互重疊的計(jì)算和通信。

方法 Bubble 參數(shù) 激活 1F1B (PP -1)(F + B) 1x PP ZB1P (PP - 1)(F + B - 2W) 1x PP DualPipe (Ours) (P -1)(F&B+ B - 3W) 2x PP+1

Table 2 | Comparison of pipeline bubbles and memory usage across different pipeline parallel methods. $F$ denotes the execution time of a forward chunk, $B$ denotes the execution time of a full backward chunk, ??denotes the execution time of a "backward for weights" chunk, and $F&B$ denotes the execution time of two mutually overlapped forward and backward chunks.

表 2 | 不同流水線并行方法中的流水線氣泡和內(nèi)存使用情況對(duì)比。$F$ 表示前向塊 (forward chunk) 的執(zhí)行時(shí)間,$B$ 表示完整反向塊 (full backward chunk) 的執(zhí)行時(shí)間,$W$ 表示“權(quán)重反向塊” (backward for weights chunk) 的執(zhí)行時(shí)間,$F&B$ 表示兩個(gè)相互重疊的前向和反向塊的執(zhí)行時(shí)間。

In addition, even in more general scenarios without a heavy communication burden, DualPipe still exhibits efficiency advantages. In Table 2, we summarize the pipeline bubbles and memory usage across different PP methods. As shown in the table, compared with ZB1P (Qi et al., 2023b) and 1F1B (Harlap et al., 2018), DualPipe significantly reduces the pipeline bubbles while only increasing the peak activation memory by $frac{1}{P P}$ times. Although DualPipe requires keeping two copies of the model parameters, this does not significantly increase the memory consumption since we use a large EP size during training. Compared with Chimera (Li and Hoefler, 2021), DualPipe only requires that the pipeline stages and micro-batches be divisible by 2, without requiring micro-batches to be divisible by pipeline stages. In addition, for DualPipe, neither the bubbles nor activation memory will increase as the number of micro-batches grows.

此外,即使在通信負(fù)擔(dān)不重的更一般場(chǎng)景中,DualPipe 仍然展現(xiàn)出效率優(yōu)勢(shì)。在表 2 中,我們總結(jié)了不同流水線并行 (PP) 方法中的流水線氣泡和內(nèi)存使用情況。如表所示,與 ZB1P (Qi et al., 2023b) 和 1F1B (Harlap et al., 2018) 相比,DualPipe 顯著減少了流水線氣泡,同時(shí)僅將峰值激活內(nèi)存增加了 $frac{1}{P P}$ 倍。盡管 DualPipe 需要保留兩份模型參數(shù)副本,但由于我們?cè)谟?xùn)練時(shí)使用了較大的 EP 大小,這并不會(huì)顯著增加內(nèi)存消耗。與 Chimera (Li and Hoefler, 2021) 相比,DualPipe 僅要求流水線階段和微批次可被 2 整除,而不要求微批次可被流水線階段整除。此外,對(duì)于 DualPipe 來(lái)說(shuō),無(wú)論是氣泡還是激活內(nèi)存都不會(huì)隨著微批次數(shù)量的增加而增加。

In order to ensure sufficient computational performance for DualPipe, we customize efficient cross-node all-to-all communication kernels (including dispatching and combining) to conserve the number of SMs dedicated to communication. The implementation of the kernels is codesigned with the MoE gating algorithm and the network topology of our cluster. To be specific, in our cluster, cross-node GPUs are fully interconnected with IB, and intra-node communications are handled via NVLink. NVLink offers a bandwidth of 160 GB/s, roughly 3.2 times that of IB $(50,mathrm{GB}/mathrm{s})$ . To effectively leverage the different bandwidths of IB and NVLink, we limit each token to be dispatched to at most 4 nodes, thereby reducing IB traffic. For each token, when its routing decision is made, it will first be transmitted via IB to the GPUs with the same in-node index on its target nodes. Once it reaches the target nodes, we will endeavor to ensure that it is instantaneously forwarded via NVLink to specific GPUs that host their target experts, without being blocked by subsequently arriving tokens. In this way, communications via IB and NVLink are fully overlapped, and each token can efficiently select an average of 3.2 experts per node without incurring additional overhead from NVLink. This implies that, although DeepSeek-V3 selects only 8 routed experts in practice, it can scale up this number to a maximum of 13 experts (4 nodes $ imes,3.2$ experts/node) while preserving the same communication cost. Overall, under such a communication strategy, only 20 SMs are sufficient to fully utilize the bandwidths of IB and NVLink.

為了確保 DualPipe 具備足夠的計(jì)算性能,我們定制了高效的跨節(jié)點(diǎn)全對(duì)全通信內(nèi)核(包括分發(fā)和合并),以減少專用于通信的 SM 數(shù)量。這些內(nèi)核的實(shí)現(xiàn)與 MoE 門(mén)控算法和我們集群的網(wǎng)絡(luò)拓?fù)涔餐O(shè)計(jì)。具體來(lái)說(shuō),在我們的集群中,跨節(jié)點(diǎn) GPU 通過(guò) IB 完全互連,節(jié)點(diǎn)內(nèi)通信則通過(guò) NVLink 處理。NVLink 提供 160 GB/s 的帶寬,大約是 IB $(50,mathrm{GB}/mathrm{s})$ 的 3.2 倍。為了有效利用 IB 和 NVLink 的不同帶寬,我們將每個(gè) token 分發(fā)的節(jié)點(diǎn)數(shù)限制為最多 4 個(gè),從而減少 IB 流量。對(duì)于每個(gè) token,當(dāng)路由決策完成后,它將首先通過(guò) IB 傳輸?shù)侥繕?biāo)節(jié)點(diǎn)上具有相同節(jié)點(diǎn)內(nèi)索引的 GPU。一旦到達(dá)目標(biāo)節(jié)點(diǎn),我們將盡力確保它通過(guò) NVLink 即時(shí)轉(zhuǎn)發(fā)到托管其目標(biāo)專家的特定 GPU,而不會(huì)被后續(xù)到達(dá)的 token 阻塞。通過(guò)這種方式,IB 和 NVLink 的通信完全重疊,每個(gè) token 可以高效地選擇每個(gè)節(jié)點(diǎn)平均 3.2 個(gè)專家,而不會(huì)產(chǎn)生額外的 NVLink 開(kāi)銷。這意味著,盡管 DeepSeek-V3 在實(shí)踐中僅選擇 8 個(gè)路由專家,但它可以將此數(shù)量擴(kuò)展到最多 13 個(gè)專家(4 個(gè)節(jié)點(diǎn) $ imes,3.2$ 個(gè)專家/節(jié)點(diǎn)),同時(shí)保持相同的通信成本。總體而言,在這種通信策略下,僅需 20 個(gè) SM 即可充分利用 IB 和 NVLink 的帶寬。

In detail, we employ the warp specialization technique (Bauer et al., 2014) and partition 20 SMs into 10 communication channels. During the dispatching process, (1) IB sending, (2) IB-to-NVLink forwarding, and (3) NVLink receiving are handled by respective warps. The number of warps allocated to each communication task is dynamically adjusted according to the actual workload across all SMs. Similarly, during the combining process, (1) NVLink sending, (2) NVLink-to-IB forwarding and accumulation, and (3) IB receiving and accumulation are also handled by dynamically adjusted warps. In addition, both dispatching and combining kernels overlap with the computation stream, so we also consider their impact on other SM computation kernels. Specifically, we employ customized PTX (Parallel Thread Execution) instructions and auto-tune the communication chunk size, which significantly reduces the use of the L2 cache and the interference to other SMs.

具體來(lái)說(shuō),我們采用了 warp specialization 技術(shù) (Bauer et al., 2014),并將 20 個(gè) SM 劃分為 10 個(gè)通信通道。在調(diào)度過(guò)程中,(1) IB 發(fā)送、(2) IB 到 NVLink 轉(zhuǎn)發(fā)以及 (3) NVLink 接收分別由各自的 warp 處理。分配給每個(gè)通信任務(wù)的 warp 數(shù)量會(huì)根據(jù)所有 SM 的實(shí)際工作負(fù)載動(dòng)態(tài)調(diào)整。同樣,在合并過(guò)程中,(1) NVLink 發(fā)送、(2) NVLink 到 IB 轉(zhuǎn)發(fā)和累加以及 (3) IB 接收和累加也由動(dòng)態(tài)調(diào)整的 warp 處理。此外,調(diào)度和合并的內(nèi)核與計(jì)算流重疊,因此我們還考慮了它們對(duì)其他 SM 計(jì)算內(nèi)核的影響。具體來(lái)說(shuō),我們采用了定制的 PTX (Parallel Thread Execution) 指令,并自動(dòng)調(diào)整通信塊大小,這顯著減少了 L2 緩存的使用以及對(duì)其他 SM 的干擾。

In order to reduce the memory footprint during training, we employ the following techniques.

為了減少訓(xùn)練期間的內(nèi)存占用,我們采用了以下技術(shù)。

Re computation of RMSNorm and MLA Up-Projection. We recompute all RMSNorm operations and MLA up-projections during back-propagation, thereby eliminating the need to persistently store their output activation s. With a minor overhead, this strategy significantly reduces memory requirements for storing activation s.

重新計(jì)算 RMSNorm 和 MLA 上投影。我們?cè)诜聪騻鞑テ陂g重新計(jì)算所有 RMSNorm 操作和 MLA 上投影,從而消除了持久存儲(chǔ)其輸出激活的需求。通過(guò)少量的開(kāi)銷,該策略顯著減少了存儲(chǔ)激活的內(nèi)存需求。

Exponential Moving Average in CPU. During training, we preserve the Exponential Moving Average (EMA) of the model parameters for early estimation of the model performance after learning rate decay. The EMA parameters are stored in CPU memory and are updated asynchronously after each training step. This method allows us to maintain EMA parameters without incurring additional memory or time overhead.

CPU 中的指數(shù)移動(dòng)平均。在訓(xùn)練過(guò)程中,我們保留模型參數(shù)的指數(shù)移動(dòng)平均 (EMA),以便在學(xué)習(xí)率衰減后對(duì)模型性能進(jìn)行早期估計(jì)。EMA 參數(shù)存儲(chǔ)在 CPU 內(nèi)存中,并在每個(gè)訓(xùn)練步驟后異步更新。這種方法使我們能夠在不增加額外內(nèi)存或時(shí)間開(kāi)銷的情況下維護(hù) EMA 參數(shù)。

Shared Embedding and Output Head for Multi-Token Prediction. With the DualPipe strategy, we deploy the shallowest layers (including the embedding layer) and deepest layers (including the output head) of the model on the same PP rank. This arrangement enables the physical sharing of parameters and gradients, of the shared embedding and output head, between the MTP module and the main model. This physical sharing mechanism further enhances our memory efficiency.

共享嵌入和輸出頭用于多Token預(yù)測(cè)。通過(guò)DualPipe策略,我們將模型的最淺層(包括嵌入層)和最深層(包括輸出頭)部署在同一個(gè)PP rank上。這種安排使得MTP模塊和主模型之間能夠物理共享共享嵌入和輸出頭的參數(shù)和梯度。這種物理共享機(jī)制進(jìn)一步提高了我們的內(nèi)存效率。

Inspired by recent advances in low-precision training (Dettmers et al., 2022; Noune et al., 2022; Peng et al., 2023b), we propose a fine-grained mixed precision framework utilizing the FP8 data format for training DeepSeek-V3. While low-precision training holds great promise, it is often limited by the presence of outliers in activation s, weights, and gradients (Fishman et al., 2024; He et al.; Sun et al., 2024). Although significant progress has been made in inference quantization (Frantar et al., 2022; Xiao et al., 2023), there are relatively few studies demonstrating successful application of low-precision techniques in large-scale language model pre-training (Fishman et al., 2024). To address thi或s ch alI len np ge u at-nd> eAf fc ect it iv veal ty i eoxnte_ndL the dynamic range of the FP8 format, we introduce a fine-grained quantization strategy: tile-wise grouping with $1 imes N_{c}$ elements or block-wise grouping with $N_{c} imes N_{c}$ ue tle pm ue tn-t>s. ATchtei a vs s a otc iio at ned {dLeq+u1an}tiza- tion overhead is largely mitigated under our increased-precision accumulation process, a critical aspect for achieving accurate FP8 General Matrix Multiplication (GEMM). Moreover, to further reduce memory and communication overhead in MoE training, we cache and dispatch activations in FP8, while storing low-precision optimizer states in BF16. We validate the proposed FP8 mixed precision framework on two model scales similar to DeepSeek-V2-Lite and DeepSeekV2, training for approximately 1 trillion tokens (see more details in Appendix B.1). Notably, compared with the BF16 baseline, the relative loss error of our FP8-training model remains consistently below $0.25%$ , a level well within the acceptable range of training randomness.

受低精度訓(xùn)練(Dettmers et al., 2022; Noune et al., 2022; Peng et al., 2023b)最新進(jìn)展的啟發(fā),我們提出了一種利用 FP8 數(shù)據(jù)格式的細(xì)粒度混合精度框架,用于訓(xùn)練 DeepSeek-V3。雖然低精度訓(xùn)練具有巨大的潛力,但它通常受到激活值、權(quán)重和梯度中異常值的限制(Fishman et al., 2024; He et al.; Sun et al., 2024)。盡管在推理量化方面取得了顯著進(jìn)展(Frantar et al., 2022; Xiao et al., 2023),但在大規(guī)模語(yǔ)言模型預(yù)訓(xùn)練中成功應(yīng)用低精度技術(shù)的研究相對(duì)較少(Fishman et al., 2024)。為了解決這一挑戰(zhàn)并有效利用 FP8 格式的動(dòng)態(tài)范圍,我們引入了一種細(xì)粒度的量化策略:使用 $1 imes N_{c}$ 元素的瓦片分組或 $N_{c} imes N_{c}$ 元素的塊分組。在我們的增加精度累積過(guò)程中,量化開(kāi)銷得到了大幅緩解,這是實(shí)現(xiàn)精確 FP8 通用矩陣乘法(GEMM)的關(guān)鍵。此外,為了進(jìn)一步減少 MoE 訓(xùn)練中的內(nèi)存和通信開(kāi)銷,我們以 FP8 緩存和分發(fā)激活值,同時(shí)以 BF16 存儲(chǔ)低精度優(yōu)化器狀態(tài)。我們?cè)谂c DeepSeek-V2-Lite 和 DeepSeekV2 相似的兩個(gè)模型規(guī)模上驗(yàn)證了所提出的 FP8 混合精度框架,訓(xùn)練了大約 1 萬(wàn)億個(gè) token(更多細(xì)節(jié)見(jiàn)附錄 B.1)。值得注意的是,與 BF16 基線相比,我們的 FP8 訓(xùn)練模型的相對(duì)損失誤差始終保持在 $0.25%$ 以下,這一水平完全在訓(xùn)練隨機(jī)性的可接受范圍內(nèi)。


Figure 6 | The overall mixed precision framework with FP8 data format. For clarification, only the Linear operator is illustrated.

圖 6 | 使用 FP8 數(shù)據(jù)格式的整體混合精度框架。為清晰起見(jiàn),僅展示了線性算子。

Building upon widely adopted techniques in low-precision training (Kalamkar et al., 2019; Narang et al., 2017), we propose a mixed precision framework for FP8 training. In this framework, most compute-density operations are conducted in FP8, while a few key operations are strategically maintained in their original data formats to balance training efficiency and numerical stability. The overall framework is illustrated in Figure 6.

基于廣泛采用的低精度訓(xùn)練技術(shù) (Kalamkar et al., 2019; Narang et al., 2017),我們提出了一種用于 FP8 訓(xùn)練的混合精度框架。在該框架中,大多數(shù)計(jì)算密集型操作以 FP8 進(jìn)行,而少數(shù)關(guān)鍵操作則策略性地保持其原始數(shù)據(jù)格式,以平衡訓(xùn)練效率和數(shù)值穩(wěn)定性。整體框架如圖 6 所示。

Firstly, in order to accelerate model training, the majority of core computation kernels, i.e., GEMM operations, are implemented in FP8 precision. These GEMM operations accept FP8 tensors as inputs and produce outputs in BF16 or FP32. As depicted in Figure 6, all three GEMMs associated with the Linear operator, namely Fprop (forward pass), Dgrad (activation backward pass), and Wgrad (weight backward pass), are executed in FP8. This design theoretically doubles the computational speed compared with the original BF16 method. Additionally, the FP8 Wgrad GEMM allows activation s to be stored in FP8 for use in the backward pass. This significantly reduces memory consumption.

首先,為了加速模型訓(xùn)練,大多數(shù)核心計(jì)算內(nèi)核(即 GEMM 操作)都以 FP8 精度實(shí)現(xiàn)。這些 GEMM 操作接受 FP8 張量作為輸入,并生成 BF16 或 FP32 的輸出。如圖 6 所示,與 Linear 算子相關(guān)的三個(gè) GEMM 操作,即 Fprop(前向傳播)、Dgrad(激活反向傳播)和 Wgrad(權(quán)重反向傳播),均在 FP8 中執(zhí)行。理論上,這種設(shè)計(jì)使計(jì)算速度比原始的 BF16 方法提高了一倍。此外,F(xiàn)P8 Wgrad GEMM 允許將激活存儲(chǔ)在 FP8 中,以便在反向傳播中使用。這顯著減少了內(nèi)存消耗。

Despite the efficiency advantage of the FP8 format, certain operators still require a higher precision due to their sensitivity to low-precision computations. Besides, some low-cost operators can also utilize a higher precision with a negligible overhead to the overall training cost. For this reason, after careful investigations, we maintain the original precision (e.g., BF16 or FP32) for the following components: the embedding module, the output head, MoE gating modules, normalization operators, and attention operators. These targeted retentions of high precision ensure stable training dynamics for DeepSeek-V3. To further guarantee numerical stability, we store the master weights, weight gradients, and optimizer states in higher precision. While these high-precision components incur some memory overheads, their impact can be minimized through efficient sharding across multiple DP ranks in our distributed training system.

盡管 FP8 格式具有效率優(yōu)勢(shì),但某些算子由于對(duì)低精度計(jì)算敏感,仍需要更高的精度。此外,一些低成本算子也可以利用更高的精度,而對(duì)整體訓(xùn)練成本的影響可以忽略不計(jì)。因此,經(jīng)過(guò)仔細(xì)研究,我們?yōu)橐韵陆M件保留了原始精度(例如 BF16 或 FP32):嵌入模塊、輸出頭、MoE 門(mén)控模塊、歸一化算子和注意力算子。這些有針對(duì)性的高精度保留確保了 DeepSeek-V3 的訓(xùn)練動(dòng)態(tài)穩(wěn)定。為了進(jìn)一步保證數(shù)值穩(wěn)定性,我們將主權(quán)重、權(quán)重梯度和優(yōu)化器狀態(tài)以更高的精度存儲(chǔ)。雖然這些高精度組件會(huì)帶來(lái)一些內(nèi)存開(kāi)銷,但通過(guò)在我們的分布式訓(xùn)練系統(tǒng)中跨多個(gè) DP 等級(jí)進(jìn)行高效分片,可以將它們的影響降至最低。


Figure 7 | (a) We propose a fine-grained quantization method to mitigate quantization errors caused by feature outliers; for illustration simplicity, only Fprop is illustrated. (b) In conjunction with our quantization strategy, we improve the FP8 GEMM precision by promoting to CUDA Cores at an interval of $N_{C}=128$ elements MMA for the high-precision accumulation.

圖 7 | (a) 我們提出了一種細(xì)粒度的量化方法,以減輕由特征異常值引起的量化誤差;為了簡(jiǎn)化說(shuō)明,僅展示了 Fprop。(b) 結(jié)合我們的量化策略,我們通過(guò)將 FP8 GEMM 精度提升到 CUDA Cores,以 $N_{C}=128$ 個(gè)元素的 MMA 間隔進(jìn)行高精度累加。


(b) Increasing accumulation precision

(b) 提高累積精度

Based on our mixed precision FP8 framework, we introduce several strategies to enhance lowprecision training accuracy, focusing on both the quantization method and the multiplication process.

基于我們的混合精度 FP8 框架,我們引入了多種策略來(lái)提高低精度訓(xùn)練的準(zhǔn)確性,重點(diǎn)關(guān)注量化方法和乘法過(guò)程。

Fine-Grained Quantization. In low-precision training frameworks, overflows and underflows are common challenges due to the limited dynamic range of the FP8 format, which is constrained by its reduced exponent bits. As a standard practice, the input distribution is aligned to the representable range of the FP8 format by scaling the maximum absolute value of the input tensor to the maximum representable value of FP8 (Narang et al., 2017). This method makes lowprecision training highly sensitive to activation outliers, which can heavily degrade quantization accuracy. To solve this, we propose a fine-grained quantization method that applies scaling at a more granular level. As illustrated in Figure 7 (a), (1) for activation s, we group and scale elements on a 1x128 tile basis (i.e., per token per 128 channels); and (2) for weights, we group and scale elements on a 128x128 block basis (i.e., per 128 input channels per 128 output channels). This approach ensures that the quantization process can better accommodate outliers by adapting the scale according to smaller groups of elements. In Appendix B.2, we further discuss the training instability when we group and scale activation s on a block basis in the same way as weights quantization.

細(xì)粒度量化。在低精度訓(xùn)練框架中,由于 FP8 格式的動(dòng)態(tài)范圍有限,溢出和下溢是常見(jiàn)的挑戰(zhàn),這受到其減少的指數(shù)位的限制。作為一種標(biāo)準(zhǔn)做法,輸入分布通過(guò)將輸入張量的最大絕對(duì)值縮放到 FP8 的最大可表示值來(lái)與 FP8 格式的可表示范圍對(duì)齊 (Narang et al., 2017)。這種方法使得低精度訓(xùn)練對(duì)激活異常值高度敏感,這會(huì)嚴(yán)重降低量化精度。為了解決這個(gè)問(wèn)題,我們提出了一種細(xì)粒度量化方法,該方法在更細(xì)粒度的級(jí)別上應(yīng)用縮放。如圖 7 (a) 所示,(1) 對(duì)于激活 s,我們?cè)?1x128 的圖塊基礎(chǔ)上對(duì)元素進(jìn)行分組和縮放(即每個(gè) token 每 128 個(gè)通道);(2) 對(duì)于權(quán)重,我們?cè)?128x128 的塊基礎(chǔ)上對(duì)元素進(jìn)行分組和縮放(即每 128 個(gè)輸入通道每 128 個(gè)輸出通道)。這種方法通過(guò)根據(jù)較小的元素組調(diào)整縮放比例,確保量化過(guò)程能夠更好地適應(yīng)異常值。在附錄 B.2 中,我們進(jìn)一步討論了當(dāng)我們以與權(quán)重量化相同的方式在塊基礎(chǔ)上對(duì)激活 s 進(jìn)行分組和縮放時(shí)的訓(xùn)練不穩(wěn)定性。

One key modification in our method is the introduction of per-group scaling factors along the inner dimension of GEMM operations. This functionality is not directly supported in the standard FP8 GEMM. However, combined with our precise FP32 accumulation strategy, it can

我們方法的一個(gè)關(guān)鍵修改是引入了沿 GEMM 操作內(nèi)部維度的每組縮放因子。這一功能在標(biāo)準(zhǔn)的 FP8 GEMM 中并不直接支持。然而,結(jié)合我們精確的 FP32 累加策略,它可以

be efficiently implemented.

高效實(shí)現(xiàn)。

Notably, our fine-grained quantization strategy is highly consistent with the idea of microscaling formats (Rouhani et al., 2023b), while the Tensor Cores of NVIDIA next-generation GPUs (Blackwell series) have announced the support for micro scaling formats with smaller quantization granularity (NVIDIA, 2024a). We hope our design can serve as a reference for future work to keep pace with the latest GPU architectures.

值得注意的是,我們的細(xì)粒度量化策略與微縮放格式 (microscaling formats) 的理念高度一致 (Rouhani et al., 2023b),而 NVIDIA 下一代 GPU (Blackwell 系列) 的 Tensor Cores 已宣布支持具有更小量化粒度的微縮放格式 (NVIDIA, 2024a)。我們希望我們的設(shè)計(jì)能夠?yàn)槲磥?lái)的工作提供參考,以跟上最新的 GPU 架構(gòu)。

Increasing Accumulation Precision. Low-precision GEMM operations often suffer from underflow issues, and their accuracy largely depends on high-precision accumulation, which is commonly performed in an FP32 precision (Kalamkar et al., 2019; Narang et al., 2017). However, we observe that the accumulation precision of FP8 GEMM on NVIDIA H800 GPUs is limited to retaining around 14 bits, which is significantly lower than FP32 accumulation precision. This problem will become more pronounced when the inner dimension K is large (Wortsman et al., 2023), a typical scenario in large-scale model training where the batch size and model width are increased. Taking GEMM operations of two random matrices with $mathtt{K}=4096$ for example, in our preliminary test, the limited accumulation precision in Tensor Cores results in a maximum relative error of nearly $2%$ . Despite these problems, the limited accumulation precision is still the default option in a few FP8 frameworks (NVIDIA, 2024b), severely constraining the training accuracy.

提高累加精度。低精度 GEMM 操作經(jīng)常面臨下溢問(wèn)題,其精度很大程度上依賴于高精度累加,通常以 FP32 精度執(zhí)行 (Kalamkar et al., 2019; Narang et al., 2017)。然而,我們觀察到,在 NVIDIA H800 GPU 上,F(xiàn)P8 GEMM 的累加精度僅限于保留約 14 位,這顯著低于 FP32 的累加精度。當(dāng)內(nèi)維度 K 較大時(shí) (Wortsman et al., 2023),這一問(wèn)題將更加明顯,這是大規(guī)模模型訓(xùn)練中增加批量大小和模型寬度的典型場(chǎng)景。以兩個(gè)隨機(jī)矩陣的 GEMM 操作為例,其中 $mathtt{K}=4096$,在我們的初步測(cè)試中,Tensor Cores 中有限的累加精度導(dǎo)致最大相對(duì)誤差接近 $2%$。盡管存在這些問(wèn)題,有限的累加精度仍然是少數(shù) FP8 框架 (NVIDIA, 2024b) 中的默認(rèn)選項(xiàng),嚴(yán)重限制了訓(xùn)練精度。

In order to address this issue, we adopt the strategy of promotion to CUDA Cores for higher precision (Thakkar et al., 2023). The process is illustrated in Figure 7 (b). To be specific, during MMA (Matrix Multiply-Accumulate) execution on Tensor Cores, intermediate results are accumulated using the limited bit width. Once an interval of $N_{C}$ is reached, these partial results will be copied to FP32 registers on CUDA Cores, where full-precision FP32 accumulation is performed. As mentioned before, our fine-grained quantization applies per-group scaling factors along the inner dimension K. These scaling factors can be efficiently multiplied on the CUDA Cores as the de quantization process with minimal additional computational cost.

為了解決這個(gè)問(wèn)題,我們采用了將計(jì)算提升到CUDA Cores以獲取更高精度的策略 (Thakkar et al., 2023)。該過(guò)程如圖7 (b)所示。具體來(lái)說(shuō),在Tensor Cores上執(zhí)行MMA(矩陣乘加)時(shí),中間結(jié)果使用有限的位寬進(jìn)行累加。一旦達(dá)到$N_{C}$的間隔,這些部分結(jié)果將被復(fù)制到CUDA Cores上的FP32寄存器中,并在那里執(zhí)行全精度的FP32累加。如前所述,我們的細(xì)粒度量化沿內(nèi)維度K應(yīng)用了每組的縮放因子。這些縮放因子可以在CUDA Cores上高效地乘以反量化過(guò)程,且額外的計(jì)算成本最小。

It is worth noting that this modification reduces the WGMMA (Warpgroup-level Matrix Multiply-Accumulate) instruction issue rate for a single warpgroup. However, on the H800 architecture, it is typical for two WGMMA to persist concurrently: while one warpgroup performs the promotion operation, the other is able to execute the MMA operation. This design enables overlapping of the two operations, maintaining high utilization of Tensor Cores. Based on our experiments, setting $N_{C},=,128$ elements, equivalent to 4 WGMMAs, represents the minimal accumulation interval that can significantly improve precision without introducing substantial overhead.

值得注意的是,這種修改降低了單個(gè) warpgroup 的 WGMMA(Warpgroup-level Matrix Multiply-Accumulate)指令的發(fā)出率。然而,在 H800 架構(gòu)上,通常會(huì)有兩個(gè) WGMMA 同時(shí)存在:當(dāng)一個(gè) warpgroup 執(zhí)行提升操作時(shí),另一個(gè) warpgroup 能夠執(zhí)行 MMA 操作。這種設(shè)計(jì)使得兩個(gè)操作可以重疊,從而保持 Tensor Core 的高利用率。根據(jù)我們的實(shí)驗(yàn),設(shè)置 $N_{C},=,128$ 個(gè)元素(相當(dāng)于 4 個(gè) WGMMA)是最小的累加間隔,可以在不引入顯著開(kāi)銷的情況下顯著提高精度。

Mantissa over Exponents. In contrast to the hybrid FP8 format adopted by prior work (NVIDIA, 2024b; Peng et al., 2023b; Sun et al., 2019b), which uses E4M3 (4-bit exponent and 3-bit mantissa) in Fprop and E5M2 (5-bit exponent and 2-bit mantissa) in Dgrad and Wgrad, we adopt the E4M3 format on all tensors for higher precision. We attribute the feasibility of this approach to our fine-grained quantization strategy, i.e., tile and block-wise scaling. By operating on smaller element groups, our methodology effectively shares exponent bits among these grouped elements, mitigating the impact of the limited dynamic range.

尾數(shù)優(yōu)先于指數(shù)。與之前工作采用的混合 FP8 格式(NVIDIA, 2024b; Peng et al., 2023b; Sun et al., 2019b)不同,它們?cè)?Fprop 中使用 E4M3(4 位指數(shù)和 3 位尾數(shù)),在 Dgrad 和 Wgrad 中使用 E5M2(5 位指數(shù)和 2 位尾數(shù)),我們?cè)谒袕埩可喜捎?E4M3 格式以獲得更高的精度。我們將這種方法的可行性歸因于我們的細(xì)粒度量化策略,即分塊和塊級(jí)縮放。通過(guò)在較小的元素組上操作,我們的方法有效地在這些分組元素之間共享指數(shù)位,從而減輕了有限動(dòng)態(tài)范圍的影響。

Online Quantization. Delayed quantization is employed in tensor-wise quantization frameworks (NVIDIA, 2024b; Peng et al., 2023b), which maintains a history of the maximum absolute values across prior iterations to infer the current value. In order to ensure accurate scales and simplify the framework, we calculate the maximum absolute value online for each 1x128 activation tile or $128mathtt{x}128$ weight block. Based on it, we derive the scaling factor and then quantize the activation or weight online into the FP8 format.

在線量化。張量級(jí)量化框架(NVIDIA, 2024b; Peng et al., 2023b)中采用了延遲量化,該方法通過(guò)維護(hù)先前迭代中的最大絕對(duì)值歷史來(lái)推斷當(dāng)前值。為了確保準(zhǔn)確的縮放比例并簡(jiǎn)化框架,我們?cè)诰€計(jì)算每個(gè)1x128激活塊或$128mathtt{x}128$權(quán)重塊的最大絕對(duì)值?;诖?,我們推導(dǎo)出縮放因子,然后在線將激活或權(quán)重量化為FP8格式。

In conjunction with our FP8 training framework, we further reduce the memory consumption and communication overhead by compressing cached activation s and optimizer states into lower-precision formats.

結(jié)合我們的 FP8 訓(xùn)練框架,我們通過(guò)將緩存的激活值和優(yōu)化器狀態(tài)壓縮為低精度格式,進(jìn)一步減少了內(nèi)存消耗和通信開(kāi)銷。

Low-Precision Optimizer States. We adopt the BF16 data format instead of FP32 to track the first and second moments in the AdamW (Loshchilov and Hutter, 2017) optimizer, without incurring observable performance degradation. However, the master weights (stored by the optimizer) and gradients (used for batch size accumulation) are still retained in FP32 to ensure numerical stability throughout training.

低精度優(yōu)化器狀態(tài)。我們采用 BF16 數(shù)據(jù)格式而非 FP32 來(lái)跟蹤 AdamW (Loshchilov and Hutter, 2017) 優(yōu)化器中的一階和二階矩,而不會(huì)導(dǎo)致明顯的性能下降。然而,主權(quán)重(由優(yōu)化器存儲(chǔ))和梯度(用于批量大小累積)仍保留在 FP32 中,以確保整個(gè)訓(xùn)練過(guò)程中的數(shù)值穩(wěn)定性。

Low-Precision Activation. As illustrated in Figure 6, the Wgrad operation is performed in FP8. To reduce the memory consumption, it is a natural choice to cache activation s in FP8 format for the backward pass of the Linear operator. However, special considerations are taken on several operators for low-cost high-precision training:

低精度激活。如圖 6 所示,Wgrad 操作在 FP8 中執(zhí)行。為了減少內(nèi)存消耗,將激活值以 FP8 格式緩存以用于 Linear 算子的反向傳播是一個(gè)自然的選擇。然而,為了低成本高精度訓(xùn)練,對(duì)幾個(gè)算子進(jìn)行了特殊考慮:

(1) Inputs of the Linear after the attention operator. These activation s are also used in the backward pass of the attention operator, which makes it sensitive to precision. We adopt a customized E5M6 data format exclusively for these activation s. Additionally, these activation s will be converted from an 1x128 quantization tile to an $128紑$ tile in the backward pass. To avoid introducing extra quantization error, all the scaling factors are round scaled, i.e., integral power of 2.

(1) 注意力算子后的線性輸入。這些激活值也用于注意力算子的反向傳播,這使得它對(duì)精度敏感。我們采用定制的 E5M6 數(shù)據(jù)格式專門(mén)用于這些激活值。此外,這些激活值在反向傳播過(guò)程中將從 1x128 量化塊轉(zhuǎn)換為 $128紑$ 塊。為了避免引入額外的量化誤差,所有的縮放因子都是四舍五入的,即 2 的整數(shù)冪。

(2) Inputs of the SwiGLU operator in MoE. To further reduce the memory cost, we cache the inputs of the SwiGLU operator and recompute its output in the backward pass. These activation s are also stored in FP8 with our fine-grained quantization method, striking a balance between memory efficiency and computational accuracy.

(2) MoE 中 SwiGLU 算子的輸入。為了進(jìn)一步降低內(nèi)存成本,我們緩存了 SwiGLU 算子的輸入,并在反向傳播時(shí)重新計(jì)算其輸出。這些激活值也使用我們的細(xì)粒度量化方法以 FP8 格式存儲(chǔ),在內(nèi)存效率和計(jì)算精度之間取得了平衡。

Low-Precision Communication. Communication bandwidth is a critical bottleneck in the training of MoE models. To alleviate this challenge, we quantize the activation before MoE up-projections into FP8 and then apply dispatch components, which is compatible with FP8 Fprop in MoE up-projections. Like the inputs of the Linear after the attention operator, scaling factors for this activation are integral power of 2. A similar strategy is applied to the activation gradient before MoE down-projections. For both the forward and backward combine components, we retain them in BF16 to preserve training precision in critical parts of the training pipeline.

低精度通信。通信帶寬是 MoE 模型訓(xùn)練中的一個(gè)關(guān)鍵瓶頸。為了緩解這一挑戰(zhàn),我們?cè)?MoE 上投影之前將激活量化為 FP8,然后應(yīng)用調(diào)度組件,這與 MoE 上投影中的 FP8 Fprop 兼容。與注意力算子后的線性輸入類似,此激活的縮放因子是 2 的整數(shù)冪。類似的策略也應(yīng)用于 MoE 下投影之前的激活梯度。對(duì)于前向和后向組合組件,我們將其保留為 BF16,以保持訓(xùn)練管道關(guān)鍵部分的訓(xùn)練精度。

We deploy DeepSeek-V3 on the H800 cluster, where GPUs within each node are interconnected using NVLink, and all GPUs across the cluster are fully interconnected via IB. To simultaneously ensure both the Service-Level Objective (SLO) for online services and high throughput, we employ the following deployment strategy that separates the prefilling and decoding stages.

我們?cè)?H800 集群上部署了 DeepSeek-V3,其中每個(gè)節(jié)點(diǎn)內(nèi)的 GPU 通過(guò) NVLink 互連,集群中的所有 GPU 通過(guò) IB 完全互連。為了同時(shí)確保在線服務(wù)的服務(wù)級(jí)別目標(biāo) (SLO) 和高吞吐量,我們采用了以下部署策略,將預(yù)填充和解碼階段分開(kāi)。

The minimum deployment unit of the prefilling stage consists of 4 nodes with 32 GPUs. The attention part employs 4-way Tensor Parallelism (TP4) with Sequence Parallelism (SP), combined with 8-way Data Parallelism (DP8). Its small TP size of 4 limits the overhead of TP communication. For the MoE part, we use 32-way Expert Parallelism (EP32), which ensures that each expert processes a sufficiently large batch size, thereby enhancing computational efficiency. For the MoE all-to-all communication, we use the same method as in training: first transferring tokens across nodes via IB, and then forwarding among the intra-node GPUs via NVLink. In particular, we use 1-way Tensor Parallelism for the dense MLPs in shallow layers to save TP communication.

預(yù)填充階段的最小部署單元由4個(gè)節(jié)點(diǎn)和32個(gè)GPU組成。注意力部分采用4路張量并行(TP4)與序列并行(SP)結(jié)合,同時(shí)使用8路數(shù)據(jù)并行(DP8)。其較小的TP大小為4,限制了TP通信的開(kāi)銷。對(duì)于MoE部分,我們使用32路專家并行(EP32),確保每個(gè)專家處理足夠大的批量大小,從而提高計(jì)算效率。對(duì)于MoE的全對(duì)全通信,我們使用與訓(xùn)練時(shí)相同的方法:首先通過(guò)IB在節(jié)點(diǎn)之間傳輸Token,然后通過(guò)NVLink在節(jié)點(diǎn)內(nèi)的GPU之間轉(zhuǎn)發(fā)。特別地,我們?cè)跍\層使用1路張量并行來(lái)處理密集的MLP,以節(jié)省TP通信。

To achieve load balancing among different experts in the MoE part, we need to ensure that each GPU processes approximately the same number of tokens. To this end, we introduce a deployment strategy of redundant experts, which duplicates high-load experts and deploys them redundantly. The high-load experts are detected based on statistics collected during the online deployment and are adjusted periodically (e.g., every 10 minutes). After determining the set of redundant experts, we carefully rearrange experts among GPUs within a node based on the observed loads, striving to balance the load across GPUs as much as possible without increasing the cross-node all-to-all communication overhead. For the deployment of DeepSeek-V3, we set 32 redundant experts for the prefilling stage. For each GPU, besides the original 8 experts it hosts, it will also host one additional redundant expert.

為了實(shí)現(xiàn) MoE 部分中不同專家之間的負(fù)載均衡,我們需要確保每個(gè) GPU 處理大致相同數(shù)量的 Token。為此,我們引入了一種冗余專家的部署策略,即復(fù)制高負(fù)載專家并進(jìn)行冗余部署。高負(fù)載專家是基于在線部署期間收集的統(tǒng)計(jì)數(shù)據(jù)檢測(cè)的,并定期進(jìn)行調(diào)整(例如,每 10 分鐘一次)。在確定冗余專家集后,我們根據(jù)觀察到的負(fù)載情況,在節(jié)點(diǎn)內(nèi)的 GPU 之間仔細(xì)重新安排專家,力求在不增加跨節(jié)點(diǎn)全對(duì)全通信開(kāi)銷的情況下,盡可能平衡 GPU 的負(fù)載。對(duì)于 DeepSeek-V3 的部署,我們?yōu)轭A(yù)填充階段設(shè)置了 32 個(gè)冗余專家。對(duì)于每個(gè) GPU,除了它原本承載的 8 個(gè)專家外,還將承載一個(gè)額外的冗余專家。

Furthermore, in the prefilling stage, to improve the throughput and hide the overhead of all-to-all and TP communication, we simultaneously process two micro-batches with similar computational workloads, overlapping the attention and MoE of one micro-batch with the dispatch and combine of another.

此外,在預(yù)填充階段,為了提高吞吐量并隱藏 all-to-all 和 TP 通信的開(kāi)銷,我們同時(shí)處理兩個(gè)計(jì)算工作量相似的微批次,將一個(gè)微批次的注意力機(jī)制和 MoE 與另一個(gè)微批次的調(diào)度和組合重疊。

Finally, we are exploring a dynamic redundancy strategy for experts, where each GPU hosts more experts (e.g., 16 experts), but only 9 will be activated during each inference step. Before the all-to-all operation at each layer begins, we compute the globally optimal routing scheme on the fly. Given the substantial computation involved in the prefilling stage, the overhead of computing this routing scheme is almost negligible.

最后,我們正在探索一種動(dòng)態(tài)冗余策略,即每個(gè) GPU 上托管更多的專家(例如 16 個(gè)專家),但在每次推理步驟中只激活 9 個(gè)。在每層的 all-to-all 操作開(kāi)始之前,我們會(huì)動(dòng)態(tài)計(jì)算全局最優(yōu)的路由方案??紤]到預(yù)填充階段涉及的大量計(jì)算,計(jì)算該路由方案的開(kāi)銷幾乎可以忽略不計(jì)。

During decoding, we treat the shared expert as a routed one. From this perspective, each token will select 9 experts during routing, where the shared expert is regarded as a heavy-load one that will always be selected. The minimum deployment unit of the decoding stage consists of 40 nodes with 320 GPUs. The attention part employs TP4 with SP, combined with DP80, while the MoE part uses EP320. For the MoE part, each GPU hosts only one expert, and 64 GPUs are responsible for hosting redundant experts and shared experts. All-to-all communication of the dispatch and combine parts is performed via direct point-to-point transfers over IB to achieve low latency. Additionally, we leverage the IBGDA (NVIDIA, 2022) technology to further minimize latency and enhance communication efficiency.

在解碼過(guò)程中,我們將共享專家視為一個(gè)路由專家。從這個(gè)角度來(lái)看,每個(gè) Token 在路由時(shí)會(huì)選擇 9 個(gè)專家,其中共享專家被視為一個(gè)高負(fù)載的專家,始終會(huì)被選中。解碼階段的最小部署單元由 40 個(gè)節(jié)點(diǎn)和 320 個(gè) GPU 組成。注意力部分采用 TP4 和 SP 結(jié)合 DP80,而 MoE 部分使用 EP320。對(duì)于 MoE 部分,每個(gè) GPU 僅托管一個(gè)專家,64 個(gè) GPU 負(fù)責(zé)托管冗余專家和共享專家。調(diào)度和組合部分的全對(duì)全通信通過(guò) IB 的直接點(diǎn)對(duì)點(diǎn)傳輸進(jìn)行,以實(shí)現(xiàn)低延遲。此外,我們利用 IBGDA (NVIDIA, 2022) 技術(shù)進(jìn)一步減少延遲并提高通信效率。

Similar to prefilling, we periodically determine the set of redundant experts in a certain interval, based on the statistical expert load from our online service. However, we do not need to rearrange experts since each GPU only hosts one expert. We are also exploring the dynamic redundancy strategy for decoding. However, this requires more careful optimization of the algorithm that computes the globally optimal routing scheme and the fusion with the dispatch kernel to reduce overhead.

與預(yù)填充類似,我們根據(jù)在線服務(wù)的統(tǒng)計(jì)專家負(fù)載,定期確定某個(gè)區(qū)間內(nèi)的冗余專家集。然而,由于每個(gè) GPU 只托管一個(gè)專家,我們不需要重新排列專家。我們還在探索解碼的動(dòng)態(tài)冗余策略。然而,這需要對(duì)計(jì)算全局最優(yōu)路由方案的算法進(jìn)行更仔細(xì)的優(yōu)化,并與調(diào)度內(nèi)核融合以減少開(kāi)銷。

Additionally, to enhance throughput and hide the overhead of all-to-all communication, we are also exploring processing two micro-batches with similar computational workloads simultaneously in the decoding stage. Unlike prefilling, attention consumes a larger portion of time in the decoding stage. Therefore, we overlap the attention of one micro-batch with the dispatch+MoE $^+$ combine of another. In the decoding stage, the batch size per expert is relatively small (usually within 256 tokens), and the bottleneck is memory access rather than computation. Since the MoE part only needs to load the parameters of one expert, the memory access overhead is minimal, so using fewer SMs will not significantly affect the overall performance. Therefore, to avoid impacting the computation speed of the attention part, we can allocate only a small portion of SMs to dispatch+MoE+combine.

此外,為了提高吞吐量并隱藏全對(duì)全通信的開(kāi)銷,我們還在探索在解碼階段同時(shí)處理兩個(gè)計(jì)算工作量相似的微批次。與預(yù)填充不同,注意力在解碼階段消耗的時(shí)間更多。因此,我們將一個(gè)微批次的注意力與另一個(gè)微批次的調(diào)度+MoE$^+$組合重疊。在解碼階段,每個(gè)專家的批次大小相對(duì)較?。ㄍǔT?56個(gè)Token以內(nèi)),瓶頸是內(nèi)存訪問(wèn)而非計(jì)算。由于MoE部分只需要加載一個(gè)專家的參數(shù),內(nèi)存訪問(wèn)開(kāi)銷很小,因此使用較少的SM不會(huì)顯著影響整體性能。因此,為了避免影響注意力部分的計(jì)算速度,我們可以只分配一小部分SM給調(diào)度+MoE+組合。

Based on our implementation of the all-to-all communication and FP8 training scheme, we propose the following suggestions on chip design to AI hardware vendors.

基于我們對(duì)全對(duì)全通信和 FP8 訓(xùn)練方案的實(shí)現(xiàn),我們向 AI 硬件供應(yīng)商提出以下芯片設(shè)計(jì)建議。

In DeepSeek-V3, we implement the overlap between computation and communication to hide the communication latency during computation. This significantly reduces the dependency on communication bandwidth compared to serial computation and communication. However, the current communication implementation relies on expensive SMs (e.g., we allocate 20 out of the 132 SMs available in the H800 GPU for this purpose), which will limit the computational throughput. Moreover, using SMs for communication results in significant inefficiencies, as tensor cores remain entirely under-utilized.

在 DeepSeek-V3 中,我們實(shí)現(xiàn)了計(jì)算與通信的重疊,以隱藏計(jì)算過(guò)程中的通信延遲。與串行計(jì)算和通信相比,這顯著降低了對(duì)通信帶寬的依賴。然而,當(dāng)前的通信實(shí)現(xiàn)依賴于昂貴的 SMs(例如,我們?cè)?H800 GPU 的 132 個(gè)可用 SMs 中分配了 20 個(gè)用于此目的),這將限制計(jì)算吞吐量。此外,使用 SMs 進(jìn)行通信會(huì)導(dǎo)致顯著的效率低下,因?yàn)閺埩亢诵模╰ensor cores)完全未被充分利用。

Currently, the SMs primarily perform the following tasks for all-to-all communication:

目前,SMs 主要執(zhí)行以下全對(duì)全通信任務(wù):

We aspire to see future vendors developing hardware that offloads these communication tasks from the valuable computation unit SM, serving as a GPU co-processor or a network co-processor like NVIDIA SHARP Graham et al. (2016). Furthermore, to reduce application programming complexity, we aim for this hardware to unify the IB (scale-out) and NVLink (scale-up) networks from the perspective of the computation units. With this unified interface, computation units can easily accomplish operations such as read, write, multicast, and reduce across the entire IB-NVLink-unified domain via submitting communication requests based on simple primitives.

我們期望未來(lái)的供應(yīng)商能夠開(kāi)發(fā)出將通信任務(wù)從寶貴的計(jì)算單元 SM 中卸載的硬件,作為 GPU 協(xié)處理器或網(wǎng)絡(luò)協(xié)處理器,例如 NVIDIA SHARP Graham 等人 (2016)。此外,為了降低應(yīng)用程序編程的復(fù)雜性,我們希望這種硬件能夠從計(jì)算單元的角度統(tǒng)一 IB(橫向擴(kuò)展)和 NVLink(縱向擴(kuò)展)網(wǎng)絡(luò)。通過(guò)這種統(tǒng)一的接口,計(jì)算單元可以通過(guò)提交基于簡(jiǎn)單原語(yǔ)的通信請(qǐng)求,輕松完成整個(gè) IB-NVLink 統(tǒng)一域中的讀取、寫(xiě)入、多播和歸約等操作。

Higher FP8 GEMM Accumulation Precision in Tensor Cores. In the current Tensor Core implementation of the NVIDIA Hopper architecture, FP8 GEMM (General Matrix Multiply) employs fixed-point accumulation, aligning the mantissa products by right-shifting based on the maximum exponent before addition. Our experiments reveal that it only uses the highest 14 bits of each mantissa product after sign-fill right shifting, and truncates bits exceeding this range. However, for example, to achieve precise FP32 results from the accumulation of $32;mathrm{FP8! imes!FP8}$ multiplications, at least 34-bit precision is required. Thus, we recommend that future chip designs increase accumulation precision in Tensor Cores to support full-precision accumulation, or select an appropriate accumulation bit-width according to the accuracy requirements of training and inference algorithms. This approach ensures that errors remain within acceptable bounds while maintaining computational efficiency.

Tensor Core 中更高的 FP8 GEMM 累加精度。在 NVIDIA Hopper 架構(gòu)的當(dāng)前 Tensor Core 實(shí)現(xiàn)中,F(xiàn)P8 GEMM(通用矩陣乘法)采用定點(diǎn)累加,通過(guò)在加法前根據(jù)最大指數(shù)右移來(lái)對(duì)齊尾數(shù)乘積。我們的實(shí)驗(yàn)表明,它在符號(hào)填充右移后僅使用每個(gè)尾數(shù)乘積的最高 14 位,并截?cái)喑龃朔秶奈?。然而,例如,要?32 個(gè) FP8×FP8 乘法的累加中獲得精確的 FP32 結(jié)果,至少需要 34 位精度。因此,我們建議未來(lái)的芯片設(shè)計(jì)提高 Tensor Core 中的累加精度,以支持全精度累加,或根據(jù)訓(xùn)練和推理算法的精度要求選擇合適的累加位寬。這種方法在保持計(jì)算效率的同時(shí),確保誤差保持在可接受的范圍內(nèi)。

Support for Tile- and Block-Wise Quantization. Current GPUs only support per-tensor quantization, lacking the native support for fine-grained quantization like our tile- and blockwise quantization. In the current implementation, when the $N_{C}$ interval is reached, the partial results will be copied from Tensor Cores to CUDA cores, multiplied by the scaling factors, and added to FP32 registers on CUDA cores. Although the de quantization overhead is significantly mitigated combined with our precise FP32 accumulation strategy, the frequent data movements between Tensor Cores and CUDA cores still limit the computational efficiency. Therefore, we recommend future chips to support fine-grained quantization by enabling Tensor Cores to receive scaling factors and implement MMA with group scaling. In this way, the whole partial sum accumulation and de quantization can be completed directly inside Tensor Cores until the final result is produced, avoiding frequent data movements.

支持分塊量化。當(dāng)前的 GPU 僅支持張量級(jí)量化,缺乏對(duì)我們分塊量化等細(xì)粒度量化的原生支持。在當(dāng)前實(shí)現(xiàn)中,當(dāng)達(dá)到 $N_{C}$ 間隔時(shí),部分結(jié)果將從 Tensor Core 復(fù)制到 CUDA Core,乘以縮放因子,并添加到 CUDA Core 上的 FP32 寄存器中。盡管結(jié)合我們精確的 FP32 累加策略,反量化開(kāi)銷得到了顯著緩解,但 Tensor Core 和 CUDA Core 之間的頻繁數(shù)據(jù)移動(dòng)仍然限制了計(jì)算效率。因此,我們建議未來(lái)的芯片通過(guò)使 Tensor Core 能夠接收縮放因子并實(shí)現(xiàn)分組縮放的 MMA 來(lái)支持細(xì)粒度量化。這樣,整個(gè)部分和累加和反量化可以直接在 Tensor Core 內(nèi)部完成,直到生成最終結(jié)果,從而避免頻繁的數(shù)據(jù)移動(dòng)。

Support for Online Quantization. The current implementations struggle to effectively support online quantization, despite its effectiveness demonstrated in our research. In the existing process, we need to read 128 BF16 activation values (the output of the previous computation) from HBM (High Bandwidth Memory) for quantization, and the quantized FP8 values are then written back to HBM, only to be read again for MMA. To address this inefficiency, we recommend that future chips integrate FP8 cast and TMA (Tensor Memory Accelerator) access into a single fused operation, so quantization can be completed during the transfer of activation s from global memory to shared memory, avoiding frequent memory reads and writes. We also recommend supporting a warp-level cast instruction for speedup, which further facilitates the better fusion of layer normalization and FP8 cast. Alternatively, a near-memory computing approach can be adopted, where compute logic is placed near the HBM. In this case, BF16 elements can be cast to FP8 directly as they are read from HBM into the GPU, reducing off-chip memory access by roughly $50%$ .

支持在線量化。盡管我們的研究證明了在線量化的有效性,但當(dāng)前的實(shí)現(xiàn)難以有效支持這一功能。在現(xiàn)有流程中,我們需要從高帶寬內(nèi)存 (HBM) 中讀取 128 個(gè) BF16 激活值(前一次計(jì)算的輸出)進(jìn)行量化,然后將量化后的 FP8 值寫(xiě)回 HBM,再重新讀取以進(jìn)行矩陣乘法累加 (MMA)。為了解決這種低效問(wèn)題,我們建議未來(lái)的芯片將 FP8 轉(zhuǎn)換和張量?jī)?nèi)存加速器 (TMA) 訪問(wèn)集成到一個(gè)融合操作中,以便在將激活值從全局內(nèi)存?zhèn)鬏數(shù)焦蚕韮?nèi)存的過(guò)程中完成量化,從而避免頻繁的內(nèi)存讀寫(xiě)。我們還建議支持 warp 級(jí)別的轉(zhuǎn)換指令以加速操作,這進(jìn)一步促進(jìn)了層歸一化和 FP8 轉(zhuǎn)換的更好融合?;蛘?,可以采用近內(nèi)存計(jì)算的方法,將計(jì)算邏輯放置在 HBM 附近。在這種情況下,BF16 元素在從 HBM 讀取到 GPU 時(shí)可以直接轉(zhuǎn)換為 FP8,從而將片外內(nèi)存訪問(wèn)減少約 $50%$。

Support for Transposed GEMM Operations. The current architecture makes it cumbersome to fuse matrix transposition with GEMM operations. In our workflow, activation s during the forward pass are quantized into 1x128 FP8 tiles and stored. During the backward pass, the matrix needs to be read out, de quantized, transposed, re-quantized into $128紑$ tiles, and stored in HBM. To reduce memory operations, we recommend future chips to enable direct transposed reads of matrices from shared memory before MMA operation, for those precisions required in both training and inference. Combined with the fusion of FP8 format conversion and TMA access, this enhancement will significantly streamline the quantization workflow.

支持轉(zhuǎn)置的 GEMM 操作。當(dāng)前的架構(gòu)使得將矩陣轉(zhuǎn)置與 GEMM 操作融合變得繁瑣。在我們的工作流程中,前向傳播中的激活被量化為 1x128 FP8 塊并存儲(chǔ)。在反向傳播期間,矩陣需要被讀取、反量化、轉(zhuǎn)置、重新量化為 $128紑$ 塊,并存儲(chǔ)在 HBM 中。為了減少內(nèi)存操作,我們建議未來(lái)的芯片在 MMA 操作之前,能夠直接從共享內(nèi)存中進(jìn)行矩陣的轉(zhuǎn)置讀取,以滿足訓(xùn)練和推理中所需的精度。結(jié)合 FP8 格式轉(zhuǎn)換和 TMA 訪問(wèn)的融合,這一增強(qiáng)將顯著簡(jiǎn)化量化工作流程。

Compared with DeepSeek-V2, we optimize the pre-training corpus by enhancing the ratio of mathematical and programming samples, while expanding multilingual coverage beyond English and Chinese. Also, our data processing pipeline is refined to minimize redundancy while maintaining corpus diversity. Inspired by Ding et al. (2024), we implement the document packing method for data integrity but do not incorporate cross-sample attention masking during training. Finally, the training corpus for DeepSeek-V3 consists of 14.8T high-quality and diverse tokens in our tokenizer.

與 DeepSeek-V2 相比,我們通過(guò)提高數(shù)學(xué)和編程樣本的比例優(yōu)化了預(yù)訓(xùn)練語(yǔ)料庫(kù),同時(shí)擴(kuò)展了除英語(yǔ)和中文之外的多語(yǔ)言覆蓋范圍。此外,我們的數(shù)據(jù)處理流程經(jīng)過(guò)優(yōu)化,以在保持語(yǔ)料庫(kù)多樣性的同時(shí)最大限度地減少冗余。受 Ding 等人 (2024) 的啟發(fā),我們實(shí)現(xiàn)了文檔打包方法以確保數(shù)據(jù)完整性,但在訓(xùn)練過(guò)程中沒(méi)有引入跨樣本注意力掩碼。最終,DeepSeek-V3 的訓(xùn)練語(yǔ)料庫(kù)在我們的分詞器中包含了 14.8T 的高質(zhì)量和多樣化 Token。

In the training process of Deep Seek Code r-V2 (DeepSeek-AI, 2024a), we observe that the Fill-in-Middle (FIM) strategy does not compromise the next-token prediction capability while enabling the model to accurately predict middle text based on contextual cues. In alignment with Deep Seek Code r-V2, we also incorporate the FIM strategy in the pre-training of DeepSeek-V3. To be specific, we employ the Prefix-Suffix-Middle (PSM) framework to structure data as follows:

在 Deep Seek Code r-V2 (DeepSeek-AI, 2024a) 的訓(xùn)練過(guò)程中,我們觀察到 Fill-in-Middle (FIM) 策略在使模型能夠根據(jù)上下文線索準(zhǔn)確預(yù)測(cè)中間文本的同時(shí),不會(huì)損害下一個(gè) Token 的預(yù)測(cè)能力。為了與 Deep Seek Code r-V2 保持一致,我們也在 DeepSeek-V3 的預(yù)訓(xùn)練中引入了 FIM 策略。具體來(lái)說(shuō),我們采用 Prefix-Suffix-Middle (PSM) 框架來(lái)構(gòu)建數(shù)據(jù),如下所示:

<|fim_begin|> ??pre<|fim_hole|> ??suf<|fim_end|> ??middle<|eos_token|>.

??pre<|fim_hole|> ??suf<|fim_end|> ??middle<|eos_token|>.

This structure is applied at the document level as a part of the pre-packing process. The FIM strategy is applied at a rate of 0.1, consistent with the PSM framework.

此結(jié)構(gòu)在文檔級(jí)別應(yīng)用,作為預(yù)打包過(guò)程的一部分。FIM策略以0.1的比率應(yīng)用,與PSM框架保持一致。

The tokenizer for DeepSeek-V3 employs Byte-level BPE (Shibata et al., 1999) with an extended vocabulary of 128K tokens. The pre token ize r and training data for our tokenizer are modified to optimize multilingual compression efficiency. In addition, compared with DeepSeek-V2, the new pre token ize r introduces tokens that combine punctuation s and line breaks. However, this trick may introduce the token boundary bias (Lundberg, 2023) when the model processes multi-line prompts without terminal line breaks, particularly for few-shot evaluation prompts. To address this issue, we randomly split a certain proportion of such combined tokens during training, which exposes the model to a wider array of special cases and mitigates this bias.

DeepSeek-V3 的分詞器采用了字節(jié)級(jí) BPE (Byte-level BPE) (Shibata et al., 1999),并擴(kuò)展了詞匯表至 128K 個(gè) Token。我們對(duì)分詞器的預(yù)處理和訓(xùn)練數(shù)據(jù)進(jìn)行了修改,以優(yōu)化多語(yǔ)言壓縮效率。此外,與 DeepSeek-V2 相比,新的分詞器引入了結(jié)合標(biāo)點(diǎn)符號(hào)和換行符的 Token。然而,當(dāng)模型處理沒(méi)有終止換行符的多行提示時(shí),這種技巧可能會(huì)引入 Token 邊界偏差 (Lundberg, 2023),尤其是在少樣本評(píng)估提示中。為了解決這個(gè)問(wèn)題,我們?cè)谟?xùn)練過(guò)程中隨機(jī)拆分了一定比例的此類組合 Token,從而使模型接觸到更多特殊情況,減輕了這種偏差。

Model Hyper-Parameters. We set the number of Transformer layers to 61 and the hidden dimension to 7168. All learnable parameters are randomly initialized with a standard deviation of 0.006. In MLA, we set the number of attention heads $n_{h}$ to 128 and the per-head dimension $d_{h}$ to 128. The KV compression dimension $d_{c}$ is set to 512, and the query compression dimension $d_{c}^{prime}$ is set to 1536. For the decoupled queries and key, we set the per-head dimension $d_{h}^{R}$ to 64. We substitute all FFNs except for the first three layers with MoE layers. Each MoE layer consists of 1 shared expert and 256 routed experts, where the intermediate hidden dimension of each expert is 2048. Among the routed experts, 8 experts will be activated for each token, and each token will be ensured to be sent to at most 4 nodes. The multi-token prediction depth $D$ is set to 1, i.e., besides the exact next token, each token will predict one additional token. As DeepSeek-V2, DeepSeek-V3 also employs additional RMSNorm layers after the compressed latent vectors, and multiplies additional scaling factors at the width bottlenecks. Under this configuration, DeepSeek-V3 comprises 671B total parameters, of which 37B are activated for each token.

模型超參數(shù)。我們將 Transformer 層數(shù)設(shè)置為 61,隱藏維度設(shè)置為 7168。所有可學(xué)習(xí)參數(shù)均以標(biāo)準(zhǔn)差為 0.006 進(jìn)行隨機(jī)初始化。在 MLA 中,我們將注意力頭數(shù) $n_{h}$ 設(shè)置為 128,每個(gè)頭的維度 $d_{h}$ 設(shè)置為 128。KV 壓縮維度 $d_{c}$ 設(shè)置為 512,查詢壓縮維度 $d_{c}^{prime}$ 設(shè)置為 1536。對(duì)于解耦的查詢和鍵,我們將每個(gè)頭的維度 $d_{h}^{R}$ 設(shè)置為 64。我們將除前三層之外的所有 FFN 替換為 MoE 層。每個(gè) MoE 層由 1 個(gè)共享專家和 256 個(gè)路由專家組成,每個(gè)專家的中間隱藏維度為 2048。在路由專家中,每個(gè) token 將激活 8 個(gè)專家,并且每個(gè) token 將確保最多發(fā)送到 4 個(gè)節(jié)點(diǎn)。多 token 預(yù)測(cè)深度 $D$ 設(shè)置為 1,即除了精確的下一個(gè) token 外,每個(gè) token 還將預(yù)測(cè)一個(gè)額外的 token。與 DeepSeek-V2 一樣,DeepSeek-V3 也在壓縮的潛在向量后使用額外的 RMSNorm 層,并在寬度瓶頸處乘以額外的縮放因子。在此配置下,DeepSeek-V3 包含 671B 個(gè)總參數(shù),其中每個(gè) token 激活 37B 個(gè)參數(shù)。

Training Hyper-Parameters. We employ the AdamW optimizer (Loshchilov and Hutter, 2017) with hyper-parameters set to $beta_{1}=0.9$ , $beta_{2}=0.95_{cdot}$ , and weight decay $=0.1$ . We set the maximum sequence length to 4K during pre-training, and pre-train DeepSeek-V3 on $14.8mathrm{T}$ tokens. As for the learning rate scheduling, we first linearly increase it from 0 to $2.2 imes10^{-4}$ during the first 2K steps. Then, we keep a constant learning rate of $2.2 imes10^{-4}$ until the model consumes 10T training tokens. Subsequently, we gradually decay the learning rate to $2.2 imes10^{-5}$ in $4.3mathrm{T}$ tokens, following a cosine decay curve. During the training of the final 500B tokens, we keep a constant learning rate of $2.2 imes10^{-5}$ in the first 333B tokens, and switch to another constant learning rate of $7.3 imes10^{-6}$ in the remaining 167B tokens. The gradient clipping norm is set to 1.0. We employ a batch size scheduling strategy, where the batch size is gradually increased from 3072 to 15360 in the training of the first 469B tokens, and then keeps 15360 in the remaining training. We leverage pipeline parallelism to deploy different layers of a model on different GPUs, and for each layer, the routed experts will be uniformly deployed on 64 GPUs belonging to 8 nodes. As for the node-limited routing, each token will be sent to at most 4 nodes (i.e., $M=4$ ). For auxiliary-loss-free load balancing, we set the bias update speed ??to 0.001 for the first $14.3mathrm{T}$ tokens, and to 0.0 for the remaining 500B tokens. For the balance loss, we set $alpha$ to 0.0001, just to avoid extreme imbalance within any single sequence. The MTP loss weight $lambda$ is set to 0.3 for the first 10T tokens, and to 0.1 for the remaining 4.8T tokens.

訓(xùn)練超參數(shù)。我們采用 AdamW 優(yōu)化器 (Loshchilov and Hutter, 2017),超參數(shù)設(shè)置為 $beta_{1}=0.9$、$beta_{2}=0.95_{cdot}$,權(quán)重衰減 $=0.1$。在預(yù)訓(xùn)練期間,我們將最大序列長(zhǎng)度設(shè)置為 4K,并在 $14.8mathrm{T}$ 的 token 上預(yù)訓(xùn)練 DeepSeek-V3。對(duì)于學(xué)習(xí)率調(diào)度,我們首先在前 2K 步中將其從 0 線性增加到 $2.2 imes10^{-4}$。然后,我們保持 $2.2 imes10^{-4}$ 的恒定學(xué)習(xí)率,直到模型消耗了 10T 的訓(xùn)練 token。隨后,我們按照余弦衰減曲線在 $4.3mathrm{T}$ 的 token 中逐漸將學(xué)習(xí)率衰減到 $2.2 imes10^{-5}$。在最后 500B token 的訓(xùn)練中,我們保持 $2.2 imes10^{-5}$ 的恒定學(xué)習(xí)率在前 333B token 中,并在剩余的 167B token 中切換到 $7.3 imes10^{-6}$ 的恒定學(xué)習(xí)率。梯度裁剪范數(shù)設(shè)置為 1.0。我們采用批量大小調(diào)度策略,在前 469B token 的訓(xùn)練中,批量大小從 3072 逐漸增加到 15360,然后在剩余的訓(xùn)練中保持 15360。我們利用流水線并行將模型的不同層部署在不同的 GPU 上,對(duì)于每一層,路由的專家將均勻部署在屬于 8 個(gè)節(jié)點(diǎn)的 64 個(gè) GPU 上。對(duì)于節(jié)點(diǎn)限制的路由,每個(gè) token 最多發(fā)送到 4 個(gè)節(jié)點(diǎn)(即 $M=4$)。對(duì)于無(wú)輔助損失的負(fù)載均衡,我們將偏差更新速度 ?? 設(shè)置為 0.001 在前 $14.3mathrm{T}$ 的 token 中,并在剩余的 500B token 中設(shè)置為 0.0。對(duì)于平衡損失,我們將 $alpha$ 設(shè)置為 0.0001,以避免任何單個(gè)序列內(nèi)的極端不平衡。MTP 損失權(quán)重 $lambda$ 在前 10T token 中設(shè)置為 0.3,在剩余的 4.8T token 中設(shè)置為 0.1。


Pressure Testing DeepSeek-V3 128K Context via "Needle In A HayStack" Figure 8 | Evaluation results on the ”Needle In A Haystack” (NIAH) tests. DeepSeek-V3 performs well across all context window lengths up to 128K.


通過(guò)“大海撈針”測(cè)試 DeepSeek-V3 128K 上下文
圖 8 | “大海撈針” (NIAH) 測(cè)試的評(píng)估結(jié)果。DeepSeek-V3 在所有上下文窗口長(zhǎng)度(最高達(dá) 128K)上表現(xiàn)良好。

We adopt a similar approach to DeepSeek-V2 (DeepSeek-AI, 2024c) to enable long context capabilities in DeepSeek-V3. After the pre-training stage, we apply YaRN (Peng et al., 2023a) for context extension and perform two additional training phases, each comprising 1000 steps, to progressively expand the context window from 4K to 32K and then to 128K. The YaRN configuration is consistent with that used in DeepSeek-V2, being applied exclusively to the decoupled shared key $mathbf{k}_{t}^{R}$ . The hyper-parameters remain identical across both phases, with the scale $s=40$ , $alpha=1$ , $beta=32.$ , and the scaling factor $sqrt{t}=0.1ln s+1.$ . In the first phase, the sequence length is set to $32mathrm{K},$ and the batch size is 1920. During the second phase, the sequence length is increased to 128K, and the batch size is reduced to 480. The learning rate for both phases is set to $7.3 imes10^{-6}$ , matching the final learning rate from the pre-training stage.

我們采用與 DeepSeek-V2 (DeepSeek-AI, 2024c) 類似的方法,在 DeepSeek-V3 中實(shí)現(xiàn)長(zhǎng)上下文能力。在預(yù)訓(xùn)練階段之后,我們應(yīng)用 YaRN (Peng et al., 2023a) 進(jìn)行上下文擴(kuò)展,并執(zhí)行兩個(gè)額外的訓(xùn)練階段,每個(gè)階段包含 1000 步,逐步將上下文窗口從 4K 擴(kuò)展到 32K,然后再擴(kuò)展到 128K。YaRN 配置與 DeepSeek-V2 中使用的配置一致,僅應(yīng)用于解耦的共享鍵 $mathbf{k}_{t}^{R}$。兩個(gè)階段的超參數(shù)保持一致,其中比例 $s=40$,$alpha=1$,$beta=32.$,縮放因子 $sqrt{t}=0.1ln s+1.$。在第一階段,序列長(zhǎng)度設(shè)置為 $32mathrm{K},$,批量大小為 1920。在第二階段,序列長(zhǎng)度增加到 128K,批量大小減少到 480。兩個(gè)階段的學(xué)習(xí)率均設(shè)置為 $7.3 imes10^{-6}$,與預(yù)訓(xùn)練階段的最終學(xué)習(xí)率一致。

Through this two-phase extension training, DeepSeek-V3 is capable of handling inputs up to 128K in length while maintaining strong performance. Figure 8 illustrates that DeepSeek-V3, following supervised fine-tuning, achieves notable performance on the "Needle In A Haystack" (NIAH) test, demonstrating consistent robustness across context window lengths up to 128K.

通過(guò)這種兩階段的擴(kuò)展訓(xùn)練,DeepSeek-V3 能夠處理長(zhǎng)達(dá) 128K 的輸入,同時(shí)保持強(qiáng)勁的性能。圖 8 展示了 DeepSeek-V3 在經(jīng)過(guò)監(jiān)督微調(diào)后,在 "Needle In A Haystack" (NIAH) 測(cè)試中取得了顯著的表現(xiàn),證明了其在長(zhǎng)達(dá) 128K 的上下文窗口長(zhǎng)度上具有一致的魯棒性。

The base model of DeepSeek-V3 is pretrained on a multilingual corpus with English and Chinese constituting the majority, so we evaluate its performance on a series of benchmarks primarily in English and Chinese, as well as on a multilingual benchmark. Our evaluation is based on our internal evaluation framework integrated in our HAI-LLM framework. Considered benchmarks are categorized and listed as follows, where underlined benchmarks are in Chinese and double-underlined benchmarks are multilingual ones:

DeepSeek-V3 的基礎(chǔ)模型是在以英語(yǔ)和中文為主的多語(yǔ)言語(yǔ)料庫(kù)上進(jìn)行預(yù)訓(xùn)練的,因此我們主要在英語(yǔ)和中文的一系列基準(zhǔn)測(cè)試中評(píng)估其性能,同時(shí)也包括一個(gè)多語(yǔ)言基準(zhǔn)測(cè)試。我們的評(píng)估基于集成在 HAI-LLM 框架中的內(nèi)部評(píng)估框架。所考慮的基準(zhǔn)測(cè)試分類如下,其中下劃線的基準(zhǔn)測(cè)試為中文,雙下劃線的基準(zhǔn)測(cè)試為多語(yǔ)言:

Multi-subject multiple-choice datasets include MMLU (Hendrycks et al., 2020), MMLURedux (Gema et al., 2024), MMLU-Pro (Wang et al., 2024b), MMMLU (OpenAI, 2024b), C-Eval (Huang et al., 2023), and CMMLU (Li et al., 2023).

多學(xué)科多選題數(shù)據(jù)集包括 MMLU (Hendrycks et al., 2020)、MMLURedux (Gema et al., 2024)、MMLU-Pro (Wang et al., 2024b)、MMMLU (OpenAI, 2024b)、C-Eval (Huang et al., 2023) 和 CMMLU (Li et al., 2023)。

Language understanding and reasoning datasets include HellaSwag (Zellers et al., 2019), PIQA (Bisk et al., 2020), ARC (Clark et al., 2018), and BigBench Hard (BBH) (Suzgun et al., 2022).

語(yǔ)言理解和推理數(shù)據(jù)集包括 HellaSwag (Zellers et al., 2019)、PIQA (Bisk et al., 2020)、ARC (Clark et al., 2018) 和 BigBench Hard (BBH) (Suzgun et al., 2022)。

Closed-book question answering datasets include TriviaQA (Joshi et al., 2017) and Natural Questions (Kwiatkowski et al., 2019).

閉卷問(wèn)答數(shù)據(jù)集包括 TriviaQA (Joshi et al., 2017) 和 Natural Questions (Kwiatkowski et al., 2019)。

Reading comprehension datasets include RACE Lai et al. (2017), DROP (Dua et al., 2019), C3 (Sun et al., 2019a), and CMRC (Cui et al., 2019).

閱讀理解數(shù)據(jù)集包括 RACE (Lai et al., 2017)、DROP (Dua et al., 2019)、C3 (Sun et al., 2019a) 和 CMRC (Cui et al., 2019)。

Reference disambiguation datasets include CLUEWSC (Xu et al., 2020) and WinoGrande Sakaguchi et al. (2019).

參考文獻(xiàn)消歧數(shù)據(jù)集包括 CLUEWSC (Xu et al., 2020) 和 WinoGrande Sakaguchi et al. (2019)。

Language modeling datasets include Pile (Gao et al., 2020).

語(yǔ)言建模數(shù)據(jù)集包括 Pile (Gao et al., 2020)。

Chinese understanding and culture datasets include CCPM (Li et al., 2021).

中文理解和文化數(shù)據(jù)集包括 CCPM (Li et al., 2021)。

Math datasets include GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), MGSM (Shi et al., 2023), and CMath (Wei et al., 2023).

數(shù)學(xué)數(shù)據(jù)集包括 GSM8K (Cobbe et al., 2021)、MATH (Hendrycks et al., 2021)、MGSM (Shi et al., 2023) 和 CMath (Wei et al., 2023)。

Code datasets include HumanEval (Chen et al., 2021), Live Code Bench-Base (0801-1101) (Jain et al., 2024), MBPP (Austin et al., 2021), and CRUXEval (Gu et al., 2024).

代碼數(shù)據(jù)集包括 HumanEval (Chen et al., 2021)、Live Code Bench-Base (0801-1101) (Jain et al., 2024)、MBPP (Austin et al., 2021) 和 CRUXEval (Gu et al., 2024)。

Standardized exams include AGIEval (Zhong et al., 2023). Note that AGIEval includes both English and Chinese subsets.

標(biāo)準(zhǔn)化考試包括AGIEval (Zhong et al., 2023)。需要注意的是,AGIEval包含英文和中文兩個(gè)子集。

Following our previous work (DeepSeek-AI, $2024mathrm,!mathrm{c})$ , we adopt perplexity-based evaluation for datasets including HellaSwag, PIQA, WinoGrande, RACE-Middle, RACE-High, MMLU, MMLU-Redux, MMLU-Pro, MMMLU, ARC-Easy, ARC-Challenge, C-Eval, CMMLU,C3, and CCPM, and adopt generation-based evaluation for TriviaQA, Natural Questions, DROP, MATH, GSM8K, MGSM, HumanEval, MBPP, Live Code Bench-Base, CRUXEval, BBH, AGIEval, CLUEWSC, CMRC, and CMath. In addition, we perform language-modeling-based evaluation for Pile-test and use Bits-Per-Byte (BPB) as the metric to guarantee fair comparison among models using different tokenizers.

繼我們之前的工作 (DeepSeek-AI, $2024mathrm,!mathrm{c})$ 之后,我們對(duì)包括 HellaSwag、PIQA、WinoGrande、RACE-Middle、RACE-High、MMLU、MMLU-Redux、MMLU-Pro、MMMLU、ARC-Easy、ARC-Challenge、C-Eval、CMMLU、C3 和 CCPM 在內(nèi)的數(shù)據(jù)集采用基于困惑度 (perplexity) 的評(píng)估方法,并對(duì) TriviaQA、Natural Questions、DROP、MATH、GSM8K、MGSM、HumanEval、MBPP、Live Code Bench-Base、CRUXEval、BBH、AGIEval、CLUEWSC、CMRC 和 CMath 采用基于生成的評(píng)估方法。此外,我們對(duì) Pile-test 進(jìn)行基于語(yǔ)言建模的評(píng)估,并使用每字節(jié)比特?cái)?shù) (Bits-Per-Byte, BPB) 作為指標(biāo),以確保使用不同分詞器的模型之間的公平比較。

Table 3 | Comparison among DeepSeek-V3-Base and other representative open-source base models. All models are evaluated in our internal framework and share the same evaluation setting. Scores with a gap not exceeding 0.3 are considered to be at the same level. DeepSeekV3-Base achieves the best performance on most benchmarks, especially on math and code tasks.

基準(zhǔn) (指標(biāo)) #Shots DeepSeek-V2 Base Qwen2.5 72B Base LLaMA-3.1 405B Base DeepSeek-V3 Base 架構(gòu) MoE Dense Dense MoE 激活參數(shù)量 21B 72B 405B 37B 總參數(shù)量 236B 72B 405B 671B Pile-test (BPB) 0.606 0.638 0.542 0.548 BBH (EM) 78.8 79.8 English 3-shot 82.9 87.5 MMLU (EM) 5-shot 78.4 85.0 84.4 87.1 MMLU-Redux (EM) 5-shot 75.6 83.2 81.3 86.2 MMLU-Pro (EM) 5-shot 51.4 58.3 52.8 64.4 DROP (F1) 3-shot 80.4 80.6 86.0 89.0 ARC-Easy (EM) 25-shot 97.6 98.4 98.4 98.9 ARC-Challenge (EM) 25-shot 92.2 94.5 95.3 95.3 HellaSwag (EM) 10-shot 87.1 84.8 89.2 88.9 PIQA (EM) 0-shot 83.9 82.6 85.9 84.7 WinoGrande (EM) 5-shot 86.3 82.3 85.2 84.9 RACE-Middle (EM) 5-shot 73.1 68.1 74.2 67.1 RACE-High (EM) 5-shot 52.6 50.3 56.8 51.3 TriviaQA (EM) 5-shot 80.0 71.9 82.7 82.9 NaturalQuestions (EM) AGIEval (EM) 5-shot 0-shot 38.6 57.5 33.2 75.8 41.5 40.0 Code HumanEval (Pass@1) 60.6 79.6 0-shot 43.3 53.0 54.9 65.2 MBPP (Pass@1) 3-shot 65.0 72.6 68.4 75.4 LiveCodeBench-Base (Pass@1) 3-shot 11.6 12.9 15.5 19.4 CRUXEval-I (EM) CRUXEval-O (EM) 2-shot 2-shot 52.5 49.8 59.1 59.9 58.5 59.9 67.3 Math 69.8 GSM8K (EM) 8-shot 81.6 88.3 83.5 89.3 MATH (EM) 4-shot 43.4 54.4 49.0 61.6 MGSM (EM) CMath (EM) 8-shot 3-shot 63.6 78.7 76.2 84.5 69.9 77.3 79.8 Chinese CLUEWSC (EM) 90.7 C-Eval (EM) 5-shot 82.0 81.4 82.5 89.2 83.0 72.5 82.7 CMMLU (EM) 5-shot 84.0 89.5 90.1 CMRC (EM) 5-shot 73.7 88.8 1-shot 77.4 75.8 76.0 76.3 C3 (EM) CCPM (EM) 0-shot 0-shot 77.4 93.0 76.7 88.5 79.7 78.6 78.6 92.0 Multilingual MMMLU-non-English (EM) 5-shot 64.0 74.8 73.8 79.4

表 3 | DeepSeek-V3-Base 與其他代表性開(kāi)源基礎(chǔ)模型的對(duì)比。所有模型均在我們的內(nèi)部框架中評(píng)估,并共享相同的評(píng)估設(shè)置。得分差距不超過(guò) 0.3 的模型被視為處于同一水平。DeepSeek-V3-Base 在大多數(shù)基準(zhǔn)測(cè)試中表現(xiàn)最佳,尤其是在數(shù)學(xué)和代碼任務(wù)上。

In Table 3, we compare the base model of DeepSeek-V3 with the state-of-the-art open-source base models, including DeepSeek-V2-Base (DeepSeek-AI, 2024c) (our previous release), Qwen2.5 72B Base (Qwen, 2024b), and LLaMA-3.1 405B Base (AI@Meta, 2024b). We evaluate all these models with our internal evaluation framework, and ensure that they share the same evaluation setting. Note that due to the changes in our evaluation framework over the past months, the performance of DeepSeek-V2-Base exhibits a slight difference from our previously reported results. Overall, DeepSeek-V3-Base comprehensively outperforms DeepSeek-V2-Base and Qwen2.5 72B Base, and surpasses LLaMA-3.1 405B Base in the majority of benchmarks, essentially becoming the strongest open-source model.

在表 3 中,我們將 DeepSeek-V3 的基礎(chǔ)模型與當(dāng)前最先進(jìn)的開(kāi)源基礎(chǔ)模型進(jìn)行了比較,包括 DeepSeek-V2-Base (DeepSeek-AI, 2024c) (我們之前的發(fā)布版本)、Qwen2.5 72B Base (Qwen, 2024b) 和 LLaMA-3.1 405B Base (AI@Meta, 2024b)。我們使用內(nèi)部評(píng)估框架對(duì)這些模型進(jìn)行了評(píng)估,并確保它們共享相同的評(píng)估設(shè)置。需要注意的是,由于過(guò)去幾個(gè)月評(píng)估框架的變化,DeepSeek-V2-Base 的表現(xiàn)與我們之前報(bào)告的結(jié)果略有不同。總體而言,DeepSeek-V3-Base 全面超越了 DeepSeek-V2-Base 和 Qwen2.5 72B Base,并在大多數(shù)基準(zhǔn)測(cè)試中超過(guò)了 LLaMA-3.1 405B Base,基本上成為了最強(qiáng)大的開(kāi)源模型。

From a more detailed perspective, we compare DeepSeek-V3-Base with the other open-source base models individually. (1) Compared with DeepSeek-V2-Base, due to the improvements in our model architecture, the scale-up of the model size and training tokens, and the enhancement of data quality, DeepSeek-V3-Base achieves significantly better performance as expected. (2) Compared with Qwen2.5 72B Base, the state-of-the-art Chinese open-source model, with only half of the activated parameters, DeepSeek-V3-Base also demonstrates remarkable advantages, especially on English, multilingual, code, and math benchmarks. As for Chinese benchmarks, except for CMMLU, a Chinese multi-subject multiple-choice task, DeepSeek-V3-Base also shows better performance than Qwen2.5 72B. (3) Compared with LLaMA-3.1 405B Base, the largest open-source model with 11 times the activated parameters, DeepSeek-V3-Base also exhibits much better performance on multilingual, code, and math benchmarks. As for English and Chinese language benchmarks, DeepSeek-V3-Base shows competitive or better performance, and is especially good on BBH, MMLU-series, DROP, C-Eval, CMMLU, and CCPM.

從更詳細(xì)的角度來(lái)看,我們將 DeepSeek-V3-Base 與其他開(kāi)源基礎(chǔ)模型逐一進(jìn)行比較。(1) 與 DeepSeek-V2-Base 相比,由于我們?cè)谀P图軜?gòu)、模型規(guī)模和訓(xùn)練 Token 數(shù)量上的提升,以及數(shù)據(jù)質(zhì)量的增強(qiáng),DeepSeek-V3-Base 的表現(xiàn)顯著優(yōu)于預(yù)期。(2) 與目前最先進(jìn)的中文開(kāi)源模型 Qwen2.5 72B Base 相比,DeepSeek-V3-Base 僅使用了其一半的激活參數(shù),但在英語(yǔ)、多語(yǔ)言、代碼和數(shù)學(xué)基準(zhǔn)測(cè)試中表現(xiàn)出顯著優(yōu)勢(shì)。在中文基準(zhǔn)測(cè)試中

轉(zhuǎn)載請(qǐng)注明來(lái)自阿拉善凱拓戶外,本文標(biāo)題:《Architecture: Innovative Load Balancing Strategy and Training Objective》

百度分享代碼,如果開(kāi)啟HTTPS請(qǐng)參考李洋個(gè)人博客
每一天,每一秒,你所做的決定都會(huì)改變你的人生!
Top