December 28, 2024
DeepSeek V3: The Open-Source AI Revolution
DeepSeek is causing a stir in the AI community with its latest model, DeepSeek-V3. This isn't just another iteration; it's a powerful force that's actively outperforming many top-tier AI models, particularly those behind closed doors. Forget the idea of open-source AI playing second fiddle – DeepSeek-V3 is setting a new benchmark and blazing its own trail.
What makes DeepSeek-V3 truly stand out is its remarkable speed and efficiency, processing information at a blistering 60 tokens per second - a threefold increase over its predecessor. But it's not just a speed demon; it’s also a versatile powerhouse capable of handling complex tasks from coding and math to text processing, proving to be a multi-faceted tool in the digital realm.
Perhaps most surprising is that DeepSeek-V3 is completely open-source and free. Available through an API, a chat website, or for local deployment, along with its incredibly competitive pricing, it is positioned to be the go-to solution for anyone seeking cutting-edge AI without breaking the bank.

I. DeepSeek V3's Architecture and Technical Details
Mixture-of-Experts (MoE) architecture
DeepSeek-V3 employs a Mixture-of-Experts (MoE) architecture, which is a key factor in its performance and efficiency. This architecture consists of multiple neural networks, each specifically optimised for different tasks.
When DeepSeek-V3 receives a prompt, a component called a router intelligently directs the request to the neural network that is best suited to handle it. This selective activation of neural networks is what makes the MoE architecture so efficient, as it reduces hardware costs by only using the resources necessary for a given task. Each of these individual neural networks within the MoE structure has 34 billion parameters.
Parameters
DeepSeek-V3, a veritable behemoth in the AI world, boasts a total of 671 billion parameters, a figure that shrinks slightly to 37 billion active parameters for each token during processing, almost as if it selectively engages its power. Yet, when accessed through the Hugging Face platform, the model's colossal size mysteriously expands to 685 billion parameters, a subtle difference attributed to the addition of the Multi-Token Prediction (MTP) module weights.
The core model’s impressive 671 billion parameters are then bolstered by the MTP module, adding another 14 billion to its weight, thereby reaching the seemingly impossible number that is 685 billion; in a stunning display of scale, this means DeepSeek-V3 clocks in at roughly 1.6 times the size of Meta's already massive Llama 3.1 405B model, a comparison that truly puts its magnitude into perspective.
Training Data
The model's impressive capabilities are underpinned by the vast amount of data it was trained on. DeepSeek-V3 was trained on a dataset of 14.8 trillion tokens. Specially, in data science, tokens are used to represent bits of raw data, with 1 million tokens equating to around 750,000 words.
Multi-head Latent Attention (MLA)
DeepSeek-V3 uses a technique called multi-head latent attention (MLA), which is an enhanced version of the attention mechanism that is commonly used in large language models. Attention mechanisms help models identify the most important parts of a sentence. MLA improves upon this by allowing the model to extract key details from a piece of text multiple times rather than just once. This means that the model is less likely to miss crucial information, which makes the model more accurate.
Multi-Token Prediction
Typical language models generate text one token at a time. However, in contrast, DeepSeek-V3 generates several tokens at once. This multi-token prediction feature significantly speeds up the inference process, which is the time it takes the model to generate text. Furthermore, this method can also be used for speculative decoding, which can further accelerate inference.
FP8 Mixed Precision Training Framework
DeepSeek-V3 was trained using an FP8 mixed precision training framework. This was the first time this framework has been used on such a large-scale model, and it has proven to be both feasible and effective. FP8 (8-bit floating point) is a numerical format that is more compact than the usual 16-bit or 32-bit formats. This means that it requires less memory and can significantly speed up computation.
Training Efficiency
DeepSeek-V3’s training process was remarkably efficient. The pre-training phase of DeepSeek-V3 required only 2.664 million H800 GPU hours. The subsequent training stages after pre-training required only 0.1 million GPU hours. DeepSeek was able to train the model using a data centre of 2048 GPUs in just around two months. The company claims that they only spent $5.5 million to train DeepSeek-V3. This is a significantly lower cost than some other similar models.
For example, Llama 3 405B used 30.8 million GPU hours, around 11 times the compute of DeepSeek-V3. This achievement demonstrates that it is possible to train large language models with less compute than previously thought, which could open the door to more efficient and affordable AI development. DeepSeek’s approach highlights how advancements in algorithms and data can reduce the need for very large GPU clusters.
Reasoning Capabilities
DeepSeek has also incorporated advanced reasoning capabilities into DeepSeek-V3. The model distills its reasoning capabilities from the DeepSeek R1 series of models. DeepSeek's pipeline integrates the verification and reflection patterns of R1 into DeepSeek-V3. This results in an improved reasoning performance for DeepSeek-V3.
Now that we've explored the technical foundations of DeepSeek V3, let's turn our attention to its performance and benchmark results, to see how these technical innovations translate into real-world capabilities.
II. Performance and Benchmarks
The numbers are in, and DeepSeek-V3 is not just impressive, it's a serious contender. It's not content with merely matching other models – it's actively outperforming many open-source alternatives and even holding its own against top closed-source competitors. And, as mentioned, it's also lightning-fast, processing 60 tokens per second, which is three times faster than DeepSeek V2.
DeepSeek-V3 incorporates advanced features that enhance its performance.
- It uses a Mixture-of-Experts (MoE) architecture with 671 billion parameters, with 37 billion activated per token. This allows for efficient processing by activating only a portion of the network for each task.
- It utilizes Multi-head Latent Attention (MLA) to extract key details from text multiple times, improving its accuracy.
- It also incorporates Multi-Token Prediction to generate several tokens at once, which speeds up inference.
The model was trained on 14.8 trillion tokens and shows strong performance across various benchmarks.
DeepSeek-V3 demonstrates its strong aptitude for competitive programming challenges, surpassing Claude-3.5 Sonnet on the Codeforces benchmark. It excels on the Aider Polyglot test, showcasing its ability to integrate new code with existing code. We can tell from the results that top performers are:
- o1-2024-11-12 (Tingli) leads the benchmark with nearly 65% accuracy in the whole format, showing exceptional performance across tasks.
- DeepSeek Chat V3 Preview and Claude-3.5 Sonnet-2024-1022 follow closely, with scores in the range of 40–50%, demonstrating solid task completion in both formats.

DeepSeek V3 also achieves a score of 88.5 on the MMLU benchmark, slightly behind Llama3.1, but outperforming Qwen2.5 and Claude-3.5 Sonnet. It also scores 91.6 on the DROP benchmark, outperforming the same models, demonstrating its strong reasoning capabilities.
The model can process context window lengths up to 128K and also incorporates FP8 Mixed Precision Training for training efficiency.

The performance of DeepSeek V3 is impressive, but to be truly useful, an AI model needs to be accessible. The next section will explore how DeepSeek V3 is made available to users.
III. Accessibility and Usage
Performance means little if a model is locked away behind an impenetrable wall. Thankfully, DeepSeek-V3 prioritizes accessibility:
Open Source: You can grab the code and tinker to your heart's content on GitHub, and the model weights are readily available on Hugging Face. This means that it can be used for a plethora of applications including commercial projects.
API Access: DeepSeek offers an API that’s compatible with OpenAI’s API, which makes integrating with existing systems easy.
Chat Website: You can hop on the DeepSeek website and chat with V3 directly, no coding or APIs required.
Deep Roles: Think of it as customized AI companions – Deep Roles will allow users to craft their own or explore roles created by others, similar to Custom GPTs.
You can also deploy DeepSeek-V3 locally. It's recommended that you use 8 H200s GPUs but can be deployed on other hardware including NVIDIA, AMD and Huawei Ascend. Plenty of open-source software options also enable you to do this, such as DeepSeek-Infer Demo, SGLang, LMDeploy, TensorRT-LLM, and vLLM. This shows its adaptability across different platforms.
DeepSeek-V3 also excels at various text-based tasks. It is excellent for coding, translations, and marketing content generation. All of these tasks are made possible by its efficiency at text processing.
IV. Inference Costs
The API pricing structure mirrors that of DeepSeek V2 until February 8th, 2025. After this, the price will be set at:
- Input: $0.27 per million tokens (cache miss)
- Input: $0.07 per million tokens (cache hits)
- Output: $1.10 per million tokens
To put it simply, DeepSeek is much more affordable than models like Claude 3.5 Sonnet, GPT-4o, and Gemini 1.5 Pro. In fact, DeepSeek V3 is 53x cheaper to use for inference than Claude Sonnet! On OpenRouter, it costs a mere $0.14 for input and $0.28 for output.

V. Limitations
DeepSeek-V3, due to Chinese regulations, avoids politically sensitive topics. You won't get answers about:
- Tiananmen Square
- Xi Jinping
- The geopolitical implications of China invading Taiwan
This is due to Chinese regulations that require models to “embody core socialist values”. Also, it's not immune to "jailbreaking," meaning those with the know-how can bypass the safeguards.
It's important to note that these restrictions are not unique to DeepSeek-V3 but are a common feature of AI models developed within China. This is due to the political and regulatory environment in which these models are created.
VI. Application and Impact
The impact of DeepSeek-V3 is undeniable. Here's why:
Research and Development: An open-source, high-performing model like this fuels innovation, allowing researchers to experiment with and build upon DeepSeek's technology.
Commercial Applications: The licensing makes commercial use permissible, opening it up to numerous applications across different industries.
Democratisation of AI: By making powerful AI accessible, it levels the playing field, allowing smaller organizations to compete.
Cost-Effective Solutions: Lower training costs and competitive pricing make it a compelling choice for anyone looking to leverage AI without massive financial burdens.
Challenging the Status Quo: Its ability to challenge top closed-source models signals that open-source AI is a genuine and viable alternative.
Innovation in Inference: The model's advanced inference capabilities, which use 32 H800 GPUs for prefill and 320 H800 GPUs for decoding, showcase a new level of sophistication in model deployment and set the standard for the future.
Conclusion
DeepSeek-V3 isn't just another incremental improvement – it’s a major leap. Its exceptional performance, combined with an open-source approach, suggests a paradigm shift. Top-tier AI, it seems, doesn’t necessarily require exorbitant costs or restrictive licensing.
DeepSeek-V3's speed, versatility, and accessibility make it a force in the AI landscape, showcasing the power of collaboration and democratization in technology. It’s a bold statement: open development can not only keep pace but even outpace traditional models. This is more than just an impressive model; it's a beacon, guiding us toward a more inclusive and collaborative future for artificial intelligence.
