Qwen3: Alibaba’s model that challenges OpenAI and DeepSeek in mathematics and coding

aivancity

12 months ago

A new generation of Chinese models is making its mark in the field of reasoning

While large language models are dominated by the United States, China is gradually strengthening its position in the field of advanced artificial intelligence. With Qwen3, Alibaba aims to offer a competitive model in the strategic areas of mathematical reasoning and code generation. The stakes are not only technological; they are also symbolic. In an increasingly polarized international context, the ability to produce high-performing and reliable models is becoming an indicator of digital sovereignty.

According to the latest report from Hugging Face, Chinese contributions now account for 29% of the new models published on the platform¹. Qwen3 is part of this upward trend, explicitly aiming to match the performance of GPT-4, Claude 3, and DeepSeek on STEM (science, technology, engineering, mathematics) tasks.

Qwen3: A Leap Forward in Mathematics and Coding

The Qwen3 family comes in several versions:

Qwen3-7B, a compact model suitable for local inference
Qwen3-72B, large dense model
Qwen3-MoE, a Mixture of Experts architecture, is more computationally efficient

In April 2025, Alibaba announced that Qwen3-72B outperformed GPT-4 on certain advanced mathematics benchmarks, including MATH and GSM8K, while achieving an accuracy rate of over 81% on HumanEval, a standard for Python code generation².

The model is based on extensive multilingual pre-training, enriched with structured mathematical datasets (ProofWiki, arXiv, MathQA) and millions of code examples. Further fine-tuning was performed using reinforcement learning from human feedback (RLHF) techniques, with a focus on the logical rigor and readability of the generated code.

Compare to understand: How does Qwen3 stack up against the market leaders?

Here is a comparison table showing the scores achieved by Qwen3 and its competitors on industry-standard benchmarks:

Model	GSM8K (Mathematical Reasoning)	MATH (formal problems)	HumanEval (Python code)	MBPP (Simple Programming)	Bachelor's Degree
Qwen3-72B	89,6 %	54,1 %	81,2 %	71,5 %	Apache 2.0
GPT-4	~92 %	~50 %	~88 %	~77 %	Owner
DeepSeek Coder	88,8 %	N/A	84,1 %	75,3 %	MIT
Claude 3 Opus	89,3 %	~47 %	~83 %	~72 %	Owner

Combined sources: published technical reports, independent reproducible tests (April–July 2025)

This table shows that Qwen3 is a serious contender among the best models on the market, despite being released under an open-source license. It stands out in particular for its performance in formal mathematics, a field that has historically been challenging for large language models (LLMs).

Why reasoning and coding skills matter today

A model’s logical and algorithmic capabilities are not trivial. They determine its ability to:

outline a step-by-step response
manage complex dependencies in chains of reasoning
generate executable, optimized, and readable code

These skills are now in demand across several sectors: science education, research support, software prototyping, and the automation of technical tasks. In 2025, nearly 44% of developers surveyed by Stack Overflow reported using AI to test or write code on a daily basis³.

Limitations, methodological uncertainties, and open questions

Despite its relative openness, Qwen3 has some gray areas:

lack of details regarding the exact corpora used, particularly proprietary code
limited documentation for the full reproducibility of the results
assessments that are sometimes conducted internally, without an independent third-party auditor

Furthermore, the Mixture of Experts architecture can introduce non-deterministic variations in the results, making it more difficult to evaluate the model consistently.

Powerful models, greater responsibility: the ethical challenges of Qwen3

The more effective a model becomes, the more sensitive the issues surrounding its use become:

In education, can it encourage automated cheating, or, on the contrary, enhance learning?
In cybersecurity, can it generate code that is potentially dangerous or designed to bypass systems?
In terms of intellectual property, how can we verify that it isn't reproducing protected code encountered during its training?
In sensitive sectors such as finance, medicine, or the legal system, what safeguards are in place to regulate the automatic generation of algorithms?

The power of a model in mathematics or programming thus raises the question of specific technical and ethical governance, which has not yet been adequately addressed in current regulations.

A Chinese player shaking up the global ecosystem?

By releasing Qwen3 under the Apache License while delivering performance on par with proprietary industry leaders, Alibaba is setting a strategic milestone. This model demonstrates that it is possible to combine openness, power, and specialization in a highly demanding field.

This reinforces the idea of a multipolar AI landscape, where Chinese, American, and European models will coexist, each with its own architectural, licensing, and usage choices. But for this diversity to be beneficial, it must be accompanied by a collective effort toward interoperability, documentation, and scientific transparency.

Learn more

Check out our blog for more DeepSeek R1-0528: The open-source model that rivals advanced AI systems, an article that examines how DeepSeek stacks up against the giants of open AI.

References

1. Hugging Face. (2025). The Open LLM Ecosystem Report Q2.
https://huggingface.co/

2. Alibaba DAMO Academy. (2025). Qwen3 Technical Report.
https://modelscope.cn/

3. Stack Overflow Developer Survey. (2025). How Developers Use AI Tools.
https://stackoverflow.blog/