RAND Forecasting Initiative

	Past Week	Past Month	Past Year	This Season	All Time
Forecasts	0	0	17	0	17
Comments	0	0	1	0	1
Questions Forecasted	0	0	7	0	7
Upvotes on Comments By This User	0	0	0	0	0
Definitions

Past Week

Past Month

Past Year

This Season

All Time

Forecasts

Comments

Questions Forecasted

Upvotes on Comments By This User

Definitions

The competitiveness of the mainland's LLMs cannot be estimated by watching these rankings. The reseacher is recommended to discourage the interpretation or assumption, that these leaderboards were more than entertainment products, like keeping score in an imagined Sino-American rivalry, akin to fantasy football.

Reason one is the power consumption, a million tokens by a high-quality LLM with the best answers might cost so much that it only pays to be employed where cooling as well as electricity remains very cheap, or free. The analogy is the CPU, though not a piece of software, the CPU's by certain vendors are energy-hungrier than the competition, yet the same power-inefficient CPUs are designed to achieve maximal compute power, achieving top ranks. The rankings seldom calculate the performance by the employed effort, or cost, say the power consumption.

Reason two is the upfront costs for hardware that an LLM requires. The high quality LLMs currently require large amounts of VRAM, unusually large power supply units, and unusual cabling and cooling systems. The purchase alone has become a problem, many AI hobbyists and startups are waiting for datacenters to sell off their last generation's hardware, say, H100 accelerators. An LLM that can produce reasonable results with old, but cheaper hardware, is more desirable than the latest LLM that requires the latest hardware, for the best results. The leaderboard doesn't capture this reality.

Reason three is the absence of a productive use case and application. Not all tasks require the same kind of all-purpose LLM, many task-specific LLMs won't produce good answers to many types of questions, but might be excellent in a narrowly defined application and use case. China's English-language Tongyi (Qwen2.5) is said to be the programmers' favorite, because of its reasonable or excellent results in mathematics and programming despite lower hardware requirements and flaws in other tasks. The leaderboard assumes excellence as a generalist, while the need for computer software is usually specialized.

Reason four are the mainland's LLMs that are not taking part, some by major corporations, and LLMs that are not designed for an English-first-speaking audience, only Tongyi (aka Qwen) was explicitly made for English. It's also been widely shared and widely employed, even though the leaderboard doesn't capture its popularity among developers of LLMs.

Reason five is trivial, though: The mainland read and write Chinese, this leaderboard and audience doesn't, not even German or French. What the Chinese deem as intelligent responses, or capacities, may not be universally shared with Americans, for example, wit, demeanor, use of culturally specific sayings, attempts to negotiate or presentation.

In short, this is a flawed attempt, to estimate the progress of LLMs outside Silicon Valley, and even more flawed, if it shall estimate the effectiveness of sabotage.

tsm

About:

No Scores Yet

Relative Brier Score

Questions Forecasted

0

Forecasts

0

Upvotes

Forecasting Calendar

tsm

About:

No Scores Yet

Relative Brier Score

Questions Forecasted

0

Forecasts

0

Upvotes

Forecasting Calendar

Active Forecaster