In the beginning, software was open. Not because that was thought to be the correct strategic approach, but rather because software was an afterthought. Hardware was what mattered. Less than two decades later, the hardware was cheaper and consequently mattered less. In search of greater returns on capital, the focus swung back to software. To

Open and Closed: The Pursuit of Frontier Models

tecosystems

By Stephen O'Grady | @sogrady | May 15, 2026 Share via Twitter Share via Facebook Share via Linkedin Share via Reddit In the beginning, software was open. Not because that was thought to be the correct strategic approach, but rather because software was an afterthought. Hardware was what mattered. Less than two decades later, the hardware was cheaper and consequently mattered less. In search of greater returns on capital, the focus swung back to software. To maximize those returns, software was turned from open to closed. Ever since, software has been in constant tug of war between open and closed. With operating systems, virtualization software, mobile and other categories, closed software led the way and open gave chase. For big data, containers, programming languages and web servers, the roles were reversed. Open source typically led, while closed and proprietary models have had to keep up. Models, for all that they are built and utterly dependent on a foundation of open source, are very much in the former camp. What began open became closed. “Frontier” models – which is to say models that push the “frontier” – are universally proprietary or closed, OpenAI’s name notwithstanding. Ever since Chat-GPT was unleashed on the world on the 22nd of November, 2022, however, there have been – inevitably – efforts to counter the dominance of proprietary models with “open” alternatives (we’ll come back to what open means in this context). The technology industry has a long history of both dominant players, and federated resistance to those dominant players. At a recent industry event, an AI executive likened the open models chasing their closed frontier counterparts to a “pack of wolves.” Casual observers of the industry could be forgiven for not knowing open alternatives existed, because almost all of the media’s attention is consumed by coverage of Anthropic and OpenAI’s latest achievements – though arguably that is in part because of the latter’s notable tendency to strategically time its releases around Google AI announcements to minimize their impact. Whether open will compete with closed, then, is not the interesting question. It always has, it always will. The question to ask is instead: how well? Put another way, will the “pack of wolves” ever catch their prey, and if so, how quickly? Evaluation of AI models is challenging for many reasons. Anecdotal experimentation is useful – anyone who used models before and after November of last year would be struck by the difference in capability – but it obviously doesn’t scale. The only real standardized quantitative measurement available, however, is industry benchmarks. Given that benchmarks were gamed almost to the point of irrelevance decades ago during the TPC-C wars, they would not otherwise be the first choice for evaluation, but at this point they are the least worst method of measuring performance model vs model. That being said, there are many other specific concerns for benchmarks generally and those selected here. Among them:

Contamination: models can be trained on data that includes benchmark test questions, either accidentally or deliberately.
Self-Reporting: models are typically self-reported by the labs that created them.
No Standardized Approach: benchmark scores can vary widely depending on scaffold, prompt, number of attempts, etc, and benchmarks typically don’t standardize the approach
Specificity: as will be seen momentarily, benchmarks typically have a specific area of focus. None can adequately cover or represent the breadth of actual real world use cases, and notably the benchmarks here are text in, text out – not multi-modal.
Difficulty: to measure progress over time, the benchmarks selected for this analysis had to have actual history. This means that more difficult or challenging benchmarks that have emerged more recently and may be more strenuous tests of ability are not represented here because they would not reveal any real trendlines worth noting.

In addition to those caveats, it’s important to note that there are dozens of potential benchmarks – some general, some specialized – that could be used. The selection process here prioritized consistently available scores across a wide variety of models and a reasonable history to evaluate. This, in other words, is a snapshot of benchmarks and other selections might produce different results. One last necessary clarification before proceeding is the definition of “open.” This analysis includes both closed and open models. Closed is closed, but open includes two distinct subsets of models: open weight, and fully open. Fully open refers simply to models that are licensed according to a known and OSI-approved open source license: Apache, MIT, etc. “Open weight,” on the other hand, refers to the emerging industry consensus term for models that are mostly open, but include some restrictions on use that prevent them from being called open source – the most common example of which in this dataset is Llama. With that context out of the way, let’s start with a simple glossary of the benchmarks selected. Notably, the benchmarks here are arranged in an order of most to least “saturated.” Saturated refers to benchmarks that have effectively been solved by all or most models, and thus are no longer useful at measuring relative capabilities. In spite of their lack of utility today, saturated benchmarks are included in this analysis because they demonstrate the historical progress open models have made in catching up with their proprietary counterparts. We’ll begin by examining one of these fully saturated benchmarks, GSM8K. From GPT-3.5’s performance in December of 2022, within 16 months GSM8K’s grade school math problems were effectively solved. And importantly, by both open and closed models. By late 2024, fully open Deepseek effectively matched Claude Sonnet’s ~96% score. It’s also notable that the 7B Llama released in July of 2023 was basically guessing at 15%, but the 8B Granite model released in May of 2026 was at 93% – meaning that even small models performance was improving rapidly. Next, we’ll look at a slightly less saturated benchmark, HumanEval. The “Pass@1” in the above means the model only gets one shot at the question, and it’s excluding other related benchmarks like LiveCodeBench. Again, we see the same pattern playing out, with both open and closed models alike largely solving HumanEval, though the scores are slightly lower than GSM8K. Also worth noting is that the 30B Granite 4.1 matches the 405B Llama 3.1 from two years ago, proving that open but restricted license models are not outperforming purely open alternatives – regardless of size. Slightly less saturated than HumanEval is MMLU. Smaller models aren’t faring as well with the broader set of 14,000 questions: 7B Mixtral represents the peak a bit over 70% and that hasn’t been exceeded since. The larger tier, 70B+, has for its part stalled around a GPT-4o level of capability. The larger open models like Deepseek have peformed well, though nothing close to the closed Opus 4.6. It’s also worth noting while open weight models lead the way performance-wise, fully open models follow quickly after and now claim the highest scores. This is whether things begin to separate. The frontier models score around 90%, while the best open models tap out at 83%. Some of this admittedly might be an artifact of the fact that newer models are more commonly reporting against the AIME and MATH-500 benchmarks rather than classic MATH. Two other things to note: model size doesn’t seem to play a major role in performance, and the older fully open Qwen model still outperforms newer open weight Llama alternatives. Now we’ll see even more separation in GPQA Diamond. There are a number of interesting takeaways here. For one, Deepseek R1 was ahead of all models, open or closed, when it debuted. But closed models made a big jump in the form of Gemini a few months later, and it took almost a year for open models to close the gap. Earlier this year, Deepseek, GLM, Kimi and Qwen have approached the performance of Anthropic and OpenAI, but not quite matched it. Lastly, let’s look at SWE-bench Verified – 500 human-validated real GitHub issues. From May of last year through early this year, all progress came from closed models. In February, however, things began to speed up. Both open and closed – Gemini, GLM, Kimi, MiniMax, Opus, Sonnet, etc – models all landed within 73-81%. Opus 4.7, for all of its other launch issues, jumped to ~88%, while the Deepseek V4 Pro leads the open contingent at ~81%. The pattern here is clear and consistent: closed leaps forward, open is hot on its heels. And the cycle appears to be getting faster. To explore that, let’s look at the time it took open models to match the capabilities of saturated benchmarks we examined earlier. It took 18 months for Qwen to match GPT-4’s capabilties within the MMLU benchmark, and 13 months for Llama to do the same for HumanEval and MATH. The longest it took an open model to match GPT-4o’s capabilities on any of those benchmarks, however, was seven months, and Llama matched its peformance on HumanEval in two. But what about the harder, non-saturated benchmarks? It’s more of the same. Deepseek caught up to Opus 4.6’s capabilities on GPQA in three months, and MiniMax did the same on SWE-Bench in one. None have matched Opus 4.7 as yet, but it’s been less than a month.

Takeaways

There are any number of different conclusions to be drawn from this dataset – with the caveats noted above – but here are five that stand out.

Closed models are setting the pace of innovation, and constantly breaking new ground from a capabilities standpoint.
Open models are chasing them, and the cycle times seem to be getting shorter. There are no clear capability moats, and what is frontier today is table stakes tomorrow.
Closed beats open today, but there is effectively no advantage to restricted open weight vs fully open models.
Small models are extremely competitive in specialized disciplines, but lag behind on general performance.
The United States has the largest contingent of surveyed models (42), and the largest proportion of closed models (64%). China, by contrast, features 17 models, and every single one is either open weight or fully open.

Having performed this base level analysis, it will be necessary to track how these models continue to evolve and how the benchmarks evolve with them. Disclosure: Amazon (Nova), Google (Gemini, Gemma) and IBM (Granite) are all RedMonk clients. 01.AI (Yi), Alibaba (Qwen), Anthropic (Opus, Sonnet), DeepSeek, Mistral (Mistral, Mixtral), Meta (Llama), MiniMax, Moonshot (Kimi), OpenAI (GPT) and Zhipu (GLM) are not currently RedMonk customers.

Open and Closed: The Pursuit of Frontier Models