MiniMax M3: the open-weight frontier AI challenger
Table of Contents
If you’ve been treating open-weight models as the budget tier you settle for, MiniMax M3 is worth a second look. Released June 1, 2026, it’s the first open-weight model to combine frontier-level coding, a 1-million-token context window, and native multimodal input in a single system — and the benchmark numbers genuinely hold up against the closed frontier. This post breaks down who MiniMax is, what M3 actually does, how it stacks up against Claude, GPT and Gemini, and where the catches are.
Who MiniMax actually is and who M3 is for
MiniMax is a Shanghai-based AI lab (minimax.io) that has quietly moved from “promising newcomer” to serious frontier contender. The lab ships open-weight M-series reasoning models — M2, M2.1, M2.5, M2.7, and now M3 — alongside consumer products like Hailuo video generation and the MiniMax Agent. M2 and M2.5 got MiniMax taken seriously. M3 is the one that puts the company in the same conversation as the closed frontier labs.
The target audience is clear: developers who want frontier-level coding and agentic capability without frontier pricing or vendor lock-in. If you’re building agent loops, wiring up coding assistants, or running long-context workloads and watching your token bill climb, M3 is aimed squarely at you.
What M3 actually delivers: benchmarks and pricing
The headline claim is concrete. On SWE-Bench Pro — a harder benchmark than the original SWE-Bench — M3 scored 59.0%, edging out GPT-5.5 at 58.6% and beating Gemini 3.1 Pro at 54.2%, per the company’s own numbers. Claude still leads here: Opus 4.7 scores 64.3% on the same benchmark, so M3 approaches the top of the closed frontier without catching it.
Beyond coding, the agentic numbers are strong: 66.0 on Terminal Bench 2.1, 83.52 on BrowseComp, 74.2 on MCP Atlas, and 70.06 on OSWorld-Verified.
The pricing is where it stops being a fair fight. At launch, MiniMax M3 listed on OpenRouter at $0.60 per million input tokens and $2.40 per million output tokens, with a temporary 50% promotional discount bringing it to roughly $0.30 input and $1.20 output per million tokens. Compare that to the closed frontier: $0.30 per million input tokens against Claude Opus 4.7’s $5 per million is a 15x difference on input cost alone.
That cost profile changes what’s economically possible. At promo pricing of roughly $0.30 per million input tokens, an agent can carry hundreds of thousands of tokens of context across thousands of steps without the bill becoming the limiting factor.
The architecture: MiniMax Sparse Attention (MSA)
The reason a 1M-context model can be priced this aggressively comes down to the attention mechanism. M3 is a sparse mixture-of-experts model built on a custom attention architecture MiniMax calls MSA, which stands for MiniMax Sparse Attention. Instead of computing full attention across every token in context, MSA selects relevant key-value blocks.
The performance gains are substantial at long context. MSA cuts per-token compute at 1M context to one-twentieth of the prior generation. Prefill is 9.7x faster. Decoding is 15.6x faster. Those speedups are measured against the M2 generation, and they’re what make carrying a million tokens of context per request practical instead of prohibitive.
This lineage matters too. The earlier M2.7 already proved the trajectory: on SWE-Pro, which covers multiple programming languages, M2.7 achieved a 56.22% accuracy rate, matching GPT-5.3-Codex. And M2.7 drew attention for a “self-evolving” workflow — by autonomously triggering log-reading, debugging, and metric analysis, M2.7 handled between 30 percent and 50 percent of its own development workflow. M3 builds on that foundation with full multimodality and computer-use capability layered in.
How M3 fits against the tools you actually use
Here’s the honest positioning, tool by tool.
vs Claude (Fable 5 / Opus 4.8) and Gemini. On raw capability, M3 competes — it beats GPT-5.5 and Gemini on SWE-Bench Pro and approaches Opus. Where it wins decisively is cost and openness: you can self-host it, and you pay roughly a tenth of Opus pricing. Claude still wins on long-horizon agentic polish and ecosystem. If you want to see what the high end looks like, my hands-on with Claude Fable 5 covers where the proprietary frontier still pulls ahead.
vs ChatGPT. M3 is less of a polished consumer assistant and more of a builder’s model. If you want a chat product with a refined UX, ChatGPT wins. If you want a cheap, capable backend for your own systems, M3 is the better fit.
vs Cursor. This isn’t even a competition — Cursor is an editor, M3 is a model. M3 is exactly the kind of cheap, capable backend you plug into Cursor, Cline, or your own agent loop. If you’re running editor-based agents, pairing M3 with the patterns in my guide on running parallel coding agents with Cursor 3 is a natural combination.
For wiring M3 into a custom loop, the fundamentals carry over from any modern model — see build your first AI agent with Python and Claude for the agent-loop scaffolding, and swap the backend.
Practical access and the self-hosting reality
You can reach M3 three ways. On OpenRouter, M3 is currently $0.30 per million input tokens and $1.20 per million output tokens during a 50%-off launch promotion (regular pricing $0.60 / $2.40). The MiniMax direct API uses tiered pricing by input length. The third path is open weights for self-hosting, though as of launch open weights have not shipped, and China’s 2017 National Intelligence Law requires MiniMax to cooperate with government.
On hardware: a 1M-context open-weight sparse-MoE model is not something you run on a single consumer GPU. Even with MSA cutting compute, you’re looking at multi-GPU server-class infrastructure to hold the weights and the KV cache for long contexts. For most teams the API is the realistic on-ramp, with self-hosting reserved for data-residency or volume reasons. If you want to understand local-model mechanics on a smaller scale first, running Llama 3 locally with Ollama is a good warm-up before you scale to something this size.
The honest verdict: where the catches are
M3 is a real challenger, not hype — but take it seriously with eyes open:
- Benchmarks need a caveat. Several results were run on MiniMax’s own infrastructure with agent scaffolding, so independent verification is still pending. Vendor-run numbers are a starting point, not a guarantee of real-world reliability.
- Compliance is a genuine question. It’s a Chinese lab subject to local law, which has direct implications for data residency and enterprise compliance. For regulated workloads, that may rule it out regardless of benchmarks.
- Claude and GPT still win in places. Long-horizon agentic polish, ecosystem maturity, and safety tooling remain advantages of the closed frontier.
The takeaway: if you’re a developer deciding whether open-weight challengers are worth your time, M3 is the clearest evidence yet that the answer is yes. MiniMax M3 surpasses GPT-5.5 and Gemini 3.1 Pro on SWE-Bench Pro at 59.0%, and it does it at roughly a tenth of the price with the option to self-host. Test it against your own workloads before betting production on it — but it has earned a spot in the evaluation.