Llama 3.x and 4 cover most general tasks. Hosted on Groq for fast inference, on Together for breadth, or self-hosted on your own GPUs via Ollama / vLLM. Open weights mean you keep the option to leave any provider.
Open-weight LLM stack
Closed-weight APIs are the right default for most teams. They're not the right default for a regulated industry, a privacy-led product, or anyone whose customers ask 'where does this data go'. Llama and Mistral are open-weight families that ship near-frontier quality with the receipts. Use Claude on a small eval set to confirm the open-weight model holds up on YOUR task.
Mistral Large for general, Codestral for code. EU-headquartered, EU data centers available. The right pick when the legal team is in the room.
Use Claude on a 30 to 50-row eval to confirm your open-weight pick holds quality on your specific task. Re-run quarterly or whenever you swap providers.
- meta-ai$10 (Groq / Together API)
- mistralFree Le Chat
- claude$5 (eval)
- meta-ai$100 (Groq scale tier)
- mistral$50 API
- claude$20 (eval Pro)
- meta-ai$1,100 (rented GPU box for Llama)
- mistral$0 (also on the same box)
- claude$80 (regular eval + drift checks)
- 1Pick the model for the use caseMeta AI
Llama 3.x for general; Llama 4 for the heavy reasoning tasks; Mistral Large for EU + nuanced French/Spanish; Codestral when the task is code generation specifically.
- 2Pick the hostMeta AI
Don't self-host until you have to. Groq is fastest for Llama; Together is broadest. Mistral La Plateforme for hosted Mistral. Self-host only when data-residency or cost-at-scale forces you.
- 3Build a 50-row eval with ClaudeClaude
Use Claude (or GPT) to draft inputs + expected outputs. Hand-edit to remove ambiguity. This eval is the only way to know your open-weight pick is truly good enough.
Prompt · Eval scaffold for an open-weight LLM swapI'm evaluating whether to use {{Llama 3.3 70B / Mistral Large / etc.}} in production for {{task description}}. Help me build the eval set. Task: """ {{task: input shape, expected output shape, definition of correct}} """ Output: 1. **Eval rows** (50) — table of {input, expected output, why this case matters}. Cover the easy cases, the long-tail edge cases, and 5 deliberately adversarial inputs. 2. **Scoring rubric** — exactly how I score actual outputs against expected. Define partial credit if useful. 3. **Pass bar** — what % score against the rubric should I require before swapping production traffic to the open-weight model? 4. **What I should re-run quarterly** — the 5 to 10 most-important rows that catch regression. Be ruthless about the adversarial cases. The whole point of evals is the model failing on things YOUR users will throw at it. - 4Smoke-test on the evalMistral
Run the eval on Claude AND your open-weight pick. If the open-weight model scores within 5% (or your domain's tolerance), ship it. If not, narrow scope or pick a different open-weight model.
- 5Production + monthly drift checkClaude
Ship to prod. Re-run the eval monthly to catch silent quality drift when the host updates the model.
Could not use closed-weight US-hosted APIs for the customer-data path. Built on Mistral La Plateforme (EU region) for the prod calls, kept Claude for the eval rig. Internal compliance team signed off in week 2 because the audit trail (model card + EU hosting) was clean.
Renting a GPU box for one app is rarely a good trade. Hosted Llama (Groq, Together) covers most cases at lower cost. Self-host only when audit / latency / cost-at-scale forces it.
Open weights only matter if you'd actually self-host or audit. If you'll never inspect them, the closed-weight APIs are usually still the right call. Be honest about which camp you're in.
Hosted open-weight models update quietly. The monthly eval is the cheapest way to catch a regression before a customer does.