← Back to all tools

Groq

Sub-second LPU inference — Llama 3.1 8B at 840 tokens/sec for $0.05/M input

APIFree tier
Visit site

Overall score

3.1/ 5
SME Fit3/5usage-metered pricing + free tier
JTBD4/5solid named JTBD
Integration3/5API
Trust5/5mature, founded 2016
Quality1/5no public rating
Compliance2/5compliance unknown

About

Groq runs language models on its own LPU (Language Processing Unit) hardware, optimized for inference speed. Throughput is the differentiator: Llama 3.1 8B at 840 tokens/sec, GPT OSS 20B at 1,000 tokens/sec — multiples faster than GPU-based competitors. Pricing is per-token PAYG at the cheapest end of the inference market.

Best for: Real-time AI experiences where latency matters — voice agents, streaming chat, autocomplete. The free API key + cheap per-token rates make Groq the default speed-comparison benchmark before considering Together AI or OpenAI.

Pricing

TierMonthlyAnnual /moBillingNotes
Pay-as-you-goFreeFreeusageFree API key on signup;Llama 3.1 8B Instant: $0.05/M input, $0.08/M output;GPT OSS 20B: $0.075/M input, $0.30/M output;GPT OSS 120B: $0.15/M input, $0.60/M output;Llama 3.3 70B: $0.59/M input, $0.79/M output;OpenAI-compatible API · Linear pricing across the catalog. No platform fee.
EnterpriseflatEnterprise-only models (Minimax M2.5, Qwen3-VL 32B);On-premises deployments;Custom SLAs;Volume pricing · Contact sales — required for on-prem and gated models.

Key features

  • Custom LPU hardware (vs GPU competitors)
  • Llama 3.1 8B: 840 tokens/sec at $0.05/M input
  • Free API key with no specific limit on signup
  • OpenAI-compatible endpoints
  • Wide model catalog (Llama, GPT OSS, Qwen, DeepSeek)
  • Sub-second response times
  • Linear, predictable token pricing

Integrations

OpenAI-compatible APILangChainLlamaIndexHeliconeOpenRouter

Trust & compliance

Stage range
Founded
2016
Status
active
SOC 2
unknown
GDPR
unknown
Data residency
unknown
External rating
Last verified
May 2026

Reviews

Be the first to share your experience.

Related tools in agent_infra

See all ai agent infrastructure
  • Ollama3.6

    The easiest way to run open language models locally

  • Pinecone3.4

    Reference vector database for RAG and semantic search — Starter tier is free up to 2GB

  • Hugging Face3.3

    The model hub the open-source AI ecosystem runs on — free Spaces, $9 PRO, $20/user Team

  • Replicate3.2

    Run, fine-tune, and deploy AI models with one line of code

Pairs well with