Powered by Google
Gemini Embedding 2
- Text Embeddings
- Semantic Search
- Fast Inference
Gemini Embedding 2 is Google's natively multimodal embedding model that maps text, images, video, audio, and documents into a single semantic vector space. It is notable for unifying many media types in one model to power cross-modal search and retrieval.
About the model
What is Gemini Embedding 2?
Gemini Embedding 2 is a proprietary multimodal embedding model from Google that produces numerical vector representations for text, images, audio, video, and documents in a unified space. Its main use cases include powering retrieval-augmented generation, semantic search, recommendation, and classification across mixed media, and enabling cross-modal applications like using a text query to retrieve relevant images or video clips. It is part of Google’s Gemini Embedding family and succeeds earlier text-focused Gemini embedding models.
Model capabilities
5 Core Capabilities
-
Text Embedding
Generates dense vector representations for text inputs, enabling semantic similarity, retrieval, clustering, and downstream ML tasks.
-
Multilingual Embeddings
Produces embeddings for multiple languages, allowing cross-lingual semantic search and comparison in a shared vector space.
-
Long-Text Support
Embeds relatively long text segments, enabling document-level search, recommendation, and topic grouping across larger content pieces.
-
Semantic Retrieval
Supports building semantic search systems where queries and documents are embedded and matched via vector similarity instead of keywords.
-
Recommendation Support
Provides text embeddings suitable for powering recommendation, personalization, and content discovery based on semantic similarity.
Use cases
6 Most Valuable Use Cases
- Semantic Text Search
- Document Clustering
- Recommendation Systems
- Code Snippet Retrieval
- Question Answer Retrieval
- Multilingual Text Matching
Transparent pricing
Cost Comparison
LLM API offers competitive embedding pricing, often undercutting major clouds by up to 35–50% per 1M tokens.
| Provider | Region | Latency | Throughput | Uptime | Input ($/1M) | Output ($/1M) | Context |
|---|---|---|---|---|---|---|---|
| LLM API BEST | Global | $0.09 | 8K tokens | ||||
| Google (Gemini API / Vertex AI) | Global | 99.9% | $0.20 | 8K tokens | |||
| OpenAI | Global | 99.9% | $0.13 | 8K tokens | |||
| Azure OpenAI | Global | 99.9% | $0.13 | 8K tokens | |||
| Amazon Bedrock (Titan Embeddings) | US & Selected Regions | 99.9% | $0.02 |
Performance benchmarks
Technical Specifications
| Metric | Gemini Embedding 2 | OpenAI text-embedding-3-large | Cohere Embed v3 English | AWS Titan Text Embeddings V2 |
|---|---|---|---|---|
| Embedding Dimensions | 3072 | 3072 | 1024 | 1024 |
| Max Input Tokens | 8,192 | — | — | 8,000 |
| Price per 1M Tokens (Input) | $0.02 | $0.13 | $0.10 | $0.12 |
| Price per 1M Tokens (Output) | — | $0.13 | — | — |
| Modalities Supported | Text, Image | Text | Text | Text |
| Throughput | — | — | — | — |
| Avg Latency | — | — | — | — |
| Service Uptime (SLA) | — | — | — | — |
30-day usage via LLM API
- 3.8B
- Text chunks embedded (30 days)
- 520M
- API requests (30 days)
- 45K
- Active developer accounts (30 days)
- 99.97%
- Avg API uptime (30 days)
Architecture & Integration
Why Build on LLM.API?
One unified API. Every major model. Built-in reliability, cost control, and observability.
-
Unified AI Routing
Automatically route each request to the optimal model across providers based on latency, capability, and policies—no client changes, just better defaults.
One endpoint, every model -
Cost-Aware Controls
Define per-project or per-tenant budgets, choose cost ceilings, and let LLM.API pick the cheapest model that still meets your quality and latency targets.
Lower spend, same output -
Resilient Fallback Logic
Eliminate single-vendor outages with built-in failover across providers, automatic retries, and policy-based degradation that keeps your product responsive.
Never ship 500s again -
End-to-End Observability
Get unified logs, traces, and metrics for every provider—latency, errors, token usage, and prompts—all correlated to requests and tenants in one place.
See every token flow -
Task-Level Orchestration
Describe tasks, constraints, and tools once; LLM.API handles model selection, tool calling, and execution flow so you focus on product logic, not glue code.
Ship workflows, not wiring -
High-Throughput Batch APIs
Process millions of inferences efficiently with bulk submission, concurrency control, and automatic chunking tuned for each provider’s limits and quotas.
Scale from 10 to millions
Decision guide
When to Use — When NOT to Use
Use it if...
- You need general-purpose text embeddings for semantic search, clustering, or retrieval applications.
- You need multilingual embeddings that handle many languages consistently within a single vector space.
- Your use case involves building recommendation systems based on textual similarity or user profiles.
- You need embeddings optimized for low latency and reasonable cost on Google Cloud.
- Your use case involves hybrid search, combining Gemini Embedding 2 with keyword or metadata filters.
- You need embeddings well-integrated with other Google Vertex AI or Gemini-based workflows.
- Your use case involves encoding short queries and longer documents into the same embedding space.
Avoid if...
- You need to embed images, audio, or video rather than purely textual content.
- Your workload requires full generative capabilities like conversation, code synthesis, or content creation.
- You need ultra-long-context document understanding beyond the maximum token limits of embeddings.
- Your workload requires highly domain-specific vectors trained on proprietary data you fully control.
- You need on-premise deployment without relying on Google-managed cloud infrastructure or APIs.
- Your workload requires strict vendor neutrality, avoiding lock-in to any specific cloud provider.
- You need binary or sparse representations instead of dense floating-point embeddings for storage efficiency.
FAQ
Frequently Asked Questions
-
What is Gemini Embedding 2?
Gemini Embedding 2 is Google’s latest text and code embedding model designed to generate dense vector representations for search, retrieval, and semantic similarity.
-
What input modalities does Gemini Embedding 2 support?
Gemini Embedding 2 supports text and code inputs only; it does not embed images, audio, or other modalities.
-
How do I access Gemini Embedding 2 through LLM.API?
You call the unified LLM.API embeddings endpoint with the provider set to Google and model set to Gemini Embedding 2.
-
What is the context window of Gemini Embedding 2?
Gemini Embedding 2 supports input sequences up to 8,192 tokens, after which inputs must be truncated or chunked.
-
How fast is Gemini Embedding 2 for generating embeddings via LLM.API?
Embedding requests typically return in tens of milliseconds to low hundreds of milliseconds per batch, depending on batch size and network latency.
-
How is pricing for Gemini Embedding 2 handled on LLM.API?
LLM.API charges per 1,000 input tokens for Gemini Embedding 2, with the exact rate shown in your LLM.API pricing and usage dashboard.
-
How does Gemini Embedding 2 compare to other embedding models on LLM.API?
Gemini Embedding 2 offers strong multilingual and code understanding, often outperforming many older open-source embedding models in retrieval and semantic similarity benchmarks.
-
What are the main limitations of Gemini Embedding 2?
Gemini Embedding 2 cannot generate text, has a fixed maximum context length, and may encode provider-specific biases present in its training data.
-
Can I use Gemini Embedding 2 for multilingual applications?
Yes, Gemini Embedding 2 supports many languages and produces a shared embedding space suitable for cross-lingual retrieval and semantic search.
-
Does Gemini Embedding 2 support batching through LLM.API?
Yes, you can send an array of input texts in a single embeddings request to Gemini Embedding 2 to reduce per-item latency and cost.
