Retrieval API Integration Report
Prepared: 2026-04-24
This report reviews another_report_on_embeddings.md and turns it into a concrete integration plan for retrieval in unified-llm-client, with specific guidance for tenant isolation, shared-vs-private knowledge, safety, and scaling.
Executive Summary
- Your report is directionally correct on the most important architectural point:
- add
client.embed()to the core library - keep chunking, vector storage, retrieval policy, citations, and ingestion jobs outside the core generation path
- add
- The current project decision is narrower than the earlier cross-provider plan:
- support the Google Embedding 2 path only in v1
- defer OpenAI embeddings
- reject Anthropic embeddings
- keep retrieval modular and separate from
complete()/conversation()
What In Your Report Is Right
client.embed()should be first-class.- Retrieval should not be hidden inside the completion transport.
anthropicshould not be exposed as a first-party embeddings provider.- The widget app should own:
- source ingest
- chunking
- vector storage
- retrieval policy
- citations
- ingestion status and retries
- Retrieval queries must be scoped. Searching across tenants, bots, or embedding profiles is a hard no.
- Reindexing should happen through embedding profiles, not by mixing vectors from different models or dimensions.
What Needs Adjustment
1. The public API should reflect the v1 Google-only product decision
The product scope has now changed from “possible multi-provider embeddings” to:
- Google Embedding 2 only in v1
- file/PDF-capable embedding use cases are the driver
- OpenAI embeddings are deferred
- Anthropic embeddings remain unsupported
That means the library API should be shaped so that:
- Google is the only accepted embedding provider in the first slice
- the request surface can support the widget’s file/PDF-oriented use case
- retrieval remains modular even though provider scope is narrow
2. tenantId and botId should not mean retrieval routing inside client.embed()
It is fine for EmbeddingRequestOptions to carry tenantId and botId for:
- usage logging
- observability
- tracing
But those fields should not change embedding semantics. The embedding request should stay stateless. Actual retrieval routing and isolation should happen in the app-owned retrieval layer and database filters.
3. Retrieval should not become a hidden side effect of complete()
Do not add a design where complete() silently reaches into a vector store. That would make:
- tests less deterministic
- provider behavior less transparent
- tenancy mistakes more dangerous
Keep retrieval explicit.
Recommended Way To Add Retrieval To The Existing API
The cleanest design is a two-layer approach.
Layer 1: core library
Add only these embeddings capabilities to LLMClient:
client.embed(options)- embedding-capable model metadata in the registry
- model-kind validation
- embedding usage reporting
Do not add:
- knowledge-base tables
- chunking
- ingestion queues
- hidden retrieval during
complete()
Layer 2: optional retrieval module or app layer
Add retrieval as a separate exported surface, not as a side effect of generation.
Recommended shape:
const store = createPostgresKnowledgeStore({ pool });
const retriever = createHybridRetriever({
client,
store,
});
const results = await retriever.search({
tenantId,
botId,
embeddingProfileId,
query: userMessage,
topK: 8,
});Recommended optional exports:
createPostgresKnowledgeStore()createDenseRetriever()createHybridRetriever()mergeRetrievalCandidates()formatRetrievedContext()
This keeps LLMClient small while still giving the widget product a first-party retrieval path.
How Retrieval Should Work In Practice
Ingestion flow
- Create or resolve the active
embedding_profile. - Create a
knowledge_sourcerow inqueuedstate. - Parse the source in the app:
- URL
- FAQ
- plain text
- Chunk content.
- Call
client.embed()on each chunk or batch. - Store vectors plus metadata in
knowledge_chunks. - Mark the source
readyonly after all vectors are committed.
Query-time flow
- Resolve the authenticated tenant and target bot.
- Resolve the bot's active embedding profile.
- Embed the user query with that exact profile.
- Run dense vector search with strict filters.
- Optionally run lexical search in parallel.
- Merge and rerank candidates.
- Build the final retrieval context and citations.
- Pass that context into
complete()orconversation.send().
That means retrieval is explicit orchestration around generation, not part of the transport itself.
How Data Segregation Should Work
This is the most important part for correctness.
Separate the axes of isolation
Use four distinct scopes:
tenant_idbot_idembedding_profile_idvisibility_scope
visibility_scope should distinguish:
- shared bot knowledge
- tenant-wide knowledge
- optional user-private knowledge
If the widget only uses shared bot knowledge right now, keep visibility_scope = 'bot' and do not add user-level retrieval yet.
Recommended hierarchy
tenant
-> bot
-> knowledge_space
-> embedding_profile
-> source
-> chunkThis gives you clean control over:
- which bot sees which knowledge
- when a bot switches to a new embedding model
- how reindexing happens without corrupting live retrieval
Shared vs private data
Most chatbot traffic does not require one vector index per end user.
Recommended default:
- knowledge is shared at the
tenant + botlevel - conversation state is separate and scoped by
tenant + session - end-user count should increase chat sessions, not duplicate the bot's knowledge vectors
Only add user-private retrieval when the product truly needs it. If you do, add:
scope_type = 'bot' | 'user'scope_user_id
and require both in the retrieval filter for private searches.
Recommended Schema
Recommended tables:
knowledge_spacesembedding_profilesknowledge_sourcesknowledge_chunks
Recommended knowledge_spaces fields:
idtenant_idbot_idnamevisibility_scopecreated_at
Recommended embedding_profiles fields:
idknowledge_space_idtenant_idbot_idprovidermodeldimensionsdistance_metrictask_instructionstatuscreated_at
Recommended knowledge_sources fields:
idknowledge_space_idtenant_idbot_idsource_typeexternal_idnamechecksumstatusprogress_percenterror_messagecreated_atupdated_at
Recommended knowledge_chunks fields:
idknowledge_space_idtenant_idbot_idsource_idembedding_profile_idchunk_indexcontentcitation jsonbmetadata jsonbfts tsvectorembeddingcreated_at
The Retrieval Filter Must Be Non-Negotiable
Every retrieval query should filter by:
tenant_idbot_idknowledge_space_idembedding_profile_idsource.status = 'ready'
Optional filters:
scope_typescope_user_idlocalecontent_type
That is stricter than only tenant_id + bot_id, and it should be.
How To Keep It Safe
1. Derive scope server-side
Never trust tenantId from the browser request body for retrieval filters.
Use:
- auth token
- signed session
- API gateway context
- server-side bot ownership lookup
The browser can send botId, but the server must still verify that:
- the caller belongs to that tenant
- that bot belongs to that tenant
2. Use Row-Level Security in Postgres
Application filters are necessary, but not sufficient at scale.
Recommended:
- enable RLS on
knowledge_spaces,knowledge_sources, andknowledge_chunks - set tenant context per request using a trusted server-side mechanism
- apply policies that block reads outside the active tenant
Application code should still filter by bot_id and embedding_profile_id. RLS is the last line of defense, not the only one.
3. Treat embedding profiles as immutable
Do not update a live profile in place.
Instead:
- create a new
embedding_profile - reindex into that profile
- validate retrieval quality
- switch the bot's active profile pointer
That avoids mixing vectors with different:
- models
- dimensions
- task instructions
- normalization behavior
4. Make retrieval fail closed
If retrieval fails, the chatbot should not search broader by relaxing tenant or bot filters.
Fallback order should be:
- retry local query
- degrade to lexical-only inside the same scope
- answer without KB context
- clearly say no reliable source was found
Never broaden scope as a fallback.
How To Keep It Reliable At Scale
1. Separate online retrieval from offline indexing
Do not embed uploaded documents in the chat request path.
Use:
- ingestion workers
- job queue
- source statuses
- idempotent chunk upserts
Chat traffic should only do:
- query embedding
- retrieval
- reranking
- generation
2. Add idempotency and checksums
Every source should carry a checksum so the system can detect:
- duplicate uploads
- unchanged URLs
- unnecessary reindex requests
Every ingest job should be safe to retry without duplicating chunks.
3. Use blue/green profile rollouts
For reindexing:
- keep the active profile serving traffic
- build the new profile in parallel
- switch over only when the new profile is complete
This is the safest way to handle many bots and many tenants without downtime.
4. Start with one Postgres table, partition later
Recommended starting point:
- one
knowledge_chunkstable - vector index per active profile or dimension family
- B-tree indexes for tenant and bot scoping
- GIN for lexical search
Recommended scale trigger:
- when chunk counts and index sizes become operationally painful, partition by hashed
tenant_id - if a few tenants dominate traffic, consider isolating those tenants or using dedicated partitions
Do not over-partition on day one.
5. Keep hot-path limits explicit
Set hard limits on:
topK- maximum rerank candidate set
- maximum retrieval context tokens
- maximum ingest chunk size
- maximum concurrent embed jobs per tenant
This prevents one tenant or one bad source from overwhelming the system.
6. Observe retrieval as its own system
Track:
- embed latency
- vector search latency
- lexical search latency
- rerank latency
- retrieval hit count
- no-hit rate
- retrieval source mix
- answer-with-citation rate
- per-tenant and per-bot error rates
Do not hide retrieval metrics inside generic completion metrics.
Recommended Public Surface
Recommended core surface in unified-llm-client:
client.embed()
Recommended optional surface:
createPostgresKnowledgeStore()createDenseRetriever()createHybridRetriever()formatRetrievedContext()
What I would not add to the core surface:
client.retrieve()client.ingestKnowledge()- automatic retrieval inside
complete()
Those features are too product-specific and would make the library harder to keep correct.
Recommended Rollout
Phase 1
- add
client.embed() - support the selected Google Embedding 2 path
- add model-kind validation
Phase 2
- add app-owned Postgres
pgvectorstorage - add dense retrieval with strict filters
- add source statuses and background indexing
Phase 3
- add lexical search
- add hybrid candidate merge
- add citations
Phase 4
- add reranking
- add retrieval evaluation fixtures
- add blue/green embedding profile rollout
Phase 5
- add private user-scoped knowledge only if the product truly needs it
- add advanced partitioning or dedicated vector infrastructure only when Postgres stops being sufficient
Bottom Line
The right way to add retrieval to the existing API is not to make LLMClient own RAG end to end.
The safer design is:
client.embed()in the core client- app-owned or optional-module retrieval around it
- strict server-side scope derivation
- immutable embedding profiles
- Postgres +
pgvectorwith strong filters and RLS - shared bot-level knowledge by default, not one vector index per user
- Google-only embeddings support in v1, with other providers deferred or rejected
That keeps the current library clean, prevents tenant leakage, and gives you a path to scale without rewriting the API later.
Sources
- OpenAI embeddings API reference: https://developers.openai.com/api/reference/resources/embeddings/methods/create
- OpenAI retrieval guide: https://developers.openai.com/api/docs/guides/retrieval
- Gemini API embeddings guide: https://ai.google.dev/gemini-api/docs/embeddings
- Gemini API embeddings reference: https://ai.google.dev/api/embeddings
- Vertex AI multimodal embeddings: https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-multimodal-embeddings
- Anthropic embeddings guide: https://platform.claude.com/docs/en/build-with-claude/embeddings
- pgvector README: https://github.com/pgvector/pgvector
- PostgreSQL full-text search indexes: https://www.postgresql.org/docs/current/textsearch-indexes.html