AI / RAG pipeline
GreekManage's AI assistant uses retrieval-augmented generation (RAG) — your question is matched against your org's data, relevant snippets are fed to a language model, and the model's answer is streamed back.
End-to-end flow
Components
1. Embeddings store (pgvector)
Model: apps/ai_services/models.py — ContentEmbedding(content_type, content_id, org_id, embedding_vector, ...)
pgvector is a Postgres extension that adds a vector column type and indexed approximate-nearest-neighbor search via ivfflat or hnsw indexes.
-- Sample table structure
CREATE TABLE content_embedding (
id UUID PRIMARY KEY,
org_id UUID NOT NULL REFERENCES organization(id),
content_type VARCHAR(64) NOT NULL, -- 'forum_post', 'document', 'member', etc.
content_id UUID NOT NULL,
embedding VECTOR(1536) NOT NULL, -- 1536 for OpenAI ada-002, 768 for Google text-embedding-004
text_excerpt TEXT NOT NULL,
created_at TIMESTAMP NOT NULL,
...
);
CREATE INDEX content_embedding_hnsw_idx
ON content_embedding USING hnsw (embedding vector_cosine_ops);
Why variable-dim: per-org BYOM (bring your own model) means different orgs may use different providers, and provider models have different embedding dimensions. The vector column handles this with a per-row dimension; the index assumes the most common one.
2. Embedder
File: apps/ai_services/providers/
Multi-provider — picks the embedding model based on the org's AIConfig.embedding_provider:
- Anthropic — Voyage embeddings (Voyage-2 for general, voyage-code for code)
- OpenAI — text-embedding-3-small or text-embedding-3-large
- Google — text-embedding-004 (768 dim) or gemini-embedding-001 (768 dim)
3. EmbeddingJob (Celery)
Model: apps/ai_services/models.py — EmbeddingJob
Long-lived background job that:
- Walks indexable content types (documents, forum posts, member profiles, compliance entries…)
- Batches them
- Calls the embedding provider
- Upserts
ContentEmbeddingrows - Reports progress
Triggered by:
- Org-level "rebuild index" button (admin)
- Nightly Celery beat schedule (incremental — only new / changed content)
- Module-enable event (e.g., enabling
documentstriggers a one-time index of existing docs)
4. Retrieval
File: apps/ai_services/retrieval.py (typical pattern)
def retrieve(question: str, org_id: UUID, user: User, k: int = 8) -> list[Snippet]:
# 1. Embed the question
q_vector = embed(question, org_id)
# 2. Determine user's access scope
access_scope = derive_access_scope(user, org_id)
# 3. Vector search, filtered by org and access
snippets = (
ContentEmbedding.objects
.filter(org_id=org_id)
.filter(access_scope_filter(access_scope))
.order_by(L2Distance("embedding", q_vector))
[:k]
)
# 4. Hydrate with full text + source URL
return [hydrate(s) for s in snippets]
The crucial line is #3 — the access filter. A chapter member's chatbot retrieval should never include another chapter's content.
5. Prompt construction
prompt = f"""
You are GreekManage's assistant for {org.name}. Answer based on the
context below. If the answer isn't in the context, say so. Always cite
sources by their [bracketed numbers].
CONTEXT:
{render_snippets_with_numbers(snippets)}
QUESTION:
{question}
"""
6. Streaming
Channels: Django Channels 4.3 with Redis as the channel layer.
The ChatConsumer opens a WebSocket, runs the retrieval, then streams the LLM response chunk-by-chunk. The frontend renders tokens as they arrive, giving the typing-out effect.
If the user disconnects mid-stream, the consumer cancels the LLM request to save tokens.
BYOM (Bring Your Own Model)
Each org configures AIConfig:
class AIConfig(models.Model):
organization = models.OneToOneField(Organization, ...)
chat_provider = models.CharField(choices=PROVIDERS) # anthropic | openai | google
chat_model = models.CharField() # e.g. "claude-sonnet-4-5"
chat_api_key = EncryptedTextField() # encrypted
embedding_provider = models.CharField(choices=PROVIDERS)
embedding_model = models.CharField()
embedding_api_key = EncryptedTextField()
monthly_query_cap = models.IntegerField(default=10000)
is_logging_enabled = models.BooleanField(default=True)
The platform falls back to PlatformAIConfig if an org doesn't configure its own. Platform-managed keys are billed via the customer's subscription tier.
Logging + observability
When is_logging_enabled=True:
- Every chat message + response logged to
apps/ai_services/models.pyChatTurn(text + tokens used) - Org admins can view + export chat logs for quality review
- 👍 / 👎 feedback stored on each turn
When disabled, only token counts are stored (for billing) — no message content.
Cost control
- Per-org monthly_query_cap — user gets a "limit reached" message when exceeded
- LLM token caps per response (default 1024 output)
- Aggressive context truncation — if retrieved snippets exceed N tokens, lowest-similarity snippets are dropped
- Failed retrievals (zero relevant snippets) skip the LLM entirely and return a "no context found" message
Failure modes + mitigations
| Failure | Mitigation |
|---|---|
| Provider API outage | Failover to next configured provider (Anthropic → OpenAI → Google), or graceful "AI is unavailable" message |
| Embedding dimension mismatch (BYOM swap) | EmbeddingJob reindexes all content with new dimensions before chat works |
| Hallucination | System prompt forces citation; UI shows sources; thumbs-down feedback flagged for review |
| Privacy leak | Per-user access filter in retrieval; tested via E2E that User A in Chapter X can't get Chapter Y data |
| Token cost runaway | Monthly cap + per-response output cap |
Privacy guarantees
- No cross-org data: retrieval filters by
org_id - No cross-chapter for non-admins: retrieval filters by user's chapter scope
- PII-stripped where possible: sensitive fields (encrypted columns) never enter embeddings
- No training on customer data: API calls are inference-only; opt-out at the provider level when supported
- Per-org logging toggle: orgs can disable conversation logging entirely