Skip to main content

AI / RAG pipeline

GreekManage's AI assistant uses retrieval-augmented generation (RAG) — your question is matched against your org's data, relevant snippets are fed to a language model, and the model's answer is streamed back.

End-to-end flow

Components

1. Embeddings store (pgvector)

Model: apps/ai_services/models.pyContentEmbedding(content_type, content_id, org_id, embedding_vector, ...)

pgvector is a Postgres extension that adds a vector column type and indexed approximate-nearest-neighbor search via ivfflat or hnsw indexes.

-- Sample table structure
CREATE TABLE content_embedding (
id UUID PRIMARY KEY,
org_id UUID NOT NULL REFERENCES organization(id),
content_type VARCHAR(64) NOT NULL, -- 'forum_post', 'document', 'member', etc.
content_id UUID NOT NULL,
embedding VECTOR(1536) NOT NULL, -- 1536 for OpenAI ada-002, 768 for Google text-embedding-004
text_excerpt TEXT NOT NULL,
created_at TIMESTAMP NOT NULL,
...
);

CREATE INDEX content_embedding_hnsw_idx
ON content_embedding USING hnsw (embedding vector_cosine_ops);

Why variable-dim: per-org BYOM (bring your own model) means different orgs may use different providers, and provider models have different embedding dimensions. The vector column handles this with a per-row dimension; the index assumes the most common one.

2. Embedder

File: apps/ai_services/providers/

Multi-provider — picks the embedding model based on the org's AIConfig.embedding_provider:

  • Anthropic — Voyage embeddings (Voyage-2 for general, voyage-code for code)
  • OpenAI — text-embedding-3-small or text-embedding-3-large
  • Google — text-embedding-004 (768 dim) or gemini-embedding-001 (768 dim)

3. EmbeddingJob (Celery)

Model: apps/ai_services/models.pyEmbeddingJob

Long-lived background job that:

  1. Walks indexable content types (documents, forum posts, member profiles, compliance entries…)
  2. Batches them
  3. Calls the embedding provider
  4. Upserts ContentEmbedding rows
  5. Reports progress

Triggered by:

  • Org-level "rebuild index" button (admin)
  • Nightly Celery beat schedule (incremental — only new / changed content)
  • Module-enable event (e.g., enabling documents triggers a one-time index of existing docs)

4. Retrieval

File: apps/ai_services/retrieval.py (typical pattern)

def retrieve(question: str, org_id: UUID, user: User, k: int = 8) -> list[Snippet]:
# 1. Embed the question
q_vector = embed(question, org_id)

# 2. Determine user's access scope
access_scope = derive_access_scope(user, org_id)

# 3. Vector search, filtered by org and access
snippets = (
ContentEmbedding.objects
.filter(org_id=org_id)
.filter(access_scope_filter(access_scope))
.order_by(L2Distance("embedding", q_vector))
[:k]
)

# 4. Hydrate with full text + source URL
return [hydrate(s) for s in snippets]

The crucial line is #3 — the access filter. A chapter member's chatbot retrieval should never include another chapter's content.

5. Prompt construction

prompt = f"""
You are GreekManage's assistant for {org.name}. Answer based on the
context below. If the answer isn't in the context, say so. Always cite
sources by their [bracketed numbers].

CONTEXT:
{render_snippets_with_numbers(snippets)}

QUESTION:
{question}
"""

6. Streaming

Channels: Django Channels 4.3 with Redis as the channel layer.

The ChatConsumer opens a WebSocket, runs the retrieval, then streams the LLM response chunk-by-chunk. The frontend renders tokens as they arrive, giving the typing-out effect.

If the user disconnects mid-stream, the consumer cancels the LLM request to save tokens.

BYOM (Bring Your Own Model)

Each org configures AIConfig:

class AIConfig(models.Model):
organization = models.OneToOneField(Organization, ...)
chat_provider = models.CharField(choices=PROVIDERS) # anthropic | openai | google
chat_model = models.CharField() # e.g. "claude-sonnet-4-5"
chat_api_key = EncryptedTextField() # encrypted
embedding_provider = models.CharField(choices=PROVIDERS)
embedding_model = models.CharField()
embedding_api_key = EncryptedTextField()
monthly_query_cap = models.IntegerField(default=10000)
is_logging_enabled = models.BooleanField(default=True)

The platform falls back to PlatformAIConfig if an org doesn't configure its own. Platform-managed keys are billed via the customer's subscription tier.

Logging + observability

When is_logging_enabled=True:

  • Every chat message + response logged to apps/ai_services/models.py ChatTurn (text + tokens used)
  • Org admins can view + export chat logs for quality review
  • 👍 / 👎 feedback stored on each turn

When disabled, only token counts are stored (for billing) — no message content.

Cost control

  • Per-org monthly_query_cap — user gets a "limit reached" message when exceeded
  • LLM token caps per response (default 1024 output)
  • Aggressive context truncation — if retrieved snippets exceed N tokens, lowest-similarity snippets are dropped
  • Failed retrievals (zero relevant snippets) skip the LLM entirely and return a "no context found" message

Failure modes + mitigations

FailureMitigation
Provider API outageFailover to next configured provider (Anthropic → OpenAI → Google), or graceful "AI is unavailable" message
Embedding dimension mismatch (BYOM swap)EmbeddingJob reindexes all content with new dimensions before chat works
HallucinationSystem prompt forces citation; UI shows sources; thumbs-down feedback flagged for review
Privacy leakPer-user access filter in retrieval; tested via E2E that User A in Chapter X can't get Chapter Y data
Token cost runawayMonthly cap + per-response output cap

Privacy guarantees

  1. No cross-org data: retrieval filters by org_id
  2. No cross-chapter for non-admins: retrieval filters by user's chapter scope
  3. PII-stripped where possible: sensitive fields (encrypted columns) never enter embeddings
  4. No training on customer data: API calls are inference-only; opt-out at the provider level when supported
  5. Per-org logging toggle: orgs can disable conversation logging entirely