Observability

GreekManage's observability story today is container logs + audit logs. There's no APM, no distributed tracing, no metrics dashboard yet. This page documents what exists and what's planned.

Current state

Concern	Tool	Where
Application logs	Container stdout (Django + Celery)	`kubectl logs <pod>`
Database logs	Postgres stdout	`kubectl logs postgres-...`
Audit log	DB table + S3 archive	`AuditLog` model + S3 daily archive
Error tracking	None	–
Metrics	None	–
Tracing	None	–
Uptime	None	–
Security scans	OWASP ZAP	E2E + nightly

Application logging

Django + Celery write to stdout in JSON format (configurable via LOGGING in settings.py). In production:

kubectl logs -n greekmanage deploy/backend --tail=200 -f
kubectl logs -n greekmanage deploy/celery --tail=200 -f

For aggregated viewing across pods, use your cluster's log aggregator (Loki, Cloudwatch Logs, GCP Logging — depends on your platform).

Log levels

Module	Level	Why
Django	`INFO`	Request lifecycle, ORM warnings
Celery	`INFO`	Task start / complete
`apps.common.middleware.AuditMiddleware`	`INFO`	Audit log writes
`apps.ai_services`	`INFO`	LLM calls (tokens, latency)
`apps.authentication.encryption`	`WARN`	Key rotation events
Third-party libs (urllib3, etc.)	`WARN`	Reduce noise

Override per environment via LOG_LEVEL env var.

Audit log

The single source of truth for "who did what when":

Every sensitive write logged automatically by AuditMiddleware
Append-only at the DB level (app-tier user has no UPDATE/DELETE on the table)
Daily archive to S3 (gzipped JSONL, partitioned by org / year / month / day)
Per-org retention (default 180 days in DB; longer in S3)

→ Audit & encryption

Health checks

Backend exposes /api/health/:

GET /api/health/
200 OK
{
  "status": "healthy",
  "db": "ok",
  "redis": "ok",
  "version": "0.57.0"
}

Returns 503 if any dependency is down. Used by:

Kubernetes readiness + liveness probes
External uptime monitors (when you wire one up)

What's missing (priority order)

1. Error tracking — Sentry (recommended)

Why first: production errors are silently lost in container logs unless you tail them. Sentry de-dupes, groups by stack, alerts on first occurrence.

Setup:

pip install sentry-sdk

# settings.py
import sentry_sdk
from sentry_sdk.integrations.django import DjangoIntegration
from sentry_sdk.integrations.celery import CeleryIntegration

if SENTRY_DSN := env("SENTRY_DSN", default=None):
    sentry_sdk.init(
        dsn=SENTRY_DSN,
        integrations=[DjangoIntegration(), CeleryIntegration()],
        traces_sample_rate=0.1,
        send_default_pii=False,  # don't ship PII
        environment=env("DJANGO_ENV", default="development"),
        release=env("RELEASE_VERSION", default="dev"),
    )

Frontend equivalent for browser errors:

import * as Sentry from "@sentry/react";

Sentry.init({
  dsn: import.meta.env.VITE_SENTRY_DSN,
  integrations: [Sentry.browserTracingIntegration()],
  tracesSampleRate: 0.1,
  environment: import.meta.env.MODE,
});

2. Metrics — Prometheus + Grafana

Why second: capacity planning, performance regressions, SLO tracking.

Setup:

pip install django-prometheus celery-prometheus

# settings.py
INSTALLED_APPS += ["django_prometheus"]
MIDDLEWARE = [
    "django_prometheus.middleware.PrometheusBeforeMiddleware",
    *MIDDLEWARE,
    "django_prometheus.middleware.PrometheusAfterMiddleware",
]

Endpoint: /metrics (restrict to cluster-internal scrape).

Useful initial metrics:

django_http_requests_total{method, view, status} — request rate, error rate
django_db_query_duration_seconds — slow queries
celery_task_runtime_seconds{task} — task latency
process_resident_memory_bytes — memory leaks

Grafana dashboards: import community dashboards for Django + Celery, customize per app.

3. Distributed tracing — OpenTelemetry

Why third: trace a single request across backend → DB → Celery → external API.

Setup:

pip install opentelemetry-distro opentelemetry-instrumentation-django \
            opentelemetry-instrumentation-celery opentelemetry-instrumentation-psycopg2
opentelemetry-bootstrap --action=install

# settings.py — auto-instrumentation via env
# OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
# OTEL_SERVICE_NAME=greekmanage-backend

Backends: Tempo, Jaeger, Honeycomb, Datadog (your choice).

4. Uptime monitoring — Better Uptime / Pingdom / Healthchecks.io

External GET against /api/health/ every 60s. Alerts on 3 consecutive failures.

Cheap, easy first win — set up before any of the above.

5. Log aggregation — Loki / Grafana Cloud Logs

Central place to search across all pods. Set up at the cluster level:

Promtail → Loki (self-hosted)
Grafana Cloud Logs (managed)
AWS CloudWatch Logs (if on AWS)

Tag logs by pod + app for filtering.

6. Frontend RUM — Web Vitals

Capture Core Web Vitals (LCP, FID, CLS) via the web-vitals package and ship to Sentry / Datadog / your analytics:

import { onCLS, onFID, onLCP, onINP, onTTFB } from "web-vitals";

const send = (metric) => {
  navigator.sendBeacon("/api/web-vitals/", JSON.stringify(metric));
};
onCLS(send); onFID(send); onLCP(send); onINP(send); onTTFB(send);

Cost considerations

Tool	Cost (approx)	Notes
Sentry	Free up to 5K events / month, then $26+/mo	Most affordable error tracking
Better Uptime	Free for 10 monitors	Simplest uptime tool
Healthchecks.io	Free for 20 checks	Cron / job monitoring
Grafana Cloud	Free tier 10K series, 50GB logs	Easiest hosted Prometheus + Loki
Datadog	$15/host/mo + extras	Most expensive but most integrated

For a typical org running GreekManage in production, Sentry + Better Uptime + self-hosted Prometheus is a reasonable starting baseline (~$30/mo).

Manual debug techniques (until tooling lands)

Tail logs across pods

kubectl logs -n greekmanage -l app=backend --tail=200 -f --max-log-requests=10

Inspect a slow query

Use EXPLAIN ANALYZE in psql:

kubectl exec -it -n greekmanage postgres-... -- psql -U greekmanage greekmanage

EXPLAIN ANALYZE
SELECT * FROM members_memberprofile mp
JOIN organizations_membership m ON mp.membership_id = m.id
WHERE m.chapter_id = 'uuid-here';

Check Celery queue depth

kubectl exec -it -n greekmanage redis-... -- redis-cli LLEN celery

A growing queue means workers are saturated → scale up celery deployment replicas.

Find a request in logs by request ID

Backend includes X-Request-Id header in responses. Grep for it across logs:

kubectl logs -n greekmanage -l app=backend | grep "abc123-..."

(Add request_id to every log line via Django middleware — see apps/common/middleware.py.)

Roadmap (rough priority)

Sentry integration — biggest immediate value
Uptime monitoring — cheap, fast
Prometheus + Grafana — for capacity + perf
Log aggregation — when scale demands it
Distributed tracing — when LLM / external API debugging gets hard
Frontend RUM — when UX perf becomes a focus

Tickets to track each are in the issue tracker (observability label).

Current state​

Application logging​

Log levels​

Audit log​

Health checks​

What's missing (priority order)​

1. Error tracking — Sentry (recommended)​

2. Metrics — Prometheus + Grafana​

3. Distributed tracing — OpenTelemetry​

4. Uptime monitoring — Better Uptime / Pingdom / Healthchecks.io​

5. Log aggregation — Loki / Grafana Cloud Logs​

6. Frontend RUM — Web Vitals​

Cost considerations​

Manual debug techniques (until tooling lands)​

Tail logs across pods​

Inspect a slow query​

Check Celery queue depth​

Find a request in logs by request ID​

Roadmap (rough priority)​

Related​

Current state

Application logging

Log levels

Audit log

Health checks

What's missing (priority order)

1. Error tracking — Sentry (recommended)

2. Metrics — Prometheus + Grafana

3. Distributed tracing — OpenTelemetry

4. Uptime monitoring — Better Uptime / Pingdom / Healthchecks.io

5. Log aggregation — Loki / Grafana Cloud Logs

6. Frontend RUM — Web Vitals

Cost considerations

Manual debug techniques (until tooling lands)

Tail logs across pods

Inspect a slow query

Check Celery queue depth

Find a request in logs by request ID

Roadmap (rough priority)

Related