Observability
GreekManage's observability story today is container logs + audit logs. There's no APM, no distributed tracing, no metrics dashboard yet. This page documents what exists and what's planned.
Current state
| Concern | Tool | Where |
|---|---|---|
| Application logs | Container stdout (Django + Celery) | kubectl logs <pod> |
| Database logs | Postgres stdout | kubectl logs postgres-... |
| Audit log | DB table + S3 archive | AuditLog model + S3 daily archive |
| Error tracking | None | – |
| Metrics | None | – |
| Tracing | None | – |
| Uptime | None | – |
| Security scans | OWASP ZAP | E2E + nightly |
Application logging
Django + Celery write to stdout in JSON format (configurable via LOGGING in settings.py). In production:
kubectl logs -n greekmanage deploy/backend --tail=200 -f
kubectl logs -n greekmanage deploy/celery --tail=200 -f
For aggregated viewing across pods, use your cluster's log aggregator (Loki, Cloudwatch Logs, GCP Logging — depends on your platform).
Log levels
| Module | Level | Why |
|---|---|---|
| Django | INFO | Request lifecycle, ORM warnings |
| Celery | INFO | Task start / complete |
apps.common.middleware.AuditMiddleware | INFO | Audit log writes |
apps.ai_services | INFO | LLM calls (tokens, latency) |
apps.authentication.encryption | WARN | Key rotation events |
| Third-party libs (urllib3, etc.) | WARN | Reduce noise |
Override per environment via LOG_LEVEL env var.
Audit log
The single source of truth for "who did what when":
- Every sensitive write logged automatically by
AuditMiddleware - Append-only at the DB level (app-tier user has no UPDATE/DELETE on the table)
- Daily archive to S3 (gzipped JSONL, partitioned by org / year / month / day)
- Per-org retention (default 180 days in DB; longer in S3)
Health checks
Backend exposes /api/health/:
GET /api/health/
200 OK
{
"status": "healthy",
"db": "ok",
"redis": "ok",
"version": "0.57.0"
}
Returns 503 if any dependency is down. Used by:
- Kubernetes readiness + liveness probes
- External uptime monitors (when you wire one up)
What's missing (priority order)
1. Error tracking — Sentry (recommended)
Why first: production errors are silently lost in container logs unless you tail them. Sentry de-dupes, groups by stack, alerts on first occurrence.
Setup:
pip install sentry-sdk
# settings.py
import sentry_sdk
from sentry_sdk.integrations.django import DjangoIntegration
from sentry_sdk.integrations.celery import CeleryIntegration
if SENTRY_DSN := env("SENTRY_DSN", default=None):
sentry_sdk.init(
dsn=SENTRY_DSN,
integrations=[DjangoIntegration(), CeleryIntegration()],
traces_sample_rate=0.1,
send_default_pii=False, # don't ship PII
environment=env("DJANGO_ENV", default="development"),
release=env("RELEASE_VERSION", default="dev"),
)
Frontend equivalent for browser errors:
import * as Sentry from "@sentry/react";
Sentry.init({
dsn: import.meta.env.VITE_SENTRY_DSN,
integrations: [Sentry.browserTracingIntegration()],
tracesSampleRate: 0.1,
environment: import.meta.env.MODE,
});
2. Metrics — Prometheus + Grafana
Why second: capacity planning, performance regressions, SLO tracking.
Setup:
pip install django-prometheus celery-prometheus
# settings.py
INSTALLED_APPS += ["django_prometheus"]
MIDDLEWARE = [
"django_prometheus.middleware.PrometheusBeforeMiddleware",
*MIDDLEWARE,
"django_prometheus.middleware.PrometheusAfterMiddleware",
]
Endpoint: /metrics (restrict to cluster-internal scrape).
Useful initial metrics:
django_http_requests_total{method, view, status}— request rate, error ratedjango_db_query_duration_seconds— slow queriescelery_task_runtime_seconds{task}— task latencyprocess_resident_memory_bytes— memory leaks
Grafana dashboards: import community dashboards for Django + Celery, customize per app.
3. Distributed tracing — OpenTelemetry
Why third: trace a single request across backend → DB → Celery → external API.
Setup:
pip install opentelemetry-distro opentelemetry-instrumentation-django \
opentelemetry-instrumentation-celery opentelemetry-instrumentation-psycopg2
opentelemetry-bootstrap --action=install
# settings.py — auto-instrumentation via env
# OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
# OTEL_SERVICE_NAME=greekmanage-backend
Backends: Tempo, Jaeger, Honeycomb, Datadog (your choice).
4. Uptime monitoring — Better Uptime / Pingdom / Healthchecks.io
External GET against /api/health/ every 60s. Alerts on 3 consecutive failures.
Cheap, easy first win — set up before any of the above.
5. Log aggregation — Loki / Grafana Cloud Logs
Central place to search across all pods. Set up at the cluster level:
- Promtail → Loki (self-hosted)
- Grafana Cloud Logs (managed)
- AWS CloudWatch Logs (if on AWS)
Tag logs by pod + app for filtering.
6. Frontend RUM — Web Vitals
Capture Core Web Vitals (LCP, FID, CLS) via the web-vitals package and ship to Sentry / Datadog / your analytics:
import { onCLS, onFID, onLCP, onINP, onTTFB } from "web-vitals";
const send = (metric) => {
navigator.sendBeacon("/api/web-vitals/", JSON.stringify(metric));
};
onCLS(send); onFID(send); onLCP(send); onINP(send); onTTFB(send);
Cost considerations
| Tool | Cost (approx) | Notes |
|---|---|---|
| Sentry | Free up to 5K events / month, then $26+/mo | Most affordable error tracking |
| Better Uptime | Free for 10 monitors | Simplest uptime tool |
| Healthchecks.io | Free for 20 checks | Cron / job monitoring |
| Grafana Cloud | Free tier 10K series, 50GB logs | Easiest hosted Prometheus + Loki |
| Datadog | $15/host/mo + extras | Most expensive but most integrated |
For a typical org running GreekManage in production, Sentry + Better Uptime + self-hosted Prometheus is a reasonable starting baseline (~$30/mo).
Manual debug techniques (until tooling lands)
Tail logs across pods
kubectl logs -n greekmanage -l app=backend --tail=200 -f --max-log-requests=10
Inspect a slow query
Use EXPLAIN ANALYZE in psql:
kubectl exec -it -n greekmanage postgres-... -- psql -U greekmanage greekmanage
EXPLAIN ANALYZE
SELECT * FROM members_memberprofile mp
JOIN organizations_membership m ON mp.membership_id = m.id
WHERE m.chapter_id = 'uuid-here';
Check Celery queue depth
kubectl exec -it -n greekmanage redis-... -- redis-cli LLEN celery
A growing queue means workers are saturated → scale up celery deployment replicas.
Find a request in logs by request ID
Backend includes X-Request-Id header in responses. Grep for it across logs:
kubectl logs -n greekmanage -l app=backend | grep "abc123-..."
(Add request_id to every log line via Django middleware — see apps/common/middleware.py.)
Roadmap (rough priority)
- Sentry integration — biggest immediate value
- Uptime monitoring — cheap, fast
- Prometheus + Grafana — for capacity + perf
- Log aggregation — when scale demands it
- Distributed tracing — when LLM / external API debugging gets hard
- Frontend RUM — when UX perf becomes a focus
Tickets to track each are in the issue tracker (observability label).