Senior AI Systems Engineer — Agentic Workflow Runtime
Software Engineering, Data Science
Toronto, ON, Canada · Remote
Senior AI Systems Engineer — Agentic Workflow Runtime
Toronto, ON, Canada .
full-time . July 2, 2026
Description
Lydia AI - ORCA
The Opportunity
The Role
Key Responsibilities
- Build and scale production-grade Python backend services
- Improve ORCA’s agent runtime, gateway, connector, and persistence architecture
- Design reliable agent workflows using tool calling, function calling, structured outputs, retries, and fallbacks
- Build retrieval and source-grounding systems for enterprise source materials
- Own PostgreSQL / Supabase schema design, migrations, RLS, query performance, and workflow state persistence
- Define clean API and service boundaries across runtime, connectors, frontend surfaces, and downstream tools
- Instrument agent runs with logs, traces, evals, and regression tests
- Improve reliability, latency, cost, debuggability, and production recovery
- Harden tenant isolation, permission boundaries, auditability, and enterprise security
- Mentor engineers and raise the technical bar across the platform
What We’re Looking For
- Production backend engineering, ideally in Python, including async execution, task lifecycle, retries, timeouts, cancellation, background workers, and production debugging
- Agentic AI systems, including LLM tool use, function calling, prompt routing, multi-turn state, context handling, retries, fallbacks, and multi-step workflow execution
- Retrieval and source grounding, including RAG, embeddings, semantic search, source search, context assembly, document grounding, and retrieval quality
- PostgreSQL and data modeling, including schema design, migrations, indexing, query performance, transaction boundaries, and debugging query plans
- API and service architecture, including service contracts, versioning, idempotency, failure handling, and clean boundaries between systems
- Evaluation and observability, including traces, logs, eval sets, golden cases, regression testing, and root-cause analysis for agent failures
- Enterprise security, including tenant isolation, permission boundaries, audit logs, least-privilege access, and data leakage prevention
- Operational debugging, including the ability to reason from logs, database state, service behavior, and imperfect production environments
- LangGraph, LangChain, CrewAI, MCP, or similar orchestration frameworks
- Anthropic, OpenAI, or other tool-use / function-calling model APIs
- Multi-model routing
- Telegram Bot API or chat-based workflow interfaces
- Connector server design
- Document ingestion, structured extraction, or source-pack processing
- Supabase RLS at scale
- Multi-tenant SaaS authorization
- CI/CD, deployment, monitoring, and infrastructure hygiene
- Frontend awareness, especially around surfacing workflow state and agent review states to users
- Have scaled production systems before, not just built prototypes
- Are comfortable owning architecture and implementation
- Can debug complex failures across services, models, tools, and databases
- Think carefully about workflow state, permissions, and data boundaries
- Understand that agentic systems require engineering discipline, not just better prompts
- Can help a fast-moving product mature into a reliable enterprise platform
