LLMOps Engineer
Thrive Career Wellness Platform
LLMOps Engineer
Description
LLMOps Engineer
Location: Hybrid – Toronto, ON (317 Adelaide Street West)
About the Role
What You’ll Do
- Lead LLM infrastructure efforts across multiple engineering teams, enabling scalable and secure AI-powered features
- Design, build, and operate production-grade LLM systems, including:
- Model and prompt versioning
- A/B testing and evaluation workflows
- Rollback and deployment strategies
- Partner with the AI team to implement prompt management, prompt versioning, and token optimization strategies
- Monitor and optimize:
- Inference latency and throughput
- Caching strategies
- Multi-provider cost management (OpenAI, Anthropic, AWS Bedrock, etc.)
- Build observability pipelines, including quality metrics, error monitoring, evaluation frameworks, and user feedback loops
- Implement and maintain Retrieval-Augmented Generation (RAG) systems, embedding pipelines, and vector databases
- Support fine-tuning workflows and manage model registries for proprietary and open-source models
- Implement AI safety guardrails, content filtering, and compliance measures
- Contribute ~10% of your time to general DevOps initiatives (CI/CD, cloud infrastructure improvements)
- Maintain clear documentation of LLM infrastructure, workflows, and best practices
The Problem You’ll Solve
What We’re Looking For
- 3+ years of experience in LLMOps, MLOps, or production-focused AI/ML roles
- Strong Python programming skills and experience with LLM frameworks and tooling
- Hands-on experience with LLM providers such as OpenAI, Anthropic, AWS Bedrock, Azure, Vertex, or Databricks
- Experience working with vector databases (Pinecone, Weaviate, Qdrant, Chroma, or similar)
- Familiarity with model serving tools (vLLM, TGI, Ray Serve)
- Strong knowledge of Docker, Kubernetes, and cloud infrastructure (AWS preferred)
- Experience with prompt engineering, token optimization, and LLM evaluation metrics
- Familiarity with LLM observability and experimentation tools (LangSmith, Weights & Biases, Phoenix, MLflow)
- Ability to troubleshoot LLM-specific challenges such as latency, hallucinations, and context window constraints
- Strong communication skills and the ability to collaborate with both technical and non-technical stakeholders
Nice to Have
- Experience with open-source LLMs (Llama, Mistral, etc.)
- Knowledge of advanced RAG techniques (hybrid search, re-ranking)
- Exposure to agent frameworks and real-time LLM applications
- Background in traditional MLOps, data engineering, or multimodal models
- Experience working in Ruby on Rails environments
- Understanding of AI safety, governance, and alignment principles
Our Hiring Process
- Talent Acquisition Screen – 30 minutes
- Take-Home Technical Assignment – 3 days to complete
- Hiring Manager Interview (Ali) – 30–45 minutes
- Live PR / Pairing Session with Staff Engineer – 60 minutes
- Meet the Leadership Team
Life at Thrive
- Fast-paced, high-trust environment with meaningful ownership
- Opportunity to shape Thrive’s AI infrastructure from the ground up
- Strong mentorship and long-term career growth
Total Rewards
- 3 weeks paid vacation + 1-week holiday shutdown
- Health insurance & wellness coverage
- Annual Learning & Development allowance
- Annual workspace allowance
