LLMOps Engineer
Thrive Career Wellness Platform
Job Summary:
Key Responsibilities:
- Lead LLM infrastructure efforts across multiple engineering teams, ensuring scalable, secure, and efficient delivery of AI-powered features.
- Design, build, and maintain production-grade systems for deploying and managing LLMs, including versioning, A/B testing, and rollback strategies.
- Collaborate with the AI team to implement prompt management systems, prompt versioning, and token optimization strategies.
- Monitor and optimize inference latency, throughput, caching strategies, and multi-provider cost management (OpenAI, Anthropic, AWS Bedrock, etc.).
- Develop observability pipelines including quality metrics, evaluation workflows, error monitoring, and user feedback loops.
- Implement and maintain Retrieval-Augmented Generation (RAG) systems, embedding pipelines, and vector database operations.
- Support fine-tuning workflows and manage model registries for both proprietary and open-source models.
- Implement AI safety guardrails, content filtering, and compliance measures to ensure responsible deployment.
- Support general DevOps initiatives ~10% of the time, including CI/CD improvements and cloud infrastructure updates.
- Maintain thorough documentation of all LLM infrastructure, processes, and best practices.
Business Problem the LLMOps Engineer Will Solve:
Ideal Candidate Demographics:
- 3+ years of experience in LLMOps, MLOps, or similar production-focused AI/ML roles.
- Strong Python programming skills and familiarity with LLM libraries and frameworks.
- Hands-on experience with LLM providers (OpenAI, Anthropic, AWS Bedrock, Azure, Vertex, Databricks).
- Experience with vector databases such as Pinecone, Weaviate, Qdrant, or Chroma.
- Knowledge of model serving tools (vLLM, TGI, Ray Serve).
- Proficiency with Docker, Kubernetes, and cloud environments (AWS preferred).
- Familiarity with prompt engineering, token optimization, chain-of-thought approaches, and evaluation metrics.
- Experience with LLM-specific tooling (LangSmith, Weights & Biases, Phoenix, MLflow).
- Ability to troubleshoot LLM issues such as latency improvements, hallucination mitigation, and context window strategies.
- Strong communication skills with both technical and non-technical stakeholders.
- Experience with open-source LLMs (Llama, Mistral, etc.).
- Knowledge of advanced RAG techniques including hybrid search and re-ranking.
- Exposure to agent frameworks and real-time LLM applications.
- Background in traditional MLOps, data engineering, or multimodal models.
- Experience with Ruby on Rails.
- Understanding of AI safety and alignment principles.
Our Hiring Process:
Life at Thrive:
- Fast-paced, high-trust environment with significant ownership.
- Opportunity to shape the foundation of Thrive’s AI infrastructure from day one.
- Strong career progression and mentorship opportunities.
Total Rewards Package:
- 3 weeks paid vacation + 1-week holiday shutdown
- Health insurance & wellness coverage
- Yearly Learning & Development Allowance
- Yearly Workspace Allowance
