hero

Canada's Talent Marketplace

Find your next role at Canada's fastest-growing tech companies
companies
Jobs

Senior Software Developer - DevOps and Site Reliability (Job Req #2025-369)

Ross Video

Ross Video

Software Engineering
Canada · Ottawa, ON, Canada · nepean, ottawa, on, canada
Posted on Oct 28, 2025

Job overview:

Why Work at Ross Video? We have a great group of people working together to create and deliver cutting edge products that look amazing and are easy to use. We go all out so that our customers can have the best possible experience and achieve quality results. With a product focus, continual learning, results driven processes, and creative thinking, we constantly strive to improve our solutions and to deliver results. If you've ever watched live television, news, sports, or entertainment, you've seen our products in use. All of the major Hollywood award shows, most professional sports teams, and many of the largest broadcasters in the world use Ross Video technology. Get behind the scenes and learn about what it takes to make live events possible. If that resonates, and you’re someone with integrity, commitment, and a strong drive to deliver great products, we’d love to hear from you.

We are transforming live media production with innovative cloud and hybrid solutions designed for flexibility, scalability, and security. We’re seeking a Senior DevOps and Site Reliability Developer to contribute to the design, scaling, and operation of our infrastructure, CI/CD pipelines, automation workflows, and cloud operations. You’ll collaborate closely with development teams to ensure reliable software delivery, support Sales and Marketing, and engage with stakeholders to drive customer success. As a senior team member, you’ll lead by example, mentor junior engineers, and help evolve the architectural direction of our DevOps and site reliability efforts.

Who you report to: Manager of Software Development - Cloud, Cloud and Enterprise Management.

What we offer

Ross offers competitive salaries and comprehensive health plans, as well as several perks to help you perform at your best. Some of these perks include flexible hours, generous paid time off, fitness/wellness allowance, employee share ownership program, development support and a ton of fun social activities and events! Best of all, you will be part of the Ross Video family, and we’ve got a pretty energizing environment here.

What the job is all about:

Infrastructure & Automation

  • Participate in design discussions and apply best practices for infrastructure design, provisioning, reliability, and security (including access control, encryption, and compliance).
  • Design, provision, and manage scalable, secure cloud infrastructure using Infrastructure as Code (e.g., Terraform).
  • Identify and implement improvements to enhance infrastructure efficiency, reliability, and security.
  • Define and maintain SLAs, SLOs, and error budgets to align reliability goals with business and operational requirements.
  • Continuously improve system performance, security, and reliability through proactive monitoring and root cause analysis.
  • Ensure accurate, up-to-date documentation of infrastructure designs, automation workflows, and operational procedures.

CI/CD & Deployment

  • Design, build, and maintain CI/CD pipelines using GitLab CI/CD or similar tools.
  • Package and deploy containerized applications using Docker and Kubernetes, managing configurations and releases with Helm.
  • Implement quality gates, automated rollback mechanisms, and environment orchestration to support safe, reliable deployments.
  • Contribute to release management practices, including branching strategies, pipeline versioning, and deployment governance.

Cloud Operations and Reliability

  • Administer and optimize cloud infrastructure for availability, scalability, cost efficiency, and compliance, while ensuring minimal disruption during patches, upgrades, and deployments.
  • Design and enforce reliability metrics (SLOs, SLIs, and error budgets) to guide service health and performance targets.
  • Develop and maintain observability frameworks using logs, metrics, and traces with tools such as Prometheus, Grafana, OpenTelemetry, Jaeger, and Loki.
  • Collaborate with security, development, and product teams to align infrastructure initiatives with business and customer needs.
  • Respond to incidents, perform root cause analysis, and apply corrective actions.
  • Implement monitoring, alerting, backup, and disaster recovery strategies.
  • Ensure regulatory and security compliance across systems and environments.

Who you are:

  • 5 years (minimum) in a DevOps, SRE, or Cloud Engineering role.
  • Hands-on experience with public cloud platforms, especially AWS (e.g., EC2, S3, IAM, VPC, EKS); familiarity with Azure or GCP is a plus.
  • Demonstrated success driving improvements in infrastructure or site reliability in collaboration with cross-functional teams.
  • Proficient in Infrastructure as Code (IaC) tools such as Terraform, CloudFormation, OpenTofu, or Crossplane.
  • Experience with configuration management tools like Ansible.
  • Strong scripting skills in Python, Bash, and PowerShell.
  • Solid understanding of containerization with Docker and orchestration with Kubernetes.
  • Experience managing Kubernetes deployments using Helm; familiarity with GitOps tools like ArgoCD or Flux is a plus.
  • Deep knowledge of CI/CD pipelines using tools like GitLab CI/CD, Jenkins, GitHub Actions.
  • Experience with version control (Git, GitLab).
  • Familiarity with Agile delivery practices and development workflows.
  • Experience implementing and maintaining monitoring and observability using Prometheus, Grafana, Datadog, or New Relic.
  • Knowledge of incident response, performance tuning, and security best practices.
  • Ability to optimize infrastructure for scalability, availability, and cost-efficiency.
  • Hands-on experience with observability stacks (e.g., Prometheus, Grafana, OpenTelemetry, Loki, Jaeger).
  • Deep understanding of alerting, SLI/SLO development, and telemetry pipelines.
  • Experience defining and tracking SLIs, SLOs, and SLAs to measure and maintain service reliability.
  • Proven ability to manage the incident lifecycle, including escalation procedures, resolution practices, and participation in blameless postmortems.
  • Familiarity with risk assessment techniques, capacity planning, and readiness evaluations for production systems.
  • Experience developing automated failure recovery mechanisms and improving mean time to recovery (MTTR) through proactive engineering practices.
  • Excellent verbal and written communication skills.
  • Experience leading technical discussions, architecture reviews, and incident postmortems.
  • Ability to distill complex technical concepts for diverse audiences.
  • Strong interpersonal skills and experience working in collaborative, cross-functional environments.

Bonus points if you have the following:

  • Experience with SQL and NoSQL databases (e.g., MySQL, PostgreSQL, DynamoDB).
  • Experience with Windows and Linux (Ubuntu) administration.
  • Knowledge of microservices architectures and API gateways.
  • Familiarity with high-availability SaaS environments and production systems.
  • Understanding of networking, cybersecurity, and governance practices.
  • Experience in media, entertainment, or broadcast industries.
  • Experience supporting sales/marketing efforts or responding to RFPs.
  • Degree or diploma in Computer Science, Engineering, or a related field.