Member of Technical Staff, Data Acquisition



Toronto, ON, Canada
Posted on Thursday, July 20, 2023
Who are we?
We’re a small, diverse team working at the cutting edge of machine learning. At Cohere, our mission is to build machines that understand the world and to make them safely accessible to all. Language is at the crux of this, but it can be difficult and expensive to parse the syntax, semantics, and context that all work together to give words meaning. The Cohere platform provides access to Large Language Models through its APIs that read billions of web pages and learns to understand the meaning, sentiment, and intent of the words we use in a richness never seen before.
We've recently raised our Series C, and we are focused on bringing our technology to market. We will partner with customers so they can build natural language understanding and generation into their products with just a few lines of code.
We’re ambitious — we believe our technology will fundamentally transform how industries interact with natural language. And we have the technical chops to back it up - Cohere’s CEO, Aidan Gomez, is a co-author of the groundbreaking paper “Attention is all you need”, and was previously part of Google Brain. Our entire technical team is world-class.
We are focused on creating a diverse and inclusive work environment so that all of our team members can thrive. We welcome kind and brilliant people to our team, from wherever they come.
Why this role?
At Cohere, we strive to continually improve our large language models. Academic research and real-world experience has demonstrated that high quality, diverse datasets can contribute as much to the performance and capabilities of LLMs as the underlying model architecture and training regimen. We at Cohere believe data will play a central role in accelerating the advancement of our already world-class language models.
Data is therefore critical to our success. Our ability to acquire data that is accurate, relevant, and timely is key to our ability to improve the quality of our models. We strive to continuously improve our data acquisition processes and systems to ensure that we have the data we need to stay competitive and meet the needs of our customers. We run frequent experiments to learn more about the role of data for model quality, from data mixtures, to cleaning techniques, to quality control.
This role will be part of the Data Acquisition team, which broadly provides data for training models and is responsible for building and maintaining the infrastructure that acquires, cleans, and formats data for model training. We are looking for a technically skilled, resourceful problem-solver who is able to work in areas of ambiguity and find efficient and sometimes creative solutions. The main responsibility of this role is to improve our internal data acquisition infrastructure, which includes data crawlers, formatters, and integrations with data providers. This role would also work closely with different teams at Cohere to support their data acquisition needs, as well as engage in more experimental work to develop highly informative data signals.
Please Note: We have offices in Toronto, Palo Alto, and London but embrace being remote-first! There are no restrictions on where you can be located for this role.

As a Member of Technical Staff, Data Acquisition, you will:

  • Develop data pipelines to acquire, prepare, and integrate high-quality datasets into model training and evaluation
  • Collaborate with research and product teams to identify, prioritize, and secure new data sources
  • Enhance and develop infrastructure for data management, pipeline orchestration, and MLOps, while avoiding premature optimization
  • Run experiments using new data inputs, preprocessing techniques, and data mixtures

You may be a good fit if:

  • You have more than 2 years of experience working on a software or machine learning engineering team
  • You have proficiency in Python and have used distributed processing technologies like Spark, Dask, etc.
  • You have experience building data pipelines and ETL processes for large-scale datasets
  • You have experience working with unstructured and/or human-annotated data (e.g., collecting or assessing sample quality).
  • You have experience with ML frameworks such as Tensorflow, TF-Serving, JAX, and XLA/MLIR
  • You have strong communication and problem-solving skills, preferring the right tool for the job even if it’s outside your wheelhouse
  • You feel comfortable actively reading academic literature and researching state-of-the-art NLP best practices.
  • You have a demonstrated passion for applied NLP models and products
If some of the above doesn’t line up perfectly with your experience, we still encourage you to apply! If you consider yourself a thoughtful worker, a lifelong learner, and a kind and playful team member, Cohere is the place for you.
We value and celebrate diversity and strive to create an inclusive work environment for all. We welcome applicants of all kinds and are committed to providing an equal opportunity process. Cohere provides accessibility accommodations during the recruitment process. Should you require any accommodation, please let us know and we will work with you to meet your needs.
Our Perks:
🤝 An open and inclusive culture and work environment
🧑‍💻 Work closely with a team on the cutting edge of AI research
🍽 Free daily lunch
🦷 Full health and dental benefits, including a separate budget to take care of your mental health
🐣 100% Parental Leave top-up for 6 months for employees based in Canada, the US, and the UK
🎨 Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
🏙 Remote-flexible, offices in Toronto, Palo Alto, San-Francisco and London and co-working stipend
✈️ 6 weeks of vacation
Note: This post is co-authored by both Cohere humans and Cohere technology.