Role Overview
We are looking for an AI Applied Data Scientist to contribute to the development of high-quality, diverse, and ethically sourced datasets for training and evaluating generative AI models. You will work hands-on with large language models (LLMs), diffusion frameworks, and other generative architectures to create scalable pipelines for synthetic and real data processing.
This role suits a candidate with solid applied AI experience who is comfortable taking ownership of technical components, collaborating closely with senior researchers and engineers, and contributing to innovation in multi-modal dataset creation and governance.
This position can also be offered as an internship or part-time opportunity for candidates with strong research or technical backgrounds seeking to develop applied experience in generative AI and data science.
Key Responsibilities
Model Research and Evaluation
- Research and evaluate open-source LLMs and generative models (e.g., diffusion models, audio synthesis, video generation frameworks) to identify suitable tools for multi-modal synthetic dataset creation.
- Perform benchmarking and report findings on model performance, quality, and scalability.
Data Generation and Pipeline Development
- Develop and maintain scalable data generation pipelines using GPU-accelerated environments (e.g., PyTorch, TensorFlow, CUDA) for large-scale dataset synthesis.
- Support automation, testing, and optimisation of data generation workflows.
Prompt Engineering and Dataset Diversity
- Design and refine prompts and conditioning strategies to ensure demographic, linguistic, and regional diversity in generated datasets.
- Analyse outputs to identify and reduce representational bias.
Data Management and Compliance
- Contribute to the architecture of secure and compliant data pipelines, following UK GDPR, ISO/IEC 27001, and internal governance standards.
- Implement and maintain labelling, data cleaning, and validation workflows for both synthetic and real datasets.
- Ethically source real-world data from open-license repositories and verify data provenance and licence terms.
Documentation and Collaboration
- Produce clear technical documentation describing dataset generation logic, configuration parameters, and data lineage.
- Collaborate closely with AI researchers, ML engineers, and data governance specialists to align dataset design with model training objectives.
- Contribute to internal discussions and experimentation on generative data quality and diversity.
Qualifications and Experience
Essential:
- Bachelor’s or Master’s degree in Computer Science, AI, Data Science, or a related field.
- Proven experience working with LLMs and generative AI models (e.g., Stable Diffusion, Mistral, Llama, or similar).
- Proficiency in Python and common ML frameworks such as PyTorch, TensorFlow, or JAX.
- Hands-on experience developing or maintaining GPU-accelerated pipelines for AI or data workflows.
- Understanding of data governance and privacy requirements under UK GDPR.
- Strong analytical, problem-solving, and documentation skills.
Desirable:
- Experience handling multi-modal data (text, audio, image, video).
- Familiarity with MLOps tools (Docker, Airflow, MLflow, or Kubernetes).
- Understanding of data lineage tracking, bias mitigation, or fairness evaluation.
- Awareness of ethical AI and responsible data sourcing principles.
What You’ll Gain
- Experience working with state-of-the-art LLMs and generative models in a research-driven environment.
- Opportunities to collaborate with leading AI researchers and contribute to multi-modal data innovation.
- Training and mentoring in ethical data science, data governance, and scalable pipeline engineering.
- Flexible or hybrid work options and a supportive, growth-oriented culture.
How to Apply
Please send your CV, portfolio or GitHub profile, and a short cover letter outlining your relevant experience to info@datambit.com.