Simon Thelin

Multimodal ML Platform  ·  Data Engineer  ·  AI Creator

Building the data infrastructure that trains the next generation of multimodal AI — at Synthesia and beyond. Over a decade in data and ML space.

Data Architecture End To End Lakehouse Architecture DataOps RAG & MCP
0
Industry experience
Infra
costs cut
Successfully reducing infrastructure cost across Data & AI projects
0
Query latency reduction
Global
R&D
Serving researchers across countries · Synthesia
Simon Thelin
About

End-to-end ML Platform
& Data Engineer.
AI creator.
DataOps fundamentalist.

10+ years building scalable, governed data systems across AWS and GCP — Lakehouse architecture, DataOps pipelines, and production-grade ML infrastructure. Currently Tech Lead at Synthesia, serving R&D researchers across multiple countries as part of a global AI video platform.

DataPains is where I share what I've learned — practical, no-fluff content on data engineering, AI tooling, and the platforms that actually matter in production. Conference speaker at Big Data London and DataNova 2023. Featured on the Data Team Success podcast.

Python Terraform AWS GCP dbt Airbyte Trino PySpark Kafka K8s Delta Lake Airflow ArgoCD Docker LLMs AI Video
Content

DataPains on YouTube

View all videos ↗
Blog

Latest Writing

All posts on Medium →
Experience

Career

Jul 2024 — Present
Tech Lead — DataOps within ML Platform
Synthesia · London

Owns DataOps strategy and execution enabling multimodal ML training (audio + video) across AWS and GCP. Delivered self-service ML training framework with full data lineage and provenance tracking. Architected Lakehouse platform (Delta + Trino) with federated query across distributed data domains. Migrated WEKA to S3, reducing infrastructure spend. Led human annotation data evaluation platform — A/B and MUSHRA evaluation frameworks with structured feedback loop patterns for continuous inference quality assessment. Exploring RAG/TAG (Tool Augmented Generation) for MCP-driven dataset and evaluation workflows.

AWS GCP Delta Lake Trino MLflow Airflow Kubeflow Argo Workflows ArgoCD Spark PyTorch Terraform K8s Python
May 2022 — Jul 2024
Lead Data Engineer
7bridges · London

50% reduction in compute and storage costs. 98% query latency reduction via Trino-based Lakehouse architecture on GCP. Championed end-to-end DataOps platform build across batch and streaming. Conference speaker at Big Data London and featured in Starburst's DataNova 2023 success story.

GCP Trino Delta Lake dbt Airbyte Spark Kafka Airflow K8s
Dec 2021 — May 2022
Lead Data Engineer (Consultant)
Kainos · London

Contracted to provide data platform architecture guidance and hands-on technical leadership to enterprise clients.

AWS Python Terraform
Aug 2020 — Dec 2021
Data Engineer
IMG Arena · London

Built real-time sports data pipelines processing millions of events across streaming and batch. Designed Lakehouse architecture on Apache Druid. PySpark, Kafka, and AWS-native services powering live sports analytics products.

PySpark Kafka AWS Apache Druid Airflow
Dec 2019 — Jun 2020
Data Engineer
GameAnalytics · London

Improved PySpark/Hive data pipelines into BigQuery. Built the company's data science platform. Introduced Apache Airflow from scratch and established data engineering best practices.

PySpark BigQuery GCP Airflow Python
Speaking & Press

On stage & on the record

Contact

Let's connect