DataPains on YouTube
@DataPains
ML PLATFORM · DATA ENGINEERING · AI · TOOLS THAT MATTER IN PRODUCTION
Master Apache Spark on Kubernetes and Beyond!
Unlocking the Power of ArgoCD: A Game Changer for Kubernetes Users
Multimodal ML Platform · Data Engineer · AI Creator
Building the data infrastructure that trains the next generation of multimodal AI — at Synthesia and beyond. Over a decade in data and ML space.
10+ years building scalable, governed data systems across AWS and GCP — Lakehouse architecture, DataOps pipelines, and production-grade ML infrastructure. Currently Tech Lead at Synthesia, serving R&D researchers across multiple countries as part of a global AI video platform.
DataPains is where I share what I've learned — practical, no-fluff content on data engineering, AI tooling, and the platforms that actually matter in production. Conference speaker at Big Data London and DataNova 2023. Featured on the Data Team Success podcast.
ML PLATFORM · DATA ENGINEERING · AI · TOOLS THAT MATTER IN PRODUCTION
MEDIUM
A look at whether F3's table format improvements actually address the deeper challenge of unifying vector and structured data in modern AI stacks.
Read on Medium →MEDIUM
How to think about cold, inactive data in a Lakehouse — lifecycle policies, tiering strategies, and the cost implications of keeping everything hot.
Read on Medium →MEDIUM
A mental model for structuring data transformations and semantic layers — where dbt fits, where it doesn't, and how to draw the right boundaries.
Read on Medium →Owns DataOps strategy and execution enabling multimodal ML training (audio + video) across AWS and GCP. Delivered self-service ML training framework with full data lineage and provenance tracking. Architected Lakehouse platform (Delta + Trino) with federated query across distributed data domains. Migrated WEKA to S3, reducing infrastructure spend. Led human annotation data evaluation platform — A/B and MUSHRA evaluation frameworks with structured feedback loop patterns for continuous inference quality assessment. Exploring RAG/TAG (Tool Augmented Generation) for MCP-driven dataset and evaluation workflows.
50% reduction in compute and storage costs. 98% query latency reduction via Trino-based Lakehouse architecture on GCP. Championed end-to-end DataOps platform build across batch and streaming. Conference speaker at Big Data London and featured in Starburst's DataNova 2023 success story.
Contracted to provide data platform architecture guidance and hands-on technical leadership to enterprise clients.
Built real-time sports data pipelines processing millions of events across streaming and batch. Designed Lakehouse architecture on Apache Druid. PySpark, Kafka, and AWS-native services powering live sports analytics products.
Improved PySpark/Hive data pipelines into BigQuery. Built the company's data science platform. Introduced Apache Airflow from scratch and established data engineering best practices.
Live conference talk on Lakehouse architecture and DataOps in production — one of Europe's largest data engineering events.
Watch on YouTube →Building a data analytics platform with a Lakehouse at 7bridges — featured by Starburst as a DataNova success story.
Read the story →In conversation with Ross Webb — on building high-performing data teams, ML Platform strategy, DataOps culture, and the realities of end-to-end data engineering at scale.
Listen to the episode →Published in IOP Conference Series: Materials Science and Engineering. Early application of machine learning to real-world sensor data for indirect tyre pressure monitoring.
View on Google Scholar →