How to build a data pipeline without data: Synthetic data generation and testing with Python

Speaker: Ruan Pretorius

Track: Testing

Type: Talk

Room: Main Hall

Time: Oct 05 (Thu): 11:45

Duration: 0:45

Data pipelines are essential for transforming, validating, and loading data from various sources into a target database or data warehouse. However, building and testing data pipelines can be challenging when the real data is not available, either due to privacy issues, technical limitations, or simply because the data is not yet collected. How can we ensure that our data pipelines are robust and reliable without having access to the actual data?

In this talk, we will share our experience of creating synthetic data to test data pipelines using Python. We will demonstrate how we used some statistical methods and Python packages such as Faker and SDV to generate realistic synthetic data for different use cases, such as customer profiles, transactions, and time series. We will also show how we used Flyway to load the synthetic data into a Postgres database and perform repeatable deployments. We will discuss the benefits and challenges of using synthetic data for testing data pipelines, as well as some best practices and tips for creating and using synthetic data effectively.

This talk is aimed at intermediate-level Python developers who are interested in learning more about synthetic data generation and testing techniques for data pipelines. The talk will include code examples and live demos of the tools and methods we used. By the end of this talk, you will have a better understanding of how to build a data pipeline without data using Python.

Python Software Foundation
Thinkst Canary Afrolabs