What is LakeFS?
LakeFS is an open-source platform that adds Git-style version control to data lakes. It lets you treat large collections of files (like those stored in cloud storage) as if they were a code repository, so you can create branches, track changes, and roll back data safely.
Let's break it down
- Open-source: Free to use and you can see or modify the code.
- Platform: A set of tools you run on your own computers or cloud.
- Git-style version control: Works like the system developers use for code, letting you save snapshots, create separate “branches” for experiments, and merge changes.
- Data lake: A big storage area (often in S3, Azure Blob, etc.) where raw data files are kept.
- Treat as a repository: You can name a specific state of the data, go back to it, or compare it with other states, just like with source code.
Why does it matter?
Because data is constantly changing, mistakes or bad updates can corrupt analytics and machine-learning models. LakeFS gives teams a safety net to experiment, collaborate, and recover data without costly downtime or data loss.
Where is it used?
- A retail company creates a separate branch of its sales data to test a new pricing algorithm before applying it to the production dataset.
- A biotech research team snapshots raw genomic files, runs different preprocessing pipelines on each branch, and later merges the best results.
- An advertising platform uses LakeFS to roll back a corrupted data ingestion job, restoring the previous clean version in minutes.
- A financial services firm audits data changes by comparing branches, ensuring regulatory compliance.
Good things about it
- Provides full data versioning without moving data out of the lake.
- Enables safe “branch-and-merge” workflows for data scientists and analysts.
- Works with existing storage (S3, GCS, Azure) and query engines (Spark, Presto, Trino).
- Open-source community offers plugins and integrations.
- Reduces risk of accidental data loss and speeds up recovery.
Not-so-good things
- Adds operational complexity; you need to manage an extra service layer.
- Performance overhead can appear on very large metadata catalogs.
- Learning curve for teams unfamiliar with Git concepts applied to data.
- Some advanced features (e.g., fine-grained access control) may require the commercial edition.