AWS has published a technical guide showing how to implement end-to-end lineage for machine learning models using DVC (Data Version Control), Amazon SageMaker AI, and Amazon SageMaker AI MLflow Apps. The approach combines dataset-level and record-level tracking to document every step from data ingestion to model deployment. Two companion notebooks are provided for immediate deployment in AWS accounts.
The first pattern focuses on dataset-level lineage, capturing metadata about datasets used in training. It logs versioning, preprocessing steps, and source files. The second pattern handles record-level lineage, tracking individual data points through transformations and model predictions. Both patterns run within SageMaker pipelines and store results in MLflow for visualization.
AWS states the system ensures reproducibility by linking each model version to its exact dataset snapshot. This addresses a common issue in ML workflows where data changes without documentation. The notebooks include sample datasets and preconfigured SageMaker environments to simplify setup.
According to the blog post, the integration reduces debugging time by providing a clear audit trail. Teams can trace model decisions back to specific data records or dataset versions. The solution targets data scientists and ML engineers working in regulated industries or research settings where traceability is critical.
The companion notebooks are available in the AWS Samples GitHub repository. AWS recommends reviewing the documentation before deployment to adjust permissions and resource settings.
Source: aws.amazon.com