In 2023, Amazon Web Services (AWS) introduced an integration between Amazon SageMaker Unified Studio and Amazon S3 general purpose buckets, simplifying the use of unstructured data for machine learning (ML) and data analytics. This integration enables teams to efficiently fine-tune large language models (LLMs) using unstructured datasets stored in S3. A recent demonstration highlights fine-tuning the Llama 3.2 11B Vision Instruct model for visual question answering (VQA) tasks, leveraging SageMaker Unified Studio and S3.
The approach utilizes the DocVQA dataset from Hugging Face, which includes 39,500 training samples comprising images, questions, and expected answers. The base Llama 3.2 11B Vision Instruct model achieves an Average Normalized Levenshtein Similarity (ANLS) score of 85.3% on this dataset—a metric that evaluates answer accuracy in VQA tasks. To improve performance, three fine-tuned model variants were created using subsets of the dataset (1,000, 5,000, and 10,000 images) and tracked via SageMaker’s serverless MLflow for experimentation and evaluation.
The workflow involves multiple stages orchestrated within SageMaker Unified Studio: data ingestion, preprocessing, model training, and evaluation. The process begins with configuring an IAM role to grant read access to an S3 bucket containing the raw DocVQA data. A data producer project catalogs and enriches the dataset before publishing it to SageMaker Catalog. Subsequently, a data consumer project subscribes to this dataset, preprocesses it into training subsets, and fine-tunes the LLM accordingly.
Prerequisites for implementing this solution include setting up an AWS account, creating SageMaker Unified Studio domains and projects for data producers and consumers, and ensuring access to a SageMaker-managed serverless MLflow application. Additionally, a service quota increase for p4de.24xlarge compute resources is required to support training workloads.
This architecture exemplifies how organizations can leverage unstructured data—ranging from customer support logs to financial records—for advanced ML applications. The accompanying Jupyter notebook and code are publicly available on GitHub, providing a practical guide for replicating the fine-tuning process with SageMaker and Amazon S3.
Read more: aws.amazon.com