Cloud Native

Building a Scalable Data Science Environment with AWS ML Services - Part 1

Raghavan Madabusi

Feb 23, 2025 • 9 min read

As machine learning (ML) becomes a critical driver of innovation, organizations are increasingly looking for robust, scalable environments to support data scientists in their experimentation, model training, and deployment workflows. AWS offers a comprehensive suite of tools and services, led by Amazon SageMaker, to build such environments while addressing challenges like infrastructure management, cost optimization, and operational efficiency. This blog explores how to create an advanced data science environment using AWS ML services, focusing on Amazon SageMaker's capabilities.

SageMaker Overview

Amazon SageMaker offers ML functionalities that cover the entire ML lifecycle, spanning from initial experimentation to production deployment and ongoing monitoring. It caters to various roles, such as data scientists, data analysts, and MLOps engineers to support the complete data science journey for different personas.

Data Scientists can leverage Studio notebooks for model development, Data Wrangler for visual data preparation, and the Processing service for large-scale data transformation. They also have access to the Training and Tuning services for model optimization and the Hosting service for deployment, enabling them to manage tasks from data preparation to model integration testing.
Data Analysts benefit from SageMaker Canvas, a no-code model-building interface, allowing them to train models effortlessly. They can also use Studio notebooks for lightweight data analysis and processing.
MLOps Engineers focus on managing and automating ML workflows. They utilize SageMaker Pipelines, Model Registry, and endpoint monitoring for governance and workflow automation. Additionally, they configure infrastructure for processing, training, and hosting, ensuring seamless interactive and automated operations.

Data Science Environment Architecture

Data scientists rely on specialized environments to experiment with diverse datasets and algorithms. These environments should include key tools such as Jupyter Notebook for coding and execution, data processing engines for large-scale processing and feature engineering, and model training services for scalable training. Effective experimentation requires utilities for tracking and managing runs, allowing researchers to organize their work efficiently. Additionally, a code repository and a Docker container repository are essential for storing artifacts like source code and Docker images.

Onboarding SageMaker Users:

SageMaker Studio serves as the primary IDE for data scientists, offering a unified interface to access SageMaker’s capabilities. It features hosted notebooks for experimentation and integrates with backend services for data wrangling, model training, and hosting. Data scientists can interact with these functionalities directly through the interface or programmatically via the SageMaker Python SDK in notebooks or scripts. The following diagram illustrates its key components.

SageMaker Studio Domains: Segregate user environments by grouping user profiles with specific configurations, including IAM roles, tags, and Canvas access permissions.
User Profiles: Allow access to different Studio applications like JupyterLab, Code Editor, and Canvas within a domain.
Studio Spaces: Required for running JupyterLab and Code Editor, managing storage and resource needs. Each space is dedicated to a single application and includes a storage volume, application type, and base image. Spaces can be private or shared.
Onboarding Process:
- Starts with creating a SageMaker domain (if not already existing).
- Single-user scenario: Quick setup configures a domain with default settings for fast deployment.
- Multi-user scenario: Advanced setup enables enterprise-level customization, including authentication, service access, networking, and encryption settings.
- After domain setup, user profiles are created and assigned access.
- Users launch SageMaker Studio by selecting their profile within a domain to begin working.

Preparing Data with Data Wrangler:

SageMaker Data Wrangler is a fully managed service designed to assist data scientists and engineers in preparing and analyzing data for ML. Its graphical interface simplifies tasks such as data cleaning, feature engineering, selection, and visualization.

Key Features of Data Wrangler:

Data Flow Construction: Users create a pipeline that connects datasets, transformations, and analysis steps. Each data flow is tied to a dedicated EC2 instance for execution.
Data Import: Supports multiple sources, including Amazon S3, Athena, Redshift, EMR, Databricks, Snowflake, and SaaS platforms like Datadog, GitHub, and Stripe.
Data Preparation & Exploration: Provides tools for cleaning, feature engineering, and prebuilt visualization templates to analyze data and detect outliers.

Using Data Wrangler begins with importing sample data from various sources for processing and transformation. Once transformations are applied, users can:

Export a Recipe: This can be run by SageMaker Processing to apply transformations to the full dataset.
Export to a Notebook: Generates a notebook file that can initiate a SageMaker Processing job for data transformation.
Store Processed Data: The output can be saved in SageMaker Feature Store or Amazon S3 for further analysis.

When preparing training data, data scientists often need to reuse the same features across different model tasks. Additionally, using the same features for both training and inference helps minimize training-serving skew.

Amazon SageMaker Feature Store provides a centralized solution for managing and sharing ML features. It enables feature discovery and reuse by storing features along with their metadata. The service includes both an online store for low-latency, real-time inference and an offline store for training and batch inference. Its architecture resembles open-source alternatives like Feast but stands out as a fully managed solution.

The process begins with reading and processing raw data before ingestion into Feature Store. Data can be streamed into both the online and offline stores or ingested directly into the offline store via batch processing.

SageMaker Feature Store organizes data using FeatureGroups, which define the schema and metadata of stored records. A FeatureGroup functions like a table, where each column represents a feature and each row has a unique identifier.

The online store is optimized for real-time inference, enabling low-latency reads and high-throughput writes, making it ideal for scenarios requiring quick feature retrieval. In contrast, the offline store supports batch processing and model training, maintaining an append-only structure for storing historical feature data. It is particularly valuable for feature exploration and model

Training ML models:

Once training data is prepared, data scientists can use the SageMaker Training service to train models that require specialized infrastructure and large-scale distributed training. Training data is typically stored in Amazon S3, Amazon EFS, or Amazon FSx based on specific needs such as cost, latency, and throughput.

S3 offers three modes for data ingestion:
- File mode: Downloads the full dataset to the local training instance before starting.
- Pipe mode: Streams data from S3 for faster start times and reduced storage requirements.
- Fast file mode: Streams data on demand without waiting for the entire dataset to download, simplifying the process and minimizing storage needs.

For high-throughput, low-latency requirements, Amazon FSx for Lustre provides faster file retrieval and is mounted directly to the training instance. EFS offers a similar service at a lower cost, but with higher latency and lower throughput.

Once the data is stored, you can initiate the training job using the AWS Boto3 SDK or SageMaker Python SDK, providing configuration details like the Docker image URL, script location, dataset, and infrastructure setup. All training jobs are run inside containers hosted on SageMaker's training infrastructure.

SageMaker offers various managed containers for model training, including built-in algorithms for tasks like computer vision, NLP, and tabular regression. These algorithms only require the training data location to start. Additionally, SageMaker provides managed framework containers for popular libraries like scikit-learn, TensorFlow, and PyTorch, which also require a training script along with data sources and infrastructure specifications.

For custom needs, you can use your own custom container, which includes the training scripts and necessary dependencies.

SageMaker automatically tracks all training jobs, storing metadata like algorithms, input datasets, hyperparameters, and output locations. It also sends system and algorithm metrics to AWS CloudWatch for monitoring, while training logs are available in CloudWatch Logs for further analysis and reproducibility.

Tuning ML models:

To optimize model performance, the SageMaker Tuning service automates the process of hyperparameter tuning for model training. It supports four hyperparameter tuning strategies:

Grid Search: Exhaustively explores a predefined set of hyperparameter values over a specified range, evaluating each combination. It's thorough but inefficient, especially with many parameters.
Random Search: Randomly samples hyperparameter values from defined distributions. It’s more efficient than grid search but may not always find the best combination.
Bayesian Search: Uses previous training results to predict the next best set of hyperparameters, making it more efficient than random search by focusing on promising areas.
Hyperband: Combines bandit algorithms and successive halving to efficiently allocate resources, evaluating a large number of configurations and gradually focusing on the best-performing ones.

The SageMaker Tuning service automatically sends different hyperparameter values to training jobs, selecting the best ones based on model performance.

To use the SageMaker tuning service, you create a tuning job and specify key details like:

Tuning strategy
Objective metric to optimize
Hyperparameters to tune and their ranges
Maximum number of training jobs
Number of jobs to run in parallel

Once the tuning job starts, it triggers multiple training jobs, passing different hyperparameters to each one based on the selected strategy. The training metrics from these jobs are then used to determine the best hyperparameters for optimizing model performance.

Deploying ML models for testing:

Data scientists typically don't deploy models directly for client applications, but they may need to deploy trained models to an API endpoint for performance testing. This is particularly useful for large models that can't be evaluated within a notebook instance, allowing for proper testing and evaluation before full deployment.

The SageMaker Hosting service provides several options for model inference, catering to different use cases:

Real-time Inference: Ideal for low-latency, sustained predictions. Options include:
- Single-model hosting
- Multiple-model hosting in a single container or across different containers behind one endpoint
Serverless Inference: Best for intermittent predictions with idle periods. It eliminates infrastructure management, offering cost savings since you only pay when the service is in use. However, there may be cold-start delays when the model is invoked after inactivity.
Asynchronous Inference: Suitable for large payloads that take time to process. Inference requests are queued and processed asynchronously. Input data and results are stored in S3, with notifications provided via AWS SNS once processing is complete.
Batch Inference: Used for large-scale inferences where individual predictions are not needed immediately. It is cost-effective, as infrastructure is only spun up during batch job execution. SageMaker Batch Transform is ideal for this use case.

For automation and orchestration of ML workflows, SageMaker Pipelines and AWS Step Functions are available:

SageMaker Pipelines: Helps create a Directed Acyclic Graph (DAG) to automate tasks such as data processing, model training, and testing. Similar to Airflow, it enhances efficiency and reproducibility.
AWS Step Functions: Provides flexibility and scalability for building automated workflows that can support various ML tasks.

These tools help data scientists improve the efficiency and organization of their iterative ML processes.

Best practices:

Use SageMaker Training for Large-Scale Jobs: Reserve SageMaker Studio notebooks for quick experimentation with small datasets. For large model training, use SageMaker Training service to avoid the high costs of running large EC2 instances constantly.
Abstract Infrastructure Details: Simplify the experience by hiding complex infrastructure configurations (e.g., networking, IAM roles, storage options) using environment variables or custom SDK options.
Create Self-Service Provisioning: Use AWS Service Catalog to automate user onboarding, reducing bottlenecks in provisioning resources.
Leverage Studio Notebook Local Mode: For fast model testing, use local mode in SageMaker Studio, which mimics the training environment locally, speeding up experimentation without additional infrastructure overhead.
Set Up Guardrails: Implement AWS service control policies to ensure best practices, such as correct instance types and the use of encryption keys, are followed by data scientists.
Regularly Clean Up Unused Resources: Review and delete unused resources (e.g., notebooks, endpoints) to prevent unnecessary costs.
Use Spot Instances: For cost savings, consider EC2 Spot Instances for training. Use training checkpoints to resume from the last point in case instances are interrupted.
Use Built-in Algorithms & Managed Containers: Leverage SageMaker’s pre-built algorithms and managed training containers to reduce the need for custom model development and increase efficiency.
Automate ML Pipelines: Build automated pipelines for experimentation, model building, and testing to improve efficiency, consistency, and tracking of experiments.

Following these practices will help you streamline ML workflows, reduce costs, and enhance the overall efficiency of SageMaker Studio. In the next part, we will dive deep into hands-on exercise to demonstrate some of the capabilities of SageMaker.

References:

For more detailed information and advanced configurations, refer to the following resources: