It’s 3 a.m., and a critical fraud detection model is failing silently. The data science team, awakened by frantic calls, blames a subtle infrastructure change. The DevOps team points to an unstable model artifact. Caught in the middle, platform engineers scramble to diagnose a system they didn’t build. This all-too-common chaos stems from a fundamental confusion: conflating model creation (a science problem) with model operation (a systems problem).
Who’s in charge of what?. Source: image generated with gpt4o
The solution lies in a clear separation of concerns (although there are many common fields), defining two distinct loops: the ML Engineer’s ‘Science Loop’ and the MLOps Engineer’s ‘System Loop.’ Both loops rely on a stable foundation provided by Platform Engineering. To resolve the chaos, we must first define these roles and then map the critical handoff that connects them.
In this post we will focus on the responsibilities of the different roles in the loop of creating a piece of software that mainly underpins on a machine learning model (deep learning included). There roles are:
- Product Manager
- Machine learning engineer
- Machine learning operations specialist
- Data Scientist
Machine Learning Engineering
A Machine Learning Engineer (ML Engineer) is a software engineer who specializes in designing, building, and deploying machine learning models and systems. They develop algorithms, optimize data pipelines, and integrate ML solutions into production environments to enable scalable, data-driven applications. Their key focus area are:
- Productionization: Deploying models into live systems with robust APIs and infrastructure. Model Development: Training, evaluating, and improving ML models
- Productionization: Deploying models into live systems with robust APIs and infrastructure.
- Optimization: Ensuring performance, scalability, and maintainability of ML workflows.
- Collaboration: Working with data scientists, MLOps engineers, and software teams.
Machine Learning Operations
Machine Learning Operations (MLOps) is a set of practices and tools that combines machine learning, software engineering, and DevOps to streamline the development, deployment, monitoring, and management of machine learning models in production environments.. Their key focus area are:
- Automates model training, testing, and deployment pipelines.
- Enables continuous integration and delivery (CI/CD) for ML systems.
- Manages versioning of data, models, and code for reproducibility.
- Provides monitoring and governance of live ML models to detect drift and ensure compliance.
The Two Loops: A Clearer Model for the ML Lifecycle
Effective machine learning at scale requires treating model development and model operations as separate, yet connected, disciplines. One loop focuses on discovery and experimentation; the other on reliability and automation.
The ‘Science Loop’: Where ML Engineers Drive Model Efficacy
The ML Engineer’s mindset is one of discovery and accuracy. They ask: “What is the best model to solve this business problem?” Their work lives in the ‘Science Loop,’ an iterative process of experimentation and refinement. Key activities include deep data analysis, feature engineering, designing novel model architectures, and rigorous hyperparameter tuning. Their world revolves around evaluating precision, recall, and AUC curves to prove a model’s worth.
To achieve this, ML Engineers wield a specific set of tools. Jupyter Notebooks serve as their digital lab for rapid prototyping. Libraries like Scikit-learn, PyTorch, and TensorFlow provide the building blocks for model creation. Experiment tracking platforms such as MLflow Tracking or Weights & Biases are essential for logging thousands of runs to find the optimal solution. As noted by industry observers, ML Engineers focus on the “development, deployment and self-tuning of ML algorithms and models” and are deeply involved in the entire experimentation lifecycle. For instance, an ML Engineer might spend weeks testing novel neural network architectures to lift a customer churn model’s accuracy by a few crucial percentage points. The final output of this loop isn’t a production service; it’s a trained, high-performing model artifact and its associated metadata—a candidate for promotion to the next stage.
The ‘System Loop’: Where MLOps Engineers Drive Reliability at Scale
Once a model is proven effective, the question shifts from “does it work?” to “will it run reliably?” The MLOps Engineer’s mindset focuses on automation and stability. They ask: “How do we run this model in production reliably, securely, and efficiently?” Their domain is the ‘System Loop,’ which operationalizes the science. Their activities revolve around building CI/CD pipelines for models, automating testing, packaging, and canary deployments. They are responsible for implementing automated monitoring to detect data drift, performance degradation, and security vulnerabilities. Their work also involves auto-scaling infrastructure, defining incident response playbooks, engineering safe rollback strategies, and implementing cost governance.
Their toolkit reflects this production focus. CI/CD engines like GitLab/GitHub Actions or Jenkins automate the path to production. Kubernetes provides the container orchestration layer, while Prometheus and Grafana offer the observability stack. Model-serving tools like Seldon Core manage deployment patterns, and infrastructure-as-code tools like Terraform ensure the entire environment is reproducible. MLOps Engineers are tasked with “deploying, monitoring, and operational managing ML models” in an automated and repeatable way. For example, an MLOps Engineer might build a pipeline that automatically triggers a model retrain and deployment whenever data drift is detected, ensuring the service heals itself without human intervention.
Having defined these two distinct loops, we can now examine the critical handoff point where a model artifact crosses the boundary from the science domain to the systems domain.
The Critical Handoff: The Contract Between Science and Systems
The transition from a promising model to a production service cannot be an informal “throw it over the wall” exercise. In mature organizations, this handoff is a formal contract—a well-defined package that transfers ownership from the Science Loop to the System Loop. It’s the point where a promising artifact becomes a serious candidate for a production workload.
This handoff “package” contains several non-negotiable components, much like the contract for a well-defined microservice API:
-
Versioned Model Artifact: The serialized model file itself (e.g.,
model.pkl
,saved_model.pb
), locked to a specific version and stored in an artifact repository. This ensures reproducibility. -
Dependencies: A precise
requirements.txt
orconda.yaml
file. This prevents the “silent dependency bump” that can break production systems. -
Schema Definition: The expected format for input data and the guaranteed format for the model’s output, often defined as a JSON schema. This serves as the API contract for the model.
-
Performance Baseline: The model’s key metrics (e.g., “92% accuracy, 85% recall”) on a versioned, golden test dataset. This sets the bar for future performance and drift detection.
-
Resource Profile: An estimate of the expected CPU, memory, or GPU consumption under a defined load. This informs the MLOps engineer how to configure scaling and resource quotas effectively.
Just as a microservice API contract specifies endpoints, request/response formats, and latency Service Level Objectives (SLOs), this model handoff package provides the MLOps team with everything they need to build a robust, observable, and scalable production service around it. While the term “handoff package” isn’t universally standardized, the concept is a direct application of established software engineering principles. This formal contract, in turn, relies on a stable, standardized foundation, which is the responsibility of the Platform Engineering team.
The Unsung Hero: How Platform Engineering Enables Both Loops
Neither the Science Loop nor the System Loop can operate efficiently at scale without a standardized, reliable foundation. Consider a fintech company where data scientists pushed notebooks directly to production under an ‘agile research’ banner—until a silent dependency bump broke fraud scoring for two hours. Post-mortem, the platform squad baked a model-serving template into their internal developer portal. Now, scientists get a one-click deploy, operations teams receive versioned artifacts, and the pager stays quiet. This illustrates the role of Platform Engineering: building the ‘paved road’ that enables both ML and MLOps engineers to move quickly without breaking things.
Platform Engineering’s core offerings are designed to abstract away infrastructure complexity and enforce standards:
-
Compute Abstraction: They provide managed Kubernetes clusters with pre-configured GPU scheduling and node auto-scaling, freeing ML engineers from becoming cloud infrastructure experts.
-
Standardized Tooling: They offer and maintain the central CI/CD system, the corporate observability stack (Prometheus, Grafana), and the official artifact repository (Artifactory, Nexus). This prevents tool sprawl and ensures consistency across teams.
-
Golden Paths & Templates: They create pre-configured “golden” Docker images with security scanning baked in, reusable Terraform modules for standing up standard environments, and service templates for deploying new models. This accelerates development and deployment significantly.
-
Security & Governance: They manage centralized secret management with tools like HashiCorp Vault, enforce Identity and Access Management (IAM) policies, and configure network policies through service meshes like Istio. This ensures security is a default, not an afterthought.
By providing these foundational components, the platform team frees ML Engineers to focus on building better models and empowers MLOps Engineers to focus on automating the system loop, rather than both teams constantly reinventing infrastructure. Spotify’s engineering team famously built an internal platform, ML Home, as a “one-stop shop for machine learning,” underscoring how critical this centralized enablement is for scaling ML practices.
To see how these three roles collaborate, let’s look at how ownership is defined in practice using RACI charts.
Putting it to Work: RACI Charts for the Real World
To eliminate confusion and finger-pointing, ownership must be explicit. A RACI chart—defining who is Responsible, Accountable, Consulted, and Informed—is a simple but powerful tool for clarifying these boundaries. As project management experts note, RACI charts are proven to help clarify roles and support the delivery of complex projects.
Scenario 1: The 10 a.m. Model Retraining Request
-
Task: The product team wants to experiment with new hyperparameters for the recommendation engine to improve user engagement.
-
RACI Table:
-
Responsible: ML Engineer (executes the experiments and analyzes results)
-
Accountable: Data Science Lead (owns the business outcome of the model)
-
Consulted: MLOps Engineer (advises on potential cost/resource impact of training)
-
Informed: Product Manager (is kept aware of progress and results)
-
Reasoning: This is a classic Science Loop activity. The ML Engineer performs the work, but their lead is accountable for the model’s ultimate performance. MLOps is consulted to ensure the experimentation doesn’t disrupt production systems.
Scenario 2: The 2 a.m. Production Model Drift Alert
-
Task: PagerDuty fires. The P99 latency of the fraud detection model has breached its SLO, and accuracy has dropped 15% against the baseline.
-
RACI Table:
-
Responsible: MLOps Engineer / SRE (executes the immediate rollback to a known good version)
-
Accountable: Head of MLOps / SRE Lead (owns the operational health of the service)
-
Consulted: ML Engineer (is brought in for post-mortem analysis of why drift occurred)
-
Informed: VP Engineering, Business Stakeholders (are notified of the incident and its resolution)
-
Reasoning: This is a System Loop failure. The MLOps/SRE team is responsible for immediate mitigation to protect the business. The ML Engineer is consulted later to diagnose the root cause (the “why”), but they don’t answer the pager.
Scenario 3: The New GPU Infrastructure Request
-
Task: The R&D team needs a new, isolated environment with A100 GPUs to begin building a new Large Language Model.
-
RACI Table:
-
Responsible: Platform Engineer (provisions, configures, and secures the new infrastructure)
-
Accountable: Head of Platform (owns the delivery and maintenance of the platform’s offerings)
-
Consulted: ML Engineer (provides the technical requirements), FinOps (advises on the budget and cost implications)
-
Informed: Data Science Lead (is kept aware of the new capability)
-
Reasoning: This is a foundational infrastructure task owned entirely by the Platform team. They are responsible for building the paved road, consulting with their “customers” (the ML team) to ensure it meets their needs.
Conclusion
Scaling machine learning is less about finding a silver-bullet tool and more about establishing clear lines of ownership. The path to moving fast without breaking things is paved with operational clarity. This clarity is built on three pillars: separating the ‘Science Loop’ of ML Engineering from the ‘System Loop’ of MLOps, formalizing the handoff between them with a clear contract, and investing in a Platform Engineering team to provide a stable foundation for both. This structure isn’t bureaucratic overhead; it is the prerequisite for speed and reliability.
Look at your last production ML incident. Was it immediately clear who owned the resolution? If the answer is anything but a confident “yes,” it’s time to draw these lines. Start by mapping your current ML lifecycle, identify where ownership is ambiguous, and use a framework like RACI to bring order to the chaos.