


In most practical applications, data scientists often have a set of ML models that can be applied to solve a problem. Data scientists run a set of ML models and see which ones perform the best. This is called Racing ML models against each other to choose a winner.
Let us understand this with an Example of a business problem. Suppose, we want to do customer churn modeling. The model will predict which customers are likely to churn. We collect data on Customer product usage patterns, customer historical interactions with the customer services team or support team, and customer demographics among other data that might be needed. Now we have multiple potential Models that can be used to solve this business problem. Some possible candidates are:
Logistic Regression: A simple and interpretable model, which is great for binary classification tasks, in this case Churn or Retained. However, this model may struggle with complex relationships between features.Random Forest: It is an ensemble method, which we will discuss in more detail later on in the guide, that combines multiple decision trees. The model is great at handling non-linear relationships and feature interactions effectively. However, this can be computationally expensive for large datasets.
Gradient Boosting Machine (GBM): GBM is another ensemble method that iteratively trains models on the errors of previous models. A widely used model, often achieves high performance but can be prone to overfitting.
Support Vector Machine (SVM): SVM is effective for high-dimensional data and non-linear boundaries but it can be computationally expensive for large datasets.
Neural Network: Finally we can also leverage NN if we have very large amount of data set. This powerful model is used to understand complex patterns on extremely large datasets. However, This model requires careful tuning of Hyperparameters and can be computationally intensive.
Given so many choices, which one should we use for our business problem to model churn? A simple framework on how we model this is given below.
We have already discussed Steps 1 - 3 in the Module 2 Lesson 1 and Module 2 Lesson 3.
Now, let us focus on Steps 4 to 8 in the above figure to gain a deeper understanding of the process of model selection and model evaluation.
Some questions that we will answer in this article are:
Let us dive right into it.
5.4.1. How to find a set of candidate models to run (Step 4)?
Now let us discuss the crux of this lesson, how do we know which models to choose from so many possible models out there?While the actual answer depends upon your business problems, the below Framework is a good starting guide.
5.4.1.a. Understand your data - The first step is to get a deep dive into the available data. Data scientists call this step - Exploratory Data Analysis (EDA).
This is the most crucial first step in the machine learning workflow, allowing data scientists and analysts to deeply understand the characteristics, quality, and structure of their dataset.
EDA involves visualizing data distributions, identifying missing values, detecting outliers, and understanding feature relationships. This systematic investigation aids in making informed decisions about feature selection, engineering, and appropriate model choices, thus significantly influencing the model's ultimate performance and accuracy. As you do EDA, you will also learn a few things such as:
5.4.1.a. i. Is the Data Labeled or Unlabeled: Labeled data contains both input variables (features) and output variables. For example, not only do we have customer attribute data for our SaaS application but also data on which customers have left (churned). If so, then we can run supervised learning tasks on the data. Unlabeled data that consists solely of input variables without an explicit target can either be converted into Labeled Data or can only be used to run unsupervised learning.
5.4.1.a. ii. Is the Output Variable Continuous or Discrete: The output variable (also known as the dependent variable or the response variable) is the outcome that a model is trained to predict. It determines the type of machine learning approach. We can expand the tree above to include the type of Output variable.
So, what is the difference between Clustering (Unsupervised Model) and Classification (Supervised Model)? Classification models are supervised models, or in other words, data is labeled, so the names of the classes or labels are already known. In the case of an unsupervised model, on the other hand, the data is unlabeled, so we do not know the names of these classes; hence, we call them clusters. Once we have clusters, we can name those clusters, but initially, we do not know the names of these classes.
5.4.1.a. iii. Data Volume vs Model Type: The relationship between data volume and model complexity significantly affects model selection. Depending upon the size of the data, one method may be more preferred than the other as shown below:
5.4.1.a. iv. Imbalanced Datasets: Class imbalance occurs when one class significantly outnumbers another class in a dataset, resulting in biased model predictions that overly favor the majority class. Proper handling of class imbalance is essential for developing accurate predictive models that effectively detect rare but critical events, reducing false negatives.Example: Fraud detection datasets often have fewer than 1% fraudulent transactions. Suitable approaches include Random Forests with balanced weights, XGBoost with specialized loss functions, and SMOTE oversampling to improve model sensitivity towards detecting rare fraudulent activities.Some techniques used for this purpose are
5.4.1.a. v. Time-Series Data: Time-series data consists of observations collected sequentially over time, with inherent temporal dependencies. Proper analysis ensures accurate forecasting and captures seasonality, trends, and cyclic patterns, critical for informed business decision-making.Examples are financial market forecasting (stock prices, market trends), inventory and demand forecasting for retail and supply chain management, and energy consumption prediction for utility companies.
Some Models and Methods suitable for Time Series Data are:
5.4.1.a. vi. Missing Values Handling: Some Models are naturally robust to missing data, such as:
While other Models require explicit imputation:
Examples: Customer datasets with incomplete profiles (demographic data missing). Healthcare records where certain patient measurements might be intermittently missing.
5.4.1.a.vii. Outliers in Data: The availability of outliers may impact your model choice since some models are highly sensitive to outliers - Predictions can be heavily influenced by outliers, requiring thorough treatment before modeling such as Linear Regression, Logistic Regression, Support Vector Machines (SVMs): which could skew predictions, while some models are more robust to Outliers - These models partition data effectively, isolating the impact of outliers and typically requiring less preprocessing such as Decision Trees, Random Forests, Gradient Boosting Methods (e.g., XGBoost).
Examples: Financial datasets with rare extreme market events affecting forecasting models, Real estate price predictions where certain properties have unusually high or low valuations due to unique features.
5.4.1.b. Assess business objectives: Model Performance vs interpretability
When selecting a machine learning model, businesses must carefully balance model performance against interpretability. This decision significantly affects stakeholder trust, regulatory compliance, transparency in decision-making, and overall business outcomes.
Ultimately, choosing between performance and interpretability depends on your organization's strategic goals, compliance requirements, and stakeholder expectations. Business leaders must carefully assess these factors to select the most suitable model that aligns with their specific needs and business environment.
5.4.1.c. Assess Computational Resources
Understanding and evaluating available computational resources is essential when selecting machine learning models. Computational resources refer to the processing power, memory capacity, storage availability, and specialized hardware (like GPUs and TPUs) that your organization can allocate for running machine learning algorithms.
Aligning the complexity of selected models with your available computational resources ensures efficient use of budget, time, and infrastructure, optimizing overall business outcomes.
5.4.1.d. Select Appropriate Models based on prior knowledge or academic research for similar data and problem space
Leveraging existing industry insights, academic research, and prior experiences can significantly streamline and enhance the model selection process. This approach reduces experimentation costs and increases the likelihood of successful outcomes.
By incorporating prior knowledge and proven academic research into your model selection strategy, your organization can confidently select effective and reliable machine learning solutions, enhancing overall project success.
Once you have filtered models based on the previous steps, finalize the set of candidate models by:
Before we dive into the training and evaluation stages, it's important to reflect on the critical role model selection plays in the machine learning pipeline. At this stage, we haven't written a single line of code or trained any models—yet the decisions made here lay the groundwork for everything that follows. By thoroughly analyzing your data, understanding your business objectives, assessing computational constraints, and drawing from both experience and research, you’re ensuring that you're not just building any model—but the right model.
Whether you're a business leader concerned with outcomes or a technical lead focused on performance, this step helps bridge the two worlds. It transforms raw business needs into a clear, strategic plan for predictive modeling. And while model selection is not glamorous, it’s foundational. The smarter and more systematic we are here, the less friction we’ll encounter later in training, deployment, and long-term maintenance.
With the candidate models chosen and aligned with real-world constraints and expectations, we are now ready to move forward—to put our models to the test.
As a photographer, it’s important to get the visuals right while establishing your online presence. Having a unique and professional portfolio will make you stand out to potential clients. The only problem? Most website builders out there offer cookie-cutter options — making lots of portfolios look the same.
That’s where a platform like Webflow comes to play. With Webflow you can either design and build a website from the ground up (without writing code) or start with a template that you can customize every aspect of. From unique animations and interactions to web app-like features, you have the opportunity to make your photography portfolio site stand out from the rest.
So, we put together a few photography portfolio websites that you can use yourself — whether you want to keep them the way they are or completely customize them to your liking.
Here are 12 photography portfolio templates you can use with Webflow to create your own personal platform for showing off your work.
Subscribe to our newsletter to receive our latest blogs, recommended digital courses, and more to unlock growth Mindset
In most practical applications, data scientists often have a set of ML models that can be applied to solve a problem. Data scientists run a set of ML models and see which ones perform the best. This is called Racing ML models against each other to choose a winner.
Let us understand this with an Example of a business problem. Suppose, we want to do customer churn modeling. The model will predict which customers are likely to churn. We collect data on Customer product usage patterns, customer historical interactions with the customer services team or support team, and customer demographics among other data that might be needed. Now we have multiple potential Models that can be used to solve this business problem. Some possible candidates are:
Logistic Regression: A simple and interpretable model, which is great for binary classification tasks, in this case Churn or Retained. However, this model may struggle with complex relationships between features.Random Forest: It is an ensemble method, which we will discuss in more detail later on in the guide, that combines multiple decision trees. The model is great at handling non-linear relationships and feature interactions effectively. However, this can be computationally expensive for large datasets.
Gradient Boosting Machine (GBM): GBM is another ensemble method that iteratively trains models on the errors of previous models. A widely used model, often achieves high performance but can be prone to overfitting.
Support Vector Machine (SVM): SVM is effective for high-dimensional data and non-linear boundaries but it can be computationally expensive for large datasets.
Neural Network: Finally we can also leverage NN if we have very large amount of data set. This powerful model is used to understand complex patterns on extremely large datasets. However, This model requires careful tuning of Hyperparameters and can be computationally intensive.
Given so many choices, which one should we use for our business problem to model churn? A simple framework on how we model this is given below.
We have already discussed Steps 1 - 3 in the Module 2 Lesson 1 and Module 2 Lesson 3.
Now, let us focus on Steps 4 to 8 in the above figure to gain a deeper understanding of the process of model selection and model evaluation.
Some questions that we will answer in this article are:
Let us dive right into it.
5.4.1. How to find a set of candidate models to run (Step 4)?
Now let us discuss the crux of this lesson, how do we know which models to choose from so many possible models out there?While the actual answer depends upon your business problems, the below Framework is a good starting guide.
5.4.1.a. Understand your data - The first step is to get a deep dive into the available data. Data scientists call this step - Exploratory Data Analysis (EDA).
This is the most crucial first step in the machine learning workflow, allowing data scientists and analysts to deeply understand the characteristics, quality, and structure of their dataset.
EDA involves visualizing data distributions, identifying missing values, detecting outliers, and understanding feature relationships. This systematic investigation aids in making informed decisions about feature selection, engineering, and appropriate model choices, thus significantly influencing the model's ultimate performance and accuracy. As you do EDA, you will also learn a few things such as:
5.4.1.a. i. Is the Data Labeled or Unlabeled: Labeled data contains both input variables (features) and output variables. For example, not only do we have customer attribute data for our SaaS application but also data on which customers have left (churned). If so, then we can run supervised learning tasks on the data. Unlabeled data that consists solely of input variables without an explicit target can either be converted into Labeled Data or can only be used to run unsupervised learning.
5.4.1.a. ii. Is the Output Variable Continuous or Discrete: The output variable (also known as the dependent variable or the response variable) is the outcome that a model is trained to predict. It determines the type of machine learning approach. We can expand the tree above to include the type of Output variable.
So, what is the difference between Clustering (Unsupervised Model) and Classification (Supervised Model)? Classification models are supervised models, or in other words, data is labeled, so the names of the classes or labels are already known. In the case of an unsupervised model, on the other hand, the data is unlabeled, so we do not know the names of these classes; hence, we call them clusters. Once we have clusters, we can name those clusters, but initially, we do not know the names of these classes.
5.4.1.a. iii. Data Volume vs Model Type: The relationship between data volume and model complexity significantly affects model selection. Depending upon the size of the data, one method may be more preferred than the other as shown below:
5.4.1.a. iv. Imbalanced Datasets: Class imbalance occurs when one class significantly outnumbers another class in a dataset, resulting in biased model predictions that overly favor the majority class. Proper handling of class imbalance is essential for developing accurate predictive models that effectively detect rare but critical events, reducing false negatives.Example: Fraud detection datasets often have fewer than 1% fraudulent transactions. Suitable approaches include Random Forests with balanced weights, XGBoost with specialized loss functions, and SMOTE oversampling to improve model sensitivity towards detecting rare fraudulent activities.Some techniques used for this purpose are
5.4.1.a. v. Time-Series Data: Time-series data consists of observations collected sequentially over time, with inherent temporal dependencies. Proper analysis ensures accurate forecasting and captures seasonality, trends, and cyclic patterns, critical for informed business decision-making.Examples are financial market forecasting (stock prices, market trends), inventory and demand forecasting for retail and supply chain management, and energy consumption prediction for utility companies.
Some Models and Methods suitable for Time Series Data are:
5.4.1.a. vi. Missing Values Handling: Some Models are naturally robust to missing data, such as:
While other Models require explicit imputation:
Examples: Customer datasets with incomplete profiles (demographic data missing). Healthcare records where certain patient measurements might be intermittently missing.
5.4.1.a.vii. Outliers in Data: The availability of outliers may impact your model choice since some models are highly sensitive to outliers - Predictions can be heavily influenced by outliers, requiring thorough treatment before modeling such as Linear Regression, Logistic Regression, Support Vector Machines (SVMs): which could skew predictions, while some models are more robust to Outliers - These models partition data effectively, isolating the impact of outliers and typically requiring less preprocessing such as Decision Trees, Random Forests, Gradient Boosting Methods (e.g., XGBoost).
Examples: Financial datasets with rare extreme market events affecting forecasting models, Real estate price predictions where certain properties have unusually high or low valuations due to unique features.
5.4.1.b. Assess business objectives: Model Performance vs interpretability
When selecting a machine learning model, businesses must carefully balance model performance against interpretability. This decision significantly affects stakeholder trust, regulatory compliance, transparency in decision-making, and overall business outcomes.
Ultimately, choosing between performance and interpretability depends on your organization's strategic goals, compliance requirements, and stakeholder expectations. Business leaders must carefully assess these factors to select the most suitable model that aligns with their specific needs and business environment.
5.4.1.c. Assess Computational Resources
Understanding and evaluating available computational resources is essential when selecting machine learning models. Computational resources refer to the processing power, memory capacity, storage availability, and specialized hardware (like GPUs and TPUs) that your organization can allocate for running machine learning algorithms.
Aligning the complexity of selected models with your available computational resources ensures efficient use of budget, time, and infrastructure, optimizing overall business outcomes.
5.4.1.d. Select Appropriate Models based on prior knowledge or academic research for similar data and problem space
Leveraging existing industry insights, academic research, and prior experiences can significantly streamline and enhance the model selection process. This approach reduces experimentation costs and increases the likelihood of successful outcomes.
By incorporating prior knowledge and proven academic research into your model selection strategy, your organization can confidently select effective and reliable machine learning solutions, enhancing overall project success.
Once you have filtered models based on the previous steps, finalize the set of candidate models by:
Before we dive into the training and evaluation stages, it's important to reflect on the critical role model selection plays in the machine learning pipeline. At this stage, we haven't written a single line of code or trained any models—yet the decisions made here lay the groundwork for everything that follows. By thoroughly analyzing your data, understanding your business objectives, assessing computational constraints, and drawing from both experience and research, you’re ensuring that you're not just building any model—but the right model.
Whether you're a business leader concerned with outcomes or a technical lead focused on performance, this step helps bridge the two worlds. It transforms raw business needs into a clear, strategic plan for predictive modeling. And while model selection is not glamorous, it’s foundational. The smarter and more systematic we are here, the less friction we’ll encounter later in training, deployment, and long-term maintenance.
With the candidate models chosen and aligned with real-world constraints and expectations, we are now ready to move forward—to put our models to the test.