WDIS AI-ML
35 min read
WDIS AI-ML Series: Module 2 Lesson 6: Model Selection and Evaluation Metrics
Written by
Vinay Roy
Published on
09th May 2025

In most practical applications, data scientists often have a set of ML models that can be applied to solve a problem. Data scientists run a set of ML models and see which ones perform the best. This is called Racing ML models against each other to choose a winner.

Let us understand this with an Example of a business problem. Suppose, we want to do customer churn modeling. The model will predict which customers are likely to churn. We collect data on Customer product usage patterns, customer historical interactions with the customer services team or support team, and customer demographics among other data that might be needed. Now we have multiple potential Models that can be used to solve this business problem. Some possible candidates are:

Logistic Regression: A simple and interpretable model, which is great for binary classification tasks, in this case Churn or Retained.  However, this model may struggle with complex relationships between features.Random Forest: It is an ensemble method, which we will discuss in more detail later on in the guide, that combines multiple decision trees. The model is great at handling non-linear relationships and feature interactions effectively. However, this can be computationally expensive for large datasets.

Gradient Boosting Machine (GBM): GBM is another ensemble method that iteratively trains models on the errors of previous models. A widely used model, often achieves high performance but can be prone to overfitting.

Support Vector Machine (SVM): SVM is effective for high-dimensional data and non-linear boundaries but it can be computationally expensive for large datasets.

Neural Network: Finally we can also leverage NN if we have very large amount of data set. This powerful model is used to understand complex patterns on extremely large datasets. However, This model requires careful tuning of Hyperparameters and can be computationally intensive.

Given so many choices, which one should we use for our business problem to model churn? A simple framework on how we model this is given below.

Fig: Framework for racing multiple ML models to identify a winning model

We have already discussed Steps 1 - 3 in the Module 2 Lesson 1 and Module 2 Lesson 3.

Now, let us focus on Steps 4 to 8 in the above figure to gain a deeper understanding of the process of model selection and model evaluation.

Some questions that we will answer in this article are:

  1. How do we determine which models to use for the race?
  2. How do we know which model is the winner?
  3. How can we improve a model to improve its chances of winning?
  4. How do we know the winner will stay the winner?

Let us dive right into it.

5.4.1. How to find a set of candidate models to run (Step 4)?

Now let us discuss the crux of this lesson, how do we know which models to choose from so many possible models out there?While the actual answer depends upon your business problems, the below Framework is a good starting guide.

5.4.1.a. Understand your data - The first step is to get a deep dive into the available data. Data scientists call this step - Exploratory Data Analysis (EDA).

This is the most crucial first step in the machine learning workflow, allowing data scientists and analysts to deeply understand the characteristics, quality, and structure of their dataset.

EDA involves visualizing data distributions, identifying missing values, detecting outliers, and understanding feature relationships. This systematic investigation aids in making informed decisions about feature selection, engineering, and appropriate model choices, thus significantly influencing the model's ultimate performance and accuracy. As you do EDA, you will also learn a few things such as:

5.4.1.a. i. Is the Data Labeled or Unlabeled: Labeled data contains both input variables (features) and output variables. For example, not only do we have customer attribute data for our SaaS application but also data on which customers have left (churned). If so, then we can run supervised learning tasks on the data. Unlabeled data that consists solely of input variables without an explicit target can either be converted into Labeled Data or can only be used to run unsupervised learning.

  • Examples:
    • Labeled (Supervised): Customer churn prediction, fraud detection.
    • Unlabeled (Unsupervised): Customer segmentation, anomaly detection.

5.4.1.a. ii. Is the Output Variable Continuous or Discrete: The output variable  (also known as the dependent variable or the response variable) is the outcome that a model is trained to predict. It determines the type of machine learning approach. We can expand the tree above to include the type of Output variable.

  • If the Data is labeled and Output Variable is
    • Continuous (Regression Model): Applicable only when labeled data with numeric target values are available. Examples: Forecasting quarterly sales, predicting stock prices, and estimating real estate market trends.
    • Output Variable is  Discrete (Classification Model): Applicable only when labeled data with categorical target values are present. Examples: Predicting customer churn, categorizing emails as spam or legitimate, and diagnosing water pipe conditions for a water utility company.
  • If the Data is unlabeled, then there are no predefined output variables; we can only run an unsupervised model. These unsupervised models, such as clustering or dimensionality reduction, are utilized to uncover inherent patterns or groupings within the data.
    • Clustering aims to group data points into distinct clusters based on similarity or patterns within the data. It helps uncover inherent groupings without prior labels and is useful for tasks like customer segmentation, market analysis, and anomaly detection. Common clustering methods include K-means, DBSCAN, and Hierarchical clustering.
    • Dimensionality Reduction focuses on reducing the number of variables or features in the dataset while preserving its essential structure and information. It simplifies data representation, reduces noise, and improves computational efficiency, enabling clearer data visualization and better performance in subsequent modeling. Popular dimensionality reduction techniques include Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE).
    • Association Rule Learning to discover interesting relationships between variables in large datasets. Example applications: Market basket analysis (identifying products frequently bought together).
    • Generative Models to learn the underlying distribution of data to generate new, synthetic instances similar to the training data. Notable algorithms include Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs). Example applications: Image synthesis, data augmentation.

So, what is the difference between Clustering (Unsupervised Model) and Classification (Supervised Model)? Classification models are supervised models, or in other words, data is labeled, so the names of the classes or labels are already known. In the case of an unsupervised model, on the other hand, the data is unlabeled, so we do not know the names of these classes; hence, we call them clusters. Once we have clusters, we can name those clusters, but initially, we do not know the names of these classes.

5.4.1.a. iii.  Data Volume vs Model Type: The relationship between data volume and model complexity significantly affects model selection. Depending upon the size of the data, one method may be more preferred than the other as shown below:

  • Small Data (<10k samples):
    • Simpler, interpretable models, such as Logistic Regression, Decision Trees, Naive Bayes, and K-Nearest Neighbors, are typically preferred due to their stability and lower risk of overfitting.
    • Examples: Diagnosing diseases from limited patient records, credit approval based on historical loan data.
  • Medium Data (10k - 1M samples):
    • Intermediate complexity models, such as Random Forests, Support Vector Machines (SVMs), Gradient Boosting algorithms (XGBoost, LightGBM),  can be beneficial, striking a balance between interpretability and predictive power.
    • Examples: Predicting customer churn rates in telecommunications, forecasting regional sales data.
  • Large Data (>1M samples):
    • Complex models, such as  Deep Neural Networks, advanced Gradient Boosting Machines,  leveraging large-scale data, can effectively capture intricate patterns without substantial risk of overfitting.
    • Examples: Image and speech recognition, language modeling for virtual assistants, high-frequency trading algorithms in finance.

5.4.1.a. iv.  Imbalanced Datasets: Class imbalance occurs when one class significantly outnumbers another class in a dataset, resulting in biased model predictions that overly favor the majority class. Proper handling of class imbalance is essential for developing accurate predictive models that effectively detect rare but critical events, reducing false negatives.Example: Fraud detection datasets often have fewer than 1% fraudulent transactions. Suitable approaches include Random Forests with balanced weights, XGBoost with specialized loss functions, and SMOTE oversampling to improve model sensitivity towards detecting rare fraudulent activities.Some techniques used for this purpose are

  • Oversampling: Increasing the number of minority class samples through synthetic data generation, such as SMOTE (Synthetic Minority Oversampling Technique).
  • Undersampling: Reducing the number of majority class samples to balance classes.
  • Balanced Class Weights: Adjusting model algorithms (e.g., Random Forest, XGBoost) to give higher importance to the minority class during training.
  • Customized Loss Functions: Using specialized loss functions that penalize misclassification of the minority class more severely, particularly effective in models like XGBoost.

5.4.1.a. v.  Time-Series Data: Time-series data consists of observations collected sequentially over time, with inherent temporal dependencies. Proper analysis ensures accurate forecasting and captures seasonality, trends, and cyclic patterns, critical for informed business decision-making.Examples are financial market forecasting (stock prices, market trends), inventory and demand forecasting for retail and supply chain management, and energy consumption prediction for utility companies.

Some Models and Methods suitable for Time Series Data are:

  • ARIMA (AutoRegressive Integrated Moving Average): Suitable for linear time-series data exhibiting consistent patterns.
  • Prophet: Developed by Facebook, ideal for handling complex seasonalities, trends, and missing data.
  • Long Short-Term Memory (LSTM) Neural Networks: Effective for modeling non-linear and complex temporal dependencies and capturing long-range dependencies in sequential data.

5.4.1.a. vi.  Missing Values Handling: Some Models are naturally robust to missing data, such as:

  • Decision Trees and Random Forests: These models inherently handle missing data by using surrogate splits or treating missingness as a separate category.
  • Gradient Boosting Methods (e.g., XGBoost, LightGBM): Internally manage missing values during the model-building process.

While other Models require explicit imputation:

  • Linear Models (e.g., Linear and Logistic Regression): Require datasets without missing values, hence necessitating imputation.

Examples: Customer datasets with incomplete profiles (demographic data missing). Healthcare records where certain patient measurements might be intermittently missing.

5.4.1.a.vii.  Outliers in Data:  The availability of outliers may impact your model choice since some models are highly sensitive to outliers - Predictions can be heavily influenced by outliers, requiring thorough treatment before modeling such as Linear Regression, Logistic Regression, Support Vector Machines (SVMs):  which could skew predictions, while some models are more robust to Outliers - These models partition data effectively, isolating the impact of outliers and typically requiring less preprocessing such as Decision Trees, Random Forests, Gradient Boosting Methods (e.g., XGBoost).

Examples: Financial datasets with rare extreme market events affecting forecasting models, Real estate price predictions where certain properties have unusually high or low valuations due to unique features.

5.4.1.b. Assess  business objectives: Model Performance vs interpretability

When selecting a machine learning model, businesses must carefully balance model performance against interpretability. This decision significantly affects stakeholder trust, regulatory compliance, transparency in decision-making, and overall business outcomes.

  • Model Performance: Refers to the accuracy, efficiency, and effectiveness with which a model makes predictions. High-performing models accurately predict outcomes, thus providing tangible business value.
  • Interpretability: Refers to the ability of stakeholders to easily understand and explain how a model arrives at its predictions or decisions. Interpretability builds stakeholder trust and ensures compliance with regulatory requirements by providing transparent decision-making.

  • High-Performance Models (e.g., Neural Networks, Gradient Boosting Machines):
    • Advantages: These sophisticated models typically achieve superior predictive accuracy by capturing complex, non-linear patterns and detailed interactions within extensive datasets. They are particularly effective in situations where accurate predictions directly translate into business value.
    • Disadvantages: These models are often perceived as "black boxes" due to their complex internal decision-making processes, making it challenging to clearly articulate how specific predictions are made. This lack of transparency can create challenges in regulated industries or situations requiring clear explanations for stakeholders.
    • Example Use Cases:
      • Image Recognition: Advanced neural networks effectively recognize images for facial recognition in smartphones, autonomous vehicle navigation systems, or medical imaging diagnostics.
      • Voice Assistants: High-performance models interpret voice commands in products like Siri or Alexa, significantly enhancing user experience.
      • Recommendation Engines: Gradient boosting algorithms personalize recommendations on platforms like Amazon and Netflix, greatly improving user engagement and sales.
  • Interpretable Models (e.g., Decision Trees, Logistic Regression):
    • Advantages: These models provide straightforward, understandable explanations for their predictions, making it easy for stakeholders to trust the decision-making process. Clear interpretability also helps meet regulatory standards, simplifies audits, and strengthens stakeholder confidence.
    • Disadvantages: Interpretable models may lack the ability to handle complex data patterns and relationships, potentially resulting in lower accuracy compared to more advanced models, particularly with extensive datasets.
    • Example Use Cases:
      • Healthcare Diagnostics: Decision trees clearly illustrate the reasoning behind patient risk assessments or diagnoses, allowing healthcare professionals to confidently explain their decisions.
      • Financial Credit Scoring: Logistic regression transparently shows which customer attributes influence credit decisions, satisfying stringent regulatory requirements and consumer transparency.
      • Insurance Risk Assessment: Clearly interpretable models allow insurance companies to transparently demonstrate how they calculate risk, premiums, or claims.

Ultimately, choosing between performance and interpretability depends on your organization's strategic goals, compliance requirements, and stakeholder expectations. Business leaders must carefully assess these factors to select the most suitable model that aligns with their specific needs and business environment.

5.4.1.c. Assess Computational Resources

Understanding and evaluating available computational resources is essential when selecting machine learning models. Computational resources refer to the processing power, memory capacity, storage availability, and specialized hardware (like GPUs and TPUs) that your organization can allocate for running machine learning algorithms.

  • Limited Computational Resources:
    • Characteristics: Small businesses or startups with basic IT infrastructure.
    • Preferred Models: Simple and computationally efficient models such as Logistic Regression and Decision Trees.
    • Examples: Real-time customer service chatbots or straightforward predictive analytics tasks for smaller datasets.
  • Moderate Computational Resources:
    • Characteristics: Medium-sized enterprises with standard computing resources such as high-performance CPUs and small-scale cloud solutions.
    • Preferred Models: Intermediate complexity models like Random Forests, Support Vector Machines, and Gradient Boosting methods.
    • Examples: Predictive maintenance in manufacturing, market segmentation analysis for retail operations, and targeted marketing campaigns.
  • Advanced Computational Resources:
    • Characteristics: Large enterprises or tech-centric organizations with extensive IT infrastructure including GPUs, TPUs, and advanced cloud computing capabilities.
    • Preferred Models: Complex algorithms, such as Deep Neural Networks and Transformer-based models (e.g., BERT, GPT).
    • Examples: Complex natural language processing applications, real-time recommendation systems for large e-commerce platforms, and advanced fraud detection systems in financial services.

Aligning the complexity of selected models with your available computational resources ensures efficient use of budget, time, and infrastructure, optimizing overall business outcomes.

5.4.1.d. Select Appropriate Models based on prior knowledge or academic research for similar data and problem space

Leveraging existing industry insights, academic research, and prior experiences can significantly streamline and enhance the model selection process. This approach reduces experimentation costs and increases the likelihood of successful outcomes.

  • Conducting Literature Reviews:
    • Systematically reviewing academic research papers, case studies, and technical reports relevant to your business problem.
    • Example: Selecting predictive models for customer churn based on successful telecommunications industry studies.
  • Industry Benchmarking and Best Practices:
    • Adopting widely recognized standards and models validated through industry-specific benchmarks and best practices.
    • Example: Employing financial forecasting models like ARIMA or LSTM, established as industry standards in financial services.
  • Utilizing Public Competitions and Platforms:
    • Leveraging data science competition platforms (e.g., Kaggle) and model repositories (e.g., Hugging Face, Scikit-learn) to identify effective models for similar tasks.
    • Example: Selecting high-performing recommendation system models from public competitions for implementation in retail platforms.

By incorporating prior knowledge and proven academic research into your model selection strategy, your organization can confidently select effective and reliable machine learning solutions, enhancing overall project success.

Once you have filtered models based on the previous steps, finalize the set of candidate models by:

  1. Consulting Prior Knowledge: Use best practices from past projects or industry standards.
  2. Academic Research: Review recent papers and research for your problem domain.
  3. Benchmarking Libraries: Use platforms like Hugging Face Model Hub, Scikit-learn model selection guide, or Kaggle competitions to find popular models for similar problems.

Before we dive into the training and evaluation stages, it's important to reflect on the critical role model selection plays in the machine learning pipeline. At this stage, we haven't written a single line of code or trained any models—yet the decisions made here lay the groundwork for everything that follows. By thoroughly analyzing your data, understanding your business objectives, assessing computational constraints, and drawing from both experience and research, you’re ensuring that you're not just building any model—but the right model.

Whether you're a business leader concerned with outcomes or a technical lead focused on performance, this step helps bridge the two worlds. It transforms raw business needs into a clear, strategic plan for predictive modeling. And while model selection is not glamorous, it’s foundational. The smarter and more systematic we are here, the less friction we’ll encounter later in training, deployment, and long-term maintenance.

With the candidate models chosen and aligned with real-world constraints and expectations, we are now ready to move forward—to put our models to the test.

As a photographer, it’s important to get the visuals right while establishing your online presence. Having a unique and professional portfolio will make you stand out to potential clients. The only problem? Most website builders out there offer cookie-cutter options — making lots of portfolios look the same.

That’s where a platform like Webflow comes to play. With Webflow you can either design and build a website from the ground up (without writing code) or start with a template that you can customize every aspect of. From unique animations and interactions to web app-like features, you have the opportunity to make your photography portfolio site stand out from the rest.

So, we put together a few photography portfolio websites that you can use yourself — whether you want to keep them the way they are or completely customize them to your liking.

12 photography portfolio websites to showcase your work

Here are 12 photography portfolio templates you can use with Webflow to create your own personal platform for showing off your work.

1. Jasmine

Stay Updated with Neuto AI Newsletter

Subscribe to our newsletter to receive our latest blogs, recommended digital courses, and more to unlock growth Mindset

Thank you for subscribing to our newsletter!
Oops! Something went wrong while submitting the form.
By clicking Subscribe, you agree to our Terms and Conditions
Data Science
WDIS AI-ML Series: Module 2 Lesson 6: Model Selection and Evaluation Metrics
Vinay Roy
09th May 2025
Introduction
In most practical applications, data scientists often have a set of ML models that can be applied to solve a problem. Data scientists run a set of ML models and see which ones perform the best. This is called Racing ML models against each other to choose a winner.
Table of Contents

Key Takeaways

WDIS AI-ML Series: Module 2 Lesson 6: Model Selection and Evaluation Metrics

In most practical applications, data scientists often have a set of ML models that can be applied to solve a problem. Data scientists run a set of ML models and see which ones perform the best. This is called Racing ML models against each other to choose a winner.

Let us understand this with an Example of a business problem. Suppose, we want to do customer churn modeling. The model will predict which customers are likely to churn. We collect data on Customer product usage patterns, customer historical interactions with the customer services team or support team, and customer demographics among other data that might be needed. Now we have multiple potential Models that can be used to solve this business problem. Some possible candidates are:

Logistic Regression: A simple and interpretable model, which is great for binary classification tasks, in this case Churn or Retained.  However, this model may struggle with complex relationships between features.Random Forest: It is an ensemble method, which we will discuss in more detail later on in the guide, that combines multiple decision trees. The model is great at handling non-linear relationships and feature interactions effectively. However, this can be computationally expensive for large datasets.

Gradient Boosting Machine (GBM): GBM is another ensemble method that iteratively trains models on the errors of previous models. A widely used model, often achieves high performance but can be prone to overfitting.

Support Vector Machine (SVM): SVM is effective for high-dimensional data and non-linear boundaries but it can be computationally expensive for large datasets.

Neural Network: Finally we can also leverage NN if we have very large amount of data set. This powerful model is used to understand complex patterns on extremely large datasets. However, This model requires careful tuning of Hyperparameters and can be computationally intensive.

Given so many choices, which one should we use for our business problem to model churn? A simple framework on how we model this is given below.

Fig: Framework for racing multiple ML models to identify a winning model

We have already discussed Steps 1 - 3 in the Module 2 Lesson 1 and Module 2 Lesson 3.

Now, let us focus on Steps 4 to 8 in the above figure to gain a deeper understanding of the process of model selection and model evaluation.

Some questions that we will answer in this article are:

  1. How do we determine which models to use for the race?
  2. How do we know which model is the winner?
  3. How can we improve a model to improve its chances of winning?
  4. How do we know the winner will stay the winner?

Let us dive right into it.

5.4.1. How to find a set of candidate models to run (Step 4)?

Now let us discuss the crux of this lesson, how do we know which models to choose from so many possible models out there?While the actual answer depends upon your business problems, the below Framework is a good starting guide.

5.4.1.a. Understand your data - The first step is to get a deep dive into the available data. Data scientists call this step - Exploratory Data Analysis (EDA).

This is the most crucial first step in the machine learning workflow, allowing data scientists and analysts to deeply understand the characteristics, quality, and structure of their dataset.

EDA involves visualizing data distributions, identifying missing values, detecting outliers, and understanding feature relationships. This systematic investigation aids in making informed decisions about feature selection, engineering, and appropriate model choices, thus significantly influencing the model's ultimate performance and accuracy. As you do EDA, you will also learn a few things such as:

5.4.1.a. i. Is the Data Labeled or Unlabeled: Labeled data contains both input variables (features) and output variables. For example, not only do we have customer attribute data for our SaaS application but also data on which customers have left (churned). If so, then we can run supervised learning tasks on the data. Unlabeled data that consists solely of input variables without an explicit target can either be converted into Labeled Data or can only be used to run unsupervised learning.

  • Examples:
    • Labeled (Supervised): Customer churn prediction, fraud detection.
    • Unlabeled (Unsupervised): Customer segmentation, anomaly detection.

5.4.1.a. ii. Is the Output Variable Continuous or Discrete: The output variable  (also known as the dependent variable or the response variable) is the outcome that a model is trained to predict. It determines the type of machine learning approach. We can expand the tree above to include the type of Output variable.

  • If the Data is labeled and Output Variable is
    • Continuous (Regression Model): Applicable only when labeled data with numeric target values are available. Examples: Forecasting quarterly sales, predicting stock prices, and estimating real estate market trends.
    • Output Variable is  Discrete (Classification Model): Applicable only when labeled data with categorical target values are present. Examples: Predicting customer churn, categorizing emails as spam or legitimate, and diagnosing water pipe conditions for a water utility company.
  • If the Data is unlabeled, then there are no predefined output variables; we can only run an unsupervised model. These unsupervised models, such as clustering or dimensionality reduction, are utilized to uncover inherent patterns or groupings within the data.
    • Clustering aims to group data points into distinct clusters based on similarity or patterns within the data. It helps uncover inherent groupings without prior labels and is useful for tasks like customer segmentation, market analysis, and anomaly detection. Common clustering methods include K-means, DBSCAN, and Hierarchical clustering.
    • Dimensionality Reduction focuses on reducing the number of variables or features in the dataset while preserving its essential structure and information. It simplifies data representation, reduces noise, and improves computational efficiency, enabling clearer data visualization and better performance in subsequent modeling. Popular dimensionality reduction techniques include Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE).
    • Association Rule Learning to discover interesting relationships between variables in large datasets. Example applications: Market basket analysis (identifying products frequently bought together).
    • Generative Models to learn the underlying distribution of data to generate new, synthetic instances similar to the training data. Notable algorithms include Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs). Example applications: Image synthesis, data augmentation.

So, what is the difference between Clustering (Unsupervised Model) and Classification (Supervised Model)? Classification models are supervised models, or in other words, data is labeled, so the names of the classes or labels are already known. In the case of an unsupervised model, on the other hand, the data is unlabeled, so we do not know the names of these classes; hence, we call them clusters. Once we have clusters, we can name those clusters, but initially, we do not know the names of these classes.

5.4.1.a. iii.  Data Volume vs Model Type: The relationship between data volume and model complexity significantly affects model selection. Depending upon the size of the data, one method may be more preferred than the other as shown below:

  • Small Data (<10k samples):
    • Simpler, interpretable models, such as Logistic Regression, Decision Trees, Naive Bayes, and K-Nearest Neighbors, are typically preferred due to their stability and lower risk of overfitting.
    • Examples: Diagnosing diseases from limited patient records, credit approval based on historical loan data.
  • Medium Data (10k - 1M samples):
    • Intermediate complexity models, such as Random Forests, Support Vector Machines (SVMs), Gradient Boosting algorithms (XGBoost, LightGBM),  can be beneficial, striking a balance between interpretability and predictive power.
    • Examples: Predicting customer churn rates in telecommunications, forecasting regional sales data.
  • Large Data (>1M samples):
    • Complex models, such as  Deep Neural Networks, advanced Gradient Boosting Machines,  leveraging large-scale data, can effectively capture intricate patterns without substantial risk of overfitting.
    • Examples: Image and speech recognition, language modeling for virtual assistants, high-frequency trading algorithms in finance.

5.4.1.a. iv.  Imbalanced Datasets: Class imbalance occurs when one class significantly outnumbers another class in a dataset, resulting in biased model predictions that overly favor the majority class. Proper handling of class imbalance is essential for developing accurate predictive models that effectively detect rare but critical events, reducing false negatives.Example: Fraud detection datasets often have fewer than 1% fraudulent transactions. Suitable approaches include Random Forests with balanced weights, XGBoost with specialized loss functions, and SMOTE oversampling to improve model sensitivity towards detecting rare fraudulent activities.Some techniques used for this purpose are

  • Oversampling: Increasing the number of minority class samples through synthetic data generation, such as SMOTE (Synthetic Minority Oversampling Technique).
  • Undersampling: Reducing the number of majority class samples to balance classes.
  • Balanced Class Weights: Adjusting model algorithms (e.g., Random Forest, XGBoost) to give higher importance to the minority class during training.
  • Customized Loss Functions: Using specialized loss functions that penalize misclassification of the minority class more severely, particularly effective in models like XGBoost.

5.4.1.a. v.  Time-Series Data: Time-series data consists of observations collected sequentially over time, with inherent temporal dependencies. Proper analysis ensures accurate forecasting and captures seasonality, trends, and cyclic patterns, critical for informed business decision-making.Examples are financial market forecasting (stock prices, market trends), inventory and demand forecasting for retail and supply chain management, and energy consumption prediction for utility companies.

Some Models and Methods suitable for Time Series Data are:

  • ARIMA (AutoRegressive Integrated Moving Average): Suitable for linear time-series data exhibiting consistent patterns.
  • Prophet: Developed by Facebook, ideal for handling complex seasonalities, trends, and missing data.
  • Long Short-Term Memory (LSTM) Neural Networks: Effective for modeling non-linear and complex temporal dependencies and capturing long-range dependencies in sequential data.

5.4.1.a. vi.  Missing Values Handling: Some Models are naturally robust to missing data, such as:

  • Decision Trees and Random Forests: These models inherently handle missing data by using surrogate splits or treating missingness as a separate category.
  • Gradient Boosting Methods (e.g., XGBoost, LightGBM): Internally manage missing values during the model-building process.

While other Models require explicit imputation:

  • Linear Models (e.g., Linear and Logistic Regression): Require datasets without missing values, hence necessitating imputation.

Examples: Customer datasets with incomplete profiles (demographic data missing). Healthcare records where certain patient measurements might be intermittently missing.

5.4.1.a.vii.  Outliers in Data:  The availability of outliers may impact your model choice since some models are highly sensitive to outliers - Predictions can be heavily influenced by outliers, requiring thorough treatment before modeling such as Linear Regression, Logistic Regression, Support Vector Machines (SVMs):  which could skew predictions, while some models are more robust to Outliers - These models partition data effectively, isolating the impact of outliers and typically requiring less preprocessing such as Decision Trees, Random Forests, Gradient Boosting Methods (e.g., XGBoost).

Examples: Financial datasets with rare extreme market events affecting forecasting models, Real estate price predictions where certain properties have unusually high or low valuations due to unique features.

5.4.1.b. Assess  business objectives: Model Performance vs interpretability

When selecting a machine learning model, businesses must carefully balance model performance against interpretability. This decision significantly affects stakeholder trust, regulatory compliance, transparency in decision-making, and overall business outcomes.

  • Model Performance: Refers to the accuracy, efficiency, and effectiveness with which a model makes predictions. High-performing models accurately predict outcomes, thus providing tangible business value.
  • Interpretability: Refers to the ability of stakeholders to easily understand and explain how a model arrives at its predictions or decisions. Interpretability builds stakeholder trust and ensures compliance with regulatory requirements by providing transparent decision-making.

  • High-Performance Models (e.g., Neural Networks, Gradient Boosting Machines):
    • Advantages: These sophisticated models typically achieve superior predictive accuracy by capturing complex, non-linear patterns and detailed interactions within extensive datasets. They are particularly effective in situations where accurate predictions directly translate into business value.
    • Disadvantages: These models are often perceived as "black boxes" due to their complex internal decision-making processes, making it challenging to clearly articulate how specific predictions are made. This lack of transparency can create challenges in regulated industries or situations requiring clear explanations for stakeholders.
    • Example Use Cases:
      • Image Recognition: Advanced neural networks effectively recognize images for facial recognition in smartphones, autonomous vehicle navigation systems, or medical imaging diagnostics.
      • Voice Assistants: High-performance models interpret voice commands in products like Siri or Alexa, significantly enhancing user experience.
      • Recommendation Engines: Gradient boosting algorithms personalize recommendations on platforms like Amazon and Netflix, greatly improving user engagement and sales.
  • Interpretable Models (e.g., Decision Trees, Logistic Regression):
    • Advantages: These models provide straightforward, understandable explanations for their predictions, making it easy for stakeholders to trust the decision-making process. Clear interpretability also helps meet regulatory standards, simplifies audits, and strengthens stakeholder confidence.
    • Disadvantages: Interpretable models may lack the ability to handle complex data patterns and relationships, potentially resulting in lower accuracy compared to more advanced models, particularly with extensive datasets.
    • Example Use Cases:
      • Healthcare Diagnostics: Decision trees clearly illustrate the reasoning behind patient risk assessments or diagnoses, allowing healthcare professionals to confidently explain their decisions.
      • Financial Credit Scoring: Logistic regression transparently shows which customer attributes influence credit decisions, satisfying stringent regulatory requirements and consumer transparency.
      • Insurance Risk Assessment: Clearly interpretable models allow insurance companies to transparently demonstrate how they calculate risk, premiums, or claims.

Ultimately, choosing between performance and interpretability depends on your organization's strategic goals, compliance requirements, and stakeholder expectations. Business leaders must carefully assess these factors to select the most suitable model that aligns with their specific needs and business environment.

5.4.1.c. Assess Computational Resources

Understanding and evaluating available computational resources is essential when selecting machine learning models. Computational resources refer to the processing power, memory capacity, storage availability, and specialized hardware (like GPUs and TPUs) that your organization can allocate for running machine learning algorithms.

  • Limited Computational Resources:
    • Characteristics: Small businesses or startups with basic IT infrastructure.
    • Preferred Models: Simple and computationally efficient models such as Logistic Regression and Decision Trees.
    • Examples: Real-time customer service chatbots or straightforward predictive analytics tasks for smaller datasets.
  • Moderate Computational Resources:
    • Characteristics: Medium-sized enterprises with standard computing resources such as high-performance CPUs and small-scale cloud solutions.
    • Preferred Models: Intermediate complexity models like Random Forests, Support Vector Machines, and Gradient Boosting methods.
    • Examples: Predictive maintenance in manufacturing, market segmentation analysis for retail operations, and targeted marketing campaigns.
  • Advanced Computational Resources:
    • Characteristics: Large enterprises or tech-centric organizations with extensive IT infrastructure including GPUs, TPUs, and advanced cloud computing capabilities.
    • Preferred Models: Complex algorithms, such as Deep Neural Networks and Transformer-based models (e.g., BERT, GPT).
    • Examples: Complex natural language processing applications, real-time recommendation systems for large e-commerce platforms, and advanced fraud detection systems in financial services.

Aligning the complexity of selected models with your available computational resources ensures efficient use of budget, time, and infrastructure, optimizing overall business outcomes.

5.4.1.d. Select Appropriate Models based on prior knowledge or academic research for similar data and problem space

Leveraging existing industry insights, academic research, and prior experiences can significantly streamline and enhance the model selection process. This approach reduces experimentation costs and increases the likelihood of successful outcomes.

  • Conducting Literature Reviews:
    • Systematically reviewing academic research papers, case studies, and technical reports relevant to your business problem.
    • Example: Selecting predictive models for customer churn based on successful telecommunications industry studies.
  • Industry Benchmarking and Best Practices:
    • Adopting widely recognized standards and models validated through industry-specific benchmarks and best practices.
    • Example: Employing financial forecasting models like ARIMA or LSTM, established as industry standards in financial services.
  • Utilizing Public Competitions and Platforms:
    • Leveraging data science competition platforms (e.g., Kaggle) and model repositories (e.g., Hugging Face, Scikit-learn) to identify effective models for similar tasks.
    • Example: Selecting high-performing recommendation system models from public competitions for implementation in retail platforms.

By incorporating prior knowledge and proven academic research into your model selection strategy, your organization can confidently select effective and reliable machine learning solutions, enhancing overall project success.

Once you have filtered models based on the previous steps, finalize the set of candidate models by:

  1. Consulting Prior Knowledge: Use best practices from past projects or industry standards.
  2. Academic Research: Review recent papers and research for your problem domain.
  3. Benchmarking Libraries: Use platforms like Hugging Face Model Hub, Scikit-learn model selection guide, or Kaggle competitions to find popular models for similar problems.

Before we dive into the training and evaluation stages, it's important to reflect on the critical role model selection plays in the machine learning pipeline. At this stage, we haven't written a single line of code or trained any models—yet the decisions made here lay the groundwork for everything that follows. By thoroughly analyzing your data, understanding your business objectives, assessing computational constraints, and drawing from both experience and research, you’re ensuring that you're not just building any model—but the right model.

Whether you're a business leader concerned with outcomes or a technical lead focused on performance, this step helps bridge the two worlds. It transforms raw business needs into a clear, strategic plan for predictive modeling. And while model selection is not glamorous, it’s foundational. The smarter and more systematic we are here, the less friction we’ll encounter later in training, deployment, and long-term maintenance.

With the candidate models chosen and aligned with real-world constraints and expectations, we are now ready to move forward—to put our models to the test.

About the author:
Vinay Roy
Fractional AI / ML Strategist | ex-CPO | ex-Nvidia | ex-Apple | UC Berkeley
further readings
Related
Articles
Data Science
6 mins read
Protecting Sensitive Data in the Age of Large Language Models (LLMs)
How to safeguard against leaking sensitive PII data while allowing their employees to use LLM models and other 3rd party AI tools.
Data Science
6 mins read
WDIS AI-ML Series: Module 2 Lesson 5: Feature Extraction, Feature Selection & Feature Engineering Techniques
It is an initial phase of any data science project, is a critical step in the data analysis process, used to understand the underlying structure, patterns, and relationships within a dataset before formal modeling or hypothesis testing. It's like detective work, where you delve into your data to understand its characteristics, identify patterns, and uncover potential insights.
Data Science
6 mins read
WDIS AI-ML Series: Module 2 Lesson 4: Data Collection and Data Preprocessing
Not many companies invest enough in data as much as they do in Data Science. Albeit the realization is growing that to be seen as an ‘AI-first’ company, one needs to establish itself as a ‘Data-first’ company. The biggest challenge In this section we will give an overview of what end-to-end data processing looks like from the viewpoint of a data science project: