Enhancing Machine Learning Workflows with Large Language Models: A Hybrid Approach


Large Language Models (LLMs) like GPT-4 are considered a foundational base of an intelligent system upon which other layers can be built. However, they could also add immense value when integrated with existing Machine Learning (ML) workflows to augment their outcomes. By embedding LLMs within traditional ML pipelines, we can leverage their advanced language understanding and contextual awareness to significantly improve performance, flexibility, and accuracy while ensuring the reliability and baseline performance of the existing pipeline. 

This approach maximizes the strengths of traditional models and LLMs. It opens up new possibilities for AI applications - making existing systems more innovative and responsive to real-world data nuances. The cherry on top is that there is no added effort to completely re-architect the existing ML architecture to integrate the LLM intelligence.

Traditional Machine Learning Workflow

Traditional ML models, particularly binary and multi-class classifiers, form the backbone of numerous ML applications in production today. A binary classifier differentiates between two classes, like distinguishing between spam and non-spam emails. In contrast, a multi-class classifier can identify multiple categories, such as categorizing news articles into sports, politics, or entertainment. These models learn from vast labeled datasets to make their predictions.

Key Concept: Thresholding in Classification

Since these ML models are trained on classification tasks, they output probabilities for the input belonging to each of the possible classes, which collectively add up to 1. For instance, consider a system with an existing AI model designed for property management (PM) that responds to a prospect’s questions about a specific unit based on existing information in the PM system. This AI model provides two outputs. First, it generates an answer to the question posed by the prospect. Second, it delivers a probability score reflecting confidence in its proposed answer. This is a classification score, where the AI model outputs the probabilities of being confident versus non-confident in the answer it has come up with. 

For instance, the model might output a probability of 0.8 corresponding to its confidence, indicating that it is moderately sure of its response. However, perhaps we want to be cautious and send out the AI model’s generated response only if we are highly confident. This ensures our prospects receive answers that are very likely to be accurate. If the model is not confident enough in the answer it has come up with, we might instead want to have a human review it. To address this, we might set a high confidence threshold and only allow the generated message to be sent to a prospect if the probability of being confident is above, say 0.95.

Deciding the threshold value for decision-making is a critical technique used in classification tasks. It involves setting a cutoff point to differentiate between different classes. This threshold is crucial as it balances precision (avoiding false positives) and recall (identifying true positives). The relative real-world cost of false positive vs false negative predictions is highly domain-specific and dictated by business needs, for which we need to strike an optimal balance by adjusting the threshold.

Incorporating LLMs in the Prediction Stage

The threshold cut-off for an ML model is decided by just one value and only changes rarely once it is decided upon. This is where LLMs can come in –  and act as an extra layer of more intelligent, dynamic, and contextual thresholding. LLMs have an innate ability to understand and interpret complex, nuanced contexts. With their advanced natural language processing capabilities, LLMs can examine the contextual intricacies within the data, leading to more contextually informed and dynamic thresholding decisions.


Tying it back to our AI model example where a response is produced for the prospect’s message in addition to a confident/non-confident set of probability scores for that generated message, instead of classifying a quite confident response with probability 0.94 as not confident (since it is less than a conservative static threshold of  0.95), we could send all responses in a specific confidence range (for example, 0.9 to 0.95) to an LLM and ask it if it is an appropriate message to send or not. This range entails requests for which the model is decently confident but not confident enough to surpass the threshold value. This hybrid system has various advantages:

  • It uses well-trained, pre-existing, and reliable deep-learning models to make classifications that yield accurate results.

  • An ensemble of the existing ML model and LLM uses the general reasoning capabilities of the LLM in conjunction with the task-specific abilities of the existing ML model, avoiding dependence on the exact static threshold value for the outcome.

  • It embeds intelligent verification using LLMs for cases close to the threshold. This could enable better coverage while keeping precision similar.

  • Although an LLM call costs more than inference from an in-house, trained ML model, it's still orders of magnitude cheaper than human verification.

In our use case, this approach enabled us to positively classify and automate 75% of cases that were lower than the threshold and would have otherwise been classified negatively. We achieved this increase in recall while maintaining a similar precision rate, thereby not affecting the quality of positive outcomes while increasing their volume. 

At the same time, it is also essential to consider potential trade-offs that this system ends up making:

  • Integrating LLMs into existing pipelines increases the system's complexity, potentially making it more challenging to maintain, debug, and improve the models over time.

  • The use of LLMs, especially in real-time applications, introduces latency, and processing the additional layer of LLM analysis can slow down the response time, which might be critical in time-sensitive scenarios.

  • There is a dependency on the providers of LLMs for data privacy policies, pricing, security, and performance. Ensuring compliance with data protection regulations becomes extremely important and complicated.

Hybrid ml workflow: Flowchart of how an existing machine learning model is augmented with a large language model to ensure quality responses in a dynamic, context-aware system

Integrating LLMs into traditional ML workflows offers a balanced approach, combining existing models' reliability with LLMs' contextual intelligence. However, there is no free lunch. Organizations must weigh the above-discussed challenges against the potential benefits when considering adopting a hybrid system involving LLMs. Regardless, this emerging hybrid system promises to enhance AI applications, making them more adaptable and responsive to real-world complexities.

Authors and Contributors: Aditya Lahiri, Brandon Davis, Michael Leece, Dimitry Fedosov, Christfried Focke, Tony Froccaro