This blog steps through a process for tackling a data mining and analysis problem using a sample business case and a simple implementation of an industry framework. This is intended as an introduction for those not well versed in the technical details with references and follow-up blogs that dive into topics further.
When approaching a business problem where the solution relies in understanding and making use of data available to said business. Having a framework to allow common terminology and effective communication throughout the project lifecycle is useful.
A popular framework for this has been around since the late nineties is called CRoss Industry Standard Process for Data Mining (CRISP-DM) and breaks the process into six stages. This framework is often modified to fit a data teams approach, with more weight on areas of importance or broken down into custom substages with artefacts relating to a team’s project management and software development tools.
Figure 1. CRISP-DM stage diagram
While these stages seem intuitive when reading through them, depending on the problem in question, there may be some difficult and complex decisions at each stage, or even the need to revisit previous stages, like any iteration based approach. For this example we will use a fictional Winery business and their desire to crush their competition, pun intended. We source the data from Kaggle’s red and white wine datasets.
Grape Expectations Winery has acquired data from a trusted source that contains the chemical composition of wines and their quality rating. They want to use it to increase their revenue and reputation as the Grapest Wine purveyors in the business.
One business initiative they have is to win the local Timboon Region Wine Festival, which they have been coming in consistent 2nd and 3rd place. They believe with some data mining and analysis they can have an edge over the competition.
Success criteria is winning the competition. To do that they need to do two things, know how to produce a top quality wine, and also know which wine to put forward for the competition, as they always produce multiple wines and select one for entry.
System requirements are not complex as this is a standalone modelling project for now which has a low enough cadence and data volume to be performed on their local systems by some educated employees. This does mean the model and code should be easy to use by a slightly technical person if it is to be deployed, however for this engagement the data expert will be onboard for the entire lifecycle. If deployment is required that will be a separate business case.
A look at the Data
Figure 2. Sample of the wine quality dataset
The data available is a csv with a series of quantitative measurements of a particular wine and a rating of its quality by a wine evaluation expert. This brings us to an important assumption to validate with the business.
We assume the judges in the competition this year will have a similar assessment of quality as the judges who produced these quality scores.
If this is not the case then our data may be considered useless as there is no dependency between the target score in the sample data and the target score we are trying to predict.
A simple way to start exploratory data analysis (EDA) is by profiling each column. In this case we use a tool called ydata-profiling to automate this process and generate a visualisation which can serve as a helpful reference to first validate our understanding of what each column is and also provide further insights about the domain, guiding us in the next steps.
A good start is a correlation plot of variables. Lets see what contributes highly to quality.
Figure 3. Heatmap of variable correlation, the dark blue indicates positive correlation and red indicates negative correlation. We see alcohol, sulphates and volatile acidity have the highest correlation with quality.
What to Focus On
Before looking at the correlations and any other results of the profiling, we need to consider what is worth focusing on. To do this, consider what variables are rigid and what does the winery have some control over. For example we may see a high correlation between pH and quality, but maybe pH is highly dependent on the altitude the grapes are grown at, which is difficult to change to say the least. We would then incorporate this into how we approach the problem, maybe by finding the pH range available to the winery and then filtering the dataset to only wines that fit into this range.
This highlights the importance of linking observations of the data to relevant business processes and objects. As the goal of most data mining projects is to influence a decision that is tangible in the real world, which means the data must relate back to the factors in that decision.
For instance the wine density may turn out to be something the wine makers have a lot of control over, looking further into this, the density is highly correlated with the alcohol content and several other factors that makes density itself not the variable to focus on changing but the contributing factors.
What are we Trying to Predict
Assessing the target column indicates that most wines fall in the 5-6 quality, which means we don’t have a lot of information for a model to confidently know what makes a wine ranked 8/10 in quality.
Figure 4. Histogram of wine dataset row count by quality
A few factors can influence the model and algorithm selection, business context, system requirements, performance. All of this is important to consider when in the data exploration phase. This is likely a phase that will be revisited many times through the lifecycle. We have only really scratched the surface of what can be explored and analysed in this stage. Often the data exploration phase takes longer than expected.
At this stage one might suggest to the vineyard that getting a model that predicts 8/10 wines accurately may not be possible. So to simplify the use case and get the most bang for buck, you could convert this into a problem of predicting a 7+ quality, and then at least the number of wines for selection to go to the competition is lowered and the chance of them being of good quality is higher.
The data preparation we are doing will have two stages; common data preparation and model specific data preparation.
Common Data Preparation
Common data preparation will focus on generic data cleaning, such as dealing with nulls, normalisation, feature engineering. We can create some interesting features with this wine dataset, after a quick online search appears that the balance between acidity and sugars is something that impacts the quality of wine, we can create a new feature that is equal to the acidity divided by the sugars. After we do this, let’s check our correlation analysis again to see if there are any features that have a high correlation.
Figure 5. Heatmap of engineered feature correlation, the dark blue indicates positive correlation and red indicates negative correlation.
A few slightly correlated features we have engineered may improve our models output. We also see total_acid and acid/density showing the same characteristics, this could be due to density being ~=1, normalising the features could be a way of improving this.
Model Specific Preparation
Model specific preparation includes any encoding required for the classification algorithm we are using, one-hot-encoding for example. Or transforming the target column, into a binary 0 and 1 where 1 represents 7 or greater quality. In this project we have a single, static and relatively clean dataset, so we do not need to do much preparation, however in general data preparation is one of the most time consuming stages. Particularly if you haven’t been thorough with the data exploration.
What type of problem is this?
This post is focusing more on the CRISP-DM lifecycle and all the things one can consider along the way, so diving into what algorithm would actually be best is not the technical focus of this blog. However I will link the code used so you can dive into the different modelling techniques attempted and described below. As we saw in the data exploration stage, we are trying to predict an integer from 1-10 where most values are 5 or 6. We will try a few techniques for this and see how it performs. We will also take into account our findings of there being very few high quality wines to train from and see if we can get a better model by reducing the quality value to a 1 or 0 with 1 being 7 or higher quality.
This is similar to the classic excel trendline that you may be familiar with, except we need to round the prediction to the nearest whole number before assessing the results. Plus we are fitting the data to many variables rather than just the one that is common in the excel use.To assess the results we will use a confusion matrix as it is an easy way to see what the model predicts and what the true values of quality are. The linear regression results are below:
Linear regression confusion matrix, predicted vs true. Let’s check out the results after resampling:
Linear regression confusion matrix, predicted vs true after resampling.
We see this helps pull some of the 7 values up into their correct position. It also increases how many 5s and 6s get incorrectly valued higher.
This is an algorithm that treats each quality value as an independent label and tries to train a model to predict them correctly.
Multi class model confusion matrix, predicted vs true.
This technique is described in this white paper, and we created a class to implement this in python.
As mentioned, we can convert the target to just predict whether a wine is 7 or above in quality. In doing this we now have all the binary classification options available to us, in this example we use a model called xgboost and optimised it to this dataset.
A Holistic and Retrospective Analysis
While in the modelling selection we try and get the model that produces the best output according to standard model evaluation techniques, this evaluation stage is more focused on the business perspective. For our case, we consider all the assumptions we have made, then consider the output of the model and whether it is suitable for influencing a business decision or providing value.
Good quality data is the key to making good business decisions. The switch to a binary prediction of good quality to reduce the number of wines to select from is valuable, plus the assessment of the important features such as alcohol content, sulphates and volatile acidity.
A good recommendation would be to suggest the binary classification model as it has a high recall. This means of all the 7+ wines, it correctly identifies the most, with a slightly higher false positive rate. If we assume there is a step after the prediction results are obtained to have experts within the winery assess the identified high quality wine, then this gives us the best chance to not miss the best candidates.
Low complexity in this Case
The deployment for this could be as simple as giving the chosen model code/notebook to the winery and educating them on how to input the variables of the wine they want to assess and get a prediction of quality. So they can self-serve the analysis. However some MLOps best practices around cataloguing training data, saving model artefacts and some versioning would save time in future.
If they wanted to retrain the model on new data or edit its functionality in any way, this would require some education of how this could be done or a lightweight repetition of the CRISP-DM lifecycle with any new requirements in mind.
This project did not require a complex deployment step, though if it did, we would engage an ML ops lifecycle approach to deploy a model that can be easily, validated, maintained, retrained and monitored. This could be a topic for a future post.
We have gone through a lightweight example of using an industry standard framework for producing a model that can generate business value through identifying high quality wines to be the focus for further assessment. However we only scratched the surface of statistical analysis, modelling and business analysis of the problem statement. The aim of this article is to highlight the benefit of breaking down a data project into defined stages and using common terminology, yet also revealing the gaps that a generic framework may not prescribe artefacts or tasks for, these gaps therefore need to be filled by people with subject matter expertise and ability to ensure project success.
We also didn’t try some popular classification techniques such as Neural Networks, Naive Bayes or a Support Vector Machine.
Another pertinent point in the design and deployment of a ML system is the importance of being thorough in the early CRISP-DM lifecycle stages, as questions that arise when designing a production system may not have been fully answered when doing the proof of concept that only focused on the design of the model itself.
For instance when defining a set of rules for model input validation, it is useful to know what assumptions on the data are made for the model to generate relevant and useful outputs. For example in the wine quality we did not explore what constraints were on some of the columns, like the pH column. The model may equate a high pH with a low level of sulphates with high quality, and if the pH comes in at 7, then it may indicate a high quality wine, when in fact a pH of 7 is not within the expected range and the max in the training data is 4. This would be clarified by a domain expert as well, pH of 7 means completely neutral acidity which doesn’t make sense for an alcoholic substance.
In conclusion, breaking down a data project into defined stages and employing common terminology can bring numerous benefits to its execution. It facilitates clear communication, promoted collaboration and provides a structured approach. However, its important to acknowledge that individual subject matter expertise and validating data assumptions and constraints with domain experts are they key to reaping the full benefits of data-driven initiatives.