CRISP-DM: What is it?
The Most Popular Data Science Lifecycle Framework you’ve Never Heard of
In a previous post, I made the case that CRISP-DM is the Data Scientist’s secret weapon. But I didn’t go into any detail about what it actually is.
CRISP-DM stands for the CRoss Industry Standard Process for Data Mining. It is a standard process for knowledge discovery consisting of 6 phases that can be applied across a wide range of applications. The 6 phases are Business Understanding, Data Understanding, Data Preparation, Modeling, Model Evaluation, and Deployment.
According to KD Nuggets Surveys in 2007 and 2014, CRISP-DM is the most widely used methodology for data science and analytics projects: over 40% of survey respondents reported they use it.
Overview of CRISP-DM
Crisp-DM is designed to improve the speed, efficiency, and accuracy of data analysis by iterating through 6 phases:
- Business Understanding: Understand the current situation and determine the business goals for the project
- Data Understanding: Gather data sources and data definitions, talk to subject matter experts, conduct exploratory data analysis and data quality checks
- Data Preparation: Select, clean, format data, and create any features needed
- Modeling: Select a modeling technique, generate a test design, and build one or more models. Assess the model performance
- Model Evaluation: Evaluate process and model results in the context of the business problem
- Deployment: Produce deliverables, develop a plan for model monitoring and maintenance
These steps aren’t meant to be followed strictly in order. As the arrows in the diagram indicate, some of the steps will lead sequentially from one to the next. However, you will also find yourself iterating between steps. For example, when starting a new project, it would be normal to spend a significant amount of time iterating between business understanding, data understanding, and data preparation in order to have a firm grasp on the problem and the data.
History of CRISP-DM
The CRoss Industry Standard Process for Data Mining was a result of special funding by the European Commission (EC) in the late 90s. The program’s objective was to establish a standard process for data mining, as the name suggests. The project statement makes note that the rise in High-Performance Computing and problems with interpreting vast amounts of data had led to a need for a process of knowledge discovery that was “fast, well understood, reliable, and valid across a wide range of applications.”
So, to solve the problem, the EC formed a special interest group to broaden the basis for development and testing without sacrificing the efficiency and effectiveness of a small, tightly-focused consortium. The special interest group would also help facilitate the dissemination and exploitation of the results. The vision was that data warehouse vendors and data mining tool suppliers could exploit the process model to enhance their product and service offerings. The user partners could exploit the results of the project internally to improve business intelligence and decision-making.
The EC saw the business potential and benefit of developing a standardized process for the industry. “[CRISP-DM] will make large data mining projects faster, more efficient, more reliable, more manageable, and less costly. A widely adopted process should foster the development of a multitude of data mining tools which support it, thereby significantly contributing to promoting a profitable use of HPCN technology,” according to the project’s stated objectives
Conclusion
CRISP-DM is a standardized process for solving real-world problems with data. It is a simple process made up of 6 phases and you can iterate through these phases in the order that makes sense for your project. While this process hasn’t been updated since its inception, it continues to be the most widely used data science methodology. Moreover, its simplicity makes it very flexible to a wide variety of problems and applications.
The CRISP-DM process’ biggest advantage is to provide speed and structure to projects by allowing teams to work with data in a more organized and efficient manner. Crisp-DM can also improve the accuracy and speed of data analysis by allowing analysts to work with data in a more standardized format.
Originally published at https://datalabnotes.com on December 13, 2022.