In this article, we will discuss the importance of data and their analysis, in order to design efficient and effective artificial intelligence (AI). But first of all, let’s set up an example to help us understand.
A typical problem to solve for a company is to anticipate the needs of their customers. For example, an insurer wants to know which insurance coverage would be most suitable for his client. Several factors can consciously or unconsciously influence the client’s choice of insurance (and also insurer!): Age, education, health, current and future financial situation, short and long-term objectives and so on! With the ability to serve a wide range of clients, it can become difficult for the insurer to have in mind all possible customer scenarios and recommend the appropriate product. Consequently, an AI could be conceived to handle this kind of situation. Assisting the insurer, it could quickly identify the type of client and suggest which insurance and options would suit him best.
Artificial intelligence is the art of developing a system capable of performing tasks normally performed by humans. How it is done? Just like a child who learns from experience through mistakes and successes, the machine “learns” from the experiences that enable it to accomplish the required task. Again, as a child might require the help of a person, so does the AI. In recent years, machine learning methods have proved their effectiveness in “teaching” a system (here AI) how to perform tasks previously reserved for humans. (To learn more about machine learning, see our blog article: A Non-Technical Guide to Understanding Machine Learning)
And how does one give the experience to a machine? Data is composed of information used to characterize a situation, a phenomenon, an element, etc. It is the data, in all its forms, that provides experience to the machine allowing it to make the necessary correlations to accomplish a task.
If an AI uses the characteristics of the client; Age, goals, and so on, it can classify them into different groups. Subsequently, the AI can suggest one or more types of coverage depending on what the other members of the same group have previously chosen. A person could handle this type of task if the client variables and related products are somewhat simple. However, at large scale (think Amazon selling online), man needs machine. The number of variables, clients, and products would then require an AI to handle the task efficiently.
The amount and reliability of the data determine the level of accuracy the AI will have when responding to the business problem. Generally, the larger the amount of data, the better the AI can predict accurately. The collection, distribution, and validation of data are therefore important issues in the creation of solutions involving AI.
But how do we process the data properly so that it can be useful in our AI?
Two words: Data Science
Let’s take a second to define the concept before going further.
Data Science is an interdisciplinary field in which scientific methods, mathematics, statistics, and information overlap in order to extract knowledge and ideas from data sets. Source: Wikipedia and simplystats
To put it simply, Data Science is the art of finding and choosing hidden information. Different associations and causes can exist between the characteristics of the client and his choice of insurance coverage. Using mathematical, statistical and tools, but also, using the knowledge already acquired in the context that is being studied, relationships and correlations can be found between seemingly independent variables. For example, a potential relationship exists between the brand, year and model of the vehicle and the decision to take replacement cost coverage, but where do we draw the line? A civic 1998? A BMW 2010? Mathematical models can help us predict this. The challenge that many companies are experiencing today is deciding whether this task should be given to a person or an AI.
Choosing the right data for AI
Some might ask why is it necessary to understand the context in which AI is applied. If we already have a lot of data, would it not be simpler to use all possible data and let, for example, a neural network train itself by finding the various links to accomplish the desired task?.
Nope. Here is why.
In more technical terms, too much information means increased variability and the model becomes unstable. For example, when predicting whether a client will choose a life insurance policy, age and income could be strongly correlated with whether the client will purchase or not. We can then question the need for both variables or how many of them are required. If predicting purchase using income only is equally accurate, then why have more when you can do the same with less? Identifying the strictly necessary data needed for machine learning is simpler and saves time and money.
A short guide to Data Science
Here is a short guide that summarizes where you should start when identifying what data should be used.
Identify and understand the business problem: Knowing what specific need is being answered is at the core of a good analysis in data science. Having a firm understanding of the problem at hand. You’ll know where to look when looking for data sources.
Understand your data: Having data is good, but you need to know what information they contain to know if they can answer the problem at hand. Understanding each characteristic with the model and how they impact the system as a whole is important. The same applies for the links between the different variables. Understanding them makes it possible to better target what you are trying to accomplish.
Prepare your data: This involves cleaning, processing, and filtering. This step is essential before any information is retrieved because it impacts accuracy. Without a clean data set, results will be poor.
Modeling and evaluation: Using statistical analysis, we want to know if the pre-established hypotheses are exact and if the information is sufficiently relevant to explain (or predict) the variable of interest. This stage is often perceived as the “black box” in data science.
Deployment: When the hypotheses have been validated and the final model with the correct entries found, it can be implemented for use on real-time data. At this point, the model can complete its designated task as mandated by the business problem.
Before you go.
Always remember the principles of good data science when designing artificial intelligence. The famous phrase “You are what you eat” is befitting when speaking of AI and its data.
Hopefully, this article will have provided you with a better understanding of the principles behind data science and how to properly use it in the AI and machine learning field. In my next article, we will go through the process of identifying a problem, selecting the data, and cleaning it in preparation for ML/AI, using a real data set! Subscribe to our blog to know when it goes live.
As a young data scientist, I am very interested in scientific research and especially in the statistical and mathematical methods surrounding this field. I’d be happy to hear your comments and questions and discuss them further. You can follow me or contact me on LinkedIn, Twitter or directly by email.