Is data science the future? Why

The data scientist has to hand over tasks

Against the background that self-learning algorithms are getting better and better at processing data, the question arises whether data scientists will be needed today and in the future. Can't data sets also be prepared automatically for machine learning models? This is not only a concern for those with no knowledge of the subject who would like to use a machine learning solution and are considering whether they should simply pour their data directly into existing algorithms. Data scientists themselves and experts in related disciplines such as data analytics and analytics are also discussing what automated machine learning machine learning (AutoML) actually means for their future. Where are the limits and possibilities of automation? Everything about Analytics on CIO.de Everything about Machine Learning on CIO.de

Data scientists themselves like to use AutoML at least in proof-of-concept (PoC) phases, but that does not mean that algorithms threaten their job. It will most likely only change a little in the years to come. In order to get a clearer picture of the extent to which data science is already threatened with disruption by AutoML algorithms, it is worth taking a look at the individual tasks of a data scientist that he typically goes through in a project.

Understand users and respond to them

The first step in any data science project is to talk to the future user and understand where the problem is. This even applies to ML applications that can be implemented and used without adaptation. Before the data scientist understands what the problem is or which process is to be optimized, no sensible solution can be offered. It is also important to later convey the resulting solutions and findings to the user in an understandable way. To do this, you need to answer these questions:

  • Was the PoC successful?

  • What changes are recommended to make the forecast more accurate?

  • Where are the biggest bottlenecks in the processes?

It is the task of the data scientist to analyze and understand the business processes concerned, including any implications (e.g. effects on other departments). This cognitive task is in the foreseeable future Impossible to automate.

Consolidate data

Before the exciting data science tasks can really start, the data must first be put into a usable state. This also means entering into a dialogue with the user or customer. It is necessary to agree on a form of data access, to connect to the different source systems, to link the data and, above all, to filter it intensively. These steps are after all partly automatable. In particular, loading from many different data sources has become significantly less complicated in recent years. However, manual effort is required because human understanding is needed to understand which data is stored where and what it means. The same applies to the linking of the data.

Impossible to automate on the other hand, there is the filtering of the data, or the so-called plausibility check. For the success of a project, it is fundamentally important to check the data to ensure that it meets the expected specifications. Data scientists know from experience: They never do that. Sensors do not always work reliably, stamping times almost always have quality problems, real orders are mixed up with orders on the extended workbench, or end customers are labeled "second reminder" even though they never received an invoice or even used a free service.

Most of these data errors cannot be automatically detected because an algorithm lacks the context for this assessment. Everyone understands at first glance that a customer cannot be warned if he does not have to pay in the first place. Someone would have to teach this rule to an algorithm first.

Feature engineering

So-called feature engineering is about processing the raw data in such a way that the ML algorithm can understand it as well as possible. It should be as easy as possible for him to extract all the information that is hidden in the data set. Suppose someone wants to predict how successful a movie will be. The names or IDs of the individual actors in each film are known. The ML algorithm can do little with these IDs. At most, he would remember the IDs of a few top performers whose participation every film would be a success. A data scientist is able to enrich the information about the actors significantly through feature engineering.

What gender and age are the actors? How successful have the last few films you starred in, both monetarily and critically? These and many other factors enable the algorithm to understand whether it is working on a real blockbuster, an art house project or a completely different genre.

Some of the simpler feature engineering tasks are already pretty well automated (one hot encoding, imputation, etc.). However, these are not the steps with which the quality of the models can be significantly improved. It is much more important to understand the processes behind the data and to incorporate this knowledge into feature engineering. This data enrichment, in combination with data consolidation, is what data scientists spend around 80 percent of their time with and what enables them to generate the greatest added value. Understanding the processes of the user and the quality of the data and making this knowledge usable algorithmically Can only be automated to a very small extent.