State of Fast Feedback in Data Science Projects

Let’s talk about the productivity in Data Science and Machine Learning projects (DSML)

DSML projects can be quite different from the software projects: a lot of R&D in a rapidly evolving landscape, working with data, distributions and probabilities instead of code. However, there is one thing in common: iterative development process matters a lot.

For example in software engineering, rapid iterations help a lot in debugging complex issues or working towards a tricky issue. In product development, an ability to rapidly roll out new version can be a deal-breaker for achieving customer satisfaction. Paul Graham eloquently covers that in his “Beating the averages” essay.

Likewise, in Data Science and Machine Learning projects, iterations help data scientists to rapidly test their theories and converge towards the solution that will create value. If we assume that 87% of data science projects fail (which looks about right to me), then having a fast feedback loop could help to get to the successful 13% faster.

Yet, there is a problem in the industry with that.

Let use a basic data science pipeline as an example. It will have a predefined structure to make it easier for collaboration between different teams in the department.

There will be the following steps:

  1. Initialise the pipeline run, deriving any per-run variables from the initial config

  2. Load and prepare the training data

  3. Perform model training

  4. Evaluate the model on a separate dataset

  5. Prepare the model for the use

  6. Run batch prediction against the resulting model

The de facto language for the pipelines in Python. We can provide a minimal implementation in a console application and run it locally. On my laptop it takes ~0.3-0.5 sec.

That is good enough.

If the computation overhead of a real pipeline is 5 minutes, then we could run up to 12 iterations in an hour.

However, the industry way to run these pipelines is via Kubeflow (ML toolkit for Kubernetes). Google Vertex is one of the most stable implementations.

If we map our pipeline components to a Kubeflow pipeline, we’ll get something like that:

How many experiments per our can we run here?

At this point, the computation overhead doesn’t even matter. Since it takes 33 minutes per run, we could run only up to experiment per hour.

The execution takes 5000x more time on Vertex than it takes on a local machine. Although that time is a paid compute time, the biggest hit is not a financial one, but more of a productivity loss.

And that is the most frustrating problem with the state of the data science pipelines today. Major hosting players make more money from less efficient data science pipelines. This might reduce incentives to prioritize performance-improving changes. This in turn negatively impacts the ability of small data science teams to have fast feedback loops and innovate efficiently.

Headerbild zu Operationalisierung von Data Science (MLOps)
Service

Operationalization of Data Science (MLOps)

Data and Artificial Intelligence (AI) can support almost any business process based on facts. Many companies are in the phase of professional assessment of the algorithms and technical testing of the respective technologies.

Navigationsbild zu Data Science
Service

AI & Data Science

We offer comprehensive solutions in the fields of data science, machine learning and AI that are tailored to your specific challenges and goals.

Das macht catworkx als Unternehmen aus
Jobs 9/12/22

Why catworkx?

We have a lot going for us: an open corporate culture, varied projects, a team-oriented working environment, flat hierarchies ...

Articifial Intelligence & Data Science
Service

Artificial Intelligence & Data Science

Data Science is all about extracting valuable information from structured and unstructured data.

Data Science & Advanced Analytics
Kompetenz 9/3/20

Data Science, AI & Advanced Analytics

Data Science & Advanced Analytics includes a wide range of tools that can examine business processes, help drive change and improvement.

Navigationsbild zu Data Science
Service

Data Science, Artificial Intelligence and Machine Learning

For some time, Data Science has been considered the supreme discipline in the recognition of valuable information in large amounts of data. It promises to extract hidden, valuable information from data of any structure.

Referenz

Jira Integration of Demand and Project Portfolio Management

In the area of demand and project portfolio management, catworkx was also able to demonstrate the great flexibility of Jira in a customer project and show that relevant business data and influencing..

Unternehmen 7/30/21

Our promise - Passion for your digital future

A lot has changed since we were founded in 1992. Only one thing has remained: Our mission of high-quality consulting and successful projects for agile development of high-quality software: Passion for your digital future.

Wissen 4/14/23

General Data Protection Regulation of idea management

Walldorf-based dacuro GmbH provides the external data protection officer for companies, helps with the fulfillment of documentation obligations and advises on all aspects of data protection. Fulfilling the requirements of the GDPR without blocking everyday life is the claim of dacuro GmbH. The team of lawyers and IT specialists provides support for all GDPR challenges, whether they are of a legal or technical nature.

Lösung 9/21/22

Portfolio Project Management (PPM)

How Project Portfolio Management with Atlassian Tools supports global project and QM tasks including Cross-Project Knowledge Management.

Wissen 5/2/24

Unlock the Potential of Data Culture in Your Organization

Are you ready to revolutionize your organization's potential by unleashing the power of data culture? Imagine a workplace where every decision is backed by insights, every strategy informed by data, and every employee equipped to navigate the digital landscape with confidence. This is the transformative impact of cultivating a robust data culture within your enterprise.

Referenz 4/22/21

Flexibility in the data evaluation of a theme park

With the support of TIMETOACT, an theme park in Germany has been using TM1 for many years in different areas of the company to carry out reporting, analysis and planning processes easily and flexibly.

Header Konnzeption individueller Business Intelligence Lösungen
Service

Conception of individual Analytics and Big Data solutions

We determine the best approach to develop an individual solution from the professional, role-specific requirements – suitable for the respective situation!

Headerbild zu Big Data, Data Lake und Data Warehouse
Service

Big Data, Data Lake & Data Warehousing

For the optimal solution – with special consideration of the business requirements – we combine different functionalities.

Referenz

Integrated Project and User Portal (IPUP)

Transparent and flexible management of projects and users in large environments with Jira Service Management: catworkx has developed a tool for a major customer from the automotive industry, with which projects and the assignment of users involved can be set up largely automatically.

Headerbild zu Projektmanagement standardisieren und optimieren
Service

Standardize and optimize project management | Atlassian

With defined standard templates, you can easily optimize your project management and use BigPicture to keep track of the status of your project, dependencies between tasks, and the planning and workload of your team.

Blog 9/27/22

Creating solutions and projects in VS code

In this post we are going to create a new Solution containing an F# console project and a test project using the dotnet CLI in Visual Studio Code.

Logo Armacell
Referenz 4/8/22

Bundled expertise for fast mail migration to M365

Based on the ten-year partnership with TIMETOACT, the experts supported the mail migration from Lotus Notes Domino to M365 - together with novaCapta. As a Managed Service Partner, TIMETOACT continues to ensure optimal functionality of the mail system.

Headerbild Data Insights
Service

Data Insights

With Data Insights, we help you step by step with the appropriate architecture to use new technologies and develop a data-driven corporate culture

Easy Cloud Solution
Produkt

Big Data

Extract valuable information from data - Take advantage of serverless, integrated end-to-end data analytics services to leave traditional limitations behind.

Bleiben Sie mit dem TIMETOACT GROUP Newsletter auf dem Laufenden!