Wer steckt eigentlich hinter Allgeier? Marina verstärkt bei Allgeier seit Juli 2023 das Finanzteam. Im Interview erfahren wir mehr über ihre beruflichen Herausforderungen, worauf es in ihrem Job ankommt und wo man sie in ihrer Freizeit findet.
Between 22.06 and 23.06 the Swiss Conference on Data Science brings together practitioners, learners and the curious, to broaden their knowledge and their network within the field of analytics. The first day open its doors with a series of workshops from Managing the End-to-End Machine Learning Lifecycle to discovering how to Develop Fair Algorithms. The second day of the conference brings a multitude of seminars at the renowned KKL in Luzern, regarding the business implementation of machine learning models as well as the future trends of the entire domain.
For Allgeier, the conference offers the opportunity to exchange ideas and keep its analytics team sharp with new business implementations. One of these implementations was a workshop regarding The Full Machine Learning Lifecycle. The workshop developed by Steffen Terhaar, Tim Rohner, Bernhard Venneman, Spyros Cavadias and Roman Moser from the consulting firm D ONE, dives into a Machine Learning (ML) case study, that depicts and practically implements a machine learning operations (MLOps) pipeline from scoping to deployment using open-source tools. Essentially showcasing how the DevOps principles from software engineering translate to data science and machine learning.
To give you a brief overview of what this entails: MLOps is a Machine Learning (ML) engineering practice that aims to unify ML system development (Dev) and ML system operation (Ops). It serves as a counterpart to the DevOps practice in classical software development which involves Continuous Integration (CI) and Continuous Deployment (CD). Practicing MLOps advocates automation and monitoring at all steps of the ML system construction, including integration, testing, releasing, deployment, and infrastructure management.
The The Full Machine Learning Lifecycle workshop brought together approximately 20 data scientist and analyst from all over Europe. The journey starts at the Great Hall of the Metropol Hotel in Zurich, where everyone takes their seats. As the beamer powers on and the D ONE team take center stage, the lecturers explain that the workshop case study focuses on building and productionalizing a machine learning model to predict turbine malfunctions from a dataset produced by Winji.
To understand why this dataset is so interesting, here is a brief overview of what Winji does: the Zurich-based data-driven company offers a platform that provides AI-based insights from alternative energy assembly components and environmental factors, that provide wind and solar farms with useful recommendations and accurate forecasting to optimize the resources overall output.
It should be noted that the workshop does not put any focus on what sort of ML algorithm is being implemented. In the end, this is not relevant. From supervised regression to unsupervised classification models, every application is different. The focus is always on the entirety of the pipeline and its respective steps within the MLOps life cycle.
As the lecturers conclude their explanations and the participants gain a good overview of the provided data, three important facts come to light:
Since the number of open-source tools is so broad, the lecturers define the open-source stack for the participants in the following manner:
The entire case study is performed using Python within Visual Studio Code. Since the workshop is limited to five hours, a Virtual Machine (VM) with predefined parameters is set up with all the necessary dependencies.
Within the VM, data exploration is done using Jupyter Notebooks. The Data is versioned and kept track off using Data Version Control (DVC). The modelling algorithms come from scikit-learn. The tracking of the models is done using MLflow. The entire orchestration i.e., the pipelines are built using AirFlow and the deployment of the entire operation is made possible using Docker. The beauty of all of this? It’s completely free of course.
The lecturers make it clear, that certain processes could be covered by the same tool. As an example: MLflow does not only track models, but it can also be used to orchestrate pipelines and deploy operations. However, for the sake of tool strength, every step has its own tool. The next objective is implementing everything we have learned on our own, hands-on.
As the participants take-on the use case through predefined exercises, each step is carefully discussed and implemented. The dos and don’ts of every tool are carefully outlined as well as how they complement each other. Each step of the pipeline and its importance is showcased in detail. The workshop is guided, but a certain amount of independent thinking is necessary to move along every subject.
In summary, the workshop provided the participants with a great overview of how to orchestrate as well as troubleshoot an entire ML pipeline into production using an interesting real-word case with open-source tools.
Where could this all be implemented in practice? First, it should be noted that the workshop uses open-source tools that have a heavy community that consistently update and therefore better its tools. The keyword being open source. There are plenty of startups and smaller businesses that lack the necessary financial resources to set up legacy services provided by Microsoft, Google, or Amazon. To counteract this, open-source tools provide a great alternative to building ML pipelines free of charge. The downside is that setting up these pipelines and then maintaining them, requires more time and depending on the project, might prove problematic in the long run when it comes to scalability. But as a PoC, to verify the potential impact of ML within a business, implementing the steps provided by this workshop can be the steppingstone a business requires to realize the true potential of ML applications.