wangdatan commented on a change in pull request #23: [SUBMARINE-208] Create Architecture and Requirement Doc URL: https://github.com/apache/hadoop-submarine/pull/23#discussion_r329614231
########## File path: docs/design/architecture-and-requirements.md ########## @@ -0,0 +1,351 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +# Terminology + +| Term | Description | +| -------- | -------- | +| User | A single data-scientist/data-engineer. User has resource quota, credentials | +| Team | User belongs to one or more teams, teams have ACLs for artifacts sharing such as notebook content, model, etc. | +| Admin | Also called SRE, who manages user's quotas, credentials, team, and other components. | +| Project | A project may include one or multiple notebooks, zero or multiple running jobs. And could be collaborated by multiple users who have ACLs on it | + + +# Background + +Everybody talks about machine learning today, and lots of companies are trying to leverage machine learning to push the business to the next level. Nowadays, as more and more developers, infrastructure software companies coming to this field, machine learning becomes more and more archivable. + +In the last decade, software industry has built many open source tools for machine learning to solve the pain points: + +1. It was not easy to build machine learning algorithms manually such as logistic regression, GBDT, and many other algorithms: + **Answer to that:** Industries have open sourced many algorithm libraries, tools and even pre-trained models so that data scientists can directly reuse these building blocks to hook up to their data without knowing intricate details inside these algorithms and models. + +2. It was not easy to achieve "WYSIWYG, what you see is what you get" from IDEs: not easy to get output, visualization, troubleshooting experiences at the same place. + **Answer to that:** Notebooks concept was added to this picture, notebook brought the experiences of interactive coding, sharing, visualization, debugging under the same user interface. There're popular open-source notebooks like Apache Zeppelin/Jupyter. + +3. It was not easy to manage dependencies, ML applications can run on one machine is hard to deploy on another machine because it has lots of libraries dependencies. + **Answer to that:** Containerization becomes popular and a standard to packaging dependencies to make it easier to "build once, run anywhere". + +4. Fragmented tools, libraries were hard for ML engineers to learn. Experiences learned in one company is not naturally migratable to another company. + **Answer to that:** A few dominant open-source frameworks reduced the overhead of learning too many different frameworks, concept. Data-scientist can learn a few libraries such as Tensorflow/PyTorch, and a few high-level wrappers like Keras will be able to create your machine learning application from other open-source building blocks. + +4. Similarly, models built by one library (such as libsvm) were hard to be integrated to machine learning pipeline since there's no standard format. + **Answer to that:** Industry has built successful open-source standard machine learning frameworks such as Tensorflow/PyTorch/Keras so their format can be easily shared across. And efforts to build a even more general model format such as ONNX. + +5. It was hard to build a data pipeline which flows/transform data from raw data source to whatever required by ML applications. + **Answer to that:** Open source big data industry plays an important role to provide, simplify, unify processes and building blocks for data flows, transformations, etc. + +Machine learning industry is moving on the right track to solve major roadblocks. So what is the pain points now for companies who have machine learning needs? What we can help here? To answer this question, let's look at machine learning workflow first. + +## Machine Learning Workflows & Pain points + +``` +1) From different data source such as edge, clickstream, logs, etc. + => Land to data lakes + +2) From data lake, data transformation: + => Data transformations: Cleanup, remove invalid rows/columns, + select columns, sampling, split train/test + data-set, join table, etc. + => Data prepared for training. + +3) From prepared data: + => Training, model hyper-parameter tuning, cross-validation, etc. + => Models saved to storage. + +4) From saved models: + => Model assurance, deployment, A/B testing, etc. + => Model deployed for online serving or offline scoring. +``` + +Typically data scientists responsible for item 2)-4), 1) typically handled by a different team (called Data Engineering team in many companies, some Data Engineering team also responsible for part of data transformation) + +### Pain \#1 Complex workflow/steps from raw data to model, many different tools need by different steps, hard to make changes to workflow, and not error-proof + +It is a complex workflow from raw data to usable models, after talking to many different data scientists, we have learned that a typical procedure to train a new model and push to production can take months to 1-2 years. + +It is also a wide skill set required by this workflow. For example, data transformation needs tools like Spark/Hive for large scale and tools like Pandas for small scale. And model training needs to be switched between XGBoost, Tensorflow, Keras, PyTorch. Building a data pipeline needs Apache Airflow or Oozie. + +Yes, there are great, standardized open-source tools built for many of such purposes. But how about changes need to be made for a particular part of the data pipeline? How about adding a few columns to the training data for experiments? How about training models, and push models to validation, A/B testing before rolling to production? All these steps need jumping between different tools, UIs, and very hard to make changes, and it is not error-proof during these procedures. + +### Pain \#2 Dependencies of underlying resource management platform + +To make jobs/services required by machine learning platform to be able to run, we need an underlying resource management platform. There're some choices of resource management platform and they have distinct advantages and disadvantages. + +For example, there're many machine learning platform built on top of K8s. It is relatively easy to get a K8s from a cloud vendor, easy to orchestrate machine learning required services/daemons run on K8s. However, K8s doesn't offer good support jobs like Spark/Flink/Hive. So if your company has Spark/Flink/Hive running on YARN, there're gaps and a significant amount of work to move required jobs from YARN to K8s. Maintaining a separate K8s cluster is also overhead to Hadoop-based data infrastructure. + +Similarly, if your company's data pipelines are mostly built on top of cloud resources and SaaS offerings. Asking you to install a separate YARN cluster to run a new machine learning platform doesn't make a lot of sense. + +### Pain \#3 Data scientist are forced to interact with lower-level platform components + +In addition to the above pain, we do see Data Scientists are forced to learn underlying platform knowledge to be able to build a real-world machine learning workflow. + +For most of the data scientists we talked with, they're experts of ML algorithms/libraries, feature engineering, etc. They're also most familiar with Python, R, and some of them understand Spark, Hive, etc. + +If they're asked to do interactions with lower-level components like fine-tuning a Spark job's performance; or troubleshooting job failed to launch because of resource constraints; or write a K8s/YARN job spec and mount volumes, set networks properly. They will scratch their heads and typically cannot perform these operations efficiently. + +### Pain \#4 Comply with data security/governance requirements + +TODO: Add more details. + +### Pain \#5 No good way to reduce routine ML code development + +After the data is prepared, the data scientist needs to do several routine tasks to build the ML pipeline. To get a sense on the existing data set, it usually needs a split of the data set, the statistics of data set. These tasks have a common duplicate part of code which reduces the efficiency of data scientists. + +An abstraction layer/framework to help developer to boost ML pipeline development could be valuable. It's better that the developer only needs to fill callback function to focus on their key logics. + +# Submarine + +## Overview + +### A little bit history + +Initially, Submarine is built to solve problems of running deep learning jobs like Tensorflow/PyTorch on Apache Hadoop YARN, allows admin to monitor launched deep learning jobs, and manage generated models. + +It was part of YARN initially, code resides under `hadoop-yarn-applications`. Later, the community decided to move to a subject of Hadoop because we want to support other resource management platforms like K8s. And finally, we're reconsidering Submarine's charter and Hadoop community voted that it is the time to moved Submarine to a separate Apache TLP. Review comment: Addressed. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
