[GitHub] [hadoop-submarine] liuxunorg commented on a change in pull request #23: Create Architecture and Requirement Doc

GitBox Sun, 29 Sep 2019 09:48:31 -0700

liuxunorg commented on a change in pull request #23: Create Architecture and 
Requirement Doc
URL: https://github.com/apache/hadoop-submarine/pull/23#discussion_r329360462


 ##########
 File path: docs/design/Architecture and Requirement.md
 ##########
 @@ -0,0 +1,336 @@
+# Terminology
+
+| Term | Description |
+| -------- | -------- |
+| User | A single data-scientist/data-engineer. User has resource quota, 
credentials |
+| Team | User belongs to one or more teams, teams have ACLs for artifacts 
sharing such as notebook content, model, etc. |
+| Admin | Also called SRE, who manages user's quotas, credentials, team, and 
other components. |
+| Project | A project may include one or multiple notebooks, zero or multiple 
running jobs. And could be collaborated by multiple users who have ACLs on it |
+
+
+# Background 
+
+Everybody talks about machine learning today, and lots of companies are trying 
to leverage machine learning to push the business to the next level. Nowadays, 
as more and more developers, infrastructure software companies coming to this 
field, machine learning becomes more and more archivable. 
+
+In the last decade, software industry has built many open source tools for 
machine learning to solve the pain points: 
+
+1. It was not easy to build machine learning algorithms manually such as 
logistic regression, GBDT, and many other algorithms:
+   **Answer to that:** Industries have open sourced many algorithm libraries, 
tools and even pre-trained models so that data scientists can directly reuse 
these building blocks to hook up to their data without knowing intricate 
details inside these algorithms and models. 
+
+2. It was not easy to achieve "WYSIWYG, what you see is what you get" from 
IDEs: not easy to get output, visualization, troubleshooting experiences at the 
same place. 
+   **Answer to that:** Notebooks concept was added to this picture, notebook 
brought the experiences of interactive coding, sharing, visualization, 
debugging under the same user interface. There're popular open-source notebooks 
like Apache Zeppelin/Jupyter.
+   
+3. It was not easy to manage dependencies, ML applications can run on one 
machine is hard to deploy on another machine because it has lots of libraries 
dependencies. 
+   **Answer to that:** Containerization becomes popular and a standard to 
packaging dependencies to make it easier to "build once, run anywhere". 
+
+4. Fragmented tools, libraries were hard for ML engineers to learn. 
Experiences learned in one company is not naturally migratable to another 
company.
+   **Answer to that:** A few dominant open-source frameworks reduced the 
overhead of learning too many different frameworks, concept. Data-scientist can 
learn a few libraries such as Tensorflow/PyTorch, and a few high-level wrappers 
like Keras will be able to create your machine learning application from other 
open-source building blocks.
+
+4. Similarly, models built by one library (such as libsvm) were hard to be 
integrated to machine learning pipeline since there's no standard format.
+   **Answer to that:** Industry has built successful open-source standard 
machine learning frameworks such as Tensorflow/PyTorch/Keras so their format 
can be easily shared across. And efforts to build a even more general model 
format such as ONNX.
+   
+5. It was hard to build a data pipeline which flows/transform data from raw 
data source to whatever required by ML applications. 
+   **Answer to that:** Open source big data industry plays an important role 
to provide, simplify, unify processes and building blocks for data flows, 
transformations, etc.
+   
+Machine learning industry is moving on the right track to solve major 
roadblocks. So what is the pain points now for companies who have machine 
learning needs? What we can help here? To answer this question, let's look at 
machine learning workflow first. 
+
+## Machine Learning Workflows & Pain points
+
+```
+1) From different data source such as edge, clickstream, logs, etc.
+   => Land to data lakes  
+   
+2) From data lake, data transformation: 
+   => Data transformations: Cleanup, remove invalid rows/columns, 
+                            select columns, sampling, split train/test
+                            data-set, join table, etc.
+   => Data prepared for training.
+                            
+3) From prepared data: 
+   => Training, model hyper-parameter tuning, cross-validation, etc. 
+   => Models saved to storage. 
+   
+4) From saved models: 
+   => Model assurance, deployment, A/B testing, etc.
+   => Model deployed for online serving or offline scoring.
+```
+
+Typically data scientists responsible for item 2)-4), 1) typically handled by 
a different team (called Data Engineering team in many companies, some Data 
Engineering team also responsible for part of data transformation)
+
+### Pain \#1 Complex workflow/steps from raw data to model, many different 
tools need by different steps, hard to make changes to workflow, and not 
error-proof
+
+It is a complex workflow from raw data to usable models, after talking to many 
different data scientists, we have learned that a typical procedure to train a 
new model and push to production can take months to 1-2 years. 
+
+It is also a wide skill set required by this workflow. For example, data 
transformation needs tools like Spark/Hive for large scale and tools like 
Pandas for small scale. And model training needs to be switched between 
XGBoost, Tensorflow, Keras, PyTorch. Building a data pipeline needs Apache 
Airflow or Oozie. 
+
+Yes, there are great, standardized open-source tools built for many of such 
purposes. But how about changes need to be made for a particular part of the 
data pipeline? How about adding a few columns to the training data for 
experiments? How about training models, and push models to validation, A/B 
testing before rolling to production? All these steps need jumping between 
different tools, UIs, and very hard to make changes, and it is not error-proof 
during these procedures.
+
+#### Pain \#2 Dependencies of underlying resource management platform
 
 Review comment:
   Document heading level error.
   Need change `#### Pain` to `### Pain`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [hadoop-submarine] liuxunorg commented on a change in pull request #23: Create Architecture and Requirement Doc

Reply via email to