Hi,
I know this is a broad question. If this is not the right forum, appreciate
if you can point to other sites/areas that may be helpful.
Before posing this question, I did use our friend Google, but sanitizing
the query results from my need angle hasn't been easy.
Who I am:
- Have done data processing and analytics, but relatively new to Spark
world
What I am looking for:
- Architecture/Design of a ML system using Spark
- In particular, looking for best practices that can support/bridge both
Engineering and Data Science teams
Engineering:
- Build a system that has typical engineering needs, data processing,
scalability, reliability, availability, fault-tolerance etc.
- System monitoring etc.
Data Science:
- Build a system for Data Science team to do data exploration activities
- Develop models using supervised learning and tweak models
Data:
- Batch and incremental updates - mostly structured or semi-structured
(some data from transaction systems, weblogs, click stream etc.)
- Steaming, in near term, but not to begin with
Data Storage:
- Data is expected to grow on a daily basis...so, system should be able
to support and handle big data
- May be, after further analysis, there might be a possibility/need to
archive some of the data...it all depends on how the ML models were built
and results were stored/used for future usage
Data Analysis:
- Obvious data related aspects, such as data cleansing, data
transformation, data partitioning etc
- May be run models on windows of data. For example: last 1-year, 2-years
etc.
ML models:
- Ability to store model versions and previous results
- Compare results of different variants of models
Consumers:
- RESTful webservice clients to look at the results
*So, the questions I have are:*
1) Are there architectural and design patterns that I can use based on
industry best-practices. In particular:
- data ingestion
- data storage (for eg. go with HDFS or not)
- data partitioning, especially in Spark world
- running parallel ML models and combining results etc.
- consumption of final results by clients (for eg. by pushing results
to Cassandra, NoSQL dbs etc.)
Again, I know this is a broad question....Pointers to some best-practices
in some of the areas, if not all, would be highly appreciated. Open to
purchase any books that may have relevant information.
Thanks much folks,
Vasu.