[GitHub] samza pull request #874: SAMZA-2058: Integrate the input partition expansion...
GitHub user shanthoosh opened a pull request: https://github.com/apache/samza/pull/874 SAMZA-2058: Integrate the input partition expansion aware SystemStreamGrouper to JobModel generation flow. SAMZA-1989 added a partition expansion aware SystemStreamPartitionGrouper in samza. This PR aims at integrating the SystemStreamGrouper with the job model generation workflow of samza and make it work for both the yarn and standalone deployment models. **Changes:** 1. Addition of TaskPartitionAssignmentManager to store the task to partition assignments present in JobModel to the underlying metadata store. This is essential in persisting the Task to SystemStreamPartition assignments for the previous run of a samza job. Currently samza-yarn stores the metadata for a execution of a job in coordinator stream. Maximum supported kafka message size within LI is 1 MB. This limitation drove the decision to denormalize the task to SystemStreamPartition Map into individual messages and store in the coordinator stream. 2. Used the existing Coordinator stream json serde to deserialize/serialize the task to partition assigments to raw bytes before reading/writing into coordinator stream. 3. Changes in JobModelManager to integrate the input partition expansion aware SSPGrouper changes. 4. Code/JavaDoc cleanup done in MetadataStore utility classes. **Testing**: 1. Added new unit-tests for all the newly added classes and fixed the existing unit-tests depending upon the changes. 2. Standalone: Wrote few integration tests in TestZkLocalApplcationRunner for standalone to test input stream partition expansion. 3. YARN: Tested this patch with a sample stream-to-table join high-level job from samza-hello-samza. Here're the relevant logs: https://gist.github.com/shanthoosh/07357bb615d9cbbfa23cc02b98c9d142, which verifies that the AM is restarted on partition expansion of input stream and correct task to partition assignments are generated. You can merge this pull request into a Git repository by running: $ git pull https://github.com/shanthoosh/samza SEP-5_left-over Alternatively you can review and apply these changes as the patch at: https://github.com/apache/samza/pull/874.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #874 commit 7bedd46ba98ebb18bdcbf6e3feace7188ac9af20 Author: Shanthoosh Venkataraman Date: 2018-12-07T02:04:18Z Initial commit. ---
[REPORT] Samza - January 2019
## Description: - Apache Samza is a distributed stream processing engine that are highly configurable to process events from various data sources, including real-time messaging system (e.g. Kafka) and distributed file systems (e.g. HDFS). ## Issues: - No issues requires board attention ## Activity: - Samza 1.0 is released: - News coverage: https://www.zdnet.com/article/real-time-data-proces sing-just-got-more-options-linkedin-releases-apache-samza-1-0-streaming/ - Engineering blogs: https://engineering . linkedin.com/blog/2018/11/samza-1-0--stream-processing-at-massive-scale - Major online website refresh: http://samza.apache.org/ - Critical improvement projects completed: - Changelog restore parallelization - Evaluated HDFS based backup/restore of state stores - Multiple SEP projects initiated or in-progress: - SEP-18: allows manipulating starting offsets and time-based rewind - SEP-19: Fast failover for stateful jobs on container failure (i.e. standby container) - SEP to come soon: async high-level API - Beam Samza runner upgrade to use Samza 1.0 - Go and Python support via Beam Samza runner ## Health report: - Project is in healthy status with 1.0 released in Nov 2018 ## PMC changes: - Currently 15 PMC members. - Prateek Maheshwari was added to the PMC on Thu Nov 01 2018 ## Committer base changes: - Currently 22 committers. - New commmitters: - Aditya Toomula was added as a committer on Mon Nov 05 2018 - Hai Lu was added as a committer on Mon Nov 05 2018 ## Releases: - Last release was 1.0 on Nov 28, 2018 ## /dist/ errors: 9 - Project is in healthy status with a major release pending in Oct ## Mailing list activity: - dev@samza.apache.org: - 271 subscribers (down -13 in the last 3 months): - 445 emails sent to list (288 in previous quarter) ## JIRA activity: - 111 JIRA tickets created in the last 3 months - 57 JIRA tickets closed/resolved in the last 3 months
Re: Draft report to board - Jan 2019
LGTM as well. Thanks, Yi! -Jake On Wed, Jan 9, 2019 at 12:41 PM Yi Pan wrote: > Thanks! Updated inline accordingly. > > -Yi > > On Wed, Jan 9, 2019 at 12:32 PM Prateek Maheshwari > wrote: > > > Thanks for the summary Yi. I'd change: "HDFS based backup/restore of > > state stores" to "Evaluation for HDFS based backup/restore of state > > stores" since this was an intern project and is not checked in to > > master. Otherwise LGTM. > > > > Thanks, > > Prateek > > > > On Wed, Jan 9, 2019 at 12:28 PM Yi Pan wrote: > > > > > > Hi, all, > > > > > > Our quarterly report is due this Wed (1/9). The following is the draft > > > report. Please let me know by the end of the day if I missed anything. > > > Thanks! > > > > > > ## Description: > > > > > > - Apache Samza is a distributed stream processing engine that are > highly > > > > > >configurable to process events from various data sources, including > > > > > >real-time messaging system (e.g. Kafka) and distributed file systems > > > (e.g. > > > > > >HDFS). > > > > > > > > > > > > ## Issues: > > > > > > - No issues requires board attention > > > > > > > > > > > > ## Activity: > > > > > > - Samza 1.0 is released: > > > > > > - News coverage: > > > > > > https://www.zdnet.com/article/real-time-data-processing-just-got-more-options-linkedin-releases-apache-samza-1-0-streaming/ > > > > > > - Engineering blogs: > > > > > > https://engineering.linkedin.com/blog/2018/11/samza-1-0--stream-processing-at-massive-scale > > > > > > - Major online website refresh: http://samza.apache.org/ > > > > > > - Critical improvement projects completed: > > > > > > - Changelog restore parallelization > > > > > > - Evaluation for HDFS based backup/restore of state stores > > > > > > - Multiple SEP projects initiated or in-progress: > > > > > > - SEP-18: allows manipulating starting offsets and time-based > rewind > > > > > > - SEP-19: Fast failover for stateful jobs on container failure > (i.e. > > > standby container) > > > > > > - SEP to come soon: async high-level API > > > > > > - Beam Samza runner upgrade to use Samza 1.0 > > > > > > - Go and Python support via Beam Samza runner > > > > > > > > > > > > ## Health report: > > > > > > - Project is in healthy status with 1.0 released in Nov 2018 > > > > > > > > > > > > ## PMC changes: > > > > > > > > > > > > - Currently 15 PMC members. > > > > > > - Prateek Maheshwari was added to the PMC on Thu Nov 01 2018 > > > > > > > > > > > > ## Committer base changes: > > > > > > > > > > > > - Currently 22 committers. > > > > > > - New commmitters: > > > > > > - Aditya Toomula was added as a committer on Mon Nov 05 2018 > > > > > > - Hai Lu was added as a committer on Mon Nov 05 2018 > > > > > > > > > > > > ## Releases: > > > > > > > > > > > > - Last release was 1.0 on Nov 28, 2018 > > > > > > > > > > > > ## /dist/ errors: 9 > > > > > > - Project is in healthy status with 1.0 released in Nov 2018 > > > > > > > > > > > > ## Mailing list activity: > > > > > > > > > > > > - dev@samza.apache.org: > > > > > > - 271 subscribers (down -13 in the last 3 months): > > > > > > - 445 emails sent to list (288 in previous quarter) > > > > > > > > > > > > > > > > > > ## JIRA activity: > > > > > > > > > > > > - 111 JIRA tickets created in the last 3 months > > > > > > - 57 JIRA tickets closed/resolved in the last 3 months > > >