[GitHub] samza pull request #874: SAMZA-2058: Integrate the input partition expansion...

2019-01-10 Thread shanthoosh
GitHub user shanthoosh opened a pull request:

https://github.com/apache/samza/pull/874

SAMZA-2058: Integrate the input partition expansion aware 
SystemStreamGrouper to JobModel generation flow.

SAMZA-1989 added a partition expansion aware SystemStreamPartitionGrouper 
in samza. This PR aims at integrating the SystemStreamGrouper with the job 
model generation workflow of samza
and make it work for both the yarn and standalone deployment models. 

**Changes:** 

1. Addition of TaskPartitionAssignmentManager to store the task to 
partition assignments present in JobModel to the underlying metadata store.  
This is essential in persisting the Task to SystemStreamPartition assignments 
for the previous run of a samza job. Currently samza-yarn stores the metadata 
for a execution of a job in coordinator stream. Maximum supported kafka message 
size within LI is 1 MB. This limitation drove the decision to denormalize the 
task to SystemStreamPartition Map into individual messages and store in the 
coordinator stream. 
2. Used the existing Coordinator stream json serde to deserialize/serialize 
the task to partition assigments to raw bytes before reading/writing into 
coordinator stream. 
3. Changes in JobModelManager to integrate the input partition expansion 
aware SSPGrouper changes.
4. Code/JavaDoc cleanup done in  MetadataStore utility classes.

**Testing**:

1. Added new unit-tests for all the newly added classes and fixed the 
existing unit-tests depending upon the changes.
2. Standalone: Wrote few integration tests in TestZkLocalApplcationRunner 
for standalone to test input stream partition expansion.
3. YARN: Tested this patch with a sample stream-to-table join high-level 
job from samza-hello-samza. Here're the relevant logs:  
https://gist.github.com/shanthoosh/07357bb615d9cbbfa23cc02b98c9d142, which 
verifies that the AM is restarted on partition expansion of input stream and 
correct task to partition assignments are generated.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/shanthoosh/samza SEP-5_left-over

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/samza/pull/874.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #874


commit 7bedd46ba98ebb18bdcbf6e3feace7188ac9af20
Author: Shanthoosh Venkataraman 
Date:   2018-12-07T02:04:18Z

Initial commit.




---


[REPORT] Samza - January 2019

2019-01-10 Thread Yi Pan
## Description:
- Apache Samza is a distributed stream processing engine that are highly
   configurable to process events from various data sources, including
   real-time messaging system (e.g. Kafka) and distributed file systems
(e.g.
   HDFS).

## Issues:
- No issues requires board attention

## Activity:
- Samza 1.0 is released:
- News coverage:
https://www.zdnet.com/article/real-time-data-proces
sing-just-got-more-options-linkedin-releases-apache-samza-1-0-streaming/
- Engineering blogs:
https://engineering  .
linkedin.com/blog/2018/11/samza-1-0--stream-processing-at-massive-scale
- Major online website refresh: http://samza.apache.org/
- Critical improvement projects completed:
- Changelog restore parallelization
- Evaluated HDFS based backup/restore of state stores
- Multiple SEP projects initiated or in-progress:
- SEP-18: allows manipulating starting offsets and time-based rewind
- SEP-19: Fast failover for stateful jobs on container failure (i.e.
  standby container)
- SEP to come soon: async high-level API
- Beam Samza runner upgrade to use Samza 1.0
- Go and Python support via Beam Samza runner

## Health report:
- Project is in healthy status with 1.0 released in Nov 2018

## PMC changes:

- Currently 15 PMC members.
- Prateek Maheshwari was added to the PMC on Thu Nov 01 2018

## Committer base changes:

- Currently 22 committers.
- New commmitters:
- Aditya Toomula was added as a committer on Mon Nov 05 2018
- Hai Lu was added as a committer on Mon Nov 05 2018

## Releases:

- Last release was 1.0 on Nov 28, 2018

## /dist/ errors: 9
- Project is in healthy status with a major release pending in Oct

## Mailing list activity:

- dev@samza.apache.org:
- 271 subscribers (down -13 in the last 3 months):
- 445 emails sent to list (288 in previous quarter)


## JIRA activity:

- 111 JIRA tickets created in the last 3 months
- 57 JIRA tickets closed/resolved in the last 3 months


Re: Draft report to board - Jan 2019

2019-01-10 Thread Jake Maes
LGTM as well.

Thanks, Yi!

-Jake

On Wed, Jan 9, 2019 at 12:41 PM Yi Pan  wrote:

> Thanks! Updated inline accordingly.
>
> -Yi
>
> On Wed, Jan 9, 2019 at 12:32 PM Prateek Maheshwari 
> wrote:
>
> > Thanks for the summary Yi. I'd change: "HDFS based backup/restore of
> > state stores" to "Evaluation for HDFS based backup/restore of state
> > stores" since this was an intern project and is not checked in to
> > master. Otherwise LGTM.
> >
> > Thanks,
> > Prateek
> >
> > On Wed, Jan 9, 2019 at 12:28 PM Yi Pan  wrote:
> > >
> > > Hi, all,
> > >
> > > Our quarterly report is due this Wed (1/9). The following is the draft
> > > report. Please let me know by the end of the day if I missed anything.
> > > Thanks!
> > >
> > > ## Description:
> > >
> > >  - Apache Samza is a distributed stream processing engine that are
> highly
> > >
> > >configurable to process events from various data sources, including
> > >
> > >real-time messaging system (e.g. Kafka) and distributed file systems
> > > (e.g.
> > >
> > >HDFS).
> > >
> > >
> > >
> > > ## Issues:
> > >
> > >  - No issues requires board attention
> > >
> > >
> > >
> > > ## Activity:
> > >
> > >  - Samza 1.0 is released:
> > >
> > > - News coverage:
> > >
> >
> https://www.zdnet.com/article/real-time-data-processing-just-got-more-options-linkedin-releases-apache-samza-1-0-streaming/
> > >
> > > - Engineering blogs:
> > >
> >
> https://engineering.linkedin.com/blog/2018/11/samza-1-0--stream-processing-at-massive-scale
> > >
> > > - Major online website refresh: http://samza.apache.org/
> > >
> > >  - Critical improvement projects completed:
> > >
> > > - Changelog restore parallelization
> > >
> > > - Evaluation for HDFS based backup/restore of state stores
> > >
> > >  - Multiple SEP projects initiated or in-progress:
> > >
> > > - SEP-18: allows manipulating starting offsets and time-based
> rewind
> > >
> > > - SEP-19: Fast failover for stateful jobs on container failure
> (i.e.
> > > standby container)
> > >
> > > - SEP to come soon: async high-level API
> > >
> > >  - Beam Samza runner upgrade to use Samza 1.0
> > >
> > >  - Go and Python support via Beam Samza runner
> > >
> > >
> > >
> > > ## Health report:
> > >
> > >  - Project is in healthy status with 1.0 released in Nov 2018
> > >
> > >
> > >
> > > ## PMC changes:
> > >
> > >
> > >
> > >  - Currently 15 PMC members.
> > >
> > >  - Prateek Maheshwari was added to the PMC on Thu Nov 01 2018
> > >
> > >
> > >
> > > ## Committer base changes:
> > >
> > >
> > >
> > >  - Currently 22 committers.
> > >
> > >  - New commmitters:
> > >
> > > - Aditya Toomula was added as a committer on Mon Nov 05 2018
> > >
> > > - Hai Lu was added as a committer on Mon Nov 05 2018
> > >
> > >
> > >
> > > ## Releases:
> > >
> > >
> > >
> > >  - Last release was 1.0 on Nov 28, 2018
> > >
> > >
> > >
> > > ## /dist/ errors: 9
> > >
> > >  - Project is in healthy status with 1.0 released in Nov 2018
> > >
> > >
> > >
> > > ## Mailing list activity:
> > >
> > >
> > >
> > >  - dev@samza.apache.org:
> > >
> > > - 271 subscribers (down -13 in the last 3 months):
> > >
> > > - 445 emails sent to list (288 in previous quarter)
> > >
> > >
> > >
> > >
> > >
> > > ## JIRA activity:
> > >
> > >
> > >
> > >  - 111 JIRA tickets created in the last 3 months
> > >
> > >  - 57 JIRA tickets closed/resolved in the last 3 months
> >
>